Announce: new-aops-1 for 2.6.21-rc3

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Announce: new-aops-1 for 2.6.21-rc3
@ 2007-03-15 16:17 Nick Piggin
  2007-03-15 19:32 ` Joel Becker
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Nick Piggin @ 2007-03-15 16:17 UTC (permalink / raw)
  To: Linux Filesystems, Mark Fasheh
  Cc: reiserfs-list, linux-ext4, xfs, nfs, cluster-devel,
	jfs-discussion

OK, I've gone through and fixed several bugs until the thing actually
survives fsx-linux for both ext2 and ext3 ordered and writeback (both
when using the new aops, and the legacy prepare_write path). Actually
ext3 sometimes breaks, but it does in unpatched kernels anyway.

At 15 patches (including the initial buffered write deadlock fixes),
it is too much to keep posting -- not much has fundamentally changed,
so I'll just post occasionally if we make big changes. The quilt
format is probably easier for someone wishing to work on it anyway.

http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/

(excludes the OCFS2 patch that Mark sent, in anticipation of an update)

It would be really nice if filesystem developers could take a look
at the new interfaces some time, because otherwise they might get stuck
with it :) So I'm cc'ing a few filesystems that come to mind, that I 
haven't heard anything from. 

Thanks,
Nick

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Announce: new-aops-1 for 2.6.21-rc3
  2007-03-15 16:17 Announce: new-aops-1 for 2.6.21-rc3 Nick Piggin
@ 2007-03-15 19:32 ` Joel Becker
  2007-03-15 19:57   ` Nick Piggin
  2007-03-15 19:53 ` Mark Fasheh
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Joel Becker @ 2007-03-15 19:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: jfs-discussion, Mark Fasheh, xfs, cluster-devel, reiserfs-list,
	nfs, Linux Filesystems, linux-ext4

On Thu, Mar 15, 2007 at 05:17:04PM +0100, Nick Piggin wrote:
> At 15 patches (including the initial buffered write deadlock fixes),
> it is too much to keep posting -- not much has fundamentally changed,
> so I'll just post occasionally if we make big changes. The quilt
> format is probably easier for someone wishing to work on it anyway.
> 
> http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/

	For future drops, can you provide the unpacked patches too, so
lazy people like me can read them in the browser?  Thanks.

Joel

-- 

"Here's something to think about:  How come you never see a headline
 like ``Psychic Wins Lottery''?"
	- Jay Leno

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Announce: new-aops-1 for 2.6.21-rc3
  2007-03-15 19:32 ` Joel Becker
@ 2007-03-15 19:57   ` Nick Piggin
  0 siblings, 0 replies; 10+ messages in thread
From: Nick Piggin @ 2007-03-15 19:57 UTC (permalink / raw)
  To: Joel Becker
  Cc: Linux Filesystems, Mark Fasheh, reiserfs-list, linux-ext4, xfs,
	nfs, cluster-devel, jfs-discussion

On Thu, Mar 15, 2007 at 12:32:45PM -0700, Joel Becker wrote:
> On Thu, Mar 15, 2007 at 05:17:04PM +0100, Nick Piggin wrote:
> > At 15 patches (including the initial buffered write deadlock fixes),
> > it is too much to keep posting -- not much has fundamentally changed,
> > so I'll just post occasionally if we make big changes. The quilt
> > format is probably easier for someone wishing to work on it anyway.
> > 
> > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/new-aops/
> 
> 	For future drops, can you provide the unpacked patches too, so
> lazy people like me can read them in the browser?  Thanks.

Sorry, I did intend to unpack that, but forgot. It's done now, the
new directory containing the patches is under the same URL as above.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Announce: new-aops-1 for 2.6.21-rc3
  2007-03-15 16:17 Announce: new-aops-1 for 2.6.21-rc3 Nick Piggin
  2007-03-15 19:32 ` Joel Becker
@ 2007-03-15 19:53 ` Mark Fasheh
  2007-03-15 19:57   ` Nick Piggin
  2007-03-15 21:08 ` Mark Fasheh
  2007-03-15 23:47 ` Mark Fasheh
  3 siblings, 1 reply; 10+ messages in thread
From: Mark Fasheh @ 2007-03-15 19:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: jfs-discussion, xfs, cluster-devel, reiserfs-list, nfs,
	Linux Filesystems, linux-ext4

On Thu, Mar 15, 2007 at 05:17:04PM +0100, Nick Piggin wrote:
> OK, I've gone through and fixed several bugs until the thing actually
> survives fsx-linux for both ext2 and ext3 ordered and writeback (both
> when using the new aops, and the legacy prepare_write path). Actually
> ext3 sometimes breaks, but it does in unpatched kernels anyway.
> 
> At 15 patches (including the initial buffered write deadlock fixes),
> it is too much to keep posting -- not much has fundamentally changed,
> so I'll just post occasionally if we make big changes. The quilt
> format is probably easier for someone wishing to work on it anyway.

Hmm, we still left out some exports...
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com


From: Mark Fasheh <mark.fasheh@oracle.com>
[PATCH] Export simple_write_begin, simple_write_end

These are used by configfs, which can be built as a module.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>

---

 fs/libfs.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

36f5d6a135c9f3f30fee3d0e4ffa887e1803ac95
diff --git a/fs/libfs.c b/fs/libfs.c
index d687819..51f9748 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -656,6 +656,8 @@ EXPORT_SYMBOL(dcache_dir_open);
 EXPORT_SYMBOL(dcache_readdir);
 EXPORT_SYMBOL(generic_read_dir);
 EXPORT_SYMBOL(get_sb_pseudo);
+EXPORT_SYMBOL(simple_write_begin);
+EXPORT_SYMBOL(simple_write_end);
 EXPORT_SYMBOL(simple_commit_write);
 EXPORT_SYMBOL(simple_dir_inode_operations);
 EXPORT_SYMBOL(simple_dir_operations);
-- 
1.3.3


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: Announce: new-aops-1 for 2.6.21-rc3
  2007-03-15 19:53 ` Mark Fasheh
@ 2007-03-15 19:57   ` Nick Piggin
  0 siblings, 0 replies; 10+ messages in thread
From: Nick Piggin @ 2007-03-15 19:57 UTC (permalink / raw)
  To: Mark Fasheh
  Cc: jfs-discussion, xfs, cluster-devel, reiserfs-list, nfs,
	Linux Filesystems, linux-ext4

On Thu, Mar 15, 2007 at 12:53:51PM -0700, Mark Fasheh wrote:
> On Thu, Mar 15, 2007 at 05:17:04PM +0100, Nick Piggin wrote:
> > OK, I've gone through and fixed several bugs until the thing actually
> > survives fsx-linux for both ext2 and ext3 ordered and writeback (both
> > when using the new aops, and the legacy prepare_write path). Actually
> > ext3 sometimes breaks, but it does in unpatched kernels anyway.
> > 
> > At 15 patches (including the initial buffered write deadlock fixes),
> > it is too much to keep posting -- not much has fundamentally changed,
> > so I'll just post occasionally if we make big changes. The quilt
> > format is probably easier for someone wishing to work on it anyway.
> 
> Hmm, we still left out some exports...

Thanks, applied.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Announce: new-aops-1 for 2.6.21-rc3
  2007-03-15 16:17 Announce: new-aops-1 for 2.6.21-rc3 Nick Piggin
  2007-03-15 19:32 ` Joel Becker
  2007-03-15 19:53 ` Mark Fasheh
@ 2007-03-15 21:08 ` Mark Fasheh
  2007-03-15 23:47 ` Mark Fasheh
  3 siblings, 0 replies; 10+ messages in thread
From: Mark Fasheh @ 2007-03-15 21:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: jfs-discussion, xfs, cluster-devel, reiserfs-list, nfs,
	Linux Filesystems, linux-ext4

On Thu, Mar 15, 2007 at 05:17:04PM +0100, Nick Piggin wrote:
> OK, I've gone through and fixed several bugs until the thing actually
> survives fsx-linux for both ext2 and ext3 ordered and writeback (both
> when using the new aops, and the legacy prepare_write path). Actually
> ext3 sometimes breaks, but it does in unpatched kernels anyway.

Attached is a bugfix for a crash folks who use an initrd will hit early on.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com


From: Mark Fasheh <mark.fasheh@oracle.com>

[PATCH] Populate pagep in simple_write_begin()

This wasn't getting passed back to callers.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>

cbf20bf51ddd6434db935ba29f845a85f3b1ec65
diff --git a/fs/libfs.c b/fs/libfs.c
index 51f9748..602496a 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -357,6 +357,8 @@ int simple_write_begin(struct file *file
 	if (!page)
 		return -ENOMEM;
 
+	*pagep = page;
+
 	return simple_prepare_write(file, page, from, from+len);
 }
 
-- 
1.3.3


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: Announce: new-aops-1 for 2.6.21-rc3
  2007-03-15 16:17 Announce: new-aops-1 for 2.6.21-rc3 Nick Piggin
                   ` (2 preceding siblings ...)
  2007-03-15 21:08 ` Mark Fasheh
@ 2007-03-15 23:47 ` Mark Fasheh
  2007-03-20  5:36   ` Nick Piggin
  2007-04-01  5:42   ` Nick Piggin
  3 siblings, 2 replies; 10+ messages in thread
From: Mark Fasheh @ 2007-03-15 23:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: jfs-discussion, xfs, cluster-devel, reiserfs-list, nfs,
	Linux Filesystems, linux-ext4

[-- Attachment #1: Type: text/plain, Size: 464 bytes --]

On Thu, Mar 15, 2007 at 05:17:04PM +0100, Nick Piggin wrote:
> (excludes the OCFS2 patch that Mark sent, in anticipation of an update)

Attached is said patch. I needed to export __grab_cache_page (ext2/ext3 also
need this if they're to be built as modules), so a patch to do that is also
attached.

This passed some preliminary testing on a two node cluster I have here at
Oracle.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com


[-- Attachment #2: 0001-ocfs2-Convert-to-new-aops.txt --]
[-- Type: text/plain, Size: 5478 bytes --]

From: Mark Fasheh <mark.fasheh@oracle.com>

ocfs2: Convert to new aops

Turn ocfs2_prepare_write() and ocfs2_commit_write() into ocfs2_write_begin()
and ocfs2_write_end(). This conveniently eliminates the need for
AOP_TRUNCATED_PAGE during write.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


e28911070b02362a9a3a543646da84a8fbf9f63b
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 875c114..cbec0e1 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -293,29 +293,67 @@ int ocfs2_prepare_write_nolock(struct in
 }
 
 /*
- * ocfs2_prepare_write() can be an outer-most ocfs2 call when it is called
- * from loopback.  It must be able to perform its own locking around
- * ocfs2_get_block().
+ * ocfs2_write_begin() can be an outer-most ocfs2 call when it is
+ * called from elsewhere in the kernel. It must be able to perform its
+ * own locking around ocfs2_get_block().
  */
-static int ocfs2_prepare_write(struct file *file, struct page *page,
-			       unsigned from, unsigned to)
+static int ocfs2_write_begin(struct file *file, struct address_space *mapping,
+			     loff_t pos, unsigned len, unsigned flags,
+			     struct page **pagep, void **fsdata)
 {
-	struct inode *inode = page->mapping->host;
+	struct inode *inode = mapping->host;
+	struct buffer_head *di_bh = NULL;
+	struct page *page = NULL;
 	int ret;
 
-	mlog_entry("(0x%p, 0x%p, %u, %u)\n", file, page, from, to);
-
-	ret = ocfs2_meta_lock_with_page(inode, NULL, 0, page);
+	ret = ocfs2_meta_lock(inode, &di_bh, 1);
 	if (ret != 0) {
 		mlog_errno(ret);
+		return ret;
+	}
+
+	ret = ocfs2_data_lock(inode, 1);
+	if (ret) {
+		ocfs2_meta_unlock(inode, 1);
+
+		mlog_errno(ret);
+		return ret;
+	}
+
+	/*
+	 * Lock the page out here to preserve ordering with
+	 * ip_alloc_sem.
+	 */
+	page = __grab_cache_page(mapping, pos >> PAGE_CACHE_SHIFT);
+	if (!page) {
+		ret = -ENOMEM;
+		mlog_errno(ret);
 		goto out;
 	}
 
-	ret = ocfs2_prepare_write_nolock(inode, page, from, to);
+	*pagep = page;
 
-	ocfs2_meta_unlock(inode, 0);
+	down_read(&OCFS2_I(inode)->ip_alloc_sem);
+	ret = block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+				ocfs2_get_block);
+	up_read(&OCFS2_I(inode)->ip_alloc_sem);
 out:
-	mlog_exit(ret);
+	if (ret == 0) {
+		*fsdata = di_bh;
+	} else {
+		/*
+		 * Error return - the caller won't call
+		 * ocfs2_write_end, so drop cluster locks here.
+		 */
+		brelse(di_bh);
+		if (page) {
+			unlock_page(page);
+			page_cache_release(page);
+		}
+		ocfs2_data_unlock(inode, 1);
+		ocfs2_meta_unlock(inode, 1);
+	}
+
 	return ret;
 }
 
@@ -388,16 +426,18 @@ out:
 	return handle;
 }
 
-static int ocfs2_commit_write(struct file *file, struct page *page,
-			      unsigned from, unsigned to)
+static int ocfs2_write_end(struct file *file, struct address_space *mapping,
+			   loff_t pos, unsigned len, unsigned copied,
+			   struct page *page, void *fsdata)
 {
 	int ret;
-	struct buffer_head *di_bh = NULL;
+	unsigned from, to;
+	struct buffer_head *di_bh = fsdata;
 	struct inode *inode = page->mapping->host;
 	handle_t *handle = NULL;
 	struct ocfs2_dinode *di;
 
-	mlog_entry("(0x%p, 0x%p, %u, %u)\n", file, page, from, to);
+	mlog_entry("(0x%p, 0x%p)\n", file, page);
 
 	/* NOTE: ocfs2_file_aio_write has ensured that it's safe for
 	 * us to continue here without rechecking the I/O against
@@ -412,22 +452,13 @@ static int ocfs2_commit_write(struct fil
 	 *    stale inode allocation image (i_size, i_clusters, etc).
 	 */
 
-	ret = ocfs2_meta_lock_with_page(inode, &di_bh, 1, page);
-	if (ret != 0) {
-		mlog_errno(ret);
-		goto out;
-	}
-
-	ret = ocfs2_data_lock_with_page(inode, 1, page);
-	if (ret != 0) {
-		mlog_errno(ret);
-		goto out_unlock_meta;
-	}
+	from = pos & (PAGE_CACHE_SIZE - 1);
+	to = from + len;
 
 	handle = ocfs2_start_walk_page_trans(inode, page, from, to);
 	if (IS_ERR(handle)) {
 		ret = PTR_ERR(handle);
-		goto out_unlock_data;
+		goto out_unlock;
 	}
 
 	/* Mark our buffer early. We'd rather catch this error up here
@@ -441,8 +472,10 @@ static int ocfs2_commit_write(struct fil
 	}
 
 	/* might update i_size */
-	ret = generic_commit_write(file, page, from, to);
-	if (ret < 0) {
+	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+	if (copied < 0) {
+		ret = copied;
+		copied = 0;
 		mlog_errno(ret);
 		goto out_commit;
 	}
@@ -458,23 +491,30 @@ static int ocfs2_commit_write(struct fil
 	di->i_size = cpu_to_le64((u64)i_size_read(inode));
 
 	ret = ocfs2_journal_dirty(handle, di_bh);
-	if (ret < 0) {
+	if (ret < 0)
 		mlog_errno(ret);
-		goto out_commit;
-	}
 
+	ret = 0;
 out_commit:
 	ocfs2_commit_trans(OCFS2_SB(inode->i_sb), handle);
-out_unlock_data:
+out_unlock:
 	ocfs2_data_unlock(inode, 1);
-out_unlock_meta:
 	ocfs2_meta_unlock(inode, 1);
-out:
+
+	if (ret) {
+		/*
+		 * We caught an error before block_write_end() -
+		 * unlock and free the page.
+		 */
+		unlock_page(page);
+		page_cache_release(page);
+	}
+
 	if (di_bh)
 		brelse(di_bh);
 
 	mlog_exit(ret);
-	return ret;
+	return copied ? copied : ret;
 }
 
 static sector_t ocfs2_bmap(struct address_space *mapping, sector_t block)
@@ -678,8 +718,8 @@ out:
 const struct address_space_operations ocfs2_aops = {
 	.readpage	= ocfs2_readpage,
 	.writepage	= ocfs2_writepage,
-	.prepare_write	= ocfs2_prepare_write,
-	.commit_write	= ocfs2_commit_write,
+	.write_begin	= ocfs2_write_begin,
+	.write_end	= ocfs2_write_end,
 	.bmap		= ocfs2_bmap,
 	.sync_page	= block_sync_page,
 	.direct_IO	= ocfs2_direct_IO,
-- 
1.3.3


[-- Attachment #3: 0002-Export-__grab_cache_page.txt --]
[-- Type: text/plain, Size: 521 bytes --]

From: Mark Fasheh <mark.fasheh@oracle.com>

[PATCH] Export __grab_cache_page

Needed at least by ocfs2 and ext[23].

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>


ec4c66f0e6012a182105405aa11813fbf836629f
diff --git a/mm/filemap.c b/mm/filemap.c
index 327c20f..c4a2d68 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2196,6 +2196,7 @@ repeat:
 	}
 	return page;
 }
+EXPORT_SYMBOL(__grab_cache_page);
 
 static ssize_t generic_perform_write_2copy(struct file *file,
 				struct iov_iter *i, loff_t pos)
-- 
1.3.3


[-- Attachment #4: Type: text/plain, Size: 345 bytes --]

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

[-- Attachment #5: Type: text/plain, Size: 140 bytes --]

_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: Announce: new-aops-1 for 2.6.21-rc3
  2007-03-15 23:47 ` Mark Fasheh
@ 2007-03-20  5:36   ` Nick Piggin
  2007-04-01  5:42   ` Nick Piggin
  1 sibling, 0 replies; 10+ messages in thread
From: Nick Piggin @ 2007-03-20  5:36 UTC (permalink / raw)
  To: Mark Fasheh
  Cc: Linux Filesystems, reiserfs-list, linux-ext4, xfs, nfs,
	cluster-devel, jfs-discussion

On Thu, Mar 15, 2007 at 04:47:13PM -0700, Mark Fasheh wrote:
> On Thu, Mar 15, 2007 at 05:17:04PM +0100, Nick Piggin wrote:
> > (excludes the OCFS2 patch that Mark sent, in anticipation of an update)
> 
> Attached is said patch. I needed to export __grab_cache_page (ext2/ext3 also
> need this if they're to be built as modules), so a patch to do that is also
> attached.
> 
> This passed some preliminary testing on a two node cluster I have here at
> Oracle.

Thanks Mark, I've merged these.

Nick

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Announce: new-aops-1 for 2.6.21-rc3
  2007-03-15 23:47 ` Mark Fasheh
  2007-03-20  5:36   ` Nick Piggin
@ 2007-04-01  5:42   ` Nick Piggin
  2007-04-02  2:14     ` Nick Piggin
  1 sibling, 1 reply; 10+ messages in thread
From: Nick Piggin @ 2007-04-01  5:42 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Linux Filesystems, Steven Whitehouse

Hi,

Progress is coming along: as well as the several fixes Mark has posted, we've
got a GFS2 patch from Steve as well, which also enables us to get rid of the
AOPS_TRUNCATED_PAGE handling in the legacy ->prepare_write paths. And I've
been auditing code and doing stress and code coverage tests for various
error paths...

I was hoping to have a cont_prepare_write filesystem converted before doing
a next patchset, but there are enough fixes and new stuff now that I'll
probably do one tomorrow.

Anyway my recent testing involved things like poisioning the memory of new
buffers, and injecting failures into various places (eg. the atomic
usercopies), then running fsx-linux and stuff on top of ext2 with various
block sizes. This found one subtle problem (the first, below), and
inspection found a couple of others.

With these fixes, the stress tests now survive as long as I've cared to wait.
Any comments always appreciated.

Thanks,
Nick

--

- The first hunk fixes a case where new buffers that were only partially
  cleared (because the write partially overlaps a buffer, so that region is
not cleared), and then the write fails before filling up this portion of the
buffer. Because we have already cleared buffer_new, page_zero_new_buffers
does not clear it properly. This results in uninitialised data going onto
disk (and in pagecache).  The solution is to only clear_buffer_new when we
know everything has been initialised, which makes sense.

- The second hunk fixes a thinko.

- The last hunk fixes a problem where we turn any short write to a !uptodate
page into a zero-length write, for the reason explained in comments. However
it was not taking this contraction into account in page_zero_new_buffers and
__block_commit_write, which is inconsistent.

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1852,17 +1852,8 @@ static int __block_prepare_write(struct 
 		if (!buffer_uptodate(*wait_bh))
 			err = -EIO;
 	}
-	if (!err) {
-		bh = head;
-		do {
-			if (buffer_new(bh))
-				clear_buffer_new(bh);
-		} while ((bh = bh->b_this_page) != head);
-		return 0;
-	}
-
-	/* Error case: */
-	page_zero_new_buffers(page, from, to);
+	if (unlikely(err))
+		page_zero_new_buffers(page, from, to);
 	return err;
 }
 
@@ -1895,7 +1886,7 @@ void page_zero_new_buffers(struct page *
 					end = min(to, block_end);
 
 					kaddr = kmap_atomic(page, KM_USER0);
-					memset(kaddr+start, 0, block_end-end);
+					memset(kaddr+start, 0, end - start);
 					flush_dcache_page(page);
 					kunmap_atomic(kaddr, KM_USER0);
 					set_buffer_uptodate(bh);
@@ -2015,30 +2006,29 @@ int block_write_end(struct file *file, s
 
 	start = pos & (PAGE_CACHE_SIZE - 1);
 
-	if (unlikely(copied < len))
+	if (unlikely(copied < len)) {
+		/*
+		 * The buffers that were written will now be uptodate, so we
+		 * don't have to worry about a readpage reading them and
+		 * overwriting a partial write. However if we have encountered
+		 * a short write and only partially written into a buffer, it
+		 * will not be marked uptodate, so a readpage might come in and
+		 * destroy our partial write.
+		 *
+		 * Do the simplest thing, and just treat any short write to a
+		 * non uptodate page as a zero-length write, and force the
+		 * caller to redo the whole thing.
+		 */
+		if (!PageUptodate(page))
+			copied = 0;
+
 		page_zero_new_buffers(page, start+copied, start+len);
+	}
 	flush_dcache_page(page);
 
 	/* This could be a short (even 0-length) commit */
 	__block_commit_write(inode, page, start, start+copied);
 
-	/*
-	 * The buffers that were written will now be uptodate, so we don't
-	 * have to worry about a readpage reading them and overwriting
-	 * a partial write. However if we have encountered a short write
-	 * and only partially written into a buffer, it will not be marked
-	 * uptodate, so a readpage might come in and destroy our partial
-	 * write.
-	 *
-	 * Do the simplest thing, and just treat any short write to a non
-	 * uptodate page as a zero-length write, and force the caller to
-	 * redo the whole thing.
-	 */
-	if (unlikely(copied < len)) {
-		if (!PageUptodate(page))
-			copied = 0;
-	}
-
 	unlock_page(page);
 	mark_page_accessed(page); /* XXX: put this in caller? */
 	page_cache_release(page);

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Announce: new-aops-1 for 2.6.21-rc3
  2007-04-01  5:42   ` Nick Piggin
@ 2007-04-02  2:14     ` Nick Piggin
  0 siblings, 0 replies; 10+ messages in thread
From: Nick Piggin @ 2007-04-02  2:14 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Linux Filesystems, Steven Whitehouse

Hmm, spotted another bug in here too, which also affects mainline AFAIKS.

If the page is already uptodate, then a new buffer will get marked uptodate
but will remain "new"... thus if the read portion of the RMW operation fails,
then the error handler will zero this guy out "to prevent stale data
exposure".

Of course, the data is not stale because it is uptodate, so zeroing out will
cause data loss.

I fixed it thusly, because I think buffer_new should always be cleared
before marking a buffer uptodate. I don't know whether I'll bother submitting
this seperately from the new-aops lineup, but if anyone would like to,
feel free to add my SOB.

--
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1813,7 +1813,9 @@ static int __block_prepare_write(struct 
 				unmap_underlying_metadata(bh->b_bdev,
 							bh->b_blocknr);
 				if (PageUptodate(page)) {
+					clear_buffer_new(bh);
 					set_buffer_uptodate(bh);
+					mark_buffer_dirty(bh);
 					continue;
 				}
 				if (block_end > to || block_start < from) {

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-04-02  2:14 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-15 16:17 Announce: new-aops-1 for 2.6.21-rc3 Nick Piggin
2007-03-15 19:32 ` Joel Becker
2007-03-15 19:57   ` Nick Piggin
2007-03-15 19:53 ` Mark Fasheh
2007-03-15 19:57   ` Nick Piggin
2007-03-15 21:08 ` Mark Fasheh
2007-03-15 23:47 ` Mark Fasheh
2007-03-20  5:36   ` Nick Piggin
2007-04-01  5:42   ` Nick Piggin
2007-04-02  2:14     ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).