[patch][rfc] fs: fix nobh error handling

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch][rfc] fs: fix nobh error handling
@ 2007-08-07  5:51 Nick Piggin
  2007-08-08  1:09 ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: Nick Piggin @ 2007-08-07  5:51 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel; +Cc: Dave Kleikamp, Badari Pulavarty

Hi,

So I seem to have hit some resistance to the idea of removing "nobh" mode
with the new aops patches. I don't need to reiterate my ideas for disliking
nobh mode (if the buffer layer is crap, it should be improved rather than
circumvented -- see fsblock). But at any rate, this view does not seem to
be unanimous, and it is completely sensible to be adverse to removing things
that exist and somewhat work today.

So I've been mulling around the best way to have nobh and new aoips coexist
nicely. I think I've got a reasonable idea now... the main problem with
nobh mode is that it blissfully throws away most of the buffer state; when
an error does happen, it's up shit creek. This becomes more of a problem
with error recovery after a "short write" allowed by the new aop API: the
short write is actually going to happen relatively often as part of normal
operation, rather than just when the disk subsystem fails.

Still, the fix for existing error handling will look basically the same as
the fix for short write error handling. So I propose this patch against
head in order not to confuse the issues (and to fix the existing bugs in
release kernels). If this gets accepted, then I can subsequently look at
doing _additional_ patches for the new aops series in -mm.

Note: I actually did observe some of these problems by using IO error
injection. I wouldn't say the patch is completely stress tested yet, but
it did survive the same tests without any file corruption.

Nick

---

nobh mode error handling is not just pretty slack, it's wrong.

One cannot zero out the whole page to ensure new blocks are zeroed,
because it just brings the whole page "uptodate" with zeroes even if
that may not be the correct uptodate data. Also, other parts of the
page may already contain dirty data which would get lost by zeroing
it out. Thirdly, the writeback of zeroes to the new blocks will also
erase existing blocks. All these conditions are pagecache and/or
filesystem corruption.

The problem comes about because we didn't keep track of which buffers
actually are new or old. However it is not enough just to keep only
this state, because at the point we start dirtying parts of the page
(new blocks, with zeroes), the handling of IO errors becomes impossible
without buffers because the page may only be partially uptodate, in
which case the page flags allone cannot capture the state of the parts
of the page.

So allocate all buffers for the page upfront, but leave them unattached
so that they don't pick up any other references and can be freed when
we're done. If the error path is hit, then zero the new buffers as the
regular buffer path does, then attach the buffers to the page so that
it can actually be written out correctly and be subject to the normal
IO error handling paths.

As an upshot, we save 1K of kernel stack on ia64 or powerpc 64K page
systems.

Signed-off-by: Nick Piggin <npiggin@suse.de>

--

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2272,51 +2272,53 @@ int nobh_prepare_write(struct page *page
 	struct inode *inode = page->mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
 	const unsigned blocksize = 1 << blkbits;
-	struct buffer_head map_bh;
-	struct buffer_head *read_bh[MAX_BUF_PER_PAGE];
+	struct buffer_head *head, *bh;
 	unsigned block_in_page;
-	unsigned block_start;
+	unsigned block_start, block_end;
 	sector_t block_in_file;
 	char *kaddr;
 	int nr_reads = 0;
-	int i;
 	int ret = 0;
 	int is_mapped_to_disk = 1;

+	if (PagePrivate(page))
+		return block_prepare_write(page, from, to, get_block);
+
 	if (PageMappedToDisk(page))
 		return 0;

+	head = alloc_page_buffers(page, blocksize, 1);
+
 	block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits);
-	map_bh.b_page = page;

 	/*
 	 * We loop across all blocks in the page, whether or not they are
 	 * part of the affected region.  This is so we can discover if the
 	 * page is fully mapped-to-disk.
 	 */
-	for (block_start = 0, block_in_page = 0;
+	for (block_start = 0, block_in_page = 0, bh = head;
 		  block_start < PAGE_CACHE_SIZE;
-		  block_in_page++, block_start += blocksize) {
-		unsigned block_end = block_start + blocksize;
+		  block_in_page++, block_start += blocksize, bh = bh->b_this_page) {
 		int create;

-		map_bh.b_state = 0;
+		block_end = block_start + blocksize;
+		bh->b_state = 0;
 		create = 1;
 		if (block_start >= to)
 			create = 0;
-		map_bh.b_size = blocksize;
 		ret = get_block(inode, block_in_file + block_in_page,
-					&map_bh, create);
+					bh, create);
 		if (ret)
 			goto failed;
-		if (!buffer_mapped(&map_bh))
+		if (!buffer_mapped(bh))
 			is_mapped_to_disk = 0;
-		if (buffer_new(&map_bh))
-			unmap_underlying_metadata(map_bh.b_bdev,
-							map_bh.b_blocknr);
-		if (PageUptodate(page))
+		if (buffer_new(bh))
+			unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
+		if (PageUptodate(page)) {
+			set_buffer_uptodate(bh);
 			continue;
-		if (buffer_new(&map_bh) || !buffer_mapped(&map_bh)) {
+		}
+		if (buffer_new(bh) || !buffer_mapped(bh)) {
 			kaddr = kmap_atomic(page, KM_USER0);
 			if (block_start < from)
 				memset(kaddr+block_start, 0, from-block_start);
@@ -2326,49 +2328,26 @@ int nobh_prepare_write(struct page *page
 			kunmap_atomic(kaddr, KM_USER0);
 			continue;
 		}
-		if (buffer_uptodate(&map_bh))
+		if (buffer_uptodate(bh))
 			continue;	/* reiserfs does this */
 		if (block_start < from || block_end > to) {
-			struct buffer_head *bh = alloc_buffer_head(GFP_NOFS);
-
-			if (!bh) {
-				ret = -ENOMEM;
-				goto failed;
-			}
-			bh->b_state = map_bh.b_state;
-			atomic_set(&bh->b_count, 0);
-			bh->b_this_page = NULL;
-			bh->b_page = page;
-			bh->b_blocknr = map_bh.b_blocknr;
-			bh->b_size = blocksize;
-			bh->b_data = (char *)(long)block_start;
-			bh->b_bdev = map_bh.b_bdev;
-			bh->b_private = NULL;
-			read_bh[nr_reads++] = bh;
+			lock_buffer(bh);
+			bh->b_end_io = end_buffer_read_nobh;
+			submit_bh(READ, bh);
+			nr_reads++;
 		}
 	}

 	if (nr_reads) {
-		struct buffer_head *bh;
-
 		/*
 		 * The page is locked, so these buffers are protected from
 		 * any VM or truncate activity.  Hence we don't need to care
 		 * for the buffer_head refcounts.
 		 */
-		for (i = 0; i < nr_reads; i++) {
-			bh = read_bh[i];
-			lock_buffer(bh);
-			bh->b_end_io = end_buffer_read_nobh;
-			submit_bh(READ, bh);
-		}
-		for (i = 0; i < nr_reads; i++) {
-			bh = read_bh[i];
+		for (bh = head; bh; bh = bh->b_this_page) {
 			wait_on_buffer(bh);
 			if (!buffer_uptodate(bh))
 				ret = -EIO;
-			free_buffer_head(bh);
-			read_bh[i] = NULL;
 		}
 		if (ret)
 			goto failed;
@@ -2377,21 +2356,54 @@ int nobh_prepare_write(struct page *page
 	if (is_mapped_to_disk)
 		SetPageMappedToDisk(page);

+	do {
+		bh = head;
+		head = head->b_this_page;
+		free_buffer_head(bh);
+	} while (head);
+
 	return 0;

 failed:
-	for (i = 0; i < nr_reads; i++) {
-		if (read_bh[i])
-			free_buffer_head(read_bh[i]);
-	}
-
 	/*
-	 * Error recovery is pretty slack.  Clear the page and mark it dirty
-	 * so we'll later zero out any blocks which _were_ allocated.
+	 * Error recovery is a bit difficult. We need to zero out blocks that
+	 * were allocated and dirty them to ensure they get written out.
+	 * Buffers need to be attached to the page at this point, otherwise
+	 * the handling of potential IO errors during writeout would be
+	 * impossible.
 	 */
-	zero_user_page(page, 0, PAGE_CACHE_SIZE, KM_USER0);
-	SetPageUptodate(page);
-	set_page_dirty(page);
+	spin_lock(&page->mapping->private_lock);
+	bh = head;
+	block_start = 0;
+	do {
+		if (PageUptodate(page))
+			set_buffer_uptodate(bh);
+		if (PageDirty(page))
+			set_buffer_dirty(bh);
+
+		block_end = block_start+blocksize;
+		if (block_end <= from)
+			goto next;
+		if (block_start >= to)
+			goto next;
+
+		if (buffer_new(bh)) {
+			clear_buffer_new(bh);
+			if (!buffer_uptodate(bh)) {
+				zero_user_page(page, block_start, bh->b_size, KM_USER0);
+				set_buffer_uptodate(bh);
+			}
+			mark_buffer_dirty(bh);
+		}
+next:
+		block_start = block_end;
+		if (!bh->b_this_page)
+			bh->b_this_page = head;
+		bh = bh->b_this_page;
+	} while (bh != head);
+	attach_page_buffers(page, head);
+	spin_unlock(&page->mapping->private_lock);
+
 	return ret;
 }
 EXPORT_SYMBOL(nobh_prepare_write);
@@ -2406,6 +2418,9 @@ int nobh_commit_write(struct file *file,
 	struct inode *inode = page->mapping->host;
 	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;

+	if (PagePrivate(page))
+		return generic_commit_write(file, page, from, to);
+
 	SetPageUptodate(page);
 	set_page_dirty(page);
 	if (pos > inode->i_size) {

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch][rfc] fs: fix nobh error handling
  2007-08-07  5:51 [patch][rfc] fs: fix nobh error handling Nick Piggin
@ 2007-08-08  1:09 ` Andrew Morton
  2007-08-08  2:18   ` Nick Piggin
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2007-08-08  1:09 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, Dave Kleikamp, Badari Pulavarty

On Tue, 7 Aug 2007 07:51:29 +0200
Nick Piggin <npiggin@suse.de> wrote:

> nobh mode error handling is not just pretty slack, it's wrong.
> 
> One cannot zero out the whole page to ensure new blocks are zeroed,
> because it just brings the whole page "uptodate" with zeroes even if
> that may not be the correct uptodate data. Also, other parts of the
> page may already contain dirty data which would get lost by zeroing
> it out. Thirdly, the writeback of zeroes to the new blocks will also
> erase existing blocks. All these conditions are pagecache and/or
> filesystem corruption.
> 
> The problem comes about because we didn't keep track of which buffers
> actually are new or old. However it is not enough just to keep only
> this state, because at the point we start dirtying parts of the page
> (new blocks, with zeroes), the handling of IO errors becomes impossible
> without buffers because the page may only be partially uptodate, in
> which case the page flags allone cannot capture the state of the parts
> of the page.
> 
> So allocate all buffers for the page upfront, but leave them unattached
> so that they don't pick up any other references and can be freed when
> we're done. If the error path is hit, then zero the new buffers as the
> regular buffer path does, then attach the buffers to the page so that
> it can actually be written out correctly and be subject to the normal
> IO error handling paths.
> 
> As an upshot, we save 1K of kernel stack on ia64 or powerpc 64K page
> systems.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> 

With this change, nobh_prepare_write() can magically attach buffers to the
page.  But a filesystem which is running in nobh mode wouldn't expect that,
and could quite legitimately go BUG, or leak data or, more seriously and
much less fixably, just go and overwrite page->private, because it "knows"
that nobody else is using ->private.

I'd have thought that it would be better to not attach the buffers and to
go ahead and do whatever synchronous IO is needed in the error recovery
code, then free those buffers again.

Also, you have a couple of (cheerily uncommented) PagePrivate() tests in
there which should be page_has_buffers().


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch][rfc] fs: fix nobh error handling
  2007-08-08  1:09 ` Andrew Morton
@ 2007-08-08  2:18   ` Nick Piggin
  2007-08-08  2:33     ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: Nick Piggin @ 2007-08-08  2:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, Dave Kleikamp, Badari Pulavarty

On Tue, Aug 07, 2007 at 06:09:03PM -0700, Andrew Morton wrote:
> On Tue, 7 Aug 2007 07:51:29 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > nobh mode error handling is not just pretty slack, it's wrong.
> > 
> > One cannot zero out the whole page to ensure new blocks are zeroed,
> > because it just brings the whole page "uptodate" with zeroes even if
> > that may not be the correct uptodate data. Also, other parts of the
> > page may already contain dirty data which would get lost by zeroing
> > it out. Thirdly, the writeback of zeroes to the new blocks will also
> > erase existing blocks. All these conditions are pagecache and/or
> > filesystem corruption.
> > 
> > The problem comes about because we didn't keep track of which buffers
> > actually are new or old. However it is not enough just to keep only
> > this state, because at the point we start dirtying parts of the page
> > (new blocks, with zeroes), the handling of IO errors becomes impossible
> > without buffers because the page may only be partially uptodate, in
> > which case the page flags allone cannot capture the state of the parts
> > of the page.
> > 
> > So allocate all buffers for the page upfront, but leave them unattached
> > so that they don't pick up any other references and can be freed when
> > we're done. If the error path is hit, then zero the new buffers as the
> > regular buffer path does, then attach the buffers to the page so that
> > it can actually be written out correctly and be subject to the normal
> > IO error handling paths.
> > 
> > As an upshot, we save 1K of kernel stack on ia64 or powerpc 64K page
> > systems.
> > 
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > 
> 
> With this change, nobh_prepare_write() can magically attach buffers to the
> page.  But a filesystem which is running in nobh mode wouldn't expect that,
> and could quite legitimately go BUG, or leak data or, more seriously and
> much less fixably, just go and overwrite page->private, because it "knows"
> that nobody else is using ->private.

I was fairly sure that a filesystem can not assume buffers won't be
attached, because there are various error case paths thta do exactly
the same thing (eg. nobh_writepage can call __block_write_full_page
which will attach buffers). 

Does any filesystem assume this? Is it not already broken?


> I'd have thought that it would be better to not attach the buffers and to
> go ahead and do whatever synchronous IO is needed in the error recovery
> code, then free those buffers again.

It is hard because if the synchronous IO fails, then what do you do?
You could try making it up as you go along, but of course if we _can_
attach the buffers here then it would be preferable to do that. IMO.
 

> Also, you have a couple of (cheerily uncommented) PagePrivate() tests in
> there which should be page_has_buffers().

Yeah, I guess the whole thing needs more commenting :P
page_has_buffers... right, I'll change that.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch][rfc] fs: fix nobh error handling
  2007-08-08  2:18   ` Nick Piggin
@ 2007-08-08  2:33     ` Andrew Morton
  2007-08-08  3:12       ` Nick Piggin
  2007-08-20  4:33       ` Nick Piggin
  0 siblings, 2 replies; 9+ messages in thread
From: Andrew Morton @ 2007-08-08  2:33 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, Dave Kleikamp, Badari Pulavarty

On Wed, 8 Aug 2007 04:18:38 +0200 Nick Piggin <npiggin@suse.de> wrote:

> On Tue, Aug 07, 2007 at 06:09:03PM -0700, Andrew Morton wrote:
> > On Tue, 7 Aug 2007 07:51:29 +0200
> > Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > nobh mode error handling is not just pretty slack, it's wrong.
> > > 
> > > One cannot zero out the whole page to ensure new blocks are zeroed,
> > > because it just brings the whole page "uptodate" with zeroes even if
> > > that may not be the correct uptodate data. Also, other parts of the
> > > page may already contain dirty data which would get lost by zeroing
> > > it out. Thirdly, the writeback of zeroes to the new blocks will also
> > > erase existing blocks. All these conditions are pagecache and/or
> > > filesystem corruption.
> > > 
> > > The problem comes about because we didn't keep track of which buffers
> > > actually are new or old. However it is not enough just to keep only
> > > this state, because at the point we start dirtying parts of the page
> > > (new blocks, with zeroes), the handling of IO errors becomes impossible
> > > without buffers because the page may only be partially uptodate, in
> > > which case the page flags allone cannot capture the state of the parts
> > > of the page.
> > > 
> > > So allocate all buffers for the page upfront, but leave them unattached
> > > so that they don't pick up any other references and can be freed when
> > > we're done. If the error path is hit, then zero the new buffers as the
> > > regular buffer path does, then attach the buffers to the page so that
> > > it can actually be written out correctly and be subject to the normal
> > > IO error handling paths.
> > > 
> > > As an upshot, we save 1K of kernel stack on ia64 or powerpc 64K page
> > > systems.
> > > 
> > > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > > 
> > 
> > With this change, nobh_prepare_write() can magically attach buffers to the
> > page.  But a filesystem which is running in nobh mode wouldn't expect that,
> > and could quite legitimately go BUG, or leak data or, more seriously and
> > much less fixably, just go and overwrite page->private, because it "knows"
> > that nobody else is using ->private.
> 
> I was fairly sure that a filesystem can not assume buffers won't be
> attached, because there are various error case paths thta do exactly
> the same thing (eg. nobh_writepage can call __block_write_full_page
> which will attach buffers). 

oh crap, that's sad.  Either we broke it later on or I misremembered.

> Does any filesystem assume this? Is it not already broken?

Yes, it would be broken.

> 
> > I'd have thought that it would be better to not attach the buffers and to
> > go ahead and do whatever synchronous IO is needed in the error recovery
> > code, then free those buffers again.
> 
> It is hard because if the synchronous IO fails, then what do you do?

Do what we usually do when an IO error happens: crash the kernel?  (Sorry,
have been spending too long at bugzilla.kernel.org)

> You could try making it up as you go along, but of course if we _can_
> attach the buffers here then it would be preferable to do that. IMO.
>  
> 
> > Also, you have a couple of (cheerily uncommented) PagePrivate() tests in
> > there which should be page_has_buffers().
> 
> Yeah, I guess the whole thing needs more commenting :P
> page_has_buffers... right, I'll change that.

Did it get much testing?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch][rfc] fs: fix nobh error handling
  2007-08-08  2:33     ` Andrew Morton
@ 2007-08-08  3:12       ` Nick Piggin
  2007-08-08 13:07         ` Dave Kleikamp
  2007-08-20  4:33       ` Nick Piggin
  1 sibling, 1 reply; 9+ messages in thread
From: Nick Piggin @ 2007-08-08  3:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, Dave Kleikamp, Badari Pulavarty

On Tue, Aug 07, 2007 at 07:33:47PM -0700, Andrew Morton wrote:
> On Wed, 8 Aug 2007 04:18:38 +0200 Nick Piggin <npiggin@suse.de> wrote:
> 
> > On Tue, Aug 07, 2007 at 06:09:03PM -0700, Andrew Morton wrote:
> > > 
> > > With this change, nobh_prepare_write() can magically attach buffers to the
> > > page.  But a filesystem which is running in nobh mode wouldn't expect that,
> > > and could quite legitimately go BUG, or leak data or, more seriously and
> > > much less fixably, just go and overwrite page->private, because it "knows"
> > > that nobody else is using ->private.
> > 
> > I was fairly sure that a filesystem can not assume buffers won't be
> > attached, because there are various error case paths thta do exactly
> > the same thing (eg. nobh_writepage can call __block_write_full_page
> > which will attach buffers). 
> 
> oh crap, that's sad.  Either we broke it later on or I misremembered.
> 
> > Does any filesystem assume this? Is it not already broken?
> 
> Yes, it would be broken.
> 
> > 
> > > I'd have thought that it would be better to not attach the buffers and to
> > > go ahead and do whatever synchronous IO is needed in the error recovery
> > > code, then free those buffers again.
> > 
> > It is hard because if the synchronous IO fails, then what do you do?
> 
> Do what we usually do when an IO error happens: crash the kernel?  (Sorry,
> have been spending too long at bugzilla.kernel.org)

Heh.. I guess there is still a chance to retry the IO with sync or
fsync. I'mt not surprised if the "normal" pagecache error handling
paths doesn't work so well either, but at least if we can duplicate
as little code as possible it might get fixed up one day.


 
> > You could try making it up as you go along, but of course if we _can_
> > attach the buffers here then it would be preferable to do that. IMO.
> >  
> > 
> > > Also, you have a couple of (cheerily uncommented) PagePrivate() tests in
> > > there which should be page_has_buffers().
> > 
> > Yeah, I guess the whole thing needs more commenting :P
> > page_has_buffers... right, I'll change that.
> 
> Did it get much testing?

A little. Obviously it only really changes anything when an IO error hits,
and I found that ext3/jbd gives up and goes readonly pretty quickly when I
inject IO errors into the block device. What I really want to do is just
inject faults at nobh_prepare_write and do some longer tests.

I'll do that today. 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch][rfc] fs: fix nobh error handling
  2007-08-08  3:12       ` Nick Piggin
@ 2007-08-08 13:07         ` Dave Kleikamp
  2007-08-08 14:39           ` Mingming Cao
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Kleikamp @ 2007-08-08 13:07 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-fsdevel, Badari Pulavarty

On Wed, 2007-08-08 at 05:12 +0200, Nick Piggin wrote:
> On Tue, Aug 07, 2007 at 07:33:47PM -0700, Andrew Morton wrote:
> > On Wed, 8 Aug 2007 04:18:38 +0200 Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > On Tue, Aug 07, 2007 at 06:09:03PM -0700, Andrew Morton wrote:
> > > > 
> > > > With this change, nobh_prepare_write() can magically attach buffers to the
> > > > page.  But a filesystem which is running in nobh mode wouldn't expect that,
> > > > and could quite legitimately go BUG, or leak data or, more seriously and
> > > > much less fixably, just go and overwrite page->private, because it "knows"
> > > > that nobody else is using ->private.
> > > 
> > > I was fairly sure that a filesystem can not assume buffers won't be
> > > attached, because there are various error case paths thta do exactly
> > > the same thing (eg. nobh_writepage can call __block_write_full_page
> > > which will attach buffers). 
> > 
> > oh crap, that's sad.  Either we broke it later on or I misremembered.
> > 
> > > Does any filesystem assume this? Is it not already broken?
> > 
> > Yes, it would be broken.
> > 
> > > 
> > > > I'd have thought that it would be better to not attach the buffers and to
> > > > go ahead and do whatever synchronous IO is needed in the error recovery
> > > > code, then free those buffers again.
> > > 
> > > It is hard because if the synchronous IO fails, then what do you do?
> > 
> > Do what we usually do when an IO error happens: crash the kernel?  (Sorry,
> > have been spending too long at bugzilla.kernel.org)
> 
> Heh.. I guess there is still a chance to retry the IO with sync or
> fsync. I'mt not surprised if the "normal" pagecache error handling
> paths doesn't work so well either, but at least if we can duplicate
> as little code as possible it might get fixed up one day.
> 
> 
> 
> > > You could try making it up as you go along, but of course if we _can_
> > > attach the buffers here then it would be preferable to do that. IMO.
> > >  
> > > 
> > > > Also, you have a couple of (cheerily uncommented) PagePrivate() tests in
> > > > there which should be page_has_buffers().
> > > 
> > > Yeah, I guess the whole thing needs more commenting :P
> > > page_has_buffers... right, I'll change that.
> > 
> > Did it get much testing?
> 
> A little. Obviously it only really changes anything when an IO error hits,
> and I found that ext3/jbd gives up and goes readonly pretty quickly when I
> inject IO errors into the block device. What I really want to do is just
> inject faults at nobh_prepare_write and do some longer tests.
> 
> I'll do that today. 

For jfs's sake, I don't really care if it ever uses nobh again.  I
originally started using it because I figured the movement was away from
buffer heads and jfs seemed just as happy with the nobh functions (after
a few bugs were flushed out).  I don't think jfs really benefitted
though.

That said, I don't really know who cares about the nobh option in ext3.

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch][rfc] fs: fix nobh error handling
  2007-08-08 13:07         ` Dave Kleikamp
@ 2007-08-08 14:39           ` Mingming Cao
  2007-08-09  1:45             ` Nick Piggin
  0 siblings, 1 reply; 9+ messages in thread
From: Mingming Cao @ 2007-08-08 14:39 UTC (permalink / raw)
  To: Dave Kleikamp; +Cc: Nick Piggin, Andrew Morton, linux-fsdevel, Badari Pulavarty

On Wed, 2007-08-08 at 08:07 -0500, Dave Kleikamp wrote:
> On Wed, 2007-08-08 at 05:12 +0200, Nick Piggin wrote:
> > On Tue, Aug 07, 2007 at 07:33:47PM -0700, Andrew Morton wrote:
> > > On Wed, 8 Aug 2007 04:18:38 +0200 Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > > On Tue, Aug 07, 2007 at 06:09:03PM -0700, Andrew Morton wrote:
> > > > > 
> > > > > With this change, nobh_prepare_write() can magically attach buffers to the
> > > > > page.  But a filesystem which is running in nobh mode wouldn't expect that,
> > > > > and could quite legitimately go BUG, or leak data or, more seriously and
> > > > > much less fixably, just go and overwrite page->private, because it "knows"
> > > > > that nobody else is using ->private.
> > > > 
> > > > I was fairly sure that a filesystem can not assume buffers won't be
> > > > attached, because there are various error case paths thta do exactly
> > > > the same thing (eg. nobh_writepage can call __block_write_full_page
> > > > which will attach buffers). 
> > > 
> > > oh crap, that's sad.  Either we broke it later on or I misremembered.
> > > 
> > > > Does any filesystem assume this? Is it not already broken?
> > > 
> > > Yes, it would be broken.
> > > 
> > > > 
> > > > > I'd have thought that it would be better to not attach the buffers and to
> > > > > go ahead and do whatever synchronous IO is needed in the error recovery
> > > > > code, then free those buffers again.
> > > > 
> > > > It is hard because if the synchronous IO fails, then what do you do?
> > > 
> > > Do what we usually do when an IO error happens: crash the kernel?  (Sorry,
> > > have been spending too long at bugzilla.kernel.org)
> > 
> > Heh.. I guess there is still a chance to retry the IO with sync or
> > fsync. I'mt not surprised if the "normal" pagecache error handling
> > paths doesn't work so well either, but at least if we can duplicate
> > as little code as possible it might get fixed up one day.
> > 
> > 
> > 
> > > > You could try making it up as you go along, but of course if we _can_
> > > > attach the buffers here then it would be preferable to do that. IMO.
> > > >  
> > > > 
> > > > > Also, you have a couple of (cheerily uncommented) PagePrivate() tests in
> > > > > there which should be page_has_buffers().
> > > > 
> > > > Yeah, I guess the whole thing needs more commenting :P
> > > > page_has_buffers... right, I'll change that.
> > > 
> > > Did it get much testing?
> > 
> > A little. Obviously it only really changes anything when an IO error hits,
> > and I found that ext3/jbd gives up and goes readonly pretty quickly when I
> > inject IO errors into the block device. What I really want to do is just
> > inject faults at nobh_prepare_write and do some longer tests.
> > 
> > I'll do that today. 
> 
> For jfs's sake, I don't really care if it ever uses nobh again.  I
> originally started using it because I figured the movement was away from
> buffer heads and jfs seemed just as happy with the nobh functions (after
> a few bugs were flushed out).  I don't think jfs really benefitted
> though.
> 
> That said, I don't really know who cares about the nobh option in ext3.
> 

Actually IBM/LTC use the nobh option in ext3 on our internal kernel
development server, to control the consumption of large amount of low
memory space.


Mingming


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch][rfc] fs: fix nobh error handling
  2007-08-08 14:39           ` Mingming Cao
@ 2007-08-09  1:45             ` Nick Piggin
  0 siblings, 0 replies; 9+ messages in thread
From: Nick Piggin @ 2007-08-09  1:45 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Dave Kleikamp, Andrew Morton, linux-fsdevel, Badari Pulavarty

On Wed, Aug 08, 2007 at 07:39:42AM -0700, Mingming Cao wrote:
> On Wed, 2007-08-08 at 08:07 -0500, Dave Kleikamp wrote:
> > 
> > For jfs's sake, I don't really care if it ever uses nobh again.  I
> > originally started using it because I figured the movement was away from
> > buffer heads and jfs seemed just as happy with the nobh functions (after
> > a few bugs were flushed out).  I don't think jfs really benefitted
> > though.
> > 
> > That said, I don't really know who cares about the nobh option in ext3.
> > 
> 
> Actually IBM/LTC use the nobh option in ext3 on our internal kernel
> development server, to control the consumption of large amount of low
> memory space.

Fair enough...  but you mean to say that page reclaim does not control
the consumption of low memory well enough? Ie. what is the exact benefit
of using nobh?

(I'm not trying to argue to remove it... yet... just interested)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch][rfc] fs: fix nobh error handling
  2007-08-08  2:33     ` Andrew Morton
  2007-08-08  3:12       ` Nick Piggin
@ 2007-08-20  4:33       ` Nick Piggin
  1 sibling, 0 replies; 9+ messages in thread
From: Nick Piggin @ 2007-08-20  4:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, Dave Kleikamp, Badari Pulavarty

On Tue, Aug 07, 2007 at 07:33:47PM -0700, Andrew Morton wrote:
> On Wed, 8 Aug 2007 04:18:38 +0200 Nick Piggin <npiggin@suse.de> wrote:
> 
> > You could try making it up as you go along, but of course if we _can_
> > attach the buffers here then it would be preferable to do that. IMO.
> >  
> > 
> > > Also, you have a couple of (cheerily uncommented) PagePrivate() tests in
> > > there which should be page_has_buffers().
> > 
> > Yeah, I guess the whole thing needs more commenting :P
> > page_has_buffers... right, I'll change that.
> 
> Did it get much testing?

OK, ext2 seems fairly robust and survives hundreds of cycles of the test
case that results in corruption without the patch (drop caches, unaligned
write(2) into non-zero file data, watch the entire block get zeroed out
on failure).

Also susrivies fsx-linux, but that appears to mostly work out of cache,
so I don't think the write-failure injection was exercised too much.

Added comments, changed PagePrivate to page_has_buffers.

--

nobh mode error handling is not just pretty slack, it's wrong.

One cannot zero out the whole page to ensure new blocks are zeroed,
because it just brings the whole page "uptodate" with zeroes even if
that may not be the correct uptodate data. Also, other parts of the
page may already contain dirty data which would get lost by zeroing
it out. Thirdly, the writeback of zeroes to the new blocks will also
erase existing blocks. All these conditions are pagecache and/or
filesystem corruption.

The problem comes about because we didn't keep track of which buffers
actually are new or old. However it is not enough just to keep only
this state, because at the point we start dirtying parts of the page
(new blocks, with zeroes), the handling of IO errors becomes impossible
without buffers because the page may only be partially uptodate, in
which case the page flags allone cannot capture the state of the parts
of the page.

So allocate all buffers for the page upfront, but leave them unattached
so that they don't pick up any other references and can be freed when
we're done. If the error path is hit, then zero the new buffers as the
regular buffer path does, then attach the buffers to the page so that
it can actually be written out correctly and be subject to the normal
IO error handling paths.

As an upshot, we save 1K of kernel stack on ia64 or powerpc 64K page
systems.

Signed-off-by: Nick Piggin <npiggin@suse.de>

--

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2272,51 +2272,64 @@ int nobh_prepare_write(struct page *page
 	struct inode *inode = page->mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
 	const unsigned blocksize = 1 << blkbits;
-	struct buffer_head map_bh;
-	struct buffer_head *read_bh[MAX_BUF_PER_PAGE];
+	struct buffer_head *head, *bh;
 	unsigned block_in_page;
-	unsigned block_start;
+	unsigned block_start, block_end;
 	sector_t block_in_file;
 	char *kaddr;
 	int nr_reads = 0;
-	int i;
 	int ret = 0;
 	int is_mapped_to_disk = 1;
 
+	if (page_has_buffers(page))
+		return block_prepare_write(page, from, to, get_block);
+
 	if (PageMappedToDisk(page))
 		return 0;
 
+	/*
+	 * Allocate buffers so that we can keep track of state, and potentially
+	 * attach them to the page if an error occurs. In the common case of
+	 * no error, they will just be freed again without ever being attached
+	 * to the page (which is all OK, because we're under the page lock).
+	 *
+	 * Be careful: the buffer linked list is a NULL terminated one, rather
+	 * than the circular one we're used to.
+	 */
+	head = alloc_page_buffers(page, blocksize, 0);
+	if (!head)
+		return -ENOMEM;
+
 	block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits);
-	map_bh.b_page = page;
 
 	/*
 	 * We loop across all blocks in the page, whether or not they are
 	 * part of the affected region.  This is so we can discover if the
 	 * page is fully mapped-to-disk.
 	 */
-	for (block_start = 0, block_in_page = 0;
+	for (block_start = 0, block_in_page = 0, bh = head;
 		  block_start < PAGE_CACHE_SIZE;
-		  block_in_page++, block_start += blocksize) {
-		unsigned block_end = block_start + blocksize;
+		  block_in_page++, block_start += blocksize, bh = bh->b_this_page) {
 		int create;
 
-		map_bh.b_state = 0;
+		block_end = block_start + blocksize;
+		bh->b_state = 0;
 		create = 1;
 		if (block_start >= to)
 			create = 0;
-		map_bh.b_size = blocksize;
 		ret = get_block(inode, block_in_file + block_in_page,
-					&map_bh, create);
+					bh, create);
 		if (ret)
 			goto failed;
-		if (!buffer_mapped(&map_bh))
+		if (!buffer_mapped(bh))
 			is_mapped_to_disk = 0;
-		if (buffer_new(&map_bh))
-			unmap_underlying_metadata(map_bh.b_bdev,
-							map_bh.b_blocknr);
-		if (PageUptodate(page))
+		if (buffer_new(bh))
+			unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
+		if (PageUptodate(page)) {
+			set_buffer_uptodate(bh);
 			continue;
-		if (buffer_new(&map_bh) || !buffer_mapped(&map_bh)) {
+		}
+		if (buffer_new(bh) || !buffer_mapped(bh)) {
 			kaddr = kmap_atomic(page, KM_USER0);
 			if (block_start < from)
 				memset(kaddr+block_start, 0, from-block_start);
@@ -2326,49 +2339,26 @@ int nobh_prepare_write(struct page *page
 			kunmap_atomic(kaddr, KM_USER0);
 			continue;
 		}
-		if (buffer_uptodate(&map_bh))
+		if (buffer_uptodate(bh))
 			continue;	/* reiserfs does this */
 		if (block_start < from || block_end > to) {
-			struct buffer_head *bh = alloc_buffer_head(GFP_NOFS);
-
-			if (!bh) {
-				ret = -ENOMEM;
-				goto failed;
-			}
-			bh->b_state = map_bh.b_state;
-			atomic_set(&bh->b_count, 0);
-			bh->b_this_page = NULL;
-			bh->b_page = page;
-			bh->b_blocknr = map_bh.b_blocknr;
-			bh->b_size = blocksize;
-			bh->b_data = (char *)(long)block_start;
-			bh->b_bdev = map_bh.b_bdev;
-			bh->b_private = NULL;
-			read_bh[nr_reads++] = bh;
+			lock_buffer(bh);
+			bh->b_end_io = end_buffer_read_nobh;
+			submit_bh(READ, bh);
+			nr_reads++;
 		}
 	}
 
 	if (nr_reads) {
-		struct buffer_head *bh;
-
 		/*
 		 * The page is locked, so these buffers are protected from
 		 * any VM or truncate activity.  Hence we don't need to care
 		 * for the buffer_head refcounts.
 		 */
-		for (i = 0; i < nr_reads; i++) {
-			bh = read_bh[i];
-			lock_buffer(bh);
-			bh->b_end_io = end_buffer_read_nobh;
-			submit_bh(READ, bh);
-		}
-		for (i = 0; i < nr_reads; i++) {
-			bh = read_bh[i];
+		for (bh = head; bh; bh = bh->b_this_page) {
 			wait_on_buffer(bh);
 			if (!buffer_uptodate(bh))
 				ret = -EIO;
-			free_buffer_head(bh);
-			read_bh[i] = NULL;
 		}
 		if (ret)
 			goto failed;
@@ -2377,21 +2367,54 @@ int nobh_prepare_write(struct page *page
 	if (is_mapped_to_disk)
 		SetPageMappedToDisk(page);
 
+	do {
+		bh = head;
+		head = head->b_this_page;
+		free_buffer_head(bh);
+	} while (head);
+
 	return 0;
 
 failed:
-	for (i = 0; i < nr_reads; i++) {
-		if (read_bh[i])
-			free_buffer_head(read_bh[i]);
-	}
-
 	/*
-	 * Error recovery is pretty slack.  Clear the page and mark it dirty
-	 * so we'll later zero out any blocks which _were_ allocated.
+	 * Error recovery is a bit difficult. We need to zero out blocks that
+	 * were newly allocated, and dirty them to ensure they get written out.
+	 * Buffers need to be attached to the page at this point, otherwise
+	 * the handling of potential IO errors during writeout would be hard
+	 * (could try doing synchronous writeout, but what if that fails too?)
 	 */
-	zero_user_page(page, 0, PAGE_CACHE_SIZE, KM_USER0);
-	SetPageUptodate(page);
-	set_page_dirty(page);
+	spin_lock(&page->mapping->private_lock);
+	bh = head;
+	block_start = 0;
+	do {
+		if (PageUptodate(page))
+			set_buffer_uptodate(bh);
+		if (PageDirty(page))
+			set_buffer_dirty(bh);
+
+		block_end = block_start+blocksize;
+		if (block_end <= from)
+			goto next;
+		if (block_start >= to)
+			goto next;
+
+		if (buffer_new(bh)) {
+			clear_buffer_new(bh);
+			if (!buffer_uptodate(bh)) {
+				zero_user_page(page, block_start, bh->b_size, KM_USER0);
+				set_buffer_uptodate(bh);
+			}
+			mark_buffer_dirty(bh);
+		}
+next:
+		block_start = block_end;
+		if (!bh->b_this_page)
+			bh->b_this_page = head;
+		bh = bh->b_this_page;
+	} while (bh != head);
+	attach_page_buffers(page, head);
+	spin_unlock(&page->mapping->private_lock);
+
 	return ret;
 }
 EXPORT_SYMBOL(nobh_prepare_write);
@@ -2406,6 +2429,9 @@ int nobh_commit_write(struct file *file,
 	struct inode *inode = page->mapping->host;
 	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
 
+	if (page_has_buffers(page))
+		return generic_commit_write(file, page, from, to);
+
 	SetPageUptodate(page);
 	set_page_dirty(page);
 	if (pos > inode->i_size) {



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2007-08-20  4:33 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-07  5:51 [patch][rfc] fs: fix nobh error handling Nick Piggin
2007-08-08  1:09 ` Andrew Morton
2007-08-08  2:18   ` Nick Piggin
2007-08-08  2:33     ` Andrew Morton
2007-08-08  3:12       ` Nick Piggin
2007-08-08 13:07         ` Dave Kleikamp
2007-08-08 14:39           ` Mingming Cao
2007-08-09  1:45             ` Nick Piggin
2007-08-20  4:33       ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).