writepage fs corruption fixes

All of lore.kernel.org
 help / color / mirror / Atom feed

* writepage fs corruption fixes
@ 2004-07-09  4:01 Andrea Arcangeli
  2004-07-09  4:06 ` Andrea Arcangeli
  2004-07-09  4:29 ` Andrew Morton
  0 siblings, 2 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2004-07-09  4:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Linus Torvalds, Chris Mason

Hello,

I believe I found three bugs in the writepage code.

1) The major one (the only one I believe that was triggering [only on
ext2 due the fact mpage is working only there]) is the marking of the bh
clean despite we could still run into the "confused" path. After that
the confused path really becomes confused and it writes nothing and fs
corruption triggers silenty (the reugular writepage only writes bh that
are marked dirty, it never attempts to submit_bh anything marked clean).
The mpage-writepage code must never mark the bh clean as far as it
wants to still fallback in the regular writepage which depends on the bh
to be dirty (i.e. the "goto confused" path). This could only triggers
with memory pressure (it also needs buffer_heads_over_limit == 0, and
that is frequent under mm pressure).

2) Second bug is that we must always alloc a bio with some vector on it
(even in the PF_MEMALLOC path), otherwise submit_bio will BUG() (this
one never triggered though).

3) Third bug is in the regular writepage, the nr_underway == 0 code was
walking buffers on an unlocked page without keeping the bh pinned, and
in turn the bh could be released under it by the VM. Fix is to delay the
put_bh loop. (this might have triggered but it's not certain)

This patch should fix all the above three bugs. beware, it's almost
untested but it looks very good. I've only tried mounting ext2 and doing
some I/O on it.  applies cleanly to the kernel CVS.

Thanks a lot to Chris for his fine debugging that localized the problem
in the writepage code.

I don't yet know if this is enough to close all corruption in practice
but I'm quite optimistic (at the moment at least ;).

--- sles/fs/buffer.c.~1~	2004-07-05 03:08:09.000000000 +0200
+++ sles/fs/buffer.c	2004-07-09 05:16:47.544011656 +0200
@@ -1586,10 +1586,9 @@ __bread(struct block_device *bdev, secto
 EXPORT_SYMBOL(__bread);

 /*
- * invalidate_bh_lrus() is called rarely - at unmount.  Because it is only for
- * unmount it only needs to ensure that all buffers from the target device are
- * invalidated on return and it doesn't need to worry about new buffers from
- * that device being added - the unmount code has to prevent that.
+ * invalidate_bh_lrus() is called rarely - but not only at unmount.
+ * This doesn't race because it runs in each cpu either in irq
+ * or with preempt disabled.
  */
 static void invalidate_bh_lru(void *arg)
 {
@@ -1912,19 +1911,19 @@ static int __block_write_full_page(struc

 	BUG_ON(PageWriteback(page));
 	set_page_writeback(page);	/* Keeps try_to_free_buffers() away */
-	unlock_page(page);
-
 	/*
-	 * The page may come unlocked any time after the *first* submit_bh()
-	 * call.  Be careful with its buffers.
+	 * The page and its bh will not go away from under us
+	 * because we pinned all the bh with get_bh and we'll
+	 * release them only after we finished.
 	 */
+	unlock_page(page);
+
 	do {
 		struct buffer_head *next = bh->b_this_page;
 		if (buffer_async_write(bh)) {
 			submit_bh(WRITE, bh);
 			nr_underway++;
 		}
-		put_bh(bh);
 		bh = next;
 	} while (bh != head);

@@ -1949,6 +1948,15 @@ done:
 		end_page_writeback(page);
 		wbc->pages_skipped++;	/* We didn't write this page */
 	}
+
+	/* can finally release the bh and after that the page can be freed */
+	bh = head;
+	do {
+		struct buffer_head *next = bh->b_this_page;
+		put_bh(bh);
+		bh = next;
+	} while (bh != head);
+
 	return err;

 recover:
@@ -1984,7 +1992,6 @@ recover:
 			submit_bh(WRITE, bh);
 			nr_underway++;
 		}
-		put_bh(bh);
 		bh = next;
 	} while (bh != head);
 	goto done;
--- sles/fs/mpage.c.~1~	2004-07-05 03:08:08.000000000 +0200
+++ sles/fs/mpage.c	2004-07-09 05:11:13.543787408 +0200
@@ -105,7 +105,7 @@ mpage_alloc(struct block_device *bdev,
 	bio = bio_alloc(gfp_flags, nr_vecs);

 	if (bio == NULL && (current->flags & PF_MEMALLOC)) {
-		while (!bio && (nr_vecs /= 2))
+		while (!bio && (nr_vecs /= 2) && nr_vecs)
 			bio = bio_alloc(gfp_flags, nr_vecs);
 	}

@@ -520,6 +520,17 @@ alloc_new:
 	}

 	/*
+	 * Must try to add the page before marking the buffer clean or
+	 * the confused fail path above (OOM) will be very confused when
+	 * it finds all bh marked clean (i.e. it will not write anything)
+	 */
+	length = first_unmapped << blkbits;
+	if (bio_add_page(bio, page, length, 0) < length) {
+		bio = mpage_bio_submit(WRITE, bio);
+		goto alloc_new;
+	}
+
+	/*
 	 * OK, we have our BIO, so we can now mark the buffers clean.  Make
 	 * sure to only clean buffers which we know we'll be writing.
 	 */
@@ -539,12 +550,6 @@ alloc_new:
 			try_to_free_buffers(page);
 	}

-	length = first_unmapped << blkbits;
-	if (bio_add_page(bio, page, length, 0) < length) {
-		bio = mpage_bio_submit(WRITE, bio);
-		goto alloc_new;
-	}
-
 	BUG_ON(PageWriteback(page));
 	set_page_writeback(page);
 	unlock_page(page);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: writepage fs corruption fixes
  2004-07-09  4:01 writepage fs corruption fixes Andrea Arcangeli
@ 2004-07-09  4:06 ` Andrea Arcangeli
  2004-07-09  4:19   ` Andrea Arcangeli
  2004-07-09  4:29 ` Andrew Morton
  1 sibling, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2004-07-09  4:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Linus Torvalds, Chris Mason

On Fri, Jul 09, 2004 at 06:01:51AM +0200, Andrea Arcangeli wrote:
> --- sles/fs/mpage.c.~1~	2004-07-05 03:08:08.000000000 +0200
> +++ sles/fs/mpage.c	2004-07-09 05:11:13.543787408 +0200
> @@ -105,7 +105,7 @@ mpage_alloc(struct block_device *bdev,
>  	bio = bio_alloc(gfp_flags, nr_vecs);
>  
>  	if (bio == NULL && (current->flags & PF_MEMALLOC)) {
> -		while (!bio && (nr_vecs /= 2))
> +		while (!bio && (nr_vecs /= 2) && nr_vecs)
>  			bio = bio_alloc(gfp_flags, nr_vecs);
>  	}
>  

the above change is a "noop", the above one wasn't really a bug (I just
misread the code), the rest of the patch is needed.  reattached after
filtering out the noop.

--- sles/fs/buffer.c.~1~	2004-07-05 03:08:09.000000000 +0200
+++ sles/fs/buffer.c	2004-07-09 05:16:47.544011656 +0200
@@ -1586,10 +1586,9 @@ __bread(struct block_device *bdev, secto
 EXPORT_SYMBOL(__bread);
 
 /*
- * invalidate_bh_lrus() is called rarely - at unmount.  Because it is only for
- * unmount it only needs to ensure that all buffers from the target device are
- * invalidated on return and it doesn't need to worry about new buffers from
- * that device being added - the unmount code has to prevent that.
+ * invalidate_bh_lrus() is called rarely - but not only at unmount.
+ * This doesn't race because it runs in each cpu either in irq
+ * or with preempt disabled.
  */
 static void invalidate_bh_lru(void *arg)
 {
@@ -1912,19 +1911,19 @@ static int __block_write_full_page(struc
 
 	BUG_ON(PageWriteback(page));
 	set_page_writeback(page);	/* Keeps try_to_free_buffers() away */
-	unlock_page(page);
-
 	/*
-	 * The page may come unlocked any time after the *first* submit_bh()
-	 * call.  Be careful with its buffers.
+	 * The page and its bh will not go away from under us
+	 * because we pinned all the bh with get_bh and we'll
+	 * release them only after we finished.
 	 */
+	unlock_page(page);
+
 	do {
 		struct buffer_head *next = bh->b_this_page;
 		if (buffer_async_write(bh)) {
 			submit_bh(WRITE, bh);
 			nr_underway++;
 		}
-		put_bh(bh);
 		bh = next;
 	} while (bh != head);
 
@@ -1949,6 +1948,15 @@ done:
 		end_page_writeback(page);
 		wbc->pages_skipped++;	/* We didn't write this page */
 	}
+
+	/* can finally release the bh and after that the page can be freed */
+	bh = head;
+	do {
+		struct buffer_head *next = bh->b_this_page;
+		put_bh(bh);
+		bh = next;
+	} while (bh != head);
+
 	return err;
 
 recover:
@@ -1984,7 +1992,6 @@ recover:
 			submit_bh(WRITE, bh);
 			nr_underway++;
 		}
-		put_bh(bh);
 		bh = next;
 	} while (bh != head);
 	goto done;
--- sles/fs/mpage.c.~1~	2004-07-05 03:08:08.000000000 +0200
+++ sles/fs/mpage.c	2004-07-09 05:11:13.543787408 +0200
@@ -520,6 +520,17 @@ alloc_new:
 	}
 
 	/*
+	 * Must try to add the page before marking the buffer clean or
+	 * the confused fail path above (OOM) will be very confused when
+	 * it finds all bh marked clean (i.e. it will not write anything)
+	 */
+	length = first_unmapped << blkbits;
+	if (bio_add_page(bio, page, length, 0) < length) {
+		bio = mpage_bio_submit(WRITE, bio);
+		goto alloc_new;
+	}
+
+	/*
 	 * OK, we have our BIO, so we can now mark the buffers clean.  Make
 	 * sure to only clean buffers which we know we'll be writing.
 	 */
@@ -539,12 +550,6 @@ alloc_new:
 			try_to_free_buffers(page);
 	}
 
-	length = first_unmapped << blkbits;
-	if (bio_add_page(bio, page, length, 0) < length) {
-		bio = mpage_bio_submit(WRITE, bio);
-		goto alloc_new;
-	}
-
 	BUG_ON(PageWriteback(page));
 	set_page_writeback(page);
 	unlock_page(page);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: writepage fs corruption fixes
  2004-07-09  4:06 ` Andrea Arcangeli
@ 2004-07-09  4:19   ` Andrea Arcangeli
  0 siblings, 0 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2004-07-09  4:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Linus Torvalds, Chris Mason

On Fri, Jul 09, 2004 at 06:06:11AM +0200, Andrea Arcangeli wrote:
> @@ -1912,19 +1911,19 @@ static int __block_write_full_page(struc
>  
>  	BUG_ON(PageWriteback(page));
>  	set_page_writeback(page);	/* Keeps try_to_free_buffers() away */
> -	unlock_page(page);
> -
>  	/*
> -	 * The page may come unlocked any time after the *first* submit_bh()
> -	 * call.  Be careful with its buffers.
> +	 * The page and its bh will not go away from under us
> +	 * because we pinned all the bh with get_bh and we'll
> +	 * release them only after we finished.
>  	 */
> +	unlock_page(page);
> +
>  	do {
>  		struct buffer_head *next = bh->b_this_page;
>  		if (buffer_async_write(bh)) {
>  			submit_bh(WRITE, bh);
>  			nr_underway++;
>  		}
> -		put_bh(bh);
>  		bh = next;
>  	} while (bh != head);
>  
> @@ -1949,6 +1948,15 @@ done:
>  		end_page_writeback(page);
>  		wbc->pages_skipped++;	/* We didn't write this page */
>  	}
> +
> +	/* can finally release the bh and after that the page can be freed */
> +	bh = head;
> +	do {
> +		struct buffer_head *next = bh->b_this_page;
> +		put_bh(bh);
> +		bh = next;
> +	} while (bh != head);
> +
>  	return err;
>  
>  recover:
> @@ -1984,7 +1992,6 @@ recover:
>  			submit_bh(WRITE, bh);
>  			nr_underway++;
>  		}
> -		put_bh(bh);
>  		bh = next;
>  	} while (bh != head);
>  	goto done;

not that the above change was wrong but I got convinced the above is
also a noop, thanks to the writeback bitflag that prevents the page to
go away.

so the only bug that remains was really the first one (number 1) and
infact this is the only one that I really think has been reproduced.

so this should be a final update. Sorry for the triple spamming ;).

btw, I left in all patches a comment about invalidate_bh_lru that is
misleading since that functions runs with BLKFLSBUF for example, but
that function is safe to run also outside unmount path so no problem
there, just wrong comment.

--- sles/fs/buffer.c.~1~	2004-07-05 03:08:09.000000000 +0200
+++ sles/fs/buffer.c	2004-07-09 05:16:47.544011656 +0200
@@ -1586,10 +1586,9 @@ __bread(struct block_device *bdev, secto
 EXPORT_SYMBOL(__bread);
 
 /*
- * invalidate_bh_lrus() is called rarely - at unmount.  Because it is only for
- * unmount it only needs to ensure that all buffers from the target device are
- * invalidated on return and it doesn't need to worry about new buffers from
- * that device being added - the unmount code has to prevent that.
+ * invalidate_bh_lrus() is called rarely - but not only at unmount.
+ * This doesn't race because it runs in each cpu either in irq
+ * or with preempt disabled.
  */
 static void invalidate_bh_lru(void *arg)
 {
--- sles/fs/mpage.c.~1~	2004-07-05 03:08:08.000000000 +0200
+++ sles/fs/mpage.c	2004-07-09 05:11:13.543787408 +0200
@@ -520,6 +520,17 @@ alloc_new:
 	}
 
 	/*
+	 * Must try to add the page before marking the buffer clean or
+	 * the confused fail path above (OOM) will be very confused when
+	 * it finds all bh marked clean (i.e. it will not write anything)
+	 */
+	length = first_unmapped << blkbits;
+	if (bio_add_page(bio, page, length, 0) < length) {
+		bio = mpage_bio_submit(WRITE, bio);
+		goto alloc_new;
+	}
+
+	/*
 	 * OK, we have our BIO, so we can now mark the buffers clean.  Make
 	 * sure to only clean buffers which we know we'll be writing.
 	 */
@@ -539,12 +550,6 @@ alloc_new:
 			try_to_free_buffers(page);
 	}
 
-	length = first_unmapped << blkbits;
-	if (bio_add_page(bio, page, length, 0) < length) {
-		bio = mpage_bio_submit(WRITE, bio);
-		goto alloc_new;
-	}
-
 	BUG_ON(PageWriteback(page));
 	set_page_writeback(page);
 	unlock_page(page);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: writepage fs corruption fixes
  2004-07-09  4:01 writepage fs corruption fixes Andrea Arcangeli
  2004-07-09  4:06 ` Andrea Arcangeli
@ 2004-07-09  4:29 ` Andrew Morton
  2004-07-09  4:42   ` Andrea Arcangeli
  1 sibling, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2004-07-09  4:29 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, torvalds, mason

Andrea Arcangeli <andrea@suse.de> wrote:
>
> 1) The major one (the only one I believe that was triggering [only on
>  ext2 due the fact mpage is working only there]) is the marking of the bh
>  clean despite we could still run into the "confused" path. After that
>  the confused path really becomes confused and it writes nothing and fs
>  corruption triggers silenty (the reugular writepage only writes bh that
>  are marked dirty, it never attempts to submit_bh anything marked clean).
>  The mpage-writepage code must never mark the bh clean as far as it
>  wants to still fallback in the regular writepage which depends on the bh
>  to be dirty (i.e. the "goto confused" path). This could only triggers
>  with memory pressure (it also needs buffer_heads_over_limit == 0, and
>  that is frequent under mm pressure).

ooh, nasty, yes.  You must be testing the crap out if it.

>  3) Third bug is in the regular writepage, the nr_underway == 0 code was
>  walking buffers on an unlocked page without keeping the bh pinned, and
>  in turn the bh could be released under it by the VM. Fix is to delay the
>  put_bh loop. (this might have triggered but it's not certain)

PG_writeback protects the page from truncate, from invalidate and from page
reclaim.  pagevec_strip() won't touch the buffers due to the
PageWriteback() test in try_to_release_page().  So I think we're OK in
there.  I can add a couple more coment fixes for this.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: writepage fs corruption fixes
  2004-07-09  4:29 ` Andrew Morton
@ 2004-07-09  4:42   ` Andrea Arcangeli
  2004-07-09  4:56     ` Andrew Morton
  0 siblings, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2004-07-09  4:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, torvalds, mason

Hi Andrew,

On Thu, Jul 08, 2004 at 09:29:23PM -0700, Andrew Morton wrote:
> ooh, nasty, yes.  You must be testing the crap out if it.

;) thanks for the immediate review!

BTW, the new mpage code looks great, it's a pity that reiserfs and ext3
don't use it yet.

> PG_writeback protects the page from truncate, from invalidate and from page
> reclaim.  pagevec_strip() won't touch the buffers due to the
> PageWriteback() test in try_to_release_page().  So I think we're OK in
> there.  I can add a couple more coment fixes for this.

Indeed, sorry. I should have watched more carefully for the writeback
bitflag but I was still in "anything that looks a bug close it and try
again" mode when I made the two "noop" changes ;)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: writepage fs corruption fixes
  2004-07-09  4:42   ` Andrea Arcangeli
@ 2004-07-09  4:56     ` Andrew Morton
  2004-07-09 12:43       ` Chris Mason
  2004-07-10  0:16       ` Andrea Arcangeli
  0 siblings, 2 replies; 15+ messages in thread
From: Andrew Morton @ 2004-07-09  4:56 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, torvalds, mason

Andrea Arcangeli <andrea@suse.de> wrote:
>
> BTW, the new mpage code looks great,

You should have seen the first version!  But after all the bugs were fixed
and the real world hit it, some spaghetti got in there.

> it's a pity that reiserfs and ext3 don't use it yet.

JFS, hfs, hfsplus and ext2 are using it.

Unfortunately it's hard to use mpage_writepages() even in ext3's writeback
mode, because ext3_get_block() assumes that it is called with a transaction
open.  Not impossible though I guess - use a different get_block() which
opens a transaction for itself...  But only open it if the page isn't
already mapped to disk.  (/me gets itchy fingers)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: writepage fs corruption fixes
  2004-07-09  4:56     ` Andrew Morton
@ 2004-07-09 12:43       ` Chris Mason
  2004-07-10  0:16       ` Andrea Arcangeli
  1 sibling, 0 replies; 15+ messages in thread
From: Chris Mason @ 2004-07-09 12:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrea Arcangeli, linux-kernel, torvalds

On Fri, 2004-07-09 at 00:56, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > BTW, the new mpage code looks great,
> 
> You should have seen the first version!  But after all the bugs were fixed
> and the real world hit it, some spaghetti got in there.
> 
> > it's a pity that reiserfs and ext3 don't use it yet.
> 
> JFS, hfs, hfsplus and ext2 are using it.
> 
> Unfortunately it's hard to use mpage_writepages() even in ext3's writeback
> mode, because ext3_get_block() assumes that it is called with a transaction
> open.  Not impossible though I guess - use a different get_block() which
> opens a transaction for itself...  But only open it if the page isn't
> already mapped to disk.  (/me gets itchy fingers)

Thanks Andrea!  

The real problem for ext3 and reiserfs using writepages isn't the
transaction I thought, but the data=ordered buffer based writeback.  The
page lock and page writeback bit are critical to the writepages locking,
which we don't take when doing data=ordered writes.

Since we're fixing writepages, here's a much more minor fix.  When the
page has buffers, and fs blocksize is < then the page size, and file eof
falls in the second to last buffer in the page, we will goto
page_is_mapped, which doesn't zero the bytes past eof.

The fix is to move up page_is_mapped slightly.

-chris

Index: linux.t/fs/mpage.c
===================================================================
--- linux.t.orig/fs/mpage.c	2004-07-01 11:07:17.000000000 -0400
+++ linux.t/fs/mpage.c	2004-07-08 12:51:44.915892528 -0400
@@ -490,6 +489,8 @@ mpage_writepage(struct bio *bio, struct 
 
 	first_unmapped = page_block;
 
+page_is_mapped:
+
 	end_index = i_size_read(inode) >> PAGE_CACHE_SHIFT;
 	if (page->index >= end_index) {
 		unsigned offset = i_size_read(inode) & (PAGE_CACHE_SIZE - 1);
@@ -503,8 +504,6 @@ mpage_writepage(struct bio *bio, struct 
 		kunmap_atomic(kaddr, KM_USER0);
 	}
 
-page_is_mapped:
-
 	/*
 	 * This page will go to BIO.  Do we need to send this BIO off first?
 	 */




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: writepage fs corruption fixes
  2004-07-09  4:56     ` Andrew Morton
  2004-07-09 12:43       ` Chris Mason
@ 2004-07-10  0:16       ` Andrea Arcangeli
  2004-07-10  1:07         ` Andrea Arcangeli
  1 sibling, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2004-07-10  0:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, torvalds, mason

On Thu, Jul 08, 2004 at 09:56:45PM -0700, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > BTW, the new mpage code looks great,
> 
> You should have seen the first version!  But after all the bugs were fixed
> and the real world hit it, some spaghetti got in there.

fun ;)

> Unfortunately it's hard to use mpage_writepages() even in ext3's writeback
> mode, because ext3_get_block() assumes that it is called with a transaction
> open.  Not impossible though I guess - use a different get_block() which
> opens a transaction for itself...  But only open it if the page isn't
> already mapped to disk.  (/me gets itchy fingers)

it would be great to enable it. at least readpages already works.

Unfortunately I wasn't reproducing this bug (despite the bug was real),
infact it should trigger only near oom.

So I returning back into debugging, and  am I right this is the real fix
I was looking for? Think i_size == 4097, this buggy below code was used
to write only 1 block, but it should write _two_ of them. This also
explains why it only triggers on ia64 with page_size > 4k and not on x86 :).

I hope this time I'm done with it.

--- sles/fs/mpage.c.~1~	2004-07-09 23:48:33.233205496 +0200
+++ sles/fs/mpage.c	2004-07-10 02:11:59.922789800 +0200
@@ -460,7 +460,7 @@ mpage_writepage(struct bio *bio, struct 
 	 */
 	BUG_ON(!PageUptodate(page));
 	block_in_file = page->index << (PAGE_CACHE_SHIFT - blkbits);
-	last_block = (i_size_read(inode) - 1) >> blkbits;
+	last_block = (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits;
 	map_bh.b_page = page;
 	for (page_block = 0; page_block < blocks_per_page; ) {
 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: writepage fs corruption fixes
  2004-07-10  0:16       ` Andrea Arcangeli
@ 2004-07-10  1:07         ` Andrea Arcangeli
  2004-07-10  4:30           ` Andrew Morton
  2004-07-10  4:59           ` Andrea Arcangeli
  0 siblings, 2 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2004-07-10  1:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, torvalds, mason

On Sat, Jul 10, 2004 at 02:16:00AM +0200, Andrea Arcangeli wrote:
> I hope this time I'm done with it.

I'm not after more thinking it seems the below is not a bug.

I'm lost since I can't find bugs anymore in this function but I need to
find something.

> 
> --- sles/fs/mpage.c.~1~	2004-07-09 23:48:33.233205496 +0200
> +++ sles/fs/mpage.c	2004-07-10 02:11:59.922789800 +0200
> @@ -460,7 +460,7 @@ mpage_writepage(struct bio *bio, struct 
>  	 */
>  	BUG_ON(!PageUptodate(page));
>  	block_in_file = page->index << (PAGE_CACHE_SHIFT - blkbits);
> -	last_block = (i_size_read(inode) - 1) >> blkbits;
> +	last_block = (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits;
>  	map_bh.b_page = page;
>  	for (page_block = 0; page_block < blocks_per_page; ) {
>  

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: writepage fs corruption fixes
  2004-07-10  1:07         ` Andrea Arcangeli
@ 2004-07-10  4:30           ` Andrew Morton
  2004-07-10  4:59           ` Andrea Arcangeli
  1 sibling, 0 replies; 15+ messages in thread
From: Andrew Morton @ 2004-07-10  4:30 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, torvalds, mason

Andrea Arcangeli <andrea@suse.de> wrote:
>
> On Sat, Jul 10, 2004 at 02:16:00AM +0200, Andrea Arcangeli wrote:
> > I hope this time I'm done with it.
> 
> I'm not after more thinking it seems the below is not a bug.

Yes, I was scratching my head.  If that code was wrong, fsx-linux on
1k-blocksize ext2-nobh with memory pressure wouldn't last long.

> I'm lost since I can't find bugs anymore in this function but I need to
> find something.

What is the failure scenario?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: writepage fs corruption fixes
  2004-07-10  1:07         ` Andrea Arcangeli
  2004-07-10  4:30           ` Andrew Morton
@ 2004-07-10  4:59           ` Andrea Arcangeli
  2004-07-10  5:56             ` Andrew Morton
  1 sibling, 1 reply; 15+ messages in thread
From: Andrea Arcangeli @ 2004-07-10  4:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, torvalds, mason

On Sat, Jul 10, 2004 at 03:07:38AM +0200, Andrea Arcangeli wrote:
> On Sat, Jul 10, 2004 at 02:16:00AM +0200, Andrea Arcangeli wrote:
> > I hope this time I'm done with it.
> 
> I'm not after more thinking it seems the below is not a bug.
> 
> I'm lost since I can't find bugs anymore in this function but I need to
> find something.

Ok, this is the last try (at least for this week ;), if this doesn't fix
anything too then I giveup and I will reasonably start to suspect/blame
on the compiler early next week ;).

I believe reading the i_size from memory multiple times can generate fs
corruption. The "offset" and the "end_index" were not coherent. this is
writepages and it runs w/o the i_sem, so the i_size can change from
under us anytime. If a parallel write happens while writepages run, the
i_size could advance from 4095 to 4100. With the current 2.6 code that
could translate in end_index = 0 and offset = 4. That's broken because
end_index and offset could be not coherent. Either end_index=1 and
offset =4, or end_index = 0 and offset = 4095. When they lose coherency
the memset can zeroout actual data. The below patch fixes that (it's at
least a theoretical bug).

I don't really expect this tiny race to fix the bug in practice after the
more serious bugs we covered yesterday didn't fix it (more likely the
compiler will get involved into the equation soon ;).

This is also an optimization for 32bit archs that needs special locking
to read 64bit i_size coherenty.

This also includes Chris's fix (which looks valid to me), and it
includes my yesterday's fix for the memory-pressure on the mempool bio
allocator with GFP_NOFS.

--- sles/fs/mpage.c.~1~	2004-07-09 23:48:33.233205496 +0200
+++ sles/fs/mpage.c	2004-07-10 06:32:24.784449776 +0200
@@ -404,6 +404,7 @@ mpage_writepage(struct bio *bio, struct 
 	struct block_device *boundary_bdev = NULL;
 	int length;
 	struct buffer_head map_bh;
+	loff_t i_size = i_size_read(inode);

 	if (page_has_buffers(page)) {
 		struct buffer_head *head = page_buffers(page);
@@ -460,7 +461,7 @@ mpage_writepage(struct bio *bio, struct 
 	 */
 	BUG_ON(!PageUptodate(page));
 	block_in_file = page->index << (PAGE_CACHE_SHIFT - blkbits);
-	last_block = (i_size_read(inode) - 1) >> blkbits;
+	last_block = (i_size - 1) >> blkbits;
 	map_bh.b_page = page;
 	for (page_block = 0; page_block < blocks_per_page; ) {

@@ -490,9 +491,11 @@ mpage_writepage(struct bio *bio, struct 

 	first_unmapped = page_block;

-	end_index = i_size_read(inode) >> PAGE_CACHE_SHIFT;
+page_is_mapped:
+
+	end_index = i_size >> PAGE_CACHE_SHIFT;
 	if (page->index >= end_index) {
-		unsigned offset = i_size_read(inode) & (PAGE_CACHE_SIZE - 1);
+		unsigned offset = i_size & (PAGE_CACHE_SIZE - 1);
 		char *kaddr;

 		if (page->index > end_index || !offset)
@@ -503,8 +506,6 @@ mpage_writepage(struct bio *bio, struct 
 		kunmap_atomic(kaddr, KM_USER0);
 	}

-page_is_mapped:
-
 	/*
 	 * This page will go to BIO.  Do we need to send this BIO off first?
 	 */
@@ -519,6 +520,12 @@ alloc_new:
 			goto confused;
 	}

+	length = first_unmapped << blkbits;
+	if (bio_add_page(bio, page, length, 0) < length) {
+		bio = mpage_bio_submit(WRITE, bio);
+		goto alloc_new;
+	}
+
 	/*
 	 * OK, we have our BIO, so we can now mark the buffers clean.  Make
 	 * sure to only clean buffers which we know we'll be writing.
@@ -539,12 +546,6 @@ alloc_new:
 			try_to_free_buffers(page);
 	}

-	length = first_unmapped << blkbits;
-	if (bio_add_page(bio, page, length, 0) < length) {
-		bio = mpage_bio_submit(WRITE, bio);
-		goto alloc_new;
-	}
-
 	BUG_ON(PageWriteback(page));
 	set_page_writeback(page);
 	unlock_page(page);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: writepage fs corruption fixes
  2004-07-10  4:59           ` Andrea Arcangeli
@ 2004-07-10  5:56             ` Andrew Morton
  2004-07-10  6:11               ` Andrea Arcangeli
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2004-07-10  5:56 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, torvalds, mason

Andrea Arcangeli <andrea@suse.de> wrote:
>
> +page_is_mapped:
>  +
>  +	end_index = i_size >> PAGE_CACHE_SHIFT;
>   	if (page->index >= end_index) {
>  -		unsigned offset = i_size_read(inode) & (PAGE_CACHE_SIZE - 1);
>  +		unsigned offset = i_size & (PAGE_CACHE_SIZE - 1);
>   		char *kaddr;
>   
>   		if (page->index > end_index || !offset)
>  @@ -503,8 +506,6 @@ mpage_writepage(struct bio *bio, struct 
>   		kunmap_atomic(kaddr, KM_USER0);
>   	}
>   
>  -page_is_mapped:

What's the thinking behind moving the page_is_mapped label here?

We've established that we have found `first_unmapped' number of uptodate
and dirty buffers at the "front" of the page, and we're about to stick
(first_unmapped<<blkbits) bytes of this page into the BIO for writeout. 
Hence everything which will go into the BIO is known to be uptodate and
dirty.  So I'm wondering why this change was made.

The change is correct, though.  It prevents us from writing non-zero data
between i_size and the end of the final bh to the file. 
block_write_full_page() does it too:

	/*
	 * The page straddles i_size.  It must be zeroed out on each and every
	 * writepage invokation because it may be mmapped.  "A file is mapped
	 * in multiples of the page size.  For a file that is not a multiple of
	 * the  page size, the remaining memory is zeroed when mapped, and
	 * writes to that region are not written out to the file."
	 */

(Note that this is a "best effort" thing - userspace could still write
non-zero data into the mmapped page outside i_size even while I/O is in
flight.  Can't do much about that).

But was this the reason for you making this change?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: writepage fs corruption fixes
  2004-07-10  5:56             ` Andrew Morton
@ 2004-07-10  6:11               ` Andrea Arcangeli
  2004-07-10  6:13                 ` Andrew Morton
  2004-07-14 16:14                 ` Andrea Arcangeli
  0 siblings, 2 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2004-07-10  6:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, torvalds, mason

On Fri, Jul 09, 2004 at 10:56:34PM -0700, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > +page_is_mapped:
> >  +
> >  +	end_index = i_size >> PAGE_CACHE_SHIFT;
> >   	if (page->index >= end_index) {
> >  -		unsigned offset = i_size_read(inode) & (PAGE_CACHE_SIZE - 1);
> >  +		unsigned offset = i_size & (PAGE_CACHE_SIZE - 1);
> >   		char *kaddr;
> >   
> >   		if (page->index > end_index || !offset)
> >  @@ -503,8 +506,6 @@ mpage_writepage(struct bio *bio, struct 
> >   		kunmap_atomic(kaddr, KM_USER0);
> >   	}
> >   
> >  -page_is_mapped:
> 
> What's the thinking behind moving the page_is_mapped label here?
> 
> We've established that we have found `first_unmapped' number of uptodate
> and dirty buffers at the "front" of the page, and we're about to stick
> (first_unmapped<<blkbits) bytes of this page into the BIO for writeout. 
> Hence everything which will go into the BIO is known to be uptodate and
> dirty.  So I'm wondering why this change was made.
> 
> 
> The change is correct, though.  It prevents us from writing non-zero data
> between i_size and the end of the final bh to the file. 
> block_write_full_page() does it too:

not sure to understand. The change is exactly to prevents us from
writing non-zero data between i_size and the end of the final bh. 

> (Note that this is a "best effort" thing - userspace could still write

sure I understand this is a best effort. I infact wanted to suggesting
that we could do it only once maybe (it's not guaranteed anyways, it's
up to userspace to do guarantee it, kernel cannot).

> But was this the reason for you making this change?

Just because block_write_full_page is doing it, so I think Chris was
correct making that change and moving the label. Either we remove it
from both, or we do it in both places. If we remove it we've just to be
careful on how to deal with the _first_ clearing. the prepare_write is
already ok, if the buffer is uptodate then we know we can write it. the
non-bh case needs to be checked and I belive that's the reason you were
doing the thing only for the non-bh case.

the only real fixes are the other two ones, the page_is_mapped move is
just because I agree with Chris it's more correct to do the same thing
as block_write_full_page to get the very same behaviour from both.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: writepage fs corruption fixes
  2004-07-10  6:11               ` Andrea Arcangeli
@ 2004-07-10  6:13                 ` Andrew Morton
  2004-07-14 16:14                 ` Andrea Arcangeli
  1 sibling, 0 replies; 15+ messages in thread
From: Andrew Morton @ 2004-07-10  6:13 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, torvalds, mason

Andrea Arcangeli <andrea@suse.de> wrote:
>
>  the only real fixes are the other two ones, the page_is_mapped move is
>  just because I agree with Chris it's more correct to do the same thing
>  as block_write_full_page to get the very same behaviour from both.

OK, cool.  I was just making sure I wasn't missing something ;)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: writepage fs corruption fixes
  2004-07-10  6:11               ` Andrea Arcangeli
  2004-07-10  6:13                 ` Andrew Morton
@ 2004-07-14 16:14                 ` Andrea Arcangeli
  1 sibling, 0 replies; 15+ messages in thread
From: Andrea Arcangeli @ 2004-07-14 16:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, torvalds, mason

JFYI, the two writepages silent-fs-corruption fixes I posted in this
thread are reported to fix the ext2 fs corruption after 24 hours into
the test.  Need to pass 72 hours before it's officially fixed though
(only 1/3 is passed so far). Chris's memset(0) correctness fix is also
applied.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2004-07-14 16:15 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-07-09  4:01 writepage fs corruption fixes Andrea Arcangeli
2004-07-09  4:06 ` Andrea Arcangeli
2004-07-09  4:19   ` Andrea Arcangeli
2004-07-09  4:29 ` Andrew Morton
2004-07-09  4:42   ` Andrea Arcangeli
2004-07-09  4:56     ` Andrew Morton
2004-07-09 12:43       ` Chris Mason
2004-07-10  0:16       ` Andrea Arcangeli
2004-07-10  1:07         ` Andrea Arcangeli
2004-07-10  4:30           ` Andrew Morton
2004-07-10  4:59           ` Andrea Arcangeli
2004-07-10  5:56             ` Andrew Morton
2004-07-10  6:11               ` Andrea Arcangeli
2004-07-10  6:13                 ` Andrew Morton
2004-07-14 16:14                 ` Andrea Arcangeli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.