From: Andrew Morton <akpm@zip.com.au>
To: Andrea Arcangeli <andrea@suse.de>
Cc: Linus Torvalds <torvalds@transmeta.com>,
torrey.hoffman@myrio.com, linux-kernel@vger.kernel.org,
Alexander Viro <viro@math.psu.edu>,
Marcelo Tosatti <marcelo@conectiva.com.br>,
"Stephen C. Tweedie" <sct@redhat.com>
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes
Date: Sat, 29 Dec 2001 22:19:52 -0800 [thread overview]
Message-ID: <3C2EB208.B2BA7CBF@zip.com.au> (raw)
In-Reply-To: <D52B19A7284D32459CF20D579C4B0C0211CB0F@mail0.myrio.com> <200112201946.fBKJkNw01262@penguin.transmeta.com> <20011221004251.K1477@athlon.random>, <20011221004251.K1477@athlon.random>; <20011221024910.L1477@athlon.random> <3C22CF16.C78B1F19@zip.com.au>, <3C22CF16.C78B1F19@zip.com.au>; from akpm@zip.com.au on Thu, Dec 20, 2001 at 09:56:40PM -0800 <20011229164056.H1356@athlon.random>
[ this thread started at http://www.lib.uaa.alaska.edu/linux-kernel/archive/2001-Week-51/1001.html ]
Andrea Arcangeli wrote:
>
> On Thu, Dec 20, 2001 at 09:56:40PM -0800, Andrew Morton wrote:
> > Andrea Arcangeli wrote:
> > >
> > > ...
> > > actually while testing it I unfortunately found also an fs corruption
> > > bug in the ->prepare_write/commit_write/writepage/direct_IO callbacks.
> > > It's a longstanding one, since get_block born. In short, if get_block
> > > fails while mapping on a certain page during
> > > prepare_write/writepage/direct_IO (like it can happen with a full
> > > filesystem, incidentally ext2 on /dev/ram0 during my testing that is
> > > only 4M and so it overflows fast), the blocks before the ENOSPC failure
> > > will be allocated, but the i_size won't be update accordingly (no commit
> > > write, because prepare_write failed in the middle). for the
> > > generic_file_write users it's easily fixable with an hack (truncating
> > > backwards because we don't know how far we got allocating blocks when we
> > > return from prepare_write), similar hack for the direct_IO one (that
> > > commits only once at the end in function of the direct_IO generated).
> > > But for writepage is quite a pain, infact I believe the writepage blocks
> > > should be reserved in-core, to guarantee we will never fail a truncate
> > > with ENOSPC. With the shared mappings we're effectively doing allocate
> > > on flush... but with the lack of space reservation that makes it
> > > unreliable :)
> >
> > The -ac kernels handled this by zeroing out the accidentally-allocated
> > blocks at __block_prepare_write.
>
> actually my fix seems cleaner because it puts the "clearing" in one single
> place.
I think so too. Even for ext3, which has a very complex truncate,
it appears to be OK.
> > > So for now I did an hack to cure the other two (writepage can still
> > > corrupt the fs as said). I think the right fix (ala 2.5) is to change
> > > the API so we can use the last blocks too, but the below will cure 2.4
> > > and for writepage the right fix IMHO is to do the reservation of the
> > > space.
> >
> > This is better in a way because it reclaims the eztra few blocks. But
> > the -ac approach also works for writepage().
>
> yes, it can solve the metadata corruption (assuming the locking is
Where can metadata corruption occur? A few extra blocks outside
i_size for ext2 directories isn't going to cause corruption, is it?
Or are you referring to i_blocks accounting being incorrect?
> right, I can as well call ->truncate within writepage but it's not
> obvious at all that it won't race because we don't hold the i_sem within
> writepage), but the data corruption still holds.
For sure - holding i_sem on truncate is abolutely required.
> I mean, there's no
> failure path to notify userspace that a certain page fault is writing
> into a page over an hole, that we don't have space to later allocate on
> disk. so to me it sounds like MAP_SHARED should preallocate the space of
> the holes so you will know that the writes into the MAP_SHARED segments
> won't be lost (current state of things will lead to silent corruption
> and pinned dirty pages in ram, aka broken allocate on flush like
> previously said).
Um. How does this differ from an I/O error on flush?
Would it be necessary to preallocate the holes at mmap() time? Mad
hand-waving: Could we not perform the instantiation at pagefault time,
and give the caller SIGBUS if we cannot allocate the blocks? Or if
there's an IO error, or quota exceeded.
> > Why was that code not brought across?
>
> Who developed that code? Can the author of the code forward port it to
> 2.4.18pre and post a patch to the list so we can review? thanks,
> Avoiding the matadata corruption would be a good start at least for 2.4,
> then we should just focus on the writepage locking that could race with the
> other "create=1" get_blocks. If it doesn't race I will certainly agree
> on that approch for 2.4.
>
It appeared in 2.4.2-ac25, and it looks like sct was the author:
o Fix higmem block_prepare_write crash (Stephen Tweedie)
Which is interesting - from the changelog it looks like he was
fixing a different problem! I always though that code was there
to prevent leakage of stale blocks. Stephen?
A 2.4.18-pre1 version is below. It compiles, but I haven't actually
exercised that code path.
It's not a pretty fix, IMO. It leaves dangling blocks outside i_size
which will make fsck unhappy. It also breaks ext3 a little bit -
those blocks should be journalled and in theory we'd need to
clone off private copies of __block_prepare_write() and
unmap_underlying_metadata() to do this. Which would be irritating,
but not the end of the world. (Should have done this in the -ac
version of ext3, but I never noticed it).
--- linux-2.4.18-pre1/fs/buffer.c Fri Dec 21 11:19:14 2001
+++ linux-akpm/fs/buffer.c Sat Dec 29 21:53:46 2001
@@ -1639,6 +1639,17 @@ static int __block_prepare_write(struct
}
return 0;
out:
+ bh = head;
+ block_start = 0;
+ do {
+ if (buffer_new(bh) && !buffer_uptodate(bh)) {
+ memset(kaddr+block_start, 0, bh->b_size);
+ set_bit(BH_Uptodate, &bh->b_state);
+ mark_buffer_dirty(bh);
+ }
+ block_start += bh->b_size;
+ bh = bh->b_this_page;
+ } while (bh != head);
return err;
}
Question: can someone please define BH_New? Its lifecycle seems
very vague. We never actually seem to *clear* it anywhere for
ext2, and it appears that the kernel will keep on treating a
clearly non-new buffer as "new" all the time. ext3 explicitly
clears BH_New in get_block(), if it finds the block was already
present in the file. I did this because we need the newness
info for internal purposes.
-
next prev parent reply other threads:[~2001-12-30 6:23 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2001-12-20 19:06 ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes Torrey Hoffman
2001-12-20 19:46 ` Linus Torvalds
2001-12-20 22:56 ` Alexander Viro
2001-12-20 23:42 ` Andrea Arcangeli
2001-12-21 1:49 ` Andrea Arcangeli
[not found] ` <3C22CF16.C78B1F19@zip.com.au>
2001-12-29 15:40 ` Andrea Arcangeli
2001-12-30 6:19 ` Andrew Morton [this message]
2001-12-30 6:33 ` Alexander Viro
2001-12-30 6:38 ` Andrew Morton
2001-12-30 7:17 ` Alexander Viro
2001-12-30 10:15 ` ramdisk corruption problems - was: RE: pivot_root and initrd Alan Cox
2001-12-31 0:08 ` ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes Andrea Arcangeli
2001-12-30 7:08 ` Andrew Morton
2001-12-30 7:29 ` Alexander Viro
2001-12-30 7:59 ` ramdisk corruption problems - was: RE: pivot_root and initrdkern " Andrew Morton
2001-12-30 17:40 ` Linus Torvalds
2001-12-31 0:28 ` Andrea Arcangeli
2001-12-31 0:35 ` Linus Torvalds
2001-12-31 1:00 ` Andrea Arcangeli
2001-12-31 0:05 ` ramdisk corruption problems - was: RE: pivot_root and initrd kern " Andrea Arcangeli
2002-01-05 11:43 ` Andrew Morton
2002-01-05 14:04 ` Trond Myklebust
2002-01-07 3:08 ` Andrea Arcangeli
2002-01-07 3:49 ` Andrew Morton
2002-01-07 4:31 ` Andrea Arcangeli
2001-12-30 23:56 ` Andrea Arcangeli
2001-12-31 10:06 ` Daniel Phillips
2002-01-04 16:38 ` Stephen C. Tweedie
2002-01-05 7:53 ` Andrew Morton
2002-01-07 1:08 ` Andrea Arcangeli
2001-12-21 1:38 ` Tachino Nobuhiro
2001-12-21 1:51 ` Everyone else but TWO Andre Hedrick
[not found] <Pine.GSO.4.21.0112210151020.15555-100000@weyl.math.psu.edu>
2001-12-21 23:11 ` ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes Linus Torvalds
2001-12-21 23:39 ` Alexander Viro
-- strict thread matches above, loose matches on Subject: below --
2001-12-18 20:14 Torrey Hoffman
2001-12-20 12:19 ` Tachino Nobuhiro
2001-12-18 1:44 Torrey Hoffman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3C2EB208.B2BA7CBF@zip.com.au \
--to=akpm@zip.com.au \
--cc=andrea@suse.de \
--cc=linux-kernel@vger.kernel.org \
--cc=marcelo@conectiva.com.br \
--cc=sct@redhat.com \
--cc=torrey.hoffman@myrio.com \
--cc=torvalds@transmeta.com \
--cc=viro@math.psu.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox