From: Anton Altaparmakov <aia21@cam.ac.uk>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>,
Linux Kernel <linux-kernel@vger.kernel.org>,
Linux Filesystems <linux-fsdevel@vger.kernel.org>,
Linux Memory Management <linux-mm@kvack.org>
Subject: Re: [patch 9/9] mm: fix pagecache write deadlocks
Date: Sun, 4 Feb 2007 17:40:35 +0000 (GMT) [thread overview]
Message-ID: <Pine.LNX.4.64.0702041737110.19190@hermes-1.csi.cam.ac.uk> (raw)
In-Reply-To: <20070204031039.46b56dbb.akpm@linux-foundation.org>
On Sun, 4 Feb 2007, Andrew Morton wrote:
> On Sun, 4 Feb 2007 10:59:58 +0000 (GMT) Anton Altaparmakov <aia21@cam.ac.uk> wrote:
> > On Sun, 4 Feb 2007, Andrew Morton wrote:
> > > On Sun, 4 Feb 2007 09:51:07 +0100 (CET) Nick Piggin <npiggin@suse.de> wrote:
> > > > 2. If we find the destination page is non uptodate, unlock it (this could be
> > > > made slightly more optimal), then find and pin the source page with
> > > > get_user_pages. Relock the destination page and continue with the copy.
> > > > However, instead of a usercopy (which might take a fault), copy the data
> > > > via the kernel address space.
> > >
> > > argh. We just can't go adding all this gunk into the write() path.
> > >
> > > mmap_sem, a full pte-walk, taking of pte-page locks, etc. For every page.
> > > Even single-process write() will suffer, let along multithreaded stuff,
> > > where mmap_sem contention may be the bigger problem.
> > >
> > > I was going to do some quick measurements of this, but the code oopses
> > > on power4 (http://userweb.kernel.org/~akpm/s5000402.jpg)
> > >
> > > We need to think different.
> >
> > How about leaving the existing code with the following minor
> > modifications:
> >
> > Instead of calling filemap_copy_from_user{,_iovec}() do only the atomic
> > bit with pagefaults disabled, i.e. instead of filemap_copy_from_user() we
> > would do (could of course move into a helper function of course):
> >
> > pagefault_disable()
> > kaddr = kmap_atomic(page, KM_USER0);
> > left = __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes);
> > kunmap_atomic(kaddr, KM_USER0);
> > pagefault_enable()
> >
> > if (unlikely(left)) {
> > /* The user space page got unmapped before we got to copy it. */
> > ...
> > }
> >
> > Thus the 99.999% (or more!) of the time the code would just work as it
> > always has and there is no bug and no speed impact. Only in the very rare
> > and hard to trigger race condition that the user space page after being
> > faulted in got thrown out again before we did the atomic memory copy do we
> > run into the above "..." code path.
>
> Right. And what I wanted to do here is to zero out the uncopied part of
> the page (if it wasn't uptodate), then run commit_write(), then retry the
> whole thing.
>
> iirc, we ruled that out because those temporary zeroes are exposed to
> userspace. But the kernel used to do that anyway for a long time (years)
> until someone noticed, and we'll only do it in your 0.0001% case anyway.
>
> (Actually, perhaps we can prevent it by not marking the page uptodate in
> this case. But that'll cause a read()er to try to bring it uptodate...)
My thinking was not marking the page uptodate. But yes that causes the
problem of a concurrent readpage reading uninitialized disk blocks that
prepare_write allocated.
> > I would propose to call out a different function altogether which could do
> > a multitude of things including drop the lock on the destination page
> > (maintaining a reference on the page!), allocate a temporary page, copy
> > from the user space page into the temporary page, then lock the
> > destination page again, and copy from the temporary page into the
> > destination page.
>
> The problem with all these things is that as soon as we unlock the page,
> it's visible to read(). And in fact, as soon as we mark it uptodate it's
> visible to mmap, even if it's still locked.
Yes we definitely cannot mark it uptodate before it really is uptodate.
Either we have to not mark it uptodate or we have to zero it or we have to
think of some other magic...
> > This would be slow but who cares given it would only happen incredibly
> > rarely and on majority of machines it would never happen at all.
> >
> > The only potential problem I can see is that the destination page could be
> > truncated whilst it is unlocked. I can see two possible solutions to
> > this:
>
> truncate's OK: we're holding i_mutex.
How about excluding readpage() (in addition to truncate if Nick is right
and some cases of truncate do not hold i_mutex) with an extra page flag as
I proposed for truncate exclusion? Then it would not matter that
prepare_write might have allocated blocks and might expose stale data.
It would go to sleep and wait on the bit to be cleared instead of trying
to bring the page uptodate. It can then lock the page and either find it
uptodate (because commit_write did it) or not and then bring it uptodate.
Then we could safely fault in the page, copy from it into a temporary
page, then lock the destination page again and copy into it.
This is getting more involved as a patch again... )-: But at least it
does not affect the common case except for having to check the new page
flag in every readpage() and truncate() call. But at least the checks
could be with an "if (unlikely(newpageflag()))" so should not be too bad.
Have I missed anything this time?
Best regards,
Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/
next prev parent reply other threads:[~2007-02-04 17:40 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-02-04 8:49 [patch 0/9] buffered write deadlock fix Nick Piggin
2007-02-04 8:49 ` [patch 1/9] fs: libfs buffered write leak fix Nick Piggin
2007-02-04 8:50 ` [patch 2/9] mm: revert "generic_file_buffered_write(): handle zero length iovec segments" Nick Piggin
2007-02-04 8:50 ` [patch 3/9] mm: revert "generic_file_buffered_write(): deadlock on vectored write" Nick Piggin
2007-02-04 8:50 ` [patch 4/9] mm: generic_file_buffered_write cleanup Nick Piggin
2007-02-04 8:50 ` [patch 5/9] mm: debug write deadlocks Nick Piggin
2007-02-04 8:50 ` [patch 6/9] mm: be sure to trim blocks Nick Piggin
2007-02-04 8:50 ` [patch 7/9] mm: cleanup pagecache insertion operations Nick Piggin
2007-02-04 8:50 ` [patch 8/9] mm: generic_file_buffered_write iovec cleanup Nick Piggin
2007-02-04 8:51 ` [patch 9/9] mm: fix pagecache write deadlocks Nick Piggin
2007-02-04 9:44 ` Andrew Morton
2007-02-04 10:15 ` Nick Piggin
2007-02-04 10:26 ` Christoph Hellwig
2007-02-04 10:30 ` Andrew Morton
2007-02-04 10:46 ` Nick Piggin
2007-02-04 10:50 ` Nick Piggin
2007-02-04 10:56 ` Andrew Morton
2007-02-04 11:03 ` Nick Piggin
2007-02-04 11:15 ` Andrew Morton
2007-02-04 15:10 ` Nick Piggin
2007-02-04 18:36 ` Andrew Morton
2007-02-06 2:25 ` Nick Piggin
2007-02-06 4:41 ` Nick Piggin
2007-02-06 5:30 ` Andrew Morton
2007-02-06 5:49 ` Nick Piggin
2007-02-06 5:53 ` Nick Piggin
2007-02-04 10:59 ` Anton Altaparmakov
2007-02-04 11:10 ` Andrew Morton
2007-02-04 11:22 ` Nick Piggin
2007-02-04 17:40 ` Anton Altaparmakov [this message]
2007-02-06 2:09 ` Nick Piggin
2007-02-06 13:13 ` Anton Altaparmakov
-- strict thread matches above, loose matches on Subject: below --
2007-01-29 10:31 [patch 0/9] buffered write deadlock fix Nick Piggin
2007-01-29 10:33 ` [patch 9/9] mm: fix pagecache write deadlocks Nick Piggin
2007-01-29 11:11 ` Nick Piggin
2007-02-02 23:53 ` Andrew Morton
2007-02-03 1:38 ` Nick Piggin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.64.0702041737110.19190@hermes-1.csi.cam.ac.uk \
--to=aia21@cam.ac.uk \
--cc=akpm@linux-foundation.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=npiggin@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).