linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Nick Piggin <npiggin@suse.de>
Cc: Linux Kernel <linux-kernel@vger.kernel.org>,
	Linux Filesystems <linux-fsdevel@vger.kernel.org>,
	Linux Memory Management <linux-mm@kvack.org>
Subject: Re: [patch 9/9] mm: fix pagecache write deadlocks
Date: Mon, 5 Feb 2007 21:30:06 -0800	[thread overview]
Message-ID: <20070205213006.0ea2d918.akpm@linux-foundation.org> (raw)
In-Reply-To: <20070206044146.GA11856@wotan.suse.de>

On Tue, 6 Feb 2007 05:41:46 +0100 Nick Piggin <npiggin@suse.de> wrote:

> On Tue, Feb 06, 2007 at 03:25:49AM +0100, Nick Piggin wrote:
> > On Sun, Feb 04, 2007 at 10:36:20AM -0800, Andrew Morton wrote:
> > > On Sun, 4 Feb 2007 16:10:51 +0100 Nick Piggin <npiggin@suse.de> wrote:
> > > 
> > > > They're not likely to hit the deadlocks, either. Probability gets more
> > > > likely after my patch to lock the page in the fault path. But practially,
> > > > we could live without that too, because the data corruption it fixes is
> > > > very rare as well. Which is exactly what we've been doing quite happily
> > > > for most of 2.6, including all distro kernels (I think).
> > > 
> > > Thing is, an application which is relying on the contents of that page is
> > > already unreliable (or really peculiar), because it can get indeterminate
> > > results anyway.
> > 
> > Not necessarily -- they could read from one part of a page and write to
> > another. I see this as the biggest data corruption problem.

The kernel gets that sort of thing wrong anyway, and always has, because
it uses memcpy()-style copying and not memmove()-style.

I can't imagine what sort of application you're envisaging here.  The
problem was only ever observed from userspace by an artificial stress-test
thing.

> And in fact, it is not just transient errors either. This problem can
> add permanent corruption into the pagecache and onto disk, and it doesn't
> even require two processes to race.
> 
> After zeroing out the uncopied part of the page, and attempting to loop
> again, we might bail out of the loop for any reason before completing the
> rest of the copy, leaving the pagecache corrupted, which will soon go out
> to disk.
> 

Only because ->commit_write() went and incorrectly marked parts of the page
as up-to-date.

Zeroing out the fag end of the copy_from_user() on fault is actually incorrect. 
What we _should_ do is to bring those uncopyable, non-uptodate parts of the
page uptodate rather than zeroing them.  ->readpage() does that.

So...  what happens if we do

	lock_page()
	prepare_write()
	if (copy_from_user_atomic()) {
		readpage()
		wait_on_page()
		lock_page()
	}
	commit_write()
	unlock_page()

- If the page has no buffers then it is either fully uptodate or fully
  not uptodate.  In the former case, don't call readpage at all.  In the
  latter case, readpage() is the correct thing to do.

- If the page had buffers, then readpage() won't touch the uptodate ones
  and will bring the non-uptodate ones up to date from disk.

  Some of the data which we copied from userspace may get overwritten
  from backing store, but that's OK.

seems crazy, but it's close.  We do have the minor problem that readpage
went and unlocked the page so we need to relock it.  I bet there are holes
in there.




Idea #42: after we've locked the pagecache page, do an atomic get_user()
against the source page(s) before attempting the copy_from_user().  If that
faults, don't run prepare_write or anything else: drop the page lock and
try again.

Because

- If the get_user() faults, it might be because the page we're copying
  from and to is the same page, and someone went and unmapped it: deadlock.

- If the get_user() doesn't fault, and if we're copying from and to the
  same page, we know that we've locked it, so nobody will be able to unmap
  it while we're copying from it.

Close, but no cigar!  This is still vulnerable to Hugh's ab/ba deadlock
scenario.


btw, to fix the writev() performance problem we may need to go off and run
get_user() against up to 1024 separate user pages before locking the
pagecache page, which sounds fairly idiotic.  Are you doing that in the
implemetnations which you've been working on?  I forget...

  reply	other threads:[~2007-02-06  5:30 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-02-04  8:49 [patch 0/9] buffered write deadlock fix Nick Piggin
2007-02-04  8:49 ` [patch 1/9] fs: libfs buffered write leak fix Nick Piggin
2007-02-04  8:50 ` [patch 2/9] mm: revert "generic_file_buffered_write(): handle zero length iovec segments" Nick Piggin
2007-02-04  8:50 ` [patch 3/9] mm: revert "generic_file_buffered_write(): deadlock on vectored write" Nick Piggin
2007-02-04  8:50 ` [patch 4/9] mm: generic_file_buffered_write cleanup Nick Piggin
2007-02-04  8:50 ` [patch 5/9] mm: debug write deadlocks Nick Piggin
2007-02-04  8:50 ` [patch 6/9] mm: be sure to trim blocks Nick Piggin
2007-02-04  8:50 ` [patch 7/9] mm: cleanup pagecache insertion operations Nick Piggin
2007-02-04  8:50 ` [patch 8/9] mm: generic_file_buffered_write iovec cleanup Nick Piggin
2007-02-04  8:51 ` [patch 9/9] mm: fix pagecache write deadlocks Nick Piggin
2007-02-04  9:44   ` Andrew Morton
2007-02-04 10:15     ` Nick Piggin
2007-02-04 10:26       ` Christoph Hellwig
2007-02-04 10:30       ` Andrew Morton
2007-02-04 10:46         ` Nick Piggin
2007-02-04 10:50           ` Nick Piggin
2007-02-04 10:56           ` Andrew Morton
2007-02-04 11:03             ` Nick Piggin
2007-02-04 11:15               ` Andrew Morton
2007-02-04 15:10                 ` Nick Piggin
2007-02-04 18:36                   ` Andrew Morton
2007-02-06  2:25                     ` Nick Piggin
2007-02-06  4:41                       ` Nick Piggin
2007-02-06  5:30                         ` Andrew Morton [this message]
2007-02-06  5:49                           ` Nick Piggin
2007-02-06  5:53                             ` Nick Piggin
2007-02-04 10:59     ` Anton Altaparmakov
2007-02-04 11:10       ` Andrew Morton
2007-02-04 11:22         ` Nick Piggin
2007-02-04 17:40         ` Anton Altaparmakov
2007-02-06  2:09           ` Nick Piggin
2007-02-06 13:13             ` Anton Altaparmakov
  -- strict thread matches above, loose matches on Subject: below --
2007-01-29 10:31 [patch 0/9] buffered write deadlock fix Nick Piggin
2007-01-29 10:33 ` [patch 9/9] mm: fix pagecache write deadlocks Nick Piggin
2007-01-29 11:11   ` Nick Piggin
2007-02-02 23:53   ` Andrew Morton
2007-02-03  1:38     ` Nick Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070205213006.0ea2d918.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npiggin@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).