From: Andrew Morton <akpm@linux-foundation.org>
To: Nick Piggin <npiggin@suse.de>
Cc: Sage Weil <sage@newdream.net>,
Trond Myklebust <trond.myklebust@fys.uio.no>,
linux-fsdevel@vger.kernel.org,
Linux Memory Management List <linux-mm@kvack.org>
Subject: Re: [patch] mm: close page_mkwrite races (try 3)
Date: Wed, 15 Apr 2009 18:38:47 -0700 [thread overview]
Message-ID: <20090415183847.d4fa1efb.akpm@linux-foundation.org> (raw)
In-Reply-To: <20090415082507.GA23674@wotan.suse.de>
On Wed, 15 Apr 2009 10:25:07 +0200 Nick Piggin <npiggin@suse.de> wrote:
> OK, that one had some rough patches (in the changelog and the patch itself).
> One more try... if it's still misunderstandable, I give up :)
>
> --
>
> Change page_mkwrite to allow implementations to return with the page locked,
> and also change it's callers (in page fault paths) to hold the lock until the
> page is marked dirty. This allows the filesystem to have full control of page
> dirtying events coming from the VM.
>
> Rather than simply hold the page locked over the page_mkwrite call, we call
> page_mkwrite with the page unlocked and allow callers to return with it locked,
> so filesystems can avoid LOR conditions with page lock.
All right, I give up. What's LOR?
> The problem with the current scheme is this: a filesystem that wants to
> associate some metadata with a page as long as the page is dirty, will perform
> this manipulation in its ->page_mkwrite. It currently then must return with the
> page unlocked and may not hold any other locks (according to existing
> page_mkwrite convention).
>
> In this window, the VM could write out the page, clearing page-dirty. The
> filesystem has no good way to detect that a dirty pte is about to be attached,
> so it will happily write out the page, at which point, the filesystem may
> manipulate the metadata to reflect that the page is no longer dirty.
>
> It is not always possible to perform the required metadata manipulation in
> ->set_page_dirty, because that function cannot block or fail. The filesystem
> may need to allocate some data structure, for example.
>
> And the VM cannot mark the pte dirty before page_mkwrite, because page_mkwrite
> is allowed to fail, so we must not allow any window where the page could be
> written to if page_mkwrite does fail.
>
> This solution of holding the page locked over the 3 critical operations
> (page_mkwrite, setting the pte dirty, and finally setting the page dirty)
> closes out races nicely, preventing page cleaning for writeout being initiated
> in that window. This provides the filesystem with a strong synchronisation
> against the VM here.
>
> - Sage needs this race closed for ceph filesystem.
> - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
I wonder which kernel version(s) we should put this in.
Going BUG isn't nice, but that report is against 2.6.27. Is the BUG
super-rare, or did we avoid it via other means, or what?
> - I need it for fsblock.
> - I suspect other filesystems may need it too (eg. btrfs).
> - I have converted buffer.c to the new locking. Even simple block allocation
> under dirty pages might be susceptible to i_size changing under partial page
> at the end of file (we also have a buffer.c-side problem here, but it cannot
> be fixed properly without this patch).
> - Other filesystems (eg. NFS, maybe btrfs) will need to change their
> page_mkwrite functions themselves.
>
> [ This also moves page_mkwrite another step closer to fault, which should
> eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
> filesystem calldown and page lock/unlock cycle in __do_fault. ]
>
>
> ...
>
> @@ -1980,9 +1989,11 @@ static int do_wp_page(struct mm_struct *
> */
> page_table = pte_offset_map_lock(mm, pmd, address,
> &ptl);
> - page_cache_release(old_page);
> - if (!pte_same(*page_table, orig_pte))
> + if (!pte_same(*page_table, orig_pte)) {
> + unlock_page(old_page);
> + page_cache_release(old_page);
> goto unlock;
> + }
>
> page_mkwrite = 1;
> }
> @@ -2105,16 +2116,31 @@ unlock:
> *
> * do_no_page is protected similarly.
> */
> - wait_on_page_locked(dirty_page);
> - set_page_dirty_balance(dirty_page, page_mkwrite);
> + if (!page_mkwrite) {
> + wait_on_page_locked(dirty_page);
> + set_page_dirty_balance(dirty_page, page_mkwrite);
> + }
> put_page(dirty_page);
> + if (page_mkwrite) {
> + struct address_space *mapping = dirty_page->mapping;
> +
> + set_page_dirty(dirty_page);
> + unlock_page(dirty_page);
> + page_cache_release(dirty_page);
> + balance_dirty_pages_ratelimited(mapping);
hm. I wonder what prevents (prevented) *mapping from vanishing under
our feet here.
> + }
> }
> return ret;
> oom_free_new:
> page_cache_release(new_page);
> oom:
> - if (old_page)
> + if (old_page) {
> + if (page_mkwrite) {
> + unlock_page(old_page);
> + page_cache_release(old_page);
> + }
> page_cache_release(old_page);
> + }
> return VM_FAULT_OOM;
>
>
> ...
>
> @@ -2736,19 +2757,29 @@ static int __do_fault(struct mm_struct *
> pte_unmap_unlock(page_table, ptl);
>
> out:
> - unlock_page(vmf.page);
> -out_unlocked:
> - if (anon)
> - page_cache_release(vmf.page);
> - else if (dirty_page) {
> + if (dirty_page) {
> + struct address_space *mapping = page->mapping;
> +
> if (vma->vm_file)
> file_update_time(vma->vm_file);
>
> - set_page_dirty_balance(dirty_page, page_mkwrite);
> + if (set_page_dirty(dirty_page))
> + page_mkwrite = 1;
> + unlock_page(dirty_page);
> put_page(dirty_page);
> + if (page_mkwrite)
> + balance_dirty_pages_ratelimited(mapping);
> + } else {
> + unlock_page(vmf.page);
> + if (anon)
> + page_cache_release(vmf.page);
> }
>
> return ret;
> +
> +unwritable_page:
> + page_cache_release(page);
> + return ret;
> }
Whoa. Running file_update_time() under lock_page() opens a whole can
of worms, doesn't it? That thing can do journal commits and all sorts
of stuff. And I don't think this ordering is necessary here?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2009-04-16 1:38 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-04-14 7:11 [patch] mm: close page_mkwrite races (try 2) Nick Piggin
2009-04-15 8:25 ` [patch] mm: close page_mkwrite races (try 3) Nick Piggin
2009-04-16 1:38 ` Andrew Morton [this message]
2009-04-16 2:03 ` Nick Piggin
2009-04-16 2:23 ` Nick Piggin
2009-04-28 18:57 ` Ravikiran G Thirumalai
2009-04-29 7:12 ` Nick Piggin
2009-04-29 7:24 ` Andrew Morton
2009-04-29 7:45 ` Nick Piggin
2009-04-29 12:39 ` Trond Myklebust
2009-04-29 15:27 ` Andrew Morton
2009-04-29 16:45 ` Trond Myklebust
2009-04-30 14:22 ` Rince
2009-04-30 14:33 ` Rince
2009-05-02 22:12 ` Rince
2009-05-03 1:38 ` Trond Myklebust
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090415183847.d4fa1efb.akpm@linux-foundation.org \
--to=akpm@linux-foundation.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=npiggin@suse.de \
--cc=sage@newdream.net \
--cc=trond.myklebust@fys.uio.no \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).