From: Linus Torvalds <torvalds@linux-foundation.org>
To: Hugh Dickins <hugh@veritas.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
Nick Piggin <npiggin@suse.de>,
Linux Memory Management List <linux-mm@kvack.org>
Subject: Re: [patch] mm: fix anon_vma races
Date: Mon, 20 Oct 2008 08:17:35 -0700 (PDT) [thread overview]
Message-ID: <alpine.LFD.2.00.0810200742300.3518@nehalem.linux-foundation.org> (raw)
In-Reply-To: <Pine.LNX.4.64.0810200427270.5543@blonde.site>
On Mon, 20 Oct 2008, Hugh Dickins wrote:
>
> When you say "to the point where we don't need to care about anything
> else", are you there agreeing with Nick that your smp_wmb() and
> smp_read_barrier_depends() are no longer needed?
Yes. The anon_vma only has two fields: the spinlock itself, and the list.
And with all list allocations being inside the spinlock, and with the
spinlock itself being a memory barrier, I'm now convinced that the worry
about memory ordering was unnecessary.
Well, not unnecessary, because I think the discussion was good, and I
think we fixed another bug, but the smp_wmb++smp_read_barrier_depends does
seem to be a non-issue in this path.
> But this is all _irrelevant_ : the tricky test to worry about in
> page_lock_anon_vma() is of page_mapped() i.e. does this page currently
> have any ptes in userspace, not of page_mapping() or page->mapping.
I'm not arguing for removing the page_mapped() we have now, I'm just
wondering about the one Nick wanted to add at the end.
> In the case of file pages, it is in some places crucial to check
> page->mapping against a racing file truncation; but there's no such
> issue with anon pages, their page->mapping pointing to anon_vma is
> left set until the page is finally freed, it is not volatile.
>
> But in the case of anon pages, what page->mapping points to may be
> volatile, in the sense that that memory might at some point get reused
> for a different anon_vma, or the slab page below it get freed and
> reused for a different purpose completely: that's what we have to
> careful of in the case of anon pages, and it's RCU and the
> page_mapped() test which guard that.
.. and I'm not worried about the slab page. It's stable, since we hold the
RCU read-lock. No worries about that one either.
It's the "struct page" itself - the one we use for lookup in
page_lock_anon_vma(). And I'm worried about the need for *re-doing* the
page_mapped() test.
The problem that makes "page_lock_anon_vma()" such a total disaster is
that yes, it does locking, but it does locking _after_ the lookup, and the
lock doesn't actually protect any of the data that it is using for the
lookup itself.
And yes, we have various tricks to try to make the data "safe" even if we
race with the lookup, like the RCU stability of the anon_vma allocation,
so that even if we race, we don't do anything bad. And I don't worry about
the anon_vma, that part looks fine.
But page_remove_rmap() is run totally unlocked wrt page_lock_anon_vma(),
it looks like. page_remove_rmap() is run under the pte lock, and
page_lock_anon_vma() is run under no lock at all that I can see that is
reliable.
So yes, we have the same kind of "delay the destroy" wrt page->mapping (ie
page_remove_rmap() doesn't actually clear page->mapping at all), but none
of this runs under any lock at all.
So as far as I can tell, the only reason "page_lock_anon_vma()" is safe is
because we hopefully always hold an elevated page count when we call it,
so at least the struct page itself will never get freed, so page->mapping
is safe just because it's not cleared.
But assuming all that is true, it boils down to this:
- the anon_vma we get may be the wrong one or a stale one (since
page_remove_rmap ran concurrently and we ended up freeing the
anon_vma), but it's always a "valid" anon_vma in the sense that it's
initialized and the list is always pointing to *some* stable set of
vm_area_struct's.
- if we raced, and the anon_vma is stale, we're going to walk over
some bogus list that pertains to a totally different page, but WHO
REALLY CARES? If it is about another page that got that anon_vma
reallocated to it, all the code that traverses the list of vma's still
has to check the page tables _anyway_. And that will never trigger, and
that will get the pte lock for checking anyway, so at _that_ point do
we correctly finally synchronize with a lock that is actually relevant.
- the "anon_vma->lock" is totally irrelevant wrt "page_mapped()", and I'm
not seeing what it could possibly help to re-check after getting that
lock.
So what I'm trying to figure out is why Nick wanted to add another check
for page_mapped(). I'm not seeing what it is supposed to protect against.
(Yes, we have checks for "page_mapped()" inside the "try_tp_unmap_xyz()"
loops, but those are for a different reason - they're there to exit the
loop early when we know there's no point. They don't claim to be about
locking serialization).
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2008-10-20 15:17 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-10-16 4:10 [patch] mm: fix anon_vma races Nick Piggin
2008-10-17 22:14 ` Hugh Dickins
2008-10-17 23:05 ` Linus Torvalds
2008-10-18 0:13 ` Hugh Dickins
2008-10-18 0:25 ` Linus Torvalds
2008-10-18 1:53 ` Nick Piggin
2008-10-18 2:50 ` Paul Mackerras
2008-10-18 2:57 ` Linus Torvalds
2008-10-18 5:49 ` Nick Piggin
2008-10-18 10:49 ` Paul Mackerras
2008-10-18 17:00 ` Linus Torvalds
2008-10-18 18:44 ` Matthew Wilcox
2008-10-19 2:54 ` Nick Piggin
2008-10-19 2:53 ` Nick Piggin
2008-10-17 23:13 ` Peter Zijlstra
2008-10-17 23:53 ` Linus Torvalds
2008-10-18 0:42 ` Linus Torvalds
2008-10-18 1:08 ` Linus Torvalds
2008-10-18 1:32 ` Nick Piggin
2008-10-18 2:11 ` Linus Torvalds
2008-10-18 2:25 ` Nick Piggin
2008-10-18 2:35 ` Nick Piggin
2008-10-18 2:53 ` Linus Torvalds
2008-10-18 5:20 ` Nick Piggin
2008-10-18 10:38 ` Peter Zijlstra
2008-10-19 9:52 ` Hugh Dickins
2008-10-19 10:51 ` Peter Zijlstra
2008-10-19 12:39 ` Hugh Dickins
2008-10-19 18:25 ` Linus Torvalds
2008-10-19 18:45 ` Peter Zijlstra
2008-10-19 19:00 ` Hugh Dickins
2008-10-20 4:03 ` Hugh Dickins
2008-10-20 15:17 ` Linus Torvalds [this message]
2008-10-20 18:21 ` Hugh Dickins
2008-10-21 2:56 ` Nick Piggin
2008-10-21 3:25 ` Linus Torvalds
2008-10-21 4:33 ` Nick Piggin
2008-10-21 12:58 ` Hugh Dickins
2008-10-21 15:59 ` Christoph Lameter
2008-10-22 9:29 ` Nick Piggin
2008-10-21 4:34 ` Nick Piggin
2008-10-21 13:55 ` Hugh Dickins
2008-10-21 2:44 ` Nick Piggin
2008-10-18 19:14 ` Hugh Dickins
2008-10-19 3:03 ` Nick Piggin
2008-10-19 7:07 ` Hugh Dickins
2008-10-20 3:26 ` Hugh Dickins
2008-10-21 2:45 ` Nick Piggin
2008-10-19 1:13 ` Hugh Dickins
2008-10-19 2:41 ` Nick Piggin
2008-10-19 9:45 ` Hugh Dickins
2008-10-21 3:59 ` Nick Piggin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LFD.2.00.0810200742300.3518@nehalem.linux-foundation.org \
--to=torvalds@linux-foundation.org \
--cc=a.p.zijlstra@chello.nl \
--cc=hugh@veritas.com \
--cc=linux-mm@kvack.org \
--cc=npiggin@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox