From: Linus Torvalds <torvalds@linux-foundation.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Hugh Dickins <hughd@google.com>, "H. Peter Anvin" <hpa@zytor.com>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Jan Kara <jack@suse.cz>, Dave Hansen <dave.hansen@intel.com>,
"linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
linux-mm <linux-mm@kvack.org>,
Russell King - ARM Linux <linux@arm.linux.org.uk>,
Tony Luck <tony.luck@intel.com>
Subject: Re: Dirty/Access bits vs. page content
Date: Sun, 27 Apr 2014 09:21:45 -0700 [thread overview]
Message-ID: <CA+55aFwLumAqA6mYyPKRZYOCr2TRPxUVdCKhHMg0nYN_KbBDbQ@mail.gmail.com> (raw)
In-Reply-To: <20140427072034.GC1429@laptop.programming.kicks-ass.net>
On Sun, Apr 27, 2014 at 12:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> OK, so I've been thinking and figured I either mis-understand how the
> hardware works or don't understand how Linus' patch will actually fully
> fix the issue.
>
> So what both try_to_unmap_one() and zap_pte_range() end up doing is
> clearing the PTE entry and then flushing the TLBs.
Right. And we can't do it in the other order, since if we flush the
TLBs first, another cpu (or even this cpu, doing speculative memory
accesses) could re-fill them between the flush and the clearing of the
page table.
So that's race #1, easily fixed by ordering the TLB flush after
clearing the page table entry. We've never done this particular race
wrong, afaik. It's _so_ obvious that it's the only one we've always
gotten right.
On to the ones we've historically gotten wrong:
> However, that still leaves a window where there are remote TLB entries.
> What if any of those remote entries cause a write (or have a dirty bit
> cached) while we've already removed the PTE entry.
So this is race #2: the race between clearing the entry and a TLB miss
loading it and marking it accessed or dirty.
That race is handled by the fact that the CPU does the accessed/dirty
bit update as an atomic read-modify-write operation, and actually
re-checks the PTE entry as it does so.
So in theory a CPU could just remember what address it loaded the TLB
entry from, and do a blind "set the dirty bit" with just an atomic
"or" operation. In fact, for a while I thought that CPU's could do
that, and the TLB flushing sequence would be:
entry = atomic_xchg(pte, 0);
flush_tlb();
entry |= *pte;
so that we'd catch any races with the A/D bit getting set.
It turns out no CPU actually does that, and I'm not sure we ever had
that code sequence in the kernel (but some code archaeologist might go
look).
What CPU's actually do is simpler both for them and for us: instead of
remembering where they loaded the TLB entry from, they re-walk the
page table when they mark the TLB dirty, and if the page table entry
is no longer present (or if it's non-writable), the store will fault
instead of marking the TLB entry dirty.
So race #2 doesn't need the above complex sequence, but it still
*does* need that TLB entry to be loaded with an atomic exchange with
zero (or at least with something that clears the present bit, zero
obviously being the simplest such value). So a simple
entry = atomic_xchg(pte, 0);
is sufficient for this race (except we call it "ptep_get_and_clear()" ;)
Of course, *If* a CPU were to remember the address it loaded the TLB
entry from, then such a CPU might as well make the TLB be part of the
cache-coherency domain, and then we wouldn't need to do any TLB
flushing at all. I wish.
> Will the hardware fault when it does a translation and needs to update
> the dirty/access bits while the PTE entry is !present?
Yes indeed, see above (but see how broken hardware _could_ work, which
would be really painful for us).
What we are fighting is race #3: the TLB happily exists on this or
other CPU's, an dis _not_ getting updated (so no re-walk), but _is_
getting used.
And we've actually had a fix for race #3 for a long time: the whole
"don't free the pages until after the flush" is very much this issue.
So it's not a new condition by any means (as far as I can tell, the
mmu_gather infrastructure was introduced in 2.4.9.13, so 2001 - the
exact commit predates even BK history).
But this new issue is related to race #3, but purely in software: when
we do the "set_page_dirty()" before doing the TLB flush, we need to
protect against our cleaning that bit until after the flush.
And we've now had three different ways to fix that race, one
introducing a new race (my original two-patch series that delayed the
set_page_dirty() the same way we delay the page freeing), one using a
new lock entirely (Hugh latest patch - mapping->i_mmap_mutex isn't a
new lock, but in this context it is), and one that extends the rules
we already had in place for the single-PTE cases (do the
set_page_dirty() and TLB flushing atomically wrt the page table lock,
which makes it atomic wrt mkclean_one).
And the reason I think I'll merge my patch rather than Hugh's (despite
Hugh's being smaller) is exactly the fact that it doesn't really
introduce any new locking rules - it just fixes the fact that we
didn't really realize how important it was, and didn't follow the same
rules as the single-pte cases did.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2014-04-27 16:21 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1398032742.19682.11.camel@pasglop>
[not found] ` <CA+55aFz1sK+PF96LYYZY7OB7PBpxZu-uNLWLvPiRz-tJsBqX3w@mail.gmail.com>
[not found] ` <1398054064.19682.32.camel@pasglop>
[not found] ` <1398057630.19682.38.camel@pasglop>
[not found] ` <CA+55aFwWHBtihC3w9E4+j4pz+6w7iTnYhTf4N3ie15BM9thxLQ@mail.gmail.com>
[not found] ` <53558507.9050703@zytor.com>
[not found] ` <CA+55aFxGm6J6N=4L7exLUFMr1_siNGHpK=wApd9GPCH1=63PPA@mail.gmail.com>
[not found] ` <53559F48.8040808@intel.com>
2014-04-22 0:31 ` Dirty/Access bits vs. page content Linus Torvalds
2014-04-22 0:44 ` Linus Torvalds
2014-04-22 5:15 ` Tony Luck
2014-04-22 14:55 ` Linus Torvalds
2014-04-22 7:34 ` Peter Zijlstra
2014-04-22 7:54 ` Peter Zijlstra
2014-04-22 21:36 ` Linus Torvalds
2014-04-22 21:46 ` Dave Hansen
2014-04-22 22:08 ` Linus Torvalds
2014-04-22 22:41 ` Dave Hansen
2014-04-23 2:44 ` Linus Torvalds
2014-04-23 3:08 ` Hugh Dickins
2014-04-23 4:23 ` Linus Torvalds
2014-04-23 6:14 ` Benjamin Herrenschmidt
2014-04-23 18:41 ` Jan Kara
2014-04-23 19:33 ` Linus Torvalds
2014-04-24 6:51 ` Peter Zijlstra
2014-04-24 18:40 ` Hugh Dickins
2014-04-24 19:45 ` Linus Torvalds
2014-04-24 20:02 ` Hugh Dickins
2014-04-24 23:46 ` Linus Torvalds
2014-04-25 1:37 ` Benjamin Herrenschmidt
2014-04-25 2:41 ` Benjamin Herrenschmidt
2014-04-25 2:46 ` Linus Torvalds
2014-04-25 2:50 ` H. Peter Anvin
2014-04-25 3:03 ` Linus Torvalds
2014-04-25 12:01 ` Hugh Dickins
2014-04-25 13:51 ` Peter Zijlstra
2014-04-25 19:41 ` Hugh Dickins
2014-04-26 18:07 ` Peter Zijlstra
2014-04-27 7:20 ` Peter Zijlstra
2014-04-27 12:20 ` Hugh Dickins
2014-04-27 19:33 ` Peter Zijlstra
2014-04-27 19:47 ` Linus Torvalds
2014-04-27 20:09 ` Hugh Dickins
2014-04-28 9:25 ` Peter Zijlstra
2014-04-28 10:14 ` Peter Zijlstra
2014-04-27 16:21 ` Linus Torvalds [this message]
2014-04-27 23:13 ` Benjamin Herrenschmidt
2014-04-25 16:54 ` Dave Hansen
2014-04-25 18:41 ` Hugh Dickins
2014-04-25 22:00 ` Dave Hansen
2014-04-26 3:11 ` Hugh Dickins
2014-04-26 3:48 ` Linus Torvalds
2014-04-25 17:56 ` Linus Torvalds
2014-04-25 19:13 ` Hugh Dickins
2014-04-25 16:30 ` Dave Hansen
2014-04-23 20:11 ` Hugh Dickins
2014-04-24 8:49 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CA+55aFwLumAqA6mYyPKRZYOCr2TRPxUVdCKhHMg0nYN_KbBDbQ@mail.gmail.com \
--to=torvalds@linux-foundation.org \
--cc=benh@kernel.crashing.org \
--cc=dave.hansen@intel.com \
--cc=hpa@zytor.com \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=linux-arch@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux@arm.linux.org.uk \
--cc=peterz@infradead.org \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).