From: Will Deacon <will.deacon@arm.com>
To: Nicholas Piggin <npiggin@gmail.com>
Cc: akpm@linux-foundation.org,
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
jstancek@redhat.com, mgorman@suse.de, minchan@kernel.org,
mm-commits@vger.kernel.org, namit@vmware.com,
Peter Zijlstra <peterz@infradead.org>,
stable@vger.kernel.org, yang.shi@linux.alibaba.com
Subject: Re: + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree
Date: Mon, 3 Jun 2019 18:57:19 +0100 [thread overview]
Message-ID: <20190603175719.GA13018@fuggles.cambridge.arm.com> (raw)
In-Reply-To: <1559569861.n3f6bbdn43.astroid@bobo.none>
Hi Nick,
On Tue, Jun 04, 2019 at 12:10:37AM +1000, Nicholas Piggin wrote:
> Will Deacon's on June 3, 2019 8:30 pm:
> > On Mon, Jun 03, 2019 at 12:11:38PM +1000, Nicholas Piggin wrote:
> >> Peter Zijlstra's on May 31, 2019 7:49 pm:
> >> > On Fri, May 31, 2019 at 12:46:56PM +1000, Nicholas Piggin wrote:
> >> >> I don't think it's very nice to set fullmm and freed_tables for this
> >> >> case though. Is this concurrent zapping an important fast path? It
> >> >> must have been, in order to justify all this complexity to the mm, so
> >> >> we don't want to tie this boat anchor to it AFAIKS?
> >> >
> >> > I'm not convinced its an important fast path, afaict it is an
> >> > unfortunate correctness issue caused by allowing concurrenct frees.
> >>
> >> I mean -- concurrent freeing was an important fastpath, right?
> >> And concurrent freeing means that you hit this case. So this
> >> case itself should be important too.
> >
> > I honestly don't think we (Peter and I) know. Our first involvement with
> > this was because TLBs were found to contain stale user entries:
> >
> > https://lore.kernel.org/linux-arm-kernel/1817839533.20996552.1557065445233.JavaMail.zimbra@redhat.com/
> >
> > so the initial work to support the concurrent freeing was done separately
> > and, I assume, motivated by some real workloads. I would also very much
> > like to know more about that, since nothing remotely realistic has surfaced
> > in this discussion, other than some historical glibc thing which has long
> > since been fixed.
>
> Well, it seems like it is important. While the complexity is carried
> in the mm, we should not skimp on this last small piece.
As I say, I really don't know. But yes, if we can do something better we
should.
> >> >> Is the problem just that the freed page tables flags get cleared by
> >> >> __tlb_reset_range()? Why not just remove that then, so the bits are
> >> >> set properly for the munmap?
> >> >
> >> > That's insufficient; as argued in my initial suggestion:
> >> >
> >> > https://lkml.kernel.org/r/20190509103813.GP2589@hirez.programming.kicks-ass.net
> >> >
> >> > Since we don't know what was flushed by the concorrent flushes, we must
> >> > flush all state (page sizes, tables etc..).
> >>
> >> Page tables should not be concurrently freed I think. Just don't clear
> >> those page table free flags and it should be okay. Page sizes yes,
> >> but we accommodated for that in the arch code. I could see reason to
> >> add a flag to the gather struct like "concurrent_free" and set that
> >> from the generic code, which the arch has to take care of.
> >
> > I think you're correct that two CPUs cannot free the page tables
> > concurrently (I misunderstood this initially), although I also think
> > there may be some subtle issues if tlb->freed_tables is not set,
> > depending on the architecture. Roughly speaking, if one CPU is clearing
> > a PMD as part of munmap() and another CPU in madvise() does only last-level
> > TLB invalidation, then I think there's the potential for the invalidation
> > to be ineffective if observing a cleared PMD doesn't imply that the last
> > level has been unmapped from the perspective of the page-table walker.
>
> That should not be the case because the last level table should have
> had all entries cleared before the pointer to it has been cleared.
The devil is in the detail here, and I think specifically it depends
what you mean by "before". Does that mean memory barrier, or special
page-table walker barrier, or TLB invalidation or ...?
> So the page table walker could begin from the now-freed page table,
> but it would never instantiate a valid TLB entry from there. So a
> TLB invalidation would behave properly despite not flushing page
> tables.
>
> Powerpc at least would want to avoid over flushing here, AFAIKS.
For arm64 it really depends how often this hits. Simply not setting
tlb->freed_tables would also break things for us, because we have an
optimisation where we elide invalidation in the fullmm && !freed_tables
case, since this is indicative of the mm going away and so we simply
avoid reallocating its ASID.
> >> > But it looks like benchmarks (for the one test-case we have) seem to
> >> > favour flushing the world over flushing a smaller range.
> >>
> >> Testing on 16MB unmap is possibly not a good benchmark, I didn't run
> >> it exactly but it looks likely to go beyond the range flush threshold
> >> and flush the entire PID anyway.
> >
> > If we can get a better idea of what a "good benchmark" might look like (i.e.
> > something that is representative of the cases in which real workloads are
> > likely to trigger this path) then we can definitely try to optimise around
> > that.
>
> Hard to say unfortunately. A smaller unmap range to start with, but
> even then when you have a TLB over-flushing case, then an unmap micro
> benchmark is not a great test because you'd like to see more impact of
> other useful entries being flushed (e.g., you need an actual working
> set).
Right, sounds like somebody needs to do some better analysis than what's
been done so far.
> > In the meantime, I would really like to see this patch land in mainline
> > since it fixes a regression.
>
> Sorry I didn't provide input earlier. I would like to improve the fix or
> at least make an option for archs to provide an optimised way to flush
> this case, so it would be nice not to fix archs this way and then have
> to change the fix significantly right away.
Please send patches ;)
> But the bug does need to be fixed of course, if there needs to be more
> thought about it maybe it's best to take this fix for next release.
Agreed.
Will
next prev parent reply other threads:[~2019-06-03 17:57 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-21 23:18 + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree akpm
2019-05-27 11:01 ` Peter Zijlstra
2019-05-27 13:29 ` Aneesh Kumar K.V
2019-05-27 14:25 ` Peter Zijlstra
2019-05-30 21:55 ` Jan Stancek
2019-05-31 2:46 ` Nicholas Piggin
2019-05-31 9:49 ` Peter Zijlstra
2019-06-03 2:11 ` Nicholas Piggin
2019-06-03 10:30 ` Will Deacon
2019-06-03 14:10 ` Nicholas Piggin
2019-06-03 17:57 ` Will Deacon [this message]
2019-06-04 8:18 ` Nicholas Piggin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190603175719.GA13018@fuggles.cambridge.arm.com \
--to=will.deacon@arm.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@linux.ibm.com \
--cc=jstancek@redhat.com \
--cc=mgorman@suse.de \
--cc=minchan@kernel.org \
--cc=mm-commits@vger.kernel.org \
--cc=namit@vmware.com \
--cc=npiggin@gmail.com \
--cc=peterz@infradead.org \
--cc=stable@vger.kernel.org \
--cc=yang.shi@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox