public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
From: Will Deacon <will.deacon@arm.com>
To: Nicholas Piggin <npiggin@gmail.com>
Cc: akpm@linux-foundation.org,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	jstancek@redhat.com, mgorman@suse.de, minchan@kernel.org,
	mm-commits@vger.kernel.org, namit@vmware.com,
	Peter Zijlstra <peterz@infradead.org>,
	stable@vger.kernel.org, yang.shi@linux.alibaba.com
Subject: Re: + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree
Date: Mon, 3 Jun 2019 18:57:19 +0100	[thread overview]
Message-ID: <20190603175719.GA13018@fuggles.cambridge.arm.com> (raw)
In-Reply-To: <1559569861.n3f6bbdn43.astroid@bobo.none>

Hi Nick,

On Tue, Jun 04, 2019 at 12:10:37AM +1000, Nicholas Piggin wrote:
> Will Deacon's on June 3, 2019 8:30 pm:
> > On Mon, Jun 03, 2019 at 12:11:38PM +1000, Nicholas Piggin wrote:
> >> Peter Zijlstra's on May 31, 2019 7:49 pm:
> >> > On Fri, May 31, 2019 at 12:46:56PM +1000, Nicholas Piggin wrote:
> >> >> I don't think it's very nice to set fullmm and freed_tables for this 
> >> >> case though. Is this concurrent zapping an important fast path? It
> >> >> must have been, in order to justify all this complexity to the mm, so
> >> >> we don't want to tie this boat anchor to it AFAIKS?
> >> > 
> >> > I'm not convinced its an important fast path, afaict it is an
> >> > unfortunate correctness issue caused by allowing concurrenct frees.
> >> 
> >> I mean -- concurrent freeing was an important fastpath, right?
> >> And concurrent freeing means that you hit this case. So this
> >> case itself should be important too.
> > 
> > I honestly don't think we (Peter and I) know. Our first involvement with
> > this was because TLBs were found to contain stale user entries:
> > 
> > https://lore.kernel.org/linux-arm-kernel/1817839533.20996552.1557065445233.JavaMail.zimbra@redhat.com/
> > 
> > so the initial work to support the concurrent freeing was done separately
> > and, I assume, motivated by some real workloads. I would also very much
> > like to know more about that, since nothing remotely realistic has surfaced
> > in this discussion, other than some historical glibc thing which has long
> > since been fixed.
> 
> Well, it seems like it is important. While the complexity is carried
> in the mm, we should not skimp on this last small piece.

As I say, I really don't know. But yes, if we can do something better we
should.

> >> >> Is the problem just that the freed page tables flags get cleared by
> >> >> __tlb_reset_range()? Why not just remove that then, so the bits are
> >> >> set properly for the munmap?
> >> > 
> >> > That's insufficient; as argued in my initial suggestion:
> >> > 
> >> >   https://lkml.kernel.org/r/20190509103813.GP2589@hirez.programming.kicks-ass.net
> >> > 
> >> > Since we don't know what was flushed by the concorrent flushes, we must
> >> > flush all state (page sizes, tables etc..).
> >> 
> >> Page tables should not be concurrently freed I think. Just don't clear
> >> those page table free flags and it should be okay. Page sizes yes,
> >> but we accommodated for that in the arch code. I could see reason to
> >> add a flag to the gather struct like "concurrent_free" and set that
> >> from the generic code, which the arch has to take care of.
> > 
> > I think you're correct that two CPUs cannot free the page tables
> > concurrently (I misunderstood this initially), although I also think
> > there may be some subtle issues if tlb->freed_tables is not set,
> > depending on the architecture. Roughly speaking, if one CPU is clearing
> > a PMD as part of munmap() and another CPU in madvise() does only last-level
> > TLB invalidation, then I think there's the potential for the invalidation
> > to be ineffective if observing a cleared PMD doesn't imply that the last
> > level has been unmapped from the perspective of the page-table walker.
> 
> That should not be the case because the last level table should have
> had all entries cleared before the pointer to it has been cleared.

The devil is in the detail here, and I think specifically it depends
what you mean by "before". Does that mean memory barrier, or special
page-table walker barrier, or TLB invalidation or ...?

> So the page table walker could begin from the now-freed page table,
> but it would never instantiate a valid TLB entry from there. So a
> TLB invalidation would behave properly despite not flushing page
> tables.
> 
> Powerpc at least would want to avoid over flushing here, AFAIKS.

For arm64 it really depends how often this hits. Simply not setting
tlb->freed_tables would also break things for us, because we have an
optimisation where we elide invalidation in the fullmm && !freed_tables
case, since this is indicative of the mm going away and so we simply
avoid reallocating its ASID.

> >> > But it looks like benchmarks (for the one test-case we have) seem to
> >> > favour flushing the world over flushing a smaller range.
> >> 
> >> Testing on 16MB unmap is possibly not a good benchmark, I didn't run
> >> it exactly but it looks likely to go beyond the range flush threshold
> >> and flush the entire PID anyway.
> > 
> > If we can get a better idea of what a "good benchmark" might look like (i.e.
> > something that is representative of the cases in which real workloads are
> > likely to trigger this path) then we can definitely try to optimise around
> > that.
> 
> Hard to say unfortunately. A smaller unmap range to start with, but
> even then when you have a TLB over-flushing case, then an unmap micro
> benchmark is not a great test because you'd like to see more impact of
> other useful entries being flushed (e.g., you need an actual working
> set).

Right, sounds like somebody needs to do some better analysis than what's
been done so far.

> > In the meantime, I would really like to see this patch land in mainline
> > since it fixes a regression.
> 
> Sorry I didn't provide input earlier. I would like to improve the fix or 
> at least make an option for archs to provide an optimised way to flush 
> this case, so it would be nice not to fix archs this way and then have 
> to change the fix significantly right away.

Please send patches ;)

> But the bug does need to be fixed of course, if there needs to be more
> thought about it maybe it's best to take this fix for next release.

Agreed.

Will

  reply	other threads:[~2019-06-03 17:57 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-21 23:18 + mm-mmu_gather-remove-__tlb_reset_range-for-force-flush.patch added to -mm tree akpm
2019-05-27 11:01 ` Peter Zijlstra
2019-05-27 13:29   ` Aneesh Kumar K.V
2019-05-27 14:25     ` Peter Zijlstra
2019-05-30 21:55       ` Jan Stancek
2019-05-31  2:46       ` Nicholas Piggin
2019-05-31  9:49         ` Peter Zijlstra
2019-06-03  2:11           ` Nicholas Piggin
2019-06-03 10:30             ` Will Deacon
2019-06-03 14:10               ` Nicholas Piggin
2019-06-03 17:57                 ` Will Deacon [this message]
2019-06-04  8:18                   ` Nicholas Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190603175719.GA13018@fuggles.cambridge.arm.com \
    --to=will.deacon@arm.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=jstancek@redhat.com \
    --cc=mgorman@suse.de \
    --cc=minchan@kernel.org \
    --cc=mm-commits@vger.kernel.org \
    --cc=namit@vmware.com \
    --cc=npiggin@gmail.com \
    --cc=peterz@infradead.org \
    --cc=stable@vger.kernel.org \
    --cc=yang.shi@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox