Re: [PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jan Stancek <jstancek@redhat.com>
To: Yang Shi <yang.shi@linux.alibaba.com>,
	Peter Zijlstra <peterz@infradead.org>
Cc: Nadav Amit <namit@vmware.com>, Will Deacon <will.deacon@arm.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	stable@vger.kernel.org,  linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	 aneesh kumar <aneesh.kumar@linux.vnet.ibm.com>,
	npiggin@gmail.com,  minchan@kernel.org,
	Mel Gorman <mgorman@suse.de>,  Jan Stancek <jstancek@redhat.com>
Subject: Re: [PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush
Date: Thu, 9 May 2019 17:06:38 -0400 (EDT)	[thread overview]
Message-ID: <249230644.21949166.1557435998550.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <84720bb8-bf3d-8c10-d675-0670f13b2efc@linux.alibaba.com>


----- Original Message -----
> 
> 
> On 5/9/19 11:24 AM, Peter Zijlstra wrote:
> > On Thu, May 09, 2019 at 05:36:29PM +0000, Nadav Amit wrote:
> >>> On May 9, 2019, at 3:38 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> >>> index 99740e1dd273..fe768f8d612e 100644
> >>> --- a/mm/mmu_gather.c
> >>> +++ b/mm/mmu_gather.c
> >>> @@ -244,15 +244,20 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
> >>> 		unsigned long start, unsigned long end)
> >>> {
> >>> 	/*
> >>> -	 * If there are parallel threads are doing PTE changes on same range
> >>> -	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> >>> -	 * flush by batching, a thread has stable TLB entry can fail to flush
> >>> -	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> >>> -	 * forcefully if we detect parallel PTE batching threads.
> >>> +	 * Sensible comment goes here..
> >>> 	 */
> >>> -	if (mm_tlb_flush_nested(tlb->mm)) {
> >>> -		__tlb_reset_range(tlb);
> >>> -		__tlb_adjust_range(tlb, start, end - start);
> >>> +	if (mm_tlb_flush_nested(tlb->mm) && !tlb->full_mm) {
> >>> +		/*
> >>> +		 * Since we're can't tell what we actually should have
> >>> +		 * flushed flush everything in the given range.
> >>> +		 */
> >>> +		tlb->start = start;
> >>> +		tlb->end = end;
> >>> +		tlb->freed_tables = 1;
> >>> +		tlb->cleared_ptes = 1;
> >>> +		tlb->cleared_pmds = 1;
> >>> +		tlb->cleared_puds = 1;
> >>> +		tlb->cleared_p4ds = 1;
> >>> 	}
> >>>
> >>> 	tlb_flush_mmu(tlb);
> >> As a simple optimization, I think it is possible to hold multiple nesting
> >> counters in the mm, similar to tlb_flush_pending, for freed_tables,
> >> cleared_ptes, etc.
> >>
> >> The first time you set tlb->freed_tables, you also atomically increase
> >> mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use
> >> mm->tlb_flush_freed_tables instead of tlb->freed_tables.
> > That sounds fraught with races and expensive; I would much prefer to not
> > go there for this arguably rare case.
> >
> > Consider such fun cases as where CPU-0 sees and clears a PTE, CPU-1
> > races and doesn't see that PTE. Therefore CPU-0 sets and counts
> > cleared_ptes. Then if CPU-1 flushes while CPU-0 is still in mmu_gather,
> > it will see cleared_ptes count increased and flush that granularity,
> > OTOH if CPU-1 flushes after CPU-0 completes, it will not and potentiall
> > miss an invalidate it should have had.
> >
> > This whole concurrent mmu_gather stuff is horrible.
> >
> >    /me ponders more....
> >
> > So I think the fundamental race here is this:
> >
> > 	CPU-0				CPU-1
> >
> > 	tlb_gather_mmu(.start=1,	tlb_gather_mmu(.start=2,
> > 		       .end=3);			       .end=4);
> >
> > 	ptep_get_and_clear_full(2)
> > 	tlb_remove_tlb_entry(2);
> > 	__tlb_remove_page();
> > 					if (pte_present(2)) // nope
> >
> > 					tlb_finish_mmu();
> >
> > 					// continue without TLBI(2)
> > 					// whoopsie
> >
> > 	tlb_finish_mmu();
> > 	  tlb_flush()		->	TLBI(2)
> 
> I'm not quite sure if this is the case Jan really met. But, according to
> his test, once correct tlb->freed_tables and tlb->cleared_* are set, his
> test works well.

My theory was following sequence:

t1: map_write_unmap()                 t2: dummy()

  map_address = mmap()
  map_address[i] = 'b'
  munmap(map_address)
  downgrade_write(&mm->mmap_sem);
  unmap_region()
  tlb_gather_mmu()
    inc_tlb_flush_pending(tlb->mm);
  free_pgtables()
    tlb->freed_tables = 1
    tlb->cleared_pmds = 1

                                        pthread_exit()
                                        madvise(thread_stack, 8M, MADV_DONTNEED)
                                          zap_page_range()
                                            tlb_gather_mmu()
                                              inc_tlb_flush_pending(tlb->mm);

  tlb_finish_mmu()
    if (mm_tlb_flush_nested(tlb->mm))
      __tlb_reset_range()
        tlb->freed_tables = 0
        tlb->cleared_pmds = 0
    __flush_tlb_range(last_level = 0)
  ...
  map_address = mmap()
    map_address[i] = 'b'
      <page fault loop>
      # PTE appeared valid to me,
      # so I suspected stale TLB entry at higher level as result of "freed_tables = 0"


I'm happy to apply/run any debug patches to get more data that would help.

> 
> >
> >
> > And we can fix that by having tlb_finish_mmu() sync up. Never let a
> > concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers
> > have completed.
> 
> Not sure if this will scale well.
> 
> >
> > This should not be too hard to make happen.
> 
>

next prev parent reply	other threads:[~2019-05-09 21:06 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-07 21:34 [PATCH] mm: mmu_gather: remove __tlb_reset_range() for force flush Yang Shi
2019-05-09  8:37 ` Will Deacon
2019-05-09 10:38   ` Peter Zijlstra
2019-05-09 10:54     ` Peter Zijlstra
2019-05-09 18:35       ` Yang Shi
2019-05-09 18:40         ` Peter Zijlstra
2019-05-09 12:44     ` Jan Stancek
2019-05-09 17:36     ` Nadav Amit
2019-05-09 18:24       ` Peter Zijlstra
2019-05-09 19:10         ` Yang Shi
2019-05-09 21:06           ` Jan Stancek [this message]
2019-05-09 21:48             ` Yang Shi
2019-05-09 22:12               ` Jan Stancek
     [not found]         ` <04668E51-FD87-4D53-A066-5A35ABC3A0D6@vmware.com>
     [not found]           ` <20190509191120.GD2623@hirez.programming.kicks-ass.net>
2019-05-09 21:21             ` Nadav Amit
2019-05-13  8:36               ` Peter Zijlstra
2019-05-13  9:11                 ` Nadav Amit
2019-05-13 11:30                   ` Peter Zijlstra
2019-05-13 16:37                   ` Will Deacon
2019-05-13 17:06                     ` Nadav Amit
2019-05-14  8:58                       ` Mel Gorman
2019-05-13  9:12                 ` Peter Zijlstra
2019-05-13  9:21                   ` Nadav Amit
2019-05-13 11:27                     ` Peter Zijlstra
2019-05-13 17:41                       ` Nadav Amit
2019-05-09 18:22     ` Yang Shi
2019-05-09 19:56     ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=249230644.21949166.1557435998550.JavaMail.zimbra@redhat.com \
    --to=jstancek@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=minchan@kernel.org \
    --cc=namit@vmware.com \
    --cc=npiggin@gmail.com \
    --cc=peterz@infradead.org \
    --cc=stable@vger.kernel.org \
    --cc=will.deacon@arm.com \
    --cc=yang.shi@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.