linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* mm,tlb: revert 4647706ebeee?
@ 2018-07-06 17:03 Rik van Riel
  2018-07-06 17:10 ` [PATCH] Revert "mm: always flush VMA ranges affected by zap_page_range" Rik van Riel
  2018-07-07 15:25 ` mm,tlb: revert 4647706ebeee? Nicholas Piggin
  0 siblings, 2 replies; 6+ messages in thread
From: Rik van Riel @ 2018-07-06 17:03 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org
  Cc: linux-mm@kvack.org, Andrew Morton, Michal Hocko, kirill.shutemov,
	Minchan Kim, Mel Gorman, kernel-team

[-- Attachment #1: Type: text/plain, Size: 1021 bytes --]

Hello,

It looks like last summer, there were 2 sets of patches
in flight to fix the issue of simultaneous mprotect/madvise
calls unmapping PTEs, and some pages not being flushed from
the TLB before returning to userspace.

Minchan posted these patches:
56236a59556c ("mm: refactor TLB gathering API")
99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")

Around the same time, Mel posted:
4647706ebeee ("mm: always flush VMA ranges affected by zap_page_range")

They both appear to solve the same bug.

Only one of the two solutions is needed.

However, 4647706ebeee appears to introduce extra TLB
flushes - one per VMA, instead of one over the entire
range unmapped, and also extra flushes when there are
no simultaneous unmappers of the same mm.

For that reason, it seems like we should revert
4647706ebeee and keep only Minchan's solution in
the kernel.

Am I overlooking any reason why we should not revert
4647706ebeee?

kind regards,

Rik
-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH] Revert "mm: always flush VMA ranges affected by zap_page_range"
  2018-07-06 17:03 mm,tlb: revert 4647706ebeee? Rik van Riel
@ 2018-07-06 17:10 ` Rik van Riel
  2018-07-16 13:12   ` Mel Gorman
  2018-07-07 15:25 ` mm,tlb: revert 4647706ebeee? Nicholas Piggin
  1 sibling, 1 reply; 6+ messages in thread
From: Rik van Riel @ 2018-07-06 17:10 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org
  Cc: linux-mm@kvack.org, Andrew Morton, Michal Hocko, kirill.shutemov,
	Minchan Kim, Mel Gorman, kernel-team

There was a bug in Linux that could cause madvise (and mprotect?)
system calls to return to userspace without the TLB having been
flushed for all the pages involved.

This could happen when multiple threads of a process made simultaneous
madvise and/or mprotect calls.

This was noticed in the summer of 2017, at which time two solutions
were created:
56236a59556c ("mm: refactor TLB gathering API")
99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
and
4647706ebeee ("mm: always flush VMA ranges affected by zap_page_range")

We need only one of these solutions, and the former appears to be
a little more efficient than the latter, so revert that one.

This reverts commit 4647706ebeee6e50f7b9f922b095f4ec94d581c3.
---
 mm/memory.c | 14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7206a634270b..9d472e00fc2d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1603,20 +1603,8 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
 	mmu_notifier_invalidate_range_start(mm, start, end);
-	for ( ; vma && vma->vm_start < end; vma = vma->vm_next) {
+	for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
 		unmap_single_vma(&tlb, vma, start, end, NULL);
-
-		/*
-		 * zap_page_range does not specify whether mmap_sem should be
-		 * held for read or write. That allows parallel zap_page_range
-		 * operations to unmap a PTE and defer a flush meaning that
-		 * this call observes pte_none and fails to flush the TLB.
-		 * Rather than adding a complex API, ensure that no stale
-		 * TLB entries exist when this call returns.
-		 */
-		flush_tlb_range(vma, start, end);
-	}
-
 	mmu_notifier_invalidate_range_end(mm, start, end);
 	tlb_finish_mmu(&tlb, start, end);
 }
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: mm,tlb: revert 4647706ebeee?
  2018-07-06 17:03 mm,tlb: revert 4647706ebeee? Rik van Riel
  2018-07-06 17:10 ` [PATCH] Revert "mm: always flush VMA ranges affected by zap_page_range" Rik van Riel
@ 2018-07-07 15:25 ` Nicholas Piggin
  2018-07-10  0:13   ` Andrew Morton
  1 sibling, 1 reply; 6+ messages in thread
From: Nicholas Piggin @ 2018-07-07 15:25 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton,
	Michal Hocko, kirill.shutemov, Minchan Kim, Mel Gorman,
	kernel-team

On Fri, 06 Jul 2018 13:03:55 -0400
Rik van Riel <riel@surriel.com> wrote:

> Hello,
> 
> It looks like last summer, there were 2 sets of patches
> in flight to fix the issue of simultaneous mprotect/madvise
> calls unmapping PTEs, and some pages not being flushed from
> the TLB before returning to userspace.
> 
> Minchan posted these patches:
> 56236a59556c ("mm: refactor TLB gathering API")
> 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
> 
> Around the same time, Mel posted:
> 4647706ebeee ("mm: always flush VMA ranges affected by zap_page_range")
> 
> They both appear to solve the same bug.
> 
> Only one of the two solutions is needed.
> 
> However, 4647706ebeee appears to introduce extra TLB
> flushes - one per VMA, instead of one over the entire
> range unmapped, and also extra flushes when there are
> no simultaneous unmappers of the same mm.
> 
> For that reason, it seems like we should revert
> 4647706ebeee and keep only Minchan's solution in
> the kernel.
> 
> Am I overlooking any reason why we should not revert
> 4647706ebeee?

Yes I think so. Discussed here recently:

https://marc.info/?l=linux-mm&m=152878780528037&w=2

Actually we realized that powerpc does not implement the mmu
gather flushing quite right so it needs a fix before this
revert. But I propose the revert for next merge window.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mm,tlb: revert 4647706ebeee?
  2018-07-07 15:25 ` mm,tlb: revert 4647706ebeee? Nicholas Piggin
@ 2018-07-10  0:13   ` Andrew Morton
  2018-07-10  5:04     ` Nicholas Piggin
  0 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2018-07-10  0:13 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Rik van Riel, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Michal Hocko, kirill.shutemov, Minchan Kim, Mel Gorman,
	kernel-team, Aneesh Kumar K.V, Nadav Amit

On Sun, 8 Jul 2018 01:25:38 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:

> On Fri, 06 Jul 2018 13:03:55 -0400
> Rik van Riel <riel@surriel.com> wrote:
> 
> > Hello,
> > 
> > It looks like last summer, there were 2 sets of patches
> > in flight to fix the issue of simultaneous mprotect/madvise
> > calls unmapping PTEs, and some pages not being flushed from
> > the TLB before returning to userspace.
> > 
> > Minchan posted these patches:
> > 56236a59556c ("mm: refactor TLB gathering API")
> > 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
> > 
> > Around the same time, Mel posted:
> > 4647706ebeee ("mm: always flush VMA ranges affected by zap_page_range")
> > 
> > They both appear to solve the same bug.
> > 
> > Only one of the two solutions is needed.
> > 
> > However, 4647706ebeee appears to introduce extra TLB
> > flushes - one per VMA, instead of one over the entire
> > range unmapped, and also extra flushes when there are
> > no simultaneous unmappers of the same mm.
> > 
> > For that reason, it seems like we should revert
> > 4647706ebeee and keep only Minchan's solution in
> > the kernel.
> > 
> > Am I overlooking any reason why we should not revert
> > 4647706ebeee?
> 
> Yes I think so. Discussed here recently:
> 
> https://marc.info/?l=linux-mm&m=152878780528037&w=2

Unclear if that was an ack ;)

> Actually we realized that powerpc does not implement the mmu
> gather flushing quite right so it needs a fix before this
> revert. But I propose the revert for next merge window.

Yes, I have Rik's patch for 4.19-rc1.  I added yourself, Aneesh and
Nadav to cc so you'll see it fly past.  If poss, please do get this all
tested before the time comes and let me know?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mm,tlb: revert 4647706ebeee?
  2018-07-10  0:13   ` Andrew Morton
@ 2018-07-10  5:04     ` Nicholas Piggin
  0 siblings, 0 replies; 6+ messages in thread
From: Nicholas Piggin @ 2018-07-10  5:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Michal Hocko, kirill.shutemov, Minchan Kim, Mel Gorman,
	kernel-team, Aneesh Kumar K.V, Nadav Amit, linux-arch

On Mon, 9 Jul 2018 17:13:56 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Sun, 8 Jul 2018 01:25:38 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> > On Fri, 06 Jul 2018 13:03:55 -0400
> > Rik van Riel <riel@surriel.com> wrote:
> >   
> > > Hello,
> > > 
> > > It looks like last summer, there were 2 sets of patches
> > > in flight to fix the issue of simultaneous mprotect/madvise
> > > calls unmapping PTEs, and some pages not being flushed from
> > > the TLB before returning to userspace.
> > > 
> > > Minchan posted these patches:
> > > 56236a59556c ("mm: refactor TLB gathering API")
> > > 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
> > > 
> > > Around the same time, Mel posted:
> > > 4647706ebeee ("mm: always flush VMA ranges affected by zap_page_range")
> > > 
> > > They both appear to solve the same bug.
> > > 
> > > Only one of the two solutions is needed.
> > > 
> > > However, 4647706ebeee appears to introduce extra TLB
> > > flushes - one per VMA, instead of one over the entire
> > > range unmapped, and also extra flushes when there are
> > > no simultaneous unmappers of the same mm.
> > > 
> > > For that reason, it seems like we should revert
> > > 4647706ebeee and keep only Minchan's solution in
> > > the kernel.
> > > 
> > > Am I overlooking any reason why we should not revert
> > > 4647706ebeee?  
> > 
> > Yes I think so. Discussed here recently:
> > 
> > https://marc.info/?l=linux-mm&m=152878780528037&w=2  
> 
> Unclear if that was an ack ;)
>

Sure, I'm thinking Rik's mail is a ack for my patch :)

No actually I think it's okay, but was in the middle of testing
my series when Aneesh pointed out a bit was missing from powerpc,
so I had to go off and fix that, I think that's upstream now. So
need to go back and re-test this revert.

Wouldn't hurt for other arch maintainers to have a look I guess
(cc linux-arch):

The problem powerpc had is that mmu_gather flushing will flush a
single page size based on the ptes it encounters when we zap. If
we hit a different page size, it flushes and switches to the new
size. If we have concurrent zaps on the same range, the other
thread may have cleared a large page pte so we won't see that and
will only do a small page flush for that range. Which means we can
return before the other thread invalidated our TLB for the large
pages in the range we wanted to flush.

I suspect most arches are probably okay, but if you make any TLB
flush choices based on the pte contents, then you could be exposed.
Except in the case of archs like sparc and powerpc/hash which do
the flushing in arch_leave_lazy_mmu_mode(), because that is called
under the same page table lock, so there can't be concurrent zap.

A quick look through the archs doesn't show anything obvious, but
please take a look at your arch.

And I'll try to do a bit more testing.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Revert "mm: always flush VMA ranges affected by zap_page_range"
  2018-07-06 17:10 ` [PATCH] Revert "mm: always flush VMA ranges affected by zap_page_range" Rik van Riel
@ 2018-07-16 13:12   ` Mel Gorman
  0 siblings, 0 replies; 6+ messages in thread
From: Mel Gorman @ 2018-07-16 13:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton,
	Michal Hocko, kirill.shutemov, Minchan Kim, kernel-team

On Fri, Jul 06, 2018 at 01:10:19PM -0400, Rik van Riel wrote:
> There was a bug in Linux that could cause madvise (and mprotect?)
> system calls to return to userspace without the TLB having been
> flushed for all the pages involved.
> 
> This could happen when multiple threads of a process made simultaneous
> madvise and/or mprotect calls.
> 
> This was noticed in the summer of 2017, at which time two solutions
> were created:
> 56236a59556c ("mm: refactor TLB gathering API")
> 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
> and
> 4647706ebeee ("mm: always flush VMA ranges affected by zap_page_range")
> 
> We need only one of these solutions, and the former appears to be
> a little more efficient than the latter, so revert that one.
> 
> This reverts commit 4647706ebeee6e50f7b9f922b095f4ec94d581c3.
> ---
>  mm/memory.c | 14 +-------------
>  1 file changed, 1 insertion(+), 13 deletions(-)

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-07-16 13:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-07-06 17:03 mm,tlb: revert 4647706ebeee? Rik van Riel
2018-07-06 17:10 ` [PATCH] Revert "mm: always flush VMA ranges affected by zap_page_range" Rik van Riel
2018-07-16 13:12   ` Mel Gorman
2018-07-07 15:25 ` mm,tlb: revert 4647706ebeee? Nicholas Piggin
2018-07-10  0:13   ` Andrew Morton
2018-07-10  5:04     ` Nicholas Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).