From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nicholas Piggin Subject: Re: mm,tlb: revert 4647706ebeee? Date: Tue, 10 Jul 2018 15:04:10 +1000 Message-ID: <20180710150410.4207bbfa@roar.ozlabs.ibm.com> References: <1530896635.5350.25.camel@surriel.com> <20180708012538.51b2c672@roar.ozlabs.ibm.com> <20180709171356.87d834e125f06e0cdaa72f85@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20180709171356.87d834e125f06e0cdaa72f85@linux-foundation.org> Sender: linux-kernel-owner@vger.kernel.org To: Andrew Morton Cc: Rik van Riel , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Michal Hocko , "kirill.shutemov" , Minchan Kim , Mel Gorman , kernel-team , "Aneesh Kumar K.V" , Nadav Amit , linux-arch List-Id: linux-arch.vger.kernel.org On Mon, 9 Jul 2018 17:13:56 -0700 Andrew Morton wrote: > On Sun, 8 Jul 2018 01:25:38 +1000 Nicholas Piggin wrote: > > > On Fri, 06 Jul 2018 13:03:55 -0400 > > Rik van Riel wrote: > > > > > Hello, > > > > > > It looks like last summer, there were 2 sets of patches > > > in flight to fix the issue of simultaneous mprotect/madvise > > > calls unmapping PTEs, and some pages not being flushed from > > > the TLB before returning to userspace. > > > > > > Minchan posted these patches: > > > 56236a59556c ("mm: refactor TLB gathering API") > > > 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem") > > > > > > Around the same time, Mel posted: > > > 4647706ebeee ("mm: always flush VMA ranges affected by zap_page_range") > > > > > > They both appear to solve the same bug. > > > > > > Only one of the two solutions is needed. > > > > > > However, 4647706ebeee appears to introduce extra TLB > > > flushes - one per VMA, instead of one over the entire > > > range unmapped, and also extra flushes when there are > > > no simultaneous unmappers of the same mm. > > > > > > For that reason, it seems like we should revert > > > 4647706ebeee and keep only Minchan's solution in > > > the kernel. > > > > > > Am I overlooking any reason why we should not revert > > > 4647706ebeee? > > > > Yes I think so. Discussed here recently: > > > > https://marc.info/?l=linux-mm&m=152878780528037&w=2 > > Unclear if that was an ack ;) > Sure, I'm thinking Rik's mail is a ack for my patch :) No actually I think it's okay, but was in the middle of testing my series when Aneesh pointed out a bit was missing from powerpc, so I had to go off and fix that, I think that's upstream now. So need to go back and re-test this revert. Wouldn't hurt for other arch maintainers to have a look I guess (cc linux-arch): The problem powerpc had is that mmu_gather flushing will flush a single page size based on the ptes it encounters when we zap. If we hit a different page size, it flushes and switches to the new size. If we have concurrent zaps on the same range, the other thread may have cleared a large page pte so we won't see that and will only do a small page flush for that range. Which means we can return before the other thread invalidated our TLB for the large pages in the range we wanted to flush. I suspect most arches are probably okay, but if you make any TLB flush choices based on the pte contents, then you could be exposed. Except in the case of archs like sparc and powerpc/hash which do the flushing in arch_leave_lazy_mmu_mode(), because that is called under the same page table lock, so there can't be concurrent zap. A quick look through the archs doesn't show anything obvious, but please take a look at your arch. And I'll try to do a bit more testing. Thanks, Nick From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pl0-f68.google.com ([209.85.160.68]:45805 "EHLO mail-pl0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750931AbeGJFdU (ORCPT ); Tue, 10 Jul 2018 01:33:20 -0400 Date: Tue, 10 Jul 2018 15:04:10 +1000 From: Nicholas Piggin Subject: Re: mm,tlb: revert 4647706ebeee? Message-ID: <20180710150410.4207bbfa@roar.ozlabs.ibm.com> In-Reply-To: <20180709171356.87d834e125f06e0cdaa72f85@linux-foundation.org> References: <1530896635.5350.25.camel@surriel.com> <20180708012538.51b2c672@roar.ozlabs.ibm.com> <20180709171356.87d834e125f06e0cdaa72f85@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-arch-owner@vger.kernel.org List-ID: To: Andrew Morton Cc: Rik van Riel , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Michal Hocko , "kirill.shutemov" , Minchan Kim , Mel Gorman , kernel-team , "Aneesh Kumar K.V" , Nadav Amit , linux-arch Message-ID: <20180710050410.DjG4AE1VOXiFrH32wGUJphee8yiD__R9FmEOuKKA_5c@z> On Mon, 9 Jul 2018 17:13:56 -0700 Andrew Morton wrote: > On Sun, 8 Jul 2018 01:25:38 +1000 Nicholas Piggin wrote: > > > On Fri, 06 Jul 2018 13:03:55 -0400 > > Rik van Riel wrote: > > > > > Hello, > > > > > > It looks like last summer, there were 2 sets of patches > > > in flight to fix the issue of simultaneous mprotect/madvise > > > calls unmapping PTEs, and some pages not being flushed from > > > the TLB before returning to userspace. > > > > > > Minchan posted these patches: > > > 56236a59556c ("mm: refactor TLB gathering API") > > > 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem") > > > > > > Around the same time, Mel posted: > > > 4647706ebeee ("mm: always flush VMA ranges affected by zap_page_range") > > > > > > They both appear to solve the same bug. > > > > > > Only one of the two solutions is needed. > > > > > > However, 4647706ebeee appears to introduce extra TLB > > > flushes - one per VMA, instead of one over the entire > > > range unmapped, and also extra flushes when there are > > > no simultaneous unmappers of the same mm. > > > > > > For that reason, it seems like we should revert > > > 4647706ebeee and keep only Minchan's solution in > > > the kernel. > > > > > > Am I overlooking any reason why we should not revert > > > 4647706ebeee? > > > > Yes I think so. Discussed here recently: > > > > https://marc.info/?l=linux-mm&m=152878780528037&w=2 > > Unclear if that was an ack ;) > Sure, I'm thinking Rik's mail is a ack for my patch :) No actually I think it's okay, but was in the middle of testing my series when Aneesh pointed out a bit was missing from powerpc, so I had to go off and fix that, I think that's upstream now. So need to go back and re-test this revert. Wouldn't hurt for other arch maintainers to have a look I guess (cc linux-arch): The problem powerpc had is that mmu_gather flushing will flush a single page size based on the ptes it encounters when we zap. If we hit a different page size, it flushes and switches to the new size. If we have concurrent zaps on the same range, the other thread may have cleared a large page pte so we won't see that and will only do a small page flush for that range. Which means we can return before the other thread invalidated our TLB for the large pages in the range we wanted to flush. I suspect most arches are probably okay, but if you make any TLB flush choices based on the pte contents, then you could be exposed. Except in the case of archs like sparc and powerpc/hash which do the flushing in arch_leave_lazy_mmu_mode(), because that is called under the same page table lock, so there can't be concurrent zap. A quick look through the archs doesn't show anything obvious, but please take a look at your arch. And I'll try to do a bit more testing. Thanks, Nick