All of lore.kernel.org
 help / color / mirror / Atom feed
From: Aaron Lu <aaron.lu@intel.com>
To: Linux Memory Management List <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"'Kirill A. Shutemov'" <kirill.shutemov@linux.intel.com>,
	Huang Ying <ying.huang@intel.com>
Subject: Re: [PATCH] mremap: fix race between mremap() and page cleanning
Date: Thu, 17 Nov 2016 15:45:38 +0800	[thread overview]
Message-ID: <20161117074538.GA1713@aaronlu.sh.intel.com> (raw)
In-Reply-To: <026b73f6-ca1d-e7bb-766c-4aaeb7071ce6@intel.com>

[-- Attachment #1: Type: text/plain, Size: 10089 bytes --]

+LKML.

Also attached the kernel patch that enlarges the race window and the
user space test code raceremap.c, which you can put in will-it-scale's
tests directory and run it as:
# ./raceremap_threads -t 2 -s 1

Make sure "cpu0 runs" appeared in the first line.

If you get SEGFAULT:
[aaron@aaronlu will-it-scale]$ sudo ./raceremap_threads -t 2 -s 1
cpu0 runs
cpu1 runs
cpu0: going to remap
testcase:mremap
warmup
cpu1: going to clean the page
cpu1: going to write 2
min:0 max:0 total:0
min:0 max:0 total:0
cpu0: remap done
Segmentation fault

That means the race doesn't occur.

If you get "*cpu1 wrote 2 gets lost":
[aaron@aaronlu will-it-scale]$ sudo ./raceremap_threads -t 2 -s 1
cpu1 runs
testcase:mremap
warmup
cpu0 runs
cpu0: going to remap
cpu1: going to clean the page
cpu1: going to write 2
cpu1 wrote 2
min:0 max:0 total:0
min:0 max:0 total:0
cpu0: remap done
*cpu1 wrote 2 gets lost

That means the race occurred.

Thanks,
Aaron

On Thu, Nov 10, 2016 at 05:16:33PM +0800, Aaron Lu wrote:
> Prior to 3.15, there was a race between zap_pte_range() and
> page_mkclean() where writes to a page could be lost.  Dave Hansen
> discovered by inspection that there is a similar race between
> move_ptes() and  page_mkclean().
> 
> We've been able to reproduce the issue by enlarging the race window with
> a msleep(), but have not been able to hit it without modifying the code.
> So, we think it's a real issue, but is difficult or impossible to hit
> in practice.
> 
> The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
> 'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
> this patch is to fix the race between page_mkclean() and mremap().
> 
> Here is one possible way to hit the race: suppose a process mmapped a
> file with READ | WRITE and SHARED, it has two threads and they are bound
> to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread 1
> did a write to addr X so that CPU1 now has a writable TLB for addr X
> on it. Thread 2 starts mremaping from addr X to Y while thread 1 cleaned
> the page and then did another write to the old addr X again. The 2nd
> write from thread 1 could succeed but the value will get lost.
> 
>         thread 1                           thread 2
>      (bound to CPU1)                    (bound to CPU2)
> 
>   1: write 1 to addr X to get a
>      writeable TLB on this CPU
> 
>                                         2: mremap starts
> 
>                                         3: move_ptes emptied PTE for addr X
>                                            and setup new PTE for addr Y and
>                                            then dropped PTL for X and Y
> 
>   4: page laundering for N by doing
>      fadvise FADV_DONTNEED. When done,
>      pageframe N is deemed clean.
> 
>   5: *write 2 to addr X
> 
>                                         6: tlb flush for addr X
> 
>   7: munmap (Y, pagesize) to make the
>      page unmapped
> 
>   8: fadvise with FADV_DONTNEED again
>      to kick the page off the pagecache
> 
>   9: pread the page from file to verify
>      the value. If 1 is there, it means
>      we have lost the written 2.
> 
>   *the write may or may not cause segmentation fault, it depends on
>   if the TLB is still on the CPU.
> 
> Please note that this is only one specific way of how the race could
> occur, it didn't mean that the race could only occur in exact the above
> config, e.g. more than 2 threads could be involved and fadvise() could
> be done in another thread, etc.
> 
> For anonymous pages, they could race between mremap() and page reclaim:
> THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge PMD
> gets unmapped/splitted/pagedout before the flush tlb happened for the old
> huge PMD in move_page_tables() and we could still write data to it.
> The normal anonymous page has similar situation.
> 
> To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
> if any, did the flush before dropping the PTL. If we did the flush for
> every move_ptes()/move_huge_pmd() call then we do not need to do the
> flush in move_pages_tables() for the whole range. But if we didn't, we
> still need to do the whole range flush.
> 
> Alternatively, we can track which part of the range is flushed in
> move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
> range in move_page_tables(). But that would require multiple tlb flushes
> for the different sub-ranges and should be less efficient than the
> single whole range flush.
> 
> KBuild test on my Sandybridge desktop doesn't show any noticeable change.
> v4.9-rc4:
> real    5m14.048s
> user    32m19.800s
> sys     4m50.320s
> 
> With this commit:
> real    5m13.888s
> user    32m19.330s
> sys     4m51.200s
> 
> Reported-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> ---
>  include/linux/huge_mm.h |  2 +-
>  mm/huge_memory.c        |  9 ++++++++-
>  mm/mremap.c             | 30 +++++++++++++++++++++---------
>  3 files changed, 30 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 9b9f65d99873..e35e6de633b9 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -22,7 +22,7 @@ extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  			unsigned char *vec);
>  extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>  			 unsigned long new_addr, unsigned long old_end,
> -			 pmd_t *old_pmd, pmd_t *new_pmd);
> +			 pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush);
>  extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  			unsigned long addr, pgprot_t newprot,
>  			int prot_numa);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index cdcd25cb30fe..eff3de359d50 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1426,11 +1426,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  
>  bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>  		  unsigned long new_addr, unsigned long old_end,
> -		  pmd_t *old_pmd, pmd_t *new_pmd)
> +		  pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
>  {
>  	spinlock_t *old_ptl, *new_ptl;
>  	pmd_t pmd;
>  	struct mm_struct *mm = vma->vm_mm;
> +	bool force_flush = false;
>  
>  	if ((old_addr & ~HPAGE_PMD_MASK) ||
>  	    (new_addr & ~HPAGE_PMD_MASK) ||
> @@ -1455,6 +1456,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>  		new_ptl = pmd_lockptr(mm, new_pmd);
>  		if (new_ptl != old_ptl)
>  			spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> +		if (pmd_present(*old_pmd) && pmd_dirty(*old_pmd))
> +			force_flush = true;
>  		pmd = pmdp_huge_get_and_clear(mm, old_addr, old_pmd);
>  		VM_BUG_ON(!pmd_none(*new_pmd));
>  
> @@ -1467,6 +1470,10 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>  		set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
>  		if (new_ptl != old_ptl)
>  			spin_unlock(new_ptl);
> +		if (force_flush)
> +			flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
> +		else
> +			*need_flush = true;
>  		spin_unlock(old_ptl);
>  		return true;
>  	}
> diff --git a/mm/mremap.c b/mm/mremap.c
> index da22ad2a5678..6ccecc03f56a 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -104,11 +104,13 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>  static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
>  		unsigned long old_addr, unsigned long old_end,
>  		struct vm_area_struct *new_vma, pmd_t *new_pmd,
> -		unsigned long new_addr, bool need_rmap_locks)
> +		unsigned long new_addr, bool need_rmap_locks, bool *need_flush)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	pte_t *old_pte, *new_pte, pte;
>  	spinlock_t *old_ptl, *new_ptl;
> +	bool force_flush = false;
> +	unsigned long len = old_end - old_addr;
>  
>  	/*
>  	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
> @@ -146,6 +148,14 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
>  				   new_pte++, new_addr += PAGE_SIZE) {
>  		if (pte_none(*old_pte))
>  			continue;
> +
> +		/*
> +		 * We are remapping a dirty PTE, make sure to
> +		 * flush TLB before we drop the PTL for the
> +		 * old PTE or we may race with page_mkclean().
> +		 */
> +		if (pte_present(*old_pte) && pte_dirty(*old_pte))
> +			force_flush = true;
>  		pte = ptep_get_and_clear(mm, old_addr, old_pte);
>  		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
>  		pte = move_soft_dirty_pte(pte);
> @@ -156,6 +166,10 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
>  	if (new_ptl != old_ptl)
>  		spin_unlock(new_ptl);
>  	pte_unmap(new_pte - 1);
> +	if (force_flush)
> +		flush_tlb_range(vma, old_end - len, old_end);
> +	else
> +		*need_flush = true;
>  	pte_unmap_unlock(old_pte - 1, old_ptl);
>  	if (need_rmap_locks)
>  		drop_rmap_locks(vma);
> @@ -201,13 +215,12 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  				if (need_rmap_locks)
>  					take_rmap_locks(vma);
>  				moved = move_huge_pmd(vma, old_addr, new_addr,
> -						    old_end, old_pmd, new_pmd);
> +						    old_end, old_pmd, new_pmd,
> +						    &need_flush);
>  				if (need_rmap_locks)
>  					drop_rmap_locks(vma);
> -				if (moved) {
> -					need_flush = true;
> +				if (moved)
>  					continue;
> -				}
>  			}
>  			split_huge_pmd(vma, old_pmd, old_addr);
>  			if (pmd_trans_unstable(old_pmd))
> @@ -220,11 +233,10 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  			extent = next - new_addr;
>  		if (extent > LATENCY_LIMIT)
>  			extent = LATENCY_LIMIT;
> -		move_ptes(vma, old_pmd, old_addr, old_addr + extent,
> -			  new_vma, new_pmd, new_addr, need_rmap_locks);
> -		need_flush = true;
> +		move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma,
> +			  new_pmd, new_addr, need_rmap_locks, &need_flush);
>  	}
> -	if (likely(need_flush))
> +	if (need_flush)
>  		flush_tlb_range(vma, old_end-len, old_addr);
>  
>  	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
> -- 
> 2.5.5
> 

[-- Attachment #2: 0001-mremap-add-a-2s-delay-for-MAP_FIXED-case.patch --]
[-- Type: text/plain, Size: 0 bytes --]



WARNING: multiple messages have this Message-ID (diff)
From: Aaron Lu <aaron.lu@intel.com>
To: Linux Memory Management List <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"'Kirill A. Shutemov'" <kirill.shutemov@linux.intel.com>,
	Huang Ying <ying.huang@intel.com>
Subject: Re: [PATCH] mremap: fix race between mremap() and page cleanning
Date: Thu, 17 Nov 2016 15:45:38 +0800	[thread overview]
Message-ID: <20161117074538.GA1713@aaronlu.sh.intel.com> (raw)
In-Reply-To: <026b73f6-ca1d-e7bb-766c-4aaeb7071ce6@intel.com>

[-- Attachment #1: Type: text/plain, Size: 10089 bytes --]

+LKML.

Also attached the kernel patch that enlarges the race window and the
user space test code raceremap.c, which you can put in will-it-scale's
tests directory and run it as:
# ./raceremap_threads -t 2 -s 1

Make sure "cpu0 runs" appeared in the first line.

If you get SEGFAULT:
[aaron@aaronlu will-it-scale]$ sudo ./raceremap_threads -t 2 -s 1
cpu0 runs
cpu1 runs
cpu0: going to remap
testcase:mremap
warmup
cpu1: going to clean the page
cpu1: going to write 2
min:0 max:0 total:0
min:0 max:0 total:0
cpu0: remap done
Segmentation fault

That means the race doesn't occur.

If you get "*cpu1 wrote 2 gets lost":
[aaron@aaronlu will-it-scale]$ sudo ./raceremap_threads -t 2 -s 1
cpu1 runs
testcase:mremap
warmup
cpu0 runs
cpu0: going to remap
cpu1: going to clean the page
cpu1: going to write 2
cpu1 wrote 2
min:0 max:0 total:0
min:0 max:0 total:0
cpu0: remap done
*cpu1 wrote 2 gets lost

That means the race occurred.

Thanks,
Aaron

On Thu, Nov 10, 2016 at 05:16:33PM +0800, Aaron Lu wrote:
> Prior to 3.15, there was a race between zap_pte_range() and
> page_mkclean() where writes to a page could be lost.  Dave Hansen
> discovered by inspection that there is a similar race between
> move_ptes() and  page_mkclean().
> 
> We've been able to reproduce the issue by enlarging the race window with
> a msleep(), but have not been able to hit it without modifying the code.
> So, we think it's a real issue, but is difficult or impossible to hit
> in practice.
> 
> The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
> 'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
> this patch is to fix the race between page_mkclean() and mremap().
> 
> Here is one possible way to hit the race: suppose a process mmapped a
> file with READ | WRITE and SHARED, it has two threads and they are bound
> to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread 1
> did a write to addr X so that CPU1 now has a writable TLB for addr X
> on it. Thread 2 starts mremaping from addr X to Y while thread 1 cleaned
> the page and then did another write to the old addr X again. The 2nd
> write from thread 1 could succeed but the value will get lost.
> 
>         thread 1                           thread 2
>      (bound to CPU1)                    (bound to CPU2)
> 
>   1: write 1 to addr X to get a
>      writeable TLB on this CPU
> 
>                                         2: mremap starts
> 
>                                         3: move_ptes emptied PTE for addr X
>                                            and setup new PTE for addr Y and
>                                            then dropped PTL for X and Y
> 
>   4: page laundering for N by doing
>      fadvise FADV_DONTNEED. When done,
>      pageframe N is deemed clean.
> 
>   5: *write 2 to addr X
> 
>                                         6: tlb flush for addr X
> 
>   7: munmap (Y, pagesize) to make the
>      page unmapped
> 
>   8: fadvise with FADV_DONTNEED again
>      to kick the page off the pagecache
> 
>   9: pread the page from file to verify
>      the value. If 1 is there, it means
>      we have lost the written 2.
> 
>   *the write may or may not cause segmentation fault, it depends on
>   if the TLB is still on the CPU.
> 
> Please note that this is only one specific way of how the race could
> occur, it didn't mean that the race could only occur in exact the above
> config, e.g. more than 2 threads could be involved and fadvise() could
> be done in another thread, etc.
> 
> For anonymous pages, they could race between mremap() and page reclaim:
> THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge PMD
> gets unmapped/splitted/pagedout before the flush tlb happened for the old
> huge PMD in move_page_tables() and we could still write data to it.
> The normal anonymous page has similar situation.
> 
> To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
> if any, did the flush before dropping the PTL. If we did the flush for
> every move_ptes()/move_huge_pmd() call then we do not need to do the
> flush in move_pages_tables() for the whole range. But if we didn't, we
> still need to do the whole range flush.
> 
> Alternatively, we can track which part of the range is flushed in
> move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
> range in move_page_tables(). But that would require multiple tlb flushes
> for the different sub-ranges and should be less efficient than the
> single whole range flush.
> 
> KBuild test on my Sandybridge desktop doesn't show any noticeable change.
> v4.9-rc4:
> real    5m14.048s
> user    32m19.800s
> sys     4m50.320s
> 
> With this commit:
> real    5m13.888s
> user    32m19.330s
> sys     4m51.200s
> 
> Reported-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> ---
>  include/linux/huge_mm.h |  2 +-
>  mm/huge_memory.c        |  9 ++++++++-
>  mm/mremap.c             | 30 +++++++++++++++++++++---------
>  3 files changed, 30 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 9b9f65d99873..e35e6de633b9 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -22,7 +22,7 @@ extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  			unsigned char *vec);
>  extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>  			 unsigned long new_addr, unsigned long old_end,
> -			 pmd_t *old_pmd, pmd_t *new_pmd);
> +			 pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush);
>  extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  			unsigned long addr, pgprot_t newprot,
>  			int prot_numa);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index cdcd25cb30fe..eff3de359d50 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1426,11 +1426,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  
>  bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>  		  unsigned long new_addr, unsigned long old_end,
> -		  pmd_t *old_pmd, pmd_t *new_pmd)
> +		  pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
>  {
>  	spinlock_t *old_ptl, *new_ptl;
>  	pmd_t pmd;
>  	struct mm_struct *mm = vma->vm_mm;
> +	bool force_flush = false;
>  
>  	if ((old_addr & ~HPAGE_PMD_MASK) ||
>  	    (new_addr & ~HPAGE_PMD_MASK) ||
> @@ -1455,6 +1456,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>  		new_ptl = pmd_lockptr(mm, new_pmd);
>  		if (new_ptl != old_ptl)
>  			spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> +		if (pmd_present(*old_pmd) && pmd_dirty(*old_pmd))
> +			force_flush = true;
>  		pmd = pmdp_huge_get_and_clear(mm, old_addr, old_pmd);
>  		VM_BUG_ON(!pmd_none(*new_pmd));
>  
> @@ -1467,6 +1470,10 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
>  		set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
>  		if (new_ptl != old_ptl)
>  			spin_unlock(new_ptl);
> +		if (force_flush)
> +			flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
> +		else
> +			*need_flush = true;
>  		spin_unlock(old_ptl);
>  		return true;
>  	}
> diff --git a/mm/mremap.c b/mm/mremap.c
> index da22ad2a5678..6ccecc03f56a 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -104,11 +104,13 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>  static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
>  		unsigned long old_addr, unsigned long old_end,
>  		struct vm_area_struct *new_vma, pmd_t *new_pmd,
> -		unsigned long new_addr, bool need_rmap_locks)
> +		unsigned long new_addr, bool need_rmap_locks, bool *need_flush)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	pte_t *old_pte, *new_pte, pte;
>  	spinlock_t *old_ptl, *new_ptl;
> +	bool force_flush = false;
> +	unsigned long len = old_end - old_addr;
>  
>  	/*
>  	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
> @@ -146,6 +148,14 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
>  				   new_pte++, new_addr += PAGE_SIZE) {
>  		if (pte_none(*old_pte))
>  			continue;
> +
> +		/*
> +		 * We are remapping a dirty PTE, make sure to
> +		 * flush TLB before we drop the PTL for the
> +		 * old PTE or we may race with page_mkclean().
> +		 */
> +		if (pte_present(*old_pte) && pte_dirty(*old_pte))
> +			force_flush = true;
>  		pte = ptep_get_and_clear(mm, old_addr, old_pte);
>  		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
>  		pte = move_soft_dirty_pte(pte);
> @@ -156,6 +166,10 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
>  	if (new_ptl != old_ptl)
>  		spin_unlock(new_ptl);
>  	pte_unmap(new_pte - 1);
> +	if (force_flush)
> +		flush_tlb_range(vma, old_end - len, old_end);
> +	else
> +		*need_flush = true;
>  	pte_unmap_unlock(old_pte - 1, old_ptl);
>  	if (need_rmap_locks)
>  		drop_rmap_locks(vma);
> @@ -201,13 +215,12 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  				if (need_rmap_locks)
>  					take_rmap_locks(vma);
>  				moved = move_huge_pmd(vma, old_addr, new_addr,
> -						    old_end, old_pmd, new_pmd);
> +						    old_end, old_pmd, new_pmd,
> +						    &need_flush);
>  				if (need_rmap_locks)
>  					drop_rmap_locks(vma);
> -				if (moved) {
> -					need_flush = true;
> +				if (moved)
>  					continue;
> -				}
>  			}
>  			split_huge_pmd(vma, old_pmd, old_addr);
>  			if (pmd_trans_unstable(old_pmd))
> @@ -220,11 +233,10 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  			extent = next - new_addr;
>  		if (extent > LATENCY_LIMIT)
>  			extent = LATENCY_LIMIT;
> -		move_ptes(vma, old_pmd, old_addr, old_addr + extent,
> -			  new_vma, new_pmd, new_addr, need_rmap_locks);
> -		need_flush = true;
> +		move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma,
> +			  new_pmd, new_addr, need_rmap_locks, &need_flush);
>  	}
> -	if (likely(need_flush))
> +	if (need_flush)
>  		flush_tlb_range(vma, old_end-len, old_addr);
>  
>  	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
> -- 
> 2.5.5
> 

[-- Attachment #2: 0001-mremap-add-a-2s-delay-for-MAP_FIXED-case.patch --]
[-- Type: text/plain, Size: 4339 bytes --]

>From c529dfa6bdfc643a9c3debb4b61b9b0c13b0862e Mon Sep 17 00:00:00 2001
From: Aaron Lu <aaron.lu@intel.com>
Date: Thu, 17 Nov 2016 15:11:08 +0800
Subject: [PATCH] mremap: add a 2s delay for MAP_FIXED case

Add a 2s delay for MAP_FIXED case to enlarge the race window so that we
can hit the race in user space.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 fs/exec.c          |  2 +-
 include/linux/mm.h |  2 +-
 mm/mremap.c        | 19 ++++++++++++-------
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 4e497b9ee71e..1e49ce9a23bd 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -619,7 +619,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 	 * process cleanup to remove whatever mess we made.
 	 */
 	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, length, false))
+				       vma, new_start, length, false, false))
 		return -ENOMEM;
 
 	lru_add_drain();
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a92c8d73aeaf..5e35fe3d914a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1392,7 +1392,7 @@ int vma_is_stack_for_current(struct vm_area_struct *vma);
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len,
-		bool need_rmap_locks);
+		bool need_rmap_locks, bool delay);
 extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 			      unsigned long end, pgprot_t newprot,
 			      int dirty_accountable, int prot_numa);
diff --git a/mm/mremap.c b/mm/mremap.c
index da22ad2a5678..8e35279ca622 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -22,6 +22,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/uaccess.h>
 #include <linux/mm-arch-hooks.h>
+#include <linux/delay.h>
 
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
@@ -166,7 +167,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len,
-		bool need_rmap_locks)
+		bool need_rmap_locks, bool delay)
 {
 	unsigned long extent, next, old_end;
 	pmd_t *old_pmd, *new_pmd;
@@ -224,8 +225,11 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 			  new_vma, new_pmd, new_addr, need_rmap_locks);
 		need_flush = true;
 	}
-	if (likely(need_flush))
+	if (likely(need_flush)) {
+		if (delay)
+			msleep(2000);
 		flush_tlb_range(vma, old_end-len, old_addr);
+	}
 
 	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
 
@@ -234,7 +238,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 static unsigned long move_vma(struct vm_area_struct *vma,
 		unsigned long old_addr, unsigned long old_len,
-		unsigned long new_len, unsigned long new_addr, bool *locked)
+		unsigned long new_len, unsigned long new_addr,
+		bool *locked, bool delay)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *new_vma;
@@ -273,7 +278,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		return -ENOMEM;
 
 	moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len,
-				     need_rmap_locks);
+				     need_rmap_locks, delay);
 	if (moved_len < old_len) {
 		err = -ENOMEM;
 	} else if (vma->vm_ops && vma->vm_ops->mremap) {
@@ -287,7 +292,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		 * and then proceed to unmap new area instead of old.
 		 */
 		move_page_tables(new_vma, new_addr, vma, old_addr, moved_len,
-				 true);
+				 true, delay);
 		vma = new_vma;
 		old_len = new_len;
 		old_addr = new_addr;
@@ -442,7 +447,7 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
 	if (offset_in_page(ret))
 		goto out1;
 
-	ret = move_vma(vma, addr, old_len, new_len, new_addr, locked);
+	ret = move_vma(vma, addr, old_len, new_len, new_addr, locked, true);
 	if (!(offset_in_page(ret)))
 		goto out;
 out1:
@@ -576,7 +581,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 			goto out;
 		}
 
-		ret = move_vma(vma, addr, old_len, new_len, new_addr, &locked);
+		ret = move_vma(vma, addr, old_len, new_len, new_addr, &locked, false);
 	}
 out:
 	if (offset_in_page(ret)) {
-- 
2.5.5


[-- Attachment #3: raceremap.c --]
[-- Type: text/plain, Size: 2717 bytes --]

#define _GNU_SOURCE
#define _XOPEN_SOURCE 500
#include <sched.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <assert.h>
#include <sys/io.h>

#define BUFLEN 4096

static char wistmpfile[] = "/mnt/willitscale.XXXXXX";

char *testcase_description = "mremap";

char *buf;
char *newbuf = (char *)0x700000000000;
#define FILE_SIZE (4096*128)

static void mdelay(int ms)
{
	int i;

	// gain io permission for the delay
	assert(ioperm(0x80, 8, 1) == 0);

	for (i = 0; i < ms; i++)
		inb(0x80);
}

void testcase_prepare(void)
{
	int fd = mkstemp(wistmpfile);

	assert(fd >= 0);
	assert(pwrite(fd, "X", 1, FILE_SIZE-1) == 1);
	buf = mmap(NULL, FILE_SIZE, PROT_READ|PROT_WRITE,
			MAP_SHARED, fd, 0);
	assert(buf != (void *)-1);
	close(fd);
}

static volatile int step = 0;

void testcase(unsigned long long *iterations)
{
	int cpu = sched_getcpu();
	int fd = open(wistmpfile, O_RDWR);
	off_t offset = sched_getcpu() * BUFLEN;
	long counterread = 0;
	long *counterbuf = (void *)&buf[offset];
	assert(fd >= 0);

	printf("cpu%d runs\n", cpu);

	while (1) {
		int ret;

		if (cpu == 0) {
			void *tmpbuf;

			// wait for step 1 done
			while (step < 1);

			// step 2: start mremap to have the old PTE emptied
			printf("cpu%d: going to remap\n", cpu);
			step = 2;
			tmpbuf = mremap(buf, FILE_SIZE, FILE_SIZE,
					MREMAP_FIXED | MREMAP_MAYMOVE,
					newbuf);
			assert(tmpbuf == newbuf);
			printf("cpu%d: remap done\n", cpu);
			pause();
		}

		// step 1: dirty the old PTE
		*counterbuf = 1;

		step = 1;
		while (step < 2);

		// step 3: clean this page
		// delay a little while to give mremap some time
		// to empty the old PTE and setup new PTE
		mdelay(1000);
		printf("cpu%d: going to clean the page\n", cpu);
		posix_fadvise(fd, offset, BUFLEN, POSIX_FADV_DONTNEED);

		// step 4: now the page is cleaned, its new PTE is
		// write protected but since mremap didn't flush tlb
		// for the old PTE yet, we could still access the old
		// addr and that will not dirty anything
		printf("cpu%d: going to write 2\n", cpu);
		*counterbuf = 2;
		printf("cpu%d wrote 2\n", cpu);

		// step 5: drop this page from page cache and then
		// read it back to verify if the last write gets lost
		// munmap the page first, or the FADV_DONTNEED won't
		// kick the page out of page cache
		munmap(newbuf + offset, BUFLEN);
		posix_fadvise(fd, offset, BUFLEN, POSIX_FADV_DONTNEED);
		ret = pread(fd, &counterread, sizeof(counterread), offset);
		assert(ret == sizeof(counterread));

		if (counterread != 2) {
			printf("*cpu%d wrote 2 gets lost\n", cpu);
			fflush(stdout);
		}
		exit(0);
	}
}

void testcase_cleanup(void)
{
	unlink(wistmpfile);
}

  reply	other threads:[~2016-11-17  7:45 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-10  9:16 [PATCH] mremap: fix race between mremap() and page cleanning Aaron Lu
2016-11-17  7:45 ` Aaron Lu [this message]
2016-11-17  7:45   ` Aaron Lu
2016-11-17 17:53 ` Linus Torvalds
2016-11-18  2:48   ` Aaron Lu
2016-11-28  8:37     ` [PATCH 0/2] use mmu gather logic for tlb flush in mremap Aaron Lu
2016-11-28  8:37       ` Aaron Lu
2016-11-28  8:39       ` [PATCH 1/2] tlb: export tlb_flush_mmu_tlbonly Aaron Lu
2016-11-28  8:39         ` Aaron Lu
2016-11-28  8:40       ` [PATCH 2/2] mremap: use mmu gather logic for tlb flush in mremap Aaron Lu
2016-11-28  8:40         ` Aaron Lu
2016-11-28 17:15         ` Linus Torvalds
2016-11-28 17:15           ` Linus Torvalds
2016-11-29  2:57           ` [PATCH] mremap: move_ptes: check pte dirty after its removal Aaron Lu
2016-11-29  2:57             ` Aaron Lu
2016-11-29  3:06             ` Linus Torvalds
2016-11-29  3:06               ` Linus Torvalds
2016-11-29  3:22               ` Aaron Lu
2016-11-29  3:22                 ` Aaron Lu
2016-11-29  5:27               ` [PATCH update] " Aaron Lu
2016-11-29  5:27                 ` Aaron Lu
2016-11-28 17:32         ` [PATCH 2/2] mremap: use mmu gather logic for tlb flush in mremap Dave Hansen
2016-11-28 17:32           ` Dave Hansen
2016-11-28 17:42           ` Linus Torvalds
2016-11-28 17:42             ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161117074538.GA1713@aaronlu.sh.intel.com \
    --to=aaron.lu@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=torvalds@linux-foundation.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.