[PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings
@ 2017-02-15 21:46 Mike Kravetz
  2017-02-16 18:41 ` Andrea Arcangeli
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2017-02-15 21:46 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Mike Kravetz, Andrea Arcangeli, Andrew Morton, Mike Rapoport,
	Dr. David Alan Gilbert, Hillf Danton, Pavel Emelyanov

When userfaultfd hugetlbfs support was originally added, it followed
the pattern of anon mappings and did not support any vmas marked
VM_SHARED.  As such, support was only added for private mappings.

Remove this limitation and support shared mappings.  The primary
functional change required is adding pages to the page cache.  More
subtle changes are required for huge page reservation handling in
error paths.  A lengthy comment in the code describes the reservation
handling.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
---
 mm/hugetlb.c     | 25 +++++++++++++++++++++--
 mm/userfaultfd.c | 60 +++++++++++++++++++++++++++++++++++++++++---------------
 2 files changed, 67 insertions(+), 18 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d0d1d08..41f6c51 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4029,6 +4029,18 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	__SetPageUptodate(page);
 	set_page_huge_active(page);
 
+	/*
+	 * If shared, add to page cache
+	 */
+	if (dst_vma->vm_flags & VM_SHARED) {
+		struct address_space *mapping = dst_vma->vm_file->f_mapping;
+		pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
+
+		ret = huge_add_to_page_cache(page, mapping, idx);
+		if (ret)
+			goto out_release_nounlock;
+	}
+
 	ptl = huge_pte_lockptr(h, dst_mm, dst_pte);
 	spin_lock(ptl);
 
@@ -4036,8 +4048,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	if (!huge_pte_none(huge_ptep_get(dst_pte)))
 		goto out_release_unlock;
 
-	ClearPagePrivate(page);
-	hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
+	if (dst_vma->vm_flags & VM_SHARED) {
+		page_dup_rmap(page, true);
+	} else {
+		ClearPagePrivate(page);
+		hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
+	}
 
 	_dst_pte = make_huge_pte(dst_vma, page, dst_vma->vm_flags & VM_WRITE);
 	if (dst_vma->vm_flags & VM_WRITE)
@@ -4054,11 +4070,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	update_mmu_cache(dst_vma, dst_addr, dst_pte);
 
 	spin_unlock(ptl);
+	if (dst_vma->vm_flags & VM_SHARED)
+		unlock_page(page);
 	ret = 0;
 out:
 	return ret;
 out_release_unlock:
 	spin_unlock(ptl);
+out_release_nounlock:
+	if (dst_vma->vm_flags & VM_SHARED)
+		unlock_page(page);
 	put_page(page);
 	goto out;
 }
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index b861cf9..6703308 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -211,11 +211,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 			goto out_unlock;
 
 		/*
-		 * Make sure the vma is not shared, that the remaining dst
-		 * range is both valid and fully within a single existing vma.
+		 * Make sure the remaining dst range is both valid and
+		 * fully within a single existing vma.
 		 */
-		if (dst_vma->vm_flags & VM_SHARED)
-			goto out_unlock;
 		if (dst_start < dst_vma->vm_start ||
 		    dst_start + len > dst_vma->vm_end)
 			goto out_unlock;
@@ -226,11 +224,13 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		goto out_unlock;
 
 	/*
-	 * Ensure the dst_vma has a anon_vma.
+	 * If not shared, ensure the dst_vma has a anon_vma.
 	 */
 	err = -ENOMEM;
-	if (unlikely(anon_vma_prepare(dst_vma)))
-		goto out_unlock;
+	if (!(dst_vma->vm_flags & VM_SHARED)) {
+		if (unlikely(anon_vma_prepare(dst_vma)))
+			goto out_unlock;
+	}
 
 	h = hstate_vma(dst_vma);
 
@@ -306,18 +306,45 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 	if (page) {
 		/*
 		 * We encountered an error and are about to free a newly
-		 * allocated huge page.  It is possible that there was a
-		 * reservation associated with the page that has been
-		 * consumed.  See the routine restore_reserve_on_error
-		 * for details.  Unfortunately, we can not call
-		 * restore_reserve_on_error now as it would require holding
-		 * mmap_sem.  Clear the PagePrivate flag so that the global
+		 * allocated huge page.
+		 *
+		 * Reservation handling is very subtle, and is different for
+		 * private and shared mappings.  See the routine
+		 * restore_reserve_on_error for details.  Unfortunately, we
+		 * can not call restore_reserve_on_error now as it would
+		 * require holding mmap_sem.
+		 *
+		 * If a reservation for the page existed in the reservation
+		 * map of a private mapping, the map was modified to indicate
+		 * the reservation was consumed when the page was allocated.
+		 * We clear the PagePrivate flag now so that the global
 		 * reserve count will not be incremented in free_huge_page.
 		 * The reservation map will still indicate the reservation
 		 * was consumed and possibly prevent later page allocation.
-		 * This is better than leaking a global reservation.
+		 * This is better than leaking a global reservation.  If no
+		 * reservation existed, it is still safe to clear PagePrivate
+		 * as no adjustments to reservation counts were made during
+		 * allocation.
+		 *
+		 * The reservation map for shared mappings indicates which
+		 * pages have reservations.  When a huge page is allocated
+		 * for an address with a reservation, no change is made to
+		 * the reserve map.  In this case PagePrivate will be set
+		 * to indicate that the global reservation count should be
+		 * incremented when the page is freed.  This is the desired
+		 * behavior.  However, when a huge page is allocated for an
+		 * address without a reservation a reservation entry is added
+		 * to the reservation map, and PagePrivate will not be set.
+		 * When the page is freed, the global reserve count will NOT
+		 * be incremented and it will appear as though we have leaked
+		 * reserved page.  In this case, set PagePrivate so that the
+		 * global reserve count will be incremented to match the
+		 * reservation map entry which was created.
 		 */
-		ClearPagePrivate(page);
+		if (dst_vma->vm_flags & VM_SHARED)
+			SetPagePrivate(page);
+		else
+			ClearPagePrivate(page);
 		put_page(page);
 	}
 	BUG_ON(copied < 0);
@@ -386,7 +413,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 		goto out_unlock;
 
 	err = -EINVAL;
-	if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED)
+	if (!vma_is_shmem(dst_vma) && !is_vm_hugetlb_page(dst_vma) &&
+	    dst_vma->vm_flags & VM_SHARED)
 		goto out_unlock;
 	if (dst_start < dst_vma->vm_start ||
 	    dst_start + len > dst_vma->vm_end)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings
  2017-02-15 21:46 [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings Mike Kravetz
@ 2017-02-16 18:41 ` Andrea Arcangeli
  2017-02-17  0:18   ` Mike Kravetz
  2017-02-21 13:25   ` Kirill A. Shutemov
  0 siblings, 2 replies; 11+ messages in thread
From: Andrea Arcangeli @ 2017-02-16 18:41 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Andrew Morton, Mike Rapoport,
	Dr. David Alan Gilbert, Hillf Danton, Pavel Emelyanov,
	Kirill A. Shutemov

On Wed, Feb 15, 2017 at 01:46:50PM -0800, Mike Kravetz wrote:
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d0d1d08..41f6c51 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4029,6 +4029,18 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  	__SetPageUptodate(page);
>  	set_page_huge_active(page);
>  
> +	/*
> +	 * If shared, add to page cache
> +	 */
> +	if (dst_vma->vm_flags & VM_SHARED) {

Minor nitpick, this could be a:

      int vm_shared = dst_vma->vm_flags & VM_SHARED;

(int faster than bool here as VM_SHARED won't have to be converted into 0|1)

> @@ -386,7 +413,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>  		goto out_unlock;
>  
>  	err = -EINVAL;
> -	if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED)
> +	if (!vma_is_shmem(dst_vma) && !is_vm_hugetlb_page(dst_vma) &&
> +	    dst_vma->vm_flags & VM_SHARED)
>  		goto out_unlock;

Other minor nitpick, this could have been faster as:

     if (vma_is_anonymous(dst_vma) && dst_vma->vm_flags & VM_SHARED)

Thinking twice, the only case we need to rule out is shmem_zero_setup
(it's not anon vmas can be really VM_SHARED or they wouldn't be anon
vmas in the first place) so even the above is superfluous because
shmem_zero_setup does:

	vma->vm_ops = &shmem_vm_ops;

So I would turn it into:

     /*
      * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but
      * it will overwrite vm_ops, so vma_is_anonymous must return false.
      */
     if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) && dst_vma->vm_flags & VM_SHARED))
 		goto out_unlock;

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
---

Changing topic: thinking about the last part, I was wondering about
vma_is_anonymous because it's well known issue, it's only guaranteed
fully correct during page faults because the weird cases that can
return a false positive cannot generate page faults and they run
remap_pfn_range before releasing the mmap_sem for writing within
mmap(2) itself.

More than a year ago I discussed with Kirill (CC'ed) the
vma_is_anonymous() false positives: some VM_IO vma leaves vma->vm_ops
NULL, which is generally non concerning because putting an anon page
in there shouldn't have any adverse side effects for the rest of the
system.

It was critical for khugepaged, because khugepaged will run out of the
app own control, so khugepaged must be fully accurate of it'll just
activate on VM_IO special mappings, but that's definitely not the case
for userfaultfd.

Still I was wondering if we could be more strict by adding a
vma->vm_flags & VM_NO_KHUGEPAGED check so that VM_IO regions will not
pass registration (only in fs/userfaultfd.c:vma_can_userfault;
mm/userfaultfd.c doesn't need any of that as it requires registration
to be passed first and vm_userfaultfd_ctx already armed).

A fully accurate vma_is_anonymous could be implemented like this:

   vma_is_anonymous && !(vm_flags & VM_NO_KHUGEPAGED)

Which is what khugepaged internally uses.

This will also work for MAP_PRIVATE on /dev/zero (required to work by
the non cooperative case).

Side note: MAP_PRIVATE /dev/zero is very special and it's the only
case in the kernel a "true" anon vma has vm_file set. I think it'd be
cleaner to "decouple" the vma from /dev/zero the same way
MAP_SHARED|MAP_ANON "couples" the vma to a pseudo /dev/zero, so anon
vmas would always have vm_file NULL (not done because it'd risk to
break stuff by changing the /proc/pid/maps output), but that's besides
the point and the above solution deployed already by khugepaged
already works with the current /dev/zero MAP_PRIVATE code.

Kirill what's your take on making the registration checks stricter?
If we would add a vma_is_anonymous_not_in_fault implemented like above
vma_can_userfault would just need a
s/vma_is_anonymous/vma_is_anonymous_not_in_fault/ and it would be more
strict. khugepaged could be then converted to use it too instead of
hardcoding this vm_flags check. Unless I'm mistaken I would consider
such a change to the registration code, purely a cleanup to add a more
strict check.

Alternatively we could just leave things as is and depend on apps
using VM_IO with remap_pfn_range with vm_file set and vm_ops NULL, not
to screw themselves up by filling their regions with anon pages. I
don't see how it could break anything (except themselves) if they do,
but I'd appreciate a second thought on that. And at least for the
remap_pfn_range users it won't even get that far, because a graceful
-EEXIST would be returned instead. The change would effectively
convert a -EEXIST returned later into a stricter -EINVAL returned
early at registration time, for a case that is erratic by design.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings
  2017-02-16 18:41 ` Andrea Arcangeli
@ 2017-02-17  0:18   ` Mike Kravetz
  2017-02-17 15:52     ` Andrea Arcangeli
  2017-02-21 13:25   ` Kirill A. Shutemov
  1 sibling, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2017-02-17  0:18 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: linux-mm, linux-kernel, Mike Rapoport, Dr. David Alan Gilbert,
	Hillf Danton, Pavel Emelyanov, Kirill A. Shutemov

On 02/16/2017 10:41 AM, Andrea Arcangeli wrote:
> On Wed, Feb 15, 2017 at 01:46:50PM -0800, Mike Kravetz wrote:
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index d0d1d08..41f6c51 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -4029,6 +4029,18 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>>  	__SetPageUptodate(page);
>>  	set_page_huge_active(page);
>>  
>> +	/*
>> +	 * If shared, add to page cache
>> +	 */
>> +	if (dst_vma->vm_flags & VM_SHARED) {
> 
> Minor nitpick, this could be a:
> 
>       int vm_shared = dst_vma->vm_flags & VM_SHARED;
> 
> (int faster than bool here as VM_SHARED won't have to be converted into 0|1)
> 
>> @@ -386,7 +413,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
>>  		goto out_unlock;
>>  
>>  	err = -EINVAL;
>> -	if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED)
>> +	if (!vma_is_shmem(dst_vma) && !is_vm_hugetlb_page(dst_vma) &&
>> +	    dst_vma->vm_flags & VM_SHARED)
>>  		goto out_unlock;
> 
> Other minor nitpick, this could have been faster as:
> 
>      if (vma_is_anonymous(dst_vma) && dst_vma->vm_flags & VM_SHARED)
> 
> Thinking twice, the only case we need to rule out is shmem_zero_setup
> (it's not anon vmas can be really VM_SHARED or they wouldn't be anon
> vmas in the first place) so even the above is superfluous because
> shmem_zero_setup does:
> 
> 	vma->vm_ops = &shmem_vm_ops;
> 
> So I would turn it into:
> 
> 
>      /*
>       * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but
>       * it will overwrite vm_ops, so vma_is_anonymous must return false.
>       */
>      if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) && dst_vma->vm_flags & VM_SHARED))
>  		goto out_unlock;
> 
> 
> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
> ---

Thanks Andrea, I incorporated your suggestions into a new version of the patch. 
While changing (dst_vma->vm_flags & VM_SHARED) to integers, I noticed an issue
in the error path of __mcopy_atomic_hugetlb().

>                */
> -             ClearPagePrivate(page);
> +             if (dst_vma->vm_flags & VM_SHARED)
> +                     SetPagePrivate(page);
> +             else
> +                     ClearPagePrivate(page);
>               put_page(page);

We can not use dst_vma here as it may be different than the vma for which the
page was originally allocated, or even NULL.  Remember, we may drop mmap_sem
and look up dst_vma again.  Therefore, we need to save the value of
(dst_vma->vm_flags & VM_SHARED) for the vma which was used when the page was
allocated.  This change as well as your suggestions are in the patch below:

>From e13d3b4ab24f2c423a8b0d645f0ea715c285880d Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Wed, 15 Feb 2017 13:02:41 -0800
Subject: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared
 mappings

When userfaultfd hugetlbfs support was originally added, it followed
the pattern of anon mappings and did not support any vmas marked
VM_SHARED.  As such, support was only added for private mappings.

Remove this limitation and support shared mappings.  The primary
functional change required is adding pages to the page cache.  More
subtle changes are required for huge page reservation handling in
error paths.  A lengthy comment in the code describes the reservation
handling.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
---
 mm/hugetlb.c     | 26 ++++++++++++++++++--
 mm/userfaultfd.c | 74 ++++++++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 82 insertions(+), 18 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d0d1d08..2e0e815 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3993,6 +3993,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    unsigned long src_addr,
 			    struct page **pagep)
 {
+	int vm_shared = dst_vma->vm_flags & VM_SHARED;
 	struct hstate *h = hstate_vma(dst_vma);
 	pte_t _dst_pte;
 	spinlock_t *ptl;
@@ -4029,6 +4030,18 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	__SetPageUptodate(page);
 	set_page_huge_active(page);
 
+	/*
+	 * If shared, add to page cache
+	 */
+	if (vm_shared) {
+		struct address_space *mapping = dst_vma->vm_file->f_mapping;
+		pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
+
+		ret = huge_add_to_page_cache(page, mapping, idx);
+		if (ret)
+			goto out_release_nounlock;
+	}
+
 	ptl = huge_pte_lockptr(h, dst_mm, dst_pte);
 	spin_lock(ptl);
 
@@ -4036,8 +4049,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	if (!huge_pte_none(huge_ptep_get(dst_pte)))
 		goto out_release_unlock;
 
-	ClearPagePrivate(page);
-	hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
+	if (vm_shared) {
+		page_dup_rmap(page, true);
+	} else {
+		ClearPagePrivate(page);
+		hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
+	}
 
 	_dst_pte = make_huge_pte(dst_vma, page, dst_vma->vm_flags & VM_WRITE);
 	if (dst_vma->vm_flags & VM_WRITE)
@@ -4054,11 +4071,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	update_mmu_cache(dst_vma, dst_addr, dst_pte);
 
 	spin_unlock(ptl);
+	if (vm_shared)
+		unlock_page(page);
 	ret = 0;
 out:
 	return ret;
 out_release_unlock:
 	spin_unlock(ptl);
+out_release_nounlock:
+	if (vm_shared)
+		unlock_page(page);
 	put_page(page);
 	goto out;
 }
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index b861cf9..2fc0e76 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -154,6 +154,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 					      unsigned long len,
 					      bool zeropage)
 {
+	int vm_alloc_shared = dst_vma->vm_flags & VM_SHARED;
+	int vm_shared = dst_vma->vm_flags & VM_SHARED;
 	ssize_t err;
 	pte_t *dst_pte;
 	unsigned long src_addr, dst_addr;
@@ -211,14 +213,14 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 			goto out_unlock;
 
 		/*
-		 * Make sure the vma is not shared, that the remaining dst
-		 * range is both valid and fully within a single existing vma.
+		 * Make sure the remaining dst range is both valid and
+		 * fully within a single existing vma.
 		 */
-		if (dst_vma->vm_flags & VM_SHARED)
-			goto out_unlock;
 		if (dst_start < dst_vma->vm_start ||
 		    dst_start + len > dst_vma->vm_end)
 			goto out_unlock;
+
+		vm_shared = dst_vma->vm_flags & VM_SHARED;
 	}
 
 	if (WARN_ON(dst_addr & (vma_hpagesize - 1) ||
@@ -226,11 +228,13 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		goto out_unlock;
 
 	/*
-	 * Ensure the dst_vma has a anon_vma.
+	 * If not shared, ensure the dst_vma has a anon_vma.
 	 */
 	err = -ENOMEM;
-	if (unlikely(anon_vma_prepare(dst_vma)))
-		goto out_unlock;
+	if (!vm_shared) {
+		if (unlikely(anon_vma_prepare(dst_vma)))
+			goto out_unlock;
+	}
 
 	h = hstate_vma(dst_vma);
 
@@ -267,6 +271,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 						dst_addr, src_addr, &page);
 
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+		vm_alloc_shared = vm_shared;
 
 		cond_resched();
 
@@ -306,18 +311,49 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 	if (page) {
 		/*
 		 * We encountered an error and are about to free a newly
-		 * allocated huge page.  It is possible that there was a
-		 * reservation associated with the page that has been
-		 * consumed.  See the routine restore_reserve_on_error
-		 * for details.  Unfortunately, we can not call
-		 * restore_reserve_on_error now as it would require holding
-		 * mmap_sem.  Clear the PagePrivate flag so that the global
+		 * allocated huge page.
+		 *
+		 * Reservation handling is very subtle, and is different for
+		 * private and shared mappings.  See the routine
+		 * restore_reserve_on_error for details.  Unfortunately, we
+		 * can not call restore_reserve_on_error now as it would
+		 * require holding mmap_sem.
+		 *
+		 * If a reservation for the page existed in the reservation
+		 * map of a private mapping, the map was modified to indicate
+		 * the reservation was consumed when the page was allocated.
+		 * We clear the PagePrivate flag now so that the global
 		 * reserve count will not be incremented in free_huge_page.
 		 * The reservation map will still indicate the reservation
 		 * was consumed and possibly prevent later page allocation.
-		 * This is better than leaking a global reservation.
+		 * This is better than leaking a global reservation.  If no
+		 * reservation existed, it is still safe to clear PagePrivate
+		 * as no adjustments to reservation counts were made during
+		 * allocation.
+		 *
+		 * The reservation map for shared mappings indicates which
+		 * pages have reservations.  When a huge page is allocated
+		 * for an address with a reservation, no change is made to
+		 * the reserve map.  In this case PagePrivate will be set
+		 * to indicate that the global reservation count should be
+		 * incremented when the page is freed.  This is the desired
+		 * behavior.  However, when a huge page is allocated for an
+		 * address without a reservation a reservation entry is added
+		 * to the reservation map, and PagePrivate will not be set.
+		 * When the page is freed, the global reserve count will NOT
+		 * be incremented and it will appear as though we have leaked
+		 * reserved page.  In this case, set PagePrivate so that the
+		 * global reserve count will be incremented to match the
+		 * reservation map entry which was created.
+		 *
+		 * Note that vm_alloc_shared is based on the flags of the vma
+		 * for which the page was originally allocated.  dst_vma could
+		 * be different or NULL on error.
 		 */
-		ClearPagePrivate(page);
+		if (vm_alloc_shared)
+			SetPagePrivate(page);
+		else
+			ClearPagePrivate(page);
 		put_page(page);
 	}
 	BUG_ON(copied < 0);
@@ -386,8 +422,14 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 		goto out_unlock;
 
 	err = -EINVAL;
-	if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED)
+	/*
+	 * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but
+	 * it will overwrite vm_ops, so vma_is_anonymous must return false.
+	 */
+	if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) &&
+	    dst_vma->vm_flags & VM_SHARED))
 		goto out_unlock;
+
 	if (dst_start < dst_vma->vm_start ||
 	    dst_start + len > dst_vma->vm_end)
 		goto out_unlock;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings
  2017-02-17  0:18   ` Mike Kravetz
@ 2017-02-17 15:52     ` Andrea Arcangeli
  2017-02-17 20:17       ` Andrew Morton
  0 siblings, 1 reply; 11+ messages in thread
From: Andrea Arcangeli @ 2017-02-17 15:52 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Andrew Morton, linux-mm, linux-kernel, Mike Rapoport,
	Dr. David Alan Gilbert, Hillf Danton, Pavel Emelyanov,
	Kirill A. Shutemov

On Thu, Feb 16, 2017 at 04:18:51PM -0800, Mike Kravetz wrote:
> Thanks Andrea, I incorporated your suggestions into a new version of the patch. 
> While changing (dst_vma->vm_flags & VM_SHARED) to integers, I noticed an issue
> in the error path of __mcopy_atomic_hugetlb().

Indeed good point!

> +	int vm_alloc_shared = dst_vma->vm_flags & VM_SHARED;
> +	int vm_shared = dst_vma->vm_flags & VM_SHARED;

Other minor nitpick, this could have been:

     int vm_shared = vm_alloc_shared;

But I'm sure gcc will emit the same asm.

For greppability (if such word exist) calling it vm_shared_alloc would
have been preferable. We can clean it up post upstream merge or it
should be diffed against mm latest or it may cause more rejects.

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>

The patches were not against latest -mm so I solved the rejects during
merge in my tree. Then I looked at the result of all my merges after
everything is applied and I think I spotted a merge gone wrong in this
patch:

https://ozlabs.org/~akpm/mmots/broken-out/userfaultfd-mcopy_atomic-return-enoent-when-no-compatible-vma-found.patch

Below is a hand edited git diff that shows the only meaningful
difference.

The below should be included in
userfaultfd-mcopy_atomic-return-enoent-when-no-compatible-vma-found.patch
or as -fix2 at the end.

Everything else is identical which is great. Mike Rapoport could you
verify the below hunk is missing in mm?

Once it'll all be merged upstream then there will be less merge crunch
as we've been working somewhat in parallel on the same files, so this
is resulting in more merge rejects than ideal :).

diff --git a/../mm/mm/userfaultfd.c b/mm/userfaultfd.c
index 830bed7..3ec9aad 100644
--- a/../mm/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -199,6 +201,12 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		dst_vma = find_vma(dst_mm, dst_start);
 		if (!dst_vma || !is_vm_hugetlb_page(dst_vma))
 			goto out_unlock;
+		/*
+		 * Only allow __mcopy_atomic_hugetlb on userfaultfd
+		 * registered ranges.
+		 */
+		if (!dst_vma->vm_userfaultfd_ctx.ctx)
+			goto out_unlock;

 		if (dst_start < dst_vma->vm_start ||
 		    dst_start + len > dst_vma->vm_end)
@@ -214,16 +224,10 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		goto out_unlock;

 	/*
-	 * Only allow __mcopy_atomic_hugetlb on userfaultfd registered ranges.
-	 */
-	if (!dst_vma->vm_userfaultfd_ctx.ctx)
-		goto out_unlock;
-
-	/*
 	 * If not shared, ensure the dst_vma has a anon_vma.
 	 */
 	err = -ENOMEM;

Thanks,
Andrea

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings
  2017-02-17 15:52     ` Andrea Arcangeli
@ 2017-02-17 20:17       ` Andrew Morton
  2017-02-17 20:51         ` Andrea Arcangeli
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2017-02-17 20:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mike Kravetz, linux-mm, linux-kernel, Mike Rapoport,
	Dr. David Alan Gilbert, Hillf Danton, Pavel Emelyanov,
	Kirill A. Shutemov

On Fri, 17 Feb 2017 16:52:41 +0100 Andrea Arcangeli <aarcange@redhat.com> wrote:

> Everything else is identical which is great. Mike Rapoport could you
> verify the below hunk is missing in mm?
> 
> Once it'll all be merged upstream then there will be less merge crunch
> as we've been working somewhat in parallel on the same files, so this
> is resulting in more merge rejects than ideal :).
> 
> diff --git a/../mm/mm/userfaultfd.c b/mm/userfaultfd.c
> index 830bed7..3ec9aad 100644
> --- a/../mm/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -199,6 +201,12 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  		dst_vma = find_vma(dst_mm, dst_start);
>  		if (!dst_vma || !is_vm_hugetlb_page(dst_vma))
>  			goto out_unlock;
> +		/*
> +		 * Only allow __mcopy_atomic_hugetlb on userfaultfd
> +		 * registered ranges.
> +		 */
> +		if (!dst_vma->vm_userfaultfd_ctx.ctx)
> +			goto out_unlock;
>  
>  		if (dst_start < dst_vma->vm_start ||
>  		    dst_start + len > dst_vma->vm_end)
> @@ -214,16 +224,10 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>  		goto out_unlock;
>  
>  	/*
> -	 * Only allow __mcopy_atomic_hugetlb on userfaultfd registered ranges.
> -	 */
> -	if (!dst_vma->vm_userfaultfd_ctx.ctx)
> -		goto out_unlock;
> -
> -	/*
>  	 * If not shared, ensure the dst_vma has a anon_vma.
>  	 */

I merged this up and a small issue remains:


:	/*
:	 * Validate alignment based on huge page size
:	 */
:	err = -EINVAL;
:	if (dst_start & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
:		goto out_unlock;
:
:retry:
:	/*
:	 * On routine entry dst_vma is set.  If we had to drop mmap_sem and
:	 * retry, dst_vma will be set to NULL and we must lookup again.
:	 */
:	if (!dst_vma) {
:		err = -ENOENT;
:		dst_vma = find_vma(dst_mm, dst_start);
:		if (!dst_vma || !is_vm_hugetlb_page(dst_vma))
:			goto out_unlock;
:		/*
:		 * Only allow __mcopy_atomic_hugetlb on userfaultfd
:		 * registered ranges.
:		 */
:		if (!dst_vma->vm_userfaultfd_ctx.ctx)
:			goto out_unlock;
:
:		if (dst_start < dst_vma->vm_start ||
:		    dst_start + len > dst_vma->vm_end)
:			goto out_unlock;
:
:		err = -EINVAL;
:		if (vma_hpagesize != vma_kernel_pagesize(dst_vma))
:			goto out_unlock;
:	}
:
:	if (WARN_ON(dst_addr & (vma_hpagesize - 1) ||
:		    (len - copied) & (vma_hpagesize - 1)))
:		goto out_unlock;

The value of `err' here is EINVAL.  That sems appropriate, but it only
happens by sheer luck.

:	/*
:	 * If not shared, ensure the dst_vma has a anon_vma.
:	 */
:	err = -ENOMEM;
:	if (!(dst_vma->vm_flags & VM_SHARED)) {
:		if (unlikely(anon_vma_prepare(dst_vma)))
:			goto out_unlock;
:	}

So...

--- a/mm/userfaultfd.c~userfaultfd-mcopy_atomic-return-enoent-when-no-compatible-vma-found-fix-2-fix
+++ a/mm/userfaultfd.c
@@ -215,6 +215,7 @@ retry:
 			goto out_unlock;
 	}
 
+	err = -EINVAL;
 	if (WARN_ON(dst_addr & (vma_hpagesize - 1) ||
 		    (len - copied) & (vma_hpagesize - 1)))
 		goto out_unlock;
_

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings
  2017-02-17 20:17       ` Andrew Morton
@ 2017-02-17 20:51         ` Andrea Arcangeli
  2017-02-17 21:08           ` Andrew Morton
  0 siblings, 1 reply; 11+ messages in thread
From: Andrea Arcangeli @ 2017-02-17 20:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, linux-mm, linux-kernel, Mike Rapoport,
	Dr. David Alan Gilbert, Hillf Danton, Pavel Emelyanov,
	Kirill A. Shutemov

On Fri, Feb 17, 2017 at 12:17:38PM -0800, Andrew Morton wrote:
> I merged this up and a small issue remains:

Great!

> The value of `err' here is EINVAL.  That sems appropriate, but it only
> happens by sheer luck.

It might have been programmer luck but just for completeness, at
runtime no luck was needed (the temporary setting to ENOENT is undoed
before the if clause is closed). Your addition is surely safer just in
case of future changes missing how we inherited the EINVAL in both
branches, thanks! (plus the compiler should be able to optimize it
away until after it will be needed)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings
  2017-02-17 20:51         ` Andrea Arcangeli
@ 2017-02-17 21:08           ` Andrew Morton
  2017-02-17 21:34             ` Andrea Arcangeli
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2017-02-17 21:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mike Kravetz, linux-mm, linux-kernel, Mike Rapoport,
	Dr. David Alan Gilbert, Hillf Danton, Pavel Emelyanov,
	Kirill A. Shutemov

On Fri, 17 Feb 2017 21:51:24 +0100 Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Fri, Feb 17, 2017 at 12:17:38PM -0800, Andrew Morton wrote:
> > I merged this up and a small issue remains:
> 
> Great!
> 
> > The value of `err' here is EINVAL.  That sems appropriate, but it only
> > happens by sheer luck.
> 
> It might have been programmer luck but just for completeness, at
> runtime no luck was needed (the temporary setting to ENOENT is undoed
> before the if clause is closed). Your addition is surely safer just in
> case of future changes missing how we inherited the EINVAL in both
> branches, thanks! (plus the compiler should be able to optimize it
> away until after it will be needed)

OK.

I had a bunch more rejects to fix in that function.  Below is the final
result - please check it carefully.

I'll release another mmotm tree in the next few hours for runtime
testing.


static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
					      struct vm_area_struct *dst_vma,
					      unsigned long dst_start,
					      unsigned long src_start,
					      unsigned long len,
					      bool zeropage)
{
	int vm_alloc_shared = dst_vma->vm_flags & VM_SHARED;
	int vm_shared = dst_vma->vm_flags & VM_SHARED;
	ssize_t err;
	pte_t *dst_pte;
	unsigned long src_addr, dst_addr;
	long copied;
	struct page *page;
	struct hstate *h;
	unsigned long vma_hpagesize;
	pgoff_t idx;
	u32 hash;
	struct address_space *mapping;

	/*
	 * There is no default zero huge page for all huge page sizes as
	 * supported by hugetlb.  A PMD_SIZE huge pages may exist as used
	 * by THP.  Since we can not reliably insert a zero page, this
	 * feature is not supported.
	 */
	if (zeropage) {
		up_read(&dst_mm->mmap_sem);
		return -EINVAL;
	}

	src_addr = src_start;
	dst_addr = dst_start;
	copied = 0;
	page = NULL;
	vma_hpagesize = vma_kernel_pagesize(dst_vma);

	/*
	 * Validate alignment based on huge page size
	 */
	err = -EINVAL;
	if (dst_start & (vma_hpagesize - 1) || len & (vma_hpagesize - 1))
		goto out_unlock;

retry:
	/*
	 * On routine entry dst_vma is set.  If we had to drop mmap_sem and
	 * retry, dst_vma will be set to NULL and we must lookup again.
	 */
	if (!dst_vma) {
		err = -ENOENT;
		dst_vma = find_vma(dst_mm, dst_start);
		if (!dst_vma || !is_vm_hugetlb_page(dst_vma))
			goto out_unlock;
		/*
		 * Only allow __mcopy_atomic_hugetlb on userfaultfd
		 * registered ranges.
		 */
		if (!dst_vma->vm_userfaultfd_ctx.ctx)
			goto out_unlock;

		if (dst_start < dst_vma->vm_start ||
		    dst_start + len > dst_vma->vm_end)
			goto out_unlock;

		vm_shared = dst_vma->vm_flags & VM_SHARED;

		err = -EINVAL;
		if (vma_hpagesize != vma_kernel_pagesize(dst_vma))
			goto out_unlock;
	}

	err = -EINVAL;
	if (WARN_ON(dst_addr & (vma_hpagesize - 1) ||
		    (len - copied) & (vma_hpagesize - 1)))
		goto out_unlock;

	if (dst_start < dst_vma->vm_start ||
	    dst_start + len > dst_vma->vm_end)
		goto out_unlock;

	/*
	 * If not shared, ensure the dst_vma has a anon_vma.
	 */
	err = -ENOMEM;
	if (!vm_shared) {
		if (unlikely(anon_vma_prepare(dst_vma)))
			goto out_unlock;
	}

	h = hstate_vma(dst_vma);

	while (src_addr < src_start + len) {
		pte_t dst_pteval;

		BUG_ON(dst_addr >= dst_start + len);
		VM_BUG_ON(dst_addr & ~huge_page_mask(h));

		/*
		 * Serialize via hugetlb_fault_mutex
		 */
		idx = linear_page_index(dst_vma, dst_addr);
		mapping = dst_vma->vm_file->f_mapping;
		hash = hugetlb_fault_mutex_hash(h, dst_mm, dst_vma, mapping,
								idx, dst_addr);
		mutex_lock(&hugetlb_fault_mutex_table[hash]);

		err = -ENOMEM;
		dst_pte = huge_pte_alloc(dst_mm, dst_addr, huge_page_size(h));
		if (!dst_pte) {
			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
			goto out_unlock;
		}

		err = -EEXIST;
		dst_pteval = huge_ptep_get(dst_pte);
		if (!huge_pte_none(dst_pteval)) {
			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
			goto out_unlock;
		}

		err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma,
						dst_addr, src_addr, &page);

		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
		vm_alloc_shared = vm_shared;

		cond_resched();

		if (unlikely(err == -EFAULT)) {
			up_read(&dst_mm->mmap_sem);
			BUG_ON(!page);

			err = copy_huge_page_from_user(page,
						(const void __user *)src_addr,
						pages_per_huge_page(h), true);
			if (unlikely(err)) {
				err = -EFAULT;
				goto out;
			}
			down_read(&dst_mm->mmap_sem);

			dst_vma = NULL;
			goto retry;
		} else
			BUG_ON(page);

		if (!err) {
			dst_addr += vma_hpagesize;
			src_addr += vma_hpagesize;
			copied += vma_hpagesize;

			if (fatal_signal_pending(current))
				err = -EINTR;
		}
		if (err)
			break;
	}

out_unlock:
	up_read(&dst_mm->mmap_sem);
out:
	if (page) {
		/*
		 * We encountered an error and are about to free a newly
		 * allocated huge page.
		 *
		 * Reservation handling is very subtle, and is different for
		 * private and shared mappings.  See the routine
		 * restore_reserve_on_error for details.  Unfortunately, we
		 * can not call restore_reserve_on_error now as it would
		 * require holding mmap_sem.
		 *
		 * If a reservation for the page existed in the reservation
		 * map of a private mapping, the map was modified to indicate
		 * the reservation was consumed when the page was allocated.
		 * We clear the PagePrivate flag now so that the global
		 * reserve count will not be incremented in free_huge_page.
		 * The reservation map will still indicate the reservation
		 * was consumed and possibly prevent later page allocation.
		 * This is better than leaking a global reservation.  If no
		 * reservation existed, it is still safe to clear PagePrivate
		 * as no adjustments to reservation counts were made during
		 * allocation.
		 *
		 * The reservation map for shared mappings indicates which
		 * pages have reservations.  When a huge page is allocated
		 * for an address with a reservation, no change is made to
		 * the reserve map.  In this case PagePrivate will be set
		 * to indicate that the global reservation count should be
		 * incremented when the page is freed.  This is the desired
		 * behavior.  However, when a huge page is allocated for an
		 * address without a reservation a reservation entry is added
		 * to the reservation map, and PagePrivate will not be set.
		 * When the page is freed, the global reserve count will NOT
		 * be incremented and it will appear as though we have leaked
		 * reserved page.  In this case, set PagePrivate so that the
		 * global reserve count will be incremented to match the
		 * reservation map entry which was created.
		 *
		 * Note that vm_alloc_shared is based on the flags of the vma
		 * for which the page was originally allocated.  dst_vma could
		 * be different or NULL on error.
		 */
		if (vm_alloc_shared)
			SetPagePrivate(page);
		else
			ClearPagePrivate(page);
		put_page(page);
	}
	BUG_ON(copied < 0);
	BUG_ON(err > 0);
	BUG_ON(!copied && !err);
	return copied ? copied : err;
}

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings
  2017-02-17 21:08           ` Andrew Morton
@ 2017-02-17 21:34             ` Andrea Arcangeli
  0 siblings, 0 replies; 11+ messages in thread
From: Andrea Arcangeli @ 2017-02-17 21:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mike Kravetz, linux-mm, linux-kernel, Mike Rapoport,
	Dr. David Alan Gilbert, Hillf Danton, Pavel Emelyanov,
	Kirill A. Shutemov

On Fri, Feb 17, 2017 at 01:08:55PM -0800, Andrew Morton wrote:
> I had a bunch more rejects to fix in that function.  Below is the final
> result - please check it carefully.

Sure, reviewed and this is the diff that remains (vm_shared assignment
location is irrelevant, I put it at the end as it's only needed later
and not checked in the out_unlock path, err = -EINVAL also is fine to
stay):

diff --git a/tmp/x b/mm/userfaultfd.c
index a3ba029..3ec9aad 100644
--- a/tmp/x
+++ b/mm/userfaultfd.c
@@ -63,22 +212,17 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		    dst_start + len > dst_vma->vm_end)
 			goto out_unlock;
 
-		vm_shared = dst_vma->vm_flags & VM_SHARED;
-
 		err = -EINVAL;
 		if (vma_hpagesize != vma_kernel_pagesize(dst_vma))
 			goto out_unlock;
+
+		vm_shared = dst_vma->vm_flags & VM_SHARED;
 	}
 
-	err = -EINVAL;
 	if (WARN_ON(dst_addr & (vma_hpagesize - 1) ||
 		    (len - copied) & (vma_hpagesize - 1)))
 		goto out_unlock;
 
-	if (dst_start < dst_vma->vm_start ||
-	    dst_start + len > dst_vma->vm_end)
-		goto out_unlock;
-
 	/*
 	 * If not shared, ensure the dst_vma has a anon_vma.
 	 */


In short there's only the last 4 lines of the above that can be
applied.

__mcopy_atomic_hugetlb in the first pass (i.e. dst_vma not NULL) is
invoked after those checks already have been run in the caller.

	if (dst_start < dst_vma->vm_start ||
	    dst_start + len > dst_vma->vm_end)
		goto out_unlock;

	err = -EINVAL;
	/*
	 * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but
	 * it will overwrite vm_ops, so vma_is_anonymous must return false.
	 */
	if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) &&
	    dst_vma->vm_flags & VM_SHARED))
		goto out_unlock;

	/*
	 * If this is a HUGETLB vma, pass off to appropriate routine
	 */
	if (is_vm_hugetlb_page(dst_vma))
		return  __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start,
						src_start, len, zeropage);

As usual hugetlbfs takes its own tangent out of the main VM code after
various checks have already been done that applies to hugetlbfs too.

In the "retry" case the dst_vma is set to NULL and the dst_vma is
being searched again and revalidated, and we so we repeat the
check. First time it's not needed, for second time it would be a
repetition and so it's a noop.

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings
  2017-02-16 18:41 ` Andrea Arcangeli
  2017-02-17  0:18   ` Mike Kravetz
@ 2017-02-21 13:25   ` Kirill A. Shutemov
  2017-02-22 15:15     ` Andrea Arcangeli
  1 sibling, 1 reply; 11+ messages in thread
From: Kirill A. Shutemov @ 2017-02-21 13:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mike Kravetz, linux-mm, linux-kernel, Andrew Morton,
	Mike Rapoport, Dr. David Alan Gilbert, Hillf Danton,
	Pavel Emelyanov

On Thu, Feb 16, 2017 at 07:41:00PM +0100, Andrea Arcangeli wrote:
> Kirill what's your take on making the registration checks stricter?
> If we would add a vma_is_anonymous_not_in_fault implemented like above
> vma_can_userfault would just need a
> s/vma_is_anonymous/vma_is_anonymous_not_in_fault/ and it would be more
> strict. khugepaged could be then converted to use it too instead of
> hardcoding this vm_flags check. Unless I'm mistaken I would consider
> such a change to the registration code, purely a cleanup to add a more
> strict check.

[sorry for later response]

I think more strict vma_is_anonymous() is a good thing.

But I don't see a point introducing one more helper. Let's just make the
existing helper work better.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings
  2017-02-21 13:25   ` Kirill A. Shutemov
@ 2017-02-22 15:15     ` Andrea Arcangeli
  2017-02-23 14:56       ` Kirill A. Shutemov
  0 siblings, 1 reply; 11+ messages in thread
From: Andrea Arcangeli @ 2017-02-22 15:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Mike Kravetz, linux-mm, linux-kernel, Andrew Morton,
	Mike Rapoport, Dr. David Alan Gilbert, Hillf Danton,
	Pavel Emelyanov

On Tue, Feb 21, 2017 at 04:25:45PM +0300, Kirill A. Shutemov wrote:
> I think more strict vma_is_anonymous() is a good thing.
> 
> But I don't see a point introducing one more helper. Let's just make the
> existing helper work better.

That would be simpler agreed. The point of having an "unsafe" faster
version was only for code running in page fault context where the
additional check is unnecessary.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings
  2017-02-22 15:15     ` Andrea Arcangeli
@ 2017-02-23 14:56       ` Kirill A. Shutemov
  0 siblings, 0 replies; 11+ messages in thread
From: Kirill A. Shutemov @ 2017-02-23 14:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mike Kravetz, linux-mm, linux-kernel, Andrew Morton,
	Mike Rapoport, Dr. David Alan Gilbert, Hillf Danton,
	Pavel Emelyanov

On Wed, Feb 22, 2017 at 04:15:07PM +0100, Andrea Arcangeli wrote:
> On Tue, Feb 21, 2017 at 04:25:45PM +0300, Kirill A. Shutemov wrote:
> > I think more strict vma_is_anonymous() is a good thing.
> > 
> > But I don't see a point introducing one more helper. Let's just make the
> > existing helper work better.
> 
> That would be simpler agreed. The point of having an "unsafe" faster
> version was only for code running in page fault context where the
> additional check is unnecessary.

Well, I don't think that the cost of additional check is significant here.
And we can bring ->vm_ops a bit closer to ->vm_flags to avoid potential
cache miss.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-02-23 14:56 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-02-15 21:46 [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings Mike Kravetz
2017-02-16 18:41 ` Andrea Arcangeli
2017-02-17  0:18   ` Mike Kravetz
2017-02-17 15:52     ` Andrea Arcangeli
2017-02-17 20:17       ` Andrew Morton
2017-02-17 20:51         ` Andrea Arcangeli
2017-02-17 21:08           ` Andrew Morton
2017-02-17 21:34             ` Andrea Arcangeli
2017-02-21 13:25   ` Kirill A. Shutemov
2017-02-22 15:15     ` Andrea Arcangeli
2017-02-23 14:56       ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox