* [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings @ 2017-02-15 21:46 Mike Kravetz 2017-02-16 18:41 ` Andrea Arcangeli 0 siblings, 1 reply; 11+ messages in thread From: Mike Kravetz @ 2017-02-15 21:46 UTC (permalink / raw) To: linux-mm, linux-kernel Cc: Mike Kravetz, Andrea Arcangeli, Andrew Morton, Mike Rapoport, Dr. David Alan Gilbert, Hillf Danton, Pavel Emelyanov When userfaultfd hugetlbfs support was originally added, it followed the pattern of anon mappings and did not support any vmas marked VM_SHARED. As such, support was only added for private mappings. Remove this limitation and support shared mappings. The primary functional change required is adding pages to the page cache. More subtle changes are required for huge page reservation handling in error paths. A lengthy comment in the code describes the reservation handling. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> --- mm/hugetlb.c | 25 +++++++++++++++++++++-- mm/userfaultfd.c | 60 +++++++++++++++++++++++++++++++++++++++++--------------- 2 files changed, 67 insertions(+), 18 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index d0d1d08..41f6c51 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4029,6 +4029,18 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, __SetPageUptodate(page); set_page_huge_active(page); + /* + * If shared, add to page cache + */ + if (dst_vma->vm_flags & VM_SHARED) { + struct address_space *mapping = dst_vma->vm_file->f_mapping; + pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr); + + ret = huge_add_to_page_cache(page, mapping, idx); + if (ret) + goto out_release_nounlock; + } + ptl = huge_pte_lockptr(h, dst_mm, dst_pte); spin_lock(ptl); @@ -4036,8 +4048,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, if (!huge_pte_none(huge_ptep_get(dst_pte))) goto out_release_unlock; - ClearPagePrivate(page); - hugepage_add_new_anon_rmap(page, dst_vma, dst_addr); + if (dst_vma->vm_flags & VM_SHARED) { + page_dup_rmap(page, true); + } else { + ClearPagePrivate(page); + hugepage_add_new_anon_rmap(page, dst_vma, dst_addr); + } _dst_pte = make_huge_pte(dst_vma, page, dst_vma->vm_flags & VM_WRITE); if (dst_vma->vm_flags & VM_WRITE) @@ -4054,11 +4070,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, update_mmu_cache(dst_vma, dst_addr, dst_pte); spin_unlock(ptl); + if (dst_vma->vm_flags & VM_SHARED) + unlock_page(page); ret = 0; out: return ret; out_release_unlock: spin_unlock(ptl); +out_release_nounlock: + if (dst_vma->vm_flags & VM_SHARED) + unlock_page(page); put_page(page); goto out; } diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index b861cf9..6703308 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -211,11 +211,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, goto out_unlock; /* - * Make sure the vma is not shared, that the remaining dst - * range is both valid and fully within a single existing vma. + * Make sure the remaining dst range is both valid and + * fully within a single existing vma. */ - if (dst_vma->vm_flags & VM_SHARED) - goto out_unlock; if (dst_start < dst_vma->vm_start || dst_start + len > dst_vma->vm_end) goto out_unlock; @@ -226,11 +224,13 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, goto out_unlock; /* - * Ensure the dst_vma has a anon_vma. + * If not shared, ensure the dst_vma has a anon_vma. */ err = -ENOMEM; - if (unlikely(anon_vma_prepare(dst_vma))) - goto out_unlock; + if (!(dst_vma->vm_flags & VM_SHARED)) { + if (unlikely(anon_vma_prepare(dst_vma))) + goto out_unlock; + } h = hstate_vma(dst_vma); @@ -306,18 +306,45 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, if (page) { /* * We encountered an error and are about to free a newly - * allocated huge page. It is possible that there was a - * reservation associated with the page that has been - * consumed. See the routine restore_reserve_on_error - * for details. Unfortunately, we can not call - * restore_reserve_on_error now as it would require holding - * mmap_sem. Clear the PagePrivate flag so that the global + * allocated huge page. + * + * Reservation handling is very subtle, and is different for + * private and shared mappings. See the routine + * restore_reserve_on_error for details. Unfortunately, we + * can not call restore_reserve_on_error now as it would + * require holding mmap_sem. + * + * If a reservation for the page existed in the reservation + * map of a private mapping, the map was modified to indicate + * the reservation was consumed when the page was allocated. + * We clear the PagePrivate flag now so that the global * reserve count will not be incremented in free_huge_page. * The reservation map will still indicate the reservation * was consumed and possibly prevent later page allocation. - * This is better than leaking a global reservation. + * This is better than leaking a global reservation. If no + * reservation existed, it is still safe to clear PagePrivate + * as no adjustments to reservation counts were made during + * allocation. + * + * The reservation map for shared mappings indicates which + * pages have reservations. When a huge page is allocated + * for an address with a reservation, no change is made to + * the reserve map. In this case PagePrivate will be set + * to indicate that the global reservation count should be + * incremented when the page is freed. This is the desired + * behavior. However, when a huge page is allocated for an + * address without a reservation a reservation entry is added + * to the reservation map, and PagePrivate will not be set. + * When the page is freed, the global reserve count will NOT + * be incremented and it will appear as though we have leaked + * reserved page. In this case, set PagePrivate so that the + * global reserve count will be incremented to match the + * reservation map entry which was created. */ - ClearPagePrivate(page); + if (dst_vma->vm_flags & VM_SHARED) + SetPagePrivate(page); + else + ClearPagePrivate(page); put_page(page); } BUG_ON(copied < 0); @@ -386,7 +413,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, goto out_unlock; err = -EINVAL; - if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED) + if (!vma_is_shmem(dst_vma) && !is_vm_hugetlb_page(dst_vma) && + dst_vma->vm_flags & VM_SHARED) goto out_unlock; if (dst_start < dst_vma->vm_start || dst_start + len > dst_vma->vm_end) -- 2.7.4 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings 2017-02-15 21:46 [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings Mike Kravetz @ 2017-02-16 18:41 ` Andrea Arcangeli 2017-02-17 0:18 ` Mike Kravetz 2017-02-21 13:25 ` Kirill A. Shutemov 0 siblings, 2 replies; 11+ messages in thread From: Andrea Arcangeli @ 2017-02-16 18:41 UTC (permalink / raw) To: Mike Kravetz Cc: linux-mm, linux-kernel, Andrew Morton, Mike Rapoport, Dr. David Alan Gilbert, Hillf Danton, Pavel Emelyanov, Kirill A. Shutemov On Wed, Feb 15, 2017 at 01:46:50PM -0800, Mike Kravetz wrote: > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index d0d1d08..41f6c51 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4029,6 +4029,18 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > __SetPageUptodate(page); > set_page_huge_active(page); > > + /* > + * If shared, add to page cache > + */ > + if (dst_vma->vm_flags & VM_SHARED) { Minor nitpick, this could be a: int vm_shared = dst_vma->vm_flags & VM_SHARED; (int faster than bool here as VM_SHARED won't have to be converted into 0|1) > @@ -386,7 +413,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, > goto out_unlock; > > err = -EINVAL; > - if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED) > + if (!vma_is_shmem(dst_vma) && !is_vm_hugetlb_page(dst_vma) && > + dst_vma->vm_flags & VM_SHARED) > goto out_unlock; Other minor nitpick, this could have been faster as: if (vma_is_anonymous(dst_vma) && dst_vma->vm_flags & VM_SHARED) Thinking twice, the only case we need to rule out is shmem_zero_setup (it's not anon vmas can be really VM_SHARED or they wouldn't be anon vmas in the first place) so even the above is superfluous because shmem_zero_setup does: vma->vm_ops = &shmem_vm_ops; So I would turn it into: /* * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but * it will overwrite vm_ops, so vma_is_anonymous must return false. */ if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) && dst_vma->vm_flags & VM_SHARED)) goto out_unlock; Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> --- Changing topic: thinking about the last part, I was wondering about vma_is_anonymous because it's well known issue, it's only guaranteed fully correct during page faults because the weird cases that can return a false positive cannot generate page faults and they run remap_pfn_range before releasing the mmap_sem for writing within mmap(2) itself. More than a year ago I discussed with Kirill (CC'ed) the vma_is_anonymous() false positives: some VM_IO vma leaves vma->vm_ops NULL, which is generally non concerning because putting an anon page in there shouldn't have any adverse side effects for the rest of the system. It was critical for khugepaged, because khugepaged will run out of the app own control, so khugepaged must be fully accurate of it'll just activate on VM_IO special mappings, but that's definitely not the case for userfaultfd. Still I was wondering if we could be more strict by adding a vma->vm_flags & VM_NO_KHUGEPAGED check so that VM_IO regions will not pass registration (only in fs/userfaultfd.c:vma_can_userfault; mm/userfaultfd.c doesn't need any of that as it requires registration to be passed first and vm_userfaultfd_ctx already armed). A fully accurate vma_is_anonymous could be implemented like this: vma_is_anonymous && !(vm_flags & VM_NO_KHUGEPAGED) Which is what khugepaged internally uses. This will also work for MAP_PRIVATE on /dev/zero (required to work by the non cooperative case). Side note: MAP_PRIVATE /dev/zero is very special and it's the only case in the kernel a "true" anon vma has vm_file set. I think it'd be cleaner to "decouple" the vma from /dev/zero the same way MAP_SHARED|MAP_ANON "couples" the vma to a pseudo /dev/zero, so anon vmas would always have vm_file NULL (not done because it'd risk to break stuff by changing the /proc/pid/maps output), but that's besides the point and the above solution deployed already by khugepaged already works with the current /dev/zero MAP_PRIVATE code. Kirill what's your take on making the registration checks stricter? If we would add a vma_is_anonymous_not_in_fault implemented like above vma_can_userfault would just need a s/vma_is_anonymous/vma_is_anonymous_not_in_fault/ and it would be more strict. khugepaged could be then converted to use it too instead of hardcoding this vm_flags check. Unless I'm mistaken I would consider such a change to the registration code, purely a cleanup to add a more strict check. Alternatively we could just leave things as is and depend on apps using VM_IO with remap_pfn_range with vm_file set and vm_ops NULL, not to screw themselves up by filling their regions with anon pages. I don't see how it could break anything (except themselves) if they do, but I'd appreciate a second thought on that. And at least for the remap_pfn_range users it won't even get that far, because a graceful -EEXIST would be returned instead. The change would effectively convert a -EEXIST returned later into a stricter -EINVAL returned early at registration time, for a case that is erratic by design. Thanks, Andrea ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings 2017-02-16 18:41 ` Andrea Arcangeli @ 2017-02-17 0:18 ` Mike Kravetz 2017-02-17 15:52 ` Andrea Arcangeli 2017-02-21 13:25 ` Kirill A. Shutemov 1 sibling, 1 reply; 11+ messages in thread From: Mike Kravetz @ 2017-02-17 0:18 UTC (permalink / raw) To: Andrea Arcangeli, Andrew Morton Cc: linux-mm, linux-kernel, Mike Rapoport, Dr. David Alan Gilbert, Hillf Danton, Pavel Emelyanov, Kirill A. Shutemov On 02/16/2017 10:41 AM, Andrea Arcangeli wrote: > On Wed, Feb 15, 2017 at 01:46:50PM -0800, Mike Kravetz wrote: >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index d0d1d08..41f6c51 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -4029,6 +4029,18 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, >> __SetPageUptodate(page); >> set_page_huge_active(page); >> >> + /* >> + * If shared, add to page cache >> + */ >> + if (dst_vma->vm_flags & VM_SHARED) { > > Minor nitpick, this could be a: > > int vm_shared = dst_vma->vm_flags & VM_SHARED; > > (int faster than bool here as VM_SHARED won't have to be converted into 0|1) > >> @@ -386,7 +413,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, >> goto out_unlock; >> >> err = -EINVAL; >> - if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED) >> + if (!vma_is_shmem(dst_vma) && !is_vm_hugetlb_page(dst_vma) && >> + dst_vma->vm_flags & VM_SHARED) >> goto out_unlock; > > Other minor nitpick, this could have been faster as: > > if (vma_is_anonymous(dst_vma) && dst_vma->vm_flags & VM_SHARED) > > Thinking twice, the only case we need to rule out is shmem_zero_setup > (it's not anon vmas can be really VM_SHARED or they wouldn't be anon > vmas in the first place) so even the above is superfluous because > shmem_zero_setup does: > > vma->vm_ops = &shmem_vm_ops; > > So I would turn it into: > > > /* > * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but > * it will overwrite vm_ops, so vma_is_anonymous must return false. > */ > if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) && dst_vma->vm_flags & VM_SHARED)) > goto out_unlock; > > > Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> > --- Thanks Andrea, I incorporated your suggestions into a new version of the patch. While changing (dst_vma->vm_flags & VM_SHARED) to integers, I noticed an issue in the error path of __mcopy_atomic_hugetlb(). > */ > - ClearPagePrivate(page); > + if (dst_vma->vm_flags & VM_SHARED) > + SetPagePrivate(page); > + else > + ClearPagePrivate(page); > put_page(page); We can not use dst_vma here as it may be different than the vma for which the page was originally allocated, or even NULL. Remember, we may drop mmap_sem and look up dst_vma again. Therefore, we need to save the value of (dst_vma->vm_flags & VM_SHARED) for the vma which was used when the page was allocated. This change as well as your suggestions are in the patch below: >From e13d3b4ab24f2c423a8b0d645f0ea715c285880d Mon Sep 17 00:00:00 2001 From: Mike Kravetz <mike.kravetz@oracle.com> Date: Wed, 15 Feb 2017 13:02:41 -0800 Subject: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings When userfaultfd hugetlbfs support was originally added, it followed the pattern of anon mappings and did not support any vmas marked VM_SHARED. As such, support was only added for private mappings. Remove this limitation and support shared mappings. The primary functional change required is adding pages to the page cache. More subtle changes are required for huge page reservation handling in error paths. A lengthy comment in the code describes the reservation handling. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> --- mm/hugetlb.c | 26 ++++++++++++++++++-- mm/userfaultfd.c | 74 ++++++++++++++++++++++++++++++++++++++++++++------------ 2 files changed, 82 insertions(+), 18 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index d0d1d08..2e0e815 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3993,6 +3993,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, unsigned long src_addr, struct page **pagep) { + int vm_shared = dst_vma->vm_flags & VM_SHARED; struct hstate *h = hstate_vma(dst_vma); pte_t _dst_pte; spinlock_t *ptl; @@ -4029,6 +4030,18 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, __SetPageUptodate(page); set_page_huge_active(page); + /* + * If shared, add to page cache + */ + if (vm_shared) { + struct address_space *mapping = dst_vma->vm_file->f_mapping; + pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr); + + ret = huge_add_to_page_cache(page, mapping, idx); + if (ret) + goto out_release_nounlock; + } + ptl = huge_pte_lockptr(h, dst_mm, dst_pte); spin_lock(ptl); @@ -4036,8 +4049,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, if (!huge_pte_none(huge_ptep_get(dst_pte))) goto out_release_unlock; - ClearPagePrivate(page); - hugepage_add_new_anon_rmap(page, dst_vma, dst_addr); + if (vm_shared) { + page_dup_rmap(page, true); + } else { + ClearPagePrivate(page); + hugepage_add_new_anon_rmap(page, dst_vma, dst_addr); + } _dst_pte = make_huge_pte(dst_vma, page, dst_vma->vm_flags & VM_WRITE); if (dst_vma->vm_flags & VM_WRITE) @@ -4054,11 +4071,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, update_mmu_cache(dst_vma, dst_addr, dst_pte); spin_unlock(ptl); + if (vm_shared) + unlock_page(page); ret = 0; out: return ret; out_release_unlock: spin_unlock(ptl); +out_release_nounlock: + if (vm_shared) + unlock_page(page); put_page(page); goto out; } diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index b861cf9..2fc0e76 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -154,6 +154,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, unsigned long len, bool zeropage) { + int vm_alloc_shared = dst_vma->vm_flags & VM_SHARED; + int vm_shared = dst_vma->vm_flags & VM_SHARED; ssize_t err; pte_t *dst_pte; unsigned long src_addr, dst_addr; @@ -211,14 +213,14 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, goto out_unlock; /* - * Make sure the vma is not shared, that the remaining dst - * range is both valid and fully within a single existing vma. + * Make sure the remaining dst range is both valid and + * fully within a single existing vma. */ - if (dst_vma->vm_flags & VM_SHARED) - goto out_unlock; if (dst_start < dst_vma->vm_start || dst_start + len > dst_vma->vm_end) goto out_unlock; + + vm_shared = dst_vma->vm_flags & VM_SHARED; } if (WARN_ON(dst_addr & (vma_hpagesize - 1) || @@ -226,11 +228,13 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, goto out_unlock; /* - * Ensure the dst_vma has a anon_vma. + * If not shared, ensure the dst_vma has a anon_vma. */ err = -ENOMEM; - if (unlikely(anon_vma_prepare(dst_vma))) - goto out_unlock; + if (!vm_shared) { + if (unlikely(anon_vma_prepare(dst_vma))) + goto out_unlock; + } h = hstate_vma(dst_vma); @@ -267,6 +271,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, dst_addr, src_addr, &page); mutex_unlock(&hugetlb_fault_mutex_table[hash]); + vm_alloc_shared = vm_shared; cond_resched(); @@ -306,18 +311,49 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, if (page) { /* * We encountered an error and are about to free a newly - * allocated huge page. It is possible that there was a - * reservation associated with the page that has been - * consumed. See the routine restore_reserve_on_error - * for details. Unfortunately, we can not call - * restore_reserve_on_error now as it would require holding - * mmap_sem. Clear the PagePrivate flag so that the global + * allocated huge page. + * + * Reservation handling is very subtle, and is different for + * private and shared mappings. See the routine + * restore_reserve_on_error for details. Unfortunately, we + * can not call restore_reserve_on_error now as it would + * require holding mmap_sem. + * + * If a reservation for the page existed in the reservation + * map of a private mapping, the map was modified to indicate + * the reservation was consumed when the page was allocated. + * We clear the PagePrivate flag now so that the global * reserve count will not be incremented in free_huge_page. * The reservation map will still indicate the reservation * was consumed and possibly prevent later page allocation. - * This is better than leaking a global reservation. + * This is better than leaking a global reservation. If no + * reservation existed, it is still safe to clear PagePrivate + * as no adjustments to reservation counts were made during + * allocation. + * + * The reservation map for shared mappings indicates which + * pages have reservations. When a huge page is allocated + * for an address with a reservation, no change is made to + * the reserve map. In this case PagePrivate will be set + * to indicate that the global reservation count should be + * incremented when the page is freed. This is the desired + * behavior. However, when a huge page is allocated for an + * address without a reservation a reservation entry is added + * to the reservation map, and PagePrivate will not be set. + * When the page is freed, the global reserve count will NOT + * be incremented and it will appear as though we have leaked + * reserved page. In this case, set PagePrivate so that the + * global reserve count will be incremented to match the + * reservation map entry which was created. + * + * Note that vm_alloc_shared is based on the flags of the vma + * for which the page was originally allocated. dst_vma could + * be different or NULL on error. */ - ClearPagePrivate(page); + if (vm_alloc_shared) + SetPagePrivate(page); + else + ClearPagePrivate(page); put_page(page); } BUG_ON(copied < 0); @@ -386,8 +422,14 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, goto out_unlock; err = -EINVAL; - if (!vma_is_shmem(dst_vma) && dst_vma->vm_flags & VM_SHARED) + /* + * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but + * it will overwrite vm_ops, so vma_is_anonymous must return false. + */ + if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) && + dst_vma->vm_flags & VM_SHARED)) goto out_unlock; + if (dst_start < dst_vma->vm_start || dst_start + len > dst_vma->vm_end) goto out_unlock; -- 2.7.4 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings 2017-02-17 0:18 ` Mike Kravetz @ 2017-02-17 15:52 ` Andrea Arcangeli 2017-02-17 20:17 ` Andrew Morton 0 siblings, 1 reply; 11+ messages in thread From: Andrea Arcangeli @ 2017-02-17 15:52 UTC (permalink / raw) To: Mike Kravetz Cc: Andrew Morton, linux-mm, linux-kernel, Mike Rapoport, Dr. David Alan Gilbert, Hillf Danton, Pavel Emelyanov, Kirill A. Shutemov On Thu, Feb 16, 2017 at 04:18:51PM -0800, Mike Kravetz wrote: > Thanks Andrea, I incorporated your suggestions into a new version of the patch. > While changing (dst_vma->vm_flags & VM_SHARED) to integers, I noticed an issue > in the error path of __mcopy_atomic_hugetlb(). Indeed good point! > + int vm_alloc_shared = dst_vma->vm_flags & VM_SHARED; > + int vm_shared = dst_vma->vm_flags & VM_SHARED; Other minor nitpick, this could have been: int vm_shared = vm_alloc_shared; But I'm sure gcc will emit the same asm. For greppability (if such word exist) calling it vm_shared_alloc would have been preferable. We can clean it up post upstream merge or it should be diffed against mm latest or it may cause more rejects. Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> The patches were not against latest -mm so I solved the rejects during merge in my tree. Then I looked at the result of all my merges after everything is applied and I think I spotted a merge gone wrong in this patch: https://ozlabs.org/~akpm/mmots/broken-out/userfaultfd-mcopy_atomic-return-enoent-when-no-compatible-vma-found.patch Below is a hand edited git diff that shows the only meaningful difference. The below should be included in userfaultfd-mcopy_atomic-return-enoent-when-no-compatible-vma-found.patch or as -fix2 at the end. Everything else is identical which is great. Mike Rapoport could you verify the below hunk is missing in mm? Once it'll all be merged upstream then there will be less merge crunch as we've been working somewhat in parallel on the same files, so this is resulting in more merge rejects than ideal :). diff --git a/../mm/mm/userfaultfd.c b/mm/userfaultfd.c index 830bed7..3ec9aad 100644 --- a/../mm/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -199,6 +201,12 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, dst_vma = find_vma(dst_mm, dst_start); if (!dst_vma || !is_vm_hugetlb_page(dst_vma)) goto out_unlock; + /* + * Only allow __mcopy_atomic_hugetlb on userfaultfd + * registered ranges. + */ + if (!dst_vma->vm_userfaultfd_ctx.ctx) + goto out_unlock; if (dst_start < dst_vma->vm_start || dst_start + len > dst_vma->vm_end) @@ -214,16 +224,10 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, goto out_unlock; /* - * Only allow __mcopy_atomic_hugetlb on userfaultfd registered ranges. - */ - if (!dst_vma->vm_userfaultfd_ctx.ctx) - goto out_unlock; - - /* * If not shared, ensure the dst_vma has a anon_vma. */ err = -ENOMEM; Thanks, Andrea ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings 2017-02-17 15:52 ` Andrea Arcangeli @ 2017-02-17 20:17 ` Andrew Morton 2017-02-17 20:51 ` Andrea Arcangeli 0 siblings, 1 reply; 11+ messages in thread From: Andrew Morton @ 2017-02-17 20:17 UTC (permalink / raw) To: Andrea Arcangeli Cc: Mike Kravetz, linux-mm, linux-kernel, Mike Rapoport, Dr. David Alan Gilbert, Hillf Danton, Pavel Emelyanov, Kirill A. Shutemov On Fri, 17 Feb 2017 16:52:41 +0100 Andrea Arcangeli <aarcange@redhat.com> wrote: > Everything else is identical which is great. Mike Rapoport could you > verify the below hunk is missing in mm? > > Once it'll all be merged upstream then there will be less merge crunch > as we've been working somewhat in parallel on the same files, so this > is resulting in more merge rejects than ideal :). > > diff --git a/../mm/mm/userfaultfd.c b/mm/userfaultfd.c > index 830bed7..3ec9aad 100644 > --- a/../mm/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > @@ -199,6 +201,12 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, > dst_vma = find_vma(dst_mm, dst_start); > if (!dst_vma || !is_vm_hugetlb_page(dst_vma)) > goto out_unlock; > + /* > + * Only allow __mcopy_atomic_hugetlb on userfaultfd > + * registered ranges. > + */ > + if (!dst_vma->vm_userfaultfd_ctx.ctx) > + goto out_unlock; > > if (dst_start < dst_vma->vm_start || > dst_start + len > dst_vma->vm_end) > @@ -214,16 +224,10 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, > goto out_unlock; > > /* > - * Only allow __mcopy_atomic_hugetlb on userfaultfd registered ranges. > - */ > - if (!dst_vma->vm_userfaultfd_ctx.ctx) > - goto out_unlock; > - > - /* > * If not shared, ensure the dst_vma has a anon_vma. > */ I merged this up and a small issue remains: : /* : * Validate alignment based on huge page size : */ : err = -EINVAL; : if (dst_start & (vma_hpagesize - 1) || len & (vma_hpagesize - 1)) : goto out_unlock; : :retry: : /* : * On routine entry dst_vma is set. If we had to drop mmap_sem and : * retry, dst_vma will be set to NULL and we must lookup again. : */ : if (!dst_vma) { : err = -ENOENT; : dst_vma = find_vma(dst_mm, dst_start); : if (!dst_vma || !is_vm_hugetlb_page(dst_vma)) : goto out_unlock; : /* : * Only allow __mcopy_atomic_hugetlb on userfaultfd : * registered ranges. : */ : if (!dst_vma->vm_userfaultfd_ctx.ctx) : goto out_unlock; : : if (dst_start < dst_vma->vm_start || : dst_start + len > dst_vma->vm_end) : goto out_unlock; : : err = -EINVAL; : if (vma_hpagesize != vma_kernel_pagesize(dst_vma)) : goto out_unlock; : } : : if (WARN_ON(dst_addr & (vma_hpagesize - 1) || : (len - copied) & (vma_hpagesize - 1))) : goto out_unlock; The value of `err' here is EINVAL. That sems appropriate, but it only happens by sheer luck. : /* : * If not shared, ensure the dst_vma has a anon_vma. : */ : err = -ENOMEM; : if (!(dst_vma->vm_flags & VM_SHARED)) { : if (unlikely(anon_vma_prepare(dst_vma))) : goto out_unlock; : } So... --- a/mm/userfaultfd.c~userfaultfd-mcopy_atomic-return-enoent-when-no-compatible-vma-found-fix-2-fix +++ a/mm/userfaultfd.c @@ -215,6 +215,7 @@ retry: goto out_unlock; } + err = -EINVAL; if (WARN_ON(dst_addr & (vma_hpagesize - 1) || (len - copied) & (vma_hpagesize - 1))) goto out_unlock; _ ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings 2017-02-17 20:17 ` Andrew Morton @ 2017-02-17 20:51 ` Andrea Arcangeli 2017-02-17 21:08 ` Andrew Morton 0 siblings, 1 reply; 11+ messages in thread From: Andrea Arcangeli @ 2017-02-17 20:51 UTC (permalink / raw) To: Andrew Morton Cc: Mike Kravetz, linux-mm, linux-kernel, Mike Rapoport, Dr. David Alan Gilbert, Hillf Danton, Pavel Emelyanov, Kirill A. Shutemov On Fri, Feb 17, 2017 at 12:17:38PM -0800, Andrew Morton wrote: > I merged this up and a small issue remains: Great! > The value of `err' here is EINVAL. That sems appropriate, but it only > happens by sheer luck. It might have been programmer luck but just for completeness, at runtime no luck was needed (the temporary setting to ENOENT is undoed before the if clause is closed). Your addition is surely safer just in case of future changes missing how we inherited the EINVAL in both branches, thanks! (plus the compiler should be able to optimize it away until after it will be needed) ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings 2017-02-17 20:51 ` Andrea Arcangeli @ 2017-02-17 21:08 ` Andrew Morton 2017-02-17 21:34 ` Andrea Arcangeli 0 siblings, 1 reply; 11+ messages in thread From: Andrew Morton @ 2017-02-17 21:08 UTC (permalink / raw) To: Andrea Arcangeli Cc: Mike Kravetz, linux-mm, linux-kernel, Mike Rapoport, Dr. David Alan Gilbert, Hillf Danton, Pavel Emelyanov, Kirill A. Shutemov On Fri, 17 Feb 2017 21:51:24 +0100 Andrea Arcangeli <aarcange@redhat.com> wrote: > On Fri, Feb 17, 2017 at 12:17:38PM -0800, Andrew Morton wrote: > > I merged this up and a small issue remains: > > Great! > > > The value of `err' here is EINVAL. That sems appropriate, but it only > > happens by sheer luck. > > It might have been programmer luck but just for completeness, at > runtime no luck was needed (the temporary setting to ENOENT is undoed > before the if clause is closed). Your addition is surely safer just in > case of future changes missing how we inherited the EINVAL in both > branches, thanks! (plus the compiler should be able to optimize it > away until after it will be needed) OK. I had a bunch more rejects to fix in that function. Below is the final result - please check it carefully. I'll release another mmotm tree in the next few hours for runtime testing. static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma, unsigned long dst_start, unsigned long src_start, unsigned long len, bool zeropage) { int vm_alloc_shared = dst_vma->vm_flags & VM_SHARED; int vm_shared = dst_vma->vm_flags & VM_SHARED; ssize_t err; pte_t *dst_pte; unsigned long src_addr, dst_addr; long copied; struct page *page; struct hstate *h; unsigned long vma_hpagesize; pgoff_t idx; u32 hash; struct address_space *mapping; /* * There is no default zero huge page for all huge page sizes as * supported by hugetlb. A PMD_SIZE huge pages may exist as used * by THP. Since we can not reliably insert a zero page, this * feature is not supported. */ if (zeropage) { up_read(&dst_mm->mmap_sem); return -EINVAL; } src_addr = src_start; dst_addr = dst_start; copied = 0; page = NULL; vma_hpagesize = vma_kernel_pagesize(dst_vma); /* * Validate alignment based on huge page size */ err = -EINVAL; if (dst_start & (vma_hpagesize - 1) || len & (vma_hpagesize - 1)) goto out_unlock; retry: /* * On routine entry dst_vma is set. If we had to drop mmap_sem and * retry, dst_vma will be set to NULL and we must lookup again. */ if (!dst_vma) { err = -ENOENT; dst_vma = find_vma(dst_mm, dst_start); if (!dst_vma || !is_vm_hugetlb_page(dst_vma)) goto out_unlock; /* * Only allow __mcopy_atomic_hugetlb on userfaultfd * registered ranges. */ if (!dst_vma->vm_userfaultfd_ctx.ctx) goto out_unlock; if (dst_start < dst_vma->vm_start || dst_start + len > dst_vma->vm_end) goto out_unlock; vm_shared = dst_vma->vm_flags & VM_SHARED; err = -EINVAL; if (vma_hpagesize != vma_kernel_pagesize(dst_vma)) goto out_unlock; } err = -EINVAL; if (WARN_ON(dst_addr & (vma_hpagesize - 1) || (len - copied) & (vma_hpagesize - 1))) goto out_unlock; if (dst_start < dst_vma->vm_start || dst_start + len > dst_vma->vm_end) goto out_unlock; /* * If not shared, ensure the dst_vma has a anon_vma. */ err = -ENOMEM; if (!vm_shared) { if (unlikely(anon_vma_prepare(dst_vma))) goto out_unlock; } h = hstate_vma(dst_vma); while (src_addr < src_start + len) { pte_t dst_pteval; BUG_ON(dst_addr >= dst_start + len); VM_BUG_ON(dst_addr & ~huge_page_mask(h)); /* * Serialize via hugetlb_fault_mutex */ idx = linear_page_index(dst_vma, dst_addr); mapping = dst_vma->vm_file->f_mapping; hash = hugetlb_fault_mutex_hash(h, dst_mm, dst_vma, mapping, idx, dst_addr); mutex_lock(&hugetlb_fault_mutex_table[hash]); err = -ENOMEM; dst_pte = huge_pte_alloc(dst_mm, dst_addr, huge_page_size(h)); if (!dst_pte) { mutex_unlock(&hugetlb_fault_mutex_table[hash]); goto out_unlock; } err = -EEXIST; dst_pteval = huge_ptep_get(dst_pte); if (!huge_pte_none(dst_pteval)) { mutex_unlock(&hugetlb_fault_mutex_table[hash]); goto out_unlock; } err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma, dst_addr, src_addr, &page); mutex_unlock(&hugetlb_fault_mutex_table[hash]); vm_alloc_shared = vm_shared; cond_resched(); if (unlikely(err == -EFAULT)) { up_read(&dst_mm->mmap_sem); BUG_ON(!page); err = copy_huge_page_from_user(page, (const void __user *)src_addr, pages_per_huge_page(h), true); if (unlikely(err)) { err = -EFAULT; goto out; } down_read(&dst_mm->mmap_sem); dst_vma = NULL; goto retry; } else BUG_ON(page); if (!err) { dst_addr += vma_hpagesize; src_addr += vma_hpagesize; copied += vma_hpagesize; if (fatal_signal_pending(current)) err = -EINTR; } if (err) break; } out_unlock: up_read(&dst_mm->mmap_sem); out: if (page) { /* * We encountered an error and are about to free a newly * allocated huge page. * * Reservation handling is very subtle, and is different for * private and shared mappings. See the routine * restore_reserve_on_error for details. Unfortunately, we * can not call restore_reserve_on_error now as it would * require holding mmap_sem. * * If a reservation for the page existed in the reservation * map of a private mapping, the map was modified to indicate * the reservation was consumed when the page was allocated. * We clear the PagePrivate flag now so that the global * reserve count will not be incremented in free_huge_page. * The reservation map will still indicate the reservation * was consumed and possibly prevent later page allocation. * This is better than leaking a global reservation. If no * reservation existed, it is still safe to clear PagePrivate * as no adjustments to reservation counts were made during * allocation. * * The reservation map for shared mappings indicates which * pages have reservations. When a huge page is allocated * for an address with a reservation, no change is made to * the reserve map. In this case PagePrivate will be set * to indicate that the global reservation count should be * incremented when the page is freed. This is the desired * behavior. However, when a huge page is allocated for an * address without a reservation a reservation entry is added * to the reservation map, and PagePrivate will not be set. * When the page is freed, the global reserve count will NOT * be incremented and it will appear as though we have leaked * reserved page. In this case, set PagePrivate so that the * global reserve count will be incremented to match the * reservation map entry which was created. * * Note that vm_alloc_shared is based on the flags of the vma * for which the page was originally allocated. dst_vma could * be different or NULL on error. */ if (vm_alloc_shared) SetPagePrivate(page); else ClearPagePrivate(page); put_page(page); } BUG_ON(copied < 0); BUG_ON(err > 0); BUG_ON(!copied && !err); return copied ? copied : err; } ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings 2017-02-17 21:08 ` Andrew Morton @ 2017-02-17 21:34 ` Andrea Arcangeli 0 siblings, 0 replies; 11+ messages in thread From: Andrea Arcangeli @ 2017-02-17 21:34 UTC (permalink / raw) To: Andrew Morton Cc: Mike Kravetz, linux-mm, linux-kernel, Mike Rapoport, Dr. David Alan Gilbert, Hillf Danton, Pavel Emelyanov, Kirill A. Shutemov On Fri, Feb 17, 2017 at 01:08:55PM -0800, Andrew Morton wrote: > I had a bunch more rejects to fix in that function. Below is the final > result - please check it carefully. Sure, reviewed and this is the diff that remains (vm_shared assignment location is irrelevant, I put it at the end as it's only needed later and not checked in the out_unlock path, err = -EINVAL also is fine to stay): diff --git a/tmp/x b/mm/userfaultfd.c index a3ba029..3ec9aad 100644 --- a/tmp/x +++ b/mm/userfaultfd.c @@ -63,22 +212,17 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, dst_start + len > dst_vma->vm_end) goto out_unlock; - vm_shared = dst_vma->vm_flags & VM_SHARED; - err = -EINVAL; if (vma_hpagesize != vma_kernel_pagesize(dst_vma)) goto out_unlock; + + vm_shared = dst_vma->vm_flags & VM_SHARED; } - err = -EINVAL; if (WARN_ON(dst_addr & (vma_hpagesize - 1) || (len - copied) & (vma_hpagesize - 1))) goto out_unlock; - if (dst_start < dst_vma->vm_start || - dst_start + len > dst_vma->vm_end) - goto out_unlock; - /* * If not shared, ensure the dst_vma has a anon_vma. */ In short there's only the last 4 lines of the above that can be applied. __mcopy_atomic_hugetlb in the first pass (i.e. dst_vma not NULL) is invoked after those checks already have been run in the caller. if (dst_start < dst_vma->vm_start || dst_start + len > dst_vma->vm_end) goto out_unlock; err = -EINVAL; /* * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but * it will overwrite vm_ops, so vma_is_anonymous must return false. */ if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) && dst_vma->vm_flags & VM_SHARED)) goto out_unlock; /* * If this is a HUGETLB vma, pass off to appropriate routine */ if (is_vm_hugetlb_page(dst_vma)) return __mcopy_atomic_hugetlb(dst_mm, dst_vma, dst_start, src_start, len, zeropage); As usual hugetlbfs takes its own tangent out of the main VM code after various checks have already been done that applies to hugetlbfs too. In the "retry" case the dst_vma is set to NULL and the dst_vma is being searched again and revalidated, and we so we repeat the check. First time it's not needed, for second time it would be a repetition and so it's a noop. ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings 2017-02-16 18:41 ` Andrea Arcangeli 2017-02-17 0:18 ` Mike Kravetz @ 2017-02-21 13:25 ` Kirill A. Shutemov 2017-02-22 15:15 ` Andrea Arcangeli 1 sibling, 1 reply; 11+ messages in thread From: Kirill A. Shutemov @ 2017-02-21 13:25 UTC (permalink / raw) To: Andrea Arcangeli Cc: Mike Kravetz, linux-mm, linux-kernel, Andrew Morton, Mike Rapoport, Dr. David Alan Gilbert, Hillf Danton, Pavel Emelyanov On Thu, Feb 16, 2017 at 07:41:00PM +0100, Andrea Arcangeli wrote: > Kirill what's your take on making the registration checks stricter? > If we would add a vma_is_anonymous_not_in_fault implemented like above > vma_can_userfault would just need a > s/vma_is_anonymous/vma_is_anonymous_not_in_fault/ and it would be more > strict. khugepaged could be then converted to use it too instead of > hardcoding this vm_flags check. Unless I'm mistaken I would consider > such a change to the registration code, purely a cleanup to add a more > strict check. [sorry for later response] I think more strict vma_is_anonymous() is a good thing. But I don't see a point introducing one more helper. Let's just make the existing helper work better. -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings 2017-02-21 13:25 ` Kirill A. Shutemov @ 2017-02-22 15:15 ` Andrea Arcangeli 2017-02-23 14:56 ` Kirill A. Shutemov 0 siblings, 1 reply; 11+ messages in thread From: Andrea Arcangeli @ 2017-02-22 15:15 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Mike Kravetz, linux-mm, linux-kernel, Andrew Morton, Mike Rapoport, Dr. David Alan Gilbert, Hillf Danton, Pavel Emelyanov On Tue, Feb 21, 2017 at 04:25:45PM +0300, Kirill A. Shutemov wrote: > I think more strict vma_is_anonymous() is a good thing. > > But I don't see a point introducing one more helper. Let's just make the > existing helper work better. That would be simpler agreed. The point of having an "unsafe" faster version was only for code running in page fault context where the additional check is unnecessary. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings 2017-02-22 15:15 ` Andrea Arcangeli @ 2017-02-23 14:56 ` Kirill A. Shutemov 0 siblings, 0 replies; 11+ messages in thread From: Kirill A. Shutemov @ 2017-02-23 14:56 UTC (permalink / raw) To: Andrea Arcangeli Cc: Mike Kravetz, linux-mm, linux-kernel, Andrew Morton, Mike Rapoport, Dr. David Alan Gilbert, Hillf Danton, Pavel Emelyanov On Wed, Feb 22, 2017 at 04:15:07PM +0100, Andrea Arcangeli wrote: > On Tue, Feb 21, 2017 at 04:25:45PM +0300, Kirill A. Shutemov wrote: > > I think more strict vma_is_anonymous() is a good thing. > > > > But I don't see a point introducing one more helper. Let's just make the > > existing helper work better. > > That would be simpler agreed. The point of having an "unsafe" faster > version was only for code running in page fault context where the > additional check is unnecessary. Well, I don't think that the cost of additional check is significant here. And we can bring ->vm_ops a bit closer to ->vm_flags to avoid potential cache miss. -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2017-02-23 14:56 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-02-15 21:46 [PATCH] userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings Mike Kravetz 2017-02-16 18:41 ` Andrea Arcangeli 2017-02-17 0:18 ` Mike Kravetz 2017-02-17 15:52 ` Andrea Arcangeli 2017-02-17 20:17 ` Andrew Morton 2017-02-17 20:51 ` Andrea Arcangeli 2017-02-17 21:08 ` Andrew Morton 2017-02-17 21:34 ` Andrea Arcangeli 2017-02-21 13:25 ` Kirill A. Shutemov 2017-02-22 15:15 ` Andrea Arcangeli 2017-02-23 14:56 ` Kirill A. Shutemov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox