Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mike Kravetz <mike.kravetz@oracle.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>,
	'Andrew Morton' <akpm@linux-foundation.org>,
	linux-mm@kvack.org,
	"'Dr. David Alan Gilbert'" <dgilbert@redhat.com>,
	'Shaohua Li' <shli@fb.com>,
	'Pavel Emelyanov' <xemul@parallels.com>,
	'Mike Rapoport' <rppt@linux.vnet.ibm.com>
Subject: Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
Date: Wed, 16 Nov 2016 10:53:39 -0800	[thread overview]
Message-ID: <8ee2c6db-7ee4-285f-4c68-75fd6e799c0d@oracle.com> (raw)
In-Reply-To: <20161116182809.GC26185@redhat.com>

On 11/16/2016 10:28 AM, Andrea Arcangeli wrote:
> Hello Mike,
> 
> On Tue, Nov 08, 2016 at 01:06:06PM -0800, Mike Kravetz wrote:
>> -- 
>> Mike Kravetz
>>
>> From: Mike Kravetz <mike.kravetz@oracle.com>
>>
>> userfaultfd: hugetlbfs: fix __mcopy_atomic_hugetlb retry/error processing
>>
>> The new routine copy_huge_page_from_user() uses kmap_atomic() to map
>> PAGE_SIZE pages.  However, this prevents page faults in the subsequent
>> call to copy_from_user().  This is OK in the case where the routine
>> is copied with mmap_sema held.  However, in another case we want to
>> allow page faults.  So, add a new argument allow_pagefault to indicate
>> if the routine should allow page faults.
>>
>> A patch (mm/hugetlb: fix huge page reservation leak in private mapping
>> error paths) was recently submitted and is being added to -mm tree.  It
>> addresses the issue huge page reservations when a huge page is allocated,
>> and free'ed before being instantiated in an address space.  This would
>> typically happen in error paths.  The routine __mcopy_atomic_hugetlb has
>> such an error path, so it will need to call restore_reserve_on_error()
>> before free'ing the huge page.  restore_reserve_on_error is currently
>> only visible in mm/hugetlb.c.  So, add it to a header file so that it
>> can be used in mm/userfaultfd.c.  Another option would be to move
>> __mcopy_atomic_hugetlb into mm/hugetlb.c
> 
> It would have been better to split this in two patches.
> 
>> @@ -302,8 +302,10 @@ static __always_inline ssize_t
>> __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>>  out_unlock:
>>  	up_read(&dst_mm->mmap_sem);
>>  out:
>> -	if (page)
>> +	if (page) {
>> +		restore_reserve_on_error(h, dst_vma, dst_addr, page);
>>  		put_page(page);
>> +	}
>>  	BUG_ON(copied < 0);
> 
> If the revalidation fails dst_vma could even be NULL.
> 
> We get there with page not NULL only if something in the revalidation
> fails effectively... I'll have to drop the above change as the fix
> will hurt more than the vma reservation not being restored. Didn't
> think too much about it, but there was no obvious way to restore the
> reservation of a vma, after we drop the mmap_sem. However if we don't
> drop the mmap_sem, we'd recurse into it, and it'll deadlock in current
> implementation if a down_write is already pending somewhere else. In
> this specific case fairness is not an issue, but it's not checking
> it's the same thread taking it again, so it's doesn't allow to recurse
> (checking it's the same thread would make it slower).

Thanks for reviewing this Andrea.  You are correct, we can not call
restore_reserve_on_error without mmap_sem held.

I was running some tests with error injection to exercise the error
path and noticed the reservation leaks as the system eventually ran
out of huge pages.  I need to think about it some more, but we may
want to at least do something like the following before put_page (with
a BIG comment):

	if (unlikely(PagePrivate(page)))
		ClearPagePrivate(page);

That would at least keep the global reservation count from increasing.
Let me look into that.

> I also fixed the gup support for userfaultfd, could you review it?

I will take a look, and 'may' have a test that can be modified for this.

-- 
Mike Kravetz

> Beware, untested... will test it shortly with qemu postcopy live
> migration with hugetlbfs instead of THP (that currently gracefully
> complains about FAULT_FLAG_ALLOW_RETRY missing, KVM ioctl returns
> badaddr and DEBUG_VM=y clearly showed the stack trace of where
> FAULT_FLAG_ALLOW_RETRY was missing).
> 
> I think this enhancement is needed by Oracle too, so that you don't
> get an error from I/O syscalls, and you instead get an userfault.
> 
> We need to update the selftest to trigger userfaults not only with the
> CPU but with O_DIRECT too.
> 
> Note, the FOLL_NOWAIT is needed to offload the userfaults to async
> page faults. KVM tries an async fault first (FOLL_NOWAIT, nonblocking
> = NULL), if that fails it offload a blocking (*nonblocking = 1) fault
> through async page fault kernel thread while guest scheduler schedule
> away the blocked process. So the userfaults behave like SSD swapins
> from disk hitting on a single guest thread and not the whole host vcpu
> thread. Clearly hugetlbfs cannot ever block for I/O, FOLL_NOWAIT is
> only useful to avoid blocking in the vcpu thread in
> handle_userfault().
> 
> From ff1ce62ee0acb14ed71621ba99f01f008a5d212d Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Wed, 16 Nov 2016 18:34:20 +0100
> Subject: [PATCH 1/1] userfaultfd: hugetlbfs: gup: support VM_FAULT_RETRY
> 
> Add support for VM_FAULT_RETRY to follow_hugetlb_page() so that
> get_user_pages_unlocked/locked and "nonblocking/FOLL_NOWAIT" features
> will work on hugetlbfs. This is required for fully functional
> userfaultfd non-present support on hugetlbfs.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/linux/hugetlb.h |  5 +++--
>  mm/gup.c                |  2 +-
>  mm/hugetlb.c            | 48 ++++++++++++++++++++++++++++++++++++++++--------
>  3 files changed, 44 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index bf02b7e..542416d 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -65,7 +65,8 @@ int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int,
>  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
>  long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
>  			 struct page **, struct vm_area_struct **,
> -			 unsigned long *, unsigned long *, long, unsigned int);
> +			 unsigned long *, unsigned long *, long, unsigned int,
> +			 int *);
>  void unmap_hugepage_range(struct vm_area_struct *,
>  			  unsigned long, unsigned long, struct page *);
>  void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> @@ -138,7 +139,7 @@ static inline unsigned long hugetlb_total_pages(void)
>  	return 0;
>  }
>  
> -#define follow_hugetlb_page(m,v,p,vs,a,b,i,w)	({ BUG(); 0; })
> +#define follow_hugetlb_page(m,v,p,vs,a,b,i,w,n)	({ BUG(); 0; })
>  #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
>  #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
>  static inline void hugetlb_report_meminfo(struct seq_file *m)
> diff --git a/mm/gup.c b/mm/gup.c
> index ec4f827..36e88a9 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -572,7 +572,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  			if (is_vm_hugetlb_page(vma)) {
>  				i = follow_hugetlb_page(mm, vma, pages, vmas,
>  						&start, &nr_pages, i,
> -						gup_flags);
> +						gup_flags, nonblocking);
>  				continue;
>  			}
>  		}
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9ce8ecb..022750d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4039,7 +4039,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			 struct page **pages, struct vm_area_struct **vmas,
>  			 unsigned long *position, unsigned long *nr_pages,
> -			 long i, unsigned int flags)
> +			 long i, unsigned int flags, int *nonblocking)
>  {
>  	unsigned long pfn_offset;
>  	unsigned long vaddr = *position;
> @@ -4102,16 +4102,43 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		    ((flags & FOLL_WRITE) &&
>  		      !huge_pte_write(huge_ptep_get(pte)))) {
>  			int ret;
> +			unsigned int fault_flags = 0;
>  
>  			if (pte)
>  				spin_unlock(ptl);
> -			ret = hugetlb_fault(mm, vma, vaddr,
> -				(flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
> -			if (!(ret & VM_FAULT_ERROR))
> -				continue;
> -
> -			remainder = 0;
> -			break;
> +			if (flags & FOLL_WRITE)
> +				fault_flags |= FAULT_FLAG_WRITE;
> +			if (nonblocking)
> +				fault_flags |= FAULT_FLAG_ALLOW_RETRY;
> +			if (flags & FOLL_NOWAIT)
> +				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
> +					FAULT_FLAG_RETRY_NOWAIT;
> +			if (flags & FOLL_TRIED) {
> +				VM_WARN_ON_ONCE(fault_flags &
> +						FAULT_FLAG_ALLOW_RETRY);
> +				fault_flags |= FAULT_FLAG_TRIED;
> +			}
> +			ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
> +			if (ret & VM_FAULT_ERROR) {
> +				remainder = 0;
> +				break;
> +			}
> +			if (ret & VM_FAULT_RETRY) {
> +				if (nonblocking)
> +					*nonblocking = 0;
> +				*nr_pages = 0;
> +				/*
> +				 * VM_FAULT_RETRY must not return an
> +				 * error, it will return zero
> +				 * instead.
> +				 *
> +				 * No need to update "position" as the
> +				 * caller will not check it after
> +				 * *nr_pages is set to 0.
> +				 */
> +				return i;
> +			}
> +			continue;
>  		}
>  
>  		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
> @@ -4140,6 +4167,11 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		spin_unlock(ptl);
>  	}
>  	*nr_pages = remainder;
> +	/*
> +	 * setting position is actually required only if remainder is
> +	 * not zero but it's faster not to add a "if (remainder)"
> +	 * branch.
> +	 */
>  	*position = vaddr;
>  
>  	return i ? i : -EFAULT;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2016-11-16 18:53 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 01/33] userfaultfd: document _IOR/_IOW Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 02/33] userfaultfd: correct comment about UFFD_FEATURE_PAGEFAULT_FLAG_WP Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 03/33] userfaultfd: convert BUG() to WARN_ON_ONCE() Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 04/33] userfaultfd: use vma_is_anonymous Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 05/33] userfaultfd: non-cooperative: Split the find_userfault() routine Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 06/33] userfaultfd: non-cooperative: Add ability to report non-PF events from uffd descriptor Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 07/33] userfaultfd: non-cooperative: report all available features to userland Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 08/33] userfaultfd: non-cooperative: Add fork() event Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 09/33] userfaultfd: non-cooperative: Add fork() event, build warning fix Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 10/33] userfaultfd: non-cooperative: dup_userfaultfd: use mm_count instead of mm_users Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 11/33] userfaultfd: non-cooperative: Add mremap() event Andrea Arcangeli
2016-11-03  7:41   ` Hillf Danton
2016-11-03 17:52     ` Mike Rapoport
2016-11-04 15:40     ` Mike Rapoport
2016-11-02 19:33 ` [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED request Andrea Arcangeli
2016-11-03  8:01   ` Hillf Danton
2016-11-03 17:24     ` Mike Rapoport
2016-11-04 16:40       ` [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED requestg Andrea Arcangeli
2016-11-04 15:42     ` [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED request Mike Rapoport
2016-11-02 19:33 ` [PATCH 13/33] userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd support Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 14/33] userfaultfd: hugetlbfs: add hugetlb_mcopy_atomic_pte for " Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY Andrea Arcangeli
2016-11-03 10:15   ` Hillf Danton
2016-11-03 17:33     ` Mike Kravetz
2016-11-03 19:14       ` Mike Kravetz
2016-11-04  6:43         ` Hillf Danton
2016-11-04 19:36         ` Andrea Arcangeli
2016-11-04 20:34           ` Mike Kravetz
2016-11-08 21:06           ` Mike Kravetz
2016-11-16 18:28             ` Andrea Arcangeli
2016-11-16 18:53               ` Mike Kravetz [this message]
2016-11-17 15:40                 ` Andrea Arcangeli
2016-11-17 19:26                   ` Mike Kravetz
2016-11-18  0:05                     ` Andrea Arcangeli
2016-11-18  5:52                       ` Mike Kravetz
2016-11-22  1:16                         ` Mike Kravetz
2016-11-23  6:38                           ` Hillf Danton
2016-12-15 19:02                             ` Andrea Arcangeli
2016-12-16  3:54                               ` Hillf Danton
2016-11-17 19:41               ` Mike Kravetz
2016-11-04 16:35       ` Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 16/33] userfaultfd: hugetlbfs: add userfaultfd hugetlb hook Andrea Arcangeli
2016-11-04  7:02   ` Hillf Danton
2016-11-02 19:33 ` [PATCH 17/33] userfaultfd: hugetlbfs: allow registration of ranges containing huge pages Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 18/33] userfaultfd: hugetlbfs: add userfaultfd_hugetlb test Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 19/33] userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 20/33] userfaultfd: introduce vma_can_userfault Andrea Arcangeli
2016-11-04  7:39   ` Hillf Danton
2016-11-02 19:33 ` [PATCH 21/33] userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 22/33] userfaultfd: shmem: introduce vma_is_shmem Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 23/33] userfaultfd: shmem: add tlbflush.h header for microblaze Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 24/33] userfaultfd: shmem: use shmem_mcopy_atomic_pte for shared memory Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 25/33] userfaultfd: shmem: add userfaultfd hook for shared memory faults Andrea Arcangeli
2016-11-04  8:59   ` Hillf Danton
2016-11-04 14:53     ` Mike Rapoport
2016-11-04 15:44     ` Mike Rapoport
2016-11-04 16:56       ` Andrea Arcangeli
2016-11-18  0:37       ` Andrea Arcangeli
2016-11-20 12:10         ` Mike Rapoport
2016-11-02 19:33 ` [PATCH 26/33] userfaultfd: shmem: allow registration of shared memory ranges Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 27/33] userfaultfd: shmem: add userfaultfd_shmem test Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 28/33] userfaultfd: shmem: lock the page before adding it to pagecache Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 29/33] userfaultfd: shmem: avoid leaking blocks and used blocks in UFFDIO_COPY Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 30/33] userfaultfd: non-cooperative: selftest: introduce userfaultfd_open Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 31/33] userfaultfd: non-cooperative: selftest: add ufd parameter to copy_page Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 32/33] userfaultfd: non-cooperative: selftest: add test for FORK, MADVDONTNEED and REMAP events Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 33/33] mm: mprotect: use pmd_trans_unstable instead of taking the pmd_lock Andrea Arcangeli
2016-11-02 20:07 ` [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8ee2c6db-7ee4-285f-4c68-75fd6e799c0d@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dgilbert@redhat.com \
    --cc=hillf.zj@alibaba-inc.com \
    --cc=linux-mm@kvack.org \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=shli@fb.com \
    --cc=xemul@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).