From: Mike Kravetz <mike.kravetz@oracle.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>,
'Andrew Morton' <akpm@linux-foundation.org>,
linux-mm@kvack.org,
"'Dr. David Alan Gilbert'" <dgilbert@redhat.com>,
'Shaohua Li' <shli@fb.com>,
'Pavel Emelyanov' <xemul@parallels.com>,
'Mike Rapoport' <rppt@linux.vnet.ibm.com>
Subject: Re: [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
Date: Wed, 16 Nov 2016 10:53:39 -0800 [thread overview]
Message-ID: <8ee2c6db-7ee4-285f-4c68-75fd6e799c0d@oracle.com> (raw)
In-Reply-To: <20161116182809.GC26185@redhat.com>
On 11/16/2016 10:28 AM, Andrea Arcangeli wrote:
> Hello Mike,
>
> On Tue, Nov 08, 2016 at 01:06:06PM -0800, Mike Kravetz wrote:
>> --
>> Mike Kravetz
>>
>> From: Mike Kravetz <mike.kravetz@oracle.com>
>>
>> userfaultfd: hugetlbfs: fix __mcopy_atomic_hugetlb retry/error processing
>>
>> The new routine copy_huge_page_from_user() uses kmap_atomic() to map
>> PAGE_SIZE pages. However, this prevents page faults in the subsequent
>> call to copy_from_user(). This is OK in the case where the routine
>> is copied with mmap_sema held. However, in another case we want to
>> allow page faults. So, add a new argument allow_pagefault to indicate
>> if the routine should allow page faults.
>>
>> A patch (mm/hugetlb: fix huge page reservation leak in private mapping
>> error paths) was recently submitted and is being added to -mm tree. It
>> addresses the issue huge page reservations when a huge page is allocated,
>> and free'ed before being instantiated in an address space. This would
>> typically happen in error paths. The routine __mcopy_atomic_hugetlb has
>> such an error path, so it will need to call restore_reserve_on_error()
>> before free'ing the huge page. restore_reserve_on_error is currently
>> only visible in mm/hugetlb.c. So, add it to a header file so that it
>> can be used in mm/userfaultfd.c. Another option would be to move
>> __mcopy_atomic_hugetlb into mm/hugetlb.c
>
> It would have been better to split this in two patches.
>
>> @@ -302,8 +302,10 @@ static __always_inline ssize_t
>> __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>> out_unlock:
>> up_read(&dst_mm->mmap_sem);
>> out:
>> - if (page)
>> + if (page) {
>> + restore_reserve_on_error(h, dst_vma, dst_addr, page);
>> put_page(page);
>> + }
>> BUG_ON(copied < 0);
>
> If the revalidation fails dst_vma could even be NULL.
>
> We get there with page not NULL only if something in the revalidation
> fails effectively... I'll have to drop the above change as the fix
> will hurt more than the vma reservation not being restored. Didn't
> think too much about it, but there was no obvious way to restore the
> reservation of a vma, after we drop the mmap_sem. However if we don't
> drop the mmap_sem, we'd recurse into it, and it'll deadlock in current
> implementation if a down_write is already pending somewhere else. In
> this specific case fairness is not an issue, but it's not checking
> it's the same thread taking it again, so it's doesn't allow to recurse
> (checking it's the same thread would make it slower).
Thanks for reviewing this Andrea. You are correct, we can not call
restore_reserve_on_error without mmap_sem held.
I was running some tests with error injection to exercise the error
path and noticed the reservation leaks as the system eventually ran
out of huge pages. I need to think about it some more, but we may
want to at least do something like the following before put_page (with
a BIG comment):
if (unlikely(PagePrivate(page)))
ClearPagePrivate(page);
That would at least keep the global reservation count from increasing.
Let me look into that.
> I also fixed the gup support for userfaultfd, could you review it?
I will take a look, and 'may' have a test that can be modified for this.
--
Mike Kravetz
> Beware, untested... will test it shortly with qemu postcopy live
> migration with hugetlbfs instead of THP (that currently gracefully
> complains about FAULT_FLAG_ALLOW_RETRY missing, KVM ioctl returns
> badaddr and DEBUG_VM=y clearly showed the stack trace of where
> FAULT_FLAG_ALLOW_RETRY was missing).
>
> I think this enhancement is needed by Oracle too, so that you don't
> get an error from I/O syscalls, and you instead get an userfault.
>
> We need to update the selftest to trigger userfaults not only with the
> CPU but with O_DIRECT too.
>
> Note, the FOLL_NOWAIT is needed to offload the userfaults to async
> page faults. KVM tries an async fault first (FOLL_NOWAIT, nonblocking
> = NULL), if that fails it offload a blocking (*nonblocking = 1) fault
> through async page fault kernel thread while guest scheduler schedule
> away the blocked process. So the userfaults behave like SSD swapins
> from disk hitting on a single guest thread and not the whole host vcpu
> thread. Clearly hugetlbfs cannot ever block for I/O, FOLL_NOWAIT is
> only useful to avoid blocking in the vcpu thread in
> handle_userfault().
>
> From ff1ce62ee0acb14ed71621ba99f01f008a5d212d Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Wed, 16 Nov 2016 18:34:20 +0100
> Subject: [PATCH 1/1] userfaultfd: hugetlbfs: gup: support VM_FAULT_RETRY
>
> Add support for VM_FAULT_RETRY to follow_hugetlb_page() so that
> get_user_pages_unlocked/locked and "nonblocking/FOLL_NOWAIT" features
> will work on hugetlbfs. This is required for fully functional
> userfaultfd non-present support on hugetlbfs.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> include/linux/hugetlb.h | 5 +++--
> mm/gup.c | 2 +-
> mm/hugetlb.c | 48 ++++++++++++++++++++++++++++++++++++++++--------
> 3 files changed, 44 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index bf02b7e..542416d 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -65,7 +65,8 @@ int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int,
> int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
> long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
> struct page **, struct vm_area_struct **,
> - unsigned long *, unsigned long *, long, unsigned int);
> + unsigned long *, unsigned long *, long, unsigned int,
> + int *);
> void unmap_hugepage_range(struct vm_area_struct *,
> unsigned long, unsigned long, struct page *);
> void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> @@ -138,7 +139,7 @@ static inline unsigned long hugetlb_total_pages(void)
> return 0;
> }
>
> -#define follow_hugetlb_page(m,v,p,vs,a,b,i,w) ({ BUG(); 0; })
> +#define follow_hugetlb_page(m,v,p,vs,a,b,i,w,n) ({ BUG(); 0; })
> #define follow_huge_addr(mm, addr, write) ERR_PTR(-EINVAL)
> #define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; })
> static inline void hugetlb_report_meminfo(struct seq_file *m)
> diff --git a/mm/gup.c b/mm/gup.c
> index ec4f827..36e88a9 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -572,7 +572,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> if (is_vm_hugetlb_page(vma)) {
> i = follow_hugetlb_page(mm, vma, pages, vmas,
> &start, &nr_pages, i,
> - gup_flags);
> + gup_flags, nonblocking);
> continue;
> }
> }
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9ce8ecb..022750d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4039,7 +4039,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> struct page **pages, struct vm_area_struct **vmas,
> unsigned long *position, unsigned long *nr_pages,
> - long i, unsigned int flags)
> + long i, unsigned int flags, int *nonblocking)
> {
> unsigned long pfn_offset;
> unsigned long vaddr = *position;
> @@ -4102,16 +4102,43 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> ((flags & FOLL_WRITE) &&
> !huge_pte_write(huge_ptep_get(pte)))) {
> int ret;
> + unsigned int fault_flags = 0;
>
> if (pte)
> spin_unlock(ptl);
> - ret = hugetlb_fault(mm, vma, vaddr,
> - (flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
> - if (!(ret & VM_FAULT_ERROR))
> - continue;
> -
> - remainder = 0;
> - break;
> + if (flags & FOLL_WRITE)
> + fault_flags |= FAULT_FLAG_WRITE;
> + if (nonblocking)
> + fault_flags |= FAULT_FLAG_ALLOW_RETRY;
> + if (flags & FOLL_NOWAIT)
> + fault_flags |= FAULT_FLAG_ALLOW_RETRY |
> + FAULT_FLAG_RETRY_NOWAIT;
> + if (flags & FOLL_TRIED) {
> + VM_WARN_ON_ONCE(fault_flags &
> + FAULT_FLAG_ALLOW_RETRY);
> + fault_flags |= FAULT_FLAG_TRIED;
> + }
> + ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
> + if (ret & VM_FAULT_ERROR) {
> + remainder = 0;
> + break;
> + }
> + if (ret & VM_FAULT_RETRY) {
> + if (nonblocking)
> + *nonblocking = 0;
> + *nr_pages = 0;
> + /*
> + * VM_FAULT_RETRY must not return an
> + * error, it will return zero
> + * instead.
> + *
> + * No need to update "position" as the
> + * caller will not check it after
> + * *nr_pages is set to 0.
> + */
> + return i;
> + }
> + continue;
> }
>
> pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
> @@ -4140,6 +4167,11 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> spin_unlock(ptl);
> }
> *nr_pages = remainder;
> + /*
> + * setting position is actually required only if remainder is
> + * not zero but it's faster not to add a "if (remainder)"
> + * branch.
> + */
> *position = vaddr;
>
> return i ? i : -EFAULT;
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-11-16 18:53 UTC|newest]
Thread overview: 69+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-11-02 19:33 [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 01/33] userfaultfd: document _IOR/_IOW Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 02/33] userfaultfd: correct comment about UFFD_FEATURE_PAGEFAULT_FLAG_WP Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 03/33] userfaultfd: convert BUG() to WARN_ON_ONCE() Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 04/33] userfaultfd: use vma_is_anonymous Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 05/33] userfaultfd: non-cooperative: Split the find_userfault() routine Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 06/33] userfaultfd: non-cooperative: Add ability to report non-PF events from uffd descriptor Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 07/33] userfaultfd: non-cooperative: report all available features to userland Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 08/33] userfaultfd: non-cooperative: Add fork() event Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 09/33] userfaultfd: non-cooperative: Add fork() event, build warning fix Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 10/33] userfaultfd: non-cooperative: dup_userfaultfd: use mm_count instead of mm_users Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 11/33] userfaultfd: non-cooperative: Add mremap() event Andrea Arcangeli
2016-11-03 7:41 ` Hillf Danton
2016-11-03 17:52 ` Mike Rapoport
2016-11-04 15:40 ` Mike Rapoport
2016-11-02 19:33 ` [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED request Andrea Arcangeli
2016-11-03 8:01 ` Hillf Danton
2016-11-03 17:24 ` Mike Rapoport
2016-11-04 16:40 ` [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED requestg Andrea Arcangeli
2016-11-04 15:42 ` [PATCH 12/33] userfaultfd: non-cooperative: Add madvise() event for MADV_DONTNEED request Mike Rapoport
2016-11-02 19:33 ` [PATCH 13/33] userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd support Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 14/33] userfaultfd: hugetlbfs: add hugetlb_mcopy_atomic_pte for " Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 15/33] userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY Andrea Arcangeli
2016-11-03 10:15 ` Hillf Danton
2016-11-03 17:33 ` Mike Kravetz
2016-11-03 19:14 ` Mike Kravetz
2016-11-04 6:43 ` Hillf Danton
2016-11-04 19:36 ` Andrea Arcangeli
2016-11-04 20:34 ` Mike Kravetz
2016-11-08 21:06 ` Mike Kravetz
2016-11-16 18:28 ` Andrea Arcangeli
2016-11-16 18:53 ` Mike Kravetz [this message]
2016-11-17 15:40 ` Andrea Arcangeli
2016-11-17 19:26 ` Mike Kravetz
2016-11-18 0:05 ` Andrea Arcangeli
2016-11-18 5:52 ` Mike Kravetz
2016-11-22 1:16 ` Mike Kravetz
2016-11-23 6:38 ` Hillf Danton
2016-12-15 19:02 ` Andrea Arcangeli
2016-12-16 3:54 ` Hillf Danton
2016-11-17 19:41 ` Mike Kravetz
2016-11-04 16:35 ` Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 16/33] userfaultfd: hugetlbfs: add userfaultfd hugetlb hook Andrea Arcangeli
2016-11-04 7:02 ` Hillf Danton
2016-11-02 19:33 ` [PATCH 17/33] userfaultfd: hugetlbfs: allow registration of ranges containing huge pages Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 18/33] userfaultfd: hugetlbfs: add userfaultfd_hugetlb test Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 19/33] userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 20/33] userfaultfd: introduce vma_can_userfault Andrea Arcangeli
2016-11-04 7:39 ` Hillf Danton
2016-11-02 19:33 ` [PATCH 21/33] userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 22/33] userfaultfd: shmem: introduce vma_is_shmem Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 23/33] userfaultfd: shmem: add tlbflush.h header for microblaze Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 24/33] userfaultfd: shmem: use shmem_mcopy_atomic_pte for shared memory Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 25/33] userfaultfd: shmem: add userfaultfd hook for shared memory faults Andrea Arcangeli
2016-11-04 8:59 ` Hillf Danton
2016-11-04 14:53 ` Mike Rapoport
2016-11-04 15:44 ` Mike Rapoport
2016-11-04 16:56 ` Andrea Arcangeli
2016-11-18 0:37 ` Andrea Arcangeli
2016-11-20 12:10 ` Mike Rapoport
2016-11-02 19:33 ` [PATCH 26/33] userfaultfd: shmem: allow registration of shared memory ranges Andrea Arcangeli
2016-11-02 19:33 ` [PATCH 27/33] userfaultfd: shmem: add userfaultfd_shmem test Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 28/33] userfaultfd: shmem: lock the page before adding it to pagecache Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 29/33] userfaultfd: shmem: avoid leaking blocks and used blocks in UFFDIO_COPY Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 30/33] userfaultfd: non-cooperative: selftest: introduce userfaultfd_open Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 31/33] userfaultfd: non-cooperative: selftest: add ufd parameter to copy_page Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 32/33] userfaultfd: non-cooperative: selftest: add test for FORK, MADVDONTNEED and REMAP events Andrea Arcangeli
2016-11-02 19:34 ` [PATCH 33/33] mm: mprotect: use pmd_trans_unstable instead of taking the pmd_lock Andrea Arcangeli
2016-11-02 20:07 ` [PATCH 00/33] userfaultfd tmpfs/hugetlbfs/non-cooperative Andrea Arcangeli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8ee2c6db-7ee4-285f-4c68-75fd6e799c0d@oracle.com \
--to=mike.kravetz@oracle.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=dgilbert@redhat.com \
--cc=hillf.zj@alibaba-inc.com \
--cc=linux-mm@kvack.org \
--cc=rppt@linux.vnet.ibm.com \
--cc=shli@fb.com \
--cc=xemul@parallels.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).