qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: qemu-devel@nongnu.org, kvm@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Cc: Robert Love <rlove@google.com>, Dave Hansen <dave@sr71.net>,
	Jan Kara <jack@suse.cz>, Neil Brown <neilb@suse.de>,
	Stefan Hajnoczi <stefanha@gmail.com>,
	Andrew Jones <drjones@redhat.com>,
	KOSAKI Motohiro <kosaki.motohiro@gmail.com>,
	Michel Lespinasse <walken@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Taras Glek <tglek@mozilla.com>,
	Juan Quintela <quintela@redhat.com>,
	Hugh Dickins <hughd@google.com>,
	Isaku Yamahata <yamahata@valinux.co.jp>,
	Mel Gorman <mgorman@suse.de>,
	Android Kernel Team <kernel-team@android.com>,
	Mel Gorman <mel@csn.ul.ie>,
	"\\\"Dr. David Alan Gilbert\\\"" <dgilbert@redhat.com>,
	"Huangpeng (Peter)" <peter.huangpeng@huawei.com>,
	Anthony Liguori <anthony@codemonkey.ws>,
	Mike Hommey <mh@glandium.org>, Keith Packard <keithp@keithp.com>,
	Wenchao Xia <wenchaoqemu@gmail.com>,
	Minchan Kim <minchan@kernel.org>,
	Dmitry Adamushko <dmitry.adamushko@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: [Qemu-devel] [PATCH 10/10] userfaultfd: use VM_FAULT_RETRY in handle_userfault()
Date: Wed,  2 Jul 2014 18:50:16 +0200	[thread overview]
Message-ID: <1404319816-30229-11-git-send-email-aarcange@redhat.com> (raw)
In-Reply-To: <1404319816-30229-1-git-send-email-aarcange@redhat.com>

This optimizes the userfault handler to repeat the fault without
returning to userland if it's a page faults and it teaches it to
handle FOLL_NOWAIT if it's a nonblocking gup invocation from KVM. The
FOLL_NOWAIT part is actually more than an optimization because if
FOLL_NOWAIT is set the gup caller assumes the mmap_sem cannot be
released (and it could assume that the structures protected by it
potentially read earlier cannot have become stale).

The locking rules to comply with FAULT_FLAG_KILLABLE,
FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT flags looks quite
convoluted (and nor well documented, aside from a "Caution" comment in
__lock_page_or_retry) so this is not a trivial change and in turn it's
kept incremental at the end of the patchset.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c            | 68 ++++++++++++++++++++++++++++++++++++++++++---
 include/linux/userfaultfd.h |  6 ++--
 mm/huge_memory.c            |  8 +++---
 mm/memory.c                 |  4 +--
 4 files changed, 74 insertions(+), 12 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index deed8cb..b8b0fb7 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -155,12 +155,29 @@ static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
 	kref_put(&ctx->kref, userfaultfd_free);
 }
 
-int handle_userfault(struct vm_area_struct *vma, unsigned long address)
+/*
+ * The locking rules involved in returning VM_FAULT_RETRY depending on
+ * FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT and
+ * FAULT_FLAG_KILLABLE are not straightforward. The "Caution"
+ * recommendation in __lock_page_or_retry is not an understatement.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set, the mmap_sem must be released
+ * before returning VM_FAULT_RETRY only if FAULT_FLAG_RETRY_NOWAIT is
+ * not set.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set but FAULT_FLAG_KILLABLE is not
+ * set, VM_FAULT_RETRY can still be returned if and only if there are
+ * fatal_signal_pending()s, and the mmap_sem must be released before
+ * returning it.
+ */
+int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+		     unsigned int flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct mm_slot *slot;
 	struct userfaultfd_ctx *ctx;
 	struct userfaultfd_wait_queue uwq;
+	int ret;
 
 	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
@@ -188,10 +205,53 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address)
 	__add_wait_queue(&ctx->fault_wqh, &uwq.wq);
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE);
-		if (fatal_signal_pending(current))
+		if (fatal_signal_pending(current) || ctx->released) {
+			/*
+			 * If we have to fail because the task is
+			 * killed or the file was relased, so simulate
+			 * VM_FAULT_SIGBUS or just return to userland
+			 * through VM_FAULT_RETRY if we come from a
+			 * page fault.
+			 */
+			ret = VM_FAULT_SIGBUS;
+			if (fatal_signal_pending(current) &&
+			    (flags & FAULT_FLAG_KILLABLE)) {
+				/*
+				 * If FAULT_FLAG_KILLABLE is set we
+				 * and there's a fatal signal pending
+				 * can return VM_FAULT_RETRY
+				 * regardless if
+				 * FAULT_FLAG_ALLOW_RETRY is set or
+				 * not as long as we release the
+				 * mmap_sem. The page fault will
+				 * return stright to userland then to
+				 * handle the fatal signal.
+				 */
+				up_read(&mm->mmap_sem);
+				ret = VM_FAULT_RETRY;
+			}
+			break;
+		}
+		if (!uwq.pending) {
+			ret = 0;
+			if (flags & FAULT_FLAG_ALLOW_RETRY) {
+				ret = VM_FAULT_RETRY;
+				if (!(flags & FAULT_FLAG_RETRY_NOWAIT))
+					up_read(&mm->mmap_sem);
+			}
 			break;
-		if (!uwq.pending)
+		}
+		if (((FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT) &
+		     flags) ==
+		    (FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT)) {
+			ret = VM_FAULT_RETRY;
+			/*
+			 * The mmap_sem must not be released if
+			 * FAULT_FLAG_RETRY_NOWAIT is set despite we
+			 * return VM_FAULT_RETRY (FOLL_NOWAIT case).
+			 */
 			break;
+		}
 		spin_unlock(&ctx->fault_wqh.lock);
 		up_read(&mm->mmap_sem);
 
@@ -211,7 +271,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address)
 	 */
 	userfaultfd_ctx_put(ctx);
 
-	return 0;
+	return ret;
 }
 
 static int userfaultfd_release(struct inode *inode, struct file *file)
diff --git a/include/linux/userfaultfd.h b/include/linux/userfaultfd.h
index 8200a71..b7caef5 100644
--- a/include/linux/userfaultfd.h
+++ b/include/linux/userfaultfd.h
@@ -26,11 +26,13 @@
 
 #ifdef CONFIG_USERFAULTFD
 
-int handle_userfault(struct vm_area_struct *vma, unsigned long address);
+int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+		     unsigned int flags);
 
 #else /* CONFIG_USERFAULTFD */
 
-static int handle_userfault(struct vm_area_struct *vma, unsigned long address)
+static int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+			    unsigned int flags)
 {
 	return VM_FAULT_SIGBUS;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d6efd80..e1a74a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -714,7 +714,7 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long haddr, pmd_t *pmd,
-					struct page *page)
+					struct page *page, unsigned int flags)
 {
 	pgtable_t pgtable;
 	spinlock_t *ptl;
@@ -753,7 +753,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 			mem_cgroup_uncharge_page(page);
 			put_page(page);
 			pte_free(mm, pgtable);
-			ret = handle_userfault(vma, haddr);
+			ret = handle_userfault(vma, haddr, flags);
 			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			return ret;
 		}
@@ -835,7 +835,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (pmd_none(*pmd)) {
 			if (vma->vm_flags & VM_USERFAULT) {
 				spin_unlock(ptl);
-				ret = handle_userfault(vma, haddr);
+				ret = handle_userfault(vma, haddr, flags);
 				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			} else {
 				set_huge_zero_page(pgtable, mm, vma,
@@ -863,7 +863,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
+	return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page, flags);
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
diff --git a/mm/memory.c b/mm/memory.c
index a6a04ed..44506e9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2645,7 +2645,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (vma->vm_flags & VM_USERFAULT) {
 			pte_unmap_unlock(page_table, ptl);
-			return handle_userfault(vma, address);
+			return handle_userfault(vma, address, flags);
 		}
 		goto setpte;
 	}
@@ -2679,7 +2679,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_unmap_unlock(page_table, ptl);
 		mem_cgroup_uncharge_page(page);
 		page_cache_release(page);
-		return handle_userfault(vma, address);
+		return handle_userfault(vma, address, flags);
 	}
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);

  parent reply	other threads:[~2014-07-02 16:51 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-02 16:50 [Qemu-devel] [PATCH 00/10] RFC: userfault Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 01/10] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 02/10] mm: madvise MADV_USERFAULT Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 03/10] mm: PT lock: export double_pt_lock/unlock Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 04/10] mm: rmap preparation for remap_anon_pages Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 05/10] mm: swp_entry_swapcount Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 06/10] mm: sys_remap_anon_pages Andrea Arcangeli
2014-07-04 11:30   ` Michael Kerrisk
2014-07-02 16:50 ` [Qemu-devel] [PATCH 07/10] waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
2014-07-03  1:56   ` Andy Lutomirski
2014-07-03 13:19     ` Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 09/10] userfaultfd: make userfaultfd_write non blocking Andrea Arcangeli
2014-07-02 16:50 ` Andrea Arcangeli [this message]
2014-07-03  1:51 ` [Qemu-devel] [PATCH 00/10] RFC: userfault Andy Lutomirski
2014-07-03 13:45 ` Christopher Covington
2014-07-03 14:08   ` Andrea Arcangeli
2014-07-03 15:41 ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1404319816-30229-11-git-send-email-aarcange@redhat.com \
    --to=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=anthony@codemonkey.ws \
    --cc=dave@sr71.net \
    --cc=dgilbert@redhat.com \
    --cc=dmitry.adamushko@gmail.com \
    --cc=drjones@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=keithp@keithp.com \
    --cc=kernel-team@android.com \
    --cc=kosaki.motohiro@gmail.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=mgorman@suse.de \
    --cc=mh@glandium.org \
    --cc=minchan@kernel.org \
    --cc=neilb@suse.de \
    --cc=pbonzini@redhat.com \
    --cc=peter.huangpeng@huawei.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=rlove@google.com \
    --cc=stefanha@gmail.com \
    --cc=tglek@mozilla.com \
    --cc=walken@google.com \
    --cc=wenchaoqemu@gmail.com \
    --cc=yamahata@valinux.co.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).