[PATCH 0/5] userfaultfd21 updates v2

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] userfaultfd21 updates v2
@ 2015-07-08 10:50 Andrea Arcangeli
  2015-07-08 10:50 ` [PATCH 1/5] userfaultfd: require UFFDIO_API before other ioctls Andrea Arcangeli
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: Andrea Arcangeli @ 2015-07-08 10:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Kirill A. Shutemov, Pavel Emelyanov, Dave Hansen

Hello everyone,

This is an update for userfaultfd to synchronize -mm with the code
in the userfaultfd21 git branch.

It includes: two fixes for some minor problem found with the selftest
(qemu wouldn't trigger those), one debuggability improvement for gdb,
the selftest itself and it adds one check to verify the API was
followed in some case.

The wakeone patch is present in the userfault21 branch but it's
deferred because it's just a minor optimization and the "require
UFFDIO_API before other ioctls" patch has been updated according to
upstream review of the previous submit of this update.

Andrea Arcangeli (5):
  userfaultfd: require UFFDIO_API before other ioctls
  userfaultfd: allow signals to interrupt a userfault
  userfaultfd: propagate the full address in THP faults
  userfaultfd: avoid missing wakeups during refile in userfaultfd_read
  userfaultfd: selftest

 fs/userfaultfd.c                         |  65 +++-
 mm/huge_memory.c                         |  10 +-
 tools/testing/selftests/vm/Makefile      |   3 +
 tools/testing/selftests/vm/run_vmtests   |  11 +
 tools/testing/selftests/vm/userfaultfd.c | 636 +++++++++++++++++++++++++++++++
 5 files changed, 715 insertions(+), 10 deletions(-)
 create mode 100644 tools/testing/selftests/vm/userfaultfd.c

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/5] userfaultfd: require UFFDIO_API before other ioctls
  2015-07-08 10:50 [PATCH 0/5] userfaultfd21 updates v2 Andrea Arcangeli
@ 2015-07-08 10:50 ` Andrea Arcangeli
  2015-07-08 10:50 ` [PATCH 2/5] userfaultfd: allow signals to interrupt a userfault Andrea Arcangeli
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Andrea Arcangeli @ 2015-07-08 10:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Kirill A. Shutemov, Pavel Emelyanov, Dave Hansen

UFFDIO_API was already forced before read/poll could work. This
makes the code more strict to force it also for all other ioctls.

All users would already have been required to call UFFDIO_API before
invoking other ioctls but this makes it more explicit.

This will ensure we can change all ioctls (all but UFFDIO_API/struct
uffdio_api) with a bump of uffdio_api.api.

There's no actual plan or need to change the API or the ioctl, the
current API already should cover fine even the non cooperative usage,
but this is just for the longer term future just in case.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 89067cf..901d52a 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -577,7 +577,6 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf,
 
 	if (ctx->state == UFFD_STATE_WAIT_API)
 		return -EINVAL;
-	BUG_ON(ctx->state != UFFD_STATE_RUNNING);
 
 	for (;;) {
 		if (count < sizeof(msg))
@@ -1115,6 +1114,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
 	int ret = -EINVAL;
 	struct userfaultfd_ctx *ctx = file->private_data;
 
+	if (cmd != UFFDIO_API && ctx->state == UFFD_STATE_WAIT_API)
+		return -EINVAL;
+
 	switch(cmd) {
 	case UFFDIO_API:
 		ret = userfaultfd_api(ctx, arg);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/5] userfaultfd: allow signals to interrupt a userfault
  2015-07-08 10:50 [PATCH 0/5] userfaultfd21 updates v2 Andrea Arcangeli
  2015-07-08 10:50 ` [PATCH 1/5] userfaultfd: require UFFDIO_API before other ioctls Andrea Arcangeli
@ 2015-07-08 10:50 ` Andrea Arcangeli
  2015-07-08 10:50 ` [PATCH 3/5] userfaultfd: propagate the full address in THP faults Andrea Arcangeli
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Andrea Arcangeli @ 2015-07-08 10:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Kirill A. Shutemov, Pavel Emelyanov, Dave Hansen

This is only simple to achieve if the userfault is going to return to
userland (not to the kernel) because we can avoid returning
VM_FAULT_RETRY despite we temporarily released the mmap_sem. The fault
would just be retried by userland then. This is safe at least on x86
and powerpc (the two archs with the syscall implemented so far).

Hint to verify for which archs this is safe: after handle_mm_fault returns, no
access to data structures protected by the mmap_sem must be done by the fault
code in arch/*/mm/fault.c until up_read(&mm->mmap_sem) is called.

This has two main benefits: signals can run with lower latency in production
(signals aren't blocked by userfaults and userfaults are immediately repeated
after signal processing) and gdb can then trivially debug the threads blocked
in this kind of userfaults coming directly from userland.

On a side note: while gdb has a need to get signal processed, coredumps always
worked perfectly with userfaults, no matter if the userfault is triggered by
GUP a kernel copy_user or directly from userland.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 901d52a..851d575 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -262,7 +262,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	struct userfaultfd_ctx *ctx;
 	struct userfaultfd_wait_queue uwq;
 	int ret;
-	bool must_wait;
+	bool must_wait, return_to_userland;
 
 	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
@@ -327,6 +327,9 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	uwq.msg = userfault_msg(address, flags, reason);
 	uwq.ctx = ctx;
 
+	return_to_userland = (flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
+		(FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
+
 	spin_lock(&ctx->fault_pending_wqh.lock);
 	/*
 	 * After the __add_wait_queue the uwq is visible to userland
@@ -338,14 +341,16 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	 * following the spin_unlock to happen before the list_add in
 	 * __add_wait_queue.
 	 */
-	set_current_state(TASK_KILLABLE);
+	set_current_state(return_to_userland ? TASK_INTERRUPTIBLE :
+			  TASK_KILLABLE);
 	spin_unlock(&ctx->fault_pending_wqh.lock);
 
 	must_wait = userfaultfd_must_wait(ctx, address, flags, reason);
 	up_read(&mm->mmap_sem);
 
 	if (likely(must_wait && !ACCESS_ONCE(ctx->released) &&
-		   !fatal_signal_pending(current))) {
+		   (return_to_userland ? !signal_pending(current) :
+		    !fatal_signal_pending(current)))) {
 		wake_up_poll(&ctx->fd_wqh, POLLIN);
 		schedule();
 		ret |= VM_FAULT_MAJOR;
@@ -353,6 +358,30 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 
 	__set_current_state(TASK_RUNNING);
 
+	if (return_to_userland) {
+		if (signal_pending(current) &&
+		    !fatal_signal_pending(current)) {
+			/*
+			 * If we got a SIGSTOP or SIGCONT and this is
+			 * a normal userland page fault, just let
+			 * userland return so the signal will be
+			 * handled and gdb debugging works.  The page
+			 * fault code immediately after we return from
+			 * this function is going to release the
+			 * mmap_sem and it's not depending on it
+			 * (unlike gup would if we were not to return
+			 * VM_FAULT_RETRY).
+			 *
+			 * If a fatal signal is pending we still take
+			 * the streamlined VM_FAULT_RETRY failure path
+			 * and there's no need to retake the mmap_sem
+			 * in such case.
+			 */
+			down_read(&mm->mmap_sem);
+			ret = 0;
+		}
+	}
+
 	/*
 	 * Here we race with the list_del; list_add in
 	 * userfaultfd_ctx_read(), however because we don't ever run

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/5] userfaultfd: propagate the full address in THP faults
  2015-07-08 10:50 [PATCH 0/5] userfaultfd21 updates v2 Andrea Arcangeli
  2015-07-08 10:50 ` [PATCH 1/5] userfaultfd: require UFFDIO_API before other ioctls Andrea Arcangeli
  2015-07-08 10:50 ` [PATCH 2/5] userfaultfd: allow signals to interrupt a userfault Andrea Arcangeli
@ 2015-07-08 10:50 ` Andrea Arcangeli
  2015-07-08 10:50 ` [PATCH 4/5] userfaultfd: avoid missing wakeups during refile in userfaultfd_read Andrea Arcangeli
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Andrea Arcangeli @ 2015-07-08 10:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Kirill A. Shutemov, Pavel Emelyanov, Dave Hansen

The THP faults were not propagating the original fault address. The latest
version of the API with uffd.arg.pagefault.address is supposed to propagate the
full address through THP faults.

This was not a kernel crashing bug and it wouldn't risk to corrupt
user memory, but it would cause a SIGBUS failure because the wrong page was
being copied.

For various reasons this wasn't easily reproducible in the qemu
workload, but the strestest exposed the problem immediately.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 68fb507..f9f3337 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -717,13 +717,14 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
 
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					struct vm_area_struct *vma,
-					unsigned long haddr, pmd_t *pmd,
+					unsigned long address, pmd_t *pmd,
 					struct page *page, gfp_t gfp,
 					unsigned int flags)
 {
 	struct mem_cgroup *memcg;
 	pgtable_t pgtable;
 	spinlock_t *ptl;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
@@ -765,7 +766,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 			mem_cgroup_cancel_charge(page, memcg);
 			put_page(page);
 			pte_free(mm, pgtable);
-			ret = handle_userfault(vma, haddr, flags,
+			ret = handle_userfault(vma, address, flags,
 					       VM_UFFD_MISSING);
 			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			return ret;
@@ -841,7 +842,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (pmd_none(*pmd)) {
 			if (userfaultfd_missing(vma)) {
 				spin_unlock(ptl);
-				ret = handle_userfault(vma, haddr, flags,
+				ret = handle_userfault(vma, address, flags,
 						       VM_UFFD_MISSING);
 				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			} else {
@@ -865,7 +866,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page, gfp, flags);
+	return __do_huge_pmd_anonymous_page(mm, vma, address, pmd, page, gfp,
+					    flags);
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 4/5] userfaultfd: avoid missing wakeups during refile in userfaultfd_read
  2015-07-08 10:50 [PATCH 0/5] userfaultfd21 updates v2 Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2015-07-08 10:50 ` [PATCH 3/5] userfaultfd: propagate the full address in THP faults Andrea Arcangeli
@ 2015-07-08 10:50 ` Andrea Arcangeli
  2015-07-08 10:50 ` [PATCH 5/5] userfaultfd: selftest Andrea Arcangeli
  2015-08-28 15:13 ` [PATCH 0/5] userfaultfd21 updates v2 Dave Hansen
  5 siblings, 0 replies; 8+ messages in thread
From: Andrea Arcangeli @ 2015-07-08 10:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Kirill A. Shutemov, Pavel Emelyanov, Dave Hansen

During the refile in userfaultfd_read both waitqueues could look empty
to the lockless wake_userfault(). Use a seqcount to prevent this false
negative that could leave an userfault blocked.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 851d575..6a117f8 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -45,6 +45,8 @@ struct userfaultfd_ctx {
 	wait_queue_head_t fault_wqh;
 	/* waitqueue head for the pseudo fd to wakeup poll/read */
 	wait_queue_head_t fd_wqh;
+	/* a refile sequence protected by fault_pending_wqh lock */
+	struct seqcount refile_seq;
 	/* pseudo fd refcounting */
 	atomic_t refcount;
 	/* userfaultfd syscall flags */
@@ -547,6 +549,15 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
 		uwq = find_userfault(ctx);
 		if (uwq) {
 			/*
+			 * Use a seqcount to repeat the lockless check
+			 * in wake_userfault() to avoid missing
+			 * wakeups because during the refile both
+			 * waitqueue could become empty if this is the
+			 * only userfault.
+			 */
+			write_seqcount_begin(&ctx->refile_seq);
+
+			/*
 			 * The fault_pending_wqh.lock prevents the uwq
 			 * to disappear from under us.
 			 *
@@ -570,6 +581,8 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
 			list_del(&uwq->wq.task_list);
 			__add_wait_queue(&ctx->fault_wqh, &uwq->wq);
 
+			write_seqcount_end(&ctx->refile_seq);
+
 			/* careful to always initialize msg if ret == 0 */
 			*msg = uwq->msg;
 			spin_unlock(&ctx->fault_pending_wqh.lock);
@@ -647,6 +660,9 @@ static void __wake_userfault(struct userfaultfd_ctx *ctx,
 static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx,
 					   struct userfaultfd_wake_range *range)
 {
+	unsigned seq;
+	bool need_wakeup;
+
 	/*
 	 * To be sure waitqueue_active() is not reordered by the CPU
 	 * before the pagetable update, use an explicit SMP memory
@@ -662,8 +678,13 @@ static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx,
 	 * userfaults yet. So we take the spinlock only when we're
 	 * sure we've userfaults to wake.
 	 */
-	if (waitqueue_active(&ctx->fault_pending_wqh) ||
-	    waitqueue_active(&ctx->fault_wqh))
+	do {
+		seq = read_seqcount_begin(&ctx->refile_seq);
+		need_wakeup = waitqueue_active(&ctx->fault_pending_wqh) ||
+			waitqueue_active(&ctx->fault_wqh);
+		cond_resched();
+	} while (read_seqcount_retry(&ctx->refile_seq, seq));
+	if (need_wakeup)
 		__wake_userfault(ctx, range);
 }
 
@@ -1219,6 +1240,7 @@ static void init_once_userfaultfd_ctx(void *mem)
 	init_waitqueue_head(&ctx->fault_pending_wqh);
 	init_waitqueue_head(&ctx->fault_wqh);
 	init_waitqueue_head(&ctx->fd_wqh);
+	seqcount_init(&ctx->refile_seq);
 }
 
 /**

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 5/5] userfaultfd: selftest
  2015-07-08 10:50 [PATCH 0/5] userfaultfd21 updates v2 Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2015-07-08 10:50 ` [PATCH 4/5] userfaultfd: avoid missing wakeups during refile in userfaultfd_read Andrea Arcangeli
@ 2015-07-08 10:50 ` Andrea Arcangeli
  2015-08-28 15:13 ` [PATCH 0/5] userfaultfd21 updates v2 Dave Hansen
  5 siblings, 0 replies; 8+ messages in thread
From: Andrea Arcangeli @ 2015-07-08 10:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Kirill A. Shutemov, Pavel Emelyanov, Dave Hansen

This test allocates two virtual areas and bounces the physical memory
across the two virtual areas using only userfaultfd.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 tools/testing/selftests/vm/Makefile      |   3 +
 tools/testing/selftests/vm/run_vmtests   |  11 +
 tools/testing/selftests/vm/userfaultfd.c | 636 +++++++++++++++++++++++++++++++
 3 files changed, 650 insertions(+)
 create mode 100644 tools/testing/selftests/vm/userfaultfd.c

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 231b9a0..0d68547 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -8,10 +8,13 @@ BINARIES += hugetlbfstest
 BINARIES += map_hugetlb
 BINARIES += thuge-gen
 BINARIES += transhuge-stress
+BINARIES += userfaultfd
 
 all: $(BINARIES)
 %: %.c
 	$(CC) $(CFLAGS) -o $@ $^ -lrt
+userfaultfd: userfaultfd.c
+	$(CC) $(CFLAGS) -O2 -o $@ $^ -lpthread
 
 TEST_PROGS := run_vmtests
 TEST_FILES := $(BINARIES)
diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
index 49ece11..831adeb 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -86,6 +86,17 @@ else
 	echo "[PASS]"
 fi
 
+echo "--------------------"
+echo "running userfaultfd"
+echo "--------------------"
+./userfaultfd 128 32
+if [ $? -ne 0 ]; then
+	echo "[FAIL]"
+	exitcode=1
+else
+	echo "[PASS]"
+fi
+
 #cleanup
 umount $mnt
 rm -rf $mnt
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
new file mode 100644
index 0000000..0c0b839
--- /dev/null
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -0,0 +1,636 @@
+/*
+ * Stress userfaultfd syscall.
+ *
+ *  Copyright (C) 2015  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ * This test allocates two virtual areas and bounces the physical
+ * memory across the two virtual areas (from area_src to area_dst)
+ * using userfaultfd.
+ *
+ * There are three threads running per CPU:
+ *
+ * 1) one per-CPU thread takes a per-page pthread_mutex in a random
+ *    page of the area_dst (while the physical page may still be in
+ *    area_src), and increments a per-page counter in the same page,
+ *    and checks its value against a verification region.
+ *
+ * 2) another per-CPU thread handles the userfaults generated by
+ *    thread 1 above. userfaultfd blocking reads or poll() modes are
+ *    exercised interleaved.
+ *
+ * 3) one last per-CPU thread transfers the memory in the background
+ *    at maximum bandwidth (if not already transferred by thread
+ *    2). Each cpu thread takes cares of transferring a portion of the
+ *    area.
+ *
+ * When all threads of type 3 completed the transfer, one bounce is
+ * complete. area_src and area_dst are then swapped. All threads are
+ * respawned and so the bounce is immediately restarted in the
+ * opposite direction.
+ *
+ * per-CPU threads 1 by triggering userfaults inside
+ * pthread_mutex_lock will also verify the atomicity of the memory
+ * transfer (UFFDIO_COPY).
+ *
+ * The program takes two parameters: the amounts of physical memory in
+ * megabytes (MiB) of the area and the number of bounces to execute.
+ *
+ * # 100MiB 99999 bounces
+ * ./userfaultfd 100 99999
+ *
+ * # 1GiB 99 bounces
+ * ./userfaultfd 1000 99
+ *
+ * # 10MiB-~6GiB 999 bounces, continue forever unless an error triggers
+ * while ./userfaultfd $[RANDOM % 6000 + 10] 999; do true; done
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <errno.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <time.h>
+#include <signal.h>
+#include <poll.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <pthread.h>
+#include "../../../../include/uapi/linux/userfaultfd.h"
+
+#ifdef __x86_64__
+#define __NR_userfaultfd 323
+#elif defined(__i386__)
+#define __NR_userfaultfd 359
+#elif defined(__powewrpc__)
+#define __NR_userfaultfd 364
+#else
+#error "missing __NR_userfaultfd definition"
+#endif
+
+static unsigned long nr_cpus, nr_pages, nr_pages_per_cpu, page_size;
+
+#define BOUNCE_RANDOM		(1<<0)
+#define BOUNCE_RACINGFAULTS	(1<<1)
+#define BOUNCE_VERIFY		(1<<2)
+#define BOUNCE_POLL		(1<<3)
+static int bounces;
+
+static unsigned long long *count_verify;
+static int uffd, finished, *pipefd;
+static char *area_src, *area_dst;
+static char *zeropage;
+pthread_attr_t attr;
+
+/* pthread_mutex_t starts at page offset 0 */
+#define area_mutex(___area, ___nr)					\
+	((pthread_mutex_t *) ((___area) + (___nr)*page_size))
+/*
+ * count is placed in the page after pthread_mutex_t naturally aligned
+ * to avoid non alignment faults on non-x86 archs.
+ */
+#define area_count(___area, ___nr)					\
+	((volatile unsigned long long *) ((unsigned long)		\
+				 ((___area) + (___nr)*page_size +	\
+				  sizeof(pthread_mutex_t) +		\
+				  sizeof(unsigned long long) - 1) &	\
+				 ~(unsigned long)(sizeof(unsigned long long) \
+						  -  1)))
+
+static int my_bcmp(char *str1, char *str2, size_t n)
+{
+	unsigned long i;
+	for (i = 0; i < n; i++)
+		if (str1[i] != str2[i])
+			return 1;
+	return 0;
+}
+
+static void *locking_thread(void *arg)
+{
+	unsigned long cpu = (unsigned long) arg;
+	struct random_data rand;
+	unsigned long page_nr = *(&(page_nr)); /* uninitialized warning */
+	int32_t rand_nr;
+	unsigned long long count;
+	char randstate[64];
+	unsigned int seed;
+	time_t start;
+
+	if (bounces & BOUNCE_RANDOM) {
+		seed = (unsigned int) time(NULL) - bounces;
+		if (!(bounces & BOUNCE_RACINGFAULTS))
+			seed += cpu;
+		bzero(&rand, sizeof(rand));
+		bzero(&randstate, sizeof(randstate));
+		if (initstate_r(seed, randstate, sizeof(randstate), &rand))
+			fprintf(stderr, "srandom_r error\n"), exit(1);
+	} else {
+		page_nr = -bounces;
+		if (!(bounces & BOUNCE_RACINGFAULTS))
+			page_nr += cpu * nr_pages_per_cpu;
+	}
+
+	while (!finished) {
+		if (bounces & BOUNCE_RANDOM) {
+			if (random_r(&rand, &rand_nr))
+				fprintf(stderr, "random_r 1 error\n"), exit(1);
+			page_nr = rand_nr;
+			if (sizeof(page_nr) > sizeof(rand_nr)) {
+				if (random_r(&rand, &rand_nr))
+					fprintf(stderr, "random_r 2 error\n"), exit(1);
+				page_nr |= ((unsigned long) rand_nr) << 32;
+			}
+		} else
+			page_nr += 1;
+		page_nr %= nr_pages;
+
+		start = time(NULL);
+		if (bounces & BOUNCE_VERIFY) {
+			count = *area_count(area_dst, page_nr);
+			if (!count)
+				fprintf(stderr,
+					"page_nr %lu wrong count %Lu %Lu\n",
+					page_nr, count,
+					count_verify[page_nr]), exit(1);
+
+
+			/*
+			 * We can't use bcmp (or memcmp) because that
+			 * returns 0 erroneously if the memory is
+			 * changing under it (even if the end of the
+			 * page is never changing and always
+			 * different).
+			 */
+#if 1
+			if (!my_bcmp(area_dst + page_nr * page_size, zeropage,
+				     page_size))
+				fprintf(stderr,
+					"my_bcmp page_nr %lu wrong count %Lu %Lu\n",
+					page_nr, count,
+					count_verify[page_nr]), exit(1);
+#else
+			unsigned long loops;
+
+			loops = 0;
+			/* uncomment the below line to test with mutex */
+			/* pthread_mutex_lock(area_mutex(area_dst, page_nr)); */
+			while (!bcmp(area_dst + page_nr * page_size, zeropage,
+				     page_size)) {
+				loops += 1;
+				if (loops > 10)
+					break;
+			}
+			/* uncomment below line to test with mutex */
+			/* pthread_mutex_unlock(area_mutex(area_dst, page_nr)); */
+			if (loops) {
+				fprintf(stderr,
+					"page_nr %lu all zero thread %lu %p %lu\n",
+					page_nr, cpu, area_dst + page_nr * page_size,
+					loops);
+				if (loops > 10)
+					exit(1);
+			}
+#endif
+		}
+
+		pthread_mutex_lock(area_mutex(area_dst, page_nr));
+		count = *area_count(area_dst, page_nr);
+		if (count != count_verify[page_nr]) {
+			fprintf(stderr,
+				"page_nr %lu memory corruption %Lu %Lu\n",
+				page_nr, count,
+				count_verify[page_nr]), exit(1);
+		}
+		count++;
+		*area_count(area_dst, page_nr) = count_verify[page_nr] = count;
+		pthread_mutex_unlock(area_mutex(area_dst, page_nr));
+
+		if (time(NULL) - start > 1)
+			fprintf(stderr,
+				"userfault too slow %ld "
+				"possible false positive with overcommit\n",
+				time(NULL) - start);
+	}
+
+	return NULL;
+}
+
+static int copy_page(unsigned long offset)
+{
+	struct uffdio_copy uffdio_copy;
+
+	if (offset >= nr_pages * page_size)
+		fprintf(stderr, "unexpected offset %lu\n",
+			offset), exit(1);
+	uffdio_copy.dst = (unsigned long) area_dst + offset;
+	uffdio_copy.src = (unsigned long) area_src + offset;
+	uffdio_copy.len = page_size;
+	uffdio_copy.mode = 0;
+	uffdio_copy.copy = 0;
+	if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy)) {
+		/* real retval in ufdio_copy.copy */
+		if (uffdio_copy.copy != -EEXIST)
+			fprintf(stderr, "UFFDIO_COPY error %Ld\n",
+				uffdio_copy.copy), exit(1);
+	} else if (uffdio_copy.copy != page_size) {
+		fprintf(stderr, "UFFDIO_COPY unexpected copy %Ld\n",
+			uffdio_copy.copy), exit(1);
+	} else
+		return 1;
+	return 0;
+}
+
+static void *uffd_poll_thread(void *arg)
+{
+	unsigned long cpu = (unsigned long) arg;
+	struct pollfd pollfd[2];
+	struct uffd_msg msg;
+	int ret;
+	unsigned long offset;
+	char tmp_chr;
+	unsigned long userfaults = 0;
+
+	pollfd[0].fd = uffd;
+	pollfd[0].events = POLLIN;
+	pollfd[1].fd = pipefd[cpu*2];
+	pollfd[1].events = POLLIN;
+
+	for (;;) {
+		ret = poll(pollfd, 2, -1);
+		if (!ret)
+			fprintf(stderr, "poll error %d\n", ret), exit(1);
+		if (ret < 0)
+			perror("poll"), exit(1);
+		if (pollfd[1].revents & POLLIN) {
+			if (read(pollfd[1].fd, &tmp_chr, 1) != 1)
+				fprintf(stderr, "read pipefd error\n"),
+					exit(1);
+			break;
+		}
+		if (!(pollfd[0].revents & POLLIN))
+			fprintf(stderr, "pollfd[0].revents %d\n",
+				pollfd[0].revents), exit(1);
+		ret = read(uffd, &msg, sizeof(msg));
+		if (ret < 0) {
+			if (errno == EAGAIN)
+				continue;
+			perror("nonblocking read error"), exit(1);
+		}
+		if (msg.event != UFFD_EVENT_PAGEFAULT)
+			fprintf(stderr, "unexpected msg event %u\n",
+				msg.event), exit(1);
+		if (msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE)
+			fprintf(stderr, "unexpected write fault\n"), exit(1);
+		offset = (char *)msg.arg.pagefault.address - area_dst;
+		offset &= ~(page_size-1);
+		if (copy_page(offset))
+			userfaults++;
+	}
+	return (void *)userfaults;
+}
+
+pthread_mutex_t uffd_read_mutex = PTHREAD_MUTEX_INITIALIZER;
+
+static void *uffd_read_thread(void *arg)
+{
+	unsigned long *this_cpu_userfaults;
+	struct uffd_msg msg;
+	unsigned long offset;
+	int ret;
+
+	this_cpu_userfaults = (unsigned long *) arg;
+	*this_cpu_userfaults = 0;
+
+	pthread_mutex_unlock(&uffd_read_mutex);
+	/* from here cancellation is ok */
+
+	for (;;) {
+		ret = read(uffd, &msg, sizeof(msg));
+		if (ret != sizeof(msg)) {
+			if (ret < 0)
+				perror("blocking read error"), exit(1);
+			else
+				fprintf(stderr, "short read\n"), exit(1);
+		}
+		if (msg.event != UFFD_EVENT_PAGEFAULT)
+			fprintf(stderr, "unexpected msg event %u\n",
+				msg.event), exit(1);
+		if (bounces & BOUNCE_VERIFY &&
+		    msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE)
+			fprintf(stderr, "unexpected write fault\n"), exit(1);
+		offset = (char *)msg.arg.pagefault.address - area_dst;
+		offset &= ~(page_size-1);
+		if (copy_page(offset))
+			(*this_cpu_userfaults)++;
+	}
+	return (void *)NULL;
+}
+
+static void *background_thread(void *arg)
+{
+	unsigned long cpu = (unsigned long) arg;
+	unsigned long page_nr;
+
+	for (page_nr = cpu * nr_pages_per_cpu;
+	     page_nr < (cpu+1) * nr_pages_per_cpu;
+	     page_nr++)
+		copy_page(page_nr * page_size);
+
+	return NULL;
+}
+
+static int stress(unsigned long *userfaults)
+{
+	unsigned long cpu;
+	pthread_t locking_threads[nr_cpus];
+	pthread_t uffd_threads[nr_cpus];
+	pthread_t background_threads[nr_cpus];
+	void **_userfaults = (void **) userfaults;
+
+	finished = 0;
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		if (pthread_create(&locking_threads[cpu], &attr,
+				   locking_thread, (void *)cpu))
+			return 1;
+		if (bounces & BOUNCE_POLL) {
+			if (pthread_create(&uffd_threads[cpu], &attr,
+					   uffd_poll_thread, (void *)cpu))
+				return 1;
+		} else {
+			if (pthread_create(&uffd_threads[cpu], &attr,
+					   uffd_read_thread,
+					   &_userfaults[cpu]))
+				return 1;
+			pthread_mutex_lock(&uffd_read_mutex);
+		}
+		if (pthread_create(&background_threads[cpu], &attr,
+				   background_thread, (void *)cpu))
+			return 1;
+	}
+	for (cpu = 0; cpu < nr_cpus; cpu++)
+		if (pthread_join(background_threads[cpu], NULL))
+			return 1;
+
+	/*
+	 * Be strict and immediately zap area_src, the whole area has
+	 * been transferred already by the background treads. The
+	 * area_src could then be faulted in in a racy way by still
+	 * running uffdio_threads reading zeropages after we zapped
+	 * area_src (but they're guaranteed to get -EEXIST from
+	 * UFFDIO_COPY without writing zero pages into area_dst
+	 * because the background threads already completed).
+	 */
+	if (madvise(area_src, nr_pages * page_size, MADV_DONTNEED)) {
+		perror("madvise");
+		return 1;
+	}
+
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		char c;
+		if (bounces & BOUNCE_POLL) {
+			if (write(pipefd[cpu*2+1], &c, 1) != 1) {
+				fprintf(stderr, "pipefd write error\n");
+				return 1;
+			}
+			if (pthread_join(uffd_threads[cpu], &_userfaults[cpu]))
+				return 1;
+		} else {
+			if (pthread_cancel(uffd_threads[cpu]))
+				return 1;
+			if (pthread_join(uffd_threads[cpu], NULL))
+				return 1;
+		}
+	}
+
+	finished = 1;
+	for (cpu = 0; cpu < nr_cpus; cpu++)
+		if (pthread_join(locking_threads[cpu], NULL))
+			return 1;
+
+	return 0;
+}
+
+static int userfaultfd_stress(void)
+{
+	void *area;
+	char *tmp_area;
+	unsigned long nr;
+	struct uffdio_register uffdio_register;
+	struct uffdio_api uffdio_api;
+	unsigned long cpu;
+	int uffd_flags;
+	unsigned long userfaults[nr_cpus];
+
+	if (posix_memalign(&area, page_size, nr_pages * page_size)) {
+		fprintf(stderr, "out of memory\n");
+		return 1;
+	}
+	area_src = area;
+	if (posix_memalign(&area, page_size, nr_pages * page_size)) {
+		fprintf(stderr, "out of memory\n");
+		return 1;
+	}
+	area_dst = area;
+
+	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+	if (uffd < 0) {
+		fprintf(stderr,
+			"userfaultfd syscall not available in this kernel\n");
+		return 1;
+	}
+	uffd_flags = fcntl(uffd, F_GETFD, NULL);
+
+	uffdio_api.api = UFFD_API;
+	uffdio_api.features = 0;
+	if (ioctl(uffd, UFFDIO_API, &uffdio_api)) {
+		fprintf(stderr, "UFFDIO_API\n");
+		return 1;
+	}
+	if (uffdio_api.api != UFFD_API) {
+		fprintf(stderr, "UFFDIO_API error %Lu\n", uffdio_api.api);
+		return 1;
+	}
+
+	count_verify = malloc(nr_pages * sizeof(unsigned long long));
+	if (!count_verify) {
+		perror("count_verify");
+		return 1;
+	}
+
+	for (nr = 0; nr < nr_pages; nr++) {
+		*area_mutex(area_src, nr) = (pthread_mutex_t)
+			PTHREAD_MUTEX_INITIALIZER;
+		count_verify[nr] = *area_count(area_src, nr) = 1;
+	}
+
+	pipefd = malloc(sizeof(int) * nr_cpus * 2);
+	if (!pipefd) {
+		perror("pipefd");
+		return 1;
+	}
+	for (cpu = 0; cpu < nr_cpus; cpu++) {
+		if (pipe2(&pipefd[cpu*2], O_CLOEXEC | O_NONBLOCK)) {
+			perror("pipe");
+			return 1;
+		}
+	}
+
+	if (posix_memalign(&area, page_size, page_size)) {
+		fprintf(stderr, "out of memory\n");
+		return 1;
+	}
+	zeropage = area;
+	bzero(zeropage, page_size);
+
+	pthread_mutex_lock(&uffd_read_mutex);
+
+	pthread_attr_init(&attr);
+	pthread_attr_setstacksize(&attr, 16*1024*1024);
+
+	while (bounces--) {
+		unsigned long expected_ioctls;
+
+		printf("bounces: %d, mode:", bounces);
+		if (bounces & BOUNCE_RANDOM)
+			printf(" rnd");
+		if (bounces & BOUNCE_RACINGFAULTS)
+			printf(" racing");
+		if (bounces & BOUNCE_VERIFY)
+			printf(" ver");
+		if (bounces & BOUNCE_POLL)
+			printf(" poll");
+		printf(", ");
+		fflush(stdout);
+
+		if (bounces & BOUNCE_POLL)
+			fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK);
+		else
+			fcntl(uffd, F_SETFL, uffd_flags & ~O_NONBLOCK);
+
+		/* register */
+		uffdio_register.range.start = (unsigned long) area_dst;
+		uffdio_register.range.len = nr_pages * page_size;
+		uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+		if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) {
+			fprintf(stderr, "register failure\n");
+			return 1;
+		}
+		expected_ioctls = (1 << _UFFDIO_WAKE) |
+				  (1 << _UFFDIO_COPY) |
+				  (1 << _UFFDIO_ZEROPAGE);
+		if ((uffdio_register.ioctls & expected_ioctls) !=
+		    expected_ioctls) {
+			fprintf(stderr,
+				"unexpected missing ioctl for anon memory\n");
+			return 1;
+		}
+
+		/*
+		 * The madvise done previously isn't enough: some
+		 * uffd_thread could have read userfaults (one of
+		 * those already resolved by the background thread)
+		 * and it may be in the process of calling
+		 * UFFDIO_COPY. UFFDIO_COPY will read the zapped
+		 * area_src and it would map a zero page in it (of
+		 * course such a UFFDIO_COPY is perfectly safe as it'd
+		 * return -EEXIST). The problem comes at the next
+		 * bounce though: that racing UFFDIO_COPY would
+		 * generate zeropages in the area_src, so invalidating
+		 * the previous MADV_DONTNEED. Without this additional
+		 * MADV_DONTNEED those zeropages leftovers in the
+		 * area_src would lead to -EEXIST failure during the
+		 * next bounce, effectively leaving a zeropage in the
+		 * area_dst.
+		 *
+		 * Try to comment this out madvise to see the memory
+		 * corruption being caught pretty quick.
+		 *
+		 * khugepaged is also inhibited to collapse THP after
+		 * MADV_DONTNEED only after the UFFDIO_REGISTER, so it's
+		 * required to MADV_DONTNEED here.
+		 */
+		if (madvise(area_dst, nr_pages * page_size, MADV_DONTNEED)) {
+			perror("madvise 2");
+			return 1;
+		}
+
+		/* bounce pass */
+		if (stress(userfaults))
+			return 1;
+
+		/* unregister */
+		if (ioctl(uffd, UFFDIO_UNREGISTER, &uffdio_register.range)) {
+			fprintf(stderr, "register failure\n");
+			return 1;
+		}
+
+		/* verification */
+		if (bounces & BOUNCE_VERIFY) {
+			for (nr = 0; nr < nr_pages; nr++) {
+				if (my_bcmp(area_dst,
+					    area_dst + nr * page_size,
+					    sizeof(pthread_mutex_t))) {
+					fprintf(stderr,
+						"error mutex 2 %lu\n",
+						nr);
+					bounces = 0;
+				}
+				if (*area_count(area_dst, nr) != count_verify[nr]) {
+					fprintf(stderr,
+						"error area_count %Lu %Lu %lu\n",
+						*area_count(area_src, nr),
+						count_verify[nr],
+						nr);
+					bounces = 0;
+				}
+			}
+		}
+
+		/* prepare next bounce */
+		tmp_area = area_src;
+		area_src = area_dst;
+		area_dst = tmp_area;
+
+		printf("userfaults:");
+		for (cpu = 0; cpu < nr_cpus; cpu++)
+			printf(" %lu", userfaults[cpu]);
+		printf("\n");
+	}
+
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	if (argc < 3)
+		fprintf(stderr, "Usage: <MiB> <bounces>\n"), exit(1);
+	nr_cpus = sysconf(_SC_NPROCESSORS_ONLN);
+	page_size = sysconf(_SC_PAGE_SIZE);
+	if ((unsigned long) area_count(NULL, 0) + sizeof(unsigned long long) >
+	    page_size)
+		fprintf(stderr, "Impossible to run this test\n"), exit(2);
+	nr_pages_per_cpu = atol(argv[1]) * 1024*1024 / page_size /
+		nr_cpus;
+	if (!nr_pages_per_cpu) {
+		fprintf(stderr, "invalid MiB\n");
+		fprintf(stderr, "Usage: <MiB> <bounces>\n"), exit(1);
+	}
+	bounces = atoi(argv[2]);
+	if (bounces <= 0) {
+		fprintf(stderr, "invalid bounces\n");
+		fprintf(stderr, "Usage: <MiB> <bounces>\n"), exit(1);
+	}
+	nr_pages = nr_pages_per_cpu * nr_cpus;
+	printf("nr_pages: %lu, nr_pages_per_cpu: %lu\n",
+	       nr_pages, nr_pages_per_cpu);
+	return userfaultfd_stress();
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/5] userfaultfd21 updates v2
  2015-07-08 10:50 [PATCH 0/5] userfaultfd21 updates v2 Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2015-07-08 10:50 ` [PATCH 5/5] userfaultfd: selftest Andrea Arcangeli
@ 2015-08-28 15:13 ` Dave Hansen
  2015-08-28 16:32   ` Andrea Arcangeli
  5 siblings, 1 reply; 8+ messages in thread
From: Dave Hansen @ 2015-08-28 15:13 UTC (permalink / raw)
  To: Andrea Arcangeli, Andrew Morton
  Cc: linux-mm, Kirill A. Shutemov, Pavel Emelyanov

Hi Andrea,

Is there a way you can think of to use userfaultfd without having a
separate thread to sit there and be watching the file descriptor?  The
current model doesn't seem like it would be possible to use with a
single-threaded app, for instance.

Is there a reason we couldn't generate a signal and then have the
userfaultfd handling done inside the signal handler?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/5] userfaultfd21 updates v2
  2015-08-28 15:13 ` [PATCH 0/5] userfaultfd21 updates v2 Dave Hansen
@ 2015-08-28 16:32   ` Andrea Arcangeli
  0 siblings, 0 replies; 8+ messages in thread
From: Andrea Arcangeli @ 2015-08-28 16:32 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Andrew Morton, linux-mm, Kirill A. Shutemov, Pavel Emelyanov

Hi Dave,

On Fri, Aug 28, 2015 at 08:13:18AM -0700, Dave Hansen wrote:
> Hi Andrea,
> 
> Is there a way you can think of to use userfaultfd without having a
> separate thread to sit there and be watching the file descriptor?  The
> current model doesn't seem like it would be possible to use with a
> single-threaded app, for instance.
> 
> Is there a reason we couldn't generate a signal and then have the
> userfaultfd handling done inside the signal handler?

Originally it worked precisely like that, much like volatile pages or
the regular PROT_NONE+sigsegv would do (it only avoided the vma
mangling). However that can't work for syscalls and get_user_pages. I
thought of taking care of get_user_pages called by the KVM shadow page
fault handler in a special way, but then there's O_DIRECT (or other
get_user_pages) that may be invoked by userland on top of the
userfault memory, which would also run into a get_user_pages done on a
userfault area.

How do you run a single threaded signal when get_user_pages finds that
it has been called on a userfault region or when copy-user returns
-EFAULT? It's unthinkable to break the syscall API with new retvals
and require all apps to change the error checks of the syscalls to use
userfaultfd safely. We could try to play tricks with restarting
syscalls within glibc but that sigbus would be the same as a real
SIGBUS. Even updating qemu alone to accept new retvals of read/write
O_DIRECT syscalls, sounds like a bad idea compared to the current
userfaultfd API which is entirely transparent to all syscalls and also
to the KVM shadow page fault handler.

The old MADV_USERFAULT/NOUSERFAULT madvise I entirely dropped it and
it's basically become the UFFDIO_REGISTER/UNREGISTER ioctls. It looked
bad idea to start with MADV_USERFAULT and signals, as you may later
notice you need to use a syscall and you've to rewrite the code with
the userfaultfd... better to start right away with userfaultfd and
avoid the risk of wasting time.

Right now copy-user and get_user_pages just blocks in kernel so
userfaults are effectively invisible to the single threaded workflow:
get_user_pages API from the caller point of view is totally
userfaultfd agnostic, but you need a separate thread to handle the
fault.

The fact the kernel in the blocked fault talks with the userland
thread directly over the fd should be more efficient too. The only
downside is a schedule() call when blocking, but on the plus side we
don't have to invoke the signal code at all which also isn't free
(even if potentially less costly than schedule when having lots of
runnable tasks and CPU overcommit). In addition signal themself can
trip on userfaults now, and gdb also will successfully send sigstop to
userfault blocked faults if they didn't happen in kernel context
(where again signals, not even sigstop/sigcont, can't run).

The only requirement added has been to have VM_FAULT_RETRY set in
every fault or get_user_pages that can hit on userfaultfd registered
regions, so we can drop the mmap_sem before blocking. We've to drop
the mmap_sem or we'd allow an userland thread to indefinitely leave
mmap_sem hold, which wouldn't safe (ps may block indefinitely etc..).

If there's a bug and VM_FAULT_RETRY is missing, SIGBUS is raised,
O_DIRECT or KVM page fault would return some noticeable error (like if
it was a real SIGBUS), and a rate limited printk is printed in the log
along with the offending stack trace that must be fixed to pass
VM_FAULT_RETRY.

On a side note, I'm about to extend this VM_FAULT_RETRY so that if the
userfault is invoked when FAULT_FLAG_TRIED is set (which means
VM_FAULT_RETRY was also previously set) we can one still hang and drop
the mmap_sem. This is the major not-self-contained change needed to
introduce the wrprotect fault tracking to userfaultfd (so that
postcopy live snapshotting becomes possible and distributed shared
memory with readonly-shared write-exclusive semantics also becomes
possible). With wrprotect faults, the pagetable-based vma-less
wrprotection can be armed with a UFFDIO_ ioctl while the page fault is
for other reason already in the VM_FAULT_RETRY path and it already
consumed it, so we may hit an wrprotect userfault in the
FAULT_FLAG_TRIED case (that cannot happen with the missing fault mode,
so we didn't need to alter the page fault logic yet and the stock
VM_FAULT_RETRY sufficed so far).

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-08-28 16:32 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-08 10:50 [PATCH 0/5] userfaultfd21 updates v2 Andrea Arcangeli
2015-07-08 10:50 ` [PATCH 1/5] userfaultfd: require UFFDIO_API before other ioctls Andrea Arcangeli
2015-07-08 10:50 ` [PATCH 2/5] userfaultfd: allow signals to interrupt a userfault Andrea Arcangeli
2015-07-08 10:50 ` [PATCH 3/5] userfaultfd: propagate the full address in THP faults Andrea Arcangeli
2015-07-08 10:50 ` [PATCH 4/5] userfaultfd: avoid missing wakeups during refile in userfaultfd_read Andrea Arcangeli
2015-07-08 10:50 ` [PATCH 5/5] userfaultfd: selftest Andrea Arcangeli
2015-08-28 15:13 ` [PATCH 0/5] userfaultfd21 updates v2 Dave Hansen
2015-08-28 16:32   ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).