[PATCH 00/17] RFC: userfault v2

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/17] RFC: userfault v2
@ 2014-10-03 17:07 Andrea Arcangeli
  2014-10-03 17:07 ` [PATCH 01/17] mm: gup: add FOLL_TRIED Andrea Arcangeli
                   ` (16 more replies)
  0 siblings, 17 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-03 17:07 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api
  Cc: Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek,
	Jan Kara <jac>

Hello everyone,

There's a large To/Cc list for this RFC because this adds two new
syscalls (userfaultfd and remap_anon_pages) and
MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes are welcome
sooner than later.

The major change compared to the previous RFC I sent a few months ago
is that the userfaultfd protocol now supports dynamic range
registration. So you can have an unlimited number of userfaults for
each process, so each shared library can use its own userfaultfd on
its own memory independently from other shared libraries or the main
program. This functionality was suggested from Andy Lutomirski (more
details on this are in the commit header of the last patch of this
patchset).

In addition the mmap_sem complexities has been sorted out. In fact the
real userfault patchset starts from patch number 7. Patches 1-6 will
be submitted separately for merging and if applied standalone they
provide a scalability improvement by reducing the mmap_sem hold times
during I/O. I included patch 1-6 here too because they're an hard
dependency for the userfault patchset. The userfaultfd syscall depends
on the first fault to always have FAULT_FLAG_ALLOW_RETRY set (the
later retry faults don't matter, it's fine to clear
FAULT_FLAG_ALLOW_RETRY with the retry faults, following the current
model).

The combination of these features are what I would propose to
implement postcopy live migration in qemu, and in general demand
paging of remote memory, hosted in different cloud nodes.

If the access could ever happen in kernel context through syscalls
(not not just from userland context), then userfaultfd has to be used
on top of MADV_USERFAULT, to make the userfault unnoticeable to the
syscall (no error will be returned). This latter feature is more
advanced than what volatile ranges alone could do with SIGBUS so far
(but it's optional, if the process doesn't register the memory in a
userfaultfd, the regular SIGBUS will fire, if the fd is closed SIGBUS
will also fire for any blocked userfault that was waiting a
userfaultfd_write ack).

userfaultfd is also a generic enough feature, that it allows KVM to
implement postcopy live migration without having to modify a single
line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
other GUP features works just fine in combination with userfaults
(userfaults trigger async page faults in the guest scheduler so those
guest processes that aren't waiting for userfaults can keep running in
the guest vcpus).

remap_anon_pages is the syscall to use to resolve the userfaults (it's
not mandatory, vmsplice will likely still be used in the case of local
postcopy live migration just to upgrade the qemu binary, but
remap_anon_pages is faster and ideal for transferring memory across
the network, it's zerocopy and doesn't touch the vma: it only holds
the mmap_sem for reading).

The current behavior of remap_anon_pages is very strict to avoid any
chance of memory corruption going unnoticed. mremap is not strict like
that: if there's a synchronization bug it would drop the destination
range silently resulting in subtle memory corruption for
example. remap_anon_pages would return -EEXIST in that case. If there
are holes in the source range remap_anon_pages will return -ENOENT.

If remap_anon_pages is used always with 2M naturally aligned
addresses, transparent hugepages will not be splitted. In there could
be 4k (or any size) holes in the 2M (or any size) source range,
remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
relax some of its strict checks (-ENOENT won't be returned if
RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
a noop on any hole in the source range). This flag is generally useful
when implementing userfaults with THP granularity, but it shouldn't be
set if doing the userfaults with PAGE_SIZE granularity if the
developer wants to benefit from the strict -ENOENT behavior.

The remap_anon_pages syscall API is not vectored, as I expect it to be
used mainly for demand paging (where there can be just one faulting
range per userfault) or for large ranges (with the THP model as an
alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
granularity before starting the guest in the destination node) where
vectoring isn't going to provide much performance advantages (thanks
to the THP coarser granularity).

On the rmap side remap_anon_pages doesn't add much complexity: there's
no need of nonlinear anon vmas to support it because I added the
constraint that it will fail if the mapcount is more than 1. So in
general the source range of remap_anon_pages should be marked
MADV_DONTFORK to prevent any risk of failure if the process ever
forks (like qemu can in some case).

The MADV_USERFAULT feature should be generic enough that it can
provide the userfaults to the Android volatile range feature too, on
access of reclaimed volatile pages. Or it could be used for other
similar things with tmpfs in the future. I've been discussing how to
extend it to tmpfs for example. Currently if MADV_USERFAULT is set on
a non-anonymous vma, it will return -EINVAL and that's enough to
provide backwards compatibility once MADV_USERFAULT will be extended
to tmpfs. An orthogonal problem then will be to identify the optimal
mechanism to atomically resolve a tmpfs backed userfault (like
remap_anon_pages does it optimally for anonymous memory) but that's
beyond the scope of the userfault functionality (in theory
remap_anon_pages is also orthogonal and I could split it off in a
separate patchset if somebody prefers). Of course remap_file_pages
should do it fine too, but it would create rmap nonlinearity which
isn't optimal.

The code can be found here:

git clone --reference linux git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault 

The branch is rebased so you can get updates for example with:

git fetch && git checkout -f origin/userfault

Comments welcome, thanks!
Andrea

Andrea Arcangeli (15):
  mm: gup: add get_user_pages_locked and get_user_pages_unlocked
  mm: gup: use get_user_pages_unlocked within get_user_pages_fast
  mm: gup: make get_user_pages_fast and __get_user_pages_fast latency
    conscious
  mm: gup: use get_user_pages_fast and get_user_pages_unlocked
  mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
  mm: madvise MADV_USERFAULT
  mm: PT lock: export double_pt_lock/unlock
  mm: rmap preparation for remap_anon_pages
  mm: swp_entry_swapcount
  mm: sys_remap_anon_pages
  waitqueue: add nr wake parameter to __wake_up_locked_key
  userfaultfd: add new syscall to provide memory externalization
  userfaultfd: make userfaultfd_write non blocking
  powerpc: add remap_anon_pages and userfaultfd
  userfaultfd: implement USERFAULTFD_RANGE_REGISTER|UNREGISTER

Andres Lagar-Cavilla (2):
  mm: gup: add FOLL_TRIED
  kvm: Faults which trigger IO release the mmap_sem

 arch/alpha/include/uapi/asm/mman.h     |   3 +
 arch/mips/include/uapi/asm/mman.h      |   3 +
 arch/mips/mm/gup.c                     |   8 +-
 arch/parisc/include/uapi/asm/mman.h    |   3 +
 arch/powerpc/include/asm/systbl.h      |   2 +
 arch/powerpc/include/asm/unistd.h      |   2 +-
 arch/powerpc/include/uapi/asm/unistd.h |   2 +
 arch/powerpc/mm/gup.c                  |   6 +-
 arch/s390/kvm/kvm-s390.c               |   4 +-
 arch/s390/mm/gup.c                     |   6 +-
 arch/sh/mm/gup.c                       |   6 +-
 arch/sparc/mm/gup.c                    |   6 +-
 arch/x86/mm/gup.c                      | 235 +++++++----
 arch/x86/syscalls/syscall_32.tbl       |   2 +
 arch/x86/syscalls/syscall_64.tbl       |   2 +
 arch/xtensa/include/uapi/asm/mman.h    |   3 +
 drivers/dma/iovlock.c                  |  10 +-
 drivers/iommu/amd_iommu_v2.c           |   6 +-
 drivers/media/pci/ivtv/ivtv-udma.c     |   6 +-
 drivers/scsi/st.c                      |  10 +-
 drivers/video/fbdev/pvr2fb.c           |   5 +-
 fs/Makefile                            |   1 +
 fs/proc/task_mmu.c                     |   5 +-
 fs/userfaultfd.c                       | 722 +++++++++++++++++++++++++++++++++
 include/linux/huge_mm.h                |  11 +-
 include/linux/ksm.h                    |   4 +-
 include/linux/mm.h                     |  15 +-
 include/linux/mm_types.h               |  13 +-
 include/linux/swap.h                   |   6 +
 include/linux/syscalls.h               |   5 +
 include/linux/userfaultfd.h            |  55 +++
 include/linux/wait.h                   |   5 +-
 include/uapi/asm-generic/mman-common.h |   3 +
 init/Kconfig                           |  11 +
 kernel/sched/wait.c                    |   7 +-
 kernel/sys_ni.c                        |   2 +
 mm/fremap.c                            | 506 +++++++++++++++++++++++
 mm/gup.c                               | 182 ++++++++-
 mm/huge_memory.c                       | 208 ++++++++--
 mm/ksm.c                               |   2 +-
 mm/madvise.c                           |  22 +-
 mm/memory.c                            |  14 +
 mm/mempolicy.c                         |   4 +-
 mm/mlock.c                             |   3 +-
 mm/mmap.c                              |  39 +-
 mm/mprotect.c                          |   3 +-
 mm/mremap.c                            |   2 +-
 mm/nommu.c                             |  23 ++
 mm/process_vm_access.c                 |   7 +-
 mm/rmap.c                              |   9 +
 mm/swapfile.c                          |  13 +
 mm/util.c                              |  10 +-
 net/ceph/pagevec.c                     |   9 +-
 net/sunrpc/sched.c                     |   2 +-
 virt/kvm/async_pf.c                    |   4 +-
 virt/kvm/kvm_main.c                    |   4 +-
 56 files changed, 2025 insertions(+), 236 deletions(-)
 create mode 100644 fs/userfaultfd.c
 create mode 100644 include/linux/userfaultfd.h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 01/17] mm: gup: add FOLL_TRIED
  2014-10-03 17:07 [PATCH 00/17] RFC: userfault v2 Andrea Arcangeli
@ 2014-10-03 17:07 ` Andrea Arcangeli
  2014-10-03 18:15   ` Linus Torvalds
  2014-10-03 17:07 ` [PATCH 02/17] mm: gup: add get_user_pages_locked and get_user_pages_unlocked Andrea Arcangeli
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-03 17:07 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api
  Cc: Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek,
	Jan Kara <jac>

From: Andres Lagar-Cavilla <andreslc@google.com>

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm.h | 1 +
 mm/gup.c           | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc8..0f4196a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1985,6 +1985,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
 #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
 #define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
 #define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
+#define FOLL_TRIED	0x800	/* a retry, previous pass started an IO */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/gup.c b/mm/gup.c
index 91d044b..af7ea3e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY;
 	if (*flags & FOLL_NOWAIT)
 		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
+	if (*flags & FOLL_TRIED) {
+		VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
+		fault_flags |= FAULT_FLAG_TRIED;
+	}
 
 	ret = handle_mm_fault(mm, vma, address, fault_flags);
 	if (ret & VM_FAULT_ERROR) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 02/17] mm: gup: add get_user_pages_locked and get_user_pages_unlocked
  2014-10-03 17:07 [PATCH 00/17] RFC: userfault v2 Andrea Arcangeli
  2014-10-03 17:07 ` [PATCH 01/17] mm: gup: add FOLL_TRIED Andrea Arcangeli
@ 2014-10-03 17:07 ` Andrea Arcangeli
  2014-10-03 17:07 ` [PATCH 03/17] mm: gup: use get_user_pages_unlocked within get_user_pages_fast Andrea Arcangeli
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-03 17:07 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api
  Cc: Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek,
	Jan Kara <jac>

We can leverage the VM_FAULT_RETRY functionality in the page fault
paths better by using either get_user_pages_locked or
get_user_pages_unlocked.

The former allow conversion of get_user_pages invocations that will
have to pass a "&locked" parameter to know if the mmap_sem was dropped
during the call. Example from:

    down_read(&mm->mmap_sem);
    do_something()
    get_user_pages(tsk, mm, ..., pages, NULL);
    up_read(&mm->mmap_sem);

to:

    int locked = 1;
    down_read(&mm->mmap_sem);
    do_something()
    get_user_pages_locked(tsk, mm, ..., pages, &locked);
    if (locked)
        up_read(&mm->mmap_sem);

The latter is suitable only as a drop in replacement of the form:

    down_read(&mm->mmap_sem);
    get_user_pages(tsk, mm, ..., pages, NULL);
    up_read(&mm->mmap_sem);

into:

    get_user_pages_unlocked(tsk, mm, ..., pages);

Where tsk, mm, the intermediate "..." paramters and "pages" can be any
value as before. Just the last parameter of get_user_pages (vmas) must
be NULL for get_user_pages_locked|unlocked to be usable (the latter
original form wouldn't have been safe anyway if vmas wasn't null, for
the former we just make it explicit by dropping the parameter).

If vmas is not NULL these two methods cannot be used.

This patch then applies the new forms in various places, in some case
also replacing it with get_user_pages_fast whenever tsk and mm are
current and current->mm. get_user_pages_unlocked varies from
get_user_pages_fast only if mm is not current->mm (like when
get_user_pages works on some other process mm). Whenever tsk and mm
matches current and current->mm get_user_pages_fast must always be
used to increase performance and get the page lockless (only with irq
disabled).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com>
Reviewed-by: Peter Feiner <pfeiner@google.com>
---
 include/linux/mm.h |   7 +++
 mm/gup.c           | 178 +++++++++++++++++++++++++++++++++++++++++++++++++----
 mm/nommu.c         |  23 +++++++
 3 files changed, 197 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0f4196a..8900ba9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1196,6 +1196,13 @@ long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		    unsigned long start, unsigned long nr_pages,
 		    int write, int force, struct page **pages,
 		    struct vm_area_struct **vmas);
+long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
+		    unsigned long start, unsigned long nr_pages,
+		    int write, int force, struct page **pages,
+		    int *locked);
+long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
+		    unsigned long start, unsigned long nr_pages,
+		    int write, int force, struct page **pages);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 struct kvec;
diff --git a/mm/gup.c b/mm/gup.c
index af7ea3e..6f2f757 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -580,6 +580,166 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 	return 0;
 }
 
+static inline long __get_user_pages_locked(struct task_struct *tsk,
+					   struct mm_struct *mm,
+					   unsigned long start,
+					   unsigned long nr_pages,
+					   int write, int force,
+					   struct page **pages,
+					   struct vm_area_struct **vmas,
+					   int *locked,
+					   bool notify_drop)
+{
+	int flags = FOLL_TOUCH;
+	long ret, pages_done;
+	bool lock_dropped;
+
+	if (locked) {
+		/* if VM_FAULT_RETRY can be returned, vmas become invalid */
+		BUG_ON(vmas);
+		/* check caller initialized locked */
+		BUG_ON(*locked != 1);
+	}
+
+	if (pages)
+		flags |= FOLL_GET;
+	if (write)
+		flags |= FOLL_WRITE;
+	if (force)
+		flags |= FOLL_FORCE;
+
+	pages_done = 0;
+	lock_dropped = false;
+	for (;;) {
+		ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
+				       vmas, locked);
+		if (!locked)
+			/* VM_FAULT_RETRY couldn't trigger, bypass */
+			return ret;
+
+		/* VM_FAULT_RETRY cannot return errors */
+		if (!*locked) {
+			BUG_ON(ret < 0);
+			BUG_ON(ret >= nr_pages);
+		}
+
+		if (!pages)
+			/* If it's a prefault don't insist harder */
+			return ret;
+
+		if (ret > 0) {
+			nr_pages -= ret;
+			pages_done += ret;
+			if (!nr_pages)
+				break;
+		}
+		if (*locked) {
+			/* VM_FAULT_RETRY didn't trigger */
+			if (!pages_done)
+				pages_done = ret;
+			break;
+		}
+		/* VM_FAULT_RETRY triggered, so seek to the faulting offset */
+		pages += ret;
+		start += ret << PAGE_SHIFT;
+
+		/*
+		 * Repeat on the address that fired VM_FAULT_RETRY
+		 * without FAULT_FLAG_ALLOW_RETRY but with
+		 * FAULT_FLAG_TRIED.
+		 */
+		*locked = 1;
+		lock_dropped = true;
+		down_read(&mm->mmap_sem);
+		ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED,
+				       pages, NULL, NULL);
+		if (ret != 1) {
+			BUG_ON(ret > 1);
+			if (!pages_done)
+				pages_done = ret;
+			break;
+		}
+		nr_pages--;
+		pages_done++;
+		if (!nr_pages)
+			break;
+		pages++;
+		start += PAGE_SIZE;
+	}
+	if (notify_drop && lock_dropped && *locked) {
+		/*
+		 * We must let the caller know we temporarily dropped the lock
+		 * and so the critical section protected by it was lost.
+		 */
+		up_read(&mm->mmap_sem);
+		*locked = 0;
+	}
+	return pages_done;
+}
+
+/*
+ * We can leverage the VM_FAULT_RETRY functionality in the page fault
+ * paths better by using either get_user_pages_locked() or
+ * get_user_pages_unlocked().
+ *
+ * get_user_pages_locked() is suitable to replace the form:
+ *
+ *      down_read(&mm->mmap_sem);
+ *      do_something()
+ *      get_user_pages(tsk, mm, ..., pages, NULL);
+ *      up_read(&mm->mmap_sem);
+ *
+ *  to:
+ *
+ *      int locked = 1;
+ *      down_read(&mm->mmap_sem);
+ *      do_something()
+ *      get_user_pages_locked(tsk, mm, ..., pages, &locked);
+ *      if (locked)
+ *          up_read(&mm->mmap_sem);
+ */
+long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
+			   unsigned long start, unsigned long nr_pages,
+			   int write, int force, struct page **pages,
+			   int *locked)
+{
+	return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
+				       pages, NULL, locked, true);
+}
+EXPORT_SYMBOL(get_user_pages_locked);
+
+/*
+ * get_user_pages_unlocked() is suitable to replace the form:
+ *
+ *      down_read(&mm->mmap_sem);
+ *      get_user_pages(tsk, mm, ..., pages, NULL);
+ *      up_read(&mm->mmap_sem);
+ *
+ *  with:
+ *
+ *      get_user_pages_unlocked(tsk, mm, ..., pages);
+ *
+ * It is functionally equivalent to get_user_pages_fast so
+ * get_user_pages_fast should be used instead, if the two parameters
+ * "tsk" and "mm" are respectively equal to current and current->mm,
+ * or if "force" shall be set to 1 (get_user_pages_fast misses the
+ * "force" parameter).
+ */
+long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
+			     unsigned long start, unsigned long nr_pages,
+			     int write, int force, struct page **pages)
+{
+	long ret;
+	int locked = 1;
+	down_read(&mm->mmap_sem);
+	ret = __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
+				      pages, NULL, &locked, false);
+	if (locked)
+		up_read(&mm->mmap_sem);
+	return ret;
+}
+EXPORT_SYMBOL(get_user_pages_unlocked);
+
 /*
  * get_user_pages() - pin user pages in memory
  * @tsk:	the task_struct to use for page fault accounting, or
@@ -629,22 +789,18 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
  * use the correct cache flushing APIs.
  *
  * See also get_user_pages_fast, for performance critical applications.
+ *
+ * get_user_pages should be phased out in favor of
+ * get_user_pages_locked|unlocked or get_user_pages_fast. Nothing
+ * should use get_user_pages because it cannot pass
+ * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
  */
 long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long start, unsigned long nr_pages, int write,
 		int force, struct page **pages, struct vm_area_struct **vmas)
 {
-	int flags = FOLL_TOUCH;
-
-	if (pages)
-		flags |= FOLL_GET;
-	if (write)
-		flags |= FOLL_WRITE;
-	if (force)
-		flags |= FOLL_FORCE;
-
-	return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas,
-				NULL);
+	return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
+				       pages, vmas, NULL, false);
 }
 EXPORT_SYMBOL(get_user_pages);
 
diff --git a/mm/nommu.c b/mm/nommu.c
index a881d96..3918b0f 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -213,6 +213,29 @@ long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 }
 EXPORT_SYMBOL(get_user_pages);
 
+long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
+			   unsigned long start, unsigned long nr_pages,
+			   int write, int force, struct page **pages,
+			   int *locked)
+{
+	return get_user_pages(tsk, mm, start, nr_pages, write, force,
+			      pages, NULL);
+}
+EXPORT_SYMBOL(get_user_pages_locked);
+
+long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
+			     unsigned long start, unsigned long nr_pages,
+			     int write, int force, struct page **pages)
+{
+	long ret;
+	down_read(&mm->mmap_sem);
+	ret = get_user_pages(tsk, mm, start, nr_pages, write, force,
+			     pages, NULL);
+	up_read(&mm->mmap_sem);
+	return ret;
+}
+EXPORT_SYMBOL(get_user_pages_unlocked);
+
 /**
  * follow_pfn - look up PFN at a user virtual address
  * @vma: memory mapping

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 03/17] mm: gup: use get_user_pages_unlocked within get_user_pages_fast
  2014-10-03 17:07 [PATCH 00/17] RFC: userfault v2 Andrea Arcangeli
  2014-10-03 17:07 ` [PATCH 01/17] mm: gup: add FOLL_TRIED Andrea Arcangeli
  2014-10-03 17:07 ` [PATCH 02/17] mm: gup: add get_user_pages_locked and get_user_pages_unlocked Andrea Arcangeli
@ 2014-10-03 17:07 ` Andrea Arcangeli
  2014-10-03 17:07 ` [PATCH 04/17] mm: gup: make get_user_pages_fast and __get_user_pages_fast latency conscious Andrea Arcangeli
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-03 17:07 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api
  Cc: Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek,
	Jan Kara <jac>

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/mips/mm/gup.c       | 8 +++-----
 arch/powerpc/mm/gup.c    | 6 ++----
 arch/s390/kvm/kvm-s390.c | 4 +---
 arch/s390/mm/gup.c       | 6 ++----
 arch/sh/mm/gup.c         | 6 ++----
 arch/sparc/mm/gup.c      | 6 ++----
 arch/x86/mm/gup.c        | 7 +++----
 7 files changed, 15 insertions(+), 28 deletions(-)

diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
index 06ce17c..20884f5 100644
--- a/arch/mips/mm/gup.c
+++ b/arch/mips/mm/gup.c
@@ -301,11 +301,9 @@ slow_irqon:
 	start += nr << PAGE_SHIFT;
 	pages += nr;
 
-	down_read(&mm->mmap_sem);
-	ret = get_user_pages(current, mm, start,
-				(end - start) >> PAGE_SHIFT,
-				write, 0, pages, NULL);
-	up_read(&mm->mmap_sem);
+	ret = get_user_pages_unlocked(current, mm, start,
+				      (end - start) >> PAGE_SHIFT,
+				      write, 0, pages);
 
 	/* Have to be a bit careful with return values */
 	if (nr > 0) {
diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
index d874668..b70c34a 100644
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -215,10 +215,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		down_read(&mm->mmap_sem);
-		ret = get_user_pages(current, mm, start,
-				     nr_pages - nr, write, 0, pages, NULL);
-		up_read(&mm->mmap_sem);
+		ret = get_user_pages_unlocked(current, mm, start,
+					      nr_pages - nr, write, 0, pages);
 
 		/* Have to be a bit careful with return values */
 		if (nr > 0) {
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 81b0e11..37ca29a 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1092,9 +1092,7 @@ long kvm_arch_fault_in_page(struct kvm_vcpu *vcpu, gpa_t gpa, int writable)
 	hva = gmap_fault(gpa, vcpu->arch.gmap);
 	if (IS_ERR_VALUE(hva))
 		return (long)hva;
-	down_read(&mm->mmap_sem);
-	rc = get_user_pages(current, mm, hva, 1, writable, 0, NULL, NULL);
-	up_read(&mm->mmap_sem);
+	rc = get_user_pages_unlocked(current, mm, hva, 1, writable, 0, NULL);
 
 	return rc < 0 ? rc : 0;
 }
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 639fce46..5c586c7 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -235,10 +235,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	/* Try to get the remaining pages with get_user_pages */
 	start += nr << PAGE_SHIFT;
 	pages += nr;
-	down_read(&mm->mmap_sem);
-	ret = get_user_pages(current, mm, start,
-			     nr_pages - nr, write, 0, pages, NULL);
-	up_read(&mm->mmap_sem);
+	ret = get_user_pages_unlocked(current, mm, start,
+			     nr_pages - nr, write, 0, pages);
 	/* Have to be a bit careful with return values */
 	if (nr > 0)
 		ret = (ret < 0) ? nr : ret + nr;
diff --git a/arch/sh/mm/gup.c b/arch/sh/mm/gup.c
index 37458f3..e15f52a 100644
--- a/arch/sh/mm/gup.c
+++ b/arch/sh/mm/gup.c
@@ -257,10 +257,8 @@ slow_irqon:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		down_read(&mm->mmap_sem);
-		ret = get_user_pages(current, mm, start,
-			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
-		up_read(&mm->mmap_sem);
+		ret = get_user_pages_unlocked(current, mm, start,
+			(end - start) >> PAGE_SHIFT, write, 0, pages);
 
 		/* Have to be a bit careful with return values */
 		if (nr > 0) {
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 1aed043..fa7de7d 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -219,10 +219,8 @@ slow:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		down_read(&mm->mmap_sem);
-		ret = get_user_pages(current, mm, start,
-			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
-		up_read(&mm->mmap_sem);
+		ret = get_user_pages_unlocked(current, mm, start,
+			(end - start) >> PAGE_SHIFT, write, 0, pages);
 
 		/* Have to be a bit careful with return values */
 		if (nr > 0) {
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 207d9aef..2ab183b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -388,10 +388,9 @@ slow_irqon:
 		start += nr << PAGE_SHIFT;
 		pages += nr;
 
-		down_read(&mm->mmap_sem);
-		ret = get_user_pages(current, mm, start,
-			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
-		up_read(&mm->mmap_sem);
+		ret = get_user_pages_unlocked(current, mm, start,
+					      (end - start) >> PAGE_SHIFT,
+					      write, 0, pages);
 
 		/* Have to be a bit careful with return values */
 		if (nr > 0) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 04/17] mm: gup: make get_user_pages_fast and __get_user_pages_fast latency conscious
  2014-10-03 17:07 [PATCH 00/17] RFC: userfault v2 Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2014-10-03 17:07 ` [PATCH 03/17] mm: gup: use get_user_pages_unlocked within get_user_pages_fast Andrea Arcangeli
@ 2014-10-03 17:07 ` Andrea Arcangeli
  2014-10-03 18:23   ` Linus Torvalds
  2014-10-03 17:07 ` [PATCH 05/17] mm: gup: use get_user_pages_fast and get_user_pages_unlocked Andrea Arcangeli
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-03 17:07 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api
  Cc: Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek,
	Jan Kara <jac>

This teaches gup_fast and __gup_fast to re-enable irqs and
cond_resched() if possible every BATCH_PAGES.

This must be implemented by other archs as well and it's a requirement
before converting more get_user_pages() to get_user_pages_fast() as an
optimization (instead of using get_user_pages_unlocked which would be
slower).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/mm/gup.c | 234 ++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 149 insertions(+), 85 deletions(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 2ab183b..917d8c1 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -12,6 +12,12 @@
 
 #include <asm/pgtable.h>
 
+/*
+ * Keep irq disabled for no more than BATCH_PAGES pages.
+ * Matches PTRS_PER_PTE (or half in non-PAE kernels).
+ */
+#define BATCH_PAGES	512
+
 static inline pte_t gup_get_pte(pte_t *ptep)
 {
 #ifndef CONFIG_X86_PAE
@@ -250,6 +256,40 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
 	return 1;
 }
 
+static inline int __get_user_pages_fast_batch(unsigned long start,
+					      unsigned long end,
+					      int write, struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long next;
+	unsigned long flags;
+	pgd_t *pgdp;
+	int nr = 0;
+
+	/*
+	 * This doesn't prevent pagetable teardown, but does prevent
+	 * the pagetables and pages from being freed on x86.
+	 *
+	 * So long as we atomically load page table pointers versus teardown
+	 * (which we do on x86, with the above PAE exception), we can follow the
+	 * address down to the the page and take a ref on it.
+	 */
+	local_irq_save(flags);
+	pgdp = pgd_offset(mm, start);
+	do {
+		pgd_t pgd = *pgdp;
+
+		next = pgd_addr_end(start, end);
+		if (pgd_none(pgd))
+			break;
+		if (!gup_pud_range(pgd, start, next, write, pages, &nr))
+			break;
+	} while (pgdp++, start = next, start != end);
+	local_irq_restore(flags);
+
+	return nr;
+}
+
 /*
  * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
  * back to the regular GUP.
@@ -257,31 +297,55 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			  struct page **pages)
 {
-	struct mm_struct *mm = current->mm;
-	unsigned long addr, len, end;
-	unsigned long next;
-	unsigned long flags;
-	pgd_t *pgdp;
-	int nr = 0;
+	unsigned long len, end, batch_pages;
+	int nr, ret;
 
 	start &= PAGE_MASK;
-	addr = start;
 	len = (unsigned long) nr_pages << PAGE_SHIFT;
 	end = start + len;
+	/*
+	 * get_user_pages() handles nr_pages == 0 gracefully, but
+	 * gup_fast starts walking the first pagetable in a do {}
+	 * while() fashion so it's not robust to handle nr_pages ==
+	 * 0. There's no point in being permissive about end < start
+	 * either. So this check verifies both nr_pages being non
+	 * zero, and that "end" didn't overflow.
+	 */
+	VM_BUG_ON(end <= start);
 	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
 					(void __user *)start, len)))
 		return 0;
 
-	/*
-	 * XXX: batch / limit 'nr', to avoid large irq off latency
-	 * needs some instrumenting to determine the common sizes used by
-	 * important workloads (eg. DB2), and whether limiting the batch size
-	 * will decrease performance.
-	 *
-	 * It seems like we're in the clear for the moment. Direct-IO is
-	 * the main guy that batches up lots of get_user_pages, and even
-	 * they are limited to 64-at-a-time which is not so many.
-	 */
+	ret = 0;
+	for (;;) {
+		batch_pages = nr_pages;
+		if (batch_pages > BATCH_PAGES && !irqs_disabled())
+			batch_pages = BATCH_PAGES;
+		len = (unsigned long) batch_pages << PAGE_SHIFT;
+		end = start + len;
+		nr = __get_user_pages_fast_batch(start, end, write, pages);
+		VM_BUG_ON(nr > batch_pages);
+		nr_pages -= nr;
+		ret += nr;
+		if (!nr_pages || nr != batch_pages)
+			break;
+		start += len;
+		pages += batch_pages;
+	}
+
+	return ret;
+}
+
+static inline int get_user_pages_fast_batch(unsigned long start,
+					    unsigned long end,
+					    int write, struct page **pages)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long next;
+	pgd_t *pgdp;
+	int nr = 0;
+	unsigned long orig_start = start;
+
 	/*
 	 * This doesn't prevent pagetable teardown, but does prevent
 	 * the pagetables and pages from being freed on x86.
@@ -290,18 +354,24 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	 * (which we do on x86, with the above PAE exception), we can follow the
 	 * address down to the the page and take a ref on it.
 	 */
-	local_irq_save(flags);
-	pgdp = pgd_offset(mm, addr);
+	local_irq_disable();
+	pgdp = pgd_offset(mm, start);
 	do {
 		pgd_t pgd = *pgdp;
 
-		next = pgd_addr_end(addr, end);
-		if (pgd_none(pgd))
+		next = pgd_addr_end(start, end);
+		if (pgd_none(pgd)) {
+			VM_BUG_ON(nr >= (end-orig_start) >> PAGE_SHIFT);
 			break;
-		if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+		}
+		if (!gup_pud_range(pgd, start, next, write, pages, &nr)) {
+			VM_BUG_ON(nr >= (end-orig_start) >> PAGE_SHIFT);
 			break;
-	} while (pgdp++, addr = next, addr != end);
-	local_irq_restore(flags);
+		}
+	} while (pgdp++, start = next, start != end);
+	local_irq_enable();
+
+	cond_resched();
 
 	return nr;
 }
@@ -326,80 +396,74 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages)
 {
 	struct mm_struct *mm = current->mm;
-	unsigned long addr, len, end;
-	unsigned long next;
-	pgd_t *pgdp;
-	int nr = 0;
+	unsigned long len, end, batch_pages;
+	int nr, ret;
+	unsigned long orig_start;
 
 	start &= PAGE_MASK;
-	addr = start;
+	orig_start = start;
 	len = (unsigned long) nr_pages << PAGE_SHIFT;
 
 	end = start + len;
-	if (end < start)
-		goto slow_irqon;
+	/*
+	 * get_user_pages() handles nr_pages == 0 gracefully, but
+	 * gup_fast starts walking the first pagetable in a do {}
+	 * while() fashion so it's not robust to handle nr_pages ==
+	 * 0. There's no point in being permissive about end < start
+	 * either. So this check verifies both nr_pages being non
+	 * zero, and that "end" didn't overflow.
+	 */
+	VM_BUG_ON(end <= start);
 
+	nr = ret = 0;
 #ifdef CONFIG_X86_64
 	if (end >> __VIRTUAL_MASK_SHIFT)
 		goto slow_irqon;
 #endif
+	for (;;) {
+		batch_pages = min(nr_pages, BATCH_PAGES);
+		len = (unsigned long) batch_pages << PAGE_SHIFT;
+		end = start + len;
+		nr = get_user_pages_fast_batch(start, end, write, pages);
+		VM_BUG_ON(nr > batch_pages);
+		nr_pages -= nr;
+		ret += nr;
+		if (!nr_pages)
+			break;
+		if (nr < batch_pages)
+			goto slow_irqon;
+		start += len;
+		pages += batch_pages;
+	}
 
-	/*
-	 * XXX: batch / limit 'nr', to avoid large irq off latency
-	 * needs some instrumenting to determine the common sizes used by
-	 * important workloads (eg. DB2), and whether limiting the batch size
-	 * will decrease performance.
-	 *
-	 * It seems like we're in the clear for the moment. Direct-IO is
-	 * the main guy that batches up lots of get_user_pages, and even
-	 * they are limited to 64-at-a-time which is not so many.
-	 */
-	/*
-	 * This doesn't prevent pagetable teardown, but does prevent
-	 * the pagetables and pages from being freed on x86.
-	 *
-	 * So long as we atomically load page table pointers versus teardown
-	 * (which we do on x86, with the above PAE exception), we can follow the
-	 * address down to the the page and take a ref on it.
-	 */
-	local_irq_disable();
-	pgdp = pgd_offset(mm, addr);
-	do {
-		pgd_t pgd = *pgdp;
-
-		next = pgd_addr_end(addr, end);
-		if (pgd_none(pgd))
-			goto slow;
-		if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
-			goto slow;
-	} while (pgdp++, addr = next, addr != end);
-	local_irq_enable();
-
-	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
-	return nr;
-
-	{
-		int ret;
+	VM_BUG_ON(ret != (end - orig_start) >> PAGE_SHIFT);
+	return ret;
 
-slow:
-		local_irq_enable();
 slow_irqon:
-		/* Try to get the remaining pages with get_user_pages */
-		start += nr << PAGE_SHIFT;
-		pages += nr;
-
-		ret = get_user_pages_unlocked(current, mm, start,
-					      (end - start) >> PAGE_SHIFT,
-					      write, 0, pages);
-
-		/* Have to be a bit careful with return values */
-		if (nr > 0) {
-			if (ret < 0)
-				ret = nr;
-			else
-				ret += nr;
-		}
+	/* Try to get the remaining pages with get_user_pages */
+	start += nr << PAGE_SHIFT;
+	pages += nr;
 
-		return ret;
+	/*
+	 * "nr" was the get_user_pages_fast_batch last retval, "ret"
+	 * was the sum of all get_user_pages_fast_batch retvals, now
+	 * "nr" becomes the sum of all get_user_pages_fast_batch
+	 * retvals and "ret" will become the get_user_pages_unlocked
+	 * retval.
+	 */
+	nr = ret;
+
+	ret = get_user_pages_unlocked(current, mm, start,
+				      (end - start) >> PAGE_SHIFT,
+				      write, 0, pages);
+
+	/* Have to be a bit careful with return values */
+	if (nr > 0) {
+		if (ret < 0)
+			ret = nr;
+		else
+			ret += nr;
 	}
+
+	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 05/17] mm: gup: use get_user_pages_fast and get_user_pages_unlocked
  2014-10-03 17:07 [PATCH 00/17] RFC: userfault v2 Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2014-10-03 17:07 ` [PATCH 04/17] mm: gup: make get_user_pages_fast and __get_user_pages_fast latency conscious Andrea Arcangeli
@ 2014-10-03 17:07 ` Andrea Arcangeli
  2014-10-03 17:07 ` [PATCH 06/17] kvm: Faults which trigger IO release the mmap_sem Andrea Arcangeli
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-03 17:07 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api
  Cc: Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek,
	Jan Kara <jac>

Just an optimization.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/dma/iovlock.c              | 10 ++--------
 drivers/iommu/amd_iommu_v2.c       |  6 ++----
 drivers/media/pci/ivtv/ivtv-udma.c |  6 ++----
 drivers/scsi/st.c                  | 10 ++--------
 drivers/video/fbdev/pvr2fb.c       |  5 +----
 mm/process_vm_access.c             |  7 ++-----
 mm/util.c                          | 10 ++--------
 net/ceph/pagevec.c                 |  9 ++++-----
 8 files changed, 17 insertions(+), 46 deletions(-)

diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c
index bb48a57..12ea7c3 100644
--- a/drivers/dma/iovlock.c
+++ b/drivers/dma/iovlock.c
@@ -95,17 +95,11 @@ struct dma_pinned_list *dma_pin_iovec_pages(struct iovec *iov, size_t len)
 		pages += page_list->nr_pages;
 
 		/* pin pages down */
-		down_read(&current->mm->mmap_sem);
-		ret = get_user_pages(
-			current,
-			current->mm,
+		ret = get_user_pages_fast(
 			(unsigned long) iov[i].iov_base,
 			page_list->nr_pages,
 			1,	/* write */
-			0,	/* force */
-			page_list->pages,
-			NULL);
-		up_read(&current->mm->mmap_sem);
+			page_list->pages);
 
 		if (ret != page_list->nr_pages)
 			goto unpin;
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 5f578e8..6963b73 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -519,10 +519,8 @@ static void do_fault(struct work_struct *work)
 
 	write = !!(fault->flags & PPR_FAULT_WRITE);
 
-	down_read(&fault->state->mm->mmap_sem);
-	npages = get_user_pages(NULL, fault->state->mm,
-				fault->address, 1, write, 0, &page, NULL);
-	up_read(&fault->state->mm->mmap_sem);
+	npages = get_user_pages_unlocked(NULL, fault->state->mm,
+					 fault->address, 1, write, 0, &page);
 
 	if (npages == 1) {
 		put_page(page);
diff --git a/drivers/media/pci/ivtv/ivtv-udma.c b/drivers/media/pci/ivtv/ivtv-udma.c
index 7338cb2..96d866b 100644
--- a/drivers/media/pci/ivtv/ivtv-udma.c
+++ b/drivers/media/pci/ivtv/ivtv-udma.c
@@ -124,10 +124,8 @@ int ivtv_udma_setup(struct ivtv *itv, unsigned long ivtv_dest_addr,
 	}
 
 	/* Get user pages for DMA Xfer */
-	down_read(&current->mm->mmap_sem);
-	err = get_user_pages(current, current->mm,
-			user_dma.uaddr, user_dma.page_count, 0, 1, dma->map, NULL);
-	up_read(&current->mm->mmap_sem);
+	err = get_user_pages_unlocked(current, current->mm,
+			user_dma.uaddr, user_dma.page_count, 0, 1, dma->map);
 
 	if (user_dma.page_count != err) {
 		IVTV_DEBUG_WARN("failed to map user pages, returned %d instead of %d\n",
diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
index aff9689..c89dcfa 100644
--- a/drivers/scsi/st.c
+++ b/drivers/scsi/st.c
@@ -4536,18 +4536,12 @@ static int sgl_map_user_pages(struct st_buffer *STbp,
 		return -ENOMEM;
 
         /* Try to fault in all of the necessary pages */
-	down_read(&current->mm->mmap_sem);
         /* rw==READ means read from drive, write into memory area */
-	res = get_user_pages(
-		current,
-		current->mm,
+	res = get_user_pages_fast(
 		uaddr,
 		nr_pages,
 		rw == READ,
-		0, /* don't force */
-		pages,
-		NULL);
-	up_read(&current->mm->mmap_sem);
+		pages);
 
 	/* Errors and no page mapped should return here */
 	if (res < nr_pages)
diff --git a/drivers/video/fbdev/pvr2fb.c b/drivers/video/fbdev/pvr2fb.c
index 167cfff..ff81f65 100644
--- a/drivers/video/fbdev/pvr2fb.c
+++ b/drivers/video/fbdev/pvr2fb.c
@@ -686,10 +686,7 @@ static ssize_t pvr2fb_write(struct fb_info *info, const char *buf,
 	if (!pages)
 		return -ENOMEM;
 
-	down_read(&current->mm->mmap_sem);
-	ret = get_user_pages(current, current->mm, (unsigned long)buf,
-			     nr_pages, WRITE, 0, pages, NULL);
-	up_read(&current->mm->mmap_sem);
+	ret = get_user_pages_fast((unsigned long)buf, nr_pages, WRITE, pages);
 
 	if (ret < nr_pages) {
 		nr_pages = ret;
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 5077afc..b159769 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -99,11 +99,8 @@ static int process_vm_rw_single_vec(unsigned long addr,
 		size_t bytes;
 
 		/* Get the pages we're interested in */
-		down_read(&mm->mmap_sem);
-		pages = get_user_pages(task, mm, pa, pages,
-				      vm_write, 0, process_pages, NULL);
-		up_read(&mm->mmap_sem);
-
+		pages = get_user_pages_unlocked(task, mm, pa, pages,
+						vm_write, 0, process_pages);
 		if (pages <= 0)
 			return -EFAULT;
 
diff --git a/mm/util.c b/mm/util.c
index 093c973..1b93f2d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -247,14 +247,8 @@ int __weak get_user_pages_fast(unsigned long start,
 				int nr_pages, int write, struct page **pages)
 {
 	struct mm_struct *mm = current->mm;
-	int ret;
-
-	down_read(&mm->mmap_sem);
-	ret = get_user_pages(current, mm, start, nr_pages,
-					write, 0, pages, NULL);
-	up_read(&mm->mmap_sem);
-
-	return ret;
+	return get_user_pages_unlocked(current, mm, start, nr_pages,
+				       write, 0, pages);
 }
 EXPORT_SYMBOL_GPL(get_user_pages_fast);
 
diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
index 5550130..5504783 100644
--- a/net/ceph/pagevec.c
+++ b/net/ceph/pagevec.c
@@ -23,17 +23,16 @@ struct page **ceph_get_direct_page_vector(const void __user *data,
 	if (!pages)
 		return ERR_PTR(-ENOMEM);
 
-	down_read(&current->mm->mmap_sem);
 	while (got < num_pages) {
-		rc = get_user_pages(current, current->mm,
-		    (unsigned long)data + ((unsigned long)got * PAGE_SIZE),
-		    num_pages - got, write_page, 0, pages + got, NULL);
+		rc = get_user_pages_fast((unsigned long)data +
+					 ((unsigned long)got * PAGE_SIZE),
+					 num_pages - got,
+					 write_page, pages + got);
 		if (rc < 0)
 			break;
 		BUG_ON(rc == 0);
 		got += rc;
 	}
-	up_read(&current->mm->mmap_sem);
 	if (rc < 0)
 		goto fail;
 	return pages;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 06/17] kvm: Faults which trigger IO release the mmap_sem
  2014-10-03 17:07 [PATCH 00/17] RFC: userfault v2 Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2014-10-03 17:07 ` [PATCH 05/17] mm: gup: use get_user_pages_fast and get_user_pages_unlocked Andrea Arcangeli
@ 2014-10-03 17:07 ` Andrea Arcangeli
  2014-10-03 17:07 ` [PATCH 07/17] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits Andrea Arcangeli
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-03 17:07 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api
  Cc: Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek,
	Jan Kara <jac>

From: Andres Lagar-Cavilla <andreslc@google.com>

When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
has been swapped out or is behind a filemap, this will trigger async
readahead and return immediately. The rationale is that KVM will kick
back the guest with an "async page fault" and allow for some other
guest process to take over.

If async PFs are enabled the fault is retried asap from an async
workqueue. If not, it's retried immediately in the same code path. In
either case the retry will not relinquish the mmap semaphore and will
block on the IO. This is a bad thing, as other mmap semaphore users
now stall as a function of swap or filemap latency.

This patch ensures both the regular and async PF path re-enter the
fault allowing for the mmap semaphore to be relinquished in the case
of IO wait.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 virt/kvm/async_pf.c | 4 +---
 virt/kvm/kvm_main.c | 4 ++--
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index d6a3d09..44660ae 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work)
 
 	might_sleep();
 
-	down_read(&mm->mmap_sem);
-	get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
-	up_read(&mm->mmap_sem);
+	get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL);
 	kvm_async_page_present_sync(vcpu, apf);
 
 	spin_lock(&vcpu->async_pf.lock);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 95519bc..921bce7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1170,8 +1170,8 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
 					      addr, write_fault, page);
 		up_read(&current->mm->mmap_sem);
 	} else
-		npages = get_user_pages_fast(addr, 1, write_fault,
-					     page);
+		npages = get_user_pages_unlocked(current, current->mm, addr, 1,
+						 write_fault, 0, page);
 	if (npages != 1)
 		return npages;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 07/17] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
  2014-10-03 17:07 [PATCH 00/17] RFC: userfault v2 Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2014-10-03 17:07 ` [PATCH 06/17] kvm: Faults which trigger IO release the mmap_sem Andrea Arcangeli
@ 2014-10-03 17:07 ` Andrea Arcangeli
  2014-10-07  9:03   ` Kirill A. Shutemov
       [not found]   ` <1412356087-16115-8-git-send-email-aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2014-10-03 17:07 ` [PATCH 08/17] mm: madvise MADV_USERFAULT Andrea Arcangeli
                   ` (9 subsequent siblings)
  16 siblings, 2 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-03 17:07 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api
  Cc: Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek,
	Jan Kara <jac>

We run out of 32bits in vm_flags, noop change for 64bit archs.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/proc/task_mmu.c       | 4 ++--
 include/linux/huge_mm.h  | 4 ++--
 include/linux/ksm.h      | 4 ++--
 include/linux/mm_types.h | 2 +-
 mm/huge_memory.c         | 2 +-
 mm/ksm.c                 | 2 +-
 mm/madvise.c             | 2 +-
 mm/mremap.c              | 2 +-
 8 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c341568..ee1c3a2 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -532,11 +532,11 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 	/*
 	 * Don't forget to update Documentation/ on changes.
 	 */
-	static const char mnemonics[BITS_PER_LONG][2] = {
+	static const char mnemonics[BITS_PER_LONG+1][2] = {
 		/*
 		 * In case if we meet a flag we don't know about.
 		 */
-		[0 ... (BITS_PER_LONG-1)] = "??",
+		[0 ... (BITS_PER_LONG)] = "??",
 
 		[ilog2(VM_READ)]	= "rd",
 		[ilog2(VM_WRITE)]	= "wr",
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 63579cb..3aa10e0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -121,7 +121,7 @@ extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
 extern int hugepage_madvise(struct vm_area_struct *vma,
-			    unsigned long *vm_flags, int advice);
+			    vm_flags_t *vm_flags, int advice);
 extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 				    unsigned long start,
 				    unsigned long end,
@@ -183,7 +183,7 @@ static inline int split_huge_page(struct page *page)
 #define split_huge_page_pmd_mm(__mm, __address, __pmd)	\
 	do { } while (0)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
-				   unsigned long *vm_flags, int advice)
+				   vm_flags_t *vm_flags, int advice)
 {
 	BUG();
 	return 0;
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 3be6bb1..8b35253 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -18,7 +18,7 @@ struct mem_cgroup;
 
 #ifdef CONFIG_KSM
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
-		unsigned long end, int advice, unsigned long *vm_flags);
+		unsigned long end, int advice, vm_flags_t *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
 void __ksm_exit(struct mm_struct *mm);
 
@@ -94,7 +94,7 @@ static inline int PageKsm(struct page *page)
 
 #ifdef CONFIG_MMU
 static inline int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
-		unsigned long end, int advice, unsigned long *vm_flags)
+		unsigned long end, int advice, vm_flags_t *vm_flags)
 {
 	return 0;
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6e0b286..2c876d1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -217,7 +217,7 @@ struct page_frag {
 #endif
 };
 
-typedef unsigned long __nocast vm_flags_t;
+typedef unsigned long long __nocast vm_flags_t;
 
 /*
  * A region containing a mapping of a non-memory backed file under NOMMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d9a21d06..e913a19 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1942,7 +1942,7 @@ out:
 #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
-		     unsigned long *vm_flags, int advice)
+		     vm_flags_t *vm_flags, int advice)
 {
 	switch (advice) {
 	case MADV_HUGEPAGE:
diff --git a/mm/ksm.c b/mm/ksm.c
index fb75902..faf319e 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1736,7 +1736,7 @@ static int ksm_scan_thread(void *nothing)
 }
 
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
-		unsigned long end, int advice, unsigned long *vm_flags)
+		unsigned long end, int advice, vm_flags_t *vm_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	int err;
diff --git a/mm/madvise.c b/mm/madvise.c
index 0938b30..d5aee71 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -49,7 +49,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	int error = 0;
 	pgoff_t pgoff;
-	unsigned long new_flags = vma->vm_flags;
+	vm_flags_t new_flags = vma->vm_flags;
 
 	switch (behavior) {
 	case MADV_NORMAL:
diff --git a/mm/mremap.c b/mm/mremap.c
index 05f1180..fa7db87 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -239,7 +239,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *new_vma;
-	unsigned long vm_flags = vma->vm_flags;
+	vm_flags_t vm_flags = vma->vm_flags;
 	unsigned long new_pgoff;
 	unsigned long moved_len;
 	unsigned long excess = 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 08/17] mm: madvise MADV_USERFAULT
  2014-10-03 17:07 [PATCH 00/17] RFC: userfault v2 Andrea Arcangeli
                   ` (6 preceding siblings ...)
  2014-10-03 17:07 ` [PATCH 07/17] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits Andrea Arcangeli
@ 2014-10-03 17:07 ` Andrea Arcangeli
  2014-10-03 23:13   ` Mike Hommey
  2014-10-07 10:36   ` Kirill A. Shutemov
  2014-10-03 17:07 ` [PATCH 09/17] mm: PT lock: export double_pt_lock/unlock Andrea Arcangeli
                   ` (8 subsequent siblings)
  16 siblings, 2 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-03 17:07 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api
  Cc: Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek,
	Jan Kara <jac>

MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
userland touches a still unmapped virtual address, a sigbus signal is
sent instead of allocating a new page. The sigbus signal handler will
then resolve the page fault in userland by calling the
remap_anon_pages syscall.

This functionality is needed to reliably implement postcopy live
migration in KVM (without having to use a special chardevice that
would disable all advanced Linux VM features, like swapping, KSM, THP,
automatic NUMA balancing, etc...).

MADV_USERFAULT could also be used to offload parts of anonymous memory
regions to remote nodes or to implement network distributed shared
memory.

Here I enlarged the vm_flags to 64bit as we run out of bits (noop on
64bit kernels). An alternative is to find some combination of flags
that are mutually exclusive if set.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/alpha/include/uapi/asm/mman.h     |  3 ++
 arch/mips/include/uapi/asm/mman.h      |  3 ++
 arch/parisc/include/uapi/asm/mman.h    |  3 ++
 arch/xtensa/include/uapi/asm/mman.h    |  3 ++
 fs/proc/task_mmu.c                     |  1 +
 include/linux/mm.h                     |  1 +
 include/uapi/asm-generic/mman-common.h |  3 ++
 mm/huge_memory.c                       | 60 +++++++++++++++++++++-------------
 mm/madvise.c                           | 17 ++++++++++
 mm/memory.c                            | 13 ++++++++
 10 files changed, 85 insertions(+), 22 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 0086b47..a10313c 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -60,6 +60,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT	18		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index cfcb876..d9d11a4 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -84,6 +84,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT	18		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 294d251..7bc7b7b 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -66,6 +66,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	70		/* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT	71		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 72		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 #define MAP_VARIABLE	0
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 00eed67..5448d88 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -90,6 +90,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT	18		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ee1c3a2..6033cb8 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -568,6 +568,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_HUGEPAGE)]	= "hg",
 		[ilog2(VM_NOHUGEPAGE)]	= "nh",
 		[ilog2(VM_MERGEABLE)]	= "mg",
+		[ilog2(VM_USERFAULT)]	= "uf",
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8900ba9..bf3df07 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -139,6 +139,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HUGEPAGE	0x20000000	/* MADV_HUGEPAGE marked this vma */
 #define VM_NOHUGEPAGE	0x40000000	/* MADV_NOHUGEPAGE marked this vma */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
+#define VM_USERFAULT	0x100000000ULL	/* Trigger user faults if not mapped */
 
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index ddc3b36..dbf1e70 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -52,6 +52,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_DONTDUMP flag */
 
+#define MADV_USERFAULT	18		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e913a19..b402d60 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -721,12 +721,16 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
-	if (mem_cgroup_try_charge(page, mm, GFP_TRANSHUGE, &memcg))
-		return VM_FAULT_OOM;
+	if (mem_cgroup_try_charge(page, mm, GFP_TRANSHUGE, &memcg)) {
+		put_page(page);
+		count_vm_event(THP_FAULT_FALLBACK);
+		return VM_FAULT_FALLBACK;
+	}
 
 	pgtable = pte_alloc_one(mm, haddr);
 	if (unlikely(!pgtable)) {
 		mem_cgroup_cancel_charge(page, memcg);
+		put_page(page);
 		return VM_FAULT_OOM;
 	}
 
@@ -746,6 +750,16 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		pte_free(mm, pgtable);
 	} else {
 		pmd_t entry;
+
+		/* Deliver the page fault to userland */
+		if (vma->vm_flags & VM_USERFAULT) {
+			spin_unlock(ptl);
+			mem_cgroup_cancel_charge(page, memcg);
+			put_page(page);
+			pte_free(mm, pgtable);
+			return VM_FAULT_SIGBUS;
+		}
+
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		page_add_new_anon_rmap(page, vma, haddr);
@@ -756,6 +770,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		atomic_long_inc(&mm->nr_ptes);
 		spin_unlock(ptl);
+		count_vm_event(THP_FAULT_ALLOC);
 	}
 
 	return 0;
@@ -776,20 +791,17 @@ static inline struct page *alloc_hugepage_vma(int defrag,
 }
 
 /* Caller must hold page table lock. */
-static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
 		struct page *zero_page)
 {
 	pmd_t entry;
-	if (!pmd_none(*pmd))
-		return false;
 	entry = mk_pmd(zero_page, vma->vm_page_prot);
 	entry = pmd_wrprotect(entry);
 	entry = pmd_mkhuge(entry);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, haddr, pmd, entry);
 	atomic_long_inc(&mm->nr_ptes);
-	return true;
 }
 
 int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -811,6 +823,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pgtable_t pgtable;
 		struct page *zero_page;
 		bool set;
+		int ret;
 		pgtable = pte_alloc_one(mm, haddr);
 		if (unlikely(!pgtable))
 			return VM_FAULT_OOM;
@@ -821,14 +834,24 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			return VM_FAULT_FALLBACK;
 		}
 		ptl = pmd_lock(mm, pmd);
-		set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
-				zero_page);
+		ret = 0;
+		set = false;
+		if (pmd_none(*pmd)) {
+			if (vma->vm_flags & VM_USERFAULT)
+				ret = VM_FAULT_SIGBUS;
+			else {
+				set_huge_zero_page(pgtable, mm, vma,
+						   haddr, pmd,
+						   zero_page);
+				set = true;
+			}
+		}
 		spin_unlock(ptl);
 		if (!set) {
 			pte_free(mm, pgtable);
 			put_huge_zero_page();
 		}
-		return 0;
+		return ret;
 	}
 	page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
 			vma, haddr, numa_node_id(), 0);
@@ -836,14 +859,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
-		put_page(page);
-		count_vm_event(THP_FAULT_FALLBACK);
-		return VM_FAULT_FALLBACK;
-	}
-
-	count_vm_event(THP_FAULT_ALLOC);
-	return 0;
+	return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -878,16 +894,14 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 */
 	if (is_huge_zero_pmd(pmd)) {
 		struct page *zero_page;
-		bool set;
 		/*
 		 * get_huge_zero_page() will never allocate a new page here,
 		 * since we already have a zero page to copy. It just takes a
 		 * reference.
 		 */
 		zero_page = get_huge_zero_page();
-		set = set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
+		set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
 				zero_page);
-		BUG_ON(!set); /* unexpected !pmd_none(dst_pmd) */
 		ret = 0;
 		goto out_unlock;
 	}
@@ -2148,7 +2162,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	     _pte++, address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
 		if (pte_none(pteval)) {
-			if (++none <= khugepaged_max_ptes_none)
+			if (!(vma->vm_flags & VM_USERFAULT) &&
+			    ++none <= khugepaged_max_ptes_none)
 				continue;
 			else
 				goto out;
@@ -2569,7 +2584,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
 		if (pte_none(pteval)) {
-			if (++none <= khugepaged_max_ptes_none)
+			if (!(vma->vm_flags & VM_USERFAULT) &&
+			    ++none <= khugepaged_max_ptes_none)
 				continue;
 			else
 				goto out_unmap;
diff --git a/mm/madvise.c b/mm/madvise.c
index d5aee71..24620c0 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -93,6 +93,21 @@ static long madvise_behavior(struct vm_area_struct *vma,
 		if (error)
 			goto out;
 		break;
+	case MADV_USERFAULT:
+		if (vma->vm_ops) {
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags |= VM_USERFAULT;
+		break;
+	case MADV_NOUSERFAULT:
+		if (vma->vm_ops) {
+			WARN_ON(new_flags & VM_USERFAULT);
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags &= ~VM_USERFAULT;
+		break;
 	}
 
 	if (new_flags == vma->vm_flags) {
@@ -408,6 +423,8 @@ madvise_behavior_valid(int behavior)
 	case MADV_HUGEPAGE:
 	case MADV_NOHUGEPAGE:
 #endif
+	case MADV_USERFAULT:
+	case MADV_NOUSERFAULT:
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
 		return 1;
diff --git a/mm/memory.c b/mm/memory.c
index e229970..16e4c8a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2645,6 +2645,11 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 		if (!pte_none(*page_table))
 			goto unlock;
+		/* Deliver the page fault to userland, check inside PT lock */
+		if (vma->vm_flags & VM_USERFAULT) {
+			pte_unmap_unlock(page_table, ptl);
+			return VM_FAULT_SIGBUS;
+		}
 		goto setpte;
 	}
 
@@ -2672,6 +2677,14 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!pte_none(*page_table))
 		goto release;
 
+	/* Deliver the page fault to userland, check inside PT lock */
+	if (vma->vm_flags & VM_USERFAULT) {
+		pte_unmap_unlock(page_table, ptl);
+		mem_cgroup_cancel_charge(page, memcg);
+		page_cache_release(page);
+		return VM_FAULT_SIGBUS;
+	}
+
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, address);
 	mem_cgroup_commit_charge(page, memcg, false);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 09/17] mm: PT lock: export double_pt_lock/unlock
  2014-10-03 17:07 [PATCH 00/17] RFC: userfault v2 Andrea Arcangeli
                   ` (7 preceding siblings ...)
  2014-10-03 17:07 ` [PATCH 08/17] mm: madvise MADV_USERFAULT Andrea Arcangeli
@ 2014-10-03 17:07 ` Andrea Arcangeli
  2014-10-03 17:08 ` [PATCH 10/17] mm: rmap preparation for remap_anon_pages Andrea Arcangeli
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-03 17:07 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api
  Cc: Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek,
	Jan Kara <jac>

Those two helpers are needed by remap_anon_pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/mm.h |  4 ++++
 mm/fremap.c        | 29 +++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf3df07..71dbe03 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1408,6 +1408,10 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
 }
 #endif /* CONFIG_MMU && !__ARCH_HAS_4LEVEL_HACK */
 
+/* mm/fremap.c */
+extern void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2);
+extern void double_pt_unlock(spinlock_t *ptl1, spinlock_t *ptl2);
+
 #if USE_SPLIT_PTE_PTLOCKS
 #if ALLOC_SPLIT_PTLOCKS
 void __init ptlock_cache_init(void);
diff --git a/mm/fremap.c b/mm/fremap.c
index 72b8fa3..1e509f7 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -281,3 +281,32 @@ out_freed:
 
 	return err;
 }
+
+void double_pt_lock(spinlock_t *ptl1,
+		    spinlock_t *ptl2)
+	__acquires(ptl1)
+	__acquires(ptl2)
+{
+	spinlock_t *ptl_tmp;
+
+	if (ptl1 > ptl2) {
+		/* exchange ptl1 and ptl2 */
+		ptl_tmp = ptl1;
+		ptl1 = ptl2;
+		ptl2 = ptl_tmp;
+	}
+	/* lock in virtual address order to avoid lock inversion */
+	spin_lock(ptl1);
+	if (ptl1 != ptl2)
+		spin_lock_nested(ptl2, SINGLE_DEPTH_NESTING);
+}
+
+void double_pt_unlock(spinlock_t *ptl1,
+		      spinlock_t *ptl2)
+	__releases(ptl1)
+	__releases(ptl2)
+{
+	spin_unlock(ptl1);
+	if (ptl1 != ptl2)
+		spin_unlock(ptl2);
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 10/17] mm: rmap preparation for remap_anon_pages
  2014-10-03 17:07 [PATCH 00/17] RFC: userfault v2 Andrea Arcangeli
                   ` (8 preceding siblings ...)
  2014-10-03 17:07 ` [PATCH 09/17] mm: PT lock: export double_pt_lock/unlock Andrea Arcangeli
@ 2014-10-03 17:08 ` Andrea Arcangeli
  2014-10-03 18:31   ` Linus Torvalds
  2014-10-07 11:10   ` [Qemu-devel] " Kirill A. Shutemov
       [not found] ` <1412356087-16115-1-git-send-email-aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                   ` (6 subsequent siblings)
  16 siblings, 2 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-03 17:08 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api
  Cc: Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek,
	Jan Kara <jac>

remap_anon_pages (unlike remap_file_pages) tries to be non intrusive
in the rmap code.

As far as the rmap code is concerned, rmap_anon_pages only alters the
page->mapping and page->index. It does it while holding the page
lock. However there are a few places that in presence of anon pages
are allowed to do rmap walks without the page lock (split_huge_page
and page_referenced_anon). Those places that are doing rmap walks
without taking the page lock first, must be updated to re-check that
the page->mapping didn't change after they obtained the anon_vma
lock. remap_anon_pages takes the anon_vma lock for writing before
altering the page->mapping, so if the page->mapping is still the same
after obtaining the anon_vma lock (without the page lock), the rmap
walks can go ahead safely (and remap_anon_pages will wait them to
complete before proceeding).

remap_anon_pages serializes against itself with the page lock.

All other places taking the anon_vma lock while holding the mmap_sem
for writing, don't need to check if the page->mapping has changed
after taking the anon_vma lock, regardless of the page lock, because
remap_anon_pages holds the mmap_sem for reading.

Overall this looks a fairly small change to the rmap code, notably
less intrusive than the nonlinear vmas created by remap_file_pages.

There's one constraint enforced to allow this simplification: the
source pages passed to remap_anon_pages must be mapped only in one
vma, but this is not a limitation when used to handle userland page
faults with MADV_USERFAULT. The source addresses passed to
remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to
avoid any risk of the mapcount of the pages increasing, if fork runs
in parallel in another thread, before or while remap_anon_pages runs.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c | 24 ++++++++++++++++++++----
 mm/rmap.c        |  9 +++++++++
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b402d60..4277ed7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1921,6 +1921,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
 	struct anon_vma *anon_vma;
 	int ret = 1;
+	struct address_space *mapping;
 
 	BUG_ON(is_huge_zero_page(page));
 	BUG_ON(!PageAnon(page));
@@ -1932,10 +1933,24 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	 * page_lock_anon_vma_read except the write lock is taken to serialise
 	 * against parallel split or collapse operations.
 	 */
-	anon_vma = page_get_anon_vma(page);
-	if (!anon_vma)
-		goto out;
-	anon_vma_lock_write(anon_vma);
+	for (;;) {
+		mapping = ACCESS_ONCE(page->mapping);
+		anon_vma = page_get_anon_vma(page);
+		if (!anon_vma)
+			goto out;
+		anon_vma_lock_write(anon_vma);
+		/*
+		 * We don't hold the page lock here so
+		 * remap_anon_pages_huge_pmd can change the anon_vma
+		 * from under us until we obtain the anon_vma
+		 * lock. Verify that we obtained the anon_vma lock
+		 * before remap_anon_pages did.
+		 */
+		if (likely(mapping == ACCESS_ONCE(page->mapping)))
+			break;
+		anon_vma_unlock_write(anon_vma);
+		put_anon_vma(anon_vma);
+	}
 
 	ret = 0;
 	if (!PageCompound(page))
@@ -2460,6 +2475,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * Prevent all access to pagetables with the exception of
 	 * gup_fast later hanlded by the ptep_clear_flush and the VM
 	 * handled by the anon_vma lock + PG_lock.
+	 * remap_anon_pages is prevented to race as well by the mmap_sem.
 	 */
 	down_write(&mm->mmap_sem);
 	if (unlikely(khugepaged_test_exit(mm)))
diff --git a/mm/rmap.c b/mm/rmap.c
index 3e8491c..6d875eb 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -450,6 +450,7 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
 	struct anon_vma *root_anon_vma;
 	unsigned long anon_mapping;
 
+repeat:
 	rcu_read_lock();
 	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
 	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
@@ -488,6 +489,14 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
 	rcu_read_unlock();
 	anon_vma_lock_read(anon_vma);
 
+	/* check if remap_anon_pages changed the anon_vma */
+	if (unlikely((unsigned long) ACCESS_ONCE(page->mapping) != anon_mapping)) {
+		anon_vma_unlock_read(anon_vma);
+		put_anon_vma(anon_vma);
+		anon_vma = NULL;
+		goto repeat;
+	}
+
 	if (atomic_dec_and_test(&anon_vma->refcount)) {
 		/*
 		 * Oops, we held the last refcount, release the lock

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 11/17] mm: swp_entry_swapcount
       [not found] ` <1412356087-16115-1-git-send-email-aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-10-03 17:08   ` Andrea Arcangeli
  0 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-03 17:08 UTC (permalink / raw)
  To: qemu-devel-qX2TKyscuCcdnm+yROfE0A, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minc

Provide a new swapfile method for remap_anon_pages to verify the swap
entry is mapped only in one vma before relocating the swap entry in a
different virtual address. Otherwise if the swap entry is mapped
in multiple vmas, when the page is swapped back in, it could get
mapped in a non linear way in some anon_vma.

Signed-off-by: Andrea Arcangeli <aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 include/linux/swap.h |  6 ++++++
 mm/swapfile.c        | 13 +++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8197452..af9977c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -458,6 +458,7 @@ extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
+extern int swp_entry_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
 extern int reuse_swap_page(struct page *);
 extern int try_to_free_swap(struct page *);
@@ -559,6 +560,11 @@ static inline int page_swapcount(struct page *page)
 	return 0;
 }
 
+static inline int swp_entry_swapcount(swp_entry_t entry)
+{
+	return 0;
+}
+
 #define reuse_swap_page(page)	(page_mapcount(page) == 1)
 
 static inline int try_to_free_swap(struct page *page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8798b2e..4cc9af6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -874,6 +874,19 @@ int page_swapcount(struct page *page)
 	return count;
 }
 
+int swp_entry_swapcount(swp_entry_t entry)
+{
+	int count = 0;
+	struct swap_info_struct *p;
+
+	p = swap_info_get(entry);
+	if (p) {
+		count = swap_count(p->swap_map[swp_offset(entry)]);
+		spin_unlock(&p->lock);
+	}
+	return count;
+}
+
 /*
  * We can write to an anon page without COW if there are no other references
  * to it.  And as a side-effect, free up its swap: because the old content

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 12/17] mm: sys_remap_anon_pages
  2014-10-03 17:07 [PATCH 00/17] RFC: userfault v2 Andrea Arcangeli
                   ` (10 preceding siblings ...)
       [not found] ` <1412356087-16115-1-git-send-email-aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-10-03 17:08 ` Andrea Arcangeli
  2014-10-04 13:13   ` Andi Kleen
  2014-10-03 17:08 ` [PATCH 13/17] waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-03 17:08 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api
  Cc: Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek,
	Jan Kara <jac>

This new syscall will move anon pages across vmas, atomically and
without touching the vmas.

It only works on non shared anonymous pages because those can be
relocated without generating non linear anon_vmas in the rmap code.

It is the ideal mechanism to handle userspace page faults. Normally
the destination vma will have VM_USERFAULT set with
madvise(MADV_USERFAULT) while the source vma will normally have
VM_DONTCOPY set with madvise(MADV_DONTFORK).

MADV_DONTFORK set in the source vma avoids remap_anon_pages to fail if
the process forks during the userland page fault.

The thread triggering the sigbus signal handler by touching an
unmapped hole in the MADV_USERFAULT region, should take care to
receive the data belonging in the faulting virtual address in the
source vma. The data can come from the network, storage or any other
I/O device. After the data has been safely received in the private
area in the source vma, it will call remap_anon_pages to map the page
in the faulting address in the destination vma atomically. And finally
it will return from the signal handler.

It is an alternative to mremap.

It only works if the vma protection bits are identical from the source
and destination vma.

It can remap non shared anonymous pages within the same vma too.

If the source virtual memory range has any unmapped holes, or if the
destination virtual memory range is not a whole unmapped hole,
remap_anon_pages will fail respectively with -ENOENT or -EEXIST. This
provides a very strict behavior to avoid any chance of memory
corruption going unnoticed if there are userland race conditions. Only
one thread should resolve the userland page fault at any given time
for any given faulting address. This means that if two threads try to
both call remap_anon_pages on the same destination address at the same
time, the second thread will get an explicit error from this syscall.

The syscall retval will return "len" is succesful. The syscall however
can be interrupted by fatal signals or errors. If interrupted it will
return the number of bytes successfully remapped before the
interruption if any, or the negative error if none. It will never
return zero. Either it will return an error or an amount of bytes
successfully moved. If the retval reports a "short" remap, the
remap_anon_pages syscall should be repeated by userland with
src+retval, dst+reval, len-retval if it wants to know about the error
that interrupted it.

The RAP_ALLOW_SRC_HOLES flag can be specified to prevent -ENOENT
errors to materialize if there are holes in the source virtual range
that is being remapped. The holes will be accounted as successfully
remapped in the retval of the syscall. This is mostly useful to remap
hugepage naturally aligned virtual regions without knowing if there
are transparent hugepage in the regions or not, but preventing the
risk of having to split the hugepmd during the remap.

The main difference with mremap is that if used to fill holes in
unmapped anonymous memory vmas (if used in combination with
MADV_USERFAULT) remap_anon_pages won't create lots of unmergeable
vmas. mremap instead would create lots of vmas (because of non linear
vma->vm_pgoff) leading to -ENOMEM failures (the number of vmas is
limited).

MADV_USERFAULT and remap_anon_pages() can be tested with a program
like below:

===
 #define _GNU_SOURCE
 #include <sys/mman.h>
 #include <pthread.h>
 #include <strings.h>
 #include <stdlib.h>
 #include <unistd.h>
 #include <stdio.h>
 #include <errno.h>
 #include <string.h>
 #include <signal.h>
 #include <sys/syscall.h>
 #include <sys/types.h>

 #define USE_USERFAULT
 #define THP

 #define MADV_USERFAULT	18

 #define SIZE (1024*1024*1024)

 #define SYS_remap_anon_pages 321

 static volatile unsigned char *c, *tmp;

 void userfault_sighandler(int signum, siginfo_t *info, void *ctx)
 {
 	unsigned char *addr = info->si_addr;
 	int len = 4096;
 	int ret;

 	addr = (unsigned char *) ((unsigned long) addr & ~((getpagesize())-1));
 #ifdef THP
 	addr = (unsigned char *) ((unsigned long) addr & ~((2*1024*1024)-1));
 	len = 2*1024*1024;
 #endif
 	if (addr >= c && addr < c + SIZE) {
 		unsigned long offset = addr - c;
 		ret = syscall(SYS_remap_anon_pages, c+offset, tmp+offset, len, 0);
 		if (ret != len)
 			perror("sigbus remap_anon_pages"), exit(1);
 		//printf("sigbus offset %lu\n", offset);
 		return;
 	}

 	printf("sigbus error addr %p c %p tmp %p\n", addr, c, tmp), exit(1);
 }

 int main()
 {
 	struct sigaction sa;
 	int ret;
 	unsigned long i;
 #ifndef THP
 	/*
 	 * Fails with THP due lack of alignment because of memset
 	 * pre-filling the destination
 	 */
 	c = mmap(0, SIZE, PROT_READ|PROT_WRITE,
 		 MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
 	if (c == MAP_FAILED)
 		perror("mmap"), exit(1);
 	tmp = mmap(0, SIZE, PROT_READ|PROT_WRITE,
 		   MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
 	if (tmp == MAP_FAILED)
 		perror("mmap"), exit(1);
 #else
 	ret = posix_memalign((void **)&c, 2*1024*1024, SIZE);
 	if (ret)
 		perror("posix_memalign"), exit(1);
 	ret = posix_memalign((void **)&tmp, 2*1024*1024, SIZE);
 	if (ret)
 		perror("posix_memalign"), exit(1);
 #endif
 	/*
 	 * MADV_USERFAULT must run before memset, to avoid THP 2m
 	 * faults to map memory into "tmp", if "tmp" isn't allocated
 	 * with hugepage alignment.
 	 */
 	if (madvise((void *)c, SIZE, MADV_USERFAULT))
 		perror("madvise"), exit(1);
 	memset((void *)tmp, 0xaa, SIZE);

 	sa.sa_sigaction = userfault_sighandler;
 	sigemptyset(&sa.sa_mask);
 	sa.sa_flags = SA_SIGINFO;
 	sigaction(SIGBUS, &sa, NULL);

 #ifndef USE_USERFAULT
 	ret = syscall(SYS_remap_anon_pages, c, tmp, SIZE, 0);
 	if (ret != SIZE)
 		perror("remap_anon_pages"), exit(1);
 #endif

 	for (i = 0; i < SIZE; i += 4096) {
 		if ((i/4096) % 2) {
 			/* exercise read and write MADV_USERFAULT */
 			c[i+1] = 0xbb;
 		}
 		if (c[i] != 0xaa)
 			printf("error %x offset %lu\n", c[i], i), exit(1);
 	}
 	printf("remap_anon_pages functions correctly\n");

 	return 0;
 }
===

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 include/linux/huge_mm.h          |   7 +
 include/linux/syscalls.h         |   4 +
 kernel/sys_ni.c                  |   1 +
 mm/fremap.c                      | 477 +++++++++++++++++++++++++++++++++++++++
 mm/huge_memory.c                 | 110 +++++++++
 7 files changed, 601 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 028b781..2d0594c 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -363,3 +363,4 @@
 354	i386	seccomp			sys_seccomp
 355	i386	getrandom		sys_getrandom
 356	i386	memfd_create		sys_memfd_create
+357	i386	remap_anon_pages	sys_remap_anon_pages
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 35dd922..41e8f3e 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -327,6 +327,7 @@
 318	common	getrandom		sys_getrandom
 319	common	memfd_create		sys_memfd_create
 320	common	kexec_file_load		sys_kexec_file_load
+321	common	remap_anon_pages	sys_remap_anon_pages
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3aa10e0..8a85fc9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -33,6 +33,13 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, pgprot_t newprot,
 			int prot_numa);
+extern int remap_anon_pages_huge_pmd(struct mm_struct *mm,
+				     pmd_t *dst_pmd, pmd_t *src_pmd,
+				     pmd_t dst_pmdval,
+				     struct vm_area_struct *dst_vma,
+				     struct vm_area_struct *src_vma,
+				     unsigned long dst_addr,
+				     unsigned long src_addr);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 0f86d85..3d4bb05 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -451,6 +451,10 @@ asmlinkage long sys_mremap(unsigned long addr,
 asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
 			unsigned long prot, unsigned long pgoff,
 			unsigned long flags);
+asmlinkage long sys_remap_anon_pages(unsigned long dst_start,
+				     unsigned long src_start,
+				     unsigned long len,
+				     unsigned long flags);
 asmlinkage long sys_msync(unsigned long start, size_t len, int flags);
 asmlinkage long sys_fadvise64(int fd, loff_t offset, size_t len, int advice);
 asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 391d4dd..2bc7bef 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -178,6 +178,7 @@ cond_syscall(sys_mincore);
 cond_syscall(sys_madvise);
 cond_syscall(sys_mremap);
 cond_syscall(sys_remap_file_pages);
+cond_syscall(sys_remap_anon_pages);
 cond_syscall(compat_sys_move_pages);
 cond_syscall(compat_sys_migrate_pages);
 
diff --git a/mm/fremap.c b/mm/fremap.c
index 1e509f7..9337637 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -310,3 +310,480 @@ void double_pt_unlock(spinlock_t *ptl1,
 	if (ptl1 != ptl2)
 		spin_unlock(ptl2);
 }
+
+#define RAP_ALLOW_SRC_HOLES (1UL<<0)
+
+/*
+ * The mmap_sem for reading is held by the caller. Just move the page
+ * from src_pmd to dst_pmd if possible, and return true if succeeded
+ * in moving the page.
+ */
+static int remap_anon_pages_pte(struct mm_struct *mm,
+				pte_t *dst_pte, pte_t *src_pte, pmd_t *src_pmd,
+				struct vm_area_struct *dst_vma,
+				struct vm_area_struct *src_vma,
+				unsigned long dst_addr,
+				unsigned long src_addr,
+				spinlock_t *dst_ptl,
+				spinlock_t *src_ptl,
+				unsigned long flags)
+{
+	struct page *src_page;
+	swp_entry_t entry;
+	pte_t orig_src_pte, orig_dst_pte;
+	struct anon_vma *src_anon_vma, *dst_anon_vma;
+
+	spin_lock(dst_ptl);
+	orig_dst_pte = *dst_pte;
+	spin_unlock(dst_ptl);
+	if (!pte_none(orig_dst_pte))
+		return -EEXIST;
+
+	spin_lock(src_ptl);
+	orig_src_pte = *src_pte;
+	spin_unlock(src_ptl);
+	if (pte_none(orig_src_pte)) {
+		if (!(flags & RAP_ALLOW_SRC_HOLES))
+			return -ENOENT;
+		else
+			/* nothing to do to remap an hole */
+			return 0;
+	}
+
+	if (pte_present(orig_src_pte)) {
+		/*
+		 * Pin the page while holding the lock to be sure the
+		 * page isn't freed under us
+		 */
+		spin_lock(src_ptl);
+		if (!pte_same(orig_src_pte, *src_pte)) {
+			spin_unlock(src_ptl);
+			return -EAGAIN;
+		}
+		src_page = vm_normal_page(src_vma, src_addr, orig_src_pte);
+		if (!src_page || !PageAnon(src_page) ||
+		    page_mapcount(src_page) != 1) {
+			spin_unlock(src_ptl);
+			return -EBUSY;
+		}
+
+		get_page(src_page);
+		spin_unlock(src_ptl);
+
+		/* block all concurrent rmap walks */
+		lock_page(src_page);
+
+		/*
+		 * page_referenced_anon walks the anon_vma chain
+		 * without the page lock. Serialize against it with
+		 * the anon_vma lock, the page lock is not enough.
+		 */
+		src_anon_vma = page_get_anon_vma(src_page);
+		if (!src_anon_vma) {
+			/* page was unmapped from under us */
+			unlock_page(src_page);
+			put_page(src_page);
+			return -EAGAIN;
+		}
+		anon_vma_lock_write(src_anon_vma);
+
+		double_pt_lock(dst_ptl, src_ptl);
+
+		if (!pte_same(*src_pte, orig_src_pte) ||
+		    !pte_same(*dst_pte, orig_dst_pte) ||
+		    page_mapcount(src_page) != 1) {
+			double_pt_unlock(dst_ptl, src_ptl);
+			anon_vma_unlock_write(src_anon_vma);
+			put_anon_vma(src_anon_vma);
+			unlock_page(src_page);
+			put_page(src_page);
+			return -EAGAIN;
+		}
+
+		BUG_ON(!PageAnon(src_page));
+		/* the PT lock is enough to keep the page pinned now */
+		put_page(src_page);
+
+		dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
+		ACCESS_ONCE(src_page->mapping) = ((struct address_space *)
+						  dst_anon_vma);
+		ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma,
+								 dst_addr);
+
+		if (!pte_same(ptep_clear_flush(src_vma, src_addr, src_pte),
+			      orig_src_pte))
+			BUG();
+
+		orig_dst_pte = mk_pte(src_page, dst_vma->vm_page_prot);
+		orig_dst_pte = maybe_mkwrite(pte_mkdirty(orig_dst_pte),
+					     dst_vma);
+
+		set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
+
+		double_pt_unlock(dst_ptl, src_ptl);
+
+		anon_vma_unlock_write(src_anon_vma);
+		put_anon_vma(src_anon_vma);
+
+		/* unblock rmap walks */
+		unlock_page(src_page);
+
+		mmu_notifier_invalidate_page(mm, src_addr);
+	} else {
+		if (pte_file(orig_src_pte))
+			return -EFAULT;
+
+		entry = pte_to_swp_entry(orig_src_pte);
+		if (non_swap_entry(entry)) {
+			if (is_migration_entry(entry)) {
+				migration_entry_wait(mm, src_pmd, src_addr);
+				return -EAGAIN;
+			}
+			return -EFAULT;
+		}
+
+		if (swp_entry_swapcount(entry) != 1)
+			return -EBUSY;
+
+		double_pt_lock(dst_ptl, src_ptl);
+
+		if (!pte_same(*src_pte, orig_src_pte) ||
+		    !pte_same(*dst_pte, orig_dst_pte) ||
+		    swp_entry_swapcount(entry) != 1) {
+			double_pt_unlock(dst_ptl, src_ptl);
+			return -EAGAIN;
+		}
+
+		if (pte_val(ptep_get_and_clear(mm, src_addr, src_pte)) !=
+		    pte_val(orig_src_pte))
+			BUG();
+		set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
+
+		double_pt_unlock(dst_ptl, src_ptl);
+	}
+
+	return 0;
+}
+
+static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd = NULL;
+
+	pgd = pgd_offset(mm, address);
+	pud = pud_alloc(mm, pgd, address);
+	if (pud)
+		/*
+		 * Note that we didn't run this because the pmd was
+		 * missing, the *pmd may be already established and in
+		 * turn it may also be a trans_huge_pmd.
+		 */
+		pmd = pmd_alloc(mm, pud, address);
+	return pmd;
+}
+
+/**
+ * sys_remap_anon_pages - remap arbitrary anonymous pages of an existing vma
+ * @dst_start: start of the destination virtual memory range
+ * @src_start: start of the source virtual memory range
+ * @len: length of the virtual memory range
+ *
+ * sys_remap_anon_pages remaps arbitrary anonymous pages atomically in
+ * zero copy. It only works on non shared anonymous pages because
+ * those can be relocated without generating non linear anon_vmas in
+ * the rmap code.
+ *
+ * It is the ideal mechanism to handle userspace page faults. Normally
+ * the destination vma will have VM_USERFAULT set with
+ * madvise(MADV_USERFAULT) while the source vma will have VM_DONTCOPY
+ * set with madvise(MADV_DONTFORK).
+ *
+ * The thread receiving the page during the userland page fault
+ * (MADV_USERFAULT) will receive the faulting page in the source vma
+ * through the network, storage or any other I/O device (MADV_DONTFORK
+ * in the source vma avoids remap_anon_pages to fail with -EBUSY if
+ * the process forks before remap_anon_pages is called), then it will
+ * call remap_anon_pages to map the page in the faulting address in
+ * the destination vma.
+ *
+ * This syscall works purely via pagetables, so it's the most
+ * efficient way to move physical non shared anonymous pages across
+ * different virtual addresses. Unlike mremap()/mmap()/munmap() it
+ * does not create any new vmas. The mapping in the destination
+ * address is atomic.
+ *
+ * It only works if the vma protection bits are identical from the
+ * source and destination vma.
+ *
+ * It can remap non shared anonymous pages within the same vma too.
+ *
+ * If the source virtual memory range has any unmapped holes, or if
+ * the destination virtual memory range is not a whole unmapped hole,
+ * remap_anon_pages will fail respectively with -ENOENT or
+ * -EEXIST. This provides a very strict behavior to avoid any chance
+ * of memory corruption going unnoticed if there are userland race
+ * conditions. Only one thread should resolve the userland page fault
+ * at any given time for any given faulting address. This means that
+ * if two threads try to both call remap_anon_pages on the same
+ * destination address at the same time, the second thread will get an
+ * explicit error from this syscall.
+ *
+ * The syscall retval will return "len" is succesful. The syscall
+ * however can be interrupted by fatal signals or errors. If
+ * interrupted it will return the number of bytes successfully
+ * remapped before the interruption if any, or the negative error if
+ * none. It will never return zero. Either it will return an error or
+ * an amount of bytes successfully moved. If the retval reports a
+ * "short" remap, the remap_anon_pages syscall should be repeated by
+ * userland with src+retval, dst+reval, len-retval if it wants to know
+ * about the error that interrupted it.
+ *
+ * The RAP_ALLOW_SRC_HOLES flag can be specified to prevent -ENOENT
+ * errors to materialize if there are holes in the source virtual
+ * range that is being remapped. The holes will be accounted as
+ * successfully remapped in the retval of the syscall. This is mostly
+ * useful to remap hugepage naturally aligned virtual regions without
+ * knowing if there are transparent hugepage in the regions or not,
+ * but preventing the risk of having to split the hugepmd during the
+ * remap.
+ *
+ * If there's any rmap walk that is taking the anon_vma locks without
+ * first obtaining the page lock (for example split_huge_page and
+ * page_referenced_anon), they will have to verify if the
+ * page->mapping has changed after taking the anon_vma lock. If it
+ * changed they should release the lock and retry obtaining a new
+ * anon_vma, because it means the anon_vma was changed by
+ * remap_anon_pages before the lock could be obtained. This is the
+ * only additional complexity added to the rmap code to provide this
+ * anonymous page remapping functionality.
+ */
+SYSCALL_DEFINE4(remap_anon_pages,
+		unsigned long, dst_start, unsigned long, src_start,
+		unsigned long, len, unsigned long, flags)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *src_vma, *dst_vma;
+	long err = -EINVAL;
+	pmd_t *src_pmd, *dst_pmd;
+	pte_t *src_pte, *dst_pte;
+	spinlock_t *dst_ptl, *src_ptl;
+	unsigned long src_addr, dst_addr;
+	int thp_aligned = -1;
+	long moved = 0;
+
+	/*
+	 * Sanitize the syscall parameters:
+	 */
+	if (src_start & ~PAGE_MASK)
+		return err;
+	if (dst_start & ~PAGE_MASK)
+		return err;
+	if (len & ~PAGE_MASK)
+		return err;
+	if (flags & ~RAP_ALLOW_SRC_HOLES)
+		return err;
+
+	/* Does the address range wrap, or is the span zero-sized? */
+	if (unlikely(src_start + len <= src_start))
+		return err;
+	if (unlikely(dst_start + len <= dst_start))
+		return err;
+
+	down_read(&mm->mmap_sem);
+
+	/*
+	 * Make sure the vma is not shared, that the src and dst remap
+	 * ranges are both valid and fully within a single existing
+	 * vma.
+	 */
+	src_vma = find_vma(mm, src_start);
+	if (!src_vma || (src_vma->vm_flags & VM_SHARED))
+		goto out;
+	if (src_start < src_vma->vm_start ||
+	    src_start + len > src_vma->vm_end)
+		goto out;
+
+	dst_vma = find_vma(mm, dst_start);
+	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+		goto out;
+	if (dst_start < dst_vma->vm_start ||
+	    dst_start + len > dst_vma->vm_end)
+		goto out;
+
+	if (pgprot_val(src_vma->vm_page_prot) !=
+	    pgprot_val(dst_vma->vm_page_prot))
+		goto out;
+
+	/* only allow remapping if both are mlocked or both aren't */
+	if ((src_vma->vm_flags & VM_LOCKED) ^ (dst_vma->vm_flags & VM_LOCKED))
+		goto out;
+
+	/*
+	 * Ensure the dst_vma has a anon_vma or this page
+	 * would get a NULL anon_vma when moved in the
+	 * dst_vma.
+	 */
+	err = -ENOMEM;
+	if (unlikely(anon_vma_prepare(dst_vma)))
+		goto out;
+
+	for (src_addr = src_start, dst_addr = dst_start;
+	     src_addr < src_start + len; ) {
+		spinlock_t *ptl;
+		pmd_t dst_pmdval;
+		BUG_ON(dst_addr >= dst_start + len);
+		src_pmd = mm_find_pmd(mm, src_addr);
+		if (unlikely(!src_pmd)) {
+			if (!(flags & RAP_ALLOW_SRC_HOLES)) {
+				err = -ENOENT;
+				break;
+			} else {
+				src_pmd = mm_alloc_pmd(mm, src_addr);
+				if (unlikely(!src_pmd)) {
+					err = -ENOMEM;
+					break;
+				}
+			}
+		}
+		dst_pmd = mm_alloc_pmd(mm, dst_addr);
+		if (unlikely(!dst_pmd)) {
+			err = -ENOMEM;
+			break;
+		}
+
+		dst_pmdval = pmd_read_atomic(dst_pmd);
+		/*
+		 * If the dst_pmd is mapped as THP don't
+		 * override it and just be strict.
+		 */
+		if (unlikely(pmd_trans_huge(dst_pmdval))) {
+			err = -EEXIST;
+			break;
+		}
+		if (pmd_trans_huge_lock(src_pmd, src_vma, &ptl) == 1) {
+			/*
+			 * Check if we can move the pmd without
+			 * splitting it. First check the address
+			 * alignment to be the same in src/dst.  These
+			 * checks don't actually need the PT lock but
+			 * it's good to do it here to optimize this
+			 * block away at build time if
+			 * CONFIG_TRANSPARENT_HUGEPAGE is not set.
+			 */
+			if (thp_aligned == -1)
+				thp_aligned = ((src_addr & ~HPAGE_PMD_MASK) ==
+					       (dst_addr & ~HPAGE_PMD_MASK));
+			if (!thp_aligned || (src_addr & ~HPAGE_PMD_MASK) ||
+			    !pmd_none(dst_pmdval) ||
+			    src_start + len - src_addr < HPAGE_PMD_SIZE) {
+				spin_unlock(ptl);
+				/* Fall through */
+				split_huge_page_pmd(src_vma, src_addr,
+						    src_pmd);
+			} else {
+				BUG_ON(dst_addr & ~HPAGE_PMD_MASK);
+				err = remap_anon_pages_huge_pmd(mm,
+								dst_pmd,
+								src_pmd,
+								dst_pmdval,
+								dst_vma,
+								src_vma,
+								dst_addr,
+								src_addr);
+				cond_resched();
+
+				if (!err) {
+					dst_addr += HPAGE_PMD_SIZE;
+					src_addr += HPAGE_PMD_SIZE;
+					moved += HPAGE_PMD_SIZE;
+				}
+
+				if ((!err || err == -EAGAIN) &&
+				    fatal_signal_pending(current))
+					err = -EINTR;
+
+				if (err && err != -EAGAIN)
+					break;
+
+				continue;
+			}
+		}
+
+		if (pmd_none(*src_pmd)) {
+			if (!(flags & RAP_ALLOW_SRC_HOLES)) {
+				err = -ENOENT;
+				break;
+			} else {
+				if (unlikely(__pte_alloc(mm, src_vma, src_pmd,
+							 src_addr))) {
+					err = -ENOMEM;
+					break;
+				}
+			}
+		}
+
+		/*
+		 * We held the mmap_sem for reading so MADV_DONTNEED
+		 * can zap transparent huge pages under us, or the
+		 * transparent huge page fault can establish new
+		 * transparent huge pages under us.
+		 */
+		if (unlikely(pmd_trans_unstable(src_pmd))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (unlikely(pmd_none(dst_pmdval)) &&
+		    unlikely(__pte_alloc(mm, dst_vma, dst_pmd,
+					 dst_addr))) {
+			err = -ENOMEM;
+			break;
+		}
+		/* If an huge pmd materialized from under us fail */
+		if (unlikely(pmd_trans_huge(*dst_pmd))) {
+			err = -EFAULT;
+			break;
+		}
+
+		BUG_ON(pmd_none(*dst_pmd));
+		BUG_ON(pmd_none(*src_pmd));
+		BUG_ON(pmd_trans_huge(*dst_pmd));
+		BUG_ON(pmd_trans_huge(*src_pmd));
+
+		dst_pte = pte_offset_map(dst_pmd, dst_addr);
+		src_pte = pte_offset_map(src_pmd, src_addr);
+		dst_ptl = pte_lockptr(mm, dst_pmd);
+		src_ptl = pte_lockptr(mm, src_pmd);
+
+		err = remap_anon_pages_pte(mm,
+					   dst_pte, src_pte, src_pmd,
+					   dst_vma, src_vma,
+					   dst_addr, src_addr,
+					   dst_ptl, src_ptl, flags);
+
+		pte_unmap(dst_pte);
+		pte_unmap(src_pte);
+		cond_resched();
+
+		if (!err) {
+			dst_addr += PAGE_SIZE;
+			src_addr += PAGE_SIZE;
+			moved += PAGE_SIZE;
+		}
+
+		if ((!err || err == -EAGAIN) &&
+		    fatal_signal_pending(current))
+			err = -EINTR;
+
+		if (err && err != -EAGAIN)
+			break;
+	}
+
+out:
+	up_read(&mm->mmap_sem);
+	BUG_ON(moved < 0);
+	BUG_ON(err > 0);
+	BUG_ON(!moved && !err);
+	return moved ? moved : err;
+}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4277ed7..9c66428 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1555,6 +1555,116 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 }
 
 /*
+ * The PT lock for src_pmd and the mmap_sem for reading are held by
+ * the caller, but it must return after releasing the
+ * page_table_lock. We're guaranteed the src_pmd is a pmd_trans_huge
+ * until the PT lock of the src_pmd is released. Just move the page
+ * from src_pmd to dst_pmd if possible. Return zero if succeeded in
+ * moving the page, -EAGAIN if it needs to be repeated by the caller,
+ * or other errors in case of failure.
+ */
+int remap_anon_pages_huge_pmd(struct mm_struct *mm,
+			      pmd_t *dst_pmd, pmd_t *src_pmd,
+			      pmd_t dst_pmdval,
+			      struct vm_area_struct *dst_vma,
+			      struct vm_area_struct *src_vma,
+			      unsigned long dst_addr,
+			      unsigned long src_addr)
+{
+	pmd_t _dst_pmd, src_pmdval;
+	struct page *src_page;
+	struct anon_vma *src_anon_vma, *dst_anon_vma;
+	spinlock_t *src_ptl, *dst_ptl;
+	pgtable_t pgtable;
+
+	src_pmdval = *src_pmd;
+	src_ptl = pmd_lockptr(mm, src_pmd);
+
+	BUG_ON(!pmd_trans_huge(src_pmdval));
+	BUG_ON(pmd_trans_splitting(src_pmdval));
+	BUG_ON(!pmd_none(dst_pmdval));
+	BUG_ON(!spin_is_locked(src_ptl));
+	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
+
+	src_page = pmd_page(src_pmdval);
+	BUG_ON(!PageHead(src_page));
+	BUG_ON(!PageAnon(src_page));
+	if (unlikely(page_mapcount(src_page) != 1)) {
+		spin_unlock(src_ptl);
+		return -EBUSY;
+	}
+
+	get_page(src_page);
+	spin_unlock(src_ptl);
+
+	mmu_notifier_invalidate_range_start(mm, src_addr,
+					    src_addr + HPAGE_PMD_SIZE);
+
+	/* block all concurrent rmap walks */
+	lock_page(src_page);
+
+	/*
+	 * split_huge_page walks the anon_vma chain without the page
+	 * lock. Serialize against it with the anon_vma lock, the page
+	 * lock is not enough.
+	 */
+	src_anon_vma = page_get_anon_vma(src_page);
+	if (!src_anon_vma) {
+		unlock_page(src_page);
+		put_page(src_page);
+		mmu_notifier_invalidate_range_end(mm, src_addr,
+						  src_addr + HPAGE_PMD_SIZE);
+		return -EAGAIN;
+	}
+	anon_vma_lock_write(src_anon_vma);
+
+	dst_ptl = pmd_lockptr(mm, dst_pmd);
+	double_pt_lock(src_ptl, dst_ptl);
+	if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
+		     !pmd_same(*dst_pmd, dst_pmdval) ||
+		     page_mapcount(src_page) != 1)) {
+		double_pt_unlock(src_ptl, dst_ptl);
+		anon_vma_unlock_write(src_anon_vma);
+		put_anon_vma(src_anon_vma);
+		unlock_page(src_page);
+		put_page(src_page);
+		mmu_notifier_invalidate_range_end(mm, src_addr,
+						  src_addr + HPAGE_PMD_SIZE);
+		return -EAGAIN;
+	}
+
+	BUG_ON(!PageHead(src_page));
+	BUG_ON(!PageAnon(src_page));
+	/* the PT lock is enough to keep the page pinned now */
+	put_page(src_page);
+
+	dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
+	ACCESS_ONCE(src_page->mapping) = (struct address_space *) dst_anon_vma;
+	ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma, dst_addr);
+
+	if (!pmd_same(pmdp_clear_flush(src_vma, src_addr, src_pmd),
+		      src_pmdval))
+		BUG();
+	_dst_pmd = mk_huge_pmd(src_page, dst_vma->vm_page_prot);
+	_dst_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma);
+	set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd);
+
+	pgtable = pgtable_trans_huge_withdraw(mm, src_pmd);
+	pgtable_trans_huge_deposit(mm, dst_pmd, pgtable);
+	double_pt_unlock(src_ptl, dst_ptl);
+
+	anon_vma_unlock_write(src_anon_vma);
+	put_anon_vma(src_anon_vma);
+
+	/* unblock rmap walks */
+	unlock_page(src_page);
+
+	mmu_notifier_invalidate_range_end(mm, src_addr,
+					  src_addr + HPAGE_PMD_SIZE);
+	return 0;
+}
+
+/*
  * Returns 1 if a given pmd maps a stable (not under splitting) thp.
  * Returns -1 if it maps a thp under splitting. Returns 0 otherwise.
  *

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 13/17] waitqueue: add nr wake parameter to __wake_up_locked_key
  2014-10-03 17:07 [PATCH 00/17] RFC: userfault v2 Andrea Arcangeli
                   ` (11 preceding siblings ...)
  2014-10-03 17:08 ` [PATCH 12/17] mm: sys_remap_anon_pages Andrea Arcangeli
@ 2014-10-03 17:08 ` Andrea Arcangeli
  2014-10-03 17:08 ` [PATCH 14/17] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-03 17:08 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api
  Cc: Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek,
	Jan Kara <jac>

Userfaultfd needs to wake all waitqueues (pass 0 as nr parameter),
instead of the current hardcoded 1 (that would wake just the first
waitqueue in the head list).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/wait.h | 5 +++--
 kernel/sched/wait.c  | 7 ++++---
 net/sunrpc/sched.c   | 2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 6fb1ba5..f8271cb 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -144,7 +144,8 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
 
 typedef int wait_bit_action_f(struct wait_bit_key *);
 void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+			  void *key);
 void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
 void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
@@ -175,7 +176,7 @@ wait_queue_head_t *bit_waitqueue(void *, int);
 #define wake_up_poll(x, m)						\
 	__wake_up(x, TASK_NORMAL, 1, (void *) (m))
 #define wake_up_locked_poll(x, m)					\
-	__wake_up_locked_key((x), TASK_NORMAL, (void *) (m))
+	__wake_up_locked_key((x), TASK_NORMAL, 1, (void *) (m))
 #define wake_up_interruptible_poll(x, m)				\
 	__wake_up(x, TASK_INTERRUPTIBLE, 1, (void *) (m))
 #define wake_up_interruptible_sync_poll(x, m)				\
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 15cab1a..d848738 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -105,9 +105,10 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr)
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked);
 
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+			  void *key)
 {
-	__wake_up_common(q, mode, 1, 0, key);
+	__wake_up_common(q, mode, nr, 0, key);
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked_key);
 
@@ -282,7 +283,7 @@ void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait,
 	if (!list_empty(&wait->task_list))
 		list_del_init(&wait->task_list);
 	else if (waitqueue_active(q))
-		__wake_up_locked_key(q, mode, key);
+		__wake_up_locked_key(q, mode, 1, key);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
 EXPORT_SYMBOL(abort_exclusive_wait);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9358c79..39b7496 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -297,7 +297,7 @@ static int rpc_complete_task(struct rpc_task *task)
 	clear_bit(RPC_TASK_ACTIVE, &task->tk_runstate);
 	ret = atomic_dec_and_test(&task->tk_count);
 	if (waitqueue_active(wq))
-		__wake_up_locked_key(wq, TASK_NORMAL, &k);
+		__wake_up_locked_key(wq, TASK_NORMAL, 1, &k);
 	spin_unlock_irqrestore(&wq->lock, flags);
 	return ret;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 14/17] userfaultfd: add new syscall to provide memory externalization
  2014-10-03 17:07 [PATCH 00/17] RFC: userfault v2 Andrea Arcangeli
                   ` (12 preceding siblings ...)
  2014-10-03 17:08 ` [PATCH 13/17] waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
@ 2014-10-03 17:08 ` Andrea Arcangeli
  2014-10-03 17:08 ` [PATCH 15/17] userfaultfd: make userfaultfd_write non blocking Andrea Arcangeli
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-03 17:08 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api
  Cc: Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek,
	Jan Kara <jac>

Once an userfaultfd is created MADV_USERFAULT regions talks through
the userfaultfd protocol with the thread responsible for doing the
memory externalization of the process.

The protocol starts by userland writing the requested/preferred
USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
kernel knows it, it will ack it by allowing userland to read 64bit
from the userfault fd that will contain the same 64bit
USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
will have to try again by writing an older protocol version if
suitable for its usage too, and read it back again until it stops
reading -1ULL. After that the userfaultfd protocol starts.

The protocol consists in the userfault fd reads 64bit in size
providing userland the fault addresses. After a userfault address has
been read and the fault is resolved by userland, the application must
write back 128bits in the form of [ start, end ] range (64bit each)
that will tell the kernel such a range has been mapped. Multiple read
userfaults can be resolved in a single range write. poll() can be used
to know when there are new userfaults to read (POLLIN) and when there
are threads waiting a wakeup through a range write (POLLOUT).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 fs/Makefile                      |   1 +
 fs/userfaultfd.c                 | 643 +++++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h         |   1 +
 include/linux/userfaultfd.h      |  42 +++
 init/Kconfig                     |  11 +
 kernel/sys_ni.c                  |   1 +
 mm/huge_memory.c                 |  24 +-
 mm/memory.c                      |   5 +-
 10 files changed, 720 insertions(+), 10 deletions(-)
 create mode 100644 fs/userfaultfd.c
 create mode 100644 include/linux/userfaultfd.h

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 2d0594c..782038c 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -364,3 +364,4 @@
 355	i386	getrandom		sys_getrandom
 356	i386	memfd_create		sys_memfd_create
 357	i386	remap_anon_pages	sys_remap_anon_pages
+358	i386	userfaultfd		sys_userfaultfd
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 41e8f3e..3d5601f 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -328,6 +328,7 @@
 319	common	memfd_create		sys_memfd_create
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	remap_anon_pages	sys_remap_anon_pages
+322	common	userfaultfd		sys_userfaultfd
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index 90c8852..00dfe77 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_ANON_INODES)	+= anon_inodes.o
 obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
+obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
 obj-$(CONFIG_COMPAT)		+= compat.o compat_ioctl.o
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
new file mode 100644
index 0000000..62b827e
--- /dev/null
+++ b/fs/userfaultfd.c
@@ -0,0 +1,643 @@
+/*
+ *  fs/userfaultfd.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
+ *  Copyright (C) 2008-2009 Red Hat, Inc.
+ *  Copyright (C) 2014  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ *  Some part derived from fs/eventfd.c (anon inode setup) and
+ *  mm/ksm.c (mm hashing).
+ */
+
+#include <linux/hashtable.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/file.h>
+#include <linux/bug.h>
+#include <linux/anon_inodes.h>
+#include <linux/syscalls.h>
+#include <linux/userfaultfd.h>
+
+struct userfaultfd_ctx {
+	/* pseudo fd refcounting */
+	atomic_t refcount;
+	/* waitqueue head for the userfaultfd page faults */
+	wait_queue_head_t fault_wqh;
+	/* waitqueue head for the pseudo fd to wakeup poll/read */
+	wait_queue_head_t fd_wqh;
+	/* userfaultfd syscall flags */
+	unsigned int flags;
+	/* state machine */
+	unsigned int state;
+	/* released */
+	bool released;
+};
+
+struct userfaultfd_wait_queue {
+	unsigned long address;
+	wait_queue_t wq;
+	bool pending;
+	struct userfaultfd_ctx *ctx;
+};
+
+#define USERFAULTFD_PROTOCOL ((__u64) 0xaa)
+#define USERFAULTFD_UNKNOWN_PROTOCOL ((__u64) -1ULL)
+
+enum {
+	USERFAULTFD_STATE_ASK_PROTOCOL,
+	USERFAULTFD_STATE_ACK_PROTOCOL,
+	USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL,
+	USERFAULTFD_STATE_RUNNING,
+};
+
+/**
+ * struct mm_slot - userlandfd information per mm that is being scanned
+ * @link: link to the mm_slots hash list
+ * @mm: the mm that this information is valid for
+ * @ctx: userfaultfd context for this mm
+ */
+struct mm_slot {
+	struct hlist_node link;
+	struct mm_struct *mm;
+	struct userfaultfd_ctx ctx;
+	struct rcu_head rcu_head;
+};
+
+#define MM_USERLANDFD_HASH_BITS 10
+static DEFINE_HASHTABLE(mm_userlandfd_hash, MM_USERLANDFD_HASH_BITS);
+
+static DEFINE_MUTEX(mm_userlandfd_mutex);
+
+static struct mm_slot *get_mm_slot(struct mm_struct *mm)
+{
+	struct mm_slot *slot;
+
+	hash_for_each_possible_rcu(mm_userlandfd_hash, slot, link,
+				   (unsigned long)mm)
+		if (slot->mm == mm)
+			return slot;
+
+	return NULL;
+}
+
+static void insert_to_mm_userlandfd_hash(struct mm_struct *mm,
+					 struct mm_slot *mm_slot)
+{
+	mm_slot->mm = mm;
+	hash_add_rcu(mm_userlandfd_hash, &mm_slot->link, (unsigned long)mm);
+}
+
+static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
+				     int wake_flags, void *key)
+{
+	unsigned long *range = key;
+	int ret;
+	struct userfaultfd_wait_queue *uwq;
+
+	uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+	ret = 0;
+	/* don't wake the pending ones to avoid reads to block */
+	if (uwq->pending && !ACCESS_ONCE(uwq->ctx->released))
+		goto out;
+	if (range[0] > uwq->address || range[1] <= uwq->address)
+		goto out;
+	ret = wake_up_state(wq->private, mode);
+	if (ret)
+		/* wake only once, autoremove behavior */
+		list_del_init(&wq->task_list);
+out:
+	return ret;
+}
+
+/**
+ * userfaultfd_ctx_get - Acquires a reference to the internal userfaultfd
+ * context.
+ * @ctx: [in] Pointer to the userfaultfd context.
+ *
+ * Returns: In case of success, returns not zero.
+ */
+static int userfaultfd_ctx_get(struct userfaultfd_ctx *ctx)
+{
+	/*
+	 * If it's already released don't get it. This can race
+	 * against userfaultfd_release, if the race triggers it'll be
+	 * handled safely by the handle_userfault main loop
+	 * (userfaultfd_release will take the mmap_sem for writing to
+	 * flush out all in-flight userfaults). This check is only an
+	 * optimization.
+	 */
+	if (unlikely(ACCESS_ONCE(ctx->released)))
+		return 0;
+	return atomic_inc_not_zero(&ctx->refcount);
+}
+
+static void userfaultfd_free(struct userfaultfd_ctx *ctx)
+{
+	struct mm_slot *mm_slot = container_of(ctx, struct mm_slot, ctx);
+
+	mutex_lock(&mm_userlandfd_mutex);
+	hash_del_rcu(&mm_slot->link);
+	mutex_unlock(&mm_userlandfd_mutex);
+
+	kfree_rcu(mm_slot, rcu_head);
+}
+
+/**
+ * userfaultfd_ctx_put - Releases a reference to the internal userfaultfd
+ * context.
+ * @ctx: [in] Pointer to userfaultfd context.
+ *
+ * The userfaultfd context reference must have been previously acquired either
+ * with userfaultfd_ctx_get() or userfaultfd_ctx_fdget().
+ */
+static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
+{
+	if (atomic_dec_and_test(&ctx->refcount))
+		userfaultfd_free(ctx);
+}
+
+/*
+ * The locking rules involved in returning VM_FAULT_RETRY depending on
+ * FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT and
+ * FAULT_FLAG_KILLABLE are not straightforward. The "Caution"
+ * recommendation in __lock_page_or_retry is not an understatement.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set, the mmap_sem must be released
+ * before returning VM_FAULT_RETRY only if FAULT_FLAG_RETRY_NOWAIT is
+ * not set.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set but FAULT_FLAG_KILLABLE is not
+ * set, VM_FAULT_RETRY can still be returned if and only if there are
+ * fatal_signal_pending()s, and the mmap_sem must be released before
+ * returning it.
+ */
+int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+		     unsigned int flags)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct mm_slot *slot;
+	struct userfaultfd_ctx *ctx;
+	struct userfaultfd_wait_queue uwq;
+	int ret;
+
+	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
+
+	rcu_read_lock();
+	slot = get_mm_slot(mm);
+	if (!slot) {
+		rcu_read_unlock();
+		return VM_FAULT_SIGBUS;
+	}
+	ctx = &slot->ctx;
+	if (!userfaultfd_ctx_get(ctx)) {
+		rcu_read_unlock();
+		return VM_FAULT_SIGBUS;
+	}
+	rcu_read_unlock();
+
+	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
+	uwq.wq.private = current;
+	uwq.address = address;
+	uwq.pending = true;
+	uwq.ctx = ctx;
+
+	spin_lock(&ctx->fault_wqh.lock);
+	/*
+	 * After the __add_wait_queue the uwq is visible to userland
+	 * through poll/read().
+	 */
+	__add_wait_queue(&ctx->fault_wqh, &uwq.wq);
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (fatal_signal_pending(current)) {
+			/*
+			 * If we have to fail because the task is
+			 * killed just retry the fault either by
+			 * returning to userland or through
+			 * VM_FAULT_RETRY if we come from a page fault
+			 * and a fatal signal is pending.
+			 */
+			ret = 0;
+			if (flags & FAULT_FLAG_KILLABLE) {
+				/*
+				 * If FAULT_FLAG_KILLABLE is set we
+				 * and there's a fatal signal pending
+				 * can return VM_FAULT_RETRY
+				 * regardless if
+				 * FAULT_FLAG_ALLOW_RETRY is set or
+				 * not as long as we release the
+				 * mmap_sem. The page fault will
+				 * return stright to userland then to
+				 * handle the fatal signal.
+				 */
+				up_read(&mm->mmap_sem);
+				ret = VM_FAULT_RETRY;
+			}
+			break;
+		}
+		if (!uwq.pending || ACCESS_ONCE(ctx->released)) {
+			ret = 0;
+			if (flags & FAULT_FLAG_ALLOW_RETRY) {
+				ret = VM_FAULT_RETRY;
+				if (!(flags & FAULT_FLAG_RETRY_NOWAIT))
+					up_read(&mm->mmap_sem);
+			}
+			break;
+		}
+		if (((FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT) &
+		     flags) ==
+		    (FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT)) {
+			ret = VM_FAULT_RETRY;
+			/*
+			 * The mmap_sem must not be released if
+			 * FAULT_FLAG_RETRY_NOWAIT is set despite we
+			 * return VM_FAULT_RETRY (FOLL_NOWAIT case).
+			 */
+			break;
+		}
+		spin_unlock(&ctx->fault_wqh.lock);
+		up_read(&mm->mmap_sem);
+
+		wake_up_poll(&ctx->fd_wqh, POLLIN);
+		schedule();
+
+		down_read(&mm->mmap_sem);
+		spin_lock(&ctx->fault_wqh.lock);
+	}
+	__remove_wait_queue(&ctx->fault_wqh, &uwq.wq);
+	__set_current_state(TASK_RUNNING);
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	/*
+	 * ctx may go away after this if the userfault pseudo fd is
+	 * released by another CPU.
+	 */
+	userfaultfd_ctx_put(ctx);
+
+	return ret;
+}
+
+static int userfaultfd_release(struct inode *inode, struct file *file)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+	struct mm_slot *mm_slot = container_of(ctx, struct mm_slot, ctx);
+	__u64 range[2] = { 0ULL, -1ULL };
+
+	ACCESS_ONCE(ctx->released) = true;
+
+	/*
+	 * Flush page faults out of all CPUs to avoid race conditions
+	 * against ctx->released. All page faults must be retried
+	 * without returning VM_FAULT_SIGBUS if the get_mm_slot and
+	 * userfaultfd_ctx_get both succeeds but ctx->released is set.
+	 */
+	down_write(&mm_slot->mm->mmap_sem);
+	up_write(&mm_slot->mm->mmap_sem);
+
+	spin_lock(&ctx->fault_wqh.lock);
+	__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, range);
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	wake_up_poll(&ctx->fd_wqh, POLLHUP);
+	userfaultfd_ctx_put(ctx);
+	return 0;
+}
+
+static inline unsigned long find_userfault(struct userfaultfd_ctx *ctx,
+					   struct userfaultfd_wait_queue **uwq,
+					   unsigned int events_filter)
+{
+	wait_queue_t *wq;
+	struct userfaultfd_wait_queue *_uwq;
+	unsigned int events = 0;
+
+	BUG_ON(!events_filter);
+
+	spin_lock(&ctx->fault_wqh.lock);
+	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+		_uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+		if (_uwq->pending) {
+			if (!(events & POLLIN) && (events_filter & POLLIN)) {
+				events |= POLLIN;
+				if (uwq)
+					*uwq = _uwq;
+			}
+		} else if (events_filter & POLLOUT)
+			events |= POLLOUT;
+		if (events == events_filter)
+			break;
+	}
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	return events;
+}
+
+static unsigned int userfaultfd_poll(struct file *file, poll_table *wait)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+
+	poll_wait(file, &ctx->fd_wqh, wait);
+
+	switch (ctx->state) {
+	case USERFAULTFD_STATE_ASK_PROTOCOL:
+		return POLLOUT;
+	case USERFAULTFD_STATE_ACK_PROTOCOL:
+		return POLLIN;
+	case USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL:
+		return POLLIN;
+	case USERFAULTFD_STATE_RUNNING:
+		return find_userfault(ctx, NULL, POLLIN|POLLOUT);
+	default:
+		BUG();
+	}
+}
+
+static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
+				    __u64 *addr)
+{
+	ssize_t ret;
+	DECLARE_WAITQUEUE(wait, current);
+	struct userfaultfd_wait_queue *uwq = NULL;
+
+	if (ctx->state == USERFAULTFD_STATE_ASK_PROTOCOL) {
+		return -EINVAL;
+	} else if (ctx->state == USERFAULTFD_STATE_ACK_PROTOCOL) {
+		*addr = USERFAULTFD_PROTOCOL;
+		ctx->state = USERFAULTFD_STATE_RUNNING;
+		return 0;
+	} else if (ctx->state == USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL) {
+		*addr = USERFAULTFD_UNKNOWN_PROTOCOL;
+		ctx->state = USERFAULTFD_STATE_ASK_PROTOCOL;
+		return 0;
+	}
+	BUG_ON(ctx->state != USERFAULTFD_STATE_RUNNING);
+
+	spin_lock(&ctx->fd_wqh.lock);
+	__add_wait_queue(&ctx->fd_wqh, &wait);
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		/* always take the fd_wqh lock before the fault_wqh lock */
+		if (find_userfault(ctx, &uwq, POLLIN)) {
+			uwq->pending = false;
+			*addr = uwq->address;
+			ret = 0;
+			break;
+		}
+		if (signal_pending(current)) {
+			ret = -ERESTARTSYS;
+			break;
+		}
+		if (no_wait) {
+			ret = -EAGAIN;
+			break;
+		}
+		spin_unlock(&ctx->fd_wqh.lock);
+		schedule();
+		spin_lock_irq(&ctx->fd_wqh.lock);
+	}
+	__remove_wait_queue(&ctx->fd_wqh, &wait);
+	__set_current_state(TASK_RUNNING);
+	if (ret == 0) {
+		if (waitqueue_active(&ctx->fd_wqh))
+			wake_up_locked_poll(&ctx->fd_wqh, POLLOUT);
+	}
+	spin_unlock_irq(&ctx->fd_wqh.lock);
+
+	return ret;
+}
+
+static ssize_t userfaultfd_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+	ssize_t ret;
+	/* careful to always initialize addr if ret == 0 */
+	__u64 uninitialized_var(addr);
+
+	if (count < sizeof(addr))
+		return -EINVAL;
+	ret = userfaultfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &addr);
+	if (ret < 0)
+		return ret;
+
+	return put_user(addr, (__u64 __user *) buf) ? -EFAULT : sizeof(addr);
+}
+
+static int wake_userfault(struct userfaultfd_ctx *ctx, __u64 *range)
+{
+	wait_queue_t *wq;
+	struct userfaultfd_wait_queue *uwq;
+	int ret = -ENOENT;
+
+	spin_lock(&ctx->fault_wqh.lock);
+	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+		uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+		if (uwq->pending)
+			continue;
+		if (uwq->address >= range[0] &&
+		    uwq->address < range[1]) {
+			ret = 0;
+			/* wake all in the range and autoremove */
+			__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0,
+					     range);
+			break;
+		}
+	}
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	return ret;
+}
+
+static ssize_t userfaultfd_write(struct file *file, const char __user *buf,
+				 size_t count, loff_t *ppos)
+{
+	struct userfaultfd_ctx *ctx = file->private_data;
+	ssize_t res;
+	__u64 range[2];
+	DECLARE_WAITQUEUE(wait, current);
+
+	if (ctx->state == USERFAULTFD_STATE_ASK_PROTOCOL) {
+		__u64 protocol;
+		if (count < sizeof(__u64))
+			return -EINVAL;
+		if (copy_from_user(&protocol, buf, sizeof(protocol)))
+			return -EFAULT;
+		if (protocol != USERFAULTFD_PROTOCOL) {
+			/* we'll offer the supported protocol in the ack */
+			printk_once(KERN_INFO
+				    "userfaultfd protocol not available\n");
+			ctx->state = USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL;
+		} else
+			ctx->state = USERFAULTFD_STATE_ACK_PROTOCOL;
+		return sizeof(protocol);
+	} else if (ctx->state == USERFAULTFD_STATE_ACK_PROTOCOL)
+		return -EINVAL;
+
+	BUG_ON(ctx->state != USERFAULTFD_STATE_RUNNING);
+
+	if (count < sizeof(range))
+		return -EINVAL;
+	if (copy_from_user(&range, buf, sizeof(range)))
+		return -EFAULT;
+	if (range[0] >= range[1])
+		return -ERANGE;
+
+	spin_lock(&ctx->fd_wqh.lock);
+	__add_wait_queue(&ctx->fd_wqh, &wait);
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		/* always take the fd_wqh lock before the fault_wqh lock */
+		if (find_userfault(ctx, NULL, POLLOUT)) {
+			if (!wake_userfault(ctx, range)) {
+				res = sizeof(range);
+				break;
+			}
+		}
+		if (signal_pending(current)) {
+			res = -ERESTARTSYS;
+			break;
+		}
+		if (file->f_flags & O_NONBLOCK) {
+			res = -EAGAIN;
+			break;
+		}
+		spin_unlock(&ctx->fd_wqh.lock);
+		schedule();
+		spin_lock(&ctx->fd_wqh.lock);
+	}
+	__remove_wait_queue(&ctx->fd_wqh, &wait);
+	__set_current_state(TASK_RUNNING);
+	spin_unlock(&ctx->fd_wqh.lock);
+
+	return res;
+}
+
+#ifdef CONFIG_PROC_FS
+static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+	struct userfaultfd_ctx *ctx = f->private_data;
+	int ret;
+	wait_queue_t *wq;
+	struct userfaultfd_wait_queue *uwq;
+	unsigned long pending = 0, total = 0;
+
+	spin_lock(&ctx->fault_wqh.lock);
+	list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
+		uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+		if (uwq->pending)
+			pending++;
+		total++;
+	}
+	spin_unlock(&ctx->fault_wqh.lock);
+
+	/*
+	 * If more protocols will be added, there will be all shown
+	 * separated by a space. Like this:
+	 *	protocols: 0xaa 0xbb
+	 */
+	ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\nprotocols:\t%Lx\n",
+			 pending, total, USERFAULTFD_PROTOCOL);
+
+	return ret;
+}
+#endif
+
+static const struct file_operations userfaultfd_fops = {
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo	= userfaultfd_show_fdinfo,
+#endif
+	.release	= userfaultfd_release,
+	.poll		= userfaultfd_poll,
+	.read		= userfaultfd_read,
+	.write		= userfaultfd_write,
+	.llseek		= noop_llseek,
+};
+
+/**
+ * userfaultfd_file_create - Creates an userfaultfd file pointer.
+ * @flags: Flags for the userfaultfd file.
+ *
+ * This function creates an userfaultfd file pointer, w/out installing
+ * it into the fd table. This is useful when the userfaultfd file is
+ * used during the initialization of data structures that require
+ * extra setup after the userfaultfd creation. So the userfaultfd
+ * creation is split into the file pointer creation phase, and the
+ * file descriptor installation phase.  In this way races with
+ * userspace closing the newly installed file descriptor can be
+ * avoided.  Returns an userfaultfd file pointer, or a proper error
+ * pointer.
+ */
+static struct file *userfaultfd_file_create(int flags)
+{
+	struct file *file;
+	struct mm_slot *mm_slot;
+
+	/* Check the UFFD_* constants for consistency.  */
+	BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
+	BUILD_BUG_ON(UFFD_NONBLOCK != O_NONBLOCK);
+
+	file = ERR_PTR(-EINVAL);
+	if (flags & ~UFFD_SHARED_FCNTL_FLAGS)
+		goto out;
+
+	mm_slot = kmalloc(sizeof(*mm_slot), GFP_KERNEL);
+	file = ERR_PTR(-ENOMEM);
+	if (!mm_slot)
+		goto out;
+
+	mutex_lock(&mm_userlandfd_mutex);
+	file = ERR_PTR(-EBUSY);
+	if (get_mm_slot(current->mm))
+		goto out_free_unlock;
+
+	atomic_set(&mm_slot->ctx.refcount, 1);
+	init_waitqueue_head(&mm_slot->ctx.fault_wqh);
+	init_waitqueue_head(&mm_slot->ctx.fd_wqh);
+	mm_slot->ctx.flags = flags;
+	mm_slot->ctx.state = USERFAULTFD_STATE_ASK_PROTOCOL;
+	mm_slot->ctx.released = false;
+
+	file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops,
+				  &mm_slot->ctx,
+				  O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
+	if (IS_ERR(file))
+	out_free_unlock:
+		kfree(mm_slot);
+	else
+		insert_to_mm_userlandfd_hash(current->mm,
+					     mm_slot);
+	mutex_unlock(&mm_userlandfd_mutex);
+out:
+	return file;
+}
+
+SYSCALL_DEFINE1(userfaultfd, int, flags)
+{
+	int fd, error;
+	struct file *file;
+
+	error = get_unused_fd_flags(flags & UFFD_SHARED_FCNTL_FLAGS);
+	if (error < 0)
+		return error;
+	fd = error;
+
+	file = userfaultfd_file_create(flags);
+	if (IS_ERR(file)) {
+		error = PTR_ERR(file);
+		goto err_put_unused_fd;
+	}
+	fd_install(fd, file);
+
+	return fd;
+
+err_put_unused_fd:
+	put_unused_fd(fd);
+
+	return error;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3d4bb05..c5cd88d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -811,6 +811,7 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
 asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int flags);
+asmlinkage long sys_userfaultfd(int flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
 asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
diff --git a/include/linux/userfaultfd.h b/include/linux/userfaultfd.h
new file mode 100644
index 0000000..b7caef5
--- /dev/null
+++ b/include/linux/userfaultfd.h
@@ -0,0 +1,42 @@
+/*
+ *  include/linux/userfaultfd.h
+ *
+ *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
+ *  Copyright (C) 2014  Red Hat, Inc.
+ *
+ */
+
+#ifndef _LINUX_USERFAULTFD_H
+#define _LINUX_USERFAULTFD_H
+
+#include <linux/fcntl.h>
+
+/*
+ * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
+ * new flags, since they might collide with O_* ones. We want
+ * to re-use O_* flags that couldn't possibly have a meaning
+ * from userfaultfd, in order to leave a free define-space for
+ * shared O_* flags.
+ */
+#define UFFD_CLOEXEC O_CLOEXEC
+#define UFFD_NONBLOCK O_NONBLOCK
+
+#define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
+#define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
+
+#ifdef CONFIG_USERFAULTFD
+
+int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+		     unsigned int flags);
+
+#else /* CONFIG_USERFAULTFD */
+
+static int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+			    unsigned int flags)
+{
+	return VM_FAULT_SIGBUS;
+}
+
+#endif
+
+#endif /* _LINUX_USERFAULTFD_H */
diff --git a/init/Kconfig b/init/Kconfig
index e84c642..d57127e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1518,6 +1518,17 @@ config EVENTFD
 
 	  If unsure, say Y.
 
+config USERFAULTFD
+	bool "Enable userfaultfd() system call"
+	select ANON_INODES
+	default y
+	depends on MMU
+	help
+	  Enable the userfaultfd() system call that allows to trap and
+	  handle page faults in userland.
+
+	  If unsure, say Y.
+
 config SHMEM
 	bool "Use full shmem filesystem" if EXPERT
 	default y
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 2bc7bef..fe6ab0c 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -200,6 +200,7 @@ cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
 cond_syscall(sys_memfd_create);
+cond_syscall(sys_userfaultfd);
 
 /* performance counters: */
 cond_syscall(sys_perf_event_open);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9c66428..10e6408 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -23,6 +23,7 @@
 #include <linux/pagemap.h>
 #include <linux/migrate.h>
 #include <linux/hashtable.h>
+#include <linux/userfaultfd.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -713,7 +714,7 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long haddr, pmd_t *pmd,
-					struct page *page)
+					struct page *page, unsigned int flags)
 {
 	struct mem_cgroup *memcg;
 	pgtable_t pgtable;
@@ -753,11 +754,15 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 
 		/* Deliver the page fault to userland */
 		if (vma->vm_flags & VM_USERFAULT) {
+			int ret;
+
 			spin_unlock(ptl);
 			mem_cgroup_cancel_charge(page, memcg);
 			put_page(page);
 			pte_free(mm, pgtable);
-			return VM_FAULT_SIGBUS;
+			ret = handle_userfault(vma, haddr, flags);
+			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+			return ret;
 		}
 
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
@@ -837,16 +842,19 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret = 0;
 		set = false;
 		if (pmd_none(*pmd)) {
-			if (vma->vm_flags & VM_USERFAULT)
-				ret = VM_FAULT_SIGBUS;
-			else {
+			if (vma->vm_flags & VM_USERFAULT) {
+				spin_unlock(ptl);
+				ret = handle_userfault(vma, haddr, flags);
+				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+			} else {
 				set_huge_zero_page(pgtable, mm, vma,
 						   haddr, pmd,
 						   zero_page);
+				spin_unlock(ptl);
 				set = true;
 			}
-		}
-		spin_unlock(ptl);
+		} else
+			spin_unlock(ptl);
 		if (!set) {
 			pte_free(mm, pgtable);
 			put_huge_zero_page();
@@ -859,7 +867,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
+	return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page, flags);
 }
 
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
diff --git a/mm/memory.c b/mm/memory.c
index 16e4c8a..e80772b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -61,6 +61,7 @@
 #include <linux/string.h>
 #include <linux/dma-debug.h>
 #include <linux/debugfs.h>
+#include <linux/userfaultfd.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -2648,7 +2649,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (vma->vm_flags & VM_USERFAULT) {
 			pte_unmap_unlock(page_table, ptl);
-			return VM_FAULT_SIGBUS;
+			return handle_userfault(vma, address, flags);
 		}
 		goto setpte;
 	}
@@ -2682,7 +2683,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte_unmap_unlock(page_table, ptl);
 		mem_cgroup_cancel_charge(page, memcg);
 		page_cache_release(page);
-		return VM_FAULT_SIGBUS;
+		return handle_userfault(vma, address, flags);
 	}
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 15/17] userfaultfd: make userfaultfd_write non blocking
  2014-10-03 17:07 [PATCH 00/17] RFC: userfault v2 Andrea Arcangeli
                   ` (13 preceding siblings ...)
  2014-10-03 17:08 ` [PATCH 14/17] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
@ 2014-10-03 17:08 ` Andrea Arcangeli
  2014-10-03 17:08 ` [PATCH 16/17] powerpc: add remap_anon_pages and userfaultfd Andrea Arcangeli
  2014-10-03 17:08 ` [PATCH 17/17] userfaultfd: implement USERFAULTFD_RANGE_REGISTER|UNREGISTER Andrea Arcangeli
  16 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-03 17:08 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api
  Cc: Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek,
	Jan Kara <jac>

It is generally inefficient to ask the wakeup of userfault ranges
where there's not a single userfault address read through
userfaultfd_read earlier and in turn waiting a wakeup. However it may
come handy to wakeup the same userfault range twice in case of
multiple thread faulting on the same address. But we should still
return an error so if the application thinks this occurrence can never
happen it will know it hit a bug. So just return -ENOENT instead of
blocking.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c | 34 +++++-----------------------------
 1 file changed, 5 insertions(+), 29 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 62b827e..2667d0d 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -458,9 +458,7 @@ static ssize_t userfaultfd_write(struct file *file, const char __user *buf,
 				 size_t count, loff_t *ppos)
 {
 	struct userfaultfd_ctx *ctx = file->private_data;
-	ssize_t res;
 	__u64 range[2];
-	DECLARE_WAITQUEUE(wait, current);
 
 	if (ctx->state == USERFAULTFD_STATE_ASK_PROTOCOL) {
 		__u64 protocol;
@@ -488,34 +486,12 @@ static ssize_t userfaultfd_write(struct file *file, const char __user *buf,
 	if (range[0] >= range[1])
 		return -ERANGE;
 
-	spin_lock(&ctx->fd_wqh.lock);
-	__add_wait_queue(&ctx->fd_wqh, &wait);
-	for (;;) {
-		set_current_state(TASK_INTERRUPTIBLE);
-		/* always take the fd_wqh lock before the fault_wqh lock */
-		if (find_userfault(ctx, NULL, POLLOUT)) {
-			if (!wake_userfault(ctx, range)) {
-				res = sizeof(range);
-				break;
-			}
-		}
-		if (signal_pending(current)) {
-			res = -ERESTARTSYS;
-			break;
-		}
-		if (file->f_flags & O_NONBLOCK) {
-			res = -EAGAIN;
-			break;
-		}
-		spin_unlock(&ctx->fd_wqh.lock);
-		schedule();
-		spin_lock(&ctx->fd_wqh.lock);
-	}
-	__remove_wait_queue(&ctx->fd_wqh, &wait);
-	__set_current_state(TASK_RUNNING);
-	spin_unlock(&ctx->fd_wqh.lock);
+	/* always take the fd_wqh lock before the fault_wqh lock */
+	if (find_userfault(ctx, NULL, POLLOUT))
+		if (!wake_userfault(ctx, range))
+			return sizeof(range);
 
-	return res;
+	return -ENOENT;
 }
 
 #ifdef CONFIG_PROC_FS

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 16/17] powerpc: add remap_anon_pages and userfaultfd
  2014-10-03 17:07 [PATCH 00/17] RFC: userfault v2 Andrea Arcangeli
                   ` (14 preceding siblings ...)
  2014-10-03 17:08 ` [PATCH 15/17] userfaultfd: make userfaultfd_write non blocking Andrea Arcangeli
@ 2014-10-03 17:08 ` Andrea Arcangeli
  2014-10-03 17:08 ` [PATCH 17/17] userfaultfd: implement USERFAULTFD_RANGE_REGISTER|UNREGISTER Andrea Arcangeli
  16 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-03 17:08 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api
  Cc: Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek,
	Jan Kara <jac>

Add the syscall numbers.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/powerpc/include/asm/systbl.h      | 2 ++
 arch/powerpc/include/asm/unistd.h      | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 2 ++
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 7d8a600..ef03a80 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -365,3 +365,5 @@ SYSCALL_SPU(renameat2)
 SYSCALL_SPU(seccomp)
 SYSCALL_SPU(getrandom)
 SYSCALL_SPU(memfd_create)
+SYSCALL_SPU(remap_anon_pages)
+SYSCALL_SPU(userfaultfd)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index 4e9af3f..36b79c3 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include <uapi/asm/unistd.h>
 
 
-#define __NR_syscalls		361
+#define __NR_syscalls		363
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index 0688fc0..5514c57 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -383,5 +383,7 @@
 #define __NR_seccomp		358
 #define __NR_getrandom		359
 #define __NR_memfd_create	360
+#define __NR_remap_anon_pages	361
+#define __NR_userfaultfd	362
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 17/17] userfaultfd: implement USERFAULTFD_RANGE_REGISTER|UNREGISTER
  2014-10-03 17:07 [PATCH 00/17] RFC: userfault v2 Andrea Arcangeli
                   ` (15 preceding siblings ...)
  2014-10-03 17:08 ` [PATCH 16/17] powerpc: add remap_anon_pages and userfaultfd Andrea Arcangeli
@ 2014-10-03 17:08 ` Andrea Arcangeli
  16 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-03 17:08 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-kernel, linux-mm, linux-api
  Cc: Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek,
	Jan Kara <jac>

This adds two protocol commands to the userfaultfd protocol.

To register memory regions into userfaultfd you can write 16 bytes as:

	 [ start|0x1, end ]

to unregister write:

	 [ start|0x2, end ]

End is "start+len" (not start+len-1). Same as vma->vm_end.

This also enforces the constraint that start and end must both be page
aligned (so the last two bits become available to implement the
USERFAULTFD_RANGE_REGISTER|UNREGISTER commands).

This way there can be multiple userfaultfd for each process and each
one can register into its own virtual memory ranges.

If an userfaultfd tries to register into a virtual memory range
already registered into a different userfaultfd, -EBUSY will be
returned by the write() syscall.

userfaultfd can register into allocated ranges that don't have
MADV_USERFAULT set, but if MADV_USERFAULT is not set, no userfault
will fire on those.

Only if MADV_USERFAULT is set on the virtual memory range, and the
userfaultfd registered into the same range, the userfaultfd protocol
will engage.

If only MADV_USERFAULT is set and there's no userfaultfd registered on
a memory range, only a SIGBUS will be raised and the page fault will
not engage the userfaultfd protocol.

This also makes the handle_userfault() safe against race conditions
with regard to the mmap_sem by requiring FAULT_FLAG_ALLOW_RETRY to be
set the first time a fault is raised by any thread. In turn to work
reliably, the userfaultd depends on the gup_locked|unlocked patchset
to be applied.

If get_user_pages() is run on virtual memory ranges registered into
the userfaultfd, handle_userfault() will return VM_FAULT_SIGBUS and
gup() will return -EFAULT, because get_user_pages() doesn't allow
handle_userfault() to release the mmap_sem and in turn we cannot
safely engage the userfaultfd protocol. So the remaining
get_user_pages() calls must be restricted to memory ranges that we
know are not tracked through the userfaultfd protocol for the
userfaultfd to be reliable.

The only exception of a get_user_pages() that can safely run into an
userfaultfd triggering a -EFAULT is ptrace. ptrace would otherwise
hang so it's actually ok if it will get a -EFAULT instead of
hanging. But it would be ok also to phase out get_user_pages()
completely and have ptrace hang on the userfault (the hang can be
resolved sending SIGKILL to gdb or whatever process that is calling
ptrace). We could also decide to retain the current -EFAULT behavior
of ptrace using get_user_pages_locked with a NULL locked parameter so
the FAULT_FLAG_ALLOW_RETRY flag will not be set. Either ways would be
safe.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/userfaultfd.c            | 411 +++++++++++++++++++++++++++-----------------
 include/linux/mm.h          |   2 +-
 include/linux/mm_types.h    |  11 ++
 include/linux/userfaultfd.h |  19 +-
 mm/madvise.c                |   3 +-
 mm/mempolicy.c              |   4 +-
 mm/mlock.c                  |   3 +-
 mm/mmap.c                   |  39 +++--
 mm/mprotect.c               |   3 +-
 9 files changed, 320 insertions(+), 175 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 2667d0d..49bbd3b 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -23,6 +23,7 @@
 #include <linux/anon_inodes.h>
 #include <linux/syscalls.h>
 #include <linux/userfaultfd.h>
+#include <linux/mempolicy.h>
 
 struct userfaultfd_ctx {
 	/* pseudo fd refcounting */
@@ -37,6 +38,8 @@ struct userfaultfd_ctx {
 	unsigned int state;
 	/* released */
 	bool released;
+	/* mm with one ore more vmas attached to this userfaultfd_ctx */
+	struct mm_struct *mm;
 };
 
 struct userfaultfd_wait_queue {
@@ -49,6 +52,10 @@ struct userfaultfd_wait_queue {
 #define USERFAULTFD_PROTOCOL ((__u64) 0xaa)
 #define USERFAULTFD_UNKNOWN_PROTOCOL ((__u64) -1ULL)
 
+#define USERFAULTFD_RANGE_REGISTER ((__u64) 0x1)
+#define USERFAULTFD_RANGE_UNREGISTER ((__u64) 0x2)
+#define USERFAULTFD_RANGE_MASK (~((__u64) 0x3))
+
 enum {
 	USERFAULTFD_STATE_ASK_PROTOCOL,
 	USERFAULTFD_STATE_ACK_PROTOCOL,
@@ -56,43 +63,6 @@ enum {
 	USERFAULTFD_STATE_RUNNING,
 };
 
-/**
- * struct mm_slot - userlandfd information per mm that is being scanned
- * @link: link to the mm_slots hash list
- * @mm: the mm that this information is valid for
- * @ctx: userfaultfd context for this mm
- */
-struct mm_slot {
-	struct hlist_node link;
-	struct mm_struct *mm;
-	struct userfaultfd_ctx ctx;
-	struct rcu_head rcu_head;
-};
-
-#define MM_USERLANDFD_HASH_BITS 10
-static DEFINE_HASHTABLE(mm_userlandfd_hash, MM_USERLANDFD_HASH_BITS);
-
-static DEFINE_MUTEX(mm_userlandfd_mutex);
-
-static struct mm_slot *get_mm_slot(struct mm_struct *mm)
-{
-	struct mm_slot *slot;
-
-	hash_for_each_possible_rcu(mm_userlandfd_hash, slot, link,
-				   (unsigned long)mm)
-		if (slot->mm == mm)
-			return slot;
-
-	return NULL;
-}
-
-static void insert_to_mm_userlandfd_hash(struct mm_struct *mm,
-					 struct mm_slot *mm_slot)
-{
-	mm_slot->mm = mm;
-	hash_add_rcu(mm_userlandfd_hash, &mm_slot->link, (unsigned long)mm);
-}
-
 static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
 				     int wake_flags, void *key)
 {
@@ -122,30 +92,10 @@ out:
  *
  * Returns: In case of success, returns not zero.
  */
-static int userfaultfd_ctx_get(struct userfaultfd_ctx *ctx)
+static void userfaultfd_ctx_get(struct userfaultfd_ctx *ctx)
 {
-	/*
-	 * If it's already released don't get it. This can race
-	 * against userfaultfd_release, if the race triggers it'll be
-	 * handled safely by the handle_userfault main loop
-	 * (userfaultfd_release will take the mmap_sem for writing to
-	 * flush out all in-flight userfaults). This check is only an
-	 * optimization.
-	 */
-	if (unlikely(ACCESS_ONCE(ctx->released)))
-		return 0;
-	return atomic_inc_not_zero(&ctx->refcount);
-}
-
-static void userfaultfd_free(struct userfaultfd_ctx *ctx)
-{
-	struct mm_slot *mm_slot = container_of(ctx, struct mm_slot, ctx);
-
-	mutex_lock(&mm_userlandfd_mutex);
-	hash_del_rcu(&mm_slot->link);
-	mutex_unlock(&mm_userlandfd_mutex);
-
-	kfree_rcu(mm_slot, rcu_head);
+	if (!atomic_inc_not_zero(&ctx->refcount))
+		BUG();
 }
 
 /**
@@ -158,8 +108,10 @@ static void userfaultfd_free(struct userfaultfd_ctx *ctx)
  */
 static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
 {
-	if (atomic_dec_and_test(&ctx->refcount))
-		userfaultfd_free(ctx);
+	if (atomic_dec_and_test(&ctx->refcount)) {
+		mmdrop(ctx->mm);
+		kfree(ctx);
+	}
 }
 
 /*
@@ -181,25 +133,55 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 		     unsigned int flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	struct mm_slot *slot;
 	struct userfaultfd_ctx *ctx;
 	struct userfaultfd_wait_queue uwq;
-	int ret;
 
 	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
-	rcu_read_lock();
-	slot = get_mm_slot(mm);
-	if (!slot) {
-		rcu_read_unlock();
+	ctx = vma->vm_userfaultfd_ctx.ctx;
+	if (!ctx)
 		return VM_FAULT_SIGBUS;
-	}
-	ctx = &slot->ctx;
-	if (!userfaultfd_ctx_get(ctx)) {
-		rcu_read_unlock();
+
+	BUG_ON(ctx->mm != mm);
+
+	/*
+	 * If it's already released don't get it. This avoids to loop
+	 * in __get_user_pages if userfaultfd_release waits on the
+	 * caller of handle_userfault to release the mmap_sem.
+	 */
+	if (unlikely(ACCESS_ONCE(ctx->released)))
+		return VM_FAULT_SIGBUS;
+
+	/* check that we can return VM_FAULT_RETRY */
+	if (unlikely(!(flags & FAULT_FLAG_ALLOW_RETRY))) {
+		/*
+		 * Validate the invariant that nowait must allow retry
+		 * to be sure not to return SIGBUS erroneously on
+		 * nowait invocations.
+		 */
+		BUG_ON(flags & FAULT_FLAG_RETRY_NOWAIT);
+#ifdef CONFIG_DEBUG_VM
+		if (printk_ratelimit()) {
+			printk(KERN_WARNING
+			       "FAULT_FLAG_ALLOW_RETRY missing %x\n", flags);
+			dump_stack();
+		}
+#endif
 		return VM_FAULT_SIGBUS;
 	}
-	rcu_read_unlock();
+
+	/*
+	 * Handle nowait, not much to do other than tell it to retry
+	 * and wait.
+	 */
+	if (flags & FAULT_FLAG_RETRY_NOWAIT)
+		return VM_FAULT_RETRY;
+
+	/* take the reference before dropping the mmap_sem */
+	userfaultfd_ctx_get(ctx);
+
+	/* be gentle and immediately relinquish the mmap_sem */
+	up_read(&mm->mmap_sem);
 
 	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
 	uwq.wq.private = current;
@@ -214,60 +196,15 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	 */
 	__add_wait_queue(&ctx->fault_wqh, &uwq.wq);
 	for (;;) {
-		set_current_state(TASK_INTERRUPTIBLE);
-		if (fatal_signal_pending(current)) {
-			/*
-			 * If we have to fail because the task is
-			 * killed just retry the fault either by
-			 * returning to userland or through
-			 * VM_FAULT_RETRY if we come from a page fault
-			 * and a fatal signal is pending.
-			 */
-			ret = 0;
-			if (flags & FAULT_FLAG_KILLABLE) {
-				/*
-				 * If FAULT_FLAG_KILLABLE is set we
-				 * and there's a fatal signal pending
-				 * can return VM_FAULT_RETRY
-				 * regardless if
-				 * FAULT_FLAG_ALLOW_RETRY is set or
-				 * not as long as we release the
-				 * mmap_sem. The page fault will
-				 * return stright to userland then to
-				 * handle the fatal signal.
-				 */
-				up_read(&mm->mmap_sem);
-				ret = VM_FAULT_RETRY;
-			}
-			break;
-		}
-		if (!uwq.pending || ACCESS_ONCE(ctx->released)) {
-			ret = 0;
-			if (flags & FAULT_FLAG_ALLOW_RETRY) {
-				ret = VM_FAULT_RETRY;
-				if (!(flags & FAULT_FLAG_RETRY_NOWAIT))
-					up_read(&mm->mmap_sem);
-			}
-			break;
-		}
-		if (((FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT) &
-		     flags) ==
-		    (FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT)) {
-			ret = VM_FAULT_RETRY;
-			/*
-			 * The mmap_sem must not be released if
-			 * FAULT_FLAG_RETRY_NOWAIT is set despite we
-			 * return VM_FAULT_RETRY (FOLL_NOWAIT case).
-			 */
+		set_current_state(TASK_KILLABLE);
+		if (!uwq.pending || ACCESS_ONCE(ctx->released) ||
+		    fatal_signal_pending(current))
 			break;
-		}
 		spin_unlock(&ctx->fault_wqh.lock);
-		up_read(&mm->mmap_sem);
 
 		wake_up_poll(&ctx->fd_wqh, POLLIN);
 		schedule();
 
-		down_read(&mm->mmap_sem);
 		spin_lock(&ctx->fault_wqh.lock);
 	}
 	__remove_wait_queue(&ctx->fault_wqh, &uwq.wq);
@@ -276,30 +213,53 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 
 	/*
 	 * ctx may go away after this if the userfault pseudo fd is
-	 * released by another CPU.
+	 * already released.
 	 */
 	userfaultfd_ctx_put(ctx);
 
-	return ret;
+	return VM_FAULT_RETRY;
 }
 
 static int userfaultfd_release(struct inode *inode, struct file *file)
 {
 	struct userfaultfd_ctx *ctx = file->private_data;
-	struct mm_slot *mm_slot = container_of(ctx, struct mm_slot, ctx);
+	struct mm_struct *mm = ctx->mm;
+	struct vm_area_struct *vma, *prev;
 	__u64 range[2] = { 0ULL, -1ULL };
 
 	ACCESS_ONCE(ctx->released) = true;
 
 	/*
-	 * Flush page faults out of all CPUs to avoid race conditions
-	 * against ctx->released. All page faults must be retried
-	 * without returning VM_FAULT_SIGBUS if the get_mm_slot and
-	 * userfaultfd_ctx_get both succeeds but ctx->released is set.
+	 * Flush page faults out of all CPUs. NOTE: all page faults
+	 * must be retried without returning VM_FAULT_SIGBUS if
+	 * userfaultfd_ctx_get() succeeds but vma->vma_userfault_ctx
+	 * changes while handle_userfault released the mmap_sem. So
+	 * it's critical that released is set to true (above), before
+	 * taking the mmap_sem for writing.
 	 */
-	down_write(&mm_slot->mm->mmap_sem);
-	up_write(&mm_slot->mm->mmap_sem);
+	down_write(&mm->mmap_sem);
+	prev = NULL;
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		if (vma->vm_userfaultfd_ctx.ctx != ctx)
+			continue;
+		prev = vma_merge(mm, prev, vma->vm_start, vma->vm_end,
+				 vma->vm_flags, vma->anon_vma,
+				 vma->vm_file, vma->vm_pgoff,
+				 vma_policy(vma),
+				 NULL_VM_USERFAULTFD_CTX);
+		if (prev)
+			vma = prev;
+		else
+			prev = vma;
+		vma->vm_userfaultfd_ctx = NULL_VM_USERFAULTFD_CTX;
+	}
+	up_write(&mm->mmap_sem);
 
+	/*
+	 * After no new page faults can wait on this fautl_wqh, flush
+	 * the last page faults that may have been already waiting on
+	 * the fault_wqh.
+	 */
 	spin_lock(&ctx->fault_wqh.lock);
 	__wake_up_locked_key(&ctx->fault_wqh, TASK_NORMAL, 0, range);
 	spin_unlock(&ctx->fault_wqh.lock);
@@ -454,6 +414,140 @@ static int wake_userfault(struct userfaultfd_ctx *ctx, __u64 *range)
 	return ret;
 }
 
+static ssize_t userfaultfd_range_register(struct userfaultfd_ctx *ctx,
+					  unsigned long start,
+					  unsigned long end)
+{
+	struct mm_struct *mm = ctx->mm;
+	struct vm_area_struct *vma, *prev;
+	int ret;
+
+	down_write(&mm->mmap_sem);
+	vma = find_vma(mm, start);
+	if (!vma)
+		return -ENOMEM;
+	if (vma->vm_start >= end)
+		return -EINVAL;
+
+	prev = vma->vm_prev;
+	if (vma->vm_start < start)
+		prev = vma;
+
+	ret = 0;
+	/* we got an overlap so start the splitting */
+	do {
+		if (vma->vm_userfaultfd_ctx.ctx == ctx)
+			goto next;
+		if (vma->vm_userfaultfd_ctx.ctx) {
+			ret = -EBUSY;
+			break;
+		}
+		prev = vma_merge(mm, prev, start, end, vma->vm_flags,
+				 vma->anon_vma, vma->vm_file, vma->vm_pgoff,
+				 vma_policy(vma),
+				 ((struct vm_userfaultfd_ctx){ ctx }));
+		if (prev) {
+			vma = prev;
+			vma->vm_userfaultfd_ctx.ctx = ctx;
+			goto next;
+		}
+		if (vma->vm_start < start) {
+			ret = split_vma(mm, vma, start, 1);
+			if (ret < 0)
+				break;
+		}
+		if (vma->vm_end > end) {
+			ret = split_vma(mm, vma, end, 0);
+			if (ret < 0)
+				break;
+		}
+		vma->vm_userfaultfd_ctx.ctx = ctx;
+	next:
+		start = vma->vm_end;
+		vma = vma->vm_next;
+	} while (vma && vma->vm_start < end);
+	up_write(&mm->mmap_sem);
+
+	return ret;
+}
+
+static ssize_t userfaultfd_range_unregister(struct userfaultfd_ctx *ctx,
+					    unsigned long start,
+					    unsigned long end)
+{
+	struct mm_struct *mm = ctx->mm;
+	struct vm_area_struct *vma, *prev;
+	int ret;
+
+	down_write(&mm->mmap_sem);
+	vma = find_vma(mm, start);
+	if (!vma)
+		return -ENOMEM;
+	if (vma->vm_start >= end)
+		return -EINVAL;
+
+	prev = vma->vm_prev;
+	if (vma->vm_start < start)
+		prev = vma;
+
+	ret = 0;
+	/* we got an overlap so start the splitting */
+	do {
+		if (!vma->vm_userfaultfd_ctx.ctx)
+			goto next;
+		if (vma->vm_userfaultfd_ctx.ctx != ctx) {
+			ret = -EBUSY;
+			break;
+		}
+		prev = vma_merge(mm, prev, start, end, vma->vm_flags,
+				 vma->anon_vma, vma->vm_file, vma->vm_pgoff,
+				 vma_policy(vma),
+				 NULL_VM_USERFAULTFD_CTX);
+		if (prev) {
+			vma = prev;
+			vma->vm_userfaultfd_ctx = NULL_VM_USERFAULTFD_CTX;
+			goto next;
+		}
+		if (vma->vm_start < start) {
+			ret = split_vma(mm, vma, start, 1);
+			if (ret < 0)
+				break;
+		}
+		if (vma->vm_end > end) {
+			ret = split_vma(mm, vma, end, 0);
+			if (ret < 0)
+				break;
+		}
+		vma->vm_userfaultfd_ctx.ctx = NULL;
+	next:
+		start = vma->vm_end;
+		vma = vma->vm_next;
+	} while (vma && vma->vm_start < end);
+	up_write(&mm->mmap_sem);
+
+	return ret;
+}
+
+static ssize_t userfaultfd_handle_range(struct userfaultfd_ctx *ctx,
+					__u64 *range)
+{
+	unsigned long start, end;
+
+	start = range[0] & USERFAULTFD_RANGE_MASK;
+	end = range[1];
+	BUG_ON(end <= start);
+	if (end > TASK_SIZE)
+		return -ENOMEM;
+
+	if (range[0] & USERFAULTFD_RANGE_REGISTER) {
+		BUG_ON(range[0] & USERFAULTFD_RANGE_UNREGISTER);
+		return userfaultfd_range_register(ctx, start, end);
+	} else {
+		BUG_ON(!(range[0] & USERFAULTFD_RANGE_UNREGISTER));
+		return userfaultfd_range_unregister(ctx, start, end);
+	}
+}
+
 static ssize_t userfaultfd_write(struct file *file, const char __user *buf,
 				 size_t count, loff_t *ppos)
 {
@@ -483,9 +577,24 @@ static ssize_t userfaultfd_write(struct file *file, const char __user *buf,
 		return -EINVAL;
 	if (copy_from_user(&range, buf, sizeof(range)))
 		return -EFAULT;
-	if (range[0] >= range[1])
+	/* the range mask requires 2 bits */
+	BUILD_BUG_ON(PAGE_SHIFT < 2);
+	if (range[0] & ~PAGE_MASK & USERFAULTFD_RANGE_MASK)
+		return -EINVAL;
+	if ((range[0] & ~USERFAULTFD_RANGE_MASK) == ~USERFAULTFD_RANGE_MASK)
+		return -EINVAL;
+	if (range[1] & ~PAGE_MASK)
+		return -EINVAL;
+	if ((range[0] & PAGE_MASK) >= (range[1] & PAGE_MASK))
 		return -ERANGE;
 
+	/* handle the register/unregister commands */
+	if (range[0] & ~USERFAULTFD_RANGE_MASK) {
+		ssize_t ret = userfaultfd_handle_range(ctx, range);
+		BUG_ON(ret > 0);
+		return ret < 0 ? ret : sizeof(range);
+	}
+
 	/* always take the fd_wqh lock before the fault_wqh lock */
 	if (find_userfault(ctx, NULL, POLLOUT))
 		if (!wake_userfault(ctx, range))
@@ -552,7 +661,9 @@ static const struct file_operations userfaultfd_fops = {
 static struct file *userfaultfd_file_create(int flags)
 {
 	struct file *file;
-	struct mm_slot *mm_slot;
+	struct userfaultfd_ctx *ctx;
+
+	BUG_ON(!current->mm);
 
 	/* Check the UFFD_* constants for consistency.  */
 	BUILD_BUG_ON(UFFD_CLOEXEC != O_CLOEXEC);
@@ -562,33 +673,25 @@ static struct file *userfaultfd_file_create(int flags)
 	if (flags & ~UFFD_SHARED_FCNTL_FLAGS)
 		goto out;
 
-	mm_slot = kmalloc(sizeof(*mm_slot), GFP_KERNEL);
+	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
 	file = ERR_PTR(-ENOMEM);
-	if (!mm_slot)
+	if (!ctx)
 		goto out;
 
-	mutex_lock(&mm_userlandfd_mutex);
-	file = ERR_PTR(-EBUSY);
-	if (get_mm_slot(current->mm))
-		goto out_free_unlock;
-
-	atomic_set(&mm_slot->ctx.refcount, 1);
-	init_waitqueue_head(&mm_slot->ctx.fault_wqh);
-	init_waitqueue_head(&mm_slot->ctx.fd_wqh);
-	mm_slot->ctx.flags = flags;
-	mm_slot->ctx.state = USERFAULTFD_STATE_ASK_PROTOCOL;
-	mm_slot->ctx.released = false;
-
-	file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops,
-				  &mm_slot->ctx,
+	atomic_set(&ctx->refcount, 1);
+	init_waitqueue_head(&ctx->fault_wqh);
+	init_waitqueue_head(&ctx->fd_wqh);
+	ctx->flags = flags;
+	ctx->state = USERFAULTFD_STATE_ASK_PROTOCOL;
+	ctx->released = false;
+	ctx->mm = current->mm;
+	/* prevent the mm struct to be freed */
+	atomic_inc(&ctx->mm->mm_count);
+
+	file = anon_inode_getfile("[userfaultfd]", &userfaultfd_fops, ctx,
 				  O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
 	if (IS_ERR(file))
-	out_free_unlock:
-		kfree(mm_slot);
-	else
-		insert_to_mm_userlandfd_hash(current->mm,
-					     mm_slot);
-	mutex_unlock(&mm_userlandfd_mutex);
+		kfree(ctx);
 out:
 	return file;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 71dbe03..cd60938 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1779,7 +1779,7 @@ extern int vma_adjust(struct vm_area_struct *vma, unsigned long start,
 extern struct vm_area_struct *vma_merge(struct mm_struct *,
 	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
 	unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
-	struct mempolicy *);
+	struct mempolicy *, struct vm_userfaultfd_ctx);
 extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int split_vma(struct mm_struct *,
 	struct vm_area_struct *, unsigned long addr, int new_below);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2c876d1..bb78fa8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -238,6 +238,16 @@ struct vm_region {
 						* this region */
 };
 
+#ifdef CONFIG_USERFAULTFD
+#define NULL_VM_USERFAULTFD_CTX ((struct vm_userfaultfd_ctx) { NULL, })
+struct vm_userfaultfd_ctx {
+	struct userfaultfd_ctx *ctx;
+};
+#else /* CONFIG_USERFAULTFD */
+#define NULL_VM_USERFAULTFD_CTX ((struct vm_userfaultfd_ctx) {})
+struct vm_userfaultfd_ctx {};
+#endif /* CONFIG_USERFAULTFD */
+
 /*
  * This struct defines a memory VMM memory area. There is one of these
  * per VM-area/task.  A VM area is any part of the process virtual memory
@@ -308,6 +318,7 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 };
 
 struct core_thread {
diff --git a/include/linux/userfaultfd.h b/include/linux/userfaultfd.h
index b7caef5..25f49db 100644
--- a/include/linux/userfaultfd.h
+++ b/include/linux/userfaultfd.h
@@ -29,14 +29,27 @@
 int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 		     unsigned int flags);
 
+static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
+					struct vm_userfaultfd_ctx vm_ctx)
+{
+	return vma->vm_userfaultfd_ctx.ctx == vm_ctx.ctx;
+}
+
 #else /* CONFIG_USERFAULTFD */
 
-static int handle_userfault(struct vm_area_struct *vma, unsigned long address,
-			    unsigned int flags)
+static inline int handle_userfault(struct vm_area_struct *vma,
+				   unsigned long address,
+				   unsigned int flags)
 {
 	return VM_FAULT_SIGBUS;
 }
 
-#endif
+static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
+					struct vm_userfaultfd_ctx vm_ctx)
+{
+	return true;
+}
+
+#endif /* CONFIG_USERFAULTFD */
 
 #endif /* _LINUX_USERFAULTFD_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index 24620c0..4bb9a68 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -117,7 +117,8 @@ static long madvise_behavior(struct vm_area_struct *vma,
 
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
-				vma->vm_file, pgoff, vma_policy(vma));
+			  vma->vm_file, pgoff, vma_policy(vma),
+			  vma->vm_userfaultfd_ctx);
 	if (*prev) {
 		vma = *prev;
 		goto success;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8f5330d..bf54e9c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -769,8 +769,8 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
 		pgoff = vma->vm_pgoff +
 			((vmstart - vma->vm_start) >> PAGE_SHIFT);
 		prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags,
-				  vma->anon_vma, vma->vm_file, pgoff,
-				  new_pol);
+				 vma->anon_vma, vma->vm_file, pgoff,
+				 new_pol, vma->vm_userfaultfd_ctx);
 		if (prev) {
 			vma = prev;
 			next = vma->vm_next;
diff --git a/mm/mlock.c b/mm/mlock.c
index ce84cb0..ccb537e 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -566,7 +566,8 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
-			  vma->vm_file, pgoff, vma_policy(vma));
+			  vma->vm_file, pgoff, vma_policy(vma),
+			  vma->vm_userfaultfd_ctx);
 	if (*prev) {
 		vma = *prev;
 		goto success;
diff --git a/mm/mmap.c b/mm/mmap.c
index c0a3637..303f45b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -41,6 +41,7 @@
 #include <linux/notifier.h>
 #include <linux/memory.h>
 #include <linux/printk.h>
+#include <linux/userfaultfd.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -901,7 +902,8 @@ again:			remove_next = 1 + (end > next->vm_end);
  * per-vma resources, so we don't attempt to merge those.
  */
 static inline int is_mergeable_vma(struct vm_area_struct *vma,
-			struct file *file, unsigned long vm_flags)
+				struct file *file, unsigned long vm_flags,
+				struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
 {
 	/*
 	 * VM_SOFTDIRTY should not prevent from VMA merging, if we
@@ -917,6 +919,8 @@ static inline int is_mergeable_vma(struct vm_area_struct *vma,
 		return 0;
 	if (vma->vm_ops && vma->vm_ops->close)
 		return 0;
+	if (!is_mergeable_vm_userfaultfd_ctx(vma, vm_userfaultfd_ctx))
+		return 0;
 	return 1;
 }
 
@@ -947,9 +951,11 @@ static inline int is_mergeable_anon_vma(struct anon_vma *anon_vma1,
  */
 static int
 can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
-	struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff)
+		     struct anon_vma *anon_vma, struct file *file,
+		     pgoff_t vm_pgoff,
+		     struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
 {
-	if (is_mergeable_vma(vma, file, vm_flags) &&
+	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
 	    is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
 		if (vma->vm_pgoff == vm_pgoff)
 			return 1;
@@ -966,9 +972,11 @@ can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
  */
 static int
 can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
-	struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff)
+		    struct anon_vma *anon_vma, struct file *file,
+		    pgoff_t vm_pgoff,
+		    struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
 {
-	if (is_mergeable_vma(vma, file, vm_flags) &&
+	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
 	    is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
 		pgoff_t vm_pglen;
 		vm_pglen = vma_pages(vma);
@@ -1011,7 +1019,8 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 			struct vm_area_struct *prev, unsigned long addr,
 			unsigned long end, unsigned long vm_flags,
 		     	struct anon_vma *anon_vma, struct file *file,
-			pgoff_t pgoff, struct mempolicy *policy)
+			pgoff_t pgoff, struct mempolicy *policy,
+			struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
 {
 	pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
 	struct vm_area_struct *area, *next;
@@ -1038,14 +1047,17 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 	if (prev && prev->vm_end == addr &&
   			mpol_equal(vma_policy(prev), policy) &&
 			can_vma_merge_after(prev, vm_flags,
-						anon_vma, file, pgoff)) {
+					    anon_vma, file, pgoff,
+					    vm_userfaultfd_ctx)) {
 		/*
 		 * OK, it can.  Can we now merge in the successor as well?
 		 */
 		if (next && end == next->vm_start &&
 				mpol_equal(policy, vma_policy(next)) &&
 				can_vma_merge_before(next, vm_flags,
-					anon_vma, file, pgoff+pglen) &&
+						     anon_vma, file,
+						     pgoff+pglen,
+						     vm_userfaultfd_ctx) &&
 				is_mergeable_anon_vma(prev->anon_vma,
 						      next->anon_vma, NULL)) {
 							/* cases 1, 6 */
@@ -1066,7 +1078,8 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 	if (next && end == next->vm_start &&
  			mpol_equal(policy, vma_policy(next)) &&
 			can_vma_merge_before(next, vm_flags,
-					anon_vma, file, pgoff+pglen)) {
+					     anon_vma, file, pgoff+pglen,
+					     vm_userfaultfd_ctx)) {
 		if (prev && addr < prev->vm_end)	/* case 4 */
 			err = vma_adjust(prev, prev->vm_start,
 				addr, prev->vm_pgoff, NULL);
@@ -1548,7 +1561,8 @@ munmap_back:
 	/*
 	 * Can we just expand an old mapping?
 	 */
-	vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL);
+	vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
+			NULL, file, pgoff, NULL, NULL_VM_USERFAULTFD_CTX);
 	if (vma)
 		goto out;
 
@@ -2670,7 +2684,7 @@ static unsigned long do_brk(unsigned long addr, unsigned long len)
 
 	/* Can we just expand an old private anonymous mapping? */
 	vma = vma_merge(mm, prev, addr, addr + len, flags,
-					NULL, NULL, pgoff, NULL);
+			NULL, NULL, pgoff, NULL, NULL_VM_USERFAULTFD_CTX);
 	if (vma)
 		goto out;
 
@@ -2829,7 +2843,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 	if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
 		return NULL;	/* should never get here */
 	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
-			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
+			    vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
+			    vma->vm_userfaultfd_ctx);
 	if (new_vma) {
 		/*
 		 * Source vma may have been merged into new_vma
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c43d557..2ee5aa7 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -294,7 +294,8 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	 */
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*pprev = vma_merge(mm, *pprev, start, end, newflags,
-			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma));
+			   vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
+			   vma->vm_userfaultfd_ctx);
 	if (*pprev) {
 		vma = *pprev;
 		goto success;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/17] mm: gup: add FOLL_TRIED
  2014-10-03 17:07 ` [PATCH 01/17] mm: gup: add FOLL_TRIED Andrea Arcangeli
@ 2014-10-03 18:15   ` Linus Torvalds
  2014-10-03 20:55     ` Paolo Bonzini
  0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2014-10-03 18:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, KVM list, Linux Kernel Mailing List, linux-mm,
	Linux API, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner, \Dr. David Alan Gilbert\,
	Christopher Covington, Johannes Weiner, Android Kernel Team,
	Robert Love, Dmitry Adamushko, Neil Brown, Mike Hommey,
	Taras Glek <tg>

This needs more explanation than that one-liner comment. Make the
commit message explain why the new FOLL_TRIED flag exists.

      Linus

On Fri, Oct 3, 2014 at 10:07 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> From: Andres Lagar-Cavilla <andreslc@google.com>
>
> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
> Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 04/17] mm: gup: make get_user_pages_fast and __get_user_pages_fast latency conscious
  2014-10-03 17:07 ` [PATCH 04/17] mm: gup: make get_user_pages_fast and __get_user_pages_fast latency conscious Andrea Arcangeli
@ 2014-10-03 18:23   ` Linus Torvalds
  2014-10-06 14:14     ` Andrea Arcangeli
  0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2014-10-03 18:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, KVM list, Linux Kernel Mailing List, linux-mm,
	Linux API, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner, \Dr. David Alan Gilbert\,
	Christopher Covington, Johannes Weiner, Android Kernel Team,
	Robert Love, Dmitry Adamushko, Neil Brown, Mike Hommey,
	Taras Glek <tg>

On Fri, Oct 3, 2014 at 10:07 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> This teaches gup_fast and __gup_fast to re-enable irqs and
> cond_resched() if possible every BATCH_PAGES.

This is disgusting.

Many (most?) __gup_fast() users just want a single page, and the
stupid overhead of the multi-page version is already unnecessary.
This just makes things much worse.

Quite frankly, we should make a single-page version of __gup_fast(),
and convert existign users to use that. After that, the few multi-page
users could have this extra latency control stuff.

And yes, the single-page version of get_user_pages_fast() is actually
latency-critical. shared futexes hit it hard, and yes, I've seen this
in profiles.

                  Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
  2014-10-03 17:08 ` [PATCH 10/17] mm: rmap preparation for remap_anon_pages Andrea Arcangeli
@ 2014-10-03 18:31   ` Linus Torvalds
  2014-10-06  8:55     ` Dr. David Alan Gilbert
  2014-10-07 11:10   ` [Qemu-devel] " Kirill A. Shutemov
  1 sibling, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2014-10-03 18:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, KVM list, Linux Kernel Mailing List, linux-mm,
	Linux API, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner, \Dr. David Alan Gilbert\,
	Christopher Covington, Johannes Weiner, Android Kernel Team,
	Robert Love, Dmitry Adamushko, Neil Brown, Mike Hommey,
	Taras Glek <tg>

On Fri, Oct 3, 2014 at 10:08 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> Overall this looks a fairly small change to the rmap code, notably
> less intrusive than the nonlinear vmas created by remap_file_pages.

Considering that remap_file_pages() was an unmitigated disaster, and
-mm has a patch to remove it entirely, I'm not at all convinced this
is a good argument.

We thought remap_file_pages() was a good idea, and it really really
really wasn't. Almost nobody used it, why would the anonymous page
case be any different?

            Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 01/17] mm: gup: add FOLL_TRIED
  2014-10-03 18:15   ` Linus Torvalds
@ 2014-10-03 20:55     ` Paolo Bonzini
  0 siblings, 0 replies; 49+ messages in thread
From: Paolo Bonzini @ 2014-10-03 20:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, qemu-devel, KVM list, Linux Kernel Mailing List,
	linux-mm, Linux API, Andres Lagar-Cavilla, Dave Hansen,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner, \Dr. David Alan Gilbert,
	Christopher Covington, Johannes Weiner, Android Kernel Team,
	Robert Love

> This needs more explanation than that one-liner comment. Make the
> commit message explain why the new FOLL_TRIED flag exists.

This patch actually is extracted from a 3.18 commit in the KVM tree,
https://git.kernel.org/cgit/virt/kvm/kvm.git/commit/?h=next&id=234b239b.

Here is how that patch uses the flag:

    /*
     * The previous call has now waited on the IO. Now we can
     * retry and complete. Pass TRIED to ensure we do not re
     * schedule async IO (see e.g. filemap_fault).
     */
    down_read(&mm->mmap_sem);
    npages = __get_user_pages(tsk, mm, addr, 1, flags | FOLL_TRIED,
                              pagep, NULL, NULL);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/17] mm: madvise MADV_USERFAULT
  2014-10-03 17:07 ` [PATCH 08/17] mm: madvise MADV_USERFAULT Andrea Arcangeli
@ 2014-10-03 23:13   ` Mike Hommey
  2014-10-06 17:24     ` Andrea Arcangeli
  2014-10-07 10:36   ` Kirill A. Shutemov
  1 sibling, 1 reply; 49+ messages in thread
From: Mike Hommey @ 2014-10-03 23:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, kvm, linux-kernel, linux-mm, linux-api,
	Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Taras Glek, Jan Kara

On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
> MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
> vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
> userland touches a still unmapped virtual address, a sigbus signal is
> sent instead of allocating a new page. The sigbus signal handler will
> then resolve the page fault in userland by calling the
> remap_anon_pages syscall.

What does "unmapped virtual address" mean in this context?

Mike

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 12/17] mm: sys_remap_anon_pages
  2014-10-03 17:08 ` [PATCH 12/17] mm: sys_remap_anon_pages Andrea Arcangeli
@ 2014-10-04 13:13   ` Andi Kleen
  2014-10-06 17:00     ` Andrea Arcangeli
  0 siblings, 1 reply; 49+ messages in thread
From: Andi Kleen @ 2014-10-04 13:13 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, linux-mm, linux-api, Linus Torvalds

Andrea Arcangeli <aarcange@redhat.com> writes:

> This new syscall will move anon pages across vmas, atomically and
> without touching the vmas.
>
> It only works on non shared anonymous pages because those can be
> relocated without generating non linear anon_vmas in the rmap code.

...

> It is an alternative to mremap.

Why a new syscall? Couldn't mremap do this transparently?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
  2014-10-03 18:31   ` Linus Torvalds
@ 2014-10-06  8:55     ` Dr. David Alan Gilbert
  2014-10-06 16:41       ` Andrea Arcangeli
  0 siblings, 1 reply; 49+ messages in thread
From: Dr. David Alan Gilbert @ 2014-10-06  8:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, qemu-devel, KVM list, Linux Kernel Mailing List,
	linux-mm, Linux API, Andres Lagar-Cavilla, Dave Hansen,
	Paolo Bonzini, Rik van Riel, Mel Gorman, Andy Lutomirski,
	Andrew Morton, Sasha Levin, Hugh Dickins, Peter Feiner,
	\Dr. David Alan Gilbert\, Christopher Covington, Johannes Weiner,
	Android Kernel Team, Robert

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> On Fri, Oct 3, 2014 at 10:08 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > Overall this looks a fairly small change to the rmap code, notably
> > less intrusive than the nonlinear vmas created by remap_file_pages.
> 
> Considering that remap_file_pages() was an unmitigated disaster, and
> -mm has a patch to remove it entirely, I'm not at all convinced this
> is a good argument.
> 
> We thought remap_file_pages() was a good idea, and it really really
> really wasn't. Almost nobody used it, why would the anonymous page
> case be any different?

I've posted code that uses this interface to qemu-devel and it works nicely;
so chalk up at least one user.

For the postcopy case I'm using it for, we need to place a page, atomically
  some thread might try and access it, and must either
     1) get caught by userfault etc or
     2) must succeed in it's access

and we'll have that happening somewhere between thousands and millions of times
to pages in no particular order, so we need to avoid creating millions of mappings.

Dave



> 
>             Linus
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 04/17] mm: gup: make get_user_pages_fast and __get_user_pages_fast latency conscious
  2014-10-03 18:23   ` Linus Torvalds
@ 2014-10-06 14:14     ` Andrea Arcangeli
  0 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-06 14:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: qemu-devel, KVM list, Linux Kernel Mailing List, linux-mm,
	Linux API, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner, \Dr. David Alan Gilbert\,
	Christopher Covington, Johannes Weiner, Android Kernel Team,
	Robert Love, Dmitry Adamushko <dm>

Hello,

On Fri, Oct 03, 2014 at 11:23:53AM -0700, Linus Torvalds wrote:
> On Fri, Oct 3, 2014 at 10:07 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > This teaches gup_fast and __gup_fast to re-enable irqs and
> > cond_resched() if possible every BATCH_PAGES.
> 
> This is disgusting.
> 
> Many (most?) __gup_fast() users just want a single page, and the
> stupid overhead of the multi-page version is already unnecessary.
> This just makes things much worse.
> 
> Quite frankly, we should make a single-page version of __gup_fast(),
> and convert existign users to use that. After that, the few multi-page
> users could have this extra latency control stuff.

Ok. I didn't think at a better way to add the latency control other
than to reduce nr_pages in a outer loop instead of altering the inner
calls, but this is what I got after implementing it... If somebody has
a cleaner way to implement the latency control stuff that's welcome
and I'd be glad to replace it.

> And yes, the single-page version of get_user_pages_fast() is actually
> latency-critical. shared futexes hit it hard, and yes, I've seen this
> in profiles.

KVM would save a few cycles from a single-page version too. I just
thought further optimizations could be added later and this was better
than nothing.

Considering I've no better idea how to implement the latency control
stuff, for now I'll just drop this controversial patch, and I'll
convert those get_user_pages to gup_unlocked instead of converting
them to gup_fast, which is more than enough to obtain the mmap_sem
holding scalability improvement (that also solves the mmap_sem trouble
for the userfaultfd). gup_unlocked isn't as good as gup_fast but it's
at least better than the current get_user_pages().

I got into this gup_fast latency control stuff purely because there
were a few get_user_pages that could have been converted to
get_user_pages_fast as they were using "current" and "current->mm" as the
first two parameters, except for the risk of disabling irq for
long. So I tried to do the right thing and fix gup_fast but I'll leave
this further optimization queued for later.

About the missing commit header for the other patch Paolo already
replied to it, to clarify this a bit further in short I expect that
FOLL_TRIED flag to be merged through the KVM git tree which already
contains it. I'll add a comment to the commit header to specify
it. Sorry for the confusion about that patch.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
  2014-10-06  8:55     ` Dr. David Alan Gilbert
@ 2014-10-06 16:41       ` Andrea Arcangeli
  2014-10-07 12:47         ` Linus Torvalds
  0 siblings, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-06 16:41 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Linus Torvalds, qemu-devel, KVM list, Linux Kernel Mailing List,
	linux-mm, Linux API, Andres Lagar-Cavilla, Dave Hansen,
	Paolo Bonzini, Rik van Riel, Mel Gorman, Andy Lutomirski,
	Andrew Morton, Sasha Levin, Hugh Dickins, Peter Feiner,
	Christopher Covington, Johannes Weiner, Android Kernel Team,
	Robert Love, Dmitry Adamushko <dmitry>

Hello,

On Mon, Oct 06, 2014 at 09:55:41AM +0100, Dr. David Alan Gilbert wrote:
> * Linus Torvalds (torvalds@linux-foundation.org) wrote:
> > On Fri, Oct 3, 2014 at 10:08 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > >
> > > Overall this looks a fairly small change to the rmap code, notably
> > > less intrusive than the nonlinear vmas created by remap_file_pages.
> > 
> > Considering that remap_file_pages() was an unmitigated disaster, and
> > -mm has a patch to remove it entirely, I'm not at all convinced this
> > is a good argument.
> > 
> > We thought remap_file_pages() was a good idea, and it really really
> > really wasn't. Almost nobody used it, why would the anonymous page
> > case be any different?
> 
> I've posted code that uses this interface to qemu-devel and it works nicely;
> so chalk up at least one user.
> 
> For the postcopy case I'm using it for, we need to place a page, atomically
>   some thread might try and access it, and must either
>      1) get caught by userfault etc or
>      2) must succeed in it's access
> 
> and we'll have that happening somewhere between thousands and millions of times
> to pages in no particular order, so we need to avoid creating millions of mappings.

Yes, that's our current use case.

Of course if somebody has better ideas on how to resolve an anonymous
userfault they're welcome.

How to resolve an userfault is orthogonal on how to detect it and to
notify userland about it and to be notified when the userfault has
been resolved. The latter is what the userfault and userfaultfd
do. The former is what remap_anon_pages is used for but we could use
something else too if there are better ways. mremap would clearly work
too, but it would be less strict (it could lead to silent data
corruption if there are bugs in the userland code), it would be slower
and it would eventually a hit a -ENOMEM failure because there would be
too many vmas.

I could in theory drop remap_anon_pages from this patchset, but
without an optimal way to resolve an userfault, the rest isn't so
useful.

We're currently discussing on what would be the best way to resolve a
MAP_SHARED userfault on tmpfs in fact (that's not sorted yet), but so
far, it seems remap_anon_pages fits the bill for anonymous memory.

remap_anon_pages is not as problematic to maintain as remap_file_pages
for the reason explained in the commit header, but there are other
reasons: it doesn't require special pte_file and it changes nothing of
how anonymous page faults works. All it requires is a loop to catch a
changed page->index (previously page->index couldn't change, not it
can, that's the only thing it changes).

remap_file_pages complexity derives from not being allowed to change
page->index during a move because the page_mapping may be bigger than
1, while that is precisely what remap_anon_pages does.

As long as this "rmap preparation" is the only constraints that
remap_anon_pages introduces in terms of rmap, it looks a nice
not-too-intrusive solution to resolve anonymous userfaults
efficiently.

Introducing remap_anon_pages in fact doesn't reduce the
simplification derived from the removal of remap_file_pages.

As opposed removing remap_anon_pages later would only have the benefit
of removing this very patch 10/17 and no other benefit.

In short remap_anon_pages does this (heavily simplified):

   pte = *src_pte;
   *src_pte = 0;
   pte_page(pte)->index = adjusted according to src_vma/dst_vma->vm_pgoff
   *dst_pte = pte;

It guarantees not to modify the vmas and in turn it doesn't require to
take the mmap_sem for writing.

To use remap_anon_pages, each thread has to create its own temporary
vma with MADV_DONTFORK set on it (not formally required by the syscall
strict checks, but then the application must never fork if
MADV_DONTFORK isn't set or remap_anon_pages could return -EBUSY:
there's no risk of silent data corruption even if the thread forks
without setting MADV_DONTFORK) as source region where receive data
through the network. Then after the data is fully received
rmap_anon_pages moves the page from the temporary vma to the address
where the userfault triggered atomically (while other threads may be
attempting to access the userfault address too, thanks to
remap_anon_pages atomic behavior they won't risk to ever see partial
data coming from the network).

remap_anon_pages as side effect creates an hole in the temporary
(source) vma, so the next recv() syscall receiving data from the
network will fault-in a new anonymous page without requiring any
further malloc/free or other kind of vma mangling.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 12/17] mm: sys_remap_anon_pages
  2014-10-04 13:13   ` Andi Kleen
@ 2014-10-06 17:00     ` Andrea Arcangeli
  0 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-06 17:00 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-mm, linux-api, Linus Torvalds

Hi,

On Sat, Oct 04, 2014 at 06:13:27AM -0700, Andi Kleen wrote:
> Andrea Arcangeli <aarcange@redhat.com> writes:
> 
> > This new syscall will move anon pages across vmas, atomically and
> > without touching the vmas.
> >
> > It only works on non shared anonymous pages because those can be
> > relocated without generating non linear anon_vmas in the rmap code.
> 
> ...
> 
> > It is an alternative to mremap.
> 
> Why a new syscall? Couldn't mremap do this transparently?

The difference between remap_anon_pages and mremap is that mremap
fundamentally moves vmas and not pages (just the pages are moved too
because they remain attached to their respective vmas), while
remap_anon_pages move anonymous pages zerocopy across vmas but it
would never touch any vma.

mremap for example would also nuke the source vma, remap_anon_pages
just moves the pages inside the vmas instead so it doesn't require to
allocate new vmas in the area that receives the data.

We could certainly change mremap to try to detect when page_mapping of
anonymous page is 1 and downgrade the mmap_sem to down_read and then
behave like remap_anon_pages internally by updating the page->index if
all pages in the range can be updated. However to provide the same
strict checks that remap_anon_pages does and to leave the source vma
intact, mremap would need new flags that would need to alter the
normal mremap semantics that silently wipes out the destination range
and get rid of the source range and it would require to run a
remap_anon_pages-detection-routine that isn't zero cost.

Unless we add even more flags to mremap, we wouldn't have the absolute
guarantee that the vma tree is not altered in case userland is not
doing all things right (like if userland forgot MADV_DONTFORK).

Separating the two looked better, mremap was never meant to be
efficient at moving 1 page at time (or 1 THP at time).

Embedding remap_anon_pages inside mremap didn't look worthwhile
considering that as result, mremap would run slower when it cannot
behave like remap_anon_pages and it would also run slower than
remap_anon_pages when it could.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/17] mm: madvise MADV_USERFAULT
  2014-10-03 23:13   ` Mike Hommey
@ 2014-10-06 17:24     ` Andrea Arcangeli
  0 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-06 17:24 UTC (permalink / raw)
  To: Mike Hommey
  Cc: qemu-devel, kvm, linux-kernel, linux-mm, linux-api,
	Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko <dmitry.adamu>

Hi,

On Sat, Oct 04, 2014 at 08:13:36AM +0900, Mike Hommey wrote:
> On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
> > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
> > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
> > userland touches a still unmapped virtual address, a sigbus signal is
> > sent instead of allocating a new page. The sigbus signal handler will
> > then resolve the page fault in userland by calling the
> > remap_anon_pages syscall.
> 
> What does "unmapped virtual address" mean in this context?

To clarify this I added this in a second sentence in the commit
header:

"still unmapped virtual address" of the previous sentence in this
context means that the pte/trans_huge_pmd is null. It means it's an
hole inside the anonymous vma (the kind of hole that doesn't account
for RSS but only virtual size of the process). It is the same state
all anonymous virtual memory is, right after mmap. The same state that
if you read from it, will map a zeropage into the faulting virtual
address. If the page is swapped out, it will not trigger userfaults.

If something isn't clear let me know.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 07/17] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
  2014-10-03 17:07 ` [PATCH 07/17] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits Andrea Arcangeli
@ 2014-10-07  9:03   ` Kirill A. Shutemov
       [not found]   ` <1412356087-16115-8-git-send-email-aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 0 replies; 49+ messages in thread
From: Kirill A. Shutemov @ 2014-10-07  9:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, kvm, linux-kernel, linux-mm, linux-api,
	Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek

On Fri, Oct 03, 2014 at 07:07:57PM +0200, Andrea Arcangeli wrote:
> We run out of 32bits in vm_flags, noop change for 64bit archs.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  fs/proc/task_mmu.c       | 4 ++--
>  include/linux/huge_mm.h  | 4 ++--
>  include/linux/ksm.h      | 4 ++--
>  include/linux/mm_types.h | 2 +-
>  mm/huge_memory.c         | 2 +-
>  mm/ksm.c                 | 2 +-
>  mm/madvise.c             | 2 +-
>  mm/mremap.c              | 2 +-
>  8 files changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index c341568..ee1c3a2 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -532,11 +532,11 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
>  	/*
>  	 * Don't forget to update Documentation/ on changes.
>  	 */
> -	static const char mnemonics[BITS_PER_LONG][2] = {
> +	static const char mnemonics[BITS_PER_LONG+1][2] = {

I believe here and below should be BITS_PER_LONG_LONG instead: it will
catch unknown vmflags. And +1 is not needed un 64-bit systems.

>  		/*
>  		 * In case if we meet a flag we don't know about.
>  		 */
> -		[0 ... (BITS_PER_LONG-1)] = "??",
> +		[0 ... (BITS_PER_LONG)] = "??",
>  
>  		[ilog2(VM_READ)]	= "rd",
>  		[ilog2(VM_WRITE)]	= "wr",
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/17] mm: madvise MADV_USERFAULT
  2014-10-03 17:07 ` [PATCH 08/17] mm: madvise MADV_USERFAULT Andrea Arcangeli
  2014-10-03 23:13   ` Mike Hommey
@ 2014-10-07 10:36   ` Kirill A. Shutemov
  2014-10-07 10:46     ` Dr. David Alan Gilbert
  2014-10-07 13:24     ` Andrea Arcangeli
  1 sibling, 2 replies; 49+ messages in thread
From: Kirill A. Shutemov @ 2014-10-07 10:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, kvm, linux-kernel, linux-mm, linux-api,
	Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek

On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
> MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
> vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
> userland touches a still unmapped virtual address, a sigbus signal is
> sent instead of allocating a new page. The sigbus signal handler will
> then resolve the page fault in userland by calling the
> remap_anon_pages syscall.

Hm. I wounder if this functionality really fits madvise(2) interface: as
far as I understand it, it provides a way to give a *hint* to kernel which
may or may not trigger an action from kernel side. I don't think an
application will behaive reasonably if kernel ignore the *advise* and will
not send SIGBUS, but allocate memory.

I would suggest to consider to use some other interface for the
functionality: a new syscall or, perhaps, mprotect().

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/17] mm: madvise MADV_USERFAULT
  2014-10-07 10:36   ` Kirill A. Shutemov
@ 2014-10-07 10:46     ` Dr. David Alan Gilbert
  2014-10-07 10:52       ` [Qemu-devel] " Kirill A. Shutemov
  2014-10-07 13:24     ` Andrea Arcangeli
  1 sibling, 1 reply; 49+ messages in thread
From: Dr. David Alan Gilbert @ 2014-10-07 10:46 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, qemu-devel, kvm, linux-kernel, linux-mm,
	linux-api, Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen,
	Paolo Bonzini, Rik van Riel, Mel Gorman, Andy Lutomirski,
	Andrew Morton, Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love

* Kirill A. Shutemov (kirill@shutemov.name) wrote:
> On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
> > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
> > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
> > userland touches a still unmapped virtual address, a sigbus signal is
> > sent instead of allocating a new page. The sigbus signal handler will
> > then resolve the page fault in userland by calling the
> > remap_anon_pages syscall.
> 
> Hm. I wounder if this functionality really fits madvise(2) interface: as
> far as I understand it, it provides a way to give a *hint* to kernel which
> may or may not trigger an action from kernel side. I don't think an
> application will behaive reasonably if kernel ignore the *advise* and will
> not send SIGBUS, but allocate memory.

Aren't DONTNEED and DONTDUMP  similar cases of madvise operations that are
expected to do what they say ?

> I would suggest to consider to use some other interface for the
> functionality: a new syscall or, perhaps, mprotect().

Dave

> -- 
>  Kirill A. Shutemov
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT
  2014-10-07 10:46     ` Dr. David Alan Gilbert
@ 2014-10-07 10:52       ` Kirill A. Shutemov
  2014-10-07 11:01         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 49+ messages in thread
From: Kirill A. Shutemov @ 2014-10-07 10:52 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Robert Love, Dave Hansen, Jan Kara, kvm, Neil Brown,
	Stefan Hajnoczi, qemu-devel, linux-mm, KOSAKI Motohiro,
	Michel Lespinasse, Andrea Arcangeli, Taras Glek, Andrew Jones,
	Juan Quintela, Hugh Dickins, Isaku Yamahata, Mel Gorman,
	Sasha Levin, Android Kernel Team, Huangpeng (Peter),
	Andres Lagar-Cavilla, Christopher Covington, Anthony Liguori,
	Mike Hommey, Keith Packard

On Tue, Oct 07, 2014 at 11:46:04AM +0100, Dr. David Alan Gilbert wrote:
> * Kirill A. Shutemov (kirill@shutemov.name) wrote:
> > On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
> > > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
> > > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
> > > userland touches a still unmapped virtual address, a sigbus signal is
> > > sent instead of allocating a new page. The sigbus signal handler will
> > > then resolve the page fault in userland by calling the
> > > remap_anon_pages syscall.
> > 
> > Hm. I wounder if this functionality really fits madvise(2) interface: as
> > far as I understand it, it provides a way to give a *hint* to kernel which
> > may or may not trigger an action from kernel side. I don't think an
> > application will behaive reasonably if kernel ignore the *advise* and will
> > not send SIGBUS, but allocate memory.
> 
> Aren't DONTNEED and DONTDUMP  similar cases of madvise operations that are
> expected to do what they say ?

No. If kernel would ignore MADV_DONTNEED or MADV_DONTDUMP it will not
affect correctness, just behaviour will be suboptimal: more than needed
memory used or wasted space in coredump.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT
  2014-10-07 10:52       ` [Qemu-devel] " Kirill A. Shutemov
@ 2014-10-07 11:01         ` Dr. David Alan Gilbert
  2014-10-07 11:30           ` Kirill A. Shutemov
  0 siblings, 1 reply; 49+ messages in thread
From: Dr. David Alan Gilbert @ 2014-10-07 11:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Robert Love, Dave Hansen, Jan Kara, kvm, Neil Brown,
	Stefan Hajnoczi, qemu-devel, linux-mm, KOSAKI Motohiro,
	Michel Lespinasse, Andrea Arcangeli, Taras Glek, Andrew Jones,
	Juan Quintela, Hugh Dickins, Isaku Yamahata, Mel Gorman,
	Sasha Levin, Android Kernel Team, Huangpeng (Peter),
	Andres Lagar-Cavilla, Christopher Covington, Antho

* Kirill A. Shutemov (kirill@shutemov.name) wrote:
> On Tue, Oct 07, 2014 at 11:46:04AM +0100, Dr. David Alan Gilbert wrote:
> > * Kirill A. Shutemov (kirill@shutemov.name) wrote:
> > > On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
> > > > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
> > > > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
> > > > userland touches a still unmapped virtual address, a sigbus signal is
> > > > sent instead of allocating a new page. The sigbus signal handler will
> > > > then resolve the page fault in userland by calling the
> > > > remap_anon_pages syscall.
> > > 
> > > Hm. I wounder if this functionality really fits madvise(2) interface: as
> > > far as I understand it, it provides a way to give a *hint* to kernel which
> > > may or may not trigger an action from kernel side. I don't think an
> > > application will behaive reasonably if kernel ignore the *advise* and will
> > > not send SIGBUS, but allocate memory.
> > 
> > Aren't DONTNEED and DONTDUMP  similar cases of madvise operations that are
> > expected to do what they say ?
> 
> No. If kernel would ignore MADV_DONTNEED or MADV_DONTDUMP it will not
> affect correctness, just behaviour will be suboptimal: more than needed
> memory used or wasted space in coredump.

That's not how the manpage reads for DONTNEED; it calls it out as a special
case near the top, and explicitly says what will happen if you read the
area marked as DONTNEED.

It looks like there are openssl patches that use DONTDUMP to explicitly
make sure keys etc don't land in cores.

Dave

> 
> -- 
>  Kirill A. Shutemov
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 10/17] mm: rmap preparation for remap_anon_pages
  2014-10-03 17:08 ` [PATCH 10/17] mm: rmap preparation for remap_anon_pages Andrea Arcangeli
  2014-10-03 18:31   ` Linus Torvalds
@ 2014-10-07 11:10   ` Kirill A. Shutemov
  2014-10-07 13:37     ` Andrea Arcangeli
  1 sibling, 1 reply; 49+ messages in thread
From: Kirill A. Shutemov @ 2014-10-07 11:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Robert Love,
	Dave Hansen, Jan Kara, Neil Brown, Stefan Hajnoczi, Andrew Jones,
	KOSAKI Motohiro, Michel Lespinasse, Taras Glek, Juan Quintela,
	Hugh Dickins, Isaku Yamahata, Mel Gorman, Sasha Levin,
	Android Kernel Team, \"Dr. David Alan Gilbert\",
	Huangpeng (Peter), Andres Lagar-Cavilla, Christopher Covington,
	Anthony Liguori

On Fri, Oct 03, 2014 at 07:08:00PM +0200, Andrea Arcangeli wrote:
> There's one constraint enforced to allow this simplification: the
> source pages passed to remap_anon_pages must be mapped only in one
> vma, but this is not a limitation when used to handle userland page
> faults with MADV_USERFAULT. The source addresses passed to
> remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to
> avoid any risk of the mapcount of the pages increasing, if fork runs
> in parallel in another thread, before or while remap_anon_pages runs.

Have you considered triggering COW instead of adding limitation on
pages' mapcount? The limitation looks artificial from interface POV.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT
  2014-10-07 11:01         ` Dr. David Alan Gilbert
@ 2014-10-07 11:30           ` Kirill A. Shutemov
  0 siblings, 0 replies; 49+ messages in thread
From: Kirill A. Shutemov @ 2014-10-07 11:30 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Robert Love, Dave Hansen, Jan Kara, kvm, Neil Brown,
	Stefan Hajnoczi, qemu-devel, linux-mm, KOSAKI Motohiro,
	Michel Lespinasse, Andrea Arcangeli, Taras Glek, Juan Quintela,
	Hugh Dickins, Isaku Yamahata, Mel Gorman, Sasha Levin,
	Android Kernel Team, Andrew Jones, Huangpeng (Peter),
	Andres Lagar-Cavilla, Christopher Covington, Anthony Liguori,
	Paolo Bonzini, Keith Packard

On Tue, Oct 07, 2014 at 12:01:02PM +0100, Dr. David Alan Gilbert wrote:
> * Kirill A. Shutemov (kirill@shutemov.name) wrote:
> > On Tue, Oct 07, 2014 at 11:46:04AM +0100, Dr. David Alan Gilbert wrote:
> > > * Kirill A. Shutemov (kirill@shutemov.name) wrote:
> > > > On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
> > > > > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
> > > > > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
> > > > > userland touches a still unmapped virtual address, a sigbus signal is
> > > > > sent instead of allocating a new page. The sigbus signal handler will
> > > > > then resolve the page fault in userland by calling the
> > > > > remap_anon_pages syscall.
> > > > 
> > > > Hm. I wounder if this functionality really fits madvise(2) interface: as
> > > > far as I understand it, it provides a way to give a *hint* to kernel which
> > > > may or may not trigger an action from kernel side. I don't think an
> > > > application will behaive reasonably if kernel ignore the *advise* and will
> > > > not send SIGBUS, but allocate memory.
> > > 
> > > Aren't DONTNEED and DONTDUMP  similar cases of madvise operations that are
> > > expected to do what they say ?
> > 
> > No. If kernel would ignore MADV_DONTNEED or MADV_DONTDUMP it will not
> > affect correctness, just behaviour will be suboptimal: more than needed
> > memory used or wasted space in coredump.
> 
> That's not how the manpage reads for DONTNEED; it calls it out as a special
> case near the top, and explicitly says what will happen if you read the
> area marked as DONTNEED.

Your are right. MADV_DONTNEED doesn't fit the interface too. That's bad
and we can't fix it. But it's not a reason to make this mistake again.

Read the next sentence: "The kernel is free to ignore the advice."

Note, POSIX_MADV_DONTNEED has totally different semantics.

> It looks like there are openssl patches that use DONTDUMP to explicitly
> make sure keys etc don't land in cores.

That's nice to have. But openssl works on systems without the interface,
meaning it's not essential for functionality.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
  2014-10-06 16:41       ` Andrea Arcangeli
@ 2014-10-07 12:47         ` Linus Torvalds
  2014-10-07 14:19           ` Andrea Arcangeli
  2014-10-07 17:07           ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 49+ messages in thread
From: Linus Torvalds @ 2014-10-07 12:47 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Dr. David Alan Gilbert, qemu-devel, KVM list,
	Linux Kernel Mailing List, linux-mm, Linux API,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Andrew Morton, Sasha Levin,
	Hugh Dickins, Peter Feiner, Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey,
	Taras Glek <tgle>

On Mon, Oct 6, 2014 at 12:41 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> Of course if somebody has better ideas on how to resolve an anonymous
> userfault they're welcome.

So I'd *much* rather have a "write()" style interface (ie _copying_
bytes from user space into a newly allocated page that gets mapped)
than a "remap page" style interface

remapping anonymous pages involves page table games that really aren't
necessarily a good idea, and tlb invalidates for the old page etc.
Just don't do it.

           Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/17] mm: madvise MADV_USERFAULT
  2014-10-07 10:36   ` Kirill A. Shutemov
  2014-10-07 10:46     ` Dr. David Alan Gilbert
@ 2014-10-07 13:24     ` Andrea Arcangeli
  2014-10-07 15:21       ` Kirill A. Shutemov
  1 sibling, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-07 13:24 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: qemu-devel, kvm, linux-kernel, linux-mm, linux-api,
	Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko <dmitry.adamu>

Hi Kirill,

On Tue, Oct 07, 2014 at 01:36:45PM +0300, Kirill A. Shutemov wrote:
> On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
> > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
> > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
> > userland touches a still unmapped virtual address, a sigbus signal is
> > sent instead of allocating a new page. The sigbus signal handler will
> > then resolve the page fault in userland by calling the
> > remap_anon_pages syscall.
> 
> Hm. I wounder if this functionality really fits madvise(2) interface: as
> far as I understand it, it provides a way to give a *hint* to kernel which
> may or may not trigger an action from kernel side. I don't think an
> application will behaive reasonably if kernel ignore the *advise* and will
> not send SIGBUS, but allocate memory.
> 
> I would suggest to consider to use some other interface for the
> functionality: a new syscall or, perhaps, mprotect().

I didn't feel like adding PROT_USERFAULT to mprotect, which looks
hardwired to just these flags:

       PROT_NONE  The memory cannot be accessed at all.

       PROT_READ  The memory can be read.

       PROT_WRITE The memory can be modified.

       PROT_EXEC  The memory can be executed.

Normally mprotect doesn't just alter the vmas but it also alters
pte/hugepmds protection bits, that's something that is never needed
with VM_USERFAULT so I didn't feel like VM_USERFAULT is a protection
change to the VMA.

mprotect is also hardwired to mangle only the VM_READ|WRITE|EXEC
flags, while madvise is ideal to set arbitrary vma flags.

>From an implementation standpoint the perfect place to set a flag in a
vma is madvise. This is what MADV_DONTFORK (it sets VM_DONTCOPY)
already does too in an identical way to MADV_USERFAULT/VM_USERFAULT.

MADV_DONTFORK is as critical as MADV_USERFAULT because people depends
on it for example to prevent the O_DIRECT vs fork race condition that
results in silent data corruption during I/O with threads that may
fork. The other reason why MADV_DONTFORK is critical is that fork()
would otherwise fail with OOM unless full overcommit is enabled
(i.e. pci hotplug crashes the guest if you forget to set
MADV_DONTFORK).

Another madvise that would generate a failure if not obeyed by the
kernel is MADV_DONTNEED that if it does nothing it could run lead to
OOM killing. We don't inflate virt balloons using munmap just to make
an example. Various other apps (maybe JVM garbage collection too)
makes extensive use of MADV_DONTNEED and depend on it.

Said that I can change it to mprotect, the only thing that I don't
like is that it'll result in a less clean patch and I can't possibly
see a practical risk in keeping it simpler with madvise, as long as we
always return -EINVAL whenever we encounter a vma type that cannot
raise userfaults yet (that is something I already enforced).

Yet another option would be to drop MADV_USERFAULT and
vm_flags&VM_USERFAULT entirely and in turn the ability to handle
userfaults with SIGBUS, and retain only the userfaultfd. The new
userfaultfd protocol requires registering each created userfaultfd
into its own private virtual memory ranges (that is to allow an
unlimited number of userfaultfd per process). Currently the
userfaultfd engages iff the fault address intersects both the
MADV_USERFAULT range and the userfaultfd registered ranges. So I could
drop MADV_USERFAULT and VM_USERFAULT and just check for
vma->vm_userfaultfd_ctx!=NULL to know if the userfaultfd protocol
needs to be engaged during the first page fault for a still unmapped
virtual address. I just thought it would be more flexibile to also
allow SIGBUS without forcing people to use userfaultfd (that's in fact
the only reason to still retain madvise(MADV_USERFAULT)!).

Volatile pages earlier patches only supported SIGBUS behavior for
example.. and I didn't intend to force them to use userfaultfd if
they're guaranteed to access the memory with the CPU and never through
a kernel syscall (that is something the app can enforce by
design). userfaultfd becomes necessary the moment you want to handle
userfaults through syscalls/gup etc... qemu obviously requires
userfaultfd and it never uses the userfaultfd-less SIGBUS behavior as
it touches the memory in all possible ways (first and foremost with
the KVM page fault that uses almost all variants of gup..).

So here somebody should comment and choose between:

1) set VM_USERFAULT with mprotect(PROT_USERFAULT) instead of
   the current madvise(MADV_USERFAULT)

2) drop MADV_USERFAULT and VM_USERFAULT and force the usage of the
   userfaultfd protocol as the only way for userland to catch
   userfaults (each userfaultfd must already register itself into its
   own virtual memory ranges so it's a trivial change for userfaultfd
   users that deletes just 1 or 2 lines of userland code, but it would
   prevent to use the SIGBUS behavior with info->si_addr=faultaddr for
   other users)

3) keep things as they are now: use MADV_USERFAULT for SIGBUS
   userfaults, with optional intersection between the
   vm_flags&VM_USERFAULT ranges and the userfaultfd registered ranges
   with vma->vm_userfaultfd_ctx!=NULL to know if to engage the
   userfaultfd protocol instead of the plain SIGBUS

I will update the code accordingly to feedback, so please comment.

I implemented 3) because I thought it provided the most flexibility
for userland to choose if to engage in the userfaultfd protocol or to
stay simple with the SIGBUS if the app doesn't require to access the
userfault virtual memory from the kernel code. It also provides the
cleanest and simplest implementation to set the VM_USERFAULT flags
with madvise.

My second choice would be 2). We could always add MADV_USERFAULT later
except then we'd be forced to set and clear VM_USERFAULT within the
userfaultfd registration to remain backwards compatible. The main cons
and the reason I didn't pick 2) is that it wouldn't be a drop in
replacement for volatile pages that would then be force to use the
userfaultfd protocol too.

I don't like 3) very much mostly because the changes to mprotect would
just make things more complex on the implementation side with purely
conceptual benefits, but then it's possible too and it's feature
equivalent to 1) as far as volatile pages are concerned, so I'm
overall fine with this change if that's the preferred way.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 10/17] mm: rmap preparation for remap_anon_pages
  2014-10-07 11:10   ` [Qemu-devel] " Kirill A. Shutemov
@ 2014-10-07 13:37     ` Andrea Arcangeli
  0 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-07 13:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: qemu-devel, kvm, linux-kernel, linux-mm, linux-api, Robert Love,
	Dave Hansen, Jan Kara, Neil Brown, Stefan Hajnoczi, Andrew Jones,
	KOSAKI Motohiro, Michel Lespinasse, Taras Glek, Juan Quintela,
	Hugh Dickins, Isaku Yamahata, Mel Gorman, Sasha Levin,
	Android Kernel Team, \"Dr. David Alan Gilbert\",
	Huangpeng (Peter), Andres Lagar-Cavilla <andres>

Hi Kirill,

On Tue, Oct 07, 2014 at 02:10:26PM +0300, Kirill A. Shutemov wrote:
> On Fri, Oct 03, 2014 at 07:08:00PM +0200, Andrea Arcangeli wrote:
> > There's one constraint enforced to allow this simplification: the
> > source pages passed to remap_anon_pages must be mapped only in one
> > vma, but this is not a limitation when used to handle userland page
> > faults with MADV_USERFAULT. The source addresses passed to
> > remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to
> > avoid any risk of the mapcount of the pages increasing, if fork runs
> > in parallel in another thread, before or while remap_anon_pages runs.
> 
> Have you considered triggering COW instead of adding limitation on
> pages' mapcount? The limitation looks artificial from interface POV.

I haven't considered it, mostly because I see it as a feature that it
returns -EBUSY. I prefer to avoid the risk of userland getting a
successful retval but internally the kernel silently behaving
non-zerocopy by mistake because some userland bug forgot to set
MADV_DONTFORK on the src_vma.

COW would be not zerocopy so it's not ok. We get sub 1msec latency for
userfaults through 10gbit and we don't want to risk wasting CPU
caches.

I however considered allowing to extend the strict behavior (i.e. the
feature) later in a backwards compatible way. We could provide a
non-zerocopy beahvior with a RAP_ALLOW_COW flag that would then turn
the -EBUSY error into a copy.

It's also more complex to implement the cow now, so it would make the
code that really matters, harder to review. So it may be preferable to
extend this later in a backwards compatible way with a new
RAP_ALLOW_COW flag.

The current handling the flags is already written in a way that should
allow backwards compatible extension with RAP_ALLOW_*:

#define RAP_ALLOW_SRC_HOLES (1UL<<0)

SYSCALL_DEFINE4(remap_anon_pages,
		unsigned long, dst_start, unsigned long, src_start,
		unsigned long, len, unsigned long, flags)
[..]
	long err = -EINVAL;
[..]
	if (flags & ~RAP_ALLOW_SRC_HOLES)
		return err;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
  2014-10-07 12:47         ` Linus Torvalds
@ 2014-10-07 14:19           ` Andrea Arcangeli
  2014-10-07 15:52             ` Andrea Arcangeli
       [not found]             ` <20141007141913.GC2342-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2014-10-07 17:07           ` Dr. David Alan Gilbert
  1 sibling, 2 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-07 14:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dr. David Alan Gilbert, qemu-devel, KVM list,
	Linux Kernel Mailing List, linux-mm, Linux API,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Andrew Morton, Sasha Levin,
	Hugh Dickins, Peter Feiner, Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko <dmitry>

Hello,

On Tue, Oct 07, 2014 at 08:47:59AM -0400, Linus Torvalds wrote:
> On Mon, Oct 6, 2014 at 12:41 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > Of course if somebody has better ideas on how to resolve an anonymous
> > userfault they're welcome.
> 
> So I'd *much* rather have a "write()" style interface (ie _copying_
> bytes from user space into a newly allocated page that gets mapped)
> than a "remap page" style interface
> 
> remapping anonymous pages involves page table games that really aren't
> necessarily a good idea, and tlb invalidates for the old page etc.
> Just don't do it.

I see what you mean. The only cons I see is that we couldn't use then
recv(tmp_addr, PAGE_SIZE), remap_anon_pages(faultaddr, tmp_addr,
PAGE_SIZE, ..)  and retain the zerocopy behavior. Or how could we?
There's no recvfile(userfaultfd, socketfd, PAGE_SIZE).

Ideally if we could prevent the page data coming from the network to
ever become visible in the kernel we could avoid the TLB flush and
also be zerocopy but I can't see how we could achieve that.

The page data could come through a ssh pipe or anything (qemu supports
all kind of network transports for live migration), this is why
leaving the network protocol into userland is preferable.

As things stands now, I'm afraid with a write() syscall we couldn't do
it zerocopy. We'd still need to receive the memory in a temporary page
and then copy it to a kernel page (invisible to userland while we
write to it) to later map into the userfault address.

If it wasn't for the TLB flush of the old page, the remap_anon_pages
variant would be more optimal than doing a copy through a write
syscall. Is the copy cheaper than a TLB flush? I probably naively
assumed the TLB flush was always cheaper.

Now another idea that comes to mind to be able to add the ability to
switch between copy and TLB flush is using a RAP_FORCE_COPY flag, that
would then do a copy inside remap_anon_pages and leave the original
page mapped in place... (and such flag would also disable the -EBUSY
error if page_mapcount is > 1).

So then if the RAP_FORCE_COPY flag is set remap_anon_pages would
behave like you suggested (but with a mremap-like interface, instead
of a write syscall) and we could benchmark the difference between copy
and TLB flush too. We could even periodically benchmark it at runtime
and switch over the faster method (the more CPUs there are in the host
and the more threads the process has, the faster the copy will be
compared to the TLB flush).

Of course in terms of API I could implement the exact same mechanism
as described above for remap_anon_pages inside a write() to the
userfaultfd (it's a pseudo inode). It'd need two different commands to
prepare for the coming write (with a len multiple of PAGE_SIZE) to
know the address where the page should be mapped into and if to behave
zerocopy or if to skip the TLB flush and copy.

Because the copy vs TLB flush trade off is possible to achieve with
both interfaces, I think it really boils down to choosing between a
mremap like interface, or file+commands protocol interface. I tend to
like mremap more, that's why I opted for a remap_anon_pages syscall
kept orthogonal to the userfaultfd functionality (remap_anon_pages
could be also used standalone as an accelerated mremap in some
circumstances) but nothing prevents to just embed the same mechanism
inside userfaultfd if a file+commands API is preferable. Or we could
add a different syscall (separated from userfaultfd) that creates
another pseudofd to write a command plus the page data into it. Just I
wouldn't see the point of creating a pseudofd just to copy a page
atomically, the write() syscall would look more ideal if the
userfaultfd is already open for other reasons and the pseudofd
overhead is required anyway.

Last thing to keep in mind is that if using userfaults with SIGBUS and
without userfaultfd, remap_anon_pages would have been still useful, so
if we retain the SIGBUS behavior for volatile pages and we don't force
the usage for userfaultfd, it may be cleaner not to use userfaultfd
but a separate pseudofd to do the write() syscall though. Otherwise
the app would need to open the userfaultfd to resolve the fault even
though it's not using the userfaultfd protocol which doesn't look an
intuitive interface to me.

Comments welcome.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 08/17] mm: madvise MADV_USERFAULT
  2014-10-07 13:24     ` Andrea Arcangeli
@ 2014-10-07 15:21       ` Kirill A. Shutemov
  0 siblings, 0 replies; 49+ messages in thread
From: Kirill A. Shutemov @ 2014-10-07 15:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel, kvm, linux-kernel, linux-mm, linux-api,
	Linus Torvalds, Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner,
	\"Dr. David Alan Gilbert\", Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Taras Glek

On Tue, Oct 07, 2014 at 03:24:58PM +0200, Andrea Arcangeli wrote:
> Hi Kirill,
> 
> On Tue, Oct 07, 2014 at 01:36:45PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote:
> > > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
> > > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
> > > userland touches a still unmapped virtual address, a sigbus signal is
> > > sent instead of allocating a new page. The sigbus signal handler will
> > > then resolve the page fault in userland by calling the
> > > remap_anon_pages syscall.
> > 
> > Hm. I wounder if this functionality really fits madvise(2) interface: as
> > far as I understand it, it provides a way to give a *hint* to kernel which
> > may or may not trigger an action from kernel side. I don't think an
> > application will behaive reasonably if kernel ignore the *advise* and will
> > not send SIGBUS, but allocate memory.
> > 
> > I would suggest to consider to use some other interface for the
> > functionality: a new syscall or, perhaps, mprotect().
> 
> I didn't feel like adding PROT_USERFAULT to mprotect, which looks
> hardwired to just these flags:

PROT_NOALLOC may be?

> 
>        PROT_NONE  The memory cannot be accessed at all.
> 
>        PROT_READ  The memory can be read.
> 
>        PROT_WRITE The memory can be modified.
> 
>        PROT_EXEC  The memory can be executed.

To be complete: PROT_GROWSDOWN, PROT_GROWSUP and unused PROT_SEM.

> So here somebody should comment and choose between:
> 
> 1) set VM_USERFAULT with mprotect(PROT_USERFAULT) instead of
>    the current madvise(MADV_USERFAULT)
> 
> 2) drop MADV_USERFAULT and VM_USERFAULT and force the usage of the
>    userfaultfd protocol as the only way for userland to catch
>    userfaults (each userfaultfd must already register itself into its
>    own virtual memory ranges so it's a trivial change for userfaultfd
>    users that deletes just 1 or 2 lines of userland code, but it would
>    prevent to use the SIGBUS behavior with info->si_addr=faultaddr for
>    other users)
> 
> 3) keep things as they are now: use MADV_USERFAULT for SIGBUS
>    userfaults, with optional intersection between the
>    vm_flags&VM_USERFAULT ranges and the userfaultfd registered ranges
>    with vma->vm_userfaultfd_ctx!=NULL to know if to engage the
>    userfaultfd protocol instead of the plain SIGBUS

4) new syscall?
 
> I will update the code accordingly to feedback, so please comment.

I don't have strong points on this. Just *feel* it doesn't fit advice
semantics.

The only userspace interface I've designed was not proven good by time.
I would listen what senior maintainers say. :)
 
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
  2014-10-07 14:19           ` Andrea Arcangeli
@ 2014-10-07 15:52             ` Andrea Arcangeli
  2014-10-07 15:54               ` Andy Lutomirski
       [not found]               ` <20141007155247.GD2342-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
       [not found]             ` <20141007141913.GC2342-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 2 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2014-10-07 15:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dr. David Alan Gilbert, qemu-devel, KVM list,
	Linux Kernel Mailing List, linux-mm, Linux API,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Andrew Morton, Sasha Levin,
	Hugh Dickins, Peter Feiner, Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko <dmitry>

On Tue, Oct 07, 2014 at 04:19:13PM +0200, Andrea Arcangeli wrote:
> mremap like interface, or file+commands protocol interface. I tend to
> like mremap more, that's why I opted for a remap_anon_pages syscall
> kept orthogonal to the userfaultfd functionality (remap_anon_pages
> could be also used standalone as an accelerated mremap in some
> circumstances) but nothing prevents to just embed the same mechanism

Sorry for the self followup, but something else comes to mind to
elaborate this further.

In term of interfaces, the most efficient I could think of to minimize
the enter/exit kernel, would be to append the "source address" of the
data received from the network transport, to the userfaultfd_write()
command (by appending 8 bytes to the wakeup command). Said that,
mixing the mechanism to be notified about userfaults with the
mechanism to resolve an userfault to me looks a complication. I kind
of liked to keep the userfaultfd protocol is very simple and doing
just its thing. The userfaultfd doesn't need to know how the userfault
was resolved, even mremap would work theoretically (until we run out
of vmas). I thought it was simpler to keep it that way. However if we
want to resolve the fault with a "write()" syscall this may be the
most efficient way to do it, as we're already doing a write() into the
pseudofd to wakeup the page fault that contains the destination
address, I just need to append the source address to the wakeup command.

I probably grossly overestimated the benefits of resolving the
userfault with a zerocopy page move, sorry. So if we entirely drop the
zerocopy behavior and the TLB flush of the old page like you
suggested, the way to keep the userfaultfd mechanism decoupled from
the userfault resolution mechanism would be to implement an
atomic-copy syscall. That would work for SIGBUS userfaults too without
requiring a pseudofd then. It would be enough then to call
mcopy_atomic(userfault_addr,tmp_addr,len) with the only constraints
that len must be a multiple of PAGE_SIZE. Of course mcopy_atomic
wouldn't page fault or call GUP into the destination address (it can't
otherwise the in-flight partial copy would be visible to the process,
breaking the atomicity of the copy), but it would fill in the
pte/trans_huge_pmd with the same strict behavior that remap_anon_pages
currently has (in turn it would by design bypass the VM_USERFAULT
check and be ideal for resolving userfaults).

mcopy_atomic could then be also extended to tmpfs and it would work
without requiring the source page to be a tmpfs page too without
having to convert page types on the fly.

If I add mcopy_atomic, the patch in subject (10/17) can be dropped of
course so it'd be even less intrusive than the current
remap_anon_pages and it would require zero TLB flush during its
runtime (it would just require an atomic copy).

So should I try to embed a mcopy_atomic inside userfault_write or can
I expose it to userland as a standalone new syscall? Or should I do
something different? Comments?

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
  2014-10-07 15:52             ` Andrea Arcangeli
@ 2014-10-07 15:54               ` Andy Lutomirski
       [not found]               ` <20141007155247.GD2342-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2014-10-07 15:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Dr. David Alan Gilbert,
	qemu-devel@nongnu.org Developers, KVM list,
	Linux Kernel Mailing List, linux-mm, Linux API,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Christopher Covington, Johannes Weiner,
	Android Kernel Team, Robert Love, Dmitry Adamushko, Neil Brown

On Tue, Oct 7, 2014 at 8:52 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Tue, Oct 07, 2014 at 04:19:13PM +0200, Andrea Arcangeli wrote:
>> mremap like interface, or file+commands protocol interface. I tend to
>> like mremap more, that's why I opted for a remap_anon_pages syscall
>> kept orthogonal to the userfaultfd functionality (remap_anon_pages
>> could be also used standalone as an accelerated mremap in some
>> circumstances) but nothing prevents to just embed the same mechanism
>
> Sorry for the self followup, but something else comes to mind to
> elaborate this further.
>
> In term of interfaces, the most efficient I could think of to minimize
> the enter/exit kernel, would be to append the "source address" of the
> data received from the network transport, to the userfaultfd_write()
> command (by appending 8 bytes to the wakeup command). Said that,
> mixing the mechanism to be notified about userfaults with the
> mechanism to resolve an userfault to me looks a complication. I kind
> of liked to keep the userfaultfd protocol is very simple and doing
> just its thing. The userfaultfd doesn't need to know how the userfault
> was resolved, even mremap would work theoretically (until we run out
> of vmas). I thought it was simpler to keep it that way. However if we
> want to resolve the fault with a "write()" syscall this may be the
> most efficient way to do it, as we're already doing a write() into the
> pseudofd to wakeup the page fault that contains the destination
> address, I just need to append the source address to the wakeup command.
>
> I probably grossly overestimated the benefits of resolving the
> userfault with a zerocopy page move, sorry. So if we entirely drop the
> zerocopy behavior and the TLB flush of the old page like you
> suggested, the way to keep the userfaultfd mechanism decoupled from
> the userfault resolution mechanism would be to implement an
> atomic-copy syscall. That would work for SIGBUS userfaults too without
> requiring a pseudofd then. It would be enough then to call
> mcopy_atomic(userfault_addr,tmp_addr,len) with the only constraints
> that len must be a multiple of PAGE_SIZE. Of course mcopy_atomic
> wouldn't page fault or call GUP into the destination address (it can't
> otherwise the in-flight partial copy would be visible to the process,
> breaking the atomicity of the copy), but it would fill in the
> pte/trans_huge_pmd with the same strict behavior that remap_anon_pages
> currently has (in turn it would by design bypass the VM_USERFAULT
> check and be ideal for resolving userfaults).

At the risk of asking a possibly useless question, would it make sense
to splice data into a userfaultfd?

--Andy

>
> mcopy_atomic could then be also extended to tmpfs and it would work
> without requiring the source page to be a tmpfs page too without
> having to convert page types on the fly.
>
> If I add mcopy_atomic, the patch in subject (10/17) can be dropped of
> course so it'd be even less intrusive than the current
> remap_anon_pages and it would require zero TLB flush during its
> runtime (it would just require an atomic copy).
>
> So should I try to embed a mcopy_atomic inside userfault_write or can
> I expose it to userland as a standalone new syscall? Or should I do
> something different? Comments?
>
> Thanks,
> Andrea



-- 
Andy Lutomirski
AMA Capital Management, LLC

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
       [not found]               ` <20141007155247.GD2342-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-10-07 16:13                 ` Peter Feiner
  0 siblings, 0 replies; 49+ messages in thread
From: Peter Feiner @ 2014-10-07 16:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Dr. David Alan Gilbert,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A, KVM list,
	Linux Kernel Mailing List, linux-mm, Linux API,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Andrew Morton, Sasha Levin,
	Hugh Dickins, Christopher Covington, Johannes Weiner,
	Android Kernel Team, Robert Love, Dmitry Adamushko, Neil Brown,
	Mike Hommey, Taras

On Tue, Oct 07, 2014 at 05:52:47PM +0200, Andrea Arcangeli wrote:
> I probably grossly overestimated the benefits of resolving the
> userfault with a zerocopy page move, sorry. [...]

For posterity, I think it's worth noting that most expensive aspect of a TLB
shootdown is the interprocessor interrupt necessary to flush other CPUs' TLBs.
On a many-core machine, copying 4K of data looks pretty cheap compared to
taking an interrupt and invalidating TLBs on many cores :-)

> [...] So if we entirely drop the
> zerocopy behavior and the TLB flush of the old page like you
> suggested, the way to keep the userfaultfd mechanism decoupled from
> the userfault resolution mechanism would be to implement an
> atomic-copy syscall. That would work for SIGBUS userfaults too without
> requiring a pseudofd then. It would be enough then to call
> mcopy_atomic(userfault_addr,tmp_addr,len) with the only constraints
> that len must be a multiple of PAGE_SIZE. Of course mcopy_atomic
> wouldn't page fault or call GUP into the destination address (it can't
> otherwise the in-flight partial copy would be visible to the process,
> breaking the atomicity of the copy), but it would fill in the
> pte/trans_huge_pmd with the same strict behavior that remap_anon_pages
> currently has (in turn it would by design bypass the VM_USERFAULT
> check and be ideal for resolving userfaults).
> 
> mcopy_atomic could then be also extended to tmpfs and it would work
> without requiring the source page to be a tmpfs page too without
> having to convert page types on the fly.
> 
> If I add mcopy_atomic, the patch in subject (10/17) can be dropped of
> course so it'd be even less intrusive than the current
> remap_anon_pages and it would require zero TLB flush during its
> runtime (it would just require an atomic copy).

I like this new approach. It will be good to have a single interface for
resolving anon and tmpfs userfaults.

> So should I try to embed a mcopy_atomic inside userfault_write or can
> I expose it to userland as a standalone new syscall? Or should I do
> something different? Comments?

One interesting (ab)use of userfault_write would be that the faulting process
and the fault-handling process could be different, which would be necessary
for post-copy live migration in CRIU (http://criu.org).

Aside from the asthetic difference, I can't think of any advantage in favor of
a syscall.

Peter

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
       [not found]             ` <20141007141913.GC2342-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-10-07 16:56               ` Linus Torvalds
  0 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2014-10-07 16:56 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Dr. David Alan Gilbert, qemu-devel-qX2TKyscuCcdnm+yROfE0A,
	KVM list, Linux Kernel Mailing List, linux-mm, Linux API,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Andrew Morton, Sasha Levin,
	Hugh Dickins, Peter Feiner, Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown, Mike Hommey, Jan Kara

On Tue, Oct 7, 2014 at 10:19 AM, Andrea Arcangeli <aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
> I see what you mean. The only cons I see is that we couldn't use then
> recv(tmp_addr, PAGE_SIZE), remap_anon_pages(faultaddr, tmp_addr,
> PAGE_SIZE, ..)  and retain the zerocopy behavior. Or how could we?
> There's no recvfile(userfaultfd, socketfd, PAGE_SIZE).

You're doing completelt faulty math, and you haven't thought it through.

Your "zero-copy" case is no such thing. Who cares if some packet
receive is zero-copy, when you need to set up the anonymous page to
*receive* the zero copy into, which involves page allocation, page
zeroing, page table setup with VM and page table locking, etc etc.

The thing is, the whole concept of "zero-copy" is pure and utter
bullshit. Sun made a big deal about the whole concept back in the
nineties, and IT DOES NOT WORK. It's a scam. Don't buy into it. It's
crap. It's made-up and not real.

Then, once you've allocated and cleared the page, mapped it in, your
"zero-copy" model involves looking up the page in the page tables
again (locking etc), then doing that zero-copy to the page. Then, when
you remap it, you look it up in the page tables AGAIN, with locking,
move it around, have to clear the old page table entry (which involves
a locked cmpxchg64), a TLB flush with most likely a cross-CPU IPI -
since the people who do this are all threaded and want many CPU's, and
then you insert the page into the new place.

That's *insane*. It's crap. All just to try to avoid one page copy.

Don't do it. remapping games really are complete BS. They never beat
just copying the data. It's that simple.

> As things stands now, I'm afraid with a write() syscall we couldn't do
> it zerocopy.

Really, you need to rethink your whole "zerocopy" model. It's broken.
Nobody sane cares. You've bought into a model that Sun already showed
doesn't work.

The only time zero-copy works is in random benchmarks that are very
careful to not touch the data at all at any point, and also try to
make sure that the benchmark is very much single-threaded so that you
never have the whole cross-CPU IPI issue for the TLB invalidate. Then,
and only then, can zero-copy win. And it's just not a realistic
scenario.

> If it wasn't for the TLB flush of the old page, the remap_anon_pages
> variant would be more optimal than doing a copy through a write
> syscall. Is the copy cheaper than a TLB flush? I probably naively
> assumed the TLB flush was always cheaper.

A local TLB flush is cheap. That's not the problem. The problem is the
setup of the page, and the clearing of the page, and the cross-CPU TLB
flush. And the page table locking, etc etc.

So no, I'm not AT ALL worried about a single "invlpg" instruction.
That's nothing. Local CPU TLB flushes of single pages are basically
free. But that really isn't where the costs are.

Quite frankly, the *only* time page remapping has ever made sense is
when it is used for realloc() kind of purposes, where you need to
remap pages not because of zero-copy, but because you need to change
the virtual address space layout. And then you make sure it's not a
common operation, because you're not doing it as a performance win,
you're doing it because you're changing your virtual layout.

Really. Zerocopy is for benchmarks, and for things like "splice()"
when you can avoid the page tables *entirely*. But the notion that
page remapping of user pages is faster than a copy is pure and utter
garbage. It's simply not true.

So I really think you should aim for a "write()": kind of interface.

With write, you may not get the zero-copy, but on the other hand it
allows you to re-use the source buffer many times without having to
allocate new pages and map it in etc. So a "read()+write()" loop (or,
quite commonly a "generate data computationally from a compressed
source + write()" loop) is actually much more efficient than the
zero-copy remapping, because you don't have all the complexities and
overheads in creating the source page.

It is possible that that could involve "splice()" too, although I
don't really think the source data tends to be in page-aligned chunks.
But hey, splice() at least *can* be faster than copying (and then we
have vmsplice() not because it's magically faster, but because it can
under certain circumstances be worth it, and it kind of made sense to
allow the interface, but I really don't think it's used very much or
very useful).

               Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
  2014-10-07 12:47         ` Linus Torvalds
  2014-10-07 14:19           ` Andrea Arcangeli
@ 2014-10-07 17:07           ` Dr. David Alan Gilbert
  2014-10-07 17:14             ` Paolo Bonzini
  1 sibling, 1 reply; 49+ messages in thread
From: Dr. David Alan Gilbert @ 2014-10-07 17:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, qemu-devel, KVM list, Linux Kernel Mailing List,
	linux-mm, Linux API, Andres Lagar-Cavilla, Dave Hansen,
	Paolo Bonzini, Rik van Riel, Mel Gorman, Andy Lutomirski,
	Andrew Morton, Sasha Levin, Hugh Dickins, Peter Feiner,
	Christopher Covington, Johannes Weiner, Android Kernel Team,
	Robert Love, Dmitry Adamushko <dmitry.adamush>

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> On Mon, Oct 6, 2014 at 12:41 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > Of course if somebody has better ideas on how to resolve an anonymous
> > userfault they're welcome.
> 
> So I'd *much* rather have a "write()" style interface (ie _copying_
> bytes from user space into a newly allocated page that gets mapped)
> than a "remap page" style interface

Something like that might work for the postcopy case; it doesn't work
for some of the other uses that need to stop a page being changed by the
guest, but then need to somehow get a copy of that page internally to QEMU,
and perhaps provide it back later.  remap_anon_pages worked for those cases
as well; I can't think of another current way of doing it in userspace.

I'm thinking here of systems for making VMs with memory larger than a single
host; that's something that's not as well thought out.  I've also seen people
writing emulation that want to trap and emulate some page accesses while
still having the original data available to the emulator itself.

So yes, OK for now, but the result is less general.

Dave


> remapping anonymous pages involves page table games that really aren't
> necessarily a good idea, and tlb invalidates for the old page etc.
> Just don't do it.
> 
>            Linus
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
  2014-10-07 17:07           ` Dr. David Alan Gilbert
@ 2014-10-07 17:14             ` Paolo Bonzini
  2014-10-07 17:25               ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 49+ messages in thread
From: Paolo Bonzini @ 2014-10-07 17:14 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Linus Torvalds
  Cc: Andrea Arcangeli, qemu-devel, KVM list, Linux Kernel Mailing List,
	linux-mm, Linux API, Andres Lagar-Cavilla, Dave Hansen,
	Rik van Riel, Mel Gorman, Andy Lutomirski, Andrew Morton,
	Sasha Levin, Hugh Dickins, Peter Feiner, Christopher Covington,
	Johannes Weiner, Android Kernel Team, Robert Love,
	Dmitry Adamushko, Neil Brown

Il 07/10/2014 19:07, Dr. David Alan Gilbert ha scritto:
>> > 
>> > So I'd *much* rather have a "write()" style interface (ie _copying_
>> > bytes from user space into a newly allocated page that gets mapped)
>> > than a "remap page" style interface
> Something like that might work for the postcopy case; it doesn't work
> for some of the other uses that need to stop a page being changed by the
> guest, but then need to somehow get a copy of that page internally to QEMU,
> and perhaps provide it back later.

I cannot parse this.  Which uses do you have in mind?  Is it for
QEMU-specific or is it for other applications of userfaults?

As long as the page is atomically mapped, I'm not sure what the
difference from remap_anon_pages are (as far as the destination page is
concerned).  Are you thinking of having userfaults enabled on the source
as well?

Paolo

> remap_anon_pages worked for those cases
> as well; I can't think of another current way of doing it in userspace.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
  2014-10-07 17:14             ` Paolo Bonzini
@ 2014-10-07 17:25               ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 49+ messages in thread
From: Dr. David Alan Gilbert @ 2014-10-07 17:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Linus Torvalds, Andrea Arcangeli, qemu-devel, KVM list,
	Linux Kernel Mailing List, linux-mm, Linux API,
	Andres Lagar-Cavilla, Dave Hansen, Rik van Riel, Mel Gorman,
	Andy Lutomirski, Andrew Morton, Sasha Levin, Hugh Dickins,
	Peter Feiner, Christopher Covington, Johannes Weiner,
	Android Kernel Team, Robert Love, Dmitry Adamushko <dmi>

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> Il 07/10/2014 19:07, Dr. David Alan Gilbert ha scritto:
> >> > 
> >> > So I'd *much* rather have a "write()" style interface (ie _copying_
> >> > bytes from user space into a newly allocated page that gets mapped)
> >> > than a "remap page" style interface
> > Something like that might work for the postcopy case; it doesn't work
> > for some of the other uses that need to stop a page being changed by the
> > guest, but then need to somehow get a copy of that page internally to QEMU,
> > and perhaps provide it back later.
> 
> I cannot parse this.  Which uses do you have in mind?  Is it for
> QEMU-specific or is it for other applications of userfaults?

> As long as the page is atomically mapped, I'm not sure what the
> difference from remap_anon_pages are (as far as the destination page is
> concerned).  Are you thinking of having userfaults enabled on the source
> as well?

What I'm talking about here is when I want to stop a page being accessed by the
guest, do something with the data in qemu, and give it back to the guest sometime
later.

The main example is: Memory pools for guests where you swap RAM between a series of
VM hosts.  You have to take the page out, send it over the wire, sometime later if
the guest tries to access it you userfault and pull it back.
(There's at least one or two companies selling something like this, and at least
one Linux based implementations with their own much more involved kernel hacks)

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 07/17] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
       [not found]   ` <1412356087-16115-8-git-send-email-aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-11-06 20:08     ` Konstantin Khlebnikov
  0 siblings, 0 replies; 49+ messages in thread
From: Konstantin Khlebnikov @ 2014-11-06 20:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: qemu-devel-qX2TKyscuCcdnm+yROfE0A, kvm-u79uwXL29TY76Z2rM5mHXA,
	Linux Kernel Mailing List,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds,
	Andres Lagar-Cavilla, Dave Hansen, Paolo Bonzini, Rik van Riel,
	Mel Gorman, Andy Lutomirski, Andrew Morton, Sasha Levin,
	Hugh Dickins, Peter Feiner, \Dr. David Alan Gilbert\,
	Christopher Covington, Johannes Weiner, Android Kernel Team,
	Robert Love, Dmitry Adamushko, Neil Brown, Mike

On Fri, Oct 3, 2014 at 9:07 PM, Andrea Arcangeli <aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> We run out of 32bits in vm_flags, noop change for 64bit archs.

What? Again?
As I see there are some free bits: 0x200, 0x1000, 0x80000

I prefer to reserve 0x02000000 for VM_ARCH_2

>
> Signed-off-by: Andrea Arcangeli <aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/proc/task_mmu.c       | 4 ++--
>  include/linux/huge_mm.h  | 4 ++--
>  include/linux/ksm.h      | 4 ++--
>  include/linux/mm_types.h | 2 +-
>  mm/huge_memory.c         | 2 +-
>  mm/ksm.c                 | 2 +-
>  mm/madvise.c             | 2 +-
>  mm/mremap.c              | 2 +-
>  8 files changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index c341568..ee1c3a2 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -532,11 +532,11 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
>         /*
>          * Don't forget to update Documentation/ on changes.
>          */
> -       static const char mnemonics[BITS_PER_LONG][2] = {
> +       static const char mnemonics[BITS_PER_LONG+1][2] = {
>                 /*
>                  * In case if we meet a flag we don't know about.
>                  */
> -               [0 ... (BITS_PER_LONG-1)] = "??",
> +               [0 ... (BITS_PER_LONG)] = "??",
>
>                 [ilog2(VM_READ)]        = "rd",
>                 [ilog2(VM_WRITE)]       = "wr",
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 63579cb..3aa10e0 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -121,7 +121,7 @@ extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
>  #error "hugepages can't be allocated by the buddy allocator"
>  #endif
>  extern int hugepage_madvise(struct vm_area_struct *vma,
> -                           unsigned long *vm_flags, int advice);
> +                           vm_flags_t *vm_flags, int advice);
>  extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
>                                     unsigned long start,
>                                     unsigned long end,
> @@ -183,7 +183,7 @@ static inline int split_huge_page(struct page *page)
>  #define split_huge_page_pmd_mm(__mm, __address, __pmd) \
>         do { } while (0)
>  static inline int hugepage_madvise(struct vm_area_struct *vma,
> -                                  unsigned long *vm_flags, int advice)
> +                                  vm_flags_t *vm_flags, int advice)
>  {
>         BUG();
>         return 0;
> diff --git a/include/linux/ksm.h b/include/linux/ksm.h
> index 3be6bb1..8b35253 100644
> --- a/include/linux/ksm.h
> +++ b/include/linux/ksm.h
> @@ -18,7 +18,7 @@ struct mem_cgroup;
>
>  #ifdef CONFIG_KSM
>  int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
> -               unsigned long end, int advice, unsigned long *vm_flags);
> +               unsigned long end, int advice, vm_flags_t *vm_flags);
>  int __ksm_enter(struct mm_struct *mm);
>  void __ksm_exit(struct mm_struct *mm);
>
> @@ -94,7 +94,7 @@ static inline int PageKsm(struct page *page)
>
>  #ifdef CONFIG_MMU
>  static inline int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
> -               unsigned long end, int advice, unsigned long *vm_flags)
> +               unsigned long end, int advice, vm_flags_t *vm_flags)
>  {
>         return 0;
>  }
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6e0b286..2c876d1 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -217,7 +217,7 @@ struct page_frag {
>  #endif
>  };
>
> -typedef unsigned long __nocast vm_flags_t;
> +typedef unsigned long long __nocast vm_flags_t;
>
>  /*
>   * A region containing a mapping of a non-memory backed file under NOMMU
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d9a21d06..e913a19 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1942,7 +1942,7 @@ out:
>  #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
>
>  int hugepage_madvise(struct vm_area_struct *vma,
> -                    unsigned long *vm_flags, int advice)
> +                    vm_flags_t *vm_flags, int advice)
>  {
>         switch (advice) {
>         case MADV_HUGEPAGE:
> diff --git a/mm/ksm.c b/mm/ksm.c
> index fb75902..faf319e 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1736,7 +1736,7 @@ static int ksm_scan_thread(void *nothing)
>  }
>
>  int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
> -               unsigned long end, int advice, unsigned long *vm_flags)
> +               unsigned long end, int advice, vm_flags_t *vm_flags)
>  {
>         struct mm_struct *mm = vma->vm_mm;
>         int err;
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 0938b30..d5aee71 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -49,7 +49,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
>         struct mm_struct *mm = vma->vm_mm;
>         int error = 0;
>         pgoff_t pgoff;
> -       unsigned long new_flags = vma->vm_flags;
> +       vm_flags_t new_flags = vma->vm_flags;
>
>         switch (behavior) {
>         case MADV_NORMAL:
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 05f1180..fa7db87 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -239,7 +239,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>  {
>         struct mm_struct *mm = vma->vm_mm;
>         struct vm_area_struct *new_vma;
> -       unsigned long vm_flags = vma->vm_flags;
> +       vm_flags_t vm_flags = vma->vm_flags;
>         unsigned long new_pgoff;
>         unsigned long moved_len;
>         unsigned long excess = 0;
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org"> email-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2014-11-06 20:08 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-03 17:07 [PATCH 00/17] RFC: userfault v2 Andrea Arcangeli
2014-10-03 17:07 ` [PATCH 01/17] mm: gup: add FOLL_TRIED Andrea Arcangeli
2014-10-03 18:15   ` Linus Torvalds
2014-10-03 20:55     ` Paolo Bonzini
2014-10-03 17:07 ` [PATCH 02/17] mm: gup: add get_user_pages_locked and get_user_pages_unlocked Andrea Arcangeli
2014-10-03 17:07 ` [PATCH 03/17] mm: gup: use get_user_pages_unlocked within get_user_pages_fast Andrea Arcangeli
2014-10-03 17:07 ` [PATCH 04/17] mm: gup: make get_user_pages_fast and __get_user_pages_fast latency conscious Andrea Arcangeli
2014-10-03 18:23   ` Linus Torvalds
2014-10-06 14:14     ` Andrea Arcangeli
2014-10-03 17:07 ` [PATCH 05/17] mm: gup: use get_user_pages_fast and get_user_pages_unlocked Andrea Arcangeli
2014-10-03 17:07 ` [PATCH 06/17] kvm: Faults which trigger IO release the mmap_sem Andrea Arcangeli
2014-10-03 17:07 ` [PATCH 07/17] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits Andrea Arcangeli
2014-10-07  9:03   ` Kirill A. Shutemov
     [not found]   ` <1412356087-16115-8-git-send-email-aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-11-06 20:08     ` Konstantin Khlebnikov
2014-10-03 17:07 ` [PATCH 08/17] mm: madvise MADV_USERFAULT Andrea Arcangeli
2014-10-03 23:13   ` Mike Hommey
2014-10-06 17:24     ` Andrea Arcangeli
2014-10-07 10:36   ` Kirill A. Shutemov
2014-10-07 10:46     ` Dr. David Alan Gilbert
2014-10-07 10:52       ` [Qemu-devel] " Kirill A. Shutemov
2014-10-07 11:01         ` Dr. David Alan Gilbert
2014-10-07 11:30           ` Kirill A. Shutemov
2014-10-07 13:24     ` Andrea Arcangeli
2014-10-07 15:21       ` Kirill A. Shutemov
2014-10-03 17:07 ` [PATCH 09/17] mm: PT lock: export double_pt_lock/unlock Andrea Arcangeli
2014-10-03 17:08 ` [PATCH 10/17] mm: rmap preparation for remap_anon_pages Andrea Arcangeli
2014-10-03 18:31   ` Linus Torvalds
2014-10-06  8:55     ` Dr. David Alan Gilbert
2014-10-06 16:41       ` Andrea Arcangeli
2014-10-07 12:47         ` Linus Torvalds
2014-10-07 14:19           ` Andrea Arcangeli
2014-10-07 15:52             ` Andrea Arcangeli
2014-10-07 15:54               ` Andy Lutomirski
     [not found]               ` <20141007155247.GD2342-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-10-07 16:13                 ` Peter Feiner
     [not found]             ` <20141007141913.GC2342-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-10-07 16:56               ` Linus Torvalds
2014-10-07 17:07           ` Dr. David Alan Gilbert
2014-10-07 17:14             ` Paolo Bonzini
2014-10-07 17:25               ` Dr. David Alan Gilbert
2014-10-07 11:10   ` [Qemu-devel] " Kirill A. Shutemov
2014-10-07 13:37     ` Andrea Arcangeli
     [not found] ` <1412356087-16115-1-git-send-email-aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-10-03 17:08   ` [PATCH 11/17] mm: swp_entry_swapcount Andrea Arcangeli
2014-10-03 17:08 ` [PATCH 12/17] mm: sys_remap_anon_pages Andrea Arcangeli
2014-10-04 13:13   ` Andi Kleen
2014-10-06 17:00     ` Andrea Arcangeli
2014-10-03 17:08 ` [PATCH 13/17] waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
2014-10-03 17:08 ` [PATCH 14/17] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
2014-10-03 17:08 ` [PATCH 15/17] userfaultfd: make userfaultfd_write non blocking Andrea Arcangeli
2014-10-03 17:08 ` [PATCH 16/17] powerpc: add remap_anon_pages and userfaultfd Andrea Arcangeli
2014-10-03 17:08 ` [PATCH 17/17] userfaultfd: implement USERFAULTFD_RANGE_REGISTER|UNREGISTER Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).