qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/4] madvise(MADV_USERFAULT) & sys_remap_anon_pages()
@ 2013-05-06 19:56 Andrea Arcangeli
  2013-05-06 19:56 ` [Qemu-devel] [PATCH 1/4] mm: madvise MADV_USERFAULT Andrea Arcangeli
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2013-05-06 19:56 UTC (permalink / raw)
  To: qemu-devel, linux-kernel
  Cc: Anthony Liguori, Orit Wasserman, Juan Quintela, Hugh Dickins,
	Isaku Yamahata, Mel Gorman, Paolo Bonzini

Hello everyone,

this is a patchset to implement two new kernel features:
MADV_USERFAULT and remap_anon_pages.

The combination of the two features are what I would propose to
implement postcopy live migration, and in general demand paging of
remote memory, hosted in different cloud nodes with KSM. It might also
be used without virt to offload parts of memory to different nodes
using some userland library and a network memory manager.

Postcopy live migration is currently implemented using a chardevice,
which remains open for the whole VM lifetime and all virtual memory
then becomes owned by the chardevice and it's not anonymous anymore.

http://lists.gnu.org/archive/html/qemu-devel/2012-10/msg05274.html

The main cons of the chardevice design is that all nice Linux MM
features (like swapping/THP/KSM/automatic-NUMA-balancing) are disabled
if the guest physical memory doesn't remain in anonymous memory. This
is entirely solved by this alternative kernel solution. In fact
remap_anon_pages will move THP pages natively by just updating two pmd
pointers if alignment and length permits without any THP split.

The other bonus is that MADV_USERFAULT and remap_anon_pages are
implemented in the MM core and remap_anon_pages furthermore provides a
functionality similar to what is already available for filebacked
pages with remap_file_pages. That is usually more maintainable than
having MM parts in a chardevice.

In addition to asking review of the internals, this also need review
the user APIs, as both those features are userland visible changes.

MADV_USERFAULT is only enabled for anonymous mappings so far but it
could be extended. To be strict, -EINVAL is returned if run on non
anonymous mappings (where it would currently be a noop).

The remap_anon_pages syscall API is not vectored, as I expect it used
for demand paging only (where there can be just one faulting range per
fault) or for large ranges where vectoring isn't going to provide
performance advantages.

The current behavior of remap_anon_pages is very strict to avoid any
chance of memory corruption going unnoticed, and it will return
-EFAULT at the first sign of something unexpected (like a page already
mapped in the destination pmd/pte, potentially signaling an userland
thread race condition with two threads userfaulting on the same
destination address). mremap is not strict like that: it would drop
the destination range silently and it would succeed in such a
condition. So on the API side, I wonder if I should add a flag to
remap_anon_pages to provide non-strict behavior more similar to
mremap. OTOH not providing the permissive mremap behavior may actually
be better to force userland to be strict and be sure it knows what it
is doing (otherwise it should use mremap in the first place?).

Comments welcome, thanks!
Andrea

Andrea Arcangeli (4):
  mm: madvise MADV_USERFAULT
  mm: rmap preparation for remap_anon_pages
  mm: swp_entry_swapcount
  mm: sys_remap_anon_pages

 arch/alpha/include/uapi/asm/mman.h     |   3 +
 arch/mips/include/uapi/asm/mman.h      |   3 +
 arch/parisc/include/uapi/asm/mman.h    |   3 +
 arch/x86/syscalls/syscall_32.tbl       |   1 +
 arch/x86/syscalls/syscall_64.tbl       |   1 +
 arch/xtensa/include/uapi/asm/mman.h    |   3 +
 include/linux/huge_mm.h                |   6 +
 include/linux/mm.h                     |   1 +
 include/linux/mm_types.h               |   2 +-
 include/linux/swap.h                   |   6 +
 include/linux/syscalls.h               |   3 +
 include/uapi/asm-generic/mman-common.h |   3 +
 kernel/sys_ni.c                        |   1 +
 mm/fremap.c                            | 440 +++++++++++++++++++++++++++++++++
 mm/huge_memory.c                       | 158 ++++++++++--
 mm/madvise.c                           |  16 ++
 mm/memory.c                            |  10 +
 mm/rmap.c                              |   9 +
 mm/swapfile.c                          |  13 +
 19 files changed, 667 insertions(+), 15 deletions(-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH 1/4] mm: madvise MADV_USERFAULT
  2013-05-06 19:56 [Qemu-devel] [PATCH 0/4] madvise(MADV_USERFAULT) & sys_remap_anon_pages() Andrea Arcangeli
@ 2013-05-06 19:56 ` Andrea Arcangeli
  2013-05-07 11:16   ` Andrew Jones
  2013-05-06 19:56 ` [Qemu-devel] [PATCH 2/4] mm: rmap preparation for remap_anon_pages Andrea Arcangeli
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Andrea Arcangeli @ 2013-05-06 19:56 UTC (permalink / raw)
  To: qemu-devel, linux-kernel
  Cc: Anthony Liguori, Orit Wasserman, Juan Quintela, Hugh Dickins,
	Isaku Yamahata, Mel Gorman, Paolo Bonzini

MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
userland touches a still unmapped virtual address, a sigbus signal is
sent instead of allocating a new page. The sigbus signal handler will
then resolve the page fault in userland by calling the remap_anon_pages
syscall.

This functionality is needed to reliably implement postcopy live
migration in KVM (without having to use a special chardevice that
would disable all advanced Linux VM features, like swapping, KSM, THP,
automatic NUMA balancing, etc...).

MADV_USERFAULT could also be used to offload parts of anonymous memory
regions to remote nodes or to implement network distributed shared
memory.

Here I enlarged the vm_flags to 64bit as we run out of bits (noop on
64bit kernels). An alternative is to find some combination of flags
that are mutually exclusive if set.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/alpha/include/uapi/asm/mman.h     |  3 +++
 arch/mips/include/uapi/asm/mman.h      |  3 +++
 arch/parisc/include/uapi/asm/mman.h    |  3 +++
 arch/xtensa/include/uapi/asm/mman.h    |  3 +++
 include/linux/mm.h                     |  1 +
 include/linux/mm_types.h               |  2 +-
 include/uapi/asm-generic/mman-common.h |  3 +++
 mm/huge_memory.c                       | 34 ++++++++++++++++++++++++----------
 mm/madvise.c                           | 16 ++++++++++++++++
 mm/memory.c                            | 10 ++++++++++
 10 files changed, 67 insertions(+), 11 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 0086b47..a10313c 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -60,6 +60,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT	18		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index cfcb876..d9d11a4 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -84,6 +84,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT	18		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 294d251..7bc7b7b 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -66,6 +66,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	70		/* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT	71		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 72		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 #define MAP_VARIABLE	0
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 00eed67..5448d88 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -90,6 +90,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT	18		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c05d7cf..f5a410e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -110,6 +110,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HUGEPAGE	0x20000000	/* MADV_HUGEPAGE marked this vma */
 #define VM_NOHUGEPAGE	0x40000000	/* MADV_NOHUGEPAGE marked this vma */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
+#define VM_USERFAULT	0x100000000	/* Trigger user faults if not mapped */
 
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ace9a5f..bed1c7c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -198,7 +198,7 @@ struct page_frag {
 #endif
 };
 
-typedef unsigned long __nocast vm_flags_t;
+typedef unsigned long long __nocast vm_flags_t;
 
 /*
  * A region containing a mapping of a non-memory backed file under NOMMU
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 4164529..43f36c0 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -52,6 +52,9 @@
 					   overrides the coredump filter bits */
 #define MADV_DODUMP	17		/* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT	18		/* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19		/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 03a89a2..f46aad1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -727,6 +727,11 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		pte_free(mm, pgtable);
 	} else {
 		pmd_t entry;
+
+		/* Deliver the page fault to userland */
+		if (vma->vm_flags & VM_USERFAULT)
+			goto do_sigbus;
+
 		entry = mk_huge_pmd(page, vma);
 		page_add_new_anon_rmap(page, vma, haddr);
 		set_pmd_at(mm, haddr, pmd, entry);
@@ -737,6 +742,9 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 	}
 
 	return 0;
+do_sigbus:
+	spin_unlock(&mm->page_table_lock);
+	return VM_FAULT_SIGBUS;
 }
 
 static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
@@ -761,20 +769,17 @@ static inline struct page *alloc_hugepage(int defrag)
 }
 #endif
 
-static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
 		struct page *zero_page)
 {
 	pmd_t entry;
-	if (!pmd_none(*pmd))
-		return false;
 	entry = mk_pmd(zero_page, vma->vm_page_prot);
 	entry = pmd_wrprotect(entry);
 	entry = pmd_mkhuge(entry);
 	set_pmd_at(mm, haddr, pmd, entry);
 	pgtable_trans_huge_deposit(mm, pgtable);
 	mm->nr_ptes++;
-	return true;
 }
 
 int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -794,6 +799,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				transparent_hugepage_use_zero_page()) {
 			pgtable_t pgtable;
 			struct page *zero_page;
+			int ret;
 			bool set;
 			pgtable = pte_alloc_one(mm, haddr);
 			if (unlikely(!pgtable))
@@ -805,14 +811,24 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				goto out;
 			}
 			spin_lock(&mm->page_table_lock);
-			set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
-					zero_page);
+			ret = 0;
+			set = false;
+			if (pmd_none(*pmd)) {
+				if (vma->vm_flags & VM_USERFAULT)
+					ret = VM_FAULT_SIGBUS;
+				else {
+					set_huge_zero_page(pgtable, mm, vma,
+							   haddr, pmd,
+							   zero_page);
+					set = true;
+				}
+			}
 			spin_unlock(&mm->page_table_lock);
 			if (!set) {
 				pte_free(mm, pgtable);
 				put_huge_zero_page();
 			}
-			return 0;
+			return ret;
 		}
 		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
 					  vma, haddr, numa_node_id(), 0);
@@ -886,16 +902,14 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 */
 	if (is_huge_zero_pmd(pmd)) {
 		struct page *zero_page;
-		bool set;
 		/*
 		 * get_huge_zero_page() will never allocate a new page here,
 		 * since we already have a zero page to copy. It just takes a
 		 * reference.
 		 */
 		zero_page = get_huge_zero_page();
-		set = set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
+		set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
 				zero_page);
-		BUG_ON(!set); /* unexpected !pmd_none(dst_pmd) */
 		ret = 0;
 		goto out_unlock;
 	}
diff --git a/mm/madvise.c b/mm/madvise.c
index 7055883..2ecab73 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -93,6 +93,21 @@ static long madvise_behavior(struct vm_area_struct * vma,
 		if (error)
 			goto out;
 		break;
+	case MADV_USERFAULT:
+		if (vma->vm_ops) {
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags |= VM_USERFAULT;
+		break;
+	case MADV_NOUSERFAULT:
+		if (vma->vm_ops) {
+			WARN_ON(new_flags & VM_USERFAULT);
+			error = -EINVAL;
+			goto out;
+		}
+		new_flags &= ~VM_USERFAULT;
+		break;
 	}
 
 	if (new_flags == vma->vm_flags) {
@@ -405,6 +420,7 @@ madvise_behavior_valid(int behavior)
 	case MADV_HUGEPAGE:
 	case MADV_NOHUGEPAGE:
 #endif
+	case MADV_USERFAULT:
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
 		return 1;
diff --git a/mm/memory.c b/mm/memory.c
index f7a1fba..044a57c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3235,6 +3235,9 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 		if (!pte_none(*page_table))
 			goto unlock;
+		/* Deliver the page fault to userland, check inside PT lock */
+		if (vma->vm_flags & VM_USERFAULT)
+			goto sigbus;
 		goto setpte;
 	}
 
@@ -3262,6 +3265,10 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!pte_none(*page_table))
 		goto release;
 
+	/* Deliver the page fault to userland, check inside PT lock */
+	if (vma->vm_flags & VM_USERFAULT)
+		goto sigbus;
+
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, address);
 setpte:
@@ -3280,6 +3287,9 @@ oom_free_page:
 	page_cache_release(page);
 oom:
 	return VM_FAULT_OOM;
+sigbus:
+	pte_unmap_unlock(page_table, ptl);
+	return VM_FAULT_SIGBUS;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH 2/4] mm: rmap preparation for remap_anon_pages
  2013-05-06 19:56 [Qemu-devel] [PATCH 0/4] madvise(MADV_USERFAULT) & sys_remap_anon_pages() Andrea Arcangeli
  2013-05-06 19:56 ` [Qemu-devel] [PATCH 1/4] mm: madvise MADV_USERFAULT Andrea Arcangeli
@ 2013-05-06 19:56 ` Andrea Arcangeli
  2013-05-06 19:57 ` [Qemu-devel] [PATCH 3/4] mm: swp_entry_swapcount Andrea Arcangeli
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2013-05-06 19:56 UTC (permalink / raw)
  To: qemu-devel, linux-kernel
  Cc: Anthony Liguori, Orit Wasserman, Juan Quintela, Hugh Dickins,
	Isaku Yamahata, Mel Gorman, Paolo Bonzini

remap_anon_pages (unlike remap_file_pages) tries to be non intrusive
in the rmap code.

As far as the rmap code is concerned, rmap_anon_pages only alters the
page->mapping and page->index. It does it while holding the page
lock. However there are a few places that in presence of anon pages
are allowed to do rmap walks without the page lock (split_huge_page
and page_referenced_anon). Those places that are doing rmap walks
without taking the page lock first, must be updated to re-check that
the page->mapping didn't change after they obtained the anon_vma
lock. remap_anon_pages takes the anon_vma lock for writing before
altering the page->mapping, so if the page->mapping is still the same
after obtaining the anon_vma lock (without the page lock), the rmap
walks can go ahead safely (and remap_anon_pages will wait them to
complete before proceeding).

remap_anon_pages serializes against itself with the page lock.

All other places taking the anon_vma lock while holding the mmap_sem
for writing, don't need to check if the page->mapping has changed
after taking the anon_vma lock, regardless of the page lock, because
remap_anon_pages holds the mmap_sem for reading.

Overall this looks a fairly small change to the rmap code, notably
less intrusive than the nonlinear vmas created by remap_file_pages.

There's one constraint enforced to allow this simplification: the
source pages passed to remap_anon_pages must be mapped only in one
vma, but this is not a limitation when used to handle userland page
faults with MADV_USERFAULT. The source addresses passed to
remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to
avoid any risk of the mapcount of the pages increasing, if fork runs
in parallel in another thread, before or while remap_anon_pages runs.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c | 24 ++++++++++++++++++++----
 mm/rmap.c        |  9 +++++++++
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f46aad1..9a2e235 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1824,6 +1824,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
 	struct anon_vma *anon_vma;
 	int ret = 1;
+	struct address_space *mapping;
 
 	BUG_ON(is_huge_zero_page(page));
 	BUG_ON(!PageAnon(page));
@@ -1835,10 +1836,24 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	 * page_lock_anon_vma_read except the write lock is taken to serialise
 	 * against parallel split or collapse operations.
 	 */
-	anon_vma = page_get_anon_vma(page);
-	if (!anon_vma)
-		goto out;
-	anon_vma_lock_write(anon_vma);
+	for (;;) {
+		mapping = ACCESS_ONCE(page->mapping);
+		anon_vma = page_get_anon_vma(page);
+		if (!anon_vma)
+			goto out;
+		anon_vma_lock_write(anon_vma);
+		/*
+		 * We don't hold the page lock here so
+		 * remap_anon_pages_huge_pmd can change the anon_vma
+		 * from under us until we obtain the anon_vma
+		 * lock. Verify that we obtained the anon_vma lock
+		 * before remap_anon_pages did.
+		 */
+		if (likely(mapping == ACCESS_ONCE(page->mapping)))
+			break;
+		anon_vma_unlock_write(anon_vma);
+		put_anon_vma(anon_vma);
+	}
 
 	ret = 0;
 	if (!PageCompound(page))
@@ -2294,6 +2309,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * Prevent all access to pagetables with the exception of
 	 * gup_fast later hanlded by the ptep_clear_flush and the VM
 	 * handled by the anon_vma lock + PG_lock.
+	 * remap_anon_pages is prevented to race as well by the mmap_sem.
 	 */
 	down_write(&mm->mmap_sem);
 	if (unlikely(khugepaged_test_exit(mm)))
diff --git a/mm/rmap.c b/mm/rmap.c
index 6280da8..86d007d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -448,6 +448,7 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
 	struct anon_vma *root_anon_vma;
 	unsigned long anon_mapping;
 
+repeat:
 	rcu_read_lock();
 	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
 	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
@@ -486,6 +487,14 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
 	rcu_read_unlock();
 	anon_vma_lock_read(anon_vma);
 
+	/* check if remap_anon_pages changed the anon_vma */
+	if (unlikely((unsigned long) ACCESS_ONCE(page->mapping) != anon_mapping)) {
+		anon_vma_unlock_read(anon_vma);
+		put_anon_vma(anon_vma);
+		anon_vma = NULL;
+		goto repeat;
+	}
+
 	if (atomic_dec_and_test(&anon_vma->refcount)) {
 		/*
 		 * Oops, we held the last refcount, release the lock

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH 3/4] mm: swp_entry_swapcount
  2013-05-06 19:56 [Qemu-devel] [PATCH 0/4] madvise(MADV_USERFAULT) & sys_remap_anon_pages() Andrea Arcangeli
  2013-05-06 19:56 ` [Qemu-devel] [PATCH 1/4] mm: madvise MADV_USERFAULT Andrea Arcangeli
  2013-05-06 19:56 ` [Qemu-devel] [PATCH 2/4] mm: rmap preparation for remap_anon_pages Andrea Arcangeli
@ 2013-05-06 19:57 ` Andrea Arcangeli
  2013-05-06 19:57 ` [Qemu-devel] [PATCH 4/4] mm: sys_remap_anon_pages Andrea Arcangeli
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2013-05-06 19:57 UTC (permalink / raw)
  To: qemu-devel, linux-kernel
  Cc: Anthony Liguori, Orit Wasserman, Juan Quintela, Hugh Dickins,
	Isaku Yamahata, Mel Gorman, Paolo Bonzini

Provide a new swapfile method for remap_anon_pages to verify the swap
entry is mapped only in one vma before relocating the swap entry in a
different virtual address. Otherwise if the swap entry is mapped
in multiple vmas, when the page is swapped back in, it could get
mapped in a non linear way in some anon_vma.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/swap.h |  6 ++++++
 mm/swapfile.c        | 13 +++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1701ce4..0ea2a56 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -389,6 +389,7 @@ extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
+extern int swp_entry_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
 extern int reuse_swap_page(struct page *);
 extern int try_to_free_swap(struct page *);
@@ -489,6 +490,11 @@ static inline int page_swapcount(struct page *page)
 	return 0;
 }
 
+static inline int swp_entry_swapcount(swp_entry_t entry)
+{
+	return 0;
+}
+
 #define reuse_swap_page(page)	(page_mapcount(page) == 1)
 
 static inline int try_to_free_swap(struct page *page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index d417efd..2772382 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -671,6 +671,19 @@ int page_swapcount(struct page *page)
 	return count;
 }
 
+int swp_entry_swapcount(swp_entry_t entry)
+{
+	int count = 0;
+	struct swap_info_struct *p;
+
+	p = swap_info_get(entry);
+	if (p) {
+		count = swap_count(p->swap_map[swp_offset(entry)]);
+		spin_unlock(&p->lock);
+	}
+	return count;
+}
+
 /*
  * We can write to an anon page without COW if there are no other references
  * to it.  And as a side-effect, free up its swap: because the old content

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH 4/4] mm: sys_remap_anon_pages
  2013-05-06 19:56 [Qemu-devel] [PATCH 0/4] madvise(MADV_USERFAULT) & sys_remap_anon_pages() Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2013-05-06 19:57 ` [Qemu-devel] [PATCH 3/4] mm: swp_entry_swapcount Andrea Arcangeli
@ 2013-05-06 19:57 ` Andrea Arcangeli
  2013-05-06 20:01   ` Andrea Arcangeli
  2013-05-07 10:07 ` [Qemu-devel] [PATCH 0/4] madvise(MADV_USERFAULT) & sys_remap_anon_pages() Isaku Yamahata
  2013-05-07 11:38 ` Andrew Jones
  5 siblings, 1 reply; 12+ messages in thread
From: Andrea Arcangeli @ 2013-05-06 19:57 UTC (permalink / raw)
  To: qemu-devel, linux-kernel
  Cc: Anthony Liguori, Orit Wasserman, Juan Quintela, Hugh Dickins,
	Isaku Yamahata, Mel Gorman, Paolo Bonzini

This new syscall will move anon pages across vmas, atomically and
without touching the vmas.

It only works on non shared anonymous pages because those can be
relocated without generating non linear anon_vmas in the rmap code.

It is the ideal mechanism to handle userspace page faults. Normally
the destination vma will have VM_USERFAULT set with
madvise(MADV_USERFAULT) while the source vma will normally have
VM_DONTCOPY set with madvise(MADV_DONTFORK).

MADV_DONTFORK set in the source vma avoids remap_anon_pages to fail if
the process forks during the userland page fault.

The thread triggering the sigbus signal handler by touching an
unmapped hole in the MADV_USERFAULT region, should take care to
receive the data belonging in the faulting virtual address in the
source vma. The data can come from the network, storage or any other
I/O device. After the data has been safely received in the private
area in the source vma, it will call remap_anon_pages to map the page
in the faulting address in the destination vma atomically. And finally
it will return from the signal handler.

It is an alternative to mremap.

It only works if the vma protection bits are identical from the source
and destination vma.

It can remap non shared anonymous pages within the same vma too.

If the source virtual memory range has any unmapped holes, or if the
destination virtual memory range is not a whole unmapped hole,
remap_anon_pages will fail with -EFAULT. This provides a very strict
behavior to avoid any chance of memory corruption going unnoticed if
there are userland race conditions. Only one thread should resolve the
userland page fault at any given time for any given faulting
address. This means that if two threads try to both call
remap_anon_pages on the same destination address at the same time, the
second thread will get an explicit -EFAULT retval from this syscall.

The destination range with VM_USERFAULT set should be completely empty
or remap_anon_pages will fail with -EFAULT. It's recommended to call
madvise(MADV_USERFAULT) immediately after the destination range has
been allocated with malloc() or posix_memalign(), so that the
VM_USERFAULT vma will be splitted before a tranparent hugepage fault
could fill the VM_USERFAULT region if it doesn't start hugepage
aligned. That will ensure the VM_USERFAULT area remains empty after
allocation, regardless of its alignment.

The main difference with mremap is that if used to fill holes in
unmapped anonymous memory vmas (if used in combination with
MADV_USERFAULT) remap_anon_pages won't create lots of unmergeable
vmas. mremap instead would create lots of vmas (because of non linear
vma->vm_pgoff) leading to -ENOMEM failures (the number of vmas is
limited).

MADV_USERFAULT and remap_anon_pages() can be tested with a program
like below:

===

static unsigned char *c, *tmp;

void userfault_sighandler(int signum, siginfo_t *info, void *ctx)
{
	unsigned char *addr = info->si_addr;
	int len = 4096;
	int ret;

	addr = (unsigned char *) ((unsigned long) addr & ~((2*1024*1024)-1));
	len = 2*1024*1024;
	if (addr >= c && addr < c + SIZE) {
		unsigned long offset = addr - c;
		ret = syscall(SYS_remap_anon_pages, c+offset, tmp+offset, len);
		if (ret)
			perror("sigbus remap_anon_pages"), exit(1);
		//printf("sigbus offset %lu\n", offset);
		return;
	}

	printf("sigbus error addr %p c %p tmp %p\n", addr, c, tmp), exit(1);
}

int main()
{
	struct sigaction sa;
	int ret;
	unsigned long i;
	/*
	 * Fails with THP due lack of alignment because of memset
	 * pre-filling the destination
	 */
	c = mmap(0, SIZE, PROT_READ|PROT_WRITE,
		 MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
	if (c == MAP_FAILED)
		perror("mmap"), exit(1);
	tmp = mmap(0, SIZE, PROT_READ|PROT_WRITE,
		   MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
	if (tmp == MAP_FAILED)
		perror("mmap"), exit(1);
	ret = posix_memalign((void **)&c, 2*1024*1024, SIZE);
	if (ret)
		perror("posix_memalign"), exit(1);
	ret = posix_memalign((void **)&tmp, 2*1024*1024, SIZE);
	if (ret)
		perror("posix_memalign"), exit(1);
	/*
	 * MADV_USERFAULT must run before memset, to avoid THP 2m
	 * faults to map memory into "tmp", if "tmp" isn't allocated
	 * with hugepage alignment.
	 */
	if (madvise(c, SIZE, MADV_USERFAULT))
		perror("madvise"), exit(1);
	memset(tmp, 0xaa, SIZE);

	sa.sa_sigaction = userfault_sighandler;
	sigemptyset(&sa.sa_mask);
	sa.sa_flags = SA_SIGINFO;
	sigaction(SIGBUS, &sa, NULL);

	ret = syscall(SYS_remap_anon_pages, c, tmp, SIZE);
	if (ret)
		perror("remap_anon_pages"), exit(1);

	for (i = 0; i < SIZE; i += 4096) {
		if ((i/4096) % 2) {
			/* exercise read and write MADV_USERFAULT */
			c[i+1] = 0xbb;
		}
		if (c[i] != 0xaa)
			printf("error %x offset %lu\n", c[i], i), exit(1);
	}

	return 0;
}
===

With postcopy live migration to use a single network socket, the vcpu
thread triggering the fault will normally use some interprocess
communication with the thread doing the postcopy background transfer,
to request a specific address, and it'll wait an ack before returning
from the signal. The thread receiving the faulting address, will also
solve the concurrency of the sigbus handler potentially happening
simultaneously on the same address on different vcpu threads (so
calling remap_anon_pages only once for each range received). So
remap_anon_pages in the above testcase runs within the signal handler,
but in production postcopy, it can run in a different thread.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 include/linux/huge_mm.h          |   6 +
 include/linux/syscalls.h         |   3 +
 kernel/sys_ni.c                  |   1 +
 mm/fremap.c                      | 440 +++++++++++++++++++++++++++++++++++++++
 mm/huge_memory.c                 | 100 +++++++++
 7 files changed, 552 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index e6d55f0..cd2f186 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -357,3 +357,4 @@
 348	i386	process_vm_writev	sys_process_vm_writev		compat_sys_process_vm_writev
 349	i386	kcmp			sys_kcmp
 350	i386	finit_module		sys_finit_module
+351	i386	remap_anon_pages	sys_remap_anon_pages
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 38ae65d..ac240fd 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -320,6 +320,7 @@
 311	64	process_vm_writev	sys_process_vm_writev
 312	common	kcmp			sys_kcmp
 313	common	finit_module		sys_finit_module
+314	common	remap_anon_pages	sys_remap_anon_pages
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 528454c..f9edf11 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -33,6 +33,12 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, pgprot_t newprot,
 			int prot_numa);
+extern int remap_anon_pages_huge_pmd(struct mm_struct *mm,
+				     pmd_t *dst_pmd, pmd_t *src_pmd,
+				     struct vm_area_struct *dst_vma,
+				     struct vm_area_struct *src_vma,
+				     unsigned long dst_addr,
+				     unsigned long src_addr);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 313a8e0..00a4781 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -486,6 +486,9 @@ asmlinkage long sys_mremap(unsigned long addr,
 asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
 			unsigned long prot, unsigned long pgoff,
 			unsigned long flags);
+asmlinkage long sys_remap_anon_pages(unsigned long dst_start,
+				     unsigned long src_start,
+				     unsigned long len);
 asmlinkage long sys_msync(unsigned long start, size_t len, int flags);
 asmlinkage long sys_fadvise64(int fd, loff_t offset, size_t len, int advice);
 asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 395084d..3d401eb 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -172,6 +172,7 @@ cond_syscall(sys_mincore);
 cond_syscall(sys_madvise);
 cond_syscall(sys_mremap);
 cond_syscall(sys_remap_file_pages);
+cond_syscall(sys_remap_anon_pages);
 cond_syscall(compat_sys_move_pages);
 cond_syscall(compat_sys_migrate_pages);
 
diff --git a/mm/fremap.c b/mm/fremap.c
index 87da359..e4018a2 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -257,3 +257,443 @@ out:
 
 	return err;
 }
+
+static void double_pt_lock(spinlock_t *ptl1,
+			   spinlock_t *ptl2)
+	__acquires(ptl1)
+	__acquires(ptl2)
+{
+	spinlock_t *ptl_tmp;
+
+	if (ptl1 > ptl2) {
+		/* exchange ptl1 and ptl2 */
+		ptl_tmp = ptl1;
+		ptl1 = ptl2;
+		ptl2 = ptl_tmp;
+	}
+	/* lock in virtual address order to avoid lock inversion */
+	spin_lock(ptl1);
+	if (ptl1 != ptl2)
+		spin_lock_nested(ptl2, SINGLE_DEPTH_NESTING);
+}
+
+static void double_pt_unlock(spinlock_t *ptl1,
+			     spinlock_t *ptl2)
+	__releases(ptl1)
+	__releases(ptl2)
+{
+	spin_unlock(ptl1);
+	if (ptl1 != ptl2)
+		spin_unlock(ptl2);
+}
+
+/*
+ * The page_table_lock and mmap_sem for reading are held by the
+ * caller. Just move the page from src_pmd to dst_pmd if possible,
+ * and return true if succeeded in moving the page.
+ */
+static int remap_anon_pages_pte(struct mm_struct *mm,
+				pte_t *dst_pte, pte_t *src_pte, pmd_t *src_pmd,
+				struct vm_area_struct *dst_vma,
+				struct vm_area_struct *src_vma,
+				unsigned long dst_addr,
+				unsigned long src_addr,
+				spinlock_t *dst_ptl,
+				spinlock_t *src_ptl)
+{
+	struct page *src_page;
+	swp_entry_t entry;
+	pte_t orig_src_pte, orig_dst_pte;
+	struct anon_vma *src_anon_vma, *dst_anon_vma;
+
+	spin_lock(dst_ptl);
+	orig_dst_pte = *dst_pte;
+	spin_unlock(dst_ptl);
+	if (!pte_none(orig_dst_pte))
+		return -EFAULT;
+
+	spin_lock(src_ptl);
+	orig_src_pte = *src_pte;
+	spin_unlock(src_ptl);
+	if (pte_none(orig_src_pte))
+		return -EFAULT;
+
+	if (pte_present(orig_src_pte)) {
+		/*
+		 * Pin the page while holding the lock to be sure the
+		 * page isn't freed under us
+		 */
+		spin_lock(src_ptl);
+		if (!pte_same(orig_src_pte, *src_pte)) {
+			spin_unlock(src_ptl);
+			return -EAGAIN;
+		}
+		src_page = vm_normal_page(src_vma, src_addr, orig_src_pte);
+		if (!src_page || !PageAnon(src_page) ||
+		    page_mapcount(src_page) != 1) {
+			spin_unlock(src_ptl);
+			return -EFAULT;
+		}
+
+		get_page(src_page);
+		spin_unlock(src_ptl);
+
+		/* block all concurrent rmap walks */
+		lock_page(src_page);
+
+		/*
+		 * page_referenced_anon walks the anon_vma chain
+		 * without the page lock. Serialize against it with
+		 * the anon_vma lock, the page lock is not enough.
+		 */
+		src_anon_vma = page_get_anon_vma(src_page);
+		if (!src_anon_vma) {
+			/* page was unmapped from under us */
+			unlock_page(src_page);
+			put_page(src_page);
+			return -EAGAIN;
+		}
+		anon_vma_lock_write(src_anon_vma);
+
+		double_pt_lock(dst_ptl, src_ptl);
+
+		if (!pte_same(*src_pte, orig_src_pte) ||
+		    !pte_same(*dst_pte, orig_dst_pte) ||
+		    page_mapcount(src_page) != 1) {
+			double_pt_unlock(dst_ptl, src_ptl);
+			anon_vma_unlock_write(src_anon_vma);
+			put_anon_vma(src_anon_vma);
+			unlock_page(src_page);
+			put_page(src_page);
+			return -EAGAIN;
+		}
+
+		BUG_ON(!PageAnon(src_page));
+		/* the PT lock is enough to keep the page pinned now */
+		put_page(src_page);
+
+		dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
+		ACCESS_ONCE(src_page->mapping) = ((struct address_space *)
+						  dst_anon_vma);
+		ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma,
+								 dst_addr);
+
+		ptep_clear_flush(src_vma, src_addr, src_pte);
+
+		orig_dst_pte = mk_pte(src_page, dst_vma->vm_page_prot);
+		orig_dst_pte = maybe_mkwrite(pte_mkdirty(orig_dst_pte),
+					     dst_vma);
+
+		set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
+
+		double_pt_unlock(dst_ptl, src_ptl);
+
+		anon_vma_unlock_write(src_anon_vma);
+		put_anon_vma(src_anon_vma);
+
+		/* unblock rmap walks */
+		unlock_page(src_page);
+
+		mmu_notifier_invalidate_page(mm, src_addr);
+	} else {
+		if (pte_file(orig_src_pte))
+			return -EFAULT;
+
+		entry = pte_to_swp_entry(orig_src_pte);
+		if (non_swap_entry(entry)) {
+			if (is_migration_entry(entry)) {
+				migration_entry_wait(mm, src_pmd, src_addr);
+				return -EAGAIN;
+			}
+			return -EFAULT;
+		}
+
+		if (swp_entry_swapcount(entry) != 1)
+			return -EFAULT;
+
+		double_pt_lock(dst_ptl, src_ptl);
+
+		if (!pte_same(*src_pte, orig_src_pte) ||
+		    !pte_same(*dst_pte, orig_dst_pte) ||
+		    swp_entry_swapcount(entry) != 1) {
+			double_pt_unlock(dst_ptl, src_ptl);
+			return -EAGAIN;
+		}
+
+		if (pte_val(ptep_get_and_clear(mm, src_addr, src_pte)) !=
+		    pte_val(orig_src_pte))
+			BUG();
+		set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
+
+		double_pt_unlock(dst_ptl, src_ptl);
+	}
+
+	return 0;
+}
+
+pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd = NULL;
+
+	pgd = pgd_offset(mm, address);
+	pud = pud_alloc(mm, pgd, address);
+	if (pud)
+		pmd = pmd_alloc(mm, pud, address);
+	return pmd;
+}
+
+/**
+ * sys_remap_anon_pages - remap arbitrary anonymous pages of an existing vma
+ * @dst_start: start of the destination virtual memory range
+ * @src_start: start of the source virtual memory range
+ * @len: length of the virtual memory range
+ *
+ * sys_remap_anon_pages remaps arbitrary anonymous pages atomically in
+ * zero copy. It only works on non shared anonymous pages because
+ * those can be relocated without generating non linear anon_vmas in
+ * the rmap code.
+ *
+ * It is the ideal mechanism to handle userspace page faults. Normally
+ * the destination vma will have VM_USERFAULT set with
+ * madvise(MADV_USERFAULT) while the source vma will have VM_DONTCOPY
+ * set with madvise(MADV_DONTFORK).
+ *
+ * The thread receiving the page during the userland page fault
+ * (MADV_USERFAULT) (inside the sigbus signal handler) will receive
+ * the faulting page in the source vma through the network, storage or
+ * any other I/O device (MADV_DONTFORK in the source vma avoids
+ * remap_anon_pages to fail if the process forks during the userland
+ * page fault), then it will call remap_anon_pages to map the page in
+ * the faulting address in the destination vma and finally it will
+ * return from the signal handler.
+ *
+ * This syscall works purely via pagetables, so it's the most
+ * efficient way to move physical non shared anonymous pages across
+ * different virtual addresses. Unlike mremap()/mmap()/munmap() it
+ * does not create any new vmas. The mapping in the destination
+ * address is atomic.
+ *
+ * It only works if the vma protection bits are identical from the
+ * source and destination vma.
+ *
+ * It can remap non shared anonymous pages within the same vma too.
+ *
+ * If the source virtual memory range has any unmapped holes, or if
+ * the destination virtual memory range is not a whole unmapped hole,
+ * remap_anon_pages will fail with -EFAULT. This provides a very
+ * strict behavior to avoid any chance of memory corruption going
+ * unnoticed if there are userland race conditions. Only one thread
+ * should resolve the userland page fault at any given time for any
+ * given faulting address. This means that if two threads try to both
+ * call remap_anon_pages on the same destination address at the same
+ * time, the second thread will get an explicit -EFAULT retval from
+ * this syscall.
+ *
+ * The destination range with VM_USERFAULT set should be completely
+ * empty or remap_anon_pages will fail with -EFAULT. It's recommended
+ * to call madvise(MADV_USERFAULT) immediately after the destination
+ * range has been allocated with malloc() or posix_memalign(), so that
+ * the VM_USERFAULT vma will be splitted before a tranparent hugepage
+ * fault could fill the VM_USERFAULT region. That will ensure the
+ * VM_USERFAULT area remains empty after allocation, regardless of its
+ * alignment.
+ *
+ * If there's any rmap walk that is taking the anon_vma locks without
+ * first obtaining the page lock (for example split_huge_page and
+ * page_referenced_anon), they will have to verify if the
+ * page->mapping has changed after taking the anon_vma lock. If it
+ * changed they should release the lock and retry obtaining a new
+ * anon_vma, because it means the anon_vma was changed by
+ * remap_anon_pages before the lock could be obtained. This is the
+ * only additional complexity added to the rmap code to provide this
+ * anonymous page remapping functionality.
+ */
+SYSCALL_DEFINE3(remap_anon_pages,
+		unsigned long, dst_start, unsigned long, src_start,
+		unsigned long, len)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *src_vma, *dst_vma;
+	int err = -EINVAL;
+	pmd_t *src_pmd, *dst_pmd;
+	pte_t *src_pte, *dst_pte;
+	spinlock_t *dst_ptl, *src_ptl;
+	unsigned long src_addr, dst_addr;
+	int thp_aligned = -1;
+
+	/*
+	 * Sanitize the syscall parameters:
+	 */
+	src_start &= PAGE_MASK;
+	dst_start &= PAGE_MASK;
+	len &= PAGE_MASK;
+
+	/* Does the address range wrap, or is the span zero-sized? */
+	if (unlikely(src_start + len <= src_start))
+		return err;
+	if (unlikely(dst_start + len <= dst_start))
+		return err;
+
+	down_read(&mm->mmap_sem);
+
+	/*
+	 * Make sure the vma is not shared, that the src and dst remap
+	 * ranges are both valid and fully within a single existing
+	 * vma.
+	 */
+	src_vma = find_vma(mm, src_start);
+	if (!src_vma || (src_vma->vm_flags & VM_SHARED))
+		goto out;
+	if (src_start < src_vma->vm_start ||
+	    src_start + len > src_vma->vm_end)
+		goto out;
+
+	dst_vma = find_vma(mm, dst_start);
+	if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
+		goto out;
+	if (dst_start < dst_vma->vm_start ||
+	    dst_start + len > dst_vma->vm_end)
+		goto out;
+
+	if (pgprot_val(src_vma->vm_page_prot) !=
+	    pgprot_val(dst_vma->vm_page_prot))
+		goto out;
+
+	/* only allow remapping if both are mlocked or both aren't */
+	if ((src_vma->vm_flags & VM_LOCKED) ^ (dst_vma->vm_flags & VM_LOCKED))
+		goto out;
+
+	err = 0;
+	for (src_addr = src_start, dst_addr = dst_start;
+	     src_addr < src_start + len; ) {
+		BUG_ON(dst_addr >= dst_start + len);
+		src_pmd = mm_find_pmd(mm, src_addr);
+		if (unlikely(!src_pmd)) {
+			err = -EFAULT;
+			break;
+		}
+		dst_pmd = mm_alloc_pmd(mm, dst_addr);
+		if (unlikely(!dst_pmd)) {
+			err = -ENOMEM;
+			break;
+		}
+		if (pmd_trans_huge_lock(src_pmd, src_vma) == 1) {
+			/*
+			 * If the dst_pmd is mapped as THP don't
+			 * override it and just be strict.
+			 */
+			if (unlikely(pmd_trans_huge(*dst_pmd))) {
+				spin_unlock(&mm->page_table_lock);
+				err = -EFAULT;
+				break;
+			}
+
+			/*
+			 * Check if we can move the pmd without
+			 * splitting it. First check the address
+			 * alignment to be the same in src/dst.
+			 */
+			if (thp_aligned == -1)
+				thp_aligned = ((src_addr & ~HPAGE_PMD_MASK) ==
+					       (dst_addr & ~HPAGE_PMD_MASK));
+			if (!thp_aligned || (src_addr & ~HPAGE_PMD_MASK) ||
+			    !pmd_none(*dst_pmd)) {
+				spin_unlock(&mm->page_table_lock);
+				/* Fall through */
+				split_huge_page_pmd(src_vma, src_addr,
+						    src_pmd);
+			} else {
+				BUG_ON(dst_addr & ~HPAGE_PMD_MASK);
+				err = remap_anon_pages_huge_pmd(mm,
+								dst_pmd,
+								src_pmd,
+								dst_vma,
+								src_vma,
+								dst_addr,
+								src_addr);
+				cond_resched();
+
+				if ((!err || err == -EAGAIN) &&
+				    signal_pending(current)) {
+					err = -EINTR;
+					break;
+				}
+
+				if (err == -EAGAIN)
+					continue;
+				else if (err)
+					break;
+
+				dst_addr += HPAGE_PMD_SIZE;
+				src_addr += HPAGE_PMD_SIZE;
+				continue;
+			}
+		}
+
+		/*
+		 * We held the mmap_sem for reading so MADV_DONTNEED
+		 * can zap transparent huge pages under us, or the
+		 * transparent huge page fault can establish new
+		 * transparent huge pages under us. Be strict in that
+		 * case. This also means that unmapped holes in the
+		 * source address range will lead to returning
+		 * -EFAULT.
+		 */
+		if (unlikely(pmd_trans_unstable(src_pmd))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (unlikely(pmd_none(*dst_pmd)) &&
+		    unlikely(__pte_alloc(mm, dst_vma, dst_pmd,
+					 dst_addr))) {
+			err = -ENOMEM;
+			break;
+		}
+		/* If an huge pmd materialized from under us fail */
+		if (unlikely(pmd_trans_huge(*dst_pmd))) {
+			err = -EFAULT;
+			break;
+		}
+
+		BUG_ON(pmd_none(*dst_pmd));
+		BUG_ON(pmd_none(*src_pmd));
+		BUG_ON(pmd_trans_huge(*dst_pmd));
+		BUG_ON(pmd_trans_huge(*src_pmd));
+
+		dst_pte = pte_offset_map(dst_pmd, dst_addr);
+		src_pte = pte_offset_map(src_pmd, src_addr);
+		dst_ptl = pte_lockptr(mm, dst_pmd);
+		src_ptl = pte_lockptr(mm, src_pmd);
+
+		err = remap_anon_pages_pte(mm,
+					   dst_pte, src_pte, src_pmd,
+					   dst_vma, src_vma,
+					   dst_addr, src_addr,
+					   dst_ptl, src_ptl);
+
+		pte_unmap(dst_pte);
+		pte_unmap(src_pte);
+		cond_resched();
+
+		if ((!err || err == -EAGAIN) &&
+		    signal_pending(current)) {
+			err = -EINTR;
+			break;
+		}
+
+		if (err == -EAGAIN)
+			continue;
+		else if (err)
+			break;
+
+		dst_addr += PAGE_SIZE;
+		src_addr += PAGE_SIZE;
+	}
+
+out:
+	up_read(&mm->mmap_sem);
+	return err;
+}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9a2e235..2979052 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1480,6 +1480,106 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 }
 
 /*
+ * The page_table_lock and mmap_sem for reading are held by the
+ * caller, but it must return after releasing the
+ * page_table_lock. Just move the page from src_pmd to dst_pmd if
+ * possible. Return zero if succeeded in moving the page, -EAGAIN if
+ * it needs to be repeated by the caller, or other errors in case of
+ * failure.
+ */
+int remap_anon_pages_huge_pmd(struct mm_struct *mm,
+			      pmd_t *dst_pmd, pmd_t *src_pmd,
+			      struct vm_area_struct *dst_vma,
+			      struct vm_area_struct *src_vma,
+			      unsigned long dst_addr,
+			      unsigned long src_addr)
+{
+	pmd_t orig_src_pmd, orig_dst_pmd;
+	struct page *src_page;
+	struct anon_vma *src_anon_vma, *dst_anon_vma;
+
+	BUG_ON(!pmd_trans_huge(*src_pmd));
+	BUG_ON(pmd_trans_splitting(*src_pmd));
+	BUG_ON(!pmd_none(*dst_pmd));
+	BUG_ON(!spin_is_locked(&mm->page_table_lock));
+	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
+
+	orig_src_pmd = *src_pmd;
+	orig_dst_pmd = *dst_pmd;
+
+	src_page = pmd_page(orig_src_pmd);
+	BUG_ON(!PageHead(src_page));
+	BUG_ON(!PageAnon(src_page));
+	if (unlikely(page_mapcount(src_page) != 1)) {
+		spin_unlock(&mm->page_table_lock);
+		return -EFAULT;
+	}
+
+	get_page(src_page);
+	spin_unlock(&mm->page_table_lock);
+
+	mmu_notifier_invalidate_range_start(mm, src_addr,
+					    src_addr + HPAGE_PMD_SIZE);
+
+	/* block all concurrent rmap walks */
+	lock_page(src_page);
+
+	/*
+	 * split_huge_page walks the anon_vma chain without the page
+	 * lock. Serialize against it with the anon_vma lock, the page
+	 * lock is not enough.
+	 */
+	src_anon_vma = page_get_anon_vma(src_page);
+	if (!src_anon_vma) {
+		unlock_page(src_page);
+		put_page(src_page);
+		mmu_notifier_invalidate_range_end(mm, src_addr,
+						  src_addr + HPAGE_PMD_SIZE);
+		return -EAGAIN;
+	}
+	anon_vma_lock_write(src_anon_vma);
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*src_pmd, orig_src_pmd) ||
+		     !pmd_same(*dst_pmd, orig_dst_pmd) ||
+		     page_mapcount(src_page) != 1)) {
+		spin_unlock(&mm->page_table_lock);
+		anon_vma_unlock_write(src_anon_vma);
+		put_anon_vma(src_anon_vma);
+		unlock_page(src_page);
+		put_page(src_page);
+		mmu_notifier_invalidate_range_end(mm, src_addr,
+						  src_addr + HPAGE_PMD_SIZE);
+		return -EAGAIN;
+	}
+
+	BUG_ON(!PageHead(src_page));
+	BUG_ON(!PageAnon(src_page));
+	/* the PT lock is enough to keep the page pinned now */
+	put_page(src_page);
+
+	dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON;
+	ACCESS_ONCE(src_page->mapping) = (struct address_space *) dst_anon_vma;
+	ACCESS_ONCE(src_page->index) = linear_page_index(dst_vma, dst_addr);
+
+	if (pmd_val(pmdp_clear_flush(src_vma, src_addr, src_pmd)) !=
+	    pmd_val(orig_src_pmd))
+		BUG();
+	set_pmd_at(mm, dst_addr, dst_pmd, mk_huge_pmd(src_page, dst_vma));
+	spin_unlock(&mm->page_table_lock);
+
+	anon_vma_unlock_write(src_anon_vma);
+	put_anon_vma(src_anon_vma);
+
+	/* unblock rmap walks */
+	unlock_page(src_page);
+
+	mmu_notifier_invalidate_range_end(mm, src_addr,
+					  src_addr + HPAGE_PMD_SIZE);
+	return 0;
+}
+
+/*
  * Returns 1 if a given pmd maps a stable (not under splitting) thp.
  * Returns -1 if it maps a thp under splitting. Returns 0 otherwise.
  *

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] mm: sys_remap_anon_pages
  2013-05-06 19:57 ` [Qemu-devel] [PATCH 4/4] mm: sys_remap_anon_pages Andrea Arcangeli
@ 2013-05-06 20:01   ` Andrea Arcangeli
  0 siblings, 0 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2013-05-06 20:01 UTC (permalink / raw)
  To: qemu-devel, linux-kernel
  Cc: Anthony Liguori, Orit Wasserman, Juan Quintela, Hugh Dickins,
	Isaku Yamahata, Mel Gorman, Paolo Bonzini

On Mon, May 06, 2013 at 09:57:01PM +0200, Andrea Arcangeli wrote:
> ===
> 
> static unsigned char *c, *tmp;
> 
> void userfault_sighandler(int signum, siginfo_t *info, void *ctx)

oops, the hash of the test program got cut... so I append it below
which is nicer without leading whitespaces.

===
#define _GNU_SOURCE
#include <sys/mman.h>
#include <pthread.h>
#include <strings.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <signal.h>
#include <sys/syscall.h>
#include <sys/types.h>

#define USE_USERFAULT
#define THP

#define MADV_USERFAULT	18

#define SIZE (1024*1024*1024)

#define SYS_remap_anon_pages 314

static unsigned char *c, *tmp;

void userfault_sighandler(int signum, siginfo_t *info, void *ctx)
{
	unsigned char *addr = info->si_addr;
	int len = 4096;
	int ret;

#ifdef THP
	addr = (unsigned char *) ((unsigned long) addr & ~((2*1024*1024)-1));
	len = 2*1024*1024;
#endif
	if (addr >= c && addr < c + SIZE) {
		unsigned long offset = addr - c;
		ret = syscall(SYS_remap_anon_pages, c+offset, tmp+offset, len);
		if (ret)
			perror("sigbus remap_anon_pages"), exit(1);
		//printf("sigbus offset %lu\n", offset);
		return;
	}

	printf("sigbus error addr %p c %p tmp %p\n", addr, c, tmp), exit(1);
}

int main()
{
	struct sigaction sa;
	int ret;
	unsigned long i;
#ifndef THP
	/*
	 * Fails with THP due lack of alignment because of memset
	 * pre-filling the destination
	 */
	c = mmap(0, SIZE, PROT_READ|PROT_WRITE,
		 MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
	if (c == MAP_FAILED)
		perror("mmap"), exit(1);
	tmp = mmap(0, SIZE, PROT_READ|PROT_WRITE,
		   MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
	if (tmp == MAP_FAILED)
		perror("mmap"), exit(1);
#else
	ret = posix_memalign((void **)&c, 2*1024*1024, SIZE);
	if (ret)
		perror("posix_memalign"), exit(1);
	ret = posix_memalign((void **)&tmp, 2*1024*1024, SIZE);
	if (ret)
		perror("posix_memalign"), exit(1);
#endif
	/*
	 * MADV_USERFAULT must run before memset, to avoid THP 2m
	 * faults to map memory into "tmp", if "tmp" isn't allocated
	 * with hugepage alignment.
	 */
	if (madvise(c, SIZE, MADV_USERFAULT))
		perror("madvise"), exit(1);
	memset(tmp, 0xaa, SIZE);

	sa.sa_sigaction = userfault_sighandler;
	sigemptyset(&sa.sa_mask);
	sa.sa_flags = SA_SIGINFO;
	sigaction(SIGBUS, &sa, NULL);

#ifndef USE_USERFAULT
	ret = syscall(SYS_remap_anon_pages, c, tmp, SIZE);
	if (ret)
		perror("remap_anon_pages"), exit(1);
#endif

	for (i = 0; i < SIZE; i += 4096) {
		if ((i/4096) % 2) {
			/* exercise read and write MADV_USERFAULT */
			c[i+1] = 0xbb;
		}
		if (c[i] != 0xaa)
			printf("error %x offset %lu\n", c[i], i), exit(1);
	}

	return 0;
}

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 0/4] madvise(MADV_USERFAULT) & sys_remap_anon_pages()
  2013-05-06 19:56 [Qemu-devel] [PATCH 0/4] madvise(MADV_USERFAULT) & sys_remap_anon_pages() Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2013-05-06 19:57 ` [Qemu-devel] [PATCH 4/4] mm: sys_remap_anon_pages Andrea Arcangeli
@ 2013-05-07 10:07 ` Isaku Yamahata
  2013-05-07 12:09   ` Andrea Arcangeli
  2013-05-07 11:38 ` Andrew Jones
  5 siblings, 1 reply; 12+ messages in thread
From: Isaku Yamahata @ 2013-05-07 10:07 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Anthony Liguori, Juan Quintela, Hugh Dickins, qemu-devel,
	linux-kernel, Orit Wasserman, Mel Gorman, Paolo Bonzini

On Mon, May 06, 2013 at 09:56:57PM +0200, Andrea Arcangeli wrote:
> Hello everyone,
> 
> this is a patchset to implement two new kernel features:
> MADV_USERFAULT and remap_anon_pages.
> 
> The combination of the two features are what I would propose to
> implement postcopy live migration, and in general demand paging of
> remote memory, hosted in different cloud nodes with KSM. It might also
> be used without virt to offload parts of memory to different nodes
> using some userland library and a network memory manager.

Interesting. The API you are proposing handles only user fault.
How do you think about kernel case. I mean that KVM kernel module issues
get_user_pages().
Exit to qemu with dedicated reason?


> Postcopy live migration is currently implemented using a chardevice,
> which remains open for the whole VM lifetime and all virtual memory
> then becomes owned by the chardevice and it's not anonymous anymore.
> 
> http://lists.gnu.org/archive/html/qemu-devel/2012-10/msg05274.html
> 
> The main cons of the chardevice design is that all nice Linux MM
> features (like swapping/THP/KSM/automatic-NUMA-balancing) are disabled
> if the guest physical memory doesn't remain in anonymous memory. This
> is entirely solved by this alternative kernel solution. In fact
> remap_anon_pages will move THP pages natively by just updating two pmd
> pointers if alignment and length permits without any THP split.
> 
> The other bonus is that MADV_USERFAULT and remap_anon_pages are
> implemented in the MM core and remap_anon_pages furthermore provides a
> functionality similar to what is already available for filebacked
> pages with remap_file_pages. That is usually more maintainable than
> having MM parts in a chardevice.
> 
> In addition to asking review of the internals, this also need review
> the user APIs, as both those features are userland visible changes.
> 
> MADV_USERFAULT is only enabled for anonymous mappings so far but it
> could be extended. To be strict, -EINVAL is returned if run on non
> anonymous mappings (where it would currently be a noop).
> 
> The remap_anon_pages syscall API is not vectored, as I expect it used
> for demand paging only (where there can be just one faulting range per
> fault) or for large ranges where vectoring isn't going to provide
> performance advantages.

In case of precopy + postcopy optimization, dirty bitmap is sent after 
precopy phase and then clean pages are populated. In this population phase,
vecotored API can be utilized. I'm not sure how much vectored API will
contribute to shorten VM-switch time, though.


> The current behavior of remap_anon_pages is very strict to avoid any
> chance of memory corruption going unnoticed, and it will return
> -EFAULT at the first sign of something unexpected (like a page already
> mapped in the destination pmd/pte, potentially signaling an userland
> thread race condition with two threads userfaulting on the same
> destination address). mremap is not strict like that: it would drop
> the destination range silently and it would succeed in such a
> condition. So on the API side, I wonder if I should add a flag to
> remap_anon_pages to provide non-strict behavior more similar to
> mremap. OTOH not providing the permissive mremap behavior may actually
> be better to force userland to be strict and be sure it knows what it
> is doing (otherwise it should use mremap in the first place?).

It would be desirable to avoid complex thing in signal handler.
Like sending page request to remote, receiving pages from remote.
So signal handler would just queue requests to those dedicated threads
and wait and requests would be serialized. Such strictness is not 
very critical, I guess. But others might find other use case...

thanks,

> Comments welcome, thanks!
> Andrea
> 
> Andrea Arcangeli (4):
>   mm: madvise MADV_USERFAULT
>   mm: rmap preparation for remap_anon_pages
>   mm: swp_entry_swapcount
>   mm: sys_remap_anon_pages
> 
>  arch/alpha/include/uapi/asm/mman.h     |   3 +
>  arch/mips/include/uapi/asm/mman.h      |   3 +
>  arch/parisc/include/uapi/asm/mman.h    |   3 +
>  arch/x86/syscalls/syscall_32.tbl       |   1 +
>  arch/x86/syscalls/syscall_64.tbl       |   1 +
>  arch/xtensa/include/uapi/asm/mman.h    |   3 +
>  include/linux/huge_mm.h                |   6 +
>  include/linux/mm.h                     |   1 +
>  include/linux/mm_types.h               |   2 +-
>  include/linux/swap.h                   |   6 +
>  include/linux/syscalls.h               |   3 +
>  include/uapi/asm-generic/mman-common.h |   3 +
>  kernel/sys_ni.c                        |   1 +
>  mm/fremap.c                            | 440 +++++++++++++++++++++++++++++++++
>  mm/huge_memory.c                       | 158 ++++++++++--
>  mm/madvise.c                           |  16 ++
>  mm/memory.c                            |  10 +
>  mm/rmap.c                              |   9 +
>  mm/swapfile.c                          |  13 +
>  19 files changed, 667 insertions(+), 15 deletions(-)
> 

-- 
yamahata

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 1/4] mm: madvise MADV_USERFAULT
  2013-05-06 19:56 ` [Qemu-devel] [PATCH 1/4] mm: madvise MADV_USERFAULT Andrea Arcangeli
@ 2013-05-07 11:16   ` Andrew Jones
  2013-05-07 11:34     ` Andrea Arcangeli
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Jones @ 2013-05-07 11:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Anthony Liguori, Juan Quintela, Hugh Dickins, qemu-devel,
	linux-kernel, Orit Wasserman, Mel Gorman, Paolo Bonzini,
	Isaku Yamahata

On Mon, May 06, 2013 at 09:56:58PM +0200, Andrea Arcangeli wrote:
> +++ b/mm/madvise.c
> @@ -93,6 +93,21 @@ static long madvise_behavior(struct vm_area_struct * vma,
>  		if (error)
>  			goto out;
>  		break;
> +	case MADV_USERFAULT:
> +		if (vma->vm_ops) {
> +			error = -EINVAL;
> +			goto out;
> +		}
> +		new_flags |= VM_USERFAULT;
> +		break;
> +	case MADV_NOUSERFAULT:
> +		if (vma->vm_ops) {
> +			WARN_ON(new_flags & VM_USERFAULT);
> +			error = -EINVAL;
> +			goto out;
> +		}
> +		new_flags &= ~VM_USERFAULT;
> +		break;
>  	}
>  
>  	if (new_flags == vma->vm_flags) {
> @@ -405,6 +420,7 @@ madvise_behavior_valid(int behavior)
>  	case MADV_HUGEPAGE:
>  	case MADV_NOHUGEPAGE:
>  #endif
> +	case MADV_USERFAULT:

Missing MADV_NOUSERFAULT: here

>  	case MADV_DONTDUMP:
>  	case MADV_DODUMP:
>  		return 1;

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 1/4] mm: madvise MADV_USERFAULT
  2013-05-07 11:16   ` Andrew Jones
@ 2013-05-07 11:34     ` Andrea Arcangeli
  0 siblings, 0 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2013-05-07 11:34 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Anthony Liguori, Juan Quintela, Hugh Dickins, qemu-devel,
	linux-kernel, Orit Wasserman, Mel Gorman, Paolo Bonzini,
	Isaku Yamahata

Hi Andrew,

On Tue, May 07, 2013 at 01:16:30PM +0200, Andrew Jones wrote:
> On Mon, May 06, 2013 at 09:56:58PM +0200, Andrea Arcangeli wrote:
> > @@ -405,6 +420,7 @@ madvise_behavior_valid(int behavior)
> >  	case MADV_HUGEPAGE:
> >  	case MADV_NOHUGEPAGE:
> >  #endif
> > +	case MADV_USERFAULT:
> 
> Missing MADV_NOUSERFAULT: here

Right, thanks! Folded the below in patch 1/4.

diff --git a/mm/madvise.c b/mm/madvise.c
index 2ecab73..8828e5c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -421,6 +421,7 @@ madvise_behavior_valid(int behavior)
 	case MADV_NOHUGEPAGE:
 #endif
 	case MADV_USERFAULT:
+	case MADV_NOUSERFAULT:
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
 		return 1;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 0/4] madvise(MADV_USERFAULT) & sys_remap_anon_pages()
  2013-05-06 19:56 [Qemu-devel] [PATCH 0/4] madvise(MADV_USERFAULT) & sys_remap_anon_pages() Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2013-05-07 10:07 ` [Qemu-devel] [PATCH 0/4] madvise(MADV_USERFAULT) & sys_remap_anon_pages() Isaku Yamahata
@ 2013-05-07 11:38 ` Andrew Jones
  2013-05-07 12:19   ` Andrea Arcangeli
  5 siblings, 1 reply; 12+ messages in thread
From: Andrew Jones @ 2013-05-07 11:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Anthony Liguori, Juan Quintela, Hugh Dickins, qemu-devel,
	linux-kernel, Orit Wasserman, Mel Gorman, Paolo Bonzini,
	Isaku Yamahata

On Mon, May 06, 2013 at 09:56:57PM +0200, Andrea Arcangeli wrote:
> 
> The current behavior of remap_anon_pages is very strict to avoid any
> chance of memory corruption going unnoticed, and it will return
> -EFAULT at the first sign of something unexpected (like a page already
> mapped in the destination pmd/pte, potentially signaling an userland
> thread race condition with two threads userfaulting on the same
> destination address). mremap is not strict like that: it would drop
> the destination range silently and it would succeed in such a
> condition. So on the API side, I wonder if I should add a flag to
> remap_anon_pages to provide non-strict behavior more similar to
> mremap. OTOH not providing the permissive mremap behavior may actually
> be better to force userland to be strict and be sure it knows what it
> is doing (otherwise it should use mremap in the first place?).
> 

What about instead of adding a new syscall (remap_anon_pages) to
instead extend mremap with new flags giving it a strict mode?

drew

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 0/4] madvise(MADV_USERFAULT) & sys_remap_anon_pages()
  2013-05-07 10:07 ` [Qemu-devel] [PATCH 0/4] madvise(MADV_USERFAULT) & sys_remap_anon_pages() Isaku Yamahata
@ 2013-05-07 12:09   ` Andrea Arcangeli
  0 siblings, 0 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2013-05-07 12:09 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Anthony Liguori, Juan Quintela, Hugh Dickins, qemu-devel,
	linux-kernel, Orit Wasserman, Mel Gorman, Paolo Bonzini

Hi Isaku,

On Tue, May 07, 2013 at 07:07:40PM +0900, Isaku Yamahata wrote:
> On Mon, May 06, 2013 at 09:56:57PM +0200, Andrea Arcangeli wrote:
> > Hello everyone,
> > 
> > this is a patchset to implement two new kernel features:
> > MADV_USERFAULT and remap_anon_pages.
> > 
> > The combination of the two features are what I would propose to
> > implement postcopy live migration, and in general demand paging of
> > remote memory, hosted in different cloud nodes with KSM. It might also
> > be used without virt to offload parts of memory to different nodes
> > using some userland library and a network memory manager.
> 
> Interesting. The API you are proposing handles only user fault.
> How do you think about kernel case. I mean that KVM kernel module issues
> get_user_pages().
> Exit to qemu with dedicated reason?

Correct. It's possible we want a more meaningful retval from
get_user_pages too (right now sigbus would make gup return a too
generic -EFAULT) by introducing a FOLL_USERFAULT in gup_flags.

So the KVM bits are still missing at this point.

Gleb also wants to enable the async page fault in the postcopy stage,
so we immediately schedule a different guest process if the current
guest process hits an userfault within KVM.

So the protocol with the postcopy thread will tell it "fill this pfn
async" or "fill it synchronous". And Gleb likes kvm to talk to the
postcopy thread (through a pipe?) directly to avoid exiting to
userland.

But we could also return to userland, if we do, we don't need to teach
the kernel about the postcopy thread protocol to require new pages
synchronously (after running out of async page faults) or
asynchronously (when async page faults are still availbale).

Clearly staying in the kernel is more efficient as it avoids an
enter/exit cycle and kvm can be restarted immediately after a 9 byte
write to the pipe with the postcopy thread.

> In case of precopy + postcopy optimization, dirty bitmap is sent after 
> precopy phase and then clean pages are populated. In this population phase,
> vecotored API can be utilized. I'm not sure how much vectored API will
> contribute to shorten VM-switch time, though.

But the network transfer won't be vectored, would it? If we pay an
enter/exit kernel for the network transfer, I assume we'd run a
remap_anon_pages after each chunk.

Also the postcopy thread won't transfer in the background too much
data at once. It needs to react quick to a "urgent" userfault request
coming from a vcpu thread.

> It would be desirable to avoid complex thing in signal handler.
> Like sending page request to remote, receiving pages from remote.
> So signal handler would just queue requests to those dedicated threads
> and wait and requests would be serialized. Such strictness is not 

Exactly, that's the idea, a separate thread will do the network
transfer and then run remap_anon_pages. And if we immediately use
async page faults it won't need to block until we run out of async
page faults.

> very critical, I guess. But others might find other use case...

It's still somewhat useful to be strict in my view, as it will verify
that we handle correctly the case of many vcpus userfaulting on the
same address at the same time, everyone except the first shouldn't run
remap_anon_pages.

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 0/4] madvise(MADV_USERFAULT) & sys_remap_anon_pages()
  2013-05-07 11:38 ` Andrew Jones
@ 2013-05-07 12:19   ` Andrea Arcangeli
  0 siblings, 0 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2013-05-07 12:19 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Anthony Liguori, Juan Quintela, Hugh Dickins, qemu-devel,
	linux-kernel, Orit Wasserman, Mel Gorman, Paolo Bonzini,
	Isaku Yamahata

On Tue, May 07, 2013 at 01:38:10PM +0200, Andrew Jones wrote:
> What about instead of adding a new syscall (remap_anon_pages) to
> instead extend mremap with new flags giving it a strict mode?

I actually thought about this and it's a very interesting argument.

When I thought about it, I felt the problem was that we won't know
until we did some work (at least until reaching _each_ individual page
structure of every page mapped in the mremapped range) if we can avoid
the vma mangling or not. So then remap_anon_pages should be more
efficient because it will just error out if strict mode is not
possible, it won't need to search all page structures before it know
if it has to do the vma mangling or not.

It likely is simpler too than to try to teach mremap to do both things
after checking which one it can do.

Also, the same argument could be made about remap_file_pages too.

But it's still an interesting point, let's see what others thinks
about it.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-05-07 12:20 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-06 19:56 [Qemu-devel] [PATCH 0/4] madvise(MADV_USERFAULT) & sys_remap_anon_pages() Andrea Arcangeli
2013-05-06 19:56 ` [Qemu-devel] [PATCH 1/4] mm: madvise MADV_USERFAULT Andrea Arcangeli
2013-05-07 11:16   ` Andrew Jones
2013-05-07 11:34     ` Andrea Arcangeli
2013-05-06 19:56 ` [Qemu-devel] [PATCH 2/4] mm: rmap preparation for remap_anon_pages Andrea Arcangeli
2013-05-06 19:57 ` [Qemu-devel] [PATCH 3/4] mm: swp_entry_swapcount Andrea Arcangeli
2013-05-06 19:57 ` [Qemu-devel] [PATCH 4/4] mm: sys_remap_anon_pages Andrea Arcangeli
2013-05-06 20:01   ` Andrea Arcangeli
2013-05-07 10:07 ` [Qemu-devel] [PATCH 0/4] madvise(MADV_USERFAULT) & sys_remap_anon_pages() Isaku Yamahata
2013-05-07 12:09   ` Andrea Arcangeli
2013-05-07 11:38 ` Andrew Jones
2013-05-07 12:19   ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).