qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Christopher Covington <cov@codeaurora.org>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Robert Love <rlove@google.com>, Dave Hansen <dave@sr71.net>,
	Jan Kara <jack@suse.cz>,
	kvm@vger.kernel.org, Neil Brown <neilb@suse.de>,
	Stefan Hajnoczi <stefanha@gmail.com>,
	qemu-devel@nongnu.org, crml <criu@openvz.org>,
	linux-mm@kvack.org, KOSAKI Motohiro <kosaki.motohiro@gmail.com>,
	Michel Lespinasse <walken@google.com>,
	Taras Glek <tglek@mozilla.com>,
	Juan Quintela <quintela@redhat.com>,
	Hugh Dickins <hughd@google.com>,
	Isaku Yamahata <yamahata@valinux.co.jp>,
	Mel Gorman <mgorman@suse.de>,
	Android Kernel Team <kernel-team@android.com>,
	Andrew Jones <drjones@redhat.com>, Mel Gorman <mel@csn.ul.ie>,
	"Huangpeng (Peter)" <peter.huangpeng@huawei.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Anthony Liguori <anthony@codemonkey.ws>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Keith Packard <keithp@keithp.com>,
	Wenchao Xia <wenchaoqemu@gmail.com>,
	linux-kernel@vger.kernel.org, Minchan Kim <minchan@kernel.org>,
	Dmitry Adamushko <dmitry.adamushko@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Mike Hommey <mh@glandium.org>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [Qemu-devel] [PATCH 00/10] RFC: userfault
Date: Thu, 03 Jul 2014 09:45:07 -0400	[thread overview]
Message-ID: <53B55E63.7080309@codeaurora.org> (raw)
In-Reply-To: <1404319816-30229-1-git-send-email-aarcange@redhat.com>

Hi Andrea,

On 07/02/2014 12:50 PM, Andrea Arcangeli wrote:
> Hello everyone,
> 
> There's a large CC list for this RFC because this adds two new
> syscalls (userfaultfd and remap_anon_pages) and
> MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API
> or on a completely different API if somebody has better ideas are
> welcome now.
> 
> The combination of these features are what I would propose to
> implement postcopy live migration in qemu, and in general demand
> paging of remote memory, hosted in different cloud nodes.
> 
> The MADV_USERFAULT feature should be generic enough that it can
> provide the userfaults to the Android volatile range feature too, on
> access of reclaimed volatile pages.
> 
> If the access could ever happen in kernel context through syscalls
> (not not just from userland context), then userfaultfd has to be used
> to make the userfault unnoticeable to the syscall (no error will be
> returned). This latter feature is more advanced than what volatile
> ranges alone could do with SIGBUS so far (but it's optional, if the
> process doesn't call userfaultfd, the regular SIGBUS will fire, if the
> fd is closed SIGBUS will also fire for any blocked userfault that was
> waiting a userfaultfd_write ack).
> 
> userfaultfd is also a generic enough feature, that it allows KVM to
> implement postcopy live migration without having to modify a single
> line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
> other GUP features works just fine in combination with userfaults
> (userfaults trigger async page faults in the guest scheduler so those
> guest processes that aren't waiting for userfaults can keep running in
> the guest vcpus).
> 
> remap_anon_pages is the syscall to use to resolve the userfaults (it's
> not mandatory, vmsplice will likely still be used in the case of local
> postcopy live migration just to upgrade the qemu binary, but
> remap_anon_pages is faster and ideal for transferring memory across
> the network, it's zerocopy and doesn't touch the vma: it only holds
> the mmap_sem for reading).
> 
> The current behavior of remap_anon_pages is very strict to avoid any
> chance of memory corruption going unnoticed. mremap is not strict like
> that: if there's a synchronization bug it would drop the destination
> range silently resulting in subtle memory corruption for
> example. remap_anon_pages would return -EEXIST in that case. If there
> are holes in the source range remap_anon_pages will return -ENOENT.
> 
> If remap_anon_pages is used always with 2M naturally aligned
> addresses, transparent hugepages will not be splitted. In there could
> be 4k (or any size) holes in the 2M (or any size) source range,
> remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
> relax some of its strict checks (-ENOENT won't be returned if
> RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
> a noop on any hole in the source range). This flag is generally useful
> when implementing userfaults with THP granularity, but it shouldn't be
> set if doing the userfaults with PAGE_SIZE granularity if the
> developer wants to benefit from the strict -ENOENT behavior.
> 
> The remap_anon_pages syscall API is not vectored, as I expect it to be
> used mainly for demand paging (where there can be just one faulting
> range per userfault) or for large ranges (with the THP model as an
> alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
> granularity before starting the guest in the destination node) where
> vectoring isn't going to provide much performance advantages (thanks
> to the THP coarser granularity).
> 
> On the rmap side remap_anon_pages doesn't add much complexity: there's
> no need of nonlinear anon vmas to support it because I added the
> constraint that it will fail if the mapcount is more than 1. So in
> general the source range of remap_anon_pages should be marked
> MADV_DONTFORK to prevent any risk of failure if the process ever
> forks (like qemu can in some case).
> 
> One part that hasn't been tested is the poll() syscall on the
> userfaultfd because the postcopy migration thread currently is more
> efficient waiting on blocking read()s (I'll write some code to test
> poll() too). I also appended below a patch to trinity to exercise
> remap_anon_pages and userfaultfd and it completes trinity
> successfully.
> 
> The code can be found here:
> 
> git clone --reference linux git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault 
> 
> The branch is rebased so you can get updates for example with:
> 
> git fetch && git checkout -f origin/userfault
> 
> Comments welcome, thanks!

CRIU uses the soft dirty bit in /proc/pid/clear_refs and /proc/pid/pagemap to
implement its pre-copy memory migration.

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/soft-dirty.txt

Would it make sense to use a similar interaction model of peeking and poking
at /proc/pid/ files for post-copy memory migration facilities?

Christopher

> From cbe940e13b4cead41e0f862b3abfa3814f235ec3 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Wed, 2 Jul 2014 18:32:35 +0200
> Subject: [PATCH] add remap_anon_pages and userfaultfd
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  include/syscalls-x86_64.h   |   2 +
>  syscalls/remap_anon_pages.c | 100 ++++++++++++++++++++++++++++++++++++++++++++
>  syscalls/syscalls.h         |   2 +
>  syscalls/userfaultfd.c      |  12 ++++++
>  4 files changed, 116 insertions(+)
>  create mode 100644 syscalls/remap_anon_pages.c
>  create mode 100644 syscalls/userfaultfd.c
> 
> diff --git a/include/syscalls-x86_64.h b/include/syscalls-x86_64.h
> index e09df43..a5b3a88 100644
> --- a/include/syscalls-x86_64.h
> +++ b/include/syscalls-x86_64.h
> @@ -324,4 +324,6 @@ struct syscalltable syscalls_x86_64[] = {
>  	{ .entry = &syscall_sched_setattr },
>  	{ .entry = &syscall_sched_getattr },
>  	{ .entry = &syscall_renameat2 },
> +	{ .entry = &syscall_remap_anon_pages },
> +	{ .entry = &syscall_userfaultfd },
>  };
> diff --git a/syscalls/remap_anon_pages.c b/syscalls/remap_anon_pages.c
> new file mode 100644
> index 0000000..b1e9d3c
> --- /dev/null
> +++ b/syscalls/remap_anon_pages.c
> @@ -0,0 +1,100 @@
> +/*
> + * SYSCALL_DEFINE3(remap_anon_pages,
> +		unsigned long, dst_start, unsigned long, src_start,
> +		unsigned long, len)
> + */
> +#include <stdlib.h>
> +#include <asm/mman.h>
> +#include <assert.h>
> +#include "arch.h"
> +#include "maps.h"
> +#include "random.h"
> +#include "sanitise.h"
> +#include "shm.h"
> +#include "syscall.h"
> +#include "tables.h"
> +#include "trinity.h"
> +#include "utils.h"
> +
> +static const unsigned long alignments[] = {
> +	1 * MB, 2 * MB, 4 * MB, 8 * MB,
> +	10 * MB, 100 * MB,
> +};
> +
> +static unsigned char *g_src, *g_dst;
> +static unsigned long g_size;
> +static int g_check;
> +
> +#define RAP_ALLOW_SRC_HOLES (1UL<<0)
> +
> +static void sanitise_remap_anon_pages(struct syscallrecord *rec)
> +{
> +	unsigned long size = alignments[rand() % ARRAY_SIZE(alignments)];
> +	unsigned long max_rand;
> +	if (rand_bool()) {
> +		g_src = mmap(NULL, size, PROT_READ|PROT_WRITE,
> +			     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> +	} else
> +		g_src = MAP_FAILED;
> +	if (rand_bool()) {
> +		g_dst = mmap(NULL, size, PROT_READ|PROT_WRITE,
> +			     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> +	} else
> +		g_dst = MAP_FAILED;
> +	g_size = size;
> +	g_check = 1;
> +
> +	rec->a1 = (unsigned long) g_dst;
> +	rec->a2 = (unsigned long) g_src;
> +	rec->a3 = g_size;
> +	rec->a4 = 0;
> +
> +	if (rand_bool())
> +		max_rand = -1UL;
> +	else
> +		max_rand = g_size << 1;
> +	if (rand_bool()) {
> +		rec->a3 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		rec->a1 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		rec->a2 += (rand() % max_rand) - g_size;
> +		g_check = 0;
> +	}
> +	if (rand_bool()) {
> +		if (rand_bool()) {
> +			rec->a4 = rand();
> +		} else
> +			rec->a4 = RAP_ALLOW_SRC_HOLES;
> +	}
> +	if (g_src != MAP_FAILED)
> +		memset(g_src, 0xaa, size);
> +}
> +
> +static void post_remap_anon_pages(struct syscallrecord *rec)
> +{
> +	if (g_check && !rec->retval) {
> +		unsigned long size = g_size;
> +		unsigned char *dst = g_dst;
> +		while (size--)
> +			assert(dst[size] == 0xaaU);
> +	}
> +	munmap(g_src, g_size);
> +	munmap(g_dst, g_size);
> +}
> +
> +struct syscallentry syscall_remap_anon_pages = {
> +	.name = "remap_anon_pages",
> +	.num_args = 4,
> +	.arg1name = "dst_start",
> +	.arg2name = "src_start",
> +	.arg3name = "len",
> +	.arg4name = "flags",
> +	.group = GROUP_VM,
> +	.sanitise = sanitise_remap_anon_pages,
> +	.post = post_remap_anon_pages,
> +};
> diff --git a/syscalls/syscalls.h b/syscalls/syscalls.h
> index 114500c..b8eaa63 100644
> --- a/syscalls/syscalls.h
> +++ b/syscalls/syscalls.h
> @@ -370,3 +370,5 @@ extern struct syscallentry syscall_sched_setattr;
>  extern struct syscallentry syscall_sched_getattr;
>  extern struct syscallentry syscall_renameat2;
>  extern struct syscallentry syscall_kern_features;
> +extern struct syscallentry syscall_remap_anon_pages;
> +extern struct syscallentry syscall_userfaultfd;
> diff --git a/syscalls/userfaultfd.c b/syscalls/userfaultfd.c
> new file mode 100644
> index 0000000..769fe78
> --- /dev/null
> +++ b/syscalls/userfaultfd.c
> @@ -0,0 +1,12 @@
> +/*
> + * SYSCALL_DEFINE1(userfaultfd, int, flags)
> + */
> +#include "sanitise.h"
> +
> +struct syscallentry syscall_userfaultfd = {
> +	.name = "userfaultfd",
> +	.num_args = 1,
> +	.arg1name = "flags",
> +	.arg1type = ARG_LEN,
> +	.rettype = RET_FD,
> +};
> 
> 
> Andrea Arcangeli (10):
>   mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
>   mm: madvise MADV_USERFAULT
>   mm: PT lock: export double_pt_lock/unlock
>   mm: rmap preparation for remap_anon_pages
>   mm: swp_entry_swapcount
>   mm: sys_remap_anon_pages
>   waitqueue: add nr wake parameter to __wake_up_locked_key
>   userfaultfd: add new syscall to provide memory externalization
>   userfaultfd: make userfaultfd_write non blocking
>   userfaultfd: use VM_FAULT_RETRY in handle_userfault()
> 
>  arch/alpha/include/uapi/asm/mman.h     |   3 +
>  arch/mips/include/uapi/asm/mman.h      |   3 +
>  arch/parisc/include/uapi/asm/mman.h    |   3 +
>  arch/x86/syscalls/syscall_32.tbl       |   2 +
>  arch/x86/syscalls/syscall_64.tbl       |   2 +
>  arch/xtensa/include/uapi/asm/mman.h    |   3 +
>  fs/Makefile                            |   1 +
>  fs/proc/task_mmu.c                     |   5 +-
>  fs/userfaultfd.c                       | 593 +++++++++++++++++++++++++++++++++
>  include/linux/huge_mm.h                |  11 +-
>  include/linux/ksm.h                    |   4 +-
>  include/linux/mm.h                     |   5 +
>  include/linux/mm_types.h               |   2 +-
>  include/linux/swap.h                   |   6 +
>  include/linux/syscalls.h               |   5 +
>  include/linux/userfaultfd.h            |  42 +++
>  include/linux/wait.h                   |   5 +-
>  include/uapi/asm-generic/mman-common.h |   3 +
>  init/Kconfig                           |  10 +
>  kernel/sched/wait.c                    |   7 +-
>  kernel/sys_ni.c                        |   2 +
>  mm/fremap.c                            | 506 ++++++++++++++++++++++++++++
>  mm/huge_memory.c                       | 209 ++++++++++--
>  mm/ksm.c                               |   2 +-
>  mm/madvise.c                           |  19 +-
>  mm/memory.c                            |  14 +
>  mm/mremap.c                            |   2 +-
>  mm/rmap.c                              |   9 +
>  mm/swapfile.c                          |  13 +
>  net/sunrpc/sched.c                     |   2 +-
>  30 files changed, 1447 insertions(+), 46 deletions(-)
>  create mode 100644 fs/userfaultfd.c
>  create mode 100644 include/linux/userfaultfd.h
> 
> 


-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by the Linux Foundation.

  parent reply	other threads:[~2014-07-03 13:45 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-02 16:50 [Qemu-devel] [PATCH 00/10] RFC: userfault Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 01/10] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 02/10] mm: madvise MADV_USERFAULT Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 03/10] mm: PT lock: export double_pt_lock/unlock Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 04/10] mm: rmap preparation for remap_anon_pages Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 05/10] mm: swp_entry_swapcount Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 06/10] mm: sys_remap_anon_pages Andrea Arcangeli
2014-07-04 11:30   ` Michael Kerrisk
2014-07-02 16:50 ` [Qemu-devel] [PATCH 07/10] waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
2014-07-03  1:56   ` Andy Lutomirski
2014-07-03 13:19     ` Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 09/10] userfaultfd: make userfaultfd_write non blocking Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 10/10] userfaultfd: use VM_FAULT_RETRY in handle_userfault() Andrea Arcangeli
2014-07-03  1:51 ` [Qemu-devel] [PATCH 00/10] RFC: userfault Andy Lutomirski
2014-07-03 13:45 ` Christopher Covington [this message]
2014-07-03 14:08   ` Andrea Arcangeli
2014-07-03 15:41 ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53B55E63.7080309@codeaurora.org \
    --to=cov@codeaurora.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=anthony@codemonkey.ws \
    --cc=criu@openvz.org \
    --cc=dave@sr71.net \
    --cc=dgilbert@redhat.com \
    --cc=dmitry.adamushko@gmail.com \
    --cc=drjones@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=keithp@keithp.com \
    --cc=kernel-team@android.com \
    --cc=kosaki.motohiro@gmail.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=mgorman@suse.de \
    --cc=mh@glandium.org \
    --cc=minchan@kernel.org \
    --cc=neilb@suse.de \
    --cc=pbonzini@redhat.com \
    --cc=peter.huangpeng@huawei.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=rlove@google.com \
    --cc=stefanha@gmail.com \
    --cc=tglek@mozilla.com \
    --cc=walken@google.com \
    --cc=wenchaoqemu@gmail.com \
    --cc=yamahata@valinux.co.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).