From: Christopher Covington <cov@codeaurora.org>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Robert Love <rlove@google.com>, Dave Hansen <dave@sr71.net>,
Jan Kara <jack@suse.cz>,
kvm@vger.kernel.org, Neil Brown <neilb@suse.de>,
Stefan Hajnoczi <stefanha@gmail.com>,
qemu-devel@nongnu.org, crml <criu@openvz.org>,
linux-mm@kvack.org, KOSAKI Motohiro <kosaki.motohiro@gmail.com>,
Michel Lespinasse <walken@google.com>,
Taras Glek <tglek@mozilla.com>,
Juan Quintela <quintela@redhat.com>,
Hugh Dickins <hughd@google.com>,
Isaku Yamahata <yamahata@valinux.co.jp>,
Mel Gorman <mgorman@suse.de>,
Android Kernel Team <kernel-team@android.com>,
Andrew Jones <drjones@redhat.com>, Mel Gorman <mel@csn.ul.ie>,
"Huangpeng (Peter)" <peter.huangpeng@huawei.com>,
"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
Anthony Liguori <anthony@codemonkey.ws>,
Paolo Bonzini <pbonzini@redhat.com>,
Keith Packard <keithp@keithp.com>,
Wenchao Xia <wenchaoqemu@gmail.com>,
linux-kernel@vger.kernel.org, Minchan Kim <minchan@kernel.org>,
Dmitry Adamushko <dmitry.adamushko@gmail.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Mike Hommey <mh@glandium.org>,
Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [Qemu-devel] [PATCH 00/10] RFC: userfault
Date: Thu, 03 Jul 2014 09:45:07 -0400 [thread overview]
Message-ID: <53B55E63.7080309@codeaurora.org> (raw)
In-Reply-To: <1404319816-30229-1-git-send-email-aarcange@redhat.com>
Hi Andrea,
On 07/02/2014 12:50 PM, Andrea Arcangeli wrote:
> Hello everyone,
>
> There's a large CC list for this RFC because this adds two new
> syscalls (userfaultfd and remap_anon_pages) and
> MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API
> or on a completely different API if somebody has better ideas are
> welcome now.
>
> The combination of these features are what I would propose to
> implement postcopy live migration in qemu, and in general demand
> paging of remote memory, hosted in different cloud nodes.
>
> The MADV_USERFAULT feature should be generic enough that it can
> provide the userfaults to the Android volatile range feature too, on
> access of reclaimed volatile pages.
>
> If the access could ever happen in kernel context through syscalls
> (not not just from userland context), then userfaultfd has to be used
> to make the userfault unnoticeable to the syscall (no error will be
> returned). This latter feature is more advanced than what volatile
> ranges alone could do with SIGBUS so far (but it's optional, if the
> process doesn't call userfaultfd, the regular SIGBUS will fire, if the
> fd is closed SIGBUS will also fire for any blocked userfault that was
> waiting a userfaultfd_write ack).
>
> userfaultfd is also a generic enough feature, that it allows KVM to
> implement postcopy live migration without having to modify a single
> line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
> other GUP features works just fine in combination with userfaults
> (userfaults trigger async page faults in the guest scheduler so those
> guest processes that aren't waiting for userfaults can keep running in
> the guest vcpus).
>
> remap_anon_pages is the syscall to use to resolve the userfaults (it's
> not mandatory, vmsplice will likely still be used in the case of local
> postcopy live migration just to upgrade the qemu binary, but
> remap_anon_pages is faster and ideal for transferring memory across
> the network, it's zerocopy and doesn't touch the vma: it only holds
> the mmap_sem for reading).
>
> The current behavior of remap_anon_pages is very strict to avoid any
> chance of memory corruption going unnoticed. mremap is not strict like
> that: if there's a synchronization bug it would drop the destination
> range silently resulting in subtle memory corruption for
> example. remap_anon_pages would return -EEXIST in that case. If there
> are holes in the source range remap_anon_pages will return -ENOENT.
>
> If remap_anon_pages is used always with 2M naturally aligned
> addresses, transparent hugepages will not be splitted. In there could
> be 4k (or any size) holes in the 2M (or any size) source range,
> remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
> relax some of its strict checks (-ENOENT won't be returned if
> RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
> a noop on any hole in the source range). This flag is generally useful
> when implementing userfaults with THP granularity, but it shouldn't be
> set if doing the userfaults with PAGE_SIZE granularity if the
> developer wants to benefit from the strict -ENOENT behavior.
>
> The remap_anon_pages syscall API is not vectored, as I expect it to be
> used mainly for demand paging (where there can be just one faulting
> range per userfault) or for large ranges (with the THP model as an
> alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
> granularity before starting the guest in the destination node) where
> vectoring isn't going to provide much performance advantages (thanks
> to the THP coarser granularity).
>
> On the rmap side remap_anon_pages doesn't add much complexity: there's
> no need of nonlinear anon vmas to support it because I added the
> constraint that it will fail if the mapcount is more than 1. So in
> general the source range of remap_anon_pages should be marked
> MADV_DONTFORK to prevent any risk of failure if the process ever
> forks (like qemu can in some case).
>
> One part that hasn't been tested is the poll() syscall on the
> userfaultfd because the postcopy migration thread currently is more
> efficient waiting on blocking read()s (I'll write some code to test
> poll() too). I also appended below a patch to trinity to exercise
> remap_anon_pages and userfaultfd and it completes trinity
> successfully.
>
> The code can be found here:
>
> git clone --reference linux git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault
>
> The branch is rebased so you can get updates for example with:
>
> git fetch && git checkout -f origin/userfault
>
> Comments welcome, thanks!
CRIU uses the soft dirty bit in /proc/pid/clear_refs and /proc/pid/pagemap to
implement its pre-copy memory migration.
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/soft-dirty.txt
Would it make sense to use a similar interaction model of peeking and poking
at /proc/pid/ files for post-copy memory migration facilities?
Christopher
> From cbe940e13b4cead41e0f862b3abfa3814f235ec3 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Wed, 2 Jul 2014 18:32:35 +0200
> Subject: [PATCH] add remap_anon_pages and userfaultfd
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> include/syscalls-x86_64.h | 2 +
> syscalls/remap_anon_pages.c | 100 ++++++++++++++++++++++++++++++++++++++++++++
> syscalls/syscalls.h | 2 +
> syscalls/userfaultfd.c | 12 ++++++
> 4 files changed, 116 insertions(+)
> create mode 100644 syscalls/remap_anon_pages.c
> create mode 100644 syscalls/userfaultfd.c
>
> diff --git a/include/syscalls-x86_64.h b/include/syscalls-x86_64.h
> index e09df43..a5b3a88 100644
> --- a/include/syscalls-x86_64.h
> +++ b/include/syscalls-x86_64.h
> @@ -324,4 +324,6 @@ struct syscalltable syscalls_x86_64[] = {
> { .entry = &syscall_sched_setattr },
> { .entry = &syscall_sched_getattr },
> { .entry = &syscall_renameat2 },
> + { .entry = &syscall_remap_anon_pages },
> + { .entry = &syscall_userfaultfd },
> };
> diff --git a/syscalls/remap_anon_pages.c b/syscalls/remap_anon_pages.c
> new file mode 100644
> index 0000000..b1e9d3c
> --- /dev/null
> +++ b/syscalls/remap_anon_pages.c
> @@ -0,0 +1,100 @@
> +/*
> + * SYSCALL_DEFINE3(remap_anon_pages,
> + unsigned long, dst_start, unsigned long, src_start,
> + unsigned long, len)
> + */
> +#include <stdlib.h>
> +#include <asm/mman.h>
> +#include <assert.h>
> +#include "arch.h"
> +#include "maps.h"
> +#include "random.h"
> +#include "sanitise.h"
> +#include "shm.h"
> +#include "syscall.h"
> +#include "tables.h"
> +#include "trinity.h"
> +#include "utils.h"
> +
> +static const unsigned long alignments[] = {
> + 1 * MB, 2 * MB, 4 * MB, 8 * MB,
> + 10 * MB, 100 * MB,
> +};
> +
> +static unsigned char *g_src, *g_dst;
> +static unsigned long g_size;
> +static int g_check;
> +
> +#define RAP_ALLOW_SRC_HOLES (1UL<<0)
> +
> +static void sanitise_remap_anon_pages(struct syscallrecord *rec)
> +{
> + unsigned long size = alignments[rand() % ARRAY_SIZE(alignments)];
> + unsigned long max_rand;
> + if (rand_bool()) {
> + g_src = mmap(NULL, size, PROT_READ|PROT_WRITE,
> + MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> + } else
> + g_src = MAP_FAILED;
> + if (rand_bool()) {
> + g_dst = mmap(NULL, size, PROT_READ|PROT_WRITE,
> + MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> + } else
> + g_dst = MAP_FAILED;
> + g_size = size;
> + g_check = 1;
> +
> + rec->a1 = (unsigned long) g_dst;
> + rec->a2 = (unsigned long) g_src;
> + rec->a3 = g_size;
> + rec->a4 = 0;
> +
> + if (rand_bool())
> + max_rand = -1UL;
> + else
> + max_rand = g_size << 1;
> + if (rand_bool()) {
> + rec->a3 += (rand() % max_rand) - g_size;
> + g_check = 0;
> + }
> + if (rand_bool()) {
> + rec->a1 += (rand() % max_rand) - g_size;
> + g_check = 0;
> + }
> + if (rand_bool()) {
> + rec->a2 += (rand() % max_rand) - g_size;
> + g_check = 0;
> + }
> + if (rand_bool()) {
> + if (rand_bool()) {
> + rec->a4 = rand();
> + } else
> + rec->a4 = RAP_ALLOW_SRC_HOLES;
> + }
> + if (g_src != MAP_FAILED)
> + memset(g_src, 0xaa, size);
> +}
> +
> +static void post_remap_anon_pages(struct syscallrecord *rec)
> +{
> + if (g_check && !rec->retval) {
> + unsigned long size = g_size;
> + unsigned char *dst = g_dst;
> + while (size--)
> + assert(dst[size] == 0xaaU);
> + }
> + munmap(g_src, g_size);
> + munmap(g_dst, g_size);
> +}
> +
> +struct syscallentry syscall_remap_anon_pages = {
> + .name = "remap_anon_pages",
> + .num_args = 4,
> + .arg1name = "dst_start",
> + .arg2name = "src_start",
> + .arg3name = "len",
> + .arg4name = "flags",
> + .group = GROUP_VM,
> + .sanitise = sanitise_remap_anon_pages,
> + .post = post_remap_anon_pages,
> +};
> diff --git a/syscalls/syscalls.h b/syscalls/syscalls.h
> index 114500c..b8eaa63 100644
> --- a/syscalls/syscalls.h
> +++ b/syscalls/syscalls.h
> @@ -370,3 +370,5 @@ extern struct syscallentry syscall_sched_setattr;
> extern struct syscallentry syscall_sched_getattr;
> extern struct syscallentry syscall_renameat2;
> extern struct syscallentry syscall_kern_features;
> +extern struct syscallentry syscall_remap_anon_pages;
> +extern struct syscallentry syscall_userfaultfd;
> diff --git a/syscalls/userfaultfd.c b/syscalls/userfaultfd.c
> new file mode 100644
> index 0000000..769fe78
> --- /dev/null
> +++ b/syscalls/userfaultfd.c
> @@ -0,0 +1,12 @@
> +/*
> + * SYSCALL_DEFINE1(userfaultfd, int, flags)
> + */
> +#include "sanitise.h"
> +
> +struct syscallentry syscall_userfaultfd = {
> + .name = "userfaultfd",
> + .num_args = 1,
> + .arg1name = "flags",
> + .arg1type = ARG_LEN,
> + .rettype = RET_FD,
> +};
>
>
> Andrea Arcangeli (10):
> mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
> mm: madvise MADV_USERFAULT
> mm: PT lock: export double_pt_lock/unlock
> mm: rmap preparation for remap_anon_pages
> mm: swp_entry_swapcount
> mm: sys_remap_anon_pages
> waitqueue: add nr wake parameter to __wake_up_locked_key
> userfaultfd: add new syscall to provide memory externalization
> userfaultfd: make userfaultfd_write non blocking
> userfaultfd: use VM_FAULT_RETRY in handle_userfault()
>
> arch/alpha/include/uapi/asm/mman.h | 3 +
> arch/mips/include/uapi/asm/mman.h | 3 +
> arch/parisc/include/uapi/asm/mman.h | 3 +
> arch/x86/syscalls/syscall_32.tbl | 2 +
> arch/x86/syscalls/syscall_64.tbl | 2 +
> arch/xtensa/include/uapi/asm/mman.h | 3 +
> fs/Makefile | 1 +
> fs/proc/task_mmu.c | 5 +-
> fs/userfaultfd.c | 593 +++++++++++++++++++++++++++++++++
> include/linux/huge_mm.h | 11 +-
> include/linux/ksm.h | 4 +-
> include/linux/mm.h | 5 +
> include/linux/mm_types.h | 2 +-
> include/linux/swap.h | 6 +
> include/linux/syscalls.h | 5 +
> include/linux/userfaultfd.h | 42 +++
> include/linux/wait.h | 5 +-
> include/uapi/asm-generic/mman-common.h | 3 +
> init/Kconfig | 10 +
> kernel/sched/wait.c | 7 +-
> kernel/sys_ni.c | 2 +
> mm/fremap.c | 506 ++++++++++++++++++++++++++++
> mm/huge_memory.c | 209 ++++++++++--
> mm/ksm.c | 2 +-
> mm/madvise.c | 19 +-
> mm/memory.c | 14 +
> mm/mremap.c | 2 +-
> mm/rmap.c | 9 +
> mm/swapfile.c | 13 +
> net/sunrpc/sched.c | 2 +-
> 30 files changed, 1447 insertions(+), 46 deletions(-)
> create mode 100644 fs/userfaultfd.c
> create mode 100644 include/linux/userfaultfd.h
>
>
--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by the Linux Foundation.
next prev parent reply other threads:[~2014-07-03 13:45 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-07-02 16:50 [Qemu-devel] [PATCH 00/10] RFC: userfault Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 01/10] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 02/10] mm: madvise MADV_USERFAULT Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 03/10] mm: PT lock: export double_pt_lock/unlock Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 04/10] mm: rmap preparation for remap_anon_pages Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 05/10] mm: swp_entry_swapcount Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 06/10] mm: sys_remap_anon_pages Andrea Arcangeli
2014-07-04 11:30 ` Michael Kerrisk
2014-07-02 16:50 ` [Qemu-devel] [PATCH 07/10] waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
2014-07-03 1:56 ` Andy Lutomirski
2014-07-03 13:19 ` Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 09/10] userfaultfd: make userfaultfd_write non blocking Andrea Arcangeli
2014-07-02 16:50 ` [Qemu-devel] [PATCH 10/10] userfaultfd: use VM_FAULT_RETRY in handle_userfault() Andrea Arcangeli
2014-07-03 1:51 ` [Qemu-devel] [PATCH 00/10] RFC: userfault Andy Lutomirski
2014-07-03 13:45 ` Christopher Covington [this message]
2014-07-03 14:08 ` Andrea Arcangeli
2014-07-03 15:41 ` Dave Hansen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53B55E63.7080309@codeaurora.org \
--to=cov@codeaurora.org \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=anthony@codemonkey.ws \
--cc=criu@openvz.org \
--cc=dave@sr71.net \
--cc=dgilbert@redhat.com \
--cc=dmitry.adamushko@gmail.com \
--cc=drjones@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=keithp@keithp.com \
--cc=kernel-team@android.com \
--cc=kosaki.motohiro@gmail.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mel@csn.ul.ie \
--cc=mgorman@suse.de \
--cc=mh@glandium.org \
--cc=minchan@kernel.org \
--cc=neilb@suse.de \
--cc=pbonzini@redhat.com \
--cc=peter.huangpeng@huawei.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
--cc=rlove@google.com \
--cc=stefanha@gmail.com \
--cc=tglek@mozilla.com \
--cc=walken@google.com \
--cc=wenchaoqemu@gmail.com \
--cc=yamahata@valinux.co.jp \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).