linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] RFC: userfault
@ 2014-07-02 16:50 Andrea Arcangeli
  2014-07-02 16:50 ` [PATCH 01/10] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits Andrea Arcangeli
                   ` (12 more replies)
  0 siblings, 13 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2014-07-02 16:50 UTC (permalink / raw)
  To: qemu-devel, kvm, linux-mm, linux-kernel
  Cc: \"Dr. David Alan Gilbert\", Johannes Weiner,
	Andrew Morton, Android Kernel Team, Robert Love, Mel Gorman,
	Hugh Dickins, Dave Hansen, Rik van Riel, Dmitry Adamushko,
	Neil Brown, Andrea Arcangeli, Mike Hommey, Taras Glek, Jan Kara,
	KOSAKI Motohiro, Michel Lespinasse, Minchan Kim, Keith Packard,
	Huangpeng (Peter), Isaku Yamahata, Paolo Bonzini, Anthony Liguori,
	Stefan Hajnoczi, Wenchao Xia, Andrew Jones, Juan Quintela,
	Mel Gorman

Hello everyone,

There's a large CC list for this RFC because this adds two new
syscalls (userfaultfd and remap_anon_pages) and
MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API
or on a completely different API if somebody has better ideas are
welcome now.

The combination of these features are what I would propose to
implement postcopy live migration in qemu, and in general demand
paging of remote memory, hosted in different cloud nodes.

The MADV_USERFAULT feature should be generic enough that it can
provide the userfaults to the Android volatile range feature too, on
access of reclaimed volatile pages.

If the access could ever happen in kernel context through syscalls
(not not just from userland context), then userfaultfd has to be used
to make the userfault unnoticeable to the syscall (no error will be
returned). This latter feature is more advanced than what volatile
ranges alone could do with SIGBUS so far (but it's optional, if the
process doesn't call userfaultfd, the regular SIGBUS will fire, if the
fd is closed SIGBUS will also fire for any blocked userfault that was
waiting a userfaultfd_write ack).

userfaultfd is also a generic enough feature, that it allows KVM to
implement postcopy live migration without having to modify a single
line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
other GUP features works just fine in combination with userfaults
(userfaults trigger async page faults in the guest scheduler so those
guest processes that aren't waiting for userfaults can keep running in
the guest vcpus).

remap_anon_pages is the syscall to use to resolve the userfaults (it's
not mandatory, vmsplice will likely still be used in the case of local
postcopy live migration just to upgrade the qemu binary, but
remap_anon_pages is faster and ideal for transferring memory across
the network, it's zerocopy and doesn't touch the vma: it only holds
the mmap_sem for reading).

The current behavior of remap_anon_pages is very strict to avoid any
chance of memory corruption going unnoticed. mremap is not strict like
that: if there's a synchronization bug it would drop the destination
range silently resulting in subtle memory corruption for
example. remap_anon_pages would return -EEXIST in that case. If there
are holes in the source range remap_anon_pages will return -ENOENT.

If remap_anon_pages is used always with 2M naturally aligned
addresses, transparent hugepages will not be splitted. In there could
be 4k (or any size) holes in the 2M (or any size) source range,
remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
relax some of its strict checks (-ENOENT won't be returned if
RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
a noop on any hole in the source range). This flag is generally useful
when implementing userfaults with THP granularity, but it shouldn't be
set if doing the userfaults with PAGE_SIZE granularity if the
developer wants to benefit from the strict -ENOENT behavior.

The remap_anon_pages syscall API is not vectored, as I expect it to be
used mainly for demand paging (where there can be just one faulting
range per userfault) or for large ranges (with the THP model as an
alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
granularity before starting the guest in the destination node) where
vectoring isn't going to provide much performance advantages (thanks
to the THP coarser granularity).

On the rmap side remap_anon_pages doesn't add much complexity: there's
no need of nonlinear anon vmas to support it because I added the
constraint that it will fail if the mapcount is more than 1. So in
general the source range of remap_anon_pages should be marked
MADV_DONTFORK to prevent any risk of failure if the process ever
forks (like qemu can in some case).

One part that hasn't been tested is the poll() syscall on the
userfaultfd because the postcopy migration thread currently is more
efficient waiting on blocking read()s (I'll write some code to test
poll() too). I also appended below a patch to trinity to exercise
remap_anon_pages and userfaultfd and it completes trinity
successfully.

The code can be found here:

git clone --reference linux git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault 

The branch is rebased so you can get updates for example with:

git fetch && git checkout -f origin/userfault

Comments welcome, thanks!
Andrea

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-07-04 11:31 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-02 16:50 [PATCH 00/10] RFC: userfault Andrea Arcangeli
2014-07-02 16:50 ` [PATCH 01/10] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits Andrea Arcangeli
2014-07-02 16:50 ` [PATCH 02/10] mm: madvise MADV_USERFAULT Andrea Arcangeli
2014-07-02 16:50 ` [PATCH 03/10] mm: PT lock: export double_pt_lock/unlock Andrea Arcangeli
2014-07-02 16:50 ` [PATCH 04/10] mm: rmap preparation for remap_anon_pages Andrea Arcangeli
2014-07-02 16:50 ` [PATCH 05/10] mm: swp_entry_swapcount Andrea Arcangeli
2014-07-02 16:50 ` [PATCH 06/10] mm: sys_remap_anon_pages Andrea Arcangeli
2014-07-04 11:30   ` Michael Kerrisk
2014-07-02 16:50 ` [PATCH 07/10] waitqueue: add nr wake parameter to __wake_up_locked_key Andrea Arcangeli
2014-07-02 16:50 ` [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization Andrea Arcangeli
2014-07-03  1:56   ` Andy Lutomirski
2014-07-03 13:19     ` Andrea Arcangeli
2014-07-02 16:50 ` [PATCH 09/10] userfaultfd: make userfaultfd_write non blocking Andrea Arcangeli
2014-07-02 16:50 ` [PATCH 10/10] userfaultfd: use VM_FAULT_RETRY in handle_userfault() Andrea Arcangeli
2014-07-03  1:51 ` [PATCH 00/10] RFC: userfault Andy Lutomirski
2014-07-03 13:45 ` [Qemu-devel] " Christopher Covington
2014-07-03 14:08   ` Andrea Arcangeli
2014-07-03 15:41 ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).