linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] userfaultfd
@ 2015-01-14 23:01 Andrea Arcangeli
  2015-01-15  9:01 ` Pavel Emelyanov
       [not found] ` <20150114230130.GR6103-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 2 replies; 3+ messages in thread
From: Andrea Arcangeli @ 2015-01-14 23:01 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm, linux-kernel, linux-api

Hello,

I would like to attend this year (2015) LSF/MM summit. I'm
particularly interested about the MM track, in order to get help in
finalizing the userfaultfd feature I've been working on lately.

An overview on the userfaultfd feature can be read here:

   http://lwn.net/Articles/615086/

In essence the userfault feature could be imagined as an optimal
implementation for userland driven on demand paging similar to
PROT_NONE+SIGSEGV.

userfaultfd is fundamentally allowing to manage memory at the
pagetable level by delivering the page fault notification to userland
to handle it with proper userfaultfd commands that mangle the address
space, without involving heavyweight structures like vmas (in fact the
userfaultfd runtime load never takes the mmap_sem for writing, just
like its kernel counterpart wouldn't). The number of vmas is limited
too so they're not suitable if there are too many scattered faults and
the address space is not limited. userfaultfd allows all userfaults to
happen in parallel from different threads and it relies on userland to
use atomic copy or move commands to resolve the userfaults.

By adding more featured commands to the userfaultfd protocol (spoken
on the fd, like the basic atomic copy command that is needed to
resolve the userfault) in the future we can also mark regions readonly
and trap only wrprotect faults (or both wrprotect and non present
faults simultaneously).

Different userfaultfd can already be used independently by multiple
librarians and the main application within the same process.

The userfaultfd once opened, can also be passed using unix domain
sockets to a manager process (use case 5) below wants to do this), so the
same manager process could handle the userfaults of a multitude of
different process without them being aware about what is going on
(well of course unless they later try to use the userfaultfd themself
on the same region the manager is already tracking, which is a corner
case the relevancy of which should be discussed).

There was interest from multiple users, hope I'm not forgetting some:

1) KVM postcopy live migration (one form of cloud memory
   externalization). KVM postcopy live migration is the primary driver
   of this work:
   http://blog.zhaw.ch/icclab/setting-up-post-copy-live-migration-in-openstack/
   )

2) KVM postcopy live snapshotting (allowing to limit/throttle the
   memory usage, unlike fork would).

3) KVM userfaults on shared memory (currently only anonymous memory
   is handled by the userfaultfd but there's nothing that prevents to
   extend it and allow to register a tmpfs region in the userfaultfd
   and fire an userfault if the tmpfs page is not present)

4) alternate mechanism to notify web browsers or apps on embedded
   devices that volatile pages have been reclaimed. This basically
   avoids the need to run a syscall before the app can access with the
   CPU the virtual regions marked volatile. This also requires point 3)
   to be fulfilled, as volatile pages happily apply to tmpfs.

5) postcopy live migration of binaries inside linux containers
   (provided there is a userfaultfd command [not an external syscall
   like the original implementation] that allows to copy memory
   atomically in the userfaultfd "mm" and not in the manager "mm",
   hence the main reason the external syscalls are going away, and in
   turn MADV_USERFAULT fd-less is going away as well).

6) qemu linux-user binary emulation was also briefly interested about
   the wrprotection fault notification for non-x86 archs. In this
   context the userfaultfd ""might"" (not sure) be useful to JIT
   emulation to efficiently protect the translated regions by only
   wrprotecting the page table without having to split or merge vmas
   (the risk of running out of vmas isn't there for this use case as
   the translated cache is probably limited in size and not heavily
   scattered).

7) distributed shared memory that could allow simultaneous mapping of
   regions marked readonly and collapse them on the first exclusive
   write. I'm mentioning it as a corollary, because I'm not aware of
   anybody who is planning to use it that way (still I'd like that
   this will be possible too just in case it finds its way later on).

The currently planned API (as hinted above) is already different to
the first version of the code posted a couple of months ago, thanks to
the valuable feedback received by the community so far.

As usual suggestions will be welcome, thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [LSF/MM TOPIC] userfaultfd
  2015-01-14 23:01 [LSF/MM TOPIC] userfaultfd Andrea Arcangeli
@ 2015-01-15  9:01 ` Pavel Emelyanov
       [not found] ` <20150114230130.GR6103-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 0 replies; 3+ messages in thread
From: Pavel Emelyanov @ 2015-01-15  9:01 UTC (permalink / raw)
  To: Andrea Arcangeli, lsf-pc; +Cc: linux-mm, linux-kernel, linux-api

On 01/15/2015 02:01 AM, Andrea Arcangeli wrote:
> Hello,
> 
> I would like to attend this year (2015) LSF/MM summit. I'm
> particularly interested about the MM track, in order to get help in
> finalizing the userfaultfd feature I've been working on lately.

I'd like the +1 this. I'm also interested in this topic, especially
in the item 5 below.

> 5) postcopy live migration of binaries inside linux containers
>    (provided there is a userfaultfd command [not an external syscall
>    like the original implementation] that allows to copy memory
>    atomically in the userfaultfd "mm" and not in the manager "mm",
>    hence the main reason the external syscalls are going away, and in
>    turn MADV_USERFAULT fd-less is going away as well).

We've started to play with the userfaultfd in the CRIU project [1] to 
do the post-copy live migration of whole containers (and their parts).

One more use case I've seen on CRIU mailing list is the restore of
container from on-disk images w/o getting the whole memory in at the
restore time. The memory is to be put into tasks' address space in
n-demand manner later. It's claimed that such restore decreases the 
restore time significantly.


One more thing that userfaultfd can help with is restoring COW areas.
Right now, if we have two tasks, that share a phys page, but have one
RO mapped to do the COW later we do complex tricks with restoring the
page in common ancestor, then inheriting one on fork()-s and mremap-ing
it. Probably it's an API misuse, but it seems to be much simpler if
the page could be just "sent" to the remote mm via userfaultfd.

[1] http://criu.org/Main_Page

Thanks,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [LSF/MM TOPIC] userfaultfd
       [not found] ` <20150114230130.GR6103-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-01-15 16:08   ` Austin S Hemmelgarn
  0 siblings, 0 replies; 3+ messages in thread
From: Austin S Hemmelgarn @ 2015-01-15 16:08 UTC (permalink / raw)
  To: Andrea Arcangeli,
	lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 876 bytes --]

On 2015-01-14 18:01, Andrea Arcangeli wrote:
> 7) distributed shared memory that could allow simultaneous mapping of
>     regions marked readonly and collapse them on the first exclusive
>     write. I'm mentioning it as a corollary, because I'm not aware of
>     anybody who is planning to use it that way (still I'd like that
>     this will be possible too just in case it finds its way later on).
While I haven't actually written any code for it yet, I've been thinking 
about the possibility to use this to allow qemu to do distributed 
emulation of a NUMA system (ie, you could run qemu on a Beowulf cluster 
and make it look to the guest OS like it's running on a big NUMA system, 
essentially SSI clustering for people who don't have a multi-million 
dollar budget).  Having userfaultd to work with would make this 
exponentially easier to implement.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-01-15 16:08 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-14 23:01 [LSF/MM TOPIC] userfaultfd Andrea Arcangeli
2015-01-15  9:01 ` Pavel Emelyanov
     [not found] ` <20150114230130.GR6103-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-01-15 16:08   ` Austin S Hemmelgarn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).