From: Andrea Arcangeli <aarcange@redhat.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Grigory Makarevich <gmakarevich@google.com>,
kvm@vger.kernel.org, gleb@redhat.com,
Eric Northup <digitaleric@google.com>
Subject: Re: Demand paging for VM on KVM
Date: Thu, 20 Mar 2014 18:32:29 +0100 [thread overview]
Message-ID: <20140320173229.GB4000@redhat.com> (raw)
In-Reply-To: <532AEABA.2070000@redhat.com>
Hi,
On Thu, Mar 20, 2014 at 02:18:50PM +0100, Paolo Bonzini wrote:
> Il 20/03/2014 00:27, Grigory Makarevich ha scritto:
> > Hi All,
> >
> > I have been exploring different ways to implement on-demand paging for
> > VMs running in KVM.
> >
> > The core of the idea is to introduce an additional exit
> > KVM_EXIT_MEMORY_NOT_PRESENT to inform VMM's user space to process
> > access to "not yet present" guest's page.
> > Each memory slot may be instructed to keep track of ondemand bit per
> > page. If the page is marked as "ondemand", page fault will generate
> > exit to the host's
> > user-space with the information about the faulting page. Once the page
> > is filled, VMM instructs the KVM to clear "ondemand" bit for the page.
> >
> > I have working prototype and would like to consider upstreaming
> > corresponding KVM changes.
That was the original idea before userfaultfd was introduced. The
problem is then what happens when qemu is doing an O_DIRECT read from
the missing memory. It's not just a matter of adding an additional
exit, the whole qemu userland would need to become aware in various
places about new kind of errors out of legacy syscalls like read(2),
not just the KVM ioctl that would be easy to control by adding a new
exit reason.
> >
> > To start up the discussion before sending the actual patch-set, I'd like
> > to send the patch for the kvm's api.txt. Please, let me know what you
> > think.
>
> Hi, Andrea Arcangeli is considering a similar infrastructure at the
> generic mm level. Last time I discussed it with him, his idea was
> roughly to have:
>
> * a "userfaultfd" syscall that would take a memory range and return a
> file descriptor; the file descriptor becomes readable when the first
> access happens on a page in the region, and the read gives the address
> of the access. Any thread that accesses a still-unmapped region remains
> blocked until the address of the faulting page is written back to the
> userfaultfd, or gets a SIGBUS if the userfaultfd is closed.
>
Yes, the userfaultfd by avoiding the kernel to return to userland (no
exit to userland through KVM_EXIT_MEMORY_NOT_PRESENT anymore) will
allow the kernel inside the vcpu/IO thread, to talk directly to the
migration thread (or in Grigory case, to the ondemand paging manager
thread). The kernel will sleep waiting for the page to be present
without returning to userland. Then the migration/ondemand thread will
notify the kernel through the userfaultfd to wakeup any vcpu/IO thread
that was waiting for the page once finished (i.e. after the network
transfer and remap_anon_pages completed).
This should solve all troubles with O_DIRECT or similar syscalls that
from the I/O thread may access the missing KVM memory, and it will
handle the spte fault case more efficiently too, by avoiding an
exit/enter kernel as KVM_EXIT_MEMORY_NOT_PRESENT will not be required
anymore.
It's not finished yet so I've no 100% proof this will work exactly as
described above but I don't expect trouble as the design is pretty
straightforward.
The only slight difference compared to the description above, is that
userfaultfd won't take a range of memory. Instead the userfault ranges
will still be marked by MADV_USERFAULT. The other option would be to
specify the ranges using iovecs but it felt less flexible having to
specify it in the syscall invocation instead of allowing random
mangling of the userfault ranges with madvise at runtime.
The userfaultfd will just bind to the whole mm, so no matter which
thread faults on memory marked MADV_USERFAULT, the faulting thread
will engage in the userfaultfd protocol without exiting to userland.
The actual syscall API will require review later anyway, that's not
the primary concern at this point.
> * a remap_anon_pages syscall that would be used in the userfaultfd I/O
> handler to make the page accessible. The handler would build the page
> in a "shadow" area with the actual contents of guest memory, and then
> remap the shadow area onto the actual guest memory.
>
> Andrea, please correct me.
>
> QEMU would use this infrastructure for post-copy migration and possibly
> also for live snapshotting of the guests. The advantage in making this
> generic rather than KVM-based is that QEMU could use it also in
> system-emulation mode (and of course anything else needing a read
> barrier could use it too).
Correct.
Comments welcome,
Andrea
next prev parent reply other threads:[~2014-03-20 17:32 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CAJMTq5=LXMp2jBaxPMBWX_3-+RC5j98n=Nz8TRe3AXFwRY1Beg@mail.gmail.com>
2014-03-20 13:18 ` Demand paging for VM on KVM Paolo Bonzini
2014-03-20 17:32 ` Andrea Arcangeli [this message]
2014-03-20 18:27 ` Grigory Makarevich
[not found] ` <CAJMTq5nGcZoNEgEhP6mPQqhSbLFyf4J5YRd0cszWLMak-LJ0DA@mail.gmail.com>
2014-03-31 18:03 ` Andrea Arcangeli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140320173229.GB4000@redhat.com \
--to=aarcange@redhat.com \
--cc=digitaleric@google.com \
--cc=gleb@redhat.com \
--cc=gmakarevich@google.com \
--cc=kvm@vger.kernel.org \
--cc=pbonzini@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.