From: Andrea Arcangeli <aarcange@redhat.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org, satoshi.itoh@aist.go.jp,
t.hirofuchi@aist.go.jp, qemu-devel@nongnu.org,
Isaku Yamahata <yamahata@valinux.co.jp>,
Avi Kivity <avi@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
Date: Tue, 3 Jan 2012 15:25:41 +0100 [thread overview]
Message-ID: <20120103142541.GK4172@redhat.com> (raw)
In-Reply-To: <4F01EF86.2050600@redhat.com>
On Mon, Jan 02, 2012 at 06:55:18PM +0100, Paolo Bonzini wrote:
> On 01/02/2012 06:05 PM, Andrea Arcangeli wrote:
> > On Thu, Dec 29, 2011 at 06:01:45PM +0200, Avi Kivity wrote:
> >> On 12/29/2011 06:00 PM, Avi Kivity wrote:
> >>> The NFS client has exactly the same issue, if you mount it with the intr
> >>> option. In fact you could use the NFS client as a trivial umem/cuse
> >>> prototype.
> >>
> >> Actually, NFS can return SIGBUS, it doesn't care about restarting daemons.
> >
> > During KVMForum I suggested to a few people that it could be done
> > entirely in userland with PROT_NONE.
>
> Or MAP_NORESERVE.
MAP_NORESERVE has no effect with the default
/proc/sys/vm/overcommit_memory == 0, and in general has no effect until you
run out of memory. It's an accounting on/off switch only, mostly a noop.
> Anything you do that is CUSE-based should be doable in a separate QEMU
> thread (rather than a different process that talks to CUSE). If a
> userspace CUSE-based solution could be done with acceptable performance,
> the same thing would have the same or better performance if done
> entirely within QEMU.
It should be somehow doable within qemu and the source node could
handle one connection per vcpu thread for the async network pageins.
> > So the problem is if we do it in
> > userland with the current functionality you'll run out of VMAs and
> > slowdown performance too much.
> >
> > But all you need is the ability to map single pages in the address
> > space.
>
> Would this also let you set different pgprots for different pages in the
> same VMA? It would be useful for write barriers in garbage collectors
> (such as boehm-gc). These do not have _that_ many VMAs, because every
> GC cycles could merge all of them back to a single VMA with PROT_READ
> permissions; however, they still put some strain on the VM subsystem.
Changing permission sounds more tricky as more code may make
assumptions on the vma before checking the pte.
Adding a magic unmapped pte entry sounds fairly safe because there's
the migration pte already used by migrate which halts page faults and
wait, that creates a precedent. So I guess we could reuse the same
code that already exists for the migration entry and we'd need to fire
a signal and returns to userland instead of waiting. The signal should
be invoked before the page fault will trigger again. Of course if the
signal returns and does nothing it'll loop at 100% cpu load but that's
ok. Maybe it's possible to tweak the permissions but it will need a
lot more thoughts. Specifically for anon pages marking them readonly
sounds possible if they are supposed to behave like regular COWs (not
segfaulting or anything), as you already can have a mixture of
readonly and read-write ptes (not to tell readonly KSM pages), but for
any other case it's non trivial. Last but not the least the API here
would be like a vma-less-mremap, moving a page from one address to
another without modifying the vmas, the permission tweak sounds more
like an mprotect, so I'm unsure if it could do both or if it should be
an optimization to consider independently.
In theory I suspect we could also teach mremap to do a
not-vma-mangling mremap if we move pages that aren't shared and so we
can adjust the page->index of the pages, instead of creating new vmas
at the dst address with an adjusted vma->vm_pgoff, but I suspect a
syscall that only works on top of fault-unmapped areas is simpler and
safer. mremap semantics requires nuking the dst region before the move
starts. If we would teach mremap how to handle the fault-unmapped
areas we could just add one syscall prepare_fault_area (or whatever
name you choose).
The locking of doing a vma-less-mremap still sounds tricky but I doubt
you can avoid that locking complexity by using the chardevice as long
as the chardevice backed-memory still allows THP, migration and swap,
if you want to do it atomic-zerocopy and I think zerocopy would be
better especially if the network card is fast and all vcpus are
faulting into unmapped pages simultaneously so triggering heavy amount
of copying from all physical cpus.
I don't mean the current device driver doing a copy_user won't work or
is bad idea, it's more self contained and maybe easier to merge
upstream. I'm just presenting another option more VM integrated
zerocopy with just 2 syscalls.
vmas must not be involved in the mremap for reliability, or too much
memory could get pinned in vmas even if we temporary lift the
/proc/sys/vm/max_map_count for the process. Plus sending another
signal (not sigsegv or sigbus) should be more reliable in case the
migration crashes for real.
next prev parent reply other threads:[~2012-01-03 14:25 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-12-29 1:26 [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy Isaku Yamahata
2011-12-29 1:26 ` [Qemu-devel] [PATCH 1/2] export necessary symbols Isaku Yamahata
2011-12-29 1:26 ` [Qemu-devel] [PATCH 2/2] umem: chardevice for kvm postcopy Isaku Yamahata
2011-12-29 11:17 ` Avi Kivity
2011-12-29 12:22 ` Isaku Yamahata
2011-12-29 12:47 ` Avi Kivity
2012-01-05 4:08 ` [Qemu-devel] 回复: " thfbjyddx
2012-01-05 10:48 ` [Qemu-devel] 回??: " Isaku Yamahata
2012-01-05 11:10 ` Tommy
2012-01-05 12:18 ` Isaku Yamahata
2012-01-05 15:02 ` Tommy Tang
[not found] ` <4F05BB68.9050302@hotmail.com>
2012-01-05 15:05 ` Tommy Tang
2012-01-06 7:02 ` thfbjyddx
2012-01-06 17:13 ` [Qemu-devel] 回??: [PATCH 2/2] umem: chardevice for kvm?postcopy Isaku Yamahata
2011-12-29 1:31 ` [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy Isaku Yamahata
2011-12-29 11:24 ` Avi Kivity
2011-12-29 12:39 ` Isaku Yamahata
2011-12-29 12:55 ` Avi Kivity
2011-12-29 13:49 ` Isaku Yamahata
2011-12-29 13:52 ` Avi Kivity
2011-12-29 14:18 ` Isaku Yamahata
2011-12-29 14:35 ` Avi Kivity
2011-12-29 14:49 ` Isaku Yamahata
2011-12-29 14:55 ` Avi Kivity
2011-12-29 15:53 ` Isaku Yamahata
2011-12-29 16:00 ` Avi Kivity
2011-12-29 16:01 ` Avi Kivity
2012-01-02 17:05 ` Andrea Arcangeli
2012-01-02 17:55 ` Paolo Bonzini
2012-01-03 14:25 ` Andrea Arcangeli [this message]
2012-01-12 13:57 ` Avi Kivity
2012-01-13 2:06 ` Andrea Arcangeli
2012-01-04 3:03 ` Isaku Yamahata
2012-01-12 13:59 ` Avi Kivity
2012-01-13 1:09 ` Benoit Hudzia
2012-01-13 1:31 ` Takuya Yoshikawa
2012-01-13 9:40 ` Benoit Hudzia
2012-01-13 2:03 ` Isaku Yamahata
2012-01-13 2:15 ` Isaku Yamahata
2012-01-13 9:55 ` Benoit Hudzia
2012-01-13 9:48 ` Benoit Hudzia
2012-01-13 2:09 ` Andrea Arcangeli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120103142541.GK4172@redhat.com \
--to=aarcange@redhat.com \
--cc=avi@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=satoshi.itoh@aist.go.jp \
--cc=t.hirofuchi@aist.go.jp \
--cc=yamahata@valinux.co.jp \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).