Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Avi Kivity <avi@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: kvm@vger.kernel.org, satoshi.itoh@aist.go.jp,
	t.hirofuchi@aist.go.jp, qemu-devel@nongnu.org,
	Isaku Yamahata <yamahata@valinux.co.jp>,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy
Date: Thu, 12 Jan 2012 15:57:47 +0200	[thread overview]
Message-ID: <4F0EE6DB.4080702@redhat.com> (raw)
In-Reply-To: <20120103142541.GK4172@redhat.com>

On 01/03/2012 04:25 PM, Andrea Arcangeli wrote:
>  
> > > So the problem is if we do it in
> > > userland with the current functionality you'll run out of VMAs and
> > > slowdown performance too much.
> > >
> > > But all you need is the ability to map single pages in the address
> > > space.
> > 
> > Would this also let you set different pgprots for different pages in the 
> > same VMA?  It would be useful for write barriers in garbage collectors 
> > (such as boehm-gc).  These do not have _that_ many VMAs, because every 
> > GC cycles could merge all of them back to a single VMA with PROT_READ 
> > permissions; however, they still put some strain on the VM subsystem.
>
> Changing permission sounds more tricky as more code may make
> assumptions on the vma before checking the pte.
>
> Adding a magic unmapped pte entry sounds fairly safe because there's
> the migration pte already used by migrate which halts page faults and
> wait, that creates a precedent. So I guess we could reuse the same
> code that already exists for the migration entry and we'd need to fire
> a signal and returns to userland instead of waiting. The signal should
> be invoked before the page fault will trigger again. 

Delivering signals is slow, and you can't use signalfd for it, because
that can be routed to a different task.  I would like an fd based
protocol with an explicit ack so the other end can be implemented by the
kernel, to use with RDMA.  Kind of like how vhost-net talks to a guest
via a kvm ioeventfd/irqfd.

> Of course if the
> signal returns and does nothing it'll loop at 100% cpu load but that's
> ok. Maybe it's possible to tweak the permissions but it will need a
> lot more thoughts. Specifically for anon pages marking them readonly
> sounds possible if they are supposed to behave like regular COWs (not
> segfaulting or anything), as you already can have a mixture of
> readonly and read-write ptes (not to tell readonly KSM pages), but for
> any other case it's non trivial. Last but not the least the API here
> would be like a vma-less-mremap, moving a page from one address to
> another without modifying the vmas, the permission tweak sounds more
> like an mprotect, so I'm unsure if it could do both or if it should be
> an optimization to consider independently.

Doesn't this stuff require tlb flushes across all threads?

>
> In theory I suspect we could also teach mremap to do a
> not-vma-mangling mremap if we move pages that aren't shared and so we
> can adjust the page->index of the pages, instead of creating new vmas
> at the dst address with an adjusted vma->vm_pgoff, but I suspect a
> syscall that only works on top of fault-unmapped areas is simpler and
> safer. mremap semantics requires nuking the dst region before the move
> starts. If we would teach mremap how to handle the fault-unmapped
> areas we could just add one syscall prepare_fault_area (or whatever
> name you choose).
>
> The locking of doing a vma-less-mremap still sounds tricky but I doubt
> you can avoid that locking complexity by using the chardevice as long
> as the chardevice backed-memory still allows THP, migration and swap,
> if you want to do it atomic-zerocopy and I think zerocopy would be
> better especially if the network card is fast and all vcpus are
> faulting into unmapped pages simultaneously so triggering heavy amount
> of copying from all physical cpus.
>
> I don't mean the current device driver doing a copy_user won't work or
> is bad idea, it's more self contained and maybe easier to merge
> upstream. I'm just presenting another option more VM integrated
> zerocopy with just 2 syscalls.

Zerocopy is really interesting here, esp. w/ RDMA.  But while adding
ptes is cheap, removing them is not.  I wonder if we can make a
write-only page?  Of course it's unmapped for cpu access, but we can
allow DMA write access from the NIC.  Probably too wierd.

>
> vmas must not be involved in the mremap for reliability, or too much
> memory could get pinned in vmas even if we temporary lift the
> /proc/sys/vm/max_map_count for the process. Plus sending another
> signal (not sigsegv or sigbus) should be more reliable in case the
> migration crashes for real.


-- 
error compiling committee.c: too many arguments to function

next prev parent reply	other threads:[~2012-01-12 13:58 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-12-29  1:26 [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy Isaku Yamahata
2011-12-29  1:26 ` [Qemu-devel] [PATCH 1/2] export necessary symbols Isaku Yamahata
2011-12-29  1:26 ` [Qemu-devel] [PATCH 2/2] umem: chardevice for kvm postcopy Isaku Yamahata
2011-12-29 11:17   ` Avi Kivity
2011-12-29 12:22     ` Isaku Yamahata
2011-12-29 12:47       ` Avi Kivity
2012-01-05  4:08   ` [Qemu-devel] 回复: " thfbjyddx
2012-01-05 10:48     ` [Qemu-devel] 回??: " Isaku Yamahata
2012-01-05 11:10       ` Tommy
2012-01-05 12:18         ` Isaku Yamahata
2012-01-05 15:02           ` Tommy Tang
     [not found]           ` <4F05BB68.9050302@hotmail.com>
2012-01-05 15:05             ` Tommy Tang
2012-01-06  7:02           ` thfbjyddx
2012-01-06 17:13             ` [Qemu-devel] 回??: [PATCH 2/2] umem: chardevice for kvm?postcopy Isaku Yamahata
2011-12-29  1:31 ` [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy Isaku Yamahata
2011-12-29 11:24 ` Avi Kivity
2011-12-29 12:39   ` Isaku Yamahata
2011-12-29 12:55     ` Avi Kivity
2011-12-29 13:49       ` Isaku Yamahata
2011-12-29 13:52         ` Avi Kivity
2011-12-29 14:18           ` Isaku Yamahata
2011-12-29 14:35             ` Avi Kivity
2011-12-29 14:49               ` Isaku Yamahata
2011-12-29 14:55                 ` Avi Kivity
2011-12-29 15:53                   ` Isaku Yamahata
2011-12-29 16:00                     ` Avi Kivity
2011-12-29 16:01                       ` Avi Kivity
2012-01-02 17:05                         ` Andrea Arcangeli
2012-01-02 17:55                           ` Paolo Bonzini
2012-01-03 14:25                             ` Andrea Arcangeli
2012-01-12 13:57                               ` Avi Kivity [this message]
2012-01-13  2:06                                 ` Andrea Arcangeli
2012-01-04  3:03                           ` Isaku Yamahata
2012-01-12 13:59                             ` Avi Kivity
2012-01-13  1:09                               ` Benoit Hudzia
2012-01-13  1:31                                 ` Takuya Yoshikawa
2012-01-13  9:40                                   ` Benoit Hudzia
2012-01-13  2:03                                 ` Isaku Yamahata
2012-01-13  2:15                                   ` Isaku Yamahata
2012-01-13  9:55                                     ` Benoit Hudzia
2012-01-13  9:48                                   ` Benoit Hudzia
2012-01-13  2:09                               ` Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F0EE6DB.4080702@redhat.com \
    --to=avi@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=satoshi.itoh@aist.go.jp \
    --cc=t.hirofuchi@aist.go.jp \
    --cc=yamahata@valinux.co.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).