From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([140.186.70.92]:47692)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <avi@redhat.com>) id 1RlLAL-0006yI-9U
	for qemu-devel@nongnu.org; Thu, 12 Jan 2012 08:58:11 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <avi@redhat.com>) id 1RlLAC-0005Su-FL
	for qemu-devel@nongnu.org; Thu, 12 Jan 2012 08:58:05 -0500
Received: from mx1.redhat.com ([209.132.183.28]:6590)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <avi@redhat.com>) id 1RlLAC-0005Sk-8c
	for qemu-devel@nongnu.org; Thu, 12 Jan 2012 08:57:56 -0500
Message-ID: <4F0EE6DB.4080702@redhat.com>
Date: Thu, 12 Jan 2012 15:57:47 +0200
From: Avi Kivity <avi@redhat.com>
MIME-Version: 1.0
References: <4EFC70BA.1080808@redhat.com>
	<20111229141802.GI19274@valinux.co.jp> <4EFC7AB8.807@redhat.com>
	<20111229144943.GJ19274@valinux.co.jp>
	<4EFC7F4F.9010202@redhat.com>
	<20111229155328.GK19274@valinux.co.jp>
	<4EFC8EAD.80306@redhat.com> <4EFC8EE9.9030802@redhat.com>
	<20120102170551.GF4172@redhat.com> <4F01EF86.2050600@redhat.com>
	<20120103142541.GK4172@redhat.com>
In-Reply-To: <20120103142541.GK4172@redhat.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char
	device for postcopy
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: kvm@vger.kernel.org, satoshi.itoh@aist.go.jp, t.hirofuchi@aist.go.jp, qemu-devel@nongnu.org, Isaku Yamahata <yamahata@valinux.co.jp>, Paolo Bonzini <pbonzini@redhat.com>

On 01/03/2012 04:25 PM, Andrea Arcangeli wrote:
>  
> > > So the problem is if we do it in
> > > userland with the current functionality you'll run out of VMAs and
> > > slowdown performance too much.
> > >
> > > But all you need is the ability to map single pages in the address
> > > space.
> > 
> > Would this also let you set different pgprots for different pages in the 
> > same VMA?  It would be useful for write barriers in garbage collectors 
> > (such as boehm-gc).  These do not have _that_ many VMAs, because every 
> > GC cycles could merge all of them back to a single VMA with PROT_READ 
> > permissions; however, they still put some strain on the VM subsystem.
>
> Changing permission sounds more tricky as more code may make
> assumptions on the vma before checking the pte.
>
> Adding a magic unmapped pte entry sounds fairly safe because there's
> the migration pte already used by migrate which halts page faults and
> wait, that creates a precedent. So I guess we could reuse the same
> code that already exists for the migration entry and we'd need to fire
> a signal and returns to userland instead of waiting. The signal should
> be invoked before the page fault will trigger again. 

Delivering signals is slow, and you can't use signalfd for it, because
that can be routed to a different task.  I would like an fd based
protocol with an explicit ack so the other end can be implemented by the
kernel, to use with RDMA.  Kind of like how vhost-net talks to a guest
via a kvm ioeventfd/irqfd.

> Of course if the
> signal returns and does nothing it'll loop at 100% cpu load but that's
> ok. Maybe it's possible to tweak the permissions but it will need a
> lot more thoughts. Specifically for anon pages marking them readonly
> sounds possible if they are supposed to behave like regular COWs (not
> segfaulting or anything), as you already can have a mixture of
> readonly and read-write ptes (not to tell readonly KSM pages), but for
> any other case it's non trivial. Last but not the least the API here
> would be like a vma-less-mremap, moving a page from one address to
> another without modifying the vmas, the permission tweak sounds more
> like an mprotect, so I'm unsure if it could do both or if it should be
> an optimization to consider independently.

Doesn't this stuff require tlb flushes across all threads?

>
> In theory I suspect we could also teach mremap to do a
> not-vma-mangling mremap if we move pages that aren't shared and so we
> can adjust the page->index of the pages, instead of creating new vmas
> at the dst address with an adjusted vma->vm_pgoff, but I suspect a
> syscall that only works on top of fault-unmapped areas is simpler and
> safer. mremap semantics requires nuking the dst region before the move
> starts. If we would teach mremap how to handle the fault-unmapped
> areas we could just add one syscall prepare_fault_area (or whatever
> name you choose).
>
> The locking of doing a vma-less-mremap still sounds tricky but I doubt
> you can avoid that locking complexity by using the chardevice as long
> as the chardevice backed-memory still allows THP, migration and swap,
> if you want to do it atomic-zerocopy and I think zerocopy would be
> better especially if the network card is fast and all vcpus are
> faulting into unmapped pages simultaneously so triggering heavy amount
> of copying from all physical cpus.
>
> I don't mean the current device driver doing a copy_user won't work or
> is bad idea, it's more self contained and maybe easier to merge
> upstream. I'm just presenting another option more VM integrated
> zerocopy with just 2 syscalls.

Zerocopy is really interesting here, esp. w/ RDMA.  But while adding
ptes is cheap, removing them is not.  I wonder if we can make a
write-only page?  Of course it's unmapped for cpu access, but we can
allow DMA write access from the NIC.  Probably too wierd.

>
> vmas must not be involved in the mremap for reliability, or too much
> memory could get pinned in vmas even if we temporary lift the
> /proc/sys/vm/max_map_count for the process. Plus sending another
> signal (not sigsegv or sigbus) should be more reliable in case the
> migration crashes for real.


-- 
error compiling committee.c: too many arguments to function