From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([140.186.70.92]:47692) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RlLAL-0006yI-9U for qemu-devel@nongnu.org; Thu, 12 Jan 2012 08:58:11 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RlLAC-0005Su-FL for qemu-devel@nongnu.org; Thu, 12 Jan 2012 08:58:05 -0500 Received: from mx1.redhat.com ([209.132.183.28]:6590) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RlLAC-0005Sk-8c for qemu-devel@nongnu.org; Thu, 12 Jan 2012 08:57:56 -0500 Message-ID: <4F0EE6DB.4080702@redhat.com> Date: Thu, 12 Jan 2012 15:57:47 +0200 From: Avi Kivity MIME-Version: 1.0 References: <4EFC70BA.1080808@redhat.com> <20111229141802.GI19274@valinux.co.jp> <4EFC7AB8.807@redhat.com> <20111229144943.GJ19274@valinux.co.jp> <4EFC7F4F.9010202@redhat.com> <20111229155328.GK19274@valinux.co.jp> <4EFC8EAD.80306@redhat.com> <4EFC8EE9.9030802@redhat.com> <20120102170551.GF4172@redhat.com> <4F01EF86.2050600@redhat.com> <20120103142541.GK4172@redhat.com> In-Reply-To: <20120103142541.GK4172@redhat.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH 0/2][RFC] postcopy migration: Linux char device for postcopy List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Andrea Arcangeli Cc: kvm@vger.kernel.org, satoshi.itoh@aist.go.jp, t.hirofuchi@aist.go.jp, qemu-devel@nongnu.org, Isaku Yamahata , Paolo Bonzini On 01/03/2012 04:25 PM, Andrea Arcangeli wrote: > > > > So the problem is if we do it in > > > userland with the current functionality you'll run out of VMAs and > > > slowdown performance too much. > > > > > > But all you need is the ability to map single pages in the address > > > space. > > > > Would this also let you set different pgprots for different pages in the > > same VMA? It would be useful for write barriers in garbage collectors > > (such as boehm-gc). These do not have _that_ many VMAs, because every > > GC cycles could merge all of them back to a single VMA with PROT_READ > > permissions; however, they still put some strain on the VM subsystem. > > Changing permission sounds more tricky as more code may make > assumptions on the vma before checking the pte. > > Adding a magic unmapped pte entry sounds fairly safe because there's > the migration pte already used by migrate which halts page faults and > wait, that creates a precedent. So I guess we could reuse the same > code that already exists for the migration entry and we'd need to fire > a signal and returns to userland instead of waiting. The signal should > be invoked before the page fault will trigger again. Delivering signals is slow, and you can't use signalfd for it, because that can be routed to a different task. I would like an fd based protocol with an explicit ack so the other end can be implemented by the kernel, to use with RDMA. Kind of like how vhost-net talks to a guest via a kvm ioeventfd/irqfd. > Of course if the > signal returns and does nothing it'll loop at 100% cpu load but that's > ok. Maybe it's possible to tweak the permissions but it will need a > lot more thoughts. Specifically for anon pages marking them readonly > sounds possible if they are supposed to behave like regular COWs (not > segfaulting or anything), as you already can have a mixture of > readonly and read-write ptes (not to tell readonly KSM pages), but for > any other case it's non trivial. Last but not the least the API here > would be like a vma-less-mremap, moving a page from one address to > another without modifying the vmas, the permission tweak sounds more > like an mprotect, so I'm unsure if it could do both or if it should be > an optimization to consider independently. Doesn't this stuff require tlb flushes across all threads? > > In theory I suspect we could also teach mremap to do a > not-vma-mangling mremap if we move pages that aren't shared and so we > can adjust the page->index of the pages, instead of creating new vmas > at the dst address with an adjusted vma->vm_pgoff, but I suspect a > syscall that only works on top of fault-unmapped areas is simpler and > safer. mremap semantics requires nuking the dst region before the move > starts. If we would teach mremap how to handle the fault-unmapped > areas we could just add one syscall prepare_fault_area (or whatever > name you choose). > > The locking of doing a vma-less-mremap still sounds tricky but I doubt > you can avoid that locking complexity by using the chardevice as long > as the chardevice backed-memory still allows THP, migration and swap, > if you want to do it atomic-zerocopy and I think zerocopy would be > better especially if the network card is fast and all vcpus are > faulting into unmapped pages simultaneously so triggering heavy amount > of copying from all physical cpus. > > I don't mean the current device driver doing a copy_user won't work or > is bad idea, it's more self contained and maybe easier to merge > upstream. I'm just presenting another option more VM integrated > zerocopy with just 2 syscalls. Zerocopy is really interesting here, esp. w/ RDMA. But while adding ptes is cheap, removing them is not. I wonder if we can make a write-only page? Of course it's unmapped for cpu access, but we can allow DMA write access from the NIC. Probably too wierd. > > vmas must not be involved in the mremap for reliability, or too much > memory could get pinned in vmas even if we temporary lift the > /proc/sys/vm/max_map_count for the process. Plus sending another > signal (not sigsegv or sigbus) should be more reliable in case the > migration crashes for real. -- error compiling committee.c: too many arguments to function