From mboxrd@z Thu Jan 1 00:00:00 1970 From: Xiao Guangrong Subject: Re: [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall Date: Tue, 17 Dec 2013 13:59:04 +0800 Message-ID: <52AFE828.3010500@linux.vnet.ibm.com> References: <1368093011-4867-1-git-send-email-wenchaolinux@gmail.com> <20130509141329.GC11497@suse.de> <518C5B5E.4010706@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Mel Gorman , linux-mm@kvack.org, Andrew Morton , hughd@google.com, walken@google.com, Alexander Viro , kirill.shutemov@linux.intel.com, Anthony Liguori , KVM To: Stefan Hajnoczi , wenchao Return-path: Received: from e28smtp01.in.ibm.com ([122.248.162.1]:58587 "EHLO e28smtp01.in.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751113Ab3LQF7O (ORCPT ); Tue, 17 Dec 2013 00:59:14 -0500 Received: from /spool/local by e28smtp01.in.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 17 Dec 2013 11:29:11 +0530 Received: from d28relay02.in.ibm.com (d28relay02.in.ibm.com [9.184.220.59]) by d28dlp03.in.ibm.com (Postfix) with ESMTP id D2B3E1258054 for ; Tue, 17 Dec 2013 11:30:22 +0530 (IST) Received: from d28av05.in.ibm.com (d28av05.in.ibm.com [9.184.220.67]) by d28relay02.in.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id rBH5x68C46727234 for ; Tue, 17 Dec 2013 11:29:06 +0530 Received: from d28av05.in.ibm.com (localhost [127.0.0.1]) by d28av05.in.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id rBH5x7Ot020616 for ; Tue, 17 Dec 2013 11:29:09 +0530 In-Reply-To: Sender: kvm-owner@vger.kernel.org List-ID: CCed KVM guys. On 05/10/2013 01:11 PM, Stefan Hajnoczi wrote: > On Fri, May 10, 2013 at 4:28 AM, wenchao wro= te: >> =E4=BA=8E 2013-5-9 22:13, Mel Gorman =E5=86=99=E9=81=93: >> >>> On Thu, May 09, 2013 at 05:50:05PM +0800, wenchaolinux@gmail.com wr= ote: >>>> >>>> From: Wenchao Xia >>>> >>>> This serial try to enable mremap syscall to cow some private me= mory >>>> region, >>>> just like what fork() did. As a result, user space application wou= ld got >>>> a >>>> mirror of those region, and it can be used as a snapshot for furth= er >>>> processing. >>>> >>> >>> What not just fork()? Even if the application was threaded it shoul= d be >>> managable to handle fork just for processing the private memory reg= ion >>> in question. I'm having trouble figuring out what sort of applicati= on >>> would require an interface like this. >>> >> It have some troubles: parent - child communication, sometimes >> page copy. >> I'd like to snapshot qemu guest's RAM, currently solution is: >> 1) fork() >> 2) pipe guest RAM data from child to parent. >> 3) parent write down the contents. >> >> To avoid complex communication for data control, and file content >> protecting, So let parent instead of child handling the data with >> a pipe, but this brings additional copy(). I think an explicit API >> cow mapping an memory region inside one process, could avoid it, >> and faster and cow less pages, also make user space code nicer. >=20 > A new Linux-specific API is not portable and not available on existin= g > hosts. Since QEMU supports non-Linux host operating systems the > fork() approach is preferable. >=20 > If you're worried about the memory copy - which should be benchmarked > - then vmsplice(2) can be used in the child process and splice(2) can > be used in the parent. It probably doesn't help though since QEMU > scans RAM pages to find all-zero pages before sending them over the > socket, and at that point the memory copy might not make much > difference. >=20 > Perhaps other applications can use this new flag better, but for QEMU > I think fork()'s portability is more important than the convenience o= f > accessing the CoW pages in the same process. Yup, I agree with you that the new syscall sometimes is not a good solu= tion. Currently, we're working on live-update[1] that will be enabled on Qemu= firstly, this feature let the guest run on the new Qemu binary smoothly without restart, it's good for us to do security-update. In this case, we need to move the guest memory on old qemu instance to = the new one, fork() can not help because we need to exec() a new instance, = after that all memory mapping will be destroyed. We tried to enable SPLICE_F_MOVE[2] for vmsplice() to move the memory w= ithout memory-copy but the performance isn't so good as we expected: it's due = to some limitations: the page-size, lock, message-size limitation on pipe,= etc. Of course, we will continue to improve this, but wenchao's patch seems = a new direction for us. To coordinate with your fork() approach, maybe we can introduce a new f= lag for VMA, something like: VM_KEEP_ONEXEC, to tell exec() to do not destr= oy this VMA. How about this or you guy have new idea? Really appreciate fo= r your suggestion. [1] http://marc.info/?l=3Dqemu-devel&m=3D138597598700844&w=3D2 [2] https://lkml.org/lkml/2013/10/25/285