From mboxrd@z Thu Jan 1 00:00:00 1970 From: Marcelo Tosatti Subject: Re: [RFC PATCH V1 0/6] mm: add a new option MREMAP_DUP to mmrep syscall Date: Tue, 31 Dec 2013 16:53:29 -0200 Message-ID: <20131231185328.GA22414@amt.cnet> References: <1368093011-4867-1-git-send-email-wenchaolinux@gmail.com> <20130509141329.GC11497@suse.de> <518C5B5E.4010706@gmail.com> <52AFE828.3010500@linux.vnet.ibm.com> <20131230202342.GA7973@amt.cnet> <943AC3BD-C4EB-4B6C-BE34-AB921938AAF0@linux.vnet.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Stefan Hajnoczi , wenchao , Mel Gorman , linux-mm@kvack.org, Andrew Morton , hughd@google.com, walken@google.com, Alexander Viro , kirill.shutemov@linux.intel.com, Anthony Liguori , KVM To: Xiao Guangrong Return-path: Received: from mx1.redhat.com ([209.132.183.28]:4025 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755608Ab3LaSyB (ORCPT ); Tue, 31 Dec 2013 13:54:01 -0500 Content-Disposition: inline In-Reply-To: <943AC3BD-C4EB-4B6C-BE34-AB921938AAF0@linux.vnet.ibm.com> Sender: kvm-owner@vger.kernel.org List-ID: On Tue, Dec 31, 2013 at 08:06:51PM +0800, Xiao Guangrong wrote: >=20 > On Dec 31, 2013, at 4:23 AM, Marcelo Tosatti wr= ote: >=20 > > On Tue, Dec 17, 2013 at 01:59:04PM +0800, Xiao Guangrong wrote: > >>=20 > >> CCed KVM guys. > >>=20 > >> On 05/10/2013 01:11 PM, Stefan Hajnoczi wrote: > >>> On Fri, May 10, 2013 at 4:28 AM, wenchao = wrote: > >>>> =E4=BA=8E 2013-5-9 22:13, Mel Gorman =E5=86=99=E9=81=93: > >>>>=20 > >>>>> On Thu, May 09, 2013 at 05:50:05PM +0800, wenchaolinux@gmail.co= m wrote: > >>>>>>=20 > >>>>>> From: Wenchao Xia > >>>>>>=20 > >>>>>> This serial try to enable mremap syscall to cow some private = memory > >>>>>> region, > >>>>>> just like what fork() did. As a result, user space application= would got > >>>>>> a > >>>>>> mirror of those region, and it can be used as a snapshot for f= urther > >>>>>> processing. > >>>>>>=20 > >>>>>=20 > >>>>> What not just fork()? Even if the application was threaded it s= hould be > >>>>> managable to handle fork just for processing the private memory= region > >>>>> in question. I'm having trouble figuring out what sort of appli= cation > >>>>> would require an interface like this. > >>>>>=20 > >>>> It have some troubles: parent - child communication, sometimes > >>>> page copy. > >>>> I'd like to snapshot qemu guest's RAM, currently solution is: > >>>> 1) fork() > >>>> 2) pipe guest RAM data from child to parent. > >>>> 3) parent write down the contents. > >>>>=20 > >>>> To avoid complex communication for data control, and file conten= t > >>>> protecting, So let parent instead of child handling the data wit= h > >>>> a pipe, but this brings additional copy(). I think an explicit A= PI > >>>> cow mapping an memory region inside one process, could avoid it, > >>>> and faster and cow less pages, also make user space code nicer. > >>>=20 > >>> A new Linux-specific API is not portable and not available on exi= sting > >>> hosts. Since QEMU supports non-Linux host operating systems the > >>> fork() approach is preferable. > >>>=20 > >>> If you're worried about the memory copy - which should be benchma= rked > >>> - then vmsplice(2) can be used in the child process and splice(2)= can > >>> be used in the parent. It probably doesn't help though since QEM= U > >>> scans RAM pages to find all-zero pages before sending them over t= he > >>> socket, and at that point the memory copy might not make much > >>> difference. > >>>=20 > >>> Perhaps other applications can use this new flag better, but for = QEMU > >>> I think fork()'s portability is more important than the convenien= ce of > >>> accessing the CoW pages in the same process. > >>=20 > >> Yup, I agree with you that the new syscall sometimes is not a good= solution. > >>=20 > >> Currently, we're working on live-update[1] that will be enabled on= Qemu firstly, > >> this feature let the guest run on the new Qemu binary smoothly wit= hout > >> restart, it's good for us to do security-update. > >>=20 > >> In this case, we need to move the guest memory on old qemu instanc= e to the > >> new one, fork() can not help because we need to exec() a new insta= nce, after > >> that all memory mapping will be destroyed. > >>=20 > >> We tried to enable SPLICE_F_MOVE[2] for vmsplice() to move the mem= ory without > >> memory-copy but the performance isn't so good as we expected: it's= due to > >> some limitations: the page-size, lock, message-size limitation on = pipe, etc. > >> Of course, we will continue to improve this, but wenchao's patch s= eems a new > >> direction for us. > >>=20 > >> To coordinate with your fork() approach, maybe we can introduce a = new flag > >> for VMA, something like: VM_KEEP_ONEXEC, to tell exec() to do not = destroy > >> this VMA. How about this or you guy have new idea? Really apprecia= te for your > >> suggestion. > >>=20 > >> [1] http://marc.info/?l=3Dqemu-devel&m=3D138597598700844&w=3D2 > >> [2] https://lkml.org/lkml/2013/10/25/285 > >=20 > > Hi, > >=20 >=20 > Hi Marcelo, >=20 >=20 > > What is the purpose of snapshotting guest RAM here, in the context = of > > local migration? >=20 > RAM-shapshotting and local-migration are on the different ways. > Why i asked for your guy=E2=80=99s suggestion here is beacuse i tho= ught > they need do a same thing that moves memory from one process > to another in a efficient way. Your idea? :) Another possibility is to use memory that is not anonymous for guest RAM, such as hugetlbfs or tmpfs.=20 IIRC ksm and thp have limitations wrt tmpfs. Still curious about RAM snapshotting.