From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:57957) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V9sXr-00044h-UR for qemu-devel@nongnu.org; Thu, 15 Aug 2013 04:04:43 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1V9sXk-0008FP-JF for qemu-devel@nongnu.org; Thu, 15 Aug 2013 04:04:35 -0400 Received: from e23smtp04.au.ibm.com ([202.81.31.146]:43172) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V9sXj-0008EF-Nd for qemu-devel@nongnu.org; Thu, 15 Aug 2013 04:04:28 -0400 Received: from /spool/local by e23smtp04.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 15 Aug 2013 17:47:40 +1000 Received: from d23relay04.au.ibm.com (d23relay04.au.ibm.com [9.190.234.120]) by d23dlp03.au.ibm.com (Postfix) with ESMTP id C078D357804E for ; Thu, 15 Aug 2013 18:04:18 +1000 (EST) Received: from d23av01.au.ibm.com (d23av01.au.ibm.com [9.190.234.96]) by d23relay04.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r7F7mQUk64815278 for ; Thu, 15 Aug 2013 17:48:28 +1000 Received: from d23av01.au.ibm.com (localhost [127.0.0.1]) by d23av01.au.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id r7F84FnY016112 for ; Thu, 15 Aug 2013 18:04:16 +1000 Message-ID: <520C8B63.2060304@linux.vnet.ibm.com> Date: Thu, 15 Aug 2013 16:03:47 +0800 From: Wenchao Xia MIME-Version: 1.0 References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <20130812095903.GF29880@stefanha-thinkpad.redhat.com> <232DEBC1058FA4A5BD76D16A@Ximines.local> <52099FA3.6010207@linux.vnet.ibm.com> <520AE34D.8000002@linux.vnet.ibm.com> <520C3C5C.5000106@linux.vnet.ibm.com> <20130815074919.GA22521@stefanha-thinkpad.redhat.com> In-Reply-To: <20130815074919.GA22521@stefanha-thinkpad.redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: Anthony Liguori , kvm , Marcelo Tosatti , qemu-devel , Chijianchun , Paul Brook , Alex Bligh , fred.konrad@greensocs.com, Avi Kivity 于 2013-8-15 15:49, Stefan Hajnoczi 写道: > On Thu, Aug 15, 2013 at 10:26:36AM +0800, Wenchao Xia wrote: >> 于 2013-8-14 15:53, Stefan Hajnoczi 写道: >>> On Wed, Aug 14, 2013 at 3:54 AM, Wenchao Xia wrote: >>>> 于 2013-8-13 16:21, Stefan Hajnoczi 写道: >>>> >>>>> On Tue, Aug 13, 2013 at 4:53 AM, Wenchao Xia >>>>> wrote: >>>>>> >>>>>> 于 2013-8-12 19:33, Stefan Hajnoczi 写道: >>>>>> >>>>>>> On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh wrote: >>>>>>>> >>>>>>>> >>>>>>>> --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi >>>>>>>> wrote: >>>>>>>> >>>>>>>>> The idea that was discussed on qemu-devel@nongnu.org uses fork(2) to >>>>>>>>> capture the state of guest RAM and then send it back to the parent >>>>>>>>> process. The guest is only paused for a brief instant during fork(2) >>>>>>>>> and can continue to run afterwards. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> How would you capture the state of emulated hardware which might not >>>>>>>> be in the guest RAM? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Exactly the same way vmsave works today. It calls the device's save >>>>>>> functions which serialize state to file. >>>>>>> >>>>>>> The difference between today's vmsave and the fork(2) approach is that >>>>>>> QEMU does not need to wait for guest RAM to be written to file before >>>>>>> resuming the guest. >>>>>>> >>>>>>> Stefan >>>>>>> >>>>>> I have a worry about what glib says: >>>>>> >>>>>> "On Unix, the GLib mainloop is incompatible with fork(). Any program >>>>>> using the mainloop must either exec() or exit() from the child without >>>>>> returning to the mainloop. " >>>>> >>>>> >>>>> This is fine, the child just writes out the memory pages and exits. >>>>> It never returns to the glib mainloop. >>>>> >>>>>> There is another way to do it: intercept the write in kvm.ko(or other >>>>>> kernel code). Since the key is intercept the memory change, we can do >>>>>> it in userspace in TCG mode, thus we can add the missing part in KVM >>>>>> mode. Another benefit of this way is: the used memory can be >>>>>> controlled. For example, with ioctl(), set a buffer of a fixed size >>>>>> which keeps the intercepted write data by kernel code, which can avoid >>>>>> frequently switch back to user space qemu code. when it is full always >>>>>> return back to userspace's qemu code, let qemu code save the data into >>>>>> disk. I haven't check the exactly behavior of Intel guest mode about >>>>>> how to handle page fault, so can't estimate the performance caused by >>>>>> switching of guest mode and root mode, but it should not be worse than >>>>>> fork(). >>>>> >>>>> >>>>> The fork(2) approach is portable, covers both KVM and TCG, and doesn't >>>>> require kernel changes. A kvm.ko kernel change also won't be >>>>> supported on existing KVM hosts. These are big drawbacks and the >>>>> kernel approach would need to be significantly better than plain old >>>>> fork(2) to make it worthwhile. >>>>> >>>>> Stefan >>>>> >>>> I think advantage is memory usage is predictable, so memory usage >>>> peak can be avoided, by always save the changed pages first. fork() >>>> does not know which pages are changed. I am not sure if this would >>>> be a serious issue when server's memory is consumed much, for example, >>>> 24G host emulate 11G*2 guest to provide powerful virtual server. >>> >>> Memory usage is predictable but guest uptime is unpredictable because >>> it waits until memory is written out. This defeats the point of >>> "live" savevm. The guest may be stalled arbitrarily. >>> >> I think it is adjustable. There is no much difference with >> fork(), except get more precise control about the changed pages. >> Kernel intercept the change, and stores the changed page in another >> page, similar to fork(). When userspace qemu code execute, save some >> pages to disk. Buffer can be used like some lubricant. When Buffer = >> MAX, it equals to fork(), guest runs more lively. When Buffer = 0, >> guest runs less lively. I think it allows user to find a good balance >> point with a parameter. >> It is harder to implement, just want to show the idea. > > You are right. You could set a bigger buffer size to increase guest > uptime. > >>> The fork child can minimize the chance of out-of-memory by using >>> madvise(MADV_DONTNEED) after pages have been written out. >> It seems no way to make sure the written out page is the changed >> pages, so it have a good chance the written one is the unchanged and >> still used by the other qemu process. > > The KVM dirty log tells you which pages were touched. The fork child > process could give priority to the pages which have been touched by the > guest. They must be written out and marked madvise(MADV_DONTNEED) as > soon as possible. Hmm, if dirty log still works normal in child process to reflect the memory status in parent not child's, then the problem could be solved by: when dirty pages is too much, child tell parent to wait some time. But I haven't check if kvm.ko behaviors like that. > > I haven't looked at the vmsave data format yet to see if memory pages > can be saved in random order, but this might work. It reduces the > likelihood of copy-on-write memory growth. > > Stefan > -- Best Regards Wenchao Xia