From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:57957)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <xiawenc@linux.vnet.ibm.com>) id 1V9sXr-00044h-UR
	for qemu-devel@nongnu.org; Thu, 15 Aug 2013 04:04:43 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <xiawenc@linux.vnet.ibm.com>) id 1V9sXk-0008FP-JF
	for qemu-devel@nongnu.org; Thu, 15 Aug 2013 04:04:35 -0400
Received: from e23smtp04.au.ibm.com ([202.81.31.146]:43172)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <xiawenc@linux.vnet.ibm.com>) id 1V9sXj-0008EF-Nd
	for qemu-devel@nongnu.org; Thu, 15 Aug 2013 04:04:28 -0400
Received: from /spool/local
	by e23smtp04.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use
	Only! Violators will be prosecuted
	for <qemu-devel@nongnu.org> from <xiawenc@linux.vnet.ibm.com>;
	Thu, 15 Aug 2013 17:47:40 +1000
Received: from d23relay04.au.ibm.com (d23relay04.au.ibm.com [9.190.234.120])
	by d23dlp03.au.ibm.com (Postfix) with ESMTP id C078D357804E
	for <qemu-devel@nongnu.org>; Thu, 15 Aug 2013 18:04:18 +1000 (EST)
Received: from d23av01.au.ibm.com (d23av01.au.ibm.com [9.190.234.96])
	by d23relay04.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id
	r7F7mQUk64815278
	for <qemu-devel@nongnu.org>; Thu, 15 Aug 2013 17:48:28 +1000
Received: from d23av01.au.ibm.com (localhost [127.0.0.1])
	by d23av01.au.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id
	r7F84FnY016112
	for <qemu-devel@nongnu.org>; Thu, 15 Aug 2013 18:04:16 +1000
Message-ID: <520C8B63.2060304@linux.vnet.ibm.com>
Date: Thu, 15 Aug 2013 16:03:47 +0800
From: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
MIME-Version: 1.0
References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com>
	<20130812095903.GF29880@stefanha-thinkpad.redhat.com>
	<232DEBC1058FA4A5BD76D16A@Ximines.local>
	<CAJSP0QU3HYdN+FiQY-RtM1N1kkWfL0Y=KbrWj-qYk+TtS-6+Rw@mail.gmail.com>
	<52099FA3.6010207@linux.vnet.ibm.com>
	<CAJSP0QW-aM7EyEtPuQfjp+FRp4aObZen3Pu2nF9TE5A4F7LRgw@mail.gmail.com>
	<520AE34D.8000002@linux.vnet.ibm.com>
	<CAJSP0QW0eQ-u1NftsSHzzbM_=GSybGzJoB9RT6WOPDa4fcA+_Q@mail.gmail.com>
	<520C3C5C.5000106@linux.vnet.ibm.com>
	<20130815074919.GA22521@stefanha-thinkpad.redhat.com>
In-Reply-To: <20130815074919.GA22521@stefanha-thinkpad.redhat.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot
	feature?
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Stefan Hajnoczi <stefanha@gmail.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>, kvm <kvm@vger.kernel.org>, Marcelo Tosatti <mtosatti@redhat.com>, qemu-devel <qemu-devel@nongnu.org>, Chijianchun <chijianchun@huawei.com>, Paul Brook <paul@codesourcery.com>, Alex Bligh <alex@alex.org.uk>, fred.konrad@greensocs.com, Avi Kivity <avi@redhat.com>

于 2013-8-15 15:49, Stefan Hajnoczi 写道:
> On Thu, Aug 15, 2013 at 10:26:36AM +0800, Wenchao Xia wrote:
>> 于 2013-8-14 15:53, Stefan Hajnoczi 写道:
>>> On Wed, Aug 14, 2013 at 3:54 AM, Wenchao Xia <xiawenc@linux.vnet.ibm.com> wrote:
>>>> 于 2013-8-13 16:21, Stefan Hajnoczi 写道:
>>>>
>>>>> On Tue, Aug 13, 2013 at 4:53 AM, Wenchao Xia <xiawenc@linux.vnet.ibm.com>
>>>>> wrote:
>>>>>>
>>>>>> 于 2013-8-12 19:33, Stefan Hajnoczi 写道:
>>>>>>
>>>>>>> On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh <alex@alex.org.uk> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi <stefanha@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The idea that was discussed on qemu-devel@nongnu.org uses fork(2) to
>>>>>>>>> capture the state of guest RAM and then send it back to the parent
>>>>>>>>> process.  The guest is only paused for a brief instant during fork(2)
>>>>>>>>> and can continue to run afterwards.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> How would you capture the state of emulated hardware which might not
>>>>>>>> be in the guest RAM?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Exactly the same way vmsave works today.  It calls the device's save
>>>>>>> functions which serialize state to file.
>>>>>>>
>>>>>>> The difference between today's vmsave and the fork(2) approach is that
>>>>>>> QEMU does not need to wait for guest RAM to be written to file before
>>>>>>> resuming the guest.
>>>>>>>
>>>>>>> Stefan
>>>>>>>
>>>>>>      I have a worry about what glib says:
>>>>>>
>>>>>> "On Unix, the GLib mainloop is incompatible with fork(). Any program
>>>>>> using the mainloop must either exec() or exit() from the child without
>>>>>> returning to the mainloop. "
>>>>>
>>>>>
>>>>> This is fine, the child just writes out the memory pages and exits.
>>>>> It never returns to the glib mainloop.
>>>>>
>>>>>>      There is another way to do it: intercept the write in kvm.ko(or other
>>>>>> kernel code). Since the key is intercept the memory change, we can do
>>>>>> it in userspace in TCG mode, thus we can add the missing part in KVM
>>>>>> mode. Another benefit of this way is: the used memory can be
>>>>>> controlled. For example, with ioctl(), set a buffer of a fixed size
>>>>>> which keeps the intercepted write data by kernel code, which can avoid
>>>>>> frequently switch back to user space qemu code. when it is full always
>>>>>> return back to userspace's qemu code, let qemu code save the data into
>>>>>> disk. I haven't check the exactly behavior of Intel guest mode about
>>>>>> how to handle page fault, so can't estimate the performance caused by
>>>>>> switching of guest mode and root mode, but it should not be worse than
>>>>>> fork().
>>>>>
>>>>>
>>>>> The fork(2) approach is portable, covers both KVM and TCG, and doesn't
>>>>> require kernel changes.  A kvm.ko kernel change also won't be
>>>>> supported on existing KVM hosts.  These are big drawbacks and the
>>>>> kernel approach would need to be significantly better than plain old
>>>>> fork(2) to make it worthwhile.
>>>>>
>>>>> Stefan
>>>>>
>>>>     I think advantage is memory usage is predictable, so memory usage
>>>> peak can be avoided, by always save the changed pages first. fork()
>>>> does not know which pages are changed. I am not sure if this would
>>>> be a serious issue when server's memory is consumed much, for example,
>>>> 24G host emulate 11G*2 guest to provide powerful virtual server.
>>>
>>> Memory usage is predictable but guest uptime is unpredictable because
>>> it waits until memory is written out.  This defeats the point of
>>> "live" savevm.  The guest may be stalled arbitrarily.
>>>
>>    I think it is adjustable. There is no much difference with
>> fork(), except get more precise control about the changed pages.
>>    Kernel intercept the change, and stores the changed page in another
>> page, similar to fork(). When userspace qemu code execute, save some
>> pages to disk. Buffer can be used like some lubricant. When Buffer =
>> MAX, it equals to fork(), guest runs more lively. When Buffer = 0,
>> guest runs less lively. I think it allows user to find a good balance
>> point with a parameter.
>>    It is harder to implement, just want to show the idea.
>
> You are right.  You could set a bigger buffer size to increase guest
> uptime.
>
>>> The fork child can minimize the chance of out-of-memory by using
>>> madvise(MADV_DONTNEED) after pages have been written out.
>>    It seems no way to make sure the written out page is the changed
>> pages, so it have a good chance the written one is the unchanged and
>> still used by the other qemu process.
>
> The KVM dirty log tells you which pages were touched.  The fork child
> process could give priority to the pages which have been touched by the
> guest.  They must be written out and marked madvise(MADV_DONTNEED) as
> soon as possible.
   Hmm, if dirty log still works normal in child process to reflect the
memory status in parent not child's, then the problem could be solved
by: when dirty pages is too much, child tell parent to wait some time.
But I haven't check if kvm.ko behaviors like that.

>
> I haven't looked at the vmsave data format yet to see if memory pages
> can be saved in random order, but this might work.  It reduces the
> likelihood of copy-on-write memory growth.
>
> Stefan
>


-- 
Best Regards

Wenchao Xia