From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:57003)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stefanha@gmail.com>) id 1V9W4L-00050m-Cl
	for qemu-devel@nongnu.org; Wed, 14 Aug 2013 04:04:38 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <stefanha@gmail.com>) id 1V9W4J-0004rP-S1
	for qemu-devel@nongnu.org; Wed, 14 Aug 2013 04:04:37 -0400
Received: from mail-qc0-x233.google.com ([2607:f8b0:400d:c01::233]:53109)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stefanha@gmail.com>) id 1V9Vu1-0001YO-6A
	for qemu-devel@nongnu.org; Wed, 14 Aug 2013 03:53:57 -0400
Received: by mail-qc0-f179.google.com with SMTP id n10so4596682qcx.24
	for <qemu-devel@nongnu.org>; Wed, 14 Aug 2013 00:53:56 -0700 (PDT)
MIME-Version: 1.0
In-Reply-To: <520AE34D.8000002@linux.vnet.ibm.com>
References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com>
	<20130812095903.GF29880@stefanha-thinkpad.redhat.com>
	<232DEBC1058FA4A5BD76D16A@Ximines.local>
	<CAJSP0QU3HYdN+FiQY-RtM1N1kkWfL0Y=KbrWj-qYk+TtS-6+Rw@mail.gmail.com>
	<52099FA3.6010207@linux.vnet.ibm.com>
	<CAJSP0QW-aM7EyEtPuQfjp+FRp4aObZen3Pu2nF9TE5A4F7LRgw@mail.gmail.com>
	<520AE34D.8000002@linux.vnet.ibm.com>
Date: Wed, 14 Aug 2013 09:53:56 +0200
Message-ID: <CAJSP0QW0eQ-u1NftsSHzzbM_=GSybGzJoB9RT6WOPDa4fcA+_Q@mail.gmail.com>
From: Stefan Hajnoczi <stefanha@gmail.com>
Content-Type: text/plain; charset=GB2312
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot
	feature?
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>, kvm <kvm@vger.kernel.org>, Paul Brook <paul@codesourcery.com>, Marcelo Tosatti <mtosatti@redhat.com>, qemu-devel <qemu-devel@nongnu.org>, Chijianchun <chijianchun@huawei.com>, Avi Kivity <avi@redhat.com>, Alex Bligh <alex@alex.org.uk>, fred.konrad@greensocs.com

On Wed, Aug 14, 2013 at 3:54 AM, Wenchao Xia <xiawenc@linux.vnet.ibm.com> w=
rote:
> =D3=DA 2013-8-13 16:21, Stefan Hajnoczi =D0=B4=B5=C0:
>
>> On Tue, Aug 13, 2013 at 4:53 AM, Wenchao Xia <xiawenc@linux.vnet.ibm.com=
>
>> wrote:
>>>
>>> =D3=DA 2013-8-12 19:33, Stefan Hajnoczi =D0=B4=B5=C0:
>>>
>>>> On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh <alex@alex.org.uk> wrote:
>>>>>
>>>>>
>>>>> --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi <stefanha@gmail.co=
m>
>>>>> wrote:
>>>>>
>>>>>> The idea that was discussed on qemu-devel@nongnu.org uses fork(2) to
>>>>>> capture the state of guest RAM and then send it back to the parent
>>>>>> process.  The guest is only paused for a brief instant during fork(2=
)
>>>>>> and can continue to run afterwards.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> How would you capture the state of emulated hardware which might not
>>>>> be in the guest RAM?
>>>>
>>>>
>>>>
>>>> Exactly the same way vmsave works today.  It calls the device's save
>>>> functions which serialize state to file.
>>>>
>>>> The difference between today's vmsave and the fork(2) approach is that
>>>> QEMU does not need to wait for guest RAM to be written to file before
>>>> resuming the guest.
>>>>
>>>> Stefan
>>>>
>>>    I have a worry about what glib says:
>>>
>>> "On Unix, the GLib mainloop is incompatible with fork(). Any program
>>> using the mainloop must either exec() or exit() from the child without
>>> returning to the mainloop. "
>>
>>
>> This is fine, the child just writes out the memory pages and exits.
>> It never returns to the glib mainloop.
>>
>>>    There is another way to do it: intercept the write in kvm.ko(or othe=
r
>>> kernel code). Since the key is intercept the memory change, we can do
>>> it in userspace in TCG mode, thus we can add the missing part in KVM
>>> mode. Another benefit of this way is: the used memory can be
>>> controlled. For example, with ioctl(), set a buffer of a fixed size
>>> which keeps the intercepted write data by kernel code, which can avoid
>>> frequently switch back to user space qemu code. when it is full always
>>> return back to userspace's qemu code, let qemu code save the data into
>>> disk. I haven't check the exactly behavior of Intel guest mode about
>>> how to handle page fault, so can't estimate the performance caused by
>>> switching of guest mode and root mode, but it should not be worse than
>>> fork().
>>
>>
>> The fork(2) approach is portable, covers both KVM and TCG, and doesn't
>> require kernel changes.  A kvm.ko kernel change also won't be
>> supported on existing KVM hosts.  These are big drawbacks and the
>> kernel approach would need to be significantly better than plain old
>> fork(2) to make it worthwhile.
>>
>> Stefan
>>
>   I think advantage is memory usage is predictable, so memory usage
> peak can be avoided, by always save the changed pages first. fork()
> does not know which pages are changed. I am not sure if this would
> be a serious issue when server's memory is consumed much, for example,
> 24G host emulate 11G*2 guest to provide powerful virtual server.

Memory usage is predictable but guest uptime is unpredictable because
it waits until memory is written out.  This defeats the point of
"live" savevm.  The guest may be stalled arbitrarily.

The fork child can minimize the chance of out-of-memory by using
madvise(MADV_DONTNEED) after pages have been written out.

The way fork handles memory overcommit on Linux is configurable, but I
guess in a situation where memory runs out the Out-of-Memory Killer
will kill a process (probably QEMU since it is hogging so much
memory).

The risk of OOM can be avoided by running the traditional vmsave which
stops the guest instead of using "live" vmsave.

The other option is to live migrate to file but the disadvantage there
is that you cannot choose exactly when the state it saved, it happens
sometime after live migration is initiated.

There are trade-offs with all the approaches, it depends on what is
most important to you.

Stefan