From mboxrd@z Thu Jan 1 00:00:00 1970 From: Chijianchun Subject: Are there plans to achieve ram live Snapshot feature? Date: Fri, 9 Aug 2013 10:20:49 +0000 Message-ID: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="_000_33FB050264B7AD4DBD6583581F2E03104B764728nkgeml511mbxchi_" To: "aliguori@us.ibm.com" , "paul@codesourcery.com" , "kvm@vger.kernel.org" , "avi@redhat.com" , "mtosatti@redhat.com" , "qemu-devel@nongnu.org" Return-path: Content-Language: zh-CN List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org Sender: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org List-Id: kvm.vger.kernel.org --_000_33FB050264B7AD4DBD6583581F2E03104B764728nkgeml511mbxchi_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Now in KVM, when RAM snapshot, vcpus needs stopped, it is Unfriendly restri= ctions to users. Are there plans to achieve ram live Snapshot feature? in my mind, Snapshots can not occupy additional too much memory, So when th= e memory needs to be changed, the old memory page is needed to flush to the= file first. But flushing to file is too slower than memory, and when flu= shing, the vcpu or VM is need to be paused until finished flushing, so pau= se...resume...pause...resume............., more and more slower. Is this idea feasible? Are there any other thoughts? --_000_33FB050264B7AD4DBD6583581F2E03104B764728nkgeml511mbxchi_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Now = in KVM, when RAM snapshot, vcpus needs stopped, it is Unfriendly restrictio= ns to users.  

 

Are = there plans to achieve ram live Snapshot feature?

 

in m= y mind, Snapshots can not occupy additional too much memory, So when the me= mory needs to be changed, the old memory page is needed to flush to the fil= e first.  But flushing to file is too slower than memory,  and when flushing, the vcpu or VM is need to be = paused until finished flushing,  so pause...resume...pause...resume...= .........., more and more slower.

 

Is t= his idea feasible? Are there any other thoughts?

--_000_33FB050264B7AD4DBD6583581F2E03104B764728nkgeml511mbxchi_-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: Paolo Bonzini Subject: Re: Are there plans to achieve ram live Snapshot feature? Date: Fri, 09 Aug 2013 17:38:12 +0200 Message-ID: <52050CE4.6000306@redhat.com> References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: "aliguori@us.ibm.com" , "paul@codesourcery.com" , "kvm@vger.kernel.org" , "avi@redhat.com" , "mtosatti@redhat.com" , "qemu-devel@nongnu.org" To: Chijianchun Return-path: Received: from mail-ea0-f176.google.com ([209.85.215.176]:35568 "EHLO mail-ea0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S964963Ab3HIPiq (ORCPT ); Fri, 9 Aug 2013 11:38:46 -0400 Received: by mail-ea0-f176.google.com with SMTP id q16so2120355ead.35 for ; Fri, 09 Aug 2013 08:38:45 -0700 (PDT) In-Reply-To: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> Sender: kvm-owner@vger.kernel.org List-ID: Il 09/08/2013 12:20, Chijianchun ha scritto: > Now in KVM, when RAM snapshot, vcpus needs stopped, it is Unfriendly > restrictions to users. > > Are there plans to achieve ram live Snapshot feature? > > in my mind, Snapshots can not occupy additional too much memory, So when > the memory needs to be changed, the old memory page is needed to flush > to the file first. But flushing to file is too slower than memory, and > when flushing, the vcpu or VM is need to be paused until finished > flushing, so pause...resume...pause...resume............., more and > more slower. > > Is this idea feasible? Are there any other thoughts? > This looks very similar to postcopy migration (you can Google it). The infrastructure for postcopy migration could be used for this as well. Paolo From mboxrd@z Thu Jan 1 00:00:00 1970 From: Anthony Liguori Subject: Re: Are there plans to achieve ram live Snapshot feature? Date: Fri, 09 Aug 2013 10:45:22 -0500 Message-ID: <877gfu3m25.fsf@codemonkey.ws> References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii To: Chijianchun , "paul\@codesourcery.com" , "kvm\@vger.kernel.org" , "avi\@redhat.com" , "mtosatti\@redhat.com" , "qemu-devel\@nongnu.org" Return-path: Received: from e39.co.us.ibm.com ([32.97.110.160]:41693 "EHLO e39.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S964975Ab3HIPpb (ORCPT ); Fri, 9 Aug 2013 11:45:31 -0400 Received: from /spool/local by e39.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 9 Aug 2013 09:45:31 -0600 Received: from d01relay06.pok.ibm.com (d01relay06.pok.ibm.com [9.56.227.116]) by d01dlp02.pok.ibm.com (Postfix) with ESMTP id C5FEC6E8040 for ; Fri, 9 Aug 2013 11:45:22 -0400 (EDT) Received: from d01av05.pok.ibm.com (d01av05.pok.ibm.com [9.56.224.195]) by d01relay06.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r79FjSXK32768028 for ; Fri, 9 Aug 2013 11:45:28 -0400 Received: from d01av05.pok.ibm.com (loopback [127.0.0.1]) by d01av05.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r79FjRrc027684 for ; Fri, 9 Aug 2013 11:45:27 -0400 In-Reply-To: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> Sender: kvm-owner@vger.kernel.org List-ID: Chijianchun writes: > Now in KVM, when RAM snapshot, vcpus needs stopped, it is Unfriendly restrictions to users. > > Are there plans to achieve ram live Snapshot feature? I think you mean a live version of the savevm command. You can approximate live migrating to a file, creating an external disk snapshot, then resuming the guest. Regards, Anthony Liguori > > in my mind, Snapshots can not occupy additional too much memory, So when the memory needs to be changed, the old memory page is needed to flush to the file first. But flushing to file is too slower than memory, and when flushing, the vcpu or VM is need to be paused until finished flushing, so pause...resume...pause...resume............., more and more slower. > > Is this idea feasible? Are there any other thoughts? From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Blake Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? Date: Fri, 09 Aug 2013 09:51:42 -0600 Message-ID: <5205100E.60007@redhat.com> References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <877gfu3m25.fsf@codemonkey.ws> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="0C26arUE2NsreIrqwMeaQrTXkiqLo8J4O" Cc: Chijianchun , "paul@codesourcery.com" , "kvm@vger.kernel.org" , "mtosatti@redhat.com" , "qemu-devel@nongnu.org" To: Anthony Liguori Return-path: Received: from mx1.redhat.com ([209.132.183.28]:41928 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S967877Ab3HIPvt (ORCPT ); Fri, 9 Aug 2013 11:51:49 -0400 In-Reply-To: <877gfu3m25.fsf@codemonkey.ws> Sender: kvm-owner@vger.kernel.org List-ID: This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --0C26arUE2NsreIrqwMeaQrTXkiqLo8J4O Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 08/09/2013 09:45 AM, Anthony Liguori wrote: > Chijianchun writes: >=20 >> Now in KVM, when RAM snapshot, vcpus needs stopped, it is Unfriendly r= estrictions to users. >> >> Are there plans to achieve ram live Snapshot feature? >=20 > I think you mean a live version of the savevm command. >=20 > You can approximate live migrating to a file, creating an external disk= > snapshot, then resuming the guest. And libvirt does just that, since libvirt 1.0.5, for its external RAM snapshots. The vcpu pause is a mere fraction of a second, so it is generally not noticeable as any guest downtime. --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --0C26arUE2NsreIrqwMeaQrTXkiqLo8J4O Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.13 (GNU/Linux) Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJSBRAOAAoJEKeha0olJ0Nqrn8IAJKfchMG2RrbiqU4ri44mowr 75/4b7JpXV0IawvalsZX4BhIbIxpjlvTrfeS1yzTmW9V40uvGHxA9OQ2vV9VW0oN JmQoQTJ6LyzRbFWgGxUL+G1s4EvaRz2z2iGdI0fVtHmDikI9J4YDliH2R2RDKzEj wk6btNKBXJ0uc/J9QYRlikWWmVn1yzXtYEDt/wEKKLQTsQea27jlY2Q2YO8DKaOJ MfOf9a89/3XzudsRHcQUxDe3IppUunTXWiHJHJJOyklCBMVZsSAm/XgJIna3ft0V 5BiTrGNQx3Cgx0NR6D+fldP/KvLBTgPhsuvWD6PlgLwSQlYTnGClAxuiFgeLdJ0= =QZr9 -----END PGP SIGNATURE----- --0C26arUE2NsreIrqwMeaQrTXkiqLo8J4O-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefan Hajnoczi Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? Date: Mon, 12 Aug 2013 11:59:03 +0200 Message-ID: <20130812095903.GF29880@stefanha-thinkpad.redhat.com> References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: "aliguori@us.ibm.com" , "paul@codesourcery.com" , "kvm@vger.kernel.org" , "avi@redhat.com" , "mtosatti@redhat.com" , "qemu-devel@nongnu.org" , xiawenc@linux.vnet.ibm.com, fred.konrad@greensocs.com To: Chijianchun Return-path: Received: from mail-ea0-f176.google.com ([209.85.215.176]:59312 "EHLO mail-ea0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755920Ab3HLJ7H (ORCPT ); Mon, 12 Aug 2013 05:59:07 -0400 Received: by mail-ea0-f176.google.com with SMTP id q16so3268040ead.21 for ; Mon, 12 Aug 2013 02:59:05 -0700 (PDT) Content-Disposition: inline In-Reply-To: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> Sender: kvm-owner@vger.kernel.org List-ID: On Fri, Aug 09, 2013 at 10:20:49AM +0000, Chijianchun wrote: > Now in KVM, when RAM snapshot, vcpus needs stopped, it is Unfriendly restrictions to users. > > Are there plans to achieve ram live Snapshot feature? > > in my mind, Snapshots can not occupy additional too much memory, So when the memory needs to be changed, the old memory page is needed to flush to the file first. But flushing to file is too slower than memory, and when flushing, the vcpu or VM is need to be paused until finished flushing, so pause...resume...pause...resume............., more and more slower. > > Is this idea feasible? Are there any other thoughts? A few people have looked at live vmsave or guest RAM snapshots. The idea that was discussed on qemu-devel@nongnu.org uses fork(2) to capture the state of guest RAM and then send it back to the parent process. The guest is only paused for a brief instant during fork(2) and can continue to run afterwards. The child process is a simple loop that sends the contents of guest RAM back to the parent process over a pipe or writes the memory pages to the save file on disk. It performs no logic besides writing out guest RAM. Stefan From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alex Bligh Subject: Re: Are there plans to achieve ram live Snapshot feature? Date: Mon, 12 Aug 2013 11:26:59 +0100 Message-ID: <232DEBC1058FA4A5BD76D16A@Ximines.local> References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <20130812095903.GF29880@stefanha-thinkpad.redhat.com> Reply-To: Alex Bligh Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Cc: aliguori@us.ibm.com, kvm@vger.kernel.org, mtosatti@redhat.com, qemu-devel@nongnu.org, avi@redhat.com, paul@codesourcery.com, Alex Bligh , xiawenc@linux.vnet.ibm.com, fred.konrad@greensocs.com To: Stefan Hajnoczi , Chijianchun Return-path: In-Reply-To: <20130812095903.GF29880@stefanha-thinkpad.redhat.com> Content-Disposition: inline List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org Sender: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org List-Id: kvm.vger.kernel.org --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi wrote: > The idea that was discussed on qemu-devel@nongnu.org uses fork(2) to > capture the state of guest RAM and then send it back to the parent > process. The guest is only paused for a brief instant during fork(2) > and can continue to run afterwards. How would you capture the state of emulated hardware which might not be in the guest RAM? -- Alex Bligh From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefan Hajnoczi Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? Date: Mon, 12 Aug 2013 13:33:35 +0200 Message-ID: References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <20130812095903.GF29880@stefanha-thinkpad.redhat.com> <232DEBC1058FA4A5BD76D16A@Ximines.local> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Cc: Chijianchun , Anthony Liguori , kvm , Marcelo Tosatti , qemu-devel , Paul Brook , fred.konrad@greensocs.com, Wayne Xia , Avi Kivity To: Alex Bligh Return-path: Received: from mail-qc0-f173.google.com ([209.85.216.173]:62692 "EHLO mail-qc0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755936Ab3HLLdg (ORCPT ); Mon, 12 Aug 2013 07:33:36 -0400 Received: by mail-qc0-f173.google.com with SMTP id z10so3294213qcx.4 for ; Mon, 12 Aug 2013 04:33:35 -0700 (PDT) In-Reply-To: <232DEBC1058FA4A5BD76D16A@Ximines.local> Sender: kvm-owner@vger.kernel.org List-ID: On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh wrote: > --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi > wrote: > >> The idea that was discussed on qemu-devel@nongnu.org uses fork(2) to >> capture the state of guest RAM and then send it back to the parent >> process. The guest is only paused for a brief instant during fork(2) >> and can continue to run afterwards. > > > How would you capture the state of emulated hardware which might not > be in the guest RAM? Exactly the same way vmsave works today. It calls the device's save functions which serialize state to file. The difference between today's vmsave and the fork(2) approach is that QEMU does not need to wait for guest RAM to be written to file before resuming the guest. Stefan From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wenchao Xia Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? Date: Tue, 13 Aug 2013 10:53:23 +0800 Message-ID: <52099FA3.6010207@linux.vnet.ibm.com> References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <20130812095903.GF29880@stefanha-thinkpad.redhat.com> <232DEBC1058FA4A5BD76D16A@Ximines.local> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Alex Bligh , Anthony Liguori , kvm , Marcelo Tosatti , qemu-devel , Chijianchun , Avi Kivity , Paul Brook , fred.konrad@greensocs.com To: Stefan Hajnoczi Return-path: Received: from e28smtp09.in.ibm.com ([122.248.162.9]:35910 "EHLO e28smtp09.in.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757196Ab3HMCyp (ORCPT ); Mon, 12 Aug 2013 22:54:45 -0400 Received: from /spool/local by e28smtp09.in.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 13 Aug 2013 08:19:00 +0530 Received: from d28relay01.in.ibm.com (d28relay01.in.ibm.com [9.184.220.58]) by d28dlp03.in.ibm.com (Postfix) with ESMTP id A76A01258051 for ; Tue, 13 Aug 2013 08:24:19 +0530 (IST) Received: from d28av02.in.ibm.com (d28av02.in.ibm.com [9.184.220.64]) by d28relay01.in.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r7D2txUh28704876 for ; Tue, 13 Aug 2013 08:25:59 +0530 Received: from d28av02.in.ibm.com (localhost [127.0.0.1]) by d28av02.in.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id r7D2sc6Q004019 for ; Tue, 13 Aug 2013 08:24:39 +0530 In-Reply-To: Sender: kvm-owner@vger.kernel.org List-ID: =E4=BA=8E 2013-8-12 19:33, Stefan Hajnoczi =E5=86=99=E9=81=93: > On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh wrote= : >> --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi >> wrote: >> >>> The idea that was discussed on qemu-devel@nongnu.org uses fork(2) t= o >>> capture the state of guest RAM and then send it back to the parent >>> process. The guest is only paused for a brief instant during fork(= 2) >>> and can continue to run afterwards. >> >> >> How would you capture the state of emulated hardware which might not >> be in the guest RAM? > > Exactly the same way vmsave works today. It calls the device's save > functions which serialize state to file. > > The difference between today's vmsave and the fork(2) approach is tha= t > QEMU does not need to wait for guest RAM to be written to file before > resuming the guest. > > Stefan > I have a worry about what glib says: "On Unix, the GLib mainloop is incompatible with fork(). Any program using the mainloop must either exec() or exit() from the child without returning to the mainloop. " There is another way to do it: intercept the write in kvm.ko(or othe= r kernel code). Since the key is intercept the memory change, we can do it in userspace in TCG mode, thus we can add the missing part in KVM mode. Another benefit of this way is: the used memory can be controlled. For example, with ioctl(), set a buffer of a fixed size which keeps the intercepted write data by kernel code, which can avoid frequently switch back to user space qemu code. when it is full always return back to userspace's qemu code, let qemu code save the data into disk. I haven't check the exactly behavior of Intel guest mode about how to handle page fault, so can't estimate the performance caused by switching of guest mode and root mode, but it should not be worse than fork(). --=20 Best Regards Wenchao Xia From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefan Hajnoczi Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? Date: Tue, 13 Aug 2013 10:21:19 +0200 Message-ID: References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <20130812095903.GF29880@stefanha-thinkpad.redhat.com> <232DEBC1058FA4A5BD76D16A@Ximines.local> <52099FA3.6010207@linux.vnet.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Alex Bligh , Anthony Liguori , kvm , Marcelo Tosatti , qemu-devel , Chijianchun , Avi Kivity , Paul Brook , fred.konrad@greensocs.com To: Wenchao Xia Return-path: Received: from mail-qa0-f53.google.com ([209.85.216.53]:36918 "EHLO mail-qa0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752584Ab3HMIVU convert rfc822-to-8bit (ORCPT ); Tue, 13 Aug 2013 04:21:20 -0400 Received: by mail-qa0-f53.google.com with SMTP id hu14so171444qab.12 for ; Tue, 13 Aug 2013 01:21:19 -0700 (PDT) In-Reply-To: <52099FA3.6010207@linux.vnet.ibm.com> Sender: kvm-owner@vger.kernel.org List-ID: On Tue, Aug 13, 2013 at 4:53 AM, Wenchao Xia wrote: > =E4=BA=8E 2013-8-12 19:33, Stefan Hajnoczi =E5=86=99=E9=81=93: > >> On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh wrot= e: >>> >>> --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi >>> wrote: >>> >>>> The idea that was discussed on qemu-devel@nongnu.org uses fork(2) = to >>>> capture the state of guest RAM and then send it back to the parent >>>> process. The guest is only paused for a brief instant during fork= (2) >>>> and can continue to run afterwards. >>> >>> >>> >>> How would you capture the state of emulated hardware which might no= t >>> be in the guest RAM? >> >> >> Exactly the same way vmsave works today. It calls the device's save >> functions which serialize state to file. >> >> The difference between today's vmsave and the fork(2) approach is th= at >> QEMU does not need to wait for guest RAM to be written to file befor= e >> resuming the guest. >> >> Stefan >> > I have a worry about what glib says: > > "On Unix, the GLib mainloop is incompatible with fork(). Any program > using the mainloop must either exec() or exit() from the child withou= t > returning to the mainloop. " This is fine, the child just writes out the memory pages and exits. It never returns to the glib mainloop. > There is another way to do it: intercept the write in kvm.ko(or oth= er > kernel code). Since the key is intercept the memory change, we can do > it in userspace in TCG mode, thus we can add the missing part in KVM > mode. Another benefit of this way is: the used memory can be > controlled. For example, with ioctl(), set a buffer of a fixed size > which keeps the intercepted write data by kernel code, which can avoi= d > frequently switch back to user space qemu code. when it is full alway= s > return back to userspace's qemu code, let qemu code save the data int= o > disk. I haven't check the exactly behavior of Intel guest mode about > how to handle page fault, so can't estimate the performance caused by > switching of guest mode and root mode, but it should not be worse tha= n > fork(). The fork(2) approach is portable, covers both KVM and TCG, and doesn't require kernel changes. A kvm.ko kernel change also won't be supported on existing KVM hosts. These are big drawbacks and the kernel approach would need to be significantly better than plain old fork(2) to make it worthwhile. Stefan From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wenchao Xia Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? Date: Wed, 14 Aug 2013 09:54:21 +0800 Message-ID: <520AE34D.8000002@linux.vnet.ibm.com> References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <20130812095903.GF29880@stefanha-thinkpad.redhat.com> <232DEBC1058FA4A5BD76D16A@Ximines.local> <52099FA3.6010207@linux.vnet.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Anthony Liguori , kvm , Marcelo Tosatti , qemu-devel , Chijianchun , Avi Kivity , Alex Bligh , fred.konrad@greensocs.com, Paul Brook To: Stefan Hajnoczi Return-path: Received: from e23smtp03.au.ibm.com ([202.81.31.145]:60906 "EHLO e23smtp03.au.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758814Ab3HNByk (ORCPT ); Tue, 13 Aug 2013 21:54:40 -0400 Received: from /spool/local by e23smtp03.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 14 Aug 2013 11:43:48 +1000 Received: from d23relay05.au.ibm.com (d23relay05.au.ibm.com [9.190.235.152]) by d23dlp03.au.ibm.com (Postfix) with ESMTP id F07E63578056 for ; Wed, 14 Aug 2013 11:54:32 +1000 (EST) Received: from d23av01.au.ibm.com (d23av01.au.ibm.com [9.190.234.96]) by d23relay05.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r7E1cZWJ63701224 for ; Wed, 14 Aug 2013 11:38:35 +1000 Received: from d23av01.au.ibm.com (localhost [127.0.0.1]) by d23av01.au.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id r7E1sVjV026338 for ; Wed, 14 Aug 2013 11:54:32 +1000 In-Reply-To: Sender: kvm-owner@vger.kernel.org List-ID: =E4=BA=8E 2013-8-13 16:21, Stefan Hajnoczi =E5=86=99=E9=81=93: > On Tue, Aug 13, 2013 at 4:53 AM, Wenchao Xia wrote: >> =E4=BA=8E 2013-8-12 19:33, Stefan Hajnoczi =E5=86=99=E9=81=93: >> >>> On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh wro= te: >>>> >>>> --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi >>>> wrote: >>>> >>>>> The idea that was discussed on qemu-devel@nongnu.org uses fork(2)= to >>>>> capture the state of guest RAM and then send it back to the paren= t >>>>> process. The guest is only paused for a brief instant during for= k(2) >>>>> and can continue to run afterwards. >>>> >>>> >>>> >>>> How would you capture the state of emulated hardware which might n= ot >>>> be in the guest RAM? >>> >>> >>> Exactly the same way vmsave works today. It calls the device's sav= e >>> functions which serialize state to file. >>> >>> The difference between today's vmsave and the fork(2) approach is t= hat >>> QEMU does not need to wait for guest RAM to be written to file befo= re >>> resuming the guest. >>> >>> Stefan >>> >> I have a worry about what glib says: >> >> "On Unix, the GLib mainloop is incompatible with fork(). Any program >> using the mainloop must either exec() or exit() from the child witho= ut >> returning to the mainloop. " > > This is fine, the child just writes out the memory pages and exits. > It never returns to the glib mainloop. > >> There is another way to do it: intercept the write in kvm.ko(or o= ther >> kernel code). Since the key is intercept the memory change, we can d= o >> it in userspace in TCG mode, thus we can add the missing part in KVM >> mode. Another benefit of this way is: the used memory can be >> controlled. For example, with ioctl(), set a buffer of a fixed size >> which keeps the intercepted write data by kernel code, which can avo= id >> frequently switch back to user space qemu code. when it is full alwa= ys >> return back to userspace's qemu code, let qemu code save the data in= to >> disk. I haven't check the exactly behavior of Intel guest mode about >> how to handle page fault, so can't estimate the performance caused b= y >> switching of guest mode and root mode, but it should not be worse th= an >> fork(). > > The fork(2) approach is portable, covers both KVM and TCG, and doesn'= t > require kernel changes. A kvm.ko kernel change also won't be > supported on existing KVM hosts. These are big drawbacks and the > kernel approach would need to be significantly better than plain old > fork(2) to make it worthwhile. > > Stefan > I think advantage is memory usage is predictable, so memory usage peak can be avoided, by always save the changed pages first. fork() does not know which pages are changed. I am not sure if this would be a serious issue when server's memory is consumed much, for example, 24G host emulate 11G*2 guest to provide powerful virtual server. --=20 Best Regards Wenchao Xia From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefan Hajnoczi Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? Date: Wed, 14 Aug 2013 09:53:56 +0200 Message-ID: References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <20130812095903.GF29880@stefanha-thinkpad.redhat.com> <232DEBC1058FA4A5BD76D16A@Ximines.local> <52099FA3.6010207@linux.vnet.ibm.com> <520AE34D.8000002@linux.vnet.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=GB2312 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Anthony Liguori , kvm , Marcelo Tosatti , qemu-devel , Chijianchun , Avi Kivity , Alex Bligh , fred.konrad@greensocs.com, Paul Brook To: Wenchao Xia Return-path: Received: from mail-qc0-f175.google.com ([209.85.216.175]:64388 "EHLO mail-qc0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757923Ab3HNHx5 convert rfc822-to-8bit (ORCPT ); Wed, 14 Aug 2013 03:53:57 -0400 Received: by mail-qc0-f175.google.com with SMTP id s11so4652415qcv.34 for ; Wed, 14 Aug 2013 00:53:56 -0700 (PDT) In-Reply-To: <520AE34D.8000002@linux.vnet.ibm.com> Sender: kvm-owner@vger.kernel.org List-ID: On Wed, Aug 14, 2013 at 3:54 AM, Wenchao Xia wrote: > =D3=DA 2013-8-13 16:21, Stefan Hajnoczi =D0=B4=B5=C0: > >> On Tue, Aug 13, 2013 at 4:53 AM, Wenchao Xia >> wrote: >>> >>> =D3=DA 2013-8-12 19:33, Stefan Hajnoczi =D0=B4=B5=C0: >>> >>>> On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh wr= ote: >>>>> >>>>> >>>>> --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi >>>>> wrote: >>>>> >>>>>> The idea that was discussed on qemu-devel@nongnu.org uses fork(2= ) to >>>>>> capture the state of guest RAM and then send it back to the pare= nt >>>>>> process. The guest is only paused for a brief instant during fo= rk(2) >>>>>> and can continue to run afterwards. >>>>> >>>>> >>>>> >>>>> >>>>> How would you capture the state of emulated hardware which might = not >>>>> be in the guest RAM? >>>> >>>> >>>> >>>> Exactly the same way vmsave works today. It calls the device's sa= ve >>>> functions which serialize state to file. >>>> >>>> The difference between today's vmsave and the fork(2) approach is = that >>>> QEMU does not need to wait for guest RAM to be written to file bef= ore >>>> resuming the guest. >>>> >>>> Stefan >>>> >>> I have a worry about what glib says: >>> >>> "On Unix, the GLib mainloop is incompatible with fork(). Any progra= m >>> using the mainloop must either exec() or exit() from the child with= out >>> returning to the mainloop. " >> >> >> This is fine, the child just writes out the memory pages and exits. >> It never returns to the glib mainloop. >> >>> There is another way to do it: intercept the write in kvm.ko(or = other >>> kernel code). Since the key is intercept the memory change, we can = do >>> it in userspace in TCG mode, thus we can add the missing part in KV= M >>> mode. Another benefit of this way is: the used memory can be >>> controlled. For example, with ioctl(), set a buffer of a fixed size >>> which keeps the intercepted write data by kernel code, which can av= oid >>> frequently switch back to user space qemu code. when it is full alw= ays >>> return back to userspace's qemu code, let qemu code save the data i= nto >>> disk. I haven't check the exactly behavior of Intel guest mode abou= t >>> how to handle page fault, so can't estimate the performance caused = by >>> switching of guest mode and root mode, but it should not be worse t= han >>> fork(). >> >> >> The fork(2) approach is portable, covers both KVM and TCG, and doesn= 't >> require kernel changes. A kvm.ko kernel change also won't be >> supported on existing KVM hosts. These are big drawbacks and the >> kernel approach would need to be significantly better than plain old >> fork(2) to make it worthwhile. >> >> Stefan >> > I think advantage is memory usage is predictable, so memory usage > peak can be avoided, by always save the changed pages first. fork() > does not know which pages are changed. I am not sure if this would > be a serious issue when server's memory is consumed much, for example= , > 24G host emulate 11G*2 guest to provide powerful virtual server. Memory usage is predictable but guest uptime is unpredictable because it waits until memory is written out. This defeats the point of "live" savevm. The guest may be stalled arbitrarily. The fork child can minimize the chance of out-of-memory by using madvise(MADV_DONTNEED) after pages have been written out. The way fork handles memory overcommit on Linux is configurable, but I guess in a situation where memory runs out the Out-of-Memory Killer will kill a process (probably QEMU since it is hogging so much memory). The risk of OOM can be avoided by running the traditional vmsave which stops the guest instead of using "live" vmsave. The other option is to live migrate to file but the disadvantage there is that you cannot choose exactly when the state it saved, it happens sometime after live migration is initiated. There are trade-offs with all the approaches, it depends on what is most important to you. Stefan From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alex Bligh Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? Date: Wed, 14 Aug 2013 09:13:35 +0100 Message-ID: <7BB8F666-B20F-4651-B0B9-C40DBB2282B5@alex.org.uk> References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <20130812095903.GF29880@stefanha-thinkpad.redhat.com> <232DEBC1058FA4A5BD76D16A@Ximines.local> <52099FA3.6010207@linux.vnet.ibm.com> <520AE34D.8000002@linux.vnet.ibm.com> Mime-Version: 1.0 (Apple Message framework v1085) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: Alex Bligh , Wenchao Xia , Anthony Liguori , kvm , Paul Brook , Marcelo Tosatti , qemu-devel , Chijianchun , Avi Kivity , fred.konrad@greensocs.com To: Stefan Hajnoczi Return-path: Received: from mail.avalus.com ([89.16.176.221]:37813 "EHLO mail.avalus.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759379Ab3HNINs (ORCPT ); Wed, 14 Aug 2013 04:13:48 -0400 In-Reply-To: Sender: kvm-owner@vger.kernel.org List-ID: On 14 Aug 2013, at 08:53, Stefan Hajnoczi wrote: > The fork child can minimize the chance of out-of-memory by using > madvise(MADV_DONTNEED) after pages have been written out. This may also be helpful (last clause) before starting writing. MADV_SEQUENTIAL Expect page references in sequential order. (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.) -- Alex Bligh From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wenchao Xia Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? Date: Thu, 15 Aug 2013 10:26:36 +0800 Message-ID: <520C3C5C.5000106@linux.vnet.ibm.com> References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <20130812095903.GF29880@stefanha-thinkpad.redhat.com> <232DEBC1058FA4A5BD76D16A@Ximines.local> <52099FA3.6010207@linux.vnet.ibm.com> <520AE34D.8000002@linux.vnet.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=GB2312 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Anthony Liguori , kvm , Paul Brook , Marcelo Tosatti , qemu-devel , Chijianchun , Avi Kivity , Alex Bligh , fred.konrad@greensocs.com To: Stefan Hajnoczi Return-path: Received: from e28smtp06.in.ibm.com ([122.248.162.6]:43301 "EHLO e28smtp06.in.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759861Ab3HOC1O (ORCPT ); Wed, 14 Aug 2013 22:27:14 -0400 Received: from /spool/local by e28smtp06.in.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 15 Aug 2013 07:47:53 +0530 Received: from d28relay03.in.ibm.com (d28relay03.in.ibm.com [9.184.220.60]) by d28dlp03.in.ibm.com (Postfix) with ESMTP id 7247C1258043 for ; Thu, 15 Aug 2013 07:56:47 +0530 (IST) Received: from d28av01.in.ibm.com (d28av01.in.ibm.com [9.184.220.63]) by d28relay03.in.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r7F2SQkW44499124 for ; Thu, 15 Aug 2013 07:58:26 +0530 Received: from d28av01.in.ibm.com (localhost [127.0.0.1]) by d28av01.in.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id r7F2R47l008989 for ; Thu, 15 Aug 2013 07:57:05 +0530 In-Reply-To: Sender: kvm-owner@vger.kernel.org List-ID: =D3=DA 2013-8-14 15:53, Stefan Hajnoczi =D0=B4=B5=C0: > On Wed, Aug 14, 2013 at 3:54 AM, Wenchao Xia wrote: >> =D3=DA 2013-8-13 16:21, Stefan Hajnoczi =D0=B4=B5=C0: >> >>> On Tue, Aug 13, 2013 at 4:53 AM, Wenchao Xia >>> wrote: >>>> >>>> =D3=DA 2013-8-12 19:33, Stefan Hajnoczi =D0=B4=B5=C0: >>>> >>>>> On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh w= rote: >>>>>> >>>>>> >>>>>> --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi >>>>>> wrote: >>>>>> >>>>>>> The idea that was discussed on qemu-devel@nongnu.org uses fork(= 2) to >>>>>>> capture the state of guest RAM and then send it back to the par= ent >>>>>>> process. The guest is only paused for a brief instant during f= ork(2) >>>>>>> and can continue to run afterwards. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> How would you capture the state of emulated hardware which might= not >>>>>> be in the guest RAM? >>>>> >>>>> >>>>> >>>>> Exactly the same way vmsave works today. It calls the device's s= ave >>>>> functions which serialize state to file. >>>>> >>>>> The difference between today's vmsave and the fork(2) approach is= that >>>>> QEMU does not need to wait for guest RAM to be written to file be= fore >>>>> resuming the guest. >>>>> >>>>> Stefan >>>>> >>>> I have a worry about what glib says: >>>> >>>> "On Unix, the GLib mainloop is incompatible with fork(). Any progr= am >>>> using the mainloop must either exec() or exit() from the child wit= hout >>>> returning to the mainloop. " >>> >>> >>> This is fine, the child just writes out the memory pages and exits. >>> It never returns to the glib mainloop. >>> >>>> There is another way to do it: intercept the write in kvm.ko(o= r other >>>> kernel code). Since the key is intercept the memory change, we can= do >>>> it in userspace in TCG mode, thus we can add the missing part in K= VM >>>> mode. Another benefit of this way is: the used memory can be >>>> controlled. For example, with ioctl(), set a buffer of a fixed siz= e >>>> which keeps the intercepted write data by kernel code, which can a= void >>>> frequently switch back to user space qemu code. when it is full al= ways >>>> return back to userspace's qemu code, let qemu code save the data = into >>>> disk. I haven't check the exactly behavior of Intel guest mode abo= ut >>>> how to handle page fault, so can't estimate the performance caused= by >>>> switching of guest mode and root mode, but it should not be worse = than >>>> fork(). >>> >>> >>> The fork(2) approach is portable, covers both KVM and TCG, and does= n't >>> require kernel changes. A kvm.ko kernel change also won't be >>> supported on existing KVM hosts. These are big drawbacks and the >>> kernel approach would need to be significantly better than plain ol= d >>> fork(2) to make it worthwhile. >>> >>> Stefan >>> >> I think advantage is memory usage is predictable, so memory usage >> peak can be avoided, by always save the changed pages first. fork() >> does not know which pages are changed. I am not sure if this would >> be a serious issue when server's memory is consumed much, for exampl= e, >> 24G host emulate 11G*2 guest to provide powerful virtual server. >=20 > Memory usage is predictable but guest uptime is unpredictable because > it waits until memory is written out. This defeats the point of > "live" savevm. The guest may be stalled arbitrarily. >=20 I think it is adjustable. There is no much difference with fork(), except get more precise control about the changed pages. Kernel intercept the change, and stores the changed page in another page, similar to fork(). When userspace qemu code execute, save some pages to disk. Buffer can be used like some lubricant. When Buffer =3D MAX, it equals to fork(), guest runs more lively. When Buffer =3D 0, guest runs less lively. I think it allows user to find a good balance point with a parameter. It is harder to implement, just want to show the idea. > The fork child can minimize the chance of out-of-memory by using > madvise(MADV_DONTNEED) after pages have been written out. It seems no way to make sure the written out page is the changed pages, so it have a good chance the written one is the unchanged and still used by the other qemu process. >=20 > The way fork handles memory overcommit on Linux is configurable, but = I > guess in a situation where memory runs out the Out-of-Memory Killer > will kill a process (probably QEMU since it is hogging so much > memory). >=20 > The risk of OOM can be avoided by running the traditional vmsave whic= h > stops the guest instead of using "live" vmsave. >=20 > The other option is to live migrate to file but the disadvantage ther= e > is that you cannot choose exactly when the state it saved, it happens > sometime after live migration is initiated. >=20 > There are trade-offs with all the approaches, it depends on what is > most important to you. >=20 > Stefan >=20 --=20 Best Regards Wenchao Xia From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefan Hajnoczi Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? Date: Thu, 15 Aug 2013 09:49:19 +0200 Message-ID: <20130815074919.GA22521@stefanha-thinkpad.redhat.com> References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <20130812095903.GF29880@stefanha-thinkpad.redhat.com> <232DEBC1058FA4A5BD76D16A@Ximines.local> <52099FA3.6010207@linux.vnet.ibm.com> <520AE34D.8000002@linux.vnet.ibm.com> <520C3C5C.5000106@linux.vnet.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Anthony Liguori , kvm , Paul Brook , Marcelo Tosatti , qemu-devel , Chijianchun , Avi Kivity , Alex Bligh , fred.konrad@greensocs.com To: Wenchao Xia Return-path: Received: from mail-ee0-f54.google.com ([74.125.83.54]:41715 "EHLO mail-ee0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755469Ab3HOHtX (ORCPT ); Thu, 15 Aug 2013 03:49:23 -0400 Received: by mail-ee0-f54.google.com with SMTP id e53so196100eek.13 for ; Thu, 15 Aug 2013 00:49:22 -0700 (PDT) Content-Disposition: inline In-Reply-To: <520C3C5C.5000106@linux.vnet.ibm.com> Sender: kvm-owner@vger.kernel.org List-ID: On Thu, Aug 15, 2013 at 10:26:36AM +0800, Wenchao Xia wrote: > =E4=BA=8E 2013-8-14 15:53, Stefan Hajnoczi =E5=86=99=E9=81=93: > > On Wed, Aug 14, 2013 at 3:54 AM, Wenchao Xia wrote: > >> =E4=BA=8E 2013-8-13 16:21, Stefan Hajnoczi =E5=86=99=E9=81=93: > >> > >>> On Tue, Aug 13, 2013 at 4:53 AM, Wenchao Xia > >>> wrote: > >>>> > >>>> =E4=BA=8E 2013-8-12 19:33, Stefan Hajnoczi =E5=86=99=E9=81=93: > >>>> > >>>>> On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh = wrote: > >>>>>> > >>>>>> > >>>>>> --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi > >>>>>> wrote: > >>>>>> > >>>>>>> The idea that was discussed on qemu-devel@nongnu.org uses for= k(2) to > >>>>>>> capture the state of guest RAM and then send it back to the p= arent > >>>>>>> process. The guest is only paused for a brief instant during= fork(2) > >>>>>>> and can continue to run afterwards. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> How would you capture the state of emulated hardware which mig= ht not > >>>>>> be in the guest RAM? > >>>>> > >>>>> > >>>>> > >>>>> Exactly the same way vmsave works today. It calls the device's= save > >>>>> functions which serialize state to file. > >>>>> > >>>>> The difference between today's vmsave and the fork(2) approach = is that > >>>>> QEMU does not need to wait for guest RAM to be written to file = before > >>>>> resuming the guest. > >>>>> > >>>>> Stefan > >>>>> > >>>> I have a worry about what glib says: > >>>> > >>>> "On Unix, the GLib mainloop is incompatible with fork(). Any pro= gram > >>>> using the mainloop must either exec() or exit() from the child w= ithout > >>>> returning to the mainloop. " > >>> > >>> > >>> This is fine, the child just writes out the memory pages and exit= s. > >>> It never returns to the glib mainloop. > >>> > >>>> There is another way to do it: intercept the write in kvm.ko= (or other > >>>> kernel code). Since the key is intercept the memory change, we c= an do > >>>> it in userspace in TCG mode, thus we can add the missing part in= KVM > >>>> mode. Another benefit of this way is: the used memory can be > >>>> controlled. For example, with ioctl(), set a buffer of a fixed s= ize > >>>> which keeps the intercepted write data by kernel code, which can= avoid > >>>> frequently switch back to user space qemu code. when it is full = always > >>>> return back to userspace's qemu code, let qemu code save the dat= a into > >>>> disk. I haven't check the exactly behavior of Intel guest mode a= bout > >>>> how to handle page fault, so can't estimate the performance caus= ed by > >>>> switching of guest mode and root mode, but it should not be wors= e than > >>>> fork(). > >>> > >>> > >>> The fork(2) approach is portable, covers both KVM and TCG, and do= esn't > >>> require kernel changes. A kvm.ko kernel change also won't be > >>> supported on existing KVM hosts. These are big drawbacks and the > >>> kernel approach would need to be significantly better than plain = old > >>> fork(2) to make it worthwhile. > >>> > >>> Stefan > >>> > >> I think advantage is memory usage is predictable, so memory usa= ge > >> peak can be avoided, by always save the changed pages first. fork(= ) > >> does not know which pages are changed. I am not sure if this would > >> be a serious issue when server's memory is consumed much, for exam= ple, > >> 24G host emulate 11G*2 guest to provide powerful virtual server. > >=20 > > Memory usage is predictable but guest uptime is unpredictable becau= se > > it waits until memory is written out. This defeats the point of > > "live" savevm. The guest may be stalled arbitrarily. > >=20 > I think it is adjustable. There is no much difference with > fork(), except get more precise control about the changed pages. > Kernel intercept the change, and stores the changed page in another > page, similar to fork(). When userspace qemu code execute, save some > pages to disk. Buffer can be used like some lubricant. When Buffer =3D > MAX, it equals to fork(), guest runs more lively. When Buffer =3D 0, > guest runs less lively. I think it allows user to find a good balance > point with a parameter. > It is harder to implement, just want to show the idea. You are right. You could set a bigger buffer size to increase guest uptime. > > The fork child can minimize the chance of out-of-memory by using > > madvise(MADV_DONTNEED) after pages have been written out. > It seems no way to make sure the written out page is the changed > pages, so it have a good chance the written one is the unchanged and > still used by the other qemu process. The KVM dirty log tells you which pages were touched. The fork child process could give priority to the pages which have been touched by the guest. They must be written out and marked madvise(MADV_DONTNEED) as soon as possible. I haven't looked at the vmsave data format yet to see if memory pages can be saved in random order, but this might work. It reduces the likelihood of copy-on-write memory growth. Stefan From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wenchao Xia Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? Date: Thu, 15 Aug 2013 16:03:47 +0800 Message-ID: <520C8B63.2060304@linux.vnet.ibm.com> References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <20130812095903.GF29880@stefanha-thinkpad.redhat.com> <232DEBC1058FA4A5BD76D16A@Ximines.local> <52099FA3.6010207@linux.vnet.ibm.com> <520AE34D.8000002@linux.vnet.ibm.com> <520C3C5C.5000106@linux.vnet.ibm.com> <20130815074919.GA22521@stefanha-thinkpad.redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Anthony Liguori , kvm , Paul Brook , Marcelo Tosatti , qemu-devel , Chijianchun , Avi Kivity , Alex Bligh , fred.konrad@greensocs.com To: Stefan Hajnoczi Return-path: Received: from e23smtp06.au.ibm.com ([202.81.31.148]:36059 "EHLO e23smtp06.au.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752086Ab3HOIE1 (ORCPT ); Thu, 15 Aug 2013 04:04:27 -0400 Received: from /spool/local by e23smtp06.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 15 Aug 2013 17:56:04 +1000 Received: from d23relay03.au.ibm.com (d23relay03.au.ibm.com [9.190.235.21]) by d23dlp03.au.ibm.com (Postfix) with ESMTP id EBE663578052 for ; Thu, 15 Aug 2013 18:04:16 +1000 (EST) Received: from d23av01.au.ibm.com (d23av01.au.ibm.com [9.190.234.96]) by d23relay03.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r7F846H963242344 for ; Thu, 15 Aug 2013 18:04:06 +1000 Received: from d23av01.au.ibm.com (localhost [127.0.0.1]) by d23av01.au.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id r7F84Fne016112 for ; Thu, 15 Aug 2013 18:04:16 +1000 In-Reply-To: <20130815074919.GA22521@stefanha-thinkpad.redhat.com> Sender: kvm-owner@vger.kernel.org List-ID: =E4=BA=8E 2013-8-15 15:49, Stefan Hajnoczi =E5=86=99=E9=81=93: > On Thu, Aug 15, 2013 at 10:26:36AM +0800, Wenchao Xia wrote: >> =E4=BA=8E 2013-8-14 15:53, Stefan Hajnoczi =E5=86=99=E9=81=93: >>> On Wed, Aug 14, 2013 at 3:54 AM, Wenchao Xia wrote: >>>> =E4=BA=8E 2013-8-13 16:21, Stefan Hajnoczi =E5=86=99=E9=81=93: >>>> >>>>> On Tue, Aug 13, 2013 at 4:53 AM, Wenchao Xia >>>>> wrote: >>>>>> >>>>>> =E4=BA=8E 2013-8-12 19:33, Stefan Hajnoczi =E5=86=99=E9=81=93: >>>>>> >>>>>>> On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh = wrote: >>>>>>>> >>>>>>>> >>>>>>>> --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi >>>>>>>> wrote: >>>>>>>> >>>>>>>>> The idea that was discussed on qemu-devel@nongnu.org uses for= k(2) to >>>>>>>>> capture the state of guest RAM and then send it back to the p= arent >>>>>>>>> process. The guest is only paused for a brief instant during= fork(2) >>>>>>>>> and can continue to run afterwards. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> How would you capture the state of emulated hardware which mig= ht not >>>>>>>> be in the guest RAM? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Exactly the same way vmsave works today. It calls the device's= save >>>>>>> functions which serialize state to file. >>>>>>> >>>>>>> The difference between today's vmsave and the fork(2) approach = is that >>>>>>> QEMU does not need to wait for guest RAM to be written to file = before >>>>>>> resuming the guest. >>>>>>> >>>>>>> Stefan >>>>>>> >>>>>> I have a worry about what glib says: >>>>>> >>>>>> "On Unix, the GLib mainloop is incompatible with fork(). Any pro= gram >>>>>> using the mainloop must either exec() or exit() from the child w= ithout >>>>>> returning to the mainloop. " >>>>> >>>>> >>>>> This is fine, the child just writes out the memory pages and exit= s. >>>>> It never returns to the glib mainloop. >>>>> >>>>>> There is another way to do it: intercept the write in kvm.k= o(or other >>>>>> kernel code). Since the key is intercept the memory change, we c= an do >>>>>> it in userspace in TCG mode, thus we can add the missing part in= KVM >>>>>> mode. Another benefit of this way is: the used memory can be >>>>>> controlled. For example, with ioctl(), set a buffer of a fixed s= ize >>>>>> which keeps the intercepted write data by kernel code, which can= avoid >>>>>> frequently switch back to user space qemu code. when it is full = always >>>>>> return back to userspace's qemu code, let qemu code save the dat= a into >>>>>> disk. I haven't check the exactly behavior of Intel guest mode a= bout >>>>>> how to handle page fault, so can't estimate the performance caus= ed by >>>>>> switching of guest mode and root mode, but it should not be wors= e than >>>>>> fork(). >>>>> >>>>> >>>>> The fork(2) approach is portable, covers both KVM and TCG, and do= esn't >>>>> require kernel changes. A kvm.ko kernel change also won't be >>>>> supported on existing KVM hosts. These are big drawbacks and the >>>>> kernel approach would need to be significantly better than plain = old >>>>> fork(2) to make it worthwhile. >>>>> >>>>> Stefan >>>>> >>>> I think advantage is memory usage is predictable, so memory us= age >>>> peak can be avoided, by always save the changed pages first. fork(= ) >>>> does not know which pages are changed. I am not sure if this would >>>> be a serious issue when server's memory is consumed much, for exam= ple, >>>> 24G host emulate 11G*2 guest to provide powerful virtual server. >>> >>> Memory usage is predictable but guest uptime is unpredictable becau= se >>> it waits until memory is written out. This defeats the point of >>> "live" savevm. The guest may be stalled arbitrarily. >>> >> I think it is adjustable. There is no much difference with >> fork(), except get more precise control about the changed pages. >> Kernel intercept the change, and stores the changed page in anoth= er >> page, similar to fork(). When userspace qemu code execute, save some >> pages to disk. Buffer can be used like some lubricant. When Buffer =3D >> MAX, it equals to fork(), guest runs more lively. When Buffer =3D 0, >> guest runs less lively. I think it allows user to find a good balanc= e >> point with a parameter. >> It is harder to implement, just want to show the idea. > > You are right. You could set a bigger buffer size to increase guest > uptime. > >>> The fork child can minimize the chance of out-of-memory by using >>> madvise(MADV_DONTNEED) after pages have been written out. >> It seems no way to make sure the written out page is the changed >> pages, so it have a good chance the written one is the unchanged and >> still used by the other qemu process. > > The KVM dirty log tells you which pages were touched. The fork child > process could give priority to the pages which have been touched by t= he > guest. They must be written out and marked madvise(MADV_DONTNEED) as > soon as possible. Hmm, if dirty log still works normal in child process to reflect the memory status in parent not child's, then the problem could be solved by: when dirty pages is too much, child tell parent to wait some time. But I haven't check if kvm.ko behaviors like that. > > I haven't looked at the vmsave data format yet to see if memory pages > can be saved in random order, but this might work. It reduces the > likelihood of copy-on-write memory growth. > > Stefan > --=20 Best Regards Wenchao Xia From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:52397) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V7juI-0005Zx-U0 for qemu-devel@nongnu.org; Fri, 09 Aug 2013 06:27:00 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1V7juC-0000XU-AB for qemu-devel@nongnu.org; Fri, 09 Aug 2013 06:26:54 -0400 Received: from szxga01-in.huawei.com ([119.145.14.64]:36069) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V7juB-0000WJ-Mj for qemu-devel@nongnu.org; Fri, 09 Aug 2013 06:26:48 -0400 From: Chijianchun Date: Fri, 9 Aug 2013 10:20:49 +0000 Message-ID: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> Content-Language: zh-CN Content-Type: multipart/alternative; boundary="_000_33FB050264B7AD4DBD6583581F2E03104B764728nkgeml511mbxchi_" MIME-Version: 1.0 Subject: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "aliguori@us.ibm.com" , "paul@codesourcery.com" , "kvm@vger.kernel.org" , "avi@redhat.com" , "mtosatti@redhat.com" , "qemu-devel@nongnu.org" --_000_33FB050264B7AD4DBD6583581F2E03104B764728nkgeml511mbxchi_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Now in KVM, when RAM snapshot, vcpus needs stopped, it is Unfriendly restri= ctions to users. Are there plans to achieve ram live Snapshot feature? in my mind, Snapshots can not occupy additional too much memory, So when th= e memory needs to be changed, the old memory page is needed to flush to the= file first. But flushing to file is too slower than memory, and when flu= shing, the vcpu or VM is need to be paused until finished flushing, so pau= se...resume...pause...resume............., more and more slower. Is this idea feasible? Are there any other thoughts? --_000_33FB050264B7AD4DBD6583581F2E03104B764728nkgeml511mbxchi_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Now = in KVM, when RAM snapshot, vcpus needs stopped, it is Unfriendly restrictio= ns to users.  

 

Are = there plans to achieve ram live Snapshot feature?

 

in m= y mind, Snapshots can not occupy additional too much memory, So when the me= mory needs to be changed, the old memory page is needed to flush to the fil= e first.  But flushing to file is too slower than memory,  and when flushing, the vcpu or VM is need to be = paused until finished flushing,  so pause...resume...pause...resume...= .........., more and more slower.

 

Is t= his idea feasible? Are there any other thoughts?

--_000_33FB050264B7AD4DBD6583581F2E03104B764728nkgeml511mbxchi_-- From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:34669) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V7omE-0000ar-QL for qemu-devel@nongnu.org; Fri, 09 Aug 2013 11:39:03 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1V7om6-0001LU-CS for qemu-devel@nongnu.org; Fri, 09 Aug 2013 11:38:54 -0400 Received: from mail-ee0-x232.google.com ([2a00:1450:4013:c00::232]:34078) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V7om6-0001LO-5m for qemu-devel@nongnu.org; Fri, 09 Aug 2013 11:38:46 -0400 Received: by mail-ee0-f50.google.com with SMTP id d51so2230858eek.9 for ; Fri, 09 Aug 2013 08:38:45 -0700 (PDT) Sender: Paolo Bonzini Message-ID: <52050CE4.6000306@redhat.com> Date: Fri, 09 Aug 2013 17:38:12 +0200 From: Paolo Bonzini MIME-Version: 1.0 References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> In-Reply-To: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Chijianchun Cc: "aliguori@us.ibm.com" , "kvm@vger.kernel.org" , "mtosatti@redhat.com" , "qemu-devel@nongnu.org" , "paul@codesourcery.com" , "avi@redhat.com" Il 09/08/2013 12:20, Chijianchun ha scritto: > Now in KVM, when RAM snapshot, vcpus needs stopped, it is Unfriendly > restrictions to users. > > Are there plans to achieve ram live Snapshot feature? > > in my mind, Snapshots can not occupy additional too much memory, So when > the memory needs to be changed, the old memory page is needed to flush > to the file first. But flushing to file is too slower than memory, and > when flushing, the vcpu or VM is need to be paused until finished > flushing, so pause...resume...pause...resume............., more and > more slower. > > Is this idea feasible? Are there any other thoughts? > This looks very similar to postcopy migration (you can Google it). The infrastructure for postcopy migration could be used for this as well. Paolo From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:36644) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V7oso-0005nI-KZ for qemu-devel@nongnu.org; Fri, 09 Aug 2013 11:45:50 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1V7osg-0004Uk-C4 for qemu-devel@nongnu.org; Fri, 09 Aug 2013 11:45:42 -0400 Received: from e7.ny.us.ibm.com ([32.97.182.137]:54518) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V7osg-0004U5-8E for qemu-devel@nongnu.org; Fri, 09 Aug 2013 11:45:34 -0400 Received: from /spool/local by e7.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 9 Aug 2013 11:45:31 -0400 Received: from d01relay07.pok.ibm.com (d01relay07.pok.ibm.com [9.56.227.147]) by d01dlp01.pok.ibm.com (Postfix) with ESMTP id 0237D38C803B for ; Fri, 9 Aug 2013 11:45:28 -0400 (EDT) Received: from d01av05.pok.ibm.com (d01av05.pok.ibm.com [9.56.224.195]) by d01relay07.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r79FjRei31260772 for ; Fri, 9 Aug 2013 11:45:27 -0400 Received: from d01av05.pok.ibm.com (loopback [127.0.0.1]) by d01av05.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r79FjRrY027684 for ; Fri, 9 Aug 2013 11:45:27 -0400 From: Anthony Liguori In-Reply-To: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> Date: Fri, 09 Aug 2013 10:45:22 -0500 Message-ID: <877gfu3m25.fsf@codemonkey.ws> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Chijianchun , "paul@codesourcery.com" , "kvm@vger.kernel.org" , "avi@redhat.com" , "mtosatti@redhat.com" , "qemu-devel@nongnu.org" Chijianchun writes: > Now in KVM, when RAM snapshot, vcpus needs stopped, it is Unfriendly restrictions to users. > > Are there plans to achieve ram live Snapshot feature? I think you mean a live version of the savevm command. You can approximate live migrating to a file, creating an external disk snapshot, then resuming the guest. Regards, Anthony Liguori > > in my mind, Snapshots can not occupy additional too much memory, So when the memory needs to be changed, the old memory page is needed to flush to the file first. But flushing to file is too slower than memory, and when flushing, the vcpu or VM is need to be paused until finished flushing, so pause...resume...pause...resume............., more and more slower. > > Is this idea feasible? Are there any other thoughts? From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:38367) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V7oym-0005A9-5d for qemu-devel@nongnu.org; Fri, 09 Aug 2013 11:51:56 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1V7oyh-000709-S2 for qemu-devel@nongnu.org; Fri, 09 Aug 2013 11:51:52 -0400 Received: from mx1.redhat.com ([209.132.183.28]:33138) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V7oyh-0006zj-G0 for qemu-devel@nongnu.org; Fri, 09 Aug 2013 11:51:47 -0400 Message-ID: <5205100E.60007@redhat.com> Date: Fri, 09 Aug 2013 09:51:42 -0600 From: Eric Blake MIME-Version: 1.0 References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <877gfu3m25.fsf@codemonkey.ws> In-Reply-To: <877gfu3m25.fsf@codemonkey.ws> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="0C26arUE2NsreIrqwMeaQrTXkiqLo8J4O" Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: Chijianchun , "mtosatti@redhat.com" , "paul@codesourcery.com" , "kvm@vger.kernel.org" , "qemu-devel@nongnu.org" This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --0C26arUE2NsreIrqwMeaQrTXkiqLo8J4O Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 08/09/2013 09:45 AM, Anthony Liguori wrote: > Chijianchun writes: >=20 >> Now in KVM, when RAM snapshot, vcpus needs stopped, it is Unfriendly r= estrictions to users. >> >> Are there plans to achieve ram live Snapshot feature? >=20 > I think you mean a live version of the savevm command. >=20 > You can approximate live migrating to a file, creating an external disk= > snapshot, then resuming the guest. And libvirt does just that, since libvirt 1.0.5, for its external RAM snapshots. The vcpu pause is a mere fraction of a second, so it is generally not noticeable as any guest downtime. --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --0C26arUE2NsreIrqwMeaQrTXkiqLo8J4O Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.13 (GNU/Linux) Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJSBRAOAAoJEKeha0olJ0Nqrn8IAJKfchMG2RrbiqU4ri44mowr 75/4b7JpXV0IawvalsZX4BhIbIxpjlvTrfeS1yzTmW9V40uvGHxA9OQ2vV9VW0oN JmQoQTJ6LyzRbFWgGxUL+G1s4EvaRz2z2iGdI0fVtHmDikI9J4YDliH2R2RDKzEj wk6btNKBXJ0uc/J9QYRlikWWmVn1yzXtYEDt/wEKKLQTsQea27jlY2Q2YO8DKaOJ MfOf9a89/3XzudsRHcQUxDe3IppUunTXWiHJHJJOyklCBMVZsSAm/XgJIna3ft0V 5BiTrGNQx3Cgx0NR6D+fldP/KvLBTgPhsuvWD6PlgLwSQlYTnGClAxuiFgeLdJ0= =QZr9 -----END PGP SIGNATURE----- --0C26arUE2NsreIrqwMeaQrTXkiqLo8J4O-- From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:44240) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V8ou8-0003bQ-PU for qemu-devel@nongnu.org; Mon, 12 Aug 2013 05:59:18 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1V8ou2-0001eB-VU for qemu-devel@nongnu.org; Mon, 12 Aug 2013 05:59:12 -0400 Received: from mail-ee0-x236.google.com ([2a00:1450:4013:c00::236]:53548) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V8ou2-0001dw-OV for qemu-devel@nongnu.org; Mon, 12 Aug 2013 05:59:06 -0400 Received: by mail-ee0-f54.google.com with SMTP id e53so3399090eek.27 for ; Mon, 12 Aug 2013 02:59:05 -0700 (PDT) Date: Mon, 12 Aug 2013 11:59:03 +0200 From: Stefan Hajnoczi Message-ID: <20130812095903.GF29880@stefanha-thinkpad.redhat.com> References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Chijianchun Cc: "aliguori@us.ibm.com" , "kvm@vger.kernel.org" , "mtosatti@redhat.com" , "qemu-devel@nongnu.org" , "paul@codesourcery.com" , fred.konrad@greensocs.com, xiawenc@linux.vnet.ibm.com, "avi@redhat.com" On Fri, Aug 09, 2013 at 10:20:49AM +0000, Chijianchun wrote: > Now in KVM, when RAM snapshot, vcpus needs stopped, it is Unfriendly restrictions to users. > > Are there plans to achieve ram live Snapshot feature? > > in my mind, Snapshots can not occupy additional too much memory, So when the memory needs to be changed, the old memory page is needed to flush to the file first. But flushing to file is too slower than memory, and when flushing, the vcpu or VM is need to be paused until finished flushing, so pause...resume...pause...resume............., more and more slower. > > Is this idea feasible? Are there any other thoughts? A few people have looked at live vmsave or guest RAM snapshots. The idea that was discussed on qemu-devel@nongnu.org uses fork(2) to capture the state of guest RAM and then send it back to the parent process. The guest is only paused for a brief instant during fork(2) and can continue to run afterwards. The child process is a simple loop that sends the contents of guest RAM back to the parent process over a pipe or writes the memory pages to the save file on disk. It performs no logic besides writing out guest RAM. Stefan From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:51862) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V8pLM-0003VP-A7 for qemu-devel@nongnu.org; Mon, 12 Aug 2013 06:27:24 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1V8pLH-0004V5-Lf for qemu-devel@nongnu.org; Mon, 12 Aug 2013 06:27:20 -0400 Received: from mail.avalus.com ([2001:41c8:10:1dd::10]:39442) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V8pLH-0004Up-Fp for qemu-devel@nongnu.org; Mon, 12 Aug 2013 06:27:15 -0400 Date: Mon, 12 Aug 2013 11:26:59 +0100 From: Alex Bligh Message-ID: <232DEBC1058FA4A5BD76D16A@Ximines.local> In-Reply-To: <20130812095903.GF29880@stefanha-thinkpad.redhat.com> References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <20130812095903.GF29880@stefanha-thinkpad.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? Reply-To: Alex Bligh List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi , Chijianchun Cc: aliguori@us.ibm.com, kvm@vger.kernel.org, mtosatti@redhat.com, qemu-devel@nongnu.org, avi@redhat.com, paul@codesourcery.com, Alex Bligh , xiawenc@linux.vnet.ibm.com, fred.konrad@greensocs.com --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi wrote: > The idea that was discussed on qemu-devel@nongnu.org uses fork(2) to > capture the state of guest RAM and then send it back to the parent > process. The guest is only paused for a brief instant during fork(2) > and can continue to run afterwards. How would you capture the state of emulated hardware which might not be in the guest RAM? -- Alex Bligh From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:36352) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V8qNV-0002E4-Of for qemu-devel@nongnu.org; Mon, 12 Aug 2013 07:33:38 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1V8qNU-00067z-7N for qemu-devel@nongnu.org; Mon, 12 Aug 2013 07:33:37 -0400 Received: from mail-qc0-x233.google.com ([2607:f8b0:400d:c01::233]:60108) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V8qNU-00067n-32 for qemu-devel@nongnu.org; Mon, 12 Aug 2013 07:33:36 -0400 Received: by mail-qc0-f179.google.com with SMTP id n10so3340720qcx.10 for ; Mon, 12 Aug 2013 04:33:35 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <232DEBC1058FA4A5BD76D16A@Ximines.local> References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <20130812095903.GF29880@stefanha-thinkpad.redhat.com> <232DEBC1058FA4A5BD76D16A@Ximines.local> Date: Mon, 12 Aug 2013 13:33:35 +0200 Message-ID: From: Stefan Hajnoczi Content-Type: text/plain; charset=ISO-8859-1 Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Bligh Cc: Anthony Liguori , kvm , Marcelo Tosatti , qemu-devel , Chijianchun , Avi Kivity , Paul Brook , Wayne Xia , fred.konrad@greensocs.com On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh wrote: > --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi > wrote: > >> The idea that was discussed on qemu-devel@nongnu.org uses fork(2) to >> capture the state of guest RAM and then send it back to the parent >> process. The guest is only paused for a brief instant during fork(2) >> and can continue to run afterwards. > > > How would you capture the state of emulated hardware which might not > be in the guest RAM? Exactly the same way vmsave works today. It calls the device's save functions which serialize state to file. The difference between today's vmsave and the fork(2) approach is that QEMU does not need to wait for guest RAM to be written to file before resuming the guest. Stefan From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:36547) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V94l9-0006za-1G for qemu-devel@nongnu.org; Mon, 12 Aug 2013 22:55:06 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1V94l1-00007H-OC for qemu-devel@nongnu.org; Mon, 12 Aug 2013 22:54:58 -0400 Received: from e28smtp03.in.ibm.com ([122.248.162.3]:53149) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V94l1-00006t-3s for qemu-devel@nongnu.org; Mon, 12 Aug 2013 22:54:51 -0400 Received: from /spool/local by e28smtp03.in.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 13 Aug 2013 08:17:15 +0530 Received: from d28relay05.in.ibm.com (d28relay05.in.ibm.com [9.184.220.62]) by d28dlp01.in.ibm.com (Postfix) with ESMTP id 89B40E0054 for ; Tue, 13 Aug 2013 08:25:00 +0530 (IST) Received: from d28av02.in.ibm.com (d28av02.in.ibm.com [9.184.220.64]) by d28relay05.in.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r7D2saqM38273174 for ; Tue, 13 Aug 2013 08:24:36 +0530 Received: from d28av02.in.ibm.com (localhost [127.0.0.1]) by d28av02.in.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id r7D2sc6K004019 for ; Tue, 13 Aug 2013 08:24:39 +0530 Message-ID: <52099FA3.6010207@linux.vnet.ibm.com> Date: Tue, 13 Aug 2013 10:53:23 +0800 From: Wenchao Xia MIME-Version: 1.0 References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <20130812095903.GF29880@stefanha-thinkpad.redhat.com> <232DEBC1058FA4A5BD76D16A@Ximines.local> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: Anthony Liguori , kvm , Marcelo Tosatti , qemu-devel , Chijianchun , Avi Kivity , Alex Bligh , fred.konrad@greensocs.com, Paul Brook 于 2013-8-12 19:33, Stefan Hajnoczi 写道: > On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh wrote: >> --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi >> wrote: >> >>> The idea that was discussed on qemu-devel@nongnu.org uses fork(2) to >>> capture the state of guest RAM and then send it back to the parent >>> process. The guest is only paused for a brief instant during fork(2) >>> and can continue to run afterwards. >> >> >> How would you capture the state of emulated hardware which might not >> be in the guest RAM? > > Exactly the same way vmsave works today. It calls the device's save > functions which serialize state to file. > > The difference between today's vmsave and the fork(2) approach is that > QEMU does not need to wait for guest RAM to be written to file before > resuming the guest. > > Stefan > I have a worry about what glib says: "On Unix, the GLib mainloop is incompatible with fork(). Any program using the mainloop must either exec() or exit() from the child without returning to the mainloop. " There is another way to do it: intercept the write in kvm.ko(or other kernel code). Since the key is intercept the memory change, we can do it in userspace in TCG mode, thus we can add the missing part in KVM mode. Another benefit of this way is: the used memory can be controlled. For example, with ioctl(), set a buffer of a fixed size which keeps the intercepted write data by kernel code, which can avoid frequently switch back to user space qemu code. when it is full always return back to userspace's qemu code, let qemu code save the data into disk. I haven't check the exactly behavior of Intel guest mode about how to handle page fault, so can't estimate the performance caused by switching of guest mode and root mode, but it should not be worse than fork(). -- Best Regards Wenchao Xia From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43409) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V99qz-0001em-Df for qemu-devel@nongnu.org; Tue, 13 Aug 2013 04:21:22 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1V99qy-0006Tj-1b for qemu-devel@nongnu.org; Tue, 13 Aug 2013 04:21:21 -0400 Received: from mail-qe0-x22a.google.com ([2607:f8b0:400d:c02::22a]:56760) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V99qx-0006TY-Tm for qemu-devel@nongnu.org; Tue, 13 Aug 2013 04:21:19 -0400 Received: by mail-qe0-f42.google.com with SMTP id s14so4171312qeb.29 for ; Tue, 13 Aug 2013 01:21:19 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <52099FA3.6010207@linux.vnet.ibm.com> References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <20130812095903.GF29880@stefanha-thinkpad.redhat.com> <232DEBC1058FA4A5BD76D16A@Ximines.local> <52099FA3.6010207@linux.vnet.ibm.com> Date: Tue, 13 Aug 2013 10:21:19 +0200 Message-ID: From: Stefan Hajnoczi Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Wenchao Xia Cc: Anthony Liguori , kvm , Marcelo Tosatti , qemu-devel , Chijianchun , Avi Kivity , Alex Bligh , fred.konrad@greensocs.com, Paul Brook On Tue, Aug 13, 2013 at 4:53 AM, Wenchao Xia w= rote: > =E4=BA=8E 2013-8-12 19:33, Stefan Hajnoczi =E5=86=99=E9=81=93: > >> On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh wrote: >>> >>> --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi >>> wrote: >>> >>>> The idea that was discussed on qemu-devel@nongnu.org uses fork(2) to >>>> capture the state of guest RAM and then send it back to the parent >>>> process. The guest is only paused for a brief instant during fork(2) >>>> and can continue to run afterwards. >>> >>> >>> >>> How would you capture the state of emulated hardware which might not >>> be in the guest RAM? >> >> >> Exactly the same way vmsave works today. It calls the device's save >> functions which serialize state to file. >> >> The difference between today's vmsave and the fork(2) approach is that >> QEMU does not need to wait for guest RAM to be written to file before >> resuming the guest. >> >> Stefan >> > I have a worry about what glib says: > > "On Unix, the GLib mainloop is incompatible with fork(). Any program > using the mainloop must either exec() or exit() from the child without > returning to the mainloop. " This is fine, the child just writes out the memory pages and exits. It never returns to the glib mainloop. > There is another way to do it: intercept the write in kvm.ko(or other > kernel code). Since the key is intercept the memory change, we can do > it in userspace in TCG mode, thus we can add the missing part in KVM > mode. Another benefit of this way is: the used memory can be > controlled. For example, with ioctl(), set a buffer of a fixed size > which keeps the intercepted write data by kernel code, which can avoid > frequently switch back to user space qemu code. when it is full always > return back to userspace's qemu code, let qemu code save the data into > disk. I haven't check the exactly behavior of Intel guest mode about > how to handle page fault, so can't estimate the performance caused by > switching of guest mode and root mode, but it should not be worse than > fork(). The fork(2) approach is portable, covers both KVM and TCG, and doesn't require kernel changes. A kvm.ko kernel change also won't be supported on existing KVM hosts. These are big drawbacks and the kernel approach would need to be significantly better than plain old fork(2) to make it worthwhile. Stefan From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:48688) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V9QIm-0001d4-Cm for qemu-devel@nongnu.org; Tue, 13 Aug 2013 21:55:16 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1V9QIf-0001m8-4w for qemu-devel@nongnu.org; Tue, 13 Aug 2013 21:55:08 -0400 Received: from e23smtp06.au.ibm.com ([202.81.31.148]:36459) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V9QIe-0001lj-Fx for qemu-devel@nongnu.org; Tue, 13 Aug 2013 21:55:01 -0400 Received: from /spool/local by e23smtp06.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 14 Aug 2013 11:46:26 +1000 Received: from d23relay03.au.ibm.com (d23relay03.au.ibm.com [9.190.235.21]) by d23dlp01.au.ibm.com (Postfix) with ESMTP id C81F22CE8052 for ; Wed, 14 Aug 2013 11:54:34 +1000 (EST) Received: from d23av01.au.ibm.com (d23av01.au.ibm.com [9.190.234.96]) by d23relay03.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r7E1sLt410813910 for ; Wed, 14 Aug 2013 11:54:24 +1000 Received: from d23av01.au.ibm.com (localhost [127.0.0.1]) by d23av01.au.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id r7E1sVjP026338 for ; Wed, 14 Aug 2013 11:54:32 +1000 Message-ID: <520AE34D.8000002@linux.vnet.ibm.com> Date: Wed, 14 Aug 2013 09:54:21 +0800 From: Wenchao Xia MIME-Version: 1.0 References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <20130812095903.GF29880@stefanha-thinkpad.redhat.com> <232DEBC1058FA4A5BD76D16A@Ximines.local> <52099FA3.6010207@linux.vnet.ibm.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: Anthony Liguori , kvm , Paul Brook , Marcelo Tosatti , qemu-devel , Chijianchun , Avi Kivity , Alex Bligh , fred.konrad@greensocs.com 于 2013-8-13 16:21, Stefan Hajnoczi 写道: > On Tue, Aug 13, 2013 at 4:53 AM, Wenchao Xia wrote: >> 于 2013-8-12 19:33, Stefan Hajnoczi 写道: >> >>> On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh wrote: >>>> >>>> --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi >>>> wrote: >>>> >>>>> The idea that was discussed on qemu-devel@nongnu.org uses fork(2) to >>>>> capture the state of guest RAM and then send it back to the parent >>>>> process. The guest is only paused for a brief instant during fork(2) >>>>> and can continue to run afterwards. >>>> >>>> >>>> >>>> How would you capture the state of emulated hardware which might not >>>> be in the guest RAM? >>> >>> >>> Exactly the same way vmsave works today. It calls the device's save >>> functions which serialize state to file. >>> >>> The difference between today's vmsave and the fork(2) approach is that >>> QEMU does not need to wait for guest RAM to be written to file before >>> resuming the guest. >>> >>> Stefan >>> >> I have a worry about what glib says: >> >> "On Unix, the GLib mainloop is incompatible with fork(). Any program >> using the mainloop must either exec() or exit() from the child without >> returning to the mainloop. " > > This is fine, the child just writes out the memory pages and exits. > It never returns to the glib mainloop. > >> There is another way to do it: intercept the write in kvm.ko(or other >> kernel code). Since the key is intercept the memory change, we can do >> it in userspace in TCG mode, thus we can add the missing part in KVM >> mode. Another benefit of this way is: the used memory can be >> controlled. For example, with ioctl(), set a buffer of a fixed size >> which keeps the intercepted write data by kernel code, which can avoid >> frequently switch back to user space qemu code. when it is full always >> return back to userspace's qemu code, let qemu code save the data into >> disk. I haven't check the exactly behavior of Intel guest mode about >> how to handle page fault, so can't estimate the performance caused by >> switching of guest mode and root mode, but it should not be worse than >> fork(). > > The fork(2) approach is portable, covers both KVM and TCG, and doesn't > require kernel changes. A kvm.ko kernel change also won't be > supported on existing KVM hosts. These are big drawbacks and the > kernel approach would need to be significantly better than plain old > fork(2) to make it worthwhile. > > Stefan > I think advantage is memory usage is predictable, so memory usage peak can be avoided, by always save the changed pages first. fork() does not know which pages are changed. I am not sure if this would be a serious issue when server's memory is consumed much, for example, 24G host emulate 11G*2 guest to provide powerful virtual server. -- Best Regards Wenchao Xia From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:57003) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V9W4L-00050m-Cl for qemu-devel@nongnu.org; Wed, 14 Aug 2013 04:04:38 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1V9W4J-0004rP-S1 for qemu-devel@nongnu.org; Wed, 14 Aug 2013 04:04:37 -0400 Received: from mail-qc0-x233.google.com ([2607:f8b0:400d:c01::233]:53109) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V9Vu1-0001YO-6A for qemu-devel@nongnu.org; Wed, 14 Aug 2013 03:53:57 -0400 Received: by mail-qc0-f179.google.com with SMTP id n10so4596682qcx.24 for ; Wed, 14 Aug 2013 00:53:56 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <520AE34D.8000002@linux.vnet.ibm.com> References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <20130812095903.GF29880@stefanha-thinkpad.redhat.com> <232DEBC1058FA4A5BD76D16A@Ximines.local> <52099FA3.6010207@linux.vnet.ibm.com> <520AE34D.8000002@linux.vnet.ibm.com> Date: Wed, 14 Aug 2013 09:53:56 +0200 Message-ID: From: Stefan Hajnoczi Content-Type: text/plain; charset=GB2312 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Wenchao Xia Cc: Anthony Liguori , kvm , Paul Brook , Marcelo Tosatti , qemu-devel , Chijianchun , Avi Kivity , Alex Bligh , fred.konrad@greensocs.com On Wed, Aug 14, 2013 at 3:54 AM, Wenchao Xia w= rote: > =D3=DA 2013-8-13 16:21, Stefan Hajnoczi =D0=B4=B5=C0: > >> On Tue, Aug 13, 2013 at 4:53 AM, Wenchao Xia >> wrote: >>> >>> =D3=DA 2013-8-12 19:33, Stefan Hajnoczi =D0=B4=B5=C0: >>> >>>> On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh wrote: >>>>> >>>>> >>>>> --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi >>>>> wrote: >>>>> >>>>>> The idea that was discussed on qemu-devel@nongnu.org uses fork(2) to >>>>>> capture the state of guest RAM and then send it back to the parent >>>>>> process. The guest is only paused for a brief instant during fork(2= ) >>>>>> and can continue to run afterwards. >>>>> >>>>> >>>>> >>>>> >>>>> How would you capture the state of emulated hardware which might not >>>>> be in the guest RAM? >>>> >>>> >>>> >>>> Exactly the same way vmsave works today. It calls the device's save >>>> functions which serialize state to file. >>>> >>>> The difference between today's vmsave and the fork(2) approach is that >>>> QEMU does not need to wait for guest RAM to be written to file before >>>> resuming the guest. >>>> >>>> Stefan >>>> >>> I have a worry about what glib says: >>> >>> "On Unix, the GLib mainloop is incompatible with fork(). Any program >>> using the mainloop must either exec() or exit() from the child without >>> returning to the mainloop. " >> >> >> This is fine, the child just writes out the memory pages and exits. >> It never returns to the glib mainloop. >> >>> There is another way to do it: intercept the write in kvm.ko(or othe= r >>> kernel code). Since the key is intercept the memory change, we can do >>> it in userspace in TCG mode, thus we can add the missing part in KVM >>> mode. Another benefit of this way is: the used memory can be >>> controlled. For example, with ioctl(), set a buffer of a fixed size >>> which keeps the intercepted write data by kernel code, which can avoid >>> frequently switch back to user space qemu code. when it is full always >>> return back to userspace's qemu code, let qemu code save the data into >>> disk. I haven't check the exactly behavior of Intel guest mode about >>> how to handle page fault, so can't estimate the performance caused by >>> switching of guest mode and root mode, but it should not be worse than >>> fork(). >> >> >> The fork(2) approach is portable, covers both KVM and TCG, and doesn't >> require kernel changes. A kvm.ko kernel change also won't be >> supported on existing KVM hosts. These are big drawbacks and the >> kernel approach would need to be significantly better than plain old >> fork(2) to make it worthwhile. >> >> Stefan >> > I think advantage is memory usage is predictable, so memory usage > peak can be avoided, by always save the changed pages first. fork() > does not know which pages are changed. I am not sure if this would > be a serious issue when server's memory is consumed much, for example, > 24G host emulate 11G*2 guest to provide powerful virtual server. Memory usage is predictable but guest uptime is unpredictable because it waits until memory is written out. This defeats the point of "live" savevm. The guest may be stalled arbitrarily. The fork child can minimize the chance of out-of-memory by using madvise(MADV_DONTNEED) after pages have been written out. The way fork handles memory overcommit on Linux is configurable, but I guess in a situation where memory runs out the Out-of-Memory Killer will kill a process (probably QEMU since it is hogging so much memory). The risk of OOM can be avoided by running the traditional vmsave which stops the guest instead of using "live" vmsave. The other option is to live migrate to file but the disadvantage there is that you cannot choose exactly when the state it saved, it happens sometime after live migration is initiated. There are trade-offs with all the approaches, it depends on what is most important to you. Stefan From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:58683) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V9WDF-0000wB-HA for qemu-devel@nongnu.org; Wed, 14 Aug 2013 04:13:50 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1V9WDE-0007x6-AL for qemu-devel@nongnu.org; Wed, 14 Aug 2013 04:13:49 -0400 Received: from mail.avalus.com ([2001:41c8:10:1dd::10]:39579) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V9WDE-0007vP-2a for qemu-devel@nongnu.org; Wed, 14 Aug 2013 04:13:48 -0400 Mime-Version: 1.0 (Apple Message framework v1085) Content-Type: text/plain; charset=us-ascii From: Alex Bligh In-Reply-To: Date: Wed, 14 Aug 2013 09:13:35 +0100 Content-Transfer-Encoding: 7bit Message-Id: <7BB8F666-B20F-4651-B0B9-C40DBB2282B5@alex.org.uk> References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <20130812095903.GF29880@stefanha-thinkpad.redhat.com> <232DEBC1058FA4A5BD76D16A@Ximines.local> <52099FA3.6010207@linux.vnet.ibm.com> <520AE34D.8000002@linux.vnet.ibm.com> Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: Anthony Liguori , kvm , Marcelo Tosatti , qemu-devel , Chijianchun , Paul Brook , Alex Bligh , fred.konrad@greensocs.com, Wenchao Xia , Avi Kivity On 14 Aug 2013, at 08:53, Stefan Hajnoczi wrote: > The fork child can minimize the chance of out-of-memory by using > madvise(MADV_DONTNEED) after pages have been written out. This may also be helpful (last clause) before starting writing. MADV_SEQUENTIAL Expect page references in sequential order. (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.) -- Alex Bligh From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:33864) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V9nHe-000525-EK for qemu-devel@nongnu.org; Wed, 14 Aug 2013 22:27:37 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1V9nHU-0001qp-Qg for qemu-devel@nongnu.org; Wed, 14 Aug 2013 22:27:30 -0400 Received: from e28smtp06.in.ibm.com ([122.248.162.6]:51414) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V9nHU-0001qA-5W for qemu-devel@nongnu.org; Wed, 14 Aug 2013 22:27:20 -0400 Received: from /spool/local by e28smtp06.in.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 15 Aug 2013 07:47:53 +0530 Received: from d28relay03.in.ibm.com (d28relay03.in.ibm.com [9.184.220.60]) by d28dlp02.in.ibm.com (Postfix) with ESMTP id E988B3940053 for ; Thu, 15 Aug 2013 07:56:57 +0530 (IST) Received: from d28av01.in.ibm.com (d28av01.in.ibm.com [9.184.220.63]) by d28relay03.in.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r7F2SQPF46006390 for ; Thu, 15 Aug 2013 07:58:27 +0530 Received: from d28av01.in.ibm.com (localhost [127.0.0.1]) by d28av01.in.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id r7F2R47f008989 for ; Thu, 15 Aug 2013 07:57:05 +0530 Message-ID: <520C3C5C.5000106@linux.vnet.ibm.com> Date: Thu, 15 Aug 2013 10:26:36 +0800 From: Wenchao Xia MIME-Version: 1.0 References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <20130812095903.GF29880@stefanha-thinkpad.redhat.com> <232DEBC1058FA4A5BD76D16A@Ximines.local> <52099FA3.6010207@linux.vnet.ibm.com> <520AE34D.8000002@linux.vnet.ibm.com> In-Reply-To: Content-Type: text/plain; charset=GB2312 Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: Anthony Liguori , kvm , Marcelo Tosatti , qemu-devel , Chijianchun , Paul Brook , Alex Bligh , fred.konrad@greensocs.com, Avi Kivity 2013-8-14 15:53, Stefan Hajnoczi д: > On Wed, Aug 14, 2013 at 3:54 AM, Wenchao Xia wrote: >> 2013-8-13 16:21, Stefan Hajnoczi д: >> >>> On Tue, Aug 13, 2013 at 4:53 AM, Wenchao Xia >>> wrote: >>>> >>>> 2013-8-12 19:33, Stefan Hajnoczi д: >>>> >>>>> On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh wrote: >>>>>> >>>>>> >>>>>> --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi >>>>>> wrote: >>>>>> >>>>>>> The idea that was discussed on qemu-devel@nongnu.org uses fork(2) to >>>>>>> capture the state of guest RAM and then send it back to the parent >>>>>>> process. The guest is only paused for a brief instant during fork(2) >>>>>>> and can continue to run afterwards. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> How would you capture the state of emulated hardware which might not >>>>>> be in the guest RAM? >>>>> >>>>> >>>>> >>>>> Exactly the same way vmsave works today. It calls the device's save >>>>> functions which serialize state to file. >>>>> >>>>> The difference between today's vmsave and the fork(2) approach is that >>>>> QEMU does not need to wait for guest RAM to be written to file before >>>>> resuming the guest. >>>>> >>>>> Stefan >>>>> >>>> I have a worry about what glib says: >>>> >>>> "On Unix, the GLib mainloop is incompatible with fork(). Any program >>>> using the mainloop must either exec() or exit() from the child without >>>> returning to the mainloop. " >>> >>> >>> This is fine, the child just writes out the memory pages and exits. >>> It never returns to the glib mainloop. >>> >>>> There is another way to do it: intercept the write in kvm.ko(or other >>>> kernel code). Since the key is intercept the memory change, we can do >>>> it in userspace in TCG mode, thus we can add the missing part in KVM >>>> mode. Another benefit of this way is: the used memory can be >>>> controlled. For example, with ioctl(), set a buffer of a fixed size >>>> which keeps the intercepted write data by kernel code, which can avoid >>>> frequently switch back to user space qemu code. when it is full always >>>> return back to userspace's qemu code, let qemu code save the data into >>>> disk. I haven't check the exactly behavior of Intel guest mode about >>>> how to handle page fault, so can't estimate the performance caused by >>>> switching of guest mode and root mode, but it should not be worse than >>>> fork(). >>> >>> >>> The fork(2) approach is portable, covers both KVM and TCG, and doesn't >>> require kernel changes. A kvm.ko kernel change also won't be >>> supported on existing KVM hosts. These are big drawbacks and the >>> kernel approach would need to be significantly better than plain old >>> fork(2) to make it worthwhile. >>> >>> Stefan >>> >> I think advantage is memory usage is predictable, so memory usage >> peak can be avoided, by always save the changed pages first. fork() >> does not know which pages are changed. I am not sure if this would >> be a serious issue when server's memory is consumed much, for example, >> 24G host emulate 11G*2 guest to provide powerful virtual server. > > Memory usage is predictable but guest uptime is unpredictable because > it waits until memory is written out. This defeats the point of > "live" savevm. The guest may be stalled arbitrarily. > I think it is adjustable. There is no much difference with fork(), except get more precise control about the changed pages. Kernel intercept the change, and stores the changed page in another page, similar to fork(). When userspace qemu code execute, save some pages to disk. Buffer can be used like some lubricant. When Buffer = MAX, it equals to fork(), guest runs more lively. When Buffer = 0, guest runs less lively. I think it allows user to find a good balance point with a parameter. It is harder to implement, just want to show the idea. > The fork child can minimize the chance of out-of-memory by using > madvise(MADV_DONTNEED) after pages have been written out. It seems no way to make sure the written out page is the changed pages, so it have a good chance the written one is the unchanged and still used by the other qemu process. > > The way fork handles memory overcommit on Linux is configurable, but I > guess in a situation where memory runs out the Out-of-Memory Killer > will kill a process (probably QEMU since it is hogging so much > memory). > > The risk of OOM can be avoided by running the traditional vmsave which > stops the guest instead of using "live" vmsave. > > The other option is to live migrate to file but the disadvantage there > is that you cannot choose exactly when the state it saved, it happens > sometime after live migration is initiated. > > There are trade-offs with all the approaches, it depends on what is > most important to you. > > Stefan > -- Best Regards Wenchao Xia From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:54210) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V9sJE-00048W-7F for qemu-devel@nongnu.org; Thu, 15 Aug 2013 03:49:33 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1V9sJ9-0002yb-8S for qemu-devel@nongnu.org; Thu, 15 Aug 2013 03:49:28 -0400 Received: from mail-ee0-x229.google.com ([2a00:1450:4013:c00::229]:51321) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V9sJ8-0002yQ-VH for qemu-devel@nongnu.org; Thu, 15 Aug 2013 03:49:23 -0400 Received: by mail-ee0-f41.google.com with SMTP id d17so196537eek.14 for ; Thu, 15 Aug 2013 00:49:22 -0700 (PDT) Date: Thu, 15 Aug 2013 09:49:19 +0200 From: Stefan Hajnoczi Message-ID: <20130815074919.GA22521@stefanha-thinkpad.redhat.com> References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <20130812095903.GF29880@stefanha-thinkpad.redhat.com> <232DEBC1058FA4A5BD76D16A@Ximines.local> <52099FA3.6010207@linux.vnet.ibm.com> <520AE34D.8000002@linux.vnet.ibm.com> <520C3C5C.5000106@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <520C3C5C.5000106@linux.vnet.ibm.com> Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Wenchao Xia Cc: Anthony Liguori , kvm , Marcelo Tosatti , qemu-devel , Chijianchun , Paul Brook , Alex Bligh , fred.konrad@greensocs.com, Avi Kivity On Thu, Aug 15, 2013 at 10:26:36AM +0800, Wenchao Xia wrote: > 于 2013-8-14 15:53, Stefan Hajnoczi 写道: > > On Wed, Aug 14, 2013 at 3:54 AM, Wenchao Xia wrote: > >> 于 2013-8-13 16:21, Stefan Hajnoczi 写道: > >> > >>> On Tue, Aug 13, 2013 at 4:53 AM, Wenchao Xia > >>> wrote: > >>>> > >>>> 于 2013-8-12 19:33, Stefan Hajnoczi 写道: > >>>> > >>>>> On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh wrote: > >>>>>> > >>>>>> > >>>>>> --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi > >>>>>> wrote: > >>>>>> > >>>>>>> The idea that was discussed on qemu-devel@nongnu.org uses fork(2) to > >>>>>>> capture the state of guest RAM and then send it back to the parent > >>>>>>> process. The guest is only paused for a brief instant during fork(2) > >>>>>>> and can continue to run afterwards. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> How would you capture the state of emulated hardware which might not > >>>>>> be in the guest RAM? > >>>>> > >>>>> > >>>>> > >>>>> Exactly the same way vmsave works today. It calls the device's save > >>>>> functions which serialize state to file. > >>>>> > >>>>> The difference between today's vmsave and the fork(2) approach is that > >>>>> QEMU does not need to wait for guest RAM to be written to file before > >>>>> resuming the guest. > >>>>> > >>>>> Stefan > >>>>> > >>>> I have a worry about what glib says: > >>>> > >>>> "On Unix, the GLib mainloop is incompatible with fork(). Any program > >>>> using the mainloop must either exec() or exit() from the child without > >>>> returning to the mainloop. " > >>> > >>> > >>> This is fine, the child just writes out the memory pages and exits. > >>> It never returns to the glib mainloop. > >>> > >>>> There is another way to do it: intercept the write in kvm.ko(or other > >>>> kernel code). Since the key is intercept the memory change, we can do > >>>> it in userspace in TCG mode, thus we can add the missing part in KVM > >>>> mode. Another benefit of this way is: the used memory can be > >>>> controlled. For example, with ioctl(), set a buffer of a fixed size > >>>> which keeps the intercepted write data by kernel code, which can avoid > >>>> frequently switch back to user space qemu code. when it is full always > >>>> return back to userspace's qemu code, let qemu code save the data into > >>>> disk. I haven't check the exactly behavior of Intel guest mode about > >>>> how to handle page fault, so can't estimate the performance caused by > >>>> switching of guest mode and root mode, but it should not be worse than > >>>> fork(). > >>> > >>> > >>> The fork(2) approach is portable, covers both KVM and TCG, and doesn't > >>> require kernel changes. A kvm.ko kernel change also won't be > >>> supported on existing KVM hosts. These are big drawbacks and the > >>> kernel approach would need to be significantly better than plain old > >>> fork(2) to make it worthwhile. > >>> > >>> Stefan > >>> > >> I think advantage is memory usage is predictable, so memory usage > >> peak can be avoided, by always save the changed pages first. fork() > >> does not know which pages are changed. I am not sure if this would > >> be a serious issue when server's memory is consumed much, for example, > >> 24G host emulate 11G*2 guest to provide powerful virtual server. > > > > Memory usage is predictable but guest uptime is unpredictable because > > it waits until memory is written out. This defeats the point of > > "live" savevm. The guest may be stalled arbitrarily. > > > I think it is adjustable. There is no much difference with > fork(), except get more precise control about the changed pages. > Kernel intercept the change, and stores the changed page in another > page, similar to fork(). When userspace qemu code execute, save some > pages to disk. Buffer can be used like some lubricant. When Buffer = > MAX, it equals to fork(), guest runs more lively. When Buffer = 0, > guest runs less lively. I think it allows user to find a good balance > point with a parameter. > It is harder to implement, just want to show the idea. You are right. You could set a bigger buffer size to increase guest uptime. > > The fork child can minimize the chance of out-of-memory by using > > madvise(MADV_DONTNEED) after pages have been written out. > It seems no way to make sure the written out page is the changed > pages, so it have a good chance the written one is the unchanged and > still used by the other qemu process. The KVM dirty log tells you which pages were touched. The fork child process could give priority to the pages which have been touched by the guest. They must be written out and marked madvise(MADV_DONTNEED) as soon as possible. I haven't looked at the vmsave data format yet to see if memory pages can be saved in random order, but this might work. It reduces the likelihood of copy-on-write memory growth. Stefan From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:57957) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V9sXr-00044h-UR for qemu-devel@nongnu.org; Thu, 15 Aug 2013 04:04:43 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1V9sXk-0008FP-JF for qemu-devel@nongnu.org; Thu, 15 Aug 2013 04:04:35 -0400 Received: from e23smtp04.au.ibm.com ([202.81.31.146]:43172) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1V9sXj-0008EF-Nd for qemu-devel@nongnu.org; Thu, 15 Aug 2013 04:04:28 -0400 Received: from /spool/local by e23smtp04.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 15 Aug 2013 17:47:40 +1000 Received: from d23relay04.au.ibm.com (d23relay04.au.ibm.com [9.190.234.120]) by d23dlp03.au.ibm.com (Postfix) with ESMTP id C078D357804E for ; Thu, 15 Aug 2013 18:04:18 +1000 (EST) Received: from d23av01.au.ibm.com (d23av01.au.ibm.com [9.190.234.96]) by d23relay04.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r7F7mQUk64815278 for ; Thu, 15 Aug 2013 17:48:28 +1000 Received: from d23av01.au.ibm.com (localhost [127.0.0.1]) by d23av01.au.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id r7F84FnY016112 for ; Thu, 15 Aug 2013 18:04:16 +1000 Message-ID: <520C8B63.2060304@linux.vnet.ibm.com> Date: Thu, 15 Aug 2013 16:03:47 +0800 From: Wenchao Xia MIME-Version: 1.0 References: <33FB050264B7AD4DBD6583581F2E03104B764728@nkgeml511-mbx.china.huawei.com> <20130812095903.GF29880@stefanha-thinkpad.redhat.com> <232DEBC1058FA4A5BD76D16A@Ximines.local> <52099FA3.6010207@linux.vnet.ibm.com> <520AE34D.8000002@linux.vnet.ibm.com> <520C3C5C.5000106@linux.vnet.ibm.com> <20130815074919.GA22521@stefanha-thinkpad.redhat.com> In-Reply-To: <20130815074919.GA22521@stefanha-thinkpad.redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: Anthony Liguori , kvm , Marcelo Tosatti , qemu-devel , Chijianchun , Paul Brook , Alex Bligh , fred.konrad@greensocs.com, Avi Kivity 于 2013-8-15 15:49, Stefan Hajnoczi 写道: > On Thu, Aug 15, 2013 at 10:26:36AM +0800, Wenchao Xia wrote: >> 于 2013-8-14 15:53, Stefan Hajnoczi 写道: >>> On Wed, Aug 14, 2013 at 3:54 AM, Wenchao Xia wrote: >>>> 于 2013-8-13 16:21, Stefan Hajnoczi 写道: >>>> >>>>> On Tue, Aug 13, 2013 at 4:53 AM, Wenchao Xia >>>>> wrote: >>>>>> >>>>>> 于 2013-8-12 19:33, Stefan Hajnoczi 写道: >>>>>> >>>>>>> On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh wrote: >>>>>>>> >>>>>>>> >>>>>>>> --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi >>>>>>>> wrote: >>>>>>>> >>>>>>>>> The idea that was discussed on qemu-devel@nongnu.org uses fork(2) to >>>>>>>>> capture the state of guest RAM and then send it back to the parent >>>>>>>>> process. The guest is only paused for a brief instant during fork(2) >>>>>>>>> and can continue to run afterwards. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> How would you capture the state of emulated hardware which might not >>>>>>>> be in the guest RAM? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Exactly the same way vmsave works today. It calls the device's save >>>>>>> functions which serialize state to file. >>>>>>> >>>>>>> The difference between today's vmsave and the fork(2) approach is that >>>>>>> QEMU does not need to wait for guest RAM to be written to file before >>>>>>> resuming the guest. >>>>>>> >>>>>>> Stefan >>>>>>> >>>>>> I have a worry about what glib says: >>>>>> >>>>>> "On Unix, the GLib mainloop is incompatible with fork(). Any program >>>>>> using the mainloop must either exec() or exit() from the child without >>>>>> returning to the mainloop. " >>>>> >>>>> >>>>> This is fine, the child just writes out the memory pages and exits. >>>>> It never returns to the glib mainloop. >>>>> >>>>>> There is another way to do it: intercept the write in kvm.ko(or other >>>>>> kernel code). Since the key is intercept the memory change, we can do >>>>>> it in userspace in TCG mode, thus we can add the missing part in KVM >>>>>> mode. Another benefit of this way is: the used memory can be >>>>>> controlled. For example, with ioctl(), set a buffer of a fixed size >>>>>> which keeps the intercepted write data by kernel code, which can avoid >>>>>> frequently switch back to user space qemu code. when it is full always >>>>>> return back to userspace's qemu code, let qemu code save the data into >>>>>> disk. I haven't check the exactly behavior of Intel guest mode about >>>>>> how to handle page fault, so can't estimate the performance caused by >>>>>> switching of guest mode and root mode, but it should not be worse than >>>>>> fork(). >>>>> >>>>> >>>>> The fork(2) approach is portable, covers both KVM and TCG, and doesn't >>>>> require kernel changes. A kvm.ko kernel change also won't be >>>>> supported on existing KVM hosts. These are big drawbacks and the >>>>> kernel approach would need to be significantly better than plain old >>>>> fork(2) to make it worthwhile. >>>>> >>>>> Stefan >>>>> >>>> I think advantage is memory usage is predictable, so memory usage >>>> peak can be avoided, by always save the changed pages first. fork() >>>> does not know which pages are changed. I am not sure if this would >>>> be a serious issue when server's memory is consumed much, for example, >>>> 24G host emulate 11G*2 guest to provide powerful virtual server. >>> >>> Memory usage is predictable but guest uptime is unpredictable because >>> it waits until memory is written out. This defeats the point of >>> "live" savevm. The guest may be stalled arbitrarily. >>> >> I think it is adjustable. There is no much difference with >> fork(), except get more precise control about the changed pages. >> Kernel intercept the change, and stores the changed page in another >> page, similar to fork(). When userspace qemu code execute, save some >> pages to disk. Buffer can be used like some lubricant. When Buffer = >> MAX, it equals to fork(), guest runs more lively. When Buffer = 0, >> guest runs less lively. I think it allows user to find a good balance >> point with a parameter. >> It is harder to implement, just want to show the idea. > > You are right. You could set a bigger buffer size to increase guest > uptime. > >>> The fork child can minimize the chance of out-of-memory by using >>> madvise(MADV_DONTNEED) after pages have been written out. >> It seems no way to make sure the written out page is the changed >> pages, so it have a good chance the written one is the unchanged and >> still used by the other qemu process. > > The KVM dirty log tells you which pages were touched. The fork child > process could give priority to the pages which have been touched by the > guest. They must be written out and marked madvise(MADV_DONTNEED) as > soon as possible. Hmm, if dirty log still works normal in child process to reflect the memory status in parent not child's, then the problem could be solved by: when dirty pages is too much, child tell parent to wait some time. But I haven't check if kvm.ko behaviors like that. > > I haven't looked at the vmsave data format yet to see if memory pages > can be saved in random order, but this might work. It reduces the > likelihood of copy-on-write memory growth. > > Stefan > -- Best Regards Wenchao Xia