From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:42941) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aQve2-0003TF-Me for qemu-devel@nongnu.org; Wed, 03 Feb 2016 06:30:47 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aQvdz-0005eV-GC for qemu-devel@nongnu.org; Wed, 03 Feb 2016 06:30:46 -0500 Received: from szxga01-in.huawei.com ([58.251.152.64]:52448) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aQvdy-0005ck-EZ for qemu-devel@nongnu.org; Wed, 03 Feb 2016 06:30:43 -0500 References: <1452676712-24239-1-git-send-email-xiecl.fnst@cn.fujitsu.com> <1452676712-24239-8-git-send-email-xiecl.fnst@cn.fujitsu.com> <20160127144644.GL26163@stefanha-x1.localdomain> <56A96B34.9090906@cn.fujitsu.com> <20160128151543.GH9825@stefanha-x1.localdomain> <56AAD8E6.8020208@cn.fujitsu.com> <20160129154648.GE11427@stefanha-x1.localdomain> <56AEB140.3060509@cn.fujitsu.com> <20160202143413.GA32084@stefanha-x1.localdomain> <56B157EB.60804@cn.fujitsu.com> <20160203093214.GA26227@stefanha-x1.localdomain> <56B1CE80.90109@cn.fujitsu.com> From: Hailiang Zhang Message-ID: <56B1E3C0.1030906@huawei.com> Date: Wed, 3 Feb 2016 19:25:52 +0800 MIME-Version: 1.0 In-Reply-To: <56B1CE80.90109@cn.fujitsu.com> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Wen Congyang , Stefan Hajnoczi Cc: Kevin Wolf , Changlong Xie , Fam Zheng , fnstml-hwcolo@cn.fujitsu.com, qemu devel , peter.huangpeng@huawei.com, Max Reitz , Gonglei , Paolo Bonzini On 2016/2/3 17:55, Wen Congyang wrote: > On 02/03/2016 05:32 PM, Stefan Hajnoczi wrote: >> On Wed, Feb 03, 2016 at 09:29:15AM +0800, Wen Congyang wrote: >>> On 02/02/2016 10:34 PM, Stefan Hajnoczi wrote: >>>> On Mon, Feb 01, 2016 at 09:13:36AM +0800, Wen Congyang wrote: >>>>> On 01/29/2016 11:46 PM, Stefan Hajnoczi wrote: >>>>>> On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote: >>>>>>> On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote: >>>>>>>> On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote: >>>>>>>>> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote: >>>>>>>>>> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote: >>>>>>>> I'm concerned that the bdrv_drain_all() in vm_stop() can take a long >>>>>>>> time if the disk is slow/failing. bdrv_drain_all() blocks until all >>>>>>>> in-flight I/O requests have completed. What does the Primary do if the >>>>>>>> Secondary becomes unresponsive? >>>>>>> >>>>>>> Actually, we knew this problem. But currently, there seems no better way to >>>>>>> resolve it. If you have any ideas? >>>>>> >>>>>> Is it possible to hold the checkpoint information and acknowledge the >>>>>> checkpoint right away, without waiting for bdrv_drain_all() or any >>>>>> Secondory guest activity to complete? >>>>> >>>>> There is no way to know that secondary becomes unreponsive. >>>> >>>> I meant whether it is necessary for the Secondary to vm_stop() and apply >>>> the checkpoint before acknowledging the checkpoint to the Primary? >>> >>> I don't understand this. >>> Here is the COLO checkpoint flow: >>> >>> Primary Secondary >>> new checkpoint notice ---> >>> vm_stop() vm_stop() >>> vm state(device state, memory, cpu) ---> >>> load state >>> <--- done >>> vm_start() vm_start() >> >> If the Secondary's vm_stop() call blocks then the Primary is stuck too. >> >> I was wondering whether the Secondary can do: >> >> <--- done >> vm_stop() >> load state >> >> It simply receives the checkpoint data into a buffer and immediately >> replies with "done". vm_stop() and load state is only performed after >> sending "done". > > Secondary vm is running, so we should also get the pages that are dirtied > by secondary vm, but not dirtied by primary vm. > We have two ways to do it: > 1. Cache all original memory in the secondary qemu > 2. Send the dirty pfn list to primary qemu, and get it. > > If we ack the checkpoint and the call vm_stop(), we only can select 1. It > means that secondary qemu costs more memory. > In COLO mode, we will compare the output socket, and will do checkpoint if > the application level data is different. If we ack the checkpoint and the > call vm_stop(), the client can not get any more data until secondary vm > is running again. So we still 'wait' the secondary vm. > > >> >> The advantage is that the Primary will not be delayed by the Secondary. >> It's an approach that doesn't block. >> >> But perhaps it's a problem if the Secondary is slower than the Primary >> since the Secondary still needs to complete vm_stop() and load state >> before it can resume execution? >> >>>>>> I think this really means falling back to microcheckpointing until the >>>>>> Secondary guest can checkpoint. Instead of a blocking vm_stop() we >>>>>> would prevent vcpus from running and when the last pending I/O finishes >>>>>> the Secondary could apply the last checkpoint. This approach does not >>>>>> block QEMU (the monitor, etc). >>>>>> >>>>> >>>>> If secondary host becomes unresponsive, it means that we cannot do mocrocheckpointing. >>>>> We should do failover in this case. >>>> >>>> This is dangerous because it means that a delay/failure in the Secondary >>>> would cause the Primary to fail over to the broken Secondary. All the >>>> more reason not to perform blocking operations on the Secondary in the >>>> checkpoint code path. >>> >>> If the secondary is broken, primary qemu will take over. >> >> Does the Primary use a timeout between "new checkpoint notice" and >> Secondary's "done" so it can move on if the Secondary is unresponsive? > > To hailiang: > IIRC, we don't use a timeout but I think we can do it. In our design, there is Yes, we may need a timeout to help detecting the unresponsive case which can not be caught by the external heartbeat module. I will investigate it. Thanks, Hailiang > an exteranl heartbeat to check primary and secondary status, and decide when > to do checkpoint. > > Thanks > Wen Congyang > > >> >> Stefan >> > > > > > . >