From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:42941)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <zhang.zhanghailiang@huawei.com>) id 1aQve2-0003TF-Me
	for qemu-devel@nongnu.org; Wed, 03 Feb 2016 06:30:47 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <zhang.zhanghailiang@huawei.com>) id 1aQvdz-0005eV-GC
	for qemu-devel@nongnu.org; Wed, 03 Feb 2016 06:30:46 -0500
Received: from szxga01-in.huawei.com ([58.251.152.64]:52448)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <zhang.zhanghailiang@huawei.com>) id 1aQvdy-0005ck-EZ
	for qemu-devel@nongnu.org; Wed, 03 Feb 2016 06:30:43 -0500
References: <1452676712-24239-1-git-send-email-xiecl.fnst@cn.fujitsu.com>
	<1452676712-24239-8-git-send-email-xiecl.fnst@cn.fujitsu.com>
	<20160127144644.GL26163@stefanha-x1.localdomain>
	<56A96B34.9090906@cn.fujitsu.com>
	<20160128151543.GH9825@stefanha-x1.localdomain>
	<56AAD8E6.8020208@cn.fujitsu.com>
	<20160129154648.GE11427@stefanha-x1.localdomain>
	<56AEB140.3060509@cn.fujitsu.com>
	<20160202143413.GA32084@stefanha-x1.localdomain>
	<56B157EB.60804@cn.fujitsu.com>
	<20160203093214.GA26227@stefanha-x1.localdomain>
	<56B1CE80.90109@cn.fujitsu.com>
From: Hailiang Zhang <zhang.zhanghailiang@huawei.com>
Message-ID: <56B1E3C0.1030906@huawei.com>
Date: Wed, 3 Feb 2016 19:25:52 +0800
MIME-Version: 1.0
In-Reply-To: <56B1CE80.90109@cn.fujitsu.com>
Content-Type: text/plain; charset="windows-1252"; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block
	replication
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Wen Congyang <wency@cn.fujitsu.com>, Stefan Hajnoczi <stefanha@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, Changlong Xie <xiecl.fnst@cn.fujitsu.com>, Fam Zheng <famz@redhat.com>, fnstml-hwcolo@cn.fujitsu.com, qemu devel <qemu-devel@nongnu.org>, peter.huangpeng@huawei.com, Max Reitz <mreitz@redhat.com>, Gonglei <arei.gonglei@huawei.com>, Paolo Bonzini <pbonzini@redhat.com>

On 2016/2/3 17:55, Wen Congyang wrote:
> On 02/03/2016 05:32 PM, Stefan Hajnoczi wrote:
>> On Wed, Feb 03, 2016 at 09:29:15AM +0800, Wen Congyang wrote:
>>> On 02/02/2016 10:34 PM, Stefan Hajnoczi wrote:
>>>> On Mon, Feb 01, 2016 at 09:13:36AM +0800, Wen Congyang wrote:
>>>>> On 01/29/2016 11:46 PM, Stefan Hajnoczi wrote:
>>>>>> On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote:
>>>>>>> On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
>>>>>>>> On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
>>>>>>>>> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
>>>>>>>>>> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
>>>>>>>> I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
>>>>>>>> time if the disk is slow/failing.  bdrv_drain_all() blocks until all
>>>>>>>> in-flight I/O requests have completed.  What does the Primary do if the
>>>>>>>> Secondary becomes unresponsive?
>>>>>>>
>>>>>>> Actually, we knew this problem. But currently, there seems no better way to
>>>>>>> resolve it. If you have any ideas?
>>>>>>
>>>>>> Is it possible to hold the checkpoint information and acknowledge the
>>>>>> checkpoint right away, without waiting for bdrv_drain_all() or any
>>>>>> Secondory guest activity to complete?
>>>>>
>>>>> There is no way to know that secondary becomes unreponsive.
>>>>
>>>> I meant whether it is necessary for the Secondary to vm_stop() and apply
>>>> the checkpoint before acknowledging the checkpoint to the Primary?
>>>
>>> I don't understand this.
>>> Here is the COLO checkpoint flow:
>>>
>>>      Primary                                                Secondary
>>>      new checkpoint notice                 --->
>>>      vm_stop()                                              vm_stop()
>>>      vm state(device state, memory, cpu)   --->
>>>                                                             load state
>>>                                            <---             done
>>>      vm_start()                                             vm_start()
>>
>> If the Secondary's vm_stop() call blocks then the Primary is stuck too.
>>
>> I was wondering whether the Secondary can do:
>>
>> <---  done
>>        vm_stop()
>>        load state
>>
>> It simply receives the checkpoint data into a buffer and immediately
>> replies with "done".  vm_stop() and load state is only performed after
>> sending "done".
>
> Secondary vm is running, so we should also get the pages that are dirtied
> by secondary vm, but not dirtied by primary vm.
> We have two ways to do it:
> 1. Cache all original memory in the secondary qemu
> 2. Send the dirty pfn list to primary qemu, and get it.
>
> If we ack the checkpoint and the call vm_stop(), we only can select 1. It
> means that secondary qemu costs more memory.
> In COLO mode, we will compare the output socket, and will do checkpoint if
> the application level data is different. If we ack the checkpoint and the
> call vm_stop(), the client can not get any more data until secondary vm
> is running again. So we still 'wait' the secondary vm.
>
>
>>
>> The advantage is that the Primary will not be delayed by the Secondary.
>> It's an approach that doesn't block.
>>
>> But perhaps it's a problem if the Secondary is slower than the Primary
>> since the Secondary still needs to complete vm_stop() and load state
>> before it can resume execution?
>>
>>>>>> I think this really means falling back to microcheckpointing until the
>>>>>> Secondary guest can checkpoint.  Instead of a blocking vm_stop() we
>>>>>> would prevent vcpus from running and when the last pending I/O finishes
>>>>>> the Secondary could apply the last checkpoint.  This approach does not
>>>>>> block QEMU (the monitor, etc).
>>>>>>
>>>>>
>>>>> If secondary host becomes unresponsive, it means that we cannot do mocrocheckpointing.
>>>>> We should do failover in this case.
>>>>
>>>> This is dangerous because it means that a delay/failure in the Secondary
>>>> would cause the Primary to fail over to the broken Secondary.  All the
>>>> more reason not to perform blocking operations on the Secondary in the
>>>> checkpoint code path.
>>>
>>> If the secondary is broken, primary qemu will take over.
>>
>> Does the Primary use a timeout between "new checkpoint notice" and
>> Secondary's "done" so it can move on if the Secondary is unresponsive?
>
> To hailiang:
> IIRC, we don't use a timeout but I think we can do it. In our design, there is

Yes, we may need a timeout to help detecting the unresponsive case
which can not be caught by the external heartbeat module.
I will investigate it.

Thanks,
Hailiang

> an exteranl heartbeat to check primary and secondary status, and decide when
> to do checkpoint.
>
> Thanks
> Wen Congyang
>
>
>>
>> Stefan
>>
>
>
>
>
> .
>