From mboxrd@z Thu Jan  1 00:00:00 1970
From: Yang Hongyang <yanghy@cn.fujitsu.com>
Subject: Re: [PATCH v2 COLOPre 03/13] libxc/restore: zero ioreq
 page only one time
Date: Thu, 11 Jun 2015 20:54:00 +0800
Message-ID: <557984E8.6070304@cn.fujitsu.com>
References: <1433734997-26570-1-git-send-email-yanghy@cn.fujitsu.com>	<1433734997-26570-4-git-send-email-yanghy@cn.fujitsu.com>	<55756468.4090500@citrix.com>
	<55756757.7020900@cn.fujitsu.com>	<55756B45.8020708@citrix.com>
	<55763A8F.6040608@cn.fujitsu.com>	<5576960D.5090407@citrix.com>
	<5577CA69.6090103@cn.fujitsu.com>	<5577EAE9.1020309@citrix.com>
	<5577FE03.1010703@cn.fujitsu.com>	<9AAE0902D5BC7E449B7C8E4E778ABCD0259432A8@AMSPEX01CL01.citrite.net>	<55781782.3060502@cn.fujitsu.com>	<9AAE0902D5BC7E449B7C8E4E778ABCD02594332D@AMSPEX01CL01.citrite.net>	<55782188.8090306@cn.fujitsu.com>	<9AAE0902D5BC7E449B7C8E4E778ABCD02594346F@AMSPEX01CL01.citrite.net>	<5578E0C3.8090606@cn.fujitsu.com>	<9AAE0902D5BC7E449B7C8E4E778ABCD025944B83@AMSPEX01CL01.citrite.net>	<55794B5C.8010705@cn.fujitsu.com>
	<9AAE0902D5BC7E449B7C8E4E778ABCD025944F59@AMSPEX01CL01.citrite.net>
	<55796D81.8050000@cn.fujitsu.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <55796D81.8050000@cn.fujitsu.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Wen Congyang <wency@cn.fujitsu.com>, Paul Durrant <Paul.Durrant@citrix.com>, Andrew Cooper <Andrew.Cooper3@citrix.com>, "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>, Ian Campbell <Ian.Campbell@citrix.com>, "guijianfeng@cn.fujitsu.com" <guijianfeng@cn.fujitsu.com>, "yunhong.jiang@intel.com" <yunhong.jiang@intel.com>, Eddie Dong <eddie.dong@intel.com>, "rshriram@cs.ubc.ca" <rshriram@cs.ubc.ca>, Ian Jackson <Ian.Jackson@citrix.com>
List-Id: xen-devel@lists.xenproject.org


On 06/11/2015 07:14 PM, Wen Congyang wrote:
> On 06/11/2015 06:20 PM, Paul Durrant wrote:
>>> -----Original Message-----
>>> From: Wen Congyang [mailto:wency@cn.fujitsu.com]
>>> Sent: 11 June 2015 09:48
>>> To: Paul Durrant; Andrew Cooper; Yang Hongyang; xen-devel@lists.xen.org
>>> Cc: Wei Liu; Ian Campbell; guijianfeng@cn.fujitsu.com;
>>> yunhong.jiang@intel.com; Eddie Dong; rshriram@cs.ubc.ca; Ian Jackson
>>> Subject: Re: [Xen-devel] [PATCH v2 COLOPre 03/13] libxc/restore: zero ioreq
>>> page only one time
>>>
>>> On 06/11/2015 04:32 PM, Paul Durrant wrote:
>>>>> -----Original Message-----
>>>>> From: Wen Congyang [mailto:wency@cn.fujitsu.com]
>>>>> Sent: 11 June 2015 02:14
>>>>> To: Paul Durrant; Andrew Cooper; Yang Hongyang; xen-
>>> devel@lists.xen.org
>>>>> Cc: Wei Liu; Ian Campbell; guijianfeng@cn.fujitsu.com;
>>>>> yunhong.jiang@intel.com; Eddie Dong; rshriram@cs.ubc.ca; Ian Jackson
>>>>> Subject: Re: [Xen-devel] [PATCH v2 COLOPre 03/13] libxc/restore: zero
>>> ioreq
>>>>> page only one time
>>>>>
>>>>> On 06/10/2015 07:47 PM, Paul Durrant wrote:
>>>>>>> -----Original Message-----
>>>>>>> From: xen-devel-bounces@lists.xen.org [mailto:xen-devel-
>>>>>>> bounces@lists.xen.org] On Behalf Of Wen Congyang
>>>>>>> Sent: 10 June 2015 12:38
>>>>>>> To: Paul Durrant; Andrew Cooper; Yang Hongyang; xen-
>>>>> devel@lists.xen.org
>>>>>>> Cc: Wei Liu; Ian Campbell; guijianfeng@cn.fujitsu.com;
>>>>>>> yunhong.jiang@intel.com; Eddie Dong; rshriram@cs.ubc.ca; Ian Jackson
>>>>>>> Subject: Re: [Xen-devel] [PATCH v2 COLOPre 03/13] libxc/restore: zero
>>>>> ioreq
>>>>>>> page only one time
>>>>>>>
>>>>>>> On 06/10/2015 06:58 PM, Paul Durrant wrote:
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Wen Congyang [mailto:wency@cn.fujitsu.com]
>>>>>>>>> Sent: 10 June 2015 11:55
>>>>>>>>> To: Paul Durrant; Andrew Cooper; Yang Hongyang; xen-
>>>>>>> devel@lists.xen.org
>>>>>>>>> Cc: Wei Liu; Ian Campbell; yunhong.jiang@intel.com; Eddie Dong;
>>>>>>>>> guijianfeng@cn.fujitsu.com; rshriram@cs.ubc.ca; Ian Jackson
>>>>>>>>> Subject: Re: [Xen-devel] [PATCH v2 COLOPre 03/13] libxc/restore:
>>> zero
>>>>>>> ioreq
>>>>>>>>> page only one time
>>>>>>>>>
>>>>>>>>> On 06/10/2015 06:40 PM, Paul Durrant wrote:
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Wen Congyang [mailto:wency@cn.fujitsu.com]
>>>>>>>>>>> Sent: 10 June 2015 10:06
>>>>>>>>>>> To: Andrew Cooper; Yang Hongyang; xen-devel@lists.xen.org;
>>> Paul
>>>>>>>>> Durrant
>>>>>>>>>>> Cc: Wei Liu; Ian Campbell; yunhong.jiang@intel.com; Eddie Dong;
>>>>>>>>>>> guijianfeng@cn.fujitsu.com; rshriram@cs.ubc.ca; Ian Jackson
>>>>>>>>>>> Subject: Re: [Xen-devel] [PATCH v2 COLOPre 03/13] libxc/restore:
>>>>> zero
>>>>>>>>> ioreq
>>>>>>>>>>> page only one time
>>>>>>>>>>>
>>>>>>>>>>> Cc: Paul Durrant
>>>>>>>>>>>
>>>>>>>>>>> On 06/10/2015 03:44 PM, Andrew Cooper wrote:
>>>>>>>>>>>> On 10/06/2015 06:26, Yang Hongyang wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 06/09/2015 03:30 PM, Andrew Cooper wrote:
>>>>>>>>>>>>>> On 09/06/2015 01:59, Yang Hongyang wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 06/08/2015 06:15 PM, Andrew Cooper wrote:
>>>>>>>>>>>>>>>> On 08/06/15 10:58, Yang Hongyang wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 06/08/2015 05:46 PM, Andrew Cooper wrote:
>>>>>>>>>>>>>>>>>> On 08/06/15 04:43, Yang Hongyang wrote:
>>>>>>>>>>>>>>>>>>> ioreq page contains evtchn which will be set when we
>>>>>>> resume
>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> secondary vm the first time. The hypervisor will check if
>>>>> the
>>>>>>>>>>>>>>>>>>> evtchn is corrupted, so we cannot zero the ioreq page
>>>>> more
>>>>>>>>>>>>>>>>>>> than one time.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The ioreq->state is always STATE_IOREQ_NONE after
>>> the
>>>>> vm
>>>>>>> is
>>>>>>>>>>>>>>>>>>> suspended, so it is OK if we only zero it one time.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Signed-off-by: Yang Hongyang
>>> <yanghy@cn.fujitsu.com>
>>>>>>>>>>>>>>>>>>> Signed-off-by: Wen congyang
>>> <wency@cn.fujitsu.com>
>>>>>>>>>>>>>>>>>>> CC: Andrew Cooper <andrew.cooper3@citrix.com>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The issue here is that we are running the restore
>>> algorithm
>>>>>>> over
>>>>>>>>> a
>>>>>>>>>>>>>>>>>> domain which has already been running in Xen for a
>>> while.
>>>>>>> This
>>>>>>>>> is a
>>>>>>>>>>>>>>>>>> brand new usecase, as far as I am aware.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Exactly.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Does the qemu process associated with this domain get
>>>>>>> frozen
>>>>>>>>>>>>>>>>>> while the
>>>>>>>>>>>>>>>>>> secondary is being reset, or does the process get
>>> destroyed
>>>>>>> and
>>>>>>>>>>>>>>>>>> recreated.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> What do you mean by reset? do you mean secondary is
>>>>>>>>> suspended
>>>>>>>>>>> at
>>>>>>>>>>>>>>>>> checkpoint?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Well - at the point that the buffered records are being
>>>>>>> processed,
>>>>>>>>> we
>>>>>>>>>>>>>>>> are in the process of resetting the state of the secondary to
>>>>>>> match
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> primary.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, at this point, the qemu process associated with this
>>>>> domain is
>>>>>>>>>>>>>>> frozen.
>>>>>>>>>>>>>>> the suspend callback will call libxl__qmp_stop(vm_stop() in
>>>>>>> qemu)
>>>>>>>>> to
>>>>>>>>>>>>>>> pause
>>>>>>>>>>>>>>> qemu. After we processed all records, qemu will be restored
>>>>> with
>>>>>>>>> the
>>>>>>>>>>>>>>> received
>>>>>>>>>>>>>>> state, that's why we add a
>>>>>>>>> libxl__qmp_restore(qemu_load_vmstate()
>>>>>>>>>>> in
>>>>>>>>>>>>>>> qemu)
>>>>>>>>>>>>>>> api to restore qemu with received state. Currently in libxl,
>>>>> qemu
>>>>>>> only
>>>>>>>>>>>>>>> start
>>>>>>>>>>>>>>> with the received state, there's no api to load received state
>>>>> while
>>>>>>>>>>>>>>> qemu is
>>>>>>>>>>>>>>> running for a while.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Now I consider this more, it is absolutely wrong to not zero
>>> the
>>>>>>> page
>>>>>>>>>>>>>> here.  The event channel in the page is not guaranteed to be
>>> the
>>>>>>>>> same
>>>>>>>>>>>>>> between the primary and secondary,
>>>>>>>>>>>>>
>>>>>>>>>>>>> That's why we don't zero it on secondary.
>>>>>>>>>>>>
>>>>>>>>>>>> I think you missed my point.  Apologies for the double negative.
>>> It
>>>>>>>>>>>> must, under all circumstances, be zeroed at this point, for safety
>>>>>>>>> reasons.
>>>>>>>>>>>>
>>>>>>>>>>>> The page in question is subject to logdirty just like any other
>>> guest
>>>>>>>>>>>> pages, which means that if the guest writes to it naturally (i.e.
>>> not a
>>>>>>>>>>>> Xen or Qemu write, both of whom have magic mappings which
>>> are
>>>>>>> not
>>>>>>>>>>>> subject to logdirty), it will be transmitted in the stream.  As the
>>>>>>>>>>>> event channel could be different, the lack of zeroing it at this
>>> point
>>>>>>>>>>>> means that the event channel would be wrong as opposed to
>>>>> simply
>>>>>>>>>>>> missing.  This is a worse position to be in.
>>>>>>>>>>>
>>>>>>>>>>> The guest should not access this page. I am not sure if the guest
>>> can
>>>>>>>>>>> access the ioreq page.
>>>>>>>>>>>
>>>>>>>>>>> But in the exceptional case, the ioreq page is dirtied, and is copied
>>> to
>>>>>>>>>>> the secondary vm. The ioreq page will contain a wrong event
>>>>> channel,
>>>>>>> the
>>>>>>>>>>> hypervisor will check it: if the event channel is wrong, the guest
>>> will
>>>>>>>>>>> be crashed.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> and we don't want to unexpectedly
>>>>>>>>>>>>>> find a pending/in-flight ioreq.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ioreq->state is always STATE_IOREQ_NONE after the vm is
>>>>>>> suspended,
>>>>>>>>>>> there
>>>>>>>>>>>>> should be no pending/in-flight ioreq at checkpoint.
>>>>>>>>>>>>
>>>>>>>>>>>> In the common case perhaps, but we must consider the
>>>>> exceptional
>>>>>>>>> case.
>>>>>>>>>>>> The exceptional case here is some corruption which happens to
>>>>>>> appear
>>>>>>>>> as
>>>>>>>>>>>> an in-flight ioreq.
>>>>>>>>>>>
>>>>>>>>>>> If the state is STATE_IOREQ_NONE, it may be hypervisor's bug. If
>>> the
>>>>>>>>>>> hypervisor
>>>>>>>>>>> has a bug, anything can happen. I think we should trust the
>>>>> hypervisor.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Either qemu needs to take care of re-initialising the event
>>>>> channels
>>>>>>>>>>>>>> back to appropriate values, or Xen should tolerate the
>>> channels
>>>>>>>>>>>>>> disappearing.
>>>>>>>>>>>>
>>>>>>>>>>>> I still stand by this statement.  I believe it is the only safe way of
>>>>>>>>>>>> solving the issue you have discovered.
>>>>>>>>>>>
>>>>>>>>>>> Add a new qemu monitor command to update ioreq page?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> If you're attaching to a 'new' VM (i.e one with an updated image)
>>> then
>>>>> I
>>>>>>>>> suspect you're going to have to destroy and re-create the ioreq
>>> server
>>>>> so
>>>>>>>>> that the shared page gets re-populated with the correct event
>>>>> channels.
>>>>>>>>> Either that or you're going to have to ensure that the page is not part
>>> of
>>>>>>>>> restored image and sample the new one that Xen should have set
>>> up.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I agree with it. I will try to add a new qemu monitor command(or do
>>> it
>>>>>>> when
>>>>>>>>> updating qemu's state) to destroy and re-create it.
>>>>>>>>
>>>>>>>> The slightly tricky part of that is that you're going to have to cache and
>>>>>>> replay all the registrations that were done on the old instance, but you
>>>>> need
>>>>>>> to do that in any case as it's not state that is transferred in the VM save
>>>>>>> record.
>>>>>>>
>>>>>>> Why do we have to cache and replay all the registrations that were
>>> done
>>>>> on
>>>>>>> the old instance?
>>>>>>
>>>>>> Do you not have device models that you need to continue to function?
>>>>> When the ioreq server is torn down then all MMIO, port IO and PCI config
>>>>> ranges that were mapped to it will disappear.
>>>>>
>>>>> Yes, I don't known which should be done unless I implement and test it.
>>>>>
>>>>> I have some questions about it:
>>>>> 1. Can guest access the ioreq page? If the page is modified by the guest
>>>>> unexpectedly,
>>>>>     what will happen?
>>>>
>>>> No, the guest cannot modify the pages once a non-default ioreq server is
>>> active. The pages are removed from the guest P2M when it is activated,
>>> which is one of the reasons for modifying QEMU to not behave as a legacy
>>> default server.
>>>>
>>>>> 2. If the ioreq page is dirtied by the guest, it will be transfered from
>>> primary
>>>>>     to secondary during checkpoint. The evtchn is invalid, I think the best
>>>>> behavior
>>>>>     is that: make the guest crashed, not continue to run.
>>>>
>>>> As I said, the pages are not in the P2M if the server is active so they will not
>>> be transferred as part of the VM state. However, this presents a problem; at
>>> the far end, the emulator will not be able to hook into the guest. So, when
>>> the source domain is paused, the ioreq server needs to be torn down (so
>>> that its pages are re-inserted into the P2M and marked dirty for transfer).
>>> This is what happens in a normal migration. One extra problem you have is
>>> that the source domain is not then killed, it is resumed along with the
>>> emulator. Thus, on resume, the emulator needs to create a new ioreq server
>>> and re-register all its device models with that new server.
>>>> I don't know the detail of what you do at the far end, but if you always start
>>> a new emulator instance using the QEMU save record then you should be
>>> fine (just like with a normal migration).
>>>
>>> I don't find the codes where the ioreq server is torn down when the source
>>> domain is paused. Which function?
>>
>> Sorry, I overstated that. By 'torn down' I meant disabled. The function that does it is:
>>
>> static void xen_hvm_change_state_handler(void *opaque, int running,
>>                                           RunState rstate)
>> {
>>      XenIOState *state = opaque;
>>
>>      if (running) {
>>          xen_main_loop_prepare(state);
>>      }
>>
>>      xen_set_ioreq_server_state(xen_xc, xen_domid,
>>                                 state->ioservid,
>>                                 (rstate == RUN_STATE_RUNNING));
>> }
>>
>
> Yes, I see it now.
>
> If the ioreq page is cleared, can xen_main_loop_prepare() put the correct
> evtchn in the ioreq page? If so, I think COLO can work without this patch.
>
> In the hypervisor, the ioreq page is cleared before re-inserted into the P2M.
> So the far end always gets the zeroed ioreq page.
>
> IIRC, there is only default ioreq server when I wrote this patch. In that case
> we don't have xen_main_loop_prepare(). It is OK that COLO cannot work with
> an older version qemu.
>
> Yang, can you test COLO without this patch? I guess it can work, and we can drop
> this patch now.

Ok, this really is a historical patch...

>
> Thanks
> Wen Congyang
>
>>>
>>> In our implementation, we don't start a new emulator. The codes can work,
>>> but some bugs may be not triggered.
>>>
>>
>> How do you reconcile the incoming QEMU save record with the running emulator state?
>>
>>    Paul
>>
>>> Thanks
>>> Wen Congyang
>>>
>>>>
>>>>    Paul
>>>>
>>>>>
>>>>> Thanks
>>>>> Wen Congyang
>>>>>
>>>>>>
>>>>>>    Paul
>>>>>>
>>>>>>> We will set to the guest to a new state, the old state should be
>>> dropped.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Wen Congyang
>>>>>>>
>>>>>>>>
>>>>>>>>    Paul
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Wen Congyang
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    Paul
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Wen Congyang
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ~Andrew
>>>>>>>>>>>> .
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>
>>>>>>>> .
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Xen-devel mailing list
>>>>>>> Xen-devel@lists.xen.org
>>>>>>> http://lists.xen.org/xen-devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> Xen-devel mailing list
>>>>>> Xen-devel@lists.xen.org
>>>>>> http://lists.xen.org/xen-devel
>>>>>> .
>>>>>>
>>>>
>>>> .
>>>>
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
>> .
>>
>
> .
>

-- 
Thanks,
Yang.