From mboxrd@z Thu Jan 1 00:00:00 1970 From: Yang Hongyang Subject: Re: [PATCH v2 COLOPre 03/13] libxc/restore: zero ioreq page only one time Date: Thu, 11 Jun 2015 20:54:00 +0800 Message-ID: <557984E8.6070304@cn.fujitsu.com> References: <1433734997-26570-1-git-send-email-yanghy@cn.fujitsu.com> <1433734997-26570-4-git-send-email-yanghy@cn.fujitsu.com> <55756468.4090500@citrix.com> <55756757.7020900@cn.fujitsu.com> <55756B45.8020708@citrix.com> <55763A8F.6040608@cn.fujitsu.com> <5576960D.5090407@citrix.com> <5577CA69.6090103@cn.fujitsu.com> <5577EAE9.1020309@citrix.com> <5577FE03.1010703@cn.fujitsu.com> <9AAE0902D5BC7E449B7C8E4E778ABCD0259432A8@AMSPEX01CL01.citrite.net> <55781782.3060502@cn.fujitsu.com> <9AAE0902D5BC7E449B7C8E4E778ABCD02594332D@AMSPEX01CL01.citrite.net> <55782188.8090306@cn.fujitsu.com> <9AAE0902D5BC7E449B7C8E4E778ABCD02594346F@AMSPEX01CL01.citrite.net> <5578E0C3.8090606@cn.fujitsu.com> <9AAE0902D5BC7E449B7C8E4E778ABCD025944B83@AMSPEX01CL01.citrite.net> <55794B5C.8010705@cn.fujitsu.com> <9AAE0902D5BC7E449B7C8E4E778ABCD025944F59@AMSPEX01CL01.citrite.net> <55796D81.8050000@cn.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <55796D81.8050000@cn.fujitsu.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Wen Congyang , Paul Durrant , Andrew Cooper , "xen-devel@lists.xen.org" Cc: Wei Liu , Ian Campbell , "guijianfeng@cn.fujitsu.com" , "yunhong.jiang@intel.com" , Eddie Dong , "rshriram@cs.ubc.ca" , Ian Jackson List-Id: xen-devel@lists.xenproject.org On 06/11/2015 07:14 PM, Wen Congyang wrote: > On 06/11/2015 06:20 PM, Paul Durrant wrote: >>> -----Original Message----- >>> From: Wen Congyang [mailto:wency@cn.fujitsu.com] >>> Sent: 11 June 2015 09:48 >>> To: Paul Durrant; Andrew Cooper; Yang Hongyang; xen-devel@lists.xen.org >>> Cc: Wei Liu; Ian Campbell; guijianfeng@cn.fujitsu.com; >>> yunhong.jiang@intel.com; Eddie Dong; rshriram@cs.ubc.ca; Ian Jackson >>> Subject: Re: [Xen-devel] [PATCH v2 COLOPre 03/13] libxc/restore: zero ioreq >>> page only one time >>> >>> On 06/11/2015 04:32 PM, Paul Durrant wrote: >>>>> -----Original Message----- >>>>> From: Wen Congyang [mailto:wency@cn.fujitsu.com] >>>>> Sent: 11 June 2015 02:14 >>>>> To: Paul Durrant; Andrew Cooper; Yang Hongyang; xen- >>> devel@lists.xen.org >>>>> Cc: Wei Liu; Ian Campbell; guijianfeng@cn.fujitsu.com; >>>>> yunhong.jiang@intel.com; Eddie Dong; rshriram@cs.ubc.ca; Ian Jackson >>>>> Subject: Re: [Xen-devel] [PATCH v2 COLOPre 03/13] libxc/restore: zero >>> ioreq >>>>> page only one time >>>>> >>>>> On 06/10/2015 07:47 PM, Paul Durrant wrote: >>>>>>> -----Original Message----- >>>>>>> From: xen-devel-bounces@lists.xen.org [mailto:xen-devel- >>>>>>> bounces@lists.xen.org] On Behalf Of Wen Congyang >>>>>>> Sent: 10 June 2015 12:38 >>>>>>> To: Paul Durrant; Andrew Cooper; Yang Hongyang; xen- >>>>> devel@lists.xen.org >>>>>>> Cc: Wei Liu; Ian Campbell; guijianfeng@cn.fujitsu.com; >>>>>>> yunhong.jiang@intel.com; Eddie Dong; rshriram@cs.ubc.ca; Ian Jackson >>>>>>> Subject: Re: [Xen-devel] [PATCH v2 COLOPre 03/13] libxc/restore: zero >>>>> ioreq >>>>>>> page only one time >>>>>>> >>>>>>> On 06/10/2015 06:58 PM, Paul Durrant wrote: >>>>>>>>> -----Original Message----- >>>>>>>>> From: Wen Congyang [mailto:wency@cn.fujitsu.com] >>>>>>>>> Sent: 10 June 2015 11:55 >>>>>>>>> To: Paul Durrant; Andrew Cooper; Yang Hongyang; xen- >>>>>>> devel@lists.xen.org >>>>>>>>> Cc: Wei Liu; Ian Campbell; yunhong.jiang@intel.com; Eddie Dong; >>>>>>>>> guijianfeng@cn.fujitsu.com; rshriram@cs.ubc.ca; Ian Jackson >>>>>>>>> Subject: Re: [Xen-devel] [PATCH v2 COLOPre 03/13] libxc/restore: >>> zero >>>>>>> ioreq >>>>>>>>> page only one time >>>>>>>>> >>>>>>>>> On 06/10/2015 06:40 PM, Paul Durrant wrote: >>>>>>>>>>> -----Original Message----- >>>>>>>>>>> From: Wen Congyang [mailto:wency@cn.fujitsu.com] >>>>>>>>>>> Sent: 10 June 2015 10:06 >>>>>>>>>>> To: Andrew Cooper; Yang Hongyang; xen-devel@lists.xen.org; >>> Paul >>>>>>>>> Durrant >>>>>>>>>>> Cc: Wei Liu; Ian Campbell; yunhong.jiang@intel.com; Eddie Dong; >>>>>>>>>>> guijianfeng@cn.fujitsu.com; rshriram@cs.ubc.ca; Ian Jackson >>>>>>>>>>> Subject: Re: [Xen-devel] [PATCH v2 COLOPre 03/13] libxc/restore: >>>>> zero >>>>>>>>> ioreq >>>>>>>>>>> page only one time >>>>>>>>>>> >>>>>>>>>>> Cc: Paul Durrant >>>>>>>>>>> >>>>>>>>>>> On 06/10/2015 03:44 PM, Andrew Cooper wrote: >>>>>>>>>>>> On 10/06/2015 06:26, Yang Hongyang wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 06/09/2015 03:30 PM, Andrew Cooper wrote: >>>>>>>>>>>>>> On 09/06/2015 01:59, Yang Hongyang wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 06/08/2015 06:15 PM, Andrew Cooper wrote: >>>>>>>>>>>>>>>> On 08/06/15 10:58, Yang Hongyang wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On 06/08/2015 05:46 PM, Andrew Cooper wrote: >>>>>>>>>>>>>>>>>> On 08/06/15 04:43, Yang Hongyang wrote: >>>>>>>>>>>>>>>>>>> ioreq page contains evtchn which will be set when we >>>>>>> resume >>>>>>>>> the >>>>>>>>>>>>>>>>>>> secondary vm the first time. The hypervisor will check if >>>>> the >>>>>>>>>>>>>>>>>>> evtchn is corrupted, so we cannot zero the ioreq page >>>>> more >>>>>>>>>>>>>>>>>>> than one time. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The ioreq->state is always STATE_IOREQ_NONE after >>> the >>>>> vm >>>>>>> is >>>>>>>>>>>>>>>>>>> suspended, so it is OK if we only zero it one time. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Signed-off-by: Yang Hongyang >>> >>>>>>>>>>>>>>>>>>> Signed-off-by: Wen congyang >>> >>>>>>>>>>>>>>>>>>> CC: Andrew Cooper >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The issue here is that we are running the restore >>> algorithm >>>>>>> over >>>>>>>>> a >>>>>>>>>>>>>>>>>> domain which has already been running in Xen for a >>> while. >>>>>>> This >>>>>>>>> is a >>>>>>>>>>>>>>>>>> brand new usecase, as far as I am aware. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Exactly. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Does the qemu process associated with this domain get >>>>>>> frozen >>>>>>>>>>>>>>>>>> while the >>>>>>>>>>>>>>>>>> secondary is being reset, or does the process get >>> destroyed >>>>>>> and >>>>>>>>>>>>>>>>>> recreated. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> What do you mean by reset? do you mean secondary is >>>>>>>>> suspended >>>>>>>>>>> at >>>>>>>>>>>>>>>>> checkpoint? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Well - at the point that the buffered records are being >>>>>>> processed, >>>>>>>>> we >>>>>>>>>>>>>>>> are in the process of resetting the state of the secondary to >>>>>>> match >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> primary. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes, at this point, the qemu process associated with this >>>>> domain is >>>>>>>>>>>>>>> frozen. >>>>>>>>>>>>>>> the suspend callback will call libxl__qmp_stop(vm_stop() in >>>>>>> qemu) >>>>>>>>> to >>>>>>>>>>>>>>> pause >>>>>>>>>>>>>>> qemu. After we processed all records, qemu will be restored >>>>> with >>>>>>>>> the >>>>>>>>>>>>>>> received >>>>>>>>>>>>>>> state, that's why we add a >>>>>>>>> libxl__qmp_restore(qemu_load_vmstate() >>>>>>>>>>> in >>>>>>>>>>>>>>> qemu) >>>>>>>>>>>>>>> api to restore qemu with received state. Currently in libxl, >>>>> qemu >>>>>>> only >>>>>>>>>>>>>>> start >>>>>>>>>>>>>>> with the received state, there's no api to load received state >>>>> while >>>>>>>>>>>>>>> qemu is >>>>>>>>>>>>>>> running for a while. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Now I consider this more, it is absolutely wrong to not zero >>> the >>>>>>> page >>>>>>>>>>>>>> here. The event channel in the page is not guaranteed to be >>> the >>>>>>>>> same >>>>>>>>>>>>>> between the primary and secondary, >>>>>>>>>>>>> >>>>>>>>>>>>> That's why we don't zero it on secondary. >>>>>>>>>>>> >>>>>>>>>>>> I think you missed my point. Apologies for the double negative. >>> It >>>>>>>>>>>> must, under all circumstances, be zeroed at this point, for safety >>>>>>>>> reasons. >>>>>>>>>>>> >>>>>>>>>>>> The page in question is subject to logdirty just like any other >>> guest >>>>>>>>>>>> pages, which means that if the guest writes to it naturally (i.e. >>> not a >>>>>>>>>>>> Xen or Qemu write, both of whom have magic mappings which >>> are >>>>>>> not >>>>>>>>>>>> subject to logdirty), it will be transmitted in the stream. As the >>>>>>>>>>>> event channel could be different, the lack of zeroing it at this >>> point >>>>>>>>>>>> means that the event channel would be wrong as opposed to >>>>> simply >>>>>>>>>>>> missing. This is a worse position to be in. >>>>>>>>>>> >>>>>>>>>>> The guest should not access this page. I am not sure if the guest >>> can >>>>>>>>>>> access the ioreq page. >>>>>>>>>>> >>>>>>>>>>> But in the exceptional case, the ioreq page is dirtied, and is copied >>> to >>>>>>>>>>> the secondary vm. The ioreq page will contain a wrong event >>>>> channel, >>>>>>> the >>>>>>>>>>> hypervisor will check it: if the event channel is wrong, the guest >>> will >>>>>>>>>>> be crashed. >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> and we don't want to unexpectedly >>>>>>>>>>>>>> find a pending/in-flight ioreq. >>>>>>>>>>>>> >>>>>>>>>>>>> ioreq->state is always STATE_IOREQ_NONE after the vm is >>>>>>> suspended, >>>>>>>>>>> there >>>>>>>>>>>>> should be no pending/in-flight ioreq at checkpoint. >>>>>>>>>>>> >>>>>>>>>>>> In the common case perhaps, but we must consider the >>>>> exceptional >>>>>>>>> case. >>>>>>>>>>>> The exceptional case here is some corruption which happens to >>>>>>> appear >>>>>>>>> as >>>>>>>>>>>> an in-flight ioreq. >>>>>>>>>>> >>>>>>>>>>> If the state is STATE_IOREQ_NONE, it may be hypervisor's bug. If >>> the >>>>>>>>>>> hypervisor >>>>>>>>>>> has a bug, anything can happen. I think we should trust the >>>>> hypervisor. >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Either qemu needs to take care of re-initialising the event >>>>> channels >>>>>>>>>>>>>> back to appropriate values, or Xen should tolerate the >>> channels >>>>>>>>>>>>>> disappearing. >>>>>>>>>>>> >>>>>>>>>>>> I still stand by this statement. I believe it is the only safe way of >>>>>>>>>>>> solving the issue you have discovered. >>>>>>>>>>> >>>>>>>>>>> Add a new qemu monitor command to update ioreq page? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> If you're attaching to a 'new' VM (i.e one with an updated image) >>> then >>>>> I >>>>>>>>> suspect you're going to have to destroy and re-create the ioreq >>> server >>>>> so >>>>>>>>> that the shared page gets re-populated with the correct event >>>>> channels. >>>>>>>>> Either that or you're going to have to ensure that the page is not part >>> of >>>>>>>>> restored image and sample the new one that Xen should have set >>> up. >>>>>>>>> >>>>>>>>> >>>>>>>>> I agree with it. I will try to add a new qemu monitor command(or do >>> it >>>>>>> when >>>>>>>>> updating qemu's state) to destroy and re-create it. >>>>>>>> >>>>>>>> The slightly tricky part of that is that you're going to have to cache and >>>>>>> replay all the registrations that were done on the old instance, but you >>>>> need >>>>>>> to do that in any case as it's not state that is transferred in the VM save >>>>>>> record. >>>>>>> >>>>>>> Why do we have to cache and replay all the registrations that were >>> done >>>>> on >>>>>>> the old instance? >>>>>> >>>>>> Do you not have device models that you need to continue to function? >>>>> When the ioreq server is torn down then all MMIO, port IO and PCI config >>>>> ranges that were mapped to it will disappear. >>>>> >>>>> Yes, I don't known which should be done unless I implement and test it. >>>>> >>>>> I have some questions about it: >>>>> 1. Can guest access the ioreq page? If the page is modified by the guest >>>>> unexpectedly, >>>>> what will happen? >>>> >>>> No, the guest cannot modify the pages once a non-default ioreq server is >>> active. The pages are removed from the guest P2M when it is activated, >>> which is one of the reasons for modifying QEMU to not behave as a legacy >>> default server. >>>> >>>>> 2. If the ioreq page is dirtied by the guest, it will be transfered from >>> primary >>>>> to secondary during checkpoint. The evtchn is invalid, I think the best >>>>> behavior >>>>> is that: make the guest crashed, not continue to run. >>>> >>>> As I said, the pages are not in the P2M if the server is active so they will not >>> be transferred as part of the VM state. However, this presents a problem; at >>> the far end, the emulator will not be able to hook into the guest. So, when >>> the source domain is paused, the ioreq server needs to be torn down (so >>> that its pages are re-inserted into the P2M and marked dirty for transfer). >>> This is what happens in a normal migration. One extra problem you have is >>> that the source domain is not then killed, it is resumed along with the >>> emulator. Thus, on resume, the emulator needs to create a new ioreq server >>> and re-register all its device models with that new server. >>>> I don't know the detail of what you do at the far end, but if you always start >>> a new emulator instance using the QEMU save record then you should be >>> fine (just like with a normal migration). >>> >>> I don't find the codes where the ioreq server is torn down when the source >>> domain is paused. Which function? >> >> Sorry, I overstated that. By 'torn down' I meant disabled. The function that does it is: >> >> static void xen_hvm_change_state_handler(void *opaque, int running, >> RunState rstate) >> { >> XenIOState *state = opaque; >> >> if (running) { >> xen_main_loop_prepare(state); >> } >> >> xen_set_ioreq_server_state(xen_xc, xen_domid, >> state->ioservid, >> (rstate == RUN_STATE_RUNNING)); >> } >> > > Yes, I see it now. > > If the ioreq page is cleared, can xen_main_loop_prepare() put the correct > evtchn in the ioreq page? If so, I think COLO can work without this patch. > > In the hypervisor, the ioreq page is cleared before re-inserted into the P2M. > So the far end always gets the zeroed ioreq page. > > IIRC, there is only default ioreq server when I wrote this patch. In that case > we don't have xen_main_loop_prepare(). It is OK that COLO cannot work with > an older version qemu. > > Yang, can you test COLO without this patch? I guess it can work, and we can drop > this patch now. Ok, this really is a historical patch... > > Thanks > Wen Congyang > >>> >>> In our implementation, we don't start a new emulator. The codes can work, >>> but some bugs may be not triggered. >>> >> >> How do you reconcile the incoming QEMU save record with the running emulator state? >> >> Paul >> >>> Thanks >>> Wen Congyang >>> >>>> >>>> Paul >>>> >>>>> >>>>> Thanks >>>>> Wen Congyang >>>>> >>>>>> >>>>>> Paul >>>>>> >>>>>>> We will set to the guest to a new state, the old state should be >>> dropped. >>>>>>> >>>>>>> Thanks >>>>>>> Wen Congyang >>>>>>> >>>>>>>> >>>>>>>> Paul >>>>>>>> >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Wen Congyang >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Paul >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Wen Congyang >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> ~Andrew >>>>>>>>>>>> . >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> . >>>>>>>>>> >>>>>>>> >>>>>>>> . >>>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Xen-devel mailing list >>>>>>> Xen-devel@lists.xen.org >>>>>>> http://lists.xen.org/xen-devel >>>>>> >>>>>> _______________________________________________ >>>>>> Xen-devel mailing list >>>>>> Xen-devel@lists.xen.org >>>>>> http://lists.xen.org/xen-devel >>>>>> . >>>>>> >>>> >>>> . >>>> >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel >> . >> > > . > -- Thanks, Yang.