From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Egger, Christoph" Subject: Re: [PATCH 5/5] x86/HVM: cache emulated instruction for retry processing Date: Tue, 7 Jan 2014 10:43:43 +0100 Message-ID: <52CBCC4F.8080500@amazon.de> References: <52498FFC02000078000F8068@nat28.tlf.novell.com> <524991D902000078000F8087@nat28.tlf.novell.com> <52B16F67020000780010E8D8@nat28.tlf.novell.com> <52B18CBC020000780010EA26@nat28.tlf.novell.com> <52CBC8C10200007800110EFA@nat28.tlf.novell.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from mail6.bemta3.messagelabs.com ([195.245.230.39]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1W0TD5-00017j-MU for xen-devel@lists.xenproject.org; Tue, 07 Jan 2014 09:44:32 +0000 In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: "Zhang, Yang Z" , Jan Beulich Cc: Andrew Cooper , Keir Fraser , "Dong, Eddie" , "Nakajima, Jun" , xen-devel List-Id: xen-devel@lists.xenproject.org On 07.01.14 09:54, Zhang, Yang Z wrote: > Jan Beulich wrote on 2014-01-07: >>>>> On 24.12.13 at 12:29, "Zhang, Yang Z" wrote: >>> Jan Beulich wrote on 2013-12-18: >>>>>>> On 18.12.13 at 10:40, "Zhang, Yang Z" >> wrote: >>>>> Jan Beulich wrote on 2013-12-18: >>>>>>>>> On 18.12.13 at 09:36, "Zhang, Yang Z" >>>> wrote: >>>>>>> Jan Beulich wrote on 2013-09-30: >>>>>>>> Rather than re-reading the instruction bytes upon retry >>>>>>>> processing, stash away and re-use tha what we already read. >>>>>>>> That way we can be certain that the retry won't do something >>>>>>>> different from what requested the retry, getting once again >>>>>>>> closer to real hardware behavior (where what we use retries for >>>>>>>> is simply a bus operation, not involving redundant decoding of instructions). >>>>>>>> >>>>>>> >>>>>>> This patch doesn't consider the nested case. >>>>>>> For example, if the buffer saved the L2's instruction, then >>>>>>> vmexit to >>>>>>> L1 and >>>>>>> L1 may use the wrong instruction. >>>>>> >>>>>> I'm having difficulty seeing how the two could get intermixed: >>>>>> There should be, at any given point in time, at most one >>>>>> instruction being emulated. Can you please give a more elaborate >>>>>> explanation of the situation where you see a (theoretical? >>>>>> practical?) >>> problem? >>>>> >>>>> I saw this issue when booting L1 hyper-v. I added some debug info >>>>> and saw the strange phenomenon: >>>>> >>>>> (XEN) write to buffer: eip 0xfffff8800430bc80, size 16, >>>>> content:f7420c1f608488b 44000000011442c7 >>>>> (XEN) read from buffer: eip 0xfffff800002f6138, size 16, >>>>> content:f7420c1f608488b 44000000011442c7 >>>>> >>>>> From the log, we can see different eip but using the same buffer. >>>>> Since I don't know how hyper-v working, so I cannot give more >>>>> information why this happens. And I only saw it with L1 hyper-v >>>>> (Xen on Xen and KVM on Xen don't have this issue) . >>>> >>>> But in order to validate the fix is (a) needed and (b) correct, a >>>> proper understanding of the issue is necessary as the very first step. >>>> Which doesn't require knowing internals of Hyper-V, all you need is >>>> tracking of emulator state changes in Xen (which I realize is said >>>> easier than done, since you want to make sure you don't generate >>>> huge amounts of output before actually hitting the issue, making it >>>> close to >>> impossible to analyze). >>> >>> Ok. It should be an old issue and just exposed by your patch only: >>> Sometimes, L0 need to decode the L2's instruction to handle IO >>> access directly. For example, if L1 pass through (not VT-d) a device to L2 directly. >>> And L0 may get X86EMUL_RETRY when handling this IO request. At same >>> time, if there is a virtual vmexit pending (for example, an >>> interrupt pending to inject to L1) and hypervisor will switch the >>> VCPU context from L2 to L1. Now we already in L1's context, but >>> since we got a X86EMUL_RETRY just now and this means hyprevisor will >>> retry to handle the IO request later and unfortunately, the retry will happen in L1's context. >>> Without your patch, L0 just emulate an L1's instruction and >>> everything goes on. With your patch, L0 will get the instruction >>> from the buffer which is belonging to L2, and problem arises. >>> >>> So the fixing is that if there is a pending IO request, no virtual >>> vmexit/vmentry is allowed which following hardware's behavior. >>> >>> diff --git a/xen/arch/x86/hvm/vmx/vvmx.c >>> b/xen/arch/x86/hvm/vmx/vvmx.c index 41db52b..c5446a9 100644 >>> --- a/xen/arch/x86/hvm/vmx/vvmx.c >>> +++ b/xen/arch/x86/hvm/vmx/vvmx.c >>> @@ -1394,7 +1394,10 @@ void nvmx_switch_guest(void) >>> struct vcpu *v = current; >>> struct nestedvcpu *nvcpu = &vcpu_nestedhvm(v); >>> struct cpu_user_regs *regs = guest_cpu_user_regs(); >>> + ioreq_t *p = get_ioreq(v); >>> >>> + if ( p->state != STATE_IOREQ_NONE ) >>> + return; >>> /* >>> * a softirq may interrupt us between a virtual vmentry is >>> * just handled and the true vmentry. If during this window, >> >> This change looks much more sensible; question is whether simply >> ignoring the switch request is acceptable (and I don't know the nested >> HVM code well enough to tell). Furthermore I wonder whether that's really a VMX-only issue. > > From hardware's point, an IO operation is handled atomically. So the problem > will never happen. But in Xen, an IO operation is divided into several steps. > Without nested virtualization, the VCPU is paused or looped until the IO emulation > is finished. But in nested environment, the VCPU will continue running even the > IO emulation is not finished. So my patch will check this and allow the VCPU to > continue running only there is no pending IO request. This is matching hardware's behavior. > > I guess SVM also has this problem. But since I don't know nested SVM well, > so perhaps Christoph can help to have a double check. For SVM this issue was fixed with commit d740d811925385c09553cbe6dee8e77c1d43b198 And there is a followup cleanup commit ac97fa6a21ccd395cca43890bbd0bf32e3255ebb The change in nestedsvm.c in commit d740d811925385c09553cbe6dee8e77c1d43b198 is actually not SVM specific. Move that into nhvm_interrupt_blocked() in hvm.c right before return hvm_funcs.nhvm_intr_blocked(v); and the fix applies for both SVM and VMX. Christoph