From mboxrd@z Thu Jan 1 00:00:00 1970 From: Paolo Bonzini Subject: Re: [PATCH] KVM: SVM: fix cr8 intercept window Date: Wed, 12 Mar 2014 18:20:01 +0100 Message-ID: <53209741.3060204@redhat.com> References: <1394561478-8815-1-git-send-email-rkrcmar@redhat.com> <20140312010513.GA8131@amt.cnet> <20140312104047.GA7488@potion.brq.redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, gleb@kernel.org, stable@vger.kernel.org To: =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , Marcelo Tosatti Return-path: In-Reply-To: <20140312104047.GA7488@potion.brq.redhat.com> Sender: stable-owner@vger.kernel.org List-Id: kvm.vger.kernel.org Il 12/03/2014 11:40, Radim Kr=C4=8Dm=C3=A1=C5=99 ha scritto: > 2014-03-11 22:05-0300, Marcelo Tosatti: >> On Tue, Mar 11, 2014 at 07:11:18PM +0100, Radim Kr=C4=8Dm=C3=A1=C5=99= wrote: >>> We always disable cr8 intercept in its handler, but only re-enable = it >>> if handling KVM_REQ_EVENT, so there can be a window where we do not >>> intercept cr8 writes, which allows an interrupt to disrupt a higher >>> priority task. >>> >>> Fix this by disabling intercepts in the same function that re-enabl= es >>> them when needed. This fixes BSOD in Windows 2008. >>> >>> Cc: >>> Signed-off-by: Radim Kr=C4=8Dm=C3=A1=C5=99 >>> --- >>> arch/x86/kvm/svm.c | 6 +++--- >>> 1 file changed, 3 insertions(+), 3 deletions(-) >>> >>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c >>> index 64d9bb9..f676c18 100644 >>> --- a/arch/x86/kvm/svm.c >>> +++ b/arch/x86/kvm/svm.c >>> @@ -3003,10 +3003,8 @@ static int cr8_write_interception(struct vcp= u_svm *svm) >>> u8 cr8_prev =3D kvm_get_cr8(&svm->vcpu); >>> /* instruction emulation calls kvm_set_cr8() */ >>> r =3D cr_interception(svm); >>> - if (irqchip_in_kernel(svm->vcpu.kvm)) { >>> - clr_cr_intercept(svm, INTERCEPT_CR8_WRITE); >>> + if (irqchip_in_kernel(svm->vcpu.kvm)) >>> return r; >>> - } I think that the old code here makes little sense, and for two reasons: 1) There are other ways to change the TPR, and the condition for=20 setting/clearing the CR8 intercept should be the same for all of them.=20 Now CR8 is the really optimized one, but there is no reason to treat=20 them differently and it just complicates understanding the code. So it is a good thing that your patch moves the clearing of the CR8=20 write intercept in a generic place (the setting of the intercept is=20 already generic). Doesn't say much about the correctness of the patch;= =20 it would be just an optimization. But at least the old code is the=20 "smelly" one. 2) Unconditionally disabling the CR8 intercept is definitely wrong.=20 What matters is the change in the PPR; if the processor priority is the= =20 same as before, or higher, absolutely nothing has changed from the poin= t=20 of view of interrupt delivery; if we had an interrupt in the IRR,=20 waiting to be delivered, we still have it, and we should keep the CR8=20 intercept enabled. Your patch does the right thing by virtue of apic_update_ppr setting=20 KVM_REQ_EVENT (which ultimately calls update_cr8_intercept) exactly if=20 the PPR has been lowered. The call chain is=20 kvm_set_cr8->kvm_lapic_set_tpr->apic_set_tpr->apic_update_ppr. At the end of this email I'll show an example of why this actually is=20 relatively common on Windows guests. So, IMO there is no doubt that the change is semantically correct. The= =20 next question then is whether it undoes the V_TPR optimization. You ca= n=20 prove it by a sort of induction; consider a sequences of events that=20 start and end with the same IRR, assume CR8 is not intercepted at the=20 beginning, and prove that CR8 is still not intercepted afterwards. We can assume that all changes to the TPR are balanced and properly=20 nested (except you can go low->med->high->low). The simple sequences are: 1) changes in TPR with no interrupts in the middle; remember that=20 Windows doesn't really ever disable/enable preemption or interrupt flag= s=20 like Linux does. It only modifies the TPR ("raise/lower the IRQL", the= y=20 call it). We're assuming that the CR8 intercept is initially disabled,= =20 so a raised-IRQL section of the code that doesn't cause other vmexits=20 will obviously run at full speed. Not much to see here. 2) delivery of an unmasked interrupt (with priority P) and subsequent=20 EOI. Changes to TPR don't really matter until EOI, because they are=20 always to priority >=3D P and they are balanced. So we ignore them. To summarize: an interrupt with priority P is going to be delivered, th= e=20 VCPU is running at TPR <=3D P, interrupts are allowed, and the CR8=20 intercept is disabled. The interrupt is injected via apic_set_irr/kvm_make_request, and this=20 causes a call to update_cr8_intercept. If TPR < P, the intercept will=20 remain disabled. When the EOI is sent, we get another event and anothe= r=20 call to update_cr8_intercept; again, the intercept stays cleared becaus= e=20 IRR =3D=3D -1. If TPR =3D=3D P, the intercept is set while the interrupt handler runs,= but=20 it is still disabled at the end of the interrupt. Looks like another=20 bugfix; before your patch, it would remain enabled, which is useless.=20 The TPR =3D=3D P case is actually interesting for Windows, more below. The complicated sequences are: 3) a change in the TPR, where an interrupt is masked while the=20 high-priority task runs. The interrupt is what will cause the intercep= t=20 to be set. As soon as the TPR is restored, the interrupt will be=20 delivered and the intercept cleared (TPR < IRR). We fall back to case = 2=20 above. 4) interrupts that are received while interrupts are not allowed. Here= =20 KVM does not inject the interrupt; it requests the interrupt window and= =20 clears the intercept. Clearing the intercept is okay, because no=20 interrupt can be delivered anyway and TPR changes are moot until the=20 interrupt window opens. Once it opens, sync_cr8_to_lapic will call=20 kvm_set_cr8 and interrupt_window_interception will set KVM_REQ_EVENT.=20 Any intervening change to the TPR is handled fine: if it got high again= ,=20 update_cr8_intercept will set the CR8 intercept and the "loop" restarts= ;=20 if it is still low, inject_pending_event will queue the interrupt as in= =20 case 3. With all these covered, is the base case true? That is, will the CR8=20 intercept _ever_ be disabled? Yes :) because kvm_vcpu_reset sets=20 KVM_REQ_EVENT and this calls update_cr8_intercept. Do you agree with the above analysis? If so, let's look how it can be=20 used to reply to Marcelo's question. >> Shouldnt IRR be injected if TPR < IRR ? (via KVM_REQ_EVENT). This is "the complicated case" above. Say initially TPR =3D 15. >> 1) IRR has interrupt 10. Interrupt 10 is received and the CR8 intercept is set. Because PPR =3D= =20 15, the interrupt is not yet in the ISR, only in the IRR. >> 2) TPR now 9 due to CR8 write. When the TPR becomes 9, the PPR is lowered to 9 too, thus=20 apic_update_ppr sets KVM_REQ_EVENT. >> 3) 10 should be injected. Event processing calls inject_pending_event, which either queues the=20 interrupt or requests the interrupt window. In my writing above, this=20 corresponds to cases 3 and 4 respectively. As usual, correct me if I'm wrong. >> Also not clearing the intercept can cause continuous CR8 writes to >> exit until KVM_REQ_EVENT ? > > It is intended, I suppose this is because we run with V_INTR_MASKING,= so > writes to CR8 only affect V_TPR register; guest then raises it once m= ore > and APIC incorrectly gives us low priority interrupt. Yes, but the extra exits should not be a problem. Let's see what could happen with Windows. Windows uses "deferred=20 procedure calls" (also known as DPCs; basically they're non-preemptible= =20 "bottom halves") extensively. Almost all ISRs will do the bulk of the=20 work in a DPC, including the timer interrupt. DPC processing is=20 requested with a priority-2 IPI, hence Windows raises the TPR to 2 in=20 order to disable preemption. This is the gory detail of why, as mentioned above, Windows writes to=20 CR8 in order to disable preemption. Since spinlocks are not=20 preemptible, taking a spinlock does the same. In both cases the old CR= 8=20 value is saved and then restored (when exiting the non-preemptible=20 section, or when releasing the spinlock). So, if you have N nested spinlocks and a timer interrupt fires, you wil= l=20 have up to 2N-1 CR8 write exits. Example with 2 nested spinlocks: IRR TPR PPR intercept initial {} 0 0 disabled take spinlock #1 {} 2 2 disabled >>> timer interrupt injected {13} 2 13 disabled >>> timer ISR schedules DPCs {2,13} 2 13 disabled >>> timer interrupt EOI {2} 2 2 enabled take spinlock #2 (vmexit!) {2} 2 2 enabled release spinlock #2 (vmexit!) {2} 2 2 enabled release spinlock #1 (vmexit!) {2} 0 0 enabled >>> DPC interrupt injected {2} 0 2 disabled >>> set CR8 =3D 2 {2} 2 2 enabled >>> DPC interrupt EOI {} 2 2 disabled >>> DPC queue processed {} 2 2 disabled >>> set CR8 =3D 0 {} 0 0 disabled Here N=3D2 so you have 3 extra vmexits. Note that: * line "take spinlock #2" writes 2 to CR8, but is really a no-op. But=20 without this patch, the CR8 intercept would have been disabled! * line "release spinlock #1" writes 0 to CR8. Without this patch, the=20 PPR would not have been updated. The DPC interrupt would have been del= ayed. * line "DPC interrupt EOI" is the other bugfix/optimization I mentioned= =20 above for TPR=3D=3DP. Without this patch, the CR8 interrupt would have= =20 stayed enabled, and KVM would have taken an exit when setting CR8 to 0. So in this scenario with N nested spinlocks there will be indeed 2N-2=20 more exits than before. Windows has an extra pair of lock/unlock primitives that do not touch=20 CR8, that you can use when you know you'r already at TPR=3D2, so we can= =20 expect the common case to be N=3D1. And N=3D1 adds up to 0 extra vmexi= ts. I'm applying the patch to kvm/next. As a follow up, I would consider the following changes: 1) stop calling update_cr8_intercept directly. There is a call in=20 kvm_vcpu_ioctl_set_lapic, which is useless because=20 kvm_apic_post_state_restore sets KVM_REQ_EVENT. Another in=20 kvm_arch_vcpu_ioctl_set_sregs, which is how userspace modifies CR8.=20 Also useless, because the function calls kvm_set_cr8 and ultimately=20 apic_update_ppr. 2) stop setting the CR8_WRITE intercept in init_vmcb (called by=20 svm_vcpu_reset, called by kvm_vcpu_reset), since it's always undone=20 immediately afterwards by the processing of KVM_REQ_EVENT. Paolo