From mboxrd@z Thu Jan 1 00:00:00 1970 From: Liran Alon Subject: Re: [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending Date: Tue, 28 Nov 2017 21:45:35 +0200 Message-ID: <5A1DBCDF.80406@ORACLE.COM> References: <1511278211-12257-1-git-send-email-liran.alon@oracle.com> <1511278211-12257-8-git-send-email-liran.alon@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: Paolo Bonzini , =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , kvm list , Wanpeng Li , Idan Brown , Konrad Rzeszutek Wilk To: Jim Mattson Return-path: Received: from userp1040.oracle.com ([156.151.31.81]:49706 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752338AbdK1Tpr (ORCPT ); Tue, 28 Nov 2017 14:45:47 -0500 In-Reply-To: Sender: kvm-owner@vger.kernel.org List-ID: On 28/11/17 20:26, Jim Mattson wrote: > One other thing to note is that there is no place to put the > interrupted IDT vectoring information for L2 virtual interrupt 210 > when we synthesize a VM-exit from L2 to L1 for L1 "physical" interrupt > 252. I see that vmcs12_save_pending_event() shoves this information > into the VMCS12 IDT-vectoring information field, but it should not do > so when the VMCS12 exit reason is "exception or non-maskable > interrupt." > > From the SDM, volume 3, section 27.2.3: > > Section 24.9.3 defined fields containing information for VM exits that > occur while delivering an event through the IDT and as a result of any > of the following cases: > o A fault occurs during event delivery and causes a VM exit (because > the bit associated with the fault is set to 1 in the exception > bitmap). > o A task switch is invoked through a task gate in the IDT. The VM > exit occurs due to the task switch only after the initial checks of > the task switch pass (see Section 25.4.2). > o Event delivery causes an APIC-access VM exit (see Section 29.4). > o An EPT violation, EPT misconfiguration, or page-modification > log-full event that occurs during event delivery. > > For all other exit reasons (including "exception or non-maskable > interrupt"), the valid bit of the IDT-vectoring info field should be > clear. Interesting. In the case described in the commit-message of this patch this is not relevant as because kvm_event_needs_reinjection()==true, vmx_check_nested_events() won't really do an exit from L2 to L1 on EXTERNAL_INTERRUPT in this case. But rather it will return -EBUSY which will cause vcpu_enter_guest() to request an "immediate-exit" (via self-IPI. See code in vcpu_enter_guest() regarding req_immediate_exit variable for more info). This will make guest complete the event-injection of interrupt 210 but immediately afterwards exit to L0 which will then exit from L2 to L1 on EXTERNAL_INTERRUPT because of L1 interrupt 252. However, you made me notice there is another issue here in vmx_check_nested_events() handling. If the event-delivery would have caused a fault that needs to cause exit from L2 to L1, then currently vmx_check_nested_events() will see that kvm_event_needs_reinjection()==true and therefore not exit from L2 to L1 even though it should. So this is just another other bug to fix I think... Nice catch. -Liran > > On Mon, Nov 27, 2017 at 10:39 PM, Jim Mattson wrote: >> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon wrote: >>> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with >>> IDT-vectoring-info which vmx_complete_interrupts() makes sure to >>> reinject before next resume of L2. >>> >>> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU >>> to the L1 vCPU which currently runs L2 and exited to L0. >>> >>> When L0 will reach vcpu_enter_guest() and call inject_pending_event(), >>> it will note that a previous event was re-injected to L2 (by >>> IDT-vectoring-info) and therefore won't check if there are pending L1 >>> events which require exit from L2 to L1. Thus, L0 enters L2 without >>> immediate VMExit even though there are pending L1 events! >>> >>> This commit fixes the issue by making sure to check for L1 pending >>> events even if a previous event was reinjected to L2 and bailing out >>> from inject_pending_event() before evaluating a new pending event in >>> case an event was already reinjected. >>> >>> The bug was observed by the following setup: >>> * L0 is a 64CPU machine which runs KVM. >>> * L1 is a 16CPU machine which runs KVM. >>> * L0 & L1 runs with APICv disabled. >>> (Also reproduced with APICv enabled but easier to analyze below info >>> with APICv disabled) >>> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest. >>> During L2 boot, L1 hangs completely and analyzing the hang reveals that >>> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI >>> that he has sent for another L1 vCPU. And all other L1 vCPUs are >>> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck >>> forever (as L1 runs with kernel-preemption disabled). >>> >>> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following >>> series of events: >>> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: >>> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182 >>> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000 >> >> This looks like an EPT violation for writing a stack page that wasn't >> present in the vmcs02 shadow EPT tables. I assume that L0 filled in >> the missing shadow EPT table entry (or entries) and dismissed this EPT >> violation itself rather than forwarding it to L1, because L1's EPT >> tables have a valid entry for this L2 guest-physical address. >> >>> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f >>> vec 252 (Fixed|edge) >> >> This is the source of the bug, right here. An interrupt can only be >> accepted at an instruction boundary, and the virtual CPU is in the >> midst of IDT-vectoring, which is not at an instruction boundary. The >> EPT violation from L2 to L0 is an L0 artifact that should be >> completely invisible to L1/L2. Interrupt 252 should remain pending >> until we are at the first instruction of the IDT handler for interrupt >> 210. >> >> To be more precise, if interrupt 210 was delivered via VM-entry event >> injection on L1 VM-entry to L2, then we need to: >> a) complete the event injection (i.e. vector through the L2 IDT for vector 210), >> b) handle virtual VMX-preemption timer expiration during VM-entry (if any), >> c) handle NMI-window exiting events (if any), >> d) handle L0 NMI (as an L2 to L1 VM-exit or by vectoring through the >> L2 IDT, depending on the NMI-exiting control), >> e) handle interrupt-window exiting events, >> f) handle L0 maskable interrupts (as an L2 to L1 VM-exit or by >> vectoring through the L2 IDT, depending on the external-iinterrupt >> exiting control). >> ... >> >> Only when we get to step (f) can we accept a new IRQ from L0's local APIC. >> >>> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210 >>> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 >>> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION >>> rip 0xffffe00069202690 info 83 0 >>> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: >>> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083 >>> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000 >>> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason: >>> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000 >>> ext_int: 0x00000000 ext_int_err: 0x00000000 >>> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 >>> >>> Which can be analyzed as follows: >>> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2. >>> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject >>> a pending-interrupt of 0xd2. >>> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination >>> vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT. >>> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which >>> notes that interrupt 0xd2 was reinjected and therefore calls >>> vmx_inject_irq() and returns. Without checking for pending L1 events! >>> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest() >>> before calling inject_pending_event(). >>> (4) L0 resumes L2 without immediate-exit even though there is a pending >>> L1 event (The IPI pending in LAPIC's IRR). >>> >>> We have already reached the buggy scenario but events could be >>> furthered analyzed: >>> (5+6) L2 VMExit to L0 on EPT_VIOLATION. This time not during >>> event-delivery. >>> (7) L0 decides to forward the VMExit to L1 for further handling. >>> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the >>> LAPIC's IRR is not examined and therefore the IPI is still not delivered >>> into L1! >>> >>> Signed-off-by: Liran Alon >>> Reviewed-by: Nikita Leshenko >>> Signed-off-by: Konrad Rzeszutek Wilk >>> --- >>> arch/x86/kvm/x86.c | 33 ++++++++++++++++++++------------- >>> 1 file changed, 20 insertions(+), 13 deletions(-) >>> >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>> index a4b5a013064b..63359ab0e798 100644 >>> --- a/arch/x86/kvm/x86.c >>> +++ b/arch/x86/kvm/x86.c >>> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) >>> int r; >>> >>> /* try to reinject previous events if any */ >>> - if (vcpu->arch.exception.injected) { >>> - kvm_x86_ops->queue_exception(vcpu); >>> - return 0; >>> - } >>> >>> + if (vcpu->arch.exception.injected) >>> + kvm_x86_ops->queue_exception(vcpu); >>> /* >>> * NMI/interrupt must not be injected if an exception is >>> * pending, because the latter may have been queued by >>> * handling exit due to NMI/interrupt delivery. >>> */ >>> - if (!vcpu->arch.exception.pending) { >>> - if (vcpu->arch.nmi_injected) { >>> + else if (!vcpu->arch.exception.pending) { >>> + if (vcpu->arch.nmi_injected) >>> kvm_x86_ops->set_nmi(vcpu); >>> - return 0; >>> - } >>> - >>> - if (vcpu->arch.interrupt.injected) { >>> + else if (vcpu->arch.interrupt.injected) >>> kvm_x86_ops->set_irq(vcpu); >>> - return 0; >>> - } >>> } >>> >>> + /* >>> + * Call check_nested_events() even if we reinjected a previous event >>> + * in order for caller to determine if it should require immediate-exit >>> + * from L2 to L1 due to pending L1 events which require exit >>> + * from L2 to L1. >>> + */ >>> if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) { >>> r = kvm_x86_ops->check_nested_events(vcpu, req_int_win); >>> if (r != 0) >>> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) >>> vcpu->arch.exception.has_error_code, >>> vcpu->arch.exception.error_code); >>> >>> + WARN_ON_ONCE(vcpu->arch.exception.injected); >>> vcpu->arch.exception.pending = false; >>> vcpu->arch.exception.injected = true; >>> >>> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win) >>> } >>> >>> kvm_x86_ops->queue_exception(vcpu); >>> - } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && kvm_x86_ops->smi_allowed(vcpu)) { >>> + } >>> + >>> + /* Don't consider new event if we re-injected an event */ >>> + if (kvm_event_needs_reinjection(vcpu)) >>> + return 0; >>> + >>> + if (vcpu->arch.smi_pending && !is_smm(vcpu) && >>> + kvm_x86_ops->smi_allowed(vcpu)) { >>> vcpu->arch.smi_pending = false; >>> enter_smm(vcpu); >>> } else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) { >>> -- >>> 1.9.1 >>>