From mboxrd@z Thu Jan  1 00:00:00 1970
From: Liran Alon <LIRAN.ALON@ORACLE.COM>
Subject: Re: [PATCH v2 7/8] KVM: nVMX: Require immediate-exit when event reinjected
 to L2 and L1 event pending
Date: Tue, 28 Nov 2017 21:45:35 +0200
Message-ID: <5A1DBCDF.80406@ORACLE.COM>
References: <1511278211-12257-1-git-send-email-liran.alon@oracle.com> <1511278211-12257-8-git-send-email-liran.alon@oracle.com> <CALMp9eRReKrJYbEKg0N1su8tW4TTB547710eMDjWospDmtwWJQ@mail.gmail.com> <CALMp9eQinvogSsa7TowFaSh_OJ3P=+dQfOTvdR=OafyncyDGWQ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Paolo Bonzini <pbonzini@redhat.com>,
        =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>,
        kvm list <kvm@vger.kernel.org>,
        Wanpeng Li <wanpeng.li@hotmail.com>,
        Idan Brown <idan.brown@ORACLE.COM>,
        Konrad Rzeszutek Wilk <konrad.wilk@ORACLE.COM>
To: Jim Mattson <jmattson@google.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from userp1040.oracle.com ([156.151.31.81]:49706 "EHLO
        userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1752338AbdK1Tpr (ORCPT <rfc822;kvm@vger.kernel.org>);
        Tue, 28 Nov 2017 14:45:47 -0500
In-Reply-To: <CALMp9eQinvogSsa7TowFaSh_OJ3P=+dQfOTvdR=OafyncyDGWQ@mail.gmail.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>


On 28/11/17 20:26, Jim Mattson wrote:
> One other thing to note is that there is no place to put the
> interrupted IDT vectoring information for L2 virtual interrupt 210
> when we synthesize a VM-exit from L2 to L1 for L1 "physical" interrupt
> 252. I see that vmcs12_save_pending_event() shoves this information
> into the VMCS12 IDT-vectoring information field, but it should not do
> so when the VMCS12 exit reason is "exception or non-maskable
> interrupt."
>
>  From the SDM, volume 3, section 27.2.3:
>
> Section 24.9.3 defined fields containing information for VM exits that
> occur while delivering an event through the IDT and as a result of any
> of the following cases:
> o  A fault occurs during event delivery and causes a VM exit (because
> the bit associated with the fault is set to 1 in the exception
> bitmap).
> o  A task switch is invoked through a task gate in the IDT. The VM
> exit occurs due to the task switch only after the initial checks of
> the task switch pass (see Section 25.4.2).
> o  Event delivery causes an APIC-access VM exit (see Section 29.4).
> o  An EPT violation, EPT misconfiguration, or page-modification
> log-full event that occurs during event delivery.
>
> For all other exit reasons (including "exception or non-maskable
> interrupt"), the valid bit of the IDT-vectoring info field should be
> clear.

Interesting.

In the case described in the commit-message of this patch this is not 
relevant as because kvm_event_needs_reinjection()==true, 
vmx_check_nested_events() won't really do an exit from L2 to L1 on 
EXTERNAL_INTERRUPT in this case. But rather it will return -EBUSY which 
will cause vcpu_enter_guest() to request an "immediate-exit" (via 
self-IPI. See code in vcpu_enter_guest() regarding req_immediate_exit 
variable for more info). This will make guest complete the 
event-injection of interrupt 210 but immediately afterwards exit to L0 
which will then exit from L2 to L1 on EXTERNAL_INTERRUPT because of L1 
interrupt 252.

However, you made me notice there is another issue here in 
vmx_check_nested_events() handling. If the event-delivery would have 
caused a fault that needs to cause exit from L2 to L1, then currently 
vmx_check_nested_events() will see that 
kvm_event_needs_reinjection()==true and therefore not exit from L2 to L1 
even though it should.
So this is just another other bug to fix I think...

Nice catch.
-Liran

>
> On Mon, Nov 27, 2017 at 10:39 PM, Jim Mattson <jmattson@google.com> wrote:
>> On Tue, Nov 21, 2017 at 7:30 AM, Liran Alon <liran.alon@oracle.com> wrote:
>>> In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
>>> IDT-vectoring-info which vmx_complete_interrupts() makes sure to
>>> reinject before next resume of L2.
>>>
>>> While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
>>> to the L1 vCPU which currently runs L2 and exited to L0.
>>>
>>> When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
>>> it will note that a previous event was re-injected to L2 (by
>>> IDT-vectoring-info) and therefore won't check if there are pending L1
>>> events which require exit from L2 to L1. Thus, L0 enters L2 without
>>> immediate VMExit even though there are pending L1 events!
>>>
>>> This commit fixes the issue by making sure to check for L1 pending
>>> events even if a previous event was reinjected to L2 and bailing out
>>> from inject_pending_event() before evaluating a new pending event in
>>> case an event was already reinjected.
>>>
>>> The bug was observed by the following setup:
>>> * L0 is a 64CPU machine which runs KVM.
>>> * L1 is a 16CPU machine which runs KVM.
>>> * L0 & L1 runs with APICv disabled.
>>> (Also reproduced with APICv enabled but easier to analyze below info
>>> with APICv disabled)
>>> * L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
>>> During L2 boot, L1 hangs completely and analyzing the hang reveals that
>>> one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
>>> that he has sent for another L1 vCPU. And all other L1 vCPUs are
>>> currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
>>> forever (as L1 runs with kernel-preemption disabled).
>>>
>>> Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
>>> series of events:
>>> (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
>>> 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
>>> ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000
>>
>> This looks like an EPT violation for writing a stack page that wasn't
>> present in the vmcs02 shadow EPT tables. I assume that L0 filled in
>> the missing shadow EPT table entry (or entries) and dismissed this EPT
>> violation itself rather than forwarding it to L1, because L1's EPT
>> tables have a valid entry for this L2 guest-physical address.
>>
>>> (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
>>> vec 252 (Fixed|edge)
>>
>> This is the source of the bug, right here. An interrupt can only be
>> accepted at an instruction boundary, and the virtual CPU is in the
>> midst of IDT-vectoring, which is not at an instruction boundary. The
>> EPT violation from L2 to L0 is an L0 artifact that should be
>> completely invisible to L1/L2. Interrupt 252 should remain pending
>> until we are at the first instruction of the IDT handler for interrupt
>> 210.
>>
>> To be more precise, if interrupt 210 was delivered via VM-entry event
>> injection on L1 VM-entry to L2, then we need to:
>> a) complete the event injection (i.e. vector through the L2 IDT for vector 210),
>> b) handle virtual VMX-preemption timer expiration during VM-entry (if any),
>> c) handle NMI-window exiting events (if any),
>> d) handle L0 NMI (as an L2 to L1 VM-exit or by vectoring through the
>> L2 IDT, depending on the NMI-exiting control),
>> e) handle interrupt-window exiting events,
>> f) handle L0 maskable interrupts (as an L2 to L1 VM-exit or by
>> vectoring through the L2 IDT, depending on the external-iinterrupt
>> exiting control).
>> ...
>>
>> Only when we get to step (f) can we accept a new IRQ from L0's local APIC.
>>
>>> (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
>>> (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>>> (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
>>> rip 0xffffe00069202690 info 83 0
>>> (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
>>> 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
>>> ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
>>> (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
>>> EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
>>> ext_int: 0x00000000 ext_int_err: 0x00000000
>>> (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
>>>
>>> Which can be analyzed as follows:
>>> (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
>>> Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
>>> a pending-interrupt of 0xd2.
>>> (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
>>> vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
>>> (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
>>> notes that interrupt 0xd2 was reinjected and therefore calls
>>> vmx_inject_irq() and returns. Without checking for pending L1 events!
>>> Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
>>> before calling inject_pending_event().
>>> (4) L0 resumes L2 without immediate-exit even though there is a pending
>>> L1 event (The IPI pending in LAPIC's IRR).
>>>
>>> We have already reached the buggy scenario but events could be
>>> furthered analyzed:
>>> (5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
>>> event-delivery.
>>> (7) L0 decides to forward the VMExit to L1 for further handling.
>>> (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
>>> LAPIC's IRR is not examined and therefore the IPI is still not delivered
>>> into L1!
>>>
>>> Signed-off-by: Liran Alon <liran.alon@oracle.com>
>>> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
>>> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>>> ---
>>>   arch/x86/kvm/x86.c | 33 ++++++++++++++++++++-------------
>>>   1 file changed, 20 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index a4b5a013064b..63359ab0e798 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -6398,28 +6398,27 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>>>          int r;
>>>
>>>          /* try to reinject previous events if any */
>>> -       if (vcpu->arch.exception.injected) {
>>> -               kvm_x86_ops->queue_exception(vcpu);
>>> -               return 0;
>>> -       }
>>>
>>> +       if (vcpu->arch.exception.injected)
>>> +               kvm_x86_ops->queue_exception(vcpu);
>>>          /*
>>>           * NMI/interrupt must not be injected if an exception is
>>>           * pending, because the latter may have been queued by
>>>           * handling exit due to NMI/interrupt delivery.
>>>           */
>>> -       if (!vcpu->arch.exception.pending) {
>>> -               if (vcpu->arch.nmi_injected) {
>>> +       else if (!vcpu->arch.exception.pending) {
>>> +               if (vcpu->arch.nmi_injected)
>>>                          kvm_x86_ops->set_nmi(vcpu);
>>> -                       return 0;
>>> -               }
>>> -
>>> -               if (vcpu->arch.interrupt.injected) {
>>> +               else if (vcpu->arch.interrupt.injected)
>>>                          kvm_x86_ops->set_irq(vcpu);
>>> -                       return 0;
>>> -               }
>>>          }
>>>
>>> +       /*
>>> +        * Call check_nested_events() even if we reinjected a previous event
>>> +        * in order for caller to determine if it should require immediate-exit
>>> +        * from L2 to L1 due to pending L1 events which require exit
>>> +        * from L2 to L1.
>>> +        */
>>>          if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) {
>>>                  r = kvm_x86_ops->check_nested_events(vcpu, req_int_win);
>>>                  if (r != 0)
>>> @@ -6438,6 +6437,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>>>                                          vcpu->arch.exception.has_error_code,
>>>                                          vcpu->arch.exception.error_code);
>>>
>>> +               WARN_ON_ONCE(vcpu->arch.exception.injected);
>>>                  vcpu->arch.exception.pending = false;
>>>                  vcpu->arch.exception.injected = true;
>>>
>>> @@ -6452,7 +6452,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
>>>                  }
>>>
>>>                  kvm_x86_ops->queue_exception(vcpu);
>>> -       } else if (vcpu->arch.smi_pending && !is_smm(vcpu) && kvm_x86_ops->smi_allowed(vcpu)) {
>>> +       }
>>> +
>>> +       /* Don't consider new event if we re-injected an event */
>>> +       if (kvm_event_needs_reinjection(vcpu))
>>> +               return 0;
>>> +
>>> +       if (vcpu->arch.smi_pending && !is_smm(vcpu) &&
>>> +           kvm_x86_ops->smi_allowed(vcpu)) {
>>>                  vcpu->arch.smi_pending = false;
>>>                  enter_smm(vcpu);
>>>          } else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) {
>>> --
>>> 1.9.1
>>>