From mboxrd@z Thu Jan  1 00:00:00 1970
From: Liran Alon <LIRAN.ALON@ORACLE.COM>
Subject: Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt
 pending
Date: Wed, 27 Dec 2017 17:15:29 +0200
Message-ID: <5A43B911.3090500@ORACLE.COM>
References: <1514131983-24305-1-git-send-email-liran.alon@oracle.com> <1514131983-24305-12-git-send-email-liran.alon@oracle.com> <f906d4ca-3a85-89ac-32e0-6d8b35a2455b@redhat.com> <5A437B2D.30805@ORACLE.COM> <c9b4ded1-200d-a6f7-6e9d-54a40505317b@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Cc: jmattson@google.com, wanpeng.li@hotmail.com, idan.brown@ORACLE.COM,
        Liam Merwick <liam.merwick@ORACLE.COM>
To: Paolo Bonzini <pbonzini@redhat.com>, rkrcmar@redhat.com,
        kvm@vger.kernel.org
Return-path: <kvm-owner@vger.kernel.org>
Received: from userp1040.oracle.com ([156.151.31.81]:23453 "EHLO
        userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1752232AbdL0PPq (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 27 Dec 2017 10:15:46 -0500
In-Reply-To: <c9b4ded1-200d-a6f7-6e9d-54a40505317b@redhat.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>


On 27/12/17 14:55, Paolo Bonzini wrote:
> On 27/12/2017 11:51, Liran Alon wrote:
>> Maybe I misunderstand you, but I don't think this is true.
>> complete_nested_posted_interrupt() relies on being called at a very
>> specific call-site: Right before VMEntry and after interrupts are
>> disabled. It works by issue a self-IPI to cause CPU to actually process
>> the nested posted-interrupts.
>>
>> In addition, I don't see how we can utilize the fact that
>> kvm_cpu_has_interrupt() calls apic_has_interrupt_for_ppr() as
>> vmx.nested.pi_desc->pir should never be synced into L0 vLAPIC IRR.
>> In contrast to vmx.pi_desc->pir which does after sync_pir_to_irr().
>
> You're right, the callbacks must stay.  What confused me is the
> reference to pi_test_on in vmx_cpu_has_nested_posted_interrupt.  While
> your patches are an improvement, there is still some confusion on what
> leads to injecting a nested posted interrupt, and on the split between
> "deliver" and "complete":
>
> - checking pi_test_on should only be done when consuming the L2 PI
> descriptor (which in your case is done by the processor via the
> self-IPI).  So you don't need to test it before sending the self-IPI.

I agree we could have not check it. It is done as an optimization to 
avoid sending a self-IPI when it won't do anything because the 
vmx.nested->pi_desc.control ON bit is off.

This optimization is even more important when this code runs in L1 
(think about triple-virtualization ^_^) to avoid a #VMExit on write to 
LAPIC ICR.

>
> - vmx->nested.pi_pending should not be set in
> vmx_deliver_nested_posted_interrupt after patch 9.  pi_pending should
> only be set by the nested PI handler, when it is invoked outside guest
> mode.  It is then consumed before doing the self-IPI as in the above sketch.

I disagree.
vmx->nested.pi_pending is used as a signal to know L1 has indeed sent a 
vmcs12->posted_intr_nv IPI. If vmx_deliver_nested_posted_interrupt() 
will see target vCPU thread is OUTSIDE_GUEST_MODE then it won't send a 
physical IPI but instead only kick vCPU. In this case, the only way the 
target vCPU thread knows it should trigger self-IPI is by the fact 
pi_pending was set by vmx_deliver_nested_posted_interrupt().
(As in this case, the nested PI handler in host was never invoked).

(It cannot only rely on the vmx.nested->pi_desc.control ON bit as it may 
be set by L1 without L1 triggering a vmcs12->posted_intr_nv IPI)

>
> - likewise, vmx_cpu_has_nested_posted_interrupt should only check
> vmx->nested.pi_pending.  I don't think it should even check
> is_guest_mode(vcpu) or vmx->nested.pi_desc; compare with the handling of
> KVM_REQ_NMI, it doesn't test nmi_allowed.  This leads me to the next point.

There is an interesting question on what should happen when
(is_guest_mode(vcpu) && vmx->nested.pi_pending && 
!pi_test_on(vmx->nested.pi_desc)). I would expect L2 guest to not be 
waken from it's HLT as virtual-interrupt-delivery should not see any 
pending interrupt to inject. But I would expect that to also happen when 
PIR is empty (even if ON bit is set) which is not checked in this 
condition...

So I think I agree but we may have a small semantic issue here of waking 
the L2 vCPU unnecessarily. But this should be a very rare case so maybe 
handle it later (Write TODO comment).

>
> - vmx_complete_nested_posted_interrupt *must* find a valid descriptor.
> Even after a vmexit, because vmx_complete_nested_posted_interrupt is the
> very first thing that runs and L1 has had no time to run a vmwrite.  So

I agree. It is better to WARN in case descriptor is not valid.

> the above could become:
>
> 	if (is_guest_mode(vcpu) && vmx->nested.pi_pending) {
> 		vmx->nested.pi_pending = false;
> 		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu),
> 				    POSTED_INTR_NESTED_VECTOR);
> 	}
>
> and then what about that is_guest_mode check?  When L1 sent the PI
> notification, it thought that the destination VCPU was in guest mode.
> Upon vmexit, L1 can expect to either see ON=0 or get the notification
> interrupt itself.  So either we deliver the interrupt to L1, or we must
> do L2 virtual interrupt delivery before L1 even runs.  Hence IMO the
> is_guest_mode is wrong, which unfortunately throws a wrench somewhat in
> the self-IPI idea quite a bit.
>
> I agree with moving vmx_complete_nested_posted_interrupt out of
> check_nested_events, and also with the simplification of
> vmx_hwapic_irr_update.  However, in my opinion the self-IPI is actually
> complicating things.  Once the handling of vmx->nested.pi_pending is
> cleaned up (the nested PI interrupt handler in patch 9 is a great
> improvement in that respect), we can move
> vmx_complete_nested_posted_interrupt to a new KVM_REQ_NESTED_PI request
> that replaces vmx->nested.pi_pending.  The GUEST_INTR_STATUS write can
> be done in the vmcs12, and will then be copied by prepare_vmcs02 to the
> vmcs02.
>
> I may be wrong, but based on my experience cleaning up the bare-metal
> APICv, I suspect that both the current code and this version are overly
> complicated, and there is a nice and simple core waiting to be
> extracted.  Let's fix the race condition in the simple way, and
> separately focus on cleaning up pi_pending and everything that follows
> from there.

Interesting.

I think I now follow what you mean regarding cleaning logic around 
pi_pending. This is how I understand it:

1. "vmx->nested.pi_pending" is a flag used to indicate "L1 has sent a 
vmcs12->posted_intr_nv IPI". That's it.

2. Currently code is a bit weird in the sense that instead of signal the 
pending IPI in virtual LAPIC IRR, we set it in a special variable.
If we would have set it in virtual LAPIC IRR, we could in theory behave 
very similar to a standard CPU. At interrupt injection point, we could:
(a) If vCPU is in root-mode: Just inject the pending interrupt normally.
(b) If vCPU is in non-root-mode and posted-interrupts feature is active, 
then instead of injecting the pending interrupt, we should simulate 
processing of posted-interrupts.

3. The processing of the nested posted-interrupts itself can still be 
done in self-IPI mechanism.

4. Because not doing (2), there is still currently an issue that L1 
doesn't receive a vmcs12->posted_intr_nv interrupt when target vCPU 
thread has exited from L2 to L1 and pi_pending=true.

Do we agree on the above? Or am I still misunderstanding something?
If we agree, I will attempt to clean code to handle it like described above.

Regards,
-Liran

>
> Paolo
>