From mboxrd@z Thu Jan 1 00:00:00 1970 From: Liran Alon Subject: Re: [PATCH v3 11/11] KVM: nVMX: Wake L2 from HLT when nested posted-interrupt pending Date: Wed, 27 Dec 2017 17:15:29 +0200 Message-ID: <5A43B911.3090500@ORACLE.COM> References: <1514131983-24305-1-git-send-email-liran.alon@oracle.com> <1514131983-24305-12-git-send-email-liran.alon@oracle.com> <5A437B2D.30805@ORACLE.COM> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: jmattson@google.com, wanpeng.li@hotmail.com, idan.brown@ORACLE.COM, Liam Merwick To: Paolo Bonzini , rkrcmar@redhat.com, kvm@vger.kernel.org Return-path: Received: from userp1040.oracle.com ([156.151.31.81]:23453 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752232AbdL0PPq (ORCPT ); Wed, 27 Dec 2017 10:15:46 -0500 In-Reply-To: Sender: kvm-owner@vger.kernel.org List-ID: On 27/12/17 14:55, Paolo Bonzini wrote: > On 27/12/2017 11:51, Liran Alon wrote: >> Maybe I misunderstand you, but I don't think this is true. >> complete_nested_posted_interrupt() relies on being called at a very >> specific call-site: Right before VMEntry and after interrupts are >> disabled. It works by issue a self-IPI to cause CPU to actually process >> the nested posted-interrupts. >> >> In addition, I don't see how we can utilize the fact that >> kvm_cpu_has_interrupt() calls apic_has_interrupt_for_ppr() as >> vmx.nested.pi_desc->pir should never be synced into L0 vLAPIC IRR. >> In contrast to vmx.pi_desc->pir which does after sync_pir_to_irr(). > > You're right, the callbacks must stay. What confused me is the > reference to pi_test_on in vmx_cpu_has_nested_posted_interrupt. While > your patches are an improvement, there is still some confusion on what > leads to injecting a nested posted interrupt, and on the split between > "deliver" and "complete": > > - checking pi_test_on should only be done when consuming the L2 PI > descriptor (which in your case is done by the processor via the > self-IPI). So you don't need to test it before sending the self-IPI. I agree we could have not check it. It is done as an optimization to avoid sending a self-IPI when it won't do anything because the vmx.nested->pi_desc.control ON bit is off. This optimization is even more important when this code runs in L1 (think about triple-virtualization ^_^) to avoid a #VMExit on write to LAPIC ICR. > > - vmx->nested.pi_pending should not be set in > vmx_deliver_nested_posted_interrupt after patch 9. pi_pending should > only be set by the nested PI handler, when it is invoked outside guest > mode. It is then consumed before doing the self-IPI as in the above sketch. I disagree. vmx->nested.pi_pending is used as a signal to know L1 has indeed sent a vmcs12->posted_intr_nv IPI. If vmx_deliver_nested_posted_interrupt() will see target vCPU thread is OUTSIDE_GUEST_MODE then it won't send a physical IPI but instead only kick vCPU. In this case, the only way the target vCPU thread knows it should trigger self-IPI is by the fact pi_pending was set by vmx_deliver_nested_posted_interrupt(). (As in this case, the nested PI handler in host was never invoked). (It cannot only rely on the vmx.nested->pi_desc.control ON bit as it may be set by L1 without L1 triggering a vmcs12->posted_intr_nv IPI) > > - likewise, vmx_cpu_has_nested_posted_interrupt should only check > vmx->nested.pi_pending. I don't think it should even check > is_guest_mode(vcpu) or vmx->nested.pi_desc; compare with the handling of > KVM_REQ_NMI, it doesn't test nmi_allowed. This leads me to the next point. There is an interesting question on what should happen when (is_guest_mode(vcpu) && vmx->nested.pi_pending && !pi_test_on(vmx->nested.pi_desc)). I would expect L2 guest to not be waken from it's HLT as virtual-interrupt-delivery should not see any pending interrupt to inject. But I would expect that to also happen when PIR is empty (even if ON bit is set) which is not checked in this condition... So I think I agree but we may have a small semantic issue here of waking the L2 vCPU unnecessarily. But this should be a very rare case so maybe handle it later (Write TODO comment). > > - vmx_complete_nested_posted_interrupt *must* find a valid descriptor. > Even after a vmexit, because vmx_complete_nested_posted_interrupt is the > very first thing that runs and L1 has had no time to run a vmwrite. So I agree. It is better to WARN in case descriptor is not valid. > the above could become: > > if (is_guest_mode(vcpu) && vmx->nested.pi_pending) { > vmx->nested.pi_pending = false; > apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), > POSTED_INTR_NESTED_VECTOR); > } > > and then what about that is_guest_mode check? When L1 sent the PI > notification, it thought that the destination VCPU was in guest mode. > Upon vmexit, L1 can expect to either see ON=0 or get the notification > interrupt itself. So either we deliver the interrupt to L1, or we must > do L2 virtual interrupt delivery before L1 even runs. Hence IMO the > is_guest_mode is wrong, which unfortunately throws a wrench somewhat in > the self-IPI idea quite a bit. > > I agree with moving vmx_complete_nested_posted_interrupt out of > check_nested_events, and also with the simplification of > vmx_hwapic_irr_update. However, in my opinion the self-IPI is actually > complicating things. Once the handling of vmx->nested.pi_pending is > cleaned up (the nested PI interrupt handler in patch 9 is a great > improvement in that respect), we can move > vmx_complete_nested_posted_interrupt to a new KVM_REQ_NESTED_PI request > that replaces vmx->nested.pi_pending. The GUEST_INTR_STATUS write can > be done in the vmcs12, and will then be copied by prepare_vmcs02 to the > vmcs02. > > I may be wrong, but based on my experience cleaning up the bare-metal > APICv, I suspect that both the current code and this version are overly > complicated, and there is a nice and simple core waiting to be > extracted. Let's fix the race condition in the simple way, and > separately focus on cleaning up pi_pending and everything that follows > from there. Interesting. I think I now follow what you mean regarding cleaning logic around pi_pending. This is how I understand it: 1. "vmx->nested.pi_pending" is a flag used to indicate "L1 has sent a vmcs12->posted_intr_nv IPI". That's it. 2. Currently code is a bit weird in the sense that instead of signal the pending IPI in virtual LAPIC IRR, we set it in a special variable. If we would have set it in virtual LAPIC IRR, we could in theory behave very similar to a standard CPU. At interrupt injection point, we could: (a) If vCPU is in root-mode: Just inject the pending interrupt normally. (b) If vCPU is in non-root-mode and posted-interrupts feature is active, then instead of injecting the pending interrupt, we should simulate processing of posted-interrupts. 3. The processing of the nested posted-interrupts itself can still be done in self-IPI mechanism. 4. Because not doing (2), there is still currently an issue that L1 doesn't receive a vmcs12->posted_intr_nv interrupt when target vCPU thread has exited from L2 to L1 and pi_pending=true. Do we agree on the above? Or am I still misunderstanding something? If we agree, I will attempt to clean code to handle it like described above. Regards, -Liran > > Paolo >