[PATCH v2 0/6] x86, apicv: Add APIC virtualizatin support

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/6] x86, apicv: Add APIC virtualizatin support
@ 2012-11-21  8:09 Yang Zhang
  2012-11-21  8:09 ` [PATCH v2 1/6] x86: PIT connects to pin 2 of IOAPIC Yang Zhang
                   ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: Yang Zhang @ 2012-11-21  8:09 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti, gleb, Yang Zhang

APIC virtualization is a new feature which can eliminate most of VM exit
when vcpu handle a interrupt:

APIC register virtualization: 
	APIC read access doesn't cause APIC-access VM exits.
	APIC write becomes trap-like.

Virtual interrupt delivery:
	Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
	manually, which is fully taken care of by the hardware.

Posted Interrupt:
	Posted Interrupt allows vAPICV interrupts to inject into guest directly
	without cause VM exit

Please refer to Intel SDM volume 3, chapter 29 for more details.

Changes v1 to v2:
 * Add Posted Interrupt support in this series patch.
 * Since there is a notifer hook in vAPIC EOI for PIT interrupt. So always Set PIT
   interrupt in eoi exit bitmap to force vmexit when EOI to interrupt.
 * Rebased on top of KVM upstream

Yang Zhang (6):
  x86: PIT connects to pin 2 of IOAPIC
  x86, apicv: add APICv register virtualization support
  x86, apicv: add virtual interrupt delivery support
  x86, apicv: add virtual x2apic support
  x86: Enable ack interrupt on vmexit
  x86, apicv: Add Posted Interrupt supporting

 arch/x86/include/asm/kvm_host.h |    7 +
 arch/x86/include/asm/vmx.h      |   17 ++
 arch/x86/kernel/apic/io_apic.c  |  138 +++++++++++++
 arch/x86/kvm/irq.c              |   44 ++++
 arch/x86/kvm/lapic.c            |   91 ++++++++-
 arch/x86/kvm/lapic.h            |   23 ++
 arch/x86/kvm/svm.c              |    6 +
 arch/x86/kvm/vmx.c              |  420 +++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/x86.c              |   18 ++-
 include/linux/kvm_host.h        |    1 +
 virt/kvm/ioapic.c               |    3 +-
 virt/kvm/kvm_main.c             |    2 +
 12 files changed, 749 insertions(+), 21 deletions(-)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 1/6] x86: PIT connects to pin 2 of IOAPIC
  2012-11-21  8:09 [PATCH v2 0/6] x86, apicv: Add APIC virtualizatin support Yang Zhang
@ 2012-11-21  8:09 ` Yang Zhang
  2012-11-28 10:50   ` Gleb Natapov
  2012-11-21  8:09 ` [PATCH v2 2/6] x86, apicv: add APICv register virtualization support Yang Zhang
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Yang Zhang @ 2012-11-21  8:09 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti, gleb, Yang Zhang

When PIT connects to IOAPIC, it route to pin 2 not pin 0.

Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
---
 virt/kvm/ioapic.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index cfb7e4d..166c450 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -181,7 +181,7 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int irq)
 
 #ifdef CONFIG_X86
 	/* Always delivery PIT interrupt to vcpu 0 */
-	if (irq == 0) {
+	if (irq == 2) {
 		irqe.dest_mode = 0; /* Physical mode. */
 		/* need to read apic_id from apic regiest since
 		 * it can be rewritten */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 2/6] x86, apicv: add APICv register virtualization support
  2012-11-21  8:09 [PATCH v2 0/6] x86, apicv: Add APIC virtualizatin support Yang Zhang
  2012-11-21  8:09 ` [PATCH v2 1/6] x86: PIT connects to pin 2 of IOAPIC Yang Zhang
@ 2012-11-21  8:09 ` Yang Zhang
  2012-11-21  8:09 ` [PATCH v2 3/6] x86, apicv: add virtual interrupt delivery support Yang Zhang
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Yang Zhang @ 2012-11-21  8:09 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti, gleb, Yang Zhang, Kevin Tian

- APIC read doesn't cause VM-Exit
- APIC write becomes trap-like

Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
---
 arch/x86/include/asm/vmx.h |    2 ++
 arch/x86/kvm/lapic.c       |   16 ++++++++++++++++
 arch/x86/kvm/lapic.h       |    2 ++
 arch/x86/kvm/vmx.c         |   32 +++++++++++++++++++++++++++++++-
 4 files changed, 51 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 36ec21c..21101b6 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -66,6 +66,7 @@
 #define EXIT_REASON_EPT_MISCONFIG       49
 #define EXIT_REASON_WBINVD              54
 #define EXIT_REASON_XSETBV              55
+#define EXIT_REASON_APIC_WRITE          56
 #define EXIT_REASON_INVPCID             58
 
 #define VMX_EXIT_REASONS \
@@ -141,6 +142,7 @@
 #define SECONDARY_EXEC_ENABLE_VPID              0x00000020
 #define SECONDARY_EXEC_WBINVD_EXITING		0x00000040
 #define SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080
+#define SECONDARY_EXEC_APIC_REGISTER_VIRT       0x00000100
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING	0x00000400
 #define SECONDARY_EXEC_ENABLE_INVPCID		0x00001000
 
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 43e9fad..a63ffdc 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1212,6 +1212,22 @@ void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
 
+/* emulate APIC access in a trap manner */
+int kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
+{
+	u32 val = 0;
+
+	/* hw has done the conditional check and inst decode */
+	offset &= 0xff0;
+	if ((offset != APIC_EOI) &&
+	     apic_reg_read(vcpu->arch.apic, offset, 4, &val))
+		return 1;
+
+	/* TODO: optimize to just emulate side effect w/o one more write */
+	return apic_reg_write(vcpu->arch.apic, offset, val);
+}
+EXPORT_SYMBOL_GPL(kvm_apic_write_nodecode);
+
 void kvm_free_lapic(struct kvm_vcpu *vcpu)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index e5ebf9f..c42f111 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -64,6 +64,8 @@ int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu);
 u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
 void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
 
+int kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
+
 void kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr);
 void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu);
 void kvm_lapic_sync_to_vapic(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f858159..e9287aa 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -83,6 +83,9 @@ module_param(vmm_exclusive, bool, S_IRUGO);
 static bool __read_mostly fasteoi = 1;
 module_param(fasteoi, bool, S_IRUGO);
 
+static bool __read_mostly enable_apicv_reg = 0;
+module_param(enable_apicv_reg, bool, S_IRUGO);
+
 /*
  * If nested=1, nested virtualization is supported, i.e., guests may use
  * VMX and be a hypervisor for its own guests. If nested=0, guests may not
@@ -761,6 +764,12 @@ static inline bool cpu_has_vmx_virtualize_apic_accesses(void)
 		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
 }
 
+static inline bool cpu_has_vmx_apic_register_virt(void)
+{
+	return vmcs_config.cpu_based_2nd_exec_ctrl &
+		SECONDARY_EXEC_APIC_REGISTER_VIRT;
+}
+
 static inline bool cpu_has_vmx_flexpriority(void)
 {
 	return cpu_has_vmx_tpr_shadow() &&
@@ -2470,7 +2479,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 			SECONDARY_EXEC_UNRESTRICTED_GUEST |
 			SECONDARY_EXEC_PAUSE_LOOP_EXITING |
 			SECONDARY_EXEC_RDTSCP |
-			SECONDARY_EXEC_ENABLE_INVPCID;
+			SECONDARY_EXEC_ENABLE_INVPCID |
+			SECONDARY_EXEC_APIC_REGISTER_VIRT;
 		if (adjust_vmx_controls(min2, opt2,
 					MSR_IA32_VMX_PROCBASED_CTLS2,
 					&_cpu_based_2nd_exec_control) < 0)
@@ -2481,6 +2491,11 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 				SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
 		_cpu_based_exec_control &= ~CPU_BASED_TPR_SHADOW;
 #endif
+
+	if (!(_cpu_based_exec_control & CPU_BASED_TPR_SHADOW))
+		_cpu_based_2nd_exec_control &= ~(
+				SECONDARY_EXEC_APIC_REGISTER_VIRT);
+
 	if (_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_EPT) {
 		/* CR3 accesses and invlpg don't need to cause VM Exits when EPT
 		   enabled */
@@ -2678,6 +2693,9 @@ static __init int hardware_setup(void)
 	if (!cpu_has_vmx_ple())
 		ple_gap = 0;
 
+	if (!cpu_has_vmx_apic_register_virt())
+		enable_apicv_reg = 0;
+
 	if (nested)
 		nested_vmx_setup_ctls_msrs();
 
@@ -3791,6 +3809,8 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
 		exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST;
 	if (!ple_gap)
 		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
+	if (!enable_apicv_reg)
+		exec_control &= ~SECONDARY_EXEC_APIC_REGISTER_VIRT;
 	return exec_control;
 }
 
@@ -4750,6 +4770,15 @@ static int handle_apic_access(struct kvm_vcpu *vcpu)
 	return emulate_instruction(vcpu, 0) == EMULATE_DONE;
 }
 
+static int handle_apic_write(struct kvm_vcpu *vcpu)
+{
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	u32 offset = exit_qualification & 0xfff;
+
+	/* APIC-write VM exit is trap-like and thus no need to adjust IP */
+	return kvm_apic_write_nodecode(vcpu, offset) == 0;
+}
+
 static int handle_task_switch(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -5689,6 +5718,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_VMON]                    = handle_vmon,
 	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,
 	[EXIT_REASON_APIC_ACCESS]             = handle_apic_access,
+	[EXIT_REASON_APIC_WRITE]              = handle_apic_write,
 	[EXIT_REASON_WBINVD]                  = handle_wbinvd,
 	[EXIT_REASON_XSETBV]                  = handle_xsetbv,
 	[EXIT_REASON_TASK_SWITCH]             = handle_task_switch,
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 3/6] x86, apicv: add virtual interrupt delivery support
  2012-11-21  8:09 [PATCH v2 0/6] x86, apicv: Add APIC virtualizatin support Yang Zhang
  2012-11-21  8:09 ` [PATCH v2 1/6] x86: PIT connects to pin 2 of IOAPIC Yang Zhang
  2012-11-21  8:09 ` [PATCH v2 2/6] x86, apicv: add APICv register virtualization support Yang Zhang
@ 2012-11-21  8:09 ` Yang Zhang
  2012-11-22 13:57   ` Gleb Natapov
  2012-11-21  8:09 ` [PATCH v2 4/6] x86, apicv: add virtual x2apic support Yang Zhang
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Yang Zhang @ 2012-11-21  8:09 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti, gleb, Yang Zhang, Kevin Tian

Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
manually, which is fully taken care of by the hardware. This needs
some special awareness into existing interrupr injection path:

- for pending interrupt, instead of direct injection, we may need
  update architecture specific indicators before resuming to guest.

- A pending interrupt, which is masked by ISR, should be also
  considered in above update action, since hardware will decide
  when to inject it at right time. Current has_interrupt and
  get_interrupt only returns a valid vector from injection p.o.v.

Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
---
 arch/x86/include/asm/kvm_host.h |    4 +
 arch/x86/include/asm/vmx.h      |   11 ++++
 arch/x86/kvm/irq.c              |   44 ++++++++++++++
 arch/x86/kvm/lapic.c            |   44 +++++++++++++-
 arch/x86/kvm/lapic.h            |   13 ++++
 arch/x86/kvm/svm.c              |    6 ++
 arch/x86/kvm/vmx.c              |  125 ++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              |   16 +++++-
 virt/kvm/ioapic.c               |    1 +
 9 files changed, 260 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b2e11f4..8e07a86 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -682,6 +682,10 @@ struct kvm_x86_ops {
 	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
 	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
 	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
+	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
+	void (*update_irq)(struct kvm_vcpu *vcpu);
+	void (*set_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector,
+			int need_eoi, int global);
 	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
 	int (*get_tdp_level)(void);
 	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 21101b6..1003341 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -62,6 +62,7 @@
 #define EXIT_REASON_MCE_DURING_VMENTRY  41
 #define EXIT_REASON_TPR_BELOW_THRESHOLD 43
 #define EXIT_REASON_APIC_ACCESS         44
+#define EXIT_REASON_EOI_INDUCED         45
 #define EXIT_REASON_EPT_VIOLATION       48
 #define EXIT_REASON_EPT_MISCONFIG       49
 #define EXIT_REASON_WBINVD              54
@@ -143,6 +144,7 @@
 #define SECONDARY_EXEC_WBINVD_EXITING		0x00000040
 #define SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080
 #define SECONDARY_EXEC_APIC_REGISTER_VIRT       0x00000100
+#define SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY    0x00000200
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING	0x00000400
 #define SECONDARY_EXEC_ENABLE_INVPCID		0x00001000
 
@@ -180,6 +182,7 @@ enum vmcs_field {
 	GUEST_GS_SELECTOR               = 0x0000080a,
 	GUEST_LDTR_SELECTOR             = 0x0000080c,
 	GUEST_TR_SELECTOR               = 0x0000080e,
+	GUEST_INTR_STATUS               = 0x00000810,
 	HOST_ES_SELECTOR                = 0x00000c00,
 	HOST_CS_SELECTOR                = 0x00000c02,
 	HOST_SS_SELECTOR                = 0x00000c04,
@@ -207,6 +210,14 @@ enum vmcs_field {
 	APIC_ACCESS_ADDR_HIGH		= 0x00002015,
 	EPT_POINTER                     = 0x0000201a,
 	EPT_POINTER_HIGH                = 0x0000201b,
+	EOI_EXIT_BITMAP0                = 0x0000201c,
+	EOI_EXIT_BITMAP0_HIGH           = 0x0000201d,
+	EOI_EXIT_BITMAP1                = 0x0000201e,
+	EOI_EXIT_BITMAP1_HIGH           = 0x0000201f,
+	EOI_EXIT_BITMAP2                = 0x00002020,
+	EOI_EXIT_BITMAP2_HIGH           = 0x00002021,
+	EOI_EXIT_BITMAP3                = 0x00002022,
+	EOI_EXIT_BITMAP3_HIGH           = 0x00002023,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
 	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
 	VMCS_LINK_POINTER               = 0x00002800,
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 7e06ba1..c7356a3 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -60,6 +60,29 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
 EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
 
 /*
+ * check if there is pending interrupt without
+ * intack. This _apicv version is used when hardware
+ * supports APIC virtualization with virtual interrupt
+ * delivery support. In such case, KVM is not required
+ * to poll pending APIC interrupt, and thus this
+ * interface is used to poll pending interupts from
+ * non-APIC source.
+ */
+int kvm_cpu_has_extint(struct kvm_vcpu *v)
+{
+	struct kvm_pic *s;
+
+	if (!irqchip_in_kernel(v->kvm))
+		return v->arch.interrupt.pending;
+
+	if (kvm_apic_accept_pic_intr(v)) {
+		s = pic_irqchip(v->kvm);	/* PIC */
+		return s->output;
+	} else
+		return 0;
+}
+
+/*
  * Read pending interrupt vector and intack.
  */
 int kvm_cpu_get_interrupt(struct kvm_vcpu *v)
@@ -82,6 +105,27 @@ int kvm_cpu_get_interrupt(struct kvm_vcpu *v)
 }
 EXPORT_SYMBOL_GPL(kvm_cpu_get_interrupt);
 
+/*
+ * Read pending interrupt vector and intack.
+ * Similar to kvm_cpu_has_interrupt_apicv, to get
+ * interrupts from non-APIC sources.
+ */
+int kvm_cpu_get_extint(struct kvm_vcpu *v)
+{
+	struct kvm_pic *s;
+	int vector = -1;
+
+	if (!irqchip_in_kernel(v->kvm))
+		return v->arch.interrupt.nr;
+
+	if (kvm_apic_accept_pic_intr(v)) {
+		s = pic_irqchip(v->kvm);
+		s->output = 0;		/* PIC */
+		vector = kvm_pic_read_irq(v->kvm);
+	}
+	return vector;
+}
+
 void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu)
 {
 	kvm_inject_apic_timer_irqs(vcpu);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index a63ffdc..af48361 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -643,6 +643,12 @@ out:
 	return ret;
 }
 
+void kvm_set_eoi_exitmap(struct kvm_vcpu *vcpu, int vector,
+		int need_eoi, int global)
+{
+	kvm_x86_ops->set_eoi_exitmap(vcpu, vector, need_eoi, global);
+}
+
 /*
  * Add a pending IRQ into lapic.
  * Return 1 if successfully added and 0 if discarded.
@@ -664,8 +670,11 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
 		if (trig_mode) {
 			apic_debug("level trig mode for vector %d", vector);
 			apic_set_vector(vector, apic->regs + APIC_TMR);
-		} else
+			kvm_set_eoi_exitmap(vcpu, vector, 1, 0);
+		} else {
 			apic_clear_vector(vector, apic->regs + APIC_TMR);
+			kvm_set_eoi_exitmap(vcpu, vector, 0, 0);
+		}
 
 		result = !apic_test_and_set_irr(vector, apic);
 		trace_kvm_apic_accept_irq(vcpu->vcpu_id, delivery_mode,
@@ -769,6 +778,26 @@ static int apic_set_eoi(struct kvm_lapic *apic)
 	return vector;
 }
 
+/*
+ * this interface assumes a trap-like exit, which has already finished
+ * desired side effect including vISR and vPPR update.
+ */
+void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector)
+{
+	struct kvm_lapic *apic = vcpu->arch.apic;
+	int trigger_mode;
+
+	if (apic_test_and_clear_vector(vector, apic->regs + APIC_TMR))
+		trigger_mode = IOAPIC_LEVEL_TRIG;
+	else
+		trigger_mode = IOAPIC_EDGE_TRIG;
+
+	if (!(kvm_apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI))
+		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
+	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
+}
+EXPORT_SYMBOL_GPL(kvm_apic_set_eoi_accelerated);
+
 static void apic_send_ipi(struct kvm_lapic *apic)
 {
 	u32 icr_low = kvm_apic_get_reg(apic, APIC_ICR);
@@ -1510,6 +1539,8 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu)
 	kvm_lapic_reset(vcpu);
 	kvm_iodevice_init(&apic->dev, &apic_mmio_ops);
 
+	if (kvm_x86_ops->has_virtual_interrupt_delivery(vcpu))
+		apic->vid_enabled = true;
 	return 0;
 nomem_free_apic:
 	kfree(apic);
@@ -1533,6 +1564,17 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu)
 	return highest_irr;
 }
 
+int kvm_apic_get_highest_irr(struct kvm_vcpu *vcpu)
+{
+	struct kvm_lapic *apic = vcpu->arch.apic;
+
+	if (!apic || !apic_enabled(apic))
+		return -1;
+
+	return apic_find_highest_irr(apic);
+}
+EXPORT_SYMBOL_GPL(kvm_apic_get_highest_irr);
+
 int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
 {
 	u32 lvt0 = kvm_apic_get_reg(vcpu->arch.apic, APIC_LVT0);
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index c42f111..2503a64 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -20,6 +20,7 @@ struct kvm_lapic {
 	u32 divide_count;
 	struct kvm_vcpu *vcpu;
 	bool irr_pending;
+	bool vid_enabled;
 	/* Number of bits set in ISR. */
 	s16 isr_count;
 	/* The highest vector set in ISR; if -1 - invalid, must scan ISR. */
@@ -39,6 +40,9 @@ void kvm_free_lapic(struct kvm_vcpu *vcpu);
 int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu);
 int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu);
 int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
+int kvm_cpu_has_extint(struct kvm_vcpu *v);
+int kvm_cpu_get_extint(struct kvm_vcpu *v);
+int kvm_apic_get_highest_irr(struct kvm_vcpu *vcpu);
 void kvm_lapic_reset(struct kvm_vcpu *vcpu);
 u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
@@ -50,6 +54,8 @@ void kvm_apic_set_version(struct kvm_vcpu *vcpu);
 int kvm_apic_match_physical_addr(struct kvm_lapic *apic, u16 dest);
 int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u8 mda);
 int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq);
+void kvm_set_eoi_exitmap(struct kvm_vcpu *vcpu, int vector,
+		int need_eoi, int global);
 int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type);
 
 bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
@@ -65,6 +71,7 @@ u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
 void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
 
 int kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
+void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector);
 
 void kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr);
 void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu);
@@ -81,6 +88,12 @@ static inline bool kvm_hv_vapic_assist_page_enabled(struct kvm_vcpu *vcpu)
 	return vcpu->arch.hv_vapic & HV_X64_MSR_APIC_ASSIST_PAGE_ENABLE;
 }
 
+static inline bool kvm_apic_vid_enabled(struct kvm_vcpu *vcpu)
+{
+	struct kvm_lapic *apic = vcpu->arch.apic;
+	return apic->vid_enabled;
+}
+
 int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data);
 void kvm_lapic_init(void);
 
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d017df3..b290aba 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3564,6 +3564,11 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
 		set_cr_intercept(svm, INTERCEPT_CR8_WRITE);
 }
 
+static int svm_has_virtual_interrupt_delivery(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
+
 static int svm_nmi_allowed(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -4283,6 +4288,7 @@ static struct kvm_x86_ops svm_x86_ops = {
 	.enable_nmi_window = enable_nmi_window,
 	.enable_irq_window = enable_irq_window,
 	.update_cr8_intercept = update_cr8_intercept,
+	.has_virtual_interrupt_delivery = svm_has_virtual_interrupt_delivery,
 
 	.set_tss_addr = svm_set_tss_addr,
 	.get_tdp_level = get_npt_level,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e9287aa..c0d74ce 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -86,6 +86,9 @@ module_param(fasteoi, bool, S_IRUGO);
 static bool __read_mostly enable_apicv_reg = 0;
 module_param(enable_apicv_reg, bool, S_IRUGO);
 
+static bool __read_mostly enable_apicv_vid = 0;
+module_param(enable_apicv_vid, bool, S_IRUGO);
+
 /*
  * If nested=1, nested virtualization is supported, i.e., guests may use
  * VMX and be a hypervisor for its own guests. If nested=0, guests may not
@@ -432,6 +435,10 @@ struct vcpu_vmx {
 
 	bool rdtscp_enabled;
 
+	u8 eoi_exitmap_changed;
+	u64 eoi_exit_bitmap[4];
+	u64 eoi_exit_bitmap_global[4];
+
 	/* Support for a guest hypervisor (nested VMX) */
 	struct nested_vmx nested;
 };
@@ -770,6 +777,12 @@ static inline bool cpu_has_vmx_apic_register_virt(void)
 		SECONDARY_EXEC_APIC_REGISTER_VIRT;
 }
 
+static inline bool cpu_has_vmx_virtual_intr_delivery(void)
+{
+	return vmcs_config.cpu_based_2nd_exec_ctrl &
+		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
+}
+
 static inline bool cpu_has_vmx_flexpriority(void)
 {
 	return cpu_has_vmx_tpr_shadow() &&
@@ -2480,7 +2493,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 			SECONDARY_EXEC_PAUSE_LOOP_EXITING |
 			SECONDARY_EXEC_RDTSCP |
 			SECONDARY_EXEC_ENABLE_INVPCID |
-			SECONDARY_EXEC_APIC_REGISTER_VIRT;
+			SECONDARY_EXEC_APIC_REGISTER_VIRT |
+			SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
 		if (adjust_vmx_controls(min2, opt2,
 					MSR_IA32_VMX_PROCBASED_CTLS2,
 					&_cpu_based_2nd_exec_control) < 0)
@@ -2494,7 +2508,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 
 	if (!(_cpu_based_exec_control & CPU_BASED_TPR_SHADOW))
 		_cpu_based_2nd_exec_control &= ~(
-				SECONDARY_EXEC_APIC_REGISTER_VIRT);
+				SECONDARY_EXEC_APIC_REGISTER_VIRT |
+				SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
 
 	if (_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_EPT) {
 		/* CR3 accesses and invlpg don't need to cause VM Exits when EPT
@@ -2696,6 +2711,9 @@ static __init int hardware_setup(void)
 	if (!cpu_has_vmx_apic_register_virt())
 		enable_apicv_reg = 0;
 
+	if (!cpu_has_vmx_virtual_intr_delivery())
+		enable_apicv_vid = 0;
+
 	if (nested)
 		nested_vmx_setup_ctls_msrs();
 
@@ -3811,6 +3829,8 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
 		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
 	if (!enable_apicv_reg)
 		exec_control &= ~SECONDARY_EXEC_APIC_REGISTER_VIRT;
+	if (!enable_apicv_vid)
+		exec_control &= ~SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
 	return exec_control;
 }
 
@@ -3855,6 +3875,15 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 				vmx_secondary_exec_control(vmx));
 	}
 
+	if (enable_apicv_vid) {
+		vmcs_write64(EOI_EXIT_BITMAP0, 0);
+		vmcs_write64(EOI_EXIT_BITMAP1, 0);
+		vmcs_write64(EOI_EXIT_BITMAP2, 0);
+		vmcs_write64(EOI_EXIT_BITMAP3, 0);
+
+		vmcs_write16(GUEST_INTR_STATUS, 0);
+	}
+
 	if (ple_gap) {
 		vmcs_write32(PLE_GAP, ple_gap);
 		vmcs_write32(PLE_WINDOW, ple_window);
@@ -4770,6 +4799,16 @@ static int handle_apic_access(struct kvm_vcpu *vcpu)
 	return emulate_instruction(vcpu, 0) == EMULATE_DONE;
 }
 
+static int handle_apic_eoi_induced(struct kvm_vcpu *vcpu)
+{
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	int vector = exit_qualification & 0xff;
+
+	/* EOI-induced VM exit is trap-like and thus no need to adjust IP */
+	kvm_apic_set_eoi_accelerated(vcpu, vector);
+	return 1;
+}
+
 static int handle_apic_write(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -5719,6 +5758,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,
 	[EXIT_REASON_APIC_ACCESS]             = handle_apic_access,
 	[EXIT_REASON_APIC_WRITE]              = handle_apic_write,
+	[EXIT_REASON_EOI_INDUCED]             = handle_apic_eoi_induced,
 	[EXIT_REASON_WBINVD]                  = handle_wbinvd,
 	[EXIT_REASON_XSETBV]                  = handle_xsetbv,
 	[EXIT_REASON_TASK_SWITCH]             = handle_task_switch,
@@ -6049,6 +6089,11 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
 
 static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
 {
+    /* no need for tpr_threshold update if APIC virtual
+     * interrupt delivery is enabled */
+	if (!enable_apicv_vid)
+		return ;
+
 	if (irr == -1 || tpr < irr) {
 		vmcs_write32(TPR_THRESHOLD, 0);
 		return;
@@ -6057,6 +6102,79 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
 	vmcs_write32(TPR_THRESHOLD, irr);
 }
 
+static int vmx_has_virtual_interrupt_delivery(struct kvm_vcpu *vcpu)
+{
+	return irqchip_in_kernel(vcpu->kvm) && enable_apicv_vid;
+}
+
+static void vmx_update_irq(struct kvm_vcpu *vcpu)
+{
+	u16 status;
+	u8 old;
+	int vector;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (!enable_apicv_vid)
+		return ;
+
+	vector = kvm_apic_get_highest_irr(vcpu);
+	if (vector == -1)
+		return;
+
+	status = vmcs_read16(GUEST_INTR_STATUS);
+	old = (u8)status & 0xff;
+	if ((u8)vector != old) {
+		status &= ~0xff;
+		status |= (u8)vector;
+		vmcs_write16(GUEST_INTR_STATUS, status);
+	}
+
+	if (vmx->eoi_exitmap_changed) {
+#define UPDATE_EOI_EXITMAP(v, e) {				\
+	if ((v)->eoi_exitmap_changed & (1 << (e)))	\
+		vmcs_write64(EOI_EXIT_BITMAP##e,		\
+		(v)->eoi_exit_bitmap[e] | (v)->eoi_exit_bitmap_global[e]); }
+
+		UPDATE_EOI_EXITMAP(vmx, 0);
+		UPDATE_EOI_EXITMAP(vmx, 1);
+		UPDATE_EOI_EXITMAP(vmx, 2);
+		UPDATE_EOI_EXITMAP(vmx, 3);
+		vmx->eoi_exitmap_changed = 0;
+	}
+}
+
+static void vmx_set_eoi_exitmap(struct kvm_vcpu *vcpu,
+				int vector,
+				int need_eoi, int global)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	int index, offset, changed;
+	unsigned long *eoi_exitmap;
+
+	if (!enable_apicv_vid)
+		return ;
+
+	if (WARN_ONCE((vector < 0) || (vector > 255),
+		"KVM VMX: vector (%d) out of range\n", vector))
+		return;
+
+	index = vector >> 6;
+	offset = vector & 63;
+	if (global)
+		eoi_exitmap =
+		    (unsigned long *)&vmx->eoi_exit_bitmap_global[index];
+	else
+		eoi_exitmap = (unsigned long *)&vmx->eoi_exit_bitmap[index];
+
+	if (need_eoi)
+		changed = !test_and_set_bit(offset, eoi_exitmap);
+	else
+		changed = test_and_clear_bit(offset, eoi_exitmap);
+
+	if (changed)
+		vmx->eoi_exitmap_changed |= 1 << index;
+}
+
 static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx)
 {
 	u32 exit_intr_info;
@@ -7320,6 +7438,9 @@ static struct kvm_x86_ops vmx_x86_ops = {
 	.enable_nmi_window = enable_nmi_window,
 	.enable_irq_window = enable_irq_window,
 	.update_cr8_intercept = update_cr8_intercept,
+	.has_virtual_interrupt_delivery = vmx_has_virtual_interrupt_delivery,
+	.update_irq = vmx_update_irq,
+	.set_eoi_exitmap = vmx_set_eoi_exitmap,
 
 	.set_tss_addr = vmx_set_tss_addr,
 	.get_tdp_level = get_ept_level,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4f76417..8b8de3b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5190,6 +5190,13 @@ static void inject_pending_event(struct kvm_vcpu *vcpu)
 			vcpu->arch.nmi_injected = true;
 			kvm_x86_ops->set_nmi(vcpu);
 		}
+	} else if (kvm_apic_vid_enabled(vcpu)) {
+		if (kvm_cpu_has_extint(vcpu) &&
+		    kvm_x86_ops->interrupt_allowed(vcpu)) {
+			kvm_queue_interrupt(vcpu,
+				kvm_cpu_get_extint(vcpu), false);
+			kvm_x86_ops->set_irq(vcpu);
+		}
 	} else if (kvm_cpu_has_interrupt(vcpu)) {
 		if (kvm_x86_ops->interrupt_allowed(vcpu)) {
 			kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu),
@@ -5289,12 +5296,19 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	}
 
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
+		/* update archtecture specific hints for APIC
+		 * virtual interrupt delivery */
+		kvm_x86_ops->update_irq(vcpu);
+
 		inject_pending_event(vcpu);
 
 		/* enable NMI/IRQ window open exits if needed */
 		if (vcpu->arch.nmi_pending)
 			kvm_x86_ops->enable_nmi_window(vcpu);
-		else if (kvm_cpu_has_interrupt(vcpu) || req_int_win)
+		else if (kvm_apic_vid_enabled(vcpu)) {
+			if (kvm_cpu_has_extint(vcpu))
+				kvm_x86_ops->enable_irq_window(vcpu);
+		} else if (kvm_cpu_has_interrupt(vcpu) || req_int_win)
 			kvm_x86_ops->enable_irq_window(vcpu);
 
 		if (kvm_lapic_enabled(vcpu)) {
diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index 166c450..898aa62 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -186,6 +186,7 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int irq)
 		/* need to read apic_id from apic regiest since
 		 * it can be rewritten */
 		irqe.dest_id = ioapic->kvm->bsp_vcpu_id;
+		kvm_set_eoi_exitmap(ioapic->kvm->vcpus[0], irqe.vector, 1, 1);
 	}
 #endif
 	return kvm_irq_delivery_to_apic(ioapic->kvm, NULL, &irqe);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 4/6] x86, apicv: add virtual x2apic support
  2012-11-21  8:09 [PATCH v2 0/6] x86, apicv: Add APIC virtualizatin support Yang Zhang
                   ` (2 preceding siblings ...)
  2012-11-21  8:09 ` [PATCH v2 3/6] x86, apicv: add virtual interrupt delivery support Yang Zhang
@ 2012-11-21  8:09 ` Yang Zhang
  2012-11-21  8:09 ` [PATCH v2 5/6] x86: Enable ack interrupt on vmexit Yang Zhang
  2012-11-21  8:09 ` [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting Yang Zhang
  5 siblings, 0 replies; 29+ messages in thread
From: Yang Zhang @ 2012-11-21  8:09 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti, gleb, Yang Zhang

basically to benefit from apicv, we need clear MSR bitmap for
corresponding x2apic MSRs:
    0x800 - 0x8ff: no read intercept for apicv register virtualization
    TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery

Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
---
 arch/x86/kvm/vmx.c |   64 ++++++++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 57 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c0d74ce..7949d21 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3705,7 +3705,10 @@ static void free_vpid(struct vcpu_vmx *vmx)
 	spin_unlock(&vmx_vpid_lock);
 }
 
-static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, u32 msr)
+#define MSR_TYPE_R	1
+#define MSR_TYPE_W	2
+static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
+						u32 msr, int type)
 {
 	int f = sizeof(unsigned long);
 
@@ -3718,20 +3721,52 @@ static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, u32 msr)
 	 * We can control MSRs 0x00000000-0x00001fff and 0xc0000000-0xc0001fff.
 	 */
 	if (msr <= 0x1fff) {
-		__clear_bit(msr, msr_bitmap + 0x000 / f); /* read-low */
-		__clear_bit(msr, msr_bitmap + 0x800 / f); /* write-low */
+		if (type & MSR_TYPE_R)
+			/* read-low */
+			__clear_bit(msr, msr_bitmap + 0x000 / f);
+
+		if (type & MSR_TYPE_W)
+			/* write-low */
+			__clear_bit(msr, msr_bitmap + 0x800 / f);
+
 	} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
 		msr &= 0x1fff;
-		__clear_bit(msr, msr_bitmap + 0x400 / f); /* read-high */
-		__clear_bit(msr, msr_bitmap + 0xc00 / f); /* write-high */
+		if (type & MSR_TYPE_R)
+			/* read-high */
+			__clear_bit(msr, msr_bitmap + 0x400 / f);
+
+		if (type & MSR_TYPE_W)
+			/* write-high */
+			__clear_bit(msr, msr_bitmap + 0xc00 / f);
+
 	}
 }
 
 static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only)
 {
 	if (!longmode_only)
-		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, msr);
-	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr);
+		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
+						msr, MSR_TYPE_R | MSR_TYPE_W);
+	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
+						msr, MSR_TYPE_R | MSR_TYPE_W);
+}
+
+static void vmx_disable_intercept_for_msr_read(u32 msr, bool longmode_only)
+{
+	if (!longmode_only)
+		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
+						msr, MSR_TYPE_R);
+	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
+					msr, MSR_TYPE_R);
+}
+
+static void vmx_disable_intercept_for_msr_write(u32 msr, bool longmode_only)
+{
+	if (!longmode_only)
+		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
+						msr, MSR_TYPE_W);
+	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
+					msr, MSR_TYPE_W);
 }
 
 /*
@@ -7525,6 +7560,21 @@ static int __init vmx_init(void)
 	vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_ESP, false);
 	vmx_disable_intercept_for_msr(MSR_IA32_SYSENTER_EIP, false);
 
+	if (enable_apicv_reg) {
+		int msr;
+		for (msr = 0x800; msr <= 0x8ff; msr++)
+			vmx_disable_intercept_for_msr_read(msr, false);
+	}
+
+	if (enable_apicv_vid) {
+		/* TPR */
+		vmx_disable_intercept_for_msr_write(0x808, false);
+		/* EOI */
+		vmx_disable_intercept_for_msr_write(0x80b, false);
+		/* SELF-IPI */
+		vmx_disable_intercept_for_msr_write(0x83f, false);
+	}
+
 	if (enable_ept) {
 		kvm_mmu_set_mask_ptes(0ull,
 			(enable_ept_ad_bits) ? VMX_EPT_ACCESS_BIT : 0ull,
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 5/6] x86: Enable ack interrupt on vmexit
  2012-11-21  8:09 [PATCH v2 0/6] x86, apicv: Add APIC virtualizatin support Yang Zhang
                   ` (3 preceding siblings ...)
  2012-11-21  8:09 ` [PATCH v2 4/6] x86, apicv: add virtual x2apic support Yang Zhang
@ 2012-11-21  8:09 ` Yang Zhang
  2012-11-22 15:22   ` Gleb Natapov
  2012-11-21  8:09 ` [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting Yang Zhang
  5 siblings, 1 reply; 29+ messages in thread
From: Yang Zhang @ 2012-11-21  8:09 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti, gleb, Yang Zhang

Ack interrupt on vmexit is required by Posted Interrupt. With it,
when external interrupt caused vmexit, the cpu will acknowledge the
interrupt controller and save the interrupt's vector in vmcs.

There are several approaches to enable it. This patch uses a simply
way: re-generate an interrupt via self ipi.

Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
---
 arch/x86/kvm/vmx.c |   11 ++++++++++-
 1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 7949d21..f6ef090 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2525,7 +2525,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 #ifdef CONFIG_X86_64
 	min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
 #endif
-	opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT;
+	opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT |
+		VM_EXIT_ACK_INTR_ON_EXIT;
 	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS,
 				&_vmexit_control) < 0)
 		return -EIO;
@@ -4457,6 +4458,14 @@ static int handle_exception(struct kvm_vcpu *vcpu)
 
 static int handle_external_interrupt(struct kvm_vcpu *vcpu)
 {
+	unsigned int vector;
+
+	vector = vmcs_read32(VM_EXIT_INTR_INFO);
+	vector &= INTR_INFO_VECTOR_MASK;
+
+	apic_eoi();
+	apic->send_IPI_self(vector);
+
 	++vcpu->stat.irq_exits;
 	return 1;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting
  2012-11-21  8:09 [PATCH v2 0/6] x86, apicv: Add APIC virtualizatin support Yang Zhang
                   ` (4 preceding siblings ...)
  2012-11-21  8:09 ` [PATCH v2 5/6] x86: Enable ack interrupt on vmexit Yang Zhang
@ 2012-11-21  8:09 ` Yang Zhang
  2012-11-25 12:39   ` Gleb Natapov
  5 siblings, 1 reply; 29+ messages in thread
From: Yang Zhang @ 2012-11-21  8:09 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti, gleb, Yang Zhang

Posted Interrupt allows vAPICV interrupts to inject into guest directly
without any vmexit.

- When delivering a interrupt to guest, if target vcpu is running,
  update Posted-interrupt requests bitmap and send a notification event
  to the vcpu. Then the vcpu will handle this interrupt automatically,
  without any software involvemnt.

- If target vcpu is not running or there already a notification event
  pending in the vcpu, do nothing. The interrupt will be handled by old
  way.

Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
---
 arch/x86/include/asm/kvm_host.h |    3 +
 arch/x86/include/asm/vmx.h      |    4 +
 arch/x86/kernel/apic/io_apic.c  |  138 ++++++++++++++++++++++++++++
 arch/x86/kvm/lapic.c            |   31 ++++++-
 arch/x86/kvm/lapic.h            |    8 ++
 arch/x86/kvm/vmx.c              |  192 +++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/x86.c              |    2 +
 include/linux/kvm_host.h        |    1 +
 virt/kvm/kvm_main.c             |    2 +
 9 files changed, 372 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8e07a86..1145894 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -683,9 +683,12 @@ struct kvm_x86_ops {
 	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
 	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
 	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
+	int (*has_posted_interrupt)(struct kvm_vcpu *vcpu);
 	void (*update_irq)(struct kvm_vcpu *vcpu);
 	void (*set_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector,
 			int need_eoi, int global);
+	int (*send_nv)(struct kvm_vcpu *vcpu, int vector);
+	void (*pi_migrate)(struct kvm_vcpu *vcpu);
 	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
 	int (*get_tdp_level)(void);
 	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 1003341..7b9e1d0 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -152,6 +152,7 @@
 #define PIN_BASED_EXT_INTR_MASK                 0x00000001
 #define PIN_BASED_NMI_EXITING                   0x00000008
 #define PIN_BASED_VIRTUAL_NMIS                  0x00000020
+#define PIN_BASED_POSTED_INTR                   0x00000080
 
 #define VM_EXIT_SAVE_DEBUG_CONTROLS             0x00000002
 #define VM_EXIT_HOST_ADDR_SPACE_SIZE            0x00000200
@@ -174,6 +175,7 @@
 /* VMCS Encodings */
 enum vmcs_field {
 	VIRTUAL_PROCESSOR_ID            = 0x00000000,
+	POSTED_INTR_NV                  = 0x00000002,
 	GUEST_ES_SELECTOR               = 0x00000800,
 	GUEST_CS_SELECTOR               = 0x00000802,
 	GUEST_SS_SELECTOR               = 0x00000804,
@@ -208,6 +210,8 @@ enum vmcs_field {
 	VIRTUAL_APIC_PAGE_ADDR_HIGH     = 0x00002013,
 	APIC_ACCESS_ADDR		= 0x00002014,
 	APIC_ACCESS_ADDR_HIGH		= 0x00002015,
+	POSTED_INTR_DESC_ADDR           = 0x00002016,
+	POSTED_INTR_DESC_ADDR_HIGH      = 0x00002017,
 	EPT_POINTER                     = 0x0000201a,
 	EPT_POINTER_HIGH                = 0x0000201b,
 	EOI_EXIT_BITMAP0                = 0x0000201c,
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 1817fa9..97cb8ee 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -3277,6 +3277,144 @@ int arch_setup_dmar_msi(unsigned int irq)
 }
 #endif
 
+static int
+pi_set_affinity(struct irq_data *data, const struct cpumask *mask,
+		      bool force)
+{
+	unsigned int dest;
+	struct irq_cfg *cfg = (struct irq_cfg *)data->chip_data;
+	if (cpumask_equal(cfg->domain, mask))
+		return IRQ_SET_MASK_OK;
+
+	if (__ioapic_set_affinity(data, mask, &dest))
+		return -1;
+
+	return IRQ_SET_MASK_OK;
+}
+
+static void pi_mask(struct irq_data *data)
+{
+	;
+}
+
+static void pi_unmask(struct irq_data *data)
+{
+	;
+}
+
+static struct irq_chip pi_chip = {
+	.name       = "POSTED-INTR",
+	.irq_ack    = ack_apic_edge,
+	.irq_unmask = pi_unmask,
+	.irq_mask   = pi_mask,
+	.irq_set_affinity   = pi_set_affinity,
+};
+
+int arch_pi_migrate(int irq, int cpu)
+{
+	struct irq_data *data = irq_get_irq_data(irq);
+	struct irq_cfg *cfg;
+	struct irq_desc *desc = irq_to_desc(irq);
+	unsigned long flags;
+
+	if (!desc)
+		return -EINVAL;
+
+	cfg = irq_cfg(irq);
+	if (cpumask_equal(cfg->domain, cpumask_of(cpu)))
+		return cfg->vector;
+
+	irq_set_affinity(irq, cpumask_of(cpu));
+	raw_spin_lock_irqsave(&desc->lock, flags);
+	irq_move_irq(data);
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+
+	if (cfg->move_in_progress)
+		send_cleanup_vector(cfg);
+	return cfg->vector;
+}
+EXPORT_SYMBOL_GPL(arch_pi_migrate);
+
+static int arch_pi_create_irq(const struct cpumask *mask)
+{
+	int node = cpu_to_node(0);
+	unsigned int irq_want;
+	struct irq_cfg *cfg;
+	unsigned long flags;
+	unsigned int ret = 0;
+	int irq;
+
+	irq_want = nr_irqs_gsi;
+
+	irq = alloc_irq_from(irq_want, node);
+	if (irq < 0)
+		return 0;
+	cfg = alloc_irq_cfg(irq_want, node);
+	if (!cfg) {
+		free_irq_at(irq, NULL);
+		return 0;
+	}
+
+	raw_spin_lock_irqsave(&vector_lock, flags);
+	if (!__assign_irq_vector(irq, cfg, mask))
+		ret = irq;
+	raw_spin_unlock_irqrestore(&vector_lock, flags);
+
+	if (ret) {
+		irq_set_chip_data(irq, cfg);
+		irq_clear_status_flags(irq, IRQ_NOREQUEST);
+	} else {
+		free_irq_at(irq, cfg);
+	}
+	return ret;
+}
+
+int arch_pi_alloc_irq(void *vmx)
+{
+	int irq, cpu = smp_processor_id();
+	struct irq_cfg *cfg;
+
+	irq = arch_pi_create_irq(cpumask_of(cpu));
+	if (!irq) {
+		pr_err("Posted Interrupt: no free irq\n");
+		return -EINVAL;
+	}
+	irq_set_handler_data(irq, vmx);
+	irq_set_chip_and_handler_name(irq, &pi_chip, handle_edge_irq, "edge");
+	irq_set_status_flags(irq, IRQ_MOVE_PCNTXT);
+	irq_set_affinity(irq, cpumask_of(cpu));
+
+	cfg = irq_cfg(irq);
+	if (cfg->move_in_progress)
+		send_cleanup_vector(cfg);
+
+	return irq;
+}
+EXPORT_SYMBOL_GPL(arch_pi_alloc_irq);
+
+void arch_pi_free_irq(unsigned int irq, void *vmx)
+{
+	if (irq) {
+		irq_set_handler_data(irq, NULL);
+		/* This will mask the irq */
+		free_irq(irq, vmx);
+		destroy_irq(irq);
+	}
+}
+EXPORT_SYMBOL_GPL(arch_pi_free_irq);
+
+int arch_pi_get_vector(unsigned int irq)
+{
+	struct irq_cfg *cfg;
+
+	if (!irq)
+		return -EINVAL;
+
+	cfg = irq_cfg(irq);
+	return cfg->vector;
+}
+EXPORT_SYMBOL_GPL(arch_pi_get_vector);
+
 #ifdef CONFIG_HPET_TIMER
 
 static int hpet_msi_set_affinity(struct irq_data *data,
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index af48361..04220de 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -656,7 +656,7 @@ void kvm_set_eoi_exitmap(struct kvm_vcpu *vcpu, int vector,
 static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
 			     int vector, int level, int trig_mode)
 {
-	int result = 0;
+	int result = 0, send;
 	struct kvm_vcpu *vcpu = apic->vcpu;
 
 	switch (delivery_mode) {
@@ -674,6 +674,13 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
 		} else {
 			apic_clear_vector(vector, apic->regs + APIC_TMR);
 			kvm_set_eoi_exitmap(vcpu, vector, 0, 0);
+			if (kvm_apic_pi_enabled(vcpu)) {
+				send = kvm_x86_ops->send_nv(vcpu, vector);
+				if (send) {
+					result = 1;
+					break;
+				}
+			}
 		}
 
 		result = !apic_test_and_set_irr(vector, apic);
@@ -1541,6 +1548,10 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu)
 
 	if (kvm_x86_ops->has_virtual_interrupt_delivery(vcpu))
 		apic->vid_enabled = true;
+
+	if (kvm_x86_ops->has_posted_interrupt(vcpu))
+		apic->pi_enabled = true;
+
 	return 0;
 nomem_free_apic:
 	kfree(apic);
@@ -1575,6 +1586,24 @@ int kvm_apic_get_highest_irr(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_apic_get_highest_irr);
 
+void kvm_apic_update_irr(struct kvm_vcpu *vcpu, unsigned int *pir)
+{
+	struct kvm_lapic *apic = vcpu->arch.apic;
+	unsigned int *reg;
+	unsigned int i;
+
+	if (!apic || !apic_enabled(apic))
+		return;
+
+	for (i = 0; i <= 7; i++) {
+		reg = apic->regs + APIC_IRR + i * 0x10;
+		*reg |= pir[i];
+		pir[i] = 0;
+	}
+	return;
+}
+EXPORT_SYMBOL_GPL(kvm_apic_update_irr);
+
 int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
 {
 	u32 lvt0 = kvm_apic_get_reg(vcpu->arch.apic, APIC_LVT0);
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 2503a64..ad35868 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -21,6 +21,7 @@ struct kvm_lapic {
 	struct kvm_vcpu *vcpu;
 	bool irr_pending;
 	bool vid_enabled;
+	bool pi_enabled;
 	/* Number of bits set in ISR. */
 	s16 isr_count;
 	/* The highest vector set in ISR; if -1 - invalid, must scan ISR. */
@@ -43,6 +44,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
 int kvm_cpu_has_extint(struct kvm_vcpu *v);
 int kvm_cpu_get_extint(struct kvm_vcpu *v);
 int kvm_apic_get_highest_irr(struct kvm_vcpu *vcpu);
+void kvm_apic_update_irr(struct kvm_vcpu *vcpu, unsigned int *pir);
 void kvm_lapic_reset(struct kvm_vcpu *vcpu);
 u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
@@ -94,6 +96,12 @@ static inline bool kvm_apic_vid_enabled(struct kvm_vcpu *vcpu)
 	return apic->vid_enabled;
 }
 
+static inline bool kvm_apic_pi_enabled(struct kvm_vcpu *vcpu)
+{
+	struct kvm_lapic *apic = vcpu->arch.apic;
+	return apic->pi_enabled;
+}
+
 int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data);
 void kvm_lapic_init(void);
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f6ef090..6448b96 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -31,6 +31,7 @@
 #include <linux/ftrace_event.h>
 #include <linux/slab.h>
 #include <linux/tboot.h>
+#include <linux/interrupt.h>
 #include "kvm_cache_regs.h"
 #include "x86.h"
 
@@ -89,6 +90,8 @@ module_param(enable_apicv_reg, bool, S_IRUGO);
 static bool __read_mostly enable_apicv_vid = 0;
 module_param(enable_apicv_vid, bool, S_IRUGO);
 
+static bool __read_mostly enable_apicv_pi = 0;
+module_param(enable_apicv_pi, bool, S_IRUGO);
 /*
  * If nested=1, nested virtualization is supported, i.e., guests may use
  * VMX and be a hypervisor for its own guests. If nested=0, guests may not
@@ -372,6 +375,44 @@ struct nested_vmx {
 	struct page *apic_access_page;
 };
 
+/* Posted-Interrupt Descriptor */
+struct pi_desc {
+	u32 pir[8];     /* Posted interrupt requested */
+	union {
+		struct {
+			u8  on:1,
+			    rsvd:7;
+		} control;
+		u32 rsvd[8];
+	} u;
+} __aligned(64);
+
+#define POSTED_INTR_ON  0
+u8 pi_test_on(struct pi_desc *pi_desc)
+{
+	return test_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->u.control);
+}
+void pi_set_on(struct pi_desc *pi_desc)
+{
+	set_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->u.control);
+}
+
+void pi_clear_on(struct pi_desc *pi_desc)
+{
+	clear_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->u.control);
+}
+
+u8 pi_test_and_set_on(struct pi_desc *pi_desc)
+{
+	return test_and_set_bit(POSTED_INTR_ON,
+			(unsigned long *)&pi_desc->u.control);
+}
+
+void pi_set_pir(int vector, struct pi_desc *pi_desc)
+{
+	set_bit(vector, (unsigned long *)pi_desc->pir);
+}
+
 struct vcpu_vmx {
 	struct kvm_vcpu       vcpu;
 	unsigned long         host_rsp;
@@ -439,6 +480,11 @@ struct vcpu_vmx {
 	u64 eoi_exit_bitmap[4];
 	u64 eoi_exit_bitmap_global[4];
 
+	/* Posted interrupt descriptor */
+	struct pi_desc *pi;
+	u32 irq;
+	u32 vector;
+
 	/* Support for a guest hypervisor (nested VMX) */
 	struct nested_vmx nested;
 };
@@ -698,6 +744,11 @@ static u64 host_efer;
 
 static void ept_save_pdptrs(struct kvm_vcpu *vcpu);
 
+int arch_pi_get_vector(unsigned int irq);
+int arch_pi_alloc_irq(struct vcpu_vmx *vmx);
+void arch_pi_free_irq(unsigned int irq, struct vcpu_vmx *vmx);
+int arch_pi_migrate(int irq, int cpu);
+
 /*
  * Keep MSR_STAR at the end, as setup_msrs() will try to optimize it
  * away by decrementing the array size.
@@ -783,6 +834,11 @@ static inline bool cpu_has_vmx_virtual_intr_delivery(void)
 		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
 }
 
+static inline bool cpu_has_vmx_posted_intr(void)
+{
+	return vmcs_config.pin_based_exec_ctrl & PIN_BASED_POSTED_INTR;
+}
+
 static inline bool cpu_has_vmx_flexpriority(void)
 {
 	return cpu_has_vmx_tpr_shadow() &&
@@ -1555,6 +1611,11 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		struct desc_ptr *gdt = &__get_cpu_var(host_gdt);
 		unsigned long sysenter_esp;
 
+		if (enable_apicv_pi && to_vmx(vcpu)->pi)
+			pi_set_on(to_vmx(vcpu)->pi);
+
+		kvm_make_request(KVM_REQ_POSTED_INTR, vcpu);
+
 		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
 		local_irq_disable();
 		list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link,
@@ -1582,6 +1643,8 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
 		vcpu->cpu = -1;
 		kvm_cpu_vmxoff();
 	}
+	if (enable_apicv_pi && to_vmx(vcpu)->pi)
+		pi_set_on(to_vmx(vcpu)->pi);
 }
 
 static void vmx_fpu_activate(struct kvm_vcpu *vcpu)
@@ -2451,12 +2514,6 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 	u32 _vmexit_control = 0;
 	u32 _vmentry_control = 0;
 
-	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
-	opt = PIN_BASED_VIRTUAL_NMIS;
-	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
-				&_pin_based_exec_control) < 0)
-		return -EIO;
-
 	min = CPU_BASED_HLT_EXITING |
 #ifdef CONFIG_X86_64
 	      CPU_BASED_CR8_LOAD_EXITING |
@@ -2531,6 +2588,17 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 				&_vmexit_control) < 0)
 		return -EIO;
 
+	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
+	opt = PIN_BASED_VIRTUAL_NMIS | PIN_BASED_POSTED_INTR;
+	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
+				&_pin_based_exec_control) < 0)
+		return -EIO;
+
+	if (!(_cpu_based_2nd_exec_control &
+		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY) ||
+		!(_vmexit_control & VM_EXIT_ACK_INTR_ON_EXIT))
+		_pin_based_exec_control &= ~PIN_BASED_POSTED_INTR;
+
 	min = 0;
 	opt = VM_ENTRY_LOAD_IA32_PAT;
 	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS,
@@ -2715,6 +2783,9 @@ static __init int hardware_setup(void)
 	if (!cpu_has_vmx_virtual_intr_delivery())
 		enable_apicv_vid = 0;
 
+	if (!cpu_has_vmx_posted_intr() || !x2apic_enabled())
+		enable_apicv_pi = 0;
+
 	if (nested)
 		nested_vmx_setup_ctls_msrs();
 
@@ -3881,6 +3952,93 @@ static void ept_set_mmio_spte_mask(void)
 	kvm_mmu_set_mmio_spte_mask(0xffull << 49 | 0x6ull);
 }
 
+irqreturn_t pi_handler(int irq, void *data)
+{
+	struct vcpu_vmx *vmx = data;
+
+	kvm_make_request(KVM_REQ_EVENT, &vmx->vcpu);
+	kvm_vcpu_kick(&vmx->vcpu);
+
+	return IRQ_HANDLED;
+}
+
+static int vmx_has_posted_interrupt(struct kvm_vcpu *vcpu)
+{
+	return irqchip_in_kernel(vcpu->kvm) && enable_apicv_pi;
+}
+
+static void vmx_pi_migrate(struct kvm_vcpu *vcpu)
+{
+	int ret = 0;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (!enable_apicv_pi)
+		return ;
+
+	preempt_disable();
+	local_irq_disable();
+	if (!vmx->irq) {
+		ret = arch_pi_alloc_irq(vmx);
+		if (ret < 0) {
+			vmx->irq = -1;
+			goto out;
+		}
+		vmx->irq = ret;
+
+		ret = request_irq(vmx->irq, pi_handler, IRQF_NO_THREAD,
+					"Posted Interrupt", vmx);
+		if (ret) {
+			vmx->irq = -1;
+			goto out;
+		}
+
+		ret = arch_pi_get_vector(vmx->irq);
+	} else
+		ret = arch_pi_migrate(vmx->irq, smp_processor_id());
+
+	if (ret < 0) {
+		vmx->irq = -1;
+		goto out;
+	} else {
+		vmx->vector = ret;
+		vmcs_write16(POSTED_INTR_NV, vmx->vector);
+		pi_clear_on(vmx->pi);
+	}
+out:
+	local_irq_enable();
+	preempt_enable();
+	return ;
+}
+
+static int vmx_send_nv(struct kvm_vcpu *vcpu,
+		int vector)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (unlikely(vmx->irq == -1))
+		return 0;
+
+	if (vcpu->cpu == smp_processor_id()) {
+		pi_set_on(vmx->pi);
+		return 0;
+	}
+
+	pi_set_pir(vector, vmx->pi);
+	if (!pi_test_and_set_on(vmx->pi) && (vcpu->mode == IN_GUEST_MODE)) {
+		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), vmx->vector);
+		return 1;
+	}
+	return 0;
+}
+
+static void free_pi(struct vcpu_vmx *vmx)
+{
+	if (enable_apicv_pi) {
+		kfree(vmx->pi);
+		arch_pi_free_irq(vmx->irq, vmx);
+	}
+}
+
 /*
  * Sets up the vmcs for emulated real mode.
  */
@@ -3890,6 +4048,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 	unsigned long a;
 #endif
 	int i;
+	u32 pin_based_exec_ctrl = vmcs_config.pin_based_exec_ctrl;
 
 	/* I/O */
 	vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a));
@@ -3901,8 +4060,10 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 	vmcs_write64(VMCS_LINK_POINTER, -1ull); /* 22.3.1.5 */
 
 	/* Control */
-	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
-		vmcs_config.pin_based_exec_ctrl);
+	if (!enable_apicv_pi)
+		pin_based_exec_ctrl &= ~PIN_BASED_POSTED_INTR;
+
+	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, pin_based_exec_ctrl);
 
 	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, vmx_exec_control(vmx));
 
@@ -3920,6 +4081,12 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 		vmcs_write16(GUEST_INTR_STATUS, 0);
 	}
 
+	if (enable_apicv_pi) {
+		vmx->pi = kmalloc(sizeof(struct pi_desc),
+				GFP_KERNEL | __GFP_ZERO);
+		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((vmx->pi)));
+	}
+
 	if (ple_gap) {
 		vmcs_write32(PLE_GAP, ple_gap);
 		vmcs_write32(PLE_WINDOW, ple_window);
@@ -6161,6 +6328,11 @@ static void vmx_update_irq(struct kvm_vcpu *vcpu)
 	if (!enable_apicv_vid)
 		return ;
 
+	if (enable_apicv_pi) {
+		kvm_apic_update_irr(vcpu, (unsigned int *)vmx->pi->pir);
+		pi_clear_on(vmx->pi);
+	}
+
 	vector = kvm_apic_get_highest_irr(vcpu);
 	if (vector == -1)
 		return;
@@ -6586,6 +6758,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
 
 	free_vpid(vmx);
 	free_nested(vmx);
+	free_pi(vmx);
 	free_loaded_vmcs(vmx->loaded_vmcs);
 	kfree(vmx->guest_msrs);
 	kvm_vcpu_uninit(vcpu);
@@ -7483,8 +7656,11 @@ static struct kvm_x86_ops vmx_x86_ops = {
 	.enable_irq_window = enable_irq_window,
 	.update_cr8_intercept = update_cr8_intercept,
 	.has_virtual_interrupt_delivery = vmx_has_virtual_interrupt_delivery,
+	.has_posted_interrupt = vmx_has_posted_interrupt,
 	.update_irq = vmx_update_irq,
 	.set_eoi_exitmap = vmx_set_eoi_exitmap,
+	.send_nv = vmx_send_nv,
+	.pi_migrate = vmx_pi_migrate,
 
 	.set_tss_addr = vmx_set_tss_addr,
 	.get_tdp_level = get_ept_level,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8b8de3b..f035267 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5250,6 +5250,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	bool req_immediate_exit = 0;
 
 	if (vcpu->requests) {
+		if (kvm_check_request(KVM_REQ_POSTED_INTR, vcpu))
+			kvm_x86_ops->pi_migrate(vcpu);
 		if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu))
 			kvm_mmu_unload(vcpu);
 		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ecc5543..f8d8d34 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -107,6 +107,7 @@ static inline bool is_error_page(struct page *page)
 #define KVM_REQ_IMMEDIATE_EXIT    15
 #define KVM_REQ_PMU               16
 #define KVM_REQ_PMI               17
+#define KVM_REQ_POSTED_INTR       18
 
 #define KVM_USERSPACE_IRQ_SOURCE_ID		0
 #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID	1
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index be70035..05baf1c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1625,6 +1625,8 @@ void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
 			smp_send_reschedule(cpu);
 	put_cpu();
 }
+EXPORT_SYMBOL_GPL(kvm_vcpu_kick);
+
 #endif /* !CONFIG_S390 */
 
 void kvm_resched(struct kvm_vcpu *vcpu)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 3/6] x86, apicv: add virtual interrupt delivery support
  2012-11-21  8:09 ` [PATCH v2 3/6] x86, apicv: add virtual interrupt delivery support Yang Zhang
@ 2012-11-22 13:57   ` Gleb Natapov
  2012-11-23 11:46     ` Zhang, Yang Z
  0 siblings, 1 reply; 29+ messages in thread
From: Gleb Natapov @ 2012-11-22 13:57 UTC (permalink / raw)
  To: Yang Zhang; +Cc: kvm, mtosatti, Kevin Tian

On Wed, Nov 21, 2012 at 04:09:36PM +0800, Yang Zhang wrote:
> Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> manually, which is fully taken care of by the hardware. This needs
> some special awareness into existing interrupr injection path:
> 
> - for pending interrupt, instead of direct injection, we may need
>   update architecture specific indicators before resuming to guest.
> 
> - A pending interrupt, which is masked by ISR, should be also
>   considered in above update action, since hardware will decide
>   when to inject it at right time. Current has_interrupt and
>   get_interrupt only returns a valid vector from injection p.o.v.
> 
> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    4 +
>  arch/x86/include/asm/vmx.h      |   11 ++++
>  arch/x86/kvm/irq.c              |   44 ++++++++++++++
>  arch/x86/kvm/lapic.c            |   44 +++++++++++++-
>  arch/x86/kvm/lapic.h            |   13 ++++
>  arch/x86/kvm/svm.c              |    6 ++
>  arch/x86/kvm/vmx.c              |  125 ++++++++++++++++++++++++++++++++++++++-
>  arch/x86/kvm/x86.c              |   16 +++++-
>  virt/kvm/ioapic.c               |    1 +
>  9 files changed, 260 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index b2e11f4..8e07a86 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -682,6 +682,10 @@ struct kvm_x86_ops {
>  	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
>  	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
>  	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
> +	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
> +	void (*update_irq)(struct kvm_vcpu *vcpu);
> +	void (*set_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector,
> +			int need_eoi, int global);
>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>  	int (*get_tdp_level)(void);
>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 21101b6..1003341 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -62,6 +62,7 @@
>  #define EXIT_REASON_MCE_DURING_VMENTRY  41
>  #define EXIT_REASON_TPR_BELOW_THRESHOLD 43
>  #define EXIT_REASON_APIC_ACCESS         44
> +#define EXIT_REASON_EOI_INDUCED         45
>  #define EXIT_REASON_EPT_VIOLATION       48
>  #define EXIT_REASON_EPT_MISCONFIG       49
>  #define EXIT_REASON_WBINVD              54
> @@ -143,6 +144,7 @@
>  #define SECONDARY_EXEC_WBINVD_EXITING		0x00000040
>  #define SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080
>  #define SECONDARY_EXEC_APIC_REGISTER_VIRT       0x00000100
> +#define SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY    0x00000200
>  #define SECONDARY_EXEC_PAUSE_LOOP_EXITING	0x00000400
>  #define SECONDARY_EXEC_ENABLE_INVPCID		0x00001000
>  
> @@ -180,6 +182,7 @@ enum vmcs_field {
>  	GUEST_GS_SELECTOR               = 0x0000080a,
>  	GUEST_LDTR_SELECTOR             = 0x0000080c,
>  	GUEST_TR_SELECTOR               = 0x0000080e,
> +	GUEST_INTR_STATUS               = 0x00000810,
>  	HOST_ES_SELECTOR                = 0x00000c00,
>  	HOST_CS_SELECTOR                = 0x00000c02,
>  	HOST_SS_SELECTOR                = 0x00000c04,
> @@ -207,6 +210,14 @@ enum vmcs_field {
>  	APIC_ACCESS_ADDR_HIGH		= 0x00002015,
>  	EPT_POINTER                     = 0x0000201a,
>  	EPT_POINTER_HIGH                = 0x0000201b,
> +	EOI_EXIT_BITMAP0                = 0x0000201c,
> +	EOI_EXIT_BITMAP0_HIGH           = 0x0000201d,
> +	EOI_EXIT_BITMAP1                = 0x0000201e,
> +	EOI_EXIT_BITMAP1_HIGH           = 0x0000201f,
> +	EOI_EXIT_BITMAP2                = 0x00002020,
> +	EOI_EXIT_BITMAP2_HIGH           = 0x00002021,
> +	EOI_EXIT_BITMAP3                = 0x00002022,
> +	EOI_EXIT_BITMAP3_HIGH           = 0x00002023,
>  	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
>  	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
>  	VMCS_LINK_POINTER               = 0x00002800,
> diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
> index 7e06ba1..c7356a3 100644
> --- a/arch/x86/kvm/irq.c
> +++ b/arch/x86/kvm/irq.c
> @@ -60,6 +60,29 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
>  EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
>  
>  /*
> + * check if there is pending interrupt without
> + * intack. This _apicv version is used when hardware
> + * supports APIC virtualization with virtual interrupt
> + * delivery support. In such case, KVM is not required
> + * to poll pending APIC interrupt, and thus this
> + * interface is used to poll pending interupts from
> + * non-APIC source.
> + */
> +int kvm_cpu_has_extint(struct kvm_vcpu *v)
> +{
> +	struct kvm_pic *s;
> +
> +	if (!irqchip_in_kernel(v->kvm))
> +		return v->arch.interrupt.pending;
> +
This does not belong here. If !irqchip_in_kernel() the function will not
be called. Hmm actually with !irqchip_in_kernel() kernel will oops in
kvm_apic_vid_enabled() since it dereference vcpu->arch.apic without
checking if it is NULL.


> +	if (kvm_apic_accept_pic_intr(v)) {
> +		s = pic_irqchip(v->kvm);	/* PIC */
> +		return s->output;
> +	} else
> +		return 0;
This is code duplication from kvm_cpu_has_interrupt(). Write common
function and call it from kvm_cpu_has_interrupt(), but even that is
not needed, see below.

> +}
> +
> +/*
>   * Read pending interrupt vector and intack.
>   */
>  int kvm_cpu_get_interrupt(struct kvm_vcpu *v)
> @@ -82,6 +105,27 @@ int kvm_cpu_get_interrupt(struct kvm_vcpu *v)
>  }
>  EXPORT_SYMBOL_GPL(kvm_cpu_get_interrupt);
>  
> +/*
> + * Read pending interrupt vector and intack.
> + * Similar to kvm_cpu_has_interrupt_apicv, to get
> + * interrupts from non-APIC sources.
> + */
> +int kvm_cpu_get_extint(struct kvm_vcpu *v)
> +{
> +	struct kvm_pic *s;
> +	int vector = -1;
> +
> +	if (!irqchip_in_kernel(v->kvm))
> +		return v->arch.interrupt.nr;
Same as above.

> +
> +	if (kvm_apic_accept_pic_intr(v)) {
> +		s = pic_irqchip(v->kvm);
> +		s->output = 0;		/* PIC */
> +		vector = kvm_pic_read_irq(v->kvm);
Ditto about code duplication.
 
> +	}
> +	return vector;
> +}
> +
>  void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu)
>  {
>  	kvm_inject_apic_timer_irqs(vcpu);
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index a63ffdc..af48361 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -643,6 +643,12 @@ out:
>  	return ret;
>  }
>  
> +void kvm_set_eoi_exitmap(struct kvm_vcpu *vcpu, int vector,
> +		int need_eoi, int global)
> +{
> +	kvm_x86_ops->set_eoi_exitmap(vcpu, vector, need_eoi, global);
> +}
> +
>  /*
>   * Add a pending IRQ into lapic.
>   * Return 1 if successfully added and 0 if discarded.
> @@ -664,8 +670,11 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
>  		if (trig_mode) {
>  			apic_debug("level trig mode for vector %d", vector);
>  			apic_set_vector(vector, apic->regs + APIC_TMR);
> -		} else
> +			kvm_set_eoi_exitmap(vcpu, vector, 1, 0);
> +		} else {
>  			apic_clear_vector(vector, apic->regs + APIC_TMR);
> +			kvm_set_eoi_exitmap(vcpu, vector, 0, 0);
Why not use APIC_TMR directly instead of kvm_set_eoi_exitmap() logic?

> +		}
>  
>  		result = !apic_test_and_set_irr(vector, apic);
>  		trace_kvm_apic_accept_irq(vcpu->vcpu_id, delivery_mode,
> @@ -769,6 +778,26 @@ static int apic_set_eoi(struct kvm_lapic *apic)
>  	return vector;
>  }
>  
> +/*
> + * this interface assumes a trap-like exit, which has already finished
> + * desired side effect including vISR and vPPR update.
> + */
> +void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector)
> +{
> +	struct kvm_lapic *apic = vcpu->arch.apic;
> +	int trigger_mode;
> +
> +	if (apic_test_and_clear_vector(vector, apic->regs + APIC_TMR))
> +		trigger_mode = IOAPIC_LEVEL_TRIG;
> +	else
> +		trigger_mode = IOAPIC_EDGE_TRIG;
> +
> +	if (!(kvm_apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI))
> +		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
> +	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
More code duplication. Why not call apic_set_eoi() and skip isr/ppr
logic there if vid is enabled, or put the logic in common function and
call from both places.

> +}
> +EXPORT_SYMBOL_GPL(kvm_apic_set_eoi_accelerated);
> +
>  static void apic_send_ipi(struct kvm_lapic *apic)
>  {
>  	u32 icr_low = kvm_apic_get_reg(apic, APIC_ICR);
> @@ -1510,6 +1539,8 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu)
>  	kvm_lapic_reset(vcpu);
>  	kvm_iodevice_init(&apic->dev, &apic_mmio_ops);
>  
> +	if (kvm_x86_ops->has_virtual_interrupt_delivery(vcpu))
> +		apic->vid_enabled = true;
What do you have vid_enabled for. This is global, not per apic, state.

>  	return 0;
>  nomem_free_apic:
>  	kfree(apic);
> @@ -1533,6 +1564,17 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu)
>  	return highest_irr;
>  }
>  
> +int kvm_apic_get_highest_irr(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_lapic *apic = vcpu->arch.apic;
> +
> +	if (!apic || !apic_enabled(apic))
> +		return -1;
> +
> +	return apic_find_highest_irr(apic);
> +}
> +EXPORT_SYMBOL_GPL(kvm_apic_get_highest_irr);
> +
>  int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
>  {
>  	u32 lvt0 = kvm_apic_get_reg(vcpu->arch.apic, APIC_LVT0);
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index c42f111..2503a64 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -20,6 +20,7 @@ struct kvm_lapic {
>  	u32 divide_count;
>  	struct kvm_vcpu *vcpu;
>  	bool irr_pending;
> +	bool vid_enabled;
>  	/* Number of bits set in ISR. */
>  	s16 isr_count;
>  	/* The highest vector set in ISR; if -1 - invalid, must scan ISR. */
> @@ -39,6 +40,9 @@ void kvm_free_lapic(struct kvm_vcpu *vcpu);
>  int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu);
>  int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu);
>  int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
> +int kvm_cpu_has_extint(struct kvm_vcpu *v);
> +int kvm_cpu_get_extint(struct kvm_vcpu *v);
> +int kvm_apic_get_highest_irr(struct kvm_vcpu *vcpu);
>  void kvm_lapic_reset(struct kvm_vcpu *vcpu);
>  u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
>  void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
> @@ -50,6 +54,8 @@ void kvm_apic_set_version(struct kvm_vcpu *vcpu);
>  int kvm_apic_match_physical_addr(struct kvm_lapic *apic, u16 dest);
>  int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u8 mda);
>  int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq);
> +void kvm_set_eoi_exitmap(struct kvm_vcpu *vcpu, int vector,
> +		int need_eoi, int global);
>  int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type);
>  
>  bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
> @@ -65,6 +71,7 @@ u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
>  void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
>  
>  int kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
> +void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector);
>  
>  void kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr);
>  void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu);
> @@ -81,6 +88,12 @@ static inline bool kvm_hv_vapic_assist_page_enabled(struct kvm_vcpu *vcpu)
>  	return vcpu->arch.hv_vapic & HV_X64_MSR_APIC_ASSIST_PAGE_ENABLE;
>  }
>  
> +static inline bool kvm_apic_vid_enabled(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_lapic *apic = vcpu->arch.apic;
> +	return apic->vid_enabled;
> +}
vcpu->arch.apic can be NULL from where this is called.

> +
>  int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data);
>  void kvm_lapic_init(void);
>  
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index d017df3..b290aba 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -3564,6 +3564,11 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
>  		set_cr_intercept(svm, INTERCEPT_CR8_WRITE);
>  }
>  
> +static int svm_has_virtual_interrupt_delivery(struct kvm_vcpu *vcpu)
> +{
> +	return 0;
> +}
> +
>  static int svm_nmi_allowed(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
> @@ -4283,6 +4288,7 @@ static struct kvm_x86_ops svm_x86_ops = {
>  	.enable_nmi_window = enable_nmi_window,
>  	.enable_irq_window = enable_irq_window,
>  	.update_cr8_intercept = update_cr8_intercept,
> +	.has_virtual_interrupt_delivery = svm_has_virtual_interrupt_delivery,
>  
>  	.set_tss_addr = svm_set_tss_addr,
>  	.get_tdp_level = get_npt_level,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index e9287aa..c0d74ce 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -86,6 +86,9 @@ module_param(fasteoi, bool, S_IRUGO);
>  static bool __read_mostly enable_apicv_reg = 0;
>  module_param(enable_apicv_reg, bool, S_IRUGO);
>  
> +static bool __read_mostly enable_apicv_vid = 0;
> +module_param(enable_apicv_vid, bool, S_IRUGO);
> +
>  /*
>   * If nested=1, nested virtualization is supported, i.e., guests may use
>   * VMX and be a hypervisor for its own guests. If nested=0, guests may not
> @@ -432,6 +435,10 @@ struct vcpu_vmx {
>  
>  	bool rdtscp_enabled;
>  
> +	u8 eoi_exitmap_changed;
> +	u64 eoi_exit_bitmap[4];
> +	u64 eoi_exit_bitmap_global[4];
> +
>  	/* Support for a guest hypervisor (nested VMX) */
>  	struct nested_vmx nested;
>  };
> @@ -770,6 +777,12 @@ static inline bool cpu_has_vmx_apic_register_virt(void)
>  		SECONDARY_EXEC_APIC_REGISTER_VIRT;
>  }
>  
> +static inline bool cpu_has_vmx_virtual_intr_delivery(void)
> +{
> +	return vmcs_config.cpu_based_2nd_exec_ctrl &
> +		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
> +}
> +
>  static inline bool cpu_has_vmx_flexpriority(void)
>  {
>  	return cpu_has_vmx_tpr_shadow() &&
> @@ -2480,7 +2493,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>  			SECONDARY_EXEC_PAUSE_LOOP_EXITING |
>  			SECONDARY_EXEC_RDTSCP |
>  			SECONDARY_EXEC_ENABLE_INVPCID |
> -			SECONDARY_EXEC_APIC_REGISTER_VIRT;
> +			SECONDARY_EXEC_APIC_REGISTER_VIRT |
> +			SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
>  		if (adjust_vmx_controls(min2, opt2,
>  					MSR_IA32_VMX_PROCBASED_CTLS2,
>  					&_cpu_based_2nd_exec_control) < 0)
> @@ -2494,7 +2508,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>  
>  	if (!(_cpu_based_exec_control & CPU_BASED_TPR_SHADOW))
>  		_cpu_based_2nd_exec_control &= ~(
> -				SECONDARY_EXEC_APIC_REGISTER_VIRT);
> +				SECONDARY_EXEC_APIC_REGISTER_VIRT |
> +				SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
>  
>  	if (_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_EPT) {
>  		/* CR3 accesses and invlpg don't need to cause VM Exits when EPT
> @@ -2696,6 +2711,9 @@ static __init int hardware_setup(void)
>  	if (!cpu_has_vmx_apic_register_virt())
>  		enable_apicv_reg = 0;
>  
> +	if (!cpu_has_vmx_virtual_intr_delivery())
> +		enable_apicv_vid = 0;
> +
>  	if (nested)
>  		nested_vmx_setup_ctls_msrs();
>  
> @@ -3811,6 +3829,8 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
>  		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
>  	if (!enable_apicv_reg)
>  		exec_control &= ~SECONDARY_EXEC_APIC_REGISTER_VIRT;
> +	if (!enable_apicv_vid)
> +		exec_control &= ~SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
>  	return exec_control;
>  }
>  
> @@ -3855,6 +3875,15 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>  				vmx_secondary_exec_control(vmx));
>  	}
>  
> +	if (enable_apicv_vid) {
> +		vmcs_write64(EOI_EXIT_BITMAP0, 0);
> +		vmcs_write64(EOI_EXIT_BITMAP1, 0);
> +		vmcs_write64(EOI_EXIT_BITMAP2, 0);
> +		vmcs_write64(EOI_EXIT_BITMAP3, 0);
> +
> +		vmcs_write16(GUEST_INTR_STATUS, 0);
> +	}
> +
>  	if (ple_gap) {
>  		vmcs_write32(PLE_GAP, ple_gap);
>  		vmcs_write32(PLE_WINDOW, ple_window);
> @@ -4770,6 +4799,16 @@ static int handle_apic_access(struct kvm_vcpu *vcpu)
>  	return emulate_instruction(vcpu, 0) == EMULATE_DONE;
>  }
>  
> +static int handle_apic_eoi_induced(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> +	int vector = exit_qualification & 0xff;
> +
> +	/* EOI-induced VM exit is trap-like and thus no need to adjust IP */
> +	kvm_apic_set_eoi_accelerated(vcpu, vector);
> +	return 1;
> +}
> +
>  static int handle_apic_write(struct kvm_vcpu *vcpu)
>  {
>  	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> @@ -5719,6 +5758,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>  	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,
>  	[EXIT_REASON_APIC_ACCESS]             = handle_apic_access,
>  	[EXIT_REASON_APIC_WRITE]              = handle_apic_write,
> +	[EXIT_REASON_EOI_INDUCED]             = handle_apic_eoi_induced,
>  	[EXIT_REASON_WBINVD]                  = handle_wbinvd,
>  	[EXIT_REASON_XSETBV]                  = handle_xsetbv,
>  	[EXIT_REASON_TASK_SWITCH]             = handle_task_switch,
> @@ -6049,6 +6089,11 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
>  
>  static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
>  {
> +    /* no need for tpr_threshold update if APIC virtual
> +     * interrupt delivery is enabled */
> +	if (!enable_apicv_vid)
> +		return ;

Just set kvm_x86_ops->update_cr8_intercept to NULL if !enable_apicv_vid
and the function will not be called.

> +
>  	if (irr == -1 || tpr < irr) {
>  		vmcs_write32(TPR_THRESHOLD, 0);
>  		return;
> @@ -6057,6 +6102,79 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
>  	vmcs_write32(TPR_THRESHOLD, irr);
>  }
>  
> +static int vmx_has_virtual_interrupt_delivery(struct kvm_vcpu *vcpu)
> +{
> +	return irqchip_in_kernel(vcpu->kvm) && enable_apicv_vid;
> +}
> +
> +static void vmx_update_irq(struct kvm_vcpu *vcpu)
> +{
> +	u16 status;
> +	u8 old;
> +	int vector;
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	if (!enable_apicv_vid)
> +		return ;
Ditto. Set kvm_x86_ops->update_irq to a function that does nothing if
!enable_apicv_vid. BTW you do not set this callback in SVM code and call
it unconditionally.

> +
> +	vector = kvm_apic_get_highest_irr(vcpu);
> +	if (vector == -1)
> +		return;
> +
> +	status = vmcs_read16(GUEST_INTR_STATUS);
> +	old = (u8)status & 0xff;
> +	if ((u8)vector != old) {
> +		status &= ~0xff;
> +		status |= (u8)vector;
> +		vmcs_write16(GUEST_INTR_STATUS, status);
> +	}
Please write RVI assessor functions.

> +
> +	if (vmx->eoi_exitmap_changed) {
> +#define UPDATE_EOI_EXITMAP(v, e) {				\
> +	if ((v)->eoi_exitmap_changed & (1 << (e)))	\
> +		vmcs_write64(EOI_EXIT_BITMAP##e,		\
> +		(v)->eoi_exit_bitmap[e] | (v)->eoi_exit_bitmap_global[e]); }
Inline function would do. But why calculate this on each entry? We want
EOI exits only for level IOAPIC interrupts and edge IOAPIC interrupt
with registered notifiers. This configuration rarely changes.


> +
> +		UPDATE_EOI_EXITMAP(vmx, 0);
> +		UPDATE_EOI_EXITMAP(vmx, 1);
> +		UPDATE_EOI_EXITMAP(vmx, 2);
> +		UPDATE_EOI_EXITMAP(vmx, 3);
> +		vmx->eoi_exitmap_changed = 0;
> +	}
> +}
> +
> +static void vmx_set_eoi_exitmap(struct kvm_vcpu *vcpu,
> +				int vector,
> +				int need_eoi, int global)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	int index, offset, changed;
> +	unsigned long *eoi_exitmap;
> +
> +	if (!enable_apicv_vid)
> +		return ;
> +
> +	if (WARN_ONCE((vector < 0) || (vector > 255),
> +		"KVM VMX: vector (%d) out of range\n", vector))
> +		return;
> +
> +	index = vector >> 6;
> +	offset = vector & 63;
> +	if (global)
> +		eoi_exitmap =
> +		    (unsigned long *)&vmx->eoi_exit_bitmap_global[index];
> +	else
> +		eoi_exitmap = (unsigned long *)&vmx->eoi_exit_bitmap[index];
> +
> +	if (need_eoi)
> +		changed = !test_and_set_bit(offset, eoi_exitmap);
> +	else
> +		changed = test_and_clear_bit(offset, eoi_exitmap);
> +
> +	if (changed)
> +		vmx->eoi_exitmap_changed |= 1 << index;
> +}
> +
>  static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx)
>  {
>  	u32 exit_intr_info;
> @@ -7320,6 +7438,9 @@ static struct kvm_x86_ops vmx_x86_ops = {
>  	.enable_nmi_window = enable_nmi_window,
>  	.enable_irq_window = enable_irq_window,
>  	.update_cr8_intercept = update_cr8_intercept,
> +	.has_virtual_interrupt_delivery = vmx_has_virtual_interrupt_delivery,
> +	.update_irq = vmx_update_irq,
You need to initialize this one in svm.c too.

> +	.set_eoi_exitmap = vmx_set_eoi_exitmap,
>  
>  	.set_tss_addr = vmx_set_tss_addr,
>  	.get_tdp_level = get_ept_level,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4f76417..8b8de3b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5190,6 +5190,13 @@ static void inject_pending_event(struct kvm_vcpu *vcpu)
>  			vcpu->arch.nmi_injected = true;
>  			kvm_x86_ops->set_nmi(vcpu);
>  		}
> +	} else if (kvm_apic_vid_enabled(vcpu)) {
> +		if (kvm_cpu_has_extint(vcpu) &&
> +		    kvm_x86_ops->interrupt_allowed(vcpu)) {
> +			kvm_queue_interrupt(vcpu,
> +				kvm_cpu_get_extint(vcpu), false);
> +			kvm_x86_ops->set_irq(vcpu);
> +		}
Drop all this and modify kvm_cpu_has_interrupt()/kvm_cpu_get_interrupt()
to consider apic interrupts only if vid is enabled then the if below
will just work.

>  	} else if (kvm_cpu_has_interrupt(vcpu)) {
>  		if (kvm_x86_ops->interrupt_allowed(vcpu)) {
>  			kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu),
> @@ -5289,12 +5296,19 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  	}
>  
>  	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
> +		/* update archtecture specific hints for APIC
> +		 * virtual interrupt delivery */
> +		kvm_x86_ops->update_irq(vcpu);
> +
>  		inject_pending_event(vcpu);
>  
>  		/* enable NMI/IRQ window open exits if needed */
>  		if (vcpu->arch.nmi_pending)
>  			kvm_x86_ops->enable_nmi_window(vcpu);
> -		else if (kvm_cpu_has_interrupt(vcpu) || req_int_win)
> +		else if (kvm_apic_vid_enabled(vcpu)) {
> +			if (kvm_cpu_has_extint(vcpu))
> +				kvm_x86_ops->enable_irq_window(vcpu);
Same as above. With proper kvm_cpu_has_interrupt() implementation this
id is not needed.

> +		} else if (kvm_cpu_has_interrupt(vcpu) || req_int_win)
>  			kvm_x86_ops->enable_irq_window(vcpu);
>  
>  		if (kvm_lapic_enabled(vcpu)) {
> diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
> index 166c450..898aa62 100644
> --- a/virt/kvm/ioapic.c
> +++ b/virt/kvm/ioapic.c
> @@ -186,6 +186,7 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int irq)
>  		/* need to read apic_id from apic regiest since
>  		 * it can be rewritten */
>  		irqe.dest_id = ioapic->kvm->bsp_vcpu_id;
> +		kvm_set_eoi_exitmap(ioapic->kvm->vcpus[0], irqe.vector, 1, 1);
>  	}
>  #endif
>  	return kvm_irq_delivery_to_apic(ioapic->kvm, NULL, &irqe);
> -- 
> 1.7.1

--
			Gleb.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 5/6] x86: Enable ack interrupt on vmexit
  2012-11-21  8:09 ` [PATCH v2 5/6] x86: Enable ack interrupt on vmexit Yang Zhang
@ 2012-11-22 15:22   ` Gleb Natapov
  2012-11-23  5:41     ` Zhang, Yang Z
  2012-11-25 12:55     ` Avi Kivity
  0 siblings, 2 replies; 29+ messages in thread
From: Gleb Natapov @ 2012-11-22 15:22 UTC (permalink / raw)
  To: Yang Zhang; +Cc: kvm, mtosatti

On Wed, Nov 21, 2012 at 04:09:38PM +0800, Yang Zhang wrote:
> Ack interrupt on vmexit is required by Posted Interrupt. With it,
> when external interrupt caused vmexit, the cpu will acknowledge the
> interrupt controller and save the interrupt's vector in vmcs.
> 
> There are several approaches to enable it. This patch uses a simply
> way: re-generate an interrupt via self ipi.
> 
> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> ---
>  arch/x86/kvm/vmx.c |   11 ++++++++++-
>  1 files changed, 10 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 7949d21..f6ef090 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2525,7 +2525,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>  #ifdef CONFIG_X86_64
>  	min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
>  #endif
> -	opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT;
> +	opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT |
> +		VM_EXIT_ACK_INTR_ON_EXIT;
Always? Do it only if posted interrupts are actually available
and going to be used.

>  	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS,
>  				&_vmexit_control) < 0)
>  		return -EIO;
> @@ -4457,6 +4458,14 @@ static int handle_exception(struct kvm_vcpu *vcpu)
>  
>  static int handle_external_interrupt(struct kvm_vcpu *vcpu)
>  {
> +	unsigned int vector;
> +
> +	vector = vmcs_read32(VM_EXIT_INTR_INFO);
> +	vector &= INTR_INFO_VECTOR_MASK;
Valid bit is guarantied to be set here?

> +
> +	apic_eoi();
This is way to late. handle_external_interrupt() is called longs after
preemption and local irqs are enabled. vcpu process may be scheduled out
and apic_eoi() will not be called for a long time leaving interrupt
stuck in ISR and blocking other interrupts.

> +	apic->send_IPI_self(vector);
For level interrupt this is not needed, no?

> +
>  	++vcpu->stat.irq_exits;
>  	return 1;
>  }
> -- 
> 1.7.1

--
			Gleb.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v2 5/6] x86: Enable ack interrupt on vmexit
  2012-11-22 15:22   ` Gleb Natapov
@ 2012-11-23  5:41     ` Zhang, Yang Z
  2012-11-25 13:30       ` Gleb Natapov
  2012-11-25 12:55     ` Avi Kivity
  1 sibling, 1 reply; 29+ messages in thread
From: Zhang, Yang Z @ 2012-11-23  5:41 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm@vger.kernel.org, mtosatti@redhat.com

Gleb Natapov wrote on 2012-11-22:
> On Wed, Nov 21, 2012 at 04:09:38PM +0800, Yang Zhang wrote:
>> Ack interrupt on vmexit is required by Posted Interrupt. With it,
>> when external interrupt caused vmexit, the cpu will acknowledge the
>> interrupt controller and save the interrupt's vector in vmcs.
>> 
>> There are several approaches to enable it. This patch uses a simply
>> way: re-generate an interrupt via self ipi.
>> 
>> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
>> ---
>>  arch/x86/kvm/vmx.c |   11 ++++++++++-
>>  1 files changed, 10 insertions(+), 1 deletions(-)
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 7949d21..f6ef090 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -2525,7 +2525,8 @@ static __init int setup_vmcs_config(struct
> vmcs_config *vmcs_conf)
>>  #ifdef CONFIG_X86_64
>>  	min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
>>  #endif
>> -	opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT;
>> +	opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT |
>> +		VM_EXIT_ACK_INTR_ON_EXIT;
> Always? Do it only if posted interrupts are actually available
> and going to be used.

Right. 
But the currently interrupt handler path is too long:
vm exit -> KVM vmexit handler(interrupt disabled) -> KVM re-enable interrupt -> cpu ack the interrupt and interrupt deliver through the host IDT
This will bring extra cost for interrupt belongs to guest. After enable "acknowledge interrupt on exit", we can inject the interrupt right way after vm exit if the interrupt is for the guest(This patch doesn't do this).

Since we only want to enable "acknowledge interrupt on exit" for Posted Interrupt, probably, we can enable it when PI is available.

>>  	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS,
>>  				&_vmexit_control) < 0)
>>  		return -EIO;
>> @@ -4457,6 +4458,14 @@ static int handle_exception(struct kvm_vcpu *vcpu)
>> 
>>  static int handle_external_interrupt(struct kvm_vcpu *vcpu)
>>  {
>> +	unsigned int vector;
>> +
>> +	vector = vmcs_read32(VM_EXIT_INTR_INFO);
>> +	vector &= INTR_INFO_VECTOR_MASK;
> Valid bit is guarantied to be set here?
> 
>> +
>> +	apic_eoi();
> This is way to late. handle_external_interrupt() is called longs after
> preemption and local irqs are enabled. vcpu process may be scheduled out
> and apic_eoi() will not be called for a long time leaving interrupt
> stuck in ISR and blocking other interrupts.

I will move it to vmx_complete_atomic_exit().

>> +	apic->send_IPI_self(vector);
> For level interrupt this is not needed, no?

If we enable "ack interrupt on exit" only when apicv is available, then all interrupt is edge(interrupt remapping will setup all remap entries to deliver edge interrupt. interrupt remapping is required by x2apic, x2apic is required by PI)

/*
         * Trigger mode in the IRTE will always be edge, and for IO-APIC,
the
         * actual level or edge trigger will be setup in the IO-APIC
         * RTE. This will help simplify level triggered irq migration.
         * For more details, see the comments (in io_apic.c) explainig
IO-APIC
         * irq migration in the presence of interrupt-remapping.
        */

>> +
>>  	++vcpu->stat.irq_exits;
>>  	return 1;
>>  }
>> --
>> 1.7.1
> 
> --
> 			Gleb.


Best regards,
Yang


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v2 3/6] x86, apicv: add virtual interrupt delivery support
  2012-11-22 13:57   ` Gleb Natapov
@ 2012-11-23 11:46     ` Zhang, Yang Z
  2012-11-25  8:53       ` Gleb Natapov
  0 siblings, 1 reply; 29+ messages in thread
From: Zhang, Yang Z @ 2012-11-23 11:46 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm@vger.kernel.org, mtosatti@redhat.com, Tian, Kevin

Gleb Natapov wrote on 2012-11-22:
> On Wed, Nov 21, 2012 at 04:09:36PM +0800, Yang Zhang wrote:
>> Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
>> manually, which is fully taken care of by the hardware. This needs
>> some special awareness into existing interrupr injection path:
>> 
>> - for pending interrupt, instead of direct injection, we may need
>>   update architecture specific indicators before resuming to guest.
>> - A pending interrupt, which is masked by ISR, should be also
>>   considered in above update action, since hardware will decide
>>   when to inject it at right time. Current has_interrupt and
>>   get_interrupt only returns a valid vector from injection p.o.v.
>> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
>> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
>> ---
>>  arch/x86/include/asm/kvm_host.h |    4 + arch/x86/include/asm/vmx.h   
>>    |   11 ++++ arch/x86/kvm/irq.c              |   44 ++++++++++++++
>>  arch/x86/kvm/lapic.c            |   44 +++++++++++++-
>>  arch/x86/kvm/lapic.h            |   13 ++++ arch/x86/kvm/svm.c        
>>       |    6 ++ arch/x86/kvm/vmx.c              |  125
>>  ++++++++++++++++++++++++++++++++++++++- arch/x86/kvm/x86.c            
>>   |   16 +++++- virt/kvm/ioapic.c               |    1 + 9 files
>>  changed, 260 insertions(+), 4 deletions(-)
>> diff --git a/arch/x86/include/asm/kvm_host.h
>> b/arch/x86/include/asm/kvm_host.h index b2e11f4..8e07a86 100644 ---
>> a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -682,6 +682,10 @@ struct kvm_x86_ops {
>>  	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
>>  	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
>>  	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
>> +	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
>> +	void (*update_irq)(struct kvm_vcpu *vcpu);
>> +	void (*set_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector,
>> +			int need_eoi, int global);
>>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>>  	int (*get_tdp_level)(void);
>>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
>> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
>> index 21101b6..1003341 100644
>> --- a/arch/x86/include/asm/vmx.h
>> +++ b/arch/x86/include/asm/vmx.h
>> @@ -62,6 +62,7 @@
>>  #define EXIT_REASON_MCE_DURING_VMENTRY  41 #define
>>  EXIT_REASON_TPR_BELOW_THRESHOLD 43 #define EXIT_REASON_APIC_ACCESS    
>>      44 +#define EXIT_REASON_EOI_INDUCED         45 #define
>>  EXIT_REASON_EPT_VIOLATION       48 #define EXIT_REASON_EPT_MISCONFIG  
>>      49 #define EXIT_REASON_WBINVD              54 @@ -143,6 +144,7 @@
>>  #define SECONDARY_EXEC_WBINVD_EXITING		0x00000040 #define
>>  SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080 #define
>>  SECONDARY_EXEC_APIC_REGISTER_VIRT       0x00000100 +#define
>>  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY    0x00000200 #define
>>  SECONDARY_EXEC_PAUSE_LOOP_EXITING	0x00000400 #define
>>  SECONDARY_EXEC_ENABLE_INVPCID		0x00001000
>> @@ -180,6 +182,7 @@ enum vmcs_field {
>>  	GUEST_GS_SELECTOR               = 0x0000080a, 	GUEST_LDTR_SELECTOR   
>>           = 0x0000080c, 	GUEST_TR_SELECTOR               = 0x0000080e,
>>  +	GUEST_INTR_STATUS               = 0x00000810, 	HOST_ES_SELECTOR     
>>            = 0x00000c00, 	HOST_CS_SELECTOR                = 0x00000c02,
>>  	HOST_SS_SELECTOR                = 0x00000c04, @@ -207,6 +210,14 @@
>>  enum vmcs_field { 	APIC_ACCESS_ADDR_HIGH		= 0x00002015, 	EPT_POINTER  
>>                    = 0x0000201a, 	EPT_POINTER_HIGH                =
>>  0x0000201b,
>> +	EOI_EXIT_BITMAP0                = 0x0000201c,
>> +	EOI_EXIT_BITMAP0_HIGH           = 0x0000201d,
>> +	EOI_EXIT_BITMAP1                = 0x0000201e,
>> +	EOI_EXIT_BITMAP1_HIGH           = 0x0000201f,
>> +	EOI_EXIT_BITMAP2                = 0x00002020,
>> +	EOI_EXIT_BITMAP2_HIGH           = 0x00002021,
>> +	EOI_EXIT_BITMAP3                = 0x00002022,
>> +	EOI_EXIT_BITMAP3_HIGH           = 0x00002023,
>>  	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
>>  	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
>>  	VMCS_LINK_POINTER               = 0x00002800,
>> diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
>> index 7e06ba1..c7356a3 100644
>> --- a/arch/x86/kvm/irq.c
>> +++ b/arch/x86/kvm/irq.c
>> @@ -60,6 +60,29 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
>>  EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
>>  
>>  /*
>> + * check if there is pending interrupt without
>> + * intack. This _apicv version is used when hardware
>> + * supports APIC virtualization with virtual interrupt
>> + * delivery support. In such case, KVM is not required
>> + * to poll pending APIC interrupt, and thus this
>> + * interface is used to poll pending interupts from
>> + * non-APIC source.
>> + */
>> +int kvm_cpu_has_extint(struct kvm_vcpu *v)
>> +{
>> +	struct kvm_pic *s;
>> +
>> +	if (!irqchip_in_kernel(v->kvm))
>> +		return v->arch.interrupt.pending;
>> +
> This does not belong here. If !irqchip_in_kernel() the function will not
> be called. Hmm actually with !irqchip_in_kernel() kernel will oops in
> kvm_apic_vid_enabled() since it dereference vcpu->arch.apic without
> checking if it is NULL.

Right. Will remove it in next version and add the check in kvm_apic_vid_enabled.
 
> 
>> +	if (kvm_apic_accept_pic_intr(v)) {
>> +		s = pic_irqchip(v->kvm);	/* PIC */
>> +		return s->output;
>> +	} else
>> +		return 0;
> This is code duplication from kvm_cpu_has_interrupt(). Write common
> function and call it from kvm_cpu_has_interrupt(), but even that is
> not needed, see below.

Why it is not needed? 
 
>> +}
>> +
>> +/*
>>   * Read pending interrupt vector and intack.
>>   */
>>  int kvm_cpu_get_interrupt(struct kvm_vcpu *v) @@ -82,6 +105,27 @@ int
>>  kvm_cpu_get_interrupt(struct kvm_vcpu *v) }
>>  EXPORT_SYMBOL_GPL(kvm_cpu_get_interrupt);
>> +/*
>> + * Read pending interrupt vector and intack.
>> + * Similar to kvm_cpu_has_interrupt_apicv, to get
>> + * interrupts from non-APIC sources.
>> + */
>> +int kvm_cpu_get_extint(struct kvm_vcpu *v)
>> +{
>> +	struct kvm_pic *s;
>> +	int vector = -1;
>> +
>> +	if (!irqchip_in_kernel(v->kvm))
>> +		return v->arch.interrupt.nr;
> Same as above.
> 
>> +
>> +	if (kvm_apic_accept_pic_intr(v)) {
>> +		s = pic_irqchip(v->kvm);
>> +		s->output = 0;		/* PIC */
>> +		vector = kvm_pic_read_irq(v->kvm);
> Ditto about code duplication.
> 
>> +	}
>> +	return vector;
>> +}
>> +
>>  void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu)
>>  {
>>  	kvm_inject_apic_timer_irqs(vcpu);
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index a63ffdc..af48361 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -643,6 +643,12 @@ out:
>>  	return ret;
>>  }
>> +void kvm_set_eoi_exitmap(struct kvm_vcpu *vcpu, int vector,
>> +		int need_eoi, int global)
>> +{
>> +	kvm_x86_ops->set_eoi_exitmap(vcpu, vector, need_eoi, global);
>> +}
>> +
>>  /*
>>   * Add a pending IRQ into lapic.
>>   * Return 1 if successfully added and 0 if discarded.
>> @@ -664,8 +670,11 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int
> delivery_mode,
>>  		if (trig_mode) {
>>  			apic_debug("level trig mode for vector %d", vector);
>>  			apic_set_vector(vector, apic->regs + APIC_TMR);
>> -		} else
>> +			kvm_set_eoi_exitmap(vcpu, vector, 1, 0);
>> +		} else {
>>  			apic_clear_vector(vector, apic->regs + APIC_TMR);
>> +			kvm_set_eoi_exitmap(vcpu, vector, 0, 0);
> Why not use APIC_TMR directly instead of kvm_set_eoi_exitmap() logic?

Good idea. It seems more reasonable. 

>> +		}
>> 
>>  		result = !apic_test_and_set_irr(vector, apic);
>>  		trace_kvm_apic_accept_irq(vcpu->vcpu_id, delivery_mode, @@ -769,6
>>  +778,26 @@ static int apic_set_eoi(struct kvm_lapic *apic) 	return
>>  vector; }
>> +/*
>> + * this interface assumes a trap-like exit, which has already finished
>> + * desired side effect including vISR and vPPR update.
>> + */
>> +void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector)
>> +{
>> +	struct kvm_lapic *apic = vcpu->arch.apic;
>> +	int trigger_mode;
>> +
>> +	if (apic_test_and_clear_vector(vector, apic->regs + APIC_TMR))
>> +		trigger_mode = IOAPIC_LEVEL_TRIG;
>> +	else
>> +		trigger_mode = IOAPIC_EDGE_TRIG;
>> +
>> +	if (!(kvm_apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI))
>> +		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
>> +	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
> More code duplication. Why not call apic_set_eoi() and skip isr/ppr
> logic there if vid is enabled, or put the logic in common function and
> call from both places.

Ok, will change it in next patch.

>> +}
>> +EXPORT_SYMBOL_GPL(kvm_apic_set_eoi_accelerated);
>> +
>>  static void apic_send_ipi(struct kvm_lapic *apic) { 	u32 icr_low =
>>  kvm_apic_get_reg(apic, APIC_ICR); @@ -1510,6 +1539,8 @@ int
>>  kvm_create_lapic(struct kvm_vcpu *vcpu) 	kvm_lapic_reset(vcpu);
>>  	kvm_iodevice_init(&apic->dev, &apic_mmio_ops);
>> +	if (kvm_x86_ops->has_virtual_interrupt_delivery(vcpu))
>> +		apic->vid_enabled = true;
> What do you have vid_enabled for. This is global, not per apic, state.
When inject interrupt to guest, we need this to check whether vid is enabled. If not, use old way to handle the interrupt.
I thing put it in apic is reasonable. Though all vcpu use same configuration, APICv feature is per vcpu too.

> 
>>  	return 0; nomem_free_apic: 	kfree(apic); @@ -1533,6 +1564,17 @@ int
>>  kvm_apic_has_interrupt(struct kvm_vcpu *vcpu) 	return highest_irr; }
>> +int kvm_apic_get_highest_irr(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_lapic *apic = vcpu->arch.apic;
>> +
>> +	if (!apic || !apic_enabled(apic))
>> +		return -1;
>> +
>> +	return apic_find_highest_irr(apic);
>> +}
>> +EXPORT_SYMBOL_GPL(kvm_apic_get_highest_irr);
>> +
>>  int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
>>  {
>>  	u32 lvt0 = kvm_apic_get_reg(vcpu->arch.apic, APIC_LVT0);
>> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
>> index c42f111..2503a64 100644
>> --- a/arch/x86/kvm/lapic.h
>> +++ b/arch/x86/kvm/lapic.h
>> @@ -20,6 +20,7 @@ struct kvm_lapic {
>>  	u32 divide_count; 	struct kvm_vcpu *vcpu; 	bool irr_pending; +	bool
>>  vid_enabled; 	/* Number of bits set in ISR. */ 	s16 isr_count; 	/* The
>>  highest vector set in ISR; if -1 - invalid, must scan ISR. */ @@ -39,6
>>  +40,9 @@ void kvm_free_lapic(struct kvm_vcpu *vcpu); int
>>  kvm_apic_has_interrupt(struct kvm_vcpu *vcpu); int
>>  kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu); int
>>  kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
>> +int kvm_cpu_has_extint(struct kvm_vcpu *v);
>> +int kvm_cpu_get_extint(struct kvm_vcpu *v);
>> +int kvm_apic_get_highest_irr(struct kvm_vcpu *vcpu);
>>  void kvm_lapic_reset(struct kvm_vcpu *vcpu); u64
>>  kvm_lapic_get_cr8(struct kvm_vcpu *vcpu); void
>>  kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8); @@ -50,6
>>  +54,8 @@ void kvm_apic_set_version(struct kvm_vcpu *vcpu); int
>>  kvm_apic_match_physical_addr(struct kvm_lapic *apic, u16 dest); int
>>  kvm_apic_match_logical_addr(struct kvm_lapic *apic, u8 mda); int
>>  kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq);
>> +void kvm_set_eoi_exitmap(struct kvm_vcpu *vcpu, int vector,
>> +		int need_eoi, int global);
>>  int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type);
>>  
>>  bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
>> @@ -65,6 +71,7 @@ u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu
> *vcpu);
>>  void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
>>  
>>  int kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
>> +void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector);
>> 
>>  void kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr);
>>  void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu);
>> @@ -81,6 +88,12 @@ static inline bool
> kvm_hv_vapic_assist_page_enabled(struct kvm_vcpu *vcpu)
>>  	return vcpu->arch.hv_vapic & HV_X64_MSR_APIC_ASSIST_PAGE_ENABLE;
>>  }
>> +static inline bool kvm_apic_vid_enabled(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_lapic *apic = vcpu->arch.apic;
>> +	return apic->vid_enabled;
>> +}
> vcpu->arch.apic can be NULL from where this is called.
> 
>> +
>>  int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data);
>>  void kvm_lapic_init(void);
>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> index d017df3..b290aba 100644
>> --- a/arch/x86/kvm/svm.c
>> +++ b/arch/x86/kvm/svm.c
>> @@ -3564,6 +3564,11 @@ static void update_cr8_intercept(struct kvm_vcpu
> *vcpu, int tpr, int irr)
>>  		set_cr_intercept(svm, INTERCEPT_CR8_WRITE);
>>  }
>> +static int svm_has_virtual_interrupt_delivery(struct kvm_vcpu *vcpu)
>> +{
>> +	return 0;
>> +}
>> +
>>  static int svm_nmi_allowed(struct kvm_vcpu *vcpu) { 	struct vcpu_svm
>>  *svm = to_svm(vcpu); @@ -4283,6 +4288,7 @@ static struct kvm_x86_ops
>>  svm_x86_ops = { 	.enable_nmi_window = enable_nmi_window,
>>  	.enable_irq_window = enable_irq_window, 	.update_cr8_intercept =
>>  update_cr8_intercept,
>> +	.has_virtual_interrupt_delivery = svm_has_virtual_interrupt_delivery,
>> 
>>  	.set_tss_addr = svm_set_tss_addr,
>>  	.get_tdp_level = get_npt_level,
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index e9287aa..c0d74ce 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -86,6 +86,9 @@ module_param(fasteoi, bool, S_IRUGO);
>>  static bool __read_mostly enable_apicv_reg = 0;
>>  module_param(enable_apicv_reg, bool, S_IRUGO);
>> +static bool __read_mostly enable_apicv_vid = 0;
>> +module_param(enable_apicv_vid, bool, S_IRUGO);
>> +
>>  /*
>>   * If nested=1, nested virtualization is supported, i.e., guests may use
>>   * VMX and be a hypervisor for its own guests. If nested=0, guests may not
>> @@ -432,6 +435,10 @@ struct vcpu_vmx {
>> 
>>  	bool rdtscp_enabled;
>> +	u8 eoi_exitmap_changed;
>> +	u64 eoi_exit_bitmap[4];
>> +	u64 eoi_exit_bitmap_global[4];
>> +
>>  	/* Support for a guest hypervisor (nested VMX) */
>>  	struct nested_vmx nested;
>>  };
>> @@ -770,6 +777,12 @@ static inline bool
> cpu_has_vmx_apic_register_virt(void)
>>  		SECONDARY_EXEC_APIC_REGISTER_VIRT;
>>  }
>> +static inline bool cpu_has_vmx_virtual_intr_delivery(void)
>> +{
>> +	return vmcs_config.cpu_based_2nd_exec_ctrl &
>> +		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
>> +}
>> +
>>  static inline bool cpu_has_vmx_flexpriority(void)
>>  {
>>  	return cpu_has_vmx_tpr_shadow() &&
>> @@ -2480,7 +2493,8 @@ static __init int setup_vmcs_config(struct
> vmcs_config *vmcs_conf)
>>  			SECONDARY_EXEC_PAUSE_LOOP_EXITING |
>>  			SECONDARY_EXEC_RDTSCP |
>>  			SECONDARY_EXEC_ENABLE_INVPCID |
>> -			SECONDARY_EXEC_APIC_REGISTER_VIRT;
>> +			SECONDARY_EXEC_APIC_REGISTER_VIRT |
>> +			SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
>>  		if (adjust_vmx_controls(min2, opt2,
>>  					MSR_IA32_VMX_PROCBASED_CTLS2,
>>  					&_cpu_based_2nd_exec_control) < 0)
>> @@ -2494,7 +2508,8 @@ static __init int setup_vmcs_config(struct
>> vmcs_config *vmcs_conf)
>> 
>>  	if (!(_cpu_based_exec_control & CPU_BASED_TPR_SHADOW))
>>  		_cpu_based_2nd_exec_control &= ~(
>> -				SECONDARY_EXEC_APIC_REGISTER_VIRT);
>> +				SECONDARY_EXEC_APIC_REGISTER_VIRT |
>> +				SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
>> 
>>  	if (_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_EPT) { 		/*
>>  CR3 accesses and invlpg don't need to cause VM Exits when EPT @@
>>  -2696,6 +2711,9 @@ static __init int hardware_setup(void) 	if
>>  (!cpu_has_vmx_apic_register_virt()) 		enable_apicv_reg = 0;
>> +	if (!cpu_has_vmx_virtual_intr_delivery())
>> +		enable_apicv_vid = 0;
>> +
>>  	if (nested)
>>  		nested_vmx_setup_ctls_msrs();
>> @@ -3811,6 +3829,8 @@ static u32 vmx_secondary_exec_control(struct
> vcpu_vmx *vmx)
>>  		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
>>  	if (!enable_apicv_reg)
>>  		exec_control &= ~SECONDARY_EXEC_APIC_REGISTER_VIRT;
>> +	if (!enable_apicv_vid)
>> +		exec_control &= ~SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
>>  	return exec_control;
>>  }
>> @@ -3855,6 +3875,15 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>>  				vmx_secondary_exec_control(vmx));
>>  	}
>> +	if (enable_apicv_vid) {
>> +		vmcs_write64(EOI_EXIT_BITMAP0, 0);
>> +		vmcs_write64(EOI_EXIT_BITMAP1, 0);
>> +		vmcs_write64(EOI_EXIT_BITMAP2, 0);
>> +		vmcs_write64(EOI_EXIT_BITMAP3, 0);
>> +
>> +		vmcs_write16(GUEST_INTR_STATUS, 0);
>> +	}
>> +
>>  	if (ple_gap) {
>>  		vmcs_write32(PLE_GAP, ple_gap);
>>  		vmcs_write32(PLE_WINDOW, ple_window);
>> @@ -4770,6 +4799,16 @@ static int handle_apic_access(struct kvm_vcpu
> *vcpu)
>>  	return emulate_instruction(vcpu, 0) == EMULATE_DONE;
>>  }
>> +static int handle_apic_eoi_induced(struct kvm_vcpu *vcpu)
>> +{
>> +	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>> +	int vector = exit_qualification & 0xff;
>> +
>> +	/* EOI-induced VM exit is trap-like and thus no need to adjust IP */
>> +	kvm_apic_set_eoi_accelerated(vcpu, vector);
>> +	return 1;
>> +}
>> +
>>  static int handle_apic_write(struct kvm_vcpu *vcpu)
>>  {
>>  	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>> @@ -5719,6 +5758,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct
> kvm_vcpu *vcpu) = {
>>  	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,
>>  	[EXIT_REASON_APIC_ACCESS]             = handle_apic_access,
>>  	[EXIT_REASON_APIC_WRITE]              = handle_apic_write,
>>  +	[EXIT_REASON_EOI_INDUCED]             = handle_apic_eoi_induced,
>>  	[EXIT_REASON_WBINVD]                  = handle_wbinvd,
>>  	[EXIT_REASON_XSETBV]                  = handle_xsetbv,
>>  	[EXIT_REASON_TASK_SWITCH]             = handle_task_switch,
>> @@ -6049,6 +6089,11 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
>> 
>>  static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
>>  {
>> +    /* no need for tpr_threshold update if APIC virtual
>> +     * interrupt delivery is enabled */
>> +	if (!enable_apicv_vid)
>> +		return ;
> 
> Just set kvm_x86_ops->update_cr8_intercept to NULL if !enable_apicv_vid
> and the function will not be called.

Sure.
 
>> +
>>  	if (irr == -1 || tpr < irr) {
>>  		vmcs_write32(TPR_THRESHOLD, 0);
>>  		return;
>> @@ -6057,6 +6102,79 @@ static void update_cr8_intercept(struct kvm_vcpu
> *vcpu, int tpr, int irr)
>>  	vmcs_write32(TPR_THRESHOLD, irr);
>>  }
>> +static int vmx_has_virtual_interrupt_delivery(struct kvm_vcpu *vcpu)
>> +{
>> +	return irqchip_in_kernel(vcpu->kvm) && enable_apicv_vid;
>> +}
>> +
>> +static void vmx_update_irq(struct kvm_vcpu *vcpu)
>> +{
>> +	u16 status;
>> +	u8 old;
>> +	int vector;
>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>> +
>> +	if (!enable_apicv_vid)
>> +		return ;
> Ditto. Set kvm_x86_ops->update_irq to a function that does nothing if
> !enable_apicv_vid. BTW you do not set this callback in SVM code and call
> it unconditionally.
> 
>> +
>> +	vector = kvm_apic_get_highest_irr(vcpu);
>> +	if (vector == -1)
>> +		return;
>> +
>> +	status = vmcs_read16(GUEST_INTR_STATUS);
>> +	old = (u8)status & 0xff;
>> +	if ((u8)vector != old) {
>> +		status &= ~0xff;
>> +		status |= (u8)vector;
>> +		vmcs_write16(GUEST_INTR_STATUS, status);
>> +	}
> Please write RVI assessor functions.
Sure.

>> +
>> +	if (vmx->eoi_exitmap_changed) {
>> +#define UPDATE_EOI_EXITMAP(v, e) {				\
>> +	if ((v)->eoi_exitmap_changed & (1 << (e)))	\
>> +		vmcs_write64(EOI_EXIT_BITMAP##e,		\
>> +		(v)->eoi_exit_bitmap[e] | (v)->eoi_exit_bitmap_global[e]); }
> Inline function would do. But why calculate this on each entry? We want
> EOI exits only for level IOAPIC interrupts and edge IOAPIC interrupt
> with registered notifiers. This configuration rarely changes.

eoi_exitmap_changed is used to track whether the trig mode is changed. As you said, it changes rarely, so this codes seldom will be executed.

> 
> 
>> +
>> +		UPDATE_EOI_EXITMAP(vmx, 0);
>> +		UPDATE_EOI_EXITMAP(vmx, 1);
>> +		UPDATE_EOI_EXITMAP(vmx, 2);
>> +		UPDATE_EOI_EXITMAP(vmx, 3);
>> +		vmx->eoi_exitmap_changed = 0;
>> +	}
>> +}
>> +
>> +static void vmx_set_eoi_exitmap(struct kvm_vcpu *vcpu,
>> +				int vector,
>> +				int need_eoi, int global)
>> +{
>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>> +	int index, offset, changed;
>> +	unsigned long *eoi_exitmap;
>> +
>> +	if (!enable_apicv_vid)
>> +		return ;
>> +
>> +	if (WARN_ONCE((vector < 0) || (vector > 255),
>> +		"KVM VMX: vector (%d) out of range\n", vector))
>> +		return;
>> +
>> +	index = vector >> 6;
>> +	offset = vector & 63;
>> +	if (global)
>> +		eoi_exitmap =
>> +		    (unsigned long *)&vmx->eoi_exit_bitmap_global[index];
>> +	else
>> +		eoi_exitmap = (unsigned long *)&vmx->eoi_exit_bitmap[index];
>> +
>> +	if (need_eoi)
>> +		changed = !test_and_set_bit(offset, eoi_exitmap);
>> +	else
>> +		changed = test_and_clear_bit(offset, eoi_exitmap);
>> +
>> +	if (changed)
>> +		vmx->eoi_exitmap_changed |= 1 << index;
>> +}
>> +
>>  static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx) { 	u32
>>  exit_intr_info; @@ -7320,6 +7438,9 @@ static struct kvm_x86_ops
>>  vmx_x86_ops = { 	.enable_nmi_window = enable_nmi_window,
>>  	.enable_irq_window = enable_irq_window, 	.update_cr8_intercept =
>>  update_cr8_intercept,
>> +	.has_virtual_interrupt_delivery = vmx_has_virtual_interrupt_delivery,
>> +	.update_irq = vmx_update_irq,
> You need to initialize this one in svm.c too.
> 
>> +	.set_eoi_exitmap = vmx_set_eoi_exitmap,
>> 
>>  	.set_tss_addr = vmx_set_tss_addr,
>>  	.get_tdp_level = get_ept_level,
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 4f76417..8b8de3b 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -5190,6 +5190,13 @@ static void inject_pending_event(struct kvm_vcpu
> *vcpu)
>>  			vcpu->arch.nmi_injected = true;
>>  			kvm_x86_ops->set_nmi(vcpu);
>>  		}
>> +	} else if (kvm_apic_vid_enabled(vcpu)) {
>> +		if (kvm_cpu_has_extint(vcpu) &&
>> +		    kvm_x86_ops->interrupt_allowed(vcpu)) {
>> +			kvm_queue_interrupt(vcpu,
>> +				kvm_cpu_get_extint(vcpu), false);
>> +			kvm_x86_ops->set_irq(vcpu);
>> +		}
> Drop all this and modify kvm_cpu_has_interrupt()/kvm_cpu_get_interrupt()
> to consider apic interrupts only if vid is enabled then the if below
> will just work.
Ok.

> 
>>  	} else if (kvm_cpu_has_interrupt(vcpu)) {
>>  		if (kvm_x86_ops->interrupt_allowed(vcpu)) {
>>  			kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu),
>> @@ -5289,12 +5296,19 @@ static int vcpu_enter_guest(struct kvm_vcpu
> *vcpu)
>>  	}
>>  
>>  	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
>> +		/* update archtecture specific hints for APIC
>> +		 * virtual interrupt delivery */
>> +		kvm_x86_ops->update_irq(vcpu);
>> +
>>  		inject_pending_event(vcpu);
>>  
>>  		/* enable NMI/IRQ window open exits if needed */
>>  		if (vcpu->arch.nmi_pending)
>>  			kvm_x86_ops->enable_nmi_window(vcpu);
>> -		else if (kvm_cpu_has_interrupt(vcpu) || req_int_win)
>> +		else if (kvm_apic_vid_enabled(vcpu)) {
>> +			if (kvm_cpu_has_extint(vcpu))
>> +				kvm_x86_ops->enable_irq_window(vcpu);
> Same as above. With proper kvm_cpu_has_interrupt() implementation this
> id is not needed.
> 
>> +		} else if (kvm_cpu_has_interrupt(vcpu) || req_int_win)
>>  			kvm_x86_ops->enable_irq_window(vcpu);
>>  
>>  		if (kvm_lapic_enabled(vcpu)) {
>> diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
>> index 166c450..898aa62 100644
>> --- a/virt/kvm/ioapic.c
>> +++ b/virt/kvm/ioapic.c
>> @@ -186,6 +186,7 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int
> irq)
>>  		/* need to read apic_id from apic regiest since 		 * it can be
>>  rewritten */ 		irqe.dest_id = ioapic->kvm->bsp_vcpu_id;
>>  +		kvm_set_eoi_exitmap(ioapic->kvm->vcpus[0], irqe.vector, 1, 1); 	}
>>  #endif 	return kvm_irq_delivery_to_apic(ioapic->kvm, NULL, &irqe);
>> --
>> 1.7.1
> 
> --
> 			Gleb.


Best regards,
Yang



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 3/6] x86, apicv: add virtual interrupt delivery support
  2012-11-23 11:46     ` Zhang, Yang Z
@ 2012-11-25  8:53       ` Gleb Natapov
  0 siblings, 0 replies; 29+ messages in thread
From: Gleb Natapov @ 2012-11-25  8:53 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: kvm@vger.kernel.org, mtosatti@redhat.com, Tian, Kevin

On Fri, Nov 23, 2012 at 11:46:30AM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2012-11-22:
> > On Wed, Nov 21, 2012 at 04:09:36PM +0800, Yang Zhang wrote:
> >> Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> >> manually, which is fully taken care of by the hardware. This needs
> >> some special awareness into existing interrupr injection path:
> >> 
> >> - for pending interrupt, instead of direct injection, we may need
> >>   update architecture specific indicators before resuming to guest.
> >> - A pending interrupt, which is masked by ISR, should be also
> >>   considered in above update action, since hardware will decide
> >>   when to inject it at right time. Current has_interrupt and
> >>   get_interrupt only returns a valid vector from injection p.o.v.
> >> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> >> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> >> ---
> >>  arch/x86/include/asm/kvm_host.h |    4 + arch/x86/include/asm/vmx.h   
> >>    |   11 ++++ arch/x86/kvm/irq.c              |   44 ++++++++++++++
> >>  arch/x86/kvm/lapic.c            |   44 +++++++++++++-
> >>  arch/x86/kvm/lapic.h            |   13 ++++ arch/x86/kvm/svm.c        
> >>       |    6 ++ arch/x86/kvm/vmx.c              |  125
> >>  ++++++++++++++++++++++++++++++++++++++- arch/x86/kvm/x86.c            
> >>   |   16 +++++- virt/kvm/ioapic.c               |    1 + 9 files
> >>  changed, 260 insertions(+), 4 deletions(-)
> >> diff --git a/arch/x86/include/asm/kvm_host.h
> >> b/arch/x86/include/asm/kvm_host.h index b2e11f4..8e07a86 100644 ---
> >> a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h
> >> @@ -682,6 +682,10 @@ struct kvm_x86_ops {
> >>  	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
> >>  	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
> >>  	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
> >> +	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
> >> +	void (*update_irq)(struct kvm_vcpu *vcpu);
> >> +	void (*set_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector,
> >> +			int need_eoi, int global);
> >>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> >>  	int (*get_tdp_level)(void);
> >>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> >> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> >> index 21101b6..1003341 100644
> >> --- a/arch/x86/include/asm/vmx.h
> >> +++ b/arch/x86/include/asm/vmx.h
> >> @@ -62,6 +62,7 @@
> >>  #define EXIT_REASON_MCE_DURING_VMENTRY  41 #define
> >>  EXIT_REASON_TPR_BELOW_THRESHOLD 43 #define EXIT_REASON_APIC_ACCESS    
> >>      44 +#define EXIT_REASON_EOI_INDUCED         45 #define
> >>  EXIT_REASON_EPT_VIOLATION       48 #define EXIT_REASON_EPT_MISCONFIG  
> >>      49 #define EXIT_REASON_WBINVD              54 @@ -143,6 +144,7 @@
> >>  #define SECONDARY_EXEC_WBINVD_EXITING		0x00000040 #define
> >>  SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080 #define
> >>  SECONDARY_EXEC_APIC_REGISTER_VIRT       0x00000100 +#define
> >>  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY    0x00000200 #define
> >>  SECONDARY_EXEC_PAUSE_LOOP_EXITING	0x00000400 #define
> >>  SECONDARY_EXEC_ENABLE_INVPCID		0x00001000
> >> @@ -180,6 +182,7 @@ enum vmcs_field {
> >>  	GUEST_GS_SELECTOR               = 0x0000080a, 	GUEST_LDTR_SELECTOR   
> >>           = 0x0000080c, 	GUEST_TR_SELECTOR               = 0x0000080e,
> >>  +	GUEST_INTR_STATUS               = 0x00000810, 	HOST_ES_SELECTOR     
> >>            = 0x00000c00, 	HOST_CS_SELECTOR                = 0x00000c02,
> >>  	HOST_SS_SELECTOR                = 0x00000c04, @@ -207,6 +210,14 @@
> >>  enum vmcs_field { 	APIC_ACCESS_ADDR_HIGH		= 0x00002015, 	EPT_POINTER  
> >>                    = 0x0000201a, 	EPT_POINTER_HIGH                =
> >>  0x0000201b,
> >> +	EOI_EXIT_BITMAP0                = 0x0000201c,
> >> +	EOI_EXIT_BITMAP0_HIGH           = 0x0000201d,
> >> +	EOI_EXIT_BITMAP1                = 0x0000201e,
> >> +	EOI_EXIT_BITMAP1_HIGH           = 0x0000201f,
> >> +	EOI_EXIT_BITMAP2                = 0x00002020,
> >> +	EOI_EXIT_BITMAP2_HIGH           = 0x00002021,
> >> +	EOI_EXIT_BITMAP3                = 0x00002022,
> >> +	EOI_EXIT_BITMAP3_HIGH           = 0x00002023,
> >>  	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
> >>  	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
> >>  	VMCS_LINK_POINTER               = 0x00002800,
> >> diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
> >> index 7e06ba1..c7356a3 100644
> >> --- a/arch/x86/kvm/irq.c
> >> +++ b/arch/x86/kvm/irq.c
> >> @@ -60,6 +60,29 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
> >>  EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
> >>  
> >>  /*
> >> + * check if there is pending interrupt without
> >> + * intack. This _apicv version is used when hardware
> >> + * supports APIC virtualization with virtual interrupt
> >> + * delivery support. In such case, KVM is not required
> >> + * to poll pending APIC interrupt, and thus this
> >> + * interface is used to poll pending interupts from
> >> + * non-APIC source.
> >> + */
> >> +int kvm_cpu_has_extint(struct kvm_vcpu *v)
> >> +{
> >> +	struct kvm_pic *s;
> >> +
> >> +	if (!irqchip_in_kernel(v->kvm))
> >> +		return v->arch.interrupt.pending;
> >> +
> > This does not belong here. If !irqchip_in_kernel() the function will not
> > be called. Hmm actually with !irqchip_in_kernel() kernel will oops in
> > kvm_apic_vid_enabled() since it dereference vcpu->arch.apic without
> > checking if it is NULL.
> 
> Right. Will remove it in next version and add the check in kvm_apic_vid_enabled.
>  
> > 
> >> +	if (kvm_apic_accept_pic_intr(v)) {
> >> +		s = pic_irqchip(v->kvm);	/* PIC */
> >> +		return s->output;
> >> +	} else
> >> +		return 0;
> > This is code duplication from kvm_cpu_has_interrupt(). Write common
> > function and call it from kvm_cpu_has_interrupt(), but even that is
> > not needed, see below.
> 
> Why it is not needed? 
Because it you change kvm_cpu_has_interrupt() like I described below the
code path that uses this function will not be needed.

>  
> >> +}
> >> +
> >> +/*
> >>   * Read pending interrupt vector and intack.
> >>   */
> >>  int kvm_cpu_get_interrupt(struct kvm_vcpu *v) @@ -82,6 +105,27 @@ int
> >>  kvm_cpu_get_interrupt(struct kvm_vcpu *v) }
> >>  EXPORT_SYMBOL_GPL(kvm_cpu_get_interrupt);
> >> +/*
> >> + * Read pending interrupt vector and intack.
> >> + * Similar to kvm_cpu_has_interrupt_apicv, to get
> >> + * interrupts from non-APIC sources.
> >> + */
> >> +int kvm_cpu_get_extint(struct kvm_vcpu *v)
> >> +{
> >> +	struct kvm_pic *s;
> >> +	int vector = -1;
> >> +
> >> +	if (!irqchip_in_kernel(v->kvm))
> >> +		return v->arch.interrupt.nr;
> > Same as above.
> > 
> >> +
> >> +	if (kvm_apic_accept_pic_intr(v)) {
> >> +		s = pic_irqchip(v->kvm);
> >> +		s->output = 0;		/* PIC */
> >> +		vector = kvm_pic_read_irq(v->kvm);
> > Ditto about code duplication.
> > 
> >> +	}
> >> +	return vector;
> >> +}
> >> +
> >>  void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu)
> >>  {
> >>  	kvm_inject_apic_timer_irqs(vcpu);
> >> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> >> index a63ffdc..af48361 100644
> >> --- a/arch/x86/kvm/lapic.c
> >> +++ b/arch/x86/kvm/lapic.c
> >> @@ -643,6 +643,12 @@ out:
> >>  	return ret;
> >>  }
> >> +void kvm_set_eoi_exitmap(struct kvm_vcpu *vcpu, int vector,
> >> +		int need_eoi, int global)
> >> +{
> >> +	kvm_x86_ops->set_eoi_exitmap(vcpu, vector, need_eoi, global);
> >> +}
> >> +
> >>  /*
> >>   * Add a pending IRQ into lapic.
> >>   * Return 1 if successfully added and 0 if discarded.
> >> @@ -664,8 +670,11 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int
> > delivery_mode,
> >>  		if (trig_mode) {
> >>  			apic_debug("level trig mode for vector %d", vector);
> >>  			apic_set_vector(vector, apic->regs + APIC_TMR);
> >> -		} else
> >> +			kvm_set_eoi_exitmap(vcpu, vector, 1, 0);
> >> +		} else {
> >>  			apic_clear_vector(vector, apic->regs + APIC_TMR);
> >> +			kvm_set_eoi_exitmap(vcpu, vector, 0, 0);
> > Why not use APIC_TMR directly instead of kvm_set_eoi_exitmap() logic?
> 
> Good idea. It seems more reasonable. 
> 
> >> +		}
> >> 
> >>  		result = !apic_test_and_set_irr(vector, apic);
> >>  		trace_kvm_apic_accept_irq(vcpu->vcpu_id, delivery_mode, @@ -769,6
> >>  +778,26 @@ static int apic_set_eoi(struct kvm_lapic *apic) 	return
> >>  vector; }
> >> +/*
> >> + * this interface assumes a trap-like exit, which has already finished
> >> + * desired side effect including vISR and vPPR update.
> >> + */
> >> +void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector)
> >> +{
> >> +	struct kvm_lapic *apic = vcpu->arch.apic;
> >> +	int trigger_mode;
> >> +
> >> +	if (apic_test_and_clear_vector(vector, apic->regs + APIC_TMR))
> >> +		trigger_mode = IOAPIC_LEVEL_TRIG;
> >> +	else
> >> +		trigger_mode = IOAPIC_EDGE_TRIG;
> >> +
> >> +	if (!(kvm_apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI))
> >> +		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
> >> +	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
> > More code duplication. Why not call apic_set_eoi() and skip isr/ppr
> > logic there if vid is enabled, or put the logic in common function and
> > call from both places.
> 
> Ok, will change it in next patch.
> 
> >> +}
> >> +EXPORT_SYMBOL_GPL(kvm_apic_set_eoi_accelerated);
> >> +
> >>  static void apic_send_ipi(struct kvm_lapic *apic) { 	u32 icr_low =
> >>  kvm_apic_get_reg(apic, APIC_ICR); @@ -1510,6 +1539,8 @@ int
> >>  kvm_create_lapic(struct kvm_vcpu *vcpu) 	kvm_lapic_reset(vcpu);
> >>  	kvm_iodevice_init(&apic->dev, &apic_mmio_ops);
> >> +	if (kvm_x86_ops->has_virtual_interrupt_delivery(vcpu))
> >> +		apic->vid_enabled = true;
> > What do you have vid_enabled for. This is global, not per apic, state.
> When inject interrupt to guest, we need this to check whether vid is enabled. If not, use old way to handle the interrupt.
> I thing put it in apic is reasonable. Though all vcpu use same configuration, APICv feature is per vcpu too.
> 
How APICv is per vcpu? It is global. Just call has_virtual_interrupt_delivery(vcpu)
instead of vid_enabled thing.

> > 
> >>  	return 0; nomem_free_apic: 	kfree(apic); @@ -1533,6 +1564,17 @@ int
> >>  kvm_apic_has_interrupt(struct kvm_vcpu *vcpu) 	return highest_irr; }
> >> +int kvm_apic_get_highest_irr(struct kvm_vcpu *vcpu)
> >> +{
> >> +	struct kvm_lapic *apic = vcpu->arch.apic;
> >> +
> >> +	if (!apic || !apic_enabled(apic))
> >> +		return -1;
> >> +
> >> +	return apic_find_highest_irr(apic);
> >> +}
> >> +EXPORT_SYMBOL_GPL(kvm_apic_get_highest_irr);
> >> +
> >>  int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
> >>  {
> >>  	u32 lvt0 = kvm_apic_get_reg(vcpu->arch.apic, APIC_LVT0);
> >> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> >> index c42f111..2503a64 100644
> >> --- a/arch/x86/kvm/lapic.h
> >> +++ b/arch/x86/kvm/lapic.h
> >> @@ -20,6 +20,7 @@ struct kvm_lapic {
> >>  	u32 divide_count; 	struct kvm_vcpu *vcpu; 	bool irr_pending; +	bool
> >>  vid_enabled; 	/* Number of bits set in ISR. */ 	s16 isr_count; 	/* The
> >>  highest vector set in ISR; if -1 - invalid, must scan ISR. */ @@ -39,6
> >>  +40,9 @@ void kvm_free_lapic(struct kvm_vcpu *vcpu); int
> >>  kvm_apic_has_interrupt(struct kvm_vcpu *vcpu); int
> >>  kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu); int
> >>  kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
> >> +int kvm_cpu_has_extint(struct kvm_vcpu *v);
> >> +int kvm_cpu_get_extint(struct kvm_vcpu *v);
> >> +int kvm_apic_get_highest_irr(struct kvm_vcpu *vcpu);
> >>  void kvm_lapic_reset(struct kvm_vcpu *vcpu); u64
> >>  kvm_lapic_get_cr8(struct kvm_vcpu *vcpu); void
> >>  kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8); @@ -50,6
> >>  +54,8 @@ void kvm_apic_set_version(struct kvm_vcpu *vcpu); int
> >>  kvm_apic_match_physical_addr(struct kvm_lapic *apic, u16 dest); int
> >>  kvm_apic_match_logical_addr(struct kvm_lapic *apic, u8 mda); int
> >>  kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq);
> >> +void kvm_set_eoi_exitmap(struct kvm_vcpu *vcpu, int vector,
> >> +		int need_eoi, int global);
> >>  int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type);
> >>  
> >>  bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
> >> @@ -65,6 +71,7 @@ u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu
> > *vcpu);
> >>  void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
> >>  
> >>  int kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
> >> +void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector);
> >> 
> >>  void kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr);
> >>  void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu);
> >> @@ -81,6 +88,12 @@ static inline bool
> > kvm_hv_vapic_assist_page_enabled(struct kvm_vcpu *vcpu)
> >>  	return vcpu->arch.hv_vapic & HV_X64_MSR_APIC_ASSIST_PAGE_ENABLE;
> >>  }
> >> +static inline bool kvm_apic_vid_enabled(struct kvm_vcpu *vcpu)
> >> +{
> >> +	struct kvm_lapic *apic = vcpu->arch.apic;
> >> +	return apic->vid_enabled;
> >> +}
> > vcpu->arch.apic can be NULL from where this is called.
> > 
> >> +
> >>  int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data);
> >>  void kvm_lapic_init(void);
> >> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> >> index d017df3..b290aba 100644
> >> --- a/arch/x86/kvm/svm.c
> >> +++ b/arch/x86/kvm/svm.c
> >> @@ -3564,6 +3564,11 @@ static void update_cr8_intercept(struct kvm_vcpu
> > *vcpu, int tpr, int irr)
> >>  		set_cr_intercept(svm, INTERCEPT_CR8_WRITE);
> >>  }
> >> +static int svm_has_virtual_interrupt_delivery(struct kvm_vcpu *vcpu)
> >> +{
> >> +	return 0;
> >> +}
> >> +
> >>  static int svm_nmi_allowed(struct kvm_vcpu *vcpu) { 	struct vcpu_svm
> >>  *svm = to_svm(vcpu); @@ -4283,6 +4288,7 @@ static struct kvm_x86_ops
> >>  svm_x86_ops = { 	.enable_nmi_window = enable_nmi_window,
> >>  	.enable_irq_window = enable_irq_window, 	.update_cr8_intercept =
> >>  update_cr8_intercept,
> >> +	.has_virtual_interrupt_delivery = svm_has_virtual_interrupt_delivery,
> >> 
> >>  	.set_tss_addr = svm_set_tss_addr,
> >>  	.get_tdp_level = get_npt_level,
> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> index e9287aa..c0d74ce 100644
> >> --- a/arch/x86/kvm/vmx.c
> >> +++ b/arch/x86/kvm/vmx.c
> >> @@ -86,6 +86,9 @@ module_param(fasteoi, bool, S_IRUGO);
> >>  static bool __read_mostly enable_apicv_reg = 0;
> >>  module_param(enable_apicv_reg, bool, S_IRUGO);
> >> +static bool __read_mostly enable_apicv_vid = 0;
> >> +module_param(enable_apicv_vid, bool, S_IRUGO);
> >> +
> >>  /*
> >>   * If nested=1, nested virtualization is supported, i.e., guests may use
> >>   * VMX and be a hypervisor for its own guests. If nested=0, guests may not
> >> @@ -432,6 +435,10 @@ struct vcpu_vmx {
> >> 
> >>  	bool rdtscp_enabled;
> >> +	u8 eoi_exitmap_changed;
> >> +	u64 eoi_exit_bitmap[4];
> >> +	u64 eoi_exit_bitmap_global[4];
> >> +
> >>  	/* Support for a guest hypervisor (nested VMX) */
> >>  	struct nested_vmx nested;
> >>  };
> >> @@ -770,6 +777,12 @@ static inline bool
> > cpu_has_vmx_apic_register_virt(void)
> >>  		SECONDARY_EXEC_APIC_REGISTER_VIRT;
> >>  }
> >> +static inline bool cpu_has_vmx_virtual_intr_delivery(void)
> >> +{
> >> +	return vmcs_config.cpu_based_2nd_exec_ctrl &
> >> +		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
> >> +}
> >> +
> >>  static inline bool cpu_has_vmx_flexpriority(void)
> >>  {
> >>  	return cpu_has_vmx_tpr_shadow() &&
> >> @@ -2480,7 +2493,8 @@ static __init int setup_vmcs_config(struct
> > vmcs_config *vmcs_conf)
> >>  			SECONDARY_EXEC_PAUSE_LOOP_EXITING |
> >>  			SECONDARY_EXEC_RDTSCP |
> >>  			SECONDARY_EXEC_ENABLE_INVPCID |
> >> -			SECONDARY_EXEC_APIC_REGISTER_VIRT;
> >> +			SECONDARY_EXEC_APIC_REGISTER_VIRT |
> >> +			SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
> >>  		if (adjust_vmx_controls(min2, opt2,
> >>  					MSR_IA32_VMX_PROCBASED_CTLS2,
> >>  					&_cpu_based_2nd_exec_control) < 0)
> >> @@ -2494,7 +2508,8 @@ static __init int setup_vmcs_config(struct
> >> vmcs_config *vmcs_conf)
> >> 
> >>  	if (!(_cpu_based_exec_control & CPU_BASED_TPR_SHADOW))
> >>  		_cpu_based_2nd_exec_control &= ~(
> >> -				SECONDARY_EXEC_APIC_REGISTER_VIRT);
> >> +				SECONDARY_EXEC_APIC_REGISTER_VIRT |
> >> +				SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
> >> 
> >>  	if (_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_EPT) { 		/*
> >>  CR3 accesses and invlpg don't need to cause VM Exits when EPT @@
> >>  -2696,6 +2711,9 @@ static __init int hardware_setup(void) 	if
> >>  (!cpu_has_vmx_apic_register_virt()) 		enable_apicv_reg = 0;
> >> +	if (!cpu_has_vmx_virtual_intr_delivery())
> >> +		enable_apicv_vid = 0;
> >> +
> >>  	if (nested)
> >>  		nested_vmx_setup_ctls_msrs();
> >> @@ -3811,6 +3829,8 @@ static u32 vmx_secondary_exec_control(struct
> > vcpu_vmx *vmx)
> >>  		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
> >>  	if (!enable_apicv_reg)
> >>  		exec_control &= ~SECONDARY_EXEC_APIC_REGISTER_VIRT;
> >> +	if (!enable_apicv_vid)
> >> +		exec_control &= ~SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
> >>  	return exec_control;
> >>  }
> >> @@ -3855,6 +3875,15 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
> >>  				vmx_secondary_exec_control(vmx));
> >>  	}
> >> +	if (enable_apicv_vid) {
> >> +		vmcs_write64(EOI_EXIT_BITMAP0, 0);
> >> +		vmcs_write64(EOI_EXIT_BITMAP1, 0);
> >> +		vmcs_write64(EOI_EXIT_BITMAP2, 0);
> >> +		vmcs_write64(EOI_EXIT_BITMAP3, 0);
> >> +
> >> +		vmcs_write16(GUEST_INTR_STATUS, 0);
> >> +	}
> >> +
> >>  	if (ple_gap) {
> >>  		vmcs_write32(PLE_GAP, ple_gap);
> >>  		vmcs_write32(PLE_WINDOW, ple_window);
> >> @@ -4770,6 +4799,16 @@ static int handle_apic_access(struct kvm_vcpu
> > *vcpu)
> >>  	return emulate_instruction(vcpu, 0) == EMULATE_DONE;
> >>  }
> >> +static int handle_apic_eoi_induced(struct kvm_vcpu *vcpu)
> >> +{
> >> +	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> >> +	int vector = exit_qualification & 0xff;
> >> +
> >> +	/* EOI-induced VM exit is trap-like and thus no need to adjust IP */
> >> +	kvm_apic_set_eoi_accelerated(vcpu, vector);
> >> +	return 1;
> >> +}
> >> +
> >>  static int handle_apic_write(struct kvm_vcpu *vcpu)
> >>  {
> >>  	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> >> @@ -5719,6 +5758,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct
> > kvm_vcpu *vcpu) = {
> >>  	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,
> >>  	[EXIT_REASON_APIC_ACCESS]             = handle_apic_access,
> >>  	[EXIT_REASON_APIC_WRITE]              = handle_apic_write,
> >>  +	[EXIT_REASON_EOI_INDUCED]             = handle_apic_eoi_induced,
> >>  	[EXIT_REASON_WBINVD]                  = handle_wbinvd,
> >>  	[EXIT_REASON_XSETBV]                  = handle_xsetbv,
> >>  	[EXIT_REASON_TASK_SWITCH]             = handle_task_switch,
> >> @@ -6049,6 +6089,11 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
> >> 
> >>  static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
> >>  {
> >> +    /* no need for tpr_threshold update if APIC virtual
> >> +     * interrupt delivery is enabled */
> >> +	if (!enable_apicv_vid)
> >> +		return ;
> > 
> > Just set kvm_x86_ops->update_cr8_intercept to NULL if !enable_apicv_vid
> > and the function will not be called.
> 
> Sure.
>  
> >> +
> >>  	if (irr == -1 || tpr < irr) {
> >>  		vmcs_write32(TPR_THRESHOLD, 0);
> >>  		return;
> >> @@ -6057,6 +6102,79 @@ static void update_cr8_intercept(struct kvm_vcpu
> > *vcpu, int tpr, int irr)
> >>  	vmcs_write32(TPR_THRESHOLD, irr);
> >>  }
> >> +static int vmx_has_virtual_interrupt_delivery(struct kvm_vcpu *vcpu)
> >> +{
> >> +	return irqchip_in_kernel(vcpu->kvm) && enable_apicv_vid;
> >> +}
> >> +
> >> +static void vmx_update_irq(struct kvm_vcpu *vcpu)
> >> +{
> >> +	u16 status;
> >> +	u8 old;
> >> +	int vector;
> >> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >> +
> >> +	if (!enable_apicv_vid)
> >> +		return ;
> > Ditto. Set kvm_x86_ops->update_irq to a function that does nothing if
> > !enable_apicv_vid. BTW you do not set this callback in SVM code and call
> > it unconditionally.
> > 
> >> +
> >> +	vector = kvm_apic_get_highest_irr(vcpu);
> >> +	if (vector == -1)
> >> +		return;
> >> +
> >> +	status = vmcs_read16(GUEST_INTR_STATUS);
> >> +	old = (u8)status & 0xff;
> >> +	if ((u8)vector != old) {
> >> +		status &= ~0xff;
> >> +		status |= (u8)vector;
> >> +		vmcs_write16(GUEST_INTR_STATUS, status);
> >> +	}
> > Please write RVI assessor functions.
> Sure.
> 
> >> +
> >> +	if (vmx->eoi_exitmap_changed) {
> >> +#define UPDATE_EOI_EXITMAP(v, e) {				\
> >> +	if ((v)->eoi_exitmap_changed & (1 << (e)))	\
> >> +		vmcs_write64(EOI_EXIT_BITMAP##e,		\
> >> +		(v)->eoi_exit_bitmap[e] | (v)->eoi_exit_bitmap_global[e]); }
> > Inline function would do. But why calculate this on each entry? We want
> > EOI exits only for level IOAPIC interrupts and edge IOAPIC interrupt
> > with registered notifiers. This configuration rarely changes.
> 
> eoi_exitmap_changed is used to track whether the trig mode is changed. As you said, it changes rarely, so this codes seldom will be executed.
> 
But code still checks whether bitmap was changed during each interrupt
injection. Recalculate bitmap when notifier is added/removed or ioapic
configuration changes. Use request bit to reload new bitmap.

> > 
> > 
> >> +
> >> +		UPDATE_EOI_EXITMAP(vmx, 0);
> >> +		UPDATE_EOI_EXITMAP(vmx, 1);
> >> +		UPDATE_EOI_EXITMAP(vmx, 2);
> >> +		UPDATE_EOI_EXITMAP(vmx, 3);
> >> +		vmx->eoi_exitmap_changed = 0;
> >> +	}
> >> +}
> >> +
> >> +static void vmx_set_eoi_exitmap(struct kvm_vcpu *vcpu,
> >> +				int vector,
> >> +				int need_eoi, int global)
> >> +{
> >> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >> +	int index, offset, changed;
> >> +	unsigned long *eoi_exitmap;
> >> +
> >> +	if (!enable_apicv_vid)
> >> +		return ;
> >> +
> >> +	if (WARN_ONCE((vector < 0) || (vector > 255),
> >> +		"KVM VMX: vector (%d) out of range\n", vector))
> >> +		return;
> >> +
> >> +	index = vector >> 6;
> >> +	offset = vector & 63;
> >> +	if (global)
> >> +		eoi_exitmap =
> >> +		    (unsigned long *)&vmx->eoi_exit_bitmap_global[index];
> >> +	else
> >> +		eoi_exitmap = (unsigned long *)&vmx->eoi_exit_bitmap[index];
> >> +
> >> +	if (need_eoi)
> >> +		changed = !test_and_set_bit(offset, eoi_exitmap);
> >> +	else
> >> +		changed = test_and_clear_bit(offset, eoi_exitmap);
> >> +
> >> +	if (changed)
> >> +		vmx->eoi_exitmap_changed |= 1 << index;
> >> +}
> >> +
> >>  static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx) { 	u32
> >>  exit_intr_info; @@ -7320,6 +7438,9 @@ static struct kvm_x86_ops
> >>  vmx_x86_ops = { 	.enable_nmi_window = enable_nmi_window,
> >>  	.enable_irq_window = enable_irq_window, 	.update_cr8_intercept =
> >>  update_cr8_intercept,
> >> +	.has_virtual_interrupt_delivery = vmx_has_virtual_interrupt_delivery,
> >> +	.update_irq = vmx_update_irq,
> > You need to initialize this one in svm.c too.
> > 
> >> +	.set_eoi_exitmap = vmx_set_eoi_exitmap,
> >> 
> >>  	.set_tss_addr = vmx_set_tss_addr,
> >>  	.get_tdp_level = get_ept_level,
> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >> index 4f76417..8b8de3b 100644
> >> --- a/arch/x86/kvm/x86.c
> >> +++ b/arch/x86/kvm/x86.c
> >> @@ -5190,6 +5190,13 @@ static void inject_pending_event(struct kvm_vcpu
> > *vcpu)
> >>  			vcpu->arch.nmi_injected = true;
> >>  			kvm_x86_ops->set_nmi(vcpu);
> >>  		}
> >> +	} else if (kvm_apic_vid_enabled(vcpu)) {
> >> +		if (kvm_cpu_has_extint(vcpu) &&
> >> +		    kvm_x86_ops->interrupt_allowed(vcpu)) {
> >> +			kvm_queue_interrupt(vcpu,
> >> +				kvm_cpu_get_extint(vcpu), false);
> >> +			kvm_x86_ops->set_irq(vcpu);
> >> +		}
> > Drop all this and modify kvm_cpu_has_interrupt()/kvm_cpu_get_interrupt()
> > to consider apic interrupts only if vid is enabled then the if below
> > will just work.
> Ok.
> 
> > 
> >>  	} else if (kvm_cpu_has_interrupt(vcpu)) {
> >>  		if (kvm_x86_ops->interrupt_allowed(vcpu)) {
> >>  			kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu),
> >> @@ -5289,12 +5296,19 @@ static int vcpu_enter_guest(struct kvm_vcpu
> > *vcpu)
> >>  	}
> >>  
> >>  	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
> >> +		/* update archtecture specific hints for APIC
> >> +		 * virtual interrupt delivery */
> >> +		kvm_x86_ops->update_irq(vcpu);
> >> +
> >>  		inject_pending_event(vcpu);
> >>  
> >>  		/* enable NMI/IRQ window open exits if needed */
> >>  		if (vcpu->arch.nmi_pending)
> >>  			kvm_x86_ops->enable_nmi_window(vcpu);
> >> -		else if (kvm_cpu_has_interrupt(vcpu) || req_int_win)
> >> +		else if (kvm_apic_vid_enabled(vcpu)) {
> >> +			if (kvm_cpu_has_extint(vcpu))
> >> +				kvm_x86_ops->enable_irq_window(vcpu);
> > Same as above. With proper kvm_cpu_has_interrupt() implementation this
> > id is not needed.
> > 
> >> +		} else if (kvm_cpu_has_interrupt(vcpu) || req_int_win)
> >>  			kvm_x86_ops->enable_irq_window(vcpu);
> >>  
> >>  		if (kvm_lapic_enabled(vcpu)) {
> >> diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
> >> index 166c450..898aa62 100644
> >> --- a/virt/kvm/ioapic.c
> >> +++ b/virt/kvm/ioapic.c
> >> @@ -186,6 +186,7 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int
> > irq)
> >>  		/* need to read apic_id from apic regiest since 		 * it can be
> >>  rewritten */ 		irqe.dest_id = ioapic->kvm->bsp_vcpu_id;
> >>  +		kvm_set_eoi_exitmap(ioapic->kvm->vcpus[0], irqe.vector, 1, 1); 	}
> >>  #endif 	return kvm_irq_delivery_to_apic(ioapic->kvm, NULL, &irqe);
> >> --
> >> 1.7.1
> > 
> > --
> > 			Gleb.
> 
> 
> Best regards,
> Yang
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting
  2012-11-21  8:09 ` [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting Yang Zhang
@ 2012-11-25 12:39   ` Gleb Natapov
  2012-11-26  3:51     ` Zhang, Yang Z
  0 siblings, 1 reply; 29+ messages in thread
From: Gleb Natapov @ 2012-11-25 12:39 UTC (permalink / raw)
  To: Yang Zhang; +Cc: kvm, mtosatti

On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
> Posted Interrupt allows vAPICV interrupts to inject into guest directly
> without any vmexit.
> 
> - When delivering a interrupt to guest, if target vcpu is running,
>   update Posted-interrupt requests bitmap and send a notification event
>   to the vcpu. Then the vcpu will handle this interrupt automatically,
>   without any software involvemnt.
> 
Looks like you allocating one irq vector per vcpu per pcpu and then
migrate it or reallocate when vcpu move from one pcpu to another.
This is not scalable and migrating irq migration slows things down.
What's wrong with allocating one global vector for posted interrupt
during vmx initialization and use it for all vcpus?

> - If target vcpu is not running or there already a notification event
>   pending in the vcpu, do nothing. The interrupt will be handled by old
>   way.
> 
> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    3 +
>  arch/x86/include/asm/vmx.h      |    4 +
>  arch/x86/kernel/apic/io_apic.c  |  138 ++++++++++++++++++++++++++++
>  arch/x86/kvm/lapic.c            |   31 ++++++-
>  arch/x86/kvm/lapic.h            |    8 ++
>  arch/x86/kvm/vmx.c              |  192 +++++++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/x86.c              |    2 +
>  include/linux/kvm_host.h        |    1 +
>  virt/kvm/kvm_main.c             |    2 +
>  9 files changed, 372 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8e07a86..1145894 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -683,9 +683,12 @@ struct kvm_x86_ops {
>  	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
>  	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
>  	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
> +	int (*has_posted_interrupt)(struct kvm_vcpu *vcpu);
>  	void (*update_irq)(struct kvm_vcpu *vcpu);
>  	void (*set_eoi_exitmap)(struct kvm_vcpu *vcpu, int vector,
>  			int need_eoi, int global);
> +	int (*send_nv)(struct kvm_vcpu *vcpu, int vector);
> +	void (*pi_migrate)(struct kvm_vcpu *vcpu);
>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>  	int (*get_tdp_level)(void);
>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 1003341..7b9e1d0 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -152,6 +152,7 @@
>  #define PIN_BASED_EXT_INTR_MASK                 0x00000001
>  #define PIN_BASED_NMI_EXITING                   0x00000008
>  #define PIN_BASED_VIRTUAL_NMIS                  0x00000020
> +#define PIN_BASED_POSTED_INTR                   0x00000080
>  
>  #define VM_EXIT_SAVE_DEBUG_CONTROLS             0x00000002
>  #define VM_EXIT_HOST_ADDR_SPACE_SIZE            0x00000200
> @@ -174,6 +175,7 @@
>  /* VMCS Encodings */
>  enum vmcs_field {
>  	VIRTUAL_PROCESSOR_ID            = 0x00000000,
> +	POSTED_INTR_NV                  = 0x00000002,
>  	GUEST_ES_SELECTOR               = 0x00000800,
>  	GUEST_CS_SELECTOR               = 0x00000802,
>  	GUEST_SS_SELECTOR               = 0x00000804,
> @@ -208,6 +210,8 @@ enum vmcs_field {
>  	VIRTUAL_APIC_PAGE_ADDR_HIGH     = 0x00002013,
>  	APIC_ACCESS_ADDR		= 0x00002014,
>  	APIC_ACCESS_ADDR_HIGH		= 0x00002015,
> +	POSTED_INTR_DESC_ADDR           = 0x00002016,
> +	POSTED_INTR_DESC_ADDR_HIGH      = 0x00002017,
>  	EPT_POINTER                     = 0x0000201a,
>  	EPT_POINTER_HIGH                = 0x0000201b,
>  	EOI_EXIT_BITMAP0                = 0x0000201c,
> diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
> index 1817fa9..97cb8ee 100644
> --- a/arch/x86/kernel/apic/io_apic.c
> +++ b/arch/x86/kernel/apic/io_apic.c
> @@ -3277,6 +3277,144 @@ int arch_setup_dmar_msi(unsigned int irq)
>  }
>  #endif
>  
> +static int
> +pi_set_affinity(struct irq_data *data, const struct cpumask *mask,
> +		      bool force)
> +{
> +	unsigned int dest;
> +	struct irq_cfg *cfg = (struct irq_cfg *)data->chip_data;
> +	if (cpumask_equal(cfg->domain, mask))
> +		return IRQ_SET_MASK_OK;
> +
> +	if (__ioapic_set_affinity(data, mask, &dest))
> +		return -1;
> +
> +	return IRQ_SET_MASK_OK;
> +}
> +
> +static void pi_mask(struct irq_data *data)
> +{
> +	;
> +}
> +
> +static void pi_unmask(struct irq_data *data)
> +{
> +	;
> +}
> +
> +static struct irq_chip pi_chip = {
> +	.name       = "POSTED-INTR",
> +	.irq_ack    = ack_apic_edge,
> +	.irq_unmask = pi_unmask,
> +	.irq_mask   = pi_mask,
> +	.irq_set_affinity   = pi_set_affinity,
> +};
> +
> +int arch_pi_migrate(int irq, int cpu)
> +{
> +	struct irq_data *data = irq_get_irq_data(irq);
> +	struct irq_cfg *cfg;
> +	struct irq_desc *desc = irq_to_desc(irq);
> +	unsigned long flags;
> +
> +	if (!desc)
> +		return -EINVAL;
> +
> +	cfg = irq_cfg(irq);
> +	if (cpumask_equal(cfg->domain, cpumask_of(cpu)))
> +		return cfg->vector;
> +
> +	irq_set_affinity(irq, cpumask_of(cpu));
> +	raw_spin_lock_irqsave(&desc->lock, flags);
> +	irq_move_irq(data);
> +	raw_spin_unlock_irqrestore(&desc->lock, flags);
> +
> +	if (cfg->move_in_progress)
> +		send_cleanup_vector(cfg);
> +	return cfg->vector;
> +}
> +EXPORT_SYMBOL_GPL(arch_pi_migrate);
> +
> +static int arch_pi_create_irq(const struct cpumask *mask)
> +{
> +	int node = cpu_to_node(0);
> +	unsigned int irq_want;
> +	struct irq_cfg *cfg;
> +	unsigned long flags;
> +	unsigned int ret = 0;
> +	int irq;
> +
> +	irq_want = nr_irqs_gsi;
> +
> +	irq = alloc_irq_from(irq_want, node);
> +	if (irq < 0)
> +		return 0;
> +	cfg = alloc_irq_cfg(irq_want, node);
s/irq_want/irq.

> +	if (!cfg) {
> +		free_irq_at(irq, NULL);
> +		return 0;
> +	}
> +
> +	raw_spin_lock_irqsave(&vector_lock, flags);
> +	if (!__assign_irq_vector(irq, cfg, mask))
> +		ret = irq;
> +	raw_spin_unlock_irqrestore(&vector_lock, flags);
> +
> +	if (ret) {
> +		irq_set_chip_data(irq, cfg);
> +		irq_clear_status_flags(irq, IRQ_NOREQUEST);
> +	} else {
> +		free_irq_at(irq, cfg);
> +	}
> +	return ret;
> +}

This function is mostly cut&paste of create_irq_nr().


> +
> +int arch_pi_alloc_irq(void *vmx)
> +{
> +	int irq, cpu = smp_processor_id();
> +	struct irq_cfg *cfg;
> +
> +	irq = arch_pi_create_irq(cpumask_of(cpu));
> +	if (!irq) {
> +		pr_err("Posted Interrupt: no free irq\n");
> +		return -EINVAL;
> +	}
> +	irq_set_handler_data(irq, vmx);
> +	irq_set_chip_and_handler_name(irq, &pi_chip, handle_edge_irq, "edge");
> +	irq_set_status_flags(irq, IRQ_MOVE_PCNTXT);
> +	irq_set_affinity(irq, cpumask_of(cpu));
> +
> +	cfg = irq_cfg(irq);
> +	if (cfg->move_in_progress)
> +		send_cleanup_vector(cfg);
> +
> +	return irq;
> +}
> +EXPORT_SYMBOL_GPL(arch_pi_alloc_irq);
> +
> +void arch_pi_free_irq(unsigned int irq, void *vmx)
> +{
> +	if (irq) {
> +		irq_set_handler_data(irq, NULL);
> +		/* This will mask the irq */
> +		free_irq(irq, vmx);
> +		destroy_irq(irq);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(arch_pi_free_irq);
> +
> +int arch_pi_get_vector(unsigned int irq)
> +{
> +	struct irq_cfg *cfg;
> +
> +	if (!irq)
> +		return -EINVAL;
> +
> +	cfg = irq_cfg(irq);
> +	return cfg->vector;
> +}
> +EXPORT_SYMBOL_GPL(arch_pi_get_vector);
> +
>  #ifdef CONFIG_HPET_TIMER
>  
>  static int hpet_msi_set_affinity(struct irq_data *data,
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index af48361..04220de 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -656,7 +656,7 @@ void kvm_set_eoi_exitmap(struct kvm_vcpu *vcpu, int vector,
>  static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
>  			     int vector, int level, int trig_mode)
>  {
> -	int result = 0;
> +	int result = 0, send;
>  	struct kvm_vcpu *vcpu = apic->vcpu;
>  
>  	switch (delivery_mode) {
> @@ -674,6 +674,13 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
>  		} else {
>  			apic_clear_vector(vector, apic->regs + APIC_TMR);
>  			kvm_set_eoi_exitmap(vcpu, vector, 0, 0);
> +			if (kvm_apic_pi_enabled(vcpu)) {
Provide send_nv() that returns 0 if pi is disabled.

> +				send = kvm_x86_ops->send_nv(vcpu, vector);
> +				if (send) {
No need "send" variable here.

> +					result = 1;
> +					break;
> +				}
> +			}
>  		}
>  
>  		result = !apic_test_and_set_irr(vector, apic);
> @@ -1541,6 +1548,10 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu)
>  
>  	if (kvm_x86_ops->has_virtual_interrupt_delivery(vcpu))
>  		apic->vid_enabled = true;
> +
> +	if (kvm_x86_ops->has_posted_interrupt(vcpu))
> +		apic->pi_enabled = true;
> +
This is global state, no need per apic variable.

>  	return 0;
>  nomem_free_apic:
>  	kfree(apic);
> @@ -1575,6 +1586,24 @@ int kvm_apic_get_highest_irr(struct kvm_vcpu *vcpu)
>  }
>  EXPORT_SYMBOL_GPL(kvm_apic_get_highest_irr);
>  
> +void kvm_apic_update_irr(struct kvm_vcpu *vcpu, unsigned int *pir)
> +{
> +	struct kvm_lapic *apic = vcpu->arch.apic;
> +	unsigned int *reg;
> +	unsigned int i;
> +
> +	if (!apic || !apic_enabled(apic))
Use kvm_vcpu_has_lapic() instead of !apic.

> +		return;
> +
> +	for (i = 0; i <= 7; i++) {
> +		reg = apic->regs + APIC_IRR + i * 0x10;
> +		*reg |= pir[i];
Non atomic access to IRR. Other threads may set bit there concurrently.

> +		pir[i] = 0;
> +	}
Should set apic->irr_pending to true when setting irr bit.

> +	return;
> +}
> +EXPORT_SYMBOL_GPL(kvm_apic_update_irr);
> +
>  int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
>  {
>  	u32 lvt0 = kvm_apic_get_reg(vcpu->arch.apic, APIC_LVT0);
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 2503a64..ad35868 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -21,6 +21,7 @@ struct kvm_lapic {
>  	struct kvm_vcpu *vcpu;
>  	bool irr_pending;
>  	bool vid_enabled;
> +	bool pi_enabled;
>  	/* Number of bits set in ISR. */
>  	s16 isr_count;
>  	/* The highest vector set in ISR; if -1 - invalid, must scan ISR. */
> @@ -43,6 +44,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
>  int kvm_cpu_has_extint(struct kvm_vcpu *v);
>  int kvm_cpu_get_extint(struct kvm_vcpu *v);
>  int kvm_apic_get_highest_irr(struct kvm_vcpu *vcpu);
> +void kvm_apic_update_irr(struct kvm_vcpu *vcpu, unsigned int *pir);
>  void kvm_lapic_reset(struct kvm_vcpu *vcpu);
>  u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
>  void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
> @@ -94,6 +96,12 @@ static inline bool kvm_apic_vid_enabled(struct kvm_vcpu *vcpu)
>  	return apic->vid_enabled;
>  }
>  
> +static inline bool kvm_apic_pi_enabled(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_lapic *apic = vcpu->arch.apic;
> +	return apic->pi_enabled;
> +}
> +
>  int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data);
>  void kvm_lapic_init(void);
>  
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index f6ef090..6448b96 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -31,6 +31,7 @@
>  #include <linux/ftrace_event.h>
>  #include <linux/slab.h>
>  #include <linux/tboot.h>
> +#include <linux/interrupt.h>
>  #include "kvm_cache_regs.h"
>  #include "x86.h"
>  
> @@ -89,6 +90,8 @@ module_param(enable_apicv_reg, bool, S_IRUGO);
>  static bool __read_mostly enable_apicv_vid = 0;
>  module_param(enable_apicv_vid, bool, S_IRUGO);
>  
> +static bool __read_mostly enable_apicv_pi = 0;
> +module_param(enable_apicv_pi, bool, S_IRUGO);
>  /*
>   * If nested=1, nested virtualization is supported, i.e., guests may use
>   * VMX and be a hypervisor for its own guests. If nested=0, guests may not
> @@ -372,6 +375,44 @@ struct nested_vmx {
>  	struct page *apic_access_page;
>  };
>  
> +/* Posted-Interrupt Descriptor */
> +struct pi_desc {
> +	u32 pir[8];     /* Posted interrupt requested */
> +	union {
> +		struct {
> +			u8  on:1,
> +			    rsvd:7;
> +		} control;
> +		u32 rsvd[8];
> +	} u;
> +} __aligned(64);
> +
> +#define POSTED_INTR_ON  0
> +u8 pi_test_on(struct pi_desc *pi_desc)
> +{
> +	return test_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->u.control);
> +}
> +void pi_set_on(struct pi_desc *pi_desc)
> +{
> +	set_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->u.control);
> +}
> +
> +void pi_clear_on(struct pi_desc *pi_desc)
> +{
> +	clear_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->u.control);
> +}
> +
> +u8 pi_test_and_set_on(struct pi_desc *pi_desc)
> +{
> +	return test_and_set_bit(POSTED_INTR_ON,
> +			(unsigned long *)&pi_desc->u.control);
> +}
> +
> +void pi_set_pir(int vector, struct pi_desc *pi_desc)
> +{
> +	set_bit(vector, (unsigned long *)pi_desc->pir);
> +}
> +
>  struct vcpu_vmx {
>  	struct kvm_vcpu       vcpu;
>  	unsigned long         host_rsp;
> @@ -439,6 +480,11 @@ struct vcpu_vmx {
>  	u64 eoi_exit_bitmap[4];
>  	u64 eoi_exit_bitmap_global[4];
>  
> +	/* Posted interrupt descriptor */
> +	struct pi_desc *pi;
> +	u32 irq;
> +	u32 vector;
> +
>  	/* Support for a guest hypervisor (nested VMX) */
>  	struct nested_vmx nested;
>  };
> @@ -698,6 +744,11 @@ static u64 host_efer;
>  
>  static void ept_save_pdptrs(struct kvm_vcpu *vcpu);
>  
> +int arch_pi_get_vector(unsigned int irq);
> +int arch_pi_alloc_irq(struct vcpu_vmx *vmx);
> +void arch_pi_free_irq(unsigned int irq, struct vcpu_vmx *vmx);
> +int arch_pi_migrate(int irq, int cpu);
> +
>  /*
>   * Keep MSR_STAR at the end, as setup_msrs() will try to optimize it
>   * away by decrementing the array size.
> @@ -783,6 +834,11 @@ static inline bool cpu_has_vmx_virtual_intr_delivery(void)
>  		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
>  }
>  
> +static inline bool cpu_has_vmx_posted_intr(void)
> +{
> +	return vmcs_config.pin_based_exec_ctrl & PIN_BASED_POSTED_INTR;
> +}
> +
>  static inline bool cpu_has_vmx_flexpriority(void)
>  {
>  	return cpu_has_vmx_tpr_shadow() &&
> @@ -1555,6 +1611,11 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  		struct desc_ptr *gdt = &__get_cpu_var(host_gdt);
>  		unsigned long sysenter_esp;
>  
> +		if (enable_apicv_pi && to_vmx(vcpu)->pi)
> +			pi_set_on(to_vmx(vcpu)->pi);
> +
Why?

> +		kvm_make_request(KVM_REQ_POSTED_INTR, vcpu);
> +
>  		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
>  		local_irq_disable();
>  		list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link,
> @@ -1582,6 +1643,8 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
>  		vcpu->cpu = -1;
>  		kvm_cpu_vmxoff();
>  	}
> +	if (enable_apicv_pi && to_vmx(vcpu)->pi)
> +		pi_set_on(to_vmx(vcpu)->pi);
Why?

>  }
>  
>  static void vmx_fpu_activate(struct kvm_vcpu *vcpu)
> @@ -2451,12 +2514,6 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>  	u32 _vmexit_control = 0;
>  	u32 _vmentry_control = 0;
>  
> -	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
> -	opt = PIN_BASED_VIRTUAL_NMIS;
> -	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
> -				&_pin_based_exec_control) < 0)
> -		return -EIO;
> -
>  	min = CPU_BASED_HLT_EXITING |
>  #ifdef CONFIG_X86_64
>  	      CPU_BASED_CR8_LOAD_EXITING |
> @@ -2531,6 +2588,17 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>  				&_vmexit_control) < 0)
>  		return -EIO;
>  
> +	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
> +	opt = PIN_BASED_VIRTUAL_NMIS | PIN_BASED_POSTED_INTR;
> +	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
> +				&_pin_based_exec_control) < 0)
> +		return -EIO;
> +
> +	if (!(_cpu_based_2nd_exec_control &
> +		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY) ||
> +		!(_vmexit_control & VM_EXIT_ACK_INTR_ON_EXIT))
> +		_pin_based_exec_control &= ~PIN_BASED_POSTED_INTR;
> +
>  	min = 0;
>  	opt = VM_ENTRY_LOAD_IA32_PAT;
>  	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS,
> @@ -2715,6 +2783,9 @@ static __init int hardware_setup(void)
>  	if (!cpu_has_vmx_virtual_intr_delivery())
>  		enable_apicv_vid = 0;
>  
> +	if (!cpu_has_vmx_posted_intr() || !x2apic_enabled())
In nested guest x2apic may be enabled without irq remapping. Check for
irq remapping here.

> +		enable_apicv_pi = 0;
> +
>  	if (nested)
>  		nested_vmx_setup_ctls_msrs();
>  
> @@ -3881,6 +3952,93 @@ static void ept_set_mmio_spte_mask(void)
>  	kvm_mmu_set_mmio_spte_mask(0xffull << 49 | 0x6ull);
>  }
>  
> +irqreturn_t pi_handler(int irq, void *data)
> +{
> +	struct vcpu_vmx *vmx = data;
> +
> +	kvm_make_request(KVM_REQ_EVENT, &vmx->vcpu);
> +	kvm_vcpu_kick(&vmx->vcpu);
> +
> +	return IRQ_HANDLED;
> +}
> +
> +static int vmx_has_posted_interrupt(struct kvm_vcpu *vcpu)
> +{
> +	return irqchip_in_kernel(vcpu->kvm) && enable_apicv_pi;
> +}
> +
> +static void vmx_pi_migrate(struct kvm_vcpu *vcpu)
> +{
> +	int ret = 0;
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	if (!enable_apicv_pi)
> +		return ;
> +
> +	preempt_disable();
> +	local_irq_disable();
> +	if (!vmx->irq) {
> +		ret = arch_pi_alloc_irq(vmx);
> +		if (ret < 0) {
> +			vmx->irq = -1;
> +			goto out;
> +		}
> +		vmx->irq = ret;
> +
> +		ret = request_irq(vmx->irq, pi_handler, IRQF_NO_THREAD,
> +					"Posted Interrupt", vmx);
> +		if (ret) {
> +			vmx->irq = -1;
> +			goto out;
> +		}
> +
> +		ret = arch_pi_get_vector(vmx->irq);
> +	} else
> +		ret = arch_pi_migrate(vmx->irq, smp_processor_id());
> +
> +	if (ret < 0) {
> +		vmx->irq = -1;
> +		goto out;
> +	} else {
> +		vmx->vector = ret;
> +		vmcs_write16(POSTED_INTR_NV, vmx->vector);
> +		pi_clear_on(vmx->pi);
> +	}
> +out:
> +	local_irq_enable();
> +	preempt_enable();
> +	return ;
> +}
> +
> +static int vmx_send_nv(struct kvm_vcpu *vcpu,
> +		int vector)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	if (unlikely(vmx->irq == -1))
> +		return 0;
> +
> +	if (vcpu->cpu == smp_processor_id()) {
> +		pi_set_on(vmx->pi);
Why? You clear this bit anyway in vmx_update_irq() during guest entry.

> +		return 0;
> +	}
> +
> +	pi_set_pir(vector, vmx->pi);
> +	if (!pi_test_and_set_on(vmx->pi) && (vcpu->mode == IN_GUEST_MODE)) {
> +		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), vmx->vector);
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static void free_pi(struct vcpu_vmx *vmx)
> +{
> +	if (enable_apicv_pi) {
> +		kfree(vmx->pi);
> +		arch_pi_free_irq(vmx->irq, vmx);
> +	}
> +}
> +
>  /*
>   * Sets up the vmcs for emulated real mode.
>   */
> @@ -3890,6 +4048,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>  	unsigned long a;
>  #endif
>  	int i;
> +	u32 pin_based_exec_ctrl = vmcs_config.pin_based_exec_ctrl;
>  
>  	/* I/O */
>  	vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a));
> @@ -3901,8 +4060,10 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>  	vmcs_write64(VMCS_LINK_POINTER, -1ull); /* 22.3.1.5 */
>  
>  	/* Control */
> -	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
> -		vmcs_config.pin_based_exec_ctrl);
> +	if (!enable_apicv_pi)
> +		pin_based_exec_ctrl &= ~PIN_BASED_POSTED_INTR;
> +
> +	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, pin_based_exec_ctrl);
>  
>  	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, vmx_exec_control(vmx));
>  
> @@ -3920,6 +4081,12 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>  		vmcs_write16(GUEST_INTR_STATUS, 0);
>  	}
>  
> +	if (enable_apicv_pi) {
> +		vmx->pi = kmalloc(sizeof(struct pi_desc),
> +				GFP_KERNEL | __GFP_ZERO);
> +		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((vmx->pi)));
> +	}
> +
>  	if (ple_gap) {
>  		vmcs_write32(PLE_GAP, ple_gap);
>  		vmcs_write32(PLE_WINDOW, ple_window);
> @@ -6161,6 +6328,11 @@ static void vmx_update_irq(struct kvm_vcpu *vcpu)
>  	if (!enable_apicv_vid)
>  		return ;
>  
> +	if (enable_apicv_pi) {
> +		kvm_apic_update_irr(vcpu, (unsigned int *)vmx->pi->pir);
> +		pi_clear_on(vmx->pi);
Why do you do that? Isn't VMX process posted interrupts on vmentry if "on" bit
is set?

> +	}
> +
>  	vector = kvm_apic_get_highest_irr(vcpu);
>  	if (vector == -1)
>  		return;
> @@ -6586,6 +6758,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
>  
>  	free_vpid(vmx);
>  	free_nested(vmx);
> +	free_pi(vmx);
>  	free_loaded_vmcs(vmx->loaded_vmcs);
>  	kfree(vmx->guest_msrs);
>  	kvm_vcpu_uninit(vcpu);
> @@ -7483,8 +7656,11 @@ static struct kvm_x86_ops vmx_x86_ops = {
>  	.enable_irq_window = enable_irq_window,
>  	.update_cr8_intercept = update_cr8_intercept,
>  	.has_virtual_interrupt_delivery = vmx_has_virtual_interrupt_delivery,
> +	.has_posted_interrupt = vmx_has_posted_interrupt,
>  	.update_irq = vmx_update_irq,
>  	.set_eoi_exitmap = vmx_set_eoi_exitmap,
> +	.send_nv = vmx_send_nv,
> +	.pi_migrate = vmx_pi_migrate,
>  
>  	.set_tss_addr = vmx_set_tss_addr,
>  	.get_tdp_level = get_ept_level,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 8b8de3b..f035267 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5250,6 +5250,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  	bool req_immediate_exit = 0;
>  
>  	if (vcpu->requests) {
> +		if (kvm_check_request(KVM_REQ_POSTED_INTR, vcpu))
> +			kvm_x86_ops->pi_migrate(vcpu);
>  		if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu))
>  			kvm_mmu_unload(vcpu);
>  		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ecc5543..f8d8d34 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -107,6 +107,7 @@ static inline bool is_error_page(struct page *page)
>  #define KVM_REQ_IMMEDIATE_EXIT    15
>  #define KVM_REQ_PMU               16
>  #define KVM_REQ_PMI               17
> +#define KVM_REQ_POSTED_INTR       18
>  
>  #define KVM_USERSPACE_IRQ_SOURCE_ID		0
>  #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID	1
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index be70035..05baf1c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1625,6 +1625,8 @@ void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
>  			smp_send_reschedule(cpu);
>  	put_cpu();
>  }
> +EXPORT_SYMBOL_GPL(kvm_vcpu_kick);
> +
>  #endif /* !CONFIG_S390 */
>  
>  void kvm_resched(struct kvm_vcpu *vcpu)
> -- 
> 1.7.1

--
			Gleb.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 5/6] x86: Enable ack interrupt on vmexit
  2012-11-22 15:22   ` Gleb Natapov
  2012-11-23  5:41     ` Zhang, Yang Z
@ 2012-11-25 12:55     ` Avi Kivity
  2012-11-25 13:03       ` Gleb Natapov
  1 sibling, 1 reply; 29+ messages in thread
From: Avi Kivity @ 2012-11-25 12:55 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Yang Zhang, kvm, mtosatti

On 11/22/2012 05:22 PM, Gleb Natapov wrote:
> On Wed, Nov 21, 2012 at 04:09:38PM +0800, Yang Zhang wrote:
>> Ack interrupt on vmexit is required by Posted Interrupt. With it,
>> when external interrupt caused vmexit, the cpu will acknowledge the
>> interrupt controller and save the interrupt's vector in vmcs.
>> 
>> There are several approaches to enable it. This patch uses a simply
>> way: re-generate an interrupt via self ipi.
>> 
>> 
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 7949d21..f6ef090 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -2525,7 +2525,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>>  #ifdef CONFIG_X86_64
>>  	min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
>>  #endif
>> -	opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT;
>> +	opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT |
>> +		VM_EXIT_ACK_INTR_ON_EXIT;
> Always? Do it only if posted interrupts are actually available
> and going to be used.

Why not always?  Better to have a single code path for host interrupts
(and as Yang notes, the new path is faster as well).



-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 5/6] x86: Enable ack interrupt on vmexit
  2012-11-25 12:55     ` Avi Kivity
@ 2012-11-25 13:03       ` Gleb Natapov
  2012-11-25 13:11         ` Avi Kivity
  0 siblings, 1 reply; 29+ messages in thread
From: Gleb Natapov @ 2012-11-25 13:03 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Yang Zhang, kvm, mtosatti

On Sun, Nov 25, 2012 at 02:55:26PM +0200, Avi Kivity wrote:
> On 11/22/2012 05:22 PM, Gleb Natapov wrote:
> > On Wed, Nov 21, 2012 at 04:09:38PM +0800, Yang Zhang wrote:
> >> Ack interrupt on vmexit is required by Posted Interrupt. With it,
> >> when external interrupt caused vmexit, the cpu will acknowledge the
> >> interrupt controller and save the interrupt's vector in vmcs.
> >> 
> >> There are several approaches to enable it. This patch uses a simply
> >> way: re-generate an interrupt via self ipi.
> >> 
> >> 
> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> index 7949d21..f6ef090 100644
> >> --- a/arch/x86/kvm/vmx.c
> >> +++ b/arch/x86/kvm/vmx.c
> >> @@ -2525,7 +2525,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
> >>  #ifdef CONFIG_X86_64
> >>  	min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
> >>  #endif
> >> -	opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT;
> >> +	opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT |
> >> +		VM_EXIT_ACK_INTR_ON_EXIT;
> > Always? Do it only if posted interrupts are actually available
> > and going to be used.
> 
> Why not always?  Better to have a single code path for host interrupts
> (and as Yang notes, the new path is faster as well).
> 
Is it? The current path is:

vm exit -> KVM vmexit handler(interrupt disabled) -> KVM re-enable
interrupt -> cpu ack the interrupt and interrupt deliver through the
host IDT.

The proposed path is:

CPU acks interrupt -> vm exit -> KVM vmexit handler(interrupt disabled)
-> eoi -> self IPI -> KVM re-enable interrupt -> cpu ack the interrupt
and interrupt deliver through the host IDT.

Am I missing something?

--
			Gleb.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 5/6] x86: Enable ack interrupt on vmexit
  2012-11-25 13:03       ` Gleb Natapov
@ 2012-11-25 13:11         ` Avi Kivity
  2012-11-26  5:44           ` Zhang, Yang Z
  0 siblings, 1 reply; 29+ messages in thread
From: Avi Kivity @ 2012-11-25 13:11 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Yang Zhang, kvm, mtosatti

On 11/25/2012 03:03 PM, Gleb Natapov wrote:
> On Sun, Nov 25, 2012 at 02:55:26PM +0200, Avi Kivity wrote:
>> On 11/22/2012 05:22 PM, Gleb Natapov wrote:
>> > On Wed, Nov 21, 2012 at 04:09:38PM +0800, Yang Zhang wrote:
>> >> Ack interrupt on vmexit is required by Posted Interrupt. With it,
>> >> when external interrupt caused vmexit, the cpu will acknowledge the
>> >> interrupt controller and save the interrupt's vector in vmcs.
>> >> 
>> >> There are several approaches to enable it. This patch uses a simply
>> >> way: re-generate an interrupt via self ipi.
>> >> 
>> >> 
>> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> >> index 7949d21..f6ef090 100644
>> >> --- a/arch/x86/kvm/vmx.c
>> >> +++ b/arch/x86/kvm/vmx.c
>> >> @@ -2525,7 +2525,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>> >>  #ifdef CONFIG_X86_64
>> >>  	min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
>> >>  #endif
>> >> -	opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT;
>> >> +	opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT |
>> >> +		VM_EXIT_ACK_INTR_ON_EXIT;
>> > Always? Do it only if posted interrupts are actually available
>> > and going to be used.
>> 
>> Why not always?  Better to have a single code path for host interrupts
>> (and as Yang notes, the new path is faster as well).
>> 
> Is it? The current path is:
> 
> vm exit -> KVM vmexit handler(interrupt disabled) -> KVM re-enable
> interrupt -> cpu ack the interrupt and interrupt deliver through the
> host IDT.
> 
> The proposed path is:
> 
> CPU acks interrupt -> vm exit -> KVM vmexit handler(interrupt disabled)
> -> eoi -> self IPI -> KVM re-enable interrupt -> cpu ack the interrupt
> and interrupt deliver through the host IDT.
> 
> Am I missing something?

Yes, you're missing the part where I didn't write that the new path
should avoid the IDT and dispatch the interrupt directly, by emulating
an interrupt frame directly.  Can be as simple as pushf; push cs; call
interrupt_table[vector * 8].  Of course we need to verify that no
interrupt uses the IST or a task gate.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 5/6] x86: Enable ack interrupt on vmexit
  2012-11-23  5:41     ` Zhang, Yang Z
@ 2012-11-25 13:30       ` Gleb Natapov
  0 siblings, 0 replies; 29+ messages in thread
From: Gleb Natapov @ 2012-11-25 13:30 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: kvm@vger.kernel.org, mtosatti@redhat.com

On Fri, Nov 23, 2012 at 05:41:49AM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2012-11-22:
> > On Wed, Nov 21, 2012 at 04:09:38PM +0800, Yang Zhang wrote:
> >> Ack interrupt on vmexit is required by Posted Interrupt. With it,
> >> when external interrupt caused vmexit, the cpu will acknowledge the
> >> interrupt controller and save the interrupt's vector in vmcs.
> >> 
> >> There are several approaches to enable it. This patch uses a simply
> >> way: re-generate an interrupt via self ipi.
> >> 
> >> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> >> ---
> >>  arch/x86/kvm/vmx.c |   11 ++++++++++-
> >>  1 files changed, 10 insertions(+), 1 deletions(-)
> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> index 7949d21..f6ef090 100644
> >> --- a/arch/x86/kvm/vmx.c
> >> +++ b/arch/x86/kvm/vmx.c
> >> @@ -2525,7 +2525,8 @@ static __init int setup_vmcs_config(struct
> > vmcs_config *vmcs_conf)
> >>  #ifdef CONFIG_X86_64
> >>  	min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
> >>  #endif
> >> -	opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT;
> >> +	opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT |
> >> +		VM_EXIT_ACK_INTR_ON_EXIT;
> > Always? Do it only if posted interrupts are actually available
> > and going to be used.
> 
> Right. 
> But the currently interrupt handler path is too long:
> vm exit -> KVM vmexit handler(interrupt disabled) -> KVM re-enable interrupt -> cpu ack the interrupt and interrupt deliver through the host IDT
> This will bring extra cost for interrupt belongs to guest. After enable "acknowledge interrupt on exit", we can inject the interrupt right way after vm exit if the interrupt is for the guest(This patch doesn't do this).
> 
> Since we only want to enable "acknowledge interrupt on exit" for Posted Interrupt, probably, we can enable it when PI is available.
> 
> >>  	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS,
> >>  				&_vmexit_control) < 0)
> >>  		return -EIO;
> >> @@ -4457,6 +4458,14 @@ static int handle_exception(struct kvm_vcpu *vcpu)
> >> 
> >>  static int handle_external_interrupt(struct kvm_vcpu *vcpu)
> >>  {
> >> +	unsigned int vector;
> >> +
> >> +	vector = vmcs_read32(VM_EXIT_INTR_INFO);
> >> +	vector &= INTR_INFO_VECTOR_MASK;
> > Valid bit is guarantied to be set here?
> > 
> >> +
> >> +	apic_eoi();
> > This is way to late. handle_external_interrupt() is called longs after
> > preemption and local irqs are enabled. vcpu process may be scheduled out
> > and apic_eoi() will not be called for a long time leaving interrupt
> > stuck in ISR and blocking other interrupts.
> 
> I will move it to vmx_complete_atomic_exit().
> 
> >> +	apic->send_IPI_self(vector);
> > For level interrupt this is not needed, no?
> 
> If we enable "ack interrupt on exit" only when apicv is available, then all interrupt is edge(interrupt remapping will setup all remap entries to deliver edge interrupt. interrupt remapping is required by x2apic, x2apic is required by PI)
Why x2apic is required by PI? Is this architectural? Cannot find it in SDM.

If interrupts will be delivered without self ipi like Avi suggests then
level will not be the issue.

> 
> /*
>          * Trigger mode in the IRTE will always be edge, and for IO-APIC,
> the
>          * actual level or edge trigger will be setup in the IO-APIC
>          * RTE. This will help simplify level triggered irq migration.
>          * For more details, see the comments (in io_apic.c) explainig
> IO-APIC
>          * irq migration in the presence of interrupt-remapping.
>         */
> 
> >> +
> >>  	++vcpu->stat.irq_exits;
> >>  	return 1;
> >>  }
> >> --
> >> 1.7.1
> > 
> > --
> > 			Gleb.
> 
> 
> Best regards,
> Yang
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting
  2012-11-25 12:39   ` Gleb Natapov
@ 2012-11-26  3:51     ` Zhang, Yang Z
  2012-11-26 10:01       ` Gleb Natapov
  0 siblings, 1 reply; 29+ messages in thread
From: Zhang, Yang Z @ 2012-11-26  3:51 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm@vger.kernel.org, mtosatti@redhat.com

Gleb Natapov wrote on 2012-11-25:
> On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
>> Posted Interrupt allows vAPICV interrupts to inject into guest directly
>> without any vmexit.
>> 
>> - When delivering a interrupt to guest, if target vcpu is running,
>>   update Posted-interrupt requests bitmap and send a notification event
>>   to the vcpu. Then the vcpu will handle this interrupt automatically,
>>   without any software involvemnt.
> Looks like you allocating one irq vector per vcpu per pcpu and then
> migrate it or reallocate when vcpu move from one pcpu to another.
> This is not scalable and migrating irq migration slows things down.
> What's wrong with allocating one global vector for posted interrupt
> during vmx initialization and use it for all vcpus?

Consider the following situation: 
If vcpu A is running when notification event which belong to vcpu B is arrived, since the vector match the vcpu A's notification vector, then this event will be consumed by vcpu A(even it do nothing) and the interrupt cannot be handled in time.

>> - If target vcpu is not running or there already a notification event
>>   pending in the vcpu, do nothing. The interrupt will be handled by old
>>   way.
>> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
>> ---
>>  arch/x86/include/asm/kvm_host.h |    3 + arch/x86/include/asm/vmx.h   
>>    |    4 + arch/x86/kernel/apic/io_apic.c  |  138
>>  ++++++++++++++++++++++++++++ arch/x86/kvm/lapic.c            |   31
>>  ++++++- arch/x86/kvm/lapic.h            |    8 ++ arch/x86/kvm/vmx.c  
>>             |  192 +++++++++++++++++++++++++++++++++++++--
>>  arch/x86/kvm/x86.c              |    2 + include/linux/kvm_host.h     
>>    |    1 + virt/kvm/kvm_main.c             |    2 + 9 files changed,
>>  372 insertions(+), 9 deletions(-)
>> diff --git a/arch/x86/include/asm/kvm_host.h
>> b/arch/x86/include/asm/kvm_host.h index 8e07a86..1145894 100644 ---
>> a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -683,9 +683,12 @@ struct kvm_x86_ops {
>>  	void (*enable_irq_window)(struct kvm_vcpu *vcpu); 	void
>>  (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr); 	int
>>  (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu); +	int
>>  (*has_posted_interrupt)(struct kvm_vcpu *vcpu); 	void
>>  (*update_irq)(struct kvm_vcpu *vcpu); 	void (*set_eoi_exitmap)(struct
>>  kvm_vcpu *vcpu, int vector, 			int need_eoi, int global);
>> +	int (*send_nv)(struct kvm_vcpu *vcpu, int vector);
>> +	void (*pi_migrate)(struct kvm_vcpu *vcpu);
>>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>>  	int (*get_tdp_level)(void);
>>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
>> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
>> index 1003341..7b9e1d0 100644
>> --- a/arch/x86/include/asm/vmx.h
>> +++ b/arch/x86/include/asm/vmx.h
>> @@ -152,6 +152,7 @@
>>  #define PIN_BASED_EXT_INTR_MASK                 0x00000001
>>  #define PIN_BASED_NMI_EXITING                   0x00000008
>>  #define PIN_BASED_VIRTUAL_NMIS                  0x00000020
>> +#define PIN_BASED_POSTED_INTR                   0x00000080
>> 
>>  #define VM_EXIT_SAVE_DEBUG_CONTROLS             0x00000002 #define
>>  VM_EXIT_HOST_ADDR_SPACE_SIZE            0x00000200 @@ -174,6 +175,7 @@
>>  /* VMCS Encodings */ enum vmcs_field { 	VIRTUAL_PROCESSOR_ID          
>>   = 0x00000000, +	POSTED_INTR_NV                  = 0x00000002,
>>  	GUEST_ES_SELECTOR               = 0x00000800, 	GUEST_CS_SELECTOR     
>>           = 0x00000802, 	GUEST_SS_SELECTOR               = 0x00000804,
>>  @@ -208,6 +210,8 @@ enum vmcs_field { 	VIRTUAL_APIC_PAGE_ADDR_HIGH    
>>  = 0x00002013, 	APIC_ACCESS_ADDR		= 0x00002014,
>>  	APIC_ACCESS_ADDR_HIGH		= 0x00002015,
>> +	POSTED_INTR_DESC_ADDR           = 0x00002016,
>> +	POSTED_INTR_DESC_ADDR_HIGH      = 0x00002017,
>>  	EPT_POINTER                     = 0x0000201a,
>>  	EPT_POINTER_HIGH                = 0x0000201b,
>>  	EOI_EXIT_BITMAP0                = 0x0000201c,
>> diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
>> index 1817fa9..97cb8ee 100644
>> --- a/arch/x86/kernel/apic/io_apic.c
>> +++ b/arch/x86/kernel/apic/io_apic.c
>> @@ -3277,6 +3277,144 @@ int arch_setup_dmar_msi(unsigned int irq)
>>  }
>>  #endif
>> +static int
>> +pi_set_affinity(struct irq_data *data, const struct cpumask *mask,
>> +		      bool force)
>> +{
>> +	unsigned int dest;
>> +	struct irq_cfg *cfg = (struct irq_cfg *)data->chip_data;
>> +	if (cpumask_equal(cfg->domain, mask))
>> +		return IRQ_SET_MASK_OK;
>> +
>> +	if (__ioapic_set_affinity(data, mask, &dest))
>> +		return -1;
>> +
>> +	return IRQ_SET_MASK_OK;
>> +}
>> +
>> +static void pi_mask(struct irq_data *data)
>> +{
>> +	;
>> +}
>> +
>> +static void pi_unmask(struct irq_data *data)
>> +{
>> +	;
>> +}
>> +
>> +static struct irq_chip pi_chip = {
>> +	.name       = "POSTED-INTR",
>> +	.irq_ack    = ack_apic_edge,
>> +	.irq_unmask = pi_unmask,
>> +	.irq_mask   = pi_mask,
>> +	.irq_set_affinity   = pi_set_affinity,
>> +};
>> +
>> +int arch_pi_migrate(int irq, int cpu)
>> +{
>> +	struct irq_data *data = irq_get_irq_data(irq);
>> +	struct irq_cfg *cfg;
>> +	struct irq_desc *desc = irq_to_desc(irq);
>> +	unsigned long flags;
>> +
>> +	if (!desc)
>> +		return -EINVAL;
>> +
>> +	cfg = irq_cfg(irq);
>> +	if (cpumask_equal(cfg->domain, cpumask_of(cpu)))
>> +		return cfg->vector;
>> +
>> +	irq_set_affinity(irq, cpumask_of(cpu));
>> +	raw_spin_lock_irqsave(&desc->lock, flags);
>> +	irq_move_irq(data);
>> +	raw_spin_unlock_irqrestore(&desc->lock, flags);
>> +
>> +	if (cfg->move_in_progress)
>> +		send_cleanup_vector(cfg);
>> +	return cfg->vector;
>> +}
>> +EXPORT_SYMBOL_GPL(arch_pi_migrate);
>> +
>> +static int arch_pi_create_irq(const struct cpumask *mask)
>> +{
>> +	int node = cpu_to_node(0);
>> +	unsigned int irq_want;
>> +	struct irq_cfg *cfg;
>> +	unsigned long flags;
>> +	unsigned int ret = 0;
>> +	int irq;
>> +
>> +	irq_want = nr_irqs_gsi;
>> +
>> +	irq = alloc_irq_from(irq_want, node);
>> +	if (irq < 0)
>> +		return 0;
>> +	cfg = alloc_irq_cfg(irq_want, node);
> s/irq_want/irq.
> 
>> +	if (!cfg) {
>> +		free_irq_at(irq, NULL);
>> +		return 0;
>> +	}
>> +
>> +	raw_spin_lock_irqsave(&vector_lock, flags);
>> +	if (!__assign_irq_vector(irq, cfg, mask))
>> +		ret = irq;
>> +	raw_spin_unlock_irqrestore(&vector_lock, flags);
>> +
>> +	if (ret) {
>> +		irq_set_chip_data(irq, cfg);
>> +		irq_clear_status_flags(irq, IRQ_NOREQUEST);
>> +	} else {
>> +		free_irq_at(irq, cfg);
>> +	}
>> +	return ret;
>> +}
> 
> This function is mostly cut&paste of create_irq_nr().

Yes, this function allow to allocate vector from specified cpu.
 
>> +
>> +int arch_pi_alloc_irq(void *vmx)
>> +{
>> +	int irq, cpu = smp_processor_id();
>> +	struct irq_cfg *cfg;
>> +
>> +	irq = arch_pi_create_irq(cpumask_of(cpu));
>> +	if (!irq) {
>> +		pr_err("Posted Interrupt: no free irq\n");
>> +		return -EINVAL;
>> +	}
>> +	irq_set_handler_data(irq, vmx);
>> +	irq_set_chip_and_handler_name(irq, &pi_chip, handle_edge_irq, "edge");
>> +	irq_set_status_flags(irq, IRQ_MOVE_PCNTXT);
>> +	irq_set_affinity(irq, cpumask_of(cpu));
>> +
>> +	cfg = irq_cfg(irq);
>> +	if (cfg->move_in_progress)
>> +		send_cleanup_vector(cfg);
>> +
>> +	return irq;
>> +}
>> +EXPORT_SYMBOL_GPL(arch_pi_alloc_irq);
>> +
>> +void arch_pi_free_irq(unsigned int irq, void *vmx)
>> +{
>> +	if (irq) {
>> +		irq_set_handler_data(irq, NULL);
>> +		/* This will mask the irq */
>> +		free_irq(irq, vmx);
>> +		destroy_irq(irq);
>> +	}
>> +}
>> +EXPORT_SYMBOL_GPL(arch_pi_free_irq);
>> +
>> +int arch_pi_get_vector(unsigned int irq)
>> +{
>> +	struct irq_cfg *cfg;
>> +
>> +	if (!irq)
>> +		return -EINVAL;
>> +
>> +	cfg = irq_cfg(irq);
>> +	return cfg->vector;
>> +}
>> +EXPORT_SYMBOL_GPL(arch_pi_get_vector);
>> +
>>  #ifdef CONFIG_HPET_TIMER
>>  
>>  static int hpet_msi_set_affinity(struct irq_data *data,
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index af48361..04220de 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -656,7 +656,7 @@ void kvm_set_eoi_exitmap(struct kvm_vcpu *vcpu, int
> vector,
>>  static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
>>  			     int vector, int level, int trig_mode)
>>  {
>> -	int result = 0;
>> +	int result = 0, send;
>>  	struct kvm_vcpu *vcpu = apic->vcpu;
>>  
>>  	switch (delivery_mode) {
>> @@ -674,6 +674,13 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int
> delivery_mode,
>>  		} else {
>>  			apic_clear_vector(vector, apic->regs + APIC_TMR);
>>  			kvm_set_eoi_exitmap(vcpu, vector, 0, 0);
>> +			if (kvm_apic_pi_enabled(vcpu)) {
> Provide send_nv() that returns 0 if pi is disabled.
> 
>> +				send = kvm_x86_ops->send_nv(vcpu, vector);
>> +				if (send) {
> No need "send" variable here.

ok.
 
>> +					result = 1;
>> +					break;
>> +				}
>> +			}
>>  		}
>>  
>>  		result = !apic_test_and_set_irr(vector, apic);
>> @@ -1541,6 +1548,10 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu)
>> 
>>  	if (kvm_x86_ops->has_virtual_interrupt_delivery(vcpu))
>>  		apic->vid_enabled = true;
>> +
>> +	if (kvm_x86_ops->has_posted_interrupt(vcpu))
>> +		apic->pi_enabled = true;
>> +
> This is global state, no need per apic variable.
> 
>>  	return 0;
>>  nomem_free_apic:
>>  	kfree(apic);
>> @@ -1575,6 +1586,24 @@ int kvm_apic_get_highest_irr(struct kvm_vcpu
> *vcpu)
>>  }
>>  EXPORT_SYMBOL_GPL(kvm_apic_get_highest_irr);
>> +void kvm_apic_update_irr(struct kvm_vcpu *vcpu, unsigned int *pir)
>> +{
>> +	struct kvm_lapic *apic = vcpu->arch.apic;
>> +	unsigned int *reg;
>> +	unsigned int i;
>> +
>> +	if (!apic || !apic_enabled(apic))
> Use kvm_vcpu_has_lapic() instead of !apic.

ok.
 
>> +		return;
>> +
>> +	for (i = 0; i <= 7; i++) {
>> +		reg = apic->regs + APIC_IRR + i * 0x10;
>> +		*reg |= pir[i];
> Non atomic access to IRR. Other threads may set bit there concurrently.
Ok.
 
>> +		pir[i] = 0;
>> +	}
> Should set apic->irr_pending to true when setting irr bit.
Right. Will add it in next version.

>> +	return;
>> +}
>> +EXPORT_SYMBOL_GPL(kvm_apic_update_irr);
>> +
>>  int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
>>  {
>>  	u32 lvt0 = kvm_apic_get_reg(vcpu->arch.apic, APIC_LVT0);
>> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
>> index 2503a64..ad35868 100644
>> --- a/arch/x86/kvm/lapic.h
>> +++ b/arch/x86/kvm/lapic.h
>> @@ -21,6 +21,7 @@ struct kvm_lapic {
>>  	struct kvm_vcpu *vcpu; 	bool irr_pending; 	bool vid_enabled; +	bool
>>  pi_enabled; 	/* Number of bits set in ISR. */ 	s16 isr_count; 	/* The
>>  highest vector set in ISR; if -1 - invalid, must scan ISR. */ @@ -43,6
>>  +44,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu); int
>>  kvm_cpu_has_extint(struct kvm_vcpu *v); int kvm_cpu_get_extint(struct
>>  kvm_vcpu *v); int kvm_apic_get_highest_irr(struct kvm_vcpu *vcpu);
>>  +void kvm_apic_update_irr(struct kvm_vcpu *vcpu, unsigned int *pir);
>>  void kvm_lapic_reset(struct kvm_vcpu *vcpu); u64
>>  kvm_lapic_get_cr8(struct kvm_vcpu *vcpu); void
>>  kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
>> @@ -94,6 +96,12 @@ static inline bool kvm_apic_vid_enabled(struct kvm_vcpu
> *vcpu)
>>  	return apic->vid_enabled;
>>  }
>> +static inline bool kvm_apic_pi_enabled(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_lapic *apic = vcpu->arch.apic;
>> +	return apic->pi_enabled;
>> +}
>> +
>>  int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data);
>>  void kvm_lapic_init(void);
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index f6ef090..6448b96 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -31,6 +31,7 @@
>>  #include <linux/ftrace_event.h> #include <linux/slab.h> #include
>>  <linux/tboot.h> +#include <linux/interrupt.h> #include
>>  "kvm_cache_regs.h" #include "x86.h"
>> @@ -89,6 +90,8 @@ module_param(enable_apicv_reg, bool, S_IRUGO);
>>  static bool __read_mostly enable_apicv_vid = 0;
>>  module_param(enable_apicv_vid, bool, S_IRUGO);
>> +static bool __read_mostly enable_apicv_pi = 0;
>> +module_param(enable_apicv_pi, bool, S_IRUGO);
>>  /*
>>   * If nested=1, nested virtualization is supported, i.e., guests may use
>>   * VMX and be a hypervisor for its own guests. If nested=0, guests may not
>> @@ -372,6 +375,44 @@ struct nested_vmx {
>>  	struct page *apic_access_page;
>>  };
>> +/* Posted-Interrupt Descriptor */
>> +struct pi_desc {
>> +	u32 pir[8];     /* Posted interrupt requested */
>> +	union {
>> +		struct {
>> +			u8  on:1,
>> +			    rsvd:7;
>> +		} control;
>> +		u32 rsvd[8];
>> +	} u;
>> +} __aligned(64);
>> +
>> +#define POSTED_INTR_ON  0
>> +u8 pi_test_on(struct pi_desc *pi_desc)
>> +{
>> +	return test_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->u.control);
>> +}
>> +void pi_set_on(struct pi_desc *pi_desc)
>> +{
>> +	set_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->u.control);
>> +}
>> +
>> +void pi_clear_on(struct pi_desc *pi_desc)
>> +{
>> +	clear_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->u.control);
>> +}
>> +
>> +u8 pi_test_and_set_on(struct pi_desc *pi_desc)
>> +{
>> +	return test_and_set_bit(POSTED_INTR_ON,
>> +			(unsigned long *)&pi_desc->u.control);
>> +}
>> +
>> +void pi_set_pir(int vector, struct pi_desc *pi_desc)
>> +{
>> +	set_bit(vector, (unsigned long *)pi_desc->pir);
>> +}
>> +
>>  struct vcpu_vmx { 	struct kvm_vcpu       vcpu; 	unsigned long        
>>  host_rsp; @@ -439,6 +480,11 @@ struct vcpu_vmx { 	u64
>>  eoi_exit_bitmap[4]; 	u64 eoi_exit_bitmap_global[4];
>> +	/* Posted interrupt descriptor */
>> +	struct pi_desc *pi;
>> +	u32 irq;
>> +	u32 vector;
>> +
>>  	/* Support for a guest hypervisor (nested VMX) */
>>  	struct nested_vmx nested;
>>  };
>> @@ -698,6 +744,11 @@ static u64 host_efer;
>> 
>>  static void ept_save_pdptrs(struct kvm_vcpu *vcpu);
>> +int arch_pi_get_vector(unsigned int irq);
>> +int arch_pi_alloc_irq(struct vcpu_vmx *vmx);
>> +void arch_pi_free_irq(unsigned int irq, struct vcpu_vmx *vmx);
>> +int arch_pi_migrate(int irq, int cpu);
>> +
>>  /*
>>   * Keep MSR_STAR at the end, as setup_msrs() will try to optimize it
>>   * away by decrementing the array size.
>> @@ -783,6 +834,11 @@ static inline bool
> cpu_has_vmx_virtual_intr_delivery(void)
>>  		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
>>  }
>> +static inline bool cpu_has_vmx_posted_intr(void)
>> +{
>> +	return vmcs_config.pin_based_exec_ctrl & PIN_BASED_POSTED_INTR;
>> +}
>> +
>>  static inline bool cpu_has_vmx_flexpriority(void)
>>  {
>>  	return cpu_has_vmx_tpr_shadow() &&
>> @@ -1555,6 +1611,11 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu,
> int cpu)
>>  		struct desc_ptr *gdt = &__get_cpu_var(host_gdt);
>>  		unsigned long sysenter_esp;
>> +		if (enable_apicv_pi && to_vmx(vcpu)->pi)
>> +			pi_set_on(to_vmx(vcpu)->pi);
>> +
> Why?

Here means the vcpu start migration. So we should prevent the notification event until migration end.

>> +		kvm_make_request(KVM_REQ_POSTED_INTR, vcpu);
>> +
>>  		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); 		local_irq_disable();
>>  		list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link, @@ -1582,6
>>  +1643,8 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu) 		vcpu->cpu
>>  = -1; 		kvm_cpu_vmxoff(); 	}
>> +	if (enable_apicv_pi && to_vmx(vcpu)->pi)
>> +		pi_set_on(to_vmx(vcpu)->pi);
> Why?

When vcpu schedule out, no need to send notification event to it, just set the PIR and wakeup it is enough.

>>  }
>>  
>>  static void vmx_fpu_activate(struct kvm_vcpu *vcpu)
>> @@ -2451,12 +2514,6 @@ static __init int setup_vmcs_config(struct
> vmcs_config *vmcs_conf)
>>  	u32 _vmexit_control = 0;
>>  	u32 _vmentry_control = 0;
>> -	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
>> -	opt = PIN_BASED_VIRTUAL_NMIS;
>> -	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
>> -				&_pin_based_exec_control) < 0)
>> -		return -EIO;
>> -
>>  	min = CPU_BASED_HLT_EXITING |
>>  #ifdef CONFIG_X86_64
>>  	      CPU_BASED_CR8_LOAD_EXITING |
>> @@ -2531,6 +2588,17 @@ static __init int setup_vmcs_config(struct
> vmcs_config *vmcs_conf)
>>  				&_vmexit_control) < 0)
>>  		return -EIO;
>> +	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
>> +	opt = PIN_BASED_VIRTUAL_NMIS | PIN_BASED_POSTED_INTR;
>> +	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
>> +				&_pin_based_exec_control) < 0)
>> +		return -EIO;
>> +
>> +	if (!(_cpu_based_2nd_exec_control &
>> +		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY) ||
>> +		!(_vmexit_control & VM_EXIT_ACK_INTR_ON_EXIT))
>> +		_pin_based_exec_control &= ~PIN_BASED_POSTED_INTR;
>> +
>>  	min = 0; 	opt = VM_ENTRY_LOAD_IA32_PAT; 	if (adjust_vmx_controls(min,
>>  opt, MSR_IA32_VMX_ENTRY_CTLS, @@ -2715,6 +2783,9 @@ static __init int
>>  hardware_setup(void) 	if (!cpu_has_vmx_virtual_intr_delivery())
>>  		enable_apicv_vid = 0;
>> +	if (!cpu_has_vmx_posted_intr() || !x2apic_enabled())
> In nested guest x2apic may be enabled without irq remapping. Check for
> irq remapping here.

There are no posted interrupt available in nested case. We don't need to check IR here.

> 
>> +		enable_apicv_pi = 0;
>> +
>>  	if (nested)
>>  		nested_vmx_setup_ctls_msrs();
>> @@ -3881,6 +3952,93 @@ static void ept_set_mmio_spte_mask(void)
>>  	kvm_mmu_set_mmio_spte_mask(0xffull << 49 | 0x6ull);
>>  }
>> +irqreturn_t pi_handler(int irq, void *data)
>> +{
>> +	struct vcpu_vmx *vmx = data;
>> +
>> +	kvm_make_request(KVM_REQ_EVENT, &vmx->vcpu);
>> +	kvm_vcpu_kick(&vmx->vcpu);
>> +
>> +	return IRQ_HANDLED;
>> +}
>> +
>> +static int vmx_has_posted_interrupt(struct kvm_vcpu *vcpu)
>> +{
>> +	return irqchip_in_kernel(vcpu->kvm) && enable_apicv_pi;
>> +}
>> +
>> +static void vmx_pi_migrate(struct kvm_vcpu *vcpu)
>> +{
>> +	int ret = 0;
>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>> +
>> +	if (!enable_apicv_pi)
>> +		return ;
>> +
>> +	preempt_disable();
>> +	local_irq_disable();
>> +	if (!vmx->irq) {
>> +		ret = arch_pi_alloc_irq(vmx);
>> +		if (ret < 0) {
>> +			vmx->irq = -1;
>> +			goto out;
>> +		}
>> +		vmx->irq = ret;
>> +
>> +		ret = request_irq(vmx->irq, pi_handler, IRQF_NO_THREAD,
>> +					"Posted Interrupt", vmx);
>> +		if (ret) {
>> +			vmx->irq = -1;
>> +			goto out;
>> +		}
>> +
>> +		ret = arch_pi_get_vector(vmx->irq);
>> +	} else
>> +		ret = arch_pi_migrate(vmx->irq, smp_processor_id());
>> +
>> +	if (ret < 0) {
>> +		vmx->irq = -1;
>> +		goto out;
>> +	} else {
>> +		vmx->vector = ret;
>> +		vmcs_write16(POSTED_INTR_NV, vmx->vector);
>> +		pi_clear_on(vmx->pi);
>> +	}
>> +out:
>> +	local_irq_enable();
>> +	preempt_enable();
>> +	return ;
>> +}
>> +
>> +static int vmx_send_nv(struct kvm_vcpu *vcpu,
>> +		int vector)
>> +{
>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>> +
>> +	if (unlikely(vmx->irq == -1))
>> +		return 0;
>> +
>> +	if (vcpu->cpu == smp_processor_id()) {
>> +		pi_set_on(vmx->pi);
> Why? You clear this bit anyway in vmx_update_irq() during guest entry.
Here means the target vcpu already in vmx non-root mode. Then it will consume the interrupt on next vm entry and we don't need to send the notification event from other cpu, just update PIR is enough.

>> +		return 0; +	} + +	pi_set_pir(vector, vmx->pi); +	if
>> (!pi_test_and_set_on(vmx->pi) && (vcpu->mode == IN_GUEST_MODE)) {
>> +		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), vmx->vector); +		return
>> 1; +	} +	return 0; +} + +static void free_pi(struct vcpu_vmx *vmx) +{
>> +	if (enable_apicv_pi) { +		kfree(vmx->pi);
>> +		arch_pi_free_irq(vmx->irq, vmx); +	} +} +
>>  /*
>>   * Sets up the vmcs for emulated real mode.
>>   */
>> @@ -3890,6 +4048,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>>  	unsigned long a;
>>  #endif
>>  	int i;
>> +	u32 pin_based_exec_ctrl = vmcs_config.pin_based_exec_ctrl;
>> 
>>  	/* I/O */ 	vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a)); @@
>>  -3901,8 +4060,10 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>>  	vmcs_write64(VMCS_LINK_POINTER, -1ull); /* 22.3.1.5 */
>>  
>>  	/* Control */
>> -	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
>> -		vmcs_config.pin_based_exec_ctrl);
>> +	if (!enable_apicv_pi)
>> +		pin_based_exec_ctrl &= ~PIN_BASED_POSTED_INTR;
>> +
>> +	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, pin_based_exec_ctrl);
>> 
>>  	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
> vmx_exec_control(vmx));
>> 
>> @@ -3920,6 +4081,12 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>>  		vmcs_write16(GUEST_INTR_STATUS, 0);
>>  	}
>> +	if (enable_apicv_pi) {
>> +		vmx->pi = kmalloc(sizeof(struct pi_desc),
>> +				GFP_KERNEL | __GFP_ZERO);
>> +		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((vmx->pi)));
>> +	}
>> +
>>  	if (ple_gap) { 		vmcs_write32(PLE_GAP, ple_gap);
>>  		vmcs_write32(PLE_WINDOW, ple_window); @@ -6161,6 +6328,11 @@ static
>>  void vmx_update_irq(struct kvm_vcpu *vcpu) 	if (!enable_apicv_vid)
>>  		return ;
>> +	if (enable_apicv_pi) {
>> +		kvm_apic_update_irr(vcpu, (unsigned int *)vmx->pi->pir);
>> +		pi_clear_on(vmx->pi);
> Why do you do that? Isn't VMX process posted interrupts on vmentry if "on" bit
> is set?
> 
>> +	}
>> +
>>  	vector = kvm_apic_get_highest_irr(vcpu);
>>  	if (vector == -1)
>>  		return;
>> @@ -6586,6 +6758,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
>> 
>>  	free_vpid(vmx); 	free_nested(vmx); +	free_pi(vmx);
>>  	free_loaded_vmcs(vmx->loaded_vmcs); 	kfree(vmx->guest_msrs);
>>  	kvm_vcpu_uninit(vcpu); @@ -7483,8 +7656,11 @@ static struct
>>  kvm_x86_ops vmx_x86_ops = { 	.enable_irq_window = enable_irq_window,
>>  	.update_cr8_intercept = update_cr8_intercept,
>>  	.has_virtual_interrupt_delivery = vmx_has_virtual_interrupt_delivery,
>>  +	.has_posted_interrupt = vmx_has_posted_interrupt, 	.update_irq =
>>  vmx_update_irq, 	.set_eoi_exitmap = vmx_set_eoi_exitmap,
>> +	.send_nv = vmx_send_nv,
>> +	.pi_migrate = vmx_pi_migrate,
>> 
>>  	.set_tss_addr = vmx_set_tss_addr,
>>  	.get_tdp_level = get_ept_level,
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 8b8de3b..f035267 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -5250,6 +5250,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>>  	bool req_immediate_exit = 0;
>>  
>>  	if (vcpu->requests) {
>> +		if (kvm_check_request(KVM_REQ_POSTED_INTR, vcpu))
>> +			kvm_x86_ops->pi_migrate(vcpu);
>>  		if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu))
>>  			kvm_mmu_unload(vcpu);
>>  		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index ecc5543..f8d8d34 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -107,6 +107,7 @@ static inline bool is_error_page(struct page *page)
>>  #define KVM_REQ_IMMEDIATE_EXIT    15
>>  #define KVM_REQ_PMU               16
>>  #define KVM_REQ_PMI               17
>> +#define KVM_REQ_POSTED_INTR       18
>> 
>>  #define KVM_USERSPACE_IRQ_SOURCE_ID		0
>>  #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID	1
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index be70035..05baf1c 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -1625,6 +1625,8 @@ void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
>>  			smp_send_reschedule(cpu);
>>  	put_cpu();
>>  }
>> +EXPORT_SYMBOL_GPL(kvm_vcpu_kick);
>> +
>>  #endif /* !CONFIG_S390 */
>>  
>>  void kvm_resched(struct kvm_vcpu *vcpu)
>> --
>> 1.7.1
> 
> --
> 			Gleb.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Best regards,
Yang



^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v2 5/6] x86: Enable ack interrupt on vmexit
  2012-11-25 13:11         ` Avi Kivity
@ 2012-11-26  5:44           ` Zhang, Yang Z
  2012-11-26  9:17             ` Gleb Natapov
  0 siblings, 1 reply; 29+ messages in thread
From: Zhang, Yang Z @ 2012-11-26  5:44 UTC (permalink / raw)
  To: Avi Kivity, Gleb Natapov; +Cc: kvm@vger.kernel.org, mtosatti@redhat.com

Avi Kivity wrote on 2012-11-25:
> On 11/25/2012 03:03 PM, Gleb Natapov wrote:
>> On Sun, Nov 25, 2012 at 02:55:26PM +0200, Avi Kivity wrote:
>>> On 11/22/2012 05:22 PM, Gleb Natapov wrote:
>>>> On Wed, Nov 21, 2012 at 04:09:38PM +0800, Yang Zhang wrote:
>>>>> Ack interrupt on vmexit is required by Posted Interrupt. With it,
>>>>> when external interrupt caused vmexit, the cpu will acknowledge the
>>>>> interrupt controller and save the interrupt's vector in vmcs.
>>>>> 
>>>>> There are several approaches to enable it. This patch uses a simply
>>>>> way: re-generate an interrupt via self ipi.
>>>>> 
>>>>> 
>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>>> index 7949d21..f6ef090 100644
>>>>> --- a/arch/x86/kvm/vmx.c
>>>>> +++ b/arch/x86/kvm/vmx.c
>>>>> @@ -2525,7 +2525,8 @@ static __init int setup_vmcs_config(struct
> vmcs_config *vmcs_conf)
>>>>>  #ifdef CONFIG_X86_64
>>>>>  	min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
>>>>>  #endif
>>>>> -	opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT;
>>>>> +	opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT |
>>>>> +		VM_EXIT_ACK_INTR_ON_EXIT;
>>>> Always? Do it only if posted interrupts are actually available
>>>> and going to be used.
>>> 
>>> Why not always?  Better to have a single code path for host interrupts
>>> (and as Yang notes, the new path is faster as well).
>>> 
>> Is it? The current path is:
>> 
>> vm exit -> KVM vmexit handler(interrupt disabled) -> KVM re-enable
>> interrupt -> cpu ack the interrupt and interrupt deliver through the
>> host IDT.
>> 
>> The proposed path is:
>> 
>> CPU acks interrupt -> vm exit -> KVM vmexit handler(interrupt disabled)
>> -> eoi -> self IPI -> KVM re-enable interrupt -> cpu ack the interrupt
>> and interrupt deliver through the host IDT.
>> 
>> Am I missing something?
> 
> Yes, you're missing the part where I didn't write that the new path
> should avoid the IDT and dispatch the interrupt directly, by emulating
> an interrupt frame directly.  Can be as simple as pushf; push cs; call
> interrupt_table[vector * 8].  Of course we need to verify that no
> interrupt uses the IST or a task gate.

How can we call interrupt table directly? I don't think we can expose the idt_table to a module.
Anyway, to simply the implementation, I will follow gleb's suggestion: only enable "ack intr on exit" when PI is enabled and self ipi should be enough. Any comments?

Best regards,
Yang



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 5/6] x86: Enable ack interrupt on vmexit
  2012-11-26  5:44           ` Zhang, Yang Z
@ 2012-11-26  9:17             ` Gleb Natapov
  0 siblings, 0 replies; 29+ messages in thread
From: Gleb Natapov @ 2012-11-26  9:17 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: Avi Kivity, kvm@vger.kernel.org, mtosatti@redhat.com

On Mon, Nov 26, 2012 at 05:44:29AM +0000, Zhang, Yang Z wrote:
> Avi Kivity wrote on 2012-11-25:
> > On 11/25/2012 03:03 PM, Gleb Natapov wrote:
> >> On Sun, Nov 25, 2012 at 02:55:26PM +0200, Avi Kivity wrote:
> >>> On 11/22/2012 05:22 PM, Gleb Natapov wrote:
> >>>> On Wed, Nov 21, 2012 at 04:09:38PM +0800, Yang Zhang wrote:
> >>>>> Ack interrupt on vmexit is required by Posted Interrupt. With it,
> >>>>> when external interrupt caused vmexit, the cpu will acknowledge the
> >>>>> interrupt controller and save the interrupt's vector in vmcs.
> >>>>> 
> >>>>> There are several approaches to enable it. This patch uses a simply
> >>>>> way: re-generate an interrupt via self ipi.
> >>>>> 
> >>>>> 
> >>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >>>>> index 7949d21..f6ef090 100644
> >>>>> --- a/arch/x86/kvm/vmx.c
> >>>>> +++ b/arch/x86/kvm/vmx.c
> >>>>> @@ -2525,7 +2525,8 @@ static __init int setup_vmcs_config(struct
> > vmcs_config *vmcs_conf)
> >>>>>  #ifdef CONFIG_X86_64
> >>>>>  	min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
> >>>>>  #endif
> >>>>> -	opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT;
> >>>>> +	opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT |
> >>>>> +		VM_EXIT_ACK_INTR_ON_EXIT;
> >>>> Always? Do it only if posted interrupts are actually available
> >>>> and going to be used.
> >>> 
> >>> Why not always?  Better to have a single code path for host interrupts
> >>> (and as Yang notes, the new path is faster as well).
> >>> 
> >> Is it? The current path is:
> >> 
> >> vm exit -> KVM vmexit handler(interrupt disabled) -> KVM re-enable
> >> interrupt -> cpu ack the interrupt and interrupt deliver through the
> >> host IDT.
> >> 
> >> The proposed path is:
> >> 
> >> CPU acks interrupt -> vm exit -> KVM vmexit handler(interrupt disabled)
> >> -> eoi -> self IPI -> KVM re-enable interrupt -> cpu ack the interrupt
> >> and interrupt deliver through the host IDT.
> >> 
> >> Am I missing something?
> > 
> > Yes, you're missing the part where I didn't write that the new path
> > should avoid the IDT and dispatch the interrupt directly, by emulating
> > an interrupt frame directly.  Can be as simple as pushf; push cs; call
> > interrupt_table[vector * 8].  Of course we need to verify that no
> > interrupt uses the IST or a task gate.
> 
> How can we call interrupt table directly? I don't think we can expose the idt_table to a module.
No, but we can add function to entry_(64|32).S that despatch via
idt_table and expose it. Avi's idea is worth to explore before going
self IPI way.

> Anyway, to simply the implementation, I will follow gleb's suggestion: only enable "ack intr on exit" when PI is enabled and self ipi should be enough. Any comments?
> 
> Best regards,
> Yang
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting
  2012-11-26  3:51     ` Zhang, Yang Z
@ 2012-11-26 10:01       ` Gleb Natapov
  2012-11-26 12:29         ` Zhang, Yang Z
  0 siblings, 1 reply; 29+ messages in thread
From: Gleb Natapov @ 2012-11-26 10:01 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: kvm@vger.kernel.org, mtosatti@redhat.com

On Mon, Nov 26, 2012 at 03:51:04AM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2012-11-25:
> > On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
> >> Posted Interrupt allows vAPICV interrupts to inject into guest directly
> >> without any vmexit.
> >> 
> >> - When delivering a interrupt to guest, if target vcpu is running,
> >>   update Posted-interrupt requests bitmap and send a notification event
> >>   to the vcpu. Then the vcpu will handle this interrupt automatically,
> >>   without any software involvemnt.
> > Looks like you allocating one irq vector per vcpu per pcpu and then
> > migrate it or reallocate when vcpu move from one pcpu to another.
> > This is not scalable and migrating irq migration slows things down.
> > What's wrong with allocating one global vector for posted interrupt
> > during vmx initialization and use it for all vcpus?
> 
> Consider the following situation: 
> If vcpu A is running when notification event which belong to vcpu B is arrived, since the vector match the vcpu A's notification vector, then this event will be consumed by vcpu A(even it do nothing) and the interrupt cannot be handled in time.
The exact same situation is possible with your code. vcpu B can be
migrated from pcpu and vcpu A will take its place and will be assigned
the same vector as vcpu B. But I fail to see why is this a problem. vcpu
A will ignore PI since pir will be empty and vcpu B should detect new
event during next vmentry. 

> 
> >> - If target vcpu is not running or there already a notification event
> >>   pending in the vcpu, do nothing. The interrupt will be handled by old
> >>   way.
> >> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> >> ---
> >>  arch/x86/include/asm/kvm_host.h |    3 + arch/x86/include/asm/vmx.h   
> >>    |    4 + arch/x86/kernel/apic/io_apic.c  |  138
> >>  ++++++++++++++++++++++++++++ arch/x86/kvm/lapic.c            |   31
> >>  ++++++- arch/x86/kvm/lapic.h            |    8 ++ arch/x86/kvm/vmx.c  
> >>             |  192 +++++++++++++++++++++++++++++++++++++--
> >>  arch/x86/kvm/x86.c              |    2 + include/linux/kvm_host.h     
> >>    |    1 + virt/kvm/kvm_main.c             |    2 + 9 files changed,
> >>  372 insertions(+), 9 deletions(-)
> >> diff --git a/arch/x86/include/asm/kvm_host.h
> >> b/arch/x86/include/asm/kvm_host.h index 8e07a86..1145894 100644 ---
> >> a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h
> >> @@ -683,9 +683,12 @@ struct kvm_x86_ops {
> >>  	void (*enable_irq_window)(struct kvm_vcpu *vcpu); 	void
> >>  (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr); 	int
> >>  (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu); +	int
> >>  (*has_posted_interrupt)(struct kvm_vcpu *vcpu); 	void
> >>  (*update_irq)(struct kvm_vcpu *vcpu); 	void (*set_eoi_exitmap)(struct
> >>  kvm_vcpu *vcpu, int vector, 			int need_eoi, int global);
> >> +	int (*send_nv)(struct kvm_vcpu *vcpu, int vector);
> >> +	void (*pi_migrate)(struct kvm_vcpu *vcpu);
> >>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> >>  	int (*get_tdp_level)(void);
> >>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> >> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> >> index 1003341..7b9e1d0 100644
> >> --- a/arch/x86/include/asm/vmx.h
> >> +++ b/arch/x86/include/asm/vmx.h
> >> @@ -152,6 +152,7 @@
> >>  #define PIN_BASED_EXT_INTR_MASK                 0x00000001
> >>  #define PIN_BASED_NMI_EXITING                   0x00000008
> >>  #define PIN_BASED_VIRTUAL_NMIS                  0x00000020
> >> +#define PIN_BASED_POSTED_INTR                   0x00000080
> >> 
> >>  #define VM_EXIT_SAVE_DEBUG_CONTROLS             0x00000002 #define
> >>  VM_EXIT_HOST_ADDR_SPACE_SIZE            0x00000200 @@ -174,6 +175,7 @@
> >>  /* VMCS Encodings */ enum vmcs_field { 	VIRTUAL_PROCESSOR_ID          
> >>   = 0x00000000, +	POSTED_INTR_NV                  = 0x00000002,
> >>  	GUEST_ES_SELECTOR               = 0x00000800, 	GUEST_CS_SELECTOR     
> >>           = 0x00000802, 	GUEST_SS_SELECTOR               = 0x00000804,
> >>  @@ -208,6 +210,8 @@ enum vmcs_field { 	VIRTUAL_APIC_PAGE_ADDR_HIGH    
> >>  = 0x00002013, 	APIC_ACCESS_ADDR		= 0x00002014,
> >>  	APIC_ACCESS_ADDR_HIGH		= 0x00002015,
> >> +	POSTED_INTR_DESC_ADDR           = 0x00002016,
> >> +	POSTED_INTR_DESC_ADDR_HIGH      = 0x00002017,
> >>  	EPT_POINTER                     = 0x0000201a,
> >>  	EPT_POINTER_HIGH                = 0x0000201b,
> >>  	EOI_EXIT_BITMAP0                = 0x0000201c,
> >> diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
> >> index 1817fa9..97cb8ee 100644
> >> --- a/arch/x86/kernel/apic/io_apic.c
> >> +++ b/arch/x86/kernel/apic/io_apic.c
> >> @@ -3277,6 +3277,144 @@ int arch_setup_dmar_msi(unsigned int irq)
> >>  }
> >>  #endif
> >> +static int
> >> +pi_set_affinity(struct irq_data *data, const struct cpumask *mask,
> >> +		      bool force)
> >> +{
> >> +	unsigned int dest;
> >> +	struct irq_cfg *cfg = (struct irq_cfg *)data->chip_data;
> >> +	if (cpumask_equal(cfg->domain, mask))
> >> +		return IRQ_SET_MASK_OK;
> >> +
> >> +	if (__ioapic_set_affinity(data, mask, &dest))
> >> +		return -1;
> >> +
> >> +	return IRQ_SET_MASK_OK;
> >> +}
> >> +
> >> +static void pi_mask(struct irq_data *data)
> >> +{
> >> +	;
> >> +}
> >> +
> >> +static void pi_unmask(struct irq_data *data)
> >> +{
> >> +	;
> >> +}
> >> +
> >> +static struct irq_chip pi_chip = {
> >> +	.name       = "POSTED-INTR",
> >> +	.irq_ack    = ack_apic_edge,
> >> +	.irq_unmask = pi_unmask,
> >> +	.irq_mask   = pi_mask,
> >> +	.irq_set_affinity   = pi_set_affinity,
> >> +};
> >> +
> >> +int arch_pi_migrate(int irq, int cpu)
> >> +{
> >> +	struct irq_data *data = irq_get_irq_data(irq);
> >> +	struct irq_cfg *cfg;
> >> +	struct irq_desc *desc = irq_to_desc(irq);
> >> +	unsigned long flags;
> >> +
> >> +	if (!desc)
> >> +		return -EINVAL;
> >> +
> >> +	cfg = irq_cfg(irq);
> >> +	if (cpumask_equal(cfg->domain, cpumask_of(cpu)))
> >> +		return cfg->vector;
> >> +
> >> +	irq_set_affinity(irq, cpumask_of(cpu));
> >> +	raw_spin_lock_irqsave(&desc->lock, flags);
> >> +	irq_move_irq(data);
> >> +	raw_spin_unlock_irqrestore(&desc->lock, flags);
> >> +
> >> +	if (cfg->move_in_progress)
> >> +		send_cleanup_vector(cfg);
> >> +	return cfg->vector;
> >> +}
> >> +EXPORT_SYMBOL_GPL(arch_pi_migrate);
> >> +
> >> +static int arch_pi_create_irq(const struct cpumask *mask)
> >> +{
> >> +	int node = cpu_to_node(0);
> >> +	unsigned int irq_want;
> >> +	struct irq_cfg *cfg;
> >> +	unsigned long flags;
> >> +	unsigned int ret = 0;
> >> +	int irq;
> >> +
> >> +	irq_want = nr_irqs_gsi;
> >> +
> >> +	irq = alloc_irq_from(irq_want, node);
> >> +	if (irq < 0)
> >> +		return 0;
> >> +	cfg = alloc_irq_cfg(irq_want, node);
> > s/irq_want/irq.
> > 
> >> +	if (!cfg) {
> >> +		free_irq_at(irq, NULL);
> >> +		return 0;
> >> +	}
> >> +
> >> +	raw_spin_lock_irqsave(&vector_lock, flags);
> >> +	if (!__assign_irq_vector(irq, cfg, mask))
> >> +		ret = irq;
> >> +	raw_spin_unlock_irqrestore(&vector_lock, flags);
> >> +
> >> +	if (ret) {
> >> +		irq_set_chip_data(irq, cfg);
> >> +		irq_clear_status_flags(irq, IRQ_NOREQUEST);
> >> +	} else {
> >> +		free_irq_at(irq, cfg);
> >> +	}
> >> +	return ret;
> >> +}
> > 
> > This function is mostly cut&paste of create_irq_nr().
> 
> Yes, this function allow to allocate vector from specified cpu.
>  
Does not justify code duplication.

> >> +
> >> +int arch_pi_alloc_irq(void *vmx)
> >> +{
> >> +	int irq, cpu = smp_processor_id();
> >> +	struct irq_cfg *cfg;
> >> +
> >> +	irq = arch_pi_create_irq(cpumask_of(cpu));
> >> +	if (!irq) {
> >> +		pr_err("Posted Interrupt: no free irq\n");
> >> +		return -EINVAL;
> >> +	}
> >> +	irq_set_handler_data(irq, vmx);
> >> +	irq_set_chip_and_handler_name(irq, &pi_chip, handle_edge_irq, "edge");
> >> +	irq_set_status_flags(irq, IRQ_MOVE_PCNTXT);
> >> +	irq_set_affinity(irq, cpumask_of(cpu));
> >> +
> >> +	cfg = irq_cfg(irq);
> >> +	if (cfg->move_in_progress)
> >> +		send_cleanup_vector(cfg);
> >> +
> >> +	return irq;
> >> +}
> >> +EXPORT_SYMBOL_GPL(arch_pi_alloc_irq);
> >> +
> >> +void arch_pi_free_irq(unsigned int irq, void *vmx)
> >> +{
> >> +	if (irq) {
> >> +		irq_set_handler_data(irq, NULL);
> >> +		/* This will mask the irq */
> >> +		free_irq(irq, vmx);
> >> +		destroy_irq(irq);
> >> +	}
> >> +}
> >> +EXPORT_SYMBOL_GPL(arch_pi_free_irq);
> >> +
> >> +int arch_pi_get_vector(unsigned int irq)
> >> +{
> >> +	struct irq_cfg *cfg;
> >> +
> >> +	if (!irq)
> >> +		return -EINVAL;
> >> +
> >> +	cfg = irq_cfg(irq);
> >> +	return cfg->vector;
> >> +}
> >> +EXPORT_SYMBOL_GPL(arch_pi_get_vector);
> >> +
> >>  #ifdef CONFIG_HPET_TIMER
> >>  
> >>  static int hpet_msi_set_affinity(struct irq_data *data,
> >> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> >> index af48361..04220de 100644
> >> --- a/arch/x86/kvm/lapic.c
> >> +++ b/arch/x86/kvm/lapic.c
> >> @@ -656,7 +656,7 @@ void kvm_set_eoi_exitmap(struct kvm_vcpu *vcpu, int
> > vector,
> >>  static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
> >>  			     int vector, int level, int trig_mode)
> >>  {
> >> -	int result = 0;
> >> +	int result = 0, send;
> >>  	struct kvm_vcpu *vcpu = apic->vcpu;
> >>  
> >>  	switch (delivery_mode) {
> >> @@ -674,6 +674,13 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int
> > delivery_mode,
> >>  		} else {
> >>  			apic_clear_vector(vector, apic->regs + APIC_TMR);
> >>  			kvm_set_eoi_exitmap(vcpu, vector, 0, 0);
> >> +			if (kvm_apic_pi_enabled(vcpu)) {
> > Provide send_nv() that returns 0 if pi is disabled.
> > 
> >> +				send = kvm_x86_ops->send_nv(vcpu, vector);
> >> +				if (send) {
> > No need "send" variable here.
> 
> ok.
>  
> >> +					result = 1;
> >> +					break;
> >> +				}
> >> +			}
> >>  		}
> >>  
> >>  		result = !apic_test_and_set_irr(vector, apic);
> >> @@ -1541,6 +1548,10 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu)
> >> 
> >>  	if (kvm_x86_ops->has_virtual_interrupt_delivery(vcpu))
> >>  		apic->vid_enabled = true;
> >> +
> >> +	if (kvm_x86_ops->has_posted_interrupt(vcpu))
> >> +		apic->pi_enabled = true;
> >> +
> > This is global state, no need per apic variable.
> > 
> >>  	return 0;
> >>  nomem_free_apic:
> >>  	kfree(apic);
> >> @@ -1575,6 +1586,24 @@ int kvm_apic_get_highest_irr(struct kvm_vcpu
> > *vcpu)
> >>  }
> >>  EXPORT_SYMBOL_GPL(kvm_apic_get_highest_irr);
> >> +void kvm_apic_update_irr(struct kvm_vcpu *vcpu, unsigned int *pir)
> >> +{
> >> +	struct kvm_lapic *apic = vcpu->arch.apic;
> >> +	unsigned int *reg;
> >> +	unsigned int i;
> >> +
> >> +	if (!apic || !apic_enabled(apic))
> > Use kvm_vcpu_has_lapic() instead of !apic.
> 
> ok.
>  
> >> +		return;
> >> +
> >> +	for (i = 0; i <= 7; i++) {
> >> +		reg = apic->regs + APIC_IRR + i * 0x10;
> >> +		*reg |= pir[i];
> > Non atomic access to IRR. Other threads may set bit there concurrently.
> Ok.
>  
> >> +		pir[i] = 0;
> >> +	}
> > Should set apic->irr_pending to true when setting irr bit.
> Right. Will add it in next version.
> 
> >> +	return;
> >> +}
> >> +EXPORT_SYMBOL_GPL(kvm_apic_update_irr);
> >> +
> >>  int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
> >>  {
> >>  	u32 lvt0 = kvm_apic_get_reg(vcpu->arch.apic, APIC_LVT0);
> >> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> >> index 2503a64..ad35868 100644
> >> --- a/arch/x86/kvm/lapic.h
> >> +++ b/arch/x86/kvm/lapic.h
> >> @@ -21,6 +21,7 @@ struct kvm_lapic {
> >>  	struct kvm_vcpu *vcpu; 	bool irr_pending; 	bool vid_enabled; +	bool
> >>  pi_enabled; 	/* Number of bits set in ISR. */ 	s16 isr_count; 	/* The
> >>  highest vector set in ISR; if -1 - invalid, must scan ISR. */ @@ -43,6
> >>  +44,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu); int
> >>  kvm_cpu_has_extint(struct kvm_vcpu *v); int kvm_cpu_get_extint(struct
> >>  kvm_vcpu *v); int kvm_apic_get_highest_irr(struct kvm_vcpu *vcpu);
> >>  +void kvm_apic_update_irr(struct kvm_vcpu *vcpu, unsigned int *pir);
> >>  void kvm_lapic_reset(struct kvm_vcpu *vcpu); u64
> >>  kvm_lapic_get_cr8(struct kvm_vcpu *vcpu); void
> >>  kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
> >> @@ -94,6 +96,12 @@ static inline bool kvm_apic_vid_enabled(struct kvm_vcpu
> > *vcpu)
> >>  	return apic->vid_enabled;
> >>  }
> >> +static inline bool kvm_apic_pi_enabled(struct kvm_vcpu *vcpu)
> >> +{
> >> +	struct kvm_lapic *apic = vcpu->arch.apic;
> >> +	return apic->pi_enabled;
> >> +}
> >> +
> >>  int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data);
> >>  void kvm_lapic_init(void);
> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> index f6ef090..6448b96 100644
> >> --- a/arch/x86/kvm/vmx.c
> >> +++ b/arch/x86/kvm/vmx.c
> >> @@ -31,6 +31,7 @@
> >>  #include <linux/ftrace_event.h> #include <linux/slab.h> #include
> >>  <linux/tboot.h> +#include <linux/interrupt.h> #include
> >>  "kvm_cache_regs.h" #include "x86.h"
> >> @@ -89,6 +90,8 @@ module_param(enable_apicv_reg, bool, S_IRUGO);
> >>  static bool __read_mostly enable_apicv_vid = 0;
> >>  module_param(enable_apicv_vid, bool, S_IRUGO);
> >> +static bool __read_mostly enable_apicv_pi = 0;
> >> +module_param(enable_apicv_pi, bool, S_IRUGO);
> >>  /*
> >>   * If nested=1, nested virtualization is supported, i.e., guests may use
> >>   * VMX and be a hypervisor for its own guests. If nested=0, guests may not
> >> @@ -372,6 +375,44 @@ struct nested_vmx {
> >>  	struct page *apic_access_page;
> >>  };
> >> +/* Posted-Interrupt Descriptor */
> >> +struct pi_desc {
> >> +	u32 pir[8];     /* Posted interrupt requested */
> >> +	union {
> >> +		struct {
> >> +			u8  on:1,
> >> +			    rsvd:7;
> >> +		} control;
> >> +		u32 rsvd[8];
> >> +	} u;
> >> +} __aligned(64);
> >> +
> >> +#define POSTED_INTR_ON  0
> >> +u8 pi_test_on(struct pi_desc *pi_desc)
> >> +{
> >> +	return test_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->u.control);
> >> +}
> >> +void pi_set_on(struct pi_desc *pi_desc)
> >> +{
> >> +	set_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->u.control);
> >> +}
> >> +
> >> +void pi_clear_on(struct pi_desc *pi_desc)
> >> +{
> >> +	clear_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->u.control);
> >> +}
> >> +
> >> +u8 pi_test_and_set_on(struct pi_desc *pi_desc)
> >> +{
> >> +	return test_and_set_bit(POSTED_INTR_ON,
> >> +			(unsigned long *)&pi_desc->u.control);
> >> +}
> >> +
> >> +void pi_set_pir(int vector, struct pi_desc *pi_desc)
> >> +{
> >> +	set_bit(vector, (unsigned long *)pi_desc->pir);
> >> +}
> >> +
> >>  struct vcpu_vmx { 	struct kvm_vcpu       vcpu; 	unsigned long        
> >>  host_rsp; @@ -439,6 +480,11 @@ struct vcpu_vmx { 	u64
> >>  eoi_exit_bitmap[4]; 	u64 eoi_exit_bitmap_global[4];
> >> +	/* Posted interrupt descriptor */
> >> +	struct pi_desc *pi;
> >> +	u32 irq;
> >> +	u32 vector;
> >> +
> >>  	/* Support for a guest hypervisor (nested VMX) */
> >>  	struct nested_vmx nested;
> >>  };
> >> @@ -698,6 +744,11 @@ static u64 host_efer;
> >> 
> >>  static void ept_save_pdptrs(struct kvm_vcpu *vcpu);
> >> +int arch_pi_get_vector(unsigned int irq);
> >> +int arch_pi_alloc_irq(struct vcpu_vmx *vmx);
> >> +void arch_pi_free_irq(unsigned int irq, struct vcpu_vmx *vmx);
> >> +int arch_pi_migrate(int irq, int cpu);
> >> +
> >>  /*
> >>   * Keep MSR_STAR at the end, as setup_msrs() will try to optimize it
> >>   * away by decrementing the array size.
> >> @@ -783,6 +834,11 @@ static inline bool
> > cpu_has_vmx_virtual_intr_delivery(void)
> >>  		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
> >>  }
> >> +static inline bool cpu_has_vmx_posted_intr(void)
> >> +{
> >> +	return vmcs_config.pin_based_exec_ctrl & PIN_BASED_POSTED_INTR;
> >> +}
> >> +
> >>  static inline bool cpu_has_vmx_flexpriority(void)
> >>  {
> >>  	return cpu_has_vmx_tpr_shadow() &&
> >> @@ -1555,6 +1611,11 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu,
> > int cpu)
> >>  		struct desc_ptr *gdt = &__get_cpu_var(host_gdt);
> >>  		unsigned long sysenter_esp;
> >> +		if (enable_apicv_pi && to_vmx(vcpu)->pi)
> >> +			pi_set_on(to_vmx(vcpu)->pi);
> >> +
> > Why?
> 
> Here means the vcpu start migration. So we should prevent the notification event until migration end.
> 
You check for IN_GUEST_MODE while sending notification. Why is this not
enough? Also why vmx_vcpu_load() call means that vcpu start migration?

> >> +		kvm_make_request(KVM_REQ_POSTED_INTR, vcpu);
> >> +
> >>  		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); 		local_irq_disable();
> >>  		list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link, @@ -1582,6
> >>  +1643,8 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu) 		vcpu->cpu
> >>  = -1; 		kvm_cpu_vmxoff(); 	}
> >> +	if (enable_apicv_pi && to_vmx(vcpu)->pi)
> >> +		pi_set_on(to_vmx(vcpu)->pi);
> > Why?
> 
> When vcpu schedule out, no need to send notification event to it, just set the PIR and wakeup it is enough.
Same as above. When vcpu is scheduled out it will no be in IN_GUEST_MODE
mode. Also in this case we probably should set bit directly in IRR and leave
PIR alone.

> 
> >>  }
> >>  
> >>  static void vmx_fpu_activate(struct kvm_vcpu *vcpu)
> >> @@ -2451,12 +2514,6 @@ static __init int setup_vmcs_config(struct
> > vmcs_config *vmcs_conf)
> >>  	u32 _vmexit_control = 0;
> >>  	u32 _vmentry_control = 0;
> >> -	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
> >> -	opt = PIN_BASED_VIRTUAL_NMIS;
> >> -	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
> >> -				&_pin_based_exec_control) < 0)
> >> -		return -EIO;
> >> -
> >>  	min = CPU_BASED_HLT_EXITING |
> >>  #ifdef CONFIG_X86_64
> >>  	      CPU_BASED_CR8_LOAD_EXITING |
> >> @@ -2531,6 +2588,17 @@ static __init int setup_vmcs_config(struct
> > vmcs_config *vmcs_conf)
> >>  				&_vmexit_control) < 0)
> >>  		return -EIO;
> >> +	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
> >> +	opt = PIN_BASED_VIRTUAL_NMIS | PIN_BASED_POSTED_INTR;
> >> +	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
> >> +				&_pin_based_exec_control) < 0)
> >> +		return -EIO;
> >> +
> >> +	if (!(_cpu_based_2nd_exec_control &
> >> +		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY) ||
> >> +		!(_vmexit_control & VM_EXIT_ACK_INTR_ON_EXIT))
> >> +		_pin_based_exec_control &= ~PIN_BASED_POSTED_INTR;
> >> +
> >>  	min = 0; 	opt = VM_ENTRY_LOAD_IA32_PAT; 	if (adjust_vmx_controls(min,
> >>  opt, MSR_IA32_VMX_ENTRY_CTLS, @@ -2715,6 +2783,9 @@ static __init int
> >>  hardware_setup(void) 	if (!cpu_has_vmx_virtual_intr_delivery())
> >>  		enable_apicv_vid = 0;
> >> +	if (!cpu_has_vmx_posted_intr() || !x2apic_enabled())
> > In nested guest x2apic may be enabled without irq remapping. Check for
> > irq remapping here.
> 
> There are no posted interrupt available in nested case. We don't need to check IR here.
> 
One day emulation will be added. If pre-request for PI is IR check
for IR.

BTW why IR is needed for PI. To deliver assigned devices interrupts
directly into a guest sure, but why is it required for delivering
interrupts from emulated devices or IPIs?

> > 
> >> +		enable_apicv_pi = 0;
> >> +
> >>  	if (nested)
> >>  		nested_vmx_setup_ctls_msrs();
> >> @@ -3881,6 +3952,93 @@ static void ept_set_mmio_spte_mask(void)
> >>  	kvm_mmu_set_mmio_spte_mask(0xffull << 49 | 0x6ull);
> >>  }
> >> +irqreturn_t pi_handler(int irq, void *data)
> >> +{
> >> +	struct vcpu_vmx *vmx = data;
> >> +
> >> +	kvm_make_request(KVM_REQ_EVENT, &vmx->vcpu);
> >> +	kvm_vcpu_kick(&vmx->vcpu);
> >> +
> >> +	return IRQ_HANDLED;
> >> +}
> >> +
> >> +static int vmx_has_posted_interrupt(struct kvm_vcpu *vcpu)
> >> +{
> >> +	return irqchip_in_kernel(vcpu->kvm) && enable_apicv_pi;
> >> +}
> >> +
> >> +static void vmx_pi_migrate(struct kvm_vcpu *vcpu)
> >> +{
> >> +	int ret = 0;
> >> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >> +
> >> +	if (!enable_apicv_pi)
> >> +		return ;
> >> +
> >> +	preempt_disable();
> >> +	local_irq_disable();
> >> +	if (!vmx->irq) {
> >> +		ret = arch_pi_alloc_irq(vmx);
> >> +		if (ret < 0) {
> >> +			vmx->irq = -1;
> >> +			goto out;
> >> +		}
> >> +		vmx->irq = ret;
> >> +
> >> +		ret = request_irq(vmx->irq, pi_handler, IRQF_NO_THREAD,
> >> +					"Posted Interrupt", vmx);
> >> +		if (ret) {
> >> +			vmx->irq = -1;
> >> +			goto out;
> >> +		}
> >> +
> >> +		ret = arch_pi_get_vector(vmx->irq);
> >> +	} else
> >> +		ret = arch_pi_migrate(vmx->irq, smp_processor_id());
> >> +
> >> +	if (ret < 0) {
> >> +		vmx->irq = -1;
> >> +		goto out;
> >> +	} else {
> >> +		vmx->vector = ret;
> >> +		vmcs_write16(POSTED_INTR_NV, vmx->vector);
> >> +		pi_clear_on(vmx->pi);
> >> +	}
> >> +out:
> >> +	local_irq_enable();
> >> +	preempt_enable();
> >> +	return ;
> >> +}
> >> +
> >> +static int vmx_send_nv(struct kvm_vcpu *vcpu,
> >> +		int vector)
> >> +{
> >> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >> +
> >> +	if (unlikely(vmx->irq == -1))
> >> +		return 0;
> >> +
> >> +	if (vcpu->cpu == smp_processor_id()) {
> >> +		pi_set_on(vmx->pi);
> > Why? You clear this bit anyway in vmx_update_irq() during guest entry.
> Here means the target vcpu already in vmx non-root mode. Then it will consume the interrupt on next vm entry and we don't need to send the notification event from other cpu, just update PIR is enough.
I understand why you avoid sending PI IPI here, but you do not update
pir in this case either. You only set "on" bit here and set vector directly
in IRR in __apic_accept_irq() since vmx_send_nv() returns 0 in this case.
Interrupt is delivered from IRR on the next entry.

> 
> >> +		return 0; +	} + +	pi_set_pir(vector, vmx->pi); +	if
> >> (!pi_test_and_set_on(vmx->pi) && (vcpu->mode == IN_GUEST_MODE)) {
> >> +		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), vmx->vector); +		return
> >> 1; +	} +	return 0; +} + +static void free_pi(struct vcpu_vmx *vmx) +{
> >> +	if (enable_apicv_pi) { +		kfree(vmx->pi);
> >> +		arch_pi_free_irq(vmx->irq, vmx); +	} +} +
> >>  /*
> >>   * Sets up the vmcs for emulated real mode.
> >>   */
> >> @@ -3890,6 +4048,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
> >>  	unsigned long a;
> >>  #endif
> >>  	int i;
> >> +	u32 pin_based_exec_ctrl = vmcs_config.pin_based_exec_ctrl;
> >> 
> >>  	/* I/O */ 	vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a)); @@
> >>  -3901,8 +4060,10 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
> >>  	vmcs_write64(VMCS_LINK_POINTER, -1ull); /* 22.3.1.5 */
> >>  
> >>  	/* Control */
> >> -	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
> >> -		vmcs_config.pin_based_exec_ctrl);
> >> +	if (!enable_apicv_pi)
> >> +		pin_based_exec_ctrl &= ~PIN_BASED_POSTED_INTR;
> >> +
> >> +	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, pin_based_exec_ctrl);
> >> 
> >>  	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
> > vmx_exec_control(vmx));
> >> 
> >> @@ -3920,6 +4081,12 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
> >>  		vmcs_write16(GUEST_INTR_STATUS, 0);
> >>  	}
> >> +	if (enable_apicv_pi) {
> >> +		vmx->pi = kmalloc(sizeof(struct pi_desc),
> >> +				GFP_KERNEL | __GFP_ZERO);
> >> +		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((vmx->pi)));
> >> +	}
> >> +
> >>  	if (ple_gap) { 		vmcs_write32(PLE_GAP, ple_gap);
> >>  		vmcs_write32(PLE_WINDOW, ple_window); @@ -6161,6 +6328,11 @@ static
> >>  void vmx_update_irq(struct kvm_vcpu *vcpu) 	if (!enable_apicv_vid)
> >>  		return ;
> >> +	if (enable_apicv_pi) {
> >> +		kvm_apic_update_irr(vcpu, (unsigned int *)vmx->pi->pir);
> >> +		pi_clear_on(vmx->pi);
> > Why do you do that? Isn't VMX process posted interrupts on vmentry if "on" bit
> > is set?
Can you answer this question?

> > 
> >> +	}
> >> +
> >>  	vector = kvm_apic_get_highest_irr(vcpu);
> >>  	if (vector == -1)
> >>  		return;
> >> @@ -6586,6 +6758,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
> >> 
> >>  	free_vpid(vmx); 	free_nested(vmx); +	free_pi(vmx);
> >>  	free_loaded_vmcs(vmx->loaded_vmcs); 	kfree(vmx->guest_msrs);
> >>  	kvm_vcpu_uninit(vcpu); @@ -7483,8 +7656,11 @@ static struct
> >>  kvm_x86_ops vmx_x86_ops = { 	.enable_irq_window = enable_irq_window,
> >>  	.update_cr8_intercept = update_cr8_intercept,
> >>  	.has_virtual_interrupt_delivery = vmx_has_virtual_interrupt_delivery,
> >>  +	.has_posted_interrupt = vmx_has_posted_interrupt, 	.update_irq =
> >>  vmx_update_irq, 	.set_eoi_exitmap = vmx_set_eoi_exitmap,
> >> +	.send_nv = vmx_send_nv,
> >> +	.pi_migrate = vmx_pi_migrate,
> >> 
> >>  	.set_tss_addr = vmx_set_tss_addr,
> >>  	.get_tdp_level = get_ept_level,
> >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >> index 8b8de3b..f035267 100644
> >> --- a/arch/x86/kvm/x86.c
> >> +++ b/arch/x86/kvm/x86.c
> >> @@ -5250,6 +5250,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >>  	bool req_immediate_exit = 0;
> >>  
> >>  	if (vcpu->requests) {
> >> +		if (kvm_check_request(KVM_REQ_POSTED_INTR, vcpu))
> >> +			kvm_x86_ops->pi_migrate(vcpu);
> >>  		if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu))
> >>  			kvm_mmu_unload(vcpu);
> >>  		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
> >> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> >> index ecc5543..f8d8d34 100644
> >> --- a/include/linux/kvm_host.h
> >> +++ b/include/linux/kvm_host.h
> >> @@ -107,6 +107,7 @@ static inline bool is_error_page(struct page *page)
> >>  #define KVM_REQ_IMMEDIATE_EXIT    15
> >>  #define KVM_REQ_PMU               16
> >>  #define KVM_REQ_PMI               17
> >> +#define KVM_REQ_POSTED_INTR       18
> >> 
> >>  #define KVM_USERSPACE_IRQ_SOURCE_ID		0
> >>  #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID	1
> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> >> index be70035..05baf1c 100644
> >> --- a/virt/kvm/kvm_main.c
> >> +++ b/virt/kvm/kvm_main.c
> >> @@ -1625,6 +1625,8 @@ void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
> >>  			smp_send_reschedule(cpu);
> >>  	put_cpu();
> >>  }
> >> +EXPORT_SYMBOL_GPL(kvm_vcpu_kick);
> >> +
> >>  #endif /* !CONFIG_S390 */
> >>  
> >>  void kvm_resched(struct kvm_vcpu *vcpu)
> >> --
> >> 1.7.1
> > 
> > --
> > 			Gleb.
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> Best regards,
> Yang
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting
  2012-11-26 10:01       ` Gleb Natapov
@ 2012-11-26 12:29         ` Zhang, Yang Z
  2012-11-26 13:48           ` Gleb Natapov
  0 siblings, 1 reply; 29+ messages in thread
From: Zhang, Yang Z @ 2012-11-26 12:29 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm@vger.kernel.org, mtosatti@redhat.com

Gleb Natapov wrote on 2012-11-26:
> On Mon, Nov 26, 2012 at 03:51:04AM +0000, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2012-11-25:
>>> On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
>>>> Posted Interrupt allows vAPICV interrupts to inject into guest directly
>>>> without any vmexit.
>>>> 
>>>> - When delivering a interrupt to guest, if target vcpu is running,
>>>>   update Posted-interrupt requests bitmap and send a notification event
>>>>   to the vcpu. Then the vcpu will handle this interrupt automatically,
>>>>   without any software involvemnt.
>>> Looks like you allocating one irq vector per vcpu per pcpu and then
>>> migrate it or reallocate when vcpu move from one pcpu to another.
>>> This is not scalable and migrating irq migration slows things down.
>>> What's wrong with allocating one global vector for posted interrupt
>>> during vmx initialization and use it for all vcpus?
>> 
>> Consider the following situation:
>> If vcpu A is running when notification event which belong to vcpu B is arrived,
> since the vector match the vcpu A's notification vector, then this event
> will be consumed by vcpu A(even it do nothing) and the interrupt cannot
> be handled in time. The exact same situation is possible with your code.
> vcpu B can be migrated from pcpu and vcpu A will take its place and will
> be assigned the same vector as vcpu B. But I fail to see why is this a
No, the on bit will be set to prevent notification event when vcpu B start migration. And it only free the vector before it going to run in another pcpu. 

> problem. vcpu A will ignore PI since pir will be empty and vcpu B should
> detect new event during next vmentry.
Yes, but the next vmentry may happen long time later and interrupt cannot be serviced until next vmentry. In current way, it will cause vmexit and re-schedule the vcpu B.
 
>> 
>>> 
>>>> +	if (!cfg) {
>>>> +		free_irq_at(irq, NULL);
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	raw_spin_lock_irqsave(&vector_lock, flags);
>>>> +	if (!__assign_irq_vector(irq, cfg, mask))
>>>> +		ret = irq;
>>>> +	raw_spin_unlock_irqrestore(&vector_lock, flags);
>>>> +
>>>> +	if (ret) {
>>>> +		irq_set_chip_data(irq, cfg);
>>>> +		irq_clear_status_flags(irq, IRQ_NOREQUEST);
>>>> +	} else {
>>>> +		free_irq_at(irq, cfg);
>>>> +	}
>>>> +	return ret;
>>>> +}
>>> 
>>> This function is mostly cut&paste of create_irq_nr().
>> 
>> Yes, this function allow to allocate vector from specified cpu.
>> 
> Does not justify code duplication.
ok. will change it in next version.
 
>>>> 
>>>>  	if (kvm_x86_ops->has_virtual_interrupt_delivery(vcpu))
>>>>  		apic->vid_enabled = true;
>>>> +
>>>> +	if (kvm_x86_ops->has_posted_interrupt(vcpu))
>>>> +		apic->pi_enabled = true;
>>>> +
>>> This is global state, no need per apic variable.
Even all vcpus use the same setting, but according to SDM, apicv really is a per apic variable.
Anyway, if you think we should not put it here, where is the best place?

>>>> @@ -1555,6 +1611,11 @@ static void vmx_vcpu_load(struct kvm_vcpu
> *vcpu,
>>> int cpu)
>>>>  		struct desc_ptr *gdt = &__get_cpu_var(host_gdt);
>>>>  		unsigned long sysenter_esp;
>>>> +		if (enable_apicv_pi && to_vmx(vcpu)->pi)
>>>> +			pi_set_on(to_vmx(vcpu)->pi);
>>>> +
>>> Why?
>> 
>> Here means the vcpu start migration. So we should prevent the
>> notification event until migration end.
>> 
> You check for IN_GUEST_MODE while sending notification. Why is this not
For interrupt from emulated device, it enough. But VT-d device doesn't know the vcpu is migrating, so set the on bit to prevent the notification event when target vcpu is migrating.

> enough? Also why vmx_vcpu_load() call means that vcpu start migration?
I think the follow check can ensure the vcpu is in migration, am I wrong?
if (vmx->loaded_vmcs->cpu != cpu)
{
	if (enable_apicv_pi && to_vmx(vcpu)->pi)
	pi_set_on(to_vmx(vcpu)->pi);
}

>>>> +		kvm_make_request(KVM_REQ_POSTED_INTR, vcpu);
>>>> +
>>>>  		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); 	local_irq_disable();
>>>>  		list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link, @@ -1582,6
>>>>  +1643,8 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
>>>>  	vcpu->cpu = -1; 		kvm_cpu_vmxoff(); 	}
>>>> +	if (enable_apicv_pi && to_vmx(vcpu)->pi)
>>>> +		pi_set_on(to_vmx(vcpu)->pi);
>>> Why?
>> 
>> When vcpu schedule out, no need to send notification event to it, just set the
> PIR and wakeup it is enough.
> Same as above. When vcpu is scheduled out it will no be in IN_GUEST_MODE
Right.

> mode. Also in this case we probably should set bit directly in IRR and leave
> PIR alone.
>From the view of hypervisor, IRR and PIR are same. For each vmentry, if PI is enabled, the IRR equal to (IRR | PIR). So there is no difference to set IRR or PIR if target vcpu is not running.

> 
>> 
>>>>  }
>>>>  
>>>>  static void vmx_fpu_activate(struct kvm_vcpu *vcpu)
>>>> @@ -2451,12 +2514,6 @@ static __init int setup_vmcs_config(struct
>>> vmcs_config *vmcs_conf)
>>>>  	u32 _vmexit_control = 0;
>>>>  	u32 _vmentry_control = 0;
>>>> -	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
>>>> -	opt = PIN_BASED_VIRTUAL_NMIS;
>>>> -	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
>>>> -				&_pin_based_exec_control) < 0)
>>>> -		return -EIO;
>>>> -
>>>>  	min = CPU_BASED_HLT_EXITING |
>>>>  #ifdef CONFIG_X86_64
>>>>  	      CPU_BASED_CR8_LOAD_EXITING |
>>>> @@ -2531,6 +2588,17 @@ static __init int setup_vmcs_config(struct
>>> vmcs_config *vmcs_conf)
>>>>  				&_vmexit_control) < 0)
>>>>  		return -EIO;
>>>> +	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
>>>> +	opt = PIN_BASED_VIRTUAL_NMIS | PIN_BASED_POSTED_INTR;
>>>> +	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
>>>> +				&_pin_based_exec_control) < 0)
>>>> +		return -EIO;
>>>> +
>>>> +	if (!(_cpu_based_2nd_exec_control &
>>>> +		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY) ||
>>>> +		!(_vmexit_control & VM_EXIT_ACK_INTR_ON_EXIT))
>>>> +		_pin_based_exec_control &= ~PIN_BASED_POSTED_INTR;
>>>> +
>>>>  	min = 0; 	opt = VM_ENTRY_LOAD_IA32_PAT; 	if
>>>>  (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS, @@ -2715,6
>>>>  +2783,9 @@ static __init int hardware_setup(void) 	if
>>>>  (!cpu_has_vmx_virtual_intr_delivery()) 		enable_apicv_vid = 0;
>>>> +	if (!cpu_has_vmx_posted_intr() || !x2apic_enabled())
>>> In nested guest x2apic may be enabled without irq remapping. Check for
>>> irq remapping here.
>> 
>> There are no posted interrupt available in nested case. We don't need
>> to check IR here.
>> 
> One day emulation will be added. If pre-request for PI is IR check
> for IR.

 
> BTW why IR is needed for PI. To deliver assigned devices interrupts
> directly into a guest sure, but why is it required for delivering
> interrupts from emulated devices or IPIs?
Posted Interrupt support is Xeon only and these platforms will have x2APIC. So, Linux will enable x2APIC on these platforms. So we only want to enable PI when x2apic is enabled and IR is required for x2apic.

>>> 
>>>> +		enable_apicv_pi = 0;
>>>> +
>>>>  	if (nested) 		nested_vmx_setup_ctls_msrs(); @@ -3881,6 +3952,93 @@
>>>>  static void ept_set_mmio_spte_mask(void)
>>>>  	kvm_mmu_set_mmio_spte_mask(0xffull << 49 | 0x6ull); }
>>>> +irqreturn_t pi_handler(int irq, void *data)
>>>> +{
>>>> +	struct vcpu_vmx *vmx = data;
>>>> +
>>>> +	kvm_make_request(KVM_REQ_EVENT, &vmx->vcpu);
>>>> +	kvm_vcpu_kick(&vmx->vcpu);
>>>> +
>>>> +	return IRQ_HANDLED;
>>>> +}
>>>> +
>>>> +static int vmx_has_posted_interrupt(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	return irqchip_in_kernel(vcpu->kvm) && enable_apicv_pi;
>>>> +}
>>>> +
>>>> +static void vmx_pi_migrate(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	int ret = 0;
>>>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>>>> +
>>>> +	if (!enable_apicv_pi)
>>>> +		return ;
>>>> +
>>>> +	preempt_disable();
>>>> +	local_irq_disable();
>>>> +	if (!vmx->irq) {
>>>> +		ret = arch_pi_alloc_irq(vmx);
>>>> +		if (ret < 0) {
>>>> +			vmx->irq = -1;
>>>> +			goto out;
>>>> +		}
>>>> +		vmx->irq = ret;
>>>> +
>>>> +		ret = request_irq(vmx->irq, pi_handler, IRQF_NO_THREAD,
>>>> +					"Posted Interrupt", vmx);
>>>> +		if (ret) {
>>>> +			vmx->irq = -1;
>>>> +			goto out;
>>>> +		}
>>>> +
>>>> +		ret = arch_pi_get_vector(vmx->irq);
>>>> +	} else
>>>> +		ret = arch_pi_migrate(vmx->irq, smp_processor_id());
>>>> +
>>>> +	if (ret < 0) {
>>>> +		vmx->irq = -1;
>>>> +		goto out;
>>>> +	} else {
>>>> +		vmx->vector = ret;
>>>> +		vmcs_write16(POSTED_INTR_NV, vmx->vector);
>>>> +		pi_clear_on(vmx->pi);
>>>> +	}
>>>> +out:
>>>> +	local_irq_enable();
>>>> +	preempt_enable();
>>>> +	return ;
>>>> +}
>>>> +
>>>> +static int vmx_send_nv(struct kvm_vcpu *vcpu,
>>>> +		int vector)
>>>> +{
>>>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>>>> +
>>>> +	if (unlikely(vmx->irq == -1))
>>>> +		return 0;
>>>> +
>>>> +	if (vcpu->cpu == smp_processor_id()) {
>>>> +		pi_set_on(vmx->pi);
>>> Why? You clear this bit anyway in vmx_update_irq() during guest entry.
>> Here means the target vcpu already in vmx non-root mode. Then it will
> consume the interrupt on next vm entry and we don't need to send the
> notification event from other cpu, just update PIR is enough.
> I understand why you avoid sending PI IPI here, but you do not update
> pir in this case either. You only set "on" bit here and set vector directly
> in IRR in __apic_accept_irq() since vmx_send_nv() returns 0 in this case.
> Interrupt is delivered from IRR on the next entry.
As I mentioned, IRR is basically same as PIR.

>> 
>>>> +		return 0; +	} + +	pi_set_pir(vector, vmx->pi); +	if
>>>> (!pi_test_and_set_on(vmx->pi) && (vcpu->mode == IN_GUEST_MODE)) {
>>>> +		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), vmx->vector);
>>>> +		return 1; +	} +	return 0; +} + +static void free_pi(struct
>>>> vcpu_vmx *vmx) +{ +	if (enable_apicv_pi) { +		kfree(vmx->pi);
>>>> +		arch_pi_free_irq(vmx->irq, vmx); +	} +} +
>>>>  /*
>>>>   * Sets up the vmcs for emulated real mode.
>>>>   */
>>>> @@ -3890,6 +4048,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>>>>  	unsigned long a;
>>>>  #endif
>>>>  	int i;
>>>> +	u32 pin_based_exec_ctrl = vmcs_config.pin_based_exec_ctrl;
>>>> 
>>>>  	/* I/O */ 	vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a)); @@
>>>>  -3901,8 +4060,10 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>>>>  	vmcs_write64(VMCS_LINK_POINTER, -1ull); /* 22.3.1.5 */
>>>>  
>>>>  	/* Control */
>>>> -	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
>>>> -		vmcs_config.pin_based_exec_ctrl); +	if (!enable_apicv_pi)
>>>> +		pin_based_exec_ctrl &= ~PIN_BASED_POSTED_INTR; +
>>>> +	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, pin_based_exec_ctrl);
>>>> 
>>>>  	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
>>> vmx_exec_control(vmx));
>>>> 
>>>> @@ -3920,6 +4081,12 @@ static int vmx_vcpu_setup(struct vcpu_vmx
> *vmx)
>>>>  		vmcs_write16(GUEST_INTR_STATUS, 0);
>>>>  	}
>>>> +	if (enable_apicv_pi) {
>>>> +		vmx->pi = kmalloc(sizeof(struct pi_desc),
>>>> +				GFP_KERNEL | __GFP_ZERO);
>>>> +		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((vmx->pi)));
>>>> +	}
>>>> +
>>>>  	if (ple_gap) { 		vmcs_write32(PLE_GAP, ple_gap);
>>>>  		vmcs_write32(PLE_WINDOW, ple_window); @@ -6161,6 +6328,11 @@
>>>>  static void vmx_update_irq(struct kvm_vcpu *vcpu) 	if
>>>>  (!enable_apicv_vid) 		return ;
>>>> +	if (enable_apicv_pi) {
>>>> +		kvm_apic_update_irr(vcpu, (unsigned int *)vmx->pi->pir);
>>>> +		pi_clear_on(vmx->pi);
>>> Why do you do that? Isn't VMX process posted interrupts on vmentry if
>>> "on" bit is set?
> Can you answer this question?
No, vmentry do nothing for PI. Posted interrupt only happens when an unmasked external interrupt arrived and the target vcpu is running. Beyond that, cpu follow the old way.

>> Best regards,
>> Yang
>> 
> 
> --
> 			Gleb.


Best regards,
Yang



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting
  2012-11-26 12:29         ` Zhang, Yang Z
@ 2012-11-26 13:48           ` Gleb Natapov
  2012-11-27  3:38             ` Zhang, Yang Z
  0 siblings, 1 reply; 29+ messages in thread
From: Gleb Natapov @ 2012-11-26 13:48 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: kvm@vger.kernel.org, mtosatti@redhat.com

On Mon, Nov 26, 2012 at 12:29:54PM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2012-11-26:
> > On Mon, Nov 26, 2012 at 03:51:04AM +0000, Zhang, Yang Z wrote:
> >> Gleb Natapov wrote on 2012-11-25:
> >>> On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
> >>>> Posted Interrupt allows vAPICV interrupts to inject into guest directly
> >>>> without any vmexit.
> >>>> 
> >>>> - When delivering a interrupt to guest, if target vcpu is running,
> >>>>   update Posted-interrupt requests bitmap and send a notification event
> >>>>   to the vcpu. Then the vcpu will handle this interrupt automatically,
> >>>>   without any software involvemnt.
> >>> Looks like you allocating one irq vector per vcpu per pcpu and then
> >>> migrate it or reallocate when vcpu move from one pcpu to another.
> >>> This is not scalable and migrating irq migration slows things down.
> >>> What's wrong with allocating one global vector for posted interrupt
> >>> during vmx initialization and use it for all vcpus?
> >> 
> >> Consider the following situation:
> >> If vcpu A is running when notification event which belong to vcpu B is arrived,
> > since the vector match the vcpu A's notification vector, then this event
> > will be consumed by vcpu A(even it do nothing) and the interrupt cannot
> > be handled in time. The exact same situation is possible with your code.
> > vcpu B can be migrated from pcpu and vcpu A will take its place and will
> > be assigned the same vector as vcpu B. But I fail to see why is this a
> No, the on bit will be set to prevent notification event when vcpu B start migration. And it only free the vector before it going to run in another pcpu. 
There is a race. Sender check on bit, vcpu B migrate to another pcpu and
starts running there, vcpu A takes vpuc's B vector, sender send PI vcpu
A gets it.


> 
> > problem. vcpu A will ignore PI since pir will be empty and vcpu B should
> > detect new event during next vmentry.
> Yes, but the next vmentry may happen long time later and interrupt cannot be serviced until next vmentry. In current way, it will cause vmexit and re-schedule the vcpu B.
Vmentry will happen when scheduler will decide that vcpu can run. There
is no problem here. What you probably want to say is that vcpu may not be
aware of interrupt happening since it was migrated to different vcpu
just after PI IPI was sent and thus missed it. But than PIR interrupts
should be processed during vmentry on another pcpu:

Sender:                             Guest:

set pir
set on
if (vcpu in guest mode on pcpu1)
                                   vmexit on pcpu1
                                   vmentry on pcpu2
                                   process pir, deliver interrupt
	send PI IPI to pcpu1


>  
> >> 
> >>> 
> >>>> +	if (!cfg) {
> >>>> +		free_irq_at(irq, NULL);
> >>>> +		return 0;
> >>>> +	}
> >>>> +
> >>>> +	raw_spin_lock_irqsave(&vector_lock, flags);
> >>>> +	if (!__assign_irq_vector(irq, cfg, mask))
> >>>> +		ret = irq;
> >>>> +	raw_spin_unlock_irqrestore(&vector_lock, flags);
> >>>> +
> >>>> +	if (ret) {
> >>>> +		irq_set_chip_data(irq, cfg);
> >>>> +		irq_clear_status_flags(irq, IRQ_NOREQUEST);
> >>>> +	} else {
> >>>> +		free_irq_at(irq, cfg);
> >>>> +	}
> >>>> +	return ret;
> >>>> +}
> >>> 
> >>> This function is mostly cut&paste of create_irq_nr().
> >> 
> >> Yes, this function allow to allocate vector from specified cpu.
> >> 
> > Does not justify code duplication.
> ok. will change it in next version.
>  
Please use single global vector in the next version.

> >>>> 
> >>>>  	if (kvm_x86_ops->has_virtual_interrupt_delivery(vcpu))
> >>>>  		apic->vid_enabled = true;
> >>>> +
> >>>> +	if (kvm_x86_ops->has_posted_interrupt(vcpu))
> >>>> +		apic->pi_enabled = true;
> >>>> +
> >>> This is global state, no need per apic variable.
> Even all vcpus use the same setting, but according to SDM, apicv really is a per apic variable.
It is not per vapic in our implementation and this is what is
important here.

> Anyway, if you think we should not put it here, where is the best place?
It is not needed, just use has_posted_interrupt(vcpu) instead.

> 
> >>>> @@ -1555,6 +1611,11 @@ static void vmx_vcpu_load(struct kvm_vcpu
> > *vcpu,
> >>> int cpu)
> >>>>  		struct desc_ptr *gdt = &__get_cpu_var(host_gdt);
> >>>>  		unsigned long sysenter_esp;
> >>>> +		if (enable_apicv_pi && to_vmx(vcpu)->pi)
> >>>> +			pi_set_on(to_vmx(vcpu)->pi);
> >>>> +
> >>> Why?
> >> 
> >> Here means the vcpu start migration. So we should prevent the
> >> notification event until migration end.
> >> 
> > You check for IN_GUEST_MODE while sending notification. Why is this not
> For interrupt from emulated device, it enough. But VT-d device doesn't know the vcpu is migrating, so set the on bit to prevent the notification event when target vcpu is migrating.
Why should VT-d device care about that? It sets bits in pir and sends
IPI. If vcpu is running it process pir immediately, if not it will do it
during next vmentry.

> 
> > enough? Also why vmx_vcpu_load() call means that vcpu start migration?
> I think the follow check can ensure the vcpu is in migration, am I wrong?
> if (vmx->loaded_vmcs->cpu != cpu)
This code checks that this vcpu ran on that pcpu last time.

> {
> 	if (enable_apicv_pi && to_vmx(vcpu)->pi)
> 	pi_set_on(to_vmx(vcpu)->pi);
> }
> 
> >>>> +		kvm_make_request(KVM_REQ_POSTED_INTR, vcpu);
> >>>> +
> >>>>  		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); 	local_irq_disable();
> >>>>  		list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link, @@ -1582,6
> >>>>  +1643,8 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
> >>>>  	vcpu->cpu = -1; 		kvm_cpu_vmxoff(); 	}
> >>>> +	if (enable_apicv_pi && to_vmx(vcpu)->pi)
> >>>> +		pi_set_on(to_vmx(vcpu)->pi);
> >>> Why?
> >> 
> >> When vcpu schedule out, no need to send notification event to it, just set the
> > PIR and wakeup it is enough.
> > Same as above. When vcpu is scheduled out it will no be in IN_GUEST_MODE
> Right.
> 
> > mode. Also in this case we probably should set bit directly in IRR and leave
> > PIR alone.
> >From the view of hypervisor, IRR and PIR are same. For each vmentry, if PI is enabled, the IRR equal to (IRR | PIR). So there is no difference to set IRR or PIR if target vcpu is not running.
But there is a difference for KVM code. For instance
kvm_arch_vcpu_runnable() checks for interrupts in IRR, but not PIR.
Migration code does the same.

> 
> > 
> >> 
> >>>>  }
> >>>>  
> >>>>  static void vmx_fpu_activate(struct kvm_vcpu *vcpu)
> >>>> @@ -2451,12 +2514,6 @@ static __init int setup_vmcs_config(struct
> >>> vmcs_config *vmcs_conf)
> >>>>  	u32 _vmexit_control = 0;
> >>>>  	u32 _vmentry_control = 0;
> >>>> -	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
> >>>> -	opt = PIN_BASED_VIRTUAL_NMIS;
> >>>> -	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
> >>>> -				&_pin_based_exec_control) < 0)
> >>>> -		return -EIO;
> >>>> -
> >>>>  	min = CPU_BASED_HLT_EXITING |
> >>>>  #ifdef CONFIG_X86_64
> >>>>  	      CPU_BASED_CR8_LOAD_EXITING |
> >>>> @@ -2531,6 +2588,17 @@ static __init int setup_vmcs_config(struct
> >>> vmcs_config *vmcs_conf)
> >>>>  				&_vmexit_control) < 0)
> >>>>  		return -EIO;
> >>>> +	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
> >>>> +	opt = PIN_BASED_VIRTUAL_NMIS | PIN_BASED_POSTED_INTR;
> >>>> +	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
> >>>> +				&_pin_based_exec_control) < 0)
> >>>> +		return -EIO;
> >>>> +
> >>>> +	if (!(_cpu_based_2nd_exec_control &
> >>>> +		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY) ||
> >>>> +		!(_vmexit_control & VM_EXIT_ACK_INTR_ON_EXIT))
> >>>> +		_pin_based_exec_control &= ~PIN_BASED_POSTED_INTR;
> >>>> +
> >>>>  	min = 0; 	opt = VM_ENTRY_LOAD_IA32_PAT; 	if
> >>>>  (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS, @@ -2715,6
> >>>>  +2783,9 @@ static __init int hardware_setup(void) 	if
> >>>>  (!cpu_has_vmx_virtual_intr_delivery()) 		enable_apicv_vid = 0;
> >>>> +	if (!cpu_has_vmx_posted_intr() || !x2apic_enabled())
> >>> In nested guest x2apic may be enabled without irq remapping. Check for
> >>> irq remapping here.
> >> 
> >> There are no posted interrupt available in nested case. We don't need
> >> to check IR here.
> >> 
> > One day emulation will be added. If pre-request for PI is IR check
> > for IR.
> 
>  
> > BTW why IR is needed for PI. To deliver assigned devices interrupts
> > directly into a guest sure, but why is it required for delivering
> > interrupts from emulated devices or IPIs?
> Posted Interrupt support is Xeon only and these platforms will have x2APIC. So, Linux will enable x2APIC on these platforms. So we only want to enable PI when x2apic is enabled and IR is required for x2apic.
The fact that x2APIC is available on all platform that support PI is
irrelevant. If one is not strictly required by the other by architecture
do not couple them.

> 
> >>> 
> >>>> +		enable_apicv_pi = 0;
> >>>> +
> >>>>  	if (nested) 		nested_vmx_setup_ctls_msrs(); @@ -3881,6 +3952,93 @@
> >>>>  static void ept_set_mmio_spte_mask(void)
> >>>>  	kvm_mmu_set_mmio_spte_mask(0xffull << 49 | 0x6ull); }
> >>>> +irqreturn_t pi_handler(int irq, void *data)
> >>>> +{
> >>>> +	struct vcpu_vmx *vmx = data;
> >>>> +
> >>>> +	kvm_make_request(KVM_REQ_EVENT, &vmx->vcpu);
> >>>> +	kvm_vcpu_kick(&vmx->vcpu);
> >>>> +
> >>>> +	return IRQ_HANDLED;
> >>>> +}
> >>>> +
> >>>> +static int vmx_has_posted_interrupt(struct kvm_vcpu *vcpu)
> >>>> +{
> >>>> +	return irqchip_in_kernel(vcpu->kvm) && enable_apicv_pi;
> >>>> +}
> >>>> +
> >>>> +static void vmx_pi_migrate(struct kvm_vcpu *vcpu)
> >>>> +{
> >>>> +	int ret = 0;
> >>>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >>>> +
> >>>> +	if (!enable_apicv_pi)
> >>>> +		return ;
> >>>> +
> >>>> +	preempt_disable();
> >>>> +	local_irq_disable();
> >>>> +	if (!vmx->irq) {
> >>>> +		ret = arch_pi_alloc_irq(vmx);
> >>>> +		if (ret < 0) {
> >>>> +			vmx->irq = -1;
> >>>> +			goto out;
> >>>> +		}
> >>>> +		vmx->irq = ret;
> >>>> +
> >>>> +		ret = request_irq(vmx->irq, pi_handler, IRQF_NO_THREAD,
> >>>> +					"Posted Interrupt", vmx);
> >>>> +		if (ret) {
> >>>> +			vmx->irq = -1;
> >>>> +			goto out;
> >>>> +		}
> >>>> +
> >>>> +		ret = arch_pi_get_vector(vmx->irq);
> >>>> +	} else
> >>>> +		ret = arch_pi_migrate(vmx->irq, smp_processor_id());
> >>>> +
> >>>> +	if (ret < 0) {
> >>>> +		vmx->irq = -1;
> >>>> +		goto out;
> >>>> +	} else {
> >>>> +		vmx->vector = ret;
> >>>> +		vmcs_write16(POSTED_INTR_NV, vmx->vector);
> >>>> +		pi_clear_on(vmx->pi);
> >>>> +	}
> >>>> +out:
> >>>> +	local_irq_enable();
> >>>> +	preempt_enable();
> >>>> +	return ;
> >>>> +}
> >>>> +
> >>>> +static int vmx_send_nv(struct kvm_vcpu *vcpu,
> >>>> +		int vector)
> >>>> +{
> >>>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >>>> +
> >>>> +	if (unlikely(vmx->irq == -1))
> >>>> +		return 0;
> >>>> +
> >>>> +	if (vcpu->cpu == smp_processor_id()) {
> >>>> +		pi_set_on(vmx->pi);
> >>> Why? You clear this bit anyway in vmx_update_irq() during guest entry.
> >> Here means the target vcpu already in vmx non-root mode. Then it will
> > consume the interrupt on next vm entry and we don't need to send the
> > notification event from other cpu, just update PIR is enough.
> > I understand why you avoid sending PI IPI here, but you do not update
> > pir in this case either. You only set "on" bit here and set vector directly
> > in IRR in __apic_accept_irq() since vmx_send_nv() returns 0 in this case.
> > Interrupt is delivered from IRR on the next entry.
> As I mentioned, IRR is basically same as PIR.
> 
That does not explain why are you setting "on" without setting bit in pir.

> >> 
> >>>> +		return 0; +	} + +	pi_set_pir(vector, vmx->pi); +	if
> >>>> (!pi_test_and_set_on(vmx->pi) && (vcpu->mode == IN_GUEST_MODE)) {
> >>>> +		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), vmx->vector);
> >>>> +		return 1; +	} +	return 0; +} + +static void free_pi(struct
> >>>> vcpu_vmx *vmx) +{ +	if (enable_apicv_pi) { +		kfree(vmx->pi);
> >>>> +		arch_pi_free_irq(vmx->irq, vmx); +	} +} +
> >>>>  /*
> >>>>   * Sets up the vmcs for emulated real mode.
> >>>>   */
> >>>> @@ -3890,6 +4048,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
> >>>>  	unsigned long a;
> >>>>  #endif
> >>>>  	int i;
> >>>> +	u32 pin_based_exec_ctrl = vmcs_config.pin_based_exec_ctrl;
> >>>> 
> >>>>  	/* I/O */ 	vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a)); @@
> >>>>  -3901,8 +4060,10 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
> >>>>  	vmcs_write64(VMCS_LINK_POINTER, -1ull); /* 22.3.1.5 */
> >>>>  
> >>>>  	/* Control */
> >>>> -	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
> >>>> -		vmcs_config.pin_based_exec_ctrl); +	if (!enable_apicv_pi)
> >>>> +		pin_based_exec_ctrl &= ~PIN_BASED_POSTED_INTR; +
> >>>> +	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, pin_based_exec_ctrl);
> >>>> 
> >>>>  	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
> >>> vmx_exec_control(vmx));
> >>>> 
> >>>> @@ -3920,6 +4081,12 @@ static int vmx_vcpu_setup(struct vcpu_vmx
> > *vmx)
> >>>>  		vmcs_write16(GUEST_INTR_STATUS, 0);
> >>>>  	}
> >>>> +	if (enable_apicv_pi) {
> >>>> +		vmx->pi = kmalloc(sizeof(struct pi_desc),
> >>>> +				GFP_KERNEL | __GFP_ZERO);
> >>>> +		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((vmx->pi)));
> >>>> +	}
> >>>> +
> >>>>  	if (ple_gap) { 		vmcs_write32(PLE_GAP, ple_gap);
> >>>>  		vmcs_write32(PLE_WINDOW, ple_window); @@ -6161,6 +6328,11 @@
> >>>>  static void vmx_update_irq(struct kvm_vcpu *vcpu) 	if
> >>>>  (!enable_apicv_vid) 		return ;
> >>>> +	if (enable_apicv_pi) {
> >>>> +		kvm_apic_update_irr(vcpu, (unsigned int *)vmx->pi->pir);
> >>>> +		pi_clear_on(vmx->pi);
> >>> Why do you do that? Isn't VMX process posted interrupts on vmentry if
> >>> "on" bit is set?
> > Can you answer this question?
> No, vmentry do nothing for PI. Posted interrupt only happens when an unmasked external interrupt arrived and the target vcpu is running. Beyond that, cpu follow the old way.
> 
Now that totally contradicts what you wrote above! (unless I
misunderstood you, in which case clarify please)

  From the view of hypervisor, IRR and PIR are same. For each vmentry, if
  PI is enabled, the IRR equal to (IRR | PIR). So there is no difference
  to set IRR or PIR if target vcpu is not running.

--
			Gleb.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting
  2012-11-26 13:48           ` Gleb Natapov
@ 2012-11-27  3:38             ` Zhang, Yang Z
  2012-11-27  9:16               ` Gleb Natapov
  0 siblings, 1 reply; 29+ messages in thread
From: Zhang, Yang Z @ 2012-11-27  3:38 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm@vger.kernel.org, mtosatti@redhat.com

Gleb Natapov wrote on 2012-11-26:
> On Mon, Nov 26, 2012 at 12:29:54PM +0000, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2012-11-26:
>>> On Mon, Nov 26, 2012 at 03:51:04AM +0000, Zhang, Yang Z wrote:
>>>> Gleb Natapov wrote on 2012-11-25:
>>>>> On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
>>>>>> Posted Interrupt allows vAPICV interrupts to inject into guest directly
>>>>>> without any vmexit.
>>>>>> 
>>>>>> - When delivering a interrupt to guest, if target vcpu is running,
>>>>>>   update Posted-interrupt requests bitmap and send a notification
>>>>>>   event to the vcpu. Then the vcpu will handle this interrupt
>>>>>>   automatically, without any software involvemnt.
>>>>> Looks like you allocating one irq vector per vcpu per pcpu and then
>>>>> migrate it or reallocate when vcpu move from one pcpu to another.
>>>>> This is not scalable and migrating irq migration slows things down.
>>>>> What's wrong with allocating one global vector for posted interrupt
>>>>> during vmx initialization and use it for all vcpus?
>>>> 
>>>> Consider the following situation:
>>>> If vcpu A is running when notification event which belong to vcpu B is
> arrived,
>>> since the vector match the vcpu A's notification vector, then this event
>>> will be consumed by vcpu A(even it do nothing) and the interrupt cannot
>>> be handled in time. The exact same situation is possible with your code.
>>> vcpu B can be migrated from pcpu and vcpu A will take its place and will
>>> be assigned the same vector as vcpu B. But I fail to see why is this a
>> No, the on bit will be set to prevent notification event when vcpu B start
> migration. And it only free the vector before it going to run in another pcpu.
> There is a race. Sender check on bit, vcpu B migrate to another pcpu and
> starts running there, vcpu A takes vpuc's B vector, sender send PI vcpu
> A gets it.
Yes, it do exist. But I think it should be ok even this happens.

>> 
>>> problem. vcpu A will ignore PI since pir will be empty and vcpu B should
>>> detect new event during next vmentry.
>> Yes, but the next vmentry may happen long time later and interrupt cannot be
> serviced until next vmentry. In current way, it will cause vmexit and re-schedule
> the vcpu B.
> Vmentry will happen when scheduler will decide that vcpu can run. There
I don't know how scheduler can know the vcpu can run in this case, can you elaborate it? 
I thought it may have problems with global vector in some cases(maybe I am wrong, since I am not familiar with KVM scheduler):
If target VCPU is in idle, and this notification event is consumed by other VCPU, then how can scheduler know the vcpu is ready to run? Even there is a way for scheduler to know, then when? Isn't it too late?
If notification event arrived in hypervisor, then how the handler know which VCPU the notification event belong to?

> is no problem here. What you probably want to say is that vcpu may not be
> aware of interrupt happening since it was migrated to different vcpu
> just after PI IPI was sent and thus missed it. But than PIR interrupts
> should be processed during vmentry on another pcpu:
> 
> Sender:                             Guest:
> 
> set pir
> set on
> if (vcpu in guest mode on pcpu1)
>                                    vmexit on pcpu1
>                                    vmentry on pcpu2
>                                    process pir, deliver interrupt
> 	send PI IPI to pcpu1

>> 
>>>> 
>>>>> 
>>>>>> +	if (!cfg) {
>>>>>> +		free_irq_at(irq, NULL);
>>>>>> +		return 0;
>>>>>> +	}
>>>>>> +
>>>>>> +	raw_spin_lock_irqsave(&vector_lock, flags);
>>>>>> +	if (!__assign_irq_vector(irq, cfg, mask))
>>>>>> +		ret = irq;
>>>>>> +	raw_spin_unlock_irqrestore(&vector_lock, flags);
>>>>>> +
>>>>>> +	if (ret) {
>>>>>> +		irq_set_chip_data(irq, cfg);
>>>>>> +		irq_clear_status_flags(irq, IRQ_NOREQUEST);
>>>>>> +	} else {
>>>>>> +		free_irq_at(irq, cfg);
>>>>>> +	}
>>>>>> +	return ret;
>>>>>> +}
>>>>> 
>>>>> This function is mostly cut&paste of create_irq_nr().
>>>> 
>>>> Yes, this function allow to allocate vector from specified cpu.
>>>> 
>>> Does not justify code duplication.
>> ok. will change it in next version.
>> 
> Please use single global vector in the next version.
> 
>>>>>> 
>>>>>>  	if (kvm_x86_ops->has_virtual_interrupt_delivery(vcpu))
>>>>>>  		apic->vid_enabled = true;
>>>>>> +
>>>>>> +	if (kvm_x86_ops->has_posted_interrupt(vcpu))
>>>>>> +		apic->pi_enabled = true;
>>>>>> +
>>>>> This is global state, no need per apic variable.
>> Even all vcpus use the same setting, but according to SDM, apicv really is a per
> apic variable.
> It is not per vapic in our implementation and this is what is
> important here.
> 
>> Anyway, if you think we should not put it here, where is the best place?
> It is not needed, just use has_posted_interrupt(vcpu) instead.
ok
 
>> 
>>>>>> @@ -1555,6 +1611,11 @@ static void vmx_vcpu_load(struct kvm_vcpu
>>> *vcpu,
>>>>> int cpu)
>>>>>>  		struct desc_ptr *gdt = &__get_cpu_var(host_gdt);
>>>>>>  		unsigned long sysenter_esp;
>>>>>> +		if (enable_apicv_pi && to_vmx(vcpu)->pi)
>>>>>> +			pi_set_on(to_vmx(vcpu)->pi);
>>>>>> +
>>>>> Why?
>>>> 
>>>> Here means the vcpu start migration. So we should prevent the
>>>> notification event until migration end.
>>>> 
>>> You check for IN_GUEST_MODE while sending notification. Why is this not
>> For interrupt from emulated device, it enough. But VT-d device doesn't know
> the vcpu is migrating, so set the on bit to prevent the notification event when
> target vcpu is migrating.
> Why should VT-d device care about that? It sets bits in pir and sends
> IPI. If vcpu is running it process pir immediately, if not it will do it
> during next vmentry.
We already know the vcpu is not running(it will run soon), we can set this bit to prevent the unnecessary IPI.

>> 
>>> enough? Also why vmx_vcpu_load() call means that vcpu start migration?
>> I think the follow check can ensure the vcpu is in migration, am I wrong?
>> if (vmx->loaded_vmcs->cpu != cpu)
> This code checks that this vcpu ran on that pcpu last time.
Yes, migration starts more earlier than here. But I think it should be ok to set ON bit here. Do you have any better idea?

>> {
>> 	if (enable_apicv_pi && to_vmx(vcpu)->pi)
>> 	pi_set_on(to_vmx(vcpu)->pi);
>> }
>> 
>>>>>> +		kvm_make_request(KVM_REQ_POSTED_INTR, vcpu);
>>>>>> +
>>>>>>  		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); 	local_irq_disable();
>>>>>>  		list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link, @@ -1582,6
>>>>>>  +1643,8 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
>>>>>>  	vcpu->cpu = -1; 		kvm_cpu_vmxoff(); 	}
>>>>>> +	if (enable_apicv_pi && to_vmx(vcpu)->pi)
>>>>>> +		pi_set_on(to_vmx(vcpu)->pi);
>>>>> Why?
>>>> 
>>>> When vcpu schedule out, no need to send notification event to it, just set
> the
>>> PIR and wakeup it is enough.
>>> Same as above. When vcpu is scheduled out it will no be in IN_GUEST_MODE
>> Right.
>> 
>>> mode. Also in this case we probably should set bit directly in IRR and leave
>>> PIR alone.
>>> From the view of hypervisor, IRR and PIR are same. For each vmentry, if PI is
> enabled, the IRR equal to (IRR | PIR). So there is no difference to set IRR or PIR if
> target vcpu is not running.
> But there is a difference for KVM code. For instance
> kvm_arch_vcpu_runnable() checks for interrupts in IRR, but not PIR.
> Migration code does the same.
Right. With PI, we need check the PIR too.
 
>> 
>>> 
>>>> 
>>>>>>  }
>>>>>>  
>>>>>>  static void vmx_fpu_activate(struct kvm_vcpu *vcpu)
>>>>>> @@ -2451,12 +2514,6 @@ static __init int setup_vmcs_config(struct
>>>>> vmcs_config *vmcs_conf)
>>>>>>  	u32 _vmexit_control = 0;
>>>>>>  	u32 _vmentry_control = 0;
>>>>>> -	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
>>>>>> -	opt = PIN_BASED_VIRTUAL_NMIS;
>>>>>> -	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
>>>>>> -				&_pin_based_exec_control) < 0)
>>>>>> -		return -EIO;
>>>>>> -
>>>>>>  	min = CPU_BASED_HLT_EXITING |
>>>>>>  #ifdef CONFIG_X86_64
>>>>>>  	      CPU_BASED_CR8_LOAD_EXITING |
>>>>>> @@ -2531,6 +2588,17 @@ static __init int setup_vmcs_config(struct
>>>>> vmcs_config *vmcs_conf)
>>>>>>  				&_vmexit_control) < 0)
>>>>>>  		return -EIO;
>>>>>> +	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
>>>>>> +	opt = PIN_BASED_VIRTUAL_NMIS | PIN_BASED_POSTED_INTR;
>>>>>> +	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
>>>>>> +				&_pin_based_exec_control) < 0)
>>>>>> +		return -EIO;
>>>>>> +
>>>>>> +	if (!(_cpu_based_2nd_exec_control &
>>>>>> +		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY) ||
>>>>>> +		!(_vmexit_control & VM_EXIT_ACK_INTR_ON_EXIT))
>>>>>> +		_pin_based_exec_control &= ~PIN_BASED_POSTED_INTR;
>>>>>> +
>>>>>>  	min = 0; 	opt = VM_ENTRY_LOAD_IA32_PAT; 	if
>>>>>>  (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS, @@ -2715,6
>>>>>>  +2783,9 @@ static __init int hardware_setup(void) 	if
>>>>>>  (!cpu_has_vmx_virtual_intr_delivery()) 		enable_apicv_vid = 0;
>>>>>> +	if (!cpu_has_vmx_posted_intr() || !x2apic_enabled())
>>>>> In nested guest x2apic may be enabled without irq remapping. Check for
>>>>> irq remapping here.
>>>> 
>>>> There are no posted interrupt available in nested case. We don't need
>>>> to check IR here.
>>>> 
>>> One day emulation will be added. If pre-request for PI is IR check
>>> for IR.
>> 
>> 
>>> BTW why IR is needed for PI. To deliver assigned devices interrupts
>>> directly into a guest sure, but why is it required for delivering
>>> interrupts from emulated devices or IPIs?
>> Posted Interrupt support is Xeon only and these platforms will have x2APIC. So,
> Linux will enable x2APIC on these platforms. So we only want to enable PI when
> x2apic is enabled and IR is required for x2apic.
> The fact that x2APIC is available on all platform that support PI is
> irrelevant. If one is not strictly required by the other by architecture
> do not couple them.
Right. We only want to simply the implementation of enable "ack intr on exit". If IR enabled, then don't need to check the trig mode(all interrupts are edge) when using self IPI to regenerate the interrupt.
 
>> 
>>>>> 
>>>>>> +		enable_apicv_pi = 0;
>>>>>> +
>>>>>>  	if (nested) 		nested_vmx_setup_ctls_msrs(); @@ -3881,6 +3952,93
>>>>>>  @@ static void ept_set_mmio_spte_mask(void)
>>>>>>  	kvm_mmu_set_mmio_spte_mask(0xffull << 49 | 0x6ull); }
>>>>>> +irqreturn_t pi_handler(int irq, void *data)
>>>>>> +{
>>>>>> +	struct vcpu_vmx *vmx = data;
>>>>>> +
>>>>>> +	kvm_make_request(KVM_REQ_EVENT, &vmx->vcpu);
>>>>>> +	kvm_vcpu_kick(&vmx->vcpu);
>>>>>> +
>>>>>> +	return IRQ_HANDLED;
>>>>>> +}
>>>>>> +
>>>>>> +static int vmx_has_posted_interrupt(struct kvm_vcpu *vcpu)
>>>>>> +{
>>>>>> +	return irqchip_in_kernel(vcpu->kvm) && enable_apicv_pi;
>>>>>> +}
>>>>>> +
>>>>>> +static void vmx_pi_migrate(struct kvm_vcpu *vcpu)
>>>>>> +{
>>>>>> +	int ret = 0;
>>>>>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>>>>>> +
>>>>>> +	if (!enable_apicv_pi)
>>>>>> +		return ;
>>>>>> +
>>>>>> +	preempt_disable();
>>>>>> +	local_irq_disable();
>>>>>> +	if (!vmx->irq) {
>>>>>> +		ret = arch_pi_alloc_irq(vmx);
>>>>>> +		if (ret < 0) {
>>>>>> +			vmx->irq = -1;
>>>>>> +			goto out;
>>>>>> +		}
>>>>>> +		vmx->irq = ret;
>>>>>> +
>>>>>> +		ret = request_irq(vmx->irq, pi_handler, IRQF_NO_THREAD,
>>>>>> +					"Posted Interrupt", vmx);
>>>>>> +		if (ret) {
>>>>>> +			vmx->irq = -1;
>>>>>> +			goto out;
>>>>>> +		}
>>>>>> +
>>>>>> +		ret = arch_pi_get_vector(vmx->irq);
>>>>>> +	} else
>>>>>> +		ret = arch_pi_migrate(vmx->irq, smp_processor_id());
>>>>>> +
>>>>>> +	if (ret < 0) {
>>>>>> +		vmx->irq = -1;
>>>>>> +		goto out;
>>>>>> +	} else {
>>>>>> +		vmx->vector = ret;
>>>>>> +		vmcs_write16(POSTED_INTR_NV, vmx->vector);
>>>>>> +		pi_clear_on(vmx->pi);
>>>>>> +	}
>>>>>> +out:
>>>>>> +	local_irq_enable();
>>>>>> +	preempt_enable();
>>>>>> +	return ;
>>>>>> +}
>>>>>> +
>>>>>> +static int vmx_send_nv(struct kvm_vcpu *vcpu,
>>>>>> +		int vector)
>>>>>> +{
>>>>>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>>>>>> +
>>>>>> +	if (unlikely(vmx->irq == -1))
>>>>>> +		return 0;
>>>>>> +
>>>>>> +	if (vcpu->cpu == smp_processor_id()) {
>>>>>> +		pi_set_on(vmx->pi);
>>>>> Why? You clear this bit anyway in vmx_update_irq() during guest entry.
>>>> Here means the target vcpu already in vmx non-root mode. Then it will
>>> consume the interrupt on next vm entry and we don't need to send the
>>> notification event from other cpu, just update PIR is enough.
>>> I understand why you avoid sending PI IPI here, but you do not update
>>> pir in this case either. You only set "on" bit here and set vector directly
>>> in IRR in __apic_accept_irq() since vmx_send_nv() returns 0 in this case.
>>> Interrupt is delivered from IRR on the next entry.
>> As I mentioned, IRR is basically same as PIR.
>> 
> That does not explain why are you setting "on" without setting bit in pir.
Right. Just set the PIR and return 1 is enough.

>>>> 
>>>>>> +		return 0; +	} + +	pi_set_pir(vector, vmx->pi); +	if
>>>>>> (!pi_test_and_set_on(vmx->pi) && (vcpu->mode == IN_GUEST_MODE)) {
>>>>>> +		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), vmx->vector);
>>>>>> +		return 1; +	} +	return 0; +} + +static void free_pi(struct
>>>>>> vcpu_vmx *vmx) +{ +	if (enable_apicv_pi) { +		kfree(vmx->pi);
>>>>>> +		arch_pi_free_irq(vmx->irq, vmx); +	} +} +
>>>>>>  /*
>>>>>>   * Sets up the vmcs for emulated real mode.
>>>>>>   */
>>>>>> @@ -3890,6 +4048,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx
> *vmx)
>>>>>>  	unsigned long a;
>>>>>>  #endif
>>>>>>  	int i;
>>>>>> +	u32 pin_based_exec_ctrl = vmcs_config.pin_based_exec_ctrl;
>>>>>> 
>>>>>>  	/* I/O */ 	vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a)); @@
>>>>>>  -3901,8 +4060,10 @@ static int vmx_vcpu_setup(struct vcpu_vmx
>>>>>>  *vmx) 	vmcs_write64(VMCS_LINK_POINTER, -1ull); /* 22.3.1.5 */
>>>>>>  
>>>>>>  	/* Control */
>>>>>> -	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
>>>>>> -		vmcs_config.pin_based_exec_ctrl); +	if (!enable_apicv_pi)
>>>>>> +		pin_based_exec_ctrl &= ~PIN_BASED_POSTED_INTR; +
>>>>>> +	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, pin_based_exec_ctrl);
>>>>>> 
>>>>>>  	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
>>>>> vmx_exec_control(vmx));
>>>>>> 
>>>>>> @@ -3920,6 +4081,12 @@ static int vmx_vcpu_setup(struct vcpu_vmx
>>> *vmx)
>>>>>>  		vmcs_write16(GUEST_INTR_STATUS, 0);
>>>>>>  	}
>>>>>> +	if (enable_apicv_pi) {
>>>>>> +		vmx->pi = kmalloc(sizeof(struct pi_desc),
>>>>>> +				GFP_KERNEL | __GFP_ZERO);
>>>>>> +		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((vmx->pi)));
>>>>>> +	}
>>>>>> +
>>>>>>  	if (ple_gap) { 		vmcs_write32(PLE_GAP, ple_gap);
>>>>>>  		vmcs_write32(PLE_WINDOW, ple_window); @@ -6161,6 +6328,11 @@
>>>>>>  static void vmx_update_irq(struct kvm_vcpu *vcpu) 	if
>>>>>>  (!enable_apicv_vid) 		return ;
>>>>>> +	if (enable_apicv_pi) {
>>>>>> +		kvm_apic_update_irr(vcpu, (unsigned int *)vmx->pi->pir);
>>>>>> +		pi_clear_on(vmx->pi);
>>>>> Why do you do that? Isn't VMX process posted interrupts on vmentry if
>>>>> "on" bit is set?
>>> Can you answer this question?
>> No, vmentry do nothing for PI. Posted interrupt only happens when an
> unmasked external interrupt arrived and the target vcpu is running. Beyond that,
> cpu follow the old way.
>> 
> Now that totally contradicts what you wrote above! (unless I
> misunderstood you, in which case clarify please)
> 
>   From the view of hypervisor, IRR and PIR are same. For each vmentry, if
>   PI is enabled, the IRR equal to (IRR | PIR). So there is no difference
>   to set IRR or PIR if target vcpu is not running.
> --
> 			Gleb.
Sorry, maybe I mislead you. VMentry have nothing to do PI.
What I mean" IRR and PIR are same" is because this patch will copy PIR to IRR before each vmentry. So I think this two should be same in some levels. But according your comments, it may wrong.

Best regards,
Yang



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting
  2012-11-27  3:38             ` Zhang, Yang Z
@ 2012-11-27  9:16               ` Gleb Natapov
  2012-11-27 11:10                 ` Zhang, Yang Z
  0 siblings, 1 reply; 29+ messages in thread
From: Gleb Natapov @ 2012-11-27  9:16 UTC (permalink / raw)
  To: Zhang, Yang Z; +Cc: kvm@vger.kernel.org, mtosatti@redhat.com

On Tue, Nov 27, 2012 at 03:38:05AM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2012-11-26:
> > On Mon, Nov 26, 2012 at 12:29:54PM +0000, Zhang, Yang Z wrote:
> >> Gleb Natapov wrote on 2012-11-26:
> >>> On Mon, Nov 26, 2012 at 03:51:04AM +0000, Zhang, Yang Z wrote:
> >>>> Gleb Natapov wrote on 2012-11-25:
> >>>>> On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
> >>>>>> Posted Interrupt allows vAPICV interrupts to inject into guest directly
> >>>>>> without any vmexit.
> >>>>>> 
> >>>>>> - When delivering a interrupt to guest, if target vcpu is running,
> >>>>>>   update Posted-interrupt requests bitmap and send a notification
> >>>>>>   event to the vcpu. Then the vcpu will handle this interrupt
> >>>>>>   automatically, without any software involvemnt.
> >>>>> Looks like you allocating one irq vector per vcpu per pcpu and then
> >>>>> migrate it or reallocate when vcpu move from one pcpu to another.
> >>>>> This is not scalable and migrating irq migration slows things down.
> >>>>> What's wrong with allocating one global vector for posted interrupt
> >>>>> during vmx initialization and use it for all vcpus?
> >>>> 
> >>>> Consider the following situation:
> >>>> If vcpu A is running when notification event which belong to vcpu B is
> > arrived,
> >>> since the vector match the vcpu A's notification vector, then this event
> >>> will be consumed by vcpu A(even it do nothing) and the interrupt cannot
> >>> be handled in time. The exact same situation is possible with your code.
> >>> vcpu B can be migrated from pcpu and vcpu A will take its place and will
> >>> be assigned the same vector as vcpu B. But I fail to see why is this a
> >> No, the on bit will be set to prevent notification event when vcpu B start
> > migration. And it only free the vector before it going to run in another pcpu.
> > There is a race. Sender check on bit, vcpu B migrate to another pcpu and
> > starts running there, vcpu A takes vpuc's B vector, sender send PI vcpu
> > A gets it.
> Yes, it do exist. But I think it should be ok even this happens.
> 
Then it is OK to use global PI vector. The race should be dealt with
anyway.

> >> 
> >>> problem. vcpu A will ignore PI since pir will be empty and vcpu B should
> >>> detect new event during next vmentry.
> >> Yes, but the next vmentry may happen long time later and interrupt cannot be
> > serviced until next vmentry. In current way, it will cause vmexit and re-schedule
> > the vcpu B.
> > Vmentry will happen when scheduler will decide that vcpu can run. There
> I don't know how scheduler can know the vcpu can run in this case, can you elaborate it? 
> I thought it may have problems with global vector in some cases(maybe I am wrong, since I am not familiar with KVM scheduler):
> If target VCPU is in idle, and this notification event is consumed by other VCPU, then how can scheduler know the vcpu is ready to run? Even there is a way for scheduler to know, then when? Isn't it too late?
> If notification event arrived in hypervisor, then how the handler know which VCPU the notification event belong to?
When vcpu is idle its thread sleeps inside host kernel (see
virt/kvm/kvm_main.c:kvm_vcpu_block()). To get it out of sleep
you should call kvm_vcpu_kick(), but only after changing vcpu
state to make it runnable. arch/x86/kvm/x86.c:kvm_arch_vcpu_runnable()
checks if vcpu is runnable. Notice that we call kvm_cpu_has_interrupt()
there which checks apic IRR, but not PIR, so it is not enough to set
bit in PIR and call kvm_vcpu_kick() to wake up vcpu.

> 
> > is no problem here. What you probably want to say is that vcpu may not be
> > aware of interrupt happening since it was migrated to different vcpu
> > just after PI IPI was sent and thus missed it. But than PIR interrupts
> > should be processed during vmentry on another pcpu:
> > 
> > Sender:                             Guest:
> > 
> > set pir
> > set on
> > if (vcpu in guest mode on pcpu1)
> >                                    vmexit on pcpu1
> >                                    vmentry on pcpu2
> >                                    process pir, deliver interrupt
> > 	send PI IPI to pcpu1
> 
> >> 
> >>>> 
> >>>>> 
> >>>>>> +	if (!cfg) {
> >>>>>> +		free_irq_at(irq, NULL);
> >>>>>> +		return 0;
> >>>>>> +	}
> >>>>>> +
> >>>>>> +	raw_spin_lock_irqsave(&vector_lock, flags);
> >>>>>> +	if (!__assign_irq_vector(irq, cfg, mask))
> >>>>>> +		ret = irq;
> >>>>>> +	raw_spin_unlock_irqrestore(&vector_lock, flags);
> >>>>>> +
> >>>>>> +	if (ret) {
> >>>>>> +		irq_set_chip_data(irq, cfg);
> >>>>>> +		irq_clear_status_flags(irq, IRQ_NOREQUEST);
> >>>>>> +	} else {
> >>>>>> +		free_irq_at(irq, cfg);
> >>>>>> +	}
> >>>>>> +	return ret;
> >>>>>> +}
> >>>>> 
> >>>>> This function is mostly cut&paste of create_irq_nr().
> >>>> 
> >>>> Yes, this function allow to allocate vector from specified cpu.
> >>>> 
> >>> Does not justify code duplication.
> >> ok. will change it in next version.
> >> 
> > Please use single global vector in the next version.
> > 
> >>>>>> 
> >>>>>>  	if (kvm_x86_ops->has_virtual_interrupt_delivery(vcpu))
> >>>>>>  		apic->vid_enabled = true;
> >>>>>> +
> >>>>>> +	if (kvm_x86_ops->has_posted_interrupt(vcpu))
> >>>>>> +		apic->pi_enabled = true;
> >>>>>> +
> >>>>> This is global state, no need per apic variable.
> >> Even all vcpus use the same setting, but according to SDM, apicv really is a per
> > apic variable.
> > It is not per vapic in our implementation and this is what is
> > important here.
> > 
> >> Anyway, if you think we should not put it here, where is the best place?
> > It is not needed, just use has_posted_interrupt(vcpu) instead.
> ok
>  
> >> 
> >>>>>> @@ -1555,6 +1611,11 @@ static void vmx_vcpu_load(struct kvm_vcpu
> >>> *vcpu,
> >>>>> int cpu)
> >>>>>>  		struct desc_ptr *gdt = &__get_cpu_var(host_gdt);
> >>>>>>  		unsigned long sysenter_esp;
> >>>>>> +		if (enable_apicv_pi && to_vmx(vcpu)->pi)
> >>>>>> +			pi_set_on(to_vmx(vcpu)->pi);
> >>>>>> +
> >>>>> Why?
> >>>> 
> >>>> Here means the vcpu start migration. So we should prevent the
> >>>> notification event until migration end.
> >>>> 
> >>> You check for IN_GUEST_MODE while sending notification. Why is this not
> >> For interrupt from emulated device, it enough. But VT-d device doesn't know
> > the vcpu is migrating, so set the on bit to prevent the notification event when
> > target vcpu is migrating.
> > Why should VT-d device care about that? It sets bits in pir and sends
> > IPI. If vcpu is running it process pir immediately, if not it will do it
> > during next vmentry.
> We already know the vcpu is not running(it will run soon), we can set this bit to prevent the unnecessary IPI.
We have IN_GUEST_MODE for that. And this is the wrong place to indicate
that vcpu is not running anyway. vcpu is not running immediately after
vmexit.

> 
> >> 
> >>> enough? Also why vmx_vcpu_load() call means that vcpu start migration?
> >> I think the follow check can ensure the vcpu is in migration, am I wrong?
> >> if (vmx->loaded_vmcs->cpu != cpu)
> > This code checks that this vcpu ran on that pcpu last time.
> Yes, migration starts more earlier than here. But I think it should be ok to set ON bit here. Do you have any better idea?
>
If you want to prevent assigned device from sending IPI to non running
vcpu you should set the bit immediately after vmexit. For emulated
devices vcpu->mode should be used.

> >> {
> >> 	if (enable_apicv_pi && to_vmx(vcpu)->pi)
> >> 	pi_set_on(to_vmx(vcpu)->pi);
> >> }
> >> 
> >>>>>> +		kvm_make_request(KVM_REQ_POSTED_INTR, vcpu);
> >>>>>> +
> >>>>>>  		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); 	local_irq_disable();
> >>>>>>  		list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link, @@ -1582,6
> >>>>>>  +1643,8 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
> >>>>>>  	vcpu->cpu = -1; 		kvm_cpu_vmxoff(); 	}
> >>>>>> +	if (enable_apicv_pi && to_vmx(vcpu)->pi)
> >>>>>> +		pi_set_on(to_vmx(vcpu)->pi);
> >>>>> Why?
> >>>> 
> >>>> When vcpu schedule out, no need to send notification event to it, just set
> > the
> >>> PIR and wakeup it is enough.
> >>> Same as above. When vcpu is scheduled out it will no be in IN_GUEST_MODE
> >> Right.
> >> 
> >>> mode. Also in this case we probably should set bit directly in IRR and leave
> >>> PIR alone.
> >>> From the view of hypervisor, IRR and PIR are same. For each vmentry, if PI is
> > enabled, the IRR equal to (IRR | PIR). So there is no difference to set IRR or PIR if
> > target vcpu is not running.
> > But there is a difference for KVM code. For instance
> > kvm_arch_vcpu_runnable() checks for interrupts in IRR, but not PIR.
> > Migration code does the same.
> Right. With PI, we need check the PIR too.
>  
> >> 
> >>> 
> >>>> 
> >>>>>>  }
> >>>>>>  
> >>>>>>  static void vmx_fpu_activate(struct kvm_vcpu *vcpu)
> >>>>>> @@ -2451,12 +2514,6 @@ static __init int setup_vmcs_config(struct
> >>>>> vmcs_config *vmcs_conf)
> >>>>>>  	u32 _vmexit_control = 0;
> >>>>>>  	u32 _vmentry_control = 0;
> >>>>>> -	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
> >>>>>> -	opt = PIN_BASED_VIRTUAL_NMIS;
> >>>>>> -	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
> >>>>>> -				&_pin_based_exec_control) < 0)
> >>>>>> -		return -EIO;
> >>>>>> -
> >>>>>>  	min = CPU_BASED_HLT_EXITING |
> >>>>>>  #ifdef CONFIG_X86_64
> >>>>>>  	      CPU_BASED_CR8_LOAD_EXITING |
> >>>>>> @@ -2531,6 +2588,17 @@ static __init int setup_vmcs_config(struct
> >>>>> vmcs_config *vmcs_conf)
> >>>>>>  				&_vmexit_control) < 0)
> >>>>>>  		return -EIO;
> >>>>>> +	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
> >>>>>> +	opt = PIN_BASED_VIRTUAL_NMIS | PIN_BASED_POSTED_INTR;
> >>>>>> +	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
> >>>>>> +				&_pin_based_exec_control) < 0)
> >>>>>> +		return -EIO;
> >>>>>> +
> >>>>>> +	if (!(_cpu_based_2nd_exec_control &
> >>>>>> +		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY) ||
> >>>>>> +		!(_vmexit_control & VM_EXIT_ACK_INTR_ON_EXIT))
> >>>>>> +		_pin_based_exec_control &= ~PIN_BASED_POSTED_INTR;
> >>>>>> +
> >>>>>>  	min = 0; 	opt = VM_ENTRY_LOAD_IA32_PAT; 	if
> >>>>>>  (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS, @@ -2715,6
> >>>>>>  +2783,9 @@ static __init int hardware_setup(void) 	if
> >>>>>>  (!cpu_has_vmx_virtual_intr_delivery()) 		enable_apicv_vid = 0;
> >>>>>> +	if (!cpu_has_vmx_posted_intr() || !x2apic_enabled())
> >>>>> In nested guest x2apic may be enabled without irq remapping. Check for
> >>>>> irq remapping here.
> >>>> 
> >>>> There are no posted interrupt available in nested case. We don't need
> >>>> to check IR here.
> >>>> 
> >>> One day emulation will be added. If pre-request for PI is IR check
> >>> for IR.
> >> 
> >> 
> >>> BTW why IR is needed for PI. To deliver assigned devices interrupts
> >>> directly into a guest sure, but why is it required for delivering
> >>> interrupts from emulated devices or IPIs?
> >> Posted Interrupt support is Xeon only and these platforms will have x2APIC. So,
> > Linux will enable x2APIC on these platforms. So we only want to enable PI when
> > x2apic is enabled and IR is required for x2apic.
> > The fact that x2APIC is available on all platform that support PI is
> > irrelevant. If one is not strictly required by the other by architecture
> > do not couple them.
> Right. We only want to simply the implementation of enable "ack intr on exit". If IR enabled, then don't need to check the trig mode(all interrupts are edge) when using self IPI to regenerate the interrupt.
With Avi's suggestion self IPI is not needed. Drop this dependency if it
is not architectural.

>  
> >> 
> >>>>> 
> >>>>>> +		enable_apicv_pi = 0;
> >>>>>> +
> >>>>>>  	if (nested) 		nested_vmx_setup_ctls_msrs(); @@ -3881,6 +3952,93
> >>>>>>  @@ static void ept_set_mmio_spte_mask(void)
> >>>>>>  	kvm_mmu_set_mmio_spte_mask(0xffull << 49 | 0x6ull); }
> >>>>>> +irqreturn_t pi_handler(int irq, void *data)
> >>>>>> +{
> >>>>>> +	struct vcpu_vmx *vmx = data;
> >>>>>> +
> >>>>>> +	kvm_make_request(KVM_REQ_EVENT, &vmx->vcpu);
> >>>>>> +	kvm_vcpu_kick(&vmx->vcpu);
> >>>>>> +
> >>>>>> +	return IRQ_HANDLED;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static int vmx_has_posted_interrupt(struct kvm_vcpu *vcpu)
> >>>>>> +{
> >>>>>> +	return irqchip_in_kernel(vcpu->kvm) && enable_apicv_pi;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void vmx_pi_migrate(struct kvm_vcpu *vcpu)
> >>>>>> +{
> >>>>>> +	int ret = 0;
> >>>>>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >>>>>> +
> >>>>>> +	if (!enable_apicv_pi)
> >>>>>> +		return ;
> >>>>>> +
> >>>>>> +	preempt_disable();
> >>>>>> +	local_irq_disable();
> >>>>>> +	if (!vmx->irq) {
> >>>>>> +		ret = arch_pi_alloc_irq(vmx);
> >>>>>> +		if (ret < 0) {
> >>>>>> +			vmx->irq = -1;
> >>>>>> +			goto out;
> >>>>>> +		}
> >>>>>> +		vmx->irq = ret;
> >>>>>> +
> >>>>>> +		ret = request_irq(vmx->irq, pi_handler, IRQF_NO_THREAD,
> >>>>>> +					"Posted Interrupt", vmx);
> >>>>>> +		if (ret) {
> >>>>>> +			vmx->irq = -1;
> >>>>>> +			goto out;
> >>>>>> +		}
> >>>>>> +
> >>>>>> +		ret = arch_pi_get_vector(vmx->irq);
> >>>>>> +	} else
> >>>>>> +		ret = arch_pi_migrate(vmx->irq, smp_processor_id());
> >>>>>> +
> >>>>>> +	if (ret < 0) {
> >>>>>> +		vmx->irq = -1;
> >>>>>> +		goto out;
> >>>>>> +	} else {
> >>>>>> +		vmx->vector = ret;
> >>>>>> +		vmcs_write16(POSTED_INTR_NV, vmx->vector);
> >>>>>> +		pi_clear_on(vmx->pi);
> >>>>>> +	}
> >>>>>> +out:
> >>>>>> +	local_irq_enable();
> >>>>>> +	preempt_enable();
> >>>>>> +	return ;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static int vmx_send_nv(struct kvm_vcpu *vcpu,
> >>>>>> +		int vector)
> >>>>>> +{
> >>>>>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >>>>>> +
> >>>>>> +	if (unlikely(vmx->irq == -1))
> >>>>>> +		return 0;
> >>>>>> +
> >>>>>> +	if (vcpu->cpu == smp_processor_id()) {
> >>>>>> +		pi_set_on(vmx->pi);
> >>>>> Why? You clear this bit anyway in vmx_update_irq() during guest entry.
> >>>> Here means the target vcpu already in vmx non-root mode. Then it will
> >>> consume the interrupt on next vm entry and we don't need to send the
> >>> notification event from other cpu, just update PIR is enough.
> >>> I understand why you avoid sending PI IPI here, but you do not update
> >>> pir in this case either. You only set "on" bit here and set vector directly
> >>> in IRR in __apic_accept_irq() since vmx_send_nv() returns 0 in this case.
> >>> Interrupt is delivered from IRR on the next entry.
> >> As I mentioned, IRR is basically same as PIR.
> >> 
> > That does not explain why are you setting "on" without setting bit in pir.
> Right. Just set the PIR and return 1 is enough.
> 
> >>>> 
> >>>>>> +		return 0; +	} + +	pi_set_pir(vector, vmx->pi); +	if
> >>>>>> (!pi_test_and_set_on(vmx->pi) && (vcpu->mode == IN_GUEST_MODE)) {
> >>>>>> +		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), vmx->vector);
> >>>>>> +		return 1; +	} +	return 0; +} + +static void free_pi(struct
> >>>>>> vcpu_vmx *vmx) +{ +	if (enable_apicv_pi) { +		kfree(vmx->pi);
> >>>>>> +		arch_pi_free_irq(vmx->irq, vmx); +	} +} +
> >>>>>>  /*
> >>>>>>   * Sets up the vmcs for emulated real mode.
> >>>>>>   */
> >>>>>> @@ -3890,6 +4048,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx
> > *vmx)
> >>>>>>  	unsigned long a;
> >>>>>>  #endif
> >>>>>>  	int i;
> >>>>>> +	u32 pin_based_exec_ctrl = vmcs_config.pin_based_exec_ctrl;
> >>>>>> 
> >>>>>>  	/* I/O */ 	vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a)); @@
> >>>>>>  -3901,8 +4060,10 @@ static int vmx_vcpu_setup(struct vcpu_vmx
> >>>>>>  *vmx) 	vmcs_write64(VMCS_LINK_POINTER, -1ull); /* 22.3.1.5 */
> >>>>>>  
> >>>>>>  	/* Control */
> >>>>>> -	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
> >>>>>> -		vmcs_config.pin_based_exec_ctrl); +	if (!enable_apicv_pi)
> >>>>>> +		pin_based_exec_ctrl &= ~PIN_BASED_POSTED_INTR; +
> >>>>>> +	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, pin_based_exec_ctrl);
> >>>>>> 
> >>>>>>  	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
> >>>>> vmx_exec_control(vmx));
> >>>>>> 
> >>>>>> @@ -3920,6 +4081,12 @@ static int vmx_vcpu_setup(struct vcpu_vmx
> >>> *vmx)
> >>>>>>  		vmcs_write16(GUEST_INTR_STATUS, 0);
> >>>>>>  	}
> >>>>>> +	if (enable_apicv_pi) {
> >>>>>> +		vmx->pi = kmalloc(sizeof(struct pi_desc),
> >>>>>> +				GFP_KERNEL | __GFP_ZERO);
> >>>>>> +		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((vmx->pi)));
> >>>>>> +	}
> >>>>>> +
> >>>>>>  	if (ple_gap) { 		vmcs_write32(PLE_GAP, ple_gap);
> >>>>>>  		vmcs_write32(PLE_WINDOW, ple_window); @@ -6161,6 +6328,11 @@
> >>>>>>  static void vmx_update_irq(struct kvm_vcpu *vcpu) 	if
> >>>>>>  (!enable_apicv_vid) 		return ;
> >>>>>> +	if (enable_apicv_pi) {
> >>>>>> +		kvm_apic_update_irr(vcpu, (unsigned int *)vmx->pi->pir);
> >>>>>> +		pi_clear_on(vmx->pi);
> >>>>> Why do you do that? Isn't VMX process posted interrupts on vmentry if
> >>>>> "on" bit is set?
> >>> Can you answer this question?
> >> No, vmentry do nothing for PI. Posted interrupt only happens when an
> > unmasked external interrupt arrived and the target vcpu is running. Beyond that,
> > cpu follow the old way.
> >> 
> > Now that totally contradicts what you wrote above! (unless I
> > misunderstood you, in which case clarify please)
> > 
> >   From the view of hypervisor, IRR and PIR are same. For each vmentry, if
> >   PI is enabled, the IRR equal to (IRR | PIR). So there is no difference
> >   to set IRR or PIR if target vcpu is not running.
> > --
> > 			Gleb.
> Sorry, maybe I mislead you. VMentry have nothing to do PI.
> What I mean" IRR and PIR are same" is because this patch will copy PIR to IRR before each vmentry. So I think this two should be same in some levels. But according your comments, it may wrong.
> 
OK. Thanks for clarification.

--
			Gleb.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting
  2012-11-27  9:16               ` Gleb Natapov
@ 2012-11-27 11:10                 ` Zhang, Yang Z
  2012-11-27 11:31                   ` Veruca Salt
  2012-11-27 11:46                   ` Gleb Natapov
  0 siblings, 2 replies; 29+ messages in thread
From: Zhang, Yang Z @ 2012-11-27 11:10 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm@vger.kernel.org, mtosatti@redhat.com, Shan, Haitao,
	Zhang, Xiantao

Gleb Natapov wrote on 2012-11-27:
> On Tue, Nov 27, 2012 at 03:38:05AM +0000, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2012-11-26:
>>> On Mon, Nov 26, 2012 at 12:29:54PM +0000, Zhang, Yang Z wrote:
>>>> Gleb Natapov wrote on 2012-11-26:
>>>>> On Mon, Nov 26, 2012 at 03:51:04AM +0000, Zhang, Yang Z wrote:
>>>>>> Gleb Natapov wrote on 2012-11-25:
>>>>>>> On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
>>>>>>>> Posted Interrupt allows vAPICV interrupts to inject into guest directly
>>>>>>>> without any vmexit.
>>>>>>>> 
>>>>>>>> - When delivering a interrupt to guest, if target vcpu is running,
>>>>>>>>   update Posted-interrupt requests bitmap and send a notification
>>>>>>>>   event to the vcpu. Then the vcpu will handle this interrupt
>>>>>>>>   automatically, without any software involvemnt.
>>>>>>> Looks like you allocating one irq vector per vcpu per pcpu and then
>>>>>>> migrate it or reallocate when vcpu move from one pcpu to another.
>>>>>>> This is not scalable and migrating irq migration slows things down.
>>>>>>> What's wrong with allocating one global vector for posted interrupt
>>>>>>> during vmx initialization and use it for all vcpus?
>>>>>> 
>>>>>> Consider the following situation:
>>>>>> If vcpu A is running when notification event which belong to vcpu B is
>>> arrived,
>>>>> since the vector match the vcpu A's notification vector, then this event
>>>>> will be consumed by vcpu A(even it do nothing) and the interrupt cannot
>>>>> be handled in time. The exact same situation is possible with your code.
>>>>> vcpu B can be migrated from pcpu and vcpu A will take its place and will
>>>>> be assigned the same vector as vcpu B. But I fail to see why is this a
>>>> No, the on bit will be set to prevent notification event when vcpu B start
>>> migration. And it only free the vector before it going to run in another pcpu.
>>> There is a race. Sender check on bit, vcpu B migrate to another pcpu and
>>> starts running there, vcpu A takes vpuc's B vector, sender send PI vcpu
>>> A gets it.
>> Yes, it do exist. But I think it should be ok even this happens.
>> 
> Then it is OK to use global PI vector. The race should be dealt with
> anyway.
Or using lock can deal with it too.
 
>>>> 
>>>>> problem. vcpu A will ignore PI since pir will be empty and vcpu B should
>>>>> detect new event during next vmentry.
>>>> Yes, but the next vmentry may happen long time later and interrupt cannot
> be
>>> serviced until next vmentry. In current way, it will cause vmexit and
>>> re-schedule the vcpu B. Vmentry will happen when scheduler will decide
>>> that vcpu can run. There
>> I don't know how scheduler can know the vcpu can run in this case, can
>> you elaborate it? I thought it may have problems with global vector in
>> some cases(maybe I am wrong, since I am not familiar with KVM
>> scheduler): If target VCPU is in idle, and this notification event is
>> consumed by other VCPU,
> then how can scheduler know the vcpu is ready to run? Even there is a way for
> scheduler to know, then when? Isn't it too late?
>> If notification event arrived in hypervisor, then how the handler know which
> VCPU the notification event belong to?
> When vcpu is idle its thread sleeps inside host kernel (see
> virt/kvm/kvm_main.c:kvm_vcpu_block()). To get it out of sleep
> you should call kvm_vcpu_kick(), but only after changing vcpu
> state to make it runnable. arch/x86/kvm/x86.c:kvm_arch_vcpu_runnable()
> checks if vcpu is runnable. Notice that we call kvm_cpu_has_interrupt()
> there which checks apic IRR, but not PIR, so it is not enough to set
> bit in PIR and call kvm_vcpu_kick() to wake up vcpu.
Sorry, I cannot understand it. As you said, we need to call kvm_vcpu_kick when the waiting event happened to wake up the blocked vcpu, but this event is consumed by other vcpu without any chance for us to kick it. Then how it will move out from blocked list to run queue.
BTW, what I am talking is for the interrupt from VT-d case. For virtual interrupt, I think global vector is ok.
Also, the second problem is also about the VT-d case.
When cpu is running in VM root mode, and then an notification event arrives, since all VCPU use the same notification vector, we cannot distinguish which VCPU the notification event want to deliver to. And we cannot put the right VCPU to run queue.
 
>> 
>>> is no problem here. What you probably want to say is that vcpu may not be
>>> aware of interrupt happening since it was migrated to different vcpu
>>> just after PI IPI was sent and thus missed it. But than PIR interrupts
>>> should be processed during vmentry on another pcpu:
>>> 
>>> Sender:                             Guest:
>>> 
>>> set pir
>>> set on
>>> if (vcpu in guest mode on pcpu1)
>>>                                    vmexit on pcpu1
>>>                                    vmentry on pcpu2
>>>                                    process pir, deliver interrupt
>>> 	send PI IPI to pcpu1
>> 
>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> +	if (!cfg) {
>>>>>>>> +		free_irq_at(irq, NULL);
>>>>>>>> +		return 0;
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>> +	raw_spin_lock_irqsave(&vector_lock, flags);
>>>>>>>> +	if (!__assign_irq_vector(irq, cfg, mask))
>>>>>>>> +		ret = irq;
>>>>>>>> +	raw_spin_unlock_irqrestore(&vector_lock, flags);
>>>>>>>> +
>>>>>>>> +	if (ret) {
>>>>>>>> +		irq_set_chip_data(irq, cfg);
>>>>>>>> +		irq_clear_status_flags(irq, IRQ_NOREQUEST);
>>>>>>>> +	} else {
>>>>>>>> +		free_irq_at(irq, cfg);
>>>>>>>> +	}
>>>>>>>> +	return ret;
>>>>>>>> +}
>>>>>>> 
>>>>>>> This function is mostly cut&paste of create_irq_nr().
>>>>>> 
>>>>>> Yes, this function allow to allocate vector from specified cpu.
>>>>>> 
>>>>> Does not justify code duplication.
>>>> ok. will change it in next version.
>>>> 
>>> Please use single global vector in the next version.
>>> 
>>>>>>>> 
>>>>>>>>  	if (kvm_x86_ops->has_virtual_interrupt_delivery(vcpu))
>>>>>>>>  		apic->vid_enabled = true;
>>>>>>>> +
>>>>>>>> +	if (kvm_x86_ops->has_posted_interrupt(vcpu))
>>>>>>>> +		apic->pi_enabled = true;
>>>>>>>> +
>>>>>>> This is global state, no need per apic variable.
>>>> Even all vcpus use the same setting, but according to SDM, apicv really is a
> per
>>> apic variable.
>>> It is not per vapic in our implementation and this is what is
>>> important here.
>>> 
>>>> Anyway, if you think we should not put it here, where is the best place?
>>> It is not needed, just use has_posted_interrupt(vcpu) instead.
>> ok
>> 
>>>> 
>>>>>>>> @@ -1555,6 +1611,11 @@ static void vmx_vcpu_load(struct kvm_vcpu
>>>>> *vcpu,
>>>>>>> int cpu)
>>>>>>>>  		struct desc_ptr *gdt = &__get_cpu_var(host_gdt);
>>>>>>>>  		unsigned long sysenter_esp;
>>>>>>>> +		if (enable_apicv_pi && to_vmx(vcpu)->pi)
>>>>>>>> +			pi_set_on(to_vmx(vcpu)->pi);
>>>>>>>> +
>>>>>>> Why?
>>>>>> 
>>>>>> Here means the vcpu start migration. So we should prevent the
>>>>>> notification event until migration end.
>>>>>> 
>>>>> You check for IN_GUEST_MODE while sending notification. Why is this not
>>>> For interrupt from emulated device, it enough. But VT-d device doesn't
> know
>>> the vcpu is migrating, so set the on bit to prevent the notification event when
>>> target vcpu is migrating.
>>> Why should VT-d device care about that? It sets bits in pir and sends
>>> IPI. If vcpu is running it process pir immediately, if not it will do it
>>> during next vmentry.
>> We already know the vcpu is not running(it will run soon), we can set this bit to
> prevent the unnecessary IPI. We have IN_GUEST_MODE for that. And this is
> the wrong place to indicate that vcpu is not running anyway. vcpu is not
> running immediately after vmexit.
But the VT-d chipset doesn't know. We need to set this bit to tell it.
 
>> 
>>>> 
>>>>> enough? Also why vmx_vcpu_load() call means that vcpu start migration?
>>>> I think the follow check can ensure the vcpu is in migration, am I wrong?
>>>> if (vmx->loaded_vmcs->cpu != cpu)
>>> This code checks that this vcpu ran on that pcpu last time.
>> Yes, migration starts more earlier than here. But I think it should be
>> ok to set ON bit here. Do you have any better idea?
>> 
> If you want to prevent assigned device from sending IPI to non running
> vcpu you should set the bit immediately after vmexit. For emulated
> devices vcpu->mode should be used.
No, if the reason of vmexit is waiting for interrupt from assigned device , then it will never have the chance to get this interrupt.
 
>>>> {
>>>> 	if (enable_apicv_pi && to_vmx(vcpu)->pi)
>>>> 	pi_set_on(to_vmx(vcpu)->pi);
>>>> }
>>>> 
>>>>>>>> +		kvm_make_request(KVM_REQ_POSTED_INTR, vcpu);
>>>>>>>> +
>>>>>>>>  		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
>>>>>>>>  	local_irq_disable();
>>>>>>>>  		list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link, @@
>>>>>>>>  -1582,6 +1643,8 @@ static void vmx_vcpu_put(struct kvm_vcpu
>>>>>>>>  *vcpu) 	vcpu->cpu = -1; 		kvm_cpu_vmxoff(); 	}
>>>>>>>> +	if (enable_apicv_pi && to_vmx(vcpu)->pi)
>>>>>>>> +		pi_set_on(to_vmx(vcpu)->pi);
>>>>>>> Why?
>>>>>> 
>>>>>> When vcpu schedule out, no need to send notification event to it, just set
>>> the
>>>>> PIR and wakeup it is enough.
>>>>> Same as above. When vcpu is scheduled out it will no be in
> IN_GUEST_MODE
>>>> Right.
>>>> 
>>>>> mode. Also in this case we probably should set bit directly in IRR and leave
>>>>> PIR alone.
>>>>> From the view of hypervisor, IRR and PIR are same. For each vmentry, if PI
> is
>>> enabled, the IRR equal to (IRR | PIR). So there is no difference to
>>> set IRR or PIR if target vcpu is not running. But there is a
>>> difference for KVM code. For instance kvm_arch_vcpu_runnable() checks
>>> for interrupts in IRR, but not PIR. Migration code does the same.
>> Right. With PI, we need check the PIR too.
>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>>>  }
>>>>>>>>  
>>>>>>>>  static void vmx_fpu_activate(struct kvm_vcpu *vcpu)
>>>>>>>> @@ -2451,12 +2514,6 @@ static __init int setup_vmcs_config(struct
>>>>>>> vmcs_config *vmcs_conf)
>>>>>>>>  	u32 _vmexit_control = 0;
>>>>>>>>  	u32 _vmentry_control = 0;
>>>>>>>> -	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING; -	opt =
>>>>>>>> PIN_BASED_VIRTUAL_NMIS; -	if (adjust_vmx_controls(min, opt,
>>>>>>>> MSR_IA32_VMX_PINBASED_CTLS, -				&_pin_based_exec_control) < 0)
>>>>>>>> -		return -EIO; -
>>>>>>>>  	min = CPU_BASED_HLT_EXITING |
>>>>>>>>  #ifdef CONFIG_X86_64
>>>>>>>>  	      CPU_BASED_CR8_LOAD_EXITING |
>>>>>>>> @@ -2531,6 +2588,17 @@ static __init int setup_vmcs_config(struct
>>>>>>> vmcs_config *vmcs_conf)
>>>>>>>>  				&_vmexit_control) < 0)
>>>>>>>>  		return -EIO;
>>>>>>>> +	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING; +	opt =
>>>>>>>> PIN_BASED_VIRTUAL_NMIS | PIN_BASED_POSTED_INTR; +	if
>>>>>>>> (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
>>>>>>>> +				&_pin_based_exec_control) < 0) +		return -EIO; + +	if
>>>>>>>> (!(_cpu_based_2nd_exec_control &
>>>>>>>> +		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY) || +		!(_vmexit_control
>>>>>>>> & VM_EXIT_ACK_INTR_ON_EXIT)) +		_pin_based_exec_control &=
>>>>>>>> ~PIN_BASED_POSTED_INTR; +
>>>>>>>>  	min = 0; 	opt = VM_ENTRY_LOAD_IA32_PAT; 	if
>>>>>>>>  (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS, @@
>>>>>>>>  -2715,6 +2783,9 @@ static __init int hardware_setup(void) 	if
>>>>>>>>  (!cpu_has_vmx_virtual_intr_delivery()) 		enable_apicv_vid =
> 0;
>>>>>>>> +	if (!cpu_has_vmx_posted_intr() || !x2apic_enabled())
>>>>>>> In nested guest x2apic may be enabled without irq remapping. Check for
>>>>>>> irq remapping here.
>>>>>> 
>>>>>> There are no posted interrupt available in nested case. We don't need
>>>>>> to check IR here.
>>>>>> 
>>>>> One day emulation will be added. If pre-request for PI is IR check
>>>>> for IR.
>>>> 
>>>> 
>>>>> BTW why IR is needed for PI. To deliver assigned devices interrupts
>>>>> directly into a guest sure, but why is it required for delivering
>>>>> interrupts from emulated devices or IPIs?
>>>> Posted Interrupt support is Xeon only and these platforms will have x2APIC.
> So,
>>> Linux will enable x2APIC on these platforms. So we only want to enable
>>> PI when x2apic is enabled and IR is required for x2apic. The fact that
>>> x2APIC is available on all platform that support PI is irrelevant. If
>>> one is not strictly required by the other by architecture do not
>>> couple them.
>> Right. We only want to simply the implementation of enable "ack intr on exit".
> If IR enabled, then don't need to check the trig mode(all interrupts are edge)
> when using self IPI to regenerate the interrupt.
> With Avi's suggestion self IPI is not needed. Drop this dependency if it
> is not architectural.
Ok, then we need to read the TMR and only send self IPI for edge interrupt.  
 
>> 
>>>> 
>>>>>>> 
>>>>>>>> +		enable_apicv_pi = 0;
>>>>>>>> +
>>>>>>>>  	if (nested) 		nested_vmx_setup_ctls_msrs(); @@ -3881,6 +3952,93
>>>>>>>>  @@ static void ept_set_mmio_spte_mask(void)
>>>>>>>>  	kvm_mmu_set_mmio_spte_mask(0xffull << 49 | 0x6ull); }
>>>>>>>> +irqreturn_t pi_handler(int irq, void *data)
>>>>>>>> +{
>>>>>>>> +	struct vcpu_vmx *vmx = data;
>>>>>>>> +
>>>>>>>> +	kvm_make_request(KVM_REQ_EVENT, &vmx->vcpu);
>>>>>>>> +	kvm_vcpu_kick(&vmx->vcpu);
>>>>>>>> +
>>>>>>>> +	return IRQ_HANDLED;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int vmx_has_posted_interrupt(struct kvm_vcpu *vcpu)
>>>>>>>> +{
>>>>>>>> +	return irqchip_in_kernel(vcpu->kvm) && enable_apicv_pi;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void vmx_pi_migrate(struct kvm_vcpu *vcpu)
>>>>>>>> +{
>>>>>>>> +	int ret = 0;
>>>>>>>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>>>>>>>> +
>>>>>>>> +	if (!enable_apicv_pi)
>>>>>>>> +		return ;
>>>>>>>> +
>>>>>>>> +	preempt_disable();
>>>>>>>> +	local_irq_disable();
>>>>>>>> +	if (!vmx->irq) {
>>>>>>>> +		ret = arch_pi_alloc_irq(vmx);
>>>>>>>> +		if (ret < 0) {
>>>>>>>> +			vmx->irq = -1;
>>>>>>>> +			goto out;
>>>>>>>> +		}
>>>>>>>> +		vmx->irq = ret;
>>>>>>>> +
>>>>>>>> +		ret = request_irq(vmx->irq, pi_handler, IRQF_NO_THREAD,
>>>>>>>> +					"Posted Interrupt", vmx);
>>>>>>>> +		if (ret) {
>>>>>>>> +			vmx->irq = -1;
>>>>>>>> +			goto out;
>>>>>>>> +		}
>>>>>>>> +
>>>>>>>> +		ret = arch_pi_get_vector(vmx->irq);
>>>>>>>> +	} else
>>>>>>>> +		ret = arch_pi_migrate(vmx->irq, smp_processor_id());
>>>>>>>> +
>>>>>>>> +	if (ret < 0) {
>>>>>>>> +		vmx->irq = -1;
>>>>>>>> +		goto out;
>>>>>>>> +	} else {
>>>>>>>> +		vmx->vector = ret;
>>>>>>>> +		vmcs_write16(POSTED_INTR_NV, vmx->vector);
>>>>>>>> +		pi_clear_on(vmx->pi);
>>>>>>>> +	}
>>>>>>>> +out:
>>>>>>>> +	local_irq_enable();
>>>>>>>> +	preempt_enable();
>>>>>>>> +	return ;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int vmx_send_nv(struct kvm_vcpu *vcpu,
>>>>>>>> +		int vector)
>>>>>>>> +{
>>>>>>>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>>>>>>>> +
>>>>>>>> +	if (unlikely(vmx->irq == -1))
>>>>>>>> +		return 0;
>>>>>>>> +
>>>>>>>> +	if (vcpu->cpu == smp_processor_id()) {
>>>>>>>> +		pi_set_on(vmx->pi);
>>>>>>> Why? You clear this bit anyway in vmx_update_irq() during guest entry.
>>>>>> Here means the target vcpu already in vmx non-root mode. Then it will
>>>>> consume the interrupt on next vm entry and we don't need to send the
>>>>> notification event from other cpu, just update PIR is enough.
>>>>> I understand why you avoid sending PI IPI here, but you do not update
>>>>> pir in this case either. You only set "on" bit here and set vector directly
>>>>> in IRR in __apic_accept_irq() since vmx_send_nv() returns 0 in this case.
>>>>> Interrupt is delivered from IRR on the next entry.
>>>> As I mentioned, IRR is basically same as PIR.
>>>> 
>>> That does not explain why are you setting "on" without setting bit in pir.
>> Right. Just set the PIR and return 1 is enough.
>> 
>>>>>> 
>>>>>>>> +		return 0; +	} + +	pi_set_pir(vector, vmx->pi); +	if
>>>>>>>> (!pi_test_and_set_on(vmx->pi) && (vcpu->mode == IN_GUEST_MODE)) {
>>>>>>>> +		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), vmx->vector);
>>>>>>>> +		return 1; +	} +	return 0; +} + +static void free_pi(struct
>>>>>>>> vcpu_vmx *vmx) +{ +	if (enable_apicv_pi) { + 	kfree(vmx->pi);
>>>>>>>> +		arch_pi_free_irq(vmx->irq, vmx); +	} +} +
>>>>>>>>  /*
>>>>>>>>   * Sets up the vmcs for emulated real mode.
>>>>>>>>   */
>>>>>>>> @@ -3890,6 +4048,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx
>>> *vmx)
>>>>>>>>  	unsigned long a;
>>>>>>>>  #endif
>>>>>>>>  	int i;
>>>>>>>> +	u32 pin_based_exec_ctrl = vmcs_config.pin_based_exec_ctrl;
>>>>>>>> 
>>>>>>>>  	/* I/O */ 	vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a)); @@
>>>>>>>>  -3901,8 +4060,10 @@ static int vmx_vcpu_setup(struct vcpu_vmx
>>>>>>>>  *vmx) 	vmcs_write64(VMCS_LINK_POINTER, -1ull); /* 22.3.1.5 */
>>>>>>>>  
>>>>>>>>  	/* Control */
>>>>>>>> -	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
>>>>>>>> -		vmcs_config.pin_based_exec_ctrl); +	if (!enable_apicv_pi)
>>>>>>>> +		pin_based_exec_ctrl &= ~PIN_BASED_POSTED_INTR; +
>>>>>>>> +	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, pin_based_exec_ctrl);
>>>>>>>> 
>>>>>>>>  	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
>>>>>>> vmx_exec_control(vmx));
>>>>>>>> 
>>>>>>>> @@ -3920,6 +4081,12 @@ static int vmx_vcpu_setup(struct vcpu_vmx
>>>>> *vmx)
>>>>>>>>  		vmcs_write16(GUEST_INTR_STATUS, 0);
>>>>>>>>  	}
>>>>>>>> +	if (enable_apicv_pi) { +		vmx->pi = kmalloc(sizeof(struct
>>>>>>>> pi_desc), +				GFP_KERNEL | __GFP_ZERO);
>>>>>>>> +		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((vmx->pi))); +	} +
>>>>>>>>  	if (ple_gap) { 		vmcs_write32(PLE_GAP, ple_gap);
>>>>>>>>  		vmcs_write32(PLE_WINDOW, ple_window); @@ -6161,6 +6328,11 @@
>>>>>>>>  static void vmx_update_irq(struct kvm_vcpu *vcpu) 	if
>>>>>>>>  (!enable_apicv_vid) 		return ;
>>>>>>>> +	if (enable_apicv_pi) {
>>>>>>>> +		kvm_apic_update_irr(vcpu, (unsigned int *)vmx->pi->pir);
>>>>>>>> +		pi_clear_on(vmx->pi);
>>>>>>> Why do you do that? Isn't VMX process posted interrupts on vmentry if
>>>>>>> "on" bit is set?
>>>>> Can you answer this question?
>>>> No, vmentry do nothing for PI. Posted interrupt only happens when an
>>> unmasked external interrupt arrived and the target vcpu is running.
>>> Beyond that, cpu follow the old way.
>>>> 
>>> Now that totally contradicts what you wrote above! (unless I
>>> misunderstood you, in which case clarify please)
>>> 
>>>   From the view of hypervisor, IRR and PIR are same. For each vmentry, if
>>>   PI is enabled, the IRR equal to (IRR | PIR). So there is no difference
>>>   to set IRR or PIR if target vcpu is not running.
>>> --
>>> 			Gleb.
>> Sorry, maybe I mislead you. VMentry have nothing to do PI.
>> What I mean" IRR and PIR are same" is because this patch will copy PIR to IRR
> before each vmentry. So I think this two should be same in some levels. But
> according your comments, it may wrong.
>> 
> OK. Thanks for clarification.
> 
> --
> 			Gleb.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Best regards,
Yang



^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting
  2012-11-27 11:10                 ` Zhang, Yang Z
@ 2012-11-27 11:31                   ` Veruca Salt
  2012-11-27 11:46                   ` Gleb Natapov
  1 sibling, 0 replies; 29+ messages in thread
From: Veruca Salt @ 2012-11-27 11:31 UTC (permalink / raw)
  To: yang.z.zhang, gleb
  Cc: QEMU-KVM Mailing List, mtosatti, haitao.shan, xiantao.zhang




----------------------------------------
> From: yang.z.zhang@intel.com
> To: gleb@redhat.com
> CC: kvm@vger.kernel.org; mtosatti@redhat.com; haitao.shan@intel.com; xiantao.zhang@intel.com
> Subject: RE: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting
> Date: Tue, 27 Nov 2012 11:10:20 +0000
>
> Gleb Natapov wrote on 2012-11-27:
> > On Tue, Nov 27, 2012 at 03:38:05AM +0000, Zhang, Yang Z wrote:
> >> Gleb Natapov wrote on 2012-11-26:
> >>> On Mon, Nov 26, 2012 at 12:29:54PM +0000, Zhang, Yang Z wrote:
> >>>> Gleb Natapov wrote on 2012-11-26:
> >>>>> On Mon, Nov 26, 2012 at 03:51:04AM +0000, Zhang, Yang Z wrote:
> >>>>>> Gleb Natapov wrote on 2012-11-25:
> >>>>>>> On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
> >>>>>>>> Posted Interrupt allows vAPICV interrupts to inject into guest directly
> >>>>>>>> without any vmexit.
> >>>>>>>>
> >>>>>>>> - When delivering a interrupt to guest, if target vcpu is running,
> >>>>>>>> update Posted-interrupt requests bitmap and send a notification
> >>>>>>>> event to the vcpu. Then the vcpu will handle this interrupt
> >>>>>>>> automatically, without any software involvemnt.
> >>>>>>> Looks like you allocating one irq vector per vcpu per pcpu and then
> >>>>>>> migrate it or reallocate when vcpu move from one pcpu to another.
> >>>>>>> This is not scalable and migrating irq migration slows things down.
> >>>>>>> What's wrong with allocating one global vector for posted interrupt
> >>>>>>> during vmx initialization and use it for all vcpus?
> >>>>>>
> >>>>>> Consider the following situation:
> >>>>>> If vcpu A is running when notification event which belong to vcpu B is
> >>> arrived,
> >>>>> since the vector match the vcpu A's notification vector, then this event
> >>>>> will be consumed by vcpu A(even it do nothing) and the interrupt cannot
> >>>>> be handled in time. The exact same situation is possible with your code.
> >>>>> vcpu B can be migrated from pcpu and vcpu A will take its place and will
> >>>>> be assigned the same vector as vcpu B. But I fail to see why is this a
> >>>> No, the on bit will be set to prevent notification event when vcpu B start
> >>> migration. And it only free the vector before it going to run in another pcpu.
> >>> There is a race. Sender check on bit, vcpu B migrate to another pcpu and
> >>> starts running there, vcpu A takes vpuc's B vector, sender send PI vcpu
> >>> A gets it.
> >> Yes, it do exist. But I think it should be ok even this happens.
> >>
> > Then it is OK to use global PI vector. The race should be dealt with
> > anyway.
> Or using lock can deal with it too.
>
> >>>>
> >>>>> problem. vcpu A will ignore PI since pir will be empty and vcpu B should
> >>>>> detect new event during next vmentry.
> >>>> Yes, but the next vmentry may happen long time later and interrupt cannot
> > be
> >>> serviced until next vmentry. In current way, it will cause vmexit and
> >>> re-schedule the vcpu B. Vmentry will happen when scheduler will decide
> >>> that vcpu can run. There
> >> I don't know how scheduler can know the vcpu can run in this case, can
> >> you elaborate it? I thought it may have problems with global vector in
> >> some cases(maybe I am wrong, since I am not familiar with KVM
> >> scheduler): If target VCPU is in idle, and this notification event is
> >> consumed by other VCPU,
> > then how can scheduler know the vcpu is ready to run? Even there is a way for
> > scheduler to know, then when? Isn't it too late?
> >> If notification event arrived in hypervisor, then how the handler know which
> > VCPU the notification event belong to?
> > When vcpu is idle its thread sleeps inside host kernel (see
> > virt/kvm/kvm_main.c:kvm_vcpu_block()). To get it out of sleep
> > you should call kvm_vcpu_kick(), but only after changing vcpu
> > state to make it runnable. arch/x86/kvm/x86.c:kvm_arch_vcpu_runnable()
> > checks if vcpu is runnable. Notice that we call kvm_cpu_has_interrupt()
> > there which checks apic IRR, but not PIR, so it is not enough to set
> > bit in PIR and call kvm_vcpu_kick() to wake up vcpu.
> Sorry, I cannot understand it. As you said, we need to call kvm_vcpu_kick when the waiting event happened to wake up the blocked vcpu, but this event is consumed by other vcpu without any chance for us to kick it. Then how it will move out from blocked list to run queue.
> BTW, what I am talking is for the interrupt from VT-d case. For virtual interrupt, I think global vector is ok.
> Also, the second problem is also about the VT-d case.
> When cpu is running in VM root mode, and then an notification event arrives, since all VCPU use the same notification vector, we cannot distinguish which VCPU the notification event want to deliver to. And we cannot put the right VCPU to run queue.

If 'kick is non-destructive of VCPU state, why not cycle to 'kick all the VCPU's so that you are guaranteed to get the one you want?
(Tell me I'm so, so wrong. :))
>
> >>
> >>> is no problem here. What you probably want to say is that vcpu may not be
> >>> aware of interrupt happening since it was migrated to different vcpu
> >>> just after PI IPI was sent and thus missed it. But than PIR interrupts
> >>> should be processed during vmentry on another pcpu:
> >>>
> >>> Sender: Guest:
> >>>
> >>> set pir
> >>> set on
> >>> if (vcpu in guest mode on pcpu1)
> >>> vmexit on pcpu1
> >>> vmentry on pcpu2
> >>> process pir, deliver interrupt
> >>> send PI IPI to pcpu1
> >>
> >>>>
> >>>>>>
> >>>>>>>
> >>>>>>>> + if (!cfg) {
> >>>>>>>> + free_irq_at(irq, NULL);
> >>>>>>>> + return 0;
> >>>>>>>> + }
> >>>>>>>> +
> >>>>>>>> + raw_spin_lock_irqsave(&vector_lock, flags);
> >>>>>>>> + if (!__assign_irq_vector(irq, cfg, mask))
> >>>>>>>> + ret = irq;
> >>>>>>>> + raw_spin_unlock_irqrestore(&vector_lock, flags);
> >>>>>>>> +
> >>>>>>>> + if (ret) {
> >>>>>>>> + irq_set_chip_data(irq, cfg);
> >>>>>>>> + irq_clear_status_flags(irq, IRQ_NOREQUEST);
> >>>>>>>> + } else {
> >>>>>>>> + free_irq_at(irq, cfg);
> >>>>>>>> + }
> >>>>>>>> + return ret;
> >>>>>>>> +}
> >>>>>>>
> >>>>>>> This function is mostly cut&paste of create_irq_nr().
> >>>>>>
> >>>>>> Yes, this function allow to allocate vector from specified cpu.
> >>>>>>
> >>>>> Does not justify code duplication.
> >>>> ok. will change it in next version.
> >>>>
> >>> Please use single global vector in the next version.
> >>>
> >>>>>>>>
> >>>>>>>> if (kvm_x86_ops->has_virtual_interrupt_delivery(vcpu))
> >>>>>>>> apic->vid_enabled = true;
> >>>>>>>> +
> >>>>>>>> + if (kvm_x86_ops->has_posted_interrupt(vcpu))
> >>>>>>>> + apic->pi_enabled = true;
> >>>>>>>> +
> >>>>>>> This is global state, no need per apic variable.
> >>>> Even all vcpus use the same setting, but according to SDM, apicv really is a
> > per
> >>> apic variable.
> >>> It is not per vapic in our implementation and this is what is
> >>> important here.
> >>>
> >>>> Anyway, if you think we should not put it here, where is the best place?
> >>> It is not needed, just use has_posted_interrupt(vcpu) instead.
> >> ok
> >>
> >>>>
> >>>>>>>> @@ -1555,6 +1611,11 @@ static void vmx_vcpu_load(struct kvm_vcpu
> >>>>> *vcpu,
> >>>>>>> int cpu)
> >>>>>>>> struct desc_ptr *gdt = &__get_cpu_var(host_gdt);
> >>>>>>>> unsigned long sysenter_esp;
> >>>>>>>> + if (enable_apicv_pi && to_vmx(vcpu)->pi)
> >>>>>>>> + pi_set_on(to_vmx(vcpu)->pi);
> >>>>>>>> +
> >>>>>>> Why?
> >>>>>>
> >>>>>> Here means the vcpu start migration. So we should prevent the
> >>>>>> notification event until migration end.
> >>>>>>
> >>>>> You check for IN_GUEST_MODE while sending notification. Why is this not
> >>>> For interrupt from emulated device, it enough. But VT-d device doesn't
> > know
> >>> the vcpu is migrating, so set the on bit to prevent the notification event when
> >>> target vcpu is migrating.
> >>> Why should VT-d device care about that? It sets bits in pir and sends
> >>> IPI. If vcpu is running it process pir immediately, if not it will do it
> >>> during next vmentry.
> >> We already know the vcpu is not running(it will run soon), we can set this bit to
> > prevent the unnecessary IPI. We have IN_GUEST_MODE for that. And this is
> > the wrong place to indicate that vcpu is not running anyway. vcpu is not
> > running immediately after vmexit.
> But the VT-d chipset doesn't know. We need to set this bit to tell it.
>
> >>
> >>>>
> >>>>> enough? Also why vmx_vcpu_load() call means that vcpu start migration?
> >>>> I think the follow check can ensure the vcpu is in migration, am I wrong?
> >>>> if (vmx->loaded_vmcs->cpu != cpu)
> >>> This code checks that this vcpu ran on that pcpu last time.
> >> Yes, migration starts more earlier than here. But I think it should be
> >> ok to set ON bit here. Do you have any better idea?
> >>
> > If you want to prevent assigned device from sending IPI to non running
> > vcpu you should set the bit immediately after vmexit. For emulated
> > devices vcpu->mode should be used.
> No, if the reason of vmexit is waiting for interrupt from assigned device , then it will never have the chance to get this interrupt.
>
> >>>> {
> >>>> if (enable_apicv_pi && to_vmx(vcpu)->pi)
> >>>> pi_set_on(to_vmx(vcpu)->pi);
> >>>> }
> >>>>
> >>>>>>>> + kvm_make_request(KVM_REQ_POSTED_INTR, vcpu);
> >>>>>>>> +
> >>>>>>>> kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
> >>>>>>>> local_irq_disable();
> >>>>>>>> list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link, @@
> >>>>>>>> -1582,6 +1643,8 @@ static void vmx_vcpu_put(struct kvm_vcpu
> >>>>>>>> *vcpu) vcpu->cpu = -1; kvm_cpu_vmxoff(); }
> >>>>>>>> + if (enable_apicv_pi && to_vmx(vcpu)->pi)
> >>>>>>>> + pi_set_on(to_vmx(vcpu)->pi);
> >>>>>>> Why?
> >>>>>>
> >>>>>> When vcpu schedule out, no need to send notification event to it, just set
> >>> the
> >>>>> PIR and wakeup it is enough.
> >>>>> Same as above. When vcpu is scheduled out it will no be in
> > IN_GUEST_MODE
> >>>> Right.
> >>>>
> >>>>> mode. Also in this case we probably should set bit directly in IRR and leave
> >>>>> PIR alone.
> >>>>> From the view of hypervisor, IRR and PIR are same. For each vmentry, if PI
> > is
> >>> enabled, the IRR equal to (IRR | PIR). So there is no difference to
> >>> set IRR or PIR if target vcpu is not running. But there is a
> >>> difference for KVM code. For instance kvm_arch_vcpu_runnable() checks
> >>> for interrupts in IRR, but not PIR. Migration code does the same.
> >> Right. With PI, we need check the PIR too.
> >>
> >>>>
> >>>>>
> >>>>>>
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> static void vmx_fpu_activate(struct kvm_vcpu *vcpu)
> >>>>>>>> @@ -2451,12 +2514,6 @@ static __init int setup_vmcs_config(struct
> >>>>>>> vmcs_config *vmcs_conf)
> >>>>>>>> u32 _vmexit_control = 0;
> >>>>>>>> u32 _vmentry_control = 0;
> >>>>>>>> - min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING; - opt =
> >>>>>>>> PIN_BASED_VIRTUAL_NMIS; - if (adjust_vmx_controls(min, opt,
> >>>>>>>> MSR_IA32_VMX_PINBASED_CTLS, - &_pin_based_exec_control) < 0)
> >>>>>>>> - return -EIO; -
> >>>>>>>> min = CPU_BASED_HLT_EXITING |
> >>>>>>>> #ifdef CONFIG_X86_64
> >>>>>>>> CPU_BASED_CR8_LOAD_EXITING |
> >>>>>>>> @@ -2531,6 +2588,17 @@ static __init int setup_vmcs_config(struct
> >>>>>>> vmcs_config *vmcs_conf)
> >>>>>>>> &_vmexit_control) < 0)
> >>>>>>>> return -EIO;
> >>>>>>>> + min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING; + opt =
> >>>>>>>> PIN_BASED_VIRTUAL_NMIS | PIN_BASED_POSTED_INTR; + if
> >>>>>>>> (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
> >>>>>>>> + &_pin_based_exec_control) < 0) + return -EIO; + + if
> >>>>>>>> (!(_cpu_based_2nd_exec_control &
> >>>>>>>> + SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY) || + !(_vmexit_control
> >>>>>>>> & VM_EXIT_ACK_INTR_ON_EXIT)) + _pin_based_exec_control &=
> >>>>>>>> ~PIN_BASED_POSTED_INTR; +
> >>>>>>>> min = 0; opt = VM_ENTRY_LOAD_IA32_PAT; if
> >>>>>>>> (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS, @@
> >>>>>>>> -2715,6 +2783,9 @@ static __init int hardware_setup(void) if
> >>>>>>>> (!cpu_has_vmx_virtual_intr_delivery()) enable_apicv_vid =
> > 0;
> >>>>>>>> + if (!cpu_has_vmx_posted_intr() || !x2apic_enabled())
> >>>>>>> In nested guest x2apic may be enabled without irq remapping. Check for
> >>>>>>> irq remapping here.
> >>>>>>
> >>>>>> There are no posted interrupt available in nested case. We don't need
> >>>>>> to check IR here.
> >>>>>>
> >>>>> One day emulation will be added. If pre-request for PI is IR check
> >>>>> for IR.
> >>>>
> >>>>
> >>>>> BTW why IR is needed for PI. To deliver assigned devices interrupts
> >>>>> directly into a guest sure, but why is it required for delivering
> >>>>> interrupts from emulated devices or IPIs?
> >>>> Posted Interrupt support is Xeon only and these platforms will have x2APIC.
> > So,
> >>> Linux will enable x2APIC on these platforms. So we only want to enable
> >>> PI when x2apic is enabled and IR is required for x2apic. The fact that
> >>> x2APIC is available on all platform that support PI is irrelevant. If
> >>> one is not strictly required by the other by architecture do not
> >>> couple them.
> >> Right. We only want to simply the implementation of enable "ack intr on exit".
> > If IR enabled, then don't need to check the trig mode(all interrupts are edge)
> > when using self IPI to regenerate the interrupt.
> > With Avi's suggestion self IPI is not needed. Drop this dependency if it
> > is not architectural.
> Ok, then we need to read the TMR and only send self IPI for edge interrupt.
>
> >>
> >>>>
> >>>>>>>
> >>>>>>>> + enable_apicv_pi = 0;
> >>>>>>>> +
> >>>>>>>> if (nested) nested_vmx_setup_ctls_msrs(); @@ -3881,6 +3952,93
> >>>>>>>> @@ static void ept_set_mmio_spte_mask(void)
> >>>>>>>> kvm_mmu_set_mmio_spte_mask(0xffull << 49 | 0x6ull); }
> >>>>>>>> +irqreturn_t pi_handler(int irq, void *data)
> >>>>>>>> +{
> >>>>>>>> + struct vcpu_vmx *vmx = data;
> >>>>>>>> +
> >>>>>>>> + kvm_make_request(KVM_REQ_EVENT, &vmx->vcpu);
> >>>>>>>> + kvm_vcpu_kick(&vmx->vcpu);
> >>>>>>>> +
> >>>>>>>> + return IRQ_HANDLED;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static int vmx_has_posted_interrupt(struct kvm_vcpu *vcpu)
> >>>>>>>> +{
> >>>>>>>> + return irqchip_in_kernel(vcpu->kvm) && enable_apicv_pi;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static void vmx_pi_migrate(struct kvm_vcpu *vcpu)
> >>>>>>>> +{
> >>>>>>>> + int ret = 0;
> >>>>>>>> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> >>>>>>>> +
> >>>>>>>> + if (!enable_apicv_pi)
> >>>>>>>> + return ;
> >>>>>>>> +
> >>>>>>>> + preempt_disable();
> >>>>>>>> + local_irq_disable();
> >>>>>>>> + if (!vmx->irq) {
> >>>>>>>> + ret = arch_pi_alloc_irq(vmx);
> >>>>>>>> + if (ret < 0) {
> >>>>>>>> + vmx->irq = -1;
> >>>>>>>> + goto out;
> >>>>>>>> + }
> >>>>>>>> + vmx->irq = ret;
> >>>>>>>> +
> >>>>>>>> + ret = request_irq(vmx->irq, pi_handler, IRQF_NO_THREAD,
> >>>>>>>> + "Posted Interrupt", vmx);
> >>>>>>>> + if (ret) {
> >>>>>>>> + vmx->irq = -1;
> >>>>>>>> + goto out;
> >>>>>>>> + }
> >>>>>>>> +
> >>>>>>>> + ret = arch_pi_get_vector(vmx->irq);
> >>>>>>>> + } else
> >>>>>>>> + ret = arch_pi_migrate(vmx->irq, smp_processor_id());
> >>>>>>>> +
> >>>>>>>> + if (ret < 0) {
> >>>>>>>> + vmx->irq = -1;
> >>>>>>>> + goto out;
> >>>>>>>> + } else {
> >>>>>>>> + vmx->vector = ret;
> >>>>>>>> + vmcs_write16(POSTED_INTR_NV, vmx->vector);
> >>>>>>>> + pi_clear_on(vmx->pi);
> >>>>>>>> + }
> >>>>>>>> +out:
> >>>>>>>> + local_irq_enable();
> >>>>>>>> + preempt_enable();
> >>>>>>>> + return ;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static int vmx_send_nv(struct kvm_vcpu *vcpu,
> >>>>>>>> + int vector)
> >>>>>>>> +{
> >>>>>>>> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> >>>>>>>> +
> >>>>>>>> + if (unlikely(vmx->irq == -1))
> >>>>>>>> + return 0;
> >>>>>>>> +
> >>>>>>>> + if (vcpu->cpu == smp_processor_id()) {
> >>>>>>>> + pi_set_on(vmx->pi);
> >>>>>>> Why? You clear this bit anyway in vmx_update_irq() during guest entry.
> >>>>>> Here means the target vcpu already in vmx non-root mode. Then it will
> >>>>> consume the interrupt on next vm entry and we don't need to send the
> >>>>> notification event from other cpu, just update PIR is enough.
> >>>>> I understand why you avoid sending PI IPI here, but you do not update
> >>>>> pir in this case either. You only set "on" bit here and set vector directly
> >>>>> in IRR in __apic_accept_irq() since vmx_send_nv() returns 0 in this case.
> >>>>> Interrupt is delivered from IRR on the next entry.
> >>>> As I mentioned, IRR is basically same as PIR.
> >>>>
> >>> That does not explain why are you setting "on" without setting bit in pir.
> >> Right. Just set the PIR and return 1 is enough.
> >>
> >>>>>>
> >>>>>>>> + return 0; + } + + pi_set_pir(vector, vmx->pi); + if
> >>>>>>>> (!pi_test_and_set_on(vmx->pi) && (vcpu->mode == IN_GUEST_MODE)) {
> >>>>>>>> + apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), vmx->vector);
> >>>>>>>> + return 1; + } + return 0; +} + +static void free_pi(struct
> >>>>>>>> vcpu_vmx *vmx) +{ + if (enable_apicv_pi) { + kfree(vmx->pi);
> >>>>>>>> + arch_pi_free_irq(vmx->irq, vmx); + } +} +
> >>>>>>>> /*
> >>>>>>>> * Sets up the vmcs for emulated real mode.
> >>>>>>>> */
> >>>>>>>> @@ -3890,6 +4048,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx
> >>> *vmx)
> >>>>>>>> unsigned long a;
> >>>>>>>> #endif
> >>>>>>>> int i;
> >>>>>>>> + u32 pin_based_exec_ctrl = vmcs_config.pin_based_exec_ctrl;
> >>>>>>>>
> >>>>>>>> /* I/O */ vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a)); @@
> >>>>>>>> -3901,8 +4060,10 @@ static int vmx_vcpu_setup(struct vcpu_vmx
> >>>>>>>> *vmx) vmcs_write64(VMCS_LINK_POINTER, -1ull); /* 22.3.1.5 */
> >>>>>>>>
> >>>>>>>> /* Control */
> >>>>>>>> - vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
> >>>>>>>> - vmcs_config.pin_based_exec_ctrl); + if (!enable_apicv_pi)
> >>>>>>>> + pin_based_exec_ctrl &= ~PIN_BASED_POSTED_INTR; +
> >>>>>>>> + vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, pin_based_exec_ctrl);
> >>>>>>>>
> >>>>>>>> vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
> >>>>>>> vmx_exec_control(vmx));
> >>>>>>>>
> >>>>>>>> @@ -3920,6 +4081,12 @@ static int vmx_vcpu_setup(struct vcpu_vmx
> >>>>> *vmx)
> >>>>>>>> vmcs_write16(GUEST_INTR_STATUS, 0);
> >>>>>>>> }
> >>>>>>>> + if (enable_apicv_pi) { + vmx->pi = kmalloc(sizeof(struct
> >>>>>>>> pi_desc), + GFP_KERNEL | __GFP_ZERO);
> >>>>>>>> + vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((vmx->pi))); + } +
> >>>>>>>> if (ple_gap) { vmcs_write32(PLE_GAP, ple_gap);
> >>>>>>>> vmcs_write32(PLE_WINDOW, ple_window); @@ -6161,6 +6328,11 @@
> >>>>>>>> static void vmx_update_irq(struct kvm_vcpu *vcpu) if
> >>>>>>>> (!enable_apicv_vid) return ;
> >>>>>>>> + if (enable_apicv_pi) {
> >>>>>>>> + kvm_apic_update_irr(vcpu, (unsigned int *)vmx->pi->pir);
> >>>>>>>> + pi_clear_on(vmx->pi);
> >>>>>>> Why do you do that? Isn't VMX process posted interrupts on vmentry if
> >>>>>>> "on" bit is set?
> >>>>> Can you answer this question?
> >>>> No, vmentry do nothing for PI. Posted interrupt only happens when an
> >>> unmasked external interrupt arrived and the target vcpu is running.
> >>> Beyond that, cpu follow the old way.
> >>>>
> >>> Now that totally contradicts what you wrote above! (unless I
> >>> misunderstood you, in which case clarify please)
> >>>
> >>> From the view of hypervisor, IRR and PIR are same. For each vmentry, if
> >>> PI is enabled, the IRR equal to (IRR | PIR). So there is no difference
> >>> to set IRR or PIR if target vcpu is not running.
> >>> --
> >>> Gleb.
> >> Sorry, maybe I mislead you. VMentry have nothing to do PI.
> >> What I mean" IRR and PIR are same" is because this patch will copy PIR to IRR
> > before each vmentry. So I think this two should be same in some levels. But
> > according your comments, it may wrong.
> >>
> > OK. Thanks for clarification.
> >
> > --
> > Gleb.
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
> Best regards,
> Yang
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
 		 	   		  

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting
  2012-11-27 11:10                 ` Zhang, Yang Z
  2012-11-27 11:31                   ` Veruca Salt
@ 2012-11-27 11:46                   ` Gleb Natapov
  1 sibling, 0 replies; 29+ messages in thread
From: Gleb Natapov @ 2012-11-27 11:46 UTC (permalink / raw)
  To: Zhang, Yang Z
  Cc: kvm@vger.kernel.org, mtosatti@redhat.com, Shan, Haitao,
	Zhang, Xiantao

On Tue, Nov 27, 2012 at 11:10:20AM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2012-11-27:
> > On Tue, Nov 27, 2012 at 03:38:05AM +0000, Zhang, Yang Z wrote:
> >> Gleb Natapov wrote on 2012-11-26:
> >>> On Mon, Nov 26, 2012 at 12:29:54PM +0000, Zhang, Yang Z wrote:
> >>>> Gleb Natapov wrote on 2012-11-26:
> >>>>> On Mon, Nov 26, 2012 at 03:51:04AM +0000, Zhang, Yang Z wrote:
> >>>>>> Gleb Natapov wrote on 2012-11-25:
> >>>>>>> On Wed, Nov 21, 2012 at 04:09:39PM +0800, Yang Zhang wrote:
> >>>>>>>> Posted Interrupt allows vAPICV interrupts to inject into guest directly
> >>>>>>>> without any vmexit.
> >>>>>>>> 
> >>>>>>>> - When delivering a interrupt to guest, if target vcpu is running,
> >>>>>>>>   update Posted-interrupt requests bitmap and send a notification
> >>>>>>>>   event to the vcpu. Then the vcpu will handle this interrupt
> >>>>>>>>   automatically, without any software involvemnt.
> >>>>>>> Looks like you allocating one irq vector per vcpu per pcpu and then
> >>>>>>> migrate it or reallocate when vcpu move from one pcpu to another.
> >>>>>>> This is not scalable and migrating irq migration slows things down.
> >>>>>>> What's wrong with allocating one global vector for posted interrupt
> >>>>>>> during vmx initialization and use it for all vcpus?
> >>>>>> 
> >>>>>> Consider the following situation:
> >>>>>> If vcpu A is running when notification event which belong to vcpu B is
> >>> arrived,
> >>>>> since the vector match the vcpu A's notification vector, then this event
> >>>>> will be consumed by vcpu A(even it do nothing) and the interrupt cannot
> >>>>> be handled in time. The exact same situation is possible with your code.
> >>>>> vcpu B can be migrated from pcpu and vcpu A will take its place and will
> >>>>> be assigned the same vector as vcpu B. But I fail to see why is this a
> >>>> No, the on bit will be set to prevent notification event when vcpu B start
> >>> migration. And it only free the vector before it going to run in another pcpu.
> >>> There is a race. Sender check on bit, vcpu B migrate to another pcpu and
> >>> starts running there, vcpu A takes vpuc's B vector, sender send PI vcpu
> >>> A gets it.
> >> Yes, it do exist. But I think it should be ok even this happens.
> >> 
> > Then it is OK to use global PI vector. The race should be dealt with
> > anyway.
> Or using lock can deal with it too.
>  
Last thing we want is to hold lock while injecting interrupt. Also Vt-d
HW will not be able to use the lock.

> >>>> 
> >>>>> problem. vcpu A will ignore PI since pir will be empty and vcpu B should
> >>>>> detect new event during next vmentry.
> >>>> Yes, but the next vmentry may happen long time later and interrupt cannot
> > be
> >>> serviced until next vmentry. In current way, it will cause vmexit and
> >>> re-schedule the vcpu B. Vmentry will happen when scheduler will decide
> >>> that vcpu can run. There
> >> I don't know how scheduler can know the vcpu can run in this case, can
> >> you elaborate it? I thought it may have problems with global vector in
> >> some cases(maybe I am wrong, since I am not familiar with KVM
> >> scheduler): If target VCPU is in idle, and this notification event is
> >> consumed by other VCPU,
> > then how can scheduler know the vcpu is ready to run? Even there is a way for
> > scheduler to know, then when? Isn't it too late?
> >> If notification event arrived in hypervisor, then how the handler know which
> > VCPU the notification event belong to?
> > When vcpu is idle its thread sleeps inside host kernel (see
> > virt/kvm/kvm_main.c:kvm_vcpu_block()). To get it out of sleep
> > you should call kvm_vcpu_kick(), but only after changing vcpu
> > state to make it runnable. arch/x86/kvm/x86.c:kvm_arch_vcpu_runnable()
> > checks if vcpu is runnable. Notice that we call kvm_cpu_has_interrupt()
> > there which checks apic IRR, but not PIR, so it is not enough to set
> > bit in PIR and call kvm_vcpu_kick() to wake up vcpu.
> Sorry, I cannot understand it. As you said, we need to call kvm_vcpu_kick when the waiting event happened to wake up the blocked vcpu, but this event is consumed by other vcpu without any chance for us to kick it. Then how it will move out from blocked list to run queue.
> BTW, what I am talking is for the interrupt from VT-d case. For virtual interrupt, I think global vector is ok.
> Also, the second problem is also about the VT-d case.
> When cpu is running in VM root mode, and then an notification event arrives, since all VCPU use the same notification vector, we cannot distinguish which VCPU the notification event want to deliver to. And we cannot put the right VCPU to run queue.
>  
There is not VT-d code in proposed patches (is there spec available about
how VT-d integrates with PI?), so discussion is purely theoretical. VT-d
device will have to be reprogrammed to generate regular interrupt when
vcpu thread goes to sleep. This interrupt will be injected by VFIO via
irqfd just like assigned devices do now. When vcpu becomes runnable VT-d
device is reprogrammed back to generate PI.

> >> 
> >>> is no problem here. What you probably want to say is that vcpu may not be
> >>> aware of interrupt happening since it was migrated to different vcpu
> >>> just after PI IPI was sent and thus missed it. But than PIR interrupts
> >>> should be processed during vmentry on another pcpu:
> >>> 
> >>> Sender:                             Guest:
> >>> 
> >>> set pir
> >>> set on
> >>> if (vcpu in guest mode on pcpu1)
> >>>                                    vmexit on pcpu1
> >>>                                    vmentry on pcpu2
> >>>                                    process pir, deliver interrupt
> >>> 	send PI IPI to pcpu1
> >> 
> >>>> 
> >>>>>> 
> >>>>>>> 
> >>>>>>>> +	if (!cfg) {
> >>>>>>>> +		free_irq_at(irq, NULL);
> >>>>>>>> +		return 0;
> >>>>>>>> +	}
> >>>>>>>> +
> >>>>>>>> +	raw_spin_lock_irqsave(&vector_lock, flags);
> >>>>>>>> +	if (!__assign_irq_vector(irq, cfg, mask))
> >>>>>>>> +		ret = irq;
> >>>>>>>> +	raw_spin_unlock_irqrestore(&vector_lock, flags);
> >>>>>>>> +
> >>>>>>>> +	if (ret) {
> >>>>>>>> +		irq_set_chip_data(irq, cfg);
> >>>>>>>> +		irq_clear_status_flags(irq, IRQ_NOREQUEST);
> >>>>>>>> +	} else {
> >>>>>>>> +		free_irq_at(irq, cfg);
> >>>>>>>> +	}
> >>>>>>>> +	return ret;
> >>>>>>>> +}
> >>>>>>> 
> >>>>>>> This function is mostly cut&paste of create_irq_nr().
> >>>>>> 
> >>>>>> Yes, this function allow to allocate vector from specified cpu.
> >>>>>> 
> >>>>> Does not justify code duplication.
> >>>> ok. will change it in next version.
> >>>> 
> >>> Please use single global vector in the next version.
> >>> 
> >>>>>>>> 
> >>>>>>>>  	if (kvm_x86_ops->has_virtual_interrupt_delivery(vcpu))
> >>>>>>>>  		apic->vid_enabled = true;
> >>>>>>>> +
> >>>>>>>> +	if (kvm_x86_ops->has_posted_interrupt(vcpu))
> >>>>>>>> +		apic->pi_enabled = true;
> >>>>>>>> +
> >>>>>>> This is global state, no need per apic variable.
> >>>> Even all vcpus use the same setting, but according to SDM, apicv really is a
> > per
> >>> apic variable.
> >>> It is not per vapic in our implementation and this is what is
> >>> important here.
> >>> 
> >>>> Anyway, if you think we should not put it here, where is the best place?
> >>> It is not needed, just use has_posted_interrupt(vcpu) instead.
> >> ok
> >> 
> >>>> 
> >>>>>>>> @@ -1555,6 +1611,11 @@ static void vmx_vcpu_load(struct kvm_vcpu
> >>>>> *vcpu,
> >>>>>>> int cpu)
> >>>>>>>>  		struct desc_ptr *gdt = &__get_cpu_var(host_gdt);
> >>>>>>>>  		unsigned long sysenter_esp;
> >>>>>>>> +		if (enable_apicv_pi && to_vmx(vcpu)->pi)
> >>>>>>>> +			pi_set_on(to_vmx(vcpu)->pi);
> >>>>>>>> +
> >>>>>>> Why?
> >>>>>> 
> >>>>>> Here means the vcpu start migration. So we should prevent the
> >>>>>> notification event until migration end.
> >>>>>> 
> >>>>> You check for IN_GUEST_MODE while sending notification. Why is this not
> >>>> For interrupt from emulated device, it enough. But VT-d device doesn't
> > know
> >>> the vcpu is migrating, so set the on bit to prevent the notification event when
> >>> target vcpu is migrating.
> >>> Why should VT-d device care about that? It sets bits in pir and sends
> >>> IPI. If vcpu is running it process pir immediately, if not it will do it
> >>> during next vmentry.
> >> We already know the vcpu is not running(it will run soon), we can set this bit to
> > prevent the unnecessary IPI. We have IN_GUEST_MODE for that. And this is
> > the wrong place to indicate that vcpu is not running anyway. vcpu is not
> > running immediately after vmexit.
> But the VT-d chipset doesn't know. We need to set this bit to tell it.
>  
You are trying to optimize here. Let it generate interrupt that will be
ignored and optimize later. Setting bit here does not make sense for
optimization you are trying to do because call to vcpu_load() means that
vcpu will likely enter guest mode in the near feature, but it was not
running before that point, so setting it here is kinda later.


> >> 
> >>>> 
> >>>>> enough? Also why vmx_vcpu_load() call means that vcpu start migration?
> >>>> I think the follow check can ensure the vcpu is in migration, am I wrong?
> >>>> if (vmx->loaded_vmcs->cpu != cpu)
> >>> This code checks that this vcpu ran on that pcpu last time.
> >> Yes, migration starts more earlier than here. But I think it should be
> >> ok to set ON bit here. Do you have any better idea?
> >> 
> > If you want to prevent assigned device from sending IPI to non running
> > vcpu you should set the bit immediately after vmexit. For emulated
> > devices vcpu->mode should be used.
> No, if the reason of vmexit is waiting for interrupt from assigned device , then it will never have the chance to get this interrupt.
>  
Exactly what will happen with your code. Consider vcpu that executed HLT
instruction. If, for some reason, it does exit to userspace while
halted, on the next ioctl(VM_RUN) "on" will be set by vmx_vcpu_load()
vcpu thread will return to kvm_vcpu_block() and VT-d interrupt will
never be triggered.

As I said above solution may be to reprogram VT-d device to regular interrupt
while vcpu is halted.

> >>>> {
> >>>> 	if (enable_apicv_pi && to_vmx(vcpu)->pi)
> >>>> 	pi_set_on(to_vmx(vcpu)->pi);
> >>>> }
> >>>> 
> >>>>>>>> +		kvm_make_request(KVM_REQ_POSTED_INTR, vcpu);
> >>>>>>>> +
> >>>>>>>>  		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
> >>>>>>>>  	local_irq_disable();
> >>>>>>>>  		list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link, @@
> >>>>>>>>  -1582,6 +1643,8 @@ static void vmx_vcpu_put(struct kvm_vcpu
> >>>>>>>>  *vcpu) 	vcpu->cpu = -1; 		kvm_cpu_vmxoff(); 	}
> >>>>>>>> +	if (enable_apicv_pi && to_vmx(vcpu)->pi)
> >>>>>>>> +		pi_set_on(to_vmx(vcpu)->pi);
> >>>>>>> Why?
> >>>>>> 
> >>>>>> When vcpu schedule out, no need to send notification event to it, just set
> >>> the
> >>>>> PIR and wakeup it is enough.
> >>>>> Same as above. When vcpu is scheduled out it will no be in
> > IN_GUEST_MODE
> >>>> Right.
> >>>> 
> >>>>> mode. Also in this case we probably should set bit directly in IRR and leave
> >>>>> PIR alone.
> >>>>> From the view of hypervisor, IRR and PIR are same. For each vmentry, if PI
> > is
> >>> enabled, the IRR equal to (IRR | PIR). So there is no difference to
> >>> set IRR or PIR if target vcpu is not running. But there is a
> >>> difference for KVM code. For instance kvm_arch_vcpu_runnable() checks
> >>> for interrupts in IRR, but not PIR. Migration code does the same.
> >> Right. With PI, we need check the PIR too.
> >> 
> >>>> 
> >>>>> 
> >>>>>> 
> >>>>>>>>  }
> >>>>>>>>  
> >>>>>>>>  static void vmx_fpu_activate(struct kvm_vcpu *vcpu)
> >>>>>>>> @@ -2451,12 +2514,6 @@ static __init int setup_vmcs_config(struct
> >>>>>>> vmcs_config *vmcs_conf)
> >>>>>>>>  	u32 _vmexit_control = 0;
> >>>>>>>>  	u32 _vmentry_control = 0;
> >>>>>>>> -	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING; -	opt =
> >>>>>>>> PIN_BASED_VIRTUAL_NMIS; -	if (adjust_vmx_controls(min, opt,
> >>>>>>>> MSR_IA32_VMX_PINBASED_CTLS, -				&_pin_based_exec_control) < 0)
> >>>>>>>> -		return -EIO; -
> >>>>>>>>  	min = CPU_BASED_HLT_EXITING |
> >>>>>>>>  #ifdef CONFIG_X86_64
> >>>>>>>>  	      CPU_BASED_CR8_LOAD_EXITING |
> >>>>>>>> @@ -2531,6 +2588,17 @@ static __init int setup_vmcs_config(struct
> >>>>>>> vmcs_config *vmcs_conf)
> >>>>>>>>  				&_vmexit_control) < 0)
> >>>>>>>>  		return -EIO;
> >>>>>>>> +	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING; +	opt =
> >>>>>>>> PIN_BASED_VIRTUAL_NMIS | PIN_BASED_POSTED_INTR; +	if
> >>>>>>>> (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
> >>>>>>>> +				&_pin_based_exec_control) < 0) +		return -EIO; + +	if
> >>>>>>>> (!(_cpu_based_2nd_exec_control &
> >>>>>>>> +		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY) || +		!(_vmexit_control
> >>>>>>>> & VM_EXIT_ACK_INTR_ON_EXIT)) +		_pin_based_exec_control &=
> >>>>>>>> ~PIN_BASED_POSTED_INTR; +
> >>>>>>>>  	min = 0; 	opt = VM_ENTRY_LOAD_IA32_PAT; 	if
> >>>>>>>>  (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS, @@
> >>>>>>>>  -2715,6 +2783,9 @@ static __init int hardware_setup(void) 	if
> >>>>>>>>  (!cpu_has_vmx_virtual_intr_delivery()) 		enable_apicv_vid =
> > 0;
> >>>>>>>> +	if (!cpu_has_vmx_posted_intr() || !x2apic_enabled())
> >>>>>>> In nested guest x2apic may be enabled without irq remapping. Check for
> >>>>>>> irq remapping here.
> >>>>>> 
> >>>>>> There are no posted interrupt available in nested case. We don't need
> >>>>>> to check IR here.
> >>>>>> 
> >>>>> One day emulation will be added. If pre-request for PI is IR check
> >>>>> for IR.
> >>>> 
> >>>> 
> >>>>> BTW why IR is needed for PI. To deliver assigned devices interrupts
> >>>>> directly into a guest sure, but why is it required for delivering
> >>>>> interrupts from emulated devices or IPIs?
> >>>> Posted Interrupt support is Xeon only and these platforms will have x2APIC.
> > So,
> >>> Linux will enable x2APIC on these platforms. So we only want to enable
> >>> PI when x2apic is enabled and IR is required for x2apic. The fact that
> >>> x2APIC is available on all platform that support PI is irrelevant. If
> >>> one is not strictly required by the other by architecture do not
> >>> couple them.
> >> Right. We only want to simply the implementation of enable "ack intr on exit".
> > If IR enabled, then don't need to check the trig mode(all interrupts are edge)
> > when using self IPI to regenerate the interrupt.
> > With Avi's suggestion self IPI is not needed. Drop this dependency if it
> > is not architectural.
> Ok, then we need to read the TMR and only send self IPI for edge interrupt.  
>  
Sounds OK. It can be optimized later by checking with ioapic code. The
configuration rarely changes, so it is possible to avoid reading TMR
register on each interrupt.

> >> 
> >>>> 
> >>>>>>> 
> >>>>>>>> +		enable_apicv_pi = 0;
> >>>>>>>> +
> >>>>>>>>  	if (nested) 		nested_vmx_setup_ctls_msrs(); @@ -3881,6 +3952,93
> >>>>>>>>  @@ static void ept_set_mmio_spte_mask(void)
> >>>>>>>>  	kvm_mmu_set_mmio_spte_mask(0xffull << 49 | 0x6ull); }
> >>>>>>>> +irqreturn_t pi_handler(int irq, void *data)
> >>>>>>>> +{
> >>>>>>>> +	struct vcpu_vmx *vmx = data;
> >>>>>>>> +
> >>>>>>>> +	kvm_make_request(KVM_REQ_EVENT, &vmx->vcpu);
> >>>>>>>> +	kvm_vcpu_kick(&vmx->vcpu);
> >>>>>>>> +
> >>>>>>>> +	return IRQ_HANDLED;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static int vmx_has_posted_interrupt(struct kvm_vcpu *vcpu)
> >>>>>>>> +{
> >>>>>>>> +	return irqchip_in_kernel(vcpu->kvm) && enable_apicv_pi;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static void vmx_pi_migrate(struct kvm_vcpu *vcpu)
> >>>>>>>> +{
> >>>>>>>> +	int ret = 0;
> >>>>>>>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >>>>>>>> +
> >>>>>>>> +	if (!enable_apicv_pi)
> >>>>>>>> +		return ;
> >>>>>>>> +
> >>>>>>>> +	preempt_disable();
> >>>>>>>> +	local_irq_disable();
> >>>>>>>> +	if (!vmx->irq) {
> >>>>>>>> +		ret = arch_pi_alloc_irq(vmx);
> >>>>>>>> +		if (ret < 0) {
> >>>>>>>> +			vmx->irq = -1;
> >>>>>>>> +			goto out;
> >>>>>>>> +		}
> >>>>>>>> +		vmx->irq = ret;
> >>>>>>>> +
> >>>>>>>> +		ret = request_irq(vmx->irq, pi_handler, IRQF_NO_THREAD,
> >>>>>>>> +					"Posted Interrupt", vmx);
> >>>>>>>> +		if (ret) {
> >>>>>>>> +			vmx->irq = -1;
> >>>>>>>> +			goto out;
> >>>>>>>> +		}
> >>>>>>>> +
> >>>>>>>> +		ret = arch_pi_get_vector(vmx->irq);
> >>>>>>>> +	} else
> >>>>>>>> +		ret = arch_pi_migrate(vmx->irq, smp_processor_id());
> >>>>>>>> +
> >>>>>>>> +	if (ret < 0) {
> >>>>>>>> +		vmx->irq = -1;
> >>>>>>>> +		goto out;
> >>>>>>>> +	} else {
> >>>>>>>> +		vmx->vector = ret;
> >>>>>>>> +		vmcs_write16(POSTED_INTR_NV, vmx->vector);
> >>>>>>>> +		pi_clear_on(vmx->pi);
> >>>>>>>> +	}
> >>>>>>>> +out:
> >>>>>>>> +	local_irq_enable();
> >>>>>>>> +	preempt_enable();
> >>>>>>>> +	return ;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static int vmx_send_nv(struct kvm_vcpu *vcpu,
> >>>>>>>> +		int vector)
> >>>>>>>> +{
> >>>>>>>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >>>>>>>> +
> >>>>>>>> +	if (unlikely(vmx->irq == -1))
> >>>>>>>> +		return 0;
> >>>>>>>> +
> >>>>>>>> +	if (vcpu->cpu == smp_processor_id()) {
> >>>>>>>> +		pi_set_on(vmx->pi);
> >>>>>>> Why? You clear this bit anyway in vmx_update_irq() during guest entry.
> >>>>>> Here means the target vcpu already in vmx non-root mode. Then it will
> >>>>> consume the interrupt on next vm entry and we don't need to send the
> >>>>> notification event from other cpu, just update PIR is enough.
> >>>>> I understand why you avoid sending PI IPI here, but you do not update
> >>>>> pir in this case either. You only set "on" bit here and set vector directly
> >>>>> in IRR in __apic_accept_irq() since vmx_send_nv() returns 0 in this case.
> >>>>> Interrupt is delivered from IRR on the next entry.
> >>>> As I mentioned, IRR is basically same as PIR.
> >>>> 
> >>> That does not explain why are you setting "on" without setting bit in pir.
> >> Right. Just set the PIR and return 1 is enough.
> >> 
> >>>>>> 
> >>>>>>>> +		return 0; +	} + +	pi_set_pir(vector, vmx->pi); +	if
> >>>>>>>> (!pi_test_and_set_on(vmx->pi) && (vcpu->mode == IN_GUEST_MODE)) {
> >>>>>>>> +		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), vmx->vector);
> >>>>>>>> +		return 1; +	} +	return 0; +} + +static void free_pi(struct
> >>>>>>>> vcpu_vmx *vmx) +{ +	if (enable_apicv_pi) { + 	kfree(vmx->pi);
> >>>>>>>> +		arch_pi_free_irq(vmx->irq, vmx); +	} +} +
> >>>>>>>>  /*
> >>>>>>>>   * Sets up the vmcs for emulated real mode.
> >>>>>>>>   */
> >>>>>>>> @@ -3890,6 +4048,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx
> >>> *vmx)
> >>>>>>>>  	unsigned long a;
> >>>>>>>>  #endif
> >>>>>>>>  	int i;
> >>>>>>>> +	u32 pin_based_exec_ctrl = vmcs_config.pin_based_exec_ctrl;
> >>>>>>>> 
> >>>>>>>>  	/* I/O */ 	vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a)); @@
> >>>>>>>>  -3901,8 +4060,10 @@ static int vmx_vcpu_setup(struct vcpu_vmx
> >>>>>>>>  *vmx) 	vmcs_write64(VMCS_LINK_POINTER, -1ull); /* 22.3.1.5 */
> >>>>>>>>  
> >>>>>>>>  	/* Control */
> >>>>>>>> -	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
> >>>>>>>> -		vmcs_config.pin_based_exec_ctrl); +	if (!enable_apicv_pi)
> >>>>>>>> +		pin_based_exec_ctrl &= ~PIN_BASED_POSTED_INTR; +
> >>>>>>>> +	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, pin_based_exec_ctrl);
> >>>>>>>> 
> >>>>>>>>  	vmcs_write32(CPU_BASED_VM_EXEC_CONTROL,
> >>>>>>> vmx_exec_control(vmx));
> >>>>>>>> 
> >>>>>>>> @@ -3920,6 +4081,12 @@ static int vmx_vcpu_setup(struct vcpu_vmx
> >>>>> *vmx)
> >>>>>>>>  		vmcs_write16(GUEST_INTR_STATUS, 0);
> >>>>>>>>  	}
> >>>>>>>> +	if (enable_apicv_pi) { +		vmx->pi = kmalloc(sizeof(struct
> >>>>>>>> pi_desc), +				GFP_KERNEL | __GFP_ZERO);
> >>>>>>>> +		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((vmx->pi))); +	} +
> >>>>>>>>  	if (ple_gap) { 		vmcs_write32(PLE_GAP, ple_gap);
> >>>>>>>>  		vmcs_write32(PLE_WINDOW, ple_window); @@ -6161,6 +6328,11 @@
> >>>>>>>>  static void vmx_update_irq(struct kvm_vcpu *vcpu) 	if
> >>>>>>>>  (!enable_apicv_vid) 		return ;
> >>>>>>>> +	if (enable_apicv_pi) {
> >>>>>>>> +		kvm_apic_update_irr(vcpu, (unsigned int *)vmx->pi->pir);
> >>>>>>>> +		pi_clear_on(vmx->pi);
> >>>>>>> Why do you do that? Isn't VMX process posted interrupts on vmentry if
> >>>>>>> "on" bit is set?
> >>>>> Can you answer this question?
> >>>> No, vmentry do nothing for PI. Posted interrupt only happens when an
> >>> unmasked external interrupt arrived and the target vcpu is running.
> >>> Beyond that, cpu follow the old way.
> >>>> 
> >>> Now that totally contradicts what you wrote above! (unless I
> >>> misunderstood you, in which case clarify please)
> >>> 
> >>>   From the view of hypervisor, IRR and PIR are same. For each vmentry, if
> >>>   PI is enabled, the IRR equal to (IRR | PIR). So there is no difference
> >>>   to set IRR or PIR if target vcpu is not running.
> >>> --
> >>> 			Gleb.
> >> Sorry, maybe I mislead you. VMentry have nothing to do PI.
> >> What I mean" IRR and PIR are same" is because this patch will copy PIR to IRR
> > before each vmentry. So I think this two should be same in some levels. But
> > according your comments, it may wrong.
> >> 
> > OK. Thanks for clarification.
> > 
> > --
> > 			Gleb.
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> Best regards,
> Yang
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/6] x86: PIT connects to pin 2 of IOAPIC
  2012-11-21  8:09 ` [PATCH v2 1/6] x86: PIT connects to pin 2 of IOAPIC Yang Zhang
@ 2012-11-28 10:50   ` Gleb Natapov
  0 siblings, 0 replies; 29+ messages in thread
From: Gleb Natapov @ 2012-11-28 10:50 UTC (permalink / raw)
  To: Yang Zhang; +Cc: kvm, mtosatti

On Wed, Nov 21, 2012 at 04:09:34PM +0800, Yang Zhang wrote:
> When PIT connects to IOAPIC, it route to pin 2 not pin 0.
> 
> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> ---
>  virt/kvm/ioapic.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
> index cfb7e4d..166c450 100644
> --- a/virt/kvm/ioapic.c
> +++ b/virt/kvm/ioapic.c
> @@ -181,7 +181,7 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int irq)
>  
>  #ifdef CONFIG_X86
>  	/* Always delivery PIT interrupt to vcpu 0 */
> -	if (irq == 0) {
> +	if (irq == 2) {
Hmm, this means all this time the code didn't work correctly which makes
me wonder if we need this hack at all.

>  		irqe.dest_mode = 0; /* Physical mode. */
>  		/* need to read apic_id from apic regiest since
>  		 * it can be rewritten */
> -- 
> 1.7.1

--
			Gleb.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2012-11-28 11:32 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-21  8:09 [PATCH v2 0/6] x86, apicv: Add APIC virtualizatin support Yang Zhang
2012-11-21  8:09 ` [PATCH v2 1/6] x86: PIT connects to pin 2 of IOAPIC Yang Zhang
2012-11-28 10:50   ` Gleb Natapov
2012-11-21  8:09 ` [PATCH v2 2/6] x86, apicv: add APICv register virtualization support Yang Zhang
2012-11-21  8:09 ` [PATCH v2 3/6] x86, apicv: add virtual interrupt delivery support Yang Zhang
2012-11-22 13:57   ` Gleb Natapov
2012-11-23 11:46     ` Zhang, Yang Z
2012-11-25  8:53       ` Gleb Natapov
2012-11-21  8:09 ` [PATCH v2 4/6] x86, apicv: add virtual x2apic support Yang Zhang
2012-11-21  8:09 ` [PATCH v2 5/6] x86: Enable ack interrupt on vmexit Yang Zhang
2012-11-22 15:22   ` Gleb Natapov
2012-11-23  5:41     ` Zhang, Yang Z
2012-11-25 13:30       ` Gleb Natapov
2012-11-25 12:55     ` Avi Kivity
2012-11-25 13:03       ` Gleb Natapov
2012-11-25 13:11         ` Avi Kivity
2012-11-26  5:44           ` Zhang, Yang Z
2012-11-26  9:17             ` Gleb Natapov
2012-11-21  8:09 ` [PATCH v2 6/6] x86, apicv: Add Posted Interrupt supporting Yang Zhang
2012-11-25 12:39   ` Gleb Natapov
2012-11-26  3:51     ` Zhang, Yang Z
2012-11-26 10:01       ` Gleb Natapov
2012-11-26 12:29         ` Zhang, Yang Z
2012-11-26 13:48           ` Gleb Natapov
2012-11-27  3:38             ` Zhang, Yang Z
2012-11-27  9:16               ` Gleb Natapov
2012-11-27 11:10                 ` Zhang, Yang Z
2012-11-27 11:31                   ` Veruca Salt
2012-11-27 11:46                   ` Gleb Natapov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).