[PATCH v9 0/3] x86, apicv: Add APIC virtualization support

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v9 0/3] x86, apicv: Add APIC virtualization support
@ 2013-01-10  7:26 Yang Zhang
  2013-01-10  7:26 ` [PATCH v9 1/3] x86, apicv: add APICv register " Yang Zhang
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Yang Zhang @ 2013-01-10  7:26 UTC (permalink / raw)
  To: kvm; +Cc: gleb, haitao.shan, mtosatti, Yang Zhang

From: Yang Zhang <yang.z.zhang@Intel.com>

APIC virtualization is a new feature which can eliminate most of VM exit
when vcpu handle a interrupt:

APIC register virtualization:
        APIC read access doesn't cause APIC-access VM exits.
        APIC write becomes trap-like.

Virtual interrupt delivery:
        Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
        manually, which is fully taken care of by the hardware.

Please refer to Intel SDM volume 3, chapter 29 for more details.
Changes v8 to v9:
 * Update eoi exit bitmap by vcpu itself.
 * Enable virtualize x2apic mode when guest is using x2apic.
 * Rebase on top of KVM upstream

Changes v7 to v8:
 * According Marcelo's suggestion, add comments for irr_pending and isr_count,
   since the two valiables have different meaning when using apicv.
 * Set highest bit in vISR to SVI after migation.
 * Use spinlock to access eoi exit bitmap synchronously.
 * Enable virtualize x2apic mode when guest is using x2apic
 * Rebased on top of KVM upstream.

Changes v6 to v7:
 * fix a bug when set exit bitmap.
 * Rebased on top of KVM upstream.

Changes v5 to v6:
 * minor adjustments according gleb's comments
 * Rebased on top of KVM upstream.

Changes v4 to v5:
 * Set eoi exit bitmap when an interrupt has notifier registered.
 * Use request to track ioapic entry's modification.
 * Rebased on top of KVM upstream.

Changes v3 to v4:
 * use one option to control both register virtualization and virtual interrupt
   delivery.
 * Update eoi exit bitmap when programing ioapic or programing apic's id/dfr/ldr.
 * Rebased on top of KVM upstream.

Changes v2 to v3:
 * Drop Posted Interrupt patch from v3.
   According Gleb's suggestion, we will use global vector for all VCPUs as notification
   event vector. So we will rewrite the Posted Interrupt patch. And resend it later.
 * Use TMR to set the eoi exiting bitmap. We only want to set eoi exiting bitmap for
   those interrupt which is level trigger or has notifier in EOI write path. So TMR is
   enough to distinguish the interrupt trigger mode.
 * Simplify some code according Gleb's comments.
 * rebased on top of KVM upstream.

Changes v1 to v2:
 * Add Posted Interrupt support in this series patch.
 * Since there is a notifer hook in vAPIC EOI for PIT interrupt. So always Set PIT
   interrupt in eoi exit bitmap to force vmexit when EOI to interrupt.
 * Rebased on top of KVM upstream

Yang Zhang (3):
  x86, apicv: add APICv register virtualization support
  x86, apicv: add virtual x2apic support
  x86, apicv: add virtual interrupt delivery support

 arch/x86/include/asm/kvm_host.h |    7 +
 arch/x86/include/asm/vmx.h      |   14 ++
 arch/x86/kvm/irq.c              |   56 +++++-
 arch/x86/kvm/lapic.c            |   92 ++++++---
 arch/x86/kvm/lapic.h            |   25 +++
 arch/x86/kvm/svm.c              |   24 +++
 arch/x86/kvm/vmx.c              |  404 ++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              |   14 ++-
 include/linux/kvm_host.h        |    3 +
 virt/kvm/ioapic.c               |   18 ++
 virt/kvm/ioapic.h               |    4 +
 virt/kvm/irq_comm.c             |   22 ++
 virt/kvm/kvm_main.c             |    5 +
 13 files changed, 643 insertions(+), 45 deletions(-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v9 1/3] x86, apicv: add APICv register virtualization support
  2013-01-10  7:26 [PATCH v9 0/3] x86, apicv: Add APIC virtualization support Yang Zhang
@ 2013-01-10  7:26 ` Yang Zhang
  2013-01-10 20:25   ` Marcelo Tosatti
  2013-01-10  7:26 ` [PATCH v9 2/3] x86, apicv: add virtual x2apic support Yang Zhang
  2013-01-10  7:26 ` [PATCH v9 3/3] x86, apicv: add virtual interrupt delivery support Yang Zhang
  2 siblings, 1 reply; 26+ messages in thread
From: Yang Zhang @ 2013-01-10  7:26 UTC (permalink / raw)
  To: kvm; +Cc: gleb, haitao.shan, mtosatti, Yang Zhang, Kevin Tian

- APIC read doesn't cause VM-Exit
- APIC write becomes trap-like

Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
---
 arch/x86/include/asm/vmx.h |    2 ++
 arch/x86/kvm/lapic.c       |   15 +++++++++++++++
 arch/x86/kvm/lapic.h       |    2 ++
 arch/x86/kvm/vmx.c         |   33 ++++++++++++++++++++++++++++++++-
 4 files changed, 51 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index e385df9..44c3f7e 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -66,6 +66,7 @@
 #define EXIT_REASON_EPT_MISCONFIG       49
 #define EXIT_REASON_WBINVD              54
 #define EXIT_REASON_XSETBV              55
+#define EXIT_REASON_APIC_WRITE          56
 #define EXIT_REASON_INVPCID             58
 
 #define VMX_EXIT_REASONS \
@@ -141,6 +142,7 @@
 #define SECONDARY_EXEC_ENABLE_VPID              0x00000020
 #define SECONDARY_EXEC_WBINVD_EXITING		0x00000040
 #define SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080
+#define SECONDARY_EXEC_APIC_REGISTER_VIRT       0x00000100
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING	0x00000400
 #define SECONDARY_EXEC_ENABLE_INVPCID		0x00001000
 
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 9392f52..0664c13 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1212,6 +1212,21 @@ void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
 
+/* emulate APIC access in a trap manner */
+void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
+{
+	u32 val = 0;
+
+	/* hw has done the conditional check and inst decode */
+	offset &= 0xff0;
+
+	apic_reg_read(vcpu->arch.apic, offset, 4, &val);
+
+	/* TODO: optimize to just emulate side effect w/o one more write */
+	apic_reg_write(vcpu->arch.apic, offset, val);
+}
+EXPORT_SYMBOL_GPL(kvm_apic_write_nodecode);
+
 void kvm_free_lapic(struct kvm_vcpu *vcpu)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index e5ebf9f..9a8ee22 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -64,6 +64,8 @@ int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu);
 u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
 void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
 
+void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
+
 void kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr);
 void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu);
 void kvm_lapic_sync_to_vapic(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 55dfc37..688f43f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -84,6 +84,9 @@ module_param(vmm_exclusive, bool, S_IRUGO);
 static bool __read_mostly fasteoi = 1;
 module_param(fasteoi, bool, S_IRUGO);
 
+static bool __read_mostly enable_apicv_reg_vid = 1;
+module_param(enable_apicv_reg_vid, bool, S_IRUGO);
+
 /*
  * If nested=1, nested virtualization is supported, i.e., guests may use
  * VMX and be a hypervisor for its own guests. If nested=0, guests may not
@@ -764,6 +767,12 @@ static inline bool cpu_has_vmx_virtualize_apic_accesses(void)
 		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
 }
 
+static inline bool cpu_has_vmx_apic_register_virt(void)
+{
+	return vmcs_config.cpu_based_2nd_exec_ctrl &
+		SECONDARY_EXEC_APIC_REGISTER_VIRT;
+}
+
 static inline bool cpu_has_vmx_flexpriority(void)
 {
 	return cpu_has_vmx_tpr_shadow() &&
@@ -2541,7 +2550,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 			SECONDARY_EXEC_UNRESTRICTED_GUEST |
 			SECONDARY_EXEC_PAUSE_LOOP_EXITING |
 			SECONDARY_EXEC_RDTSCP |
-			SECONDARY_EXEC_ENABLE_INVPCID;
+			SECONDARY_EXEC_ENABLE_INVPCID |
+			SECONDARY_EXEC_APIC_REGISTER_VIRT;
 		if (adjust_vmx_controls(min2, opt2,
 					MSR_IA32_VMX_PROCBASED_CTLS2,
 					&_cpu_based_2nd_exec_control) < 0)
@@ -2552,6 +2562,11 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 				SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
 		_cpu_based_exec_control &= ~CPU_BASED_TPR_SHADOW;
 #endif
+
+	if (!(_cpu_based_exec_control & CPU_BASED_TPR_SHADOW))
+		_cpu_based_2nd_exec_control &= ~(
+				SECONDARY_EXEC_APIC_REGISTER_VIRT);
+
 	if (_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_EPT) {
 		/* CR3 accesses and invlpg don't need to cause VM Exits when EPT
 		   enabled */
@@ -2749,6 +2764,9 @@ static __init int hardware_setup(void)
 	if (!cpu_has_vmx_ple())
 		ple_gap = 0;
 
+	if (!cpu_has_vmx_apic_register_virt())
+		enable_apicv_reg_vid = 0;
+
 	if (nested)
 		nested_vmx_setup_ctls_msrs();
 
@@ -3835,6 +3853,8 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
 		exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST;
 	if (!ple_gap)
 		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
+	if (!enable_apicv_reg_vid)
+		exec_control &= ~SECONDARY_EXEC_APIC_REGISTER_VIRT;
 	return exec_control;
 }
 
@@ -4796,6 +4816,16 @@ static int handle_apic_access(struct kvm_vcpu *vcpu)
 	return emulate_instruction(vcpu, 0) == EMULATE_DONE;
 }
 
+static int handle_apic_write(struct kvm_vcpu *vcpu)
+{
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	u32 offset = exit_qualification & 0xfff;
+
+	/* APIC-write VM exit is trap-like and thus no need to adjust IP */
+	kvm_apic_write_nodecode(vcpu, offset);
+	return 1;
+}
+
 static int handle_task_switch(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -5730,6 +5760,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_VMON]                    = handle_vmon,
 	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,
 	[EXIT_REASON_APIC_ACCESS]             = handle_apic_access,
+	[EXIT_REASON_APIC_WRITE]              = handle_apic_write,
 	[EXIT_REASON_WBINVD]                  = handle_wbinvd,
 	[EXIT_REASON_XSETBV]                  = handle_xsetbv,
 	[EXIT_REASON_TASK_SWITCH]             = handle_task_switch,
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v9 2/3] x86, apicv: add virtual x2apic support
  2013-01-10  7:26 [PATCH v9 0/3] x86, apicv: Add APIC virtualization support Yang Zhang
  2013-01-10  7:26 ` [PATCH v9 1/3] x86, apicv: add APICv register " Yang Zhang
@ 2013-01-10  7:26 ` Yang Zhang
  2013-01-10  7:55   ` Gleb Natapov
  2013-01-10  8:25   ` Gleb Natapov
  2013-01-10  7:26 ` [PATCH v9 3/3] x86, apicv: add virtual interrupt delivery support Yang Zhang
  2 siblings, 2 replies; 26+ messages in thread
From: Yang Zhang @ 2013-01-10  7:26 UTC (permalink / raw)
  To: kvm; +Cc: gleb, haitao.shan, mtosatti, Yang Zhang, Kevin Tian

From: Yang Zhang <yang.z.zhang@Intel.com>

basically to benefit from apicv, we need to enable virtualized x2apic mode.
Currently, we only enable it when guest is really using x2apic.

Also, clear MSR bitmap for corresponding x2apic MSRs when guest enabled x2apic:
    0x800 - 0x8ff: no read intercept for apicv register virtualization,
    		   except APIC ID and TMCCT.
    APIC ID and TMCCT: need software's assistance to get right value.
    TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery.

Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
---
 arch/x86/include/asm/kvm_host.h |    2 +
 arch/x86/include/asm/vmx.h      |    1 +
 arch/x86/kvm/lapic.c            |    5 +-
 arch/x86/kvm/svm.c              |    6 +
 arch/x86/kvm/vmx.c              |  194 +++++++++++++++++++++++++++++++++++++--
 5 files changed, 200 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c431b33..572a562 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -697,6 +697,8 @@ struct kvm_x86_ops {
 	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
 	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
 	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
+	void (*enable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu);
+	void (*disable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu);
 	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
 	int (*get_tdp_level)(void);
 	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 44c3f7e..0a54df0 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -139,6 +139,7 @@
 #define SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES 0x00000001
 #define SECONDARY_EXEC_ENABLE_EPT               0x00000002
 #define SECONDARY_EXEC_RDTSCP			0x00000008
+#define SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE   0x00000010
 #define SECONDARY_EXEC_ENABLE_VPID              0x00000020
 #define SECONDARY_EXEC_WBINVD_EXITING		0x00000040
 #define SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 0664c13..ec38906 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1328,7 +1328,10 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
 		u32 id = kvm_apic_id(apic);
 		u32 ldr = ((id >> 4) << 16) | (1 << (id & 0xf));
 		kvm_apic_set_ldr(apic, ldr);
-	}
+		kvm_x86_ops->enable_virtual_x2apic_mode(vcpu);
+	} else
+		kvm_x86_ops->disable_virtual_x2apic_mode(vcpu);
+
 	apic->base_address = apic->vcpu->arch.apic_base &
 			     MSR_IA32_APICBASE_BASE;
 
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d29d3cd..0b82cb1 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3571,6 +3571,11 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
 		set_cr_intercept(svm, INTERCEPT_CR8_WRITE);
 }
 
+static void svm_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
+{
+	return;
+}
+
 static int svm_nmi_allowed(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -4290,6 +4295,7 @@ static struct kvm_x86_ops svm_x86_ops = {
 	.enable_nmi_window = enable_nmi_window,
 	.enable_irq_window = enable_irq_window,
 	.update_cr8_intercept = update_cr8_intercept,
+	.enable_virtual_x2apic_mode = svm_enable_virtual_x2apic_mode,
 
 	.set_tss_addr = svm_set_tss_addr,
 	.get_tdp_level = get_npt_level,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 688f43f..b203ce7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -433,6 +433,8 @@ struct vcpu_vmx {
 
 	bool rdtscp_enabled;
 
+	bool virtual_x2apic_enabled;
+
 	/* Support for a guest hypervisor (nested VMX) */
 	struct nested_vmx nested;
 };
@@ -767,12 +769,23 @@ static inline bool cpu_has_vmx_virtualize_apic_accesses(void)
 		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
 }
 
+static inline bool cpu_has_vmx_virtualize_x2apic_mode(void)
+{
+	return vmcs_config.cpu_based_2nd_exec_ctrl &
+		SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
+}
+
 static inline bool cpu_has_vmx_apic_register_virt(void)
 {
 	return vmcs_config.cpu_based_2nd_exec_ctrl &
 		SECONDARY_EXEC_APIC_REGISTER_VIRT;
 }
 
+static inline bool cpu_has_vmx_virtual_intr_delivery(void)
+{
+	return false;
+}
+
 static inline bool cpu_has_vmx_flexpriority(void)
 {
 	return cpu_has_vmx_tpr_shadow() &&
@@ -2544,6 +2557,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 	if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) {
 		min2 = 0;
 		opt2 = SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
+			SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
 			SECONDARY_EXEC_WBINVD_EXITING |
 			SECONDARY_EXEC_ENABLE_VPID |
 			SECONDARY_EXEC_ENABLE_EPT |
@@ -3731,7 +3745,45 @@ static void free_vpid(struct vcpu_vmx *vmx)
 	spin_unlock(&vmx_vpid_lock);
 }
 
-static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, u32 msr)
+#define MSR_TYPE_R	1
+#define MSR_TYPE_W	2
+static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
+						u32 msr, int type)
+{
+	int f = sizeof(unsigned long);
+
+	if (!cpu_has_vmx_msr_bitmap())
+		return;
+
+	/*
+	 * See Intel PRM Vol. 3, 20.6.9 (MSR-Bitmap Address). Early manuals
+	 * have the write-low and read-high bitmap offsets the wrong way round.
+	 * We can control MSRs 0x00000000-0x00001fff and 0xc0000000-0xc0001fff.
+	 */
+	if (msr <= 0x1fff) {
+		if (type & MSR_TYPE_R)
+			/* read-low */
+			__clear_bit(msr, msr_bitmap + 0x000 / f);
+
+		if (type & MSR_TYPE_W)
+			/* write-low */
+			__clear_bit(msr, msr_bitmap + 0x800 / f);
+
+	} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
+		msr &= 0x1fff;
+		if (type & MSR_TYPE_R)
+			/* read-high */
+			__clear_bit(msr, msr_bitmap + 0x400 / f);
+
+		if (type & MSR_TYPE_W)
+			/* write-high */
+			__clear_bit(msr, msr_bitmap + 0xc00 / f);
+
+	}
+}
+
+static void __vmx_enable_intercept_for_msr(unsigned long *msr_bitmap,
+						u32 msr, int type)
 {
 	int f = sizeof(unsigned long);
 
@@ -3744,20 +3796,75 @@ static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, u32 msr)
 	 * We can control MSRs 0x00000000-0x00001fff and 0xc0000000-0xc0001fff.
 	 */
 	if (msr <= 0x1fff) {
-		__clear_bit(msr, msr_bitmap + 0x000 / f); /* read-low */
-		__clear_bit(msr, msr_bitmap + 0x800 / f); /* write-low */
+		if (type & MSR_TYPE_R)
+			/* read-low */
+			__set_bit(msr, msr_bitmap + 0x000 / f);
+
+		if (type & MSR_TYPE_W)
+			/* write-low */
+			__set_bit(msr, msr_bitmap + 0x800 / f);
+
 	} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
 		msr &= 0x1fff;
-		__clear_bit(msr, msr_bitmap + 0x400 / f); /* read-high */
-		__clear_bit(msr, msr_bitmap + 0xc00 / f); /* write-high */
+		if (type & MSR_TYPE_R)
+			/* read-high */
+			__set_bit(msr, msr_bitmap + 0x400 / f);
+
+		if (type & MSR_TYPE_W)
+			/* write-high */
+			__set_bit(msr, msr_bitmap + 0xc00 / f);
+
 	}
 }
 
+
 static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only)
 {
 	if (!longmode_only)
-		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, msr);
-	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr);
+		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
+						msr, MSR_TYPE_R | MSR_TYPE_W);
+	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
+						msr, MSR_TYPE_R | MSR_TYPE_W);
+}
+
+static void vmx_intercept_for_msr_read(u32 msr, bool longmode_only,
+					bool set)
+{
+	if (!longmode_only) {
+		if (set)
+			__vmx_enable_intercept_for_msr(vmx_msr_bitmap_legacy,
+					msr, MSR_TYPE_R);
+		else
+			__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
+					msr, MSR_TYPE_R);
+
+	}
+	if (set)
+		__vmx_enable_intercept_for_msr(vmx_msr_bitmap_longmode,
+				msr, MSR_TYPE_R);
+	else
+		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
+				msr, MSR_TYPE_R);
+}
+
+static void vmx_intercept_for_msr_write(u32 msr, bool longmode_only,
+					bool set)
+{
+	if (!longmode_only) {
+		if (set)
+			__vmx_enable_intercept_for_msr(vmx_msr_bitmap_legacy,
+					msr, MSR_TYPE_W);
+		else
+			__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
+					msr, MSR_TYPE_W);
+
+	}
+	if (set)
+		__vmx_enable_intercept_for_msr(vmx_msr_bitmap_longmode,
+				msr, MSR_TYPE_W);
+	else
+		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
+				msr, MSR_TYPE_W);
 }
 
 /*
@@ -3855,6 +3962,7 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
 		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
 	if (!enable_apicv_reg_vid)
 		exec_control &= ~SECONDARY_EXEC_APIC_REGISTER_VIRT;
+	exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
 	return exec_control;
 }
 
@@ -6110,6 +6218,76 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
 	vmcs_write32(TPR_THRESHOLD, irr);
 }
 
+static void vmx_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
+{
+	u32 exec_control;
+	int msr;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (!cpu_has_vmx_virtualize_x2apic_mode())
+		return;
+
+	exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
+	/* virtualize x2apic mode relies on tpr shadow */
+	if (!(exec_control & CPU_BASED_TPR_SHADOW))
+		return;
+
+	exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
+	exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+	exec_control |= SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
+	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
+	vmx->virtual_x2apic_enabled = true;
+
+	if (!cpu_has_vmx_virtual_intr_delivery())
+		return;
+
+	for (msr = 0x800; msr <= 0x8ff; msr++)
+		vmx_intercept_for_msr_read(msr, false, false);
+
+	/* APIC ID */
+	vmx_intercept_for_msr_read(0x802, false, true);
+	/* TMCCT */
+	vmx_intercept_for_msr_read(0x839, false, true);
+	/* TPR */
+	vmx_intercept_for_msr_write(0x808, false, false);
+	/* EOI */
+	vmx_intercept_for_msr_write(0x80b, false, false);
+	/* SELF-IPI */
+	vmx_intercept_for_msr_write(0x83f, false, false);
+
+}
+
+static void vmx_disable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
+{
+	u32 second_exec_control;
+	int msr;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	/* If doesn't enable virtual x2apic before, do nothing*/
+	if (!vmx->virtual_x2apic_enabled)
+		return;
+
+	second_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
+	/* disalbe virtual x2apic*/
+	second_exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
+	second_exec_control |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, second_exec_control);
+	vmx->virtual_x2apic_enabled = false;
+
+	if (!cpu_has_vmx_virtual_intr_delivery())
+		return;
+
+	for (msr = 0x800; msr <= 0x8ff; msr++)
+		vmx_intercept_for_msr_read(msr, false, true);
+
+	/* TPR */
+	vmx_intercept_for_msr_write(0x808, false, true);
+	/* EOI */
+	vmx_intercept_for_msr_write(0x80b, false, true);
+	/* SELF-IPI */
+	vmx_intercept_for_msr_write(0x83f, false, true);
+}
+
 static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx)
 {
 	u32 exit_intr_info;
@@ -7373,6 +7551,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
 	.enable_nmi_window = enable_nmi_window,
 	.enable_irq_window = enable_irq_window,
 	.update_cr8_intercept = update_cr8_intercept,
+	.enable_virtual_x2apic_mode = vmx_enable_virtual_x2apic_mode,
+	.disable_virtual_x2apic_mode = vmx_disable_virtual_x2apic_mode,
 
 	.set_tss_addr = vmx_set_tss_addr,
 	.get_tdp_level = get_ept_level,
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v9 3/3] x86, apicv: add virtual interrupt delivery support
  2013-01-10  7:26 [PATCH v9 0/3] x86, apicv: Add APIC virtualization support Yang Zhang
  2013-01-10  7:26 ` [PATCH v9 1/3] x86, apicv: add APICv register " Yang Zhang
  2013-01-10  7:26 ` [PATCH v9 2/3] x86, apicv: add virtual x2apic support Yang Zhang
@ 2013-01-10  7:26 ` Yang Zhang
  2013-01-10  8:23   ` Gleb Natapov
  2013-01-10 21:36   ` Marcelo Tosatti
  2 siblings, 2 replies; 26+ messages in thread
From: Yang Zhang @ 2013-01-10  7:26 UTC (permalink / raw)
  To: kvm; +Cc: gleb, haitao.shan, mtosatti, Yang Zhang, Kevin Tian

From: Yang Zhang <yang.z.zhang@Intel.com>

Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
manually, which is fully taken care of by the hardware. This needs
some special awareness into existing interrupr injection path:

- for pending interrupt, instead of direct injection, we may need
  update architecture specific indicators before resuming to guest.

- A pending interrupt, which is masked by ISR, should be also
  considered in above update action, since hardware will decide
  when to inject it at right time. Current has_interrupt and
  get_interrupt only returns a valid vector from injection p.o.v.

Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
---
 arch/x86/include/asm/kvm_host.h |    5 +
 arch/x86/include/asm/vmx.h      |   11 +++
 arch/x86/kvm/irq.c              |   56 +++++++++++-
 arch/x86/kvm/lapic.c            |   72 +++++++++------
 arch/x86/kvm/lapic.h            |   23 +++++
 arch/x86/kvm/svm.c              |   18 ++++
 arch/x86/kvm/vmx.c              |  191 +++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/x86.c              |   14 +++-
 include/linux/kvm_host.h        |    3 +
 virt/kvm/ioapic.c               |   18 ++++
 virt/kvm/ioapic.h               |    4 +
 virt/kvm/irq_comm.c             |   22 +++++
 virt/kvm/kvm_main.c             |    5 +
 13 files changed, 399 insertions(+), 43 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 572a562..f471856 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -697,6 +697,10 @@ struct kvm_x86_ops {
 	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
 	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
 	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
+	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
+	void (*update_apic_irq)(struct kvm_vcpu *vcpu, int max_irr);
+	void (*update_eoi_exitmap)(struct kvm_vcpu *vcpu);
+	void (*set_svi)(int isr);
 	void (*enable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu);
 	void (*disable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu);
 	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
@@ -993,6 +997,7 @@ int kvm_age_hva(struct kvm *kvm, unsigned long hva);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
+int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
 int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu);
 int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0a54df0..694586c 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -62,6 +62,7 @@
 #define EXIT_REASON_MCE_DURING_VMENTRY  41
 #define EXIT_REASON_TPR_BELOW_THRESHOLD 43
 #define EXIT_REASON_APIC_ACCESS         44
+#define EXIT_REASON_EOI_INDUCED         45
 #define EXIT_REASON_EPT_VIOLATION       48
 #define EXIT_REASON_EPT_MISCONFIG       49
 #define EXIT_REASON_WBINVD              54
@@ -144,6 +145,7 @@
 #define SECONDARY_EXEC_WBINVD_EXITING		0x00000040
 #define SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080
 #define SECONDARY_EXEC_APIC_REGISTER_VIRT       0x00000100
+#define SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY    0x00000200
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING	0x00000400
 #define SECONDARY_EXEC_ENABLE_INVPCID		0x00001000
 
@@ -181,6 +183,7 @@ enum vmcs_field {
 	GUEST_GS_SELECTOR               = 0x0000080a,
 	GUEST_LDTR_SELECTOR             = 0x0000080c,
 	GUEST_TR_SELECTOR               = 0x0000080e,
+	GUEST_INTR_STATUS               = 0x00000810,
 	HOST_ES_SELECTOR                = 0x00000c00,
 	HOST_CS_SELECTOR                = 0x00000c02,
 	HOST_SS_SELECTOR                = 0x00000c04,
@@ -208,6 +211,14 @@ enum vmcs_field {
 	APIC_ACCESS_ADDR_HIGH		= 0x00002015,
 	EPT_POINTER                     = 0x0000201a,
 	EPT_POINTER_HIGH                = 0x0000201b,
+	EOI_EXIT_BITMAP0                = 0x0000201c,
+	EOI_EXIT_BITMAP0_HIGH           = 0x0000201d,
+	EOI_EXIT_BITMAP1                = 0x0000201e,
+	EOI_EXIT_BITMAP1_HIGH           = 0x0000201f,
+	EOI_EXIT_BITMAP2                = 0x00002020,
+	EOI_EXIT_BITMAP2_HIGH           = 0x00002021,
+	EOI_EXIT_BITMAP3                = 0x00002022,
+	EOI_EXIT_BITMAP3_HIGH           = 0x00002023,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
 	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
 	VMCS_LINK_POINTER               = 0x00002800,
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index b111aee..e113440 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -38,6 +38,38 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu)
 EXPORT_SYMBOL(kvm_cpu_has_pending_timer);
 
 /*
+ * check if there is pending interrupt from
+ * non-APIC source without intack.
+ */
+static int kvm_cpu_has_extint(struct kvm_vcpu *v)
+{
+	if (kvm_apic_accept_pic_intr(v))
+		return pic_irqchip(v->kvm)->output;	/* PIC */
+	else
+		return 0;
+}
+
+/*
+ * check if there is injectable interrupt:
+ * when virtual interrupt delivery enabled,
+ * interrupt from apic will handled by hardware,
+ * we don't need to check it here.
+ */
+int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v)
+{
+	if (!irqchip_in_kernel(v->kvm))
+		return v->arch.interrupt.pending;
+
+	if (kvm_cpu_has_extint(v))
+		return 1;
+
+	if (kvm_apic_vid_enabled(v))
+		return 0;
+
+	return kvm_apic_has_interrupt(v) != -1; /* LAPIC */
+}
+
+/*
  * check if there is pending interrupt without
  * intack.
  */
@@ -46,27 +78,41 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
 	if (!irqchip_in_kernel(v->kvm))
 		return v->arch.interrupt.pending;
 
-	if (kvm_apic_accept_pic_intr(v) && pic_irqchip(v->kvm)->output)
-		return pic_irqchip(v->kvm)->output;	/* PIC */
+	if (kvm_cpu_has_extint(v))
+		return 1;
 
 	return kvm_apic_has_interrupt(v) != -1;	/* LAPIC */
 }
 EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
 
 /*
+ * Read pending interrupt(from non-APIC source)
+ * vector and intack.
+ */
+static int kvm_cpu_get_extint(struct kvm_vcpu *v)
+{
+	if (kvm_cpu_has_extint(v))
+		return kvm_pic_read_irq(v->kvm); /* PIC */
+	return -1;
+}
+
+/*
  * Read pending interrupt vector and intack.
  */
 int kvm_cpu_get_interrupt(struct kvm_vcpu *v)
 {
+	int vector;
+
 	if (!irqchip_in_kernel(v->kvm))
 		return v->arch.interrupt.nr;
 
-	if (kvm_apic_accept_pic_intr(v) && pic_irqchip(v->kvm)->output)
-		return kvm_pic_read_irq(v->kvm);	/* PIC */
+	vector = kvm_cpu_get_extint(v);
+
+	if (kvm_apic_vid_enabled(v) || vector != -1)
+		return vector;			/* PIC */
 
 	return kvm_get_apic_interrupt(v);	/* APIC */
 }
-EXPORT_SYMBOL_GPL(kvm_cpu_get_interrupt);
 
 void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index ec38906..d219f41 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -150,23 +150,6 @@ static inline int kvm_apic_id(struct kvm_lapic *apic)
 	return (kvm_apic_get_reg(apic, APIC_ID) >> 24) & 0xff;
 }
 
-static inline u16 apic_cluster_id(struct kvm_apic_map *map, u32 ldr)
-{
-	u16 cid;
-	ldr >>= 32 - map->ldr_bits;
-	cid = (ldr >> map->cid_shift) & map->cid_mask;
-
-	BUG_ON(cid >= ARRAY_SIZE(map->logical_map));
-
-	return cid;
-}
-
-static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr)
-{
-	ldr >>= (32 - map->ldr_bits);
-	return ldr & map->lid_mask;
-}
-
 static void recalculate_apic_map(struct kvm *kvm)
 {
 	struct kvm_apic_map *new, *old = NULL;
@@ -236,12 +219,14 @@ static inline void kvm_apic_set_id(struct kvm_lapic *apic, u8 id)
 {
 	apic_set_reg(apic, APIC_ID, id << 24);
 	recalculate_apic_map(apic->vcpu->kvm);
+	ioapic_update_eoi_exitmap(apic->vcpu->kvm);
 }
 
 static inline void kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id)
 {
 	apic_set_reg(apic, APIC_LDR, id);
 	recalculate_apic_map(apic->vcpu->kvm);
+	ioapic_update_eoi_exitmap(apic->vcpu->kvm);
 }
 
 static inline int apic_lvt_enabled(struct kvm_lapic *apic, int lvt_type)
@@ -345,6 +330,8 @@ static inline int apic_find_highest_irr(struct kvm_lapic *apic)
 {
 	int result;
 
+	/* Note that irr_pending is just a hint. It will be always
+	 * true with virtual interrupt delivery enabled. */
 	if (!apic->irr_pending)
 		return -1;
 
@@ -461,6 +448,8 @@ static void pv_eoi_clr_pending(struct kvm_vcpu *vcpu)
 static inline int apic_find_highest_isr(struct kvm_lapic *apic)
 {
 	int result;
+
+	/* Note that isr_count is always 1 with vid enabled*/
 	if (!apic->isr_count)
 		return -1;
 	if (likely(apic->highest_isr_cache != -1))
@@ -740,6 +729,19 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
 	return vcpu1->arch.apic_arb_prio - vcpu2->arch.apic_arb_prio;
 }
 
+static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector)
+{
+	if (!(kvm_apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI) &&
+	    kvm_ioapic_handles_vector(apic->vcpu->kvm, vector)) {
+		int trigger_mode;
+		if (apic_test_vector(vector, apic->regs + APIC_TMR))
+			trigger_mode = IOAPIC_LEVEL_TRIG;
+		else
+			trigger_mode = IOAPIC_EDGE_TRIG;
+		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
+	}
+}
+
 static int apic_set_eoi(struct kvm_lapic *apic)
 {
 	int vector = apic_find_highest_isr(apic);
@@ -756,19 +758,26 @@ static int apic_set_eoi(struct kvm_lapic *apic)
 	apic_clear_isr(vector, apic);
 	apic_update_ppr(apic);
 
-	if (!(kvm_apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI) &&
-	    kvm_ioapic_handles_vector(apic->vcpu->kvm, vector)) {
-		int trigger_mode;
-		if (apic_test_vector(vector, apic->regs + APIC_TMR))
-			trigger_mode = IOAPIC_LEVEL_TRIG;
-		else
-			trigger_mode = IOAPIC_EDGE_TRIG;
-		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
-	}
+	kvm_ioapic_send_eoi(apic, vector);
 	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
 	return vector;
 }
 
+/*
+ * this interface assumes a trap-like exit, which has already finished
+ * desired side effect including vISR and vPPR update.
+ */
+void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector)
+{
+	struct kvm_lapic *apic = vcpu->arch.apic;
+
+	trace_kvm_eoi(apic, vector);
+
+	kvm_ioapic_send_eoi(apic, vector);
+	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
+}
+EXPORT_SYMBOL_GPL(kvm_apic_set_eoi_accelerated);
+
 static void apic_send_ipi(struct kvm_lapic *apic)
 {
 	u32 icr_low = kvm_apic_get_reg(apic, APIC_ICR);
@@ -1071,6 +1080,7 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
 		if (!apic_x2apic_mode(apic)) {
 			apic_set_reg(apic, APIC_DFR, val | 0x0FFFFFFF);
 			recalculate_apic_map(apic->vcpu->kvm);
+			ioapic_update_eoi_exitmap(apic->vcpu->kvm);
 		} else
 			ret = 1;
 		break;
@@ -1318,6 +1328,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
 		else
 			static_key_slow_inc(&apic_hw_disabled.key);
 		recalculate_apic_map(vcpu->kvm);
+		ioapic_update_eoi_exitmap(apic->vcpu->kvm);
 	}
 
 	if (!kvm_vcpu_is_bsp(apic->vcpu))
@@ -1377,8 +1388,9 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
 		apic_set_reg(apic, APIC_ISR + 0x10 * i, 0);
 		apic_set_reg(apic, APIC_TMR + 0x10 * i, 0);
 	}
-	apic->irr_pending = false;
-	apic->isr_count = 0;
+	apic->irr_pending = kvm_apic_vid_enabled(vcpu);
+	apic->isr_count = kvm_apic_vid_enabled(vcpu) ?
+				1 : 0;
 	apic->highest_isr_cache = -1;
 	update_divide_count(apic);
 	atomic_set(&apic->lapic_timer.pending, 0);
@@ -1593,8 +1605,10 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
 	update_divide_count(apic);
 	start_apic_timer(apic);
 	apic->irr_pending = true;
-	apic->isr_count = count_vectors(apic->regs + APIC_ISR);
+	apic->isr_count = kvm_apic_vid_enabled(vcpu) ?
+				1 : count_vectors(apic->regs + APIC_ISR);
 	apic->highest_isr_cache = -1;
+	kvm_x86_ops->set_svi(apic_find_highest_isr(apic));
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 }
 
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 9a8ee22..fed6538 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -65,6 +65,7 @@ u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
 void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
 
 void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
+void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector);
 
 void kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr);
 void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu);
@@ -126,4 +127,26 @@ static inline int kvm_lapic_enabled(struct kvm_vcpu *vcpu)
 	return kvm_apic_present(vcpu) && kvm_apic_sw_enabled(vcpu->arch.apic);
 }
 
+static inline bool kvm_apic_vid_enabled(struct kvm_vcpu *vcpu)
+{
+	return kvm_x86_ops->has_virtual_interrupt_delivery(vcpu);
+}
+
+static inline u16 apic_cluster_id(struct kvm_apic_map *map, u32 ldr)
+{
+	u16 cid;
+	ldr >>= 32 - map->ldr_bits;
+	cid = (ldr >> map->cid_shift) & map->cid_mask;
+
+	BUG_ON(cid >= ARRAY_SIZE(map->logical_map));
+
+	return cid;
+}
+
+static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr)
+{
+	ldr >>= (32 - map->ldr_bits);
+	return ldr & map->lid_mask;
+}
+
 #endif
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 0b82cb1..0ce6543 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3576,6 +3576,21 @@ static void svm_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
 	return;
 }
 
+static int svm_has_virtual_interrupt_delivery(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
+
+static void svm_update_eoi_exitmap(struct kvm_vcpu *vcpu)
+{
+	return;
+}
+
+static void svm_set_svi(int isr)
+{
+	return;
+}
+
 static int svm_nmi_allowed(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -4296,6 +4311,9 @@ static struct kvm_x86_ops svm_x86_ops = {
 	.enable_irq_window = enable_irq_window,
 	.update_cr8_intercept = update_cr8_intercept,
 	.enable_virtual_x2apic_mode = svm_enable_virtual_x2apic_mode,
+	.has_virtual_interrupt_delivery = svm_has_virtual_interrupt_delivery,
+	.update_eoi_exitmap = svm_update_eoi_exitmap,
+	.set_svi = svm_set_svi,
 
 	.set_tss_addr = svm_set_tss_addr,
 	.get_tdp_level = get_npt_level,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index b203ce7..990409a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -434,6 +434,7 @@ struct vcpu_vmx {
 	bool rdtscp_enabled;
 
 	bool virtual_x2apic_enabled;
+	unsigned long eoi_exit_bitmap[4];
 
 	/* Support for a guest hypervisor (nested VMX) */
 	struct nested_vmx nested;
@@ -783,7 +784,8 @@ static inline bool cpu_has_vmx_apic_register_virt(void)
 
 static inline bool cpu_has_vmx_virtual_intr_delivery(void)
 {
-	return false;
+	return vmcs_config.cpu_based_2nd_exec_ctrl &
+		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
 }
 
 static inline bool cpu_has_vmx_flexpriority(void)
@@ -2565,7 +2567,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 			SECONDARY_EXEC_PAUSE_LOOP_EXITING |
 			SECONDARY_EXEC_RDTSCP |
 			SECONDARY_EXEC_ENABLE_INVPCID |
-			SECONDARY_EXEC_APIC_REGISTER_VIRT;
+			SECONDARY_EXEC_APIC_REGISTER_VIRT |
+			SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
 		if (adjust_vmx_controls(min2, opt2,
 					MSR_IA32_VMX_PROCBASED_CTLS2,
 					&_cpu_based_2nd_exec_control) < 0)
@@ -2579,7 +2582,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 
 	if (!(_cpu_based_exec_control & CPU_BASED_TPR_SHADOW))
 		_cpu_based_2nd_exec_control &= ~(
-				SECONDARY_EXEC_APIC_REGISTER_VIRT);
+				SECONDARY_EXEC_APIC_REGISTER_VIRT |
+				SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
 
 	if (_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_EPT) {
 		/* CR3 accesses and invlpg don't need to cause VM Exits when EPT
@@ -2778,9 +2782,15 @@ static __init int hardware_setup(void)
 	if (!cpu_has_vmx_ple())
 		ple_gap = 0;
 
-	if (!cpu_has_vmx_apic_register_virt())
+	if (!cpu_has_vmx_apic_register_virt() ||
+				!cpu_has_vmx_virtual_intr_delivery())
 		enable_apicv_reg_vid = 0;
 
+	if (enable_apicv_reg_vid)
+		kvm_x86_ops->update_cr8_intercept = NULL;
+	else
+		kvm_x86_ops->update_apic_irq = NULL;
+
 	if (nested)
 		nested_vmx_setup_ctls_msrs();
 
@@ -3961,7 +3971,8 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
 	if (!ple_gap)
 		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
 	if (!enable_apicv_reg_vid)
-		exec_control &= ~SECONDARY_EXEC_APIC_REGISTER_VIRT;
+		exec_control &= ~(SECONDARY_EXEC_APIC_REGISTER_VIRT |
+				  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
 	exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
 	return exec_control;
 }
@@ -4007,6 +4018,15 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 				vmx_secondary_exec_control(vmx));
 	}
 
+	if (enable_apicv_reg_vid) {
+		vmcs_write64(EOI_EXIT_BITMAP0, 0);
+		vmcs_write64(EOI_EXIT_BITMAP1, 0);
+		vmcs_write64(EOI_EXIT_BITMAP2, 0);
+		vmcs_write64(EOI_EXIT_BITMAP3, 0);
+
+		vmcs_write16(GUEST_INTR_STATUS, 0);
+	}
+
 	if (ple_gap) {
 		vmcs_write32(PLE_GAP, ple_gap);
 		vmcs_write32(PLE_WINDOW, ple_window);
@@ -4924,6 +4944,16 @@ static int handle_apic_access(struct kvm_vcpu *vcpu)
 	return emulate_instruction(vcpu, 0) == EMULATE_DONE;
 }
 
+static int handle_apic_eoi_induced(struct kvm_vcpu *vcpu)
+{
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	int vector = exit_qualification & 0xff;
+
+	/* EOI-induced VM exit is trap-like and thus no need to adjust IP */
+	kvm_apic_set_eoi_accelerated(vcpu, vector);
+	return 1;
+}
+
 static int handle_apic_write(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -5869,6 +5899,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,
 	[EXIT_REASON_APIC_ACCESS]             = handle_apic_access,
 	[EXIT_REASON_APIC_WRITE]              = handle_apic_write,
+	[EXIT_REASON_EOI_INDUCED]             = handle_apic_eoi_induced,
 	[EXIT_REASON_WBINVD]                  = handle_wbinvd,
 	[EXIT_REASON_XSETBV]                  = handle_xsetbv,
 	[EXIT_REASON_TASK_SWITCH]             = handle_task_switch,
@@ -6238,7 +6269,7 @@ static void vmx_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
 	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
 	vmx->virtual_x2apic_enabled = true;
 
-	if (!cpu_has_vmx_virtual_intr_delivery())
+	if (!enable_apicv_reg_vid)
 		return;
 
 	for (msr = 0x800; msr <= 0x8ff; msr++)
@@ -6274,7 +6305,7 @@ static void vmx_disable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
 	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, second_exec_control);
 	vmx->virtual_x2apic_enabled = false;
 
-	if (!cpu_has_vmx_virtual_intr_delivery())
+	if (!enable_apicv_reg_vid)
 		return;
 
 	for (msr = 0x800; msr <= 0x8ff; msr++)
@@ -6288,6 +6319,148 @@ static void vmx_disable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
 	vmx_intercept_for_msr_write(0x83f, false, true);
 }
 
+static int vmx_has_virtual_interrupt_delivery(struct kvm_vcpu *vcpu)
+{
+	return enable_apicv_reg_vid;
+}
+
+static void vmx_set_svi(int isr)
+{
+	u16 status;
+	u8 old;
+
+	if (!enable_apicv_reg_vid)
+		return;
+
+	if (isr == -1)
+		isr = 0;
+
+	status = vmcs_read16(GUEST_INTR_STATUS);
+	old = status >> 8;
+	if (isr != old) {
+		status &= 0xff;
+		status |= isr << 8;
+		vmcs_write16(GUEST_INTR_STATUS, status);
+	}
+}
+
+static void vmx_set_rvi(int vector)
+{
+	u16 status;
+	u8 old;
+
+	status = vmcs_read16(GUEST_INTR_STATUS);
+	old = (u8)status & 0xff;
+	if ((u8)vector != old) {
+		status &= ~0xff;
+		status |= (u8)vector;
+		vmcs_write16(GUEST_INTR_STATUS, status);
+	}
+}
+
+static void vmx_update_apic_irq(struct kvm_vcpu *vcpu, int max_irr)
+{
+	if (max_irr == -1)
+		return;
+
+	vmx_set_rvi(max_irr);
+}
+
+static void set_eoi_exitmap_one(struct kvm_vcpu *vcpu,
+				u32 vector)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (WARN_ONCE((vector > 255),
+		"KVM VMX: vector (%d) out of range\n", vector))
+		return;
+
+	__set_bit(vector, vmx->eoi_exit_bitmap);
+}
+
+void vmx_check_ioapic_entry(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq)
+{
+	struct kvm_lapic **dst;
+	struct kvm_apic_map *map;
+	unsigned long bitmap = 1;
+	int i;
+
+	rcu_read_lock();
+	map = rcu_dereference(vcpu->kvm->arch.apic_map);
+
+	if (unlikely(!map)) {
+		set_eoi_exitmap_one(vcpu, irq->vector);
+		goto out;
+	}
+
+	if (irq->dest_mode == 0) { /* physical mode */
+		if (irq->delivery_mode == APIC_DM_LOWEST ||
+				irq->dest_id == 0xff) {
+			set_eoi_exitmap_one(vcpu, irq->vector);
+			goto out;
+		}
+		dst = &map->phys_map[irq->dest_id & 0xff];
+	} else {
+		u32 mda = irq->dest_id << (32 - map->ldr_bits);
+
+		dst = map->logical_map[apic_cluster_id(map, mda)];
+
+		bitmap = apic_logical_id(map, mda);
+	}
+
+	for_each_set_bit(i, &bitmap, 16) {
+		if (!dst[i])
+			continue;
+		if (dst[i]->vcpu == vcpu) {
+			set_eoi_exitmap_one(vcpu, irq->vector);
+			break;
+		}
+	}
+
+out:
+	rcu_read_unlock();
+}
+
+static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	vmcs_write64(EOI_EXIT_BITMAP0, vmx->eoi_exit_bitmap[0]);
+	vmcs_write64(EOI_EXIT_BITMAP1, vmx->eoi_exit_bitmap[1]);
+	vmcs_write64(EOI_EXIT_BITMAP2, vmx->eoi_exit_bitmap[2]);
+	vmcs_write64(EOI_EXIT_BITMAP3, vmx->eoi_exit_bitmap[3]);
+}
+
+static void vmx_update_eoi_exitmap(struct kvm_vcpu *vcpu)
+{
+	struct kvm_ioapic *ioapic = vcpu->kvm->arch.vioapic;
+	union kvm_ioapic_redirect_entry *e;
+	struct kvm_lapic_irq irqe;
+	int index;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	/* clear eoi exit bitmap */
+	memset(vmx->eoi_exit_bitmap, 0, 32);
+
+	/* traverse ioapic entry to set eoi exit bitmap*/
+	for (index = 0; index < IOAPIC_NUM_PINS; index++) {
+		e = &ioapic->redirtbl[index];
+		if (!e->fields.mask &&
+			(e->fields.trig_mode == IOAPIC_LEVEL_TRIG ||
+			 kvm_irq_has_notifier(ioapic->kvm, KVM_IRQCHIP_IOAPIC,
+				 index))) {
+			irqe.dest_id = e->fields.dest_id;
+			irqe.vector = e->fields.vector;
+			irqe.dest_mode = e->fields.dest_mode;
+			irqe.delivery_mode = e->fields.delivery_mode << 8;
+			vmx_check_ioapic_entry(vcpu, &irqe);
+
+		}
+	}
+
+	vmx_load_eoi_exitmap(vcpu);
+}
+
 static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx)
 {
 	u32 exit_intr_info;
@@ -7553,6 +7726,10 @@ static struct kvm_x86_ops vmx_x86_ops = {
 	.update_cr8_intercept = update_cr8_intercept,
 	.enable_virtual_x2apic_mode = vmx_enable_virtual_x2apic_mode,
 	.disable_virtual_x2apic_mode = vmx_disable_virtual_x2apic_mode,
+	.has_virtual_interrupt_delivery = vmx_has_virtual_interrupt_delivery,
+	.update_apic_irq = vmx_update_apic_irq,
+	.update_eoi_exitmap = vmx_update_eoi_exitmap,
+	.set_svi = vmx_set_svi,
 
 	.set_tss_addr = vmx_set_tss_addr,
 	.get_tdp_level = get_ept_level,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1c9c834..e6d8227 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5527,7 +5527,7 @@ static void inject_pending_event(struct kvm_vcpu *vcpu)
 			vcpu->arch.nmi_injected = true;
 			kvm_x86_ops->set_nmi(vcpu);
 		}
-	} else if (kvm_cpu_has_interrupt(vcpu)) {
+	} else if (kvm_cpu_has_injectable_intr(vcpu)) {
 		if (kvm_x86_ops->interrupt_allowed(vcpu)) {
 			kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu),
 					    false);
@@ -5648,6 +5648,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			kvm_handle_pmu_event(vcpu);
 		if (kvm_check_request(KVM_REQ_PMI, vcpu))
 			kvm_deliver_pmi(vcpu);
+		if (kvm_check_request(KVM_REQ_EOIBITMAP, vcpu)) {
+			mutex_lock(&vcpu->kvm->arch.vioapic->eoimap_lock);
+			kvm_x86_ops->update_eoi_exitmap(vcpu);
+			mutex_unlock(&vcpu->kvm->arch.vioapic->eoimap_lock);
+		}
 	}
 
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
@@ -5656,10 +5661,15 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		/* enable NMI/IRQ window open exits if needed */
 		if (vcpu->arch.nmi_pending)
 			kvm_x86_ops->enable_nmi_window(vcpu);
-		else if (kvm_cpu_has_interrupt(vcpu) || req_int_win)
+		else if (kvm_cpu_has_injectable_intr(vcpu) || req_int_win)
 			kvm_x86_ops->enable_irq_window(vcpu);
 
 		if (kvm_lapic_enabled(vcpu)) {
+			/* update architecture specific hints for APIC
+			 * virtual interrupt delivery */
+			if (kvm_x86_ops->update_apic_irq)
+				kvm_x86_ops->update_apic_irq(vcpu,
+					      kvm_lapic_find_highest_irr(vcpu));
 			update_cr8_intercept(vcpu);
 			kvm_lapic_sync_to_vapic(vcpu);
 		}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index cbe0d68..bc0e261 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -122,6 +122,7 @@ static inline bool is_error_page(struct page *page)
 #define KVM_REQ_WATCHDOG          18
 #define KVM_REQ_MASTERCLOCK_UPDATE 19
 #define KVM_REQ_MCLOCK_INPROGRESS 20
+#define KVM_REQ_EOIBITMAP         21
 
 #define KVM_USERSPACE_IRQ_SOURCE_ID		0
 #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID	1
@@ -537,6 +538,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
 void kvm_flush_remote_tlbs(struct kvm *kvm);
 void kvm_reload_remote_mmus(struct kvm *kvm);
 void kvm_make_mclock_inprogress_request(struct kvm *kvm);
+void kvm_make_update_eoibitmap_request(struct kvm *kvm);
 
 long kvm_arch_dev_ioctl(struct file *filp,
 			unsigned int ioctl, unsigned long arg);
@@ -690,6 +692,7 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
 int kvm_set_irq_inatomic(struct kvm *kvm, int irq_source_id, u32 irq, int level);
 int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
 		int irq_source_id, int level);
+bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin);
 void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
 void kvm_register_irq_ack_notifier(struct kvm *kvm,
 				   struct kvm_irq_ack_notifier *kian);
diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index f3abbef..e5ccb8f 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -115,6 +115,20 @@ static void update_handled_vectors(struct kvm_ioapic *ioapic)
 	smp_wmb();
 }
 
+void ioapic_update_eoi_exitmap(struct kvm *kvm)
+{
+#ifdef CONFIG_X86
+	struct kvm_vcpu *vcpu = kvm->vcpus[0];
+	struct kvm_ioapic *ioapic = kvm->arch.vioapic;
+
+	/* If vid is enabled in one of vcpus, then other
+	 * vcpus also enabled it. */
+	if (!kvm_apic_vid_enabled(vcpu) || !ioapic)
+		return;
+	kvm_make_update_eoibitmap_request(kvm);
+#endif
+}
+
 static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
 {
 	unsigned index;
@@ -156,6 +170,7 @@ static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
 		if (e->fields.trig_mode == IOAPIC_LEVEL_TRIG
 		    && ioapic->irr & (1 << index))
 			ioapic_service(ioapic, index);
+		ioapic_update_eoi_exitmap(ioapic->kvm);
 		break;
 	}
 }
@@ -415,6 +430,9 @@ int kvm_ioapic_init(struct kvm *kvm)
 	ret = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS, ioapic->base_address,
 				      IOAPIC_MEM_LENGTH, &ioapic->dev);
 	mutex_unlock(&kvm->slots_lock);
+#ifdef CONFIG_X86
+	mutex_init(&ioapic->eoimap_lock);
+#endif
 	if (ret < 0) {
 		kvm->arch.vioapic = NULL;
 		kfree(ioapic);
diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
index a30abfe..34544ce 100644
--- a/virt/kvm/ioapic.h
+++ b/virt/kvm/ioapic.h
@@ -47,6 +47,9 @@ struct kvm_ioapic {
 	void (*ack_notifier)(void *opaque, int irq);
 	spinlock_t lock;
 	DECLARE_BITMAP(handled_vectors, 256);
+#ifdef CONFIG_X86
+	struct mutex eoimap_lock;
+#endif
 };
 
 #ifdef DEBUG
@@ -82,5 +85,6 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 		struct kvm_lapic_irq *irq);
 int kvm_get_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
 int kvm_set_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
+void ioapic_update_eoi_exitmap(struct kvm *kvm);
 
 #endif
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 656fa45..64aa1ab 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -22,6 +22,7 @@
 
 #include <linux/kvm_host.h>
 #include <linux/slab.h>
+#include <linux/export.h>
 #include <trace/events/kvm.h>
 
 #include <asm/msidef.h>
@@ -237,6 +238,25 @@ int kvm_set_irq_inatomic(struct kvm *kvm, int irq_source_id, u32 irq, int level)
 	return ret;
 }
 
+bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin)
+{
+	struct kvm_irq_ack_notifier *kian;
+	struct hlist_node *n;
+	int gsi;
+
+	rcu_read_lock();
+	gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];
+	if (gsi != -1)
+		hlist_for_each_entry_rcu(kian, n, &kvm->irq_ack_notifier_list,
+					 link)
+			if (kian->gsi == gsi)
+				return true;
+	rcu_read_unlock();
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(kvm_irq_has_notifier);
+
 void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
 {
 	struct kvm_irq_ack_notifier *kian;
@@ -261,6 +281,7 @@ void kvm_register_irq_ack_notifier(struct kvm *kvm,
 	mutex_lock(&kvm->irq_lock);
 	hlist_add_head_rcu(&kian->link, &kvm->irq_ack_notifier_list);
 	mutex_unlock(&kvm->irq_lock);
+	ioapic_update_eoi_exitmap(kvm);
 }
 
 void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
@@ -270,6 +291,7 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
 	hlist_del_init_rcu(&kian->link);
 	mutex_unlock(&kvm->irq_lock);
 	synchronize_rcu();
+	ioapic_update_eoi_exitmap(kvm);
 }
 
 int kvm_request_irq_source_id(struct kvm *kvm)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e45c20c..cc465c6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -217,6 +217,11 @@ void kvm_make_mclock_inprogress_request(struct kvm *kvm)
 	make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
 }
 
+void kvm_make_update_eoibitmap_request(struct kvm *kvm)
+{
+	make_all_cpus_request(kvm, KVM_REQ_EOIBITMAP);
+}
+
 int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 {
 	struct page *page;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 2/3] x86, apicv: add virtual x2apic support
  2013-01-10  7:26 ` [PATCH v9 2/3] x86, apicv: add virtual x2apic support Yang Zhang
@ 2013-01-10  7:55   ` Gleb Natapov
  2013-01-10  8:32     ` Zhang, Yang Z
  2013-01-10  8:25   ` Gleb Natapov
  1 sibling, 1 reply; 26+ messages in thread
From: Gleb Natapov @ 2013-01-10  7:55 UTC (permalink / raw)
  To: Yang Zhang; +Cc: kvm, haitao.shan, mtosatti, Kevin Tian

On Thu, Jan 10, 2013 at 03:26:07PM +0800, Yang Zhang wrote:
> From: Yang Zhang <yang.z.zhang@Intel.com>
> 
> basically to benefit from apicv, we need to enable virtualized x2apic mode.
> Currently, we only enable it when guest is really using x2apic.
> 
> Also, clear MSR bitmap for corresponding x2apic MSRs when guest enabled x2apic:
>     0x800 - 0x8ff: no read intercept for apicv register virtualization,
>     		   except APIC ID and TMCCT.
>     APIC ID and TMCCT: need software's assistance to get right value.
>     TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery.
> 
> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    2 +
>  arch/x86/include/asm/vmx.h      |    1 +
>  arch/x86/kvm/lapic.c            |    5 +-
>  arch/x86/kvm/svm.c              |    6 +
>  arch/x86/kvm/vmx.c              |  194 +++++++++++++++++++++++++++++++++++++--
>  5 files changed, 200 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index c431b33..572a562 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -697,6 +697,8 @@ struct kvm_x86_ops {
>  	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
>  	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
>  	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
> +	void (*enable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu);
> +	void (*disable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu);
Make one callback with enable/disable parameter. And do not forget SVM.


>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>  	int (*get_tdp_level)(void);
>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 44c3f7e..0a54df0 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -139,6 +139,7 @@
>  #define SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES 0x00000001
>  #define SECONDARY_EXEC_ENABLE_EPT               0x00000002
>  #define SECONDARY_EXEC_RDTSCP			0x00000008
> +#define SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE   0x00000010
>  #define SECONDARY_EXEC_ENABLE_VPID              0x00000020
>  #define SECONDARY_EXEC_WBINVD_EXITING		0x00000040
>  #define SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 0664c13..ec38906 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1328,7 +1328,10 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
>  		u32 id = kvm_apic_id(apic);
>  		u32 ldr = ((id >> 4) << 16) | (1 << (id & 0xf));
>  		kvm_apic_set_ldr(apic, ldr);
> -	}
> +		kvm_x86_ops->enable_virtual_x2apic_mode(vcpu);
> +	} else
> +		kvm_x86_ops->disable_virtual_x2apic_mode(vcpu);
> +
You just broke SVM.

>  	apic->base_address = apic->vcpu->arch.apic_base &
>  			     MSR_IA32_APICBASE_BASE;
>  
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index d29d3cd..0b82cb1 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -3571,6 +3571,11 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
>  		set_cr_intercept(svm, INTERCEPT_CR8_WRITE);
>  }
>  
> +static void svm_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
> +{
> +	return;
> +}
> +
>  static int svm_nmi_allowed(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
> @@ -4290,6 +4295,7 @@ static struct kvm_x86_ops svm_x86_ops = {
>  	.enable_nmi_window = enable_nmi_window,
>  	.enable_irq_window = enable_irq_window,
>  	.update_cr8_intercept = update_cr8_intercept,
> +	.enable_virtual_x2apic_mode = svm_enable_virtual_x2apic_mode,
>  
>  	.set_tss_addr = svm_set_tss_addr,
>  	.get_tdp_level = get_npt_level,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 688f43f..b203ce7 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -433,6 +433,8 @@ struct vcpu_vmx {
>  
>  	bool rdtscp_enabled;
>  
> +	bool virtual_x2apic_enabled;
> +
>  	/* Support for a guest hypervisor (nested VMX) */
>  	struct nested_vmx nested;
>  };
> @@ -767,12 +769,23 @@ static inline bool cpu_has_vmx_virtualize_apic_accesses(void)
>  		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>  }
>  
> +static inline bool cpu_has_vmx_virtualize_x2apic_mode(void)
> +{
> +	return vmcs_config.cpu_based_2nd_exec_ctrl &
> +		SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
> +}
> +
>  static inline bool cpu_has_vmx_apic_register_virt(void)
>  {
>  	return vmcs_config.cpu_based_2nd_exec_ctrl &
>  		SECONDARY_EXEC_APIC_REGISTER_VIRT;
>  }
>  
> +static inline bool cpu_has_vmx_virtual_intr_delivery(void)
> +{
> +	return false;
> +}
> +
>  static inline bool cpu_has_vmx_flexpriority(void)
>  {
>  	return cpu_has_vmx_tpr_shadow() &&
> @@ -2544,6 +2557,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>  	if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) {
>  		min2 = 0;
>  		opt2 = SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
> +			SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
>  			SECONDARY_EXEC_WBINVD_EXITING |
>  			SECONDARY_EXEC_ENABLE_VPID |
>  			SECONDARY_EXEC_ENABLE_EPT |
> @@ -3731,7 +3745,45 @@ static void free_vpid(struct vcpu_vmx *vmx)
>  	spin_unlock(&vmx_vpid_lock);
>  }
>  
> -static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, u32 msr)
> +#define MSR_TYPE_R	1
> +#define MSR_TYPE_W	2
> +static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
> +						u32 msr, int type)
> +{
> +	int f = sizeof(unsigned long);
> +
> +	if (!cpu_has_vmx_msr_bitmap())
> +		return;
> +
> +	/*
> +	 * See Intel PRM Vol. 3, 20.6.9 (MSR-Bitmap Address). Early manuals
> +	 * have the write-low and read-high bitmap offsets the wrong way round.
> +	 * We can control MSRs 0x00000000-0x00001fff and 0xc0000000-0xc0001fff.
> +	 */
> +	if (msr <= 0x1fff) {
> +		if (type & MSR_TYPE_R)
> +			/* read-low */
> +			__clear_bit(msr, msr_bitmap + 0x000 / f);
> +
> +		if (type & MSR_TYPE_W)
> +			/* write-low */
> +			__clear_bit(msr, msr_bitmap + 0x800 / f);
> +
> +	} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
> +		msr &= 0x1fff;
> +		if (type & MSR_TYPE_R)
> +			/* read-high */
> +			__clear_bit(msr, msr_bitmap + 0x400 / f);
> +
> +		if (type & MSR_TYPE_W)
> +			/* write-high */
> +			__clear_bit(msr, msr_bitmap + 0xc00 / f);
> +
> +	}
> +}
> +
> +static void __vmx_enable_intercept_for_msr(unsigned long *msr_bitmap,
> +						u32 msr, int type)
>  {
>  	int f = sizeof(unsigned long);
>  
> @@ -3744,20 +3796,75 @@ static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, u32 msr)
>  	 * We can control MSRs 0x00000000-0x00001fff and 0xc0000000-0xc0001fff.
>  	 */
>  	if (msr <= 0x1fff) {
> -		__clear_bit(msr, msr_bitmap + 0x000 / f); /* read-low */
> -		__clear_bit(msr, msr_bitmap + 0x800 / f); /* write-low */
> +		if (type & MSR_TYPE_R)
> +			/* read-low */
> +			__set_bit(msr, msr_bitmap + 0x000 / f);
> +
> +		if (type & MSR_TYPE_W)
> +			/* write-low */
> +			__set_bit(msr, msr_bitmap + 0x800 / f);
> +
>  	} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
>  		msr &= 0x1fff;
> -		__clear_bit(msr, msr_bitmap + 0x400 / f); /* read-high */
> -		__clear_bit(msr, msr_bitmap + 0xc00 / f); /* write-high */
> +		if (type & MSR_TYPE_R)
> +			/* read-high */
> +			__set_bit(msr, msr_bitmap + 0x400 / f);
> +
> +		if (type & MSR_TYPE_W)
> +			/* write-high */
> +			__set_bit(msr, msr_bitmap + 0xc00 / f);
> +
>  	}
>  }
>  
> +
>  static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only)
>  {
>  	if (!longmode_only)
> -		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, msr);
> -	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr);
> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
> +						msr, MSR_TYPE_R | MSR_TYPE_W);
> +	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
> +						msr, MSR_TYPE_R | MSR_TYPE_W);
> +}
> +
> +static void vmx_intercept_for_msr_read(u32 msr, bool longmode_only,
> +					bool set)
> +{
> +	if (!longmode_only) {
> +		if (set)
> +			__vmx_enable_intercept_for_msr(vmx_msr_bitmap_legacy,
> +					msr, MSR_TYPE_R);
> +		else
> +			__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
> +					msr, MSR_TYPE_R);
> +
> +	}
> +	if (set)
> +		__vmx_enable_intercept_for_msr(vmx_msr_bitmap_longmode,
> +				msr, MSR_TYPE_R);
> +	else
> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
> +				msr, MSR_TYPE_R);
> +}
> +
> +static void vmx_intercept_for_msr_write(u32 msr, bool longmode_only,
> +					bool set)
> +{
> +	if (!longmode_only) {
> +		if (set)
> +			__vmx_enable_intercept_for_msr(vmx_msr_bitmap_legacy,
> +					msr, MSR_TYPE_W);
> +		else
> +			__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
> +					msr, MSR_TYPE_W);
> +
> +	}
> +	if (set)
> +		__vmx_enable_intercept_for_msr(vmx_msr_bitmap_longmode,
> +				msr, MSR_TYPE_W);
> +	else
> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
> +				msr, MSR_TYPE_W);
>  }
>  
>  /*
> @@ -3855,6 +3962,7 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
>  		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
>  	if (!enable_apicv_reg_vid)
>  		exec_control &= ~SECONDARY_EXEC_APIC_REGISTER_VIRT;
> +	exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
>  	return exec_control;
>  }
>  
> @@ -6110,6 +6218,76 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
>  	vmcs_write32(TPR_THRESHOLD, irr);
>  }
>  
> +static void vmx_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
> +{
> +	u32 exec_control;
> +	int msr;
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	if (!cpu_has_vmx_virtualize_x2apic_mode())
> +		return;
> +
> +	exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
> +	/* virtualize x2apic mode relies on tpr shadow */
> +	if (!(exec_control & CPU_BASED_TPR_SHADOW))
> +		return;
> +
> +	exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
> +	exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> +	exec_control |= SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
> +	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
> +	vmx->virtual_x2apic_enabled = true;
Why track it?

> +
> +	if (!cpu_has_vmx_virtual_intr_delivery())
> +		return;
> +
You need to test whether vid is enabled, not whether it can be enabled.
And you need to test it at the beginning of the function. If vid is
disabled we do not want to set SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE.
Also force disable vid if !cpu_has_vmx_virtualize_x2apic_mode().

> +	for (msr = 0x800; msr <= 0x8ff; msr++)
> +		vmx_intercept_for_msr_read(msr, false, false);
> +
> +	/* APIC ID */
> +	vmx_intercept_for_msr_read(0x802, false, true);
> +	/* TMCCT */
> +	vmx_intercept_for_msr_read(0x839, false, true);
> +	/* TPR */
> +	vmx_intercept_for_msr_write(0x808, false, false);
> +	/* EOI */
> +	vmx_intercept_for_msr_write(0x80b, false, false);
> +	/* SELF-IPI */
> +	vmx_intercept_for_msr_write(0x83f, false, false);
> +
> +}
> +
> +static void vmx_disable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
> +{
> +	u32 second_exec_control;
> +	int msr;
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	/* If doesn't enable virtual x2apic before, do nothing*/
> +	if (!vmx->virtual_x2apic_enabled)
> +		return;
> +
> +	second_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
> +	/* disalbe virtual x2apic*/
> +	second_exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
> +	second_exec_control |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> +	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, second_exec_control);
> +	vmx->virtual_x2apic_enabled = false;
> +
> +	if (!cpu_has_vmx_virtual_intr_delivery())
> +		return;
> +
> +	for (msr = 0x800; msr <= 0x8ff; msr++)
> +		vmx_intercept_for_msr_read(msr, false, true);
> +
> +	/* TPR */
> +	vmx_intercept_for_msr_write(0x808, false, true);
> +	/* EOI */
> +	vmx_intercept_for_msr_write(0x80b, false, true);
> +	/* SELF-IPI */
> +	vmx_intercept_for_msr_write(0x83f, false, true);
> +}
> +
>  static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx)
>  {
>  	u32 exit_intr_info;
> @@ -7373,6 +7551,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
>  	.enable_nmi_window = enable_nmi_window,
>  	.enable_irq_window = enable_irq_window,
>  	.update_cr8_intercept = update_cr8_intercept,
> +	.enable_virtual_x2apic_mode = vmx_enable_virtual_x2apic_mode,
> +	.disable_virtual_x2apic_mode = vmx_disable_virtual_x2apic_mode,
>  
>  	.set_tss_addr = vmx_set_tss_addr,
>  	.get_tdp_level = get_ept_level,
> -- 
> 1.7.1

--
			Gleb.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 3/3] x86, apicv: add virtual interrupt delivery support
  2013-01-10  7:26 ` [PATCH v9 3/3] x86, apicv: add virtual interrupt delivery support Yang Zhang
@ 2013-01-10  8:23   ` Gleb Natapov
  2013-01-10 12:04     ` Zhang, Yang Z
  2013-01-10 21:36   ` Marcelo Tosatti
  1 sibling, 1 reply; 26+ messages in thread
From: Gleb Natapov @ 2013-01-10  8:23 UTC (permalink / raw)
  To: Yang Zhang; +Cc: kvm, haitao.shan, mtosatti, Kevin Tian

On Thu, Jan 10, 2013 at 03:26:08PM +0800, Yang Zhang wrote:
> From: Yang Zhang <yang.z.zhang@Intel.com>
> 
> Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> manually, which is fully taken care of by the hardware. This needs
> some special awareness into existing interrupr injection path:
> 
> - for pending interrupt, instead of direct injection, we may need
>   update architecture specific indicators before resuming to guest.
> 
> - A pending interrupt, which is masked by ISR, should be also
>   considered in above update action, since hardware will decide
>   when to inject it at right time. Current has_interrupt and
>   get_interrupt only returns a valid vector from injection p.o.v.
> 
> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    5 +
>  arch/x86/include/asm/vmx.h      |   11 +++
>  arch/x86/kvm/irq.c              |   56 +++++++++++-
>  arch/x86/kvm/lapic.c            |   72 +++++++++------
>  arch/x86/kvm/lapic.h            |   23 +++++
>  arch/x86/kvm/svm.c              |   18 ++++
>  arch/x86/kvm/vmx.c              |  191 +++++++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/x86.c              |   14 +++-
>  include/linux/kvm_host.h        |    3 +
>  virt/kvm/ioapic.c               |   18 ++++
>  virt/kvm/ioapic.h               |    4 +
>  virt/kvm/irq_comm.c             |   22 +++++
>  virt/kvm/kvm_main.c             |    5 +
>  13 files changed, 399 insertions(+), 43 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 572a562..f471856 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -697,6 +697,10 @@ struct kvm_x86_ops {
>  	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
>  	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
>  	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
> +	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
> +	void (*update_apic_irq)(struct kvm_vcpu *vcpu, int max_irr);
> +	void (*update_eoi_exitmap)(struct kvm_vcpu *vcpu);
> +	void (*set_svi)(int isr);
>  	void (*enable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu);
>  	void (*disable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu);
>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> @@ -993,6 +997,7 @@ int kvm_age_hva(struct kvm *kvm, unsigned long hva);
>  int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
>  void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
>  int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
> +int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v);
>  int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
>  int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu);
>  int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 0a54df0..694586c 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -62,6 +62,7 @@
>  #define EXIT_REASON_MCE_DURING_VMENTRY  41
>  #define EXIT_REASON_TPR_BELOW_THRESHOLD 43
>  #define EXIT_REASON_APIC_ACCESS         44
> +#define EXIT_REASON_EOI_INDUCED         45
>  #define EXIT_REASON_EPT_VIOLATION       48
>  #define EXIT_REASON_EPT_MISCONFIG       49
>  #define EXIT_REASON_WBINVD              54
> @@ -144,6 +145,7 @@
>  #define SECONDARY_EXEC_WBINVD_EXITING		0x00000040
>  #define SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080
>  #define SECONDARY_EXEC_APIC_REGISTER_VIRT       0x00000100
> +#define SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY    0x00000200
>  #define SECONDARY_EXEC_PAUSE_LOOP_EXITING	0x00000400
>  #define SECONDARY_EXEC_ENABLE_INVPCID		0x00001000
>  
> @@ -181,6 +183,7 @@ enum vmcs_field {
>  	GUEST_GS_SELECTOR               = 0x0000080a,
>  	GUEST_LDTR_SELECTOR             = 0x0000080c,
>  	GUEST_TR_SELECTOR               = 0x0000080e,
> +	GUEST_INTR_STATUS               = 0x00000810,
>  	HOST_ES_SELECTOR                = 0x00000c00,
>  	HOST_CS_SELECTOR                = 0x00000c02,
>  	HOST_SS_SELECTOR                = 0x00000c04,
> @@ -208,6 +211,14 @@ enum vmcs_field {
>  	APIC_ACCESS_ADDR_HIGH		= 0x00002015,
>  	EPT_POINTER                     = 0x0000201a,
>  	EPT_POINTER_HIGH                = 0x0000201b,
> +	EOI_EXIT_BITMAP0                = 0x0000201c,
> +	EOI_EXIT_BITMAP0_HIGH           = 0x0000201d,
> +	EOI_EXIT_BITMAP1                = 0x0000201e,
> +	EOI_EXIT_BITMAP1_HIGH           = 0x0000201f,
> +	EOI_EXIT_BITMAP2                = 0x00002020,
> +	EOI_EXIT_BITMAP2_HIGH           = 0x00002021,
> +	EOI_EXIT_BITMAP3                = 0x00002022,
> +	EOI_EXIT_BITMAP3_HIGH           = 0x00002023,
>  	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
>  	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
>  	VMCS_LINK_POINTER               = 0x00002800,
> diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
> index b111aee..e113440 100644
> --- a/arch/x86/kvm/irq.c
> +++ b/arch/x86/kvm/irq.c
> @@ -38,6 +38,38 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu)
>  EXPORT_SYMBOL(kvm_cpu_has_pending_timer);
>  
>  /*
> + * check if there is pending interrupt from
> + * non-APIC source without intack.
> + */
> +static int kvm_cpu_has_extint(struct kvm_vcpu *v)
> +{
> +	if (kvm_apic_accept_pic_intr(v))
> +		return pic_irqchip(v->kvm)->output;	/* PIC */
> +	else
> +		return 0;
> +}
> +
> +/*
> + * check if there is injectable interrupt:
> + * when virtual interrupt delivery enabled,
> + * interrupt from apic will handled by hardware,
> + * we don't need to check it here.
> + */
> +int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v)
> +{
> +	if (!irqchip_in_kernel(v->kvm))
> +		return v->arch.interrupt.pending;
> +
> +	if (kvm_cpu_has_extint(v))
> +		return 1;
> +
> +	if (kvm_apic_vid_enabled(v))
> +		return 0;
> +
> +	return kvm_apic_has_interrupt(v) != -1; /* LAPIC */
> +}
> +
> +/*
>   * check if there is pending interrupt without
>   * intack.
>   */
> @@ -46,27 +78,41 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
>  	if (!irqchip_in_kernel(v->kvm))
>  		return v->arch.interrupt.pending;
>  
> -	if (kvm_apic_accept_pic_intr(v) && pic_irqchip(v->kvm)->output)
> -		return pic_irqchip(v->kvm)->output;	/* PIC */
> +	if (kvm_cpu_has_extint(v))
> +		return 1;
>  
>  	return kvm_apic_has_interrupt(v) != -1;	/* LAPIC */
>  }
>  EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
>  
>  /*
> + * Read pending interrupt(from non-APIC source)
> + * vector and intack.
> + */
> +static int kvm_cpu_get_extint(struct kvm_vcpu *v)
> +{
> +	if (kvm_cpu_has_extint(v))
> +		return kvm_pic_read_irq(v->kvm); /* PIC */
> +	return -1;
> +}
> +
> +/*
>   * Read pending interrupt vector and intack.
>   */
>  int kvm_cpu_get_interrupt(struct kvm_vcpu *v)
>  {
> +	int vector;
> +
>  	if (!irqchip_in_kernel(v->kvm))
>  		return v->arch.interrupt.nr;
>  
> -	if (kvm_apic_accept_pic_intr(v) && pic_irqchip(v->kvm)->output)
> -		return kvm_pic_read_irq(v->kvm);	/* PIC */
> +	vector = kvm_cpu_get_extint(v);
> +
> +	if (kvm_apic_vid_enabled(v) || vector != -1)
> +		return vector;			/* PIC */
>  
>  	return kvm_get_apic_interrupt(v);	/* APIC */
>  }
> -EXPORT_SYMBOL_GPL(kvm_cpu_get_interrupt);
>  
>  void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu)
>  {
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index ec38906..d219f41 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -150,23 +150,6 @@ static inline int kvm_apic_id(struct kvm_lapic *apic)
>  	return (kvm_apic_get_reg(apic, APIC_ID) >> 24) & 0xff;
>  }
>  
> -static inline u16 apic_cluster_id(struct kvm_apic_map *map, u32 ldr)
> -{
> -	u16 cid;
> -	ldr >>= 32 - map->ldr_bits;
> -	cid = (ldr >> map->cid_shift) & map->cid_mask;
> -
> -	BUG_ON(cid >= ARRAY_SIZE(map->logical_map));
> -
> -	return cid;
> -}
> -
> -static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr)
> -{
> -	ldr >>= (32 - map->ldr_bits);
> -	return ldr & map->lid_mask;
> -}
> -
>  static void recalculate_apic_map(struct kvm *kvm)
>  {
>  	struct kvm_apic_map *new, *old = NULL;
> @@ -236,12 +219,14 @@ static inline void kvm_apic_set_id(struct kvm_lapic *apic, u8 id)
>  {
>  	apic_set_reg(apic, APIC_ID, id << 24);
>  	recalculate_apic_map(apic->vcpu->kvm);
> +	ioapic_update_eoi_exitmap(apic->vcpu->kvm);
>  }
>  
>  static inline void kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id)
>  {
>  	apic_set_reg(apic, APIC_LDR, id);
>  	recalculate_apic_map(apic->vcpu->kvm);
> +	ioapic_update_eoi_exitmap(apic->vcpu->kvm);
>  }
>  
>  static inline int apic_lvt_enabled(struct kvm_lapic *apic, int lvt_type)
> @@ -345,6 +330,8 @@ static inline int apic_find_highest_irr(struct kvm_lapic *apic)
>  {
>  	int result;
>  
> +	/* Note that irr_pending is just a hint. It will be always
> +	 * true with virtual interrupt delivery enabled. */
>  	if (!apic->irr_pending)
>  		return -1;
>  
> @@ -461,6 +448,8 @@ static void pv_eoi_clr_pending(struct kvm_vcpu *vcpu)
>  static inline int apic_find_highest_isr(struct kvm_lapic *apic)
>  {
>  	int result;
> +
> +	/* Note that isr_count is always 1 with vid enabled*/
>  	if (!apic->isr_count)
>  		return -1;
>  	if (likely(apic->highest_isr_cache != -1))
> @@ -740,6 +729,19 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
>  	return vcpu1->arch.apic_arb_prio - vcpu2->arch.apic_arb_prio;
>  }
>  
> +static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector)
> +{
> +	if (!(kvm_apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI) &&
> +	    kvm_ioapic_handles_vector(apic->vcpu->kvm, vector)) {
> +		int trigger_mode;
> +		if (apic_test_vector(vector, apic->regs + APIC_TMR))
> +			trigger_mode = IOAPIC_LEVEL_TRIG;
> +		else
> +			trigger_mode = IOAPIC_EDGE_TRIG;
> +		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
> +	}
> +}
> +
>  static int apic_set_eoi(struct kvm_lapic *apic)
>  {
>  	int vector = apic_find_highest_isr(apic);
> @@ -756,19 +758,26 @@ static int apic_set_eoi(struct kvm_lapic *apic)
>  	apic_clear_isr(vector, apic);
>  	apic_update_ppr(apic);
>  
> -	if (!(kvm_apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI) &&
> -	    kvm_ioapic_handles_vector(apic->vcpu->kvm, vector)) {
> -		int trigger_mode;
> -		if (apic_test_vector(vector, apic->regs + APIC_TMR))
> -			trigger_mode = IOAPIC_LEVEL_TRIG;
> -		else
> -			trigger_mode = IOAPIC_EDGE_TRIG;
> -		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
> -	}
> +	kvm_ioapic_send_eoi(apic, vector);
>  	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
>  	return vector;
>  }
>  
> +/*
> + * this interface assumes a trap-like exit, which has already finished
> + * desired side effect including vISR and vPPR update.
> + */
> +void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector)
> +{
> +	struct kvm_lapic *apic = vcpu->arch.apic;
> +
> +	trace_kvm_eoi(apic, vector);
> +
> +	kvm_ioapic_send_eoi(apic, vector);
> +	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
> +}
> +EXPORT_SYMBOL_GPL(kvm_apic_set_eoi_accelerated);
> +
>  static void apic_send_ipi(struct kvm_lapic *apic)
>  {
>  	u32 icr_low = kvm_apic_get_reg(apic, APIC_ICR);
> @@ -1071,6 +1080,7 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
>  		if (!apic_x2apic_mode(apic)) {
>  			apic_set_reg(apic, APIC_DFR, val | 0x0FFFFFFF);
>  			recalculate_apic_map(apic->vcpu->kvm);
> +			ioapic_update_eoi_exitmap(apic->vcpu->kvm);
>  		} else
>  			ret = 1;
>  		break;
> @@ -1318,6 +1328,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
>  		else
>  			static_key_slow_inc(&apic_hw_disabled.key);
>  		recalculate_apic_map(vcpu->kvm);
> +		ioapic_update_eoi_exitmap(apic->vcpu->kvm);
>  	}
>  
>  	if (!kvm_vcpu_is_bsp(apic->vcpu))
> @@ -1377,8 +1388,9 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu)
>  		apic_set_reg(apic, APIC_ISR + 0x10 * i, 0);
>  		apic_set_reg(apic, APIC_TMR + 0x10 * i, 0);
>  	}
> -	apic->irr_pending = false;
> -	apic->isr_count = 0;
> +	apic->irr_pending = kvm_apic_vid_enabled(vcpu);
> +	apic->isr_count = kvm_apic_vid_enabled(vcpu) ?
> +				1 : 0;
Why not just "apic->isr_count = kvm_apic_vid_enabled(vcpu)"?

>  	apic->highest_isr_cache = -1;
>  	update_divide_count(apic);
>  	atomic_set(&apic->lapic_timer.pending, 0);
> @@ -1593,8 +1605,10 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
>  	update_divide_count(apic);
>  	start_apic_timer(apic);
>  	apic->irr_pending = true;
> -	apic->isr_count = count_vectors(apic->regs + APIC_ISR);
> +	apic->isr_count = kvm_apic_vid_enabled(vcpu) ?
> +				1 : count_vectors(apic->regs + APIC_ISR);
>  	apic->highest_isr_cache = -1;
> +	kvm_x86_ops->set_svi(apic_find_highest_isr(apic));
>  	kvm_make_request(KVM_REQ_EVENT, vcpu);
>  }
>  
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 9a8ee22..fed6538 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -65,6 +65,7 @@ u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
>  void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
>  
>  void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
> +void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector);
>  
>  void kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr);
>  void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu);
> @@ -126,4 +127,26 @@ static inline int kvm_lapic_enabled(struct kvm_vcpu *vcpu)
>  	return kvm_apic_present(vcpu) && kvm_apic_sw_enabled(vcpu->arch.apic);
>  }
>  
> +static inline bool kvm_apic_vid_enabled(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_x86_ops->has_virtual_interrupt_delivery(vcpu);
> +}
> +
> +static inline u16 apic_cluster_id(struct kvm_apic_map *map, u32 ldr)
> +{
> +	u16 cid;
> +	ldr >>= 32 - map->ldr_bits;
> +	cid = (ldr >> map->cid_shift) & map->cid_mask;
> +
> +	BUG_ON(cid >= ARRAY_SIZE(map->logical_map));
> +
> +	return cid;
> +}
> +
> +static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr)
> +{
> +	ldr >>= (32 - map->ldr_bits);
> +	return ldr & map->lid_mask;
> +}
> +
>  #endif
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 0b82cb1..0ce6543 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -3576,6 +3576,21 @@ static void svm_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
>  	return;
>  }
>  
> +static int svm_has_virtual_interrupt_delivery(struct kvm_vcpu *vcpu)
> +{
> +	return 0;
> +}
> +
> +static void svm_update_eoi_exitmap(struct kvm_vcpu *vcpu)
> +{
> +	return;
> +}
> +
> +static void svm_set_svi(int isr)
> +{
> +	return;
> +}
> +
>  static int svm_nmi_allowed(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
> @@ -4296,6 +4311,9 @@ static struct kvm_x86_ops svm_x86_ops = {
>  	.enable_irq_window = enable_irq_window,
>  	.update_cr8_intercept = update_cr8_intercept,
>  	.enable_virtual_x2apic_mode = svm_enable_virtual_x2apic_mode,
> +	.has_virtual_interrupt_delivery = svm_has_virtual_interrupt_delivery,
> +	.update_eoi_exitmap = svm_update_eoi_exitmap,
> +	.set_svi = svm_set_svi,
>  
>  	.set_tss_addr = svm_set_tss_addr,
>  	.get_tdp_level = get_npt_level,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index b203ce7..990409a 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -434,6 +434,7 @@ struct vcpu_vmx {
>  	bool rdtscp_enabled;
>  
>  	bool virtual_x2apic_enabled;
> +	unsigned long eoi_exit_bitmap[4];
>  
>  	/* Support for a guest hypervisor (nested VMX) */
>  	struct nested_vmx nested;
> @@ -783,7 +784,8 @@ static inline bool cpu_has_vmx_apic_register_virt(void)
>  
>  static inline bool cpu_has_vmx_virtual_intr_delivery(void)
>  {
> -	return false;
> +	return vmcs_config.cpu_based_2nd_exec_ctrl &
> +		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
>  }
>  
>  static inline bool cpu_has_vmx_flexpriority(void)
> @@ -2565,7 +2567,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>  			SECONDARY_EXEC_PAUSE_LOOP_EXITING |
>  			SECONDARY_EXEC_RDTSCP |
>  			SECONDARY_EXEC_ENABLE_INVPCID |
> -			SECONDARY_EXEC_APIC_REGISTER_VIRT;
> +			SECONDARY_EXEC_APIC_REGISTER_VIRT |
> +			SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
>  		if (adjust_vmx_controls(min2, opt2,
>  					MSR_IA32_VMX_PROCBASED_CTLS2,
>  					&_cpu_based_2nd_exec_control) < 0)
> @@ -2579,7 +2582,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>  
>  	if (!(_cpu_based_exec_control & CPU_BASED_TPR_SHADOW))
>  		_cpu_based_2nd_exec_control &= ~(
> -				SECONDARY_EXEC_APIC_REGISTER_VIRT);
> +				SECONDARY_EXEC_APIC_REGISTER_VIRT |
> +				SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
>  
>  	if (_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_EPT) {
>  		/* CR3 accesses and invlpg don't need to cause VM Exits when EPT
> @@ -2778,9 +2782,15 @@ static __init int hardware_setup(void)
>  	if (!cpu_has_vmx_ple())
>  		ple_gap = 0;
>  
> -	if (!cpu_has_vmx_apic_register_virt())
> +	if (!cpu_has_vmx_apic_register_virt() ||
> +				!cpu_has_vmx_virtual_intr_delivery())
>  		enable_apicv_reg_vid = 0;
>  
> +	if (enable_apicv_reg_vid)
> +		kvm_x86_ops->update_cr8_intercept = NULL;
> +	else
> +		kvm_x86_ops->update_apic_irq = NULL;
> +
>  	if (nested)
>  		nested_vmx_setup_ctls_msrs();
>  
> @@ -3961,7 +3971,8 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
>  	if (!ple_gap)
>  		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
>  	if (!enable_apicv_reg_vid)
> -		exec_control &= ~SECONDARY_EXEC_APIC_REGISTER_VIRT;
> +		exec_control &= ~(SECONDARY_EXEC_APIC_REGISTER_VIRT |
> +				  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
>  	exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
>  	return exec_control;
>  }
> @@ -4007,6 +4018,15 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>  				vmx_secondary_exec_control(vmx));
>  	}
>  
> +	if (enable_apicv_reg_vid) {
> +		vmcs_write64(EOI_EXIT_BITMAP0, 0);
> +		vmcs_write64(EOI_EXIT_BITMAP1, 0);
> +		vmcs_write64(EOI_EXIT_BITMAP2, 0);
> +		vmcs_write64(EOI_EXIT_BITMAP3, 0);
> +
> +		vmcs_write16(GUEST_INTR_STATUS, 0);
> +	}
> +
>  	if (ple_gap) {
>  		vmcs_write32(PLE_GAP, ple_gap);
>  		vmcs_write32(PLE_WINDOW, ple_window);
> @@ -4924,6 +4944,16 @@ static int handle_apic_access(struct kvm_vcpu *vcpu)
>  	return emulate_instruction(vcpu, 0) == EMULATE_DONE;
>  }
>  
> +static int handle_apic_eoi_induced(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> +	int vector = exit_qualification & 0xff;
> +
> +	/* EOI-induced VM exit is trap-like and thus no need to adjust IP */
> +	kvm_apic_set_eoi_accelerated(vcpu, vector);
> +	return 1;
> +}
> +
>  static int handle_apic_write(struct kvm_vcpu *vcpu)
>  {
>  	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> @@ -5869,6 +5899,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>  	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,
>  	[EXIT_REASON_APIC_ACCESS]             = handle_apic_access,
>  	[EXIT_REASON_APIC_WRITE]              = handle_apic_write,
> +	[EXIT_REASON_EOI_INDUCED]             = handle_apic_eoi_induced,
>  	[EXIT_REASON_WBINVD]                  = handle_wbinvd,
>  	[EXIT_REASON_XSETBV]                  = handle_xsetbv,
>  	[EXIT_REASON_TASK_SWITCH]             = handle_task_switch,
> @@ -6238,7 +6269,7 @@ static void vmx_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
>  	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
>  	vmx->virtual_x2apic_enabled = true;
>  
> -	if (!cpu_has_vmx_virtual_intr_delivery())
> +	if (!enable_apicv_reg_vid)
>  		return;
>  
>  	for (msr = 0x800; msr <= 0x8ff; msr++)
> @@ -6274,7 +6305,7 @@ static void vmx_disable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
>  	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, second_exec_control);
>  	vmx->virtual_x2apic_enabled = false;
>  
> -	if (!cpu_has_vmx_virtual_intr_delivery())
> +	if (!enable_apicv_reg_vid)
>  		return;
>  
>  	for (msr = 0x800; msr <= 0x8ff; msr++)
> @@ -6288,6 +6319,148 @@ static void vmx_disable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
>  	vmx_intercept_for_msr_write(0x83f, false, true);
>  }
>  
> +static int vmx_has_virtual_interrupt_delivery(struct kvm_vcpu *vcpu)
> +{
> +	return enable_apicv_reg_vid;
> +}
Why it needs vcpu parameter if it does not use it? It gets you in
trouble later.

> +
> +static void vmx_set_svi(int isr)
> +{
> +	u16 status;
> +	u8 old;
> +
> +	if (!enable_apicv_reg_vid)
> +		return;
> +
> +	if (isr == -1)
> +		isr = 0;
> +
> +	status = vmcs_read16(GUEST_INTR_STATUS);
> +	old = status >> 8;
> +	if (isr != old) {
> +		status &= 0xff;
> +		status |= isr << 8;
> +		vmcs_write16(GUEST_INTR_STATUS, status);
> +	}
> +}
> +
> +static void vmx_set_rvi(int vector)
> +{
> +	u16 status;
> +	u8 old;
> +
> +	status = vmcs_read16(GUEST_INTR_STATUS);
> +	old = (u8)status & 0xff;
> +	if ((u8)vector != old) {
> +		status &= ~0xff;
> +		status |= (u8)vector;
> +		vmcs_write16(GUEST_INTR_STATUS, status);
> +	}
> +}
> +
> +static void vmx_update_apic_irq(struct kvm_vcpu *vcpu, int max_irr)
> +{
> +	if (max_irr == -1)
> +		return;
> +
> +	vmx_set_rvi(max_irr);
> +}
> +
> +static void set_eoi_exitmap_one(struct kvm_vcpu *vcpu,
> +				u32 vector)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	if (WARN_ONCE((vector > 255),
> +		"KVM VMX: vector (%d) out of range\n", vector))
> +		return;
> +
> +	__set_bit(vector, vmx->eoi_exit_bitmap);
> +}
> +
> +void vmx_check_ioapic_entry(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq)
> +{
> +	struct kvm_lapic **dst;
> +	struct kvm_apic_map *map;
> +	unsigned long bitmap = 1;
> +	int i;
> +
> +	rcu_read_lock();
> +	map = rcu_dereference(vcpu->kvm->arch.apic_map);
> +
> +	if (unlikely(!map)) {
> +		set_eoi_exitmap_one(vcpu, irq->vector);
> +		goto out;
> +	}
> +
> +	if (irq->dest_mode == 0) { /* physical mode */
> +		if (irq->delivery_mode == APIC_DM_LOWEST ||
> +				irq->dest_id == 0xff) {
> +			set_eoi_exitmap_one(vcpu, irq->vector);
> +			goto out;
> +		}
> +		dst = &map->phys_map[irq->dest_id & 0xff];
> +	} else {
> +		u32 mda = irq->dest_id << (32 - map->ldr_bits);
> +
> +		dst = map->logical_map[apic_cluster_id(map, mda)];
> +
> +		bitmap = apic_logical_id(map, mda);
> +	}
> +
> +	for_each_set_bit(i, &bitmap, 16) {
> +		if (!dst[i])
> +			continue;
> +		if (dst[i]->vcpu == vcpu) {
> +			set_eoi_exitmap_one(vcpu, irq->vector);
> +			break;
> +		}
> +	}
> +
> +out:
> +	rcu_read_unlock();
> +}
> +
> +static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	vmcs_write64(EOI_EXIT_BITMAP0, vmx->eoi_exit_bitmap[0]);
> +	vmcs_write64(EOI_EXIT_BITMAP1, vmx->eoi_exit_bitmap[1]);
> +	vmcs_write64(EOI_EXIT_BITMAP2, vmx->eoi_exit_bitmap[2]);
> +	vmcs_write64(EOI_EXIT_BITMAP3, vmx->eoi_exit_bitmap[3]);
> +}
> +
> +static void vmx_update_eoi_exitmap(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_ioapic *ioapic = vcpu->kvm->arch.vioapic;
> +	union kvm_ioapic_redirect_entry *e;
> +	struct kvm_lapic_irq irqe;
> +	int index;
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	/* clear eoi exit bitmap */
> +	memset(vmx->eoi_exit_bitmap, 0, 32);
> +
> +	/* traverse ioapic entry to set eoi exit bitmap*/
> +	for (index = 0; index < IOAPIC_NUM_PINS; index++) {
> +		e = &ioapic->redirtbl[index];
> +		if (!e->fields.mask &&
> +			(e->fields.trig_mode == IOAPIC_LEVEL_TRIG ||
> +			 kvm_irq_has_notifier(ioapic->kvm, KVM_IRQCHIP_IOAPIC,
> +				 index))) {
> +			irqe.dest_id = e->fields.dest_id;
> +			irqe.vector = e->fields.vector;
> +			irqe.dest_mode = e->fields.dest_mode;
> +			irqe.delivery_mode = e->fields.delivery_mode << 8;
> +			vmx_check_ioapic_entry(vcpu, &irqe);
> +
> +		}
> +	}
This logic should sit in ioapic.c and you cannot access ioapic without
holding ioapic lock.

> +
> +	vmx_load_eoi_exitmap(vcpu);
> +}
> +
>  static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx)
>  {
>  	u32 exit_intr_info;
> @@ -7553,6 +7726,10 @@ static struct kvm_x86_ops vmx_x86_ops = {
>  	.update_cr8_intercept = update_cr8_intercept,
>  	.enable_virtual_x2apic_mode = vmx_enable_virtual_x2apic_mode,
>  	.disable_virtual_x2apic_mode = vmx_disable_virtual_x2apic_mode,
> +	.has_virtual_interrupt_delivery = vmx_has_virtual_interrupt_delivery,
> +	.update_apic_irq = vmx_update_apic_irq,
> +	.update_eoi_exitmap = vmx_update_eoi_exitmap,
> +	.set_svi = vmx_set_svi,
>  
>  	.set_tss_addr = vmx_set_tss_addr,
>  	.get_tdp_level = get_ept_level,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1c9c834..e6d8227 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5527,7 +5527,7 @@ static void inject_pending_event(struct kvm_vcpu *vcpu)
>  			vcpu->arch.nmi_injected = true;
>  			kvm_x86_ops->set_nmi(vcpu);
>  		}
> -	} else if (kvm_cpu_has_interrupt(vcpu)) {
> +	} else if (kvm_cpu_has_injectable_intr(vcpu)) {
>  		if (kvm_x86_ops->interrupt_allowed(vcpu)) {
>  			kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu),
>  					    false);
> @@ -5648,6 +5648,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  			kvm_handle_pmu_event(vcpu);
>  		if (kvm_check_request(KVM_REQ_PMI, vcpu))
>  			kvm_deliver_pmi(vcpu);
> +		if (kvm_check_request(KVM_REQ_EOIBITMAP, vcpu)) {
> +			mutex_lock(&vcpu->kvm->arch.vioapic->eoimap_lock);
You need to hold ioapic lock, not useless eoimap_lock that protects
nothing. And not, do not take it here. Call function in ioapic.c

> +			kvm_x86_ops->update_eoi_exitmap(vcpu);
> +			mutex_unlock(&vcpu->kvm->arch.vioapic->eoimap_lock);
> +		}
>  	}
>  
>  	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
> @@ -5656,10 +5661,15 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  		/* enable NMI/IRQ window open exits if needed */
>  		if (vcpu->arch.nmi_pending)
>  			kvm_x86_ops->enable_nmi_window(vcpu);
> -		else if (kvm_cpu_has_interrupt(vcpu) || req_int_win)
> +		else if (kvm_cpu_has_injectable_intr(vcpu) || req_int_win)
>  			kvm_x86_ops->enable_irq_window(vcpu);
>  
>  		if (kvm_lapic_enabled(vcpu)) {
> +			/* update architecture specific hints for APIC
> +			 * virtual interrupt delivery */
> +			if (kvm_x86_ops->update_apic_irq)
> +				kvm_x86_ops->update_apic_irq(vcpu,
> +					      kvm_lapic_find_highest_irr(vcpu));
>  			update_cr8_intercept(vcpu);
>  			kvm_lapic_sync_to_vapic(vcpu);
>  		}
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index cbe0d68..bc0e261 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -122,6 +122,7 @@ static inline bool is_error_page(struct page *page)
>  #define KVM_REQ_WATCHDOG          18
>  #define KVM_REQ_MASTERCLOCK_UPDATE 19
>  #define KVM_REQ_MCLOCK_INPROGRESS 20
> +#define KVM_REQ_EOIBITMAP         21
>  
>  #define KVM_USERSPACE_IRQ_SOURCE_ID		0
>  #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID	1
> @@ -537,6 +538,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
>  void kvm_flush_remote_tlbs(struct kvm *kvm);
>  void kvm_reload_remote_mmus(struct kvm *kvm);
>  void kvm_make_mclock_inprogress_request(struct kvm *kvm);
> +void kvm_make_update_eoibitmap_request(struct kvm *kvm);
>  
>  long kvm_arch_dev_ioctl(struct file *filp,
>  			unsigned int ioctl, unsigned long arg);
> @@ -690,6 +692,7 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level);
>  int kvm_set_irq_inatomic(struct kvm *kvm, int irq_source_id, u32 irq, int level);
>  int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm *kvm,
>  		int irq_source_id, int level);
> +bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin);
>  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
>  void kvm_register_irq_ack_notifier(struct kvm *kvm,
>  				   struct kvm_irq_ack_notifier *kian);
> diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
> index f3abbef..e5ccb8f 100644
> --- a/virt/kvm/ioapic.c
> +++ b/virt/kvm/ioapic.c
> @@ -115,6 +115,20 @@ static void update_handled_vectors(struct kvm_ioapic *ioapic)
>  	smp_wmb();
>  }
>  
> +void ioapic_update_eoi_exitmap(struct kvm *kvm)
> +{
> +#ifdef CONFIG_X86
Define kvm_apic_vid_enabled() in IA64 instead.

> +	struct kvm_vcpu *vcpu = kvm->vcpus[0];
> +	struct kvm_ioapic *ioapic = kvm->arch.vioapic;
> +
> +	/* If vid is enabled in one of vcpus, then other
> +	 * vcpus also enabled it. */
Vid state is global for all VM instances.  kvm_apic_vid_enabled() should
not get vcpu as a parameter.

> +	if (!kvm_apic_vid_enabled(vcpu) || !ioapic)
> +		return;
> +	kvm_make_update_eoibitmap_request(kvm);
> +#endif
> +}
> +
>  static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
>  {
>  	unsigned index;
> @@ -156,6 +170,7 @@ static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
>  		if (e->fields.trig_mode == IOAPIC_LEVEL_TRIG
>  		    && ioapic->irr & (1 << index))
>  			ioapic_service(ioapic, index);
> +		ioapic_update_eoi_exitmap(ioapic->kvm);
>  		break;
>  	}
>  }
> @@ -415,6 +430,9 @@ int kvm_ioapic_init(struct kvm *kvm)
>  	ret = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS, ioapic->base_address,
>  				      IOAPIC_MEM_LENGTH, &ioapic->dev);
>  	mutex_unlock(&kvm->slots_lock);
> +#ifdef CONFIG_X86
> +	mutex_init(&ioapic->eoimap_lock);
> +#endif
>  	if (ret < 0) {
>  		kvm->arch.vioapic = NULL;
>  		kfree(ioapic);
> diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
> index a30abfe..34544ce 100644
> --- a/virt/kvm/ioapic.h
> +++ b/virt/kvm/ioapic.h
> @@ -47,6 +47,9 @@ struct kvm_ioapic {
>  	void (*ack_notifier)(void *opaque, int irq);
>  	spinlock_t lock;
>  	DECLARE_BITMAP(handled_vectors, 256);
> +#ifdef CONFIG_X86
> +	struct mutex eoimap_lock;
> +#endif
This lock protects nothing. Drop it.

>  };
>  
>  #ifdef DEBUG
> @@ -82,5 +85,6 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
>  		struct kvm_lapic_irq *irq);
>  int kvm_get_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
>  int kvm_set_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
> +void ioapic_update_eoi_exitmap(struct kvm *kvm);
>  
>  #endif
> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> index 656fa45..64aa1ab 100644
> --- a/virt/kvm/irq_comm.c
> +++ b/virt/kvm/irq_comm.c
> @@ -22,6 +22,7 @@
>  
>  #include <linux/kvm_host.h>
>  #include <linux/slab.h>
> +#include <linux/export.h>
>  #include <trace/events/kvm.h>
>  
>  #include <asm/msidef.h>
> @@ -237,6 +238,25 @@ int kvm_set_irq_inatomic(struct kvm *kvm, int irq_source_id, u32 irq, int level)
>  	return ret;
>  }
>  
> +bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin)
> +{
> +	struct kvm_irq_ack_notifier *kian;
> +	struct hlist_node *n;
> +	int gsi;
> +
> +	rcu_read_lock();
> +	gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];
> +	if (gsi != -1)
> +		hlist_for_each_entry_rcu(kian, n, &kvm->irq_ack_notifier_list,
> +					 link)
> +			if (kian->gsi == gsi)
> +				return true;
> +	rcu_read_unlock();
> +
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(kvm_irq_has_notifier);
> +
>  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
>  {
>  	struct kvm_irq_ack_notifier *kian;
> @@ -261,6 +281,7 @@ void kvm_register_irq_ack_notifier(struct kvm *kvm,
>  	mutex_lock(&kvm->irq_lock);
>  	hlist_add_head_rcu(&kian->link, &kvm->irq_ack_notifier_list);
>  	mutex_unlock(&kvm->irq_lock);
> +	ioapic_update_eoi_exitmap(kvm);
>  }
>  
>  void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
> @@ -270,6 +291,7 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
>  	hlist_del_init_rcu(&kian->link);
>  	mutex_unlock(&kvm->irq_lock);
>  	synchronize_rcu();
> +	ioapic_update_eoi_exitmap(kvm);
>  }
>  
>  int kvm_request_irq_source_id(struct kvm *kvm)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index e45c20c..cc465c6 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -217,6 +217,11 @@ void kvm_make_mclock_inprogress_request(struct kvm *kvm)
>  	make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
>  }
>  
> +void kvm_make_update_eoibitmap_request(struct kvm *kvm)
> +{
> +	make_all_cpus_request(kvm, KVM_REQ_EOIBITMAP);
> +}
> +
>  int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>  {
>  	struct page *page;
> -- 
> 1.7.1

--
			Gleb.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 2/3] x86, apicv: add virtual x2apic support
  2013-01-10  7:26 ` [PATCH v9 2/3] x86, apicv: add virtual x2apic support Yang Zhang
  2013-01-10  7:55   ` Gleb Natapov
@ 2013-01-10  8:25   ` Gleb Natapov
  2013-01-10  8:31     ` Zhang, Yang Z
  1 sibling, 1 reply; 26+ messages in thread
From: Gleb Natapov @ 2013-01-10  8:25 UTC (permalink / raw)
  To: Yang Zhang; +Cc: kvm, haitao.shan, mtosatti, Kevin Tian

On Thu, Jan 10, 2013 at 03:26:07PM +0800, Yang Zhang wrote:
> +static void vmx_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
> +{
> +	u32 exec_control;
> +	int msr;
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	if (!cpu_has_vmx_virtualize_x2apic_mode())
> +		return;
> +
> +	exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
> +	/* virtualize x2apic mode relies on tpr shadow */
> +	if (!(exec_control & CPU_BASED_TPR_SHADOW))
> +		return;
> +
> +	exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
> +	exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> +	exec_control |= SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
> +	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
> +	vmx->virtual_x2apic_enabled = true;
> +
> +	if (!cpu_has_vmx_virtual_intr_delivery())
> +		return;
> +
> +	for (msr = 0x800; msr <= 0x8ff; msr++)
> +		vmx_intercept_for_msr_read(msr, false, false);
> +
> +	/* APIC ID */
> +	vmx_intercept_for_msr_read(0x802, false, true);
Why are you enabling apic id read intercept?

> +	/* TMCCT */
> +	vmx_intercept_for_msr_read(0x839, false, true);
> +	/* TPR */
> +	vmx_intercept_for_msr_write(0x808, false, false);
> +	/* EOI */
> +	vmx_intercept_for_msr_write(0x80b, false, false);
> +	/* SELF-IPI */
> +	vmx_intercept_for_msr_write(0x83f, false, false);
> +
> +}
> +

--
			Gleb.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v9 2/3] x86, apicv: add virtual x2apic support
  2013-01-10  8:25   ` Gleb Natapov
@ 2013-01-10  8:31     ` Zhang, Yang Z
  2013-01-10  8:53       ` Gleb Natapov
  0 siblings, 1 reply; 26+ messages in thread
From: Zhang, Yang Z @ 2013-01-10  8:31 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm@vger.kernel.org, Shan, Haitao, mtosatti@redhat.com,
	Tian, Kevin

Gleb Natapov wrote on 2013-01-10:
> On Thu, Jan 10, 2013 at 03:26:07PM +0800, Yang Zhang wrote:
>> +static void vmx_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
>> +{
>> +	u32 exec_control;
>> +	int msr;
>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>> +
>> +	if (!cpu_has_vmx_virtualize_x2apic_mode())
>> +		return;
>> +
>> +	exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
>> +	/* virtualize x2apic mode relies on tpr shadow */
>> +	if (!(exec_control & CPU_BASED_TPR_SHADOW))
>> +		return;
>> +
>> +	exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
>> +	exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>> +	exec_control |= SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
>> +	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
>> +	vmx->virtual_x2apic_enabled = true;
>> +
>> +	if (!cpu_has_vmx_virtual_intr_delivery())
>> +		return;
>> +
>> +	for (msr = 0x800; msr <= 0x8ff; msr++)
>> +		vmx_intercept_for_msr_read(msr, false, false);
>> +
>> +	/* APIC ID */
>> +	vmx_intercept_for_msr_read(0x802, false, true);
> Why are you enabling apic id read intercept?
Current code to read apic id in x2apic mode has some hacks:

if (apic_x2apic_mode(apic))
       val = kvm_apic_id(apic);
else
       val = kvm_apic_id(apic) << 24;

static inline int kvm_apic_id(struct kvm_lapic *apic)
{
        return (kvm_apic_get_reg(apic, APIC_ID) >> 24) & 0xff;
}

According SPEC, in x2apic mode, the whole id reg is used, but in KVM, it only use the highest eight bits.

>> +	/* TMCCT */
>> +	vmx_intercept_for_msr_read(0x839, false, true);
>> +	/* TPR */
>> +	vmx_intercept_for_msr_write(0x808, false, false);
>> +	/* EOI */
>> +	vmx_intercept_for_msr_write(0x80b, false, false);
>> +	/* SELF-IPI */
>> +	vmx_intercept_for_msr_write(0x83f, false, false);
>> +
>> +}
>> +
> 
> --
> 			Gleb.


Best regards,
Yang



^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v9 2/3] x86, apicv: add virtual x2apic support
  2013-01-10  7:55   ` Gleb Natapov
@ 2013-01-10  8:32     ` Zhang, Yang Z
  2013-01-10  8:52       ` Gleb Natapov
  0 siblings, 1 reply; 26+ messages in thread
From: Zhang, Yang Z @ 2013-01-10  8:32 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm@vger.kernel.org, Shan, Haitao, mtosatti@redhat.com,
	Tian, Kevin

Gleb Natapov wrote on 2013-01-10:
> On Thu, Jan 10, 2013 at 03:26:07PM +0800, Yang Zhang wrote:
>> From: Yang Zhang <yang.z.zhang@Intel.com>
>> 
>> basically to benefit from apicv, we need to enable virtualized x2apic mode.
>> Currently, we only enable it when guest is really using x2apic.
>> 
>> Also, clear MSR bitmap for corresponding x2apic MSRs when guest enabled
> x2apic:
>>     0x800 - 0x8ff: no read intercept for apicv register virtualization,
>>     		   except APIC ID and TMCCT.
>>     APIC ID and TMCCT: need software's assistance to get right value.
>>     TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery.
>> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
>> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
>> ---
>>  arch/x86/include/asm/kvm_host.h |    2 + arch/x86/include/asm/vmx.h   
>>    |    1 + arch/x86/kvm/lapic.c            |    5 +-
>>  arch/x86/kvm/svm.c              |    6 + arch/x86/kvm/vmx.c           
>>    |  194 +++++++++++++++++++++++++++++++++++++-- 5 files changed, 200
>>  insertions(+), 8 deletions(-)
>> diff --git a/arch/x86/include/asm/kvm_host.h
>> b/arch/x86/include/asm/kvm_host.h index c431b33..572a562 100644 ---
>> a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -697,6 +697,8 @@ struct kvm_x86_ops {
>>  	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
>>  	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
>>  	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
>> +	void (*enable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu);
>> +	void (*disable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu);
> Make one callback with enable/disable parameter. And do not forget SVM.
> 
>>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>>  	int (*get_tdp_level)(void);
>>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
>> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
>> index 44c3f7e..0a54df0 100644
>> --- a/arch/x86/include/asm/vmx.h
>> +++ b/arch/x86/include/asm/vmx.h
>> @@ -139,6 +139,7 @@
>>  #define SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES 0x00000001 #define
>>  SECONDARY_EXEC_ENABLE_EPT               0x00000002 #define
>>  SECONDARY_EXEC_RDTSCP			0x00000008 +#define
>>  SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE   0x00000010 #define
>>  SECONDARY_EXEC_ENABLE_VPID              0x00000020 #define
>>  SECONDARY_EXEC_WBINVD_EXITING		0x00000040 #define
>>  SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index 0664c13..ec38906 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -1328,7 +1328,10 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu,
> u64 value)
>>  		u32 id = kvm_apic_id(apic);
>>  		u32 ldr = ((id >> 4) << 16) | (1 << (id & 0xf));
>>  		kvm_apic_set_ldr(apic, ldr);
>> -	}
>> +		kvm_x86_ops->enable_virtual_x2apic_mode(vcpu);
>> +	} else
>> +		kvm_x86_ops->disable_virtual_x2apic_mode(vcpu);
>> +
> You just broke SVM.
>>  	apic->base_address = apic->vcpu->arch.apic_base &
>>  			     MSR_IA32_APICBASE_BASE;
>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> index d29d3cd..0b82cb1 100644
>> --- a/arch/x86/kvm/svm.c
>> +++ b/arch/x86/kvm/svm.c
>> @@ -3571,6 +3571,11 @@ static void update_cr8_intercept(struct kvm_vcpu
> *vcpu, int tpr, int irr)
>>  		set_cr_intercept(svm, INTERCEPT_CR8_WRITE);
>>  }
>> +static void svm_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
>> +{
>> +	return;
>> +}
>> +
>>  static int svm_nmi_allowed(struct kvm_vcpu *vcpu) { 	struct vcpu_svm
>>  *svm = to_svm(vcpu); @@ -4290,6 +4295,7 @@ static struct kvm_x86_ops
>>  svm_x86_ops = { 	.enable_nmi_window = enable_nmi_window,
>>  	.enable_irq_window = enable_irq_window, 	.update_cr8_intercept =
>>  update_cr8_intercept,
>> +	.enable_virtual_x2apic_mode = svm_enable_virtual_x2apic_mode,
>> 
>>  	.set_tss_addr = svm_set_tss_addr,
>>  	.get_tdp_level = get_npt_level,
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 688f43f..b203ce7 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -433,6 +433,8 @@ struct vcpu_vmx {
>> 
>>  	bool rdtscp_enabled;
>> +	bool virtual_x2apic_enabled;
>> +
>>  	/* Support for a guest hypervisor (nested VMX) */
>>  	struct nested_vmx nested;
>>  };
>> @@ -767,12 +769,23 @@ static inline bool
> cpu_has_vmx_virtualize_apic_accesses(void)
>>  		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>>  }
>> +static inline bool cpu_has_vmx_virtualize_x2apic_mode(void)
>> +{
>> +	return vmcs_config.cpu_based_2nd_exec_ctrl &
>> +		SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
>> +}
>> +
>>  static inline bool cpu_has_vmx_apic_register_virt(void)
>>  {
>>  	return vmcs_config.cpu_based_2nd_exec_ctrl &
>>  		SECONDARY_EXEC_APIC_REGISTER_VIRT;
>>  }
>> +static inline bool cpu_has_vmx_virtual_intr_delivery(void)
>> +{
>> +	return false;
>> +}
>> +
>>  static inline bool cpu_has_vmx_flexpriority(void)
>>  {
>>  	return cpu_has_vmx_tpr_shadow() &&
>> @@ -2544,6 +2557,7 @@ static __init int setup_vmcs_config(struct
> vmcs_config *vmcs_conf)
>>  	if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)
>>  { 		min2 = 0; 		opt2 = SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
>>  +			SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
>>  			SECONDARY_EXEC_WBINVD_EXITING | 			SECONDARY_EXEC_ENABLE_VPID |
>>  			SECONDARY_EXEC_ENABLE_EPT | @@ -3731,7 +3745,45 @@ static void
>>  free_vpid(struct vcpu_vmx *vmx) 	spin_unlock(&vmx_vpid_lock); }
>> -static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
>> u32 msr) +#define MSR_TYPE_R	1 +#define MSR_TYPE_W	2 +static void
>> __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, +						u32
>> msr, int type) +{ +	int f = sizeof(unsigned long); + +	if
>> (!cpu_has_vmx_msr_bitmap()) +		return; + +	/* +	 * See Intel PRM Vol.
>> 3, 20.6.9 (MSR-Bitmap Address). Early manuals +	 * have the write-low
>> and read-high bitmap offsets the wrong way round. +	 * We can control
>> MSRs 0x00000000-0x00001fff and 0xc0000000-0xc0001fff. +	 */ +	if (msr
>> <= 0x1fff) { +		if (type & MSR_TYPE_R) +			/* read-low */
>> +			__clear_bit(msr, msr_bitmap + 0x000 / f); + +		if (type &
>> MSR_TYPE_W) +			/* write-low */ +			__clear_bit(msr, msr_bitmap + 0x800
>> / f); + +	} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
>> +		msr &= 0x1fff; +		if (type & MSR_TYPE_R) +			/* read-high */
>> +			__clear_bit(msr, msr_bitmap + 0x400 / f); + +		if (type &
>> MSR_TYPE_W) +			/* write-high */ +			__clear_bit(msr, msr_bitmap +
>> 0xc00 / f); + +	} +} + +static void
>> __vmx_enable_intercept_for_msr(unsigned long *msr_bitmap, +						u32
>> msr, int type)
>>  {
>>  	int f = sizeof(unsigned long);
>> @@ -3744,20 +3796,75 @@ static void
> __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, u32 msr)
>>  	 * We can control MSRs 0x00000000-0x00001fff and 0xc0000000-0xc0001fff.
>>  	 */
>>  	if (msr <= 0x1fff) {
>> -		__clear_bit(msr, msr_bitmap + 0x000 / f); /* read-low */
>> -		__clear_bit(msr, msr_bitmap + 0x800 / f); /* write-low */
>> +		if (type & MSR_TYPE_R)
>> +			/* read-low */
>> +			__set_bit(msr, msr_bitmap + 0x000 / f);
>> +
>> +		if (type & MSR_TYPE_W)
>> +			/* write-low */
>> +			__set_bit(msr, msr_bitmap + 0x800 / f);
>> +
>>  	} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
>>  		msr &= 0x1fff;
>> -		__clear_bit(msr, msr_bitmap + 0x400 / f); /* read-high */
>> -		__clear_bit(msr, msr_bitmap + 0xc00 / f); /* write-high */
>> +		if (type & MSR_TYPE_R)
>> +			/* read-high */
>> +			__set_bit(msr, msr_bitmap + 0x400 / f);
>> +
>> +		if (type & MSR_TYPE_W)
>> +			/* write-high */
>> +			__set_bit(msr, msr_bitmap + 0xc00 / f);
>> +
>>  	}
>>  }
>> +
>>  static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only)
>>  {
>>  	if (!longmode_only)
>> -		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, msr);
>> -	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr);
>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
>> +						msr, MSR_TYPE_R | MSR_TYPE_W);
>> +	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
>> +						msr, MSR_TYPE_R | MSR_TYPE_W);
>> +}
>> +
>> +static void vmx_intercept_for_msr_read(u32 msr, bool longmode_only,
>> +					bool set)
>> +{
>> +	if (!longmode_only) {
>> +		if (set)
>> +			__vmx_enable_intercept_for_msr(vmx_msr_bitmap_legacy,
>> +					msr, MSR_TYPE_R);
>> +		else
>> +			__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
>> +					msr, MSR_TYPE_R);
>> +
>> +	}
>> +	if (set)
>> +		__vmx_enable_intercept_for_msr(vmx_msr_bitmap_longmode,
>> +				msr, MSR_TYPE_R);
>> +	else
>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
>> +				msr, MSR_TYPE_R);
>> +}
>> +
>> +static void vmx_intercept_for_msr_write(u32 msr, bool longmode_only,
>> +					bool set)
>> +{
>> +	if (!longmode_only) {
>> +		if (set)
>> +			__vmx_enable_intercept_for_msr(vmx_msr_bitmap_legacy,
>> +					msr, MSR_TYPE_W);
>> +		else
>> +			__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
>> +					msr, MSR_TYPE_W);
>> +
>> +	}
>> +	if (set)
>> +		__vmx_enable_intercept_for_msr(vmx_msr_bitmap_longmode,
>> +				msr, MSR_TYPE_W);
>> +	else
>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
>> +				msr, MSR_TYPE_W);
>>  }
>>  
>>  /*
>> @@ -3855,6 +3962,7 @@ static u32 vmx_secondary_exec_control(struct
> vcpu_vmx *vmx)
>>  		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING; 	if
>>  (!enable_apicv_reg_vid) 		exec_control &=
>>  ~SECONDARY_EXEC_APIC_REGISTER_VIRT; +	exec_control &=
>>  ~SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE; 	return exec_control; }
>> @@ -6110,6 +6218,76 @@ static void update_cr8_intercept(struct kvm_vcpu
> *vcpu, int tpr, int irr)
>>  	vmcs_write32(TPR_THRESHOLD, irr);
>>  }
>> +static void vmx_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
>> +{
>> +	u32 exec_control;
>> +	int msr;
>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>> +
>> +	if (!cpu_has_vmx_virtualize_x2apic_mode())
>> +		return;
>> +
>> +	exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
>> +	/* virtualize x2apic mode relies on tpr shadow */
>> +	if (!(exec_control & CPU_BASED_TPR_SHADOW))
>> +		return;
>> +
>> +	exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
>> +	exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>> +	exec_control |= SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
>> +	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
>> +	vmx->virtual_x2apic_enabled = true;
> Why track it?
With this flag, we don't need to read vmcs to check whether we enabled virtua x2apic before.

>> +
>> +	if (!cpu_has_vmx_virtual_intr_delivery())
>> +		return;
>> +
> You need to test whether vid is enabled, not whether it can be enabled.
> And you need to test it at the beginning of the function. If vid is
> disabled we do not want to set SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE.
> Also force disable vid if !cpu_has_vmx_virtualize_x2apic_mode().
vid is not depend on virtualize x2apic.
The right logic should be:
When register virtualization enabled, read all apic msr(except apic id reg and tmcct which have the hook) not cause vmexit and only write TPR not cause vmexit.
When vid enabled, write to TPR, EOI and SELF IPI not cause vmexit.
It's better to put the patch of envirtualize x2apic mode as first patch.

>> +	for (msr = 0x800; msr <= 0x8ff; msr++)
>> +		vmx_intercept_for_msr_read(msr, false, false);
>> +
>> +	/* APIC ID */
>> +	vmx_intercept_for_msr_read(0x802, false, true);
>> +	/* TMCCT */
>> +	vmx_intercept_for_msr_read(0x839, false, true);
>> +	/* TPR */
>> +	vmx_intercept_for_msr_write(0x808, false, false);
>> +	/* EOI */
>> +	vmx_intercept_for_msr_write(0x80b, false, false);
>> +	/* SELF-IPI */
>> +	vmx_intercept_for_msr_write(0x83f, false, false);
>> +
>> +}
>> +
>> +static void vmx_disable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
>> +{
>> +	u32 second_exec_control;
>> +	int msr;
>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>> +
>> +	/* If doesn't enable virtual x2apic before, do nothing*/
>> +	if (!vmx->virtual_x2apic_enabled)
>> +		return;
>> +
>> +	second_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
>> +	/* disalbe virtual x2apic*/
>> +	second_exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
>> +	second_exec_control |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>> +	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, second_exec_control);
>> +	vmx->virtual_x2apic_enabled = false;
>> +
>> +	if (!cpu_has_vmx_virtual_intr_delivery())
>> +		return;
>> +
>> +	for (msr = 0x800; msr <= 0x8ff; msr++)
>> +		vmx_intercept_for_msr_read(msr, false, true);
>> +
>> +	/* TPR */
>> +	vmx_intercept_for_msr_write(0x808, false, true);
>> +	/* EOI */
>> +	vmx_intercept_for_msr_write(0x80b, false, true);
>> +	/* SELF-IPI */
>> +	vmx_intercept_for_msr_write(0x83f, false, true);
>> +}
>> +
>>  static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx) { 	u32
>>  exit_intr_info; @@ -7373,6 +7551,8 @@ static struct kvm_x86_ops
>>  vmx_x86_ops = { 	.enable_nmi_window = enable_nmi_window,
>>  	.enable_irq_window = enable_irq_window, 	.update_cr8_intercept =
>>  update_cr8_intercept,
>> +	.enable_virtual_x2apic_mode = vmx_enable_virtual_x2apic_mode,
>> +	.disable_virtual_x2apic_mode = vmx_disable_virtual_x2apic_mode,
>> 
>>  	.set_tss_addr = vmx_set_tss_addr,
>>  	.get_tdp_level = get_ept_level,
>> --
>> 1.7.1
> 
> --
> 			Gleb.


Best regards,
Yang



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 2/3] x86, apicv: add virtual x2apic support
  2013-01-10  8:32     ` Zhang, Yang Z
@ 2013-01-10  8:52       ` Gleb Natapov
  2013-01-10 11:54         ` Zhang, Yang Z
  2013-01-11  2:37         ` Zhang, Yang Z
  0 siblings, 2 replies; 26+ messages in thread
From: Gleb Natapov @ 2013-01-10  8:52 UTC (permalink / raw)
  To: Zhang, Yang Z
  Cc: kvm@vger.kernel.org, Shan, Haitao, mtosatti@redhat.com,
	Tian, Kevin

On Thu, Jan 10, 2013 at 08:32:06AM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2013-01-10:
> > On Thu, Jan 10, 2013 at 03:26:07PM +0800, Yang Zhang wrote:
> >> From: Yang Zhang <yang.z.zhang@Intel.com>
> >> 
> >> basically to benefit from apicv, we need to enable virtualized x2apic mode.
> >> Currently, we only enable it when guest is really using x2apic.
> >> 
> >> Also, clear MSR bitmap for corresponding x2apic MSRs when guest enabled
> > x2apic:
> >>     0x800 - 0x8ff: no read intercept for apicv register virtualization,
> >>     		   except APIC ID and TMCCT.
> >>     APIC ID and TMCCT: need software's assistance to get right value.
> >>     TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery.
> >> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> >> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> >> ---
> >>  arch/x86/include/asm/kvm_host.h |    2 + arch/x86/include/asm/vmx.h   
> >>    |    1 + arch/x86/kvm/lapic.c            |    5 +-
> >>  arch/x86/kvm/svm.c              |    6 + arch/x86/kvm/vmx.c           
> >>    |  194 +++++++++++++++++++++++++++++++++++++-- 5 files changed, 200
> >>  insertions(+), 8 deletions(-)
> >> diff --git a/arch/x86/include/asm/kvm_host.h
> >> b/arch/x86/include/asm/kvm_host.h index c431b33..572a562 100644 ---
> >> a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h
> >> @@ -697,6 +697,8 @@ struct kvm_x86_ops {
> >>  	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
> >>  	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
> >>  	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
> >> +	void (*enable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu);
> >> +	void (*disable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu);
> > Make one callback with enable/disable parameter. And do not forget SVM.
> > 
> >>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> >>  	int (*get_tdp_level)(void);
> >>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> >> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> >> index 44c3f7e..0a54df0 100644
> >> --- a/arch/x86/include/asm/vmx.h
> >> +++ b/arch/x86/include/asm/vmx.h
> >> @@ -139,6 +139,7 @@
> >>  #define SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES 0x00000001 #define
> >>  SECONDARY_EXEC_ENABLE_EPT               0x00000002 #define
> >>  SECONDARY_EXEC_RDTSCP			0x00000008 +#define
> >>  SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE   0x00000010 #define
> >>  SECONDARY_EXEC_ENABLE_VPID              0x00000020 #define
> >>  SECONDARY_EXEC_WBINVD_EXITING		0x00000040 #define
> >>  SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080
> >> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> >> index 0664c13..ec38906 100644
> >> --- a/arch/x86/kvm/lapic.c
> >> +++ b/arch/x86/kvm/lapic.c
> >> @@ -1328,7 +1328,10 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu,
> > u64 value)
> >>  		u32 id = kvm_apic_id(apic);
> >>  		u32 ldr = ((id >> 4) << 16) | (1 << (id & 0xf));
> >>  		kvm_apic_set_ldr(apic, ldr);
> >> -	}
> >> +		kvm_x86_ops->enable_virtual_x2apic_mode(vcpu);
> >> +	} else
> >> +		kvm_x86_ops->disable_virtual_x2apic_mode(vcpu);
> >> +
> > You just broke SVM.
> >>  	apic->base_address = apic->vcpu->arch.apic_base &
> >>  			     MSR_IA32_APICBASE_BASE;
> >> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> >> index d29d3cd..0b82cb1 100644
> >> --- a/arch/x86/kvm/svm.c
> >> +++ b/arch/x86/kvm/svm.c
> >> @@ -3571,6 +3571,11 @@ static void update_cr8_intercept(struct kvm_vcpu
> > *vcpu, int tpr, int irr)
> >>  		set_cr_intercept(svm, INTERCEPT_CR8_WRITE);
> >>  }
> >> +static void svm_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
> >> +{
> >> +	return;
> >> +}
> >> +
> >>  static int svm_nmi_allowed(struct kvm_vcpu *vcpu) { 	struct vcpu_svm
> >>  *svm = to_svm(vcpu); @@ -4290,6 +4295,7 @@ static struct kvm_x86_ops
> >>  svm_x86_ops = { 	.enable_nmi_window = enable_nmi_window,
> >>  	.enable_irq_window = enable_irq_window, 	.update_cr8_intercept =
> >>  update_cr8_intercept,
> >> +	.enable_virtual_x2apic_mode = svm_enable_virtual_x2apic_mode,
> >> 
> >>  	.set_tss_addr = svm_set_tss_addr,
> >>  	.get_tdp_level = get_npt_level,
> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> index 688f43f..b203ce7 100644
> >> --- a/arch/x86/kvm/vmx.c
> >> +++ b/arch/x86/kvm/vmx.c
> >> @@ -433,6 +433,8 @@ struct vcpu_vmx {
> >> 
> >>  	bool rdtscp_enabled;
> >> +	bool virtual_x2apic_enabled;
> >> +
> >>  	/* Support for a guest hypervisor (nested VMX) */
> >>  	struct nested_vmx nested;
> >>  };
> >> @@ -767,12 +769,23 @@ static inline bool
> > cpu_has_vmx_virtualize_apic_accesses(void)
> >>  		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> >>  }
> >> +static inline bool cpu_has_vmx_virtualize_x2apic_mode(void)
> >> +{
> >> +	return vmcs_config.cpu_based_2nd_exec_ctrl &
> >> +		SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
> >> +}
> >> +
> >>  static inline bool cpu_has_vmx_apic_register_virt(void)
> >>  {
> >>  	return vmcs_config.cpu_based_2nd_exec_ctrl &
> >>  		SECONDARY_EXEC_APIC_REGISTER_VIRT;
> >>  }
> >> +static inline bool cpu_has_vmx_virtual_intr_delivery(void)
> >> +{
> >> +	return false;
> >> +}
> >> +
> >>  static inline bool cpu_has_vmx_flexpriority(void)
> >>  {
> >>  	return cpu_has_vmx_tpr_shadow() &&
> >> @@ -2544,6 +2557,7 @@ static __init int setup_vmcs_config(struct
> > vmcs_config *vmcs_conf)
> >>  	if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)
> >>  { 		min2 = 0; 		opt2 = SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
> >>  +			SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
> >>  			SECONDARY_EXEC_WBINVD_EXITING | 			SECONDARY_EXEC_ENABLE_VPID |
> >>  			SECONDARY_EXEC_ENABLE_EPT | @@ -3731,7 +3745,45 @@ static void
> >>  free_vpid(struct vcpu_vmx *vmx) 	spin_unlock(&vmx_vpid_lock); }
> >> -static void __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
> >> u32 msr) +#define MSR_TYPE_R	1 +#define MSR_TYPE_W	2 +static void
> >> __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, +						u32
> >> msr, int type) +{ +	int f = sizeof(unsigned long); + +	if
> >> (!cpu_has_vmx_msr_bitmap()) +		return; + +	/* +	 * See Intel PRM Vol.
> >> 3, 20.6.9 (MSR-Bitmap Address). Early manuals +	 * have the write-low
> >> and read-high bitmap offsets the wrong way round. +	 * We can control
> >> MSRs 0x00000000-0x00001fff and 0xc0000000-0xc0001fff. +	 */ +	if (msr
> >> <= 0x1fff) { +		if (type & MSR_TYPE_R) +			/* read-low */
> >> +			__clear_bit(msr, msr_bitmap + 0x000 / f); + +		if (type &
> >> MSR_TYPE_W) +			/* write-low */ +			__clear_bit(msr, msr_bitmap + 0x800
> >> / f); + +	} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
> >> +		msr &= 0x1fff; +		if (type & MSR_TYPE_R) +			/* read-high */
> >> +			__clear_bit(msr, msr_bitmap + 0x400 / f); + +		if (type &
> >> MSR_TYPE_W) +			/* write-high */ +			__clear_bit(msr, msr_bitmap +
> >> 0xc00 / f); + +	} +} + +static void
> >> __vmx_enable_intercept_for_msr(unsigned long *msr_bitmap, +						u32
> >> msr, int type)
> >>  {
> >>  	int f = sizeof(unsigned long);
> >> @@ -3744,20 +3796,75 @@ static void
> > __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, u32 msr)
> >>  	 * We can control MSRs 0x00000000-0x00001fff and 0xc0000000-0xc0001fff.
> >>  	 */
> >>  	if (msr <= 0x1fff) {
> >> -		__clear_bit(msr, msr_bitmap + 0x000 / f); /* read-low */
> >> -		__clear_bit(msr, msr_bitmap + 0x800 / f); /* write-low */
> >> +		if (type & MSR_TYPE_R)
> >> +			/* read-low */
> >> +			__set_bit(msr, msr_bitmap + 0x000 / f);
> >> +
> >> +		if (type & MSR_TYPE_W)
> >> +			/* write-low */
> >> +			__set_bit(msr, msr_bitmap + 0x800 / f);
> >> +
> >>  	} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
> >>  		msr &= 0x1fff;
> >> -		__clear_bit(msr, msr_bitmap + 0x400 / f); /* read-high */
> >> -		__clear_bit(msr, msr_bitmap + 0xc00 / f); /* write-high */
> >> +		if (type & MSR_TYPE_R)
> >> +			/* read-high */
> >> +			__set_bit(msr, msr_bitmap + 0x400 / f);
> >> +
> >> +		if (type & MSR_TYPE_W)
> >> +			/* write-high */
> >> +			__set_bit(msr, msr_bitmap + 0xc00 / f);
> >> +
> >>  	}
> >>  }
> >> +
> >>  static void vmx_disable_intercept_for_msr(u32 msr, bool longmode_only)
> >>  {
> >>  	if (!longmode_only)
> >> -		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, msr);
> >> -	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr);
> >> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
> >> +						msr, MSR_TYPE_R | MSR_TYPE_W);
> >> +	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
> >> +						msr, MSR_TYPE_R | MSR_TYPE_W);
> >> +}
> >> +
> >> +static void vmx_intercept_for_msr_read(u32 msr, bool longmode_only,
> >> +					bool set)
> >> +{
> >> +	if (!longmode_only) {
> >> +		if (set)
> >> +			__vmx_enable_intercept_for_msr(vmx_msr_bitmap_legacy,
> >> +					msr, MSR_TYPE_R);
> >> +		else
> >> +			__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
> >> +					msr, MSR_TYPE_R);
> >> +
> >> +	}
> >> +	if (set)
> >> +		__vmx_enable_intercept_for_msr(vmx_msr_bitmap_longmode,
> >> +				msr, MSR_TYPE_R);
> >> +	else
> >> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
> >> +				msr, MSR_TYPE_R);
> >> +}
> >> +
> >> +static void vmx_intercept_for_msr_write(u32 msr, bool longmode_only,
> >> +					bool set)
> >> +{
> >> +	if (!longmode_only) {
> >> +		if (set)
> >> +			__vmx_enable_intercept_for_msr(vmx_msr_bitmap_legacy,
> >> +					msr, MSR_TYPE_W);
> >> +		else
> >> +			__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy,
> >> +					msr, MSR_TYPE_W);
> >> +
> >> +	}
> >> +	if (set)
> >> +		__vmx_enable_intercept_for_msr(vmx_msr_bitmap_longmode,
> >> +				msr, MSR_TYPE_W);
> >> +	else
> >> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
> >> +				msr, MSR_TYPE_W);
> >>  }
> >>  
> >>  /*
> >> @@ -3855,6 +3962,7 @@ static u32 vmx_secondary_exec_control(struct
> > vcpu_vmx *vmx)
> >>  		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING; 	if
> >>  (!enable_apicv_reg_vid) 		exec_control &=
> >>  ~SECONDARY_EXEC_APIC_REGISTER_VIRT; +	exec_control &=
> >>  ~SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE; 	return exec_control; }
> >> @@ -6110,6 +6218,76 @@ static void update_cr8_intercept(struct kvm_vcpu
> > *vcpu, int tpr, int irr)
> >>  	vmcs_write32(TPR_THRESHOLD, irr);
> >>  }
> >> +static void vmx_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
> >> +{
> >> +	u32 exec_control;
> >> +	int msr;
> >> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >> +
> >> +	if (!cpu_has_vmx_virtualize_x2apic_mode())
> >> +		return;
> >> +
> >> +	exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
> >> +	/* virtualize x2apic mode relies on tpr shadow */
> >> +	if (!(exec_control & CPU_BASED_TPR_SHADOW))
> >> +		return;
> >> +
> >> +	exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
> >> +	exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> >> +	exec_control |= SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
> >> +	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
> >> +	vmx->virtual_x2apic_enabled = true;
> > Why track it?
> With this flag, we don't need to read vmcs to check whether we enabled virtua x2apic before.
> 
Why do you care? Just disabled it regardless.

> >> +
> >> +	if (!cpu_has_vmx_virtual_intr_delivery())
> >> +		return;
> >> +
> > You need to test whether vid is enabled, not whether it can be enabled.
> > And you need to test it at the beginning of the function. If vid is
> > disabled we do not want to set SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE.
> > Also force disable vid if !cpu_has_vmx_virtualize_x2apic_mode().
> vid is not depend on virtualize x2apic.
It is in our implementation. If vid is enabled and virtualize x2apic is
not our code will not work if guest uses x2apic.

> The right logic should be:
> When register virtualization enabled, read all apic msr(except apic id reg and tmcct which have the hook) not cause vmexit and only write TPR not cause vmexit.
> When vid enabled, write to TPR, EOI and SELF IPI not cause vmexit.
> It's better to put the patch of envirtualize x2apic mode as first patch.
> 
There is no point whatsoever to enable virtualize x2apic without
allowing at least one non intercepted access to x2apic MSR range and
this is what your patch is doing when run on cpu without vid capability.

> >> +	for (msr = 0x800; msr <= 0x8ff; msr++)
> >> +		vmx_intercept_for_msr_read(msr, false, false);
> >> +
> >> +	/* APIC ID */
> >> +	vmx_intercept_for_msr_read(0x802, false, true);
> >> +	/* TMCCT */
> >> +	vmx_intercept_for_msr_read(0x839, false, true);
> >> +	/* TPR */
> >> +	vmx_intercept_for_msr_write(0x808, false, false);
> >> +	/* EOI */
> >> +	vmx_intercept_for_msr_write(0x80b, false, false);
> >> +	/* SELF-IPI */
> >> +	vmx_intercept_for_msr_write(0x83f, false, false);
> >> +
> >> +}
> >> +
> >> +static void vmx_disable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
> >> +{
> >> +	u32 second_exec_control;
> >> +	int msr;
> >> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >> +
> >> +	/* If doesn't enable virtual x2apic before, do nothing*/
> >> +	if (!vmx->virtual_x2apic_enabled)
> >> +		return;
> >> +
> >> +	second_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
> >> +	/* disalbe virtual x2apic*/
> >> +	second_exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
> >> +	second_exec_control |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> >> +	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, second_exec_control);
> >> +	vmx->virtual_x2apic_enabled = false;
> >> +
> >> +	if (!cpu_has_vmx_virtual_intr_delivery())
> >> +		return;
> >> +
> >> +	for (msr = 0x800; msr <= 0x8ff; msr++)
> >> +		vmx_intercept_for_msr_read(msr, false, true);
> >> +
> >> +	/* TPR */
> >> +	vmx_intercept_for_msr_write(0x808, false, true);
> >> +	/* EOI */
> >> +	vmx_intercept_for_msr_write(0x80b, false, true);
> >> +	/* SELF-IPI */
> >> +	vmx_intercept_for_msr_write(0x83f, false, true);
> >> +}
> >> +
> >>  static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx) { 	u32
> >>  exit_intr_info; @@ -7373,6 +7551,8 @@ static struct kvm_x86_ops
> >>  vmx_x86_ops = { 	.enable_nmi_window = enable_nmi_window,
> >>  	.enable_irq_window = enable_irq_window, 	.update_cr8_intercept =
> >>  update_cr8_intercept,
> >> +	.enable_virtual_x2apic_mode = vmx_enable_virtual_x2apic_mode,
> >> +	.disable_virtual_x2apic_mode = vmx_disable_virtual_x2apic_mode,
> >> 
> >>  	.set_tss_addr = vmx_set_tss_addr,
> >>  	.get_tdp_level = get_ept_level,
> >> --
> >> 1.7.1
> > 
> > --
> > 			Gleb.
> 
> 
> Best regards,
> Yang
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 2/3] x86, apicv: add virtual x2apic support
  2013-01-10  8:31     ` Zhang, Yang Z
@ 2013-01-10  8:53       ` Gleb Natapov
  0 siblings, 0 replies; 26+ messages in thread
From: Gleb Natapov @ 2013-01-10  8:53 UTC (permalink / raw)
  To: Zhang, Yang Z
  Cc: kvm@vger.kernel.org, Shan, Haitao, mtosatti@redhat.com,
	Tian, Kevin

On Thu, Jan 10, 2013 at 08:31:51AM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2013-01-10:
> > On Thu, Jan 10, 2013 at 03:26:07PM +0800, Yang Zhang wrote:
> >> +static void vmx_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
> >> +{
> >> +	u32 exec_control;
> >> +	int msr;
> >> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >> +
> >> +	if (!cpu_has_vmx_virtualize_x2apic_mode())
> >> +		return;
> >> +
> >> +	exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
> >> +	/* virtualize x2apic mode relies on tpr shadow */
> >> +	if (!(exec_control & CPU_BASED_TPR_SHADOW))
> >> +		return;
> >> +
> >> +	exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
> >> +	exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> >> +	exec_control |= SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
> >> +	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
> >> +	vmx->virtual_x2apic_enabled = true;
> >> +
> >> +	if (!cpu_has_vmx_virtual_intr_delivery())
> >> +		return;
> >> +
> >> +	for (msr = 0x800; msr <= 0x8ff; msr++)
> >> +		vmx_intercept_for_msr_read(msr, false, false);
> >> +
> >> +	/* APIC ID */
> >> +	vmx_intercept_for_msr_read(0x802, false, true);
> > Why are you enabling apic id read intercept?
> Current code to read apic id in x2apic mode has some hacks:
> 
> if (apic_x2apic_mode(apic))
>        val = kvm_apic_id(apic);
> else
>        val = kvm_apic_id(apic) << 24;
> 
> static inline int kvm_apic_id(struct kvm_lapic *apic)
> {
>         return (kvm_apic_get_reg(apic, APIC_ID) >> 24) & 0xff;
> }
> 
> According SPEC, in x2apic mode, the whole id reg is used, but in KVM, it only use the highest eight bits.
> 
Correct. Put the comment above apic id intercept that we do that since
in x2apic mode apic id is not correctly reflected in apic page.

> >> +	/* TMCCT */
> >> +	vmx_intercept_for_msr_read(0x839, false, true);
> >> +	/* TPR */
> >> +	vmx_intercept_for_msr_write(0x808, false, false);
> >> +	/* EOI */
> >> +	vmx_intercept_for_msr_write(0x80b, false, false);
> >> +	/* SELF-IPI */
> >> +	vmx_intercept_for_msr_write(0x83f, false, false);
> >> +
> >> +}
> >> +
> > 
> > --
> > 			Gleb.
> 
> 
> Best regards,
> Yang
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v9 2/3] x86, apicv: add virtual x2apic support
  2013-01-10  8:52       ` Gleb Natapov
@ 2013-01-10 11:54         ` Zhang, Yang Z
  2013-01-10 12:16           ` Gleb Natapov
  2013-01-11  2:37         ` Zhang, Yang Z
  1 sibling, 1 reply; 26+ messages in thread
From: Zhang, Yang Z @ 2013-01-10 11:54 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm@vger.kernel.org, Shan, Haitao, mtosatti@redhat.com,
	Tian, Kevin

Gleb Natapov wrote on 2013-01-10:
> On Thu, Jan 10, 2013 at 08:32:06AM +0000, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2013-01-10:
>>> On Thu, Jan 10, 2013 at 03:26:07PM +0800, Yang Zhang wrote:
>>>> From: Yang Zhang <yang.z.zhang@Intel.com>
>>>> 
>>>> basically to benefit from apicv, we need to enable virtualized x2apic mode.
>>>> Currently, we only enable it when guest is really using x2apic.
>>>> 
>>>> Also, clear MSR bitmap for corresponding x2apic MSRs when guest enabled
>>> x2apic:
>>>>     0x800 - 0x8ff: no read intercept for apicv register virtualization,
>>>>     		   except APIC ID and TMCCT.
>>>>     APIC ID and TMCCT: need software's assistance to get right value.
>>>>     TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery.
>>>> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
>>>> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
>>>> ---
>>>>  arch/x86/include/asm/kvm_host.h |    2 + arch/x86/include/asm/vmx.h
>>>>    |    1 + arch/x86/kvm/lapic.c            |    5 +-
>>>>    arch/x86/kvm/svm.c              |    6 + arch/x86/kvm/vmx.c |  194
>>>>    +++++++++++++++++++++++++++++++++++++-- 5 files
> changed, 200
>>>>  insertions(+), 8 deletions(-)
>>>> diff --git a/arch/x86/include/asm/kvm_host.h
>>>> b/arch/x86/include/asm/kvm_host.h index c431b33..572a562 100644 ---
>>>> a/arch/x86/include/asm/kvm_host.h +++
>>>> b/arch/x86/include/asm/kvm_host.h @@ -697,6 +697,8 @@ struct
>>>> kvm_x86_ops {
>>>>  	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
>>>>  	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
>>>>  	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
>>>> +	void (*enable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu);
>>>> +	void (*disable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu);
>>> Make one callback with enable/disable parameter. And do not forget SVM.
>>> 
>>>>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>>>>  	int (*get_tdp_level)(void);
>>>>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool
> is_mmio);
>>>> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
>>>> index 44c3f7e..0a54df0 100644
>>>> --- a/arch/x86/include/asm/vmx.h
>>>> +++ b/arch/x86/include/asm/vmx.h
>>>> @@ -139,6 +139,7 @@
>>>>  #define SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES 0x00000001 #define
>>>>  SECONDARY_EXEC_ENABLE_EPT               0x00000002 #define
>>>>  SECONDARY_EXEC_RDTSCP			0x00000008 +#define
>>>>  SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE   0x00000010 #define
>>>>  SECONDARY_EXEC_ENABLE_VPID              0x00000020 #define
>>>>  SECONDARY_EXEC_WBINVD_EXITING		0x00000040 #define
>>>>  SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080
>>>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>>>> index 0664c13..ec38906 100644
>>>> --- a/arch/x86/kvm/lapic.c
>>>> +++ b/arch/x86/kvm/lapic.c
>>>> @@ -1328,7 +1328,10 @@ void kvm_lapic_set_base(struct kvm_vcpu
> *vcpu,
>>> u64 value)
>>>>  		u32 id = kvm_apic_id(apic);
>>>>  		u32 ldr = ((id >> 4) << 16) | (1 << (id & 0xf));
>>>>  		kvm_apic_set_ldr(apic, ldr);
>>>> -	}
>>>> +		kvm_x86_ops->enable_virtual_x2apic_mode(vcpu);
>>>> +	} else
>>>> +		kvm_x86_ops->disable_virtual_x2apic_mode(vcpu);
>>>> +
>>> You just broke SVM.
>>>>  	apic->base_address = apic->vcpu->arch.apic_base &
>>>>  			     MSR_IA32_APICBASE_BASE;
>>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>>> index d29d3cd..0b82cb1 100644
>>>> --- a/arch/x86/kvm/svm.c
>>>> +++ b/arch/x86/kvm/svm.c
>>>> @@ -3571,6 +3571,11 @@ static void update_cr8_intercept(struct
> kvm_vcpu
>>> *vcpu, int tpr, int irr)
>>>>  		set_cr_intercept(svm, INTERCEPT_CR8_WRITE);
>>>>  }
>>>> +static void svm_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	return;
>>>> +}
>>>> +
>>>>  static int svm_nmi_allowed(struct kvm_vcpu *vcpu) { 	struct vcpu_svm
>>>>  *svm = to_svm(vcpu); @@ -4290,6 +4295,7 @@ static struct kvm_x86_ops
>>>>  svm_x86_ops = { 	.enable_nmi_window = enable_nmi_window,
>>>>  	.enable_irq_window = enable_irq_window, 	.update_cr8_intercept =
>>>>  update_cr8_intercept,
>>>> +	.enable_virtual_x2apic_mode = svm_enable_virtual_x2apic_mode,
>>>> 
>>>>  	.set_tss_addr = svm_set_tss_addr,
>>>>  	.get_tdp_level = get_npt_level,
>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>> index 688f43f..b203ce7 100644
>>>> --- a/arch/x86/kvm/vmx.c
>>>> +++ b/arch/x86/kvm/vmx.c
>>>> @@ -433,6 +433,8 @@ struct vcpu_vmx {
>>>> 
>>>>  	bool rdtscp_enabled;
>>>> +	bool virtual_x2apic_enabled;
>>>> +
>>>>  	/* Support for a guest hypervisor (nested VMX) */
>>>>  	struct nested_vmx nested;
>>>>  };
>>>> @@ -767,12 +769,23 @@ static inline bool
>>> cpu_has_vmx_virtualize_apic_accesses(void)
>>>>  		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>>>>  }
>>>> +static inline bool cpu_has_vmx_virtualize_x2apic_mode(void)
>>>> +{
>>>> +	return vmcs_config.cpu_based_2nd_exec_ctrl &
>>>> +		SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
>>>> +}
>>>> +
>>>>  static inline bool cpu_has_vmx_apic_register_virt(void)
>>>>  {
>>>>  	return vmcs_config.cpu_based_2nd_exec_ctrl &
>>>>  		SECONDARY_EXEC_APIC_REGISTER_VIRT;
>>>>  }
>>>> +static inline bool cpu_has_vmx_virtual_intr_delivery(void)
>>>> +{
>>>> +	return false;
>>>> +}
>>>> +
>>>>  static inline bool cpu_has_vmx_flexpriority(void)
>>>>  {
>>>>  	return cpu_has_vmx_tpr_shadow() &&
>>>> @@ -2544,6 +2557,7 @@ static __init int setup_vmcs_config(struct
>>> vmcs_config *vmcs_conf)
>>>>  	if (_cpu_based_exec_control &
>>>>  CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) { 		min2 = 0; 		opt2 =
>>>>  SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
>>>>  +			SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
>>>>  			SECONDARY_EXEC_WBINVD_EXITING | 	SECONDARY_EXEC_ENABLE_VPID |
>>>>  			SECONDARY_EXEC_ENABLE_EPT | @@ -3731,7 +3745,45 @@ static void
>>>>  free_vpid(struct vcpu_vmx *vmx) 	spin_unlock(&vmx_vpid_lock); }
>>>> -static void __vmx_disable_intercept_for_msr(unsigned long
>>>> *msr_bitmap, u32 msr) +#define MSR_TYPE_R	1 +#define MSR_TYPE_W	2
>>>> +static void __vmx_disable_intercept_for_msr(unsigned long
>>>> *msr_bitmap, + 					u32 msr, int type) +{ +	int f = sizeof(unsigned
>>>> long); + +	if (!cpu_has_vmx_msr_bitmap()) +		return; + +	/* +	 * See
>>>> Intel PRM Vol. 3, 20.6.9 (MSR-Bitmap Address). Early manuals +	 *
>>>> have the write-low and read-high bitmap offsets the wrong way round.
>>>> +	 * We can control MSRs 0x00000000-0x00001fff and
>>>> 0xc0000000-0xc0001fff. +	 */ +	if (msr <= 0x1fff) { +		if (type &
>>>> MSR_TYPE_R) +			/* read-low */ +			__clear_bit(msr, msr_bitmap +
>>>> 0x000 / f); + +		if (type & MSR_TYPE_W) +			/* write-low */ +
>>>> 	__clear_bit(msr, msr_bitmap + 0x800 / f); + +	} else if ((msr >=
>>>> 0xc0000000) && (msr <= 0xc0001fff)) { +		msr &= 0x1fff; +		if (type &
>>>> MSR_TYPE_R) + 	/* read-high */ +			__clear_bit(msr, msr_bitmap +
>>>> 0x400 / f); + +		if (type & MSR_TYPE_W) +			/* write-high */ +
>>>> 	__clear_bit(msr, msr_bitmap + 0xc00 / f); + +	} +} + +static void
>>>> __vmx_enable_intercept_for_msr(unsigned long *msr_bitmap, + 					u32
>>>> msr, int type)
>>>>  {
>>>>  	int f = sizeof(unsigned long);
>>>> @@ -3744,20 +3796,75 @@ static void
>>> __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, u32 msr)
>>>>  	 * We can control MSRs 0x00000000-0x00001fff and
>>>>  0xc0000000-0xc0001fff. 	 */ 	if (msr <= 0x1fff) {
>>>> -		__clear_bit(msr, msr_bitmap + 0x000 / f); /* read-low */
>>>> -		__clear_bit(msr, msr_bitmap + 0x800 / f); /* write-low */
>>>> +		if (type & MSR_TYPE_R)
>>>> +			/* read-low */
>>>> +			__set_bit(msr, msr_bitmap + 0x000 / f);
>>>> +
>>>> +		if (type & MSR_TYPE_W)
>>>> +			/* write-low */
>>>> +			__set_bit(msr, msr_bitmap + 0x800 / f);
>>>> +
>>>>  	} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
>>>>  		msr &= 0x1fff;
>>>> -		__clear_bit(msr, msr_bitmap + 0x400 / f); /* read-high */
>>>> -		__clear_bit(msr, msr_bitmap + 0xc00 / f); /* write-high */
>>>> +		if (type & MSR_TYPE_R)
>>>> +			/* read-high */
>>>> +			__set_bit(msr, msr_bitmap + 0x400 / f);
>>>> +
>>>> +		if (type & MSR_TYPE_W)
>>>> +			/* write-high */
>>>> +			__set_bit(msr, msr_bitmap + 0xc00 / f);
>>>> +
>>>>  	} } + static void vmx_disable_intercept_for_msr(u32 msr, bool
>>>>  longmode_only) { 	if (!longmode_only)
>>>> -		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, msr);
>>>> -	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr);
>>>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, +						msr,
>>>> MSR_TYPE_R | MSR_TYPE_W);
>>>> +	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
>>>> +						msr, MSR_TYPE_R | MSR_TYPE_W); +} + +static void
>>>> vmx_intercept_for_msr_read(u32 msr, bool longmode_only, +					bool
>>>> set) +{ +	if (!longmode_only) { +		if (set) +
>>>> 	__vmx_enable_intercept_for_msr(vmx_msr_bitmap_legacy, +					msr,
>>>> MSR_TYPE_R); +		else +
>>>> 	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, +					msr,
>>>> MSR_TYPE_R); + +	} +	if (set)
>>>> +		__vmx_enable_intercept_for_msr(vmx_msr_bitmap_longmode, +				msr,
>>>> MSR_TYPE_R); +	else
>>>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, +				msr,
>>>> MSR_TYPE_R); +} + +static void vmx_intercept_for_msr_write(u32 msr,
>>>> bool longmode_only, +					bool set) +{ +	if (!longmode_only) { +		if
>>>> (set) + 	__vmx_enable_intercept_for_msr(vmx_msr_bitmap_legacy,
>>>> +					msr, MSR_TYPE_W); +		else +
>>>> 	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, +					msr,
>>>> MSR_TYPE_W); + +	} +	if (set)
>>>> +		__vmx_enable_intercept_for_msr(vmx_msr_bitmap_longmode, +				msr,
>>>> MSR_TYPE_W); +	else
>>>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, +				msr,
>>>> MSR_TYPE_W);
>>>>  }
>>>>  
>>>>  /*
>>>> @@ -3855,6 +3962,7 @@ static u32 vmx_secondary_exec_control(struct
>>> vcpu_vmx *vmx)
>>>>  		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING; 	if
>>>>  (!enable_apicv_reg_vid) 		exec_control &=
>>>>  ~SECONDARY_EXEC_APIC_REGISTER_VIRT; +	exec_control &=
>>>>  ~SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE; 	return
> exec_control; }
>>>> @@ -6110,6 +6218,76 @@ static void update_cr8_intercept(struct
> kvm_vcpu
>>> *vcpu, int tpr, int irr)
>>>>  	vmcs_write32(TPR_THRESHOLD, irr);
>>>>  }
>>>> +static void vmx_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	u32 exec_control;
>>>> +	int msr;
>>>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>>>> +
>>>> +	if (!cpu_has_vmx_virtualize_x2apic_mode())
>>>> +		return;
>>>> +
>>>> +	exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
>>>> +	/* virtualize x2apic mode relies on tpr shadow */
>>>> +	if (!(exec_control & CPU_BASED_TPR_SHADOW))
>>>> +		return;
>>>> +
>>>> +	exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
>>>> +	exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>>>> +	exec_control |= SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
>>>> +	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
>>>> +	vmx->virtual_x2apic_enabled = true;
>>> Why track it?
>> With this flag, we don't need to read vmcs to check whether we enabled
>> virtua x2apic before.
>> 
> Why do you care? Just disabled it regardless.
>>>> +
>>>> +	if (!cpu_has_vmx_virtual_intr_delivery())
>>>> +		return;
>>>> +
>>> You need to test whether vid is enabled, not whether it can be
>>> enabled. And you need to test it at the beginning of the function. If
>>> vid is disabled we do not want to set
>>> SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE. Also force disable vid if
>>> !cpu_has_vmx_virtualize_x2apic_mode().
>> vid is not depend on virtualize x2apic.
> It is in our implementation. If vid is enabled and virtualize x2apic is
> not our code will not work if guest uses x2apic.
> 
>> The right logic should be:
>> When register virtualization enabled, read all apic msr(except apic id reg and
> tmcct which have the hook) not cause vmexit and only write TPR not cause
> vmexit.
>> When vid enabled, write to TPR, EOI and SELF IPI not cause vmexit.
>> It's better to put the patch of envirtualize x2apic mode as first patch.
>> 
> There is no point whatsoever to enable virtualize x2apic without
> allowing at least one non intercepted access to x2apic MSR range and
> this is what your patch is doing when run on cpu without vid capability.
No, at least read/write TPR will not cause vmexit if virtualize x2apic mode is enabled.
I am not sure whether I understand your comments right in previous discussion, here is my thinking:
1. enable virtualize x2apic mode if guest is really using x2apic and clear TPR in msr read  and write bitmap. This will benefit reading TPR.
2. If APIC registers virtualization is enabled, clear all bit in rang 0x800-0x8ff(except apic id reg and tmcct).
3. If vid is enabled, clear EOI and SELF IPI in msr write map.

One concern you mentioned is that vid enabled and virtualize x2apic is disabled but guest still use x2apic. In this case, we still use MSR bitmap to intercept x2apic. This means the vEOI update will never happen. But we still can benefit from interrupt window.

>>>> +	for (msr = 0x800; msr <= 0x8ff; msr++)
>>>> +		vmx_intercept_for_msr_read(msr, false, false); + +	/* APIC ID */
>>>> +	vmx_intercept_for_msr_read(0x802, false, true); +	/* TMCCT */
>>>> +	vmx_intercept_for_msr_read(0x839, false, true); +	/* TPR */
>>>> +	vmx_intercept_for_msr_write(0x808, false, false); +	/* EOI */
>>>> +	vmx_intercept_for_msr_write(0x80b, false, false); +	/* SELF-IPI */
>>>> +	vmx_intercept_for_msr_write(0x83f, false, false); + +} + +static
>>>> void vmx_disable_virtual_x2apic_mode(struct kvm_vcpu *vcpu) +{ +	u32
>>>> second_exec_control; +	int msr; +	struct vcpu_vmx *vmx =
>>>> to_vmx(vcpu); + +	/* If doesn't enable virtual x2apic before, do
>>>> nothing*/ +	if (!vmx->virtual_x2apic_enabled) +		return; +
>>>> +	second_exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL); +	/*
>>>> disalbe virtual x2apic*/ +	second_exec_control &=
>>>> ~SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE; +	second_exec_control |=
>>>> SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>>>> +	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, second_exec_control);
>>>> +	vmx->virtual_x2apic_enabled = false; + +	if
>>>> (!cpu_has_vmx_virtual_intr_delivery()) +		return; + +	for (msr =
>>>> 0x800; msr <= 0x8ff; msr++) +		vmx_intercept_for_msr_read(msr, false,
>>>> true); + +	/* TPR */ +	vmx_intercept_for_msr_write(0x808, false,
>>>> true); +	/* EOI */ +	vmx_intercept_for_msr_write(0x80b, false, true);
>>>> +	/* SELF-IPI */ +	vmx_intercept_for_msr_write(0x83f, false, true);
>>>> +} +
>>>>  static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx) { 	u32
>>>>  exit_intr_info; @@ -7373,6 +7551,8 @@ static struct kvm_x86_ops
>>>>  vmx_x86_ops = { 	.enable_nmi_window = enable_nmi_window,
>>>>  	.enable_irq_window = enable_irq_window, 	.update_cr8_intercept =
>>>>  update_cr8_intercept,
>>>> +	.enable_virtual_x2apic_mode = vmx_enable_virtual_x2apic_mode,
>>>> +	.disable_virtual_x2apic_mode = vmx_disable_virtual_x2apic_mode,
>>>> 
>>>>  	.set_tss_addr = vmx_set_tss_addr,
>>>>  	.get_tdp_level = get_ept_level,
>>>> --
>>>> 1.7.1
>>> 
>>> --
>>> 			Gleb.
>> 
>> 
>> Best regards,
>> Yang
>> 
> 
> --
> 			Gleb.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Best regards,
Yang


^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v9 3/3] x86, apicv: add virtual interrupt delivery support
  2013-01-10  8:23   ` Gleb Natapov
@ 2013-01-10 12:04     ` Zhang, Yang Z
  0 siblings, 0 replies; 26+ messages in thread
From: Zhang, Yang Z @ 2013-01-10 12:04 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm@vger.kernel.org, Shan, Haitao, mtosatti@redhat.com,
	Tian, Kevin

Gleb Natapov wrote on 2013-01-10:
> On Thu, Jan 10, 2013 at 03:26:08PM +0800, Yang Zhang wrote:
>> From: Yang Zhang <yang.z.zhang@Intel.com>
>> 
>> Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
>> manually, which is fully taken care of by the hardware. This needs
>> some special awareness into existing interrupr injection path:
>> 
>> - for pending interrupt, instead of direct injection, we may need
>>   update architecture specific indicators before resuming to guest.
>> - A pending interrupt, which is masked by ISR, should be also
>>   considered in above update action, since hardware will decide
>>   when to inject it at right time. Current has_interrupt and
>>   get_interrupt only returns a valid vector from injection p.o.v.
>> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
>> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
>> ---
>>  arch/x86/include/asm/kvm_host.h |    5 + arch/x86/include/asm/vmx.h   
>>    |   11 +++ arch/x86/kvm/irq.c              |   56 +++++++++++-
>>  arch/x86/kvm/lapic.c            |   72 +++++++++------
>>  arch/x86/kvm/lapic.h            |   23 +++++ arch/x86/kvm/svm.c       
>>        |   18 ++++ arch/x86/kvm/vmx.c              |  191
>>  +++++++++++++++++++++++++++++++++++++-- arch/x86/kvm/x86.c            
>>   |   14 +++- include/linux/kvm_host.h        |    3 +
>>  virt/kvm/ioapic.c               |   18 ++++ virt/kvm/ioapic.h         
>>       |    4 + virt/kvm/irq_comm.c             |   22 +++++
>>  virt/kvm/kvm_main.c             |    5 + 13 files changed, 399
>>  insertions(+), 43 deletions(-)
>> diff --git a/arch/x86/include/asm/kvm_host.h
>> b/arch/x86/include/asm/kvm_host.h index 572a562..f471856 100644 ---
>> a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -697,6 +697,10 @@ struct kvm_x86_ops {
>>  	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
>>  	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
>>  	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
>> +	int (*has_virtual_interrupt_delivery)(struct kvm_vcpu *vcpu);
>> +	void (*update_apic_irq)(struct kvm_vcpu *vcpu, int max_irr);
>> +	void (*update_eoi_exitmap)(struct kvm_vcpu *vcpu);
>> +	void (*set_svi)(int isr);
>>  	void (*enable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu); 	void
>>  (*disable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu); 	int
>>  (*set_tss_addr)(struct kvm *kvm, unsigned int addr); @@ -993,6 +997,7
>>  @@ int kvm_age_hva(struct kvm *kvm, unsigned long hva); int
>>  kvm_test_age_hva(struct kvm *kvm, unsigned long hva); void
>>  kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); int
>>  cpuid_maxphyaddr(struct kvm_vcpu *vcpu); +int
>>  kvm_cpu_has_injectable_intr(struct kvm_vcpu *v); int
>>  kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu); int
>>  kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu); int
>>  kvm_cpu_get_interrupt(struct kvm_vcpu *v);
>> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
>> index 0a54df0..694586c 100644
>> --- a/arch/x86/include/asm/vmx.h
>> +++ b/arch/x86/include/asm/vmx.h
>> @@ -62,6 +62,7 @@
>>  #define EXIT_REASON_MCE_DURING_VMENTRY  41 #define
>>  EXIT_REASON_TPR_BELOW_THRESHOLD 43 #define EXIT_REASON_APIC_ACCESS    
>>      44 +#define EXIT_REASON_EOI_INDUCED         45 #define
>>  EXIT_REASON_EPT_VIOLATION       48 #define EXIT_REASON_EPT_MISCONFIG  
>>      49 #define EXIT_REASON_WBINVD              54 @@ -144,6 +145,7 @@
>>  #define SECONDARY_EXEC_WBINVD_EXITING		0x00000040 #define
>>  SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080 #define
>>  SECONDARY_EXEC_APIC_REGISTER_VIRT       0x00000100 +#define
>>  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY    0x00000200 #define
>>  SECONDARY_EXEC_PAUSE_LOOP_EXITING	0x00000400 #define
>>  SECONDARY_EXEC_ENABLE_INVPCID		0x00001000
>> @@ -181,6 +183,7 @@ enum vmcs_field {
>>  	GUEST_GS_SELECTOR               = 0x0000080a, 	GUEST_LDTR_SELECTOR   
>>           = 0x0000080c, 	GUEST_TR_SELECTOR               = 0x0000080e,
>>  +	GUEST_INTR_STATUS               = 0x00000810, 	HOST_ES_SELECTOR     
>>            = 0x00000c00, 	HOST_CS_SELECTOR                = 0x00000c02,
>>  	HOST_SS_SELECTOR                = 0x00000c04, @@ -208,6 +211,14 @@
>>  enum vmcs_field { 	APIC_ACCESS_ADDR_HIGH		= 0x00002015, 	EPT_POINTER  
>>                    = 0x0000201a, 	EPT_POINTER_HIGH                =
>>  0x0000201b,
>> +	EOI_EXIT_BITMAP0                = 0x0000201c,
>> +	EOI_EXIT_BITMAP0_HIGH           = 0x0000201d,
>> +	EOI_EXIT_BITMAP1                = 0x0000201e,
>> +	EOI_EXIT_BITMAP1_HIGH           = 0x0000201f,
>> +	EOI_EXIT_BITMAP2                = 0x00002020,
>> +	EOI_EXIT_BITMAP2_HIGH           = 0x00002021,
>> +	EOI_EXIT_BITMAP3                = 0x00002022,
>> +	EOI_EXIT_BITMAP3_HIGH           = 0x00002023,
>>  	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
>>  	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
>>  	VMCS_LINK_POINTER               = 0x00002800,
>> diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
>> index b111aee..e113440 100644
>> --- a/arch/x86/kvm/irq.c
>> +++ b/arch/x86/kvm/irq.c
>> @@ -38,6 +38,38 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu
> *vcpu)
>>  EXPORT_SYMBOL(kvm_cpu_has_pending_timer);
>>  
>>  /*
>> + * check if there is pending interrupt from
>> + * non-APIC source without intack.
>> + */
>> +static int kvm_cpu_has_extint(struct kvm_vcpu *v)
>> +{
>> +	if (kvm_apic_accept_pic_intr(v))
>> +		return pic_irqchip(v->kvm)->output;	/* PIC */
>> +	else
>> +		return 0;
>> +}
>> +
>> +/*
>> + * check if there is injectable interrupt:
>> + * when virtual interrupt delivery enabled,
>> + * interrupt from apic will handled by hardware,
>> + * we don't need to check it here.
>> + */
>> +int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v)
>> +{
>> +	if (!irqchip_in_kernel(v->kvm))
>> +		return v->arch.interrupt.pending;
>> +
>> +	if (kvm_cpu_has_extint(v))
>> +		return 1;
>> +
>> +	if (kvm_apic_vid_enabled(v))
>> +		return 0;
>> +
>> +	return kvm_apic_has_interrupt(v) != -1; /* LAPIC */
>> +}
>> +
>> +/*
>>   * check if there is pending interrupt without
>>   * intack.
>>   */
>> @@ -46,27 +78,41 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
>>  	if (!irqchip_in_kernel(v->kvm))
>>  		return v->arch.interrupt.pending;
>> -	if (kvm_apic_accept_pic_intr(v) && pic_irqchip(v->kvm)->output)
>> -		return pic_irqchip(v->kvm)->output;	/* PIC */
>> +	if (kvm_cpu_has_extint(v))
>> +		return 1;
>> 
>>  	return kvm_apic_has_interrupt(v) != -1;	/* LAPIC */
>>  }
>>  EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
>>  
>>  /*
>> + * Read pending interrupt(from non-APIC source)
>> + * vector and intack.
>> + */
>> +static int kvm_cpu_get_extint(struct kvm_vcpu *v)
>> +{
>> +	if (kvm_cpu_has_extint(v))
>> +		return kvm_pic_read_irq(v->kvm); /* PIC */
>> +	return -1;
>> +}
>> +
>> +/*
>>   * Read pending interrupt vector and intack.
>>   */
>>  int kvm_cpu_get_interrupt(struct kvm_vcpu *v)
>>  {
>> +	int vector;
>> +
>>  	if (!irqchip_in_kernel(v->kvm))
>>  		return v->arch.interrupt.nr;
>> -	if (kvm_apic_accept_pic_intr(v) && pic_irqchip(v->kvm)->output)
>> -		return kvm_pic_read_irq(v->kvm);	/* PIC */
>> +	vector = kvm_cpu_get_extint(v);
>> +
>> +	if (kvm_apic_vid_enabled(v) || vector != -1)
>> +		return vector;			/* PIC */
>> 
>>  	return kvm_get_apic_interrupt(v);	/* APIC */
>>  }
>> -EXPORT_SYMBOL_GPL(kvm_cpu_get_interrupt);
>> 
>>  void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu)
>>  {
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index ec38906..d219f41 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -150,23 +150,6 @@ static inline int kvm_apic_id(struct kvm_lapic *apic)
>>  	return (kvm_apic_get_reg(apic, APIC_ID) >> 24) & 0xff;
>>  }
>> -static inline u16 apic_cluster_id(struct kvm_apic_map *map, u32 ldr)
>> -{
>> -	u16 cid;
>> -	ldr >>= 32 - map->ldr_bits;
>> -	cid = (ldr >> map->cid_shift) & map->cid_mask;
>> -
>> -	BUG_ON(cid >= ARRAY_SIZE(map->logical_map));
>> -
>> -	return cid;
>> -}
>> -
>> -static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr)
>> -{
>> -	ldr >>= (32 - map->ldr_bits);
>> -	return ldr & map->lid_mask;
>> -}
>> -
>>  static void recalculate_apic_map(struct kvm *kvm)
>>  {
>>  	struct kvm_apic_map *new, *old = NULL;
>> @@ -236,12 +219,14 @@ static inline void kvm_apic_set_id(struct kvm_lapic
> *apic, u8 id)
>>  { 	apic_set_reg(apic, APIC_ID, id << 24);
>>  	recalculate_apic_map(apic->vcpu->kvm);
>>  +	ioapic_update_eoi_exitmap(apic->vcpu->kvm); }
>>  
>>  static inline void kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id) {
>>  	apic_set_reg(apic, APIC_LDR, id);
>>  	recalculate_apic_map(apic->vcpu->kvm);
>>  +	ioapic_update_eoi_exitmap(apic->vcpu->kvm); }
>>  
>>  static inline int apic_lvt_enabled(struct kvm_lapic *apic, int lvt_type)
>> @@ -345,6 +330,8 @@ static inline int apic_find_highest_irr(struct kvm_lapic
> *apic)
>>  {
>>  	int result;
>> +	/* Note that irr_pending is just a hint. It will be always
>> +	 * true with virtual interrupt delivery enabled. */
>>  	if (!apic->irr_pending)
>>  		return -1;
>> @@ -461,6 +448,8 @@ static void pv_eoi_clr_pending(struct kvm_vcpu *vcpu)
>>  static inline int apic_find_highest_isr(struct kvm_lapic *apic)
>>  {
>>  	int result;
>> +
>> +	/* Note that isr_count is always 1 with vid enabled*/
>>  	if (!apic->isr_count)
>>  		return -1;
>>  	if (likely(apic->highest_isr_cache != -1))
>> @@ -740,6 +729,19 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1,
> struct kvm_vcpu *vcpu2)
>>  	return vcpu1->arch.apic_arb_prio - vcpu2->arch.apic_arb_prio;
>>  }
>> +static void kvm_ioapic_send_eoi(struct kvm_lapic *apic, int vector)
>> +{
>> +	if (!(kvm_apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI) &&
>> +	    kvm_ioapic_handles_vector(apic->vcpu->kvm, vector)) {
>> +		int trigger_mode;
>> +		if (apic_test_vector(vector, apic->regs + APIC_TMR))
>> +			trigger_mode = IOAPIC_LEVEL_TRIG;
>> +		else
>> +			trigger_mode = IOAPIC_EDGE_TRIG;
>> +		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
>> +	}
>> +}
>> +
>>  static int apic_set_eoi(struct kvm_lapic *apic) { 	int vector =
>>  apic_find_highest_isr(apic); @@ -756,19 +758,26 @@ static int
>>  apic_set_eoi(struct kvm_lapic *apic) 	apic_clear_isr(vector, apic);
>>  	apic_update_ppr(apic);
>> -	if (!(kvm_apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI) &&
>> -	    kvm_ioapic_handles_vector(apic->vcpu->kvm, vector)) {
>> -		int trigger_mode;
>> -		if (apic_test_vector(vector, apic->regs + APIC_TMR))
>> -			trigger_mode = IOAPIC_LEVEL_TRIG;
>> -		else
>> -			trigger_mode = IOAPIC_EDGE_TRIG;
>> -		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
>> -	}
>> +	kvm_ioapic_send_eoi(apic, vector);
>>  	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
>>  	return vector;
>>  }
>> +/*
>> + * this interface assumes a trap-like exit, which has already finished
>> + * desired side effect including vISR and vPPR update.
>> + */
>> +void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector)
>> +{
>> +	struct kvm_lapic *apic = vcpu->arch.apic;
>> +
>> +	trace_kvm_eoi(apic, vector);
>> +
>> +	kvm_ioapic_send_eoi(apic, vector);
>> +	kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
>> +}
>> +EXPORT_SYMBOL_GPL(kvm_apic_set_eoi_accelerated);
>> +
>>  static void apic_send_ipi(struct kvm_lapic *apic)
>>  {
>>  	u32 icr_low = kvm_apic_get_reg(apic, APIC_ICR);
>> @@ -1071,6 +1080,7 @@ static int apic_reg_write(struct kvm_lapic *apic, u32
> reg, u32 val)
>>  		if (!apic_x2apic_mode(apic)) { 			apic_set_reg(apic, APIC_DFR, val |
>>  0x0FFFFFFF); 			recalculate_apic_map(apic->vcpu->kvm);
>>  +			ioapic_update_eoi_exitmap(apic->vcpu->kvm); 		} else 			ret = 1;
>>  		break;
>> @@ -1318,6 +1328,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64
> value)
>>  		else 			static_key_slow_inc(&apic_hw_disabled.key);
>>  		recalculate_apic_map(vcpu->kvm);
>>  +		ioapic_update_eoi_exitmap(apic->vcpu->kvm); 	}
>>  
>>  	if (!kvm_vcpu_is_bsp(apic->vcpu)) @@ -1377,8 +1388,9 @@ void
>>  kvm_lapic_reset(struct kvm_vcpu *vcpu) 		apic_set_reg(apic, APIC_ISR +
>>  0x10 * i, 0); 		apic_set_reg(apic, APIC_TMR + 0x10 * i, 0); 	}
>> -	apic->irr_pending = false;
>> -	apic->isr_count = 0;
>> +	apic->irr_pending = kvm_apic_vid_enabled(vcpu);
>> +	apic->isr_count = kvm_apic_vid_enabled(vcpu) ?
>> +				1 : 0;
> Why not just "apic->isr_count = kvm_apic_vid_enabled(vcpu)"?
Ok. 

>>  	apic->highest_isr_cache = -1;
>>  	update_divide_count(apic);
>>  	atomic_set(&apic->lapic_timer.pending, 0);
>> @@ -1593,8 +1605,10 @@ void kvm_apic_post_state_restore(struct kvm_vcpu
> *vcpu,
>>  	update_divide_count(apic);
>>  	start_apic_timer(apic);
>>  	apic->irr_pending = true;
>> -	apic->isr_count = count_vectors(apic->regs + APIC_ISR);
>> +	apic->isr_count = kvm_apic_vid_enabled(vcpu) ?
>> +				1 : count_vectors(apic->regs + APIC_ISR);
>>  	apic->highest_isr_cache = -1;
>>  +	kvm_x86_ops->set_svi(apic_find_highest_isr(apic));
>>  	kvm_make_request(KVM_REQ_EVENT, vcpu); }
>> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
>> index 9a8ee22..fed6538 100644
>> --- a/arch/x86/kvm/lapic.h
>> +++ b/arch/x86/kvm/lapic.h
>> @@ -65,6 +65,7 @@ u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu
> *vcpu);
>>  void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
>>  
>>  void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
>> +void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, int vector);
>> 
>>  void kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr);
>>  void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu);
>> @@ -126,4 +127,26 @@ static inline int kvm_lapic_enabled(struct kvm_vcpu
> *vcpu)
>>  	return kvm_apic_present(vcpu) &&
>>  kvm_apic_sw_enabled(vcpu->arch.apic); }
>> +static inline bool kvm_apic_vid_enabled(struct kvm_vcpu *vcpu)
>> +{
>> +	return kvm_x86_ops->has_virtual_interrupt_delivery(vcpu);
>> +}
>> +
>> +static inline u16 apic_cluster_id(struct kvm_apic_map *map, u32 ldr)
>> +{
>> +	u16 cid;
>> +	ldr >>= 32 - map->ldr_bits;
>> +	cid = (ldr >> map->cid_shift) & map->cid_mask;
>> +
>> +	BUG_ON(cid >= ARRAY_SIZE(map->logical_map));
>> +
>> +	return cid;
>> +}
>> +
>> +static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr)
>> +{
>> +	ldr >>= (32 - map->ldr_bits);
>> +	return ldr & map->lid_mask;
>> +}
>> +
>>  #endif
>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> index 0b82cb1..0ce6543 100644
>> --- a/arch/x86/kvm/svm.c
>> +++ b/arch/x86/kvm/svm.c
>> @@ -3576,6 +3576,21 @@ static void svm_enable_virtual_x2apic_mode(struct
> kvm_vcpu *vcpu)
>>  	return;
>>  }
>> +static int svm_has_virtual_interrupt_delivery(struct kvm_vcpu *vcpu)
>> +{
>> +	return 0;
>> +}
>> +
>> +static void svm_update_eoi_exitmap(struct kvm_vcpu *vcpu)
>> +{
>> +	return;
>> +}
>> +
>> +static void svm_set_svi(int isr)
>> +{
>> +	return;
>> +}
>> +
>>  static int svm_nmi_allowed(struct kvm_vcpu *vcpu) { 	struct vcpu_svm
>>  *svm = to_svm(vcpu); @@ -4296,6 +4311,9 @@ static struct kvm_x86_ops
>>  svm_x86_ops = { 	.enable_irq_window = enable_irq_window,
>>  	.update_cr8_intercept = update_cr8_intercept,
>>  	.enable_virtual_x2apic_mode = svm_enable_virtual_x2apic_mode,
>> +	.has_virtual_interrupt_delivery = svm_has_virtual_interrupt_delivery,
>> +	.update_eoi_exitmap = svm_update_eoi_exitmap,
>> +	.set_svi = svm_set_svi,
>> 
>>  	.set_tss_addr = svm_set_tss_addr,
>>  	.get_tdp_level = get_npt_level,
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index b203ce7..990409a 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -434,6 +434,7 @@ struct vcpu_vmx {
>>  	bool rdtscp_enabled;
>>  
>>  	bool virtual_x2apic_enabled;
>> +	unsigned long eoi_exit_bitmap[4];
>> 
>>  	/* Support for a guest hypervisor (nested VMX) */
>>  	struct nested_vmx nested;
>> @@ -783,7 +784,8 @@ static inline bool cpu_has_vmx_apic_register_virt(void)
>> 
>>  static inline bool cpu_has_vmx_virtual_intr_delivery(void)
>>  {
>> -	return false;
>> +	return vmcs_config.cpu_based_2nd_exec_ctrl &
>> +		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
>>  }
>>  
>>  static inline bool cpu_has_vmx_flexpriority(void)
>> @@ -2565,7 +2567,8 @@ static __init int setup_vmcs_config(struct
> vmcs_config *vmcs_conf)
>>  			SECONDARY_EXEC_PAUSE_LOOP_EXITING |
>>  			SECONDARY_EXEC_RDTSCP |
>>  			SECONDARY_EXEC_ENABLE_INVPCID |
>> -			SECONDARY_EXEC_APIC_REGISTER_VIRT;
>> +			SECONDARY_EXEC_APIC_REGISTER_VIRT |
>> +			SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
>>  		if (adjust_vmx_controls(min2, opt2,
>>  					MSR_IA32_VMX_PROCBASED_CTLS2,
>>  					&_cpu_based_2nd_exec_control) < 0)
>> @@ -2579,7 +2582,8 @@ static __init int setup_vmcs_config(struct
>> vmcs_config *vmcs_conf)
>> 
>>  	if (!(_cpu_based_exec_control & CPU_BASED_TPR_SHADOW))
>>  		_cpu_based_2nd_exec_control &= ~(
>> -				SECONDARY_EXEC_APIC_REGISTER_VIRT);
>> +				SECONDARY_EXEC_APIC_REGISTER_VIRT |
>> +				SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
>> 
>>  	if (_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_EPT) { 		/*
>>  CR3 accesses and invlpg don't need to cause VM Exits when EPT @@
>>  -2778,9 +2782,15 @@ static __init int hardware_setup(void) 	if
>>  (!cpu_has_vmx_ple()) 		ple_gap = 0;
>> -	if (!cpu_has_vmx_apic_register_virt())
>> +	if (!cpu_has_vmx_apic_register_virt() ||
>> +				!cpu_has_vmx_virtual_intr_delivery())
>>  		enable_apicv_reg_vid = 0;
>> +	if (enable_apicv_reg_vid)
>> +		kvm_x86_ops->update_cr8_intercept = NULL;
>> +	else
>> +		kvm_x86_ops->update_apic_irq = NULL;
>> +
>>  	if (nested)
>>  		nested_vmx_setup_ctls_msrs();
>> @@ -3961,7 +3971,8 @@ static u32 vmx_secondary_exec_control(struct
> vcpu_vmx *vmx)
>>  	if (!ple_gap)
>>  		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
>>  	if (!enable_apicv_reg_vid)
>> -		exec_control &= ~SECONDARY_EXEC_APIC_REGISTER_VIRT;
>> +		exec_control &= ~(SECONDARY_EXEC_APIC_REGISTER_VIRT |
>> +				  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
>>  	exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE; 	return
>>  exec_control; } @@ -4007,6 +4018,15 @@ static int
>>  vmx_vcpu_setup(struct vcpu_vmx *vmx)
>>  				vmx_secondary_exec_control(vmx)); 	}
>> +	if (enable_apicv_reg_vid) {
>> +		vmcs_write64(EOI_EXIT_BITMAP0, 0);
>> +		vmcs_write64(EOI_EXIT_BITMAP1, 0);
>> +		vmcs_write64(EOI_EXIT_BITMAP2, 0);
>> +		vmcs_write64(EOI_EXIT_BITMAP3, 0);
>> +
>> +		vmcs_write16(GUEST_INTR_STATUS, 0);
>> +	}
>> +
>>  	if (ple_gap) {
>>  		vmcs_write32(PLE_GAP, ple_gap);
>>  		vmcs_write32(PLE_WINDOW, ple_window);
>> @@ -4924,6 +4944,16 @@ static int handle_apic_access(struct kvm_vcpu
> *vcpu)
>>  	return emulate_instruction(vcpu, 0) == EMULATE_DONE;
>>  }
>> +static int handle_apic_eoi_induced(struct kvm_vcpu *vcpu)
>> +{
>> +	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>> +	int vector = exit_qualification & 0xff;
>> +
>> +	/* EOI-induced VM exit is trap-like and thus no need to adjust IP */
>> +	kvm_apic_set_eoi_accelerated(vcpu, vector);
>> +	return 1;
>> +}
>> +
>>  static int handle_apic_write(struct kvm_vcpu *vcpu)
>>  {
>>  	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>> @@ -5869,6 +5899,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct
> kvm_vcpu *vcpu) = {
>>  	[EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,
>>  	[EXIT_REASON_APIC_ACCESS]             = handle_apic_access,
>>  	[EXIT_REASON_APIC_WRITE]              = handle_apic_write,
>>  +	[EXIT_REASON_EOI_INDUCED]             = handle_apic_eoi_induced,
>>  	[EXIT_REASON_WBINVD]                  = handle_wbinvd,
>>  	[EXIT_REASON_XSETBV]                  = handle_xsetbv,
>>  	[EXIT_REASON_TASK_SWITCH]             = handle_task_switch,
>> @@ -6238,7 +6269,7 @@ static void vmx_enable_virtual_x2apic_mode(struct
> kvm_vcpu *vcpu)
>>  	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
>>  	vmx->virtual_x2apic_enabled = true;
>> -	if (!cpu_has_vmx_virtual_intr_delivery())
>> +	if (!enable_apicv_reg_vid)
>>  		return;
>>  
>>  	for (msr = 0x800; msr <= 0x8ff; msr++)
>> @@ -6274,7 +6305,7 @@ static void vmx_disable_virtual_x2apic_mode(struct
> kvm_vcpu *vcpu)
>>  	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, second_exec_control);
>>  	vmx->virtual_x2apic_enabled = false;
>> -	if (!cpu_has_vmx_virtual_intr_delivery())
>> +	if (!enable_apicv_reg_vid)
>>  		return;
>>  
>>  	for (msr = 0x800; msr <= 0x8ff; msr++)
>> @@ -6288,6 +6319,148 @@ static void
> vmx_disable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
>>  	vmx_intercept_for_msr_write(0x83f, false, true);
>>  }
>> +static int vmx_has_virtual_interrupt_delivery(struct kvm_vcpu *vcpu)
>> +{
>> +	return enable_apicv_reg_vid;
>> +}
> Why it needs vcpu parameter if it does not use it? It gets you in
> trouble later.
Right. The vcpu parameter used on old patch, in this patch, it is needless.
 
>> + +static void vmx_set_svi(int isr) +{ +	u16 status; +	u8 old; + +	if
>> (!enable_apicv_reg_vid) +		return; + +	if (isr == -1) +		isr = 0; +
>> +	status = vmcs_read16(GUEST_INTR_STATUS); +	old = status >> 8; +	if
>> (isr != old) { +		status &= 0xff; +		status |= isr << 8;
>> +		vmcs_write16(GUEST_INTR_STATUS, status); +	} +} + +static void
>> vmx_set_rvi(int vector) +{ +	u16 status; +	u8 old; + +	status =
>> vmcs_read16(GUEST_INTR_STATUS); +	old = (u8)status & 0xff; +	if
>> ((u8)vector != old) { +		status &= ~0xff; +		status |= (u8)vector;
>> +		vmcs_write16(GUEST_INTR_STATUS, status); +	} +} + +static void
>> vmx_update_apic_irq(struct kvm_vcpu *vcpu, int max_irr) +{ +	if
>> (max_irr == -1) +		return; + +	vmx_set_rvi(max_irr); +} + +static void
>> set_eoi_exitmap_one(struct kvm_vcpu *vcpu, +				u32 vector) +{ +	struct
>> vcpu_vmx *vmx = to_vmx(vcpu); + +	if (WARN_ONCE((vector > 255), +		"KVM
>> VMX: vector (%d) out of range\n", vector)) +		return; +
>> +	__set_bit(vector, vmx->eoi_exit_bitmap); +} + +void
>> vmx_check_ioapic_entry(struct kvm_vcpu *vcpu, struct kvm_lapic_irq
>> *irq) +{ +	struct kvm_lapic **dst; +	struct kvm_apic_map *map;
>> +	unsigned long bitmap = 1; +	int i; + +	rcu_read_lock(); +	map =
>> rcu_dereference(vcpu->kvm->arch.apic_map); + +	if (unlikely(!map)) {
>> +		set_eoi_exitmap_one(vcpu, irq->vector); +		goto out; +	} + +	if
>> (irq->dest_mode == 0) { /* physical mode */ +		if (irq->delivery_mode
>> == APIC_DM_LOWEST || +				irq->dest_id == 0xff) {
>> +			set_eoi_exitmap_one(vcpu, irq->vector); +			goto out; +		} +		dst =
>> &map->phys_map[irq->dest_id & 0xff]; +	} else { +		u32 mda =
>> irq->dest_id << (32 - map->ldr_bits); + +		dst =
>> map->logical_map[apic_cluster_id(map, mda)]; + +		bitmap =
>> apic_logical_id(map, mda); +	} + +	for_each_set_bit(i, &bitmap, 16) {
>> +		if (!dst[i]) +			continue; +		if (dst[i]->vcpu == vcpu) {
>> +			set_eoi_exitmap_one(vcpu, irq->vector); +			break; +		} +	} + +out:
>> +	rcu_read_unlock(); +} + +static void vmx_load_eoi_exitmap(struct
>> kvm_vcpu *vcpu) +{ +	struct vcpu_vmx *vmx = to_vmx(vcpu); +
>> +	vmcs_write64(EOI_EXIT_BITMAP0, vmx->eoi_exit_bitmap[0]);
>> +	vmcs_write64(EOI_EXIT_BITMAP1, vmx->eoi_exit_bitmap[1]);
>> +	vmcs_write64(EOI_EXIT_BITMAP2, vmx->eoi_exit_bitmap[2]);
>> +	vmcs_write64(EOI_EXIT_BITMAP3, vmx->eoi_exit_bitmap[3]); +} + +static
>> void vmx_update_eoi_exitmap(struct kvm_vcpu *vcpu) +{ +	struct
>> kvm_ioapic *ioapic = vcpu->kvm->arch.vioapic; +	union
>> kvm_ioapic_redirect_entry *e; +	struct kvm_lapic_irq irqe; +	int index;
>> +	struct vcpu_vmx *vmx = to_vmx(vcpu); + +	/* clear eoi exit bitmap */
>> +	memset(vmx->eoi_exit_bitmap, 0, 32); + +	/* traverse ioapic entry to
>> set eoi exit bitmap*/ +	for (index = 0; index < IOAPIC_NUM_PINS;
>> index++) { +		e = &ioapic->redirtbl[index]; +		if (!e->fields.mask &&
>> +			(e->fields.trig_mode == IOAPIC_LEVEL_TRIG || +			
>> kvm_irq_has_notifier(ioapic->kvm, KVM_IRQCHIP_IOAPIC, +				 index))) {
>> +			irqe.dest_id = e->fields.dest_id; +			irqe.vector =
>> e->fields.vector; +			irqe.dest_mode = e->fields.dest_mode;
>> +			irqe.delivery_mode = e->fields.delivery_mode << 8;
>> +			vmx_check_ioapic_entry(vcpu, &irqe); + +		} +	}
> This logic should sit in ioapic.c and you cannot access ioapic without
> holding ioapic lock.
You are right. I have another version patch which used the eoimap_lock to protect the access of eoi exit bitmap. Obviously, as you mentioned below, i misuse eoimap_lock as ioapic->lock.
 
>> +
>> +	vmx_load_eoi_exitmap(vcpu);
>> +}
>> +
>>  static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx) { 	u32
>>  exit_intr_info; @@ -7553,6 +7726,10 @@ static struct kvm_x86_ops
>>  vmx_x86_ops = { 	.update_cr8_intercept = update_cr8_intercept,
>>  	.enable_virtual_x2apic_mode = vmx_enable_virtual_x2apic_mode,
>>  	.disable_virtual_x2apic_mode = vmx_disable_virtual_x2apic_mode,
>> +	.has_virtual_interrupt_delivery = vmx_has_virtual_interrupt_delivery,
>> +	.update_apic_irq = vmx_update_apic_irq,
>> +	.update_eoi_exitmap = vmx_update_eoi_exitmap,
>> +	.set_svi = vmx_set_svi,
>> 
>>  	.set_tss_addr = vmx_set_tss_addr,
>>  	.get_tdp_level = get_ept_level,
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 1c9c834..e6d8227 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -5527,7 +5527,7 @@ static void inject_pending_event(struct kvm_vcpu
> *vcpu)
>>  			vcpu->arch.nmi_injected = true;
>>  			kvm_x86_ops->set_nmi(vcpu);
>>  		}
>> -	} else if (kvm_cpu_has_interrupt(vcpu)) {
>> +	} else if (kvm_cpu_has_injectable_intr(vcpu)) {
>>  		if (kvm_x86_ops->interrupt_allowed(vcpu)) {
>>  			kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu), 					   
>>  false); @@ -5648,6 +5648,11 @@ static int vcpu_enter_guest(struct
>>  kvm_vcpu *vcpu) 			kvm_handle_pmu_event(vcpu); 		if
>>  (kvm_check_request(KVM_REQ_PMI, vcpu)) 			kvm_deliver_pmi(vcpu);
>> +		if (kvm_check_request(KVM_REQ_EOIBITMAP, vcpu)) {
>> +			mutex_lock(&vcpu->kvm->arch.vioapic->eoimap_lock);
> You need to hold ioapic lock, not useless eoimap_lock that protects
> nothing. And not, do not take it here. Call function in ioapic.c
Yes. eoimap_lock is useless in this patch.
 
>> +			kvm_x86_ops->update_eoi_exitmap(vcpu);
>> +			mutex_unlock(&vcpu->kvm->arch.vioapic->eoimap_lock);
>> +		}
>>  	}
>>  
>>  	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
>> @@ -5656,10 +5661,15 @@ static int vcpu_enter_guest(struct kvm_vcpu
> *vcpu)
>>  		/* enable NMI/IRQ window open exits if needed */
>>  		if (vcpu->arch.nmi_pending)
>>  			kvm_x86_ops->enable_nmi_window(vcpu);
>> -		else if (kvm_cpu_has_interrupt(vcpu) || req_int_win)
>> +		else if (kvm_cpu_has_injectable_intr(vcpu) || req_int_win)
>>  			kvm_x86_ops->enable_irq_window(vcpu);
>>  
>>  		if (kvm_lapic_enabled(vcpu)) {
>> +			/* update architecture specific hints for APIC
>> +			 * virtual interrupt delivery */
>> +			if (kvm_x86_ops->update_apic_irq)
>> +				kvm_x86_ops->update_apic_irq(vcpu,
>> +					      kvm_lapic_find_highest_irr(vcpu));
>>  			update_cr8_intercept(vcpu);
>>  			kvm_lapic_sync_to_vapic(vcpu);
>>  		}
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index cbe0d68..bc0e261 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -122,6 +122,7 @@ static inline bool is_error_page(struct page *page)
>>  #define KVM_REQ_WATCHDOG          18
>>  #define KVM_REQ_MASTERCLOCK_UPDATE 19
>>  #define KVM_REQ_MCLOCK_INPROGRESS 20
>> +#define KVM_REQ_EOIBITMAP         21
>> 
>>  #define KVM_USERSPACE_IRQ_SOURCE_ID		0 #define
>>  KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID	1 @@ -537,6 +538,7 @@ void
>>  kvm_put_guest_fpu(struct kvm_vcpu *vcpu); void
>>  kvm_flush_remote_tlbs(struct kvm *kvm); void
>>  kvm_reload_remote_mmus(struct kvm *kvm); void
>>  kvm_make_mclock_inprogress_request(struct kvm *kvm);
>> +void kvm_make_update_eoibitmap_request(struct kvm *kvm);
>> 
>>  long kvm_arch_dev_ioctl(struct file *filp,
>>  			unsigned int ioctl, unsigned long arg);
>> @@ -690,6 +692,7 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32
> irq, int level);
>>  int kvm_set_irq_inatomic(struct kvm *kvm, int irq_source_id, u32 irq,
>>  int level); int kvm_set_msi(struct kvm_kernel_irq_routing_entry
>>  *irq_entry, struct kvm *kvm, 		int irq_source_id, int level); +bool
>>  kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin);
>>  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned
>>  pin); void kvm_register_irq_ack_notifier(struct kvm *kvm, 				  
>>  struct kvm_irq_ack_notifier *kian);
>> diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
>> index f3abbef..e5ccb8f 100644
>> --- a/virt/kvm/ioapic.c
>> +++ b/virt/kvm/ioapic.c
>> @@ -115,6 +115,20 @@ static void update_handled_vectors(struct kvm_ioapic
> *ioapic)
>>  	smp_wmb();
>>  }
>> +void ioapic_update_eoi_exitmap(struct kvm *kvm)
>> +{
>> +#ifdef CONFIG_X86
> Define kvm_apic_vid_enabled() in IA64 instead.
sure.
 
>> +	struct kvm_vcpu *vcpu = kvm->vcpus[0];
>> +	struct kvm_ioapic *ioapic = kvm->arch.vioapic;
>> +
>> +	/* If vid is enabled in one of vcpus, then other
>> +	 * vcpus also enabled it. */
> Vid state is global for all VM instances.  kvm_apic_vid_enabled() should
> not get vcpu as a parameter.
Agree.
 
>> +	if (!kvm_apic_vid_enabled(vcpu) || !ioapic)
>> +		return;
>> +	kvm_make_update_eoibitmap_request(kvm);
>> +#endif
>> +}
>> +
>>  static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
>>  {
>>  	unsigned index;
>> @@ -156,6 +170,7 @@ static void ioapic_write_indirect(struct kvm_ioapic
> *ioapic, u32 val)
>>  		if (e->fields.trig_mode == IOAPIC_LEVEL_TRIG 		    && ioapic->irr &
>>  (1 << index)) 			ioapic_service(ioapic, index);
>>  +		ioapic_update_eoi_exitmap(ioapic->kvm); 		break; 	} } @@ -415,6
>>  +430,9 @@ int kvm_ioapic_init(struct kvm *kvm) 	ret =
>>  kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS, ioapic->base_address, 				 
>>      IOAPIC_MEM_LENGTH, &ioapic->dev); 	mutex_unlock(&kvm->slots_lock);
>> +#ifdef CONFIG_X86
>> +	mutex_init(&ioapic->eoimap_lock);
>> +#endif
>>  	if (ret < 0) {
>>  		kvm->arch.vioapic = NULL;
>>  		kfree(ioapic);
>> diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
>> index a30abfe..34544ce 100644
>> --- a/virt/kvm/ioapic.h
>> +++ b/virt/kvm/ioapic.h
>> @@ -47,6 +47,9 @@ struct kvm_ioapic {
>>  	void (*ack_notifier)(void *opaque, int irq);
>>  	spinlock_t lock;
>>  	DECLARE_BITMAP(handled_vectors, 256);
>> +#ifdef CONFIG_X86
>> +	struct mutex eoimap_lock;
>> +#endif
> This lock protects nothing. Drop it.
Right. This is a misue in this patch.
 
>>  };
>>  
>>  #ifdef DEBUG
>> @@ -82,5 +85,6 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct
> kvm_lapic *src,
>>  		struct kvm_lapic_irq *irq);
>>  int kvm_get_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
>>  int kvm_set_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
>> +void ioapic_update_eoi_exitmap(struct kvm *kvm);
>> 
>>  #endif
>> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
>> index 656fa45..64aa1ab 100644
>> --- a/virt/kvm/irq_comm.c
>> +++ b/virt/kvm/irq_comm.c
>> @@ -22,6 +22,7 @@
>> 
>>  #include <linux/kvm_host.h> #include <linux/slab.h> +#include
>>  <linux/export.h> #include <trace/events/kvm.h>
>>  
>>  #include <asm/msidef.h>
>> @@ -237,6 +238,25 @@ int kvm_set_irq_inatomic(struct kvm *kvm, int
> irq_source_id, u32 irq, int level)
>>  	return ret;
>>  }
>> +bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin)
>> +{
>> +	struct kvm_irq_ack_notifier *kian;
>> +	struct hlist_node *n;
>> +	int gsi;
>> +
>> +	rcu_read_lock();
>> +	gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];
>> +	if (gsi != -1)
>> +		hlist_for_each_entry_rcu(kian, n, &kvm->irq_ack_notifier_list,
>> +					 link)
>> +			if (kian->gsi == gsi)
>> +				return true;
>> +	rcu_read_unlock();
>> +
>> +	return false;
>> +}
>> +EXPORT_SYMBOL_GPL(kvm_irq_has_notifier);
>> +
>>  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned
>>  pin) { 	struct kvm_irq_ack_notifier *kian; @@ -261,6 +281,7 @@ void
>>  kvm_register_irq_ack_notifier(struct kvm *kvm,
>>  	mutex_lock(&kvm->irq_lock); 	hlist_add_head_rcu(&kian->link,
>>  &kvm->irq_ack_notifier_list); 	mutex_unlock(&kvm->irq_lock);
>>  +	ioapic_update_eoi_exitmap(kvm); }
>>  
>>  void kvm_unregister_irq_ack_notifier(struct kvm *kvm, @@ -270,6 +291,7
>>  @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
>>  	hlist_del_init_rcu(&kian->link); 	mutex_unlock(&kvm->irq_lock);
>>  	synchronize_rcu(); +	ioapic_update_eoi_exitmap(kvm); }
>>  
>>  int kvm_request_irq_source_id(struct kvm *kvm)
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index e45c20c..cc465c6 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -217,6 +217,11 @@ void kvm_make_mclock_inprogress_request(struct
> kvm *kvm)
>>  	make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
>>  }
>> +void kvm_make_update_eoibitmap_request(struct kvm *kvm)
>> +{
>> +	make_all_cpus_request(kvm, KVM_REQ_EOIBITMAP);
>> +}
>> +
>>  int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>>  {
>>  	struct page *page;
>> --
>> 1.7.1
> 
> --
> 			Gleb.


Best regards,
Yang



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 2/3] x86, apicv: add virtual x2apic support
  2013-01-10 11:54         ` Zhang, Yang Z
@ 2013-01-10 12:16           ` Gleb Natapov
  2013-01-10 12:22             ` Zhang, Yang Z
  0 siblings, 1 reply; 26+ messages in thread
From: Gleb Natapov @ 2013-01-10 12:16 UTC (permalink / raw)
  To: Zhang, Yang Z
  Cc: kvm@vger.kernel.org, Shan, Haitao, mtosatti@redhat.com,
	Tian, Kevin

On Thu, Jan 10, 2013 at 11:54:21AM +0000, Zhang, Yang Z wrote:
> >> The right logic should be:
> >> When register virtualization enabled, read all apic msr(except apic id reg and
> > tmcct which have the hook) not cause vmexit and only write TPR not cause
> > vmexit.
> >> When vid enabled, write to TPR, EOI and SELF IPI not cause vmexit.
> >> It's better to put the patch of envirtualize x2apic mode as first patch.
> >> 
> > There is no point whatsoever to enable virtualize x2apic without
> > allowing at least one non intercepted access to x2apic MSR range and
> > this is what your patch is doing when run on cpu without vid capability.
> No, at least read/write TPR will not cause vmexit if virtualize x2apic mode is enabled.
For that you need to disable 808H MSR intercept, which your patch does not do.

> I am not sure whether I understand your comments right in previous discussion, here is my thinking:
> 1. enable virtualize x2apic mode if guest is really using x2apic and clear TPR in msr read  and write bitmap. This will benefit reading TPR.
> 2. If APIC registers virtualization is enabled, clear all bit in rang 0x800-0x8ff(except apic id reg and tmcct).
Clear all read bits in the range.

> 3. If vid is enabled, clear EOI and SELF IPI in msr write map.
> 
Looks OK.

> One concern you mentioned is that vid enabled and virtualize x2apic is disabled but guest still use x2apic. In this case, we still use MSR bitmap to intercept x2apic. This means the vEOI update will never happen. But we still can benefit from interrupt window.
> 
What interrupt windows? Without virtualized x2apic TPR/EOI
virtualization will not happen and we rely on that in the code.

--
			Gleb.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v9 2/3] x86, apicv: add virtual x2apic support
  2013-01-10 12:16           ` Gleb Natapov
@ 2013-01-10 12:22             ` Zhang, Yang Z
  2013-01-10 12:34               ` Gleb Natapov
  0 siblings, 1 reply; 26+ messages in thread
From: Zhang, Yang Z @ 2013-01-10 12:22 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm@vger.kernel.org, Shan, Haitao, mtosatti@redhat.com,
	Tian, Kevin

Gleb Natapov wrote on 2013-01-10:
> On Thu, Jan 10, 2013 at 11:54:21AM +0000, Zhang, Yang Z wrote:
>>>> The right logic should be:
>>>> When register virtualization enabled, read all apic msr(except apic id reg and
>>> tmcct which have the hook) not cause vmexit and only write TPR not cause
>>> vmexit.
>>>> When vid enabled, write to TPR, EOI and SELF IPI not cause vmexit.
>>>> It's better to put the patch of envirtualize x2apic mode as first patch.
>>>> 
>>> There is no point whatsoever to enable virtualize x2apic without
>>> allowing at least one non intercepted access to x2apic MSR range and
>>> this is what your patch is doing when run on cpu without vid capability.
>> No, at least read/write TPR will not cause vmexit if virtualize x2apic mode is
> enabled.
> For that you need to disable 808H MSR intercept, which your patch does not do.
I want to do this in next patch.
 
>> I am not sure whether I understand your comments right in previous
>> discussion, here is my thinking: 1. enable virtualize x2apic mode if
>> guest is really using x2apic and clear TPR in msr read  and write
>> bitmap. This will benefit reading TPR. 2. If APIC registers
>> virtualization is enabled, clear all bit in rang
> 0x800-0x8ff(except apic id reg and tmcct).
> Clear all read bits in the range.
> 
>> 3. If vid is enabled, clear EOI and SELF IPI in msr write map.
>> 
> Looks OK.
> 
>> One concern you mentioned is that vid enabled and virtualize x2apic is disabled
> but guest still use x2apic. In this case, we still use MSR bitmap to intercept x2apic.
> This means the vEOI update will never happen. But we still can benefit from
> interrupt window.
>> 
> What interrupt windows? Without virtualized x2apic TPR/EOI
> virtualization will not happen and we rely on that in the code.
Yes, but we can eliminate vmexit of interrupt window. Interrupt still can delivery to guest without vmexit when guest enables interrupt if vid is enabled. See SDM vol 3, 29.2.2.

Best regards,
Yang



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 2/3] x86, apicv: add virtual x2apic support
  2013-01-10 12:22             ` Zhang, Yang Z
@ 2013-01-10 12:34               ` Gleb Natapov
  2013-01-11  7:36                 ` Zhang, Yang Z
  0 siblings, 1 reply; 26+ messages in thread
From: Gleb Natapov @ 2013-01-10 12:34 UTC (permalink / raw)
  To: Zhang, Yang Z
  Cc: kvm@vger.kernel.org, Shan, Haitao, mtosatti@redhat.com,
	Tian, Kevin

On Thu, Jan 10, 2013 at 12:22:51PM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2013-01-10:
> > On Thu, Jan 10, 2013 at 11:54:21AM +0000, Zhang, Yang Z wrote:
> >>>> The right logic should be:
> >>>> When register virtualization enabled, read all apic msr(except apic id reg and
> >>> tmcct which have the hook) not cause vmexit and only write TPR not cause
> >>> vmexit.
> >>>> When vid enabled, write to TPR, EOI and SELF IPI not cause vmexit.
> >>>> It's better to put the patch of envirtualize x2apic mode as first patch.
> >>>> 
> >>> There is no point whatsoever to enable virtualize x2apic without
> >>> allowing at least one non intercepted access to x2apic MSR range and
> >>> this is what your patch is doing when run on cpu without vid capability.
> >> No, at least read/write TPR will not cause vmexit if virtualize x2apic mode is
> > enabled.
> > For that you need to disable 808H MSR intercept, which your patch does not do.
> I want to do this in next patch.
>  
Then don't. There is no point in the patch that enables virtualize
x2apic and does not allow at least one non-intercepted MSR access.

> >> I am not sure whether I understand your comments right in previous
> >> discussion, here is my thinking: 1. enable virtualize x2apic mode if
> >> guest is really using x2apic and clear TPR in msr read  and write
> >> bitmap. This will benefit reading TPR. 2. If APIC registers
> >> virtualization is enabled, clear all bit in rang
> > 0x800-0x8ff(except apic id reg and tmcct).
> > Clear all read bits in the range.
> > 
> >> 3. If vid is enabled, clear EOI and SELF IPI in msr write map.
> >> 
> > Looks OK.
> > 
> >> One concern you mentioned is that vid enabled and virtualize x2apic is disabled
> > but guest still use x2apic. In this case, we still use MSR bitmap to intercept x2apic.
> > This means the vEOI update will never happen. But we still can benefit from
> > interrupt window.
> >> 
> > What interrupt windows? Without virtualized x2apic TPR/EOI
> > virtualization will not happen and we rely on that in the code.
> Yes, but we can eliminate vmexit of interrupt window. Interrupt still can delivery to guest without vmexit when guest enables interrupt if vid is enabled. See SDM vol 3, 29.2.2.
> 
What does it matter that you eliminated vmexit of interrupt window if
you broke everything else? The vid patch assumes that apic emulation
either entirely happens in a software when vid is disabled or in a CPU
if vid is enabled. Mixed mode will not work (apic->isr_count = 1 trick
will break things for instance). And it is not worth it to complicate
the code to make it work.

--
			Gleb.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 1/3] x86, apicv: add APICv register virtualization support
  2013-01-10  7:26 ` [PATCH v9 1/3] x86, apicv: add APICv register " Yang Zhang
@ 2013-01-10 20:25   ` Marcelo Tosatti
  0 siblings, 0 replies; 26+ messages in thread
From: Marcelo Tosatti @ 2013-01-10 20:25 UTC (permalink / raw)
  To: Yang Zhang; +Cc: kvm, gleb, haitao.shan, Kevin Tian

On Thu, Jan 10, 2013 at 03:26:06PM +0800, Yang Zhang wrote:
> - APIC read doesn't cause VM-Exit
> - APIC write becomes trap-like
> 
> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
> ---
>  arch/x86/include/asm/vmx.h |    2 ++
>  arch/x86/kvm/lapic.c       |   15 +++++++++++++++
>  arch/x86/kvm/lapic.h       |    2 ++
>  arch/x86/kvm/vmx.c         |   33 ++++++++++++++++++++++++++++++++-
>  4 files changed, 51 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index e385df9..44c3f7e 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -66,6 +66,7 @@
>  #define EXIT_REASON_EPT_MISCONFIG       49
>  #define EXIT_REASON_WBINVD              54
>  #define EXIT_REASON_XSETBV              55
> +#define EXIT_REASON_APIC_WRITE          56
>  #define EXIT_REASON_INVPCID             58
>  
>  #define VMX_EXIT_REASONS \
> @@ -141,6 +142,7 @@
>  #define SECONDARY_EXEC_ENABLE_VPID              0x00000020
>  #define SECONDARY_EXEC_WBINVD_EXITING		0x00000040
>  #define SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080
> +#define SECONDARY_EXEC_APIC_REGISTER_VIRT       0x00000100
>  #define SECONDARY_EXEC_PAUSE_LOOP_EXITING	0x00000400
>  #define SECONDARY_EXEC_ENABLE_INVPCID		0x00001000
>  
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 9392f52..0664c13 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1212,6 +1212,21 @@ void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
>  }
>  EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
>  
> +/* emulate APIC access in a trap manner */
> +void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
> +{
> +	u32 val = 0;
> +
> +	/* hw has done the conditional check and inst decode */
> +	offset &= 0xff0;
> +
> +	apic_reg_read(vcpu->arch.apic, offset, 4, &val);
> +
> +	/* TODO: optimize to just emulate side effect w/o one more write */
> +	apic_reg_write(vcpu->arch.apic, offset, val);
> +}
> +EXPORT_SYMBOL_GPL(kvm_apic_write_nodecode);
> +
>  void kvm_free_lapic(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_lapic *apic = vcpu->arch.apic;
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index e5ebf9f..9a8ee22 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -64,6 +64,8 @@ int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu);
>  u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
>  void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
>  
> +void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
> +
>  void kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr);
>  void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu);
>  void kvm_lapic_sync_to_vapic(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 55dfc37..688f43f 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -84,6 +84,9 @@ module_param(vmm_exclusive, bool, S_IRUGO);
>  static bool __read_mostly fasteoi = 1;
>  module_param(fasteoi, bool, S_IRUGO);
>  
> +static bool __read_mostly enable_apicv_reg_vid = 1;
> +module_param(enable_apicv_reg_vid, bool, S_IRUGO);
> +
>  /*
>   * If nested=1, nested virtualization is supported, i.e., guests may use
>   * VMX and be a hypervisor for its own guests. If nested=0, guests may not
> @@ -764,6 +767,12 @@ static inline bool cpu_has_vmx_virtualize_apic_accesses(void)
>  		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>  }
>  
> +static inline bool cpu_has_vmx_apic_register_virt(void)
> +{
> +	return vmcs_config.cpu_based_2nd_exec_ctrl &
> +		SECONDARY_EXEC_APIC_REGISTER_VIRT;
> +}
> +
>  static inline bool cpu_has_vmx_flexpriority(void)
>  {
>  	return cpu_has_vmx_tpr_shadow() &&
> @@ -2541,7 +2550,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>  			SECONDARY_EXEC_UNRESTRICTED_GUEST |
>  			SECONDARY_EXEC_PAUSE_LOOP_EXITING |
>  			SECONDARY_EXEC_RDTSCP |
> -			SECONDARY_EXEC_ENABLE_INVPCID;
> +			SECONDARY_EXEC_ENABLE_INVPCID |
> +			SECONDARY_EXEC_APIC_REGISTER_VIRT;
>  		if (adjust_vmx_controls(min2, opt2,
>  					MSR_IA32_VMX_PROCBASED_CTLS2,
>  					&_cpu_based_2nd_exec_control) < 0)
> @@ -2552,6 +2562,11 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>  				SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
>  		_cpu_based_exec_control &= ~CPU_BASED_TPR_SHADOW;
>  #endif
> +
> +	if (!(_cpu_based_exec_control & CPU_BASED_TPR_SHADOW))
> +		_cpu_based_2nd_exec_control &= ~(
> +				SECONDARY_EXEC_APIC_REGISTER_VIRT);

No need for () around SECONDARY_EXEC_APIC_REGISTER_VIRT.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 3/3] x86, apicv: add virtual interrupt delivery support
  2013-01-10  7:26 ` [PATCH v9 3/3] x86, apicv: add virtual interrupt delivery support Yang Zhang
  2013-01-10  8:23   ` Gleb Natapov
@ 2013-01-10 21:36   ` Marcelo Tosatti
  2013-01-11 14:09     ` Gleb Natapov
  1 sibling, 1 reply; 26+ messages in thread
From: Marcelo Tosatti @ 2013-01-10 21:36 UTC (permalink / raw)
  To: Yang Zhang; +Cc: kvm, gleb, haitao.shan, Kevin Tian

Hi,

Getting into good shape.

On Thu, Jan 10, 2013 at 03:26:08PM +0800, Yang Zhang wrote:
> From: Yang Zhang <yang.z.zhang@Intel.com>
> 
> Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> manually, which is fully taken care of by the hardware. This needs
> some special awareness into existing interrupr injection path:
> 
> - for pending interrupt, instead of direct injection, we may need
>   update architecture specific indicators before resuming to guest.
> 
> - A pending interrupt, which is masked by ISR, should be also
>   considered in above update action, since hardware will decide
>   when to inject it at right time. Current has_interrupt and
>   get_interrupt only returns a valid vector from injection p.o.v.
> 
> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    5 +
>  arch/x86/include/asm/vmx.h      |   11 +++
>  arch/x86/kvm/irq.c              |   56 +++++++++++-
>  arch/x86/kvm/lapic.c            |   72 +++++++++------
>  arch/x86/kvm/lapic.h            |   23 +++++
>  arch/x86/kvm/svm.c              |   18 ++++
>  arch/x86/kvm/vmx.c              |  191 +++++++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/x86.c              |   14 +++-
>  include/linux/kvm_host.h        |    3 +
>  virt/kvm/ioapic.c               |   18 ++++
>  virt/kvm/ioapic.h               |    4 +
>  virt/kvm/irq_comm.c             |   22 +++++
>  virt/kvm/kvm_main.c             |    5 +
>  13 files changed, 399 insertions(+), 43 deletions(-)
> 

>  static void recalculate_apic_map(struct kvm *kvm)
>  {
>  	struct kvm_apic_map *new, *old = NULL;
> @@ -236,12 +219,14 @@ static inline void kvm_apic_set_id(struct kvm_lapic *apic, u8 id)
>  {
>  	apic_set_reg(apic, APIC_ID, id << 24);
>  	recalculate_apic_map(apic->vcpu->kvm);
> +	ioapic_update_eoi_exitmap(apic->vcpu->kvm);
>  }

Move ioapic_update_eoi_exitmap into recalculate_apic_map.

>  static inline void kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id)
>  {
>  	apic_set_reg(apic, APIC_LDR, id);
>  	recalculate_apic_map(apic->vcpu->kvm);
> +	ioapic_update_eoi_exitmap(apic->vcpu->kvm);
>  }
>  

> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index b203ce7..990409a 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -434,6 +434,7 @@ struct vcpu_vmx {
>  	bool rdtscp_enabled;
>  
>  	bool virtual_x2apic_enabled;
> +	unsigned long eoi_exit_bitmap[4];

Use DECLARE_BITMAP (unsigned long is 4 bytes on 32-bit host).

>  	/* Support for a guest hypervisor (nested VMX) */
>  	struct nested_vmx nested;
> @@ -783,7 +784,8 @@ static inline bool cpu_has_vmx_apic_register_virt(void)
>  
>  static inline bool cpu_has_vmx_virtual_intr_delivery(void)
>  {
> -	return false;
> +	return vmcs_config.cpu_based_2nd_exec_ctrl &
> +		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
>  }
>  
>  static inline bool cpu_has_vmx_flexpriority(void)
> @@ -2565,7 +2567,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>  			SECONDARY_EXEC_PAUSE_LOOP_EXITING |
>  			SECONDARY_EXEC_RDTSCP |
>  			SECONDARY_EXEC_ENABLE_INVPCID |
> -			SECONDARY_EXEC_APIC_REGISTER_VIRT;
> +			SECONDARY_EXEC_APIC_REGISTER_VIRT |
> +			SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
>  		if (adjust_vmx_controls(min2, opt2,
>  					MSR_IA32_VMX_PROCBASED_CTLS2,
>  					&_cpu_based_2nd_exec_control) < 0)
> @@ -2579,7 +2582,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
>  
>  	if (!(_cpu_based_exec_control & CPU_BASED_TPR_SHADOW))
>  		_cpu_based_2nd_exec_control &= ~(
> -				SECONDARY_EXEC_APIC_REGISTER_VIRT);
> +				SECONDARY_EXEC_APIC_REGISTER_VIRT |
> +				SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);

Nevermind the earlier comment about ().

> +static void set_eoi_exitmap_one(struct kvm_vcpu *vcpu,
> +				u32 vector)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	if (WARN_ONCE((vector > 255),
> +		"KVM VMX: vector (%d) out of range\n", vector))
> +		return;
> +	__set_bit(vector, vmx->eoi_exit_bitmap);
> +}
> +
> +void vmx_check_ioapic_entry(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq)
> +{
> +	struct kvm_lapic **dst;
> +	struct kvm_apic_map *map;
> +	unsigned long bitmap = 1;
> +	int i;
> +
> +	rcu_read_lock();
> +	map = rcu_dereference(vcpu->kvm->arch.apic_map);
> +
> +	if (unlikely(!map)) {
> +		set_eoi_exitmap_one(vcpu, irq->vector);
> +		goto out;
> +	}
> +
> +	if (irq->dest_mode == 0) { /* physical mode */
> +		if (irq->delivery_mode == APIC_DM_LOWEST ||
> +				irq->dest_id == 0xff) {
> +			set_eoi_exitmap_one(vcpu, irq->vector);
> +			goto out;
> +		}
> +		dst = &map->phys_map[irq->dest_id & 0xff];
> +	} else {
> +		u32 mda = irq->dest_id << (32 - map->ldr_bits);
> +
> +		dst = map->logical_map[apic_cluster_id(map, mda)];
> +
> +		bitmap = apic_logical_id(map, mda);
> +	}
> +
> +	for_each_set_bit(i, &bitmap, 16) {
> +		if (!dst[i])
> +			continue;
> +		if (dst[i]->vcpu == vcpu) {
> +			set_eoi_exitmap_one(vcpu, irq->vector);
> +			break;
> +		}
> +	}
> +
> +out:
> +	rcu_read_unlock();
> +}
> +
> +static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	vmcs_write64(EOI_EXIT_BITMAP0, vmx->eoi_exit_bitmap[0]);
> +	vmcs_write64(EOI_EXIT_BITMAP1, vmx->eoi_exit_bitmap[1]);
> +	vmcs_write64(EOI_EXIT_BITMAP2, vmx->eoi_exit_bitmap[2]);
> +	vmcs_write64(EOI_EXIT_BITMAP3, vmx->eoi_exit_bitmap[3]);
> +}
> +
> +static void vmx_update_eoi_exitmap(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_ioapic *ioapic = vcpu->kvm->arch.vioapic;
> +	union kvm_ioapic_redirect_entry *e;
> +	struct kvm_lapic_irq irqe;
> +	int index;
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	/* clear eoi exit bitmap */
> +	memset(vmx->eoi_exit_bitmap, 0, 32);
> +
> +	/* traverse ioapic entry to set eoi exit bitmap*/
> +	for (index = 0; index < IOAPIC_NUM_PINS; index++) {
> +		e = &ioapic->redirtbl[index];
> +		if (!e->fields.mask &&
> +			(e->fields.trig_mode == IOAPIC_LEVEL_TRIG ||
> +			 kvm_irq_has_notifier(ioapic->kvm, KVM_IRQCHIP_IOAPIC,
> +				 index))) {
> +			irqe.dest_id = e->fields.dest_id;
> +			irqe.vector = e->fields.vector;
> +			irqe.dest_mode = e->fields.dest_mode;
> +			irqe.delivery_mode = e->fields.delivery_mode << 8;
> +			vmx_check_ioapic_entry(vcpu, &irqe);
> +
> +		}
> +	}
> +
> +	vmx_load_eoi_exitmap(vcpu);
> +}
> +
>  static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx)
>  {
>  	u32 exit_intr_info;
> @@ -7553,6 +7726,10 @@ static struct kvm_x86_ops vmx_x86_ops = {
>  	.update_cr8_intercept = update_cr8_intercept,
>  	.enable_virtual_x2apic_mode = vmx_enable_virtual_x2apic_mode,
>  	.disable_virtual_x2apic_mode = vmx_disable_virtual_x2apic_mode,
> +	.has_virtual_interrupt_delivery = vmx_has_virtual_interrupt_delivery,
> +	.update_apic_irq = vmx_update_apic_irq,
> +	.update_eoi_exitmap = vmx_update_eoi_exitmap,
> +	.set_svi = vmx_set_svi,
>  
>  	.set_tss_addr = vmx_set_tss_addr,
>  	.get_tdp_level = get_ept_level,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1c9c834..e6d8227 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5527,7 +5527,7 @@ static void inject_pending_event(struct kvm_vcpu *vcpu)
>  			vcpu->arch.nmi_injected = true;
>  			kvm_x86_ops->set_nmi(vcpu);
>  		}
> -	} else if (kvm_cpu_has_interrupt(vcpu)) {
> +	} else if (kvm_cpu_has_injectable_intr(vcpu)) {
>  		if (kvm_x86_ops->interrupt_allowed(vcpu)) {
>  			kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu),
>  					    false);
> @@ -5648,6 +5648,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  			kvm_handle_pmu_event(vcpu);
>  		if (kvm_check_request(KVM_REQ_PMI, vcpu))
>  			kvm_deliver_pmi(vcpu);
> +		if (kvm_check_request(KVM_REQ_EOIBITMAP, vcpu)) {
> +			mutex_lock(&vcpu->kvm->arch.vioapic->eoimap_lock);
> +			kvm_x86_ops->update_eoi_exitmap(vcpu);
> +			mutex_unlock(&vcpu->kvm->arch.vioapic->eoimap_lock);
> +		}

Take ioapic lock and irq_lock mutex.

> +void ioapic_update_eoi_exitmap(struct kvm *kvm)
> +{
> +#ifdef CONFIG_X86
> +	struct kvm_vcpu *vcpu = kvm->vcpus[0];
> +	struct kvm_ioapic *ioapic = kvm->arch.vioapic;
> +
> +	/* If vid is enabled in one of vcpus, then other
> +	 * vcpus also enabled it. */
> +	if (!kvm_apic_vid_enabled(vcpu) || !ioapic)
> +		return;

Is it even possible to call ioapic_update_eoi_exitmap() if 
kvm->arch.vioapic == NULL?

> +	kvm_make_update_eoibitmap_request(kvm);
> +#endif
> +}
> +
>  static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
>  {
>  	unsigned index;
> @@ -156,6 +170,7 @@ static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
>  		if (e->fields.trig_mode == IOAPIC_LEVEL_TRIG
>  		    && ioapic->irr & (1 << index))
>  			ioapic_service(ioapic, index);
> +		ioapic_update_eoi_exitmap(ioapic->kvm);
>  		break;
>  	}
>  }
> @@ -415,6 +430,9 @@ int kvm_ioapic_init(struct kvm *kvm)
>  	ret = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS, ioapic->base_address,
>  				      IOAPIC_MEM_LENGTH, &ioapic->dev);
>  	mutex_unlock(&kvm->slots_lock);
> +#ifdef CONFIG_X86
> +	mutex_init(&ioapic->eoimap_lock);
> +#endif
>  	if (ret < 0) {
>  		kvm->arch.vioapic = NULL;
>  		kfree(ioapic);
> diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
> index a30abfe..34544ce 100644
> --- a/virt/kvm/ioapic.h
> +++ b/virt/kvm/ioapic.h
> @@ -47,6 +47,9 @@ struct kvm_ioapic {
>  	void (*ack_notifier)(void *opaque, int irq);
>  	spinlock_t lock;
>  	DECLARE_BITMAP(handled_vectors, 256);
> +#ifdef CONFIG_X86
> +	struct mutex eoimap_lock;
> +#endif
>  };
>  
>  #ifdef DEBUG
> @@ -82,5 +85,6 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
>  		struct kvm_lapic_irq *irq);
>  int kvm_get_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
>  int kvm_set_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
> +void ioapic_update_eoi_exitmap(struct kvm *kvm);
>  
>  #endif
> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> index 656fa45..64aa1ab 100644
> --- a/virt/kvm/irq_comm.c
> +++ b/virt/kvm/irq_comm.c
> @@ -22,6 +22,7 @@
>  
>  #include <linux/kvm_host.h>
>  #include <linux/slab.h>
> +#include <linux/export.h>

Whats this for?

>  #include <trace/events/kvm.h>
>  
>  #include <asm/msidef.h>
> @@ -237,6 +238,25 @@ int kvm_set_irq_inatomic(struct kvm *kvm, int irq_source_id, u32 irq, int level)
>  	return ret;
>  }
>  
> +bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin)
> +{
> +	struct kvm_irq_ack_notifier *kian;
> +	struct hlist_node *n;
> +	int gsi;
> +
> +	rcu_read_lock();
> +	gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];
> +	if (gsi != -1)
> +		hlist_for_each_entry_rcu(kian, n, &kvm->irq_ack_notifier_list,
> +					 link)
> +			if (kian->gsi == gsi)
> +				return true;

Forgot rcu_read_unlock();

> +	rcu_read_unlock();
> +
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(kvm_irq_has_notifier);
> +
>  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
>  {
>  	struct kvm_irq_ack_notifier *kian;
> @@ -261,6 +281,7 @@ void kvm_register_irq_ack_notifier(struct kvm *kvm,
>  	mutex_lock(&kvm->irq_lock);
>  	hlist_add_head_rcu(&kian->link, &kvm->irq_ack_notifier_list);
>  	mutex_unlock(&kvm->irq_lock);
> +	ioapic_update_eoi_exitmap(kvm);
>  }

Move inside irq_lock protection.

>  
>  void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
> @@ -270,6 +291,7 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
>  	hlist_del_init_rcu(&kian->link);
>  	mutex_unlock(&kvm->irq_lock);
>  	synchronize_rcu();
> +	ioapic_update_eoi_exitmap(kvm);

Move both synchronize_rcu and ioapic_update_eoi_exitmap inside irq_lock
protection.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v9 2/3] x86, apicv: add virtual x2apic support
  2013-01-10  8:52       ` Gleb Natapov
  2013-01-10 11:54         ` Zhang, Yang Z
@ 2013-01-11  2:37         ` Zhang, Yang Z
  2013-01-11 13:51           ` Gleb Natapov
  1 sibling, 1 reply; 26+ messages in thread
From: Zhang, Yang Z @ 2013-01-11  2:37 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm@vger.kernel.org, Shan, Haitao, mtosatti@redhat.com,
	Tian, Kevin

Gleb Natapov wrote on 2013-01-10:
> On Thu, Jan 10, 2013 at 08:32:06AM +0000, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2013-01-10:
>>> On Thu, Jan 10, 2013 at 03:26:07PM +0800, Yang Zhang wrote:
>>>> From: Yang Zhang <yang.z.zhang@Intel.com>
>>>> 
>>>> basically to benefit from apicv, we need to enable virtualized x2apic mode.
>>>> Currently, we only enable it when guest is really using x2apic.
>>>> 
>>>> Also, clear MSR bitmap for corresponding x2apic MSRs when guest enabled
>>> x2apic:
>>>>     0x800 - 0x8ff: no read intercept for apicv register virtualization,
>>>>     		   except APIC ID and TMCCT.
>>>>     APIC ID and TMCCT: need software's assistance to get right value.
>>>>     TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery.
>>>> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
>>>> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
>>>> ---
>>>>  arch/x86/include/asm/kvm_host.h |    2 + arch/x86/include/asm/vmx.h
>>>>    |    1 + arch/x86/kvm/lapic.c            |    5 +-
>>>>    arch/x86/kvm/svm.c              |    6 + arch/x86/kvm/vmx.c |  194
>>>>    +++++++++++++++++++++++++++++++++++++-- 5 files
> changed, 200
>>>>  insertions(+), 8 deletions(-)
>>>> diff --git a/arch/x86/include/asm/kvm_host.h
>>>> b/arch/x86/include/asm/kvm_host.h index c431b33..572a562 100644 ---
>>>> a/arch/x86/include/asm/kvm_host.h +++
>>>> b/arch/x86/include/asm/kvm_host.h @@ -697,6 +697,8 @@ struct
>>>> kvm_x86_ops {
>>>>  	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
>>>>  	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
>>>>  	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
>>>> +	void (*enable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu);
>>>> +	void (*disable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu);
>>> Make one callback with enable/disable parameter. And do not forget SVM.
>>> 
>>>>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>>>>  	int (*get_tdp_level)(void);
>>>>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool
> is_mmio);
>>>> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
>>>> index 44c3f7e..0a54df0 100644
>>>> --- a/arch/x86/include/asm/vmx.h
>>>> +++ b/arch/x86/include/asm/vmx.h
>>>> @@ -139,6 +139,7 @@
>>>>  #define SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES 0x00000001 #define
>>>>  SECONDARY_EXEC_ENABLE_EPT               0x00000002 #define
>>>>  SECONDARY_EXEC_RDTSCP			0x00000008 +#define
>>>>  SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE   0x00000010 #define
>>>>  SECONDARY_EXEC_ENABLE_VPID              0x00000020 #define
>>>>  SECONDARY_EXEC_WBINVD_EXITING		0x00000040 #define
>>>>  SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080
>>>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>>>> index 0664c13..ec38906 100644
>>>> --- a/arch/x86/kvm/lapic.c
>>>> +++ b/arch/x86/kvm/lapic.c
>>>> @@ -1328,7 +1328,10 @@ void kvm_lapic_set_base(struct kvm_vcpu
> *vcpu,
>>> u64 value)
>>>>  		u32 id = kvm_apic_id(apic);
>>>>  		u32 ldr = ((id >> 4) << 16) | (1 << (id & 0xf));
>>>>  		kvm_apic_set_ldr(apic, ldr);
>>>> -	}
>>>> +		kvm_x86_ops->enable_virtual_x2apic_mode(vcpu);
>>>> +	} else
>>>> +		kvm_x86_ops->disable_virtual_x2apic_mode(vcpu);
>>>> +
>>> You just broke SVM.
>>>>  	apic->base_address = apic->vcpu->arch.apic_base &
>>>>  			     MSR_IA32_APICBASE_BASE;
>>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>>> index d29d3cd..0b82cb1 100644
>>>> --- a/arch/x86/kvm/svm.c
>>>> +++ b/arch/x86/kvm/svm.c
>>>> @@ -3571,6 +3571,11 @@ static void update_cr8_intercept(struct
> kvm_vcpu
>>> *vcpu, int tpr, int irr)
>>>>  		set_cr_intercept(svm, INTERCEPT_CR8_WRITE);
>>>>  }
>>>> +static void svm_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	return;
>>>> +}
>>>> +
>>>>  static int svm_nmi_allowed(struct kvm_vcpu *vcpu) { 	struct vcpu_svm
>>>>  *svm = to_svm(vcpu); @@ -4290,6 +4295,7 @@ static struct kvm_x86_ops
>>>>  svm_x86_ops = { 	.enable_nmi_window = enable_nmi_window,
>>>>  	.enable_irq_window = enable_irq_window, 	.update_cr8_intercept =
>>>>  update_cr8_intercept,
>>>> +	.enable_virtual_x2apic_mode = svm_enable_virtual_x2apic_mode,
>>>> 
>>>>  	.set_tss_addr = svm_set_tss_addr,
>>>>  	.get_tdp_level = get_npt_level,
>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>> index 688f43f..b203ce7 100644
>>>> --- a/arch/x86/kvm/vmx.c
>>>> +++ b/arch/x86/kvm/vmx.c
>>>> @@ -433,6 +433,8 @@ struct vcpu_vmx {
>>>> 
>>>>  	bool rdtscp_enabled;
>>>> +	bool virtual_x2apic_enabled;
>>>> +
>>>>  	/* Support for a guest hypervisor (nested VMX) */
>>>>  	struct nested_vmx nested;
>>>>  };
>>>> @@ -767,12 +769,23 @@ static inline bool
>>> cpu_has_vmx_virtualize_apic_accesses(void)
>>>>  		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>>>>  }
>>>> +static inline bool cpu_has_vmx_virtualize_x2apic_mode(void)
>>>> +{
>>>> +	return vmcs_config.cpu_based_2nd_exec_ctrl &
>>>> +		SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
>>>> +}
>>>> +
>>>>  static inline bool cpu_has_vmx_apic_register_virt(void)
>>>>  {
>>>>  	return vmcs_config.cpu_based_2nd_exec_ctrl &
>>>>  		SECONDARY_EXEC_APIC_REGISTER_VIRT;
>>>>  }
>>>> +static inline bool cpu_has_vmx_virtual_intr_delivery(void)
>>>> +{
>>>> +	return false;
>>>> +}
>>>> +
>>>>  static inline bool cpu_has_vmx_flexpriority(void)
>>>>  {
>>>>  	return cpu_has_vmx_tpr_shadow() &&
>>>> @@ -2544,6 +2557,7 @@ static __init int setup_vmcs_config(struct
>>> vmcs_config *vmcs_conf)
>>>>  	if (_cpu_based_exec_control &
>>>>  CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) { 		min2 = 0; 		opt2 =
>>>>  SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
>>>>  +			SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
>>>>  			SECONDARY_EXEC_WBINVD_EXITING | 	SECONDARY_EXEC_ENABLE_VPID |
>>>>  			SECONDARY_EXEC_ENABLE_EPT | @@ -3731,7 +3745,45 @@ static void
>>>>  free_vpid(struct vcpu_vmx *vmx) 	spin_unlock(&vmx_vpid_lock); }
>>>> -static void __vmx_disable_intercept_for_msr(unsigned long
>>>> *msr_bitmap, u32 msr) +#define MSR_TYPE_R	1 +#define MSR_TYPE_W	2
>>>> +static void __vmx_disable_intercept_for_msr(unsigned long
>>>> *msr_bitmap, + 					u32 msr, int type) +{ +	int f = sizeof(unsigned
>>>> long); + +	if (!cpu_has_vmx_msr_bitmap()) +		return; + +	/* +	 * See
>>>> Intel PRM Vol. 3, 20.6.9 (MSR-Bitmap Address). Early manuals +	 *
>>>> have the write-low and read-high bitmap offsets the wrong way round.
>>>> +	 * We can control MSRs 0x00000000-0x00001fff and
>>>> 0xc0000000-0xc0001fff. +	 */ +	if (msr <= 0x1fff) { +		if (type &
>>>> MSR_TYPE_R) +			/* read-low */ +			__clear_bit(msr, msr_bitmap +
>>>> 0x000 / f); + +		if (type & MSR_TYPE_W) +			/* write-low */ +
>>>> 	__clear_bit(msr, msr_bitmap + 0x800 / f); + +	} else if ((msr >=
>>>> 0xc0000000) && (msr <= 0xc0001fff)) { +		msr &= 0x1fff; +		if (type &
>>>> MSR_TYPE_R) + 	/* read-high */ +			__clear_bit(msr, msr_bitmap +
>>>> 0x400 / f); + +		if (type & MSR_TYPE_W) +			/* write-high */ +
>>>> 	__clear_bit(msr, msr_bitmap + 0xc00 / f); + +	} +} + +static void
>>>> __vmx_enable_intercept_for_msr(unsigned long *msr_bitmap, + 					u32
>>>> msr, int type)
>>>>  {
>>>>  	int f = sizeof(unsigned long);
>>>> @@ -3744,20 +3796,75 @@ static void
>>> __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, u32 msr)
>>>>  	 * We can control MSRs 0x00000000-0x00001fff and
>>>>  0xc0000000-0xc0001fff. 	 */ 	if (msr <= 0x1fff) {
>>>> -		__clear_bit(msr, msr_bitmap + 0x000 / f); /* read-low */
>>>> -		__clear_bit(msr, msr_bitmap + 0x800 / f); /* write-low */
>>>> +		if (type & MSR_TYPE_R)
>>>> +			/* read-low */
>>>> +			__set_bit(msr, msr_bitmap + 0x000 / f);
>>>> +
>>>> +		if (type & MSR_TYPE_W)
>>>> +			/* write-low */
>>>> +			__set_bit(msr, msr_bitmap + 0x800 / f);
>>>> +
>>>>  	} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
>>>>  		msr &= 0x1fff;
>>>> -		__clear_bit(msr, msr_bitmap + 0x400 / f); /* read-high */
>>>> -		__clear_bit(msr, msr_bitmap + 0xc00 / f); /* write-high */
>>>> +		if (type & MSR_TYPE_R)
>>>> +			/* read-high */
>>>> +			__set_bit(msr, msr_bitmap + 0x400 / f);
>>>> +
>>>> +		if (type & MSR_TYPE_W)
>>>> +			/* write-high */
>>>> +			__set_bit(msr, msr_bitmap + 0xc00 / f);
>>>> +
>>>>  	} } + static void vmx_disable_intercept_for_msr(u32 msr, bool
>>>>  longmode_only) { 	if (!longmode_only)
>>>> -		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, msr);
>>>> -	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr);
>>>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, +						msr,
>>>> MSR_TYPE_R | MSR_TYPE_W);
>>>> +	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
>>>> +						msr, MSR_TYPE_R | MSR_TYPE_W); +} + +static void
>>>> vmx_intercept_for_msr_read(u32 msr, bool longmode_only, +					bool
>>>> set) +{ +	if (!longmode_only) { +		if (set) +
>>>> 	__vmx_enable_intercept_for_msr(vmx_msr_bitmap_legacy, +					msr,
>>>> MSR_TYPE_R); +		else +
>>>> 	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, +					msr,
>>>> MSR_TYPE_R); + +	} +	if (set)
>>>> +		__vmx_enable_intercept_for_msr(vmx_msr_bitmap_longmode, +				msr,
>>>> MSR_TYPE_R); +	else
>>>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, +				msr,
>>>> MSR_TYPE_R); +} + +static void vmx_intercept_for_msr_write(u32 msr,
>>>> bool longmode_only, +					bool set) +{ +	if (!longmode_only) { +		if
>>>> (set) + 	__vmx_enable_intercept_for_msr(vmx_msr_bitmap_legacy,
>>>> +					msr, MSR_TYPE_W); +		else +
>>>> 	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, +					msr,
>>>> MSR_TYPE_W); + +	} +	if (set)
>>>> +		__vmx_enable_intercept_for_msr(vmx_msr_bitmap_longmode, +				msr,
>>>> MSR_TYPE_W); +	else
>>>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, +				msr,
>>>> MSR_TYPE_W);
>>>>  }
>>>>  
>>>>  /*
>>>> @@ -3855,6 +3962,7 @@ static u32 vmx_secondary_exec_control(struct
>>> vcpu_vmx *vmx)
>>>>  		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING; 	if
>>>>  (!enable_apicv_reg_vid) 		exec_control &=
>>>>  ~SECONDARY_EXEC_APIC_REGISTER_VIRT; +	exec_control &=
>>>>  ~SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE; 	return
> exec_control; }
>>>> @@ -6110,6 +6218,76 @@ static void update_cr8_intercept(struct
> kvm_vcpu
>>> *vcpu, int tpr, int irr)
>>>>  	vmcs_write32(TPR_THRESHOLD, irr);
>>>>  }
>>>> +static void vmx_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	u32 exec_control;
>>>> +	int msr;
>>>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>>>> +
>>>> +	if (!cpu_has_vmx_virtualize_x2apic_mode())
>>>> +		return;
>>>> +
>>>> +	exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
>>>> +	/* virtualize x2apic mode relies on tpr shadow */
>>>> +	if (!(exec_control & CPU_BASED_TPR_SHADOW))
>>>> +		return;
>>>> +
>>>> +	exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
>>>> +	exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>>>> +	exec_control |= SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
>>>> +	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
>>>> +	vmx->virtual_x2apic_enabled = true;
>>> Why track it?
>> With this flag, we don't need to read vmcs to check whether we enabled
>> virtua x2apic before.
>> 
> Why do you care? Just disabled it regardless.
kvm_lapic_set_base will be called when creating lapic. At that time, vcpu didn't initialized. Then read/write vmcs in vmx_disable_virtual_x2apic_mode will cause error.
With this flag, we only disable the virtual x2apic mode if it is enabled before. 

Best regards,
Yang



^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v9 2/3] x86, apicv: add virtual x2apic support
  2013-01-10 12:34               ` Gleb Natapov
@ 2013-01-11  7:36                 ` Zhang, Yang Z
  2013-01-11 16:54                   ` Gleb Natapov
  0 siblings, 1 reply; 26+ messages in thread
From: Zhang, Yang Z @ 2013-01-11  7:36 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm@vger.kernel.org, Shan, Haitao, mtosatti@redhat.com,
	Tian, Kevin, Zhang, Xiantao

Gleb Natapov wrote on 2013-01-10:
> On Thu, Jan 10, 2013 at 12:22:51PM +0000, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2013-01-10:
>>> On Thu, Jan 10, 2013 at 11:54:21AM +0000, Zhang, Yang Z wrote:
>>>>>> The right logic should be:
>>>>>> When register virtualization enabled, read all apic msr(except apic id reg
> and
>>>>> tmcct which have the hook) not cause vmexit and only write TPR not cause
>>>>> vmexit.
>>>>>> When vid enabled, write to TPR, EOI and SELF IPI not cause vmexit.
>>>>>> It's better to put the patch of envirtualize x2apic mode as first patch.
>>>>>> 
>>>>> There is no point whatsoever to enable virtualize x2apic without
>>>>> allowing at least one non intercepted access to x2apic MSR range and
>>>>> this is what your patch is doing when run on cpu without vid capability.
>>>> No, at least read/write TPR will not cause vmexit if virtualize x2apic mode is
>>> enabled.
>>> For that you need to disable 808H MSR intercept, which your patch does not
> do.
>> I want to do this in next patch.
>> 
> Then don't. There is no point in the patch that enables virtualize
> x2apic and does not allow at least one non-intercepted MSR access.
As I said, read/write TPR will not cause vmexit if virtual x2apic is set.

>>>> I am not sure whether I understand your comments right in previous
>>>> discussion, here is my thinking: 1. enable virtualize x2apic mode if
>>>> guest is really using x2apic and clear TPR in msr read  and write
>>>> bitmap. This will benefit reading TPR. 2. If APIC registers
>>>> virtualization is enabled, clear all bit in rang
>>> 0x800-0x8ff(except apic id reg and tmcct).
>>> Clear all read bits in the range.
>>> 
>>>> 3. If vid is enabled, clear EOI and SELF IPI in msr write map.
>>>> 
>>> Looks OK.
>>> 
>>>> One concern you mentioned is that vid enabled and virtualize x2apic is
> disabled
>>> but guest still use x2apic. In this case, we still use MSR bitmap to
>>> intercept x2apic. This means the vEOI update will never happen. But we
>>> still can benefit from interrupt window.
>>>> 
>>> What interrupt windows? Without virtualized x2apic TPR/EOI
>>> virtualization will not happen and we rely on that in the code.
>> Yes, but we can eliminate vmexit of interrupt window. Interrupt still can
> delivery to guest without vmexit when guest enables interrupt if vid is enabled.
> See SDM vol 3, 29.2.2.
>> 
> What does it matter that you eliminated vmexit of interrupt window if
> you broke everything else? The vid patch assumes that apic emulation
> either entirely happens in a software when vid is disabled or in a CPU
> if vid is enabled. Mixed mode will not work (apic->isr_count = 1 trick
> will break things for instance). And it is not worth it to complicate
> the code to make it work.
Yes, you are right. It too complicated.
Another question? Why not to hide x2apic capability to guest when vid is supported and virtual x2apic mode is not supported? It should be more reasonable than disable vid when virtual x2apic mode is unavailable.

Best regards,
Yang



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 2/3] x86, apicv: add virtual x2apic support
  2013-01-11  2:37         ` Zhang, Yang Z
@ 2013-01-11 13:51           ` Gleb Natapov
  0 siblings, 0 replies; 26+ messages in thread
From: Gleb Natapov @ 2013-01-11 13:51 UTC (permalink / raw)
  To: Zhang, Yang Z
  Cc: kvm@vger.kernel.org, Shan, Haitao, mtosatti@redhat.com,
	Tian, Kevin

On Fri, Jan 11, 2013 at 02:37:15AM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2013-01-10:
> > On Thu, Jan 10, 2013 at 08:32:06AM +0000, Zhang, Yang Z wrote:
> >> Gleb Natapov wrote on 2013-01-10:
> >>> On Thu, Jan 10, 2013 at 03:26:07PM +0800, Yang Zhang wrote:
> >>>> From: Yang Zhang <yang.z.zhang@Intel.com>
> >>>> 
> >>>> basically to benefit from apicv, we need to enable virtualized x2apic mode.
> >>>> Currently, we only enable it when guest is really using x2apic.
> >>>> 
> >>>> Also, clear MSR bitmap for corresponding x2apic MSRs when guest enabled
> >>> x2apic:
> >>>>     0x800 - 0x8ff: no read intercept for apicv register virtualization,
> >>>>     		   except APIC ID and TMCCT.
> >>>>     APIC ID and TMCCT: need software's assistance to get right value.
> >>>>     TPR,EOI,SELF-IPI: no write intercept for virtual interrupt delivery.
> >>>> Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> >>>> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> >>>> ---
> >>>>  arch/x86/include/asm/kvm_host.h |    2 + arch/x86/include/asm/vmx.h
> >>>>    |    1 + arch/x86/kvm/lapic.c            |    5 +-
> >>>>    arch/x86/kvm/svm.c              |    6 + arch/x86/kvm/vmx.c |  194
> >>>>    +++++++++++++++++++++++++++++++++++++-- 5 files
> > changed, 200
> >>>>  insertions(+), 8 deletions(-)
> >>>> diff --git a/arch/x86/include/asm/kvm_host.h
> >>>> b/arch/x86/include/asm/kvm_host.h index c431b33..572a562 100644 ---
> >>>> a/arch/x86/include/asm/kvm_host.h +++
> >>>> b/arch/x86/include/asm/kvm_host.h @@ -697,6 +697,8 @@ struct
> >>>> kvm_x86_ops {
> >>>>  	void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
> >>>>  	void (*enable_irq_window)(struct kvm_vcpu *vcpu);
> >>>>  	void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
> >>>> +	void (*enable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu);
> >>>> +	void (*disable_virtual_x2apic_mode)(struct kvm_vcpu *vcpu);
> >>> Make one callback with enable/disable parameter. And do not forget SVM.
> >>> 
> >>>>  	int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> >>>>  	int (*get_tdp_level)(void);
> >>>>  	u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool
> > is_mmio);
> >>>> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> >>>> index 44c3f7e..0a54df0 100644
> >>>> --- a/arch/x86/include/asm/vmx.h
> >>>> +++ b/arch/x86/include/asm/vmx.h
> >>>> @@ -139,6 +139,7 @@
> >>>>  #define SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES 0x00000001 #define
> >>>>  SECONDARY_EXEC_ENABLE_EPT               0x00000002 #define
> >>>>  SECONDARY_EXEC_RDTSCP			0x00000008 +#define
> >>>>  SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE   0x00000010 #define
> >>>>  SECONDARY_EXEC_ENABLE_VPID              0x00000020 #define
> >>>>  SECONDARY_EXEC_WBINVD_EXITING		0x00000040 #define
> >>>>  SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080
> >>>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> >>>> index 0664c13..ec38906 100644
> >>>> --- a/arch/x86/kvm/lapic.c
> >>>> +++ b/arch/x86/kvm/lapic.c
> >>>> @@ -1328,7 +1328,10 @@ void kvm_lapic_set_base(struct kvm_vcpu
> > *vcpu,
> >>> u64 value)
> >>>>  		u32 id = kvm_apic_id(apic);
> >>>>  		u32 ldr = ((id >> 4) << 16) | (1 << (id & 0xf));
> >>>>  		kvm_apic_set_ldr(apic, ldr);
> >>>> -	}
> >>>> +		kvm_x86_ops->enable_virtual_x2apic_mode(vcpu);
> >>>> +	} else
> >>>> +		kvm_x86_ops->disable_virtual_x2apic_mode(vcpu);
> >>>> +
> >>> You just broke SVM.
> >>>>  	apic->base_address = apic->vcpu->arch.apic_base &
> >>>>  			     MSR_IA32_APICBASE_BASE;
> >>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> >>>> index d29d3cd..0b82cb1 100644
> >>>> --- a/arch/x86/kvm/svm.c
> >>>> +++ b/arch/x86/kvm/svm.c
> >>>> @@ -3571,6 +3571,11 @@ static void update_cr8_intercept(struct
> > kvm_vcpu
> >>> *vcpu, int tpr, int irr)
> >>>>  		set_cr_intercept(svm, INTERCEPT_CR8_WRITE);
> >>>>  }
> >>>> +static void svm_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
> >>>> +{
> >>>> +	return;
> >>>> +}
> >>>> +
> >>>>  static int svm_nmi_allowed(struct kvm_vcpu *vcpu) { 	struct vcpu_svm
> >>>>  *svm = to_svm(vcpu); @@ -4290,6 +4295,7 @@ static struct kvm_x86_ops
> >>>>  svm_x86_ops = { 	.enable_nmi_window = enable_nmi_window,
> >>>>  	.enable_irq_window = enable_irq_window, 	.update_cr8_intercept =
> >>>>  update_cr8_intercept,
> >>>> +	.enable_virtual_x2apic_mode = svm_enable_virtual_x2apic_mode,
> >>>> 
> >>>>  	.set_tss_addr = svm_set_tss_addr,
> >>>>  	.get_tdp_level = get_npt_level,
> >>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >>>> index 688f43f..b203ce7 100644
> >>>> --- a/arch/x86/kvm/vmx.c
> >>>> +++ b/arch/x86/kvm/vmx.c
> >>>> @@ -433,6 +433,8 @@ struct vcpu_vmx {
> >>>> 
> >>>>  	bool rdtscp_enabled;
> >>>> +	bool virtual_x2apic_enabled;
> >>>> +
> >>>>  	/* Support for a guest hypervisor (nested VMX) */
> >>>>  	struct nested_vmx nested;
> >>>>  };
> >>>> @@ -767,12 +769,23 @@ static inline bool
> >>> cpu_has_vmx_virtualize_apic_accesses(void)
> >>>>  		SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> >>>>  }
> >>>> +static inline bool cpu_has_vmx_virtualize_x2apic_mode(void)
> >>>> +{
> >>>> +	return vmcs_config.cpu_based_2nd_exec_ctrl &
> >>>> +		SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
> >>>> +}
> >>>> +
> >>>>  static inline bool cpu_has_vmx_apic_register_virt(void)
> >>>>  {
> >>>>  	return vmcs_config.cpu_based_2nd_exec_ctrl &
> >>>>  		SECONDARY_EXEC_APIC_REGISTER_VIRT;
> >>>>  }
> >>>> +static inline bool cpu_has_vmx_virtual_intr_delivery(void)
> >>>> +{
> >>>> +	return false;
> >>>> +}
> >>>> +
> >>>>  static inline bool cpu_has_vmx_flexpriority(void)
> >>>>  {
> >>>>  	return cpu_has_vmx_tpr_shadow() &&
> >>>> @@ -2544,6 +2557,7 @@ static __init int setup_vmcs_config(struct
> >>> vmcs_config *vmcs_conf)
> >>>>  	if (_cpu_based_exec_control &
> >>>>  CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) { 		min2 = 0; 		opt2 =
> >>>>  SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
> >>>>  +			SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
> >>>>  			SECONDARY_EXEC_WBINVD_EXITING | 	SECONDARY_EXEC_ENABLE_VPID |
> >>>>  			SECONDARY_EXEC_ENABLE_EPT | @@ -3731,7 +3745,45 @@ static void
> >>>>  free_vpid(struct vcpu_vmx *vmx) 	spin_unlock(&vmx_vpid_lock); }
> >>>> -static void __vmx_disable_intercept_for_msr(unsigned long
> >>>> *msr_bitmap, u32 msr) +#define MSR_TYPE_R	1 +#define MSR_TYPE_W	2
> >>>> +static void __vmx_disable_intercept_for_msr(unsigned long
> >>>> *msr_bitmap, + 					u32 msr, int type) +{ +	int f = sizeof(unsigned
> >>>> long); + +	if (!cpu_has_vmx_msr_bitmap()) +		return; + +	/* +	 * See
> >>>> Intel PRM Vol. 3, 20.6.9 (MSR-Bitmap Address). Early manuals +	 *
> >>>> have the write-low and read-high bitmap offsets the wrong way round.
> >>>> +	 * We can control MSRs 0x00000000-0x00001fff and
> >>>> 0xc0000000-0xc0001fff. +	 */ +	if (msr <= 0x1fff) { +		if (type &
> >>>> MSR_TYPE_R) +			/* read-low */ +			__clear_bit(msr, msr_bitmap +
> >>>> 0x000 / f); + +		if (type & MSR_TYPE_W) +			/* write-low */ +
> >>>> 	__clear_bit(msr, msr_bitmap + 0x800 / f); + +	} else if ((msr >=
> >>>> 0xc0000000) && (msr <= 0xc0001fff)) { +		msr &= 0x1fff; +		if (type &
> >>>> MSR_TYPE_R) + 	/* read-high */ +			__clear_bit(msr, msr_bitmap +
> >>>> 0x400 / f); + +		if (type & MSR_TYPE_W) +			/* write-high */ +
> >>>> 	__clear_bit(msr, msr_bitmap + 0xc00 / f); + +	} +} + +static void
> >>>> __vmx_enable_intercept_for_msr(unsigned long *msr_bitmap, + 					u32
> >>>> msr, int type)
> >>>>  {
> >>>>  	int f = sizeof(unsigned long);
> >>>> @@ -3744,20 +3796,75 @@ static void
> >>> __vmx_disable_intercept_for_msr(unsigned long *msr_bitmap, u32 msr)
> >>>>  	 * We can control MSRs 0x00000000-0x00001fff and
> >>>>  0xc0000000-0xc0001fff. 	 */ 	if (msr <= 0x1fff) {
> >>>> -		__clear_bit(msr, msr_bitmap + 0x000 / f); /* read-low */
> >>>> -		__clear_bit(msr, msr_bitmap + 0x800 / f); /* write-low */
> >>>> +		if (type & MSR_TYPE_R)
> >>>> +			/* read-low */
> >>>> +			__set_bit(msr, msr_bitmap + 0x000 / f);
> >>>> +
> >>>> +		if (type & MSR_TYPE_W)
> >>>> +			/* write-low */
> >>>> +			__set_bit(msr, msr_bitmap + 0x800 / f);
> >>>> +
> >>>>  	} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
> >>>>  		msr &= 0x1fff;
> >>>> -		__clear_bit(msr, msr_bitmap + 0x400 / f); /* read-high */
> >>>> -		__clear_bit(msr, msr_bitmap + 0xc00 / f); /* write-high */
> >>>> +		if (type & MSR_TYPE_R)
> >>>> +			/* read-high */
> >>>> +			__set_bit(msr, msr_bitmap + 0x400 / f);
> >>>> +
> >>>> +		if (type & MSR_TYPE_W)
> >>>> +			/* write-high */
> >>>> +			__set_bit(msr, msr_bitmap + 0xc00 / f);
> >>>> +
> >>>>  	} } + static void vmx_disable_intercept_for_msr(u32 msr, bool
> >>>>  longmode_only) { 	if (!longmode_only)
> >>>> -		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, msr);
> >>>> -	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr);
> >>>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, +						msr,
> >>>> MSR_TYPE_R | MSR_TYPE_W);
> >>>> +	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode,
> >>>> +						msr, MSR_TYPE_R | MSR_TYPE_W); +} + +static void
> >>>> vmx_intercept_for_msr_read(u32 msr, bool longmode_only, +					bool
> >>>> set) +{ +	if (!longmode_only) { +		if (set) +
> >>>> 	__vmx_enable_intercept_for_msr(vmx_msr_bitmap_legacy, +					msr,
> >>>> MSR_TYPE_R); +		else +
> >>>> 	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, +					msr,
> >>>> MSR_TYPE_R); + +	} +	if (set)
> >>>> +		__vmx_enable_intercept_for_msr(vmx_msr_bitmap_longmode, +				msr,
> >>>> MSR_TYPE_R); +	else
> >>>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, +				msr,
> >>>> MSR_TYPE_R); +} + +static void vmx_intercept_for_msr_write(u32 msr,
> >>>> bool longmode_only, +					bool set) +{ +	if (!longmode_only) { +		if
> >>>> (set) + 	__vmx_enable_intercept_for_msr(vmx_msr_bitmap_legacy,
> >>>> +					msr, MSR_TYPE_W); +		else +
> >>>> 	__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, +					msr,
> >>>> MSR_TYPE_W); + +	} +	if (set)
> >>>> +		__vmx_enable_intercept_for_msr(vmx_msr_bitmap_longmode, +				msr,
> >>>> MSR_TYPE_W); +	else
> >>>> +		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, +				msr,
> >>>> MSR_TYPE_W);
> >>>>  }
> >>>>  
> >>>>  /*
> >>>> @@ -3855,6 +3962,7 @@ static u32 vmx_secondary_exec_control(struct
> >>> vcpu_vmx *vmx)
> >>>>  		exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING; 	if
> >>>>  (!enable_apicv_reg_vid) 		exec_control &=
> >>>>  ~SECONDARY_EXEC_APIC_REGISTER_VIRT; +	exec_control &=
> >>>>  ~SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE; 	return
> > exec_control; }
> >>>> @@ -6110,6 +6218,76 @@ static void update_cr8_intercept(struct
> > kvm_vcpu
> >>> *vcpu, int tpr, int irr)
> >>>>  	vmcs_write32(TPR_THRESHOLD, irr);
> >>>>  }
> >>>> +static void vmx_enable_virtual_x2apic_mode(struct kvm_vcpu *vcpu)
> >>>> +{
> >>>> +	u32 exec_control;
> >>>> +	int msr;
> >>>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >>>> +
> >>>> +	if (!cpu_has_vmx_virtualize_x2apic_mode())
> >>>> +		return;
> >>>> +
> >>>> +	exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
> >>>> +	/* virtualize x2apic mode relies on tpr shadow */
> >>>> +	if (!(exec_control & CPU_BASED_TPR_SHADOW))
> >>>> +		return;
> >>>> +
> >>>> +	exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
> >>>> +	exec_control &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> >>>> +	exec_control |= SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
> >>>> +	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
> >>>> +	vmx->virtual_x2apic_enabled = true;
> >>> Why track it?
> >> With this flag, we don't need to read vmcs to check whether we enabled
> >> virtua x2apic before.
> >> 
> > Why do you care? Just disabled it regardless.
> kvm_lapic_set_base will be called when creating lapic. At that time, vcpu didn't initialized. Then read/write vmcs in vmx_disable_virtual_x2apic_mode will cause error.
> With this flag, we only disable the virtual x2apic mode if it is enabled before. 
> 
Then call vmx_enable_virtual_x2apic_mode() only when mode actually changes.
kvm_lapic_set_base() can track it like it does to MSR_IA32_APICBASE_ENABLE.

--
			Gleb.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 3/3] x86, apicv: add virtual interrupt delivery support
  2013-01-10 21:36   ` Marcelo Tosatti
@ 2013-01-11 14:09     ` Gleb Natapov
  2013-01-11 17:58       ` Marcelo Tosatti
  0 siblings, 1 reply; 26+ messages in thread
From: Gleb Natapov @ 2013-01-11 14:09 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Yang Zhang, kvm, haitao.shan, Kevin Tian

On Thu, Jan 10, 2013 at 07:36:28PM -0200, Marcelo Tosatti wrote:
> Hi,
> 
> Getting into good shape.
> 
> On Thu, Jan 10, 2013 at 03:26:08PM +0800, Yang Zhang wrote:
> > From: Yang Zhang <yang.z.zhang@Intel.com>
> > 
> > Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
> > manually, which is fully taken care of by the hardware. This needs
> > some special awareness into existing interrupr injection path:
> > 
> > - for pending interrupt, instead of direct injection, we may need
> >   update architecture specific indicators before resuming to guest.
> > 
> > - A pending interrupt, which is masked by ISR, should be also
> >   considered in above update action, since hardware will decide
> >   when to inject it at right time. Current has_interrupt and
> >   get_interrupt only returns a valid vector from injection p.o.v.
> > 
> > Signed-off-by: Kevin Tian <kevin.tian@intel.com>
> > Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |    5 +
> >  arch/x86/include/asm/vmx.h      |   11 +++
> >  arch/x86/kvm/irq.c              |   56 +++++++++++-
> >  arch/x86/kvm/lapic.c            |   72 +++++++++------
> >  arch/x86/kvm/lapic.h            |   23 +++++
> >  arch/x86/kvm/svm.c              |   18 ++++
> >  arch/x86/kvm/vmx.c              |  191 +++++++++++++++++++++++++++++++++++++--
> >  arch/x86/kvm/x86.c              |   14 +++-
> >  include/linux/kvm_host.h        |    3 +
> >  virt/kvm/ioapic.c               |   18 ++++
> >  virt/kvm/ioapic.h               |    4 +
> >  virt/kvm/irq_comm.c             |   22 +++++
> >  virt/kvm/kvm_main.c             |    5 +
> >  13 files changed, 399 insertions(+), 43 deletions(-)
> > 
> 
> >  static void recalculate_apic_map(struct kvm *kvm)
> >  {
> >  	struct kvm_apic_map *new, *old = NULL;
> > @@ -236,12 +219,14 @@ static inline void kvm_apic_set_id(struct kvm_lapic *apic, u8 id)
> >  {
> >  	apic_set_reg(apic, APIC_ID, id << 24);
> >  	recalculate_apic_map(apic->vcpu->kvm);
> > +	ioapic_update_eoi_exitmap(apic->vcpu->kvm);
> >  }
> 
> Move ioapic_update_eoi_exitmap into recalculate_apic_map.
> 
> >  static inline void kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id)
> >  {
> >  	apic_set_reg(apic, APIC_LDR, id);
> >  	recalculate_apic_map(apic->vcpu->kvm);
> > +	ioapic_update_eoi_exitmap(apic->vcpu->kvm);
> >  }
> >  
> 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index b203ce7..990409a 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -434,6 +434,7 @@ struct vcpu_vmx {
> >  	bool rdtscp_enabled;
> >  
> >  	bool virtual_x2apic_enabled;
> > +	unsigned long eoi_exit_bitmap[4];
> 
> Use DECLARE_BITMAP (unsigned long is 4 bytes on 32-bit host).
> 
> >  	/* Support for a guest hypervisor (nested VMX) */
> >  	struct nested_vmx nested;
> > @@ -783,7 +784,8 @@ static inline bool cpu_has_vmx_apic_register_virt(void)
> >  
> >  static inline bool cpu_has_vmx_virtual_intr_delivery(void)
> >  {
> > -	return false;
> > +	return vmcs_config.cpu_based_2nd_exec_ctrl &
> > +		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
> >  }
> >  
> >  static inline bool cpu_has_vmx_flexpriority(void)
> > @@ -2565,7 +2567,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
> >  			SECONDARY_EXEC_PAUSE_LOOP_EXITING |
> >  			SECONDARY_EXEC_RDTSCP |
> >  			SECONDARY_EXEC_ENABLE_INVPCID |
> > -			SECONDARY_EXEC_APIC_REGISTER_VIRT;
> > +			SECONDARY_EXEC_APIC_REGISTER_VIRT |
> > +			SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY;
> >  		if (adjust_vmx_controls(min2, opt2,
> >  					MSR_IA32_VMX_PROCBASED_CTLS2,
> >  					&_cpu_based_2nd_exec_control) < 0)
> > @@ -2579,7 +2582,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
> >  
> >  	if (!(_cpu_based_exec_control & CPU_BASED_TPR_SHADOW))
> >  		_cpu_based_2nd_exec_control &= ~(
> > -				SECONDARY_EXEC_APIC_REGISTER_VIRT);
> > +				SECONDARY_EXEC_APIC_REGISTER_VIRT |
> > +				SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
> 
> Nevermind the earlier comment about ().
> 
> > +static void set_eoi_exitmap_one(struct kvm_vcpu *vcpu,
> > +				u32 vector)
> > +{
> > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +
> > +	if (WARN_ONCE((vector > 255),
> > +		"KVM VMX: vector (%d) out of range\n", vector))
> > +		return;
> > +	__set_bit(vector, vmx->eoi_exit_bitmap);
> > +}
> > +
> > +void vmx_check_ioapic_entry(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq)
> > +{
> > +	struct kvm_lapic **dst;
> > +	struct kvm_apic_map *map;
> > +	unsigned long bitmap = 1;
> > +	int i;
> > +
> > +	rcu_read_lock();
> > +	map = rcu_dereference(vcpu->kvm->arch.apic_map);
> > +
> > +	if (unlikely(!map)) {
> > +		set_eoi_exitmap_one(vcpu, irq->vector);
> > +		goto out;
> > +	}
> > +
> > +	if (irq->dest_mode == 0) { /* physical mode */
> > +		if (irq->delivery_mode == APIC_DM_LOWEST ||
> > +				irq->dest_id == 0xff) {
> > +			set_eoi_exitmap_one(vcpu, irq->vector);
> > +			goto out;
> > +		}
> > +		dst = &map->phys_map[irq->dest_id & 0xff];
> > +	} else {
> > +		u32 mda = irq->dest_id << (32 - map->ldr_bits);
> > +
> > +		dst = map->logical_map[apic_cluster_id(map, mda)];
> > +
> > +		bitmap = apic_logical_id(map, mda);
> > +	}
> > +
> > +	for_each_set_bit(i, &bitmap, 16) {
> > +		if (!dst[i])
> > +			continue;
> > +		if (dst[i]->vcpu == vcpu) {
> > +			set_eoi_exitmap_one(vcpu, irq->vector);
> > +			break;
> > +		}
> > +	}
> > +
> > +out:
> > +	rcu_read_unlock();
> > +}
> > +
> > +static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu)
> > +{
> > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +
> > +	vmcs_write64(EOI_EXIT_BITMAP0, vmx->eoi_exit_bitmap[0]);
> > +	vmcs_write64(EOI_EXIT_BITMAP1, vmx->eoi_exit_bitmap[1]);
> > +	vmcs_write64(EOI_EXIT_BITMAP2, vmx->eoi_exit_bitmap[2]);
> > +	vmcs_write64(EOI_EXIT_BITMAP3, vmx->eoi_exit_bitmap[3]);
> > +}
> > +
> > +static void vmx_update_eoi_exitmap(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvm_ioapic *ioapic = vcpu->kvm->arch.vioapic;
> > +	union kvm_ioapic_redirect_entry *e;
> > +	struct kvm_lapic_irq irqe;
> > +	int index;
> > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +
> > +	/* clear eoi exit bitmap */
> > +	memset(vmx->eoi_exit_bitmap, 0, 32);
> > +
> > +	/* traverse ioapic entry to set eoi exit bitmap*/
> > +	for (index = 0; index < IOAPIC_NUM_PINS; index++) {
> > +		e = &ioapic->redirtbl[index];
> > +		if (!e->fields.mask &&
> > +			(e->fields.trig_mode == IOAPIC_LEVEL_TRIG ||
> > +			 kvm_irq_has_notifier(ioapic->kvm, KVM_IRQCHIP_IOAPIC,
> > +				 index))) {
> > +			irqe.dest_id = e->fields.dest_id;
> > +			irqe.vector = e->fields.vector;
> > +			irqe.dest_mode = e->fields.dest_mode;
> > +			irqe.delivery_mode = e->fields.delivery_mode << 8;
> > +			vmx_check_ioapic_entry(vcpu, &irqe);
> > +
> > +		}
> > +	}
> > +
> > +	vmx_load_eoi_exitmap(vcpu);
> > +}
> > +
> >  static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx)
> >  {
> >  	u32 exit_intr_info;
> > @@ -7553,6 +7726,10 @@ static struct kvm_x86_ops vmx_x86_ops = {
> >  	.update_cr8_intercept = update_cr8_intercept,
> >  	.enable_virtual_x2apic_mode = vmx_enable_virtual_x2apic_mode,
> >  	.disable_virtual_x2apic_mode = vmx_disable_virtual_x2apic_mode,
> > +	.has_virtual_interrupt_delivery = vmx_has_virtual_interrupt_delivery,
> > +	.update_apic_irq = vmx_update_apic_irq,
> > +	.update_eoi_exitmap = vmx_update_eoi_exitmap,
> > +	.set_svi = vmx_set_svi,
> >  
> >  	.set_tss_addr = vmx_set_tss_addr,
> >  	.get_tdp_level = get_ept_level,
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 1c9c834..e6d8227 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -5527,7 +5527,7 @@ static void inject_pending_event(struct kvm_vcpu *vcpu)
> >  			vcpu->arch.nmi_injected = true;
> >  			kvm_x86_ops->set_nmi(vcpu);
> >  		}
> > -	} else if (kvm_cpu_has_interrupt(vcpu)) {
> > +	} else if (kvm_cpu_has_injectable_intr(vcpu)) {
> >  		if (kvm_x86_ops->interrupt_allowed(vcpu)) {
> >  			kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu),
> >  					    false);
> > @@ -5648,6 +5648,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >  			kvm_handle_pmu_event(vcpu);
> >  		if (kvm_check_request(KVM_REQ_PMI, vcpu))
> >  			kvm_deliver_pmi(vcpu);
> > +		if (kvm_check_request(KVM_REQ_EOIBITMAP, vcpu)) {
> > +			mutex_lock(&vcpu->kvm->arch.vioapic->eoimap_lock);
> > +			kvm_x86_ops->update_eoi_exitmap(vcpu);
> > +			mutex_unlock(&vcpu->kvm->arch.vioapic->eoimap_lock);
> > +		}
> 
> Take ioapic lock and irq_lock mutex.
> 
Why irq_lock? Ack notifiers are RCU protected and they are only read
here.

> > +void ioapic_update_eoi_exitmap(struct kvm *kvm)
> > +{
> > +#ifdef CONFIG_X86
> > +	struct kvm_vcpu *vcpu = kvm->vcpus[0];
> > +	struct kvm_ioapic *ioapic = kvm->arch.vioapic;
> > +
> > +	/* If vid is enabled in one of vcpus, then other
> > +	 * vcpus also enabled it. */
> > +	if (!kvm_apic_vid_enabled(vcpu) || !ioapic)
> > +		return;
> 
> Is it even possible to call ioapic_update_eoi_exitmap() if 
> kvm->arch.vioapic == NULL?
> 
May be if irq chip in userspace. Need to verify all callers.

> > +	kvm_make_update_eoibitmap_request(kvm);
> > +#endif
> > +}
> > +
> >  static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
> >  {
> >  	unsigned index;
> > @@ -156,6 +170,7 @@ static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
> >  		if (e->fields.trig_mode == IOAPIC_LEVEL_TRIG
> >  		    && ioapic->irr & (1 << index))
> >  			ioapic_service(ioapic, index);
> > +		ioapic_update_eoi_exitmap(ioapic->kvm);
> >  		break;
> >  	}
> >  }
> > @@ -415,6 +430,9 @@ int kvm_ioapic_init(struct kvm *kvm)
> >  	ret = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS, ioapic->base_address,
> >  				      IOAPIC_MEM_LENGTH, &ioapic->dev);
> >  	mutex_unlock(&kvm->slots_lock);
> > +#ifdef CONFIG_X86
> > +	mutex_init(&ioapic->eoimap_lock);
> > +#endif
> >  	if (ret < 0) {
> >  		kvm->arch.vioapic = NULL;
> >  		kfree(ioapic);
> > diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
> > index a30abfe..34544ce 100644
> > --- a/virt/kvm/ioapic.h
> > +++ b/virt/kvm/ioapic.h
> > @@ -47,6 +47,9 @@ struct kvm_ioapic {
> >  	void (*ack_notifier)(void *opaque, int irq);
> >  	spinlock_t lock;
> >  	DECLARE_BITMAP(handled_vectors, 256);
> > +#ifdef CONFIG_X86
> > +	struct mutex eoimap_lock;
> > +#endif
> >  };
> >  
> >  #ifdef DEBUG
> > @@ -82,5 +85,6 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
> >  		struct kvm_lapic_irq *irq);
> >  int kvm_get_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
> >  int kvm_set_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state);
> > +void ioapic_update_eoi_exitmap(struct kvm *kvm);
> >  
> >  #endif
> > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > index 656fa45..64aa1ab 100644
> > --- a/virt/kvm/irq_comm.c
> > +++ b/virt/kvm/irq_comm.c
> > @@ -22,6 +22,7 @@
> >  
> >  #include <linux/kvm_host.h>
> >  #include <linux/slab.h>
> > +#include <linux/export.h>
> 
> Whats this for?
> 
> >  #include <trace/events/kvm.h>
> >  
> >  #include <asm/msidef.h>
> > @@ -237,6 +238,25 @@ int kvm_set_irq_inatomic(struct kvm *kvm, int irq_source_id, u32 irq, int level)
> >  	return ret;
> >  }
> >  
> > +bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin)
> > +{
> > +	struct kvm_irq_ack_notifier *kian;
> > +	struct hlist_node *n;
> > +	int gsi;
> > +
> > +	rcu_read_lock();
> > +	gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];
> > +	if (gsi != -1)
> > +		hlist_for_each_entry_rcu(kian, n, &kvm->irq_ack_notifier_list,
> > +					 link)
> > +			if (kian->gsi == gsi)
> > +				return true;
> 
> Forgot rcu_read_unlock();
> 
> > +	rcu_read_unlock();
> > +
> > +	return false;
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_irq_has_notifier);
> > +
> >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
> >  {
> >  	struct kvm_irq_ack_notifier *kian;
> > @@ -261,6 +281,7 @@ void kvm_register_irq_ack_notifier(struct kvm *kvm,
> >  	mutex_lock(&kvm->irq_lock);
> >  	hlist_add_head_rcu(&kian->link, &kvm->irq_ack_notifier_list);
> >  	mutex_unlock(&kvm->irq_lock);
> > +	ioapic_update_eoi_exitmap(kvm);
> >  }
> 
> Move inside irq_lock protection.
> 
> >  
> >  void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
> > @@ -270,6 +291,7 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
> >  	hlist_del_init_rcu(&kian->link);
> >  	mutex_unlock(&kvm->irq_lock);
> >  	synchronize_rcu();
> > +	ioapic_update_eoi_exitmap(kvm);
> 
> Move both synchronize_rcu and ioapic_update_eoi_exitmap inside irq_lock
> protection.
Why? (here and one above). If vcpu uses stale data during update it will
find recalculate request during guest entry and will recalculate again.

--
			Gleb.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 2/3] x86, apicv: add virtual x2apic support
  2013-01-11  7:36                 ` Zhang, Yang Z
@ 2013-01-11 16:54                   ` Gleb Natapov
  2013-01-14  1:03                     ` Zhang, Yang Z
  2013-01-14  1:14                     ` Zhang, Yang Z
  0 siblings, 2 replies; 26+ messages in thread
From: Gleb Natapov @ 2013-01-11 16:54 UTC (permalink / raw)
  To: Zhang, Yang Z
  Cc: kvm@vger.kernel.org, Shan, Haitao, mtosatti@redhat.com,
	Tian, Kevin, Zhang, Xiantao

On Fri, Jan 11, 2013 at 07:36:44AM +0000, Zhang, Yang Z wrote:
> Gleb Natapov wrote on 2013-01-10:
> > On Thu, Jan 10, 2013 at 12:22:51PM +0000, Zhang, Yang Z wrote:
> >> Gleb Natapov wrote on 2013-01-10:
> >>> On Thu, Jan 10, 2013 at 11:54:21AM +0000, Zhang, Yang Z wrote:
> >>>>>> The right logic should be:
> >>>>>> When register virtualization enabled, read all apic msr(except apic id reg
> > and
> >>>>> tmcct which have the hook) not cause vmexit and only write TPR not cause
> >>>>> vmexit.
> >>>>>> When vid enabled, write to TPR, EOI and SELF IPI not cause vmexit.
> >>>>>> It's better to put the patch of envirtualize x2apic mode as first patch.
> >>>>>> 
> >>>>> There is no point whatsoever to enable virtualize x2apic without
> >>>>> allowing at least one non intercepted access to x2apic MSR range and
> >>>>> this is what your patch is doing when run on cpu without vid capability.
> >>>> No, at least read/write TPR will not cause vmexit if virtualize x2apic mode is
> >>> enabled.
> >>> For that you need to disable 808H MSR intercept, which your patch does not
> > do.
> >> I want to do this in next patch.
> >> 
> > Then don't. There is no point in the patch that enables virtualize
> > x2apic and does not allow at least one non-intercepted MSR access.
> As I said, read/write TPR will not cause vmexit if virtual x2apic is set.
> 
>From my reading of the spec you need to disable 808H intercept for that.
Is this not the case?

> >>>> I am not sure whether I understand your comments right in previous
> >>>> discussion, here is my thinking: 1. enable virtualize x2apic mode if
> >>>> guest is really using x2apic and clear TPR in msr read  and write
> >>>> bitmap. This will benefit reading TPR. 2. If APIC registers
> >>>> virtualization is enabled, clear all bit in rang
> >>> 0x800-0x8ff(except apic id reg and tmcct).
> >>> Clear all read bits in the range.
> >>> 
> >>>> 3. If vid is enabled, clear EOI and SELF IPI in msr write map.
> >>>> 
> >>> Looks OK.
> >>> 
> >>>> One concern you mentioned is that vid enabled and virtualize x2apic is
> > disabled
> >>> but guest still use x2apic. In this case, we still use MSR bitmap to
> >>> intercept x2apic. This means the vEOI update will never happen. But we
> >>> still can benefit from interrupt window.
> >>>> 
> >>> What interrupt windows? Without virtualized x2apic TPR/EOI
> >>> virtualization will not happen and we rely on that in the code.
> >> Yes, but we can eliminate vmexit of interrupt window. Interrupt still can
> > delivery to guest without vmexit when guest enables interrupt if vid is enabled.
> > See SDM vol 3, 29.2.2.
> >> 
> > What does it matter that you eliminated vmexit of interrupt window if
> > you broke everything else? The vid patch assumes that apic emulation
> > either entirely happens in a software when vid is disabled or in a CPU
> > if vid is enabled. Mixed mode will not work (apic->isr_count = 1 trick
> > will break things for instance). And it is not worth it to complicate
> > the code to make it work.
> Yes, you are right. It too complicated.
> Another question? Why not to hide x2apic capability to guest when vid is supported and virtual x2apic mode is not supported? It should be more reasonable than disable vid when virtual x2apic mode is unavailable.
> 
Because I think there will be no such HW. If such HW expected to be
common then we can go this route.

--
			Gleb.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v9 3/3] x86, apicv: add virtual interrupt delivery support
  2013-01-11 14:09     ` Gleb Natapov
@ 2013-01-11 17:58       ` Marcelo Tosatti
  0 siblings, 0 replies; 26+ messages in thread
From: Marcelo Tosatti @ 2013-01-11 17:58 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Yang Zhang, kvm, haitao.shan, Kevin Tian

> > 
> > Move inside irq_lock protection.
> > 
> > >  
> > >  void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
> > > @@ -270,6 +291,7 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
> > >  	hlist_del_init_rcu(&kian->link);
> > >  	mutex_unlock(&kvm->irq_lock);
> > >  	synchronize_rcu();
> > > +	ioapic_update_eoi_exitmap(kvm);
> > 
> > Move both synchronize_rcu and ioapic_update_eoi_exitmap inside irq_lock
> > protection.
> Why? (here and one above). If vcpu uses stale data during update it will
> find recalculate request during guest entry and will recalculate again.
> 
> --
> 			Gleb.

Indeed, its not necessary and also not necessary to grab irq_lock during 
vcpu entry, while processing the KVM_REQ_ bit.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v9 2/3] x86, apicv: add virtual x2apic support
  2013-01-11 16:54                   ` Gleb Natapov
@ 2013-01-14  1:03                     ` Zhang, Yang Z
  2013-01-14  1:14                     ` Zhang, Yang Z
  1 sibling, 0 replies; 26+ messages in thread
From: Zhang, Yang Z @ 2013-01-14  1:03 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm@vger.kernel.org, Shan, Haitao, mtosatti@redhat.com,
	Tian, Kevin, Zhang, Xiantao

Gleb Natapov wrote on 2013-01-12:
> On Fri, Jan 11, 2013 at 07:36:44AM +0000, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2013-01-10:
>>> On Thu, Jan 10, 2013 at 12:22:51PM +0000, Zhang, Yang Z wrote:
>>>> Gleb Natapov wrote on 2013-01-10:
>>>>> On Thu, Jan 10, 2013 at 11:54:21AM +0000, Zhang, Yang Z wrote:
>>>>>>>> The right logic should be:
>>>>>>>> When register virtualization enabled, read all apic msr(except apic id
> reg
>>> and
>>>>>>> tmcct which have the hook) not cause vmexit and only write TPR not
>>>>>>> cause vmexit.
>>>>>>>> When vid enabled, write to TPR, EOI and SELF IPI not cause vmexit.
>>>>>>>> It's better to put the patch of envirtualize x2apic mode as first patch.
>>>>>>>> 
>>>>>>> There is no point whatsoever to enable virtualize x2apic without
>>>>>>> allowing at least one non intercepted access to x2apic MSR range and
>>>>>>> this is what your patch is doing when run on cpu without vid capability.
>>>>>> No, at least read/write TPR will not cause vmexit if virtualize x2apic mode
> is
>>>>> enabled.
>>>>> For that you need to disable 808H MSR intercept, which your patch does
> not
>>> do.
>>>> I want to do this in next patch.
>>>> 
>>> Then don't. There is no point in the patch that enables virtualize
>>> x2apic and does not allow at least one non-intercepted MSR access.
>> As I said, read/write TPR will not cause vmexit if virtual x2apic is set.
>> 
> From my reading of the spec you need to disable 808H intercept for that.
> Is this not the case?
> 
>>>>>> I am not sure whether I understand your comments right in previous
>>>>>> discussion, here is my thinking: 1. enable virtualize x2apic mode if
>>>>>> guest is really using x2apic and clear TPR in msr read  and write
>>>>>> bitmap. This will benefit reading TPR. 2. If APIC registers
>>>>>> virtualization is enabled, clear all bit in rang
>>>>> 0x800-0x8ff(except apic id reg and tmcct).
>>>>> Clear all read bits in the range.
>>>>> 
>>>>>> 3. If vid is enabled, clear EOI and SELF IPI in msr write map.
>>>>>> 
>>>>> Looks OK.
>>>>> 
>>>>>> One concern you mentioned is that vid enabled and virtualize x2apic is
>>> disabled
>>>>> but guest still use x2apic. In this case, we still use MSR bitmap to
>>>>> intercept x2apic. This means the vEOI update will never happen. But we
>>>>> still can benefit from interrupt window.
>>>>>> 
>>>>> What interrupt windows? Without virtualized x2apic TPR/EOI
>>>>> virtualization will not happen and we rely on that in the code.
>>>> Yes, but we can eliminate vmexit of interrupt window. Interrupt still can
>>> delivery to guest without vmexit when guest enables interrupt if vid
>>> is enabled. See SDM vol 3, 29.2.2.
>>>> 
>>> What does it matter that you eliminated vmexit of interrupt window if
>>> you broke everything else? The vid patch assumes that apic emulation
>>> either entirely happens in a software when vid is disabled or in a CPU
>>> if vid is enabled. Mixed mode will not work (apic->isr_count = 1 trick
>>> will break things for instance). And it is not worth it to complicate
>>> the code to make it work.
>> Yes, you are right. It too complicated.
>> Another question? Why not to hide x2apic capability to guest when vid is
> supported and virtual x2apic mode is not supported? It should be more
> reasonable than disable vid when virtual x2apic mode is unavailable.
>> 
> Because I think there will be no such HW. If such HW expected to be
> common then we can go this route.
Yes. No HW will support vid without supporting virtual x2apic mode. So we just ignore this case.

Best regards,
Yang



^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v9 2/3] x86, apicv: add virtual x2apic support
  2013-01-11 16:54                   ` Gleb Natapov
  2013-01-14  1:03                     ` Zhang, Yang Z
@ 2013-01-14  1:14                     ` Zhang, Yang Z
  1 sibling, 0 replies; 26+ messages in thread
From: Zhang, Yang Z @ 2013-01-14  1:14 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm@vger.kernel.org, Shan, Haitao, mtosatti@redhat.com,
	Tian, Kevin, Zhang, Xiantao

Gleb Natapov wrote on 2013-01-12:
> On Fri, Jan 11, 2013 at 07:36:44AM +0000, Zhang, Yang Z wrote:
>> Gleb Natapov wrote on 2013-01-10:
>>> On Thu, Jan 10, 2013 at 12:22:51PM +0000, Zhang, Yang Z wrote:
>>>> Gleb Natapov wrote on 2013-01-10:
>>>>> On Thu, Jan 10, 2013 at 11:54:21AM +0000, Zhang, Yang Z wrote:
>>>>>>>> The right logic should be:
>>>>>>>> When register virtualization enabled, read all apic msr(except apic id
> reg
>>> and
>>>>>>> tmcct which have the hook) not cause vmexit and only write TPR not
>>>>>>> cause vmexit.
>>>>>>>> When vid enabled, write to TPR, EOI and SELF IPI not cause vmexit.
>>>>>>>> It's better to put the patch of envirtualize x2apic mode as first patch.
>>>>>>>> 
>>>>>>> There is no point whatsoever to enable virtualize x2apic without
>>>>>>> allowing at least one non intercepted access to x2apic MSR range and
>>>>>>> this is what your patch is doing when run on cpu without vid capability.
>>>>>> No, at least read/write TPR will not cause vmexit if virtualize x2apic mode
> is
>>>>> enabled.
>>>>> For that you need to disable 808H MSR intercept, which your patch does
> not
>>> do.
>>>> I want to do this in next patch.
>>>> 
>>> Then don't. There is no point in the patch that enables virtualize
>>> x2apic and does not allow at least one non-intercepted MSR access.
>> As I said, read/write TPR will not cause vmexit if virtual x2apic is set.
>> 
> From my reading of the spec you need to disable 808H intercept for that.
> Is this not the case?
After thinking, since Linux doesn't use TPR, there is no point to do this. No real guest will benefit from this.

>>>>>> I am not sure whether I understand your comments right in previous
>>>>>> discussion, here is my thinking: 1. enable virtualize x2apic mode if
>>>>>> guest is really using x2apic and clear TPR in msr read  and write
>>>>>> bitmap. This will benefit reading TPR. 2. If APIC registers
>>>>>> virtualization is enabled, clear all bit in rang
>>>>> 0x800-0x8ff(except apic id reg and tmcct).
>>>>> Clear all read bits in the range.
>>>>> 
>>>>>> 3. If vid is enabled, clear EOI and SELF IPI in msr write map.
>>>>>> 
>>>>> Looks OK.
>>>>> 
>>>>>> One concern you mentioned is that vid enabled and virtualize x2apic is
>>> disabled
>>>>> but guest still use x2apic. In this case, we still use MSR bitmap to
>>>>> intercept x2apic. This means the vEOI update will never happen. But we
>>>>> still can benefit from interrupt window.
>>>>>> 
>>>>> What interrupt windows? Without virtualized x2apic TPR/EOI
>>>>> virtualization will not happen and we rely on that in the code.
>>>> Yes, but we can eliminate vmexit of interrupt window. Interrupt still can
>>> delivery to guest without vmexit when guest enables interrupt if vid
>>> is enabled. See SDM vol 3, 29.2.2.
>>>> 
>>> What does it matter that you eliminated vmexit of interrupt window if
>>> you broke everything else? The vid patch assumes that apic emulation
>>> either entirely happens in a software when vid is disabled or in a CPU
>>> if vid is enabled. Mixed mode will not work (apic->isr_count = 1 trick
>>> will break things for instance). And it is not worth it to complicate
>>> the code to make it work.
>> Yes, you are right. It too complicated.
>> Another question? Why not to hide x2apic capability to guest when vid is
> supported and virtual x2apic mode is not supported? It should be more
> reasonable than disable vid when virtual x2apic mode is unavailable.
>> 
> Because I think there will be no such HW. If such HW expected to be
> common then we can go this route.
> 
> --
> 			Gleb.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Best regards,
Yang



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2013-01-14  1:14 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-10  7:26 [PATCH v9 0/3] x86, apicv: Add APIC virtualization support Yang Zhang
2013-01-10  7:26 ` [PATCH v9 1/3] x86, apicv: add APICv register " Yang Zhang
2013-01-10 20:25   ` Marcelo Tosatti
2013-01-10  7:26 ` [PATCH v9 2/3] x86, apicv: add virtual x2apic support Yang Zhang
2013-01-10  7:55   ` Gleb Natapov
2013-01-10  8:32     ` Zhang, Yang Z
2013-01-10  8:52       ` Gleb Natapov
2013-01-10 11:54         ` Zhang, Yang Z
2013-01-10 12:16           ` Gleb Natapov
2013-01-10 12:22             ` Zhang, Yang Z
2013-01-10 12:34               ` Gleb Natapov
2013-01-11  7:36                 ` Zhang, Yang Z
2013-01-11 16:54                   ` Gleb Natapov
2013-01-14  1:03                     ` Zhang, Yang Z
2013-01-14  1:14                     ` Zhang, Yang Z
2013-01-11  2:37         ` Zhang, Yang Z
2013-01-11 13:51           ` Gleb Natapov
2013-01-10  8:25   ` Gleb Natapov
2013-01-10  8:31     ` Zhang, Yang Z
2013-01-10  8:53       ` Gleb Natapov
2013-01-10  7:26 ` [PATCH v9 3/3] x86, apicv: add virtual interrupt delivery support Yang Zhang
2013-01-10  8:23   ` Gleb Natapov
2013-01-10 12:04     ` Zhang, Yang Z
2013-01-10 21:36   ` Marcelo Tosatti
2013-01-11 14:09     ` Gleb Natapov
2013-01-11 17:58       ` Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox