[PATCH 0/6] KVM: VMX: Page Modification Logging (PML) support

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] KVM: VMX: Page Modification Logging (PML) support
@ 2015-01-28  2:54 Kai Huang
  2015-01-28  2:54 ` [PATCH 1/6] KVM: Rename kvm_arch_mmu_write_protect_pt_masked to be more generic for log dirty Kai Huang
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Kai Huang @ 2015-01-28  2:54 UTC (permalink / raw)
  To: pbonzini, gleb, linux, kvm; +Cc: Kai Huang

This patch series adds Page Modification Logging (PML) support in VMX.

1) Introduction

PML is a new feature on Intel's Boardwell server platfrom targeted to reduce
overhead of dirty logging mechanism.

The specification can be found at:

http://www.intel.com/content/www/us/en/processors/page-modification-logging-vmm-white-paper.html

Currently, dirty logging is done by write protection, which write protects guest
memory, and mark dirty GFN to dirty_bitmap in subsequent write fault. This works
fine, except with overhead of additional write fault for logging each dirty GFN.
The overhead can be large if the write operations from geust is intensive.

PML is a hardware-assisted efficient way for dirty logging. PML logs dirty GPA
automatically to a 4K PML memory buffer when CPU changes EPT table's D-bit from
0 to 1. To do this, A new 4K PML buffer base address, and a PML index were added
to VMCS. Initially PML index is set to 512 (8 bytes for each GPA), and CPU
decreases PML index after logging one GPA, and eventually a PML buffer full
VMEXIT happens when PML buffer is fully logged.

With PML, we don't have to use write protection so the intensive write fault EPT
violation can be avoided, with an additional PML buffer full VMEXIT for 512
dirty GPAs. Theoretically, this can reduce hypervisor overhead when guest is in
dirty logging mode, and therefore more CPU cycles can be allocated to guest, so
it's expected benchmarks in guest will have better performance comparing to
non-PML.

2) Design

a. Enable/Disable PML

PML is per-vcpu (per-VMCS), while EPT table can be shared by vcpus, so we need
to enable/disable PML for all vcpus of guest. A dedicated 4K page will be
allocated for each vcpu when PML is enabled for that vcpu.

Currently, we choose to always enable PML for guest, which means we enables PML
when creating VCPU, and never disable it during guest's life time. This avoids
the complicated logic to enable PML by demand when guest is running. And to
eliminate potential unnecessary GPA logging in non-dirty logging mode, we set
D-bit manually for the slots with dirty logging disabled.

b. Flush PML buffer

When userspace querys dirty_bitmap, it's possible that there are GPAs logged in
vcpu's PML buffer, but as PML buffer is not full, so no VMEXIT happens. In this
case, we'd better to manually flush PML buffer for all vcpus and update the
dirty GPAs to dirty_bitmap.

We do PML buffer flush at the beginning of each VMEXIT, this makes dirty_bitmap
more updated, and also makes logic of flushing PML buffer for all vcpus easier
-- we only need to kick all vcpus out of guest and PML buffer for each vcpu will
be flushed automatically.

3) Tests and benchmark results

I tested specjbb benchmark, which is memory intensive to measure PML. All tests
are done in below configuration:

Machine (Boardwell server): 16 CPUs (1.4G) + 4G memory
Host Kernel: KVM queue branch. Transparent Hugepage disabled. C-state, P-state,
	S-state disabled. Swap disabled.

Guest: Ubuntu 14.04 with kernel 3.13.0-36-generic
Guest: 4 vcpus + 1G memory. All vcpus are pinned.

a. Comapre score with and without PML enabled.

This is to make sure PML won't bring any performance regression as it's always
enabled for guest.

Booting guest with graphic window (no --nographic)

	NOPML		PML

	109755		109379
	108786		109300
	109234		109663
	109257		107471
	108514		108904
	109740		107623

avg:	109214		108723

performance regression: (109214 - 108723) / 109214 = 0.45%

Booting guest without graphic window (--nographic)

	NOPML		PML

	109090		109686
	109461		110533
	110523		108550
	109960		110775
	109090		109802
	110787		109192

avg:	109818		109756

performance regression: (109818 - 109756) / 109818 = 0.06%

So there's no noticeable performance regression leaving PML always enabled.

b. Compare specjbb score between PML and Write Protection.

This is used to see how much performance gain PML can bring when guest is in
dirty logging mode.

I modified qemu by adding an additional "Monitoring thread" to query
dirty_bitmap periodically (once per 1 second). With this thread, we can get
performance gain of PML by comparing specjbb score under PML code path and
write protection code path.

Again, I got score for both with/without graphic window of guest.

Booting guest with graphic window (no --nographic)

		PML		WP 		No monitoring thread

		104748		101358
		102934		99895
		103525		98832
		105331		100678
		106038		99476
		104776		99851

	avg:	104558		100015		108723 (== PML score in test a)

	percent: 96.17%		91.99%		100%

	performance gain:	96.17% - 91.99% = 4.18%

Booting guest without graphic window (--nographic)

		PML		WP		No monithring thread

		104778		98967
		104856		99380
		103783		99406
		105210		100638
		106218		99763
		105475		99287

	avg:	105053		99573		109756 (== PML score in test a)

	percent: 95.72%		90.72%		100%

	performance gain:  95.72% - 90.72% = 5%

So there's noticeable performance gain (around 4%~5%) of PML comparing to Write
Protection.

Kai Huang (6):
  KVM: Rename kvm_arch_mmu_write_protect_pt_masked to be more generic
    for log dirty
  KVM: MMU: Add mmu help functions to support PML
  KVM: MMU: Explicitly set D-bit for writable spte.
  KVM: x86: Change parameter of kvm_mmu_slot_remove_write_access
  KVM: x86: Add new dirty logging kvm_x86_ops for PML
  KVM: VMX: Add PML support in VMX

 arch/arm/kvm/mmu.c              |  18 ++-
 arch/x86/include/asm/kvm_host.h |  37 +++++-
 arch/x86/include/asm/vmx.h      |   4 +
 arch/x86/include/uapi/asm/vmx.h |   1 +
 arch/x86/kvm/mmu.c              | 243 +++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/trace.h            |  18 +++
 arch/x86/kvm/vmx.c              | 195 +++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              |  78 +++++++++++--
 include/linux/kvm_host.h        |   2 +-
 virt/kvm/kvm_main.c             |   2 +-
 10 files changed, 577 insertions(+), 21 deletions(-)

-- 
2.1.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/6] KVM: Rename kvm_arch_mmu_write_protect_pt_masked to be more generic for log dirty
  2015-01-28  2:54 [PATCH 0/6] KVM: VMX: Page Modification Logging (PML) support Kai Huang
@ 2015-01-28  2:54 ` Kai Huang
  2015-01-28  2:54 ` [PATCH 2/6] KVM: MMU: Add mmu help functions to support PML Kai Huang
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Kai Huang @ 2015-01-28  2:54 UTC (permalink / raw)
  To: pbonzini, gleb, linux, kvm; +Cc: Kai Huang

We don't have to write protect guest memory for dirty logging if architecture
supports hardware dirty logging, such as PML on VMX, so rename it to be more
generic.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
---
 arch/arm/kvm/mmu.c       | 18 ++++++++++++++++--
 arch/x86/kvm/mmu.c       | 21 +++++++++++++++++++--
 include/linux/kvm_host.h |  2 +-
 virt/kvm/kvm_main.c      |  2 +-
 4 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 74aeaba..6034697 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -1081,7 +1081,7 @@ void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
 }
 
 /**
- * kvm_arch_mmu_write_protect_pt_masked() - write protect dirty pages
+ * kvm_mmu_write_protect_pt_masked() - write protect dirty pages
  * @kvm:	The KVM pointer
  * @slot:	The memory slot associated with mask
  * @gfn_offset:	The gfn offset in memory slot
@@ -1091,7 +1091,7 @@ void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
  * Walks bits set in mask write protects the associated pte's. Caller must
  * acquire kvm_mmu_lock.
  */
-void kvm_arch_mmu_write_protect_pt_masked(struct kvm *kvm,
+static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 		struct kvm_memory_slot *slot,
 		gfn_t gfn_offset, unsigned long mask)
 {
@@ -1102,6 +1102,20 @@ void kvm_arch_mmu_write_protect_pt_masked(struct kvm *kvm,
 	stage2_wp_range(kvm, start, end);
 }
 
+/*
+ * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
+ * dirty pages.
+ *
+ * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to
+ * enable dirty logging for them.
+ */
+void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
+		struct kvm_memory_slot *slot,
+		gfn_t gfn_offset, unsigned long mask)
+{
+	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
 			  unsigned long fault_status)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0ed9f79..b18e65c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1216,7 +1216,7 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
 }
 
 /**
- * kvm_arch_mmu_write_protect_pt_masked - write protect selected PT level pages
+ * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
  * @kvm: kvm instance
  * @slot: slot to protect
  * @gfn_offset: start of the BITS_PER_LONG pages we care about
@@ -1225,7 +1225,7 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
  * Used when we do not need to care about huge page mappings: e.g. during dirty
  * logging we do not have any such mappings.
  */
-void kvm_arch_mmu_write_protect_pt_masked(struct kvm *kvm,
+static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 				     struct kvm_memory_slot *slot,
 				     gfn_t gfn_offset, unsigned long mask)
 {
@@ -1241,6 +1241,23 @@ void kvm_arch_mmu_write_protect_pt_masked(struct kvm *kvm,
 	}
 }
 
+/**
+ * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
+ * PT level pages.
+ *
+ * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to
+ * enable dirty logging for them.
+ *
+ * Used when we do not need to care about huge page mappings: e.g. during dirty
+ * logging we do not have any such mappings.
+ */
+void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
+				struct kvm_memory_slot *slot,
+				gfn_t gfn_offset, unsigned long mask)
+{
+	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
+}
+
 static bool rmap_write_protect(struct kvm *kvm, u64 gfn)
 {
 	struct kvm_memory_slot *slot;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7d67195..32d0575 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -615,7 +615,7 @@ int kvm_get_dirty_log(struct kvm *kvm,
 int kvm_get_dirty_log_protect(struct kvm *kvm,
 			struct kvm_dirty_log *log, bool *is_dirty);
 
-void kvm_arch_mmu_write_protect_pt_masked(struct kvm *kvm,
+void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 					struct kvm_memory_slot *slot,
 					gfn_t gfn_offset,
 					unsigned long mask);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a8490f0..0c28176 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1059,7 +1059,7 @@ int kvm_get_dirty_log_protect(struct kvm *kvm,
 		dirty_bitmap_buffer[i] = mask;
 
 		offset = i * BITS_PER_LONG;
-		kvm_arch_mmu_write_protect_pt_masked(kvm, memslot, offset,
+		kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset,
 								mask);
 	}
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/6] KVM: MMU: Add mmu help functions to support PML
  2015-01-28  2:54 [PATCH 0/6] KVM: VMX: Page Modification Logging (PML) support Kai Huang
  2015-01-28  2:54 ` [PATCH 1/6] KVM: Rename kvm_arch_mmu_write_protect_pt_masked to be more generic for log dirty Kai Huang
@ 2015-01-28  2:54 ` Kai Huang
  2015-02-03 17:34   ` Radim Krčmář
  2015-01-28  2:54 ` [PATCH 3/6] KVM: MMU: Explicitly set D-bit for writable spte Kai Huang
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Kai Huang @ 2015-01-28  2:54 UTC (permalink / raw)
  To: pbonzini, gleb, linux, kvm; +Cc: Kai Huang

This patch adds new mmu layer functions to clear/set D-bit for memory slot, and
to write protect superpages for memory slot.

In case of PML, CPU logs the dirty GPA automatically to PML buffer when CPU
updates D-bit from 0 to 1, therefore we don't have to write protect 4K pages,
instead, we only need to clear D-bit in order to log that GPA.

For superpages, we still write protect it and let page fault code to handle
dirty page logging, as we still need to split superpage to 4K pages in PML.

As PML is always enabled during guest's lifetime, to eliminate unnecessary PML
GPA logging, we set D-bit manually for the slot with dirty logging disabled.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h |   9 ++
 arch/x86/kvm/mmu.c              | 195 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 204 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 843bea0..4f6369b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -835,6 +835,15 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
+void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
+				   struct kvm_memory_slot *memslot);
+void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
+					struct kvm_memory_slot *memslot);
+void kvm_mmu_slot_set_dirty(struct kvm *kvm,
+			    struct kvm_memory_slot *memslot);
+void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
+				   struct kvm_memory_slot *slot,
+				   gfn_t gfn_offset, unsigned long mask);
 void kvm_mmu_zap_all(struct kvm *kvm);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm);
 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b18e65c..c438224 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1215,6 +1215,60 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
 	return flush;
 }
 
+static bool spte_clear_dirty(struct kvm *kvm, u64 *sptep)
+{
+	u64 spte = *sptep;
+
+	rmap_printk("rmap_clear_dirty: spte %p %llx\n", sptep, *sptep);
+
+	spte &= ~shadow_dirty_mask;
+
+	return mmu_spte_update(sptep, spte);
+}
+
+static bool __rmap_clear_dirty(struct kvm *kvm, unsigned long *rmapp)
+{
+	u64 *sptep;
+	struct rmap_iterator iter;
+	bool flush = false;
+
+	for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
+		BUG_ON(!(*sptep & PT_PRESENT_MASK));
+
+		flush |= spte_clear_dirty(kvm, sptep);
+		sptep = rmap_get_next(&iter);
+	}
+
+	return flush;
+}
+
+static bool spte_set_dirty(struct kvm *kvm, u64 *sptep)
+{
+	u64 spte = *sptep;
+
+	rmap_printk("rmap_set_dirty: spte %p %llx\n", sptep, *sptep);
+
+	spte |= shadow_dirty_mask;
+
+	return mmu_spte_update(sptep, spte);
+}
+
+static bool __rmap_set_dirty(struct kvm *kvm, unsigned long *rmapp)
+{
+	u64 *sptep;
+	struct rmap_iterator iter;
+	bool flush = false;
+
+	for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
+		BUG_ON(!(*sptep & PT_PRESENT_MASK));
+
+		flush |= spte_set_dirty(kvm, sptep);
+		sptep = rmap_get_next(&iter);
+	}
+
+	return flush;
+}
+
 /**
  * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
  * @kvm: kvm instance
@@ -1242,6 +1296,32 @@ static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
 }
 
 /**
+ * kvm_mmu_clear_dirty_pt_masked - clear MMU D-bit for PT level pages
+ * @kvm: kvm instance
+ * @slot: slot to clear D-bit
+ * @gfn_offset: start of the BITS_PER_LONG pages we care about
+ * @mask: indicates which pages we should clear D-bit
+ *
+ * Used for PML to re-log the dirty GPAs after userspace querying dirty_bitmap.
+ */
+void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm,
+				     struct kvm_memory_slot *slot,
+				     gfn_t gfn_offset, unsigned long mask)
+{
+	unsigned long *rmapp;
+
+	while (mask) {
+		rmapp = __gfn_to_rmap(slot->base_gfn + gfn_offset + __ffs(mask),
+				      PT_PAGE_TABLE_LEVEL, slot);
+		__rmap_clear_dirty(kvm, rmapp);
+
+		/* clear the first set bit */
+		mask &= mask - 1;
+	}
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_clear_dirty_pt_masked);
+
+/**
  * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
  * PT level pages.
  *
@@ -4368,6 +4448,121 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
 		kvm_flush_remote_tlbs(kvm);
 }
 
+void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
+				   struct kvm_memory_slot *memslot)
+{
+	gfn_t last_gfn;
+	unsigned long *rmapp;
+	unsigned long last_index, index;
+	bool flush = false;
+
+	last_gfn = memslot->base_gfn + memslot->npages - 1;
+
+	spin_lock(&kvm->mmu_lock);
+
+	rmapp = memslot->arch.rmap[PT_PAGE_TABLE_LEVEL - 1];
+	last_index = gfn_to_index(last_gfn, memslot->base_gfn,
+			PT_PAGE_TABLE_LEVEL);
+
+	for (index = 0; index <= last_index; ++index, ++rmapp) {
+		if (*rmapp)
+			flush |= __rmap_clear_dirty(kvm, rmapp);
+
+		if (need_resched() || spin_needbreak(&kvm->mmu_lock))
+			cond_resched_lock(&kvm->mmu_lock);
+	}
+
+	spin_unlock(&kvm->mmu_lock);
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	/*
+	 * It's also safe to flush TLBs out of mmu lock here as currently this
+	 * function is only used for dirty logging, in which case flushing TLB
+	 * out of mmu lock also guarantees no dirty pages will be lost in
+	 * dirty_bitmap.
+	 */
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_slot_leaf_clear_dirty);
+
+void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
+					struct kvm_memory_slot *memslot)
+{
+	gfn_t last_gfn;
+	int i;
+	bool flush = false;
+
+	last_gfn = memslot->base_gfn + memslot->npages - 1;
+
+	spin_lock(&kvm->mmu_lock);
+
+	for (i = PT_PAGE_TABLE_LEVEL + 1; /* skip rmap for 4K page */
+	     i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
+		unsigned long *rmapp;
+		unsigned long last_index, index;
+
+		rmapp = memslot->arch.rmap[i - PT_PAGE_TABLE_LEVEL];
+		last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);
+
+		for (index = 0; index <= last_index; ++index, ++rmapp) {
+			if (*rmapp)
+				flush |= __rmap_write_protect(kvm, rmapp,
+						false);
+
+			if (need_resched() || spin_needbreak(&kvm->mmu_lock))
+				cond_resched_lock(&kvm->mmu_lock);
+		}
+	}
+	spin_unlock(&kvm->mmu_lock);
+
+	/* see kvm_mmu_slot_remove_write_access */
+	lockdep_assert_held(&kvm->slots_lock);
+
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_slot_largepage_remove_write_access);
+
+void kvm_mmu_slot_set_dirty(struct kvm *kvm,
+			    struct kvm_memory_slot *memslot)
+{
+	gfn_t last_gfn;
+	int i;
+	bool flush = false;
+
+	last_gfn = memslot->base_gfn + memslot->npages - 1;
+
+	spin_lock(&kvm->mmu_lock);
+
+	for (i = PT_PAGE_TABLE_LEVEL;
+	     i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
+		unsigned long *rmapp;
+		unsigned long last_index, index;
+
+		rmapp = memslot->arch.rmap[i - PT_PAGE_TABLE_LEVEL];
+		last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);
+
+		for (index = 0; index <= last_index; ++index, ++rmapp) {
+			if (*rmapp)
+				flush |= __rmap_set_dirty(kvm, rmapp);
+
+			if (need_resched() || spin_needbreak(&kvm->mmu_lock))
+				cond_resched_lock(&kvm->mmu_lock);
+		}
+	}
+
+	spin_unlock(&kvm->mmu_lock);
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	/* see kvm_mmu_slot_leaf_clear_dirty */
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_slot_set_dirty);
+
 #define BATCH_ZAP_PAGES	10
 static void kvm_zap_obsolete_pages(struct kvm *kvm)
 {
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/6] KVM: MMU: Explicitly set D-bit for writable spte.
  2015-01-28  2:54 [PATCH 0/6] KVM: VMX: Page Modification Logging (PML) support Kai Huang
  2015-01-28  2:54 ` [PATCH 1/6] KVM: Rename kvm_arch_mmu_write_protect_pt_masked to be more generic for log dirty Kai Huang
  2015-01-28  2:54 ` [PATCH 2/6] KVM: MMU: Add mmu help functions to support PML Kai Huang
@ 2015-01-28  2:54 ` Kai Huang
  2015-01-28  2:54 ` [PATCH 4/6] KVM: x86: Change parameter of kvm_mmu_slot_remove_write_access Kai Huang
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Kai Huang @ 2015-01-28  2:54 UTC (permalink / raw)
  To: pbonzini, gleb, linux, kvm; +Cc: Kai Huang

This patch avoids unnecessary dirty GPA logging to PML buffer in EPT violation
path by setting D-bit manually prior to the occurrence of the write from guest.

We only set D-bit manually in set_spte, and leave fast_page_fault path
unchanged, as fast_page_fault is very unlikely to happen in case of PML.

For the hva <-> pa change case, the spte is updated to either read-only (host
pte is read-only) or be dropped (host pte is writeable), and both cases will be
handled by above changes, therefore no change is necessary.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
---
 arch/x86/kvm/mmu.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c438224..fb35535 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2597,8 +2597,14 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		}
 	}
 
-	if (pte_access & ACC_WRITE_MASK)
+	if (pte_access & ACC_WRITE_MASK) {
 		mark_page_dirty(vcpu->kvm, gfn);
+		/*
+		 * Explicitly set dirty bit. It is used to eliminate unnecessary
+		 * dirty GPA logging in case of PML is enabled on VMX.
+		 */
+		spte |= shadow_dirty_mask;
+	}
 
 set_pte:
 	if (mmu_spte_update(sptep, spte))
@@ -2914,6 +2920,16 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	 */
 	gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
 
+	/*
+	 * Theoretically we could also set dirty bit (and flush TLB) here in
+	 * order to eliminate the unnecessary PML logging. See comments in
+	 * set_spte. But as in case of PML, fast_page_fault is very unlikely to
+	 * happen so we leave it unchanged. This might result in the same GPA
+	 * to be logged in PML buffer again when the write really happens, and
+	 * eventually to be called by mark_page_dirty twice. But it's also no
+	 * harm. This also avoids the TLB flush needed after setting dirty bit
+	 * so non-PML cases won't be impacted.
+	 */
 	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
 		mark_page_dirty(vcpu->kvm, gfn);
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/6] KVM: x86: Change parameter of kvm_mmu_slot_remove_write_access
  2015-01-28  2:54 [PATCH 0/6] KVM: VMX: Page Modification Logging (PML) support Kai Huang
                   ` (2 preceding siblings ...)
  2015-01-28  2:54 ` [PATCH 3/6] KVM: MMU: Explicitly set D-bit for writable spte Kai Huang
@ 2015-01-28  2:54 ` Kai Huang
  2015-02-03 16:28   ` Radim Krčmář
  2015-01-28  2:54 ` [PATCH 5/6] KVM: x86: Add new dirty logging kvm_x86_ops for PML Kai Huang
  2015-01-28  2:54 ` [PATCH 6/6] KVM: VMX: Add PML support in VMX Kai Huang
  5 siblings, 1 reply; 22+ messages in thread
From: Kai Huang @ 2015-01-28  2:54 UTC (permalink / raw)
  To: pbonzini, gleb, linux, kvm; +Cc: Kai Huang

This patch changes the second parameter of kvm_mmu_slot_remove_write_access from
'slot id' to 'struct kvm_memory_slot *' to align with kvm_x86_ops dirty logging
hooks, which will be introduced in further patch.

Better way is to change second parameter of kvm_arch_commit_memory_region from
'struct kvm_userspace_memory_region *' to 'struct kvm_memory_slot * new', but it
requires changes on other non-x86 ARCH too, so avoid it now.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/kvm/mmu.c              |  5 ++---
 arch/x86/kvm/x86.c              | 10 +++++++---
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4f6369b..67a98d7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -834,7 +834,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 		u64 dirty_mask, u64 nx_mask, u64 x_mask);
 
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
+void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
+				      struct kvm_memory_slot *memslot);
 void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
 				   struct kvm_memory_slot *memslot);
 void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index fb35535..6c24af3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4410,14 +4410,13 @@ void kvm_mmu_setup(struct kvm_vcpu *vcpu)
 	init_kvm_mmu(vcpu);
 }
 
-void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
+void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
+				      struct kvm_memory_slot *memslot)
 {
-	struct kvm_memory_slot *memslot;
 	gfn_t last_gfn;
 	int i;
 	bool flush = false;
 
-	memslot = id_to_memslot(kvm->memslots, slot);
 	last_gfn = memslot->base_gfn + memslot->npages - 1;
 
 	spin_lock(&kvm->mmu_lock);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1e10e3f..3a7fcff 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7538,7 +7538,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 				const struct kvm_memory_slot *old,
 				enum kvm_mr_change change)
 {
-
+	struct kvm_memory_slot *new;
 	int nr_mmu_pages = 0;
 
 	if ((mem->slot >= KVM_USER_MEM_SLOTS) && (change == KVM_MR_DELETE)) {
@@ -7557,6 +7557,10 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 
 	if (nr_mmu_pages)
 		kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
+
+	/* It's OK to get 'new' slot here as it has already been installed */
+	new = id_to_memslot(kvm->memslots, mem->slot);
+
 	/*
 	 * Write protect all pages for dirty logging.
 	 *
@@ -7566,8 +7570,8 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 	 *
 	 * See the comments in fast_page_fault().
 	 */
-	if ((change != KVM_MR_DELETE) && (mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
-		kvm_mmu_slot_remove_write_access(kvm, mem->slot);
+	if ((change != KVM_MR_DELETE) && (new->flags & KVM_MEM_LOG_DIRTY_PAGES))
+		kvm_mmu_slot_remove_write_access(kvm, new);
 }
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/6] KVM: x86: Add new dirty logging kvm_x86_ops for PML
  2015-01-28  2:54 [PATCH 0/6] KVM: VMX: Page Modification Logging (PML) support Kai Huang
                   ` (3 preceding siblings ...)
  2015-01-28  2:54 ` [PATCH 4/6] KVM: x86: Change parameter of kvm_mmu_slot_remove_write_access Kai Huang
@ 2015-01-28  2:54 ` Kai Huang
  2015-02-03 15:53   ` Radim Krčmář
  2015-01-28  2:54 ` [PATCH 6/6] KVM: VMX: Add PML support in VMX Kai Huang
  5 siblings, 1 reply; 22+ messages in thread
From: Kai Huang @ 2015-01-28  2:54 UTC (permalink / raw)
  To: pbonzini, gleb, linux, kvm; +Cc: Kai Huang

This patch adds new kvm_x86_ops dirty logging hooks to enable/disable dirty
logging for particular memory slot, and to flush potentially logged dirty GPAs
before reporting slot->dirty_bitmap to userspace.

kvm x86 common code calls these hooks when they are available so PML logic can
be hidden to VMX specific. Other ARCHs won't be impacted as these hooks are NULL
for them.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h | 25 +++++++++++++++
 arch/x86/kvm/mmu.c              |  6 +++-
 arch/x86/kvm/x86.c              | 71 ++++++++++++++++++++++++++++++++++++-----
 3 files changed, 93 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 67a98d7..57916ec 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -802,6 +802,31 @@ struct kvm_x86_ops {
 	int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr);
 
 	void (*sched_in)(struct kvm_vcpu *kvm, int cpu);
+
+	/*
+	 * Arch-specific dirty logging hooks. These hooks are only supposed to
+	 * be valid if the specific arch has hardware-accelerated dirty logging
+	 * mechanism. Currently only for PML on VMX.
+	 *
+	 *  - slot_enable_log_dirty:
+	 *	called when enabling log dirty mode for the slot.
+	 *  - slot_disable_log_dirty:
+	 *	called when disabling log dirty mode for the slot.
+	 *	also called when slot is created with log dirty disabled.
+	 *  - flush_log_dirty:
+	 *	called before reporting dirty_bitmap to userspace.
+	 *  - enable_log_dirty_pt_masked:
+	 *	called when reenabling log dirty for the GFNs in the mask after
+	 *	corresponding bits are cleared in slot->dirty_bitmap.
+	 */
+	void (*slot_enable_log_dirty)(struct kvm *kvm,
+				      struct kvm_memory_slot *slot);
+	void (*slot_disable_log_dirty)(struct kvm *kvm,
+				       struct kvm_memory_slot *slot);
+	void (*flush_log_dirty)(struct kvm *kvm);
+	void (*enable_log_dirty_pt_masked)(struct kvm *kvm,
+					   struct kvm_memory_slot *slot,
+					   gfn_t offset, unsigned long mask);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6c24af3..c5833ca 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1335,7 +1335,11 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
 				struct kvm_memory_slot *slot,
 				gfn_t gfn_offset, unsigned long mask)
 {
-	kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
+	if (kvm_x86_ops->enable_log_dirty_pt_masked)
+		kvm_x86_ops->enable_log_dirty_pt_masked(kvm, slot, gfn_offset,
+				mask);
+	else
+		kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
 }
 
 static bool rmap_write_protect(struct kvm *kvm, u64 gfn)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3a7fcff..442ee7d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3780,6 +3780,12 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
 
 	mutex_lock(&kvm->slots_lock);
 
+	/*
+	 * Flush potentially hardware-cached dirty pages to dirty_bitmap.
+	 */
+	if (kvm_x86_ops->flush_log_dirty)
+		kvm_x86_ops->flush_log_dirty(kvm);
+
 	r = kvm_get_dirty_log_protect(kvm, log, &is_dirty);
 
 	/*
@@ -7533,6 +7539,56 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 	return 0;
 }
 
+static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
+				     struct kvm_memory_slot *new)
+{
+	/* Still write protect RO slot */
+	if (new->flags & KVM_MEM_READONLY) {
+		kvm_mmu_slot_remove_write_access(kvm, new);
+		return;
+	}
+
+	/*
+	 * Call kvm_x86_ops dirty logging hooks when they are valid.
+	 *
+	 * kvm_x86_ops->slot_disable_log_dirty is called when:
+	 *
+	 *  - KVM_MR_CREATE with dirty logging is disabled
+	 *  - KVM_MR_FLAGS_ONLY with dirty logging is disabled in new flag
+	 *
+	 * The reason is, in case of PML, we need to set D-bit for any slots
+	 * with dirty logging disabled in order to eliminate unnecessary GPA
+	 * logging in PML buffer (and potential PML buffer full VMEXT). This
+	 * guarantees leaving PML enabled during guest's lifetime won't have
+	 * any additonal overhead from PML when guest is running with dirty
+	 * logging disabled for memory slots.
+	 *
+	 * kvm_x86_ops->slot_enable_log_dirty is called when switching new slot
+	 * to dirty logging mode.
+	 *
+	 * If kvm_x86_ops dirty logging hooks are invalid, use write protect.
+	 *
+	 * In case of write protect:
+	 *
+	 * Write protect all pages for dirty logging.
+	 *
+	 * All the sptes including the large sptes which point to this
+	 * slot are set to readonly. We can not create any new large
+	 * spte on this slot until the end of the logging.
+	 *
+	 * See the comments in fast_page_fault().
+	 */
+	if (new->flags & KVM_MEM_LOG_DIRTY_PAGES) {
+		if (kvm_x86_ops->slot_enable_log_dirty)
+			kvm_x86_ops->slot_enable_log_dirty(kvm, new);
+		else
+			kvm_mmu_slot_remove_write_access(kvm, new);
+	} else {
+		if (kvm_x86_ops->slot_disable_log_dirty)
+			kvm_x86_ops->slot_disable_log_dirty(kvm, new);
+	}
+}
+
 void kvm_arch_commit_memory_region(struct kvm *kvm,
 				struct kvm_userspace_memory_region *mem,
 				const struct kvm_memory_slot *old,
@@ -7562,16 +7618,15 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 	new = id_to_memslot(kvm->memslots, mem->slot);
 
 	/*
-	 * Write protect all pages for dirty logging.
+	 * Set up write protection and/or dirty logging for the new slot.
 	 *
-	 * All the sptes including the large sptes which point to this
-	 * slot are set to readonly. We can not create any new large
-	 * spte on this slot until the end of the logging.
-	 *
-	 * See the comments in fast_page_fault().
+	 * For KVM_MR_DELETE and KVM_MR_MOVE, the shadow pages of old slot have
+	 * been zapped so no dirty logging staff is needed for old slot. For
+	 * KVM_MR_FLAGS_ONLY, the old slot is essentially the same one as the
+	 * new and it's also covered when dealing with the new slot.
 	 */
-	if ((change != KVM_MR_DELETE) && (new->flags & KVM_MEM_LOG_DIRTY_PAGES))
-		kvm_mmu_slot_remove_write_access(kvm, new);
+	if (change != KVM_MR_DELETE)
+		kvm_mmu_slot_apply_flags(kvm, new);
 }
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/6] KVM: VMX: Add PML support in VMX
  2015-01-28  2:54 [PATCH 0/6] KVM: VMX: Page Modification Logging (PML) support Kai Huang
                   ` (4 preceding siblings ...)
  2015-01-28  2:54 ` [PATCH 5/6] KVM: x86: Add new dirty logging kvm_x86_ops for PML Kai Huang
@ 2015-01-28  2:54 ` Kai Huang
  2015-02-03 15:18   ` Radim Krčmář
  5 siblings, 1 reply; 22+ messages in thread
From: Kai Huang @ 2015-01-28  2:54 UTC (permalink / raw)
  To: pbonzini, gleb, linux, kvm; +Cc: Kai Huang

This patch adds PML support in VMX. A new module parameter 'enable_pml' is added
to allow user to enable/disable it manually.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
---
 arch/x86/include/asm/vmx.h      |   4 +
 arch/x86/include/uapi/asm/vmx.h |   1 +
 arch/x86/kvm/trace.h            |  18 ++++
 arch/x86/kvm/vmx.c              | 195 +++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              |   1 +
 5 files changed, 218 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 45afaee..da772ed 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -69,6 +69,7 @@
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING	0x00000400
 #define SECONDARY_EXEC_ENABLE_INVPCID		0x00001000
 #define SECONDARY_EXEC_SHADOW_VMCS              0x00004000
+#define SECONDARY_EXEC_ENABLE_PML               0x00020000
 #define SECONDARY_EXEC_XSAVES			0x00100000
 
 
@@ -121,6 +122,7 @@ enum vmcs_field {
 	GUEST_LDTR_SELECTOR             = 0x0000080c,
 	GUEST_TR_SELECTOR               = 0x0000080e,
 	GUEST_INTR_STATUS               = 0x00000810,
+	GUEST_PML_INDEX			= 0x00000812,
 	HOST_ES_SELECTOR                = 0x00000c00,
 	HOST_CS_SELECTOR                = 0x00000c02,
 	HOST_SS_SELECTOR                = 0x00000c04,
@@ -140,6 +142,8 @@ enum vmcs_field {
 	VM_EXIT_MSR_LOAD_ADDR_HIGH      = 0x00002009,
 	VM_ENTRY_MSR_LOAD_ADDR          = 0x0000200a,
 	VM_ENTRY_MSR_LOAD_ADDR_HIGH     = 0x0000200b,
+	PML_ADDRESS			= 0x0000200e,
+	PML_ADDRESS_HIGH		= 0x0000200f,
 	TSC_OFFSET                      = 0x00002010,
 	TSC_OFFSET_HIGH                 = 0x00002011,
 	VIRTUAL_APIC_PAGE_ADDR          = 0x00002012,
diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index ff2b8e2..c5f1a1d 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -73,6 +73,7 @@
 #define EXIT_REASON_XSETBV              55
 #define EXIT_REASON_APIC_WRITE          56
 #define EXIT_REASON_INVPCID             58
+#define EXIT_REASON_PML_FULL            62
 #define EXIT_REASON_XSAVES              63
 #define EXIT_REASON_XRSTORS             64
 
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 587149b..a139977 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -846,6 +846,24 @@ TRACE_EVENT(kvm_track_tsc,
 		  __print_symbolic(__entry->host_clock, host_clocks))
 );
 
+/*
+ * Tracepoint for PML full VMEXIT.
+ */
+TRACE_EVENT(kvm_pml_full,
+	TP_PROTO(unsigned int vcpu_id),
+	TP_ARGS(vcpu_id),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	vcpu_id			)
+	),
+
+	TP_fast_assign(
+		__entry->vcpu_id		= vcpu_id;
+	),
+
+	TP_printk("vcpu %d: PML full", __entry->vcpu_id)
+);
+
 #endif /* CONFIG_X86_64 */
 
 TRACE_EVENT(kvm_ple_window,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c987374..de5ce82 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -101,6 +101,9 @@ module_param(nested, bool, S_IRUGO);
 
 static u64 __read_mostly host_xss;
 
+static bool __read_mostly enable_pml = 1;
+module_param_named(pml, enable_pml, bool, S_IRUGO);
+
 #define KVM_GUEST_CR0_MASK (X86_CR0_NW | X86_CR0_CD)
 #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST (X86_CR0_WP | X86_CR0_NE)
 #define KVM_VM_CR0_ALWAYS_ON						\
@@ -516,6 +519,10 @@ struct vcpu_vmx {
 	/* Dynamic PLE window. */
 	int ple_window;
 	bool ple_window_dirty;
+
+	/* Support for PML */
+#define PML_ENTITY_NUM		512
+	struct page *pml_pg;
 };
 
 enum segment_cache_field {
@@ -1068,6 +1075,11 @@ static inline bool cpu_has_vmx_shadow_vmcs(void)
 		SECONDARY_EXEC_SHADOW_VMCS;
 }
 
+static inline bool cpu_has_vmx_pml(void)
+{
+	return vmcs_config.cpu_based_2nd_exec_ctrl & SECONDARY_EXEC_ENABLE_PML;
+}
+
 static inline bool report_flexpriority(void)
 {
 	return flexpriority_enabled;
@@ -2924,7 +2936,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 			SECONDARY_EXEC_APIC_REGISTER_VIRT |
 			SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
 			SECONDARY_EXEC_SHADOW_VMCS |
-			SECONDARY_EXEC_XSAVES;
+			SECONDARY_EXEC_XSAVES |
+			SECONDARY_EXEC_ENABLE_PML;
 		if (adjust_vmx_controls(min2, opt2,
 					MSR_IA32_VMX_PROCBASED_CTLS2,
 					&_cpu_based_2nd_exec_control) < 0)
@@ -4355,6 +4368,9 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
 	   a current VMCS12
 	*/
 	exec_control &= ~SECONDARY_EXEC_SHADOW_VMCS;
+	/* PML is enabled/disabled in creating/destorying vcpu */
+	exec_control &= ~SECONDARY_EXEC_ENABLE_PML;
+
 	return exec_control;
 }
 
@@ -5942,6 +5958,20 @@ static __init int hardware_setup(void)
 
 	update_ple_window_actual_max();
 
+	/*
+	 * Only enable PML when hardware supports PML feature, and both EPT
+	 * and EPT A/D bit features are enabled -- PML depends on them to work.
+	 */
+	if (!enable_ept || !enable_ept_ad_bits || !cpu_has_vmx_pml())
+		enable_pml = 0;
+
+	if (!enable_pml) {
+		kvm_x86_ops->slot_enable_log_dirty = NULL;
+		kvm_x86_ops->slot_disable_log_dirty = NULL;
+		kvm_x86_ops->flush_log_dirty = NULL;
+		kvm_x86_ops->enable_log_dirty_pt_masked = NULL;
+	}
+
 	return alloc_kvm_area();
 
 out7:
@@ -6971,6 +7001,31 @@ static bool vmx_test_pir(struct kvm_vcpu *vcpu, int vector)
 	return pi_test_pir(vector, &vmx->pi_desc);
 }
 
+static int handle_pml_full(struct kvm_vcpu *vcpu)
+{
+	unsigned long exit_qualification;
+
+	trace_kvm_pml_full(vcpu->vcpu_id);
+
+	exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+
+	/*
+	 * PML buffer FULL happened while executing iret from NMI,
+	 * "blocked by NMI" bit has to be set before next VM entry.
+	 */
+	if (!(to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
+			cpu_has_virtual_nmis() &&
+			(exit_qualification & INTR_INFO_UNBLOCK_NMI))
+		vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO,
+				GUEST_INTR_STATE_NMI);
+
+	/*
+	 * PML buffer already flushed at beginning of VMEXIT. Nothing to do
+	 * here.., and there's no userspace involvement needed for PML.
+	 */
+	return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -7019,6 +7074,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_INVVPID]                 = handle_invvpid,
 	[EXIT_REASON_XSAVES]                  = handle_xsaves,
 	[EXIT_REASON_XRSTORS]                 = handle_xrstors,
+	[EXIT_REASON_PML_FULL]		      = handle_pml_full,
 };
 
 static const int kvm_vmx_max_exit_handlers =
@@ -7325,6 +7381,89 @@ static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2)
 	*info2 = vmcs_read32(VM_EXIT_INTR_INFO);
 }
 
+static int vmx_enable_pml(struct vcpu_vmx *vmx)
+{
+	struct page *pml_pg;
+	u32 exec_control;
+
+	pml_pg = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	if (!pml_pg)
+		return -ENOMEM;
+
+	vmx->pml_pg = pml_pg;
+
+	vmcs_write64(PML_ADDRESS, page_to_phys(vmx->pml_pg));
+	vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
+
+	exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
+	exec_control |= SECONDARY_EXEC_ENABLE_PML;
+	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
+
+	return 0;
+}
+
+static void vmx_disable_pml(struct vcpu_vmx *vmx)
+{
+	u32 exec_control;
+
+	ASSERT(vmx->pml_pg);
+	__free_page(vmx->pml_pg);
+	vmx->pml_pg = NULL;
+
+	exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
+	exec_control &= ~SECONDARY_EXEC_ENABLE_PML;
+	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
+}
+
+static void vmx_flush_pml_buffer(struct vcpu_vmx *vmx)
+{
+	struct kvm *kvm = vmx->vcpu.kvm;
+	u64 *pml_buf;
+	u16 pml_idx;
+
+	pml_idx = vmcs_read16(GUEST_PML_INDEX);
+
+	/* Do nothing if PML buffer is empty */
+	if (pml_idx == (PML_ENTITY_NUM - 1))
+		return;
+
+	/* PML index always points to next available PML buffer entity */
+	if (pml_idx >= PML_ENTITY_NUM)
+		pml_idx = 0;
+	else
+		pml_idx++;
+
+	pml_buf = page_address(vmx->pml_pg);
+	for (; pml_idx < PML_ENTITY_NUM; pml_idx++) {
+		u64 gpa;
+
+		gpa = pml_buf[pml_idx];
+		WARN_ON(gpa & (PAGE_SIZE - 1));
+		mark_page_dirty(kvm, gpa >> PAGE_SHIFT);
+	}
+
+	/* reset PML index */
+	vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
+}
+
+/*
+ * Flush all vcpus' PML buffer and update logged GPAs to dirty_bitmap.
+ * Called before reporting dirty_bitmap to userspace.
+ */
+static void kvm_flush_pml_buffers(struct kvm *kvm)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+	/*
+	 * We only need to kick vcpu out of guest mode here, as PML buffer
+	 * is flushed at beginning of all VMEXITs, and it's obvious that only
+	 * vcpus running in guest are possible to have unflushed GPAs in PML
+	 * buffer.
+	 */
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvm_vcpu_kick(vcpu);
+}
+
 /*
  * The guest has exited.  See if we can fix it or if we need userspace
  * assistance.
@@ -7335,6 +7474,16 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
 	u32 exit_reason = vmx->exit_reason;
 	u32 vectoring_info = vmx->idt_vectoring_info;
 
+	/*
+	 * Flush logged GPAs PML buffer, this will make dirty_bitmap more
+	 * updated. Another good is, in kvm_vm_ioctl_get_dirty_log, before
+	 * querying dirty_bitmap, we only need to kick all vcpus out of guest
+	 * mode as if vcpus is in root mode, the PML buffer must has been
+	 * flushed already.
+	 */
+	if (enable_pml)
+		vmx_flush_pml_buffer(vmx);
+
 	/* If guest state is invalid, start emulating */
 	if (vmx->emulation_required)
 		return handle_invalid_guest_state(vcpu);
@@ -7981,6 +8130,8 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+	if (enable_pml)
+		vmx_disable_pml(vmx);
 	free_vpid(vmx);
 	leave_guest_mode(vcpu);
 	vmx_load_vmcs01(vcpu);
@@ -8051,6 +8202,18 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
 	vmx->nested.current_vmptr = -1ull;
 	vmx->nested.current_vmcs12 = NULL;
 
+	/*
+	 * If PML is turned on, failure on enabling PML just results in failure
+	 * of creating the vcpu, therefore we can simplify PML logic (by
+	 * avoiding dealing with cases, such as enabling PML partially on vcpus
+	 * for the guest, etc.
+	 */
+	if (enable_pml) {
+		err = vmx_enable_pml(vmx);
+		if (err)
+			goto free_vmcs;
+	}
+
 	return &vmx->vcpu;
 
 free_vmcs:
@@ -9492,6 +9655,31 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
 		shrink_ple_window(vcpu);
 }
 
+static void vmx_slot_enable_log_dirty(struct kvm *kvm,
+				     struct kvm_memory_slot *slot)
+{
+	kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
+	kvm_mmu_slot_largepage_remove_write_access(kvm, slot);
+}
+
+static void vmx_slot_disable_log_dirty(struct kvm *kvm,
+				       struct kvm_memory_slot *slot)
+{
+	kvm_mmu_slot_set_dirty(kvm, slot);
+}
+
+static void vmx_flush_log_dirty(struct kvm *kvm)
+{
+	kvm_flush_pml_buffers(kvm);
+}
+
+static void vmx_enable_log_dirty_pt_masked(struct kvm *kvm,
+					   struct kvm_memory_slot *memslot,
+					   gfn_t offset, unsigned long mask)
+{
+	kvm_mmu_clear_dirty_pt_masked(kvm, memslot, offset, mask);
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,
@@ -9601,6 +9789,11 @@ static struct kvm_x86_ops vmx_x86_ops = {
 	.check_nested_events = vmx_check_nested_events,
 
 	.sched_in = vmx_sched_in,
+
+	.slot_enable_log_dirty = vmx_slot_enable_log_dirty,
+	.slot_disable_log_dirty = vmx_slot_disable_log_dirty,
+	.flush_log_dirty = vmx_flush_log_dirty,
+	.enable_log_dirty_pt_masked = vmx_enable_log_dirty_pt_masked,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 442ee7d..1373e04 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7880,3 +7880,4 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_skinit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intercepts);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_write_tsc_offset);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_ple_window);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pml_full);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/6] KVM: VMX: Add PML support in VMX
  2015-01-28  2:54 ` [PATCH 6/6] KVM: VMX: Add PML support in VMX Kai Huang
@ 2015-02-03 15:18   ` Radim Krčmář
  2015-02-03 15:39     ` Paolo Bonzini
  2015-02-05  6:23     ` Kai Huang
  0 siblings, 2 replies; 22+ messages in thread
From: Radim Krčmář @ 2015-02-03 15:18 UTC (permalink / raw)
  To: Kai Huang; +Cc: pbonzini, gleb, linux, kvm

2015-01-28 10:54+0800, Kai Huang:
> This patch adds PML support in VMX. A new module parameter 'enable_pml' is added

(+module_param_named(pml, enable_pml, bool, S_IRUGO);)

> to allow user to enable/disable it manually.
> 
> Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
> ---
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index 587149b..a139977 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -846,6 +846,24 @@ TRACE_EVENT(kvm_track_tsc,
> +TRACE_EVENT(kvm_pml_full,
> +	TP_PROTO(unsigned int vcpu_id),
> +	TP_ARGS(vcpu_id),
> +
> +	TP_STRUCT__entry(
> +		__field(	unsigned int,	vcpu_id			)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->vcpu_id		= vcpu_id;
> +	),
> +
> +	TP_printk("vcpu %d: PML full", __entry->vcpu_id)
> +);
> +
>  #endif /* CONFIG_X86_64 */

You have it protected by CONFIG_X86_64, but use it unconditionally.

(Also, we can get all this information from kvm_exit tracepoint;
 I'd just drop it.)

> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index c987374..de5ce82 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -516,6 +519,10 @@ struct vcpu_vmx {
> +	/* Support for PML */
> +#define PML_ENTITY_NUM		512

(Readers of this struct likely aren't interested in this number, so it
 would be nicer to define it outside.  I thought it is here to hint the
 amount of allocated pages, but PML_ENTITY_NUM / 512 isn't obvious.)

> +	struct page *pml_pg;
> @@ -4355,6 +4368,9 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
>  	   a current VMCS12
>  	*/
>  	exec_control &= ~SECONDARY_EXEC_SHADOW_VMCS;
> +	/* PML is enabled/disabled in creating/destorying vcpu */
> +	exec_control &= ~SECONDARY_EXEC_ENABLE_PML;

What is the harm of enabling it here?

(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES seems similar and does it.)

> +
>  	return exec_control;
>  }
> @@ -6971,6 +7001,31 @@ static bool vmx_test_pir(struct kvm_vcpu *vcpu, int vector)
> +static int handle_pml_full(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long exit_qualification;
> +
> +	trace_kvm_pml_full(vcpu->vcpu_id);
> +
> +	exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> +
> +	/*
> +	 * PML buffer FULL happened while executing iret from NMI,
> +	 * "blocked by NMI" bit has to be set before next VM entry.
> +	 */
> +	if (!(to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
> +			cpu_has_virtual_nmis() &&
> +			(exit_qualification & INTR_INFO_UNBLOCK_NMI))
> +		vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO,
> +				GUEST_INTR_STATE_NMI);

Relevant part of the specification is pasted from 27.2.2 (Information
for VM Exits Due to Vectored Events), which makes me think that this bit
is mirrored to IDT-vectoring information field ...

Isn't vmx_recover_nmi_blocking() enough?

(I see the same code in handle_ept_violation(), but wasn't that needed
 just because of a hardware error?)

> +
> +	/*
> +	 * PML buffer already flushed at beginning of VMEXIT. Nothing to do
> +	 * here.., and there's no userspace involvement needed for PML.
> +	 */
> +	return 1;
> +}
> @@ -7325,6 +7381,89 @@ static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2)
> +static int vmx_enable_pml(struct vcpu_vmx *vmx)
> +{
> +	struct page *pml_pg;
> +	u32 exec_control;
> +
> +	pml_pg = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	if (!pml_pg)
> +		return -ENOMEM;
> +
> +	vmx->pml_pg = pml_pg;

(It's safe to use vmx->pml_pg directly.)

> +
> +	vmcs_write64(PML_ADDRESS, page_to_phys(vmx->pml_pg));
> +	vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
> +
> +	exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
> +	exec_control |= SECONDARY_EXEC_ENABLE_PML;
> +	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
> +
> +	return 0;
> +}
> +static void vmx_flush_pml_buffer(struct vcpu_vmx *vmx)
> +{
> +	struct kvm *kvm = vmx->vcpu.kvm;
> +	u64 *pml_buf;
> +	u16 pml_idx;
> +
> +	pml_idx = vmcs_read16(GUEST_PML_INDEX);
> +
> +	/* Do nothing if PML buffer is empty */
> +	if (pml_idx == (PML_ENTITY_NUM - 1))
> +		return;
> +
> +	/* PML index always points to next available PML buffer entity */
> +	if (pml_idx >= PML_ENTITY_NUM)
> +		pml_idx = 0;
> +	else
> +		pml_idx++;

If the pml_idx is >= PML_ENTITY_NUM and < 0xffff, the log is empty,
so it would be better to use just 'pml_idx++' and unsigned modulo.

(Could also 'assert(pml_idx < PML_ENTITY_NUM)' then.)

> +
> +	pml_buf = page_address(vmx->pml_pg);
> +	for (; pml_idx < PML_ENTITY_NUM; pml_idx++) {
> +		u64 gpa;
> +
> +		gpa = pml_buf[pml_idx];
> +		WARN_ON(gpa & (PAGE_SIZE - 1));
> +		mark_page_dirty(kvm, gpa >> PAGE_SHIFT);
> +	}
> +
> +	/* reset PML index */
> +	vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
> +}
> @@ -9492,6 +9655,31 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
> +static void vmx_slot_enable_log_dirty(struct kvm *kvm,
> +				     struct kvm_memory_slot *slot)
> +{
> +	kvm_mmu_slot_leaf_clear_dirty(kvm, slot);

(New slot contains dirty pages?)

> +	kvm_mmu_slot_largepage_remove_write_access(kvm, slot);
> +}

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/6] KVM: VMX: Add PML support in VMX
  2015-02-03 15:18   ` Radim Krčmář
@ 2015-02-03 15:39     ` Paolo Bonzini
  2015-02-03 16:02       ` Radim Krčmář
  2015-02-05  6:23     ` Kai Huang
  1 sibling, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2015-02-03 15:39 UTC (permalink / raw)
  To: Radim Krčmář, Kai Huang; +Cc: gleb, linux, kvm

On 03/02/2015 16:18, Radim Krčmář wrote:
> (I see the same code in handle_ept_violation(), but wasn't that needed
>  just because of a hardware error?)

That was how I read it initially, but actually that means: "this
statement could be broken if the processor has that erratum".

>> +static void vmx_slot_enable_log_dirty(struct kvm *kvm,
>> +				     struct kvm_memory_slot *slot)
>> +{
>> +	kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
> 
> (New slot contains dirty pages?)

New slots contain clean pages as far as the KVM dirty log is concerned.

In the case of PML, note that D=1 does not mean the page is dirty.  It
only means that writes will not be logged by PML.  The page may thus
also have logging disabled.

Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] KVM: x86: Add new dirty logging kvm_x86_ops for PML
  2015-01-28  2:54 ` [PATCH 5/6] KVM: x86: Add new dirty logging kvm_x86_ops for PML Kai Huang
@ 2015-02-03 15:53   ` Radim Krčmář
  2015-02-05  6:29     ` Kai Huang
  0 siblings, 1 reply; 22+ messages in thread
From: Radim Krčmář @ 2015-02-03 15:53 UTC (permalink / raw)
  To: Kai Huang; +Cc: pbonzini, gleb, linux, kvm

2015-01-28 10:54+0800, Kai Huang:
> This patch adds new kvm_x86_ops dirty logging hooks to enable/disable dirty
> logging for particular memory slot, and to flush potentially logged dirty GPAs
> before reporting slot->dirty_bitmap to userspace.
> 
> kvm x86 common code calls these hooks when they are available so PML logic can
> be hidden to VMX specific. Other ARCHs won't be impacted as these hooks are NULL
> for them.
> 
> Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
> ---
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -802,6 +802,31 @@ struct kvm_x86_ops {
> +
> +	/*
> +	 * Arch-specific dirty logging hooks. These hooks are only supposed to
> +	 * be valid if the specific arch has hardware-accelerated dirty logging
> +	 * mechanism. Currently only for PML on VMX.
> +	 *
> +	 *  - slot_enable_log_dirty:
> +	 *	called when enabling log dirty mode for the slot.

(I guess that "log dirty mode" isn't the meaning that people will think
 after seeing 'log_dirty' ...
 I'd at least change 'log_dirty' to 'dirty_log' in these names.)

> +	 *  - slot_disable_log_dirty:
> +	 *	called when disabling log dirty mode for the slot.
> +	 *	also called when slot is created with log dirty disabled.
> +	 *  - flush_log_dirty:
> +	 *	called before reporting dirty_bitmap to userspace.
> +	 *  - enable_log_dirty_pt_masked:
> +	 *	called when reenabling log dirty for the GFNs in the mask after
> +	 *	corresponding bits are cleared in slot->dirty_bitmap.

This name is very confusing ... I think we should hint that this is
called after we learn that the page has been written to and would like
to monitor it again.

Using something like collected/refresh?  (I'd have to do horrible things
to come up with a good name, sorry.)

> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3780,6 +3780,12 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
>  
>  	mutex_lock(&kvm->slots_lock);
>  
> +	/*
> +	 * Flush potentially hardware-cached dirty pages to dirty_bitmap.
> +	 */
> +	if (kvm_x86_ops->flush_log_dirty)
> +		kvm_x86_ops->flush_log_dirty(kvm);

(Flushing would make more sense in kvm_get_dirty_log_protect().)

> +
>  	r = kvm_get_dirty_log_protect(kvm, log, &is_dirty);
>  
>  	/*
> @@ -7533,6 +7539,56 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>  	return 0;
>  }
>  
> +static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
> +				     struct kvm_memory_slot *new)
> +{
> +	/* Still write protect RO slot */
> +	if (new->flags & KVM_MEM_READONLY) {
> +		kvm_mmu_slot_remove_write_access(kvm, new);

We didn't write protect RO slots before, does this patch depend on it?

> @@ -7562,16 +7618,15 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
> -	if ((change != KVM_MR_DELETE) && (new->flags & KVM_MEM_LOG_DIRTY_PAGES))
> -		kvm_mmu_slot_remove_write_access(kvm, new);
> +	if (change != KVM_MR_DELETE)
> +		kvm_mmu_slot_apply_flags(kvm, new);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/6] KVM: VMX: Add PML support in VMX
  2015-02-03 15:39     ` Paolo Bonzini
@ 2015-02-03 16:02       ` Radim Krčmář
  0 siblings, 0 replies; 22+ messages in thread
From: Radim Krčmář @ 2015-02-03 16:02 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kai Huang, gleb, linux, kvm

2015-02-03 16:39+0100, Paolo Bonzini:
> 
> 
> On 03/02/2015 16:18, Radim Krčmář wrote:
> > (I see the same code in handle_ept_violation(), but wasn't that needed
> >  just because of a hardware error?)
> 
> That was how I read it initially, but actually that means: "this
> statement could be broken if the processor has that erratum".

Thanks, that was a nice ruse for the original bug :)

> >> +static void vmx_slot_enable_log_dirty(struct kvm *kvm,
> >> +				     struct kvm_memory_slot *slot)
> >> +{
> >> +	kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
> > 
> > (New slot contains dirty pages?)
> 
> New slots contain clean pages as far as the KVM dirty log is concerned.
> 
> In the case of PML, note that D=1 does not mean the page is dirty.  It
> only means that writes will not be logged by PML.  The page may thus
> also have logging disabled.

Yeah, it would be a problem if we had dirty pages at the beginning, but
I don't think it is possible as was too lazy to check.
(It's not important and I wanted to do this review today :)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/6] KVM: x86: Change parameter of kvm_mmu_slot_remove_write_access
  2015-01-28  2:54 ` [PATCH 4/6] KVM: x86: Change parameter of kvm_mmu_slot_remove_write_access Kai Huang
@ 2015-02-03 16:28   ` Radim Krčmář
  0 siblings, 0 replies; 22+ messages in thread
From: Radim Krčmář @ 2015-02-03 16:28 UTC (permalink / raw)
  To: Kai Huang; +Cc: pbonzini, gleb, linux, kvm

2015-01-28 10:54+0800, Kai Huang:
> This patch changes the second parameter of kvm_mmu_slot_remove_write_access from
> 'slot id' to 'struct kvm_memory_slot *' to align with kvm_x86_ops dirty logging
> hooks, which will be introduced in further patch.
> 
> Better way is to change second parameter of kvm_arch_commit_memory_region from
> 'struct kvm_userspace_memory_region *' to 'struct kvm_memory_slot * new', but it
> requires changes on other non-x86 ARCH too, so avoid it now.
> 
> Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
> ---

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>

> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7557,6 +7557,10 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>  
>  	if (nr_mmu_pages)
>  		kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
> +
> +	/* It's OK to get 'new' slot here as it has already been installed */
> +	new = id_to_memslot(kvm->memslots, mem->slot);

(Doint it early doesn't hurt.)

> @@ -7566,8 +7570,8 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>  	 *
>  	 * See the comments in fast_page_fault().
>  	 */
> -	if ((change != KVM_MR_DELETE) && (mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
> -		kvm_mmu_slot_remove_write_access(kvm, mem->slot);
> +	if ((change != KVM_MR_DELETE) && (new->flags & KVM_MEM_LOG_DIRTY_PAGES))
> +		kvm_mmu_slot_remove_write_access(kvm, new);
>  }
>  
>  void kvm_arch_flush_shadow_all(struct kvm *kvm)
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] KVM: MMU: Add mmu help functions to support PML
  2015-01-28  2:54 ` [PATCH 2/6] KVM: MMU: Add mmu help functions to support PML Kai Huang
@ 2015-02-03 17:34   ` Radim Krčmář
  2015-02-05  5:59     ` Kai Huang
  0 siblings, 1 reply; 22+ messages in thread
From: Radim Krčmář @ 2015-02-03 17:34 UTC (permalink / raw)
  To: Kai Huang; +Cc: pbonzini, gleb, linux, kvm

2015-01-28 10:54+0800, Kai Huang:
> This patch adds new mmu layer functions to clear/set D-bit for memory slot, and
> to write protect superpages for memory slot.
> 
> In case of PML, CPU logs the dirty GPA automatically to PML buffer when CPU
> updates D-bit from 0 to 1, therefore we don't have to write protect 4K pages,
> instead, we only need to clear D-bit in order to log that GPA.
> 
> For superpages, we still write protect it and let page fault code to handle
> dirty page logging, as we still need to split superpage to 4K pages in PML.
> 
> As PML is always enabled during guest's lifetime, to eliminate unnecessary PML
> GPA logging, we set D-bit manually for the slot with dirty logging disabled.
> 
> Signed-off-by: Kai Huang <kai.huang@linux.intel.com>

This message contains just several  stylistic tips, I wasn't able to
learn enough about large pages to review today.

> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index b18e65c..c438224 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4368,6 +4448,121 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
> +void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
> +					struct kvm_memory_slot *memslot)
> +{
> +	gfn_t last_gfn;
> +	int i;
> +	bool flush = false;
> +
> +	last_gfn = memslot->base_gfn + memslot->npages - 1;
> +
> +	spin_lock(&kvm->mmu_lock);
> +
> +	for (i = PT_PAGE_TABLE_LEVEL + 1; /* skip rmap for 4K page */

("+ 1" is the only difference from kvm_mmu_slot_remove_write_access();
 new argument, initial level, would be better.

 Btw, PT_PAGE_TABLE_LEVEL + 1 == PT_DIRECTORY_LEVEL)

> +	     i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
> +		unsigned long *rmapp;
> +		unsigned long last_index, index;
> +
> +		rmapp = memslot->arch.rmap[i - PT_PAGE_TABLE_LEVEL];
> +		last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);
> +
> +		for (index = 0; index <= last_index; ++index, ++rmapp) {
> +			if (*rmapp)
> +				flush |= __rmap_write_protect(kvm, rmapp,
> +						false);
> +
> +			if (need_resched() || spin_needbreak(&kvm->mmu_lock))
> +				cond_resched_lock(&kvm->mmu_lock);
> +		}
> +	}
> +	spin_unlock(&kvm->mmu_lock);
> +
> +	/* see kvm_mmu_slot_remove_write_access */
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	if (flush)
> +		kvm_flush_remote_tlbs(kvm);
> +}
> +void kvm_mmu_slot_set_dirty(struct kvm *kvm,
> +			    struct kvm_memory_slot *memslot)
> +{
> +	gfn_t last_gfn;
> +	int i;
> +	bool flush = false;
> +
> +	last_gfn = memslot->base_gfn + memslot->npages - 1;
> +
> +	spin_lock(&kvm->mmu_lock);
> +
> +	for (i = PT_PAGE_TABLE_LEVEL;
> +	     i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
> +		unsigned long *rmapp;
> +		unsigned long last_index, index;
> +
> +		rmapp = memslot->arch.rmap[i - PT_PAGE_TABLE_LEVEL];
> +		last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);
> +
> +		for (index = 0; index <= last_index; ++index, ++rmapp) {
> +			if (*rmapp)
> +				flush |= __rmap_set_dirty(kvm, rmapp);

(And yet another similar walker ... We can either pass a worker function
 accepting 'kvm' and 'rmapp', use a 'switch' with a new operations enum,
 or have code duplication.  Paolo?)

> +
> +			if (need_resched() || spin_needbreak(&kvm->mmu_lock))
> +				cond_resched_lock(&kvm->mmu_lock);
> +		}
> +	}
> +
> +	spin_unlock(&kvm->mmu_lock);
> +
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	/* see kvm_mmu_slot_leaf_clear_dirty */
> +	if (flush)
> +		kvm_flush_remote_tlbs(kvm);
> +}
> +EXPORT_SYMBOL_GPL(kvm_mmu_slot_set_dirty);
> +void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
> +				   struct kvm_memory_slot *memslot)
> +{
> +	gfn_t last_gfn;
> +	unsigned long *rmapp;
> +	unsigned long last_index, index;
> +	bool flush = false;
> +
> +	last_gfn = memslot->base_gfn + memslot->npages - 1;
> +
> +	spin_lock(&kvm->mmu_lock);
> +

(If we abstracted with switch or function, we could also add max level
 argument to cover this function.)

> +	rmapp = memslot->arch.rmap[PT_PAGE_TABLE_LEVEL - 1];
> +	last_index = gfn_to_index(last_gfn, memslot->base_gfn,
> +			PT_PAGE_TABLE_LEVEL);
> +
> +	for (index = 0; index <= last_index; ++index, ++rmapp) {
> +		if (*rmapp)
> +			flush |= __rmap_clear_dirty(kvm, rmapp);
> +
> +		if (need_resched() || spin_needbreak(&kvm->mmu_lock))
> +			cond_resched_lock(&kvm->mmu_lock);
> +	}
> +
> +	spin_unlock(&kvm->mmu_lock);
> +
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	/*
> +	 * It's also safe to flush TLBs out of mmu lock here as currently this
> +	 * function is only used for dirty logging, in which case flushing TLB
> +	 * out of mmu lock also guarantees no dirty pages will be lost in
> +	 * dirty_bitmap.
> +	 */
> +	if (flush)
> +		kvm_flush_remote_tlbs(kvm);
> +}
> +EXPORT_SYMBOL_GPL(kvm_mmu_slot_leaf_clear_dirty);
> @@ -1215,6 +1215,60 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,

(After looking at following two pairs of functions, you'll hopefully
 understand why I don't like C very much.)

> +static bool spte_clear_dirty(struct kvm *kvm, u64 *sptep)
> +{
> +	u64 spte = *sptep;
> +
> +	rmap_printk("rmap_clear_dirty: spte %p %llx\n", sptep, *sptep);
> +
> +	spte &= ~shadow_dirty_mask;
> +
> +	return mmu_spte_update(sptep, spte);
> +}
> +static bool spte_set_dirty(struct kvm *kvm, u64 *sptep)
> +{
> +	u64 spte = *sptep;
> +
> +	rmap_printk("rmap_set_dirty: spte %p %llx\n", sptep, *sptep);
> +
> +	spte |= shadow_dirty_mask;
> +
> +	return mmu_spte_update(sptep, spte);
> +}

> +static bool __rmap_clear_dirty(struct kvm *kvm, unsigned long *rmapp)
> +{
> +	u64 *sptep;
> +	struct rmap_iterator iter;
> +	bool flush = false;
> +
> +	for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
> +		BUG_ON(!(*sptep & PT_PRESENT_MASK));
> +
> +		flush |= spte_clear_dirty(kvm, sptep);
> +		sptep = rmap_get_next(&iter);
> +	}
> +
> +	return flush;
> +}
> +static bool __rmap_set_dirty(struct kvm *kvm, unsigned long *rmapp)
> +{
> +	u64 *sptep;
> +	struct rmap_iterator iter;
> +	bool flush = false;
> +
> +	for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
> +		BUG_ON(!(*sptep & PT_PRESENT_MASK));
> +
> +		flush |= spte_set_dirty(kvm, sptep);
> +		sptep = rmap_get_next(&iter);
> +	}
> +
> +	return flush;
> +}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] KVM: MMU: Add mmu help functions to support PML
  2015-02-03 17:34   ` Radim Krčmář
@ 2015-02-05  5:59     ` Kai Huang
  2015-02-05 14:51       ` Radim Krčmář
  0 siblings, 1 reply; 22+ messages in thread
From: Kai Huang @ 2015-02-05  5:59 UTC (permalink / raw)
  To: Radim Krčmář; +Cc: pbonzini, gleb, linux, kvm


On 02/04/2015 01:34 AM, Radim Krčmář wrote:
> 2015-01-28 10:54+0800, Kai Huang:
>> This patch adds new mmu layer functions to clear/set D-bit for memory slot, and
>> to write protect superpages for memory slot.
>>
>> In case of PML, CPU logs the dirty GPA automatically to PML buffer when CPU
>> updates D-bit from 0 to 1, therefore we don't have to write protect 4K pages,
>> instead, we only need to clear D-bit in order to log that GPA.
>>
>> For superpages, we still write protect it and let page fault code to handle
>> dirty page logging, as we still need to split superpage to 4K pages in PML.
>>
>> As PML is always enabled during guest's lifetime, to eliminate unnecessary PML
>> GPA logging, we set D-bit manually for the slot with dirty logging disabled.
>>
>> Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
> This message contains just several  stylistic tips, I wasn't able to
> learn enough about large pages to review today.
Hi Radim,

Thanks for your review and sorry for the late reply. I was working on 
something else these days.

There are two reasons we still write protect the large page in case of 
PML. One is we still need to split large page to 4K page in dirty 
logging mode, and PML doesn't split large page automatically. The second 
is PML only logs dirty GPA but it doesn't distinguish if the dirty GPA 
is in 4K page or large page.

For rest of your comments, as this patch series have already been in 
Paolo's queue branch, I think I can send further patches to enhance 
them, if Paolo agrees the enhancement is needed.

Thanks,
-Kai
>
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index b18e65c..c438224 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -4368,6 +4448,121 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
>> +void kvm_mmu_slot_largepage_remove_write_access(struct kvm *kvm,
>> +					struct kvm_memory_slot *memslot)
>> +{
>> +	gfn_t last_gfn;
>> +	int i;
>> +	bool flush = false;
>> +
>> +	last_gfn = memslot->base_gfn + memslot->npages - 1;
>> +
>> +	spin_lock(&kvm->mmu_lock);
>> +
>> +	for (i = PT_PAGE_TABLE_LEVEL + 1; /* skip rmap for 4K page */
> ("+ 1" is the only difference from kvm_mmu_slot_remove_write_access();
>   new argument, initial level, would be better.
>
>   Btw, PT_PAGE_TABLE_LEVEL + 1 == PT_DIRECTORY_LEVEL)
>
>> +	     i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
>> +		unsigned long *rmapp;
>> +		unsigned long last_index, index;
>> +
>> +		rmapp = memslot->arch.rmap[i - PT_PAGE_TABLE_LEVEL];
>> +		last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);
>> +
>> +		for (index = 0; index <= last_index; ++index, ++rmapp) {
>> +			if (*rmapp)
>> +				flush |= __rmap_write_protect(kvm, rmapp,
>> +						false);
>> +
>> +			if (need_resched() || spin_needbreak(&kvm->mmu_lock))
>> +				cond_resched_lock(&kvm->mmu_lock);
>> +		}
>> +	}
>> +	spin_unlock(&kvm->mmu_lock);
>> +
>> +	/* see kvm_mmu_slot_remove_write_access */
>> +	lockdep_assert_held(&kvm->slots_lock);
>> +
>> +	if (flush)
>> +		kvm_flush_remote_tlbs(kvm);
>> +}
>> +void kvm_mmu_slot_set_dirty(struct kvm *kvm,
>> +			    struct kvm_memory_slot *memslot)
>> +{
>> +	gfn_t last_gfn;
>> +	int i;
>> +	bool flush = false;
>> +
>> +	last_gfn = memslot->base_gfn + memslot->npages - 1;
>> +
>> +	spin_lock(&kvm->mmu_lock);
>> +
>> +	for (i = PT_PAGE_TABLE_LEVEL;
>> +	     i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
>> +		unsigned long *rmapp;
>> +		unsigned long last_index, index;
>> +
>> +		rmapp = memslot->arch.rmap[i - PT_PAGE_TABLE_LEVEL];
>> +		last_index = gfn_to_index(last_gfn, memslot->base_gfn, i);
>> +
>> +		for (index = 0; index <= last_index; ++index, ++rmapp) {
>> +			if (*rmapp)
>> +				flush |= __rmap_set_dirty(kvm, rmapp);
> (And yet another similar walker ... We can either pass a worker function
>   accepting 'kvm' and 'rmapp', use a 'switch' with a new operations enum,
>   or have code duplication.  Paolo?)
>
>> +
>> +			if (need_resched() || spin_needbreak(&kvm->mmu_lock))
>> +				cond_resched_lock(&kvm->mmu_lock);
>> +		}
>> +	}
>> +
>> +	spin_unlock(&kvm->mmu_lock);
>> +
>> +	lockdep_assert_held(&kvm->slots_lock);
>> +
>> +	/* see kvm_mmu_slot_leaf_clear_dirty */
>> +	if (flush)
>> +		kvm_flush_remote_tlbs(kvm);
>> +}
>> +EXPORT_SYMBOL_GPL(kvm_mmu_slot_set_dirty);
>> +void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
>> +				   struct kvm_memory_slot *memslot)
>> +{
>> +	gfn_t last_gfn;
>> +	unsigned long *rmapp;
>> +	unsigned long last_index, index;
>> +	bool flush = false;
>> +
>> +	last_gfn = memslot->base_gfn + memslot->npages - 1;
>> +
>> +	spin_lock(&kvm->mmu_lock);
>> +
> (If we abstracted with switch or function, we could also add max level
>   argument to cover this function.)
>
>> +	rmapp = memslot->arch.rmap[PT_PAGE_TABLE_LEVEL - 1];
>> +	last_index = gfn_to_index(last_gfn, memslot->base_gfn,
>> +			PT_PAGE_TABLE_LEVEL);
>> +
>> +	for (index = 0; index <= last_index; ++index, ++rmapp) {
>> +		if (*rmapp)
>> +			flush |= __rmap_clear_dirty(kvm, rmapp);
>> +
>> +		if (need_resched() || spin_needbreak(&kvm->mmu_lock))
>> +			cond_resched_lock(&kvm->mmu_lock);
>> +	}
>> +
>> +	spin_unlock(&kvm->mmu_lock);
>> +
>> +	lockdep_assert_held(&kvm->slots_lock);
>> +
>> +	/*
>> +	 * It's also safe to flush TLBs out of mmu lock here as currently this
>> +	 * function is only used for dirty logging, in which case flushing TLB
>> +	 * out of mmu lock also guarantees no dirty pages will be lost in
>> +	 * dirty_bitmap.
>> +	 */
>> +	if (flush)
>> +		kvm_flush_remote_tlbs(kvm);
>> +}
>> +EXPORT_SYMBOL_GPL(kvm_mmu_slot_leaf_clear_dirty);
>> @@ -1215,6 +1215,60 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
> (After looking at following two pairs of functions, you'll hopefully
>   understand why I don't like C very much.)
>
>> +static bool spte_clear_dirty(struct kvm *kvm, u64 *sptep)
>> +{
>> +	u64 spte = *sptep;
>> +
>> +	rmap_printk("rmap_clear_dirty: spte %p %llx\n", sptep, *sptep);
>> +
>> +	spte &= ~shadow_dirty_mask;
>> +
>> +	return mmu_spte_update(sptep, spte);
>> +}
>> +static bool spte_set_dirty(struct kvm *kvm, u64 *sptep)
>> +{
>> +	u64 spte = *sptep;
>> +
>> +	rmap_printk("rmap_set_dirty: spte %p %llx\n", sptep, *sptep);
>> +
>> +	spte |= shadow_dirty_mask;
>> +
>> +	return mmu_spte_update(sptep, spte);
>> +}
>> +static bool __rmap_clear_dirty(struct kvm *kvm, unsigned long *rmapp)
>> +{
>> +	u64 *sptep;
>> +	struct rmap_iterator iter;
>> +	bool flush = false;
>> +
>> +	for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
>> +		BUG_ON(!(*sptep & PT_PRESENT_MASK));
>> +
>> +		flush |= spte_clear_dirty(kvm, sptep);
>> +		sptep = rmap_get_next(&iter);
>> +	}
>> +
>> +	return flush;
>> +}
>> +static bool __rmap_set_dirty(struct kvm *kvm, unsigned long *rmapp)
>> +{
>> +	u64 *sptep;
>> +	struct rmap_iterator iter;
>> +	bool flush = false;
>> +
>> +	for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
>> +		BUG_ON(!(*sptep & PT_PRESENT_MASK));
>> +
>> +		flush |= spte_set_dirty(kvm, sptep);
>> +		sptep = rmap_get_next(&iter);
>> +	}
>> +
>> +	return flush;
>> +}


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/6] KVM: VMX: Add PML support in VMX
  2015-02-03 15:18   ` Radim Krčmář
  2015-02-03 15:39     ` Paolo Bonzini
@ 2015-02-05  6:23     ` Kai Huang
  2015-02-05 15:04       ` Radim Krčmář
  2015-02-06 16:00       ` Paolo Bonzini
  1 sibling, 2 replies; 22+ messages in thread
From: Kai Huang @ 2015-02-05  6:23 UTC (permalink / raw)
  To: Radim Krčmář; +Cc: pbonzini, gleb, linux, kvm


On 02/03/2015 11:18 PM, Radim Krčmář wrote:
> 2015-01-28 10:54+0800, Kai Huang:
>> This patch adds PML support in VMX. A new module parameter 'enable_pml' is added
> (+module_param_named(pml, enable_pml, bool, S_IRUGO);)
>
>> to allow user to enable/disable it manually.
>>
>> Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
>> ---
>> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
>> index 587149b..a139977 100644
>> --- a/arch/x86/kvm/trace.h
>> +++ b/arch/x86/kvm/trace.h
>> @@ -846,6 +846,24 @@ TRACE_EVENT(kvm_track_tsc,
>> +TRACE_EVENT(kvm_pml_full,
>> +	TP_PROTO(unsigned int vcpu_id),
>> +	TP_ARGS(vcpu_id),
>> +
>> +	TP_STRUCT__entry(
>> +		__field(	unsigned int,	vcpu_id			)
>> +	),
>> +
>> +	TP_fast_assign(
>> +		__entry->vcpu_id		= vcpu_id;
>> +	),
>> +
>> +	TP_printk("vcpu %d: PML full", __entry->vcpu_id)
>> +);
>> +
>>   #endif /* CONFIG_X86_64 */
> You have it protected by CONFIG_X86_64, but use it unconditionally.
Thanks for catching. This has been fixed by another patch, and the fix 
has also been merged by Paolo.

Thanks,
-Kai
>
> (Also, we can get all this information from kvm_exit tracepoint;
>   I'd just drop it.)
>
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index c987374..de5ce82 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -516,6 +519,10 @@ struct vcpu_vmx {
>> +	/* Support for PML */
>> +#define PML_ENTITY_NUM		512
> (Readers of this struct likely aren't interested in this number, so it
>   would be nicer to define it outside.  I thought it is here to hint the
>   amount of allocated pages, but PML_ENTITY_NUM / 512 isn't obvious.)
>
>> +	struct page *pml_pg;
>> @@ -4355,6 +4368,9 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
>>   	   a current VMCS12
>>   	*/
>>   	exec_control &= ~SECONDARY_EXEC_SHADOW_VMCS;
>> +	/* PML is enabled/disabled in creating/destorying vcpu */
>> +	exec_control &= ~SECONDARY_EXEC_ENABLE_PML;
> What is the harm of enabling it here?
>
> (SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES seems similar and does it.)
Because the PML feature detection is unconditional (meaning 
SECONDARY_EXEC_ENABLE_PML is always in 
vmcs_config.cpu_based_2nd_exec_ctrl), but the PML buffer is only created 
when vcpu is created, and it is controlled by 'enable_pml' module 
parameter, if we always enable SECONDARY_EXEC_ENABLE_PML here, no PML 
buffer will be created if PML is disabled by 'enable_pml' parameter, so 
it's better to enable it along with creating PML buffer.

>
>> +
>>   	return exec_control;
>>   }
>> @@ -6971,6 +7001,31 @@ static bool vmx_test_pir(struct kvm_vcpu *vcpu, int vector)
>> +static int handle_pml_full(struct kvm_vcpu *vcpu)
>> +{
>> +	unsigned long exit_qualification;
>> +
>> +	trace_kvm_pml_full(vcpu->vcpu_id);
>> +
>> +	exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>> +
>> +	/*
>> +	 * PML buffer FULL happened while executing iret from NMI,
>> +	 * "blocked by NMI" bit has to be set before next VM entry.
>> +	 */
>> +	if (!(to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
>> +			cpu_has_virtual_nmis() &&
>> +			(exit_qualification & INTR_INFO_UNBLOCK_NMI))
>> +		vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO,
>> +				GUEST_INTR_STATE_NMI);
> Relevant part of the specification is pasted from 27.2.2 (Information
> for VM Exits Due to Vectored Events), which makes me think that this bit
> is mirrored to IDT-vectoring information field ...
>
> Isn't vmx_recover_nmi_blocking() enough?
>
> (I see the same code in handle_ept_violation(), but wasn't that needed
>   just because of a hardware error?)
This needs to be handled in both EPT violation and PML. If you look at 
the PML specification (the link is in cover letter), you can see the 
definition of bit 12 of exit_qualification (section 1.5), which explains 
above code. The same definition of exit_qualification is in EPT 
violation part in Intel's SDM.

>> +
>> +	/*
>> +	 * PML buffer already flushed at beginning of VMEXIT. Nothing to do
>> +	 * here.., and there's no userspace involvement needed for PML.
>> +	 */
>> +	return 1;
>> +}
>> @@ -7325,6 +7381,89 @@ static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2)
>> +static int vmx_enable_pml(struct vcpu_vmx *vmx)
>> +{
>> +	struct page *pml_pg;
>> +	u32 exec_control;
>> +
>> +	pml_pg = alloc_page(GFP_KERNEL | __GFP_ZERO);
>> +	if (!pml_pg)
>> +		return -ENOMEM;
>> +
>> +	vmx->pml_pg = pml_pg;
> (It's safe to use vmx->pml_pg directly.)
>
>> +
>> +	vmcs_write64(PML_ADDRESS, page_to_phys(vmx->pml_pg));
>> +	vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
>> +
>> +	exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
>> +	exec_control |= SECONDARY_EXEC_ENABLE_PML;
>> +	vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
>> +
>> +	return 0;
>> +}
>> +static void vmx_flush_pml_buffer(struct vcpu_vmx *vmx)
>> +{
>> +	struct kvm *kvm = vmx->vcpu.kvm;
>> +	u64 *pml_buf;
>> +	u16 pml_idx;
>> +
>> +	pml_idx = vmcs_read16(GUEST_PML_INDEX);
>> +
>> +	/* Do nothing if PML buffer is empty */
>> +	if (pml_idx == (PML_ENTITY_NUM - 1))
>> +		return;
>> +
>> +	/* PML index always points to next available PML buffer entity */
>> +	if (pml_idx >= PML_ENTITY_NUM)
>> +		pml_idx = 0;
>> +	else
>> +		pml_idx++;
> If the pml_idx is >= PML_ENTITY_NUM and < 0xffff, the log is empty,
In this case, the log is full, actually. PML index is set to 
PML_ENTITY_NUM - 1 initially, and hardware decrease it after GPA is logged.

Below is the pseudocode of PML hardware logic.

IF (PML Index[15:9] ≠ 0)
     THEN VM exit;
FI;
set accessed and dirty flags for EPT;
IF (a dirty flag was updated from 0 to 1)
THEN
     PML address[PML index] ← 4-KByte-aligned guest-physical address;
     PML index is decremented;
FI;

Thanks,
-Kai
> so it would be better to use just 'pml_idx++' and unsigned modulo.
>
> (Could also 'assert(pml_idx < PML_ENTITY_NUM)' then.)
>
>> +
>> +	pml_buf = page_address(vmx->pml_pg);
>> +	for (; pml_idx < PML_ENTITY_NUM; pml_idx++) {
>> +		u64 gpa;
>> +
>> +		gpa = pml_buf[pml_idx];
>> +		WARN_ON(gpa & (PAGE_SIZE - 1));
>> +		mark_page_dirty(kvm, gpa >> PAGE_SHIFT);
>> +	}
>> +
>> +	/* reset PML index */
>> +	vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
>> +}
>> @@ -9492,6 +9655,31 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
>> +static void vmx_slot_enable_log_dirty(struct kvm *kvm,
>> +				     struct kvm_memory_slot *slot)
>> +{
>> +	kvm_mmu_slot_leaf_clear_dirty(kvm, slot);
> (New slot contains dirty pages?)
>
>> +	kvm_mmu_slot_largepage_remove_write_access(kvm, slot);
>> +}
> Thanks.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] KVM: x86: Add new dirty logging kvm_x86_ops for PML
  2015-02-03 15:53   ` Radim Krčmář
@ 2015-02-05  6:29     ` Kai Huang
  2015-02-05 14:52       ` Radim Krčmář
  0 siblings, 1 reply; 22+ messages in thread
From: Kai Huang @ 2015-02-05  6:29 UTC (permalink / raw)
  To: Radim Krčmář; +Cc: pbonzini, gleb, linux, kvm


On 02/03/2015 11:53 PM, Radim Krčmář wrote:
> 2015-01-28 10:54+0800, Kai Huang:
>> This patch adds new kvm_x86_ops dirty logging hooks to enable/disable dirty
>> logging for particular memory slot, and to flush potentially logged dirty GPAs
>> before reporting slot->dirty_bitmap to userspace.
>>
>> kvm x86 common code calls these hooks when they are available so PML logic can
>> be hidden to VMX specific. Other ARCHs won't be impacted as these hooks are NULL
>> for them.
>>
>> Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
>> ---
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -802,6 +802,31 @@ struct kvm_x86_ops {
>> +
>> +	/*
>> +	 * Arch-specific dirty logging hooks. These hooks are only supposed to
>> +	 * be valid if the specific arch has hardware-accelerated dirty logging
>> +	 * mechanism. Currently only for PML on VMX.
>> +	 *
>> +	 *  - slot_enable_log_dirty:
>> +	 *	called when enabling log dirty mode for the slot.
> (I guess that "log dirty mode" isn't the meaning that people will think
>   after seeing 'log_dirty' ...
>   I'd at least change 'log_dirty' to 'dirty_log' in these names.)
>
>> +	 *  - slot_disable_log_dirty:
>> +	 *	called when disabling log dirty mode for the slot.
>> +	 *	also called when slot is created with log dirty disabled.
>> +	 *  - flush_log_dirty:
>> +	 *	called before reporting dirty_bitmap to userspace.
>> +	 *  - enable_log_dirty_pt_masked:
>> +	 *	called when reenabling log dirty for the GFNs in the mask after
>> +	 *	corresponding bits are cleared in slot->dirty_bitmap.
> This name is very confusing ... I think we should hint that this is
> called after we learn that the page has been written to and would like
> to monitor it again.
>
> Using something like collected/refresh?  (I'd have to do horrible things
> to come up with a good name, sorry.)
>
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -3780,6 +3780,12 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
>>   
>>   	mutex_lock(&kvm->slots_lock);
>>   
>> +	/*
>> +	 * Flush potentially hardware-cached dirty pages to dirty_bitmap.
>> +	 */
>> +	if (kvm_x86_ops->flush_log_dirty)
>> +		kvm_x86_ops->flush_log_dirty(kvm);
> (Flushing would make more sense in kvm_get_dirty_log_protect().)
>
>> +
>>   	r = kvm_get_dirty_log_protect(kvm, log, &is_dirty);
>>   
>>   	/*
>> @@ -7533,6 +7539,56 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>   	return 0;
>>   }
>>   
>> +static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
>> +				     struct kvm_memory_slot *new)
>> +{
>> +	/* Still write protect RO slot */
>> +	if (new->flags & KVM_MEM_READONLY) {
>> +		kvm_mmu_slot_remove_write_access(kvm, new);
> We didn't write protect RO slots before, does this patch depend on it?
No PML doesn't depend on it to work. It's suggested by Paolo.

Thanks,
-Kai
>
>> @@ -7562,16 +7618,15 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>> -	if ((change != KVM_MR_DELETE) && (new->flags & KVM_MEM_LOG_DIRTY_PAGES))
>> -		kvm_mmu_slot_remove_write_access(kvm, new);
>> +	if (change != KVM_MR_DELETE)
>> +		kvm_mmu_slot_apply_flags(kvm, new);


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] KVM: MMU: Add mmu help functions to support PML
  2015-02-05  5:59     ` Kai Huang
@ 2015-02-05 14:51       ` Radim Krčmář
  0 siblings, 0 replies; 22+ messages in thread
From: Radim Krčmář @ 2015-02-05 14:51 UTC (permalink / raw)
  To: Kai Huang; +Cc: pbonzini, gleb, linux, kvm

2015-02-05 13:59+0800, Kai Huang:
> Thanks for your review and sorry for the late reply. I was working on
> something else these days.

(No problem, a reply within a week is instantaneous for me :)

> For rest of your comments, as this patch series have already been in Paolo's
> queue branch, I think I can send further patches to enhance them, if Paolo
> agrees the enhancement is needed.

Ah, sorry, I didn't notice the patches were already in ...
it's better to wait for a bug in the common part now.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] KVM: x86: Add new dirty logging kvm_x86_ops for PML
  2015-02-05  6:29     ` Kai Huang
@ 2015-02-05 14:52       ` Radim Krčmář
  0 siblings, 0 replies; 22+ messages in thread
From: Radim Krčmář @ 2015-02-05 14:52 UTC (permalink / raw)
  To: Kai Huang; +Cc: pbonzini, gleb, linux, kvm

2015-02-05 14:29+0800, Kai Huang:
> >>+	/* Still write protect RO slot */
> >>+	if (new->flags & KVM_MEM_READONLY) {
> >>+		kvm_mmu_slot_remove_write_access(kvm, new);
> >We didn't write protect RO slots before, does this patch depend on it?
> No PML doesn't depend on it to work. It's suggested by Paolo.

Thanks, it would have deserved a separate patch, IMO.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/6] KVM: VMX: Add PML support in VMX
  2015-02-05  6:23     ` Kai Huang
@ 2015-02-05 15:04       ` Radim Krčmář
  2015-02-06  0:22         ` Kai Huang
  2015-02-06  0:28         ` Kai Huang
  2015-02-06 16:00       ` Paolo Bonzini
  1 sibling, 2 replies; 22+ messages in thread
From: Radim Krčmář @ 2015-02-05 15:04 UTC (permalink / raw)
  To: Kai Huang; +Cc: pbonzini, gleb, linux, kvm

2015-02-05 14:23+0800, Kai Huang:
> On 02/03/2015 11:18 PM, Radim Krčmář wrote:
> >You have it protected by CONFIG_X86_64, but use it unconditionally.
> Thanks for catching. This has been fixed by another patch, and the fix has
> also been merged by Paolo.

(And I haven't noticed the followup either ...)

I don't know of any present bugs in this patch and all questions have
been cleared,

Thanks.

> >>+	exec_control &= ~SECONDARY_EXEC_ENABLE_PML;
> >What is the harm of enabling it here?
> >
> >(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES seems similar and does it.)
> Because the PML feature detection is unconditional (meaning
> SECONDARY_EXEC_ENABLE_PML is always in vmcs_config.cpu_based_2nd_exec_ctrl),
> but the PML buffer is only created when vcpu is created, and it is
> controlled by 'enable_pml' module parameter, if we always enable
> SECONDARY_EXEC_ENABLE_PML here, no PML buffer will be created if PML is
> disabled by 'enable_pml' parameter,

I meant

  if (!enable_pml)
    exec_control &= ~SECONDARY_EXEC_ENABLE_PML

here and no exec_control operations in vmx_create_vcpu().

>                                     so it's better to enable it along with
> creating PML buffer.

I think the reason why KVM split the setup into vmx_create_vcpu() and
vmx_secondary_exec_control() is to simplify nested virtualization.

> >>+	if (!(to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
> >>+			cpu_has_virtual_nmis() &&
> >>+			(exit_qualification & INTR_INFO_UNBLOCK_NMI))
> >>+		vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO,
> >>+				GUEST_INTR_STATE_NMI);
> >Relevant part of the specification is pasted from 27.2.2 (Information
> >for VM Exits Due to Vectored Events), which makes me think that this bit
> >is mirrored to IDT-vectoring information field ...
> >
> >Isn't vmx_recover_nmi_blocking() enough?
> >
> >(I see the same code in handle_ept_violation(), but wasn't that needed
> >  just because of a hardware error?)
> This needs to be handled in both EPT violation and PML. If you look at the
> PML specification (the link is in cover letter), you can see the definition
> of bit 12 of exit_qualification (section 1.5), which explains above code.
> The same definition of exit_qualification is in EPT violation part in
> Intel's SDM.

True ... IDT-vectoring doesn't set bit 12 because PML exit isn't a fault
and 27.2.2: "— For all other relevant VM exits, bit 12 is cleared to 0."
(This was humourously pasted to PML as well.)

> >>+	/* PML index always points to next available PML buffer entity */
> >>+	if (pml_idx >= PML_ENTITY_NUM)
> >>+		pml_idx = 0;
> >>+	else
> >>+		pml_idx++;
> >If the pml_idx is >= PML_ENTITY_NUM and < 0xffff, the log is empty,
> In this case, the log is full, actually. PML index is set to PML_ENTITY_NUM
> - 1 initially, and hardware decrease it after GPA is logged.
> 
> Below is the pseudocode of PML hardware logic.
> 
> IF (PML Index[15:9] ≠ 0)
>     THEN VM exit;
> FI;

pml_idx >= PML_ENTITY_NUM exits without modifying the buffer,

> set accessed and dirty flags for EPT;
> IF (a dirty flag was updated from 0 to 1)
> THEN
>     PML address[PML index] ← 4-KByte-aligned guest-physical address;
>     PML index is decremented;

0xffff is the only value that specifies full buffer, the rest means
that we incorrectly set up the initial index and VMX exited without
changing the buffer => we shouldn't read it.
(I should have said "untouched" instead of "empty".)

> FI;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/6] KVM: VMX: Add PML support in VMX
  2015-02-05 15:04       ` Radim Krčmář
@ 2015-02-06  0:22         ` Kai Huang
  2015-02-06  0:28         ` Kai Huang
  1 sibling, 0 replies; 22+ messages in thread
From: Kai Huang @ 2015-02-06  0:22 UTC (permalink / raw)
  To: Radim Krčmář; +Cc: pbonzini, gleb, linux, kvm


On 02/05/2015 11:04 PM, Radim Krčmář wrote:
> 2015-02-05 14:23+0800, Kai Huang:
>> On 02/03/2015 11:18 PM, Radim Krčmář wrote:
>>> You have it protected by CONFIG_X86_64, but use it unconditionally.
>> Thanks for catching. This has been fixed by another patch, and the fix has
>> also been merged by Paolo.
> (And I haven't noticed the followup either ...)
>
> I don't know of any present bugs in this patch and all questions have
> been cleared,
>
> Thanks.
>
>>>> +	exec_control &= ~SECONDARY_EXEC_ENABLE_PML;
>>> What is the harm of enabling it here?
>>>
>>> (SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES seems similar and does it.)
>> Because the PML feature detection is unconditional (meaning
>> SECONDARY_EXEC_ENABLE_PML is always in vmcs_config.cpu_based_2nd_exec_ctrl),
>> but the PML buffer is only created when vcpu is created, and it is
>> controlled by 'enable_pml' module parameter, if we always enable
>> SECONDARY_EXEC_ENABLE_PML here, no PML buffer will be created if PML is
>> disabled by 'enable_pml' parameter,
> I meant
>
>    if (!enable_pml)
>      exec_control &= ~SECONDARY_EXEC_ENABLE_PML
>
> here and no exec_control operations in vmx_create_vcpu().
This also works. I originally thought put the enabling part and buffer 
allocation together is more straightforward.
>
>>                                      so it's better to enable it along with
>> creating PML buffer.
> I think the reason why KVM split the setup into vmx_create_vcpu() and
> vmx_secondary_exec_control() is to simplify nested virtualization.
This is a good point and I'll think about it :)

>
>>>> +	if (!(to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
>>>> +			cpu_has_virtual_nmis() &&
>>>> +			(exit_qualification & INTR_INFO_UNBLOCK_NMI))
>>>> +		vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO,
>>>> +				GUEST_INTR_STATE_NMI);
>>> Relevant part of the specification is pasted from 27.2.2 (Information
>>> for VM Exits Due to Vectored Events), which makes me think that this bit
>>> is mirrored to IDT-vectoring information field ...
>>>
>>> Isn't vmx_recover_nmi_blocking() enough?
>>>
>>> (I see the same code in handle_ept_violation(), but wasn't that needed
>>>   just because of a hardware error?)
>> This needs to be handled in both EPT violation and PML. If you look at the
>> PML specification (the link is in cover letter), you can see the definition
>> of bit 12 of exit_qualification (section 1.5), which explains above code.
>> The same definition of exit_qualification is in EPT violation part in
>> Intel's SDM.
> True ... IDT-vectoring doesn't set bit 12 because PML exit isn't a fault
> and 27.2.2: "— For all other relevant VM exits, bit 12 is cleared to 0."
> (This was humourously pasted to PML as well.)
>
>>>> +	/* PML index always points to next available PML buffer entity */
>>>> +	if (pml_idx >= PML_ENTITY_NUM)
>>>> +		pml_idx = 0;
>>>> +	else
>>>> +		pml_idx++;
>>> If the pml_idx is >= PML_ENTITY_NUM and < 0xffff, the log is empty,
>> In this case, the log is full, actually. PML index is set to PML_ENTITY_NUM
>> - 1 initially, and hardware decrease it after GPA is logged.
>>
>> Below is the pseudocode of PML hardware logic.
>>
>> IF (PML Index[15:9] ≠ 0)
>>      THEN VM exit;
>> FI;
> pml_idx >= PML_ENTITY_NUM exits without modifying the buffer,
Yes.
>
>> set accessed and dirty flags for EPT;
>> IF (a dirty flag was updated from 0 to 1)
>> THEN
>>      PML address[PML index] ← 4-KByte-aligned guest-physical address;
>>      PML index is decremented;
> 0xffff is the only value that specifies full buffer, the rest means
> that we incorrectly set up the initial index and VMX exited without
> changing the buffer => we shouldn't read it.
> (I should have said "untouched" instead of "empty".)
Yes theoretically, according to the pseudocode, 0xffff is the only 
possible value that specifies full buffer. Probably a better way is to 
check if pml_idx is in range [-1, 511] at the beginning and give a 
warning if it's not, or just ASSERT it. Then we can simply to do a 
++pml_idx (with changing pml_idx to signed short type). But the current 
code should also work anyway.

Thanks,
-Kai
>
>> FI;


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/6] KVM: VMX: Add PML support in VMX
  2015-02-05 15:04       ` Radim Krčmář
  2015-02-06  0:22         ` Kai Huang
@ 2015-02-06  0:28         ` Kai Huang
  1 sibling, 0 replies; 22+ messages in thread
From: Kai Huang @ 2015-02-06  0:28 UTC (permalink / raw)
  To: Radim Krčmář; +Cc: pbonzini, gleb, linux, kvm


On 02/05/2015 11:04 PM, Radim Krčmář wrote:
> 2015-02-05 14:23+0800, Kai Huang:
>> On 02/03/2015 11:18 PM, Radim Krčmář wrote:
>>> You have it protected by CONFIG_X86_64, but use it unconditionally.
>> Thanks for catching. This has been fixed by another patch, and the fix has
>> also been merged by Paolo.
> (And I haven't noticed the followup either ...)
>
> I don't know of any present bugs in this patch and all questions have
> been cleared,
So far I see no other bug reported :)

Thanks,
-Kai
>
> Thanks.
>
>>>> +	exec_control &= ~SECONDARY_EXEC_ENABLE_PML;
>>> What is the harm of enabling it here?
>>>
>>> (SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES seems similar and does it.)
>> Because the PML feature detection is unconditional (meaning
>> SECONDARY_EXEC_ENABLE_PML is always in vmcs_config.cpu_based_2nd_exec_ctrl),
>> but the PML buffer is only created when vcpu is created, and it is
>> controlled by 'enable_pml' module parameter, if we always enable
>> SECONDARY_EXEC_ENABLE_PML here, no PML buffer will be created if PML is
>> disabled by 'enable_pml' parameter,
> I meant
>
>    if (!enable_pml)
>      exec_control &= ~SECONDARY_EXEC_ENABLE_PML
>
> here and no exec_control operations in vmx_create_vcpu().
>
>>                                      so it's better to enable it along with
>> creating PML buffer.
> I think the reason why KVM split the setup into vmx_create_vcpu() and
> vmx_secondary_exec_control() is to simplify nested virtualization.
>
>>>> +	if (!(to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
>>>> +			cpu_has_virtual_nmis() &&
>>>> +			(exit_qualification & INTR_INFO_UNBLOCK_NMI))
>>>> +		vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO,
>>>> +				GUEST_INTR_STATE_NMI);
>>> Relevant part of the specification is pasted from 27.2.2 (Information
>>> for VM Exits Due to Vectored Events), which makes me think that this bit
>>> is mirrored to IDT-vectoring information field ...
>>>
>>> Isn't vmx_recover_nmi_blocking() enough?
>>>
>>> (I see the same code in handle_ept_violation(), but wasn't that needed
>>>   just because of a hardware error?)
>> This needs to be handled in both EPT violation and PML. If you look at the
>> PML specification (the link is in cover letter), you can see the definition
>> of bit 12 of exit_qualification (section 1.5), which explains above code.
>> The same definition of exit_qualification is in EPT violation part in
>> Intel's SDM.
> True ... IDT-vectoring doesn't set bit 12 because PML exit isn't a fault
> and 27.2.2: "— For all other relevant VM exits, bit 12 is cleared to 0."
> (This was humourously pasted to PML as well.)
>
>>>> +	/* PML index always points to next available PML buffer entity */
>>>> +	if (pml_idx >= PML_ENTITY_NUM)
>>>> +		pml_idx = 0;
>>>> +	else
>>>> +		pml_idx++;
>>> If the pml_idx is >= PML_ENTITY_NUM and < 0xffff, the log is empty,
>> In this case, the log is full, actually. PML index is set to PML_ENTITY_NUM
>> - 1 initially, and hardware decrease it after GPA is logged.
>>
>> Below is the pseudocode of PML hardware logic.
>>
>> IF (PML Index[15:9] ≠ 0)
>>      THEN VM exit;
>> FI;
> pml_idx >= PML_ENTITY_NUM exits without modifying the buffer,
>
>> set accessed and dirty flags for EPT;
>> IF (a dirty flag was updated from 0 to 1)
>> THEN
>>      PML address[PML index] ← 4-KByte-aligned guest-physical address;
>>      PML index is decremented;
> 0xffff is the only value that specifies full buffer, the rest means
> that we incorrectly set up the initial index and VMX exited without
> changing the buffer => we shouldn't read it.
> (I should have said "untouched" instead of "empty".)
>
>> FI;


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/6] KVM: VMX: Add PML support in VMX
  2015-02-05  6:23     ` Kai Huang
  2015-02-05 15:04       ` Radim Krčmář
@ 2015-02-06 16:00       ` Paolo Bonzini
  1 sibling, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2015-02-06 16:00 UTC (permalink / raw)
  To: Kai Huang, Radim Krčmář; +Cc: gleb, linux, kvm



On 05/02/2015 07:23, Kai Huang wrote:
>>>
>>> +    /* PML is enabled/disabled in creating/destorying vcpu */
>>> +    exec_control &= ~SECONDARY_EXEC_ENABLE_PML;
>> What is the harm of enabling it here?
>>
>> (SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES seems similar and does it.)
>
> Because the PML feature detection is unconditional (meaning
> SECONDARY_EXEC_ENABLE_PML is always in
> vmcs_config.cpu_based_2nd_exec_ctrl), but the PML buffer is only created
> when vcpu is created, and it is controlled by 'enable_pml' module
> parameter, if we always enable SECONDARY_EXEC_ENABLE_PML here, no PML
> buffer will be created if PML is disabled by 'enable_pml' parameter, so
> it's better to enable it along with creating PML buffer.

I guess this is the most interesting comment from Radim.

Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-02-06 16:00 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-28  2:54 [PATCH 0/6] KVM: VMX: Page Modification Logging (PML) support Kai Huang
2015-01-28  2:54 ` [PATCH 1/6] KVM: Rename kvm_arch_mmu_write_protect_pt_masked to be more generic for log dirty Kai Huang
2015-01-28  2:54 ` [PATCH 2/6] KVM: MMU: Add mmu help functions to support PML Kai Huang
2015-02-03 17:34   ` Radim Krčmář
2015-02-05  5:59     ` Kai Huang
2015-02-05 14:51       ` Radim Krčmář
2015-01-28  2:54 ` [PATCH 3/6] KVM: MMU: Explicitly set D-bit for writable spte Kai Huang
2015-01-28  2:54 ` [PATCH 4/6] KVM: x86: Change parameter of kvm_mmu_slot_remove_write_access Kai Huang
2015-02-03 16:28   ` Radim Krčmář
2015-01-28  2:54 ` [PATCH 5/6] KVM: x86: Add new dirty logging kvm_x86_ops for PML Kai Huang
2015-02-03 15:53   ` Radim Krčmář
2015-02-05  6:29     ` Kai Huang
2015-02-05 14:52       ` Radim Krčmář
2015-01-28  2:54 ` [PATCH 6/6] KVM: VMX: Add PML support in VMX Kai Huang
2015-02-03 15:18   ` Radim Krčmář
2015-02-03 15:39     ` Paolo Bonzini
2015-02-03 16:02       ` Radim Krčmář
2015-02-05  6:23     ` Kai Huang
2015-02-05 15:04       ` Radim Krčmář
2015-02-06  0:22         ` Kai Huang
2015-02-06  0:28         ` Kai Huang
2015-02-06 16:00       ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).