public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] KVM: MMU: Improve rmap handling
@ 2012-03-15  9:18 Takuya Yoshikawa
  2012-03-15  9:19 ` [PATCH 1/3] KVM: MMU: Make pte_list_desc fit cache lines well Takuya Yoshikawa
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Takuya Yoshikawa @ 2012-03-15  9:18 UTC (permalink / raw)
  To: avi, mtosatti; +Cc: kvm

This patch series is basen on my srcu-less dirty logging patch series
and being sent for letting everybody knows about possible improvements
we can add to that.

As pasted below, we can achieve 15% improvement for the worst case.

I will rebase and post this again after srcu-less gets merged.

	Takuya


===
with this    simple-srcu-less   pages  improvement
 469907.8            493900.4      8K           5%
 668887.6            760268.2     16K          14%
1110787.2           1238709.6     32K          12%
2079386.2           2359523.6     64K          13%
3944530.8           4540780.6    128K          15%
7536232.6           8747274.0    256K          16%

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] KVM: MMU: Make pte_list_desc fit cache lines well
  2012-03-15  9:18 [PATCH 0/3] KVM: MMU: Improve rmap handling Takuya Yoshikawa
@ 2012-03-15  9:19 ` Takuya Yoshikawa
  2012-03-15  9:22   ` Avi Kivity
  2012-03-15  9:20 ` [PATCH 2/3] KVM: MMU: Improve iteration over sptes linked by rmap Takuya Yoshikawa
  2012-03-15  9:21 ` [PATCH 3/3] KVM: MMU: Separate trivial NULL check out from rmap_get_next() Takuya Yoshikawa
  2 siblings, 1 reply; 15+ messages in thread
From: Takuya Yoshikawa @ 2012-03-15  9:19 UTC (permalink / raw)
  To: avi, mtosatti; +Cc: kvm

We have PTE_LIST_EXT + 1 pointers in this structure and these 40/20
bytes do not fit cache lines well.  Furthermore some allocators may
use 64/32-byte objects for the pte_list_desc cache.

This patch solves this problem by raising PTE_LIST_EXT to 7.

Note: with EPT/NPT we almost always have a single spte in each reverse
mapping and nothing will be changed by this.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
---
 arch/x86/kvm/mmu.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index dc5f245..384d3c0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -135,8 +135,6 @@ module_param(dbg, bool, 0644);
 #define PT64_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | PT_USER_MASK \
 			| PT64_NX_MASK)
 
-#define PTE_LIST_EXT 4
-
 #define ACC_EXEC_MASK    1
 #define ACC_WRITE_MASK   PT_WRITABLE_MASK
 #define ACC_USER_MASK    PT_USER_MASK
@@ -151,6 +149,9 @@ module_param(dbg, bool, 0644);
 
 #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
 
+/* make pte_list_desc fit well in cache line */
+#define PTE_LIST_EXT 7
+
 struct pte_list_desc {
 	u64 *sptes[PTE_LIST_EXT];
 	struct pte_list_desc *more;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] KVM: MMU: Improve iteration over sptes linked by rmap
  2012-03-15  9:18 [PATCH 0/3] KVM: MMU: Improve rmap handling Takuya Yoshikawa
  2012-03-15  9:19 ` [PATCH 1/3] KVM: MMU: Make pte_list_desc fit cache lines well Takuya Yoshikawa
@ 2012-03-15  9:20 ` Takuya Yoshikawa
  2012-03-15  9:44   ` Avi Kivity
  2012-03-15  9:21 ` [PATCH 3/3] KVM: MMU: Separate trivial NULL check out from rmap_get_next() Takuya Yoshikawa
  2 siblings, 1 reply; 15+ messages in thread
From: Takuya Yoshikawa @ 2012-03-15  9:20 UTC (permalink / raw)
  To: avi, mtosatti; +Cc: kvm

Iteration using rmap_next(), the actual body is pte_list_next(), is
inefficient: every time we call it we start from checking whether rmap
holds a single spte or points to a descriptor which links more sptes.

In the case of shadow paging, this quadratic total iteration cost is a
problem.  Even for two dimensional paging, with EPT/NPT on, in which we
almost always have a single spte, the extra checks at the end of the
iteration should be eliminated.

This patch fixes this by introducing rmap_iterator which keeps the
iteration context for the next search.  Furthermore the implementation
of rmap_next() is splitted into two functions - rmap_get_first() and
rmap_get_next() - to avoid repeatedly checking whether the rmap being
iterated on has only one spte.

Note: we just remove pte_list_next() because we can think of parent_ptes
as a reverse mapping.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
---
 arch/x86/kvm/mmu.c       |  198 ++++++++++++++++++++++++++++------------------
 arch/x86/kvm/mmu_audit.c |    8 +-
 2 files changed, 124 insertions(+), 82 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 384d3c0..d042087 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -842,32 +842,6 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
 	return count;
 }
 
-static u64 *pte_list_next(unsigned long *pte_list, u64 *spte)
-{
-	struct pte_list_desc *desc;
-	u64 *prev_spte;
-	int i;
-
-	if (!*pte_list)
-		return NULL;
-	else if (!(*pte_list & 1)) {
-		if (!spte)
-			return (u64 *)*pte_list;
-		return NULL;
-	}
-	desc = (struct pte_list_desc *)(*pte_list & ~1ul);
-	prev_spte = NULL;
-	while (desc) {
-		for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i) {
-			if (prev_spte == spte)
-				return desc->sptes[i];
-			prev_spte = desc->sptes[i];
-		}
-		desc = desc->more;
-	}
-	return NULL;
-}
-
 static void
 pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc,
 			   int i, struct pte_list_desc *prev_desc)
@@ -988,11 +962,6 @@ static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
 	return pte_list_add(vcpu, spte, rmapp);
 }
 
-static u64 *rmap_next(unsigned long *rmapp, u64 *spte)
-{
-	return pte_list_next(rmapp, spte);
-}
-
 static void rmap_remove(struct kvm *kvm, u64 *spte)
 {
 	struct kvm_mmu_page *sp;
@@ -1005,6 +974,72 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
 	pte_list_remove(spte, rmapp);
 }
 
+/*
+ * Used by the following functions to iterate over the sptes linked by a rmap.
+ * Only sptep can be used outside of these functions.
+ */
+struct rmap_iterator {
+	u64 *sptep;			/* points to the current spte */
+	/* private fields */
+	struct pte_list_desc *desc;	/* holds the sptep if not NULL */
+	int pos;			/* index of the sptep */
+};
+
+/*
+ * Iteration must be started by this function.  This should also be used after
+ * removing/dropping sptes from rmap because in such cases the information in
+ * the itererator may not be valid.
+ *
+ * Returns true if spte is found, false otherwise.
+ */
+static bool rmap_get_first(unsigned long rmap, struct rmap_iterator *iter)
+{
+	if (!rmap) {
+		iter->sptep = NULL;
+		return false;
+	}
+
+	if (!(rmap & 1)) {
+		iter->sptep = (u64 *)rmap;
+		iter->desc = NULL;
+	} else {
+		iter->desc = (struct pte_list_desc *)(rmap & ~1ul);
+		iter->pos = 0;
+		iter->sptep = iter->desc->sptes[iter->pos];
+	}
+
+	return true;
+}
+
+/*
+ * Must be used with a valid iterator: e.g. after rmap_get_first().
+ *
+ * Returns true if spte is found, false otherwise.
+ */
+static bool rmap_get_next(struct rmap_iterator *iter)
+{
+	if (iter->desc) {
+		if (iter->pos < PTE_LIST_EXT - 1) {
+			++iter->pos;
+			iter->sptep = iter->desc->sptes[iter->pos];
+			if (iter->sptep)
+				return true;
+		}
+
+		iter->desc = iter->desc->more;
+
+		if (iter->desc) {
+			iter->pos = 0;
+			iter->sptep = iter->desc->sptes[iter->pos];
+			/* desc->sptes[0] cannot be NULL */
+			return true;
+		}
+	}
+
+	iter->sptep = NULL;
+	return false;
+}
+
 static void drop_spte(struct kvm *kvm, u64 *sptep)
 {
 	if (mmu_spte_clear_track_bits(sptep))
@@ -1013,23 +1048,28 @@ static void drop_spte(struct kvm *kvm, u64 *sptep)
 
 static int __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp, int level)
 {
-	u64 *spte = NULL;
+	struct rmap_iterator iter;
 	int write_protected = 0;
 
-	while ((spte = rmap_next(rmapp, spte))) {
-		BUG_ON(!(*spte & PT_PRESENT_MASK));
-		rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
+	for (rmap_get_first(*rmapp, &iter); iter.sptep;) {
+		u64 spte = *iter.sptep;
 
-		if (!is_writable_pte(*spte))
+		BUG_ON(!(spte & PT_PRESENT_MASK));
+		rmap_printk("rmap_write_protect: spte %p %llx\n", iter.sptep, spte);
+
+		if (!is_writable_pte(spte)) {
+			rmap_get_next(&iter);
 			continue;
+		}
 
 		if (level == PT_PAGE_TABLE_LEVEL) {
-			mmu_spte_update(spte, *spte & ~PT_WRITABLE_MASK);
+			mmu_spte_update(iter.sptep, spte & ~PT_WRITABLE_MASK);
+			rmap_get_next(&iter);
 		} else {
-			BUG_ON(!is_large_pte(*spte));
-			drop_spte(kvm, spte);
+			BUG_ON(!is_large_pte(spte));
+			drop_spte(kvm, iter.sptep);
 			--kvm->stat.lpages;
-			spte = NULL;
+			rmap_get_first(*rmapp, &iter);
 		}
 
 		write_protected = 1;
@@ -1084,48 +1124,58 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
 static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
 			   unsigned long data)
 {
-	u64 *spte;
+	struct rmap_iterator iter;
 	int need_tlb_flush = 0;
 
-	while ((spte = rmap_next(rmapp, NULL))) {
-		BUG_ON(!(*spte & PT_PRESENT_MASK));
-		rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", spte, *spte);
-		drop_spte(kvm, spte);
+	while (rmap_get_first(*rmapp, &iter)) {
+		BUG_ON(!(*iter.sptep & PT_PRESENT_MASK));
+		rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n",
+			    iter->sptep, *iter.sptep);
+
+		drop_spte(kvm, iter.sptep);
 		need_tlb_flush = 1;
 	}
+
 	return need_tlb_flush;
 }
 
 static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp,
 			     unsigned long data)
 {
+	struct rmap_iterator iter;
 	int need_flush = 0;
-	u64 *spte, new_spte;
+	u64 spte, new_spte;
 	pte_t *ptep = (pte_t *)data;
 	pfn_t new_pfn;
 
 	WARN_ON(pte_huge(*ptep));
 	new_pfn = pte_pfn(*ptep);
-	spte = rmap_next(rmapp, NULL);
-	while (spte) {
-		BUG_ON(!is_shadow_present_pte(*spte));
-		rmap_printk("kvm_set_pte_rmapp: spte %p %llx\n", spte, *spte);
+
+	for (rmap_get_first(*rmapp, &iter); iter.sptep;) {
+		spte = *iter.sptep;
+
+		BUG_ON(!is_shadow_present_pte(spte));
+		rmap_printk("kvm_set_pte_rmapp: spte %p %llx\n", iter.sptep, spte);
+
 		need_flush = 1;
+
 		if (pte_write(*ptep)) {
-			drop_spte(kvm, spte);
-			spte = rmap_next(rmapp, NULL);
+			drop_spte(kvm, iter.sptep);
+			rmap_get_first(*rmapp, &iter);
 		} else {
-			new_spte = *spte &~ (PT64_BASE_ADDR_MASK);
+			new_spte = spte & ~PT64_BASE_ADDR_MASK;
 			new_spte |= (u64)new_pfn << PAGE_SHIFT;
 
 			new_spte &= ~PT_WRITABLE_MASK;
 			new_spte &= ~SPTE_HOST_WRITEABLE;
 			new_spte &= ~shadow_accessed_mask;
-			mmu_spte_clear_track_bits(spte);
-			mmu_spte_set(spte, new_spte);
-			spte = rmap_next(rmapp, spte);
+
+			mmu_spte_clear_track_bits(iter.sptep);
+			mmu_spte_set(iter.sptep, new_spte);
+			rmap_get_next(&iter);
 		}
 	}
+
 	if (need_flush)
 		kvm_flush_remote_tlbs(kvm);
 
@@ -1184,7 +1234,7 @@ void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
 static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
 			 unsigned long data)
 {
-	u64 *spte;
+	struct rmap_iterator iter;
 	int young = 0;
 
 	/*
@@ -1197,25 +1247,22 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
 	if (!shadow_accessed_mask)
 		return kvm_unmap_rmapp(kvm, rmapp, data);
 
-	spte = rmap_next(rmapp, NULL);
-	while (spte) {
-		int _young;
-		u64 _spte = *spte;
-		BUG_ON(!(_spte & PT_PRESENT_MASK));
-		_young = _spte & PT_ACCESSED_MASK;
-		if (_young) {
+	for (rmap_get_first(*rmapp, &iter); iter.sptep; rmap_get_next(&iter)) {
+		BUG_ON(!(*iter.sptep & PT_PRESENT_MASK));
+
+		if (*iter.sptep & PT_ACCESSED_MASK) {
 			young = 1;
-			clear_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
+			clear_bit(PT_ACCESSED_SHIFT, (unsigned long *)iter.sptep);
 		}
-		spte = rmap_next(rmapp, spte);
 	}
+
 	return young;
 }
 
 static int kvm_test_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
 			      unsigned long data)
 {
-	u64 *spte;
+	struct rmap_iterator iter;
 	int young = 0;
 
 	/*
@@ -1226,16 +1273,13 @@ static int kvm_test_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
 	if (!shadow_accessed_mask)
 		goto out;
 
-	spte = rmap_next(rmapp, NULL);
-	while (spte) {
-		u64 _spte = *spte;
-		BUG_ON(!(_spte & PT_PRESENT_MASK));
-		young = _spte & PT_ACCESSED_MASK;
-		if (young) {
+	for (rmap_get_first(*rmapp, &iter); iter.sptep; rmap_get_next(&iter)) {
+		BUG_ON(!(*iter.sptep & PT_PRESENT_MASK));
+
+		if (*iter.sptep & PT_ACCESSED_MASK) {
 			young = 1;
 			break;
 		}
-		spte = rmap_next(rmapp, spte);
 	}
 out:
 	return young;
@@ -1887,10 +1931,10 @@ static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
 
 static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
-	u64 *parent_pte;
+	struct rmap_iterator iter;
 
-	while ((parent_pte = pte_list_next(&sp->parent_ptes, NULL)))
-		drop_parent_pte(sp, parent_pte);
+	while (rmap_get_first(sp->parent_ptes, &iter))
+		drop_parent_pte(sp, iter.sptep);
 }
 
 static int mmu_zap_unsync_children(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
index 6eabae3..568941d 100644
--- a/arch/x86/kvm/mmu_audit.c
+++ b/arch/x86/kvm/mmu_audit.c
@@ -192,7 +192,7 @@ static void audit_write_protection(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	struct kvm_memory_slot *slot;
 	unsigned long *rmapp;
-	u64 *spte;
+	struct rmap_iterator iter;
 
 	if (sp->role.direct || sp->unsync || sp->role.invalid)
 		return;
@@ -200,13 +200,11 @@ static void audit_write_protection(struct kvm *kvm, struct kvm_mmu_page *sp)
 	slot = gfn_to_memslot(kvm, sp->gfn);
 	rmapp = &slot->rmap[sp->gfn - slot->base_gfn];
 
-	spte = rmap_next(rmapp, NULL);
-	while (spte) {
-		if (is_writable_pte(*spte))
+	for (rmap_get_first(*rmapp, &iter); iter.sptep; rmap_get_next(&iter)) {
+		if (is_writable_pte(*iter.sptep))
 			audit_printk(kvm, "shadow page has writable "
 				     "mappings: gfn %llx role %x\n",
 				     sp->gfn, sp->role.word);
-		spte = rmap_next(rmapp, spte);
 	}
 }
 
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] KVM: MMU: Separate trivial NULL check out from rmap_get_next()
  2012-03-15  9:18 [PATCH 0/3] KVM: MMU: Improve rmap handling Takuya Yoshikawa
  2012-03-15  9:19 ` [PATCH 1/3] KVM: MMU: Make pte_list_desc fit cache lines well Takuya Yoshikawa
  2012-03-15  9:20 ` [PATCH 2/3] KVM: MMU: Improve iteration over sptes linked by rmap Takuya Yoshikawa
@ 2012-03-15  9:21 ` Takuya Yoshikawa
  2012-03-15  9:49   ` Avi Kivity
  2 siblings, 1 reply; 15+ messages in thread
From: Takuya Yoshikawa @ 2012-03-15  9:21 UTC (permalink / raw)
  To: avi, mtosatti; +Cc: kvm

Checking wheter iter->desc is NULL is not worth a function call
especially when we use EPT/NPT because we know it would almost always
be NULL.

Although using "inline" like this does not look clean, we could see
measurable performance improvements: get_dirty_log for 1GB dirty memory
became faster by more than 10% on my test box.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
---
 arch/x86/kvm/mmu.c |   38 +++++++++++++++++++++++---------------
 1 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d042087..8aa58a1 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1016,30 +1016,38 @@ static bool rmap_get_first(unsigned long rmap, struct rmap_iterator *iter)
  *
  * Returns true if spte is found, false otherwise.
  */
-static bool rmap_get_next(struct rmap_iterator *iter)
+static bool __rmap_get_next(struct rmap_iterator *iter)
 {
-	if (iter->desc) {
-		if (iter->pos < PTE_LIST_EXT - 1) {
-			++iter->pos;
-			iter->sptep = iter->desc->sptes[iter->pos];
-			if (iter->sptep)
-				return true;
-		}
+	if (iter->pos < PTE_LIST_EXT - 1) {
+		++iter->pos;
+		iter->sptep = iter->desc->sptes[iter->pos];
+		if (iter->sptep)
+			return true;
+	}
 
-		iter->desc = iter->desc->more;
+	iter->desc = iter->desc->more;
 
-		if (iter->desc) {
-			iter->pos = 0;
-			iter->sptep = iter->desc->sptes[iter->pos];
-			/* desc->sptes[0] cannot be NULL */
-			return true;
-		}
+	if (iter->desc) {
+		iter->pos = 0;
+		iter->sptep = iter->desc->sptes[iter->pos];
+		/* desc->sptes[0] cannot be NULL */
+		return true;
 	}
 
 	iter->sptep = NULL;
 	return false;
 }
 
+/* Make this trivial check alone inlined by putting inline. */
+static inline bool rmap_get_next(struct rmap_iterator *iter)
+{
+	if (iter->desc)
+		return __rmap_get_next(iter);
+
+	iter->sptep = NULL;
+	return false;
+}
+
 static void drop_spte(struct kvm *kvm, u64 *sptep)
 {
 	if (mmu_spte_clear_track_bits(sptep))
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] KVM: MMU: Make pte_list_desc fit cache lines well
  2012-03-15  9:19 ` [PATCH 1/3] KVM: MMU: Make pte_list_desc fit cache lines well Takuya Yoshikawa
@ 2012-03-15  9:22   ` Avi Kivity
  0 siblings, 0 replies; 15+ messages in thread
From: Avi Kivity @ 2012-03-15  9:22 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: mtosatti, kvm

On 03/15/2012 11:19 AM, Takuya Yoshikawa wrote:
> We have PTE_LIST_EXT + 1 pointers in this structure and these 40/20
> bytes do not fit cache lines well.  Furthermore some allocators may
> use 64/32-byte objects for the pte_list_desc cache.
>
> This patch solves this problem by raising PTE_LIST_EXT to 7.
>
> Note: with EPT/NPT we almost always have a single spte in each reverse
> mapping and nothing will be changed by this.

It might be better to drop it to 3.  Without EPT/NPT, anonymous guest
pages will have two mappings (a kernel mapping and a user mapping). 
Only file pages will have many mappings.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] KVM: MMU: Improve iteration over sptes linked by rmap
  2012-03-15  9:20 ` [PATCH 2/3] KVM: MMU: Improve iteration over sptes linked by rmap Takuya Yoshikawa
@ 2012-03-15  9:44   ` Avi Kivity
  2012-03-15 10:25     ` Takuya Yoshikawa
  0 siblings, 1 reply; 15+ messages in thread
From: Avi Kivity @ 2012-03-15  9:44 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: mtosatti, kvm

On 03/15/2012 11:20 AM, Takuya Yoshikawa wrote:
> Iteration using rmap_next(), the actual body is pte_list_next(), is
> inefficient: every time we call it we start from checking whether rmap
> holds a single spte or points to a descriptor which links more sptes.
>
> In the case of shadow paging, this quadratic total iteration cost is a
> problem.  Even for two dimensional paging, with EPT/NPT on, in which we
> almost always have a single spte, the extra checks at the end of the
> iteration should be eliminated.
>
> This patch fixes this by introducing rmap_iterator which keeps the
> iteration context for the next search.  Furthermore the implementation
> of rmap_next() is splitted into two functions - rmap_get_first() and
> rmap_get_next() - to avoid repeatedly checking whether the rmap being
> iterated on has only one spte.
>
> Note: we just remove pte_list_next() because we can think of parent_ptes
> as a reverse mapping.
>
> Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
> ---
>  arch/x86/kvm/mmu.c       |  198 ++++++++++++++++++++++++++++------------------
>  arch/x86/kvm/mmu_audit.c |    8 +-
>  2 files changed, 124 insertions(+), 82 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 384d3c0..d042087 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -842,32 +842,6 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
>  	return count;
>  }
>  
> -static u64 *pte_list_next(unsigned long *pte_list, u64 *spte)
> -{
> -	struct pte_list_desc *desc;
> -	u64 *prev_spte;
> -	int i;
> -
> -	if (!*pte_list)
> -		return NULL;
> -	else if (!(*pte_list & 1)) {
> -		if (!spte)
> -			return (u64 *)*pte_list;
> -		return NULL;
> -	}
> -	desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> -	prev_spte = NULL;
> -	while (desc) {
> -		for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i) {
> -			if (prev_spte == spte)
> -				return desc->sptes[i];
> -			prev_spte = desc->sptes[i];
> -		}
> -		desc = desc->more;
> -	}
> -	return NULL;
> -}
> -
>  static void
>  pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc,
>  			   int i, struct pte_list_desc *prev_desc)
> @@ -988,11 +962,6 @@ static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
>  	return pte_list_add(vcpu, spte, rmapp);
>  }
>  
> -static u64 *rmap_next(unsigned long *rmapp, u64 *spte)
> -{
> -	return pte_list_next(rmapp, spte);
> -}
> -
>  static void rmap_remove(struct kvm *kvm, u64 *spte)
>  {
>  	struct kvm_mmu_page *sp;
> @@ -1005,6 +974,72 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
>  	pte_list_remove(spte, rmapp);
>  }
>  
> +/*
> + * Used by the following functions to iterate over the sptes linked by a rmap.
> + * Only sptep can be used outside of these functions.
> + */
> +struct rmap_iterator {
> +	u64 *sptep;			/* points to the current spte */
> +	/* private fields */
> +	struct pte_list_desc *desc;	/* holds the sptep if not NULL */
> +	int pos;			/* index of the sptep */
> +};
> +
> +/*
> + * Iteration must be started by this function.  This should also be used after
> + * removing/dropping sptes from rmap because in such cases the information in
> + * the itererator may not be valid.

Note: this suggests rmap_remove(struct rmap_iterator *ri) to remove an
rmap from the iterator while keeping it valid.  Converts a potentially
quadratic kvm_mmu_rmap_write_protect() to linear.

> + *
> + * Returns true if spte is found, false otherwise.
> + */
> +static bool rmap_get_first(unsigned long rmap, struct rmap_iterator *iter)
> +{
> +	if (!rmap) {
> +		iter->sptep = NULL;
> +		return false;
> +	}
> +
> +	if (!(rmap & 1)) {
> +		iter->sptep = (u64 *)rmap;
> +		iter->desc = NULL;
> +	} else {
> +		iter->desc = (struct pte_list_desc *)(rmap & ~1ul);
> +		iter->pos = 0;
> +		iter->sptep = iter->desc->sptes[iter->pos];
> +	}
> +
> +	return true;
> +}

Might be simplified to return the sptep or NULL; so you don't have to
copy it to the iterator.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] KVM: MMU: Separate trivial NULL check out from rmap_get_next()
  2012-03-15  9:21 ` [PATCH 3/3] KVM: MMU: Separate trivial NULL check out from rmap_get_next() Takuya Yoshikawa
@ 2012-03-15  9:49   ` Avi Kivity
  2012-03-15 10:15     ` Takuya Yoshikawa
  0 siblings, 1 reply; 15+ messages in thread
From: Avi Kivity @ 2012-03-15  9:49 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: mtosatti, kvm

On 03/15/2012 11:21 AM, Takuya Yoshikawa wrote:
> Checking wheter iter->desc is NULL is not worth a function call
> especially when we use EPT/NPT because we know it would almost always
> be NULL.
>
> Although using "inline" like this does not look clean, we could see
> measurable performance improvements: get_dirty_log for 1GB dirty memory
> became faster by more than 10% on my test box.
>

WOW.  I'd have assumed the processor deals better with this; it should
be 100% predicted branches.

But I won't argue with cold data.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] KVM: MMU: Separate trivial NULL check out from rmap_get_next()
  2012-03-15  9:49   ` Avi Kivity
@ 2012-03-15 10:15     ` Takuya Yoshikawa
  2012-03-15 12:01       ` Avi Kivity
  0 siblings, 1 reply; 15+ messages in thread
From: Takuya Yoshikawa @ 2012-03-15 10:15 UTC (permalink / raw)
  To: Avi Kivity; +Cc: mtosatti, kvm

Avi Kivity <avi@redhat.com> wrote:

> > Although using "inline" like this does not look clean, we could see
> > measurable performance improvements: get_dirty_log for 1GB dirty memory
> > became faster by more than 10% on my test box.
> >
> 
> WOW.  I'd have assumed the processor deals better with this; it should
> be 100% predicted branches.
> 
> But I won't argue with cold data.

What I checked was:

original   with-patch2   with-patch3
8.7ms      8.5ms         7.5ms

I assumed that without "inline" only __rmap_get_next() would be inlined
into rmap_get_next() so did like this.

I thought the improvement was just from removing one function call for
each rmap_write_protect.  Not sure if anything was changed with branch
predictions.

	Takuya

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] KVM: MMU: Improve iteration over sptes linked by rmap
  2012-03-15  9:44   ` Avi Kivity
@ 2012-03-15 10:25     ` Takuya Yoshikawa
  0 siblings, 0 replies; 15+ messages in thread
From: Takuya Yoshikawa @ 2012-03-15 10:25 UTC (permalink / raw)
  To: Avi Kivity; +Cc: mtosatti, kvm

Avi Kivity <avi@redhat.com> wrote:

> > + * Iteration must be started by this function.  This should also be used after
> > + * removing/dropping sptes from rmap because in such cases the information in
> > + * the itererator may not be valid.
> 
> Note: this suggests rmap_remove(struct rmap_iterator *ri) to remove an
> rmap from the iterator while keeping it valid.  Converts a potentially
> quadratic kvm_mmu_rmap_write_protect() to linear.

Yes, but a bit scary because we need to touch drop_spte() and so on.

Actually drop_spte() itself should be changed to take rmap as an argument to
avoid recalculating it from sp.

I personally do not like current rmap_* interface.  We should think of
a better design for the future.

> > +static bool rmap_get_first(unsigned long rmap, struct rmap_iterator *iter)
> > +{
> > +	if (!rmap) {
> > +		iter->sptep = NULL;
> > +		return false;
> > +	}
> > +
> > +	if (!(rmap & 1)) {
> > +		iter->sptep = (u64 *)rmap;
> > +		iter->desc = NULL;
> > +	} else {
> > +		iter->desc = (struct pte_list_desc *)(rmap & ~1ul);
> > +		iter->pos = 0;
> > +		iter->sptep = iter->desc->sptes[iter->pos];
> > +	}
> > +
> > +	return true;
> > +}
> 
> Might be simplified to return the sptep or NULL; so you don't have to
> copy it to the iterator.

while ((spte = rmap_...)) or for (rmap_...; iter.sptep; ...)

Will think again!

	Takuya

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] KVM: MMU: Separate trivial NULL check out from rmap_get_next()
  2012-03-15 10:15     ` Takuya Yoshikawa
@ 2012-03-15 12:01       ` Avi Kivity
  2012-03-15 13:41         ` Takuya Yoshikawa
  0 siblings, 1 reply; 15+ messages in thread
From: Avi Kivity @ 2012-03-15 12:01 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: mtosatti, kvm

On 03/15/2012 12:15 PM, Takuya Yoshikawa wrote:
> Avi Kivity <avi@redhat.com> wrote:
>
> > > Although using "inline" like this does not look clean, we could see
> > > measurable performance improvements: get_dirty_log for 1GB dirty memory
> > > became faster by more than 10% on my test box.
> > >
> > 
> > WOW.  I'd have assumed the processor deals better with this; it should
> > be 100% predicted branches.
> > 
> > But I won't argue with cold data.
>
> What I checked was:
>
> original   with-patch2   with-patch3
> 8.7ms      8.5ms         7.5ms

What's the per-call numbers?

> I assumed that without "inline" only __rmap_get_next() would be inlined
> into rmap_get_next() so did like this.
>
> I thought the improvement was just from removing one function call for
> each rmap_write_protect.  Not sure if anything was changed with branch
> predictions.

What I mean is, modern cpus effectively inline simple function calls by
predicting the call, and branchs within the function, and the return, so
they don't have to stop their pipelines at any of these points.  But
again, the numbers talk louder than speculation about cpu architecture.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] KVM: MMU: Separate trivial NULL check out from rmap_get_next()
  2012-03-15 12:01       ` Avi Kivity
@ 2012-03-15 13:41         ` Takuya Yoshikawa
  2012-03-15 13:45           ` Avi Kivity
  2012-03-15 13:46           ` Avi Kivity
  0 siblings, 2 replies; 15+ messages in thread
From: Takuya Yoshikawa @ 2012-03-15 13:41 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Takuya Yoshikawa, mtosatti, kvm

Avi Kivity <avi@redhat.com> wrote:

> > What I checked was:
> >
> > original   with-patch2   with-patch3
> > 8.7ms      8.5ms         7.5ms
> 
> What's the per-call numbers?

I did not look into details at that time.
I will try to see more details later if possible!

> What I mean is, modern cpus effectively inline simple function calls by
> predicting the call, and branchs within the function, and the return, so
> they don't have to stop their pipelines at any of these points.  But
> again, the numbers talk louder than speculation about cpu architecture.

I need to update my knowledge, thank you!

Anyway I will re-check if we can achieve the same performance with a bit
cleaner implementation.


	Takuya

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] KVM: MMU: Separate trivial NULL check out from rmap_get_next()
  2012-03-15 13:41         ` Takuya Yoshikawa
@ 2012-03-15 13:45           ` Avi Kivity
  2012-03-15 13:46           ` Avi Kivity
  1 sibling, 0 replies; 15+ messages in thread
From: Avi Kivity @ 2012-03-15 13:45 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: Takuya Yoshikawa, mtosatti, kvm

On 03/15/2012 03:41 PM, Takuya Yoshikawa wrote:
> Anyway I will re-check if we can achieve the same performance with a bit
> cleaner implementation.
>

Not needed -- it's fine as it is (. I'm just interested in the
normalized numbers (nanoseconds per call, instead of aggregated time per
unknown number of calls).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] KVM: MMU: Separate trivial NULL check out from rmap_get_next()
  2012-03-15 13:41         ` Takuya Yoshikawa
  2012-03-15 13:45           ` Avi Kivity
@ 2012-03-15 13:46           ` Avi Kivity
  2012-03-20  6:37             ` Takuya Yoshikawa
  1 sibling, 1 reply; 15+ messages in thread
From: Avi Kivity @ 2012-03-15 13:46 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: Takuya Yoshikawa, mtosatti, kvm

On 03/15/2012 03:41 PM, Takuya Yoshikawa wrote:
> > What I mean is, modern cpus effectively inline simple function calls by
> > predicting the call, and branchs within the function, and the return, so
> > they don't have to stop their pipelines at any of these points.  But
> > again, the numbers talk louder than speculation about cpu architecture.
>
> I need to update my knowledge, thank you!
>
>

You can look at what's happening by doing

  perf record -a -f
  perf report

using the TUI, select 'annotate rmap_get_next'

You should see where the time is spent.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] KVM: MMU: Separate trivial NULL check out from rmap_get_next()
  2012-03-15 13:46           ` Avi Kivity
@ 2012-03-20  6:37             ` Takuya Yoshikawa
  2012-03-20 11:56               ` Avi Kivity
  0 siblings, 1 reply; 15+ messages in thread
From: Takuya Yoshikawa @ 2012-03-20  6:37 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Takuya Yoshikawa, mtosatti, kvm

On Thu, 15 Mar 2012 15:46:31 +0200
Avi Kivity <avi@redhat.com> wrote:

> You can look at what's happening by doing
> 
>   perf record -a -f
>   perf report
> 
> using the TUI, select 'annotate rmap_get_next'
> 
> You should see where the time is spent.
> 

Since the default sampling rate was not enough to get consistent numbers,
I raised the frequency to 50K/sec; so there was surely some overheads
during that sampling.

>From the result below, it seems to be the prologue and epilogue on which
we spent most of our time.

	Takuya

===
 Percent |	Source code & Disassembly of kvm.ko
------------------------------------------------
         :
         :
         :
         :	Disassembly of section .text:
         :
         :	0001d260 <rmap_get_next>:
         :	 * Must be used with a valid iterator: e.g. after rmap_get_first().
         :	 *
         :	 * Returns true if spte is found, false otherwise.
         :	 */
         :	static bool rmap_get_next(struct rmap_iterator *iter)
         :	{
    1.75 :	   1d260:       push   %ebp
   17.54 :	   1d261:       mov    %esp,%ebp
   12.28 :	   1d263:       push   %ebx
    7.02 :	   1d264:       call   1d265 <rmap_get_next+0x5>
         :	        if (iter->desc) {
    1.75 :	   1d269:       mov    0x4(%eax),%ecx
         :	 * Must be used with a valid iterator: e.g. after rmap_get_first().
         :	 *
         :	 * Returns true if spte is found, false otherwise.
         :	 */
         :	static bool rmap_get_next(struct rmap_iterator *iter)
         :	{
   22.81 :	   1d26c:       mov    %eax,%edx
         :	        if (iter->desc) {
    0.00 :	   1d26e:       test   %ecx,%ecx
    0.00 :	   1d270:       je     1d2b8 <rmap_get_next+0x58>
         :	                if (iter->pos < PTE_LIST_EXT - 1) {
    0.00 :	   1d272:       mov    0x8(%eax),%eax
    0.00 :	   1d275:       cmp    $0x5,%eax
    0.00 :	   1d278:       jg     1d298 <rmap_get_next+0x38>
         :	                        ++iter->pos;
    0.00 :	   1d27a:       add    $0x1,%eax
    0.00 :	   1d27d:       mov    %eax,0x8(%edx)
         :	                        iter->sptep = iter->desc->sptes[iter->pos];
    0.00 :	   1d280:       mov    (%ecx,%eax,4),%ebx
         :	                        if (iter->sptep)
         :	                                return true;
    0.00 :	   1d283:       mov    $0x1,%eax
         :	{
         :	        if (iter->desc) {
         :	                if (iter->pos < PTE_LIST_EXT - 1) {
         :	                        ++iter->pos;
         :	                        iter->sptep = iter->desc->sptes[iter->pos];
         :	                        if (iter->sptep)
    0.00 :	   1d288:       test   %ebx,%ebx
         :	static bool rmap_get_next(struct rmap_iterator *iter)
         :	{
         :	        if (iter->desc) {
         :	                if (iter->pos < PTE_LIST_EXT - 1) {
         :	                        ++iter->pos;
         :	                        iter->sptep = iter->desc->sptes[iter->pos];
    0.00 :	   1d28a:       mov    %ebx,(%edx)
         :	                        if (iter->sptep)
    0.00 :	   1d28c:       je     1d298 <rmap_get_next+0x38>
         :	                }
         :	        }
         :
         :	        iter->sptep = NULL;
         :	        return false;
         :	}
    0.00 :	   1d28e:       pop    %ebx
    0.00 :	   1d28f:       pop    %ebp
    0.00 :	   1d290:       ret    
    0.00 :	   1d291:       lea    0x0(%esi,%eiz,1),%esi
         :	                        iter->sptep = iter->desc->sptes[iter->pos];
         :	                        if (iter->sptep)
         :	                                return true;
         :	                }
         :
         :	                iter->desc = iter->desc->more;
    0.00 :	   1d298:       mov    0x1c(%ecx),%eax
         :
         :	                if (iter->desc) {
    0.00 :	   1d29b:       test   %eax,%eax
         :	                        iter->sptep = iter->desc->sptes[iter->pos];
         :	                        if (iter->sptep)
         :	                                return true;
         :	                }
         :
         :	                iter->desc = iter->desc->more;
    0.00 :	   1d29d:       mov    %eax,0x4(%edx)
         :
         :	                if (iter->desc) {
    0.00 :	   1d2a0:       je     1d2b8 <rmap_get_next+0x58>
         :	                        iter->pos = 0;
    0.00 :	   1d2a2:       movl   $0x0,0x8(%edx)
         :	                        iter->sptep = iter->desc->sptes[iter->pos];
    0.00 :	   1d2a9:       mov    (%eax),%eax
    0.00 :	   1d2ab:       mov    %eax,(%edx)
    0.00 :	   1d2ad:       mov    $0x1,%eax
         :	                }
         :	        }
         :
         :	        iter->sptep = NULL;
         :	        return false;
         :	}
    0.00 :	   1d2b2:       pop    %ebx
    0.00 :	   1d2b3:       pop    %ebp
    0.00 :	   1d2b4:       ret    
    0.00 :	   1d2b5:       lea    0x0(%esi),%esi
         :	                        /* desc->sptes[0] cannot be NULL */
         :	                        return true;
         :	                }
         :	        }
         :
         :	        iter->sptep = NULL;
    0.00 :	   1d2b8:       movl   $0x0,(%edx)
         :	        return false;
    1.75 :	   1d2be:       xor    %eax,%eax
         :	}
   12.28 :	   1d2c0:       pop    %ebx
   22.81 :	   1d2c1:       pop    %ebp
    0.00 :	   1d2c2:       ret    

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] KVM: MMU: Separate trivial NULL check out from rmap_get_next()
  2012-03-20  6:37             ` Takuya Yoshikawa
@ 2012-03-20 11:56               ` Avi Kivity
  0 siblings, 0 replies; 15+ messages in thread
From: Avi Kivity @ 2012-03-20 11:56 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: Takuya Yoshikawa, mtosatti, kvm

On 03/20/2012 08:37 AM, Takuya Yoshikawa wrote:
> On Thu, 15 Mar 2012 15:46:31 +0200
> Avi Kivity <avi@redhat.com> wrote:
>
> > You can look at what's happening by doing
> > 
> >   perf record -a -f
> >   perf report
> > 
> > using the TUI, select 'annotate rmap_get_next'
> > 
> > You should see where the time is spent.
> > 
>
> Since the default sampling rate was not enough to get consistent numbers,
> I raised the frequency to 50K/sec; so there was surely some overheads
> during that sampling.
>
> From the result below, it seems to be the prologue and epilogue on which
> we spent most of our time.
>
>

I can only assume that the cpu is processing these instructions
perfectly and each is taking 1 cycle (or less).  I don't see how else we
can arrive at such results.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-03-20 11:57 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-15  9:18 [PATCH 0/3] KVM: MMU: Improve rmap handling Takuya Yoshikawa
2012-03-15  9:19 ` [PATCH 1/3] KVM: MMU: Make pte_list_desc fit cache lines well Takuya Yoshikawa
2012-03-15  9:22   ` Avi Kivity
2012-03-15  9:20 ` [PATCH 2/3] KVM: MMU: Improve iteration over sptes linked by rmap Takuya Yoshikawa
2012-03-15  9:44   ` Avi Kivity
2012-03-15 10:25     ` Takuya Yoshikawa
2012-03-15  9:21 ` [PATCH 3/3] KVM: MMU: Separate trivial NULL check out from rmap_get_next() Takuya Yoshikawa
2012-03-15  9:49   ` Avi Kivity
2012-03-15 10:15     ` Takuya Yoshikawa
2012-03-15 12:01       ` Avi Kivity
2012-03-15 13:41         ` Takuya Yoshikawa
2012-03-15 13:45           ` Avi Kivity
2012-03-15 13:46           ` Avi Kivity
2012-03-20  6:37             ` Takuya Yoshikawa
2012-03-20 11:56               ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox