[PATCH 3/3] KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault

All of lore.kernel.org
 help / color / mirror / Atom feed

From: David Matlack <dmatlack@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	David Matlack <dmatlack@google.com>,
	Ben Gardon <bgardon@google.com>,
	Zhenzhong Duan <zhenzhong.duan@intel.com>,
	"open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86)" 
	<kvm@vger.kernel.org>, Peter Xu <peterx@redhat.com>
Subject: [PATCH 3/3] KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault
Date: Fri,  1 Apr 2022 23:37:37 +0000	[thread overview]
Message-ID: <20220401233737.3021889-4-dmatlack@google.com> (raw)
In-Reply-To: <20220401233737.3021889-1-dmatlack@google.com>

Now that the TDP MMU has a mechanism to split huge pages, use it in the
fault path when a huge page needs to be replaced with a mapping at a
lower level.

This change reduces the negative performance impact of NX HugePages.
Prior to this change if a vCPU executed from a huge page and NX
HugePages was enabled, the vCPU would take a fault, zap the huge page,
and mapping the faulting address at 4KiB with execute permissions
enabled. The rest of the memory would be left *unmapped* and have to be
faulted back in by the guest upon access (read, write, or execute). If
guest is backed by 1GiB, a single execute instruction can zap an entire
GiB of its physical address space.

For example, it can take a VM longer to execute from its memory than to
populate that memory in the first place:

$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96

Populating memory             : 2.748378795s
Executing from memory         : 2.899670885s

With this change, such faults split the huge page instead of zapping it,
which avoids the non-present faults on the rest of the huge page:

$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96

Populating memory             : 2.729544474s
Executing from memory         : 0.111965688s   <---

This change also reduces the performance impact of dirty logging when
eager_page_split=N for the same reasons as above but write faults.
eager_page_split=N (abbreviated "eps=N" below) can be desirable for
read-heavy workloads, as it avoids allocating memory to split huge pages
that are never written and avoids increasing the TLB miss cost on reads
of those pages.

             | Config: ept=Y, tdp_mmu=Y, 5% writes           |
             | Iteration 1 dirty memory time                 |
             | --------------------------------------------- |
vCPU Count   | eps=N (Before) | eps=N (After) | eps=Y        |
------------ | -------------- | ------------- | ------------ |
2            | 0.332305091s   | 0.019615027s  | 0.006108211s |
4            | 0.353096020s   | 0.019452131s  | 0.006214670s |
8            | 0.453938562s   | 0.019748246s  | 0.006610997s |
16           | 0.719095024s   | 0.019972171s  | 0.007757889s |
32           | 1.698727124s   | 0.021361615s  | 0.012274432s |
64           | 2.630673582s   | 0.031122014s  | 0.016994683s |
96           | 3.016535213s   | 0.062608739s  | 0.044760838s |

Eager page splitting remains beneficial for write-heavy workloads, but
the gap is now reduced.

             | Config: ept=Y, tdp_mmu=Y, 100% writes         |
             | Iteration 1 dirty memory time                 |
             | --------------------------------------------- |
vCPU Count   | eps=N (Before) | eps=N (After) | eps=Y        |
------------ | -------------- | ------------- | ------------ |
2            | 0.317710329s   | 0.296204596s  | 0.058689782s |
4            | 0.337102375s   | 0.299841017s  | 0.060343076s |
8            | 0.386025681s   | 0.297274460s  | 0.060399702s |
16           | 0.791462524s   | 0.298942578s  | 0.062508699s |
32           | 1.719646014s   | 0.313101996s  | 0.075984855s |
64           | 2.527973150s   | 0.455779206s  | 0.079789363s |
96           | 2.681123208s   | 0.673778787s  | 0.165386739s |

Further study is needed to determine if the remaining gap is acceptable
for customer workloads or if eager_page_split=N still requires a-priori
knowledge of the VM workload, especially when considering these costs
extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 37 +++++++++++++++++++++++++------------
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9263765c8068..5a2120d85347 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1131,6 +1131,10 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
 	return 0;
 }
 
+static int tdp_mmu_split_huge_page_atomic(struct kvm_vcpu *vcpu,
+					  struct tdp_iter *iter,
+					  bool account_nx);
+
 /*
  * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
  * page tables and SPTEs to translate the faulting guest physical address.
@@ -1140,6 +1144,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
+	bool account_nx;
 	int ret;
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -1155,28 +1160,22 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		if (iter.level == fault->goal_level)
 			break;
 
+		account_nx = fault->huge_page_disallowed &&
+			     fault->req_level >= iter.level;
+
 		/*
 		 * If there is an SPTE mapping a large page at a higher level
-		 * than the target, that SPTE must be cleared and replaced
-		 * with a non-leaf SPTE.
+		 * than the target, split it down one level.
 		 */
 		if (is_shadow_present_pte(iter.old_spte) &&
 		    is_large_pte(iter.old_spte)) {
-			if (tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
+			if (tdp_mmu_split_huge_page_atomic(vcpu, &iter, account_nx))
 				break;
 
-			/*
-			 * The iter must explicitly re-read the spte here
-			 * because the new value informs the !present
-			 * path below.
-			 */
-			iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
+			continue;
 		}
 
 		if (!is_shadow_present_pte(iter.old_spte)) {
-			bool account_nx = fault->huge_page_disallowed &&
-					  fault->req_level >= iter.level;
-
 			/*
 			 * If SPTE has been frozen by another thread, just
 			 * give up and retry, avoiding unnecessary page table
@@ -1496,6 +1495,20 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	return ret;
 }
 
+static int tdp_mmu_split_huge_page_atomic(struct kvm_vcpu *vcpu,
+					  struct tdp_iter *iter,
+					  bool account_nx)
+{
+	struct kvm_mmu_page *sp = tdp_mmu_alloc_sp(vcpu);
+	int r;
+
+	r = tdp_mmu_split_huge_page(vcpu->kvm, iter, sp, true, account_nx);
+	if (r)
+		tdp_mmu_free_sp(sp);
+
+	return r;
+}
+
 static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 					 struct kvm_mmu_page *root,
 					 gfn_t start, gfn_t end,
-- 
2.35.1.1094.g7c7d902a7c-goog

next prev parent reply	other threads:[~2022-04-01 23:37 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-01 23:37 [PATCH 0/3] KVM: Split huge pages mapped by the TDP MMU on fault David Matlack
2022-04-01 23:37 ` [PATCH 1/3] KVM: selftests: Introduce a selftest to measure execution performance David Matlack
2022-04-04 18:13   ` Ben Gardon
2022-04-05 20:59     ` David Matlack
2022-04-05 21:00       ` David Matlack
2022-04-01 23:37 ` [PATCH 2/3] KVM: x86/mmu: Pass account_nx to tdp_mmu_split_huge_page() David Matlack
2022-04-04 18:26   ` Ben Gardon
2022-04-01 23:37 ` David Matlack [this message]
2022-04-04 18:48   ` [PATCH 3/3] KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault Ben Gardon
2022-04-05 19:02     ` David Matlack
2022-04-07 19:39   ` Sean Christopherson
2022-04-07 20:52     ` David Matlack

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:9263765c806 dfblob:5a2120d8534 )
 OR (
bs:"[PATCH 3/3] KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220401233737.3021889-4-dmatlack@google.com \
    --to=dmatlack@google.com \
    --cc=bgardon@google.com \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=kvm@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=seanjc@google.com \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    --cc=zhenzhong.duan@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.