From: David Matlack <dmatlack@google.com>
To: Ben Gardon <bgardon@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
Sean Christopherson <seanjc@google.com>,
Vitaly Kuznetsov <vkuznets@redhat.com>,
Wanpeng Li <wanpengli@tencent.com>,
Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
Zhenzhong Duan <zhenzhong.duan@intel.com>,
"open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86)"
<kvm@vger.kernel.org>, Peter Xu <peterx@redhat.com>
Subject: Re: [PATCH 3/3] KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault
Date: Tue, 5 Apr 2022 19:02:08 +0000 [thread overview]
Message-ID: <YkySMOyOK+25UAd9@google.com> (raw)
In-Reply-To: <CANgfPd-pt1K0D0TqX9qJpk0kMOszDeWxuScNeFHbVP01a8XJwQ@mail.gmail.com>
On Mon, Apr 04, 2022 at 11:48:46AM -0700, Ben Gardon wrote:
> On Fri, Apr 1, 2022 at 4:37 PM David Matlack <dmatlack@google.com> wrote:
> >
> > Now that the TDP MMU has a mechanism to split huge pages, use it in the
> > fault path when a huge page needs to be replaced with a mapping at a
> > lower level.
> >
> > This change reduces the negative performance impact of NX HugePages.
> > Prior to this change if a vCPU executed from a huge page and NX
> > HugePages was enabled, the vCPU would take a fault, zap the huge page,
> > and mapping the faulting address at 4KiB with execute permissions
> > enabled. The rest of the memory would be left *unmapped* and have to be
> > faulted back in by the guest upon access (read, write, or execute). If
> > guest is backed by 1GiB, a single execute instruction can zap an entire
> > GiB of its physical address space.
> >
> > For example, it can take a VM longer to execute from its memory than to
> > populate that memory in the first place:
> >
> > $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
> >
> > Populating memory : 2.748378795s
> > Executing from memory : 2.899670885s
> >
> > With this change, such faults split the huge page instead of zapping it,
> > which avoids the non-present faults on the rest of the huge page:
> >
> > $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
> >
> > Populating memory : 2.729544474s
> > Executing from memory : 0.111965688s <---
> >
> > This change also reduces the performance impact of dirty logging when
> > eager_page_split=N for the same reasons as above but write faults.
> > eager_page_split=N (abbreviated "eps=N" below) can be desirable for
> > read-heavy workloads, as it avoids allocating memory to split huge pages
> > that are never written and avoids increasing the TLB miss cost on reads
> > of those pages.
> >
> > | Config: ept=Y, tdp_mmu=Y, 5% writes |
> > | Iteration 1 dirty memory time |
> > | --------------------------------------------- |
> > vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
> > ------------ | -------------- | ------------- | ------------ |
> > 2 | 0.332305091s | 0.019615027s | 0.006108211s |
> > 4 | 0.353096020s | 0.019452131s | 0.006214670s |
> > 8 | 0.453938562s | 0.019748246s | 0.006610997s |
> > 16 | 0.719095024s | 0.019972171s | 0.007757889s |
> > 32 | 1.698727124s | 0.021361615s | 0.012274432s |
> > 64 | 2.630673582s | 0.031122014s | 0.016994683s |
> > 96 | 3.016535213s | 0.062608739s | 0.044760838s |
> >
> > Eager page splitting remains beneficial for write-heavy workloads, but
> > the gap is now reduced.
> >
> > | Config: ept=Y, tdp_mmu=Y, 100% writes |
> > | Iteration 1 dirty memory time |
> > | --------------------------------------------- |
> > vCPU Count | eps=N (Before) | eps=N (After) | eps=Y |
> > ------------ | -------------- | ------------- | ------------ |
> > 2 | 0.317710329s | 0.296204596s | 0.058689782s |
> > 4 | 0.337102375s | 0.299841017s | 0.060343076s |
> > 8 | 0.386025681s | 0.297274460s | 0.060399702s |
> > 16 | 0.791462524s | 0.298942578s | 0.062508699s |
> > 32 | 1.719646014s | 0.313101996s | 0.075984855s |
> > 64 | 2.527973150s | 0.455779206s | 0.079789363s |
> > 96 | 2.681123208s | 0.673778787s | 0.165386739s |
> >
> > Further study is needed to determine if the remaining gap is acceptable
> > for customer workloads or if eager_page_split=N still requires a-priori
> > knowledge of the VM workload, especially when considering these costs
> > extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> > arch/x86/kvm/mmu/tdp_mmu.c | 37 +++++++++++++++++++++++++------------
> > 1 file changed, 25 insertions(+), 12 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 9263765c8068..5a2120d85347 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1131,6 +1131,10 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
> > return 0;
> > }
> >
> > +static int tdp_mmu_split_huge_page_atomic(struct kvm_vcpu *vcpu,
> > + struct tdp_iter *iter,
> > + bool account_nx);
> > +
> > /*
> > * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
> > * page tables and SPTEs to translate the faulting guest physical address.
> > @@ -1140,6 +1144,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > struct kvm_mmu *mmu = vcpu->arch.mmu;
> > struct tdp_iter iter;
> > struct kvm_mmu_page *sp;
> > + bool account_nx;
> > int ret;
> >
> > kvm_mmu_hugepage_adjust(vcpu, fault);
> > @@ -1155,28 +1160,22 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > if (iter.level == fault->goal_level)
> > break;
> >
> > + account_nx = fault->huge_page_disallowed &&
> > + fault->req_level >= iter.level;
> > +
> > /*
> > * If there is an SPTE mapping a large page at a higher level
> > - * than the target, that SPTE must be cleared and replaced
> > - * with a non-leaf SPTE.
> > + * than the target, split it down one level.
> > */
> > if (is_shadow_present_pte(iter.old_spte) &&
> > is_large_pte(iter.old_spte)) {
> > - if (tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
> > + if (tdp_mmu_split_huge_page_atomic(vcpu, &iter, account_nx))
> > break;
>
> I don't think we necessarily want to break here, as splitting a 1G
> page would require two splits.
>
> ...
>
> Oh tdp_mmu_split_huge_page_atomic returns non-zero to indicate an
> error and if everything works we will split again. In the case of
> failure, should we fall back to zapping?
The only way for tdp_mmu_split_huge_page_atomic() to fail is if
tdp_mmu_set_spte_atomic() fails (i.e. the huge page SPTE is frozen or
being concurrently modified). Breaking here means we go back into the
guest and retry the access.
I don't think we should fall back to zapping:
- If the SPTE is frozen, zapping will also fail.
- Otherwise, the SPTE is being modified by another CPU. It'd be a waste
to immediately zap that CPU's work. e.g. Maybe another CPU just split
this huge page for us :).
>
>
> >
> > - /*
> > - * The iter must explicitly re-read the spte here
> > - * because the new value informs the !present
> > - * path below.
> > - */
> > - iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
> > + continue;
> > }
> >
> > if (!is_shadow_present_pte(iter.old_spte)) {
> > - bool account_nx = fault->huge_page_disallowed &&
> > - fault->req_level >= iter.level;
> > -
> > /*
> > * If SPTE has been frozen by another thread, just
> > * give up and retry, avoiding unnecessary page table
> > @@ -1496,6 +1495,20 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
> > return ret;
> > }
> >
> > +static int tdp_mmu_split_huge_page_atomic(struct kvm_vcpu *vcpu,
> > + struct tdp_iter *iter,
> > + bool account_nx)
> > +{
> > + struct kvm_mmu_page *sp = tdp_mmu_alloc_sp(vcpu);
> > + int r;
> > +
> > + r = tdp_mmu_split_huge_page(vcpu->kvm, iter, sp, true, account_nx);
> > + if (r)
> > + tdp_mmu_free_sp(sp);
> > +
> > + return r;
> > +}
> > +
> > static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> > struct kvm_mmu_page *root,
> > gfn_t start, gfn_t end,
> > --
> > 2.35.1.1094.g7c7d902a7c-goog
> >
next prev parent reply other threads:[~2022-04-05 22:31 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-04-01 23:37 [PATCH 0/3] KVM: Split huge pages mapped by the TDP MMU on fault David Matlack
2022-04-01 23:37 ` [PATCH 1/3] KVM: selftests: Introduce a selftest to measure execution performance David Matlack
2022-04-04 18:13 ` Ben Gardon
2022-04-05 20:59 ` David Matlack
2022-04-05 21:00 ` David Matlack
2022-04-01 23:37 ` [PATCH 2/3] KVM: x86/mmu: Pass account_nx to tdp_mmu_split_huge_page() David Matlack
2022-04-04 18:26 ` Ben Gardon
2022-04-01 23:37 ` [PATCH 3/3] KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault David Matlack
2022-04-04 18:48 ` Ben Gardon
2022-04-05 19:02 ` David Matlack [this message]
2022-04-07 19:39 ` Sean Christopherson
2022-04-07 20:52 ` David Matlack
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YkySMOyOK+25UAd9@google.com \
--to=dmatlack@google.com \
--cc=bgardon@google.com \
--cc=jmattson@google.com \
--cc=joro@8bytes.org \
--cc=kvm@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=seanjc@google.com \
--cc=vkuznets@redhat.com \
--cc=wanpengli@tencent.com \
--cc=zhenzhong.duan@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.