Re: [PATCH v2 2/2] KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mingwei Zhang <mizhang@google.com>
To: David Matlack <dmatlack@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Sean Christopherson <seanjc@google.com>,
	Ben Gardon <bgardon@google.com>,
	kvm@vger.kernel.org
Subject: Re: [PATCH v2 2/2] KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault
Date: Thu, 20 Oct 2022 22:03:17 +0000	[thread overview]
Message-ID: <Y1HFpTl+bZBFHaCQ@google.com> (raw)
In-Reply-To: <20221019234050.3919566-3-dmatlack@google.com>

On Wed, Oct 19, 2022, David Matlack wrote:
> Now that the TDP MMU has a mechanism to split huge pages, use it in the
> fault path when a huge page needs to be replaced with a mapping at a
> lower level.
> 
> This change reduces the negative performance impact of NX HugePages.
> Prior to this change if a vCPU executed from a huge page and NX
> HugePages was enabled, the vCPU would take a fault, zap the huge page,
> and mapping the faulting address at 4KiB with execute permissions
> enabled. The rest of the memory would be left *unmapped* and have to be
> faulted back in by the guest upon access (read, write, or execute). If
> guest is backed by 1GiB, a single execute instruction can zap an entire
> GiB of its physical address space.
> 
> For example, it can take a VM longer to execute from its memory than to
> populate that memory in the first place:
> 
> $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
> 
> Populating memory             : 2.748378795s
> Executing from memory         : 2.899670885s
> 
> With this change, such faults split the huge page instead of zapping it,
> which avoids the non-present faults on the rest of the huge page:
> 
> $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
> 
> Populating memory             : 2.729544474s
> Executing from memory         : 0.111965688s   <---
> 
> This change also reduces the performance impact of dirty logging when
> eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can
> be desirable for read-heavy workloads, as it avoids allocating memory to
> split huge pages that are never written and avoids increasing the TLB
> miss cost on reads of those pages.
> 
>              | Config: ept=Y, tdp_mmu=Y, 5% writes           |
>              | Iteration 1 dirty memory time                 |
>              | --------------------------------------------- |
> vCPU Count   | eps=N (Before) | eps=N (After) | eps=Y        |
> ------------ | -------------- | ------------- | ------------ |
> 2            | 0.332305091s   | 0.019615027s  | 0.006108211s |
> 4            | 0.353096020s   | 0.019452131s  | 0.006214670s |
> 8            | 0.453938562s   | 0.019748246s  | 0.006610997s |
> 16           | 0.719095024s   | 0.019972171s  | 0.007757889s |
> 32           | 1.698727124s   | 0.021361615s  | 0.012274432s |
> 64           | 2.630673582s   | 0.031122014s  | 0.016994683s |
> 96           | 3.016535213s   | 0.062608739s  | 0.044760838s |
> 
> Eager page splitting remains beneficial for write-heavy workloads, but
> the gap is now reduced.
> 
>              | Config: ept=Y, tdp_mmu=Y, 100% writes         |
>              | Iteration 1 dirty memory time                 |
>              | --------------------------------------------- |
> vCPU Count   | eps=N (Before) | eps=N (After) | eps=Y        |
> ------------ | -------------- | ------------- | ------------ |
> 2            | 0.317710329s   | 0.296204596s  | 0.058689782s |
> 4            | 0.337102375s   | 0.299841017s  | 0.060343076s |
> 8            | 0.386025681s   | 0.297274460s  | 0.060399702s |
> 16           | 0.791462524s   | 0.298942578s  | 0.062508699s |
> 32           | 1.719646014s   | 0.313101996s  | 0.075984855s |
> 64           | 2.527973150s   | 0.455779206s  | 0.079789363s |
> 96           | 2.681123208s   | 0.673778787s  | 0.165386739s |
> 
> Further study is needed to determine if the remaining gap is acceptable
> for customer workloads or if eager_page_split=N still requires a-priori
> knowledge of the VM workload, especially when considering these costs
> extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 72 ++++++++++++++++++--------------------
>  1 file changed, 34 insertions(+), 38 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 4e5b3ae824c1..c53767104d5b 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1146,6 +1146,9 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
>  	return 0;
>  }
>  
> +static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
> +				   struct kvm_mmu_page *sp, bool shared);
> +
>  /*
>   * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
>   * page tables and SPTEs to translate the faulting guest physical address.
> @@ -1171,49 +1174,42 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  		if (iter.level == fault->goal_level)
>  			break;
>  
> -		/*
> -		 * If there is an SPTE mapping a large page at a higher level
> -		 * than the target, that SPTE must be cleared and replaced
> -		 * with a non-leaf SPTE.
> -		 */
> +		/* Step down into the lower level page table if it exists. */
>  		if (is_shadow_present_pte(iter.old_spte) &&
> -		    is_large_pte(iter.old_spte)) {
> -			if (tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
> -				break;
> +		    !is_large_pte(iter.old_spte))
> +			continue;
>  
> -			/*
> -			 * The iter must explicitly re-read the spte here
> -			 * because the new value informs the !present
> -			 * path below.
> -			 */
> -			iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
> -		}
> +		/*
> +		 * If SPTE has been frozen by another thread, just give up and
> +		 * retry, avoiding unnecessary page table allocation and free.
> +		 */
> +		if (is_removed_spte(iter.old_spte))
> +			break;
>  
> -		if (!is_shadow_present_pte(iter.old_spte)) {
> -			/*
> -			 * If SPTE has been frozen by another thread, just
> -			 * give up and retry, avoiding unnecessary page table
> -			 * allocation and free.
> -			 */
> -			if (is_removed_spte(iter.old_spte))
> -				break;
> +		/*
> +		 * The SPTE is either non-present or points to a huge page that
> +		 * needs to be split.
> +		 */
> +		sp = tdp_mmu_alloc_sp(vcpu);
> +		tdp_mmu_init_child_sp(sp, &iter);
>  
> -			sp = tdp_mmu_alloc_sp(vcpu);
> -			tdp_mmu_init_child_sp(sp, &iter);
> +		sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
>  
> -			sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
> +		if (is_shadow_present_pte(iter.old_spte))
> +			ret = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
> +		else
> +			ret = tdp_mmu_link_sp(kvm, &iter, sp, true);
>  
> -			if (tdp_mmu_link_sp(kvm, &iter, sp, true)) {
> -				tdp_mmu_free_sp(sp);
> -				break;
> -			}
> +		if (ret) {
> +			tdp_mmu_free_sp(sp);
> +			break;
> +		}
>  
> -			if (fault->huge_page_disallowed &&
> -			    fault->req_level >= iter.level) {
> -				spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> -				track_possible_nx_huge_page(kvm, sp);
> -				spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> -			}
> +		if (fault->huge_page_disallowed &&
> +		    fault->req_level >= iter.level) {
> +			spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> +			track_possible_nx_huge_page(kvm, sp);
> +			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>  		}
>  	}
>  
> @@ -1484,8 +1480,6 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>  	const int level = iter->level;
>  	int ret, i;
>  
> -	tdp_mmu_init_child_sp(sp, iter);
> -
David, thanks for the alignment with the precise nx huge page series.

Nit: since this patch puts tdp_mmu_init_child_sp() out of the
tdp_mmu_split_huge_page(), can we add a comment mentioned that, i.e.,
initialization of child sp is required before invoking the function?

With that action done,

Reviewed-by: Mingwei Zhang <mizhang@google.com>

>  	/*
>  	 * No need for atomics when writing to sp->spt since the page table has
>  	 * not been linked in yet and thus is not reachable from any other CPU.
> @@ -1561,6 +1555,8 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
>  				continue;
>  		}
>  
> +		tdp_mmu_init_child_sp(sp, &iter);
> +
>  		if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))
>  			goto retry;
>  
> -- 
> 2.38.0.413.g74048e4d9e-goog
>

     prev parent reply	other threads:[~2022-10-20 22:03 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-19 23:40 [PATCH v2 0/2] KVM: Split huge pages mapped by the TDP MMU on fault David Matlack
2022-10-19 23:40 ` [PATCH v2 1/2] KVM: selftests: Introduce a selftest to measure execution performance David Matlack
2022-10-19 23:40 ` [PATCH v2 2/2] KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault David Matlack
2022-10-20 22:03   ` Mingwei Zhang [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y1HFpTl+bZBFHaCQ@google.com \
    --to=mizhang@google.com \
    --cc=bgardon@google.com \
    --cc=dmatlack@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.