kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Matlack <dmatlack@google.com>
To: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Ben Gardon <bgardon@google.com>,
	Mingwei Zhang <mizhang@google.com>
Subject: Re: [PATCH v2 25/30] KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls
Date: Wed, 5 Jan 2022 00:19:09 +0000	[thread overview]
Message-ID: <YdTj/eHur+9Vqdw6@google.com> (raw)
In-Reply-To: <20211223222318.1039223-26-seanjc@google.com>

On Thu, Dec 23, 2021 at 10:23:13PM +0000, Sean Christopherson wrote:
> When zapping a TDP MMU root, perform the zap in two passes to avoid
> zapping an entire top-level SPTE while holding RCU, which can induce RCU
> stalls.  In the first pass, zap SPTEs at PG_LEVEL_1G, and then
> zap top-level entries in the second pass.
> 
> With 4-level paging, zapping a PGD that is fully populated with 4kb leaf
> SPTEs take up to ~7 or so seconds (time varies based number of kernel
> config, CPUs, vCPUs, etc...).  With 5-level paging, that time can balloon
> well into hundreds of seconds.
> 
> Before remote TLB flushes were omitted, the problem was even worse as
> waiting for all active vCPUs to respond to the IPI introduced significant
> overhead for VMs with large numbers of vCPUs.
> 
> By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass,
> the amount of work that is done without dropping RCU protection is
> strictly bounded, with the worst case latency for a single operation
> being less than 100ms.
> 
> Zapping at 1gb in the first pass is not arbitrary.  First and foremost,
> KVM relies on being able to zap 1gb shadow pages in a single shot when
> when repacing a shadow page with a hugepage.

When dirty logging is disabled, zap_collapsible_spte_range() does the
bulk of the work zapping leaf SPTEs and allows yielding. I guess that
could race with a vCPU faulting in the huge page though and the vCPU
could do the bulk of the work.

Are there any other scenarios where KVM relies on zapping 1GB worth of
4KB SPTEs without yielding?

In any case, 100ms is a long time to hog the CPU. Why not just do the
safe thing and zap each level? 4K, then 2M, then 1GB, ..., then root
level. The only argument against it I can think of is performance (lots
of redundant walks through the page table). But I don't think root
zapping is especially latency critical.

> Zapping a 1gb shadow page
> that is fully populated with 4kb dirty SPTEs also triggers the worst case
> latency due writing back the struct page accessed/dirty bits for each 4kb
> page, i.e. the two-pass approach is guaranteed to work so long as KVM can
> cleany zap a 1gb shadow page.
> 
>   rcu: INFO: rcu_sched self-detected stall on CPU
>   rcu:     52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000
>                                           softirq=15759/15759 fqs=5058
>    (t=21016 jiffies g=66453 q=238577)
>   NMI backtrace for cpu 52
>   Call Trace:
>    ...
>    mark_page_accessed+0x266/0x2f0
>    kvm_set_pfn_accessed+0x31/0x40
>    handle_removed_tdp_mmu_page+0x259/0x2e0
>    __handle_changed_spte+0x223/0x2c0
>    handle_removed_tdp_mmu_page+0x1c1/0x2e0
>    __handle_changed_spte+0x223/0x2c0
>    handle_removed_tdp_mmu_page+0x1c1/0x2e0
>    __handle_changed_spte+0x223/0x2c0
>    zap_gfn_range+0x141/0x3b0
>    kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130
>    kvm_mmu_zap_all_fast+0x121/0x190
>    kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
>    kvm_page_track_flush_slot+0x5c/0x80
>    kvm_arch_flush_shadow_memslot+0xe/0x10
>    kvm_set_memslot+0x172/0x4e0
>    __kvm_set_memory_region+0x337/0x590
>    kvm_vm_ioctl+0x49c/0xf80
> 
> Reported-by: David Matlack <dmatlack@google.com>
> Cc: Ben Gardon <bgardon@google.com>
> Cc: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 27 ++++++++++++++++++++++-----
>  1 file changed, 22 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index aec97e037a8d..2e28f5e4b761 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -809,6 +809,18 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  	gfn_t end = tdp_mmu_max_gfn_host();
>  	gfn_t start = 0;
>  
> +	/*
> +	 * To avoid RCU stalls due to recursively removing huge swaths of SPs,
> +	 * split the zap into two passes.  On the first pass, zap at the 1gb
> +	 * level, and then zap top-level SPs on the second pass.  "1gb" is not
> +	 * arbitrary, as KVM must be able to zap a 1gb shadow page without
> +	 * inducing a stall to allow in-place replacement with a 1gb hugepage.
> +	 *
> +	 * Because zapping a SP recurses on its children, stepping down to
> +	 * PG_LEVEL_4K in the iterator itself is unnecessary.
> +	 */
> +	int zap_level = PG_LEVEL_1G;
> +
>  	/*
>  	 * The root must have an elevated refcount so that it's reachable via
>  	 * mmu_notifier callbacks, which allows this path to yield and drop
> @@ -825,12 +837,9 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  
>  	rcu_read_lock();
>  
> -	/*
> -	 * No need to try to step down in the iterator when zapping an entire
> -	 * root, zapping an upper-level SPTE will recurse on its children.
> -	 */
> +start:
>  	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
> -				   root->role.level, start, end) {
> +				   zap_level, start, end) {
>  retry:
>  		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
>  			continue;
> @@ -838,6 +847,9 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  		if (!is_shadow_present_pte(iter.old_spte))
>  			continue;
>  
> +		if (iter.level > zap_level)
> +			continue;
> +
>  		if (!shared) {
>  			tdp_mmu_set_spte(kvm, &iter, 0);
>  		} else if (!tdp_mmu_set_spte_atomic(kvm, &iter, 0)) {
> @@ -846,6 +858,11 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  		}
>  	}
>  
> +	if (zap_level < root->role.level) {
> +		zap_level = root->role.level;
> +		goto start;
> +	}

This is probably just person opinion but I find the 2 iteration goto
loop harder to understand than just open-coding the 2 passes.

e.g.

  static void tdp_mmu_zap_root(...)
  {
          /*
           * To avoid RCU stalls due to recursively removing huge swaths of SPs,
           * split the zap into two passes.  On the first pass, zap at the 1gb
           * level, and then zap top-level SPs on the second pass.  "1gb" is not
           * arbitrary, as KVM must be able to zap a 1gb shadow page without
           * inducing a stall to allow in-place replacement with a 1gb hugepage.
           *
           * Because zapping a SP recurses on its children, stepping down to
           * PG_LEVEL_4K in the iterator itself is unnecessary.
           */
          tdp_mmu_zap_root_level(..., PG_LEVEL_1G);
          tdp_mmu_zap_root_level(..., root->role.level);
  }

Or just go ahead and zap each level from 4K up to root->role.level as I
mentioned above.

> +
>  	rcu_read_unlock();
>  }
>  
> -- 
> 2.34.1.448.ga2b2bfdf31-goog
> 

  reply	other threads:[~2022-01-05  0:19 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-23 22:22 [PATCH v2 00/30] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing Sean Christopherson
2021-12-23 22:22 ` [PATCH v2 01/30] KVM: x86/mmu: Use common TDP MMU zap helper for MMU notifier unmap hook Sean Christopherson
2021-12-23 22:22 ` [PATCH v2 02/30] KVM: x86/mmu: Move "invalid" check out of kvm_tdp_mmu_get_root() Sean Christopherson
2021-12-23 22:22 ` [PATCH v2 03/30] KVM: x86/mmu: Zap _all_ roots when unmapping gfn range in TDP MMU Sean Christopherson
2021-12-23 22:22 ` [PATCH v2 04/30] KVM: x86/mmu: Use common iterator for walking invalid TDP MMU roots Sean Christopherson
2022-01-04 22:13   ` David Matlack
2021-12-23 22:22 ` [PATCH v2 05/30] KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU Sean Christopherson
2021-12-23 22:22 ` [PATCH v2 06/30] KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap Sean Christopherson
2021-12-23 22:22 ` [PATCH v2 07/30] KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic Sean Christopherson
2021-12-23 22:22 ` [PATCH v2 08/30] KVM: x86/mmu: Document that zapping invalidated roots doesn't need to flush Sean Christopherson
2021-12-23 22:22 ` [PATCH v2 09/30] KVM: x86/mmu: Drop unused @kvm param from kvm_tdp_mmu_get_root() Sean Christopherson
2021-12-23 22:22 ` [PATCH v2 10/30] KVM: x86/mmu: Require mmu_lock be held for write in unyielding root iter Sean Christopherson
2021-12-23 22:22 ` [PATCH v2 11/30] KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP removal Sean Christopherson
2021-12-23 22:23 ` [PATCH v2 12/30] KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier change_spte Sean Christopherson
2021-12-23 22:23 ` [PATCH v2 13/30] KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks Sean Christopherson
2021-12-23 22:23 ` [PATCH v2 14/30] KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU Sean Christopherson
2021-12-23 22:23 ` [PATCH v2 15/30] KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path Sean Christopherson
2021-12-23 22:23 ` [PATCH v2 16/30] KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw vals Sean Christopherson
2021-12-23 22:23 ` [PATCH v2 17/30] KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery Sean Christopherson
2021-12-23 22:23 ` [PATCH v2 18/30] KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU Sean Christopherson
2021-12-23 22:23 ` [PATCH v2 19/30] KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page Sean Christopherson
2021-12-23 22:23 ` [PATCH v2 20/30] KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range Sean Christopherson
2021-12-23 22:23 ` [PATCH v2 21/30] KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range() Sean Christopherson
2021-12-23 22:23 ` [PATCH v2 22/30] KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched Sean Christopherson
2021-12-23 22:23 ` [PATCH v2 23/30] KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages Sean Christopherson
2021-12-23 22:23 ` [PATCH v2 24/30] KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root Sean Christopherson
2022-01-05  1:25   ` David Matlack
2022-01-05 21:32     ` Sean Christopherson
2021-12-23 22:23 ` [PATCH v2 25/30] KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls Sean Christopherson
2022-01-05  0:19   ` David Matlack [this message]
2022-01-05 16:33     ` Sean Christopherson
2021-12-23 22:23 ` [PATCH v2 26/30] KVM: x86/mmu: Zap defunct roots via asynchronous worker Sean Christopherson
2021-12-23 22:23 ` [PATCH v2 27/30] KVM: selftests: Move raw KVM_SET_USER_MEMORY_REGION helper to utils Sean Christopherson
2021-12-23 22:23 ` [PATCH v2 28/30] KVM: selftests: Split out helper to allocate guest mem via memfd Sean Christopherson
2021-12-23 22:23 ` [PATCH v2 29/30] KVM: selftests: Define cpu_relax() helpers for s390 and x86 Sean Christopherson
2021-12-23 22:23 ` [PATCH v2 30/30] KVM: selftests: Add test to populate a VM with the max possible guest mem Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YdTj/eHur+9Vqdw6@google.com \
    --to=dmatlack@google.com \
    --cc=bgardon@google.com \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mizhang@google.com \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).