All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: James Houghton <jthoughton@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Vipin Sharma <vipinsh@google.com>,
	 David Matlack <dmatlack@google.com>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock
Date: Wed, 23 Jul 2025 13:34:59 -0700	[thread overview]
Message-ID: <aIFHc83PtfB9fkKB@google.com> (raw)
In-Reply-To: <20250707224720.4016504-4-jthoughton@google.com>

On Mon, Jul 07, 2025, James Houghton wrote:
> From: Vipin Sharma <vipinsh@google.com>
> 
> Use MMU read lock to recover TDP MMU NX huge pages. Iterate

Wrap at ~75 chars.

> over the huge pages list under tdp_mmu_pages_lock protection and
> unaccount the page before dropping the lock.
> 
> We must not zap an SPTE if:

No pronouns!

> - The SPTE is a root page.
> - The SPTE does not point at the SP's page table.
> 
> If the SPTE does not point at the SP's page table, then something else
> has change the SPTE, so we cannot safely zap it.
> 
> Warn if zapping SPTE fails and current SPTE is still pointing to same
> page table. This should never happen.
> 
> There is always a race between dirty logging, vCPU faults, and NX huge
> page recovery for backing a gfn by an NX huge page or an executable
> small page. Unaccounting sooner during the list traversal is increasing
> the window of that race. Functionally, it is okay, because accounting
> doesn't protect against iTLB multi-hit bug, it is there purely to
> prevent KVM from bouncing a gfn between two page sizes. The only
> downside is that a vCPU will end up doing more work in tearing down all
> the child SPTEs. This should be a very rare race.
> 
> Zapping under MMU read lock unblocks vCPUs which are waiting for MMU
> read lock. This optimizaion is done to solve a guest jitter issue on
> Windows VM which was observing an increase in network latency.

With slight tweaking:

Use MMU read lock to recover TDP MMU NX huge pages.  To prevent
concurrent modification of the list of potential huge pages, iterate over
the list under tdp_mmu_pages_lock protection and unaccount the page
before dropping the lock.

Zapping under MMU read lock unblocks vCPUs which are waiting for MMU
read lock, which solves a guest jitter issue on Windows VMs which were
observing an increase in network latency.

Do not zap an SPTE if:
- The SPTE is a root page.
- The SPTE does not point at the SP's page table.

If the SPTE does not point at the SP's page table, then something else
has change the SPTE, so KVM cannot safely zap it.

Warn if zapping SPTE fails and current SPTE is still pointing to same
page table, as it should be impossible for the CMPXCHG to fail due to all
other write scenarios being mutually exclusive.

There is always a race between dirty logging, vCPU faults, and NX huge
page recovery for backing a gfn by an NX huge page or an executable
small page.  Unaccounting sooner during the list traversal increases the
window of that race, but functionally, it is okay.  Accounting doesn't
protect against iTLB multi-hit bug, it is there purely to prevent KVM
from bouncing a gfn between two page sizes. The only  downside is that a
vCPU will end up doing more work in tearing down all  the child SPTEs.
This should be a very rare race.

> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Vipin Sharma <vipinsh@google.com>
> Co-developed-by: James Houghton <jthoughton@google.com>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     | 107 ++++++++++++++++++++++++-------------
>  arch/x86/kvm/mmu/tdp_mmu.c |  42 ++++++++++++---
>  2 files changed, 105 insertions(+), 44 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b074f7bb5cc58..7df1b4ead705b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7535,12 +7535,40 @@ static unsigned long nx_huge_pages_to_zap(struct kvm *kvm,
>  	return ratio ? DIV_ROUND_UP(pages, ratio) : 0;
>  }
>  
> +static bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm,
> +					     struct kvm_mmu_page *sp)
> +{
> +	struct kvm_memory_slot *slot = NULL;
> +
> +	/*
> +	 * Since gfn_to_memslot() is relatively expensive, it helps to skip it if
> +	 * it the test cannot possibly return true.  On the other hand, if any
> +	 * memslot has logging enabled, chances are good that all of them do, in
> +	 * which case unaccount_nx_huge_page() is much cheaper than zapping the
> +	 * page.

And largely irrelevant, because KVM should unaccount the NX no matter what.  I
kinda get what you're saying, but honestly it adds a lot of confusion, especially
since unaccount_nx_huge_page() is in the caller.

> +	 *
> +	 * If a memslot update is in progress, reading an incorrect value of
> +	 * kvm->nr_memslots_dirty_logging is not a problem: if it is becoming
> +	 * zero, gfn_to_memslot() will be done unnecessarily; if it is becoming
> +	 * nonzero, the page will be zapped unnecessarily.  Either way, this only
> +	 * affects efficiency in racy situations, and not correctness.
> +	 */
> +	if (atomic_read(&kvm->nr_memslots_dirty_logging)) {

Short-circuit the function to decrease indentation, and so that "slot" doesn't
need to be NULL-initialized.

> +		struct kvm_memslots *slots;
> +
> +		slots = kvm_memslots_for_spte_role(kvm, sp->role);
> +		slot = __gfn_to_memslot(slots, sp->gfn);

Then this can be:

	slot = __gfn_to_memslot(kvm_memslots_for_spte_role(kvm, sp->role), sp->gfn);

without creating a stupid-long line.

> +		WARN_ON_ONCE(!slot);

And then:

	if (WARN_ON_ONCE(!slot))
		return false;

	return kvm_slot_dirty_track_enabled(slot);

With a comment cleanup:

	struct kvm_memory_slot *slot;

	/*
	 * Skip the memslot lookup if dirty tracking can't possibly be enabled,
	 * as memslot lookups are relatively expensive.
	 *
	 * If a memslot update is in progress, reading an incorrect value of
	 * kvm->nr_memslots_dirty_logging is not a problem: if it is becoming
	 * zero, KVM will  do an unnecessary memslot lookup;  if it is becoming
	 * nonzero, the page will be zapped unnecessarily.  Either way, this
	 * only affects efficiency in racy situations, and not correctness.
	 */
	if (!atomic_read(&kvm->nr_memslots_dirty_logging))
		return false;

	slot = __gfn_to_memslot(kvm_memslots_for_spte_role(kvm, sp->role), sp->gfn);
	if (WARN_ON_ONCE(!slot))
		return false;

	return kvm_slot_dirty_track_enabled(slot);
> @@ -7559,8 +7590,17 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
>  	rcu_read_lock();
>  
>  	for ( ; to_zap; --to_zap) {
> -		if (list_empty(nx_huge_pages))
> +#ifdef CONFIG_X86_64

These #ifdefs still make me sad, but I also still think they're the least awful
solution.  And hopefully we will jettison 32-bit sooner than later :-)

  reply	other threads:[~2025-07-23 20:35 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-07 22:47 [PATCH v5 0/7] KVM: x86/mmu: Run TDP MMU NX huge page recovery under MMU read lock James Houghton
2025-07-07 22:47 ` [PATCH v5 1/7] KVM: x86/mmu: Track TDP MMU NX huge pages separately James Houghton
2025-08-19 17:57   ` Sean Christopherson
2025-07-07 22:47 ` [PATCH v5 2/7] KVM: x86/mmu: Rename kvm_tdp_mmu_zap_sp() to better indicate its purpose James Houghton
2025-07-07 22:47 ` [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock James Houghton
2025-07-23 20:34   ` Sean Christopherson [this message]
2025-07-28 18:07     ` James Houghton
2025-07-28 18:17       ` David Matlack
2025-07-28 21:38         ` Sean Christopherson
2025-07-28 21:48           ` James Houghton
2025-08-01 18:17             ` David Matlack
2025-08-01 22:00               ` Sean Christopherson
2025-08-12 19:21                 ` David Matlack
2025-07-07 22:47 ` [PATCH v5 4/7] KVM: x86/mmu: Only grab RCU lock for nx hugepage recovery for TDP MMU James Houghton
2025-07-23 20:38   ` Sean Christopherson
2025-07-28 17:51     ` James Houghton
2025-07-07 22:47 ` [PATCH v5 5/7] KVM: selftests: Introduce a selftest to measure execution performance James Houghton
2025-07-23 20:50   ` Sean Christopherson
2025-07-29  0:18     ` James Houghton
2025-07-07 22:47 ` [PATCH v5 6/7] KVM: selftests: Provide extra mmap flags in vm_mem_add() James Houghton
2025-07-07 22:47 ` [PATCH v5 7/7] KVM: selftests: Add an NX huge pages jitter test James Houghton
2025-07-23 21:04   ` Sean Christopherson
2025-07-28 18:40     ` James Houghton
2025-08-01 14:11       ` Sean Christopherson
2025-08-01 18:45         ` James Houghton
2025-08-01 22:30           ` Sean Christopherson
2025-07-23 20:44 ` [PATCH v5 0/7] KVM: x86/mmu: Run TDP MMU NX huge page recovery under MMU read lock Sean Christopherson
2025-07-29  0:19   ` James Houghton
2025-08-19 23:12 ` Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aIFHc83PtfB9fkKB@google.com \
    --to=seanjc@google.com \
    --cc=dmatlack@google.com \
    --cc=jthoughton@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=vipinsh@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.