Re: [PATCH v2 4/4] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Vipin Sharma <vipinsh@google.com>
Cc: pbonzini@redhat.com, dmatlack@google.com, kvm@vger.kernel.org,
	 linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 4/4] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock
Date: Thu, 29 Aug 2024 13:06:55 -0700	[thread overview]
Message-ID: <ZtDU3x9t66BnpPt_@google.com> (raw)
In-Reply-To: <20240829191135.2041489-5-vipinsh@google.com>

On Thu, Aug 29, 2024, Vipin Sharma wrote:
> @@ -1806,7 +1832,7 @@ void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm)
>  	struct kvm_mmu_page *sp;
>  	bool flush = false;
>  
> -	lockdep_assert_held_write(&kvm->mmu_lock);
> +	lockdep_assert_held_read(&kvm->mmu_lock);
>  	/*
>  	 * Zapping TDP MMU shadow pages, including the remote TLB flush, must
>  	 * be done under RCU protection, because the pages are freed via RCU
> @@ -1815,8 +1841,11 @@ void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm)
>  	rcu_read_lock();
>  
>  	for ( ; to_zap; --to_zap) {
> -		if (list_empty(&kvm->arch.tdp_mmu_possible_nx_huge_pages))
> +		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> +		if (list_empty(&kvm->arch.tdp_mmu_possible_nx_huge_pages)) {
> +			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>  			break;
> +		}
>  
>  		/*
>  		 * We use a separate list instead of just using active_mmu_pages
> @@ -1832,16 +1861,29 @@ void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm)
>  		WARN_ON_ONCE(!sp->role.direct);
>  
>  		/*
> -		 * Unaccount and do not attempt to recover any NX Huge Pages
> -		 * that are being dirty tracked, as they would just be faulted
> -		 * back in as 4KiB pages. The NX Huge Pages in this slot will be
> +		 * Unaccount the shadow page before zapping its SPTE so as to
> +		 * avoid bouncing tdp_mmu_pages_lock more than is necessary.
> +		 * Clearing nx_huge_page_disallowed before zapping is safe, as
> +		 * the flag doesn't protect against iTLB multi-hit, it's there
> +		 * purely to prevent bouncing the gfn between an NX huge page
> +		 * and an X small spage. A vCPU could get stuck tearing down
> +		 * the shadow page, e.g. if it happens to fault on the region
> +		 * before the SPTE is zapped and replaces the shadow page with
> +		 * an NX huge page and get stuck tearing down the child SPTEs,
> +		 * but that is a rare race, i.e. shouldn't impact performance.
> +		 */
> +		unaccount_nx_huge_page(kvm, sp);
> +		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> +
> +		/*
> +		 * Don't bother zapping shadow pages if the memslot is being
> +		 * dirty logged, as the relevant pages would just be faulted back
> +		 * in as 4KiB pages. Potential NX Huge Pages in this slot will be
>  		 * recovered, along with all the other huge pages in the slot,
>  		 * when dirty logging is disabled.
>  		 */
> -		if (kvm_mmu_sp_dirty_logging_enabled(kvm, sp))
> -			unaccount_nx_huge_page(kvm, sp);
> -		else
> -			flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
> +		if (!kvm_mmu_sp_dirty_logging_enabled(kvm, sp))
> +			flush |= tdp_mmu_zap_possible_nx_huge_page(kvm, sp);
>  		WARN_ON_ONCE(sp->nx_huge_page_disallowed);
>  
>  		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {

Tsk, tsk.  There's a cond_resched_rwlock_write() lurking here.

Which is a decent argument/segue for my main comment: I would very, very strongly
prefer to have a single flow for the control logic.  Almost all of the code is
copy+pasted to the TDP MMU, and there unnecessary and confusing differences between
the two flows.  E.g. the TDP MMU does unaccount_nx_huge_page() preemptively, while
the shadow MMU does not.

The TDP MMU has a few extra "locks" (rcu_read_lock() + tdp_mmu_pages_lock), but
I don't see any harm in taking those in the shadow MMU flow.  KVM holds a spinlock
and so RCU practically is meaningless, and tdp_mmu_pages_lock will never be
contended since it's only ever taken under mmu_lock.

If we _really_ wanted to bury tdp_mmu_pages_lock in the TDP MMU, it could be
passed in as an optional paramter.  I'm not super opposed to that, it just looks
ugly, and I'm not convinced it's worth the effort.  Maybe a middle ground would
be to provide a helper so that tdp_mmu_pages_lock can stay 64-bit only, but this
code doesn't need #ifdefs?

Anyways, if the helper is __always_inline, there shouldn't be an indirect call
to recover_page().  Completely untested, but this is the basic gist.

---
typedef bool (*nx_recover_page_t)(struct kvm *kvm, struct kvm_mmu_page *sp,
				  struct list_head *invalid_pages);

static __always_inline void kvm_recover_nx_huge_pages(struct kvm *kvm,
						      bool shared,
						      spinlock_t *list_lock;
						      struct list_head *pages,
						      unsigned long nr_pages,
						      nx_recover_page_t recover_page)
{
	unsigned int ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
	unsigned long to_zap = ratio ? DIV_ROUND_UP(nr_pages, ratio) : 0;
	struct kvm_mmu_page *sp;
	LIST_HEAD(invalid_list);
	int rcu_idx;
	bool flush;

	kvm_lockdep_assert_mmu_lock_held(kvm, shared);

	rcu_idx = srcu_read_lock(&kvm->srcu);
	rcu_read_lock();

	for ( ; to_zap; --to_zap) {
		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
		if (list_empty(pages)) {
			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
			break;
		}

		sp = list_first_entry(pages, struct kvm_mmu_page,
				      possible_nx_huge_page_link);
		WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
		WARN_ON_ONCE(!sp->role.direct);

		unaccount_nx_huge_page(kvm, sp);
		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);

		if (!kvm_mmu_sp_dirty_logging_enabled(kvm, sp))
			flush |= recover_page(kvm, sp, &invalid_list);

		WARN_ON_ONCE(sp->nx_huge_page_disallowed);

		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);

			rcu_read_unlock();

			if (shared)
				cond_resched_rwlock_read(&kvm->mmu_lock);
			else
				cond_resched_rwlock_write(&kvm->mmu_lock);

			rcu_read_lock();
		}
	}

	kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);

	rcu_read_unlock();
	srcu_read_unlock(&kvm->srcu, rcu_idx);
}

next prev parent reply	other threads:[~2024-08-29 20:06 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-29 19:11 [PATCH v2 0/4] KVM: x86/mmu: Run NX huge page recovery under MMU read lock Vipin Sharma
2024-08-29 19:11 ` [PATCH v2 1/4] KVM: x86/mmu: Track TDP MMU NX huge pages separately Vipin Sharma
2024-08-29 20:18   ` Sean Christopherson
2024-09-03 20:26     ` Vipin Sharma
2024-08-29 19:11 ` [PATCH v2 2/4] KVM: x86/mmu: Extract out TDP MMU NX huge page recovery code Vipin Sharma
2024-08-29 19:11 ` [PATCH v2 3/4] KVM: x86/mmu: Rearrange locks and to_zap count for NX huge page recovery Vipin Sharma
2024-08-29 19:11 ` [PATCH v2 4/4] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock Vipin Sharma
2024-08-29 20:06   ` Sean Christopherson [this message]
2024-09-03 20:23     ` Vipin Sharma

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZtDU3x9t66BnpPt_@google.com \
    --to=seanjc@google.com \
    --cc=dmatlack@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=vipinsh@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox