Re: [PATCH 2/2] KVM: x86/mmu: Recover NX Huge pages belonging to TDP MMU under MMU read lock

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Vipin Sharma <vipinsh@google.com>
Cc: pbonzini@redhat.com, dmatlack@google.com, kvm@vger.kernel.org,
	 linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/2] KVM: x86/mmu: Recover NX Huge pages belonging to TDP MMU under MMU read lock
Date: Mon, 26 Aug 2024 07:34:35 -0700	[thread overview]
Message-ID: <ZsySe8tpDyZAvb6l@google.com> (raw)
In-Reply-To: <20240823223800.GB678289.vipinsh@google.com>

On Fri, Aug 23, 2024, Vipin Sharma wrote:
> On 2024-08-19 15:12:42, Sean Christopherson wrote:
> > On Mon, Aug 19, 2024, Vipin Sharma wrote:
> > Huh.  Actually, after a lot of fiddling and staring, there's a simpler solution,
> > and it would force us to comment/document an existing race that's subly ok.
> > 
> > For the dirty logging case, the result of kvm_mmu_sp_dirty_logging_enabled() is
> > visible to the NX recovery thread before the memslot update task is guaranteed
> > to finish (or even start) kvm_mmu_zap_collapsible_sptes().  I.e. KVM could
> > unaccount an NX shadow page before it is zapped, and that could lead to a vCPU
> > replacing the shadow page with an NX huge page.
> > 
> > Functionally, that's a-ok, because the accounting doesn't provide protection
> > against iTLB multi-hit bug, it's there purely to prevent KVM from bouncing a gfn
> > between an NX hugepage and an execute small page.  The only downside to the vCPU
> > doing the replacement is that the vCPU will get saddle with tearing down all the
> > child SPTEs.  But this should be a very rare race, so I can't imagine that would
> > be problematic in practice.
> 
> I am worried that whenever this happens it might cause guest jitter
> which we are trying to avoid as handle_changed_spte() might be keep a
> vCPU busy for sometime.

That race already exists today, and your series already extends the ways in which
the race can be hit.  My suggestion is to (a) explicit document that race and (b)
expand the window in which it can occur to also apply to dirty logging being off.

> > void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm, unsigned long to_zap)
> > 
> > 		/*
> > 		 * Unaccount the shadow page before zapping its SPTE so as to
> > 		 * avoid bouncing tdp_mmu_pages_lock() more than is necessary.
> > 		 * Clearing nx_huge_page_disallowed before zapping is safe, as
> > 		 * the flag doesn't protect against iTLB multi-hit, it's there
> > 		 * purely to prevent bouncing the gfn between an NX huge page
> > 		 * and an X small spage.  A vCPU could get stuck tearing down
> > 		 * the shadow page, e.g. if it happens to fault on the region
> > 		 * before the SPTE is zapped and replaces the shadow page with
> > 		 * an NX huge page and get stuck tearing down the child SPTEs,
> > 		 * but that is a rare race, i.e. shouldn't impact performance.
> > 		 */
> > 		unaccount_nx_huge_page(kvm, sp);
> 
> Might cause jitter. A long jitter might cause an escalation.
> 
> What if I do not unaccount in the beginning, and  move page to the end
> of the list only if it is still in the list?

The race between kvm_mmu_sp_dirty_logging_enabled() vs. kvm_tdp_mmu_map() vs.
kvm_mmu_zap_collapsible_sptes() still exists.

> If zapping failed because some other flow might be removing this page but it
> still in the possible_nx_huge_pages list, then just move it to the end. The
> thread which is removing will remove it from the list eventually.
> 
> for ( ; to_zap; --to_zap) {
> 	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> 	if (list_empty(&kvm->arch.possible_tdp_mmu_nx_huge_pages)) {
> 		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> 		break;
> 	}
> 
> 	sp = list_first_entry(&kvm->arch.possible_tdp_mmu_nx_huge_pages,
> 			      struct kvm_mmu_page,
> 			      possible_nx_huge_page_link);
> 
> 	WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
> 	WARN_ON_ONCE(!sp->role.direct);
> 
> 	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> 
> 
> 	/*
> 	 * Don't bother zapping shadow pages if the memslot is being
> 	 * dirty logged, as the relevant pages would just be faulted
> 	 * back in as 4KiB pages.  Potential NX Huge Pages in this slot
> 	 * will be recovered, along with all the other huge pages in
> 	 * the slot, when dirty logging is disabled.
> 	 */
> 	if (kvm_mmu_sp_dirty_logging_enabled(kvm, sp)) {
> 		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> 		unaccount_nx_huge_page(kvm, sp);
> 		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> 		WARN_ON_ONCE(sp->nx_huge_page_disallowed);
> 	} else if (tdp_mmu_zap_possible_nx_huge_page(kvm, sp)) {
> 		flush = true;
> 		WARN_ON_ONCE(sp->nx_huge_page_disallowed);
> 	} else {
> 		/*
> 		 * Try again in future if the page is still in the
> 		 * list
> 		 */
> 		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> 		if (!list_empty(&sp->possible_nx_huge_page_link))
> 			list_move_tail(&sp->possible_nx_huge_page_link,
> 			kvm-> &kvm->arch.possible_nx_huge_pages);

This is unsafe.  The only thing that prevents a use-after-free of "sp" is the fact
that this task holds rcu_read_lock().  The sp could already been queued for freeing
via call_rcu().

> 		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> 	}
> 
> 	/* Resched code below */
> }

next prev parent reply	other threads:[~2024-08-26 14:34 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-12 17:13 [PATCH 0/2] KVM: x86/mmu: Run NX huge page recovery under MMU read lock Vipin Sharma
2024-08-12 17:13 ` [PATCH 1/2] KVM: x86/mmu: Split NX hugepage recovery flow into TDP and non-TDP flow Vipin Sharma
2024-08-16 23:29   ` Sean Christopherson
2024-08-19 17:20     ` Vipin Sharma
2024-08-19 17:28       ` David Matlack
2024-08-19 18:31         ` Sean Christopherson
2024-08-19 21:57           ` Vipin Sharma
2024-08-12 17:13 ` [PATCH 2/2] KVM: x86/mmu: Recover NX Huge pages belonging to TDP MMU under MMU read lock Vipin Sharma
2024-08-14  9:33   ` kernel test robot
2024-08-14 18:23     ` Vipin Sharma
2024-08-14 22:50   ` kernel test robot
2024-08-15 16:42     ` Vipin Sharma
2024-08-16 23:38   ` Sean Christopherson
2024-08-19 17:34     ` Vipin Sharma
2024-08-19 22:12       ` Sean Christopherson
2024-08-23 22:38         ` Vipin Sharma
2024-08-26 14:34           ` Sean Christopherson [this message]
2024-08-26 19:24             ` Vipin Sharma
2024-08-26 19:51               ` Sean Christopherson
2024-08-19 22:19   ` Sean Christopherson
2024-08-23 19:27     ` Vipin Sharma
2024-08-23 20:45       ` Sean Christopherson
2024-08-23 21:41         ` Vipin Sharma

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZsySe8tpDyZAvb6l@google.com \
    --to=seanjc@google.com \
    --cc=dmatlack@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=vipinsh@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).