public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Xu Yilun <yilun.xu@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	 Yan Zhao <yan.y.zhao@intel.com>, Kai Huang <kai.huang@intel.com>
Subject: Re: [PATCH v2] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing
Date: Thu, 18 Jan 2024 09:22:37 -0800	[thread overview]
Message-ID: <ZaleXWn7reOI5yJY@google.com> (raw)
In-Reply-To: <ZalUDLVJSVN/rEf2@yilunxu-OptiPlex-7050>

On Fri, Jan 19, 2024, Xu Yilun wrote:
> On Tue, Jan 09, 2024 at 05:20:45PM -0800, Sean Christopherson wrote:
> > Retry page faults without acquiring mmu_lock if the resolved gfn is covered
> > by an active invalidation.  Contending for mmu_lock is especially
> > problematic on preemptible kernels as the mmu_notifier invalidation task
> > will yield mmu_lock (see rwlock_needbreak()), delay the in-progress
> 
> Is it possible fault-in task avoids contending mmu_lock by using _trylock()?
> Like:
> 
> 	while (!read_trylock(&vcpu->kvm->mmu_lock))
> 		cpu_relax();
> 
> 	if (is_page_fault_stale(vcpu, fault))
> 		goto out_unlock;
>   
> 	r = kvm_tdp_mmu_map(vcpu, fault);
> 
> out_unlock:
> 	read_unlock(&vcpu->kvm->mmu_lock)

It's definitely possible, but the downsides far outweigh any potential benefits.

Doing trylock means the CPU is NOT put into the queue for acquiring the lock,
which means that forward progress isn't guaranteed.  E.g. in a pathological
scenario (and by "pathological", I mean if NUMA balancing or KSM is active ;-)),
it's entirely possible for a near-endless stream of mmu_lock writers to be in
the queue, thus preventing the vCPU from acquiring mmu_lock in a timely manner.

And hacking the page fault path to bypass KVM's lock contention detection would
be a very willful, deliberate violation of the intent of the MMU's yielding logic
for preemptible kernels.

That said, I would love to revisit KVM's use of rwlock_needbreak(), at least in
the TDP MMU.  As evidenced by this rash of issues, it's not at all obvious that
yielding on mmu_lock contention is *ever* a net positive for KVM, or even for the
kernel.  The shadow MMU is probably a different story since it can't parallelize
page faults with other operations, e.g. yielding in kvm_zap_obsolete_pages() to
allow vCPUs to make forward progress is probably a net positive.

But AFAIK, no one has done any testing to prove that yielding on contention in
the TDP MMU is actually a good thing.  I'm 99% certain the only reason the TDP
MMU yields on contention is because the shadow MMU yields on contention, i.e.
I'm confident that no one ever did performance testing to shadow that there is
any benefit whatsoever to yielding mmu_lock in the TDP MMU.

> > invalidation, and ultimately increase the latency of resolving the page
> > fault.  And in the worst case scenario, yielding will be accompanied by a
> > remote TLB flush, e.g. if the invalidation covers a large range of memory
> > and vCPUs are accessing addresses that were already zapped.
> 
> This case covers all usage of mmu_invalidate_retry_gfn(), is it? Should
> we also consider vmx_set_apic_access_page_addr()?

Nah, reloading the APIC access page is an very, very infrequent operation.  And
post-boot (of the VM), the page will only be reloaded if it's migrated by the
primary MMU, i.e. after a relevant mmu_notifier operation.  And the APIC access
page is a one-off allocation, i.e. it has its own VMA of PAGE_SIZE, so even if
vmx_set_apic_access_page_addr() does manage to trigger contention with a relevant
mmu_notifier event, e.g. races ahead of page migration completion, the odds of it
forcing KVM's mmu_notifier handler to yield mmu_lock are low (maybe impossible?)
because the invaldation event shouldn't be spanning multipl pages.

  reply	other threads:[~2024-01-18 17:22 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-10  1:20 [PATCH v2] KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing Sean Christopherson
2024-01-11  2:11 ` Yan Zhao
2024-01-12  7:48 ` Yuan Yao
2024-01-12 16:31   ` Sean Christopherson
2024-01-18 16:38 ` Xu Yilun
2024-01-18 17:22   ` Sean Christopherson [this message]
2024-01-19  5:50     ` Xu Yilun

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZaleXWn7reOI5yJY@google.com \
    --to=seanjc@google.com \
    --cc=kai.huang@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=yan.y.zhao@intel.com \
    --cc=yilun.xu@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox