public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: David Matlack <dmatlack@google.com>
To: Sean Christopherson <seanjc@google.com>
Cc: maobibo <maobibo@loongson.cn>,
	Paolo Bonzini <pbonzini@redhat.com>,
	kvm@vger.kernel.org, Marc Zyngier <maz@kernel.org>,
	Oliver Upton <oliver.upton@linux.dev>,
	Tianrui Zhao <zhaotianrui@loongson.cn>,
	Huacai Chen <chenhuacai@kernel.org>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Anup Patel <anup@brainfault.org>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Janosch Frank <frankja@linux.ibm.com>,
	Claudio Imbrenda <imbrenda@linux.ibm.com>
Subject: Re: [PATCH v2] KVM: Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG
Date: Mon, 15 Apr 2024 10:20:34 -0700	[thread overview]
Message-ID: <Zh1h4gfOpImWHQsC@google.com> (raw)
In-Reply-To: <CALzav=coESqsXnLbX2emiO_P12WrPZh9WutxF6JWWqwX-6RFDg@mail.gmail.com>

On 2024-04-12 09:14 AM, David Matlack wrote:
> On Thu, Apr 4, 2024 at 11:17 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Apr 04, 2024, David Matlack wrote:
> > > > I don't love the idea of adding more arch specific MMU behavior (going the wrong
> > > > direction), but it doesn't seem like an unreasonable approach in this case.
> > >
> > > I wonder if this is being overly cautious.
> >
> > Probably.  "Lazy" is another word for it ;-)
> >
> > > I would expect only more benefit on architectures that more aggressively take
> > > the mmu_lock on vCPU threads during faults. The more lock acquisition on vCPU
> > > threads, the more this patch will help reduce vCPU starvation during
> > > CLEAR_DIRTY_LOG.
> > >
> > > Hm, perhaps testing with ept=N (which will use the write-lock for even
> > > dirty logging faults) would be a way to increase confidence in the
> > > effect on other architectures?
> >
> > Turning off the TDP MMU would be more representative, just manually disable the
> > fast-path, e.g.
> 
> Good idea. I'm actually throwing in some writable module parameters
> too to make it easy to toggle between configurations.
> 
> I'll report back when I have some data.

tl;dr

 * My patch likely _will_ regress migration performance on other architectures.
   Thank you Bibo and Sean for keeping me honest here.

 * I suspect the original issue my patch is trying to fix is actually specific
   to the way the TDP MMU does eager page splitting and a more targeted fix is
   warranted.

---

To evaluate my patch I tested on x86 with different mmu_lock configurations
to simulate other architectures.

 Config 1: tdp_mmu=Y fast_page_fault_read_lock=N eager_page_split=Y
 Config 2: tdp_mmu=Y fast_page_fault_read_lock=Y eager_page_split=Y
 Config 3: tdp_mmu=N fast_page_fault_read_lock=N eager_page_split=N

Note: "fast_page_fault_read_lock" is a non-upstream parameter I added to
add a read_lock/unlock() in fast_page_fault().

Config 1 is vanilla KVM/x86. Config 2 emulates KVM/arm64. Config 3 emulates
LoongArch if LoongArch added support for lockless write-protection fault
handling.

The test I ran was a Live Migration of a 16VCPU 64GB VM running an aggressive
write-heavy workload. To compare runs I evaluated 3 metrics:

 * Duration of pre-copy.
 * Amount of dirty memory going into post-copy.
 * Total CPU usage of CLEAR_DIRTY_LOG.

The following table shows how each metric changed after adding my patch to drop
mmu_lock during CLEAR_DIRTY_LOG.

          | Precopy Duration | Post-Copy Dirty | CLEAR_DIRTY_LOG CPU
 ---------|------------------|-----------------|---------------------
 Config 1 | -1%              | -1%             | +6%
 Config 2 | -1%              | +1%             | +123%
 Config 3 | +32%             | +158%           | +5200%

Config 2 and 3 both show regressions, with Config 3 severely regressed in all 3
dimensions.

Given these regressions, I started rethinking the original issue this patch is
trying to fix.

The dips in guest performance during CLEAR_DIRTY_LOG occurred during the first
pre-copy pass but not during subsequent passes. One thing unique to the first
pass is eager page splitting.

Ah ha, a theory! The TDP MMU allocates shadow pages while holding the mmu_lock
during eager page splitting and only drops the lock if need_resched=True or a
GFP_NOWAIT allocation fails. If neither occurs, CLEAR_DIRTY_LOG could potential
hold mmu_lock in write-mode for a long time.

Second, the host platform where we saw the dips has nx_huge_pages=Y. I suspect
the long CLEAR_DIRTY_LOG calls are blocking vCPUs taking steady-state
faults for NX Huge Pages, causing the dips in performance.

This theory also explains why we (Google) haven't seen similar drops in guest
performance when using !manual-protect, as in that configuration the TDP MMU
does eager page splitting under the read-lock instead of write-lock.

If this is all true, then a better / more targeted fix for this issue would be
to drop mmu_lock in the TDP MMU eager page splitting path. For example, we
could limit the "allocate under lock" behavior to only when the read-lock is
held, e.g.

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7dfdc49a6ade..ea34f8232d97 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1472,9 +1472,11 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
         * If this allocation fails we drop the lock and retry with reclaim
         * allowed.
         */
-       sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid);
-       if (sp)
-               return sp;
+       if (shared) {
+               sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, nid);
+               if (sp)
+                       return sp;
+       }

        rcu_read_unlock();

I checked the KVM/arm64 eager page splitting code, and it drops the mmu_lock to
allocate page tables. So I suspect no fix is needed there and this is, in fact,
purely and x86-specific issue.

  reply	other threads:[~2024-04-15 17:20 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-02 21:36 [PATCH v2] KVM: Aggressively drop and reacquire mmu_lock during CLEAR_DIRTY_LOG David Matlack
2024-04-03  1:50 ` maobibo
2024-04-03 17:20   ` Sean Christopherson
2024-04-04 16:29   ` David Matlack
2024-04-04 17:10     ` Sean Christopherson
2024-04-04 18:12       ` David Matlack
2024-04-04 18:17         ` Sean Christopherson
2024-04-12 16:14           ` David Matlack
2024-04-15 17:20             ` David Matlack [this message]
2024-04-15 20:00               ` Sean Christopherson
2024-04-18 18:50                 ` David Matlack
2024-04-18 19:39                   ` Sean Christopherson
2024-04-07  2:25       ` maobibo
2024-04-12 16:12         ` David Matlack
2024-04-15  1:21           ` maobibo
2024-04-07  1:36     ` maobibo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zh1h4gfOpImWHQsC@google.com \
    --to=dmatlack@google.com \
    --cc=anup@brainfault.org \
    --cc=borntraeger@linux.ibm.com \
    --cc=chenhuacai@kernel.org \
    --cc=frankja@linux.ibm.com \
    --cc=imbrenda@linux.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=maobibo@loongson.cn \
    --cc=maz@kernel.org \
    --cc=mpe@ellerman.id.au \
    --cc=oliver.upton@linux.dev \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    --cc=zhaotianrui@loongson.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox