Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Maxim Levitsky <mlevitsk@redhat.com>
Cc: James Houghton <jthoughton@google.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	 David Matlack <dmatlack@google.com>,
	David Rientjes <rientjes@google.com>,
	Marc Zyngier <maz@kernel.org>,
	 Oliver Upton <oliver.upton@linux.dev>,
	Wei Xu <weixugc@google.com>, Yu Zhao <yuzhao@google.com>,
	 Axel Rasmussen <axelrasmussen@google.com>,
	kvm@vger.kernel.org,  linux-kernel@vger.kernel.org
Subject: Re: [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly
Date: Tue, 18 Feb 2025 17:13:07 -0800	[thread overview]
Message-ID: <Z7UwI-9zqnhpmg30@google.com> (raw)
In-Reply-To: <025b409c5ca44055a5f90d2c67e76af86617e222.camel@redhat.com>

On Tue, Feb 18, 2025, Maxim Levitsky wrote:
> On Tue, 2025-02-04 at 00:40 +0000, James Houghton wrote:
> > By aging sptes locklessly with the TDP MMU and the shadow MMU, neither
> > vCPUs nor reclaim (mmu_notifier_invalidate_range*) will get stuck
> > waiting for aging. This contention reduction improves guest performance
> > and saves a significant amount of Google Cloud's CPU usage, and it has
> > valuable improvements for ChromeOS, as Yu has mentioned previously[1].
> > 
> > Please see v8[8] for some performance results using
> > access_tracking_perf_test patched to use MGLRU.
> > 
> > Neither access_tracking_perf_test nor mmu_stress_test trigger any
> > splats (with CONFIG_LOCKDEP=y) with the TDP MMU and with the shadow MMU.
> 
> 
> Hi, I have a question about this patch series and about the
> access_tracking_perf_test:
> 
> Some time ago, I investigated a failure in access_tracking_perf_test which
> shows up in our CI.
> 
> The root cause was that 'folio_clear_idle' doesn't clear the idle bit when
> MGLRU is enabled, and overall I got the impression that MGLRU is not
> compatible with idle page tracking.
>
> I thought that this patch series and the 'mm: multi-gen LRU: Have secondary
> MMUs participate in MM_WALK' patch series could address this but the test
> still fails.
> 
> 
> For the reference the exact problem is:
> 
> 1. Idle bits for guest memory under test are set via /sys/kernel/mm/page_idle/bitmap
> 
> 2. Guest dirties memory, which leads to A/D bits being set in the secondary mappings.
> 
> 3. A NUMA autobalance code write protects the guest memory. KVM in response
>    evicts the SPTE mappings with A/D bit set, and while doing so tells mm
>    that pages were accessed using 'folio_mark_accessed' (via kvm_set_page_accessed (*) )
>    but due to MLGRU the call doesn't clear the idle bit and thus all the traces
>    of the guest access disappear and the kernel thinks that the page is still idle.
> 
> I can say that the root cause of this is that folio_mark_accessed doesn't do
> what it supposed to do.
> 
> Calling 'folio_clear_idle(folio);' in MLGRU case in folio_mark_accessed()
> will probably fix this but I don't have enough confidence to say if this is
> all that is needed to fix this.  If this is the case I can send a patch.

My understanding is that the behavior is deliberate.  Per Yu[1], page_idle/bitmap
effectively isn't supported by MGLRU.

[1] https://lore.kernel.org/all/CAOUHufZeADNp_y=Ng+acmMMgnTR=ZGFZ7z-m6O47O=CmJauWjw@mail.gmail.com

> This patch makes the test pass (but only on 6.12 kernel and below, see below):
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 59f30a981c6f..2013e1f4d572 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -460,7 +460,7 @@ void folio_mark_accessed(struct folio *folio)
>  {
>         if (lru_gen_enabled()) {
>                 folio_inc_refs(folio);
> -               return;
> +               goto clear_idle_bit;
>         }
>  
>         if (!folio_test_referenced(folio)) {
> @@ -485,6 +485,7 @@ void folio_mark_accessed(struct folio *folio)
>                 folio_clear_referenced(folio);
>                 workingset_activation(folio);
>         }
> +clear_idle_bit:
>         if (folio_test_idle(folio))
>                 folio_clear_idle(folio);
>  }
> 
> 
> To always reproduce this, it is best to use a patch to make the test run in a
> loop, like below (although the test fails without this as well).

..

> With the above patch applied, you will notice after 4-6 iterations that the
> number of still idle pages soars:
> 
> Populating memory             : 0.798882357s

...

> vCPU0: idle pages: 132558 out of 262144, failed to mark idle: 0 no pfn: 0
> Mark memory idle              : 2.711946690s
> Writing to idle memory        : 0.302882502s
> 
> ...
> 
> (*) Turns out that since kernel 6.13, this code that sets accessed bit in the
> primary paging structure, when the secondary was zapped was *removed*. I
> bisected this to commit:
> 
> 66bc627e7fee KVM: x86/mmu: Don't mark "struct page" accessed when zapping SPTEs
> 
> So now the access_tracking_test is broken regardless of MGLRU.

Just to confirm, do you see failures on 6.13 with MGLRU disabled?  

> Any ideas on how to fix all this mess?

The easy answer is to skip the test if MGLRU is in use, or if NUMA balancing is
enabled.  In a real-world scenario, if the guest is actually accessing the pages
that get PROT_NONE'd by NUMA balancing, they will be marked accessed when they're
faulted back in.  There's a window where page_idle/bitmap could be read between
making the VMA PROT_NONE and re-accessing the page from the guest, but IMO that's
one of the many flaws of NUMA balancing.

That said, one thing is quite odd.  In the failing case, *half* of the guest pages
are still idle.  That's quite insane.

Aha!  I wonder if in the failing case, the vCPU gets migrated to a pCPU on a
different node, and that causes NUMA balancing to go crazy and zap pretty much
all of guest memory.  If that's what's happening, then a better solution for the
NUMA balancing issue would be to affine the vCPU to a single NUMA node (or hard
pin it to a single pCPU?).

next prev parent reply	other threads:[~2025-02-19  1:13 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-04  0:40 [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly James Houghton
2025-02-04  0:40 ` [PATCH v9 01/11] KVM: Rename kvm_handle_hva_range() James Houghton
2025-02-04  0:40 ` [PATCH v9 02/11] KVM: Add lockless memslot walk to KVM James Houghton
2025-02-14 15:26   ` Sean Christopherson
2025-02-14 19:27     ` James Houghton
2025-02-04  0:40 ` [PATCH v9 03/11] KVM: x86/mmu: Factor out spte atomic bit clearing routine James Houghton
2025-02-04  0:40 ` [PATCH v9 04/11] KVM: x86/mmu: Relax locking for kvm_test_age_gfn() and kvm_age_gfn() James Houghton
2025-02-12 22:07   ` Sean Christopherson
2025-02-13  0:25     ` James Houghton
2025-02-04  0:40 ` [PATCH v9 05/11] KVM: x86/mmu: Rename spte_has_volatile_bits() to spte_needs_atomic_write() James Houghton
2025-02-12 22:09   ` Sean Christopherson
2025-02-13  0:26     ` James Houghton
2025-02-04  0:40 ` [PATCH v9 06/11] KVM: x86/mmu: Skip shadow MMU test_young if TDP MMU reports page as young James Houghton
2025-02-04  0:40 ` [PATCH v9 07/11] KVM: x86/mmu: Only check gfn age in shadow MMU if indirect_shadow_pages > 0 James Houghton
2025-02-04  0:40 ` [PATCH v9 08/11] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock James Houghton
2025-02-04  0:40 ` [PATCH v9 09/11] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock James Houghton
2025-02-04  0:40 ` [PATCH v9 10/11] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs James Houghton
2025-02-04  0:40 ` [PATCH v9 11/11] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns James Houghton
2025-02-15  0:50 ` [PATCH v9 00/11] KVM: x86/mmu: Age sptes locklessly Sean Christopherson
2025-02-18 19:29 ` Maxim Levitsky
2025-02-19  1:13   ` Sean Christopherson [this message]
2025-02-19 18:56     ` James Houghton
2025-02-25 22:00     ` Maxim Levitsky
2025-02-26  0:50       ` Sean Christopherson
2025-02-26 18:39         ` Maxim Levitsky
2025-02-27  0:51           ` Sean Christopherson
2025-02-27  1:54             ` Maxim Levitsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z7UwI-9zqnhpmg30@google.com \
    --to=seanjc@google.com \
    --cc=axelrasmussen@google.com \
    --cc=dmatlack@google.com \
    --cc=jthoughton@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=mlevitsk@redhat.com \
    --cc=oliver.upton@linux.dev \
    --cc=pbonzini@redhat.com \
    --cc=rientjes@google.com \
    --cc=weixugc@google.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.