access_tracking_perf_test kvm selftest doesn't work when Multi-Gen LRU is in use

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* access_tracking_perf_test kvm selftest doesn't work when Multi-Gen LRU  is in use
@ 2024-05-15 23:39 Maxim Levitsky
  2024-05-21 23:29 ` Sean Christopherson
  0 siblings, 1 reply; 2+ messages in thread
From: Maxim Levitsky @ 2024-05-15 23:39 UTC (permalink / raw)
  To: kvm; +Cc: Sean Christopherson, Paolo Bonzini, Henry Huang, linux-mm

Hi,

I would like to share a long rabbit hole dive I did some time ago on why access_tracking_perf_test test sometimes 
fails and why it fails on only some RHEL9 machines.

When it fails you see an error like this:

Populating memory : 0.693662489s
Writing to populated memory : 0.022868074s
Reading from populated memory : 0.009497503s
Mark memory idle : 2.206361533s
Writing to idle memory : 0.282340559s
==== Test Assertion Failure ====
access_tracking_perf_test.c:188: this_cpu_has(X86_FEATURE_HYPERVISOR)
pid=78914 tid=78918 errno=4 - Interrupted system call
1 0x0000000000402e99: mark_vcpu_memory_idle at access_tracking_perf_test.c:188
2 (inlined by) vcpu_thread_main at access_tracking_perf_test.c:240
3 0x000000000040745d: vcpu_thread_main at memstress.c:283
4 0x00007f68e66a1911: ?? ??:0
5 0x00007f68e663f44f: ?? ??:0
vCPU0: Too many pages still idle (123013 out of 262144)


access_tracking_perf_test uses '/sys/kernel/mm/page_idle/bitmap' interface to: 

	- runs a guest once which writes to its memory pages and thus allocates and dirties them.

	- clear A/D bits of the primary and secondary translation of guest pages
	   (note that it clears the bits in the actual PTEs only)

	- set so called 'idle' page flags bit on these pages

	  (this bit is only used for page_idle private usage, it is not used in generic mm code, because
	   generic mm code only tracks dirty and not accessed page status)

	- runs again the guest which dirties those memory pages again.

	- uses the same 'page_idle' interface to check that most (90%) of the guest pages are now accessed again.

	  in terms of page_idle code, it will tell that page is not idle (=accessed) if either:
		- idle bit of it is clear
		- A/D bits are set in primary or secondary PTEs that map this page 
		  (in this case it will also clear the idle bit,
		   so that subsequent queries won't need to check the PTEs again)
	  

The problem is that sometimes the secondary translations (that is SPTEs) are destroyed/flushed by KVM 
which causes KVM to mark guest pages which were mapped through these SPTEs as accessed:


KVM calls kvm_set_pfn_accessed and this call eventually leads to folio_mark_accessed().

This function used to clear the idle bit of the page.
(but note that it would not set accessed bits in the primary translation of this page!)

But now when MGLRU is enabled it doesn't do this anymore:

void folio_mark_accessed(struct folio *folio)
{
	if (lru_gen_enabled()) {
		folio_inc_refs(folio);
		return;
	}

	....

	if (folio_test_idle(folio))
		folio_clear_idle(folio);
}
EXPORT_SYMBOL(folio_mark_accessed);


Thus when the page_idle code checks the page, it sees no A/D bits in primary translation,
no A/D bits in secondary translation (because it doesn't exist), and idle bit set,
so it considers the page idle, that is not accessed.

There is a patch series that seems to fix this, but it seems that it wasn't accepted upstream,
I don't know what is the current status of this work.

https://patchew.org/linux/951fb7edab535cf522def4f5f2613947ed7b7d28.1701853894.git.henry.hj@antgroup.com/


Now the question is, what do you think we should do to fix it? 
Should we at least disable page_idle interface when MGLRU is enabled?


Best regards,
	Maxim Levitsky


PS:

Small note on why we started seeing this failure on RHEL 9 and only on some machines: 

	- RHEL9 has MGLRU enabled, RHEL8 doesn't.

	- machine needs to have more than one NUMA node because NUMA balancing 
	  (enabled by default) tries apparently to write protect the primary PTEs 
	  of (all?) processes every few seconds, and that causes KVM to flush the secondary PTEs:
	  (at least with new tdp mmu)

access_tracking-3448    [091] ....1..  1380.244666: handle_changed_spte <-tdp_mmu_set_spte
 access_tracking-3448    [091] ....1..  1380.244667: <stack trace>
 => cdc_driver_init
 => handle_changed_spte
 => tdp_mmu_set_spte
 => tdp_mmu_zap_leafs
 => kvm_tdp_mmu_unmap_gfn_range
 => kvm_unmap_gfn_range
 => kvm_mmu_notifier_invalidate_range_start
 => __mmu_notifier_invalidate_range_start
 => change_p4d_range
 => change_protection
 => change_prot_numa
 => task_numa_work
 => task_work_run
 => exit_to_user_mode_prepare
 => syscall_exit_to_user_mode
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

It's a separate question, if the NUMA balancing should do this, or if NUMA balancing should be enabled by default,
because there are other reasons that can force KVM to invalidate the secondary mappings and trigger this issue.







^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: access_tracking_perf_test kvm selftest doesn't work when Multi-Gen LRU  is in use
  2024-05-15 23:39 access_tracking_perf_test kvm selftest doesn't work when Multi-Gen LRU is in use Maxim Levitsky
@ 2024-05-21 23:29 ` Sean Christopherson
  0 siblings, 0 replies; 2+ messages in thread
From: Sean Christopherson @ 2024-05-21 23:29 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: kvm, Paolo Bonzini, Henry Huang, linux-mm

On Wed, May 15, 2024, Maxim Levitsky wrote:
> Small note on why we started seeing this failure on RHEL 9 and only on some machines: 
> 
> 	- RHEL9 has MGLRU enabled, RHEL8 doesn't.

For a stopgap in KVM selftests, or possibly even a long term solution in case the
decision is that page_idle will simply have different behavior for MGLRU, couldn't
we tweak the test to not assert if MGRLU is enabled?

E.g. refactor get_module_param_integer() and/or get_module_param() to add
get_sysfs_value_integer() or so, and then do this?

diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 3c7defd34f56..1e759df36098 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -123,6 +123,11 @@ static void mark_page_idle(int page_idle_fd, uint64_t pfn)
                    "Set page_idle bits for PFN 0x%" PRIx64, pfn);
 }
 
+static bool is_lru_gen_enabled(void)
+{
+       return !!get_sysfs_value_integer("/sys/kernel/mm/lru_gen/enabled");
+}
+
 static void mark_vcpu_memory_idle(struct kvm_vm *vm,
                                  struct memstress_vcpu_args *vcpu_args)
 {
@@ -185,7 +190,8 @@ static void mark_vcpu_memory_idle(struct kvm_vm *vm,
         */
        if (still_idle >= pages / 10) {
 #ifdef __x86_64__
-               TEST_ASSERT(this_cpu_has(X86_FEATURE_HYPERVISOR),
+               TEST_ASSERT(this_cpu_has(X86_FEATURE_HYPERVISOR) ||
+                           is_lru_gen_enabled(),
                            "vCPU%d: Too many pages still idle (%lu out of %lu)",
                            vcpu_idx, still_idle, pages);
 #endif

> 	- machine needs to have more than one NUMA node because NUMA balancing 
> 	  (enabled by default) tries apparently to write protect the primary PTEs 
> 	  of (all?) processes every few seconds, and that causes KVM to flush the secondary PTEs:
> 	  (at least with new tdp mmu)
> 
> access_tracking-3448    [091] ....1..  1380.244666: handle_changed_spte <-tdp_mmu_set_spte
>  access_tracking-3448    [091] ....1..  1380.244667: <stack trace>
>  => cdc_driver_init
>  => handle_changed_spte
>  => tdp_mmu_set_spte
>  => tdp_mmu_zap_leafs
>  => kvm_tdp_mmu_unmap_gfn_range
>  => kvm_unmap_gfn_range
>  => kvm_mmu_notifier_invalidate_range_start
>  => __mmu_notifier_invalidate_range_start
>  => change_p4d_range
>  => change_protection
>  => change_prot_numa
>  => task_numa_work
>  => task_work_run
>  => exit_to_user_mode_prepare
>  => syscall_exit_to_user_mode
>  => do_syscall_64
>  => entry_SYSCALL_64_after_hwframe
> 
> It's a separate question, if the NUMA balancing should do this, or if NUMA
> balancing should be enabled by default,

FWIW, IMO, enabling NUMA balancing on a system whose primary purpose is to run VMs
is bad idea.  NUMA balancing operates under the assumption that a !PRESENT #PF is
relatively cheap.  When secondary MMUs are involved, that is simply not the case,
e.g. to honor the mmu_notifer event, KVM zaps _and_ does a remote TLB flush.  Even
if we reworked KVM and/or the mmu_notifiers so that KVM didn't need to do such a
heavy operation, the cost of page fault VM-Exit is significantly higher than the
cost of a host #PF.

> because there are other reasons that can force KVM to invalidate the
> secondary mappings and trigger this issue.

Ya.

^ permalink raw reply related	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2024-05-21 23:29 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-15 23:39 access_tracking_perf_test kvm selftest doesn't work when Multi-Gen LRU is in use Maxim Levitsky
2024-05-21 23:29 ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).