public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Yan Zhao <yan.y.zhao@intel.com>
Cc: pbonzini@redhat.com, rick.p.edgecombe@intel.com,
	 linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Subject: Re: [PATCH] KVM: selftests: Wait mprotect_ro_done before write to RO in mmu_stress_test
Date: Thu, 27 Feb 2025 14:18:02 -0800	[thread overview]
Message-ID: <Z8Dkmu_57EmWUdk5@google.com> (raw)
In-Reply-To: <Z7/8EOKH5Z1iShQB@yzhao56-desk.sh.intel.com>

On Thu, Feb 27, 2025, Yan Zhao wrote:
> On Wed, Feb 26, 2025 at 11:30:15AM -0800, Sean Christopherson wrote:
> > On Wed, Feb 26, 2025, Yan Zhao wrote:
> > > On Tue, Feb 25, 2025 at 05:48:39PM -0800, Sean Christopherson wrote:
> > > > On Sat, Feb 08, 2025, Yan Zhao wrote:
> > > > > The test then fails and reports "Unhandled exception '0xe' at guest RIP
> > > > > '0x402638'", since the next valid guest rip address is 0x402639, i.e. the
> > > > > "(mem) = val" in vcpu_arch_put_guest() is compiled into a mov instruction
> > > > > of length 4.
> > > > 
> > > > This shouldn't happen.  On x86, stage 3 is a hand-coded "mov %rax, (%rax)", not
> > > > vcpu_arch_put_guest().  Either something else is going on, or __x86_64__ isn't
> > > > defined?
> > > stage 3 is hand-coded "mov %rax, (%rax)", but stage 4 is with
> > > vcpu_arch_put_guest().
> > > 
> > > The original code expects that "mov %rax, (%rax)" in stage 3 can produce
> > > -EFAULT, so that in the host thread can jump out of stage 3's 1st vcpu_run()
> > > loop.
> > 
> > Ugh, I forgot that there are two loops in stage-3.  I tried to prevent this race,
> > but violated my own rule of not using arbitrary delays to avoid races.
> > 
> > Completely untested, but I think this should address the problem (I'll test
> > later today; you already did the hard work of debugging).  The only thing I'm
> > not positive is correct is making the first _vcpu_run() a one-off instead of a
> > loop.
> Right, making the first _vcpu_run() a one-off could produce below error:
> "Expected EFAULT on write to RO memory, got r = 0, errno = 4".

/facepalm

There are multiple vCPU, using a single flag isn't sufficient.  I also remembered
(well, re-discovered) why I added the weird looping on "!":

	do {                                                                    
		r = _vcpu_run(vcpu);                                            
	} while (!r);

On x86, with forced emulation, the vcpu_arch_put_guest() path hits an MMIO exit
due to a longstanding (like, forever longstanding) bug in KVM's emulator.  Given
that the vcpu_arch_put_guest() path is only reachable by disabling the x86 specific
code (which I did for testing those paths), and that the bug only manifests on x86,
I think it makes sense to drop that code as it's super confusing, gets in the way,
and is unreachable unless the user is going way out of their way to hit it.

I still haven't reproduced the failure without "help", but I was able to force
failure by doing a single write and dropping the mprotect_ro_done check:

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index a1f3f6d83134..3524dcc0dfcf 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -50,15 +50,15 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
         */
        GUEST_ASSERT(!READ_ONCE(all_vcpus_hit_ro_fault));
        do {
-               for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
+               // for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
 #ifdef __x86_64__
-                       asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */
+                       asm volatile(".byte 0x48,0x89,0x00" :: "a"(end_gpa - stride) : "memory"); /* mov %rax, (%rax) */
 #elif defined(__aarch64__)
                        asm volatile("str %0, [%0]" :: "r" (gpa) : "memory");
 #else
                        vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
 #endif
-       } while (!READ_ONCE(mprotect_ro_done) && !READ_ONCE(all_vcpus_hit_ro_fault));
+       } while (!READ_ONCE(all_vcpus_hit_ro_fault));
 
        /*
         * Only architectures that write the entire range can explicitly sync,

The below makes everything happy, can you verify the fix on your end?

---
 tools/testing/selftests/kvm/mmu_stress_test.c | 22 ++++++++++++-------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index d9c76b4c0d88..a1f3f6d83134 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -18,6 +18,7 @@
 #include "ucall_common.h"
 
 static bool mprotect_ro_done;
+static bool all_vcpus_hit_ro_fault;
 
 static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 {
@@ -36,9 +37,9 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 
 	/*
 	 * Write to the region while mprotect(PROT_READ) is underway.  Keep
-	 * looping until the memory is guaranteed to be read-only, otherwise
-	 * vCPUs may complete their writes and advance to the next stage
-	 * prematurely.
+	 * looping until the memory is guaranteed to be read-only and a fault
+	 * has occured, otherwise vCPUs may complete their writes and advance
+	 * to the next stage prematurely.
 	 *
 	 * For architectures that support skipping the faulting instruction,
 	 * generate the store via inline assembly to ensure the exact length
@@ -47,6 +48,7 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 	 * is low in this case).  For x86, hand-code the exact opcode so that
 	 * there is no room for variability in the generated instruction.
 	 */
+	GUEST_ASSERT(!READ_ONCE(all_vcpus_hit_ro_fault));
 	do {
 		for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
 #ifdef __x86_64__
@@ -56,7 +58,7 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 #else
 			vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
 #endif
-	} while (!READ_ONCE(mprotect_ro_done));
+	} while (!READ_ONCE(mprotect_ro_done) && !READ_ONCE(all_vcpus_hit_ro_fault));
 
 	/*
 	 * Only architectures that write the entire range can explicitly sync,
@@ -81,6 +83,7 @@ struct vcpu_info {
 
 static int nr_vcpus;
 static atomic_t rendezvous;
+static atomic_t nr_ro_faults;
 
 static void rendezvous_with_boss(void)
 {
@@ -148,12 +151,16 @@ static void *vcpu_worker(void *data)
 	 * be stuck on the faulting instruction for other architectures.  Go to
 	 * stage 3 without a rendezvous
 	 */
-	do {
-		r = _vcpu_run(vcpu);
-	} while (!r);
+	r = _vcpu_run(vcpu);
 	TEST_ASSERT(r == -1 && errno == EFAULT,
 		    "Expected EFAULT on write to RO memory, got r = %d, errno = %d", r, errno);
 
+	atomic_inc(&nr_ro_faults);
+	if (atomic_read(&nr_ro_faults) == nr_vcpus) {
+		WRITE_ONCE(all_vcpus_hit_ro_fault, true);
+		sync_global_to_guest(vm, all_vcpus_hit_ro_fault);
+	}
+
 #if defined(__x86_64__) || defined(__aarch64__)
 	/*
 	 * Verify *all* writes from the guest hit EFAULT due to the VMA now
@@ -378,7 +385,6 @@ int main(int argc, char *argv[])
 	rendezvous_with_vcpus(&time_run2, "run 2");
 
 	mprotect(mem, slot_size, PROT_READ);
-	usleep(10);
 	mprotect_ro_done = true;
 	sync_global_to_guest(vm, mprotect_ro_done);
 

base-commit: 557953f8b75fce49dc65f9b0f7e811c060fc7860
-- 

  reply	other threads:[~2025-02-27 22:18 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-08 10:53 [PATCH] KVM: selftests: Wait mprotect_ro_done before write to RO in mmu_stress_test Yan Zhao
2025-02-11  1:03 ` Edgecombe, Rick P
2025-02-11  1:42   ` Yan Zhao
2025-02-11 18:29     ` Edgecombe, Rick P
2025-02-12  6:59       ` Yan Zhao
2025-02-26  1:53       ` Sean Christopherson
2025-02-26  2:07         ` Yan Zhao
2025-02-26  2:20           ` Sean Christopherson
2025-02-26  3:09             ` Yan Zhao
2025-02-26  1:48 ` Sean Christopherson
2025-02-26  3:08   ` Yan Zhao
2025-02-26 19:30     ` Sean Christopherson
2025-02-27  5:45       ` Yan Zhao
2025-02-27 22:18         ` Sean Christopherson [this message]
2025-02-28 10:55           ` Yan Zhao
2025-02-28 14:00             ` Sean Christopherson
2025-03-03  3:14               ` Yan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z8Dkmu_57EmWUdk5@google.com \
    --to=seanjc@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox