[PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap
@ 2024-08-09 19:43 Sean Christopherson
  2024-08-09 19:43 ` [PATCH 01/22] KVM: selftests: Check for a potential unhandled exception iff KVM_RUN succeeded Sean Christopherson
                   ` (22 more replies)
  0 siblings, 23 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

The main intent of this series is to allow yielding, i.e. cond_resched(),
when unmapping memory in shadow MMUs in response to an mmu_notifier
invalidation.  There is zero reason not to yield, and in fact I _thought_
KVM did yield, but because of how KVM grew over the years, the unmap path
got left behind.

The first half of the series is reworks max_guest_memory_test into
mmu_stress_test, to give some confidence in the mmu_notifier-related
changes.

Oliver and Marc, there's on patch lurking in here to enable said test on
arm64.  It's as well tested as I can make it (and that took much longer
than anticipated because arm64 hit races in the test that x86 doesn't
for whatever reason).

The middle of the series reworks x86's shadow MMU logic to use the
zap flow that can yield.

The last third or so is a wee bit adventurous, and is kinda of an RFC, but
well tested.  It's essentially prep/post work for James' MGLRU, and allows
aging SPTEs in x86's shadow MMU to run outside of mmu_lock, e.g. so that
nested TDP (stage-2) MMUs can participate in MGLRU.

If everything checks out, my goal is to land the selftests and yielding
changes in 6.12.  The aging stuff is incomplete and meaningless without
James' MGLRU, I'm posting it here purely so that folks can see the end
state when the mmu_notifier invalidation paths also moves to a different
API.

James, the aging stuff is quite well tested (see below).  Can you try
working into/on-top of your MGLRU series?  And if you're feeling very
kind, hammer it a bit more? :-)  I haven't looked at the latest ideas
and/or discussion on the MGLRU series, but I'm hoping that being able to
support the shadow MMU (absent the stupid eptad=0 case) in MGLRU will
allow for few shenanigans, e.g. no need to toggle flags during runtime.

As for testing, I spun up a VM and ran a compilation loop and `stress` in
the VM, while simultaneously running a small userspace program to age the
VM's memory (also in an infinite loop), using the same basic methodology as
access_tracking_perf_test.c (I put almost all of guest memory into a
memfd and then aged only that range of memory).

I confirmed that the locking does work, e.g. that there was (infrequent)
contention, and am fairly confident that the idea pans out.  E.g. I hit
the BUG_ON(!is_shadow_present_pte()) using that setup, which is the only
reason those patches exist :-)

Sean Christopherson (22):
  KVM: selftests: Check for a potential unhandled exception iff KVM_RUN
    succeeded
  KVM: selftests: Rename max_guest_memory_test to mmu_stress_test
  KVM: selftests: Only muck with SREGS on x86 in mmu_stress_test
  KVM: selftests: Compute number of extra pages needed in
    mmu_stress_test
  KVM: selftests: Enable mmu_stress_test on arm64
  KVM: selftests: Use vcpu_arch_put_guest() in mmu_stress_test
  KVM: selftests: Precisely limit the number of guest loops in
    mmu_stress_test
  KVM: selftests: Add a read-only mprotect() phase to mmu_stress_test
  KVM: selftests: Verify KVM correctly handles mprotect(PROT_READ)
  KVM: x86/mmu: Move walk_slot_rmaps() up near
    for_each_slot_rmap_range()
  KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps()
  KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot
  KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is
    allowed
  KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific
    helper
  KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range()
  KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul'
    literals
  KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o
    mmu_lock
  KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded
    equivalent
  KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of
    mmu_lock
  KVM: x86/mmu: Add support for lockless walks of rmap SPTEs
  KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging
    gfns
  ***HACK*** KVM: x86: Don't take mmu_lock when aging gfns

 arch/x86/kvm/mmu/mmu.c                        | 527 +++++++++++-------
 arch/x86/kvm/svm/svm.c                        |   2 +
 arch/x86/kvm/vmx/vmx.c                        |   2 +
 tools/testing/selftests/kvm/Makefile          |   3 +-
 tools/testing/selftests/kvm/lib/kvm_util.c    |   3 +-
 ..._guest_memory_test.c => mmu_stress_test.c} | 144 ++++-
 virt/kvm/kvm_main.c                           |   7 +-
 7 files changed, 482 insertions(+), 206 deletions(-)
 rename tools/testing/selftests/kvm/{max_guest_memory_test.c => mmu_stress_test.c} (65%)

base-commit: 332d2c1d713e232e163386c35a3ba0c1b90df83f
-- 
2.46.0.76.ge559c4bf1a-goog

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 01/22] KVM: selftests: Check for a potential unhandled exception iff KVM_RUN succeeded
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
  2024-08-09 19:43 ` [PATCH 02/22] KVM: selftests: Rename max_guest_memory_test to mmu_stress_test Sean Christopherson
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

Don't check for an unhandled exception if KVM_RUN failed, e.g. if it
returned errno=EFAULT, as reporting unhandled exceptions is done via a
ucall, i.e. requires KVM_RUN to exit cleanly.  Theoretically, checking
for a ucall on a failed KVM_RUN could get a false positive, e.g. if there
were stale data in vcpu->run from a previous exit.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/lib/kvm_util.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 56b170b725b3..0e25011d9b51 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1719,7 +1719,8 @@ int _vcpu_run(struct kvm_vcpu *vcpu)
 		rc = __vcpu_run(vcpu);
 	} while (rc == -1 && errno == EINTR);
 
-	assert_on_unhandled_exception(vcpu);
+	if (!rc)
+		assert_on_unhandled_exception(vcpu);
 
 	return rc;
 }
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 02/22] KVM: selftests: Rename max_guest_memory_test to mmu_stress_test
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
  2024-08-09 19:43 ` [PATCH 01/22] KVM: selftests: Check for a potential unhandled exception iff KVM_RUN succeeded Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
  2024-08-09 19:43 ` [PATCH 03/22] KVM: selftests: Only muck with SREGS on x86 in mmu_stress_test Sean Christopherson
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

Rename max_guest_memory_test to mmu_stress_test so that the name isn't
horribly misleading when future changes extend the test to verify things
like mprotect() interactions, and because the test is useful even when its
configured to populate far less than the maximum amount of guest memory.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/Makefile                            | 2 +-
 .../kvm/{max_guest_memory_test.c => mmu_stress_test.c}          | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename tools/testing/selftests/kvm/{max_guest_memory_test.c => mmu_stress_test.c} (100%)

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index b084ba2262a0..cc25f19b62da 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -138,7 +138,7 @@ TEST_GEN_PROGS_x86_64 += guest_print_test
 TEST_GEN_PROGS_x86_64 += hardware_disable_test
 TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
 TEST_GEN_PROGS_x86_64 += kvm_page_table_test
-TEST_GEN_PROGS_x86_64 += max_guest_memory_test
+TEST_GEN_PROGS_x86_64 += mmu_stress_test
 TEST_GEN_PROGS_x86_64 += memslot_modification_stress_test
 TEST_GEN_PROGS_x86_64 += memslot_perf_test
 TEST_GEN_PROGS_x86_64 += rseq_test
diff --git a/tools/testing/selftests/kvm/max_guest_memory_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
similarity index 100%
rename from tools/testing/selftests/kvm/max_guest_memory_test.c
rename to tools/testing/selftests/kvm/mmu_stress_test.c
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 03/22] KVM: selftests: Only muck with SREGS on x86 in mmu_stress_test
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
  2024-08-09 19:43 ` [PATCH 01/22] KVM: selftests: Check for a potential unhandled exception iff KVM_RUN succeeded Sean Christopherson
  2024-08-09 19:43 ` [PATCH 02/22] KVM: selftests: Rename max_guest_memory_test to mmu_stress_test Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
  2024-08-09 19:43 ` [PATCH 04/22] KVM: selftests: Compute number of extra pages needed " Sean Christopherson
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

Try to get/set SREGS in mmu_stress_test only when running on x86, as the
ioctls are supported only by x86 and PPC, and the latter doesn't yet
support KVM selftests.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/mmu_stress_test.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index 0b9678858b6d..847da23ec1b1 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -59,10 +59,10 @@ static void run_vcpu(struct kvm_vcpu *vcpu)
 
 static void *vcpu_worker(void *data)
 {
+	struct kvm_sregs __maybe_unused sregs;
 	struct vcpu_info *info = data;
 	struct kvm_vcpu *vcpu = info->vcpu;
 	struct kvm_vm *vm = vcpu->vm;
-	struct kvm_sregs sregs;
 
 	vcpu_args_set(vcpu, 3, info->start_gpa, info->end_gpa, vm->page_size);
 
@@ -70,12 +70,12 @@ static void *vcpu_worker(void *data)
 
 	run_vcpu(vcpu);
 	rendezvous_with_boss();
+#ifdef __x86_64__
 	vcpu_sregs_get(vcpu, &sregs);
-#ifdef __x86_64__
 	/* Toggle CR0.WP to trigger a MMU context reset. */
 	sregs.cr0 ^= X86_CR0_WP;
-#endif
 	vcpu_sregs_set(vcpu, &sregs);
+#endif
 	rendezvous_with_boss();
 
 	run_vcpu(vcpu);
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 04/22] KVM: selftests: Compute number of extra pages needed in mmu_stress_test
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
                   ` (2 preceding siblings ...)
  2024-08-09 19:43 ` [PATCH 03/22] KVM: selftests: Only muck with SREGS on x86 in mmu_stress_test Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
  2024-09-06  0:03   ` James Houghton
  2024-08-09 19:43 ` [PATCH 05/22] KVM: selftests: Enable mmu_stress_test on arm64 Sean Christopherson
                   ` (18 subsequent siblings)
  22 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

Create mmu_stress_tests's VM with the correct number of extra pages needed
to map all of memory in the guest.  The bug hasn't been noticed before as
the test currently runs only on x86, which maps guest memory with 1GiB
pages, i.e. doesn't need much memory in the guest for page tables.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/mmu_stress_test.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index 847da23ec1b1..5467b12f5903 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -209,7 +209,13 @@ int main(int argc, char *argv[])
 	vcpus = malloc(nr_vcpus * sizeof(*vcpus));
 	TEST_ASSERT(vcpus, "Failed to allocate vCPU array");
 
-	vm = vm_create_with_vcpus(nr_vcpus, guest_code, vcpus);
+	vm = __vm_create_with_vcpus(VM_SHAPE_DEFAULT, nr_vcpus,
+#ifdef __x86_64__
+				    max_mem / SZ_1G,
+#else
+				    max_mem / vm_guest_mode_params[VM_MODE_DEFAULT].page_size,
+#endif
+				    guest_code, vcpus);
 
 	max_gpa = vm->max_gfn << vm->page_shift;
 	TEST_ASSERT(max_gpa > (4 * slot_size), "MAXPHYADDR <4gb ");
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 05/22] KVM: selftests: Enable mmu_stress_test on arm64
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
                   ` (3 preceding siblings ...)
  2024-08-09 19:43 ` [PATCH 04/22] KVM: selftests: Compute number of extra pages needed " Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
  2024-08-09 19:43 ` [PATCH 06/22] KVM: selftests: Use vcpu_arch_put_guest() in mmu_stress_test Sean Christopherson
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

Enable the mmu_stress_test on arm64.  The intent was to enable the test
across all architectures when it was first added, but a few goofs made it
unrunnable on !x86.  Now that those goofs are fixed, at least for arm64,
enable the test.

Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index cc25f19b62da..4198df29503f 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -174,6 +174,7 @@ TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
 TEST_GEN_PROGS_aarch64 += kvm_page_table_test
 TEST_GEN_PROGS_aarch64 += memslot_modification_stress_test
 TEST_GEN_PROGS_aarch64 += memslot_perf_test
+TEST_GEN_PROGS_aarch64 += mmu_stress_test
 TEST_GEN_PROGS_aarch64 += rseq_test
 TEST_GEN_PROGS_aarch64 += set_memory_region_test
 TEST_GEN_PROGS_aarch64 += steal_time
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 06/22] KVM: selftests: Use vcpu_arch_put_guest() in mmu_stress_test
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
                   ` (4 preceding siblings ...)
  2024-08-09 19:43 ` [PATCH 05/22] KVM: selftests: Enable mmu_stress_test on arm64 Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
  2024-08-09 19:43 ` [PATCH 07/22] KVM: selftests: Precisely limit the number of guest loops " Sean Christopherson
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

Use vcpu_arch_put_guest() to write memory from the guest in
mmu_stress_test as an easy way to provide a bit of extra coverage.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/mmu_stress_test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index 5467b12f5903..80863e8290db 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -22,7 +22,7 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 
 	for (;;) {
 		for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
-			*((volatile uint64_t *)gpa) = gpa;
+			vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
 		GUEST_SYNC(0);
 	}
 }
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 07/22] KVM: selftests: Precisely limit the number of guest loops in mmu_stress_test
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
                   ` (5 preceding siblings ...)
  2024-08-09 19:43 ` [PATCH 06/22] KVM: selftests: Use vcpu_arch_put_guest() in mmu_stress_test Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
  2024-08-09 19:43 ` [PATCH 08/22] KVM: selftests: Add a read-only mprotect() phase to mmu_stress_test Sean Christopherson
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

Run the exact number of guest loops required in mmu_stress_test instead
of looping indefinitely in anticipation of adding more stages that run
different code (e.g. reads instead of writes).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/mmu_stress_test.c | 25 ++++++++++++++-----
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index 80863e8290db..9573ed0e696d 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -19,12 +19,15 @@
 static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 {
 	uint64_t gpa;
+	int i;
 
-	for (;;) {
+	for (i = 0; i < 2; i++) {
 		for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
 			vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
-		GUEST_SYNC(0);
+		GUEST_SYNC(i);
 	}
+
+	GUEST_ASSERT(0);
 }
 
 struct vcpu_info {
@@ -51,10 +54,18 @@ static void rendezvous_with_boss(void)
 	}
 }
 
-static void run_vcpu(struct kvm_vcpu *vcpu)
+static void assert_sync_stage(struct kvm_vcpu *vcpu, int stage)
+{
+	struct ucall uc;
+
+	TEST_ASSERT_EQ(get_ucall(vcpu, &uc), UCALL_SYNC);
+	TEST_ASSERT_EQ(uc.args[1], stage);
+}
+
+static void run_vcpu(struct kvm_vcpu *vcpu, int stage)
 {
 	vcpu_run(vcpu);
-	TEST_ASSERT_EQ(get_ucall(vcpu, NULL), UCALL_SYNC);
+	assert_sync_stage(vcpu, stage);
 }
 
 static void *vcpu_worker(void *data)
@@ -68,7 +79,8 @@ static void *vcpu_worker(void *data)
 
 	rendezvous_with_boss();
 
-	run_vcpu(vcpu);
+	/* Stage 0, write all of guest memory. */
+	run_vcpu(vcpu, 0);
 	rendezvous_with_boss();
 #ifdef __x86_64__
 	vcpu_sregs_get(vcpu, &sregs);
@@ -78,7 +90,8 @@ static void *vcpu_worker(void *data)
 #endif
 	rendezvous_with_boss();
 
-	run_vcpu(vcpu);
+	/* Stage 1, re-write all of guest memory. */
+	run_vcpu(vcpu, 1);
 	rendezvous_with_boss();
 
 	return NULL;
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 08/22] KVM: selftests: Add a read-only mprotect() phase to mmu_stress_test
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
                   ` (6 preceding siblings ...)
  2024-08-09 19:43 ` [PATCH 07/22] KVM: selftests: Precisely limit the number of guest loops " Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
  2024-08-09 19:43 ` [PATCH 09/22] KVM: selftests: Verify KVM correctly handles mprotect(PROT_READ) Sean Christopherson
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

Add a third phase of mmu_stress_test to verify that mprotect()ing guest
memory to make it read-only doesn't cause explosions, e.g. to verify KVM
correctly handles the resulting mmu_notifier invalidations.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/mmu_stress_test.c | 22 +++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index 9573ed0e696d..50c3a17418c4 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -27,6 +27,10 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 		GUEST_SYNC(i);
 	}
 
+	for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
+		*((volatile uint64_t *)gpa);
+	GUEST_SYNC(2);
+
 	GUEST_ASSERT(0);
 }
 
@@ -94,6 +98,10 @@ static void *vcpu_worker(void *data)
 	run_vcpu(vcpu, 1);
 	rendezvous_with_boss();
 
+	/* Stage 2, read all of guest memory, which is now read-only. */
+	run_vcpu(vcpu, 2);
+	rendezvous_with_boss();
+
 	return NULL;
 }
 
@@ -174,7 +182,7 @@ int main(int argc, char *argv[])
 	const uint64_t start_gpa = SZ_4G;
 	const int first_slot = 1;
 
-	struct timespec time_start, time_run1, time_reset, time_run2;
+	struct timespec time_start, time_run1, time_reset, time_run2, time_ro;
 	uint64_t max_gpa, gpa, slot_size, max_mem, i;
 	int max_slots, slot, opt, fd;
 	bool hugepages = false;
@@ -278,14 +286,20 @@ int main(int argc, char *argv[])
 	rendezvous_with_vcpus(&time_reset, "reset");
 	rendezvous_with_vcpus(&time_run2, "run 2");
 
+	mprotect(mem, slot_size, PROT_READ);
+	rendezvous_with_vcpus(&time_ro, "mprotect RO");
+
+	time_ro    = timespec_sub(time_ro,     time_run2);
 	time_run2  = timespec_sub(time_run2,   time_reset);
-	time_reset = timespec_sub(time_reset, time_run1);
+	time_reset = timespec_sub(time_reset,  time_run1);
 	time_run1  = timespec_sub(time_run1,   time_start);
 
-	pr_info("run1 = %ld.%.9lds, reset = %ld.%.9lds, run2 =  %ld.%.9lds\n",
+	pr_info("run1 = %ld.%.9lds, reset = %ld.%.9lds, run2 = %ld.%.9lds, "
+		"ro = %ld.%.9lds\n",
 		time_run1.tv_sec, time_run1.tv_nsec,
 		time_reset.tv_sec, time_reset.tv_nsec,
-		time_run2.tv_sec, time_run2.tv_nsec);
+		time_run2.tv_sec, time_run2.tv_nsec,
+		time_ro.tv_sec, time_ro.tv_nsec);
 
 	/*
 	 * Delete even numbered slots (arbitrary) and unmap the first half of
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 09/22] KVM: selftests: Verify KVM correctly handles mprotect(PROT_READ)
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
                   ` (7 preceding siblings ...)
  2024-08-09 19:43 ` [PATCH 08/22] KVM: selftests: Add a read-only mprotect() phase to mmu_stress_test Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
       [not found]   ` <CADrL8HXcD--jn1iLeCJycCd3Btv4_rBPxz6NMnTREXfeh0vRZA@mail.gmail.com>
  2024-08-09 19:43 ` [PATCH 10/22] KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range() Sean Christopherson
                   ` (13 subsequent siblings)
  22 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

Add two phases to mmu_stress_test to verify that KVM correctly handles
guest memory that was writable, and then made read-only in the primary MMU,
and then made writable again.

Add bonus coverage for x86 to verify that all of guest memory was marked
read-only.  Making forward progress (without making memory writable)
requires arch specific code to skip over the faulting instruction, but the
test can at least verify each vCPU's starting page was made read-only.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/mmu_stress_test.c | 87 ++++++++++++++++++-
 1 file changed, 84 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index 50c3a17418c4..98f9a4660269 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -16,6 +16,8 @@
 #include "guest_modes.h"
 #include "processor.h"
 
+static bool mprotect_ro_done;
+
 static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 {
 	uint64_t gpa;
@@ -31,6 +33,33 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
 		*((volatile uint64_t *)gpa);
 	GUEST_SYNC(2);
 
+	/*
+	 * Write to the region while mprotect(PROT_READ) is underway.  Keep
+	 * looping until the memory is guaranteed to be read-only, otherwise
+	 * vCPUs may complete their writes and advance to the next stage
+	 * prematurely.
+	 */
+	do {
+		for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
+#ifdef __x86_64__
+			asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory");
+#else
+			vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
+#endif
+	} while (!READ_ONCE(mprotect_ro_done));
+
+	/*
+	 * Only x86 can explicitly sync, as other architectures will be stuck
+	 * on the write fault.
+	 */
+#ifdef __x86_64__
+	GUEST_SYNC(3);
+#endif
+
+	for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
+		vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
+	GUEST_SYNC(4);
+
 	GUEST_ASSERT(0);
 }
 
@@ -78,6 +107,7 @@ static void *vcpu_worker(void *data)
 	struct vcpu_info *info = data;
 	struct kvm_vcpu *vcpu = info->vcpu;
 	struct kvm_vm *vm = vcpu->vm;
+	int r;
 
 	vcpu_args_set(vcpu, 3, info->start_gpa, info->end_gpa, vm->page_size);
 
@@ -100,6 +130,49 @@ static void *vcpu_worker(void *data)
 
 	/* Stage 2, read all of guest memory, which is now read-only. */
 	run_vcpu(vcpu, 2);
+
+	/*
+	 * Stage 3, write guest memory and verify KVM returns -EFAULT for once
+	 * the mprotect(PROT_READ) lands.  Only architectures that support
+	 * validating *all* of guest memory sync for this stage, as vCPUs will
+	 * be stuck on the faulting instruction for other architectures.  Go to
+	 * stage 3 without a rendezvous
+	 */
+	do {
+		r = _vcpu_run(vcpu);
+	} while (!r);
+	TEST_ASSERT(r == -1 && errno == EFAULT,
+		    "Expected EFAULT on write to RO memory, got r = %d, errno = %d", r, errno);
+
+#ifdef __x86_64__
+	/*
+	 * Verify *all* writes from the guest hit EFAULT due to the VMA now
+	 * being read-only.  x86-only at this time as skipping the instruction
+	 * that hits the EFAULT requires advancing the program counter, which
+	 * is arch specific and currently relies on hand-coded assembly.
+	 */
+	vcpu->run->kvm_valid_regs = KVM_SYNC_X86_REGS;
+	for (;;) {
+		r = _vcpu_run(vcpu);
+		if (!r)
+			break;
+		TEST_ASSERT_EQ(errno, EFAULT);
+		WRITE_ONCE(vcpu->run->kvm_dirty_regs, KVM_SYNC_X86_REGS);
+		vcpu->run->s.regs.regs.rip += 4;
+	}
+	assert_sync_stage(vcpu, 3);
+#endif
+	rendezvous_with_boss();
+
+	/*
+	 * Stage 4.  Run to completion, waiting for mprotect(PROT_WRITE) to
+	 * make the memory writable again.
+	 */
+	do {
+		r = _vcpu_run(vcpu);
+	} while (r && errno == EFAULT);
+	TEST_ASSERT_EQ(r, 0);
+	assert_sync_stage(vcpu, 4);
 	rendezvous_with_boss();
 
 	return NULL;
@@ -182,7 +255,7 @@ int main(int argc, char *argv[])
 	const uint64_t start_gpa = SZ_4G;
 	const int first_slot = 1;
 
-	struct timespec time_start, time_run1, time_reset, time_run2, time_ro;
+	struct timespec time_start, time_run1, time_reset, time_run2, time_ro, time_rw;
 	uint64_t max_gpa, gpa, slot_size, max_mem, i;
 	int max_slots, slot, opt, fd;
 	bool hugepages = false;
@@ -287,19 +360,27 @@ int main(int argc, char *argv[])
 	rendezvous_with_vcpus(&time_run2, "run 2");
 
 	mprotect(mem, slot_size, PROT_READ);
+	usleep(10);
+	mprotect_ro_done = true;
+	sync_global_to_guest(vm, mprotect_ro_done);
+
 	rendezvous_with_vcpus(&time_ro, "mprotect RO");
+	mprotect(mem, slot_size, PROT_READ | PROT_WRITE);
+	rendezvous_with_vcpus(&time_rw, "mprotect RW");
 
+	time_rw    = timespec_sub(time_rw,     time_ro);
 	time_ro    = timespec_sub(time_ro,     time_run2);
 	time_run2  = timespec_sub(time_run2,   time_reset);
 	time_reset = timespec_sub(time_reset,  time_run1);
 	time_run1  = timespec_sub(time_run1,   time_start);
 
 	pr_info("run1 = %ld.%.9lds, reset = %ld.%.9lds, run2 = %ld.%.9lds, "
-		"ro = %ld.%.9lds\n",
+		"ro = %ld.%.9lds, rw = %ld.%.9lds\n",
 		time_run1.tv_sec, time_run1.tv_nsec,
 		time_reset.tv_sec, time_reset.tv_nsec,
 		time_run2.tv_sec, time_run2.tv_nsec,
-		time_ro.tv_sec, time_ro.tv_nsec);
+		time_ro.tv_sec, time_ro.tv_nsec,
+		time_rw.tv_sec, time_rw.tv_nsec);
 
 	/*
 	 * Delete even numbered slots (arbitrary) and unmap the first half of
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 10/22] KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range()
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
                   ` (8 preceding siblings ...)
  2024-08-09 19:43 ` [PATCH 09/22] KVM: selftests: Verify KVM correctly handles mprotect(PROT_READ) Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
  2024-08-28 18:20   ` Sean Christopherson
  2024-08-09 19:43 ` [PATCH 11/22] KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps() Sean Christopherson
                   ` (12 subsequent siblings)
  22 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

Move walk_slot_rmaps() and friends up near for_each_slot_rmap_range() so
that the walkers can be used to handle mmu_notifier invalidations, and so
that similar function has some amount of locality in code.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 107 +++++++++++++++++++++--------------------
 1 file changed, 54 insertions(+), 53 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 901be9e420a4..676cb7dfcbf9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1534,6 +1534,60 @@ static void slot_rmap_walk_next(struct slot_rmap_walk_iterator *iterator)
 	     slot_rmap_walk_okay(_iter_);				\
 	     slot_rmap_walk_next(_iter_))
 
+
+/* The return value indicates if tlb flush on all vcpus is needed. */
+typedef bool (*slot_rmaps_handler) (struct kvm *kvm,
+				    struct kvm_rmap_head *rmap_head,
+				    const struct kvm_memory_slot *slot);
+
+static __always_inline bool __walk_slot_rmaps(struct kvm *kvm,
+					      const struct kvm_memory_slot *slot,
+					      slot_rmaps_handler fn,
+					      int start_level, int end_level,
+					      gfn_t start_gfn, gfn_t end_gfn,
+					      bool flush_on_yield, bool flush)
+{
+	struct slot_rmap_walk_iterator iterator;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	for_each_slot_rmap_range(slot, start_level, end_level, start_gfn,
+			end_gfn, &iterator) {
+		if (iterator.rmap)
+			flush |= fn(kvm, iterator.rmap, slot);
+
+		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
+			if (flush && flush_on_yield) {
+				kvm_flush_remote_tlbs_range(kvm, start_gfn,
+							    iterator.gfn - start_gfn + 1);
+				flush = false;
+			}
+			cond_resched_rwlock_write(&kvm->mmu_lock);
+		}
+	}
+
+	return flush;
+}
+
+static __always_inline bool walk_slot_rmaps(struct kvm *kvm,
+					    const struct kvm_memory_slot *slot,
+					    slot_rmaps_handler fn,
+					    int start_level, int end_level,
+					    bool flush_on_yield)
+{
+	return __walk_slot_rmaps(kvm, slot, fn, start_level, end_level,
+				 slot->base_gfn, slot->base_gfn + slot->npages - 1,
+				 flush_on_yield, false);
+}
+
+static __always_inline bool walk_slot_rmaps_4k(struct kvm *kvm,
+					       const struct kvm_memory_slot *slot,
+					       slot_rmaps_handler fn,
+					       bool flush_on_yield)
+{
+	return walk_slot_rmaps(kvm, slot, fn, PG_LEVEL_4K, PG_LEVEL_4K, flush_on_yield);
+}
+
 typedef bool (*rmap_handler_t)(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 			       struct kvm_memory_slot *slot, gfn_t gfn,
 			       int level);
@@ -6199,59 +6253,6 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 }
 EXPORT_SYMBOL_GPL(kvm_configure_mmu);
 
-/* The return value indicates if tlb flush on all vcpus is needed. */
-typedef bool (*slot_rmaps_handler) (struct kvm *kvm,
-				    struct kvm_rmap_head *rmap_head,
-				    const struct kvm_memory_slot *slot);
-
-static __always_inline bool __walk_slot_rmaps(struct kvm *kvm,
-					      const struct kvm_memory_slot *slot,
-					      slot_rmaps_handler fn,
-					      int start_level, int end_level,
-					      gfn_t start_gfn, gfn_t end_gfn,
-					      bool flush_on_yield, bool flush)
-{
-	struct slot_rmap_walk_iterator iterator;
-
-	lockdep_assert_held_write(&kvm->mmu_lock);
-
-	for_each_slot_rmap_range(slot, start_level, end_level, start_gfn,
-			end_gfn, &iterator) {
-		if (iterator.rmap)
-			flush |= fn(kvm, iterator.rmap, slot);
-
-		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
-			if (flush && flush_on_yield) {
-				kvm_flush_remote_tlbs_range(kvm, start_gfn,
-							    iterator.gfn - start_gfn + 1);
-				flush = false;
-			}
-			cond_resched_rwlock_write(&kvm->mmu_lock);
-		}
-	}
-
-	return flush;
-}
-
-static __always_inline bool walk_slot_rmaps(struct kvm *kvm,
-					    const struct kvm_memory_slot *slot,
-					    slot_rmaps_handler fn,
-					    int start_level, int end_level,
-					    bool flush_on_yield)
-{
-	return __walk_slot_rmaps(kvm, slot, fn, start_level, end_level,
-				 slot->base_gfn, slot->base_gfn + slot->npages - 1,
-				 flush_on_yield, false);
-}
-
-static __always_inline bool walk_slot_rmaps_4k(struct kvm *kvm,
-					       const struct kvm_memory_slot *slot,
-					       slot_rmaps_handler fn,
-					       bool flush_on_yield)
-{
-	return walk_slot_rmaps(kvm, slot, fn, PG_LEVEL_4K, PG_LEVEL_4K, flush_on_yield);
-}
-
 static void free_mmu_pages(struct kvm_mmu *mmu)
 {
 	if (!tdp_enabled && mmu->pae_root)
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 11/22] KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps()
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
                   ` (9 preceding siblings ...)
  2024-08-09 19:43 ` [PATCH 10/22] KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range() Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
  2024-08-09 19:43 ` [PATCH 12/22] KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot Sean Christopherson
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

Add a @can_yield param to __walk_slot_rmaps() to control whether or not
dropping mmu_lock and conditionally rescheduling is allowed.  This will
allow using __walk_slot_rmaps() and thus cond_resched() to handle
mmu_notifier invalidations, which usually allow blocking/yielding, but not
when invoked by the OOM killer.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 676cb7dfcbf9..a5a7e476f5bb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1545,7 +1545,8 @@ static __always_inline bool __walk_slot_rmaps(struct kvm *kvm,
 					      slot_rmaps_handler fn,
 					      int start_level, int end_level,
 					      gfn_t start_gfn, gfn_t end_gfn,
-					      bool flush_on_yield, bool flush)
+					      bool can_yield, bool flush_on_yield,
+					      bool flush)
 {
 	struct slot_rmap_walk_iterator iterator;
 
@@ -1556,6 +1557,9 @@ static __always_inline bool __walk_slot_rmaps(struct kvm *kvm,
 		if (iterator.rmap)
 			flush |= fn(kvm, iterator.rmap, slot);
 
+		if (!can_yield)
+			continue;
+
 		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
 			if (flush && flush_on_yield) {
 				kvm_flush_remote_tlbs_range(kvm, start_gfn,
@@ -1577,7 +1581,7 @@ static __always_inline bool walk_slot_rmaps(struct kvm *kvm,
 {
 	return __walk_slot_rmaps(kvm, slot, fn, start_level, end_level,
 				 slot->base_gfn, slot->base_gfn + slot->npages - 1,
-				 flush_on_yield, false);
+				 true, flush_on_yield, false);
 }
 
 static __always_inline bool walk_slot_rmaps_4k(struct kvm *kvm,
@@ -6528,7 +6532,7 @@ static bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_e
 
 			flush = __walk_slot_rmaps(kvm, memslot, __kvm_zap_rmap,
 						  PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL,
-						  start, end - 1, true, flush);
+						  start, end - 1, true, true, flush);
 		}
 	}
 
@@ -6816,7 +6820,7 @@ static void kvm_shadow_mmu_try_split_huge_pages(struct kvm *kvm,
 	 */
 	for (level = KVM_MAX_HUGEPAGE_LEVEL; level > target_level; level--)
 		__walk_slot_rmaps(kvm, slot, shadow_mmu_try_split_huge_pages,
-				  level, level, start, end - 1, true, false);
+				  level, level, start, end - 1, true, true, false);
 }
 
 /* Must be called with the mmu_lock held in write-mode. */
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 12/22] KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
                   ` (10 preceding siblings ...)
  2024-08-09 19:43 ` [PATCH 11/22] KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps() Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
  2024-08-09 19:43 ` [PATCH 13/22] KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed Sean Christopherson
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

Add a dedicated helper to walk and zap rmaps for a given memslot so that
the code can be shared between KVM-initiated zaps and mmu_notifier
invalidations.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a5a7e476f5bb..4016f63d03e8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1592,6 +1592,16 @@ static __always_inline bool walk_slot_rmaps_4k(struct kvm *kvm,
 	return walk_slot_rmaps(kvm, slot, fn, PG_LEVEL_4K, PG_LEVEL_4K, flush_on_yield);
 }
 
+static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm,
+				     const struct kvm_memory_slot *slot,
+				     gfn_t start, gfn_t end, bool can_yield,
+				     bool flush)
+{
+	return __walk_slot_rmaps(kvm, slot, __kvm_zap_rmap,
+				 PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL,
+				 start, end - 1, can_yield, true, flush);
+}
+
 typedef bool (*rmap_handler_t)(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
 			       struct kvm_memory_slot *slot, gfn_t gfn,
 			       int level);
@@ -6530,9 +6540,8 @@ static bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_e
 			if (WARN_ON_ONCE(start >= end))
 				continue;
 
-			flush = __walk_slot_rmaps(kvm, memslot, __kvm_zap_rmap,
-						  PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL,
-						  start, end - 1, true, true, flush);
+			flush = __kvm_rmap_zap_gfn_range(kvm, memslot, start,
+							 end, true, flush);
 		}
 	}
 
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 13/22] KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
                   ` (11 preceding siblings ...)
  2024-08-09 19:43 ` [PATCH 12/22] KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
  2024-08-09 19:43 ` [PATCH 14/22] KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper Sean Christopherson
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

Convert kvm_unmap_gfn_range(), which is the helper that zaps rmap SPTEs in
response to an mmu_notifier invalidation, to use __kvm_rmap_zap_gfn_range()
and feed in range->may_block.  In other words, honor NEED_RESCHED by way of
cond_resched() when zapping rmaps.  This fixes a long-standing issue where
KVM could process an absurd number of rmap entries without ever yielding,
e.g. if an mmu_notifier fired on a PUD (or larger) range.

Opportunistically rename __kvm_zap_rmap() to kvm_zap_rmap(), and drop the
old kvm_zap_rmap().  Ideally, the shuffling would be done in a different
patch, but that just makes the compiler unhappy, e.g.

  arch/x86/kvm/mmu/mmu.c:1462:13: error: ‘kvm_zap_rmap’ defined but not used

Reported-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4016f63d03e8..0a33857d668a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1453,16 +1453,10 @@ static bool kvm_vcpu_write_protect_gfn(struct kvm_vcpu *vcpu, u64 gfn)
 	return kvm_mmu_slot_gfn_write_protect(vcpu->kvm, slot, gfn, PG_LEVEL_4K);
 }
 
-static bool __kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-			   const struct kvm_memory_slot *slot)
-{
-	return kvm_zap_all_rmap_sptes(kvm, rmap_head);
-}
-
 static bool kvm_zap_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-			 struct kvm_memory_slot *slot, gfn_t gfn, int level)
+			 const struct kvm_memory_slot *slot)
 {
-	return __kvm_zap_rmap(kvm, rmap_head, slot);
+	return kvm_zap_all_rmap_sptes(kvm, rmap_head);
 }
 
 struct slot_rmap_walk_iterator {
@@ -1597,7 +1591,7 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm,
 				     gfn_t start, gfn_t end, bool can_yield,
 				     bool flush)
 {
-	return __walk_slot_rmaps(kvm, slot, __kvm_zap_rmap,
+	return __walk_slot_rmaps(kvm, slot, kvm_zap_rmap,
 				 PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL,
 				 start, end - 1, can_yield, true, flush);
 }
@@ -1626,7 +1620,9 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	bool flush = false;
 
 	if (kvm_memslots_have_rmaps(kvm))
-		flush = kvm_handle_gfn_range(kvm, range, kvm_zap_rmap);
+		flush = __kvm_rmap_zap_gfn_range(kvm, range->slot,
+						 range->start, range->end,
+						 range->may_block, flush);
 
 	if (tdp_mmu_enabled)
 		flush = kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush);
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 14/22] KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
                   ` (12 preceding siblings ...)
  2024-08-09 19:43 ` [PATCH 13/22] KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
  2024-08-12 21:53   ` David Matlack
  2024-08-09 19:43 ` [PATCH 15/22] KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range() Sean Christopherson
                   ` (8 subsequent siblings)
  22 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

Rework kvm_handle_gfn_range() into an aging-specic helper,
kvm_rmap_age_gfn_range().  In addition to purging a bunch of unnecessary
boilerplate code, this sets the stage for aging rmap SPTEs outside of
mmu_lock.

Note, there's a small functional change, as kvm_test_age_gfn() will now
return immediately if a young SPTE is found, whereas previously KVM would
continue iterating over other levels.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 68 ++++++++++++++----------------------------
 1 file changed, 22 insertions(+), 46 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0a33857d668a..88b656a1453d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1596,25 +1596,6 @@ static bool __kvm_rmap_zap_gfn_range(struct kvm *kvm,
 				 start, end - 1, can_yield, true, flush);
 }
 
-typedef bool (*rmap_handler_t)(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-			       struct kvm_memory_slot *slot, gfn_t gfn,
-			       int level);
-
-static __always_inline bool kvm_handle_gfn_range(struct kvm *kvm,
-						 struct kvm_gfn_range *range,
-						 rmap_handler_t handler)
-{
-	struct slot_rmap_walk_iterator iterator;
-	bool ret = false;
-
-	for_each_slot_rmap_range(range->slot, PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL,
-				 range->start, range->end - 1, &iterator)
-		ret |= handler(kvm, iterator.rmap, range->slot, iterator.gfn,
-			       iterator.level);
-
-	return ret;
-}
-
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool flush = false;
@@ -1634,31 +1615,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	return flush;
 }
 
-static bool kvm_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-			 struct kvm_memory_slot *slot, gfn_t gfn, int level)
-{
-	u64 *sptep;
-	struct rmap_iterator iter;
-	int young = 0;
-
-	for_each_rmap_spte(rmap_head, &iter, sptep)
-		young |= mmu_spte_age(sptep);
-
-	return young;
-}
-
-static bool kvm_test_age_rmap(struct kvm *kvm, struct kvm_rmap_head *rmap_head,
-			      struct kvm_memory_slot *slot, gfn_t gfn, int level)
-{
-	u64 *sptep;
-	struct rmap_iterator iter;
-
-	for_each_rmap_spte(rmap_head, &iter, sptep)
-		if (is_accessed_spte(*sptep))
-			return true;
-	return false;
-}
-
 #define RMAP_RECYCLE_THRESHOLD 1000
 
 static void __rmap_add(struct kvm *kvm,
@@ -1693,12 +1649,32 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 	__rmap_add(vcpu->kvm, cache, slot, spte, gfn, access);
 }
 
+static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
+				   struct kvm_gfn_range *range, bool test_only)
+{
+	struct slot_rmap_walk_iterator iterator;
+	struct rmap_iterator iter;
+	bool young = false;
+	u64 *sptep;
+
+	for_each_slot_rmap_range(range->slot, PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL,
+				 range->start, range->end - 1, &iterator) {
+		for_each_rmap_spte(iterator.rmap, &iter, sptep) {
+			if (test_only && is_accessed_spte(*sptep))
+				return true;
+
+			young = mmu_spte_age(sptep);
+		}
+	}
+	return young;
+}
+
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
 	if (kvm_memslots_have_rmaps(kvm))
-		young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
+		young = kvm_rmap_age_gfn_range(kvm, range, false);
 
 	if (tdp_mmu_enabled)
 		young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
@@ -1711,7 +1687,7 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	bool young = false;
 
 	if (kvm_memslots_have_rmaps(kvm))
-		young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
+		young = kvm_rmap_age_gfn_range(kvm, range, true);
 
 	if (tdp_mmu_enabled)
 		young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 15/22] KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range()
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
                   ` (13 preceding siblings ...)
  2024-08-09 19:43 ` [PATCH 14/22] KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
  2024-08-09 19:43 ` [PATCH 16/22] KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals Sean Christopherson
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

Fold mmu_spte_age() into its sole caller now that aging and testing for
young SPTEs is handled in a common location, i.e. doesn't require more
helpers.

Opportunistically remove the use of mmu_spte_get_lockless(), as mmu_lock
is held (for write!), and marking SPTEs for access tracking outside of
mmu_lock is unsafe (at least, as written).  I.e. using the lockless
accessor is quite misleading.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 50 +++++++++++++++++++-----------------------
 1 file changed, 22 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 88b656a1453d..c536a069d6b9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -614,32 +614,6 @@ static u64 mmu_spte_get_lockless(u64 *sptep)
 	return __get_spte_lockless(sptep);
 }
 
-/* Returns the Accessed status of the PTE and resets it at the same time. */
-static bool mmu_spte_age(u64 *sptep)
-{
-	u64 spte = mmu_spte_get_lockless(sptep);
-
-	if (!is_accessed_spte(spte))
-		return false;
-
-	if (spte_ad_enabled(spte)) {
-		clear_bit((ffs(shadow_accessed_mask) - 1),
-			  (unsigned long *)sptep);
-	} else {
-		/*
-		 * Capture the dirty status of the page, so that it doesn't get
-		 * lost when the SPTE is marked for access tracking.
-		 */
-		if (is_writable_pte(spte))
-			kvm_set_pfn_dirty(spte_to_pfn(spte));
-
-		spte = mark_spte_for_access_track(spte);
-		mmu_spte_update_no_track(sptep, spte);
-	}
-
-	return true;
-}
-
 static inline bool is_tdp_mmu_active(struct kvm_vcpu *vcpu)
 {
 	return tdp_mmu_enabled && vcpu->arch.mmu->root_role.direct;
@@ -1660,10 +1634,30 @@ static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
 	for_each_slot_rmap_range(range->slot, PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL,
 				 range->start, range->end - 1, &iterator) {
 		for_each_rmap_spte(iterator.rmap, &iter, sptep) {
-			if (test_only && is_accessed_spte(*sptep))
+			u64 spte = *sptep;
+
+			if (!is_accessed_spte(spte))
+				continue;
+
+			if (test_only)
 				return true;
 
-			young = mmu_spte_age(sptep);
+			if (spte_ad_enabled(spte)) {
+				clear_bit((ffs(shadow_accessed_mask) - 1),
+					(unsigned long *)sptep);
+			} else {
+				/*
+				 * Capture the dirty status of the page, so that
+				 * it doesn't get lost when the SPTE is marked
+				 * for access tracking.
+				 */
+				if (is_writable_pte(spte))
+					kvm_set_pfn_dirty(spte_to_pfn(spte));
+
+				spte = mark_spte_for_access_track(spte);
+				mmu_spte_update_no_track(sptep, spte);
+			}
+			young = true;
 		}
 	}
 	return young;
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 16/22] KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
                   ` (14 preceding siblings ...)
  2024-08-09 19:43 ` [PATCH 15/22] KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range() Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
  2024-08-09 19:43 ` [PATCH 17/22] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock Sean Christopherson
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

Replace all of the open coded '1' literals used to mark a PTE list as
having many/multiple entries with a proper define.  It's hard enough to
read the code with one magic bit, and a future patch to support "locking"
a single rmap will add another.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 31 ++++++++++++++++---------------
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c536a069d6b9..73569979130d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -912,6 +912,7 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu
  * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct
  * pte_list_desc containing more mappings.
  */
+#define KVM_RMAP_MANY	BIT(0)
 
 /*
  * Returns the number of pointers in the rmap chain, not counting the new one.
@@ -924,16 +925,16 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 
 	if (!rmap_head->val) {
 		rmap_head->val = (unsigned long)spte;
-	} else if (!(rmap_head->val & 1)) {
+	} else if (!(rmap_head->val & KVM_RMAP_MANY)) {
 		desc = kvm_mmu_memory_cache_alloc(cache);
 		desc->sptes[0] = (u64 *)rmap_head->val;
 		desc->sptes[1] = spte;
 		desc->spte_count = 2;
 		desc->tail_count = 0;
-		rmap_head->val = (unsigned long)desc | 1;
+		rmap_head->val = (unsigned long)desc | KVM_RMAP_MANY;
 		++count;
 	} else {
-		desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+		desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
 		count = desc->tail_count + desc->spte_count;
 
 		/*
@@ -942,10 +943,10 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 		 */
 		if (desc->spte_count == PTE_LIST_EXT) {
 			desc = kvm_mmu_memory_cache_alloc(cache);
-			desc->more = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+			desc->more = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
 			desc->spte_count = 0;
 			desc->tail_count = count;
-			rmap_head->val = (unsigned long)desc | 1;
+			rmap_head->val = (unsigned long)desc | KVM_RMAP_MANY;
 		}
 		desc->sptes[desc->spte_count++] = spte;
 	}
@@ -956,7 +957,7 @@ static void pte_list_desc_remove_entry(struct kvm *kvm,
 				       struct kvm_rmap_head *rmap_head,
 				       struct pte_list_desc *desc, int i)
 {
-	struct pte_list_desc *head_desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+	struct pte_list_desc *head_desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
 	int j = head_desc->spte_count - 1;
 
 	/*
@@ -985,7 +986,7 @@ static void pte_list_desc_remove_entry(struct kvm *kvm,
 	if (!head_desc->more)
 		rmap_head->val = 0;
 	else
-		rmap_head->val = (unsigned long)head_desc->more | 1;
+		rmap_head->val = (unsigned long)head_desc->more | KVM_RMAP_MANY;
 	mmu_free_pte_list_desc(head_desc);
 }
 
@@ -998,13 +999,13 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
 	if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_head->val, kvm))
 		return;
 
-	if (!(rmap_head->val & 1)) {
+	if (!(rmap_head->val & KVM_RMAP_MANY)) {
 		if (KVM_BUG_ON_DATA_CORRUPTION((u64 *)rmap_head->val != spte, kvm))
 			return;
 
 		rmap_head->val = 0;
 	} else {
-		desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+		desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
 		while (desc) {
 			for (i = 0; i < desc->spte_count; ++i) {
 				if (desc->sptes[i] == spte) {
@@ -1037,12 +1038,12 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 	if (!rmap_head->val)
 		return false;
 
-	if (!(rmap_head->val & 1)) {
+	if (!(rmap_head->val & KVM_RMAP_MANY)) {
 		mmu_spte_clear_track_bits(kvm, (u64 *)rmap_head->val);
 		goto out;
 	}
 
-	desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+	desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
 
 	for (; desc; desc = next) {
 		for (i = 0; i < desc->spte_count; i++)
@@ -1062,10 +1063,10 @@ unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
 
 	if (!rmap_head->val)
 		return 0;
-	else if (!(rmap_head->val & 1))
+	else if (!(rmap_head->val & KVM_RMAP_MANY))
 		return 1;
 
-	desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+	desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
 	return desc->tail_count + desc->spte_count;
 }
 
@@ -1127,13 +1128,13 @@ static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
 	if (!rmap_head->val)
 		return NULL;
 
-	if (!(rmap_head->val & 1)) {
+	if (!(rmap_head->val & KVM_RMAP_MANY)) {
 		iter->desc = NULL;
 		sptep = (u64 *)rmap_head->val;
 		goto out;
 	}
 
-	iter->desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+	iter->desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
 	iter->pos = 0;
 	sptep = iter->desc->sptes[iter->pos];
 out:
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 17/22] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
                   ` (15 preceding siblings ...)
  2024-08-09 19:43 ` [PATCH 16/22] KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
  2024-08-09 19:43 ` [PATCH 18/22] KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent Sean Christopherson
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

Refactor the pte_list and rmap code to always read and write rmap_head->val
exactly once, e.g. by collecting changes in a local variable and then
propagating those changes back to rmap_head->val as appropriate.  This will
allow implementing a per-rmap rwlock (of sorts) by adding a LOCKED bit into
the rmap value alongside the MANY bit.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 83 +++++++++++++++++++++++++-----------------
 1 file changed, 50 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 73569979130d..49c1f6cc1526 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -920,21 +920,24 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu
 static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 			struct kvm_rmap_head *rmap_head)
 {
+	unsigned long old_val, new_val;
 	struct pte_list_desc *desc;
 	int count = 0;
 
-	if (!rmap_head->val) {
-		rmap_head->val = (unsigned long)spte;
-	} else if (!(rmap_head->val & KVM_RMAP_MANY)) {
+	old_val = rmap_head->val;
+
+	if (!old_val) {
+		new_val = (unsigned long)spte;
+	} else if (!(old_val & KVM_RMAP_MANY)) {
 		desc = kvm_mmu_memory_cache_alloc(cache);
-		desc->sptes[0] = (u64 *)rmap_head->val;
+		desc->sptes[0] = (u64 *)old_val;
 		desc->sptes[1] = spte;
 		desc->spte_count = 2;
 		desc->tail_count = 0;
-		rmap_head->val = (unsigned long)desc | KVM_RMAP_MANY;
+		new_val = (unsigned long)desc | KVM_RMAP_MANY;
 		++count;
 	} else {
-		desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
+		desc = (struct pte_list_desc *)(old_val & ~KVM_RMAP_MANY);
 		count = desc->tail_count + desc->spte_count;
 
 		/*
@@ -943,21 +946,25 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 		 */
 		if (desc->spte_count == PTE_LIST_EXT) {
 			desc = kvm_mmu_memory_cache_alloc(cache);
-			desc->more = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
+			desc->more = (struct pte_list_desc *)(old_val & ~KVM_RMAP_MANY);
 			desc->spte_count = 0;
 			desc->tail_count = count;
-			rmap_head->val = (unsigned long)desc | KVM_RMAP_MANY;
+			new_val = (unsigned long)desc | KVM_RMAP_MANY;
+		} else {
+			new_val = old_val;
 		}
 		desc->sptes[desc->spte_count++] = spte;
 	}
+
+	rmap_head->val = new_val;
+
 	return count;
 }
 
-static void pte_list_desc_remove_entry(struct kvm *kvm,
-				       struct kvm_rmap_head *rmap_head,
+static void pte_list_desc_remove_entry(struct kvm *kvm, unsigned long *rmap_val,
 				       struct pte_list_desc *desc, int i)
 {
-	struct pte_list_desc *head_desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
+	struct pte_list_desc *head_desc = (struct pte_list_desc *)(*rmap_val & ~KVM_RMAP_MANY);
 	int j = head_desc->spte_count - 1;
 
 	/*
@@ -984,9 +991,9 @@ static void pte_list_desc_remove_entry(struct kvm *kvm,
 	 * head at the next descriptor, i.e. the new head.
 	 */
 	if (!head_desc->more)
-		rmap_head->val = 0;
+		*rmap_val = 0;
 	else
-		rmap_head->val = (unsigned long)head_desc->more | KVM_RMAP_MANY;
+		*rmap_val = (unsigned long)head_desc->more | KVM_RMAP_MANY;
 	mmu_free_pte_list_desc(head_desc);
 }
 
@@ -994,24 +1001,26 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
 			    struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
+	unsigned long rmap_val;
 	int i;
 
-	if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_head->val, kvm))
-		return;
+	rmap_val = rmap_head->val;
+	if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_val, kvm))
+		goto out;
 
-	if (!(rmap_head->val & KVM_RMAP_MANY)) {
-		if (KVM_BUG_ON_DATA_CORRUPTION((u64 *)rmap_head->val != spte, kvm))
-			return;
+	if (!(rmap_val & KVM_RMAP_MANY)) {
+		if (KVM_BUG_ON_DATA_CORRUPTION((u64 *)rmap_val != spte, kvm))
+			goto out;
 
-		rmap_head->val = 0;
+		rmap_val = 0;
 	} else {
-		desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
+		desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 		while (desc) {
 			for (i = 0; i < desc->spte_count; ++i) {
 				if (desc->sptes[i] == spte) {
-					pte_list_desc_remove_entry(kvm, rmap_head,
+					pte_list_desc_remove_entry(kvm, &rmap_val,
 								   desc, i);
-					return;
+					goto out;
 				}
 			}
 			desc = desc->more;
@@ -1019,6 +1028,9 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
 
 		KVM_BUG_ON_DATA_CORRUPTION(true, kvm);
 	}
+
+out:
+	rmap_head->val = rmap_val;
 }
 
 static void kvm_zap_one_rmap_spte(struct kvm *kvm,
@@ -1033,17 +1045,19 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 				   struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc, *next;
+	unsigned long rmap_val;
 	int i;
 
-	if (!rmap_head->val)
+	rmap_val = rmap_head->val;
+	if (!rmap_val)
 		return false;
 
-	if (!(rmap_head->val & KVM_RMAP_MANY)) {
-		mmu_spte_clear_track_bits(kvm, (u64 *)rmap_head->val);
+	if (!(rmap_val & KVM_RMAP_MANY)) {
+		mmu_spte_clear_track_bits(kvm, (u64 *)rmap_val);
 		goto out;
 	}
 
-	desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
+	desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 
 	for (; desc; desc = next) {
 		for (i = 0; i < desc->spte_count; i++)
@@ -1059,14 +1073,15 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 
 unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
 {
+	unsigned long rmap_val = rmap_head->val;
 	struct pte_list_desc *desc;
 
-	if (!rmap_head->val)
+	if (!rmap_val)
 		return 0;
-	else if (!(rmap_head->val & KVM_RMAP_MANY))
+	else if (!(rmap_val & KVM_RMAP_MANY))
 		return 1;
 
-	desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
+	desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 	return desc->tail_count + desc->spte_count;
 }
 
@@ -1109,6 +1124,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
  */
 struct rmap_iterator {
 	/* private fields */
+	struct rmap_head *head;
 	struct pte_list_desc *desc;	/* holds the sptep if not NULL */
 	int pos;			/* index of the sptep */
 };
@@ -1123,18 +1139,19 @@ struct rmap_iterator {
 static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
 			   struct rmap_iterator *iter)
 {
+	unsigned long rmap_val = rmap_head->val;
 	u64 *sptep;
 
-	if (!rmap_head->val)
+	if (!rmap_val)
 		return NULL;
 
-	if (!(rmap_head->val & KVM_RMAP_MANY)) {
+	if (!(rmap_val & KVM_RMAP_MANY)) {
 		iter->desc = NULL;
-		sptep = (u64 *)rmap_head->val;
+		sptep = (u64 *)rmap_val;
 		goto out;
 	}
 
-	iter->desc = (struct pte_list_desc *)(rmap_head->val & ~KVM_RMAP_MANY);
+	iter->desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 	iter->pos = 0;
 	sptep = iter->desc->sptes[iter->pos];
 out:
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 18/22] KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
                   ` (16 preceding siblings ...)
  2024-08-09 19:43 ` [PATCH 17/22] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
  2024-08-09 19:43 ` [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock Sean Christopherson
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

Use KVM_PAGES_PER_HPAGE() instead of open coding equivalent logic that is
anything but obvious.

No functional change intended, and verified by compiling with the below
assertions:

        BUILD_BUG_ON((1UL << KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K)) !=
                      KVM_PAGES_PER_HPAGE(PG_LEVEL_4K));

        BUILD_BUG_ON((1UL << KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M)) !=
                      KVM_PAGES_PER_HPAGE(PG_LEVEL_2M));

        BUILD_BUG_ON((1UL << KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G)) !=
                      KVM_PAGES_PER_HPAGE(PG_LEVEL_1G));

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 49c1f6cc1526..8ca7f51c2da3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1499,7 +1499,7 @@ static bool slot_rmap_walk_okay(struct slot_rmap_walk_iterator *iterator)
 static void slot_rmap_walk_next(struct slot_rmap_walk_iterator *iterator)
 {
 	while (++iterator->rmap <= iterator->end_rmap) {
-		iterator->gfn += (1UL << KVM_HPAGE_GFN_SHIFT(iterator->level));
+		iterator->gfn += KVM_PAGES_PER_HPAGE(iterator->level);
 
 		if (iterator->rmap->val)
 			return;
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
                   ` (17 preceding siblings ...)
  2024-08-09 19:43 ` [PATCH 18/22] KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
  2024-08-12  8:39   ` Lai Jiangshan
                     ` (2 more replies)
  2024-08-09 19:43 ` [PATCH 20/22] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs Sean Christopherson
                   ` (3 subsequent siblings)
  22 siblings, 3 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

Steal another bit from rmap entries (which are word aligned pointers, i.e.
have 2 free bits on 32-bit KVM, and 3 free bits on 64-bit KVM), and use
the bit to implement a *very* rudimentary per-rmap spinlock.  The only
anticipated usage of the lock outside of mmu_lock is for aging gfns, and
collisions between aging and other MMU rmap operations are quite rare,
e.g. unless userspace is being silly and aging a tiny range over and over
in a tight loop, time between contention when aging an actively running VM
is O(seconds).  In short, a more sophisticated locking scheme shouldn't be
necessary.

Note, the lock only protects the rmap structure itself, SPTEs that are
pointed at by a locked rmap can still be modified and zapped by another
task (KVM drops/zaps SPTEs before deleting the rmap entries)

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 80 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 71 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8ca7f51c2da3..a683b5fc4026 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -909,11 +909,73 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu
  * About rmap_head encoding:
  *
  * If the bit zero of rmap_head->val is clear, then it points to the only spte
- * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct
+ * in this rmap chain. Otherwise, (rmap_head->val & ~3) points to a struct
  * pte_list_desc containing more mappings.
  */
 #define KVM_RMAP_MANY	BIT(0)
 
+/*
+ * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always
+ * operates with mmu_lock held for write), but rmaps can be walked without
+ * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain
+ * being zapped/dropped _while the rmap is locked_.
+ *
+ * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be
+ * done while holding mmu_lock for write.  This allows a task walking rmaps
+ * without holding mmu_lock to concurrently walk the same entries as a task
+ * that is holding mmu_lock but _not_ the rmap lock.  Neither task will modify
+ * the rmaps, thus the walks are stable.
+ *
+ * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED,
+ * only the rmap chains themselves are protected.  E.g. holding an rmap's lock
+ * ensures all "struct pte_list_desc" fields are stable.
+ */
+#define KVM_RMAP_LOCKED	BIT(1)
+
+static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
+{
+	unsigned long old_val, new_val;
+
+	old_val = READ_ONCE(rmap_head->val);
+	if (!old_val)
+		return 0;
+
+	do {
+		/*
+		 * If the rmap is locked, wait for it to be unlocked before
+		 * trying acquire the lock, e.g. to bounce the cache line.
+		 */
+		while (old_val & KVM_RMAP_LOCKED) {
+			old_val = READ_ONCE(rmap_head->val);
+			cpu_relax();
+		}
+
+		/*
+		 * Recheck for an empty rmap, it may have been purged by the
+		 * task that held the lock.
+		 */
+		if (!old_val)
+			return 0;
+
+		new_val = old_val | KVM_RMAP_LOCKED;
+	} while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
+
+	/* Return the old value, i.e. _without_ the LOCKED bit set. */
+	return old_val;
+}
+
+static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
+			    unsigned long new_val)
+{
+	WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED);
+	WRITE_ONCE(rmap_head->val, new_val);
+}
+
+static unsigned long kvm_rmap_get(struct kvm_rmap_head *rmap_head)
+{
+	return READ_ONCE(rmap_head->val) & ~KVM_RMAP_LOCKED;
+}
+
 /*
  * Returns the number of pointers in the rmap chain, not counting the new one.
  */
@@ -924,7 +986,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 	struct pte_list_desc *desc;
 	int count = 0;
 
-	old_val = rmap_head->val;
+	old_val = kvm_rmap_lock(rmap_head);
 
 	if (!old_val) {
 		new_val = (unsigned long)spte;
@@ -956,7 +1018,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 		desc->sptes[desc->spte_count++] = spte;
 	}
 
-	rmap_head->val = new_val;
+	kvm_rmap_unlock(rmap_head, new_val);
 
 	return count;
 }
@@ -1004,7 +1066,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
 	unsigned long rmap_val;
 	int i;
 
-	rmap_val = rmap_head->val;
+	rmap_val = kvm_rmap_lock(rmap_head);
 	if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_val, kvm))
 		goto out;
 
@@ -1030,7 +1092,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
 	}
 
 out:
-	rmap_head->val = rmap_val;
+	kvm_rmap_unlock(rmap_head, rmap_val);
 }
 
 static void kvm_zap_one_rmap_spte(struct kvm *kvm,
@@ -1048,7 +1110,7 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 	unsigned long rmap_val;
 	int i;
 
-	rmap_val = rmap_head->val;
+	rmap_val = kvm_rmap_lock(rmap_head);
 	if (!rmap_val)
 		return false;
 
@@ -1067,13 +1129,13 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 	}
 out:
 	/* rmap_head is meaningless now, remember to reset it */
-	rmap_head->val = 0;
+	kvm_rmap_unlock(rmap_head, 0);
 	return true;
 }
 
 unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
 {
-	unsigned long rmap_val = rmap_head->val;
+	unsigned long rmap_val = kvm_rmap_get(rmap_head);
 	struct pte_list_desc *desc;
 
 	if (!rmap_val)
@@ -1139,7 +1201,7 @@ struct rmap_iterator {
 static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
 			   struct rmap_iterator *iter)
 {
-	unsigned long rmap_val = rmap_head->val;
+	unsigned long rmap_val = kvm_rmap_get(rmap_head);
 	u64 *sptep;
 
 	if (!rmap_val)
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 20/22] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
                   ` (18 preceding siblings ...)
  2024-08-09 19:43 ` [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
  2024-08-09 19:43 ` [PATCH 21/22] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns Sean Christopherson
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

Add a lockless version of for_each_rmap_spte(), which is pretty much the
same as the normal version, except that it doesn't BUG() the host if a
non-present SPTE is encountered.  When mmu_lock is held, it should be
impossible for a different task to zap a SPTE, _and_ zapped SPTEs must
be removed from their rmap chain prior to dropping mmu_lock.  Thus, the
normal walker BUG()s if a non-present SPTE is encountered as something is
wildly broken.

When walking rmaps without holding mmu_lock, the SPTEs pointed at by the
rmap chain can be zapped/dropped, and so a lockless walk can observe a
non-present SPTE if it runs concurrently with a different operation that
is zapping SPTEs.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 86 +++++++++++++++++++++++++++++++-----------
 1 file changed, 63 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a683b5fc4026..48e8608c2738 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -932,10 +932,16 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu
  */
 #define KVM_RMAP_LOCKED	BIT(1)
 
-static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
+static unsigned long __kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
 {
 	unsigned long old_val, new_val;
 
+	/*
+	 * Elide the lock if the rmap is empty, as lockless walkers (read-only
+	 * mode) don't need to (and can't) walk an empty rmap, nor can they add
+	 * entries to the rmap.  I.e. the only paths that process empty rmaps
+	 * do so while holding mmu_lock for write, and are mutually exclusive.
+	 */
 	old_val = READ_ONCE(rmap_head->val);
 	if (!old_val)
 		return 0;
@@ -960,17 +966,53 @@ static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
 		new_val = old_val | KVM_RMAP_LOCKED;
 	} while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
 
-	/* Return the old value, i.e. _without_ the LOCKED bit set. */
+	/*
+	 * Return the old value, i.e. _without_ the LOCKED bit set.  It's
+	 * impossible for the return value to be 0 (see above), i.e. the read-
+	 * only unlock flow can't get a false positive and fail to unlock.
+	 */
 	return old_val;
 }
 
+static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
+{
+	/*
+	 * TODO: Plumb in @kvm and add a lockdep assertion that mmu_lock is
+	 *       held for write.
+	 */
+	return __kvm_rmap_lock(rmap_head);
+}
+
 static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
 			    unsigned long new_val)
 {
-	WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED);
+	KVM_MMU_WARN_ON(new_val & KVM_RMAP_LOCKED);
 	WRITE_ONCE(rmap_head->val, new_val);
 }
 
+/*
+ * If mmu_lock isn't held, rmaps can only locked in read-only mode.  The actual
+ * locking is the same, but the caller is disallowed from modifying the rmap,
+ * and so the unlock flow is a nop if the rmap is/was empty.
+ */
+__maybe_unused
+static unsigned long kvm_rmap_lock_readonly(struct kvm_rmap_head *rmap_head)
+{
+	return __kvm_rmap_lock(rmap_head);
+}
+
+__maybe_unused
+static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
+				     unsigned long old_val)
+{
+	if (!old_val)
+		return;
+
+	KVM_MMU_WARN_ON(old_val != (rmap_head->val & ~KVM_RMAP_LOCKED));
+	WRITE_ONCE(rmap_head->val, old_val);
+}
+
+
 static unsigned long kvm_rmap_get(struct kvm_rmap_head *rmap_head)
 {
 	return READ_ONCE(rmap_head->val) & ~KVM_RMAP_LOCKED;
@@ -1202,23 +1244,18 @@ static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
 			   struct rmap_iterator *iter)
 {
 	unsigned long rmap_val = kvm_rmap_get(rmap_head);
-	u64 *sptep;
 
 	if (!rmap_val)
 		return NULL;
 
 	if (!(rmap_val & KVM_RMAP_MANY)) {
 		iter->desc = NULL;
-		sptep = (u64 *)rmap_val;
-		goto out;
+		return (u64 *)rmap_val;
 	}
 
 	iter->desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 	iter->pos = 0;
-	sptep = iter->desc->sptes[iter->pos];
-out:
-	BUG_ON(!is_shadow_present_pte(*sptep));
-	return sptep;
+	return iter->desc->sptes[iter->pos];
 }
 
 /*
@@ -1228,14 +1265,11 @@ static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
  */
 static u64 *rmap_get_next(struct rmap_iterator *iter)
 {
-	u64 *sptep;
-
 	if (iter->desc) {
 		if (iter->pos < PTE_LIST_EXT - 1) {
 			++iter->pos;
-			sptep = iter->desc->sptes[iter->pos];
-			if (sptep)
-				goto out;
+			if (iter->desc->sptes[iter->pos])
+				return iter->desc->sptes[iter->pos];
 		}
 
 		iter->desc = iter->desc->more;
@@ -1243,20 +1277,26 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
 		if (iter->desc) {
 			iter->pos = 0;
 			/* desc->sptes[0] cannot be NULL */
-			sptep = iter->desc->sptes[iter->pos];
-			goto out;
+			return iter->desc->sptes[iter->pos];
 		}
 	}
 
 	return NULL;
-out:
-	BUG_ON(!is_shadow_present_pte(*sptep));
-	return sptep;
 }
 
-#define for_each_rmap_spte(_rmap_head_, _iter_, _spte_)			\
-	for (_spte_ = rmap_get_first(_rmap_head_, _iter_);		\
-	     _spte_; _spte_ = rmap_get_next(_iter_))
+#define __for_each_rmap_spte(_rmap_head_, _iter_, _sptep_)	\
+	for (_sptep_ = rmap_get_first(_rmap_head_, _iter_);	\
+	     _sptep_; _sptep_ = rmap_get_next(_iter_))
+
+#define for_each_rmap_spte(_rmap_head_, _iter_, _sptep_)	\
+	__for_each_rmap_spte(_rmap_head_, _iter_, _sptep_)	\
+		if (!is_shadow_present_pte(*(_sptep_)))		\
+			BUG();					\
+		else
+
+#define for_each_rmap_spte_lockless(_rmap_head_, _iter_, _sptep_, _spte_)	\
+	__for_each_rmap_spte(_rmap_head_, _iter_, _sptep_)			\
+		if (is_shadow_present_pte(_spte_ = mmu_spte_get_lockless(sptep)))
 
 static void drop_spte(struct kvm *kvm, u64 *sptep)
 {
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 21/22] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
                   ` (19 preceding siblings ...)
  2024-08-09 19:43 ` [PATCH 20/22] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
  2024-09-03 20:36   ` James Houghton
  2024-08-09 19:43 ` [PATCH 22/22] ***HACK*** KVM: x86: Don't take " Sean Christopherson
  2024-09-10  4:56 ` [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
  22 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

DO NOT MERGE, yet...

Cc: James Houghton <jthoughton@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 63 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 59 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 48e8608c2738..9df6b465de06 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -995,13 +995,11 @@ static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
  * locking is the same, but the caller is disallowed from modifying the rmap,
  * and so the unlock flow is a nop if the rmap is/was empty.
  */
-__maybe_unused
 static unsigned long kvm_rmap_lock_readonly(struct kvm_rmap_head *rmap_head)
 {
 	return __kvm_rmap_lock(rmap_head);
 }
 
-__maybe_unused
 static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
 				     unsigned long old_val)
 {
@@ -1743,8 +1741,53 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
 	__rmap_add(vcpu->kvm, cache, slot, spte, gfn, access);
 }
 
-static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
-				   struct kvm_gfn_range *range, bool test_only)
+static bool kvm_rmap_age_gfn_range_lockless(struct kvm *kvm,
+					    struct kvm_gfn_range *range,
+					    bool test_only)
+{
+	struct kvm_rmap_head *rmap_head;
+	struct rmap_iterator iter;
+	unsigned long rmap_val;
+	bool young = false;
+	u64 *sptep;
+	gfn_t gfn;
+	int level;
+	u64 spte;
+
+	for (level = PG_LEVEL_4K; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
+		for (gfn = range->start; gfn < range->end;
+		     gfn += KVM_PAGES_PER_HPAGE(level)) {
+			rmap_head = gfn_to_rmap(gfn, level, range->slot);
+			rmap_val = kvm_rmap_lock_readonly(rmap_head);
+
+			for_each_rmap_spte_lockless(rmap_head, &iter, sptep, spte) {
+				if (!is_accessed_spte(spte))
+					continue;
+
+				if (test_only) {
+					kvm_rmap_unlock_readonly(rmap_head, rmap_val);
+					return true;
+				}
+
+				/*
+				 * Marking SPTEs for access tracking outside of
+				 * mmu_lock is unsupported.  Report the page as
+				 * young, but otherwise leave it as-is.
+				 */
+				if (spte_ad_enabled(spte))
+					clear_bit((ffs(shadow_accessed_mask) - 1),
+						  (unsigned long *)sptep);
+				young = true;
+			}
+
+			kvm_rmap_unlock_readonly(rmap_head, rmap_val);
+		}
+	}
+	return young;
+}
+
+static bool __kvm_rmap_age_gfn_range(struct kvm *kvm,
+				     struct kvm_gfn_range *range, bool test_only)
 {
 	struct slot_rmap_walk_iterator iterator;
 	struct rmap_iterator iter;
@@ -1783,6 +1826,18 @@ static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
 	return young;
 }
 
+
+static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
+				   struct kvm_gfn_range *range, bool test_only)
+{
+	/* FIXME: This also needs to be guarded with something like range->fast_only. */
+	if (kvm_ad_enabled())
+		return kvm_rmap_age_gfn_range_lockless(kvm, range, test_only);
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+	return __kvm_rmap_age_gfn_range(kvm, range, test_only);
+}
+
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 22/22] ***HACK*** KVM: x86: Don't take mmu_lock when aging gfns
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
                   ` (20 preceding siblings ...)
  2024-08-09 19:43 ` [PATCH 21/22] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns Sean Christopherson
@ 2024-08-09 19:43 ` Sean Christopherson
  2024-09-10  4:56 ` [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
  22 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

DO NOT MERGE, this is a horrific hack, breaks TDP MMU, etc.

Not-signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/svm.c | 2 ++
 arch/x86/kvm/vmx/vmx.c | 2 ++
 virt/kvm/kvm_main.c    | 7 ++++---
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index c115d26844f7..e5c5d0f9a69d 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -5300,6 +5300,8 @@ static __init int svm_hardware_setup(void)
 	if (!boot_cpu_has(X86_FEATURE_NPT))
 		npt_enabled = false;
 
+	npt_enabled = false;
+
 	/* Force VM NPT level equal to the host's paging level */
 	kvm_configure_mmu(npt_enabled, get_npt_level(),
 			  get_npt_level(), PG_LEVEL_1G);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f18c2d8c7476..28f3493d6391 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8419,6 +8419,8 @@ __init int vmx_hardware_setup(void)
 	    !cpu_has_vmx_invept_global())
 		enable_ept = 0;
 
+	enable_ept = 0;
+
 	/* NX support is required for shadow paging. */
 	if (!enable_ept && !boot_cpu_has(X86_FEATURE_NX)) {
 		pr_err_ratelimited("NX (Execute Disable) not supported\n");
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d0788d0a72cc..1b9b5dea2ac8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -642,10 +642,11 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
 			gfn_range.slot = slot;
 
 			if (!r.found_memslot) {
-				r.found_memslot = true;
-				KVM_MMU_LOCK(kvm);
-				if (!IS_KVM_NULL_FN(range->on_lock))
+				if (!IS_KVM_NULL_FN(range->on_lock)) {
+					r.found_memslot = true;
+					KVM_MMU_LOCK(kvm);
 					range->on_lock(kvm);
+				}
 
 				if (IS_KVM_NULL_FN(range->handler))
 					goto mmu_unlock;
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
  2024-08-09 19:43 ` [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock Sean Christopherson
@ 2024-08-12  8:39   ` Lai Jiangshan
  2024-08-12 15:22     ` Sean Christopherson
  2024-09-03 20:17   ` James Houghton
  2024-09-09 19:00   ` James Houghton
  2 siblings, 1 reply; 50+ messages in thread
From: Lai Jiangshan @ 2024-08-12  8:39 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu, James Houghton

On Sat, Aug 10, 2024 at 3:49 AM Sean Christopherson <seanjc@google.com> wrote:

> +
> +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> +{
> +       unsigned long old_val, new_val;
> +
> +       old_val = READ_ONCE(rmap_head->val);
> +       if (!old_val)
> +               return 0;
> +
> +       do {
> +               /*
> +                * If the rmap is locked, wait for it to be unlocked before
> +                * trying acquire the lock, e.g. to bounce the cache line.
> +                */
> +               while (old_val & KVM_RMAP_LOCKED) {
> +                       old_val = READ_ONCE(rmap_head->val);
> +                       cpu_relax();

The sequence of these two lines of code can be improved.

> +               }
> +
> +               /*
> +                * Recheck for an empty rmap, it may have been purged by the
> +                * task that held the lock.
> +                */
> +               if (!old_val)
> +                       return 0;
> +
> +               new_val = old_val | KVM_RMAP_LOCKED;
> +       } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
> +
> +       /* Return the old value, i.e. _without_ the LOCKED bit set. */
> +       return old_val;
> +}

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
  2024-08-12  8:39   ` Lai Jiangshan
@ 2024-08-12 15:22     ` Sean Christopherson
  2024-08-13  6:35       ` Lai Jiangshan
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2024-08-12 15:22 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu, James Houghton

On Mon, Aug 12, 2024, Lai Jiangshan wrote:
> On Sat, Aug 10, 2024 at 3:49 AM Sean Christopherson <seanjc@google.com> wrote:
> 
> > +
> > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> > +{
> > +       unsigned long old_val, new_val;
> > +
> > +       old_val = READ_ONCE(rmap_head->val);
> > +       if (!old_val)
> > +               return 0;
> > +
> > +       do {
> > +               /*
> > +                * If the rmap is locked, wait for it to be unlocked before
> > +                * trying acquire the lock, e.g. to bounce the cache line.
> > +                */
> > +               while (old_val & KVM_RMAP_LOCKED) {
> > +                       old_val = READ_ONCE(rmap_head->val);
> > +                       cpu_relax();
> 
> The sequence of these two lines of code can be improved.

Oh yeah, duh, re-read after PAUSE, not before.

Definitely holler if you have any alternative ideas for walking rmaps without
taking mmu_lock, I guarantee you've spent more time than me thinking about the
shadow MMU :-)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 14/22] KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper
  2024-08-09 19:43 ` [PATCH 14/22] KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper Sean Christopherson
@ 2024-08-12 21:53   ` David Matlack
  2024-08-12 21:58     ` David Matlack
  0 siblings, 1 reply; 50+ messages in thread
From: David Matlack @ 2024-08-12 21:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu, James Houghton

On Fri, Aug 9, 2024 at 12:48 PM Sean Christopherson <seanjc@google.com> wrote:
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0a33857d668a..88b656a1453d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> +static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
> +                                  struct kvm_gfn_range *range, bool test_only)
> +{
> +       struct slot_rmap_walk_iterator iterator;
> +       struct rmap_iterator iter;
> +       bool young = false;
> +       u64 *sptep;
> +
> +       for_each_slot_rmap_range(range->slot, PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL,
> +                                range->start, range->end - 1, &iterator) {
> +               for_each_rmap_spte(iterator.rmap, &iter, sptep) {
> +                       if (test_only && is_accessed_spte(*sptep))
> +                               return true;
> +
> +                       young = mmu_spte_age(sptep);

It's jarring to see that mmu_spte_age() can get called in the
test_only case, even though I think the code is technically correct
(it will only be called if !is_accessed_spte() in which case
mmu_spte_age() will do nothing).

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 14/22] KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper
  2024-08-12 21:53   ` David Matlack
@ 2024-08-12 21:58     ` David Matlack
  0 siblings, 0 replies; 50+ messages in thread
From: David Matlack @ 2024-08-12 21:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu, James Houghton

On Mon, Aug 12, 2024 at 2:53 PM David Matlack <dmatlack@google.com> wrote:
>
> On Fri, Aug 9, 2024 at 12:48 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 0a33857d668a..88b656a1453d 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > +static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
> > +                                  struct kvm_gfn_range *range, bool test_only)
> > +{
> > +       struct slot_rmap_walk_iterator iterator;
> > +       struct rmap_iterator iter;
> > +       bool young = false;
> > +       u64 *sptep;
> > +
> > +       for_each_slot_rmap_range(range->slot, PG_LEVEL_4K, KVM_MAX_HUGEPAGE_LEVEL,
> > +                                range->start, range->end - 1, &iterator) {
> > +               for_each_rmap_spte(iterator.rmap, &iter, sptep) {
> > +                       if (test_only && is_accessed_spte(*sptep))
> > +                               return true;
> > +
> > +                       young = mmu_spte_age(sptep);
>
> It's jarring to see that mmu_spte_age() can get called in the
> test_only case, even though I think the code is technically correct
> (it will only be called if !is_accessed_spte() in which case
> mmu_spte_age() will do nothing).

Nevermind, I see this is cleaned up in the following patch.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
  2024-08-12 15:22     ` Sean Christopherson
@ 2024-08-13  6:35       ` Lai Jiangshan
  2024-08-13 15:19         ` Sean Christopherson
  0 siblings, 1 reply; 50+ messages in thread
From: Lai Jiangshan @ 2024-08-13  6:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu, James Houghton

On Mon, Aug 12, 2024 at 11:22 PM Sean Christopherson <seanjc@google.com> wrote:

>
> Oh yeah, duh, re-read after PAUSE, not before.
>
> Definitely holler if you have any alternative ideas for walking rmaps
> without taking mmu_lock, I guarantee you've spent more time than me
> thinking about the shadow MMU :-)

We use the same bit and the same way for the rmap lock.

We just use bit_spin_lock() and the optimization for empty rmap_head is
handled out of kvm_rmap_lock().

bit_spin_lock() has the most-needed preempt_disable(). I'm not sure if the
new kvm_rmap_age_gfn_range_lockless() is called in a preempt disabled region.

Thanks
Lai

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
  2024-08-13  6:35       ` Lai Jiangshan
@ 2024-08-13 15:19         ` Sean Christopherson
  0 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-08-13 15:19 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu, James Houghton

On Tue, Aug 13, 2024, Lai Jiangshan wrote:
> On Mon, Aug 12, 2024 at 11:22 PM Sean Christopherson <seanjc@google.com> wrote:
> 
> >
> > Oh yeah, duh, re-read after PAUSE, not before.
> >
> > Definitely holler if you have any alternative ideas for walking rmaps
> > without taking mmu_lock, I guarantee you've spent more time than me
> > thinking about the shadow MMU :-)
> 
> We use the same bit and the same way for the rmap lock.
> 
> We just use bit_spin_lock() and the optimization for empty rmap_head is
> handled out of kvm_rmap_lock().

Hmm, I'm leaning towards keeping the custom locking.  There are a handful of
benefits, none of which are all that meaningful on their own, but do add up.

 - Callers don't need to manually check for an empty rmap_head.
 - Can avoid the redundant preempt_{disable,enable}() entirely in the common case
   of being called while mmu_lock is held.
 - Handles the (likely super rare) edge case where a read-only walker encounters
   an rmap that was just emptied (rmap_head->val goes to zero after the initial
   check to elide the lock).
 - Avoids an atomic when releasing the lock, and any extra instructions entirely
   for writers since they always write the full rmap_head->val when releasing.

> bit_spin_lock() has the most-needed preempt_disable(). I'm not sure if the
> new kvm_rmap_age_gfn_range_lockless() is called in a preempt disabled region.

Oof, it doesn't.  Disabling IRQs crossed my mind, but I completely forgot about
preemption.

Thanks much!

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/22] KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range()
  2024-08-09 19:43 ` [PATCH 10/22] KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range() Sean Christopherson
@ 2024-08-28 18:20   ` Sean Christopherson
  0 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-08-28 18:20 UTC (permalink / raw)
  To: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu, James Houghton

On Fri, Aug 09, 2024, Sean Christopherson wrote:
> Move walk_slot_rmaps() and friends up near for_each_slot_rmap_range() so
> that the walkers can be used to handle mmu_notifier invalidations, and so
> that similar function has some amount of locality in code.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 107 +++++++++++++++++++++--------------------
>  1 file changed, 54 insertions(+), 53 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 901be9e420a4..676cb7dfcbf9 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1534,6 +1534,60 @@ static void slot_rmap_walk_next(struct slot_rmap_walk_iterator *iterator)
>  	     slot_rmap_walk_okay(_iter_);				\
>  	     slot_rmap_walk_next(_iter_))
>  
> +

Doh, extra newline here.

> +/* The return value indicates if tlb flush on all vcpus is needed. */
> +typedef bool (*slot_rmaps_handler) (struct kvm *kvm,
> +				    struct kvm_rmap_head *rmap_head,
> +				    const struct kvm_memory_slot *slot);
> +

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
  2024-08-09 19:43 ` [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock Sean Christopherson
  2024-08-12  8:39   ` Lai Jiangshan
@ 2024-09-03 20:17   ` James Houghton
  2024-09-03 21:27     ` Sean Christopherson
  2024-09-09 19:00   ` James Houghton
  2 siblings, 1 reply; 50+ messages in thread
From: James Houghton @ 2024-09-03 20:17 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu

On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Steal another bit from rmap entries (which are word aligned pointers, i.e.
> have 2 free bits on 32-bit KVM, and 3 free bits on 64-bit KVM), and use
> the bit to implement a *very* rudimentary per-rmap spinlock.  The only
> anticipated usage of the lock outside of mmu_lock is for aging gfns, and
> collisions between aging and other MMU rmap operations are quite rare,
> e.g. unless userspace is being silly and aging a tiny range over and over
> in a tight loop, time between contention when aging an actively running VM
> is O(seconds).  In short, a more sophisticated locking scheme shouldn't be
> necessary.
>
> Note, the lock only protects the rmap structure itself, SPTEs that are
> pointed at by a locked rmap can still be modified and zapped by another
> task (KVM drops/zaps SPTEs before deleting the rmap entries)
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 80 +++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 71 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8ca7f51c2da3..a683b5fc4026 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -909,11 +909,73 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu
>   * About rmap_head encoding:
>   *
>   * If the bit zero of rmap_head->val is clear, then it points to the only spte
> - * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct
> + * in this rmap chain. Otherwise, (rmap_head->val & ~3) points to a struct
>   * pte_list_desc containing more mappings.
>   */
>  #define KVM_RMAP_MANY  BIT(0)
>
> +/*
> + * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always
> + * operates with mmu_lock held for write), but rmaps can be walked without
> + * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain
> + * being zapped/dropped _while the rmap is locked_.
> + *
> + * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be
> + * done while holding mmu_lock for write.  This allows a task walking rmaps
> + * without holding mmu_lock to concurrently walk the same entries as a task
> + * that is holding mmu_lock but _not_ the rmap lock.  Neither task will modify
> + * the rmaps, thus the walks are stable.
> + *
> + * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED,
> + * only the rmap chains themselves are protected.  E.g. holding an rmap's lock
> + * ensures all "struct pte_list_desc" fields are stable.
> + */
> +#define KVM_RMAP_LOCKED        BIT(1)
> +
> +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> +{
> +       unsigned long old_val, new_val;
> +
> +       old_val = READ_ONCE(rmap_head->val);
> +       if (!old_val)
> +               return 0;

I'm having trouble understanding how this bit works. What exactly is
stopping the rmap from being populated while we have it "locked"? We
aren't holding the MMU lock at all in the lockless case, and given
this bit, it is impossible (I think?) for the MMU-write-lock-holding,
rmap-modifying side to tell that this rmap is locked.

Concretely, my immediate concern is that we will still unconditionally
write 0 back at unlock time even if the value has changed.

I expect that this works and I'm just not getting it... :)


> +
> +       do {
> +               /*
> +                * If the rmap is locked, wait for it to be unlocked before
> +                * trying acquire the lock, e.g. to bounce the cache line.
> +                */
> +               while (old_val & KVM_RMAP_LOCKED) {
> +                       old_val = READ_ONCE(rmap_head->val);
> +                       cpu_relax();
> +               }
> +
> +               /*
> +                * Recheck for an empty rmap, it may have been purged by the
> +                * task that held the lock.
> +                */
> +               if (!old_val)
> +                       return 0;
> +
> +               new_val = old_val | KVM_RMAP_LOCKED;
> +       } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
> +
> +       /* Return the old value, i.e. _without_ the LOCKED bit set. */
> +       return old_val;
> +}
> +
> +static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
> +                           unsigned long new_val)
> +{
> +       WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED);
> +       WRITE_ONCE(rmap_head->val, new_val);
> +}
> +
> +static unsigned long kvm_rmap_get(struct kvm_rmap_head *rmap_head)
> +{
> +       return READ_ONCE(rmap_head->val) & ~KVM_RMAP_LOCKED;
> +}
> +
>  /*
>   * Returns the number of pointers in the rmap chain, not counting the new one.
>   */
> @@ -924,7 +986,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
>         struct pte_list_desc *desc;
>         int count = 0;
>
> -       old_val = rmap_head->val;
> +       old_val = kvm_rmap_lock(rmap_head);
>
>         if (!old_val) {
>                 new_val = (unsigned long)spte;
> @@ -956,7 +1018,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
>                 desc->sptes[desc->spte_count++] = spte;
>         }
>
> -       rmap_head->val = new_val;
> +       kvm_rmap_unlock(rmap_head, new_val);
>
>         return count;
>  }
> @@ -1004,7 +1066,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
>         unsigned long rmap_val;
>         int i;
>
> -       rmap_val = rmap_head->val;
> +       rmap_val = kvm_rmap_lock(rmap_head);
>         if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_val, kvm))
>                 goto out;
>
> @@ -1030,7 +1092,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
>         }
>
>  out:
> -       rmap_head->val = rmap_val;
> +       kvm_rmap_unlock(rmap_head, rmap_val);
>  }
>
>  static void kvm_zap_one_rmap_spte(struct kvm *kvm,
> @@ -1048,7 +1110,7 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
>         unsigned long rmap_val;
>         int i;
>
> -       rmap_val = rmap_head->val;
> +       rmap_val = kvm_rmap_lock(rmap_head);
>         if (!rmap_val)
>                 return false;
>
> @@ -1067,13 +1129,13 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
>         }
>  out:
>         /* rmap_head is meaningless now, remember to reset it */
> -       rmap_head->val = 0;
> +       kvm_rmap_unlock(rmap_head, 0);
>         return true;
>  }
>
>  unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
>  {
> -       unsigned long rmap_val = rmap_head->val;
> +       unsigned long rmap_val = kvm_rmap_get(rmap_head);
>         struct pte_list_desc *desc;
>
>         if (!rmap_val)
> @@ -1139,7 +1201,7 @@ struct rmap_iterator {
>  static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
>                            struct rmap_iterator *iter)
>  {
> -       unsigned long rmap_val = rmap_head->val;
> +       unsigned long rmap_val = kvm_rmap_get(rmap_head);
>         u64 *sptep;
>
>         if (!rmap_val)
> --
> 2.46.0.76.ge559c4bf1a-goog
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 21/22] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns
  2024-08-09 19:43 ` [PATCH 21/22] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns Sean Christopherson
@ 2024-09-03 20:36   ` James Houghton
  2024-09-04 15:48     ` Sean Christopherson
  0 siblings, 1 reply; 50+ messages in thread
From: James Houghton @ 2024-09-03 20:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu

On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote:
>
> DO NOT MERGE, yet...
>
> Cc: James Houghton <jthoughton@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 63 +++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 59 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 48e8608c2738..9df6b465de06 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -995,13 +995,11 @@ static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
>   * locking is the same, but the caller is disallowed from modifying the rmap,
>   * and so the unlock flow is a nop if the rmap is/was empty.
>   */
> -__maybe_unused
>  static unsigned long kvm_rmap_lock_readonly(struct kvm_rmap_head *rmap_head)
>  {
>         return __kvm_rmap_lock(rmap_head);
>  }
>
> -__maybe_unused
>  static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
>                                      unsigned long old_val)
>  {
> @@ -1743,8 +1741,53 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
>         __rmap_add(vcpu->kvm, cache, slot, spte, gfn, access);
>  }
>
> -static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
> -                                  struct kvm_gfn_range *range, bool test_only)
> +static bool kvm_rmap_age_gfn_range_lockless(struct kvm *kvm,
> +                                           struct kvm_gfn_range *range,
> +                                           bool test_only)
> +{
> +       struct kvm_rmap_head *rmap_head;
> +       struct rmap_iterator iter;
> +       unsigned long rmap_val;
> +       bool young = false;
> +       u64 *sptep;
> +       gfn_t gfn;
> +       int level;
> +       u64 spte;
> +
> +       for (level = PG_LEVEL_4K; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> +               for (gfn = range->start; gfn < range->end;
> +                    gfn += KVM_PAGES_PER_HPAGE(level)) {
> +                       rmap_head = gfn_to_rmap(gfn, level, range->slot);
> +                       rmap_val = kvm_rmap_lock_readonly(rmap_head);
> +
> +                       for_each_rmap_spte_lockless(rmap_head, &iter, sptep, spte) {
> +                               if (!is_accessed_spte(spte))
> +                                       continue;
> +
> +                               if (test_only) {
> +                                       kvm_rmap_unlock_readonly(rmap_head, rmap_val);
> +                                       return true;
> +                               }
> +
> +                               /*
> +                                * Marking SPTEs for access tracking outside of
> +                                * mmu_lock is unsupported.  Report the page as
> +                                * young, but otherwise leave it as-is.

Just for my own understanding, what's the main reason why it's unsafe
to mark PTEs for access tracking outside the mmu_lock?

> +                                */
> +                               if (spte_ad_enabled(spte))
> +                                       clear_bit((ffs(shadow_accessed_mask) - 1),
> +                                                 (unsigned long *)sptep);

I feel like it'd be kinda nice to de-duplicate this clear_bit() piece
with the one in kvm_rmap_age_gfn_range().

> +                               young = true;
> +                       }
> +
> +                       kvm_rmap_unlock_readonly(rmap_head, rmap_val);
> +               }
> +       }
> +       return young;
> +}
> +
> +static bool __kvm_rmap_age_gfn_range(struct kvm *kvm,
> +                                    struct kvm_gfn_range *range, bool test_only)
>  {
>         struct slot_rmap_walk_iterator iterator;
>         struct rmap_iterator iter;
> @@ -1783,6 +1826,18 @@ static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
>         return young;
>  }
>
> +
> +static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
> +                                  struct kvm_gfn_range *range, bool test_only)
> +{
> +       /* FIXME: This also needs to be guarded with something like range->fast_only. */
> +       if (kvm_ad_enabled())

I expect this to be something like `if (kvm_ad_enabled() ||
range->fast_only)`. With MGLRU, that means the pages will always be
the last candidates for eviction, though it is still possible for them
to be evicted (though I think this would basically never happen). I
think this is fine.

I think the only other possible choice is to record/return 'not
young'/false instead of 'young'/true if the spte is young but
!spte_ad_enabled(). That doesn't seem to be obviously better, though
we *will* get correct age information at eviction time, when
!range->fast_only, at which point the page will not be evicted, and
Accessed will be cleared.


> +               return kvm_rmap_age_gfn_range_lockless(kvm, range, test_only);
> +
> +       lockdep_assert_held_write(&kvm->mmu_lock);
> +       return __kvm_rmap_age_gfn_range(kvm, range, test_only);
> +}
> +
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>         bool young = false;
> --
> 2.46.0.76.ge559c4bf1a-goog
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
  2024-09-03 20:17   ` James Houghton
@ 2024-09-03 21:27     ` Sean Christopherson
  2024-09-03 21:40       ` James Houghton
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2024-09-03 21:27 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu

On Tue, Sep 03, 2024, James Houghton wrote:
> On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote:
> > +/*
> > + * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always
> > + * operates with mmu_lock held for write), but rmaps can be walked without
> > + * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain
> > + * being zapped/dropped _while the rmap is locked_.
> > + *
> > + * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be
> > + * done while holding mmu_lock for write.  This allows a task walking rmaps
> > + * without holding mmu_lock to concurrently walk the same entries as a task
> > + * that is holding mmu_lock but _not_ the rmap lock.  Neither task will modify
> > + * the rmaps, thus the walks are stable.
> > + *
> > + * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED,
> > + * only the rmap chains themselves are protected.  E.g. holding an rmap's lock
> > + * ensures all "struct pte_list_desc" fields are stable.
> > + */
> > +#define KVM_RMAP_LOCKED        BIT(1)
> > +
> > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> > +{
> > +       unsigned long old_val, new_val;
> > +
> > +       old_val = READ_ONCE(rmap_head->val);
> > +       if (!old_val)
> > +               return 0;
> 
> I'm having trouble understanding how this bit works. What exactly is
> stopping the rmap from being populated while we have it "locked"?

Nothing prevents the 0=>1 transition, but that's a-ok because walking rmaps for
aging only cares about existing mappings.  The key to correctness is that this
helper returns '0' when there are no rmaps, i.e. the caller is guaranteed to do
nothing and thus will never see any rmaps that come along in the future.

> aren't holding the MMU lock at all in the lockless case, and given
> this bit, it is impossible (I think?) for the MMU-write-lock-holding,
> rmap-modifying side to tell that this rmap is locked.
> 
> Concretely, my immediate concern is that we will still unconditionally
> write 0 back at unlock time even if the value has changed.

The "readonly" unlocker (not added in this patch) is a nop if the rmap was empty,
i.e. wasn't actually locked.

+static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
+				     unsigned long old_val)
+{
+	if (!old_val)
+		return;
+
+	KVM_MMU_WARN_ON(old_val != (rmap_head->val & ~KVM_RMAP_LOCKED));
+	WRITE_ONCE(rmap_head->val, old_val);
+}

The TODO in kvm_rmap_lock() pretty much sums things up: it's safe to call the
"normal", non-readonly versions if and only if mmu_lock is held for write.

+static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
+{
+	/*
+	 * TODO: Plumb in @kvm and add a lockdep assertion that mmu_lock is
+	 *       held for write.
+	 */
+	return __kvm_rmap_lock(rmap_head);
+}

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
  2024-09-03 21:27     ` Sean Christopherson
@ 2024-09-03 21:40       ` James Houghton
  0 siblings, 0 replies; 50+ messages in thread
From: James Houghton @ 2024-09-03 21:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu

On Tue, Sep 3, 2024 at 2:27 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Sep 03, 2024, James Houghton wrote:
> > On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote:
> > > +/*
> > > + * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always
> > > + * operates with mmu_lock held for write), but rmaps can be walked without
> > > + * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain
> > > + * being zapped/dropped _while the rmap is locked_.
> > > + *
> > > + * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be
> > > + * done while holding mmu_lock for write.  This allows a task walking rmaps
> > > + * without holding mmu_lock to concurrently walk the same entries as a task
> > > + * that is holding mmu_lock but _not_ the rmap lock.  Neither task will modify
> > > + * the rmaps, thus the walks are stable.
> > > + *
> > > + * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED,
> > > + * only the rmap chains themselves are protected.  E.g. holding an rmap's lock
> > > + * ensures all "struct pte_list_desc" fields are stable.
> > > + */
> > > +#define KVM_RMAP_LOCKED        BIT(1)
> > > +
> > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> > > +{
> > > +       unsigned long old_val, new_val;
> > > +
> > > +       old_val = READ_ONCE(rmap_head->val);
> > > +       if (!old_val)
> > > +               return 0;
> >
> > I'm having trouble understanding how this bit works. What exactly is
> > stopping the rmap from being populated while we have it "locked"?
>
> Nothing prevents the 0=>1 transition, but that's a-ok because walking rmaps for
> aging only cares about existing mappings.  The key to correctness is that this
> helper returns '0' when there are no rmaps, i.e. the caller is guaranteed to do
> nothing and thus will never see any rmaps that come along in the future.
>
> > aren't holding the MMU lock at all in the lockless case, and given
> > this bit, it is impossible (I think?) for the MMU-write-lock-holding,
> > rmap-modifying side to tell that this rmap is locked.
> >
> > Concretely, my immediate concern is that we will still unconditionally
> > write 0 back at unlock time even if the value has changed.
>
> The "readonly" unlocker (not added in this patch) is a nop if the rmap was empty,
> i.e. wasn't actually locked.

Ah, that's what I was missing. Thanks! This all makes sense now.

>
> +static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
> +                                    unsigned long old_val)
> +{
> +       if (!old_val)
> +               return;
> +
> +       KVM_MMU_WARN_ON(old_val != (rmap_head->val & ~KVM_RMAP_LOCKED));
> +       WRITE_ONCE(rmap_head->val, old_val);
> +}
>
> The TODO in kvm_rmap_lock() pretty much sums things up: it's safe to call the
> "normal", non-readonly versions if and only if mmu_lock is held for write.
>
> +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> +{
> +       /*
> +        * TODO: Plumb in @kvm and add a lockdep assertion that mmu_lock is
> +        *       held for write.
> +        */
> +       return __kvm_rmap_lock(rmap_head);
> +}

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 21/22] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns
  2024-09-03 20:36   ` James Houghton
@ 2024-09-04 15:48     ` Sean Christopherson
  0 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-09-04 15:48 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu

On Tue, Sep 03, 2024, James Houghton wrote:
> On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > DO NOT MERGE, yet...
> >
> > Cc: James Houghton <jthoughton@google.com>
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 63 +++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 59 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 48e8608c2738..9df6b465de06 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -995,13 +995,11 @@ static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
> >   * locking is the same, but the caller is disallowed from modifying the rmap,
> >   * and so the unlock flow is a nop if the rmap is/was empty.
> >   */
> > -__maybe_unused
> >  static unsigned long kvm_rmap_lock_readonly(struct kvm_rmap_head *rmap_head)
> >  {
> >         return __kvm_rmap_lock(rmap_head);
> >  }
> >
> > -__maybe_unused
> >  static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head,
> >                                      unsigned long old_val)
> >  {
> > @@ -1743,8 +1741,53 @@ static void rmap_add(struct kvm_vcpu *vcpu, const struct kvm_memory_slot *slot,
> >         __rmap_add(vcpu->kvm, cache, slot, spte, gfn, access);
> >  }
> >
> > -static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
> > -                                  struct kvm_gfn_range *range, bool test_only)
> > +static bool kvm_rmap_age_gfn_range_lockless(struct kvm *kvm,
> > +                                           struct kvm_gfn_range *range,
> > +                                           bool test_only)
> > +{
> > +       struct kvm_rmap_head *rmap_head;
> > +       struct rmap_iterator iter;
> > +       unsigned long rmap_val;
> > +       bool young = false;
> > +       u64 *sptep;
> > +       gfn_t gfn;
> > +       int level;
> > +       u64 spte;
> > +
> > +       for (level = PG_LEVEL_4K; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
> > +               for (gfn = range->start; gfn < range->end;
> > +                    gfn += KVM_PAGES_PER_HPAGE(level)) {
> > +                       rmap_head = gfn_to_rmap(gfn, level, range->slot);
> > +                       rmap_val = kvm_rmap_lock_readonly(rmap_head);
> > +
> > +                       for_each_rmap_spte_lockless(rmap_head, &iter, sptep, spte) {
> > +                               if (!is_accessed_spte(spte))
> > +                                       continue;
> > +
> > +                               if (test_only) {
> > +                                       kvm_rmap_unlock_readonly(rmap_head, rmap_val);
> > +                                       return true;
> > +                               }
> > +
> > +                               /*
> > +                                * Marking SPTEs for access tracking outside of
> > +                                * mmu_lock is unsupported.  Report the page as
> > +                                * young, but otherwise leave it as-is.
> 
> Just for my own understanding, what's the main reason why it's unsafe

Note, I specifically said "unsupported", not "unsafe" :-D

> to mark PTEs for access tracking outside the mmu_lock?

It probably can be done safely?  The main issue is that marking the SPTE for
access tracking can also clear the Writable bit, and so we'd need to audit all
the flows that consume is_writable_pte().  Hmm, actually, that's less scary than
it first seems, because thanks to kvm_mmu_notifier_clear_young(), KVM already
clears the Writable bit in AD-disabled SPTEs without a TLB flush.  E.g.
mmu_spte_update() specifically looks at MMU-writable, not the Writable bit, when
deciding if a TLB flush is required.

On a related note, I missed is that KVM would need to leaf SPTEs as volatile at
all times, as your MGLRU series modified kvm_tdp_mmu_spte_need_atomic_write(),
not the common spte_has_volatile_bits().

Actually, on second though, maybe it isn't necessary for the AD-enabled case.
Effectively clobbering the Accessed bit is completely fine, as aging is tolerant
of false negatives and false positives, so long as they aren't excessive.  And
that's doubly true if KVM x86 follows MM and doesn't force a TLB flush[1]

Oooh, and triply true if KVM stops marking the folio accesses when zapping SPTEs[2].

So yeah, after thinking though all of the moving parts, maybe we should commit
to aging AD-disabled SPTEs out of mmu_lock.  AD-disabled leaf SPTEs do end up being
"special", because KVM needs to ensure it doesn't clobber the Writable bit, i.e.
AD-disabled leaf SPTEs need to be treated as volatile at all times.  But in practice,
forcing an atomic update for all AD-disabled leaf SPTEs probably doesn't actually
change much, because in most cases KVM is probably using an atomic access anyways,
e.g. because KVM is clearing the Writable bit and the Writable bit is already volatile.

FWIW, marking the folio dirty if the SPTE was writable, as is done today in
mmu_spte_age(), is sketchy if mmu_lock isn't held, but probably ok since this is
invoked from an mmu_notifier and presumably the caller holds a reference to the
page/folio.  But that's largely a moot point since I want to yank out that code
anyways[3].

[1] https://lore.kernel.org/all/ZsS_OmxwFzrqDcfY@google.com
[2] https://lore.kernel.org/all/20240726235234.228822-82-seanjc@google.com
[3] https://lore.kernel.org/all/20240726235234.228822-8-seanjc@google.com

> > +                               if (spte_ad_enabled(spte))
> > +                                       clear_bit((ffs(shadow_accessed_mask) - 1),
> > +                                                 (unsigned long *)sptep);
> 
> I feel like it'd be kinda nice to de-duplicate this clear_bit() piece
> with the one in kvm_rmap_age_gfn_range().

Ya, definitely no argument against adding a helper.

> > +                               young = true;
> > +                       }
> > +
> > +                       kvm_rmap_unlock_readonly(rmap_head, rmap_val);
> > +               }
> > +       }
> > +       return young;
> > +}
> > +
> > +static bool __kvm_rmap_age_gfn_range(struct kvm *kvm,
> > +                                    struct kvm_gfn_range *range, bool test_only)
> >  {
> >         struct slot_rmap_walk_iterator iterator;
> >         struct rmap_iterator iter;
> > @@ -1783,6 +1826,18 @@ static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
> >         return young;
> >  }
> >
> > +
> > +static bool kvm_rmap_age_gfn_range(struct kvm *kvm,
> > +                                  struct kvm_gfn_range *range, bool test_only)
> > +{
> > +       /* FIXME: This also needs to be guarded with something like range->fast_only. */
> > +       if (kvm_ad_enabled())
> 
> I expect this to be something like `if (kvm_ad_enabled() ||
> range->fast_only)`. With MGLRU, that means the pages will always be the last
> candidates for eviction, though it is still possible for them to be evicted
> (though I think this would basically never happen). I think this is fine.
> 
> I think the only other possible choice is to record/return 'not young'/false
> instead of 'young'/true if the spte is young but !spte_ad_enabled(). That
> doesn't seem to be obviously better, though we *will* get correct age
> information at eviction time, when !range->fast_only, at which point the page
> will not be evicted, and Accessed will be cleared.

As above, I think the simpler solution overall is to support aging AD-disabled
SPTEs out of mmu_lock.  The sequence of getting to that end state will be more
complex, but most of that complexity is going to happen irrespective of this series.
And it would mean KVM MGLRU support has no chance of landing in 6.12, but again
I think that's the reality either way.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 04/22] KVM: selftests: Compute number of extra pages needed in mmu_stress_test
  2024-08-09 19:43 ` [PATCH 04/22] KVM: selftests: Compute number of extra pages needed " Sean Christopherson
@ 2024-09-06  0:03   ` James Houghton
  2024-09-06  4:42     ` Sean Christopherson
  0 siblings, 1 reply; 50+ messages in thread
From: James Houghton @ 2024-09-06  0:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu

On Fri, Aug 9, 2024 at 12:43 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Create mmu_stress_tests's VM with the correct number of extra pages needed
> to map all of memory in the guest.  The bug hasn't been noticed before as
> the test currently runs only on x86, which maps guest memory with 1GiB
> pages, i.e. doesn't need much memory in the guest for page tables.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  tools/testing/selftests/kvm/mmu_stress_test.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
> index 847da23ec1b1..5467b12f5903 100644
> --- a/tools/testing/selftests/kvm/mmu_stress_test.c
> +++ b/tools/testing/selftests/kvm/mmu_stress_test.c
> @@ -209,7 +209,13 @@ int main(int argc, char *argv[])
>         vcpus = malloc(nr_vcpus * sizeof(*vcpus));
>         TEST_ASSERT(vcpus, "Failed to allocate vCPU array");
>
> -       vm = vm_create_with_vcpus(nr_vcpus, guest_code, vcpus);
> +       vm = __vm_create_with_vcpus(VM_SHAPE_DEFAULT, nr_vcpus,
> +#ifdef __x86_64__
> +                                   max_mem / SZ_1G,
> +#else
> +                                   max_mem / vm_guest_mode_params[VM_MODE_DEFAULT].page_size,
> +#endif
> +                                   guest_code, vcpus);

Hmm... I'm trying to square this change with the logic in
vm_nr_pages_required(). That logic seems to be doing what you want
(though it always assumes small mappings IIUC).

So it seems like there's something else that's not being accounted
for? (Also without the extra pages, how does this test actually fail?)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 04/22] KVM: selftests: Compute number of extra pages needed in mmu_stress_test
  2024-09-06  0:03   ` James Houghton
@ 2024-09-06  4:42     ` Sean Christopherson
  2024-09-06 18:25       ` James Houghton
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2024-09-06  4:42 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu

On Thu, Sep 05, 2024, James Houghton wrote:
> On Fri, Aug 9, 2024 at 12:43 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Create mmu_stress_tests's VM with the correct number of extra pages needed
> > to map all of memory in the guest.  The bug hasn't been noticed before as
> > the test currently runs only on x86, which maps guest memory with 1GiB
> > pages, i.e. doesn't need much memory in the guest for page tables.
> >
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  tools/testing/selftests/kvm/mmu_stress_test.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
> > index 847da23ec1b1..5467b12f5903 100644
> > --- a/tools/testing/selftests/kvm/mmu_stress_test.c
> > +++ b/tools/testing/selftests/kvm/mmu_stress_test.c
> > @@ -209,7 +209,13 @@ int main(int argc, char *argv[])
> >         vcpus = malloc(nr_vcpus * sizeof(*vcpus));
> >         TEST_ASSERT(vcpus, "Failed to allocate vCPU array");
> >
> > -       vm = vm_create_with_vcpus(nr_vcpus, guest_code, vcpus);
> > +       vm = __vm_create_with_vcpus(VM_SHAPE_DEFAULT, nr_vcpus,
> > +#ifdef __x86_64__
> > +                                   max_mem / SZ_1G,
> > +#else
> > +                                   max_mem / vm_guest_mode_params[VM_MODE_DEFAULT].page_size,
> > +#endif
> > +                                   guest_code, vcpus);
> 
> Hmm... I'm trying to square this change with the logic in
> vm_nr_pages_required(). 

vm_nr_pages_required() mostly operates on the number of pages that are needed to
setup the VM, e.g. for vCPU stacks.  The one calculation that guesstimates the
number of bytes needed, ucall_nr_pages_required(), does the same thing this code
does: divide the number of total bytes by bytes-per-page.

> That logic seems to be doing what you want (though it always assumes small
> mappings IIUC).

Ya, AFAIK, x86 is the only architecture that let's test map huge pages on the
guest side.

> So it seems like there's something else that's not being accounted for?

I don't think so?  vm_nr_pages_required() uses @nr_pages to determine how many
page table pages will be needed in the guest, and then adds that many non-huge
pages worth of bytes to the size of memslot 0.

> (Also without the extra pages, how does this test actually fail?)

Guest memory allocation failure when trying to create the guest mappings.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 04/22] KVM: selftests: Compute number of extra pages needed in mmu_stress_test
  2024-09-06  4:42     ` Sean Christopherson
@ 2024-09-06 18:25       ` James Houghton
  0 siblings, 0 replies; 50+ messages in thread
From: James Houghton @ 2024-09-06 18:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu

On Thu, Sep 5, 2024 at 9:42 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Sep 05, 2024, James Houghton wrote:
> > On Fri, Aug 9, 2024 at 12:43 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > Create mmu_stress_tests's VM with the correct number of extra pages needed
> > > to map all of memory in the guest.  The bug hasn't been noticed before as
> > > the test currently runs only on x86, which maps guest memory with 1GiB
> > > pages, i.e. doesn't need much memory in the guest for page tables.
> > >
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > >  tools/testing/selftests/kvm/mmu_stress_test.c | 8 +++++++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
> > > index 847da23ec1b1..5467b12f5903 100644
> > > --- a/tools/testing/selftests/kvm/mmu_stress_test.c
> > > +++ b/tools/testing/selftests/kvm/mmu_stress_test.c
> > > @@ -209,7 +209,13 @@ int main(int argc, char *argv[])
> > >         vcpus = malloc(nr_vcpus * sizeof(*vcpus));
> > >         TEST_ASSERT(vcpus, "Failed to allocate vCPU array");
> > >
> > > -       vm = vm_create_with_vcpus(nr_vcpus, guest_code, vcpus);
> > > +       vm = __vm_create_with_vcpus(VM_SHAPE_DEFAULT, nr_vcpus,
> > > +#ifdef __x86_64__
> > > +                                   max_mem / SZ_1G,
> > > +#else
> > > +                                   max_mem / vm_guest_mode_params[VM_MODE_DEFAULT].page_size,
> > > +#endif
> > > +                                   guest_code, vcpus);
> >
> > Hmm... I'm trying to square this change with the logic in
> > vm_nr_pages_required().
>
> vm_nr_pages_required() mostly operates on the number of pages that are needed to
> setup the VM, e.g. for vCPU stacks.  The one calculation that guesstimates the
> number of bytes needed, ucall_nr_pages_required(), does the same thing this code
> does: divide the number of total bytes by bytes-per-page.

Oh, yes, you're right. It's only accounting for the page tables for
the 512 pages for memslot 0. Sorry for the noise. Feel free to add:

Reviewed-by: James Houghton <jthoughton@google.com>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 09/22] KVM: selftests: Verify KVM correctly handles mprotect(PROT_READ)
       [not found]   ` <CADrL8HXcD--jn1iLeCJycCd3Btv4_rBPxz6NMnTREXfeh0vRZA@mail.gmail.com>
@ 2024-09-07  0:53     ` Sean Christopherson
  2024-09-07  2:26       ` James Houghton
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2024-09-07  0:53 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu, James Houghton

On Fri, Sep 06, 2024, James Houghton wrote:
> On Fri, Aug 9, 2024 at 12:43 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Add two phases to mmu_stress_test to verify that KVM correctly handles
> > guest memory that was writable, and then made read-only in the primary MMU,
> > and then made writable again.
> >
> > Add bonus coverage for x86 to verify that all of guest memory was marked
> > read-only.  Making forward progress (without making memory writable)
> > requires arch specific code to skip over the faulting instruction, but the
> > test can at least verify each vCPU's starting page was made read-only.
> >
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> 
> Writing off-list because I just have some smaller questions that I
> don't want to bother the list with.

Pulling everyone and the lists back in :-)

IMO, no question is too small for kvm@, and lkml@ is gigantic firehose that's 99%
archival and 1% list, at best.  Odds are very, very good that if you have a
question, however trivial or small, then someone else has the exact same question,
or _will_ have the question in the future.

I strongly prefer that all questions, review, feedback, etc. happen on list, even
if the questions/feedback may seem trivial or noisy.  The only exception is if
information can't/shouldn't be made public, e.g. because of an embargo, NDA,
security implications, etc.

> For the next version, feel free to add:
> 
> Reviewed-by: James Houghton <jthoughton@google.com>
> 
> All of the selftest patches look fine to me.
> 
> > ---
> >  tools/testing/selftests/kvm/mmu_stress_test.c | 87 ++++++++++++++++++-
> >  1 file changed, 84 insertions(+), 3 deletions(-)
> >
> > diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
> > index 50c3a17418c4..98f9a4660269 100644
> > --- a/tools/testing/selftests/kvm/mmu_stress_test.c
> > +++ b/tools/testing/selftests/kvm/mmu_stress_test.c
> > @@ -16,6 +16,8 @@
> >  #include "guest_modes.h"
> >  #include "processor.h"
> >
> > +static bool mprotect_ro_done;
> > +
> >  static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
> >  {
> >         uint64_t gpa;
> > @@ -31,6 +33,33 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
> >                 *((volatile uint64_t *)gpa);
> >         GUEST_SYNC(2);
> >
> > +       /*
> > +        * Write to the region while mprotect(PROT_READ) is underway.  Keep
> > +        * looping until the memory is guaranteed to be read-only, otherwise
> > +        * vCPUs may complete their writes and advance to the next stage
> > +        * prematurely.
> > +        */
> > +       do {
> > +               for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
> > +#ifdef __x86_64__
> > +                       asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory");
> 
> Ok so this appears to be a `mov BYTE PTR [rax + 0x0], 0x0`, where %rax = gpa. :)
> 
> Does '0xc6,0x0,0x0' also work? It seems like that translates to `mov
> BYTE PTR [rax], 0x0`. (just curious, no need to change it)

LOL, yes, but as evidenced by the trailing comment, my intent was to generate
"mov rax, [rax]", not "movb $0, [rax]".  I suspect I was too lazy to consult the
SDM to recall the correct opcode and simply copied an instruction from some random
disassembly output without looking too closely at the output.

	asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory"); /* MOV RAX, [RAX] */

> And I take it you wrote it out like this (instead of using mnemonics)
> so that you could guarantee that IP + 4 would be the right way to skip
> forwards. Does it make sense to leave a comment about that? 

Yes and yes.

> The translation from mnemonic --> bytes won't change...

Heh, this is x86, by no means is that guaranteed.  E.g. see the above, where the
same mnemonic can be represented multiple ways.

> so could you just write the proper assembly? (not a request,  just curious)

In practice, probably.  But the rules for inline assembly are, at best, fuzzy.
So long as the correct instruction is generated, the assembler has quite a bit
of freedom.

E.g. similar to above, "mov %rax,(%rax)" can (should) be encoded as:

  48 89 00

but can also be encoded as

  48 89 40 00

Now, it's _extremely_ unlikely a compiler will actually generate the latter, but
it's perfectly legal to do so.  E.g. with gcc-13, this

  mov %rax, 0x0(%rax)

generates

  48 89 00

even though a more literal interpretation would be

  48 89 40 00

So yeah, while the hand-coded opcode is gross and annoying, given that a failure
due to the "wrong" instruction being generated would be painful and time consuming
to debug, hand-coding is worth avoiding the risk and potential pain if the compiler
decides to be mean :-)

> A comment that 0x40 corresponds to %rax and that "a" also corresponds
> to %rax would have been helpful for me. :)

Eh, I get what you're saying, but giving a play-by-play of the encoding isn't
really all that reasonable because _so_ much information needs to be conveyed to
capture the entire picture, and some things are essentially table stakes when it
comes to x86 kernel programming.

E.g. 0x40 doesn't simply mean "(%rax)", it's a full ModR/M that defines the
addressing mode, which in turn depends on the operating mode (64-bit).

And "a" isn't just %rax; it's specifically an input register constraint, e.g. is
distinctly different than:

  asm volatile(".byte 0x48,0x89,0x0" : "+a"(gpa) :: "memory"); /* mov %rax, (%rax) */

even though in this specific scenario they generate the same code.

And with the correct "48 89 00", understanding the full encoding requires describing
REX prefixes, which are a mess unto themselves.

So, a trailing comment (with the correct mnemonic) is all I'm willing to do, even
though I 100% agree that it puts a decent sized load on the reader.  There's just
_too_ much information to communicate to the reader, at least for x86.

> > +#else
> > +                       vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
> > +#endif
> > +       } while (!READ_ONCE(mprotect_ro_done));
> > +
> > +       /*
> > +        * Only x86 can explicitly sync, as other architectures will be stuck
> > +        * on the write fault.
> 
> It would also have been a little clearer if the comment also said how
> this is just because the test has been written to increment for the PC
> upon getting these write faults *for x86 only*. IDK something like
> "For x86, the test will adjust the PC for each write fault, allowing
> the above loop to complete. Other architectures will get stuck, so the
> #3 sync step is skipped."

Ya.  Untested, but how about this?

diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
index 2d66c2724336..29acb22ea387 100644
--- a/tools/testing/selftests/kvm/mmu_stress_test.c
+++ b/tools/testing/selftests/kvm/mmu_stress_test.c
@@ -38,11 +38,18 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
         * looping until the memory is guaranteed to be read-only, otherwise
         * vCPUs may complete their writes and advance to the next stage
         * prematurely.
+        *
+        * For architectures that support skipping the faulting instruction,
+        * generate the store via inline assembly to ensure the exact length
+        * of the instruction is known and stable (vcpu_arch_put_guest() on
+        * fixed-length architectures should work, but the cost of paranoia
+        * is low in this case).  For x86, hand-code the exact opcode so that
+        * there is no room for variability in the generated instruction.
         */
        do {
                for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
 #ifdef __x86_64__
-                       asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory"); /* MOV RAX, [RAX] */
+                       asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */
 #elif defined(__aarch64__)
                        asm volatile("str %0, [%0]" :: "r" (gpa) : "memory");
 #else
@@ -163,7 +170,7 @@ static void *vcpu_worker(void *data)
                TEST_ASSERT_EQ(errno, EFAULT);
 #ifdef __x86_64__
                WRITE_ONCE(vcpu->run->kvm_dirty_regs, KVM_SYNC_X86_REGS);
-               vcpu->run->s.regs.regs.rip += 4;
+               vcpu->run->s.regs.regs.rip += 3;
 #endif
 #ifdef __aarch64__
                vcpu_set_reg(vcpu, ARM64_CORE_REG(regs.pc),


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 09/22] KVM: selftests: Verify KVM correctly handles mprotect(PROT_READ)
  2024-09-07  0:53     ` Sean Christopherson
@ 2024-09-07  2:26       ` James Houghton
  2024-09-09 15:49         ` Sean Christopherson
  0 siblings, 1 reply; 50+ messages in thread
From: James Houghton @ 2024-09-07  2:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu

On Fri, Sep 6, 2024 at 5:53 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Sep 06, 2024, James Houghton wrote:
> > On Fri, Aug 9, 2024 at 12:43 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > Add two phases to mmu_stress_test to verify that KVM correctly handles
> > > guest memory that was writable, and then made read-only in the primary MMU,
> > > and then made writable again.
> > >
> > > Add bonus coverage for x86 to verify that all of guest memory was marked
> > > read-only.  Making forward progress (without making memory writable)
> > > requires arch specific code to skip over the faulting instruction, but the
> > > test can at least verify each vCPU's starting page was made read-only.
> > >
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> >
> > Writing off-list because I just have some smaller questions that I
> > don't want to bother the list with.
>
> Pulling everyone and the lists back in :-)
>
> IMO, no question is too small for kvm@, and lkml@ is gigantic firehose that's 99%
> archival and 1% list, at best.  Odds are very, very good that if you have a
> question, however trivial or small, then someone else has the exact same question,
> or _will_ have the question in the future.
>
> I strongly prefer that all questions, review, feedback, etc. happen on list, even
> if the questions/feedback may seem trivial or noisy.  The only exception is if
> information can't/shouldn't be made public, e.g. because of an embargo, NDA,
> security implications, etc.

I'll keep this in mind, thanks!

> > For the next version, feel free to add:
> >
> > Reviewed-by: James Houghton <jthoughton@google.com>
> >
> > All of the selftest patches look fine to me.
> >
> > > ---
> > >  tools/testing/selftests/kvm/mmu_stress_test.c | 87 ++++++++++++++++++-
> > >  1 file changed, 84 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
> > > index 50c3a17418c4..98f9a4660269 100644
> > > --- a/tools/testing/selftests/kvm/mmu_stress_test.c
> > > +++ b/tools/testing/selftests/kvm/mmu_stress_test.c
> > > @@ -16,6 +16,8 @@
> > >  #include "guest_modes.h"
> > >  #include "processor.h"
> > >
> > > +static bool mprotect_ro_done;
> > > +
> > >  static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
> > >  {
> > >         uint64_t gpa;
> > > @@ -31,6 +33,33 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
> > >                 *((volatile uint64_t *)gpa);
> > >         GUEST_SYNC(2);
> > >
> > > +       /*
> > > +        * Write to the region while mprotect(PROT_READ) is underway.  Keep
> > > +        * looping until the memory is guaranteed to be read-only, otherwise
> > > +        * vCPUs may complete their writes and advance to the next stage
> > > +        * prematurely.
> > > +        */
> > > +       do {
> > > +               for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
> > > +#ifdef __x86_64__
> > > +                       asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory");
> >
> > Ok so this appears to be a `mov BYTE PTR [rax + 0x0], 0x0`, where %rax = gpa. :)
> >
> > Does '0xc6,0x0,0x0' also work? It seems like that translates to `mov
> > BYTE PTR [rax], 0x0`. (just curious, no need to change it)
>
> LOL, yes, but as evidenced by the trailing comment, my intent was to generate
> "mov rax, [rax]", not "movb $0, [rax]".  I suspect I was too lazy to consult the
> SDM to recall the correct opcode and simply copied an instruction from some random
> disassembly output without looking too closely at the output.
>
>         asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory"); /* MOV RAX, [RAX] */
>
> > And I take it you wrote it out like this (instead of using mnemonics)
> > so that you could guarantee that IP + 4 would be the right way to skip
> > forwards. Does it make sense to leave a comment about that?
>
> Yes and yes.
>
> > The translation from mnemonic --> bytes won't change...
>
> Heh, this is x86, by no means is that guaranteed.  E.g. see the above, where the
> same mnemonic can be represented multiple ways.
>
> > so could you just write the proper assembly? (not a request,  just curious)
>
> In practice, probably.  But the rules for inline assembly are, at best, fuzzy.
> So long as the correct instruction is generated, the assembler has quite a bit
> of freedom.
>
> E.g. similar to above, "mov %rax,(%rax)" can (should) be encoded as:
>
>   48 89 00
>
> but can also be encoded as
>
>   48 89 40 00
>
> Now, it's _extremely_ unlikely a compiler will actually generate the latter, but
> it's perfectly legal to do so.  E.g. with gcc-13, this
>
>   mov %rax, 0x0(%rax)
>
> generates
>
>   48 89 00
>
> even though a more literal interpretation would be
>
>   48 89 40 00

Oh... neat! I'm glad I asked about this.

>
> So yeah, while the hand-coded opcode is gross and annoying, given that a failure
> due to the "wrong" instruction being generated would be painful and time consuming
> to debug, hand-coding is worth avoiding the risk and potential pain if the compiler
> decides to be mean :-)

I 100% agree with hand-writing the opcode given what you've said. :)

> > A comment that 0x40 corresponds to %rax and that "a" also corresponds
> > to %rax would have been helpful for me. :)
>
> Eh, I get what you're saying, but giving a play-by-play of the encoding isn't
> really all that reasonable because _so_ much information needs to be conveyed to
> capture the entire picture, and some things are essentially table stakes when it
> comes to x86 kernel programming.
>
>
> E.g. 0x40 doesn't simply mean "(%rax)", it's a full ModR/M that defines the
> addressing mode, which in turn depends on the operating mode (64-bit).
>
> And "a" isn't just %rax; it's specifically an input register constraint, e.g. is
> distinctly different than:
>
>   asm volatile(".byte 0x48,0x89,0x0" : "+a"(gpa) :: "memory"); /* mov %rax, (%rax) */
>
> even though in this specific scenario they generate the same code.
>
> And with the correct "48 89 00", understanding the full encoding requires describing
> REX prefixes, which are a mess unto themselves.
>
> So, a trailing comment (with the correct mnemonic) is all I'm willing to do, even
> though I 100% agree that it puts a decent sized load on the reader.  There's just
> _too_ much information to communicate to the reader, at least for x86.

The trailing comment works for me! Thanks for all the detail -- I am
learning so much.

>
> > > +#else
> > > +                       vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
> > > +#endif
> > > +       } while (!READ_ONCE(mprotect_ro_done));
> > > +
> > > +       /*
> > > +        * Only x86 can explicitly sync, as other architectures will be stuck
> > > +        * on the write fault.
> >
> > It would also have been a little clearer if the comment also said how
> > this is just because the test has been written to increment for the PC
> > upon getting these write faults *for x86 only*. IDK something like
> > "For x86, the test will adjust the PC for each write fault, allowing
> > the above loop to complete. Other architectures will get stuck, so the
> > #3 sync step is skipped."
>
> Ya.  Untested, but how about this?

LGTM! (I haven't tested it either.)

>
> diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
> index 2d66c2724336..29acb22ea387 100644
> --- a/tools/testing/selftests/kvm/mmu_stress_test.c
> +++ b/tools/testing/selftests/kvm/mmu_stress_test.c
> @@ -38,11 +38,18 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
>          * looping until the memory is guaranteed to be read-only, otherwise
>          * vCPUs may complete their writes and advance to the next stage
>          * prematurely.
> +        *
> +        * For architectures that support skipping the faulting instruction,
> +        * generate the store via inline assembly to ensure the exact length
> +        * of the instruction is known and stable (vcpu_arch_put_guest() on
> +        * fixed-length architectures should work, but the cost of paranoia
> +        * is low in this case).  For x86, hand-code the exact opcode so that
> +        * there is no room for variability in the generated instruction.
>          */
>         do {
>                 for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
>  #ifdef __x86_64__
> -                       asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory"); /* MOV RAX, [RAX] */
> +                       asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */

FWIW I much prefer the trailing comment you have ended up with vs. the
one you had before. (To me, the older one _seems_ like it's Intel
syntax, in which case the comment says it's a load..? The comment you
have now is, to me, obviously indicating a store. Though... perhaps
"movq"?)


>  #elif defined(__aarch64__)
>                         asm volatile("str %0, [%0]" :: "r" (gpa) : "memory");
>  #else
> @@ -163,7 +170,7 @@ static void *vcpu_worker(void *data)
>                 TEST_ASSERT_EQ(errno, EFAULT);
>  #ifdef __x86_64__
>                 WRITE_ONCE(vcpu->run->kvm_dirty_regs, KVM_SYNC_X86_REGS);
> -               vcpu->run->s.regs.regs.rip += 4;
> +               vcpu->run->s.regs.regs.rip += 3;
>  #endif
>  #ifdef __aarch64__
>                 vcpu_set_reg(vcpu, ARM64_CORE_REG(regs.pc),
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 09/22] KVM: selftests: Verify KVM correctly handles mprotect(PROT_READ)
  2024-09-07  2:26       ` James Houghton
@ 2024-09-09 15:49         ` Sean Christopherson
  0 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-09-09 15:49 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu

On Fri, Sep 06, 2024, James Houghton wrote:
> On Fri, Sep 6, 2024 at 5:53 PM Sean Christopherson <seanjc@google.com> wrote:
> >  #ifdef __x86_64__
> > -                       asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory"); /* MOV RAX, [RAX] */
> > +                       asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */
> 
> FWIW I much prefer the trailing comment you have ended up with vs. the
> one you had before. (To me, the older one _seems_ like it's Intel
> syntax, in which case the comment says it's a load..? The comment you
> have now is, to me, obviously indicating a store. Though... perhaps
> "movq"?)

TL;DR: "movq" is arguably a worse mnemonic than simply "mov" because MOV *and*
        MOVQ are absurdly overloaded mnemonics, and because x86-64 is wonky.

Heh, "movq" is technically a different instruction (MMX/SSE instruction).  For
ambiguous mnemonics, the assembler infers the exact instructions from the operands.
When a register is the source or destination, appending the size to a vanilla MOV
is 100% optional, as the width of the register communicates the desired size
without any ambiguity.

When there is no register operand, e.g. storing an immediate to memory, the size
becomes necessary, sort of.  The assembler will still happily accept an inferred
size, but the size is simply the default operand size for the current mode.

E.g.

  mov $0xffff, (%0)

will generate a 4-byte MOV

  c7 00 ff ff 00 00

so if you actually wanted a 2-byte MOV, the mnemonic needs to be:

  movw $0xffff, (%0)

There is still value in specifying an explicit operand size in assembly, as it
disambiguates the size of human readers, and also generates an error if the
operands mismatch.

E.g.

  movw $0xffff, %%eax

will fail with

  incorrect register `%eax' used with `w' suffix

The really fun one is if ou want to load a 64-bit gpr with an immediate.  All
else being equal, the assembler will generally optimize for code size, and so
if the desired value can be generated by sign-extension, the compiler will opt
for opcode 0xc7 or 0xb8

E.g.

  mov $0xffffffffffffffff, %%rax

generates

  48 c7 c0 ff ff ff ff

whereas, somewhat counter-intuitively, this

  mov $0xffffffff, %%rax

generates the more gnarly

  48 b8 ff ff ff ff 00 00 00 00

But wait, there's more!  If the developer were a wee bit smarter, they could/should
actually write

  mov $0xffffffff, %%eax

to generate

  b8 ff ff ff ff

because in x86-64, writing the lower 32 bits of a 64-bit register architecturally
clears the upper 32 bits.  I mention this because you'll actually see the compiler
take advantage of this behavior.

E.g. if you were to load RAX through an inline asm constraint

  asm volatile(".byte 0xcc" :: "a"(0xffffffff) : "memory");

the generated code will indeed be:

  b8 ff ff ff ff          mov    $0xffffffff,%eax

or if you explicitly load a register with '0'

  31 c0                   xor    %eax,%eax

Lastly, because "%0" in 64-bit mode refers to RAX, not EAX, this:

  asm volatile("mov $0xffffffff, %0" :: "a"(gpa) : "memory");

generates

  48 b8 ff ff ff ff 00 00 00 00

i.e. is equivalent to "mov .., %%rax".

Jumping back to "movq", it's perfectly fine in this case, but also fully
redundant.  And so I would prefer to document it simply as "mov", because "movq"
would be more appropriate to document something like this:

  asm volatile("movq %0, %%xmm0" :: "a"(gpa) : "memory");

  66 48 0f 6e c0          movq   %rax,%xmm0

LOL, which brings up more quirks/warts with x86-64.  Many instructions in x86,
especially SIMD instructions, have mandatory "prefixes" in order to squeeze more
instructions out of the available opcodes.  E.g. the operand size prefix, 0x66,
is reserved for MMX instructions, which allows the architecture to usurp the
reserved combination for XMM instructions.   Table 9-3. Effect of Prefixes on MMX
Instructions says this

  Operand Size (66H)Reserved and may result in unpredictable behavior.

and specifically says "unpredictable behavior" instead of #UD, because prefixing
most MMX instructions with 0x66 "promotes" the instruction to operate on XMM
registers.

And then there's the REX prefix, which is actually four prefixes built into one.
The "base" prefix ix 0x40, with the lower 4 bits encoding the four "real" prefixes.
From Table 2-4. REX Prefix Fields [BITS: 0100WRXB]

  Field Name      Bit Position    Definition
  -               7:4             0100
  W               3               0 = Operand size determined by CS.D, 1 = 64 Bit Operand Size
  R               2               Extension of the ModR/M reg field
  X               1               Extension of the SIB index field
  B               0               Extension of the ModR/M r/m field, SIB base field, or Opcode reg field

e.g. 0x48 is REX.W, 0x49 is REX.W+REX.B, etc.

The first quirky thing with REX, and REX.W (0x48) + the legacy operand size
prefix (0x66) in particular, is that the legacy prefix is ignored in most cases
if REX.W=1.

  For non-byte operations: if a 66H prefix is used with prefix (REX.W = 1), 66H is ignored.

But because 0x66 is a mandatory prefix for MOVQ, it's not ignored (at least, I
don't _think_ it's ignored; objdump and gdb both seem to happy decoding MOVQ
without the prefix).

Anyways, the second quirky thing with REX is that, because REX usurps single-byte
opcodes for DEC and INC

  In 64-bit mode, DEC r16 and DEC r32 are not encodable (because opcodes 48H through
  4FH are REX prefixes).

  In 64-bit mode, INC r16 and INC r32 are not encodable (because opcodes 40H through
  47H are REX prefixes).

i.e. uses opcodes that are actual instructions outside of 64-bit mode, the REX
prefix _must_ be the last byte before the non-prefix opcode, otherwise it's
ignored (presumably this avoids extra complexity in the instruction decoder).

  Only one REX prefix is allowed per instruction. If used, the REX prefix byte
  must immediately precede the opcode byte or the escape opcode byte (0FH). When
  a REX prefix is used in conjunction with an instruction containing a mandatory
  prefix, the mandatory prefix must come before the REX so the REX prefix can be
  immediately preceding the opcode or the escape byte. For example, CVTDQ2PD with
  a REX prefix should have REX placed between F3 and 0F E6. Other placements are
  ignored. The instruction-size limit of 15 bytes still applies to instructions
  with a REX prefix.

So even though the "opcode" for MOVQ is "66 0F 6E" , when encoding with REX.W to
address RAX instead of EAX, the full encoding needs to be "66 48 0F DE", otherwise
REX.W will be ignored, e.g. objdump will interpret it as this, even though the
CPU will decode REX.W as part of the MOVD.

  4024d1:       48                      rex.W
  4024d2:       66 0f 6e c0             movd   %eax,%xmm0

And because _that's_ just not confusing enough, there are actually _six_ distinct
opcodes for MOVQ (I think; might be more): two which are REX.W promotions of MOVD
(6E and 7E, ignoring mandatory prefixes and the escape opcode 0F), and four that
are straight quadword moves that can't target registers, i.e. can be encoded even
in 32-bit mode (6F, 7E, 7F, and D6).

So yeah, MOVQ in particular is a disaster, especially in 64-bit mode, so I'd much
prefer to just say "mov %rax, (%rax)" and leave it to the reader to understand
that it's a 64-bit store.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
  2024-08-09 19:43 ` [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock Sean Christopherson
  2024-08-12  8:39   ` Lai Jiangshan
  2024-09-03 20:17   ` James Houghton
@ 2024-09-09 19:00   ` James Houghton
  2024-09-09 20:02     ` Sean Christopherson
  2 siblings, 1 reply; 50+ messages in thread
From: James Houghton @ 2024-09-09 19:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu

On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Steal another bit from rmap entries (which are word aligned pointers, i.e.
> have 2 free bits on 32-bit KVM, and 3 free bits on 64-bit KVM), and use
> the bit to implement a *very* rudimentary per-rmap spinlock.  The only
> anticipated usage of the lock outside of mmu_lock is for aging gfns, and
> collisions between aging and other MMU rmap operations are quite rare,
> e.g. unless userspace is being silly and aging a tiny range over and over
> in a tight loop, time between contention when aging an actively running VM
> is O(seconds).  In short, a more sophisticated locking scheme shouldn't be
> necessary.
>
> Note, the lock only protects the rmap structure itself, SPTEs that are
> pointed at by a locked rmap can still be modified and zapped by another
> task (KVM drops/zaps SPTEs before deleting the rmap entries)
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 80 +++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 71 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8ca7f51c2da3..a683b5fc4026 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -909,11 +909,73 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu
>   * About rmap_head encoding:
>   *
>   * If the bit zero of rmap_head->val is clear, then it points to the only spte
> - * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct
> + * in this rmap chain. Otherwise, (rmap_head->val & ~3) points to a struct
>   * pte_list_desc containing more mappings.
>   */
>  #define KVM_RMAP_MANY  BIT(0)
>
> +/*
> + * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always
> + * operates with mmu_lock held for write), but rmaps can be walked without
> + * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain
> + * being zapped/dropped _while the rmap is locked_.
> + *
> + * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be
> + * done while holding mmu_lock for write.  This allows a task walking rmaps
> + * without holding mmu_lock to concurrently walk the same entries as a task
> + * that is holding mmu_lock but _not_ the rmap lock.  Neither task will modify
> + * the rmaps, thus the walks are stable.
> + *
> + * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED,
> + * only the rmap chains themselves are protected.  E.g. holding an rmap's lock
> + * ensures all "struct pte_list_desc" fields are stable.

This last sentence makes me think we need to be careful about memory ordering.

> + */
> +#define KVM_RMAP_LOCKED        BIT(1)
> +
> +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> +{
> +       unsigned long old_val, new_val;
> +
> +       old_val = READ_ONCE(rmap_head->val);
> +       if (!old_val)
> +               return 0;
> +
> +       do {
> +               /*
> +                * If the rmap is locked, wait for it to be unlocked before
> +                * trying acquire the lock, e.g. to bounce the cache line.
> +                */
> +               while (old_val & KVM_RMAP_LOCKED) {
> +                       old_val = READ_ONCE(rmap_head->val);
> +                       cpu_relax();
> +               }
> +
> +               /*
> +                * Recheck for an empty rmap, it may have been purged by the
> +                * task that held the lock.
> +                */
> +               if (!old_val)
> +                       return 0;
> +
> +               new_val = old_val | KVM_RMAP_LOCKED;
> +       } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));

I think we (technically) need an smp_rmb() here. I think cmpxchg
implicitly has that on x86 (and this code is x86-only), but should we
nonetheless document that we need smp_rmb() (if it indeed required)?
Perhaps we could/should condition the smp_rmb() on `if (old_val)`.

kvm_rmap_lock_readonly() should have an smb_rmb(), but it seems like
adding it here will do the right thing for the read-only lock side.

> +
> +       /* Return the old value, i.e. _without_ the LOCKED bit set. */
> +       return old_val;
> +}
> +
> +static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
> +                           unsigned long new_val)
> +{
> +       WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED);

Same goes with having an smp_wmb() here. Is it necessary? If so,
should it at least be documented?

And this is *not* necessary for kvm_rmap_unlock_readonly(), IIUC.

> +       WRITE_ONCE(rmap_head->val, new_val);
> +}

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
  2024-09-09 19:00   ` James Houghton
@ 2024-09-09 20:02     ` Sean Christopherson
  2024-09-09 20:46       ` James Houghton
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2024-09-09 20:02 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu

On Mon, Sep 09, 2024, James Houghton wrote:
> On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote:
> > +/*
> > + * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always
> > + * operates with mmu_lock held for write), but rmaps can be walked without
> > + * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain
> > + * being zapped/dropped _while the rmap is locked_.
> > + *
> > + * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be
> > + * done while holding mmu_lock for write.  This allows a task walking rmaps
> > + * without holding mmu_lock to concurrently walk the same entries as a task
> > + * that is holding mmu_lock but _not_ the rmap lock.  Neither task will modify
> > + * the rmaps, thus the walks are stable.
> > + *
> > + * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED,
> > + * only the rmap chains themselves are protected.  E.g. holding an rmap's lock
> > + * ensures all "struct pte_list_desc" fields are stable.
> 
> This last sentence makes me think we need to be careful about memory ordering.
> 
> > + */
> > +#define KVM_RMAP_LOCKED        BIT(1)
> > +
> > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> > +{
> > +       unsigned long old_val, new_val;
> > +
> > +       old_val = READ_ONCE(rmap_head->val);
> > +       if (!old_val)
> > +               return 0;
> > +
> > +       do {
> > +               /*
> > +                * If the rmap is locked, wait for it to be unlocked before
> > +                * trying acquire the lock, e.g. to bounce the cache line.
> > +                */
> > +               while (old_val & KVM_RMAP_LOCKED) {
> > +                       old_val = READ_ONCE(rmap_head->val);
> > +                       cpu_relax();
> > +               }
> > +
> > +               /*
> > +                * Recheck for an empty rmap, it may have been purged by the
> > +                * task that held the lock.
> > +                */
> > +               if (!old_val)
> > +                       return 0;
> > +
> > +               new_val = old_val | KVM_RMAP_LOCKED;
> > +       } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
> 
> I think we (technically) need an smp_rmb() here. I think cmpxchg
> implicitly has that on x86 (and this code is x86-only), but should we
> nonetheless document that we need smp_rmb() (if it indeed required)?
> Perhaps we could/should condition the smp_rmb() on `if (old_val)`.

Hmm, no, not smp_rmb().  If anything, the appropriate barrier here would be
smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably
even wrong.

For the !old_val case, there is a address/data dependency that can't be broken by
the CPU without violating the x86 memory model (all future actions with relevant
memory loads depend on rmap_head->val being non-zero).  And AIUI, in the Linux
kernel memory model, READ_ONCE() is responsible for ensuring that the address
dependency can't be morphed into a control dependency by the compiler and
subsequently reordered by the CPU.

I.e. even if this were arm64, ignoring the LOCK CMPXCHG path for the moment, I
don't _think_ an smp_{r,w}mb() pair would be needed, as arm64's definition of
__READ_ONCE() promotes the operation to an acquire.

Back to the LOCK CMPXCHG path, KVM_RMAP_LOCKED implements a rudimentary spinlock,
hence my smp_mb__after_spinlock() suggestion.  Though _because_ it's a spinlock,
the rmaps are fully protected by the critical section.  And for the SPTEs, there
is no required ordering.  The reader (aging thread) can observe a !PRESENT or a
PRESENT SPTE, and must be prepared for either.  I.e. there is no requirement that
the reader observe a PRESENT SPTE if there is a valid rmap.

So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(),
even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests
an ordering requirement that doesn't exist.

> kvm_rmap_lock_readonly() should have an smb_rmb(), but it seems like
> adding it here will do the right thing for the read-only lock side.
> 
> > +
> > +       /* Return the old value, i.e. _without_ the LOCKED bit set. */
> > +       return old_val;
> > +}
> > +
> > +static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
> > +                           unsigned long new_val)
> > +{
> > +       WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED);
> 
> Same goes with having an smp_wmb() here. Is it necessary? If so,
> should it at least be documented?
> 
> And this is *not* necessary for kvm_rmap_unlock_readonly(), IIUC.
> 
> > +       WRITE_ONCE(rmap_head->val, new_val);
> > +}

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
  2024-09-09 20:02     ` Sean Christopherson
@ 2024-09-09 20:46       ` James Houghton
  2024-09-09 22:02         ` Sean Christopherson
  0 siblings, 1 reply; 50+ messages in thread
From: James Houghton @ 2024-09-09 20:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu

On Mon, Sep 9, 2024 at 1:02 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Sep 09, 2024, James Houghton wrote:
> > On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote:
> > > + */
> > > +#define KVM_RMAP_LOCKED        BIT(1)
> > > +
> > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> > > +{
> > > +       unsigned long old_val, new_val;
> > > +
> > > +       old_val = READ_ONCE(rmap_head->val);
> > > +       if (!old_val)
> > > +               return 0;
> > > +
> > > +       do {
> > > +               /*
> > > +                * If the rmap is locked, wait for it to be unlocked before
> > > +                * trying acquire the lock, e.g. to bounce the cache line.
> > > +                */
> > > +               while (old_val & KVM_RMAP_LOCKED) {
> > > +                       old_val = READ_ONCE(rmap_head->val);
> > > +                       cpu_relax();
> > > +               }
> > > +
> > > +               /*
> > > +                * Recheck for an empty rmap, it may have been purged by the
> > > +                * task that held the lock.
> > > +                */
> > > +               if (!old_val)
> > > +                       return 0;
> > > +
> > > +               new_val = old_val | KVM_RMAP_LOCKED;
> > > +       } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
> >
> > I think we (technically) need an smp_rmb() here. I think cmpxchg
> > implicitly has that on x86 (and this code is x86-only), but should we
> > nonetheless document that we need smp_rmb() (if it indeed required)?
> > Perhaps we could/should condition the smp_rmb() on `if (old_val)`.
>
> Hmm, no, not smp_rmb().  If anything, the appropriate barrier here would be
> smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably
> even wrong.

I don't think smp_mb__after_spinlock() is right either. This seems to
be used following the acquisition of a spinlock to promote the memory
ordering from an acquire barrier (that is implicit with the lock
acquisition, e.g. [1]) to a full barrier. IIUC, we have no need for a
stronger-than-usual barrier. But I guess I'm not really sure.

In this case, I'm complaining that we don't have the usual memory
ordering restrictions that come with a spinlock.

> For the !old_val case, there is a address/data dependency that can't be broken by
> the CPU without violating the x86 memory model (all future actions with relevant
> memory loads depend on rmap_head->val being non-zero).  And AIUI, in the Linux
> kernel memory model, READ_ONCE() is responsible for ensuring that the address
> dependency can't be morphed into a control dependency by the compiler and
> subsequently reordered by the CPU.
>
> I.e. even if this were arm64, ignoring the LOCK CMPXCHG path for the moment, I
> don't _think_ an smp_{r,w}mb() pair would be needed, as arm64's definition of
> __READ_ONCE() promotes the operation to an acquire.
>
> Back to the LOCK CMPXCHG path, KVM_RMAP_LOCKED implements a rudimentary spinlock,
> hence my smp_mb__after_spinlock() suggestion.  Though _because_ it's a spinlock,
> the rmaps are fully protected by the critical section.

I feel like a spinlock must include the appropriate barriers for it to
correctly function as a spinlock, so I'm not sure I fully understand
what you mean here.

> And for the SPTEs, there
> is no required ordering.  The reader (aging thread) can observe a !PRESENT or a
> PRESENT SPTE, and must be prepared for either.  I.e. there is no requirement that
> the reader observe a PRESENT SPTE if there is a valid rmap.

This makes sense.

> So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(),
> even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests
> an ordering requirement that doesn't exist.

So we have: the general kvm_rmap_lock() and the read-only
kvm_rmap_lock_readonly(), as introduced by the next patch[2]. I'll use
those names (sorry if it's confusing).

For kvm_rmap_lock(), we are always holding mmu_lock for writing. So
any changes we make to the rmap will be properly published to other
threads that subsequently grab kvm_rmap_lock() because we had to
properly release and then re-acquire mmu_lock, which comes with the
barriers I'm saying we need.

For kvm_rmap_lock_readonly(), we don't hold mmu_lock, so there is no
smp_rmb() or equivalent. Without an smp_rmb() somewhere, I claim that
it is possible that there may observe external changes to the
pte_list_desc while we are in this critical section (for a
sufficiently weak architecture). The changes that the kvm_rmap_lock()
(mmu_lock) side made were half-published with an smp_wmb() (really
[3]), but the read side didn't use a load-acquire or smp_rmb(), so it
hasn't held up its end of the deal.

I don't think READ_ONCE() has the guarantees we need to be a
sufficient replacement for smp_rmb() or a load-acquire that a real
lock would use, although I agree with you that, on arm64, it
apparently *is* a sufficient replacement.

Now this isn't a problem if the kvm_rmap_lock_readonly() side can
tolerate changes to pte_list_desc while in the critical section. I
don't think this is true (given for_each_rmap_spte_lockless),
therefore an smp_rmb() or equivalent is (technically) needed.

Am I confused?

(Though all of this works just fine as written on x86.)

[1]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L62
[2]: https://lore.kernel.org/kvm/20240809194335.1726916-21-seanjc@google.com/
[3]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L190

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
  2024-09-09 20:46       ` James Houghton
@ 2024-09-09 22:02         ` Sean Christopherson
  2024-09-10  0:28           ` James Houghton
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2024-09-09 22:02 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu

On Mon, Sep 09, 2024, James Houghton wrote:
> On Mon, Sep 9, 2024 at 1:02 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Sep 09, 2024, James Houghton wrote:
> > > On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > + */
> > > > +#define KVM_RMAP_LOCKED        BIT(1)
> > > > +
> > > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> > > > +{
> > > > +       unsigned long old_val, new_val;
> > > > +
> > > > +       old_val = READ_ONCE(rmap_head->val);
> > > > +       if (!old_val)
> > > > +               return 0;
> > > > +
> > > > +       do {
> > > > +               /*
> > > > +                * If the rmap is locked, wait for it to be unlocked before
> > > > +                * trying acquire the lock, e.g. to bounce the cache line.
> > > > +                */
> > > > +               while (old_val & KVM_RMAP_LOCKED) {
> > > > +                       old_val = READ_ONCE(rmap_head->val);
> > > > +                       cpu_relax();
> > > > +               }
> > > > +
> > > > +               /*
> > > > +                * Recheck for an empty rmap, it may have been purged by the
> > > > +                * task that held the lock.
> > > > +                */
> > > > +               if (!old_val)
> > > > +                       return 0;
> > > > +
> > > > +               new_val = old_val | KVM_RMAP_LOCKED;
> > > > +       } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
> > >
> > > I think we (technically) need an smp_rmb() here. I think cmpxchg
> > > implicitly has that on x86 (and this code is x86-only), but should we
> > > nonetheless document that we need smp_rmb() (if it indeed required)?
> > > Perhaps we could/should condition the smp_rmb() on `if (old_val)`.
> >
> > Hmm, no, not smp_rmb().  If anything, the appropriate barrier here would be
> > smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably
> > even wrong.
> 
> I don't think smp_mb__after_spinlock() is right either. This seems to
> be used following the acquisition of a spinlock to promote the memory
> ordering from an acquire barrier (that is implicit with the lock
> acquisition, e.g. [1]) to a full barrier. IIUC, we have no need for a
> stronger-than-usual barrier. But I guess I'm not really sure.
> 
> In this case, I'm complaining that we don't have the usual memory
> ordering restrictions that come with a spinlock.

What makes you think that?

> > For the !old_val case, there is a address/data dependency that can't be broken by
> > the CPU without violating the x86 memory model (all future actions with relevant
> > memory loads depend on rmap_head->val being non-zero).  And AIUI, in the Linux
> > kernel memory model, READ_ONCE() is responsible for ensuring that the address
> > dependency can't be morphed into a control dependency by the compiler and
> > subsequently reordered by the CPU.
> >
> > I.e. even if this were arm64, ignoring the LOCK CMPXCHG path for the moment, I
> > don't _think_ an smp_{r,w}mb() pair would be needed, as arm64's definition of
> > __READ_ONCE() promotes the operation to an acquire.
> >
> > Back to the LOCK CMPXCHG path, KVM_RMAP_LOCKED implements a rudimentary spinlock,
> > hence my smp_mb__after_spinlock() suggestion.  Though _because_ it's a spinlock,
> > the rmaps are fully protected by the critical section.
> 
> I feel like a spinlock must include the appropriate barriers for it to
> correctly function as a spinlock, so I'm not sure I fully understand
> what you mean here.

On TSO architectures, the atomic _is_ the barrier.  E.g. atomic_try_cmpxchg_acquire()
eventually resolves to atomic_try_cmpxchg() on x86.  And jumping back to the
"we don't have the usual memory ordering restrictions that come with a spinlock",
x86's virt_spin_lock() uses atomic_try_cmpxchg().  So while using the acquire
variant here is obviously not wrong, it also feels somewhat weird.  Though some
of that is undoubtedly due to explicit "acquire" semantics being rather rare in
x86.

> > And for the SPTEs, there is no required ordering.  The reader (aging
> > thread) can observe a !PRESENT or a PRESENT SPTE, and must be prepared for
> > either.  I.e. there is no requirement that the reader observe a PRESENT
> > SPTE if there is a valid rmap.
> 
> This makes sense.
> 
> > So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(),
> > even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests
> > an ordering requirement that doesn't exist.
> 
> So we have: the general kvm_rmap_lock() and the read-only
> kvm_rmap_lock_readonly(), as introduced by the next patch[2]. I'll use
> those names (sorry if it's confusing).
> 
> For kvm_rmap_lock(), we are always holding mmu_lock for writing. So
> any changes we make to the rmap will be properly published to other
> threads that subsequently grab kvm_rmap_lock() because we had to
> properly release and then re-acquire mmu_lock, which comes with the
> barriers I'm saying we need.
> 
> For kvm_rmap_lock_readonly(), we don't hold mmu_lock, so there is no
> smp_rmb() or equivalent. Without an smp_rmb() somewhere, I claim that
> it is possible that there may observe external changes to the
> pte_list_desc while we are in this critical section (for a
> sufficiently weak architecture). The changes that the kvm_rmap_lock()
> (mmu_lock) side made were half-published with an smp_wmb() (really
> [3]), but the read side didn't use a load-acquire or smp_rmb(), so it
> hasn't held up its end of the deal.
> 
> I don't think READ_ONCE() has the guarantees we need to be a
> sufficient replacement for smp_rmb() or a load-acquire that a real
> lock would use, although I agree with you that, on arm64, it
> apparently *is* a sufficient replacement.
> 
> Now this isn't a problem if the kvm_rmap_lock_readonly() side can
> tolerate changes to pte_list_desc while in the critical section. I
> don't think this is true (given for_each_rmap_spte_lockless),
> therefore an smp_rmb() or equivalent is (technically) needed.
> 
> Am I confused?

Yes, I think so.  kvm_rmap_lock_readonly() creates a critical section that prevents
any pte_list_desc changes.  rmap_head->val, and every pte_list_desc that is pointed
at by rmap_head->val in the KVM_RMAP_MULTI case, is protected and cannot change.

The SPTE _value_ that is pointed at by rmap_head->val or pte_list_desc.sptes[]
can change, but the pointers themselves cannot.  And with aging, the code is
completely tolerant of an instable SPTE _value_ because test-only doesn't care
about false negatives/positives, and test-and-clear is itself an atomic access
i.e. won't corrupt a SPTE (and is also tolerant of false positives/negatives).

> 
> (Though all of this works just fine as written on x86.)
> 
> [1]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L62
> [2]: https://lore.kernel.org/kvm/20240809194335.1726916-21-seanjc@google.com/
> [3]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L190

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
  2024-09-09 22:02         ` Sean Christopherson
@ 2024-09-10  0:28           ` James Houghton
  2024-09-10  1:42             ` Sean Christopherson
  0 siblings, 1 reply; 50+ messages in thread
From: James Houghton @ 2024-09-10  0:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu

On Mon, Sep 9, 2024 at 3:02 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Sep 09, 2024, James Houghton wrote:
> > On Mon, Sep 9, 2024 at 1:02 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Mon, Sep 09, 2024, James Houghton wrote:
> > > > On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > > + */
> > > > > +#define KVM_RMAP_LOCKED        BIT(1)
> > > > > +
> > > > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> > > > > +{
> > > > > +       unsigned long old_val, new_val;
> > > > > +
> > > > > +       old_val = READ_ONCE(rmap_head->val);
> > > > > +       if (!old_val)
> > > > > +               return 0;
> > > > > +
> > > > > +       do {
> > > > > +               /*
> > > > > +                * If the rmap is locked, wait for it to be unlocked before
> > > > > +                * trying acquire the lock, e.g. to bounce the cache line.
> > > > > +                */
> > > > > +               while (old_val & KVM_RMAP_LOCKED) {
> > > > > +                       old_val = READ_ONCE(rmap_head->val);
> > > > > +                       cpu_relax();
> > > > > +               }
> > > > > +
> > > > > +               /*
> > > > > +                * Recheck for an empty rmap, it may have been purged by the
> > > > > +                * task that held the lock.
> > > > > +                */
> > > > > +               if (!old_val)
> > > > > +                       return 0;
> > > > > +
> > > > > +               new_val = old_val | KVM_RMAP_LOCKED;
> > > > > +       } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
> > > >
> > > > I think we (technically) need an smp_rmb() here. I think cmpxchg
> > > > implicitly has that on x86 (and this code is x86-only), but should we
> > > > nonetheless document that we need smp_rmb() (if it indeed required)?
> > > > Perhaps we could/should condition the smp_rmb() on `if (old_val)`.
> > >
> > > Hmm, no, not smp_rmb().  If anything, the appropriate barrier here would be
> > > smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably
> > > even wrong.
> >
> > I don't think smp_mb__after_spinlock() is right either. This seems to
> > be used following the acquisition of a spinlock to promote the memory
> > ordering from an acquire barrier (that is implicit with the lock
> > acquisition, e.g. [1]) to a full barrier. IIUC, we have no need for a
> > stronger-than-usual barrier. But I guess I'm not really sure.
> >
> > In this case, I'm complaining that we don't have the usual memory
> > ordering restrictions that come with a spinlock.
>
> What makes you think that?

Ok I was under the impression that try_cmpxchg() did not carry memory
ordering guarantees (i.e., I thought it was try_cmpxchg_relaxed()).
Sorry....

So the way I would write this is with try_cmpxchg_relaxed() in the
loop and then smp_rmb() after we break out of the loop, at least for
the `old_val != 0` case. Kinda like this [4].

Did you really want try_cmpxchg(), not try_cmpxchg_relaxed()?

Just comparing against bit_spin_lock, test_and_set_bit_lock()
documents the requirement for acquire barrier semantics[5].

[4]: https://elixir.bootlin.com/linux/v6.10.9/source/kernel/locking/rtmutex.c#L253
[5]: https://elixir.bootlin.com/linux/v6.10.9/source/include/asm-generic/bitops/instrumented-lock.h#L51

>
> > > For the !old_val case, there is a address/data dependency that can't be broken by
> > > the CPU without violating the x86 memory model (all future actions with relevant
> > > memory loads depend on rmap_head->val being non-zero).  And AIUI, in the Linux
> > > kernel memory model, READ_ONCE() is responsible for ensuring that the address
> > > dependency can't be morphed into a control dependency by the compiler and
> > > subsequently reordered by the CPU.
> > >
> > > I.e. even if this were arm64, ignoring the LOCK CMPXCHG path for the moment, I
> > > don't _think_ an smp_{r,w}mb() pair would be needed, as arm64's definition of
> > > __READ_ONCE() promotes the operation to an acquire.
> > >
> > > Back to the LOCK CMPXCHG path, KVM_RMAP_LOCKED implements a rudimentary spinlock,
> > > hence my smp_mb__after_spinlock() suggestion.  Though _because_ it's a spinlock,
> > > the rmaps are fully protected by the critical section.
> >
> > I feel like a spinlock must include the appropriate barriers for it to
> > correctly function as a spinlock, so I'm not sure I fully understand
> > what you mean here.
>
> On TSO architectures, the atomic _is_ the barrier.  E.g. atomic_try_cmpxchg_acquire()
> eventually resolves to atomic_try_cmpxchg() on x86.

Yeah I'm with you here.

> And jumping back to the
> "we don't have the usual memory ordering restrictions that come with a spinlock",
> x86's virt_spin_lock() uses atomic_try_cmpxchg().  So while using the acquire
> variant here is obviously not wrong, it also feels somewhat weird.

Yeah that's fine. atomic_try_cmpxchg() is at least as strong as
atomic_try_cmpxchg_acquire(), so there is no issue.

But if virt_spin_lock() were written to use
atomic_try_cmpxchg_relaxed() (and nothing else) instead, then you'd
complain right? It would work on x86 (I think?), but it's not written
properly! That's basically what I'm saying in this thread.

> Though some
> of that is undoubtedly due to explicit "acquire" semantics being rather rare in
> x86.
>
> > > And for the SPTEs, there is no required ordering.  The reader (aging
> > > thread) can observe a !PRESENT or a PRESENT SPTE, and must be prepared for
> > > either.  I.e. there is no requirement that the reader observe a PRESENT
> > > SPTE if there is a valid rmap.
> >
> > This makes sense.
> >
> > > So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(),
> > > even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests
> > > an ordering requirement that doesn't exist.
> >
> > So we have: the general kvm_rmap_lock() and the read-only
> > kvm_rmap_lock_readonly(), as introduced by the next patch[2]. I'll use
> > those names (sorry if it's confusing).
> >
> > For kvm_rmap_lock(), we are always holding mmu_lock for writing. So
> > any changes we make to the rmap will be properly published to other
> > threads that subsequently grab kvm_rmap_lock() because we had to
> > properly release and then re-acquire mmu_lock, which comes with the
> > barriers I'm saying we need.
> >
> > For kvm_rmap_lock_readonly(), we don't hold mmu_lock, so there is no
> > smp_rmb() or equivalent. Without an smp_rmb() somewhere, I claim that
> > it is possible that there may observe external changes to the
> > pte_list_desc while we are in this critical section (for a
> > sufficiently weak architecture). The changes that the kvm_rmap_lock()
> > (mmu_lock) side made were half-published with an smp_wmb() (really
> > [3]), but the read side didn't use a load-acquire or smp_rmb(), so it
> > hasn't held up its end of the deal.
> >
> > I don't think READ_ONCE() has the guarantees we need to be a
> > sufficient replacement for smp_rmb() or a load-acquire that a real
> > lock would use, although I agree with you that, on arm64, it
> > apparently *is* a sufficient replacement.
> >
> > Now this isn't a problem if the kvm_rmap_lock_readonly() side can
> > tolerate changes to pte_list_desc while in the critical section. I
> > don't think this is true (given for_each_rmap_spte_lockless),
> > therefore an smp_rmb() or equivalent is (technically) needed.
> >
> > Am I confused?
>
> Yes, I think so.  kvm_rmap_lock_readonly() creates a critical section that prevents
> any pte_list_desc changes.  rmap_head->val, and every pte_list_desc that is pointed
> at by rmap_head->val in the KVM_RMAP_MULTI case, is protected and cannot change.

I take back what I said about this working on x86. I think it's
possible for there to be a race.

Say...

1. T1 modifies pte_list_desc then unlocks kvm_rmap_unlock().
2. T2 then locks kvm_rmap_lock_readonly().

The modifications that T1 has made are not guaranteed to be visible to
T2 unless T1 has an smp_wmb() (or equivalent) after the modfication
and T2 has an smp_rmb() before reading the data.

Now the way you had it, T2, because it uses try_cmpxchg() with full
ordering, will effectively do a smp_rmb(). But T1 only does an
smp_wmb() *after dropping the mmu_lock*, so there is a race. While T1
still holds the mmu_lock but after releasing the kvm_rmap_lock(), T2
may enter its critical section and then *later* observe the changes
that T1 made.

Now this is impossible on x86 (IIUC) if, in the compiled list of
instructions, T1's writes occur in the same order that we have written
them in C. I'm not sure if WRITE_ONCE guarantees that this reordering
at compile time is forbidden.

So what I'm saying is:

1. kvm_rmap_unlock() must have an smp_wmb().
2. If you change kvm_rmap_lock() to use try_cmpxchg_relaxed() (which
is what I think you want), then you must also have an smp_rmb()
following a successful cmpxchg/acquisition (at least for the case
where we then follow the pte_list_desc pointer).

> The SPTE _value_ that is pointed at by rmap_head->val or pte_list_desc.sptes[]
> can change, but the pointers themselves cannot.  And with aging, the code is
> completely tolerant of an instable SPTE _value_ because test-only doesn't care
> about false negatives/positives, and test-and-clear is itself an atomic access
> i.e. won't corrupt a SPTE (and is also tolerant of false positives/negatives).

I think we're on the same page for the rest of this.

>
> >
> > (Though all of this works just fine as written on x86.)
> >
> > [1]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L62
> > [2]: https://lore.kernel.org/kvm/20240809194335.1726916-21-seanjc@google.com/
> > [3]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L190

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
  2024-09-10  0:28           ` James Houghton
@ 2024-09-10  1:42             ` Sean Christopherson
  2024-09-10 21:11               ` James Houghton
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2024-09-10  1:42 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu

On Mon, Sep 09, 2024, James Houghton wrote:
> On Mon, Sep 9, 2024 at 3:02 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Sep 09, 2024, James Houghton wrote:
> > > On Mon, Sep 9, 2024 at 1:02 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > Hmm, no, not smp_rmb().  If anything, the appropriate barrier here would be
> > > > smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably
> > > > even wrong.
> > >
> > > I don't think smp_mb__after_spinlock() is right either. This seems to
> > > be used following the acquisition of a spinlock to promote the memory
> > > ordering from an acquire barrier (that is implicit with the lock
> > > acquisition, e.g. [1]) to a full barrier. IIUC, we have no need for a
> > > stronger-than-usual barrier. But I guess I'm not really sure.
> > >
> > > In this case, I'm complaining that we don't have the usual memory
> > > ordering restrictions that come with a spinlock.
> >
> > What makes you think that?
> 
> Ok I was under the impression that try_cmpxchg() did not carry memory
> ordering guarantees (i.e., I thought it was try_cmpxchg_relaxed()).
> Sorry....
> 
> So the way I would write this is with try_cmpxchg_relaxed() in the
> loop and then smp_rmb() after we break out of the loop, at least for
> the `old_val != 0` case. Kinda like this [4].
> 
> Did you really want try_cmpxchg(), not try_cmpxchg_relaxed()?

Heh, why would I care?  On x86, they are the same thing.  Yeah, I could tease out
some subtle difference to super precisely describe KVM's behavior, but that more
or less defeats the value proposition of total-store ordered architectures: take
on more complexity in hardware (and possibly loss in performance) to allow for
simpler software.

> Just comparing against bit_spin_lock, test_and_set_bit_lock()
> documents the requirement for acquire barrier semantics[5].

Again, that's generic library code.

> [4]: https://elixir.bootlin.com/linux/v6.10.9/source/kernel/locking/rtmutex.c#L253
> [5]: https://elixir.bootlin.com/linux/v6.10.9/source/include/asm-generic/bitops/instrumented-lock.h#L51

...

> > And jumping back to the "we don't have the usual memory ordering
> > restrictions that come with a spinlock", x86's virt_spin_lock() uses
> > atomic_try_cmpxchg().  So while using the acquire variant here is obviously
> > not wrong, it also feels somewhat weird.
> 
> Yeah that's fine. atomic_try_cmpxchg() is at least as strong as
> atomic_try_cmpxchg_acquire(), so there is no issue.
> 
> But if virt_spin_lock() were written to use atomic_try_cmpxchg_relaxed() (and
> nothing else) instead, then you'd complain right?

I'd complain because someone made me think hard for no reason. :-)

> It would work on x86 (I think?), but it's not written properly! That's
> basically what I'm saying in this thread.
> 
> > Though some of that is undoubtedly due to explicit "acquire" semantics
> > being rather rare in x86.
> >
> > > > And for the SPTEs, there is no required ordering.  The reader (aging
> > > > thread) can observe a !PRESENT or a PRESENT SPTE, and must be prepared for
> > > > either.  I.e. there is no requirement that the reader observe a PRESENT
> > > > SPTE if there is a valid rmap.
> > >
> > > This makes sense.
> > >
> > > > So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(),
> > > > even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests
> > > > an ordering requirement that doesn't exist.
> > >
> > > So we have: the general kvm_rmap_lock() and the read-only
> > > kvm_rmap_lock_readonly(), as introduced by the next patch[2]. I'll use
> > > those names (sorry if it's confusing).
> > >
> > > For kvm_rmap_lock(), we are always holding mmu_lock for writing. So
> > > any changes we make to the rmap will be properly published to other
> > > threads that subsequently grab kvm_rmap_lock() because we had to
> > > properly release and then re-acquire mmu_lock, which comes with the
> > > barriers I'm saying we need.
> > >
> > > For kvm_rmap_lock_readonly(), we don't hold mmu_lock, so there is no
> > > smp_rmb() or equivalent. Without an smp_rmb() somewhere, I claim that
> > > it is possible that there may observe external changes to the
> > > pte_list_desc while we are in this critical section (for a
> > > sufficiently weak architecture). The changes that the kvm_rmap_lock()
> > > (mmu_lock) side made were half-published with an smp_wmb() (really
> > > [3]), but the read side didn't use a load-acquire or smp_rmb(), so it
> > > hasn't held up its end of the deal.
> > >
> > > I don't think READ_ONCE() has the guarantees we need to be a
> > > sufficient replacement for smp_rmb() or a load-acquire that a real
> > > lock would use, although I agree with you that, on arm64, it
> > > apparently *is* a sufficient replacement.
> > >
> > > Now this isn't a problem if the kvm_rmap_lock_readonly() side can
> > > tolerate changes to pte_list_desc while in the critical section. I
> > > don't think this is true (given for_each_rmap_spte_lockless),
> > > therefore an smp_rmb() or equivalent is (technically) needed.
> > >
> > > Am I confused?
> >
> > Yes, I think so.  kvm_rmap_lock_readonly() creates a critical section that prevents
> > any pte_list_desc changes.  rmap_head->val, and every pte_list_desc that is pointed
> > at by rmap_head->val in the KVM_RMAP_MULTI case, is protected and cannot change.
> 
> I take back what I said about this working on x86. I think it's
> possible for there to be a race.
> 
> Say...
> 
> 1. T1 modifies pte_list_desc then unlocks kvm_rmap_unlock().
> 2. T2 then locks kvm_rmap_lock_readonly().
> 
> The modifications that T1 has made are not guaranteed to be visible to
> T2 unless T1 has an smp_wmb() (or equivalent) after the modfication
> and T2 has an smp_rmb() before reading the data.
> 
> Now the way you had it, T2, because it uses try_cmpxchg() with full
> ordering, will effectively do a smp_rmb(). But T1 only does an
> smp_wmb() *after dropping the mmu_lock*, so there is a race. While T1
> still holds the mmu_lock but after releasing the kvm_rmap_lock(), T2
> may enter its critical section and then *later* observe the changes
> that T1 made.
> 
> Now this is impossible on x86 (IIUC) if, in the compiled list of
> instructions, T1's writes occur in the same order that we have written
> them in C. I'm not sure if WRITE_ONCE guarantees that this reordering
> at compile time is forbidden.
> 
> So what I'm saying is:
> 
> 1. kvm_rmap_unlock() must have an smp_wmb().

No, because beating a dead horse, this is not generic code, this is x86. 

> 2. If you change kvm_rmap_lock() to use try_cmpxchg_relaxed() (which
> is what I think you want), 

No, I don't.  If this were generic code, then it would be a different conversation,
but using a "relaxed" atomic in x86 specific code is nonsensical, because such an
operation simply does not exist in the world of x86.  There are memory operations
that have relaxed ordering, but none of them are atomic; they are very much the
exact opposite of atomic.

E.g. the only references to _relaxed that you'll find in x86 are to override the
generics to x86's non-relaxed semantics.

$ git grep _relaxed arch/x86
arch/x86/include/asm/io.h:#define readb_relaxed(a) __readb(a)
arch/x86/include/asm/io.h:#define readw_relaxed(a) __readw(a)
arch/x86/include/asm/io.h:#define readl_relaxed(a) __readl(a)
arch/x86/include/asm/io.h:#define writeb_relaxed(v, a) __writeb(v, a)
arch/x86/include/asm/io.h:#define writew_relaxed(v, a) __writew(v, a)
arch/x86/include/asm/io.h:#define writel_relaxed(v, a) __writel(v, a)
arch/x86/include/asm/io.h:#define readq_relaxed(a)      __readq(a)
arch/x86/include/asm/io.h:#define writeq_relaxed(v, a)  __writeq(v, a)

There are a few places in KVM x86 where _acquire() and _release() variants are
used, but that's done purely to document the roles and who is doing what, and
almost always (maybe always?) when there are lockless programming shenanigans
in play.

E.g. kvm_recalculate_apic_map() is probably the closest example, where KVM uses
atomic_read_acquire(), atomic_cmpxchg_acquire(), and atomic_cmpxchg_release() to
document the ordering between marking the map as dirty and actually processing
the map.  But that is still quite different, as apic_map_dirty is not a spinlock,
there isn't a direct data/address dependency between apic_map_dirty and all of
the data it consumes, _and_ the data that is guarded by apic_map_dirty can be
modified by other vCPUs _while it is being processed_.

Hmm, actually, sev_lock_two_vms() is arguably a better comparison.  That's also
a rudimentary spinlock (try-lock only behavior).

If kvm_rmap_head.val were an int, i.e. could be unionized with an atomic_t, then
I wouldn't be opposed to doing this in the locking code to document things:

 s/READ_ONCE/atomic_read_acquire
 s/WRITE_ONCE/atomic_set_release
 s/try_cmpxchg/atomic_cmpxchg_acquire

But it's an "unsigned long", and so trying to massage it into the above isn't a
net positive for me.  This is gnarly x86 specific code; it's very unlikely that
someone will be able to understand the rest of the rmap code, but won't be able
to understand that the code is safe due to x86's strong memory ordering.

> then you must also have an smp_rmb() following a successful
> cmpxchg/acquisition (at least for the case where we then follow the
> pte_list_desc pointer).

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap
  2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
                   ` (21 preceding siblings ...)
  2024-08-09 19:43 ` [PATCH 22/22] ***HACK*** KVM: x86: Don't take " Sean Christopherson
@ 2024-09-10  4:56 ` Sean Christopherson
  22 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-09-10  4:56 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Oliver Upton, Marc Zyngier, Peter Xu,
	James Houghton

On Fri, 09 Aug 2024 12:43:12 -0700, Sean Christopherson wrote:
> The main intent of this series is to allow yielding, i.e. cond_resched(),
> when unmapping memory in shadow MMUs in response to an mmu_notifier
> invalidation.  There is zero reason not to yield, and in fact I _thought_
> KVM did yield, but because of how KVM grew over the years, the unmap path
> got left behind.
> 
> The first half of the series is reworks max_guest_memory_test into
> mmu_stress_test, to give some confidence in the mmu_notifier-related
> changes.
> 
> [...]

Applied the KVM proper changes to kvm-x86 mmu, sans the rmap locking stuff.
I'll send a v2 for just the selftest changes, and probably put them in the
dedicated selftests branch, as there is no actual dependency between the
selftest and the mmu changes.

[10/22] KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range()
        https://github.com/kvm-x86/linux/commit/0a37fffda145
[11/22] KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps()
        https://github.com/kvm-x86/linux/commit/5b1fb116e1a6
[12/22] KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot
        https://github.com/kvm-x86/linux/commit/dd9eaad744f4
[13/22] KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed
        https://github.com/kvm-x86/linux/commit/548f87f667a3
[14/22] KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper
        https://github.com/kvm-x86/linux/commit/c17f150000f6
[15/22] KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range()
        https://github.com/kvm-x86/linux/commit/7aac9dc680da
[16/22] KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals
        https://github.com/kvm-x86/linux/commit/7645829145a9

[18/22] KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent
        https://github.com/kvm-x86/linux/commit/9a5bff7f5ec2

--
https://github.com/kvm-x86/linux/tree/next

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
  2024-09-10  1:42             ` Sean Christopherson
@ 2024-09-10 21:11               ` James Houghton
  2024-09-11 15:33                 ` Sean Christopherson
  0 siblings, 1 reply; 50+ messages in thread
From: James Houghton @ 2024-09-10 21:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu

On Mon, Sep 9, 2024 at 6:42 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Sep 09, 2024, James Houghton wrote:
> > I take back what I said about this working on x86. I think it's
> > possible for there to be a race.
> >
> > Say...
> >
> > 1. T1 modifies pte_list_desc then unlocks kvm_rmap_unlock().
> > 2. T2 then locks kvm_rmap_lock_readonly().
> >
> > The modifications that T1 has made are not guaranteed to be visible to
> > T2 unless T1 has an smp_wmb() (or equivalent) after the modfication
> > and T2 has an smp_rmb() before reading the data.
> >
> > Now the way you had it, T2, because it uses try_cmpxchg() with full
> > ordering, will effectively do a smp_rmb(). But T1 only does an
> > smp_wmb() *after dropping the mmu_lock*, so there is a race. While T1
> > still holds the mmu_lock but after releasing the kvm_rmap_lock(), T2
> > may enter its critical section and then *later* observe the changes
> > that T1 made.
> >
> > Now this is impossible on x86 (IIUC) if, in the compiled list of
> > instructions, T1's writes occur in the same order that we have written
> > them in C. I'm not sure if WRITE_ONCE guarantees that this reordering
> > at compile time is forbidden.
> >
> > So what I'm saying is:
> >
> > 1. kvm_rmap_unlock() must have an smp_wmb().
>
> No, because beating a dead horse, this is not generic code, this is x86.

What prevents the compiler from reordering (non-atomic, non-volatile)
stores that happen before WRITE_ONCE() in kvm_rmap_unlock() to after
the WRITE_ONCE()?

IMV, such a reordering is currently permitted[1] (i.e., a barrier() is
missing), and should the compiler choose to do this, the lock will not
function correctly.

> If kvm_rmap_head.val were an int, i.e. could be unionized with an atomic_t, then
> I wouldn't be opposed to doing this in the locking code to document things:
>
>  s/READ_ONCE/atomic_read_acquire
>  s/WRITE_ONCE/atomic_set_release
>  s/try_cmpxchg/atomic_cmpxchg_acquire

I think we can use atomic_long_t.

It would be really great if we did a substitution like this. That
would address my above concern about barrier() (atomic_set_release,
for example, implies a barrier() that we otherwise need to include).

[1]: https://www.kernel.org/doc/Documentation/memory-barriers.txt
(GUARANTEES + COMPILER BARRIER)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
  2024-09-10 21:11               ` James Houghton
@ 2024-09-11 15:33                 ` Sean Christopherson
  0 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2024-09-11 15:33 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, kvm, linux-kernel, Oliver Upton, Marc Zyngier,
	Peter Xu

On Tue, Sep 10, 2024, James Houghton wrote:
> On Mon, Sep 9, 2024 at 6:42 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Sep 09, 2024, James Houghton wrote:
> > > I take back what I said about this working on x86. I think it's
> > > possible for there to be a race.
> > >
> > > Say...
> > >
> > > 1. T1 modifies pte_list_desc then unlocks kvm_rmap_unlock().
> > > 2. T2 then locks kvm_rmap_lock_readonly().
> > >
> > > The modifications that T1 has made are not guaranteed to be visible to
> > > T2 unless T1 has an smp_wmb() (or equivalent) after the modfication
> > > and T2 has an smp_rmb() before reading the data.
> > >
> > > Now the way you had it, T2, because it uses try_cmpxchg() with full
> > > ordering, will effectively do a smp_rmb(). But T1 only does an
> > > smp_wmb() *after dropping the mmu_lock*, so there is a race. While T1
> > > still holds the mmu_lock but after releasing the kvm_rmap_lock(), T2
> > > may enter its critical section and then *later* observe the changes
> > > that T1 made.
> > >
> > > Now this is impossible on x86 (IIUC) if, in the compiled list of
> > > instructions, T1's writes occur in the same order that we have written
> > > them in C. I'm not sure if WRITE_ONCE guarantees that this reordering
> > > at compile time is forbidden.
> > >
> > > So what I'm saying is:
> > >
> > > 1. kvm_rmap_unlock() must have an smp_wmb().
> >
> > No, because beating a dead horse, this is not generic code, this is x86.
> 
> What prevents the compiler from reordering (non-atomic, non-volatile)
> stores that happen before WRITE_ONCE() in kvm_rmap_unlock() to after
> the WRITE_ONCE()?

Oof, right, nothing.  Which is why __smp_store_release() has an explicit
barrier() before its WRITE_ONCE().

> IMV, such a reordering is currently permitted[1] (i.e., a barrier() is
> missing), and should the compiler choose to do this, the lock will not
> function correctly.
> 
> > If kvm_rmap_head.val were an int, i.e. could be unionized with an atomic_t, then
> > I wouldn't be opposed to doing this in the locking code to document things:
> >
> >  s/READ_ONCE/atomic_read_acquire
> >  s/WRITE_ONCE/atomic_set_release
> >  s/try_cmpxchg/atomic_cmpxchg_acquire
> 
> I think we can use atomic_long_t.

Duh.  That's a /facepalm moment.

> It would be really great if we did a substitution like this. That
> would address my above concern about barrier() (atomic_set_release,
> for example, implies a barrier() that we otherwise need to include).

Ya, agreed, not just for warm fuzzies, but because it's necessary to prevent
the compiler from being clever.

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2024-09-11 15:33 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
2024-08-09 19:43 ` [PATCH 01/22] KVM: selftests: Check for a potential unhandled exception iff KVM_RUN succeeded Sean Christopherson
2024-08-09 19:43 ` [PATCH 02/22] KVM: selftests: Rename max_guest_memory_test to mmu_stress_test Sean Christopherson
2024-08-09 19:43 ` [PATCH 03/22] KVM: selftests: Only muck with SREGS on x86 in mmu_stress_test Sean Christopherson
2024-08-09 19:43 ` [PATCH 04/22] KVM: selftests: Compute number of extra pages needed " Sean Christopherson
2024-09-06  0:03   ` James Houghton
2024-09-06  4:42     ` Sean Christopherson
2024-09-06 18:25       ` James Houghton
2024-08-09 19:43 ` [PATCH 05/22] KVM: selftests: Enable mmu_stress_test on arm64 Sean Christopherson
2024-08-09 19:43 ` [PATCH 06/22] KVM: selftests: Use vcpu_arch_put_guest() in mmu_stress_test Sean Christopherson
2024-08-09 19:43 ` [PATCH 07/22] KVM: selftests: Precisely limit the number of guest loops " Sean Christopherson
2024-08-09 19:43 ` [PATCH 08/22] KVM: selftests: Add a read-only mprotect() phase to mmu_stress_test Sean Christopherson
2024-08-09 19:43 ` [PATCH 09/22] KVM: selftests: Verify KVM correctly handles mprotect(PROT_READ) Sean Christopherson
     [not found]   ` <CADrL8HXcD--jn1iLeCJycCd3Btv4_rBPxz6NMnTREXfeh0vRZA@mail.gmail.com>
2024-09-07  0:53     ` Sean Christopherson
2024-09-07  2:26       ` James Houghton
2024-09-09 15:49         ` Sean Christopherson
2024-08-09 19:43 ` [PATCH 10/22] KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range() Sean Christopherson
2024-08-28 18:20   ` Sean Christopherson
2024-08-09 19:43 ` [PATCH 11/22] KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps() Sean Christopherson
2024-08-09 19:43 ` [PATCH 12/22] KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot Sean Christopherson
2024-08-09 19:43 ` [PATCH 13/22] KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed Sean Christopherson
2024-08-09 19:43 ` [PATCH 14/22] KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper Sean Christopherson
2024-08-12 21:53   ` David Matlack
2024-08-12 21:58     ` David Matlack
2024-08-09 19:43 ` [PATCH 15/22] KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range() Sean Christopherson
2024-08-09 19:43 ` [PATCH 16/22] KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals Sean Christopherson
2024-08-09 19:43 ` [PATCH 17/22] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock Sean Christopherson
2024-08-09 19:43 ` [PATCH 18/22] KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent Sean Christopherson
2024-08-09 19:43 ` [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock Sean Christopherson
2024-08-12  8:39   ` Lai Jiangshan
2024-08-12 15:22     ` Sean Christopherson
2024-08-13  6:35       ` Lai Jiangshan
2024-08-13 15:19         ` Sean Christopherson
2024-09-03 20:17   ` James Houghton
2024-09-03 21:27     ` Sean Christopherson
2024-09-03 21:40       ` James Houghton
2024-09-09 19:00   ` James Houghton
2024-09-09 20:02     ` Sean Christopherson
2024-09-09 20:46       ` James Houghton
2024-09-09 22:02         ` Sean Christopherson
2024-09-10  0:28           ` James Houghton
2024-09-10  1:42             ` Sean Christopherson
2024-09-10 21:11               ` James Houghton
2024-09-11 15:33                 ` Sean Christopherson
2024-08-09 19:43 ` [PATCH 20/22] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs Sean Christopherson
2024-08-09 19:43 ` [PATCH 21/22] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns Sean Christopherson
2024-09-03 20:36   ` James Houghton
2024-09-04 15:48     ` Sean Christopherson
2024-08-09 19:43 ` [PATCH 22/22] ***HACK*** KVM: x86: Don't take " Sean Christopherson
2024-09-10  4:56 ` [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).