kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] KVM: Do not reset dirty GFNs in a memslot not enabling dirty tracking
@ 2025-08-22  8:01 Yan Zhao
  2025-08-22  8:02 ` [PATCH v3 1/3] " Yan Zhao
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Yan Zhao @ 2025-08-22  8:01 UTC (permalink / raw)
  To: pbonzini, seanjc; +Cc: peterx, rick.p.edgecombe, linux-kernel, kvm, Yan Zhao

Hi,
This series addresses a bug where userspace can request KVM to reset dirty
GFNs in memslots that do not have dirty tracking enabled.

Patch 1 provides the fix.
Patch 2 is an optimization to avoid unnecessary invoking of handlers in
        kvm_handle_hva_range() when a GFN range is entirely private.

        Patch 2 is not directly related to the issue in this series, but
        was found while implementing the selftest in patch 3. It also
        enhance reliability of the selftest results in patch 3 by ruling
        out the zap-related changes to private mappings of the test slot.

Patch 3 converts the TDX-specific selftest in v2 to test
        KVM_X86_SW_PROTECTED_VM VMs.

        Unlike TDX cases which would generate KVM_BUG_ON() when GFNs are
        incorrectly reset in memslots not enabling dirty tracking, there
        are not obvious errors for KVM_X86_SW_PROTECTED_VMs. So, patch 3
        detects the event kvm_tdp_mmu_spte_changed instead.

        Will provide TDX cases once the TDX selftest framework is
        finalized.

v3:
- Rebased patch 1.
- Added patch 2.
- Converted patch 3 to test KVM_X86_SW_PROTECTED_VM VMs.
- code base: kvm-x86-next-2025.08.21

v2: https://lore.kernel.org/all/20241223070427.29583-1-yan.y.zhao@intel.com
- Added a comment in patch 1, explaining that it's possible to try to
  update a memslot that isn't being dirty-logged if userspace is
  misbehaving. Specifically, userspace can write arbitrary data into the
  ring. (Sean)

v1: https://lore.kernel.org/all/20241220082027.15851-1-yan.y.zhao@intel.com


Yan Zhao (3):
  KVM: Do not reset dirty GFNs in a memslot not enabling dirty tracking
  KVM: Skip invoking shared memory handler for entirely private GFN
    ranges
  KVM: selftests: Test resetting dirty ring in gmem slots in protected
    VMs

 tools/testing/selftests/kvm/Makefile.kvm      |   1 +
 .../kvm/x86/reset_dirty_ring_on_gmem_test.c   | 392 ++++++++++++++++++
 virt/kvm/dirty_ring.c                         |   8 +-
 virt/kvm/kvm_main.c                           |  11 +
 4 files changed, 411 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/kvm/x86/reset_dirty_ring_on_gmem_test.c

-- 
2.43.2


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 1/3] KVM: Do not reset dirty GFNs in a memslot not enabling dirty tracking
  2025-08-22  8:01 [PATCH v3 0/3] KVM: Do not reset dirty GFNs in a memslot not enabling dirty tracking Yan Zhao
@ 2025-08-22  8:02 ` Yan Zhao
  2025-08-25 20:42   ` Sean Christopherson
  2025-08-22  8:02 ` [PATCH v3 2/3] KVM: Skip invoking shared memory handler for entirely private GFN ranges Yan Zhao
  2025-08-22  8:03 ` [PATCH v3 3/3] KVM: selftests: Test resetting dirty ring in gmem slots in protected VMs Yan Zhao
  2 siblings, 1 reply; 8+ messages in thread
From: Yan Zhao @ 2025-08-22  8:02 UTC (permalink / raw)
  To: pbonzini, seanjc; +Cc: peterx, rick.p.edgecombe, linux-kernel, kvm, Yan Zhao

Do not allow resetting dirty GFNs in memslots that do not enable dirty
tracking.

vCPUs' dirty rings are shared between userspace and KVM. After KVM sets
dirtied entries in the dirty rings, userspace is responsible for
harvesting/resetting these entries and calling the ioctl
KVM_RESET_DIRTY_RINGS to inform KVM to advance the reset_index in the dirty
rings and invoke kvm_arch_mmu_enable_log_dirty_pt_masked() to clear the
SPTEs' dirty bits or perform write protection of the GFNs.

Although KVM does not set dirty entries for GFNs in a memslot that does not
enable dirty tracking, userspace can write arbitrary data into the dirty
ring. This makes it possible for misbehaving userspace to specify that it
has harvested a GFN from such a memslot. When this happens, KVM will be
asked to clear dirty bits or perform write protection for GFNs in a memslot
that does not enable dirty tracking, which is undesirable.

For TDX, this unexpected resetting of dirty GFNs could cause inconsistency
between the mirror SPTE and the external SPTE in hardware (e.g., the mirror
SPTE has no write bit while the external SPTE is writable). When
kvm_dirty_log_manual_protect_and_init_set() is true and huge pages are
enabled in TDX, this could even lead to kvm_mmu_slot_gfn_write_protect()
being called and trigger KVM_BUG_ON() due to permission reduction changes
in the huge mirror SPTEs.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 virt/kvm/dirty_ring.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index 02bc6b00d76c..b38b4b7d7667 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -63,7 +63,13 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
 
 	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
 
-	if (!memslot || (offset + __fls(mask)) >= memslot->npages)
+	/*
+	 * Userspace can write arbitrary data into the dirty ring, making it
+	 * possible for misbehaving userspace to try to reset an out-of-memslot
+	 * GFN or a GFN in a memslot that isn't being dirty-logged.
+	 */
+	if (!memslot || (offset + __fls(mask)) >= memslot->npages ||
+	    !kvm_slot_dirty_track_enabled(memslot))
 		return;
 
 	KVM_MMU_LOCK(kvm);
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 2/3] KVM: Skip invoking shared memory handler for entirely private GFN ranges
  2025-08-22  8:01 [PATCH v3 0/3] KVM: Do not reset dirty GFNs in a memslot not enabling dirty tracking Yan Zhao
  2025-08-22  8:02 ` [PATCH v3 1/3] " Yan Zhao
@ 2025-08-22  8:02 ` Yan Zhao
  2025-08-25 21:05   ` Sean Christopherson
  2025-08-22  8:03 ` [PATCH v3 3/3] KVM: selftests: Test resetting dirty ring in gmem slots in protected VMs Yan Zhao
  2 siblings, 1 reply; 8+ messages in thread
From: Yan Zhao @ 2025-08-22  8:02 UTC (permalink / raw)
  To: pbonzini, seanjc; +Cc: peterx, rick.p.edgecombe, linux-kernel, kvm, Yan Zhao

When a GFN range is entirely private, it's unnecessary for
kvm_handle_hva_range() to invoke handlers for the GFN range, because
1) the gfn_range.attr_filter for the handler is KVM_FILTER_SHARED, which
   is for shared mappings only;
2) KVM has already zapped all shared mappings before setting the memory
   attribute to private.

This can avoid unnecessary zaps on private mappings for VMs of type
KVM_X86_SW_PROTECTED_VM, e.g., during auto numa balancing scans of VMAs.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 virt/kvm/kvm_main.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f769d1dccc21..e615ad405ce4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -620,6 +620,17 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
 			gfn_range.slot = slot;
 			gfn_range.lockless = range->lockless;
 
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+			/*
+			 * If GFN range are all private, no need to invoke the
+			 * handler.
+			 */
+			if (kvm_range_has_memory_attributes(kvm, gfn_range.start,
+							    gfn_range.end, ~0,
+							    KVM_MEMORY_ATTRIBUTE_PRIVATE))
+				continue;
+#endif
+
 			if (!r.found_memslot) {
 				r.found_memslot = true;
 				if (!range->lockless) {
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 3/3] KVM: selftests: Test resetting dirty ring in gmem slots in protected VMs
  2025-08-22  8:01 [PATCH v3 0/3] KVM: Do not reset dirty GFNs in a memslot not enabling dirty tracking Yan Zhao
  2025-08-22  8:02 ` [PATCH v3 1/3] " Yan Zhao
  2025-08-22  8:02 ` [PATCH v3 2/3] KVM: Skip invoking shared memory handler for entirely private GFN ranges Yan Zhao
@ 2025-08-22  8:03 ` Yan Zhao
  2 siblings, 0 replies; 8+ messages in thread
From: Yan Zhao @ 2025-08-22  8:03 UTC (permalink / raw)
  To: pbonzini, seanjc; +Cc: peterx, rick.p.edgecombe, linux-kernel, kvm, Yan Zhao

Test resetting dirty ring in slots with the KVM_MEM_GUEST_MEMFD flag in
KVM_X86_SW_PROTECTED_VM VMs.

Purposely resetting dirty ring entries incorrectly to point to a gmem slot.

Unlike in TDX VMs, where resetting the dirty ring in a gmem slot could
trigger KVM_BUG_ON(), there are no obvious errors for
KVM_X86_SW_PROTECTED_VM VMs. Therefore, detect SPTE changes by reading
trace messages with the kvm_tdp_mmu_spte_changed event enabled.
Consequently, the test is conducted only when tdp_mmu is enabled and
tracing is available.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 tools/testing/selftests/kvm/Makefile.kvm      |   1 +
 .../kvm/x86/reset_dirty_ring_on_gmem_test.c   | 392 ++++++++++++++++++
 2 files changed, 393 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86/reset_dirty_ring_on_gmem_test.c

diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index f6fe7a07a0a2..ebd1d829c3f9 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -136,6 +136,7 @@ TEST_GEN_PROGS_x86 += x86/max_vcpuid_cap_test
 TEST_GEN_PROGS_x86 += x86/triple_fault_event_test
 TEST_GEN_PROGS_x86 += x86/recalc_apic_map_test
 TEST_GEN_PROGS_x86 += x86/aperfmperf_test
+TEST_GEN_PROGS_x86 += x86/reset_dirty_ring_on_gmem_test
 TEST_GEN_PROGS_x86 += access_tracking_perf_test
 TEST_GEN_PROGS_x86 += coalesced_io_test
 TEST_GEN_PROGS_x86 += dirty_log_perf_test
diff --git a/tools/testing/selftests/kvm/x86/reset_dirty_ring_on_gmem_test.c b/tools/testing/selftests/kvm/x86/reset_dirty_ring_on_gmem_test.c
new file mode 100644
index 000000000000..cf1746c0149f
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86/reset_dirty_ring_on_gmem_test.c
@@ -0,0 +1,392 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Test reset dirty ring on gmem slot on x86.
+ * Copyright (C) 2025, Intel, Inc.
+ *
+ * The slot flag KVM_MEM_GUEST_MEMFD is incompatible with the flag
+ * KVM_MEM_LOG_DIRTY_PAGES, which means KVM does not permit dirty page tracking
+ * on gmem slots.
+ *
+ * When dirty ring is enabled, although KVM does not mark GFNs in gmem slots as
+ * dirty, userspace can reset and write arbitrary data into the dirty ring
+ * entries shared between KVM and userspace. This can lead KVM to incorrectly
+ * clear write permission or dirty bits on SPTEs of gmem slots.
+ *
+ * While this might be harmless for non-TDX VMs, it could cause inconsistencies
+ * between the mirror SPTEs and the external SPTEs in hardware, or even trigger
+ * KVM_BUG_ON() for TDX.
+ *
+ * Purposely reset dirty ring incorrectly on gmem slots (gmem slots do not allow
+ * dirty page tracking) to verify malbehaved userspace cannot cause any SPTE
+ * permission reduction change.
+ *
+ * Steps conducted in this test:
+ * 1. echo nop > ${TRACING_ROOT}/current_tracer
+ *    echo 1 > ${TRACING_ROOT}/events/kvmmmu/kvm_tdp_mmu_spte_changed/enable
+ *    echo > ${TRACING_ROOT}/set_event_pid
+ *    echo > ${TRACING_ROOT}/set_event_notrace_pid
+ *
+ * 2. echo "common_pid == $tid && gfn == 0xc0400" > \
+ *    ${TRACING_ROOT}/events/kvmmmu/kvm_tdp_mmu_spte_changed/filter
+ *
+ * 3. echo 0 > ${TRACING_ROOT}/tracing_on
+ *    echo > ${TRACING_ROOT}/trace
+ *    echo 1 > ${TRACING_ROOT}/tracing_on
+ *
+ * 4. purposely reset dirty ring incorrectly
+ *
+ * 5. cat ${TRACING_ROOT}/trace
+ */
+#include <linux/kvm.h>
+#include <asm/barrier.h>
+#include <test_util.h>
+#include <kvm_util.h>
+#include <processor.h>
+
+#define DEBUGFS "/sys/kernel/debug/tracing"
+#define TRACEFS "/sys/kernel/tracing"
+
+#define TEST_DIRTY_RING_GPA (0xc0400000)
+#define TEST_DIRTY_RING_GVA (0x90400000)
+#define TEST_DIRTY_RING_REGION_SLOT 11
+#define TEST_DIRTY_RING_REGION_SIZE 0x200000
+#define TEST_DIRTY_RING_COUNT 4096
+#define TEST_DIRTY_RING_GUEST_WRITE_MAX_CNT 3
+
+static const char *PATTEN = "spte_changed";
+static char *tracing_root;
+
+static int open_path(char *subpath, int flags)
+{
+	static char path[100];
+	int count, fd;
+
+	count = snprintf(path, sizeof(path), "%s/%s", tracing_root, subpath);
+	TEST_ASSERT(count > 0, "Incorrect path\n");
+	fd = open(path, flags);
+	TEST_ASSERT(fd >= 0, "Cannot open %s\n", path);
+
+	return fd;
+}
+
+static void setup_tracing(void)
+{
+	int fd;
+
+	/* set current_tracer to nop */
+	fd = open_path("current_tracer", O_WRONLY);
+	test_write(fd, "nop\n", 4);
+	close(fd);
+
+	/* turn on event kvm_tdp_mmu_spte_changed */
+	fd = open_path("events/kvmmmu/kvm_tdp_mmu_spte_changed/enable", O_WRONLY);
+	test_write(fd, "1\n", 2);
+	close(fd);
+
+	/* clear set_event_pid & set_event_notrace_pid */
+	fd = open_path("set_event_pid", O_WRONLY | O_TRUNC);
+	close(fd);
+
+	fd = open_path("set_event_notrace_pid", O_WRONLY | O_TRUNC);
+	close(fd);
+}
+
+static void filter_event(void)
+{
+	int count, fd;
+	char buf[100];
+
+	fd = open_path("events/kvmmmu/kvm_tdp_mmu_spte_changed/filter",
+		       O_WRONLY | O_TRUNC);
+
+	count = snprintf(buf, sizeof(buf), "common_pid == %d && gfn == 0x%x\n",
+			 gettid(), TEST_DIRTY_RING_GPA >> PAGE_SHIFT);
+	TEST_ASSERT(count > 0, "Incorrect number of data written\n");
+	test_write(fd, buf, count);
+	close(fd);
+}
+
+static void enable_tracing(bool enable)
+{
+	char *val = enable ? "1\n" : "0\n";
+	int fd;
+
+	if (enable) {
+		/* clear trace log before enabling */
+		fd = open_path("trace", O_WRONLY | O_TRUNC);
+		close(fd);
+	}
+
+	fd = open_path("tracing_on", O_WRONLY);
+	test_write(fd, val, 2);
+	close(fd);
+}
+
+static void reset_tracing(void)
+{
+	enable_tracing(false);
+	enable_tracing(true);
+}
+
+static void detect_spte_change(void)
+{
+	static char buf[1024];
+	FILE *file;
+	int count;
+
+	count = snprintf(buf, sizeof(buf), "%s/trace", tracing_root);
+	TEST_ASSERT(count > 0, "Incorrect path\n");
+	file = fopen(buf, "r");
+	TEST_ASSERT(file, "Cannot open %s\n", buf);
+
+	while (fgets(buf, sizeof(buf), file))
+		TEST_ASSERT(!strstr(buf, PATTEN), "Unexpected SPTE change %s\n", buf);
+
+	fclose(file);
+}
+
+/*
+ * Write to a gmem slot and exit to host after each write to allow host to check
+ * dirty ring.
+ */
+void guest_code(void)
+{
+	uint64_t count = 0;
+
+	while (count < TEST_DIRTY_RING_GUEST_WRITE_MAX_CNT) {
+		count++;
+		memset((void *)TEST_DIRTY_RING_GVA, 1, 8);
+		GUEST_SYNC(count);
+	}
+	GUEST_DONE();
+}
+
+/*
+ * Verify that KVM_MEM_LOG_DIRTY_PAGES cannot be set on a memslot with flag
+ * KVM_MEM_GUEST_MEMFD.
+ */
+static void verify_turn_on_log_dirty_pages_flag(struct kvm_vcpu *vcpu)
+{
+	struct userspace_mem_region *region;
+	int ret;
+
+	region = memslot2region(vcpu->vm, TEST_DIRTY_RING_REGION_SLOT);
+	region->region.flags |= KVM_MEM_LOG_DIRTY_PAGES;
+
+	ret = __vm_ioctl(vcpu->vm, KVM_SET_USER_MEMORY_REGION2, &region->region);
+
+	TEST_ASSERT(ret, "KVM_SET_USER_MEMORY_REGION2 incorrectly succeeds\n");
+	region->region.flags &= ~KVM_MEM_LOG_DIRTY_PAGES;
+}
+
+static inline bool dirty_gfn_is_dirtied(struct kvm_dirty_gfn *gfn)
+{
+	return smp_load_acquire(&gfn->flags) == KVM_DIRTY_GFN_F_DIRTY;
+}
+
+static inline void dirty_gfn_set_collected(struct kvm_dirty_gfn *gfn)
+{
+	smp_store_release(&gfn->flags, KVM_DIRTY_GFN_F_RESET);
+}
+
+static bool dirty_ring_empty(struct kvm_vcpu *vcpu)
+{
+	struct kvm_dirty_gfn *dirty_gfns = vcpu_map_dirty_ring(vcpu);
+	struct kvm_dirty_gfn *cur;
+	int i;
+
+	for (i = 0; i < TEST_DIRTY_RING_COUNT; i++) {
+		cur = &dirty_gfns[i];
+
+		if (dirty_gfn_is_dirtied(cur))
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Purposely reset the dirty ring incorrectly by resetting a dirty ring entry
+ * even when KVM does not report the entry as dirty.
+ *
+ * In the kvm_dirty_gfn entry, specify the slot to the gmem slot that does not
+ * allow dirty page tracking and has no flag KVM_MEM_LOG_DIRTY_PAGES.
+ */
+static void reset_dirty_ring(struct kvm_vcpu *vcpu, int *reset_index)
+{
+	struct kvm_dirty_gfn *dirty_gfns = vcpu_map_dirty_ring(vcpu);
+	struct kvm_dirty_gfn *cur = &dirty_gfns[*reset_index];
+	uint32_t cleared;
+
+	reset_tracing();
+
+	cur->slot = TEST_DIRTY_RING_REGION_SLOT;
+	cur->offset = 0;
+	dirty_gfn_set_collected(cur);
+	cleared = kvm_vm_reset_dirty_ring(vcpu->vm);
+	*reset_index += cleared;
+	TEST_ASSERT(cleared == 1, "Unexpected cleared count %d\n", cleared);
+
+	detect_spte_change();
+}
+
+/*
+ * The vCPU worker to loop vcpu_run(). After each vCPU access to a GFN, check if
+ * the dirty ring is empty and reset the dirty ring.
+ */
+static void reset_dirty_ring_worker(struct kvm_vcpu *vcpu)
+{
+	struct kvm_run *run = vcpu->run;
+	struct ucall uc;
+	uint64_t cmd;
+	int index = 0;
+
+	filter_event();
+	while (1) {
+		vcpu_run(vcpu);
+
+		if (run->exit_reason == KVM_EXIT_IO) {
+			cmd = get_ucall(vcpu, &uc);
+			if (cmd != UCALL_SYNC)
+				break;
+
+			TEST_ASSERT(dirty_ring_empty(vcpu),
+				    "Guest write should not cause GFN dirty\n");
+
+			reset_dirty_ring(vcpu, &index);
+		}
+	}
+}
+
+static struct kvm_vm *create_vm(unsigned long vm_type, struct kvm_vcpu **vcpu,
+				bool private)
+{
+	unsigned int npages = TEST_DIRTY_RING_REGION_SIZE / getpagesize();
+	const struct vm_shape shape = {
+		.mode = VM_MODE_DEFAULT,
+		.type = vm_type,
+	};
+	struct kvm_vm *vm;
+
+	vm = __vm_create(shape, 1, 0);
+	vm_enable_dirty_ring(vm, TEST_DIRTY_RING_COUNT * sizeof(struct kvm_dirty_gfn));
+	*vcpu = vm_vcpu_add(vm, 0, guest_code);
+	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+				    TEST_DIRTY_RING_GPA,
+				    TEST_DIRTY_RING_REGION_SLOT,
+				    npages, KVM_MEM_GUEST_MEMFD);
+	vm->memslots[MEM_REGION_TEST_DATA] = TEST_DIRTY_RING_REGION_SLOT;
+	virt_map(vm, TEST_DIRTY_RING_GVA, TEST_DIRTY_RING_GPA, npages);
+	if (private)
+		vm_mem_set_private(vm, TEST_DIRTY_RING_GPA,
+				   TEST_DIRTY_RING_REGION_SIZE);
+	return vm;
+}
+
+struct test_config {
+	unsigned long vm_type;
+	bool manual_protect_and_init_set;
+	bool private_access;
+	char *test_desc;
+};
+
+void test_dirty_ring_on_gmem_slot(struct test_config *config)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+
+	if (config->vm_type &&
+	    !(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(config->vm_type))) {
+		ksft_test_result_skip("\n");
+		return;
+	}
+
+	vm = create_vm(config->vm_type, &vcpu, config->private_access);
+
+	/*
+	 * Let KVM detect that kvm_dirty_log_manual_protect_and_init_set() is
+	 * true in kvm_arch_mmu_enable_log_dirty_pt_masked() to check if
+	 * kvm_mmu_slot_gfn_write_protect() will be called on a gmem memslot.
+	 */
+	if (config->manual_protect_and_init_set) {
+		u64 manual_caps;
+
+		manual_caps = kvm_check_cap(KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2);
+
+		manual_caps &= (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE |
+				KVM_DIRTY_LOG_INITIALLY_SET);
+
+		if (!manual_caps)
+			return;
+
+		vm_enable_cap(vm, KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2, manual_caps);
+	}
+
+	verify_turn_on_log_dirty_pages_flag(vcpu);
+
+	reset_dirty_ring_worker(vcpu);
+
+	kvm_vm_free(vm);
+	ksft_test_result_pass("\n");
+}
+
+static bool dirty_ring_supported(void)
+{
+	return (kvm_has_cap(KVM_CAP_DIRTY_LOG_RING) ||
+		kvm_has_cap(KVM_CAP_DIRTY_LOG_RING_ACQ_REL));
+}
+
+static bool has_tracing(void)
+{
+	if (faccessat(AT_FDCWD, DEBUGFS, F_OK, AT_EACCESS) == 0) {
+		tracing_root = DEBUGFS;
+		return true;
+	}
+
+	if (faccessat(AT_FDCWD, TRACEFS, F_OK, AT_EACCESS) == 0) {
+		tracing_root = TRACEFS;
+		return true;
+	}
+
+	return false;
+}
+
+static struct test_config tests[] = {
+	{
+		.vm_type = KVM_X86_SW_PROTECTED_VM,
+		.manual_protect_and_init_set = false,
+		.private_access = true,
+		.test_desc = "SW_PROTECTED_VM, manual_protect_and_init_set=false, private access",
+	},
+	{
+		.vm_type = KVM_X86_SW_PROTECTED_VM,
+		.manual_protect_and_init_set = true,
+		.private_access = true,
+		.test_desc = "SW_PROTECTED_VM, manual_protect_and_init_set=true, private access",
+	},
+};
+
+int main(int argc, char **argv)
+{
+	int test_cnt = ARRAY_SIZE(tests);
+
+	ksft_print_header();
+	ksft_set_plan(test_cnt);
+
+	TEST_REQUIRE(get_kvm_param_bool("tdp_mmu"));
+	TEST_REQUIRE(has_tracing());
+	TEST_REQUIRE(dirty_ring_supported());
+
+	setup_tracing();
+
+	for (int i = 0; i < test_cnt; i++) {
+		pthread_t vm_thread;
+
+		pthread_create(&vm_thread, NULL,
+			       (void *(*)(void *))test_dirty_ring_on_gmem_slot,
+			       &tests[i]);
+		pthread_join(vm_thread, NULL);
+	}
+
+	ksft_finished();
+	return 0;
+}
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/3] KVM: Do not reset dirty GFNs in a memslot not enabling dirty tracking
  2025-08-22  8:02 ` [PATCH v3 1/3] " Yan Zhao
@ 2025-08-25 20:42   ` Sean Christopherson
  2025-08-26  1:22     ` Yan Zhao
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2025-08-25 20:42 UTC (permalink / raw)
  To: Yan Zhao; +Cc: pbonzini, peterx, rick.p.edgecombe, linux-kernel, kvm

On Fri, Aug 22, 2025, Yan Zhao wrote:
> Do not allow resetting dirty GFNs in memslots that do not enable dirty
> tracking.
> 
> vCPUs' dirty rings are shared between userspace and KVM. After KVM sets
> dirtied entries in the dirty rings, userspace is responsible for
> harvesting/resetting these entries and calling the ioctl
> KVM_RESET_DIRTY_RINGS to inform KVM to advance the reset_index in the dirty
> rings and invoke kvm_arch_mmu_enable_log_dirty_pt_masked() to clear the
> SPTEs' dirty bits or perform write protection of the GFNs.
> 
> Although KVM does not set dirty entries for GFNs in a memslot that does not
> enable dirty tracking, userspace can write arbitrary data into the dirty
> ring. This makes it possible for misbehaving userspace to specify that it
> has harvested a GFN from such a memslot. When this happens, KVM will be
> asked to clear dirty bits or perform write protection for GFNs in a memslot
> that does not enable dirty tracking, which is undesirable.
> 
> For TDX, this unexpected resetting of dirty GFNs could cause inconsistency
> between the mirror SPTE and the external SPTE in hardware (e.g., the mirror
> SPTE has no write bit while the external SPTE is writable). When
> kvm_dirty_log_manual_protect_and_init_set() is true and huge pages are
> enabled in TDX, this could even lead to kvm_mmu_slot_gfn_write_protect()
> being called and trigger KVM_BUG_ON() due to permission reduction changes
> in the huge mirror SPTEs.
> 

Sounds like this needs a Fixes and Cc: stable?

> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  virt/kvm/dirty_ring.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index 02bc6b00d76c..b38b4b7d7667 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -63,7 +63,13 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
>  
>  	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
>  
> -	if (!memslot || (offset + __fls(mask)) >= memslot->npages)
> +	/*
> +	 * Userspace can write arbitrary data into the dirty ring, making it
> +	 * possible for misbehaving userspace to try to reset an out-of-memslot
> +	 * GFN or a GFN in a memslot that isn't being dirty-logged.
> +	 */
> +	if (!memslot || (offset + __fls(mask)) >= memslot->npages ||
> +	    !kvm_slot_dirty_track_enabled(memslot))

Maybe check for dirty tracking being enabled before checking the range?  Purely
because checking if _any_  gfn can be recorded seems like something that should
be checked before a specific gfn can be recorded.  I.e.

	if (!memslot || !kvm_slot_dirty_track_enabled(memslot) ||
	    (offset + __fls(mask)) >= memslot->npages)
	    
>  		return;
>  
>  	KVM_MMU_LOCK(kvm);
> -- 
> 2.43.2
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 2/3] KVM: Skip invoking shared memory handler for entirely private GFN ranges
  2025-08-22  8:02 ` [PATCH v3 2/3] KVM: Skip invoking shared memory handler for entirely private GFN ranges Yan Zhao
@ 2025-08-25 21:05   ` Sean Christopherson
  2025-08-26  6:51     ` Yan Zhao
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2025-08-25 21:05 UTC (permalink / raw)
  To: Yan Zhao; +Cc: pbonzini, peterx, rick.p.edgecombe, linux-kernel, kvm

On Fri, Aug 22, 2025, Yan Zhao wrote:
> When a GFN range is entirely private, it's unnecessary for
> kvm_handle_hva_range() to invoke handlers for the GFN range, because
> 1) the gfn_range.attr_filter for the handler is KVM_FILTER_SHARED, which
>    is for shared mappings only;
> 2) KVM has already zapped all shared mappings before setting the memory
>    attribute to private.
> 
> This can avoid unnecessary zaps on private mappings for VMs of type
> KVM_X86_SW_PROTECTED_VM, e.g., during auto numa balancing scans of VMAs.

This feels like the wrong place to try and optimize spurious zaps.  x86 should
be skipping SPTEs that don't match.  For KVM_X86_SW_PROTECTED_VM, I don't think
we care about spurious zpas, because that's a testing-only type that doesn't have
line of sight to be being a "real" type.

For SNP, we might care?  But actually zapping private SPTEs would require
userspace to retain the shared mappings across a transition, _and_ be running
NUMA autobalancing in the first place.  If someone actually cares about optimizing
this scenario, KVM x86 could track private SPTEs via a software-available bit.

We also want to move away from KVM_MEMORY_ATTRIBUTE_PRIVATE and instead track
private vs. shared in the gmem instance.

So I'm inclined to skip this...

> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  virt/kvm/kvm_main.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index f769d1dccc21..e615ad405ce4 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -620,6 +620,17 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
>  			gfn_range.slot = slot;
>  			gfn_range.lockless = range->lockless;
>  
> +#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> +			/*
> +			 * If GFN range are all private, no need to invoke the
> +			 * handler.
> +			 */
> +			if (kvm_range_has_memory_attributes(kvm, gfn_range.start,
> +							    gfn_range.end, ~0,
> +							    KVM_MEMORY_ATTRIBUTE_PRIVATE))
> +				continue;
> +#endif
> +
>  			if (!r.found_memslot) {
>  				r.found_memslot = true;
>  				if (!range->lockless) {
> -- 
> 2.43.2
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/3] KVM: Do not reset dirty GFNs in a memslot not enabling dirty tracking
  2025-08-25 20:42   ` Sean Christopherson
@ 2025-08-26  1:22     ` Yan Zhao
  0 siblings, 0 replies; 8+ messages in thread
From: Yan Zhao @ 2025-08-26  1:22 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: pbonzini, peterx, rick.p.edgecombe, linux-kernel, kvm

On Mon, Aug 25, 2025 at 01:42:43PM -0700, Sean Christopherson wrote:
> On Fri, Aug 22, 2025, Yan Zhao wrote:
> > Do not allow resetting dirty GFNs in memslots that do not enable dirty
> > tracking.
> > 
> > vCPUs' dirty rings are shared between userspace and KVM. After KVM sets
> > dirtied entries in the dirty rings, userspace is responsible for
> > harvesting/resetting these entries and calling the ioctl
> > KVM_RESET_DIRTY_RINGS to inform KVM to advance the reset_index in the dirty
> > rings and invoke kvm_arch_mmu_enable_log_dirty_pt_masked() to clear the
> > SPTEs' dirty bits or perform write protection of the GFNs.
> > 
> > Although KVM does not set dirty entries for GFNs in a memslot that does not
> > enable dirty tracking, userspace can write arbitrary data into the dirty
> > ring. This makes it possible for misbehaving userspace to specify that it
> > has harvested a GFN from such a memslot. When this happens, KVM will be
> > asked to clear dirty bits or perform write protection for GFNs in a memslot
> > that does not enable dirty tracking, which is undesirable.
> > 
> > For TDX, this unexpected resetting of dirty GFNs could cause inconsistency
> > between the mirror SPTE and the external SPTE in hardware (e.g., the mirror
> > SPTE has no write bit while the external SPTE is writable). When
> > kvm_dirty_log_manual_protect_and_init_set() is true and huge pages are
> > enabled in TDX, this could even lead to kvm_mmu_slot_gfn_write_protect()
> > being called and trigger KVM_BUG_ON() due to permission reduction changes
> > in the huge mirror SPTEs.
> > 
> 
> Sounds like this needs a Fixes and Cc: stable?
Ok. Will include them in the next version.

> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  virt/kvm/dirty_ring.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > index 02bc6b00d76c..b38b4b7d7667 100644
> > --- a/virt/kvm/dirty_ring.c
> > +++ b/virt/kvm/dirty_ring.c
> > @@ -63,7 +63,13 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
> >  
> >  	memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
> >  
> > -	if (!memslot || (offset + __fls(mask)) >= memslot->npages)
> > +	/*
> > +	 * Userspace can write arbitrary data into the dirty ring, making it
> > +	 * possible for misbehaving userspace to try to reset an out-of-memslot
> > +	 * GFN or a GFN in a memslot that isn't being dirty-logged.
> > +	 */
> > +	if (!memslot || (offset + __fls(mask)) >= memslot->npages ||
> > +	    !kvm_slot_dirty_track_enabled(memslot))
> 
> Maybe check for dirty tracking being enabled before checking the range?  Purely
> because checking if _any_  gfn can be recorded seems like something that should
> be checked before a specific gfn can be recorded.  I.e.
> 
> 	if (!memslot || !kvm_slot_dirty_track_enabled(memslot) ||
> 	    (offset + __fls(mask)) >= memslot->npages)
Makes sense.
Thank you!

> >  		return;
> >  
> >  	KVM_MMU_LOCK(kvm);
> > -- 
> > 2.43.2
> > 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 2/3] KVM: Skip invoking shared memory handler for entirely private GFN ranges
  2025-08-25 21:05   ` Sean Christopherson
@ 2025-08-26  6:51     ` Yan Zhao
  0 siblings, 0 replies; 8+ messages in thread
From: Yan Zhao @ 2025-08-26  6:51 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: pbonzini, peterx, rick.p.edgecombe, linux-kernel, kvm

On Mon, Aug 25, 2025 at 02:05:22PM -0700, Sean Christopherson wrote:
> On Fri, Aug 22, 2025, Yan Zhao wrote:
> > When a GFN range is entirely private, it's unnecessary for
> > kvm_handle_hva_range() to invoke handlers for the GFN range, because
> > 1) the gfn_range.attr_filter for the handler is KVM_FILTER_SHARED, which
> >    is for shared mappings only;
> > 2) KVM has already zapped all shared mappings before setting the memory
> >    attribute to private.
> > 
> > This can avoid unnecessary zaps on private mappings for VMs of type
> > KVM_X86_SW_PROTECTED_VM, e.g., during auto numa balancing scans of VMAs.
> 
> This feels like the wrong place to try and optimize spurious zaps.  x86 should
> be skipping SPTEs that don't match.  For KVM_X86_SW_PROTECTED_VM, I don't think
> we care about spurious zpas, because that's a testing-only type that doesn't have
> line of sight to be being a "real" type.
> 
> For SNP, we might care?  But actually zapping private SPTEs would require
> userspace to retain the shared mappings across a transition, _and_ be running
> NUMA autobalancing in the first place.  If someone actually cares about optimizing
Hmm, "running NUMA autobalancing" + "madvise(MADV_DONTNEED)" can still trigger
the spurious zaps.

task_numa_work  ==> found a VMA
  change_prot_numa
    change_protection
      change_pud_range ==> mmu_notifier_invalidate_range_start() if !pud_none()
 
Let me use munmap() in patch 3 to guard againt spurious zap then.

> this scenario, KVM x86 could track private SPTEs via a software-available bit.
> 
> We also want to move away from KVM_MEMORY_ATTRIBUTE_PRIVATE and instead track
> private vs. shared in the gmem instance.
>
> So I'm inclined to skip this...
Fair enough. Thank you for the detailed explanation!

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-08-26  6:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-22  8:01 [PATCH v3 0/3] KVM: Do not reset dirty GFNs in a memslot not enabling dirty tracking Yan Zhao
2025-08-22  8:02 ` [PATCH v3 1/3] " Yan Zhao
2025-08-25 20:42   ` Sean Christopherson
2025-08-26  1:22     ` Yan Zhao
2025-08-22  8:02 ` [PATCH v3 2/3] KVM: Skip invoking shared memory handler for entirely private GFN ranges Yan Zhao
2025-08-25 21:05   ` Sean Christopherson
2025-08-26  6:51     ` Yan Zhao
2025-08-22  8:03 ` [PATCH v3 3/3] KVM: selftests: Test resetting dirty ring in gmem slots in protected VMs Yan Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).