[PATCH 0/4] KVM: selftests: dirty_log_test: fixes for running the test nested

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] KVM: selftests: dirty_log_test: fixes for running the test nested
@ 2024-12-11 19:37 Maxim Levitsky
  2024-12-11 19:37 ` [PATCH 1/4] KVM: VMX: read the PML log in the same order as it was written Maxim Levitsky
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Maxim Levitsky @ 2024-12-11 19:37 UTC (permalink / raw)
  To: kvm
  Cc: linux-kselftest, Sean Christopherson, Paolo Bonzini, linux-kernel,
	x86, Maxim Levitsky

As a part of the effort to start running kvm selftests nested, this patch
series contains several fixes to the dirty_log_test, which allows this test
to run nested very well.

I also included a mostly nop change to KVM, to reverse the order in which
the PML log is read to align more closely to the hardware. It should
not affect regular users of the dirty logging but it fixes a unit test
specific assumption in the dirty_log_test dirty-ring mode.

Patch 4 fixes a very rare problem, which is hard to reproduce with standard
test parameters, but due to some weird timing issue, it
actually happened a few times on my machine which prompted me to investigate
it.

The issue can be reproduced well by running the test nested
(without patch 4 applied) with a very short iteration time and with a
few iterations in a loop like this:

while ./dirty_log_test -i 10 -I 1 -M dirty-ring ; do true ; done

Or even better, it's possible to manually patch the test to not wait at all
(effectively setting iteration time to 0), then it fails pretty fast.

Best regards,
       Maxim Levitsky

Maxim Levitsky (4):
  KVM: VMX: read the PML log in the same order as it was written
  KVM: selftests: dirty_log_test: Limit s390x workaround to s390x
  KVM: selftests: dirty_log_test: run the guest until some dirty ring
    entries were harvested
  KVM: selftests: dirty_log_test: support multiple write retires

 arch/x86/kvm/vmx/vmx.c                       | 32 +++++---
 arch/x86/kvm/vmx/vmx.h                       |  1 +
 tools/testing/selftests/kvm/dirty_log_test.c | 79 +++++++++++++++++---
 3 files changed, 91 insertions(+), 21 deletions(-)

-- 
2.26.3

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/4] KVM: VMX: read the PML log in the same order as it was written
  2024-12-11 19:37 [PATCH 0/4] KVM: selftests: dirty_log_test: fixes for running the test nested Maxim Levitsky
@ 2024-12-11 19:37 ` Maxim Levitsky
  2024-12-12  0:44   ` Sean Christopherson
  2024-12-11 19:37 ` [PATCH 2/4] KVM: selftests: dirty_log_test: Limit s390x workaround to s390x Maxim Levitsky
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Maxim Levitsky @ 2024-12-11 19:37 UTC (permalink / raw)
  To: kvm
  Cc: linux-kselftest, Sean Christopherson, Paolo Bonzini, linux-kernel,
	x86, Maxim Levitsky

X86 spec specifies that the CPU writes to the PML log 'backwards'
or in other words, it first writes entry 511, then entry 510 and so on,
until it writes entry 0, after which the 'PML log full' VM exit happens.

I also confirmed on the bare metal that the CPU indeed writes the entries
in this order.

KVM on the other hand, reads the entries in the opposite order, from the
last written entry and towards entry 511 and dumps them in this order to
the dirty ring.

Usually this doesn't matter, except for one complex nesting case:

KVM reties the instructions that cause MMU faults.
This might cause an emulated PML log entry to be visible to L1's hypervisor
before the actual memory write was committed.

This happens when the L0 MMU fault is followed directly by the VM exit to
L1, for example due to a pending L1 interrupt or due to the L1's
'PML log full' event.

This problem doesn't have a noticeable real-world impact because this
write retry is not much different from the guest writing to the same page
multiple times, which is also not reflected in the dirty log. The users of
the dirty logging only rely on correct reporting of the clean pages, or
in other words they assume that if a page is clean, then no writes were
committed to it since the moment it was marked clean.

However KVM has a kvm_dirty_log_test selftest, a test that tests both
the clean and the dirty pages vs the memory contents, and can fail if it
detects a dirty page which has an old value at the offset 0 which the test
writes.

To avoid failure, the test has a workaround for this specific problem:

The test skips checking memory that belongs to the last dirty ring entry,
which it has seen, relying on the fact that as long as memory writes are
committed in-order, only the last entry can belong to a not yet committed
memory write.

However, since L1's KVM is reading the PML log in the opposite direction
that L0 wrote it, the last dirty ring entry often will be not the last
entry written by the L0.

To fix this, switch the order in which KVM reads the PML log.

Note that this issue is not present on the bare metal, because on the
bare metal, an update of the A/D bits of a present entry, PML logging and
the actual memory write are all done by the CPU without any hypervisor
intervention and pending interrupt evaluation, thus once a PML log and/or
vCPU kick happens, all memory writes that are in the PML log are
committed to memory.

The only exception to this rule is when the guest hits a not present EPT
entry, in which case KVM first reads (backward) the PML log, dumps it to
the dirty ring, and *then* sets up a SPTE entry with A/D bits set, and logs
this to the dirty ring, thus making the entry be the last one in the
dirty ring.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/vmx/vmx.c | 32 +++++++++++++++++++++-----------
 arch/x86/kvm/vmx/vmx.h |  1 +
 2 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0f008f5ef6f0..6fb946b58a75 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6211,31 +6211,41 @@ static void vmx_flush_pml_buffer(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u64 *pml_buf;
-	u16 pml_idx;
+	u16 pml_idx, pml_last_written_entry;
+	int i;

 	pml_idx = vmcs_read16(GUEST_PML_INDEX);

 	/* Do nothing if PML buffer is empty */
-	if (pml_idx == (PML_ENTITY_NUM - 1))
+	if (pml_idx == PML_LAST_ENTRY)
 		return;
+	/*
+	 * PML index always points to the next available PML buffer entity
+	 * unless PML log has just overflowed, in which case PML index will be
+	 * 0xFFFF.
+	 */
+	pml_last_written_entry = (pml_idx >= PML_ENTITY_NUM) ? 0 : pml_idx + 1;

-	/* PML index always points to next available PML buffer entity */
-	if (pml_idx >= PML_ENTITY_NUM)
-		pml_idx = 0;
-	else
-		pml_idx++;
-
+	/*
+	 * PML log is written backwards: the CPU first writes the entity 511
+	 * then the entity 510, and so on, until it writes the entity 0 at which
+	 * point the PML log full VM exit happens and the logging stops.
+	 *
+	 * Read the entries in the same order they were written, to ensure that
+	 * the dirty ring is filled in the same order the CPU wrote them.
+	 */
 	pml_buf = page_address(vmx->pml_pg);
-	for (; pml_idx < PML_ENTITY_NUM; pml_idx++) {
+
+	for (i = PML_LAST_ENTRY; i >= pml_last_written_entry; i--) {
 		u64 gpa;

-		gpa = pml_buf[pml_idx];
+		gpa = pml_buf[i];
 		WARN_ON(gpa & (PAGE_SIZE - 1));
 		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
 	}

 	/* reset PML index */
-	vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
+	vmcs_write16(GUEST_PML_INDEX, PML_LAST_ENTRY);
 }

 static void vmx_dump_sel(char *name, uint32_t sel)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 43f573f6ca46..d14401c8e499 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -331,6 +331,7 @@ struct vcpu_vmx {

 	/* Support for PML */
 #define PML_ENTITY_NUM		512
+#define PML_LAST_ENTRY		(PML_ENTITY_NUM - 1)
 	struct page *pml_pg;

 	/* apic deadline value in host tsc */
-- 
2.26.3

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/4] KVM: selftests: dirty_log_test: Limit s390x workaround to s390x
  2024-12-11 19:37 [PATCH 0/4] KVM: selftests: dirty_log_test: fixes for running the test nested Maxim Levitsky
  2024-12-11 19:37 ` [PATCH 1/4] KVM: VMX: read the PML log in the same order as it was written Maxim Levitsky
@ 2024-12-11 19:37 ` Maxim Levitsky
  2024-12-11 19:37 ` [PATCH 3/4] KVM: selftests: dirty_log_test: run the guest until some dirty ring entries were harvested Maxim Levitsky
  2024-12-11 19:37 ` [PATCH 4/4] KVM: selftests: dirty_log_test: support multiple write retires Maxim Levitsky
  3 siblings, 0 replies; 10+ messages in thread
From: Maxim Levitsky @ 2024-12-11 19:37 UTC (permalink / raw)
  To: kvm
  Cc: linux-kselftest, Sean Christopherson, Paolo Bonzini, linux-kernel,
	x86, Maxim Levitsky

s390 specific workaround causes the dirty-log mode of the test to dirty all
the guest memory on the first iteration which is very slow when
run nested.

Limit this workaround to s390x.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index aacf80f57439..f60e2aceeae0 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -98,6 +98,7 @@ static void guest_code(void)
 	uint64_t addr;
 	int i;
 
+#ifdef __s390x__
 	/*
 	 * On s390x, all pages of a 1M segment are initially marked as dirty
 	 * when a page of the segment is written to for the very first time.
@@ -108,6 +109,7 @@ static void guest_code(void)
 		addr = guest_test_virt_mem + i * guest_page_size;
 		vcpu_arch_put_guest(*(uint64_t *)addr, READ_ONCE(iteration));
 	}
+#endif
 
 	while (true) {
 		for (i = 0; i < TEST_PAGES_PER_LOOP; i++) {
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/4] KVM: selftests: dirty_log_test: run the guest until some dirty ring entries were harvested
  2024-12-11 19:37 [PATCH 0/4] KVM: selftests: dirty_log_test: fixes for running the test nested Maxim Levitsky
  2024-12-11 19:37 ` [PATCH 1/4] KVM: VMX: read the PML log in the same order as it was written Maxim Levitsky
  2024-12-11 19:37 ` [PATCH 2/4] KVM: selftests: dirty_log_test: Limit s390x workaround to s390x Maxim Levitsky
@ 2024-12-11 19:37 ` Maxim Levitsky
  2024-12-11 19:37 ` [PATCH 4/4] KVM: selftests: dirty_log_test: support multiple write retires Maxim Levitsky
  3 siblings, 0 replies; 10+ messages in thread
From: Maxim Levitsky @ 2024-12-11 19:37 UTC (permalink / raw)
  To: kvm
  Cc: linux-kselftest, Sean Christopherson, Paolo Bonzini, linux-kernel,
	x86, Maxim Levitsky

When the dirty_log_test is run nested, due to very slow nature of
the environment, it is possible to reach a situation in which no entries were
harvested from the dirty ring and that breaks the test logic.

Detect this case and just let the guest run a bit more until test obtains
some entries from the dirty ring.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c | 25 +++++++++++++-------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index f60e2aceeae0..a9428076a681 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -227,19 +227,21 @@ static void clear_log_create_vm_done(struct kvm_vm *vm)
 	vm_enable_cap(vm, KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2, manual_caps);
 }
 
-static void dirty_log_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,
+static bool dirty_log_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,
 					  void *bitmap, uint32_t num_pages,
 					  uint32_t *unused)
 {
 	kvm_vm_get_dirty_log(vcpu->vm, slot, bitmap);
+	return true;
 }
 
-static void clear_log_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,
+static bool clear_log_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,
 					  void *bitmap, uint32_t num_pages,
 					  uint32_t *unused)
 {
 	kvm_vm_get_dirty_log(vcpu->vm, slot, bitmap);
 	kvm_vm_clear_dirty_log(vcpu->vm, slot, bitmap, 0, num_pages);
+	return true;
 }
 
 /* Should only be called after a GUEST_SYNC */
@@ -350,7 +352,7 @@ static void dirty_ring_continue_vcpu(void)
 	sem_post(&sem_vcpu_cont);
 }
 
-static void dirty_ring_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,
+static bool dirty_ring_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,
 					   void *bitmap, uint32_t num_pages,
 					   uint32_t *ring_buf_idx)
 {
@@ -373,6 +375,10 @@ static void dirty_ring_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,
 				       slot, bitmap, num_pages,
 				       ring_buf_idx);
 
+	/* Retry if no entries were collected */
+	if (count == 0)
+		return false;
+
 	cleared = kvm_vm_reset_dirty_ring(vcpu->vm);
 
 	/*
@@ -389,6 +395,7 @@ static void dirty_ring_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,
 	}
 
 	pr_info("Iteration %ld collected %u pages\n", iteration, count);
+	return true;
 }
 
 static void dirty_ring_after_vcpu_run(struct kvm_vcpu *vcpu, int ret, int err)
@@ -424,7 +431,7 @@ struct log_mode {
 	/* Hook when the vm creation is done (before vcpu creation) */
 	void (*create_vm_done)(struct kvm_vm *vm);
 	/* Hook to collect the dirty pages into the bitmap provided */
-	void (*collect_dirty_pages) (struct kvm_vcpu *vcpu, int slot,
+	bool (*collect_dirty_pages)(struct kvm_vcpu *vcpu, int slot,
 				     void *bitmap, uint32_t num_pages,
 				     uint32_t *ring_buf_idx);
 	/* Hook to call when after each vcpu run */
@@ -488,7 +495,7 @@ static void log_mode_create_vm_done(struct kvm_vm *vm)
 		mode->create_vm_done(vm);
 }
 
-static void log_mode_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,
+static bool log_mode_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,
 					 void *bitmap, uint32_t num_pages,
 					 uint32_t *ring_buf_idx)
 {
@@ -496,7 +503,7 @@ static void log_mode_collect_dirty_pages(struct kvm_vcpu *vcpu, int slot,
 
 	TEST_ASSERT(mode->collect_dirty_pages != NULL,
 		    "collect_dirty_pages() is required for any log mode!");
-	mode->collect_dirty_pages(vcpu, slot, bitmap, num_pages, ring_buf_idx);
+	return mode->collect_dirty_pages(vcpu, slot, bitmap, num_pages, ring_buf_idx);
 }
 
 static void log_mode_after_vcpu_run(struct kvm_vcpu *vcpu, int ret, int err)
@@ -783,9 +790,11 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	while (iteration < p->iterations) {
 		/* Give the vcpu thread some time to dirty some pages */
 		usleep(p->interval * 1000);
-		log_mode_collect_dirty_pages(vcpu, TEST_MEM_SLOT_INDEX,
+
+		if (!log_mode_collect_dirty_pages(vcpu, TEST_MEM_SLOT_INDEX,
 					     bmap, host_num_pages,
-					     &ring_buf_idx);
+					     &ring_buf_idx))
+			continue;
 
 		/*
 		 * See vcpu_sync_stop_requested definition for details on why
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/4] KVM: selftests: dirty_log_test: support multiple write retires
  2024-12-11 19:37 [PATCH 0/4] KVM: selftests: dirty_log_test: fixes for running the test nested Maxim Levitsky
                   ` (2 preceding siblings ...)
  2024-12-11 19:37 ` [PATCH 3/4] KVM: selftests: dirty_log_test: run the guest until some dirty ring entries were harvested Maxim Levitsky
@ 2024-12-11 19:37 ` Maxim Levitsky
  3 siblings, 0 replies; 10+ messages in thread
From: Maxim Levitsky @ 2024-12-11 19:37 UTC (permalink / raw)
  To: kvm
  Cc: linux-kselftest, Sean Christopherson, Paolo Bonzini, linux-kernel,
	x86, Maxim Levitsky

If dirty_log_test is run nested, it is possible for entries in the emulated
PML log to appear before the actual memory write is committed to the RAM,
due to the way KVM retries memory writes as a response to a MMU fault.

In addition to that in some very rare cases retry can happen more than
once, which will lead to the test failure because once the write is
finally committed it may have a very outdated iteration value.

Detect and avoid this case.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 tools/testing/selftests/kvm/dirty_log_test.c | 52 +++++++++++++++++++-
 1 file changed, 50 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index a9428076a681..f07126b0205d 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -154,6 +154,7 @@ static atomic_t vcpu_sync_stop_requested;
  * sem_vcpu_stop and before vcpu continues to run.
  */
 static bool dirty_ring_vcpu_ring_full;
+
 /*
  * This is only used for verifying the dirty pages.  Dirty ring has a very
  * tricky case when the ring just got full, kvm will do userspace exit due to
@@ -168,7 +169,51 @@ static bool dirty_ring_vcpu_ring_full;
  * dirty gfn we've collected, so that if a mismatch of data found later in the
  * verifying process, we let it pass.
  */
-static uint64_t dirty_ring_last_page;
+static uint64_t dirty_ring_last_page = -1ULL;
+
+/*
+ * In addition to the above, it is possible (especially if this
+ * test is run nested) for the above scenario to repeat multiple times:
+ *
+ * The following can happen:
+ *
+ * - L1 vCPU:        Memory write is logged to PML but not committed.
+ *
+ * - L1 test thread: Ignores the write because its last dirty ring entry
+ *                   Resets the dirty ring which:
+ *                     - Resets the A/D bits in EPT
+ *                     - Issues tlb flush (invept), which is intercepted by L0
+ *
+ * - L0: frees the whole nested ept mmu root as the response to invept,
+ *       and thus ensures that when memory write is retried, it will fault again
+ *
+ * - L1 vCPU:        Same memory write is logged to the PML but not committed again.
+ *
+ * - L1 test thread: Ignores the write because its last dirty ring entry (again)
+ *                   Resets the dirty ring which:
+ *                     - Resets the A/D bits in EPT (again)
+ *                     - Issues tlb flush (again) which is intercepted by L0
+ *
+ * ...
+ *
+ * N times
+ *
+ * - L1 vCPU:        Memory write is logged in the PML and then committed.
+ *                   Lots of other memory writes are logged and committed.
+ * ...
+ *
+ * - L1 test thread: Sees the memory write along with other memory writes
+ *                   in the dirty ring, and since the write is usually not
+ *                   the last entry in the dirty-ring and has a very outdated
+ *                   iteration, the test fails.
+ *
+ *
+ * Note that this is only possible when the write was the last log entry
+ * write during iteration N-1, thus remember last iteration last log entry
+ * and also don't fail when it is reported in the next iteration, together with
+ * an outdated iteration count.
+ */
+static uint64_t dirty_ring_prev_iteration_last_page;
 
 enum log_mode_t {
 	/* Only use KVM_GET_DIRTY_LOG for logging */
@@ -320,6 +365,8 @@ static uint32_t dirty_ring_collect_one(struct kvm_dirty_gfn *dirty_gfns,
 	struct kvm_dirty_gfn *cur;
 	uint32_t count = 0;
 
+	dirty_ring_prev_iteration_last_page = dirty_ring_last_page;
+
 	while (true) {
 		cur = &dirty_gfns[*fetch_index % test_dirty_ring_count];
 		if (!dirty_gfn_is_dirtied(cur))
@@ -622,7 +669,8 @@ static void vm_dirty_log_verify(enum vm_guest_mode mode, unsigned long *bmap)
 					 */
 					min_iter = iteration - 1;
 					continue;
-				} else if (page == dirty_ring_last_page) {
+				} else if (page == dirty_ring_last_page ||
+					   page == dirty_ring_prev_iteration_last_page) {
 					/*
 					 * Please refer to comments in
 					 * dirty_ring_last_page.
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/4] KVM: VMX: read the PML log in the same order as it was written
  2024-12-11 19:37 ` [PATCH 1/4] KVM: VMX: read the PML log in the same order as it was written Maxim Levitsky
@ 2024-12-12  0:44   ` Sean Christopherson
  2024-12-12 21:37     ` Maxim Levitsky
  0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2024-12-12  0:44 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: kvm, linux-kselftest, Paolo Bonzini, linux-kernel, x86

On Wed, Dec 11, 2024, Maxim Levitsky wrote:
> X86 spec specifies that the CPU writes to the PML log 'backwards'

SDM, because this is Intel specific.

> or in other words, it first writes entry 511, then entry 510 and so on,
> until it writes entry 0, after which the 'PML log full' VM exit happens.
> 
> I also confirmed on the bare metal that the CPU indeed writes the entries
> in this order.
> 
> KVM on the other hand, reads the entries in the opposite order, from the
> last written entry and towards entry 511 and dumps them in this order to
> the dirty ring.
> 
> Usually this doesn't matter, except for one complex nesting case:
> 
> KVM reties the instructions that cause MMU faults.
> This might cause an emulated PML log entry to be visible to L1's hypervisor
> before the actual memory write was committed.
> 
> This happens when the L0 MMU fault is followed directly by the VM exit to
> L1, for example due to a pending L1 interrupt or due to the L1's 'PML log full'
> event.

Hmm, this an L0 bug.  Exiting to L1 to deliver a pending IRQ in the middle of an
instruction is a blatant architectural violation.  As discussed in the RSM =>
SHUTDOWN thread[*], fixing this would require adding a flag to note that the vCPU
needs to enter the guest before generating an exit to L1.

Oof.  It's probably worse than that.  For this case, KVM would need to ensure the
original instruction *completed*.  That would get really, really ugly.  And for
something like VSCATTER, where each write can be completed independently, trying
to do the right thing for PML would be absurdly complex.

I'm not opposed to fudging around processing the PML log in the "correct" order,
because irrespective of this bug, populating the dirty ring using order in which
accesses occurred is probably a good idea.

But, I can't help but wonder why KVM bothers emulating PML.  I can appreciate
that avoiding exits to L1 would be beneficial, but what use case actually cares
about dirty logging performance in L1?

[*] https://lore.kernel.org/all/ZcY_GbqcFXH2pR5E@google.com

> This problem doesn't have a noticeable real-world impact because this
> write retry is not much different from the guest writing to the same page
> multiple times, which is also not reflected in the dirty log. The users of
> the dirty logging only rely on correct reporting of the clean pages, or
> in other words they assume that if a page is clean, then no writes were
> committed to it since the moment it was marked clean.
> 
> However KVM has a kvm_dirty_log_test selftest, a test that tests both
> the clean and the dirty pages vs the memory contents, and can fail if it
> detects a dirty page which has an old value at the offset 0 which the test
> writes.
> 
> To avoid failure, the test has a workaround for this specific problem:
> 
> The test skips checking memory that belongs to the last dirty ring entry,
> which it has seen, relying on the fact that as long as memory writes are
> committed in-order, only the last entry can belong to a not yet committed
> memory write.
> 
> However, since L1's KVM is reading the PML log in the opposite direction
> that L0 wrote it, the last dirty ring entry often will be not the last
> entry written by the L0.
> 
> To fix this, switch the order in which KVM reads the PML log.
> 
> Note that this issue is not present on the bare metal, because on the
> bare metal, an update of the A/D bits of a present entry, PML logging and
> the actual memory write are all done by the CPU without any hypervisor
> intervention and pending interrupt evaluation, thus once a PML log and/or
> vCPU kick happens, all memory writes that are in the PML log are
> committed to memory.
> 
> The only exception to this rule is when the guest hits a not present EPT
> entry, in which case KVM first reads (backward) the PML log, dumps it to
> the dirty ring, and *then* sets up a SPTE entry with A/D bits set, and logs
> this to the dirty ring, thus making the entry be the last one in the
> dirty ring.
> 
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 32 +++++++++++++++++++++-----------
>  arch/x86/kvm/vmx/vmx.h |  1 +
>  2 files changed, 22 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 0f008f5ef6f0..6fb946b58a75 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6211,31 +6211,41 @@ static void vmx_flush_pml_buffer(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	u64 *pml_buf;
> -	u16 pml_idx;
> +	u16 pml_idx, pml_last_written_entry;
> +	int i;
>  
>  	pml_idx = vmcs_read16(GUEST_PML_INDEX);
>  
>  	/* Do nothing if PML buffer is empty */
> -	if (pml_idx == (PML_ENTITY_NUM - 1))
> +	if (pml_idx == PML_LAST_ENTRY)

Heh, this is mildly confusing, in that the first entry filled is actually called
the "last" entry by KVM.  And then below, pml_list_written_entry could point at
the first entry.

The best idea I can come up with is PML_HEAD_INDEX and then pml_last_written_entry
becomes pml_tail_index.  It's not a circular buffer, but I think/hope head/tail
terminology would be intuitive for most readers.

E.g. the for-loop becomes:

	for (i = PML_HEAD_INDEX; i >= pml_tail_index; i--)
		u64 gpa;

		gpa = pml_buf[i];
		WARN_ON(gpa & (PAGE_SIZE - 1));
		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
	}

>  		return;
> +	/*
> +	 * PML index always points to the next available PML buffer entity
> +	 * unless PML log has just overflowed, in which case PML index will be

If you don't have a strong preference, I vote to do s/entity/entry and then rename
PML_ENTITY_NUM => NR_PML_ENTRIES (or maybe PML_LOG_NR_ENTRIES?).  I find the
existing "entity" terminology weird and unhelpful, and arguably wrong.

  entity - a thing with distinct and independent existence.

The things being consumed are entries in a buffer.

> +	 * 0xFFFF.
> +	 */
> +	pml_last_written_entry = (pml_idx >= PML_ENTITY_NUM) ? 0 : pml_idx + 1;
>  
> -	/* PML index always points to next available PML buffer entity */
> -	if (pml_idx >= PML_ENTITY_NUM)
> -		pml_idx = 0;
> -	else
> -		pml_idx++;
> -
> +	/*
> +	 * PML log is written backwards: the CPU first writes the entity 511
> +	 * then the entity 510, and so on, until it writes the entity 0 at which
> +	 * point the PML log full VM exit happens and the logging stops.

This is technically wrong.  The PML Full exit only occurs on the next write.
E.g. KVM could observe GUEST_PML_INDEX == -1 without ever seeing a PML Full exit.

  If the PML index is not in the range 0–511, there is a page-modification log-full
  event and a VM exit occurs. In this case, the accessed or dirty flag is not set,
  and the guest-physical access that triggered the event does not occur.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/4] KVM: VMX: read the PML log in the same order as it was written
  2024-12-12  0:44   ` Sean Christopherson
@ 2024-12-12 21:37     ` Maxim Levitsky
  2024-12-13  6:19       ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: Maxim Levitsky @ 2024-12-12 21:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kselftest, Paolo Bonzini, linux-kernel, x86

On Wed, 2024-12-11 at 16:44 -0800, Sean Christopherson wrote:
> On Wed, Dec 11, 2024, Maxim Levitsky wrote:
> > X86 spec specifies that the CPU writes to the PML log 'backwards'
> 
> SDM, because this is Intel specific.
True.
> 
> > or in other words, it first writes entry 511, then entry 510 and so on,
> > until it writes entry 0, after which the 'PML log full' VM exit happens.
> > 
> > I also confirmed on the bare metal that the CPU indeed writes the entries
> > in this order.
> > 
> > KVM on the other hand, reads the entries in the opposite order, from the
> > last written entry and towards entry 511 and dumps them in this order to
> > the dirty ring.
> > 
> > Usually this doesn't matter, except for one complex nesting case:
> > 
> > KVM reties the instructions that cause MMU faults.
> > This might cause an emulated PML log entry to be visible to L1's hypervisor
> > before the actual memory write was committed.
> > 
> > This happens when the L0 MMU fault is followed directly by the VM exit to
> > L1, for example due to a pending L1 interrupt or due to the L1's 'PML log full'
> > event.
> 
> Hmm, this an L0 bug.  Exiting to L1 to deliver a pending IRQ in the middle of an
> instruction is a blatant architectural violation.  As discussed in the RSM =>
> SHUTDOWN thread[*], fixing this would require adding a flag to note that the vCPU
> needs to enter the guest before generating an exit to L1.

Agree, note that this is to some extent visible not nested as well, for example 
this workaround that the test has is for non nested case.

One can argue that dirty ring is not an x86 feature, but I am sure that there are
other more complicated cases in which retried write can be observed by this or
other vCPUs in the violation of x86 spec.

> 
> Oof.  It's probably worse than that.  For this case, KVM would need to ensure the
> original instruction *completed*.  That would get really, really ugly.  And for
> something like VSCATTER, where each write can be completed independently, trying
> to do the right thing for PML would be absurdly complex.

I also agree. Instruction retry is much simpler and safer that emulating it, KVM
really can't stop doing this.


> I'm not opposed to fudging around processing the PML log in the "correct" order,
> because irrespective of this bug, populating the dirty ring using order in which
> accesses occurred is probably a good idea.
> 
> But, I can't help but wonder why KVM bothers emulating PML.  I can appreciate
> that avoiding exits to L1 would be beneficial, but what use case actually cares
> about dirty logging performance in L1?

It does help with performance by a lot and the implementation is emulated and simple.

For example this test without pml collects about 500 pages on each iteration
with default parameters, and about 2400 pages per iteration with pml 
(after the caches warm up).


> 
> [*] https://lore.kernel.org/all/ZcY_GbqcFXH2pR5E@google.com
> 
> > This problem doesn't have a noticeable real-world impact because this
> > write retry is not much different from the guest writing to the same page
> > multiple times, which is also not reflected in the dirty log. The users of
> > the dirty logging only rely on correct reporting of the clean pages, or
> > in other words they assume that if a page is clean, then no writes were
> > committed to it since the moment it was marked clean.
> > 
> > However KVM has a kvm_dirty_log_test selftest, a test that tests both
> > the clean and the dirty pages vs the memory contents, and can fail if it
> > detects a dirty page which has an old value at the offset 0 which the test
> > writes.
> > 
> > To avoid failure, the test has a workaround for this specific problem:
> > 
> > The test skips checking memory that belongs to the last dirty ring entry,
> > which it has seen, relying on the fact that as long as memory writes are
> > committed in-order, only the last entry can belong to a not yet committed
> > memory write.
> > 
> > However, since L1's KVM is reading the PML log in the opposite direction
> > that L0 wrote it, the last dirty ring entry often will be not the last
> > entry written by the L0.
> > 
> > To fix this, switch the order in which KVM reads the PML log.
> > 
> > Note that this issue is not present on the bare metal, because on the
> > bare metal, an update of the A/D bits of a present entry, PML logging and
> > the actual memory write are all done by the CPU without any hypervisor
> > intervention and pending interrupt evaluation, thus once a PML log and/or
> > vCPU kick happens, all memory writes that are in the PML log are
> > committed to memory.
> > 
> > The only exception to this rule is when the guest hits a not present EPT
> > entry, in which case KVM first reads (backward) the PML log, dumps it to
> > the dirty ring, and *then* sets up a SPTE entry with A/D bits set, and logs
> > this to the dirty ring, thus making the entry be the last one in the
> > dirty ring.
> > 
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> >  arch/x86/kvm/vmx/vmx.c | 32 +++++++++++++++++++++-----------
> >  arch/x86/kvm/vmx/vmx.h |  1 +
> >  2 files changed, 22 insertions(+), 11 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 0f008f5ef6f0..6fb946b58a75 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -6211,31 +6211,41 @@ static void vmx_flush_pml_buffer(struct kvm_vcpu *vcpu)
> >  {
> >  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >  	u64 *pml_buf;
> > -	u16 pml_idx;
> > +	u16 pml_idx, pml_last_written_entry;
> > +	int i;
> >  
> >  	pml_idx = vmcs_read16(GUEST_PML_INDEX);
> >  
> >  	/* Do nothing if PML buffer is empty */
> > -	if (pml_idx == (PML_ENTITY_NUM - 1))
> > +	if (pml_idx == PML_LAST_ENTRY)
> 
> Heh, this is mildly confusing, in that the first entry filled is actually called
> the "last" entry by KVM.  And then below, pml_list_written_entry could point at
> the first entry.
> 
> The best idea I can come up with is PML_HEAD_INDEX and then pml_last_written_entry
> becomes pml_tail_index.  It's not a circular buffer, but I think/hope head/tail
> terminology would be intuitive for most readers.

I agree here. Your proposal does seem better to me, so I'll adopt it in v2.
> 
> E.g. the for-loop becomes:
> 
> 	for (i = PML_HEAD_INDEX; i >= pml_tail_index; i--)
> 		u64 gpa;
> 
> 		gpa = pml_buf[i];
> 		WARN_ON(gpa & (PAGE_SIZE - 1));
> 		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
> 	}
> 
> >  		return;
> > +	/*
> > +	 * PML index always points to the next available PML buffer entity
> > +	 * unless PML log has just overflowed, in which case PML index will be
> 



> If you don't have a strong preference, I vote to do s/entity/entry and then rename
> PML_ENTITY_NUM => NR_PML_ENTRIES (or maybe PML_LOG_NR_ENTRIES?).  I find the
> existing "entity" terminology weird and unhelpful, and arguably wrong.

I don't mind renaming this.

> 
>   entity - a thing with distinct and independent existence.
> 
> The things being consumed are entries in a buffer.
> 
> > +	 * 0xFFFF.
> > +	 */
> > +	pml_last_written_entry = (pml_idx >= PML_ENTITY_NUM) ? 0 : pml_idx + 1;
> >  
> > -	/* PML index always points to next available PML buffer entity */
> > -	if (pml_idx >= PML_ENTITY_NUM)
> > -		pml_idx = 0;
> > -	else
> > -		pml_idx++;
> > -
> > +	/*
> > +	 * PML log is written backwards: the CPU first writes the entity 511
> > +	 * then the entity 510, and so on, until it writes the entity 0 at which
> > +	 * point the PML log full VM exit happens and the logging stops.
> 
> This is technically wrong.  The PML Full exit only occurs on the next write.
> E.g. KVM could observe GUEST_PML_INDEX == -1 without ever seeing a PML Full exit.

I skipped over this part of the PRM when I was reading upon how PML works.
I will drop this sentence in the next version.

> 
>   If the PML index is not in the range 0–511, there is a page-modification log-full
>   event and a VM exit occurs. In this case, the accessed or dirty flag is not set,
>   and the guest-physical access that triggered the event does not occur.
> 

Do you have any comments for the rest of the patch series? If not then I'll send
v2 of the patch series.


Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/4] KVM: VMX: read the PML log in the same order as it was written
  2024-12-12 21:37     ` Maxim Levitsky
@ 2024-12-13  6:19       ` Sean Christopherson
  2024-12-13 19:56         ` Maxim Levitsky
  0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2024-12-13  6:19 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: kvm, linux-kselftest, Paolo Bonzini, linux-kernel, x86

On Thu, Dec 12, 2024, Maxim Levitsky wrote:
> On Wed, 2024-12-11 at 16:44 -0800, Sean Christopherson wrote:
> > But, I can't help but wonder why KVM bothers emulating PML.  I can appreciate
> > that avoiding exits to L1 would be beneficial, but what use case actually cares
> > about dirty logging performance in L1?
> 
> It does help with performance by a lot and the implementation is emulated and simple.

Yeah, it's not a lot of complexity, but it's architecturally flawed.  And I get
that it helps with performance, I'm just stumped as to the use case for dirty
logging in a nested VM in the first place.

> Do you have any comments for the rest of the patch series? If not then I'll send
> v2 of the patch series.

*sigh*

I do.  Through no fault of your own.  I was trying to figure out a way to ensure
the vCPU made meaningful progress, versus just guaranteeing at least one write,
and stumbled onto a plethora of flaws and unnecessary complexity in the test.

Can you post this patch as a standalone v2?  I'd like to do a more agressive
cleanup of the selftest, but I don't want to hold this up, and there's no hard
dependency.

As for the issues I encountered with the selftest:

 1. Tracing how many pages have been written for the current iteration with a
    guest side counter doesn't work without more fixes, because the test doesn't
    collect all dirty entries for the current iterations.  For the dirty ring,
    this results in a vCPU *starting* an iteration with a full dirty ring, and
    the test hangs because the guest can't make forward progress until
    log_mode_collect_dirty_pages() is called.

 2. The test presumably doesn't collect all dirty entries because of the weird
    and unnecessary kick in dirty_ring_collect_dirty_pages(), and all the
    synchronization that comes with it.  The kick is "justified" with a comment
    saying "This makes sure that hardware PML cache flushed", but there's no
    reason to do *if* pages that the test collects dirty pages *after* stopping
    the vCPU.  Which is easy to do while also collecting while the vCPU is
    running, if the kick+synchronization is eliminated (i.e. it's a self-inflicted
    wound of sorts).

 3. dirty_ring_after_vcpu_run() doesn't honor vcpu_sync_stop_requested, and so
    every other iteration runs until the ring is full.  Testing the "soft full"
    logic is interesting, but not _that_ interesting, and filling the dirty ring
    basically ignores the "interval".  Fixing this reduces the runtime by a
    significant amount, especially on nested, at the cost of providing less
    coverage for the dirty ring with default interval in a nested VM (but if
    someone cares about testing the dirty ring soft full in a nested VM, they
    can darn well bump the interval).

 4. Fixing the test to collect all dirty entries for the current iteration
    exposes another flaw.  The bitmaps (not dirty ring) start with all bits
    set.  And so the first iteration can see "dirty" pages that have never
    been written, but only when applying your fix to limit the hack to s390.

 5. "iteration" is synched to the guest *after* the vCPU is restarted, i.e. the
    guest could see a stale iteration if the main thread is delayed.

 6. host_bmap_track and all of the weird exemptions for writes from previous
    iterations goes away if all entries are collected for the current iteration
    (though a second bitmap is needed to handle the second collection; KVM's
    "get" of the bitmap clobbers the previous value).

I have everything more or less coded up, but I need to split it into patches,
write changelogs, and interleave it with your fixes.  Hopefully I'll get to that
tomorrow.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/4] KVM: VMX: read the PML log in the same order as it was written
  2024-12-13  6:19       ` Sean Christopherson
@ 2024-12-13 19:56         ` Maxim Levitsky
  2024-12-13 20:31           ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: Maxim Levitsky @ 2024-12-13 19:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kselftest, Paolo Bonzini, linux-kernel, x86, Peter Xu

On Thu, 2024-12-12 at 22:19 -0800, Sean Christopherson wrote:
> On Thu, Dec 12, 2024, Maxim Levitsky wrote:
> > On Wed, 2024-12-11 at 16:44 -0800, Sean Christopherson wrote:
> > > But, I can't help but wonder why KVM bothers emulating PML.  I can appreciate
> > > that avoiding exits to L1 would be beneficial, but what use case actually cares
> > > about dirty logging performance in L1?
> > 
> > It does help with performance by a lot and the implementation is emulated and simple.
> 
> Yeah, it's not a lot of complexity, but it's architecturally flawed.  And I get
> that it helps with performance, I'm just stumped as to the use case for dirty
> logging in a nested VM in the first place.
> 
> > Do you have any comments for the rest of the patch series? If not then I'll send
> > v2 of the patch series.
> 
> *sigh*
> 
> I do.  Through no fault of your own.  I was trying to figure out a way to ensure
> the vCPU made meaningful progress, versus just guaranteeing at least one write,
> and stumbled onto a plethora of flaws and unnecessary complexity in the test.
> 
> Can you post this patch as a standalone v2?  I'd like to do a more agressive
> cleanup of the selftest, but I don't want to hold this up, and there's no hard
> dependency.
> 
> As for the issues I encountered with the selftest:
> 
>  1. Tracing how many pages have been written for the current iteration with a
>     guest side counter doesn't work without more fixes, because the test doesn't
>     collect all dirty entries for the current iterations.  For the dirty ring,
>     this results in a vCPU *starting* an iteration with a full dirty ring, and
>     the test hangs because the guest can't make forward progress until
>     log_mode_collect_dirty_pages() is called.
> 
>  2. The test presumably doesn't collect all dirty entries because of the weird
>     and unnecessary kick in dirty_ring_collect_dirty_pages(), and all the
>     synchronization that comes with it.  The kick is "justified" with a comment
>     saying "This makes sure that hardware PML cache flushed", but there's no
>     reason to do *if* pages that the test collects dirty pages *after* stopping
>     the vCPU.  Which is easy to do while also collecting while the vCPU is
>     running, if the kick+synchronization is eliminated (i.e. it's a self-inflicted
>     wound of sorts).
> 
>  3. dirty_ring_after_vcpu_run() doesn't honor vcpu_sync_stop_requested, and so
>     every other iteration runs until the ring is full.  Testing the "soft full"
>     logic is interesting, but not _that_ interesting, and filling the dirty ring
>     basically ignores the "interval".  Fixing this reduces the runtime by a
>     significant amount, especially on nested, at the cost of providing less
>     coverage for the dirty ring with default interval in a nested VM (but if
>     someone cares about testing the dirty ring soft full in a nested VM, they
>     can darn well bump the interval).
> 
>  4. Fixing the test to collect all dirty entries for the current iteration
>     exposes another flaw.  The bitmaps (not dirty ring) start with all bits
>     set.  And so the first iteration can see "dirty" pages that have never
>     been written, but only when applying your fix to limit the hack to s390.
> 
>  5. "iteration" is synched to the guest *after* the vCPU is restarted, i.e. the
>     guest could see a stale iteration if the main thread is delayed.
> 
>  6. host_bmap_track and all of the weird exemptions for writes from previous
>     iterations goes away if all entries are collected for the current iteration
>     (though a second bitmap is needed to handle the second collection; KVM's
>     "get" of the bitmap clobbers the previous value).
> 
> I have everything more or less coded up, but I need to split it into patches,
> write changelogs, and interleave it with your fixes.  Hopefully I'll get to that
> tomorrow.
> 

Hi!

I will take a look at your patch series once you post it.
I also think that the logic in the test is somewhat broken, but then this also 
serves as a way to cause as much havoc as possible.

The fact that not all dirty pages are collected is because the ring harvest happens
at the same time the guest continues dirtying the pages, adding more entries to the
ring, simulating what would happen during real-life migration.

kicking the guest just before ring harvest is also IMHO a good thing as it also
simulates the IRQ load that would happen.

we can avoid kicking the guest if it is already stopped due to dirty ring, in fact,
the fact that we still kick it, delays the kick to the point where we resume the guest
and wait for it to stop again before the do the verify step, which makes it often
exit not due to log full event.

I did this but this makes the test be way less random, and the whole point of this
test is to cause as much havoc as possible.


I do think that we don't need to stop the guest during verify for the dirty-ring case,
this is probably a code that only dirty bitmap part of the test needs.

I added Peter Xu to CC to hear his option about this as well.


Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/4] KVM: VMX: read the PML log in the same order as it was written
  2024-12-13 19:56         ` Maxim Levitsky
@ 2024-12-13 20:31           ` Sean Christopherson
  0 siblings, 0 replies; 10+ messages in thread
From: Sean Christopherson @ 2024-12-13 20:31 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, linux-kselftest, Paolo Bonzini, linux-kernel, x86, Peter Xu

On Fri, Dec 13, 2024, Maxim Levitsky wrote:
> On Thu, 2024-12-12 at 22:19 -0800, Sean Christopherson wrote:
> > On Thu, Dec 12, 2024, Maxim Levitsky wrote:
> > > On Wed, 2024-12-11 at 16:44 -0800, Sean Christopherson wrote:
> > > > But, I can't help but wonder why KVM bothers emulating PML.  I can appreciate
> > > > that avoiding exits to L1 would be beneficial, but what use case actually cares
> > > > about dirty logging performance in L1?
> > > 
> > > It does help with performance by a lot and the implementation is emulated and simple.
> > 
> > Yeah, it's not a lot of complexity, but it's architecturally flawed.  And I get
> > that it helps with performance, I'm just stumped as to the use case for dirty
> > logging in a nested VM in the first place.
> > 
> > > Do you have any comments for the rest of the patch series? If not then I'll send
> > > v2 of the patch series.
> > 
> > *sigh*
> > 
> > I do.  Through no fault of your own.  I was trying to figure out a way to ensure
> > the vCPU made meaningful progress, versus just guaranteeing at least one write,
> > and stumbled onto a plethora of flaws and unnecessary complexity in the test.
> > 
> > Can you post this patch as a standalone v2?  I'd like to do a more agressive
> > cleanup of the selftest, but I don't want to hold this up, and there's no hard
> > dependency.
> > 
> > As for the issues I encountered with the selftest:
> > 
> >  1. Tracing how many pages have been written for the current iteration with a
> >     guest side counter doesn't work without more fixes, because the test doesn't
> >     collect all dirty entries for the current iterations.  For the dirty ring,
> >     this results in a vCPU *starting* an iteration with a full dirty ring, and
> >     the test hangs because the guest can't make forward progress until
> >     log_mode_collect_dirty_pages() is called.
> > 
> >  2. The test presumably doesn't collect all dirty entries because of the weird
> >     and unnecessary kick in dirty_ring_collect_dirty_pages(), and all the
> >     synchronization that comes with it.  The kick is "justified" with a comment
> >     saying "This makes sure that hardware PML cache flushed", but there's no
> >     reason to do *if* pages that the test collects dirty pages *after* stopping
> >     the vCPU.  Which is easy to do while also collecting while the vCPU is
> >     running, if the kick+synchronization is eliminated (i.e. it's a self-inflicted
> >     wound of sorts).
> > 
> >  3. dirty_ring_after_vcpu_run() doesn't honor vcpu_sync_stop_requested, and so
> >     every other iteration runs until the ring is full.  Testing the "soft full"
> >     logic is interesting, but not _that_ interesting, and filling the dirty ring
> >     basically ignores the "interval".  Fixing this reduces the runtime by a
> >     significant amount, especially on nested, at the cost of providing less
> >     coverage for the dirty ring with default interval in a nested VM (but if
> >     someone cares about testing the dirty ring soft full in a nested VM, they
> >     can darn well bump the interval).
> > 
> >  4. Fixing the test to collect all dirty entries for the current iteration
> >     exposes another flaw.  The bitmaps (not dirty ring) start with all bits
> >     set.  And so the first iteration can see "dirty" pages that have never
> >     been written, but only when applying your fix to limit the hack to s390.
> > 
> >  5. "iteration" is synched to the guest *after* the vCPU is restarted, i.e. the
> >     guest could see a stale iteration if the main thread is delayed.
> > 
> >  6. host_bmap_track and all of the weird exemptions for writes from previous
> >     iterations goes away if all entries are collected for the current iteration
> >     (though a second bitmap is needed to handle the second collection; KVM's
> >     "get" of the bitmap clobbers the previous value).
> > 
> > I have everything more or less coded up, but I need to split it into patches,
> > write changelogs, and interleave it with your fixes.  Hopefully I'll get to that
> > tomorrow.
> > 
> 
> Hi!
> 
> I will take a look at your patch series once you post it.
> I also think that the logic in the test is somewhat broken, but then this also 
> serves as a way to cause as much havoc as possible.
> 
> The fact that not all dirty pages are collected is because the ring harvest happens
> at the same time the guest continues dirtying the pages, adding more entries to the
> ring, simulating what would happen during real-life migration.

But as above, that behavior is trivially easy to mimic even when collecting all
entries simply by playing nice with multiple collections per iteration.  

> kicking the guest just before ring harvest is also IMHO a good thing as it also
> simulates the IRQ load that would happen.

I am not at all convinced that's interesting.  And *if* it's really truly all
that interesting, then the kick should be done for all flavors.

Unless the host is tickless, the vCPU will get interrupt from time to time, at
least for any decently large interval.  The kick from the test itself adds an
absurd amount of complexity for no meaningful test coverage.

> we can avoid kicking the guest if it is already stopped due to dirty ring, in fact,
> the fact that we still kick it, delays the kick to the point where we resume the guest
> and wait for it to stop again before the do the verify step, which makes it often
> exit not due to log full event.
> 
> I did this but this makes the test be way less random, and the whole point of this
> test is to cause as much havoc as possible.

I agree randomness is a good thing for testing, but this is more noise than
random/controlled chaos.

E.g. we can do _far_ better for large interval numbers.  As is, collecting _once_
per iteration means the vCPU is all but guarantee to stall on a dirty ring for
any decently large interval.

And conversely, not honor the "stop" request means every other iteration is all
but guaranteed to fill the dirty ring, even for small intervals.

> I do think that we don't need to stop the guest during verify for the
> dirty-ring case, this is probably a code that only dirty bitmap part of the
> test needs.

Not stopping the vCPU would reduce test coverage (which is one of my complaints
with not fully harvesting the dirty entries).  If KVM misses a dirty log event
on iteration X, and the vCPU also writes the same page in iteration X+1, then the
test will get a false pass because iteration X+1 will see the page as dirty and
think all is well.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-12-13 20:31 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-11 19:37 [PATCH 0/4] KVM: selftests: dirty_log_test: fixes for running the test nested Maxim Levitsky
2024-12-11 19:37 ` [PATCH 1/4] KVM: VMX: read the PML log in the same order as it was written Maxim Levitsky
2024-12-12  0:44   ` Sean Christopherson
2024-12-12 21:37     ` Maxim Levitsky
2024-12-13  6:19       ` Sean Christopherson
2024-12-13 19:56         ` Maxim Levitsky
2024-12-13 20:31           ` Sean Christopherson
2024-12-11 19:37 ` [PATCH 2/4] KVM: selftests: dirty_log_test: Limit s390x workaround to s390x Maxim Levitsky
2024-12-11 19:37 ` [PATCH 3/4] KVM: selftests: dirty_log_test: run the guest until some dirty ring entries were harvested Maxim Levitsky
2024-12-11 19:37 ` [PATCH 4/4] KVM: selftests: dirty_log_test: support multiple write retires Maxim Levitsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox