linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] KVM: Dirty ring fixes and cleanups
@ 2025-01-11  1:04 Sean Christopherson
  2025-01-11  1:04 ` [PATCH 1/5] KVM: Bound the number of dirty ring entries in a single reset at INT_MAX Sean Christopherson
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Sean Christopherson @ 2025-01-11  1:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Xu, Yan Zhao, Maxim Levitsky,
	Sean Christopherson

Yan's series to fix a flaw where a buggy/malicious userspace could coerce
KVM into corrupting TDX SPTEs got me wondering what other bad things userspace
could do by writing into the dirty ring...

The main issue is that KVM doesn't bound the processing of harvested entries
in any way, which allows userspace to keep KVM in a tight loop indefinitely.

E.g.

        struct kvm_dirty_gfn *dirty_gfns = vcpu_map_dirty_ring(vcpu);

        if (fork()) {
                int r;

                for (;;) {
                        r = kvm_vm_reset_dirty_ring(vcpu->vm);
                        if (r)
                                printf("RESET %d dirty ring entries\n", r);
                }
        } else {
                int i;

                for (i = 0; i < test_dirty_ring_count; i++) {
                        dirty_gfns[i].slot = TEST_MEM_SLOT_INDEX;
                        dirty_gfns[i].offset = (i * 64) % host_num_pages;
                }

                for (;;) {
                        for (i = 0; i < test_dirty_ring_count; i++)
                                WRITE_ONCE(dirty_gfns[i].flags, KVM_DIRTY_GFN_F_RESET);
                }
        }

Patches 1-3 address that class of bugs.  Patches 4 and 5 are cleanups.

[*] https://lore.kernel.org/all/20241220082027.15851-1-yan.y.zhao@intel.com

Sean Christopherson (5):
  KVM: Bound the number of dirty ring entries in a single reset at
    INT_MAX
  KVM: Bail from the dirty ring reset flow if a signal is pending
  KVM: Conditionally reschedule when resetting the dirty ring
  KVM: Check for empty mask of harvested dirty ring entries in caller
  KVM: Use mask of harvested dirty ring entries to coalesce dirty ring
    resets

 include/linux/kvm_dirty_ring.h |  8 ++--
 virt/kvm/dirty_ring.c          | 88 +++++++++++++++++++++-------------
 virt/kvm/kvm_main.c            |  9 ++--
 3 files changed, 66 insertions(+), 39 deletions(-)


base-commit: 10485c4bc3caad3e93a6a4e99003e8ffffcd826a
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/5] KVM: Bound the number of dirty ring entries in a single reset at INT_MAX
  2025-01-11  1:04 [PATCH 0/5] KVM: Dirty ring fixes and cleanups Sean Christopherson
@ 2025-01-11  1:04 ` Sean Christopherson
  2025-01-13  6:48   ` Yan Zhao
  2025-01-11  1:04 ` [PATCH 2/5] KVM: Bail from the dirty ring reset flow if a signal is pending Sean Christopherson
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2025-01-11  1:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Xu, Yan Zhao, Maxim Levitsky,
	Sean Christopherson

Cap the number of ring entries that are reset in a single ioctl to INT_MAX
to ensure userspace isn't confused by a wrap into negative space, and so
that, in a truly pathological scenario, KVM doesn't miss a TLB flush due
to the count wrapping to zero.  While the size of the ring is fixed at
0x10000 entries and KVM (currently) supports at most 4096, userspace is
allowed to harvest entries from the ring while the reset is in-progress,
i.e. it's possible for the ring to always harvested entries.

Opportunistically return an actual error code from the helper so that a
future fix to handle pending signals can gracefully return -EINTR.

Cc: Peter Xu <peterx@redhat.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/kvm_dirty_ring.h |  8 +++++---
 virt/kvm/dirty_ring.c          | 10 +++++-----
 virt/kvm/kvm_main.c            |  9 ++++++---
 3 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
index 4862c98d80d3..82829243029d 100644
--- a/include/linux/kvm_dirty_ring.h
+++ b/include/linux/kvm_dirty_ring.h
@@ -49,9 +49,10 @@ static inline int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring,
 }
 
 static inline int kvm_dirty_ring_reset(struct kvm *kvm,
-				       struct kvm_dirty_ring *ring)
+				       struct kvm_dirty_ring *ring,
+				       int *nr_entries_reset)
 {
-	return 0;
+	return -ENOENT;
 }
 
 static inline void kvm_dirty_ring_push(struct kvm_vcpu *vcpu,
@@ -81,7 +82,8 @@ int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size);
  * called with kvm->slots_lock held, returns the number of
  * processed pages.
  */
-int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring);
+int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
+			 int *nr_entries_reset);
 
 /*
  * returns =0: successfully pushed
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index 7bc74969a819..2faf894dec5a 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -104,19 +104,19 @@ static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn)
 	return smp_load_acquire(&gfn->flags) & KVM_DIRTY_GFN_F_RESET;
 }
 
-int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
+int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
+			 int *nr_entries_reset)
 {
 	u32 cur_slot, next_slot;
 	u64 cur_offset, next_offset;
 	unsigned long mask;
-	int count = 0;
 	struct kvm_dirty_gfn *entry;
 	bool first_round = true;
 
 	/* This is only needed to make compilers happy */
 	cur_slot = cur_offset = mask = 0;
 
-	while (true) {
+	while (likely((*nr_entries_reset) < INT_MAX)) {
 		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
 
 		if (!kvm_dirty_gfn_harvested(entry))
@@ -129,7 +129,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
 		kvm_dirty_gfn_set_invalid(entry);
 
 		ring->reset_index++;
-		count++;
+		(*nr_entries_reset)++;
 		/*
 		 * Try to coalesce the reset operations when the guest is
 		 * scanning pages in the same slot.
@@ -166,7 +166,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
 
 	trace_kvm_dirty_ring_reset(ring);
 
-	return count;
+	return 0;
 }
 
 void kvm_dirty_ring_push(struct kvm_vcpu *vcpu, u32 slot, u64 offset)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9d54473d18e3..2d63b4d46ccb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4877,15 +4877,18 @@ static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
 {
 	unsigned long i;
 	struct kvm_vcpu *vcpu;
-	int cleared = 0;
+	int cleared = 0, r;
 
 	if (!kvm->dirty_ring_size)
 		return -EINVAL;
 
 	mutex_lock(&kvm->slots_lock);
 
-	kvm_for_each_vcpu(i, vcpu, kvm)
-		cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring);
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		r = kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring, &cleared);
+		if (r)
+			break;
+	}
 
 	mutex_unlock(&kvm->slots_lock);
 
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/5] KVM: Bail from the dirty ring reset flow if a signal is pending
  2025-01-11  1:04 [PATCH 0/5] KVM: Dirty ring fixes and cleanups Sean Christopherson
  2025-01-11  1:04 ` [PATCH 1/5] KVM: Bound the number of dirty ring entries in a single reset at INT_MAX Sean Christopherson
@ 2025-01-11  1:04 ` Sean Christopherson
  2025-01-13  9:31   ` Yan Zhao
  2025-01-11  1:04 ` [PATCH 3/5] KVM: Conditionally reschedule when resetting the dirty ring Sean Christopherson
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2025-01-11  1:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Xu, Yan Zhao, Maxim Levitsky,
	Sean Christopherson

Abort a dirty ring reset if the current task has a pending signal, as the
hard limit of INT_MAX entries doesn't ensure KVM will respond to a signal
in a timely fashion.

Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/dirty_ring.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index 2faf894dec5a..a81ad17d5eef 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -117,6 +117,9 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
 	cur_slot = cur_offset = mask = 0;
 
 	while (likely((*nr_entries_reset) < INT_MAX)) {
+		if (signal_pending(current))
+			return -EINTR;
+
 		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
 
 		if (!kvm_dirty_gfn_harvested(entry))
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/5] KVM: Conditionally reschedule when resetting the dirty ring
  2025-01-11  1:04 [PATCH 0/5] KVM: Dirty ring fixes and cleanups Sean Christopherson
  2025-01-11  1:04 ` [PATCH 1/5] KVM: Bound the number of dirty ring entries in a single reset at INT_MAX Sean Christopherson
  2025-01-11  1:04 ` [PATCH 2/5] KVM: Bail from the dirty ring reset flow if a signal is pending Sean Christopherson
@ 2025-01-11  1:04 ` Sean Christopherson
  2025-01-13  7:04   ` Yan Zhao
  2025-01-11  1:04 ` [PATCH 4/5] KVM: Check for empty mask of harvested dirty ring entries in caller Sean Christopherson
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2025-01-11  1:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Xu, Yan Zhao, Maxim Levitsky,
	Sean Christopherson

When resetting a dirty ring, conditionally reschedule on each iteration
after the first.  The recently introduced hard limit mitigates the issue
of an endless reset, but isn't sufficient to completely prevent RCU
stalls, soft lockups, etc., nor is the hard limit intended to guard
against such badness.

Note!  Take care to check for reschedule even in the "continue" paths,
as a pathological scenario (or malicious userspace) could dirty the same
gfn over and over, i.e. always hit the continue path.

  rcu: INFO: rcu_sched self-detected stall on CPU
  rcu: 	4-....: (5249 ticks this GP) idle=51e4/1/0x4000000000000000 softirq=309/309 fqs=2563
  rcu: 	(t=5250 jiffies g=-319 q=608 ncpus=24)
  CPU: 4 UID: 1000 PID: 1067 Comm: dirty_log_test Tainted: G             L     6.13.0-rc3-17fa7a24ea1e-HEAD-vm #814
  Tainted: [L]=SOFTLOCKUP
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x26/0x200 [kvm]
  Call Trace:
   <TASK>
   kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
   kvm_dirty_ring_reset+0x58/0x220 [kvm]
   kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
   __x64_sys_ioctl+0x8b/0xb0
   do_syscall_64+0x5b/0x160
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>
  Tainted: [L]=SOFTLOCKUP
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x17/0x200 [kvm]
  Call Trace:
   <TASK>
   kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
   kvm_dirty_ring_reset+0x58/0x220 [kvm]
   kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
   __x64_sys_ioctl+0x8b/0xb0
   do_syscall_64+0x5b/0x160
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>

Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/dirty_ring.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index a81ad17d5eef..37eb2b7142bd 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -133,6 +133,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
 
 		ring->reset_index++;
 		(*nr_entries_reset)++;
+
+		/*
+		 * While the size of each ring is fixed, it's possible for the
+		 * ring to be constantly re-dirtied/harvested while the reset
+		 * is in-progress (the hard limit exists only to guard against
+		 * wrapping the count into negative space).
+		 */
+		if (!first_round)
+			cond_resched();
+
 		/*
 		 * Try to coalesce the reset operations when the guest is
 		 * scanning pages in the same slot.
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/5] KVM: Check for empty mask of harvested dirty ring entries in caller
  2025-01-11  1:04 [PATCH 0/5] KVM: Dirty ring fixes and cleanups Sean Christopherson
                   ` (2 preceding siblings ...)
  2025-01-11  1:04 ` [PATCH 3/5] KVM: Conditionally reschedule when resetting the dirty ring Sean Christopherson
@ 2025-01-11  1:04 ` Sean Christopherson
  2025-01-13 10:30   ` Yan Zhao
  2025-01-11  1:04 ` [PATCH 5/5] KVM: Use mask of harvested dirty ring entries to coalesce dirty ring resets Sean Christopherson
  2025-01-13  9:51 ` [PATCH 0/5] KVM: Dirty ring fixes and cleanups Yan Zhao
  5 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2025-01-11  1:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Xu, Yan Zhao, Maxim Levitsky,
	Sean Christopherson

When resetting a dirty ring, explicitly check that there is work to be
done before calling kvm_reset_dirty_gfn(), e.g. if no harvested entries
are found and/or on the loop's first iteration, and delete the extremely
misleading comment "This is only needed to make compilers happy".  KVM
absolutely relies on mask to be zero-initialized, i.e. the comment is an
outright lie.  Furthermore, the compiler is right to complain that KVM is
calling a function with uninitialized data, as there are no guarantees
the implementation details of kvm_reset_dirty_gfn() will be visible to
kvm_dirty_ring_reset().

While the flaw could be fixed by simply deleting (or rewording) the
comment, and duplicating the check is unfortunate, checking mask in the
caller will allow for additional cleanups.

Opportunisticaly drop the zero-initialization of cur_slot and cur_offset.
If a bug were introduced where either the slot or offset was consumed
before mask is set to a non-zero value, then it is highly desirable for
the compiler (or some other sanitizer) to yell.

Cc: Peter Xu <peterx@redhat.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/dirty_ring.c | 29 ++++++++++++++++++++---------
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index 37eb2b7142bd..95ab0e3cf9da 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -55,9 +55,6 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
 	struct kvm_memory_slot *memslot;
 	int as_id, id;
 
-	if (!mask)
-		return;
-
 	as_id = slot >> 16;
 	id = (u16)slot;
 
@@ -109,13 +106,10 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
 {
 	u32 cur_slot, next_slot;
 	u64 cur_offset, next_offset;
-	unsigned long mask;
+	unsigned long mask = 0;
 	struct kvm_dirty_gfn *entry;
 	bool first_round = true;
 
-	/* This is only needed to make compilers happy */
-	cur_slot = cur_offset = mask = 0;
-
 	while (likely((*nr_entries_reset) < INT_MAX)) {
 		if (signal_pending(current))
 			return -EINTR;
@@ -163,14 +157,31 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
 				continue;
 			}
 		}
-		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
+
+		/*
+		 * Reset the slot for all the harvested entries that have been
+		 * gathered, but not yet fully processed.
+		 */
+		if (mask)
+			kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
+
+		/*
+		 * The current slot was reset or this is the first harvested
+		 * entry, (re)initialize the metadata.
+		 */
 		cur_slot = next_slot;
 		cur_offset = next_offset;
 		mask = 1;
 		first_round = false;
 	}
 
-	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
+	/*
+	 * Perform a final reset if there are harvested entries that haven't
+	 * been processed. The loop only performs a reset when an entry can't
+	 * be coalesced, i.e. always leaves at least one entry pending.
+	 */
+	if (mask)
+		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
 
 	/*
 	 * The request KVM_REQ_DIRTY_RING_SOFT_FULL will be cleared
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/5] KVM: Use mask of harvested dirty ring entries to coalesce dirty ring resets
  2025-01-11  1:04 [PATCH 0/5] KVM: Dirty ring fixes and cleanups Sean Christopherson
                   ` (3 preceding siblings ...)
  2025-01-11  1:04 ` [PATCH 4/5] KVM: Check for empty mask of harvested dirty ring entries in caller Sean Christopherson
@ 2025-01-11  1:04 ` Sean Christopherson
  2025-01-13  9:51 ` [PATCH 0/5] KVM: Dirty ring fixes and cleanups Yan Zhao
  5 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2025-01-11  1:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Xu, Yan Zhao, Maxim Levitsky,
	Sean Christopherson

Use "mask" instead of a dedicated boolean to track whether or not there
is at least one to-be-reset entry for the current slot+offset.  In the
body of the loop, mask is zero only on the first iteration, i.e. !mask is
equivalent to first_round.

Opportunstically combine the adjacent "if (mask)" statements into a single
if-statement.

No function change intended.

Cc: Peter Xu <peterx@redhat.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/dirty_ring.c | 60 +++++++++++++++++++++----------------------
 1 file changed, 29 insertions(+), 31 deletions(-)

diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index 95ab0e3cf9da..9b23f86ff7b6 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -108,7 +108,6 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
 	u64 cur_offset, next_offset;
 	unsigned long mask = 0;
 	struct kvm_dirty_gfn *entry;
-	bool first_round = true;
 
 	while (likely((*nr_entries_reset) < INT_MAX)) {
 		if (signal_pending(current))
@@ -128,42 +127,42 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
 		ring->reset_index++;
 		(*nr_entries_reset)++;
 
-		/*
-		 * While the size of each ring is fixed, it's possible for the
-		 * ring to be constantly re-dirtied/harvested while the reset
-		 * is in-progress (the hard limit exists only to guard against
-		 * wrapping the count into negative space).
-		 */
-		if (!first_round)
+		if (mask) {
+			/*
+			 * While the size of each ring is fixed, it's possible
+			 * for the ring to be constantly re-dirtied/harvested
+			 * while the reset is in-progress (the hard limit exists
+			 * only to guard against the count becoming negative).
+			 */
 			cond_resched();
 
-		/*
-		 * Try to coalesce the reset operations when the guest is
-		 * scanning pages in the same slot.
-		 */
-		if (!first_round && next_slot == cur_slot) {
-			s64 delta = next_offset - cur_offset;
+			/*
+			 * Try to coalesce the reset operations when the guest
+			 * is scanning pages in the same slot.
+			 */
+			if (next_slot == cur_slot) {
+				s64 delta = next_offset - cur_offset;
 
-			if (delta >= 0 && delta < BITS_PER_LONG) {
-				mask |= 1ull << delta;
-				continue;
-			}
+				if (delta >= 0 && delta < BITS_PER_LONG) {
+					mask |= 1ull << delta;
+					continue;
+				}
 
-			/* Backwards visit, careful about overflows!  */
-			if (delta > -BITS_PER_LONG && delta < 0 &&
-			    (mask << -delta >> -delta) == mask) {
-				cur_offset = next_offset;
-				mask = (mask << -delta) | 1;
-				continue;
+				/* Backwards visit, careful about overflows! */
+				if (delta > -BITS_PER_LONG && delta < 0 &&
+				(mask << -delta >> -delta) == mask) {
+					cur_offset = next_offset;
+					mask = (mask << -delta) | 1;
+					continue;
+				}
 			}
-		}
 
-		/*
-		 * Reset the slot for all the harvested entries that have been
-		 * gathered, but not yet fully processed.
-		 */
-		if (mask)
+			/*
+			 * Reset the slot for all the harvested entries that
+			 * have been gathered, but not yet fully processed.
+			 */
 			kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
+		}
 
 		/*
 		 * The current slot was reset or this is the first harvested
@@ -172,7 +171,6 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
 		cur_slot = next_slot;
 		cur_offset = next_offset;
 		mask = 1;
-		first_round = false;
 	}
 
 	/*
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/5] KVM: Bound the number of dirty ring entries in a single reset at INT_MAX
  2025-01-11  1:04 ` [PATCH 1/5] KVM: Bound the number of dirty ring entries in a single reset at INT_MAX Sean Christopherson
@ 2025-01-13  6:48   ` Yan Zhao
  2025-01-13  6:57     ` Yan Zhao
  0 siblings, 1 reply; 19+ messages in thread
From: Yan Zhao @ 2025-01-13  6:48 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Xu, Maxim Levitsky

On Fri, Jan 10, 2025 at 05:04:05PM -0800, Sean Christopherson wrote:
> Cap the number of ring entries that are reset in a single ioctl to INT_MAX
> to ensure userspace isn't confused by a wrap into negative space, and so
> that, in a truly pathological scenario, KVM doesn't miss a TLB flush due
> to the count wrapping to zero.  While the size of the ring is fixed at
> 0x10000 entries and KVM (currently) supports at most 4096, userspace is
> allowed to harvest entries from the ring while the reset is in-progress,
> i.e. it's possible for the ring to always harvested entries.
> 
> Opportunistically return an actual error code from the helper so that a
> future fix to handle pending signals can gracefully return -EINTR.
> 
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yan Zhao <yan.y.zhao@intel.com>
> Cc: Maxim Levitsky <mlevitsk@redhat.com>
> Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  include/linux/kvm_dirty_ring.h |  8 +++++---
>  virt/kvm/dirty_ring.c          | 10 +++++-----
>  virt/kvm/kvm_main.c            |  9 ++++++---
>  3 files changed, 16 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> index 4862c98d80d3..82829243029d 100644
> --- a/include/linux/kvm_dirty_ring.h
> +++ b/include/linux/kvm_dirty_ring.h
> @@ -49,9 +49,10 @@ static inline int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring,
>  }
>  
>  static inline int kvm_dirty_ring_reset(struct kvm *kvm,
> -				       struct kvm_dirty_ring *ring)
> +				       struct kvm_dirty_ring *ring,
> +				       int *nr_entries_reset)
>  {
> -	return 0;
> +	return -ENOENT;
>  }
>  
>  static inline void kvm_dirty_ring_push(struct kvm_vcpu *vcpu,
> @@ -81,7 +82,8 @@ int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size);
>   * called with kvm->slots_lock held, returns the number of
>   * processed pages.
>   */
> -int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring);
> +int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
> +			 int *nr_entries_reset);
>  
>  /*
>   * returns =0: successfully pushed
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index 7bc74969a819..2faf894dec5a 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -104,19 +104,19 @@ static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn)
>  	return smp_load_acquire(&gfn->flags) & KVM_DIRTY_GFN_F_RESET;
>  }
>  
> -int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> +int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
> +			 int *nr_entries_reset)
>  {
>  	u32 cur_slot, next_slot;
>  	u64 cur_offset, next_offset;
>  	unsigned long mask;
> -	int count = 0;
>  	struct kvm_dirty_gfn *entry;
>  	bool first_round = true;
>  
>  	/* This is only needed to make compilers happy */
>  	cur_slot = cur_offset = mask = 0;
>  
> -	while (true) {
> +	while (likely((*nr_entries_reset) < INT_MAX)) {
>  		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
>  
>  		if (!kvm_dirty_gfn_harvested(entry))
> @@ -129,7 +129,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
>  		kvm_dirty_gfn_set_invalid(entry);
>  
>  		ring->reset_index++;
> -		count++;
> +		(*nr_entries_reset)++;
>  		/*
>  		 * Try to coalesce the reset operations when the guest is
>  		 * scanning pages in the same slot.
> @@ -166,7 +166,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
>  
>  	trace_kvm_dirty_ring_reset(ring);
>  
> -	return count;
> +	return 0;
>  }
>  
>  void kvm_dirty_ring_push(struct kvm_vcpu *vcpu, u32 slot, u64 offset)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 9d54473d18e3..2d63b4d46ccb 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4877,15 +4877,18 @@ static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
>  {
>  	unsigned long i;
>  	struct kvm_vcpu *vcpu;
> -	int cleared = 0;
> +	int cleared = 0, r;
>  
>  	if (!kvm->dirty_ring_size)
>  		return -EINVAL;
>  
>  	mutex_lock(&kvm->slots_lock);
>  
> -	kvm_for_each_vcpu(i, vcpu, kvm)
> -		cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring);
Previously "cleared" counts all cleared count in all vCPUs.

> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		r = kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring, &cleared);
Here it's reset to the cleared count in the last vCPU, possibly causing loss of
kvm_flush_remote_tlbs(().

> +		if (r)
> +			break;
> +	}
>  
>  	mutex_unlock(&kvm->slots_lock);
>  
> -- 
> 2.47.1.613.gc27f4b7a9f-goog
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/5] KVM: Bound the number of dirty ring entries in a single reset at INT_MAX
  2025-01-13  6:48   ` Yan Zhao
@ 2025-01-13  6:57     ` Yan Zhao
  0 siblings, 0 replies; 19+ messages in thread
From: Yan Zhao @ 2025-01-13  6:57 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel, Peter Xu,
	Maxim Levitsky

On Mon, Jan 13, 2025 at 02:48:05PM +0800, Yan Zhao wrote:
> On Fri, Jan 10, 2025 at 05:04:05PM -0800, Sean Christopherson wrote:
> > Cap the number of ring entries that are reset in a single ioctl to INT_MAX
> > to ensure userspace isn't confused by a wrap into negative space, and so
> > that, in a truly pathological scenario, KVM doesn't miss a TLB flush due
> > to the count wrapping to zero.  While the size of the ring is fixed at
> > 0x10000 entries and KVM (currently) supports at most 4096, userspace is
> > allowed to harvest entries from the ring while the reset is in-progress,
> > i.e. it's possible for the ring to always harvested entries.
> > 
> > Opportunistically return an actual error code from the helper so that a
> > future fix to handle pending signals can gracefully return -EINTR.
> > 
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Yan Zhao <yan.y.zhao@intel.com>
> > Cc: Maxim Levitsky <mlevitsk@redhat.com>
> > Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  include/linux/kvm_dirty_ring.h |  8 +++++---
> >  virt/kvm/dirty_ring.c          | 10 +++++-----
> >  virt/kvm/kvm_main.c            |  9 ++++++---
> >  3 files changed, 16 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> > index 4862c98d80d3..82829243029d 100644
> > --- a/include/linux/kvm_dirty_ring.h
> > +++ b/include/linux/kvm_dirty_ring.h
> > @@ -49,9 +49,10 @@ static inline int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring,
> >  }
> >  
> >  static inline int kvm_dirty_ring_reset(struct kvm *kvm,
> > -				       struct kvm_dirty_ring *ring)
> > +				       struct kvm_dirty_ring *ring,
> > +				       int *nr_entries_reset)
> >  {
> > -	return 0;
> > +	return -ENOENT;
> >  }
> >  
> >  static inline void kvm_dirty_ring_push(struct kvm_vcpu *vcpu,
> > @@ -81,7 +82,8 @@ int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size);
> >   * called with kvm->slots_lock held, returns the number of
> >   * processed pages.
> >   */
> > -int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring);
> > +int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
> > +			 int *nr_entries_reset);
> >  
> >  /*
> >   * returns =0: successfully pushed
> > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > index 7bc74969a819..2faf894dec5a 100644
> > --- a/virt/kvm/dirty_ring.c
> > +++ b/virt/kvm/dirty_ring.c
> > @@ -104,19 +104,19 @@ static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn)
> >  	return smp_load_acquire(&gfn->flags) & KVM_DIRTY_GFN_F_RESET;
> >  }
> >  
> > -int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> > +int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
> > +			 int *nr_entries_reset)
> >  {
> >  	u32 cur_slot, next_slot;
> >  	u64 cur_offset, next_offset;
> >  	unsigned long mask;
> > -	int count = 0;
> >  	struct kvm_dirty_gfn *entry;
> >  	bool first_round = true;
> >  
> >  	/* This is only needed to make compilers happy */
> >  	cur_slot = cur_offset = mask = 0;
> >  
> > -	while (true) {
> > +	while (likely((*nr_entries_reset) < INT_MAX)) {
> >  		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> >  
> >  		if (!kvm_dirty_gfn_harvested(entry))
> > @@ -129,7 +129,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> >  		kvm_dirty_gfn_set_invalid(entry);
> >  
> >  		ring->reset_index++;
> > -		count++;
> > +		(*nr_entries_reset)++;
> >  		/*
> >  		 * Try to coalesce the reset operations when the guest is
> >  		 * scanning pages in the same slot.
> > @@ -166,7 +166,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> >  
> >  	trace_kvm_dirty_ring_reset(ring);
> >  
> > -	return count;
> > +	return 0;
> >  }
> >  
> >  void kvm_dirty_ring_push(struct kvm_vcpu *vcpu, u32 slot, u64 offset)
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 9d54473d18e3..2d63b4d46ccb 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -4877,15 +4877,18 @@ static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
> >  {
> >  	unsigned long i;
> >  	struct kvm_vcpu *vcpu;
> > -	int cleared = 0;
> > +	int cleared = 0, r;
> >  
> >  	if (!kvm->dirty_ring_size)
> >  		return -EINVAL;
> >  
> >  	mutex_lock(&kvm->slots_lock);
> >  
> > -	kvm_for_each_vcpu(i, vcpu, kvm)
> > -		cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring);
> Previously "cleared" counts all cleared count in all vCPUs.
> 
> > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > +		r = kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring, &cleared);
> Here it's reset to the cleared count in the last vCPU, possibly causing loss of
> kvm_flush_remote_tlbs(().
Ah, sorry. it's not. (*nr_entries_reset)++ in each vCPU.

> 
> > +		if (r)
> > +			break;
> > +	}
> >  
> >  	mutex_unlock(&kvm->slots_lock);
> >  
> > -- 
> > 2.47.1.613.gc27f4b7a9f-goog
> > 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/5] KVM: Conditionally reschedule when resetting the dirty ring
  2025-01-11  1:04 ` [PATCH 3/5] KVM: Conditionally reschedule when resetting the dirty ring Sean Christopherson
@ 2025-01-13  7:04   ` Yan Zhao
  2025-01-13 16:28     ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Yan Zhao @ 2025-01-13  7:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Xu, Maxim Levitsky

On Fri, Jan 10, 2025 at 05:04:07PM -0800, Sean Christopherson wrote:
> When resetting a dirty ring, conditionally reschedule on each iteration
> after the first.  The recently introduced hard limit mitigates the issue
> of an endless reset, but isn't sufficient to completely prevent RCU
> stalls, soft lockups, etc., nor is the hard limit intended to guard
> against such badness.
> 
> Note!  Take care to check for reschedule even in the "continue" paths,
> as a pathological scenario (or malicious userspace) could dirty the same
> gfn over and over, i.e. always hit the continue path.
> 
>   rcu: INFO: rcu_sched self-detected stall on CPU
>   rcu: 	4-....: (5249 ticks this GP) idle=51e4/1/0x4000000000000000 softirq=309/309 fqs=2563
>   rcu: 	(t=5250 jiffies g=-319 q=608 ncpus=24)
>   CPU: 4 UID: 1000 PID: 1067 Comm: dirty_log_test Tainted: G             L     6.13.0-rc3-17fa7a24ea1e-HEAD-vm #814
>   Tainted: [L]=SOFTLOCKUP
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x26/0x200 [kvm]
>   Call Trace:
>    <TASK>
>    kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
>    kvm_dirty_ring_reset+0x58/0x220 [kvm]
>    kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
>    __x64_sys_ioctl+0x8b/0xb0
>    do_syscall_64+0x5b/0x160
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>    </TASK>
>   Tainted: [L]=SOFTLOCKUP
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x17/0x200 [kvm]
>   Call Trace:
>    <TASK>
>    kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
>    kvm_dirty_ring_reset+0x58/0x220 [kvm]
>    kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
>    __x64_sys_ioctl+0x8b/0xb0
>    do_syscall_64+0x5b/0x160
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>    </TASK>
> 
> Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  virt/kvm/dirty_ring.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index a81ad17d5eef..37eb2b7142bd 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -133,6 +133,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
>  
>  		ring->reset_index++;
>  		(*nr_entries_reset)++;
> +
> +		/*
> +		 * While the size of each ring is fixed, it's possible for the
> +		 * ring to be constantly re-dirtied/harvested while the reset
> +		 * is in-progress (the hard limit exists only to guard against
> +		 * wrapping the count into negative space).
> +		 */
> +		if (!first_round)
> +			cond_resched();
> +
Will cond_resched() per entry be too frequent?
Could we combine the cond_resched() per ring? e.g.

if (count >= ring->soft_limit)
	cond_resched();

or simply
while (count < ring->size) {
	...
}


>  		/*
>  		 * Try to coalesce the reset operations when the guest is
>  		 * scanning pages in the same slot.
> -- 
> 2.47.1.613.gc27f4b7a9f-goog
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] KVM: Bail from the dirty ring reset flow if a signal is pending
  2025-01-11  1:04 ` [PATCH 2/5] KVM: Bail from the dirty ring reset flow if a signal is pending Sean Christopherson
@ 2025-01-13  9:31   ` Yan Zhao
  2025-01-13 15:48     ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Yan Zhao @ 2025-01-13  9:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Xu, Maxim Levitsky

On Fri, Jan 10, 2025 at 05:04:06PM -0800, Sean Christopherson wrote:
> Abort a dirty ring reset if the current task has a pending signal, as the
> hard limit of INT_MAX entries doesn't ensure KVM will respond to a signal
> in a timely fashion.
> 
> Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  virt/kvm/dirty_ring.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index 2faf894dec5a..a81ad17d5eef 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -117,6 +117,9 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
>  	cur_slot = cur_offset = mask = 0;
>  
>  	while (likely((*nr_entries_reset) < INT_MAX)) {
> +		if (signal_pending(current))
> +			return -EINTR;
Will it break the userspace when a signal is pending? e.g. QEMU might report an
error like
"kvm_dirty_ring_reap_locked: Assertion `ret == total' failed".

>  		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
>  
>  		if (!kvm_dirty_gfn_harvested(entry))
> -- 
> 2.47.1.613.gc27f4b7a9f-goog
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/5] KVM: Dirty ring fixes and cleanups
  2025-01-11  1:04 [PATCH 0/5] KVM: Dirty ring fixes and cleanups Sean Christopherson
                   ` (4 preceding siblings ...)
  2025-01-11  1:04 ` [PATCH 5/5] KVM: Use mask of harvested dirty ring entries to coalesce dirty ring resets Sean Christopherson
@ 2025-01-13  9:51 ` Yan Zhao
  5 siblings, 0 replies; 19+ messages in thread
From: Yan Zhao @ 2025-01-13  9:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Xu, Maxim Levitsky

On Fri, Jan 10, 2025 at 05:04:04PM -0800, Sean Christopherson wrote:
> Yan's series to fix a flaw where a buggy/malicious userspace could coerce
> KVM into corrupting TDX SPTEs
> [*] https://lore.kernel.org/all/20241220082027.15851-1-yan.y.zhao@intel.com

Hi Sean,
Here's the v2
https://lore.kernel.org/kvm/20241223070427.29583-1-yan.y.zhao@intel.com.
FYI, in case you didn't catch it as it was sent close to Christmas :)

Thanks
Yan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/5] KVM: Check for empty mask of harvested dirty ring entries in caller
  2025-01-11  1:04 ` [PATCH 4/5] KVM: Check for empty mask of harvested dirty ring entries in caller Sean Christopherson
@ 2025-01-13 10:30   ` Yan Zhao
  2025-01-13 16:48     ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Yan Zhao @ 2025-01-13 10:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Xu, Maxim Levitsky

On Fri, Jan 10, 2025 at 05:04:08PM -0800, Sean Christopherson wrote:
> When resetting a dirty ring, explicitly check that there is work to be
> done before calling kvm_reset_dirty_gfn(), e.g. if no harvested entries
> are found and/or on the loop's first iteration, and delete the extremely
> misleading comment "This is only needed to make compilers happy".  KVM
> absolutely relies on mask to be zero-initialized, i.e. the comment is an
> outright lie.  Furthermore, the compiler is right to complain that KVM is
> calling a function with uninitialized data, as there are no guarantees
> the implementation details of kvm_reset_dirty_gfn() will be visible to
> kvm_dirty_ring_reset().
> 
> While the flaw could be fixed by simply deleting (or rewording) the
> comment, and duplicating the check is unfortunate, checking mask in the
> caller will allow for additional cleanups.
> 
> Opportunisticaly drop the zero-initialization of cur_slot and cur_offset.
> If a bug were introduced where either the slot or offset was consumed
> before mask is set to a non-zero value, then it is highly desirable for
> the compiler (or some other sanitizer) to yell.
> 
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yan Zhao <yan.y.zhao@intel.com>
> Cc: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  virt/kvm/dirty_ring.c | 29 ++++++++++++++++++++---------
>  1 file changed, 20 insertions(+), 9 deletions(-)
> 
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index 37eb2b7142bd..95ab0e3cf9da 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -55,9 +55,6 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
>  	struct kvm_memory_slot *memslot;
>  	int as_id, id;
>  
> -	if (!mask)
> -		return;
> -
>  	as_id = slot >> 16;
>  	id = (u16)slot;
>  
> @@ -109,13 +106,10 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
>  {
>  	u32 cur_slot, next_slot;
>  	u64 cur_offset, next_offset;
> -	unsigned long mask;
> +	unsigned long mask = 0;
>  	struct kvm_dirty_gfn *entry;
>  	bool first_round = true;
>  
> -	/* This is only needed to make compilers happy */
> -	cur_slot = cur_offset = mask = 0;
> -
>  	while (likely((*nr_entries_reset) < INT_MAX)) {
>  		if (signal_pending(current))
>  			return -EINTR;
> @@ -163,14 +157,31 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
>  				continue;
>  			}
>  		}
> -		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> +
> +		/*
> +		 * Reset the slot for all the harvested entries that have been
> +		 * gathered, but not yet fully processed.
> +		 */
I really like the logs as it took me quite a while figuring out how this part of
the code works :)

Does "processed" mean the entries have been reset, and "gathered" means they've
been read from the ring?

I'm not sure, but do you like this version? e.g.
"Combined reset of the harvested entries that can be identified by curr_slot
plus cur_offset+mask" ?


> +		if (mask)
> +			kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> +
> +		/*
> +		 * The current slot was reset or this is the first harvested
> +		 * entry, (re)initialize the metadata.
> +		 */
What about
"Save the current slot and cur_offset (with mask initialized to 1) to check if
any future entries can be found for a combined reset." ?

>  		cur_slot = next_slot;
>  		cur_offset = next_offset;
>  		mask = 1;
>  		first_round = false;
>  	}
>  
> -	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> +	/*
> +	 * Perform a final reset if there are harvested entries that haven't
> +	 * been processed. The loop only performs a reset when an entry can't
> +	 * be coalesced, i.e. always leaves at least one entry pending.
The loop only performs a reset when an entry can be coalesced?

> +	 */
> +	if (mask)
> +		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
>  
>  	/*
>  	 * The request KVM_REQ_DIRTY_RING_SOFT_FULL will be cleared
> -- 
> 2.47.1.613.gc27f4b7a9f-goog
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] KVM: Bail from the dirty ring reset flow if a signal is pending
  2025-01-13  9:31   ` Yan Zhao
@ 2025-01-13 15:48     ` Sean Christopherson
  2025-01-14  7:29       ` Yan Zhao
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2025-01-13 15:48 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Paolo Bonzini, kvm, linux-kernel, Peter Xu, Maxim Levitsky

On Mon, Jan 13, 2025, Yan Zhao wrote:
> On Fri, Jan 10, 2025 at 05:04:06PM -0800, Sean Christopherson wrote:
> > Abort a dirty ring reset if the current task has a pending signal, as the
> > hard limit of INT_MAX entries doesn't ensure KVM will respond to a signal
> > in a timely fashion.
> > 
> > Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  virt/kvm/dirty_ring.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > index 2faf894dec5a..a81ad17d5eef 100644
> > --- a/virt/kvm/dirty_ring.c
> > +++ b/virt/kvm/dirty_ring.c
> > @@ -117,6 +117,9 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
> >  	cur_slot = cur_offset = mask = 0;
> >  
> >  	while (likely((*nr_entries_reset) < INT_MAX)) {
> > +		if (signal_pending(current))
> > +			return -EINTR;
> Will it break the userspace when a signal is pending? e.g. QEMU might report an
> error like
> "kvm_dirty_ring_reap_locked: Assertion `ret == total' failed".

Ugh.  In theory, yes.  In practice, I hope not?  If it's a potential problem for
QEMU, the only idea have is to only react to fatal signals by default, and then
let userspace opt-in to reacting to non-fatal signals.

> 
> >  		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> >  
> >  		if (!kvm_dirty_gfn_harvested(entry))
> > -- 
> > 2.47.1.613.gc27f4b7a9f-goog
> > 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/5] KVM: Conditionally reschedule when resetting the dirty ring
  2025-01-13  7:04   ` Yan Zhao
@ 2025-01-13 16:28     ` Sean Christopherson
  2025-01-14  7:58       ` Yan Zhao
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2025-01-13 16:28 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Paolo Bonzini, kvm, linux-kernel, Peter Xu, Maxim Levitsky

On Mon, Jan 13, 2025, Yan Zhao wrote:
> On Fri, Jan 10, 2025 at 05:04:07PM -0800, Sean Christopherson wrote:
> > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > index a81ad17d5eef..37eb2b7142bd 100644
> > --- a/virt/kvm/dirty_ring.c
> > +++ b/virt/kvm/dirty_ring.c
> > @@ -133,6 +133,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
> >  
> >  		ring->reset_index++;
> >  		(*nr_entries_reset)++;
> > +
> > +		/*
> > +		 * While the size of each ring is fixed, it's possible for the
> > +		 * ring to be constantly re-dirtied/harvested while the reset
> > +		 * is in-progress (the hard limit exists only to guard against
> > +		 * wrapping the count into negative space).
> > +		 */
> > +		if (!first_round)
> > +			cond_resched();
> > +
> Will cond_resched() per entry be too frequent?

No, if it is too frequent, KVM has other problems.  cond_resched() only takes a
handful of cycles when no work needs to be done, and on PREEMPTION=y kernels,
dropping mmu_lock in kvm_reset_dirty_gfn() already includes a NEED_RESCHED check.

> Could we combine the cond_resched() per ring? e.g.
> 
> if (count >= ring->soft_limit)
> 	cond_resched();
> 
> or simply
> while (count < ring->size) {
> 	...
> }

I don't think I have any objections to bounding the reset at ring->size?  I
assumed the unbounded walk was deliberate, e.g. to let userspace reset entries
in a separate thread, but looking at the QEMU code, that doesn't appear to be
the case.

However, IMO that's an orthogonal discussion.  I think KVM should still check for
NEED_RESCHED after processing each entry regardless of how the loop is bounded.
E.g. write-protecting 65536 GFNs is definitely going to have measurable latency.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/5] KVM: Check for empty mask of harvested dirty ring entries in caller
  2025-01-13 10:30   ` Yan Zhao
@ 2025-01-13 16:48     ` Sean Christopherson
  2025-01-14  8:13       ` Yan Zhao
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2025-01-13 16:48 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Paolo Bonzini, kvm, linux-kernel, Peter Xu, Maxim Levitsky

On Mon, Jan 13, 2025, Yan Zhao wrote:
> On Fri, Jan 10, 2025 at 05:04:08PM -0800, Sean Christopherson wrote:
> > @@ -163,14 +157,31 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
> >  				continue;
> >  			}
> >  		}
> > -		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > +
> > +		/*
> > +		 * Reset the slot for all the harvested entries that have been
> > +		 * gathered, but not yet fully processed.
> > +		 */
> I really like the logs as it took me quite a while figuring out how this part of
> the code works :)
> 
> Does "processed" mean the entries have been reset, and "gathered" means they've
> been read from the ring?

Yeah.

> I'm not sure, but do you like this version? e.g.
> "Combined reset of the harvested entries that can be identified by curr_slot
> plus cur_offset+mask" ?

I have no objection to documenting the mechanics *and* the high level intent,
but I definitely want to document the "what", not just the "how".

> > +		if (mask)
> > +			kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > +
> > +		/*
> > +		 * The current slot was reset or this is the first harvested
> > +		 * entry, (re)initialize the metadata.
> > +		 */
> What about
> "Save the current slot and cur_offset (with mask initialized to 1) to check if
> any future entries can be found for a combined reset." ?

Hmm, what if I add a comment at the top to document the overall behavior and the
variables,

	/*
	 * To minimize mmu_lock contention, batch resets for harvested entries
	 * whose gfns are in the same slot, and are within N frame numbers of
	 * each other, where N is the number of bits in an unsigned long.  For
	 * simplicity, process the current set of entries when the next entry
	 * can't be included in the batch.
	 *
	 * Track the current batch slot, the gfn offset into the slot for the
	 * batch, and the bitmask of gfns that need to be reset (relative to
	 * offset).  Note, the offset may be adjusted backwards, e.g. so that
	 * a sequence of gfns X, X-1, ... X-N can be batched.
	 */
	u32 cur_slot, next_slot;
	u64 cur_offset, next_offset;
	unsigned long mask = 0;
	struct kvm_dirty_gfn *entry;

and then keep this as:

		/*
		 * The current slot was reset or this is the first harvested
		 * entry, (re)initialize the batching metadata.
		 */

> 
> >  		cur_slot = next_slot;
> >  		cur_offset = next_offset;
> >  		mask = 1;
> >  		first_round = false;
> >  	}
> >  
> > -	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > +	/*
> > +	 * Perform a final reset if there are harvested entries that haven't
> > +	 * been processed. The loop only performs a reset when an entry can't
> > +	 * be coalesced, i.e. always leaves at least one entry pending.
> The loop only performs a reset when an entry can be coalesced?

No, if an entry can be coalesced then the loop doesn't perform a reset.  Does
this read better?

	/*
	 * Perform a final reset if there are harvested entries that haven't
	 * been processed, which is guaranteed if at least one harvested was
	 * found.  The loop only performs a reset when the "next" entry can't
	 * be batched with "current" the entry(s), and that reset processes the
	 * _current_ entry(s), i.e. the last harvested entry, a.k.a. next, will
	 * will always be left pending.
	 */

> > +	 */
> > +	if (mask)
> > +		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> >  
> >  	/*
> >  	 * The request KVM_REQ_DIRTY_RING_SOFT_FULL will be cleared
> > -- 
> > 2.47.1.613.gc27f4b7a9f-goog
> > 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] KVM: Bail from the dirty ring reset flow if a signal is pending
  2025-01-13 15:48     ` Sean Christopherson
@ 2025-01-14  7:29       ` Yan Zhao
  2025-01-14 17:16         ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Yan Zhao @ 2025-01-14  7:29 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Xu, Maxim Levitsky

On Mon, Jan 13, 2025 at 07:48:46AM -0800, Sean Christopherson wrote:
> On Mon, Jan 13, 2025, Yan Zhao wrote:
> > On Fri, Jan 10, 2025 at 05:04:06PM -0800, Sean Christopherson wrote:
> > > Abort a dirty ring reset if the current task has a pending signal, as the
> > > hard limit of INT_MAX entries doesn't ensure KVM will respond to a signal
> > > in a timely fashion.
> > > 
> > > Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > >  virt/kvm/dirty_ring.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > > index 2faf894dec5a..a81ad17d5eef 100644
> > > --- a/virt/kvm/dirty_ring.c
> > > +++ b/virt/kvm/dirty_ring.c
> > > @@ -117,6 +117,9 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
> > >  	cur_slot = cur_offset = mask = 0;
> > >  
> > >  	while (likely((*nr_entries_reset) < INT_MAX)) {
> > > +		if (signal_pending(current))
> > > +			return -EINTR;
> > Will it break the userspace when a signal is pending? e.g. QEMU might report an
> > error like
> > "kvm_dirty_ring_reap_locked: Assertion `ret == total' failed".
> 
> Ugh.  In theory, yes.  In practice, I hope not?  If it's a potential problem for
> QEMU, the only idea have is to only react to fatal signals by default, and then
> let userspace opt-in to reacting to non-fatal signals.
So, what about just fatal_signal_pending() as in other ioctls in kernel?

if (fatal_signal_pending(current))
	return -EINTR;

> 
> > 
> > >  		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
> > >  
> > >  		if (!kvm_dirty_gfn_harvested(entry))
> > > -- 
> > > 2.47.1.613.gc27f4b7a9f-goog
> > > 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/5] KVM: Conditionally reschedule when resetting the dirty ring
  2025-01-13 16:28     ` Sean Christopherson
@ 2025-01-14  7:58       ` Yan Zhao
  0 siblings, 0 replies; 19+ messages in thread
From: Yan Zhao @ 2025-01-14  7:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Xu, Maxim Levitsky

On Mon, Jan 13, 2025 at 08:28:06AM -0800, Sean Christopherson wrote:
> On Mon, Jan 13, 2025, Yan Zhao wrote:
> > On Fri, Jan 10, 2025 at 05:04:07PM -0800, Sean Christopherson wrote:
> > > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > > index a81ad17d5eef..37eb2b7142bd 100644
> > > --- a/virt/kvm/dirty_ring.c
> > > +++ b/virt/kvm/dirty_ring.c
> > > @@ -133,6 +133,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
> > >  
> > >  		ring->reset_index++;
> > >  		(*nr_entries_reset)++;
> > > +
> > > +		/*
> > > +		 * While the size of each ring is fixed, it's possible for the
> > > +		 * ring to be constantly re-dirtied/harvested while the reset
> > > +		 * is in-progress (the hard limit exists only to guard against
> > > +		 * wrapping the count into negative space).
> > > +		 */
> > > +		if (!first_round)
> > > +			cond_resched();
> > > +
> > Will cond_resched() per entry be too frequent?
> 
> No, if it is too frequent, KVM has other problems.  cond_resched() only takes a
> handful of cycles when no work needs to be done, and on PREEMPTION=y kernels,
> dropping mmu_lock in kvm_reset_dirty_gfn() already includes a NEED_RESCHED check.
Ok. I just worried about the live migration performance.
But looks per-entry should be also good.

> 
> > Could we combine the cond_resched() per ring? e.g.
> > 
> > if (count >= ring->soft_limit)
> > 	cond_resched();
> > 
> > or simply
> > while (count < ring->size) {
> > 	...
> > }
> 
> I don't think I have any objections to bounding the reset at ring->size?  I
> assumed the unbounded walk was deliberate, e.g. to let userspace reset entries
> in a separate thread, but looking at the QEMU code, that doesn't appear to be
> the case.
Ok.

> However, IMO that's an orthogonal discussion.  I think KVM should still check for
> NEED_RESCHED after processing each entry regardless of how the loop is bounded.
> E.g. write-protecting 65536 GFNs is definitely going to have measurable latency.
Yes.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/5] KVM: Check for empty mask of harvested dirty ring entries in caller
  2025-01-13 16:48     ` Sean Christopherson
@ 2025-01-14  8:13       ` Yan Zhao
  0 siblings, 0 replies; 19+ messages in thread
From: Yan Zhao @ 2025-01-14  8:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Xu, Maxim Levitsky

On Mon, Jan 13, 2025 at 08:48:28AM -0800, Sean Christopherson wrote:
> On Mon, Jan 13, 2025, Yan Zhao wrote:
> > On Fri, Jan 10, 2025 at 05:04:08PM -0800, Sean Christopherson wrote:
> > > @@ -163,14 +157,31 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
> > >  				continue;
> > >  			}
> > >  		}
> > > -		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > > +
> > > +		/*
> > > +		 * Reset the slot for all the harvested entries that have been
> > > +		 * gathered, but not yet fully processed.
> > > +		 */
> > I really like the logs as it took me quite a while figuring out how this part of
> > the code works :)
> > 
> > Does "processed" mean the entries have been reset, and "gathered" means they've
> > been read from the ring?
> 
> Yeah.
> 
> > I'm not sure, but do you like this version? e.g.
> > "Combined reset of the harvested entries that can be identified by curr_slot
> > plus cur_offset+mask" ?
> 
> I have no objection to documenting the mechanics *and* the high level intent,
> but I definitely want to document the "what", not just the "how".
> 
> > > +		if (mask)
> > > +			kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > > +
> > > +		/*
> > > +		 * The current slot was reset or this is the first harvested
> > > +		 * entry, (re)initialize the metadata.
> > > +		 */
> > What about
> > "Save the current slot and cur_offset (with mask initialized to 1) to check if
> > any future entries can be found for a combined reset." ?
> 
> Hmm, what if I add a comment at the top to document the overall behavior and the
> variables,
> 
> 	/*
> 	 * To minimize mmu_lock contention, batch resets for harvested entries
> 	 * whose gfns are in the same slot, and are within N frame numbers of
> 	 * each other, where N is the number of bits in an unsigned long.  For
> 	 * simplicity, process the current set of entries when the next entry
> 	 * can't be included in the batch.
> 	 *
> 	 * Track the current batch slot, the gfn offset into the slot for the
> 	 * batch, and the bitmask of gfns that need to be reset (relative to
> 	 * offset).  Note, the offset may be adjusted backwards, e.g. so that
> 	 * a sequence of gfns X, X-1, ... X-N can be batched.
> 	 */
Nice log.
> 	u32 cur_slot, next_slot;
> 	u64 cur_offset, next_offset;
> 	unsigned long mask = 0;
> 	struct kvm_dirty_gfn *entry;
> 
> and then keep this as:
> 
> 		/*
> 		 * The current slot was reset or this is the first harvested
> 		 * entry, (re)initialize the batching metadata.
> 		 */
> 
> > 
> > >  		cur_slot = next_slot;
> > >  		cur_offset = next_offset;
> > >  		mask = 1;
> > >  		first_round = false;
> > >  	}
> > >  
> > > -	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > > +	/*
> > > +	 * Perform a final reset if there are harvested entries that haven't
> > > +	 * been processed. The loop only performs a reset when an entry can't
> > > +	 * be coalesced, i.e. always leaves at least one entry pending.
> > The loop only performs a reset when an entry can be coalesced?
> 
> No, if an entry can be coalesced then the loop doesn't perform a reset.  Does
> this read better?
> 
> 	/*
> 	 * Perform a final reset if there are harvested entries that haven't
> 	 * been processed, which is guaranteed if at least one harvested was
> 	 * found.  The loop only performs a reset when the "next" entry can't
> 	 * be batched with "current" the entry(s), and that reset processes the
> 	 * _current_ entry(s), i.e. the last harvested entry, a.k.a. next, will
> 	 * will always be left pending.
> 	 */
LGTM!

> > > +	 */
> > > +	if (mask)
> > > +		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> > >  
> > >  	/*
> > >  	 * The request KVM_REQ_DIRTY_RING_SOFT_FULL will be cleared
> > > -- 
> > > 2.47.1.613.gc27f4b7a9f-goog
> > > 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/5] KVM: Bail from the dirty ring reset flow if a signal is pending
  2025-01-14  7:29       ` Yan Zhao
@ 2025-01-14 17:16         ` Sean Christopherson
  0 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2025-01-14 17:16 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Paolo Bonzini, kvm, linux-kernel, Peter Xu, Maxim Levitsky

On Tue, Jan 14, 2025, Yan Zhao wrote:
> On Mon, Jan 13, 2025 at 07:48:46AM -0800, Sean Christopherson wrote:
> > On Mon, Jan 13, 2025, Yan Zhao wrote:
> > > On Fri, Jan 10, 2025 at 05:04:06PM -0800, Sean Christopherson wrote:
> > > > Abort a dirty ring reset if the current task has a pending signal, as the
> > > > hard limit of INT_MAX entries doesn't ensure KVM will respond to a signal
> > > > in a timely fashion.
> > > > 
> > > > Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
> > > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > > ---
> > > >  virt/kvm/dirty_ring.c | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > > 
> > > > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > > > index 2faf894dec5a..a81ad17d5eef 100644
> > > > --- a/virt/kvm/dirty_ring.c
> > > > +++ b/virt/kvm/dirty_ring.c
> > > > @@ -117,6 +117,9 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
> > > >  	cur_slot = cur_offset = mask = 0;
> > > >  
> > > >  	while (likely((*nr_entries_reset) < INT_MAX)) {
> > > > +		if (signal_pending(current))
> > > > +			return -EINTR;
> > > Will it break the userspace when a signal is pending? e.g. QEMU might report an
> > > error like
> > > "kvm_dirty_ring_reap_locked: Assertion `ret == total' failed".
> > 
> > Ugh.  In theory, yes.  In practice, I hope not?  If it's a potential problem for
> > QEMU, the only idea have is to only react to fatal signals by default, and then
> > let userspace opt-in to reacting to non-fatal signals.
> So, what about just fatal_signal_pending() as in other ioctls in kernel?

Ya, though I would leave the decision up to Peter or Paolo (or someone else that
knows what QEMU wants/prefers/tolerates).

> if (fatal_signal_pending(current))
> 	return -EINTR;

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-01-14 17:16 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-11  1:04 [PATCH 0/5] KVM: Dirty ring fixes and cleanups Sean Christopherson
2025-01-11  1:04 ` [PATCH 1/5] KVM: Bound the number of dirty ring entries in a single reset at INT_MAX Sean Christopherson
2025-01-13  6:48   ` Yan Zhao
2025-01-13  6:57     ` Yan Zhao
2025-01-11  1:04 ` [PATCH 2/5] KVM: Bail from the dirty ring reset flow if a signal is pending Sean Christopherson
2025-01-13  9:31   ` Yan Zhao
2025-01-13 15:48     ` Sean Christopherson
2025-01-14  7:29       ` Yan Zhao
2025-01-14 17:16         ` Sean Christopherson
2025-01-11  1:04 ` [PATCH 3/5] KVM: Conditionally reschedule when resetting the dirty ring Sean Christopherson
2025-01-13  7:04   ` Yan Zhao
2025-01-13 16:28     ` Sean Christopherson
2025-01-14  7:58       ` Yan Zhao
2025-01-11  1:04 ` [PATCH 4/5] KVM: Check for empty mask of harvested dirty ring entries in caller Sean Christopherson
2025-01-13 10:30   ` Yan Zhao
2025-01-13 16:48     ` Sean Christopherson
2025-01-14  8:13       ` Yan Zhao
2025-01-11  1:04 ` [PATCH 5/5] KVM: Use mask of harvested dirty ring entries to coalesce dirty ring resets Sean Christopherson
2025-01-13  9:51 ` [PATCH 0/5] KVM: Dirty ring fixes and cleanups Yan Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).