public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] KVM: Dirty ring fixes and cleanups
@ 2025-05-08 14:10 Sean Christopherson
  2025-05-08 14:10 ` [PATCH v2 1/5] KVM: Bound the number of dirty ring entries in a single reset at INT_MAX Sean Christopherson
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Sean Christopherson @ 2025-05-08 14:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Xu, Yan Zhao, Maxim Levitsky,
	Sean Christopherson

Fix issues with dirty ring harvesting where KVM doesn't bound the processing
of entries in any way, which allows userspace to keep KVM in a tight loop
indefinitely.

E.g.

        struct kvm_dirty_gfn *dirty_gfns = vcpu_map_dirty_ring(vcpu);

        if (fork()) {
                int r;

                for (;;) {
                        r = kvm_vm_reset_dirty_ring(vcpu->vm);
                        if (r)
                                printf("RESET %d dirty ring entries\n", r);
                }
        } else {
                int i;

                for (i = 0; i < test_dirty_ring_count; i++) {
                        dirty_gfns[i].slot = TEST_MEM_SLOT_INDEX;
                        dirty_gfns[i].offset = (i * 64) % host_num_pages;
                }

                for (;;) {
                        for (i = 0; i < test_dirty_ring_count; i++)
                                WRITE_ONCE(dirty_gfns[i].flags, KVM_DIRTY_GFN_F_RESET);
                }
        }

Patches 1-3 address that class of bugs.  Patches 4 and 5 are cleanups.


v2: Expand on comments in dirty ring harvesting code. [Yan]

v1: https://lore.kernel.org/all/20250111010409.1252942-1-seanjc@google.com

Sean Christopherson (5):
  KVM: Bound the number of dirty ring entries in a single reset at
    INT_MAX
  KVM: Bail from the dirty ring reset flow if a signal is pending
  KVM: Conditionally reschedule when resetting the dirty ring
  KVM: Check for empty mask of harvested dirty ring entries in caller
  KVM: Use mask of harvested dirty ring entries to coalesce dirty ring
    resets

 include/linux/kvm_dirty_ring.h |   8 ++-
 virt/kvm/dirty_ring.c          | 103 ++++++++++++++++++++++-----------
 virt/kvm/kvm_main.c            |   9 ++-
 3 files changed, 81 insertions(+), 39 deletions(-)


base-commit: d1b88431c59e94799aff0e31ce1467af2a86b0cf
-- 
2.49.0.1015.ga840276032-goog


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/5] KVM: Bound the number of dirty ring entries in a single reset at INT_MAX
  2025-05-08 14:10 [PATCH v2 0/5] KVM: Dirty ring fixes and cleanups Sean Christopherson
@ 2025-05-08 14:10 ` Sean Christopherson
  2025-05-13  1:25   ` Binbin Wu
  2025-05-08 14:10 ` [PATCH v2 2/5] KVM: Bail from the dirty ring reset flow if a signal is pending Sean Christopherson
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2025-05-08 14:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Xu, Yan Zhao, Maxim Levitsky,
	Sean Christopherson

Cap the number of ring entries that are reset in a single ioctl to INT_MAX
to ensure userspace isn't confused by a wrap into negative space, and so
that, in a truly pathological scenario, KVM doesn't miss a TLB flush due
to the count wrapping to zero.  While the size of the ring is fixed at
0x10000 entries and KVM (currently) supports at most 4096, userspace is
allowed to harvest entries from the ring while the reset is in-progress,
i.e. it's possible for the ring to always have harvested entries.

Opportunistically return an actual error code from the helper so that a
future fix to handle pending signals can gracefully return -EINTR.

Cc: Peter Xu <peterx@redhat.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/kvm_dirty_ring.h |  8 +++++---
 virt/kvm/dirty_ring.c          | 10 +++++-----
 virt/kvm/kvm_main.c            |  9 ++++++---
 3 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
index da4d9b5f58f1..ee61ff6c3fe4 100644
--- a/include/linux/kvm_dirty_ring.h
+++ b/include/linux/kvm_dirty_ring.h
@@ -49,9 +49,10 @@ static inline int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *r
 }
 
 static inline int kvm_dirty_ring_reset(struct kvm *kvm,
-				       struct kvm_dirty_ring *ring)
+				       struct kvm_dirty_ring *ring,
+				       int *nr_entries_reset)
 {
-	return 0;
+	return -ENOENT;
 }
 
 static inline void kvm_dirty_ring_push(struct kvm_vcpu *vcpu,
@@ -82,7 +83,8 @@ int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring,
  * called with kvm->slots_lock held, returns the number of
  * processed pages.
  */
-int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring);
+int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
+			 int *nr_entries_reset);
 
 /*
  * returns =0: successfully pushed
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index d14ffc7513ee..77986f34eff8 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -105,19 +105,19 @@ static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn)
 	return smp_load_acquire(&gfn->flags) & KVM_DIRTY_GFN_F_RESET;
 }
 
-int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
+int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
+			 int *nr_entries_reset)
 {
 	u32 cur_slot, next_slot;
 	u64 cur_offset, next_offset;
 	unsigned long mask;
-	int count = 0;
 	struct kvm_dirty_gfn *entry;
 	bool first_round = true;
 
 	/* This is only needed to make compilers happy */
 	cur_slot = cur_offset = mask = 0;
 
-	while (true) {
+	while (likely((*nr_entries_reset) < INT_MAX)) {
 		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
 
 		if (!kvm_dirty_gfn_harvested(entry))
@@ -130,7 +130,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
 		kvm_dirty_gfn_set_invalid(entry);
 
 		ring->reset_index++;
-		count++;
+		(*nr_entries_reset)++;
 		/*
 		 * Try to coalesce the reset operations when the guest is
 		 * scanning pages in the same slot.
@@ -167,7 +167,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
 
 	trace_kvm_dirty_ring_reset(ring);
 
-	return count;
+	return 0;
 }
 
 void kvm_dirty_ring_push(struct kvm_vcpu *vcpu, u32 slot, u64 offset)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 16fe54cf2808..f9de2c8b9c1e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4903,15 +4903,18 @@ static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
 {
 	unsigned long i;
 	struct kvm_vcpu *vcpu;
-	int cleared = 0;
+	int cleared = 0, r;
 
 	if (!kvm->dirty_ring_size)
 		return -EINVAL;
 
 	mutex_lock(&kvm->slots_lock);
 
-	kvm_for_each_vcpu(i, vcpu, kvm)
-		cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring);
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		r = kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring, &cleared);
+		if (r)
+			break;
+	}
 
 	mutex_unlock(&kvm->slots_lock);
 
-- 
2.49.0.1015.ga840276032-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/5] KVM: Bail from the dirty ring reset flow if a signal is pending
  2025-05-08 14:10 [PATCH v2 0/5] KVM: Dirty ring fixes and cleanups Sean Christopherson
  2025-05-08 14:10 ` [PATCH v2 1/5] KVM: Bound the number of dirty ring entries in a single reset at INT_MAX Sean Christopherson
@ 2025-05-08 14:10 ` Sean Christopherson
  2025-05-08 14:10 ` [PATCH v2 3/5] KVM: Conditionally reschedule when resetting the dirty ring Sean Christopherson
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Sean Christopherson @ 2025-05-08 14:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Xu, Yan Zhao, Maxim Levitsky,
	Sean Christopherson

Abort a dirty ring reset if the current task has a pending signal, as the
hard limit of INT_MAX entries doesn't ensure KVM will respond to a signal
in a timely fashion.

Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/dirty_ring.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index 77986f34eff8..e844e869e8c7 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -118,6 +118,9 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
 	cur_slot = cur_offset = mask = 0;
 
 	while (likely((*nr_entries_reset) < INT_MAX)) {
+		if (signal_pending(current))
+			return -EINTR;
+
 		entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
 
 		if (!kvm_dirty_gfn_harvested(entry))
-- 
2.49.0.1015.ga840276032-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/5] KVM: Conditionally reschedule when resetting the dirty ring
  2025-05-08 14:10 [PATCH v2 0/5] KVM: Dirty ring fixes and cleanups Sean Christopherson
  2025-05-08 14:10 ` [PATCH v2 1/5] KVM: Bound the number of dirty ring entries in a single reset at INT_MAX Sean Christopherson
  2025-05-08 14:10 ` [PATCH v2 2/5] KVM: Bail from the dirty ring reset flow if a signal is pending Sean Christopherson
@ 2025-05-08 14:10 ` Sean Christopherson
  2025-05-12 22:02   ` James Houghton
  2025-05-08 14:10 ` [PATCH v2 4/5] KVM: Check for empty mask of harvested dirty ring entries in caller Sean Christopherson
  2025-05-08 14:10 ` [PATCH v2 5/5] KVM: Use mask of harvested dirty ring entries to coalesce dirty ring resets Sean Christopherson
  4 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2025-05-08 14:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Xu, Yan Zhao, Maxim Levitsky,
	Sean Christopherson

When resetting a dirty ring, conditionally reschedule on each iteration
after the first.  The recently introduced hard limit mitigates the issue
of an endless reset, but isn't sufficient to completely prevent RCU
stalls, soft lockups, etc., nor is the hard limit intended to guard
against such badness.

Note!  Take care to check for reschedule even in the "continue" paths,
as a pathological scenario (or malicious userspace) could dirty the same
gfn over and over, i.e. always hit the continue path.

  rcu: INFO: rcu_sched self-detected stall on CPU
  rcu: 	4-....: (5249 ticks this GP) idle=51e4/1/0x4000000000000000 softirq=309/309 fqs=2563
  rcu: 	(t=5250 jiffies g=-319 q=608 ncpus=24)
  CPU: 4 UID: 1000 PID: 1067 Comm: dirty_log_test Tainted: G             L     6.13.0-rc3-17fa7a24ea1e-HEAD-vm #814
  Tainted: [L]=SOFTLOCKUP
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x26/0x200 [kvm]
  Call Trace:
   <TASK>
   kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
   kvm_dirty_ring_reset+0x58/0x220 [kvm]
   kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
   __x64_sys_ioctl+0x8b/0xb0
   do_syscall_64+0x5b/0x160
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>
  Tainted: [L]=SOFTLOCKUP
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x17/0x200 [kvm]
  Call Trace:
   <TASK>
   kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
   kvm_dirty_ring_reset+0x58/0x220 [kvm]
   kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
   __x64_sys_ioctl+0x8b/0xb0
   do_syscall_64+0x5b/0x160
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>

Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/dirty_ring.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index e844e869e8c7..97cca0c02fd1 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -134,6 +134,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
 
 		ring->reset_index++;
 		(*nr_entries_reset)++;
+
+		/*
+		 * While the size of each ring is fixed, it's possible for the
+		 * ring to be constantly re-dirtied/harvested while the reset
+		 * is in-progress (the hard limit exists only to guard against
+		 * wrapping the count into negative space).
+		 */
+		if (!first_round)
+			cond_resched();
+
 		/*
 		 * Try to coalesce the reset operations when the guest is
 		 * scanning pages in the same slot.
-- 
2.49.0.1015.ga840276032-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 4/5] KVM: Check for empty mask of harvested dirty ring entries in caller
  2025-05-08 14:10 [PATCH v2 0/5] KVM: Dirty ring fixes and cleanups Sean Christopherson
                   ` (2 preceding siblings ...)
  2025-05-08 14:10 ` [PATCH v2 3/5] KVM: Conditionally reschedule when resetting the dirty ring Sean Christopherson
@ 2025-05-08 14:10 ` Sean Christopherson
  2025-05-13  9:17   ` Binbin Wu
  2025-05-13 12:51   ` Gupta, Pankaj
  2025-05-08 14:10 ` [PATCH v2 5/5] KVM: Use mask of harvested dirty ring entries to coalesce dirty ring resets Sean Christopherson
  4 siblings, 2 replies; 15+ messages in thread
From: Sean Christopherson @ 2025-05-08 14:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Xu, Yan Zhao, Maxim Levitsky,
	Sean Christopherson

When resetting a dirty ring, explicitly check that there is work to be
done before calling kvm_reset_dirty_gfn(), e.g. if no harvested entries
are found and/or on the loop's first iteration, and delete the extremely
misleading comment "This is only needed to make compilers happy".  KVM
absolutely relies on mask to be zero-initialized, i.e. the comment is an
outright lie.  Furthermore, the compiler is right to complain that KVM is
calling a function with uninitialized data, as there are no guarantees
the implementation details of kvm_reset_dirty_gfn() will be visible to
kvm_dirty_ring_reset().

While the flaw could be fixed by simply deleting (or rewording) the
comment, and duplicating the check is unfortunate, checking mask in the
caller will allow for additional cleanups.

Opportunisticaly drop the zero-initialization of cur_slot and cur_offset.
If a bug were introduced where either the slot or offset was consumed
before mask is set to a non-zero value, then it is highly desirable for
the compiler (or some other sanitizer) to yell.

Cc: Peter Xu <peterx@redhat.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/dirty_ring.c | 44 ++++++++++++++++++++++++++++++++++---------
 1 file changed, 35 insertions(+), 9 deletions(-)

diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index 97cca0c02fd1..a3434be8f00d 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -55,9 +55,6 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
 	struct kvm_memory_slot *memslot;
 	int as_id, id;
 
-	if (!mask)
-		return;
-
 	as_id = slot >> 16;
 	id = (u16)slot;
 
@@ -108,15 +105,24 @@ static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn)
 int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
 			 int *nr_entries_reset)
 {
+	/*
+	 * To minimize mmu_lock contention, batch resets for harvested entries
+	 * whose gfns are in the same slot, and are within N frame numbers of
+	 * each other, where N is the number of bits in an unsigned long.  For
+	 * simplicity, process the current set of entries when the next entry
+	 * can't be included in the batch.
+	 *
+	 * Track the current batch slot, the gfn offset into the slot for the
+	 * batch, and the bitmask of gfns that need to be reset (relative to
+	 * offset).  Note, the offset may be adjusted backwards, e.g. so that
+	 * a sequence of gfns X, X-1, ... X-N can be batched.
+	 */
 	u32 cur_slot, next_slot;
 	u64 cur_offset, next_offset;
-	unsigned long mask;
+	unsigned long mask = 0;
 	struct kvm_dirty_gfn *entry;
 	bool first_round = true;
 
-	/* This is only needed to make compilers happy */
-	cur_slot = cur_offset = mask = 0;
-
 	while (likely((*nr_entries_reset) < INT_MAX)) {
 		if (signal_pending(current))
 			return -EINTR;
@@ -164,14 +170,34 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
 				continue;
 			}
 		}
-		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
+
+		/*
+		 * Reset the slot for all the harvested entries that have been
+		 * gathered, but not yet fully processed.
+		 */
+		if (mask)
+			kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
+
+		/*
+		 * The current slot was reset or this is the first harvested
+		 * entry, (re)initialize the metadata.
+		 */
 		cur_slot = next_slot;
 		cur_offset = next_offset;
 		mask = 1;
 		first_round = false;
 	}
 
-	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
+	/*
+	 * Perform a final reset if there are harvested entries that haven't
+	 * been processed, which is guaranteed if at least one harvested was
+	 * found.  The loop only performs a reset when the "next" entry can't
+	 * be batched with "current" the entry(s), and that reset processes the
+	 * _current_ entry(s), i.e. the last harvested entry, a.k.a. next, will
+	 * always be left pending.
+	 */
+	if (mask)
+		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
 
 	/*
 	 * The request KVM_REQ_DIRTY_RING_SOFT_FULL will be cleared
-- 
2.49.0.1015.ga840276032-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 5/5] KVM: Use mask of harvested dirty ring entries to coalesce dirty ring resets
  2025-05-08 14:10 [PATCH v2 0/5] KVM: Dirty ring fixes and cleanups Sean Christopherson
                   ` (3 preceding siblings ...)
  2025-05-08 14:10 ` [PATCH v2 4/5] KVM: Check for empty mask of harvested dirty ring entries in caller Sean Christopherson
@ 2025-05-08 14:10 ` Sean Christopherson
  2025-05-12 22:33   ` James Houghton
  2025-05-13 12:16   ` Gupta, Pankaj
  4 siblings, 2 replies; 15+ messages in thread
From: Sean Christopherson @ 2025-05-08 14:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Xu, Yan Zhao, Maxim Levitsky,
	Sean Christopherson

Use "mask" instead of a dedicated boolean to track whether or not there
is at least one to-be-reset entry for the current slot+offset.  In the
body of the loop, mask is zero only on the first iteration, i.e. !mask is
equivalent to first_round.

Opportunstically combine the adjacent "if (mask)" statements into a single
if-statement.

No function change intended.

Cc: Peter Xu <peterx@redhat.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/dirty_ring.c | 60 +++++++++++++++++++++----------------------
 1 file changed, 29 insertions(+), 31 deletions(-)

diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index a3434be8f00d..934828d729e5 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -121,7 +121,6 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
 	u64 cur_offset, next_offset;
 	unsigned long mask = 0;
 	struct kvm_dirty_gfn *entry;
-	bool first_round = true;
 
 	while (likely((*nr_entries_reset) < INT_MAX)) {
 		if (signal_pending(current))
@@ -141,42 +140,42 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
 		ring->reset_index++;
 		(*nr_entries_reset)++;
 
-		/*
-		 * While the size of each ring is fixed, it's possible for the
-		 * ring to be constantly re-dirtied/harvested while the reset
-		 * is in-progress (the hard limit exists only to guard against
-		 * wrapping the count into negative space).
-		 */
-		if (!first_round)
+		if (mask) {
+			/*
+			 * While the size of each ring is fixed, it's possible
+			 * for the ring to be constantly re-dirtied/harvested
+			 * while the reset is in-progress (the hard limit exists
+			 * only to guard against the count becoming negative).
+			 */
 			cond_resched();
 
-		/*
-		 * Try to coalesce the reset operations when the guest is
-		 * scanning pages in the same slot.
-		 */
-		if (!first_round && next_slot == cur_slot) {
-			s64 delta = next_offset - cur_offset;
+			/*
+			 * Try to coalesce the reset operations when the guest
+			 * is scanning pages in the same slot.
+			 */
+			if (next_slot == cur_slot) {
+				s64 delta = next_offset - cur_offset;
 
-			if (delta >= 0 && delta < BITS_PER_LONG) {
-				mask |= 1ull << delta;
-				continue;
-			}
+				if (delta >= 0 && delta < BITS_PER_LONG) {
+					mask |= 1ull << delta;
+					continue;
+				}
 
-			/* Backwards visit, careful about overflows!  */
-			if (delta > -BITS_PER_LONG && delta < 0 &&
-			    (mask << -delta >> -delta) == mask) {
-				cur_offset = next_offset;
-				mask = (mask << -delta) | 1;
-				continue;
+				/* Backwards visit, careful about overflows! */
+				if (delta > -BITS_PER_LONG && delta < 0 &&
+				(mask << -delta >> -delta) == mask) {
+					cur_offset = next_offset;
+					mask = (mask << -delta) | 1;
+					continue;
+				}
 			}
-		}
 
-		/*
-		 * Reset the slot for all the harvested entries that have been
-		 * gathered, but not yet fully processed.
-		 */
-		if (mask)
+			/*
+			 * Reset the slot for all the harvested entries that
+			 * have been gathered, but not yet fully processed.
+			 */
 			kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
+		}
 
 		/*
 		 * The current slot was reset or this is the first harvested
@@ -185,7 +184,6 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
 		cur_slot = next_slot;
 		cur_offset = next_offset;
 		mask = 1;
-		first_round = false;
 	}
 
 	/*
-- 
2.49.0.1015.ga840276032-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/5] KVM: Conditionally reschedule when resetting the dirty ring
  2025-05-08 14:10 ` [PATCH v2 3/5] KVM: Conditionally reschedule when resetting the dirty ring Sean Christopherson
@ 2025-05-12 22:02   ` James Houghton
  2025-05-13 14:13     ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: James Houghton @ 2025-05-12 22:02 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Xu, Yan Zhao,
	Maxim Levitsky

On Thu, May 8, 2025 at 7:11 AM Sean Christopherson <seanjc@google.com> wrote:
>
> When resetting a dirty ring, conditionally reschedule on each iteration
> after the first.  The recently introduced hard limit mitigates the issue
> of an endless reset, but isn't sufficient to completely prevent RCU
> stalls, soft lockups, etc., nor is the hard limit intended to guard
> against such badness.
>
> Note!  Take care to check for reschedule even in the "continue" paths,
> as a pathological scenario (or malicious userspace) could dirty the same
> gfn over and over, i.e. always hit the continue path.
>
>   rcu: INFO: rcu_sched self-detected stall on CPU
>   rcu:  4-....: (5249 ticks this GP) idle=51e4/1/0x4000000000000000 softirq=309/309 fqs=2563
>   rcu:  (t=5250 jiffies g=-319 q=608 ncpus=24)
>   CPU: 4 UID: 1000 PID: 1067 Comm: dirty_log_test Tainted: G             L     6.13.0-rc3-17fa7a24ea1e-HEAD-vm #814
>   Tainted: [L]=SOFTLOCKUP
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x26/0x200 [kvm]
>   Call Trace:
>    <TASK>
>    kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
>    kvm_dirty_ring_reset+0x58/0x220 [kvm]
>    kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
>    __x64_sys_ioctl+0x8b/0xb0
>    do_syscall_64+0x5b/0x160
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>    </TASK>
>   Tainted: [L]=SOFTLOCKUP
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x17/0x200 [kvm]
>   Call Trace:
>    <TASK>
>    kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
>    kvm_dirty_ring_reset+0x58/0x220 [kvm]
>    kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
>    __x64_sys_ioctl+0x8b/0xb0
>    do_syscall_64+0x5b/0x160
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>    </TASK>
>
> Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  virt/kvm/dirty_ring.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index e844e869e8c7..97cca0c02fd1 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -134,6 +134,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
>
>                 ring->reset_index++;
>                 (*nr_entries_reset)++;
> +
> +               /*
> +                * While the size of each ring is fixed, it's possible for the
> +                * ring to be constantly re-dirtied/harvested while the reset
> +                * is in-progress (the hard limit exists only to guard against
> +                * wrapping the count into negative space).
> +                */
> +               if (!first_round)
> +                       cond_resched();

Should we be dropping slots_lock here?

It seems like we need to be holding slots_lock to call
kvm_reset_dirty_gfn(), but that's it. Userspace can already change the
memslots after enabling the dirty ring, so `entry->slot` can already
be stale, so dropping slots_lock for the cond_resched() seems harmless
(and better than not dropping it).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 5/5] KVM: Use mask of harvested dirty ring entries to coalesce dirty ring resets
  2025-05-08 14:10 ` [PATCH v2 5/5] KVM: Use mask of harvested dirty ring entries to coalesce dirty ring resets Sean Christopherson
@ 2025-05-12 22:33   ` James Houghton
  2025-05-13 12:16   ` Gupta, Pankaj
  1 sibling, 0 replies; 15+ messages in thread
From: James Houghton @ 2025-05-12 22:33 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Xu, Yan Zhao,
	Maxim Levitsky

On Thu, May 8, 2025 at 7:12 AM Sean Christopherson <seanjc@google.com> wrote:
>
> Use "mask" instead of a dedicated boolean to track whether or not there
> is at least one to-be-reset entry for the current slot+offset.  In the
> body of the loop, mask is zero only on the first iteration, i.e. !mask is
> equivalent to first_round.
>
> Opportunstically combine the adjacent "if (mask)" statements into a single
> if-statement.
>
> No function change intended.

nit: "Opportunistically" and "functional" :)

>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yan Zhao <yan.y.zhao@intel.com>
> Cc: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Thanks, Sean! This logic is much easier to read now.

Feel free to add to the entire series if you'd like:

Reviewed-by: James Houghton <jthoughton@google.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/5] KVM: Bound the number of dirty ring entries in a single reset at INT_MAX
  2025-05-08 14:10 ` [PATCH v2 1/5] KVM: Bound the number of dirty ring entries in a single reset at INT_MAX Sean Christopherson
@ 2025-05-13  1:25   ` Binbin Wu
  0 siblings, 0 replies; 15+ messages in thread
From: Binbin Wu @ 2025-05-13  1:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Xu, Yan Zhao,
	Maxim Levitsky



On 5/8/2025 10:10 PM, Sean Christopherson wrote:
> Cap the number of ring entries that are reset in a single ioctl to INT_MAX
> to ensure userspace isn't confused by a wrap into negative space, and so
> that, in a truly pathological scenario, KVM doesn't miss a TLB flush due
> to the count wrapping to zero.  While the size of the ring is fixed at
> 0x10000 entries and KVM (currently) supports at most 4096, userspace is
> allowed to harvest entries from the ring while the reset is in-progress,
> i.e. it's possible for the ring to always have harvested entries.
>
> Opportunistically return an actual error code from the helper so that a
> future fix to handle pending signals can gracefully return -EINTR.
>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yan Zhao <yan.y.zhao@intel.com>
> Cc: Maxim Levitsky <mlevitsk@redhat.com>
> Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   include/linux/kvm_dirty_ring.h |  8 +++++---
>   virt/kvm/dirty_ring.c          | 10 +++++-----
>   virt/kvm/kvm_main.c            |  9 ++++++---
>   3 files changed, 16 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h
> index da4d9b5f58f1..ee61ff6c3fe4 100644
> --- a/include/linux/kvm_dirty_ring.h
> +++ b/include/linux/kvm_dirty_ring.h
> @@ -49,9 +49,10 @@ static inline int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *r
>   }
>   
>   static inline int kvm_dirty_ring_reset(struct kvm *kvm,
> -				       struct kvm_dirty_ring *ring)
> +				       struct kvm_dirty_ring *ring,
> +				       int *nr_entries_reset)
>   {
> -	return 0;
> +	return -ENOENT;
>   }
>   
>   static inline void kvm_dirty_ring_push(struct kvm_vcpu *vcpu,
> @@ -82,7 +83,8 @@ int kvm_dirty_ring_alloc(struct kvm *kvm, struct kvm_dirty_ring *ring,
>    * called with kvm->slots_lock held, returns the number of
>    * processed pages.
>    */
The comment should be updated as well, since the return value is not the
number of processed pages now.


> -int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring);
> +int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
> +			 int *nr_entries_reset);
>   
[...]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 4/5] KVM: Check for empty mask of harvested dirty ring entries in caller
  2025-05-08 14:10 ` [PATCH v2 4/5] KVM: Check for empty mask of harvested dirty ring entries in caller Sean Christopherson
@ 2025-05-13  9:17   ` Binbin Wu
  2025-05-13 12:51   ` Gupta, Pankaj
  1 sibling, 0 replies; 15+ messages in thread
From: Binbin Wu @ 2025-05-13  9:17 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Xu, Yan Zhao,
	Maxim Levitsky



On 5/8/2025 10:10 PM, Sean Christopherson wrote:
> When resetting a dirty ring, explicitly check that there is work to be
> done before calling kvm_reset_dirty_gfn(), e.g. if no harvested entries
> are found and/or on the loop's first iteration, and delete the extremely
> misleading comment "This is only needed to make compilers happy".  KVM
> absolutely relies on mask to be zero-initialized, i.e. the comment is an
> outright lie.  Furthermore, the compiler is right to complain that KVM is
> calling a function with uninitialized data, as there are no guarantees
> the implementation details of kvm_reset_dirty_gfn() will be visible to
> kvm_dirty_ring_reset().
>
> While the flaw could be fixed by simply deleting (or rewording) the
> comment, and duplicating the check is unfortunate, checking mask in the
> caller will allow for additional cleanups.
>
> Opportunisticaly drop the zero-initialization of cur_slot and cur_offset.

Opportunisticaly -> Opportunistically


> If a bug were introduced where either the slot or offset was consumed
> before mask is set to a non-zero value, then it is highly desirable for
> the compiler (or some other sanitizer) to yell.
>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yan Zhao <yan.y.zhao@intel.com>
> Cc: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   virt/kvm/dirty_ring.c | 44 ++++++++++++++++++++++++++++++++++---------
>   1 file changed, 35 insertions(+), 9 deletions(-)
>
[...]
>   
> -	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> +	/*
> +	 * Perform a final reset if there are harvested entries that haven't
> +	 * been processed, which is guaranteed if at least one harvested was
> +	 * found.  The loop only performs a reset when the "next" entry can't
> +	 * be batched with "current" the entry(s), and that reset processes the
"current" the entry(s) -> the "current" entry(s) ?

> +	 * _current_ entry(s), i.e. the last harvested entry, a.k.a. next, will
> +	 * always be left pending.
> +	 */
> +	if (mask)
> +		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
>   
>   	/*
>   	 * The request KVM_REQ_DIRTY_RING_SOFT_FULL will be cleared


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 5/5] KVM: Use mask of harvested dirty ring entries to coalesce dirty ring resets
  2025-05-08 14:10 ` [PATCH v2 5/5] KVM: Use mask of harvested dirty ring entries to coalesce dirty ring resets Sean Christopherson
  2025-05-12 22:33   ` James Houghton
@ 2025-05-13 12:16   ` Gupta, Pankaj
  1 sibling, 0 replies; 15+ messages in thread
From: Gupta, Pankaj @ 2025-05-13 12:16 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Xu, Yan Zhao, Maxim Levitsky

On 5/8/2025 4:10 PM, Sean Christopherson wrote:
> Use "mask" instead of a dedicated boolean to track whether or not there
> is at least one to-be-reset entry for the current slot+offset.  In the
> body of the loop, mask is zero only on the first iteration, i.e. !mask is
> equivalent to first_round.
> 
> Opportunstically combine the adjacent "if (mask)" statements into a single
> if-statement.
> 
> No function change intended.
> 
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yan Zhao <yan.y.zhao@intel.com>
> Cc: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>

Thanks!

> ---
>   virt/kvm/dirty_ring.c | 60 +++++++++++++++++++++----------------------
>   1 file changed, 29 insertions(+), 31 deletions(-)
> 
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index a3434be8f00d..934828d729e5 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -121,7 +121,6 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
>   	u64 cur_offset, next_offset;
>   	unsigned long mask = 0;
>   	struct kvm_dirty_gfn *entry;
> -	bool first_round = true;
>   
>   	while (likely((*nr_entries_reset) < INT_MAX)) {
>   		if (signal_pending(current))
> @@ -141,42 +140,42 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
>   		ring->reset_index++;
>   		(*nr_entries_reset)++;
>   
> -		/*
> -		 * While the size of each ring is fixed, it's possible for the
> -		 * ring to be constantly re-dirtied/harvested while the reset
> -		 * is in-progress (the hard limit exists only to guard against
> -		 * wrapping the count into negative space).
> -		 */
> -		if (!first_round)
> +		if (mask) {
> +			/*
> +			 * While the size of each ring is fixed, it's possible
> +			 * for the ring to be constantly re-dirtied/harvested
> +			 * while the reset is in-progress (the hard limit exists
> +			 * only to guard against the count becoming negative).
> +			 */
>   			cond_resched();
>   
> -		/*
> -		 * Try to coalesce the reset operations when the guest is
> -		 * scanning pages in the same slot.
> -		 */
> -		if (!first_round && next_slot == cur_slot) {
> -			s64 delta = next_offset - cur_offset;
> +			/*
> +			 * Try to coalesce the reset operations when the guest
> +			 * is scanning pages in the same slot.
> +			 */
> +			if (next_slot == cur_slot) {
> +				s64 delta = next_offset - cur_offset;
>   
> -			if (delta >= 0 && delta < BITS_PER_LONG) {
> -				mask |= 1ull << delta;
> -				continue;
> -			}
> +				if (delta >= 0 && delta < BITS_PER_LONG) {
> +					mask |= 1ull << delta;
> +					continue;
> +				}
>   
> -			/* Backwards visit, careful about overflows!  */
> -			if (delta > -BITS_PER_LONG && delta < 0 &&
> -			    (mask << -delta >> -delta) == mask) {
> -				cur_offset = next_offset;
> -				mask = (mask << -delta) | 1;
> -				continue;
> +				/* Backwards visit, careful about overflows! */
> +				if (delta > -BITS_PER_LONG && delta < 0 &&
> +				(mask << -delta >> -delta) == mask) {
> +					cur_offset = next_offset;
> +					mask = (mask << -delta) | 1;
> +					continue;
> +				}
>   			}
> -		}
>   
> -		/*
> -		 * Reset the slot for all the harvested entries that have been
> -		 * gathered, but not yet fully processed.
> -		 */
> -		if (mask)
> +			/*
> +			 * Reset the slot for all the harvested entries that
> +			 * have been gathered, but not yet fully processed.
> +			 */
>   			kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> +		}
>   
>   		/*
>   		 * The current slot was reset or this is the first harvested
> @@ -185,7 +184,6 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
>   		cur_slot = next_slot;
>   		cur_offset = next_offset;
>   		mask = 1;
> -		first_round = false;
>   	}
>   
>   	/*


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 4/5] KVM: Check for empty mask of harvested dirty ring entries in caller
  2025-05-08 14:10 ` [PATCH v2 4/5] KVM: Check for empty mask of harvested dirty ring entries in caller Sean Christopherson
  2025-05-13  9:17   ` Binbin Wu
@ 2025-05-13 12:51   ` Gupta, Pankaj
  1 sibling, 0 replies; 15+ messages in thread
From: Gupta, Pankaj @ 2025-05-13 12:51 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Xu, Yan Zhao, Maxim Levitsky

On 5/8/2025 4:10 PM, Sean Christopherson wrote:
> When resetting a dirty ring, explicitly check that there is work to be
> done before calling kvm_reset_dirty_gfn(), e.g. if no harvested entries
> are found and/or on the loop's first iteration, and delete the extremely
> misleading comment "This is only needed to make compilers happy".  KVM
> absolutely relies on mask to be zero-initialized, i.e. the comment is an
> outright lie.  Furthermore, the compiler is right to complain that KVM is
> calling a function with uninitialized data, as there are no guarantees
> the implementation details of kvm_reset_dirty_gfn() will be visible to
> kvm_dirty_ring_reset().
> 
> While the flaw could be fixed by simply deleting (or rewording) the
> comment, and duplicating the check is unfortunate, checking mask in the
> caller will allow for additional cleanups.
> 
> Opportunisticaly drop the zero-initialization of cur_slot and cur_offset.
> If a bug were introduced where either the slot or offset was consumed
> before mask is set to a non-zero value, then it is highly desirable for
> the compiler (or some other sanitizer) to yell.
> 
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Yan Zhao <yan.y.zhao@intel.com>
> Cc: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

LGTM

Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>

> ---
>   virt/kvm/dirty_ring.c | 44 ++++++++++++++++++++++++++++++++++---------
>   1 file changed, 35 insertions(+), 9 deletions(-)
> 
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index 97cca0c02fd1..a3434be8f00d 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -55,9 +55,6 @@ static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
>   	struct kvm_memory_slot *memslot;
>   	int as_id, id;
>   
> -	if (!mask)
> -		return;
> -
>   	as_id = slot >> 16;
>   	id = (u16)slot;
>   
> @@ -108,15 +105,24 @@ static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn)
>   int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
>   			 int *nr_entries_reset)
>   {
> +	/*
> +	 * To minimize mmu_lock contention, batch resets for harvested entries
> +	 * whose gfns are in the same slot, and are within N frame numbers of
> +	 * each other, where N is the number of bits in an unsigned long.  For
> +	 * simplicity, process the current set of entries when the next entry
> +	 * can't be included in the batch.
> +	 *
> +	 * Track the current batch slot, the gfn offset into the slot for the
> +	 * batch, and the bitmask of gfns that need to be reset (relative to
> +	 * offset).  Note, the offset may be adjusted backwards, e.g. so that
> +	 * a sequence of gfns X, X-1, ... X-N can be batched.
> +	 */
>   	u32 cur_slot, next_slot;
>   	u64 cur_offset, next_offset;
> -	unsigned long mask;
> +	unsigned long mask = 0;
>   	struct kvm_dirty_gfn *entry;
>   	bool first_round = true;
>   
> -	/* This is only needed to make compilers happy */
> -	cur_slot = cur_offset = mask = 0;
> -
>   	while (likely((*nr_entries_reset) < INT_MAX)) {
>   		if (signal_pending(current))
>   			return -EINTR;
> @@ -164,14 +170,34 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
>   				continue;
>   			}
>   		}
> -		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> +
> +		/*
> +		 * Reset the slot for all the harvested entries that have been
> +		 * gathered, but not yet fully processed.
> +		 */
> +		if (mask)
> +			kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> +
> +		/*
> +		 * The current slot was reset or this is the first harvested
> +		 * entry, (re)initialize the metadata.
> +		 */
>   		cur_slot = next_slot;
>   		cur_offset = next_offset;
>   		mask = 1;
>   		first_round = false;
>   	}
>   
> -	kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
> +	/*
> +	 * Perform a final reset if there are harvested entries that haven't
> +	 * been processed, which is guaranteed if at least one harvested was
> +	 * found.  The loop only performs a reset when the "next" entry can't
> +	 * be batched with "current" the entry(s), and that reset processes the
> +	 * _current_ entry(s), i.e. the last harvested entry, a.k.a. next, will
> +	 * always be left pending.
> +	 */
> +	if (mask)
> +		kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
>   
>   	/*
>   	 * The request KVM_REQ_DIRTY_RING_SOFT_FULL will be cleared


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/5] KVM: Conditionally reschedule when resetting the dirty ring
  2025-05-12 22:02   ` James Houghton
@ 2025-05-13 14:13     ` Sean Christopherson
  2025-05-13 22:27       ` James Houghton
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2025-05-13 14:13 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Xu, Yan Zhao,
	Maxim Levitsky

On Mon, May 12, 2025, James Houghton wrote:
> On Thu, May 8, 2025 at 7:11 AM Sean Christopherson <seanjc@google.com> wrote:
> > ---
> >  virt/kvm/dirty_ring.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > index e844e869e8c7..97cca0c02fd1 100644
> > --- a/virt/kvm/dirty_ring.c
> > +++ b/virt/kvm/dirty_ring.c
> > @@ -134,6 +134,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
> >
> >                 ring->reset_index++;
> >                 (*nr_entries_reset)++;
> > +
> > +               /*
> > +                * While the size of each ring is fixed, it's possible for the
> > +                * ring to be constantly re-dirtied/harvested while the reset
> > +                * is in-progress (the hard limit exists only to guard against
> > +                * wrapping the count into negative space).
> > +                */
> > +               if (!first_round)
> > +                       cond_resched();
> 
> Should we be dropping slots_lock here?

Could we?  Yes.  Should we?  Eh.  I don't see any value in doing so, because in
practice, it's extremely unlikely anything will be waiting on slots_lock.

kvm_vm_ioctl_reset_dirty_pages() operates on all vCPUs, i.e. there won't be
competing calls to reset other rings.  A well-behaved userspace won't be modifying
memslots or dirty logs, and won't be toggling nx_huge_pages.

That leaves kvm_vm_ioctl_set_mem_attributes(), kvm_inhibit_apic_access_page(),
kvm_assign_ioeventfd(), snp_launch_update(), and coalesced IO/MMIO (un)registration.
Except for snp_launch_update(), those are all brutally slow paths, e.g. require
SRCU synchronization and/or zapping of SPTEs.  And snp_launch_update() is probably
fairly slow too.

And dropping slots_lock only makes any sense for non-preemptible kernels, because
preemptible kernels include an equivalent check in KVM_MMU_UNLOCK().

It's also possible that dropping slots_lock in this case could be a net negative.
I don't think it's likely, but I don't think it's any more or less likely that
droppings slots_lock is a net positive.  Without performance data to guide us,
it'd be little more than a guess, and I really, really don't want to set a
precedence of dropping a mutex on cond_resched() without a very strong reason
for doing so.

> It seems like we need to be holding slots_lock to call kvm_reset_dirty_gfn(),
> but that's it. Userspace can already change the memslots after enabling the
> dirty ring, so `entry->slot` can already be stale, so dropping slots_lock for
> the cond_resched() seems harmless (and better than not dropping it).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/5] KVM: Conditionally reschedule when resetting the dirty ring
  2025-05-13 14:13     ` Sean Christopherson
@ 2025-05-13 22:27       ` James Houghton
  2025-05-14 14:24         ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: James Houghton @ 2025-05-13 22:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Xu, Yan Zhao,
	Maxim Levitsky

On Tue, May 13, 2025 at 7:13 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, May 12, 2025, James Houghton wrote:
> > On Thu, May 8, 2025 at 7:11 AM Sean Christopherson <seanjc@google.com> wrote:
> > > ---
> > >  virt/kvm/dirty_ring.c | 10 ++++++++++
> > >  1 file changed, 10 insertions(+)
> > >
> > > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > > index e844e869e8c7..97cca0c02fd1 100644
> > > --- a/virt/kvm/dirty_ring.c
> > > +++ b/virt/kvm/dirty_ring.c
> > > @@ -134,6 +134,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
> > >
> > >                 ring->reset_index++;
> > >                 (*nr_entries_reset)++;
> > > +
> > > +               /*
> > > +                * While the size of each ring is fixed, it's possible for the
> > > +                * ring to be constantly re-dirtied/harvested while the reset
> > > +                * is in-progress (the hard limit exists only to guard against
> > > +                * wrapping the count into negative space).
> > > +                */
> > > +               if (!first_round)
> > > +                       cond_resched();
> >
> > Should we be dropping slots_lock here?
>
> Could we?  Yes.  Should we?  Eh.  I don't see any value in doing so, because in
> practice, it's extremely unlikely anything will be waiting on slots_lock.
>
> kvm_vm_ioctl_reset_dirty_pages() operates on all vCPUs, i.e. there won't be
> competing calls to reset other rings.  A well-behaved userspace won't be modifying
> memslots or dirty logs, and won't be toggling nx_huge_pages.
>
> That leaves kvm_vm_ioctl_set_mem_attributes(), kvm_inhibit_apic_access_page(),
> kvm_assign_ioeventfd(), snp_launch_update(), and coalesced IO/MMIO (un)registration.
> Except for snp_launch_update(), those are all brutally slow paths, e.g. require
> SRCU synchronization and/or zapping of SPTEs.  And snp_launch_update() is probably
> fairly slow too.

Okay, that makes sense.

> And dropping slots_lock only makes any sense for non-preemptible kernels, because
> preemptible kernels include an equivalent check in KVM_MMU_UNLOCK().

I'm not really sure what "equivalent check" you mean, sorry. For
preemptible kernels, we could reschedule at any time, so dropping the
slots_lock on a cond_resched() wouldn't do much, yeah. Hopefully
that's partially what you mean.

> It's also possible that dropping slots_lock in this case could be a net negative.
> I don't think it's likely, but I don't think it's any more or less likely that
> droppings slots_lock is a net positive.  Without performance data to guide us,
> it'd be little more than a guess, and I really, really don't want to set a
> precedence of dropping a mutex on cond_resched() without a very strong reason
> for doing so.

Fair enough.

Also, while we're at it, could you add a
`lockdep_assert_held(&kvm->slots_lock)` to this function? :) Not
necessarily in this patch.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 3/5] KVM: Conditionally reschedule when resetting the dirty ring
  2025-05-13 22:27       ` James Houghton
@ 2025-05-14 14:24         ` Sean Christopherson
  0 siblings, 0 replies; 15+ messages in thread
From: Sean Christopherson @ 2025-05-14 14:24 UTC (permalink / raw)
  To: James Houghton
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Xu, Yan Zhao,
	Maxim Levitsky

On Tue, May 13, 2025, James Houghton wrote:
> On Tue, May 13, 2025 at 7:13 AM Sean Christopherson <seanjc@google.com> wrote:
> > On Mon, May 12, 2025, James Houghton wrote:
> > > On Thu, May 8, 2025 at 7:11 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > ---
> > > >  virt/kvm/dirty_ring.c | 10 ++++++++++
> > > >  1 file changed, 10 insertions(+)
> > > >
> > > > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > > > index e844e869e8c7..97cca0c02fd1 100644
> > > > --- a/virt/kvm/dirty_ring.c
> > > > +++ b/virt/kvm/dirty_ring.c
> > > > @@ -134,6 +134,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
> > > >
> > > >                 ring->reset_index++;
> > > >                 (*nr_entries_reset)++;
> > > > +
> > > > +               /*
> > > > +                * While the size of each ring is fixed, it's possible for the
> > > > +                * ring to be constantly re-dirtied/harvested while the reset
> > > > +                * is in-progress (the hard limit exists only to guard against
> > > > +                * wrapping the count into negative space).
> > > > +                */
> > > > +               if (!first_round)
> > > > +                       cond_resched();
> > >
> > > Should we be dropping slots_lock here?
> >
> > Could we?  Yes.  Should we?  Eh.  I don't see any value in doing so, because in
> > practice, it's extremely unlikely anything will be waiting on slots_lock.
> >
> > kvm_vm_ioctl_reset_dirty_pages() operates on all vCPUs, i.e. there won't be
> > competing calls to reset other rings.  A well-behaved userspace won't be modifying
> > memslots or dirty logs, and won't be toggling nx_huge_pages.
> >
> > That leaves kvm_vm_ioctl_set_mem_attributes(), kvm_inhibit_apic_access_page(),
> > kvm_assign_ioeventfd(), snp_launch_update(), and coalesced IO/MMIO (un)registration.
> > Except for snp_launch_update(), those are all brutally slow paths, e.g. require
> > SRCU synchronization and/or zapping of SPTEs.  And snp_launch_update() is probably
> > fairly slow too.
> 
> Okay, that makes sense.

As discussed offlist, dropping slots_lock would also be functionally problematic,
as concurrent calls to KVM_RESET_DIRTY_RINGS could get interwoven, which could
result in one of the calls returning to userspace without actually completing the
reset, i.e. if a different task has reaped entries but not yet called
kvm_reset_dirty_gfn().

> > And dropping slots_lock only makes any sense for non-preemptible kernels, because
> > preemptible kernels include an equivalent check in KVM_MMU_UNLOCK().
> 
> I'm not really sure what "equivalent check" you mean, sorry. For preemptible
> kernels, we could reschedule at any time, so dropping the slots_lock on a
> cond_resched() wouldn't do much, yeah. Hopefully that's partially what you
> mean.

Ya, that's essentially what I mean.  What I was referencing with KVM_MMU_UNLOCK()
is the explicit check for NEED_RESCHED that happens when the preempt count hits
'0' on preemptible kernels.

> > It's also possible that dropping slots_lock in this case could be a net negative.
> > I don't think it's likely, but I don't think it's any more or less likely that
> > droppings slots_lock is a net positive.  Without performance data to guide us,
> > it'd be little more than a guess, and I really, really don't want to set a
> > precedence of dropping a mutex on cond_resched() without a very strong reason
> > for doing so.
> 
> Fair enough.
> 
> Also, while we're at it, could you add a
> `lockdep_assert_held(&kvm->slots_lock)` to this function? :) Not necessarily
> in this patch.

Heh, yep, my mind jumped to that as well.  I'll tack on a patch to add a lockdep
assertion, along with a comment explaining what all it protects.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-05-14 14:24 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-08 14:10 [PATCH v2 0/5] KVM: Dirty ring fixes and cleanups Sean Christopherson
2025-05-08 14:10 ` [PATCH v2 1/5] KVM: Bound the number of dirty ring entries in a single reset at INT_MAX Sean Christopherson
2025-05-13  1:25   ` Binbin Wu
2025-05-08 14:10 ` [PATCH v2 2/5] KVM: Bail from the dirty ring reset flow if a signal is pending Sean Christopherson
2025-05-08 14:10 ` [PATCH v2 3/5] KVM: Conditionally reschedule when resetting the dirty ring Sean Christopherson
2025-05-12 22:02   ` James Houghton
2025-05-13 14:13     ` Sean Christopherson
2025-05-13 22:27       ` James Houghton
2025-05-14 14:24         ` Sean Christopherson
2025-05-08 14:10 ` [PATCH v2 4/5] KVM: Check for empty mask of harvested dirty ring entries in caller Sean Christopherson
2025-05-13  9:17   ` Binbin Wu
2025-05-13 12:51   ` Gupta, Pankaj
2025-05-08 14:10 ` [PATCH v2 5/5] KVM: Use mask of harvested dirty ring entries to coalesce dirty ring resets Sean Christopherson
2025-05-12 22:33   ` James Houghton
2025-05-13 12:16   ` Gupta, Pankaj

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox