kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/13] KVM: Make irqfd registration globally unique
@ 2025-05-22 23:52 Sean Christopherson
  2025-05-22 23:52 ` [PATCH v3 01/13] KVM: Use a local struct to do the initial vfs_poll() on an irqfd Sean Christopherson
                   ` (15 more replies)
  0 siblings, 16 replies; 21+ messages in thread
From: Sean Christopherson @ 2025-05-22 23:52 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Juergen Gross, Stefano Stabellini, Paolo Bonzini, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Shuah Khan,
	Marc Zyngier, Oliver Upton, Sean Christopherson
  Cc: linux-kernel, linux-hyperv, xen-devel, kvm, linux-kselftest,
	linux-arm-kernel, kvmarm, K Prateek Nayak, David Matlack

Non-KVM folks,

I am hoping to route this through the KVM tree (6.17 or later), as the non-KVM
changes should be glorified nops.  Please holler if you object to that idea.

Hyper-V folks in particular, let me know if you want a stable topic branch/tag,
e.g. on the off chance you want to make similar changes to the Hyper-V code,
and I'll make sure that happens.


As for what this series actually does...

Rework KVM's irqfd registration to require that an eventfd is bound to at
most one irqfd throughout the entire system.  KVM currently disallows
binding an eventfd to multiple irqfds for a single VM, but doesn't reject
attempts to bind an eventfd to multiple VMs.

This is obviously an ABI change, but I'm fairly confident that it won't
break userspace, because binding an eventfd to multiple irqfds hasn't
truly worked since commit e8dbf19508a1 ("kvm/eventfd: Use priority waitqueue
to catch events before userspace").  A somewhat undocumented, and perhaps
even unintentional, side effect of suppressing eventfd notifications for
userspace is that the priority+exclusive behavior also suppresses eventfd
notifications for any subsequent waiters, even if they are priority waiters.
I.e. only the first VM with an irqfd+eventfd binding will get notifications.

And for IRQ bypass, a.k.a. device posted interrupts, globally unique
bindings are a hard requirement (at least on x86; I assume other archs are
the same).  KVM and the IRQ bypass manager kinda sorta handle this, but in
the absolute worst way possible (IMO).  Instead of surfacing an error to
userspace, KVM silently ignores IRQ bypass registration errors.

The motivation for this series is to harden against userspace goofs.  AFAIK,
we (Google) have never actually had a bug where userspace tries to assign
an eventfd to multiple VMs, but the possibility has come up in more than one
bug investigation (our intra-host, a.k.a. copyless, migration scheme
transfers eventfds from the old to the new VM when updating the host VMM).

v3:
 - Retain WQ_FLAG_EXCLUSIVE in mshv_eventfd.c, which snuck in between v1
   and v2. [Peter]
 - Use EXPORT_SYMBOL_GPL. [Peter]
 - Move WQ_FLAG_EXCLUSIVE out of add_wait_queue_priority() in a prep patch
   so that the affected subsystems are more explicitly documented (and then
   immediately drop the flag from drivers/xen/privcmd.c, which amusingly
   hides that file from the diff stats).

v2:
 - https://lore.kernel.org/all/20250519185514.2678456-1-seanjc@google.com
 - Use guard(spinlock_irqsave). [Prateek]

v1: https://lore.kernel.org/all/20250401204425.904001-1-seanjc@google.com


Sean Christopherson (13):
  KVM: Use a local struct to do the initial vfs_poll() on an irqfd
  KVM: Acquire SCRU lock outside of irqfds.lock during assignment
  KVM: Initialize irqfd waitqueue callback when adding to the queue
  KVM: Add irqfd to KVM's list via the vfs_poll() callback
  KVM: Add irqfd to eventfd's waitqueue while holding irqfds.lock
  sched/wait: Drop WQ_FLAG_EXCLUSIVE from add_wait_queue_priority()
  xen: privcmd: Don't mark eventfd waiter as EXCLUSIVE
  sched/wait: Add a waitqueue helper for fully exclusive priority
    waiters
  KVM: Disallow binding multiple irqfds to an eventfd with a priority
    waiter
  KVM: Drop sanity check that per-VM list of irqfds is unique
  KVM: selftests: Assert that eventfd() succeeds in Xen shinfo test
  KVM: selftests: Add utilities to create eventfds and do KVM_IRQFD
  KVM: selftests: Add a KVM_IRQFD test to verify uniqueness requirements

 drivers/hv/mshv_eventfd.c                     |   8 ++
 include/linux/kvm_irqfd.h                     |   1 -
 include/linux/wait.h                          |   2 +
 kernel/sched/wait.c                           |  22 ++-
 tools/testing/selftests/kvm/Makefile.kvm      |   1 +
 tools/testing/selftests/kvm/arm64/vgic_irq.c  |  12 +-
 .../testing/selftests/kvm/include/kvm_util.h  |  40 ++++++
 tools/testing/selftests/kvm/irqfd_test.c      | 130 ++++++++++++++++++
 .../selftests/kvm/x86/xen_shinfo_test.c       |  21 +--
 virt/kvm/eventfd.c                            | 130 +++++++++++++-----
 10 files changed, 302 insertions(+), 65 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/irqfd_test.c


base-commit: 45eb29140e68ffe8e93a5471006858a018480a45
-- 
2.49.0.1151.ga128411c76-goog


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 01/13] KVM: Use a local struct to do the initial vfs_poll() on an irqfd
  2025-05-22 23:52 [PATCH v3 00/13] KVM: Make irqfd registration globally unique Sean Christopherson
@ 2025-05-22 23:52 ` Sean Christopherson
  2025-05-22 23:52 ` [PATCH v3 02/13] KVM: Acquire SCRU lock outside of irqfds.lock during assignment Sean Christopherson
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2025-05-22 23:52 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Juergen Gross, Stefano Stabellini, Paolo Bonzini, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Shuah Khan,
	Marc Zyngier, Oliver Upton, Sean Christopherson
  Cc: linux-kernel, linux-hyperv, xen-devel, kvm, linux-kselftest,
	linux-arm-kernel, kvmarm, K Prateek Nayak, David Matlack

Use a function-local struct for the poll_table passed to vfs_poll(), as
nothing in the vfs_poll() callchain grabs a long-term reference to the
structure, i.e. its lifetime doesn't need to be tied to the irqfd.  Using
a local structure will also allow propagating failures out of the polling
callback without further polluting kvm_kernel_irqfd.

Opportunstically rename irqfd_ptable_queue_proc() to kvm_irqfd_register()
to capture what it actually does.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/kvm_irqfd.h |  1 -
 virt/kvm/eventfd.c        | 26 +++++++++++++++++---------
 2 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/include/linux/kvm_irqfd.h b/include/linux/kvm_irqfd.h
index 8ad43692e3bb..44fd2a20b09e 100644
--- a/include/linux/kvm_irqfd.h
+++ b/include/linux/kvm_irqfd.h
@@ -55,7 +55,6 @@ struct kvm_kernel_irqfd {
 	/* Used for setup/shutdown */
 	struct eventfd_ctx *eventfd;
 	struct list_head list;
-	poll_table pt;
 	struct work_struct shutdown;
 	struct irq_bypass_consumer consumer;
 	struct irq_bypass_producer *producer;
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 11e5d1e3f12e..39e42b19d9f7 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -245,12 +245,17 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
 	return ret;
 }
 
-static void
-irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
-			poll_table *pt)
+struct kvm_irqfd_pt {
+	struct kvm_kernel_irqfd *irqfd;
+	poll_table pt;
+};
+
+static void kvm_irqfd_register(struct file *file, wait_queue_head_t *wqh,
+			       poll_table *pt)
 {
-	struct kvm_kernel_irqfd *irqfd =
-		container_of(pt, struct kvm_kernel_irqfd, pt);
+	struct kvm_irqfd_pt *p = container_of(pt, struct kvm_irqfd_pt, pt);
+	struct kvm_kernel_irqfd *irqfd = p->irqfd;
+
 	add_wait_queue_priority(wqh, &irqfd->wait);
 }
 
@@ -305,6 +310,7 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 {
 	struct kvm_kernel_irqfd *irqfd, *tmp;
 	struct eventfd_ctx *eventfd = NULL, *resamplefd = NULL;
+	struct kvm_irqfd_pt irqfd_pt;
 	int ret;
 	__poll_t events;
 	int idx;
@@ -394,7 +400,6 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 	 * a callback whenever someone signals the underlying eventfd
 	 */
 	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
-	init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc);
 
 	spin_lock_irq(&kvm->irqfds.lock);
 
@@ -416,11 +421,14 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 	spin_unlock_irq(&kvm->irqfds.lock);
 
 	/*
-	 * Check if there was an event already pending on the eventfd
-	 * before we registered, and trigger it as if we didn't miss it.
+	 * Register the irqfd with the eventfd by polling on the eventfd.  If
+	 * there was en event pending on the eventfd prior to registering,
+	 * manually trigger IRQ injection.
 	 */
-	events = vfs_poll(fd_file(f), &irqfd->pt);
+	irqfd_pt.irqfd = irqfd;
+	init_poll_funcptr(&irqfd_pt.pt, kvm_irqfd_register);
 
+	events = vfs_poll(fd_file(f), &irqfd_pt.pt);
 	if (events & EPOLLIN)
 		schedule_work(&irqfd->inject);
 
-- 
2.49.0.1151.ga128411c76-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 02/13] KVM: Acquire SCRU lock outside of irqfds.lock during assignment
  2025-05-22 23:52 [PATCH v3 00/13] KVM: Make irqfd registration globally unique Sean Christopherson
  2025-05-22 23:52 ` [PATCH v3 01/13] KVM: Use a local struct to do the initial vfs_poll() on an irqfd Sean Christopherson
@ 2025-05-22 23:52 ` Sean Christopherson
  2025-05-22 23:52 ` [PATCH v3 03/13] KVM: Initialize irqfd waitqueue callback when adding to the queue Sean Christopherson
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2025-05-22 23:52 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Juergen Gross, Stefano Stabellini, Paolo Bonzini, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Shuah Khan,
	Marc Zyngier, Oliver Upton, Sean Christopherson
  Cc: linux-kernel, linux-hyperv, xen-devel, kvm, linux-kselftest,
	linux-arm-kernel, kvmarm, K Prateek Nayak, David Matlack

Acquire SRCU outside of irqfds.lock so that the locking is symmetrical,
and add a comment explaining why on earth KVM holds SRCU for so long.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/eventfd.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 39e42b19d9f7..42c02c35e542 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -401,6 +401,18 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 	 */
 	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
 
+	/*
+	 * Set the irqfd routing and add it to KVM's list before registering
+	 * the irqfd with the eventfd, so that the routing information is valid
+	 * and stays valid, e.g. if there are GSI routing changes, prior to
+	 * making the irqfd visible, i.e. before it might be signaled.
+	 *
+	 * Note, holding SRCU ensures a stable read of routing information, and
+	 * also prevents irqfd_shutdown() from freeing the irqfd before it's
+	 * fully initialized.
+	 */
+	idx = srcu_read_lock(&kvm->irq_srcu);
+
 	spin_lock_irq(&kvm->irqfds.lock);
 
 	ret = 0;
@@ -409,11 +421,9 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 			continue;
 		/* This fd is used for another irq already. */
 		ret = -EBUSY;
-		spin_unlock_irq(&kvm->irqfds.lock);
-		goto fail;
+		goto fail_duplicate;
 	}
 
-	idx = srcu_read_lock(&kvm->irq_srcu);
 	irqfd_update(kvm, irqfd);
 
 	list_add_tail(&irqfd->list, &kvm->irqfds.items);
@@ -449,6 +459,9 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 	srcu_read_unlock(&kvm->irq_srcu, idx);
 	return 0;
 
+fail_duplicate:
+	spin_unlock_irq(&kvm->irqfds.lock);
+	srcu_read_unlock(&kvm->irq_srcu, idx);
 fail:
 	if (irqfd->resampler)
 		irqfd_resampler_shutdown(irqfd);
-- 
2.49.0.1151.ga128411c76-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 03/13] KVM: Initialize irqfd waitqueue callback when adding to the queue
  2025-05-22 23:52 [PATCH v3 00/13] KVM: Make irqfd registration globally unique Sean Christopherson
  2025-05-22 23:52 ` [PATCH v3 01/13] KVM: Use a local struct to do the initial vfs_poll() on an irqfd Sean Christopherson
  2025-05-22 23:52 ` [PATCH v3 02/13] KVM: Acquire SCRU lock outside of irqfds.lock during assignment Sean Christopherson
@ 2025-05-22 23:52 ` Sean Christopherson
  2025-05-22 23:52 ` [PATCH v3 04/13] KVM: Add irqfd to KVM's list via the vfs_poll() callback Sean Christopherson
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2025-05-22 23:52 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Juergen Gross, Stefano Stabellini, Paolo Bonzini, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Shuah Khan,
	Marc Zyngier, Oliver Upton, Sean Christopherson
  Cc: linux-kernel, linux-hyperv, xen-devel, kvm, linux-kselftest,
	linux-arm-kernel, kvmarm, K Prateek Nayak, David Matlack

Initialize the irqfd waitqueue callback immediately prior to inserting the
irqfd into the eventfd's waitqueue.  Pre-initializing the state in a
completely different context is all kinds of confusing, and incorrectly
suggests that the waitqueue function needs to be initialize prior to
vfs_poll().

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/eventfd.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 42c02c35e542..8b9a87daa2bb 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -256,6 +256,13 @@ static void kvm_irqfd_register(struct file *file, wait_queue_head_t *wqh,
 	struct kvm_irqfd_pt *p = container_of(pt, struct kvm_irqfd_pt, pt);
 	struct kvm_kernel_irqfd *irqfd = p->irqfd;
 
+	/*
+	 * Add the irqfd as a priority waiter on the eventfd, with a custom
+	 * wake-up handler, so that KVM *and only KVM* is notified whenever the
+	 * underlying eventfd is signaled.
+	 */
+	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
+
 	add_wait_queue_priority(wqh, &irqfd->wait);
 }
 
@@ -395,12 +402,6 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 		mutex_unlock(&kvm->irqfds.resampler_lock);
 	}
 
-	/*
-	 * Install our own custom wake-up handling so we are notified via
-	 * a callback whenever someone signals the underlying eventfd
-	 */
-	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
-
 	/*
 	 * Set the irqfd routing and add it to KVM's list before registering
 	 * the irqfd with the eventfd, so that the routing information is valid
-- 
2.49.0.1151.ga128411c76-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 04/13] KVM: Add irqfd to KVM's list via the vfs_poll() callback
  2025-05-22 23:52 [PATCH v3 00/13] KVM: Make irqfd registration globally unique Sean Christopherson
                   ` (2 preceding siblings ...)
  2025-05-22 23:52 ` [PATCH v3 03/13] KVM: Initialize irqfd waitqueue callback when adding to the queue Sean Christopherson
@ 2025-05-22 23:52 ` Sean Christopherson
  2025-05-22 23:52 ` [PATCH v3 05/13] KVM: Add irqfd to eventfd's waitqueue while holding irqfds.lock Sean Christopherson
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2025-05-22 23:52 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Juergen Gross, Stefano Stabellini, Paolo Bonzini, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Shuah Khan,
	Marc Zyngier, Oliver Upton, Sean Christopherson
  Cc: linux-kernel, linux-hyperv, xen-devel, kvm, linux-kselftest,
	linux-arm-kernel, kvmarm, K Prateek Nayak, David Matlack

Add the irqfd structure to KVM's list of irqfds in kvm_irqfd_register(),
i.e. via the vfs_poll() callback.  This will allow taking irqfds.lock
across the entire registration sequence (add to waitqueue, add to list),
and more importantly will allow inserting into KVM's list if and only if
adding to the waitqueue succeeds (spoiler alert), without needing to
juggle return codes in weird ways.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/eventfd.c | 102 +++++++++++++++++++++++++--------------------
 1 file changed, 57 insertions(+), 45 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 8b9a87daa2bb..99274d60335d 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -245,34 +245,14 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
 	return ret;
 }
 
-struct kvm_irqfd_pt {
-	struct kvm_kernel_irqfd *irqfd;
-	poll_table pt;
-};
-
-static void kvm_irqfd_register(struct file *file, wait_queue_head_t *wqh,
-			       poll_table *pt)
-{
-	struct kvm_irqfd_pt *p = container_of(pt, struct kvm_irqfd_pt, pt);
-	struct kvm_kernel_irqfd *irqfd = p->irqfd;
-
-	/*
-	 * Add the irqfd as a priority waiter on the eventfd, with a custom
-	 * wake-up handler, so that KVM *and only KVM* is notified whenever the
-	 * underlying eventfd is signaled.
-	 */
-	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
-
-	add_wait_queue_priority(wqh, &irqfd->wait);
-}
-
-/* Must be called under irqfds.lock */
 static void irqfd_update(struct kvm *kvm, struct kvm_kernel_irqfd *irqfd)
 {
 	struct kvm_kernel_irq_routing_entry *e;
 	struct kvm_kernel_irq_routing_entry entries[KVM_NR_IRQCHIPS];
 	int n_entries;
 
+	lockdep_assert_held(&kvm->irqfds.lock);
+
 	n_entries = kvm_irq_map_gsi(kvm, entries, irqfd->gsi);
 
 	write_seqcount_begin(&irqfd->irq_entry_sc);
@@ -286,6 +266,49 @@ static void irqfd_update(struct kvm *kvm, struct kvm_kernel_irqfd *irqfd)
 	write_seqcount_end(&irqfd->irq_entry_sc);
 }
 
+struct kvm_irqfd_pt {
+	struct kvm_kernel_irqfd *irqfd;
+	struct kvm *kvm;
+	poll_table pt;
+	int ret;
+};
+
+static void kvm_irqfd_register(struct file *file, wait_queue_head_t *wqh,
+			       poll_table *pt)
+{
+	struct kvm_irqfd_pt *p = container_of(pt, struct kvm_irqfd_pt, pt);
+	struct kvm_kernel_irqfd *irqfd = p->irqfd;
+	struct kvm_kernel_irqfd *tmp;
+	struct kvm *kvm = p->kvm;
+
+	spin_lock_irq(&kvm->irqfds.lock);
+
+	list_for_each_entry(tmp, &kvm->irqfds.items, list) {
+		if (irqfd->eventfd != tmp->eventfd)
+			continue;
+		/* This fd is used for another irq already. */
+		p->ret = -EBUSY;
+		spin_unlock_irq(&kvm->irqfds.lock);
+		return;
+	}
+
+	irqfd_update(kvm, irqfd);
+
+	list_add_tail(&irqfd->list, &kvm->irqfds.items);
+
+	spin_unlock_irq(&kvm->irqfds.lock);
+
+	/*
+	 * Add the irqfd as a priority waiter on the eventfd, with a custom
+	 * wake-up handler, so that KVM *and only KVM* is notified whenever the
+	 * underlying eventfd is signaled.
+	 */
+	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
+
+	add_wait_queue_priority(wqh, &irqfd->wait);
+	p->ret = 0;
+}
+
 #if IS_ENABLED(CONFIG_HAVE_KVM_IRQ_BYPASS)
 void __attribute__((weak)) kvm_arch_irq_bypass_stop(
 				struct irq_bypass_consumer *cons)
@@ -315,7 +338,7 @@ bool __attribute__((weak)) kvm_arch_irqfd_route_changed(
 static int
 kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 {
-	struct kvm_kernel_irqfd *irqfd, *tmp;
+	struct kvm_kernel_irqfd *irqfd;
 	struct eventfd_ctx *eventfd = NULL, *resamplefd = NULL;
 	struct kvm_irqfd_pt irqfd_pt;
 	int ret;
@@ -414,32 +437,22 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 	 */
 	idx = srcu_read_lock(&kvm->irq_srcu);
 
-	spin_lock_irq(&kvm->irqfds.lock);
-
-	ret = 0;
-	list_for_each_entry(tmp, &kvm->irqfds.items, list) {
-		if (irqfd->eventfd != tmp->eventfd)
-			continue;
-		/* This fd is used for another irq already. */
-		ret = -EBUSY;
-		goto fail_duplicate;
-	}
-
-	irqfd_update(kvm, irqfd);
-
-	list_add_tail(&irqfd->list, &kvm->irqfds.items);
-
-	spin_unlock_irq(&kvm->irqfds.lock);
-
 	/*
-	 * Register the irqfd with the eventfd by polling on the eventfd.  If
-	 * there was en event pending on the eventfd prior to registering,
-	 * manually trigger IRQ injection.
+	 * Register the irqfd with the eventfd by polling on the eventfd, and
+	 * simultaneously and the irqfd to KVM's list.  If there was en event
+	 * pending on the eventfd prior to registering, manually trigger IRQ
+	 * injection.
 	 */
 	irqfd_pt.irqfd = irqfd;
+	irqfd_pt.kvm = kvm;
 	init_poll_funcptr(&irqfd_pt.pt, kvm_irqfd_register);
 
 	events = vfs_poll(fd_file(f), &irqfd_pt.pt);
+
+	ret = irqfd_pt.ret;
+	if (ret)
+		goto fail_poll;
+
 	if (events & EPOLLIN)
 		schedule_work(&irqfd->inject);
 
@@ -460,8 +473,7 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 	srcu_read_unlock(&kvm->irq_srcu, idx);
 	return 0;
 
-fail_duplicate:
-	spin_unlock_irq(&kvm->irqfds.lock);
+fail_poll:
 	srcu_read_unlock(&kvm->irq_srcu, idx);
 fail:
 	if (irqfd->resampler)
-- 
2.49.0.1151.ga128411c76-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 05/13] KVM: Add irqfd to eventfd's waitqueue while holding irqfds.lock
  2025-05-22 23:52 [PATCH v3 00/13] KVM: Make irqfd registration globally unique Sean Christopherson
                   ` (3 preceding siblings ...)
  2025-05-22 23:52 ` [PATCH v3 04/13] KVM: Add irqfd to KVM's list via the vfs_poll() callback Sean Christopherson
@ 2025-05-22 23:52 ` Sean Christopherson
  2025-05-22 23:52 ` [PATCH v3 06/13] sched/wait: Drop WQ_FLAG_EXCLUSIVE from add_wait_queue_priority() Sean Christopherson
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2025-05-22 23:52 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Juergen Gross, Stefano Stabellini, Paolo Bonzini, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Shuah Khan,
	Marc Zyngier, Oliver Upton, Sean Christopherson
  Cc: linux-kernel, linux-hyperv, xen-devel, kvm, linux-kselftest,
	linux-arm-kernel, kvmarm, K Prateek Nayak, David Matlack

Add an irqfd to its target eventfd's waitqueue while holding irqfds.lock,
which is mildly terrifying but functionally safe.  irqfds.lock is taken
inside the waitqueue's lock, but if and only if the eventfd is being
released, i.e. that path is mutually exclusive with registration as KVM
holds a reference to the eventfd (and obviously must do so to avoid UAF).

This will allow using the eventfd's waitqueue to enforce KVM's requirement
that eventfd is assigned to at most one irqfd, without introducing races.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/eventfd.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 99274d60335d..04877b297267 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -204,6 +204,11 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
 	int ret = 0;
 
 	if (flags & EPOLLIN) {
+		/*
+		 * WARNING: Do NOT take irqfds.lock in any path except EPOLLHUP,
+		 * as KVM holds irqfds.lock when registering the irqfd with the
+		 * eventfd.
+		 */
 		u64 cnt;
 		eventfd_ctx_do_read(irqfd->eventfd, &cnt);
 
@@ -225,6 +230,11 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
 		/* The eventfd is closing, detach from KVM */
 		unsigned long iflags;
 
+		/*
+		 * Taking irqfds.lock is safe here, as KVM holds a reference to
+		 * the eventfd when registering the irqfd, i.e. this path can't
+		 * be reached while kvm_irqfd_add() is running.
+		 */
 		spin_lock_irqsave(&kvm->irqfds.lock, iflags);
 
 		/*
@@ -296,16 +306,21 @@ static void kvm_irqfd_register(struct file *file, wait_queue_head_t *wqh,
 
 	list_add_tail(&irqfd->list, &kvm->irqfds.items);
 
-	spin_unlock_irq(&kvm->irqfds.lock);
-
 	/*
 	 * Add the irqfd as a priority waiter on the eventfd, with a custom
 	 * wake-up handler, so that KVM *and only KVM* is notified whenever the
-	 * underlying eventfd is signaled.
+	 * underlying eventfd is signaled.  Temporarily lie to lockdep about
+	 * holding irqfds.lock to avoid a false positive regarding potential
+	 * deadlock with irqfd_wakeup() (see irqfd_wakeup() for details).
 	 */
 	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
 
+	spin_release(&kvm->irqfds.lock.dep_map, _RET_IP_);
 	add_wait_queue_priority(wqh, &irqfd->wait);
+	spin_acquire(&kvm->irqfds.lock.dep_map, 0, 0, _RET_IP_);
+
+	spin_unlock_irq(&kvm->irqfds.lock);
+
 	p->ret = 0;
 }
 
-- 
2.49.0.1151.ga128411c76-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 06/13] sched/wait: Drop WQ_FLAG_EXCLUSIVE from add_wait_queue_priority()
  2025-05-22 23:52 [PATCH v3 00/13] KVM: Make irqfd registration globally unique Sean Christopherson
                   ` (4 preceding siblings ...)
  2025-05-22 23:52 ` [PATCH v3 05/13] KVM: Add irqfd to eventfd's waitqueue while holding irqfds.lock Sean Christopherson
@ 2025-05-22 23:52 ` Sean Christopherson
  2025-05-22 23:52 ` [PATCH v3 07/13] xen: privcmd: Don't mark eventfd waiter as EXCLUSIVE Sean Christopherson
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2025-05-22 23:52 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Juergen Gross, Stefano Stabellini, Paolo Bonzini, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Shuah Khan,
	Marc Zyngier, Oliver Upton, Sean Christopherson
  Cc: linux-kernel, linux-hyperv, xen-devel, kvm, linux-kselftest,
	linux-arm-kernel, kvmarm, K Prateek Nayak, David Matlack

Drop the setting of WQ_FLAG_EXCLUSIVE from add_wait_queue_priority() and
instead have callers manually add the flag prior to adding their structure
to the queue.  Blindly setting WQ_FLAG_EXCLUSIVE is flawed, as the nature
of exclusive, priority waiters means that only the first waiter added will
ever receive notifications.

Pushing the flawed behavior to callers will allow fixing the problem one
hypervisor at a time (KVM added the flawed API, and then KVM's code was
copy+pasted nearly verbatim by Xen and Hyper-V), and will also allow for
adding an API that provides true exclusivity, i.e. that guarantees at most
one priority waiter is in the queue.

Opportunistically add a comment in Hyper-V to call out the mess.  Xen
privcmd's irqfd_wakefup() doesn't actually operate in exclusive mode, i.e.
can be "fixed" simply by dropping WQ_FLAG_EXCLUSIVE.  And KVM is primed to
switch to the aforementioned fully exclusive API, i.e. won't be carrying
the flawed code for long.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 drivers/hv/mshv_eventfd.c | 8 ++++++++
 drivers/xen/privcmd.c     | 1 +
 kernel/sched/wait.c       | 4 ++--
 virt/kvm/eventfd.c        | 1 +
 4 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 8dd22be2ca0b..b348928871c2 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -368,6 +368,14 @@ static void mshv_irqfd_queue_proc(struct file *file, wait_queue_head_t *wqh,
 			container_of(polltbl, struct mshv_irqfd, irqfd_polltbl);
 
 	irqfd->irqfd_wqh = wqh;
+
+	/*
+	 * TODO: Ensure there isn't already an exclusive, priority waiter, e.g.
+	 * that the irqfd isn't already bound to another partition.  Only the
+	 * first exclusive waiter encountered will be notified, and
+	 * add_wait_queue_priority() doesn't enforce exclusivity.
+	 */
+	irqfd->irqfd_wait.flags |= WQ_FLAG_EXCLUSIVE;
 	add_wait_queue_priority(wqh, &irqfd->irqfd_wait);
 }
 
diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index 13a10f3294a8..c08ec8a7d27c 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -957,6 +957,7 @@ irqfd_poll_func(struct file *file, wait_queue_head_t *wqh, poll_table *pt)
 	struct privcmd_kernel_irqfd *kirqfd =
 		container_of(pt, struct privcmd_kernel_irqfd, pt);
 
+	kirqfd->wait.flags |= WQ_FLAG_EXCLUSIVE;
 	add_wait_queue_priority(wqh, &kirqfd->wait);
 }
 
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 51e38f5f4701..4ab3ab195277 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -40,7 +40,7 @@ void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_
 {
 	unsigned long flags;
 
-	wq_entry->flags |= WQ_FLAG_EXCLUSIVE | WQ_FLAG_PRIORITY;
+	wq_entry->flags |= WQ_FLAG_PRIORITY;
 	spin_lock_irqsave(&wq_head->lock, flags);
 	__add_wait_queue(wq_head, wq_entry);
 	spin_unlock_irqrestore(&wq_head->lock, flags);
@@ -64,7 +64,7 @@ EXPORT_SYMBOL(remove_wait_queue);
  * the non-exclusive tasks. Normally, exclusive tasks will be at the end of
  * the list and any non-exclusive tasks will be woken first. A priority task
  * may be at the head of the list, and can consume the event without any other
- * tasks being woken.
+ * tasks being woken if it's also an exclusive task.
  *
  * There are circumstances in which we can try to wake a task which has already
  * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 04877b297267..c7969904637a 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -316,6 +316,7 @@ static void kvm_irqfd_register(struct file *file, wait_queue_head_t *wqh,
 	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
 
 	spin_release(&kvm->irqfds.lock.dep_map, _RET_IP_);
+	irqfd->wait.flags |= WQ_FLAG_EXCLUSIVE;
 	add_wait_queue_priority(wqh, &irqfd->wait);
 	spin_acquire(&kvm->irqfds.lock.dep_map, 0, 0, _RET_IP_);
 
-- 
2.49.0.1151.ga128411c76-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 07/13] xen: privcmd: Don't mark eventfd waiter as EXCLUSIVE
  2025-05-22 23:52 [PATCH v3 00/13] KVM: Make irqfd registration globally unique Sean Christopherson
                   ` (5 preceding siblings ...)
  2025-05-22 23:52 ` [PATCH v3 06/13] sched/wait: Drop WQ_FLAG_EXCLUSIVE from add_wait_queue_priority() Sean Christopherson
@ 2025-05-22 23:52 ` Sean Christopherson
  2025-05-22 23:52 ` [PATCH v3 08/13] sched/wait: Add a waitqueue helper for fully exclusive priority waiters Sean Christopherson
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2025-05-22 23:52 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Juergen Gross, Stefano Stabellini, Paolo Bonzini, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Shuah Khan,
	Marc Zyngier, Oliver Upton, Sean Christopherson
  Cc: linux-kernel, linux-hyperv, xen-devel, kvm, linux-kselftest,
	linux-arm-kernel, kvmarm, K Prateek Nayak, David Matlack

Don't set WQ_FLAG_EXCLUSIVE when adding an irqfd to a wait queue, as
irqfd_wakeup() unconditionally returns '0', i.e. doesn't actually operate
in exclusive mode.

Note, the use of WQ_FLAG_PRIORITY is also dubious, but that's a problem
for another day.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 drivers/xen/privcmd.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index c08ec8a7d27c..13a10f3294a8 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -957,7 +957,6 @@ irqfd_poll_func(struct file *file, wait_queue_head_t *wqh, poll_table *pt)
 	struct privcmd_kernel_irqfd *kirqfd =
 		container_of(pt, struct privcmd_kernel_irqfd, pt);
 
-	kirqfd->wait.flags |= WQ_FLAG_EXCLUSIVE;
 	add_wait_queue_priority(wqh, &kirqfd->wait);
 }
 
-- 
2.49.0.1151.ga128411c76-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 08/13] sched/wait: Add a waitqueue helper for fully exclusive priority waiters
  2025-05-22 23:52 [PATCH v3 00/13] KVM: Make irqfd registration globally unique Sean Christopherson
                   ` (6 preceding siblings ...)
  2025-05-22 23:52 ` [PATCH v3 07/13] xen: privcmd: Don't mark eventfd waiter as EXCLUSIVE Sean Christopherson
@ 2025-05-22 23:52 ` Sean Christopherson
  2025-05-30  8:45   ` K Prateek Nayak
  2025-05-22 23:52 ` [PATCH v3 09/13] KVM: Disallow binding multiple irqfds to an eventfd with a priority waiter Sean Christopherson
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-05-22 23:52 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Juergen Gross, Stefano Stabellini, Paolo Bonzini, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Shuah Khan,
	Marc Zyngier, Oliver Upton, Sean Christopherson
  Cc: linux-kernel, linux-hyperv, xen-devel, kvm, linux-kselftest,
	linux-arm-kernel, kvmarm, K Prateek Nayak, David Matlack

Add a waitqueue helper to add a priority waiter that requires exclusive
wakeups, i.e. that requires that it be the _only_ priority waiter.  The
API will be used by KVM to ensure that at most one of KVM's irqfds is
bound to a single eventfd (across the entire kernel).

Open code the helper instead of using __add_wait_queue() so that the
common path doesn't need to "handle" impossible failures.

Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/wait.h |  2 ++
 kernel/sched/wait.c  | 18 ++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 965a19809c7e..09855d819418 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -164,6 +164,8 @@ static inline bool wq_has_sleeper(struct wait_queue_head *wq_head)
 extern void add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
 extern void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
 extern void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
+extern int add_wait_queue_priority_exclusive(struct wait_queue_head *wq_head,
+					     struct wait_queue_entry *wq_entry);
 extern void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
 
 static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 4ab3ab195277..15632c89c4f2 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -47,6 +47,24 @@ void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_
 }
 EXPORT_SYMBOL_GPL(add_wait_queue_priority);
 
+int add_wait_queue_priority_exclusive(struct wait_queue_head *wq_head,
+				      struct wait_queue_entry *wq_entry)
+{
+	struct list_head *head = &wq_head->head;
+
+	wq_entry->flags |= WQ_FLAG_EXCLUSIVE | WQ_FLAG_PRIORITY;
+
+	guard(spinlock_irqsave)(&wq_head->lock);
+
+	if (!list_empty(head) &&
+	    (list_first_entry(head, typeof(*wq_entry), entry)->flags & WQ_FLAG_PRIORITY))
+		return -EBUSY;
+
+	list_add(&wq_entry->entry, head);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(add_wait_queue_priority_exclusive);
+
 void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
 {
 	unsigned long flags;
-- 
2.49.0.1151.ga128411c76-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 09/13] KVM: Disallow binding multiple irqfds to an eventfd with a priority waiter
  2025-05-22 23:52 [PATCH v3 00/13] KVM: Make irqfd registration globally unique Sean Christopherson
                   ` (7 preceding siblings ...)
  2025-05-22 23:52 ` [PATCH v3 08/13] sched/wait: Add a waitqueue helper for fully exclusive priority waiters Sean Christopherson
@ 2025-05-22 23:52 ` Sean Christopherson
  2025-05-22 23:52 ` [PATCH v3 10/13] KVM: Drop sanity check that per-VM list of irqfds is unique Sean Christopherson
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2025-05-22 23:52 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Juergen Gross, Stefano Stabellini, Paolo Bonzini, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Shuah Khan,
	Marc Zyngier, Oliver Upton, Sean Christopherson
  Cc: linux-kernel, linux-hyperv, xen-devel, kvm, linux-kselftest,
	linux-arm-kernel, kvmarm, K Prateek Nayak, David Matlack

Disallow binding an irqfd to an eventfd that already has a priority waiter,
i.e. to an eventfd that already has an attached irqfd.  KVM always
operates in exclusive mode for EPOLL_IN (unconditionally returns '1'),
i.e. only the first waiter will be notified.

KVM already disallows binding multiple irqfds to an eventfd in a single
VM, but doesn't guard against multiple VMs binding to an eventfd.  Adding
the extra protection reduces the pain of a userspace VMM bug, e.g. if
userspace fails to de-assign before re-assigning when transferring state
for intra-host migration, then the migration will explicitly fail as
opposed to dropping IRQs on the destination VM.

Temporarily keep KVM's manual check on irqfds.items, but add a WARN, e.g.
to allow sanity checking the waitqueue enforcement.

Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/eventfd.c | 55 +++++++++++++++++++++++++++++++---------------
 1 file changed, 37 insertions(+), 18 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index c7969904637a..7b2e1f858f6d 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -291,38 +291,57 @@ static void kvm_irqfd_register(struct file *file, wait_queue_head_t *wqh,
 	struct kvm_kernel_irqfd *tmp;
 	struct kvm *kvm = p->kvm;
 
+	/*
+	 * Note, irqfds.lock protects the irqfd's irq_entry, i.e. its routing,
+	 * and irqfds.items.  It does NOT protect registering with the eventfd.
+	 */
 	spin_lock_irq(&kvm->irqfds.lock);
 
-	list_for_each_entry(tmp, &kvm->irqfds.items, list) {
-		if (irqfd->eventfd != tmp->eventfd)
-			continue;
-		/* This fd is used for another irq already. */
-		p->ret = -EBUSY;
-		spin_unlock_irq(&kvm->irqfds.lock);
-		return;
-	}
-
+	/*
+	 * Initialize the routing information prior to adding the irqfd to the
+	 * eventfd's waitqueue, as irqfd_wakeup() can be invoked as soon as the
+	 * irqfd is registered.
+	 */
 	irqfd_update(kvm, irqfd);
 
-	list_add_tail(&irqfd->list, &kvm->irqfds.items);
-
 	/*
 	 * Add the irqfd as a priority waiter on the eventfd, with a custom
 	 * wake-up handler, so that KVM *and only KVM* is notified whenever the
-	 * underlying eventfd is signaled.  Temporarily lie to lockdep about
-	 * holding irqfds.lock to avoid a false positive regarding potential
-	 * deadlock with irqfd_wakeup() (see irqfd_wakeup() for details).
+	 * underlying eventfd is signaled.
 	 */
 	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
 
+	/*
+	 * Temporarily lie to lockdep about holding irqfds.lock to avoid a
+	 * false positive regarding potential deadlock with irqfd_wakeup()
+	 * (see irqfd_wakeup() for details).
+	 *
+	 * Adding to the wait queue will fail if there is already a priority
+	 * waiter, i.e. if the eventfd is associated with another irqfd (in any
+	 * VM).  Note, kvm_irqfd_deassign() waits for all in-flight shutdown
+	 * jobs to complete, i.e. ensures the irqfd has been removed from the
+	 * eventfd's waitqueue before returning to userspace.
+	 */
 	spin_release(&kvm->irqfds.lock.dep_map, _RET_IP_);
-	irqfd->wait.flags |= WQ_FLAG_EXCLUSIVE;
-	add_wait_queue_priority(wqh, &irqfd->wait);
+	p->ret = add_wait_queue_priority_exclusive(wqh, &irqfd->wait);
 	spin_acquire(&kvm->irqfds.lock.dep_map, 0, 0, _RET_IP_);
+	if (p->ret)
+		goto out;
 
+	list_for_each_entry(tmp, &kvm->irqfds.items, list) {
+		if (irqfd->eventfd != tmp->eventfd)
+			continue;
+
+		WARN_ON_ONCE(1);
+		/* This fd is used for another irq already. */
+		p->ret = -EBUSY;
+		goto out;
+	}
+
+	list_add_tail(&irqfd->list, &kvm->irqfds.items);
+
+out:
 	spin_unlock_irq(&kvm->irqfds.lock);
-
-	p->ret = 0;
 }
 
 #if IS_ENABLED(CONFIG_HAVE_KVM_IRQ_BYPASS)
-- 
2.49.0.1151.ga128411c76-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 10/13] KVM: Drop sanity check that per-VM list of irqfds is unique
  2025-05-22 23:52 [PATCH v3 00/13] KVM: Make irqfd registration globally unique Sean Christopherson
                   ` (8 preceding siblings ...)
  2025-05-22 23:52 ` [PATCH v3 09/13] KVM: Disallow binding multiple irqfds to an eventfd with a priority waiter Sean Christopherson
@ 2025-05-22 23:52 ` Sean Christopherson
  2025-05-22 23:52 ` [PATCH v3 11/13] KVM: selftests: Assert that eventfd() succeeds in Xen shinfo test Sean Christopherson
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2025-05-22 23:52 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Juergen Gross, Stefano Stabellini, Paolo Bonzini, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Shuah Khan,
	Marc Zyngier, Oliver Upton, Sean Christopherson
  Cc: linux-kernel, linux-hyperv, xen-devel, kvm, linux-kselftest,
	linux-arm-kernel, kvmarm, K Prateek Nayak, David Matlack

Now that the eventfd's waitqueue ensures it has at most one priority
waiter, i.e. prevents KVM from binding multiple irqfds to one eventfd,
drop KVM's sanity check that eventfds are unique for a single VM.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/eventfd.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 7b2e1f858f6d..d5258fd16033 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -288,7 +288,6 @@ static void kvm_irqfd_register(struct file *file, wait_queue_head_t *wqh,
 {
 	struct kvm_irqfd_pt *p = container_of(pt, struct kvm_irqfd_pt, pt);
 	struct kvm_kernel_irqfd *irqfd = p->irqfd;
-	struct kvm_kernel_irqfd *tmp;
 	struct kvm *kvm = p->kvm;
 
 	/*
@@ -328,16 +327,6 @@ static void kvm_irqfd_register(struct file *file, wait_queue_head_t *wqh,
 	if (p->ret)
 		goto out;
 
-	list_for_each_entry(tmp, &kvm->irqfds.items, list) {
-		if (irqfd->eventfd != tmp->eventfd)
-			continue;
-
-		WARN_ON_ONCE(1);
-		/* This fd is used for another irq already. */
-		p->ret = -EBUSY;
-		goto out;
-	}
-
 	list_add_tail(&irqfd->list, &kvm->irqfds.items);
 
 out:
-- 
2.49.0.1151.ga128411c76-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 11/13] KVM: selftests: Assert that eventfd() succeeds in Xen shinfo test
  2025-05-22 23:52 [PATCH v3 00/13] KVM: Make irqfd registration globally unique Sean Christopherson
                   ` (9 preceding siblings ...)
  2025-05-22 23:52 ` [PATCH v3 10/13] KVM: Drop sanity check that per-VM list of irqfds is unique Sean Christopherson
@ 2025-05-22 23:52 ` Sean Christopherson
  2025-05-22 23:52 ` [PATCH v3 12/13] KVM: selftests: Add utilities to create eventfds and do KVM_IRQFD Sean Christopherson
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2025-05-22 23:52 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Juergen Gross, Stefano Stabellini, Paolo Bonzini, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Shuah Khan,
	Marc Zyngier, Oliver Upton, Sean Christopherson
  Cc: linux-kernel, linux-hyperv, xen-devel, kvm, linux-kselftest,
	linux-arm-kernel, kvmarm, K Prateek Nayak, David Matlack

Assert that eventfd() succeeds in the Xen shinfo test instead of skipping
the associated testcase.  While eventfd() is outside the scope of KVM, KVM
unconditionally selects EVENTFD, i.e. the syscall should always succeed.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/x86/xen_shinfo_test.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86/xen_shinfo_test.c b/tools/testing/selftests/kvm/x86/xen_shinfo_test.c
index 287829f850f7..34d180cf4eed 100644
--- a/tools/testing/selftests/kvm/x86/xen_shinfo_test.c
+++ b/tools/testing/selftests/kvm/x86/xen_shinfo_test.c
@@ -548,14 +548,11 @@ int main(int argc, char *argv[])
 
 	if (do_eventfd_tests) {
 		irq_fd[0] = eventfd(0, 0);
+		TEST_ASSERT(irq_fd[0] >= 0, __KVM_SYSCALL_ERROR("eventfd()", irq_fd[0]));
+
 		irq_fd[1] = eventfd(0, 0);
+		TEST_ASSERT(irq_fd[1] >= 0, __KVM_SYSCALL_ERROR("eventfd()", irq_fd[1]));
 
-		/* Unexpected, but not a KVM failure */
-		if (irq_fd[0] == -1 || irq_fd[1] == -1)
-			do_evtchn_tests = do_eventfd_tests = false;
-	}
-
-	if (do_eventfd_tests) {
 		irq_routes.info.nr = 2;
 
 		irq_routes.entries[0].gsi = 32;
-- 
2.49.0.1151.ga128411c76-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 12/13] KVM: selftests: Add utilities to create eventfds and do KVM_IRQFD
  2025-05-22 23:52 [PATCH v3 00/13] KVM: Make irqfd registration globally unique Sean Christopherson
                   ` (10 preceding siblings ...)
  2025-05-22 23:52 ` [PATCH v3 11/13] KVM: selftests: Assert that eventfd() succeeds in Xen shinfo test Sean Christopherson
@ 2025-05-22 23:52 ` Sean Christopherson
  2025-05-22 23:52 ` [PATCH v3 13/13] KVM: selftests: Add a KVM_IRQFD test to verify uniqueness requirements Sean Christopherson
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2025-05-22 23:52 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Juergen Gross, Stefano Stabellini, Paolo Bonzini, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Shuah Khan,
	Marc Zyngier, Oliver Upton, Sean Christopherson
  Cc: linux-kernel, linux-hyperv, xen-devel, kvm, linux-kselftest,
	linux-arm-kernel, kvmarm, K Prateek Nayak, David Matlack

Add helpers to create eventfds and to (de)assign eventfds via KVM_IRQFD.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/arm64/vgic_irq.c  | 12 ++----
 .../testing/selftests/kvm/include/kvm_util.h  | 40 +++++++++++++++++++
 .../selftests/kvm/x86/xen_shinfo_test.c       | 18 ++-------
 3 files changed, 47 insertions(+), 23 deletions(-)

diff --git a/tools/testing/selftests/kvm/arm64/vgic_irq.c b/tools/testing/selftests/kvm/arm64/vgic_irq.c
index f4ac28d53747..a09dd423c2d7 100644
--- a/tools/testing/selftests/kvm/arm64/vgic_irq.c
+++ b/tools/testing/selftests/kvm/arm64/vgic_irq.c
@@ -620,18 +620,12 @@ static void kvm_routing_and_irqfd_check(struct kvm_vm *vm,
 	 * that no actual interrupt was injected for those cases.
 	 */
 
-	for (f = 0, i = intid; i < (uint64_t)intid + num; i++, f++) {
-		fd[f] = eventfd(0, 0);
-		TEST_ASSERT(fd[f] != -1, __KVM_SYSCALL_ERROR("eventfd()", fd[f]));
-	}
+	for (f = 0, i = intid; i < (uint64_t)intid + num; i++, f++)
+		fd[f] = kvm_new_eventfd();
 
 	for (f = 0, i = intid; i < (uint64_t)intid + num; i++, f++) {
-		struct kvm_irqfd irqfd = {
-			.fd  = fd[f],
-			.gsi = i - MIN_SPI,
-		};
 		assert(i <= (uint64_t)UINT_MAX);
-		vm_ioctl(vm, KVM_IRQFD, &irqfd);
+		kvm_assign_irqfd(vm, i - MIN_SPI, fd[f]);
 	}
 
 	for (f = 0, i = intid; i < (uint64_t)intid + num; i++, f++) {
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 373912464fb4..4f7bf8f000bb 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -18,6 +18,7 @@
 #include <asm/atomic.h>
 #include <asm/kvm.h>
 
+#include <sys/eventfd.h>
 #include <sys/ioctl.h>
 
 #include "kvm_util_arch.h"
@@ -496,6 +497,45 @@ static inline int vm_get_stats_fd(struct kvm_vm *vm)
 	return fd;
 }
 
+static inline int __kvm_irqfd(struct kvm_vm *vm, uint32_t gsi, int eventfd,
+			      uint32_t flags)
+{
+	struct kvm_irqfd irqfd = {
+		.fd = eventfd,
+		.gsi = gsi,
+		.flags = flags,
+		.resamplefd = -1,
+	};
+
+	return __vm_ioctl(vm, KVM_IRQFD, &irqfd);
+}
+
+static inline void kvm_irqfd(struct kvm_vm *vm, uint32_t gsi, int eventfd,
+			      uint32_t flags)
+{
+	int ret = __kvm_irqfd(vm, gsi, eventfd, flags);
+
+	TEST_ASSERT_VM_VCPU_IOCTL(!ret, KVM_IRQFD, ret, vm);
+}
+
+static inline void kvm_assign_irqfd(struct kvm_vm *vm, uint32_t gsi, int eventfd)
+{
+	kvm_irqfd(vm, gsi, eventfd, 0);
+}
+
+static inline void kvm_deassign_irqfd(struct kvm_vm *vm, uint32_t gsi, int eventfd)
+{
+	kvm_irqfd(vm, gsi, eventfd, KVM_IRQFD_FLAG_DEASSIGN);
+}
+
+static inline int kvm_new_eventfd(void)
+{
+	int fd = eventfd(0, 0);
+
+	TEST_ASSERT(fd >= 0, __KVM_SYSCALL_ERROR("eventfd()", fd));
+	return fd;
+}
+
 static inline void read_stats_header(int stats_fd, struct kvm_stats_header *header)
 {
 	ssize_t ret;
diff --git a/tools/testing/selftests/kvm/x86/xen_shinfo_test.c b/tools/testing/selftests/kvm/x86/xen_shinfo_test.c
index 34d180cf4eed..23909b501ac2 100644
--- a/tools/testing/selftests/kvm/x86/xen_shinfo_test.c
+++ b/tools/testing/selftests/kvm/x86/xen_shinfo_test.c
@@ -547,11 +547,8 @@ int main(int argc, char *argv[])
 	int irq_fd[2] = { -1, -1 };
 
 	if (do_eventfd_tests) {
-		irq_fd[0] = eventfd(0, 0);
-		TEST_ASSERT(irq_fd[0] >= 0, __KVM_SYSCALL_ERROR("eventfd()", irq_fd[0]));
-
-		irq_fd[1] = eventfd(0, 0);
-		TEST_ASSERT(irq_fd[1] >= 0, __KVM_SYSCALL_ERROR("eventfd()", irq_fd[1]));
+		irq_fd[0] = kvm_new_eventfd();
+		irq_fd[1] = kvm_new_eventfd();
 
 		irq_routes.info.nr = 2;
 
@@ -569,15 +566,8 @@ int main(int argc, char *argv[])
 
 		vm_ioctl(vm, KVM_SET_GSI_ROUTING, &irq_routes.info);
 
-		struct kvm_irqfd ifd = { };
-
-		ifd.fd = irq_fd[0];
-		ifd.gsi = 32;
-		vm_ioctl(vm, KVM_IRQFD, &ifd);
-
-		ifd.fd = irq_fd[1];
-		ifd.gsi = 33;
-		vm_ioctl(vm, KVM_IRQFD, &ifd);
+		kvm_assign_irqfd(vm, 32, irq_fd[0]);
+		kvm_assign_irqfd(vm, 33, irq_fd[1]);
 
 		struct sigaction sa = { };
 		sa.sa_handler = handle_alrm;
-- 
2.49.0.1151.ga128411c76-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 13/13] KVM: selftests: Add a KVM_IRQFD test to verify uniqueness requirements
  2025-05-22 23:52 [PATCH v3 00/13] KVM: Make irqfd registration globally unique Sean Christopherson
                   ` (11 preceding siblings ...)
  2025-05-22 23:52 ` [PATCH v3 12/13] KVM: selftests: Add utilities to create eventfds and do KVM_IRQFD Sean Christopherson
@ 2025-05-22 23:52 ` Sean Christopherson
  2025-05-23  7:23   ` Sairaj Kodilkar
  2025-05-23 11:14 ` [PATCH v3 00/13] KVM: Make irqfd registration globally unique Peter Zijlstra
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-05-22 23:52 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Juergen Gross, Stefano Stabellini, Paolo Bonzini, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Shuah Khan,
	Marc Zyngier, Oliver Upton, Sean Christopherson
  Cc: linux-kernel, linux-hyperv, xen-devel, kvm, linux-kselftest,
	linux-arm-kernel, kvmarm, K Prateek Nayak, David Matlack

Add a selftest to verify that eventfd+irqfd bindings are globally unique,
i.e. that KVM doesn't allow multiple irqfds to bind to a single eventfd,
even across VMs.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/Makefile.kvm |   1 +
 tools/testing/selftests/kvm/irqfd_test.c | 130 +++++++++++++++++++++++
 2 files changed, 131 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/irqfd_test.c

diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index f62b0a5aba35..318adf3ef6b6 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -54,6 +54,7 @@ TEST_PROGS_x86 += x86/nx_huge_pages_test.sh
 TEST_GEN_PROGS_COMMON = demand_paging_test
 TEST_GEN_PROGS_COMMON += dirty_log_test
 TEST_GEN_PROGS_COMMON += guest_print_test
+TEST_GEN_PROGS_COMMON += irqfd_test
 TEST_GEN_PROGS_COMMON += kvm_binary_stats_test
 TEST_GEN_PROGS_COMMON += kvm_create_max_vcpus
 TEST_GEN_PROGS_COMMON += kvm_page_table_test
diff --git a/tools/testing/selftests/kvm/irqfd_test.c b/tools/testing/selftests/kvm/irqfd_test.c
new file mode 100644
index 000000000000..286f2b15fde6
--- /dev/null
+++ b/tools/testing/selftests/kvm/irqfd_test.c
@@ -0,0 +1,130 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <errno.h>
+#include <pthread.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <signal.h>
+#include <stdint.h>
+#include <sys/sysinfo.h>
+
+#include "kvm_util.h"
+
+static struct kvm_vm *vm1;
+static struct kvm_vm *vm2;
+static int __eventfd;
+static bool done;
+
+/*
+ * KVM de-assigns based on eventfd *and* GSI, but requires unique eventfds when
+ * assigning (the API isn't symmetrical).  Abuse the oddity and use a per-task
+ * GSI base to avoid false failures due to cross-task de-assign, i.e. so that
+ * the secondary doesn't de-assign the primary's eventfd and cause assign to
+ * unexpectedly succeed on the primary.
+ */
+#define GSI_BASE_PRIMARY	0x20
+#define GSI_BASE_SECONDARY	0x30
+
+static void juggle_eventfd_secondary(struct kvm_vm *vm, int eventfd)
+{
+	int r, i;
+
+	/*
+	 * The secondary task can encounter EBADF since the primary can close
+	 * the eventfd at any time.  And because the primary can recreate the
+	 * eventfd, at the safe fd in the file table, the secondary can also
+	 * encounter "unexpected" success, e.g. if the close+recreate happens
+	 * between the first and second assignments.  The secondary's role is
+	 * mostly to antagonize KVM, not to detect bugs.
+	 */
+	for (i = 0; i < 2; i++) {
+		r = __kvm_irqfd(vm, GSI_BASE_SECONDARY, eventfd, 0);
+		TEST_ASSERT(!r || errno == EBUSY || errno == EBADF,
+			    "Wanted success, EBUSY, or EBADF, r = %d, errno = %d",
+			    r, errno);
+
+		/* De-assign should succeed unless the eventfd was closed. */
+		r = __kvm_irqfd(vm, GSI_BASE_SECONDARY + i, eventfd, KVM_IRQFD_FLAG_DEASSIGN);
+		TEST_ASSERT(!r || errno == EBADF,
+			    "De-assign should succeed unless the fd was closed");
+	}
+}
+
+static void *secondary_irqfd_juggler(void *ign)
+{
+	while (!READ_ONCE(done)) {
+		juggle_eventfd_secondary(vm1, READ_ONCE(__eventfd));
+		juggle_eventfd_secondary(vm2, READ_ONCE(__eventfd));
+	}
+
+	return NULL;
+}
+
+static void juggle_eventfd_primary(struct kvm_vm *vm, int eventfd)
+{
+	int r1, r2;
+
+	/*
+	 * At least one of the assigns should fail.  KVM disallows assigning a
+	 * single eventfd to multiple GSIs (or VMs), so it's possible that both
+	 * assignments can fail, too.
+	 */
+	r1 = __kvm_irqfd(vm, GSI_BASE_PRIMARY, eventfd, 0);
+	TEST_ASSERT(!r1 || errno == EBUSY,
+		    "Wanted success or EBUSY, r = %d, errno = %d", r1, errno);
+
+	r2 = __kvm_irqfd(vm, GSI_BASE_PRIMARY + 1, eventfd, 0);
+	TEST_ASSERT(r1 || (r2 && errno == EBUSY),
+		    "Wanted failure (EBUSY), r1 = %d, r2 = %d, errno = %d",
+		    r1, r2, errno);
+
+	/*
+	 * De-assign should always succeed, even if the corresponding assign
+	 * failed.
+	 */
+	kvm_irqfd(vm, GSI_BASE_PRIMARY, eventfd, KVM_IRQFD_FLAG_DEASSIGN);
+	kvm_irqfd(vm, GSI_BASE_PRIMARY + 1, eventfd, KVM_IRQFD_FLAG_DEASSIGN);
+}
+
+int main(int argc, char *argv[])
+{
+	pthread_t racing_thread;
+	int r, i;
+
+	/* Create "full" VMs, as KVM_IRQFD requires an in-kernel IRQ chip. */
+	vm1 = vm_create(1);
+	vm2 = vm_create(1);
+
+	WRITE_ONCE(__eventfd, kvm_new_eventfd());
+
+	kvm_irqfd(vm1, 10, __eventfd, 0);
+
+	r = __kvm_irqfd(vm1, 11, __eventfd, 0);
+	TEST_ASSERT(r && errno == EBUSY,
+		    "Wanted EBUSY, r = %d, errno = %d", r, errno);
+
+	r = __kvm_irqfd(vm2, 12, __eventfd, 0);
+	TEST_ASSERT(r && errno == EBUSY,
+		    "Wanted EBUSY, r = %d, errno = %d", r, errno);
+
+	kvm_irqfd(vm1, 11, READ_ONCE(__eventfd), KVM_IRQFD_FLAG_DEASSIGN);
+	kvm_irqfd(vm1, 12, READ_ONCE(__eventfd), KVM_IRQFD_FLAG_DEASSIGN);
+	kvm_irqfd(vm1, 13, READ_ONCE(__eventfd), KVM_IRQFD_FLAG_DEASSIGN);
+	kvm_irqfd(vm1, 14, READ_ONCE(__eventfd), KVM_IRQFD_FLAG_DEASSIGN);
+	kvm_irqfd(vm1, 10, READ_ONCE(__eventfd), KVM_IRQFD_FLAG_DEASSIGN);
+
+	close(__eventfd);
+
+	pthread_create(&racing_thread, NULL, secondary_irqfd_juggler, vm2);
+
+	for (i = 0; i < 10000; i++) {
+		WRITE_ONCE(__eventfd, kvm_new_eventfd());
+
+		juggle_eventfd_primary(vm1, __eventfd);
+		juggle_eventfd_primary(vm2, __eventfd);
+		close(__eventfd);
+	}
+
+	WRITE_ONCE(done, true);
+	pthread_join(racing_thread, NULL);
+}
-- 
2.49.0.1151.ga128411c76-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 13/13] KVM: selftests: Add a KVM_IRQFD test to verify uniqueness requirements
  2025-05-22 23:52 ` [PATCH v3 13/13] KVM: selftests: Add a KVM_IRQFD test to verify uniqueness requirements Sean Christopherson
@ 2025-05-23  7:23   ` Sairaj Kodilkar
  2025-05-23 14:33     ` Sean Christopherson
  0 siblings, 1 reply; 21+ messages in thread
From: Sairaj Kodilkar @ 2025-05-23  7:23 UTC (permalink / raw)
  To: Sean Christopherson, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Juergen Gross, Stefano Stabellini, Paolo Bonzini,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Shuah Khan, Marc Zyngier, Oliver Upton
  Cc: linux-kernel, linux-hyperv, xen-devel, kvm, linux-kselftest,
	linux-arm-kernel, kvmarm, K Prateek Nayak, David Matlack

On 5/23/2025 5:22 AM, Sean Christopherson wrote:

> +
> +int main(int argc, char *argv[])
> +{
> +	pthread_t racing_thread;
> +	int r, i;
> +
> +	/* Create "full" VMs, as KVM_IRQFD requires an in-kernel IRQ chip. */
> +	vm1 = vm_create(1);
> +	vm2 = vm_create(1);
> +
> +	WRITE_ONCE(__eventfd, kvm_new_eventfd());
> +
> +	kvm_irqfd(vm1, 10, __eventfd, 0);
> +
> +	r = __kvm_irqfd(vm1, 11, __eventfd, 0);
> +	TEST_ASSERT(r && errno == EBUSY,
> +		    "Wanted EBUSY, r = %d, errno = %d", r, errno);
> +
> +	r = __kvm_irqfd(vm2, 12, __eventfd, 0);
> +	TEST_ASSERT(r && errno == EBUSY,
> +		    "Wanted EBUSY, r = %d, errno = %d", r, errno);
> +
> +	kvm_irqfd(vm1, 11, READ_ONCE(__eventfd), KVM_IRQFD_FLAG_DEASSIGN);
> +	kvm_irqfd(vm1, 12, READ_ONCE(__eventfd), KVM_IRQFD_FLAG_DEASSIGN);
> +	kvm_irqfd(vm1, 13, READ_ONCE(__eventfd), KVM_IRQFD_FLAG_DEASSIGN);
> +	kvm_irqfd(vm1, 14, READ_ONCE(__eventfd), KVM_IRQFD_FLAG_DEASSIGN);

Hi Sean,
I dont see any allocation for the GSI 13 and 14..
Is there any reason for the deassigning these two GSIs ?

Regards
Sairaj Kodilkar

> +	kvm_irqfd(vm1, 10, READ_ONCE(__eventfd), KVM_IRQFD_FLAG_DEASSIGN);
> +
> +	close(__eventfd);
> +
> +	pthread_create(&racing_thread, NULL, secondary_irqfd_juggler, vm2);
> +
> +	for (i = 0; i < 10000; i++) {
> +		WRITE_ONCE(__eventfd, kvm_new_eventfd());
> +
> +		juggle_eventfd_primary(vm1, __eventfd);
> +		juggle_eventfd_primary(vm2, __eventfd);
> +		close(__eventfd);
> +	}
> +
> +	WRITE_ONCE(done, true);
> +	pthread_join(racing_thread, NULL);
> +}


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 00/13] KVM: Make irqfd registration globally unique
  2025-05-22 23:52 [PATCH v3 00/13] KVM: Make irqfd registration globally unique Sean Christopherson
                   ` (12 preceding siblings ...)
  2025-05-22 23:52 ` [PATCH v3 13/13] KVM: selftests: Add a KVM_IRQFD test to verify uniqueness requirements Sean Christopherson
@ 2025-05-23 11:14 ` Peter Zijlstra
  2025-05-30  8:49 ` K Prateek Nayak
  2025-06-24 19:38 ` Sean Christopherson
  15 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2025-05-23 11:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Juergen Gross, Stefano Stabellini, Paolo Bonzini, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Shuah Khan, Marc Zyngier,
	Oliver Upton, linux-kernel, linux-hyperv, xen-devel, kvm,
	linux-kselftest, linux-arm-kernel, kvmarm, K Prateek Nayak,
	David Matlack

On Thu, May 22, 2025 at 04:52:10PM -0700, Sean Christopherson wrote:
>   sched/wait: Drop WQ_FLAG_EXCLUSIVE from add_wait_queue_priority()
>   sched/wait: Add a waitqueue helper for fully exclusive priority
>     waiters

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 13/13] KVM: selftests: Add a KVM_IRQFD test to verify uniqueness requirements
  2025-05-23  7:23   ` Sairaj Kodilkar
@ 2025-05-23 14:33     ` Sean Christopherson
  2025-05-26  3:36       ` Sairaj Kodilkar
  0 siblings, 1 reply; 21+ messages in thread
From: Sean Christopherson @ 2025-05-23 14:33 UTC (permalink / raw)
  To: Sairaj Kodilkar
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Juergen Gross, Stefano Stabellini, Paolo Bonzini, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Shuah Khan,
	Marc Zyngier, Oliver Upton, linux-kernel, linux-hyperv, xen-devel,
	kvm, linux-kselftest, linux-arm-kernel, kvmarm, K Prateek Nayak,
	David Matlack

On Fri, May 23, 2025, Sairaj Kodilkar wrote:
> On 5/23/2025 5:22 AM, Sean Christopherson wrote:
> 
> > +
> > +int main(int argc, char *argv[])
> > +{
> > +	pthread_t racing_thread;
> > +	int r, i;
> > +
> > +	/* Create "full" VMs, as KVM_IRQFD requires an in-kernel IRQ chip. */
> > +	vm1 = vm_create(1);
> > +	vm2 = vm_create(1);
> > +
> > +	WRITE_ONCE(__eventfd, kvm_new_eventfd());
> > +
> > +	kvm_irqfd(vm1, 10, __eventfd, 0);
> > +
> > +	r = __kvm_irqfd(vm1, 11, __eventfd, 0);
> > +	TEST_ASSERT(r && errno == EBUSY,
> > +		    "Wanted EBUSY, r = %d, errno = %d", r, errno);
> > +
> > +	r = __kvm_irqfd(vm2, 12, __eventfd, 0);
> > +	TEST_ASSERT(r && errno == EBUSY,
> > +		    "Wanted EBUSY, r = %d, errno = %d", r, errno);
> > +
> > +	kvm_irqfd(vm1, 11, READ_ONCE(__eventfd), KVM_IRQFD_FLAG_DEASSIGN);
> > +	kvm_irqfd(vm1, 12, READ_ONCE(__eventfd), KVM_IRQFD_FLAG_DEASSIGN);
> > +	kvm_irqfd(vm1, 13, READ_ONCE(__eventfd), KVM_IRQFD_FLAG_DEASSIGN);
> > +	kvm_irqfd(vm1, 14, READ_ONCE(__eventfd), KVM_IRQFD_FLAG_DEASSIGN);
> 
> Hi Sean,
> I dont see any allocation for the GSI 13 and 14..
> Is there any reason for the deassigning these two GSIs ?

Yes, KVM's rather bizarre ABI is that DEASSIGN is allowed even if the VM doesn't
have a corresponding assigned irqfd.  The reason I added these early DEASSIGN
calls is so that there will be an easier-to-debug failure if KVM's behavior
changes (the racing threads part of the test abuses KVM's ABI).  I didn't add a
comment because the helpers already have comments, but looking at this again, I
agree that main() needs a better comment.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 13/13] KVM: selftests: Add a KVM_IRQFD test to verify uniqueness requirements
  2025-05-23 14:33     ` Sean Christopherson
@ 2025-05-26  3:36       ` Sairaj Kodilkar
  0 siblings, 0 replies; 21+ messages in thread
From: Sairaj Kodilkar @ 2025-05-26  3:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Juergen Gross, Stefano Stabellini, Paolo Bonzini, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Shuah Khan,
	Marc Zyngier, Oliver Upton, linux-kernel, linux-hyperv, xen-devel,
	kvm, linux-kselftest, linux-arm-kernel, kvmarm, K Prateek Nayak,
	David Matlack



On 5/23/2025 8:03 PM, Sean Christopherson wrote:
> On Fri, May 23, 2025, Sairaj Kodilkar wrote:
>> On 5/23/2025 5:22 AM, Sean Christopherson wrote:
>>
>>> +
>>> +int main(int argc, char *argv[])
>>> +{
>>> +	pthread_t racing_thread;
>>> +	int r, i;
>>> +
>>> +	/* Create "full" VMs, as KVM_IRQFD requires an in-kernel IRQ chip. */
>>> +	vm1 = vm_create(1);
>>> +	vm2 = vm_create(1);
>>> +
>>> +	WRITE_ONCE(__eventfd, kvm_new_eventfd());
>>> +
>>> +	kvm_irqfd(vm1, 10, __eventfd, 0);
>>> +
>>> +	r = __kvm_irqfd(vm1, 11, __eventfd, 0);
>>> +	TEST_ASSERT(r && errno == EBUSY,
>>> +		    "Wanted EBUSY, r = %d, errno = %d", r, errno);
>>> +
>>> +	r = __kvm_irqfd(vm2, 12, __eventfd, 0);
>>> +	TEST_ASSERT(r && errno == EBUSY,
>>> +		    "Wanted EBUSY, r = %d, errno = %d", r, errno);
>>> +
>>> +	kvm_irqfd(vm1, 11, READ_ONCE(__eventfd), KVM_IRQFD_FLAG_DEASSIGN);
>>> +	kvm_irqfd(vm1, 12, READ_ONCE(__eventfd), KVM_IRQFD_FLAG_DEASSIGN);
>>> +	kvm_irqfd(vm1, 13, READ_ONCE(__eventfd), KVM_IRQFD_FLAG_DEASSIGN);
>>> +	kvm_irqfd(vm1, 14, READ_ONCE(__eventfd), KVM_IRQFD_FLAG_DEASSIGN);
>>
>> Hi Sean,
>> I dont see any allocation for the GSI 13 and 14..
>> Is there any reason for the deassigning these two GSIs ?
> 
> Yes, KVM's rather bizarre ABI is that DEASSIGN is allowed even if the VM doesn't
> have a corresponding assigned irqfd.  The reason I added these early DEASSIGN
> calls is so that there will be an easier-to-debug failure if KVM's behavior
> changes (the racing threads part of the test abuses KVM's ABI).  I didn't add a
> comment because the helpers already have comments, but looking at this again, I
> agree that main() needs a better comment.

Makes sense, thanks for the explanation.

Thanks
Sairaj Kodilkar


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 08/13] sched/wait: Add a waitqueue helper for fully exclusive priority waiters
  2025-05-22 23:52 ` [PATCH v3 08/13] sched/wait: Add a waitqueue helper for fully exclusive priority waiters Sean Christopherson
@ 2025-05-30  8:45   ` K Prateek Nayak
  0 siblings, 0 replies; 21+ messages in thread
From: K Prateek Nayak @ 2025-05-30  8:45 UTC (permalink / raw)
  To: Sean Christopherson, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Juergen Gross, Stefano Stabellini, Paolo Bonzini,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Shuah Khan, Marc Zyngier, Oliver Upton
  Cc: linux-kernel, linux-hyperv, xen-devel, kvm, linux-kselftest,
	linux-arm-kernel, kvmarm, David Matlack

Hello Sean,

On 5/23/2025 5:22 AM, Sean Christopherson wrote:
> Add a waitqueue helper to add a priority waiter that requires exclusive
> wakeups, i.e. that requires that it be the _only_ priority waiter.  The
> API will be used by KVM to ensure that at most one of KVM's irqfds is
> bound to a single eventfd (across the entire kernel).
> 
> Open code the helper instead of using __add_wait_queue() so that the
> common path doesn't need to "handle" impossible failures.
> 
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>

Feel free to include:

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

-- 
Thanks and Regards,
Prateek

> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   include/linux/wait.h |  2 ++
>   kernel/sched/wait.c  | 18 ++++++++++++++++++
>   2 files changed, 20 insertions(+)
> [..snip..]


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 00/13] KVM: Make irqfd registration globally unique
  2025-05-22 23:52 [PATCH v3 00/13] KVM: Make irqfd registration globally unique Sean Christopherson
                   ` (13 preceding siblings ...)
  2025-05-23 11:14 ` [PATCH v3 00/13] KVM: Make irqfd registration globally unique Peter Zijlstra
@ 2025-05-30  8:49 ` K Prateek Nayak
  2025-06-24 19:38 ` Sean Christopherson
  15 siblings, 0 replies; 21+ messages in thread
From: K Prateek Nayak @ 2025-05-30  8:49 UTC (permalink / raw)
  To: Sean Christopherson, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Juergen Gross, Stefano Stabellini, Paolo Bonzini,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Shuah Khan, Marc Zyngier, Oliver Upton
  Cc: linux-kernel, linux-hyperv, xen-devel, kvm, linux-kselftest,
	linux-arm-kernel, kvmarm, David Matlack

Hello Sean,

On 5/23/2025 5:22 AM, Sean Christopherson wrote:
> Non-KVM folks,
> 
> I am hoping to route this through the KVM tree (6.17 or later), as the non-KVM
> changes should be glorified nops.  Please holler if you object to that idea.

I've tested this series with the selftests and also ran KVM unit test
on top of the specified base and didn't see anything unexpected. Feel
free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

-- 
Thanks and Regards,
Prateek



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 00/13] KVM: Make irqfd registration globally unique
  2025-05-22 23:52 [PATCH v3 00/13] KVM: Make irqfd registration globally unique Sean Christopherson
                   ` (14 preceding siblings ...)
  2025-05-30  8:49 ` K Prateek Nayak
@ 2025-06-24 19:38 ` Sean Christopherson
  15 siblings, 0 replies; 21+ messages in thread
From: Sean Christopherson @ 2025-06-24 19:38 UTC (permalink / raw)
  To: Sean Christopherson, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Juergen Gross, Stefano Stabellini, Paolo Bonzini,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Shuah Khan, Marc Zyngier, Oliver Upton
  Cc: linux-kernel, linux-hyperv, xen-devel, kvm, linux-kselftest,
	linux-arm-kernel, kvmarm, K Prateek Nayak, David Matlack

On Thu, 22 May 2025 16:52:10 -0700, Sean Christopherson wrote:
> Non-KVM folks,
> 
> I am hoping to route this through the KVM tree (6.17 or later), as the non-KVM
> changes should be glorified nops.  Please holler if you object to that idea.
> 
> Hyper-V folks in particular, let me know if you want a stable topic branch/tag,
> e.g. on the off chance you want to make similar changes to the Hyper-V code,
> and I'll make sure that happens.
> 
> [...]

Applied to kvm-x86 irqs, thanks!

[01/13] KVM: Use a local struct to do the initial vfs_poll() on an irqfd
        https://github.com/kvm-x86/linux/commit/283ed5001d68
[02/13] KVM: Acquire SCRU lock outside of irqfds.lock during assignment
        https://github.com/kvm-x86/linux/commit/140768a7bf03
[03/13] KVM: Initialize irqfd waitqueue callback when adding to the queue
        https://github.com/kvm-x86/linux/commit/b5c543518ae9
[04/13] KVM: Add irqfd to KVM's list via the vfs_poll() callback
        https://github.com/kvm-x86/linux/commit/5f8ca05ea991
[05/13] KVM: Add irqfd to eventfd's waitqueue while holding irqfds.lock
        https://github.com/kvm-x86/linux/commit/86e00cd162a7
[06/13] sched/wait: Drop WQ_FLAG_EXCLUSIVE from add_wait_queue_priority()
        https://github.com/kvm-x86/linux/commit/867347bb21e1
[07/13] xen: privcmd: Don't mark eventfd waiter as EXCLUSIVE
        https://github.com/kvm-x86/linux/commit/a52664134a24
[08/13] sched/wait: Add a waitqueue helper for fully exclusive priority waiters
        https://github.com/kvm-x86/linux/commit/0d09582b3a60
[09/13] KVM: Disallow binding multiple irqfds to an eventfd with a priority waiter
        https://github.com/kvm-x86/linux/commit/2cdd64cbf990
[10/13] KVM: Drop sanity check that per-VM list of irqfds is unique
        https://github.com/kvm-x86/linux/commit/b599d44a71f1
[11/13] KVM: selftests: Assert that eventfd() succeeds in Xen shinfo test
        https://github.com/kvm-x86/linux/commit/033b76bc7f06
[12/13] KVM: selftests: Add utilities to create eventfds and do KVM_IRQFD
        https://github.com/kvm-x86/linux/commit/74e5e3fb0dd7
[13/13] KVM: selftests: Add a KVM_IRQFD test to verify uniqueness requirements
        https://github.com/kvm-x86/linux/commit/7e9b231c402a

--
https://github.com/kvm-x86/kvm-unit-tests/tree/next

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-06-24 19:41 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-22 23:52 [PATCH v3 00/13] KVM: Make irqfd registration globally unique Sean Christopherson
2025-05-22 23:52 ` [PATCH v3 01/13] KVM: Use a local struct to do the initial vfs_poll() on an irqfd Sean Christopherson
2025-05-22 23:52 ` [PATCH v3 02/13] KVM: Acquire SCRU lock outside of irqfds.lock during assignment Sean Christopherson
2025-05-22 23:52 ` [PATCH v3 03/13] KVM: Initialize irqfd waitqueue callback when adding to the queue Sean Christopherson
2025-05-22 23:52 ` [PATCH v3 04/13] KVM: Add irqfd to KVM's list via the vfs_poll() callback Sean Christopherson
2025-05-22 23:52 ` [PATCH v3 05/13] KVM: Add irqfd to eventfd's waitqueue while holding irqfds.lock Sean Christopherson
2025-05-22 23:52 ` [PATCH v3 06/13] sched/wait: Drop WQ_FLAG_EXCLUSIVE from add_wait_queue_priority() Sean Christopherson
2025-05-22 23:52 ` [PATCH v3 07/13] xen: privcmd: Don't mark eventfd waiter as EXCLUSIVE Sean Christopherson
2025-05-22 23:52 ` [PATCH v3 08/13] sched/wait: Add a waitqueue helper for fully exclusive priority waiters Sean Christopherson
2025-05-30  8:45   ` K Prateek Nayak
2025-05-22 23:52 ` [PATCH v3 09/13] KVM: Disallow binding multiple irqfds to an eventfd with a priority waiter Sean Christopherson
2025-05-22 23:52 ` [PATCH v3 10/13] KVM: Drop sanity check that per-VM list of irqfds is unique Sean Christopherson
2025-05-22 23:52 ` [PATCH v3 11/13] KVM: selftests: Assert that eventfd() succeeds in Xen shinfo test Sean Christopherson
2025-05-22 23:52 ` [PATCH v3 12/13] KVM: selftests: Add utilities to create eventfds and do KVM_IRQFD Sean Christopherson
2025-05-22 23:52 ` [PATCH v3 13/13] KVM: selftests: Add a KVM_IRQFD test to verify uniqueness requirements Sean Christopherson
2025-05-23  7:23   ` Sairaj Kodilkar
2025-05-23 14:33     ` Sean Christopherson
2025-05-26  3:36       ` Sairaj Kodilkar
2025-05-23 11:14 ` [PATCH v3 00/13] KVM: Make irqfd registration globally unique Peter Zijlstra
2025-05-30  8:49 ` K Prateek Nayak
2025-06-24 19:38 ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).