[KVM PATCH v6 0/4] irqfd fixes and enhancements

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [KVM PATCH v6 0/4] irqfd fixes and enhancements
@ 2009-06-29 15:43 Gregory Haskins
  2009-06-29 15:44 ` [KVM PATCH v6 1/4] kvm: prepare irqfd for having interrupts disabled during eventfd->release Gregory Haskins
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Gregory Haskins @ 2009-06-29 15:43 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, mst, avi, davidel

(Applies to kvm.git/master:4631e094)

The following is the latest attempt to fix the races in irqfd/eventfd, as
well as restore DEASSIGN support.  For more details, please read the patch
headers.

You can also find this applied as a git tree:

git pull git://git.kernel.org/pub/scm/linux/kernel/git/ghaskins/linux-2.6-hacks.git kvm/irqfd

For reviewing convenience, here is a link to the entire virt/kvm/eventfd.c
file after the patches are applied:

http://git.kernel.org/?p=linux/kernel/git/ghaskins/linux-2.6-hacks.git;a=blob;f=virt/kvm/eventfd.c;h=409d9e160f1f85618a5e3772937b2721a249399a;hb=85cfd57e33dcaea29971513334ca003764653b21

As always, this series has been tested against the kvm-eventfd unit test, and
appears to be functioning properly. You can download this test here:

ftp://ftp.novell.com/dev/ghaskins/kvm-eventfd.tar.bz2

I've included version 4 of Davide's eventfd patch (ported to kvm.git) so
that its a complete reviewable series.  Note, however, that there may be
later versions of his patch to consider for merging, so we should
coordinate with him.

-Greg


[Changelog:

	v6:
	   *) Removed slow-work in favor of using a dedicated single-thread
              workqueue.
	   *) Condensed cleanup path to always use deferred shutdown
	   *) Saved about 56 lines over v5, with the following diffstat:

 	   include/linux/kvm_host.h |    2 
 	   virt/kvm/eventfd.c       |  248 ++++++++++++++++++-----------------------------
 	   2 files changed, 97 insertions(+), 153 deletions(-)

	v5:
           Untracked..
]

---

Davide Libenzi (1):
      eventfd - revised interface and cleanups (4th rev)

Gregory Haskins (3):
      KVM: add irqfd DEASSIGN feature
      KVM: Fix races in irqfd using new eventfd_kref_get interface
      kvm: prepare irqfd for having interrupts disabled during eventfd->release


 drivers/lguest/lg.h          |    2 
 drivers/lguest/lguest_user.c |    4 -
 fs/aio.c                     |   24 +---
 fs/eventfd.c                 |  126 ++++++++++++++++++++---
 include/linux/aio.h          |    4 -
 include/linux/eventfd.h      |   35 +++++-
 include/linux/kvm.h          |    2 
 include/linux/kvm_host.h     |    5 +
 virt/kvm/Kconfig             |    1 
 virt/kvm/eventfd.c           |  229 +++++++++++++++++++++++++++++++-----------
 10 files changed, 321 insertions(+), 111 deletions(-)

-- 
Signature

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [KVM PATCH v6 1/4] kvm: prepare irqfd for having interrupts disabled during eventfd->release
  2009-06-29 15:43 [KVM PATCH v6 0/4] irqfd fixes and enhancements Gregory Haskins
@ 2009-06-29 15:44 ` Gregory Haskins
  2009-06-29 15:44 ` [KVM PATCH v6 2/4] eventfd - revised interface and cleanups (4th rev) Gregory Haskins
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Gregory Haskins @ 2009-06-29 15:44 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, mst, avi, davidel

We need to plug some race conditions on eventfd shutdown.  In order to
do this, we need to change the context in which the release notification
is delivered so that the wqh lock is now held.  However, there is currently
code in the release callback that assumes it can sleep.

We have a slight chicken and egg problem where we cant fix the race
without adding the lock, and we can't add the lock without breaking
the sleepy code.  Normally we could deal with this by making both
changes in an atomic changeset.  However, we want to keep the eventfd
and kvm specific changes isolated to ease the reviewer burden on upstream
eventfd (at the specific request of upstream).  Therefore, we have this
intermediate patch.

This intermediate patch allows the release() method to work in an atomic
context, at the expense of correctness w.r.t. memory-leaks.  Today we have
a race condition.  With this patch applied we leak.  Both issues will be
resolved later in the series.  It is the author's opinion that a leak is
better for bisectability than the hang would be should we leave the sleepy
code in place after the locking changeover.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 virt/kvm/eventfd.c |   89 ++++++++++++++++------------------------------------
 1 files changed, 27 insertions(+), 62 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index a9e7de7..9656027 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -28,7 +28,6 @@
 #include <linux/file.h>
 #include <linux/list.h>
 #include <linux/eventfd.h>
-#include <linux/srcu.h>
 
 /*
  * --------------------------------------------------------------------
@@ -38,8 +37,6 @@
  * --------------------------------------------------------------------
  */
 struct _irqfd {
-	struct mutex              lock;
-	struct srcu_struct        srcu;
 	struct kvm               *kvm;
 	int                       gsi;
 	struct list_head          list;
@@ -53,48 +50,12 @@ static void
 irqfd_inject(struct work_struct *work)
 {
 	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
-	struct kvm *kvm;
-	int idx;
+	struct kvm *kvm = irqfd->kvm;
 
-	idx = srcu_read_lock(&irqfd->srcu);
-
-	kvm = rcu_dereference(irqfd->kvm);
-	if (kvm) {
-		mutex_lock(&kvm->irq_lock);
-		kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
-		kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
-		mutex_unlock(&kvm->irq_lock);
-	}
-
-	srcu_read_unlock(&irqfd->srcu, idx);
-}
-
-static void
-irqfd_disconnect(struct _irqfd *irqfd)
-{
-	struct kvm *kvm;
-
-	mutex_lock(&irqfd->lock);
-
-	kvm = rcu_dereference(irqfd->kvm);
-	rcu_assign_pointer(irqfd->kvm, NULL);
-
-	mutex_unlock(&irqfd->lock);
-
-	if (!kvm)
-		return;
-
-	mutex_lock(&kvm->lock);
-	list_del(&irqfd->list);
-	mutex_unlock(&kvm->lock);
-
-	/*
-	 * It is important to not drop the kvm reference until the next grace
-	 * period because there might be lockless references in flight up
-	 * until then
-	 */
-	synchronize_srcu(&irqfd->srcu);
-	kvm_put_kvm(kvm);
+	mutex_lock(&kvm->irq_lock);
+	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
+	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
+	mutex_unlock(&kvm->irq_lock);
 }
 
 static int
@@ -103,26 +64,24 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
 	struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
 	unsigned long flags = (unsigned long)key;
 
+	/*
+	 * Assume we will be called with interrupts disabled
+	 */
 	if (flags & POLLIN)
 		/*
-		 * The POLLIN wake_up is called with interrupts disabled.
-		 * Therefore we need to defer the IRQ injection until later
-		 * since we need to acquire the kvm->lock to do so.
+		 * Defer the IRQ injection until later since we need to
+		 * acquire the kvm->lock to do so.
 		 */
 		schedule_work(&irqfd->inject);
 
 	if (flags & POLLHUP) {
 		/*
-		 * The POLLHUP is called unlocked, so it theoretically should
-		 * be safe to remove ourselves from the wqh using the locked
-		 * variant of remove_wait_queue()
+		 * for now, just remove ourselves from the list and let
+		 * the rest dangle.  We will fix this up later once
+		 * the races in eventfd are fixed
 		 */
-		remove_wait_queue(irqfd->wqh, &irqfd->wait);
-		flush_work(&irqfd->inject);
-		irqfd_disconnect(irqfd);
-
-		cleanup_srcu_struct(&irqfd->srcu);
-		kfree(irqfd);
+		__remove_wait_queue(irqfd->wqh, &irqfd->wait);
+		irqfd->wqh = NULL;
 	}
 
 	return 0;
@@ -150,8 +109,6 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
 	if (!irqfd)
 		return -ENOMEM;
 
-	mutex_init(&irqfd->lock);
-	init_srcu_struct(&irqfd->srcu);
 	irqfd->kvm = kvm;
 	irqfd->gsi = gsi;
 	INIT_LIST_HEAD(&irqfd->list);
@@ -172,8 +129,6 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
 
 	events = file->f_op->poll(file, &irqfd->pt);
 
-	kvm_get_kvm(kvm);
-
 	mutex_lock(&kvm->lock);
 	list_add_tail(&irqfd->list, &kvm->irqfds);
 	mutex_unlock(&kvm->lock);
@@ -211,6 +166,16 @@ kvm_irqfd_release(struct kvm *kvm)
 {
 	struct _irqfd *irqfd, *tmp;
 
-	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list)
-		irqfd_disconnect(irqfd);
+	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
+		if (irqfd->wqh)
+			remove_wait_queue(irqfd->wqh, &irqfd->wait);
+
+		flush_work(&irqfd->inject);
+
+		mutex_lock(&kvm->lock);
+		list_del(&irqfd->list);
+		mutex_unlock(&kvm->lock);
+
+		kfree(irqfd);
+	}
 }


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [KVM PATCH v6 2/4] eventfd - revised interface and cleanups (4th rev)
  2009-06-29 15:43 [KVM PATCH v6 0/4] irqfd fixes and enhancements Gregory Haskins
  2009-06-29 15:44 ` [KVM PATCH v6 1/4] kvm: prepare irqfd for having interrupts disabled during eventfd->release Gregory Haskins
@ 2009-06-29 15:44 ` Gregory Haskins
  2009-06-29 15:44 ` [KVM PATCH v6 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface Gregory Haskins
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Gregory Haskins @ 2009-06-29 15:44 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, mst, avi, davidel

From: Davide Libenzi <davidel@xmailserver.org>

The following patch changes the eventfd interface to de-couple the eventfd
memory context, from the file pointer instance.
Without such change, there is no clean way to racely free handle the
POLLHUP event sent when the last instance of the file* goes away.
Also, now the internal eventfd APIs are using the eventfd context instead
of the file*.

Gregory, none of the APIs changed, so your patches do not need to be
rebased over this one. The last three revisions just updated comments.

Andrew, I'm not posting the "AIO select EVENTFD" patch at this time. Will
eventually try another push in your skull once I come back from vacation ;)

[gmh: resolved merge conflict against prior POLLHUP patch ]

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 drivers/lguest/lg.h          |    2 -
 drivers/lguest/lguest_user.c |    4 +
 fs/aio.c                     |   24 ++------
 fs/eventfd.c                 |  126 ++++++++++++++++++++++++++++++++++++------
 include/linux/aio.h          |    4 +
 include/linux/eventfd.h      |   35 +++++++++---
 6 files changed, 147 insertions(+), 48 deletions(-)

diff --git a/drivers/lguest/lg.h b/drivers/lguest/lg.h
index d4e8979..9c31382 100644
--- a/drivers/lguest/lg.h
+++ b/drivers/lguest/lg.h
@@ -82,7 +82,7 @@ struct lg_cpu {
 
 struct lg_eventfd {
 	unsigned long addr;
-	struct file *event;
+	struct eventfd_ctx *event;
 };
 
 struct lg_eventfd_map {
diff --git a/drivers/lguest/lguest_user.c b/drivers/lguest/lguest_user.c
index 32e2971..9f9a295 100644
--- a/drivers/lguest/lguest_user.c
+++ b/drivers/lguest/lguest_user.c
@@ -50,7 +50,7 @@ static int add_eventfd(struct lguest *lg, unsigned long addr, int fd)
 
 	/* Now append new entry. */
 	new->map[new->num].addr = addr;
-	new->map[new->num].event = eventfd_fget(fd);
+	new->map[new->num].event = eventfd_ctx_fdget(fd);
 	if (IS_ERR(new->map[new->num].event)) {
 		kfree(new);
 		return PTR_ERR(new->map[new->num].event);
@@ -357,7 +357,7 @@ static int close(struct inode *inode, struct file *file)
 
 	/* Release any eventfds they registered. */
 	for (i = 0; i < lg->eventfds->num; i++)
-		fput(lg->eventfds->map[i].event);
+		eventfd_ctx_put(lg->eventfds->map[i].event);
 	kfree(lg->eventfds);
 
 	/* If lg->dead doesn't contain an error code it will be NULL or a
diff --git a/fs/aio.c b/fs/aio.c
index 76da125..d065b2c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -485,6 +485,8 @@ static inline void really_put_req(struct kioctx *ctx, struct kiocb *req)
 {
 	assert_spin_locked(&ctx->ctx_lock);
 
+	if (req->ki_eventfd != NULL)
+		eventfd_ctx_put(req->ki_eventfd);
 	if (req->ki_dtor)
 		req->ki_dtor(req);
 	if (req->ki_iovec != &req->ki_inline_vec)
@@ -509,8 +511,6 @@ static void aio_fput_routine(struct work_struct *data)
 		/* Complete the fput(s) */
 		if (req->ki_filp != NULL)
 			__fput(req->ki_filp);
-		if (req->ki_eventfd != NULL)
-			__fput(req->ki_eventfd);
 
 		/* Link the iocb into the context's free list */
 		spin_lock_irq(&ctx->ctx_lock);
@@ -528,8 +528,6 @@ static void aio_fput_routine(struct work_struct *data)
  */
 static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
 {
-	int schedule_putreq = 0;
-
 	dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n",
 		req, atomic_long_read(&req->ki_filp->f_count));
 
@@ -549,24 +547,16 @@ static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
 	 * we would not be holding the last reference to the file*, so
 	 * this function will be executed w/out any aio kthread wakeup.
 	 */
-	if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count)))
-		schedule_putreq++;
-	else
-		req->ki_filp = NULL;
-	if (req->ki_eventfd != NULL) {
-		if (unlikely(atomic_long_dec_and_test(&req->ki_eventfd->f_count)))
-			schedule_putreq++;
-		else
-			req->ki_eventfd = NULL;
-	}
-	if (unlikely(schedule_putreq)) {
+	if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) {
 		get_ioctx(ctx);
 		spin_lock(&fput_lock);
 		list_add(&req->ki_list, &fput_head);
 		spin_unlock(&fput_lock);
 		queue_work(aio_wq, &fput_work);
-	} else
+	} else {
+		req->ki_filp = NULL;
 		really_put_req(ctx, req);
+	}
 	return 1;
 }
 
@@ -1622,7 +1612,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 		 * an eventfd() fd, and will be signaled for each completed
 		 * event using the eventfd_signal() function.
 		 */
-		req->ki_eventfd = eventfd_fget((int) iocb->aio_resfd);
+		req->ki_eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd);
 		if (IS_ERR(req->ki_eventfd)) {
 			ret = PTR_ERR(req->ki_eventfd);
 			req->ki_eventfd = NULL;
diff --git a/fs/eventfd.c b/fs/eventfd.c
index 72f5f8d..31d12de 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -14,35 +14,44 @@
 #include <linux/list.h>
 #include <linux/spinlock.h>
 #include <linux/anon_inodes.h>
-#include <linux/eventfd.h>
 #include <linux/syscalls.h>
 #include <linux/module.h>
+#include <linux/kref.h>
+#include <linux/eventfd.h>
 
 struct eventfd_ctx {
+	struct kref kref;
 	wait_queue_head_t wqh;
 	/*
 	 * Every time that a write(2) is performed on an eventfd, the
 	 * value of the __u64 being written is added to "count" and a
 	 * wakeup is performed on "wqh". A read(2) will return the "count"
 	 * value to userspace, and will reset "count" to zero. The kernel
-	 * size eventfd_signal() also, adds to the "count" counter and
+	 * side eventfd_signal() also, adds to the "count" counter and
 	 * issue a wakeup.
 	 */
 	__u64 count;
 	unsigned int flags;
 };
 
-/*
- * Adds "n" to the eventfd counter "count". Returns "n" in case of
- * success, or a value lower then "n" in case of coutner overflow.
- * This function is supposed to be called by the kernel in paths
- * that do not allow sleeping. In this function we allow the counter
- * to reach the ULLONG_MAX value, and we signal this as overflow
- * condition by returining a POLLERR to poll(2).
+/**
+ * eventfd_signal - Adds @n to the eventfd counter.
+ * @ctx: [in] Pointer to the eventfd context.
+ * @n: [in] Value of the counter to be added to the eventfd internal counter.
+ *          The value cannot be negative.
+ *
+ * This function is supposed to be called by the kernel in paths that do not
+ * allow sleeping. In this function we allow the counter to reach the ULLONG_MAX
+ * value, and we signal this as overflow condition by returining a POLLERR
+ * to poll(2).
+ *
+ * Returns @n in case of success, a non-negative number lower than @n in case
+ * of overflow, or the following error codes:
+ *
+ * -EINVAL    : The value of @n is negative.
  */
-int eventfd_signal(struct file *file, int n)
+int eventfd_signal(struct eventfd_ctx *ctx, int n)
 {
-	struct eventfd_ctx *ctx = file->private_data;
 	unsigned long flags;
 
 	if (n < 0)
@@ -59,17 +68,45 @@ int eventfd_signal(struct file *file, int n)
 }
 EXPORT_SYMBOL_GPL(eventfd_signal);
 
+static void eventfd_free(struct kref *kref)
+{
+	struct eventfd_ctx *ctx = container_of(kref, struct eventfd_ctx, kref);
+
+	kfree(ctx);
+}
+
+/**
+ * eventfd_ctx_get - Acquires a reference to the internal eventfd context.
+ * @ctx: [in] Pointer to the eventfd context.
+ *
+ * Returns: In case of success, returns a pointer to the eventfd context.
+ */
+struct eventfd_ctx *eventfd_ctx_get(struct eventfd_ctx *ctx)
+{
+	kref_get(&ctx->kref);
+	return ctx;
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_get);
+
+/**
+ * eventfd_ctx_put - Releases a reference to the internal eventfd context.
+ * @ctx: [in] Pointer to eventfd context.
+ *
+ * The eventfd context reference must have been previously acquired either
+ * with eventfd_ctx_get() or eventfd_ctx_fdget()).
+ */
+void eventfd_ctx_put(struct eventfd_ctx *ctx)
+{
+	kref_put(&ctx->kref, eventfd_free);
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_put);
+
 static int eventfd_release(struct inode *inode, struct file *file)
 {
 	struct eventfd_ctx *ctx = file->private_data;
 
-	/*
-	 * No need to hold the lock here, since we are on the file cleanup
-	 * path and the ones still attached to the wait queue will be
-	 * serialized by wake_up_locked_poll().
-	 */
-	wake_up_locked_poll(&ctx->wqh, POLLHUP);
-	kfree(ctx);
+	wake_up_poll(&ctx->wqh, POLLHUP);
+	eventfd_ctx_put(ctx);
 	return 0;
 }
 
@@ -193,6 +230,16 @@ static const struct file_operations eventfd_fops = {
 	.write		= eventfd_write,
 };
 
+/**
+ * eventfd_fget - Acquire a reference of an eventfd file descriptor.
+ * @fd: [in] Eventfd file descriptor.
+ *
+ * Returns a pointer to the eventfd file structure in case of success, or the
+ * following error pointer:
+ *
+ * -EBADF    : Invalid @fd file descriptor.
+ * -EINVAL   : The @fd file descriptor is not an eventfd file.
+ */
 struct file *eventfd_fget(int fd)
 {
 	struct file *file;
@@ -209,6 +256,48 @@ struct file *eventfd_fget(int fd)
 }
 EXPORT_SYMBOL_GPL(eventfd_fget);
 
+/**
+ * eventfd_ctx_fdget - Acquires a reference to the internal eventfd context.
+ * @fd: [in] Eventfd file descriptor.
+ *
+ * Returns a pointer to the internal eventfd context, otherwise the error
+ * pointers returned by the following functions:
+ *
+ * eventfd_fget
+ */
+struct eventfd_ctx *eventfd_ctx_fdget(int fd)
+{
+	struct file *file;
+	struct eventfd_ctx *ctx;
+
+	file = eventfd_fget(fd);
+	if (IS_ERR(file))
+		return (struct eventfd_ctx *) file;
+	ctx = eventfd_ctx_get(file->private_data);
+	fput(file);
+
+	return ctx;
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_fdget);
+
+/**
+ * eventfd_ctx_fileget - Acquires a reference to the internal eventfd context.
+ * @file: [in] Eventfd file pointer.
+ *
+ * Returns a pointer to the internal eventfd context, otherwise the error
+ * pointer:
+ *
+ * -EINVAL   : The @fd file descriptor is not an eventfd file.
+ */
+struct eventfd_ctx *eventfd_ctx_fileget(struct file *file)
+{
+	if (file->f_op != &eventfd_fops)
+		return ERR_PTR(-EINVAL);
+
+	return eventfd_ctx_get(file->private_data);
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_fileget);
+
 SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
 {
 	int fd;
@@ -225,6 +314,7 @@ SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
 	if (!ctx)
 		return -ENOMEM;
 
+	kref_init(&ctx->kref);
 	init_waitqueue_head(&ctx->wqh);
 	ctx->count = count;
 	ctx->flags = flags;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index b16a957..47f7d93 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -121,9 +121,9 @@ struct kiocb {
 
 	/*
 	 * If the aio_resfd field of the userspace iocb is not zero,
-	 * this is the underlying file* to deliver event to.
+	 * this is the underlying eventfd context to deliver events to.
 	 */
-	struct file		*ki_eventfd;
+	struct eventfd_ctx	*ki_eventfd;
 };
 
 #define is_sync_kiocb(iocb)	((iocb)->ki_key == KIOCB_SYNC_KEY)
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index f45a8ae..3b85ba6 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -8,10 +8,8 @@
 #ifndef _LINUX_EVENTFD_H
 #define _LINUX_EVENTFD_H
 
-#ifdef CONFIG_EVENTFD
-
-/* For O_CLOEXEC and O_NONBLOCK */
 #include <linux/fcntl.h>
+#include <linux/file.h>
 
 /*
  * CAREFUL: Check include/asm-generic/fcntl.h when defining
@@ -27,16 +25,37 @@
 #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
 #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
 
+#ifdef CONFIG_EVENTFD
+
+struct eventfd_ctx *eventfd_ctx_get(struct eventfd_ctx *ctx);
+void eventfd_ctx_put(struct eventfd_ctx *ctx);
 struct file *eventfd_fget(int fd);
-int eventfd_signal(struct file *file, int n);
+struct eventfd_ctx *eventfd_ctx_fdget(int fd);
+struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
+int eventfd_signal(struct eventfd_ctx *ctx, int n);
 
 #else /* CONFIG_EVENTFD */
 
-#define eventfd_fget(fd) ERR_PTR(-ENOSYS)
-static inline int eventfd_signal(struct file *file, int n)
-{ return 0; }
+/*
+ * Ugly ugly ugly error layer to support modules that uses eventfd but
+ * pretend to work in !CONFIG_EVENTFD configurations. Namely, AIO.
+ */
+static inline struct eventfd_ctx *eventfd_ctx_fdget(int fd)
+{
+	return ERR_PTR(-ENOSYS);
+}
+
+static inline int eventfd_signal(struct eventfd_ctx *ctx, int n)
+{
+	return -ENOSYS;
+}
+
+static inline void eventfd_ctx_put(struct eventfd_ctx *ctx)
+{
+
+}
 
-#endif /* CONFIG_EVENTFD */
+#endif
 
 #endif /* _LINUX_EVENTFD_H */
 


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [KVM PATCH v6 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-29 15:43 [KVM PATCH v6 0/4] irqfd fixes and enhancements Gregory Haskins
  2009-06-29 15:44 ` [KVM PATCH v6 1/4] kvm: prepare irqfd for having interrupts disabled during eventfd->release Gregory Haskins
  2009-06-29 15:44 ` [KVM PATCH v6 2/4] eventfd - revised interface and cleanups (4th rev) Gregory Haskins
@ 2009-06-29 15:44 ` Gregory Haskins
  2009-06-29 16:11   ` Michael S. Tsirkin
  2009-06-29 15:44 ` [KVM PATCH v6 4/4] KVM: add irqfd DEASSIGN feature Gregory Haskins
  2009-06-29 15:46 ` [KVM PATCH v6 0/4] irqfd fixes and enhancements Gregory Haskins
  4 siblings, 1 reply; 11+ messages in thread
From: Gregory Haskins @ 2009-06-29 15:44 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, mst, avi, davidel

eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
"release" callback.  This lets eventfd clients know if the eventfd is about
to go away and is very useful particularly for in-kernel clients.  However,
until recently it is not possible to use this feature of eventfd in a
race-free way.  This patch utilizes a new eventfd interface to rectify
the problem.

Note that one final race is known to exist: the slow-work thread may race
with module removal.  We are currently working with slow-work upstream
to fix this issue as well.  Since the code prior to this patch also
races with module_put(), we are not making anything worse, but rather
shifting the cause of the race.  Once the slow-work code is patched we
will be fixing the last remaining issue.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/kvm_host.h |    5 +
 virt/kvm/Kconfig         |    1 
 virt/kvm/eventfd.c       |  154 ++++++++++++++++++++++++++++++++++++++--------
 3 files changed, 131 insertions(+), 29 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2451f48..5a90c6a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -141,7 +141,10 @@ struct kvm {
 	struct kvm_io_bus mmio_bus;
 	struct kvm_io_bus pio_bus;
 #ifdef CONFIG_HAVE_KVM_EVENTFD
-	struct list_head irqfds;
+	struct {
+		spinlock_t        lock;
+		struct list_head  items;
+	} irqfds;
 #endif
 	struct kvm_vm_stat stat;
 	struct kvm_arch arch;
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index daece36..ab7848a 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -9,6 +9,7 @@ config HAVE_KVM_IRQCHIP
 config HAVE_KVM_EVENTFD
        bool
        select EVENTFD
+       select SLOW_WORK
 
 config KVM_APIC_ARCHITECTURE
        bool
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 9656027..76ad125 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -36,16 +36,22 @@
  * Credit goes to Avi Kivity for the original idea.
  * --------------------------------------------------------------------
  */
+
 struct _irqfd {
 	struct kvm               *kvm;
+	struct eventfd_ctx       *eventfd;
 	int                       gsi;
 	struct list_head          list;
 	poll_table                pt;
 	wait_queue_head_t        *wqh;
 	wait_queue_t              wait;
 	struct work_struct        inject;
+	struct work_struct        shutdown;
+	int                       active:1;
 };
 
+static struct workqueue_struct *irqfd_cleanup_wq;
+
 static void
 irqfd_inject(struct work_struct *work)
 {
@@ -58,32 +64,81 @@ irqfd_inject(struct work_struct *work)
 	mutex_unlock(&kvm->irq_lock);
 }
 
+/*
+ * Race-free decouple logic (ordering is critical)
+ */
+static void
+irqfd_shutdown(struct work_struct *work)
+{
+	struct _irqfd *irqfd = container_of(work, struct _irqfd, shutdown);
+
+	/*
+	 * Synchronize with the wait-queue and unhook ourselves to prevent
+	 * further events.
+	 */
+	remove_wait_queue(irqfd->wqh, &irqfd->wait);
+
+	/*
+	 * We know no new events will be scheduled at this point, so block
+	 * until all previously outstanding events have completed
+	 */
+	flush_work(&irqfd->inject);
+
+	/*
+	 * It is now safe to release the object's resources
+	 */
+	eventfd_ctx_put(irqfd->eventfd);
+	kfree(irqfd);
+}
+
+/*
+ * Mark the irqfd as inactive and schedule it for removal
+ *
+ * assumes kvm->irqfds.lock is held
+ */
+static void
+irqfd_deactivate(struct _irqfd *irqfd)
+{
+	BUG_ON(!irqfd->active);
+
+	irqfd->active = false;
+	list_del(&irqfd->list);
+
+	queue_work(irqfd_cleanup_wq, &irqfd->shutdown);
+}
+
+/*
+ * Called with wqh->lock held and interrupts disabled
+ */
 static int
 irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
 {
 	struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
 	unsigned long flags = (unsigned long)key;
 
-	/*
-	 * Assume we will be called with interrupts disabled
-	 */
 	if (flags & POLLIN)
-		/*
-		 * Defer the IRQ injection until later since we need to
-		 * acquire the kvm->lock to do so.
-		 */
+		/* An event has been signaled, inject an interrupt */
 		schedule_work(&irqfd->inject);
 
 	if (flags & POLLHUP) {
-		/*
-		 * for now, just remove ourselves from the list and let
-		 * the rest dangle.  We will fix this up later once
-		 * the races in eventfd are fixed
-		 */
-		__remove_wait_queue(irqfd->wqh, &irqfd->wait);
-		irqfd->wqh = NULL;
+		/* The eventfd is closing, detach from KVM */
+		struct kvm *kvm = irqfd->kvm;
+		unsigned long flags;
+
+		spin_lock_irqsave(&kvm->irqfds.lock, flags);
+
+		if (irqfd->active)
+			/*
+			 * If we get here, we can be sure no-one else is
+			 * trying to shutdown this object at the same time
+			 * since we hold the list lock.
+			 */
+			irqfd_deactivate(irqfd);
+
+		spin_unlock_irqrestore(&kvm->irqfds.lock, flags);
 	}
 
+
 	return 0;
 }
 
@@ -102,6 +157,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
 {
 	struct _irqfd *irqfd;
 	struct file *file = NULL;
+	struct eventfd_ctx *eventfd = NULL;
 	int ret;
 	unsigned int events;
 
@@ -113,6 +169,8 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
 	irqfd->gsi = gsi;
 	INIT_LIST_HEAD(&irqfd->list);
 	INIT_WORK(&irqfd->inject, irqfd_inject);
+	INIT_WORK(&irqfd->shutdown, irqfd_shutdown);
+	irqfd->active = true;
 
 	file = eventfd_fget(fd);
 	if (IS_ERR(file)) {
@@ -120,6 +178,14 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
 		goto fail;
 	}
 
+	eventfd = eventfd_ctx_fileget(file);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto fail;
+	}
+
+	irqfd->eventfd = eventfd;
+
 	/*
 	 * Install our own custom wake-up handling so we are notified via
 	 * a callback whenever someone signals the underlying eventfd
@@ -129,12 +195,13 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
 
 	events = file->f_op->poll(file, &irqfd->pt);
 
-	mutex_lock(&kvm->lock);
-	list_add_tail(&irqfd->list, &kvm->irqfds);
-	mutex_unlock(&kvm->lock);
+	spin_lock_irq(&kvm->irqfds.lock);
+	list_add_tail(&irqfd->list, &kvm->irqfds.items);
+	spin_unlock_irq(&kvm->irqfds.lock);
 
 	/*
-	 * Check if there was an event already queued
+	 * Check if there was an event already pending on the eventfd
+	 * before we registered, and trigger it as if we didn't miss it.
 	 */
 	if (events & POLLIN)
 		schedule_work(&irqfd->inject);
@@ -148,6 +215,9 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
 	return 0;
 
 fail:
+	if (eventfd && !IS_ERR(eventfd))
+		eventfd_ctx_put(eventfd);
+
 	if (file && !IS_ERR(file))
 		fput(file);
 
@@ -158,24 +228,52 @@ fail:
 void
 kvm_irqfd_init(struct kvm *kvm)
 {
-	INIT_LIST_HEAD(&kvm->irqfds);
+	spin_lock_init(&kvm->irqfds.lock);
+	INIT_LIST_HEAD(&kvm->irqfds.items);
 }
 
+/*
+ * This function is called as the kvm VM fd is being released. Shutdown all
+ * irqfds that still remain open
+ */
 void
 kvm_irqfd_release(struct kvm *kvm)
 {
 	struct _irqfd *irqfd, *tmp;
 
-	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
-		if (irqfd->wqh)
-			remove_wait_queue(irqfd->wqh, &irqfd->wait);
+	spin_lock_irq(&kvm->irqfds.lock);
 
-		flush_work(&irqfd->inject);
+	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list)
+		irqfd_deactivate(irqfd);
 
-		mutex_lock(&kvm->lock);
-		list_del(&irqfd->list);
-		mutex_unlock(&kvm->lock);
+	spin_unlock_irq(&kvm->irqfds.lock);
+
+	/*
+	 * Block until we know all outstanding shutdown jobs have completed
+	 * since we do not take a kvm* reference.
+	 */
+	flush_workqueue(irqfd_cleanup_wq);
 
-		kfree(irqfd);
-	}
 }
+
+/*
+ * create a host-wide workqueue for issuing deferred shutdown requests
+ * aggregated from all vm* instances. We need our own isolated single-thread
+ * queue to prevent deadlock against flushing the normal work-queue.
+ */
+static int __init irqfd_module_init(void)
+{
+	irqfd_cleanup_wq = create_singlethread_workqueue("kvm-irqfd-cleanup");
+	if (!irqfd_cleanup_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void __exit irqfd_module_exit(void)
+{
+	destroy_workqueue(irqfd_cleanup_wq);
+}
+
+module_init(irqfd_module_init);
+module_exit(irqfd_module_exit);


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [KVM PATCH v6 4/4] KVM: add irqfd DEASSIGN feature
  2009-06-29 15:43 [KVM PATCH v6 0/4] irqfd fixes and enhancements Gregory Haskins
                   ` (2 preceding siblings ...)
  2009-06-29 15:44 ` [KVM PATCH v6 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface Gregory Haskins
@ 2009-06-29 15:44 ` Gregory Haskins
  2009-06-29 16:16   ` Michael S. Tsirkin
  2009-06-29 15:46 ` [KVM PATCH v6 0/4] irqfd fixes and enhancements Gregory Haskins
  4 siblings, 1 reply; 11+ messages in thread
From: Gregory Haskins @ 2009-06-29 15:44 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, mst, avi, davidel

DEASSIGN allows us to optionally disassociate an IRQFD from its underlying
eventfd without destroying the eventfd in the process.  This is useful
for conditions like live-migration which may have an eventfd associated
with a device and an IRQFD.  We need to be able to decouple the guest
from the event source while not perturbing the event source itself.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Michael S. Tsirkin <mst@redhat.com>
---

 include/linux/kvm.h |    2 ++
 virt/kvm/eventfd.c  |   46 ++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 38ff31e..6710518 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -490,6 +490,8 @@ struct kvm_x86_mce {
 };
 #endif
 
+#define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
+
 struct kvm_irqfd {
 	__u32 fd;
 	__u32 gsi;
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 76ad125..409d9e1 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -152,8 +152,8 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
 	add_wait_queue(wqh, &irqfd->wait);
 }
 
-int
-kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
+static int
+kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
 {
 	struct _irqfd *irqfd;
 	struct file *file = NULL;
@@ -233,6 +233,48 @@ kvm_irqfd_init(struct kvm *kvm)
 }
 
 /*
+ * shutdown any irqfd's that match fd+gsi
+ */
+static int
+kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
+{
+	struct _irqfd *irqfd, *tmp;
+	struct eventfd_ctx *eventfd;
+
+	eventfd = eventfd_ctx_fdget(fd);
+	if (IS_ERR(eventfd))
+		return PTR_ERR(eventfd);
+
+	spin_lock_irq(&kvm->irqfds.lock);
+
+	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list) {
+		if (irqfd->eventfd == eventfd && irqfd->gsi == gsi)
+			irqfd_deactivate(irqfd);
+	}
+
+	spin_unlock_irq(&kvm->irqfds.lock);
+	eventfd_ctx_put(eventfd);
+
+	/*
+	 * Block until we know all outstanding shutdown jobs have completed
+	 * so that we guarantee there will not be any more interrupts on this
+	 * gsi once this deassign function returns.
+	 */
+	flush_workqueue(irqfd_cleanup_wq);
+
+	return 0;
+}
+
+int
+kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
+{
+	if (flags & KVM_IRQFD_FLAG_DEASSIGN)
+		return kvm_irqfd_deassign(kvm, fd, gsi);
+
+	return kvm_irqfd_assign(kvm, fd, gsi);
+}
+
+/*
  * This function is called as the kvm VM fd is being released. Shutdown all
  * irqfds that still remain open
  */


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [KVM PATCH v6 0/4] irqfd fixes and enhancements
  2009-06-29 15:43 [KVM PATCH v6 0/4] irqfd fixes and enhancements Gregory Haskins
                   ` (3 preceding siblings ...)
  2009-06-29 15:44 ` [KVM PATCH v6 4/4] KVM: add irqfd DEASSIGN feature Gregory Haskins
@ 2009-06-29 15:46 ` Gregory Haskins
  4 siblings, 0 replies; 11+ messages in thread
From: Gregory Haskins @ 2009-06-29 15:46 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, mst, avi, davidel

[-- Attachment #1: Type: text/plain, Size: 2696 bytes --]

Gregory Haskins wrote:
> (Applies to kvm.git/master:4631e094)
>
> The following is the latest attempt to fix the races in irqfd/eventfd, as
> well as restore DEASSIGN support.  For more details, please read the patch
> headers.
>
> You can also find this applied as a git tree:
>
> git pull git://git.kernel.org/pub/scm/linux/kernel/git/ghaskins/linux-2.6-hacks.git kvm/irqfd
>
> For reviewing convenience, here is a link to the entire virt/kvm/eventfd.c
> file after the patches are applied:
>
> http://git.kernel.org/?p=linux/kernel/git/ghaskins/linux-2.6-hacks.git;a=blob;f=virt/kvm/eventfd.c;h=409d9e160f1f85618a5e3772937b2721a249399a;hb=85cfd57e33dcaea29971513334ca003764653b21
>
> As always, this series has been tested against the kvm-eventfd unit test, and
> appears to be functioning properly. You can download this test here:
>
> ftp://ftp.novell.com/dev/ghaskins/kvm-eventfd.tar.bz2
>
> I've included version 4 of Davide's eventfd patch (ported to kvm.git) so
> that its a complete reviewable series.  Note, however, that there may be
> later versions of his patch to consider for merging, so we should
> coordinate with him.
>
> -Greg
>
>
> [Changelog:
>
> 	v6:
> 	   *) Removed slow-work in favor of using a dedicated single-thread
>               workqueue.
> 	   *) Condensed cleanup path to always use deferred shutdown
> 	   *) Saved about 56 lines over v5, with the following diffstat:
>
>  	   include/linux/kvm_host.h |    2 
>  	   virt/kvm/eventfd.c       |  248 ++++++++++++++++++-----------------------------
>  	   2 files changed, 97 insertions(+), 153 deletions(-)
>   

Forgot another change:

                          *) Fixed race in ASSIGN for the proper
acquisition order of the irqfd->eventfd

> 	v5:
>            Untracked..
> ]
>
> ---
>
> Davide Libenzi (1):
>       eventfd - revised interface and cleanups (4th rev)
>
> Gregory Haskins (3):
>       KVM: add irqfd DEASSIGN feature
>       KVM: Fix races in irqfd using new eventfd_kref_get interface
>       kvm: prepare irqfd for having interrupts disabled during eventfd->release
>
>
>  drivers/lguest/lg.h          |    2 
>  drivers/lguest/lguest_user.c |    4 -
>  fs/aio.c                     |   24 +---
>  fs/eventfd.c                 |  126 ++++++++++++++++++++---
>  include/linux/aio.h          |    4 -
>  include/linux/eventfd.h      |   35 +++++-
>  include/linux/kvm.h          |    2 
>  include/linux/kvm_host.h     |    5 +
>  virt/kvm/Kconfig             |    1 
>  virt/kvm/eventfd.c           |  229 +++++++++++++++++++++++++++++++-----------
>  10 files changed, 321 insertions(+), 111 deletions(-)
>
>   



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [KVM PATCH v6 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-29 15:44 ` [KVM PATCH v6 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface Gregory Haskins
@ 2009-06-29 16:11   ` Michael S. Tsirkin
  2009-06-29 16:52     ` Gregory Haskins
  0 siblings, 1 reply; 11+ messages in thread
From: Michael S. Tsirkin @ 2009-06-29 16:11 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm, linux-kernel, avi, davidel

Look good. A couple of minor nits:

On Mon, Jun 29, 2009 at 11:44:15AM -0400, Gregory Haskins wrote:
> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
> "release" callback.  This lets eventfd clients know if the eventfd is about
> to go away and is very useful particularly for in-kernel clients.  However,
> until recently it is not possible to use this feature of eventfd in a
> race-free way.  This patch utilizes a new eventfd interface to rectify
> the problem.
> 
> Note that one final race is known to exist: the slow-work thread may race
> with module removal.  We are currently working with slow-work upstream
> to fix this issue as well.  Since the code prior to this patch also
> races with module_put(), we are not making anything worse, but rather
> shifting the cause of the race.  Once the slow-work code is patched we
> will be fixing the last remaining issue.
> 
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> ---
> 
>  include/linux/kvm_host.h |    5 +
>  virt/kvm/Kconfig         |    1 
>  virt/kvm/eventfd.c       |  154 ++++++++++++++++++++++++++++++++++++++--------
>  3 files changed, 131 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 2451f48..5a90c6a 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -141,7 +141,10 @@ struct kvm {
>  	struct kvm_io_bus mmio_bus;
>  	struct kvm_io_bus pio_bus;
>  #ifdef CONFIG_HAVE_KVM_EVENTFD
> -	struct list_head irqfds;
> +	struct {
> +		spinlock_t        lock;
> +		struct list_head  items;
> +	} irqfds;
>  #endif
>  	struct kvm_vm_stat stat;
>  	struct kvm_arch arch;
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index daece36..ab7848a 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -9,6 +9,7 @@ config HAVE_KVM_IRQCHIP
>  config HAVE_KVM_EVENTFD
>         bool
>         select EVENTFD
> +       select SLOW_WORK

not needed in this version?

>  config KVM_APIC_ARCHITECTURE
>         bool
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 9656027..76ad125 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -36,16 +36,22 @@
>   * Credit goes to Avi Kivity for the original idea.
>   * --------------------------------------------------------------------
>   */
> +
>  struct _irqfd {
>  	struct kvm               *kvm;
> +	struct eventfd_ctx       *eventfd;
>  	int                       gsi;
>  	struct list_head          list;
>  	poll_table                pt;
>  	wait_queue_head_t        *wqh;
>  	wait_queue_t              wait;
>  	struct work_struct        inject;
> +	struct work_struct        shutdown;
> +	int                       active:1;

I think we need a comment here that active bit is protected by irqfds
lock.  One idea I had to make it even clearer was to have a shutdown list
of irqfds per-kvm, together with the items list, and make work_struct for
shutdown global, not per-irqfd.  We can then unconditionally do
list_move + schedule_work to shut down an irqfd, and it's safe to do
even if it is already on the shutdown list - it just gets moved to tail.

>  };
>  
> +static struct workqueue_struct *irqfd_cleanup_wq;
> +
>  static void
>  irqfd_inject(struct work_struct *work)
>  {
> @@ -58,32 +64,81 @@ irqfd_inject(struct work_struct *work)
>  	mutex_unlock(&kvm->irq_lock);
>  }
>  
> +/*
> + * Race-free decouple logic (ordering is critical)
> + */
> +static void
> +irqfd_shutdown(struct work_struct *work)
> +{
> +	struct _irqfd *irqfd = container_of(work, struct _irqfd, shutdown);
> +
> +	/*
> +	 * Synchronize with the wait-queue and unhook ourselves to prevent
> +	 * further events.
> +	 */
> +	remove_wait_queue(irqfd->wqh, &irqfd->wait);
> +
> +	/*
> +	 * We know no new events will be scheduled at this point, so block
> +	 * until all previously outstanding events have completed
> +	 */
> +	flush_work(&irqfd->inject);
> +
> +	/*
> +	 * It is now safe to release the object's resources
> +	 */
> +	eventfd_ctx_put(irqfd->eventfd);
> +	kfree(irqfd);
> +}
> +
> +/*
> + * Mark the irqfd as inactive and schedule it for removal
> + *
> + * assumes kvm->irqfds.lock is held
> + */
> +static void
> +irqfd_deactivate(struct _irqfd *irqfd)
> +{
> +	BUG_ON(!irqfd->active);
> +
> +	irqfd->active = false;
> +	list_del(&irqfd->list);
> +
> +	queue_work(irqfd_cleanup_wq, &irqfd->shutdown);
> +}
> +
> +/*
> + * Called with wqh->lock held and interrupts disabled
> + */
>  static int
>  irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>  {
>  	struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
>  	unsigned long flags = (unsigned long)key;
>  
> -	/*
> -	 * Assume we will be called with interrupts disabled
> -	 */
>  	if (flags & POLLIN)
> -		/*
> -		 * Defer the IRQ injection until later since we need to
> -		 * acquire the kvm->lock to do so.
> -		 */
> +		/* An event has been signaled, inject an interrupt */
>  		schedule_work(&irqfd->inject);
>  
>  	if (flags & POLLHUP) {
> -		/*
> -		 * for now, just remove ourselves from the list and let
> -		 * the rest dangle.  We will fix this up later once
> -		 * the races in eventfd are fixed
> -		 */
> -		__remove_wait_queue(irqfd->wqh, &irqfd->wait);
> -		irqfd->wqh = NULL;
> +		/* The eventfd is closing, detach from KVM */
> +		struct kvm *kvm = irqfd->kvm;
> +		unsigned long flags;
> +
> +		spin_lock_irqsave(&kvm->irqfds.lock, flags);
> +
> +		if (irqfd->active)
> +			/*
> +			 * If we get here, we can be sure no-one else is
> +			 * trying to shutdown this object at the same time
> +			 * since we hold the list lock.
> +			 */
> +			irqfd_deactivate(irqfd);
> +
> +		spin_unlock_irqrestore(&kvm->irqfds.lock, flags);
>  	}
>  
> +

extra empty line

>  	return 0;
>  }
>  
> @@ -102,6 +157,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>  {
>  	struct _irqfd *irqfd;
>  	struct file *file = NULL;
> +	struct eventfd_ctx *eventfd = NULL;
>  	int ret;
>  	unsigned int events;
>  
> @@ -113,6 +169,8 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>  	irqfd->gsi = gsi;
>  	INIT_LIST_HEAD(&irqfd->list);
>  	INIT_WORK(&irqfd->inject, irqfd_inject);
> +	INIT_WORK(&irqfd->shutdown, irqfd_shutdown);
> +	irqfd->active = true;
>  
>  	file = eventfd_fget(fd);
>  	if (IS_ERR(file)) {
> @@ -120,6 +178,14 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>  		goto fail;
>  	}
>  
> +	eventfd = eventfd_ctx_fileget(file);
> +	if (IS_ERR(file)) {
> +		ret = PTR_ERR(file);
> +		goto fail;
> +	}
> +
> +	irqfd->eventfd = eventfd;
> +
>  	/*
>  	 * Install our own custom wake-up handling so we are notified via
>  	 * a callback whenever someone signals the underlying eventfd
> @@ -129,12 +195,13 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>  
>  	events = file->f_op->poll(file, &irqfd->pt);
>  
> -	mutex_lock(&kvm->lock);
> -	list_add_tail(&irqfd->list, &kvm->irqfds);
> -	mutex_unlock(&kvm->lock);
> +	spin_lock_irq(&kvm->irqfds.lock);
> +	list_add_tail(&irqfd->list, &kvm->irqfds.items);
> +	spin_unlock_irq(&kvm->irqfds.lock);
>  
>  	/*
> -	 * Check if there was an event already queued
> +	 * Check if there was an event already pending on the eventfd
> +	 * before we registered, and trigger it as if we didn't miss it.
>  	 */
>  	if (events & POLLIN)
>  		schedule_work(&irqfd->inject);
> @@ -148,6 +215,9 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>  	return 0;
>  
>  fail:
> +	if (eventfd && !IS_ERR(eventfd))
> +		eventfd_ctx_put(eventfd);
> +
>  	if (file && !IS_ERR(file))
>  		fput(file);
>  
> @@ -158,24 +228,52 @@ fail:
>  void
>  kvm_irqfd_init(struct kvm *kvm)
>  {
> -	INIT_LIST_HEAD(&kvm->irqfds);
> +	spin_lock_init(&kvm->irqfds.lock);
> +	INIT_LIST_HEAD(&kvm->irqfds.items);
>  }
>  
> +/*
> + * This function is called as the kvm VM fd is being released. Shutdown all
> + * irqfds that still remain open
> + */
>  void
>  kvm_irqfd_release(struct kvm *kvm)
>  {
>  	struct _irqfd *irqfd, *tmp;
>  
> -	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
> -		if (irqfd->wqh)
> -			remove_wait_queue(irqfd->wqh, &irqfd->wait);
> +	spin_lock_irq(&kvm->irqfds.lock);
>  
> -		flush_work(&irqfd->inject);
> +	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list)
> +		irqfd_deactivate(irqfd);
>  
> -		mutex_lock(&kvm->lock);
> -		list_del(&irqfd->list);
> -		mutex_unlock(&kvm->lock);
> +	spin_unlock_irq(&kvm->irqfds.lock);
> +
> +	/*
> +	 * Block until we know all outstanding shutdown jobs have completed
> +	 * since we do not take a kvm* reference.
> +	 */
> +	flush_workqueue(irqfd_cleanup_wq);
>  
> -		kfree(irqfd);
> -	}
>  }
> +
> +/*
> + * create a host-wide workqueue for issuing deferred shutdown requests
> + * aggregated from all vm* instances. We need our own isolated single-thread
> + * queue to prevent deadlock against flushing the normal work-queue.
> + */
> +static int __init irqfd_module_init(void)
> +{
> +	irqfd_cleanup_wq = create_singlethread_workqueue("kvm-irqfd-cleanup");
> +	if (!irqfd_cleanup_wq)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +static void __exit irqfd_module_exit(void)
> +{
> +	destroy_workqueue(irqfd_cleanup_wq);
> +}
> +
> +module_init(irqfd_module_init);
> +module_exit(irqfd_module_exit);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [KVM PATCH v6 4/4] KVM: add irqfd DEASSIGN feature
  2009-06-29 15:44 ` [KVM PATCH v6 4/4] KVM: add irqfd DEASSIGN feature Gregory Haskins
@ 2009-06-29 16:16   ` Michael S. Tsirkin
  0 siblings, 0 replies; 11+ messages in thread
From: Michael S. Tsirkin @ 2009-06-29 16:16 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm, linux-kernel, avi, davidel

On Mon, Jun 29, 2009 at 11:44:20AM -0400, Gregory Haskins wrote:
> DEASSIGN allows us to optionally disassociate an IRQFD from its underlying
> eventfd without destroying the eventfd in the process.  This is useful
> for conditions like live-migration which may have an eventfd associated
> with a device and an IRQFD.  We need to be able to decouple the guest
> from the event source while not perturbing the event source itself.
> 
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> CC: Michael S. Tsirkin <mst@redhat.com>

Very happy to see this feature back. ACK.

> ---
> 
>  include/linux/kvm.h |    2 ++
>  virt/kvm/eventfd.c  |   46 ++++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 46 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index 38ff31e..6710518 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -490,6 +490,8 @@ struct kvm_x86_mce {
>  };
>  #endif
>  
> +#define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
> +
>  struct kvm_irqfd {
>  	__u32 fd;
>  	__u32 gsi;
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 76ad125..409d9e1 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -152,8 +152,8 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
>  	add_wait_queue(wqh, &irqfd->wait);
>  }
>  
> -int
> -kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> +static int
> +kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
>  {
>  	struct _irqfd *irqfd;
>  	struct file *file = NULL;
> @@ -233,6 +233,48 @@ kvm_irqfd_init(struct kvm *kvm)
>  }
>  
>  /*
> + * shutdown any irqfd's that match fd+gsi
> + */
> +static int
> +kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
> +{
> +	struct _irqfd *irqfd, *tmp;
> +	struct eventfd_ctx *eventfd;
> +
> +	eventfd = eventfd_ctx_fdget(fd);
> +	if (IS_ERR(eventfd))
> +		return PTR_ERR(eventfd);
> +
> +	spin_lock_irq(&kvm->irqfds.lock);
> +
> +	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list) {
> +		if (irqfd->eventfd == eventfd && irqfd->gsi == gsi)
> +			irqfd_deactivate(irqfd);
> +	}
> +
> +	spin_unlock_irq(&kvm->irqfds.lock);
> +	eventfd_ctx_put(eventfd);
> +
> +	/*
> +	 * Block until we know all outstanding shutdown jobs have completed
> +	 * so that we guarantee there will not be any more interrupts on this
> +	 * gsi once this deassign function returns.
> +	 */
> +	flush_workqueue(irqfd_cleanup_wq);
> +
> +	return 0;
> +}
> +
> +int
> +kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> +{
> +	if (flags & KVM_IRQFD_FLAG_DEASSIGN)
> +		return kvm_irqfd_deassign(kvm, fd, gsi);
> +
> +	return kvm_irqfd_assign(kvm, fd, gsi);
> +}
> +
> +/*
>   * This function is called as the kvm VM fd is being released. Shutdown all
>   * irqfds that still remain open
>   */

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [KVM PATCH v6 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-29 16:11   ` Michael S. Tsirkin
@ 2009-06-29 16:52     ` Gregory Haskins
  2009-06-29 17:32       ` Michael S. Tsirkin
  2009-07-01  9:44       ` Avi Kivity
  0 siblings, 2 replies; 11+ messages in thread
From: Gregory Haskins @ 2009-06-29 16:52 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, linux-kernel, avi, davidel

[-- Attachment #1: Type: text/plain, Size: 10447 bytes --]

Michael S. Tsirkin wrote:
> Look good. A couple of minor nits:
>
> On Mon, Jun 29, 2009 at 11:44:15AM -0400, Gregory Haskins wrote:
>   
>> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
>> "release" callback.  This lets eventfd clients know if the eventfd is about
>> to go away and is very useful particularly for in-kernel clients.  However,
>> until recently it is not possible to use this feature of eventfd in a
>> race-free way.  This patch utilizes a new eventfd interface to rectify
>> the problem.
>>
>> Note that one final race is known to exist: the slow-work thread may race
>> with module removal.  We are currently working with slow-work upstream
>> to fix this issue as well.  Since the code prior to this patch also
>> races with module_put(), we are not making anything worse, but rather
>> shifting the cause of the race.  Once the slow-work code is patched we
>> will be fixing the last remaining issue.
>>
>> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
>> ---
>>
>>  include/linux/kvm_host.h |    5 +
>>  virt/kvm/Kconfig         |    1 
>>  virt/kvm/eventfd.c       |  154 ++++++++++++++++++++++++++++++++++++++--------
>>  3 files changed, 131 insertions(+), 29 deletions(-)
>>
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 2451f48..5a90c6a 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -141,7 +141,10 @@ struct kvm {
>>  	struct kvm_io_bus mmio_bus;
>>  	struct kvm_io_bus pio_bus;
>>  #ifdef CONFIG_HAVE_KVM_EVENTFD
>> -	struct list_head irqfds;
>> +	struct {
>> +		spinlock_t        lock;
>> +		struct list_head  items;
>> +	} irqfds;
>>  #endif
>>  	struct kvm_vm_stat stat;
>>  	struct kvm_arch arch;
>> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
>> index daece36..ab7848a 100644
>> --- a/virt/kvm/Kconfig
>> +++ b/virt/kvm/Kconfig
>> @@ -9,6 +9,7 @@ config HAVE_KVM_IRQCHIP
>>  config HAVE_KVM_EVENTFD
>>         bool
>>         select EVENTFD
>> +       select SLOW_WORK
>>     
>
> not needed in this version?
>   

Good catch.  Will remove.

>   
>>  config KVM_APIC_ARCHITECTURE
>>         bool
>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
>> index 9656027..76ad125 100644
>> --- a/virt/kvm/eventfd.c
>> +++ b/virt/kvm/eventfd.c
>> @@ -36,16 +36,22 @@
>>   * Credit goes to Avi Kivity for the original idea.
>>   * --------------------------------------------------------------------
>>   */
>> +
>>  struct _irqfd {
>>  	struct kvm               *kvm;
>> +	struct eventfd_ctx       *eventfd;
>>  	int                       gsi;
>>  	struct list_head          list;
>>  	poll_table                pt;
>>  	wait_queue_head_t        *wqh;
>>  	wait_queue_t              wait;
>>  	struct work_struct        inject;
>> +	struct work_struct        shutdown;
>> +	int                       active:1;
>>     
>
> I think we need a comment here that active bit is protected by irqfds
> lock.

Ack

>   One idea I had to make it even clearer was to have a shutdown list
> of irqfds per-kvm, together with the items list, and make work_struct for
> shutdown global, not per-irqfd.  We can then unconditionally do
> list_move + schedule_work to shut down an irqfd, and it's safe to do
> even if it is already on the shutdown list - it just gets moved to tail.
>   

Hmm..I'm not sure that churn really buys us anything, tho.  Technically
the "active" bit is redundant with list_del_init()+list_empty() that I
employed in previous versions.  However, I made it explicit with the
active bit to be more self-documenting.  IMO, the latest code is pretty
clear, and the change you are proposing is moving towards a slightly
trickier variant like I originally had.  I'd say "lets leave this as is".

>   
>>  };
>>  
>> +static struct workqueue_struct *irqfd_cleanup_wq;
>> +
>>  static void
>>  irqfd_inject(struct work_struct *work)
>>  {
>> @@ -58,32 +64,81 @@ irqfd_inject(struct work_struct *work)
>>  	mutex_unlock(&kvm->irq_lock);
>>  }
>>  
>> +/*
>> + * Race-free decouple logic (ordering is critical)
>> + */
>> +static void
>> +irqfd_shutdown(struct work_struct *work)
>> +{
>> +	struct _irqfd *irqfd = container_of(work, struct _irqfd, shutdown);
>> +
>> +	/*
>> +	 * Synchronize with the wait-queue and unhook ourselves to prevent
>> +	 * further events.
>> +	 */
>> +	remove_wait_queue(irqfd->wqh, &irqfd->wait);
>> +
>> +	/*
>> +	 * We know no new events will be scheduled at this point, so block
>> +	 * until all previously outstanding events have completed
>> +	 */
>> +	flush_work(&irqfd->inject);
>> +
>> +	/*
>> +	 * It is now safe to release the object's resources
>> +	 */
>> +	eventfd_ctx_put(irqfd->eventfd);
>> +	kfree(irqfd);
>> +}
>> +
>> +/*
>> + * Mark the irqfd as inactive and schedule it for removal
>> + *
>> + * assumes kvm->irqfds.lock is held
>> + */
>> +static void
>> +irqfd_deactivate(struct _irqfd *irqfd)
>> +{
>> +	BUG_ON(!irqfd->active);
>> +
>> +	irqfd->active = false;
>> +	list_del(&irqfd->list);
>> +
>> +	queue_work(irqfd_cleanup_wq, &irqfd->shutdown);
>> +}
>> +
>> +/*
>> + * Called with wqh->lock held and interrupts disabled
>> + */
>>  static int
>>  irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>  {
>>  	struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
>>  	unsigned long flags = (unsigned long)key;
>>  
>> -	/*
>> -	 * Assume we will be called with interrupts disabled
>> -	 */
>>  	if (flags & POLLIN)
>> -		/*
>> -		 * Defer the IRQ injection until later since we need to
>> -		 * acquire the kvm->lock to do so.
>> -		 */
>> +		/* An event has been signaled, inject an interrupt */
>>  		schedule_work(&irqfd->inject);
>>  
>>  	if (flags & POLLHUP) {
>> -		/*
>> -		 * for now, just remove ourselves from the list and let
>> -		 * the rest dangle.  We will fix this up later once
>> -		 * the races in eventfd are fixed
>> -		 */
>> -		__remove_wait_queue(irqfd->wqh, &irqfd->wait);
>> -		irqfd->wqh = NULL;
>> +		/* The eventfd is closing, detach from KVM */
>> +		struct kvm *kvm = irqfd->kvm;
>> +		unsigned long flags;
>> +
>> +		spin_lock_irqsave(&kvm->irqfds.lock, flags);
>> +
>> +		if (irqfd->active)
>> +			/*
>> +			 * If we get here, we can be sure no-one else is
>> +			 * trying to shutdown this object at the same time
>> +			 * since we hold the list lock.
>> +			 */
>> +			irqfd_deactivate(irqfd);
>> +
>> +		spin_unlock_irqrestore(&kvm->irqfds.lock, flags);
>>  	}
>>  
>> +
>>     
>
> extra empty line
>   

Ack
>   
>>  	return 0;
>>  }
>>  
>> @@ -102,6 +157,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>>  {
>>  	struct _irqfd *irqfd;
>>  	struct file *file = NULL;
>> +	struct eventfd_ctx *eventfd = NULL;
>>  	int ret;
>>  	unsigned int events;
>>  
>> @@ -113,6 +169,8 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>>  	irqfd->gsi = gsi;
>>  	INIT_LIST_HEAD(&irqfd->list);
>>  	INIT_WORK(&irqfd->inject, irqfd_inject);
>> +	INIT_WORK(&irqfd->shutdown, irqfd_shutdown);
>> +	irqfd->active = true;
>>  
>>  	file = eventfd_fget(fd);
>>  	if (IS_ERR(file)) {
>> @@ -120,6 +178,14 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>>  		goto fail;
>>  	}
>>  
>> +	eventfd = eventfd_ctx_fileget(file);
>> +	if (IS_ERR(file)) {
>> +		ret = PTR_ERR(file);
>> +		goto fail;
>> +	}
>> +
>> +	irqfd->eventfd = eventfd;
>> +
>>  	/*
>>  	 * Install our own custom wake-up handling so we are notified via
>>  	 * a callback whenever someone signals the underlying eventfd
>> @@ -129,12 +195,13 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>>  
>>  	events = file->f_op->poll(file, &irqfd->pt);
>>  
>> -	mutex_lock(&kvm->lock);
>> -	list_add_tail(&irqfd->list, &kvm->irqfds);
>> -	mutex_unlock(&kvm->lock);
>> +	spin_lock_irq(&kvm->irqfds.lock);
>> +	list_add_tail(&irqfd->list, &kvm->irqfds.items);
>> +	spin_unlock_irq(&kvm->irqfds.lock);
>>  
>>  	/*
>> -	 * Check if there was an event already queued
>> +	 * Check if there was an event already pending on the eventfd
>> +	 * before we registered, and trigger it as if we didn't miss it.
>>  	 */
>>  	if (events & POLLIN)
>>  		schedule_work(&irqfd->inject);
>> @@ -148,6 +215,9 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>>  	return 0;
>>  
>>  fail:
>> +	if (eventfd && !IS_ERR(eventfd))
>> +		eventfd_ctx_put(eventfd);
>> +
>>  	if (file && !IS_ERR(file))
>>  		fput(file);
>>  
>> @@ -158,24 +228,52 @@ fail:
>>  void
>>  kvm_irqfd_init(struct kvm *kvm)
>>  {
>> -	INIT_LIST_HEAD(&kvm->irqfds);
>> +	spin_lock_init(&kvm->irqfds.lock);
>> +	INIT_LIST_HEAD(&kvm->irqfds.items);
>>  }
>>  
>> +/*
>> + * This function is called as the kvm VM fd is being released. Shutdown all
>> + * irqfds that still remain open
>> + */
>>  void
>>  kvm_irqfd_release(struct kvm *kvm)
>>  {
>>  	struct _irqfd *irqfd, *tmp;
>>  
>> -	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
>> -		if (irqfd->wqh)
>> -			remove_wait_queue(irqfd->wqh, &irqfd->wait);
>> +	spin_lock_irq(&kvm->irqfds.lock);
>>  
>> -		flush_work(&irqfd->inject);
>> +	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list)
>> +		irqfd_deactivate(irqfd);
>>  
>> -		mutex_lock(&kvm->lock);
>> -		list_del(&irqfd->list);
>> -		mutex_unlock(&kvm->lock);
>> +	spin_unlock_irq(&kvm->irqfds.lock);
>> +
>> +	/*
>> +	 * Block until we know all outstanding shutdown jobs have completed
>> +	 * since we do not take a kvm* reference.
>> +	 */
>> +	flush_workqueue(irqfd_cleanup_wq);
>>  
>> -		kfree(irqfd);
>> -	}
>>  }
>> +
>> +/*
>> + * create a host-wide workqueue for issuing deferred shutdown requests
>> + * aggregated from all vm* instances. We need our own isolated single-thread
>> + * queue to prevent deadlock against flushing the normal work-queue.
>> + */
>> +static int __init irqfd_module_init(void)
>> +{
>> +	irqfd_cleanup_wq = create_singlethread_workqueue("kvm-irqfd-cleanup");
>> +	if (!irqfd_cleanup_wq)
>> +		return -ENOMEM;
>> +
>> +	return 0;
>> +}
>> +
>> +static void __exit irqfd_module_exit(void)
>> +{
>> +	destroy_workqueue(irqfd_cleanup_wq);
>> +}
>> +
>> +module_init(irqfd_module_init);
>> +module_exit(irqfd_module_exit);
>>     



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [KVM PATCH v6 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-29 16:52     ` Gregory Haskins
@ 2009-06-29 17:32       ` Michael S. Tsirkin
  2009-07-01  9:44       ` Avi Kivity
  1 sibling, 0 replies; 11+ messages in thread
From: Michael S. Tsirkin @ 2009-06-29 17:32 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm, linux-kernel, avi, davidel

On Mon, Jun 29, 2009 at 12:52:24PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > Look good. A couple of minor nits:
> >
> > On Mon, Jun 29, 2009 at 11:44:15AM -0400, Gregory Haskins wrote:
> >   
> >> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
> >> "release" callback.  This lets eventfd clients know if the eventfd is about
> >> to go away and is very useful particularly for in-kernel clients.  However,
> >> until recently it is not possible to use this feature of eventfd in a
> >> race-free way.  This patch utilizes a new eventfd interface to rectify
> >> the problem.
> >>
> >> Note that one final race is known to exist: the slow-work thread may race
> >> with module removal.  We are currently working with slow-work upstream
> >> to fix this issue as well.  Since the code prior to this patch also
> >> races with module_put(), we are not making anything worse, but rather
> >> shifting the cause of the race.  Once the slow-work code is patched we
> >> will be fixing the last remaining issue.
> >>
> >> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> >> ---
> >>
> >>  include/linux/kvm_host.h |    5 +
> >>  virt/kvm/Kconfig         |    1 
> >>  virt/kvm/eventfd.c       |  154 ++++++++++++++++++++++++++++++++++++++--------
> >>  3 files changed, 131 insertions(+), 29 deletions(-)
> >>
> >> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> >> index 2451f48..5a90c6a 100644
> >> --- a/include/linux/kvm_host.h
> >> +++ b/include/linux/kvm_host.h
> >> @@ -141,7 +141,10 @@ struct kvm {
> >>  	struct kvm_io_bus mmio_bus;
> >>  	struct kvm_io_bus pio_bus;
> >>  #ifdef CONFIG_HAVE_KVM_EVENTFD
> >> -	struct list_head irqfds;
> >> +	struct {
> >> +		spinlock_t        lock;
> >> +		struct list_head  items;
> >> +	} irqfds;
> >>  #endif
> >>  	struct kvm_vm_stat stat;
> >>  	struct kvm_arch arch;
> >> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> >> index daece36..ab7848a 100644
> >> --- a/virt/kvm/Kconfig
> >> +++ b/virt/kvm/Kconfig
> >> @@ -9,6 +9,7 @@ config HAVE_KVM_IRQCHIP
> >>  config HAVE_KVM_EVENTFD
> >>         bool
> >>         select EVENTFD
> >> +       select SLOW_WORK
> >>     
> >
> > not needed in this version?
> >   
> 
> Good catch.  Will remove.
> 
> >   
> >>  config KVM_APIC_ARCHITECTURE
> >>         bool
> >> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> >> index 9656027..76ad125 100644
> >> --- a/virt/kvm/eventfd.c
> >> +++ b/virt/kvm/eventfd.c
> >> @@ -36,16 +36,22 @@
> >>   * Credit goes to Avi Kivity for the original idea.
> >>   * --------------------------------------------------------------------
> >>   */
> >> +
> >>  struct _irqfd {
> >>  	struct kvm               *kvm;
> >> +	struct eventfd_ctx       *eventfd;
> >>  	int                       gsi;
> >>  	struct list_head          list;
> >>  	poll_table                pt;
> >>  	wait_queue_head_t        *wqh;
> >>  	wait_queue_t              wait;
> >>  	struct work_struct        inject;
> >> +	struct work_struct        shutdown;
> >> +	int                       active:1;
> >>     
> >
> > I think we need a comment here that active bit is protected by irqfds
> > lock.
> 
> Ack
> 
> >   One idea I had to make it even clearer was to have a shutdown list
> > of irqfds per-kvm, together with the items list, and make work_struct for
> > shutdown global, not per-irqfd.  We can then unconditionally do
> > list_move + schedule_work to shut down an irqfd, and it's safe to do
> > even if it is already on the shutdown list - it just gets moved to tail.
> >   
> 
> Hmm..I'm not sure that churn really buys us anything, tho.  Technically
> the "active" bit is redundant with list_del_init()+list_empty() that I
> employed in previous versions.  However, I made it explicit with the
> active bit to be more self-documenting.  IMO, the latest code is pretty
> clear, and the change you are proposing is moving towards a slightly
> trickier variant like I originally had.  I'd say "lets leave this as is".

ok.

> >   
> >>  };
> >>  
> >> +static struct workqueue_struct *irqfd_cleanup_wq;
> >> +
> >>  static void
> >>  irqfd_inject(struct work_struct *work)
> >>  {
> >> @@ -58,32 +64,81 @@ irqfd_inject(struct work_struct *work)
> >>  	mutex_unlock(&kvm->irq_lock);
> >>  }
> >>  
> >> +/*
> >> + * Race-free decouple logic (ordering is critical)
> >> + */
> >> +static void
> >> +irqfd_shutdown(struct work_struct *work)
> >> +{
> >> +	struct _irqfd *irqfd = container_of(work, struct _irqfd, shutdown);
> >> +
> >> +	/*
> >> +	 * Synchronize with the wait-queue and unhook ourselves to prevent
> >> +	 * further events.
> >> +	 */
> >> +	remove_wait_queue(irqfd->wqh, &irqfd->wait);
> >> +
> >> +	/*
> >> +	 * We know no new events will be scheduled at this point, so block
> >> +	 * until all previously outstanding events have completed
> >> +	 */
> >> +	flush_work(&irqfd->inject);
> >> +
> >> +	/*
> >> +	 * It is now safe to release the object's resources
> >> +	 */
> >> +	eventfd_ctx_put(irqfd->eventfd);
> >> +	kfree(irqfd);
> >> +}
> >> +
> >> +/*
> >> + * Mark the irqfd as inactive and schedule it for removal
> >> + *
> >> + * assumes kvm->irqfds.lock is held
> >> + */
> >> +static void
> >> +irqfd_deactivate(struct _irqfd *irqfd)
> >> +{
> >> +	BUG_ON(!irqfd->active);
> >> +
> >> +	irqfd->active = false;
> >> +	list_del(&irqfd->list);
> >> +
> >> +	queue_work(irqfd_cleanup_wq, &irqfd->shutdown);
> >> +}
> >> +
> >> +/*
> >> + * Called with wqh->lock held and interrupts disabled
> >> + */
> >>  static int
> >>  irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
> >>  {
> >>  	struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
> >>  	unsigned long flags = (unsigned long)key;
> >>  
> >> -	/*
> >> -	 * Assume we will be called with interrupts disabled
> >> -	 */
> >>  	if (flags & POLLIN)
> >> -		/*
> >> -		 * Defer the IRQ injection until later since we need to
> >> -		 * acquire the kvm->lock to do so.
> >> -		 */
> >> +		/* An event has been signaled, inject an interrupt */
> >>  		schedule_work(&irqfd->inject);
> >>  
> >>  	if (flags & POLLHUP) {
> >> -		/*
> >> -		 * for now, just remove ourselves from the list and let
> >> -		 * the rest dangle.  We will fix this up later once
> >> -		 * the races in eventfd are fixed
> >> -		 */
> >> -		__remove_wait_queue(irqfd->wqh, &irqfd->wait);
> >> -		irqfd->wqh = NULL;
> >> +		/* The eventfd is closing, detach from KVM */
> >> +		struct kvm *kvm = irqfd->kvm;
> >> +		unsigned long flags;
> >> +
> >> +		spin_lock_irqsave(&kvm->irqfds.lock, flags);
> >> +
> >> +		if (irqfd->active)
> >> +			/*
> >> +			 * If we get here, we can be sure no-one else is
> >> +			 * trying to shutdown this object at the same time
> >> +			 * since we hold the list lock.
> >> +			 */
> >> +			irqfd_deactivate(irqfd);
> >> +
> >> +		spin_unlock_irqrestore(&kvm->irqfds.lock, flags);
> >>  	}
> >>  
> >> +
> >>     
> >
> > extra empty line
> >   
> 
> Ack
> >   
> >>  	return 0;
> >>  }
> >>  
> >> @@ -102,6 +157,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> >>  {
> >>  	struct _irqfd *irqfd;
> >>  	struct file *file = NULL;
> >> +	struct eventfd_ctx *eventfd = NULL;
> >>  	int ret;
> >>  	unsigned int events;
> >>  
> >> @@ -113,6 +169,8 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> >>  	irqfd->gsi = gsi;
> >>  	INIT_LIST_HEAD(&irqfd->list);
> >>  	INIT_WORK(&irqfd->inject, irqfd_inject);
> >> +	INIT_WORK(&irqfd->shutdown, irqfd_shutdown);
> >> +	irqfd->active = true;
> >>  
> >>  	file = eventfd_fget(fd);
> >>  	if (IS_ERR(file)) {
> >> @@ -120,6 +178,14 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> >>  		goto fail;
> >>  	}
> >>  
> >> +	eventfd = eventfd_ctx_fileget(file);
> >> +	if (IS_ERR(file)) {
> >> +		ret = PTR_ERR(file);
> >> +		goto fail;
> >> +	}
> >> +
> >> +	irqfd->eventfd = eventfd;
> >> +
> >>  	/*
> >>  	 * Install our own custom wake-up handling so we are notified via
> >>  	 * a callback whenever someone signals the underlying eventfd
> >> @@ -129,12 +195,13 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> >>  
> >>  	events = file->f_op->poll(file, &irqfd->pt);
> >>  
> >> -	mutex_lock(&kvm->lock);
> >> -	list_add_tail(&irqfd->list, &kvm->irqfds);
> >> -	mutex_unlock(&kvm->lock);
> >> +	spin_lock_irq(&kvm->irqfds.lock);
> >> +	list_add_tail(&irqfd->list, &kvm->irqfds.items);
> >> +	spin_unlock_irq(&kvm->irqfds.lock);
> >>  
> >>  	/*
> >> -	 * Check if there was an event already queued
> >> +	 * Check if there was an event already pending on the eventfd
> >> +	 * before we registered, and trigger it as if we didn't miss it.
> >>  	 */
> >>  	if (events & POLLIN)
> >>  		schedule_work(&irqfd->inject);
> >> @@ -148,6 +215,9 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> >>  	return 0;
> >>  
> >>  fail:
> >> +	if (eventfd && !IS_ERR(eventfd))
> >> +		eventfd_ctx_put(eventfd);
> >> +
> >>  	if (file && !IS_ERR(file))
> >>  		fput(file);
> >>  
> >> @@ -158,24 +228,52 @@ fail:
> >>  void
> >>  kvm_irqfd_init(struct kvm *kvm)
> >>  {
> >> -	INIT_LIST_HEAD(&kvm->irqfds);
> >> +	spin_lock_init(&kvm->irqfds.lock);
> >> +	INIT_LIST_HEAD(&kvm->irqfds.items);
> >>  }
> >>  
> >> +/*
> >> + * This function is called as the kvm VM fd is being released. Shutdown all
> >> + * irqfds that still remain open
> >> + */
> >>  void
> >>  kvm_irqfd_release(struct kvm *kvm)
> >>  {
> >>  	struct _irqfd *irqfd, *tmp;
> >>  
> >> -	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
> >> -		if (irqfd->wqh)
> >> -			remove_wait_queue(irqfd->wqh, &irqfd->wait);
> >> +	spin_lock_irq(&kvm->irqfds.lock);
> >>  
> >> -		flush_work(&irqfd->inject);
> >> +	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list)
> >> +		irqfd_deactivate(irqfd);
> >>  
> >> -		mutex_lock(&kvm->lock);
> >> -		list_del(&irqfd->list);
> >> -		mutex_unlock(&kvm->lock);
> >> +	spin_unlock_irq(&kvm->irqfds.lock);
> >> +
> >> +	/*
> >> +	 * Block until we know all outstanding shutdown jobs have completed
> >> +	 * since we do not take a kvm* reference.
> >> +	 */
> >> +	flush_workqueue(irqfd_cleanup_wq);
> >>  
> >> -		kfree(irqfd);
> >> -	}
> >>  }
> >> +
> >> +/*
> >> + * create a host-wide workqueue for issuing deferred shutdown requests
> >> + * aggregated from all vm* instances. We need our own isolated single-thread
> >> + * queue to prevent deadlock against flushing the normal work-queue.
> >> + */
> >> +static int __init irqfd_module_init(void)
> >> +{
> >> +	irqfd_cleanup_wq = create_singlethread_workqueue("kvm-irqfd-cleanup");
> >> +	if (!irqfd_cleanup_wq)
> >> +		return -ENOMEM;
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +static void __exit irqfd_module_exit(void)
> >> +{
> >> +	destroy_workqueue(irqfd_cleanup_wq);
> >> +}
> >> +
> >> +module_init(irqfd_module_init);
> >> +module_exit(irqfd_module_exit);
> >>     
> 
> 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [KVM PATCH v6 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-29 16:52     ` Gregory Haskins
  2009-06-29 17:32       ` Michael S. Tsirkin
@ 2009-07-01  9:44       ` Avi Kivity
  1 sibling, 0 replies; 11+ messages in thread
From: Avi Kivity @ 2009-07-01  9:44 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: Michael S. Tsirkin, kvm, linux-kernel, davidel

On 06/29/2009 07:52 PM, Gregory Haskins wrote:
>
>>    One idea I had to make it even clearer was to have a shutdown list
>> of irqfds per-kvm, together with the items list, and make work_struct for
>> shutdown global, not per-irqfd.  We can then unconditionally do
>> list_move + schedule_work to shut down an irqfd, and it's safe to do
>> even if it is already on the shutdown list - it just gets moved to tail.
>>
>>      
>
> Hmm..I'm not sure that churn really buys us anything, tho.  Technically
> the "active" bit is redundant with list_del_init()+list_empty() that I
> employed in previous versions.  However, I made it explicit with the
> active bit to be more self-documenting.  IMO, the latest code is pretty
> clear, and the change you are proposing is moving towards a slightly
> trickier variant like I originally had.  I'd say "lets leave this as is".
>    

Could retain self documentation by introducing a helper irqfd_active() 
which does the list_blah() magic.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-07-01  9:43 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-06-29 15:43 [KVM PATCH v6 0/4] irqfd fixes and enhancements Gregory Haskins
2009-06-29 15:44 ` [KVM PATCH v6 1/4] kvm: prepare irqfd for having interrupts disabled during eventfd->release Gregory Haskins
2009-06-29 15:44 ` [KVM PATCH v6 2/4] eventfd - revised interface and cleanups (4th rev) Gregory Haskins
2009-06-29 15:44 ` [KVM PATCH v6 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface Gregory Haskins
2009-06-29 16:11   ` Michael S. Tsirkin
2009-06-29 16:52     ` Gregory Haskins
2009-06-29 17:32       ` Michael S. Tsirkin
2009-07-01  9:44       ` Avi Kivity
2009-06-29 15:44 ` [KVM PATCH v6 4/4] KVM: add irqfd DEASSIGN feature Gregory Haskins
2009-06-29 16:16   ` Michael S. Tsirkin
2009-06-29 15:46 ` [KVM PATCH v6 0/4] irqfd fixes and enhancements Gregory Haskins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox