public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Gregory Haskins <ghaskins@novell.com>
To: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, mst@redhat.com, avi@redhat.com,
	paulmck@linux.vnet.ibm.com, davidel@xmailserver.org,
	mingo@elte.hu, rusty@rustcorp.com.au
Subject: [KVM PATCH v3 2/3] eventfd: add internal reference counting to fix notifier race conditions
Date: Mon, 22 Jun 2009 12:05:51 -0400	[thread overview]
Message-ID: <20090622160551.22967.32376.stgit@dev.haskins.net> (raw)
In-Reply-To: <20090622155504.22967.50532.stgit@dev.haskins.net>

eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
"release" callback.  This lets eventfd clients know if the eventfd is about
to go away and is very useful particularly for in-kernel clients.  However,
as it stands today it is not possible to use this feature of eventfd in a
race-free way.  This patch adds some additional logic to eventfd in order
to rectify this problem.

Background:
-----------------------
Eventfd currently only has one reference count mechanism: fget/fput.  This
in of itself is normally fine.  However, if a client expects to be
notified if the eventfd is closed, it cannot hold a fget() reference
itself or the underlying f_ops->release() callback will never be invoked
by VFS.  Therefore we have this somewhat unusual situation where we may
hold a pointer to an eventfd object (by virtue of having a waiter registered
in its wait-queue), but no reference.  To make matters more complicated,
the release callback is issued in an unlocked state.  This makes it nearly
impossible to design a mutual decoupling algorithm: you cannot unhook one
side from the other (or vice versa) without racing.

-----------------------

In summary, there are two fundamental problems:

1) The POLLHUP wakeup is broadcast lockless
2) There are no references to the wait-queue-head (embedded in eventfd_ctx)

We fix this by using the locked variant of wakeup for POLLHUP, and by
adding/exposing a kref to the underlying eventfd_ctx.  Clients should then
be able to govern their usage of the wait-queue as they do for any other
wait-queue in the kernel.

We propose this more raw solution rather than trying to encapsulate the
poll-callback because there are advantages to decoupling the
remove_wait_queue from the kref_put().  Namely, its nice to unhook the
wait-queue inside the wakeup, but to defer the kref_put() until we can
synchronize with the client.

Between these points, we believe we now have a race-free release
mechanism.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Davide Libenzi <davidel@xmailserver.org>
---

 fs/eventfd.c            |   43 ++++++++++++++++++++++++++++++++++++-------
 include/linux/eventfd.h |    7 +++++++
 2 files changed, 43 insertions(+), 7 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 72f5f8d..4806116 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -17,8 +17,10 @@
 #include <linux/eventfd.h>
 #include <linux/syscalls.h>
 #include <linux/module.h>
+#include <linux/kref.h>
 
 struct eventfd_ctx {
+	struct kref kref;
 	wait_queue_head_t wqh;
 	/*
 	 * Every time that a write(2) is performed on an eventfd, the
@@ -59,17 +61,24 @@ int eventfd_signal(struct file *file, int n)
 }
 EXPORT_SYMBOL_GPL(eventfd_signal);
 
+static void _eventfd_release(struct kref *kref)
+{
+	struct eventfd_ctx *ctx = container_of(kref, struct eventfd_ctx, kref);
+
+	kfree(ctx);
+}
+
+static void _eventfd_put(struct kref *kref)
+{
+	kref_put(kref, &_eventfd_release);
+}
+
 static int eventfd_release(struct inode *inode, struct file *file)
 {
 	struct eventfd_ctx *ctx = file->private_data;
 
-	/*
-	 * No need to hold the lock here, since we are on the file cleanup
-	 * path and the ones still attached to the wait queue will be
-	 * serialized by wake_up_locked_poll().
-	 */
-	wake_up_locked_poll(&ctx->wqh, POLLHUP);
-	kfree(ctx);
+	wake_up_poll(&ctx->wqh, POLLHUP);
+	_eventfd_put(&ctx->kref);
 	return 0;
 }
 
@@ -209,6 +218,26 @@ struct file *eventfd_fget(int fd)
 }
 EXPORT_SYMBOL_GPL(eventfd_fget);
 
+struct kref *eventfd_kref_get(struct file *file)
+{
+	struct eventfd_ctx *ctx;
+
+	if (file->f_op != &eventfd_fops)
+		return ERR_PTR(-EINVAL);
+
+	ctx = file->private_data;
+	kref_get(&ctx->kref);
+
+	return &ctx->kref;
+}
+EXPORT_SYMBOL_GPL(eventfd_kref_get);
+
+void eventfd_kref_put(struct kref *kref)
+{
+	_eventfd_put(kref);
+}
+EXPORT_SYMBOL_GPL(eventfd_kref_put);
+
 SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
 {
 	int fd;
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index f45a8ae..c0396b3 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -8,6 +8,8 @@
 #ifndef _LINUX_EVENTFD_H
 #define _LINUX_EVENTFD_H
 
+#include <linux/kref.h>
+
 #ifdef CONFIG_EVENTFD
 
 /* For O_CLOEXEC and O_NONBLOCK */
@@ -28,11 +30,16 @@
 #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
 
 struct file *eventfd_fget(int fd);
+struct kref *eventfd_kref_get(struct file *file);
+void eventfd_kref_put(struct kref *kref);
 int eventfd_signal(struct file *file, int n);
 
 #else /* CONFIG_EVENTFD */
 
 #define eventfd_fget(fd) ERR_PTR(-ENOSYS)
+#define eventfd_kref_get(file) ERR_PTR(-ENOSYS);
+static inline void eventfd_kref_put(struct kref *kref)
+{ }
 static inline int eventfd_signal(struct file *file, int n)
 { return 0; }
 


  parent reply	other threads:[~2009-06-22 16:06 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-06-22 16:05 [KVM PATCH v3 0/3] irqfd/eventfd fixes Gregory Haskins
2009-06-22 16:05 ` [KVM PATCH v3 1/3] kvm: prepare irqfd for having interrupts disabled during eventfd->release Gregory Haskins
2009-06-22 16:05 ` Gregory Haskins [this message]
2009-06-22 16:05 ` [KVM PATCH v3 3/3] KVM: Fix races in irqfd using new eventfd_kref_get interface Gregory Haskins
2009-06-22 16:57   ` Michael S. Tsirkin
2009-06-22 17:31     ` Gregory Haskins
2009-06-22 17:45       ` Michael S. Tsirkin
2009-06-22 18:03         ` Gregory Haskins
2009-06-22 18:26           ` Michael S. Tsirkin
2009-06-22 18:11         ` Davide Libenzi
2009-06-22 18:32           ` Michael S. Tsirkin
2009-06-22 18:41             ` Davide Libenzi
2009-06-22 18:52               ` Michael S. Tsirkin
2009-06-23 14:55       ` Gregory Haskins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090622160551.22967.32376.stgit@dev.haskins.net \
    --to=ghaskins@novell.com \
    --cc=avi@redhat.com \
    --cc=davidel@xmailserver.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=mst@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=rusty@rustcorp.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox