From: Gregory Haskins <ghaskins@novell.com>
To: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, mst@redhat.com, avi@redhat.com,
paulmck@linux.vnet.ibm.com, davidel@xmailserver.org,
mingo@elte.hu, rusty@rustcorp.com.au
Subject: [KVM PATCH v3 2/3] eventfd: add internal reference counting to fix notifier race conditions
Date: Mon, 22 Jun 2009 12:05:51 -0400 [thread overview]
Message-ID: <20090622160551.22967.32376.stgit@dev.haskins.net> (raw)
In-Reply-To: <20090622155504.22967.50532.stgit@dev.haskins.net>
eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
"release" callback. This lets eventfd clients know if the eventfd is about
to go away and is very useful particularly for in-kernel clients. However,
as it stands today it is not possible to use this feature of eventfd in a
race-free way. This patch adds some additional logic to eventfd in order
to rectify this problem.
Background:
-----------------------
Eventfd currently only has one reference count mechanism: fget/fput. This
in of itself is normally fine. However, if a client expects to be
notified if the eventfd is closed, it cannot hold a fget() reference
itself or the underlying f_ops->release() callback will never be invoked
by VFS. Therefore we have this somewhat unusual situation where we may
hold a pointer to an eventfd object (by virtue of having a waiter registered
in its wait-queue), but no reference. To make matters more complicated,
the release callback is issued in an unlocked state. This makes it nearly
impossible to design a mutual decoupling algorithm: you cannot unhook one
side from the other (or vice versa) without racing.
-----------------------
In summary, there are two fundamental problems:
1) The POLLHUP wakeup is broadcast lockless
2) There are no references to the wait-queue-head (embedded in eventfd_ctx)
We fix this by using the locked variant of wakeup for POLLHUP, and by
adding/exposing a kref to the underlying eventfd_ctx. Clients should then
be able to govern their usage of the wait-queue as they do for any other
wait-queue in the kernel.
We propose this more raw solution rather than trying to encapsulate the
poll-callback because there are advantages to decoupling the
remove_wait_queue from the kref_put(). Namely, its nice to unhook the
wait-queue inside the wakeup, but to defer the kref_put() until we can
synchronize with the client.
Between these points, we believe we now have a race-free release
mechanism.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Davide Libenzi <davidel@xmailserver.org>
---
fs/eventfd.c | 43 ++++++++++++++++++++++++++++++++++++-------
include/linux/eventfd.h | 7 +++++++
2 files changed, 43 insertions(+), 7 deletions(-)
diff --git a/fs/eventfd.c b/fs/eventfd.c
index 72f5f8d..4806116 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -17,8 +17,10 @@
#include <linux/eventfd.h>
#include <linux/syscalls.h>
#include <linux/module.h>
+#include <linux/kref.h>
struct eventfd_ctx {
+ struct kref kref;
wait_queue_head_t wqh;
/*
* Every time that a write(2) is performed on an eventfd, the
@@ -59,17 +61,24 @@ int eventfd_signal(struct file *file, int n)
}
EXPORT_SYMBOL_GPL(eventfd_signal);
+static void _eventfd_release(struct kref *kref)
+{
+ struct eventfd_ctx *ctx = container_of(kref, struct eventfd_ctx, kref);
+
+ kfree(ctx);
+}
+
+static void _eventfd_put(struct kref *kref)
+{
+ kref_put(kref, &_eventfd_release);
+}
+
static int eventfd_release(struct inode *inode, struct file *file)
{
struct eventfd_ctx *ctx = file->private_data;
- /*
- * No need to hold the lock here, since we are on the file cleanup
- * path and the ones still attached to the wait queue will be
- * serialized by wake_up_locked_poll().
- */
- wake_up_locked_poll(&ctx->wqh, POLLHUP);
- kfree(ctx);
+ wake_up_poll(&ctx->wqh, POLLHUP);
+ _eventfd_put(&ctx->kref);
return 0;
}
@@ -209,6 +218,26 @@ struct file *eventfd_fget(int fd)
}
EXPORT_SYMBOL_GPL(eventfd_fget);
+struct kref *eventfd_kref_get(struct file *file)
+{
+ struct eventfd_ctx *ctx;
+
+ if (file->f_op != &eventfd_fops)
+ return ERR_PTR(-EINVAL);
+
+ ctx = file->private_data;
+ kref_get(&ctx->kref);
+
+ return &ctx->kref;
+}
+EXPORT_SYMBOL_GPL(eventfd_kref_get);
+
+void eventfd_kref_put(struct kref *kref)
+{
+ _eventfd_put(kref);
+}
+EXPORT_SYMBOL_GPL(eventfd_kref_put);
+
SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
{
int fd;
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index f45a8ae..c0396b3 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -8,6 +8,8 @@
#ifndef _LINUX_EVENTFD_H
#define _LINUX_EVENTFD_H
+#include <linux/kref.h>
+
#ifdef CONFIG_EVENTFD
/* For O_CLOEXEC and O_NONBLOCK */
@@ -28,11 +30,16 @@
#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
struct file *eventfd_fget(int fd);
+struct kref *eventfd_kref_get(struct file *file);
+void eventfd_kref_put(struct kref *kref);
int eventfd_signal(struct file *file, int n);
#else /* CONFIG_EVENTFD */
#define eventfd_fget(fd) ERR_PTR(-ENOSYS)
+#define eventfd_kref_get(file) ERR_PTR(-ENOSYS);
+static inline void eventfd_kref_put(struct kref *kref)
+{ }
static inline int eventfd_signal(struct file *file, int n)
{ return 0; }
next prev parent reply other threads:[~2009-06-22 16:06 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-06-22 16:05 [KVM PATCH v3 0/3] irqfd/eventfd fixes Gregory Haskins
2009-06-22 16:05 ` [KVM PATCH v3 1/3] kvm: prepare irqfd for having interrupts disabled during eventfd->release Gregory Haskins
2009-06-22 16:05 ` Gregory Haskins [this message]
2009-06-22 16:05 ` [KVM PATCH v3 3/3] KVM: Fix races in irqfd using new eventfd_kref_get interface Gregory Haskins
2009-06-22 16:57 ` Michael S. Tsirkin
2009-06-22 17:31 ` Gregory Haskins
2009-06-22 17:45 ` Michael S. Tsirkin
2009-06-22 18:03 ` Gregory Haskins
2009-06-22 18:26 ` Michael S. Tsirkin
2009-06-22 18:11 ` Davide Libenzi
2009-06-22 18:32 ` Michael S. Tsirkin
2009-06-22 18:41 ` Davide Libenzi
2009-06-22 18:52 ` Michael S. Tsirkin
2009-06-23 14:55 ` Gregory Haskins
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090622160551.22967.32376.stgit@dev.haskins.net \
--to=ghaskins@novell.com \
--cc=avi@redhat.com \
--cc=davidel@xmailserver.org \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=mst@redhat.com \
--cc=paulmck@linux.vnet.ibm.com \
--cc=rusty@rustcorp.com.au \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.