All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Davide Libenzi <davidel@xmailserver.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Ingo Molnar <mingo@elte.hu>,
	Suparna Bhattacharya <suparna@in.ibm.com>,
	Zach Brown <zach.brown@oracle.com>,
	Benjamin LaHaise <bcrl@kvack.org>
Subject: Re: [patch 10/13] signal/timer/event fds v8 - eventfd core ...
Date: Fri, 30 Mar 2007 12:40:28 -0700	[thread overview]
Message-ID: <20070330124028.c6e54d2e.akpm@linux-foundation.org> (raw)
In-Reply-To: <send-serie.davidel@xmailserver.org.28743.1174415840.10>

On Tue, 20 Mar 2007 11:37:14 -0700
Davide Libenzi <davidel@xmailserver.org> wrote:

> This is a very simple and light file descriptor, that can be used as
> event wait/dispatch by userspace (both wait and dispatch) and by the
> kernel (dispatch only). It can be used instead of pipe(2) in all cases
> where those would simply be used to signal events. Their kernel overhead
> is much lower than pipes, and they do not consume two fds. When used in
> the kernel, it can offer an fd-bridge to enable, for example, functionalities
> like KAIO or syslets/threadlets to signal to an fd the completion of certain
> operations. But more in general, an eventfd can be used by the kernel to
> signal readiness, in a POSIX poll/select way, of interfaces that would
> otherwise be incompatible with it. The API is:
> 
> int eventfd(unsigned int count);
>
> The eventfd API accepts an initial "count" parameter, and returns an
> eventfd fd. It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and write(2).
> The POLLIN flag is raised when the internal counter is greater than zero.
> The POLLOUT flag is raised when at least a value of "1" can be written to
> the internal counter.
> The POLLERR flag is raised when an overflow in the counter value is detected.
> The write(2) operation can never overflow the counter, since it blocks
> (unless O_NONBLOCK is set, in which case -EAGAIN is returned).
> But the eventfd_signal() function can do it, since it's supposed to not
> sleep during its operation.
> The read(2) function reads the __u64 counter value, and reset the internal
> value to zero. If the value read is equal to (__u64) -1, an overflow
> happened on the internal counter (due to 2^64 eventfd_signal() posts
> that has never been retired - unlickely, but possible).
> The write(2) call writes an __u64 count value, and adds it
> to the current counter. The eventfd fd supports O_NONBLOCK also.
> On the kernel side, we have:
> 
> struct file *eventfd_fget(int fd);
> int eventfd_signal(struct file *file, unsigned int n);
> 
> The eventfd_fget() should be called to get a struct file* from an eventfd
> fd (this is an fget() + check of f_op being an eventfd fops pointer).
> The kernel can then call eventfd_signal() every time it wants to post
> an event to userspace. The eventfd_signal() function can be called from any
> context.
> An eventfd() simple test and bench is available here:
> 
> http://www.xmailserver.org/eventfd-bench.c
> 
> This is the eventfd-based version of pipetest-4 (pipe(2) based):
> 
> http://www.xmailserver.org/pipetest-4.c
> 
> Not that performance matters much in the eventfd case, but eventfd-bench
> shows almost as double as performance than pipetest-4.
> 
>...
> 
> Index: linux-2.6.21-rc3.quilt/fs/eventfd.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6.21-rc3.quilt/fs/eventfd.c	2007-03-19 19:01:58.000000000 -0700
> @@ -0,0 +1,237 @@
> +/*
> + *  fs/eventfd.c
> + *
> + *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
> + *
> + */
> +
> +#include <linux/file.h>
> +#include <linux/poll.h>
> +#include <linux/slab.h>
> +#include <linux/init.h>
> +#include <linux/fs.h>
> +#include <linux/mount.h>
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/jiffies.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/eventfd.h>
> +
> +#include <asm/uaccess.h>
> +
> +
> +
> +struct eventfd_ctx {
> +	spinlock_t lock;
> +	wait_queue_head_t wqh;
> +	__u64 count;
> +};

Again, can we borrow wqh.lock?

`count' needs documentation - these things are key to understanding the
code.

> +
> +
> +static int eventfd_close(struct inode *inode, struct file *file);
> +static unsigned int eventfd_poll(struct file *file, poll_table *wait);
> +static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
> +			    loff_t *ppos);
> +static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
> +			     loff_t *ppos);

dittos

> +
> +
> +

dittos

> +static const struct file_operations eventfd_fops = {
> +	.release	= eventfd_close,

dittos

> +	.poll		= eventfd_poll,
> +	.read		= eventfd_read,
> +	.write		= eventfd_write,
> +};
> +
> +
> +
> +struct file *eventfd_fget(int fd)
> +{
> +	struct file *file;
> +
> +	file = fget(fd);
> +	if (!file)
> +		return ERR_PTR(-EBADF);
> +	if (file->f_op != &eventfd_fops) {
> +		fput(file);
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	return file;
> +}
> +
> +/*
> + * This function is supposed to be called by the kernel in paths
> + * that do not allow to sleep. In this function we allow the counter

"that do not allow sleeping"

> + * to reach the ULLONG_MAX value, and we signal this as overflow
> + * condition by returining a POLLERR to poll(2).

typo

> + */

So it is the caller's responsibility to ensure that *file refers to an
eventfd file?

> +int eventfd_signal(struct file *file, int n)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +	unsigned long flags;
> +
> +	if (n < 0)
> +		return -EINVAL;
> +	spin_lock_irqsave(&ctx->lock, flags);
> +	if (ULLONG_MAX - ctx->count < n)
> +		n = (int) (ULLONG_MAX - ctx->count);
> +	ctx->count += n;
> +	if (waitqueue_active(&ctx->wqh))
> +		wake_up_locked(&ctx->wqh);
> +	spin_unlock_irqrestore(&ctx->lock, flags);
> +
> +	return n;
> +}

Neither the incoming arg (usefully named "n") nor the return value are
documented.


> +asmlinkage long sys_eventfd(unsigned int count)
> +{
> +	int error, fd;
> +	struct eventfd_ctx *ctx;
> +	struct file *file;
> +	struct inode *inode;
> +
> +	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
> +	if (!ctx)
> +		return -ENOMEM;
> +
> +	init_waitqueue_head(&ctx->wqh);
> +	spin_lock_init(&ctx->lock);
> +	ctx->count = count;
> +
> +	/*
> +	 * When we call this, the initialization must be complete, since
> +	 * aino_getfd() will install the fd.
> +	 */
> +	error = aino_getfd(&fd, &inode, &file, "[eventfd]",
> +			   &eventfd_fops, ctx);
> +	if (!error)
> +		return fd;
> +
> +	kfree(ctx);
> +	return error;
> +}

Documentation?

> +static int eventfd_close(struct inode *inode, struct file *file)
> +{
> +	kfree(file->private_data);
> +	return 0;
> +}
> +
> +static unsigned int eventfd_poll(struct file *file, poll_table *wait)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +	unsigned int events = 0;
> +	unsigned long flags;
> +
> +	poll_wait(file, &ctx->wqh, wait);
> +
> +	spin_lock_irqsave(&ctx->lock, flags);
> +	if (ctx->count > 0)
> +		events |= POLLIN;
> +	if (ctx->count == ULLONG_MAX)
> +		events |= POLLERR;
> +	if (ULLONG_MAX - 1 > ctx->count)
> +		events |= POLLOUT;
> +	spin_unlock_irqrestore(&ctx->lock, flags);
> +
> +	return events;
> +}
> +
> +static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
> +			    loff_t *ppos)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +	ssize_t res;
> +	__u64 ucnt;
> +	DECLARE_WAITQUEUE(wait, current);
> +
> +	if (count < sizeof(ucnt))
> +		return -EINVAL;
> +	spin_lock_irq(&ctx->lock);
> +	res = -EAGAIN;
> +	ucnt = ctx->count;
> +	if (ucnt > 0)
> +		res = sizeof(ucnt);
> +	else if (!(file->f_flags & O_NONBLOCK)) {
> +		__add_wait_queue(&ctx->wqh, &wait);
> +		for (res = 0;;) {
> +			set_current_state(TASK_INTERRUPTIBLE);
> +			if (ctx->count > 0) {
> +				ucnt = ctx->count;
> +				res = sizeof(ucnt);
> +				break;
> +			}
> +			if (signal_pending(current)) {
> +				res = -ERESTARTSYS;
> +				break;
> +			}
> +			spin_unlock_irq(&ctx->lock);
> +			schedule();
> +			spin_lock_irq(&ctx->lock);
> +		}
> +		__remove_wait_queue(&ctx->wqh, &wait);
> +		__set_current_state(TASK_RUNNING);
> +	}
> +	if (res > 0) {
> +		ctx->count = 0;
> +		if (waitqueue_active(&ctx->wqh))
> +			wake_up_locked(&ctx->wqh);
> +	}
> +	spin_unlock_irq(&ctx->lock);
> +	if (res > 0 && put_user(ucnt, (__u64 __user *) buf))
> +		return -EFAULT;
> +
> +	return res;
> +}

Needs interface documentation, please.  Even the changelog doesn't tell us
what an EAGAIN return from read() means.

> +static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
> +			     loff_t *ppos)
> +{
> +	struct eventfd_ctx *ctx = file->private_data;
> +	ssize_t res;
> +	__u64 ucnt;
> +	DECLARE_WAITQUEUE(wait, current);
> +
> +	if (count < sizeof(ucnt))
> +		return -EINVAL;
> +	if (get_user(ucnt, (const __u64 __user *) buf))
> +		return -EFAULT;

Some architectures do not implement 64-bit get_user()

> +	if (ucnt == ULLONG_MAX)
> +		return -EINVAL;
> +	spin_lock_irq(&ctx->lock);
> +	res = -EAGAIN;
> +	if (ULLONG_MAX - ctx->count > ucnt)
> +		res = sizeof(ucnt);
> +	else if (!(file->f_flags & O_NONBLOCK)) {
> +		__add_wait_queue(&ctx->wqh, &wait);
> +		for (res = 0;;) {
> +			set_current_state(TASK_INTERRUPTIBLE);
> +			if (ULLONG_MAX - ctx->count > ucnt) {
> +				res = sizeof(ucnt);
> +				break;
> +			}
> +			if (signal_pending(current)) {
> +				res = -ERESTARTSYS;
> +				break;
> +			}
> +			spin_unlock_irq(&ctx->lock);
> +			schedule();
> +			spin_lock_irq(&ctx->lock);
> +		}
> +		__remove_wait_queue(&ctx->wqh, &wait);
> +		__set_current_state(TASK_RUNNING);
> +	}
> +	if (res > 0) {
> +		ctx->count += ucnt;
> +		if (waitqueue_active(&ctx->wqh))
> +			wake_up_locked(&ctx->wqh);
> +	}
> +	spin_unlock_irq(&ctx->lock);
> +
> +	return res;
> +}
> +


  reply	other threads:[~2007-03-30 19:41 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-03-20 18:37 [patch 10/13] signal/timer/event fds v8 - eventfd core Davide Libenzi
2007-03-30 19:40 ` Andrew Morton [this message]
2007-03-31  1:11   ` Davide Libenzi
2007-03-31  1:14     ` Davide Libenzi
2007-03-31  1:26     ` Andrew Morton
2007-03-31 19:50       ` Davide Libenzi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070330124028.c6e54d2e.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=bcrl@kvack.org \
    --cc=davidel@xmailserver.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=suparna@in.ibm.com \
    --cc=torvalds@linux-foundation.org \
    --cc=zach.brown@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.