Re: async network I/O, event channels, etc

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: async network I/O, event channels, etc
       [not found] <44C66FC9.3050402@redhat.com>
@ 2006-07-25 22:01 ` David Miller
  2006-07-25 22:55   ` Nicholas Miell
  2006-07-26  6:28   ` Evgeniy Polyakov
  0 siblings, 2 replies; 73+ messages in thread
From: David Miller @ 2006-07-25 22:01 UTC (permalink / raw)
  To: drepper; +Cc: linux-kernel, netdev

From: Ulrich Drepper <drepper@redhat.com>
Date: Tue, 25 Jul 2006 12:23:53 -0700

> I was very much surprised by the reactions I got after my OLS talk.
> Lots of people declared interest and even agreed with the approach and
> asked me to do further ahead with all this.  For those who missed it,
> the paper and the slides are available on my home page:
> 
> http://people.redhat.com/drepper/
> 
> As for the next steps I see a number of possible ways.  The discussions
> can be held on the usual mailing lists (i.e., lkml and netdev) but due
> to the raw nature of the current proposal I would imagine that would be
> mainly perceived as noise.

Since I gave a big thumbs up for Evgivny's kevent work yesterday
on linux-kernel, you might want to start by comparing your work
to his.  Because his has the advantage that 1) we have code now
and 2) he has written many test applications and performed many
benchmarks against his code which has flushed out most of the
major implementation issues.

I think most of the people who have encouraged your work are unaware
of Evgivny's kevent stuff, which is extremely unfortunate, the two
works are more similar than they are different.

I do not think discussing all of this on netdev would be perceived
as noise. :)

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: async network I/O, event channels, etc
  2006-07-25 22:01 ` async network I/O, event channels, etc David Miller
@ 2006-07-25 22:55   ` Nicholas Miell
  2006-07-26  6:28   ` Evgeniy Polyakov
  1 sibling, 0 replies; 73+ messages in thread
From: Nicholas Miell @ 2006-07-25 22:55 UTC (permalink / raw)
  To: David Miller; +Cc: drepper, linux-kernel, netdev

On Tue, 2006-07-25 at 15:01 -0700, David Miller wrote:
> From: Ulrich Drepper <drepper@redhat.com>
> Date: Tue, 25 Jul 2006 12:23:53 -0700
> 
> > I was very much surprised by the reactions I got after my OLS talk.
> > Lots of people declared interest and even agreed with the approach and
> > asked me to do further ahead with all this.  For those who missed it,
> > the paper and the slides are available on my home page:
> > 
> > http://people.redhat.com/drepper/
> > 
> > As for the next steps I see a number of possible ways.  The discussions
> > can be held on the usual mailing lists (i.e., lkml and netdev) but due
> > to the raw nature of the current proposal I would imagine that would be
> > mainly perceived as noise.
> 
> Since I gave a big thumbs up for Evgivny's kevent work yesterday
> on linux-kernel, you might want to start by comparing your work
> to his.  Because his has the advantage that 1) we have code now
> and 2) he has written many test applications and performed many
> benchmarks against his code which has flushed out most of the
> major implementation issues.
> 
> I think most of the people who have encouraged your work are unaware
> of Evgivny's kevent stuff, which is extremely unfortunate, the two
> works are more similar than they are different.
> 
> I do not think discussing all of this on netdev would be perceived
> as noise. :)

While the comparing is going on, how does this compare to Solaris's
ports interface? It's documented at
http://docs.sun.com/app/docs/doc/816-5168/6mbb3hrir?a=view

Also, since we're on the subject, why a whole new interface for event
queuing instead of extending the existing io_getevents(2) and friends?

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: async network I/O, event channels, etc
  2006-07-25 22:01 ` async network I/O, event channels, etc David Miller
  2006-07-25 22:55   ` Nicholas Miell
@ 2006-07-26  6:28   ` Evgeniy Polyakov
  2006-07-26  9:18     ` [0/4] kevent: generic event processing subsystem Evgeniy Polyakov
  2006-07-27  6:10     ` async network I/O, event channels, etc David Miller
  1 sibling, 2 replies; 73+ messages in thread
From: Evgeniy Polyakov @ 2006-07-26  6:28 UTC (permalink / raw)
  To: David Miller; +Cc: drepper, linux-kernel, netdev

On Tue, Jul 25, 2006 at 03:01:22PM -0700, David Miller (davem@davemloft.net) wrote:
> From: Ulrich Drepper <drepper@redhat.com>
> Date: Tue, 25 Jul 2006 12:23:53 -0700
> 
> > I was very much surprised by the reactions I got after my OLS talk.
> > Lots of people declared interest and even agreed with the approach and
> > asked me to do further ahead with all this.  For those who missed it,
> > the paper and the slides are available on my home page:
> > 
> > http://people.redhat.com/drepper/
> > 
> > As for the next steps I see a number of possible ways.  The discussions
> > can be held on the usual mailing lists (i.e., lkml and netdev) but due
> > to the raw nature of the current proposal I would imagine that would be
> > mainly perceived as noise.
> 
> Since I gave a big thumbs up for Evgivny's kevent work yesterday
> on linux-kernel, you might want to start by comparing your work
> to his.  Because his has the advantage that 1) we have code now
> and 2) he has written many test applications and performed many
> benchmarks against his code which has flushed out most of the
> major implementation issues.
> 
> I think most of the people who have encouraged your work are unaware
> of Evgivny's kevent stuff, which is extremely unfortunate, the two
> works are more similar than they are different.
> 
> I do not think discussing all of this on netdev would be perceived
> as noise. :)

Hello David, Ulrich.

Here is brief description of what is kevent and how it works.

Kevent subsystem incorporates several AIO/kqueue design notes and ideas.
Kevent can be used both for edge and level notifications. It supports
socket notifications (accept, send, recv), network AIO (aio_send(),
aio_recv() and aio_sendfile()), inode notifications (create/remove),
generic poll()/select() notifications and timer notifications.

There are several object in the kevent system:
storage - each source of events (socket, inode, timer, aio, anything) has
	structure kevent_storage incorporated into it, which is basically a list
	of registered interests for this source of events.
user - it is abstraction which holds all requested kevents. It is
	similar to FreeBSD's kqueue.
kevent - set of interests for given source of events or storage.

When kevent is queued into storage, it will live there until removed by
kevent_dequeue(). When some activity is noticed in given storage, it
scans it's kevent_storage->list for kevents which match activity event.
If kevents are found and they are not already in the
kevent_user->ready_list, they will be added there at the end.

ioctl(WAIT) (or appropriate syscall) will wait until either requested
number of kevents are ready or timeout elapsed or at least one kevent is
ready, it's behaviour depends on parameters.

It is possible to have one-shot kevents, which are automatically removed
when are ready.

Any event can be added/removed/modified by ioctl or special controlling
syscall.

Network AIO is based on kevent and works as usual kevent storage on top
of inode.
When new socket is created it is associated with that inode and when
some activity is detected appropriate notifications are generated and
kevent_naio_callback() is called.
When new kevent is being registered, network AIO ->enqueue() callback
simply marks itself like usual socket event watcher. It also locks
physical userspace pages in memory and stores appropriate pointers in
private kevent structure. I have not created additional DMA memory
allocation methods, like Ulrich described in his article, so I handle it
inside NAIO which has some overhead (I posted get_user_pages()
sclability graph some time ago).
Network AIO callback gets pointers to userspace pages and tries to copy
data from receiving skb queue into them using protocol specific
callback. This callback is very similar to ->recvmsg(), so they could
share a lot in future (as far as I recall it worked only with hardware
capable to do checksumming, I'm a bit lazy).

Both network and aio implementation work on top of hooks inside
appropriate state machines, but not as repeated call design (curect AIO) 
or special thread (SGI AIO). AIO work was stopped, since I was unable to 
achieve the same speed as synchronous read 
(maximum speeds were 2Gb/sec vs. 2.1 GB/sec for aio and sync IO accordingly
when reading data from the cache).
Network aio_sendfile() works lazily - it asynchronously populates pages
into the VFS cache (which can be used for various tricks with adaptive
readahead) and then uses usual ->sendfile() callback.

I have not created an interface for userspace events (like Solaris), 
since right now I do not see it's usefullness, but if there is
requirements for that it is quite easy with kevents.

I'm preparing set of kevent patches resend (with cleanups mentioned in
previous e-mails), which will be ready in a couple of moments.

1. kevent homepage.
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

2. network aio homepage.
http://tservice.net.ru/~s0mbre/old/?section=projects&item=naio

3. LWN.net published a very good article about kevent.
http://lwn.net/Articles/172844/

Thank you.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [0/4] kevent: generic event processing subsystem.
  2006-07-26  6:28   ` Evgeniy Polyakov
@ 2006-07-26  9:18     ` Evgeniy Polyakov
  2006-07-26  9:18       ` [1/4] kevent: core files Evgeniy Polyakov
  2006-07-27  6:10     ` async network I/O, event channels, etc David Miller
  1 sibling, 1 reply; 73+ messages in thread
From: Evgeniy Polyakov @ 2006-07-26  9:18 UTC (permalink / raw)
  To: lkml; +Cc: David Miller, Ulrich Drepper, Evgeniy Polyakov, netdev

Kevent subsystem incorporates several AIO/kqueue design notes and ideas.
Kevent can be used both for edge and level notifications. It supports
socket notifications (accept, send, recv), network AIO (aio_send(),
aio_recv() and aio_sendfile()), inode notifications (create/remove),
generic poll()/select() notifications and timer notifications.

There are several object in the kevent system:
storage - each source of events (socket, inode, timer, aio, anything) has
	structure kevent_storage incorporated into it, which is basically a list
	of registered interests for this source of events.
user - it is abstraction which holds all requested kevents. It is
	similar to FreeBSD's kqueue.
kevent - set of interests for given source of events or storage.

When kevent is queued into storage, it will live there until removed by
kevent_dequeue(). When some activity is noticed in given storage, it
scans it's kevent_storage->list for kevents which match activity event.
If kevents are found and they are not already in the
kevent_user->ready_list, they will be added there at the end.

ioctl(WAIT) (or appropriate syscall) will wait until either requested
number of kevents are ready or timeout elapsed or at least one kevent is
ready, it's behaviour depends on parameters.

It is possible to have one-shot kevents, which are automatically removed
when are ready.

Any event can be added/removed/modified by ioctl or special controlling
syscall.

Network AIO is based on kevent and works as usual kevent storage on top
of inode.
When new socket is created it is associated with that inode and when
some activity is detected appropriate notifications are generated and
kevent_naio_callback() is called.
When new kevent is being registered, network AIO ->enqueue() callback
simply marks itself like usual socket event watcher. It also locks
physical userspace pages in memory and stores appropriate pointers in
private kevent structure. I have not created additional DMA memory
allocation methods, like Ulrich described in his article, so I handle it
inside NAIO which has some overhead (I posted get_user_pages()
sclability graph some time ago).
Network AIO callback gets pointers to userspace pages and tries to copy
data from receiving skb queue into them using protocol specific
callback. This callback is very similar to ->recvmsg(), so they could
share a lot in future (as far as I recall it worked only with hardware
capable to do checksumming, I'm a bit lazy).

Both network and aio implementation work on top of hooks inside
appropriate state machines, but not as repeated call design (curect AIO) 
or special thread (SGI AIO). AIO work was stopped, since I was unable to 
achieve the same speed as synchronous read 
(maximum speeds were 2Gb/sec vs. 2.1 GB/sec for aio and sync IO accordingly
when reading data from the cache).
Network aio_sendfile() works lazily - it asynchronously populates pages
into the VFS cache (which can be used for various tricks with adaptive
readahead) and then uses usual ->sendfile() callback.

I have not created an interface for userspace events (like Solaris), 
since right now I do not see it's usefullness, but if there is
requirements for that it is quite easy with kevents.

Patches currently include ifdefs and kevent can be disabled in config,
when things are settled that can be removed.

1. kevent homepage.
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

2. network aio homepage.
http://tservice.net.ru/~s0mbre/old/?section=projects&item=naio

3. LWN.net published a very good article about kevent.
http://lwn.net/Articles/172844/

Thank you.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [1/4] kevent: core files.
  2006-07-26  9:18     ` [0/4] kevent: generic event processing subsystem Evgeniy Polyakov
@ 2006-07-26  9:18       ` Evgeniy Polyakov
  2006-07-26  9:18         ` [2/4] kevent: network AIO, socket notifications Evgeniy Polyakov
                           ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Evgeniy Polyakov @ 2006-07-26  9:18 UTC (permalink / raw)
  To: lkml; +Cc: David Miller, Ulrich Drepper, Evgeniy Polyakov, netdev


This patch includes core kevent files:
 - userspace controlling
 - kernelspace interfaces
 - initialization
 - notification state machines

It might also inlclude parts from other subsystem (like network related
syscalls, so it is possible that it will not compile without other
patches applied).

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index af56987..93e23ff 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -316,3 +316,7 @@ ENTRY(sys_call_table)
 	.long sys_sync_file_range
 	.long sys_tee			/* 315 */
 	.long sys_vmsplice
+	.long sys_aio_recv
+	.long sys_aio_send
+	.long sys_aio_sendfile
+	.long sys_kevent_ctl
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index 5a92fed..534d516 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -696,4 +696,8 @@ #endif
 	.quad sys_sync_file_range
 	.quad sys_tee
 	.quad compat_sys_vmsplice
+	.quad sys_aio_recv
+	.quad sys_aio_send
+	.quad sys_aio_sendfile
+	.quad sys_kevent_ctl
 ia32_syscall_end:		

diff --git a/include/asm-i386/socket.h b/include/asm-i386/socket.h
index 802ae76..3473f5c 100644
--- a/include/asm-i386/socket.h
+++ b/include/asm-i386/socket.h
@@ -49,4 +49,6 @@ #define SO_ACCEPTCONN		30
 
 #define SO_PEERSEC		31
 
+#define SO_ASYNC_SOCK		34
+
 #endif /* _ASM_SOCKET_H */
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index de2ccc1..52f8642 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -322,10 +322,14 @@ #define __NR_splice		313
 #define __NR_sync_file_range	314
 #define __NR_tee		315
 #define __NR_vmsplice		316
+#define __NR_aio_recv		317
+#define __NR_aio_send		318
+#define __NR_aio_sendfile	319
+#define __NR_kevent_ctl		320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 317
+#define NR_syscalls 321
 
 /*
  * user-visible error numbers are in the range -1 - -128: see
diff --git a/include/asm-x86_64/socket.h b/include/asm-x86_64/socket.h
index f2cdbea..1f31f86 100644
--- a/include/asm-x86_64/socket.h
+++ b/include/asm-x86_64/socket.h
@@ -49,4 +49,6 @@ #define SO_ACCEPTCONN		30
 
 #define SO_PEERSEC             31
 
+#define SO_ASYNC_SOCK		34
+
 #endif /* _ASM_SOCKET_H */
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 0aff22b..352c34b 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -617,11 +617,18 @@ #define __NR_sync_file_range	277
 __SYSCALL(__NR_sync_file_range, sys_sync_file_range)
 #define __NR_vmsplice		278
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
+#define __NR_aio_recv		279
+__SYSCALL(__NR_aio_recv, sys_aio_recv)
+#define __NR_aio_send		280
+__SYSCALL(__NR_aio_send, sys_aio_send)
+#define __NR_aio_sendfile	281
+__SYSCALL(__NR_aio_sendfile, sys_aio_sendfile)
+#define __NR_kevent_ctl		282
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
 
 #ifdef __KERNEL__
 
-#define __NR_syscall_max __NR_vmsplice
-
+#define __NR_syscall_max __NR_kevent_ctl
 #ifndef __NO_STUBS
 
 /* user-visible error numbers are in the range -1 - -4095 */

diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..e94a7bf
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,263 @@
+/*
+ * 	kevent.h
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+#define KEVENT_REQ_ONESHOT	0x1		/* Process this event only once and then dequeue. */
+
+/*
+ * Kevent return flags.
+ */
+#define KEVENT_RET_BROKEN	0x1		/* Kevent is broken. */
+#define KEVENT_RET_DONE		0x2		/* Kevent processing was finished successfully. */
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 		0
+#define KEVENT_INODE		1
+#define KEVENT_TIMER		2
+#define KEVENT_POLL		3
+#define KEVENT_NAIO		4
+#define KEVENT_AIO		5
+#define	KEVENT_MAX		6
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define	KEVENT_TIMER_FIRED	0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#define	KEVENT_SOCKET_RECV	0x1
+#define	KEVENT_SOCKET_ACCEPT	0x2
+#define	KEVENT_SOCKET_SEND	0x4
+
+/*
+ * Inode events.
+ */
+#define	KEVENT_INODE_CREATE	0x1
+#define	KEVENT_INODE_REMOVE	0x2
+
+/*
+ * Poll events.
+ */
+#define	KEVENT_POLL_POLLIN	0x0001
+#define	KEVENT_POLL_POLLPRI	0x0002
+#define	KEVENT_POLL_POLLOUT	0x0004
+#define	KEVENT_POLL_POLLERR	0x0008
+#define	KEVENT_POLL_POLLHUP	0x0010
+#define	KEVENT_POLL_POLLNVAL	0x0020
+
+#define	KEVENT_POLL_POLLRDNORM	0x0040
+#define	KEVENT_POLL_POLLRDBAND	0x0080
+#define	KEVENT_POLL_POLLWRNORM	0x0100
+#define	KEVENT_POLL_POLLWRBAND	0x0200
+#define	KEVENT_POLL_POLLMSG	0x0400
+#define	KEVENT_POLL_POLLREMOVE	0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define	KEVENT_AIO_BIO		0x1
+
+#define KEVENT_MASK_ALL		0xffffffff	/* Mask of all possible event values. */
+#define KEVENT_MASK_EMPTY	0x0		/* Empty mask of ready events. */
+
+struct kevent_id
+{
+	__u32		raw[2];
+};
+
+struct ukevent
+{
+	struct kevent_id	id;			/* Id of this request, e.g. socket number, file descriptor and so on... */
+	__u32			type;			/* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+	__u32			event;			/* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+	__u32			req_flags;		/* Per-event request flags */
+	__u32			ret_flags;		/* Per-event return flags */
+	__u32			ret_data[2];		/* Event return data. Event originator fills it with anything it likes. */
+	union {
+		__u32		user[2];		/* User's data. It is not used, just copied to/from user. */
+		void		*ptr;
+	};
+};
+
+#define	KEVENT_CTL_ADD 		0
+#define	KEVENT_CTL_REMOVE	1
+#define	KEVENT_CTL_MODIFY	2
+#define	KEVENT_CTL_WAIT		3
+#define	KEVENT_CTL_INIT		4
+
+struct kevent_user_control
+{
+	unsigned int		cmd;			/* Control command, e.g. KEVENT_ADD, KEVENT_REMOVE... */
+	unsigned int		num;			/* Number of ukevents this strucutre controls. */
+	unsigned int		timeout;		/* Timeout in milliseconds waiting for "num" events to become ready. */
+};
+
+#define KEVENT_USER_SYMBOL	'K'
+#define KEVENT_USER_CTL		_IOWR(KEVENT_USER_SYMBOL, 0, struct kevent_user_control)
+#define KEVENT_USER_WAIT	_IOWR(KEVENT_USER_SYMBOL, 1, struct kevent_user_control)
+
+#ifdef __KERNEL__
+
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/kevent_storage.h>
+#include <asm/semaphore.h>
+
+struct inode;
+struct dentry;
+struct sock;
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+struct kevent
+{
+	struct ukevent		event;
+	spinlock_t		lock;			/* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+
+	struct list_head	kevent_entry;		/* Entry of user's queue. */
+	struct list_head	storage_entry;		/* Entry of origin's queue. */
+	struct list_head	ready_entry;		/* Entry of user's ready. */
+
+	struct kevent_user	*user;			/* User who requested this kevent. */
+	struct kevent_storage	*st;			/* Kevent container. */
+
+	kevent_callback_t	callback;		/* Is called each time new event has been caught. */
+	kevent_callback_t	enqueue;		/* Is called each time new event is queued. */
+	kevent_callback_t	dequeue;		/* Is called each time event is dequeued. */
+
+	void			*priv;			/* Private data for different storages. 
+							 * poll()/select storage has a list of wait_queue_t containers 
+							 * for each ->poll() { poll_wait()' } here.
+							 */
+};
+
+#define KEVENT_HASH_MASK	0xff
+
+struct kevent_list
+{
+	struct list_head	kevent_list;		/* List of all kevents. */
+	spinlock_t 		kevent_lock;		/* Protects all manipulations with queue of kevents. */
+};
+
+struct kevent_user
+{
+	struct kevent_list	kqueue[KEVENT_HASH_MASK+1];
+	unsigned int		kevent_num;		/* Number of queued kevents. */
+
+	struct list_head	ready_list;		/* List of ready kevents. */
+	unsigned int		ready_num;		/* Number of ready kevents. */
+	spinlock_t 		ready_lock;		/* Protects all manipulations with ready queue. */
+
+	unsigned int		max_ready_num;		/* Requested number of kevents. */
+
+	struct semaphore	ctl_mutex;		/* Protects against simultaneous kevent_user control manipulations. */
+	struct semaphore	wait_mutex;		/* Protects against simultaneous kevent_user waits. */
+	wait_queue_head_t	wait;			/* Wait until some events are ready. */
+
+	atomic_t		refcnt;			/* Reference counter, increased for each new kevent. */
+#ifdef CONFIG_KEVENT_USER_STAT
+	unsigned long		im_num;
+	unsigned long		wait_num;
+	unsigned long		total;
+#endif
+};
+
+#define KEVENT_MAX_REQUESTS		PAGE_SIZE/sizeof(struct kevent)
+
+struct kevent *kevent_alloc(gfp_t mask);
+void kevent_free(struct kevent *k);
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+
+#define list_for_each_entry_reverse_safe(pos, n, head, member)		\
+	for (pos = list_entry((head)->prev, typeof(*pos), member),	\
+		n = list_entry(pos->member.prev, typeof(*pos), member);	\
+	     prefetch(pos->member.prev), &pos->member != (head); 	\
+	     pos = n, n = list_entry(pos->member.prev, typeof(*pos), member))
+
+int kevent_break(struct kevent *k);
+int kevent_init(struct kevent *k);
+
+int kevent_init_socket(struct kevent *k);
+int kevent_init_inode(struct kevent *k);
+int kevent_init_timer(struct kevent *k);
+int kevent_init_poll(struct kevent *k);
+int kevent_init_naio(struct kevent *k);
+int kevent_init_aio(struct kevent *k);
+
+void kevent_storage_ready(struct kevent_storage *st, 
+		kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_INODE
+void kevent_inode_notify(struct inode *inode, u32 event);
+void kevent_inode_notify_parent(struct dentry *dentry, u32 event);
+void kevent_inode_remove(struct inode *inode);
+#else
+static inline void kevent_inode_notify(struct inode *inode, u32 event)
+{
+}
+static inline void kevent_inode_notify_parent(struct dentry *dentry, u32 event)
+{
+}
+static inline void kevent_inode_remove(struct inode *inode)
+{
+}
+#endif /* CONFIG_KEVENT_INODE */
+#ifdef CONFIG_KEVENT_SOCKET
+
+void kevent_socket_notify(struct sock *sock, u32 event);
+int kevent_socket_dequeue(struct kevent *k);
+int kevent_socket_enqueue(struct kevent *k);
+#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC)
+#else
+static inline void kevent_socket_notify(struct sock *sock, u32 event)
+{
+}
+#define sock_async(__sk)	0
+#endif
+#endif /* __KERNEL__ */
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..bd891f0
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,12 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+	void			*origin;		/* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+	struct list_head	list;			/* List of queued kevents. */
+	unsigned int		qlen;			/* Number of queued kevents. */
+	spinlock_t		lock;			/* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 66f8819..ea914c3 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1269,6 +1269,8 @@ extern struct sk_buff *skb_recv_datagram
 					 int noblock, int *err);
 extern unsigned int    datagram_poll(struct file *file, struct socket *sock,
 				     struct poll_table_struct *wait);
+extern int	       skb_copy_datagram(const struct sk_buff *from, 
+					 int offset, void *dst, int size);
 extern int	       skb_copy_datagram_iovec(const struct sk_buff *from,
 					       int offset, struct iovec *to,
 					       int size);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index bd67a44..33d436e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -587,4 +587,8 @@ asmlinkage long sys_get_robust_list(int 
 asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
 				    size_t len);
 
+asmlinkage long sys_aio_recv(int ctl_fd, int s, void __user *buf, size_t size, unsigned flags);
+asmlinkage long sys_aio_send(int ctl_fd, int s, void __user *buf, size_t size, unsigned flags);
+asmlinkage long sys_aio_sendfile(int ctl_fd, int fd, int s, size_t size, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, void __user *buf);
 #endif
diff --git a/include/net/sock.h b/include/net/sock.h
index d10dfec..7a2bee3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -47,6 +47,7 @@ #include <linux/module.h>
 #include <linux/netdevice.h>
 #include <linux/skbuff.h>	/* struct sk_buff */
 #include <linux/security.h>
+#include <linux/kevent.h>
 
 #include <linux/filter.h>
 
@@ -386,6 +387,8 @@ enum sock_flags {
 	SOCK_NO_LARGESEND, /* whether to sent large segments or not */
 	SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
 	SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+	SOCK_ASYNC,
+	SOCK_ASYNC_INUSE,
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -445,6 +448,21 @@ static inline int sk_stream_memory_free(
 
 extern void sk_stream_rfree(struct sk_buff *skb);
 
+struct socket_alloc {
+	struct socket socket;
+	struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+	return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+	return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
+}
+
 static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
 {
 	skb->sk = sk;
@@ -472,6 +490,7 @@ static inline void sk_add_backlog(struct
 		sk->sk_backlog.tail = skb;
 	}
 	skb->next = NULL;
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 }
 
 #define sk_wait_event(__sk, __timeo, __condition)		\
@@ -543,6 +562,12 @@ struct proto {
 
 	int			(*backlog_rcv) (struct sock *sk, 
 						struct sk_buff *skb);
+	
+	int			(*async_recv) (struct sock *sk, 
+						void *dst, size_t size);
+	int			(*async_send) (struct sock *sk, 
+						struct page **pages, unsigned int poffset, 
+						size_t size);
 
 	/* Keeping track of sk's, looking them up, and port selection methods. */
 	void			(*hash)(struct sock *sk);
@@ -674,21 +699,6 @@ static inline struct kiocb *siocb_to_kio
 	return si->kiocb;
 }
 
-struct socket_alloc {
-	struct socket socket;
-	struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
-	return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
-	return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
-}
-
 extern void __sk_stream_mem_reclaim(struct sock *sk);
 extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5f4eb5c..820cd5a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -364,6 +364,8 @@ extern int			compat_tcp_setsockopt(struc
 					int level, int optname,
 					char __user *optval, int optlen);
 extern void			tcp_set_keepalive(struct sock *sk, int val);
+extern int			tcp_async_recv(struct sock *sk, void *dst, size_t size);
+extern int			tcp_async_send(struct sock *sk, struct page **pages, unsigned int poffset, size_t size);
 extern int			tcp_recvmsg(struct kiocb *iocb, struct sock *sk,
 					    struct msghdr *msg,
 					    size_t len, int nonblock, 
@@ -857,6 +859,7 @@ static inline int tcp_prequeue(struct so
 			tp->ucopy.memory = 0;
 		} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
 			wake_up_interruptible(sk->sk_sleep);
+			kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 			if (!inet_csk_ack_scheduled(sk))
 				inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
 						          (3 * TCP_RTO_MIN) / 4,
diff --git a/init/Kconfig b/init/Kconfig
index df864a3..6135afc 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -185,6 +185,8 @@ config AUDITSYSCALL
 	  such as SELinux.  To use audit's filesystem watch feature, please
 	  ensure that INOTIFY is configured.
 
+source "kernel/kevent/Kconfig"
+
 config IKCONFIG
 	bool "Kernel .config support"
 	---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index f6ef00f..eb057ea 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -36,6 +36,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
 obj-$(CONFIG_RELAY) += relay.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..88b35af
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,57 @@
+config KEVENT
+	bool "Kernel event notification mechanism"
+	help
+	  This option enables event queue mechanism.
+	  It can be used as replacement for poll()/select(), AIO callback invocations,
+	  advanced timer notifications and other kernel object status changes.
+
+config KEVENT_USER_STAT
+	bool "Kevent user statistic"
+	depends on KEVENT
+	default N
+	help
+	  This option will turn kevent_user statistic collection on.
+	  Statistic data includes total number of kevent, number of kevents which are ready
+	  immediately at insertion time and number of kevents which were removed through
+	  readiness completion. It will be printed each time control kevent descriptor
+	  is closed.
+
+config KEVENT_SOCKET
+	bool "Kernel event notifications for sockets"
+	depends on NET && KEVENT
+	help
+	  This option enables notifications through KEVENT subsystem of 
+	  sockets operations, like new packet receiving conditions, ready for accept
+  	  conditions and so on.
+	
+config KEVENT_INODE
+	bool "Kernel event notifications for inodes"
+	depends on KEVENT
+	help
+	  This option enables notifications through KEVENT subsystem of 
+	  inode operations, like file creation, removal and so on.
+
+config KEVENT_TIMER
+	bool "Kernel event notifications for timers"
+	depends on KEVENT
+	help
+	  This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+	bool "Kernel event notifications for poll()/select()"
+	depends on KEVENT
+	help
+	  This option allows to use kevent subsystem for poll()/select() notifications.
+
+config KEVENT_NAIO
+	bool "Network asynchronous IO"
+	depends on KEVENT && KEVENT_SOCKET
+	help
+	  This option enables kevent based network asynchronous IO subsystem.
+
+config KEVENT_AIO
+	bool "Asynchronous IO"
+	depends on KEVENT
+	help
+	  This option allows to use kevent subsystem for AIO operations.
+	  AIO read is currently supported.
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..7dcd651
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,7 @@
+obj-y := kevent.o kevent_user.o kevent_init.o
+obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o
+obj-$(CONFIG_KEVENT_INODE) += kevent_inode.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
+obj-$(CONFIG_KEVENT_NAIO) += kevent_naio.o
+obj-$(CONFIG_KEVENT_AIO) += kevent_aio.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..f699a13
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,260 @@
+/*
+ * 	kevent.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+static kmem_cache_t *kevent_cache;
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+	if (k->event.type >= KEVENT_MAX)
+		return -E2BIG;
+
+	if (!k->enqueue) {
+		kevent_break(k);
+		return -EINVAL;
+	}
+	
+	return k->enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+	if (k->event.type >= KEVENT_MAX)
+		return -E2BIG;
+	
+	if (!k->dequeue) {
+		kevent_break(k);
+		return -EINVAL;
+	}
+
+	return k->dequeue(k);
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+	int err;
+
+	spin_lock_init(&k->lock);
+	k->kevent_entry.next = LIST_POISON1;
+	k->storage_entry.next = LIST_POISON1;
+	k->ready_entry.next = LIST_POISON1;
+
+	if (k->event.type >= KEVENT_MAX)
+		return -E2BIG;
+	
+	switch (k->event.type) {
+		case KEVENT_NAIO:
+			err = kevent_init_naio(k);
+			break;
+		case KEVENT_SOCKET:
+			err = kevent_init_socket(k);
+			break;
+		case KEVENT_INODE:
+			err = kevent_init_inode(k);
+			break;
+		case KEVENT_TIMER:
+			err = kevent_init_timer(k);
+			break;
+		case KEVENT_POLL:
+			err = kevent_init_poll(k);
+			break;
+		case KEVENT_AIO:
+			err = kevent_init_aio(k);
+			break;
+		default:
+			err = -ENODEV;
+	}
+
+	return err;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	k->st = st;
+	spin_lock_irqsave(&st->lock, flags);
+	list_add_tail(&k->storage_entry, &st->list);
+	st->qlen++;
+	spin_unlock_irqrestore(&st->lock, flags);
+	return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue. 
+ * It does not decrease origin's reference counter in any way 
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&st->lock, flags);
+	if (k->storage_entry.next != LIST_POISON1) {
+		list_del(&k->storage_entry);
+		st->qlen--;
+	}
+	spin_unlock_irqrestore(&st->lock, flags);
+}
+
+static void __kevent_requeue(struct kevent *k, u32 event)
+{
+	int err, rem = 0;
+	unsigned long flags;
+
+	err = k->callback(k);
+
+	spin_lock_irqsave(&k->lock, flags);
+	if (err > 0) {
+		k->event.ret_flags |= KEVENT_RET_DONE;
+	} else if (err < 0) {
+		k->event.ret_flags |= KEVENT_RET_BROKEN;
+		k->event.ret_flags |= KEVENT_RET_DONE;
+	}
+	rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+	if (!err)
+		err = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+	spin_unlock_irqrestore(&k->lock, flags);
+
+	if (err) {
+		if (rem) {
+			list_del(&k->storage_entry);
+			k->st->qlen--;
+		}
+		
+		spin_lock_irqsave(&k->user->ready_lock, flags);
+		if (k->ready_entry.next == LIST_POISON1) {
+			list_add_tail(&k->ready_entry, &k->user->ready_list);
+			k->user->ready_num++;
+		}
+		spin_unlock_irqrestore(&k->user->ready_lock, flags);
+		wake_up(&k->user->wait);
+	}
+}
+
+void kevent_requeue(struct kevent *k)
+{
+	unsigned long flags;
+	
+	spin_lock_irqsave(&k->st->lock, flags);
+	__kevent_requeue(k, 0);
+	spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st, 
+		kevent_callback_t ready_callback, u32 event)
+{
+	struct kevent *k, *n;
+
+	spin_lock(&st->lock);
+	list_for_each_entry_safe(k, n, &st->list, storage_entry) {
+		if (ready_callback)
+			ready_callback(k);
+
+		if (event & k->event.event)
+			__kevent_requeue(k, event);
+	}
+	spin_unlock(&st->lock);
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+	spin_lock_init(&st->lock);
+	st->origin = origin;
+	st->qlen = 0;
+	INIT_LIST_HEAD(&st->list);
+	return 0;
+}
+
+void kevent_storage_fini(struct kevent_storage *st)
+{
+	kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
+
+struct kevent *kevent_alloc(gfp_t mask)
+{
+	struct kevent *k;
+	
+	if (kevent_cache)
+		k = kmem_cache_alloc(kevent_cache, mask);
+	else
+		k = kzalloc(sizeof(struct kevent), mask);
+
+	return k;
+}
+
+void kevent_free(struct kevent *k)
+{
+	memset(k, 0xab, sizeof(struct kevent));
+
+	if (kevent_cache)
+		kmem_cache_free(kevent_cache, k);
+	else
+		kfree(k);
+}
+
+int __init kevent_sys_init(void)
+{
+	int err = 0;
+
+	kevent_cache = kmem_cache_create("kevent_cache", 
+			sizeof(struct kevent), 0, 0, NULL, NULL);
+	if (!kevent_cache)
+		err = -ENOMEM;
+	
+	return err;
+}
+
+late_initcall(kevent_sys_init);

diff --git a/kernel/kevent/kevent_init.c b/kernel/kevent/kevent_init.c
new file mode 100644
index 0000000..ec95114
--- /dev/null
+++ b/kernel/kevent/kevent_init.c
@@ -0,0 +1,85 @@
+/*
+ * 	kevent_init.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/errno.h>
+#include <linux/kevent.h>
+
+int kevent_break(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->lock, flags);
+	k->event.ret_flags |= KEVENT_RET_BROKEN;
+	spin_unlock_irqrestore(&k->lock, flags);
+	return 0;
+}
+
+#ifndef CONFIG_KEVENT_SOCKET
+int kevent_init_socket(struct kevent *k)
+{
+	kevent_break(k);
+	return -ENODEV;
+}
+#endif
+
+#ifndef CONFIG_KEVENT_INODE
+int kevent_init_inode(struct kevent *k)
+{
+	kevent_break(k);
+	return -ENODEV;
+}
+#endif
+
+#ifndef CONFIG_KEVENT_TIMER
+int kevent_init_timer(struct kevent *k)
+{
+	kevent_break(k);
+	return -ENODEV;
+}
+#endif
+
+#ifndef CONFIG_KEVENT_POLL
+int kevent_init_poll(struct kevent *k)
+{
+	kevent_break(k);
+	return -ENODEV;
+}
+#endif
+
+#ifndef CONFIG_KEVENT_NAIO
+int kevent_init_naio(struct kevent *k)
+{
+	kevent_break(k);
+	return -ENODEV;
+}
+#endif
+
+#ifndef CONFIG_KEVENT_AIO
+int kevent_init_aio(struct kevent *k)
+{
+	kevent_break(k);
+	return -ENODEV;
+}
+#endif

diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..2f71fe4
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,728 @@
+/*
+ * 	kevent_user.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/jhash.h>
+#include <asm/uaccess.h>
+#include <asm/semaphore.h>
+
+static struct class *kevent_user_class;
+static char kevent_name[] = "kevent";
+static int kevent_user_major;
+
+static int kevent_user_open(struct inode *, struct file *);
+static int kevent_user_release(struct inode *, struct file *);
+static int kevent_user_ioctl(struct inode *, struct file *, 
+		unsigned int, unsigned long);
+static unsigned int kevent_user_poll(struct file *, struct poll_table_struct *);
+
+static struct file_operations kevent_user_fops = {
+	.open		= kevent_user_open,
+	.release	= kevent_user_release,
+	.ioctl		= kevent_user_ioctl,
+	.poll		= kevent_user_poll,
+	.owner		= THIS_MODULE,
+};
+
+static struct super_block *kevent_get_sb(struct file_system_type *fs_type, 
+		int flags, const char *dev_name, void *data)
+{
+	/* So original magic... */
+	return get_sb_pseudo(fs_type, kevent_name, NULL, 0xabcdef);	
+}
+
+static struct file_system_type kevent_fs_type = {
+	.name		= kevent_name,
+	.get_sb		= kevent_get_sb,
+	.kill_sb	= kill_anon_super,
+};
+
+static struct vfsmount *kevent_mnt;
+
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+	struct kevent_user *u = file->private_data;
+	unsigned int mask;
+	
+	poll_wait(file, &u->wait, wait);
+	mask = 0;
+
+	if (u->ready_num)
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+
+static struct kevent_user *kevent_user_alloc(void)
+{
+	struct kevent_user *u;
+	int i;
+
+	u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
+	if (!u)
+		return NULL;
+
+	INIT_LIST_HEAD(&u->ready_list);
+	spin_lock_init(&u->ready_lock);
+	u->ready_num = 0;
+#ifdef CONFIG_KEVENT_USER_STAT
+	u->wait_num = u->im_num = u->total = 0;
+#endif
+	for (i=0; i<KEVENT_HASH_MASK+1; ++i) {
+		INIT_LIST_HEAD(&u->kqueue[i].kevent_list);
+		spin_lock_init(&u->kqueue[i].kevent_lock);
+	}
+	u->kevent_num = 0;
+	
+	init_MUTEX(&u->ctl_mutex);
+	init_MUTEX(&u->wait_mutex);
+	init_waitqueue_head(&u->wait);
+	u->max_ready_num = 0;
+
+	atomic_set(&u->refcnt, 1);
+
+	return u;
+}
+
+static int kevent_user_open(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u = kevent_user_alloc();
+	
+	if (!u)
+		return -ENOMEM;
+
+	file->private_data = u;
+	
+	return 0;
+}
+
+static inline void kevent_user_get(struct kevent_user *u)
+{
+	atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+	if (atomic_dec_and_test(&u->refcnt)) {
+#ifdef CONFIG_KEVENT_USER_STAT
+		printk("%s: u=%p, wait=%lu, immediately=%lu, total=%lu.\n", 
+				__func__, u, u->wait_num, u->im_num, u->total);
+#endif
+		kfree(u);
+	}
+}
+
+#if 0
+static inline unsigned int kevent_user_hash(struct ukevent *uk)
+{
+	unsigned int h = (uk->user[0] ^ uk->user[1]) ^ (uk->id.raw[0] ^ uk->id.raw[1]);
+	
+	h = (((h >> 16) & 0xffff) ^ (h & 0xffff)) & 0xffff;
+	h = (((h >> 8) & 0xff) ^ (h & 0xff)) & KEVENT_HASH_MASK;
+
+	return h;
+}
+#else
+static inline unsigned int kevent_user_hash(struct ukevent *uk)
+{
+	return jhash_1word(uk->id.raw[0], 0) & KEVENT_HASH_MASK;
+}
+#endif
+
+/*
+ * Remove kevent from user's list of all events, 
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int lock, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	if (lock) {
+		unsigned int hash = kevent_user_hash(&k->event);
+		struct kevent_list *l = &u->kqueue[hash];
+
+		spin_lock_irqsave(&l->kevent_lock, flags);
+		list_del(&k->kevent_entry);
+		u->kevent_num--;
+		spin_unlock_irqrestore(&l->kevent_lock, flags);
+	} else {
+		list_del(&k->kevent_entry);
+		u->kevent_num--;
+	}
+
+	if (deq)
+		kevent_dequeue(k);
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (k->ready_entry.next != LIST_POISON1) {
+		list_del(&k->ready_entry);
+		u->ready_num--;
+	}
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+	
+	kevent_user_put(u);
+	kevent_free(k);
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *__kqueue_dequeue_one_ready(struct list_head *q, 
+		unsigned int *qlen)
+{
+	struct kevent *k = NULL;
+	unsigned int len = *qlen;
+	
+	if (len && !list_empty(q)) {
+		k = list_entry(q->next, struct kevent, ready_entry);
+		list_del(&k->ready_entry);
+		*qlen = len - 1;
+	}
+	
+	return k;
+}
+
+static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
+{
+	unsigned long flags;
+	struct kevent *k;
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	k = __kqueue_dequeue_one_ready(&u->ready_list, &u->ready_num);
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+
+	return k;
+}
+
+static struct kevent *__kevent_search(struct kevent_list *l, struct ukevent *uk, 
+		struct kevent_user *u)
+{
+	struct kevent *k;
+	int found = 0;
+	
+	list_for_each_entry(k, &l->kevent_list, kevent_entry) {
+		spin_lock(&k->lock);
+		if (k->event.user[0] == uk->user[0] && k->event.user[1] == uk->user[1] &&
+				k->event.id.raw[0] == uk->id.raw[0] && 
+				k->event.id.raw[1] == uk->id.raw[1]) {
+			found = 1;
+			spin_unlock(&k->lock);
+			break;
+		}
+		spin_unlock(&k->lock);
+	}
+
+	return (found)?k:NULL;
+}
+
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	unsigned int hash = kevent_user_hash(uk);
+	struct kevent_list *l = &u->kqueue[hash];
+	int err = -ENODEV;
+	unsigned long flags;
+	
+	spin_lock_irqsave(&l->kevent_lock, flags);
+	k = __kevent_search(l, uk, u);
+	if (k) {
+		spin_lock(&k->lock);
+		k->event.event = uk->event;
+		k->event.req_flags = uk->req_flags;
+		k->event.ret_flags = 0;
+		spin_unlock(&k->lock);
+		kevent_requeue(k);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&l->kevent_lock, flags);
+	
+	return err;
+}
+
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+	int err = -ENODEV;
+	struct kevent *k;
+	unsigned int hash = kevent_user_hash(uk);
+	struct kevent_list *l = &u->kqueue[hash];
+	unsigned long flags;
+
+	spin_lock_irqsave(&l->kevent_lock, flags);
+	k = __kevent_search(l, uk, u);
+	if (k) {
+		kevent_finish_user(k, 0, 1);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&l->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * No new entry can be added or removed from any list at this point.
+ * It is not permitted to call ->ioctl() and ->release() in parallel.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u = file->private_data;
+	struct kevent *k, *n;
+	int i;
+
+	for (i=0; i<KEVENT_HASH_MASK+1; ++i) {
+		struct kevent_list *l = &u->kqueue[i];
+		
+		list_for_each_entry_safe(k, n, &l->kevent_list, kevent_entry)
+			kevent_finish_user(k, 1, 1);
+	}
+
+	kevent_user_put(u);
+	file->private_data = NULL;
+
+	return 0;
+}
+
+static int kevent_user_ctl_modify(struct kevent_user *u, 
+		struct kevent_user_control *ctl, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	if (down_interruptible(&u->ctl_mutex))
+		return -ERESTARTSYS;
+
+	for (i=0; i<ctl->num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EINVAL;
+			break;
+		}
+
+		if (kevent_modify(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EINVAL;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+
+	up(&u->ctl_mutex);
+
+	return err;
+}
+
+static int kevent_user_ctl_remove(struct kevent_user *u, 
+		struct kevent_user_control *ctl, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	if (down_interruptible(&u->ctl_mutex))
+		return -ERESTARTSYS;
+
+	for (i=0; i<ctl->num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EINVAL;
+			break;
+		}
+
+		if (kevent_remove(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EINVAL;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+
+	up(&u->ctl_mutex);
+
+	return err;
+}
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err;
+
+	k = kevent_alloc(GFP_KERNEL);
+	if (!k) {
+		err = -ENOMEM;
+		goto err_out_exit;
+	}
+
+	memcpy(&k->event, uk, sizeof(struct ukevent));
+
+	k->event.ret_flags = 0;
+
+	err = kevent_init(k);
+	if (err) {
+		kevent_free(k);
+		goto err_out_exit;
+	}
+	k->user = u;
+#ifdef CONFIG_KEVENT_USER_STAT
+	u->total++;
+#endif
+	{
+		unsigned long flags;
+		unsigned int hash = kevent_user_hash(&k->event);
+		struct kevent_list *l = &u->kqueue[hash];
+		
+		spin_lock_irqsave(&l->kevent_lock, flags);
+		list_add_tail(&k->kevent_entry, &l->kevent_list);
+		u->kevent_num++;
+		kevent_user_get(u);
+		spin_unlock_irqrestore(&l->kevent_lock, flags);
+	}
+
+	err = kevent_enqueue(k);
+	if (err) {
+		memcpy(uk, &k->event, sizeof(struct ukevent));
+		if (err < 0)
+			uk->ret_flags |= KEVENT_RET_BROKEN;
+		uk->ret_flags |= KEVENT_RET_DONE;
+		kevent_finish_user(k, 1, 0);
+	} 
+
+err_out_exit:
+	return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one 
+ * and add them into appropriate kevent_storages, 
+ * e.g. sockets, inodes and so on...
+ * If something goes wrong, all events will be dequeued and 
+ * negative error will be returned. 
+ * On success zero is returned and 
+ * ctl->num will be a number of finished events, either completed or failed. 
+ * Array of finished events (struct ukevent) will be placed behind 
+ * kevent_user_control structure. User must run through that array and check 
+ * ret_flags field of each ukevent structure to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, 
+		struct kevent_user_control *ctl, void __user *arg)
+{
+	int err = 0, cerr = 0, num = 0, knum = 0, i;
+	void __user *orig, *ctl_addr;
+	struct ukevent uk;
+
+	if (down_interruptible(&u->ctl_mutex))
+		return -ERESTARTSYS;
+
+	orig = arg;
+	ctl_addr = arg - sizeof(struct kevent_user_control);
+#if 1
+	err = -ENFILE;
+	if (u->kevent_num + ctl->num >= 1024)
+		goto err_out_remove;
+#endif
+	for (i=0; i<ctl->num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			cerr = -EINVAL;
+			break;
+		}
+		arg += sizeof(struct ukevent);
+
+		err = kevent_user_add_ukevent(&uk, u);
+		if (err) {
+#ifdef CONFIG_KEVENT_USER_STAT
+			u->im_num++;
+#endif
+			if (copy_to_user(orig, &uk, sizeof(struct ukevent)))
+				cerr = -EINVAL;
+			orig += sizeof(struct ukevent);
+			num++;
+		} else
+			knum++;
+	}
+
+	if (cerr < 0)
+		goto err_out_remove;
+
+	ctl->num = num;
+	if (copy_to_user(ctl_addr, ctl, sizeof(struct kevent_user_control)))
+		cerr = -EINVAL;
+
+	if (cerr)
+		err = cerr;
+	if (!err)
+		err = num;
+
+err_out_remove:
+	up(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Waits until at least ctl->ready_num events are ready or timeout and returns 
+ * number of ready events (in case of timeout) or number of requested events.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u, 
+		struct kevent_user_control *ctl, void __user *arg)
+{
+	struct kevent *k;
+	int cerr = 0, num = 0;
+	void __user *ptr = arg + sizeof(struct kevent_user_control);
+
+	if (down_interruptible(&u->ctl_mutex))
+		return -ERESTARTSYS;
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		if (ctl->timeout)
+			wait_event_interruptible_timeout(u->wait, 
+				u->ready_num >= ctl->num, msecs_to_jiffies(ctl->timeout));
+		else
+			wait_event_interruptible_timeout(u->wait, 
+					u->ready_num > 0, msecs_to_jiffies(1000));
+	}
+	while (num < ctl->num && ((k = kqueue_dequeue_ready(u)) != NULL)) {
+		if (copy_to_user(ptr + num*sizeof(struct ukevent), 
+					&k->event, sizeof(struct ukevent)))
+			cerr = -EINVAL;
+
+		/*
+		 * If it is one-shot kevent, it has been removed already from
+		 * origin's queue, so we can easily free it here.
+		 */
+		if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+			kevent_finish_user(k, 1, 1);
+		++num;
+#ifdef CONFIG_KEVENT_USER_STAT
+		u->wait_num++;
+#endif
+	}
+
+	ctl->num = num;
+	if (copy_to_user(arg, ctl, sizeof(struct kevent_user_control)))
+		cerr = -EINVAL;
+
+	up(&u->ctl_mutex);
+
+	return (cerr)?cerr:num;
+}
+
+static int kevent_ctl_init(void)
+{
+	struct kevent_user *u;
+	struct file *file;
+	int fd, ret;
+
+	fd = get_unused_fd();
+	if (fd < 0)
+		return fd;
+
+	file = get_empty_filp();
+	if (!file) {
+		ret = -ENFILE;
+		goto out_put_fd;
+	}
+
+	u = kevent_user_alloc();
+	if (unlikely(!u)) {
+		ret = -ENOMEM;
+		goto out_put_file;
+	}
+
+	file->f_op = &kevent_user_fops;
+	file->f_vfsmnt = mntget(kevent_mnt);
+	file->f_dentry = dget(kevent_mnt->mnt_root);
+	file->f_mapping = file->f_dentry->d_inode->i_mapping;
+	file->f_mode = FMODE_READ;
+	file->f_flags = O_RDONLY;
+	file->private_data = u;
+	
+	fd_install(fd, file);
+
+	return fd;
+
+out_put_file:
+	put_filp(file);
+out_put_fd:
+	put_unused_fd(fd);
+	return ret;
+}
+
+static int kevent_ctl_process(struct file *file, 
+		struct kevent_user_control *ctl, void __user *arg)
+{
+	int err;
+	struct kevent_user *u = file->private_data;
+
+	if (!u)
+		return -EINVAL;
+
+	switch (ctl->cmd) {
+		case KEVENT_CTL_ADD:
+			err = kevent_user_ctl_add(u, ctl, 
+					arg+sizeof(struct kevent_user_control));
+			break;
+		case KEVENT_CTL_REMOVE:
+			err = kevent_user_ctl_remove(u, ctl, 
+					arg+sizeof(struct kevent_user_control));
+			break;
+		case KEVENT_CTL_MODIFY:
+			err = kevent_user_ctl_modify(u, ctl, 
+					arg+sizeof(struct kevent_user_control));
+			break;
+		case KEVENT_CTL_WAIT:
+			err = kevent_user_wait(file, u, ctl, arg);
+			break;
+		case KEVENT_CTL_INIT:
+			err = kevent_ctl_init();
+		default:
+			err = -EINVAL;
+			break;
+	}
+
+	return err;
+}
+
+asmlinkage long sys_kevent_ctl(int fd, void __user *arg)
+{
+	int err, fput_needed;
+	struct kevent_user_control ctl;
+	struct file *file;
+
+	if (copy_from_user(&ctl, arg, sizeof(struct kevent_user_control)))
+		return -EINVAL;
+
+	if (ctl.cmd == KEVENT_CTL_INIT)
+		return kevent_ctl_init();
+
+	file = fget_light(fd, &fput_needed);
+	if (!file)
+		return -ENODEV;
+
+	err = kevent_ctl_process(file, &ctl, arg);
+
+	fput_light(file, fput_needed);
+	return err;
+}
+
+static int kevent_user_ioctl(struct inode *inode, struct file *file, 
+		unsigned int cmd, unsigned long arg)
+{
+	int err = -ENODEV;
+	struct kevent_user_control ctl;
+	struct kevent_user *u = file->private_data;
+	void __user *ptr = (void __user *)arg;
+
+	if (copy_from_user(&ctl, ptr, sizeof(struct kevent_user_control)))
+		return -EINVAL;
+
+	switch (cmd) {
+		case KEVENT_USER_CTL:
+			err = kevent_ctl_process(file, &ctl, ptr);
+			break;
+		case KEVENT_USER_WAIT:
+			err = kevent_user_wait(file, u, &ctl, ptr);
+			break;
+		default:
+			break;
+	}
+
+	return err;
+}
+
+static int __devinit kevent_user_init(void)
+{
+	struct class_device *dev;
+	int err = 0;
+	
+	err = register_filesystem(&kevent_fs_type);
+	if (err)
+		panic("%s: failed to register filesystem: err=%d.\n",
+			       kevent_name, err);
+
+	kevent_mnt = kern_mount(&kevent_fs_type);
+	if (IS_ERR(kevent_mnt))
+		panic("%s: failed to mount silesystem: err=%ld.\n", 
+				kevent_name, PTR_ERR(kevent_mnt));
+	
+	kevent_user_major = register_chrdev(0, kevent_name, &kevent_user_fops);
+	if (kevent_user_major < 0) {
+		printk(KERN_ERR "Failed to register \"%s\" char device: err=%d.\n", 
+				kevent_name, kevent_user_major);
+		return -ENODEV;
+	}
+
+	kevent_user_class = class_create(THIS_MODULE, "kevent");
+	if (IS_ERR(kevent_user_class)) {
+		printk(KERN_ERR "Failed to register \"%s\" class: err=%ld.\n", 
+				kevent_name, PTR_ERR(kevent_user_class));
+		err = PTR_ERR(kevent_user_class);
+		goto err_out_unregister;
+	}
+
+	dev = class_device_create(kevent_user_class, NULL, 
+			MKDEV(kevent_user_major, 0), NULL, kevent_name);
+	if (IS_ERR(dev)) {
+		printk(KERN_ERR "Failed to create %d.%d class device in \"%s\" class: err=%ld.\n", 
+				kevent_user_major, 0, kevent_name, PTR_ERR(dev));
+		err = PTR_ERR(dev);
+		goto err_out_class_destroy;
+	}
+
+	printk("KEVENT subsystem: chardev helper: major=%d.\n", kevent_user_major);
+
+	return 0;
+
+err_out_class_destroy:
+	class_destroy(kevent_user_class);
+err_out_unregister:
+	unregister_chrdev(kevent_user_major, kevent_name);
+
+	return err;
+}
+
+static void __devexit kevent_user_fini(void)
+{
+	class_device_destroy(kevent_user_class, MKDEV(kevent_user_major, 0));
+	class_destroy(kevent_user_class);
+	unregister_chrdev(kevent_user_major, kevent_name);
+	mntput(kevent_mnt);
+	unregister_filesystem(&kevent_fs_type);
+}
+
+module_init(kevent_user_init);
+module_exit(kevent_user_fini);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 5433195..dcbacf5 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -121,6 +121,11 @@ cond_syscall(ppc_rtas);
 cond_syscall(sys_spu_run);
 cond_syscall(sys_spu_create);
 
+cond_syscall(sys_aio_recv);
+cond_syscall(sys_aio_send);
+cond_syscall(sys_aio_sendfile);
+cond_syscall(sys_kevent_ctl);
+
 /* mmu depending weak syscall entries */
 cond_syscall(sys_mprotect);
 cond_syscall(sys_msync);
diff --git a/net/core/datagram.c b/net/core/datagram.c
index aecddcc..493245b 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -236,6 +236,60 @@ void skb_kill_datagram(struct sock *sk, 
 EXPORT_SYMBOL(skb_kill_datagram);
 
 /**
+ *	skb_copy_datagram - Copy a datagram.
+ *	@skb: buffer to copy
+ *	@offset: offset in the buffer to start copying from
+ *	@to: pointer to copy to
+ *	@len: amount of data to copy from buffer to iovec
+ */
+int skb_copy_datagram(const struct sk_buff *skb, int offset,
+			    void *to, int len)
+{
+	int i, fraglen, end = 0;
+	struct sk_buff *next = skb_shinfo(skb)->frag_list;
+
+	if (!len)
+		return 0;
+
+next_skb:
+	fraglen = skb_headlen(skb);
+	i = -1;
+
+	while (1) {
+		int start = end;
+
+		if ((end += fraglen) > offset) {
+			int copy = end - offset, o = offset - start;
+
+			if (copy > len)
+				copy = len;
+			if (i == -1)
+				memcpy(to, skb->data + o, copy);
+			else {
+				skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+				struct page *page = frag->page;
+				void *p = kmap(page) + frag->page_offset + o;
+				memcpy(to, p, copy);
+				kunmap(page);
+			}
+			if (!(len -= copy))
+				return 0;
+			offset += copy;
+		}
+		if (++i >= skb_shinfo(skb)->nr_frags)
+			break;
+		fraglen = skb_shinfo(skb)->frags[i].size;
+	}
+	if (next) {
+		skb = next;
+		BUG_ON(skb_shinfo(skb)->frag_list);
+		next = skb->next;
+		goto next_skb;
+	}
+	return -EFAULT;
+}
+
+/**
  *	skb_copy_datagram_iovec - Copy a datagram to an iovec.
  *	@skb: buffer to copy
  *	@offset: offset in the buffer to start copying from
@@ -530,6 +584,7 @@ unsigned int datagram_poll(struct file *
 
 EXPORT_SYMBOL(datagram_poll);
 EXPORT_SYMBOL(skb_copy_and_csum_datagram_iovec);
+EXPORT_SYMBOL(skb_copy_datagram);
 EXPORT_SYMBOL(skb_copy_datagram_iovec);
 EXPORT_SYMBOL(skb_free_datagram);
 EXPORT_SYMBOL(skb_recv_datagram);
diff --git a/net/core/sock.c b/net/core/sock.c
index 5d820c3..3345048 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -564,6 +564,16 @@ #endif
 			spin_unlock_bh(&sk->sk_lock.slock);
 			ret = -ENONET;
 			break;
+#ifdef CONFIG_KEVENT_SOCKET
+		case SO_ASYNC_SOCK:
+			spin_lock_bh(&sk->sk_lock.slock);
+			if (valbool)
+				sock_set_flag(sk, SOCK_ASYNC);
+			else
+				sock_reset_flag(sk, SOCK_ASYNC);
+			spin_unlock_bh(&sk->sk_lock.slock);
+			break;
+#endif
 
 		/* We implement the SO_SNDLOWAT etc to
 		   not be settable (1003.1g 5.3) */
@@ -1313,6 +1323,7 @@ static void sock_def_wakeup(struct sock 
 	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
 		wake_up_interruptible_all(sk->sk_sleep);
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_error_report(struct sock *sk)
@@ -1322,6 +1333,7 @@ static void sock_def_error_report(struct
 		wake_up_interruptible(sk->sk_sleep);
 	sk_wake_async(sk,0,POLL_ERR); 
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_readable(struct sock *sk, int len)
@@ -1331,6 +1343,7 @@ static void sock_def_readable(struct soc
 		wake_up_interruptible(sk->sk_sleep);
 	sk_wake_async(sk,1,POLL_IN);
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_write_space(struct sock *sk)
@@ -1350,6 +1363,7 @@ static void sock_def_write_space(struct 
 	}
 
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
 }
 
 static void sock_def_destruct(struct sock *sk)
@@ -1454,8 +1468,10 @@ void fastcall release_sock(struct sock *
 	if (sk->sk_backlog.tail)
 		__release_sock(sk);
 	sk->sk_lock.owner = NULL;
-        if (waitqueue_active(&(sk->sk_lock.wq)))
+        if (waitqueue_active(&(sk->sk_lock.wq))) {
 		wake_up(&(sk->sk_lock.wq));
+		kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
+	}
 	spin_unlock_bh(&(sk->sk_lock.slock));
 }
 EXPORT_SYMBOL(release_sock);
diff --git a/net/core/stream.c b/net/core/stream.c
index e948969..91e2e07 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock *
 			wake_up_interruptible(sk->sk_sleep);
 		if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
 			sock_wake_async(sock, 2, POLL_OUT);
+		kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
 	}
 }
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 74998f2..403d33e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -206,6 +206,7 @@
  *					lingertime == 0 (RFC 793 ABORT Call)
  *	Hirokazu Takahashi	:	Use copy_from_user() instead of
  *					csum_and_copy_from_user() if possible.
+ *	Evgeniy Polyakov	:	Network asynchronous IO.
  *
  *		This program is free software; you can redistribute it and/or
  *		modify it under the terms of the GNU General Public License
@@ -1085,6 +1086,275 @@ int tcp_read_sock(struct sock *sk, read_
 }
 
 /*
+ * Must be called with locked sock.
+ */
+int tcp_async_send(struct sock *sk, struct page **pages, unsigned int poffset, size_t len)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	int mss_now, size_goal;
+	int err = -EAGAIN;
+	ssize_t copied;
+
+	/* Wait for a connection to finish. */
+	if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT))
+		goto out_err;
+
+	clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
+
+	mss_now = tcp_current_mss(sk, 1);
+	size_goal = tp->xmit_size_goal;
+	copied = 0;
+
+	err = -EPIPE;
+	if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN) || sock_flag(sk, SOCK_DONE) ||
+			(sk->sk_state == TCP_CLOSE) || (atomic_read(&sk->sk_refcnt) == 1))
+		goto do_error;
+
+	while (len > 0) {
+		struct sk_buff *skb = sk->sk_write_queue.prev;
+		struct page *page = pages[poffset / PAGE_SIZE];
+		int copy, i, can_coalesce;
+		int offset = poffset % PAGE_SIZE;
+		int size = min_t(size_t, len, PAGE_SIZE - offset);
+
+		if (!sk->sk_send_head || (copy = size_goal - skb->len) <= 0) {
+new_segment:
+			if (!sk_stream_memory_free(sk))
+				goto wait_for_sndbuf;
+
+			skb = sk_stream_alloc_pskb(sk, 0, 0,
+						   sk->sk_allocation);
+			if (!skb)
+				goto wait_for_memory;
+
+			skb_entail(sk, tp, skb);
+			copy = size_goal;
+		}
+
+		if (copy > size)
+			copy = size;
+
+		i = skb_shinfo(skb)->nr_frags;
+		can_coalesce = skb_can_coalesce(skb, i, page, offset);
+		if (!can_coalesce && i >= MAX_SKB_FRAGS) {
+			tcp_mark_push(tp, skb);
+			goto new_segment;
+		}
+		if (!sk_stream_wmem_schedule(sk, copy))
+			goto wait_for_memory;
+		
+		if (can_coalesce) {
+			skb_shinfo(skb)->frags[i - 1].size += copy;
+		} else {
+			get_page(page);
+			skb_fill_page_desc(skb, i, page, offset, copy);
+		}
+
+		skb->len += copy;
+		skb->data_len += copy;
+		skb->truesize += copy;
+		sk->sk_wmem_queued += copy;
+		sk->sk_forward_alloc -= copy;
+		skb->ip_summed = CHECKSUM_HW;
+		tp->write_seq += copy;
+		TCP_SKB_CB(skb)->end_seq += copy;
+		skb_shinfo(skb)->tso_segs = 0;
+
+		if (!copied)
+			TCP_SKB_CB(skb)->flags &= ~TCPCB_FLAG_PSH;
+
+		copied += copy;
+		poffset += copy;
+		if (!(len -= copy))
+			goto out;
+
+		if (skb->len < mss_now)
+			continue;
+
+		if (forced_push(tp)) {
+			tcp_mark_push(tp, skb);
+			__tcp_push_pending_frames(sk, tp, mss_now, TCP_NAGLE_PUSH);
+		} else if (skb == sk->sk_send_head)
+			tcp_push_one(sk, mss_now);
+		continue;
+
+wait_for_sndbuf:
+		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+wait_for_memory:
+		if (copied)
+			tcp_push(sk, tp, 0, mss_now, TCP_NAGLE_PUSH);
+
+		err = -EAGAIN;
+		goto do_error;
+	}
+
+out:
+	if (copied)
+		tcp_push(sk, tp, 0, mss_now, tp->nonagle);
+	return copied;
+
+do_error:
+	if (copied)
+		goto out;
+out_err:
+	return sk_stream_error(sk, 0, err);
+}
+
+/*
+ * Must be called with locked sock.
+ */
+int tcp_async_recv(struct sock *sk, void *dst, size_t len)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	int copied = 0;
+	u32 *seq;
+	unsigned long used;
+	int err;
+	int target;		/* Read at least this many bytes */
+
+	TCP_CHECK_TIMER(sk);
+
+	err = -ENOTCONN;
+	if (sk->sk_state == TCP_LISTEN)
+		goto out;
+
+	seq = &tp->copied_seq;
+
+	target = sock_rcvlowat(sk, 0, len);
+
+	do {
+		struct sk_buff *skb;
+		u32 offset;
+
+		/* Are we at urgent data? Stop if we have read anything or have SIGURG pending. */
+		if (tp->urg_data && tp->urg_seq == *seq) {
+			if (copied)
+				break;
+		}
+
+		/* Next get a buffer. */
+
+		skb = skb_peek(&sk->sk_receive_queue);
+		do {
+			if (!skb)
+				break;
+
+			/* Now that we have two receive queues this
+			 * shouldn't happen.
+			 */
+			if (before(*seq, TCP_SKB_CB(skb)->seq)) {
+				printk(KERN_INFO "async_recv bug: copied %X "
+				       "seq %X\n", *seq, TCP_SKB_CB(skb)->seq);
+				break;
+			}
+			offset = *seq - TCP_SKB_CB(skb)->seq;
+			if (skb->h.th->syn)
+				offset--;
+			if (offset < skb->len)
+				goto found_ok_skb;
+			if (skb->h.th->fin)
+				goto found_fin_ok;
+			skb = skb->next;
+		} while (skb != (struct sk_buff *)&sk->sk_receive_queue);
+
+		if (copied)
+			break;
+
+		if (sock_flag(sk, SOCK_DONE))
+			break;
+
+		if (sk->sk_err) {
+			copied = sock_error(sk);
+			break;
+		}
+
+		if (sk->sk_shutdown & RCV_SHUTDOWN)
+			break;
+
+		if (sk->sk_state == TCP_CLOSE) {
+			if (!sock_flag(sk, SOCK_DONE)) {
+				/* This occurs when user tries to read
+				 * from never connected socket.
+				 */
+				copied = -ENOTCONN;
+				break;
+			}
+			break;
+		}
+
+		copied = -EAGAIN;
+		break;
+
+	found_ok_skb:
+		/* Ok so how much can we use? */
+		used = skb->len - offset;
+		if (len < used)
+			used = len;
+
+		/* Do we have urgent data here? */
+		if (tp->urg_data) {
+			u32 urg_offset = tp->urg_seq - *seq;
+			if (urg_offset < used) {
+				if (!urg_offset) {
+					if (!sock_flag(sk, SOCK_URGINLINE)) {
+						++*seq;
+						offset++;
+						used--;
+						if (!used)
+							goto skip_copy;
+					}
+				} else
+					used = urg_offset;
+			}
+		}
+
+		err = skb_copy_datagram(skb, offset, dst, used);
+		if (err) {
+			/* Exception. Bailout! */
+			if (!copied)
+				copied = -EFAULT;
+			break;
+		}
+
+		*seq += used;
+		copied += used;
+		len -= used;
+		dst += used;
+
+		tcp_rcv_space_adjust(sk);
+
+skip_copy:
+		if (tp->urg_data && after(tp->copied_seq, tp->urg_seq)) {
+			tp->urg_data = 0;
+			tcp_fast_path_check(sk, tp);
+		}
+		if (used + offset < skb->len)
+			continue;
+
+		if (skb->h.th->fin)
+			goto found_fin_ok;
+		sk_eat_skb(sk, skb);
+		continue;
+
+	found_fin_ok:
+		/* Process the FIN. */
+		++*seq;
+		sk_eat_skb(sk, skb);
+		break;
+	} while (len > 0);
+
+	/* Clean up data we have read: This will do ACK frames. */
+	cleanup_rbuf(sk, copied);
+
+	TCP_CHECK_TIMER(sk);
+	return copied;
+
+out:
+	TCP_CHECK_TIMER(sk);
+	return err;
+}
+
+/*
  *	This routine copies from a sock struct into the user buffer.
  *
  *	Technical note: in 2.3 we work on _locked_ socket, so that
@@ -2259,6 +2529,8 @@ EXPORT_SYMBOL(tcp_getsockopt);
 EXPORT_SYMBOL(tcp_ioctl);
 EXPORT_SYMBOL(tcp_poll);
 EXPORT_SYMBOL(tcp_read_sock);
+EXPORT_SYMBOL(tcp_async_recv);
+EXPORT_SYMBOL(tcp_async_send);
 EXPORT_SYMBOL(tcp_recvmsg);
 EXPORT_SYMBOL(tcp_sendmsg);
 EXPORT_SYMBOL(tcp_sendpage);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e08245b..5655b1e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3113,6 +3113,7 @@ static void tcp_ofo_queue(struct sock *s
 
 		__skb_unlink(skb, &tp->out_of_order_queue);
 		__skb_queue_tail(&sk->sk_receive_queue, skb);
+		kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
 		if(skb->h.th->fin)
 			tcp_fin(skb, sk, skb->h.th);
@@ -3956,7 +3957,8 @@ int tcp_rcv_established(struct sock *sk,
 			int copied_early = 0;
 
 			if (tp->copied_seq == tp->rcv_nxt &&
-			    len - tcp_header_len <= tp->ucopy.len) {
+			    len - tcp_header_len <= tp->ucopy.len &&
+			    !sock_async(sk)) {
 #ifdef CONFIG_NET_DMA
 				if (tcp_dma_try_early_copy(sk, skb, tcp_header_len)) {
 					copied_early = 1;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 25ecc6e..05d7086 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -62,6 +62,7 @@ #include <linux/cache.h>
 #include <linux/jhash.h>
 #include <linux/init.h>
 #include <linux/times.h>
+#include <linux/kevent.h>
 
 #include <net/icmp.h>
 #include <net/inet_hashtables.h>
@@ -850,6 +851,7 @@ #endif
 	   	reqsk_free(req);
 	} else {
 		inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+		kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT);
 	}
 	return 0;
 
@@ -1089,24 +1091,30 @@ process:
 
 	skb->dev = NULL;
 
-	bh_lock_sock(sk);
 	ret = 0;
-	if (!sock_owned_by_user(sk)) {
+	if (sock_async(sk)) {
+		spin_lock_bh(&sk->sk_lock.slock);
+		ret = tcp_v4_do_rcv(sk, skb);
+		spin_unlock_bh(&sk->sk_lock.slock);
+	} else {
+		bh_lock_sock(sk);
+		if (!sock_owned_by_user(sk)) {
 #ifdef CONFIG_NET_DMA
-		struct tcp_sock *tp = tcp_sk(sk);
-		if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
-			tp->ucopy.dma_chan = get_softnet_dma();
-		if (tp->ucopy.dma_chan)
-			ret = tcp_v4_do_rcv(sk, skb);
-		else
+			struct tcp_sock *tp = tcp_sk(sk);
+			if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
+				tp->ucopy.dma_chan = get_softnet_dma();
+			if (tp->ucopy.dma_chan)
+				ret = tcp_v4_do_rcv(sk, skb);
+			else
 #endif
-		{
-			if (!tcp_prequeue(sk, skb))
-			ret = tcp_v4_do_rcv(sk, skb);
-		}
-	} else
-		sk_add_backlog(sk, skb);
-	bh_unlock_sock(sk);
+			{
+				if (!tcp_prequeue(sk, skb))
+				ret = tcp_v4_do_rcv(sk, skb);
+			}
+		} else
+			sk_add_backlog(sk, skb);
+		bh_unlock_sock(sk);
+	}
 
 	sock_put(sk);
 
@@ -1830,6 +1838,8 @@ struct proto tcp_prot = {
 	.getsockopt		= tcp_getsockopt,
 	.sendmsg		= tcp_sendmsg,
 	.recvmsg		= tcp_recvmsg,
+	.async_recv		= tcp_async_recv,
+	.async_send		= tcp_async_send,
 	.backlog_rcv		= tcp_v4_do_rcv,
 	.hash			= tcp_v4_hash,
 	.unhash			= tcp_unhash,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index a50eb30..e27e231 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1215,22 +1215,28 @@ process:
 
 	skb->dev = NULL;
 
-	bh_lock_sock(sk);
 	ret = 0;
-	if (!sock_owned_by_user(sk)) {
+	if (sock_async(sk)) {
+		spin_lock_bh(&sk->sk_lock.slock);
+		ret = tcp_v4_do_rcv(sk, skb);
+		spin_unlock_bh(&sk->sk_lock.slock);
+	} else {
+		bh_lock_sock(sk);
+		if (!sock_owned_by_user(sk)) {
 #ifdef CONFIG_NET_DMA
-                struct tcp_sock *tp = tcp_sk(sk);
-                if (tp->ucopy.dma_chan)
-                        ret = tcp_v6_do_rcv(sk, skb);
-                else
-#endif
-		{
-			if (!tcp_prequeue(sk, skb))
+			struct tcp_sock *tp = tcp_sk(sk);
+			if (tp->ucopy.dma_chan)
 				ret = tcp_v6_do_rcv(sk, skb);
-		}
-	} else
-		sk_add_backlog(sk, skb);
-	bh_unlock_sock(sk);
+			else
+#endif
+			{
+				if (!tcp_prequeue(sk, skb))
+					ret = tcp_v6_do_rcv(sk, skb);
+			}
+		} else
+			sk_add_backlog(sk, skb);
+		bh_unlock_sock(sk);
+	}
 
 	sock_put(sk);
 	return ret ? -1 : 0;
@@ -1580,6 +1586,8 @@ struct proto tcpv6_prot = {
 	.getsockopt		= tcp_getsockopt,
 	.sendmsg		= tcp_sendmsg,
 	.recvmsg		= tcp_recvmsg,
+	.async_recv		= tcp_async_recv,
+	.async_send		= tcp_async_send,
 	.backlog_rcv		= tcp_v6_do_rcv,
 	.hash			= tcp_v6_hash,
 	.unhash			= tcp_unhash,


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [2/4] kevent: network AIO, socket notifications.
  2006-07-26  9:18       ` [1/4] kevent: core files Evgeniy Polyakov
@ 2006-07-26  9:18         ` Evgeniy Polyakov
  2006-07-26  9:18           ` [3/4] kevent: AIO, aio_sendfile() implementation Evgeniy Polyakov
  2006-07-26 10:31         ` [1/4] kevent: core files Andrew Morton
  2006-07-26 10:44         ` Evgeniy Polyakov
  2 siblings, 1 reply; 73+ messages in thread
From: Evgeniy Polyakov @ 2006-07-26  9:18 UTC (permalink / raw)
  To: lkml; +Cc: David Miller, Ulrich Drepper, Evgeniy Polyakov, netdev


This patchset includes socket notifications and network asynchronous IO.
Network AIO is based on kevent and works as usual kevent storage on top
of inode.

When new socket is created it is associated with inode (to save some space,
since inode already has kevent_storage embedded) and when some activity is 
detected appropriate notifications are generated and kevent_naio_callback() 
is called.

When new kevent is being registered, network AIO ->enqueue() callback
simply marks itself like usual socket event watcher. It also locks
physical userspace pages in memory and stores appropriate pointers in
private kevent structure. I have not created additional DMA memory
allocation methods, like Ulrich described in his article, so I handle it
inside NAIO which has some overhead (I posted get_user_pages()
sclability graph some time ago). New set of syscalls to allocate DMAable
memory is in TODO.

Network AIO callback gets pointers to userspace pages and tries to copy
data from receiving skb queue into them using protocol specific
callback. This callback is very similar to ->recvmsg(), so they could
share a lot in future (as far as I recall it worked only with hardware
capable to do checksumming, I'm a bit lazy, it is in TODO)

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>


diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 0000000..c230aaa
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,125 @@
+/*
+ * 	kevent_socket.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/tcp.h>
+#include <linux/kevent.h>
+
+#include <net/sock.h>
+#include <net/request_sock.h>
+#include <net/inet_connection_sock.h>
+
+static int kevent_socket_callback(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+	struct sock *sk = SOCKET_I(inode)->sk;
+	int rmem;
+	
+	if (k->event.event & KEVENT_SOCKET_RECV) {
+		int ret = 0;
+		
+		if ((rmem = atomic_read(&sk->sk_rmem_alloc)) > 0 || 
+				!skb_queue_empty(&sk->sk_receive_queue))
+			ret = 1;
+		if (sk->sk_shutdown & RCV_SHUTDOWN)
+			ret = 1;
+		if (ret)
+			return ret;
+	}
+	if ((k->event.event & KEVENT_SOCKET_ACCEPT) && 
+		(!reqsk_queue_empty(&inet_csk(sk)->icsk_accept_queue) || 
+		 	reqsk_queue_len_young(&inet_csk(sk)->icsk_accept_queue))) {
+		k->event.ret_data[1] = reqsk_queue_len(&inet_csk(sk)->icsk_accept_queue);
+		return 1;
+	}
+
+	return 0;
+}
+
+int kevent_socket_enqueue(struct kevent *k)
+{
+	struct file *file;
+	struct inode *inode;
+	int err, fput_needed;
+
+	file = fget_light(k->event.id.raw[0], &fput_needed);
+	if (!file)
+		return -ENODEV;
+
+	err = -EINVAL;
+	if (!file->f_dentry || !file->f_dentry->d_inode)
+		goto err_out_fput;
+
+	inode = igrab(file->f_dentry->d_inode);
+	if (!inode)
+		goto err_out_fput;
+
+	err = kevent_storage_enqueue(&inode->st, k);
+	if (err)
+		goto err_out_iput;
+
+	err = k->callback(k);
+	if (err)
+		goto err_out_dequeue;
+
+	fput_light(file, fput_needed);
+	return err;
+
+err_out_dequeue:
+	kevent_storage_dequeue(k->st, k);
+err_out_iput:
+	iput(inode);
+err_out_fput:
+	fput_light(file, fput_needed);
+	return err;
+}
+
+int kevent_socket_dequeue(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+
+	kevent_storage_dequeue(k->st, k);
+	iput(inode);
+
+	return 0;
+}
+
+int kevent_init_socket(struct kevent *k)
+{
+	k->enqueue = &kevent_socket_enqueue;
+	k->dequeue = &kevent_socket_dequeue;
+	k->callback = &kevent_socket_callback;
+	return 0;
+}
+
+void kevent_socket_notify(struct sock *sk, u32 event)
+{
+	if (sk->sk_socket && !test_and_set_bit(SOCK_ASYNC_INUSE, &sk->sk_flags)) {
+		kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event);
+		sock_reset_flag(sk, SOCK_ASYNC_INUSE);
+	}
+}

diff --git a/kernel/kevent/kevent_naio.c b/kernel/kevent/kevent_naio.c
new file mode 100644
index 0000000..1c71021
--- /dev/null
+++ b/kernel/kevent/kevent_naio.c
@@ -0,0 +1,239 @@
+/*
+ * 	kevent_naio.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/kevent.h>
+
+#include <net/sock.h>
+#include <net/tcp_states.h>
+
+static int kevent_naio_enqueue(struct kevent *k);
+static int kevent_naio_dequeue(struct kevent *k);
+static int kevent_naio_callback(struct kevent *k);
+
+static int kevent_naio_setup_aio(int ctl_fd, int s, void __user *buf, 
+		size_t size, u32 event)
+{
+	struct kevent_user *u;
+	struct file *file;
+	int err, fput_needed;
+	struct ukevent uk;
+
+	file = fget_light(ctl_fd, &fput_needed);
+	if (!file)
+		return -ENODEV;
+
+	u = file->private_data;
+	if (!u) {
+		err = -EINVAL;
+		goto err_out_fput;
+	}
+
+	memset(&uk, 0, sizeof(struct ukevent));
+	uk.type = KEVENT_NAIO;
+	uk.ptr = buf;
+	uk.req_flags = KEVENT_REQ_ONESHOT;
+	uk.event = event;
+	uk.id.raw[0] = s;
+	uk.id.raw[1] = size;
+
+	err = kevent_user_add_ukevent(&uk, u);
+
+err_out_fput:
+	fput_light(file, fput_needed);
+	return err;
+}
+
+asmlinkage long sys_aio_recv(int ctl_fd, int s, void __user *buf, 
+		size_t size, unsigned flags)
+{
+	return kevent_naio_setup_aio(ctl_fd, s, buf, size, KEVENT_SOCKET_RECV);
+}
+
+asmlinkage long sys_aio_send(int ctl_fd, int s, void __user *buf, 
+		size_t size, unsigned flags)
+{
+	return kevent_naio_setup_aio(ctl_fd, s, buf, size, KEVENT_SOCKET_SEND);
+}
+
+static int kevent_naio_enqueue(struct kevent *k)
+{
+	int err, i;
+	struct page **page;
+	void *addr;
+	unsigned int size = k->event.id.raw[1];
+	int num = size/PAGE_SIZE;
+	struct file *file;
+	struct sock *sk = NULL;
+	int fput_needed;
+
+	file = fget_light(k->event.id.raw[0], &fput_needed);
+	if (!file)
+		return -ENODEV;
+
+	err = -EINVAL;
+	if (!file->f_dentry || !file->f_dentry->d_inode)
+		goto err_out_fput;
+
+	sk = SOCKET_I(file->f_dentry->d_inode)->sk;
+
+	err = -ESOCKTNOSUPPORT;
+	if (!sk || !sk->sk_prot->async_recv || !sk->sk_prot->async_send || 
+		!sock_flag(sk, SOCK_ASYNC))
+		goto err_out_fput;
+	
+	addr = k->event.ptr;
+	if (((unsigned long)addr & PAGE_MASK) != (unsigned long)addr)
+		num++;
+
+	page = kmalloc(sizeof(struct page *) * num, GFP_KERNEL);
+	if (!page)
+		return -ENOMEM;
+
+	down_read(&current->mm->mmap_sem);
+	err = get_user_pages(current, current->mm, (unsigned long)addr, 
+			num, 1, 0, page, NULL);
+	up_read(&current->mm->mmap_sem);
+	if (err <= 0)
+		goto err_out_free;
+	num = err;
+
+	k->event.ret_data[0] = num;
+	k->event.ret_data[1] = offset_in_page(k->event.ptr);
+	k->priv = page;
+
+	sk->sk_allocation = GFP_ATOMIC;
+
+	spin_lock_bh(&sk->sk_lock.slock);
+	err = kevent_socket_enqueue(k);
+	spin_unlock_bh(&sk->sk_lock.slock);
+	if (err)
+		goto err_out_put_pages;
+
+	fput_light(file, fput_needed);
+
+	return err;
+
+err_out_put_pages:
+	for (i=0; i<num; ++i)
+		page_cache_release(page[i]);
+err_out_free:
+	kfree(page);
+err_out_fput:
+	fput_light(file, fput_needed);
+
+	return err;
+}
+
+static int kevent_naio_dequeue(struct kevent *k)
+{
+	int err, i, num;
+	struct page **page = k->priv;
+
+	num = k->event.ret_data[0];
+
+	err = kevent_socket_dequeue(k);
+
+	for (i=0; i<num; ++i)
+		page_cache_release(page[i]);
+
+	kfree(k->priv);
+	k->priv = NULL;
+
+	return err;
+}
+
+static int kevent_naio_callback(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+	struct sock *sk = SOCKET_I(inode)->sk;
+	unsigned int size = k->event.id.raw[1];
+	unsigned int off = k->event.ret_data[1];
+	struct page **pages = k->priv, *page;
+	int ready = 0, num = off/PAGE_SIZE, err = 0, send = 0;
+	void *ptr, *optr;
+	unsigned int len;
+
+	if (!sock_flag(sk, SOCK_ASYNC))
+		return -1;
+
+	if (k->event.event & KEVENT_SOCKET_SEND)
+		send = 1;
+	else if (!(k->event.event & KEVENT_SOCKET_RECV))
+		return -EINVAL;
+
+	/*
+	 * sk_prot->async_*() can return either number of bytes processed,
+	 * or negative error value, or zero if socket is closed.
+	 */
+
+	if (!send) {
+		page = pages[num];
+
+		optr = ptr = kmap_atomic(page, KM_IRQ0);
+		if (!ptr)
+			return -ENOMEM;
+
+		ptr += off % PAGE_SIZE;
+		len = min_t(unsigned int, PAGE_SIZE - (ptr - optr), size);
+
+		err = sk->sk_prot->async_recv(sk, ptr, len);
+
+		kunmap_atomic(optr, KM_IRQ0);
+	} else {
+		len = size;
+		err = sk->sk_prot->async_send(sk, pages, off, size);
+	}
+
+	if (err > 0) {
+		num++;
+		size -= err;
+		off += err;
+	}
+
+	k->event.ret_data[1] = off;
+	k->event.id.raw[1] = size;
+
+	if (err == 0 || (err < 0 && err != -EAGAIN))
+		ready = -1;
+
+	if (!size)
+		ready = 1;
+#if 0
+	printk("%s: sk=%p, k=%p, size=%4u, off=%4u, err=%3d, ready=%1d.\n",
+			__func__, sk, k, size, off, err, ready);
+#endif
+
+	return ready;
+}
+
+int kevent_init_naio(struct kevent *k)
+{
+	k->enqueue = &kevent_naio_enqueue;
+	k->dequeue = &kevent_naio_dequeue;
+	k->callback = &kevent_naio_callback;
+	return 0;
+}


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-26  9:18         ` [2/4] kevent: network AIO, socket notifications Evgeniy Polyakov
@ 2006-07-26  9:18           ` Evgeniy Polyakov
  2006-07-26  9:18             ` [4/4] kevent: poll/select() notifications. Timer notifications Evgeniy Polyakov
                               ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Evgeniy Polyakov @ 2006-07-26  9:18 UTC (permalink / raw)
  To: lkml; +Cc: David Miller, Ulrich Drepper, Evgeniy Polyakov, netdev


This patch includes asynchronous propagation of file's data into VFS
cache and aio_sendfile() implementation.
Network aio_sendfile() works lazily - it asynchronously populates pages
into the VFS cache (which can be used for various tricks with adaptive
readahead) and then uses usual ->sendfile() callback.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/fs/bio.c b/fs/bio.c
index 6a0b9ad..a3ee530 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -119,7 +119,7 @@ void bio_free(struct bio *bio, struct bi
 /*
  * default destructor for a bio allocated with bio_alloc_bioset()
  */
-static void bio_fs_destructor(struct bio *bio)
+void bio_fs_destructor(struct bio *bio)
 {
 	bio_free(bio, fs_bio_set);
 }
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 04af9c4..295fce9 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -685,6 +685,7 @@ ext2_writepages(struct address_space *ma
 }
 
 struct address_space_operations ext2_aops = {
+	.get_block		= ext2_get_block,
 	.readpage		= ext2_readpage,
 	.readpages		= ext2_readpages,
 	.writepage		= ext2_writepage,
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 2edd7ee..e44f5ad 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1700,6 +1700,7 @@ static int ext3_journalled_set_page_dirt
 }
 
 static struct address_space_operations ext3_ordered_aops = {
+	.get_block	= ext3_get_block,
 	.readpage	= ext3_readpage,
 	.readpages	= ext3_readpages,
 	.writepage	= ext3_ordered_writepage,
diff --git a/fs/file_table.c b/fs/file_table.c
index bcea199..8759479 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -113,6 +113,9 @@ struct file *get_empty_filp(void)
 	if (security_file_alloc(f))
 		goto fail_sec;
 
+#ifdef CONFIG_KEVENT_POLL
+	kevent_storage_init(f, &f->st);
+#endif
 	tsk = current;
 	INIT_LIST_HEAD(&f->f_u.fu_list);
 	atomic_set(&f->f_count, 1);
@@ -160,6 +163,9 @@ void fastcall __fput(struct file *file)
 	might_sleep();
 
 	fsnotify_close(file);
+#ifdef CONFIG_KEVENT_POLL
+	kevent_storage_fini(&file->st);
+#endif
 	/*
 	 * The function eventpoll_release() should be the first called
 	 * in the file cleanup chain.
diff --git a/fs/inode.c b/fs/inode.c
index 3a2446a..0493935 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -22,6 +22,7 @@ #include <linux/pagemap.h>
 #include <linux/cdev.h>
 #include <linux/bootmem.h>
 #include <linux/inotify.h>
+#include <linux/kevent.h>
 #include <linux/mount.h>
 
 /*
@@ -166,12 +167,18 @@ #endif
 		}
 		memset(&inode->u, 0, sizeof(inode->u));
 		inode->i_mapping = mapping;
+#if defined CONFIG_KEVENT
+		kevent_storage_init(inode, &inode->st);
+#endif
 	}
 	return inode;
 }
 
 void destroy_inode(struct inode *inode) 
 {
+#if defined CONFIG_KEVENT_INODE || defined CONFIG_KEVENT_SOCKET
+	kevent_storage_fini(&inode->st);
+#endif
 	BUG_ON(inode_has_buffers(inode));
 	security_inode_free(inode);
 	if (inode->i_sb->s_op->destroy_inode)
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index 9857e50..bbbb578 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -2997,6 +2997,7 @@ int reiserfs_setattr(struct dentry *dent
 }
 
 struct address_space_operations reiserfs_address_space_operations = {
+	.get_block = reiserfs_get_block,
 	.writepage = reiserfs_writepage,
 	.readpage = reiserfs_readpage,
 	.readpages = reiserfs_readpages,

diff --git a/include/linux/fs.h b/include/linux/fs.h
index ecc8c2c..248f6a1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -236,6 +236,9 @@ #include <linux/mutex.h>
 #include <asm/atomic.h>
 #include <asm/semaphore.h>
 #include <asm/byteorder.h>
+#ifdef CONFIG_KEVENT
+#include <linux/kevent_storage.h>
+#endif
 
 struct hd_geometry;
 struct iovec;
@@ -348,6 +351,8 @@ struct address_space;
 struct writeback_control;
 
 struct address_space_operations {
+	int  (*get_block)(struct inode *inode, sector_t iblock,
+			struct buffer_head *bh_result, int create);
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
 	int (*readpage)(struct file *, struct page *);
 	void (*sync_page)(struct page *);
@@ -526,6 +531,10 @@ #ifdef CONFIG_INOTIFY
 	struct mutex		inotify_mutex;	/* protects the watches list */
 #endif
 
+#ifdef CONFIG_KEVENT_INODE
+	struct kevent_storage	st;
+#endif
+
 	unsigned long		i_state;
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 
@@ -659,6 +668,9 @@ #ifdef CONFIG_EPOLL
 	struct list_head	f_ep_links;
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+	struct kevent_storage	st;
+#endif
 	struct address_space	*f_mapping;
 };
 extern spinlock_t files_lock;
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index cc5dec7..0acc8db 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -15,6 +15,7 @@ #ifdef __KERNEL__
 
 #include <linux/dnotify.h>
 #include <linux/inotify.h>
+#include <linux/kevent.h>
 #include <linux/audit.h>
 
 /*
@@ -79,6 +80,7 @@ static inline void fsnotify_nameremove(s
 		isdir = IN_ISDIR;
 	dnotify_parent(dentry, DN_DELETE);
 	inotify_dentry_parent_queue_event(dentry, IN_DELETE|isdir, 0, dentry->d_name.name);
+	kevent_inode_notify_parent(dentry, KEVENT_INODE_REMOVE);
 }
 
 /*
@@ -88,6 +90,7 @@ static inline void fsnotify_inoderemove(
 {
 	inotify_inode_queue_event(inode, IN_DELETE_SELF, 0, NULL, NULL);
 	inotify_inode_is_dead(inode);
+	kevent_inode_remove(inode);
 }
 
 /*
@@ -96,6 +99,7 @@ static inline void fsnotify_inoderemove(
 static inline void fsnotify_create(struct inode *inode, struct dentry *dentry)
 {
 	inode_dir_notify(inode, DN_CREATE);
+	kevent_inode_notify(inode, KEVENT_INODE_CREATE);
 	inotify_inode_queue_event(inode, IN_CREATE, 0, dentry->d_name.name,
 				  dentry->d_inode);
 	audit_inode_child(dentry->d_name.name, dentry->d_inode, inode->i_ino);
@@ -107,6 +111,7 @@ static inline void fsnotify_create(struc
 static inline void fsnotify_mkdir(struct inode *inode, struct dentry *dentry)
 {
 	inode_dir_notify(inode, DN_CREATE);
+	kevent_inode_notify(inode, KEVENT_INODE_CREATE);
 	inotify_inode_queue_event(inode, IN_CREATE | IN_ISDIR, 0, 
 				  dentry->d_name.name, dentry->d_inode);
 	audit_inode_child(dentry->d_name.name, dentry->d_inode, inode->i_ino);

diff --git a/kernel/kevent/kevent_inode.c b/kernel/kevent/kevent_inode.c
new file mode 100644
index 0000000..3af0e11
--- /dev/null
+++ b/kernel/kevent/kevent_inode.c
@@ -0,0 +1,110 @@
+/*
+ * 	kevent_inode.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/fs.h>
+
+static int kevent_inode_enqueue(struct kevent *k)
+{
+	struct file *file;
+	struct inode *inode;
+	int err, fput_needed;
+
+	file = fget_light(k->event.id.raw[0], &fput_needed);
+	if (!file)
+		return -ENODEV;
+
+	err = -EINVAL;
+	if (!file->f_dentry || !file->f_dentry->d_inode)
+		goto err_out_fput;
+	
+	inode = igrab(file->f_dentry->d_inode);
+	if (!inode)
+		goto err_out_fput;
+
+	err = kevent_storage_enqueue(&inode->st, k);
+	if (err)
+		goto err_out_iput;
+
+	fput_light(file, fput_needed);
+	return 0;
+
+err_out_iput:
+	iput(inode);
+err_out_fput:
+	fput_light(file, fput_needed);
+	return err;
+}
+
+static int kevent_inode_dequeue(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+
+	kevent_storage_dequeue(k->st, k);
+	iput(inode);
+
+	return 0;
+}
+
+static int kevent_inode_callback(struct kevent *k)
+{
+	return 1;
+}
+
+int kevent_init_inode(struct kevent *k)
+{
+	k->enqueue = &kevent_inode_enqueue;
+	k->dequeue = &kevent_inode_dequeue;
+	k->callback = &kevent_inode_callback;
+	return 0;
+}
+
+void kevent_inode_notify_parent(struct dentry *dentry, u32 event)
+{
+	struct dentry *parent;
+	struct inode *inode;
+	
+	spin_lock(&dentry->d_lock);
+	parent = dentry->d_parent;
+	inode = parent->d_inode;
+
+	dget(parent);
+	spin_unlock(&dentry->d_lock);
+	kevent_inode_notify(inode, KEVENT_INODE_REMOVE);
+	dput(parent);
+}
+	
+void kevent_inode_remove(struct inode *inode)
+{
+	kevent_storage_fini(&inode->st);
+}
+	
+void kevent_inode_notify(struct inode *inode, u32 event)
+{
+	kevent_storage_ready(&inode->st, NULL, event);
+}
diff --git a/kernel/kevent/kevent_aio.c b/kernel/kevent/kevent_aio.c
new file mode 100644
index 0000000..d4132a3
--- /dev/null
+++ b/kernel/kevent/kevent_aio.c
@@ -0,0 +1,580 @@
+/*
+ * 	kevent_aio.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/swap.h>
+#include <linux/pagemap.h>
+#include <linux/bio.h>
+#include <linux/buffer_head.h>
+#include <linux/kevent.h>
+
+#include <net/sock.h>
+
+#define KEVENT_AIO_DEBUG
+
+#ifdef KEVENT_AIO_DEBUG
+#define dprintk(f, a...) printk(f, ##a)
+#else
+#define dprintk(f, a...) do {} while (0)
+#endif
+
+struct kevent_aio_private
+{
+	int			pg_num;
+	size_t			size;
+	loff_t			offset;
+	loff_t			processed;
+	atomic_t		bio_page_num;
+	struct completion	bio_complete;
+	struct file		*file, *sock;
+	struct work_struct	work;
+};
+
+static int kevent_aio_dequeue(struct kevent *k);
+static int kevent_aio_enqueue(struct kevent *k);
+static int kevent_aio_callback(struct kevent *k);
+
+extern void bio_fs_destructor(struct bio *bio);
+
+static void kevent_aio_bio_destructor(struct bio *bio)
+{
+	struct kevent *k = bio->bi_private;
+	struct kevent_aio_private *priv = k->priv;
+
+	dprintk("%s: bio=%p, num=%u, k=%p, inode=%p.\n", __func__, bio, bio->bi_vcnt, k, k->st->origin);
+	schedule_work(&priv->work);
+	bio_fs_destructor(bio);
+}
+
+static void kevent_aio_bio_put(struct kevent *k)
+{
+	struct kevent_aio_private *priv = k->priv;
+	
+	if (atomic_dec_and_test(&priv->bio_page_num))
+		complete(&priv->bio_complete);
+}
+
+static int kevent_mpage_end_io_read(struct bio *bio, unsigned int bytes_done, int err)
+{
+	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+	struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
+	struct kevent *k = bio->bi_private;
+
+	if (bio->bi_size)
+		return 1;
+
+	do {
+		struct page *page = bvec->bv_page;
+
+		if (--bvec >= bio->bi_io_vec)
+			prefetchw(&bvec->bv_page->flags);
+
+		if (uptodate) {
+			SetPageUptodate(page);
+		} else {
+			ClearPageUptodate(page);
+			SetPageError(page);
+		}
+
+		unlock_page(page);
+		kevent_aio_bio_put(k);
+	} while (bvec >= bio->bi_io_vec);
+
+	bio_put(bio);
+	return 0;
+}
+
+static inline struct bio *kevent_mpage_bio_submit(int rw, struct bio *bio)
+{
+	if (bio) {
+		bio->bi_end_io = kevent_mpage_end_io_read;
+		dprintk("%s: bio=%p, num=%u.\n", __func__, bio, bio->bi_vcnt);
+		submit_bio(READ, bio);
+	}
+	return NULL;
+}
+
+static struct bio *kevent_mpage_readpage(struct kevent *k, struct bio *bio,
+		struct page *page, unsigned nr_pages, get_block_t get_block, 
+		loff_t *offset, sector_t *last_block_in_bio)
+{
+	struct inode *inode = k->st->origin;
+	const unsigned blkbits = inode->i_blkbits;
+	const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
+	const unsigned blocksize = 1 << blkbits;
+	sector_t block_in_file;
+	sector_t last_block;
+	struct block_device *bdev = NULL;
+	unsigned first_hole = blocks_per_page;
+	unsigned page_block;
+	sector_t blocks[MAX_BUF_PER_PAGE];
+	struct buffer_head bh;
+	int fully_mapped = 1, length;
+
+	block_in_file = (*offset + blocksize - 1) >> blkbits;
+	last_block = (i_size_read(inode) + blocksize - 1) >> blkbits;
+
+	bh.b_page = page;
+	for (page_block = 0; page_block < blocks_per_page; page_block++, block_in_file++) {
+		bh.b_state = 0;
+		if (block_in_file < last_block) {
+			if (get_block(inode, block_in_file, &bh, 0))
+				goto confused;
+		}
+
+		if (!buffer_mapped(&bh)) {
+			fully_mapped = 0;
+			if (first_hole == blocks_per_page)
+				first_hole = page_block;
+			continue;
+		}
+
+		/* some filesystems will copy data into the page during
+		 * the get_block call, in which case we don't want to
+		 * read it again.  map_buffer_to_page copies the data
+		 * we just collected from get_block into the page's buffers
+		 * so readpage doesn't have to repeat the get_block call
+		 */
+		if (buffer_uptodate(&bh)) {
+			BUG();
+			//map_buffer_to_page(page, &bh, page_block);
+			goto confused;
+		}
+	
+		if (first_hole != blocks_per_page)
+			goto confused;		/* hole -> non-hole */
+
+		/* Contiguous blocks? */
+		if (page_block && blocks[page_block-1] != bh.b_blocknr-1)
+			goto confused;
+		blocks[page_block] = bh.b_blocknr;
+		bdev = bh.b_bdev;
+	}
+
+	if (!bdev)
+		goto confused;
+
+	if (first_hole != blocks_per_page) {
+		char *kaddr = kmap_atomic(page, KM_USER0);
+		memset(kaddr + (first_hole << blkbits), 0,
+				PAGE_CACHE_SIZE - (first_hole << blkbits));
+		flush_dcache_page(page);
+		kunmap_atomic(kaddr, KM_USER0);
+		if (first_hole == 0) {
+			SetPageUptodate(page);
+			goto out;
+		}
+	} else if (fully_mapped) {
+		SetPageMappedToDisk(page);
+	}
+	
+	/*
+	 * This page will go to BIO.  Do we need to send this BIO off first?
+	 */
+	if (bio && (*last_block_in_bio != blocks[0] - 1))
+		bio = kevent_mpage_bio_submit(READ, bio);
+
+alloc_new:
+	if (bio == NULL) {
+		nr_pages = min_t(unsigned, nr_pages, bio_get_nr_vecs(bdev));
+		bio = bio_alloc(GFP_KERNEL, nr_pages);
+		if (bio == NULL)
+			goto confused;
+
+		bio->bi_destructor = kevent_aio_bio_destructor;
+		bio->bi_bdev = bdev;
+		bio->bi_sector = blocks[0] << (blkbits - 9);
+		bio->bi_private = k;
+	}
+
+	length = first_hole << blkbits;
+	if (bio_add_page(bio, page, length, 0) < length) {
+		bio = kevent_mpage_bio_submit(READ, bio);
+		dprintk("%s: Failed to add a page: nr_pages=%d, length=%d, page=%p.\n", 
+				__func__, nr_pages, length, page);
+		goto alloc_new;
+	}
+	
+	dprintk("%s: bio=%p, b=%d, m=%d, u=%d, nr_pages=%d, offset=%Lu, "
+			"size=%Lu. page_block=%u, page=%p.\n", 
+			__func__, bio, buffer_boundary(&bh), buffer_mapped(&bh), 
+			buffer_uptodate(&bh), nr_pages, *offset, i_size_read(inode), 
+			page_block, page);
+	
+	*offset = *offset + length;
+
+	if (buffer_boundary(&bh) || (first_hole != blocks_per_page))
+		bio = kevent_mpage_bio_submit(READ, bio);
+	else
+		*last_block_in_bio = blocks[blocks_per_page - 1];
+
+out:
+	return bio;
+
+confused:
+	dprintk("%s: confused. bio=%p, nr_pages=%d.\n", __func__, bio, nr_pages);
+	if (bio)
+		bio = kevent_mpage_bio_submit(READ, bio);
+	kevent_aio_bio_put(k);
+	SetPageUptodate(page);
+
+	if (nr_pages == 1) {
+		struct kevent_aio_private *priv = k->priv;
+
+		wait_for_completion(&priv->bio_complete);
+		kevent_storage_ready(k->st, NULL, KEVENT_AIO_BIO);
+		init_completion(&priv->bio_complete);
+		complete(&priv->bio_complete);
+	}
+	goto out;
+}
+
+static int kevent_aio_alloc_cached_page(struct kevent *k, struct page **cached_page)
+{
+	struct kevent_aio_private *priv = k->priv;
+	struct address_space *mapping = priv->file->f_mapping;
+	struct page *page;
+	int err = 0;
+	pgoff_t index = priv->offset >> PAGE_CACHE_SHIFT;
+
+	page = page_cache_alloc_cold(mapping);
+	if (!page) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	err = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
+	if (err) {
+		if (err == -EEXIST)
+			err = 0;
+		page_cache_release(page);
+		goto out;
+	}
+
+	dprintk("%s: page=%p, offset=%Lu, processed=%Lu, index=%lu, size=%zu.\n",
+			__func__, page, priv->offset, priv->processed, index, priv->size);
+
+	*cached_page = page;
+
+out:
+	return err;
+}
+
+static int kevent_mpage_readpages(struct kevent *k, int first,
+		int (* get_block)(struct inode *inode, sector_t iblock,	
+			struct buffer_head *bh_result, int create))
+{
+	struct bio *bio = NULL;
+	struct kevent_aio_private *priv = k->priv;
+	sector_t last_block_in_bio = 0;
+	int i, err = 0;
+
+	atomic_set(&priv->bio_page_num, priv->pg_num);
+
+	for (i=first; i<priv->pg_num; ++i) {
+		struct page *page;
+		
+		err = kevent_aio_alloc_cached_page(k, &page);
+		if (err)
+			break;
+
+		/*
+		 * If there is no error and page is NULL, this means
+		 * that someone added a page into VFS cache.
+		 * We will not process this page, since it is that who
+		 * added a page must read data from disk.
+		 */
+		if (!page)
+			continue;
+
+		bio = kevent_mpage_readpage(k, bio, page, priv->pg_num - i, 
+				get_block, &priv->offset, &last_block_in_bio);
+	}
+
+	if (bio)
+		bio = kevent_mpage_bio_submit(READ, bio);
+
+	return err;
+}
+
+static size_t kevent_aio_vfs_read_actor(struct kevent *k, struct page *kpage, size_t len)
+{
+	struct kevent_aio_private *priv = k->priv;
+	size_t ret;
+	
+	ret = priv->sock->f_op->sendpage(priv->sock, kpage, 0, len, &priv->sock->f_pos, 1);
+
+	dprintk("%s: k=%p, page=%p, len=%zu, ret=%zd.\n", 
+			__func__, k, kpage, len, ret);
+
+	return ret;
+}
+
+static int kevent_aio_vfs_read(struct kevent *k, 
+		size_t (*actor)(struct kevent *, struct page *, size_t))
+{
+	struct kevent_aio_private *priv = k->priv;
+	struct address_space *mapping;
+	size_t isize, actor_size;
+	int i;
+
+	mapping = priv->file->f_mapping;
+	isize = i_size_read(priv->file->f_dentry->d_inode);
+	
+	dprintk("%s: start: size_left=%zd, offset=%Lu, processed=%Lu, isize=%zu, pg_num=%d.\n", 
+			__func__, priv->size, priv->offset, priv->processed, isize, priv->pg_num);
+
+	for (i=0; i<priv->pg_num && priv->size; ++i) {
+		struct page *page;
+		size_t nr = PAGE_CACHE_SIZE;
+
+		cond_resched();
+		page = find_get_page(mapping, priv->processed >> PAGE_CACHE_SHIFT);
+		if (unlikely(page == NULL))
+			break;
+		if (!PageUptodate(page)) {
+			dprintk("%s: %2d: page=%p, processed=%Lu, size=%zu not uptodate.\n", 
+					__func__, i, page, priv->processed, priv->size);
+			page_cache_release(page);
+			break;
+		}
+
+		if (mapping_writably_mapped(mapping))
+			flush_dcache_page(page);
+
+		mark_page_accessed(page);
+
+		if (nr + priv->processed > isize)
+			nr = isize - priv->processed;
+		if (nr > priv->size)
+			nr = priv->size;
+
+		actor_size = actor(k, page, nr);
+		if (actor_size < 0) {
+			page_cache_release(page);
+			break;
+		}
+
+		page_cache_release(page);
+
+		priv->processed += actor_size;
+		priv->size -= actor_size;
+	}
+
+	if (!priv->size)
+		i = priv->pg_num;
+
+	if (i != priv->pg_num)
+		priv->offset = priv->processed;
+
+	dprintk("%s: end: next=%d, num=%d, left=%zu, offset=%Lu, procesed=%Lu, ret=%d.\n", 
+			__func__, i, priv->pg_num, 
+			priv->size, priv->offset, priv->processed, i);
+
+	return i;
+}
+
+static int kevent_aio_callback(struct kevent *k)
+{
+	return 1;
+}
+
+static void kevent_aio_work(void *data)
+{
+	struct kevent *k = data;
+	struct kevent_aio_private *priv = k->priv;
+	struct inode *inode = k->st->origin;
+	struct address_space *mapping = priv->file->f_mapping;
+	int err, ready = 0, num;
+
+	dprintk("%s: k=%p, priv=%p, inode=%p.\n", __func__, k, priv, inode);
+
+	init_completion(&priv->bio_complete);
+	
+	num = ready = kevent_aio_vfs_read(k, &kevent_aio_vfs_read_actor);
+	if (ready > 0 && ready != priv->pg_num)
+		ready = 0;
+
+	dprintk("%s: k=%p, ready=%d, size=%zd.\n", __func__, k, ready, priv->size);
+
+	if (!ready) {
+		err = kevent_mpage_readpages(k, num, mapping->a_ops->get_block);
+		if (err) {
+			dprintk("%s: kevent_mpage_readpages failed: err=%d, k=%p, size=%zd.\n",
+					__func__, err, k, priv->size);
+			kevent_break(k);
+			kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL);
+		}
+	} else {
+		dprintk("%s: next k=%p, size=%zd.\n", __func__, k, priv->size);
+
+		if (priv->size)
+			schedule_work(&priv->work);
+		else {
+			kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL);
+		}
+
+		complete(&priv->bio_complete);
+	}
+}
+
+static int kevent_aio_enqueue(struct kevent *k)
+{
+	int err;
+	struct file *file, *sock;
+	struct inode *inode;
+	struct kevent_aio_private *priv;
+	struct address_space *mapping;
+	int fd = k->event.id.raw[0];
+	int num = k->event.id.raw[1];
+	int s = k->event.ret_data[0];
+	size_t size;
+
+	err = -ENODEV;
+	file = fget(fd);
+	if (!file)
+		goto err_out_exit;
+	
+	sock = fget(s);
+	if (!sock)
+		goto err_out_fput_file;
+	
+	mapping = file->f_mapping;
+
+	err = -EINVAL;
+	if (!file->f_dentry || !file->f_dentry->d_inode || !mapping->a_ops->get_block)
+		goto err_out_fput;
+	if (!sock->f_dentry || !sock->f_dentry->d_inode)
+		goto err_out_fput;
+
+	inode = igrab(file->f_dentry->d_inode);
+	if (!inode)
+		goto err_out_fput;
+
+	size = i_size_read(inode);
+	
+	num = (size > num << PAGE_SHIFT) ? num : (size >> PAGE_SHIFT);
+
+	err = -ENOMEM;
+	priv = kzalloc(sizeof(struct kevent_aio_private), GFP_KERNEL);
+	if (!priv)
+		goto err_out_iput;
+
+	priv->pg_num = num;
+	priv->size = size;
+	priv->offset = 0;
+	priv->file = file;
+	priv->sock = sock;
+	INIT_WORK(&priv->work, kevent_aio_work, k);
+	k->priv = priv;
+
+	dprintk("%s: read: k=%p, priv=%p, inode=%p, num=%u, size=%zu, off=%Lu.\n", 
+			__func__, k, priv, inode, priv->pg_num, priv->size, priv->offset);
+	
+	init_completion(&priv->bio_complete);
+	kevent_storage_enqueue(&inode->st, k);
+	schedule_work(&priv->work);
+	
+	return 0;
+
+err_out_iput:
+	iput(inode);
+err_out_fput:
+	fput(sock);
+err_out_fput_file:
+	fput(file);
+err_out_exit:
+
+	return err;
+}
+
+static int kevent_aio_dequeue(struct kevent *k)
+{
+	struct kevent_aio_private *priv = k->priv;
+	struct inode *inode = k->st->origin;
+	struct file *file = priv->file;
+	struct file *sock = priv->sock;
+
+	kevent_storage_dequeue(k->st, k);
+	flush_scheduled_work();
+	wait_for_completion(&priv->bio_complete);
+
+	kfree(k->priv);
+	k->priv = NULL;
+	iput(inode);
+	fput(file);
+	fput(sock);
+
+	return 0;
+}
+
+asmlinkage long sys_aio_sendfile(int ctl_fd, int fd, int s, 
+		size_t size, unsigned flags)
+{
+	struct ukevent ukread, uksend;
+	struct kevent_user *u;
+	struct file *file;
+	int err, fput_needed;
+	int num = (flags & 7)?(flags & 7):8;
+
+	memset(&ukread, 0, sizeof(struct ukevent));
+	memset(&uksend, 0, sizeof(struct ukevent));
+
+	ukread.type = KEVENT_AIO;
+	ukread.event = KEVENT_AIO_BIO;
+
+	ukread.id.raw[0] = fd;
+	ukread.id.raw[1] = num;
+	ukread.ret_data[0] = s;
+
+	dprintk("%s: fd=%d, s=%d, num=%d.\n", __func__, fd, s, num);
+
+	file = fget_light(ctl_fd, &fput_needed);
+	if (!file)
+		return -ENODEV;
+
+	u = file->private_data;
+	if (!u) {
+		err = -EINVAL;
+		goto err_out_fput;
+	}
+
+	err = kevent_user_add_ukevent(&ukread, u);
+	if (err < 0)
+		goto err_out_fput;
+
+err_out_fput:
+	fput_light(file, fput_needed);
+	return err;
+}
+
+int kevent_init_aio(struct kevent *k)
+{
+	k->enqueue = &kevent_aio_enqueue;
+	k->dequeue = &kevent_aio_dequeue;
+	k->callback = &kevent_aio_callback;
+	return 0;
+}


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [4/4] kevent: poll/select() notifications. Timer notifications.
  2006-07-26  9:18           ` [3/4] kevent: AIO, aio_sendfile() implementation Evgeniy Polyakov
@ 2006-07-26  9:18             ` Evgeniy Polyakov
  2006-07-26 10:00             ` [3/4] kevent: AIO, aio_sendfile() implementation Christoph Hellwig
  2006-07-26 10:04             ` Christoph Hellwig
  2 siblings, 0 replies; 73+ messages in thread
From: Evgeniy Polyakov @ 2006-07-26  9:18 UTC (permalink / raw)
  To: lkml; +Cc: David Miller, Ulrich Drepper, Evgeniy Polyakov, netdev


This patch includes generic poll/select and timer notifications.

kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake).

Timer notifications can be used for fine grained per-process time 
management, since iteractive timers are very inconveniently to use, 
and they are limited.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..4950e7c
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,223 @@
+/*
+ * 	kevent_poll.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+	struct poll_table_struct 	pt;
+	struct kevent			*k;
+};
+
+struct kevent_poll_wait_container
+{
+	struct list_head		container_entry;
+	wait_queue_head_t		*whead;
+	wait_queue_t			wait;
+	struct kevent			*k;
+};
+
+struct kevent_poll_private
+{
+	struct list_head		container_list;
+	spinlock_t			container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait, 
+		unsigned mode, int sync, void *key)
+{
+	struct kevent_poll_wait_container *cont = 
+		container_of(wait, struct kevent_poll_wait_container, wait);
+	struct kevent *k = cont->k;
+	struct file *file = k->st->origin;
+	unsigned long flags;
+	u32 revents, event;
+
+	revents = file->f_op->poll(file, NULL);
+	spin_lock_irqsave(&k->lock, flags);
+	event = k->event.event;
+	spin_unlock_irqrestore(&k->lock, flags);
+
+	kevent_storage_ready(k->st, NULL, revents);
+
+	return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead, 
+		struct poll_table_struct *poll_table)
+{
+	struct kevent *k = 
+		container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *cont;
+	unsigned long flags;
+
+	cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
+	if (!cont) {
+		kevent_break(k);
+		return;
+	}
+		
+	cont->k = k;
+	init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+	cont->whead = whead;
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_add_tail(&cont->container_entry, &priv->container_list);
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+	struct file *file;
+	int err, ready = 0;
+	unsigned int revents;
+	struct kevent_poll_ctl ctl;
+	struct kevent_poll_private *priv;
+
+	file = fget(k->event.id.raw[0]);
+	if (!file)
+		return -ENODEV;
+
+	err = -EINVAL;
+	if (!file->f_op || !file->f_op->poll)
+		goto err_out_fput;
+
+	err = -ENOMEM;
+	priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
+	if (!priv)
+		goto err_out_fput;
+
+	spin_lock_init(&priv->container_lock);
+	INIT_LIST_HEAD(&priv->container_list);
+
+	k->priv = priv;
+
+	ctl.k = k;
+	init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+
+	err = kevent_storage_enqueue(&file->st, k);
+	if (err)
+		goto err_out_free;
+
+	revents = file->f_op->poll(file, &ctl.pt);
+	if (revents & k->event.event) {
+		ready = 1;
+		kevent_poll_dequeue(k);
+	}
+	
+	return ready;
+
+err_out_free:
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+	fput(file);
+	return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *w, *n;
+	unsigned long flags;
+
+	kevent_storage_dequeue(k->st, k);
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+		list_del(&w->container_entry);
+		remove_wait_queue(w->whead, &w->wait);
+		kmem_cache_free(kevent_poll_container_cache, w);
+	}
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+	
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+	k->priv = NULL;
+	
+	fput(file);
+
+	return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	unsigned int revents = file->f_op->poll(file, NULL);
+	return (revents & k->event.event);
+}
+
+int kevent_init_poll(struct kevent *k)
+{
+	if (!kevent_poll_container_cache || !kevent_poll_priv_cache)
+		return -ENOMEM;
+
+	k->enqueue = &kevent_poll_enqueue;
+	k->dequeue = &kevent_poll_dequeue;
+	k->callback = &kevent_poll_callback;
+	return 0;
+}
+
+
+static int __init kevent_poll_sys_init(void)
+{
+	kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache", 
+			sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+	if (!kevent_poll_container_cache) {
+		printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+		return -ENOMEM;
+	}
+	
+	kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache", 
+			sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+	if (!kevent_poll_priv_cache) {
+		printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+		kmem_cache_destroy(kevent_poll_container_cache);
+		kevent_poll_container_cache = NULL;
+		return -ENOMEM;
+	}
+
+	printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+	return 0;
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+	kmem_cache_destroy(kevent_poll_priv_cache);
+	kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..53d3bdf
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,112 @@
+/*
+ * 	kevent_timer.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+static void kevent_timer_func(unsigned long data)
+{
+	struct kevent *k = (struct kevent *)data;
+	struct timer_list *t = k->st->origin;
+
+	kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL);
+	mod_timer(t, jiffies + msecs_to_jiffies(k->event.id.raw[0]));
+}
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+	struct timer_list *t;
+	struct kevent_storage *st;
+	int err;
+
+	t = kmalloc(sizeof(struct timer_list) + sizeof(struct kevent_storage), 
+			GFP_KERNEL);
+	if (!t)
+		return -ENOMEM;
+
+	init_timer(t);
+	t->function = kevent_timer_func;
+	t->expires = jiffies + msecs_to_jiffies(k->event.id.raw[0]);
+	t->data = (unsigned long)k;
+
+	st = (struct kevent_storage *)(t+1);
+	err = kevent_storage_init(t, st);
+	if (err)
+		goto err_out_free;
+
+	err = kevent_storage_enqueue(st, k);
+	if (err)
+		goto err_out_st_fini;
+	
+	add_timer(t);
+
+	return 0;
+
+err_out_st_fini:	
+	kevent_storage_fini(st);
+err_out_free:
+	kfree(t);
+
+	return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+	struct kevent_storage *st = k->st;
+	struct timer_list *t = st->origin;
+
+	if (!t)
+		return -ENODEV;
+
+	del_timer_sync(t);
+	
+	kevent_storage_dequeue(st, k);
+	
+	kfree(t);
+
+	return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+	struct kevent_storage *st = k->st;
+	struct timer_list *t = st->origin;
+
+	if (!t)
+		return -ENODEV;
+	
+	k->event.ret_data[0] = (__u32)jiffies;
+	return 1;
+}
+
+int kevent_init_timer(struct kevent *k)
+{
+	k->enqueue = &kevent_timer_enqueue;
+	k->dequeue = &kevent_timer_dequeue;
+	k->callback = &kevent_timer_callback;
+	return 0;
+}


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-26  9:18           ` [3/4] kevent: AIO, aio_sendfile() implementation Evgeniy Polyakov
  2006-07-26  9:18             ` [4/4] kevent: poll/select() notifications. Timer notifications Evgeniy Polyakov
@ 2006-07-26 10:00             ` Christoph Hellwig
  2006-07-26 10:08               ` Evgeniy Polyakov
  2006-07-26 10:04             ` Christoph Hellwig
  2 siblings, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2006-07-26 10:00 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: lkml, David Miller, Ulrich Drepper, netdev

On Wed, Jul 26, 2006 at 01:18:15PM +0400, Evgeniy Polyakov wrote:
> 
> This patch includes asynchronous propagation of file's data into VFS
> cache and aio_sendfile() implementation.
> Network aio_sendfile() works lazily - it asynchronously populates pages
> into the VFS cache (which can be used for various tricks with adaptive
> readahead) and then uses usual ->sendfile() callback.
> 
> Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> 
> diff --git a/fs/bio.c b/fs/bio.c
> index 6a0b9ad..a3ee530 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -119,7 +119,7 @@ void bio_free(struct bio *bio, struct bi
>  /*
>   * default destructor for a bio allocated with bio_alloc_bioset()
>   */
> -static void bio_fs_destructor(struct bio *bio)
> +void bio_fs_destructor(struct bio *bio)
>  {
>  	bio_free(bio, fs_bio_set);
>  }
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index 04af9c4..295fce9 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -685,6 +685,7 @@ ext2_writepages(struct address_space *ma
>  }
>  
>  struct address_space_operations ext2_aops = {
> +	.get_block		= ext2_get_block,

No way in hell.  For whatever you do please provide a interface at
the readpage/writepage/sendfile/etc abstraction layer.  get_block is
nothing that can be exposed to the common code.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-26 10:00             ` [3/4] kevent: AIO, aio_sendfile() implementation Christoph Hellwig
@ 2006-07-26 10:08               ` Evgeniy Polyakov
  2006-07-26 10:13                 ` Christoph Hellwig
  0 siblings, 1 reply; 73+ messages in thread
From: Evgeniy Polyakov @ 2006-07-26 10:08 UTC (permalink / raw)
  To: Christoph Hellwig, lkml, David Miller, Ulrich Drepper, netdev

On Wed, Jul 26, 2006 at 11:00:13AM +0100, Christoph Hellwig (hch@infradead.org) wrote:
> >  struct address_space_operations ext2_aops = {
> > +	.get_block		= ext2_get_block,
> 
> No way in hell.  For whatever you do please provide a interface at
> the readpage/writepage/sendfile/etc abstraction layer.  get_block is
> nothing that can be exposed to the common code.

Compare this with sync read methods - all they do is exactly the same
operations with low-level blocks, which are combined into nice exported
function, so there is _no_ readpage layer - it calls only one function
which works with blocks.

I would create the same, i.e. async_readpage(), which called kevent's
functions and processed low-level blocks, just like sync code does, but
that requires kevent to be deep part of the FS tree.

So I prefer to have
kevent/some_function_which_works_with_blocks_and_kevents() 

instead of
fs/some_function_which_works_with_block_and_kevents()
kevent/call_that_function_like_all_readpage_callbacks_do().

So it is not a technical problem, but political one.
-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-26 10:08               ` Evgeniy Polyakov
@ 2006-07-26 10:13                 ` Christoph Hellwig
  2006-07-26 10:25                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2006-07-26 10:13 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Christoph Hellwig, lkml, David Miller, Ulrich Drepper, netdev

On Wed, Jul 26, 2006 at 02:08:49PM +0400, Evgeniy Polyakov wrote:
> On Wed, Jul 26, 2006 at 11:00:13AM +0100, Christoph Hellwig (hch@infradead.org) wrote:
> > >  struct address_space_operations ext2_aops = {
> > > +	.get_block		= ext2_get_block,
> > 
> > No way in hell.  For whatever you do please provide a interface at
> > the readpage/writepage/sendfile/etc abstraction layer.  get_block is
> > nothing that can be exposed to the common code.
> 
> Compare this with sync read methods - all they do is exactly the same
> operations with low-level blocks, which are combined into nice exported
> function, so there is _no_ readpage layer - it calls only one function
> which works with blocks.

No.  The abtraction layer there is ->readpage(s).  _A_ common implementation
works with a get_block callback from the filesystem, but there are various
others.  We've been there before, up to mid-2.3.x we had a get_block inode
operation and we got rid of it because it is the wrong abstraction.

> So it is not a technical problem, but political one.

It's a technical problem, and it's called get you abstractions right.  And
ontop of that a political one and that's called get your abstraction coherent.
If you managed to argue all of us into accept that get_block is the right
abstraction (and as I mentioned above that's technically not true) you'd
still have the burden to update everything to use the same abstraction.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-26 10:13                 ` Christoph Hellwig
@ 2006-07-26 10:25                   ` Evgeniy Polyakov
  0 siblings, 0 replies; 73+ messages in thread
From: Evgeniy Polyakov @ 2006-07-26 10:25 UTC (permalink / raw)
  To: Christoph Hellwig, lkml, David Miller, Ulrich Drepper, netdev

On Wed, Jul 26, 2006 at 11:13:56AM +0100, Christoph Hellwig (hch@infradead.org) wrote:
> On Wed, Jul 26, 2006 at 02:08:49PM +0400, Evgeniy Polyakov wrote:
> > On Wed, Jul 26, 2006 at 11:00:13AM +0100, Christoph Hellwig (hch@infradead.org) wrote:
> > > >  struct address_space_operations ext2_aops = {
> > > > +	.get_block		= ext2_get_block,
> > > 
> > > No way in hell.  For whatever you do please provide a interface at
> > > the readpage/writepage/sendfile/etc abstraction layer.  get_block is
> > > nothing that can be exposed to the common code.
> > 
> > Compare this with sync read methods - all they do is exactly the same
> > operations with low-level blocks, which are combined into nice exported
> > function, so there is _no_ readpage layer - it calls only one function
> > which works with blocks.
> 
> No.  The abtraction layer there is ->readpage(s).  _A_ common implementation
> works with a get_block callback from the filesystem, but there are various
> others.  We've been there before, up to mid-2.3.x we had a get_block inode
> operation and we got rid of it because it is the wrong abstraction.

Well, kevent can work not from it's own, but with common implementation,
which works with get_block(). No problem here.

> > So it is not a technical problem, but political one.
> 
> It's a technical problem, and it's called get you abstractions right.  And
> ontop of that a political one and that's called get your abstraction coherent.
> If you managed to argue all of us into accept that get_block is the right
> abstraction (and as I mentioned above that's technically not true) you'd
> still have the burden to update everything to use the same abstraction.

Christoph, I completely understand your point of view.
There is absolutely no technical problem to create common async implementation,
and place it where existing sync lives and call from readpage() level.

It just requires to allow to change BIO callbacks instead of default
one, and (probably) event sync readpage can be used.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-26  9:18           ` [3/4] kevent: AIO, aio_sendfile() implementation Evgeniy Polyakov
  2006-07-26  9:18             ` [4/4] kevent: poll/select() notifications. Timer notifications Evgeniy Polyakov
  2006-07-26 10:00             ` [3/4] kevent: AIO, aio_sendfile() implementation Christoph Hellwig
@ 2006-07-26 10:04             ` Christoph Hellwig
  2006-07-26 10:12               ` David Miller
  2006-07-26 10:19               ` Evgeniy Polyakov
  2 siblings, 2 replies; 73+ messages in thread
From: Christoph Hellwig @ 2006-07-26 10:04 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: lkml, David Miller, Ulrich Drepper, netdev

On Wed, Jul 26, 2006 at 01:18:15PM +0400, Evgeniy Polyakov wrote:
> 
> This patch includes asynchronous propagation of file's data into VFS
> cache and aio_sendfile() implementation.
> Network aio_sendfile() works lazily - it asynchronously populates pages
> into the VFS cache (which can be used for various tricks with adaptive
> readahead) and then uses usual ->sendfile() callback.

And please don't base this on sendfile.  Please make the splice infrastructure
aynschronous without duplicating all the code but rather make the existing
code aynch and the existing synchronous call wait on them to finish, similar
to how we handle async/sync direct I/O.  And to be honest, I don't think
adding all this code is acceptable if it can't replace the existing aio
code while keeping the interface.  So while you interface looks pretty
sane the implementation needs a lot of work still :)

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-26 10:04             ` Christoph Hellwig
@ 2006-07-26 10:12               ` David Miller
  2006-07-26 10:15                 ` Christoph Hellwig
  2006-07-26 14:14                 ` Avi Kivity
  2006-07-26 10:19               ` Evgeniy Polyakov
  1 sibling, 2 replies; 73+ messages in thread
From: David Miller @ 2006-07-26 10:12 UTC (permalink / raw)
  To: hch; +Cc: johnpol, linux-kernel, drepper, netdev

From: Christoph Hellwig <hch@infradead.org>
Date: Wed, 26 Jul 2006 11:04:31 +0100

> And to be honest, I don't think adding all this code is acceptable
> if it can't replace the existing aio code while keeping the
> interface.  So while you interface looks pretty sane the
> implementation needs a lot of work still :)

Networking and disk AIO have significantly different needs.

Therefore, I really don't see it as reasonable to expect
a merge of these two things.  It doesn't make any sense.

I do agree that this stuff needs to be cleaned up, all the get_block
etc. hacks have to be pulled out and abstracted properly.  That part
of the kevent changes are indeed still crap :)

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-26 10:12               ` David Miller
@ 2006-07-26 10:15                 ` Christoph Hellwig
  2006-07-26 20:21                   ` Phillip Susi
  2006-07-26 14:14                 ` Avi Kivity
  1 sibling, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2006-07-26 10:15 UTC (permalink / raw)
  To: David Miller; +Cc: hch, johnpol, linux-kernel, drepper, netdev

On Wed, Jul 26, 2006 at 03:12:47AM -0700, David Miller wrote:
> From: Christoph Hellwig <hch@infradead.org>
> Date: Wed, 26 Jul 2006 11:04:31 +0100
> 
> > And to be honest, I don't think adding all this code is acceptable
> > if it can't replace the existing aio code while keeping the
> > interface.  So while you interface looks pretty sane the
> > implementation needs a lot of work still :)
> 
> Networking and disk AIO have significantly different needs.
> 
> Therefore, I really don't see it as reasonable to expect
> a merge of these two things.  It doesn't make any sense.

I'm not sure about that.  The current aio interface isn't exactly nice
for disk I/O either.  I'm more than happy to have a discussion about
that aspect.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-26 10:15                 ` Christoph Hellwig
@ 2006-07-26 20:21                   ` Phillip Susi
  0 siblings, 0 replies; 73+ messages in thread
From: Phillip Susi @ 2006-07-26 20:21 UTC (permalink / raw)
  To: Christoph Hellwig, David Miller, johnpol, linux-kernel, drepper,
	netdev

Christoph Hellwig wrote:
>> Networking and disk AIO have significantly different needs.
>>
>> Therefore, I really don't see it as reasonable to expect
>> a merge of these two things.  It doesn't make any sense.
> 
> I'm not sure about that.  The current aio interface isn't exactly nice
> for disk I/O either.  I'm more than happy to have a discussion about
> that aspect.
> 

I agree that it makes perfect sense for a merger because aio and 
networking have very similar needs.  In both cases, the caller hands the 
kernel a buffer and wants the kernel to either fill it or consume it, 
and to be able to do so asynchronously.  You also want to maximize 
performance in both cases by taking advantage of zero copy IO.

I wonder though, why do you say the current aio interface isn't nice for 
disk IO?  It seems to work rather nicely to me, and is much better than 
the posix aio interface.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-26 10:12               ` David Miller
  2006-07-26 10:15                 ` Christoph Hellwig
@ 2006-07-26 14:14                 ` Avi Kivity
  1 sibling, 0 replies; 73+ messages in thread
From: Avi Kivity @ 2006-07-26 14:14 UTC (permalink / raw)
  To: David Miller; +Cc: hch, johnpol, linux-kernel, drepper, netdev

David Miller wrote:
>
> From: Christoph Hellwig <hch@infradead.org>
> Date: Wed, 26 Jul 2006 11:04:31 +0100
>
> > And to be honest, I don't think adding all this code is acceptable
> > if it can't replace the existing aio code while keeping the
> > interface.  So while you interface looks pretty sane the
> > implementation needs a lot of work still :)
>
> Networking and disk AIO have significantly different needs.
>
Surely, there needs to be a unified polling interface to support single 
threaded designs.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-26 10:04             ` Christoph Hellwig
  2006-07-26 10:12               ` David Miller
@ 2006-07-26 10:19               ` Evgeniy Polyakov
  2006-07-26 10:30                 ` Christoph Hellwig
  1 sibling, 1 reply; 73+ messages in thread
From: Evgeniy Polyakov @ 2006-07-26 10:19 UTC (permalink / raw)
  To: Christoph Hellwig, lkml, David Miller, Ulrich Drepper, netdev

On Wed, Jul 26, 2006 at 11:04:31AM +0100, Christoph Hellwig (hch@infradead.org) wrote:
> On Wed, Jul 26, 2006 at 01:18:15PM +0400, Evgeniy Polyakov wrote:
> > 
> > This patch includes asynchronous propagation of file's data into VFS
> > cache and aio_sendfile() implementation.
> > Network aio_sendfile() works lazily - it asynchronously populates pages
> > into the VFS cache (which can be used for various tricks with adaptive
> > readahead) and then uses usual ->sendfile() callback.
> 
> And please don't base this on sendfile.  Please make the splice infrastructure
> aynschronous without duplicating all the code but rather make the existing
> code aynch and the existing synchronous call wait on them to finish, similar
> to how we handle async/sync direct I/O.  And to be honest, I don't think
> adding all this code is acceptable if it can't replace the existing aio
> code while keeping the interface.  So while you interface looks pretty
> sane the implementation needs a lot of work still :)

Kevent was created quite before splice and friends, so I used what there
were :)

I stopped to work on AIO, since neither existing, nor mine
implementation were able to outperform sync speeds (one of the major problems
in my implementation is get_user_pages() overhead, which can be
completely eliminated with physical memory allocation being done in
advance in userspace, like Ulrich described).
My personal opinion on existing AIO is that it is not the right design.
Benjamin LaHaise agree with me (if I understood him right), but he
failed to move AIO outside repeated-call model (2.4 had state machine
based one, and out-of-the tree 2.6 patches have that design too).
In theory existing AIO (with all posix userspace API) can be replaced
with kevent (it will even take less space), but I would present it as a
TODO item, since kevent itself has nothing to do with AIO.

Kevent is a generic event processing mechanism, AIO, network AIO and all
others are just kernel users of it's functionality.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-26 10:19               ` Evgeniy Polyakov
@ 2006-07-26 10:30                 ` Christoph Hellwig
  2006-07-26 14:28                   ` Ulrich Drepper
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2006-07-26 10:30 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Christoph Hellwig, lkml, David Miller, Ulrich Drepper, netdev

On Wed, Jul 26, 2006 at 02:19:21PM +0400, Evgeniy Polyakov wrote:
> I stopped to work on AIO, since neither existing, nor mine
> implementation were able to outperform sync speeds (one of the major problems
> in my implementation is get_user_pages() overhead, which can be
> completely eliminated with physical memory allocation being done in
> advance in userspace, like Ulrich described).
> My personal opinion on existing AIO is that it is not the right design.
> Benjamin LaHaise agree with me (if I understood him right),

I completely agree with that aswell.

> but he
> failed to move AIO outside repeated-call model (2.4 had state machine
> based one, and out-of-the tree 2.6 patches have that design too).
> In theory existing AIO (with all posix userspace API) can be replaced
> with kevent (it will even take less space), but I would present it as a
> TODO item, since kevent itself has nothing to do with AIO.

And replacing the existing aio code is exactly we I want you to do.  We
can't keep adding more and more code without getting rid of old mess forever.

And yes, the asynchronous pagecache population bit in your patchkit has a lot
to do with aio.  It's same variant of aio done right (or at least less bad).

I suspect the right way to go ahead is to drop that bit for now (it's the
by far worst code in the patchkit anyway) and then we can redo it later to
not get abstractions wrong and duplicate lots of code but also replace the
aio code.  I don't expect you to do that alone, you'll probably need quite
a bit help from us FS and VM people.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-26 10:30                 ` Christoph Hellwig
@ 2006-07-26 14:28                   ` Ulrich Drepper
  2006-07-26 16:22                     ` Badari Pulavarty
  0 siblings, 1 reply; 73+ messages in thread
From: Ulrich Drepper @ 2006-07-26 14:28 UTC (permalink / raw)
  To: Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller,
	Ulrich Drepper, netdev

[-- Attachment #1: Type: text/plain, Size: 819 bytes --]

Christoph Hellwig wrote:
>> My personal opinion on existing AIO is that it is not the right design.
>> Benjamin LaHaise agree with me (if I understood him right),
> 
> I completely agree with that aswell.

I agree, too, but the current code is not the last of the line.  Suparna
has a st of patches which make the current kernel aio code work much
better and especially make it really usable to implement POSIX AIO.

In Ottawa we were talking about submitting it and Suparna will.  We just
thought about a little longer timeframe.  I guess it could be
accelerated since he mostly has the patch done.  But I don't know her
schedule.

Important here is, don't base any decision on the current aio
implementation.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-26 14:28                   ` Ulrich Drepper
@ 2006-07-26 16:22                     ` Badari Pulavarty
  2006-07-27  6:49                       ` Sébastien Dugué
  0 siblings, 1 reply; 73+ messages in thread
From: Badari Pulavarty @ 2006-07-26 16:22 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev,
	Suparna Bhattacharya

Ulrich Drepper wrote:
> Christoph Hellwig wrote:
>   
>>> My personal opinion on existing AIO is that it is not the right design.
>>> Benjamin LaHaise agree with me (if I understood him right),
>>>       
>> I completely agree with that aswell.
>>     
>
> I agree, too, but the current code is not the last of the line.  Suparna
> has a st of patches which make the current kernel aio code work much
> better and especially make it really usable to implement POSIX AIO.
>
> In Ottawa we were talking about submitting it and Suparna will.  We just
> thought about a little longer timeframe.  I guess it could be
> accelerated since he mostly has the patch done.  But I don't know her
> schedule.
>
> Important here is, don't base any decision on the current aio
> implementation.
>   
Ulrich,

Suparna mentioned your interest in making POSIX glibc aio work with 
kernel-aio at OLS.
We thought taking a re-look at the (kernel side) work BULL did, would be 
a nice starting
point. I re-based those patches to 2.6.18-rc2 and sent it to Zach Brown 
for review before
sending them out to list.

These patches does NOT make AIO any cleaner. All they do is add 
functionality to support
POSIX AIO easier. These are

[ PATCH 1/3 ]  Adding signal notification for event completion

[ PATCH 2/3 ]  lio (listio) completion semantics

[ PATCH 3/3 ] cancel_fd support

Suparna explained these in the following article:

http://lwn.net/Articles/148755/

If you think, this is a reasonable direction/approach for the kernel and 
you would take care
of glibc side of things - I can spend time on these patches, getting 
them to reasonable shape
and push for inclusion.

Please let us know.

Thanks,
Badari

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-26 16:22                     ` Badari Pulavarty
@ 2006-07-27  6:49                       ` Sébastien Dugué
  2006-07-27 15:28                         ` Badari Pulavarty
  0 siblings, 1 reply; 73+ messages in thread
From: Sébastien Dugué @ 2006-07-27  6:49 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Ulrich Drepper, Christoph Hellwig, Evgeniy Polyakov, lkml,
	David Miller, netdev, Suparna Bhattacharya

On Wed, 2006-07-26 at 09:22 -0700, Badari Pulavarty wrote:
> Ulrich Drepper wrote:
> > Christoph Hellwig wrote:
> >   
> >>> My personal opinion on existing AIO is that it is not the right design.
> >>> Benjamin LaHaise agree with me (if I understood him right),
> >>>       
> >> I completely agree with that aswell.
> >>     
> >
> > I agree, too, but the current code is not the last of the line.  Suparna
> > has a st of patches which make the current kernel aio code work much
> > better and especially make it really usable to implement POSIX AIO.
> >
> > In Ottawa we were talking about submitting it and Suparna will.  We just
> > thought about a little longer timeframe.  I guess it could be
> > accelerated since he mostly has the patch done.  But I don't know her
> > schedule.
> >
> > Important here is, don't base any decision on the current aio
> > implementation.
> >   
> Ulrich,
> 
> Suparna mentioned your interest in making POSIX glibc aio work with 
> kernel-aio at OLS.
> We thought taking a re-look at the (kernel side) work BULL did, would be 
> a nice starting
> point. I re-based those patches to 2.6.18-rc2 and sent it to Zach Brown 
> for review before
> sending them out to list.
> 
> These patches does NOT make AIO any cleaner. All they do is add 
> functionality to support
> POSIX AIO easier. These are
> 
> [ PATCH 1/3 ]  Adding signal notification for event completion
> 
> [ PATCH 2/3 ]  lio (listio) completion semantics
> 
> [ PATCH 3/3 ] cancel_fd support

  Badari,

  Thanks for refreshing those patches, they have been sitting here
for quite some time now and collected dust.

  I also think Suparna's patchset for doing buffered AIO would be
a real plus here.

> 
> Suparna explained these in the following article:
> 
> http://lwn.net/Articles/148755/
> 
> If you think, this is a reasonable direction/approach for the kernel and 
> you would take care
> of glibc side of things - I can spend time on these patches, getting 
> them to reasonable shape
> and push for inclusion.

  Ulrich, I you want to have a look at how those patches are put to
use in libposix-aio, have a look at http://sourceforge.net/projects/paiol.

  It could be a starting point for glibc.

  Thanks,

  Sébastien.

-- 
-----------------------------------------------------

  Sébastien Dugué                BULL/FREC:B1-247
  phone: (+33) 476 29 77 70      Bullcom: 229-7770

  mailto:sebastien.dugue@bull.net

  Linux POSIX AIO: http://www.bullopensource.org/posix
                   http://sourceforge.net/projects/paiol

-----------------------------------------------------


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-27  6:49                       ` Sébastien Dugué
@ 2006-07-27 15:28                         ` Badari Pulavarty
  2006-07-27 18:14                           ` Zach Brown
  2006-07-28  7:26                           ` Sébastien Dugué
  0 siblings, 2 replies; 73+ messages in thread
From: Badari Pulavarty @ 2006-07-27 15:28 UTC (permalink / raw)
  To: Sébastien Dugué
  Cc: Ulrich Drepper, Christoph Hellwig, Evgeniy Polyakov, lkml,
	David Miller, netdev, Suparna Bhattacharya

Sébastien Dugué wrote:
> On Wed, 2006-07-26 at 09:22 -0700, Badari Pulavarty wrote:
>   
>> Ulrich Drepper wrote:
>>     
>>> Christoph Hellwig wrote:
>>>   
>>>       
>>>>> My personal opinion on existing AIO is that it is not the right design.
>>>>> Benjamin LaHaise agree with me (if I understood him right),
>>>>>       
>>>>>           
>>>> I completely agree with that aswell.
>>>>     
>>>>         
>>> I agree, too, but the current code is not the last of the line.  Suparna
>>> has a st of patches which make the current kernel aio code work much
>>> better and especially make it really usable to implement POSIX AIO.
>>>
>>> In Ottawa we were talking about submitting it and Suparna will.  We just
>>> thought about a little longer timeframe.  I guess it could be
>>> accelerated since he mostly has the patch done.  But I don't know her
>>> schedule.
>>>
>>> Important here is, don't base any decision on the current aio
>>> implementation.
>>>   
>>>       
>> Ulrich,
>>
>> Suparna mentioned your interest in making POSIX glibc aio work with 
>> kernel-aio at OLS.
>> We thought taking a re-look at the (kernel side) work BULL did, would be 
>> a nice starting
>> point. I re-based those patches to 2.6.18-rc2 and sent it to Zach Brown 
>> for review before
>> sending them out to list.
>>
>> These patches does NOT make AIO any cleaner. All they do is add 
>> functionality to support
>> POSIX AIO easier. These are
>>
>> [ PATCH 1/3 ]  Adding signal notification for event completion
>>
>> [ PATCH 2/3 ]  lio (listio) completion semantics
>>
>> [ PATCH 3/3 ] cancel_fd support
>>     
>
>   Badari,
>
>   Thanks for refreshing those patches, they have been sitting here
> for quite some time now and collected dust.
>
>   I also think Suparna's patchset for doing buffered AIO would be
> a real plus here.
>
>   
>> Suparna explained these in the following article:
>>
>> http://lwn.net/Articles/148755/
>>
>> If you think, this is a reasonable direction/approach for the kernel and 
>> you would take care
>> of glibc side of things - I can spend time on these patches, getting 
>> them to reasonable shape
>> and push for inclusion.
>>     
>
>   Ulrich, I you want to have a look at how those patches are put to
> use in libposix-aio, have a look at http://sourceforge.net/projects/paiol.
>
>   It could be a starting point for glibc.
>
>   Thanks,
>
>   Sébastien.
>
>   
Sebastien,

Suparna mentioned at Ulrich wants us to concentrate on kernel-side 
support, so that he
can look at glibc side of things (along with other work he is already 
doing). So, if we
can get an agreement on what kind of kernel support is needed - we can 
focus our
efforts on kernel side first and leave glibc enablement to capable hands 
of Uli :)

Thanks,
Badari


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-27 15:28                         ` Badari Pulavarty
@ 2006-07-27 18:14                           ` Zach Brown
  2006-07-27 18:29                             ` Badari Pulavarty
  2006-07-28  7:26                           ` Sébastien Dugué
  1 sibling, 1 reply; 73+ messages in thread
From: Zach Brown @ 2006-07-27 18:14 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Sébastien Dugué, Ulrich Drepper, Christoph Hellwig,
	Evgeniy Polyakov, lkml, David Miller, netdev,
	Suparna Bhattacharya


> Suparna mentioned at Ulrich wants us to concentrate on kernel-side 
> support, so that he can look at glibc side of things (along with
> other work he is already doing). So, if we can get an agreement on
> what kind of kernel support is needed - we can focus our efforts on
> kernel side first and leave glibc enablement to capable hands of Uli
> :)

Yeah, and the existing patches still need some cleanup.  Badari, did you
still want me to look into that?

We need someone to claim ultimate responsibility for getting these
patches suitable for merging :).  I'm happy to do that if Suparna isn't
already on it.

- z

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-27 18:14                           ` Zach Brown
@ 2006-07-27 18:29                             ` Badari Pulavarty
  2006-07-27 18:44                               ` Ulrich Drepper
  0 siblings, 1 reply; 73+ messages in thread
From: Badari Pulavarty @ 2006-07-27 18:29 UTC (permalink / raw)
  To: Zach Brown
  Cc: Sébastien Dugué, Ulrich Drepper, Christoph Hellwig,
	Evgeniy Polyakov, lkml, David Miller, netdev,
	Suparna Bhattacharya

On Thu, 2006-07-27 at 11:14 -0700, Zach Brown wrote:
> > Suparna mentioned at Ulrich wants us to concentrate on kernel-side 
> > support, so that he can look at glibc side of things (along with
> > other work he is already doing). So, if we can get an agreement on
> > what kind of kernel support is needed - we can focus our efforts on
> > kernel side first and leave glibc enablement to capable hands of Uli
> > :)
> 
> Yeah, and the existing patches still need some cleanup.  Badari, did you
> still want me to look into that?
> 
> We need someone to claim ultimate responsibility for getting these
> patches suitable for merging :).  I'm happy to do that if Suparna isn't
> already on it.

Zach,

Thanks for volunteering !! Sebastien & I should be able to help you.

Before we spend too much time cleaning up and merging into mainline -
I would like an agreement that what we add is good enough for glibc
POSIX AIO. I hate to waste everyone's time and add complexity to the
kernel - if glibc side is not going to happen :(

Thanks,
Badari


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-27 18:29                             ` Badari Pulavarty
@ 2006-07-27 18:44                               ` Ulrich Drepper
  2006-07-27 21:02                                 ` Badari Pulavarty
                                                   ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Ulrich Drepper @ 2006-07-27 18:44 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Zach Brown, Sébastien Dugué, Christoph Hellwig,
	Evgeniy Polyakov, lkml, David Miller, netdev,
	Suparna Bhattacharya

[-- Attachment #1: Type: text/plain, Size: 1417 bytes --]

Badari Pulavarty wrote:
> Before we spend too much time cleaning up and merging into mainline -
> I would like an agreement that what we add is good enough for glibc
> POSIX AIO.

I haven't seen a description of the interface so far.  Would be good if
it existed.  But I briefly mentioned one quirk in the interface about
which Suparna wasn't sure whether it's implemented/implementable in the
current interface.

If a lio_listio call is made the individual requests are handle just as
if they'd be issue separately.  I.e., the notification specified in the
individual aiocb is performed when the specific request is done.  Then,
once all requests are done, another notification is made, this time
controlled by the sigevent parameter if lio_listio.

Another feature which I always wanted: the current lio_listio call
returns in blocking mode only if all requests are done.  In non-blocking
mode it returns immediately and the program needs to poll the aiocbs.
What is needed is something in the middle.  For instance, if multiple
read requests are issued the program might be able to start working as
soon as one request is satisfied.  I.e., a call similar to lio_listio
would be nice which also takes another parameter specifying how many of
the NENT aiocbs have to finish before the call returns.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-27 18:44                               ` Ulrich Drepper
@ 2006-07-27 21:02                                 ` Badari Pulavarty
  2006-07-28  7:31                                   ` Sébastien Dugué
  2006-07-28 12:58                                   ` Sébastien Dugué
  2006-07-28  7:29                                 ` [3/4] kevent: AIO, aio_sendfile() implementation Sébastien Dugué
  2006-07-31 10:11                                 ` Suparna Bhattacharya
  2 siblings, 2 replies; 73+ messages in thread
From: Badari Pulavarty @ 2006-07-27 21:02 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Zach Brown, Sébastien Dugué, Christoph Hellwig,
	Evgeniy Polyakov, lkml, David Miller, netdev,
	Suparna Bhattacharya

On Thu, 2006-07-27 at 11:44 -0700, Ulrich Drepper wrote:
> Badari Pulavarty wrote:
> > Before we spend too much time cleaning up and merging into mainline -
> > I would like an agreement that what we add is good enough for glibc
> > POSIX AIO.
> 
> I haven't seen a description of the interface so far.  Would be good if
> it existed.  But I briefly mentioned one quirk in the interface about
> which Suparna wasn't sure whether it's implemented/implementable in the
> current interface.

Sebastien, could you provide a description of interfaces you are
adding ? Since you did all the work, it would be appropriate for
you to do it :)

> If a lio_listio call is made the individual requests are handle just as
> if they'd be issue separately.  I.e., the notification specified in the
> individual aiocb is performed when the specific request is done.  Then,
> once all requests are done, another notification is made, this time
> controlled by the sigevent parameter if lio_listio.
> 
> 
> Another feature which I always wanted: the current lio_listio call
> returns in blocking mode only if all requests are done.  In non-blocking
> mode it returns immediately and the program needs to poll the aiocbs.
> What is needed is something in the middle.  For instance, if multiple
> read requests are issued the program might be able to start working as
> soon as one request is satisfied.  I.e., a call similar to lio_listio
> would be nice which also takes another parameter specifying how many of
> the NENT aiocbs have to finish before the call returns.

Looks reasonable.

Thanks,
Badari


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-27 21:02                                 ` Badari Pulavarty
@ 2006-07-28  7:31                                   ` Sébastien Dugué
  2006-07-28 12:58                                   ` Sébastien Dugué
  1 sibling, 0 replies; 73+ messages in thread
From: Sébastien Dugué @ 2006-07-28  7:31 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Ulrich Drepper, Zach Brown, Christoph Hellwig, Evgeniy Polyakov,
	lkml, David Miller, netdev, Suparna Bhattacharya

On Thu, 2006-07-27 at 14:02 -0700, Badari Pulavarty wrote:
> On Thu, 2006-07-27 at 11:44 -0700, Ulrich Drepper wrote:
> > Badari Pulavarty wrote:
> > > Before we spend too much time cleaning up and merging into mainline -
> > > I would like an agreement that what we add is good enough for glibc
> > > POSIX AIO.
> > 
> > I haven't seen a description of the interface so far.  Would be good if
> > it existed.  But I briefly mentioned one quirk in the interface about
> > which Suparna wasn't sure whether it's implemented/implementable in the
> > current interface.
> 
> Sebastien, could you provide a description of interfaces you are
> adding ? Since you did all the work, it would be appropriate for
> you to do it :)
> 

  I will clean up what description I have and send it soon.

  Sébastien.


-- 
-----------------------------------------------------

  Sébastien Dugué                BULL/FREC:B1-247
  phone: (+33) 476 29 77 70      Bullcom: 229-7770

  mailto:sebastien.dugue@bull.net

  Linux POSIX AIO: http://www.bullopensource.org/posix
                   http://sourceforge.net/projects/paiol

-----------------------------------------------------


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-27 21:02                                 ` Badari Pulavarty
  2006-07-28  7:31                                   ` Sébastien Dugué
@ 2006-07-28 12:58                                   ` Sébastien Dugué
  2006-08-11 19:45                                     ` Ulrich Drepper
  1 sibling, 1 reply; 73+ messages in thread
From: Sébastien Dugué @ 2006-07-28 12:58 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Ulrich Drepper, Zach Brown, Christoph Hellwig, Evgeniy Polyakov,
	lkml, David Miller, netdev, Suparna Bhattacharya

[-- Attachment #1: Type: text/plain, Size: 1257 bytes --]

On Thu, 2006-07-27 at 14:02 -0700, Badari Pulavarty wrote:
> On Thu, 2006-07-27 at 11:44 -0700, Ulrich Drepper wrote:
> > Badari Pulavarty wrote:
> > > Before we spend too much time cleaning up and merging into mainline -
> > > I would like an agreement that what we add is good enough for glibc
> > > POSIX AIO.
> > 
> > I haven't seen a description of the interface so far.  Would be good if
> > it existed.  But I briefly mentioned one quirk in the interface about
> > which Suparna wasn't sure whether it's implemented/implementable in the
> > current interface.
> 
> Sebastien, could you provide a description of interfaces you are
> adding ? Since you did all the work, it would be appropriate for
> you to do it :)
> 

  Here are the descriptions for the AIO completion notification and
listio patches. Hope I did not leave out too much.

  Sébastien.

-- 
-----------------------------------------------------

  Sébastien Dugué                BULL/FREC:B1-247
  phone: (+33) 476 29 77 70      Bullcom: 229-7770

  mailto:sebastien.dugue@bull.net

  Linux POSIX AIO: http://www.bullopensource.org/posix
                   http://sourceforge.net/projects/paiol

-----------------------------------------------------

[-- Attachment #2: aioevent.txt --]
[-- Type: text/plain, Size: 2741 bytes --]

		     aio completion notification

Summary:
-------

  The current 2.6 kernel does not support notification of user space via
an RT signal upon an asynchronous IO completion. The POSIX specification
states that when an AIO request completes, a signal can be delivered to
the application as notification.

  The aioevent patch adds a struct sigevent *aio_sigeventp to the iocb.
The relevant fields (pid, signal number and value) are stored in the kiocb
for use when the request completes.

  That sigevent structure is filled by the application as part of the AIO
request preparation. Upon request completion, the kernel notifies the
application using those sigevent parameters. If SIGEV_NONE has been specified,
then the old behaviour is retained and the application must rely on polling
the completion queue using io_getevents().

Details:
-------

  A struct sigevent *aio_sigeventp is added to struct iocb in
include/linux/aio_abi.h

  An enum {IO_NOTIFY_SIGNAL = 0, IO_NOTIFY_THREAD_ID = 1} is added in
include/linux/aio.h:

	- IO_NOTIFY_SIGNAL means that the signal is to be sent to the
	  requesting thread 

	- IO_NOTIFY_THREAD_ID means that the signal is to be sent to a
	  specifi thread.

  The following fields are added to struct kiocb in include/linux/aio.h:

	- pid_t ki_pid: target of the signal

	- __u16 ki_signo: signal number

	- __u16 ki_notify: kind of notification, IO_NOTIFY_SIGNAL or
			   IO_NOTIFY_THREAD_ID

	- uid_t ki_uid, ki_euid: filled with the submitter credentials

	- sigval_t ki_sigev_value: value stuffed in siginfo

  these fields are only valid if ki_signo != 0.

  In io_submit_one(), if the application provided a sigevent then
iocb_setup_sigevent() is called which does the following:

	- save current->uid and current->euid in the kiocb fields ki_uid and
	  ki_euid for use in the completion path to check permissions

	- check access to the user sigevent

	- extract the needed fields from the sigevent (pid, signo, and value).
	  If the signal number passed from userspace is 0 then no notification
	  is to occur and ki_signo is set to 0

	- check whether the submitting thread wants to be notified directly
	  (sigevent->sigev_notify_thread_id is 0) or wants the signal to be sent
	  to another thread.
	  In the latter case a check is made to assert that the target thread
	  is in the same thread group

	- fill in the kiocb fields (ki_pid, ki_signo, ki_notify and ki_sigev_value)
	  for that request.

  Upon request completion, in aio_complete(), if ki_signo is not 0, then
__aio_send_signal() is called which sends the signal as follows:

	- fill in the siginfo struct to be sent to the application

	- check whether we have permission to signal the given thread

	- send the signal

[-- Attachment #3: lioevent.txt --]
[-- Type: text/plain, Size: 2489 bytes --]

			    listio support

Summary:
-------

  The lio patch adds POSIX listio completion notification support. It builds
on support provided by the aio event patch and adds an IOCB_CMD_GROUP
command to sys_io_submit().

  The purpose of IOCB_CMD_GROUP is to group together the following requests in
the list up to the end of the list.

  As part of listio submission, the user process prepends to a list of requests
an empty special aiocb with an aio_lio_opcode of IOCB_CMD_GROUP, filling only
the aio_sigevent fields.

Details:
-------

  An IOCB_CMD_GROUP is added to the IOCB_CMD enum in include/linux/aio_abi.h

  A struct lio_event is added in include/linux/aio.h

  A struct lio_event *ki_lio is added to struct iocb in include/linux/aio.h

 In sys_io_submit(), upon detecting such an IOCB_CMD_GROUP marker iocb, an
lio_event is created in lio_create() which contains the necessary information
for signaling a thread (signal number, pid, notify type and value) along with
a count of requests attached to this event.

  The following depicts the lio_event structure:

        struct lio_event {
                atomic_t        lio_users;
                int             lio_wait;
                __s32           lio_pid;
                __u16           lio_signo;
                __u16           lio_notify;
                __u64           lio_value;
                uid_t           lio_uid, lio_euid;
        };

  lio_users holds a count of the number of requests attached to this lio. It
is incremented with each request submitted and decremented at each request
completion. Thread notification occurs when this count reaches 0.

  Each subsequent submitted request is attached to this lio_event by setting
the request kiocb->*ki_lio to that lio_event (in io_submit_one()) and
incrementing the lio_users count.

  In aio_complete(), if the request is attached to an lio (ki_lio <> 0), then
lio_check() is called to decrement the lio_users count and eventually signal
the user process when all the requests in the group have completed.

  The IOCB_CMD_GROUP command semantic is as follows:

       - if the associated aiocb sigevent is NULL then we want to group
         requests for the purpose of blocking on the group completion
         (LIO_WAIT sync behavior).

       - if the associated sigevent is valid (not NULL) then we want to
         group requests for the purpose of being notified upon that
         group of requests completion (LIO_NOWAIT async behaviour).

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-28 12:58                                   ` Sébastien Dugué
@ 2006-08-11 19:45                                     ` Ulrich Drepper
  2006-08-12 18:29                                       ` Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile) Suparna Bhattacharya
  0 siblings, 1 reply; 73+ messages in thread
From: Ulrich Drepper @ 2006-08-11 19:45 UTC (permalink / raw)
  To: Sébastien Dugué
  Cc: Badari Pulavarty, Zach Brown, Christoph Hellwig, Evgeniy Polyakov,
	lkml, David Miller, netdev, Suparna Bhattacharya

[-- Attachment #1: Type: text/plain, Size: 4620 bytes --]

Sébastien Dugué wrote:
> 		     aio completion notification

I looked over this now but I don't think I understand everything.  Or I
don't see how it all is integrated.  And no, I'm not looking at the
proposed glibc code since would mean being tainted.

> Details:
> -------
> 
>   A struct sigevent *aio_sigeventp is added to struct iocb in
> include/linux/aio_abi.h
> 
>   An enum {IO_NOTIFY_SIGNAL = 0, IO_NOTIFY_THREAD_ID = 1} is added in
> include/linux/aio.h:
> 
> 	- IO_NOTIFY_SIGNAL means that the signal is to be sent to the
> 	  requesting thread 
> 
> 	- IO_NOTIFY_THREAD_ID means that the signal is to be sent to a
> 	  specifi thread.

This has been proved to be sufficient in the timer code which basically
has the same problem.  But why do you need separate constants?  We have
the various SIGEV_* constants, among them SIGEV_THREAD_ID.  Just use
these constants for the values of ki_notify.

>   The following fields are added to struct kiocb in include/linux/aio.h:
> 
> 	- pid_t ki_pid: target of the signal
> 
> 	- __u16 ki_signo: signal number
> 
> 	- __u16 ki_notify: kind of notification, IO_NOTIFY_SIGNAL or
> 			   IO_NOTIFY_THREAD_ID
> 
> 	- uid_t ki_uid, ki_euid: filled with the submitter credentials

These two fields aren't needed for the POSIX interfaces.  Where does the
requirement come from?  I don't say they should be removed, they might
be useful, but if the costs are non-negligible then they could go away.

> 	- check whether the submitting thread wants to be notified directly
> 	  (sigevent->sigev_notify_thread_id is 0) or wants the signal to be sent
> 	  to another thread.
> 	  In the latter case a check is made to assert that the target thread
> 	  is in the same thread group

Is this really how it's implemented?  This is not how it should be.
Either a signal is sent to a specific thread in the same process (this
is what SIGEV_THREAD_ID is for) or the signal is sent to a calling
process.  Sending a signal to the process means that from the kernel's
POV any thread which doesn't have the signal blocked can receive it.
The final decision is made by the kernel.  There is no mechanism to send
the signal to another process.

So, for the purpose of the POSIX AIO code the ki_pid value is only
needed when the SIGEV_THREAD_ID bit is set.

It could be an extension and I don't mind it being introduced.  But
again, it's not necessary and if it adds costs then it could be left
out.  It is something which could easily be introduced later if the need
arises.

> 			    listio support
> 

I really don't understand the kernel interface for this feature.

> Details:
> -------
> 
>   An IOCB_CMD_GROUP is added to the IOCB_CMD enum in include/linux/aio_abi.h
> 
>   A struct lio_event is added in include/linux/aio.h
> 
>   A struct lio_event *ki_lio is added to struct iocb in include/linux/aio.h

So you have a pointer in the structure for the individual requests.  I
assume you use the atomic counter to trigger the final delivery.  I
further assume that if lio_wait is set the calling thread is suspended
until all requests are handled and that the final notification in this
case means that thread gets woken.

This is all fine.

But how do you pass the requests to the kernel?  If you have a new
lio_listio-like syscall it'll be easy.  But I haven't seen anything like
this mentioned.

The alternative is to pass the requests one-by-one in which case I don't
see how you create the reference to the lio_listio control block.  This
approach seems to be slower.

If all requests are passed at once, do you have the equivalent of
LIO_NOP entries?

How can we support the extension where we wait for a number of requests
which need not be all of them.  I.e., I submit N requests and want to be
notified when at least M (M <= N) notified.  I am not yet clear about
the actual semantics we should implement (e.g., do we send another
notification after the first one?) but it's something which IMO should
be taken into account in the design.

Finally, and this is very important, does you code send out the
individual requests notification and then in the end the lio_listio
completion?  I think Suparna wrote this is the case but I want to make sure.

Overall, this looks much better than the old code.  If the answers to my
questions show that the behavior is compatible with the POSIX AIO code
I'm certainly very much in favor of adding the kernel code.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile)
  2006-08-11 19:45                                     ` Ulrich Drepper
@ 2006-08-12 18:29                                       ` Suparna Bhattacharya
  2006-08-12 19:10                                         ` Ulrich Drepper
  2006-09-04 14:28                                         ` Sébastien Dugué
  0 siblings, 2 replies; 73+ messages in thread
From: Suparna Bhattacharya @ 2006-08-12 18:29 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: =?iso-8859-1?Q?S=E9bastien_Dugu=E9_=3Csebastien=2Edugue=40bull=2Enet?=.=?iso-8859-1?Q?=3E?=,
	Badari Pulavarty, Zach Brown, Christoph Hellwig, Evgeniy Polyakov,
	lkml, David Miller, netdev, linux-aio

BTW, if anyone would like to be dropped off this growing cc list, please
let us know.

On Fri, Aug 11, 2006 at 12:45:55PM -0700, Ulrich Drepper wrote:
> Sébastien Dugué wrote:
> > 		     aio completion notification
> 
> I looked over this now but I don't think I understand everything.  Or I
> don't see how it all is integrated.  And no, I'm not looking at the
> proposed glibc code since would mean being tainted.

Oh, I didn't realise that. 
I'll make an attempt to clarify parts that I understand based on what I
have gleaned from my reading of the code and intent, but hopefully Sebastien,
Ben, Zach et al will be able to pitch in for a more accurate and complete
picture.

> 
> 
> > Details:
> > -------
> > 
> >   A struct sigevent *aio_sigeventp is added to struct iocb in
> > include/linux/aio_abi.h
> > 
> >   An enum {IO_NOTIFY_SIGNAL = 0, IO_NOTIFY_THREAD_ID = 1} is added in
> > include/linux/aio.h:
> > 
> > 	- IO_NOTIFY_SIGNAL means that the signal is to be sent to the
> > 	  requesting thread 
> > 
> > 	- IO_NOTIFY_THREAD_ID means that the signal is to be sent to a
> > 	  specifi thread.
> 
> This has been proved to be sufficient in the timer code which basically
> has the same problem.  But why do you need separate constants?  We have
> the various SIGEV_* constants, among them SIGEV_THREAD_ID.  Just use
> these constants for the values of ki_notify.
> 

I am wondering about that too. IIRC, the IO_NOTIFY_* constants are not
part of the ABI, but only internal to the kernel implementation. I think
Zach had suggested inferring THREAD_ID notification if the pid specified
is not zero. But, I don't see why ->sigev_notify couldn't used directly
(just like the POSIX timers code does) thus doing away with the 
new constants altogether. Sebestian/Laurent, do you recall?

> 
> >   The following fields are added to struct kiocb in include/linux/aio.h:
> > 
> > 	- pid_t ki_pid: target of the signal
> > 
> > 	- __u16 ki_signo: signal number
> > 
> > 	- __u16 ki_notify: kind of notification, IO_NOTIFY_SIGNAL or
> > 			   IO_NOTIFY_THREAD_ID
> > 
> > 	- uid_t ki_uid, ki_euid: filled with the submitter credentials
> 
> These two fields aren't needed for the POSIX interfaces.  Where does the
> requirement come from?  I don't say they should be removed, they might
> be useful, but if the costs are non-negligible then they could go away.

I'm guessing they are being used for validation of permissions at the time
of sending the signal, but maybe saving the task pointer in the iocb instead
of the pid would suffice ?

> 
> 
> > 	- check whether the submitting thread wants to be notified directly
> > 	  (sigevent->sigev_notify_thread_id is 0) or wants the signal to be sent
> > 	  to another thread.
> > 	  In the latter case a check is made to assert that the target thread
> > 	  is in the same thread group
> 
> Is this really how it's implemented?  This is not how it should be.
> Either a signal is sent to a specific thread in the same process (this
> is what SIGEV_THREAD_ID is for) or the signal is sent to a calling
> process.  Sending a signal to the process means that from the kernel's
> POV any thread which doesn't have the signal blocked can receive it.
> The final decision is made by the kernel.  There is no mechanism to send
> the signal to another process.

The code seems to be set up to call specific_send_sig_info() in the case
of *_THREAD_ID , and __group_send_sig_info() otherwise. So I think the
intended behaviour is as you describe it should be (__group_send_sig_info
does the equivalent of sending a signal to the process and so any thread
which doesn't have signals blocked can receive it, while specific_send_sig_info
sends it to a particular thread). 

But, I should really leave it to Sebestian to confirm that.

> 
> So, for the purpose of the POSIX AIO code the ki_pid value is only
> needed when the SIGEV_THREAD_ID bit is set.
> 
> It could be an extension and I don't mind it being introduced.  But
> again, it's not necessary and if it adds costs then it could be left
> out.  It is something which could easily be introduced later if the need
> arises.
> 
> 
> > 			    listio support
> > 
> 
> I really don't understand the kernel interface for this feature.

I'm sorry this is confusing. This probably means that we need to
separate the external interface description more clearly and completely
from the internals.

> 
> 
> > Details:
> > -------
> > 
> >   An IOCB_CMD_GROUP is added to the IOCB_CMD enum in include/linux/aio_abi.h
> > 
> >   A struct lio_event is added in include/linux/aio.h
> > 
> >   A struct lio_event *ki_lio is added to struct iocb in include/linux/aio.h
> 
> So you have a pointer in the structure for the individual requests.  I
> assume you use the atomic counter to trigger the final delivery.  I
> further assume that if lio_wait is set the calling thread is suspended
> until all requests are handled and that the final notification in this
> case means that thread gets woken.
> 
> This is all fine.
> 
> But how do you pass the requests to the kernel?  If you have a new
> lio_listio-like syscall it'll be easy.  But I haven't seen anything like
> this mentioned.
> 
> The alternative is to pass the requests one-by-one in which case I don't
> see how you create the reference to the lio_listio control block.  This
> approach seems to be slower.

The way it works (and better ideas are welcome) is that, since the io_submit()
syscall already accepts an array of iocbs[], no new syscall was introduced.
To implement lio_listio, one has to set up such an array, with the first iocb
in the array having the special (new) grouping opcode of IOCB_CMD_GROUP which
specifies the sigev notification to be associated with group completion
(a NULL value of the sigev notification pointer would imply equivalent of
LIO_WAIT). The following iocbs in the array should correspond to the set of
listio aiocbs. Whenever it encounters an IOCB_CMD_GROUP iocb opcode, the
kernel would interpret all subsequent iocbs[] submitted in the same
io_submit() call to be associated with the same lio control block. 

Does that clarify ?

Would an example help ?

> 
> If all requests are passed at once, do you have the equivalent of
> LIO_NOP entries?
> 

Good question - we do have an IOCB_CMD_NOOP defined, and I seem to even
recall a patch that implemented it, but am wondering if it ever got merged.
Ben/Zach ?

> 
> How can we support the extension where we wait for a number of requests
> which need not be all of them.  I.e., I submit N requests and want to be
> notified when at least M (M <= N) notified.  I am not yet clear about
> the actual semantics we should implement (e.g., do we send another
> notification after the first one?) but it's something which IMO should
> be taken into account in the design.
> 

My thought here was that it should be possible to include M as a parameter
to the IOCB_CMD_GROUP opcode iocb, and thus incorporated in the lio control
block ... then whatever semantics are agreed upon can be implemented.

> 
> Finally, and this is very important, does you code send out the
> individual requests notification and then in the end the lio_listio
> completion?  I think Suparna wrote this is the case but I want to make sure.

Sebestian, could you confirm ?

> 
> 
> Overall, this looks much better than the old code.  If the answers to my
> questions show that the behavior is compatible with the POSIX AIO code
> I'm certainly very much in favor of adding the kernel code.

Thanks a lot for looking through this !
Let us know what you think about the listio interface ... hopefully the
other issues are mostly simple to resolve.

Regards
Suparna

> 
> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
> 

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile)
  2006-08-12 18:29                                       ` Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile) Suparna Bhattacharya
@ 2006-08-12 19:10                                         ` Ulrich Drepper
  2006-08-12 19:28                                           ` Jakub Jelinek
                                                             ` (2 more replies)
  2006-09-04 14:28                                         ` Sébastien Dugué
  1 sibling, 3 replies; 73+ messages in thread
From: Ulrich Drepper @ 2006-08-12 19:10 UTC (permalink / raw)
  To: suparna
  Cc: sebastien.dugue, Badari Pulavarty, Zach Brown, Christoph Hellwig,
	Evgeniy Polyakov, lkml, David Miller, netdev, linux-aio

[-- Attachment #1: Type: text/plain, Size: 2751 bytes --]

Suparna Bhattacharya wrote:
> I am wondering about that too. IIRC, the IO_NOTIFY_* constants are not
> part of the ABI, but only internal to the kernel implementation. I think
> Zach had suggested inferring THREAD_ID notification if the pid specified
> is not zero. But, I don't see why ->sigev_notify couldn't used directly
> (just like the POSIX timers code does) thus doing away with the 
> new constants altogether. Sebestian/Laurent, do you recall?

I suggest to model the implementation after the timer code which does
exactly what we need.


> I'm guessing they are being used for validation of permissions at the time
> of sending the signal, but maybe saving the task pointer in the iocb instead
> of the pid would suffice ?

Why should any verification be necessary?  The requests are generated in
the same process which will receive the notification.  Even if the POSIX
process (aka, kernel process group) changes the IDs the notifications
should be set.  The key is that notifications cannot be sent to another
POSIX process.

Adding this as a feature just makes things so much more complicated.


> So I think the
> intended behaviour is as you describe it should be

Then the documentation needs to be adjusted.


> The way it works (and better ideas are welcome) is that, since the io_submit()
> syscall already accepts an array of iocbs[], no new syscall was introduced.
> To implement lio_listio, one has to set up such an array, with the first iocb
> in the array having the special (new) grouping opcode of IOCB_CMD_GROUP which
> specifies the sigev notification to be associated with group completion
> (a NULL value of the sigev notification pointer would imply equivalent of
> LIO_WAIT).

OK, this seems OK.  We have to construct the iocb arrays dynamically anyway.


> My thought here was that it should be possible to include M as a parameter
> to the IOCB_CMD_GROUP opcode iocb, and thus incorporated in the lio control
> block ... then whatever semantics are agreed upon can be implemented.

If you have room for the parameter this is fine.  For the beginning we
can enforce the number to be the same as the total number of requests.


> Let us know what you think about the listio interface ... hopefully the
> other issues are mostly simple to resolve.

It should be fine and I would support adding all this assuming the
normal file support (as opposed to direct I/O only) is added, too.


But I have one last question: sockets, pipes and the like are already
supported, right?  If this is not the case we have a problem with the
currently proposed  lio_listio interface.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile)
  2006-08-12 19:10                                         ` Ulrich Drepper
@ 2006-08-12 19:28                                           ` Jakub Jelinek
  2006-09-04 14:37                                             ` Sébastien Dugué
  2006-08-14  7:02                                           ` Suparna Bhattacharya
  2006-09-04 14:36                                           ` Sébastien Dugué
  2 siblings, 1 reply; 73+ messages in thread
From: Jakub Jelinek @ 2006-08-12 19:28 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: suparna, sebastien.dugue, Badari Pulavarty, Zach Brown,
	Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev,
	linux-aio

On Sat, Aug 12, 2006 at 12:10:35PM -0700, Ulrich Drepper wrote:
> > I am wondering about that too. IIRC, the IO_NOTIFY_* constants are not
> > part of the ABI, but only internal to the kernel implementation. I think
> > Zach had suggested inferring THREAD_ID notification if the pid specified
> > is not zero. But, I don't see why ->sigev_notify couldn't used directly
> > (just like the POSIX timers code does) thus doing away with the 
> > new constants altogether. Sebestian/Laurent, do you recall?
> 
> I suggest to model the implementation after the timer code which does
> exactly what we need.

Yeah, and if at all possible we want to use just one helper thread for
SIGEV_THREAD notification of timers/aio/etc., so it really should behave the
same as timer thread notification.

	Jakub

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile)
  2006-08-12 19:28                                           ` Jakub Jelinek
@ 2006-09-04 14:37                                             ` Sébastien Dugué
  0 siblings, 0 replies; 73+ messages in thread
From: Sébastien Dugué @ 2006-09-04 14:37 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Ulrich Drepper, suparna, Badari Pulavarty, Zach Brown,
	Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev,
	linux-aio

On Sat, 2006-08-12 at 15:28 -0400, Jakub Jelinek wrote:
> On Sat, Aug 12, 2006 at 12:10:35PM -0700, Ulrich Drepper wrote:
> > > I am wondering about that too. IIRC, the IO_NOTIFY_* constants are not
> > > part of the ABI, but only internal to the kernel implementation. I think
> > > Zach had suggested inferring THREAD_ID notification if the pid specified
> > > is not zero. But, I don't see why ->sigev_notify couldn't used directly
> > > (just like the POSIX timers code does) thus doing away with the 
> > > new constants altogether. Sebestian/Laurent, do you recall?
> > 
> > I suggest to model the implementation after the timer code which does
> > exactly what we need.
> 
> Yeah, and if at all possible we want to use just one helper thread for
> SIGEV_THREAD notification of timers/aio/etc., so it really should behave the
> same as timer thread notification.
> 

  That's exactly what is done in libposix-aio.

  Sébastien.

-- 
-----------------------------------------------------

  Sébastien Dugué                BULL/FREC:B1-247
  phone: (+33) 476 29 77 70      Bullcom: 229-7770

  mailto:sebastien.dugue@bull.net

  Linux POSIX AIO: http://www.bullopensource.org/posix
                   http://sourceforge.net/projects/paiol

-----------------------------------------------------


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile)
  2006-08-12 19:10                                         ` Ulrich Drepper
  2006-08-12 19:28                                           ` Jakub Jelinek
@ 2006-08-14  7:02                                           ` Suparna Bhattacharya
  2006-08-14 16:38                                             ` Ulrich Drepper
  2006-09-04 14:36                                           ` Sébastien Dugué
  2 siblings, 1 reply; 73+ messages in thread
From: Suparna Bhattacharya @ 2006-08-14  7:02 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: sebastien.dugue, Badari Pulavarty, Zach Brown, Christoph Hellwig,
	Evgeniy Polyakov, lkml, David Miller, netdev, linux-aio, mingo

On Sat, Aug 12, 2006 at 12:10:35PM -0700, Ulrich Drepper wrote:
> Suparna Bhattacharya wrote:
> > I am wondering about that too. IIRC, the IO_NOTIFY_* constants are not
> > part of the ABI, but only internal to the kernel implementation. I think
> > Zach had suggested inferring THREAD_ID notification if the pid specified
> > is not zero. But, I don't see why ->sigev_notify couldn't used directly
> > (just like the POSIX timers code does) thus doing away with the 
> > new constants altogether. Sebestian/Laurent, do you recall?
> 
> I suggest to model the implementation after the timer code which does
> exactly what we need.

Agreed.

> 
> 
> > I'm guessing they are being used for validation of permissions at the time
> > of sending the signal, but maybe saving the task pointer in the iocb instead
> > of the pid would suffice ?
> 
> Why should any verification be necessary?  The requests are generated in
> the same process which will receive the notification.  Even if the POSIX
> process (aka, kernel process group) changes the IDs the notifications
> should be set.  The key is that notifications cannot be sent to another
> POSIX process.

Is there a (remote) possibility that the thread could have died and its
pid got reused by a new thread in another process ? Or is there a mechanism
that prevents such a possibility from arising (not just in NPTL library,
but at the kernel level) ?

I think the timer code saves a reference to the task pointer instead of
the pid, which is what I was suggesting above (instead of the euid checks),
as way to avoid the above situation.

> 
> Adding this as a feature just makes things so much more complicated.
> 
> 
> > So I think the
> > intended behaviour is as you describe it should be
> 
> Then the documentation needs to be adjusted.

*Nod*

> 
> 
> > The way it works (and better ideas are welcome) is that, since the io_submit()
> > syscall already accepts an array of iocbs[], no new syscall was introduced.
> > To implement lio_listio, one has to set up such an array, with the first iocb
> > in the array having the special (new) grouping opcode of IOCB_CMD_GROUP which
> > specifies the sigev notification to be associated with group completion
> > (a NULL value of the sigev notification pointer would imply equivalent of
> > LIO_WAIT).
> 
> OK, this seems OK.  We have to construct the iocb arrays dynamically anyway.
> 
> 
> > My thought here was that it should be possible to include M as a parameter
> > to the IOCB_CMD_GROUP opcode iocb, and thus incorporated in the lio control
> > block ... then whatever semantics are agreed upon can be implemented.
> 
> If you have room for the parameter this is fine.  For the beginning we
> can enforce the number to be the same as the total number of requests.
> 

Sounds good.

> 
> > Let us know what you think about the listio interface ... hopefully the
> > other issues are mostly simple to resolve.
> 
> It should be fine and I would support adding all this assuming the
> normal file support (as opposed to direct I/O only) is added, too.

OK. I updated my patchset against 2618-rc3 just after OLS.

> 
> 
> But I have one last question: sockets, pipes and the like are already
> supported, right?  If this is not the case we have a problem with the
> currently proposed  lio_listio interface.

AIO for pipes should not be a problem - Chris Mason had a patch, so we can
just bring it up to the current levels, possibly with some additional
improvements.

I'm not sure what would be the right thing to do for the sockets case. While
we could put together a patch for basic aio_read/write (based on the same
model used for files), given the whole ongoing kevent effort, its not yet
clear to me what would make the most sense ...  

Ben had a patch to do a fallback to kernel threads for AIO operations that
are not yet supported natively. I had some concerns about the approach, but
I guess he had intended it as an interim path for cases like this.

Suggestions would be much appreciated ?  DaveM, Ingo, Andrew ?

Regards
Suparna

> 
> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
> 



-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile)
  2006-08-14  7:02                                           ` Suparna Bhattacharya
@ 2006-08-14 16:38                                             ` Ulrich Drepper
  2006-08-15  2:06                                               ` Nicholas Miell
  0 siblings, 1 reply; 73+ messages in thread
From: Ulrich Drepper @ 2006-08-14 16:38 UTC (permalink / raw)
  To: suparna
  Cc: sebastien.dugue, Badari Pulavarty, Zach Brown, Christoph Hellwig,
	Evgeniy Polyakov, lkml, David Miller, netdev, linux-aio, mingo

[-- Attachment #1: Type: text/plain, Size: 2036 bytes --]

Suparna Bhattacharya wrote:
> Is there a (remote) possibility that the thread could have died and its
> pid got reused by a new thread in another process ? Or is there a mechanism
> that prevents such a possibility from arising (not just in NPTL library,
> but at the kernel level) ?

The UID/GID won't help you with dying processes.  What if the same user
creates a process with the same PID?  That process will not expect the
notification and mustn't receive it.  If you cannot detect whether the
issuing process died you have problems which cannot be solved with a
uid/gid pair.

> AIO for pipes should not be a problem - Chris Mason had a patch, so we can
> just bring it up to the current levels, possibly with some additional
> improvements.

Good.

> I'm not sure what would be the right thing to do for the sockets case. While
> we could put together a patch for basic aio_read/write (based on the same
> model used for files), given the whole ongoing kevent effort, its not yet
> clear to me what would make the most sense ...  
> 
> Ben had a patch to do a fallback to kernel threads for AIO operations that
> are not yet supported natively. I had some concerns about the approach, but
> I guess he had intended it as an interim path for cases like this.

A fallback solution would be sufficient.  Nobody _should_ use POSIX AIO
for networking but people do and just giving them something that works
is good enough.  It cannot really be worse than the userlevel emulation
we have know.

The alternative, separately and sequentially handling network sockets at
userlevel is horrible.  We'd have to go over every file descriptor and
check whether it's a socket and then take if out of the request list for
the kernel.  Then they need to be handled separately before or after the
kernel AIO code.  This would punish unduly all the 99.9% of the programs
which don't use POSIX  AIO for network I/O.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile)
  2006-08-14 16:38                                             ` Ulrich Drepper
@ 2006-08-15  2:06                                               ` Nicholas Miell
  0 siblings, 0 replies; 73+ messages in thread
From: Nicholas Miell @ 2006-08-15  2:06 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: suparna, sebastien.dugue, Badari Pulavarty, Zach Brown,
	Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev,
	linux-aio, mingo

On Mon, 2006-08-14 at 09:38 -0700, Ulrich Drepper wrote:
> Suparna Bhattacharya wrote:
> > Is there a (remote) possibility that the thread could have died and its
> > pid got reused by a new thread in another process ? Or is there a mechanism
> > that prevents such a possibility from arising (not just in NPTL library,
> > but at the kernel level) ?
> 
> The UID/GID won't help you with dying processes.  What if the same user
> creates a process with the same PID?  That process will not expect the
> notification and mustn't receive it.  If you cannot detect whether the
> issuing process died you have problems which cannot be solved with a
> uid/gid pair.
> 
> 

Eric W. Biederman sent a series of patches that introduced a struct
task_ref specifically to solve this sort of problem on January 28 of
this year, but I don't think it went anywhere.


-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile)
  2006-08-12 19:10                                         ` Ulrich Drepper
  2006-08-12 19:28                                           ` Jakub Jelinek
  2006-08-14  7:02                                           ` Suparna Bhattacharya
@ 2006-09-04 14:36                                           ` Sébastien Dugué
  2 siblings, 0 replies; 73+ messages in thread
From: Sébastien Dugué @ 2006-09-04 14:36 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: suparna, Badari Pulavarty, Zach Brown, Christoph Hellwig,
	Evgeniy Polyakov, lkml, David Miller, netdev, linux-aio

On Sat, 2006-08-12 at 12:10 -0700, Ulrich Drepper wrote:
> Suparna Bhattacharya wrote:
> > I am wondering about that too. IIRC, the IO_NOTIFY_* constants are not
> > part of the ABI, but only internal to the kernel implementation. I think
> > Zach had suggested inferring THREAD_ID notification if the pid specified
> > is not zero. But, I don't see why ->sigev_notify couldn't used directly
> > (just like the POSIX timers code does) thus doing away with the 
> > new constants altogether. Sebestian/Laurent, do you recall?
> 
> I suggest to model the implementation after the timer code which does
> exactly what we need.
> 

  Will do.

> 
> > I'm guessing they are being used for validation of permissions at the time
> > of sending the signal, but maybe saving the task pointer in the iocb instead
> > of the pid would suffice ?
> 
> Why should any verification be necessary?  The requests are generated in
> the same process which will receive the notification.  Even if the POSIX
> process (aka, kernel process group) changes the IDs the notifications
> should be set.  The key is that notifications cannot be sent to another
> POSIX process.
> 
> Adding this as a feature just makes things so much more complicated.
> 

  Agreed.

  Sébastien.


-- 
-----------------------------------------------------

  Sébastien Dugué                BULL/FREC:B1-247
  phone: (+33) 476 29 77 70      Bullcom: 229-7770

  mailto:sebastien.dugue@bull.net

  Linux POSIX AIO: http://www.bullopensource.org/posix
                   http://sourceforge.net/projects/paiol

-----------------------------------------------------


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile)
  2006-08-12 18:29                                       ` Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile) Suparna Bhattacharya
  2006-08-12 19:10                                         ` Ulrich Drepper
@ 2006-09-04 14:28                                         ` Sébastien Dugué
  1 sibling, 0 replies; 73+ messages in thread
From: Sébastien Dugué @ 2006-09-04 14:28 UTC (permalink / raw)
  To: suparna
  Cc: Ulrich Drepper,
	=?iso-8859-1?Q?S=E9bastien_Dugu=E9_=3Csebastien=2Edugue=40bull=2Enet?=.=?iso-8859-1?Q?=3E?=,
	Badari Pulavarty, Zach Brown, Christoph Hellwig, Evgeniy Polyakov,
	lkml, David Miller, netdev, linux-aio, Benjamin LaHaise

  Hi,

  just came back from vacation, sorry for the delay.

On Sat, 2006-08-12 at 23:59 +0530, Suparna Bhattacharya wrote:
> BTW, if anyone would like to be dropped off this growing cc list, please
> let us know.
> 
> On Fri, Aug 11, 2006 at 12:45:55PM -0700, Ulrich Drepper wrote:
> > Sébastien Dugué wrote:
> > > 		     aio completion notification
> > 
> > I looked over this now but I don't think I understand everything.  Or I
> > don't see how it all is integrated.  And no, I'm not looking at the
> > proposed glibc code since would mean being tainted.
> 
> Oh, I didn't realise that. 
> I'll make an attempt to clarify parts that I understand based on what I
> have gleaned from my reading of the code and intent, but hopefully Sebastien,
> Ben, Zach et al will be able to pitch in for a more accurate and complete
> picture.
> 
> > 
> > 
> > > Details:
> > > -------
> > > 
> > >   A struct sigevent *aio_sigeventp is added to struct iocb in
> > > include/linux/aio_abi.h
> > > 
> > >   An enum {IO_NOTIFY_SIGNAL = 0, IO_NOTIFY_THREAD_ID = 1} is added in
> > > include/linux/aio.h:
> > > 
> > > 	- IO_NOTIFY_SIGNAL means that the signal is to be sent to the
> > > 	  requesting thread 
> > > 
> > > 	- IO_NOTIFY_THREAD_ID means that the signal is to be sent to a
> > > 	  specifi thread.
> > 
> > This has been proved to be sufficient in the timer code which basically
> > has the same problem.  But why do you need separate constants?  We have
> > the various SIGEV_* constants, among them SIGEV_THREAD_ID.  Just use
> > these constants for the values of ki_notify.
> > 
> 
> I am wondering about that too. IIRC, the IO_NOTIFY_* constants are not
> part of the ABI, but only internal to the kernel implementation. I think
> Zach had suggested inferring THREAD_ID notification if the pid specified
> is not zero. But, I don't see why ->sigev_notify couldn't used directly
> (just like the POSIX timers code does) thus doing away with the 
> new constants altogether. Sebestian/Laurent, do you recall?

  As I see it, those IO_NOTIFY_* constants are uneeded and we could use
->sigev_notify directly. I will change this so that we use the same
mechanism as the POSIX timers code.

> 
> > 
> > >   The following fields are added to struct kiocb in include/linux/aio.h:
> > > 
> > > 	- pid_t ki_pid: target of the signal
> > > 
> > > 	- __u16 ki_signo: signal number
> > > 
> > > 	- __u16 ki_notify: kind of notification, IO_NOTIFY_SIGNAL or
> > > 			   IO_NOTIFY_THREAD_ID
> > > 
> > > 	- uid_t ki_uid, ki_euid: filled with the submitter credentials
> > 
> > These two fields aren't needed for the POSIX interfaces.  Where does the
> > requirement come from?  I don't say they should be removed, they might
> > be useful, but if the costs are non-negligible then they could go away.
> 
> I'm guessing they are being used for validation of permissions at the time
> of sending the signal, but maybe saving the task pointer in the iocb instead
> of the pid would suffice ?

  IIRC, Ben added these for that exact reason. Is this really needed?
Ben?

> 
> > 
> > 
> > > 	- check whether the submitting thread wants to be notified directly
> > > 	  (sigevent->sigev_notify_thread_id is 0) or wants the signal to be sent
> > > 	  to another thread.
> > > 	  In the latter case a check is made to assert that the target thread
> > > 	  is in the same thread group
> > 
> > Is this really how it's implemented?  This is not how it should be.
> > Either a signal is sent to a specific thread in the same process (this
> > is what SIGEV_THREAD_ID is for) or the signal is sent to a calling
> > process.  Sending a signal to the process means that from the kernel's
> > POV any thread which doesn't have the signal blocked can receive it.
> > The final decision is made by the kernel.  There is no mechanism to send
> > the signal to another process.
> 
> The code seems to be set up to call specific_send_sig_info() in the case
> of *_THREAD_ID , and __group_send_sig_info() otherwise. So I think the
> intended behaviour is as you describe it should be (__group_send_sig_info
> does the equivalent of sending a signal to the process and so any thread
> which doesn't have signals blocked can receive it, while specific_send_sig_info
> sends it to a particular thread). 
> 
> But, I should really leave it to Sebestian to confirm that.

  That's right, but I think that part needs to be reworked to follow
the same logic as the POSIX timers.


> > > 			    listio support
> > > 
> > 
> > I really don't understand the kernel interface for this feature.
> 
> I'm sorry this is confusing. This probably means that we need to
> separate the external interface description more clearly and completely
> from the internals.
> 
> > 
> > 
> > > Details:
> > > -------
> > > 
> > >   An IOCB_CMD_GROUP is added to the IOCB_CMD enum in include/linux/aio_abi.h
> > > 
> > >   A struct lio_event is added in include/linux/aio.h
> > > 
> > >   A struct lio_event *ki_lio is added to struct iocb in include/linux/aio.h
> > 
> > So you have a pointer in the structure for the individual requests.  I
> > assume you use the atomic counter to trigger the final delivery.  I
> > further assume that if lio_wait is set the calling thread is suspended
> > until all requests are handled and that the final notification in this
> > case means that thread gets woken.
> > 
> > This is all fine.
> > 
> > But how do you pass the requests to the kernel?  If you have a new
> > lio_listio-like syscall it'll be easy.  But I haven't seen anything like
> > this mentioned.
> > 
> > The alternative is to pass the requests one-by-one in which case I don't
> > see how you create the reference to the lio_listio control block.  This
> > approach seems to be slower.
> 
> The way it works (and better ideas are welcome) is that, since the io_submit()
> syscall already accepts an array of iocbs[], no new syscall was introduced.
> To implement lio_listio, one has to set up such an array, with the first iocb
> in the array having the special (new) grouping opcode of IOCB_CMD_GROUP which
> specifies the sigev notification to be associated with group completion
> (a NULL value of the sigev notification pointer would imply equivalent of
> LIO_WAIT). The following iocbs in the array should correspond to the set of
> listio aiocbs. Whenever it encounters an IOCB_CMD_GROUP iocb opcode, the
> kernel would interpret all subsequent iocbs[] submitted in the same
> io_submit() call to be associated with the same lio control block. 
> 
> Does that clarify ?
> 
> Would an example help ?
> 
> > 
> > If all requests are passed at once, do you have the equivalent of
> > LIO_NOP entries?

  So far, LIO_NOP entries are pruned by the support library 
(libposix-aio) and never sent to the kernel.
> > 
> 
> Good question - we do have an IOCB_CMD_NOOP defined, and I seem to even
> recall a patch that implemented it, but am wondering if it ever got merged.
> Ben/Zach ?
> 
> > 
> > How can we support the extension where we wait for a number of requests
> > which need not be all of them.  I.e., I submit N requests and want to be
> > notified when at least M (M <= N) notified.  I am not yet clear about
> > the actual semantics we should implement (e.g., do we send another
> > notification after the first one?) but it's something which IMO should
> > be taken into account in the design.
> > 
> 
> My thought here was that it should be possible to include M as a parameter
> to the IOCB_CMD_GROUP opcode iocb, and thus incorporated in the lio control
> block ... then whatever semantics are agreed upon can be implemented.
> 
> > 
> > Finally, and this is very important, does you code send out the
> > individual requests notification and then in the end the lio_listio
> > completion?  I think Suparna wrote this is the case but I want to make sure.
> 
> Sebestian, could you confirm ?

  If (and only if) the user did setup a sigevent for one or more
individual requests then those requests completion will trigger a
notification and in the end the list completion notification is sent. 
Otherwise, only the list completion notification is sent.


-- 
-----------------------------------------------------

  Sébastien Dugué                BULL/FREC:B1-247
  phone: (+33) 476 29 77 70      Bullcom: 229-7770

  mailto:sebastien.dugue@bull.net

  Linux POSIX AIO: http://www.bullopensource.org/posix
                   http://sourceforge.net/projects/paiol

-----------------------------------------------------


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-27 18:44                               ` Ulrich Drepper
  2006-07-27 21:02                                 ` Badari Pulavarty
@ 2006-07-28  7:29                                 ` Sébastien Dugué
  2006-07-31 10:11                                 ` Suparna Bhattacharya
  2 siblings, 0 replies; 73+ messages in thread
From: Sébastien Dugué @ 2006-07-28  7:29 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Badari Pulavarty, Zach Brown, Christoph Hellwig, Evgeniy Polyakov,
	lkml, David Miller, netdev, Suparna Bhattacharya

On Thu, 2006-07-27 at 11:44 -0700, Ulrich Drepper wrote:
> Badari Pulavarty wrote:
> > Before we spend too much time cleaning up and merging into mainline -
> > I would like an agreement that what we add is good enough for glibc
> > POSIX AIO.
> 
> I haven't seen a description of the interface so far.  Would be good if
> it existed.  But I briefly mentioned one quirk in the interface about
> which Suparna wasn't sure whether it's implemented/implementable in the
> current interface.
> 
> If a lio_listio call is made the individual requests are handle just as
> if they'd be issue separately.  I.e., the notification specified in the
> individual aiocb is performed when the specific request is done.  Then,
> once all requests are done, another notification is made, this time
> controlled by the sigevent parameter if lio_listio.
> 
> 
> Another feature which I always wanted: the current lio_listio call
> returns in blocking mode only if all requests are done.  In non-blocking
> mode it returns immediately and the program needs to poll the aiocbs.
> What is needed is something in the middle.  For instance, if multiple
> read requests are issued the program might be able to start working as
> soon as one request is satisfied.  I.e., a call similar to lio_listio
> would be nice which also takes another parameter specifying how many of
> the NENT aiocbs have to finish before the call returns.

  You're right here, that definitely would be a plus.


-- 
-----------------------------------------------------

  Sébastien Dugué                BULL/FREC:B1-247
  phone: (+33) 476 29 77 70      Bullcom: 229-7770

  mailto:sebastien.dugue@bull.net

  Linux POSIX AIO: http://www.bullopensource.org/posix
                   http://sourceforge.net/projects/paiol

-----------------------------------------------------


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-27 18:44                               ` Ulrich Drepper
  2006-07-27 21:02                                 ` Badari Pulavarty
  2006-07-28  7:29                                 ` [3/4] kevent: AIO, aio_sendfile() implementation Sébastien Dugué
@ 2006-07-31 10:11                                 ` Suparna Bhattacharya
  2 siblings, 0 replies; 73+ messages in thread
From: Suparna Bhattacharya @ 2006-07-31 10:11 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Badari Pulavarty, Zach Brown,
	=?iso-8859-1?Q?S=E9bastien_Dugu=E9_=3Csebastien=2Edugue=40bull=2Enet?=.=?iso-8859-1?Q?=3E?=,
	Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev

On Thu, Jul 27, 2006 at 11:44:23AM -0700, Ulrich Drepper wrote:
> Badari Pulavarty wrote:
> > Before we spend too much time cleaning up and merging into mainline -
> > I would like an agreement that what we add is good enough for glibc
> > POSIX AIO.
> 
> I haven't seen a description of the interface so far.  Would be good if

Did Sébastien's mail with the description help ? 

> it existed.  But I briefly mentioned one quirk in the interface about
> which Suparna wasn't sure whether it's implemented/implementable in the
> current interface.
> 
> If a lio_listio call is made the individual requests are handle just as
> if they'd be issue separately.  I.e., the notification specified in the
> individual aiocb is performed when the specific request is done.  Then,
> once all requests are done, another notification is made, this time
> controlled by the sigevent parameter if lio_listio.

Looking at the code in lio kernel patch, this should be already covered:

        if (iocb->ki_signo)
                __aio_send_signal(iocb);

+       if (iocb->ki_lio)
+               lio_check(iocb->ki_lio);

That is, it first checks the notification in the individual iocb, and then
the one for the LIO.

> 
> 
> Another feature which I always wanted: the current lio_listio call
> returns in blocking mode only if all requests are done.  In non-blocking
> mode it returns immediately and the program needs to poll the aiocbs.
> What is needed is something in the middle.  For instance, if multiple
> read requests are issued the program might be able to start working as
> soon as one request is satisfied.  I.e., a call similar to lio_listio
> would be nice which also takes another parameter specifying how many of
> the NENT aiocbs have to finish before the call returns.

I imagine the kernel could enable this by incorporating this additional
parameter for IOCB_CMD_GROUP in the ABI (in the default case this should be the
same as the total number of iocbs submitted to lio_listio). Now should the
at least NENT check apply only to LIO_WAIT or also to the LIO_NOWAIT
notification case ? 

BTW, the native io_getevents does support a min_nr wakeup already, except that
it applies to any iocb on the io_context, and not just a given lio_listio call.

Regards
Suparna


-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [3/4] kevent: AIO, aio_sendfile() implementation.
  2006-07-27 15:28                         ` Badari Pulavarty
  2006-07-27 18:14                           ` Zach Brown
@ 2006-07-28  7:26                           ` Sébastien Dugué
  1 sibling, 0 replies; 73+ messages in thread
From: Sébastien Dugué @ 2006-07-28  7:26 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Ulrich Drepper, Christoph Hellwig, Evgeniy Polyakov, lkml,
	David Miller, netdev, Suparna Bhattacharya

On Thu, 2006-07-27 at 08:28 -0700, Badari Pulavarty wrote:
> Sébastien Dugué wrote:
> > On Wed, 2006-07-26 at 09:22 -0700, Badari Pulavarty wrote:
> >   
> >> Ulrich Drepper wrote:
> >>     
> >>> Christoph Hellwig wrote:
> >>>   
> >>>       
> >>>>> My personal opinion on existing AIO is that it is not the right design.
> >>>>> Benjamin LaHaise agree with me (if I understood him right),
> >>>>>       
> >>>>>           
> >>>> I completely agree with that aswell.
> >>>>     
> >>>>         
> >>> I agree, too, but the current code is not the last of the line.  Suparna
> >>> has a st of patches which make the current kernel aio code work much
> >>> better and especially make it really usable to implement POSIX AIO.
> >>>
> >>> In Ottawa we were talking about submitting it and Suparna will.  We just
> >>> thought about a little longer timeframe.  I guess it could be
> >>> accelerated since he mostly has the patch done.  But I don't know her
> >>> schedule.
> >>>
> >>> Important here is, don't base any decision on the current aio
> >>> implementation.
> >>>   
> >>>       
> >> Ulrich,
> >>
> >> Suparna mentioned your interest in making POSIX glibc aio work with 
> >> kernel-aio at OLS.
> >> We thought taking a re-look at the (kernel side) work BULL did, would be 
> >> a nice starting
> >> point. I re-based those patches to 2.6.18-rc2 and sent it to Zach Brown 
> >> for review before
> >> sending them out to list.
> >>
> >> These patches does NOT make AIO any cleaner. All they do is add 
> >> functionality to support
> >> POSIX AIO easier. These are
> >>
> >> [ PATCH 1/3 ]  Adding signal notification for event completion
> >>
> >> [ PATCH 2/3 ]  lio (listio) completion semantics
> >>
> >> [ PATCH 3/3 ] cancel_fd support
> >>     
> >
> >   Badari,
> >
> >   Thanks for refreshing those patches, they have been sitting here
> > for quite some time now and collected dust.
> >
> >   I also think Suparna's patchset for doing buffered AIO would be
> > a real plus here.
> >
> >   
> >> Suparna explained these in the following article:
> >>
> >> http://lwn.net/Articles/148755/
> >>
> >> If you think, this is a reasonable direction/approach for the kernel and 
> >> you would take care
> >> of glibc side of things - I can spend time on these patches, getting 
> >> them to reasonable shape
> >> and push for inclusion.
> >>     
> >
> >   Ulrich, I you want to have a look at how those patches are put to
> > use in libposix-aio, have a look at http://sourceforge.net/projects/paiol.
> >
> >   It could be a starting point for glibc.
> >
> >   Thanks,
> >
> >   Sébastien.
> >
> >   
> Sebastien,
> 
> Suparna mentioned at Ulrich wants us to concentrate on kernel-side 
> support, so that he
> can look at glibc side of things (along with other work he is already 
> doing). So, if we
> can get an agreement on what kind of kernel support is needed - we can 
> focus our
> efforts on kernel side first and leave glibc enablement to capable hands 
> of Uli :)
> 

  That's fine with me. 

  Sébastien.

-- 
-----------------------------------------------------

  Sébastien Dugué                BULL/FREC:B1-247
  phone: (+33) 476 29 77 70      Bullcom: 229-7770

  mailto:sebastien.dugue@bull.net

  Linux POSIX AIO: http://www.bullopensource.org/posix
                   http://sourceforge.net/projects/paiol

-----------------------------------------------------


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [1/4] kevent: core files.
  2006-07-26  9:18       ` [1/4] kevent: core files Evgeniy Polyakov
  2006-07-26  9:18         ` [2/4] kevent: network AIO, socket notifications Evgeniy Polyakov
@ 2006-07-26 10:31         ` Andrew Morton
  2006-07-26 10:37           ` Evgeniy Polyakov
  2006-07-26 10:44         ` Evgeniy Polyakov
  2 siblings, 1 reply; 73+ messages in thread
From: Andrew Morton @ 2006-07-26 10:31 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: linux-kernel, davem, drepper, netdev

On Wed, 26 Jul 2006 13:18:15 +0400
Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> +static int kevent_ctl_process(struct file *file, 
> +		struct kevent_user_control *ctl, void __user *arg)
> +{
> +	int err;
> +	struct kevent_user *u = file->private_data;
> +
> +	if (!u)
> +		return -EINVAL;
> +
> +	switch (ctl->cmd) {
> +		case KEVENT_CTL_ADD:
> +			err = kevent_user_ctl_add(u, ctl, 
> +					arg+sizeof(struct kevent_user_control));
> +			break;
> +		case KEVENT_CTL_REMOVE:
> +			err = kevent_user_ctl_remove(u, ctl, 
> +					arg+sizeof(struct kevent_user_control));
> +			break;
> +		case KEVENT_CTL_MODIFY:
> +			err = kevent_user_ctl_modify(u, ctl, 
> +					arg+sizeof(struct kevent_user_control));
> +			break;
> +		case KEVENT_CTL_WAIT:
> +			err = kevent_user_wait(file, u, ctl, arg);
> +			break;
> +		case KEVENT_CTL_INIT:
> +			err = kevent_ctl_init();
> +		default:
> +			err = -EINVAL;
> +			break;
> +	}
> +
> +	return err;
> +}

Please indent the body of the switch one tabstop to the left.

> +asmlinkage long sys_kevent_ctl(int fd, void __user *arg)
> +{
> +	int err, fput_needed;
> +	struct kevent_user_control ctl;
> +	struct file *file;
> +
> +	if (copy_from_user(&ctl, arg, sizeof(struct kevent_user_control)))
> +		return -EINVAL;
> +
> +	if (ctl.cmd == KEVENT_CTL_INIT)
> +		return kevent_ctl_init();
> +
> +	file = fget_light(fd, &fput_needed);
> +	if (!file)
> +		return -ENODEV;
> +
> +	err = kevent_ctl_process(file, &ctl, arg);
> +
> +	fput_light(file, fput_needed);
> +	return err;
> +}

If the user passes this an fd which was obtained via means other than
kevent_ctl_init(), the kernel will explode.  Do

	if (file->f_fop != &kevent_user_fops)
		return -EINVAL;


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [1/4] kevent: core files.
  2006-07-26 10:31         ` [1/4] kevent: core files Andrew Morton
@ 2006-07-26 10:37           ` Evgeniy Polyakov
  0 siblings, 0 replies; 73+ messages in thread
From: Evgeniy Polyakov @ 2006-07-26 10:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, davem, drepper, netdev

On Wed, Jul 26, 2006 at 03:31:05AM -0700, Andrew Morton (akpm@osdl.org) wrote:
> Please indent the body of the switch one tabstop to the left.
..
> If the user passes this an fd which was obtained via means other than
> kevent_ctl_init(), the kernel will explode.  Do
> 
> 	if (file->f_fop != &kevent_user_fops)
> 		return -EINVAL;

Thanks, I will implement both.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [1/4] kevent: core files.
  2006-07-26  9:18       ` [1/4] kevent: core files Evgeniy Polyakov
  2006-07-26  9:18         ` [2/4] kevent: network AIO, socket notifications Evgeniy Polyakov
  2006-07-26 10:31         ` [1/4] kevent: core files Andrew Morton
@ 2006-07-26 10:44         ` Evgeniy Polyakov
  2 siblings, 0 replies; 73+ messages in thread
From: Evgeniy Polyakov @ 2006-07-26 10:44 UTC (permalink / raw)
  To: lkml; +Cc: David Miller, Ulrich Drepper, netdev

On Wed, Jul 26, 2006 at 01:18:15PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> +struct kevent *kevent_alloc(gfp_t mask)
> +{
> +	struct kevent *k;
> +	
> +	if (kevent_cache)
> +		k = kmem_cache_alloc(kevent_cache, mask);
> +	else
> +		k = kzalloc(sizeof(struct kevent), mask);
> +
> +	return k;
> +}
> +

Sorry for that.
It is fixed already to always use cache, but I forget to commit that
change before I created pachset.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: async network I/O, event channels, etc
  2006-07-26  6:28   ` Evgeniy Polyakov
  2006-07-26  9:18     ` [0/4] kevent: generic event processing subsystem Evgeniy Polyakov
@ 2006-07-27  6:10     ` David Miller
  2006-07-27  7:49       ` Evgeniy Polyakov
  1 sibling, 1 reply; 73+ messages in thread
From: David Miller @ 2006-07-27  6:10 UTC (permalink / raw)
  To: johnpol; +Cc: drepper, linux-kernel, netdev

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Wed, 26 Jul 2006 10:28:17 +0400

> I have not created additional DMA memory allocation methods, like
> Ulrich described in his article, so I handle it inside NAIO which
> has some overhead (I posted get_user_pages() sclability graph some
> time ago).

I've been thinking about this aspect, and I think it's very
interesting.  Let's be clear what the ramifications of this
are first.

Using the terminology of Network Algorithmics, this is an
instance of Principle 2, "Shift computation in time".

Instead of using get_user_pages() at AIO setup, we instead map the
thing to userspace later when the user wants it.  Pinning pages is a
pain because both user and kernel refer to the buffer at the same
time.  We get more flexibility when the user has to map the thing
explicitly.

I want us to think about how a user might want to use this.  What
I anticipate is that users will want to organize a pool of AIO
buffers for themselves using this DMA interface.  So the events
they are truly interested in are of a finer granularity than you
might expect.  They want to know when pieces of a buffer are
available for reuse.

And here is the core dilemma.

If you make the event granularity too coarse, a larger AIO buffer
pool is necessary.  If you make the event granuliary too fine,
event processing begins to dominate, and costs too much.  This is
true even for something as light weight as kevent.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: async network I/O, event channels, etc
  2006-07-27  6:10     ` async network I/O, event channels, etc David Miller
@ 2006-07-27  7:49       ` Evgeniy Polyakov
  2006-07-27  8:02         ` David Miller
  0 siblings, 1 reply; 73+ messages in thread
From: Evgeniy Polyakov @ 2006-07-27  7:49 UTC (permalink / raw)
  To: David Miller; +Cc: drepper, linux-kernel, netdev

On Wed, Jul 26, 2006 at 11:10:55PM -0700, David Miller (davem@davemloft.net) wrote:
> From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> Date: Wed, 26 Jul 2006 10:28:17 +0400
> 
> > I have not created additional DMA memory allocation methods, like
> > Ulrich described in his article, so I handle it inside NAIO which
> > has some overhead (I posted get_user_pages() sclability graph some
> > time ago).
> 
> I've been thinking about this aspect, and I think it's very
> interesting.  Let's be clear what the ramifications of this
> are first.
> 
> Using the terminology of Network Algorithmics, this is an
> instance of Principle 2, "Shift computation in time".
> 
> Instead of using get_user_pages() at AIO setup, we instead map the
> thing to userspace later when the user wants it.  Pinning pages is a
> pain because both user and kernel refer to the buffer at the same
> time.  We get more flexibility when the user has to map the thing
> explicitly.

I.e. map skb's data to userspace? Not a good idea especially with it's
tricky lifetime and unability for userspace to inform kernel when it
finished and skb can be freed (without additional syscall).
I did it with af_tlb zero-copy sniffer (but I substitute mapped pages
with physical skb->data pages), and it was not very good.

> I want us to think about how a user might want to use this.  What
> I anticipate is that users will want to organize a pool of AIO
> buffers for themselves using this DMA interface.  So the events
> they are truly interested in are of a finer granularity than you
> might expect.  They want to know when pieces of a buffer are
> available for reuse.

Ah, I see.
Well, I think preallocate some buffers and use that in AIO setup is a
plus, since in that case user does not care about when it is possible to
reuse the same buffer - when appropriate kevent is completed, that means
that provided buffer is no longer in use by kernel, and user can reuse
it.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: async network I/O, event channels, etc
  2006-07-27  7:49       ` Evgeniy Polyakov
@ 2006-07-27  8:02         ` David Miller
  2006-07-27  8:09           ` Jens Axboe
  2006-07-27  8:58           ` Evgeniy Polyakov
  0 siblings, 2 replies; 73+ messages in thread
From: David Miller @ 2006-07-27  8:02 UTC (permalink / raw)
  To: johnpol; +Cc: drepper, linux-kernel, netdev

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Thu, 27 Jul 2006 11:49:02 +0400

> I.e. map skb's data to userspace? Not a good idea especially with it's
> tricky lifetime and unability for userspace to inform kernel when it
> finished and skb can be freed (without additional syscall).

Hmmm...

If it is paged based, I do not see the problem.  Events and calls to
AIO I/O routines make transfer of buffer ownership.  The fact that
while kernel (and thus networking stack) "owns" the buffer for an AIO
call, the user can have a valid mapping to it is a unimportant detail.

If the user will scramble a piece of data that is in flight to or from
the network card, it is his problem.

If we are using a primitive network card that does not support
scatter-gather I/O and thus not page based SKBs, we will make
copies.  But this is transparent to the user.

The idea is that DMA mappings have page granularity.

At least on transmit it should work well.  Receive side is more
difficult and initial implementation will need to copy.

> I did it with af_tlb zero-copy sniffer (but I substitute mapped pages
> with physical skb->data pages), and it was not very good.

Trying to be too clever with skb->data has always been catastrophic. :)

> Well, I think preallocate some buffers and use that in AIO setup is a
> plus, since in that case user does not care about when it is possible to
> reuse the same buffer - when appropriate kevent is completed, that means
> that provided buffer is no longer in use by kernel, and user can reuse
> it.

We now enter the most interesting topic of AIO buffer pool management
and where it belongs. :-)  We are assuming up to this point that the
user manages this stuff with explicit DMA calls for allocation, then
passes the key based references to those buffers as arguments to the
AIO I/O calls.

But I want to suggest another possibility.  What if the kernel managed
the AIO buffer pool for a task?  It could grow this dynamically based
upon need.  The only implementation road block is how large to we
allow this to grow, but I think normal VM mechanisms can take care
of it.

On transmit this is not straightforward, but for receive it has really
nice possibilities. :)

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: async network I/O, event channels, etc
  2006-07-27  8:02         ` David Miller
@ 2006-07-27  8:09           ` Jens Axboe
  2006-07-27  8:11             ` Jens Axboe
  2006-07-27  8:58           ` Evgeniy Polyakov
  1 sibling, 1 reply; 73+ messages in thread
From: Jens Axboe @ 2006-07-27  8:09 UTC (permalink / raw)
  To: David Miller; +Cc: johnpol, drepper, linux-kernel, netdev

On Thu, Jul 27 2006, David Miller wrote:
> From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> Date: Thu, 27 Jul 2006 11:49:02 +0400
> 
> > I.e. map skb's data to userspace? Not a good idea especially with it's
> > tricky lifetime and unability for userspace to inform kernel when it
> > finished and skb can be freed (without additional syscall).
> 
> Hmmm...
> 
> If it is paged based, I do not see the problem.  Events and calls to
> AIO I/O routines make transfer of buffer ownership.  The fact that
> while kernel (and thus networking stack) "owns" the buffer for an AIO
> call, the user can have a valid mapping to it is a unimportant detail.

Ownership may be clear, but "when can I reuse" is tricky. The same issue
comes up for vmsplice -> splice to socket.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: async network I/O, event channels, etc
  2006-07-27  8:09           ` Jens Axboe
@ 2006-07-27  8:11             ` Jens Axboe
  2006-07-27  8:20               ` David Miller
  0 siblings, 1 reply; 73+ messages in thread
From: Jens Axboe @ 2006-07-27  8:11 UTC (permalink / raw)
  To: David Miller; +Cc: johnpol, drepper, linux-kernel, netdev

On Thu, Jul 27 2006, Jens Axboe wrote:
> On Thu, Jul 27 2006, David Miller wrote:
> > From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> > Date: Thu, 27 Jul 2006 11:49:02 +0400
> > 
> > > I.e. map skb's data to userspace? Not a good idea especially with it's
> > > tricky lifetime and unability for userspace to inform kernel when it
> > > finished and skb can be freed (without additional syscall).
> > 
> > Hmmm...
> > 
> > If it is paged based, I do not see the problem.  Events and calls to
> > AIO I/O routines make transfer of buffer ownership.  The fact that
> > while kernel (and thus networking stack) "owns" the buffer for an AIO
> > call, the user can have a valid mapping to it is a unimportant detail.
> 
> Ownership may be clear, but "when can I reuse" is tricky. The same issue
> comes up for vmsplice -> splice to socket.

Ownership transition from user -> kernel that is, what I'm trying to say
that returning ownership to the user again is the tricky part.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: async network I/O, event channels, etc
  2006-07-27  8:11             ` Jens Axboe
@ 2006-07-27  8:20               ` David Miller
  2006-07-27  8:29                 ` Jens Axboe
  0 siblings, 1 reply; 73+ messages in thread
From: David Miller @ 2006-07-27  8:20 UTC (permalink / raw)
  To: axboe; +Cc: johnpol, drepper, linux-kernel, netdev

From: Jens Axboe <axboe@suse.de>
Date: Thu, 27 Jul 2006 10:11:15 +0200

> Ownership transition from user -> kernel that is, what I'm trying to say
> that returning ownership to the user again is the tricky part.

Yes, it is important that for TCP, for example, we don't give
the user the event until the data is acknowledged and the skb's
referencing that data are fully freed.

This is further complicated by the fact that packetization boundaries
are going to be different from AIO buffer boundaries.

I think this is what you are alluding to.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: async network I/O, event channels, etc
  2006-07-27  8:20               ` David Miller
@ 2006-07-27  8:29                 ` Jens Axboe
  2006-07-27  8:37                   ` David Miller
  0 siblings, 1 reply; 73+ messages in thread
From: Jens Axboe @ 2006-07-27  8:29 UTC (permalink / raw)
  To: David Miller; +Cc: johnpol, drepper, linux-kernel, netdev

On Thu, Jul 27 2006, David Miller wrote:
> From: Jens Axboe <axboe@suse.de>
> Date: Thu, 27 Jul 2006 10:11:15 +0200
> 
> > Ownership transition from user -> kernel that is, what I'm trying to say
> > that returning ownership to the user again is the tricky part.
> 
> Yes, it is important that for TCP, for example, we don't give
> the user the event until the data is acknowledged and the skb's
> referencing that data are fully freed.
> 
> This is further complicated by the fact that packetization boundaries
> are going to be different from AIO buffer boundaries.
> 
> I think this is what you are alluding to.

Precisely. And this is the bit that is currently still broken for
splice-to-socket, since it gives that ack right after ->sendpage() has
been called. But that's a known deficiency right now, I think Alexey is
currently looking at that (as well as receive side support).

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: async network I/O, event channels, etc
  2006-07-27  8:29                 ` Jens Axboe
@ 2006-07-27  8:37                   ` David Miller
  2006-07-27  8:39                     ` Jens Axboe
  0 siblings, 1 reply; 73+ messages in thread
From: David Miller @ 2006-07-27  8:37 UTC (permalink / raw)
  To: axboe; +Cc: johnpol, drepper, linux-kernel, netdev

From: Jens Axboe <axboe@suse.de>
Date: Thu, 27 Jul 2006 10:29:24 +0200

> Precisely. And this is the bit that is currently still broken for
> splice-to-socket, since it gives that ack right after ->sendpage() has
> been called. But that's a known deficiency right now, I think Alexey is
> currently looking at that (as well as receive side support).

That's right, I was discussing this with him just a few days ago.

It's good to hear that he's looking at those patches you were working
on several months ago.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: async network I/O, event channels, etc
  2006-07-27  8:37                   ` David Miller
@ 2006-07-27  8:39                     ` Jens Axboe
  0 siblings, 0 replies; 73+ messages in thread
From: Jens Axboe @ 2006-07-27  8:39 UTC (permalink / raw)
  To: David Miller; +Cc: johnpol, drepper, linux-kernel, netdev

On Thu, Jul 27 2006, David Miller wrote:
> From: Jens Axboe <axboe@suse.de>
> Date: Thu, 27 Jul 2006 10:29:24 +0200
> 
> > Precisely. And this is the bit that is currently still broken for
> > splice-to-socket, since it gives that ack right after ->sendpage() has
> > been called. But that's a known deficiency right now, I think Alexey is
> > currently looking at that (as well as receive side support).
> 
> That's right, I was discussing this with him just a few days ago.
> 
> It's good to hear that he's looking at those patches you were working
> on several months ago.

It is. I never ventured much into the networking part, just noted that
as a current limitation with the ->sendpage() based approach. Basically
we need to pass more info in, which also gets rid of the limitation of
passing a single page at the time.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: async network I/O, event channels, etc
  2006-07-27  8:02         ` David Miller
  2006-07-27  8:09           ` Jens Axboe
@ 2006-07-27  8:58           ` Evgeniy Polyakov
  2006-07-27  9:31             ` David Miller
  1 sibling, 1 reply; 73+ messages in thread
From: Evgeniy Polyakov @ 2006-07-27  8:58 UTC (permalink / raw)
  To: David Miller; +Cc: drepper, linux-kernel, netdev

On Thu, Jul 27, 2006 at 01:02:55AM -0700, David Miller (davem@davemloft.net) wrote:
> From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> Date: Thu, 27 Jul 2006 11:49:02 +0400
> 
> > I.e. map skb's data to userspace? Not a good idea especially with it's
> > tricky lifetime and unability for userspace to inform kernel when it
> > finished and skb can be freed (without additional syscall).
> 
> Hmmm...
> 
> If it is paged based, I do not see the problem.  Events and calls to
> AIO I/O routines make transfer of buffer ownership.  The fact that
> while kernel (and thus networking stack) "owns" the buffer for an AIO
> call, the user can have a valid mapping to it is a unimportant detail.
> 
> If the user will scramble a piece of data that is in flight to or from
> the network card, it is his problem.
> 
> If we are using a primitive network card that does not support
> scatter-gather I/O and thus not page based SKBs, we will make
> copies.  But this is transparent to the user.
> 
> The idea is that DMA mappings have page granularity.
> 
> At least on transmit it should work well.  Receive side is more
> difficult and initial implementation will need to copy.

And what if several skb->data are placed on the same page?
Or do we want to allocate at least one page for one skb? 
Even if it is an 40 bytes ack?

> > I did it with af_tlb zero-copy sniffer (but I substitute mapped pages
> > with physical skb->data pages), and it was not very good.
> 
> Trying to be too clever with skb->data has always been catastrophic. :)

Yep :)

> > Well, I think preallocate some buffers and use that in AIO setup is a
> > plus, since in that case user does not care about when it is possible to
> > reuse the same buffer - when appropriate kevent is completed, that means
> > that provided buffer is no longer in use by kernel, and user can reuse
> > it.
> 
> We now enter the most interesting topic of AIO buffer pool management
> and where it belongs. :-)  We are assuming up to this point that the
> user manages this stuff with explicit DMA calls for allocation, then
> passes the key based references to those buffers as arguments to the
> AIO I/O calls.
> 
> But I want to suggest another possibility.  What if the kernel managed
> the AIO buffer pool for a task?  It could grow this dynamically based
> upon need.  The only implementation road block is how large to we
> allow this to grow, but I think normal VM mechanisms can take care
> of it.
> 
> On transmit this is not straightforward, but for receive it has really
> nice possibilities. :)

Btw, according to DMA allocations - there are some problems here too.
Some pieces of the world can not dma behind 16mb, and someone can do it
over 4gb. If only 16mb are used, it is just 8k packets with 1500 MTU,
and actually userspace does not know which NIC receives it's data, so it
is impossible in advance to allocate some pool, which will be used for
dma transfer, so we just need to allocate physical pages and use them
with memcpy() from skb->data.

Those physical pages can be managed within kernel and userspace can map
them. But there is another possibility - replace slab allocation for
network devices with allocation from premapped pool.
That naturally allows to manage that pool for AIO needs and have
zero-copy sending and receiving support. That is what I talked in
netchannel topic when question about allocation/freeing cost in atomic
context arised. I work on that solution, which can be used both for
netchannels (and full userspace processing) and usual networking code.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: async network I/O, event channels, etc
  2006-07-27  8:58           ` Evgeniy Polyakov
@ 2006-07-27  9:31             ` David Miller
  2006-07-27  9:37               ` Evgeniy Polyakov
  0 siblings, 1 reply; 73+ messages in thread
From: David Miller @ 2006-07-27  9:31 UTC (permalink / raw)
  To: johnpol; +Cc: drepper, linux-kernel, netdev

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Thu, 27 Jul 2006 12:58:13 +0400

> Btw, according to DMA allocations - there are some problems here too.
> Some pieces of the world can not dma behind 16mb, and someone can do it
> over 4gb.

I think people take this "DMA" in Ulrich's interface names too
literally.  It is logically something different, although it could be
used directly for this purpose.

View it rather as memory you have by some key based ID, but need to
explicitly map to access directly.

> Those physical pages can be managed within kernel and userspace can map
> them. But there is another possibility - replace slab allocation for
> network devices with allocation from premapped pool.
> That naturally allows to manage that pool for AIO needs and have
> zero-copy sending and receiving support. That is what I talked in
> netchannel topic when question about allocation/freeing cost in atomic
> context arised. I work on that solution, which can be used both for
> netchannels (and full userspace processing) and usual networking code.

Interesting idea, and yes I have been watching you stress test your
AVL tree code :))

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: async network I/O, event channels, etc
  2006-07-27  9:31             ` David Miller
@ 2006-07-27  9:37               ` Evgeniy Polyakov
  0 siblings, 0 replies; 73+ messages in thread
From: Evgeniy Polyakov @ 2006-07-27  9:37 UTC (permalink / raw)
  To: David Miller; +Cc: drepper, linux-kernel, netdev

On Thu, Jul 27, 2006 at 02:31:56AM -0700, David Miller (davem@davemloft.net) wrote:
> From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> Date: Thu, 27 Jul 2006 12:58:13 +0400
> 
> > Btw, according to DMA allocations - there are some problems here too.
> > Some pieces of the world can not dma behind 16mb, and someone can do it
> > over 4gb.
> 
> I think people take this "DMA" in Ulrich's interface names too
> literally.  It is logically something different, although it could be
> used directly for this purpose.
> 
> View it rather as memory you have by some key based ID, but need to
> explicitly map to access directly.

I mean here, that it is possible to have those Ulrich's dma regions to
be used as a real dma regions, and showed that it is not a good idea.

> > Those physical pages can be managed within kernel and userspace can map
> > them. But there is another possibility - replace slab allocation for
> > network devices with allocation from premapped pool.
> > That naturally allows to manage that pool for AIO needs and have
> > zero-copy sending and receiving support. That is what I talked in
> > netchannel topic when question about allocation/freeing cost in atomic
> > context arised. I work on that solution, which can be used both for
> > netchannels (and full userspace processing) and usual networking code.
> 
> Interesting idea, and yes I have been watching you stress test your
> AVL tree code :))

Tests are completed - actually it required 12 a4 papers filled with
small circles and numbers to prove it is correct, overnight run was just
for clarifications :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [1/1] Kevent subsystem.
@ 2006-06-22 17:14 Evgeniy Polyakov
  2006-06-23  7:09 ` [1/4] kevent: core files Evgeniy Polyakov
  0 siblings, 1 reply; 73+ messages in thread
From: Evgeniy Polyakov @ 2006-06-22 17:14 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 1157 bytes --]

Hello.

Kevent subsystem incorporates several AIO/kqueue design notes and ideas.
Kevent can be used both for edge and level notifications. It supports
socket notifications, network AIO (aio_send(), aio_recv() and 
aio_sendfile()), inode notifications (create/remove),
generic poll()/select() notifications and timer notifications.

It was tested against FreeBSD kqueue and Linux epoll and showed
noticeble performance win.

Network asynchronous IO operations were tested against Linux synchronous 
socket code and showed noticeble performance win.

Patch against linux-2.6.17-git tree attached (gzipped).
I would like to hear some comments about the overall design,
implementation and plans about it's usefullness for generic kernel.

Design notes, patches, userspace application and perfomance tests can be
found at project's homepages.

1. Kevent subsystem.
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

2. Network AIO.
http://tservice.net.ru/~s0mbre/old/?section=projects&item=naio

3. LWN article about kevent.
http://lwn.net/Articles/172844/

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

Thank you.

-- 
	Evgeniy Polyakov

[-- Attachment #2: kevent-2.6.17-git.diff.gz --]
[-- Type: application/x-gunzip, Size: 24054 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [1/4] kevent: core files.
  2006-06-22 17:14 [1/1] Kevent subsystem Evgeniy Polyakov
@ 2006-06-23  7:09 ` Evgeniy Polyakov
  2006-06-23 18:44   ` Benjamin LaHaise
  0 siblings, 1 reply; 73+ messages in thread
From: Evgeniy Polyakov @ 2006-06-23  7:09 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

This patch includes core kevent files:
 - userspace controlling
 - kernelspace interfaces
 - initialisation
 - notification state machines

It might also inlclude parts from other subsystem (like network related
syscalls so it is possible that it will not compile without other
patches applied).

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index af56987..93e23ff 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -316,3 +316,7 @@ ENTRY(sys_call_table)
 	.long sys_sync_file_range
 	.long sys_tee			/* 315 */
 	.long sys_vmsplice
+	.long sys_aio_recv
+	.long sys_aio_send
+	.long sys_aio_sendfile
+	.long sys_kevent_ctl
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index 5a92fed..534d516 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -696,4 +696,8 @@ #endif
 	.quad sys_sync_file_range
 	.quad sys_tee
 	.quad compat_sys_vmsplice
+	.quad sys_aio_recv
+	.quad sys_aio_send
+	.quad sys_aio_sendfile
+	.quad sys_kevent_ctl
 ia32_syscall_end:		
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index de2ccc1..52f8642 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -322,10 +322,14 @@ #define __NR_splice		313
 #define __NR_sync_file_range	314
 #define __NR_tee		315
 #define __NR_vmsplice		316
+#define __NR_aio_recv		317
+#define __NR_aio_send		318
+#define __NR_aio_sendfile	319
+#define __NR_kevent_ctl		320
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 317
+#define NR_syscalls 321
 
 /*
  * user-visible error numbers are in the range -1 - -128: see
diff --git a/include/asm-x86_64/socket.h b/include/asm-x86_64/socket.h
index f2cdbea..1f31f86 100644
--- a/include/asm-x86_64/socket.h
+++ b/include/asm-x86_64/socket.h
@@ -49,4 +49,6 @@ #define SO_ACCEPTCONN		30
 
 #define SO_PEERSEC             31
 
+#define SO_ASYNC_SOCK		34
+
 #endif /* _ASM_SOCKET_H */
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 0aff22b..352c34b 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -617,11 +617,18 @@ #define __NR_sync_file_range	277
 __SYSCALL(__NR_sync_file_range, sys_sync_file_range)
 #define __NR_vmsplice		278
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
+#define __NR_aio_recv		279
+__SYSCALL(__NR_aio_recv, sys_aio_recv)
+#define __NR_aio_send		280
+__SYSCALL(__NR_aio_send, sys_aio_send)
+#define __NR_aio_sendfile	281
+__SYSCALL(__NR_aio_sendfile, sys_aio_sendfile)
+#define __NR_kevent_ctl		282
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
 
 #ifdef __KERNEL__
 
-#define __NR_syscall_max __NR_vmsplice
-
+#define __NR_syscall_max __NR_kevent_ctl
 #ifndef __NO_STUBS
 
 /* user-visible error numbers are in the range -1 - -4095 */

diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..e94a7bf
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,263 @@
+/*
+ * 	kevent.h
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+#define KEVENT_REQ_ONESHOT	0x1		/* Process this event only once and then dequeue. */
+
+/*
+ * Kevent return flags.
+ */
+#define KEVENT_RET_BROKEN	0x1		/* Kevent is broken. */
+#define KEVENT_RET_DONE		0x2		/* Kevent processing was finished successfully. */
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 		0
+#define KEVENT_INODE		1
+#define KEVENT_TIMER		2
+#define KEVENT_POLL		3
+#define KEVENT_NAIO		4
+#define KEVENT_AIO		5
+#define	KEVENT_MAX		6
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define	KEVENT_TIMER_FIRED	0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#define	KEVENT_SOCKET_RECV	0x1
+#define	KEVENT_SOCKET_ACCEPT	0x2
+#define	KEVENT_SOCKET_SEND	0x4
+
+/*
+ * Inode events.
+ */
+#define	KEVENT_INODE_CREATE	0x1
+#define	KEVENT_INODE_REMOVE	0x2
+
+/*
+ * Poll events.
+ */
+#define	KEVENT_POLL_POLLIN	0x0001
+#define	KEVENT_POLL_POLLPRI	0x0002
+#define	KEVENT_POLL_POLLOUT	0x0004
+#define	KEVENT_POLL_POLLERR	0x0008
+#define	KEVENT_POLL_POLLHUP	0x0010
+#define	KEVENT_POLL_POLLNVAL	0x0020
+
+#define	KEVENT_POLL_POLLRDNORM	0x0040
+#define	KEVENT_POLL_POLLRDBAND	0x0080
+#define	KEVENT_POLL_POLLWRNORM	0x0100
+#define	KEVENT_POLL_POLLWRBAND	0x0200
+#define	KEVENT_POLL_POLLMSG	0x0400
+#define	KEVENT_POLL_POLLREMOVE	0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define	KEVENT_AIO_BIO		0x1
+
+#define KEVENT_MASK_ALL		0xffffffff	/* Mask of all possible event values. */
+#define KEVENT_MASK_EMPTY	0x0		/* Empty mask of ready events. */
+
+struct kevent_id
+{
+	__u32		raw[2];
+};
+
+struct ukevent
+{
+	struct kevent_id	id;			/* Id of this request, e.g. socket number, file descriptor and so on... */
+	__u32			type;			/* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+	__u32			event;			/* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+	__u32			req_flags;		/* Per-event request flags */
+	__u32			ret_flags;		/* Per-event return flags */
+	__u32			ret_data[2];		/* Event return data. Event originator fills it with anything it likes. */
+	union {
+		__u32		user[2];		/* User's data. It is not used, just copied to/from user. */
+		void		*ptr;
+	};
+};
+
+#define	KEVENT_CTL_ADD 		0
+#define	KEVENT_CTL_REMOVE	1
+#define	KEVENT_CTL_MODIFY	2
+#define	KEVENT_CTL_WAIT		3
+#define	KEVENT_CTL_INIT		4
+
+struct kevent_user_control
+{
+	unsigned int		cmd;			/* Control command, e.g. KEVENT_ADD, KEVENT_REMOVE... */
+	unsigned int		num;			/* Number of ukevents this strucutre controls. */
+	unsigned int		timeout;		/* Timeout in milliseconds waiting for "num" events to become ready. */
+};
+
+#define KEVENT_USER_SYMBOL	'K'
+#define KEVENT_USER_CTL		_IOWR(KEVENT_USER_SYMBOL, 0, struct kevent_user_control)
+#define KEVENT_USER_WAIT	_IOWR(KEVENT_USER_SYMBOL, 1, struct kevent_user_control)
+
+#ifdef __KERNEL__
+
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/kevent_storage.h>
+#include <asm/semaphore.h>
+
+struct inode;
+struct dentry;
+struct sock;
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+struct kevent
+{
+	struct ukevent		event;
+	spinlock_t		lock;			/* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+
+	struct list_head	kevent_entry;		/* Entry of user's queue. */
+	struct list_head	storage_entry;		/* Entry of origin's queue. */
+	struct list_head	ready_entry;		/* Entry of user's ready. */
+
+	struct kevent_user	*user;			/* User who requested this kevent. */
+	struct kevent_storage	*st;			/* Kevent container. */
+
+	kevent_callback_t	callback;		/* Is called each time new event has been caught. */
+	kevent_callback_t	enqueue;		/* Is called each time new event is queued. */
+	kevent_callback_t	dequeue;		/* Is called each time event is dequeued. */
+
+	void			*priv;			/* Private data for different storages. 
+							 * poll()/select storage has a list of wait_queue_t containers 
+							 * for each ->poll() { poll_wait()' } here.
+							 */
+};
+
+#define KEVENT_HASH_MASK	0xff
+
+struct kevent_list
+{
+	struct list_head	kevent_list;		/* List of all kevents. */
+	spinlock_t 		kevent_lock;		/* Protects all manipulations with queue of kevents. */
+};
+
+struct kevent_user
+{
+	struct kevent_list	kqueue[KEVENT_HASH_MASK+1];
+	unsigned int		kevent_num;		/* Number of queued kevents. */
+
+	struct list_head	ready_list;		/* List of ready kevents. */
+	unsigned int		ready_num;		/* Number of ready kevents. */
+	spinlock_t 		ready_lock;		/* Protects all manipulations with ready queue. */
+
+	unsigned int		max_ready_num;		/* Requested number of kevents. */
+
+	struct semaphore	ctl_mutex;		/* Protects against simultaneous kevent_user control manipulations. */
+	struct semaphore	wait_mutex;		/* Protects against simultaneous kevent_user waits. */
+	wait_queue_head_t	wait;			/* Wait until some events are ready. */
+
+	atomic_t		refcnt;			/* Reference counter, increased for each new kevent. */
+#ifdef CONFIG_KEVENT_USER_STAT
+	unsigned long		im_num;
+	unsigned long		wait_num;
+	unsigned long		total;
+#endif
+};
+
+#define KEVENT_MAX_REQUESTS		PAGE_SIZE/sizeof(struct kevent)
+
+struct kevent *kevent_alloc(gfp_t mask);
+void kevent_free(struct kevent *k);
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+
+#define list_for_each_entry_reverse_safe(pos, n, head, member)		\
+	for (pos = list_entry((head)->prev, typeof(*pos), member),	\
+		n = list_entry(pos->member.prev, typeof(*pos), member);	\
+	     prefetch(pos->member.prev), &pos->member != (head); 	\
+	     pos = n, n = list_entry(pos->member.prev, typeof(*pos), member))
+
+int kevent_break(struct kevent *k);
+int kevent_init(struct kevent *k);
+
+int kevent_init_socket(struct kevent *k);
+int kevent_init_inode(struct kevent *k);
+int kevent_init_timer(struct kevent *k);
+int kevent_init_poll(struct kevent *k);
+int kevent_init_naio(struct kevent *k);
+int kevent_init_aio(struct kevent *k);
+
+void kevent_storage_ready(struct kevent_storage *st, 
+		kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_INODE
+void kevent_inode_notify(struct inode *inode, u32 event);
+void kevent_inode_notify_parent(struct dentry *dentry, u32 event);
+void kevent_inode_remove(struct inode *inode);
+#else
+static inline void kevent_inode_notify(struct inode *inode, u32 event)
+{
+}
+static inline void kevent_inode_notify_parent(struct dentry *dentry, u32 event)
+{
+}
+static inline void kevent_inode_remove(struct inode *inode)
+{
+}
+#endif /* CONFIG_KEVENT_INODE */
+#ifdef CONFIG_KEVENT_SOCKET
+
+void kevent_socket_notify(struct sock *sock, u32 event);
+int kevent_socket_dequeue(struct kevent *k);
+int kevent_socket_enqueue(struct kevent *k);
+#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC)
+#else
+static inline void kevent_socket_notify(struct sock *sock, u32 event)
+{
+}
+#define sock_async(__sk)	0
+#endif
+#endif /* __KERNEL__ */
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..bd891f0
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,12 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+	void			*origin;		/* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+	struct list_head	list;			/* List of queued kevents. */
+	unsigned int		qlen;			/* Number of queued kevents. */
+	spinlock_t		lock;			/* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index bd67a44..33d436e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -587,4 +587,8 @@ asmlinkage long sys_get_robust_list(int 
 asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
 				    size_t len);
 
+asmlinkage long sys_aio_recv(int ctl_fd, int s, void __user *buf, size_t size, unsigned flags);
+asmlinkage long sys_aio_send(int ctl_fd, int s, void __user *buf, size_t size, unsigned flags);
+asmlinkage long sys_aio_sendfile(int ctl_fd, int fd, int s, size_t size, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, void __user *buf);
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index df864a3..6135afc 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -185,6 +185,8 @@ config AUDITSYSCALL
 	  such as SELinux.  To use audit's filesystem watch feature, please
 	  ensure that INOTIFY is configured.
 
+source "kernel/kevent/Kconfig"
+
 config IKCONFIG
 	bool "Kernel .config support"
 	---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index f6ef00f..eb057ea 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -36,6 +36,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
 obj-$(CONFIG_RELAY) += relay.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..88b35af
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,57 @@
+config KEVENT
+	bool "Kernel event notification mechanism"
+	help
+	  This option enables event queue mechanism.
+	  It can be used as replacement for poll()/select(), AIO callback invocations,
+	  advanced timer notifications and other kernel object status changes.
+
+config KEVENT_USER_STAT
+	bool "Kevent user statistic"
+	depends on KEVENT
+	default N
+	help
+	  This option will turn kevent_user statistic collection on.
+	  Statistic data includes total number of kevent, number of kevents which are ready
+	  immediately at insertion time and number of kevents which were removed through
+	  readiness completion. It will be printed each time control kevent descriptor
+	  is closed.
+
+config KEVENT_SOCKET
+	bool "Kernel event notifications for sockets"
+	depends on NET && KEVENT
+	help
+	  This option enables notifications through KEVENT subsystem of 
+	  sockets operations, like new packet receiving conditions, ready for accept
+  	  conditions and so on.
+	
+config KEVENT_INODE
+	bool "Kernel event notifications for inodes"
+	depends on KEVENT
+	help
+	  This option enables notifications through KEVENT subsystem of 
+	  inode operations, like file creation, removal and so on.
+
+config KEVENT_TIMER
+	bool "Kernel event notifications for timers"
+	depends on KEVENT
+	help
+	  This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+	bool "Kernel event notifications for poll()/select()"
+	depends on KEVENT
+	help
+	  This option allows to use kevent subsystem for poll()/select() notifications.
+
+config KEVENT_NAIO
+	bool "Network asynchronous IO"
+	depends on KEVENT && KEVENT_SOCKET
+	help
+	  This option enables kevent based network asynchronous IO subsystem.
+
+config KEVENT_AIO
+	bool "Asynchronous IO"
+	depends on KEVENT
+	help
+	  This option allows to use kevent subsystem for AIO operations.
+	  AIO read is currently supported.
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..7dcd651
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,7 @@
+obj-y := kevent.o kevent_user.o kevent_init.o
+obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o
+obj-$(CONFIG_KEVENT_INODE) += kevent_inode.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
+obj-$(CONFIG_KEVENT_NAIO) += kevent_naio.o
+obj-$(CONFIG_KEVENT_AIO) += kevent_aio.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..f699a13
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,260 @@
+/*
+ * 	kevent.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+static kmem_cache_t *kevent_cache;
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+	if (k->event.type >= KEVENT_MAX)
+		return -E2BIG;
+
+	if (!k->enqueue) {
+		kevent_break(k);
+		return -EINVAL;
+	}
+	
+	return k->enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+	if (k->event.type >= KEVENT_MAX)
+		return -E2BIG;
+	
+	if (!k->dequeue) {
+		kevent_break(k);
+		return -EINVAL;
+	}
+
+	return k->dequeue(k);
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+	int err;
+
+	spin_lock_init(&k->lock);
+	k->kevent_entry.next = LIST_POISON1;
+	k->storage_entry.next = LIST_POISON1;
+	k->ready_entry.next = LIST_POISON1;
+
+	if (k->event.type >= KEVENT_MAX)
+		return -E2BIG;
+	
+	switch (k->event.type) {
+		case KEVENT_NAIO:
+			err = kevent_init_naio(k);
+			break;
+		case KEVENT_SOCKET:
+			err = kevent_init_socket(k);
+			break;
+		case KEVENT_INODE:
+			err = kevent_init_inode(k);
+			break;
+		case KEVENT_TIMER:
+			err = kevent_init_timer(k);
+			break;
+		case KEVENT_POLL:
+			err = kevent_init_poll(k);
+			break;
+		case KEVENT_AIO:
+			err = kevent_init_aio(k);
+			break;
+		default:
+			err = -ENODEV;
+	}
+
+	return err;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	k->st = st;
+	spin_lock_irqsave(&st->lock, flags);
+	list_add_tail(&k->storage_entry, &st->list);
+	st->qlen++;
+	spin_unlock_irqrestore(&st->lock, flags);
+	return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue. 
+ * It does not decrease origin's reference counter in any way 
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&st->lock, flags);
+	if (k->storage_entry.next != LIST_POISON1) {
+		list_del(&k->storage_entry);
+		st->qlen--;
+	}
+	spin_unlock_irqrestore(&st->lock, flags);
+}
+
+static void __kevent_requeue(struct kevent *k, u32 event)
+{
+	int err, rem = 0;
+	unsigned long flags;
+
+	err = k->callback(k);
+
+	spin_lock_irqsave(&k->lock, flags);
+	if (err > 0) {
+		k->event.ret_flags |= KEVENT_RET_DONE;
+	} else if (err < 0) {
+		k->event.ret_flags |= KEVENT_RET_BROKEN;
+		k->event.ret_flags |= KEVENT_RET_DONE;
+	}
+	rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+	if (!err)
+		err = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+	spin_unlock_irqrestore(&k->lock, flags);
+
+	if (err) {
+		if (rem) {
+			list_del(&k->storage_entry);
+			k->st->qlen--;
+		}
+		
+		spin_lock_irqsave(&k->user->ready_lock, flags);
+		if (k->ready_entry.next == LIST_POISON1) {
+			list_add_tail(&k->ready_entry, &k->user->ready_list);
+			k->user->ready_num++;
+		}
+		spin_unlock_irqrestore(&k->user->ready_lock, flags);
+		wake_up(&k->user->wait);
+	}
+}
+
+void kevent_requeue(struct kevent *k)
+{
+	unsigned long flags;
+	
+	spin_lock_irqsave(&k->st->lock, flags);
+	__kevent_requeue(k, 0);
+	spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st, 
+		kevent_callback_t ready_callback, u32 event)
+{
+	struct kevent *k, *n;
+
+	spin_lock(&st->lock);
+	list_for_each_entry_safe(k, n, &st->list, storage_entry) {
+		if (ready_callback)
+			ready_callback(k);
+
+		if (event & k->event.event)
+			__kevent_requeue(k, event);
+	}
+	spin_unlock(&st->lock);
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+	spin_lock_init(&st->lock);
+	st->origin = origin;
+	st->qlen = 0;
+	INIT_LIST_HEAD(&st->list);
+	return 0;
+}
+
+void kevent_storage_fini(struct kevent_storage *st)
+{
+	kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
+
+struct kevent *kevent_alloc(gfp_t mask)
+{
+	struct kevent *k;
+	
+	if (kevent_cache)
+		k = kmem_cache_alloc(kevent_cache, mask);
+	else
+		k = kzalloc(sizeof(struct kevent), mask);
+
+	return k;
+}
+
+void kevent_free(struct kevent *k)
+{
+	memset(k, 0xab, sizeof(struct kevent));
+
+	if (kevent_cache)
+		kmem_cache_free(kevent_cache, k);
+	else
+		kfree(k);
+}
+
+int __init kevent_sys_init(void)
+{
+	int err = 0;
+
+	kevent_cache = kmem_cache_create("kevent_cache", 
+			sizeof(struct kevent), 0, 0, NULL, NULL);
+	if (!kevent_cache)
+		err = -ENOMEM;
+	
+	return err;
+}
+
+late_initcall(kevent_sys_init);
diff --git a/kernel/kevent/kevent_init.c b/kernel/kevent/kevent_init.c
new file mode 100644
index 0000000..ec95114
--- /dev/null
+++ b/kernel/kevent/kevent_init.c
@@ -0,0 +1,85 @@
+/*
+ * 	kevent_init.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/errno.h>
+#include <linux/kevent.h>
+
+int kevent_break(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->lock, flags);
+	k->event.ret_flags |= KEVENT_RET_BROKEN;
+	spin_unlock_irqrestore(&k->lock, flags);
+	return 0;
+}
+
+#ifndef CONFIG_KEVENT_SOCKET
+int kevent_init_socket(struct kevent *k)
+{
+	kevent_break(k);
+	return -ENODEV;
+}
+#endif
+
+#ifndef CONFIG_KEVENT_INODE
+int kevent_init_inode(struct kevent *k)
+{
+	kevent_break(k);
+	return -ENODEV;
+}
+#endif
+
+#ifndef CONFIG_KEVENT_TIMER
+int kevent_init_timer(struct kevent *k)
+{
+	kevent_break(k);
+	return -ENODEV;
+}
+#endif
+
+#ifndef CONFIG_KEVENT_POLL
+int kevent_init_poll(struct kevent *k)
+{
+	kevent_break(k);
+	return -ENODEV;
+}
+#endif
+
+#ifndef CONFIG_KEVENT_NAIO
+int kevent_init_naio(struct kevent *k)
+{
+	kevent_break(k);
+	return -ENODEV;
+}
+#endif
+
+#ifndef CONFIG_KEVENT_AIO
+int kevent_init_aio(struct kevent *k)
+{
+	kevent_break(k);
+	return -ENODEV;
+}
+#endif
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..566b62b
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,728 @@
+/*
+ * 	kevent_user.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/jhash.h>
+#include <asm/uaccess.h>
+#include <asm/semaphore.h>
+
+static struct class *kevent_user_class;
+static char kevent_name[] = "kevent";
+static int kevent_user_major;
+
+static int kevent_user_open(struct inode *, struct file *);
+static int kevent_user_release(struct inode *, struct file *);
+static int kevent_user_ioctl(struct inode *, struct file *, 
+		unsigned int, unsigned long);
+static unsigned int kevent_user_poll(struct file *, struct poll_table_struct *);
+
+static struct file_operations kevent_user_fops = {
+	.open		= kevent_user_open,
+	.release	= kevent_user_release,
+	.ioctl		= kevent_user_ioctl,
+	.poll		= kevent_user_poll,
+	.owner		= THIS_MODULE,
+};
+
+static struct super_block *kevent_get_sb(struct file_system_type *fs_type, 
+		int flags, const char *dev_name, void *data)
+{
+	/* So original magic... */
+	return get_sb_pseudo(fs_type, kevent_name, NULL, 0xabcdef);	
+}
+
+static struct file_system_type kevent_fs_type = {
+	.name		= kevent_name,
+	.get_sb		= kevent_get_sb,
+	.kill_sb	= kill_anon_super,
+};
+
+static struct vfsmount *kevent_mnt;
+
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+	struct kevent_user *u = file->private_data;
+	unsigned int mask;
+	
+	poll_wait(file, &u->wait, wait);
+	mask = 0;
+
+	if (u->ready_num)
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+
+static struct kevent_user *kevent_user_alloc(void)
+{
+	struct kevent_user *u;
+	int i;
+
+	u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
+	if (!u)
+		return NULL;
+
+	INIT_LIST_HEAD(&u->ready_list);
+	spin_lock_init(&u->ready_lock);
+	u->ready_num = 0;
+#ifdef CONFIG_KEVENT_USER_STAT
+	u->wait_num = u->im_num = u->total = 0;
+#endif
+	for (i=0; i<KEVENT_HASH_MASK+1; ++i) {
+		INIT_LIST_HEAD(&u->kqueue[i].kevent_list);
+		spin_lock_init(&u->kqueue[i].kevent_lock);
+	}
+	u->kevent_num = 0;
+	
+	init_MUTEX(&u->ctl_mutex);
+	init_MUTEX(&u->wait_mutex);
+	init_waitqueue_head(&u->wait);
+	u->max_ready_num = 0;
+
+	atomic_set(&u->refcnt, 1);
+
+	return u;
+}
+
+static int kevent_user_open(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u = kevent_user_alloc();
+	
+	if (!u)
+		return -ENOMEM;
+
+	file->private_data = u;
+	
+	return 0;
+}
+
+static inline void kevent_user_get(struct kevent_user *u)
+{
+	atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+	if (atomic_dec_and_test(&u->refcnt)) {
+#ifdef CONFIG_KEVENT_USER_STAT
+		printk("%s: u=%p, wait=%lu, immediately=%lu, total=%lu.\n", 
+				__func__, u, u->wait_num, u->im_num, u->total);
+#endif
+		kfree(u);
+	}
+}
+
+#if 0
+static inline unsigned int kevent_user_hash(struct ukevent *uk)
+{
+	unsigned int h = (uk->user[0] ^ uk->user[1]) ^ (uk->id.raw[0] ^ uk->id.raw[1]);
+	
+	h = (((h >> 16) & 0xffff) ^ (h & 0xffff)) & 0xffff;
+	h = (((h >> 8) & 0xff) ^ (h & 0xff)) & KEVENT_HASH_MASK;
+
+	return h;
+}
+#else
+static inline unsigned int kevent_user_hash(struct ukevent *uk)
+{
+	return jhash_1word(uk->id.raw[0], 0) & KEVENT_HASH_MASK;
+}
+#endif
+
+/*
+ * Remove kevent from user's list of all events, 
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int lock, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	if (lock) {
+		unsigned int hash = kevent_user_hash(&k->event);
+		struct kevent_list *l = &u->kqueue[hash];
+
+		spin_lock_irqsave(&l->kevent_lock, flags);
+		list_del(&k->kevent_entry);
+		u->kevent_num--;
+		spin_unlock_irqrestore(&l->kevent_lock, flags);
+	} else {
+		list_del(&k->kevent_entry);
+		u->kevent_num--;
+	}
+
+	if (deq)
+		kevent_dequeue(k);
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (k->ready_entry.next != LIST_POISON1) {
+		list_del(&k->ready_entry);
+		u->ready_num--;
+	}
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+	
+	kevent_user_put(u);
+	kevent_free(k);
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *__kqueue_dequeue_one_ready(struct list_head *q, 
+		unsigned int *qlen)
+{
+	struct kevent *k = NULL;
+	unsigned int len = *qlen;
+	
+	if (len && !list_empty(q)) {
+		k = list_entry(q->next, struct kevent, ready_entry);
+		list_del(&k->ready_entry);
+		*qlen = len - 1;
+	}
+	
+	return k;
+}
+
+static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
+{
+	unsigned long flags;
+	struct kevent *k;
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	k = __kqueue_dequeue_one_ready(&u->ready_list, &u->ready_num);
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+
+	return k;
+}
+
+static struct kevent *__kevent_search(struct kevent_list *l, struct ukevent *uk, 
+		struct kevent_user *u)
+{
+	struct kevent *k;
+	int found = 0;
+	
+	list_for_each_entry(k, &l->kevent_list, kevent_entry) {
+		spin_lock(&k->lock);
+		if (k->event.user[0] == uk->user[0] && k->event.user[1] == uk->user[1] &&
+				k->event.id.raw[0] == uk->id.raw[0] && 
+				k->event.id.raw[1] == uk->id.raw[1]) {
+			found = 1;
+			spin_unlock(&k->lock);
+			break;
+		}
+		spin_unlock(&k->lock);
+	}
+
+	return (found)?k:NULL;
+}
+
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	unsigned int hash = kevent_user_hash(uk);
+	struct kevent_list *l = &u->kqueue[hash];
+	int err = -ENODEV;
+	unsigned long flags;
+	
+	spin_lock_irqsave(&l->kevent_lock, flags);
+	k = __kevent_search(l, uk, u);
+	if (k) {
+		spin_lock(&k->lock);
+		k->event.event = uk->event;
+		k->event.req_flags = uk->req_flags;
+		k->event.ret_flags = 0;
+		spin_unlock(&k->lock);
+		kevent_requeue(k);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&l->kevent_lock, flags);
+	
+	return err;
+}
+
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+	int err = -ENODEV;
+	struct kevent *k;
+	unsigned int hash = kevent_user_hash(uk);
+	struct kevent_list *l = &u->kqueue[hash];
+	unsigned long flags;
+
+	spin_lock_irqsave(&l->kevent_lock, flags);
+	k = __kevent_search(l, uk, u);
+	if (k) {
+		kevent_finish_user(k, 0, 1);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&l->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * No new entry can be added or removed from any list at this point.
+ * It is not permitted to call ->ioctl() and ->release() in parallel.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u = file->private_data;
+	struct kevent *k, *n;
+	int i;
+
+	for (i=0; i<KEVENT_HASH_MASK+1; ++i) {
+		struct kevent_list *l = &u->kqueue[i];
+		
+		list_for_each_entry_safe(k, n, &l->kevent_list, kevent_entry)
+			kevent_finish_user(k, 1, 1);
+	}
+
+	kevent_user_put(u);
+	file->private_data = NULL;
+
+	return 0;
+}
+
+static int kevent_user_ctl_modify(struct kevent_user *u, 
+		struct kevent_user_control *ctl, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	if (down_interruptible(&u->ctl_mutex))
+		return -ERESTARTSYS;
+
+	for (i=0; i<ctl->num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EINVAL;
+			break;
+		}
+
+		if (kevent_modify(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EINVAL;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+
+	up(&u->ctl_mutex);
+
+	return err;
+}
+
+static int kevent_user_ctl_remove(struct kevent_user *u, 
+		struct kevent_user_control *ctl, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	if (down_interruptible(&u->ctl_mutex))
+		return -ERESTARTSYS;
+
+	for (i=0; i<ctl->num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EINVAL;
+			break;
+		}
+
+		if (kevent_remove(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EINVAL;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+
+	up(&u->ctl_mutex);
+
+	return err;
+}
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err;
+
+	k = kevent_alloc(GFP_KERNEL);
+	if (!k) {
+		err = -ENOMEM;
+		goto err_out_exit;
+	}
+
+	memcpy(&k->event, uk, sizeof(struct ukevent));
+
+	k->event.ret_flags = 0;
+
+	err = kevent_init(k);
+	if (err) {
+		kevent_free(k);
+		goto err_out_exit;
+	}
+	k->user = u;
+#ifdef CONFIG_KEVENT_USER_STAT
+	u->total++;
+#endif
+	{
+		unsigned long flags;
+		unsigned int hash = kevent_user_hash(&k->event);
+		struct kevent_list *l = &u->kqueue[hash];
+		
+		spin_lock_irqsave(&l->kevent_lock, flags);
+		list_add_tail(&k->kevent_entry, &l->kevent_list);
+		u->kevent_num++;
+		kevent_user_get(u);
+		spin_unlock_irqrestore(&l->kevent_lock, flags);
+	}
+
+	err = kevent_enqueue(k);
+	if (err) {
+		memcpy(uk, &k->event, sizeof(struct ukevent));
+		if (err < 0)
+			uk->ret_flags |= KEVENT_RET_BROKEN;
+		uk->ret_flags |= KEVENT_RET_DONE;
+		kevent_finish_user(k, 1, 0);
+	} 
+
+err_out_exit:
+	return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one 
+ * and add them into appropriate kevent_storages, 
+ * e.g. sockets, inodes and so on...
+ * If something goes wrong, all events will be dequeued and 
+ * negative error will be returned. 
+ * On success zero is returned and 
+ * ctl->num will be a number of finished events, either completed or failed. 
+ * Array of finished events (struct ukevent) will be placed behind 
+ * kevent_user_control structure. User must run through that array and check 
+ * ret_flags field of each ukevent structure to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, 
+		struct kevent_user_control *ctl, void __user *arg)
+{
+	int err = 0, cerr = 0, num = 0, knum = 0, i;
+	void __user *orig, *ctl_addr;
+	struct ukevent uk;
+
+	if (down_interruptible(&u->ctl_mutex))
+		return -ERESTARTSYS;
+
+	orig = arg;
+	ctl_addr = arg - sizeof(struct kevent_user_control);
+#if 1
+	err = -ENFILE;
+	if (u->kevent_num + ctl->num >= 1024)
+		goto err_out_remove;
+#endif
+	for (i=0; i<ctl->num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			cerr = -EINVAL;
+			break;
+		}
+		arg += sizeof(struct ukevent);
+
+		err = kevent_user_add_ukevent(&uk, u);
+		if (err) {
+#ifdef CONFIG_KEVENT_USER_STAT
+			u->im_num++;
+#endif
+			if (copy_to_user(orig, &uk, sizeof(struct ukevent)))
+				cerr = -EINVAL;
+			orig += sizeof(struct ukevent);
+			num++;
+		} else
+			knum++;
+	}
+
+	if (cerr < 0)
+		goto err_out_remove;
+
+	ctl->num = num;
+	if (copy_to_user(ctl_addr, ctl, sizeof(struct kevent_user_control)))
+		cerr = -EINVAL;
+
+	if (cerr)
+		err = cerr;
+	if (!err)
+		err = num;
+
+err_out_remove:
+	up(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Waits until at least ctl->ready_num events are ready or timeout and returns 
+ * number of ready events (in case of timeout) or number of requested events.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u, 
+		struct kevent_user_control *ctl, void __user *arg)
+{
+	struct kevent *k;
+	int cerr = 0, num = 0;
+	void __user *ptr = arg + sizeof(struct kevent_user_control);
+
+	if (down_interruptible(&u->ctl_mutex))
+		return -ERESTARTSYS;
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		if (ctl->timeout)
+			wait_event_interruptible_timeout(u->wait, 
+				u->ready_num >= ctl->num, msecs_to_jiffies(ctl->timeout));
+		else
+			wait_event_interruptible_timeout(u->wait, 
+					u->ready_num > 0, msecs_to_jiffies(1000));
+	}
+	while (num < ctl->num && ((k = kqueue_dequeue_ready(u)) != NULL)) {
+		if (copy_to_user(ptr + num*sizeof(struct ukevent), 
+					&k->event, sizeof(struct ukevent)))
+			cerr = -EINVAL;
+
+		/*
+		 * If it is one-shot kevent, it has been removed already from
+		 * origin's queue, so we can easily free it here.
+		 */
+		if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+			kevent_finish_user(k, 1, 1);
+		++num;
+#ifdef CONFIG_KEVENT_USER_STAT
+		u->wait_num++;
+#endif
+	}
+
+	ctl->num = num;
+	if (copy_to_user(arg, ctl, sizeof(struct kevent_user_control)))
+		cerr = -EINVAL;
+
+	up(&u->ctl_mutex);
+
+	return (cerr)?cerr:num;
+}
+
+static int kevent_ctl_init(void)
+{
+	struct kevent_user *u;
+	struct file *file;	
+	int fd, ret;
+
+	fd = get_unused_fd();
+	if (fd < 0)
+		return fd;
+
+	file = get_empty_filp();
+	if (!file) {
+		ret = -ENFILE;
+		goto out_put_fd;
+	}
+
+	u = kevent_user_alloc();
+	if (unlikely(!u)) {
+		ret = -ENOMEM;
+		goto out_put_file;
+	}
+
+	file->f_op = &kevent_user_fops;
+	file->f_vfsmnt = mntget(kevent_mnt);
+	file->f_dentry = dget(kevent_mnt->mnt_root);
+	file->f_mapping = file->f_dentry->d_inode->i_mapping;
+	file->f_mode = FMODE_READ;
+	file->f_flags = O_RDONLY;
+	file->private_data = u;
+	
+	fd_install(fd, file);
+
+	return fd;
+
+out_put_file:
+	put_filp(file);
+out_put_fd:
+	put_unused_fd(fd);
+	return ret;
+}
+
+static int kevent_ctl_process(struct file *file, 
+		struct kevent_user_control *ctl, void __user *arg)
+{
+	int err;
+	struct kevent_user *u = file->private_data;
+
+	if (!u)
+		return -EINVAL;
+
+	switch (ctl->cmd) {
+		case KEVENT_CTL_ADD:
+			err = kevent_user_ctl_add(u, ctl, 
+					arg+sizeof(struct kevent_user_control));
+			break;
+		case KEVENT_CTL_REMOVE:
+			err = kevent_user_ctl_remove(u, ctl, 
+					arg+sizeof(struct kevent_user_control));
+			break;
+		case KEVENT_CTL_MODIFY:
+			err = kevent_user_ctl_modify(u, ctl, 
+					arg+sizeof(struct kevent_user_control));
+			break;
+		case KEVENT_CTL_WAIT:
+			err = kevent_user_wait(file, u, ctl, arg);
+			break;
+		case KEVENT_CTL_INIT:
+			err = kevent_ctl_init();
+		default:
+			err = -EINVAL;
+			break;
+	}
+
+	return err;
+}
+
+asmlinkage long sys_kevent_ctl(int fd, void __user *arg)
+{
+	int err, fput_needed;
+	struct kevent_user_control ctl;
+	struct file *file;
+
+	if (copy_from_user(&ctl, arg, sizeof(struct kevent_user_control)))
+		return -EINVAL;
+
+	if (ctl.cmd == KEVENT_CTL_INIT)
+		return kevent_ctl_init();
+
+	file = fget_light(fd, &fput_needed);
+	if (!file)
+		return -ENODEV;
+
+	err = kevent_ctl_process(file, &ctl, arg);
+
+	fput_light(file, fput_needed);
+	return err;
+}
+
+static int kevent_user_ioctl(struct inode *inode, struct file *file, 
+		unsigned int cmd, unsigned long arg)
+{
+	int err = -ENODEV;
+	struct kevent_user_control ctl;
+	struct kevent_user *u = file->private_data;
+	void __user *ptr = (void __user *)arg;
+
+	if (copy_from_user(&ctl, ptr, sizeof(struct kevent_user_control)))
+		return -EINVAL;
+
+	switch (cmd) {
+		case KEVENT_USER_CTL:
+			err = kevent_ctl_process(file, &ctl, ptr);
+			break;
+		case KEVENT_USER_WAIT:
+			err = kevent_user_wait(file, u, &ctl, ptr);
+			break;
+		default:
+			break;
+	}
+
+	return err;
+}
+
+static int __devinit kevent_user_init(void)
+{
+	struct class_device *dev;
+	int err = 0;
+	
+	err = register_filesystem(&kevent_fs_type);
+	if (err)
+		panic("%s: failed to register filesystem: err=%d.\n",
+			       kevent_name, err);
+
+	kevent_mnt = kern_mount(&kevent_fs_type);
+	if (IS_ERR(kevent_mnt))
+		panic("%s: failed to mount silesystem: err=%ld.\n", 
+				kevent_name, PTR_ERR(kevent_mnt));
+	
+	kevent_user_major = register_chrdev(0, kevent_name, &kevent_user_fops);
+	if (kevent_user_major < 0) {
+		printk(KERN_ERR "Failed to register \"%s\" char device: err=%d.\n", 
+				kevent_name, kevent_user_major);
+		return -ENODEV;
+	}
+
+	kevent_user_class = class_create(THIS_MODULE, "kevent");
+	if (IS_ERR(kevent_user_class)) {
+		printk(KERN_ERR "Failed to register \"%s\" class: err=%ld.\n", 
+				kevent_name, PTR_ERR(kevent_user_class));
+		err = PTR_ERR(kevent_user_class);
+		goto err_out_unregister;
+	}
+
+	dev = class_device_create(kevent_user_class, NULL, 
+			MKDEV(kevent_user_major, 0), NULL, kevent_name);
+	if (IS_ERR(dev)) {
+		printk(KERN_ERR "Failed to create %d.%d class device in \"%s\" class: err=%ld.\n", 
+				kevent_user_major, 0, kevent_name, PTR_ERR(dev));
+		err = PTR_ERR(dev);
+		goto err_out_class_destroy;
+	}
+
+	printk("KEVENT subsystem: chardev helper: major=%d.\n", kevent_user_major);
+
+	return 0;
+
+err_out_class_destroy:
+	class_destroy(kevent_user_class);
+err_out_unregister:
+	unregister_chrdev(kevent_user_major, kevent_name);
+
+	return err;
+}
+
+static void __devexit kevent_user_fini(void)
+{
+	class_device_destroy(kevent_user_class, MKDEV(kevent_user_major, 0));
+	class_destroy(kevent_user_class);
+	unregister_chrdev(kevent_user_major, kevent_name);
+	mntput(kevent_mnt);
+	unregister_filesystem(&kevent_fs_type);
+}
+
+module_init(kevent_user_init);
+module_exit(kevent_user_fini);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 5433195..dcbacf5 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -121,6 +121,11 @@ cond_syscall(ppc_rtas);
 cond_syscall(sys_spu_run);
 cond_syscall(sys_spu_create);
 
+cond_syscall(sys_aio_recv);
+cond_syscall(sys_aio_send);
+cond_syscall(sys_aio_sendfile);
+cond_syscall(sys_kevent_ctl);
+
 /* mmu depending weak syscall entries */
 cond_syscall(sys_mprotect);
 cond_syscall(sys_msync);

-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [1/4] kevent: core files.
  2006-06-23  7:09 ` [1/4] kevent: core files Evgeniy Polyakov
@ 2006-06-23 18:44   ` Benjamin LaHaise
  2006-06-23 19:24     ` Evgeniy Polyakov
  0 siblings, 1 reply; 73+ messages in thread
From: Benjamin LaHaise @ 2006-06-23 18:44 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, netdev

On Fri, Jun 23, 2006 at 11:09:34AM +0400, Evgeniy Polyakov wrote:
> This patch includes core kevent files:
>  - userspace controlling
>  - kernelspace interfaces
>  - initialisation
>  - notification state machines

We don't need yet another event mechanism in the kernel, so I don't see 
why the new syscalls should be added when they don't interoperate with 
existing solutions.  If your results are enough to sway akpm that it is 
worth taking the patches, then it would make sense to merge the code with 
the already in-tree APIs.

		-ben

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [1/4] kevent: core files.
  2006-06-23 18:44   ` Benjamin LaHaise
@ 2006-06-23 19:24     ` Evgeniy Polyakov
  2006-06-23 19:55       ` Benjamin LaHaise
  2006-06-23 20:19       ` David Miller
  0 siblings, 2 replies; 73+ messages in thread
From: Evgeniy Polyakov @ 2006-06-23 19:24 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: David Miller, netdev

On Fri, Jun 23, 2006 at 02:44:57PM -0400, Benjamin LaHaise (bcrl@kvack.org) wrote:
> On Fri, Jun 23, 2006 at 11:09:34AM +0400, Evgeniy Polyakov wrote:
> > This patch includes core kevent files:
> >  - userspace controlling
> >  - kernelspace interfaces
> >  - initialisation
> >  - notification state machines
> 
> We don't need yet another event mechanism in the kernel, so I don't see 
> why the new syscalls should be added when they don't interoperate with 
> existing solutions.  If your results are enough to sway akpm that it is 
> worth taking the patches, then it would make sense to merge the code with 
> the already in-tree APIs.

What API are you talking about?
There is only epoll(), which is 40% slower than kevent, and AIO, which
works not as state machine, but as repeated call for the same work.
There is also inotify, which allocates new message each time event
occurs, which is not a good solution for every situation.

Linux just does not have unified event processing mechanism, which was
pointed to many times in AIO mail list and when epoll() was only
introduced. I would even say, that Linux does not have such mechanism at
all, since every potential user implements it's own, which can not be
used with others.

Kevent fixes that. Although implementation itself can be suboptimal for
some cases or even unacceptible at all, but it is really needed
functionality.

Every existing notification can be built on top of kevent. One can find
how easy it was to implement generic poll/select notifications (what
epoll() does) or socket notifications (which are similar to epoll(), but
are called from inside socket state machine, thus improving processing
performance).

> 		-ben

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [1/4] kevent: core files.
  2006-06-23 19:24     ` Evgeniy Polyakov
@ 2006-06-23 19:55       ` Benjamin LaHaise
  2006-06-23 20:17         ` Evgeniy Polyakov
  2006-06-23 20:19       ` David Miller
  1 sibling, 1 reply; 73+ messages in thread
From: Benjamin LaHaise @ 2006-06-23 19:55 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, netdev

On Fri, Jun 23, 2006 at 11:24:29PM +0400, Evgeniy Polyakov wrote:
> What API are you talking about?
> There is only epoll(), which is 40% slower than kevent, and AIO, which
> works not as state machine, but as repeated call for the same work.
> There is also inotify, which allocates new message each time event
> occurs, which is not a good solution for every situation.

AIO can be implemented as a state machine.  Nothing in the API stops 
you from doing that, and in fact there was code which was implemented as 
a state machine used on 2.4 kernels.

> Linux just does not have unified event processing mechanism, which was
> pointed to many times in AIO mail list and when epoll() was only
> introduced. I would even say, that Linux does not have such mechanism at
> all, since every potential user implements it's own, which can not be
> used with others.

The epoll event API doesn't have space in the event fields for result codes 
as needed for AIO.  The AIO API does -- how is it lacking in this regard?

> Kevent fixes that. Although implementation itself can be suboptimal for
> some cases or even unacceptible at all, but it is really needed
> functionality.

At the expense of adding another API?  How is this a good thing?  Why 
not spit out events in the existing format?

> Every existing notification can be built on top of kevent. One can find
> how easy it was to implement generic poll/select notifications (what
> epoll() does) or socket notifications (which are similar to epoll(), but
> are called from inside socket state machine, thus improving processing
> performance).

So far your code is adding a lot without unifying anything.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [1/4] kevent: core files.
  2006-06-23 19:55       ` Benjamin LaHaise
@ 2006-06-23 20:17         ` Evgeniy Polyakov
  2006-06-23 20:44           ` Benjamin LaHaise
  0 siblings, 1 reply; 73+ messages in thread
From: Evgeniy Polyakov @ 2006-06-23 20:17 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: David Miller, netdev

On Fri, Jun 23, 2006 at 03:55:13PM -0400, Benjamin LaHaise (bcrl@kvack.org) wrote:
> On Fri, Jun 23, 2006 at 11:24:29PM +0400, Evgeniy Polyakov wrote:
> > What API are you talking about?
> > There is only epoll(), which is 40% slower than kevent, and AIO, which
> > works not as state machine, but as repeated call for the same work.
> > There is also inotify, which allocates new message each time event
> > occurs, which is not a good solution for every situation.
> 
> AIO can be implemented as a state machine.  Nothing in the API stops 
> you from doing that, and in fact there was code which was implemented as 
> a state machine used on 2.4 kernels.

But now it is implemented as repeated call for the same work, which does
not look like it can be used for any other types of work.
And repeated work introduce latencies.
As far as I recall, it is you who wanted to remove thread based approach
from AIO subsystem.

> > Linux just does not have unified event processing mechanism, which was
> > pointed to many times in AIO mail list and when epoll() was only
> > introduced. I would even say, that Linux does not have such mechanism at
> > all, since every potential user implements it's own, which can not be
> > used with others.
> 
> The epoll event API doesn't have space in the event fields for result codes 
> as needed for AIO.  The AIO API does -- how is it lacking in this regard?

AIO completion approach was designed to be used with process context VFS
update. read/write approach can not cover other types of notifications,
like inode updates or timers.

> > Kevent fixes that. Although implementation itself can be suboptimal for
> > some cases or even unacceptible at all, but it is really needed
> > functionality.
> 
> At the expense of adding another API?  How is this a good thing?  Why 
> not spit out events in the existing format?

Format of the structure transferred between the objects does not matter
at all. We can create a wrapper on kevent structures or kevent can
transform data from AIO objects.
The main design goal of kevent is to provide easy connected hooks into
any state machine, which might be used by kernelspace to notify about
any kind of events without any knowledge of it's background nature.
Kevent can be used for example as notification blocks for address
changes or it can replace netlink completely (it can even emulate
event multicasting).

Kevent is queue of events, which can be transferred from any object to
any destination.

> > Every existing notification can be built on top of kevent. One can find
> > how easy it was to implement generic poll/select notifications (what
> > epoll() does) or socket notifications (which are similar to epoll(), but
> > are called from inside socket state machine, thus improving processing
> > performance).
> 
> So far your code is adding a lot without unifying anything.

Not at all!
Kevent is a mechanism, which allows to impleement AIO, network AIO, poll
and select, timer control, adaptive readhead (as example of AIO VFS
update). All the code I present shows how to use kevent, it is not part
of the kevent. One can find Makefile in kevent dir to check what is the
core of the subsystem, which allows to be used as a transport for
events.

AIO, NAIO, poll/select, socket and timer notifications are just users.
One can add it's own usage as easy as to call kevent_storage
initialization function and event generation function. All other pieces
are hidded in the implementation.

> 		-ben
> -- 
> "Time is of no importance, Mr. President, only life is important."
> Don't Email: <dont@kvack.org>.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [1/4] kevent: core files.
  2006-06-23 20:17         ` Evgeniy Polyakov
@ 2006-06-23 20:44           ` Benjamin LaHaise
  2006-06-23 21:08             ` Evgeniy Polyakov
  0 siblings, 1 reply; 73+ messages in thread
From: Benjamin LaHaise @ 2006-06-23 20:44 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, netdev

On Sat, Jun 24, 2006 at 12:17:17AM +0400, Evgeniy Polyakov wrote:
> But now it is implemented as repeated call for the same work, which does
> not look like it can be used for any other types of work.

Given an iocb, you do not have to return -EIOCBRETRY, instead return 
-EIOCBQUEUED and then from whatever context do an aio_complete() with the 
result for that iocb.

> And repeated work introduce latencies.
> As far as I recall, it is you who wanted to remove thread based approach
> from AIO subsystem.

I have essentially given up on trying to get the filesystem AIO patches 
in given that the concerns against them are "woe complexity" with no real 
recourse for inclusion being open.  If David is open to changes in the 
networking area, I'd love to see it built on top of your code.

> AIO completion approach was designed to be used with process context VFS
> update. read/write approach can not cover other types of notifications,
> like inode updates or timers.

The completion event is 100% generic and does not need to come from process 
context.  Calling aio_complete() from irq context is entirely valid.

> Format of the structure transferred between the objects does not matter
> at all. We can create a wrapper on kevent structures or kevent can
> transform data from AIO objects.

> The main design goal of kevent is to provide easy connected hooks into
> any state machine, which might be used by kernelspace to notify about
> any kind of events without any knowledge of it's background nature.
> Kevent can be used for example as notification blocks for address
> changes or it can replace netlink completely (it can even emulate
> event multicasting).
> 
> Kevent is queue of events, which can be transferred from any object to
> any destination.

And io_getevents() reads a queue of events, so I'm not sure why you need 
a new syscall.

> Not at all!
> Kevent is a mechanism, which allows to impleement AIO, network AIO, poll
> and select, timer control, adaptive readhead (as example of AIO VFS
> update). All the code I present shows how to use kevent, it is not part
> of the kevent. One can find Makefile in kevent dir to check what is the
> core of the subsystem, which allows to be used as a transport for
> events.
> 
> AIO, NAIO, poll/select, socket and timer notifications are just users.
> One can add it's own usage as easy as to call kevent_storage
> initialization function and event generation function. All other pieces
> are hidded in the implementation.

I'll look at adapting your code to use the existing syscalls.  Maybe code 
will be better at expressing my concerns.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [1/4] kevent: core files.
  2006-06-23 20:44           ` Benjamin LaHaise
@ 2006-06-23 21:08             ` Evgeniy Polyakov
  2006-06-23 21:31               ` Benjamin LaHaise
  0 siblings, 1 reply; 73+ messages in thread
From: Evgeniy Polyakov @ 2006-06-23 21:08 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: David Miller, netdev

On Fri, Jun 23, 2006 at 04:44:42PM -0400, Benjamin LaHaise (bcrl@kvack.org) wrote:
> > AIO completion approach was designed to be used with process context VFS
> > update. read/write approach can not cover other types of notifications,
> > like inode updates or timers.
> 
> The completion event is 100% generic and does not need to come from process 
> context.  Calling aio_complete() from irq context is entirely valid.

put_ioctx() can sleep.
And the whole approach is different: AIO just wakes up requesting
thread, so user must provide a lot to be able to work with AIO.
It perfectly fits VFS design, but it is not acceptible for generic event
notifications.

> > Format of the structure transferred between the objects does not matter
> > at all. We can create a wrapper on kevent structures or kevent can
> > transform data from AIO objects.
> 
> > The main design goal of kevent is to provide easy connected hooks into
> > any state machine, which might be used by kernelspace to notify about
> > any kind of events without any knowledge of it's background nature.
> > Kevent can be used for example as notification blocks for address
> > changes or it can replace netlink completely (it can even emulate
> > event multicasting).
> > 
> > Kevent is queue of events, which can be transferred from any object to
> > any destination.
> 
> And io_getevents() reads a queue of events, so I'm not sure why you need 
> a new syscall.

It is not syscall, but overall design should be analyzed.
It is possible to use existing ssycalls, kevent design does not care
about how it's data structures are delivered to the internal
"processor".

> > Not at all!
> > Kevent is a mechanism, which allows to impleement AIO, network AIO, poll
> > and select, timer control, adaptive readhead (as example of AIO VFS
> > update). All the code I present shows how to use kevent, it is not part
> > of the kevent. One can find Makefile in kevent dir to check what is the
> > core of the subsystem, which allows to be used as a transport for
> > events.
> > 
> > AIO, NAIO, poll/select, socket and timer notifications are just users.
> > One can add it's own usage as easy as to call kevent_storage
> > initialization function and event generation function. All other pieces
> > are hidded in the implementation.
> 
> I'll look at adapting your code to use the existing syscalls.  Maybe code 
> will be better at expressing my concerns.

That would be great.

> 		-ben

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [1/4] kevent: core files.
  2006-06-23 21:08             ` Evgeniy Polyakov
@ 2006-06-23 21:31               ` Benjamin LaHaise
  2006-06-23 21:43                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 73+ messages in thread
From: Benjamin LaHaise @ 2006-06-23 21:31 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, netdev

On Sat, Jun 24, 2006 at 01:08:27AM +0400, Evgeniy Polyakov wrote:
> On Fri, Jun 23, 2006 at 04:44:42PM -0400, Benjamin LaHaise (bcrl@kvack.org) wrote:
> > > AIO completion approach was designed to be used with process context VFS
> > > update. read/write approach can not cover other types of notifications,
> > > like inode updates or timers.
> > 
> > The completion event is 100% generic and does not need to come from process 
> > context.  Calling aio_complete() from irq context is entirely valid.
> 
> put_ioctx() can sleep.

Err, no, that should definately not be the case.  If it can, someone has 
completely broken aio.

> It is not syscall, but overall design should be analyzed.
> It is possible to use existing ssycalls, kevent design does not care
> about how it's data structures are delivered to the internal
> "processor".

Okay, that's good to hear. =-)

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [1/4] kevent: core files.
  2006-06-23 21:31               ` Benjamin LaHaise
@ 2006-06-23 21:43                 ` Evgeniy Polyakov
  0 siblings, 0 replies; 73+ messages in thread
From: Evgeniy Polyakov @ 2006-06-23 21:43 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: David Miller, netdev

On Fri, Jun 23, 2006 at 05:31:44PM -0400, Benjamin LaHaise (bcrl@kvack.org) wrote:
> On Sat, Jun 24, 2006 at 01:08:27AM +0400, Evgeniy Polyakov wrote:
> > On Fri, Jun 23, 2006 at 04:44:42PM -0400, Benjamin LaHaise (bcrl@kvack.org) wrote:
> > > > AIO completion approach was designed to be used with process context VFS
> > > > update. read/write approach can not cover other types of notifications,
> > > > like inode updates or timers.
> > > 
> > > The completion event is 100% generic and does not need to come from process 
> > > context.  Calling aio_complete() from irq context is entirely valid.
> > 
> > put_ioctx() can sleep.
> 
> Err, no, that should definately not be the case.  If it can, someone has 
> completely broken aio.

When reference counter hits zero it flushes aio workqueue, which can
sleep.
put_ioctx() -> __put_ioctx() -> cancel_delayed_work()/flush_workqueue().

It is there at least from 2.6.15 days (it is the oldest tree I can
access using my extremely slow GPRS link).

Hang the looter!

> 		-ben

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [1/4] kevent: core files.
  2006-06-23 19:24     ` Evgeniy Polyakov
  2006-06-23 19:55       ` Benjamin LaHaise
@ 2006-06-23 20:19       ` David Miller
  2006-06-23 20:31         ` Benjamin LaHaise
  1 sibling, 1 reply; 73+ messages in thread
From: David Miller @ 2006-06-23 20:19 UTC (permalink / raw)
  To: johnpol; +Cc: bcrl, netdev

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Fri, 23 Jun 2006 23:24:29 +0400

> Linux just does not have unified event processing mechanism, which was
> pointed to many times in AIO mail list and when epoll() was only
> introduced. I would even say, that Linux does not have such mechanism at
> all, since every potential user implements it's own, which can not be
> used with others.
> 
> Kevent fixes that. Although implementation itself can be suboptimal for
> some cases or even unacceptible at all, but it is really needed
> functionality.

I completely agree with Evgeniy here.

There is nothing in the kernel today that provides integrated event
handling.  Nothing.  So when someone says to use the "existing" stuff,
they need to have their head examined.

The existing AIO stuff stinks as a set of interfaces.  It was designed
by a standards committee, not by people truly interested in a good
performing event processing design.  It is especially poorly suited
for networking, and any networking developer understands this.

It is pretty much a foregone conclusion that we will need new
APIs to get good networking performance.  Every existing interface
has one limitation or another.

So we should be happy people like Evgeniy try to work on this stuff,
instead of discouraging them.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [1/4] kevent: core files.
  2006-06-23 20:19       ` David Miller
@ 2006-06-23 20:31         ` Benjamin LaHaise
  2006-06-23 20:54           ` Evgeniy Polyakov
  2006-06-23 20:54           ` David Miller
  0 siblings, 2 replies; 73+ messages in thread
From: Benjamin LaHaise @ 2006-06-23 20:31 UTC (permalink / raw)
  To: David Miller; +Cc: johnpol, netdev

On Fri, Jun 23, 2006 at 01:19:40PM -0700, David Miller wrote:
> I completely agree with Evgeniy here.
> 
> There is nothing in the kernel today that provides integrated event
> handling.  Nothing.  So when someone says to use the "existing" stuff,
> they need to have their head examined.

The existing AIO events are *events*, with the syscalls providing the 
reading of events.

> The existing AIO stuff stinks as a set of interfaces.  It was designed
> by a standards committee, not by people truly interested in a good
> performing event processing design.  It is especially poorly suited
> for networking, and any networking developer understands this.

I disagree.  Stuffing an event that a read or write is complete/ready is a 
good way of handling things, even more so with hardware that will perform 
the memory copies to/from user buffers.

> It is pretty much a foregone conclusion that we will need new
> APIs to get good networking performance.  Every existing interface
> has one limitation or another.

Eh?  Nobody has posted any numbers comparing the approaches yet, so this 
is pure handwaving, unless you have real concrete results?

> So we should be happy people like Evgeniy try to work on this stuff,
> instead of discouraging them.

I would like to encourage him, but at the same time I don't want to see 
creating APIs that essentially duplicate existing work and needlessly 
break compatibility.  I completely agree that the in-kernel APIs are not 
as encompassing as they should be, and within the kernel Evgeniy's work 
may well be the way to go.  What I do not agree is that we need new 
syscalls at this point.  I'm perfectly willing to accept proof that change 
is needed if we do a proper comparision between any new syscall API and the 
use of the existing syscall API, but the pain of introducing a new API is 
sufficiently large that I think it is worth looking at the numbers.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [1/4] kevent: core files.
  2006-06-23 20:31         ` Benjamin LaHaise
@ 2006-06-23 20:54           ` Evgeniy Polyakov
  2006-06-24  9:14             ` Robert Iakobashvili
  2006-06-23 20:54           ` David Miller
  1 sibling, 1 reply; 73+ messages in thread
From: Evgeniy Polyakov @ 2006-06-23 20:54 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: David Miller, netdev

On Fri, Jun 23, 2006 at 04:31:14PM -0400, Benjamin LaHaise (bcrl@kvack.org) wrote:
> may well be the way to go.  What I do not agree is that we need new 
> syscalls at this point.  I'm perfectly willing to accept proof that change 
> is needed if we do a proper comparision between any new syscall API and the 
> use of the existing syscall API, but the pain of introducing a new API is 
> sufficiently large that I think it is worth looking at the numbers.

New syscall is just an interface. Originally kevent (and it still can)
use char device and it's ioctl method.
It is perfectly possible to create wrappers for posix aio_* calls,
although I do not see why it is needed.
No need to concentrate on end-users interface at this point - it can be
changed at any time since design allows it, we should think about
overall design and if it is ok, move forward in implementation.

Btw, new API adds only one syscall for userspace kevent processing (and
three for send/recv/sendfile for network AIO).

According to numbers: kevent compared to epoll resulted in the
folllowing numbers:
kevent: more than 2600 requests per second (trivial web server)
epoll: about 1600-1800 requests.
Number of errors for 3k bursts of connections with 30K connections total
in 10seconds:
kevent: about 2k errors.
epoll: upto 15k errors.

More detailed results can be found on project's homepage at:
tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

> 		-ben

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [1/4] kevent: core files.
  2006-06-23 20:54           ` Evgeniy Polyakov
@ 2006-06-24  9:14             ` Robert Iakobashvili
  0 siblings, 0 replies; 73+ messages in thread
From: Robert Iakobashvili @ 2006-06-24  9:14 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Benjamin LaHaise, David Miller, netdev

Hi,

> According to numbers: kevent compared to epoll resulted in the
> folllowing numbers:
> kevent: more than 2600 requests per second (trivial web server)
> epoll: about 1600-1800 requests.
> Number of errors for 3k bursts of connections with 30K connections total
> in 10seconds:
> kevent: about 2k errors.
> epoll: upto 15k errors.

If it beats the great epoll, it means a real business case for kevent.

All previous attempts in kernel as well as by glibc and other userland
emulations to provide some real AIO infrastructure and API for server
applications with performance benefits, were not too much successful.
Heavy load networking servers are normally not using AIOs on linux due
to low performance.

>From another side Windows have a very strong I/O completion ports API,
which are widely used for the most heavy load applications.

Kevent may take linux servers productivity forward in general as well as
encourage moving aio-applications from windows to linux.

-- 
Sincerely,
------------------------------------------------------------------
Robert Iakobashvili, coroberti at gmail dot com
Navigare necesse est, vivere non est necesse.
------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [1/4] kevent: core files.
  2006-06-23 20:31         ` Benjamin LaHaise
  2006-06-23 20:54           ` Evgeniy Polyakov
@ 2006-06-23 20:54           ` David Miller
  2006-06-23 21:53             ` Benjamin LaHaise
  1 sibling, 1 reply; 73+ messages in thread
From: David Miller @ 2006-06-23 20:54 UTC (permalink / raw)
  To: bcrl; +Cc: johnpol, netdev

From: Benjamin LaHaise <bcrl@kvack.org>
Date: Fri, 23 Jun 2006 16:31:14 -0400

> Eh?  Nobody has posted any numbers comparing the approaches yet, so this 
> is pure handwaving, unless you have real concrete results?

Evgeniy posts numbers and performance graphs on his kevent work all
the time.

Van Jacobson did in his LCA2006 net channel slides too, perhaps you
missed that.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [1/4] kevent: core files.
  2006-06-23 20:54           ` David Miller
@ 2006-06-23 21:53             ` Benjamin LaHaise
  2006-06-23 22:12               ` David Miller
  0 siblings, 1 reply; 73+ messages in thread
From: Benjamin LaHaise @ 2006-06-23 21:53 UTC (permalink / raw)
  To: David Miller; +Cc: johnpol, netdev

On Fri, Jun 23, 2006 at 01:54:23PM -0700, David Miller wrote:
> From: Benjamin LaHaise <bcrl@kvack.org>
> Date: Fri, 23 Jun 2006 16:31:14 -0400
> 
> > Eh?  Nobody has posted any numbers comparing the approaches yet, so this 
> > is pure handwaving, unless you have real concrete results?
> 
> Evgeniy posts numbers and performance graphs on his kevent work all
> the time.

But you're argueing that the performance of something that hasn't been 
tested is worse simply by nature of it not having been tested.  That's a 
fallacy of omission, iiuc.

> Van Jacobson did in his LCA2006 net channel slides too, perhaps you
> missed that.

I have yet to be convinced that the layering violation known as net channels 
is the right way to go, mostly because it breaks horribly in a few cases -- 
think what happens during periods of CPU overcommit, in which case doing too 
much in interrupt context will kill a system (which is why softirqs are 
needed).  The effect of doing all processing in user context creates issues 
with delayed acks (due to context switching to other tasks in the system), 
which will cause excess retransmits.  The hard problems associated with 
packet filtering and security are also still unresolved, which is okay for 
a paper, but a concern in real life.

There are also a number of performance flaws in the current stack that 
show up under profiling, some of which I posted fixes for, some of which 
have yet to be fixed.  The pushf/popf pipeline stall was one of the bigger 
instances of CPU wastage that Van Jacobson noticed (it shows up as bottom 
halves using lots of CPU).  Iirc, Ingo's real time patches may avoid that 
by way of reworking the irq disable/enable mechanism, which would mean the 
results need retesting.  Using the cr8 register to enable/disable 
interrupts on x86-64 might also improve things, as that would eliminate the 
flags dependancy of cli/sti...

In short, there's a lot of work that still has to be done.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [1/4] kevent: core files.
  2006-06-23 21:53             ` Benjamin LaHaise
@ 2006-06-23 22:12               ` David Miller
  0 siblings, 0 replies; 73+ messages in thread
From: David Miller @ 2006-06-23 22:12 UTC (permalink / raw)
  To: bcrl; +Cc: johnpol, netdev

From: Benjamin LaHaise <bcrl@kvack.org>
Date: Fri, 23 Jun 2006 17:53:14 -0400

> The effect of doing all processing in user context creates issues 
> with delayed acks (due to context switching to other tasks in the system), 

The Linux TCP stack does this today.

Full TCP input protocol processing is done in the user process
context.

What you are not understanding is that process scheduling helps TCP,
it does not hinder it.  If the system is loaded, we want the senders
to pace themselves to the rate at which the kernel can schedule the
abundance of receiver work it has.  And this happens naturally when
the TCP protocol input processing operates in process context.

Your fear of cpu overcommit in interrupt handlers is also heavily
flawed.  Net channels do a socket demux, and a queue entail plus
ring a doorbell if necessary, nothing more.

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2006-09-04 14:38 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <44C66FC9.3050402@redhat.com>
2006-07-25 22:01 ` async network I/O, event channels, etc David Miller
2006-07-25 22:55   ` Nicholas Miell
2006-07-26  6:28   ` Evgeniy Polyakov
2006-07-26  9:18     ` [0/4] kevent: generic event processing subsystem Evgeniy Polyakov
2006-07-26  9:18       ` [1/4] kevent: core files Evgeniy Polyakov
2006-07-26  9:18         ` [2/4] kevent: network AIO, socket notifications Evgeniy Polyakov
2006-07-26  9:18           ` [3/4] kevent: AIO, aio_sendfile() implementation Evgeniy Polyakov
2006-07-26  9:18             ` [4/4] kevent: poll/select() notifications. Timer notifications Evgeniy Polyakov
2006-07-26 10:00             ` [3/4] kevent: AIO, aio_sendfile() implementation Christoph Hellwig
2006-07-26 10:08               ` Evgeniy Polyakov
2006-07-26 10:13                 ` Christoph Hellwig
2006-07-26 10:25                   ` Evgeniy Polyakov
2006-07-26 10:04             ` Christoph Hellwig
2006-07-26 10:12               ` David Miller
2006-07-26 10:15                 ` Christoph Hellwig
2006-07-26 20:21                   ` Phillip Susi
2006-07-26 14:14                 ` Avi Kivity
2006-07-26 10:19               ` Evgeniy Polyakov
2006-07-26 10:30                 ` Christoph Hellwig
2006-07-26 14:28                   ` Ulrich Drepper
2006-07-26 16:22                     ` Badari Pulavarty
2006-07-27  6:49                       ` Sébastien Dugué
2006-07-27 15:28                         ` Badari Pulavarty
2006-07-27 18:14                           ` Zach Brown
2006-07-27 18:29                             ` Badari Pulavarty
2006-07-27 18:44                               ` Ulrich Drepper
2006-07-27 21:02                                 ` Badari Pulavarty
2006-07-28  7:31                                   ` Sébastien Dugué
2006-07-28 12:58                                   ` Sébastien Dugué
2006-08-11 19:45                                     ` Ulrich Drepper
2006-08-12 18:29                                       ` Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile) Suparna Bhattacharya
2006-08-12 19:10                                         ` Ulrich Drepper
2006-08-12 19:28                                           ` Jakub Jelinek
2006-09-04 14:37                                             ` Sébastien Dugué
2006-08-14  7:02                                           ` Suparna Bhattacharya
2006-08-14 16:38                                             ` Ulrich Drepper
2006-08-15  2:06                                               ` Nicholas Miell
2006-09-04 14:36                                           ` Sébastien Dugué
2006-09-04 14:28                                         ` Sébastien Dugué
2006-07-28  7:29                                 ` [3/4] kevent: AIO, aio_sendfile() implementation Sébastien Dugué
2006-07-31 10:11                                 ` Suparna Bhattacharya
2006-07-28  7:26                           ` Sébastien Dugué
2006-07-26 10:31         ` [1/4] kevent: core files Andrew Morton
2006-07-26 10:37           ` Evgeniy Polyakov
2006-07-26 10:44         ` Evgeniy Polyakov
2006-07-27  6:10     ` async network I/O, event channels, etc David Miller
2006-07-27  7:49       ` Evgeniy Polyakov
2006-07-27  8:02         ` David Miller
2006-07-27  8:09           ` Jens Axboe
2006-07-27  8:11             ` Jens Axboe
2006-07-27  8:20               ` David Miller
2006-07-27  8:29                 ` Jens Axboe
2006-07-27  8:37                   ` David Miller
2006-07-27  8:39                     ` Jens Axboe
2006-07-27  8:58           ` Evgeniy Polyakov
2006-07-27  9:31             ` David Miller
2006-07-27  9:37               ` Evgeniy Polyakov
2006-06-22 17:14 [1/1] Kevent subsystem Evgeniy Polyakov
2006-06-23  7:09 ` [1/4] kevent: core files Evgeniy Polyakov
2006-06-23 18:44   ` Benjamin LaHaise
2006-06-23 19:24     ` Evgeniy Polyakov
2006-06-23 19:55       ` Benjamin LaHaise
2006-06-23 20:17         ` Evgeniy Polyakov
2006-06-23 20:44           ` Benjamin LaHaise
2006-06-23 21:08             ` Evgeniy Polyakov
2006-06-23 21:31               ` Benjamin LaHaise
2006-06-23 21:43                 ` Evgeniy Polyakov
2006-06-23 20:19       ` David Miller
2006-06-23 20:31         ` Benjamin LaHaise
2006-06-23 20:54           ` Evgeniy Polyakov
2006-06-24  9:14             ` Robert Iakobashvili
2006-06-23 20:54           ` David Miller
2006-06-23 21:53             ` Benjamin LaHaise
2006-06-23 22:12               ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).