[take21 0/4] kevent: Generic event handling mechanism.

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [take21 0/4] kevent: Generic event handling mechanism.
       [not found] <1154985aa0591036@2ka.mipt.ru>
@ 2006-10-27 16:10 ` Evgeniy Polyakov
  2006-10-27 16:10   ` [take21 1/4] kevent: Core files Evgeniy Polyakov
                     ` (2 more replies)
  2006-11-01 11:36 ` [take22 " Evgeniy Polyakov
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-10-27 16:10 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel


Generic event handling mechanism.

Consider for inclusion.

Changes from 'take20' patchset:
 * new ring buffer implementation
 * removed artificial limit on possible number of kevents
With this release and fixed userspace web server it was possible to 
achive 3960+ req/s with client connection rate of 4000 con/s
over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which
is too close to wire speed if we get into account headers and the like.

Changes from 'take19' patchset:
 * use __init instead of __devinit
 * removed 'default N' from config for user statistic
 * removed kevent_user_fini() since kevent can not be unloaded
 * use KERN_INFO for statistic output

Changes from 'take18' patchset:
 * use __init instead of __devinit
 * removed 'default N' from config for user statistic
 * removed kevent_user_fini() since kevent can not be unloaded
 * use KERN_INFO for statistic output

Changes from 'take17' patchset:
 * Use RB tree instead of hash table. 
	At least for a web sever, frequency of addition/deletion of new kevent 
	is comparable with number of search access, i.e. most of the time events 
	are added, accesed only couple of times and then removed, so it justifies 
	RB tree usage over AVL tree, since the latter does have much slower deletion 
	time (max O(log(N)) compared to 3 ops), 
	although faster search time (1.44*O(log(N)) vs. 2*O(log(N))). 
	So for kevents I use RB tree for now and later, when my AVL tree implementation 
	is ready, it will be possible to compare them.
 * Changed readiness check for socket notifications.

With both above changes it is possible to achieve more than 3380 req/second compared to 2200, 
sometimes 2500 req/second for epoll() for trivial web-server and httperf client on the same
hardware.
It is possible that above kevent limit is due to maximum allowed kevents in a time limit, which is
4096 events.

Changes from 'take16' patchset:
 * misc cleanups (__read_mostly, const ...)
 * created special macro which is used for mmap size (number of pages) calculation
 * export kevent_socket_notify(), since it is used in network protocols which can be 
	built as modules (IPv6 for example)

Changes from 'take15' patchset:
 * converted kevent_timer to high-resolution timers, this forces timer API update at
	http://linux-net.osdl.org/index.php/Kevent
 * use struct ukevent* instead of void * in syscalls (documentation has been updated)
 * added warning in kevent_add_ukevent() if ring has broken index (for testing)

Changes from 'take14' patchset:
 * added kevent_wait()
    This syscall waits until either timeout expires or at least one event
    becomes ready. It also commits that @num events from @start are processed
    by userspace and thus can be be removed or rearmed (depending on it's flags).
    It can be used for commit events read by userspace through mmap interface.
    Example userspace code (evtest.c) can be found on project's homepage.
 * added socket notifications (send/recv/accept)

Changes from 'take13' patchset:
 * do not get lock aroung user data check in __kevent_search()
 * fail early if there were no registered callbacks for given type of kevent
 * trailing whitespace cleanup

Changes from 'take12' patchset:
 * remove non-chardev interface for initialization
 * use pointer to kevent_mring instead of unsigned longs
 * use aligned 64bit type in raw user data (can be used by high-res timer if needed)
 * simplified enqueue/dequeue callbacks and kevent initialization
 * use nanoseconds for timeout
 * put number of milliseconds into timer's return data
 * move some definitions into user-visible header
 * removed filenames from comments

Changes from 'take11' patchset:
 * include missing headers into patchset
 * some trivial code cleanups (use goto instead of if/else games and so on)
 * some whitespace cleanups
 * check for ready_callback() callback before main loop which should save us some ticks

Changes from 'take10' patchset:
 * removed non-existent prototypes
 * added helper function for kevent_registered_callbacks
 * fixed 80 lines comments issues
 * added shared between userspace and kernelspace header instead of embedd them in one
 * core restructuring to remove forward declarations
 * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
 * use vm_insert_page() instead of remap_pfn_range()

Changes from 'take9' patchset:
 * fixed ->nopage method

Changes from 'take8' patchset:
 * fixed mmap release bug
 * use module_init() instead of late_initcall()
 * use better structures for timer notifications

Changes from 'take7' patchset:
 * new mmap interface (not tested, waiting for other changes to be acked)
	- use nopage() method to dynamically substitue pages
	- allocate new page for events only when new added kevent requres it
	- do not use ugly index dereferencing, use structure instead
	- reduced amount of data in the ring (id and flags), 
		maximum 12 pages on x86 per kevent fd

Changes from 'take6' patchset:
 * a lot of comments!
 * do not use list poisoning for detection of the fact, that entry is in the list
 * return number of ready kevents even if copy*user() fails
 * strict check for number of kevents in syscall
 * use ARRAY_SIZE for array size calculation
 * changed superblock magic number
 * use SLAB_PANIC instead of direct panic() call
 * changed -E* return values
 * a lot of small cleanups and indent fixes

Changes from 'take5' patchset:
 * removed compilation warnings about unused wariables when lockdep is not turned on
 * do not use internal socket structures, use appropriate (exported) wrappers instead
 * removed default 1 second timeout
 * removed AIO stuff from patchset

Changes from 'take4' patchset:
 * use miscdevice instead of chardevice
 * comments fixes

Changes from 'take3' patchset:
 * removed serializing mutex from kevent_user_wait()
 * moved storage list processing to RCU
 * removed lockdep screaming - all storage locks are initialized in the same function, so it was
learned 
	to differentiate between various cases
 * remove kevent from storage if is marked as broken after callback
 * fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion 

Changes from 'take2' patchset:
 * split kevent_finish_user() to locked and unlocked variants
 * do not use KEVENT_STAT ifdefs, use inline functions instead
 * use array of callbacks of each type instead of each kevent callback initialization
 * changed name of ukevent guarding lock
 * use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
 * do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
 * various indent cleanups
 * added optimisation, which is aimed to help when a lot of kevents are being copied from
userspace
 * mapped buffer (initial) implementation (no userspace yet)

Changes from 'take1' patchset:
 - rebased against 2.6.18-git tree
 - removed ioctl controlling
 - added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
			unsigned int timeout, void __user *buf, unsigned flags)
 - use old syscall kevent_ctl for creation/removing, modification and initial kevent 
	initialization
 - use mutuxes instead of semaphores
 - added file descriptor check and return error if provided descriptor does not match
	kevent file operations
 - various indent fixes
 - removed aio_sendfile() declarations.

Thank you.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>



^ permalink raw reply	[flat|nested] 200+ messages in thread

* [take21 1/4] kevent: Core files.
  2006-10-27 16:10 ` [take21 0/4] kevent: Generic event handling mechanism Evgeniy Polyakov
@ 2006-10-27 16:10   ` Evgeniy Polyakov
  2006-10-27 16:10     ` [take21 2/4] kevent: poll/select() notifications Evgeniy Polyakov
  2006-10-28 10:28     ` [take21 1/4] kevent: Core files Eric Dumazet
  2006-10-27 16:42   ` [take21 0/4] kevent: Generic event handling mechanism Evgeniy Polyakov
  2006-11-07 11:26   ` Jeff Garzik
  2 siblings, 2 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-10-27 16:10 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel


Core files.

This patch includes core kevent files:
 * userspace controlling
 * kernelspace interfaces
 * initialization
 * notification state machines

Some bits of documentation can be found on project's homepage (and links from there):
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index 7e639f7..a9560eb 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -318,3 +318,6 @@ ENTRY(sys_call_table)
 	.long sys_vmsplice
 	.long sys_move_pages
 	.long sys_getcpu
+	.long sys_kevent_get_events
+	.long sys_kevent_ctl		/* 320 */
+	.long sys_kevent_wait
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index b4aa875..cf18955 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -714,8 +714,11 @@ #endif
 	.quad compat_sys_get_robust_list
 	.quad sys_splice
 	.quad sys_sync_file_range
-	.quad sys_tee
+	.quad sys_tee			/* 315 */
 	.quad compat_sys_vmsplice
 	.quad compat_sys_move_pages
 	.quad sys_getcpu
+	.quad sys_kevent_get_events
+	.quad sys_kevent_ctl		/* 320 */
+	.quad sys_kevent_wait
 ia32_syscall_end:		
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index bd99870..f009677 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -324,10 +324,13 @@ #define __NR_tee		315
 #define __NR_vmsplice		316
 #define __NR_move_pages		317
 #define __NR_getcpu		318
+#define __NR_kevent_get_events	319
+#define __NR_kevent_ctl		320
+#define __NR_kevent_wait	321
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 319
+#define NR_syscalls 322
 #include <linux/err.h>
 
 /*
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 6137146..c53d156 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,16 @@ #define __NR_vmsplice		278
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages		279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events	280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl		281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
+#define __NR_kevent_wait	282
+__SYSCALL(__NR_kevent_wait, sys_kevent_wait)
 
 #ifdef __KERNEL__
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_wait
 #include <linux/err.h>
 
 #ifndef __NO_STUBS
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..125414c
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,205 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/kevent_storage.h>
+#include <linux/ukevent.h>
+
+#define KEVENT_MIN_BUFFS_ALLOC	3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+	kevent_callback_t	callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY		0x1
+#define KEVENT_STORAGE		0x2
+#define KEVENT_USER		0x4
+
+struct kevent
+{
+	/* Used for kevent freeing.*/
+	struct rcu_head		rcu_head;
+	struct ukevent		event;
+	/* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+	spinlock_t		ulock;
+
+	/* Entry of user's tree. */
+	struct rb_node		kevent_node;
+	/* Entry of origin's queue. */
+	struct list_head	storage_entry;
+	/* Entry of user's ready. */
+	struct list_head	ready_entry;
+
+	u32			flags;
+
+	/* User who requested this kevent. */
+	struct kevent_user	*user;
+	/* Kevent container. */
+	struct kevent_storage	*st;
+
+	struct kevent_callbacks	callbacks;
+
+	/* Private data for different storages.
+	 * poll()/select storage has a list of wait_queue_t containers
+	 * for each ->poll() { poll_wait()' } here.
+	 */
+	void			*priv;
+};
+
+struct kevent_user
+{
+	struct rb_root		kevent_root;
+	spinlock_t		kevent_lock;
+	/* Number of queued kevents. */
+	unsigned int		kevent_num;
+
+	/* List of ready kevents. */
+	struct list_head	ready_list;
+	/* Number of ready kevents. */
+	unsigned int		ready_num;
+	/* Protects all manipulations with ready queue. */
+	spinlock_t 		ready_lock;
+
+	/* Protects against simultaneous kevent_user control manipulations. */
+	struct mutex		ctl_mutex;
+	/* Wait until some events are ready. */
+	wait_queue_head_t	wait;
+
+	/* Reference counter, increased for each new kevent. */
+	atomic_t		refcnt;
+
+	/* First kevent which was not put into ring buffer due to overflow.
+	 * It will be copied into the buffer, when first event will be removed
+	 * from ready queue (and thus there will be an empty place in the
+	 * ring buffer).
+	 */
+	struct kevent		*overflow_kevent;
+	/* Array of pages forming mapped ring buffer */
+	struct kevent_mring	**pring;
+
+#ifdef CONFIG_KEVENT_USER_STAT
+	unsigned long		im_num;
+	unsigned long		wait_num, mmap_num;
+	unsigned long		total;
+#endif
+};
+
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos);
+
+int kevent_user_ring_add_event(struct kevent *k);
+
+void kevent_storage_ready(struct kevent_storage *st,
+		kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_stat_init(struct kevent_user *u)
+{
+	u->wait_num = u->im_num = u->total = 0;
+}
+static inline void kevent_stat_print(struct kevent_user *u)
+{
+	printk(KERN_INFO "%s: u: %p, wait: %lu, mmap: %lu, immediately: %lu, total: %lu.\n",
+			__func__, u, u->wait_num, u->mmap_num, u->im_num, u->total);
+}
+static inline void kevent_stat_im(struct kevent_user *u)
+{
+	u->im_num++;
+}
+static inline void kevent_stat_mmap(struct kevent_user *u)
+{
+	u->mmap_num++;
+}
+static inline void kevent_stat_wait(struct kevent_user *u)
+{
+	u->wait_num++;
+}
+static inline void kevent_stat_total(struct kevent_user *u)
+{
+	u->total++;
+}
+#else
+#define kevent_stat_print(u)		({ (void) u;})
+#define kevent_stat_init(u)		({ (void) u;})
+#define kevent_stat_im(u)		({ (void) u;})
+#define kevent_stat_wait(u)		({ (void) u;})
+#define kevent_stat_mmap(u)		({ (void) u;})
+#define kevent_stat_total(u)		({ (void) u;})
+#endif
+
+#ifdef CONFIG_KEVENT_SOCKET
+#ifdef CONFIG_LOCKDEP
+void kevent_socket_reinit(struct socket *sock);
+void kevent_sk_reinit(struct sock *sk);
+#else
+static inline void kevent_socket_reinit(struct socket *sock)
+{
+}
+static inline void kevent_sk_reinit(struct sock *sk)
+{
+}
+#endif
+void kevent_socket_notify(struct sock *sock, u32 event);
+int kevent_socket_dequeue(struct kevent *k);
+int kevent_socket_enqueue(struct kevent *k);
+#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC)
+#else
+static inline void kevent_socket_notify(struct sock *sock, u32 event)
+{
+}
+#define sock_async(__sk)	({ (void)__sk; 0; })
+#endif
+
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..a38575d
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,11 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+	void			*origin;		/* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+	struct list_head	list;			/* List of queued kevents. */
+	spinlock_t		lock;			/* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2d1c3d5..71a758f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -54,6 +54,7 @@ struct compat_stat;
 struct compat_timeval;
 struct robust_list_head;
 struct getcpu_cache;
+struct ukevent;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -599,4 +600,8 @@ asmlinkage long sys_set_robust_list(stru
 				    size_t len);
 asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);
 
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max,
+		__u64 timeout, struct ukevent __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, struct ukevent __user *buf);
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int start, unsigned int num, __u64 timeout);
 #endif
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
new file mode 100644
index 0000000..daa8202
--- /dev/null
+++ b/include/linux/ukevent.h
@@ -0,0 +1,163 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __UKEVENT_H
+#define __UKEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+/* Process this event only once and then dequeue. */
+#define KEVENT_REQ_ONESHOT	0x1
+
+/*
+ * Kevent return flags.
+ */
+/* Kevent is broken. */
+#define KEVENT_RET_BROKEN	0x1
+/* Kevent processing was finished successfully. */
+#define KEVENT_RET_DONE		0x2
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 		0
+#define KEVENT_INODE		1
+#define KEVENT_TIMER		2
+#define KEVENT_POLL		3
+#define KEVENT_NAIO		4
+#define KEVENT_AIO		5
+#define	KEVENT_MAX		6
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define	KEVENT_TIMER_FIRED	0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#define	KEVENT_SOCKET_RECV	0x1
+#define	KEVENT_SOCKET_ACCEPT	0x2
+#define	KEVENT_SOCKET_SEND	0x4
+
+/*
+ * Inode events.
+ */
+#define	KEVENT_INODE_CREATE	0x1
+#define	KEVENT_INODE_REMOVE	0x2
+
+/*
+ * Poll events.
+ */
+#define	KEVENT_POLL_POLLIN	0x0001
+#define	KEVENT_POLL_POLLPRI	0x0002
+#define	KEVENT_POLL_POLLOUT	0x0004
+#define	KEVENT_POLL_POLLERR	0x0008
+#define	KEVENT_POLL_POLLHUP	0x0010
+#define	KEVENT_POLL_POLLNVAL	0x0020
+
+#define	KEVENT_POLL_POLLRDNORM	0x0040
+#define	KEVENT_POLL_POLLRDBAND	0x0080
+#define	KEVENT_POLL_POLLWRNORM	0x0100
+#define	KEVENT_POLL_POLLWRBAND	0x0200
+#define	KEVENT_POLL_POLLMSG	0x0400
+#define	KEVENT_POLL_POLLREMOVE	0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define	KEVENT_AIO_BIO		0x1
+
+#define KEVENT_MASK_ALL		0xffffffff
+/* Mask of all possible event values. */
+#define KEVENT_MASK_EMPTY	0x0
+/* Empty mask of ready events. */
+
+struct kevent_id
+{
+	union {
+		__u32		raw[2];
+		__u64		raw_u64 __attribute__((aligned(8)));
+	};
+};
+
+struct ukevent
+{
+	/* Id of this request, e.g. socket number, file descriptor and so on... */
+	struct kevent_id	id;
+	/* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+	__u32			type;
+	/* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+	__u32			event;
+	/* Per-event request flags */
+	__u32			req_flags;
+	/* Per-event return flags */
+	__u32			ret_flags;
+	/* Event return data. Event originator fills it with anything it likes. */
+	__u32			ret_data[2];
+	/* User's data. It is not used, just copied to/from user.
+	 * The whole structure is aligned to 8 bytes already, so the last union
+	 * is aligned properly.
+	 */
+	union {
+		__u32		user[2];
+		void		*ptr;
+	};
+};
+
+struct mukevent
+{
+	struct kevent_id	id;
+	__u32			ret_flags;
+};
+
+#define KEVENT_MAX_PAGES	2
+
+/*
+ * Note that kevents does not exactly fill the page (each mukevent is 12 bytes),
+ * so we reuse 4 bytes at the begining of the page to store index.
+ * Take that into account if you want to change size of struct mukevent.
+ */
+#define KEVENTS_ON_PAGE ((PAGE_SIZE-2*sizeof(unsigned int))/sizeof(struct mukevent))
+struct kevent_mring
+{
+	unsigned int		kidx, uidx;
+	struct mukevent		event[KEVENTS_ON_PAGE];
+};
+
+/*
+ * Used only for sanitizing of the kevent_wait() input data - do not
+ * allow user to specify number of events more than it is possible to place
+ * into ring buffer. This does not limit number of events which can be
+ * put into kevent queue (which is unlimited).
+ */
+#define KEVENT_MAX_EVENTS	(KEVENT_MAX_PAGES * KEVENTS_ON_PAGE)
+
+#define	KEVENT_CTL_ADD 		0
+#define	KEVENT_CTL_REMOVE	1
+#define	KEVENT_CTL_MODIFY	2
+
+#endif /* __UKEVENT_H */
diff --git a/init/Kconfig b/init/Kconfig
index d2eb7a8..c7d8250 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -201,6 +201,8 @@ config AUDITSYSCALL
 	  such as SELinux.  To use audit's filesystem watch feature, please
 	  ensure that INOTIFY is configured.
 
+source "kernel/kevent/Kconfig"
+
 config IKCONFIG
 	bool "Kernel .config support"
 	---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..5ba8086
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,39 @@
+config KEVENT
+	bool "Kernel event notification mechanism"
+	help
+	  This option enables event queue mechanism.
+	  It can be used as replacement for poll()/select(), AIO callback
+	  invocations, advanced timer notifications and other kernel
+	  object status changes.
+
+config KEVENT_USER_STAT
+	bool "Kevent user statistic"
+	depends on KEVENT
+	help
+	  This option will turn kevent_user statistic collection on.
+	  Statistic data includes total number of kevent, number of kevents
+	  which are ready immediately at insertion time and number of kevents
+	  which were removed through readiness completion.
+	  It will be printed each time control kevent descriptor is closed.
+
+config KEVENT_TIMER
+	bool "Kernel event notifications for timers"
+	depends on KEVENT
+	help
+	  This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+	bool "Kernel event notifications for poll()/select()"
+	depends on KEVENT
+	help
+	  This option allows to use kevent subsystem for poll()/select()
+	  notifications.
+
+config KEVENT_SOCKET
+	bool "Kernel event notifications for sockets"
+	depends on NET && KEVENT
+	help
+	  This option enables notifications through KEVENT subsystem of 
+	  sockets operations, like new packet receiving conditions, 
+	  ready for accept conditions and so on.
+	
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..9130cad
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,4 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
+obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..25404d3
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,227 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+	return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+	return k->callbacks.dequeue(k);
+}
+
+/*
+ * Mark kevent as broken.
+ */
+int kevent_break(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->ulock, flags);
+	k->event.ret_flags |= KEVENT_RET_BROKEN;
+	spin_unlock_irqrestore(&k->ulock, flags);
+	return -EINVAL;
+}
+
+static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX] __read_mostly;
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos)
+{
+	struct kevent_callbacks *p;
+
+	if (pos >= KEVENT_MAX)
+		return -EINVAL;
+
+	p = &kevent_registered_callbacks[pos];
+
+	p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break;
+	p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break;
+	p->callback = (cb->callback) ? cb->callback : kevent_break;
+
+	printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);
+	return 0;
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+	spin_lock_init(&k->ulock);
+	k->flags = 0;
+
+	if (unlikely(k->event.type >= KEVENT_MAX ||
+			!kevent_registered_callbacks[k->event.type].callback))
+		return kevent_break(k);
+
+	k->callbacks = kevent_registered_callbacks[k->event.type];
+	if (unlikely(k->callbacks.callback == kevent_break))
+		return kevent_break(k);
+
+	return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	k->st = st;
+	spin_lock_irqsave(&st->lock, flags);
+	list_add_tail_rcu(&k->storage_entry, &st->list);
+	k->flags |= KEVENT_STORAGE;
+	spin_unlock_irqrestore(&st->lock, flags);
+	return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue.
+ * It does not decrease origin's reference counter in any way
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&st->lock, flags);
+	if (k->flags & KEVENT_STORAGE) {
+		list_del_rcu(&k->storage_entry);
+		k->flags &= ~KEVENT_STORAGE;
+	}
+	spin_unlock_irqrestore(&st->lock, flags);
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static void __kevent_requeue(struct kevent *k, u32 event)
+{
+	int ret, rem;
+	unsigned long flags;
+
+	ret = k->callbacks.callback(k);
+
+	spin_lock_irqsave(&k->ulock, flags);
+	if (ret > 0)
+		k->event.ret_flags |= KEVENT_RET_DONE;
+	else if (ret < 0)
+		k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE);
+	else
+		ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+	rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+	spin_unlock_irqrestore(&k->ulock, flags);
+
+	if (ret) {
+		if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) {
+			list_del_rcu(&k->storage_entry);
+			k->flags &= ~KEVENT_STORAGE;
+		}
+
+		spin_lock_irqsave(&k->user->ready_lock, flags);
+		if (!(k->flags & KEVENT_READY)) {
+			kevent_user_ring_add_event(k);
+			list_add_tail(&k->ready_entry, &k->user->ready_list);
+			k->flags |= KEVENT_READY;
+			k->user->ready_num++;
+		}
+		spin_unlock_irqrestore(&k->user->ready_lock, flags);
+		wake_up(&k->user->wait);
+	}
+}
+
+/*
+ * Check if kevent is ready (by invoking it's callback) and requeue/remove
+ * if needed.
+ */
+void kevent_requeue(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->st->lock, flags);
+	__kevent_requeue(k, 0);
+	spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st,
+		kevent_callback_t ready_callback, u32 event)
+{
+	struct kevent *k;
+
+	rcu_read_lock();
+	if (ready_callback)
+		list_for_each_entry_rcu(k, &st->list, storage_entry)
+			(*ready_callback)(k);
+
+	list_for_each_entry_rcu(k, &st->list, storage_entry)
+		if (event & k->event.event)
+			__kevent_requeue(k, event);
+	rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+	spin_lock_init(&st->lock);
+	st->origin = origin;
+	INIT_LIST_HEAD(&st->list);
+	return 0;
+}
+
+/*
+ * Mark all events as broken, that will remove them from storage,
+ * so storage origin (inode, sockt and so on) can be safely removed.
+ * No new entries are allowed to be added into the storage at this point.
+ * (Socket is removed from file table at this point for example).
+ */
+void kevent_storage_fini(struct kevent_storage *st)
+{
+	kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..e92a1dc
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,1000 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static const char kevent_name[] = "kevent";
+static kmem_cache_t *kevent_cache __read_mostly;
+
+/*
+ * kevents are pollable, return POLLIN and POLLRDNORM
+ * when there is at least one ready kevent.
+ */
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+	struct kevent_user *u = file->private_data;
+	unsigned int mask;
+
+	poll_wait(file, &u->wait, wait);
+	mask = 0;
+
+	if (u->ready_num)
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+
+/*
+ * Called under kevent_user->ready_lock, so updates are always protected.
+ */
+int kevent_user_ring_add_event(struct kevent *k)
+{
+	unsigned int pidx, off;
+	struct kevent_mring *ring, *copy_ring;
+	
+	ring = k->user->pring[0];
+
+	if ((ring->kidx + 1 == ring->uidx) || 
+			((ring->kidx + 1 == KEVENT_MAX_EVENTS) && ring->uidx == 0)) {
+		if (k->user->overflow_kevent == NULL)
+			k->user->overflow_kevent = k;
+		return -EAGAIN;
+	}
+
+	pidx = ring->kidx/KEVENTS_ON_PAGE;
+	off = ring->kidx%KEVENTS_ON_PAGE;
+
+	if (unlikely(pidx >= KEVENT_MAX_PAGES)) {
+		printk(KERN_ERR "%s: kidx: %u, pidx: %u, on_page: %lu, pidx: %u.\n",
+				__func__, ring->kidx, ring->uidx, KEVENTS_ON_PAGE, pidx);
+		return -EINVAL;
+	}
+
+	copy_ring = k->user->pring[pidx];
+
+	copy_ring->event[off].id.raw[0] = k->event.id.raw[0];
+	copy_ring->event[off].id.raw[1] = k->event.id.raw[1];
+	copy_ring->event[off].ret_flags = k->event.ret_flags;
+
+	if (++ring->kidx >= KEVENT_MAX_EVENTS)
+		ring->kidx = 0;
+
+	return 0;
+}
+
+/*
+ * Initialize mmap ring buffer.
+ * It will store ready kevents, so userspace could get them directly instead
+ * of using syscall. Esentially syscall becomes just a waiting point.
+ * @KEVENT_MAX_PAGES is an arbitrary number of pages to store ready events.
+ */
+static int kevent_user_ring_init(struct kevent_user *u)
+{
+	int i;
+
+	u->pring = kzalloc(KEVENT_MAX_PAGES * sizeof(struct kevent_mring *), GFP_KERNEL);
+	if (!u->pring)
+		return -ENOMEM;
+
+	for (i=0; i<KEVENT_MAX_PAGES; ++i) {
+		u->pring[i] = (struct kevent_mring *)__get_free_page(GFP_KERNEL);
+		if (!u->pring[i])
+			break;
+	}
+
+	if (i != KEVENT_MAX_PAGES)
+		goto err_out_free;
+
+	u->pring[0]->uidx = u->pring[0]->kidx = 0;
+
+	return 0;
+
+err_out_free:
+	for (i=0; i<KEVENT_MAX_PAGES; ++i) {
+		if (!u->pring[i])
+			break;
+
+		free_page((unsigned long)u->pring[i]);
+	}
+	
+	kfree(u->pring);
+
+	return -ENOMEM;
+}
+
+static void kevent_user_ring_fini(struct kevent_user *u)
+{
+	int i;
+
+	for (i=0; i<KEVENT_MAX_PAGES; ++i)
+		free_page((unsigned long)u->pring[i]);
+
+	kfree(u->pring);
+}
+
+static int kevent_user_open(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u;
+
+	u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
+	if (!u)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&u->ready_list);
+	spin_lock_init(&u->ready_lock);
+	kevent_stat_init(u);
+	spin_lock_init(&u->kevent_lock);
+	u->kevent_root = RB_ROOT;
+
+	mutex_init(&u->ctl_mutex);
+	init_waitqueue_head(&u->wait);
+
+	atomic_set(&u->refcnt, 1);
+
+	if (unlikely(kevent_user_ring_init(u))) {
+		kfree(u);
+		return -ENOMEM;
+	}
+
+	file->private_data = u;
+	return 0;
+}
+
+/*
+ * Kevent userspace control block reference counting.
+ * Set to 1 at creation time, when appropriate kevent file descriptor
+ * is closed, that reference counter is decreased.
+ * When counter hits zero block is freed.
+ */
+static inline void kevent_user_get(struct kevent_user *u)
+{
+	atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+	if (atomic_dec_and_test(&u->refcnt)) {
+		kevent_stat_print(u);
+		kevent_user_ring_fini(u);
+		kfree(u);
+	}
+}
+
+/*
+ * Mmap implementation for ring buffer, which is created as array
+ * of pages, so vm_pgoff is an offset (in pages, not in bytes) of
+ * the first page to be mapped.
+ */
+static int kevent_user_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	unsigned long start = vma->vm_start, off = vma->vm_pgoff / PAGE_SIZE;
+	struct kevent_user *u = file->private_data;
+
+	if (off >= KEVENT_MAX_PAGES)
+		return -EINVAL;
+
+	if (vma->vm_flags & VM_WRITE)
+		return -EPERM;
+
+	vma->vm_flags |= VM_RESERVED;
+	vma->vm_file = file;
+
+	if (vm_insert_page(vma, start, virt_to_page(u->pring[off])))
+		return -EFAULT;
+
+	return 0;
+}
+
+static inline int kevent_compare_id(struct kevent_id *left, struct kevent_id *right)
+{
+	if (left->raw_u64 > right->raw_u64)
+		return -1;
+
+	if (right->raw_u64 > left->raw_u64)
+		return 1;
+
+	return 0;
+}
+
+/*
+ * RCU protects storage list (kevent->storage_entry).
+ * Free entry in RCU callback, it is dequeued from all lists at
+ * this point.
+ */
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+	struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+	kmem_cache_free(kevent_cache, kevent);
+}
+
+/*
+ * Complete kevent removing - it dequeues kevent from storage list
+ * if it is requested, removes kevent from ready list, drops userspace
+ * control block reference counter and schedules kevent freeing through RCU.
+ */
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	if (deq)
+		kevent_dequeue(k);
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (k->flags & KEVENT_READY) {
+		list_del(&k->ready_entry);
+		k->flags &= ~KEVENT_READY;
+		u->ready_num--;
+	}
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+
+	kevent_user_put(u);
+	call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+/*
+ * Remove from all lists and free kevent.
+ * Must be called under kevent_user->kevent_lock to protect
+ * kevent->kevent_entry removing.
+ */
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+
+	rb_erase(&k->kevent_node, &u->kevent_root);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events,
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	rb_erase(&k->kevent_node, &u->kevent_root);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
+{
+	unsigned long flags;
+	struct kevent *k = NULL;
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (u->ready_num && !list_empty(&u->ready_list)) {
+		k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+		list_del(&k->ready_entry);
+		k->flags &= ~KEVENT_READY;
+		u->ready_num--;
+		if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS)
+			u->pring[0]->uidx = 0;
+		
+		if (u->overflow_kevent) {
+			int err;
+
+			err = kevent_user_ring_add_event(u->overflow_kevent);
+			if (!err) {
+				if (u->overflow_kevent->ready_entry.next == &u->ready_list)
+					u->overflow_kevent = NULL;
+				else
+					u->overflow_kevent = 
+						list_entry(u->overflow_kevent->ready_entry.next, 
+								struct kevent, ready_entry);
+			}
+		}
+	}
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+
+	return k;
+}
+
+/*
+ * Search a kevent inside kevent tree for given ukevent.
+ */
+static struct kevent *__kevent_search(struct kevent_id *id, struct kevent_user *u)
+{
+	struct kevent *k, *ret = NULL;
+	struct rb_node *n = u->kevent_root.rb_node;
+	int cmp;
+
+	while (n) {
+		k = rb_entry(n, struct kevent, kevent_node);
+		cmp = kevent_compare_id(&k->event.id, id);
+
+		if (cmp > 0)
+			n = n->rb_right;
+		else if (cmp < 0)
+			n = n->rb_left;
+		else {
+			ret = k;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+/*
+ * Search and modify kevent according to provided ukevent.
+ */
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err = -ENODEV;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&uk->id, u);
+	if (k) {
+		spin_lock(&k->ulock);
+		k->event.event = uk->event;
+		k->event.req_flags = uk->req_flags;
+		k->event.ret_flags = 0;
+		spin_unlock(&k->ulock);
+		kevent_requeue(k);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Remove kevent which matches provided ukevent.
+ */
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+	int err = -ENODEV;
+	struct kevent *k;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&uk->id, u);
+	if (k) {
+		__kevent_finish_user(k, 1);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Detaches userspace control block from file descriptor
+ * and decrease it's reference counter.
+ * No new kevents can be added or removed from any list at this point.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u = file->private_data;
+	struct kevent *k;
+	struct rb_node *n;
+
+	for (n = rb_first(&u->kevent_root); n; n = rb_next(n)) {
+		k = rb_entry(n, struct kevent, kevent_node);
+		kevent_finish_user(k, 1);
+	}
+
+	kevent_user_put(u);
+	file->private_data = NULL;
+
+	return 0;
+}
+
+/*
+ * Read requested number of ukevents in one shot.
+ */
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+	struct ukevent *ukev;
+
+	ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+	if (!ukev)
+		return NULL;
+
+	if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
+		kfree(ukev);
+		return NULL;
+	}
+
+	return ukev;
+}
+
+/*
+ * Read from userspace all ukevents and modify appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_modify(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_modify(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Read from userspace all ukevents and remove appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_remove(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_remove(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Queue kevent into userspace control block and increase
+ * it's reference counter.
+ */
+static int kevent_user_enqueue(struct kevent_user *u, struct kevent *new)
+{
+	unsigned long flags;
+	struct rb_node **p = &u->kevent_root.rb_node, *parent = NULL;
+	struct kevent *k;
+	int err = 0, cmp;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	while (*p) {
+		parent = *p;
+		k = rb_entry(parent, struct kevent, kevent_node);
+
+		cmp = kevent_compare_id(&k->event.id, &new->event.id);
+		if (cmp > 0)
+			p = &parent->rb_right;
+		else if (cmp < 0)
+			p = &parent->rb_left;
+		else {
+			err = -EEXIST;
+			break;
+		}
+	}
+	if (likely(!err)) {
+		rb_link_node(&new->kevent_node, parent, p);
+		rb_insert_color(&new->kevent_node, &u->kevent_root);
+		new->flags |= KEVENT_USER;
+		u->kevent_num++;
+		kevent_user_get(u);
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Add kevent from both kernel and userspace users.
+ * This function allocates and queues kevent, returns negative value
+ * on error, positive if kevent is ready immediately and zero
+ * if kevent has been queued.
+ */
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err;
+
+	k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+	if (!k) {
+		err = -ENOMEM;
+		goto err_out_exit;
+	}
+
+	memcpy(&k->event, uk, sizeof(struct ukevent));
+	INIT_RCU_HEAD(&k->rcu_head);
+
+	k->event.ret_flags = 0;
+
+	err = kevent_init(k);
+	if (err) {
+		kmem_cache_free(kevent_cache, k);
+		goto err_out_exit;
+	}
+	k->user = u;
+	kevent_stat_total(u);
+	err = kevent_user_enqueue(u, k);
+	if (err) {
+		kmem_cache_free(kevent_cache, k);
+		goto err_out_exit;
+	}
+
+	err = kevent_enqueue(k);
+	if (err) {
+		memcpy(uk, &k->event, sizeof(struct ukevent));
+		kevent_finish_user(k, 0);
+		goto err_out_exit;
+	}
+
+	return 0;
+
+err_out_exit:
+	if (err < 0) {
+		uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE;
+		uk->ret_data[1] = err;
+	} else if (err > 0)
+		uk->ret_flags |= KEVENT_RET_DONE;
+	return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one
+ * and add them into appropriate kevent_storages,
+ * e.g. sockets, inodes and so on...
+ * Ready events will replace ones provided by used and number
+ * of ready events is returned.
+ * User must check ret_flags field of each ukevent structure
+ * to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err, cerr = 0, knum = 0, rnum = 0, i;
+	void __user *orig = arg;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	err = -EINVAL;
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				err = kevent_user_add_ukevent(&ukev[i], u);
+				if (err) {
+					kevent_stat_im(u);
+					if (i != rnum)
+						memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+					rnum++;
+				} else
+					knum++;
+			}
+			if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+				cerr = -EFAULT;
+			kfree(ukev);
+			goto out_setup;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			cerr = -EFAULT;
+			break;
+		}
+		arg += sizeof(struct ukevent);
+
+		err = kevent_user_add_ukevent(&uk, u);
+		if (err) {
+			kevent_stat_im(u);
+			if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+				cerr = -EFAULT;
+				break;
+			}
+			orig += sizeof(struct ukevent);
+			rnum++;
+		} else
+			knum++;
+	}
+
+out_setup:
+	if (cerr < 0) {
+		err = cerr;
+		goto out_remove;
+	}
+
+	err = rnum;
+out_remove:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u,
+		unsigned int min_nr, unsigned int max_nr, __u64 timeout,
+		void __user *buf)
+{
+	struct kevent *k;
+	int num = 0;
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		wait_event_interruptible_timeout(u->wait,
+			u->ready_num >= min_nr,
+			clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+	}
+
+	while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
+		if (copy_to_user(buf + num*sizeof(struct ukevent),
+					&k->event, sizeof(struct ukevent)))
+			break;
+
+		/*
+		 * If it is one-shot kevent, it has been removed already from
+		 * origin's queue, so we can easily free it here.
+		 */
+		if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+			kevent_finish_user(k, 1);
+		++num;
+		kevent_stat_wait(u);
+	}
+
+	return num;
+}
+
+static struct file_operations kevent_user_fops = {
+	.mmap		= kevent_user_mmap,
+	.open		= kevent_user_open,
+	.release	= kevent_user_release,
+	.poll		= kevent_user_poll,
+	.owner		= THIS_MODULE,
+};
+
+static struct miscdevice kevent_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = kevent_name,
+	.fops = &kevent_user_fops,
+};
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+	int err;
+	struct kevent_user *u = file->private_data;
+
+	switch (cmd) {
+	case KEVENT_CTL_ADD:
+		err = kevent_user_ctl_add(u, num, arg);
+		break;
+	case KEVENT_CTL_REMOVE:
+		err = kevent_user_ctl_remove(u, num, arg);
+		break;
+	case KEVENT_CTL_MODIFY:
+		err = kevent_user_ctl_modify(u, num, arg);
+		break;
+	default:
+		err = -EINVAL;
+		break;
+	}
+
+	return err;
+}
+
+/*
+ * Used to get ready kevents from queue.
+ * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).
+ * @min_nr - minimum number of ready kevents.
+ * @max_nr - maximum number of ready kevents.
+ * @timeout - timeout in nanoseconds to wait until some events are ready.
+ * @buf - buffer to place ready events.
+ * @flags - ununsed for now (will be used for mmap implementation).
+ */
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+		__u64 timeout, struct ukevent __user *buf, unsigned flags)
+{
+	int err = -EINVAL;
+	struct file *file;
+	struct kevent_user *u;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -ENODEV;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * This syscall is used to perform waiting until there is free space in kevent queue
+ * and removes/requeues requested number of events (commits them). Function returns
+ * number of actually committed events.
+ *
+ * @ctl_fd - kevent file descriptor.
+ * @start - number of first ready event.
+ * @num - number of processed kevents.
+ * @timeout - this timeout specifies number of nanoseconds to wait until there is
+ * 	free space in kevent queue.
+ *
+ * Ring buffer is designed in a way that first ready kevent will be at @ring->uidx 
+ * position, and all other ready events will be in FIFO order after it.
+ * So when we need to commit @num events, it means we should just remove first @num
+ * kevents from ready queue and commit them. We do not use any special locking to
+ * protect this function against simultaneous running - kevent dequeueing is atomic,
+ * and we do not care about order in which events were committed.
+ * An example: thread 1 and thread 2 simultaneously call kevent_wait() to 
+ * commit 2 and 3 events. It is possible that first thread will commit 
+ * events 0 and 2 while second thread will commit events 1, 3 and 4.
+ * If there were only 3 ready events, then one of the calls will return lesser number
+ * of committed events than it was requested.
+ * ring->uidx update is atomic, since it is protected by u->ready_lock,
+ * which removes race with kevent_user_ring_add_event().
+ *
+ * If user asks to commit events which have beed removed by kevent_get_events() recently 
+ * (for example when one thread looked into ring indexes and started to commit evets, 
+ * which were simultaneously committed by other thread through kevent_get_events(),
+ * kevent_wait() will not commit unprocessed events, but will return number of actually
+ * committed events instead.
+ *
+ * It is forbidden to try to commit events not from the start of the buffer, but from
+ * some 'futher' event.
+ *
+ * An example: if ready events use positions 2-5, 
+ * it is permitted to start to commit 3 events from position 0, 
+ *   in this case 0 and 1 positions will be ommited and only event in position 2 will 
+ *   be committed and kevent_wait() will return 1, since only one event was actually committed.
+ * It is forbidden to try to commit from position 4, 0 will be returned.
+ * This means that if some events were committed using kevent_get_events(), 
+ * they will not be counted, instead userspace should check ring index and try to commit again.
+ */
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int start, unsigned int num, __u64 timeout)
+{
+	int err = -EINVAL, committed = 0;
+	struct file *file;
+	struct kevent_user *u;
+	struct kevent *k;
+	struct kevent_mring *ring;
+	unsigned int i, actual;
+	unsigned long flags;
+
+	if (num >= KEVENT_MAX_EVENTS)
+		return -EINVAL;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -ENODEV;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	ring = u->pring[0];
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	actual = (ring->kidx > ring->uidx)?
+			(ring->kidx - ring->uidx):
+			(KEVENT_MAX_EVENTS - (ring->uidx - ring->kidx));
+
+	if (actual < num)
+		num = actual;
+
+	if (start < ring->uidx) {
+		/*
+		 * Some events have been committed through kevent_get_events().
+		 *                 ready events
+		 * |==========|RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR|==========|
+		 *          ring->uidx                      ring->kidx
+		 *      |               |
+		 *    start           start+num
+		 *
+		 */
+		unsigned int diff = ring->uidx - start;
+
+		if (num < diff)
+			num = 0;
+		else
+			num -= diff;
+	} else if (start > ring->uidx)
+		num = 0;
+	
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+
+	for (i=0; i<num; ++i) {
+		k = kqueue_dequeue_ready(u);
+		if (!k)
+			break;
+
+		if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+			kevent_finish_user(k, 1);
+		kevent_stat_mmap(u);
+		committed++;
+	}
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		wait_event_interruptible_timeout(u->wait,
+			u->ready_num >= 1,
+			clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+	}
+	
+	fput(file);
+
+	return committed;
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * This syscall is used to perform various control operations
+ * on given kevent queue, which is obtained through kevent file descriptor @fd.
+ * @cmd - type of operation.
+ * @num - number of kevents to be processed.
+ * @arg - pointer to array of struct ukevent.
+ */
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent __user *arg)
+{
+	int err = -EINVAL;
+	struct file *file;
+
+	file = fget(fd);
+	if (!file)
+		return -ENODEV;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+
+	err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * Kevent subsystem initialization - create kevent cache and register
+ * filesystem to get control file descriptors from.
+ */
+static int __init kevent_user_init(void)
+{
+	int err = 0;
+
+	kevent_cache = kmem_cache_create("kevent_cache",
+			sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
+
+	err = misc_register(&kevent_miscdev);
+	if (err) {
+		printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);
+		goto err_out_exit;
+	}
+
+	printk(KERN_INFO "KEVENT subsystem has been successfully registered.\n");
+
+	return 0;
+
+err_out_exit:
+	kmem_cache_destroy(kevent_cache);
+	return err;
+}
+
+module_init(kevent_user_init);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7a3b2e7..bc0582b 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,10 @@ cond_syscall(ppc_rtas);
 cond_syscall(sys_spu_run);
 cond_syscall(sys_spu_create);
 
+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_wait);
+cond_syscall(sys_kevent_ctl);
+
 /* mmu depending weak syscall entries */
 cond_syscall(sys_mprotect);
 cond_syscall(sys_msync);

^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take21 2/4] kevent: poll/select() notifications.
  2006-10-27 16:10   ` [take21 1/4] kevent: Core files Evgeniy Polyakov
@ 2006-10-27 16:10     ` Evgeniy Polyakov
  2006-10-27 16:10       ` [take21 3/4] kevent: Socket notifications Evgeniy Polyakov
  2006-10-28 10:04       ` [take21 2/4] kevent: poll/select() notifications Eric Dumazet
  2006-10-28 10:28     ` [take21 1/4] kevent: Core files Eric Dumazet
  1 sibling, 2 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-10-27 16:10 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel


poll/select() notifications.

This patch includes generic poll/select notifications.
kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake, a lot of allocations and so on).

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5baf3a1..f81299f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -276,6 +276,7 @@ #include <linux/prio_tree.h>
 #include <linux/init.h>
 #include <linux/sched.h>
 #include <linux/mutex.h>
+#include <linux/kevent.h>
 
 #include <asm/atomic.h>
 #include <asm/semaphore.h>
@@ -586,6 +587,10 @@ #ifdef CONFIG_INOTIFY
 	struct mutex		inotify_mutex;	/* protects the watches list */
 #endif
 
+#ifdef CONFIG_KEVENT_SOCKET
+	struct kevent_storage	st;
+#endif
+
 	unsigned long		i_state;
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 
@@ -739,6 +744,9 @@ #ifdef CONFIG_EPOLL
 	struct list_head	f_ep_links;
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+	struct kevent_storage	st;
+#endif
 	struct address_space	*f_mapping;
 };
 extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..fb74e0f
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,222 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+	struct poll_table_struct 	pt;
+	struct kevent			*k;
+};
+
+struct kevent_poll_wait_container
+{
+	struct list_head		container_entry;
+	wait_queue_head_t		*whead;
+	wait_queue_t			wait;
+	struct kevent			*k;
+};
+
+struct kevent_poll_private
+{
+	struct list_head		container_list;
+	spinlock_t			container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+		unsigned mode, int sync, void *key)
+{
+	struct kevent_poll_wait_container *cont =
+		container_of(wait, struct kevent_poll_wait_container, wait);
+	struct kevent *k = cont->k;
+	struct file *file = k->st->origin;
+	u32 revents;
+
+	revents = file->f_op->poll(file, NULL);
+
+	kevent_storage_ready(k->st, NULL, revents);
+
+	return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+		struct poll_table_struct *poll_table)
+{
+	struct kevent *k =
+		container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *cont;
+	unsigned long flags;
+
+	cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
+	if (!cont) {
+		kevent_break(k);
+		return;
+	}
+
+	cont->k = k;
+	init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+	cont->whead = whead;
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_add_tail(&cont->container_entry, &priv->container_list);
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+	struct file *file;
+	int err, ready = 0;
+	unsigned int revents;
+	struct kevent_poll_ctl ctl;
+	struct kevent_poll_private *priv;
+
+	file = fget(k->event.id.raw[0]);
+	if (!file)
+		return -ENODEV;
+
+	err = -EINVAL;
+	if (!file->f_op || !file->f_op->poll)
+		goto err_out_fput;
+
+	err = -ENOMEM;
+	priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
+	if (!priv)
+		goto err_out_fput;
+
+	spin_lock_init(&priv->container_lock);
+	INIT_LIST_HEAD(&priv->container_list);
+
+	k->priv = priv;
+
+	ctl.k = k;
+	init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+
+	err = kevent_storage_enqueue(&file->st, k);
+	if (err)
+		goto err_out_free;
+
+	revents = file->f_op->poll(file, &ctl.pt);
+	if (revents & k->event.event) {
+		ready = 1;
+		kevent_poll_dequeue(k);
+	}
+
+	return ready;
+
+err_out_free:
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+	fput(file);
+	return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *w, *n;
+	unsigned long flags;
+
+	kevent_storage_dequeue(k->st, k);
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+		list_del(&w->container_entry);
+		remove_wait_queue(w->whead, &w->wait);
+		kmem_cache_free(kevent_poll_container_cache, w);
+	}
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+	k->priv = NULL;
+
+	fput(file);
+
+	return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	unsigned int revents = file->f_op->poll(file, NULL);
+
+	k->event.ret_data[0] = revents & k->event.event;
+
+	return (revents & k->event.event);
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+	struct kevent_callbacks pc = {
+		.callback = &kevent_poll_callback,
+		.enqueue = &kevent_poll_enqueue,
+		.dequeue = &kevent_poll_dequeue};
+
+	kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache",
+			sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+	if (!kevent_poll_container_cache) {
+		printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+		return -ENOMEM;
+	}
+
+	kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache",
+			sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+	if (!kevent_poll_priv_cache) {
+		printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+		kmem_cache_destroy(kevent_poll_container_cache);
+		kevent_poll_container_cache = NULL;
+		return -ENOMEM;
+	}
+
+	kevent_add_callbacks(&pc, KEVENT_POLL);
+
+	printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+	return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+	lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+	kmem_cache_destroy(kevent_poll_priv_cache);
+	kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take21 3/4] kevent: Socket notifications.
  2006-10-27 16:10     ` [take21 2/4] kevent: poll/select() notifications Evgeniy Polyakov
@ 2006-10-27 16:10       ` Evgeniy Polyakov
  2006-10-27 16:10         ` [take21 4/4] kevent: Timer notifications Evgeniy Polyakov
  2006-10-28 10:04       ` [take21 2/4] kevent: poll/select() notifications Eric Dumazet
  1 sibling, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-10-27 16:10 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel


Socket notifications.

This patch includes socket send/recv/accept notifications.
Using trivial web server based on kevent and this features
instead of epoll it's performance increased more than noticebly.
More details about various benchmarks and server itself 
(evserver_kevent.c) can be found on project's homepage.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/fs/inode.c b/fs/inode.c
index ada7643..ff1b129 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@ #include <linux/pagemap.h>
 #include <linux/cdev.h>
 #include <linux/bootmem.h>
 #include <linux/inotify.h>
+#include <linux/kevent.h>
 #include <linux/mount.h>
 
 /*
@@ -164,12 +165,18 @@ #endif
 		}
 		inode->i_private = 0;
 		inode->i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET
+		kevent_storage_init(inode, &inode->st);
+#endif
 	}
 	return inode;
 }
 
 void destroy_inode(struct inode *inode) 
 {
+#if defined CONFIG_KEVENT_SOCKET
+	kevent_storage_fini(&inode->st);
+#endif
 	BUG_ON(inode_has_buffers(inode));
 	security_inode_free(inode);
 	if (inode->i_sb->s_op->destroy_inode)
diff --git a/include/net/sock.h b/include/net/sock.h
index edd4d73..d48ded8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -48,6 +48,7 @@ #include <linux/lockdep.h>
 #include <linux/netdevice.h>
 #include <linux/skbuff.h>	/* struct sk_buff */
 #include <linux/security.h>
+#include <linux/kevent.h>
 
 #include <linux/filter.h>
 
@@ -450,6 +451,21 @@ static inline int sk_stream_memory_free(
 
 extern void sk_stream_rfree(struct sk_buff *skb);
 
+struct socket_alloc {
+	struct socket socket;
+	struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+	return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+	return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
+}
+
 static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
 {
 	skb->sk = sk;
@@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct
 		sk->sk_backlog.tail = skb;
 	}
 	skb->next = NULL;
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 }
 
 #define sk_wait_event(__sk, __timeo, __condition)		\
@@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio
 	return si->kiocb;
 }
 
-struct socket_alloc {
-	struct socket socket;
-	struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
-	return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
-	return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
-}
-
 extern void __sk_stream_mem_reclaim(struct sock *sk);
 extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..69f4ad2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so
 			tp->ucopy.memory = 0;
 		} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
 			wake_up_interruptible(sk->sk_sleep);
+			kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 			if (!inet_csk_ack_scheduled(sk))
 				inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
 						          (3 * TCP_RTO_MIN) / 4,
diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 0000000..c865b3e
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,129 @@
+/*
+ * 	kevent_socket.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/tcp.h>
+#include <linux/kevent.h>
+
+#include <net/sock.h>
+#include <net/request_sock.h>
+#include <net/inet_connection_sock.h>
+
+static int kevent_socket_callback(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+	return SOCKET_I(inode)->ops->poll(SOCKET_I(inode)->file, SOCKET_I(inode), NULL);
+}
+
+int kevent_socket_enqueue(struct kevent *k)
+{
+	struct inode *inode;
+	struct socket *sock;
+	int err = -ENODEV;
+
+	sock = sockfd_lookup(k->event.id.raw[0], &err);
+	if (!sock)
+		goto err_out_exit;
+
+	inode = igrab(SOCK_INODE(sock));
+	if (!inode)
+		goto err_out_fput;
+
+	err = kevent_storage_enqueue(&inode->st, k);
+	if (err)
+		goto err_out_iput;
+
+	err = k->callbacks.callback(k);
+	if (err)
+		goto err_out_dequeue;
+
+	return err;
+
+err_out_dequeue:
+	kevent_storage_dequeue(k->st, k);
+err_out_iput:
+	iput(inode);
+err_out_fput:
+	sockfd_put(sock);
+err_out_exit:
+	return err;
+}
+
+int kevent_socket_dequeue(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+	struct socket *sock;
+
+	kevent_storage_dequeue(k->st, k);
+
+	sock = SOCKET_I(inode);
+	iput(inode);
+	sockfd_put(sock);
+
+	return 0;
+}
+
+void kevent_socket_notify(struct sock *sk, u32 event)
+{
+	if (sk->sk_socket)
+		kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event);
+}
+
+/*
+ * It is required for network protocols compiled as modules, like IPv6.
+ */
+EXPORT_SYMBOL_GPL(kevent_socket_notify);
+
+#ifdef CONFIG_LOCKDEP
+static struct lock_class_key kevent_sock_key;
+
+void kevent_socket_reinit(struct socket *sock)
+{
+	struct inode *inode = SOCK_INODE(sock);
+
+	lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+}
+
+void kevent_sk_reinit(struct sock *sk)
+{
+	if (sk->sk_socket) {
+		struct inode *inode = SOCK_INODE(sk->sk_socket);
+
+		lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+	}
+}
+#endif
+static int __init kevent_init_socket(void)
+{
+	struct kevent_callbacks sc = {
+		.callback = &kevent_socket_callback,
+		.enqueue = &kevent_socket_enqueue,
+		.dequeue = &kevent_socket_dequeue};
+
+	return kevent_add_callbacks(&sc, KEVENT_SOCKET);
+}
+module_init(kevent_init_socket);
diff --git a/net/core/sock.c b/net/core/sock.c
index b77e155..7d5fa3e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1402,6 +1402,7 @@ static void sock_def_wakeup(struct sock 
 	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
 		wake_up_interruptible_all(sk->sk_sleep);
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_error_report(struct sock *sk)
@@ -1411,6 +1412,7 @@ static void sock_def_error_report(struct
 		wake_up_interruptible(sk->sk_sleep);
 	sk_wake_async(sk,0,POLL_ERR); 
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_readable(struct sock *sk, int len)
@@ -1420,6 +1422,7 @@ static void sock_def_readable(struct soc
 		wake_up_interruptible(sk->sk_sleep);
 	sk_wake_async(sk,1,POLL_IN);
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_write_space(struct sock *sk)
@@ -1439,6 +1442,7 @@ static void sock_def_write_space(struct 
 	}
 
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
 }
 
 static void sock_def_destruct(struct sock *sk)
@@ -1489,6 +1493,8 @@ #endif
 	sk->sk_state		=	TCP_CLOSE;
 	sk->sk_socket		=	sock;
 
+	kevent_sk_reinit(sk);
+
 	sock_set_flag(sk, SOCK_ZAPPED);
 
 	if(sock)
@@ -1555,8 +1561,10 @@ void fastcall release_sock(struct sock *
 	if (sk->sk_backlog.tail)
 		__release_sock(sk);
 	sk->sk_lock.owner = NULL;
-	if (waitqueue_active(&sk->sk_lock.wq))
+	if (waitqueue_active(&sk->sk_lock.wq)) {
 		wake_up(&sk->sk_lock.wq);
+		kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
+	}
 	spin_unlock_bh(&sk->sk_lock.slock);
 }
 EXPORT_SYMBOL(release_sock);
diff --git a/net/core/stream.c b/net/core/stream.c
index d1d7dec..2878c2a 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock *
 			wake_up_interruptible(sk->sk_sleep);
 		if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
 			sock_wake_async(sock, 2, POLL_OUT);
+		kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
 	}
 }
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3f884ce..e7dd989 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3119,6 +3119,7 @@ static void tcp_ofo_queue(struct sock *s
 
 		__skb_unlink(skb, &tp->out_of_order_queue);
 		__skb_queue_tail(&sk->sk_receive_queue, skb);
+		kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
 		if(skb->h.th->fin)
 			tcp_fin(skb, sk, skb->h.th);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index c83938b..b0dd70d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -61,6 +61,7 @@ #include <linux/cache.h>
 #include <linux/jhash.h>
 #include <linux/init.h>
 #include <linux/times.h>
+#include <linux/kevent.h>
 
 #include <net/icmp.h>
 #include <net/inet_hashtables.h>
@@ -870,6 +871,7 @@ #endif
 	   	reqsk_free(req);
 	} else {
 		inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+		kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT);
 	}
 	return 0;
 
diff --git a/net/socket.c b/net/socket.c
index 1bc4167..5582b4a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -85,6 +85,7 @@ #include <linux/compat.h>
 #include <linux/kmod.h>
 #include <linux/audit.h>
 #include <linux/wireless.h>
+#include <linux/kevent.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -490,6 +491,8 @@ static struct socket *sock_alloc(void)
 	inode->i_uid = current->fsuid;
 	inode->i_gid = current->fsgid;
 
+	kevent_socket_reinit(sock);
+
 	get_cpu_var(sockets_in_use)++;
 	put_cpu_var(sockets_in_use);
 	return sock;


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take21 4/4] kevent: Timer notifications.
  2006-10-27 16:10       ` [take21 3/4] kevent: Socket notifications Evgeniy Polyakov
@ 2006-10-27 16:10         ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-10-27 16:10 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel


Timer notifications.

Timer notifications can be used for fine grained per-process time 
management, since interval timers are very inconvenient to use, 
and they are limited.

This subsystem uses high-resolution timers.
id.raw[0] is used as number of seconds
id.raw[1] is used as number of nanoseconds

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..04acc46
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,113 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/hrtimer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+struct kevent_timer
+{
+	struct hrtimer		ktimer;
+	struct kevent_storage	ktimer_storage;
+	struct kevent		*ktimer_event;
+};
+
+static int kevent_timer_func(struct hrtimer *timer)
+{
+	struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer);
+	struct kevent *k = t->ktimer_event;
+
+	kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL);
+	hrtimer_forward(timer, timer->base->softirq_time,
+			ktime_set(k->event.id.raw[0], k->event.id.raw[1]));
+	return HRTIMER_RESTART;
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+	int err;
+	struct kevent_timer *t;
+
+	t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+	if (!t)
+		return -ENOMEM;
+
+	hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL);
+	t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]);
+	t->ktimer.function = kevent_timer_func;
+	t->ktimer_event = k;
+
+	err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+	if (err)
+		goto err_out_free;
+	lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+	err = kevent_storage_enqueue(&t->ktimer_storage, k);
+	if (err)
+		goto err_out_st_fini;
+
+	printk("%s: jiffies: %lu, timer: %p.\n", __func__, jiffies, &t->ktimer);
+	hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL);
+
+	return 0;
+
+err_out_st_fini:
+	kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+	kfree(t);
+
+	return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+	struct kevent_storage *st = k->st;
+	struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage);
+
+	hrtimer_cancel(&t->ktimer);
+	kevent_storage_dequeue(st, k);
+	kfree(t);
+
+	return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+	k->event.ret_data[0] = jiffies_to_msecs(jiffies);
+	return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+	struct kevent_callbacks tc = {
+		.callback = &kevent_timer_callback,
+		.enqueue = &kevent_timer_enqueue,
+		.dequeue = &kevent_timer_dequeue};
+
+	return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);
+


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: [take21 2/4] kevent: poll/select() notifications.
  2006-10-27 16:10     ` [take21 2/4] kevent: poll/select() notifications Evgeniy Polyakov
  2006-10-27 16:10       ` [take21 3/4] kevent: Socket notifications Evgeniy Polyakov
@ 2006-10-28 10:04       ` Eric Dumazet
  2006-10-28 10:08         ` Evgeniy Polyakov
  1 sibling, 1 reply; 200+ messages in thread
From: Eric Dumazet @ 2006-10-28 10:04 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel

Evgeniy Polyakov a écrit :

> +	file = fget(k->event.id.raw[0]);
> +	if (!file)
> +		return -ENODEV;

Please, do us a favor, and use EBADF instead of ENODEV.

EBADF : /* Bad file number */

ENODEV : /* No such device */

You have many ENODEV uses in your patches and that really hurts.

Eric


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take21 2/4] kevent: poll/select() notifications.
  2006-10-28 10:04       ` [take21 2/4] kevent: poll/select() notifications Eric Dumazet
@ 2006-10-28 10:08         ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-10-28 10:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel

On Sat, Oct 28, 2006 at 12:04:10PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote:
> Evgeniy Polyakov a écrit :
> 
> >+	file = fget(k->event.id.raw[0]);
> >+	if (!file)
> >+		return -ENODEV;
> 
> Please, do us a favor, and use EBADF instead of ENODEV.
> 
> EBADF : /* Bad file number */
> 
> ENODEV : /* No such device */
> 
> You have many ENODEV uses in your patches and that really hurts.

Ok :)

> Eric

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take21 1/4] kevent: Core files.
  2006-10-27 16:10   ` [take21 1/4] kevent: Core files Evgeniy Polyakov
  2006-10-27 16:10     ` [take21 2/4] kevent: poll/select() notifications Evgeniy Polyakov
@ 2006-10-28 10:28     ` Eric Dumazet
  2006-10-28 10:53       ` Evgeniy Polyakov
  1 sibling, 1 reply; 200+ messages in thread
From: Eric Dumazet @ 2006-10-28 10:28 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel

+/*
+ * Called under kevent_user->ready_lock, so updates are always protected.
+ */
+int kevent_user_ring_add_event(struct kevent *k)
+{
+	unsigned int pidx, off;
+	struct kevent_mring *ring, *copy_ring;
+	
+	ring = k->user->pring[0];
+
+	if ((ring->kidx + 1 == ring->uidx) ||
+			((ring->kidx + 1 == KEVENT_MAX_EVENTS) && ring->uidx == 0)) {
+		if (k->user->overflow_kevent == NULL)
+			k->user->overflow_kevent = k;
+		return -EAGAIN;
+	}
+


I really dont understand how you manage to queue multiple kevents in the 
'overflow list'. You just queue one kevent at most. What am I missing ?



> +
> +	for (i=0; i<KEVENT_MAX_PAGES; ++i) {
> +		u->pring[i] = (struct kevent_mring *)__get_free_page(GFP_KERNEL);
> +		if (!u->pring[i])
> +			break;
> +	}
> +
> +	if (i != KEVENT_MAX_PAGES)
> +		goto err_out_free;

Why dont you use goto directly ?

	if (!u->pring[i])
		goto err_out_free;




> +
> +	u->pring[0]->uidx = u->pring[0]->kidx = 0;
> +
> +	return 0;
> +
> +err_out_free:
> +	for (i=0; i<KEVENT_MAX_PAGES; ++i) {
> +		if (!u->pring[i])
> +			break;
> +
> +		free_page((unsigned long)u->pring[i]);
> +	}
> +	return k;
> +}
> +




> +static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
> +{
> +	int err, cerr = 0, knum = 0, rnum = 0, i;
> +	void __user *orig = arg;
> +	struct ukevent uk;
> +
> +	mutex_lock(&u->ctl_mutex);
> +
> +	err = -EINVAL;
> +	if (num > KEVENT_MIN_BUFFS_ALLOC) {
> +		struct ukevent *ukev;
> +
> +		ukev = kevent_get_user(num, arg);
> +		if (ukev) {
> +			for (i = 0; i < num; ++i) {
> +				err = kevent_user_add_ukevent(&ukev[i], u);
> +				if (err) {
> +					kevent_stat_im(u);
> +					if (i != rnum)
> +						memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
> +					rnum++;
> +				} else
> +					knum++;


Why are you using/counting knum ?



> +			}
> +			if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
> +				cerr = -EFAULT;
> +			kfree(ukev);
> +			goto out_setup;
> +		}
> +	}
> +
> +	for (i = 0; i < num; ++i) {
> +		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
> +			cerr = -EFAULT;
> +			break;
> +		}
> +		arg += sizeof(struct ukevent);
> +
> +		err = kevent_user_add_ukevent(&uk, u);
> +		if (err) {
> +			kevent_stat_im(u);
> +			if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
> +				cerr = -EFAULT;
> +				break;
> +			}
> +			orig += sizeof(struct ukevent);
> +			rnum++;
> +		} else
> +			knum++;
> +	}
> +
> +out_setup:
> +	if (cerr < 0) {
> +		err = cerr;
> +		goto out_remove;
> +	}
> +
> +	err = rnum;
> +out_remove:
> +	mutex_unlock(&u->ctl_mutex);
> +
> +	return err;
> +}

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take21 1/4] kevent: Core files.
  2006-10-28 10:28     ` [take21 1/4] kevent: Core files Eric Dumazet
@ 2006-10-28 10:53       ` Evgeniy Polyakov
  2006-10-28 12:36         ` Eric Dumazet
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-10-28 10:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel

On Sat, Oct 28, 2006 at 12:28:12PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote:
> +/*
> + * Called under kevent_user->ready_lock, so updates are always protected.
> + */
> +int kevent_user_ring_add_event(struct kevent *k)
> +{
> +	unsigned int pidx, off;
> +	struct kevent_mring *ring, *copy_ring;
> +	
> +	ring = k->user->pring[0];
> +
> +	if ((ring->kidx + 1 == ring->uidx) ||
> +			((ring->kidx + 1 == KEVENT_MAX_EVENTS) && ring->uidx 
> == 0)) {
> +		if (k->user->overflow_kevent == NULL)
> +			k->user->overflow_kevent = k;
> +		return -EAGAIN;
> +	}
> +
> 
> 
> I really dont understand how you manage to queue multiple kevents in the 
> 'overflow list'. You just queue one kevent at most. What am I missing ?

There is no overflow list - it is a pointer to the first kevent in the
ready queue, which was not put into ring buffer. It is an optimisation, 
which allows to not search for that position each time new event should 
be placed into the buffer, when it starts to have an empty slot.
 
> 
> >+
> >+	for (i=0; i<KEVENT_MAX_PAGES; ++i) {
> >+		u->pring[i] = (struct kevent_mring 
> >*)__get_free_page(GFP_KERNEL);
> >+		if (!u->pring[i])
> >+			break;
> >+	}
> >+
> >+	if (i != KEVENT_MAX_PAGES)
> >+		goto err_out_free;
> 
> Why dont you use goto directly ?
> 
> 	if (!u->pring[i])
> 		goto err_out_free;
> 
 
I used a fallback mode here which allowed to use smaller number of pages
for kevent ring buffer, but then decided to drop it.
So it is possible to use goto directly.
 
> >+
> >+	u->pring[0]->uidx = u->pring[0]->kidx = 0;
> >+
> >+	return 0;
> >+
> >+err_out_free:
> >+	for (i=0; i<KEVENT_MAX_PAGES; ++i) {
> >+		if (!u->pring[i])
> >+			break;
> >+
> >+		free_page((unsigned long)u->pring[i]);
> >+	}
> >+	return k;
> >+}
> >+
> 
> 
> 
> 
> >+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, 
> >void __user *arg)
> >+{
> >+	int err, cerr = 0, knum = 0, rnum = 0, i;
> >+	void __user *orig = arg;
> >+	struct ukevent uk;
> >+
> >+	mutex_lock(&u->ctl_mutex);
> >+
> >+	err = -EINVAL;
> >+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
> >+		struct ukevent *ukev;
> >+
> >+		ukev = kevent_get_user(num, arg);
> >+		if (ukev) {
> >+			for (i = 0; i < num; ++i) {
> >+				err = kevent_user_add_ukevent(&ukev[i], u);
> >+				if (err) {
> >+					kevent_stat_im(u);
> >+					if (i != rnum)
> >+						memcpy(&ukev[rnum], 
> >&ukev[i], sizeof(struct ukevent));
> >+					rnum++;
> >+				} else
> >+					knum++;
> 
> 
> Why are you using/counting knum ?
 
It should go avay. 
 
> >+			}
> >+			if (copy_to_user(orig, ukev, rnum*sizeof(struct 
> >ukevent)))
> >+				cerr = -EFAULT;
> >+			kfree(ukev);
> >+			goto out_setup;
> >+		}
> >+	}
> >+
> >+	for (i = 0; i < num; ++i) {
> >+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
> >+			cerr = -EFAULT;
> >+			break;
> >+		}
> >+		arg += sizeof(struct ukevent);
> >+
> >+		err = kevent_user_add_ukevent(&uk, u);
> >+		if (err) {
> >+			kevent_stat_im(u);
> >+			if (copy_to_user(orig, &uk, sizeof(struct ukevent))) 
> >{
> >+				cerr = -EFAULT;
> >+				break;
> >+			}
> >+			orig += sizeof(struct ukevent);
> >+			rnum++;
> >+		} else
> >+			knum++;
> >+	}
> >+
> >+out_setup:
> >+	if (cerr < 0) {
> >+		err = cerr;
> >+		goto out_remove;
> >+	}
> >+
> >+	err = rnum;
> >+out_remove:
> >+	mutex_unlock(&u->ctl_mutex);
> >+
> >+	return err;
> >+}
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take21 1/4] kevent: Core files.
  2006-10-28 10:53       ` Evgeniy Polyakov
@ 2006-10-28 12:36         ` Eric Dumazet
  2006-10-28 13:03           ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Eric Dumazet @ 2006-10-28 12:36 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel

Evgeniy Polyakov a e'crit :
> On Sat, Oct 28, 2006 at 12:28:12PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote:
>>
>> I really dont understand how you manage to queue multiple kevents in the 
>> 'overflow list'. You just queue one kevent at most. What am I missing ?
> 
> There is no overflow list - it is a pointer to the first kevent in the
> ready queue, which was not put into ring buffer. It is an optimisation, 
> which allows to not search for that position each time new event should 
> be placed into the buffer, when it starts to have an empty slot.

This overflow list (you may call it differently, but still it IS a list), is 
not complete. I feel you add it just to make me happy, but I am not (yet :) )

For example, you make no test at kevent_finish_user_complete() time.

Obviously, you can have a dangling pointer, and crash your box in certain 
conditions.

static void kevent_finish_user_complete(struct kevent *k, int deq)
{
	struct kevent_user *u = k->user;
	unsigned long flags;

	if (deq)
		kevent_dequeue(k);

	spin_lock_irqsave(&u->ready_lock, flags);
	if (k->flags & KEVENT_READY) {
+               if (u->overflow_event == k) {
+		/* MUST do something to change u->overflow_kevent */
+		}
		list_del(&k->ready_entry);
		k->flags &= ~KEVENT_READY;
		u->ready_num--;
	}
	spin_unlock_irqrestore(&u->ready_lock, flags);

	kevent_user_put(u);
	call_rcu(&k->rcu_head, kevent_free_rcu);
}

Eric

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take21 1/4] kevent: Core files.
  2006-10-28 12:36         ` Eric Dumazet
@ 2006-10-28 13:03           ` Evgeniy Polyakov
  2006-10-28 13:23             ` Eric Dumazet
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-10-28 13:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel

On Sat, Oct 28, 2006 at 02:36:31PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote:
> Evgeniy Polyakov a e'crit :
> >On Sat, Oct 28, 2006 at 12:28:12PM +0200, Eric Dumazet 
> >(dada1@cosmosbay.com) wrote:
> >>
> >>I really dont understand how you manage to queue multiple kevents in the 
> >>'overflow list'. You just queue one kevent at most. What am I missing ?
> >
> >There is no overflow list - it is a pointer to the first kevent in the
> >ready queue, which was not put into ring buffer. It is an optimisation, 
> >which allows to not search for that position each time new event should 
> >be placed into the buffer, when it starts to have an empty slot.
> 
> This overflow list (you may call it differently, but still it IS a list), 
> is not complete. I feel you add it just to make me happy, but I am not (yet 
> :) )

There is no overflow list.
There is ready queue, part of which (first several entries) is copied
into the ring buffer, overflow_kevent is a pointer to the first kevent which
was not copied.

> For example, you make no test at kevent_finish_user_complete() time.
> 
> Obviously, you can have a dangling pointer, and crash your box in certain 
> conditions.

You are right, I did not put overflow_kevent check into all places which
can remove kevent.

Here is a patch I am about to commit into the kevent tree:

diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
index 711a8a8..ecee668 100644
--- a/kernel/kevent/kevent_user.c
+++ b/kernel/kevent/kevent_user.c
@@ -235,6 +235,36 @@ static void kevent_free_rcu(struct rcu_h
 }
 
 /*
+ * Must be called under u->ready_lock.
+ * This function removes kevent from ready queue and 
+ * tries to add new kevent into ring buffer.
+ */
+static void kevent_remove_ready(struct kevent *k)
+{
+	struct kevent_user *u = k->user;
+
+	list_del(&k->ready_entry);
+	k->flags &= ~KEVENT_READY;
+	u->ready_num--;
+	if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS)
+		u->pring[0]->uidx = 0;
+	
+	if (u->overflow_kevent) {
+		int err;
+
+		err = kevent_user_ring_add_event(u->overflow_kevent);
+		if (!err || u->overflow_kevent == k) {
+			if (u->overflow_kevent->ready_entry.next == &u->ready_list)
+				u->overflow_kevent = NULL;
+			else
+				u->overflow_kevent = 
+					list_entry(u->overflow_kevent->ready_entry.next, 
+							struct kevent, ready_entry);
+		}
+	}
+}
+
+/*
  * Complete kevent removing - it dequeues kevent from storage list
  * if it is requested, removes kevent from ready list, drops userspace
  * control block reference counter and schedules kevent freeing through RCU.
@@ -248,11 +278,8 @@ static void kevent_finish_user_complete(
 		kevent_dequeue(k);
 
 	spin_lock_irqsave(&u->ready_lock, flags);
-	if (k->flags & KEVENT_READY) {
-		list_del(&k->ready_entry);
-		k->flags &= ~KEVENT_READY;
-		u->ready_num--;
-	}
+	if (k->flags & KEVENT_READY)
+		kevent_remove_ready(k);
 	spin_unlock_irqrestore(&u->ready_lock, flags);
 
 	kevent_user_put(u);
@@ -303,25 +330,7 @@ static struct kevent *kqueue_dequeue_rea
 	spin_lock_irqsave(&u->ready_lock, flags);
 	if (u->ready_num && !list_empty(&u->ready_list)) {
 		k = list_entry(u->ready_list.next, struct kevent, ready_entry);
-		list_del(&k->ready_entry);
-		k->flags &= ~KEVENT_READY;
-		u->ready_num--;
-		if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS)
-			u->pring[0]->uidx = 0;
-		
-		if (u->overflow_kevent) {
-			int err;
-
-			err = kevent_user_ring_add_event(u->overflow_kevent);
-			if (!err) {
-				if (u->overflow_kevent->ready_entry.next == &u->ready_list)
-					u->overflow_kevent = NULL;
-				else
-					u->overflow_kevent = 
-						list_entry(u->overflow_kevent->ready_entry.next, 
-								struct kevent, ready_entry);
-			}
-		}
+		kevent_remove_ready(k);
 	}
 	spin_unlock_irqrestore(&u->ready_lock, flags);
 

It tries to put next kevent into the ring and thus update
overflow_kevent if new kevent has been put into the 
buffer or kevent being removed is overflow kevent.
Patch depends on committed changes of returned error numbers and unused
variables cleanup, it will be included into next patchset if there are
no problems with it.

-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: [take21 1/4] kevent: Core files.
  2006-10-28 13:03           ` Evgeniy Polyakov
@ 2006-10-28 13:23             ` Eric Dumazet
  2006-10-28 13:28               ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Eric Dumazet @ 2006-10-28 13:23 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel

Evgeniy Polyakov a e'crit :
> On Sat, Oct 28, 2006 at 02:36:31PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote:
>> Evgeniy Polyakov a e'crit :
>>> On Sat, Oct 28, 2006 at 12:28:12PM +0200, Eric Dumazet 
>>> (dada1@cosmosbay.com) wrote:
>>>> I really dont understand how you manage to queue multiple kevents in the 
>>>> 'overflow list'. You just queue one kevent at most. What am I missing ?
>>> There is no overflow list - it is a pointer to the first kevent in the
>>> ready queue, which was not put into ring buffer. It is an optimisation, 
>>> which allows to not search for that position each time new event should 
>>> be placed into the buffer, when it starts to have an empty slot.
>> This overflow list (you may call it differently, but still it IS a list), 
>> is not complete. I feel you add it just to make me happy, but I am not (yet 
>> :) )
> 
> There is no overflow list.
> There is ready queue, part of which (first several entries) is copied
> into the ring buffer, overflow_kevent is a pointer to the first kevent which
> was not copied.
> 
>> For example, you make no test at kevent_finish_user_complete() time.
>>
>> Obviously, you can have a dangling pointer, and crash your box in certain 
>> conditions.
> 
> You are right, I did not put overflow_kevent check into all places which
> can remove kevent.
> 
> Here is a patch I am about to commit into the kevent tree:
> 
> diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
> index 711a8a8..ecee668 100644
> --- a/kernel/kevent/kevent_user.c
> +++ b/kernel/kevent/kevent_user.c
> @@ -235,6 +235,36 @@ static void kevent_free_rcu(struct rcu_h
>  }
>  
>  /*
> + * Must be called under u->ready_lock.
> + * This function removes kevent from ready queue and 
> + * tries to add new kevent into ring buffer.
> + */
> +static void kevent_remove_ready(struct kevent *k)
> +{
> +	struct kevent_user *u = k->user;
> +
> +	list_del(&k->ready_entry);

Arg... no

You cannot call list_del() , then check overflow_kevent.

I you call list_del on what happens to be the kevent pointed by 
overflow_kevent, you loose...

> +	k->flags &= ~KEVENT_READY;
> +	u->ready_num--;
> +	if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS)
> +		u->pring[0]->uidx = 0;
> +	
> +	if (u->overflow_kevent) {
> +		int err;
> +
> +		err = kevent_user_ring_add_event(u->overflow_kevent);
> +		if (!err || u->overflow_kevent == k) {
> +			if (u->overflow_kevent->ready_entry.next == &u->ready_list)
> +				u->overflow_kevent = NULL;
> +			else
> +				u->overflow_kevent = 
> +					list_entry(u->overflow_kevent->ready_entry.next, 
> +							struct kevent, ready_entry);
> +		}
> +	}
> +}
> +
> +/*
>   * Complete kevent removing - it dequeues kevent from storage list
>   * if it is requested, removes kevent from ready list, drops userspace
>   * control block reference counter and schedules kevent freeing through RCU.
> @@ -248,11 +278,8 @@ static void kevent_finish_user_complete(
>  		kevent_dequeue(k);
>  
>  	spin_lock_irqsave(&u->ready_lock, flags);
> -	if (k->flags & KEVENT_READY) {
> -		list_del(&k->ready_entry);
> -		k->flags &= ~KEVENT_READY;
> -		u->ready_num--;
> -	}
> +	if (k->flags & KEVENT_READY)
> +		kevent_remove_ready(k);
>  	spin_unlock_irqrestore(&u->ready_lock, flags);
>  
>  	kevent_user_put(u);
> @@ -303,25 +330,7 @@ static struct kevent *kqueue_dequeue_rea
>  	spin_lock_irqsave(&u->ready_lock, flags);
>  	if (u->ready_num && !list_empty(&u->ready_list)) {
>  		k = list_entry(u->ready_list.next, struct kevent, ready_entry);
> -		list_del(&k->ready_entry);
> -		k->flags &= ~KEVENT_READY;
> -		u->ready_num--;
> -		if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS)
> -			u->pring[0]->uidx = 0;
> -		
> -		if (u->overflow_kevent) {
> -			int err;
> -
> -			err = kevent_user_ring_add_event(u->overflow_kevent);
> -			if (!err) {
> -				if (u->overflow_kevent->ready_entry.next == &u->ready_list)
> -					u->overflow_kevent = NULL;
> -				else
> -					u->overflow_kevent = 
> -						list_entry(u->overflow_kevent->ready_entry.next, 
> -								struct kevent, ready_entry);
> -			}
> -		}
> +		kevent_remove_ready(k);
>  	}
>  	spin_unlock_irqrestore(&u->ready_lock, flags);
>  



^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take21 1/4] kevent: Core files.
  2006-10-28 13:23             ` Eric Dumazet
@ 2006-10-28 13:28               ` Evgeniy Polyakov
  2006-10-28 13:34                 ` Eric Dumazet
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-10-28 13:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel

On Sat, Oct 28, 2006 at 03:23:40PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote:
> >diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
> >index 711a8a8..ecee668 100644
> >--- a/kernel/kevent/kevent_user.c
> >+++ b/kernel/kevent/kevent_user.c
> >@@ -235,6 +235,36 @@ static void kevent_free_rcu(struct rcu_h
> > }
> > 
> > /*
> >+ * Must be called under u->ready_lock.
> >+ * This function removes kevent from ready queue and 
> >+ * tries to add new kevent into ring buffer.
> >+ */
> >+static void kevent_remove_ready(struct kevent *k)
> >+{
> >+	struct kevent_user *u = k->user;
> >+
> >+	list_del(&k->ready_entry);
> 
> Arg... no
> 
> You cannot call list_del() , then check overflow_kevent.
> 
> I you call list_del on what happens to be the kevent pointed by 
> overflow_kevent, you loose...

This function is always called from appropriate context, where it is
guaranteed that it is safe to call list_del:
1. when kevent is removed. It is called after check, that given kevent 
is in the ready queue.
2. when dequeued from ready queue, which means that it can be removed
from that queue.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take21 1/4] kevent: Core files.
  2006-10-28 13:28               ` Evgeniy Polyakov
@ 2006-10-28 13:34                 ` Eric Dumazet
  2006-10-28 13:47                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Eric Dumazet @ 2006-10-28 13:34 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel

Evgeniy Polyakov a e'crit :
> On Sat, Oct 28, 2006 at 03:23:40PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote:
>>> diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
>>> index 711a8a8..ecee668 100644
>>> --- a/kernel/kevent/kevent_user.c
>>> +++ b/kernel/kevent/kevent_user.c
>>> @@ -235,6 +235,36 @@ static void kevent_free_rcu(struct rcu_h
>>> }
>>>
>>> /*
>>> + * Must be called under u->ready_lock.
>>> + * This function removes kevent from ready queue and 
>>> + * tries to add new kevent into ring buffer.
>>> + */
>>> +static void kevent_remove_ready(struct kevent *k)
>>> +{
>>> +	struct kevent_user *u = k->user;
>>> +
>>> +	list_del(&k->ready_entry);
>> Arg... no
>>
>> You cannot call list_del() , then check overflow_kevent.
>>
>> I you call list_del on what happens to be the kevent pointed by 
>> overflow_kevent, you loose...
> 
> This function is always called from appropriate context, where it is
> guaranteed that it is safe to call list_del:
> 1. when kevent is removed. It is called after check, that given kevent 
> is in the ready queue.
> 2. when dequeued from ready queue, which means that it can be removed
> from that queue.
> 

Could you please check the list_del() function ?

file include/linux/list.h

static inline void list_del(struct list_head *entry)
{
   __list_del(entry->prev, entry->next);
   entry->next = LIST_POISON1;
   entry->prev = LIST_POISON2;
}

So, after calling list_del(&k->read_entry);
next and prev are basically destroyed.

So when you write later :

+        if (!err || u->overflow_kevent == k) {
+            if (u->overflow_kevent->ready_entry.next == &u->ready_list)
+                u->overflow_kevent = NULL;
+            else
+                u->overflow_kevent = + 
list_entry(u->overflow_kevent->ready_entry.next, + 
struct kevent, ready_entry);
+        }


then you have a problem, since

list_entry(k->ready_entry.next, struct kevent, ready_entry);

will give you garbage.

Eric

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take21 1/4] kevent: Core files.
  2006-10-28 13:34                 ` Eric Dumazet
@ 2006-10-28 13:47                   ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-10-28 13:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel

On Sat, Oct 28, 2006 at 03:34:52PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote:
> >>>+	list_del(&k->ready_entry);
> >>Arg... no
> >>
> >>You cannot call list_del() , then check overflow_kevent.
> >>
> >>I you call list_del on what happens to be the kevent pointed by 
> >>overflow_kevent, you loose...
> >
> >This function is always called from appropriate context, where it is
> >guaranteed that it is safe to call list_del:
> >1. when kevent is removed. It is called after check, that given kevent 
> >is in the ready queue.
> >2. when dequeued from ready queue, which means that it can be removed
> >from that queue.
> >
> 
> Could you please check the list_del() function ?
> 
> file include/linux/list.h
> 
> static inline void list_del(struct list_head *entry)
> {
>   __list_del(entry->prev, entry->next);
>   entry->next = LIST_POISON1;
>   entry->prev = LIST_POISON2;
> }
> 
> So, after calling list_del(&k->read_entry);
> next and prev are basically destroyed.
> 
> So when you write later :
> 
> +        if (!err || u->overflow_kevent == k) {
> +            if (u->overflow_kevent->ready_entry.next == &u->ready_list)
> +                u->overflow_kevent = NULL;
> +            else
> +                u->overflow_kevent = + 
> list_entry(u->overflow_kevent->ready_entry.next, + 
> struct kevent, ready_entry);
> +        }
> 
> 
> then you have a problem, since
> 
> list_entry(k->ready_entry.next, struct kevent, ready_entry);
> 
> will give you garbage.

Ok, I understand you now.
To remove this issue we can delete entry from the list after all checks
with overflow_kevent pointer are completed, i.e. have something like
this:

diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
index 711a8a8..f3fec9b 100644
--- a/kernel/kevent/kevent_user.c
+++ b/kernel/kevent/kevent_user.c
@@ -235,6 +235,36 @@ static void kevent_free_rcu(struct rcu_h
 }
 
 /*
+ * Must be called under u->ready_lock.
+ * This function removes kevent from ready queue and 
+ * tries to add new kevent into ring buffer.
+ */
+static void kevent_remove_ready(struct kevent *k)
+{
+	struct kevent_user *u = k->user;
+
+	if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS)
+		u->pring[0]->uidx = 0;
+
+	if (u->overflow_kevent) {
+		int err;
+
+		err = kevent_user_ring_add_event(u->overflow_kevent);
+		if (!err || u->overflow_kevent == k) {
+			if (u->overflow_kevent->ready_entry.next == &u->ready_list)
+				u->overflow_kevent = NULL;
+			else
+				u->overflow_kevent = 
+					list_entry(u->overflow_kevent->ready_entry.next, 
+							struct kevent, ready_entry);
+		}
+	}
+	list_del(&k->ready_entry);
+	k->flags &= ~KEVENT_READY;
+	u->ready_num--;
+}
+
+/*
  * Complete kevent removing - it dequeues kevent from storage list
  * if it is requested, removes kevent from ready list, drops userspace
  * control block reference counter and schedules kevent freeing through RCU.
@@ -248,11 +278,8 @@ static void kevent_finish_user_complete(
 		kevent_dequeue(k);
 
 	spin_lock_irqsave(&u->ready_lock, flags);
-	if (k->flags & KEVENT_READY) {
-		list_del(&k->ready_entry);
-		k->flags &= ~KEVENT_READY;
-		u->ready_num--;
-	}
+	if (k->flags & KEVENT_READY)
+		kevent_remove_ready(k);
 	spin_unlock_irqrestore(&u->ready_lock, flags);
 
 	kevent_user_put(u);
@@ -303,25 +330,7 @@ static struct kevent *kqueue_dequeue_rea
 	spin_lock_irqsave(&u->ready_lock, flags);
 	if (u->ready_num && !list_empty(&u->ready_list)) {
 		k = list_entry(u->ready_list.next, struct kevent, ready_entry);
-		list_del(&k->ready_entry);
-		k->flags &= ~KEVENT_READY;
-		u->ready_num--;
-		if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS)
-			u->pring[0]->uidx = 0;
-		
-		if (u->overflow_kevent) {
-			int err;
-
-			err = kevent_user_ring_add_event(u->overflow_kevent);
-			if (!err) {
-				if (u->overflow_kevent->ready_entry.next == &u->ready_list)
-					u->overflow_kevent = NULL;
-				else
-					u->overflow_kevent = 
-						list_entry(u->overflow_kevent->ready_entry.next, 
-								struct kevent, ready_entry);
-			}
-		}
+		kevent_remove_ready(k);
 	}
 	spin_unlock_irqrestore(&u->ready_lock, flags);
 

Thanks.

> Eric

-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: [take21 0/4] kevent: Generic event handling mechanism.
  2006-10-27 16:10 ` [take21 0/4] kevent: Generic event handling mechanism Evgeniy Polyakov
  2006-10-27 16:10   ` [take21 1/4] kevent: Core files Evgeniy Polyakov
@ 2006-10-27 16:42   ` Evgeniy Polyakov
  2006-11-07 11:26   ` Jeff Garzik
  2 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-10-27 16:42 UTC (permalink / raw)
  To: johnpol
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2305 bytes --]

On Fri, Oct 27, 2006 at 08:10:01PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> 
> Generic event handling mechanism.
> 
> Consider for inclusion.
> 
> Changes from 'take20' patchset:
>  * new ring buffer implementation

Test userspace application can be found in archive on project's
homepage. It is also attached to this mail.

Short design notes about ring buffer implementation.

Ring buffer is designed in a way that first ready kevent will be at
ring->uidx position, and all other ready events will be in FIFO order
after it. So when we need to commit num events, it means we should just
remove first num kevents from ready queue and commit them. We do not use
any special locking to protect this function against simultaneous
running - kevent dequeueing is atomic, and we do not care about order in
which events were committed.
An example: thread 1 and thread 2 simultaneously call kevent_wait() to
commit 2 and 3 events. It is possible that first thread will commit
events 0 and 2 while second thread will commit events 1, 3 and 4. If
there were only 3 ready events, then one of the calls will return lesser
number of committed events than it was requested.
ring->uidx update is atomic, since it is protected by u->ready_lock,
which removes race with kevent_user_ring_add_event().

If user asks to commit events which have beed removed by
kevent_get_events() recently (for example when one thread looked into
ring indexes and started to commit evets, which were simultaneously
committed by other thread through kevent_get_events(), kevent_wait()
will not commit unprocessed events, but will return number of actually
committed events instead.

It is forbidden to try to commit events not from the start of the
buffer, but from some 'futher' event.

An example: if ready events use positions 2-5, it is permitted to start
to commit 3 events from position 0, in this case 0 and 1 positions will
be ommited and only event in position 2 will be committed and
kevent_wait() will return 1, since only one event was actually
committed.
It is forbidden to try to commit from position 4, 0 will be returned.
This means that if some events were committed using kevent_get_events(),
they will not be counted, instead userspace should check ring index and
try to commit again.

-- 
	Evgeniy Polyakov

[-- Attachment #2: evtest.c --]
[-- Type: text/plain, Size: 5070 bytes --]

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <sys/time.h>
#include <sys/mman.h>

#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#include <time.h>
#include <unistd.h>

#include <linux/unistd.h>
#include <linux/types.h>

#define PAGE_SIZE	4096
#include <linux/ukevent.h>

#define _syscall3(type,name,type1,arg1,type2,arg2,type3,arg3) \
type name (type1 arg1, type2 arg2, type3 arg3) \
{\
	return syscall(__NR_##name, arg1, arg2, arg3);\
}

#define _syscall4(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4) \
type name (type1 arg1, type2 arg2, type3 arg3, type4 arg4) \
{\
	return syscall(__NR_##name, arg1, arg2, arg3, arg4);\
}

#define _syscall5(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4, \
	  type5,arg5) \
type name (type1 arg1,type2 arg2,type3 arg3,type4 arg4,type5 arg5) \
{\
	return syscall(__NR_##name, arg1, arg2, arg3, arg4, arg5);\
}

#define _syscall6(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4, \
	  type5,arg5,type6,arg6) \
type name (type1 arg1,type2 arg2,type3 arg3,type4 arg4,type5 arg5, type6 arg6) \
{\
	return syscall(__NR_##name, arg1, arg2, arg3, arg4, arg5, arg6);\
}

_syscall4(int, kevent_ctl, int, arg1, unsigned int, argv2, unsigned int, argv3, void *, argv4);
_syscall6(int, kevent_get_events, int, arg1, unsigned int, argv2, unsigned int, argv3, __u64, argv4, void *, argv5, unsigned, arg6);
_syscall4(int, kevent_wait, int, arg1, unsigned int, arg2, unsigned int, argv3, __u64, argv4);

#define ulog(f, a...) fprintf(stderr, "%8u: "f, time(NULL), ##a)
#define ulog_err(f, a...) ulog(f ": %s [%d].\n", ##a, strerror(errno), errno)

static void usage(char *p)
{
	ulog("Usage: %s -t type -e event -o oneshot -p path -n wait_num -f kevent_file -r ready_num -h\n", p);
}

static int evtest_mmap(int fd, struct kevent_mring **ring, int number)
{
	int i;
	off_t o = 0;

	for (i=0; i<number; ++i) {
		ring[i] = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, fd, o);
		if (ring[i] == MAP_FAILED) {
			ulog_err("Failed to mmap: i: %d, number: %u, offset: %lu", i, number, o);
			return -ENOMEM;
		}

		printf("mmap: %d: number: %u, offset: %lu.\n", i, number, o);
		o += PAGE_SIZE;
	}
	
	return 0;
}

int main(int argc, char *argv[])
{
	int ch, fd, err, oneshot, wait_num;
	unsigned int i, ready_num, old_idx, new_idx, tm_sec, tm_nsec;
	char *file;
	char buf[4096];
	struct ukevent *uk;
	struct mukevent *m;
	struct kevent_mring *ring[KEVENT_MAX_PAGES];
	off_t offset;

	oneshot = 0;
	wait_num = 10;
	offset = 0;
	old_idx = 0;
	file = "/dev/kevent";
	tm_sec = 2;
	tm_nsec = 0;
	ready_num = 1;

	while ((ch = getopt(argc, argv, "r:f:t:T:o:n:h")) > 0) {
		switch (ch) {
			case 'f':
				file = optarg;
				break;
			case 'r':
				ready_num = atoi(optarg);
				break;
			case 'n':
				wait_num = atoi(optarg);
				break;
			case 't':
				tm_sec = atoi(optarg);
				break;
			case 'T':
				tm_nsec = atoi(optarg);
				break;
			case 'o':
				oneshot = atoi(optarg);
				break;
			default:
				usage(argv[0]);
				return -1;
		}
	}

	fd = open(file, O_RDWR);
	if (fd == -1) {
		ulog_err("Failed create kevent control block using file %s", file);
		return -1;
	}

	err = evtest_mmap(fd, ring, KEVENT_MAX_PAGES);
	if (err)
		return err;

	memset(buf, 0, sizeof(buf));
	
	for (i=0; i<ready_num; ++i) {
		uk = (struct ukevent *)buf;
		uk->event = KEVENT_TIMER_FIRED;
		uk->type = KEVENT_TIMER;
		if (oneshot)
			uk->req_flags |= KEVENT_REQ_ONESHOT;
		uk->user[0] = i;
		uk->id.raw[0] = tm_sec;
		uk->id.raw[1] = tm_nsec+i;

		err = kevent_ctl(fd, KEVENT_CTL_ADD, 1, uk);
		if (err < 0) {
			ulog_err("Failed to perform control operation: oneshot: %d, sec: %u, nsec: %u", 
					oneshot, tm_sec, tm_nsec);
			close(fd);
			return err;
		}
		if (err) {
			ulog("%d: %016llx: ret_flags: 0x%x, ret_data: %u %d.\n", 
					i, uk->id.raw_u64, 
					uk->ret_flags, uk->ret_data[0], (int)uk->ret_data[1]);
		}
	}

	old_idx = ready_num = 0;
	while (1) {
		new_idx = ring[0]->kidx;
		old_idx = ring[0]->uidx;
		if (new_idx != old_idx) {
			ready_num = (old_idx > new_idx)?(KEVENT_MAX_EVENTS - (old_idx - new_idx)):(new_idx - old_idx);

			ulog("mmap: new: %u, old: %u, ready: %u.\n", new_idx, old_idx, ready_num);

			for (i=0; i<ready_num; ++i) {
				int ridx = old_idx / KEVENTS_ON_PAGE;
				int idx = old_idx % KEVENTS_ON_PAGE;
				m = &ring[ridx]->event[idx % KEVENTS_ON_PAGE];
				ulog("%08x: %08x.%08x - %08x\n", 
					i, m->id.raw[0], m->id.raw[1], m->ret_flags);
			}
		}
		ulog("going to wait: old: %u, new: %u, ready_num: %u, uidx: %u, kidx: %u.\n", 
				old_idx, new_idx, ready_num, ring[0]->uidx, ring[0]->kidx);
		err = kevent_wait(fd, old_idx, ready_num, 10000000000ULL);
		if (err < 0) {
			if (errno != EAGAIN) {
				ulog_err("Failed to perform control operation: oneshot: %d, sec: %u, nsec: %u", 
						oneshot, tm_sec, tm_nsec);
				close(fd);
				return err;
			}
			old_idx = (old_idx + ready_num) % KEVENT_MAX_EVENTS;
			ready_num = 0;
		}
		ulog("wait: old: %u, ready: %u, ret: %d.\n", old_idx, ready_num, err);
	}

	close(fd);
	return 0;
}

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take21 0/4] kevent: Generic event handling mechanism.
  2006-10-27 16:10 ` [take21 0/4] kevent: Generic event handling mechanism Evgeniy Polyakov
  2006-10-27 16:10   ` [take21 1/4] kevent: Core files Evgeniy Polyakov
  2006-10-27 16:42   ` [take21 0/4] kevent: Generic event handling mechanism Evgeniy Polyakov
@ 2006-11-07 11:26   ` Jeff Garzik
  2006-11-07 11:46     ` Jeff Garzik
  2006-11-07 11:51     ` Evgeniy Polyakov
  2 siblings, 2 replies; 200+ messages in thread
From: Jeff Garzik @ 2006-11-07 11:26 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, linux-kernel,
	Linus Torvalds

Evgeniy Polyakov wrote:
> Generic event handling mechanism.
> 
> Consider for inclusion.
> 
> Changes from 'take20' patchset:
>  * new ring buffer implementation
>  * removed artificial limit on possible number of kevents
> With this release and fixed userspace web server it was possible to 
> achive 3960+ req/s with client connection rate of 4000 con/s
> over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which
> is too close to wire speed if we get into account headers and the like.

OK, now that ring buffer is here, I definitely like the direction this 
code is taking.  I just committed the patches to a local repo for a good 
in-depth review.

Could you write up a simple text file, documenting (a) your proposed 
syscalls and (b) your ring buffer design?

Overall I have a Linux "design wish", that I hope kevent can fulfill:

To develop completely async applications (generally network servers, in 
Linux-land) and increase the chance of zero-copy I/O, network and file 
I/O submission and completion should be as async as possible.

As such, syscalls themselves have come a serializing bottleneck that 
isn't strictly necessary.  A fully-async application should be able to 
submit file read, file write, and network write requests 
asynchronously... in batches.  Network reads, and file I/O completions 
should be received asynchronously, potentially in batches.

Even with epoll and AIO syscalls, Linux isn't quite up to the task.

So to me, the design of the userspace interface that solves this problem 
is a fundamental issue.

My best guess at a solution would be two classes of mmap'd ring buffers, 
request and response.  Let the app allocate one or more.  Then have two 
hooks, (a) kick the kernel to read the request ring, and (b) kick the 
app when one or more events have arrived on a ring.

But that's just thinking out loud.  I welcome any solution that gives 
userspace a fully-async submission/completion interface for both network 
and file I/O.

Setting the standard for a good interface here means Linux will kick ass 
for decades more to come ;-)  This is IMO a Big Deal(tm).

	Jeff

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take21 0/4] kevent: Generic event handling mechanism.
  2006-11-07 11:26   ` Jeff Garzik
@ 2006-11-07 11:46     ` Jeff Garzik
  2006-11-07 11:58       ` Evgeniy Polyakov
  2006-11-07 11:51     ` Evgeniy Polyakov
  1 sibling, 1 reply; 200+ messages in thread
From: Jeff Garzik @ 2006-11-07 11:46 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, linux-kernel,
	Linus Torvalds

At an aside...  This may be useful.  Or not.

Al Viro had an interesting idea about kernel<->userspace data passing 
interfaces.  He had suggested creating a task-specific filesystem 
derived from ramfs.  Through the normal VFS/VM codepaths, the user can 
easily create [subject to resource/priv checks] a buffer that is locked 
into the pagecache.  Using mmap, read, write, whatever they prefer. 
Derive from tmpfs, and the buffers are swappable.

Then it would be a simple matter to associate a file stored in 
"keventfs" with a ring buffer guaranteed to be pagecache-friendly.

Heck, that might make zero-copy easier in some cases, too.  And using a 
filesystem would mean that you could do all this without adding 
syscalls, by using special (poll-able!) files in the filesystem for 
control and notification purposes.

	Jeff

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take21 0/4] kevent: Generic event handling mechanism.
  2006-11-07 11:46     ` Jeff Garzik
@ 2006-11-07 11:58       ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-07 11:58 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, linux-kernel,
	Linus Torvalds

On Tue, Nov 07, 2006 at 06:46:58AM -0500, Jeff Garzik (jeff@garzik.org) wrote:
> At an aside...  This may be useful.  Or not.
> 
> Al Viro had an interesting idea about kernel<->userspace data passing 
> interfaces.  He had suggested creating a task-specific filesystem 
> derived from ramfs.  Through the normal VFS/VM codepaths, the user can 
> easily create [subject to resource/priv checks] a buffer that is locked 
> into the pagecache.  Using mmap, read, write, whatever they prefer. 
> Derive from tmpfs, and the buffers are swappable.

It looks like Al likes filesystems more than any other part of kernel
tree...
Existing ring buffer is created in process' memory, so it is swappable
too (which is probably the most significant part of this ring buffer 
version), but in theory kevent file descriptor can be obtained not from
the char device, but from special filesystem (well, it was done in that
way in first releases but then I was asked to remove such
functionality).

> Then it would be a simple matter to associate a file stored in 
> "keventfs" with a ring buffer guaranteed to be pagecache-friendly.
> 
> Heck, that might make zero-copy easier in some cases, too.  And using a 
> filesystem would mean that you could do all this without adding 
> syscalls, by using special (poll-able!) files in the filesystem for 
> control and notification purposes.

There are too many ideas about networking zero-copy both sending and
receiving, and some of them are even implemented on different layers
(starting from special allocator down to splice() with additional
single allocation/copy).

> 	Jeff

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take21 0/4] kevent: Generic event handling mechanism.
  2006-11-07 11:26   ` Jeff Garzik
  2006-11-07 11:46     ` Jeff Garzik
@ 2006-11-07 11:51     ` Evgeniy Polyakov
  2006-11-07 12:17       ` Jeff Garzik
  2006-11-07 12:32       ` Jeff Garzik
  1 sibling, 2 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-07 11:51 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, linux-kernel,
	Linus Torvalds

On Tue, Nov 07, 2006 at 06:26:09AM -0500, Jeff Garzik (jeff@garzik.org) wrote:
> Evgeniy Polyakov wrote:
> >Generic event handling mechanism.
> >
> >Consider for inclusion.
> >
> >Changes from 'take20' patchset:
> > * new ring buffer implementation
> > * removed artificial limit on possible number of kevents
> >With this release and fixed userspace web server it was possible to 
> >achive 3960+ req/s with client connection rate of 4000 con/s
> >over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which
> >is too close to wire speed if we get into account headers and the like.
> 
> OK, now that ring buffer is here, I definitely like the direction this 
> code is taking.  I just committed the patches to a local repo for a good 
> in-depth review.

It is third ring buffer, the fourth one will be in the next release,
which should satisfy everyone.

> Could you write up a simple text file, documenting (a) your proposed 
> syscalls and (b) your ring buffer design?

Initial draft about supported syscalls can be found at documentation page at
http://linux-net.osdl.org/index.php/Kevent

Ring buffer background bits pasted below (quotations from blog, do not
pay too much attention if sometimes something is not in sync).

New ring buffer is implemented fully in userspace in process' memory,
which means that there are no memory pinned, its size can have almost
any length, several threads and processes can access it simultaneously.
There is new system call

int kevent_ring_init(int ctl_fd, struct ring_buffer *ring, unsigned int
num);

which initializes kevent's ring buffer (int ctl_fd is a kevent file
descriptor, struct ring_buffer *ring is a userspace allocated ring
buffer, and unsigned int num is maximum number of events (struct
ukevent) which can be placed into that buffer).
Ring buffer is described with following structure:

struct kevent_ring
{
	unsigned int		ring_kidx, ring_uidx;
	struct ukevent		event[0];
};

where unsigned int ring_kidx, ring_uidx are last kernel's position (i.e.
position which points to the first place after the last kevent put by
kernel into the ring buffer) and last userspace commit (i.e. position
where first unread kevent lives) positions appropriately.
I will release appropriate userspace test application when tests are
completed.

When kevent is removed (not dequeued when it is ready, but just
removed), even if it was ready, it is not copied into ring buffer, since
if it is removed, no one cares about it (otherwise user would wait until
it becomes ready and got it through usual way using kevent_get_events()
or kevent_wait()) and thus no need to copy it to the ring buffer.
Dequeueing of the kevent (calling kevent_get_events()) means that user
has processed previously dequeued kevent and is ready to process new
one, which means that position in the ring buffer previously ocupied but
that event can be reused by currently dequeued event. In the world where
only one type of syscalls to get events is used (either usual way and
kevent_get_events() or ring buffer and kevent_wait()) it should not be a
problem, since kevent_wait() only allows to mark number of events as
processed by userspace starting from the beginning (i.e. from the last
processed event), but if several threads will use different models, that
can rise some questions, for example one thread can start to read events
from ring buffer, and in that time other thread will call
kevent_get_events(), which can rewrite that events. Actually other
thread can call kevent_wait() to commit that events (i.e. mark them as
processed by userspace so kernel could free them or requeue), so
appropriate locking is required in userspace in any way.

So I want to repeat, that it is possible with userspace ring buffer,
that events in the ring buffer can be replaced without knowledge for the
thread currently reading them (when other thread calls
kevent_get_events() or kevent_wait()), so appropriate locking between
threads or processes, which can simultaneously access the same ring
buffer, is required.

Having userspace ring buffer allows to make all kevent syscalls as so
called 'cancellation points' by glibc, i.e. when thread has been
cancelled in kevent syscall, thread can be safely removed and no events
will be lost, since each syscall will copy event into special ring
buffer, accessible from other threads or even processes (if shared
memory is used).

> 
> Overall I have a Linux "design wish", that I hope kevent can fulfill:
> 
> To develop completely async applications (generally network servers, in 
> Linux-land) and increase the chance of zero-copy I/O, network and file 
> I/O submission and completion should be as async as possible.
> 
> As such, syscalls themselves have come a serializing bottleneck that 
> isn't strictly necessary.  A fully-async application should be able to 
> submit file read, file write, and network write requests 
> asynchronously... in batches.  Network reads, and file I/O completions 
> should be received asynchronously, potentially in batches.
> 
> Even with epoll and AIO syscalls, Linux isn't quite up to the task.
> 
> So to me, the design of the userspace interface that solves this problem 
> is a fundamental issue.
> 
> My best guess at a solution would be two classes of mmap'd ring buffers, 
> request and response.  Let the app allocate one or more.  Then have two 
> hooks, (a) kick the kernel to read the request ring, and (b) kick the 
> app when one or more events have arrived on a ring.

Mmap ring buffer implementation was stopped by Andrew Morton and Ulrich
Drepper, process' memory is used instead. copy_to_user() is slower (and
some times noticebly), but there are major advantages of such approach.

> But that's just thinking out loud.  I welcome any solution that gives 
> userspace a fully-async submission/completion interface for both network 
> and file I/O.

Well, kevent network and FS AIO are suspended for now (although first
patches included them all).

> Setting the standard for a good interface here means Linux will kick ass 
> for decades more to come ;-)  This is IMO a Big Deal(tm).
> 
> 	Jeff
> 

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take21 0/4] kevent: Generic event handling mechanism.
  2006-11-07 11:51     ` Evgeniy Polyakov
@ 2006-11-07 12:17       ` Jeff Garzik
  2006-11-07 12:29         ` Evgeniy Polyakov
  2006-11-07 12:32       ` Jeff Garzik
  1 sibling, 1 reply; 200+ messages in thread
From: Jeff Garzik @ 2006-11-07 12:17 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, linux-kernel,
	Linus Torvalds

Evgeniy Polyakov wrote:
> Well, kevent network and FS AIO are suspended for now (although first

Why?

IMO, getting async event submission right is important.  It should be 
designed in parallel with async event reception.

	Jeff



^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take21 0/4] kevent: Generic event handling mechanism.
  2006-11-07 12:17       ` Jeff Garzik
@ 2006-11-07 12:29         ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-07 12:29 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, linux-kernel,
	Linus Torvalds

On Tue, Nov 07, 2006 at 07:17:03AM -0500, Jeff Garzik (jeff@garzik.org) wrote:
> Evgeniy Polyakov wrote:
> >Well, kevent network and FS AIO are suspended for now (although first
> 
> Why?
> 
> IMO, getting async event submission right is important.  It should be 
> designed in parallel with async event reception.

It was not only designed but also implemented, but...

FS AIO was confirmed to have correct design, but there were minor (from
my point of view) layering design problems 
(I was almost suggested to make myself a lobotomy after I put
get_block() callback into address_space_operations, there were also some
code duplication of mpage_readpages() in async way in
kevent/kevent_aio.c - I made it to separate kevent as much as possible,
both changes can live in fs/ with appropriate callback export).

Network AIO I postponed for a while, since looking how hard core changed
are processed, it looks like a better decision...
Using Ulrich's DMA allocation API (if it would exist not only as
proposal) it would be possible to speed up NAIO yet a bit too.

Kevent based FS AIO patch can be found for example here (it contains
full kevent subsystem with network aio and fs aio):
http://tservice.net.ru/~s0mbre/archive/kevent/kevent_full.diff.3

Network aio homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=naio

> 	Jeff
> 

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take21 0/4] kevent: Generic event handling mechanism.
  2006-11-07 11:51     ` Evgeniy Polyakov
  2006-11-07 12:17       ` Jeff Garzik
@ 2006-11-07 12:32       ` Jeff Garzik
  2006-11-07 19:34         ` Andrew Morton
  1 sibling, 1 reply; 200+ messages in thread
From: Jeff Garzik @ 2006-11-07 12:32 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, linux-kernel,
	Linus Torvalds

Evgeniy Polyakov wrote:
> Mmap ring buffer implementation was stopped by Andrew Morton and Ulrich
> Drepper, process' memory is used instead. copy_to_user() is slower (and
> some times noticebly), but there are major advantages of such approach.


hmmmm.  I say there are advantages to both.

Perhaps create a "kevent_direct_limit" resource limit for each thread. 
By default, each thread could mmap $n pinned pagecache pages.  Sysadmin 
can tune certain app resource limits to permit more.

I would think that retaining the option to avoid copy_to_user() 
-somehow- in -some- cases would be wise.

	Jeff



^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take21 0/4] kevent: Generic event handling mechanism.
  2006-11-07 12:32       ` Jeff Garzik
@ 2006-11-07 19:34         ` Andrew Morton
  2006-11-07 20:52           ` David Miller
  0 siblings, 1 reply; 200+ messages in thread
From: Andrew Morton @ 2006-11-07 19:34 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Evgeniy Polyakov, David Miller, Ulrich Drepper, netdev,
	linux-kernel, Linus Torvalds

On Tue, 07 Nov 2006 07:32:20 -0500
Jeff Garzik <jeff@garzik.org> wrote:

> Evgeniy Polyakov wrote:
> > Mmap ring buffer implementation was stopped by Andrew Morton and Ulrich
> > Drepper, process' memory is used instead. copy_to_user() is slower (and
> > some times noticebly), but there are major advantages of such approach.
> 
> 
> hmmmm.  I say there are advantages to both.

My problem with the old mmapped ringbuffer was that it permitted each user
to pin (typically) 48MB of unswappable memory.  Plus this pinned-memory
problem would put upper bounds on the ring size.

> Perhaps create a "kevent_direct_limit" resource limit for each thread. 
> By default, each thread could mmap $n pinned pagecache pages.  Sysadmin 
> can tune certain app resource limits to permit more.
> 
> I would think that retaining the option to avoid copy_to_user() 
> -somehow- in -some- cases would be wise.

What Evgeniy means here is that copy_to_user() is slower than memcpy() (on
his machine, with his kernel config, at least).

Which is kinda weird and unexpected and is something which we should
investigate independently from this project.  (Rather than simply going
and bypassing it!)


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take21 0/4] kevent: Generic event handling mechanism.
  2006-11-07 19:34         ` Andrew Morton
@ 2006-11-07 20:52           ` David Miller
  2006-11-07 21:38             ` Andrew Morton
  0 siblings, 1 reply; 200+ messages in thread
From: David Miller @ 2006-11-07 20:52 UTC (permalink / raw)
  To: akpm; +Cc: jeff, johnpol, drepper, netdev, linux-kernel, torvalds

From: Andrew Morton <akpm@osdl.org>
Date: Tue, 7 Nov 2006 11:34:00 -0800

> What Evgeniy means here is that copy_to_user() is slower than memcpy() (on
> his machine, with his kernel config, at least).
> 
> Which is kinda weird and unexpected and is something which we should
> investigate independently from this project.  (Rather than simply going
> and bypassing it!)

It's straightforward to me. :-)

If the kerne memcpy()'s, it uses those nice 4MB PTE mappings to
the kernel pages.  With copy_to_user() you run through tiny
4K or 8K PTE mappings which thrash the TLB.

The TLB is therefore able to hold more of the accessed state at
a time if you touch the pages on the kernel side.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take21 0/4] kevent: Generic event handling mechanism.
  2006-11-07 20:52           ` David Miller
@ 2006-11-07 21:38             ` Andrew Morton
  0 siblings, 0 replies; 200+ messages in thread
From: Andrew Morton @ 2006-11-07 21:38 UTC (permalink / raw)
  To: David Miller; +Cc: jeff, johnpol, drepper, netdev, linux-kernel, torvalds

On Tue, 07 Nov 2006 12:52:41 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> From: Andrew Morton <akpm@osdl.org>
> Date: Tue, 7 Nov 2006 11:34:00 -0800
> 
> > What Evgeniy means here is that copy_to_user() is slower than memcpy() (on
> > his machine, with his kernel config, at least).
> > 
> > Which is kinda weird and unexpected and is something which we should
> > investigate independently from this project.  (Rather than simply going
> > and bypassing it!)
> 
> It's straightforward to me. :-)
> 
> If the kerne memcpy()'s, it uses those nice 4MB PTE mappings to
> the kernel pages.  With copy_to_user() you run through tiny
> 4K or 8K PTE mappings which thrash the TLB.
> 
> The TLB is therefore able to hold more of the accessed state at
> a time if you touch the pages on the kernel side.

Maybe.  Evgeniy tends to favour teeny microbenchmarks.  I'd also be
suspecting the considerable setup code in the x86 uaccess funtions.  That
would show up in a tight loop doing large numbers of small copies.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* [take22 0/4] kevent: Generic event handling mechanism.
       [not found] <1154985aa0591036@2ka.mipt.ru>
  2006-10-27 16:10 ` [take21 0/4] kevent: Generic event handling mechanism Evgeniy Polyakov
@ 2006-11-01 11:36 ` Evgeniy Polyakov
  2006-11-01 11:36   ` [take22 1/4] kevent: Core files Evgeniy Polyakov
  2006-11-01 13:06   ` [take22 0/4] kevent: Generic event handling mechanism Pavel Machek
  2006-11-07 16:50 ` [take23 0/5] " Evgeniy Polyakov
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-01 11:36 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel


Generic event handling mechanism.

Consider for inclusion.

Changes from 'take21' patchset:
 * minor cleanups (different return values, removed unneded variables, whitespaces and so on)
 * fixed bug in kevent removal in case when kevent being removed
   is the same as overflow_kevent (spotted by Eric Dumazet)

Changes from 'take20' patchset:
 * new ring buffer implementation
 * removed artificial limit on possible number of kevents
With this release and fixed userspace web server it was possible to 
achive 3960+ req/s with client connection rate of 4000 con/s
over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which
is too close to wire speed if we get into account headers and the like.

Changes from 'take19' patchset:
 * use __init instead of __devinit
 * removed 'default N' from config for user statistic
 * removed kevent_user_fini() since kevent can not be unloaded
 * use KERN_INFO for statistic output

Changes from 'take18' patchset:
 * use __init instead of __devinit
 * removed 'default N' from config for user statistic
 * removed kevent_user_fini() since kevent can not be unloaded
 * use KERN_INFO for statistic output

Changes from 'take17' patchset:
 * Use RB tree instead of hash table. 
	At least for a web sever, frequency of addition/deletion of new kevent 
	is comparable with number of search access, i.e. most of the time events 
	are added, accesed only couple of times and then removed, so it justifies 
	RB tree usage over AVL tree, since the latter does have much slower deletion 
	time (max O(log(N)) compared to 3 ops), 
	although faster search time (1.44*O(log(N)) vs. 2*O(log(N))). 
	So for kevents I use RB tree for now and later, when my AVL tree implementation 
	is ready, it will be possible to compare them.
 * Changed readiness check for socket notifications.

With both above changes it is possible to achieve more than 3380 req/second compared to 2200, 
sometimes 2500 req/second for epoll() for trivial web-server and httperf client on the same
hardware.
It is possible that above kevent limit is due to maximum allowed kevents in a time limit, which is
4096 events.

Changes from 'take16' patchset:
 * misc cleanups (__read_mostly, const ...)
 * created special macro which is used for mmap size (number of pages) calculation
 * export kevent_socket_notify(), since it is used in network protocols which can be 
	built as modules (IPv6 for example)

Changes from 'take15' patchset:
 * converted kevent_timer to high-resolution timers, this forces timer API update at
	http://linux-net.osdl.org/index.php/Kevent
 * use struct ukevent* instead of void * in syscalls (documentation has been updated)
 * added warning in kevent_add_ukevent() if ring has broken index (for testing)

Changes from 'take14' patchset:
 * added kevent_wait()
    This syscall waits until either timeout expires or at least one event
    becomes ready. It also commits that @num events from @start are processed
    by userspace and thus can be be removed or rearmed (depending on it's flags).
    It can be used for commit events read by userspace through mmap interface.
    Example userspace code (evtest.c) can be found on project's homepage.
 * added socket notifications (send/recv/accept)

Changes from 'take13' patchset:
 * do not get lock aroung user data check in __kevent_search()
 * fail early if there were no registered callbacks for given type of kevent
 * trailing whitespace cleanup

Changes from 'take12' patchset:
 * remove non-chardev interface for initialization
 * use pointer to kevent_mring instead of unsigned longs
 * use aligned 64bit type in raw user data (can be used by high-res timer if needed)
 * simplified enqueue/dequeue callbacks and kevent initialization
 * use nanoseconds for timeout
 * put number of milliseconds into timer's return data
 * move some definitions into user-visible header
 * removed filenames from comments

Changes from 'take11' patchset:
 * include missing headers into patchset
 * some trivial code cleanups (use goto instead of if/else games and so on)
 * some whitespace cleanups
 * check for ready_callback() callback before main loop which should save us some ticks

Changes from 'take10' patchset:
 * removed non-existent prototypes
 * added helper function for kevent_registered_callbacks
 * fixed 80 lines comments issues
 * added shared between userspace and kernelspace header instead of embedd them in one
 * core restructuring to remove forward declarations
 * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
 * use vm_insert_page() instead of remap_pfn_range()

Changes from 'take9' patchset:
 * fixed ->nopage method

Changes from 'take8' patchset:
 * fixed mmap release bug
 * use module_init() instead of late_initcall()
 * use better structures for timer notifications

Changes from 'take7' patchset:
 * new mmap interface (not tested, waiting for other changes to be acked)
	- use nopage() method to dynamically substitue pages
	- allocate new page for events only when new added kevent requres it
	- do not use ugly index dereferencing, use structure instead
	- reduced amount of data in the ring (id and flags), 
		maximum 12 pages on x86 per kevent fd

Changes from 'take6' patchset:
 * a lot of comments!
 * do not use list poisoning for detection of the fact, that entry is in the list
 * return number of ready kevents even if copy*user() fails
 * strict check for number of kevents in syscall
 * use ARRAY_SIZE for array size calculation
 * changed superblock magic number
 * use SLAB_PANIC instead of direct panic() call
 * changed -E* return values
 * a lot of small cleanups and indent fixes

Changes from 'take5' patchset:
 * removed compilation warnings about unused wariables when lockdep is not turned on
 * do not use internal socket structures, use appropriate (exported) wrappers instead
 * removed default 1 second timeout
 * removed AIO stuff from patchset

Changes from 'take4' patchset:
 * use miscdevice instead of chardevice
 * comments fixes

Changes from 'take3' patchset:
 * removed serializing mutex from kevent_user_wait()
 * moved storage list processing to RCU
 * removed lockdep screaming - all storage locks are initialized in the same function, so it was
learned 
	to differentiate between various cases
 * remove kevent from storage if is marked as broken after callback
 * fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion 

Changes from 'take2' patchset:
 * split kevent_finish_user() to locked and unlocked variants
 * do not use KEVENT_STAT ifdefs, use inline functions instead
 * use array of callbacks of each type instead of each kevent callback initialization
 * changed name of ukevent guarding lock
 * use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
 * do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
 * various indent cleanups
 * added optimisation, which is aimed to help when a lot of kevents are being copied from
userspace
 * mapped buffer (initial) implementation (no userspace yet)

Changes from 'take1' patchset:
 - rebased against 2.6.18-git tree
 - removed ioctl controlling
 - added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
			unsigned int timeout, void __user *buf, unsigned flags)
 - use old syscall kevent_ctl for creation/removing, modification and initial kevent 
	initialization
 - use mutuxes instead of semaphores
 - added file descriptor check and return error if provided descriptor does not match
	kevent file operations
 - various indent fixes
 - removed aio_sendfile() declarations.

Thank you.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>



^ permalink raw reply	[flat|nested] 200+ messages in thread

* [take22 1/4] kevent: Core files.
  2006-11-01 11:36 ` [take22 " Evgeniy Polyakov
@ 2006-11-01 11:36   ` Evgeniy Polyakov
  2006-11-01 11:36     ` [take22 2/4] kevent: poll/select() notifications Evgeniy Polyakov
  2006-11-01 13:06   ` [take22 0/4] kevent: Generic event handling mechanism Pavel Machek
  1 sibling, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-01 11:36 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel


Core files.

This patch includes core kevent files:
 * userspace controlling
 * kernelspace interfaces
 * initialization
 * notification state machines

Some bits of documentation can be found on project's homepage (and links from there):
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index 7e639f7..a9560eb 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -318,3 +318,6 @@ ENTRY(sys_call_table)
 	.long sys_vmsplice
 	.long sys_move_pages
 	.long sys_getcpu
+	.long sys_kevent_get_events
+	.long sys_kevent_ctl		/* 320 */
+	.long sys_kevent_wait
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index b4aa875..cf18955 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -714,8 +714,11 @@ #endif
 	.quad compat_sys_get_robust_list
 	.quad sys_splice
 	.quad sys_sync_file_range
-	.quad sys_tee
+	.quad sys_tee			/* 315 */
 	.quad compat_sys_vmsplice
 	.quad compat_sys_move_pages
 	.quad sys_getcpu
+	.quad sys_kevent_get_events
+	.quad sys_kevent_ctl		/* 320 */
+	.quad sys_kevent_wait
 ia32_syscall_end:		
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index bd99870..f009677 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -324,10 +324,13 @@ #define __NR_tee		315
 #define __NR_vmsplice		316
 #define __NR_move_pages		317
 #define __NR_getcpu		318
+#define __NR_kevent_get_events	319
+#define __NR_kevent_ctl		320
+#define __NR_kevent_wait	321
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 319
+#define NR_syscalls 322
 #include <linux/err.h>
 
 /*
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 6137146..c53d156 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,16 @@ #define __NR_vmsplice		278
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages		279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events	280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl		281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
+#define __NR_kevent_wait	282
+__SYSCALL(__NR_kevent_wait, sys_kevent_wait)
 
 #ifdef __KERNEL__
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_wait
 #include <linux/err.h>
 
 #ifndef __NO_STUBS
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..743b328
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,205 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/kevent_storage.h>
+#include <linux/ukevent.h>
+
+#define KEVENT_MIN_BUFFS_ALLOC	3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+	kevent_callback_t	callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY		0x1
+#define KEVENT_STORAGE		0x2
+#define KEVENT_USER		0x4
+
+struct kevent
+{
+	/* Used for kevent freeing.*/
+	struct rcu_head		rcu_head;
+	struct ukevent		event;
+	/* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+	spinlock_t		ulock;
+
+	/* Entry of user's tree. */
+	struct rb_node		kevent_node;
+	/* Entry of origin's queue. */
+	struct list_head	storage_entry;
+	/* Entry of user's ready. */
+	struct list_head	ready_entry;
+
+	u32			flags;
+
+	/* User who requested this kevent. */
+	struct kevent_user	*user;
+	/* Kevent container. */
+	struct kevent_storage	*st;
+
+	struct kevent_callbacks	callbacks;
+
+	/* Private data for different storages.
+	 * poll()/select storage has a list of wait_queue_t containers
+	 * for each ->poll() { poll_wait()' } here.
+	 */
+	void			*priv;
+};
+
+struct kevent_user
+{
+	struct rb_root		kevent_root;
+	spinlock_t		kevent_lock;
+	/* Number of queued kevents. */
+	unsigned int		kevent_num;
+
+	/* List of ready kevents. */
+	struct list_head	ready_list;
+	/* Number of ready kevents. */
+	unsigned int		ready_num;
+	/* Protects all manipulations with ready queue. */
+	spinlock_t 		ready_lock;
+
+	/* Protects against simultaneous kevent_user control manipulations. */
+	struct mutex		ctl_mutex;
+	/* Wait until some events are ready. */
+	wait_queue_head_t	wait;
+
+	/* Reference counter, increased for each new kevent. */
+	atomic_t		refcnt;
+
+	/* First kevent which was not put into ring buffer due to overflow.
+	 * It will be copied into the buffer, when first event will be removed
+	 * from ready queue (and thus there will be an empty place in the
+	 * ring buffer).
+	 */
+	struct kevent		*overflow_kevent;
+	/* Array of pages forming mapped ring buffer */
+	struct kevent_mring	**pring;
+
+#ifdef CONFIG_KEVENT_USER_STAT
+	unsigned long		im_num;
+	unsigned long		wait_num, mmap_num;
+	unsigned long		total;
+#endif
+};
+
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos);
+
+int kevent_user_ring_add_event(struct kevent *k);
+
+void kevent_storage_ready(struct kevent_storage *st,
+		kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_stat_init(struct kevent_user *u)
+{
+	u->wait_num = u->im_num = u->total = 0;
+}
+static inline void kevent_stat_print(struct kevent_user *u)
+{
+	printk(KERN_INFO "%s: u: %p, wait: %lu, mmap: %lu, immediately: %lu, total: %lu.\n",
+			__func__, u, u->wait_num, u->mmap_num, u->im_num, u->total);
+}
+static inline void kevent_stat_im(struct kevent_user *u)
+{
+	u->im_num++;
+}
+static inline void kevent_stat_mmap(struct kevent_user *u)
+{
+	u->mmap_num++;
+}
+static inline void kevent_stat_wait(struct kevent_user *u)
+{
+	u->wait_num++;
+}
+static inline void kevent_stat_total(struct kevent_user *u)
+{
+	u->total++;
+}
+#else
+#define kevent_stat_print(u)		({ (void) u;})
+#define kevent_stat_init(u)		({ (void) u;})
+#define kevent_stat_im(u)		({ (void) u;})
+#define kevent_stat_wait(u)		({ (void) u;})
+#define kevent_stat_mmap(u)		({ (void) u;})
+#define kevent_stat_total(u)		({ (void) u;})
+#endif
+
+#ifdef CONFIG_LOCKDEP
+void kevent_socket_reinit(struct socket *sock);
+void kevent_sk_reinit(struct sock *sk);
+#else
+static inline void kevent_socket_reinit(struct socket *sock)
+{
+}
+static inline void kevent_sk_reinit(struct sock *sk)
+{
+}
+#endif
+#ifdef CONFIG_KEVENT_SOCKET
+void kevent_socket_notify(struct sock *sock, u32 event);
+int kevent_socket_dequeue(struct kevent *k);
+int kevent_socket_enqueue(struct kevent *k);
+#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC)
+#else
+static inline void kevent_socket_notify(struct sock *sock, u32 event)
+{
+}
+#define sock_async(__sk)	({ (void)__sk; 0; })
+#endif
+
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..a38575d
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,11 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+	void			*origin;		/* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+	struct list_head	list;			/* List of queued kevents. */
+	spinlock_t		lock;			/* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2d1c3d5..71a758f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -54,6 +54,7 @@ struct compat_stat;
 struct compat_timeval;
 struct robust_list_head;
 struct getcpu_cache;
+struct ukevent;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -599,4 +600,8 @@ asmlinkage long sys_set_robust_list(stru
 				    size_t len);
 asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);
 
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max,
+		__u64 timeout, struct ukevent __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, struct ukevent __user *buf);
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int start, unsigned int num, __u64 timeout);
 #endif
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
new file mode 100644
index 0000000..daa8202
--- /dev/null
+++ b/include/linux/ukevent.h
@@ -0,0 +1,163 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __UKEVENT_H
+#define __UKEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+/* Process this event only once and then dequeue. */
+#define KEVENT_REQ_ONESHOT	0x1
+
+/*
+ * Kevent return flags.
+ */
+/* Kevent is broken. */
+#define KEVENT_RET_BROKEN	0x1
+/* Kevent processing was finished successfully. */
+#define KEVENT_RET_DONE		0x2
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 		0
+#define KEVENT_INODE		1
+#define KEVENT_TIMER		2
+#define KEVENT_POLL		3
+#define KEVENT_NAIO		4
+#define KEVENT_AIO		5
+#define	KEVENT_MAX		6
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define	KEVENT_TIMER_FIRED	0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#define	KEVENT_SOCKET_RECV	0x1
+#define	KEVENT_SOCKET_ACCEPT	0x2
+#define	KEVENT_SOCKET_SEND	0x4
+
+/*
+ * Inode events.
+ */
+#define	KEVENT_INODE_CREATE	0x1
+#define	KEVENT_INODE_REMOVE	0x2
+
+/*
+ * Poll events.
+ */
+#define	KEVENT_POLL_POLLIN	0x0001
+#define	KEVENT_POLL_POLLPRI	0x0002
+#define	KEVENT_POLL_POLLOUT	0x0004
+#define	KEVENT_POLL_POLLERR	0x0008
+#define	KEVENT_POLL_POLLHUP	0x0010
+#define	KEVENT_POLL_POLLNVAL	0x0020
+
+#define	KEVENT_POLL_POLLRDNORM	0x0040
+#define	KEVENT_POLL_POLLRDBAND	0x0080
+#define	KEVENT_POLL_POLLWRNORM	0x0100
+#define	KEVENT_POLL_POLLWRBAND	0x0200
+#define	KEVENT_POLL_POLLMSG	0x0400
+#define	KEVENT_POLL_POLLREMOVE	0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define	KEVENT_AIO_BIO		0x1
+
+#define KEVENT_MASK_ALL		0xffffffff
+/* Mask of all possible event values. */
+#define KEVENT_MASK_EMPTY	0x0
+/* Empty mask of ready events. */
+
+struct kevent_id
+{
+	union {
+		__u32		raw[2];
+		__u64		raw_u64 __attribute__((aligned(8)));
+	};
+};
+
+struct ukevent
+{
+	/* Id of this request, e.g. socket number, file descriptor and so on... */
+	struct kevent_id	id;
+	/* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+	__u32			type;
+	/* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+	__u32			event;
+	/* Per-event request flags */
+	__u32			req_flags;
+	/* Per-event return flags */
+	__u32			ret_flags;
+	/* Event return data. Event originator fills it with anything it likes. */
+	__u32			ret_data[2];
+	/* User's data. It is not used, just copied to/from user.
+	 * The whole structure is aligned to 8 bytes already, so the last union
+	 * is aligned properly.
+	 */
+	union {
+		__u32		user[2];
+		void		*ptr;
+	};
+};
+
+struct mukevent
+{
+	struct kevent_id	id;
+	__u32			ret_flags;
+};
+
+#define KEVENT_MAX_PAGES	2
+
+/*
+ * Note that kevents does not exactly fill the page (each mukevent is 12 bytes),
+ * so we reuse 4 bytes at the begining of the page to store index.
+ * Take that into account if you want to change size of struct mukevent.
+ */
+#define KEVENTS_ON_PAGE ((PAGE_SIZE-2*sizeof(unsigned int))/sizeof(struct mukevent))
+struct kevent_mring
+{
+	unsigned int		kidx, uidx;
+	struct mukevent		event[KEVENTS_ON_PAGE];
+};
+
+/*
+ * Used only for sanitizing of the kevent_wait() input data - do not
+ * allow user to specify number of events more than it is possible to place
+ * into ring buffer. This does not limit number of events which can be
+ * put into kevent queue (which is unlimited).
+ */
+#define KEVENT_MAX_EVENTS	(KEVENT_MAX_PAGES * KEVENTS_ON_PAGE)
+
+#define	KEVENT_CTL_ADD 		0
+#define	KEVENT_CTL_REMOVE	1
+#define	KEVENT_CTL_MODIFY	2
+
+#endif /* __UKEVENT_H */
diff --git a/init/Kconfig b/init/Kconfig
index d2eb7a8..c7d8250 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -201,6 +201,8 @@ config AUDITSYSCALL
 	  such as SELinux.  To use audit's filesystem watch feature, please
 	  ensure that INOTIFY is configured.
 
+source "kernel/kevent/Kconfig"
+
 config IKCONFIG
 	bool "Kernel .config support"
 	---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..5ba8086
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,39 @@
+config KEVENT
+	bool "Kernel event notification mechanism"
+	help
+	  This option enables event queue mechanism.
+	  It can be used as replacement for poll()/select(), AIO callback
+	  invocations, advanced timer notifications and other kernel
+	  object status changes.
+
+config KEVENT_USER_STAT
+	bool "Kevent user statistic"
+	depends on KEVENT
+	help
+	  This option will turn kevent_user statistic collection on.
+	  Statistic data includes total number of kevent, number of kevents
+	  which are ready immediately at insertion time and number of kevents
+	  which were removed through readiness completion.
+	  It will be printed each time control kevent descriptor is closed.
+
+config KEVENT_TIMER
+	bool "Kernel event notifications for timers"
+	depends on KEVENT
+	help
+	  This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+	bool "Kernel event notifications for poll()/select()"
+	depends on KEVENT
+	help
+	  This option allows to use kevent subsystem for poll()/select()
+	  notifications.
+
+config KEVENT_SOCKET
+	bool "Kernel event notifications for sockets"
+	depends on NET && KEVENT
+	help
+	  This option enables notifications through KEVENT subsystem of 
+	  sockets operations, like new packet receiving conditions, 
+	  ready for accept conditions and so on.
+	
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..9130cad
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,4 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
+obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..25404d3
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,227 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+	return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+	return k->callbacks.dequeue(k);
+}
+
+/*
+ * Mark kevent as broken.
+ */
+int kevent_break(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->ulock, flags);
+	k->event.ret_flags |= KEVENT_RET_BROKEN;
+	spin_unlock_irqrestore(&k->ulock, flags);
+	return -EINVAL;
+}
+
+static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX] __read_mostly;
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos)
+{
+	struct kevent_callbacks *p;
+
+	if (pos >= KEVENT_MAX)
+		return -EINVAL;
+
+	p = &kevent_registered_callbacks[pos];
+
+	p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break;
+	p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break;
+	p->callback = (cb->callback) ? cb->callback : kevent_break;
+
+	printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);
+	return 0;
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+	spin_lock_init(&k->ulock);
+	k->flags = 0;
+
+	if (unlikely(k->event.type >= KEVENT_MAX ||
+			!kevent_registered_callbacks[k->event.type].callback))
+		return kevent_break(k);
+
+	k->callbacks = kevent_registered_callbacks[k->event.type];
+	if (unlikely(k->callbacks.callback == kevent_break))
+		return kevent_break(k);
+
+	return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	k->st = st;
+	spin_lock_irqsave(&st->lock, flags);
+	list_add_tail_rcu(&k->storage_entry, &st->list);
+	k->flags |= KEVENT_STORAGE;
+	spin_unlock_irqrestore(&st->lock, flags);
+	return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue.
+ * It does not decrease origin's reference counter in any way
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&st->lock, flags);
+	if (k->flags & KEVENT_STORAGE) {
+		list_del_rcu(&k->storage_entry);
+		k->flags &= ~KEVENT_STORAGE;
+	}
+	spin_unlock_irqrestore(&st->lock, flags);
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static void __kevent_requeue(struct kevent *k, u32 event)
+{
+	int ret, rem;
+	unsigned long flags;
+
+	ret = k->callbacks.callback(k);
+
+	spin_lock_irqsave(&k->ulock, flags);
+	if (ret > 0)
+		k->event.ret_flags |= KEVENT_RET_DONE;
+	else if (ret < 0)
+		k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE);
+	else
+		ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+	rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+	spin_unlock_irqrestore(&k->ulock, flags);
+
+	if (ret) {
+		if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) {
+			list_del_rcu(&k->storage_entry);
+			k->flags &= ~KEVENT_STORAGE;
+		}
+
+		spin_lock_irqsave(&k->user->ready_lock, flags);
+		if (!(k->flags & KEVENT_READY)) {
+			kevent_user_ring_add_event(k);
+			list_add_tail(&k->ready_entry, &k->user->ready_list);
+			k->flags |= KEVENT_READY;
+			k->user->ready_num++;
+		}
+		spin_unlock_irqrestore(&k->user->ready_lock, flags);
+		wake_up(&k->user->wait);
+	}
+}
+
+/*
+ * Check if kevent is ready (by invoking it's callback) and requeue/remove
+ * if needed.
+ */
+void kevent_requeue(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->st->lock, flags);
+	__kevent_requeue(k, 0);
+	spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st,
+		kevent_callback_t ready_callback, u32 event)
+{
+	struct kevent *k;
+
+	rcu_read_lock();
+	if (ready_callback)
+		list_for_each_entry_rcu(k, &st->list, storage_entry)
+			(*ready_callback)(k);
+
+	list_for_each_entry_rcu(k, &st->list, storage_entry)
+		if (event & k->event.event)
+			__kevent_requeue(k, event);
+	rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+	spin_lock_init(&st->lock);
+	st->origin = origin;
+	INIT_LIST_HEAD(&st->list);
+	return 0;
+}
+
+/*
+ * Mark all events as broken, that will remove them from storage,
+ * so storage origin (inode, sockt and so on) can be safely removed.
+ * No new entries are allowed to be added into the storage at this point.
+ * (Socket is removed from file table at this point for example).
+ */
+void kevent_storage_fini(struct kevent_storage *st)
+{
+	kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..f3fec9b
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,1004 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static const char kevent_name[] = "kevent";
+static kmem_cache_t *kevent_cache __read_mostly;
+
+/*
+ * kevents are pollable, return POLLIN and POLLRDNORM
+ * when there is at least one ready kevent.
+ */
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+	struct kevent_user *u = file->private_data;
+	unsigned int mask;
+
+	poll_wait(file, &u->wait, wait);
+	mask = 0;
+
+	if (u->ready_num)
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+
+/*
+ * Called under kevent_user->ready_lock, so updates are always protected.
+ */
+int kevent_user_ring_add_event(struct kevent *k)
+{
+	unsigned int pidx, off;
+	struct kevent_mring *ring, *copy_ring;
+	
+	ring = k->user->pring[0];
+
+	if ((ring->kidx + 1 == ring->uidx) || 
+			((ring->kidx + 1 == KEVENT_MAX_EVENTS) && ring->uidx == 0)) {
+		if (k->user->overflow_kevent == NULL)
+			k->user->overflow_kevent = k;
+		return -EAGAIN;
+	}
+
+	pidx = ring->kidx/KEVENTS_ON_PAGE;
+	off = ring->kidx%KEVENTS_ON_PAGE;
+
+	if (unlikely(pidx >= KEVENT_MAX_PAGES)) {
+		printk(KERN_ERR "%s: kidx: %u, pidx: %u, on_page: %lu, pidx: %u.\n",
+				__func__, ring->kidx, ring->uidx, KEVENTS_ON_PAGE, pidx);
+		return -EINVAL;
+	}
+
+	copy_ring = k->user->pring[pidx];
+
+	copy_ring->event[off].id.raw[0] = k->event.id.raw[0];
+	copy_ring->event[off].id.raw[1] = k->event.id.raw[1];
+	copy_ring->event[off].ret_flags = k->event.ret_flags;
+
+	if (++ring->kidx >= KEVENT_MAX_EVENTS)
+		ring->kidx = 0;
+
+	return 0;
+}
+
+/*
+ * Initialize mmap ring buffer.
+ * It will store ready kevents, so userspace could get them directly instead
+ * of using syscall. Esentially syscall becomes just a waiting point.
+ * @KEVENT_MAX_PAGES is an arbitrary number of pages to store ready events.
+ */
+static int kevent_user_ring_init(struct kevent_user *u)
+{
+	int i;
+
+	u->pring = kzalloc(KEVENT_MAX_PAGES * sizeof(struct kevent_mring *), GFP_KERNEL);
+	if (!u->pring)
+		return -ENOMEM;
+
+	for (i=0; i<KEVENT_MAX_PAGES; ++i) {
+		u->pring[i] = (struct kevent_mring *)__get_free_page(GFP_KERNEL);
+		if (!u->pring[i])
+			goto err_out_free;
+	}
+
+	u->pring[0]->uidx = u->pring[0]->kidx = 0;
+
+	return 0;
+
+err_out_free:
+	for (i=0; i<KEVENT_MAX_PAGES; ++i) {
+		if (!u->pring[i])
+			break;
+
+		free_page((unsigned long)u->pring[i]);
+	}
+	
+	kfree(u->pring);
+
+	return -ENOMEM;
+}
+
+static void kevent_user_ring_fini(struct kevent_user *u)
+{
+	int i;
+
+	for (i=0; i<KEVENT_MAX_PAGES; ++i)
+		free_page((unsigned long)u->pring[i]);
+
+	kfree(u->pring);
+}
+
+static int kevent_user_open(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u;
+
+	u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
+	if (!u)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&u->ready_list);
+	spin_lock_init(&u->ready_lock);
+	kevent_stat_init(u);
+	spin_lock_init(&u->kevent_lock);
+	u->kevent_root = RB_ROOT;
+
+	mutex_init(&u->ctl_mutex);
+	init_waitqueue_head(&u->wait);
+
+	atomic_set(&u->refcnt, 1);
+
+	if (unlikely(kevent_user_ring_init(u))) {
+		kfree(u);
+		return -ENOMEM;
+	}
+
+	file->private_data = u;
+	return 0;
+}
+
+/*
+ * Kevent userspace control block reference counting.
+ * Set to 1 at creation time, when appropriate kevent file descriptor
+ * is closed, that reference counter is decreased.
+ * When counter hits zero block is freed.
+ */
+static inline void kevent_user_get(struct kevent_user *u)
+{
+	atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+	if (atomic_dec_and_test(&u->refcnt)) {
+		kevent_stat_print(u);
+		kevent_user_ring_fini(u);
+		kfree(u);
+	}
+}
+
+/*
+ * Mmap implementation for ring buffer, which is created as array
+ * of pages, so vm_pgoff is an offset (in pages, not in bytes) of
+ * the first page to be mapped.
+ */
+static int kevent_user_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	unsigned long start = vma->vm_start, off = vma->vm_pgoff / PAGE_SIZE;
+	struct kevent_user *u = file->private_data;
+
+	if (off >= KEVENT_MAX_PAGES)
+		return -EINVAL;
+
+	if (vma->vm_flags & VM_WRITE)
+		return -EPERM;
+
+	vma->vm_flags |= VM_RESERVED;
+	vma->vm_file = file;
+
+	if (vm_insert_page(vma, start, virt_to_page(u->pring[off])))
+		return -EFAULT;
+
+	return 0;
+}
+
+static inline int kevent_compare_id(struct kevent_id *left, struct kevent_id *right)
+{
+	if (left->raw_u64 > right->raw_u64)
+		return -1;
+
+	if (right->raw_u64 > left->raw_u64)
+		return 1;
+
+	return 0;
+}
+
+/*
+ * RCU protects storage list (kevent->storage_entry).
+ * Free entry in RCU callback, it is dequeued from all lists at
+ * this point.
+ */
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+	struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+	kmem_cache_free(kevent_cache, kevent);
+}
+
+/*
+ * Must be called under u->ready_lock.
+ * This function removes kevent from ready queue and 
+ * tries to add new kevent into ring buffer.
+ */
+static void kevent_remove_ready(struct kevent *k)
+{
+	struct kevent_user *u = k->user;
+
+	if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS)
+		u->pring[0]->uidx = 0;
+
+	if (u->overflow_kevent) {
+		int err;
+
+		err = kevent_user_ring_add_event(u->overflow_kevent);
+		if (!err || u->overflow_kevent == k) {
+			if (u->overflow_kevent->ready_entry.next == &u->ready_list)
+				u->overflow_kevent = NULL;
+			else
+				u->overflow_kevent = 
+					list_entry(u->overflow_kevent->ready_entry.next, 
+							struct kevent, ready_entry);
+		}
+	}
+	list_del(&k->ready_entry);
+	k->flags &= ~KEVENT_READY;
+	u->ready_num--;
+}
+
+/*
+ * Complete kevent removing - it dequeues kevent from storage list
+ * if it is requested, removes kevent from ready list, drops userspace
+ * control block reference counter and schedules kevent freeing through RCU.
+ */
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	if (deq)
+		kevent_dequeue(k);
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (k->flags & KEVENT_READY)
+		kevent_remove_ready(k);
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+
+	kevent_user_put(u);
+	call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+/*
+ * Remove from all lists and free kevent.
+ * Must be called under kevent_user->kevent_lock to protect
+ * kevent->kevent_entry removing.
+ */
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+
+	rb_erase(&k->kevent_node, &u->kevent_root);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events,
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	rb_erase(&k->kevent_node, &u->kevent_root);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
+{
+	unsigned long flags;
+	struct kevent *k = NULL;
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (u->ready_num && !list_empty(&u->ready_list)) {
+		k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+		kevent_remove_ready(k);
+	}
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+
+	return k;
+}
+
+/*
+ * Search a kevent inside kevent tree for given ukevent.
+ */
+static struct kevent *__kevent_search(struct kevent_id *id, struct kevent_user *u)
+{
+	struct kevent *k, *ret = NULL;
+	struct rb_node *n = u->kevent_root.rb_node;
+	int cmp;
+
+	while (n) {
+		k = rb_entry(n, struct kevent, kevent_node);
+		cmp = kevent_compare_id(&k->event.id, id);
+
+		if (cmp > 0)
+			n = n->rb_right;
+		else if (cmp < 0)
+			n = n->rb_left;
+		else {
+			ret = k;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+/*
+ * Search and modify kevent according to provided ukevent.
+ */
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err = -ENODEV;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&uk->id, u);
+	if (k) {
+		spin_lock(&k->ulock);
+		k->event.event = uk->event;
+		k->event.req_flags = uk->req_flags;
+		k->event.ret_flags = 0;
+		spin_unlock(&k->ulock);
+		kevent_requeue(k);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Remove kevent which matches provided ukevent.
+ */
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+	int err = -ENODEV;
+	struct kevent *k;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&uk->id, u);
+	if (k) {
+		__kevent_finish_user(k, 1);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Detaches userspace control block from file descriptor
+ * and decrease it's reference counter.
+ * No new kevents can be added or removed from any list at this point.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u = file->private_data;
+	struct kevent *k;
+	struct rb_node *n;
+
+	for (n = rb_first(&u->kevent_root); n; n = rb_next(n)) {
+		k = rb_entry(n, struct kevent, kevent_node);
+		kevent_finish_user(k, 1);
+	}
+
+	kevent_user_put(u);
+	file->private_data = NULL;
+
+	return 0;
+}
+
+/*
+ * Read requested number of ukevents in one shot.
+ */
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+	struct ukevent *ukev;
+
+	ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+	if (!ukev)
+		return NULL;
+
+	if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
+		kfree(ukev);
+		return NULL;
+	}
+
+	return ukev;
+}
+
+/*
+ * Read from userspace all ukevents and modify appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_modify(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_modify(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Read from userspace all ukevents and remove appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_remove(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_remove(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Queue kevent into userspace control block and increase
+ * it's reference counter.
+ */
+static int kevent_user_enqueue(struct kevent_user *u, struct kevent *new)
+{
+	unsigned long flags;
+	struct rb_node **p = &u->kevent_root.rb_node, *parent = NULL;
+	struct kevent *k;
+	int err = 0, cmp;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	while (*p) {
+		parent = *p;
+		k = rb_entry(parent, struct kevent, kevent_node);
+
+		cmp = kevent_compare_id(&k->event.id, &new->event.id);
+		if (cmp > 0)
+			p = &parent->rb_right;
+		else if (cmp < 0)
+			p = &parent->rb_left;
+		else {
+			err = -EEXIST;
+			break;
+		}
+	}
+	if (likely(!err)) {
+		rb_link_node(&new->kevent_node, parent, p);
+		rb_insert_color(&new->kevent_node, &u->kevent_root);
+		new->flags |= KEVENT_USER;
+		u->kevent_num++;
+		kevent_user_get(u);
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Add kevent from both kernel and userspace users.
+ * This function allocates and queues kevent, returns negative value
+ * on error, positive if kevent is ready immediately and zero
+ * if kevent has been queued.
+ */
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err;
+
+	k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+	if (!k) {
+		err = -ENOMEM;
+		goto err_out_exit;
+	}
+
+	memcpy(&k->event, uk, sizeof(struct ukevent));
+	INIT_RCU_HEAD(&k->rcu_head);
+
+	k->event.ret_flags = 0;
+
+	err = kevent_init(k);
+	if (err) {
+		kmem_cache_free(kevent_cache, k);
+		goto err_out_exit;
+	}
+	k->user = u;
+	kevent_stat_total(u);
+	err = kevent_user_enqueue(u, k);
+	if (err) {
+		kmem_cache_free(kevent_cache, k);
+		goto err_out_exit;
+	}
+
+	err = kevent_enqueue(k);
+	if (err) {
+		memcpy(uk, &k->event, sizeof(struct ukevent));
+		kevent_finish_user(k, 0);
+		goto err_out_exit;
+	}
+
+	return 0;
+
+err_out_exit:
+	if (err < 0) {
+		uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE;
+		uk->ret_data[1] = err;
+	} else if (err > 0)
+		uk->ret_flags |= KEVENT_RET_DONE;
+	return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one
+ * and add them into appropriate kevent_storages,
+ * e.g. sockets, inodes and so on...
+ * Ready events will replace ones provided by used and number
+ * of ready events is returned.
+ * User must check ret_flags field of each ukevent structure
+ * to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err, cerr = 0, rnum = 0, i;
+	void __user *orig = arg;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	err = -EINVAL;
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				err = kevent_user_add_ukevent(&ukev[i], u);
+				if (err) {
+					kevent_stat_im(u);
+					if (i != rnum)
+						memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+					rnum++;
+				}
+			}
+			if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+				cerr = -EFAULT;
+			kfree(ukev);
+			goto out_setup;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			cerr = -EFAULT;
+			break;
+		}
+		arg += sizeof(struct ukevent);
+
+		err = kevent_user_add_ukevent(&uk, u);
+		if (err) {
+			kevent_stat_im(u);
+			if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+				cerr = -EFAULT;
+				break;
+			}
+			orig += sizeof(struct ukevent);
+			rnum++;
+		}
+	}
+
+out_setup:
+	if (cerr < 0) {
+		err = cerr;
+		goto out_remove;
+	}
+
+	err = rnum;
+out_remove:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u,
+		unsigned int min_nr, unsigned int max_nr, __u64 timeout,
+		void __user *buf)
+{
+	struct kevent *k;
+	int num = 0;
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		wait_event_interruptible_timeout(u->wait,
+			u->ready_num >= min_nr,
+			clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+	}
+
+	while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
+		if (copy_to_user(buf + num*sizeof(struct ukevent),
+					&k->event, sizeof(struct ukevent)))
+			break;
+
+		/*
+		 * If it is one-shot kevent, it has been removed already from
+		 * origin's queue, so we can easily free it here.
+		 */
+		if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+			kevent_finish_user(k, 1);
+		++num;
+		kevent_stat_wait(u);
+	}
+
+	return num;
+}
+
+static struct file_operations kevent_user_fops = {
+	.mmap		= kevent_user_mmap,
+	.open		= kevent_user_open,
+	.release	= kevent_user_release,
+	.poll		= kevent_user_poll,
+	.owner		= THIS_MODULE,
+};
+
+static struct miscdevice kevent_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = kevent_name,
+	.fops = &kevent_user_fops,
+};
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+	int err;
+	struct kevent_user *u = file->private_data;
+
+	switch (cmd) {
+	case KEVENT_CTL_ADD:
+		err = kevent_user_ctl_add(u, num, arg);
+		break;
+	case KEVENT_CTL_REMOVE:
+		err = kevent_user_ctl_remove(u, num, arg);
+		break;
+	case KEVENT_CTL_MODIFY:
+		err = kevent_user_ctl_modify(u, num, arg);
+		break;
+	default:
+		err = -EINVAL;
+		break;
+	}
+
+	return err;
+}
+
+/*
+ * Used to get ready kevents from queue.
+ * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).
+ * @min_nr - minimum number of ready kevents.
+ * @max_nr - maximum number of ready kevents.
+ * @timeout - timeout in nanoseconds to wait until some events are ready.
+ * @buf - buffer to place ready events.
+ * @flags - ununsed for now (will be used for mmap implementation).
+ */
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+		__u64 timeout, struct ukevent __user *buf, unsigned flags)
+{
+	int err = -EINVAL;
+	struct file *file;
+	struct kevent_user *u;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -EBADF;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * This syscall is used to perform waiting until there is free space in kevent queue
+ * and removes/requeues requested number of events (commits them). Function returns
+ * number of actually committed events.
+ *
+ * @ctl_fd - kevent file descriptor.
+ * @start - number of first ready event.
+ * @num - number of processed kevents.
+ * @timeout - this timeout specifies number of nanoseconds to wait until there is
+ * 	free space in kevent queue.
+ *
+ * Ring buffer is designed in a way that first ready kevent will be at @ring->uidx 
+ * position, and all other ready events will be in FIFO order after it.
+ * So when we need to commit @num events, it means we should just remove first @num
+ * kevents from ready queue and commit them. We do not use any special locking to
+ * protect this function against simultaneous running - kevent dequeueing is atomic,
+ * and we do not care about order in which events were committed.
+ * An example: thread 1 and thread 2 simultaneously call kevent_wait() to 
+ * commit 2 and 3 events. It is possible that first thread will commit 
+ * events 0 and 2 while second thread will commit events 1, 3 and 4.
+ * If there were only 3 ready events, then one of the calls will return lesser number
+ * of committed events than it was requested.
+ * ring->uidx update is atomic, since it is protected by u->ready_lock,
+ * which removes race with kevent_user_ring_add_event().
+ *
+ * If user asks to commit events which have beed removed by kevent_get_events() recently 
+ * (for example when one thread looked into ring indexes and started to commit evets, 
+ * which were simultaneously committed by other thread through kevent_get_events(),
+ * kevent_wait() will not commit unprocessed events, but will return number of actually
+ * committed events instead.
+ *
+ * It is forbidden to try to commit events not from the start of the buffer, but from
+ * some 'futher' event.
+ *
+ * An example: if ready events use positions 2-5, 
+ * it is permitted to start to commit 3 events from position 0, 
+ *   in this case 0 and 1 positions will be ommited and only event in position 2 will 
+ *   be committed and kevent_wait() will return 1, since only one event was actually committed.
+ * It is forbidden to try to commit from position 4, 0 will be returned.
+ * This means that if some events were committed using kevent_get_events(), 
+ * they will not be counted, instead userspace should check ring index and try to commit again.
+ */
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int start, unsigned int num, __u64 timeout)
+{
+	int err = -EINVAL, committed = 0;
+	struct file *file;
+	struct kevent_user *u;
+	struct kevent *k;
+	struct kevent_mring *ring;
+	unsigned int i, actual;
+	unsigned long flags;
+
+	if (num >= KEVENT_MAX_EVENTS)
+		return -EINVAL;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -EBADF;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	ring = u->pring[0];
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	actual = (ring->kidx > ring->uidx)?
+			(ring->kidx - ring->uidx):
+			(KEVENT_MAX_EVENTS - (ring->uidx - ring->kidx));
+
+	if (actual < num)
+		num = actual;
+
+	if (start < ring->uidx) {
+		/*
+		 * Some events have been committed through kevent_get_events().
+		 *                 ready events
+		 * |==========|RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR|==========|
+		 *          ring->uidx                      ring->kidx
+		 *      |               |
+		 *    start           start+num
+		 *
+		 */
+		unsigned int diff = ring->uidx - start;
+
+		if (num < diff)
+			num = 0;
+		else
+			num -= diff;
+	} else if (start > ring->uidx)
+		num = 0;
+	
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+
+	for (i=0; i<num; ++i) {
+		k = kqueue_dequeue_ready(u);
+		if (!k)
+			break;
+
+		if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+			kevent_finish_user(k, 1);
+		kevent_stat_mmap(u);
+		committed++;
+	}
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		wait_event_interruptible_timeout(u->wait,
+			u->ready_num >= 1,
+			clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+	}
+	
+	fput(file);
+
+	return committed;
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * This syscall is used to perform various control operations
+ * on given kevent queue, which is obtained through kevent file descriptor @fd.
+ * @cmd - type of operation.
+ * @num - number of kevents to be processed.
+ * @arg - pointer to array of struct ukevent.
+ */
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent __user *arg)
+{
+	int err = -EINVAL;
+	struct file *file;
+
+	file = fget(fd);
+	if (!file)
+		return -EBADF;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+
+	err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * Kevent subsystem initialization - create kevent cache and register
+ * filesystem to get control file descriptors from.
+ */
+static int __init kevent_user_init(void)
+{
+	int err = 0;
+
+	kevent_cache = kmem_cache_create("kevent_cache",
+			sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
+
+	err = misc_register(&kevent_miscdev);
+	if (err) {
+		printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);
+		goto err_out_exit;
+	}
+
+	printk(KERN_INFO "KEVENT subsystem has been successfully registered.\n");
+
+	return 0;
+
+err_out_exit:
+	kmem_cache_destroy(kevent_cache);
+	return err;
+}
+
+module_init(kevent_user_init);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7a3b2e7..bc0582b 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,10 @@ cond_syscall(ppc_rtas);
 cond_syscall(sys_spu_run);
 cond_syscall(sys_spu_create);
 
+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_wait);
+cond_syscall(sys_kevent_ctl);
+
 /* mmu depending weak syscall entries */
 cond_syscall(sys_mprotect);
 cond_syscall(sys_msync);


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take22 2/4] kevent: poll/select() notifications.
  2006-11-01 11:36   ` [take22 1/4] kevent: Core files Evgeniy Polyakov
@ 2006-11-01 11:36     ` Evgeniy Polyakov
  2006-11-01 11:36       ` [take22 3/4] kevent: Socket notifications Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-01 11:36 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel


poll/select() notifications.

This patch includes generic poll/select notifications.
kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake, a lot of allocations and so on).

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5baf3a1..f81299f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -276,6 +276,7 @@ #include <linux/prio_tree.h>
 #include <linux/init.h>
 #include <linux/sched.h>
 #include <linux/mutex.h>
+#include <linux/kevent.h>
 
 #include <asm/atomic.h>
 #include <asm/semaphore.h>
@@ -586,6 +587,10 @@ #ifdef CONFIG_INOTIFY
 	struct mutex		inotify_mutex;	/* protects the watches list */
 #endif
 
+#ifdef CONFIG_KEVENT_SOCKET
+	struct kevent_storage	st;
+#endif
+
 	unsigned long		i_state;
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 
@@ -739,6 +744,9 @@ #ifdef CONFIG_EPOLL
 	struct list_head	f_ep_links;
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+	struct kevent_storage	st;
+#endif
 	struct address_space	*f_mapping;
 };
 extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..94facbb
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,222 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+	struct poll_table_struct 	pt;
+	struct kevent			*k;
+};
+
+struct kevent_poll_wait_container
+{
+	struct list_head		container_entry;
+	wait_queue_head_t		*whead;
+	wait_queue_t			wait;
+	struct kevent			*k;
+};
+
+struct kevent_poll_private
+{
+	struct list_head		container_list;
+	spinlock_t			container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+		unsigned mode, int sync, void *key)
+{
+	struct kevent_poll_wait_container *cont =
+		container_of(wait, struct kevent_poll_wait_container, wait);
+	struct kevent *k = cont->k;
+	struct file *file = k->st->origin;
+	u32 revents;
+
+	revents = file->f_op->poll(file, NULL);
+
+	kevent_storage_ready(k->st, NULL, revents);
+
+	return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+		struct poll_table_struct *poll_table)
+{
+	struct kevent *k =
+		container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *cont;
+	unsigned long flags;
+
+	cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
+	if (!cont) {
+		kevent_break(k);
+		return;
+	}
+
+	cont->k = k;
+	init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+	cont->whead = whead;
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_add_tail(&cont->container_entry, &priv->container_list);
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+	struct file *file;
+	int err, ready = 0;
+	unsigned int revents;
+	struct kevent_poll_ctl ctl;
+	struct kevent_poll_private *priv;
+
+	file = fget(k->event.id.raw[0]);
+	if (!file)
+		return -EBADF;
+
+	err = -EINVAL;
+	if (!file->f_op || !file->f_op->poll)
+		goto err_out_fput;
+
+	err = -ENOMEM;
+	priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
+	if (!priv)
+		goto err_out_fput;
+
+	spin_lock_init(&priv->container_lock);
+	INIT_LIST_HEAD(&priv->container_list);
+
+	k->priv = priv;
+
+	ctl.k = k;
+	init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+
+	err = kevent_storage_enqueue(&file->st, k);
+	if (err)
+		goto err_out_free;
+
+	revents = file->f_op->poll(file, &ctl.pt);
+	if (revents & k->event.event) {
+		ready = 1;
+		kevent_poll_dequeue(k);
+	}
+
+	return ready;
+
+err_out_free:
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+	fput(file);
+	return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *w, *n;
+	unsigned long flags;
+
+	kevent_storage_dequeue(k->st, k);
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+		list_del(&w->container_entry);
+		remove_wait_queue(w->whead, &w->wait);
+		kmem_cache_free(kevent_poll_container_cache, w);
+	}
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+	k->priv = NULL;
+
+	fput(file);
+
+	return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	unsigned int revents = file->f_op->poll(file, NULL);
+
+	k->event.ret_data[0] = revents & k->event.event;
+
+	return (revents & k->event.event);
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+	struct kevent_callbacks pc = {
+		.callback = &kevent_poll_callback,
+		.enqueue = &kevent_poll_enqueue,
+		.dequeue = &kevent_poll_dequeue};
+
+	kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache",
+			sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+	if (!kevent_poll_container_cache) {
+		printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+		return -ENOMEM;
+	}
+
+	kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache",
+			sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+	if (!kevent_poll_priv_cache) {
+		printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+		kmem_cache_destroy(kevent_poll_container_cache);
+		kevent_poll_container_cache = NULL;
+		return -ENOMEM;
+	}
+
+	kevent_add_callbacks(&pc, KEVENT_POLL);
+
+	printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+	return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+	lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+	kmem_cache_destroy(kevent_poll_priv_cache);
+	kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take22 3/4] kevent: Socket notifications.
  2006-11-01 11:36     ` [take22 2/4] kevent: poll/select() notifications Evgeniy Polyakov
@ 2006-11-01 11:36       ` Evgeniy Polyakov
  2006-11-01 11:36         ` [take22 4/4] kevent: Timer notifications Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-01 11:36 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel


Socket notifications.

This patch includes socket send/recv/accept notifications.
Using trivial web server based on kevent and this features
instead of epoll it's performance increased more than noticebly.
More details about various benchmarks and server itself 
(evserver_kevent.c) can be found on project's homepage.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/fs/inode.c b/fs/inode.c
index ada7643..ff1b129 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@ #include <linux/pagemap.h>
 #include <linux/cdev.h>
 #include <linux/bootmem.h>
 #include <linux/inotify.h>
+#include <linux/kevent.h>
 #include <linux/mount.h>
 
 /*
@@ -164,12 +165,18 @@ #endif
 		}
 		inode->i_private = 0;
 		inode->i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET
+		kevent_storage_init(inode, &inode->st);
+#endif
 	}
 	return inode;
 }
 
 void destroy_inode(struct inode *inode) 
 {
+#if defined CONFIG_KEVENT_SOCKET
+	kevent_storage_fini(&inode->st);
+#endif
 	BUG_ON(inode_has_buffers(inode));
 	security_inode_free(inode);
 	if (inode->i_sb->s_op->destroy_inode)
diff --git a/include/net/sock.h b/include/net/sock.h
index edd4d73..d48ded8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -48,6 +48,7 @@ #include <linux/lockdep.h>
 #include <linux/netdevice.h>
 #include <linux/skbuff.h>	/* struct sk_buff */
 #include <linux/security.h>
+#include <linux/kevent.h>
 
 #include <linux/filter.h>
 
@@ -450,6 +451,21 @@ static inline int sk_stream_memory_free(
 
 extern void sk_stream_rfree(struct sk_buff *skb);
 
+struct socket_alloc {
+	struct socket socket;
+	struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+	return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+	return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
+}
+
 static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
 {
 	skb->sk = sk;
@@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct
 		sk->sk_backlog.tail = skb;
 	}
 	skb->next = NULL;
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 }
 
 #define sk_wait_event(__sk, __timeo, __condition)		\
@@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio
 	return si->kiocb;
 }
 
-struct socket_alloc {
-	struct socket socket;
-	struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
-	return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
-	return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
-}
-
 extern void __sk_stream_mem_reclaim(struct sock *sk);
 extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..69f4ad2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so
 			tp->ucopy.memory = 0;
 		} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
 			wake_up_interruptible(sk->sk_sleep);
+			kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 			if (!inet_csk_ack_scheduled(sk))
 				inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
 						          (3 * TCP_RTO_MIN) / 4,
diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 0000000..5040b4c
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,129 @@
+/*
+ * 	kevent_socket.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/tcp.h>
+#include <linux/kevent.h>
+
+#include <net/sock.h>
+#include <net/request_sock.h>
+#include <net/inet_connection_sock.h>
+
+static int kevent_socket_callback(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+	return SOCKET_I(inode)->ops->poll(SOCKET_I(inode)->file, SOCKET_I(inode), NULL);
+}
+
+int kevent_socket_enqueue(struct kevent *k)
+{
+	struct inode *inode;
+	struct socket *sock;
+	int err = -EBADF;
+
+	sock = sockfd_lookup(k->event.id.raw[0], &err);
+	if (!sock)
+		goto err_out_exit;
+
+	inode = igrab(SOCK_INODE(sock));
+	if (!inode)
+		goto err_out_fput;
+
+	err = kevent_storage_enqueue(&inode->st, k);
+	if (err)
+		goto err_out_iput;
+
+	err = k->callbacks.callback(k);
+	if (err)
+		goto err_out_dequeue;
+
+	return err;
+
+err_out_dequeue:
+	kevent_storage_dequeue(k->st, k);
+err_out_iput:
+	iput(inode);
+err_out_fput:
+	sockfd_put(sock);
+err_out_exit:
+	return err;
+}
+
+int kevent_socket_dequeue(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+	struct socket *sock;
+
+	kevent_storage_dequeue(k->st, k);
+
+	sock = SOCKET_I(inode);
+	iput(inode);
+	sockfd_put(sock);
+
+	return 0;
+}
+
+void kevent_socket_notify(struct sock *sk, u32 event)
+{
+	if (sk->sk_socket)
+		kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event);
+}
+
+/*
+ * It is required for network protocols compiled as modules, like IPv6.
+ */
+EXPORT_SYMBOL_GPL(kevent_socket_notify);
+
+#ifdef CONFIG_LOCKDEP
+static struct lock_class_key kevent_sock_key;
+
+void kevent_socket_reinit(struct socket *sock)
+{
+	struct inode *inode = SOCK_INODE(sock);
+
+	lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+}
+
+void kevent_sk_reinit(struct sock *sk)
+{
+	if (sk->sk_socket) {
+		struct inode *inode = SOCK_INODE(sk->sk_socket);
+
+		lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+	}
+}
+#endif
+static int __init kevent_init_socket(void)
+{
+	struct kevent_callbacks sc = {
+		.callback = &kevent_socket_callback,
+		.enqueue = &kevent_socket_enqueue,
+		.dequeue = &kevent_socket_dequeue};
+
+	return kevent_add_callbacks(&sc, KEVENT_SOCKET);
+}
+module_init(kevent_init_socket);
diff --git a/net/core/sock.c b/net/core/sock.c
index b77e155..7d5fa3e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1402,6 +1402,7 @@ static void sock_def_wakeup(struct sock
 	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
 		wake_up_interruptible_all(sk->sk_sleep);
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_error_report(struct sock *sk)
@@ -1411,6 +1412,7 @@ static void sock_def_error_report(struct
 		wake_up_interruptible(sk->sk_sleep);
 	sk_wake_async(sk,0,POLL_ERR); 
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_readable(struct sock *sk, int len)
@@ -1420,6 +1422,7 @@ static void sock_def_readable(struct soc
 		wake_up_interruptible(sk->sk_sleep);
 	sk_wake_async(sk,1,POLL_IN);
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_write_space(struct sock *sk)
@@ -1439,6 +1442,7 @@ static void sock_def_write_space(struct
 	}
 
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
 }
 
 static void sock_def_destruct(struct sock *sk)
@@ -1489,6 +1493,8 @@ #endif
 	sk->sk_state		=	TCP_CLOSE;
 	sk->sk_socket		=	sock;
 
+	kevent_sk_reinit(sk);
+
 	sock_set_flag(sk, SOCK_ZAPPED);
 
 	if(sock)
@@ -1555,8 +1561,10 @@ void fastcall release_sock(struct sock *
 	if (sk->sk_backlog.tail)
 		__release_sock(sk);
 	sk->sk_lock.owner = NULL;
-	if (waitqueue_active(&sk->sk_lock.wq))
+	if (waitqueue_active(&sk->sk_lock.wq)) {
 		wake_up(&sk->sk_lock.wq);
+		kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
+	}
 	spin_unlock_bh(&sk->sk_lock.slock);
 }
 EXPORT_SYMBOL(release_sock);
diff --git a/net/core/stream.c b/net/core/stream.c
index d1d7dec..2878c2a 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock *
 			wake_up_interruptible(sk->sk_sleep);
 		if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
 			sock_wake_async(sock, 2, POLL_OUT);
+		kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
 	}
 }
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3f884ce..e7dd989 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3119,6 +3119,7 @@ static void tcp_ofo_queue(struct sock *s
 
 		__skb_unlink(skb, &tp->out_of_order_queue);
 		__skb_queue_tail(&sk->sk_receive_queue, skb);
+		kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
 		if(skb->h.th->fin)
 			tcp_fin(skb, sk, skb->h.th);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index c83938b..b0dd70d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -61,6 +61,7 @@ #include <linux/cache.h>
 #include <linux/jhash.h>
 #include <linux/init.h>
 #include <linux/times.h>
+#include <linux/kevent.h>
 
 #include <net/icmp.h>
 #include <net/inet_hashtables.h>
@@ -870,6 +871,7 @@ #endif
 	   	reqsk_free(req);
 	} else {
 		inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+		kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT);
 	}
 	return 0;
 
diff --git a/net/socket.c b/net/socket.c
index 1bc4167..5582b4a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -85,6 +85,7 @@ #include <linux/compat.h>
 #include <linux/kmod.h>
 #include <linux/audit.h>
 #include <linux/wireless.h>
+#include <linux/kevent.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -490,6 +491,8 @@ static struct socket *sock_alloc(void)
 	inode->i_uid = current->fsuid;
 	inode->i_gid = current->fsgid;
 
+	kevent_socket_reinit(sock);
+
 	get_cpu_var(sockets_in_use)++;
 	put_cpu_var(sockets_in_use);
 	return sock;


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take22 4/4] kevent: Timer notifications.
  2006-11-01 11:36       ` [take22 3/4] kevent: Socket notifications Evgeniy Polyakov
@ 2006-11-01 11:36         ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-01 11:36 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel


Timer notifications.

Timer notifications can be used for fine grained per-process time 
management, since interval timers are very inconvenient to use, 
and they are limited.

This subsystem uses high-resolution timers.
id.raw[0] is used as number of seconds
id.raw[1] is used as number of nanoseconds

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..04acc46
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,113 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/hrtimer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+struct kevent_timer
+{
+	struct hrtimer		ktimer;
+	struct kevent_storage	ktimer_storage;
+	struct kevent		*ktimer_event;
+};
+
+static int kevent_timer_func(struct hrtimer *timer)
+{
+	struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer);
+	struct kevent *k = t->ktimer_event;
+
+	kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL);
+	hrtimer_forward(timer, timer->base->softirq_time,
+			ktime_set(k->event.id.raw[0], k->event.id.raw[1]));
+	return HRTIMER_RESTART;
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+	int err;
+	struct kevent_timer *t;
+
+	t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+	if (!t)
+		return -ENOMEM;
+
+	hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL);
+	t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]);
+	t->ktimer.function = kevent_timer_func;
+	t->ktimer_event = k;
+
+	err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+	if (err)
+		goto err_out_free;
+	lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+	err = kevent_storage_enqueue(&t->ktimer_storage, k);
+	if (err)
+		goto err_out_st_fini;
+
+	printk("%s: jiffies: %lu, timer: %p.\n", __func__, jiffies, &t->ktimer);
+	hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL);
+
+	return 0;
+
+err_out_st_fini:
+	kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+	kfree(t);
+
+	return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+	struct kevent_storage *st = k->st;
+	struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage);
+
+	hrtimer_cancel(&t->ktimer);
+	kevent_storage_dequeue(st, k);
+	kfree(t);
+
+	return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+	k->event.ret_data[0] = jiffies_to_msecs(jiffies);
+	return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+	struct kevent_callbacks tc = {
+		.callback = &kevent_timer_callback,
+		.enqueue = &kevent_timer_enqueue,
+		.dequeue = &kevent_timer_dequeue};
+
+	return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);
+


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-01 11:36 ` [take22 " Evgeniy Polyakov
  2006-11-01 11:36   ` [take22 1/4] kevent: Core files Evgeniy Polyakov
@ 2006-11-01 13:06   ` Pavel Machek
  2006-11-01 13:25     ` Evgeniy Polyakov
  2006-11-01 16:07     ` James Morris
  1 sibling, 2 replies; 200+ messages in thread
From: Pavel Machek @ 2006-11-01 13:06 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel

Hi!

> Generic event handling mechanism.
> 
> Consider for inclusion.
> 
> Changes from 'take21' patchset:

We are not interrested in how many times you spammed us, nor we want
to know what was wrong in previous versions. It would be nice to have
short summary of what this is good for, instead.

								Pavel
-- 
Thanks, Sharp!

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-01 13:06   ` [take22 0/4] kevent: Generic event handling mechanism Pavel Machek
@ 2006-11-01 13:25     ` Evgeniy Polyakov
  2006-11-01 16:05       ` Pavel Machek
  2006-11-01 16:07     ` James Morris
  1 sibling, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-01 13:25 UTC (permalink / raw)
  To: Pavel Machek
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel

On Wed, Nov 01, 2006 at 02:06:14PM +0100, Pavel Machek (pavel@ucw.cz) wrote:
> Hi!
> 
> > Generic event handling mechanism.
> > 
> > Consider for inclusion.
> > 
> > Changes from 'take21' patchset:
> 
> We are not interrested in how many times you spammed us, nor we want
> to know what was wrong in previous versions. It would be nice to have
> short summary of what this is good for, instead.

Let me guess, short explaination in subsequent emails is not enough...
If changelog will be removed, then how people will detect what happend 
after previous release?

Kevent is a generic subsytem which allows to handle event notifications.
It supports both level and edge triggered events. It is similar to
poll/epoll in some cases, but it is more scalable, it is faster and
allows to work with essentially eny kind of events.

Events are provided into kernel through control syscall and can be read
back through mmaped ring or syscall.
Kevent update (i.e. readiness switching) happens directly from internals
of the appropriate state machine of the underlying subsytem (like
network, filesystem, timer or any other).

I will put that text into introduction message.

> 								Pavel
> -- 
> Thanks, Sharp!

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-01 13:25     ` Evgeniy Polyakov
@ 2006-11-01 16:05       ` Pavel Machek
  2006-11-01 16:24         ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Pavel Machek @ 2006-11-01 16:05 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel

Hi!

> > > Generic event handling mechanism.
> > > 
> > > Consider for inclusion.
> > > 
> > > Changes from 'take21' patchset:
> > 
> > We are not interrested in how many times you spammed us, nor we want
> > to know what was wrong in previous versions. It would be nice to have
> > short summary of what this is good for, instead.
> 
> Let me guess, short explaination in subsequent emails is not
> enough...

Yes.

> Kevent is a generic subsytem which allows to handle event notifications.
> It supports both level and edge triggered events. It is similar to
> poll/epoll in some cases, but it is more scalable, it is faster and
> allows to work with essentially eny kind of events.

Quantifying "how much more scalable" would be nice, as would be some
example where it is useful. ("It makes my webserver twice as fast on
monster 64-cpu box").

> Events are provided into kernel through control syscall and can be read
> back through mmaped ring or syscall.
> Kevent update (i.e. readiness switching) happens directly from internals
> of the appropriate state machine of the underlying subsytem (like
> network, filesystem, timer or any other).
> 
> I will put that text into introduction message.

Thanks.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-01 16:05       ` Pavel Machek
@ 2006-11-01 16:24         ` Evgeniy Polyakov
  2006-11-01 18:13           ` Oleg Verych
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-01 16:24 UTC (permalink / raw)
  To: Pavel Machek
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel

On Wed, Nov 01, 2006 at 05:05:51PM +0100, Pavel Machek (pavel@ucw.cz) wrote:
> Hi!

Hi Pavel.

> > Kevent is a generic subsytem which allows to handle event notifications.
> > It supports both level and edge triggered events. It is similar to
> > poll/epoll in some cases, but it is more scalable, it is faster and
> > allows to work with essentially eny kind of events.
> 
> Quantifying "how much more scalable" would be nice, as would be some
> example where it is useful. ("It makes my webserver twice as fast on
> monster 64-cpu box").

Trivial kevent web-server can handle 3960+ req/sec on Xeon 2.4Ghz with
1Gb RAM, epoll based - 2200-2500 req/sec.
100 Mbit wire is filled almost 100% (10582.7 KB/s of data without
TCP and below headers).
More benchmarks created by me and Johann Borck can be found on project's 
homepage as long as all my sources used in tests.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-01 16:24         ` Evgeniy Polyakov
@ 2006-11-01 18:13           ` Oleg Verych
  2006-11-01 18:57             ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Oleg Verych @ 2006-11-01 18:13 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev

Hallo, Evgeniy Polyakov.

On 2006-11-01, you wrote:
[]
>> Quantifying "how much more scalable" would be nice, as would be some
>> example where it is useful. ("It makes my webserver twice as fast on
>> monster 64-cpu box").
>
> Trivial kevent web-server can handle 3960+ req/sec on Xeon 2.4Ghz with
[...]

Seriously. I'm seeing that patches also. New, shiny, always ready "for
inclusion". But considering kernel (linux in this case) as not thing
for itself, i want to ask following question.

Where's real-life application to do configure && make && make install?

There were some comments about laking much of such programs, answers were
"was in prev. e-mail", "need to update them", something like that.
"Trivial web server" sources url, mentioned in benchmark isn't pointed
in patch advertisement. If it was, should i actually try that new
*trivial* wheel?

Saying that, i want to give you some short examples, i know.
*Linux kernel <-> userspace*:
o Alexey Kuznetsov  networking     <-> (excellent) iproute set of utilities;
o Maxim Krasnyansky tun net driver <-> vtun daemon application;

*Glibc with mister Drepper* has huge set of tests, please search for
`tst*' files in the sources.

To make a little hint to you, Evgeniy, why don't you find a little
animal in the open source zoo to implement little interface to
proposed kernel subsystem and then show it to The Big Jury (not me),
we have here? And i can not see, how you've managed to implement
something like that having almost nothing on the test basket.
Very *suspicious* ch.

One, that comes in mind is lighthttpd <http://www.lighttpd.net/>.
It had sub-interface for event systems like select,poll,epoll, when i
checked its sources last time. And it is mature, btw.

Cheers.

[ -*- OT -*-                                                           ]
[ I wouldn't write all this, unless saw your opinion about the         ]
[ reportbug (part of the Debian Bug Tracking System) this week.        ]
[ While i'm nobody here, imho, the first thing about good programmer   ]
[ must be, that he is excellent user.                                  ]
____

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-01 18:13           ` Oleg Verych
@ 2006-11-01 18:57             ` Evgeniy Polyakov
  2006-11-02  2:12               ` Nate Diller
  2006-11-03 18:49               ` Oleg Verych
  0 siblings, 2 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-01 18:57 UTC (permalink / raw)
  To: LKML
  Cc: Oleg Verych, Pavel Machek, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown, Christoph Hellwig,
	Chase Venters, Johann Borck

On Wed, Nov 01, 2006 at 06:20:43PM +0000, Oleg Verych (olecom@flower.upol.cz) wrote:
> 
> Hallo, Evgeniy Polyakov.

Hello, Oleg.

> On 2006-11-01, you wrote:
> []
> >> Quantifying "how much more scalable" would be nice, as would be some
> >> example where it is useful. ("It makes my webserver twice as fast on
> >> monster 64-cpu box").
> >
> > Trivial kevent web-server can handle 3960+ req/sec on Xeon 2.4Ghz with
> [...]
> 
> Seriously. I'm seeing that patches also. New, shiny, always ready "for
> inclusion". But considering kernel (linux in this case) as not thing
> for itself, i want to ask following question.
> 
> Where's real-life application to do configure && make && make install?

Your real life or mine as developer?
I fortunately do not know anything about your real life, but my real life
applications can be found on project's homepage.
There is a link to archive there, where you can find plenty of sources.
You likely do not know, but it is a bit risky business to patch all
existing applications to show that approach is correct, if
implementation is not completed.
You likely do not know, but after I first time announced kevents in
February I changed interfaces 4 times - and it is just interfaces, not
including numerous features added/removed by developer's requests.

> There were some comments about laking much of such programs, answers were
> "was in prev. e-mail", "need to update them", something like that.
> "Trivial web server" sources url, mentioned in benchmark isn't pointed
> in patch advertisement. If it was, should i actually try that new
> *trivial* wheel?

Answer is trivial - there is archive where one can find a source code
(filenames are posted regulary). Should I create a rpm? For what glibc
version?

> Saying that, i want to give you some short examples, i know.
> *Linux kernel <-> userspace*:
> o Alexey Kuznetsov  networking     <-> (excellent) iproute set of utilities;

iproute documentation was way too bad when Alexey presented it first 
time :)

> o Maxim Krasnyansky tun net driver <-> vtun daemon application;
>
> *Glibc with mister Drepper* has huge set of tests, please search for
> `tst*' files in the sources.

Btw, show me splice() 'shiny' application? Does lighttpd use it?
Or move_pages().

> To make a little hint to you, Evgeniy, why don't you find a little
> animal in the open source zoo to implement little interface to
> proposed kernel subsystem and then show it to The Big Jury (not me),
> we have here? And i can not see, how you've managed to implement
> something like that having almost nothing on the test basket.
> Very *suspicious* ch.

There are always people who do not like something, what can I do with
it? I present the code, we discuss it, I ask for inclusion (since it is
the only way to get feedback), something requires changes, it is changed
and so on - it is development process.
I created 'little animal in the open source zoo' by myself to show how
simple kevents are.

> One, that comes in mind is lighthttpd <http://www.lighttpd.net/>.
> It had sub-interface for event systems like select,poll,epoll, when i
> checked its sources last time. And it is mature, btw.

As I already told several times, I changed only interfaces 4 times
already, since no one seems to know what we really want and how
interface should look like. You suggest to patch lighttpd? Well, it is
doable, but then I will be asked to change apache and nginx. And then
someone will suggest to change order of parameters. Will you help me
rewrite userspace? No, you will not. You asks for something without
providing anything back (not getting into account code, but discussion,
ideas, testing time, nothing), and you do it in ultimate manner.
Btw, kevent also support AIO notifications - do you suggest to patch
reactor/proactor for tests?
It supports network AIO - do you suggest to write support for that into
apache?
What about timers? It is possible to rewrite all POSIX timers users to
usem instead.
There is feature request for userspace events and singal delivery - what
to do with that?

I created trivial web servers, which send single static page and use
various event handling schemes, and I test new subsystem with new tools,
when tests are completed and all requested features are implemented it
is time to work on different more complex users.

So let's at least complete what we have right now, so no developer's
efforts could be wasted writing empty chars in various places.

> Cheers.
> 
> [ -*- OT -*-                                                           ]
> [ I wouldn't write all this, unless saw your opinion about the         ]
> [ reportbug (part of the Debian Bug Tracking System) this week.        ]
> [ While i'm nobody here, imho, the first thing about good programmer   ]
> [ must be, that he is excellent user.                                  ]
> ____

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-01 18:57             ` Evgeniy Polyakov
@ 2006-11-02  2:12               ` Nate Diller
  2006-11-02  6:21                 ` Evgeniy Polyakov
                                   ` (2 more replies)
  2006-11-03 18:49               ` Oleg Verych
  1 sibling, 3 replies; 200+ messages in thread
From: Nate Diller @ 2006-11-02  2:12 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: LKML, Oleg Verych, Pavel Machek, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown, Christoph Hellwig,
	Chase Venters, Johann Borck

On 11/1/06, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> On Wed, Nov 01, 2006 at 06:20:43PM +0000, Oleg Verych (olecom@flower.upol.cz) wrote:
> >
> > Hallo, Evgeniy Polyakov.
>
> Hello, Oleg.
>
> > On 2006-11-01, you wrote:
> > []
> > >> Quantifying "how much more scalable" would be nice, as would be some
> > >> example where it is useful. ("It makes my webserver twice as fast on
> > >> monster 64-cpu box").
> > >
> > > Trivial kevent web-server can handle 3960+ req/sec on Xeon 2.4Ghz with
> > [...]
> >
> > Seriously. I'm seeing that patches also. New, shiny, always ready "for
> > inclusion". But considering kernel (linux in this case) as not thing
> > for itself, i want to ask following question.
> >
> > Where's real-life application to do configure && make && make install?
>
> Your real life or mine as developer?
> I fortunately do not know anything about your real life, but my real life
> applications can be found on project's homepage.
> There is a link to archive there, where you can find plenty of sources.
> You likely do not know, but it is a bit risky business to patch all
> existing applications to show that approach is correct, if
> implementation is not completed.
> You likely do not know, but after I first time announced kevents in
> February I changed interfaces 4 times - and it is just interfaces, not
> including numerous features added/removed by developer's requests.
>
> > There were some comments about laking much of such programs, answers were
> > "was in prev. e-mail", "need to update them", something like that.
> > "Trivial web server" sources url, mentioned in benchmark isn't pointed
> > in patch advertisement. If it was, should i actually try that new
> > *trivial* wheel?
>
> Answer is trivial - there is archive where one can find a source code
> (filenames are posted regulary). Should I create a rpm? For what glibc
> version?
>
> > Saying that, i want to give you some short examples, i know.
> > *Linux kernel <-> userspace*:
> > o Alexey Kuznetsov  networking     <-> (excellent) iproute set of utilities;
>
> iproute documentation was way too bad when Alexey presented it first
> time :)
>
> > o Maxim Krasnyansky tun net driver <-> vtun daemon application;
> >
> > *Glibc with mister Drepper* has huge set of tests, please search for
> > `tst*' files in the sources.
>
> Btw, show me splice() 'shiny' application? Does lighttpd use it?
> Or move_pages().
>
> > To make a little hint to you, Evgeniy, why don't you find a little
> > animal in the open source zoo to implement little interface to
> > proposed kernel subsystem and then show it to The Big Jury (not me),
> > we have here? And i can not see, how you've managed to implement
> > something like that having almost nothing on the test basket.
> > Very *suspicious* ch.
>
> There are always people who do not like something, what can I do with
> it? I present the code, we discuss it, I ask for inclusion (since it is
> the only way to get feedback), something requires changes, it is changed
> and so on - it is development process.
> I created 'little animal in the open source zoo' by myself to show how
> simple kevents are.
>
> > One, that comes in mind is lighthttpd <http://www.lighttpd.net/>.
> > It had sub-interface for event systems like select,poll,epoll, when i
> > checked its sources last time. And it is mature, btw.
>
> As I already told several times, I changed only interfaces 4 times
> already, since no one seems to know what we really want and how
> interface should look like.

Indesiciveness has certainly been an issue here, but I remember akpm
and Ulrich both giving concrete suggestions.  I was particularly
interested in Andrew's request to explain and justify the differences
between kevent and BSD's kqueue interface.  Was there a discussion
that I missed?  I am very interested to see your work on this
mechanism merged, because you've clearly emphasized performance and
shown impressive results.  But it seems like we lose out on a lot by
throwing out all the applications that already use kqueue.

NATE

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-02  2:12               ` Nate Diller
@ 2006-11-02  6:21                 ` Evgeniy Polyakov
  2006-11-02 19:40                   ` Nate Diller
       [not found]                 ` <aaf959cb0611011829k36deda6ahe61bcb9bf8e612e1@mail.gmail.com>
  2006-11-07 12:02                 ` Jeff Garzik
  2 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-02  6:21 UTC (permalink / raw)
  To: Nate Diller
  Cc: LKML, Oleg Verych, Pavel Machek, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown, Christoph Hellwig,
	Chase Venters, Johann Borck

On Wed, Nov 01, 2006 at 06:12:41PM -0800, Nate Diller (nate.diller@gmail.com) wrote:
> Indesiciveness has certainly been an issue here, but I remember akpm
> and Ulrich both giving concrete suggestions.  I was particularly
> interested in Andrew's request to explain and justify the differences
> between kevent and BSD's kqueue interface.  Was there a discussion
> that I missed?  I am very interested to see your work on this
> mechanism merged, because you've clearly emphasized performance and
> shown impressive results.  But it seems like we lose out on a lot by
> throwing out all the applications that already use kqueue.

It looks you missed that discussion - freebsd kqueue has fields in the 
kevent structure which have diffent sizes in 32 and 64 bit environments.

> NATE

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-02  6:21                 ` Evgeniy Polyakov
@ 2006-11-02 19:40                   ` Nate Diller
  2006-11-03  8:42                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Nate Diller @ 2006-11-02 19:40 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: LKML, Oleg Verych, Pavel Machek, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown, Christoph Hellwig,
	Chase Venters, Johann Borck

On 11/1/06, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> On Wed, Nov 01, 2006 at 06:12:41PM -0800, Nate Diller (nate.diller@gmail.com) wrote:
> > Indesiciveness has certainly been an issue here, but I remember akpm
> > and Ulrich both giving concrete suggestions.  I was particularly
> > interested in Andrew's request to explain and justify the differences
> > between kevent and BSD's kqueue interface.  Was there a discussion
> > that I missed?  I am very interested to see your work on this
> > mechanism merged, because you've clearly emphasized performance and
> > shown impressive results.  But it seems like we lose out on a lot by
> > throwing out all the applications that already use kqueue.
>
> It looks you missed that discussion - freebsd kqueue has fields in the
> kevent structure which have diffent sizes in 32 and 64 bit environments.

Are you saying that the *only* reason we choose not to be
source-compatible with BSD is the 32 bit userland on 64 bit arch
problem?  I've followed every thread that gmail 'kqueue' search
returns, which thread are you referring to?  Nicholas Miell, in "The
Proposed Linux kevent API" thread, seems to think that there are no
advantages over kqueue to justify the incompatibility, an argument you
made no effort to refute.  I've also read the Kevent wiki at
linux-net.osdl.org, but it too is lacking in any direct comparisons
(even theoretical, let alone benchmarks) of the flexibility,
performance, etc. between the two.

I'm not arguing that you've done a bad design, I'm asking you to brag
about the things you improved on vs. kqueue.  Your emphasis on
unifying all the different event types into one interface is really
cool, fill me in on why that can't be effectively done with the kqueue
compatability and I also will advocate for kevent inclusion.

NATE

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-02 19:40                   ` Nate Diller
@ 2006-11-03  8:42                     ` Evgeniy Polyakov
  2006-11-03  8:57                       ` Pavel Machek
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-03  8:42 UTC (permalink / raw)
  To: Nate Diller
  Cc: LKML, Oleg Verych, Pavel Machek, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown, Christoph Hellwig,
	Chase Venters, Johann Borck

On Thu, Nov 02, 2006 at 11:40:43AM -0800, Nate Diller (nate.diller@gmail.com) wrote:
> Are you saying that the *only* reason we choose not to be
> source-compatible with BSD is the 32 bit userland on 64 bit arch
> problem?  I've followed every thread that gmail 'kqueue' search

I.e. do you want that generic event handling mechanism would not work on
x86_64? I doubt you do.

> returns, which thread are you referring to?  Nicholas Miell, in "The
> Proposed Linux kevent API" thread, seems to think that there are no
> advantages over kqueue to justify the incompatibility, an argument you
> made no effort to refute.  I've also read the Kevent wiki at
> linux-net.osdl.org, but it too is lacking in any direct comparisons
> (even theoretical, let alone benchmarks) of the flexibility,
> performance, etc. between the two.
> 
> I'm not arguing that you've done a bad design, I'm asking you to brag
> about the things you improved on vs. kqueue.  Your emphasis on
> unifying all the different event types into one interface is really
> cool, fill me in on why that can't be effectively done with the kqueue
> compatability and I also will advocate for kevent inclusion.

kqueue just can not be used as is in Linux (_maybe_ *bsd has different
types, not those which I found in /usr/include in my FC5 and Debian
distro). It will not work on x86_64 for example. Some kind of a pointer
or unsigned long in structures which are transferred between kernelspace
and userspace is so much questionable, than it is much better even do
not see there... (if I would not have so political correctness, I would
describe it in a much different words actually).
So, kqueue API and structures can not be usd in Linux.

> NATE

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-03  8:42                     ` Evgeniy Polyakov
@ 2006-11-03  8:57                       ` Pavel Machek
  2006-11-03  9:04                         ` David Miller
  2006-11-03  9:13                         ` Evgeniy Polyakov
  0 siblings, 2 replies; 200+ messages in thread
From: Pavel Machek @ 2006-11-03  8:57 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Nate Diller, LKML, Oleg Verych, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown, Christoph Hellwig,
	Chase Venters, Johann Borck

Hi!

> > returns, which thread are you referring to?  Nicholas Miell, in "The
> > Proposed Linux kevent API" thread, seems to think that there are no
> > advantages over kqueue to justify the incompatibility, an argument you
> > made no effort to refute.  I've also read the Kevent wiki at
> > linux-net.osdl.org, but it too is lacking in any direct comparisons
> > (even theoretical, let alone benchmarks) of the flexibility,
> > performance, etc. between the two.
> > 
> > I'm not arguing that you've done a bad design, I'm asking you to brag
> > about the things you improved on vs. kqueue.  Your emphasis on
> > unifying all the different event types into one interface is really
> > cool, fill me in on why that can't be effectively done with the kqueue
> > compatability and I also will advocate for kevent inclusion.
> 
> kqueue just can not be used as is in Linux (_maybe_ *bsd has different
> types, not those which I found in /usr/include in my FC5 and Debian
> distro). It will not work on x86_64 for example. Some kind of a pointer
> or unsigned long in structures which are transferred between kernelspace
> and userspace is so much questionable, than it is much better even do
> not see there... (if I would not have so political correctness, I would
> describe it in a much different words actually).
> So, kqueue API and structures can not be usd in Linux.

Not sure what you are smoking, but "there's unsigned long in *bsd
version, lets rewrite it from scratch" sounds like very bad idea. What
about fixing that one bit you don't like?
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-03  8:57                       ` Pavel Machek
@ 2006-11-03  9:04                         ` David Miller
  2006-11-07 12:05                           ` Jeff Garzik
  2006-11-03  9:13                         ` Evgeniy Polyakov
  1 sibling, 1 reply; 200+ messages in thread
From: David Miller @ 2006-11-03  9:04 UTC (permalink / raw)
  To: pavel
  Cc: johnpol, nate.diller, linux-kernel, olecom, drepper, akpm, netdev,
	zach.brown, hch, chase.venters, johann.borck

From: Pavel Machek <pavel@ucw.cz>
Date: Fri, 3 Nov 2006 09:57:12 +0100

> Not sure what you are smoking, but "there's unsigned long in *bsd
> version, lets rewrite it from scratch" sounds like very bad idea. What
> about fixing that one bit you don't like?

I disagree, it's more like since we have to be structure incompatible
anyways, let's design something superior if we can.


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-03  9:04                         ` David Miller
@ 2006-11-07 12:05                           ` Jeff Garzik
  0 siblings, 0 replies; 200+ messages in thread
From: Jeff Garzik @ 2006-11-07 12:05 UTC (permalink / raw)
  To: David Miller
  Cc: pavel, johnpol, nate.diller, linux-kernel, olecom, drepper, akpm,
	netdev, zach.brown, hch, chase.venters, johann.borck

David Miller wrote:
> From: Pavel Machek <pavel@ucw.cz>
> Date: Fri, 3 Nov 2006 09:57:12 +0100
> 
>> Not sure what you are smoking, but "there's unsigned long in *bsd
>> version, lets rewrite it from scratch" sounds like very bad idea. What
>> about fixing that one bit you don't like?
> 
> I disagree, it's more like since we have to be structure incompatible
> anyways, let's design something superior if we can.

Definitely agreed.

	Jeff




^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-03  8:57                       ` Pavel Machek
  2006-11-03  9:04                         ` David Miller
@ 2006-11-03  9:13                         ` Evgeniy Polyakov
  2006-11-05 11:19                           ` Pavel Machek
  1 sibling, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-03  9:13 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Nate Diller, LKML, Oleg Verych, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown, Christoph Hellwig,
	Chase Venters, Johann Borck

On Fri, Nov 03, 2006 at 09:57:12AM +0100, Pavel Machek (pavel@ucw.cz) wrote:
> > So, kqueue API and structures can not be usd in Linux.
> 
> Not sure what you are smoking, but "there's unsigned long in *bsd
> version, lets rewrite it from scratch" sounds like very bad idea. What
> about fixing that one bit you don't like?

It is not about what I dislike, but about what is broken or not.
Putting u64 instead of a long or some kind of that _is_ incompatible
already, so why should we even use it?
And, btw, what we are talking about? Is it about the whole kevent
compared to kqueue in kernelspace, or just about what structure is being
transferred between kernelspace and userspace?
I'm sure, it was some kind of a joke to 'not rewrite *bsd from scratch
and use kqueue in Linux kernel as is'.

> 								Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-03  9:13                         ` Evgeniy Polyakov
@ 2006-11-05 11:19                           ` Pavel Machek
  2006-11-05 11:43                             ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Pavel Machek @ 2006-11-05 11:19 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Nate Diller, LKML, Oleg Verych, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown, Christoph Hellwig,
	Chase Venters, Johann Borck

Hi!

On Fri 2006-11-03 12:13:02, Evgeniy Polyakov wrote:
> On Fri, Nov 03, 2006 at 09:57:12AM +0100, Pavel Machek (pavel@ucw.cz) wrote:
> > > So, kqueue API and structures can not be usd in Linux.
> > 
> > Not sure what you are smoking, but "there's unsigned long in *bsd
> > version, lets rewrite it from scratch" sounds like very bad idea. What
> > about fixing that one bit you don't like?
> 
> It is not about what I dislike, but about what is broken or not.
> Putting u64 instead of a long or some kind of that _is_ incompatible
> already, so why should we even use it?

Well.. u64 vs unsigned long *is* binary incompatible, but it is
similar enough that it is going to be compatible at source level, or
maybe userland app will need *minor* ifdefs... That's better than two
completely different versions...

> And, btw, what we are talking about? Is it about the whole kevent
> compared to kqueue in kernelspace, or just about what structure is being
> transferred between kernelspace and userspace?
> I'm sure, it was some kind of a joke to 'not rewrite *bsd from scratch
> and use kqueue in Linux kernel as is'.

No, it is probably not possible to take code from BSD kernel and "just
port it". But keeping same/similar userland interface would be nice.
										Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-05 11:19                           ` Pavel Machek
@ 2006-11-05 11:43                             ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-05 11:43 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Nate Diller, LKML, Oleg Verych, David Miller, Ulrich Drepper,
	Andrew Morton, netdev, Zach Brown, Christoph Hellwig,
	Chase Venters, Johann Borck

On Sun, Nov 05, 2006 at 12:19:33PM +0100, Pavel Machek (pavel@ucw.cz) wrote:
> Hi!
> 
> On Fri 2006-11-03 12:13:02, Evgeniy Polyakov wrote:
> > On Fri, Nov 03, 2006 at 09:57:12AM +0100, Pavel Machek (pavel@ucw.cz) wrote:
> > > > So, kqueue API and structures can not be usd in Linux.
> > > 
> > > Not sure what you are smoking, but "there's unsigned long in *bsd
> > > version, lets rewrite it from scratch" sounds like very bad idea. What
> > > about fixing that one bit you don't like?
> > 
> > It is not about what I dislike, but about what is broken or not.
> > Putting u64 instead of a long or some kind of that _is_ incompatible
> > already, so why should we even use it?
> 
> Well.. u64 vs unsigned long *is* binary incompatible, but it is
> similar enough that it is going to be compatible at source level, or
> maybe userland app will need *minor* ifdefs... That's better than two
> completely different versions...
> 
> > And, btw, what we are talking about? Is it about the whole kevent
> > compared to kqueue in kernelspace, or just about what structure is being
> > transferred between kernelspace and userspace?
> > I'm sure, it was some kind of a joke to 'not rewrite *bsd from scratch
> > and use kqueue in Linux kernel as is'.
> 
> No, it is probably not possible to take code from BSD kernel and "just
> port it". But keeping same/similar userland interface would be nice.

It is not only probably, but not even unlikely - it is impossible to get
FreeBSD kqueue code and port it - that port will be completely different
system.
It is impossible to have the same event structure, one should create
#if defined kqueue
fill all members of the structure
#else if defined kevent
fill different members name, since Linux does not even have some types
#endif

*BSD kevent (structure transferred between userspace and kernelspace)
struct kevent {
	uintptr_t ident;	     /* identifier for this event */
	short     filter;	     /* filter for event */
	u_short   flags;	     /* action flags for kqueue */
	u_int     fflags;	     /* filter flag value */
	intptr_t  data;	     		/* filter data value */
	void      *udata;		/* opaque user data identifier */
};

You must fill all fields differently due to above.
Just an example: Linux kevent has extended ID field which is grouped
into type.event, kqueue has different pointer indent and short filter.

Linux kevent does not have filters, but instead it has generic storages
of events which can be processed in any way origin of the storage wants
(this for example allows to create aio_sendfile() (which is dropped from
patchset currently) which no other system in the wild has).

There are too many differences. It is just different systems.
If both can be described by sentence "system which handles events", it
does not mean that they are the same and can use the structures or even
have similar design. 

Kevent is not kqueue in any way (although there are certain
similarities), so they can not share anything.

> 										Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

[parent not found: <aaf959cb0611011829k36deda6ahe61bcb9bf8e612e1@mail.gmail.com>]

[parent not found: <aaf959cb0611011830j1ca3e469tc4a6af3a2a010fa@mail.gmail.com>]

[parent not found: <4549A261.9010007@cosmosbay.com>]

* Re: [take22 0/4] kevent: Generic event handling mechanism.
       [not found]                     ` <4549A261.9010007@cosmosbay.com>
@ 2006-11-03  2:42                       ` zhou drangon
  2006-11-03  9:16                         ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: zhou drangon @ 2006-11-03  2:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-kernel, Evgeniy Polyakov, Oleg Verych, Pavel Machek,
	David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, drangon.zhou

2006/11/2, Eric Dumazet <dada1@cosmosbay.com>:
> zhou drangon a écrit :
> > performance is great, and we are exciting at the result.
> >
> > I want to know why there can be so much improvement, can we improve
> > epoll too ?
>
> Why did you remove most of CC addresses but lkml ?
> Dont do that please...
I seldom reply to the mailing list, Sorry for this.
>
> Good question :)
>
> Hum, I think I can look into epoll and see how it can be improved (if necessary)
>
I have an other question.
As for the VFS system, when we introduce the AIO machinism, we add aio_read,
aio_write, etc... to file ops, and then we make the read, write op to
call aio_read,
aio_write, so that we only remain one implement in kernel.
Can we do event machinism the same way?
when kevent is robust enough, can we implement epoll/select/io_submit etc...
base on kevent ??
In this way, we can simplified the kernel, and epoll can gain
improvement from kevent.

> This is not to say we dont need kevent ! Please Evgeniy continue your work !
Yes! We are expecting for you greate work.

I create an userland event-driven framework for my application.
but I have to use multiple thread to receive event, epoll to wait most event,
and io_getevent to wait disk AIO event, I hope we can get a universal
event machinism
to make the code elegance.
>
> Just to remind you that according to
> http://www.xmailserver.org/linux-patches/nio-improve.html David Libenzi had to
> wait 18 months before epoll being officialy added into kernel.
>
> At that time, many applications were using epoll, and we were patching our
> kernels for that.
>
>
> I cooked a very simple program (attached in this mail), using pipes and epoll,
> and got 250.000 events received per second on an otherwise lightly loaded
> machine (dual opteron 246 , 2GHz, 1MB cache per cpu) with 10.000 pipes (20.000
> handles)
>
> It could be nice to add support for other event providers in this program
> (AF_INET & AF_UNIX sockets for example), and also add support for kevent, so
> that we really can compare epoll/kevent without a complex setup.
> I should extend the program to also add/remove sources during lifetime, not
> only insert at setup time.
>
> # gcc -O2 -o epoll_pipe_bench epoll_pipe_bench.c -lpthread
> # ulimit -n 1000000
> # epoll_pipe_bench -n 10000
> ^C after a while...
>
> oprofile results say that ep_poll_callback() and sys_epoll_wait() use 20% of
> cpu time.
> Even if we gain a two factor in cpu time or cache usage, we wont eliminate
> other costs...
>
> oprofile results gave :
>
> Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
> mask of 0x00 (No unit mask) count 50000
> samples  %        symbol name
> 2015420  11.1309  ep_poll_callback
> 1867431  10.3136  pipe_writev
> 1791872   9.8963  sys_epoll_wait
> 1357297   7.4962  fget_light
> 1277515   7.0556  pipe_readv
> 998447    5.5143  current_fs_time
> 801597    4.4271  __mark_inode_dirty
> 755268    4.1713  __wake_up
> 587065    3.2423  __write_lock_failed
> 582931    3.2195  system_call
> 297132    1.6410  iov_fault_in_pages_read
> 296136    1.6355  sys_write
> 290106    1.6022  __wake_up_common
> 270692    1.4950  bad_pipe_w
> 261516    1.4443  do_pipe
> 257208    1.4205  tg3_start_xmit_dma_bug
> 254917    1.4079  pipe_poll
> 252925    1.3969  copy_user_generic_c
> 234212    1.2935  generic_pipe_buf_map
> 228659    1.2629  ret_from_sys_call
> 212541    1.1738  sysret_check
> 166529    0.9197  sys_read
> 160038    0.8839  vfs_write
> 151091    0.8345  pipe_ioctl
> 136301    0.7528  file_update_time
> 107173    0.5919  tg3_poll
> 77846     0.4299  ipt_do_table
> 75081     0.4147  schedule
> 73059     0.4035  vfs_read
> 69787     0.3854  get_task_comm
> 63923     0.3530  memcpy
> 60019     0.3315  touch_atime
> 57490     0.3175  eventpoll_release_file
> 56152     0.3101  tg3_write_flush_reg32
> 54468     0.3008  rw_verify_area
> 47833     0.2642  generic_pipe_buf_unmap
> 47777     0.2639  __switch_to
> 44106     0.2436  bad_pipe_r
> 41824     0.2310  proc_nr_files
> 41319     0.2282  pipe_iov_copy_from_user
>
>
> Eric
>
>
>
> /*
>  * How to stress epoll
>  *
>  * This program uses many pipes and two threads.
>  * First we open as many pipes we can. (see ulimit -n)
>  * Then we create a worker thread.
>  * The worker thread will send bytes to random pipes.
>  * The main thread uses epoll to collect ready pipes and read them.
>  * Each second, a number of collected bytes is printed on stderr
>  *
>  * Usage : epoll_bench [-n X]
>  */
> #include <pthread.h>
> #include <stdlib.h>
> #include <errno.h>
> #include <stdio.h>
> #include <string.h>
> #include <sys/epoll.h>
> #include <signal.h>
> #include <unistd.h>
> #include <sys/time.h>
>
> int nbpipes = 1024;
>
> struct pipefd {
>         int fd[2];
> } *tab;
>
> int epoll_fd;
>
> static int alloc_pipes()
> {
>         int i;
>
>         epoll_fd = epoll_create(nbpipes);
>         if (epoll_fd == -1) {
>                 perror("epoll_create");
>                 return -1;
>         }
>         tab = malloc(sizeof(struct pipefd) * nbpipes);
>         if (tab ==NULL) {
>                 perror("malloc");
>                 return -1;
>         }
>         for (i = 0 ; i < nbpipes ; i++) {
>                         struct epoll_event ev;
>                 if (pipe(tab[i].fd) == -1)
>                         break;
>                 ev.events = EPOLLIN | EPOLLOUT | EPOLLHUP | EPOLLPRI | EPOLLET;
>                 ev.data.u64 = (uint64_t)i;
>                 epoll_ctl(epoll_fd, EPOLL_CTL_ADD, tab[i].fd[0], &ev);
>         }
>         nbpipes = i;
>         printf("%d pipes setup\n", nbpipes);
>         return 0;
> }
>
>
> unsigned long nbhandled;
> static void timer_func()
> {
>         char buffer[32];
>         size_t len;
>         static unsigned long old;
>         unsigned long delta = nbhandled - old;
>         old = nbhandled;
>         len = sprintf(buffer, "%lu\n", delta);
>         write(2, buffer, len);
> }
>
> static void timer_setup()
> {
>         struct itimerval it;
>         struct sigaction sg;
>
>         memset(&sg, 0, sizeof(sg));
>         sg.sa_handler = timer_func;
>         sigaction(SIGALRM, &sg, 0);
>         it.it_interval.tv_sec = 1;
>         it.it_interval.tv_usec = 0;
>         it.it_value.tv_sec = 1;
>         it.it_value.tv_usec = 0;
>         if (setitimer(ITIMER_REAL, &it, 0))
>                 perror("setitimer");
> }
>
> static void * worker_thread_func(void *arg)
> {
>         int fd;
>         char c = 1;
>         for (;;) {
>                 fd = rand() % nbpipes;
>                 write(tab[fd].fd[1], &c, 1);
>         }
> }
>
>
> int main(int argc, char *argv[])
> {
>         char buff[1024];
>         pthread_t tid;
>         int c;
>
>         while ((c = getopt(argc, argv, "n:")) != EOF) {
>                 if (c == 'n') nbpipes = atoi(optarg);
>         }
>         alloc_pipes();
>         pthread_create(&tid, NULL, worker_thread_func, (void *)0);
>         timer_setup();
>
>         for (;;) {
>                 struct epoll_event events[128];
>                 int nb = epoll_wait(epoll_fd, events, 128, 10000);
>                 int i, fd;
>                 for (i = 0 ; i < nb ; i++) {
>                         fd = tab[events[i].data.u64].fd[0];
>                         if (read(fd, buff, 1024) > 0)
>                                 nbhandled++;
>                 }
>         }
> }
>
>
>

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-03  2:42                       ` zhou drangon
@ 2006-11-03  9:16                         ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-03  9:16 UTC (permalink / raw)
  To: zhou drangon
  Cc: Eric Dumazet, linux-kernel, Oleg Verych, Pavel Machek,
	David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, drangon.zhou

On Fri, Nov 03, 2006 at 10:42:04AM +0800, zhou drangon (drangon.mail@gmail.com) wrote:
> As for the VFS system, when we introduce the AIO machinism, we add aio_read,
> aio_write, etc... to file ops, and then we make the read, write op to
> call aio_read,
> aio_write, so that we only remain one implement in kernel.
> Can we do event machinism the same way?
> when kevent is robust enough, can we implement epoll/select/io_submit etc...
> base on kevent ??
> In this way, we can simplified the kernel, and epoll can gain
> improvement from kevent.

There is AIO implementaion on top of kevent, although it was confirmed
that it has a good design, except minor API layering changes, it was
postponed for a while.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-02  2:12               ` Nate Diller
  2006-11-02  6:21                 ` Evgeniy Polyakov
       [not found]                 ` <aaf959cb0611011829k36deda6ahe61bcb9bf8e612e1@mail.gmail.com>
@ 2006-11-07 12:02                 ` Jeff Garzik
  2 siblings, 0 replies; 200+ messages in thread
From: Jeff Garzik @ 2006-11-07 12:02 UTC (permalink / raw)
  To: Nate Diller
  Cc: Evgeniy Polyakov, LKML, Oleg Verych, Pavel Machek, David Miller,
	Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck

Nate Diller wrote:
> Indesiciveness has certainly been an issue here, but I remember akpm
> and Ulrich both giving concrete suggestions.  I was particularly
> interested in Andrew's request to explain and justify the differences
> between kevent and BSD's kqueue interface.  Was there a discussion
> that I missed?  I am very interested to see your work on this
> mechanism merged, because you've clearly emphasized performance and
> shown impressive results.  But it seems like we lose out on a lot by
> throwing out all the applications that already use kqueue.


kqueue looks pretty nice, the filter/note models in particular.  I don't 
see anything about ring buffers though.

I also wonder about the asynchronous event side (send), not just the 
event reception side.

	Jeff



^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-01 18:57             ` Evgeniy Polyakov
  2006-11-02  2:12               ` Nate Diller
@ 2006-11-03 18:49               ` Oleg Verych
  2006-11-04 10:24                 ` Evgeniy Polyakov
  2006-11-04 17:47                 ` Evgeniy Polyakov
  1 sibling, 2 replies; 200+ messages in thread
From: Oleg Verych @ 2006-11-03 18:49 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: LKML, Pavel Machek, David Miller, Ulrich Drepper, Andrew Morton,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck

On Wed, Nov 01, 2006 at 09:57:46PM +0300, Evgeniy Polyakov wrote:
> On Wed, Nov 01, 2006 at 06:20:43PM +0000, Oleg Verych (olecom@flower.upol.cz) wrote:
[] 
> > Where's real-life application to do configure && make && make install?
> 
> Your real life or mine as developer?
> I fortunately do not know anything about your real life, but my real life

To do not further shift conversation in no technical way, think of my
sentence as question *and* as definition.

> applications can be found on project's homepage.
> There is a link to archive there, where you can find plenty of sources.

But no single makefile. Or what CC and options do not mater really?
You can easily find in your server's apache logs, my visit of that
archive in the day of my message (today i just confirmed my assertions):
browser lynx, host flower.upol.cz.

> You likely do not know, but it is a bit risky business to patch all
> existing applications to show that approach is correct, if
> implementation is not completed.

Fortunately to me, `lighthttpd' is real-life *and* in the benchmark
area also. Just see that site how much there was measured: different OSes,
special tunning. *That* is i'm talking about. Epoll _wrapper_ there,
is 3461 byte long, your answer to _me_ 2580. People are bringing you a
test bed, with all set up ready to use; need less code, go on, comment
needless out!

> You likely do not know, but after I first time announced kevents in
> February I changed interfaces 4 times - and it is just interfaces, not
> including numerous features added/removed by developer's requests.

I think that called open source, linux kernel case.

> > There were some comments about laking much of such programs, answers were
> > "was in prev. e-mail", "need to update them", something like that.
> > "Trivial web server" sources url, mentioned in benchmark isn't pointed
> > in patch advertisement. If it was, should i actually try that new
> > *trivial* wheel?
> 
> Answer is trivial - there is archive where one can find a source code
> (filenames are posted regulary). Should I create a rpm? For what glibc
> version?

Hmm. Let me answer on that "dup" with stuff from LKML archive. That
will reveal, that my guesses were told by The Big Jury to you already:

[^0] Message-ID: 44CA66D8.3010404@oracle.com
[^1] Message-ID: 20060818104120.GA20816@infradead.org,
     Message-ID: 20060816133014.GB32499@infradead.org

more than 10 takes ago.

> > Saying that, i want to give you some short examples, i know.
> > *Linux kernel <-> userspace*:
> > o Alexey Kuznetsov  networking     <-> (excellent) iproute set of utilities;
> 
> iproute documentation was way too bad when Alexey presented it first 
> time :)

As example, after have read some books on TCP/IP and Ethernet, internal
help of `ip' was all i needed to know.

> Btw, show me splice() 'shiny' application? Does lighttpd use it?
> Or move_pages().

You know who proposed that, and you know how many (few) releases ago.
 
> > To make a little hint to you, Evgeniy, why don't you find a little
> > animal in the open source zoo to implement little interface to
> > proposed kernel subsystem and then show it to The Big Jury (not me),
> > we have here? And i can not see, how you've managed to implement
> > something like that having almost nothing on the test basket.
> > Very *suspicious* ch.
> 
> There are always people who do not like something, what can I do with

I didn't think, that my message was offensive. Also i didn't even say,
that you have not bothered feed your code to "scripts/Lindent".

[]
> I created trivial web servers, which send single static page and use
> various event handling schemes, and I test new subsystem with new tools,
> when tests are completed and all requested features are implemented it
> is time to work on different more complex users.

Please, see [^0],

> So let's at least complete what we have right now, so no developer's
> efforts could be wasted writing empty chars in various places.

and [^1].

[ Please do not answer just to answer, cc list is big, no one from       ]
[ The Big Jury seems to care. (well, Jonathan does, but he wasn't in cc) ]

Friendly, Oleg.
____

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-03 18:49               ` Oleg Verych
@ 2006-11-04 10:24                 ` Evgeniy Polyakov
  2006-11-04 17:47                 ` Evgeniy Polyakov
  1 sibling, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-04 10:24 UTC (permalink / raw)
  To: Oleg Verych
  Cc: LKML, Pavel Machek, David Miller, Ulrich Drepper, Andrew Morton,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck

On Fri, Nov 03, 2006 at 07:49:16PM +0100, Oleg Verych (olecom@flower.upol.cz) wrote:
> > applications can be found on project's homepage.
> > There is a link to archive there, where you can find plenty of sources.
> 
> But no single makefile. Or what CC and options do not mater really?
> You can easily find in your server's apache logs, my visit of that
> archive in the day of my message (today i just confirmed my assertions):
> browser lynx, host flower.upol.cz.

If you can not compile that sources, than you should not use kevent for
a while. Definitely.
Options are pretty simple: -W -Wall -I$(path_to_kernel_tree)/include

> > You likely do not know, but it is a bit risky business to patch all
> > existing applications to show that approach is correct, if
> > implementation is not completed.
> 
> Fortunately to me, `lighthttpd' is real-life *and* in the benchmark
> area also. Just see that site how much there was measured: different OSes,
> special tunning. *That* is i'm talking about. Epoll _wrapper_ there,
> is 3461 byte long, your answer to _me_ 2580. People are bringing you a
> test bed, with all set up ready to use; need less code, go on, comment
> needless out!

So what?
People bring me tons of various stuff, and I prefer to use my own for
tests. If _you_ need it, _you_ can always patch any sources you like.

> > You likely do not know, but after I first time announced kevents in
> > February I changed interfaces 4 times - and it is just interfaces, not
> > including numerous features added/removed by developer's requests.
> 
> I think that called open source, linux kernel case.

You missed the point - I'm not going to patch tons of existing
applications when I'm asked to change an interface once per month.

When all requested features are implemented I definitely with patch some
popular web-server to show how kevent is used.

> > > There were some comments about laking much of such programs, answers were
> > > "was in prev. e-mail", "need to update them", something like that.
> > > "Trivial web server" sources url, mentioned in benchmark isn't pointed
> > > in patch advertisement. If it was, should i actually try that new
> > > *trivial* wheel?
> > 
> > Answer is trivial - there is archive where one can find a source code
> > (filenames are posted regulary). Should I create a rpm? For what glibc
> > version?
> 
> Hmm. Let me answer on that "dup" with stuff from LKML archive. That
> will reveal, that my guesses were told by The Big Jury to you already:
> 
> [^0] Message-ID: 44CA66D8.3010404@oracle.com
> [^1] Message-ID: 20060818104120.GA20816@infradead.org,
>      Message-ID: 20060816133014.GB32499@infradead.org
> 
> more than 10 takes ago.

And? Please provide a link to archive.

> > > Saying that, i want to give you some short examples, i know.
> > > *Linux kernel <-> userspace*:
> > > o Alexey Kuznetsov  networking     <-> (excellent) iproute set of utilities;
> > 
> > iproute documentation was way too bad when Alexey presented it first 
> > time :)
> 
> As example, after have read some books on TCP/IP and Ethernet, internal
> help of `ip' was all i needed to know.

:)) i.e. it is ok for you to 'read some books on TCP/IP and Ethernet' to
understand how utility works, and it is not ok to determine how to
compile my sources? Do not compile my sources.

> > Btw, show me splice() 'shiny' application? Does lighttpd use it?
> > Or move_pages().
> 
> You know who proposed that, and you know how many (few) releases ago.

And why lighttpd still do not use it?
You should start to blame authors of the splice() for that.
You will not? Then I can not consider your words in my direction as
serious.

> > > To make a little hint to you, Evgeniy, why don't you find a little
> > > animal in the open source zoo to implement little interface to
> > > proposed kernel subsystem and then show it to The Big Jury (not me),
> > > we have here? And i can not see, how you've managed to implement
> > > something like that having almost nothing on the test basket.
> > > Very *suspicious* ch.
> > 
> > There are always people who do not like something, what can I do with
> 
> I didn't think, that my message was offensive. Also i didn't even say,
> that you have not bothered feed your code to "scripts/Lindent".

You do not use kevent, why do you care about indent of the userspace
tools?

> []
> > I created trivial web servers, which send single static page and use
> > various event handling schemes, and I test new subsystem with new tools,
> > when tests are completed and all requested features are implemented it
> > is time to work on different more complex users.
> 
> Please, see [^0],
> 
> > So let's at least complete what we have right now, so no developer's
> > efforts could be wasted writing empty chars in various places.
> 
> and [^1].
> 
> [ Please do not answer just to answer, cc list is big, no one from       ]
> [ The Big Jury seems to care. (well, Jonathan does, but he wasn't in cc) ]

This thread is just to answer for the sake of answers - there is
completely no sense in it.
You blame me that I did not create some benchmarks you like, but I do not
care about it. I created usefull patch and test is in the way I like,
because it is much more productive, than spending a lot of time
detemining how different sources work with appropriate loads.
When there will be strong requirement to perform additional tests, I
will do them.

> Friendly, Oleg.
> ____

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-03 18:49               ` Oleg Verych
  2006-11-04 10:24                 ` Evgeniy Polyakov
@ 2006-11-04 17:47                 ` Evgeniy Polyakov
  1 sibling, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-04 17:47 UTC (permalink / raw)
  To: Oleg Verych
  Cc: LKML, Pavel Machek, David Miller, Ulrich Drepper, Andrew Morton,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck

On Fri, Nov 03, 2006 at 07:49:16PM +0100, Oleg Verych (olecom@flower.upol.cz) wrote:
> [ Please do not answer just to answer, cc list is big, no one from       ]
> [ The Big Jury seems to care. (well, Jonathan does, but he wasn't in cc) ]
> 
> Friendly, Oleg.

Just in case some misunderstanding happend: I do not want to insult
anyone who is against kevent, I just do not understand cases, when
people require me to do something to convince them in rude manner.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take22 0/4] kevent: Generic event handling mechanism.
  2006-11-01 13:06   ` [take22 0/4] kevent: Generic event handling mechanism Pavel Machek
  2006-11-01 13:25     ` Evgeniy Polyakov
@ 2006-11-01 16:07     ` James Morris
  1 sibling, 0 replies; 200+ messages in thread
From: James Morris @ 2006-11-01 16:07 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Evgeniy Polyakov, David Miller, Ulrich Drepper, Andrew Morton,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel

On Wed, 1 Nov 2006, Pavel Machek wrote:

> Hi!
> 
> > Generic event handling mechanism.
> > 
> > Consider for inclusion.
> > 
> > Changes from 'take21' patchset:
> 
> We are not interrested in how many times you spammed us, nor we want
> to know what was wrong in previous versions. It would be nice to have
> short summary of what this is good for, instead.

I'm interested in knowing which version the patches belong to and what has 
changed (geez, it's rare enough that someone actually bothers to do this 
with an updated patchset, and to complain about it?)



- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply	[flat|nested] 200+ messages in thread

* [take23 0/5] kevent: Generic event handling mechanism.
       [not found] <1154985aa0591036@2ka.mipt.ru>
  2006-10-27 16:10 ` [take21 0/4] kevent: Generic event handling mechanism Evgeniy Polyakov
  2006-11-01 11:36 ` [take22 " Evgeniy Polyakov
@ 2006-11-07 16:50 ` Evgeniy Polyakov
  2006-11-07 16:50   ` [take23 1/5] kevent: Description Evgeniy Polyakov
  2006-11-07 22:17   ` [take23 0/5] kevent: Generic event handling mechanism Andrew Morton
  2006-11-09  8:23 ` [take24 0/6] " Evgeniy Polyakov
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-07 16:50 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Generic event handling mechanism.

Kevent is a generic subsytem which allows to handle event notifications.
It supports both level and edge triggered events. It is similar to
poll/epoll in some cases, but it is more scalable, it is faster and
allows to work with essentially eny kind of events.

Events are provided into kernel through control syscall and can be read
back through mmaped ring or syscall.
Kevent update (i.e. readiness switching) happens directly from internals
of the appropriate state machine of the underlying subsytem (like
network, filesystem, timer or any other).

Homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

Documentation page:
http://linux-net.osdl.org/index.php/Kevent

Consider for inclusion.

Changes from 'take22' patchset:
* new ring buffer implementation in process' memory
* wakeup-one-thread flag
* edge-triggered behaviour
With this release additional independent benchmark shows kevent speed compared to epoll:
Eric Dumazet created special benchmark which creates set of AF_INET sockets and two threads 
start to simultaneously read and write data from/into them.
Here is results:
epoll (no EPOLLET): 57428 events/sec
kevent (no ET): 59794 events/sec
epoll (with EPOLLET): 71000 events/sec
kevent (with ET): 78265 events/sec
Maximum (busy loop reading events): 88482 events/sec

Changes from 'take21' patchset:
 * minor cleanups (different return values, removed unneded variables, whitespaces and so on)
 * fixed bug in kevent removal in case when kevent being removed
   is the same as overflow_kevent (spotted by Eric Dumazet)

Changes from 'take20' patchset:
 * new ring buffer implementation
 * removed artificial limit on possible number of kevents
With this release and fixed userspace web server it was possible to 
achive 3960+ req/s with client connection rate of 4000 con/s
over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which
is too close to wire speed if we get into account headers and the like.

Changes from 'take19' patchset:
 * use __init instead of __devinit
 * removed 'default N' from config for user statistic
 * removed kevent_user_fini() since kevent can not be unloaded
 * use KERN_INFO for statistic output

Changes from 'take18' patchset:
 * use __init instead of __devinit
 * removed 'default N' from config for user statistic
 * removed kevent_user_fini() since kevent can not be unloaded
 * use KERN_INFO for statistic output

Changes from 'take17' patchset:
 * Use RB tree instead of hash table. 
	At least for a web sever, frequency of addition/deletion of new kevent 
	is comparable with number of search access, i.e. most of the time events 
	are added, accesed only couple of times and then removed, so it justifies 
	RB tree usage over AVL tree, since the latter does have much slower deletion 
	time (max O(log(N)) compared to 3 ops), 
	although faster search time (1.44*O(log(N)) vs. 2*O(log(N))). 
	So for kevents I use RB tree for now and later, when my AVL tree implementation 
	is ready, it will be possible to compare them.
 * Changed readiness check for socket notifications.

With both above changes it is possible to achieve more than 3380 req/second compared to 2200, 
sometimes 2500 req/second for epoll() for trivial web-server and httperf client on the same
hardware.
It is possible that above kevent limit is due to maximum allowed kevents in a time limit, which is
4096 events.

Changes from 'take16' patchset:
 * misc cleanups (__read_mostly, const ...)
 * created special macro which is used for mmap size (number of pages) calculation
 * export kevent_socket_notify(), since it is used in network protocols which can be 
	built as modules (IPv6 for example)

Changes from 'take15' patchset:
 * converted kevent_timer to high-resolution timers, this forces timer API update at
	http://linux-net.osdl.org/index.php/Kevent
 * use struct ukevent* instead of void * in syscalls (documentation has been updated)
 * added warning in kevent_add_ukevent() if ring has broken index (for testing)

Changes from 'take14' patchset:
 * added kevent_wait()
    This syscall waits until either timeout expires or at least one event
    becomes ready. It also commits that @num events from @start are processed
    by userspace and thus can be be removed or rearmed (depending on it's flags).
    It can be used for commit events read by userspace through mmap interface.
    Example userspace code (evtest.c) can be found on project's homepage.
 * added socket notifications (send/recv/accept)

Changes from 'take13' patchset:
 * do not get lock aroung user data check in __kevent_search()
 * fail early if there were no registered callbacks for given type of kevent
 * trailing whitespace cleanup

Changes from 'take12' patchset:
 * remove non-chardev interface for initialization
 * use pointer to kevent_mring instead of unsigned longs
 * use aligned 64bit type in raw user data (can be used by high-res timer if needed)
 * simplified enqueue/dequeue callbacks and kevent initialization
 * use nanoseconds for timeout
 * put number of milliseconds into timer's return data
 * move some definitions into user-visible header
 * removed filenames from comments

Changes from 'take11' patchset:
 * include missing headers into patchset
 * some trivial code cleanups (use goto instead of if/else games and so on)
 * some whitespace cleanups
 * check for ready_callback() callback before main loop which should save us some ticks

Changes from 'take10' patchset:
 * removed non-existent prototypes
 * added helper function for kevent_registered_callbacks
 * fixed 80 lines comments issues
 * added shared between userspace and kernelspace header instead of embedd them in one
 * core restructuring to remove forward declarations
 * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
 * use vm_insert_page() instead of remap_pfn_range()

Changes from 'take9' patchset:
 * fixed ->nopage method

Changes from 'take8' patchset:
 * fixed mmap release bug
 * use module_init() instead of late_initcall()
 * use better structures for timer notifications

Changes from 'take7' patchset:
 * new mmap interface (not tested, waiting for other changes to be acked)
	- use nopage() method to dynamically substitue pages
	- allocate new page for events only when new added kevent requres it
	- do not use ugly index dereferencing, use structure instead
	- reduced amount of data in the ring (id and flags), 
		maximum 12 pages on x86 per kevent fd

Changes from 'take6' patchset:
 * a lot of comments!
 * do not use list poisoning for detection of the fact, that entry is in the list
 * return number of ready kevents even if copy*user() fails
 * strict check for number of kevents in syscall
 * use ARRAY_SIZE for array size calculation
 * changed superblock magic number
 * use SLAB_PANIC instead of direct panic() call
 * changed -E* return values
 * a lot of small cleanups and indent fixes

Changes from 'take5' patchset:
 * removed compilation warnings about unused wariables when lockdep is not turned on
 * do not use internal socket structures, use appropriate (exported) wrappers instead
 * removed default 1 second timeout
 * removed AIO stuff from patchset

Changes from 'take4' patchset:
 * use miscdevice instead of chardevice
 * comments fixes

Changes from 'take3' patchset:
 * removed serializing mutex from kevent_user_wait()
 * moved storage list processing to RCU
 * removed lockdep screaming - all storage locks are initialized in the same function, so it was
learned 
	to differentiate between various cases
 * remove kevent from storage if is marked as broken after callback
 * fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion 

Changes from 'take2' patchset:
 * split kevent_finish_user() to locked and unlocked variants
 * do not use KEVENT_STAT ifdefs, use inline functions instead
 * use array of callbacks of each type instead of each kevent callback initialization
 * changed name of ukevent guarding lock
 * use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
 * do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
 * various indent cleanups
 * added optimisation, which is aimed to help when a lot of kevents are being copied from
userspace
 * mapped buffer (initial) implementation (no userspace yet)

Changes from 'take1' patchset:
 - rebased against 2.6.18-git tree
 - removed ioctl controlling
 - added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
			unsigned int timeout, void __user *buf, unsigned flags)
 - use old syscall kevent_ctl for creation/removing, modification and initial kevent 
	initialization
 - use mutuxes instead of semaphores
 - added file descriptor check and return error if provided descriptor does not match
	kevent file operations
 - various indent fixes
 - removed aio_sendfile() declarations.

Thank you.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

^ permalink raw reply	[flat|nested] 200+ messages in thread

* [take23 1/5] kevent: Description.
  2006-11-07 16:50 ` [take23 0/5] " Evgeniy Polyakov
@ 2006-11-07 16:50   ` Evgeniy Polyakov
  2006-11-07 16:50     ` [take23 2/5] kevent: Core files Evgeniy Polyakov
  2006-11-07 22:16     ` [take23 1/5] kevent: Description Andrew Morton
  2006-11-07 22:17   ` [take23 0/5] kevent: Generic event handling mechanism Andrew Morton
  1 sibling, 2 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-07 16:50 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Description.

int kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent *arg);

fd - is the file descriptor referring to the kevent queue to manipulate. 
It is created by opening "/dev/kevent" char device, which is created with dynamic 
minor number and major number assigned for misc devices. 

cmd - is the requested operation. It can be one of the following:
    KEVENT_CTL_ADD - add event notification 
    KEVENT_CTL_REMOVE - remove event notification 
    KEVENT_CTL_MODIFY - modify existing notification 

num - number of struct ukevent in the array pointed to by arg 
arg - array of struct ukevent

When called, kevent_ctl will carry out the operation specified in the cmd parameter.
-------------------------------------------------------------------------------------

 int kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, __u64 timeout, struct ukevent *buf, unsigned flags)

ctl_fd - file descriptor referring to the kevent queue 
min_nr - minimum number of completed events that kevent_get_events will block waiting for 
max_nr - number of struct ukevent in buf 
timeout - number of nanoseconds to wait before returning less than min_nr events. 
	If this is -1, then wait forever. 
buf - pointer to an array of struct ukevent. 
flags - unused 

kevent_get_events will wait timeout milliseconds for at least min_nr completed events, 
copying completed struct ukevents to buf and deleting any KEVENT_REQ_ONESHOT event requests. 
In nonblocking mode it returns as many events as possible, but not more than max_nr. 
In blocking mode it waits until timeout or if at least min_nr events are ready.
-------------------------------------------------------------------------------------

 int kevent_wait(int ctl_fd, unsigned int num, __u64 timeout)

ctl_fd - file descriptor referring to the kevent queue 
num - number of processed kevents 
timeout - this timeout specifies number of nanoseconds to wait until there is free space in kevent queue 

This syscall waits until either timeout expires or at least one event becomes ready. 
It also copies that num events into special ring buffer and requeues them (or removes depending on flags). 
-------------------------------------------------------------------------------------

 int kevent_ring_init(int ctl_fd, struct kevent_ring *ring, unsigned int num)

ctl_fd - file descriptor referring to the kevent queue 
num - size of the ring buffer in events 

 struct kevent_ring
 {
   unsigned int ring_kidx;
   struct ukevent event[0];
 }

ring_kidx - is an index in the ring buffer where kernel will put new events when 
  kevent_wait() or kevent_get_events() is called 

Example userspace code (ring_buffer.c) can be found on project's homepage.

Each kevent syscall can be so called cancellation point in glibc, i.e. when thread has 
been cancelled in kevent syscall, thread can be safely removed and no events will be lost, 
since each syscall (kevent_wait() or kevent_get_events()) will copy event into special ring buffer, 
accessible from other threads or even processes (if shared memory is used).

When kevent is removed (not dequeued when it is ready, but just removed), even if it was ready, 
it is not copied into ring buffer, since if it is removed, no one cares about it (otherwise user 
would wait until it becomes ready and got it through usual way using kevent_get_events() or kevent_wait()) 
and thus no need to copy it to the ring buffer.

It is possible with userspace ring buffer, that events in the ring buffer can be replaced without knowledge 
for the thread currently reading them (when other thread calls kevent_get_events() or kevent_wait()), 
so appropriate locking between threads or processes, which can simultaneously access the same ring buffer, 
is required.
-------------------------------------------------------------------------------------

The bulk of the interface is entirely done through the ukevent struct. 
It is used to add event requests, modify existing event requests, 
specify which event requests to remove, and return completed events.

struct ukevent contains the following members:

struct kevent_id id
    Id of this request, e.g. socket number, file descriptor and so on 
__u32 type
    Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on 
__u32 event
    Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED 
__u32 req_flags
    Per-event request flags,

    KEVENT_REQ_ONESHOT
        event will be removed when it is ready 

    KEVENT_REQ_WAKEUP_ONE
        When several threads wait on the same kevent queue and requested the same event, 
	for example 'wake me up when new client has connected, so I could call accept()', 
	then all threads will be awakened when new client has connected, but only one of 
	them can process the data. This problem is known as thundering nerd problem. 
	Events which have this flag set will not be marked as ready (and appropriate threads 
	will not be awakened) if at least one event has been already marked. 

    KEVENT_REQ_ET
        Edge Triggered behaviour. It is an optimisation which allows to move ready and dequeued 
	(i.e. copied to userspace) event to move into set of interest for given storage (socket, 
	inode and so on) again. It is very usefull for cases when the same event should be used 
	many times (like reading from pipe). It is similar to epoll()'s EPOLLET flag. 

__u32 ret_flags
    Per-event return flags

    KEVENT_RET_BROKEN
        Kevent is broken 

    KEVENT_RET_DONE
        Kevent processing was finished successfully 

    KEVENT_RET_COPY_FAILED
        Kevent was not copied into ring buffer due to some error conditions. 

__u32 ret_data
    Event return data. Event originator fills it with anything it likes (for example 
    timer notifications put number of milliseconds when timer has fired 
union { __u32 user[2]; void *ptr; }
    User's data. It is not used, just copied to/from user. The whole structure is aligned 
    to 8 bytes already, so the last union is aligned properly. 

---------------------------------------------------------------------------------

Usage

For KEVENT_CTL_ADD, all fields relevant to the event type must be filled 
(id, type, possibly event, req_flags). After kevent_ctl(..., KEVENT_CTL_ADD, ...) 
returns each struct's ret_flags should be checked to see if the event is already broken or done.

For KEVENT_CTL_MODIFY, the id, req_flags, and user and event fields must be set and an 
existing kevent request must have matching id and user fields. If a match is found, 
req_flags and event are replaced with the newly supplied values and requeueing is started, 
so modified kevent can be checked and probably marked as ready immediately. If a match can't
be found, the passed in ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is always set.

For KEVENT_CTL_REMOVE, the id and user fields must be set and an existing kevent request must 
have matching id and user fields. If a match is found, the kevent request is removed. 
If a match can't be found, the passed in ukevent's ret_flags has KEVENT_RET_BROKEN set. 
KEVENT_RET_DONE is always set.

For kevent_get_events, the entire structure is returned.

---------------------------------------------------------------------------------

Usage cases

kevent_timer
struct ukevent should contain following fields:
    type - KEVENT_TIMER 
    event - KEVENT_TIMER_FIRED 
    req_flags - KEVENT_REQ_ONESHOT if you want to fire that timer only once 
    id.raw[0] - number of seconds after commit when this timer shout expire 
    id.raw[0] - additional to number of seconds number of nanoseconds 



^ permalink raw reply	[flat|nested] 200+ messages in thread

* [take23 2/5] kevent: Core files.
  2006-11-07 16:50   ` [take23 1/5] kevent: Description Evgeniy Polyakov
@ 2006-11-07 16:50     ` Evgeniy Polyakov
  2006-11-07 16:50       ` [take23 3/5] kevent: poll/select() notifications Evgeniy Polyakov
  2006-11-07 22:16       ` [take23 2/5] kevent: Core files Andrew Morton
  2006-11-07 22:16     ` [take23 1/5] kevent: Description Andrew Morton
  1 sibling, 2 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-07 16:50 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Core files.

This patch includes core kevent files:
 * userspace controlling
 * kernelspace interfaces
 * initialization
 * notification state machines

Some bits of documentation can be found on project's homepage (and links from there):
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index 7e639f7..fa8075b 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -318,3 +318,7 @@ ENTRY(sys_call_table)
 	.long sys_vmsplice
 	.long sys_move_pages
 	.long sys_getcpu
+	.long sys_kevent_get_events
+	.long sys_kevent_ctl		/* 320 */
+	.long sys_kevent_wait
+	.long sys_kevent_ring_init
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index b4aa875..95fb252 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -714,8 +714,12 @@ #endif
 	.quad compat_sys_get_robust_list
 	.quad sys_splice
 	.quad sys_sync_file_range
-	.quad sys_tee
+	.quad sys_tee			/* 315 */
 	.quad compat_sys_vmsplice
 	.quad compat_sys_move_pages
 	.quad sys_getcpu
+	.quad sys_kevent_get_events
+	.quad sys_kevent_ctl		/* 320 */
+	.quad sys_kevent_wait
+	.quad sys_kevent_ring_init
 ia32_syscall_end:		
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index bd99870..2161ef2 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -324,10 +324,14 @@ #define __NR_tee		315
 #define __NR_vmsplice		316
 #define __NR_move_pages		317
 #define __NR_getcpu		318
+#define __NR_kevent_get_events	319
+#define __NR_kevent_ctl		320
+#define __NR_kevent_wait	321
+#define __NR_kevent_ring_init	322
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 319
+#define NR_syscalls 323
 #include <linux/err.h>
 
 /*
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 6137146..3669c0f 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,18 @@ #define __NR_vmsplice		278
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages		279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events	280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl		281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
+#define __NR_kevent_wait	282
+__SYSCALL(__NR_kevent_wait, sys_kevent_wait)
+#define __NR_kevent_ring_init	283
+__SYSCALL(__NR_kevent_ring_init, sys_kevent_ring_init)
 
 #ifdef __KERNEL__
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_ring_init
 #include <linux/err.h>
 
 #ifndef __NO_STUBS
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..781ffa8
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,201 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/kevent_storage.h>
+#include <linux/ukevent.h>
+
+#define KEVENT_MIN_BUFFS_ALLOC	3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+	kevent_callback_t	callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY		0x1
+#define KEVENT_STORAGE		0x2
+#define KEVENT_USER		0x4
+
+struct kevent
+{
+	/* Used for kevent freeing.*/
+	struct rcu_head		rcu_head;
+	struct ukevent		event;
+	/* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+	spinlock_t		ulock;
+
+	/* Entry of user's tree. */
+	struct rb_node		kevent_node;
+	/* Entry of origin's queue. */
+	struct list_head	storage_entry;
+	/* Entry of user's ready. */
+	struct list_head	ready_entry;
+
+	u32			flags;
+
+	/* User who requested this kevent. */
+	struct kevent_user	*user;
+	/* Kevent container. */
+	struct kevent_storage	*st;
+
+	struct kevent_callbacks	callbacks;
+
+	/* Private data for different storages.
+	 * poll()/select storage has a list of wait_queue_t containers
+	 * for each ->poll() { poll_wait()' } here.
+	 */
+	void			*priv;
+};
+
+struct kevent_user
+{
+	struct rb_root		kevent_root;
+	spinlock_t		kevent_lock;
+	/* Number of queued kevents. */
+	unsigned int		kevent_num;
+
+	/* List of ready kevents. */
+	struct list_head	ready_list;
+	/* Number of ready kevents. */
+	unsigned int		ready_num;
+	/* Protects all manipulations with ready queue. */
+	spinlock_t 		ready_lock;
+
+	/* Protects against simultaneous kevent_user control manipulations. */
+	struct mutex		ctl_mutex;
+	/* Wait until some events are ready. */
+	wait_queue_head_t	wait;
+
+	/* Reference counter, increased for each new kevent. */
+	atomic_t		refcnt;
+
+	/* Mutex protecting userspace ring buffer. */
+	struct mutex		ring_lock;
+	/* Kernel index and size of the userspace ring buffer. */
+	unsigned int		kidx, ring_size;
+	/* Pointer to userspace ring buffer. */
+	struct kevent_ring __user *pring;
+
+#ifdef CONFIG_KEVENT_USER_STAT
+	unsigned long		im_num;
+	unsigned long		wait_num, ring_num;
+	unsigned long		total;
+#endif
+};
+
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos);
+
+void kevent_storage_ready(struct kevent_storage *st,
+		kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_stat_init(struct kevent_user *u)
+{
+	u->wait_num = u->im_num = u->total = 0;
+}
+static inline void kevent_stat_print(struct kevent_user *u)
+{
+	printk(KERN_INFO "%s: u: %p, wait: %lu, ring: %lu, immediately: %lu, total: %lu.\n",
+			__func__, u, u->wait_num, u->ring_num, u->im_num, u->total);
+}
+static inline void kevent_stat_im(struct kevent_user *u)
+{
+	u->im_num++;
+}
+static inline void kevent_stat_ring(struct kevent_user *u)
+{
+	u->ring_num++;
+}
+static inline void kevent_stat_wait(struct kevent_user *u)
+{
+	u->wait_num++;
+}
+static inline void kevent_stat_total(struct kevent_user *u)
+{
+	u->total++;
+}
+#else
+#define kevent_stat_print(u)		({ (void) u;})
+#define kevent_stat_init(u)		({ (void) u;})
+#define kevent_stat_im(u)		({ (void) u;})
+#define kevent_stat_wait(u)		({ (void) u;})
+#define kevent_stat_ring(u)		({ (void) u;})
+#define kevent_stat_total(u)		({ (void) u;})
+#endif
+
+#ifdef CONFIG_LOCKDEP
+void kevent_socket_reinit(struct socket *sock);
+void kevent_sk_reinit(struct sock *sk);
+#else
+static inline void kevent_socket_reinit(struct socket *sock)
+{
+}
+static inline void kevent_sk_reinit(struct sock *sk)
+{
+}
+#endif
+#ifdef CONFIG_KEVENT_SOCKET
+void kevent_socket_notify(struct sock *sock, u32 event);
+int kevent_socket_dequeue(struct kevent *k);
+int kevent_socket_enqueue(struct kevent *k);
+#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC)
+#else
+static inline void kevent_socket_notify(struct sock *sock, u32 event)
+{
+}
+#define sock_async(__sk)	({ (void)__sk; 0; })
+#endif
+
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..a38575d
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,11 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+	void			*origin;		/* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+	struct list_head	list;			/* List of queued kevents. */
+	spinlock_t		lock;			/* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2d1c3d5..471a685 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -54,6 +54,8 @@ struct compat_stat;
 struct compat_timeval;
 struct robust_list_head;
 struct getcpu_cache;
+struct ukevent;
+struct kevent_ring;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -599,4 +601,9 @@ asmlinkage long sys_set_robust_list(stru
 				    size_t len);
 asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);
 
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max,
+		__u64 timeout, struct ukevent __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, struct ukevent __user *buf);
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, __u64 timeout);
+asmlinkage long sys_kevent_ring_init(int ctl_fd, struct kevent_ring __user *ring, unsigned int num);
 #endif
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
new file mode 100644
index 0000000..ee881c9
--- /dev/null
+++ b/include/linux/ukevent.h
@@ -0,0 +1,153 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __UKEVENT_H
+#define __UKEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+/* Process this event only once and then remove it. */
+#define KEVENT_REQ_ONESHOT	0x1
+/* Wake up only when event exclusively belongs to this thread,
+ * for example when several threads are waiting for new client
+ * connection so they could perform accept() it is a good idea
+ * to set this flag, so only one thread of all with this flag set 
+ * will be awakened. 
+ * If there are events without this flags, appropriate threads will
+ * be awakened too. */
+#define KEVENT_REQ_WAKEUP_ONE	0x2
+/* Edge Triggered behaviour. */
+#define KEVENT_REQ_ET		0x4
+
+/*
+ * Kevent return flags.
+ */
+/* Kevent is broken. */
+#define KEVENT_RET_BROKEN	0x1
+/* Kevent processing was finished successfully. */
+#define KEVENT_RET_DONE		0x2
+/* Kevent was not copied into ring buffer due to some error conditions. */
+#define KEVENT_RET_COPY_FAILED	0x4
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 		0
+#define KEVENT_INODE		1
+#define KEVENT_TIMER		2
+#define KEVENT_POLL		3
+#define KEVENT_NAIO		4
+#define KEVENT_AIO		5
+#define	KEVENT_MAX		6
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define	KEVENT_TIMER_FIRED	0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#define	KEVENT_SOCKET_RECV	0x1
+#define	KEVENT_SOCKET_ACCEPT	0x2
+#define	KEVENT_SOCKET_SEND	0x4
+
+/*
+ * Inode events.
+ */
+#define	KEVENT_INODE_CREATE	0x1
+#define	KEVENT_INODE_REMOVE	0x2
+
+/*
+ * Poll events.
+ */
+#define	KEVENT_POLL_POLLIN	0x0001
+#define	KEVENT_POLL_POLLPRI	0x0002
+#define	KEVENT_POLL_POLLOUT	0x0004
+#define	KEVENT_POLL_POLLERR	0x0008
+#define	KEVENT_POLL_POLLHUP	0x0010
+#define	KEVENT_POLL_POLLNVAL	0x0020
+
+#define	KEVENT_POLL_POLLRDNORM	0x0040
+#define	KEVENT_POLL_POLLRDBAND	0x0080
+#define	KEVENT_POLL_POLLWRNORM	0x0100
+#define	KEVENT_POLL_POLLWRBAND	0x0200
+#define	KEVENT_POLL_POLLMSG	0x0400
+#define	KEVENT_POLL_POLLREMOVE	0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define	KEVENT_AIO_BIO		0x1
+
+#define KEVENT_MASK_ALL		0xffffffff
+/* Mask of all possible event values. */
+#define KEVENT_MASK_EMPTY	0x0
+/* Empty mask of ready events. */
+
+struct kevent_id
+{
+	union {
+		__u32		raw[2];
+		__u64		raw_u64 __attribute__((aligned(8)));
+	};
+};
+
+struct ukevent
+{
+	/* Id of this request, e.g. socket number, file descriptor and so on... */
+	struct kevent_id	id;
+	/* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+	__u32			type;
+	/* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+	__u32			event;
+	/* Per-event request flags */
+	__u32			req_flags;
+	/* Per-event return flags */
+	__u32			ret_flags;
+	/* Event return data. Event originator fills it with anything it likes. */
+	__u32			ret_data[2];
+	/* User's data. It is not used, just copied to/from user.
+	 * The whole structure is aligned to 8 bytes already, so the last union
+	 * is aligned properly.
+	 */
+	union {
+		__u32		user[2];
+		void		*ptr;
+	};
+};
+
+struct kevent_ring
+{
+	unsigned int		ring_kidx;
+	struct ukevent		event[0];
+};
+
+#define	KEVENT_CTL_ADD 		0
+#define	KEVENT_CTL_REMOVE	1
+#define	KEVENT_CTL_MODIFY	2
+
+#endif /* __UKEVENT_H */
diff --git a/init/Kconfig b/init/Kconfig
index d2eb7a8..c7d8250 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -201,6 +201,8 @@ config AUDITSYSCALL
 	  such as SELinux.  To use audit's filesystem watch feature, please
 	  ensure that INOTIFY is configured.
 
+source "kernel/kevent/Kconfig"
+
 config IKCONFIG
 	bool "Kernel .config support"
 	---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..5ba8086
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,39 @@
+config KEVENT
+	bool "Kernel event notification mechanism"
+	help
+	  This option enables event queue mechanism.
+	  It can be used as replacement for poll()/select(), AIO callback
+	  invocations, advanced timer notifications and other kernel
+	  object status changes.
+
+config KEVENT_USER_STAT
+	bool "Kevent user statistic"
+	depends on KEVENT
+	help
+	  This option will turn kevent_user statistic collection on.
+	  Statistic data includes total number of kevent, number of kevents
+	  which are ready immediately at insertion time and number of kevents
+	  which were removed through readiness completion.
+	  It will be printed each time control kevent descriptor is closed.
+
+config KEVENT_TIMER
+	bool "Kernel event notifications for timers"
+	depends on KEVENT
+	help
+	  This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+	bool "Kernel event notifications for poll()/select()"
+	depends on KEVENT
+	help
+	  This option allows to use kevent subsystem for poll()/select()
+	  notifications.
+
+config KEVENT_SOCKET
+	bool "Kernel event notifications for sockets"
+	depends on NET && KEVENT
+	help
+	  This option enables notifications through KEVENT subsystem of 
+	  sockets operations, like new packet receiving conditions, 
+	  ready for accept conditions and so on.
+	
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..9130cad
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,4 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
+obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..24ee44a
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,232 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+	return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+	return k->callbacks.dequeue(k);
+}
+
+/*
+ * Mark kevent as broken.
+ */
+int kevent_break(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->ulock, flags);
+	k->event.ret_flags |= KEVENT_RET_BROKEN;
+	spin_unlock_irqrestore(&k->ulock, flags);
+	return -EINVAL;
+}
+
+static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX] __read_mostly;
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos)
+{
+	struct kevent_callbacks *p;
+
+	if (pos >= KEVENT_MAX)
+		return -EINVAL;
+
+	p = &kevent_registered_callbacks[pos];
+
+	p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break;
+	p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break;
+	p->callback = (cb->callback) ? cb->callback : kevent_break;
+
+	printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);
+	return 0;
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+	spin_lock_init(&k->ulock);
+	k->flags = 0;
+
+	if (unlikely(k->event.type >= KEVENT_MAX ||
+			!kevent_registered_callbacks[k->event.type].callback))
+		return kevent_break(k);
+
+	k->callbacks = kevent_registered_callbacks[k->event.type];
+	if (unlikely(k->callbacks.callback == kevent_break))
+		return kevent_break(k);
+
+	return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	k->st = st;
+	spin_lock_irqsave(&st->lock, flags);
+	list_add_tail_rcu(&k->storage_entry, &st->list);
+	k->flags |= KEVENT_STORAGE;
+	spin_unlock_irqrestore(&st->lock, flags);
+	return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue.
+ * It does not decrease origin's reference counter in any way
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&st->lock, flags);
+	if (k->flags & KEVENT_STORAGE) {
+		list_del_rcu(&k->storage_entry);
+		k->flags &= ~KEVENT_STORAGE;
+	}
+	spin_unlock_irqrestore(&st->lock, flags);
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static int __kevent_requeue(struct kevent *k, u32 event)
+{
+	int ret, rem;
+	unsigned long flags;
+
+	ret = k->callbacks.callback(k);
+
+	spin_lock_irqsave(&k->ulock, flags);
+	if (ret > 0)
+		k->event.ret_flags |= KEVENT_RET_DONE;
+	else if (ret < 0)
+		k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE);
+	else
+		ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+	rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+	spin_unlock_irqrestore(&k->ulock, flags);
+
+	if (ret) {
+		if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) {
+			list_del_rcu(&k->storage_entry);
+			k->flags &= ~KEVENT_STORAGE;
+		}
+
+		spin_lock_irqsave(&k->user->ready_lock, flags);
+		if (!(k->flags & KEVENT_READY)) {
+			list_add_tail(&k->ready_entry, &k->user->ready_list);
+			k->flags |= KEVENT_READY;
+			k->user->ready_num++;
+		}
+		spin_unlock_irqrestore(&k->user->ready_lock, flags);
+		wake_up(&k->user->wait);
+	}
+
+	return ret;
+}
+
+/*
+ * Check if kevent is ready (by invoking it's callback) and requeue/remove
+ * if needed.
+ */
+void kevent_requeue(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->st->lock, flags);
+	__kevent_requeue(k, 0);
+	spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st,
+		kevent_callback_t ready_callback, u32 event)
+{
+	struct kevent *k;
+	int wake_num = 0;
+
+	rcu_read_lock();
+	if (ready_callback)
+		list_for_each_entry_rcu(k, &st->list, storage_entry)
+			(*ready_callback)(k);
+
+	list_for_each_entry_rcu(k, &st->list, storage_entry) {
+		if (event & k->event.event)
+			if (!(k->event.req_flags & KEVENT_REQ_WAKEUP_ONE) || wake_num == 0)
+				if (__kevent_requeue(k, event))
+					wake_num++;
+	}
+	rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+	spin_lock_init(&st->lock);
+	st->origin = origin;
+	INIT_LIST_HEAD(&st->list);
+	return 0;
+}
+
+/*
+ * Mark all events as broken, that will remove them from storage,
+ * so storage origin (inode, sockt and so on) can be safely removed.
+ * No new entries are allowed to be added into the storage at this point.
+ * (Socket is removed from file table at this point for example).
+ */
+void kevent_storage_fini(struct kevent_storage *st)
+{
+	kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..5ebfa6d
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,913 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static const char kevent_name[] = "kevent";
+static kmem_cache_t *kevent_cache __read_mostly;
+
+/*
+ * kevents are pollable, return POLLIN and POLLRDNORM
+ * when there is at least one ready kevent.
+ */
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+	struct kevent_user *u = file->private_data;
+	unsigned int mask;
+
+	poll_wait(file, &u->wait, wait);
+	mask = 0;
+
+	if (u->ready_num)
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+
+/*
+ * Copies kevent into userspace ring buffer if it was initialized.
+ * Returns 
+ *  0 on success, 
+ *  -EAGAIN if there were no place for that kevent (impossible)
+ *  -EFAULT if copy_to_user() failed.
+ *
+ *  Must be called under kevent_user->ring_lock locked.
+ */
+static int kevent_copy_ring_buffer(struct kevent *k)
+{
+	struct kevent_ring __user *ring;
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+	int err;
+
+	ring = u->pring;
+	if (!ring)
+		return 0;
+
+	if (copy_to_user(&ring->event[u->kidx], &k->event, sizeof(struct ukevent))) {
+		err = -EFAULT;
+		goto err_out_exit;
+	}
+
+	if (put_user(u->kidx, &ring->ring_kidx)) {
+		err = -EFAULT;
+		goto err_out_exit;
+	}
+
+	if (++u->kidx >= u->ring_size)
+		u->kidx = 0;
+
+	return 0;
+
+err_out_exit:
+	spin_lock_irqsave(&k->ulock, flags);
+	k->event.ret_flags |= KEVENT_RET_COPY_FAILED;
+	spin_unlock_irqrestore(&k->ulock, flags);
+	return err;
+}
+
+static int kevent_user_open(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u;
+
+	u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
+	if (!u)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&u->ready_list);
+	spin_lock_init(&u->ready_lock);
+	kevent_stat_init(u);
+	spin_lock_init(&u->kevent_lock);
+	u->kevent_root = RB_ROOT;
+
+	mutex_init(&u->ctl_mutex);
+	init_waitqueue_head(&u->wait);
+
+	atomic_set(&u->refcnt, 1);
+
+	mutex_init(&u->ring_lock);
+	u->kidx = u->ring_size = 0;
+	u->pring = NULL;
+
+	file->private_data = u;
+	return 0;
+}
+
+/*
+ * Kevent userspace control block reference counting.
+ * Set to 1 at creation time, when appropriate kevent file descriptor
+ * is closed, that reference counter is decreased.
+ * When counter hits zero block is freed.
+ */
+static inline void kevent_user_get(struct kevent_user *u)
+{
+	atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+	if (atomic_dec_and_test(&u->refcnt)) {
+		kevent_stat_print(u);
+		kfree(u);
+	}
+}
+
+static inline int kevent_compare_id(struct kevent_id *left, struct kevent_id *right)
+{
+	if (left->raw_u64 > right->raw_u64)
+		return -1;
+
+	if (right->raw_u64 > left->raw_u64)
+		return 1;
+
+	return 0;
+}
+
+/*
+ * RCU protects storage list (kevent->storage_entry).
+ * Free entry in RCU callback, it is dequeued from all lists at
+ * this point.
+ */
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+	struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+	kmem_cache_free(kevent_cache, kevent);
+}
+
+/*
+ * Must be called under u->ready_lock.
+ * This function unlinks kevent from ready queue.
+ */
+static inline void kevent_unlink_ready(struct kevent *k)
+{
+	list_del(&k->ready_entry);
+	k->flags &= ~KEVENT_READY;
+	k->user->ready_num--;
+}
+
+static void kevent_remove_ready(struct kevent *k)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (k->flags & KEVENT_READY)
+		kevent_unlink_ready(k);
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+}
+
+/*
+ * Complete kevent removing - it dequeues kevent from storage list
+ * if it is requested, removes kevent from ready list, drops userspace
+ * control block reference counter and schedules kevent freeing through RCU.
+ */
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+	if (deq)
+		kevent_dequeue(k);
+
+	kevent_remove_ready(k);
+
+	kevent_user_put(k->user);
+	call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+/*
+ * Remove from all lists and free kevent.
+ * Must be called under kevent_user->kevent_lock to protect
+ * kevent->kevent_entry removing.
+ */
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+
+	rb_erase(&k->kevent_node, &u->kevent_root);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events,
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	rb_erase(&k->kevent_node, &u->kevent_root);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
+{
+	unsigned long flags;
+	struct kevent *k = NULL;
+
+	mutex_lock(&u->ring_lock);
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (u->ready_num && !list_empty(&u->ready_list)) {
+		k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+		kevent_unlink_ready(k);
+	}
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+
+	if (k)
+		kevent_copy_ring_buffer(k);
+	mutex_unlock(&u->ring_lock);
+
+	return k;
+}
+
+static void kevent_complete_ready(struct kevent *k)
+{
+	if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+		/*
+		 * If it is one-shot kevent, it has been removed already from
+		 * origin's queue, so we can easily free it here.
+		 */
+		kevent_finish_user(k, 1);
+	else if (k->event.req_flags & KEVENT_REQ_ET) {
+		unsigned long flags;
+
+		/*
+		 * Edge-triggered behaviour: mark event as clear new one.
+		 */
+
+		spin_lock_irqsave(&k->ulock, flags);
+		k->event.ret_flags = 0;
+		k->event.ret_data[0] = k->event.ret_data[1] = 0;
+		spin_unlock_irqrestore(&k->ulock, flags);
+	}
+}
+
+/*
+ * Search a kevent inside kevent tree for given ukevent.
+ */
+static struct kevent *__kevent_search(struct kevent_id *id, struct kevent_user *u)
+{
+	struct kevent *k, *ret = NULL;
+	struct rb_node *n = u->kevent_root.rb_node;
+	int cmp;
+
+	while (n) {
+		k = rb_entry(n, struct kevent, kevent_node);
+		cmp = kevent_compare_id(&k->event.id, id);
+
+		if (cmp > 0)
+			n = n->rb_right;
+		else if (cmp < 0)
+			n = n->rb_left;
+		else {
+			ret = k;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+/*
+ * Search and modify kevent according to provided ukevent.
+ */
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err = -ENODEV;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&uk->id, u);
+	if (k) {
+		spin_lock(&k->ulock);
+		k->event.event = uk->event;
+		k->event.req_flags = uk->req_flags;
+		k->event.ret_flags = 0;
+		spin_unlock(&k->ulock);
+		kevent_requeue(k);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Remove kevent which matches provided ukevent.
+ */
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+	int err = -ENODEV;
+	struct kevent *k;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&uk->id, u);
+	if (k) {
+		__kevent_finish_user(k, 1);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Detaches userspace control block from file descriptor
+ * and decrease it's reference counter.
+ * No new kevents can be added or removed from any list at this point.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u = file->private_data;
+	struct kevent *k;
+	struct rb_node *n;
+
+	for (n = rb_first(&u->kevent_root); n; n = rb_next(n)) {
+		k = rb_entry(n, struct kevent, kevent_node);
+		kevent_finish_user(k, 1);
+	}
+
+	kevent_user_put(u);
+	file->private_data = NULL;
+
+	return 0;
+}
+
+/*
+ * Read requested number of ukevents in one shot.
+ */
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+	struct ukevent *ukev;
+
+	ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+	if (!ukev)
+		return NULL;
+
+	if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
+		kfree(ukev);
+		return NULL;
+	}
+
+	return ukev;
+}
+
+/*
+ * Read from userspace all ukevents and modify appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_modify(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_modify(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Read from userspace all ukevents and remove appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_remove(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_remove(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Queue kevent into userspace control block and increase
+ * it's reference counter.
+ */
+static int kevent_user_enqueue(struct kevent_user *u, struct kevent *new)
+{
+	unsigned long flags;
+	struct rb_node **p = &u->kevent_root.rb_node, *parent = NULL;
+	struct kevent *k;
+	int err = 0, cmp;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	while (*p) {
+		parent = *p;
+		k = rb_entry(parent, struct kevent, kevent_node);
+
+		cmp = kevent_compare_id(&k->event.id, &new->event.id);
+		if (cmp > 0)
+			p = &parent->rb_right;
+		else if (cmp < 0)
+			p = &parent->rb_left;
+		else {
+			err = -EEXIST;
+			break;
+		}
+	}
+	if (likely(!err)) {
+		rb_link_node(&new->kevent_node, parent, p);
+		rb_insert_color(&new->kevent_node, &u->kevent_root);
+		new->flags |= KEVENT_USER;
+		u->kevent_num++;
+		kevent_user_get(u);
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Add kevent from both kernel and userspace users.
+ * This function allocates and queues kevent, returns negative value
+ * on error, positive if kevent is ready immediately and zero
+ * if kevent has been queued.
+ */
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err;
+
+	k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+	if (!k) {
+		err = -ENOMEM;
+		goto err_out_exit;
+	}
+
+	memcpy(&k->event, uk, sizeof(struct ukevent));
+	INIT_RCU_HEAD(&k->rcu_head);
+
+	k->event.ret_flags = 0;
+
+	err = kevent_init(k);
+	if (err) {
+		kmem_cache_free(kevent_cache, k);
+		goto err_out_exit;
+	}
+	k->user = u;
+	kevent_stat_total(u);
+	err = kevent_user_enqueue(u, k);
+	if (err) {
+		kmem_cache_free(kevent_cache, k);
+		goto err_out_exit;
+	}
+
+	err = kevent_enqueue(k);
+	if (err) {
+		memcpy(uk, &k->event, sizeof(struct ukevent));
+		kevent_finish_user(k, 0);
+		goto err_out_exit;
+	}
+
+	return 0;
+
+err_out_exit:
+	if (err < 0) {
+		uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE;
+		uk->ret_data[1] = err;
+	} else if (err > 0)
+		uk->ret_flags |= KEVENT_RET_DONE;
+	return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one
+ * and add them into appropriate kevent_storages,
+ * e.g. sockets, inodes and so on...
+ * Ready events will replace ones provided by used and number
+ * of ready events is returned.
+ * User must check ret_flags field of each ukevent structure
+ * to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err, cerr = 0, rnum = 0, i;
+	void __user *orig = arg;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	err = -EINVAL;
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				err = kevent_user_add_ukevent(&ukev[i], u);
+				if (err) {
+					kevent_stat_im(u);
+					if (i != rnum)
+						memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+					rnum++;
+				}
+			}
+			if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+				cerr = -EFAULT;
+			kfree(ukev);
+			goto out_setup;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			cerr = -EFAULT;
+			break;
+		}
+		arg += sizeof(struct ukevent);
+
+		err = kevent_user_add_ukevent(&uk, u);
+		if (err) {
+			kevent_stat_im(u);
+			if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+				cerr = -EFAULT;
+				break;
+			}
+			orig += sizeof(struct ukevent);
+			rnum++;
+		}
+	}
+
+out_setup:
+	if (cerr < 0) {
+		err = cerr;
+		goto out_remove;
+	}
+
+	err = rnum;
+out_remove:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u,
+		unsigned int min_nr, unsigned int max_nr, __u64 timeout,
+		void __user *buf)
+{
+	struct kevent *k;
+	int num = 0;
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		wait_event_interruptible_timeout(u->wait,
+			u->ready_num >= min_nr,
+			clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+	}
+
+	while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
+		if (copy_to_user(buf + num*sizeof(struct ukevent),
+					&k->event, sizeof(struct ukevent)))
+			break;
+		kevent_complete_ready(k);
+		++num;
+		kevent_stat_wait(u);
+	}
+
+	return num;
+}
+
+static struct file_operations kevent_user_fops = {
+	.open		= kevent_user_open,
+	.release	= kevent_user_release,
+	.poll		= kevent_user_poll,
+	.owner		= THIS_MODULE,
+};
+
+static struct miscdevice kevent_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = kevent_name,
+	.fops = &kevent_user_fops,
+};
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+	int err;
+	struct kevent_user *u = file->private_data;
+
+	switch (cmd) {
+	case KEVENT_CTL_ADD:
+		err = kevent_user_ctl_add(u, num, arg);
+		break;
+	case KEVENT_CTL_REMOVE:
+		err = kevent_user_ctl_remove(u, num, arg);
+		break;
+	case KEVENT_CTL_MODIFY:
+		err = kevent_user_ctl_modify(u, num, arg);
+		break;
+	default:
+		err = -EINVAL;
+		break;
+	}
+
+	return err;
+}
+
+/*
+ * Used to get ready kevents from queue.
+ * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).
+ * @min_nr - minimum number of ready kevents.
+ * @max_nr - maximum number of ready kevents.
+ * @timeout - timeout in nanoseconds to wait until some events are ready.
+ * @buf - buffer to place ready events.
+ * @flags - ununsed for now (will be used for mmap implementation).
+ */
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+		__u64 timeout, struct ukevent __user *buf, unsigned flags)
+{
+	int err = -EINVAL;
+	struct file *file;
+	struct kevent_user *u;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -EBADF;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
+out_fput:
+	fput(file);
+	return err;
+}
+
+asmlinkage long sys_kevent_ring_init(int ctl_fd, struct kevent_ring __user *ring, unsigned int num)
+{
+	int err = -EINVAL;
+	struct file *file;
+	struct kevent_user *u;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -EBADF;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	mutex_lock(&u->ring_lock);
+	if (u->pring) {
+		err = -EINVAL;
+		goto err_out_exit;
+	}
+	u->pring = ring;
+	u->ring_size = num;
+	mutex_unlock(&u->ring_lock);
+
+	fput(file);
+
+	return 0;
+
+err_out_exit:
+	mutex_unlock(&u->ring_lock);
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * This syscall is used to perform waiting until there is free space in kevent queue
+ * and removes/requeues requested number of events (commits them). Function returns
+ * number of actually committed events.
+ *
+ * @ctl_fd - kevent file descriptor.
+ * @num - number of kevents to process.
+ * @timeout - this timeout specifies number of nanoseconds to wait until there is
+ * 	free space in kevent queue.
+ *
+ * When we need to commit @num events, it means we should just remove first @num
+ * kevents from ready queue and copy them into the buffer. 
+ * Kevents will be copied into ring buffer in order they were placed into ready queue.
+ */
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, __u64 timeout)
+{
+	int err = -EINVAL, committed = 0;
+	struct file *file;
+	struct kevent_user *u;
+	struct kevent *k;
+	struct kevent_ring __user *ring;
+	unsigned int i;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -EBADF;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	ring = u->pring;
+	if (!ring || num >= u->ring_size)
+		goto out_fput;
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		wait_event_interruptible_timeout(u->wait,
+			u->ready_num >= 1,
+			clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+	}
+
+	for (i=0; i<num; ++i) {
+		k = kqueue_dequeue_ready(u);
+		if (!k)
+			break;
+		kevent_complete_ready(k);
+		kevent_stat_ring(u);
+		committed++;
+	}
+
+	fput(file);
+
+	return committed;
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * This syscall is used to perform various control operations
+ * on given kevent queue, which is obtained through kevent file descriptor @fd.
+ * @cmd - type of operation.
+ * @num - number of kevents to be processed.
+ * @arg - pointer to array of struct ukevent.
+ */
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent __user *arg)
+{
+	int err = -EINVAL;
+	struct file *file;
+
+	file = fget(fd);
+	if (!file)
+		return -EBADF;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+
+	err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * Kevent subsystem initialization - create kevent cache and register
+ * filesystem to get control file descriptors from.
+ */
+static int __init kevent_user_init(void)
+{
+	int err = 0;
+
+	kevent_cache = kmem_cache_create("kevent_cache",
+			sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
+
+	err = misc_register(&kevent_miscdev);
+	if (err) {
+		printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);
+		goto err_out_exit;
+	}
+
+	printk(KERN_INFO "KEVENT subsystem has been successfully registered.\n");
+
+	return 0;
+
+err_out_exit:
+	kmem_cache_destroy(kevent_cache);
+	return err;
+}
+
+module_init(kevent_user_init);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7a3b2e7..5200583 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,11 @@ cond_syscall(ppc_rtas);
 cond_syscall(sys_spu_run);
 cond_syscall(sys_spu_create);
 
+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_wait);
+cond_syscall(sys_kevent_ctl);
+cond_syscall(sys_kevent_ring_init);
+
 /* mmu depending weak syscall entries */
 cond_syscall(sys_mprotect);
 cond_syscall(sys_msync);


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take23 3/5] kevent: poll/select() notifications.
  2006-11-07 16:50     ` [take23 2/5] kevent: Core files Evgeniy Polyakov
@ 2006-11-07 16:50       ` Evgeniy Polyakov
  2006-11-07 16:50         ` [take23 4/5] kevent: Socket notifications Evgeniy Polyakov
  2006-11-07 22:53         ` [take23 3/5] kevent: poll/select() notifications Davide Libenzi
  2006-11-07 22:16       ` [take23 2/5] kevent: Core files Andrew Morton
  1 sibling, 2 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-07 16:50 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


poll/select() notifications.

This patch includes generic poll/select notifications.
kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake, a lot of allocations and so on).

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5baf3a1..f81299f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -276,6 +276,7 @@ #include <linux/prio_tree.h>
 #include <linux/init.h>
 #include <linux/sched.h>
 #include <linux/mutex.h>
+#include <linux/kevent.h>
 
 #include <asm/atomic.h>
 #include <asm/semaphore.h>
@@ -586,6 +587,10 @@ #ifdef CONFIG_INOTIFY
 	struct mutex		inotify_mutex;	/* protects the watches list */
 #endif
 
+#ifdef CONFIG_KEVENT_SOCKET
+	struct kevent_storage	st;
+#endif
+
 	unsigned long		i_state;
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 
@@ -739,6 +744,9 @@ #ifdef CONFIG_EPOLL
 	struct list_head	f_ep_links;
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+	struct kevent_storage	st;
+#endif
 	struct address_space	*f_mapping;
 };
 extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..94facbb
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,222 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+	struct poll_table_struct 	pt;
+	struct kevent			*k;
+};
+
+struct kevent_poll_wait_container
+{
+	struct list_head		container_entry;
+	wait_queue_head_t		*whead;
+	wait_queue_t			wait;
+	struct kevent			*k;
+};
+
+struct kevent_poll_private
+{
+	struct list_head		container_list;
+	spinlock_t			container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+		unsigned mode, int sync, void *key)
+{
+	struct kevent_poll_wait_container *cont =
+		container_of(wait, struct kevent_poll_wait_container, wait);
+	struct kevent *k = cont->k;
+	struct file *file = k->st->origin;
+	u32 revents;
+
+	revents = file->f_op->poll(file, NULL);
+
+	kevent_storage_ready(k->st, NULL, revents);
+
+	return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+		struct poll_table_struct *poll_table)
+{
+	struct kevent *k =
+		container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *cont;
+	unsigned long flags;
+
+	cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
+	if (!cont) {
+		kevent_break(k);
+		return;
+	}
+
+	cont->k = k;
+	init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+	cont->whead = whead;
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_add_tail(&cont->container_entry, &priv->container_list);
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+	struct file *file;
+	int err, ready = 0;
+	unsigned int revents;
+	struct kevent_poll_ctl ctl;
+	struct kevent_poll_private *priv;
+
+	file = fget(k->event.id.raw[0]);
+	if (!file)
+		return -EBADF;
+
+	err = -EINVAL;
+	if (!file->f_op || !file->f_op->poll)
+		goto err_out_fput;
+
+	err = -ENOMEM;
+	priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
+	if (!priv)
+		goto err_out_fput;
+
+	spin_lock_init(&priv->container_lock);
+	INIT_LIST_HEAD(&priv->container_list);
+
+	k->priv = priv;
+
+	ctl.k = k;
+	init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+
+	err = kevent_storage_enqueue(&file->st, k);
+	if (err)
+		goto err_out_free;
+
+	revents = file->f_op->poll(file, &ctl.pt);
+	if (revents & k->event.event) {
+		ready = 1;
+		kevent_poll_dequeue(k);
+	}
+
+	return ready;
+
+err_out_free:
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+	fput(file);
+	return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *w, *n;
+	unsigned long flags;
+
+	kevent_storage_dequeue(k->st, k);
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+		list_del(&w->container_entry);
+		remove_wait_queue(w->whead, &w->wait);
+		kmem_cache_free(kevent_poll_container_cache, w);
+	}
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+	k->priv = NULL;
+
+	fput(file);
+
+	return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	unsigned int revents = file->f_op->poll(file, NULL);
+
+	k->event.ret_data[0] = revents & k->event.event;
+
+	return (revents & k->event.event);
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+	struct kevent_callbacks pc = {
+		.callback = &kevent_poll_callback,
+		.enqueue = &kevent_poll_enqueue,
+		.dequeue = &kevent_poll_dequeue};
+
+	kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache",
+			sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+	if (!kevent_poll_container_cache) {
+		printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+		return -ENOMEM;
+	}
+
+	kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache",
+			sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+	if (!kevent_poll_priv_cache) {
+		printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+		kmem_cache_destroy(kevent_poll_container_cache);
+		kevent_poll_container_cache = NULL;
+		return -ENOMEM;
+	}
+
+	kevent_add_callbacks(&pc, KEVENT_POLL);
+
+	printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+	return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+	lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+	kmem_cache_destroy(kevent_poll_priv_cache);
+	kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take23 4/5] kevent: Socket notifications.
  2006-11-07 16:50       ` [take23 3/5] kevent: poll/select() notifications Evgeniy Polyakov
@ 2006-11-07 16:50         ` Evgeniy Polyakov
  2006-11-07 16:50           ` [take23 5/5] kevent: Timer notifications Evgeniy Polyakov
  2006-11-07 22:53         ` [take23 3/5] kevent: poll/select() notifications Davide Libenzi
  1 sibling, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-07 16:50 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Socket notifications.

This patch includes socket send/recv/accept notifications.
Using trivial web server based on kevent and this features
instead of epoll it's performance increased more than noticebly.
More details about various benchmarks and server itself 
(evserver_kevent.c) can be found on project's homepage.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/fs/inode.c b/fs/inode.c
index ada7643..ff1b129 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@ #include <linux/pagemap.h>
 #include <linux/cdev.h>
 #include <linux/bootmem.h>
 #include <linux/inotify.h>
+#include <linux/kevent.h>
 #include <linux/mount.h>
 
 /*
@@ -164,12 +165,18 @@ #endif
 		}
 		inode->i_private = 0;
 		inode->i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET
+		kevent_storage_init(inode, &inode->st);
+#endif
 	}
 	return inode;
 }
 
 void destroy_inode(struct inode *inode) 
 {
+#if defined CONFIG_KEVENT_SOCKET
+	kevent_storage_fini(&inode->st);
+#endif
 	BUG_ON(inode_has_buffers(inode));
 	security_inode_free(inode);
 	if (inode->i_sb->s_op->destroy_inode)
diff --git a/include/net/sock.h b/include/net/sock.h
index edd4d73..d48ded8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -48,6 +48,7 @@ #include <linux/lockdep.h>
 #include <linux/netdevice.h>
 #include <linux/skbuff.h>	/* struct sk_buff */
 #include <linux/security.h>
+#include <linux/kevent.h>
 
 #include <linux/filter.h>
 
@@ -450,6 +451,21 @@ static inline int sk_stream_memory_free(
 
 extern void sk_stream_rfree(struct sk_buff *skb);
 
+struct socket_alloc {
+	struct socket socket;
+	struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+	return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+	return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
+}
+
 static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
 {
 	skb->sk = sk;
@@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct
 		sk->sk_backlog.tail = skb;
 	}
 	skb->next = NULL;
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 }
 
 #define sk_wait_event(__sk, __timeo, __condition)		\
@@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio
 	return si->kiocb;
 }
 
-struct socket_alloc {
-	struct socket socket;
-	struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
-	return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
-	return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
-}
-
 extern void __sk_stream_mem_reclaim(struct sock *sk);
 extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..69f4ad2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so
 			tp->ucopy.memory = 0;
 		} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
 			wake_up_interruptible(sk->sk_sleep);
+			kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 			if (!inet_csk_ack_scheduled(sk))
 				inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
 						          (3 * TCP_RTO_MIN) / 4,
diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 0000000..7f74110
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,135 @@
+/*
+ * 	kevent_socket.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/tcp.h>
+#include <linux/kevent.h>
+
+#include <net/sock.h>
+#include <net/request_sock.h>
+#include <net/inet_connection_sock.h>
+
+static int kevent_socket_callback(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+	unsigned int events = SOCKET_I(inode)->ops->poll(SOCKET_I(inode)->file, SOCKET_I(inode), NULL);
+
+	if ((events & (POLLIN | POLLRDNORM)) && (k->event.event & (KEVENT_SOCKET_RECV | KEVENT_SOCKET_ACCEPT)))
+		return 1;
+	if ((events & (POLLOUT | POLLWRNORM)) && (k->event.event & KEVENT_SOCKET_SEND))
+		return 1;
+	return 0;
+}
+
+int kevent_socket_enqueue(struct kevent *k)
+{
+	struct inode *inode;
+	struct socket *sock;
+	int err = -EBADF;
+
+	sock = sockfd_lookup(k->event.id.raw[0], &err);
+	if (!sock)
+		goto err_out_exit;
+
+	inode = igrab(SOCK_INODE(sock));
+	if (!inode)
+		goto err_out_fput;
+
+	err = kevent_storage_enqueue(&inode->st, k);
+	if (err)
+		goto err_out_iput;
+
+	err = k->callbacks.callback(k);
+	if (err)
+		goto err_out_dequeue;
+
+	return err;
+
+err_out_dequeue:
+	kevent_storage_dequeue(k->st, k);
+err_out_iput:
+	iput(inode);
+err_out_fput:
+	sockfd_put(sock);
+err_out_exit:
+	return err;
+}
+
+int kevent_socket_dequeue(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+	struct socket *sock;
+
+	kevent_storage_dequeue(k->st, k);
+
+	sock = SOCKET_I(inode);
+	iput(inode);
+	sockfd_put(sock);
+
+	return 0;
+}
+
+void kevent_socket_notify(struct sock *sk, u32 event)
+{
+	if (sk->sk_socket)
+		kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event);
+}
+
+/*
+ * It is required for network protocols compiled as modules, like IPv6.
+ */
+EXPORT_SYMBOL_GPL(kevent_socket_notify);
+
+#ifdef CONFIG_LOCKDEP
+static struct lock_class_key kevent_sock_key;
+
+void kevent_socket_reinit(struct socket *sock)
+{
+	struct inode *inode = SOCK_INODE(sock);
+
+	lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+}
+
+void kevent_sk_reinit(struct sock *sk)
+{
+	if (sk->sk_socket) {
+		struct inode *inode = SOCK_INODE(sk->sk_socket);
+
+		lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+	}
+}
+#endif
+static int __init kevent_init_socket(void)
+{
+	struct kevent_callbacks sc = {
+		.callback = &kevent_socket_callback,
+		.enqueue = &kevent_socket_enqueue,
+		.dequeue = &kevent_socket_dequeue};
+
+	return kevent_add_callbacks(&sc, KEVENT_SOCKET);
+}
+module_init(kevent_init_socket);
diff --git a/net/core/sock.c b/net/core/sock.c
index b77e155..7d5fa3e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1402,6 +1402,7 @@ static void sock_def_wakeup(struct sock
 	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
 		wake_up_interruptible_all(sk->sk_sleep);
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_error_report(struct sock *sk)
@@ -1411,6 +1412,7 @@ static void sock_def_error_report(struct
 		wake_up_interruptible(sk->sk_sleep);
 	sk_wake_async(sk,0,POLL_ERR); 
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_readable(struct sock *sk, int len)
@@ -1420,6 +1422,7 @@ static void sock_def_readable(struct soc
 		wake_up_interruptible(sk->sk_sleep);
 	sk_wake_async(sk,1,POLL_IN);
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_write_space(struct sock *sk)
@@ -1439,6 +1442,7 @@ static void sock_def_write_space(struct
 	}
 
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
 }
 
 static void sock_def_destruct(struct sock *sk)
@@ -1489,6 +1493,8 @@ #endif
 	sk->sk_state		=	TCP_CLOSE;
 	sk->sk_socket		=	sock;
 
+	kevent_sk_reinit(sk);
+
 	sock_set_flag(sk, SOCK_ZAPPED);
 
 	if(sock)
@@ -1555,8 +1561,10 @@ void fastcall release_sock(struct sock *
 	if (sk->sk_backlog.tail)
 		__release_sock(sk);
 	sk->sk_lock.owner = NULL;
-	if (waitqueue_active(&sk->sk_lock.wq))
+	if (waitqueue_active(&sk->sk_lock.wq)) {
 		wake_up(&sk->sk_lock.wq);
+		kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
+	}
 	spin_unlock_bh(&sk->sk_lock.slock);
 }
 EXPORT_SYMBOL(release_sock);
diff --git a/net/core/stream.c b/net/core/stream.c
index d1d7dec..2878c2a 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock *
 			wake_up_interruptible(sk->sk_sleep);
 		if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
 			sock_wake_async(sock, 2, POLL_OUT);
+		kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
 	}
 }
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3f884ce..e7dd989 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3119,6 +3119,7 @@ static void tcp_ofo_queue(struct sock *s
 
 		__skb_unlink(skb, &tp->out_of_order_queue);
 		__skb_queue_tail(&sk->sk_receive_queue, skb);
+		kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
 		if(skb->h.th->fin)
 			tcp_fin(skb, sk, skb->h.th);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index c83938b..b0dd70d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -61,6 +61,7 @@ #include <linux/cache.h>
 #include <linux/jhash.h>
 #include <linux/init.h>
 #include <linux/times.h>
+#include <linux/kevent.h>
 
 #include <net/icmp.h>
 #include <net/inet_hashtables.h>
@@ -870,6 +871,7 @@ #endif
 	   	reqsk_free(req);
 	} else {
 		inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+		kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT);
 	}
 	return 0;
 
diff --git a/net/socket.c b/net/socket.c
index 1bc4167..5582b4a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -85,6 +85,7 @@ #include <linux/compat.h>
 #include <linux/kmod.h>
 #include <linux/audit.h>
 #include <linux/wireless.h>
+#include <linux/kevent.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -490,6 +491,8 @@ static struct socket *sock_alloc(void)
 	inode->i_uid = current->fsuid;
 	inode->i_gid = current->fsgid;
 
+	kevent_socket_reinit(sock);
+
 	get_cpu_var(sockets_in_use)++;
 	put_cpu_var(sockets_in_use);
 	return sock;


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take23 5/5] kevent: Timer notifications.
  2006-11-07 16:50         ` [take23 4/5] kevent: Socket notifications Evgeniy Polyakov
@ 2006-11-07 16:50           ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-07 16:50 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Timer notifications.

Timer notifications can be used for fine grained per-process time 
management, since interval timers are very inconvenient to use, 
and they are limited.

This subsystem uses high-resolution timers.
id.raw[0] is used as number of seconds
id.raw[1] is used as number of nanoseconds

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..df93049
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,112 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/hrtimer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+struct kevent_timer
+{
+	struct hrtimer		ktimer;
+	struct kevent_storage	ktimer_storage;
+	struct kevent		*ktimer_event;
+};
+
+static int kevent_timer_func(struct hrtimer *timer)
+{
+	struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer);
+	struct kevent *k = t->ktimer_event;
+
+	kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL);
+	hrtimer_forward(timer, timer->base->softirq_time,
+			ktime_set(k->event.id.raw[0], k->event.id.raw[1]));
+	return HRTIMER_RESTART;
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+	int err;
+	struct kevent_timer *t;
+
+	t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+	if (!t)
+		return -ENOMEM;
+
+	hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL);
+	t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]);
+	t->ktimer.function = kevent_timer_func;
+	t->ktimer_event = k;
+
+	err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+	if (err)
+		goto err_out_free;
+	lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+	err = kevent_storage_enqueue(&t->ktimer_storage, k);
+	if (err)
+		goto err_out_st_fini;
+
+	hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL);
+
+	return 0;
+
+err_out_st_fini:
+	kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+	kfree(t);
+
+	return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+	struct kevent_storage *st = k->st;
+	struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage);
+
+	hrtimer_cancel(&t->ktimer);
+	kevent_storage_dequeue(st, k);
+	kfree(t);
+
+	return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+	k->event.ret_data[0] = jiffies_to_msecs(jiffies);
+	return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+	struct kevent_callbacks tc = {
+		.callback = &kevent_timer_callback,
+		.enqueue = &kevent_timer_enqueue,
+		.dequeue = &kevent_timer_dequeue};
+
+	return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);
+


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: [take23 3/5] kevent: poll/select() notifications.
  2006-11-07 16:50       ` [take23 3/5] kevent: poll/select() notifications Evgeniy Polyakov
  2006-11-07 16:50         ` [take23 4/5] kevent: Socket notifications Evgeniy Polyakov
@ 2006-11-07 22:53         ` Davide Libenzi
  2006-11-08  8:45           ` Evgeniy Polyakov
  1 sibling, 1 reply; 200+ messages in thread
From: Davide Libenzi @ 2006-11-07 22:53 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck,
	Linux Kernel Mailing List, Jeff Garzik

On Tue, 7 Nov 2006, Evgeniy Polyakov wrote:

> +static int kevent_poll_wait_callback(wait_queue_t *wait,
> +		unsigned mode, int sync, void *key)
> +{
> +	struct kevent_poll_wait_container *cont =
> +		container_of(wait, struct kevent_poll_wait_container, wait);
> +	struct kevent *k = cont->k;
> +	struct file *file = k->st->origin;
> +	u32 revents;
> +
> +	revents = file->f_op->poll(file, NULL);
> +
> +	kevent_storage_ready(k->st, NULL, revents);
> +
> +	return 0;
> +}

Are you sure you can safely call file->f_op->poll() from inside a callback 
based wakeup? The low level driver may be calling the wakeup with one of 
its locks held, and during the file->f_op->poll may be trying to acquire 
the same lock. I remember there was a discussion about this, and assuming 
the above not true, made epoll code more complex (and slower, since an 
extra O(R) loop was needed to fetch events).



- Davide



^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take23 3/5] kevent: poll/select() notifications.
  2006-11-07 22:53         ` [take23 3/5] kevent: poll/select() notifications Davide Libenzi
@ 2006-11-08  8:45           ` Evgeniy Polyakov
  2006-11-08 17:03             ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-08  8:45 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck,
	Linux Kernel Mailing List, Jeff Garzik

On Tue, Nov 07, 2006 at 02:53:33PM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> On Tue, 7 Nov 2006, Evgeniy Polyakov wrote:
> 
> > +static int kevent_poll_wait_callback(wait_queue_t *wait,
> > +		unsigned mode, int sync, void *key)
> > +{
> > +	struct kevent_poll_wait_container *cont =
> > +		container_of(wait, struct kevent_poll_wait_container, wait);
> > +	struct kevent *k = cont->k;
> > +	struct file *file = k->st->origin;
> > +	u32 revents;
> > +
> > +	revents = file->f_op->poll(file, NULL);
> > +
> > +	kevent_storage_ready(k->st, NULL, revents);
> > +
> > +	return 0;
> > +}
> 
> Are you sure you can safely call file->f_op->poll() from inside a callback 
> based wakeup? The low level driver may be calling the wakeup with one of 
> its locks held, and during the file->f_op->poll may be trying to acquire 
> the same lock. I remember there was a discussion about this, and assuming 
> the above not true, made epoll code more complex (and slower, since an 
> extra O(R) loop was needed to fetch events).

Indeed, I have not paid too much attention to poll/select notifications in 
kevent actually. As far as I recall it should be called on behalf of process 
doing kevent_get_event(). I will check and fix if that is not correct.
Thanks Davide.
 
> - Davide
> 

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take23 3/5] kevent: poll/select() notifications.
  2006-11-08  8:45           ` Evgeniy Polyakov
@ 2006-11-08 17:03             ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-08 17:03 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck,
	Linux Kernel Mailing List, Jeff Garzik

On Wed, Nov 08, 2006 at 11:45:54AM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > Are you sure you can safely call file->f_op->poll() from inside a callback 
> > based wakeup? The low level driver may be calling the wakeup with one of 
> > its locks held, and during the file->f_op->poll may be trying to acquire 
> > the same lock. I remember there was a discussion about this, and assuming 
> > the above not true, made epoll code more complex (and slower, since an 
> > extra O(R) loop was needed to fetch events).
> 
> Indeed, I have not paid too much attention to poll/select notifications in 
> kevent actually. As far as I recall it should be called on behalf of process 
> doing kevent_get_event(). I will check and fix if that is not correct.
> Thanks Davide.

Indeed there was a bug.
Actually poll/select patch was broken quite noticebly - patchset did not
include major changes I made for it.
I will put them all into next release.

Thanks again Davide for pointing that out.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take23 2/5] kevent: Core files.
  2006-11-07 16:50     ` [take23 2/5] kevent: Core files Evgeniy Polyakov
  2006-11-07 16:50       ` [take23 3/5] kevent: poll/select() notifications Evgeniy Polyakov
@ 2006-11-07 22:16       ` Andrew Morton
  2006-11-08  8:24         ` Evgeniy Polyakov
  1 sibling, 1 reply; 200+ messages in thread
From: Andrew Morton @ 2006-11-07 22:16 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

On Tue, 7 Nov 2006 19:50:48 +0300
Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> This patch includes core kevent files:
>  * userspace controlling
>  * kernelspace interfaces
>  * initialization
>  * notification state machines

I fixed up all the rejects, but your syscall numbers changed.  Please
always raise patches against the latest kernel.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take23 2/5] kevent: Core files.
  2006-11-07 22:16       ` [take23 2/5] kevent: Core files Andrew Morton
@ 2006-11-08  8:24         ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-08  8:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Miller, Ulrich Drepper, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

On Tue, Nov 07, 2006 at 02:16:57PM -0800, Andrew Morton (akpm@osdl.org) wrote:
> On Tue, 7 Nov 2006 19:50:48 +0300
> Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > This patch includes core kevent files:
> >  * userspace controlling
> >  * kernelspace interfaces
> >  * initialization
> >  * notification state machines
> 
> I fixed up all the rejects, but your syscall numbers changed.  Please
> always raise patches against the latest kernel.

Will do. NUmbers actually are the same, but added new syscall which was
against old tree.
Thanks Andrew.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take23 1/5] kevent: Description.
  2006-11-07 16:50   ` [take23 1/5] kevent: Description Evgeniy Polyakov
  2006-11-07 16:50     ` [take23 2/5] kevent: Core files Evgeniy Polyakov
@ 2006-11-07 22:16     ` Andrew Morton
  2006-11-08  8:23       ` Evgeniy Polyakov
  1 sibling, 1 reply; 200+ messages in thread
From: Andrew Morton @ 2006-11-07 22:16 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

On Tue, 7 Nov 2006 19:50:48 +0300
Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> Description.

I converted this into Documentation/kevent.txt.  It looks like crap in an 80-col
xterm btw.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take23 1/5] kevent: Description.
  2006-11-07 22:16     ` [take23 1/5] kevent: Description Andrew Morton
@ 2006-11-08  8:23       ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-08  8:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Miller, Ulrich Drepper, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

On Tue, Nov 07, 2006 at 02:16:40PM -0800, Andrew Morton (akpm@osdl.org) wrote:
> On Tue, 7 Nov 2006 19:50:48 +0300
> Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > Description.
> 
> I converted this into Documentation/kevent.txt.  It looks like crap in an 80-col
> xterm btw.

Thanks.
It was copied as is from documentation page, so it does looks like crap
in non-browser window. I'm quite sure there will be some questions about
kevent, so I will update that file and fix indent.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take23 0/5] kevent: Generic event handling mechanism.
  2006-11-07 16:50 ` [take23 0/5] " Evgeniy Polyakov
  2006-11-07 16:50   ` [take23 1/5] kevent: Description Evgeniy Polyakov
@ 2006-11-07 22:17   ` Andrew Morton
  2006-11-08  8:21     ` Evgeniy Polyakov
  1 sibling, 1 reply; 200+ messages in thread
From: Andrew Morton @ 2006-11-07 22:17 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

On Tue, 7 Nov 2006 19:50:48 +0300
Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> Generic event handling mechanism.

I updated the version in -mm to v23.  So people can play with it and review
it.  It looks like a bit of work will be needed to get it to compile.


It seems that most of the fixes which were added to the previous version
were merged or are now irrelevant, however you lost this change:


From: Andrew Morton <akpm@osdl.org>

If kevent_user_wait() gets -EFAULT on the attempt to copy the first event, it
will return 0, which is indistinguishable from "no events pending".

It can and should return EFAULT in this case.

Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 kernel/kevent/kevent_user.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletion(-)

diff -puN kernel/kevent/kevent_user.c~kevent_user_wait-retval-fix kernel/kevent/kevent_user.c
--- a/kernel/kevent/kevent_user.c~kevent_user_wait-retval-fix
+++ a/kernel/kevent/kevent_user.c
@@ -690,8 +690,11 @@ static int kevent_user_wait(struct file 
 
 	while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
 		if (copy_to_user(buf + num*sizeof(struct ukevent),
-					&k->event, sizeof(struct ukevent)))
+					&k->event, sizeof(struct ukevent))) {
+			if (num == 0)
+				num = -EFAULT;
 			break;
+		}
 		kevent_complete_ready(k);
 		++num;
 		kevent_stat_wait(u);
_


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take23 0/5] kevent: Generic event handling mechanism.
  2006-11-07 22:17   ` [take23 0/5] kevent: Generic event handling mechanism Andrew Morton
@ 2006-11-08  8:21     ` Evgeniy Polyakov
  2006-11-08 14:51       ` Eric Dumazet
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-08  8:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Miller, Ulrich Drepper, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

On Tue, Nov 07, 2006 at 02:17:18PM -0800, Andrew Morton (akpm@osdl.org) wrote:
> From: Andrew Morton <akpm@osdl.org>
> 
> If kevent_user_wait() gets -EFAULT on the attempt to copy the first event, it
> will return 0, which is indistinguishable from "no events pending".
> 
> It can and should return EFAULT in this case.

Correct, I missed that.
Thanks Andrew, I will put into my tree, -mm seems to have it already.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take23 0/5] kevent: Generic event handling mechanism.
  2006-11-08  8:21     ` Evgeniy Polyakov
@ 2006-11-08 14:51       ` Eric Dumazet
  2006-11-08 22:03         ` Andrew Morton
  0 siblings, 1 reply; 200+ messages in thread
From: Eric Dumazet @ 2006-11-08 14:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Evgeniy Polyakov, David Miller, Ulrich Drepper, netdev,
	Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck,
	linux-kernel, Jeff Garzik

[-- Attachment #1: Type: text/plain, Size: 1057 bytes --]

On Wednesday 08 November 2006 09:21, Evgeniy Polyakov wrote:
> On Tue, Nov 07, 2006 at 02:17:18PM -0800, Andrew Morton (akpm@osdl.org) 
wrote:
> > From: Andrew Morton <akpm@osdl.org>
> >
> > If kevent_user_wait() gets -EFAULT on the attempt to copy the first
> > event, it will return 0, which is indistinguishable from "no events
> > pending".
> >
> > It can and should return EFAULT in this case.
>
> Correct, I missed that.
> Thanks Andrew, I will put into my tree, -mm seems to have it already.

I believe eventpoll has a similar problem. Not a big problem, but we can be 
cleaner. Normally, the access_ok() done in sys_epoll_wait() should catch non 
writeable user area, unless another thread play VM game (the thread in 
sys_epoll_wait() can sleep)

[PATCH] eventpoll : In case a fault occurs during copy_to_user(), we should 
report the count of events that were successfully copied into user space, 
instead of EFAULT. That would be consistent with behavior of read/write() 
syscalls for example.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>


[-- Attachment #2: eventpoll.patch --]
[-- Type: text/plain, Size: 423 bytes --]

--- linux/fs/eventpoll.c	2006-11-08 15:37:36.000000000 +0100
+++ linux/fs/eventpoll.c	2006-11-08 15:38:31.000000000 +0100
@@ -1447,7 +1447,7 @@
 				       &events[eventcnt].events) ||
 			    __put_user(epi->event.data,
 				       &events[eventcnt].data))
-				return -EFAULT;
+				return eventcnt ? eventcnt : -EFAULT;
 			if (epi->event.events & EPOLLONESHOT)
 				epi->event.events &= EP_PRIVATE_BITS;
 			eventcnt++;

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take23 0/5] kevent: Generic event handling mechanism.
  2006-11-08 14:51       ` Eric Dumazet
@ 2006-11-08 22:03         ` Andrew Morton
  2006-11-08 22:44           ` Davide Libenzi
  0 siblings, 1 reply; 200+ messages in thread
From: Andrew Morton @ 2006-11-08 22:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Evgeniy Polyakov, David Miller, Ulrich Drepper, netdev,
	Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck,
	linux-kernel, Jeff Garzik, Davide Libenzi

On Wed, 8 Nov 2006 15:51:13 +0100
Eric Dumazet <dada1@cosmosbay.com> wrote:

> [PATCH] eventpoll : In case a fault occurs during copy_to_user(), we should 
> report the count of events that were successfully copied into user space, 
> instead of EFAULT. That would be consistent with behavior of read/write() 
> syscalls for example.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> 
> 
> [eventpoll.patch  text/plain (424B)]
> --- linux/fs/eventpoll.c	2006-11-08 15:37:36.000000000 +0100
> +++ linux/fs/eventpoll.c	2006-11-08 15:38:31.000000000 +0100
> @@ -1447,7 +1447,7 @@
>  				       &events[eventcnt].events) ||
>  			    __put_user(epi->event.data,
>  				       &events[eventcnt].data))
> -				return -EFAULT;
> +				return eventcnt ? eventcnt : -EFAULT;
>  			if (epi->event.events & EPOLLONESHOT)
>  				epi->event.events &= EP_PRIVATE_BITS;
>  			eventcnt++;
> 

Definitely a better interface, but I wonder if it's too late to change it.

An app which does

	if (epoll_wait(...) == -1)
		barf(errno);
	else
		assume_all_events_were_received();

will now do the wrong thing.

otoh, such an applciation basically _has_ to use the epoll_wait()
return value to work out how many events it received, so maybe it's OK...

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take23 0/5] kevent: Generic event handling mechanism.
  2006-11-08 22:03         ` Andrew Morton
@ 2006-11-08 22:44           ` Davide Libenzi
  2006-11-08 23:07             ` Eric Dumazet
  0 siblings, 1 reply; 200+ messages in thread
From: Davide Libenzi @ 2006-11-08 22:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric Dumazet, Evgeniy Polyakov, David Miller, Ulrich Drepper,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, Linux Kernel Mailing List, Jeff Garzik

On Wed, 8 Nov 2006, Andrew Morton wrote:

> On Wed, 8 Nov 2006 15:51:13 +0100
> Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
> > [PATCH] eventpoll : In case a fault occurs during copy_to_user(), we should 
> > report the count of events that were successfully copied into user space, 
> > instead of EFAULT. That would be consistent with behavior of read/write() 
> > syscalls for example.
> > 
> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> > 
> > 
> > 
> > [eventpoll.patch  text/plain (424B)]
> > --- linux/fs/eventpoll.c	2006-11-08 15:37:36.000000000 +0100
> > +++ linux/fs/eventpoll.c	2006-11-08 15:38:31.000000000 +0100
> > @@ -1447,7 +1447,7 @@
> >  				       &events[eventcnt].events) ||
> >  			    __put_user(epi->event.data,
> >  				       &events[eventcnt].data))
> > -				return -EFAULT;
> > +				return eventcnt ? eventcnt : -EFAULT;
> >  			if (epi->event.events & EPOLLONESHOT)
> >  				epi->event.events &= EP_PRIVATE_BITS;
> >  			eventcnt++;
> > 
> 
> Definitely a better interface, but I wonder if it's too late to change it.
> 
> An app which does
> 
> 	if (epoll_wait(...) == -1)
> 		barf(errno);
> 	else
> 		assume_all_events_were_received();
> 
> will now do the wrong thing.
> 
> otoh, such an applciation basically _has_ to use the epoll_wait()
> return value to work out how many events it received, so maybe it's OK...

I don't care about both ways, but sys_poll() does the same thing epoll 
does right now, so I would not change epoll behaviour.



- Davide



^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take23 0/5] kevent: Generic event handling mechanism.
  2006-11-08 22:44           ` Davide Libenzi
@ 2006-11-08 23:07             ` Eric Dumazet
  2006-11-08 23:56               ` Davide Libenzi
  0 siblings, 1 reply; 200+ messages in thread
From: Eric Dumazet @ 2006-11-08 23:07 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Andrew Morton, Evgeniy Polyakov, David Miller, Ulrich Drepper,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, Linux Kernel Mailing List, Jeff Garzik

Davide Libenzi a écrit :
> On Wed, 8 Nov 2006, Andrew Morton wrote:
> 
>> On Wed, 8 Nov 2006 15:51:13 +0100
>> Eric Dumazet <dada1@cosmosbay.com> wrote:
>>
>>> [PATCH] eventpoll : In case a fault occurs during copy_to_user(), we should 
>>> report the count of events that were successfully copied into user space, 
>>> instead of EFAULT. That would be consistent with behavior of read/write() 
>>> syscalls for example.
>>>
>>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>>
>>>
>>>
>>> [eventpoll.patch  text/plain (424B)]
>>> --- linux/fs/eventpoll.c	2006-11-08 15:37:36.000000000 +0100
>>> +++ linux/fs/eventpoll.c	2006-11-08 15:38:31.000000000 +0100
>>> @@ -1447,7 +1447,7 @@
>>>  				       &events[eventcnt].events) ||
>>>  			    __put_user(epi->event.data,
>>>  				       &events[eventcnt].data))
>>> -				return -EFAULT;
>>> +				return eventcnt ? eventcnt : -EFAULT;
>>>  			if (epi->event.events & EPOLLONESHOT)
>>>  				epi->event.events &= EP_PRIVATE_BITS;
>>>  			eventcnt++;
>>>
>> Definitely a better interface, but I wonder if it's too late to change it.
>>
>> An app which does
>>
>> 	if (epoll_wait(...) == -1)
>> 		barf(errno);
>> 	else
>> 		assume_all_events_were_received();
>>
>> will now do the wrong thing.
>>
>> otoh, such an applciation basically _has_ to use the epoll_wait()
>> return value to work out how many events it received, so maybe it's OK...
> 
> I don't care about both ways, but sys_poll() does the same thing epoll 
> does right now, so I would not change epoll behaviour.
> 

Sure poll() cannot return a partial count, since its return value is :

On success, a positive number is returned, where the number returned is
        the  number  of structures which have non-zero revents fields (in other
        words, those descriptors with events or errors reported).

poll() is non destructive (it doesnt change any state into kernel). Returning 
EFAULT in case of an error in the very last bit of user area is mandatory.

On the contrary :

epoll_wait() does return a count of transfered events, and update some state 
in kernel (it consume Edge Trigered events : They can be lost forever if not 
reported to user)

So epoll_wait() is much more like read(), that also updates file state in 
kernel (current file position)


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take23 0/5] kevent: Generic event handling mechanism.
  2006-11-08 23:07             ` Eric Dumazet
@ 2006-11-08 23:56               ` Davide Libenzi
  2006-11-09  7:24                 ` Eric Dumazet
  0 siblings, 1 reply; 200+ messages in thread
From: Davide Libenzi @ 2006-11-08 23:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Evgeniy Polyakov, David Miller, Ulrich Drepper,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, Linux Kernel Mailing List, Jeff Garzik

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; CHARSET=X-UNKNOWN, Size: 2071 bytes --]

On Thu, 9 Nov 2006, Eric Dumazet wrote:

> Davide Libenzi a écrit :
> > 
> > I don't care about both ways, but sys_poll() does the same thing epoll does
> > right now, so I would not change epoll behaviour.
> > 
> 
> Sure poll() cannot return a partial count, since its return value is :
> 
> On success, a positive number is returned, where the number returned is
>        the  number  of structures which have non-zero revents fields (in other
>        words, those descriptors with events or errors reported).
> 
> poll() is non destructive (it doesnt change any state into kernel). Returning
> EFAULT in case of an error in the very last bit of user area is mandatory.
> 
> On the contrary :
> 
> epoll_wait() does return a count of transfered events, and update some state
> in kernel (it consume Edge Trigered events : They can be lost forever if not
> reported to user)
> 
> So epoll_wait() is much more like read(), that also updates file state in
> kernel (current file position)

Lost forever means? If there are more processes watching some fd 
(external events), they all get their own copy of the events in their own 
private epoll fd. It's not that we "steal" things out of the kernel, is 
not a 1:1 producer/consumer thing (one producer, 1 queue). It's one 
producer, broadcast to all listeners (consumers) thing. The only case 
where it'd matter is in the case of multiple threads sharing the same 
epoll fd.
In general, I'd be more for having the userspace get his own SEGFAULT 
instead of letting it go with broken parameters. If I'm coding userspace, 
and I'm doing something wrong, I like the kernel to let me know, instead 
of trying to fix things for me.
Also, epoll can easily be fixed (add a param to ep_reinject_items() to 
re-inject items in case of error/EFAULT) to leave events in the ready-list 
and let the EFAULT emerge. 
Anyone else has opinions about this?

PS: Next time it'd be great if you Cc: me when posting epoll patches, so 
    you avoid Andrew the job of doing it.

- Davide

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take23 0/5] kevent: Generic event handling mechanism.
  2006-11-08 23:56               ` Davide Libenzi
@ 2006-11-09  7:24                 ` Eric Dumazet
  2006-11-09  7:52                   ` Eric Dumazet
  0 siblings, 1 reply; 200+ messages in thread
From: Eric Dumazet @ 2006-11-09  7:24 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Andrew Morton, Evgeniy Polyakov, David Miller, Ulrich Drepper,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, Linux Kernel Mailing List, Jeff Garzik

Davide Libenzi a écrit :
> On Thu, 9 Nov 2006, Eric Dumazet wrote:
> 
>> Davide Libenzi a ?crit :
>>> I don't care about both ways, but sys_poll() does the same thing epoll does
>>> right now, so I would not change epoll behaviour.
>>>
>> Sure poll() cannot return a partial count, since its return value is :
>>
>> On success, a positive number is returned, where the number returned is
>>        the  number  of structures which have non-zero revents fields (in other
>>        words, those descriptors with events or errors reported).
>>
>> poll() is non destructive (it doesnt change any state into kernel). Returning
>> EFAULT in case of an error in the very last bit of user area is mandatory.
>>
>> On the contrary :
>>
>> epoll_wait() does return a count of transfered events, and update some state
>> in kernel (it consume Edge Trigered events : They can be lost forever if not
>> reported to user)
>>
>> So epoll_wait() is much more like read(), that also updates file state in
>> kernel (current file position)
> 
> Lost forever means? If there are more processes watching some fd 
> (external events), they all get their own copy of the events in their own 
> private epoll fd. It's not that we "steal" things out of the kernel, is 
> not a 1:1 producer/consumer thing (one producer, 1 queue). It's one 
> producer, broadcast to all listeners (consumers) thing. The only case 
> where it'd matter is in the case of multiple threads sharing the same 
> epoll fd.

In my particular epoll application, the producer is tcp stack, and I have one 
consumer. If an network event is lost in the EFAULT handling, its lost 
forever. In any case, my application do provide a correct user area, so this 
problem is only theorical.

> In general, I'd be more for having the userspace get his own SEGFAULT 
> instead of letting it go with broken parameters. If I'm coding userspace, 
> and I'm doing something wrong, I like the kernel to let me know, instead 
> of trying to fix things for me.
> Also, epoll can easily be fixed (add a param to ep_reinject_items() to 
> re-inject items in case of error/EFAULT) to leave events in the ready-list 
> and let the EFAULT emerge. 

Please dont slow the hot path for a basically "User Error". It's already 
tested in the transfert function, with two conditional branches for each 
transfered event.

> Anyone else has opinions about this?
> 
> 
> 
> 
> PS: Next time it'd be great if you Cc: me when posting epoll patches, so 
>     you avoid Andrew the job of doing it.

Yes, but this particular patch was a followup on own kevent Andrew patch.

I have a bunch of patches for epoll I will send to you :)

Eric

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take23 0/5] kevent: Generic event handling mechanism.
  2006-11-09  7:24                 ` Eric Dumazet
@ 2006-11-09  7:52                   ` Eric Dumazet
  2006-11-09 17:12                     ` Davide Libenzi
  0 siblings, 1 reply; 200+ messages in thread
From: Eric Dumazet @ 2006-11-09  7:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Davide Libenzi, Andrew Morton, Evgeniy Polyakov, David Miller,
	Ulrich Drepper, netdev, Zach Brown, Christoph Hellwig,
	Chase Venters, Johann Borck, Linux Kernel Mailing List,
	Jeff Garzik

Eric Dumazet a écrit :
> Davide Libenzi a écrit :
>> Lost forever means? If there are more processes watching some fd 
>> (external events), they all get their own copy of the events in their 
>> own private epoll fd. It's not that we "steal" things out of the 
>> kernel, is not a 1:1 producer/consumer thing (one producer, 1 queue). 
>> It's one producer, broadcast to all listeners (consumers) thing. The 
>> only case where it'd matter is in the case of multiple threads sharing 
>> the same epoll fd.
> 
> In my particular epoll application, the producer is tcp stack, and I 
> have one consumer. If an network event is lost in the EFAULT handling, 
> its lost forever. In any case, my application do provide a correct user 
> area, so this problem is only theorical.

I realize I was not explicit, and dit not answer your question (Lost forever 
means ?)

                 if (epi->revents) {
                         if (__put_user(epi->revents,
                                        &events[eventcnt].events) ||
                             __put_user(epi->event.data,
                                        &events[eventcnt].data))
                                 return -EFAULT;
 >>                        if (epi->event.events & EPOLLONESHOT)
 >>                                epi->event.events &= EP_PRIVATE_BITS;
                         eventcnt++;
                 }

If one EPOLLONESHOT event is correctly copied to user space, its status is 
updated.

If other ready events in the same epoll_wait() call cannot be transferred 
because of an EFAULT (we reach the real end of user provided area), this 
EPOLLONESHOT event is lost forever, because it wont be requeued in ready list.

Eric


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take23 0/5] kevent: Generic event handling mechanism.
  2006-11-09  7:52                   ` Eric Dumazet
@ 2006-11-09 17:12                     ` Davide Libenzi
  0 siblings, 0 replies; 200+ messages in thread
From: Davide Libenzi @ 2006-11-09 17:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Davide Libenzi, Andrew Morton, Evgeniy Polyakov, David Miller,
	Ulrich Drepper, netdev, Zach Brown, Christoph Hellwig,
	Chase Venters, Johann Borck, Linux Kernel Mailing List,
	Jeff Garzik

On Thu, 9 Nov 2006, Eric Dumazet wrote:

> > > Lost forever means? If there are more processes watching some fd (external
> > > events), they all get their own copy of the events in their own private
> > > epoll fd. It's not that we "steal" things out of the kernel, is not a 1:1
> > > producer/consumer thing (one producer, 1 queue). It's one producer,
> > > broadcast to all listeners (consumers) thing. The only case where it'd
> > > matter is in the case of multiple threads sharing the same epoll fd.
> > 
> > In my particular epoll application, the producer is tcp stack, and I have
> > one consumer. If an network event is lost in the EFAULT handling, its lost
> > forever. In any case, my application do provide a correct user area, so this
> > problem is only theorical.
> 
> I realize I was not explicit, and dit not answer your question (Lost forever
> means ?)
> 
>                 if (epi->revents) {
>                         if (__put_user(epi->revents,
>                                        &events[eventcnt].events) ||
>                             __put_user(epi->event.data,
>                                        &events[eventcnt].data))
>                                 return -EFAULT;
> >>                        if (epi->event.events & EPOLLONESHOT)
> >>                                epi->event.events &= EP_PRIVATE_BITS;
>                         eventcnt++;
>                 }
> 
> If one EPOLLONESHOT event is correctly copied to user space, its status is
> updated.
> 
> If other ready events in the same epoll_wait() call cannot be transferred
> because of an EFAULT (we reach the real end of user provided area), this
> EPOLLONESHOT event is lost forever, because it wont be requeued in ready list.

Your application is feeding crap to the kernel, because of programming 
bugs. If that happens, I want an EFAULT and not a partially filled buffer. 
And which buffer then? This could have been scribbled in userspace memory 
(the pointer), and the try of the kernel to mask out bugs might create 
even more subtle problems. Such bug will *never* show up in the up in case 
the wrong buffer is partially valid (first part, that is the *only* case 
where your fix would make a difference compared to the status quo), since 
in case of no ready events we'll never hit it, and in case of some events 
we'll always return few of them and never EFAULT. No, the more I think 
about it, the more I personally disagree with the change.



> Please dont slow the hot path for a basically "User Error". It's already 
> tested in the transfert function, with two conditional 
> branches for each transfered event.

Ohh, if you think you can measure them from userspace, those can be turned 
in 'err |= __put_user();' with err tested only out of the loop.



- Davide



^ permalink raw reply	[flat|nested] 200+ messages in thread

* [take24 0/6] kevent: Generic event handling mechanism.
       [not found] <1154985aa0591036@2ka.mipt.ru>
                   ` (2 preceding siblings ...)
  2006-11-07 16:50 ` [take23 0/5] " Evgeniy Polyakov
@ 2006-11-09  8:23 ` Evgeniy Polyakov
  2006-11-09  8:23   ` [take24 1/6] kevent: Description Evgeniy Polyakov
                     ` (2 more replies)
  2006-11-21 16:29 ` [take25 " Evgeniy Polyakov
  2006-11-30 19:14 ` [take26 0/8] kevent: Generic event handling mechanism Evgeniy Polyakov
  5 siblings, 3 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-09  8:23 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Generic event handling mechanism.

Kevent is a generic subsytem which allows to handle event notifications.
It supports both level and edge triggered events. It is similar to
poll/epoll in some cases, but it is more scalable, it is faster and
allows to work with essentially eny kind of events.

Events are provided into kernel through control syscall and can be read
back through ring buffer or using usual syscalls.
Kevent update (i.e. readiness switching) happens directly from internals
of the appropriate state machine of the underlying subsytem (like
network, filesystem, timer or any other).

Homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

Documentation page:
http://linux-net.osdl.org/index.php/Kevent

Consider for inclusion.

Changes from 'take23' patchset:
 * kevent PIPE notifications
 * KEVENT_REQ_LAST_CHECK flag, which allows to perform last check in dequeuing time
 * fixed poll/select notifications (were broken due to tree manipulations)
 * made Documentation/kevent.txt look nice in 80-col terminal
 * fix for copy_to_user() failure report for the first kevent (Andrew Morton)
 * minor fucntion renames
Here is pipe result with kevent_pipe kernel kevent part with 2000 pipes 
(Eric Dumazet's application):
epoll (edge-triggered):   248408 events/sec
kevent (edge-triggered):  269282 events/sec
Busy reading loop:        269519 events/sec

Changes from 'take22' patchset:
 * new ring buffer implementation in process' memory
 * wakeup-one-thread flag
 * edge-triggered behaviour
With this release additional independent benchmark shows kevent speed compared to epoll:
Eric Dumazet created special benchmark which creates set of AF_INET sockets and two threads 
start to simultaneously read and write data from/into them.
Here is results:
epoll (no EPOLLET): 57428 events/sec
kevent (no ET): 59794 events/sec
epoll (with EPOLLET): 71000 events/sec
kevent (with ET): 78265 events/sec
Maximum (busy loop reading events): 88482 events/sec

Changes from 'take21' patchset:
 * minor cleanups (different return values, removed unneded variables, whitespaces and so on)
 * fixed bug in kevent removal in case when kevent being removed
   is the same as overflow_kevent (spotted by Eric Dumazet)

Changes from 'take20' patchset:
 * new ring buffer implementation
 * removed artificial limit on possible number of kevents
With this release and fixed userspace web server it was possible to 
achive 3960+ req/s with client connection rate of 4000 con/s
over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which
is too close to wire speed if we get into account headers and the like.

Changes from 'take19' patchset:
 * use __init instead of __devinit
 * removed 'default N' from config for user statistic
 * removed kevent_user_fini() since kevent can not be unloaded
 * use KERN_INFO for statistic output

Changes from 'take18' patchset:
 * use __init instead of __devinit
 * removed 'default N' from config for user statistic
 * removed kevent_user_fini() since kevent can not be unloaded
 * use KERN_INFO for statistic output

Changes from 'take17' patchset:
 * Use RB tree instead of hash table. 
	At least for a web sever, frequency of addition/deletion of new kevent 
	is comparable with number of search access, i.e. most of the time events 
	are added, accesed only couple of times and then removed, so it justifies 
	RB tree usage over AVL tree, since the latter does have much slower deletion 
	time (max O(log(N)) compared to 3 ops), 
	although faster search time (1.44*O(log(N)) vs. 2*O(log(N))). 
	So for kevents I use RB tree for now and later, when my AVL tree implementation 
	is ready, it will be possible to compare them.
 * Changed readiness check for socket notifications.

With both above changes it is possible to achieve more than 3380 req/second compared to 2200, 
sometimes 2500 req/second for epoll() for trivial web-server and httperf client on the same
hardware.
It is possible that above kevent limit is due to maximum allowed kevents in a time limit, which is
4096 events.

Changes from 'take16' patchset:
 * misc cleanups (__read_mostly, const ...)
 * created special macro which is used for mmap size (number of pages) calculation
 * export kevent_socket_notify(), since it is used in network protocols which can be 
	built as modules (IPv6 for example)

Changes from 'take15' patchset:
 * converted kevent_timer to high-resolution timers, this forces timer API update at
	http://linux-net.osdl.org/index.php/Kevent
 * use struct ukevent* instead of void * in syscalls (documentation has been updated)
 * added warning in kevent_add_ukevent() if ring has broken index (for testing)

Changes from 'take14' patchset:
 * added kevent_wait()
    This syscall waits until either timeout expires or at least one event
    becomes ready. It also commits that @num events from @start are processed
    by userspace and thus can be be removed or rearmed (depending on it's flags).
    It can be used for commit events read by userspace through mmap interface.
    Example userspace code (evtest.c) can be found on project's homepage.
 * added socket notifications (send/recv/accept)

Changes from 'take13' patchset:
 * do not get lock aroung user data check in __kevent_search()
 * fail early if there were no registered callbacks for given type of kevent
 * trailing whitespace cleanup

Changes from 'take12' patchset:
 * remove non-chardev interface for initialization
 * use pointer to kevent_mring instead of unsigned longs
 * use aligned 64bit type in raw user data (can be used by high-res timer if needed)
 * simplified enqueue/dequeue callbacks and kevent initialization
 * use nanoseconds for timeout
 * put number of milliseconds into timer's return data
 * move some definitions into user-visible header
 * removed filenames from comments

Changes from 'take11' patchset:
 * include missing headers into patchset
 * some trivial code cleanups (use goto instead of if/else games and so on)
 * some whitespace cleanups
 * check for ready_callback() callback before main loop which should save us some ticks

Changes from 'take10' patchset:
 * removed non-existent prototypes
 * added helper function for kevent_registered_callbacks
 * fixed 80 lines comments issues
 * added shared between userspace and kernelspace header instead of embedd them in one
 * core restructuring to remove forward declarations
 * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
 * use vm_insert_page() instead of remap_pfn_range()

Changes from 'take9' patchset:
 * fixed ->nopage method

Changes from 'take8' patchset:
 * fixed mmap release bug
 * use module_init() instead of late_initcall()
 * use better structures for timer notifications

Changes from 'take7' patchset:
 * new mmap interface (not tested, waiting for other changes to be acked)
	- use nopage() method to dynamically substitue pages
	- allocate new page for events only when new added kevent requres it
	- do not use ugly index dereferencing, use structure instead
	- reduced amount of data in the ring (id and flags), 
		maximum 12 pages on x86 per kevent fd

Changes from 'take6' patchset:
 * a lot of comments!
 * do not use list poisoning for detection of the fact, that entry is in the list
 * return number of ready kevents even if copy*user() fails
 * strict check for number of kevents in syscall
 * use ARRAY_SIZE for array size calculation
 * changed superblock magic number
 * use SLAB_PANIC instead of direct panic() call
 * changed -E* return values
 * a lot of small cleanups and indent fixes

Changes from 'take5' patchset:
 * removed compilation warnings about unused wariables when lockdep is not turned on
 * do not use internal socket structures, use appropriate (exported) wrappers instead
 * removed default 1 second timeout
 * removed AIO stuff from patchset

Changes from 'take4' patchset:
 * use miscdevice instead of chardevice
 * comments fixes

Changes from 'take3' patchset:
 * removed serializing mutex from kevent_user_wait()
 * moved storage list processing to RCU
 * removed lockdep screaming - all storage locks are initialized in the same function, so it was
learned 
	to differentiate between various cases
 * remove kevent from storage if is marked as broken after callback
 * fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion 

Changes from 'take2' patchset:
 * split kevent_finish_user() to locked and unlocked variants
 * do not use KEVENT_STAT ifdefs, use inline functions instead
 * use array of callbacks of each type instead of each kevent callback initialization
 * changed name of ukevent guarding lock
 * use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
 * do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
 * various indent cleanups
 * added optimisation, which is aimed to help when a lot of kevents are being copied from
userspace
 * mapped buffer (initial) implementation (no userspace yet)

Changes from 'take1' patchset:
 - rebased against 2.6.18-git tree
 - removed ioctl controlling
 - added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
			unsigned int timeout, void __user *buf, unsigned flags)
 - use old syscall kevent_ctl for creation/removing, modification and initial kevent 
	initialization
 - use mutuxes instead of semaphores
 - added file descriptor check and return error if provided descriptor does not match
	kevent file operations
 - various indent fixes
 - removed aio_sendfile() declarations.

Thank you.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>



^ permalink raw reply	[flat|nested] 200+ messages in thread

* [take24 1/6] kevent: Description.
  2006-11-09  8:23 ` [take24 0/6] " Evgeniy Polyakov
@ 2006-11-09  8:23   ` Evgeniy Polyakov
  2006-11-09  8:23     ` [take24 2/6] kevent: Core files Evgeniy Polyakov
  2006-11-11 17:36   ` [take24 7/6] kevent: signal notifications Evgeniy Polyakov
  2006-11-11 22:28   ` [take24 0/6] kevent: Generic event handling mechanism Ulrich Drepper
  2 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-09  8:23 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Description.


diff --git a/Documentation/kevent.txt b/Documentation/kevent.txt
new file mode 100644
index 0000000..ca49e4b
--- /dev/null
+++ b/Documentation/kevent.txt
@@ -0,0 +1,186 @@
+Description.
+
+int kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent *arg);
+
+fd - is the file descriptor referring to the kevent queue to manipulate. 
+It is created by opening "/dev/kevent" char device, which is created with 
+dynamic minor number and major number assigned for misc devices. 
+
+cmd - is the requested operation. It can be one of the following:
+    KEVENT_CTL_ADD - add event notification 
+    KEVENT_CTL_REMOVE - remove event notification 
+    KEVENT_CTL_MODIFY - modify existing notification 
+
+num - number of struct ukevent in the array pointed to by arg 
+arg - array of struct ukevent
+
+When called, kevent_ctl will carry out the operation specified in the 
+cmd parameter.
+-------------------------------------------------------------------------------
+
+ int kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, 
+ 			__u64 timeout, struct ukevent *buf, unsigned flags)
+
+ctl_fd - file descriptor referring to the kevent queue 
+min_nr - minimum number of completed events that kevent_get_events will block 
+	 waiting for 
+max_nr - number of struct ukevent in buf 
+timeout - number of nanoseconds to wait before returning less than min_nr 
+	  events. If this is -1, then wait forever. 
+buf - pointer to an array of struct ukevent. 
+flags - unused 
+
+kevent_get_events will wait timeout milliseconds for at least min_nr completed 
+events, copying completed struct ukevents to buf and deleting any 
+KEVENT_REQ_ONESHOT event requests. In nonblocking mode it returns as many 
+events as possible, but not more than max_nr. In blocking mode it waits until 
+timeout or if at least min_nr events are ready.
+-------------------------------------------------------------------------------
+
+ int kevent_wait(int ctl_fd, unsigned int num, __u64 timeout)
+
+ctl_fd - file descriptor referring to the kevent queue 
+num - number of processed kevents 
+timeout - this timeout specifies number of nanoseconds to wait until there is 
+		free space in kevent queue 
+
+This syscall waits until either timeout expires or at least one event becomes 
+ready. It also copies that num events into special ring buffer and requeues 
+them (or removes depending on flags). 
+-------------------------------------------------------------------------------
+
+ int kevent_ring_init(int ctl_fd, struct kevent_ring *ring, unsigned int num)
+
+ctl_fd - file descriptor referring to the kevent queue 
+num - size of the ring buffer in events 
+
+ struct kevent_ring
+ {
+   unsigned int ring_kidx;
+   struct ukevent event[0];
+ }
+
+ring_kidx - is an index in the ring buffer where kernel will put new events 
+		when kevent_wait() or kevent_get_events() is called 
+
+Example userspace code (ring_buffer.c) can be found on project's homepage.
+
+Each kevent syscall can be so called cancellation point in glibc, i.e. when 
+thread has been cancelled in kevent syscall, thread can be safely removed 
+and no events will be lost, since each syscall (kevent_wait() or 
+kevent_get_events()) will copy event into special ring buffer, accessible 
+from other threads or even processes (if shared memory is used).
+
+When kevent is removed (not dequeued when it is ready, but just removed), 
+even if it was ready, it is not copied into ring buffer, since if it is 
+removed, no one cares about it (otherwise user would wait until it becomes 
+ready and got it through usual way using kevent_get_events() or kevent_wait()) 
+and thus no need to copy it to the ring buffer.
+
+It is possible with userspace ring buffer, that events in the ring buffer 
+can be replaced without knowledge for the thread currently reading them 
+(when other thread calls kevent_get_events() or kevent_wait()), so appropriate 
+locking between threads or processes, which can simultaneously access the same 
+ring buffer, is required.
+-------------------------------------------------------------------------------
+
+The bulk of the interface is entirely done through the ukevent struct. 
+It is used to add event requests, modify existing event requests, 
+specify which event requests to remove, and return completed events.
+
+struct ukevent contains the following members:
+
+struct kevent_id id
+    Id of this request, e.g. socket number, file descriptor and so on 
+__u32 type
+    Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on 
+__u32 event
+    Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED 
+__u32 req_flags
+    Per-event request flags,
+
+    KEVENT_REQ_ONESHOT
+        event will be removed when it is ready 
+
+    KEVENT_REQ_WAKEUP_ONE
+        When several threads wait on the same kevent queue and requested the 
+	same event, for example 'wake me up when new client has connected, 
+	so I could call accept()', then all threads will be awakened when new 
+	client has connected, but only one of them can process the data. This 
+	problem is known as thundering nerd problem. Events which have this 
+	flag set will not be marked as ready (and appropriate threads will 
+	not be awakened) if at least one event has been already marked. 
+
+    KEVENT_REQ_ET
+        Edge Triggered behaviour. It is an optimisation which allows to move 
+	ready and dequeued (i.e. copied to userspace) event to move into set 
+	of interest for given storage (socket, inode and so on) again. It is 
+	very usefull for cases when the same event should be used many times 
+	(like reading from pipe). It is similar to epoll()'s EPOLLET flag. 
+
+    KEVENT_REQ_LAST_CHECK
+        if set allows to perform the last check on kevent (call appropriate 
+	callback) when kevent is marked as ready and has been removed from 
+	ready queue. If it will be confirmed that kevent is ready 
+	(k->callbacks.callback(k) returns true) then kevent will be copied 
+	to userspace, otherwise it will be requeued back to storage. 
+	Second (checking) call is performed with this bit cleared, so callback 
+	can detect when it was called from kevent_storage_ready() - bit is set, 
+	or kevent_dequeue_ready() - bit is cleared. If kevent will be requeued, 
+	bit will be set again.
+
+__u32 ret_flags
+    Per-event return flags
+
+    KEVENT_RET_BROKEN
+        Kevent is broken 
+
+    KEVENT_RET_DONE
+        Kevent processing was finished successfully 
+
+    KEVENT_RET_COPY_FAILED
+        Kevent was not copied into ring buffer due to some error conditions. 
+
+__u32 ret_data
+    Event return data. Event originator fills it with anything it likes 
+    (for example timer notifications put number of milliseconds when timer 
+    has fired 
+union { __u32 user[2]; void *ptr; }
+    User's data. It is not used, just copied to/from user. The whole structure 
+    is aligned to 8 bytes already, so the last union is aligned properly. 
+
+-------------------------------------------------------------------------------
+
+Usage
+
+For KEVENT_CTL_ADD, all fields relevant to the event type must be filled 
+(id, type, possibly event, req_flags). 
+After kevent_ctl(..., KEVENT_CTL_ADD, ...) returns each struct's ret_flags 
+should be checked to see if the event is already broken or done.
+
+For KEVENT_CTL_MODIFY, the id, req_flags, and user and event fields must be 
+set and an existing kevent request must have matching id and user fields. If 
+match is found, req_flags and event are replaced with the newly supplied 
+values and requeueing is started, so modified kevent can be checked and 
+probably marked as ready immediately. If a match can't be found, the 
+passed in ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is 
+always set.
+
+For KEVENT_CTL_REMOVE, the id and user fields must be set and an existing 
+kevent request must have matching id and user fields. If a match is found, 
+the kevent request is removed. If a match can't be found, the passed in 
+ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is always set.
+
+For kevent_get_events, the entire structure is returned.
+
+-------------------------------------------------------------------------------
+
+Usage cases
+
+kevent_timer
+struct ukevent should contain following fields:
+    type - KEVENT_TIMER 
+    event - KEVENT_TIMER_FIRED 
+    req_flags - KEVENT_REQ_ONESHOT if you want to fire that timer only once 
+    id.raw[0] - number of seconds after commit when this timer shout expire 
+    id.raw[0] - additional to number of seconds number of nanoseconds 


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take24 2/6] kevent: Core files.
  2006-11-09  8:23   ` [take24 1/6] kevent: Description Evgeniy Polyakov
@ 2006-11-09  8:23     ` Evgeniy Polyakov
  2006-11-09  8:23       ` [take24 3/6] kevent: poll/select() notifications Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-09  8:23 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Core files.

This patch includes core kevent files:
 * userspace controlling
 * kernelspace interfaces
 * initialization
 * notification state machines

Some bits of documentation can be found on project's homepage (and links from there):
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index 7e639f7..fa8075b 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -318,3 +318,7 @@ ENTRY(sys_call_table)
 	.long sys_vmsplice
 	.long sys_move_pages
 	.long sys_getcpu
+	.long sys_kevent_get_events
+	.long sys_kevent_ctl		/* 320 */
+	.long sys_kevent_wait
+	.long sys_kevent_ring_init
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index b4aa875..95fb252 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -714,8 +714,12 @@ #endif
 	.quad compat_sys_get_robust_list
 	.quad sys_splice
 	.quad sys_sync_file_range
-	.quad sys_tee
+	.quad sys_tee			/* 315 */
 	.quad compat_sys_vmsplice
 	.quad compat_sys_move_pages
 	.quad sys_getcpu
+	.quad sys_kevent_get_events
+	.quad sys_kevent_ctl		/* 320 */
+	.quad sys_kevent_wait
+	.quad sys_kevent_ring_init
 ia32_syscall_end:		
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index bd99870..2161ef2 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -324,10 +324,14 @@ #define __NR_tee		315
 #define __NR_vmsplice		316
 #define __NR_move_pages		317
 #define __NR_getcpu		318
+#define __NR_kevent_get_events	319
+#define __NR_kevent_ctl		320
+#define __NR_kevent_wait	321
+#define __NR_kevent_ring_init	322
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 319
+#define NR_syscalls 323
 #include <linux/err.h>
 
 /*
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 6137146..3669c0f 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,18 @@ #define __NR_vmsplice		278
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages		279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events	280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl		281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
+#define __NR_kevent_wait	282
+__SYSCALL(__NR_kevent_wait, sys_kevent_wait)
+#define __NR_kevent_ring_init	283
+__SYSCALL(__NR_kevent_ring_init, sys_kevent_ring_init)
 
 #ifdef __KERNEL__
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_ring_init
 #include <linux/err.h>
 
 #ifndef __NO_STUBS
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..f7cbf6b
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,223 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/fs.h>
+#include <linux/kevent_storage.h>
+#include <linux/ukevent.h>
+
+#define KEVENT_MIN_BUFFS_ALLOC	3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+	kevent_callback_t	callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY		0x1
+#define KEVENT_STORAGE		0x2
+#define KEVENT_USER		0x4
+
+struct kevent
+{
+	/* Used for kevent freeing.*/
+	struct rcu_head		rcu_head;
+	struct ukevent		event;
+	/* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+	spinlock_t		ulock;
+
+	/* Entry of user's tree. */
+	struct rb_node		kevent_node;
+	/* Entry of origin's queue. */
+	struct list_head	storage_entry;
+	/* Entry of user's ready. */
+	struct list_head	ready_entry;
+
+	u32			flags;
+
+	/* User who requested this kevent. */
+	struct kevent_user	*user;
+	/* Kevent container. */
+	struct kevent_storage	*st;
+
+	struct kevent_callbacks	callbacks;
+
+	/* Private data for different storages.
+	 * poll()/select storage has a list of wait_queue_t containers
+	 * for each ->poll() { poll_wait()' } here.
+	 */
+	void			*priv;
+};
+
+struct kevent_user
+{
+	struct rb_root		kevent_root;
+	spinlock_t		kevent_lock;
+	/* Number of queued kevents. */
+	unsigned int		kevent_num;
+
+	/* List of ready kevents. */
+	struct list_head	ready_list;
+	/* Number of ready kevents. */
+	unsigned int		ready_num;
+	/* Protects all manipulations with ready queue. */
+	spinlock_t 		ready_lock;
+
+	/* Protects against simultaneous kevent_user control manipulations. */
+	struct mutex		ctl_mutex;
+	/* Wait until some events are ready. */
+	wait_queue_head_t	wait;
+
+	/* Reference counter, increased for each new kevent. */
+	atomic_t		refcnt;
+
+	/* Mutex protecting userspace ring buffer. */
+	struct mutex		ring_lock;
+	/* Kernel index and size of the userspace ring buffer. */
+	unsigned int		kidx, ring_size;
+	/* Pointer to userspace ring buffer. */
+	struct kevent_ring __user *pring;
+
+#ifdef CONFIG_KEVENT_USER_STAT
+	unsigned long		im_num;
+	unsigned long		wait_num, ring_num;
+	unsigned long		total;
+#endif
+};
+
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos);
+
+void kevent_storage_ready(struct kevent_storage *st,
+		kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_stat_init(struct kevent_user *u)
+{
+	u->wait_num = u->im_num = u->total = 0;
+}
+static inline void kevent_stat_print(struct kevent_user *u)
+{
+	printk(KERN_INFO "%s: u: %p, wait: %lu, ring: %lu, immediately: %lu, total: %lu.\n",
+			__func__, u, u->wait_num, u->ring_num, u->im_num, u->total);
+}
+static inline void kevent_stat_im(struct kevent_user *u)
+{
+	u->im_num++;
+}
+static inline void kevent_stat_ring(struct kevent_user *u)
+{
+	u->ring_num++;
+}
+static inline void kevent_stat_wait(struct kevent_user *u)
+{
+	u->wait_num++;
+}
+static inline void kevent_stat_total(struct kevent_user *u)
+{
+	u->total++;
+}
+#else
+#define kevent_stat_print(u)		({ (void) u;})
+#define kevent_stat_init(u)		({ (void) u;})
+#define kevent_stat_im(u)		({ (void) u;})
+#define kevent_stat_wait(u)		({ (void) u;})
+#define kevent_stat_ring(u)		({ (void) u;})
+#define kevent_stat_total(u)		({ (void) u;})
+#endif
+
+#ifdef CONFIG_LOCKDEP
+void kevent_socket_reinit(struct socket *sock);
+void kevent_sk_reinit(struct sock *sk);
+#else
+static inline void kevent_socket_reinit(struct socket *sock)
+{
+}
+static inline void kevent_sk_reinit(struct sock *sk)
+{
+}
+#endif
+#ifdef CONFIG_KEVENT_SOCKET
+void kevent_socket_notify(struct sock *sock, u32 event);
+int kevent_socket_dequeue(struct kevent *k);
+int kevent_socket_enqueue(struct kevent *k);
+#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC)
+#else
+static inline void kevent_socket_notify(struct sock *sock, u32 event)
+{
+}
+#define sock_async(__sk)	({ (void)__sk; 0; })
+#endif
+
+#ifdef CONFIG_KEVENT_POLL
+static inline void kevent_init_file(struct file *file)
+{
+	kevent_storage_init(file, &file->st);
+}
+
+static inline void kevent_cleanup_file(struct file *file)
+{
+	kevent_storage_fini(&file->st);
+}
+#else
+static inline void kevent_init_file(struct file *file) {}
+static inline void kevent_cleanup_file(struct file *file) {}
+#endif
+
+#ifdef CONFIG_KEVENT_PIPE
+extern void kevent_pipe_notify(struct inode *inode, u32 events);
+#else
+static inline void kevent_pipe_notify(struct inode *inode, u32 events) {}
+#endif
+
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..a38575d
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,11 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+	void			*origin;		/* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+	struct list_head	list;			/* List of queued kevents. */
+	spinlock_t		lock;			/* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2d1c3d5..471a685 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -54,6 +54,8 @@ struct compat_stat;
 struct compat_timeval;
 struct robust_list_head;
 struct getcpu_cache;
+struct ukevent;
+struct kevent_ring;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -599,4 +601,9 @@ asmlinkage long sys_set_robust_list(stru
 				    size_t len);
 asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);
 
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max,
+		__u64 timeout, struct ukevent __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, struct ukevent __user *buf);
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, __u64 timeout);
+asmlinkage long sys_kevent_ring_init(int ctl_fd, struct kevent_ring __user *ring, unsigned int num);
 #endif
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
new file mode 100644
index 0000000..b14e14e
--- /dev/null
+++ b/include/linux/ukevent.h
@@ -0,0 +1,165 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __UKEVENT_H
+#define __UKEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+/* Process this event only once and then remove it. */
+#define KEVENT_REQ_ONESHOT	0x1
+/* Wake up only when event exclusively belongs to this thread,
+ * for example when several threads are waiting for new client
+ * connection so they could perform accept() it is a good idea
+ * to set this flag, so only one thread of all with this flag set 
+ * will be awakened. 
+ * If there are events without this flags, appropriate threads will
+ * be awakened too. */
+#define KEVENT_REQ_WAKEUP_ONE	0x2
+/* Edge Triggered behaviour. */
+#define KEVENT_REQ_ET		0x4
+/* Perform the last check on kevent (call appropriate callback) when
+ * kevent is marked as ready and has been removed from ready queue.
+ * If it will be confirmed that kevent is ready 
+ * (k->callbacks.callback(k) returns true) then kevent will be copied
+ * to userspace, otherwise it will be requeued back to storage. 
+ * Second (checking) call is performed with this bit _cleared_ so
+ * callback can detect when it was called from 
+ * kevent_storage_ready() - bit is set, or 
+ * kevent_dequeue_ready() - bit is cleared. 
+ * If kevent will be requeued, bit will be set again. */
+#define KEVENT_REQ_LAST_CHECK	0x8
+
+/*
+ * Kevent return flags.
+ */
+/* Kevent is broken. */
+#define KEVENT_RET_BROKEN	0x1
+/* Kevent processing was finished successfully. */
+#define KEVENT_RET_DONE		0x2
+/* Kevent was not copied into ring buffer due to some error conditions. */
+#define KEVENT_RET_COPY_FAILED	0x4
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 		0
+#define KEVENT_INODE		1
+#define KEVENT_TIMER		2
+#define KEVENT_POLL		3
+#define KEVENT_NAIO		4
+#define KEVENT_AIO		5
+#define KEVENT_PIPE		6
+#define	KEVENT_MAX		7
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define	KEVENT_TIMER_FIRED	0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#define	KEVENT_SOCKET_RECV	0x1
+#define	KEVENT_SOCKET_ACCEPT	0x2
+#define	KEVENT_SOCKET_SEND	0x4
+
+/*
+ * Inode events.
+ */
+#define	KEVENT_INODE_CREATE	0x1
+#define	KEVENT_INODE_REMOVE	0x2
+
+/*
+ * Poll events.
+ */
+#define	KEVENT_POLL_POLLIN	0x0001
+#define	KEVENT_POLL_POLLPRI	0x0002
+#define	KEVENT_POLL_POLLOUT	0x0004
+#define	KEVENT_POLL_POLLERR	0x0008
+#define	KEVENT_POLL_POLLHUP	0x0010
+#define	KEVENT_POLL_POLLNVAL	0x0020
+
+#define	KEVENT_POLL_POLLRDNORM	0x0040
+#define	KEVENT_POLL_POLLRDBAND	0x0080
+#define	KEVENT_POLL_POLLWRNORM	0x0100
+#define	KEVENT_POLL_POLLWRBAND	0x0200
+#define	KEVENT_POLL_POLLMSG	0x0400
+#define	KEVENT_POLL_POLLREMOVE	0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define	KEVENT_AIO_BIO		0x1
+
+#define KEVENT_MASK_ALL		0xffffffff
+/* Mask of all possible event values. */
+#define KEVENT_MASK_EMPTY	0x0
+/* Empty mask of ready events. */
+
+struct kevent_id
+{
+	union {
+		__u32		raw[2];
+		__u64		raw_u64 __attribute__((aligned(8)));
+	};
+};
+
+struct ukevent
+{
+	/* Id of this request, e.g. socket number, file descriptor and so on... */
+	struct kevent_id	id;
+	/* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+	__u32			type;
+	/* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+	__u32			event;
+	/* Per-event request flags */
+	__u32			req_flags;
+	/* Per-event return flags */
+	__u32			ret_flags;
+	/* Event return data. Event originator fills it with anything it likes. */
+	__u32			ret_data[2];
+	/* User's data. It is not used, just copied to/from user.
+	 * The whole structure is aligned to 8 bytes already, so the last union
+	 * is aligned properly.
+	 */
+	union {
+		__u32		user[2];
+		void		*ptr;
+	};
+};
+
+struct kevent_ring
+{
+	unsigned int		ring_kidx;
+	struct ukevent		event[0];
+};
+
+#define	KEVENT_CTL_ADD 		0
+#define	KEVENT_CTL_REMOVE	1
+#define	KEVENT_CTL_MODIFY	2
+
+#endif /* __UKEVENT_H */
diff --git a/init/Kconfig b/init/Kconfig
index d2eb7a8..c7d8250 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -201,6 +201,8 @@ config AUDITSYSCALL
 	  such as SELinux.  To use audit's filesystem watch feature, please
 	  ensure that INOTIFY is configured.
 
+source "kernel/kevent/Kconfig"
+
 config IKCONFIG
 	bool "Kernel .config support"
 	---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..267fc53
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,45 @@
+config KEVENT
+	bool "Kernel event notification mechanism"
+	help
+	  This option enables event queue mechanism.
+	  It can be used as replacement for poll()/select(), AIO callback
+	  invocations, advanced timer notifications and other kernel
+	  object status changes.
+
+config KEVENT_USER_STAT
+	bool "Kevent user statistic"
+	depends on KEVENT
+	help
+	  This option will turn kevent_user statistic collection on.
+	  Statistic data includes total number of kevent, number of kevents
+	  which are ready immediately at insertion time and number of kevents
+	  which were removed through readiness completion.
+	  It will be printed each time control kevent descriptor is closed.
+
+config KEVENT_TIMER
+	bool "Kernel event notifications for timers"
+	depends on KEVENT
+	help
+	  This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+	bool "Kernel event notifications for poll()/select()"
+	depends on KEVENT
+	help
+	  This option allows to use kevent subsystem for poll()/select()
+	  notifications.
+
+config KEVENT_SOCKET
+	bool "Kernel event notifications for sockets"
+	depends on NET && KEVENT
+	help
+	  This option enables notifications through KEVENT subsystem of 
+	  sockets operations, like new packet receiving conditions, 
+	  ready for accept conditions and so on.
+
+config KEVENT_PIPE
+	bool "Kernel event notifications for pipes"
+	depends on KEVENT
+	help
+	  This option enables notifications through KEVENT subsystem of 
+	  pipe read/write operations.
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..d4d6b68
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,5 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
+obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o
+obj-$(CONFIG_KEVENT_PIPE) += kevent_pipe.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..24ee44a
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,232 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+	return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+	return k->callbacks.dequeue(k);
+}
+
+/*
+ * Mark kevent as broken.
+ */
+int kevent_break(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->ulock, flags);
+	k->event.ret_flags |= KEVENT_RET_BROKEN;
+	spin_unlock_irqrestore(&k->ulock, flags);
+	return -EINVAL;
+}
+
+static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX] __read_mostly;
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos)
+{
+	struct kevent_callbacks *p;
+
+	if (pos >= KEVENT_MAX)
+		return -EINVAL;
+
+	p = &kevent_registered_callbacks[pos];
+
+	p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break;
+	p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break;
+	p->callback = (cb->callback) ? cb->callback : kevent_break;
+
+	printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);
+	return 0;
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+	spin_lock_init(&k->ulock);
+	k->flags = 0;
+
+	if (unlikely(k->event.type >= KEVENT_MAX ||
+			!kevent_registered_callbacks[k->event.type].callback))
+		return kevent_break(k);
+
+	k->callbacks = kevent_registered_callbacks[k->event.type];
+	if (unlikely(k->callbacks.callback == kevent_break))
+		return kevent_break(k);
+
+	return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	k->st = st;
+	spin_lock_irqsave(&st->lock, flags);
+	list_add_tail_rcu(&k->storage_entry, &st->list);
+	k->flags |= KEVENT_STORAGE;
+	spin_unlock_irqrestore(&st->lock, flags);
+	return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue.
+ * It does not decrease origin's reference counter in any way
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&st->lock, flags);
+	if (k->flags & KEVENT_STORAGE) {
+		list_del_rcu(&k->storage_entry);
+		k->flags &= ~KEVENT_STORAGE;
+	}
+	spin_unlock_irqrestore(&st->lock, flags);
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static int __kevent_requeue(struct kevent *k, u32 event)
+{
+	int ret, rem;
+	unsigned long flags;
+
+	ret = k->callbacks.callback(k);
+
+	spin_lock_irqsave(&k->ulock, flags);
+	if (ret > 0)
+		k->event.ret_flags |= KEVENT_RET_DONE;
+	else if (ret < 0)
+		k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE);
+	else
+		ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+	rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+	spin_unlock_irqrestore(&k->ulock, flags);
+
+	if (ret) {
+		if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) {
+			list_del_rcu(&k->storage_entry);
+			k->flags &= ~KEVENT_STORAGE;
+		}
+
+		spin_lock_irqsave(&k->user->ready_lock, flags);
+		if (!(k->flags & KEVENT_READY)) {
+			list_add_tail(&k->ready_entry, &k->user->ready_list);
+			k->flags |= KEVENT_READY;
+			k->user->ready_num++;
+		}
+		spin_unlock_irqrestore(&k->user->ready_lock, flags);
+		wake_up(&k->user->wait);
+	}
+
+	return ret;
+}
+
+/*
+ * Check if kevent is ready (by invoking it's callback) and requeue/remove
+ * if needed.
+ */
+void kevent_requeue(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->st->lock, flags);
+	__kevent_requeue(k, 0);
+	spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st,
+		kevent_callback_t ready_callback, u32 event)
+{
+	struct kevent *k;
+	int wake_num = 0;
+
+	rcu_read_lock();
+	if (ready_callback)
+		list_for_each_entry_rcu(k, &st->list, storage_entry)
+			(*ready_callback)(k);
+
+	list_for_each_entry_rcu(k, &st->list, storage_entry) {
+		if (event & k->event.event)
+			if (!(k->event.req_flags & KEVENT_REQ_WAKEUP_ONE) || wake_num == 0)
+				if (__kevent_requeue(k, event))
+					wake_num++;
+	}
+	rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+	spin_lock_init(&st->lock);
+	st->origin = origin;
+	INIT_LIST_HEAD(&st->list);
+	return 0;
+}
+
+/*
+ * Mark all events as broken, that will remove them from storage,
+ * so storage origin (inode, sockt and so on) can be safely removed.
+ * No new entries are allowed to be added into the storage at this point.
+ * (Socket is removed from file table at this point for example).
+ */
+void kevent_storage_fini(struct kevent_storage *st)
+{
+	kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..00d942a
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,936 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static const char kevent_name[] = "kevent";
+static kmem_cache_t *kevent_cache __read_mostly;
+
+/*
+ * kevents are pollable, return POLLIN and POLLRDNORM
+ * when there is at least one ready kevent.
+ */
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+	struct kevent_user *u = file->private_data;
+	unsigned int mask;
+
+	poll_wait(file, &u->wait, wait);
+	mask = 0;
+
+	if (u->ready_num)
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+
+/*
+ * Copies kevent into userspace ring buffer if it was initialized.
+ * Returns 
+ *  0 on success, 
+ *  -EAGAIN if there were no place for that kevent (impossible)
+ *  -EFAULT if copy_to_user() failed.
+ *
+ *  Must be called under kevent_user->ring_lock locked.
+ */
+static int kevent_copy_ring_buffer(struct kevent *k)
+{
+	struct kevent_ring __user *ring;
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+	int err;
+
+	ring = u->pring;
+	if (!ring)
+		return 0;
+
+	if (copy_to_user(&ring->event[u->kidx], &k->event, sizeof(struct ukevent))) {
+		err = -EFAULT;
+		goto err_out_exit;
+	}
+
+	if (put_user(u->kidx, &ring->ring_kidx)) {
+		err = -EFAULT;
+		goto err_out_exit;
+	}
+
+	if (++u->kidx >= u->ring_size)
+		u->kidx = 0;
+
+	return 0;
+
+err_out_exit:
+	spin_lock_irqsave(&k->ulock, flags);
+	k->event.ret_flags |= KEVENT_RET_COPY_FAILED;
+	spin_unlock_irqrestore(&k->ulock, flags);
+	return err;
+}
+
+static int kevent_user_open(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u;
+
+	u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
+	if (!u)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&u->ready_list);
+	spin_lock_init(&u->ready_lock);
+	kevent_stat_init(u);
+	spin_lock_init(&u->kevent_lock);
+	u->kevent_root = RB_ROOT;
+
+	mutex_init(&u->ctl_mutex);
+	init_waitqueue_head(&u->wait);
+
+	atomic_set(&u->refcnt, 1);
+
+	mutex_init(&u->ring_lock);
+	u->kidx = u->ring_size = 0;
+	u->pring = NULL;
+
+	file->private_data = u;
+	return 0;
+}
+
+/*
+ * Kevent userspace control block reference counting.
+ * Set to 1 at creation time, when appropriate kevent file descriptor
+ * is closed, that reference counter is decreased.
+ * When counter hits zero block is freed.
+ */
+static inline void kevent_user_get(struct kevent_user *u)
+{
+	atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+	if (atomic_dec_and_test(&u->refcnt)) {
+		kevent_stat_print(u);
+		kfree(u);
+	}
+}
+
+static inline int kevent_compare_id(struct kevent_id *left, struct kevent_id *right)
+{
+	if (left->raw_u64 > right->raw_u64)
+		return -1;
+
+	if (right->raw_u64 > left->raw_u64)
+		return 1;
+
+	return 0;
+}
+
+/*
+ * RCU protects storage list (kevent->storage_entry).
+ * Free entry in RCU callback, it is dequeued from all lists at
+ * this point.
+ */
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+	struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+	kmem_cache_free(kevent_cache, kevent);
+}
+
+/*
+ * Must be called under u->ready_lock.
+ * This function unlinks kevent from ready queue.
+ */
+static inline void kevent_unlink_ready(struct kevent *k)
+{
+	list_del(&k->ready_entry);
+	k->flags &= ~KEVENT_READY;
+	k->user->ready_num--;
+}
+
+static void kevent_remove_ready(struct kevent *k)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (k->flags & KEVENT_READY)
+		kevent_unlink_ready(k);
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+}
+
+/*
+ * Complete kevent removing - it dequeues kevent from storage list
+ * if it is requested, removes kevent from ready list, drops userspace
+ * control block reference counter and schedules kevent freeing through RCU.
+ */
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+	if (deq)
+		kevent_dequeue(k);
+
+	kevent_remove_ready(k);
+
+	kevent_user_put(k->user);
+	call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+/*
+ * Remove from all lists and free kevent.
+ * Must be called under kevent_user->kevent_lock to protect
+ * kevent->kevent_entry removing.
+ */
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+
+	rb_erase(&k->kevent_node, &u->kevent_root);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events,
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	rb_erase(&k->kevent_node, &u->kevent_root);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *kevent_dequeue_ready(struct kevent_user *u)
+{
+	unsigned long flags;
+	struct kevent *k = NULL;
+
+	mutex_lock(&u->ring_lock);
+	while (u->ready_num && !k) {
+		spin_lock_irqsave(&u->ready_lock, flags);
+		if (u->ready_num && !list_empty(&u->ready_list)) {
+			k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+			kevent_unlink_ready(k);
+		}
+		spin_unlock_irqrestore(&u->ready_lock, flags);
+
+		if (k && (k->event.req_flags & KEVENT_REQ_LAST_CHECK)) {
+			unsigned long flags;
+
+			spin_lock_irqsave(&k->ulock, flags);
+			k->event.req_flags &= ~KEVENT_REQ_LAST_CHECK;
+			spin_unlock_irqrestore(&k->ulock, flags);
+
+			if (!k->callbacks.callback(k)) {
+				spin_lock_irqsave(&k->ulock, flags);
+				k->event.req_flags |= KEVENT_REQ_LAST_CHECK;
+				k->event.ret_flags = 0;
+				k->event.ret_data[0] = k->event.ret_data[1] = 0;
+				spin_unlock_irqrestore(&k->ulock, flags);
+				k = NULL;
+			}
+		} else
+			break;
+	}
+
+	if (k)
+		kevent_copy_ring_buffer(k);
+	mutex_unlock(&u->ring_lock);
+
+	return k;
+}
+
+static void kevent_complete_ready(struct kevent *k)
+{
+	if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+		/*
+		 * If it is one-shot kevent, it has been removed already from
+		 * origin's queue, so we can easily free it here.
+		 */
+		kevent_finish_user(k, 1);
+	else if (k->event.req_flags & KEVENT_REQ_ET) {
+		unsigned long flags;
+
+		/*
+		 * Edge-triggered behaviour: mark event as clear new one.
+		 */
+
+		spin_lock_irqsave(&k->ulock, flags);
+		k->event.ret_flags = 0;
+		k->event.ret_data[0] = k->event.ret_data[1] = 0;
+		spin_unlock_irqrestore(&k->ulock, flags);
+	}
+}
+
+/*
+ * Search a kevent inside kevent tree for given ukevent.
+ */
+static struct kevent *__kevent_search(struct kevent_id *id, struct kevent_user *u)
+{
+	struct kevent *k, *ret = NULL;
+	struct rb_node *n = u->kevent_root.rb_node;
+	int cmp;
+
+	while (n) {
+		k = rb_entry(n, struct kevent, kevent_node);
+		cmp = kevent_compare_id(&k->event.id, id);
+
+		if (cmp > 0)
+			n = n->rb_right;
+		else if (cmp < 0)
+			n = n->rb_left;
+		else {
+			ret = k;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+/*
+ * Search and modify kevent according to provided ukevent.
+ */
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err = -ENODEV;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&uk->id, u);
+	if (k) {
+		spin_lock(&k->ulock);
+		k->event.event = uk->event;
+		k->event.req_flags = uk->req_flags;
+		k->event.ret_flags = 0;
+		spin_unlock(&k->ulock);
+		kevent_requeue(k);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Remove kevent which matches provided ukevent.
+ */
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+	int err = -ENODEV;
+	struct kevent *k;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&uk->id, u);
+	if (k) {
+		__kevent_finish_user(k, 1);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Detaches userspace control block from file descriptor
+ * and decrease it's reference counter.
+ * No new kevents can be added or removed from any list at this point.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u = file->private_data;
+	struct kevent *k;
+	struct rb_node *n;
+
+	for (n = rb_first(&u->kevent_root); n; n = rb_next(n)) {
+		k = rb_entry(n, struct kevent, kevent_node);
+		kevent_finish_user(k, 1);
+	}
+
+	kevent_user_put(u);
+	file->private_data = NULL;
+
+	return 0;
+}
+
+/*
+ * Read requested number of ukevents in one shot.
+ */
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+	struct ukevent *ukev;
+
+	ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+	if (!ukev)
+		return NULL;
+
+	if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
+		kfree(ukev);
+		return NULL;
+	}
+
+	return ukev;
+}
+
+/*
+ * Read from userspace all ukevents and modify appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_modify(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_modify(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Read from userspace all ukevents and remove appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_remove(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_remove(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Queue kevent into userspace control block and increase
+ * it's reference counter.
+ */
+static int kevent_user_enqueue(struct kevent_user *u, struct kevent *new)
+{
+	unsigned long flags;
+	struct rb_node **p = &u->kevent_root.rb_node, *parent = NULL;
+	struct kevent *k;
+	int err = 0, cmp;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	while (*p) {
+		parent = *p;
+		k = rb_entry(parent, struct kevent, kevent_node);
+
+		cmp = kevent_compare_id(&k->event.id, &new->event.id);
+		if (cmp > 0)
+			p = &parent->rb_right;
+		else if (cmp < 0)
+			p = &parent->rb_left;
+		else {
+			err = -EEXIST;
+			break;
+		}
+	}
+	if (likely(!err)) {
+		rb_link_node(&new->kevent_node, parent, p);
+		rb_insert_color(&new->kevent_node, &u->kevent_root);
+		new->flags |= KEVENT_USER;
+		u->kevent_num++;
+		kevent_user_get(u);
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Add kevent from both kernel and userspace users.
+ * This function allocates and queues kevent, returns negative value
+ * on error, positive if kevent is ready immediately and zero
+ * if kevent has been queued.
+ */
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err;
+
+	k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+	if (!k) {
+		err = -ENOMEM;
+		goto err_out_exit;
+	}
+
+	memcpy(&k->event, uk, sizeof(struct ukevent));
+	INIT_RCU_HEAD(&k->rcu_head);
+
+	k->event.ret_flags = 0;
+
+	err = kevent_init(k);
+	if (err) {
+		kmem_cache_free(kevent_cache, k);
+		goto err_out_exit;
+	}
+	k->user = u;
+	kevent_stat_total(u);
+	err = kevent_user_enqueue(u, k);
+	if (err) {
+		kmem_cache_free(kevent_cache, k);
+		goto err_out_exit;
+	}
+
+	err = kevent_enqueue(k);
+	if (err) {
+		memcpy(uk, &k->event, sizeof(struct ukevent));
+		kevent_finish_user(k, 0);
+		goto err_out_exit;
+	}
+
+	return 0;
+
+err_out_exit:
+	if (err < 0) {
+		uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE;
+		uk->ret_data[1] = err;
+	} else if (err > 0)
+		uk->ret_flags |= KEVENT_RET_DONE;
+	return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one
+ * and add them into appropriate kevent_storages,
+ * e.g. sockets, inodes and so on...
+ * Ready events will replace ones provided by used and number
+ * of ready events is returned.
+ * User must check ret_flags field of each ukevent structure
+ * to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err, cerr = 0, rnum = 0, i;
+	void __user *orig = arg;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	err = -EINVAL;
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				err = kevent_user_add_ukevent(&ukev[i], u);
+				if (err) {
+					kevent_stat_im(u);
+					if (i != rnum)
+						memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+					rnum++;
+				}
+			}
+			if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+				cerr = -EFAULT;
+			kfree(ukev);
+			goto out_setup;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			cerr = -EFAULT;
+			break;
+		}
+		arg += sizeof(struct ukevent);
+
+		err = kevent_user_add_ukevent(&uk, u);
+		if (err) {
+			kevent_stat_im(u);
+			if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+				cerr = -EFAULT;
+				break;
+			}
+			orig += sizeof(struct ukevent);
+			rnum++;
+		}
+	}
+
+out_setup:
+	if (cerr < 0) {
+		err = cerr;
+		goto out_remove;
+	}
+
+	err = rnum;
+out_remove:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u,
+		unsigned int min_nr, unsigned int max_nr, __u64 timeout,
+		void __user *buf)
+{
+	struct kevent *k;
+	int num = 0;
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		wait_event_interruptible_timeout(u->wait,
+			u->ready_num >= min_nr,
+			clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+	}
+
+	while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) {
+		if (copy_to_user(buf + num*sizeof(struct ukevent),
+					&k->event, sizeof(struct ukevent))) {
+			if (num == 0)
+				num = -EFAULT;
+			break;
+		}
+		kevent_complete_ready(k);
+		++num;
+		kevent_stat_wait(u);
+	}
+
+	return num;
+}
+
+static struct file_operations kevent_user_fops = {
+	.open		= kevent_user_open,
+	.release	= kevent_user_release,
+	.poll		= kevent_user_poll,
+	.owner		= THIS_MODULE,
+};
+
+static struct miscdevice kevent_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = kevent_name,
+	.fops = &kevent_user_fops,
+};
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+	int err;
+	struct kevent_user *u = file->private_data;
+
+	switch (cmd) {
+	case KEVENT_CTL_ADD:
+		err = kevent_user_ctl_add(u, num, arg);
+		break;
+	case KEVENT_CTL_REMOVE:
+		err = kevent_user_ctl_remove(u, num, arg);
+		break;
+	case KEVENT_CTL_MODIFY:
+		err = kevent_user_ctl_modify(u, num, arg);
+		break;
+	default:
+		err = -EINVAL;
+		break;
+	}
+
+	return err;
+}
+
+/*
+ * Used to get ready kevents from queue.
+ * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).
+ * @min_nr - minimum number of ready kevents.
+ * @max_nr - maximum number of ready kevents.
+ * @timeout - timeout in nanoseconds to wait until some events are ready.
+ * @buf - buffer to place ready events.
+ * @flags - ununsed for now (will be used for mmap implementation).
+ */
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+		__u64 timeout, struct ukevent __user *buf, unsigned flags)
+{
+	int err = -EINVAL;
+	struct file *file;
+	struct kevent_user *u;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -EBADF;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
+out_fput:
+	fput(file);
+	return err;
+}
+
+asmlinkage long sys_kevent_ring_init(int ctl_fd, struct kevent_ring __user *ring, unsigned int num)
+{
+	int err = -EINVAL;
+	struct file *file;
+	struct kevent_user *u;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -EBADF;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	mutex_lock(&u->ring_lock);
+	if (u->pring) {
+		err = -EINVAL;
+		goto err_out_exit;
+	}
+	u->pring = ring;
+	u->ring_size = num;
+	mutex_unlock(&u->ring_lock);
+
+	fput(file);
+
+	return 0;
+
+err_out_exit:
+	mutex_unlock(&u->ring_lock);
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * This syscall is used to perform waiting until there is free space in kevent queue
+ * and removes/requeues requested number of events (commits them). Function returns
+ * number of actually committed events.
+ *
+ * @ctl_fd - kevent file descriptor.
+ * @num - number of kevents to process.
+ * @timeout - this timeout specifies number of nanoseconds to wait until there is
+ * 	free space in kevent queue.
+ *
+ * When we need to commit @num events, it means we should just remove first @num
+ * kevents from ready queue and copy them into the buffer. 
+ * Kevents will be copied into ring buffer in order they were placed into ready queue.
+ */
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, __u64 timeout)
+{
+	int err = -EINVAL, committed = 0;
+	struct file *file;
+	struct kevent_user *u;
+	struct kevent *k;
+	struct kevent_ring __user *ring;
+	unsigned int i;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -EBADF;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	ring = u->pring;
+	if (!ring || num >= u->ring_size)
+		goto out_fput;
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		wait_event_interruptible_timeout(u->wait,
+			u->ready_num >= 1,
+			clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+	}
+
+	for (i=0; i<num; ++i) {
+		k = kevent_dequeue_ready(u);
+		if (!k)
+			break;
+		kevent_complete_ready(k);
+		kevent_stat_ring(u);
+		committed++;
+	}
+
+	fput(file);
+
+	return committed;
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * This syscall is used to perform various control operations
+ * on given kevent queue, which is obtained through kevent file descriptor @fd.
+ * @cmd - type of operation.
+ * @num - number of kevents to be processed.
+ * @arg - pointer to array of struct ukevent.
+ */
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent __user *arg)
+{
+	int err = -EINVAL;
+	struct file *file;
+
+	file = fget(fd);
+	if (!file)
+		return -EBADF;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+
+	err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * Kevent subsystem initialization - create kevent cache and register
+ * filesystem to get control file descriptors from.
+ */
+static int __init kevent_user_init(void)
+{
+	int err = 0;
+
+	kevent_cache = kmem_cache_create("kevent_cache",
+			sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
+
+	err = misc_register(&kevent_miscdev);
+	if (err) {
+		printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);
+		goto err_out_exit;
+	}
+
+	printk(KERN_INFO "KEVENT subsystem has been successfully registered.\n");
+
+	return 0;
+
+err_out_exit:
+	kmem_cache_destroy(kevent_cache);
+	return err;
+}
+
+module_init(kevent_user_init);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7a3b2e7..5200583 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,11 @@ cond_syscall(ppc_rtas);
 cond_syscall(sys_spu_run);
 cond_syscall(sys_spu_create);
 
+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_wait);
+cond_syscall(sys_kevent_ctl);
+cond_syscall(sys_kevent_ring_init);
+
 /* mmu depending weak syscall entries */
 cond_syscall(sys_mprotect);
 cond_syscall(sys_msync);

^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take24 3/6] kevent: poll/select() notifications.
  2006-11-09  8:23     ` [take24 2/6] kevent: Core files Evgeniy Polyakov
@ 2006-11-09  8:23       ` Evgeniy Polyakov
  2006-11-09  8:23         ` [take24 4/6] kevent: Socket notifications Evgeniy Polyakov
                           ` (2 more replies)
  0 siblings, 3 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-09  8:23 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


poll/select() notifications.

This patch includes generic poll/select notifications.
kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake, a lot of allocations and so on).

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/fs/file_table.c b/fs/file_table.c
index bc35a40..0805547 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -20,6 +20,7 @@ #include <linux/capability.h>
 #include <linux/cdev.h>
 #include <linux/fsnotify.h>
 #include <linux/sysctl.h>
+#include <linux/kevent.h>
 #include <linux/percpu_counter.h>
 
 #include <asm/atomic.h>
@@ -119,6 +120,7 @@ struct file *get_empty_filp(void)
 	f->f_uid = tsk->fsuid;
 	f->f_gid = tsk->fsgid;
 	eventpoll_init_file(f);
+	kevent_init_file(f);
 	/* f->f_version: 0 */
 	return f;
 
@@ -164,6 +166,7 @@ void fastcall __fput(struct file *file)
 	 * in the file cleanup chain.
 	 */
 	eventpoll_release(file);
+	kevent_cleanup_file(file);
 	locks_remove_flock(file);
 
 	if (file->f_op && file->f_op->release)
diff --git a/fs/inode.c b/fs/inode.c
index ada7643..6745c00 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@ #include <linux/pagemap.h>
 #include <linux/cdev.h>
 #include <linux/bootmem.h>
 #include <linux/inotify.h>
+#include <linux/kevent.h>
 #include <linux/mount.h>
 
 /*
@@ -164,12 +165,18 @@ #endif
 		}
 		inode->i_private = 0;
 		inode->i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+		kevent_storage_init(inode, &inode->st);
+#endif
 	}
 	return inode;
 }
 
 void destroy_inode(struct inode *inode) 
 {
+#if defined CONFIG_KEVENT_SOCKET
+	kevent_storage_fini(&inode->st);
+#endif
 	BUG_ON(inode_has_buffers(inode));
 	security_inode_free(inode);
 	if (inode->i_sb->s_op->destroy_inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5baf3a1..c529723 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -276,6 +276,7 @@ #include <linux/prio_tree.h>
 #include <linux/init.h>
 #include <linux/sched.h>
 #include <linux/mutex.h>
+#include <linux/kevent_storage.h>
 
 #include <asm/atomic.h>
 #include <asm/semaphore.h>
@@ -586,6 +587,10 @@ #ifdef CONFIG_INOTIFY
 	struct mutex		inotify_mutex;	/* protects the watches list */
 #endif
 
+#ifdef CONFIG_KEVENT_SOCKET
+	struct kevent_storage	st;
+#endif
+
 	unsigned long		i_state;
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 
@@ -739,6 +744,9 @@ #ifdef CONFIG_EPOLL
 	struct list_head	f_ep_links;
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+	struct kevent_storage	st;
+#endif
 	struct address_space	*f_mapping;
 };
 extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..7030d21
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,228 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+	struct poll_table_struct 	pt;
+	struct kevent			*k;
+};
+
+struct kevent_poll_wait_container
+{
+	struct list_head		container_entry;
+	wait_queue_head_t		*whead;
+	wait_queue_t			wait;
+	struct kevent			*k;
+};
+
+struct kevent_poll_private
+{
+	struct list_head		container_list;
+	spinlock_t			container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+		unsigned mode, int sync, void *key)
+{
+	struct kevent_poll_wait_container *cont =
+		container_of(wait, struct kevent_poll_wait_container, wait);
+	struct kevent *k = cont->k;
+
+	kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL);
+	return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+		struct poll_table_struct *poll_table)
+{
+	struct kevent *k =
+		container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *cont;
+	unsigned long flags;
+
+	cont = kmem_cache_alloc(kevent_poll_container_cache, GFP_KERNEL);
+	if (!cont) {
+		kevent_break(k);
+		return;
+	}
+
+	cont->k = k;
+	init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+	cont->whead = whead;
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_add_tail(&cont->container_entry, &priv->container_list);
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+	struct file *file;
+	int err;
+	unsigned int revents;
+	unsigned long flags;
+	struct kevent_poll_ctl ctl;
+	struct kevent_poll_private *priv;
+
+	file = fget(k->event.id.raw[0]);
+	if (!file)
+		return -EBADF;
+	
+	err = -EINVAL;
+	if (!file->f_op || !file->f_op->poll)
+		goto err_out_fput;
+	
+	err = -ENOMEM;
+	priv = kmem_cache_alloc(kevent_poll_priv_cache, GFP_KERNEL);
+	if (!priv)
+		goto err_out_fput;
+
+	spin_lock_init(&priv->container_lock);
+	INIT_LIST_HEAD(&priv->container_list);
+
+	k->priv = priv;
+
+	ctl.k = k;
+	init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+	
+	err = kevent_storage_enqueue(&file->st, k);
+	if (err)
+		goto err_out_free;
+
+	revents = file->f_op->poll(file, &ctl.pt);
+	if (revents & k->event.event) {
+		err = 1;
+		goto out_dequeue;
+	}
+
+	spin_lock_irqsave(&k->ulock, flags);
+	k->event.req_flags |= KEVENT_REQ_LAST_CHECK;
+	spin_unlock_irqrestore(&k->ulock, flags);
+
+	return 0;
+
+out_dequeue:
+	kevent_storage_dequeue(k->st, k);
+err_out_free:
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+	fput(file);
+	return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *w, *n;
+	unsigned long flags;
+
+	kevent_storage_dequeue(k->st, k);
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+		list_del(&w->container_entry);
+		remove_wait_queue(w->whead, &w->wait);
+		kmem_cache_free(kevent_poll_container_cache, w);
+	}
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+	k->priv = NULL;
+
+	fput(file);
+
+	return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+	if (k->event.req_flags & KEVENT_REQ_LAST_CHECK) {
+		return 1;
+	} else {
+		struct file *file = k->st->origin;
+		unsigned int revents = file->f_op->poll(file, NULL);
+
+		k->event.ret_data[0] = revents & k->event.event;
+		
+		return (revents & k->event.event);
+	}
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+	struct kevent_callbacks pc = {
+		.callback = &kevent_poll_callback,
+		.enqueue = &kevent_poll_enqueue,
+		.dequeue = &kevent_poll_dequeue};
+
+	kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache",
+			sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+	if (!kevent_poll_container_cache) {
+		printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+		return -ENOMEM;
+	}
+
+	kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache",
+			sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+	if (!kevent_poll_priv_cache) {
+		printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+		kmem_cache_destroy(kevent_poll_container_cache);
+		kevent_poll_container_cache = NULL;
+		return -ENOMEM;
+	}
+
+	kevent_add_callbacks(&pc, KEVENT_POLL);
+
+	printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+	return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+	lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+	kmem_cache_destroy(kevent_poll_priv_cache);
+	kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take24 4/6] kevent: Socket notifications.
  2006-11-09  8:23       ` [take24 3/6] kevent: poll/select() notifications Evgeniy Polyakov
@ 2006-11-09  8:23         ` Evgeniy Polyakov
  2006-11-09  8:23           ` [take24 5/6] kevent: Timer notifications Evgeniy Polyakov
  2006-11-09  9:08         ` [take24 3/6] kevent: poll/select() notifications Eric Dumazet
  2006-11-09 18:51         ` Davide Libenzi
  2 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-09  8:23 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Socket notifications.

This patch includes socket send/recv/accept notifications.
Using trivial web server based on kevent and this features
instead of epoll it's performance increased more than noticebly.
More details about various benchmarks and server itself 
(evserver_kevent.c) can be found on project's homepage.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/fs/inode.c b/fs/inode.c
index ada7643..6745c00 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@ #include <linux/pagemap.h>
 #include <linux/cdev.h>
 #include <linux/bootmem.h>
 #include <linux/inotify.h>
+#include <linux/kevent.h>
 #include <linux/mount.h>
 
 /*
@@ -164,12 +165,18 @@ #endif
 		}
 		inode->i_private = 0;
 		inode->i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+		kevent_storage_init(inode, &inode->st);
+#endif
 	}
 	return inode;
 }
 
 void destroy_inode(struct inode *inode) 
 {
+#if defined CONFIG_KEVENT_SOCKET
+	kevent_storage_fini(&inode->st);
+#endif
 	BUG_ON(inode_has_buffers(inode));
 	security_inode_free(inode);
 	if (inode->i_sb->s_op->destroy_inode)
diff --git a/include/net/sock.h b/include/net/sock.h
index edd4d73..d48ded8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -48,6 +48,7 @@ #include <linux/lockdep.h>
 #include <linux/netdevice.h>
 #include <linux/skbuff.h>	/* struct sk_buff */
 #include <linux/security.h>
+#include <linux/kevent.h>
 
 #include <linux/filter.h>
 
@@ -450,6 +451,21 @@ static inline int sk_stream_memory_free(
 
 extern void sk_stream_rfree(struct sk_buff *skb);
 
+struct socket_alloc {
+	struct socket socket;
+	struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+	return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+	return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
+}
+
 static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
 {
 	skb->sk = sk;
@@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct
 		sk->sk_backlog.tail = skb;
 	}
 	skb->next = NULL;
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 }
 
 #define sk_wait_event(__sk, __timeo, __condition)		\
@@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio
 	return si->kiocb;
 }
 
-struct socket_alloc {
-	struct socket socket;
-	struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
-	return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
-	return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
-}
-
 extern void __sk_stream_mem_reclaim(struct sock *sk);
 extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..69f4ad2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so
 			tp->ucopy.memory = 0;
 		} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
 			wake_up_interruptible(sk->sk_sleep);
+			kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 			if (!inet_csk_ack_scheduled(sk))
 				inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
 						          (3 * TCP_RTO_MIN) / 4,
diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 0000000..7f74110
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,135 @@
+/*
+ * 	kevent_socket.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/tcp.h>
+#include <linux/kevent.h>
+
+#include <net/sock.h>
+#include <net/request_sock.h>
+#include <net/inet_connection_sock.h>
+
+static int kevent_socket_callback(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+	unsigned int events = SOCKET_I(inode)->ops->poll(SOCKET_I(inode)->file, SOCKET_I(inode), NULL);
+
+	if ((events & (POLLIN | POLLRDNORM)) && (k->event.event & (KEVENT_SOCKET_RECV | KEVENT_SOCKET_ACCEPT)))
+		return 1;
+	if ((events & (POLLOUT | POLLWRNORM)) && (k->event.event & KEVENT_SOCKET_SEND))
+		return 1;
+	return 0;
+}
+
+int kevent_socket_enqueue(struct kevent *k)
+{
+	struct inode *inode;
+	struct socket *sock;
+	int err = -EBADF;
+
+	sock = sockfd_lookup(k->event.id.raw[0], &err);
+	if (!sock)
+		goto err_out_exit;
+
+	inode = igrab(SOCK_INODE(sock));
+	if (!inode)
+		goto err_out_fput;
+
+	err = kevent_storage_enqueue(&inode->st, k);
+	if (err)
+		goto err_out_iput;
+
+	err = k->callbacks.callback(k);
+	if (err)
+		goto err_out_dequeue;
+
+	return err;
+
+err_out_dequeue:
+	kevent_storage_dequeue(k->st, k);
+err_out_iput:
+	iput(inode);
+err_out_fput:
+	sockfd_put(sock);
+err_out_exit:
+	return err;
+}
+
+int kevent_socket_dequeue(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+	struct socket *sock;
+
+	kevent_storage_dequeue(k->st, k);
+
+	sock = SOCKET_I(inode);
+	iput(inode);
+	sockfd_put(sock);
+
+	return 0;
+}
+
+void kevent_socket_notify(struct sock *sk, u32 event)
+{
+	if (sk->sk_socket)
+		kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event);
+}
+
+/*
+ * It is required for network protocols compiled as modules, like IPv6.
+ */
+EXPORT_SYMBOL_GPL(kevent_socket_notify);
+
+#ifdef CONFIG_LOCKDEP
+static struct lock_class_key kevent_sock_key;
+
+void kevent_socket_reinit(struct socket *sock)
+{
+	struct inode *inode = SOCK_INODE(sock);
+
+	lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+}
+
+void kevent_sk_reinit(struct sock *sk)
+{
+	if (sk->sk_socket) {
+		struct inode *inode = SOCK_INODE(sk->sk_socket);
+
+		lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+	}
+}
+#endif
+static int __init kevent_init_socket(void)
+{
+	struct kevent_callbacks sc = {
+		.callback = &kevent_socket_callback,
+		.enqueue = &kevent_socket_enqueue,
+		.dequeue = &kevent_socket_dequeue};
+
+	return kevent_add_callbacks(&sc, KEVENT_SOCKET);
+}
+module_init(kevent_init_socket);
diff --git a/net/core/sock.c b/net/core/sock.c
index b77e155..7d5fa3e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1402,6 +1402,7 @@ static void sock_def_wakeup(struct sock
 	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
 		wake_up_interruptible_all(sk->sk_sleep);
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_error_report(struct sock *sk)
@@ -1411,6 +1412,7 @@ static void sock_def_error_report(struct
 		wake_up_interruptible(sk->sk_sleep);
 	sk_wake_async(sk,0,POLL_ERR); 
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_readable(struct sock *sk, int len)
@@ -1420,6 +1422,7 @@ static void sock_def_readable(struct soc
 		wake_up_interruptible(sk->sk_sleep);
 	sk_wake_async(sk,1,POLL_IN);
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_write_space(struct sock *sk)
@@ -1439,6 +1442,7 @@ static void sock_def_write_space(struct
 	}
 
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
 }
 
 static void sock_def_destruct(struct sock *sk)
@@ -1489,6 +1493,8 @@ #endif
 	sk->sk_state		=	TCP_CLOSE;
 	sk->sk_socket		=	sock;
 
+	kevent_sk_reinit(sk);
+
 	sock_set_flag(sk, SOCK_ZAPPED);
 
 	if(sock)
@@ -1555,8 +1561,10 @@ void fastcall release_sock(struct sock *
 	if (sk->sk_backlog.tail)
 		__release_sock(sk);
 	sk->sk_lock.owner = NULL;
-	if (waitqueue_active(&sk->sk_lock.wq))
+	if (waitqueue_active(&sk->sk_lock.wq)) {
 		wake_up(&sk->sk_lock.wq);
+		kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
+	}
 	spin_unlock_bh(&sk->sk_lock.slock);
 }
 EXPORT_SYMBOL(release_sock);
diff --git a/net/core/stream.c b/net/core/stream.c
index d1d7dec..2878c2a 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock *
 			wake_up_interruptible(sk->sk_sleep);
 		if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
 			sock_wake_async(sock, 2, POLL_OUT);
+		kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
 	}
 }
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3f884ce..e7dd989 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3119,6 +3119,7 @@ static void tcp_ofo_queue(struct sock *s
 
 		__skb_unlink(skb, &tp->out_of_order_queue);
 		__skb_queue_tail(&sk->sk_receive_queue, skb);
+		kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
 		if(skb->h.th->fin)
 			tcp_fin(skb, sk, skb->h.th);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index c83938b..b0dd70d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -61,6 +61,7 @@ #include <linux/cache.h>
 #include <linux/jhash.h>
 #include <linux/init.h>
 #include <linux/times.h>
+#include <linux/kevent.h>
 
 #include <net/icmp.h>
 #include <net/inet_hashtables.h>
@@ -870,6 +871,7 @@ #endif
 	   	reqsk_free(req);
 	} else {
 		inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+		kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT);
 	}
 	return 0;
 
diff --git a/net/socket.c b/net/socket.c
index 1bc4167..5582b4a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -85,6 +85,7 @@ #include <linux/compat.h>
 #include <linux/kmod.h>
 #include <linux/audit.h>
 #include <linux/wireless.h>
+#include <linux/kevent.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -490,6 +491,8 @@ static struct socket *sock_alloc(void)
 	inode->i_uid = current->fsuid;
 	inode->i_gid = current->fsgid;
 
+	kevent_socket_reinit(sock);
+
 	get_cpu_var(sockets_in_use)++;
 	put_cpu_var(sockets_in_use);
 	return sock;

^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take24 5/6] kevent: Timer notifications.
  2006-11-09  8:23         ` [take24 4/6] kevent: Socket notifications Evgeniy Polyakov
@ 2006-11-09  8:23           ` Evgeniy Polyakov
  2006-11-09  8:23             ` [take24 6/6] kevent: Pipe notifications Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-09  8:23 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Timer notifications.

Timer notifications can be used for fine grained per-process time 
management, since interval timers are very inconvenient to use, 
and they are limited.

This subsystem uses high-resolution timers.
id.raw[0] is used as number of seconds
id.raw[1] is used as number of nanoseconds

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..df93049
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,112 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/hrtimer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+struct kevent_timer
+{
+	struct hrtimer		ktimer;
+	struct kevent_storage	ktimer_storage;
+	struct kevent		*ktimer_event;
+};
+
+static int kevent_timer_func(struct hrtimer *timer)
+{
+	struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer);
+	struct kevent *k = t->ktimer_event;
+
+	kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL);
+	hrtimer_forward(timer, timer->base->softirq_time,
+			ktime_set(k->event.id.raw[0], k->event.id.raw[1]));
+	return HRTIMER_RESTART;
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+	int err;
+	struct kevent_timer *t;
+
+	t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+	if (!t)
+		return -ENOMEM;
+
+	hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL);
+	t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]);
+	t->ktimer.function = kevent_timer_func;
+	t->ktimer_event = k;
+
+	err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+	if (err)
+		goto err_out_free;
+	lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+	err = kevent_storage_enqueue(&t->ktimer_storage, k);
+	if (err)
+		goto err_out_st_fini;
+
+	hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL);
+
+	return 0;
+
+err_out_st_fini:
+	kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+	kfree(t);
+
+	return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+	struct kevent_storage *st = k->st;
+	struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage);
+
+	hrtimer_cancel(&t->ktimer);
+	kevent_storage_dequeue(st, k);
+	kfree(t);
+
+	return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+	k->event.ret_data[0] = jiffies_to_msecs(jiffies);
+	return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+	struct kevent_callbacks tc = {
+		.callback = &kevent_timer_callback,
+		.enqueue = &kevent_timer_enqueue,
+		.dequeue = &kevent_timer_dequeue};
+
+	return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);
+


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take24 6/6] kevent: Pipe notifications.
  2006-11-09  8:23           ` [take24 5/6] kevent: Timer notifications Evgeniy Polyakov
@ 2006-11-09  8:23             ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-09  8:23 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Pipe notifications.


diff --git a/fs/pipe.c b/fs/pipe.c
index f3b6f71..aeaee9c 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -16,6 +16,7 @@ #include <linux/pipe_fs_i.h>
 #include <linux/uio.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/kevent.h>
 
 #include <asm/uaccess.h>
 #include <asm/ioctls.h>
@@ -312,6 +313,7 @@ redo:
 			break;
 		}
 		if (do_wakeup) {
+			kevent_pipe_notify(inode, KEVENT_SOCKET_SEND);
 			wake_up_interruptible_sync(&pipe->wait);
  			kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 		}
@@ -321,6 +323,7 @@ redo:
 
 	/* Signal writers asynchronously that there is more room. */
 	if (do_wakeup) {
+		kevent_pipe_notify(inode, KEVENT_SOCKET_SEND);
 		wake_up_interruptible(&pipe->wait);
 		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 	}
@@ -490,6 +493,7 @@ redo2:
 			break;
 		}
 		if (do_wakeup) {
+			kevent_pipe_notify(inode, KEVENT_SOCKET_RECV);
 			wake_up_interruptible_sync(&pipe->wait);
 			kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 			do_wakeup = 0;
@@ -501,6 +505,7 @@ redo2:
 out:
 	mutex_unlock(&inode->i_mutex);
 	if (do_wakeup) {
+		kevent_pipe_notify(inode, KEVENT_SOCKET_RECV);
 		wake_up_interruptible(&pipe->wait);
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 	}
@@ -605,6 +610,7 @@ pipe_release(struct inode *inode, int de
 		free_pipe_info(inode);
 	} else {
 		wake_up_interruptible(&pipe->wait);
+		kevent_pipe_notify(inode, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 	}
diff --git a/kernel/kevent/kevent_pipe.c b/kernel/kevent/kevent_pipe.c
new file mode 100644
index 0000000..32c6f19
--- /dev/null
+++ b/kernel/kevent/kevent_pipe.c
@@ -0,0 +1,112 @@
+/*
+ * 	kevent_pipe.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/kevent.h>
+#include <linux/pipe_fs_i.h>
+
+static int kevent_pipe_callback(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+	struct pipe_inode_info *pipe = inode->i_pipe;
+	int nrbufs = pipe->nrbufs;
+
+	if (k->event.event & KEVENT_SOCKET_RECV && nrbufs > 0) {
+		if (!pipe->writers)
+			return -1;
+		return 1;
+	}
+	
+	if (k->event.event & KEVENT_SOCKET_SEND && nrbufs < PIPE_BUFFERS) {
+		if (!pipe->readers)
+			return -1;
+		return 1;
+	}
+
+	return 0;
+}
+
+int kevent_pipe_enqueue(struct kevent *k)
+{
+	struct file *pipe;
+	int err = -EBADF;
+	struct inode *inode;
+
+	pipe = fget(k->event.id.raw[0]);
+	if (!pipe)
+		goto err_out_exit;
+
+	inode = igrab(pipe->f_dentry->d_inode);
+	if (!inode)
+		goto err_out_fput;
+
+	err = kevent_storage_enqueue(&inode->st, k);
+	if (err)
+		goto err_out_iput;
+
+	err = k->callbacks.callback(k);
+	if (err)
+		goto err_out_dequeue;
+
+	fput(pipe);
+
+	return err;
+
+err_out_dequeue:
+	kevent_storage_dequeue(k->st, k);
+err_out_iput:
+	iput(inode);
+err_out_fput:
+	fput(pipe);
+err_out_exit:
+	return err;
+}
+
+int kevent_pipe_dequeue(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+
+	kevent_storage_dequeue(k->st, k);
+	iput(inode);
+
+	return 0;
+}
+
+void kevent_pipe_notify(struct inode *inode, u32 event)
+{
+	kevent_storage_ready(&inode->st, NULL, event);
+}
+
+static int __init kevent_init_pipe(void)
+{
+	struct kevent_callbacks sc = {
+		.callback = &kevent_pipe_callback,
+		.enqueue = &kevent_pipe_enqueue,
+		.dequeue = &kevent_pipe_dequeue};
+
+	return kevent_add_callbacks(&sc, KEVENT_PIPE);
+}
+module_init(kevent_init_pipe);

^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: [take24 3/6] kevent: poll/select() notifications.
  2006-11-09  8:23       ` [take24 3/6] kevent: poll/select() notifications Evgeniy Polyakov
  2006-11-09  8:23         ` [take24 4/6] kevent: Socket notifications Evgeniy Polyakov
@ 2006-11-09  9:08         ` Eric Dumazet
  2006-11-09  9:29           ` Evgeniy Polyakov
  2006-11-09 18:51         ` Davide Libenzi
  2 siblings, 1 reply; 200+ messages in thread
From: Eric Dumazet @ 2006-11-09  9:08 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

On Thursday 09 November 2006 09:23, Evgeniy Polyakov wrote:
> poll/select() notifications.
>
> This patch includes generic poll/select notifications.
> kevent_poll works simialr to epoll and has the same issues (callback
> is invoked not from internal state machine of the caller, but through
> process awake, a lot of allocations and so on).
>
> Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>
>
> diff --git a/fs/file_table.c b/fs/file_table.c
> index bc35a40..0805547 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -20,6 +20,7 @@ #include <linux/capability.h>
>  #include <linux/cdev.h>
>  #include <linux/fsnotify.h>
>  #include <linux/sysctl.h>
> +#include <linux/kevent.h>
>  #include <linux/percpu_counter.h>
>
>  #include <asm/atomic.h>
> @@ -119,6 +120,7 @@ struct file *get_empty_filp(void)
>  	f->f_uid = tsk->fsuid;
>  	f->f_gid = tsk->fsgid;
>  	eventpoll_init_file(f);
> +	kevent_init_file(f);
>  	/* f->f_version: 0 */
>  	return f;
>
> @@ -164,6 +166,7 @@ void fastcall __fput(struct file *file)
>  	 * in the file cleanup chain.
>  	 */
>  	eventpoll_release(file);
> +	kevent_cleanup_file(file);
>  	locks_remove_flock(file);
>
>  	if (file->f_op && file->f_op->release)
> diff --git a/fs/inode.c b/fs/inode.c
> index ada7643..6745c00 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -21,6 +21,7 @@ #include <linux/pagemap.h>
>  #include <linux/cdev.h>
>  #include <linux/bootmem.h>
>  #include <linux/inotify.h>
> +#include <linux/kevent.h>
>  #include <linux/mount.h>
>
>  /*
> @@ -164,12 +165,18 @@ #endif
>  		}
>  		inode->i_private = 0;
>  		inode->i_mapping = mapping;

Here you test both KEVENT_SOCKET and KEVENT_PIPE

> +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
> +		kevent_storage_init(inode, &inode->st);
> +#endif
>  	}
>  	return inode;
>  }
>
>  void destroy_inode(struct inode *inode)
>  {

but here you test only KEVENT_SOCKET

> +#if defined CONFIG_KEVENT_SOCKET
> +	kevent_storage_fini(&inode->st);
> +#endif
>  	BUG_ON(inode_has_buffers(inode));
>  	security_inode_free(inode);
>  	if (inode->i_sb->s_op->destroy_inode)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 5baf3a1..c529723 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -276,6 +276,7 @@ #include <linux/prio_tree.h>
>  #include <linux/init.h>
>  #include <linux/sched.h>
>  #include <linux/mutex.h>
> +#include <linux/kevent_storage.h>
>
>  #include <asm/atomic.h>
>  #include <asm/semaphore.h>
> @@ -586,6 +587,10 @@ #ifdef CONFIG_INOTIFY
>  	struct mutex		inotify_mutex;	/* protects the watches list */
>  #endif
>

Here you include a kevent_storage only if KEVENT_SOCKET

> +#ifdef CONFIG_KEVENT_SOCKET
> +	struct kevent_storage	st;
> +#endif
> +
>  	unsigned long		i_state;
>  	unsigned long		dirtied_when;	/* jiffies of first dirtying */
>
> @@ -739,6 +744,9 @@ #ifdef CONFIG_EPOLL
>  	struct list_head	f_ep_links;
>  	spinlock_t		f_ep_lock;
>  #endif /* #ifdef CONFIG_EPOLL */
> +#ifdef CONFIG_KEVENT_POLL
> +	struct kevent_storage	st;
> +#endif
>  	struct address_space	*f_mapping;
>  };
>  extern spinlock_t files_lock;

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 3/6] kevent: poll/select() notifications.
  2006-11-09  9:08         ` [take24 3/6] kevent: poll/select() notifications Eric Dumazet
@ 2006-11-09  9:29           ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-09  9:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

On Thu, Nov 09, 2006 at 10:08:44AM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> Here you test both KEVENT_SOCKET and KEVENT_PIPE
> 
> > +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
> > +		kevent_storage_init(inode, &inode->st);
> > +#endif
> >  	}
> >  	return inode;
> >  }
> >
> >  void destroy_inode(struct inode *inode)
> >  {
> 
> but here you test only KEVENT_SOCKET
> 
> > +#if defined CONFIG_KEVENT_SOCKET
> > +	kevent_storage_fini(&inode->st);
> > +#endif

Indeed, it must be 
#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE

> >  	BUG_ON(inode_has_buffers(inode));
> >  	security_inode_free(inode);
> >  	if (inode->i_sb->s_op->destroy_inode)
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 5baf3a1..c529723 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -276,6 +276,7 @@ #include <linux/prio_tree.h>
> >  #include <linux/init.h>
> >  #include <linux/sched.h>
> >  #include <linux/mutex.h>
> > +#include <linux/kevent_storage.h>
> >
> >  #include <asm/atomic.h>
> >  #include <asm/semaphore.h>
> > @@ -586,6 +587,10 @@ #ifdef CONFIG_INOTIFY
> >  	struct mutex		inotify_mutex;	/* protects the watches list */
> >  #endif
> >
> 
> Here you include a kevent_storage only if KEVENT_SOCKET
> 
> > +#ifdef CONFIG_KEVENT_SOCKET
> > +	struct kevent_storage	st;
> > +#endif
> > +

It must be 
#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 3/6] kevent: poll/select() notifications.
  2006-11-09  8:23       ` [take24 3/6] kevent: poll/select() notifications Evgeniy Polyakov
  2006-11-09  8:23         ` [take24 4/6] kevent: Socket notifications Evgeniy Polyakov
  2006-11-09  9:08         ` [take24 3/6] kevent: poll/select() notifications Eric Dumazet
@ 2006-11-09 18:51         ` Davide Libenzi
  2006-11-09 19:10           ` Evgeniy Polyakov
  2 siblings, 1 reply; 200+ messages in thread
From: Davide Libenzi @ 2006-11-09 18:51 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck,
	Linux Kernel Mailing List, Jeff Garzik

On Thu, 9 Nov 2006, Evgeniy Polyakov wrote:

> +static int kevent_poll_callback(struct kevent *k)
> +{
> +	if (k->event.req_flags & KEVENT_REQ_LAST_CHECK) {
> +		return 1;
> +	} else {
> +		struct file *file = k->st->origin;
> +		unsigned int revents = file->f_op->poll(file, NULL);
> +
> +		k->event.ret_data[0] = revents & k->event.event;
> +		
> +		return (revents & k->event.event);
> +	}
> +}

You need to be careful that file->f_op->poll is not called inside the 
spin_lock_irqsave/spin_lock_irqrestore pair, since (even this came up 
during epoll developemtn days) file->f_op->poll might do a simple 
spin_lock_irq/spin_unlock_irq. This unfortunate constrain forced epoll to 
have a suboptimal double O(R) loop to handle LT events.



- Davide

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 3/6] kevent: poll/select() notifications.
  2006-11-09 18:51         ` Davide Libenzi
@ 2006-11-09 19:10           ` Evgeniy Polyakov
  2006-11-09 19:42             ` Davide Libenzi
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-09 19:10 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck,
	Linux Kernel Mailing List, Jeff Garzik

On Thu, Nov 09, 2006 at 10:51:56AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> On Thu, 9 Nov 2006, Evgeniy Polyakov wrote:
> 
> > +static int kevent_poll_callback(struct kevent *k)
> > +{
> > +	if (k->event.req_flags & KEVENT_REQ_LAST_CHECK) {
> > +		return 1;
> > +	} else {
> > +		struct file *file = k->st->origin;
> > +		unsigned int revents = file->f_op->poll(file, NULL);
> > +
> > +		k->event.ret_data[0] = revents & k->event.event;
> > +		
> > +		return (revents & k->event.event);
> > +	}
> > +}
> 
> You need to be careful that file->f_op->poll is not called inside the 
> spin_lock_irqsave/spin_lock_irqrestore pair, since (even this came up 
> during epoll developemtn days) file->f_op->poll might do a simple 
> spin_lock_irq/spin_unlock_irq. This unfortunate constrain forced epoll to 
> have a suboptimal double O(R) loop to handle LT events.

It is tricky - users call wake_up() from any context, which in turn ends
up calling kevent_storage_ready(), which calls kevent_poll_callback() with
KEVENT_REQ_LAST_CHECK bit set, which becomes almost empty call in fast
path. Since callback returns 1, kevent will be queued into ready queue,
which is processed on behalf of syscalls - in that case kevent will
check the flag and since KEVENT_REQ_LAST_CHECK is set, will call
callback again to check if kevent is correctly marked, but already
without that flag (it happens in syscall context, i.e. process context
without any locks held), so callback calls ->poll(), which can sleep,
but it is safe. If ->poll() returns 'ready' value, kevent is transfers
data into userspace, otherwise it is 'requeued' (just removed from
ready queue).

> - Davide
> 

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 3/6] kevent: poll/select() notifications.
  2006-11-09 19:10           ` Evgeniy Polyakov
@ 2006-11-09 19:42             ` Davide Libenzi
  2006-11-09 20:10               ` Davide Libenzi
  0 siblings, 1 reply; 200+ messages in thread
From: Davide Libenzi @ 2006-11-09 19:42 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck,
	Linux Kernel Mailing List, Jeff Garzik

On Thu, 9 Nov 2006, Evgeniy Polyakov wrote:

> On Thu, Nov 09, 2006 at 10:51:56AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> > On Thu, 9 Nov 2006, Evgeniy Polyakov wrote:
> > 
> > > +static int kevent_poll_callback(struct kevent *k)
> > > +{
> > > +	if (k->event.req_flags & KEVENT_REQ_LAST_CHECK) {
> > > +		return 1;
> > > +	} else {
> > > +		struct file *file = k->st->origin;
> > > +		unsigned int revents = file->f_op->poll(file, NULL);
> > > +
> > > +		k->event.ret_data[0] = revents & k->event.event;
> > > +		
> > > +		return (revents & k->event.event);
> > > +	}
> > > +}
> > 
> > You need to be careful that file->f_op->poll is not called inside the 
> > spin_lock_irqsave/spin_lock_irqrestore pair, since (even this came up 
> > during epoll developemtn days) file->f_op->poll might do a simple 
> > spin_lock_irq/spin_unlock_irq. This unfortunate constrain forced epoll to 
> > have a suboptimal double O(R) loop to handle LT events.
>  
> It is tricky - users call wake_up() from any context, which in turn ends
> up calling kevent_storage_ready(), which calls kevent_poll_callback() with
> KEVENT_REQ_LAST_CHECK bit set, which becomes almost empty call in fast
> path. Since callback returns 1, kevent will be queued into ready queue,
> which is processed on behalf of syscalls - in that case kevent will
> check the flag and since KEVENT_REQ_LAST_CHECK is set, will call
> callback again to check if kevent is correctly marked, but already
> without that flag (it happens in syscall context, i.e. process context
> without any locks held), so callback calls ->poll(), which can sleep,
> but it is safe. If ->poll() returns 'ready' value, kevent is transfers
> data into userspace, otherwise it is 'requeued' (just removed from
> ready queue).

Oh, mine was only a general warn. I hadn't looked at the generic code 
before. But now that I poke on it, I see:

void kevent_requeue(struct kevent *k)
{
       unsigned long flags;

       spin_lock_irqsave(&k->st->lock, flags);
       __kevent_requeue(k, 0);
       spin_unlock_irqrestore(&k->st->lock, flags);
}

and then:

static int __kevent_requeue(struct kevent *k, u32 event)
{
       int ret, rem;
       unsigned long flags;

       ret = k->callbacks.callback(k);

Isn't the k->callbacks.callback() possibly end up calling f_op->poll?



- Davide



^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 3/6] kevent: poll/select() notifications.
  2006-11-09 19:42             ` Davide Libenzi
@ 2006-11-09 20:10               ` Davide Libenzi
  0 siblings, 0 replies; 200+ messages in thread
From: Davide Libenzi @ 2006-11-09 20:10 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck,
	Linux Kernel Mailing List, Jeff Garzik

On Thu, 9 Nov 2006, Davide Libenzi wrote:

> On Thu, 9 Nov 2006, Evgeniy Polyakov wrote:
> 
> > On Thu, Nov 09, 2006 at 10:51:56AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote:
> > > On Thu, 9 Nov 2006, Evgeniy Polyakov wrote:
> > > 
> > > > +static int kevent_poll_callback(struct kevent *k)
> > > > +{
> > > > +	if (k->event.req_flags & KEVENT_REQ_LAST_CHECK) {
> > > > +		return 1;
> > > > +	} else {
> > > > +		struct file *file = k->st->origin;
> > > > +		unsigned int revents = file->f_op->poll(file, NULL);
> > > > +
> > > > +		k->event.ret_data[0] = revents & k->event.event;
> > > > +		
> > > > +		return (revents & k->event.event);
> > > > +	}
> > > > +}
> > > 
> > > You need to be careful that file->f_op->poll is not called inside the 
> > > spin_lock_irqsave/spin_lock_irqrestore pair, since (even this came up 
> > > during epoll developemtn days) file->f_op->poll might do a simple 
> > > spin_lock_irq/spin_unlock_irq. This unfortunate constrain forced epoll to 
> > > have a suboptimal double O(R) loop to handle LT events.
> >  
> > It is tricky - users call wake_up() from any context, which in turn ends
> > up calling kevent_storage_ready(), which calls kevent_poll_callback() with
> > KEVENT_REQ_LAST_CHECK bit set, which becomes almost empty call in fast
> > path. Since callback returns 1, kevent will be queued into ready queue,
> > which is processed on behalf of syscalls - in that case kevent will
> > check the flag and since KEVENT_REQ_LAST_CHECK is set, will call
> > callback again to check if kevent is correctly marked, but already
> > without that flag (it happens in syscall context, i.e. process context
> > without any locks held), so callback calls ->poll(), which can sleep,
> > but it is safe. If ->poll() returns 'ready' value, kevent is transfers
> > data into userspace, otherwise it is 'requeued' (just removed from
> > ready queue).
> 
> Oh, mine was only a general warn. I hadn't looked at the generic code 
> before. But now that I poke on it, I see:
> 
> void kevent_requeue(struct kevent *k)
> {
>        unsigned long flags;
> 
>        spin_lock_irqsave(&k->st->lock, flags);
>        __kevent_requeue(k, 0);
>        spin_unlock_irqrestore(&k->st->lock, flags);
> }
> 
> and then:
> 
> static int __kevent_requeue(struct kevent *k, u32 event)
> {
>        int ret, rem;
>        unsigned long flags;
> 
>        ret = k->callbacks.callback(k);
> 
> Isn't the k->callbacks.callback() possibly end up calling f_op->poll?

Ack, there the check for KEVENT_REQ_LAST_CHECK inside the callback.
The problem with f_op->poll was not that it can sleep (not excluded 
though) but that some f_op->poll can do a simple spin_lock_irq/spin_unlock_irq.
But for a quick peek your new code seems fine with that.



- Davide



^ permalink raw reply	[flat|nested] 200+ messages in thread

* [take24 7/6] kevent: signal notifications.
  2006-11-09  8:23 ` [take24 0/6] " Evgeniy Polyakov
  2006-11-09  8:23   ` [take24 1/6] kevent: Description Evgeniy Polyakov
@ 2006-11-11 17:36   ` Evgeniy Polyakov
  2006-11-11 22:28   ` [take24 0/6] kevent: Generic event handling mechanism Ulrich Drepper
  2 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-11 17:36 UTC (permalink / raw)
  To: David Miller
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

Signals which were requested to be delivered through kevent
subsystem must be registered through usual signal() and others
syscalls, this option allows alternative delivery.

With KEVENT_SIGNAL_NOMASK flag being set in kevent for set of
signals, they will not be delivered in a usual way.
Kevents for appropriate signals are not copied when process forks,
new process must add new kevents after fork(). Mask of signals
is copied as before.

Test application which registers two signal callbacks for usr1 and usr2
signals and it's deivery through kevent (the former with both callback and
kevent notifications, the latter only through kevent) is called signal.c
and can be found in archive on project homepage
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/include/linux/kevent.h b/include/linux/kevent.h
index f7cbf6b..e588ae6 100644
--- a/include/linux/kevent.h
+++ b/include/linux/kevent.h
@@ -28,6 +28,7 @@ #include <linux/wait.h>
 #include <linux/net.h>
 #include <linux/rcupdate.h>
 #include <linux/fs.h>
+#include <linux/sched.h>
 #include <linux/kevent_storage.h>
 #include <linux/ukevent.h>
 
@@ -220,4 +221,10 @@ #else
 static inline void kevent_pipe_notify(struct inode *inode, u32 events) {}
 #endif
 
+#ifdef CONFIG_KEVENT_SIGNAL
+extern int kevent_signal_notify(struct task_struct *tsk, int sig);
+#else
+static inline int kevent_signal_notify(struct task_struct *tsk, int sig) {return 0;}
+#endif
+
 #endif /* __KEVENT_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fc4a987..ef38a3c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -80,6 +80,7 @@ #include <linux/param.h>
 #include <linux/resource.h>
 #include <linux/timer.h>
 #include <linux/hrtimer.h>
+#include <linux/kevent_storage.h>
 
 #include <asm/processor.h>
 
@@ -1013,6 +1014,10 @@ #endif
 #ifdef	CONFIG_TASK_DELAY_ACCT
 	struct task_delay_info *delays;
 #endif
+#ifdef CONFIG_KEVENT_SIGNAL
+	struct kevent_storage st;
+	u32 kevent_signals;
+#endif
 };
 
 static inline pid_t process_group(struct task_struct *tsk)
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
index b14e14e..a6038eb 100644
--- a/include/linux/ukevent.h
+++ b/include/linux/ukevent.h
@@ -68,7 +68,8 @@ #define KEVENT_POLL		3
 #define KEVENT_NAIO		4
 #define KEVENT_AIO		5
 #define KEVENT_PIPE		6
-#define	KEVENT_MAX		7
+#define KEVENT_SIGNAL		7
+#define	KEVENT_MAX		8
 
 /*
  * Per-type event sets.
@@ -81,7 +82,7 @@ #define	KEVENT_MAX		7
 #define	KEVENT_TIMER_FIRED	0x1
 
 /*
- * Socket/network asynchronous IO events.
+ * Socket/network asynchronous IO and PIPE events.
  */
 #define	KEVENT_SOCKET_RECV	0x1
 #define	KEVENT_SOCKET_ACCEPT	0x2
@@ -115,10 +116,20 @@ #define	KEVENT_POLL_POLLREMOVE	0x1000
  */
 #define	KEVENT_AIO_BIO		0x1
 
-#define KEVENT_MASK_ALL		0xffffffff
+/*
+ * Signal events.
+ */
+#define KEVENT_SIGNAL_DELIVERY		0x1
+
+/* If set in raw64, then given signals will not be delivered
+ * in a usual way through sigmask update and signal callback 
+ * invokation. */
+#define KEVENT_SIGNAL_NOMASK	0x8000000000000000ULL
+
 /* Mask of all possible event values. */
-#define KEVENT_MASK_EMPTY	0x0
+#define KEVENT_MASK_ALL		0xffffffff
 /* Empty mask of ready events. */
+#define KEVENT_MASK_EMPTY	0x0
 
 struct kevent_id
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index 1c999f3..e5b5b14 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -46,6 +46,7 @@ #include <linux/cn_proc.h>
 #include <linux/delayacct.h>
 #include <linux/taskstats_kern.h>
 #include <linux/random.h>
+#include <linux/kevent.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -115,6 +116,9 @@ void __put_task_struct(struct task_struc
 	WARN_ON(atomic_read(&tsk->usage));
 	WARN_ON(tsk == current);
 
+#ifdef CONFIG_KEVENT_SIGNAL
+	kevent_storage_fini(&tsk->st);
+#endif
 	security_task_free(tsk);
 	free_uid(tsk->user);
 	put_group_info(tsk->group_info);
@@ -1121,6 +1125,10 @@ #endif
 	if (retval)
 		goto bad_fork_cleanup_namespace;
 
+#ifdef CONFIG_KEVENT_SIGNAL
+	kevent_storage_init(p, &p->st);
+#endif
+
 	p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
 	/*
 	 * Clear TID on mm_release()?
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
index 267fc53..4b137ee 100644
--- a/kernel/kevent/Kconfig
+++ b/kernel/kevent/Kconfig
@@ -43,3 +43,18 @@ config KEVENT_PIPE
 	help
 	  This option enables notifications through KEVENT subsystem of 
 	  pipe read/write operations.
+
+config KEVENT_SIGNAL
+	bool "Kernel event notifications for signals"
+	depends on KEVENT
+	help
+	  This option enables signal delivery through KEVENT subsystem.
+	  Signals which were requested to be delivered through kevent
+	  subsystem must be registered through usual signal() and others
+	  syscalls, this option allows alternative delivery.
+	  With KEVENT_SIGNAL_NOMASK flag being set in kevent for set of 
+	  signals, they will not be delivered in a usual way.
+	  Kevents for appropriate signals are not copied when process forks,
+	  new process must add new kevents after fork(). Mask of signals
+	  is copied as before.
+
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
index d4d6b68..f98e0c8 100644
--- a/kernel/kevent/Makefile
+++ b/kernel/kevent/Makefile
@@ -3,3 +3,4 @@ obj-$(CONFIG_KEVENT_TIMER) += kevent_tim
 obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
 obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o
 obj-$(CONFIG_KEVENT_PIPE) += kevent_pipe.o
+obj-$(CONFIG_KEVENT_SIGNAL) += kevent_signal.o
diff --git a/kernel/kevent/kevent_signal.c b/kernel/kevent/kevent_signal.c
new file mode 100644
index 0000000..15f9d1f
--- /dev/null
+++ b/kernel/kevent/kevent_signal.c
@@ -0,0 +1,87 @@
+/*
+ * 	kevent_signal.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/kevent.h>
+
+static int kevent_signal_callback(struct kevent *k)
+{
+	struct task_struct *tsk = k->st->origin;
+	int sig = k->event.id.raw[0];
+	int ret = 0;
+
+	if (sig == tsk->kevent_signals)
+		ret = 1;
+
+	if (ret && (k->event.id.raw_u64 & KEVENT_SIGNAL_NOMASK))
+		tsk->kevent_signals |= 0x80000000;
+
+	return ret;
+}
+
+int kevent_signal_enqueue(struct kevent *k)
+{
+	int err;
+
+	err = kevent_storage_enqueue(&current->st, k);
+	if (err)
+		goto err_out_exit;
+
+	err = k->callbacks.callback(k);
+	if (err)
+		goto err_out_dequeue;
+
+	return err;
+
+err_out_dequeue:
+	kevent_storage_dequeue(k->st, k);
+err_out_exit:
+	return err;
+}
+
+int kevent_signal_dequeue(struct kevent *k)
+{
+	kevent_storage_dequeue(k->st, k);
+	return 0;
+}
+
+int kevent_signal_notify(struct task_struct *tsk, int sig)
+{
+	tsk->kevent_signals = sig;
+	kevent_storage_ready(&tsk->st, NULL, KEVENT_SIGNAL_DELIVERY);
+	return (tsk->kevent_signals & 0x80000000);
+}
+
+static int __init kevent_init_signal(void)
+{
+	struct kevent_callbacks sc = {
+		.callback = &kevent_signal_callback,
+		.enqueue = &kevent_signal_enqueue,
+		.dequeue = &kevent_signal_dequeue};
+
+	return kevent_add_callbacks(&sc, KEVENT_SIGNAL);
+}
+module_init(kevent_init_signal);
diff --git a/kernel/signal.c b/kernel/signal.c
index fb5da6d..d3d3594 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -23,6 +23,7 @@ #include <linux/syscalls.h>
 #include <linux/ptrace.h>
 #include <linux/signal.h>
 #include <linux/capability.h>
+#include <linux/kevent.h>
 #include <asm/param.h>
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -703,6 +704,9 @@ static int send_signal(int sig, struct s
 {
 	struct sigqueue * q = NULL;
 	int ret = 0;
+	
+	if (kevent_signal_notify(t, sig))
+		return 1;
 
 	/*
 	 * fast-pathed signals for kernel-internal things like SIGSTOP
@@ -782,6 +786,17 @@ specific_send_sig_info(int sig, struct s
 	ret = send_signal(sig, info, t, &t->pending);
 	if (!ret && !sigismember(&t->blocked, sig))
 		signal_wake_up(t, sig == SIGKILL);
+#ifdef CONFIG_KEVENT_SIGNAL
+	/*
+	 * Kevent allows to deliver signals through kevent queue, 
+	 * it is possible to setup kevent to not deliver
+	 * signal through the usual way, in that case send_signal()
+	 * returns 1 and signal is delivered only through kevent queue.
+	 * We simulate successfull delivery notification through this hack:
+	 */
+	if (ret == 1)
+		ret = 0;
+#endif
 out:
 	return ret;
 }
@@ -971,6 +986,17 @@ __group_send_sig_info(int sig, struct si
 	 * to avoid several races.
 	 */
 	ret = send_signal(sig, info, p, &p->signal->shared_pending);
+#ifdef CONFIG_KEVENT_SIGNAL
+	/*
+	 * Kevent allows to deliver signals through kevent queue, 
+	 * it is possible to setup kevent to not deliver
+	 * signal through the usual way, in that case send_signal()
+	 * returns 1 and signal is delivered only through kevent queue.
+	 * We simulate successfull delivery notification through this hack:
+	 */
+	if (ret == 1)
+		ret = 0;
+#endif
 	if (unlikely(ret))
 		return ret;
 

-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-09  8:23 ` [take24 0/6] " Evgeniy Polyakov
  2006-11-09  8:23   ` [take24 1/6] kevent: Description Evgeniy Polyakov
  2006-11-11 17:36   ` [take24 7/6] kevent: signal notifications Evgeniy Polyakov
@ 2006-11-11 22:28   ` Ulrich Drepper
  2006-11-13 10:54     ` Evgeniy Polyakov
  2 siblings, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-11 22:28 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

Evgeniy Polyakov wrote:
> Generic event handling mechanism.
> [...]

Sorry for the delay again.  Kernel work is simply not my highest priority.

I've collected my comments on some parts of the patch.  I haven't gone 
through every part of the patch yet.  Sorry for the length.

===================

- basic ring buffer problem: the kevent_copy_ring_buffer function stores
   the event in the ring buffer without disregard of the current content.

   + if dequeued entries larger than number of ring buffer entries
     events immediately get overwritten without passing anything to
     userlevel

   + as with the old approach, the ring buffer is basically unusable with
     multiple threads/processes.  A thread calling kevent_wait might
     cause entries another thread is still working on to be overwritten.

Possible solution:

a) it would be possible to have a "used" flag in each ring buffer entry.
    That's too expensive, I guess.

b) kevent_wait needs another parameter which specifies the which is the
    last (i.e., least recently added) entry in the ring buffer.
    Everything between this entry and the current head (in ->kidx) is
    occupied.  If multiple threads arrive in kevent_wait the highest idx
    (with wrap around possibly lowest) is used.

    kevent_wait will not try to move more entries into the ring buffer
    if ->kidx and the higest index passed in to any kevent_wait call
    is equal (i.e., the ring buffer is full).

    There is one issue, though, and that is that a system call is needed
    to signal to the kernel that more entries in the ring buffer are
    processed and that they can be refilled.  This goes against the
    kernel filling the ring buffer automatically (see below)

Threads should be able to (not necessarily forced to) use the
interfaces like this:

- by default all threads are "parked" in the kevent_wait syscall.

- If an event occurs one thread might be woken (depending on the 'num'
   parameter)

- the woken thread(s) work on all the events in the ring buffer and
   then call kevent_wait() again.

This requires that the threads can independently call kevent_wait()
and that they can independently retrieve events from the ring buffer
without fear the entry gets overwritten before it is retrieved.
Atomically retrieving entries from the ring buffer can be implemented
at userlevel.  Either the ring buffer is writable and a field in each
ring buffer entry can be used as a 'handled' flag.  Obviously this can
be done with atomic compare-and-exchange.  If the ring buffer is not
writable then, as part of the userlevel wrapper around the event
handling interfaces, another array is created which contains the use
flags for each ring buffer entry.  This is less elegant and probably
slower.

===================

- implementing the kevent_wait syscall the proposed way means we are
   missing out on one possible optimization.  The ring buffer is
   currently only filled on kevent_wait calls.  I expect that in really
   high traffic situations requests are coming in at a higher rate than
   the can be processed.  At least for periods of time.  If such
   situations it would be nice to not have to call into the kernel at
   all.  If the kernel would deliver into the ring buffer on its own
   this would be possible.

   If the argument against this is that kevent_get_event should be
   possible the answer is...

===================

- the kevent_get_event syscall is not  needed at all.  All reporting
   should be done using a ring buffer.  There really is not reason to
   keep two interfaces around  which serve the same purpose.  Making
   the argument the kevent_get_event is so much easier to use is not
   valid.  The exposed interface to access the ring buffer will be easy,
   too.  In the OLS paper I more or wait hinted at the interfaces.  I
   think they should be like this (names are irrelevant):

ec_t ec_create(unsigned flags);
int ec_destroy(ec_t ec);
int ec_poll_event(ec_t ec, event_data_t *d);
int ec_wait_event(ec_t ec, event_data_t *d);
int ec_timedwait_event(ec_t ec, event_data_t *d, struct timespec *to);

The latter three interfaces are the interesting ones.  We have to get
the data out of the ring buffer as quickly as possible.  So the
interfaces require passing in a reference to an object which can hold
the data.  The 'poll' variant won't delay, the other two will.

We need separate create and destroy functions since there will always
be a userlevel component of the data structures.  The create variant
can allocate the ring buffer and the other memory needed ('handled'
flags, tail pointers, ...) and destroy free all resources.

These interfaces are fast and easy to use.  At least as easy as the
kevent_get_event syscall.  And all transparently implemented on top of
the ring buffer.  So, please let's drop the unneeded syscall.

===================

- another optimization I am thinking about is optimizing the thread
   wakeup and ring buffer use for cache line use.  I.e., if we know
   an event was queued on a specific CPU then the wakeup function
   should take this into account.  I.e., if any of the threads
   waiting was/will be scheduled on the same CPU it should be
   preferred.

   With the current simple form of a ring buffer this isn't sufficient,
   though.  Reading all entries in the ring buffer until finding the
   one written by the CPU in question is not helpful.  We'd need a
   mechanism to point the thread to the entry in question.  One
   possibility to do this is to return the ring buffer entry as the
   return value of the kevent_wait() syscall.  This works fine if the
   thread only works for one event (which I guess will be 99.999% of
   all uses).  An extension could be to extend the ukevent structure to
   contain an index of the next entry written the same CPU.

   Another problem this entails is false sharing of the ring buffer
   entries.  This would probably require to pad the ukevent structure
   to 64 bytes.  It's not that much more, 40 bytes so far, it's
   also more future-safe.  The alternative is to allocate have per-CPU
   regions in the ring buffer.  With hotplug CPUs this is just plain
   silly.

   I think this optimization has the potential to help quite a bit,
   especially for large machines.

===================

- we absolutely need an interface to signal the kernel that a thread,
   just woken from kevent_wait, cannot handle the events.  I.e., the
   events are in the ring buffer but all the other threads are in the
   kernel in their kevent_wait calls.  The new syscall would wake up
   one or more threads to handle the events.

   This syscall is for instance necessary if the thread calling
   kevent_wait is canceled.  It might also be needed when a thread
   requested more than one event and realizes processing an entry
   takes a long time and that another thread might work on the other
   items in the meantime.

   Al Viro pointed out another possible solution which also could solve
   the "handled" flag problem and concurrency in use of the ring buffer.

   The idea is to require the kevent_wait() syscall to signal which entry
   in the ring buffer is handled or not handled.  This means:

   + the kernel knows at any time which entries in the buffer are free
     and which are not

   + concurrent filling of the ring buffer is no problem anymore since
     entries are not discarded until told

   + by not waiting for event (num parameter == 0) the syscall can be
     used to discard entries to free up the ring buffer before continuing
     to work on more entries.  And, as per the requirement above, it can
     be used to tell the kernel that certain entries are *NOT* handled
     and need to be sent to another thread.  This would be useful in the
     thread cancellation case.

   This seems like a nice approach.

===================

- why no syscall to create kevent queue?  With dynamic /dev this might
   be a problem and it's really not much additional code.  What about
   programs which want to use these interfaces before /dev is set up?

===================

- still: the syscall should use a struct timespec* timeout parameter
   and not nanosecs.  There are at least three timeout modes which
   are wanted:

   + relative, unconditionally wait that long

   + relative, aborted in case of large enough settimeofday() or NTP
     adjustment

   + absolute timeout.  Probably even with selecting which clock ot use.
     This mode requires a timespec value parameter

   We have all this code already in the futex syscall.  It just needs to
   be generalized or copied and adjusted.

===================

- still: no signal mask parameter in the kevent_wait (and get_event)
   syscall.  Regardless of what one thinks about signals, they are used
   and integrating the kevent interface into existing code requires
   this functionality.  And it's not only about receiving signals.
   The signal mask parameter can also be used to _prevent_ signals from
   being delivered in that time.

===================

- the KEVENT_REQ_WAKEUP_ONE functionality is good and needed.  But I
   would reverse the default.  I cannot see many places where you want
   all threads to be woken.  Introduce KEVENT_REQ_WAKEUP_ALL instead.

===================

- there is really no reason to invent yet another timer implementation.
   We have the POSIX timers which are feature rich and nicely
   implemented.  All that is needed is to implement SIGEV_KEVENT as a
   notification mechanism.  The timer is registered as part of the
   timer_create() syscalls.

===================

I haven't yet looked at the other event sources.  I think the above is
enough for now.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-11 22:28   ` [take24 0/6] kevent: Generic event handling mechanism Ulrich Drepper
@ 2006-11-13 10:54     ` Evgeniy Polyakov
  2006-11-13 11:16       ` Evgeniy Polyakov
  2006-11-20  0:02       ` Ulrich Drepper
  0 siblings, 2 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-13 10:54 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Sat, Nov 11, 2006 at 02:28:53PM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >Generic event handling mechanism.
> >[...]
> 
> Sorry for the delay again.  Kernel work is simply not my highest priority.
> 
> I've collected my comments on some parts of the patch.  I haven't gone 
> through every part of the patch yet.  Sorry for the length.

No problem.

> ===================
> 
> - basic ring buffer problem: the kevent_copy_ring_buffer function stores
>   the event in the ring buffer without disregard of the current content.
> 
>   + if dequeued entries larger than number of ring buffer entries
>     events immediately get overwritten without passing anything to
>     userlevel
> 
>   + as with the old approach, the ring buffer is basically unusable with
>     multiple threads/processes.  A thread calling kevent_wait might
>     cause entries another thread is still working on to be overwritten.
> 
> Possible solution:
> 
> a) it would be possible to have a "used" flag in each ring buffer entry.
>    That's too expensive, I guess.
> 
> b) kevent_wait needs another parameter which specifies the which is the
>    last (i.e., least recently added) entry in the ring buffer.
>    Everything between this entry and the current head (in ->kidx) is
>    occupied.  If multiple threads arrive in kevent_wait the highest idx
>    (with wrap around possibly lowest) is used.
> 
>    kevent_wait will not try to move more entries into the ring buffer
>    if ->kidx and the higest index passed in to any kevent_wait call
>    is equal (i.e., the ring buffer is full).
> 
>    There is one issue, though, and that is that a system call is needed
>    to signal to the kernel that more entries in the ring buffer are
>    processed and that they can be refilled.  This goes against the
>    kernel filling the ring buffer automatically (see below)

If thread calls kevent_wait() it means it has processed previous entries, 
one can call kevent_wait() with $num parameter as zero, which
means that thread does not want any new events, so nothing will be
copied.

> Threads should be able to (not necessarily forced to) use the
> interfaces like this:
> 
> - by default all threads are "parked" in the kevent_wait syscall.
> 
> 
> - If an event occurs one thread might be woken (depending on the 'num'
>   parameter)
> 
> - the woken thread(s) work on all the events in the ring buffer and
>   then call kevent_wait() again.
> 
> This requires that the threads can independently call kevent_wait()
> and that they can independently retrieve events from the ring buffer
> without fear the entry gets overwritten before it is retrieved.
> Atomically retrieving entries from the ring buffer can be implemented
> at userlevel.  Either the ring buffer is writable and a field in each
> ring buffer entry can be used as a 'handled' flag.  Obviously this can
> be done with atomic compare-and-exchange.  If the ring buffer is not
> writable then, as part of the userlevel wrapper around the event
> handling interfaces, another array is created which contains the use
> flags for each ring buffer entry.  This is less elegant and probably
> slower.

Writable ring buffer does not sound too good to me - what if one thread
will overwrite the whole ring buffer so kernel's indexes can be screwed?

Ring buffer processed not in FIFO order is wrong idea - ring buffer can
be potentially very big and searching there for the entry, which was
been marked as 'free' by userspace is not a solution at all - userspace
in that case must provide ukevent so fast tree search would be used,
(and although it is already possible) it requires userspace to make
additional syscalls which is not what we want.

So kevent ring buffer is designed in the following way: all entries can
be processed _only_ in fifo order, i.e. they can be read in any order
threads want, but when one thread calls kevent_wait(num), $num requested
from the begining can be overwritten - kernel does not know how many
users reads those $num events from the begining, and even if they have
some flag that 'do not touch me, someone reads me', how and when those
entries will be reused? Kernel does not store bitmask or any other type
of objects to show that holes in the ring buffer are free - it works in
FIFO order since it the fastest mode.

As a solution I can create folowing scheme:
there are two syscalls (or one with a switch) which get events and
commits them.

kevent_wait() becomes a syscall which waits until number of events or
one of them becomes ready and just copies them into ring buffer and
returns. kevent_wait() will fail with special error code when ring
buffer is full.

kevent_commit() frees requested number of events _from the beginning_,
i.e. from special index, visible from userspace. Userspace can create
special counters for events (and even put them into read-only ring 
buffer overwriting some fields of kevent, especially if we will increase
it's size) and only call kevent_commit() when all events have zero usage
counter.

I disagree that having possibility to have holes in the ring buffer is a
good idea at all - it requires much more complex protocol, which will
fill and reuse that holes, and the main disavantge - it requires to
transfer much more information from userspace to kernelspace to free the
ring entry in the hole - in that case it is already possible just to
call kevent_ctl(KEVENT_REMOVE) and do not wash the brain with new
approach at all.

> ===================
> 
> - implementing the kevent_wait syscall the proposed way means we are
>   missing out on one possible optimization.  The ring buffer is
>   currently only filled on kevent_wait calls.  I expect that in really
>   high traffic situations requests are coming in at a higher rate than
>   the can be processed.  At least for periods of time.  If such
>   situations it would be nice to not have to call into the kernel at
>   all.  If the kernel would deliver into the ring buffer on its own
>   this would be possible.

Well, it can be done on behalf of workqueue or dedicated thread which
will bring up appropriate mm context, although it means that userspace
can not handle the load it requested, which is a bad sign...

>   If the argument against this is that kevent_get_event should be
>   possible the answer is...
> 
> ===================
> 
> - the kevent_get_event syscall is not  needed at all.  All reporting
>   should be done using a ring buffer.  There really is not reason to
>   keep two interfaces around  which serve the same purpose.  Making
>   the argument the kevent_get_event is so much easier to use is not
>   valid.  The exposed interface to access the ring buffer will be easy,
>   too.  In the OLS paper I more or wait hinted at the interfaces.  I
>   think they should be like this (names are irrelevant):

Well, kevent_get_events() _is_ much easier to use. And actually having
only that interface it is possible to implement ring buffer with any
kind or protocol for its controlling - userspace can have a wrapper
which will call kevent_get_events() with pointer which shows to the
place in the shared ring buffer where to place new events, that wrapper
can handle essentially any kind of flags/parameters which are suitable
for that ring buffer implementation.
But since we started to implement ring buffer as a additional feature of
kevent, let's find the way all people will be happy with before removing
something which was proven to work correctly.

> ec_t ec_create(unsigned flags);
> int ec_destroy(ec_t ec);
> int ec_poll_event(ec_t ec, event_data_t *d);
> int ec_wait_event(ec_t ec, event_data_t *d);
> int ec_timedwait_event(ec_t ec, event_data_t *d, struct timespec *to);
> 
> The latter three interfaces are the interesting ones.  We have to get
> the data out of the ring buffer as quickly as possible.  So the
> interfaces require passing in a reference to an object which can hold
> the data.  The 'poll' variant won't delay, the other two will.

The last three are exactly kevent_get_events() with different set of
parameters - it is possible to get events without sleeping, it is
possible to wait until at least something is ready and it is possible to
sleep for timeout.

> We need separate create and destroy functions since there will always
> be a userlevel component of the data structures.  The create variant
> can allocate the ring buffer and the other memory needed ('handled'
> flags, tail pointers, ...) and destroy free all resources.
> 
> These interfaces are fast and easy to use.  At least as easy as the
> kevent_get_event syscall.  And all transparently implemented on top of
> the ring buffer.  So, please let's drop the unneeded syscall.

They all already imeplemented. Just all above, and it was done several
months ago already. No need to reinvent what is already there.
Even if we will decide to remove kevent_get_events() in favour of ring
buffer-only implementation, winting-for-event syscall will be
essentially kevent_get_events() without pointer to the place where to
put events.
And I will not repeat, that it is (and was from the beginning for about
10 months already) to implement ring buffer using kevent_get_events().

I agree that having special syscall to initialize kevent is a good idea,
and initial kevent implementation had it, but it was removed due to API
cleanup work by Cristoph Hellwing.
So I again see the same problem as several months ago when there are
many people who have opposite views on API, and I as author do not know
who is right...

Can we all agree that initialization syscall is a good idea?

> ===================
> 
> - another optimization I am thinking about is optimizing the thread
>   wakeup and ring buffer use for cache line use.  I.e., if we know
>   an event was queued on a specific CPU then the wakeup function
>   should take this into account.  I.e., if any of the threads
>   waiting was/will be scheduled on the same CPU it should be
>   preferred.

Do you have _any_ kind of benchmarks with epoll() which would show that
it is feasible? ukevent is one cache line (well, 2 cache lines on old
CPUs), which can be setup way too far away from the time when it is
ready, and CPU which origianlly set that up can be busy, so we will lose
performance waiting until CPU becomes free instead of calling other
thread on different CPU.

So I'm asking is there at least some data except theoretical thoughts?

>   With the current simple form of a ring buffer this isn't sufficient,
>   though.  Reading all entries in the ring buffer until finding the
>   one written by the CPU in question is not helpful.  We'd need a
>   mechanism to point the thread to the entry in question.  One
>   possibility to do this is to return the ring buffer entry as the
>   return value of the kevent_wait() syscall.  This works fine if the
>   thread only works for one event (which I guess will be 99.999% of
>   all uses).  An extension could be to extend the ukevent structure to
>   contain an index of the next entry written the same CPU.
> 
>   Another problem this entails is false sharing of the ring buffer
>   entries.  This would probably require to pad the ukevent structure
>   to 64 bytes.  It's not that much more, 40 bytes so far, it's
>   also more future-safe.  The alternative is to allocate have per-CPU
>   regions in the ring buffer.  With hotplug CPUs this is just plain
>   silly.
> 
>   I think this optimization has the potential to help quite a bit,
>   especially for large machines.

I think again that complete removal of ring buffer and its
implementation in userspace wrapper and kevent_get_events() is a good
idea. But probably I'm alone thinking in that direction, so let's think
about ring buffer in kernelspace.

It is possible to specify CPU id in kevent (not in ukevent, i.e. not
in shared by userspace structure, but in it's kernel representation),
and then check if currently active CPU is the same or not, but what if
it is not the same CPU? Entry order is important, since application can
take advantage of synchronization, so idea to skip some entries is bad.

> ===================
> 
> - we absolutely need an interface to signal the kernel that a thread,
>   just woken from kevent_wait, cannot handle the events.  I.e., the
>   events are in the ring buffer but all the other threads are in the
>   kernel in their kevent_wait calls.  The new syscall would wake up
>   one or more threads to handle the events.
> 
>   This syscall is for instance necessary if the thread calling
>   kevent_wait is canceled.  It might also be needed when a thread
>   requested more than one event and realizes processing an entry
>   takes a long time and that another thread might work on the other
>   items in the meantime.

Hmm, send a signal to other thread when glibc cancells given one...
This problem points me to the idea of userspace thread implementation I
have in mind, but it is another story.

It is management task - kernel should not even know about someone has
died and can not process events it requested.
Userspace can open a control pipe (and setup a kevent handler for it) 
and glibc will write there a byte thus awakening some other thread.
It can be done in userspace and should be done in userspace.

If you insist I will create userspace kevent handling - userspace will
be able to request kevents and mark them as ready.

>   Al Viro pointed out another possible solution which also could solve
>   the "handled" flag problem and concurrency in use of the ring buffer.
> 
>   The idea is to require the kevent_wait() syscall to signal which entry
>   in the ring buffer is handled or not handled.  This means:
> 
>   + the kernel knows at any time which entries in the buffer are free
>     and which are not
> 
>   + concurrent filling of the ring buffer is no problem anymore since
>     entries are not discarded until told
> 
>   + by not waiting for event (num parameter == 0) the syscall can be
>     used to discard entries to free up the ring buffer before continuing
>     to work on more entries.  And, as per the requirement above, it can
>     be used to tell the kernel that certain entries are *NOT* handled
>     and need to be sent to another thread.  This would be useful in the
>     thread cancellation case.
> 
>   This seems like a nice approach.

But unfortunately theory and practice are different in a real world.
Kernel has millions of entries in _linear_ ring buffer, how do you think
they should be handled without complex protocol between userspace and
kernelspace? In that protocol userspace is required to transfer some
information to kernelspace so it could find the entry (i.e. per entry
field ! ), and then it should have a tree or other mechanism to store
free and used chunks of entries...

You probably did not see my network tree allocator patches I posted in
lkml@, netdev@ and linux-mm@ lists - it is quite big chunk of code which
handles exactly that, but you do not want to implement it in glibc I
think...

So, do not overdesign.

And as a side note, btw - _all_ above can be implemented in userspace.

> ===================
> 
> - why no syscall to create kevent queue?  With dynamic /dev this might
>   be a problem and it's really not much additional code.  What about
>   programs which want to use these interfaces before /dev is set up?

It was there - Cristoph Hellwig removed it in his API cleanup patch, so
far it was not needed at all (and is not needed for now).
That application can create /dev file by itself if it wants... Just a
though.

> ===================
> 
> - still: the syscall should use a struct timespec* timeout parameter
>   and not nanosecs.  There are at least three timeout modes which
>   are wanted:
> 
>   + relative, unconditionally wait that long
> 
>   + relative, aborted in case of large enough settimeofday() or NTP
>     adjustment
> 
>   + absolute timeout.  Probably even with selecting which clock ot use.
>     This mode requires a timespec value parameter
> 
> 
>   We have all this code already in the futex syscall.  It just needs to
>   be generalized or copied and adjusted.

Will we discuss it for death?

Kevent does not need to have absolute timeout.

Because timeout specified there is always related to the start of
syscall, since it is a timeout which specifies maximum time frame
syscall can live.

All such timeouts _ARE_ relative and should be relative since it is
correct.

> ===================
> 
> - still: no signal mask parameter in the kevent_wait (and get_event)
>   syscall.  Regardless of what one thinks about signals, they are used
>   and integrating the kevent interface into existing code requires
>   this functionality.  And it's not only about receiving signals.
>   The signal mask parameter can also be used to _prevent_ signals from
>   being delivered in that time.

I created kevent_signal notifications - it allows user to setup any set
of interested signals before call to kevent_get_events() and friends.

No need to solve a problem with operation way when there is tactical and
strategical ones - kevent signal is that way which allows not to use
workarounds for interfaces which do not support handling of different
types of events except file descriptors.

> ===================
> 
> - the KEVENT_REQ_WAKEUP_ONE functionality is good and needed.  But I
>   would reverse the default.  I cannot see many places where you want
>   all threads to be woken.  Introduce KEVENT_REQ_WAKEUP_ALL instead.

I.e. to wake up only first thread always and in addon those threads
which have specified flag set? Ok, will put into todo foer the next
release.

> ===================
> 
> - there is really no reason to invent yet another timer implementation.
>   We have the POSIX timers which are feature rich and nicely
>   implemented.  All that is needed is to implement SIGEV_KEVENT as a
>   notification mechanism.  The timer is registered as part of the
>   timer_create() syscalls.

Feel free to add any interface you like - it is as simple as call for
kevent_user_add_ukevent() in userspace.

> ===================
> 
> 
> I haven't yet looked at the other event sources.  I think the above is
> enough for now.

It looks like you generate ideas (or move them into different
implementation layer) faster than I implement them :)
And I almost silently stay behind with the fact that it is possbile to
implement _all_ above ring buffer things in userspace with
kevent_get_events() and this functionality is there for almost a year :)

Let's solve problem in order of theirs appearance - what do you think
about above interface for ring buffer?

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-13 10:54     ` Evgeniy Polyakov
@ 2006-11-13 11:16       ` Evgeniy Polyakov
  2006-11-20  0:02       ` Ulrich Drepper
  1 sibling, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-13 11:16 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Mon, Nov 13, 2006 at 01:54:58PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > ===================
> > 
> > - there is really no reason to invent yet another timer implementation.
> >   We have the POSIX timers which are feature rich and nicely
> >   implemented.  All that is needed is to implement SIGEV_KEVENT as a
> >   notification mechanism.  The timer is registered as part of the
> >   timer_create() syscalls.
> 
> Feel free to add any interface you like - it is as simple as call for
> kevent_user_add_ukevent() in userspace.

... in kernelspace I mean.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-13 10:54     ` Evgeniy Polyakov
  2006-11-13 11:16       ` Evgeniy Polyakov
@ 2006-11-20  0:02       ` Ulrich Drepper
  2006-11-20  8:25         ` Evgeniy Polyakov
  1 sibling, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-20  0:02 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

Evgeniy Polyakov wrote:
>> Possible solution:
>>
>> a) it would be possible to have a "used" flag in each ring buffer entry.
>>    That's too expensive, I guess.
>>
>> b) kevent_wait needs another parameter which specifies the which is the
>>    last (i.e., least recently added) entry in the ring buffer.
>>    Everything between this entry and the current head (in ->kidx) is
>>    occupied.  If multiple threads arrive in kevent_wait the highest idx
>>    (with wrap around possibly lowest) is used.
>>
>>    kevent_wait will not try to move more entries into the ring buffer
>>    if ->kidx and the higest index passed in to any kevent_wait call
>>    is equal (i.e., the ring buffer is full).
>>
>>    There is one issue, though, and that is that a system call is needed
>>    to signal to the kernel that more entries in the ring buffer are
>>    processed and that they can be refilled.  This goes against the
>>    kernel filling the ring buffer automatically (see below)
> 
> If thread calls kevent_wait() it means it has processed previous entries, 
> one can call kevent_wait() with $num parameter as zero, which
> means that thread does not want any new events, so nothing will be
> copied.

This doesn't solve the problem.  You could only request new events when 
all previously reported events are processed.  Plus: how do you report 
events if the you don't allow get_event pass them on?

> Writable ring buffer does not sound too good to me - what if one thread
> will overwrite the whole ring buffer so kernel's indexes can be screwed?

Agreed, there are problems.  This is why I suggested the ring buffer can 
be a structured.  Parts of it might be read-only, other parts 
read/write.  I don't necessarily think the 'used' flag is the right way. 
  And front/tail pointer solution seems to be better.

> Ring buffer processed not in FIFO order is wrong idea

Not necessarily, see my comments about CPU affinity in the previous mail.

> - ring buffer can
> be potentially very big and searching there for the entry, which was
> been marked as 'free' by userspace is not a solution at all - userspace
> in that case must provide ukevent so fast tree search would be used,
> (and although it is already possible) it requires userspace to make
> additional syscalls which is not what we want.

It is not necessary.  I've proposed to only have a fron and tail 
pointer.  The tail pointer is maintained by the application and passed 
to the kernel explicitly or via shared memory.  The kernel maintains the 
front pointer.  No tree needed.

> As a solution I can create folowing scheme:
> there are two syscalls (or one with a switch) which get events and
> commits them.
> 
> kevent_wait() becomes a syscall which waits until number of events or
> one of them becomes ready and just copies them into ring buffer and
> returns. kevent_wait() will fail with special error code when ring
> buffer is full.
> 
> kevent_commit() frees requested number of events _from the beginning_,
> i.e. from special index, visible from userspace. Userspace can create
> special counters for events (and even put them into read-only ring 
> buffer overwriting some fields of kevent, especially if we will increase
> it's size) and only call kevent_commit() when all events have zero usage
> counter.

Right, that's basically the front/tail pointer implementation.  That 
would work.  You just have to make sure that the kevent_wait() call 
takes the current front pointer/index as a parameter.  This way if the 
buffer gets filled between the thread checking the ring buffer (and 
finding it empty) and the syscall being handled the thread is not suspended.

> I disagree that having possibility to have holes in the ring buffer is a
> good idea at all - it requires much more complex protocol, which will
> fill and reuse that holes, and the main disavantge - it requires to
> transfer much more information from userspace to kernelspace to free the
> ring entry in the hole - in that case it is already possible just to
> call kevent_ctl(KEVENT_REMOVE) and do not wash the brain with new
> approach at all.

Well, it would require more data transport of we'd use writable shared 
memory.  But I agree, it's far too complicated and might not scale with 
growing ring buffer sizes.

>> - implementing the kevent_wait syscall the proposed way means we are
>>   missing out on one possible optimization.  The ring buffer is
>>   currently only filled on kevent_wait calls.  I expect that in really
>>   high traffic situations requests are coming in at a higher rate than
>>   the can be processed.  At least for periods of time.  If such
>>   situations it would be nice to not have to call into the kernel at
>>   all.  If the kernel would deliver into the ring buffer on its own
>>   this would be possible.
> 
> Well, it can be done on behalf of workqueue or dedicated thread which
> will bring up appropriate mm context,

I think it should be done.  It's potentially a huge advantage.

> although it means that userspace
> can not handle the load it requested, which is a bad sign...

I don't understand.  What is not supposed to work?  There is nothing 
which cannot work with automatic posting since the get_event() call does 
nothing but copying the event data over and wake a thread.

>> - the kevent_get_event syscall is not  needed at all.  All reporting
>>   should be done using a ring buffer.  There really is not reason to
>>   keep two interfaces around  which serve the same purpose.  Making
>>   the argument the kevent_get_event is so much easier to use is not
>>   valid.  The exposed interface to access the ring buffer will be easy,
>>   too.  In the OLS paper I more or wait hinted at the interfaces.  I
>>   think they should be like this (names are irrelevant):
> 
> Well, kevent_get_events() _is_ much easier to use. And actually having
> only that interface it is possible to implement ring buffer with any
> kind or protocol for its controlling - userspace can have a wrapper
> which will call kevent_get_events() with pointer which shows to the
> place in the shared ring buffer where to place new events, that wrapper
> can handle essentially any kind of flags/parameters which are suitable
> for that ring buffer implementation.

That's far too slow.  The whole point behind the ring buffer is speed. 
And emulation would defeat the purpose.

> But since we started to implement ring buffer as a additional feature of
> kevent, let's find the way all people will be happy with before removing
> something which was proven to work correctly.

The get_event interface is basically the userlevel interface the runtime 
(glibc probably) would provide.  Programmers don't see the complexity.

I'm concerned about the get_event interface holding the kernel 
implementation back.  For instance, automatic filling the ring buffer. 
This would not be possible if the program is free to mix 
kevent_get_event and kevent_wait calls freely.  If you do away with the 
get_event syscall the automatic ring buffer filling is possible and a 
logical extension.

> 
> The last three are exactly kevent_get_events() with different set of
> parameters - it is possible to get events without sleeping, it is
> possible to wait until at least something is ready and it is possible to
> sleep for timeout.

Exactly.  But these interfaces should be implemented at userlevel, not 
at the syscall level.  It's not necessary.  The kernel interface should 
be kept as small as possible and the get_event syscall is pure duplication.

> They all already imeplemented. Just all above, and it was done several
> months ago already. No need to reinvent what is already there.
> Even if we will decide to remove kevent_get_events() in favour of ring
> buffer-only implementation, winting-for-event syscall will be
> essentially kevent_get_events() without pointer to the place where to
> put events.

Right, but this limitation of the interface is important.  It means the 
interface of the kernel is smaller: fewer possibilities for problems and 
fewer constraints if in future something should be changed (and smaller 
kernel).

> I agree that having special syscall to initialize kevent is a good idea,
> and initial kevent implementation had it, but it was removed due to API
> cleanup work by Cristoph Hellwing.

Well, he is wrong.  If, for instance, init or any of the programs which 
start first wants to use the syscall it couldn't because /dev isn't 
mounted.  The program might use libraries and therefore not have any 
influence on whether the kevent stuff is used or not.

Yes, the /dev interface is useful for some/many other kernel interfaces. 
  But this is a core interface.  For the same reason epoll_create is a 
syscall.

> Do you have _any_ kind of benchmarks with epoll() which would show that
> it is feasible? ukevent is one cache line (well, 2 cache lines on old
> CPUs), which can be setup way too far away from the time when it is
> ready, and CPU which origianlly set that up can be busy, so we will lose
> performance waiting until CPU becomes free instead of calling other
> thread on different CPU.

If the period between the generation of the event (e.g., incoming 
network traffic or sent data) and the delivery of the event by waking a 
thread is too long, it makes not too much sense.  But if the L2 cache 
hasn't hasn't been flushed it might be a big advantage.

I think it's reasonable to only have the last queued entry for a CPU 
handled special.  And note, this is only ever a hint.  If an event entry 
was created by the kernel in one CPU but none of the threads which wait 
to be waken is on that CPU, nothing has to be done.

No, I don't have a benchmark.  But it is likely quite easily possible to 
  create a synthetic benachmark.  Maybe with pipes.

> It is possible to specify CPU id in kevent (not in ukevent, i.e. not
> in shared by userspace structure, but in it's kernel representation),
> and then check if currently active CPU is the same or not, but what if
> it is not the same CPU?

Nothing special.  It's up to the userlevel wrapper code.  The CPU number 
would only be a hint.

> Entry order is important, since application can
> take advantage of synchronization, so idea to skip some entries is bad.

That's something the application should be make a call about.  It's not 
always (or even mostly) the case that the ordering of the notification 
is important.  Furthermore, this would also require the kernel to 
enforce an ordering.  This is expensive on SMP machines.  A locally 
generated event (i.e., source and the thread reporting the event) can be 
delivered faster than an event created on another CPU.

> It is management task - kernel should not even know about someone has
> died and can not process events it requested.

But the kernel has to be involed.

> Userspace can open a control pipe (and setup a kevent handler for it) 
> and glibc will write there a byte thus awakening some other thread.
> It can be done in userspace and should be done in userspace.

That's invasive.  The problem is that no userlevel interface should have 
to implicitly keep file descriptors open.  This would mean the 
application would be influenced since suddenly a file descriptor is not 
available anymore.  Yes, applications shouldn't care but they 
unfortunately sometimes do.

> Will we discuss it for death?
> 
> Kevent does not need to have absolute timeout.

Of course it does.  Just because you don't see a need for it for your 
applications right now it doesn't mean it's not a valid use.

> Because timeout specified there is always related to the start of
> syscall, since it is a timeout which specifies maximum time frame
> syscall can live.

That's your current implementation.  There is absolutely no reason 
whatsoever why this couldn't be changed.
> I created kevent_signal notifications - it allows user to setup any set
> of interested signals before call to kevent_get_events() and friends.
> 
> No need to solve a problem with operation way when there is tactical and
> strategical ones

Of course there is a need and I explained it before.  Getting signal 
notifications is in no way the same as changing the signal mask 
temporarily.  You cannot correctly emulate the case where you want to 
block a signal while in the call as reenable it afterwards.  Receiving 
the signal as an event and then artificially raising it is not the same. 
  Especially timing-wise, the signal kevent might not be seen long after 
the syscall returns because other entries are worked on first.

The opposite case is equally impossible to emulate: unblocking a signal 
just for the duration of the syscall.  These are all possible and used 
cases.

>> - the KEVENT_REQ_WAKEUP_ONE functionality is good and needed.  But I
>>   would reverse the default.  I cannot see many places where you want
>>   all threads to be woken.  Introduce KEVENT_REQ_WAKEUP_ALL instead.
> 
> I.e. to wake up only first thread always and in addon those threads
> which have specified flag set? Ok, will put into todo foer the next
> release.

It's a flag for an event.  So the threads won't have the flag set.  If 
an event is delivered with the flag set, wake all threads.  Otherwise 
just one.

>> - there is really no reason to invent yet another timer implementation.
>>   We have the POSIX timers which are feature rich and nicely
>>   implemented.  All that is needed is to implement SIGEV_KEVENT as a
>>   notification mechanism.  The timer is registered as part of the
>>   timer_create() syscalls.
> 
> Feel free to add any interface you like - it is as simple as call for
> kevent_user_add_ukevent() in userspace.

No, that's not what I mean.  There is no need for the special 
timer-related part of your patch.  Instead the existing POSIX timer 
syscalls should be modified to handle SIGEV_KEVENT notification.  Again, 
keep the interface as small as possible.  Plus, the POSIX timer 
interface is very flexible.  You don't want to duplicate all that 
functionality.

> And I almost silently stay behind with the fact that it is possbile to
> implement _all_ above ring buffer things in userspace with
> kevent_get_events() and this functionality is there for almost a year :)

Again, this defeats the purpose completely.  The ring buffer is the 
faster interface, especially when coupled with asynchronous filling of 
ring buffer (i.e., without a syscal).

> Let's solve problem in order of theirs appearance - what do you think
> about above interface for ring buffer?

Looks better, yes.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-20  0:02       ` Ulrich Drepper
@ 2006-11-20  8:25         ` Evgeniy Polyakov
  2006-11-20  8:43           ` Andrew Morton
  2006-11-20 20:29           ` Ulrich Drepper
  0 siblings, 2 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-20  8:25 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Sun, Nov 19, 2006 at 04:02:03PM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >>Possible solution:
> >>
> >>a) it would be possible to have a "used" flag in each ring buffer entry.
> >>   That's too expensive, I guess.
> >>
> >>b) kevent_wait needs another parameter which specifies the which is the
> >>   last (i.e., least recently added) entry in the ring buffer.
> >>   Everything between this entry and the current head (in ->kidx) is
> >>   occupied.  If multiple threads arrive in kevent_wait the highest idx
> >>   (with wrap around possibly lowest) is used.
> >>
> >>   kevent_wait will not try to move more entries into the ring buffer
> >>   if ->kidx and the higest index passed in to any kevent_wait call
> >>   is equal (i.e., the ring buffer is full).
> >>
> >>   There is one issue, though, and that is that a system call is needed
> >>   to signal to the kernel that more entries in the ring buffer are
> >>   processed and that they can be refilled.  This goes against the
> >>   kernel filling the ring buffer automatically (see below)
> >
> >If thread calls kevent_wait() it means it has processed previous entries, 
> >one can call kevent_wait() with $num parameter as zero, which
> >means that thread does not want any new events, so nothing will be
> >copied.
> 
> This doesn't solve the problem.  You could only request new events when 
> all previously reported events are processed.  Plus: how do you report 
> events if the you don't allow get_event pass them on?

Userspace should itself maintain order and possibility to get event in
this implementation, kernel just returns events which were requested.
 
> >Writable ring buffer does not sound too good to me - what if one thread
> >will overwrite the whole ring buffer so kernel's indexes can be screwed?
> 
> Agreed, there are problems.  This is why I suggested the ring buffer can 
> be a structured.  Parts of it might be read-only, other parts 
> read/write.  I don't necessarily think the 'used' flag is the right way. 
>  And front/tail pointer solution seems to be better.
> 
> 
> >Ring buffer processed not in FIFO order is wrong idea
> 
> Not necessarily, see my comments about CPU affinity in the previous mail.
> 
> 
> >- ring buffer can
> >be potentially very big and searching there for the entry, which was
> >been marked as 'free' by userspace is not a solution at all - userspace
> >in that case must provide ukevent so fast tree search would be used,
> >(and although it is already possible) it requires userspace to make
> >additional syscalls which is not what we want.
> 
> It is not necessary.  I've proposed to only have a fron and tail 
> pointer.  The tail pointer is maintained by the application and passed 
> to the kernel explicitly or via shared memory.  The kernel maintains the 
> front pointer.  No tree needed.

There was such implementation (in previous patchset) - sine no one
commented, I changed that.
 
> >As a solution I can create folowing scheme:
> >there are two syscalls (or one with a switch) which get events and
> >commits them.
> >
> >kevent_wait() becomes a syscall which waits until number of events or
> >one of them becomes ready and just copies them into ring buffer and
> >returns. kevent_wait() will fail with special error code when ring
> >buffer is full.
> >
> >kevent_commit() frees requested number of events _from the beginning_,
> >i.e. from special index, visible from userspace. Userspace can create
> >special counters for events (and even put them into read-only ring 
> >buffer overwriting some fields of kevent, especially if we will increase
> >it's size) and only call kevent_commit() when all events have zero usage
> >counter.
> 
> Right, that's basically the front/tail pointer implementation.  That 
> would work.  You just have to make sure that the kevent_wait() call 
> takes the current front pointer/index as a parameter.  This way if the 
> buffer gets filled between the thread checking the ring buffer (and 
> finding it empty) and the syscall being handled the thread is not suspended.

It is exactly how previous ring buffer (in mapped area though) was
implemented.

I think I need to quickly setup my slightly used (bought on ebay) but
still working mind reader, I will try to tune it to work with your brain
waves so next time I would not spent weeks changing something which
could be reused, while others keep silent :)
 
> >I disagree that having possibility to have holes in the ring buffer is a
> >good idea at all - it requires much more complex protocol, which will
> >fill and reuse that holes, and the main disavantge - it requires to
> >transfer much more information from userspace to kernelspace to free the
> >ring entry in the hole - in that case it is already possible just to
> >call kevent_ctl(KEVENT_REMOVE) and do not wash the brain with new
> >approach at all.
> 
> Well, it would require more data transport of we'd use writable shared 
> memory.  But I agree, it's far too complicated and might not scale with 
> growing ring buffer sizes.
> 
> 
> >>- implementing the kevent_wait syscall the proposed way means we are
> >>  missing out on one possible optimization.  The ring buffer is
> >>  currently only filled on kevent_wait calls.  I expect that in really
> >>  high traffic situations requests are coming in at a higher rate than
> >>  the can be processed.  At least for periods of time.  If such
> >>  situations it would be nice to not have to call into the kernel at
> >>  all.  If the kernel would deliver into the ring buffer on its own
> >>  this would be possible.
> >
> >Well, it can be done on behalf of workqueue or dedicated thread which
> >will bring up appropriate mm context,
> 
> I think it should be done.  It's potentially a huge advantage.
> 
> 
> >although it means that userspace
> >can not handle the load it requested, which is a bad sign...
> 
> I don't understand.  What is not supposed to work?  There is nothing 
> which cannot work with automatic posting since the get_event() call does 
> nothing but copying the event data over and wake a thread.
 
If userspace is too slow to get events, dedicated thread or workqueue
will be busy unneded things, although they can allow to remove peaks in
the load.

> >>- the kevent_get_event syscall is not  needed at all.  All reporting
> >>  should be done using a ring buffer.  There really is not reason to
> >>  keep two interfaces around  which serve the same purpose.  Making
> >>  the argument the kevent_get_event is so much easier to use is not
> >>  valid.  The exposed interface to access the ring buffer will be easy,
> >>  too.  In the OLS paper I more or wait hinted at the interfaces.  I
> >>  think they should be like this (names are irrelevant):
> >
> >Well, kevent_get_events() _is_ much easier to use. And actually having
> >only that interface it is possible to implement ring buffer with any
> >kind or protocol for its controlling - userspace can have a wrapper
> >which will call kevent_get_events() with pointer which shows to the
> >place in the shared ring buffer where to place new events, that wrapper
> >can handle essentially any kind of flags/parameters which are suitable
> >for that ring buffer implementation.
> 
> That's far too slow.  The whole point behind the ring buffer is speed. 
> And emulation would defeat the purpose.
 
It was an example, I do not say ring-buffer maintained in kernelspace is
bad idea. Actually it is possible to create several threads which will
only read events into the buffer, which will be processed by some pool of 
'working' threads. There are a lot of possibilities to work with only
one syscall and create scalable system.

> >But since we started to implement ring buffer as a additional feature of
> >kevent, let's find the way all people will be happy with before removing
> >something which was proven to work correctly.
> 
> The get_event interface is basically the userlevel interface the runtime 
> (glibc probably) would provide.  Programmers don't see the complexity.
> 
> I'm concerned about the get_event interface holding the kernel 
> implementation back.  For instance, automatic filling the ring buffer. 
> This would not be possible if the program is free to mix 
> kevent_get_event and kevent_wait calls freely.  If you do away with the 
> get_event syscall the automatic ring buffer filling is possible and a 
> logical extension.
 
Yes, that is why only one should be used.
If there are several threads, then ring buffer implementation should be
used otherwise just kevent_get_events().
In theory yes, access library like glibc can provide kevent_get_events()
which will read event from ring buffer, but there is no such call right
now, so kernel's kevent_get_events() looks reasonable.

> >The last three are exactly kevent_get_events() with different set of
> >parameters - it is possible to get events without sleeping, it is
> >possible to wait until at least something is ready and it is possible to
> >sleep for timeout.
> 
> Exactly.  But these interfaces should be implemented at userlevel, not 
> at the syscall level.  It's not necessary.  The kernel interface should 
> be kept as small as possible and the get_event syscall is pure duplication.

I would say that ring-buffer mainpulating syscalls are duplicatino, but
it is just matter of a view :)
 
> >They all already imeplemented. Just all above, and it was done several
> >months ago already. No need to reinvent what is already there.
> >Even if we will decide to remove kevent_get_events() in favour of ring
> >buffer-only implementation, winting-for-event syscall will be
> >essentially kevent_get_events() without pointer to the place where to
> >put events.
> 
> Right, but this limitation of the interface is important.  It means the 
> interface of the kernel is smaller: fewer possibilities for problems and 
> fewer constraints if in future something should be changed (and smaller 
> kernel).

Ok, lets see for ring buffer implementation right now, and then we will
decide if we want to remove or to stay with kevent_get_events() syscall.
 
> >I agree that having special syscall to initialize kevent is a good idea,
> >and initial kevent implementation had it, but it was removed due to API
> >cleanup work by Cristoph Hellwing.
> 
> Well, he is wrong.  If, for instance, init or any of the programs which 
> start first wants to use the syscall it couldn't because /dev isn't 
> mounted.  The program might use libraries and therefore not have any 
> influence on whether the kevent stuff is used or not.
> 
> Yes, the /dev interface is useful for some/many other kernel interfaces. 
>  But this is a core interface.  For the same reason epoll_create is a 
> syscall.

Ok, I will create initialization syscall.

> >Do you have _any_ kind of benchmarks with epoll() which would show that
> >it is feasible? ukevent is one cache line (well, 2 cache lines on old
> >CPUs), which can be setup way too far away from the time when it is
> >ready, and CPU which origianlly set that up can be busy, so we will lose
> >performance waiting until CPU becomes free instead of calling other
> >thread on different CPU.
> 
> If the period between the generation of the event (e.g., incoming 
> network traffic or sent data) and the delivery of the event by waking a 
> thread is too long, it makes not too much sense.  But if the L2 cache 
> hasn't hasn't been flushed it might be a big advantage.
> 
> I think it's reasonable to only have the last queued entry for a CPU 
> handled special.  And note, this is only ever a hint.  If an event entry 
> was created by the kernel in one CPU but none of the threads which wait 
> to be waken is on that CPU, nothing has to be done.
> 
> No, I don't have a benchmark.  But it is likely quite easily possible to 
>  create a synthetic benachmark.  Maybe with pipes.
> 
> 
> >It is possible to specify CPU id in kevent (not in ukevent, i.e. not
> >in shared by userspace structure, but in it's kernel representation),
> >and then check if currently active CPU is the same or not, but what if
> >it is not the same CPU?
> 
> Nothing special.  It's up to the userlevel wrapper code.  The CPU number 
> would only be a hint.
> 
> 
> >Entry order is important, since application can
> >take advantage of synchronization, so idea to skip some entries is bad.
> 
> That's something the application should be make a call about.  It's not 
> always (or even mostly) the case that the ordering of the notification 
> is important.  Furthermore, this would also require the kernel to 
> enforce an ordering.  This is expensive on SMP machines.  A locally 
> generated event (i.e., source and the thread reporting the event) can be 
> delivered faster than an event created on another CPU.

How come? If signal was delivered earlier than data arrived, userspace
should get signal before data - that is the rule. Ordering is maintained
not for event insertion, but for marking them ready - it is atomic, so
who first starts to mark even ready, that event will be read first from
the ready queue.
 
> >It is management task - kernel should not even know about someone has
> >died and can not process events it requested.
> 
> But the kernel has to be involed.
> 
> 
> >Userspace can open a control pipe (and setup a kevent handler for it) 
> >and glibc will write there a byte thus awakening some other thread.
> >It can be done in userspace and should be done in userspace.
> 
> That's invasive.  The problem is that no userlevel interface should have 
> to implicitly keep file descriptors open.  This would mean the 
> application would be influenced since suddenly a file descriptor is not 
> available anymore.  Yes, applications shouldn't care but they 
> unfortunately sometimes do.

Then I propose userspace notifications - each new thread can register
'wake me up when userspace event 1 is ready' and 'event 1' will be
marked as ready by glibc when it removes the thread.
 
> >Will we discuss it for death?
> >
> >Kevent does not need to have absolute timeout.
> 
> Of course it does.  Just because you don't see a need for it for your 
> applications right now it doesn't mean it's not a valid use.

Please explain why glibc AIO uses relatinve timeouts then :)
 
> >Because timeout specified there is always related to the start of
> >syscall, since it is a timeout which specifies maximum time frame
> >syscall can live.
> 
> That's your current implementation.  There is absolutely no reason 
> whatsoever why this couldn't be changed.

It has nothing with implementation - it is logic. Something starts and
it has its maximum lifetime, but not something starts and should be
stopped Jan 1, 2008. In the latter case one can setup a timer, but it
does not allow to specify maximum lifetime. If glibc posix sleeping
functions converts relatinve AIO timeouts into absolute it does not mean
all should do it. It is just not needed.

> >I created kevent_signal notifications - it allows user to setup any set
> >of interested signals before call to kevent_get_events() and friends.
> >
> >No need to solve a problem with operation way when there is tactical and
> >strategical ones
> 
> Of course there is a need and I explained it before.  Getting signal 
> notifications is in no way the same as changing the signal mask 
> temporarily.  You cannot correctly emulate the case where you want to 
> block a signal while in the call as reenable it afterwards.  Receiving 
> the signal as an event and then artificially raising it is not the same. 
>  Especially timing-wise, the signal kevent might not be seen long after 
> the syscall returns because other entries are worked on first.
> 
> The opposite case is equally impossible to emulate: unblocking a signal 
> just for the duration of the syscall.  These are all possible and used 
> cases.
 
Add and remove appropriate kevent - it is as simple as call for one
function.

> >>- the KEVENT_REQ_WAKEUP_ONE functionality is good and needed.  But I
> >>  would reverse the default.  I cannot see many places where you want
> >>  all threads to be woken.  Introduce KEVENT_REQ_WAKEUP_ALL instead.
> >
> >I.e. to wake up only first thread always and in addon those threads
> >which have specified flag set? Ok, will put into todo foer the next
> >release.
> 
> It's a flag for an event.  So the threads won't have the flag set.  If 
> an event is delivered with the flag set, wake all threads.  Otherwise 
> just one.

Ok.
 
> >>- there is really no reason to invent yet another timer implementation.
> >>  We have the POSIX timers which are feature rich and nicely
> >>  implemented.  All that is needed is to implement SIGEV_KEVENT as a
> >>  notification mechanism.  The timer is registered as part of the
> >>  timer_create() syscalls.
> >
> >Feel free to add any interface you like - it is as simple as call for
> >kevent_user_add_ukevent() in userspace.
> 
> No, that's not what I mean.  There is no need for the special 
> timer-related part of your patch.  Instead the existing POSIX timer 
> syscalls should be modified to handle SIGEV_KEVENT notification.  Again, 
> keep the interface as small as possible.  Plus, the POSIX timer 
> interface is very flexible.  You don't want to duplicate all that 
> functionality.

Interface is already there with kevent_ctl(KEVENT_ADD), I just created
additional entry, which describes timers enqueue/dequeue callbacks - I
have not invented new interfaces, just reused existing generic kevent
facilities. It is possible to add timer events from any other place.
 
> >And I almost silently stay behind with the fact that it is possbile to
> >implement _all_ above ring buffer things in userspace with
> >kevent_get_events() and this functionality is there for almost a year :)
> 
> Again, this defeats the purpose completely.  The ring buffer is the 
> faster interface, especially when coupled with asynchronous filling of 
> ring buffer (i.e., without a syscal).

It is still possible to have very scalable system with it, for example
with one thread dedicated for syscall reading (with big number of events
transferred in one shot syscall overhead becomes negligible) and pool of
working threads. It is not about 'let's remove kernelspace ring buffer
management', but about possibilities and flexibility of the existing
model.

> >Let's solve problem in order of theirs appearance - what do you think
> >about above interface for ring buffer?
> 
> Looks better, yes.

Ok, I will implement this new (old) ring buffer in present it in the 
next release. I will also schedule there userspace notifications,
'wake-up-one-thread' flag changes and other small updates.

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-20  8:25         ` Evgeniy Polyakov
@ 2006-11-20  8:43           ` Andrew Morton
  2006-11-20  8:51             ` Evgeniy Polyakov
  2006-11-20 20:29           ` Ulrich Drepper
  1 sibling, 1 reply; 200+ messages in thread
From: Andrew Morton @ 2006-11-20  8:43 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ulrich Drepper, David Miller, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Mon, 20 Nov 2006 11:25:01 +0300
Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> On Sun, Nov 19, 2006 at 04:02:03PM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> > Evgeniy Polyakov wrote:
> > >>Possible solution:
> > >>
> > >>a) it would be possible to have a "used" flag in each ring buffer entry.
> > >>   That's too expensive, I guess.
> > >>
> > >>b) kevent_wait needs another parameter which specifies the which is the
> > >>   last (i.e., least recently added) entry in the ring buffer.
> > >>   Everything between this entry and the current head (in ->kidx) is
> > >>   occupied.  If multiple threads arrive in kevent_wait the highest idx
> > >>   (with wrap around possibly lowest) is used.
> > >>
> > >>   kevent_wait will not try to move more entries into the ring buffer
> > >>   if ->kidx and the higest index passed in to any kevent_wait call
> > >>   is equal (i.e., the ring buffer is full).
> > >>
> > >>   There is one issue, though, and that is that a system call is needed
> > >>   to signal to the kernel that more entries in the ring buffer are
> > >>   processed and that they can be refilled.  This goes against the
> > >>   kernel filling the ring buffer automatically (see below)
> > >
> > >If thread calls kevent_wait() it means it has processed previous entries, 
> > >one can call kevent_wait() with $num parameter as zero, which
> > >means that thread does not want any new events, so nothing will be
> > >copied.
> > 
> > This doesn't solve the problem.  You could only request new events when 
> > all previously reported events are processed.  Plus: how do you report 
> > events if the you don't allow get_event pass them on?
> 
> Userspace should itself maintain order and possibility to get event in
> this implementation, kernel just returns events which were requested.

That would mean that in a multithreaded application (or multi-processes
sharing the same MAP_SHARED ringbuffer), all threads/processes will be
slowed down to wait for the slowest one.

> > >They all already imeplemented. Just all above, and it was done several
> > >months ago already. No need to reinvent what is already there.
> > >Even if we will decide to remove kevent_get_events() in favour of ring
> > >buffer-only implementation, winting-for-event syscall will be
> > >essentially kevent_get_events() without pointer to the place where to
> > >put events.
> > 
> > Right, but this limitation of the interface is important.  It means the 
> > interface of the kernel is smaller: fewer possibilities for problems and 
> > fewer constraints if in future something should be changed (and smaller 
> > kernel).
> 
> Ok, lets see for ring buffer implementation right now, and then we will
> decide if we want to remove or to stay with kevent_get_events() syscall.

I agree that kevent_get_events() is duplicative and we shouldn't need it. 
Better to concentrate all our development effort on the single and most
flexible means of delivery.


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-20  8:43           ` Andrew Morton
@ 2006-11-20  8:51             ` Evgeniy Polyakov
  2006-11-20  9:15               ` Andrew Morton
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-20  8:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ulrich Drepper, David Miller, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Mon, Nov 20, 2006 at 12:43:01AM -0800, Andrew Morton (akpm@osdl.org) wrote:
> > > >If thread calls kevent_wait() it means it has processed previous entries, 
> > > >one can call kevent_wait() with $num parameter as zero, which
> > > >means that thread does not want any new events, so nothing will be
> > > >copied.
> > > 
> > > This doesn't solve the problem.  You could only request new events when 
> > > all previously reported events are processed.  Plus: how do you report 
> > > events if the you don't allow get_event pass them on?
> > 
> > Userspace should itself maintain order and possibility to get event in
> > this implementation, kernel just returns events which were requested.
> 
> That would mean that in a multithreaded application (or multi-processes
> sharing the same MAP_SHARED ringbuffer), all threads/processes will be
> slowed down to wait for the slowest one.

Not at all - all other threads can call kevent_get_events() with theirs
own place in the ring buffer, so while one of them is processing an
entry, others can fill next entries.

> > > >They all already imeplemented. Just all above, and it was done several
> > > >months ago already. No need to reinvent what is already there.
> > > >Even if we will decide to remove kevent_get_events() in favour of ring
> > > >buffer-only implementation, winting-for-event syscall will be
> > > >essentially kevent_get_events() without pointer to the place where to
> > > >put events.
> > > 
> > > Right, but this limitation of the interface is important.  It means the 
> > > interface of the kernel is smaller: fewer possibilities for problems and 
> > > fewer constraints if in future something should be changed (and smaller 
> > > kernel).
> > 
> > Ok, lets see for ring buffer implementation right now, and then we will
> > decide if we want to remove or to stay with kevent_get_events() syscall.
> 
> I agree that kevent_get_events() is duplicative and we shouldn't need it. 
> Better to concentrate all our development effort on the single and most
> flexible means of delivery.

Let's wait for ring buffer imeplementation first :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-20  8:51             ` Evgeniy Polyakov
@ 2006-11-20  9:15               ` Andrew Morton
  2006-11-20  9:19                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Andrew Morton @ 2006-11-20  9:15 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ulrich Drepper, David Miller, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Mon, 20 Nov 2006 11:51:59 +0300
Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> On Mon, Nov 20, 2006 at 12:43:01AM -0800, Andrew Morton (akpm@osdl.org) wrote:
> > > > >If thread calls kevent_wait() it means it has processed previous entries, 
> > > > >one can call kevent_wait() with $num parameter as zero, which
> > > > >means that thread does not want any new events, so nothing will be
> > > > >copied.
> > > > 
> > > > This doesn't solve the problem.  You could only request new events when 
> > > > all previously reported events are processed.  Plus: how do you report 
> > > > events if the you don't allow get_event pass them on?
> > > 
> > > Userspace should itself maintain order and possibility to get event in
> > > this implementation, kernel just returns events which were requested.
> > 
> > That would mean that in a multithreaded application (or multi-processes
> > sharing the same MAP_SHARED ringbuffer), all threads/processes will be
> > slowed down to wait for the slowest one.
> 
> Not at all - all other threads can call kevent_get_events() with theirs
> own place in the ring buffer, so while one of them is processing an
> entry, others can fill next entries.

eh?  That's not a ringbuffer, and it sounds awfully complex.

I don't know if this (new?) proposal resolves the
events-gets-lost-due-to-thread-cancellation problem?  Would need to see
considerably more detail.


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-20  9:15               ` Andrew Morton
@ 2006-11-20  9:19                 ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-20  9:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ulrich Drepper, David Miller, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Mon, Nov 20, 2006 at 01:15:16AM -0800, Andrew Morton (akpm@osdl.org) wrote:
> On Mon, 20 Nov 2006 11:51:59 +0300
> Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > On Mon, Nov 20, 2006 at 12:43:01AM -0800, Andrew Morton (akpm@osdl.org) wrote:
> > > > > >If thread calls kevent_wait() it means it has processed previous entries, 
> > > > > >one can call kevent_wait() with $num parameter as zero, which
> > > > > >means that thread does not want any new events, so nothing will be
> > > > > >copied.
> > > > > 
> > > > > This doesn't solve the problem.  You could only request new events when 
> > > > > all previously reported events are processed.  Plus: how do you report 
> > > > > events if the you don't allow get_event pass them on?
> > > > 
> > > > Userspace should itself maintain order and possibility to get event in
> > > > this implementation, kernel just returns events which were requested.
> > > 
> > > That would mean that in a multithreaded application (or multi-processes
> > > sharing the same MAP_SHARED ringbuffer), all threads/processes will be
> > > slowed down to wait for the slowest one.
> > 
> > Not at all - all other threads can call kevent_get_events() with theirs
> > own place in the ring buffer, so while one of them is processing an
> > entry, others can fill next entries.
> 
> eh?  That's not a ringbuffer, and it sounds awfully complex.
> 
> I don't know if this (new?) proposal resolves the
> events-gets-lost-due-to-thread-cancellation problem?  Would need to see
> considerably more detail.

It does - event is copied into shared buffer, but place (or index in the
ring buffer) is selected by userspace (wrapper, glibc, anything).
It is simple and (from my point of view) elegant, but it will not be used - 
I surrender and implement kenelspace ring buffer management right now, I 
just said that it is possible to implement any kind of ring buffer in 
userspace with old kevent_get_events() syscall only.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-20  8:25         ` Evgeniy Polyakov
  2006-11-20  8:43           ` Andrew Morton
@ 2006-11-20 20:29           ` Ulrich Drepper
  2006-11-20 21:46             ` Jeff Garzik
  2006-11-21  9:53             ` Evgeniy Polyakov
  1 sibling, 2 replies; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-20 20:29 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

Evgeniy Polyakov wrote:
> It is exactly how previous ring buffer (in mapped area though) was
> implemented.

Not any of those I saw.  The one I looked at always started again at 
index 0 to fill the ring buffer.  I'll wait for the next implementation.

>> That's something the application should be make a call about.  It's not 
>> always (or even mostly) the case that the ordering of the notification 
>> is important.  Furthermore, this would also require the kernel to 
>> enforce an ordering.  This is expensive on SMP machines.  A locally 
>> generated event (i.e., source and the thread reporting the event) can be 
>> delivered faster than an event created on another CPU.
> 
> How come? If signal was delivered earlier than data arrived, userspace
> should get signal before data - that is the rule. Ordering is maintained
> not for event insertion, but for marking them ready - it is atomic, so
> who first starts to mark even ready, that event will be read first from
> the ready queue.

This is as far as the kernel is concerned.  Queue them in the order they 
arrive.

I'm talking about the userlevel side.  *If* (and it needs to be verified 
that this has an advantage) a CPU creates an event for, e.g., a read 
event and then a number of threads could be notified about the event. 
When the kernel has to wake up a thread it'll look whether any thread is 
scheduled on the same CPU which generated the event.  Then the thread, 
upon waking up, can be told about the entry in the ring buffer which can 
be accessed first best (due to caching).  This entry needs not be the 
first available in the ring buffer but that's a problem the userlevel 
code has to worry about.

> Then I propose userspace notifications - each new thread can register
> 'wake me up when userspace event 1 is ready' and 'event 1' will be
> marked as ready by glibc when it removes the thread.

You don't want to have a channel like this.  The userlevel code doesn't 
know which threads are waiting in the kernel on the event queue.  And it 
seems to be much more complicated then simply have an kevent call which 
tells the kernel "wake up N or 1 more threads since I cannot handle it". 
  Basically a futex_wake()-like call.

>> Of course it does.  Just because you don't see a need for it for your 
>> applications right now it doesn't mean it's not a valid use.
> 
> Please explain why glibc AIO uses relatinve timeouts then :)

You are still completely focused on AIO.  We are talking here about a 
new generic event handling.  It is not tied to AIO.  We will add all 
kinds of events, e.g., hopefully futex support and many others.  And 
even for AIO it's relevant.

As I said, relative timeouts are unable to cope with settimeofday calls 
or ntp adjustments.  AIO is certainly usable in situations where 
timeouts are related to wall clock time.

> It has nothing with implementation - it is logic. Something starts and
> it has its maximum lifetime, but not something starts and should be
> stopped Jan 1, 2008.

It is an implementation detail.  Look at the PI futex support.  It has 
timeouts which can be cut short (or increased) due to wall clock changes.

>> The opposite case is equally impossible to emulate: unblocking a signal 
>> just for the duration of the syscall.  These are all possible and used 
>> cases.
>  
> Add and remove appropriate kevent - it is as simple as call for one
> function.

No, it's not.  The kevent stuff handles only the kevent handler (i.e., 
the replacement for calling the signal handler).  It cannot set signal 
masks.  I am talking about signal masks here.  And don't suggest "I can 
add another kevent feature where I can register signal masks".  This 
would be ridiculous since it's not an event source.  Just add the 
parameter and every base is covered and, at least equally important, we 
have symmetry between the event handling interfaces.

>> No, that's not what I mean.  There is no need for the special 
>> timer-related part of your patch.  Instead the existing POSIX timer 
>> syscalls should be modified to handle SIGEV_KEVENT notification.  Again, 
>> keep the interface as small as possible.  Plus, the POSIX timer 
>> interface is very flexible.  You don't want to duplicate all that 
>> functionality.
> 
> Interface is already there with kevent_ctl(KEVENT_ADD), I just created
> additional entry, which describes timers enqueue/dequeue callbacks

New multiplexers cases are additional syscalls.  This is unnecessary 
code.  Increased kernel interface and such.  We have the POSIX timer 
interfaces which are feature-rich and standardized *and* can be triviall 
extended (at least from the userlevel interface POV) to use event 
queues.  If you don't want to do this, fine, I'll try to get it made. 
But drop the timer part of your patches.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-20 20:29           ` Ulrich Drepper
@ 2006-11-20 21:46             ` Jeff Garzik
  2006-11-20 21:52               ` Ulrich Drepper
  2006-11-21  9:53             ` Evgeniy Polyakov
  1 sibling, 1 reply; 200+ messages in thread
From: Jeff Garzik @ 2006-11-20 21:46 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Evgeniy Polyakov, David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Alexander Viro

Ulrich Drepper wrote:
> Evgeniy Polyakov wrote:
>> It is exactly how previous ring buffer (in mapped area though) was
>> implemented.
> 
> Not any of those I saw.  The one I looked at always started again at 
> index 0 to fill the ring buffer.  I'll wait for the next implementation.

I like the two-pointer ring buffer approach, one pointer for the 
consumer and one for the producer.


> You don't want to have a channel like this.  The userlevel code doesn't 
> know which threads are waiting in the kernel on the event queue.  And it 

Agreed.


> You are still completely focused on AIO.  We are talking here about a 
> new generic event handling.  It is not tied to AIO.  We will add all 

Agreed.


> As I said, relative timeouts are unable to cope with settimeofday calls 
> or ntp adjustments.  AIO is certainly usable in situations where 
> timeouts are related to wall clock time.

I think we have lived with relative timeouts for so long, it would be 
unusual to change now.  select(2), poll(2), epoll_wait(2) all take 
relative timeouts.

	Jeff



^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-20 21:46             ` Jeff Garzik
@ 2006-11-20 21:52               ` Ulrich Drepper
  2006-11-21  9:09                 ` Ingo Oeser
  2006-11-22 11:38                 ` Michael Tokarev
  0 siblings, 2 replies; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-20 21:52 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Evgeniy Polyakov, David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Alexander Viro

Jeff Garzik wrote:
> I think we have lived with relative timeouts for so long, it would be 
> unusual to change now.  select(2), poll(2), epoll_wait(2) all take 
> relative timeouts.

I'm not talking about always using absolute timeouts.

I'm saying the timeout parameter should be a struct timespec* and then 
the flags word could have a flag meaning "this is an absolute timeout". 
  I.e., enable both uses,, even make relative timeouts the default. 
This is what the modern POSIX interfaces do, too, see clock_nanosleep.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-20 21:52               ` Ulrich Drepper
@ 2006-11-21  9:09                 ` Ingo Oeser
  2006-11-22 11:38                 ` Michael Tokarev
  1 sibling, 0 replies; 200+ messages in thread
From: Ingo Oeser @ 2006-11-21  9:09 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Jeff Garzik, Evgeniy Polyakov, David Miller, Andrew Morton,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Alexander Viro

Hi,

Ulrich Drepper schrieb:
> Jeff Garzik wrote:
> > I think we have lived with relative timeouts for so long, it would be 
> > unusual to change now.  select(2), poll(2), epoll_wait(2) all take 
> > relative timeouts.
> 
> I'm not talking about always using absolute timeouts.
> 
> I'm saying the timeout parameter should be a struct timespec* and then 
> the flags word could have a flag meaning "this is an absolute timeout". 
>   I.e., enable both uses,, even make relative timeouts the default. 
> This is what the modern POSIX interfaces do, too, see clock_nanosleep.

I agree here. And while you are at it: Have it say "not before" vs. "not after".

<rant>
And if you call "absolute timeout" an "alarm" or "deadline" everyone will agree, 
that this is useful.

Timeout means "I ran OUT of TIME to do it" and this is by definition relative
to a starting point. A "deadline" is an absolute point in (wall) time where sth.
has to be ready and an "alarm" is an absolute point in (wall) time where sth.
is triggered (e.g. a bell rings on your "ALARM clock").

I don't know which person established that non-sense nomenclature about relative
and absolute timouts.
</rant>

Regards

Ingo Oeser

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-20 21:52               ` Ulrich Drepper
  2006-11-21  9:09                 ` Ingo Oeser
@ 2006-11-22 11:38                 ` Michael Tokarev
  2006-11-22 11:47                   ` Evgeniy Polyakov
  2006-11-22 12:33                   ` Jeff Garzik
  1 sibling, 2 replies; 200+ messages in thread
From: Michael Tokarev @ 2006-11-22 11:38 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Jeff Garzik, Evgeniy Polyakov, David Miller, Andrew Morton,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Alexander Viro

Ulrich Drepper wrote:
> Jeff Garzik wrote:
>> I think we have lived with relative timeouts for so long, it would be
>> unusual to change now.  select(2), poll(2), epoll_wait(2) all take
>> relative timeouts.
> 
> I'm not talking about always using absolute timeouts.
> 
> I'm saying the timeout parameter should be a struct timespec* and then
> the flags word could have a flag meaning "this is an absolute timeout".
>  I.e., enable both uses,, even make relative timeouts the default. This
> is what the modern POSIX interfaces do, too, see clock_nanosleep.


Can't the argument be something like u64 instead of struct timespec,
regardless of this discussion (relative vs absolute)?

Compare:

 void mysleep(int msec) {
   struct timeval tv;
   tv.tv_sec = msec/1000;
   tv.tv_usec = msec%1000;
   select(0,0,0,0,&tv);
 }

with

  void mysleep(int msec) {
    poll(0, 0, msec*SOME_TIME_SCALE_VALUE);
  }

That to say: struct time{spec,val,whatever} is more difficult to use than
plain numbers.

But yes... existing struct timespec has an advantage of being already existed.
Oh well.

/mjt

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-22 11:38                 ` Michael Tokarev
@ 2006-11-22 11:47                   ` Evgeniy Polyakov
  2006-11-22 12:33                   ` Jeff Garzik
  1 sibling, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-22 11:47 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Ulrich Drepper, Jeff Garzik, David Miller, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck,
	linux-kernel, Alexander Viro

On Wed, Nov 22, 2006 at 02:38:50PM +0300, Michael Tokarev (mjt@tls.msk.ru) wrote:
> Ulrich Drepper wrote:
> > Jeff Garzik wrote:
> >> I think we have lived with relative timeouts for so long, it would be
> >> unusual to change now.  select(2), poll(2), epoll_wait(2) all take
> >> relative timeouts.
> > 
> > I'm not talking about always using absolute timeouts.
> > 
> > I'm saying the timeout parameter should be a struct timespec* and then
> > the flags word could have a flag meaning "this is an absolute timeout".
> >  I.e., enable both uses,, even make relative timeouts the default. This
> > is what the modern POSIX interfaces do, too, see clock_nanosleep.
> 
> 
> Can't the argument be something like u64 instead of struct timespec,
> regardless of this discussion (relative vs absolute)?

It is right now :)

> /mjt

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-22 11:38                 ` Michael Tokarev
  2006-11-22 11:47                   ` Evgeniy Polyakov
@ 2006-11-22 12:33                   ` Jeff Garzik
  1 sibling, 0 replies; 200+ messages in thread
From: Jeff Garzik @ 2006-11-22 12:33 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Ulrich Drepper, Evgeniy Polyakov, David Miller, Andrew Morton,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Alexander Viro

Michael Tokarev wrote:
> Can't the argument be something like u64 instead of struct timespec,
> regardless of this discussion (relative vs absolute)?


Newer syscalls (ppoll, pselect) take struct timespec, which is a 
reasonable, modern form of the timeout argument...

	Jeff



^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-20 20:29           ` Ulrich Drepper
  2006-11-20 21:46             ` Jeff Garzik
@ 2006-11-21  9:53             ` Evgeniy Polyakov
  2006-11-21 16:58               ` Ulrich Drepper
  1 sibling, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-21  9:53 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Mon, Nov 20, 2006 at 12:29:31PM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >It is exactly how previous ring buffer (in mapped area though) was
> >implemented.
> 
> Not any of those I saw.  The one I looked at always started again at 
> index 0 to fill the ring buffer.  I'll wait for the next implementation.

That what I'm talking about - there are at least 4 (!) different ring
buffer implementations, most of them were not even looked at.
But new version is ready, I will complete testing stage and will relese
'take25' soon today.

For those who like 'real-world benchmark and so on' I created a patch
for the latest stable lighttpd version and test it with kevent.

> >>That's something the application should be make a call about.  It's not 
> >>always (or even mostly) the case that the ordering of the notification 
> >>is important.  Furthermore, this would also require the kernel to 
> >>enforce an ordering.  This is expensive on SMP machines.  A locally 
> >>generated event (i.e., source and the thread reporting the event) can be 
> >>delivered faster than an event created on another CPU.
> >
> >How come? If signal was delivered earlier than data arrived, userspace
> >should get signal before data - that is the rule. Ordering is maintained
> >not for event insertion, but for marking them ready - it is atomic, so
> >who first starts to mark even ready, that event will be read first from
> >the ready queue.
> 
> This is as far as the kernel is concerned.  Queue them in the order they 
> arrive.
> 
> I'm talking about the userlevel side.  *If* (and it needs to be verified 
> that this has an advantage) a CPU creates an event for, e.g., a read 
> event and then a number of threads could be notified about the event. 
> When the kernel has to wake up a thread it'll look whether any thread is 
> scheduled on the same CPU which generated the event.  Then the thread, 
> upon waking up, can be told about the entry in the ring buffer which can 
> be accessed first best (due to caching).  This entry needs not be the 
> first available in the ring buffer but that's a problem the userlevel 
> code has to worry about.

Ok, I've understood.

> >Then I propose userspace notifications - each new thread can register
> >'wake me up when userspace event 1 is ready' and 'event 1' will be
> >marked as ready by glibc when it removes the thread.
> 
> You don't want to have a channel like this.  The userlevel code doesn't 
> know which threads are waiting in the kernel on the event queue.  And it 
> seems to be much more complicated then simply have an kevent call which 
> tells the kernel "wake up N or 1 more threads since I cannot handle it". 
>  Basically a futex_wake()-like call.

Kernel does not know about any threads which waits for events, it only
has queue of events, it can only wake those who was parked in
kevent_get_events() or kevent_wait(), but syscall will return only when
condition it waits on is true, i.e. when there is new event in the ready
queue and/or ring buffer has empty slots, but kernel will wake them up
in any case if those conditions are true.

How should it know which syscall should be interrupted when special syscall
is called?

> >>Of course it does.  Just because you don't see a need for it for your 
> >>applications right now it doesn't mean it's not a valid use.
> >
> >Please explain why glibc AIO uses relatinve timeouts then :)
> 
> You are still completely focused on AIO.  We are talking here about a 
> new generic event handling.  It is not tied to AIO.  We will add all 
> kinds of events, e.g., hopefully futex support and many others.  And 
> even for AIO it's relevant.
> 
> As I said, relative timeouts are unable to cope with settimeofday calls 
> or ntp adjustments.  AIO is certainly usable in situations where 
> timeouts are related to wall clock time.

No AIO, but syscall.
Only syscall time matters.
Syscall starts, it sould be sometime stopped. When it should be stopped?
It should be stopped after some time after it was started!

I still do not understand how will you use absolute timeout values
there. Please exaplain.

> >It has nothing with implementation - it is logic. Something starts and
> >it has its maximum lifetime, but not something starts and should be
> >stopped Jan 1, 2008.
> 
> It is an implementation detail.  Look at the PI futex support.  It has 
> timeouts which can be cut short (or increased) due to wall clock changes.

futex_wait() uses relative timeouts:
 static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time)

Kernel use relative timeouts.

Only special syscalls, which work with absolute time, have absolute
timeouts (like settimeofday).

> >>The opposite case is equally impossible to emulate: unblocking a signal 
> >>just for the duration of the syscall.  These are all possible and used 
> >>cases.
> > 
> >Add and remove appropriate kevent - it is as simple as call for one
> >function.
> 
> No, it's not.  The kevent stuff handles only the kevent handler (i.e., 
> the replacement for calling the signal handler).  It cannot set signal 
> masks.  I am talking about signal masks here.  And don't suggest "I can 
> add another kevent feature where I can register signal masks".  This 
> would be ridiculous since it's not an event source.  Just add the 
> parameter and every base is covered and, at least equally important, we 
> have symmetry between the event handling interfaces.

We have not have such symmetry.
Other event handling interfaces can not work with events, which do not
have file descriptor behind them. Kevent can and works.
Signals are just usual events.

You request to get events - and you get them.
You request to not get events during syscall - you remove events.

Btw, please point me to the discussion about real life usefullness of
that parameter for epoll. I read thread where sys_pepoll() was
intruduced, but except some theoretical handwaving about possible
usefullness there are no real signs of that requirement.

What is the ground research or extended explaination about
blocking/unblocking some signals during syscall execution?

> >>No, that's not what I mean.  There is no need for the special 
> >>timer-related part of your patch.  Instead the existing POSIX timer 
> >>syscalls should be modified to handle SIGEV_KEVENT notification.  Again, 
> >>keep the interface as small as possible.  Plus, the POSIX timer 
> >>interface is very flexible.  You don't want to duplicate all that 
> >>functionality.
> >
> >Interface is already there with kevent_ctl(KEVENT_ADD), I just created
> >additional entry, which describes timers enqueue/dequeue callbacks
> 
> New multiplexers cases are additional syscalls.  This is unnecessary 
> code.  Increased kernel interface and such.  We have the POSIX timer 
> interfaces which are feature-rich and standardized *and* can be triviall 
> extended (at least from the userlevel interface POV) to use event 
> queues.  If you don't want to do this, fine, I'll try to get it made. 
> But drop the timer part of your patches.

There are _no_ additional syscalls.
I just introduced new case for event type.
You _need_ it to be done, since any kernel kevent user must have
enqueue/dequeue/callback callbacks. It is just an implementation of that
callbacks.
I made the work, one can create any interfaces (additional syscalls or
anything else) on top of that.

Due to the fact that kevent was designed as generic event handling
mechanism it is possible to work will all types of events using the same
interface, which was created 10 month ago: kevent add, remove and so
on... There is nothing special for timers there - it is separate file
which does _not_ have any interfaces accessible outside kevent core (i.e.
syscalls or exported symbols).

Btw, how POSIX API should be extended to allow to queue events - queue
is required (which is created when user calls kevent_init() or
previoisly opened /dev/kevent), how should it be accessed, since it is
just a file descriptor in process task_struct.

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-21  9:53             ` Evgeniy Polyakov
@ 2006-11-21 16:58               ` Ulrich Drepper
  2006-11-21 17:43                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-21 16:58 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

Evgeniy Polyakov wrote:
>> You don't want to have a channel like this.  The userlevel code doesn't 
>> know which threads are waiting in the kernel on the event queue.  And it 
>> seems to be much more complicated then simply have an kevent call which 
>> tells the kernel "wake up N or 1 more threads since I cannot handle it". 
>>  Basically a futex_wake()-like call.
> 
> Kernel does not know about any threads which waits for events, it only
> has queue of events, it can only wake those who was parked in
> kevent_get_events() or kevent_wait(), but syscall will return only when
> condition it waits on is true, i.e. when there is new event in the ready
> queue and/or ring buffer has empty slots, but kernel will wake them up
> in any case if those conditions are true.
> 
> How should it know which syscall should be interrupted when special syscall
> is called?

It's not about interrupting any threads.

The issue is that the wakeup of a thread from the kevent_wait call 
constitutes an "event notification".  If, as it should be, only one 
thread is woken than this information mustn't get lost.  If the woken 
thread cannot work on the events it got notified for, then it must tell 
the kernel about it so that, *if* there are other threads waiting in 
kevent_wait, one of those other threads can be woken.

What is needed is a simple "wake another thread waiting on this event 
queue" syscall.  Yes, in theory we could open an additional pipe with 
each event queue and use it for waking threads, but this is influencing 
the ABI through the use of a file descriptor.  It's much better to have 
an explicit way to do this.

> No AIO, but syscall.
> Only syscall time matters.
> Syscall starts, it sould be sometime stopped. When it should be stopped?
> It should be stopped after some time after it was started!
> 
> I still do not understand how will you use absolute timeout values
> there. Please exaplain.

What is there to explain?  If you are waiting for events which must 
coincide with real-world events you'll naturally will want to formulate 
something like "wait for X until 10:15h".  You cannot formulate this 
correctly with relative timeouts since the realtime clock might be adjusted.

> futex_wait() uses relative timeouts:
>  static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time)
> 
> Kernel use relative timeouts.

Look again.  This time at the implementation.  For FUTEX_LOCK_PI the 
timeout is an absolute timeout.

> We have not have such symmetry.
> Other event handling interfaces can not work with events, which do not
> have file descriptor behind them. Kevent can and works.
> Signals are just usual events.
> 
> You request to get events - and you get them.
> You request to not get events during syscall - you remove events.

None of this matches what I'm talking about.  If you want to block a 
signal for the duration of the kevent_wait call this is nothing you can 
do by registering an event.

Registering events has nothing to do with signal masks.  They are not 
modified.  It is the program's responsibility to set the mask up 
correctly.  Just like sigwaitinfo() etc expect all signals which are 
waited on to be blocked.

The signal mask handling is orthogonal to all this and must be explicit. 
  In some cases explicit pthread_sigmask/sigprocmask calls.  But this is 
not atomic if a signal must be masked/unmasked for the *_wait call. 
This is why we have variants like pselect/ppoll/epoll_pwait which 
explicitly and *atomically* change the signal mask for the duration of 
the call.

> Btw, please point me to the discussion about real life usefullness of
> that parameter for epoll. I read thread where sys_pepoll() was
> intruduced, but except some theoretical handwaving about possible
> usefullness there are no real signs of that requirement.

Don't search for epoll_pwait, it's not widely used yet.  Search for 
pselect, which is standardized.  You'll find plenty of uses of that 
interface.  The number is certainly depressed in the moment since until 
recently there was no correct implementation on Linux.  And the 
interface is mostly used in real-time contexts where signals are more 
commonly used.

> What is the ground research or extended explaination about
> blocking/unblocking some signals during syscall execution?

Why is this even a question?  Have you done programming with signals? 
You hatred of signals makes me think this isn't the case.

You might want to unblock a signal on a *_wait call if it can be used to 
interrupt the wait but you don't want this to happen during when the 
thread is working on a request.

You might want to block a signal, for instance, around a sigwaitinfo 
call or, in this case, a kevent_wait call where the signal might be 
delivered to the queue.

There are countless possibilities.  Signals are very flexible.

> There are _no_ additional syscalls.
> I just introduced new case for event type.

Which is a new syscall.  All demultiplexer cases are no syscalls. 
Which, BTW, implies that unrecognized types should actually cause a 
ENOSYS return value (this affects kevent_break).  We've been over this 
many times.  If EINVAL is return this case cannot be distinguished from 
invalid parameters.  This is crucial for future extensions where 
userland (esp glibc) needs to be able to determine whether a new feature 
is supported on the system.

> You _need_ it to be done, since any kernel kevent user must have
> enqueue/dequeue/callback callbacks. It is just an implementation of that
> callbacks.

I don't question that.  But there is no need to add the callback.  It 
extends the kernel ABI/API.  And for what?  A vastly inferior timer 
implementation compared to the POSIX timers.  And this while all that 
needs to be done is to extend the POSIX timer code slightly to handle 
SIGEV_KEVENT in addition to the other notification methods currently 
used.  If you do it right then the code can be shared with the file AIO 
code which currently is circulated as well and which uses parts of the 
POSIX timer infrastructure.

> Btw, how POSIX API should be extended to allow to queue events - queue
> is required (which is created when user calls kevent_init() or
> previoisly opened /dev/kevent), how should it be accessed, since it is
> just a file descriptor in process task_struct.

I've explained this multiple times.  The struct sigevent structure needs 
to be extended to get a new part in the union.  Something like

   struct {
     int kevent_fd;
     void *data;
   } _sigev_kevent;

Then define SIGEV_KEVENT as a value distinct from the other SIGEV_ 
values.  In the code which handles setup of timers (the timer_create 
syscall), recognize SIGEV_KEVENT and handle it appropriately.  I.e., 
call into the code to register the event source, just like you'd do with 
the current interface.  Then add the code to post an event to the event 
queue where currently signals would be sent et voilà.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-21 16:58               ` Ulrich Drepper
@ 2006-11-21 17:43                 ` Evgeniy Polyakov
  2006-11-21 18:46                   ` Evgeniy Polyakov
  2006-11-22  7:33                   ` [take24 0/6] kevent: Generic event handling mechanism Ulrich Drepper
  0 siblings, 2 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-21 17:43 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Tue, Nov 21, 2006 at 08:58:49AM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >>You don't want to have a channel like this.  The userlevel code doesn't 
> >>know which threads are waiting in the kernel on the event queue.  And it 
> >>seems to be much more complicated then simply have an kevent call which 
> >>tells the kernel "wake up N or 1 more threads since I cannot handle it". 
> >> Basically a futex_wake()-like call.
> >
> >Kernel does not know about any threads which waits for events, it only
> >has queue of events, it can only wake those who was parked in
> >kevent_get_events() or kevent_wait(), but syscall will return only when
> >condition it waits on is true, i.e. when there is new event in the ready
> >queue and/or ring buffer has empty slots, but kernel will wake them up
> >in any case if those conditions are true.
> >
> >How should it know which syscall should be interrupted when special syscall
> >is called?
> 
> It's not about interrupting any threads.
>
> The issue is that the wakeup of a thread from the kevent_wait call 
> constitutes an "event notification".  If, as it should be, only one 
> thread is woken than this information mustn't get lost.  If the woken 
> thread cannot work on the events it got notified for, then it must tell 
> the kernel about it so that, *if* there are other threads waiting in 
> kevent_wait, one of those other threads can be woken.
> 
> What is needed is a simple "wake another thread waiting on this event 
> queue" syscall.  Yes, in theory we could open an additional pipe with 
> each event queue and use it for waking threads, but this is influencing 
> the ABI through the use of a file descriptor.  It's much better to have 
> an explicit way to do this.

Threads are parked in syscalls - which one should be interrupted?
And what if there were no threads waiting in syscalls?

> >No AIO, but syscall.
> >Only syscall time matters.
> >Syscall starts, it sould be sometime stopped. When it should be stopped?
> >It should be stopped after some time after it was started!
> >
> >I still do not understand how will you use absolute timeout values
> >there. Please exaplain.
> 
> What is there to explain?  If you are waiting for events which must 
> coincide with real-world events you'll naturally will want to formulate 
> something like "wait for X until 10:15h".  You cannot formulate this 
> correctly with relative timeouts since the realtime clock might be adjusted.

It has completely nothing with syscall.
You register a timer to wait until 10:15 that is all.

You do not ask to sleep in read() until some time, because read() has
nothing in common with that time and event.

But actually it becomes stupid discussion, don't you think?
What do you think about putting there timespec and a small warning in
dmesg about absolute timeout? When someone will report it I will
publically say that you were right and it is correct to have possibility
to have absolute timeouts for syscalls? :)
 
> >futex_wait() uses relative timeouts:
> > static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time)
> >
> >Kernel use relative timeouts.
> 
> Look again.  This time at the implementation.  For FUTEX_LOCK_PI the 
> timeout is an absolute timeout.

How come? It just uses timespec.

> >We have not have such symmetry.
> >Other event handling interfaces can not work with events, which do not
> >have file descriptor behind them. Kevent can and works.
> >Signals are just usual events.
> >
> >You request to get events - and you get them.
> >You request to not get events during syscall - you remove events.
> 
> None of this matches what I'm talking about.  If you want to block a 
> signal for the duration of the kevent_wait call this is nothing you can 
> do by registering an event.
> 
> Registering events has nothing to do with signal masks.  They are not 
> modified.  It is the program's responsibility to set the mask up 
> correctly.  Just like sigwaitinfo() etc expect all signals which are 
> waited on to be blocked.
> 
> The signal mask handling is orthogonal to all this and must be explicit. 
>  In some cases explicit pthread_sigmask/sigprocmask calls.  But this is 
> not atomic if a signal must be masked/unmasked for the *_wait call. 
> This is why we have variants like pselect/ppoll/epoll_pwait which 
> explicitly and *atomically* change the signal mask for the duration of 
> the call.

You probably missed kevent signal patch - signal will not be delivered
(in special cases) since it will not be copied into signal mask. System 
just will not know that it happend. Completely. Like putting it into
blocked mask.

> >Btw, please point me to the discussion about real life usefullness of
> >that parameter for epoll. I read thread where sys_pepoll() was
> >intruduced, but except some theoretical handwaving about possible
> >usefullness there are no real signs of that requirement.
> 
> Don't search for epoll_pwait, it's not widely used yet.  Search for 
> pselect, which is standardized.  You'll find plenty of uses of that 
> interface.  The number is certainly depressed in the moment since until 
> recently there was no correct implementation on Linux.  And the 
> interface is mostly used in real-time contexts where signals are more 
> commonly used.

I found this:

... document a pselect() call intended to
remove the race condition that is present when one wants
to wait on either a signal or some file descriptor.
(See also Stevens, Unix Network Programming, Volume 1, 2nd Ed.,
1998, p. 168 and the pselect.2 man page released today.)
Glibc 2.0 has a bad version (wrong number of parameters)
and glibc 2.1 a better version, but the whole purpose
of pselect is to avoid the race, and glibc cannot do that,
one needs kernel support. 


But it is completely irrelevant with kevent signals - there is no race
for that case when signal is delivered through file descriptor.

> >What is the ground research or extended explaination about
> >blocking/unblocking some signals during syscall execution?
> 
> Why is this even a question?  Have you done programming with signals? 
> You hatred of signals makes me think this isn't the case.

It is much better to not know how thing works, then to not be possible
to understand how new things can work.

> You might want to unblock a signal on a *_wait call if it can be used to 
> interrupt the wait but you don't want this to happen during when the 
> thread is working on a request.

Add kevent signal and do not process that event.

> You might want to block a signal, for instance, around a sigwaitinfo 
> call or, in this case, a kevent_wait call where the signal might be 
> delivered to the queue.

Having special type of kevent signal is the same as putting signal into
blocked mask, but signal event will be marked as ready - to indicate
that condition was there.
There will not be any race in that case.

> There are countless possibilities.  Signals are very flexible.

That is why we want to get them through synchronous queue? :)

> >There are _no_ additional syscalls.
> >I just introduced new case for event type.
> 
> Which is a new syscall.  All demultiplexer cases are no syscalls. 

I think I am a bit blind, probably parts of Leonids are still getting
into my brain, but there is one syscall called kevent_ctl() which adds
different events, including timer, signal, socket and others.

> Which, BTW, implies that unrecognized types should actually cause a 
> ENOSYS return value (this affects kevent_break).  We've been over this 
> many times.  If EINVAL is return this case cannot be distinguished from 
> invalid parameters.  This is crucial for future extensions where 
> userland (esp glibc) needs to be able to determine whether a new feature 
> is supported on the system.

I can replace with -ENOSYS if you like.

> >You _need_ it to be done, since any kernel kevent user must have
> >enqueue/dequeue/callback callbacks. It is just an implementation of that
> >callbacks.
> 
> I don't question that.  But there is no need to add the callback.  It 

No one asked and pain me to create kevent, but it is done.
Probably no the way some people wanted, but it always happend, 
it is really not that bad.

Kevent subsystem operates with structures which can be added into
completely different objects in the system - inodes, files - anything.
And to say to that object about new events there are special callbacks -
enqueue and dequeue. Callback which has extremely unusual name 'callback'
is invoked when object, where event is linked, has something to report -
new data, fired alarm or anything else, so it calls kevent's ->callback
and if return value is positive, kevent is marked as ready.
It allows to have event with different sets of interests for the same
type of the main object - for example socket can have read and write
callbacks.

So you must have them.
As you probably saw, kevent_timer_callback() just returns 1.

> extends the kernel ABI/API.  And for what?  A vastly inferior timer 
> implementation compared to the POSIX timers.  And this while all that 
> needs to be done is to extend the POSIX timer code slightly to handle 
> SIGEV_KEVENT in addition to the other notification methods currently 
> used.  If you do it right then the code can be shared with the file AIO 
> code which currently is circulated as well and which uses parts of the 
> POSIX timer infrastructure.

Ulrich, tell me the truth, will you kill me if I say that I have an entry 
in TODO to implement different AIO design (details for interested readers 
can be found in my blog), and then present it to community? :))
 
> >Btw, how POSIX API should be extended to allow to queue events - queue
> >is required (which is created when user calls kevent_init() or
> >previoisly opened /dev/kevent), how should it be accessed, since it is
> >just a file descriptor in process task_struct.
> 
> I've explained this multiple times.  The struct sigevent structure needs 
> to be extended to get a new part in the union.  Something like
> 
>   struct {
>     int kevent_fd;
>     void *data;
>   } _sigev_kevent;
> 
> Then define SIGEV_KEVENT as a value distinct from the other SIGEV_ 
> values.  In the code which handles setup of timers (the timer_create 
> syscall), recognize SIGEV_KEVENT and handle it appropriately.  I.e., 
> call into the code to register the event source, just like you'd do with 
> the current interface.  Then add the code to post an event to the event 
> queue where currently signals would be sent et voilà.

Ok, I see.
It is doable and simple.
I will try to implement it tomorrow.

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-21 17:43                 ` Evgeniy Polyakov
@ 2006-11-21 18:46                   ` Evgeniy Polyakov
  2006-11-21 20:01                     ` Jeff Garzik
                                       ` (2 more replies)
  2006-11-22  7:33                   ` [take24 0/6] kevent: Generic event handling mechanism Ulrich Drepper
  1 sibling, 3 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-21 18:46 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Tue, Nov 21, 2006 at 08:43:34PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > I've explained this multiple times.  The struct sigevent structure needs 
> > to be extended to get a new part in the union.  Something like
> > 
> >   struct {
> >     int kevent_fd;
> >     void *data;
> >   } _sigev_kevent;
> > 
> > Then define SIGEV_KEVENT as a value distinct from the other SIGEV_ 
> > values.  In the code which handles setup of timers (the timer_create 
> > syscall), recognize SIGEV_KEVENT and handle it appropriately.  I.e., 
> > call into the code to register the event source, just like you'd do with 
> > the current interface.  Then add the code to post an event to the event 
> > queue where currently signals would be sent et voilà.
> 
> Ok, I see.
> It is doable and simple.
> I will try to implement it tomorrow.

I've checked the code.
Since it will be a union, it is impossible to use _sigev_thread and it
becomes just SIGEV_SIGNAL case with different delivery mechanism.
Is it what you want?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-21 18:46                   ` Evgeniy Polyakov
@ 2006-11-21 20:01                     ` Jeff Garzik
  2006-11-22 10:41                       ` Evgeniy Polyakov
  2006-11-21 20:19                     ` Jeff Garzik
  2006-11-22  7:38                     ` Ulrich Drepper
  2 siblings, 1 reply; 200+ messages in thread
From: Jeff Garzik @ 2006-11-21 20:01 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ulrich Drepper, David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Alexander Viro

nitpick:  in ring_buffer.c (example app), I would use posix_memalign(3) 
rather than malloc(3)

	Jeff





^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-21 20:01                     ` Jeff Garzik
@ 2006-11-22 10:41                       ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-22 10:41 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Ulrich Drepper, David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Alexander Viro

On Tue, Nov 21, 2006 at 03:01:45PM -0500, Jeff Garzik (jeff@garzik.org) wrote:
> nitpick:  in ring_buffer.c (example app), I would use posix_memalign(3) 
> rather than malloc(3)

Yes, it can be done.

> 	Jeff

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-21 18:46                   ` Evgeniy Polyakov
  2006-11-21 20:01                     ` Jeff Garzik
@ 2006-11-21 20:19                     ` Jeff Garzik
  2006-11-22 10:39                       ` Evgeniy Polyakov
  2006-11-22  7:38                     ` Ulrich Drepper
  2 siblings, 1 reply; 200+ messages in thread
From: Jeff Garzik @ 2006-11-21 20:19 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Ulrich Drepper, David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Alexander Viro

Another:  pass a 'flags' argument to kevent_init(2).  I guarantee you 
will need it eventually.  It IMO would help with later binary 
compatibility, if nothing else.  You wouldn't need a new syscall to 
introduce struct kevent_ring_v2.

	Jeff

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-21 20:19                     ` Jeff Garzik
@ 2006-11-22 10:39                       ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-22 10:39 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Ulrich Drepper, David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Alexander Viro

On Tue, Nov 21, 2006 at 03:19:05PM -0500, Jeff Garzik (jeff@garzik.org) wrote:
> Another:  pass a 'flags' argument to kevent_init(2).  I guarantee you 
> will need it eventually.  It IMO would help with later binary 
> compatibility, if nothing else.  You wouldn't need a new syscall to 
> introduce struct kevent_ring_v2.

Yep, I will add there 'flags' field.

> 	Jeff

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-21 18:46                   ` Evgeniy Polyakov
  2006-11-21 20:01                     ` Jeff Garzik
  2006-11-21 20:19                     ` Jeff Garzik
@ 2006-11-22  7:38                     ` Ulrich Drepper
  2006-11-22 10:44                       ` Evgeniy Polyakov
  2 siblings, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-22  7:38 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

Evgeniy Polyakov wrote:
> I've checked the code.
> Since it will be a union, it is impossible to use _sigev_thread and it
> becomes just SIGEV_SIGNAL case with different delivery mechanism.
> Is it what you want?

struct sigevent is defined like this:

typedef struct sigevent {
         sigval_t sigev_value;
         int sigev_signo;
         int sigev_notify;
         union {
                 int _pad[SIGEV_PAD_SIZE];
                  int _tid;

                 struct {
                         void (*_function)(sigval_t);
                         void *_attribute;       /* really pthread_attr_t */
                 } _sigev_thread;
         } _sigev_un;
} sigevent_t;


For the SIGEV_KEVENT case:

   sigev_notify is set to SIGEV_KEVENT (obviously)

   sigev_value can be used for the void* data passed along with the
   signal, just like in the case of a signal delivery

Now you need a way to specify the kevent descriptor.  Just add

   int _kevent;

inside the union and if you want

   #define sigev_kevent_descr _sigev_un._kevent

That should be all.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-22  7:38                     ` Ulrich Drepper
@ 2006-11-22 10:44                       ` Evgeniy Polyakov
  2006-11-22 21:02                         ` Ulrich Drepper
  2006-11-23  8:52                         ` Kevent POSIX timers support Evgeniy Polyakov
  0 siblings, 2 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-22 10:44 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Tue, Nov 21, 2006 at 11:38:25PM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >I've checked the code.
> >Since it will be a union, it is impossible to use _sigev_thread and it
> >becomes just SIGEV_SIGNAL case with different delivery mechanism.
> >Is it what you want?
> 
> struct sigevent is defined like this:
> 
> typedef struct sigevent {
>         sigval_t sigev_value;
>         int sigev_signo;
>         int sigev_notify;
>         union {
>                 int _pad[SIGEV_PAD_SIZE];
>                  int _tid;
> 
>                 struct {
>                         void (*_function)(sigval_t);
>                         void *_attribute;       /* really pthread_attr_t */
>                 } _sigev_thread;
>         } _sigev_un;
> } sigevent_t;
> 
> 
> For the SIGEV_KEVENT case:
> 
>   sigev_notify is set to SIGEV_KEVENT (obviously)
> 
>   sigev_value can be used for the void* data passed along with the
>   signal, just like in the case of a signal delivery
> 
> Now you need a way to specify the kevent descriptor.  Just add
> 
>   int _kevent;
> 
> inside the union and if you want
> 
>   #define sigev_kevent_descr _sigev_un._kevent
> 
> That should be all.

That what I implemented.
But in this case it will be impossible to have SIGEV_THREAD and SIGEV_KEVENT
at the same time, it will be just the same as SIGEV_SIGNAL but with
different delivery mechanism. Is is what you expect for that?

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-22 10:44                       ` Evgeniy Polyakov
@ 2006-11-22 21:02                         ` Ulrich Drepper
  2006-11-23 12:23                           ` Evgeniy Polyakov
  2006-11-23  8:52                         ` Kevent POSIX timers support Evgeniy Polyakov
  1 sibling, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-22 21:02 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

Evgeniy Polyakov wrote:
> But in this case it will be impossible to have SIGEV_THREAD and SIGEV_KEVENT
> at the same time, it will be just the same as SIGEV_SIGNAL but with
> different delivery mechanism. Is is what you expect for that?

Yes, that's expected.  The event if for the queue, not directed to a 
specific thread.

If in future we want to think about preferably waking a specific thread 
we can then think about it.  But I doubt that'll be beneficial.  The 
thread specific part in the signal handling is only used to implement 
the SIGEV_THREAD notification.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-22 21:02                         ` Ulrich Drepper
@ 2006-11-23 12:23                           ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-23 12:23 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Wed, Nov 22, 2006 at 01:02:00PM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >But in this case it will be impossible to have SIGEV_THREAD and 
> >SIGEV_KEVENT
> >at the same time, it will be just the same as SIGEV_SIGNAL but with
> >different delivery mechanism. Is is what you expect for that?
> 
> Yes, that's expected.  The event if for the queue, not directed to a 
> specific thread.
> 
> If in future we want to think about preferably waking a specific thread 
> we can then think about it.  But I doubt that'll be beneficial.  The 
> thread specific part in the signal handling is only used to implement 
> the SIGEV_THREAD notification.

Ok, so please review patch I sent, if it is ok from design point of
view, I will run some tests here.

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Kevent POSIX timers support.
  2006-11-22 10:44                       ` Evgeniy Polyakov
  2006-11-22 21:02                         ` Ulrich Drepper
@ 2006-11-23  8:52                         ` Evgeniy Polyakov
  2006-11-23 20:26                           ` Ulrich Drepper
  1 sibling, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-23  8:52 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Wed, Nov 22, 2006 at 01:44:16PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> That what I implemented.
> But in this case it will be impossible to have SIGEV_THREAD and SIGEV_KEVENT
> at the same time, it will be just the same as SIGEV_SIGNAL but with
> different delivery mechanism. Is is what you expect for that?

Something like this morning hack (compile tested only).
If my thoughts are correct, I will create some simple application and
test if it works.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index a7dd38f..4b9deb4 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -4,6 +4,7 @@
 #include <linux/spinlock.h>
 #include <linux/list.h>
 #include <linux/sched.h>
+#include <linux/kevent_storage.h>
 
 union cpu_time_count {
 	cputime_t cpu;
@@ -49,6 +50,9 @@ struct k_itimer {
 	sigval_t it_sigev_value;	/* value word of sigevent struct */
 	struct task_struct *it_process;	/* process to send signal to */
 	struct sigqueue *sigq;		/* signal queue entry. */
+#ifdef CONFIG_KEVENT_TIMER
+	struct kevent_storage st;
+#endif
 	union {
 		struct {
 			struct hrtimer timer;
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index e5ebcc1..148a9f9 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -48,6 +48,8 @@
 #include <linux/wait.h>
 #include <linux/workqueue.h>
 #include <linux/module.h>
+#include <linux/kevent.h>
+#include <linux/file.h>
 
 /*
  * Management arrays for POSIX timers.	 Timers are kept in slab memory
@@ -224,6 +226,95 @@ static int posix_ktime_get_ts(clockid_t
 	return 0;
 }
 
+#ifdef CONFIG_KEVENT_TIMER
+static int posix_kevent_enqueue(struct kevent *k)
+{
+	struct k_itimer *tmr = k->event.ptr;
+	return kevent_storage_enqueue(&tmr->st, k);
+}
+static int posix_kevent_dequeue(struct kevent *k)
+{
+	struct k_itimer *tmr = k->event.ptr;
+	kevent_storage_dequeue(&tmr->st, k);
+	return 0;
+}
+static int posix_kevent_callback(struct kevent *k)
+{
+	return 1;
+}
+static int posix_kevent_init(void)
+{
+	struct kevent_callbacks tc = {
+		.callback = &posix_kevent_callback,
+		.enqueue = &posix_kevent_enqueue,
+		.dequeue = &posix_kevent_dequeue};
+
+	return kevent_add_callbacks(&tc, KEVENT_POSIX_TIMER);
+}
+
+extern struct file_operations kevent_user_fops;
+
+static int posix_kevent_init_timer(struct k_itimer *tmr, int fd)
+{
+	struct ukevent uk;
+	struct file *file;
+	struct kevent_user *u;
+	int err;
+
+	file = fget(fd);
+	if (!file) {
+		err = -EBADF;
+		goto err_out;
+	}
+
+	if (file->f_op != &kevent_user_fops) {
+		err = -EINVAL;
+		goto err_out_fput;
+	}
+
+	u = file->private_data;
+
+	memset(&uk, 0, sizeof(struct ukevent));
+
+	uk.type = KEVENT_POSIX_TIMER;
+	uk.id.raw_u64 = (unsigned long)(tmr); /* Just cast to something unique */
+	uk.ptr = tmr;
+
+	tmr->it_sigev_value.sival_ptr = file;
+
+	err = kevent_user_add_ukevent(&uk, u);
+	if (err)
+		goto err_out_fput;
+
+	fput(file);
+
+	return 0;
+
+err_out_fput:
+	fput(file);
+err_out:
+	return err;
+}
+
+static void posix_kevent_fini_timer(struct k_itimer *tmr)
+{
+	kevent_storage_fini(&tmr->st);
+}
+#else
+static int posix_kevent_init_timer(struct k_itimer *tmr, int fd)
+{
+	return -ENOSYS;
+}
+static int posix_kevent_init(void)
+{
+	return 0;
+}
+static void posix_kevent_fini_timer(struct k_itimer *tmr)
+{
+}
+#endif
+
+
 /*
  * Initialize everything, well, just everything in Posix clocks/timers ;)
  */
@@ -241,6 +332,11 @@ static __init int init_posix_timers(void
 	register_posix_clock(CLOCK_REALTIME, &clock_realtime);
 	register_posix_clock(CLOCK_MONOTONIC, &clock_monotonic);
 
+	if (posix_kevent_init()) {
+		printk(KERN_ERR "Failed to initialize kevent posix timers.\n");
+		BUG();
+	}
+
 	posix_timers_cache = kmem_cache_create("posix_timers_cache",
 					sizeof (struct k_itimer), 0, 0, NULL, NULL);
 	idr_init(&posix_timers_id);
@@ -343,23 +439,27 @@ static int posix_timer_fn(struct hrtimer
 
 	timr = container_of(timer, struct k_itimer, it.real.timer);
 	spin_lock_irqsave(&timr->it_lock, flags);
+	
+	if (timr->it_sigev_notify & SIGEV_KEVENT) {
+		kevent_storage_ready(&timr->st, NULL, KEVENT_MASK_ALL);
+	} else {
+		if (timr->it.real.interval.tv64 != 0)
+			si_private = ++timr->it_requeue_pending;
 
-	if (timr->it.real.interval.tv64 != 0)
-		si_private = ++timr->it_requeue_pending;
-
-	if (posix_timer_event(timr, si_private)) {
-		/*
-		 * signal was not sent because of sig_ignor
-		 * we will not get a call back to restart it AND
-		 * it should be restarted.
-		 */
-		if (timr->it.real.interval.tv64 != 0) {
-			timr->it_overrun +=
-				hrtimer_forward(timer,
-						timer->base->softirq_time,
-						timr->it.real.interval);
-			ret = HRTIMER_RESTART;
-			++timr->it_requeue_pending;
+		if (posix_timer_event(timr, si_private)) {
+			/*
+			 * signal was not sent because of sig_ignor
+			 * we will not get a call back to restart it AND
+			 * it should be restarted.
+			 */
+			if (timr->it.real.interval.tv64 != 0) {
+				timr->it_overrun +=
+					hrtimer_forward(timer,
+							timer->base->softirq_time,
+							timr->it.real.interval);
+				ret = HRTIMER_RESTART;
+				++timr->it_requeue_pending;
+			}
 		}
 	}
 
@@ -407,6 +507,9 @@ static struct k_itimer * alloc_posix_tim
 		kmem_cache_free(posix_timers_cache, tmr);
 		tmr = NULL;
 	}
+#ifdef CONFIG_KEVENT_TIMER
+	kevent_storage_init(tmr, &tmr->st);
+#endif
 	return tmr;
 }
 
@@ -424,6 +527,7 @@ static void release_posix_timer(struct k
 	if (unlikely(tmr->it_process) &&
 	    tmr->it_sigev_notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID))
 		put_task_struct(tmr->it_process);
+	posix_kevent_fini_timer(tmr);
 	kmem_cache_free(posix_timers_cache, tmr);
 }
 
@@ -496,40 +600,52 @@ sys_timer_create(const clockid_t which_c
 		new_timer->it_sigev_signo = event.sigev_signo;
 		new_timer->it_sigev_value = event.sigev_value;
 
-		read_lock(&tasklist_lock);
-		if ((process = good_sigevent(&event))) {
-			/*
-			 * We may be setting up this process for another
-			 * thread.  It may be exiting.  To catch this
-			 * case the we check the PF_EXITING flag.  If
-			 * the flag is not set, the siglock will catch
-			 * him before it is too late (in exit_itimers).
-			 *
-			 * The exec case is a bit more invloved but easy
-			 * to code.  If the process is in our thread
-			 * group (and it must be or we would not allow
-			 * it here) and is doing an exec, it will cause
-			 * us to be killed.  In this case it will wait
-			 * for us to die which means we can finish this
-			 * linkage with our last gasp. I.e. no code :)
-			 */
+		if (event.sigev_notify & SIGEV_KEVENT) {
+			error = posix_kevent_init_timer(new_timer, event._sigev_un.kevent_fd);
+			if (error)
+				goto out;
+
+			process = current->group_leader;
 			spin_lock_irqsave(&process->sighand->siglock, flags);
-			if (!(process->flags & PF_EXITING)) {
-				new_timer->it_process = process;
-				list_add(&new_timer->list,
-					 &process->signal->posix_timers);
-				spin_unlock_irqrestore(&process->sighand->siglock, flags);
-				if (new_timer->it_sigev_notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID))
-					get_task_struct(process);
-			} else {
-				spin_unlock_irqrestore(&process->sighand->siglock, flags);
-				process = NULL;
+			new_timer->it_process = process;
+			list_add(&new_timer->list, &process->signal->posix_timers);
+			spin_unlock_irqrestore(&process->sighand->siglock, flags);
+		} else {
+			read_lock(&tasklist_lock);
+			if ((process = good_sigevent(&event))) {
+				/*
+				 * We may be setting up this process for another
+				 * thread.  It may be exiting.  To catch this
+				 * case the we check the PF_EXITING flag.  If
+				 * the flag is not set, the siglock will catch
+				 * him before it is too late (in exit_itimers).
+				 *
+				 * The exec case is a bit more invloved but easy
+				 * to code.  If the process is in our thread
+				 * group (and it must be or we would not allow
+				 * it here) and is doing an exec, it will cause
+				 * us to be killed.  In this case it will wait
+				 * for us to die which means we can finish this
+				 * linkage with our last gasp. I.e. no code :)
+				 */
+				spin_lock_irqsave(&process->sighand->siglock, flags);
+				if (!(process->flags & PF_EXITING)) {
+					new_timer->it_process = process;
+					list_add(&new_timer->list,
+						 &process->signal->posix_timers);
+					spin_unlock_irqrestore(&process->sighand->siglock, flags);
+					if (new_timer->it_sigev_notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID))
+						get_task_struct(process);
+				} else {
+					spin_unlock_irqrestore(&process->sighand->siglock, flags);
+					process = NULL;
+				}
+			}
+			read_unlock(&tasklist_lock);
+			if (!process) {
+				error = -EINVAL;
+				goto out;
 			}
-		}
-		read_unlock(&tasklist_lock);
-		if (!process) {
-			error = -EINVAL;
-			goto out;
 		}
 	} else {
 		new_timer->it_sigev_notify = SIGEV_SIGNAL;
 

-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: Kevent POSIX timers support.
  2006-11-23  8:52                         ` Kevent POSIX timers support Evgeniy Polyakov
@ 2006-11-23 20:26                           ` Ulrich Drepper
  2006-11-24  9:50                             ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-23 20:26 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

Evgeniy Polyakov wrote:
> +static int posix_kevent_init(void)
> +{
> +	struct kevent_callbacks tc = {
> +		.callback = &posix_kevent_callback,
> +		.enqueue = &posix_kevent_enqueue,
> +		.dequeue = &posix_kevent_dequeue};

How do we prevent that somebody tries to register a POSIX timer event 
source with kevent_ctl(KEVENT_CTL_ADD)?  This should only be possible 
from sys_timer_create and nowhere else.

Can you add a parameter to kevent_enqueue indicating this is a call from 
inside the kernel and then ignore certain enqueue callbacks?

> @@ -343,23 +439,27 @@ static int posix_timer_fn(struct hrtimer
>  
>  	timr = container_of(timer, struct k_itimer, it.real.timer);
>  	spin_lock_irqsave(&timr->it_lock, flags);
> +	
> +	if (timr->it_sigev_notify & SIGEV_KEVENT) {
> +		kevent_storage_ready(&timr->st, NULL, KEVENT_MASK_ALL);
> +	} else {

We need to pass the data in the sigev_value meember of the struct 
sigevent structure passed to timer_create to the caller.  I don't see it 
being done here nor when the timer is created.  Do I miss something? 
The sigev_value value should be stored in the user/ptr member of struct 
ukevent.

> +		if (event.sigev_notify & SIGEV_KEVENT) {

Don't use a bit.  It makes no sense to combine SIGEV_SIGNAL with 
SIGEV_KEVENT etc.  Only SIGEV_THREAD_ID is a special case.

Just define SIGEV_KEVENT to 3 and replace the tests like the one cited 
above with

   if (timr->it_sigev_notify == SIGEV_KEVENT)

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: Kevent POSIX timers support.
  2006-11-23 20:26                           ` Ulrich Drepper
@ 2006-11-24  9:50                             ` Evgeniy Polyakov
  2006-11-27 18:20                               ` Ulrich Drepper
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-24  9:50 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Thu, Nov 23, 2006 at 12:26:15PM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >+static int posix_kevent_init(void)
> >+{
> >+	struct kevent_callbacks tc = {
> >+		.callback = &posix_kevent_callback,
> >+		.enqueue = &posix_kevent_enqueue,
> >+		.dequeue = &posix_kevent_dequeue};
> 
> How do we prevent that somebody tries to register a POSIX timer event 
> source with kevent_ctl(KEVENT_CTL_ADD)?  This should only be possible 
> from sys_timer_create and nowhere else.
> 
> Can you add a parameter to kevent_enqueue indicating this is a call from 
> inside the kernel and then ignore certain enqueue callbacks?

I think we need some set of flags for callbacks - where they can be
called, maybe even from which context and so on. So userspace will not
be allowed to create such timers through kevent API.
Will do it for release.
 
> >@@ -343,23 +439,27 @@ static int posix_timer_fn(struct hrtimer
> > 
> > 	timr = container_of(timer, struct k_itimer, it.real.timer);
> > 	spin_lock_irqsave(&timr->it_lock, flags);
> >+	
> >+	if (timr->it_sigev_notify & SIGEV_KEVENT) {
> >+		kevent_storage_ready(&timr->st, NULL, KEVENT_MASK_ALL);
> >+	} else {
> 
> We need to pass the data in the sigev_value meember of the struct 
> sigevent structure passed to timer_create to the caller.  I don't see it 
> being done here nor when the timer is created.  Do I miss something? 
> The sigev_value value should be stored in the user/ptr member of struct 
> ukevent.

sigev_value was stored in k_itimer structure, I just do not know where
to put it in the ukevent provided to userspace - it can be placed in
pointer value if you like.

> >+		if (event.sigev_notify & SIGEV_KEVENT) {
> 
> Don't use a bit.  It makes no sense to combine SIGEV_SIGNAL with 
> SIGEV_KEVENT etc.  Only SIGEV_THREAD_ID is a special case.
> 
> Just define SIGEV_KEVENT to 3 and replace the tests like the one cited 
> above with
> 
>   if (timr->it_sigev_notify == SIGEV_KEVENT)

Ok.

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: Kevent POSIX timers support.
  2006-11-24  9:50                             ` Evgeniy Polyakov
@ 2006-11-27 18:20                               ` Ulrich Drepper
  2006-11-27 18:24                                 ` David Miller
  2006-11-28  9:16                                 ` Evgeniy Polyakov
  0 siblings, 2 replies; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-27 18:20 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

Evgeniy Polyakov wrote:
>> We need to pass the data in the sigev_value meember of the struct 
>> sigevent structure passed to timer_create to the caller.  I don't see it 
>> being done here nor when the timer is created.  Do I miss something? 
>> The sigev_value value should be stored in the user/ptr member of struct 
>> ukevent.
> 
> sigev_value was stored in k_itimer structure, I just do not know where
> to put it in the ukevent provided to userspace - it can be placed in
> pointer value if you like.

sigev_value is a union and the largest element is a pointer.  So, 
transporting the pointer value is sufficient and it should be passed up 
to the user in the ptr member of struct ukevent.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: Kevent POSIX timers support.
  2006-11-27 18:20                               ` Ulrich Drepper
@ 2006-11-27 18:24                                 ` David Miller
  2006-11-27 18:36                                   ` Ulrich Drepper
  2006-11-28  9:16                                 ` Evgeniy Polyakov
  1 sibling, 1 reply; 200+ messages in thread
From: David Miller @ 2006-11-27 18:24 UTC (permalink / raw)
  To: drepper
  Cc: johnpol, akpm, netdev, zach.brown, hch, chase.venters,
	johann.borck, linux-kernel, jeff, aviro

From: Ulrich Drepper <drepper@redhat.com>
Date: Mon, 27 Nov 2006 10:20:50 -0800

> Evgeniy Polyakov wrote:
> >> We need to pass the data in the sigev_value meember of the struct 
> >> sigevent structure passed to timer_create to the caller.  I don't see it 
> >> being done here nor when the timer is created.  Do I miss something? 
> >> The sigev_value value should be stored in the user/ptr member of struct 
> >> ukevent.
> > 
> > sigev_value was stored in k_itimer structure, I just do not know where
> > to put it in the ukevent provided to userspace - it can be placed in
> > pointer value if you like.
> 
> sigev_value is a union and the largest element is a pointer.  So, 
> transporting the pointer value is sufficient and it should be passed up 
> to the user in the ptr member of struct ukevent.

Now we'll have to have a compat layer for 32-bit/64-bit environments
thanks to POSIX timers, which is rediculious.

This is exactly the kind of thing I was hoping we could avoid when
designing these data structures.  No pointers, no non-fixed sized
types, only types which are identically sized and aligned between
32-bit and 64-bit environments.

It's OK to have these problems for things designed a long time ago
before 32-bit/64-bit compat issues existed, but for new stuff no
way.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: Kevent POSIX timers support.
  2006-11-27 18:24                                 ` David Miller
@ 2006-11-27 18:36                                   ` Ulrich Drepper
  2006-11-27 18:49                                     ` David Miller
  0 siblings, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-27 18:36 UTC (permalink / raw)
  To: David Miller
  Cc: johnpol, akpm, netdev, zach.brown, hch, chase.venters,
	johann.borck, linux-kernel, jeff, aviro

David Miller wrote:
> Now we'll have to have a compat layer for 32-bit/64-bit environments
> thanks to POSIX timers, which is rediculious.

We already have compat_sys_timer_create.  It should be sufficient just 
to add the conversion (if anything new is needed) there.  The pointer 
value can be passed to userland in one or two int fields, I don't really 
care.  When reporting the event to the user code we cannot just point 
into the ring buffer anyway.  So while copying the data we can rewrite 
it if necessary.  I see no need to complicate the code more than it 
already is.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: Kevent POSIX timers support.
  2006-11-27 18:36                                   ` Ulrich Drepper
@ 2006-11-27 18:49                                     ` David Miller
  2006-11-28  9:16                                       ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: David Miller @ 2006-11-27 18:49 UTC (permalink / raw)
  To: drepper
  Cc: johnpol, akpm, netdev, zach.brown, hch, chase.venters,
	johann.borck, linux-kernel, jeff, aviro

From: Ulrich Drepper <drepper@redhat.com>
Date: Mon, 27 Nov 2006 10:36:06 -0800

> David Miller wrote:
> > Now we'll have to have a compat layer for 32-bit/64-bit environments
> > thanks to POSIX timers, which is rediculious.
> 
> We already have compat_sys_timer_create.  It should be sufficient just 
> to add the conversion (if anything new is needed) there.  The pointer 
> value can be passed to userland in one or two int fields, I don't really 
> care.  When reporting the event to the user code we cannot just point 
> into the ring buffer anyway.  So while copying the data we can rewrite 
> it if necessary.  I see no need to complicate the code more than it 
> already is.

Ok, as long as that thing doesn't end up in the ring buffer entry
data structure, that's where the real troubles would be.

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: Kevent POSIX timers support.
  2006-11-27 18:49                                     ` David Miller
@ 2006-11-28  9:16                                       ` Evgeniy Polyakov
  2006-11-28 19:13                                         ` David Miller
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-28  9:16 UTC (permalink / raw)
  To: David Miller
  Cc: drepper, akpm, netdev, zach.brown, hch, chase.venters,
	johann.borck, linux-kernel, jeff, aviro

On Mon, Nov 27, 2006 at 10:49:55AM -0800, David Miller (davem@davemloft.net) wrote:
> From: Ulrich Drepper <drepper@redhat.com>
> Date: Mon, 27 Nov 2006 10:36:06 -0800
> 
> > David Miller wrote:
> > > Now we'll have to have a compat layer for 32-bit/64-bit environments
> > > thanks to POSIX timers, which is rediculious.
> > 
> > We already have compat_sys_timer_create.  It should be sufficient just 
> > to add the conversion (if anything new is needed) there.  The pointer 
> > value can be passed to userland in one or two int fields, I don't really 
> > care.  When reporting the event to the user code we cannot just point 
> > into the ring buffer anyway.  So while copying the data we can rewrite 
> > it if necessary.  I see no need to complicate the code more than it 
> > already is.
> 
> Ok, as long as that thing doesn't end up in the ring buffer entry
> data structure, that's where the real troubles would be.

Although ukevent has pointer embedded, it is unioned with u64, so there
should be no problems until 128 bit arch appeared, which likely will not
happen soon. There is also unused in kevent posix timers patch
'u32 ret_val[2]' field, which can store segval's value too.

But the fact that ukevent does not and will not in any way have variable
size is absolutely.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: Kevent POSIX timers support.
  2006-11-28  9:16                                       ` Evgeniy Polyakov
@ 2006-11-28 19:13                                         ` David Miller
  2006-11-28 19:22                                           ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: David Miller @ 2006-11-28 19:13 UTC (permalink / raw)
  To: johnpol
  Cc: drepper, akpm, netdev, zach.brown, hch, chase.venters,
	johann.borck, linux-kernel, jeff, aviro

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Tue, 28 Nov 2006 12:16:02 +0300

> Although ukevent has pointer embedded, it is unioned with u64, so there
> should be no problems until 128 bit arch appeared, which likely will not
> happen soon. There is also unused in kevent posix timers patch
> 'u32 ret_val[2]' field, which can store segval's value too.
> 
> But the fact that ukevent does not and will not in any way have variable
> size is absolutely.

I believe that in order to be %100 safe you will need to use the
special aligned_u64 type, as that takes care of a crucial difference
between x86 and x86_64 API, namely that u64 needs 8-byte alignment on
x86_64 but not on x86.

You probably know this already :-)

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: Kevent POSIX timers support.
  2006-11-28 19:13                                         ` David Miller
@ 2006-11-28 19:22                                           ` Evgeniy Polyakov
  2006-12-12  1:36                                             ` David Miller
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-28 19:22 UTC (permalink / raw)
  To: David Miller
  Cc: drepper, akpm, netdev, zach.brown, hch, chase.venters,
	johann.borck, linux-kernel, jeff, aviro

On Tue, Nov 28, 2006 at 11:13:00AM -0800, David Miller (davem@davemloft.net) wrote:
> From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> Date: Tue, 28 Nov 2006 12:16:02 +0300
> 
> > Although ukevent has pointer embedded, it is unioned with u64, so there
> > should be no problems until 128 bit arch appeared, which likely will not
> > happen soon. There is also unused in kevent posix timers patch
> > 'u32 ret_val[2]' field, which can store segval's value too.
> > 
> > But the fact that ukevent does not and will not in any way have variable
> > size is absolutely.
> 
> I believe that in order to be %100 safe you will need to use the
> special aligned_u64 type, as that takes care of a crucial difference
> between x86 and x86_64 API, namely that u64 needs 8-byte alignment on
> x86_64 but not on x86.
> 
> You probably know this already :-)

Yep :)
So I put it at the end, where structure is already correctly aligned, so
there is no need for special alignment.
And, btw, last time I checked, aligned_u64 was not exported to
userspace.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: Kevent POSIX timers support.
  2006-11-28 19:22                                           ` Evgeniy Polyakov
@ 2006-12-12  1:36                                             ` David Miller
  2006-12-12  5:31                                               ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: David Miller @ 2006-12-12  1:36 UTC (permalink / raw)
  To: johnpol
  Cc: drepper, akpm, netdev, zach.brown, hch, chase.venters,
	johann.borck, linux-kernel, jeff, aviro

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Tue, 28 Nov 2006 22:22:36 +0300

> And, btw, last time I checked, aligned_u64 was not exported to
> userspace.

It is in linux/types.h and not protected by __KERNEL__ ifdefs.
Perhaps you mean something else?

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: Kevent POSIX timers support.
  2006-12-12  1:36                                             ` David Miller
@ 2006-12-12  5:31                                               ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-12-12  5:31 UTC (permalink / raw)
  To: David Miller
  Cc: drepper, akpm, netdev, zach.brown, hch, chase.venters,
	johann.borck, linux-kernel, jeff, aviro

On Mon, Dec 11, 2006 at 05:36:44PM -0800, David Miller (davem@davemloft.net) wrote:
> From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> Date: Tue, 28 Nov 2006 22:22:36 +0300
> 
> > And, btw, last time I checked, aligned_u64 was not exported to
> > userspace.
> 
> It is in linux/types.h and not protected by __KERNEL__ ifdefs.
> Perhaps you mean something else?

It looks like I checked wrong #ifdef __KERNEL__/#endif pair.
It is there indeed.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: Kevent POSIX timers support.
  2006-11-27 18:20                               ` Ulrich Drepper
  2006-11-27 18:24                                 ` David Miller
@ 2006-11-28  9:16                                 ` Evgeniy Polyakov
  1 sibling, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-28  9:16 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Mon, Nov 27, 2006 at 10:20:50AM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> sigev_value is a union and the largest element is a pointer.  So, 
> transporting the pointer value is sufficient and it should be passed up 
> to the user in the ptr member of struct ukevent.

That is where I've put it in current version.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-21 17:43                 ` Evgeniy Polyakov
  2006-11-21 18:46                   ` Evgeniy Polyakov
@ 2006-11-22  7:33                   ` Ulrich Drepper
  2006-11-22 10:38                     ` Evgeniy Polyakov
  2006-11-22 12:09                     ` Evgeniy Polyakov
  1 sibling, 2 replies; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-22  7:33 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

Evgeniy Polyakov wrote:
> Threads are parked in syscalls - which one should be interrupted?

It doesn't matter, use the same policy you use when waking a thread in 
case of an event.  This is not about waking a specific thread, it's 
about not dropping the event notification.

> And what if there were no threads waiting in syscalls?

This is fine, do nothing.  It means that the other threads are about to 
read the ring buffer and will pick up the event.

The case which must be avoided is that of all threads being in the 
kernel, one threads gets woken, and then is canceled.  Without notifying 
the kernel about the cancellation and in the absence of further events 
notifications the process is deadlocked.

A second case which should be avoided is that there is a thread waiting 
when a thread gets canceled and there are one or more addition threads 
around, but not in the kernel.  But those other threads might not get to 
the ring buffer anytime soon, so handling the event is unnecessarily 
delayed.

> It has completely nothing with syscall.
> You register a timer to wait until 10:15 that is all.

That's a nonsense argument.  In this case you would not add any timeout 
parameter at all.  Of course nobody would want that since it's simply 
too slow.  Stop thinking about the absolute timeout as an exceptional 
case, it might very well not be for some problems.

Beside, I've already mentioned another case where a struct timespec* 
parameter is needed.  There are even two different relative timeouts: 
using the monotonis clock or using the realtime clock.   The latter is 
affected by gettimeofday and ntp.

>>> Kernel use relative timeouts.
>> Look again.  This time at the implementation.  For FUTEX_LOCK_PI the 
>> timeout is an absolute timeout.
> 
> How come? It just uses timespec.

Correct, it's using the value passed in.

>> The signal mask handling is orthogonal to all this and must be explicit. 
>>  In some cases explicit pthread_sigmask/sigprocmask calls.  But this is 
>> not atomic if a signal must be masked/unmasked for the *_wait call. 
>> This is why we have variants like pselect/ppoll/epoll_pwait which 
>> explicitly and *atomically* change the signal mask for the duration of 
>> the call.
> 
> You probably missed kevent signal patch - signal will not be delivered
> (in special cases) since it will not be copied into signal mask. System 
> just will not know that it happend. Completely. Like putting it into
> blocked mask.

I don't really understand what you want to say here.

I looked over the patch and I don't think I miss anything.  You just 
deliver the signal as an event.  No signal mask handling at all.  This 
is exactly the problem.

> But it is completely irrelevant with kevent signals - there is no race
> for that case when signal is delivered through file descriptor.

Of course there is a race.  You might not want the signal delivered. 
This is what the signal mask is for.  Of the other way around, as I've 
said before.

> It is much better to not know how thing works, then to not be possible
> to understand how new things can work.

Well, this explains why you don't understand signal masks at all.

> Add kevent signal and do not process that event.

That's not only a horrible hack, it does not work.  If I want to ignore 
a signal for the duration of the call, while you have it occasionally 
blocked for the rest of the program you would have to register the 
kevent for the signal, unblock the signal, the kevent_wait call, reset 
the mask, remove the kevent for the signal..  Otherwise it would not be 
delivered to be ignored.  And then you have a race, the same race 
pselect is designed to prevent.  In fact, you have two races.

There are other scenarios like this.  Fact is, signal mask handling is 
necessary and it cannot be folded into the event handling, it's orthogonal.

> Having special type of kevent signal is the same as putting signal into
> blocked mask, but signal event will be marked as ready - to indicate
> that condition was there.
> There will not be any race in that case.

Nonsense on all counts.

> I think I am a bit blind, probably parts of Leonids are still getting
> into my brain, but there is one syscall called kevent_ctl() which adds
> different events, including timer, signal, socket and others.

You are searching for callbacks and if none is found you return EINVAL. 
  This is exactly the same as if you'd create separate syscalls. 
Perhaps even worse, I really don't like demultiplexers, separate 
syscalls are much cleaner.

Avoiding these callbacks would help reducing the kernel interface, 
especially for this useless since inferior timer implementation.

> I can replace with -ENOSYS if you like.

It's necessary since we must be able to distinguish the errors.

> No one asked and pain me to create kevent, but it is done.
> Probably no the way some people wanted, but it always happend, 
> it is really not that bad.

Nobody says that the work isn't appreciated.  But if you don't want it 
to be critiqued, don't publish it.  If you don't want to mask any more 
changes, fine, say so.  I'll find somebody else to do it or will do it 
myself.

I claim that I know a thing or two about interfaces of the runtime 
programs expect to use.  And I know POSIX and the way the interfaces are 
designed and how they interact.

> Ulrich, tell me the truth, will you kill me if I say that I have an entry 
> in TODO to implement different AIO design (details for interested readers 
> can be found in my blog), and then present it to community? :))

I don't care about the kernel implementation as long as the interface is 
compatible with what I need for the POSIX AIO implementation.  The 
currently proposed code is going in that direction.  Any implementation 
which like Ben's old one does not allow POSIX AIO to be implemented I 
will of oppose.

>> Then define SIGEV_KEVENT as a value distinct from the other SIGEV_ 
>> values.  In the code which handles setup of timers (the timer_create 
>> syscall), recognize SIGEV_KEVENT and handle it appropriately.  I.e., 
>> call into the code to register the event source, just like you'd do with 
>> the current interface.  Then add the code to post an event to the event 
>> queue where currently signals would be sent et voilà.
> 
> Ok, I see.
> It is doable and simple.
> I will try to implement it tomorrow.

Thanks, that's progress.  And yes, I imagine it's not hard which is why 
the currently proposed timer interface is so unnecessary.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-22  7:33                   ` [take24 0/6] kevent: Generic event handling mechanism Ulrich Drepper
@ 2006-11-22 10:38                     ` Evgeniy Polyakov
  2006-11-22 22:22                       ` Ulrich Drepper
  2006-11-22 12:09                     ` Evgeniy Polyakov
  1 sibling, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-22 10:38 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Tue, Nov 21, 2006 at 11:33:39PM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >Threads are parked in syscalls - which one should be interrupted?
> 
> It doesn't matter, use the same policy you use when waking a thread in 
> case of an event.  This is not about waking a specific thread, it's 
> about not dropping the event notification.

Event notification is not dropped - thread was awakened, kernel task is
completed. Kernel does not know and should not have such knowledge about
the fact that selected thread was not good enough. If you want to wakeup
another thread - create another event, that is why I proposed userspace
notifications, which I actually do not like.

> >And what if there were no threads waiting in syscalls?
> 
> This is fine, do nothing.  It means that the other threads are about to 
> read the ring buffer and will pick up the event.
> 
> 
> The case which must be avoided is that of all threads being in the 
> kernel, one threads gets woken, and then is canceled.  Without notifying 
> the kernel about the cancellation and in the absence of further events 
> notifications the process is deadlocked.
> 
> A second case which should be avoided is that there is a thread waiting 
> when a thread gets canceled and there are one or more addition threads 
> around, but not in the kernel.  But those other threads might not get to 
> the ring buffer anytime soon, so handling the event is unnecessarily 
> delayed.

If those threads are not in the kernel, kernel can not wake hem up.
But if there is an event 'wake me up when thread has died' or something
like that, when new threads will try to sleep in syscall, they will be
immediately awakened, since that event will be ready.

> >It has completely nothing with syscall.
> >You register a timer to wait until 10:15 that is all.
> 
> That's a nonsense argument.  In this case you would not add any timeout 
> parameter at all.  Of course nobody would want that since it's simply 
> too slow.  Stop thinking about the absolute timeout as an exceptional 
> case, it might very well not be for some problems.

I repeate - timeout is needed to tell kernel the maximum possible
timeframe syscall can live. When you will tell me why you want syscall
to be interrupted when some absolute time is on the clock instead of
having special event for that, then ok.

I think I know why you want absolute time there - because glibc converts
most of the timeouts to absolute time since posix waiting
pthread_cond_timedwait() works only with it.

> Beside, I've already mentioned another case where a struct timespec* 
> parameter is needed.  There are even two different relative timeouts: 
> using the monotonis clock or using the realtime clock.   The latter is 
> affected by gettimeofday and ntp.

Kevent convert it to jiffies since it uses wait_event() and friends,
jiffies do not carry information about clocks to be used.

> >>>Kernel use relative timeouts.
> >>Look again.  This time at the implementation.  For FUTEX_LOCK_PI the 
> >>timeout is an absolute timeout.
> >
> >How come? It just uses timespec.
> 
> Correct, it's using the value passed in.
> 
> 
> >>The signal mask handling is orthogonal to all this and must be explicit. 
> >> In some cases explicit pthread_sigmask/sigprocmask calls.  But this is 
> >>not atomic if a signal must be masked/unmasked for the *_wait call. 
> >>This is why we have variants like pselect/ppoll/epoll_pwait which 
> >>explicitly and *atomically* change the signal mask for the duration of 
> >>the call.
> >
> >You probably missed kevent signal patch - signal will not be delivered
> >(in special cases) since it will not be copied into signal mask. System 
> >just will not know that it happend. Completely. Like putting it into
> >blocked mask.
> 
> 
> I don't really understand what you want to say here.
> 
> I looked over the patch and I don't think I miss anything.  You just 
> deliver the signal as an event.  No signal mask handling at all.  This 
> is exactly the problem.

Have you seen specific_send_sig_info():

	/* Short-circuit ignored signals.  */
	if (sig_ignored(p, sig)) {
		ret = 1;
		goto out;
	}

almost the same happens when signal is delivered using kevent (special
case) - pending mask is not updated.

> >But it is completely irrelevant with kevent signals - there is no race
> >for that case when signal is delivered through file descriptor.
> 
> Of course there is a race.  You might not want the signal delivered. 
> This is what the signal mask is for.  Of the other way around, as I've 
> said before.

Then ignore that event - there is no race between signal delivery and
other descriptors reading, and it _is_ when signal is delivered no
through the same queue but asynchronously with mask update.

> >It is much better to not know how thing works, then to not be possible
> >to understand how new things can work.
> 
> Well, this explains why you don't understand signal masks at all.

Nice :)
I at least try to do something to solve this problem, instead of blindly
saying the same again and again without even trying to hear and
understand what others say.

> >Add kevent signal and do not process that event.
> 
> That's not only a horrible hack, it does not work.  If I want to ignore 
> a signal for the duration of the call, while you have it occasionally 
> blocked for the rest of the program you would have to register the 
> kevent for the signal, unblock the signal, the kevent_wait call, reset 
> the mask, remove the kevent for the signal..  Otherwise it would not be 
> delivered to be ignored.  And then you have a race, the same race 
> pselect is designed to prevent.  In fact, you have two races.
> 
> There are other scenarios like this.  Fact is, signal mask handling is 
> necessary and it cannot be folded into the event handling, it's orthogonal.

You have too narrow look.
Look broader - pselect() has signal mask to prevent race between async
signal delivery and file descriptor readiness. With kevent both that
events are delivered through the same queue, so there is no race, so
kevent syscalls do not need that workaround for 20 years-old design,
which can not handle different than fd events.

> >Having special type of kevent signal is the same as putting signal into
> >blocked mask, but signal event will be marked as ready - to indicate
> >that condition was there.
> >There will not be any race in that case.
> 
> Nonsense on all counts.
>
>
> >I think I am a bit blind, probably parts of Leonids are still getting
> >into my brain, but there is one syscall called kevent_ctl() which adds
> >different events, including timer, signal, socket and others.
> 
> You are searching for callbacks and if none is found you return EINVAL. 
>  This is exactly the same as if you'd create separate syscalls. 
> Perhaps even worse, I really don't like demultiplexers, separate 
> syscalls are much cleaner.
> 
> Avoiding these callbacks would help reducing the kernel interface, 
> especially for this useless since inferior timer implementation.

You completely do not want to understand how kevent works and why they 
are needed, if you would try to think that there are different than 
yours opinions, then probably we could have some progress.

Those callbacks are neededto support different types of objects, which
can produce events, with the same interface.

> >I can replace with -ENOSYS if you like.
> 
> It's necessary since we must be able to distinguish the errors.

And what if user requests bogus event type - is it invalid condition or
normal, but not handled (thus enosys)?

> >No one asked and pain me to create kevent, but it is done.
> >Probably no the way some people wanted, but it always happend, 
> >it is really not that bad.
> 
> Nobody says that the work isn't appreciated.  But if you don't want it 
> to be critiqued, don't publish it.  If you don't want to mask any more 
> changes, fine, say so.  I'll find somebody else to do it or will do it 
> myself.

I greatly appreciate critics, really. But when it comes to 'this
sucks because it sucks, no matter if it is completely different way,
it still sucks because others sucked there too' I can not say it is 
critics, it becomes nonsence.

> I claim that I know a thing or two about interfaces of the runtime 
> programs expect to use.  And I know POSIX and the way the interfaces are 
> designed and how they interact.

Well, then I claim that I do not know 'thing or two about interfaces of
the runtime programs expect to use', but instead I write those programms
and I know my needs. And POSIX interfaces are the last one I prefer to
use.

We are on the different positions - theoretical thoughs about world
hapinness, and practical usage. I do not say that only one of that
approaches must exist, they both can live together, but it requires that
people from both sides not only tried to say, that other part is stupid
and do not know something or anything, but instead tried to listen and get 
into account that.

> >Ulrich, tell me the truth, will you kill me if I say that I have an entry 
> >in TODO to implement different AIO design (details for interested readers 
> >can be found in my blog), and then present it to community? :))
> 
> I don't care about the kernel implementation as long as the interface is 
> compatible with what I need for the POSIX AIO implementation.  The 
> currently proposed code is going in that direction.  Any implementation 
> which like Ben's old one does not allow POSIX AIO to be implemented I 
> will of oppose.

What if it will not be called POSIX AIO, but instead some kind of 'true
AIO' or 'real AIO' or maybe 'alternative AIO'? :)
It is quite sure that POSIX AIO interfaces will unlikely to be applied
there...

> >>Then define SIGEV_KEVENT as a value distinct from the other SIGEV_ 
> >>values.  In the code which handles setup of timers (the timer_create 
> >>syscall), recognize SIGEV_KEVENT and handle it appropriately.  I.e., 
> >>call into the code to register the event source, just like you'd do with 
> >>the current interface.  Then add the code to post an event to the event 
> >>queue where currently signals would be sent et voilà.
> >
> >Ok, I see.
> >It is doable and simple.
> >I will try to implement it tomorrow.
> 
> Thanks, that's progress.  And yes, I imagine it's not hard which is why 
> the currently proposed timer interface is so unnecessary.

It is the first techical but not political problem we cought in this
endless discussion, I separated it in different subthread already.
Let's try to think more about it there.

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-22 10:38                     ` Evgeniy Polyakov
@ 2006-11-22 22:22                       ` Ulrich Drepper
  2006-11-23 12:18                         ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-22 22:22 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

Evgeniy Polyakov wrote:
> Event notification is not dropped - [...]

Since you said you added the new syscall I'll leave this alone.

> I repeate - timeout is needed to tell kernel the maximum possible
> timeframe syscall can live. When you will tell me why you want syscall
> to be interrupted when some absolute time is on the clock instead of
> having special event for that, then ok.

This goes together with...

> I think I know why you want absolute time there - because glibc converts
> most of the timeouts to absolute time since posix waiting
> pthread_cond_timedwait() works only with it.

I did not make the decision to use absolute timeouts/deadlines.  This is 
what is needed in many situations.  It's the more general way to specify 
delays.  These are real-world requirements which were taken into account 
when designing the interfaces.

For most cases I would agree that when doing AIO you need relative 
timeouts.  But the event handling is not about AIO alone.  It's all 
kinds of events and some/many are wall clock related.  And it is 
definitely necessary in some situations to be able to interrupt if the 
clock jumps ahead.  If a program deals with devices in the real world 
this be crucial.  The new event handling must be generic enough to 
accommodate all these uses and using struct timespec* plus eventually 
flags does not add any measurable overhead so there is no reason to not 
do it right.

> Kevent convert it to jiffies since it uses wait_event() and friends,
> jiffies do not carry information about clocks to be used.

Then this points to a place in the implementation which needs changing. 
  The interface cannot be restricted just because the implementation 
currently allow this to be implemented.

> 	/* Short-circuit ignored signals.  */
> 	if (sig_ignored(p, sig)) {
> 		ret = 1;
> 		goto out;
> 	}
>  
> almost the same happens when signal is delivered using kevent (special
> case) - pending mask is not updated.

Yes, and how do you set the signal mask atomically wrt to registering 
and unregistering signals with kevent and the syscall itself?  You 
cannot.  But this is exactly which is resolved by adding the signal mask 
parameter.

Programs which don't need the functionality simply pass a NULL pointer 
and the cost is once again not measurable.  But don't restrict the 
functionality just because you don't see a use for this in your small world.

Yes, we could (later again) add new syscalls.  But this is plain stupid. 
  I would love to never have the epoll_wait or select syscall and just 
have epoll_pwait and pselect since the functionality is a superset.  We 
have a larger kernel ABI.  Here we can stop making the same mistake again.

For the userlevel side we might even have separate intterfaces, one with 
one without signal mask parameter.  But that's userlevel, both functions 
would use the same syscall.

>> There are other scenarios like this.  Fact is, signal mask handling is 
>> necessary and it cannot be folded into the event handling, it's orthogonal.
> 
> You have too narrow look.
> Look broader - pselect() has signal mask to prevent race between async
> signal delivery and file descriptor readiness. With kevent both that
> events are delivered through the same queue, so there is no race, so
> kevent syscalls do not need that workaround for 20 years-old design,
> which can not handle different than fd events.

Your failure to understand to signal model leads to wrong conclusions. 
There are races, several of them, and you cannot do anything without 
signal mask parameters.  I've explained this before.

>> Avoiding these callbacks would help reducing the kernel interface, 
>> especially for this useless since inferior timer implementation.
> 
> You completely do not want to understand how kevent works and why they 
> are needed, if you would try to think that there are different than 
> yours opinions, then probably we could have some progress.

I think I know very well how they work meanwhile.

> Those callbacks are neededto support different types of objects, which
> can produce events, with the same interface.

Yes, but it is not necessary to expose all the different types in the 
userlevel APIs.  That's the issue.  Reduce the exposure of kernel 
functionality to userlevel APIs.

If you integrate the timer handling into the POSIX timer syscalls the 
callbacks in your timer patch might not need be there.  At least the 
enqueue callback, if I remember correctly.  All enqueue operations are 
initiated by timer_create calls which can call the function directly. 
Removing the callback from the list used by add_ctl will reduce the 
exposed interface.

>>> I can replace with -ENOSYS if you like.
>> It's necessary since we must be able to distinguish the errors.
> 
> And what if user requests bogus event type - is it invalid condition or
> normal, but not handled (thus enosys)?

It's ENOSYS.  Just like for system calls.  You cannot distinguish 
completely invalid values from values which are correct only on later 
kernels.  But: the first use is a bug while the later is not a bug and 
needed to write robust and well performing apps.  The former's problems 
therefore are unimportant.

> Well, then I claim that I do not know 'thing or two about interfaces of
> the runtime programs expect to use', but instead I write those programms
> and I know my needs. And POSIX interfaces are the last one I prefer to
> use.

Well, there it is.  You look out for yourself while I make sure that all 
the bases I can think of are covered.

Again, if you don't want to work on the generalization, fine.  That's 
your right.  Nobody will think bad of you for doing this.  But don't 
expect that a) I'll not try to change it and b) I'll not object to the 
changes being accepted as they are.

> What if it will not be called POSIX AIO, but instead some kind of 'true
> AIO' or 'real AIO' or maybe 'alternative AIO'? :)
> It is quite sure that POSIX AIO interfaces will unlikely to be applied
> there...

Programmers don't like specialized OS-specific interfaces.  AIO users 
who put up with libaio are rare.  The same will happen with any other 
approach.  The Samba use is symptomatic: they need portability even if 
this costs a minute percentage of performance compared to a highly 
specialized implementation.

There might be some aspects of POSIX AIO which could be implemented 
better on Linux.  But the important part in the name is the 'P'.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-22 22:22                       ` Ulrich Drepper
@ 2006-11-23 12:18                         ` Evgeniy Polyakov
  2006-11-23 22:23                           ` Ulrich Drepper
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-23 12:18 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Wed, Nov 22, 2006 at 02:22:15PM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> >I repeate - timeout is needed to tell kernel the maximum possible
> >timeframe syscall can live. When you will tell me why you want syscall
> >to be interrupted when some absolute time is on the clock instead of
> >having special event for that, then ok.
> 
> This goes together with...
> 
> 
> >I think I know why you want absolute time there - because glibc converts
> >most of the timeouts to absolute time since posix waiting
> >pthread_cond_timedwait() works only with it.
> 
> I did not make the decision to use absolute timeouts/deadlines.  This is 
> what is needed in many situations.  It's the more general way to specify 
> delays.  These are real-world requirements which were taken into account 
> when designing the interfaces.
> 
> For most cases I would agree that when doing AIO you need relative 
> timeouts.  But the event handling is not about AIO alone.  It's all 
> kinds of events and some/many are wall clock related.  And it is 
> definitely necessary in some situations to be able to interrupt if the 
> clock jumps ahead.  If a program deals with devices in the real world 
> this be crucial.  The new event handling must be generic enough to 
> accommodate all these uses and using struct timespec* plus eventually 
> flags does not add any measurable overhead so there is no reason to not 
> do it right.

Timeouts are not about AIO or any other event types (there are a lot of
them already as you can see), it is only about syscall itself.
Please point me to _any_ syscall out there which uses absolute time
(except settimeofday() and similar syscalls).
 
> >Kevent convert it to jiffies since it uses wait_event() and friends,
> >jiffies do not carry information about clocks to be used.
> 
> Then this points to a place in the implementation which needs changing. 
>  The interface cannot be restricted just because the implementation 
> currently allow this to be implemented.

Btw, do you propose to change all users of wait_event()?

Interface is not restricted, it is just different from what you want it
to be, and you did not show why it requires changes.

Btw, there are _no_ interfaces similar to 'wait event with absolute
times' in kernel.
 
> >	/* Short-circuit ignored signals.  */
> >	if (sig_ignored(p, sig)) {
> >		ret = 1;
> >		goto out;
> >	}
> > 
> >almost the same happens when signal is delivered using kevent (special
> >case) - pending mask is not updated.
> 
> Yes, and how do you set the signal mask atomically wrt to registering 
> and unregistering signals with kevent and the syscall itself?  You 
> cannot.  But this is exactly which is resolved by adding the signal mask 
> parameter.

kevent signal registering is atomic with respect to other kevent
syscalls: control syscalls are protected by mutex and waiting syscalls
work with queue, which is protected by appropriate lock.

> Programs which don't need the functionality simply pass a NULL pointer 
> and the cost is once again not measurable.  But don't restrict the 
> functionality just because you don't see a use for this in your small world.
> 
> Yes, we could (later again) add new syscalls.  But this is plain stupid. 
>  I would love to never have the epoll_wait or select syscall and just 
> have epoll_pwait and pselect since the functionality is a superset.  We 
> have a larger kernel ABI.  Here we can stop making the same mistake again.
> 
> For the userlevel side we might even have separate intterfaces, one with 
> one without signal mask parameter.  But that's userlevel, both functions 
> would use the same syscall.

Let me formulate signal problem here, please point me if it is correct
or not.

User registers some async signal notifications and calls poll() waiting
for some file descriptors to became ready. When it is interrupted there
is no knowledge about what really happend first - signal was delivered
or file descriptor was ready.

Is it correct?

In case it is, let me explain why this situation can not happen with
kevent: since signals are not delivered in the old way, but instead they
are queued into the same queue where file descriptors are, and queueing
is atomic, and pending signal mask is not updated, user will only read
one event after another, which automatically (since delivery is atomic)
means that what first was read, that was first happend.

So, why in the latter situation we need to specify signal mask, which
will block some signals from _async_ delivery, since there is _no_ async
delivery?

> >>There are other scenarios like this.  Fact is, signal mask handling is 
> >>necessary and it cannot be folded into the event handling, it's 
> >>orthogonal.
> >
> >You have too narrow look.
> >Look broader - pselect() has signal mask to prevent race between async
> >signal delivery and file descriptor readiness. With kevent both that
> >events are delivered through the same queue, so there is no race, so
> >kevent syscalls do not need that workaround for 20 years-old design,
> >which can not handle different than fd events.
> 
> Your failure to understand to signal model leads to wrong conclusions. 
> There are races, several of them, and you cannot do anything without 
> signal mask parameters.  I've explained this before.

Please refer to my above explaination, point me in that example what we
are talking about. It seems we do not understand each other.
 
> >>Avoiding these callbacks would help reducing the kernel interface, 
> >>especially for this useless since inferior timer implementation.
> >
> >You completely do not want to understand how kevent works and why they 
> >are needed, if you would try to think that there are different than 
> >yours opinions, then probably we could have some progress.
> 
> I think I know very well how they work meanwhile.

If that would be true, I would be very happy. Definitely.
 
> >Those callbacks are neededto support different types of objects, which
> >can produce events, with the same interface.
> 
> Yes, but it is not necessary to expose all the different types in the 
> userlevel APIs.  That's the issue.  Reduce the exposure of kernel 
> functionality to userlevel APIs.
> 
> If you integrate the timer handling into the POSIX timer syscalls the 
> callbacks in your timer patch might not need be there.  At least the 
> enqueue callback, if I remember correctly.  All enqueue operations are 
> initiated by timer_create calls which can call the function directly. 
> Removing the callback from the list used by add_ctl will reduce the 
> exposed interface.

I posted a patch to implement kevent support for posix timers, it is
quite simple in existing model. No need to remove anything, that allows
to have flexibility and create different usage models others than what
is required by fairly small part of the users.

> >>>I can replace with -ENOSYS if you like.
> >>It's necessary since we must be able to distinguish the errors.
> >
> >And what if user requests bogus event type - is it invalid condition or
> >normal, but not handled (thus enosys)?
> 
> It's ENOSYS.  Just like for system calls.  You cannot distinguish 
> completely invalid values from values which are correct only on later 
> kernels.  But: the first use is a bug while the later is not a bug and 
> needed to write robust and well performing apps.  The former's problems 
> therefore are unimportant.

I implemented it to return -enosys for the case, when event type is
smaller than maximum allowed and no subsystem is registered, and -einval 
for the case, when requested type is higher.

> >Well, then I claim that I do not know 'thing or two about interfaces of
> >the runtime programs expect to use', but instead I write those programms
> >and I know my needs. And POSIX interfaces are the last one I prefer to
> >use.
> 
> Well, there it is.  You look out for yourself while I make sure that all 
> the bases I can think of are covered.
> 
> Again, if you don't want to work on the generalization, fine.  That's 
> your right.  Nobody will think bad of you for doing this.  But don't 
> expect that a) I'll not try to change it and b) I'll not object to the 
> changes being accepted as they are.

It is not about generalization, but about those who do practical work
and those who prefer to spread theoretical thoughts, which result in
several month of unused empty discussions.

> >What if it will not be called POSIX AIO, but instead some kind of 'true
> >AIO' or 'real AIO' or maybe 'alternative AIO'? :)
> >It is quite sure that POSIX AIO interfaces will unlikely to be applied
> >there...
> 
> Programmers don't like specialized OS-specific interfaces.  AIO users 
> who put up with libaio are rare.  The same will happen with any other 
> approach.  The Samba use is symptomatic: they need portability even if 
> this costs a minute percentage of performance compared to a highly 
> specialized implementation.

Do not say for everyone - it is not a some kind of feodalism with the
only opinion allowed - respect those who do not like/do not want what 
you propose them to use.

> There might be some aspects of POSIX AIO which could be implemented 
> better on Linux.  But the important part in the name is the 'P'.

I will create completely different model, POSIX completely is not
designed for that - that model allows to specify set of tasks performed
on object completely asycnhronous to user before object is returned -
for example specify destination socket and filename, so async sendfile
will asynchronously open file, transfer it to remote destination and
probably even close (or return file descriptor). The same can be applied
to aio read/write.

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-23 12:18                         ` Evgeniy Polyakov
@ 2006-11-23 22:23                           ` Ulrich Drepper
  2006-11-24 10:57                             ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-23 22:23 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

Evgeniy Polyakov wrote:
> On Wed, Nov 22, 2006 at 02:22:15PM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Timeouts are not about AIO or any other event types (there are a lot of
> them already as you can see), it is only about syscall itself.
> Please point me to _any_ syscall out there which uses absolute time
> (except settimeofday() and similar syscalls).

futex(FUTEX_LOCK_PI).

> Btw, do you propose to change all users of wait_event()?

Which users?

> Interface is not restricted, it is just different from what you want it
> to be, and you did not show why it requires changes.

No, it is restricted because I cannot express something like an absolute 
timeout/deadline.  If the parameter would be a struct timespec* then at 
any time we can implement either relative timeouts w/ and w/out 
observance of settimeofday/ntp and absolute timeouts.  This is what 
makes the interface generic and unrestricted while your current version 
cannot be used for the latter.

> kevent signal registering is atomic with respect to other kevent
> syscalls: control syscalls are protected by mutex and waiting syscalls
> work with queue, which is protected by appropriate lock.

It is about atomicity wrt to the signal mask manipulation which would 
have to precede the kevent_wait call and the call itself (and 
registering a signal for kevent delivery).  This is not atomic.

> Let me formulate signal problem here, please point me if it is correct
> or not.

There are a myriad of different scenarios, it makes no sense to pick 
one.  The interface must be generic to cover them all, I don't know how 
often I have to repeat this.

> User registers some async signal notifications and calls poll() waiting
> for some file descriptors to became ready. When it is interrupted there
> is no knowledge about what really happend first - signal was delivered
> or file descriptor was ready.

The order is unimportant.  You change the signal mask, for instance, if 
the time when a thread is waiting in poll() is the only time when a 
signal can be handled.  Or vice versa, it's the time when signals are 
not wanted.  And these are per-thread decisions.

Signal handlers and kevent registrations for signals are process-wide 
decisions.  And furthermore: with kevent delivered signals there is no 
signal mask anymore (at least you seem to not check it).  Even if this 
would be done it doesn't change the fact that you cannot use signals the 
way many programs want to.

Fact is that without a signal queue you cannot implement the above 
cases.  You cannot block/unblock a signal for a specific thread.  You 
also cannot work together with signals which cannot be delivered through 
kevent.  This is the case for existing code in a program which happens 
to use also kevent and it is the case if there is more than one possible 
recipient.  With kevent signals can be attached to one kevent queue only 
but the recipients (different threads or only different parts of a 
program) need not use the same kevent queue.

I've said from the start that you cannot possibly expect that programs 
are not using signal delivery in the current form.  And the complete 
loss of blocking signals for individual threads makes the kevent-based 
signal delivery incomplete (in a non-fixable form) anyway.

> In case it is, let me explain why this situation can not happen with
> kevent: since signals are not delivered in the old way, but instead they
> are queued into the same queue where file descriptors are, and queueing
> is atomic, and pending signal mask is not updated, user will only read
> one event after another, which automatically (since delivery is atomic)
> means that what first was read, that was first happend.

This really has nothing to do with the problem.

> I posted a patch to implement kevent support for posix timers, it is
> quite simple in existing model. No need to remove anything,

Surely you don't suggest keeping your original timer patch?

> I implemented it to return -enosys for the case, when event type is
> smaller than maximum allowed and no subsystem is registered, and -einval 
> for the case, when requested type is higher.

What is the "maximum allowed"?  ENOSYS must be returned for all values 
which could potentially in future be used as a valid type value.  If you 
limit the values which are treated this way you are setting a fixed 
upper limit for the type values which _ever_ can be used.

> It is not about generalization, but about those who do practical work
> and those who prefer to spread theoretical thoughts, which result in
> several month of unused empty discussions.

I've told you, then don't work on these parts.  I'll get the changes I 
think are needed implemented by somebody else or I'll do it myself.  If 
you say that only those you implement something have a say in the way 
this is done then this is fine with me.  But you have to realize that 
you're not the one who will make all the final decisions.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-23 22:23                           ` Ulrich Drepper
@ 2006-11-24 10:57                             ` Evgeniy Polyakov
  2006-11-27 19:12                               ` Ulrich Drepper
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-24 10:57 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Thu, Nov 23, 2006 at 02:23:12PM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >On Wed, Nov 22, 2006 at 02:22:15PM -0800, Ulrich Drepper 
> >(drepper@redhat.com) wrote:
> >Timeouts are not about AIO or any other event types (there are a lot of
> >them already as you can see), it is only about syscall itself.
> >Please point me to _any_ syscall out there which uses absolute time
> >(except settimeofday() and similar syscalls).
> 
> futex(FUTEX_LOCK_PI).

It just sets hrtimer with abs time and sleeps - it can achieve the same
goals using similar to wait_event() mechanism.

> >Btw, do you propose to change all users of wait_event()?
> 
> Which users?

Any users which use wait_event() or schedule_timeout(). Futex for
example - it perfectly ok lives with relative timeouts provided to
schedule_timeout() - the same (roughly saying of course) is done in kevent.

> >Interface is not restricted, it is just different from what you want it
> >to be, and you did not show why it requires changes.
> 
> No, it is restricted because I cannot express something like an absolute 
> timeout/deadline.  If the parameter would be a struct timespec* then at 
> any time we can implement either relative timeouts w/ and w/out 
> observance of settimeofday/ntp and absolute timeouts.  This is what 
> makes the interface generic and unrestricted while your current version 
> cannot be used for the latter.

I think I said already several times that absolute timeouts are not
related to syscall execution process. But you seems to not hear me and
insist.

Ok, I will change waiting syscalls to have 'flags' parameter and 'struct
timespec' as timeout parameter. Special bit in flags will result in
additional timer setup which will fire after absolute timeout and will
wake up those who wait...

> >kevent signal registering is atomic with respect to other kevent
> >syscalls: control syscalls are protected by mutex and waiting syscalls
> >work with queue, which is protected by appropriate lock.
> 
> It is about atomicity wrt to the signal mask manipulation which would 
> have to precede the kevent_wait call and the call itself (and 
> registering a signal for kevent delivery).  This is not atomic.

If signal mask is updated from userspace it should be done through
kevent - add/remove different kevent signals. Signal mask of pending
signals is not updated for special kevent signals.

> >Let me formulate signal problem here, please point me if it is correct
> >or not.
> 
> There are a myriad of different scenarios, it makes no sense to pick 
> one.  The interface must be generic to cover them all, I don't know how 
> often I have to repeat this.

The whole signal mask was added by POSXI exactly for that single
practical race in the event dispatching mechanism, which can not handle
other types of events like signals.

> >User registers some async signal notifications and calls poll() waiting
> >for some file descriptors to became ready. When it is interrupted there
> >is no knowledge about what really happend first - signal was delivered
> >or file descriptor was ready.
> 
> The order is unimportant.  You change the signal mask, for instance, if 
> the time when a thread is waiting in poll() is the only time when a 
> signal can be handled.  Or vice versa, it's the time when signals are 
> not wanted.  And these are per-thread decisions.
> 
> Signal handlers and kevent registrations for signals are process-wide 
> decisions.  And furthermore: with kevent delivered signals there is no 
> signal mask anymore (at least you seem to not check it).  Even if this 
> would be done it doesn't change the fact that you cannot use signals the 
> way many programs want to.

There is major contradiction here - you say that programmers will use
old-style signal delivery and want me to add signal mask to prevent that
delivery, so signals would be in blocked mask, when I say that current kevent 
signal delivery does not update pending signal mask, which is the same as
putting signals into blocked mask, you say that it is not what is
required.

> Fact is that without a signal queue you cannot implement the above 
> cases.  You cannot block/unblock a signal for a specific thread.  You 
> also cannot work together with signals which cannot be delivered through 
> kevent.  This is the case for existing code in a program which happens 
> to use also kevent and it is the case if there is more than one possible 
> recipient.  With kevent signals can be attached to one kevent queue only 
> but the recipients (different threads or only different parts of a 
> program) need not use the same kevent queue.

Signal queue is replaced with kevent queue, and it is in sync with all
other kevents.
Programmers which want to use kevents will use kevents (if miracle will
happend and we agree that kevent is good for inclusion), and programmers
will know how kevent signal delivery works.

> I've said from the start that you cannot possibly expect that programs 
> are not using signal delivery in the current form.  And the complete 
> loss of blocking signals for individual threads makes the kevent-based 
> signal delivery incomplete (in a non-fixable form) anyway.

Having sigmask parameter is the same as creating kevent signal delivery.

And, btw, programmers can change signal mask before calling syscall,
since in the syscall there is a gap between start and sigprocmask()
call.

> >In case it is, let me explain why this situation can not happen with
> >kevent: since signals are not delivered in the old way, but instead they
> >are queued into the same queue where file descriptors are, and queueing
> >is atomic, and pending signal mask is not updated, user will only read
> >one event after another, which automatically (since delivery is atomic)
> >means that what first was read, that was first happend.
> 
> This really has nothing to do with the problem.

It is the only practical example of the need for that signal mask.
And it can be perfectly handled by kevent.

> >I posted a patch to implement kevent support for posix timers, it is
> >quite simple in existing model. No need to remove anything,
> 
> Surely you don't suggest keeping your original timer patch?

Of course not - kevent timers are more scalable than posix timers (the 
latter uses idr, which is slower than balanced binary tree, since it
looks like it uses similar to radix tree algo), POSIX interface is 
much-much-much more unconvenient to use than simple add/wait.

> >I implemented it to return -enosys for the case, when event type is
> >smaller than maximum allowed and no subsystem is registered, and -einval 
> >for the case, when requested type is higher.
> 
> What is the "maximum allowed"?  ENOSYS must be returned for all values 
> which could potentially in future be used as a valid type value.  If you 
> limit the values which are treated this way you are setting a fixed 
> upper limit for the type values which _ever_ can be used.

Upper limit is for current version - when new type is added limit is
increased - just like maximum number of syscalls.
Ok, I will use -ENOSYS for all cases.

> >It is not about generalization, but about those who do practical work
> >and those who prefer to spread theoretical thoughts, which result in
> >several month of unused empty discussions.
> 
> I've told you, then don't work on these parts.  I'll get the changes I 
> think are needed implemented by somebody else or I'll do it myself.  If 
> you say that only those you implement something have a say in the way 
> this is done then this is fine with me.  But you have to realize that 
> you're not the one who will make all the final decisions.

Because our void discussion seems to never end, which puts kevent into
hung state - I definitely prefer final words made by kernel maintainers 
about inclusion or declining of the kevents, but they keep silence since
they look for not only my decision as author, but also different
opinions of the potential users.

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-24 10:57                             ` Evgeniy Polyakov
@ 2006-11-27 19:12                               ` Ulrich Drepper
  2006-11-28 11:00                                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-27 19:12 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

Evgeniy Polyakov wrote:
> It just sets hrtimer with abs time and sleeps - it can achieve the same
> goals using similar to wait_event() mechanism.

I don't follow.  Of course it is somehow possible to wait until an 
absolute deadline.  But it's not part of the parameter list and hence 
easily and _quickly_ usable.

>>> Btw, do you propose to change all users of wait_event()?
>> Which users?
> 
> Any users which use wait_event() or schedule_timeout(). Futex for
> example - it perfectly ok lives with relative timeouts provided to
> schedule_timeout() - the same (roughly saying of course) is done in kevent.

No, it does not live perfectly OK with relative timeouts.  The userlevel 
implementation is actually wrong because of this in subtle ways.  Some 
futex interfaces take absolute timeouts and they have to be interrupted 
if the realtime clock is set forward.

Also, the calls are complicated and slow because the userlevel wrapper 
has to call clock_gettime/gettimeofday before each futex syscall.  If 
the kernel would accept absolute timeouts as well we would save a 
syscall and have actually a correct implementation.

> I think I said already several times that absolute timeouts are not
> related to syscall execution process. But you seems to not hear me and
> insist.

Because you're wrong.  For your use cases it might not be but it's not 
true in general.  And your interface is preventing it from being 
implemented forever.

> Ok, I will change waiting syscalls to have 'flags' parameter and 'struct
> timespec' as timeout parameter. Special bit in flags will result in
> additional timer setup which will fire after absolute timeout and will
> wake up those who wait...

Thanks a lot.

>>> kevent signal registering is atomic with respect to other kevent
>>> syscalls: control syscalls are protected by mutex and waiting syscalls
>>> work with queue, which is protected by appropriate lock.
>> It is about atomicity wrt to the signal mask manipulation which would 
>> have to precede the kevent_wait call and the call itself (and 
>> registering a signal for kevent delivery).  This is not atomic.
> 
> If signal mask is updated from userspace it should be done through
> kevent - add/remove different kevent signals.

Indeed, this is what I've been saying and why ppoll/pselect/epoll_pwait 
take the sigset_t parameter.

Adding the signal mask to the queued events (e.g., the signal events) 
does not work.  First of all it's slow, you'd have to find and combine 
all mask at least every time a signal event is added/removed.  Then how 
do you combine them, OR or AND?  Not all threads might want/need the 
same signal mask.

These are just some of the usability problems.  The only clean and 
usable solution is really to OPTIONALLY pass in the signal mask.  Nobody 
forces anybody to use this feature.  Pass a NULL pointer and nothing 
happens, this is how the other syscalls also work.

> The whole signal mask was added by POSXI exactly for that single
> practical race in the event dispatching mechanism, which can not handle
> other types of events like signals.

No.  How should this argument make sense ?  Signals cannot be used in 
the current event handling and are therefore used for something 
completely different.  And they will have to be used like this for many 
applications (.e., thread cancellation, setuid/setgid implementation, etc).

That fact that the new event handling can handle signals is orthogonal 
(and good).  But it does not supersede the old signal use, it's 
something new.  The old uses are still valid.

BTW: there is a little design decision which has to be made: if a signal 
is registered with kevent and this signal is sent to a specific thread 
instead of the process (tkill and tgkill), what should happen?  I'm 
currently leaning toward failing the tkill/tgkill syscall if delivery of 
the signal requires posting to an event queue.

> There is major contradiction here - you say that programmers will use
> old-style signal delivery and want me to add signal mask to prevent that
> delivery, so signals would be in blocked mask,

That's one thing you can do.  You also can unblock signals.

> when I say that current kevent 
> signal delivery does not update pending signal mask, which is the same as
> putting signals into blocked mask, you say that it is not what is
> required.

First, what is "pending signal mask"?  There is one signal mask per 
thread.  And "pending" refers to thread delivery (either per-process or 
per-thread) which is not the signal mask (well, for non-RT signals it 
can be a bitmap but this still is no mask).

Second, I'm not talking about signal delivery.  Yes, sigaction allows to 
specify how the signal mask is to be changed when a signal is delivered. 
  But this is not what I'm talk about.  I'm talking about the signal 
mask used for the duration of the kevent_wait syscall, regardless of 
whether signals are waited for or delivered.

> Signal queue is replaced with kevent queue, and it is in sync with all
> other kevents.

But the signal mask is something completely different and completely 
independent from the signal queue.  There is nothing in the kevent 
interface to replace that functionality.  Nor should this be possible 
with the events; only a sigset_t parameter to kevent_wait makes sense.

> Having sigmask parameter is the same as creating kevent signal delivery.

No, no, no.  Not at all.

>> Surely you don't suggest keeping your original timer patch?
> 
> Of course not - kevent timers are more scalable than posix timers (the 
> latter uses idr, which is slower than balanced binary tree, since it
> looks like it uses similar to radix tree algo), POSIX interface is 
> much-much-much more unconvenient to use than simple add/wait.

I assume you misread the question.  You agree to drop the patch and then 
  go on listing things why you think it's better to keep them.  I don't 
think these arguments are in any way sufficient.  The interface is 
already too big and this is 100% duplicate functionality.  If there are 
performance problems with the POSIX timer implementation (and I have yet 
to see indications) it should be fixed instead of worked around.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-27 19:12                               ` Ulrich Drepper
@ 2006-11-28 11:00                                 ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-28 11:00 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Mon, Nov 27, 2006 at 11:12:21AM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >It just sets hrtimer with abs time and sleeps - it can achieve the same
> >goals using similar to wait_event() mechanism.
> 
> I don't follow.  Of course it is somehow possible to wait until an 
> absolute deadline.  But it's not part of the parameter list and hence 
> easily and _quickly_ usable.

I just described how it is implemented in futex. I will create the same
approach - hrtimer which will wakeup wait_event() with infinite timeout.

> >>>Btw, do you propose to change all users of wait_event()?
> >>Which users?
> >
> >Any users which use wait_event() or schedule_timeout(). Futex for
> >example - it perfectly ok lives with relative timeouts provided to
> >schedule_timeout() - the same (roughly saying of course) is done in kevent.
> 
> No, it does not live perfectly OK with relative timeouts.  The userlevel 
> implementation is actually wrong because of this in subtle ways.  Some 
> futex interfaces take absolute timeouts and they have to be interrupted 
> if the realtime clock is set forward.
> 
> Also, the calls are complicated and slow because the userlevel wrapper 
> has to call clock_gettime/gettimeofday before each futex syscall.  If 
> the kernel would accept absolute timeouts as well we would save a 
> syscall and have actually a correct implementation.

It is only done for LOCK_PI case, which was specially created to have
absolute timeout, i.e. futex does not need it, but there is an option.

I will extend waiting syscalls to have timespec and absolute timeout,
I'm just want to stop this (I hope you agree) stupid endless arguing
about completely unimportant thing.

> >I think I said already several times that absolute timeouts are not
> >related to syscall execution process. But you seems to not hear me and
> >insist.
> 
> Because you're wrong.  For your use cases it might not be but it's not 
> true in general.  And your interface is preventing it from being 
> implemented forever.

Because I'm right and it will not be used :)
Well, it does not matter anymore, right?

> >Ok, I will change waiting syscalls to have 'flags' parameter and 'struct
> >timespec' as timeout parameter. Special bit in flags will result in
> >additional timer setup which will fire after absolute timeout and will
> >wake up those who wait...
> 
> Thanks a lot.

No problem - I always like to spend couple of month arguing about taste
and 'right-from-my-point-of-view' theories - doesn't it the best way to
waste the time?

...

> >Having sigmask parameter is the same as creating kevent signal delivery.
> 
> No, no, no.  Not at all.

I've dropped a lot, but let me describe signal mask problem in few
words: signal mask provided in sys_pselect() and friends is a mask of
signals, which will be put into blocked mask in the task structure in
kernel. When new signal is going to be delivered, signal number is being
checked if it is in blocked mask, and if so, signal is not put into
pending mask of signals, which ends up in not being delivered to
userspace. Kevent (with special flag) does exactly the same - but it
does not update blocked mask, but instead adds another check if signal
is in kevent set of requests, in that case signal is delivered to
userspace through kevent queue.

It is _exactly_ the same behaviour from userspace point of view
concerning race of delivery signal versus file descriptor readyness.
Exactly.

Here is code snippet:
specific_send_sig_info()
{
	...
	/* Short-circuit ignored signals.  */
	if (sig_ignored(t, sig))
		goto out;
	...
	ret = send_signal(sig, info, t, &t->pending);
	if (!ret && !sigismember(&t->blocked, sig))
		signal_wake_up(t, sig == SIGKILL);
#ifdef CONFIG_KEVENT_SIGNAL
	/*
	 * Kevent allows to deliver signals through kevent queue,
	 * it is possible to setup kevent to not deliver
	 * signal through the usual way, in that case send_signal()
	 * returns 1 and signal is delivered only through kevent queue.
	 * We simulate successfull delivery notification through this hack:
	 */
	 if (ret == 1)
	 	ret = 0;
#endif
out:
	return ret;
}

> >>Surely you don't suggest keeping your original timer patch?
> >
> >Of course not - kevent timers are more scalable than posix timers (the 
> >latter uses idr, which is slower than balanced binary tree, since it
> >looks like it uses similar to radix tree algo), POSIX interface is 
> >much-much-much more unconvenient to use than simple add/wait.
> 
> I assume you misread the question.  You agree to drop the patch and then 
>  go on listing things why you think it's better to keep them.  I don't 
> think these arguments are in any way sufficient.  The interface is 
> already too big and this is 100% duplicate functionality.  If there are 
> performance problems with the POSIX timer implementation (and I have yet 
> to see indications) it should be fixed instead of worked around.

I do _not_ agree to drop kevent timer patch (not posix timer), since
from my point of view it is much more convenient interface, it is more
scalable, it is generic enough to be used with other kevent methods.

But anyway, we can spend awfull lot of time arguing about taste, which
is definitely _NOT_ what we want. So, there are two worlds - posix
timers and usual timers, accessible from userspace, first one through
create_timer() and friends, second one with kevent interface.

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-22  7:33                   ` [take24 0/6] kevent: Generic event handling mechanism Ulrich Drepper
  2006-11-22 10:38                     ` Evgeniy Polyakov
@ 2006-11-22 12:09                     ` Evgeniy Polyakov
  2006-11-22 12:15                       ` Evgeniy Polyakov
  1 sibling, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-22 12:09 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Tue, Nov 21, 2006 at 11:33:39PM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >Threads are parked in syscalls - which one should be interrupted?
> 
> It doesn't matter, use the same policy you use when waking a thread in 
> case of an event.  This is not about waking a specific thread, it's 
> about not dropping the event notification.
> 
> 
> >And what if there were no threads waiting in syscalls?
> 
> This is fine, do nothing.  It means that the other threads are about to 
> read the ring buffer and will pick up the event.
> 
> 
> The case which must be avoided is that of all threads being in the 
> kernel, one threads gets woken, and then is canceled.  Without notifying 
> the kernel about the cancellation and in the absence of further events 
> notifications the process is deadlocked.
> 
> A second case which should be avoided is that there is a thread waiting 
> when a thread gets canceled and there are one or more addition threads 
> around, but not in the kernel.  But those other threads might not get to 
> the ring buffer anytime soon, so handling the event is unnecessarily 
> delayed.

Ok, to solve the problem in the way which should be good for both I
decided to implement additional syscall which will allow to mark any
event as ready and thus wake up appropriate threads. If userspace will
request zero events to be marked as ready, syscall will just
interrupt/wakeup one of the listeners parked in syscall.

Piece?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-22 12:09                     ` Evgeniy Polyakov
@ 2006-11-22 12:15                       ` Evgeniy Polyakov
  2006-11-22 13:46                         ` Evgeniy Polyakov
  2006-11-22 22:24                         ` Ulrich Drepper
  0 siblings, 2 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-22 12:15 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Wed, Nov 22, 2006 at 03:09:34PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> Ok, to solve the problem in the way which should be good for both I
> decided to implement additional syscall which will allow to mark any
> event as ready and thus wake up appropriate threads. If userspace will
> request zero events to be marked as ready, syscall will just
> interrupt/wakeup one of the listeners parked in syscall.

Btw, what about putting aditional multiplexer into add/remove/modify
switch? There will be logical 'ready' addon?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-22 12:15                       ` Evgeniy Polyakov
@ 2006-11-22 13:46                         ` Evgeniy Polyakov
  2006-11-22 22:24                         ` Ulrich Drepper
  1 sibling, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-22 13:46 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Wed, Nov 22, 2006 at 03:15:16PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> On Wed, Nov 22, 2006 at 03:09:34PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > Ok, to solve the problem in the way which should be good for both I
> > decided to implement additional syscall which will allow to mark any
> > event as ready and thus wake up appropriate threads. If userspace will
> > request zero events to be marked as ready, syscall will just
> > interrupt/wakeup one of the listeners parked in syscall.
> 
> Btw, what about putting aditional multiplexer into add/remove/modify
> switch? There will be logical 'ready' addon?

Something like this.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/include/linux/kevent.h b/include/linux/kevent.h
index c909c62..7afb3d6 100644
--- a/include/linux/kevent.h
+++ b/include/linux/kevent.h
@@ -99,6 +99,8 @@ struct kevent_user
 	struct mutex		ctl_mutex;
 	/* Wait until some events are ready. */
 	wait_queue_head_t	wait;
+	/* Exit from syscall if someone wants us to do it */
+	int			need_exit;
 
 	/* Reference counter, increased for each new kevent. */
 	atomic_t		refcnt;
@@ -132,6 +134,8 @@ void kevent_storage_fini(struct kevent_s
 int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
 void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
 
+void kevent_ready(struct kevent *k, int ret);
+
 int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
 
 #ifdef CONFIG_KEVENT_POLL
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
index 0680fdf..6bc0c79 100644
--- a/include/linux/ukevent.h
+++ b/include/linux/ukevent.h
@@ -174,5 +174,6 @@ struct kevent_ring
 #define	KEVENT_CTL_ADD 		0
 #define	KEVENT_CTL_REMOVE	1
 #define	KEVENT_CTL_MODIFY	2
+#define	KEVENT_CTL_READY	3
 
 #endif /* __UKEVENT_H */
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
index 4d2d878..d1770a1 100644
--- a/kernel/kevent/kevent.c
+++ b/kernel/kevent/kevent.c
@@ -91,10 +91,10 @@ int kevent_init(struct kevent *k)
 	spin_lock_init(&k->ulock);
 	k->flags = 0;
 
-	if (unlikely(k->event.type >= KEVENT_MAX)
+	if (unlikely(k->event.type >= KEVENT_MAX))
 		return kevent_break(k);
 
-	if (!kevent_registered_callbacks[k->event.type].callback)) {
+	if (!kevent_registered_callbacks[k->event.type].callback) {
 		kevent_break(k);
 		return -ENOSYS;
 	}
@@ -142,16 +142,10 @@ void kevent_storage_dequeue(struct keven
 	spin_unlock_irqrestore(&st->lock, flags);
 }
 
-/*
- * Call kevent ready callback and queue it into ready queue if needed.
- * If kevent is marked as one-shot, then remove it from storage queue.
- */
-static int __kevent_requeue(struct kevent *k, u32 event)
+void kevent_ready(struct kevent *k, int ret)
 {
-	int ret, rem;
 	unsigned long flags;
-
-	ret = k->callbacks.callback(k);
+	int rem;
 
 	spin_lock_irqsave(&k->ulock, flags);
 	if (ret > 0)
@@ -178,6 +172,19 @@ static int __kevent_requeue(struct keven
 		spin_unlock_irqrestore(&k->user->ready_lock, flags);
 		wake_up(&k->user->wait);
 	}
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static int __kevent_requeue(struct kevent *k, u32 event)
+{
+	int ret;
+
+	ret = k->callbacks.callback(k);
+
+	kevent_ready(k, ret);
 
 	return ret;
 }
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
index 2cd8c99..3d1ea6b 100644
--- a/kernel/kevent/kevent_user.c
+++ b/kernel/kevent/kevent_user.c
@@ -47,8 +47,9 @@ static unsigned int kevent_user_poll(str
 	poll_wait(file, &u->wait, wait);
 	mask = 0;
 
-	if (u->ready_num)
+	if (u->ready_num || u->need_exit)
 		mask |= POLLIN | POLLRDNORM;
+	u->need_exit = 0;
 
 	return mask;
 }
@@ -136,6 +137,7 @@ static struct kevent_user *kevent_user_a
 
 	mutex_init(&u->ctl_mutex);
 	init_waitqueue_head(&u->wait);
+	u->need_exit = 0;
 
 	atomic_set(&u->refcnt, 1);
 
@@ -487,6 +489,97 @@ static struct ukevent *kevent_get_user(u
 	return ukev;
 }
 
+static int kevent_mark_ready(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err = -ENODEV;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&uk->id, u);
+	if (k) {
+		spin_lock(&k->st->lock);
+		kevent_ready(k, 1);
+		spin_unlock(&k->st->lock);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Mark appropriate kevents as ready.
+ * If number of events is zero just wake up one listener.
+ */
+static int kevent_user_ctl_ready(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = -EINVAL, cerr = 0, rnum = 0, i;
+	void __user *orig = arg;
+	struct ukevent uk;
+
+	if (num > u->kevent_num)
+		return err;
+	
+	if (!num) {
+		u->need_exit = 1;
+		wake_up(&u->wait);
+		return 0;
+	}
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				err = kevent_mark_ready(&ukev[i], u);
+				if (err) {
+					if (i != rnum)
+						memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+					rnum++;
+				}
+			}
+			if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+				cerr = -EFAULT;
+			kfree(ukev);
+			goto out_setup;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			cerr = -EFAULT;
+			break;
+		}
+		arg += sizeof(struct ukevent);
+
+		err = kevent_mark_ready(&uk, u);
+		if (err) {
+			if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+				cerr = -EFAULT;
+				break;
+			}
+			orig += sizeof(struct ukevent);
+			rnum++;
+		}
+	}
+
+out_setup:
+	if (cerr < 0) {
+		err = cerr;
+		goto out_remove;
+	}
+
+	err = num - rnum;
+out_remove:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
 /*
  * Read from userspace all ukevents and modify appropriate kevents.
  * If provided number of ukevents is more that threshold, it is faster
@@ -779,9 +872,10 @@ static int kevent_user_wait(struct file
 
 	if (!(file->f_flags & O_NONBLOCK)) {
 		wait_event_interruptible_timeout(u->wait,
-			u->ready_num >= min_nr,
+			(u->ready_num >= min_nr) || u->need_exit,
 			clock_t_to_jiffies(nsec_to_clock_t(timeout)));
 	}
+	u->need_exit = 0;
 
 	while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) {
 		if (copy_to_user(buf + num*sizeof(struct ukevent),
@@ -819,6 +913,9 @@ static int kevent_ctl_process(struct fil
 	case KEVENT_CTL_MODIFY:
 		err = kevent_user_ctl_modify(u, num, arg);
 		break;
+	case KEVENT_CTL_READY:
+		err = kevent_user_ctl_ready(u, num, arg);
+		break;
 	default:
 		err = -EINVAL;
 		break;
@@ -994,9 +1091,10 @@ asmlinkage long sys_kevent_wait(int ctl_
 
 	if (!(file->f_flags & O_NONBLOCK)) {
 		wait_event_interruptible_timeout(u->wait,
-			((u->ready_num >= 1) && (kevent_ring_space(u))),
+			((u->ready_num >= 1) && kevent_ring_space(u)) || u->need_exit,
 			clock_t_to_jiffies(nsec_to_clock_t(timeout)));
 	}
+	u->need_exit = 0;
 
 	for (i=0; i<num; ++i) {
 		k = kevent_dequeue_ready_ring(u);

-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-22 12:15                       ` Evgeniy Polyakov
  2006-11-22 13:46                         ` Evgeniy Polyakov
@ 2006-11-22 22:24                         ` Ulrich Drepper
  2006-11-23 12:22                           ` Evgeniy Polyakov
  1 sibling, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-22 22:24 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

Evgeniy Polyakov wrote:
> On Wed, Nov 22, 2006 at 03:09:34PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
>> Ok, to solve the problem in the way which should be good for both I
>> decided to implement additional syscall which will allow to mark any
>> event as ready and thus wake up appropriate threads. If userspace will
>> request zero events to be marked as ready, syscall will just
>> interrupt/wakeup one of the listeners parked in syscall.

I'll wait for the new code drop to comment.


> Btw, what about putting aditional multiplexer into add/remove/modify
> switch? There will be logical 'ready' addon?

Is it needed?  Usually this is done with a *_wait call with a timeout of 
zero.  That code path might have to be optimized but it should already 
be there.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-22 22:24                         ` Ulrich Drepper
@ 2006-11-23 12:22                           ` Evgeniy Polyakov
  2006-11-23 20:34                             ` Ulrich Drepper
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-23 12:22 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Wed, Nov 22, 2006 at 02:24:00PM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >On Wed, Nov 22, 2006 at 03:09:34PM +0300, Evgeniy Polyakov 
> >(johnpol@2ka.mipt.ru) wrote:
> >>Ok, to solve the problem in the way which should be good for both I
> >>decided to implement additional syscall which will allow to mark any
> >>event as ready and thus wake up appropriate threads. If userspace will
> >>request zero events to be marked as ready, syscall will just
> >>interrupt/wakeup one of the listeners parked in syscall.
> 
> I'll wait for the new code drop to comment.

I posted it.
 
> >Btw, what about putting aditional multiplexer into add/remove/modify
> >switch? There will be logical 'ready' addon?
> 
> Is it needed?  Usually this is done with a *_wait call with a timeout of 
> zero.  That code path might have to be optimized but it should already 
> be there.

It does not allow to mark events as ready.
And current interfaces wake up when either timeout is zero (in this case
thread itself does not sleep and can process events), or when there is
_new_ work - since there is no _new_ work, when thread awakened to
process it was killed, kernel does not think that something is wrong.

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-23 12:22                           ` Evgeniy Polyakov
@ 2006-11-23 20:34                             ` Ulrich Drepper
  2006-11-24 10:58                               ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-23 20:34 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

Evgeniy Polyakov wrote:
>>> Btw, what about putting aditional multiplexer into add/remove/modify
>>> switch? There will be logical 'ready' addon?
>> Is it needed?  Usually this is done with a *_wait call with a timeout of 
>> zero.  That code path might have to be optimized but it should already 
>> be there.
> 
> It does not allow to mark events as ready.
> And current interfaces wake up when either timeout is zero (in this case
> thread itself does not sleep and can process events), or when there is
> _new_ work - since there is no _new_ work, when thread awakened to
> process it was killed, kernel does not think that something is wrong.

Rather than mark an existing entry as ready, how about a call to inject 
a new ready event?

This would be useful to implement functionality at userlevel and still 
use an event queue to announce the availability.  Without this type of 
functionality we'd need to use indirect notification via signal or pipe 
or something like that.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-23 20:34                             ` Ulrich Drepper
@ 2006-11-24 10:58                               ` Evgeniy Polyakov
  2006-11-27 18:23                                 ` Ulrich Drepper
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-24 10:58 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Thu, Nov 23, 2006 at 12:34:50PM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >>>Btw, what about putting aditional multiplexer into add/remove/modify
> >>>switch? There will be logical 'ready' addon?
> >>Is it needed?  Usually this is done with a *_wait call with a timeout of 
> >>zero.  That code path might have to be optimized but it should already 
> >>be there.
> >
> >It does not allow to mark events as ready.
> >And current interfaces wake up when either timeout is zero (in this case
> >thread itself does not sleep and can process events), or when there is
> >_new_ work - since there is no _new_ work, when thread awakened to
> >process it was killed, kernel does not think that something is wrong.
> 
> Rather than mark an existing entry as ready, how about a call to inject 
> a new ready event?
> 
> This would be useful to implement functionality at userlevel and still 
> use an event queue to announce the availability.  Without this type of 
> functionality we'd need to use indirect notification via signal or pipe 
> or something like that.

With provided patch it is possible to wakeup 'for-free' - just call
kevent_ctl(ready) with zero number of ready events, so thread will be
awakened if it was in poll(kevent_fd), kevent_wait() or
kevent_get_events().

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-24 10:58                               ` Evgeniy Polyakov
@ 2006-11-27 18:23                                 ` Ulrich Drepper
  2006-11-28 10:13                                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-27 18:23 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

Evgeniy Polyakov wrote:
> 
> With provided patch it is possible to wakeup 'for-free' - just call
> kevent_ctl(ready) with zero number of ready events, so thread will be
> awakened if it was in poll(kevent_fd), kevent_wait() or
> kevent_get_events().

Yes, I realize that.  But I wrote something else:

 >> Rather than mark an existing entry as ready, how about a call to
 >> inject a new ready event?
 >>
 >> This would be useful to implement functionality at userlevel and
 >> still use an event queue to announce the availability.  Without this
 >> type of functionality we'd need to use indirect notification via
 >> signal or pipe or something like that.

This is still something which is wanted.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-27 18:23                                 ` Ulrich Drepper
@ 2006-11-28 10:13                                   ` Evgeniy Polyakov
  2006-12-27 20:45                                     ` Ulrich Drepper
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-28 10:13 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Mon, Nov 27, 2006 at 10:23:39AM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >
> >With provided patch it is possible to wakeup 'for-free' - just call
> >kevent_ctl(ready) with zero number of ready events, so thread will be
> >awakened if it was in poll(kevent_fd), kevent_wait() or
> >kevent_get_events().
> 
> Yes, I realize that.  But I wrote something else:
> 
> >> Rather than mark an existing entry as ready, how about a call to
> >> inject a new ready event?
> >>
> >> This would be useful to implement functionality at userlevel and
> >> still use an event queue to announce the availability.  Without this
> >> type of functionality we'd need to use indirect notification via
> >> signal or pipe or something like that.
> 
> This is still something which is wanted.

Why do we want to inject _ready_ event, when it is possible to mark
event as ready and wakeup thread parked in syscall?

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-11-28 10:13                                   ` Evgeniy Polyakov
@ 2006-12-27 20:45                                     ` Ulrich Drepper
  2006-12-28  9:50                                       ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-12-27 20:45 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

[-- Attachment #1: Type: text/plain, Size: 1028 bytes --]

Evgeniy Polyakov wrote:
> Why do we want to inject _ready_ event, when it is possible to mark
> event as ready and wakeup thread parked in syscall?

Going back to this old one:

How do you want to mark an event ready if you don't want to introduce
yet another layer of data structures?  The event notification happens
through entries in the ring buffer.  Userlevel code should never add
anything to the ring buffer directly, this would mean huge
synchronization problems.  Yes, one could add additional data structures
accompanying the ring buffer which can specify userlevel-generated
events.  But this is a) clumsy and b) a pain to use when the same ring
buffer is used in multiple threads (you'd have to have another shared
memory segment).

It's much cleaner if the userlevel code can get the kernel to inject a
userlevel-generated event.  This is the equivalent of userlevel code
generating a signal with kill().

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take24 0/6] kevent: Generic event handling mechanism.
  2006-12-27 20:45                                     ` Ulrich Drepper
@ 2006-12-28  9:50                                       ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-12-28  9:50 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik, Alexander Viro

On Wed, Dec 27, 2006 at 12:45:50PM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> > Why do we want to inject _ready_ event, when it is possible to mark
> > event as ready and wakeup thread parked in syscall?
> 
> Going back to this old one:
> 
> How do you want to mark an event ready if you don't want to introduce
> yet another layer of data structures?  The event notification happens
> through entries in the ring buffer.  Userlevel code should never add
> anything to the ring buffer directly, this would mean huge
> synchronization problems.  Yes, one could add additional data structures
> accompanying the ring buffer which can specify userlevel-generated
> events.  But this is a) clumsy and b) a pain to use when the same ring
> buffer is used in multiple threads (you'd have to have another shared
> memory segment).
> 
> It's much cleaner if the userlevel code can get the kernel to inject a
> userlevel-generated event.  This is the equivalent of userlevel code
> generating a signal with kill().

Existing possibility to mark event as ready works following way:
event is queued into storage queue (socket, inode or some other queue),
when readiness condition becomes true, event is queued into ready queue
(although it is still in the storage queueu). It happens completely
asynchronosu to _any_ kind of userspace processing.
When userspace calls apropriate syscall, event is being copied into ring
buffer.

Thus userspace readiness will just mark event as ready, i.e. it queues
event into ready queue, so later usersapce will callsyscall to actually
get the event.

When one thread is parked in the syscall and there are _no_ events
which should be marked as ready (for example only sockets are there, and
it is not a good idea to wakeup the whole socket processing state machine), 
then there is no possibility to receive such event (although it is
possible to interrupt and break syscall).

So, according to injecting ready events, it can be done - just an
addition of special flag which will force kevent core to move event into
ready queue immediately. In this case userspace can event prepare a
needed event (like signal event) and deliver it to process, so it will
think (only from kevent point of view) that real signal has been arrived.

I will also add special type of events - userspace events - which will
not have empty callbacks, which will be intended to use for user-defined
way (i.e. for inter thread communications).

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
> 

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* [take25 0/6] kevent: Generic event handling mechanism.
       [not found] <1154985aa0591036@2ka.mipt.ru>
                   ` (3 preceding siblings ...)
  2006-11-09  8:23 ` [take24 0/6] " Evgeniy Polyakov
@ 2006-11-21 16:29 ` Evgeniy Polyakov
  2006-11-21 16:29   ` [take25 1/6] kevent: Description Evgeniy Polyakov
  2006-11-30 19:14 ` [take26 0/8] kevent: Generic event handling mechanism Evgeniy Polyakov
  5 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-21 16:29 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Generic event handling mechanism.

Kevent is a generic subsytem which allows to handle event notifications.
It supports both level and edge triggered events. It is similar to
poll/epoll in some cases, but it is more scalable, it is faster and
allows to work with essentially eny kind of events.

Events are provided into kernel through control syscall and can be read
back through ring buffer or using usual syscalls.
Kevent update (i.e. readiness switching) happens directly from internals
of the appropriate state machine of the underlying subsytem (like
network, filesystem, timer or any other).

Homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

Documentation page:
http://linux-net.osdl.org/index.php/Kevent

I installed slightly used, but still functional (bought on ebay) remote 
mind reader, and set it up to read Ulrich's alpha brain waves (I hope he 
agrees that it is a good decision), which took me the whole week.
So I think the last ring buffer implementation is what we all wanted.
Details in documentation part.

Changes from 'take24' patchset:
 * new (old (new)) ring buffer imeplementation with kernel and user indexes.
 * added initialization syscall instead of opening /dev/kevent
 * kevent_commit() syscall to commit ring buffer entries
 * changed KEVENT_REQ_WAKEUP_ONE flag to KEVENT_REQ_WAKEUP_ALL, kevent wakes
   only first thread always if that flag is not set
 * KEVENT_REQ_ALWAYS_QUEUE flag. If set, kevent will be queued into ready queue
   instead of copying back to userspace when kevent is ready immediately when
   it is added.
 * lighttpd patch (Hail! Although nothing realy outstanding compared to epoll)

Changes from 'take23' patchset:
 * kevent PIPE notifications
 * KEVENT_REQ_LAST_CHECK flag, which allows to perform last check at dequeueing time
 * fixed poll/select notifications (were broken due to tree manipulations)
 * made Documentation/kevent.txt look nice in 80-col terminal
 * fix for copy_to_user() failure report for the first kevent (Andrew Morton)
 * minor function renames

Changes from 'take22' patchset:
 * new ring buffer implementation in process' memory
 * wakeup-one-thread flag
 * edge-triggered behaviour

Changes from 'take21' patchset:
 * minor cleanups (different return values, removed unneded variables, whitespaces and so on)
 * fixed bug in kevent removal in case when kevent being removed
   is the same as overflow_kevent (spotted by Eric Dumazet)

Changes from 'take20' patchset:
 * new ring buffer implementation
 * removed artificial limit on possible number of kevents

Changes from 'take19' patchset:
 * use __init instead of __devinit
 * removed 'default N' from config for user statistic
 * removed kevent_user_fini() since kevent can not be unloaded
 * use KERN_INFO for statistic output

Changes from 'take18' patchset:
 * use __init instead of __devinit
 * removed 'default N' from config for user statistic
 * removed kevent_user_fini() since kevent can not be unloaded
 * use KERN_INFO for statistic output

Changes from 'take17' patchset:
 * Use RB tree instead of hash table. 
	At least for a web sever, frequency of addition/deletion of new kevent 
	is comparable with number of search access, i.e. most of the time events 
	are added, accesed only couple of times and then removed, so it justifies 
	RB tree usage over AVL tree, since the latter does have much slower deletion 
	time (max O(log(N)) compared to 3 ops), 
	although faster search time (1.44*O(log(N)) vs. 2*O(log(N))). 
	So for kevents I use RB tree for now and later, when my AVL tree implementation 
	is ready, it will be possible to compare them.
 * Changed readiness check for socket notifications.

With both above changes it is possible to achieve more than 3380 req/second compared to 2200, 
sometimes 2500 req/second for epoll() for trivial web-server and httperf client on the same
hardware.
It is possible that above kevent limit is due to maximum allowed kevents in a time limit, which is
4096 events.

Changes from 'take16' patchset:
 * misc cleanups (__read_mostly, const ...)
 * created special macro which is used for mmap size (number of pages) calculation
 * export kevent_socket_notify(), since it is used in network protocols which can be 
	built as modules (IPv6 for example)

Changes from 'take15' patchset:
 * converted kevent_timer to high-resolution timers, this forces timer API update at
	http://linux-net.osdl.org/index.php/Kevent
 * use struct ukevent* instead of void * in syscalls (documentation has been updated)
 * added warning in kevent_add_ukevent() if ring has broken index (for testing)

Changes from 'take14' patchset:
 * added kevent_wait()
    This syscall waits until either timeout expires or at least one event
    becomes ready. It also commits that @num events from @start are processed
    by userspace and thus can be be removed or rearmed (depending on it's flags).
    It can be used for commit events read by userspace through mmap interface.
    Example userspace code (evtest.c) can be found on project's homepage.
 * added socket notifications (send/recv/accept)

Changes from 'take13' patchset:
 * do not get lock aroung user data check in __kevent_search()
 * fail early if there were no registered callbacks for given type of kevent
 * trailing whitespace cleanup

Changes from 'take12' patchset:
 * remove non-chardev interface for initialization
 * use pointer to kevent_mring instead of unsigned longs
 * use aligned 64bit type in raw user data (can be used by high-res timer if needed)
 * simplified enqueue/dequeue callbacks and kevent initialization
 * use nanoseconds for timeout
 * put number of milliseconds into timer's return data
 * move some definitions into user-visible header
 * removed filenames from comments

Changes from 'take11' patchset:
 * include missing headers into patchset
 * some trivial code cleanups (use goto instead of if/else games and so on)
 * some whitespace cleanups
 * check for ready_callback() callback before main loop which should save us some ticks

Changes from 'take10' patchset:
 * removed non-existent prototypes
 * added helper function for kevent_registered_callbacks
 * fixed 80 lines comments issues
 * added shared between userspace and kernelspace header instead of embedd them in one
 * core restructuring to remove forward declarations
 * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
 * use vm_insert_page() instead of remap_pfn_range()

Changes from 'take9' patchset:
 * fixed ->nopage method

Changes from 'take8' patchset:
 * fixed mmap release bug
 * use module_init() instead of late_initcall()
 * use better structures for timer notifications

Changes from 'take7' patchset:
 * new mmap interface (not tested, waiting for other changes to be acked)
	- use nopage() method to dynamically substitue pages
	- allocate new page for events only when new added kevent requres it
	- do not use ugly index dereferencing, use structure instead
	- reduced amount of data in the ring (id and flags), 
		maximum 12 pages on x86 per kevent fd

Changes from 'take6' patchset:
 * a lot of comments!
 * do not use list poisoning for detection of the fact, that entry is in the list
 * return number of ready kevents even if copy*user() fails
 * strict check for number of kevents in syscall
 * use ARRAY_SIZE for array size calculation
 * changed superblock magic number
 * use SLAB_PANIC instead of direct panic() call
 * changed -E* return values
 * a lot of small cleanups and indent fixes

Changes from 'take5' patchset:
 * removed compilation warnings about unused wariables when lockdep is not turned on
 * do not use internal socket structures, use appropriate (exported) wrappers instead
 * removed default 1 second timeout
 * removed AIO stuff from patchset

Changes from 'take4' patchset:
 * use miscdevice instead of chardevice
 * comments fixes

Changes from 'take3' patchset:
 * removed serializing mutex from kevent_user_wait()
 * moved storage list processing to RCU
 * removed lockdep screaming - all storage locks are initialized in the same function, so it was
learned 
	to differentiate between various cases
 * remove kevent from storage if is marked as broken after callback
 * fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion 

Changes from 'take2' patchset:
 * split kevent_finish_user() to locked and unlocked variants
 * do not use KEVENT_STAT ifdefs, use inline functions instead
 * use array of callbacks of each type instead of each kevent callback initialization
 * changed name of ukevent guarding lock
 * use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
 * do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
 * various indent cleanups
 * added optimisation, which is aimed to help when a lot of kevents are being copied from
userspace
 * mapped buffer (initial) implementation (no userspace yet)

Changes from 'take1' patchset:
 - rebased against 2.6.18-git tree
 - removed ioctl controlling
 - added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
			unsigned int timeout, void __user *buf, unsigned flags)
 - use old syscall kevent_ctl for creation/removing, modification and initial kevent 
	initialization
 - use mutuxes instead of semaphores
 - added file descriptor check and return error if provided descriptor does not match
	kevent file operations
 - various indent fixes
 - removed aio_sendfile() declarations.

Thank you.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

^ permalink raw reply	[flat|nested] 200+ messages in thread

* [take25 1/6] kevent: Description.
  2006-11-21 16:29 ` [take25 " Evgeniy Polyakov
@ 2006-11-21 16:29   ` Evgeniy Polyakov
  2006-11-21 16:29     ` [take25 2/6] kevent: Core files Evgeniy Polyakov
                       ` (3 more replies)
  0 siblings, 4 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-21 16:29 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Description.


diff --git a/Documentation/kevent.txt b/Documentation/kevent.txt
new file mode 100644
index 0000000..49e1cc2
--- /dev/null
+++ b/Documentation/kevent.txt
@@ -0,0 +1,230 @@
+Description.
+
+int kevent_init(struct kevent_ring *ring, unsigned int ring_size);
+
+num - size of the ring buffer in events 
+ring - pointer to allocated ring buffer
+
+Return value: kevent control file descriptor or negative error value.
+
+ struct kevent_ring
+ {
+   unsigned int ring_kidx, ring_uidx, ring_over;
+   struct ukevent event[0];
+ }
+
+ring_kidx - index in the ring buffer where kernel will put new events 
+		when kevent_wait() or kevent_get_events() is called 
+ring_uidx - index of the first entry userspace can start reading from
+ring_over - number of overflows of ring_uidx happend from the start.
+	Overflow counter is used to prevent situation when two threads 
+	are going to free the same events, but one of them was scheduled 
+	away for too long, so ring indexes were wrapped, so when that 
+	thread will be awakened, it will free not those events, which 
+	it suppose to free.
+
+Example userspace code (ring_buffer.c) can be found on project's homepage.
+
+Each kevent syscall can be so called cancellation point in glibc, i.e. when 
+thread has been cancelled in kevent syscall, thread can be safely removed 
+and no events will be lost, since each syscall (kevent_wait() or 
+kevent_get_events()) will copy event into special ring buffer, accessible 
+from other threads or even processes (if shared memory is used).
+
+When kevent is removed (not dequeued when it is ready, but just removed), 
+even if it was ready, it is not copied into ring buffer, since if it is 
+removed, no one cares about it (otherwise user would wait until it becomes 
+ready and got it through usual way using kevent_get_events() or kevent_wait()) 
+and thus no need to copy it to the ring buffer.
+
+-------------------------------------------------------------------------------
+
+
+int kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent *arg);
+
+fd - is the file descriptor referring to the kevent queue to manipulate. 
+It is created by opening "/dev/kevent" char device, which is created with 
+dynamic minor number and major number assigned for misc devices. 
+
+cmd - is the requested operation. It can be one of the following:
+    KEVENT_CTL_ADD - add event notification 
+    KEVENT_CTL_REMOVE - remove event notification 
+    KEVENT_CTL_MODIFY - modify existing notification 
+
+num - number of struct ukevent in the array pointed to by arg 
+arg - array of struct ukevent
+
+Return value: 
+ number of events processed or negative error value.
+
+When called, kevent_ctl will carry out the operation specified in the 
+cmd parameter.
+-------------------------------------------------------------------------------
+
+ int kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, 
+ 			__u64 timeout, struct ukevent *buf, unsigned flags);
+
+ctl_fd - file descriptor referring to the kevent queue 
+min_nr - minimum number of completed events that kevent_get_events will block 
+	 waiting for 
+max_nr - number of struct ukevent in buf 
+timeout - number of nanoseconds to wait before returning less than min_nr 
+	  events. If this is -1, then wait forever. 
+buf - pointer to an array of struct ukevent. 
+flags - unused 
+
+Return value:
+ number of events copied or negative error value.
+
+kevent_get_events will wait timeout milliseconds for at least min_nr completed 
+events, copying completed struct ukevents to buf and deleting any 
+KEVENT_REQ_ONESHOT event requests. In nonblocking mode it returns as many 
+events as possible, but not more than max_nr. In blocking mode it waits until 
+timeout or if at least min_nr events are ready.
+
+This function copies event into ring buffer if it was initialized, if ring buffer
+is full, KEVENT_RET_COPY_FAILED flag is set in ret_flags field.
+-------------------------------------------------------------------------------
+
+ int kevent_wait(int ctl_fd, unsigned int num, __u64 timeout);
+
+ctl_fd - file descriptor referring to the kevent queue 
+num - number of processed kevents 
+timeout - this timeout specifies number of nanoseconds to wait until there is 
+		free space in kevent queue 
+
+Return value:
+ number of events copied into ring buffer or negative error value.
+
+This syscall waits until either timeout expires or at least one event becomes 
+ready. It also copies events into special ring buffer. If ring buffer is full,
+it waits until there are ready events and then return.
+If kevent is one-shot kevent it is removed in this syscall.
+If kevent is edge-triggered (KEVENT_REQ_ET flag is set in 'req_flags') it is 
+requeued in this syscall for performance reasons.
+-------------------------------------------------------------------------------
+
+ int kevent_commit(int ctl_fd, unsigned int start, 
+ 	unsigned int num, unsigned int over);
+
+ctl_fd - file descriptor referring to the kevent queue 
+start - index of the first index in the ring buffer to start to commit from
+num - number of kevents to commit
+over - overflow count for given $start value
+
+Return value:
+ number of committed kevents or negative error value.
+
+This function commits, i.e. marks as empty, slots in the ring buffer, so
+they can be reused when userspace completes that entries processing.
+
+Overflow counter is used to prevent situation when two threads are going 
+to free the same events, but one of them was scheduled away for too long, 
+so ring indexes were wrapped, so when that thread will be awakened, it 
+will free not those events, which it suppose to free.
+
+It is possible that returned number of committed events will be smaller than
+requested number - it is possible when several threads try to commit the
+same events.
+-------------------------------------------------------------------------------
+
+The bulk of the interface is entirely done through the ukevent struct. 
+It is used to add event requests, modify existing event requests, 
+specify which event requests to remove, and return completed events.
+
+struct ukevent contains the following members:
+
+struct kevent_id id
+    Id of this request, e.g. socket number, file descriptor and so on 
+__u32 type
+    Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on 
+__u32 event
+    Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED 
+__u32 req_flags
+    Per-event request flags,
+
+    KEVENT_REQ_ONESHOT
+        event will be removed when it is ready 
+
+    KEVENT_REQ_WAKEUP_ALL
+        Kevent wakes up only first thread interested in given event, 
+	or all threads if this flag is set.
+
+    KEVENT_REQ_ET
+        Edge Triggered behaviour. It is an optimisation which allows to move 
+	ready and dequeued (i.e. copied to userspace) event to move into set 
+	of interest for given storage (socket, inode and so on) again. It is 
+	very usefull for cases when the same event should be used many times 
+	(like reading from pipe). It is similar to epoll()'s EPOLLET flag. 
+
+    KEVENT_REQ_LAST_CHECK
+        if set allows to perform the last check on kevent (call appropriate 
+	callback) when kevent is marked as ready and has been removed from 
+	ready queue. If it will be confirmed that kevent is ready 
+	(k->callbacks.callback(k) returns true) then kevent will be copied 
+	to userspace, otherwise it will be requeued back to storage. 
+	Second (checking) call is performed with this bit cleared, so callback 
+	can detect when it was called from kevent_storage_ready() - bit is set, 
+	or kevent_dequeue_ready() - bit is cleared. If kevent will be requeued, 
+	bit will be set again.
+
+   KEVENT_REQ_ALWAYS_QUEUE
+        If this flag is set kevent will be queued into ready queue if it is 
+	ready at enqueue time, otherwise it will be copied back to userspace
+	and will not be queued into the storage.
+
+__u32 ret_flags
+    Per-event return flags
+
+    KEVENT_RET_BROKEN
+        Kevent is broken 
+
+    KEVENT_RET_DONE
+        Kevent processing was finished successfully 
+
+    KEVENT_RET_COPY_FAILED
+        Kevent was not copied into ring buffer due to some error conditions. 
+
+__u32 ret_data
+    Event return data. Event originator fills it with anything it likes 
+    (for example timer notifications put number of milliseconds when timer 
+    has fired 
+union { __u32 user[2]; void *ptr; }
+    User's data. It is not used, just copied to/from user. The whole structure 
+    is aligned to 8 bytes already, so the last union is aligned properly. 
+
+-------------------------------------------------------------------------------
+
+Usage
+
+For KEVENT_CTL_ADD, all fields relevant to the event type must be filled 
+(id, type, event, req_flags). 
+After kevent_ctl(..., KEVENT_CTL_ADD, ...) returns each struct's ret_flags 
+should be checked to see if the event is already broken or done.
+
+For KEVENT_CTL_MODIFY, the id, req_flags, and user and event fields must be 
+set and an existing kevent request must have matching id and user fields. If 
+match is found, req_flags and event are replaced with the newly supplied 
+values and requeueing is started, so modified kevent can be checked and 
+probably marked as ready immediately. If a match can't be found, the 
+passed in ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is 
+always set.
+
+For KEVENT_CTL_REMOVE, the id and user fields must be set and an existing 
+kevent request must have matching id and user fields. If a match is found, 
+the kevent request is removed. If a match can't be found, the passed in 
+ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is always set.
+
+For kevent_get_events, the entire structure is returned.
+
+-------------------------------------------------------------------------------
+
+Usage cases
+
+kevent_timer
+struct ukevent should contain following fields:
+    type - KEVENT_TIMER 
+    event - KEVENT_TIMER_FIRED 
+    req_flags - KEVENT_REQ_ONESHOT if you want to fire that timer only once 
+    id.raw[0] - number of seconds after commit when this timer shout expire 
+    id.raw[0] - additional to number of seconds number of nanoseconds 

^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take25 2/6] kevent: Core files.
  2006-11-21 16:29   ` [take25 1/6] kevent: Description Evgeniy Polyakov
@ 2006-11-21 16:29     ` Evgeniy Polyakov
  2006-11-21 16:29       ` [take25 3/6] kevent: poll/select() notifications Evgeniy Polyakov
  2006-11-22 23:46     ` [take25 1/6] kevent: Description Ulrich Drepper
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-21 16:29 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Core files.

This patch includes core kevent files:
 * userspace controlling
 * kernelspace interfaces
 * initialization
 * notification state machines

Some bits of documentation can be found on project's homepage (and links from there):
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index 7e639f7..a6221c2 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -318,3 +318,8 @@ ENTRY(sys_call_table)
 	.long sys_vmsplice
 	.long sys_move_pages
 	.long sys_getcpu
+	.long sys_kevent_get_events
+	.long sys_kevent_ctl		/* 320 */
+	.long sys_kevent_wait
+	.long sys_kevent_commit
+	.long sys_kevent_init
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index b4aa875..dda2168 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -714,8 +714,13 @@ ia32_sys_call_table:
 	.quad compat_sys_get_robust_list
 	.quad sys_splice
 	.quad sys_sync_file_range
-	.quad sys_tee
+	.quad sys_tee			/* 315 */
 	.quad compat_sys_vmsplice
 	.quad compat_sys_move_pages
 	.quad sys_getcpu
+	.quad sys_kevent_get_events
+	.quad sys_kevent_ctl		/* 320 */
+	.quad sys_kevent_wait
+	.quad sys_kevent_commit
+	.quad sys_kevent_init
 ia32_syscall_end:		
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index bd99870..57a6b8c 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -324,10 +324,15 @@
 #define __NR_vmsplice		316
 #define __NR_move_pages		317
 #define __NR_getcpu		318
+#define __NR_kevent_get_events	319
+#define __NR_kevent_ctl		320
+#define __NR_kevent_wait	321
+#define __NR_kevent_commit	322
+#define __NR_kevent_init	323
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 319
+#define NR_syscalls 324
 #include <linux/err.h>
 
 /*
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 6137146..17d750d 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,20 @@ __SYSCALL(__NR_sync_file_range, sys_sync
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages		279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events	280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl		281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
+#define __NR_kevent_wait	282
+__SYSCALL(__NR_kevent_wait, sys_kevent_wait)
+#define __NR_kevent_commit	283
+__SYSCALL(__NR_kevent_commit, sys_kevent_commit)
+#define __NR_kevent_init	284
+__SYSCALL(__NR_kevent_init, sys_kevent_init)
 
 #ifdef __KERNEL__
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_init
 #include <linux/err.h>
 
 #ifndef __NO_STUBS
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..c909c62
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,230 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/fs.h>
+#include <linux/sched.h>
+#include <linux/kevent_storage.h>
+#include <linux/ukevent.h>
+
+#define KEVENT_MIN_BUFFS_ALLOC	3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+	kevent_callback_t	callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY		0x1
+#define KEVENT_STORAGE		0x2
+#define KEVENT_USER		0x4
+
+struct kevent
+{
+	/* Used for kevent freeing.*/
+	struct rcu_head		rcu_head;
+	struct ukevent		event;
+	/* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+	spinlock_t		ulock;
+
+	/* Entry of user's tree. */
+	struct rb_node		kevent_node;
+	/* Entry of origin's queue. */
+	struct list_head	storage_entry;
+	/* Entry of user's ready. */
+	struct list_head	ready_entry;
+
+	u32			flags;
+
+	/* User who requested this kevent. */
+	struct kevent_user	*user;
+	/* Kevent container. */
+	struct kevent_storage	*st;
+
+	struct kevent_callbacks	callbacks;
+
+	/* Private data for different storages.
+	 * poll()/select storage has a list of wait_queue_t containers
+	 * for each ->poll() { poll_wait()' } here.
+	 */
+	void			*priv;
+};
+
+struct kevent_user
+{
+	struct rb_root		kevent_root;
+	spinlock_t		kevent_lock;
+	/* Number of queued kevents. */
+	unsigned int		kevent_num;
+
+	/* List of ready kevents. */
+	struct list_head	ready_list;
+	/* Number of ready kevents. */
+	unsigned int		ready_num;
+	/* Protects all manipulations with ready queue. */
+	spinlock_t 		ready_lock;
+
+	/* Protects against simultaneous kevent_user control manipulations. */
+	struct mutex		ctl_mutex;
+	/* Wait until some events are ready. */
+	wait_queue_head_t	wait;
+
+	/* Reference counter, increased for each new kevent. */
+	atomic_t		refcnt;
+
+	/* Mutex protecting userspace ring buffer. */
+	struct mutex		ring_lock;
+	/* Kernel index and size of the userspace ring buffer. */
+	unsigned int		kidx, uidx, ring_size, ring_over, full;
+	/* Pointer to userspace ring buffer. */
+	struct kevent_ring __user *pring;
+
+#ifdef CONFIG_KEVENT_USER_STAT
+	unsigned long		im_num;
+	unsigned long		wait_num, ring_num;
+	unsigned long		total;
+#endif
+};
+
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos);
+
+void kevent_storage_ready(struct kevent_storage *st,
+		kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_stat_init(struct kevent_user *u)
+{
+	u->wait_num = u->im_num = u->total = 0;
+}
+static inline void kevent_stat_print(struct kevent_user *u)
+{
+	printk(KERN_INFO "%s: u: %p, wait: %lu, ring: %lu, immediately: %lu, total: %lu.\n",
+			__func__, u, u->wait_num, u->ring_num, u->im_num, u->total);
+}
+static inline void kevent_stat_im(struct kevent_user *u)
+{
+	u->im_num++;
+}
+static inline void kevent_stat_ring(struct kevent_user *u)
+{
+	u->ring_num++;
+}
+static inline void kevent_stat_wait(struct kevent_user *u)
+{
+	u->wait_num++;
+}
+static inline void kevent_stat_total(struct kevent_user *u)
+{
+	u->total++;
+}
+#else
+#define kevent_stat_print(u)		({ (void) u;})
+#define kevent_stat_init(u)		({ (void) u;})
+#define kevent_stat_im(u)		({ (void) u;})
+#define kevent_stat_wait(u)		({ (void) u;})
+#define kevent_stat_ring(u)		({ (void) u;})
+#define kevent_stat_total(u)		({ (void) u;})
+#endif
+
+#ifdef CONFIG_LOCKDEP
+void kevent_socket_reinit(struct socket *sock);
+void kevent_sk_reinit(struct sock *sk);
+#else
+static inline void kevent_socket_reinit(struct socket *sock)
+{
+}
+static inline void kevent_sk_reinit(struct sock *sk)
+{
+}
+#endif
+#ifdef CONFIG_KEVENT_SOCKET
+void kevent_socket_notify(struct sock *sock, u32 event);
+int kevent_socket_dequeue(struct kevent *k);
+int kevent_socket_enqueue(struct kevent *k);
+#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC)
+#else
+static inline void kevent_socket_notify(struct sock *sock, u32 event)
+{
+}
+#define sock_async(__sk)	({ (void)__sk; 0; })
+#endif
+
+#ifdef CONFIG_KEVENT_POLL
+static inline void kevent_init_file(struct file *file)
+{
+	kevent_storage_init(file, &file->st);
+}
+
+static inline void kevent_cleanup_file(struct file *file)
+{
+	kevent_storage_fini(&file->st);
+}
+#else
+static inline void kevent_init_file(struct file *file) {}
+static inline void kevent_cleanup_file(struct file *file) {}
+#endif
+
+#ifdef CONFIG_KEVENT_PIPE
+extern void kevent_pipe_notify(struct inode *inode, u32 events);
+#else
+static inline void kevent_pipe_notify(struct inode *inode, u32 events) {}
+#endif
+
+#ifdef CONFIG_KEVENT_SIGNAL
+extern int kevent_signal_notify(struct task_struct *tsk, int sig);
+#else
+static inline int kevent_signal_notify(struct task_struct *tsk, int sig) {return 0;}
+#endif
+
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..a38575d
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,11 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+	void			*origin;		/* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+	struct list_head	list;			/* List of queued kevents. */
+	spinlock_t		lock;			/* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2d1c3d5..1317a18 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -54,6 +54,8 @@ struct compat_stat;
 struct compat_timeval;
 struct robust_list_head;
 struct getcpu_cache;
+struct ukevent;
+struct kevent_ring;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -599,4 +601,10 @@ asmlinkage long sys_set_robust_list(stru
 				    size_t len);
 asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);
 
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max,
+		__u64 timeout, struct ukevent __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, struct ukevent __user *buf);
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, __u64 timeout);
+asmlinkage long sys_kevent_commit(int ctl_fd, unsigned int start, unsigned int num, unsigned int over);
+asmlinkage long sys_kevent_init(int ctl_fd, struct kevent_ring __user *ring, unsigned int num);
 #endif
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
new file mode 100644
index 0000000..0680fdf
--- /dev/null
+++ b/include/linux/ukevent.h
@@ -0,0 +1,178 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __UKEVENT_H
+#define __UKEVENT_H
+
+#include <linux/types.h>
+
+/*
+ * Kevent request flags.
+ */
+
+/* Process this event only once and then remove it. */
+#define KEVENT_REQ_ONESHOT	0x1
+/* Kevent wakes up only first thread interested in given event,
+ * or all threads if this flag is set.
+ */
+#define KEVENT_REQ_WAKEUP_ALL	0x2
+/* Edge Triggered behaviour. */
+#define KEVENT_REQ_ET		0x4
+/* Perform the last check on kevent (call appropriate callback) when
+ * kevent is marked as ready and has been removed from ready queue.
+ * If it will be confirmed that kevent is ready 
+ * (k->callbacks.callback(k) returns true) then kevent will be copied
+ * to userspace, otherwise it will be requeued back to storage. 
+ * Second (checking) call is performed with this bit _cleared_ so
+ * callback can detect when it was called from 
+ * kevent_storage_ready() - bit is set, or 
+ * kevent_dequeue_ready() - bit is cleared. 
+ * If kevent will be requeued, bit will be set again. */
+#define KEVENT_REQ_LAST_CHECK	0x8
+/*
+ * Always queue kevent even if it is immediately ready.
+ */
+#define KEVENT_REQ_ALWAYS_QUEUE	0x16
+
+/*
+ * Kevent return flags.
+ */
+/* Kevent is broken. */
+#define KEVENT_RET_BROKEN	0x1
+/* Kevent processing was finished successfully. */
+#define KEVENT_RET_DONE		0x2
+/* Kevent was not copied into ring buffer due to some error conditions. */
+#define KEVENT_RET_COPY_FAILED	0x4
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 		0
+#define KEVENT_INODE		1
+#define KEVENT_TIMER		2
+#define KEVENT_POLL		3
+#define KEVENT_NAIO		4
+#define KEVENT_AIO		5
+#define KEVENT_PIPE		6
+#define KEVENT_SIGNAL		7
+#define	KEVENT_MAX		8
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define	KEVENT_TIMER_FIRED	0x1
+
+/*
+ * Socket/network asynchronous IO and PIPE events.
+ */
+#define	KEVENT_SOCKET_RECV	0x1
+#define	KEVENT_SOCKET_ACCEPT	0x2
+#define	KEVENT_SOCKET_SEND	0x4
+
+/*
+ * Inode events.
+ */
+#define	KEVENT_INODE_CREATE	0x1
+#define	KEVENT_INODE_REMOVE	0x2
+
+/*
+ * Poll events.
+ */
+#define	KEVENT_POLL_POLLIN	0x0001
+#define	KEVENT_POLL_POLLPRI	0x0002
+#define	KEVENT_POLL_POLLOUT	0x0004
+#define	KEVENT_POLL_POLLERR	0x0008
+#define	KEVENT_POLL_POLLHUP	0x0010
+#define	KEVENT_POLL_POLLNVAL	0x0020
+
+#define	KEVENT_POLL_POLLRDNORM	0x0040
+#define	KEVENT_POLL_POLLRDBAND	0x0080
+#define	KEVENT_POLL_POLLWRNORM	0x0100
+#define	KEVENT_POLL_POLLWRBAND	0x0200
+#define	KEVENT_POLL_POLLMSG	0x0400
+#define	KEVENT_POLL_POLLREMOVE	0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define	KEVENT_AIO_BIO		0x1
+
+/*
+ * Signal events.
+ */
+#define KEVENT_SIGNAL_DELIVERY		0x1
+
+/* If set in raw64, then given signals will not be delivered
+ * in a usual way through sigmask update and signal callback 
+ * invokation. */
+#define KEVENT_SIGNAL_NOMASK	0x8000000000000000ULL
+
+/* Mask of all possible event values. */
+#define KEVENT_MASK_ALL		0xffffffff
+/* Empty mask of ready events. */
+#define KEVENT_MASK_EMPTY	0x0
+
+struct kevent_id
+{
+	union {
+		__u32		raw[2];
+		__u64		raw_u64 __attribute__((aligned(8)));
+	};
+};
+
+struct ukevent
+{
+	/* Id of this request, e.g. socket number, file descriptor and so on... */
+	struct kevent_id	id;
+	/* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+	__u32			type;
+	/* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+	__u32			event;
+	/* Per-event request flags */
+	__u32			req_flags;
+	/* Per-event return flags */
+	__u32			ret_flags;
+	/* Event return data. Event originator fills it with anything it likes. */
+	__u32			ret_data[2];
+	/* User's data. It is not used, just copied to/from user.
+	 * The whole structure is aligned to 8 bytes already, so the last union
+	 * is aligned properly.
+	 */
+	union {
+		__u32		user[2];
+		void		*ptr;
+	};
+};
+
+struct kevent_ring
+{
+	unsigned int		ring_kidx, ring_uidx, ring_over;
+	struct ukevent		event[0];
+};
+
+#define	KEVENT_CTL_ADD 		0
+#define	KEVENT_CTL_REMOVE	1
+#define	KEVENT_CTL_MODIFY	2
+
+#endif /* __UKEVENT_H */
diff --git a/init/Kconfig b/init/Kconfig
index d2eb7a8..c7d8250 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -201,6 +201,8 @@ config AUDITSYSCALL
 	  such as SELinux.  To use audit's filesystem watch feature, please
 	  ensure that INOTIFY is configured.
 
+source "kernel/kevent/Kconfig"
+
 config IKCONFIG
 	bool "Kernel .config support"
 	---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..4b137ee
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,60 @@
+config KEVENT
+	bool "Kernel event notification mechanism"
+	help
+	  This option enables event queue mechanism.
+	  It can be used as replacement for poll()/select(), AIO callback
+	  invocations, advanced timer notifications and other kernel
+	  object status changes.
+
+config KEVENT_USER_STAT
+	bool "Kevent user statistic"
+	depends on KEVENT
+	help
+	  This option will turn kevent_user statistic collection on.
+	  Statistic data includes total number of kevent, number of kevents
+	  which are ready immediately at insertion time and number of kevents
+	  which were removed through readiness completion.
+	  It will be printed each time control kevent descriptor is closed.
+
+config KEVENT_TIMER
+	bool "Kernel event notifications for timers"
+	depends on KEVENT
+	help
+	  This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+	bool "Kernel event notifications for poll()/select()"
+	depends on KEVENT
+	help
+	  This option allows to use kevent subsystem for poll()/select()
+	  notifications.
+
+config KEVENT_SOCKET
+	bool "Kernel event notifications for sockets"
+	depends on NET && KEVENT
+	help
+	  This option enables notifications through KEVENT subsystem of 
+	  sockets operations, like new packet receiving conditions, 
+	  ready for accept conditions and so on.
+
+config KEVENT_PIPE
+	bool "Kernel event notifications for pipes"
+	depends on KEVENT
+	help
+	  This option enables notifications through KEVENT subsystem of 
+	  pipe read/write operations.
+
+config KEVENT_SIGNAL
+	bool "Kernel event notifications for signals"
+	depends on KEVENT
+	help
+	  This option enables signal delivery through KEVENT subsystem.
+	  Signals which were requested to be delivered through kevent
+	  subsystem must be registered through usual signal() and others
+	  syscalls, this option allows alternative delivery.
+	  With KEVENT_SIGNAL_NOMASK flag being set in kevent for set of 
+	  signals, they will not be delivered in a usual way.
+	  Kevents for appropriate signals are not copied when process forks,
+	  new process must add new kevents after fork(). Mask of signals
+	  is copied as before.
+
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..f98e0c8
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,6 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
+obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o
+obj-$(CONFIG_KEVENT_PIPE) += kevent_pipe.o
+obj-$(CONFIG_KEVENT_SIGNAL) += kevent_signal.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..8cf756c
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,232 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+	return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+	return k->callbacks.dequeue(k);
+}
+
+/*
+ * Mark kevent as broken.
+ */
+int kevent_break(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->ulock, flags);
+	k->event.ret_flags |= KEVENT_RET_BROKEN;
+	spin_unlock_irqrestore(&k->ulock, flags);
+	return -EINVAL;
+}
+
+static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX] __read_mostly;
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos)
+{
+	struct kevent_callbacks *p;
+
+	if (pos >= KEVENT_MAX)
+		return -EINVAL;
+
+	p = &kevent_registered_callbacks[pos];
+
+	p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break;
+	p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break;
+	p->callback = (cb->callback) ? cb->callback : kevent_break;
+
+	printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);
+	return 0;
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+	spin_lock_init(&k->ulock);
+	k->flags = 0;
+
+	if (unlikely(k->event.type >= KEVENT_MAX ||
+			!kevent_registered_callbacks[k->event.type].callback))
+		return kevent_break(k);
+
+	k->callbacks = kevent_registered_callbacks[k->event.type];
+	if (unlikely(k->callbacks.callback == kevent_break))
+		return kevent_break(k);
+
+	return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	k->st = st;
+	spin_lock_irqsave(&st->lock, flags);
+	list_add_tail_rcu(&k->storage_entry, &st->list);
+	k->flags |= KEVENT_STORAGE;
+	spin_unlock_irqrestore(&st->lock, flags);
+	return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue.
+ * It does not decrease origin's reference counter in any way
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&st->lock, flags);
+	if (k->flags & KEVENT_STORAGE) {
+		list_del_rcu(&k->storage_entry);
+		k->flags &= ~KEVENT_STORAGE;
+	}
+	spin_unlock_irqrestore(&st->lock, flags);
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static int __kevent_requeue(struct kevent *k, u32 event)
+{
+	int ret, rem;
+	unsigned long flags;
+
+	ret = k->callbacks.callback(k);
+
+	spin_lock_irqsave(&k->ulock, flags);
+	if (ret > 0)
+		k->event.ret_flags |= KEVENT_RET_DONE;
+	else if (ret < 0)
+		k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE);
+	else
+		ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+	rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+	spin_unlock_irqrestore(&k->ulock, flags);
+
+	if (ret) {
+		if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) {
+			list_del_rcu(&k->storage_entry);
+			k->flags &= ~KEVENT_STORAGE;
+		}
+
+		spin_lock_irqsave(&k->user->ready_lock, flags);
+		if (!(k->flags & KEVENT_READY)) {
+			list_add_tail(&k->ready_entry, &k->user->ready_list);
+			k->flags |= KEVENT_READY;
+			k->user->ready_num++;
+		}
+		spin_unlock_irqrestore(&k->user->ready_lock, flags);
+		wake_up(&k->user->wait);
+	}
+
+	return ret;
+}
+
+/*
+ * Check if kevent is ready (by invoking it's callback) and requeue/remove
+ * if needed.
+ */
+void kevent_requeue(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->st->lock, flags);
+	__kevent_requeue(k, 0);
+	spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st,
+		kevent_callback_t ready_callback, u32 event)
+{
+	struct kevent *k;
+	int wake_num = 0;
+
+	rcu_read_lock();
+	if (unlikely(ready_callback))
+		list_for_each_entry_rcu(k, &st->list, storage_entry)
+			(*ready_callback)(k);
+
+	list_for_each_entry_rcu(k, &st->list, storage_entry) {
+		if (event & k->event.event)
+			if ((k->event.req_flags & KEVENT_REQ_WAKEUP_ALL) || wake_num == 0)
+				if (__kevent_requeue(k, event))
+					wake_num++;
+	}
+	rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+	spin_lock_init(&st->lock);
+	st->origin = origin;
+	INIT_LIST_HEAD(&st->list);
+	return 0;
+}
+
+/*
+ * Mark all events as broken, that will remove them from storage,
+ * so storage origin (inode, sockt and so on) can be safely removed.
+ * No new entries are allowed to be added into the storage at this point.
+ * (Socket is removed from file table at this point for example).
+ */
+void kevent_storage_fini(struct kevent_storage *st)
+{
+	kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..2cd8c99
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,1181 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static kmem_cache_t *kevent_cache __read_mostly;
+static kmem_cache_t *kevent_user_cache __read_mostly;
+
+/*
+ * kevents are pollable, return POLLIN and POLLRDNORM
+ * when there is at least one ready kevent.
+ */
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+	struct kevent_user *u = file->private_data;
+	unsigned int mask;
+
+	poll_wait(file, &u->wait, wait);
+	mask = 0;
+
+	if (u->ready_num)
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+
+static inline unsigned int kevent_ring_space(struct kevent_user *u)
+{
+	if (u->full)
+		return 0;
+
+	return (u->uidx > u->kidx)?
+		(u->uidx - u->kidx):
+		(u->ring_size - (u->kidx - u->uidx));
+}
+
+static inline int kevent_ring_index_inc(unsigned int *pidx, unsigned int size)
+{
+	unsigned int idx = *pidx;
+
+	if (++idx >= size)
+		idx = 0;
+	*pidx = idx;
+	return (idx == 0);
+}
+
+/*
+ * Copies kevent into userspace ring buffer if it was initialized.
+ * Returns 
+ *  0 on success or if ring buffer is not used
+ *  -EAGAIN if there were no place for that kevent
+ *  -EFAULT if copy_to_user() failed.
+ *
+ *  Must be called under kevent_user->ring_lock locked.
+ */
+static int kevent_copy_ring_buffer(struct kevent *k)
+{
+	struct kevent_ring __user *ring;
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+	int err;
+
+	ring = u->pring;
+	if (!ring)
+		return 0;
+
+	if (!kevent_ring_space(u))
+		return -EAGAIN;
+
+	if (copy_to_user(&ring->event[u->kidx], &k->event, sizeof(struct ukevent))) {
+		err = -EFAULT;
+		goto err_out_exit;
+	}
+
+	kevent_ring_index_inc(&u->kidx, u->ring_size);
+
+	if (u->kidx == u->uidx)
+		u->full = 1;
+
+	if (put_user(u->kidx, &ring->ring_kidx)) {
+		err = -EFAULT;
+		goto err_out_exit;
+	}
+
+	return 0;
+
+err_out_exit:
+	spin_lock_irqsave(&k->ulock, flags);
+	k->event.ret_flags |= KEVENT_RET_COPY_FAILED;
+	spin_unlock_irqrestore(&k->ulock, flags);
+	return err;
+}
+
+static struct kevent_user *kevent_user_alloc(struct kevent_ring __user *ring, unsigned int num)
+{
+	struct kevent_user *u;
+
+	u = kmem_cache_alloc(kevent_user_cache, GFP_KERNEL);
+	if (!u)
+		return NULL;
+
+	INIT_LIST_HEAD(&u->ready_list);
+	spin_lock_init(&u->ready_lock);
+	kevent_stat_init(u);
+	spin_lock_init(&u->kevent_lock);
+	u->kevent_root = RB_ROOT;
+
+	mutex_init(&u->ctl_mutex);
+	init_waitqueue_head(&u->wait);
+
+	atomic_set(&u->refcnt, 1);
+
+	mutex_init(&u->ring_lock);
+	u->kidx = u->uidx = u->ring_over = u->full = 0;
+
+	u->pring = ring;
+	u->ring_size = num;
+
+	return u;
+}
+
+/*
+ * Kevent userspace control block reference counting.
+ * Set to 1 at creation time, when appropriate kevent file descriptor
+ * is closed, that reference counter is decreased.
+ * When counter hits zero block is freed.
+ */
+static inline void kevent_user_get(struct kevent_user *u)
+{
+	atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+	if (atomic_dec_and_test(&u->refcnt)) {
+		kevent_stat_print(u);
+		kmem_cache_free(kevent_user_cache, u);
+	}
+}
+
+static inline int kevent_compare_id(struct kevent_id *left, struct kevent_id *right)
+{
+	if (left->raw_u64 > right->raw_u64)
+		return -1;
+
+	if (right->raw_u64 > left->raw_u64)
+		return 1;
+
+	return 0;
+}
+
+/*
+ * RCU protects storage list (kevent->storage_entry).
+ * Free entry in RCU callback, it is dequeued from all lists at
+ * this point.
+ */
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+	struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+	kmem_cache_free(kevent_cache, kevent);
+}
+
+/*
+ * Must be called under u->ready_lock.
+ * This function unlinks kevent from ready queue.
+ */
+static inline void kevent_unlink_ready(struct kevent *k)
+{
+	list_del(&k->ready_entry);
+	k->flags &= ~KEVENT_READY;
+	k->user->ready_num--;
+}
+
+static void kevent_remove_ready(struct kevent *k)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (k->flags & KEVENT_READY)
+		kevent_unlink_ready(k);
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+}
+
+/*
+ * Complete kevent removing - it dequeues kevent from storage list
+ * if it is requested, removes kevent from ready list, drops userspace
+ * control block reference counter and schedules kevent freeing through RCU.
+ */
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+	if (deq)
+		kevent_dequeue(k);
+
+	kevent_remove_ready(k);
+
+	kevent_user_put(k->user);
+	call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+/*
+ * Remove from all lists and free kevent.
+ * Must be called under kevent_user->kevent_lock to protect
+ * kevent->kevent_entry removing.
+ */
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+
+	rb_erase(&k->kevent_node, &u->kevent_root);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events,
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	rb_erase(&k->kevent_node, &u->kevent_root);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+	kevent_finish_user_complete(k, deq);
+}
+
+static struct kevent *__kevent_dequeue_ready_one(struct kevent_user *u)
+{
+	unsigned long flags;
+	struct kevent *k = NULL;
+
+	if (u->ready_num) {
+		spin_lock_irqsave(&u->ready_lock, flags);
+		if (u->ready_num && !list_empty(&u->ready_list)) {
+			k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+			kevent_unlink_ready(k);
+		}
+		spin_unlock_irqrestore(&u->ready_lock, flags);
+	}
+
+	return k;
+}
+
+static struct kevent *kevent_dequeue_ready_one(struct kevent_user *u)
+{
+	struct kevent *k = NULL;
+
+	while (u->ready_num && !k) {
+		k = __kevent_dequeue_ready_one(u);
+
+		if (k && (k->event.req_flags & KEVENT_REQ_LAST_CHECK)) {
+			unsigned long flags;
+
+			spin_lock_irqsave(&k->ulock, flags);
+			k->event.req_flags &= ~KEVENT_REQ_LAST_CHECK;
+			spin_unlock_irqrestore(&k->ulock, flags);
+
+			if (!k->callbacks.callback(k)) {
+				spin_lock_irqsave(&k->ulock, flags);
+				k->event.req_flags |= KEVENT_REQ_LAST_CHECK;
+				k->event.ret_flags = 0;
+				k->event.ret_data[0] = k->event.ret_data[1] = 0;
+				spin_unlock_irqrestore(&k->ulock, flags);
+				k = NULL;
+			}
+		} else
+			break;
+	}
+
+	return k;
+}
+
+static inline void kevent_copy_ring(struct kevent *k)
+{
+	unsigned long flags;
+
+	if (!k)
+		return;
+
+	if (kevent_copy_ring_buffer(k)) {
+		spin_lock_irqsave(&k->ulock, flags);
+		k->event.ret_flags |= KEVENT_RET_COPY_FAILED;
+		spin_unlock_irqrestore(&k->ulock, flags);
+	}
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *kevent_dequeue_ready(struct kevent_user *u)
+{
+	struct kevent *k;
+
+	mutex_lock(&u->ring_lock);
+	k = kevent_dequeue_ready_one(u);
+	kevent_copy_ring(k);
+	mutex_unlock(&u->ring_lock);
+
+	return k;
+}
+
+/*
+ * Dequeue one entry from user's ready queue if there is space in ring buffer.
+ */
+static struct kevent *kevent_dequeue_ready_ring(struct kevent_user *u)
+{
+	struct kevent *k = NULL;
+
+	mutex_lock(&u->ring_lock);
+	if (kevent_ring_space(u)) {
+		k = kevent_dequeue_ready_one(u);
+		kevent_copy_ring(k);
+	}
+	mutex_unlock(&u->ring_lock);
+
+	return k;
+}
+
+static void kevent_complete_ready(struct kevent *k)
+{
+	if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+		/*
+		 * If it is one-shot kevent, it has been removed already from
+		 * origin's queue, so we can easily free it here.
+		 */
+		kevent_finish_user(k, 1);
+	else if (k->event.req_flags & KEVENT_REQ_ET) {
+		unsigned long flags;
+
+		/*
+		 * Edge-triggered behaviour: mark event as clear new one.
+		 */
+
+		spin_lock_irqsave(&k->ulock, flags);
+		k->event.ret_flags = 0;
+		k->event.ret_data[0] = k->event.ret_data[1] = 0;
+		spin_unlock_irqrestore(&k->ulock, flags);
+	}
+}
+
+/*
+ * Search a kevent inside kevent tree for given ukevent.
+ */
+static struct kevent *__kevent_search(struct kevent_id *id, struct kevent_user *u)
+{
+	struct kevent *k, *ret = NULL;
+	struct rb_node *n = u->kevent_root.rb_node;
+	int cmp;
+
+	while (n) {
+		k = rb_entry(n, struct kevent, kevent_node);
+		cmp = kevent_compare_id(&k->event.id, id);
+
+		if (cmp > 0)
+			n = n->rb_right;
+		else if (cmp < 0)
+			n = n->rb_left;
+		else {
+			ret = k;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+/*
+ * Search and modify kevent according to provided ukevent.
+ */
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err = -ENODEV;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&uk->id, u);
+	if (k) {
+		spin_lock(&k->ulock);
+		k->event.event = uk->event;
+		k->event.req_flags = uk->req_flags;
+		k->event.ret_flags = 0;
+		spin_unlock(&k->ulock);
+		kevent_requeue(k);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Remove kevent which matches provided ukevent.
+ */
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+	int err = -ENODEV;
+	struct kevent *k;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&uk->id, u);
+	if (k) {
+		__kevent_finish_user(k, 1);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Detaches userspace control block from file descriptor
+ * and decrease it's reference counter.
+ * No new kevents can be added or removed from any list at this point.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u = file->private_data;
+	struct kevent *k;
+	struct rb_node *n;
+
+	for (n = rb_first(&u->kevent_root); n; n = rb_next(n)) {
+		k = rb_entry(n, struct kevent, kevent_node);
+		kevent_finish_user(k, 1);
+	}
+
+	kevent_user_put(u);
+	file->private_data = NULL;
+
+	return 0;
+}
+
+/*
+ * Read requested number of ukevents in one shot.
+ */
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+	struct ukevent *ukev;
+
+	ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+	if (!ukev)
+		return NULL;
+
+	if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
+		kfree(ukev);
+		return NULL;
+	}
+
+	return ukev;
+}
+
+/*
+ * Read from userspace all ukevents and modify appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_modify(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_modify(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Read from userspace all ukevents and remove appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_remove(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_remove(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Queue kevent into userspace control block and increase
+ * it's reference counter.
+ */
+static int kevent_user_enqueue(struct kevent_user *u, struct kevent *new)
+{
+	unsigned long flags;
+	struct rb_node **p = &u->kevent_root.rb_node, *parent = NULL;
+	struct kevent *k;
+	int err = 0, cmp;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	while (*p) {
+		parent = *p;
+		k = rb_entry(parent, struct kevent, kevent_node);
+
+		cmp = kevent_compare_id(&k->event.id, &new->event.id);
+		if (cmp > 0)
+			p = &parent->rb_right;
+		else if (cmp < 0)
+			p = &parent->rb_left;
+		else {
+			err = -EEXIST;
+			break;
+		}
+	}
+	if (likely(!err)) {
+		rb_link_node(&new->kevent_node, parent, p);
+		rb_insert_color(&new->kevent_node, &u->kevent_root);
+		new->flags |= KEVENT_USER;
+		u->kevent_num++;
+		kevent_user_get(u);
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Add kevent from both kernel and userspace users.
+ * This function allocates and queues kevent, returns negative value
+ * on error, positive if kevent is ready immediately and zero
+ * if kevent has been queued.
+ */
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err;
+
+	k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+	if (!k) {
+		err = -ENOMEM;
+		goto err_out_exit;
+	}
+
+	memcpy(&k->event, uk, sizeof(struct ukevent));
+	INIT_RCU_HEAD(&k->rcu_head);
+
+	k->event.ret_flags = 0;
+
+	err = kevent_init(k);
+	if (err) {
+		kmem_cache_free(kevent_cache, k);
+		goto err_out_exit;
+	}
+	k->user = u;
+	kevent_stat_total(u);
+	err = kevent_user_enqueue(u, k);
+	if (err) {
+		kmem_cache_free(kevent_cache, k);
+		goto err_out_exit;
+	}
+
+	err = kevent_enqueue(k);
+	if (err) {
+		memcpy(uk, &k->event, sizeof(struct ukevent));
+		kevent_finish_user(k, 0);
+		goto err_out_exit;
+	}
+
+	return 0;
+
+err_out_exit:
+	if (err < 0) {
+		uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE;
+		uk->ret_data[1] = err;
+	} else if (err > 0)
+		uk->ret_flags |= KEVENT_RET_DONE;
+	return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one
+ * and add them into appropriate kevent_storages,
+ * e.g. sockets, inodes and so on...
+ * Ready events will replace ones provided by used and number
+ * of ready events is returned.
+ * User must check ret_flags field of each ukevent structure
+ * to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err, cerr = 0, rnum = 0, i;
+	void __user *orig = arg;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	err = -EINVAL;
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				err = kevent_user_add_ukevent(&ukev[i], u);
+				if (err) {
+					kevent_stat_im(u);
+					if (i != rnum)
+						memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+					rnum++;
+				}
+			}
+			if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+				cerr = -EFAULT;
+			kfree(ukev);
+			goto out_setup;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			cerr = -EFAULT;
+			break;
+		}
+		arg += sizeof(struct ukevent);
+
+		err = kevent_user_add_ukevent(&uk, u);
+		if (err) {
+			kevent_stat_im(u);
+			if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+				cerr = -EFAULT;
+				break;
+			}
+			orig += sizeof(struct ukevent);
+			rnum++;
+		}
+	}
+
+out_setup:
+	if (cerr < 0) {
+		err = cerr;
+		goto out_remove;
+	}
+
+	err = rnum;
+out_remove:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u,
+		unsigned int min_nr, unsigned int max_nr, __u64 timeout,
+		void __user *buf)
+{
+	struct kevent *k;
+	int num = 0;
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		wait_event_interruptible_timeout(u->wait,
+			u->ready_num >= min_nr,
+			clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+	}
+
+	while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) {
+		if (copy_to_user(buf + num*sizeof(struct ukevent),
+					&k->event, sizeof(struct ukevent))) {
+			if (num == 0)
+				num = -EFAULT;
+			break;
+		}
+		kevent_complete_ready(k);
+		++num;
+		kevent_stat_wait(u);
+	}
+
+	return num;
+}
+
+static struct file_operations kevent_user_fops = {
+	.release	= kevent_user_release,
+	.poll		= kevent_user_poll,
+	.owner		= THIS_MODULE,
+};
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+	int err;
+	struct kevent_user *u = file->private_data;
+
+	switch (cmd) {
+	case KEVENT_CTL_ADD:
+		err = kevent_user_ctl_add(u, num, arg);
+		break;
+	case KEVENT_CTL_REMOVE:
+		err = kevent_user_ctl_remove(u, num, arg);
+		break;
+	case KEVENT_CTL_MODIFY:
+		err = kevent_user_ctl_modify(u, num, arg);
+		break;
+	default:
+		err = -EINVAL;
+		break;
+	}
+
+	return err;
+}
+
+/*
+ * Used to get ready kevents from queue.
+ * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).
+ * @min_nr - minimum number of ready kevents.
+ * @max_nr - maximum number of ready kevents.
+ * @timeout - timeout in nanoseconds to wait until some events are ready.
+ * @buf - buffer to place ready events.
+ * @flags - ununsed for now (will be used for mmap implementation).
+ */
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+		__u64 timeout, struct ukevent __user *buf, unsigned flags)
+{
+	int err = -EINVAL;
+	struct file *file;
+	struct kevent_user *u;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -EBADF;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
+out_fput:
+	fput(file);
+	return err;
+}
+
+static struct vfsmount *kevent_mnt __read_mostly;
+
+static int kevent_get_sb(struct file_system_type *fs_type, int flags,
+		   const char *dev_name, void *data, struct vfsmount *mnt)
+{
+	return get_sb_pseudo(fs_type, "kevent", NULL, 0xaabbccdd, mnt);
+}
+
+static struct file_system_type kevent_fs_type = {
+	.name		= "keventfs",
+	.get_sb		= kevent_get_sb,
+	.kill_sb	= kill_anon_super,
+};
+
+static int keventfs_delete_dentry(struct dentry *dentry)
+{
+	return 1;
+}
+
+static struct dentry_operations keventfs_dentry_operations = {
+	.d_delete	= keventfs_delete_dentry,
+};
+
+asmlinkage long sys_kevent_init(struct kevent_ring __user *ring, unsigned int num)
+{
+	struct qstr this;
+	char name[32];
+	struct dentry *dentry;
+	struct inode *inode;
+	struct file *file;
+	int err = -ENFILE, fd;
+	struct kevent_user *u;
+
+	if ((ring && !num) || (!ring && num) || (num == 1))
+		return -EINVAL;
+
+	file = get_empty_filp();
+	if (!file)
+		goto err_out_exit;
+
+	inode = new_inode(kevent_mnt->mnt_sb);
+	if (!inode)
+		goto err_out_fput;
+
+	inode->i_fop = &kevent_user_fops;
+
+	inode->i_state = I_DIRTY;
+	inode->i_mode = S_IRUSR | S_IWUSR;
+	inode->i_uid = current->fsuid;
+	inode->i_gid = current->fsgid;
+	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+
+	err = get_unused_fd();
+	if (err < 0)
+		goto err_out_iput;
+	fd = err;
+
+	err = -ENOMEM;
+	u = kevent_user_alloc(ring, num);
+	if (!u)
+		goto err_out_put_fd;
+
+	sprintf(name, "[%lu]", inode->i_ino);
+	this.name = name;
+	this.len = strlen(name);
+	this.hash = inode->i_ino;
+	dentry = d_alloc(kevent_mnt->mnt_sb->s_root, &this);
+	if (!dentry)
+		goto err_out_free;
+	dentry->d_op = &keventfs_dentry_operations;
+	d_add(dentry, inode);
+	file->f_vfsmnt = mntget(kevent_mnt);
+	file->f_dentry = dentry;
+	file->f_mapping = inode->i_mapping;
+	file->f_pos = 0;
+	file->f_flags = O_RDONLY;
+	file->f_op = &kevent_user_fops;
+	file->f_mode = FMODE_READ;
+	file->f_version = 0;
+	file->private_data = u;
+
+	fd_install(fd, file);
+
+	return fd;
+
+err_out_free:
+	kmem_cache_free(kevent_user_cache, u);
+err_out_put_fd:
+	put_unused_fd(fd);
+err_out_iput:
+	iput(inode);
+err_out_fput:
+	put_filp(file);
+err_out_exit:
+	return err;
+}
+
+/*
+ * This syscall is used to perform waiting until there is free space in the ring
+ * buffer, in that case some events will be copied there.
+ * Function returns number of actually copied ready events in ring buffer.
+ * After this function is completed userspace ring->ring_kidx will be updated.
+ *
+ * @ctl_fd - kevent file descriptor.
+ * @num - number of kevents to process.
+ * @timeout - this timeout specifies number of nanoseconds to wait until there is
+ * 	free space in kevent queue.
+ *
+ * When we need to commit @num events, it means we should just remove first @num
+ * kevents from ready queue and copy them into the buffer. 
+ * Kevents will be copied into ring buffer in order they were placed into ready queue.
+ * One-shot kevents will be removed here, since there is no way they can be reused.
+ * Edge-triggered events will be requeued here for better performance.
+ */
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, __u64 timeout)
+{
+	int err = -EINVAL, copied = 0;
+	struct file *file;
+	struct kevent_user *u;
+	struct kevent *k;
+	struct kevent_ring __user *ring;
+	unsigned int i;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -EBADF;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	ring = u->pring;
+	if (!ring || num > u->ring_size)
+		goto out_fput;
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		wait_event_interruptible_timeout(u->wait,
+			((u->ready_num >= 1) && (kevent_ring_space(u))),
+			clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+	}
+
+	for (i=0; i<num; ++i) {
+		k = kevent_dequeue_ready_ring(u);
+		if (!k)
+			break;
+		kevent_complete_ready(k);
+
+		if (k->event.ret_flags & KEVENT_RET_COPY_FAILED)
+			break;
+		kevent_stat_ring(u);
+		copied++;
+	}
+
+	fput(file);
+
+	return copied;
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * This syscall is used to commit events in ring buffer, i.e. mark appropriate
+ * entries as unused by userspace so subsequent kevent_wait() could overwrite them.
+ * This fucntion returns actual number of kevents which were committed.
+ * After this function is completed userspace ring->ring_uidx will be updated.
+ *
+ * @ctl_fd - kevent file descriptor.
+ * @start - index of the first kevent to be committed.
+ * @num - number of kevents to commit.
+ * @over - number of overflows given queue had.
+ *
+ * If several threads are going to commit the same events, and one of them
+ * has committed events, while other was scheduled away for too long, that
+ * ring indexes have wrapped, it is possible that incorrect
+ */
+asmlinkage long sys_kevent_commit(int ctl_fd, unsigned int start, unsigned int num, unsigned int over)
+{
+	int err = -EINVAL, comm = 0, i, over_changed = 0;
+	struct file *file;
+	struct kevent_user *u;
+	struct kevent_ring __user *ring;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -EBADF;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+	ring = u->pring;
+
+	if (!ring || num > u->ring_size)
+		goto out_fput;
+
+	err = -EOVERFLOW;
+	mutex_lock(&u->ring_lock);
+	if (over != u->ring_over+1 && over != u->ring_over)
+		goto err_out_unlock;
+
+	if (start > u->uidx) {
+		if (over != u->ring_over+1) {
+			if (over == u->ring_over)
+				err = -EINVAL;
+			goto err_out_unlock;
+		} else {
+			/* 
+			 * To be or not to be, that is a question:
+			 * Whether it is nobler in the mind to suffer...
+			 * Stop. Not.
+			 * To optimize 'the modulo' or not, that is a question:
+			 * Are there many CPUs, which still being in the world production
+			 * And suffer badly from that stuff in it.
+			 */
+			unsigned int mod = (start + num) % u->ring_size;
+
+			if (mod >= u->uidx)
+				comm = mod - u->uidx;
+		}
+	} else {
+		if (over != u->ring_over)
+			goto err_out_unlock;
+
+		if (start + num >= u->uidx)
+			comm = start + num - u->uidx;
+	}
+
+	if (comm)
+		u->full = 0;
+
+	for (i=0; i<comm; ++i) {
+		if (kevent_ring_index_inc(&u->uidx, u->ring_size)) {
+			u->ring_over++;
+			over_changed = 1;
+		}
+	}
+
+	if (over_changed) {
+		if (put_user(u->ring_over, &ring->ring_over)) {
+			err = -EFAULT;
+			goto err_out_unlock;
+		}
+	}
+
+	if (put_user(u->uidx, &ring->ring_uidx)) {
+		err = -EFAULT;
+		goto err_out_unlock;
+	}
+	mutex_unlock(&u->ring_lock);
+
+	fput(file);
+
+	return comm;
+
+err_out_unlock:
+	mutex_unlock(&u->ring_lock);
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * This syscall is used to perform various control operations
+ * on given kevent queue, which is obtained through kevent file descriptor @fd.
+ * @cmd - type of operation.
+ * @num - number of kevents to be processed.
+ * @arg - pointer to array of struct ukevent.
+ */
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent __user *arg)
+{
+	int err = -EINVAL;
+	struct file *file;
+
+	file = fget(fd);
+	if (!file)
+		return -EBADF;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+
+	err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * Kevent subsystem initialization - create caches and register
+ * filesystem to get control file descriptors from.
+ */
+static int __init kevent_user_init(void)
+{
+	int err = 0;
+
+	kevent_cache = kmem_cache_create("kevent_cache",
+			sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
+	
+	kevent_user_cache = kmem_cache_create("kevent_user_cache",
+			sizeof(struct kevent_user), 0, SLAB_PANIC, NULL, NULL);
+	
+	err = register_filesystem(&kevent_fs_type);
+	if (err)
+		goto err_out_exit;
+	
+	kevent_mnt = kern_mount(&kevent_fs_type);
+	err = PTR_ERR(kevent_mnt);
+	if (IS_ERR(kevent_mnt))
+		goto err_out_unreg;
+
+	printk(KERN_INFO "KEVENT subsystem has been successfully registered.\n");
+
+	return 0;
+
+err_out_unreg:
+	unregister_filesystem(&kevent_fs_type);
+err_out_exit:
+	kmem_cache_destroy(kevent_cache);
+	return err;
+}
+
+module_init(kevent_user_init);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7a3b2e7..3b7d35f 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,12 @@ cond_syscall(ppc_rtas);
 cond_syscall(sys_spu_run);
 cond_syscall(sys_spu_create);
 
+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_ctl);
+cond_syscall(sys_kevent_wait);
+cond_syscall(sys_kevent_commit);
+cond_syscall(sys_kevent_init);
+
 /* mmu depending weak syscall entries */
 cond_syscall(sys_mprotect);
 cond_syscall(sys_msync);

^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take25 3/6] kevent: poll/select() notifications.
  2006-11-21 16:29     ` [take25 2/6] kevent: Core files Evgeniy Polyakov
@ 2006-11-21 16:29       ` Evgeniy Polyakov
  2006-11-21 16:29         ` [take25 4/6] kevent: Socket notifications Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-21 16:29 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


poll/select() notifications.

This patch includes generic poll/select notifications.
kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake, a lot of allocations and so on).

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/fs/file_table.c b/fs/file_table.c
index bc35a40..0805547 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -20,6 +20,7 @@
 #include <linux/cdev.h>
 #include <linux/fsnotify.h>
 #include <linux/sysctl.h>
+#include <linux/kevent.h>
 #include <linux/percpu_counter.h>
 
 #include <asm/atomic.h>
@@ -119,6 +120,7 @@ struct file *get_empty_filp(void)
 	f->f_uid = tsk->fsuid;
 	f->f_gid = tsk->fsgid;
 	eventpoll_init_file(f);
+	kevent_init_file(f);
 	/* f->f_version: 0 */
 	return f;
 
@@ -164,6 +166,7 @@ void fastcall __fput(struct file *file)
 	 * in the file cleanup chain.
 	 */
 	eventpoll_release(file);
+	kevent_cleanup_file(file);
 	locks_remove_flock(file);
 
 	if (file->f_op && file->f_op->release)
diff --git a/fs/inode.c b/fs/inode.c
index ada7643..2740617 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@
 #include <linux/cdev.h>
 #include <linux/bootmem.h>
 #include <linux/inotify.h>
+#include <linux/kevent.h>
 #include <linux/mount.h>
 
 /*
@@ -164,12 +165,18 @@ static struct inode *alloc_inode(struct
 		}
 		inode->i_private = 0;
 		inode->i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+		kevent_storage_init(inode, &inode->st);
+#endif
 	}
 	return inode;
 }
 
 void destroy_inode(struct inode *inode) 
 {
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+	kevent_storage_fini(&inode->st);
+#endif
 	BUG_ON(inode_has_buffers(inode));
 	security_inode_free(inode);
 	if (inode->i_sb->s_op->destroy_inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5baf3a1..8bbf3a5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -276,6 +276,7 @@ extern int dir_notify_enable;
 #include <linux/init.h>
 #include <linux/sched.h>
 #include <linux/mutex.h>
+#include <linux/kevent_storage.h>
 
 #include <asm/atomic.h>
 #include <asm/semaphore.h>
@@ -586,6 +587,10 @@ struct inode {
 	struct mutex		inotify_mutex;	/* protects the watches list */
 #endif
 
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+	struct kevent_storage	st;
+#endif
+
 	unsigned long		i_state;
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 
@@ -739,6 +744,9 @@ struct file {
 	struct list_head	f_ep_links;
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+	struct kevent_storage	st;
+#endif
 	struct address_space	*f_mapping;
 };
 extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..11dbe25
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,232 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+	struct poll_table_struct 	pt;
+	struct kevent			*k;
+};
+
+struct kevent_poll_wait_container
+{
+	struct list_head		container_entry;
+	wait_queue_head_t		*whead;
+	wait_queue_t			wait;
+	struct kevent			*k;
+};
+
+struct kevent_poll_private
+{
+	struct list_head		container_list;
+	spinlock_t			container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+		unsigned mode, int sync, void *key)
+{
+	struct kevent_poll_wait_container *cont =
+		container_of(wait, struct kevent_poll_wait_container, wait);
+	struct kevent *k = cont->k;
+
+	kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL);
+	return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+		struct poll_table_struct *poll_table)
+{
+	struct kevent *k =
+		container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *cont;
+	unsigned long flags;
+
+	cont = kmem_cache_alloc(kevent_poll_container_cache, GFP_KERNEL);
+	if (!cont) {
+		kevent_break(k);
+		return;
+	}
+
+	cont->k = k;
+	init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+	cont->whead = whead;
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_add_tail(&cont->container_entry, &priv->container_list);
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+	struct file *file;
+	int err;
+	unsigned int revents;
+	unsigned long flags;
+	struct kevent_poll_ctl ctl;
+	struct kevent_poll_private *priv;
+
+	file = fget(k->event.id.raw[0]);
+	if (!file)
+		return -EBADF;
+	
+	err = -EINVAL;
+	if (!file->f_op || !file->f_op->poll)
+		goto err_out_fput;
+	
+	err = -ENOMEM;
+	priv = kmem_cache_alloc(kevent_poll_priv_cache, GFP_KERNEL);
+	if (!priv)
+		goto err_out_fput;
+
+	spin_lock_init(&priv->container_lock);
+	INIT_LIST_HEAD(&priv->container_list);
+
+	k->priv = priv;
+
+	ctl.k = k;
+	init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+	
+	err = kevent_storage_enqueue(&file->st, k);
+	if (err)
+		goto err_out_free;
+
+	if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE) {
+		kevent_requeue(k);
+	} else {
+		revents = file->f_op->poll(file, &ctl.pt);
+		if (revents & k->event.event) {
+			err = 1;
+			goto out_dequeue;
+		}
+	}
+
+	spin_lock_irqsave(&k->ulock, flags);
+	k->event.req_flags |= KEVENT_REQ_LAST_CHECK;
+	spin_unlock_irqrestore(&k->ulock, flags);
+
+	return 0;
+
+out_dequeue:
+	kevent_storage_dequeue(k->st, k);
+err_out_free:
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+	fput(file);
+	return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *w, *n;
+	unsigned long flags;
+
+	kevent_storage_dequeue(k->st, k);
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+		list_del(&w->container_entry);
+		remove_wait_queue(w->whead, &w->wait);
+		kmem_cache_free(kevent_poll_container_cache, w);
+	}
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+	k->priv = NULL;
+
+	fput(file);
+
+	return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+	if (k->event.req_flags & KEVENT_REQ_LAST_CHECK) {
+		return 1;
+	} else {
+		struct file *file = k->st->origin;
+		unsigned int revents = file->f_op->poll(file, NULL);
+
+		k->event.ret_data[0] = revents & k->event.event;
+		
+		return (revents & k->event.event);
+	}
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+	struct kevent_callbacks pc = {
+		.callback = &kevent_poll_callback,
+		.enqueue = &kevent_poll_enqueue,
+		.dequeue = &kevent_poll_dequeue};
+
+	kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache",
+			sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+	if (!kevent_poll_container_cache) {
+		printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+		return -ENOMEM;
+	}
+
+	kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache",
+			sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+	if (!kevent_poll_priv_cache) {
+		printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+		kmem_cache_destroy(kevent_poll_container_cache);
+		kevent_poll_container_cache = NULL;
+		return -ENOMEM;
+	}
+
+	kevent_add_callbacks(&pc, KEVENT_POLL);
+
+	printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+	return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+	lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+	kmem_cache_destroy(kevent_poll_priv_cache);
+	kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take25 4/6] kevent: Socket notifications.
  2006-11-21 16:29       ` [take25 3/6] kevent: poll/select() notifications Evgeniy Polyakov
@ 2006-11-21 16:29         ` Evgeniy Polyakov
  2006-11-21 16:29           ` [take25 5/6] kevent: Timer notifications Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-21 16:29 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Socket notifications.

This patch includes socket send/recv/accept notifications.
Using trivial web server based on kevent and this features
instead of epoll it's performance increased more than noticebly.
More details about various benchmarks and server itself 
(evserver_kevent.c) can be found on project's homepage.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/fs/inode.c b/fs/inode.c
index ada7643..2740617 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@
 #include <linux/cdev.h>
 #include <linux/bootmem.h>
 #include <linux/inotify.h>
+#include <linux/kevent.h>
 #include <linux/mount.h>
 
 /*
@@ -164,12 +165,18 @@ static struct inode *alloc_inode(struct
 		}
 		inode->i_private = 0;
 		inode->i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+		kevent_storage_init(inode, &inode->st);
+#endif
 	}
 	return inode;
 }
 
 void destroy_inode(struct inode *inode) 
 {
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+	kevent_storage_fini(&inode->st);
+#endif
 	BUG_ON(inode_has_buffers(inode));
 	security_inode_free(inode);
 	if (inode->i_sb->s_op->destroy_inode)
diff --git a/include/net/sock.h b/include/net/sock.h
index edd4d73..d48ded8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -48,6 +48,7 @@
 #include <linux/netdevice.h>
 #include <linux/skbuff.h>	/* struct sk_buff */
 #include <linux/security.h>
+#include <linux/kevent.h>
 
 #include <linux/filter.h>
 
@@ -450,6 +451,21 @@ static inline int sk_stream_memory_free(
 
 extern void sk_stream_rfree(struct sk_buff *skb);
 
+struct socket_alloc {
+	struct socket socket;
+	struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+	return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+	return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
+}
+
 static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
 {
 	skb->sk = sk;
@@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct
 		sk->sk_backlog.tail = skb;
 	}
 	skb->next = NULL;
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 }
 
 #define sk_wait_event(__sk, __timeo, __condition)		\
@@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio
 	return si->kiocb;
 }
 
-struct socket_alloc {
-	struct socket socket;
-	struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
-	return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
-	return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
-}
-
 extern void __sk_stream_mem_reclaim(struct sock *sk);
 extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..69f4ad2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so
 			tp->ucopy.memory = 0;
 		} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
 			wake_up_interruptible(sk->sk_sleep);
+			kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 			if (!inet_csk_ack_scheduled(sk))
 				inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
 						          (3 * TCP_RTO_MIN) / 4,
diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 0000000..9c24b5b
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,142 @@
+/*
+ * 	kevent_socket.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/tcp.h>
+#include <linux/kevent.h>
+
+#include <net/sock.h>
+#include <net/request_sock.h>
+#include <net/inet_connection_sock.h>
+
+static int kevent_socket_callback(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+	unsigned int events = SOCKET_I(inode)->ops->poll(SOCKET_I(inode)->file, SOCKET_I(inode), NULL);
+
+	if ((events & (POLLIN | POLLRDNORM)) && (k->event.event & (KEVENT_SOCKET_RECV | KEVENT_SOCKET_ACCEPT)))
+		return 1;
+	if ((events & (POLLOUT | POLLWRNORM)) && (k->event.event & KEVENT_SOCKET_SEND))
+		return 1;
+	if (events & (POLLERR | POLLHUP))
+		return -1;
+	return 0;
+}
+
+int kevent_socket_enqueue(struct kevent *k)
+{
+	struct inode *inode;
+	struct socket *sock;
+	int err = -EBADF;
+
+	sock = sockfd_lookup(k->event.id.raw[0], &err);
+	if (!sock)
+		goto err_out_exit;
+
+	inode = igrab(SOCK_INODE(sock));
+	if (!inode)
+		goto err_out_fput;
+
+	err = kevent_storage_enqueue(&inode->st, k);
+	if (err)
+		goto err_out_iput;
+
+	if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE) {
+		kevent_requeue(k);
+		err = 0;
+	} else {
+		err = k->callbacks.callback(k);
+		if (err)
+			goto err_out_dequeue;
+	}
+
+	return err;
+
+err_out_dequeue:
+	kevent_storage_dequeue(k->st, k);
+err_out_iput:
+	iput(inode);
+err_out_fput:
+	sockfd_put(sock);
+err_out_exit:
+	return err;
+}
+
+int kevent_socket_dequeue(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+	struct socket *sock;
+
+	kevent_storage_dequeue(k->st, k);
+
+	sock = SOCKET_I(inode);
+	iput(inode);
+	sockfd_put(sock);
+
+	return 0;
+}
+
+void kevent_socket_notify(struct sock *sk, u32 event)
+{
+	if (sk->sk_socket)
+		kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event);
+}
+
+/*
+ * It is required for network protocols compiled as modules, like IPv6.
+ */
+EXPORT_SYMBOL_GPL(kevent_socket_notify);
+
+#ifdef CONFIG_LOCKDEP
+static struct lock_class_key kevent_sock_key;
+
+void kevent_socket_reinit(struct socket *sock)
+{
+	struct inode *inode = SOCK_INODE(sock);
+
+	lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+}
+
+void kevent_sk_reinit(struct sock *sk)
+{
+	if (sk->sk_socket) {
+		struct inode *inode = SOCK_INODE(sk->sk_socket);
+
+		lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+	}
+}
+#endif
+static int __init kevent_init_socket(void)
+{
+	struct kevent_callbacks sc = {
+		.callback = &kevent_socket_callback,
+		.enqueue = &kevent_socket_enqueue,
+		.dequeue = &kevent_socket_dequeue};
+
+	return kevent_add_callbacks(&sc, KEVENT_SOCKET);
+}
+module_init(kevent_init_socket);
diff --git a/net/core/sock.c b/net/core/sock.c
index b77e155..7d5fa3e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1402,6 +1402,7 @@ static void sock_def_wakeup(struct sock
 	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
 		wake_up_interruptible_all(sk->sk_sleep);
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_error_report(struct sock *sk)
@@ -1411,6 +1412,7 @@ static void sock_def_error_report(struct
 		wake_up_interruptible(sk->sk_sleep);
 	sk_wake_async(sk,0,POLL_ERR); 
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_readable(struct sock *sk, int len)
@@ -1420,6 +1422,7 @@ static void sock_def_readable(struct soc
 		wake_up_interruptible(sk->sk_sleep);
 	sk_wake_async(sk,1,POLL_IN);
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_write_space(struct sock *sk)
@@ -1439,6 +1442,7 @@ static void sock_def_write_space(struct
 	}
 
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
 }
 
 static void sock_def_destruct(struct sock *sk)
@@ -1489,6 +1493,8 @@ void sock_init_data(struct socket *sock,
 	sk->sk_state		=	TCP_CLOSE;
 	sk->sk_socket		=	sock;
 
+	kevent_sk_reinit(sk);
+
 	sock_set_flag(sk, SOCK_ZAPPED);
 
 	if(sock)
@@ -1555,8 +1561,10 @@ void fastcall release_sock(struct sock *
 	if (sk->sk_backlog.tail)
 		__release_sock(sk);
 	sk->sk_lock.owner = NULL;
-	if (waitqueue_active(&sk->sk_lock.wq))
+	if (waitqueue_active(&sk->sk_lock.wq)) {
 		wake_up(&sk->sk_lock.wq);
+		kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
+	}
 	spin_unlock_bh(&sk->sk_lock.slock);
 }
 EXPORT_SYMBOL(release_sock);
diff --git a/net/core/stream.c b/net/core/stream.c
index d1d7dec..2878c2a 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock *
 			wake_up_interruptible(sk->sk_sleep);
 		if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
 			sock_wake_async(sock, 2, POLL_OUT);
+		kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
 	}
 }
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3f884ce..e7dd989 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3119,6 +3119,7 @@ static void tcp_ofo_queue(struct sock *s
 
 		__skb_unlink(skb, &tp->out_of_order_queue);
 		__skb_queue_tail(&sk->sk_receive_queue, skb);
+		kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
 		if(skb->h.th->fin)
 			tcp_fin(skb, sk, skb->h.th);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index c83938b..b0dd70d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -61,6 +61,7 @@
 #include <linux/jhash.h>
 #include <linux/init.h>
 #include <linux/times.h>
+#include <linux/kevent.h>
 
 #include <net/icmp.h>
 #include <net/inet_hashtables.h>
@@ -870,6 +871,7 @@ int tcp_v4_conn_request(struct sock *sk,
 	   	reqsk_free(req);
 	} else {
 		inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+		kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT);
 	}
 	return 0;
 
diff --git a/net/socket.c b/net/socket.c
index 1bc4167..5582b4a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -85,6 +85,7 @@
 #include <linux/kmod.h>
 #include <linux/audit.h>
 #include <linux/wireless.h>
+#include <linux/kevent.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -490,6 +491,8 @@ static struct socket *sock_alloc(void)
 	inode->i_uid = current->fsuid;
 	inode->i_gid = current->fsgid;
 
+	kevent_socket_reinit(sock);
+
 	get_cpu_var(sockets_in_use)++;
 	put_cpu_var(sockets_in_use);
 	return sock;


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take25 5/6] kevent: Timer notifications.
  2006-11-21 16:29         ` [take25 4/6] kevent: Socket notifications Evgeniy Polyakov
@ 2006-11-21 16:29           ` Evgeniy Polyakov
  2006-11-21 16:29             ` [take25 6/6] kevent: Pipe notifications Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-21 16:29 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Timer notifications.

Timer notifications can be used for fine grained per-process time 
management, since interval timers are very inconvenient to use, 
and they are limited.

This subsystem uses high-resolution timers.
id.raw[0] is used as number of seconds
id.raw[1] is used as number of nanoseconds

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..df93049
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,112 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/hrtimer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+struct kevent_timer
+{
+	struct hrtimer		ktimer;
+	struct kevent_storage	ktimer_storage;
+	struct kevent		*ktimer_event;
+};
+
+static int kevent_timer_func(struct hrtimer *timer)
+{
+	struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer);
+	struct kevent *k = t->ktimer_event;
+
+	kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL);
+	hrtimer_forward(timer, timer->base->softirq_time,
+			ktime_set(k->event.id.raw[0], k->event.id.raw[1]));
+	return HRTIMER_RESTART;
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+	int err;
+	struct kevent_timer *t;
+
+	t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+	if (!t)
+		return -ENOMEM;
+
+	hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL);
+	t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]);
+	t->ktimer.function = kevent_timer_func;
+	t->ktimer_event = k;
+
+	err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+	if (err)
+		goto err_out_free;
+	lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+	err = kevent_storage_enqueue(&t->ktimer_storage, k);
+	if (err)
+		goto err_out_st_fini;
+
+	hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL);
+
+	return 0;
+
+err_out_st_fini:
+	kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+	kfree(t);
+
+	return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+	struct kevent_storage *st = k->st;
+	struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage);
+
+	hrtimer_cancel(&t->ktimer);
+	kevent_storage_dequeue(st, k);
+	kfree(t);
+
+	return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+	k->event.ret_data[0] = jiffies_to_msecs(jiffies);
+	return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+	struct kevent_callbacks tc = {
+		.callback = &kevent_timer_callback,
+		.enqueue = &kevent_timer_enqueue,
+		.dequeue = &kevent_timer_dequeue};
+
+	return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);
+

^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take25 6/6] kevent: Pipe notifications.
  2006-11-21 16:29           ` [take25 5/6] kevent: Timer notifications Evgeniy Polyakov
@ 2006-11-21 16:29             ` Evgeniy Polyakov
  2006-11-22 11:20               ` Eric Dumazet
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-21 16:29 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Pipe notifications.


diff --git a/fs/pipe.c b/fs/pipe.c
index f3b6f71..aeaee9c 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -16,6 +16,7 @@
 #include <linux/uio.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/kevent.h>
 
 #include <asm/uaccess.h>
 #include <asm/ioctls.h>
@@ -312,6 +313,7 @@ redo:
 			break;
 		}
 		if (do_wakeup) {
+			kevent_pipe_notify(inode, KEVENT_SOCKET_SEND);
 			wake_up_interruptible_sync(&pipe->wait);
  			kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 		}
@@ -321,6 +323,7 @@ redo:
 
 	/* Signal writers asynchronously that there is more room. */
 	if (do_wakeup) {
+		kevent_pipe_notify(inode, KEVENT_SOCKET_SEND);
 		wake_up_interruptible(&pipe->wait);
 		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 	}
@@ -490,6 +493,7 @@ redo2:
 			break;
 		}
 		if (do_wakeup) {
+			kevent_pipe_notify(inode, KEVENT_SOCKET_RECV);
 			wake_up_interruptible_sync(&pipe->wait);
 			kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 			do_wakeup = 0;
@@ -501,6 +505,7 @@ redo2:
 out:
 	mutex_unlock(&inode->i_mutex);
 	if (do_wakeup) {
+		kevent_pipe_notify(inode, KEVENT_SOCKET_RECV);
 		wake_up_interruptible(&pipe->wait);
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 	}
@@ -605,6 +610,7 @@ pipe_release(struct inode *inode, int de
 		free_pipe_info(inode);
 	} else {
 		wake_up_interruptible(&pipe->wait);
+		kevent_pipe_notify(inode, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 	}
diff --git a/kernel/kevent/kevent_pipe.c b/kernel/kevent/kevent_pipe.c
new file mode 100644
index 0000000..5080642
--- /dev/null
+++ b/kernel/kevent/kevent_pipe.c
@@ -0,0 +1,117 @@
+/*
+ * 	kevent_pipe.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/kevent.h>
+#include <linux/pipe_fs_i.h>
+
+static int kevent_pipe_callback(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+	struct pipe_inode_info *pipe = inode->i_pipe;
+	int nrbufs = pipe->nrbufs;
+
+	if (k->event.event & KEVENT_SOCKET_RECV && nrbufs > 0) {
+		if (!pipe->writers)
+			return -1;
+		return 1;
+	}
+	
+	if (k->event.event & KEVENT_SOCKET_SEND && nrbufs < PIPE_BUFFERS) {
+		if (!pipe->readers)
+			return -1;
+		return 1;
+	}
+
+	return 0;
+}
+
+int kevent_pipe_enqueue(struct kevent *k)
+{
+	struct file *pipe;
+	int err = -EBADF;
+	struct inode *inode;
+
+	pipe = fget(k->event.id.raw[0]);
+	if (!pipe)
+		goto err_out_exit;
+
+	inode = igrab(pipe->f_dentry->d_inode);
+	if (!inode)
+		goto err_out_fput;
+
+	err = kevent_storage_enqueue(&inode->st, k);
+	if (err)
+		goto err_out_iput;
+
+	if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE) {
+		kevent_requeue(k);
+		err = 0;
+	} else {
+		err = k->callbacks.callback(k);
+		if (err)
+			goto err_out_dequeue;
+	}
+
+	fput(pipe);
+
+	return err;
+
+err_out_dequeue:
+	kevent_storage_dequeue(k->st, k);
+err_out_iput:
+	iput(inode);
+err_out_fput:
+	fput(pipe);
+err_out_exit:
+	return err;
+}
+
+int kevent_pipe_dequeue(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+
+	kevent_storage_dequeue(k->st, k);
+	iput(inode);
+
+	return 0;
+}
+
+void kevent_pipe_notify(struct inode *inode, u32 event)
+{
+	kevent_storage_ready(&inode->st, NULL, event);
+}
+
+static int __init kevent_init_pipe(void)
+{
+	struct kevent_callbacks sc = {
+		.callback = &kevent_pipe_callback,
+		.enqueue = &kevent_pipe_enqueue,
+		.dequeue = &kevent_pipe_dequeue};
+
+	return kevent_add_callbacks(&sc, KEVENT_PIPE);
+}
+module_init(kevent_init_pipe);


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* Re: [take25 6/6] kevent: Pipe notifications.
  2006-11-21 16:29             ` [take25 6/6] kevent: Pipe notifications Evgeniy Polyakov
@ 2006-11-22 11:20               ` Eric Dumazet
  2006-11-22 11:30                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Eric Dumazet @ 2006-11-22 11:20 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

On Tuesday 21 November 2006 17:29, Evgeniy Polyakov wrote:
> Pipe notifications.

> +int kevent_pipe_enqueue(struct kevent *k)
> +{
> +	struct file *pipe;
> +	int err = -EBADF;
> +	struct inode *inode;
> +
> +	pipe = fget(k->event.id.raw[0]);
> +	if (!pipe)
> +		goto err_out_exit;
> +
> +	inode = igrab(pipe->f_dentry->d_inode);
> +	if (!inode)
> +		goto err_out_fput;
> +

Well...

How can you be sure 'pipe/inode' really refers to a pipe/fifo here ?

Hint : i_pipe <> NULL is not sufficient because i_pipe, i_bdev, i_cdev share 
the same location. (check pipe_info() in fs/splice.c)

So I guess you need :

err = -EINVAL;
if  (!S_ISFIFO(inode->i_mode))
	goto err_out_iput;



Eric

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 6/6] kevent: Pipe notifications.
  2006-11-22 11:20               ` Eric Dumazet
@ 2006-11-22 11:30                 ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-22 11:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

On Wed, Nov 22, 2006 at 12:20:50PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> On Tuesday 21 November 2006 17:29, Evgeniy Polyakov wrote:
> > Pipe notifications.
> 
> > +int kevent_pipe_enqueue(struct kevent *k)
> > +{
> > +	struct file *pipe;
> > +	int err = -EBADF;
> > +	struct inode *inode;
> > +
> > +	pipe = fget(k->event.id.raw[0]);
> > +	if (!pipe)
> > +		goto err_out_exit;
> > +
> > +	inode = igrab(pipe->f_dentry->d_inode);
> > +	if (!inode)
> > +		goto err_out_fput;
> > +
> 
> Well...
> 
> How can you be sure 'pipe/inode' really refers to a pipe/fifo here ?
> 
> Hint : i_pipe <> NULL is not sufficient because i_pipe, i_bdev, i_cdev share 
> the same location. (check pipe_info() in fs/splice.c)
> 
> So I guess you need :
> 
> err = -EINVAL;
> if  (!S_ISFIFO(inode->i_mode))
> 	goto err_out_iput;
 
You are correct, I did not perform that check, since all pipe open
functions do rely on the i_pipe, which can not be block device at that
point, but with kevent file descriptor can be anything, so that check
must be performed.

I will put it into the tree, thanks Eric.
 
> Eric

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-21 16:29   ` [take25 1/6] kevent: Description Evgeniy Polyakov
  2006-11-21 16:29     ` [take25 2/6] kevent: Core files Evgeniy Polyakov
@ 2006-11-22 23:46     ` Ulrich Drepper
  2006-11-23 11:52       ` Evgeniy Polyakov
  2006-11-22 23:52     ` Ulrich Drepper
  2006-11-23 22:33     ` Ulrich Drepper
  3 siblings, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-22 23:46 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

Evgeniy Polyakov wrote:
> + int kevent_wait(int ctl_fd, unsigned int num, __u64 timeout);
> +
> +ctl_fd - file descriptor referring to the kevent queue 
> +num - number of processed kevents 
> +timeout - this timeout specifies number of nanoseconds to wait until there is 
> +		free space in kevent queue 
> +
> +Return value:
> + number of events copied into ring buffer or negative error value.

This is not quite sufficient.  What we also need is a parameter which 
specifies which ring buffer the code assumes is currently active.  This 
is just like the EWOULDBLOCK error in the futex.  I.e., the kernel 
doesn't move the thread on the wait list if the index has changed. 
Otherwise asynchronous ring buffer filling is impossible.  Assume this

     thread                             kernel

     get current ring buffer idx

     front and tail pointer the same

                                        add new entry to ring buffer

                                        bump front pointer

     call kevent_wait()

With the interface above this leads to a deadlock.  The kernel delivered 
the event and is done with it.

If the kevent_wait() syscall gets an additional parameter which 
specifies the expected front pointer the kernel wouldn't put the thread 
to sleep since, in this case, the front pointer changed since last checked.

The kernel cannot and should not check the ring buffer is empty. 
Userlevel should maintain the tail pointer all by itself.  And even if 
the tail pointer is available to the kernel, the program might want to 
handle the queued events differently.

The above also comes to bear without asynchronous queuing if a thread 
waits for more than one event and it is possible to handle both events 
concurrently in two threads.

Passing in the expected front pointer value is flexible and efficient.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-22 23:46     ` [take25 1/6] kevent: Description Ulrich Drepper
@ 2006-11-23 11:52       ` Evgeniy Polyakov
  2006-11-23 19:45         ` Ulrich Drepper
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-23 11:52 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

On Wed, Nov 22, 2006 at 03:46:42PM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >+ int kevent_wait(int ctl_fd, unsigned int num, __u64 timeout);
> >+
> >+ctl_fd - file descriptor referring to the kevent queue 
> >+num - number of processed kevents 
> >+timeout - this timeout specifies number of nanoseconds to wait until 
> >there is +		free space in kevent queue 
> >+
> >+Return value:
> >+ number of events copied into ring buffer or negative error value.
> 
> This is not quite sufficient.  What we also need is a parameter which 
> specifies which ring buffer the code assumes is currently active.  This 
> is just like the EWOULDBLOCK error in the futex.  I.e., the kernel 
> doesn't move the thread on the wait list if the index has changed. 
> Otherwise asynchronous ring buffer filling is impossible.  Assume this
> 
>     thread                             kernel
> 
>     get current ring buffer idx
> 
>     front and tail pointer the same
> 
>                                        add new entry to ring buffer
> 
>                                        bump front pointer
> 
>     call kevent_wait()
> 
> 
> With the interface above this leads to a deadlock.  The kernel delivered 
> the event and is done with it.

Kernel does not put there a new entry, it is only done inside
kevent_wait(). Entries are put into queue (in any context), where they can be obtained
from only kevent_wait() or kevent_get_events().

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-23 11:52       ` Evgeniy Polyakov
@ 2006-11-23 19:45         ` Ulrich Drepper
  2006-11-24 11:01           ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-23 19:45 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

Evgeniy Polyakov wrote:
> Kernel does not put there a new entry, it is only done inside
> kevent_wait(). Entries are put into queue (in any context), where they can be obtained
> from only kevent_wait() or kevent_get_events().

I know this is how it's done now.  But it is not where it has to end. 
IMO we have to get to a solution where new events are posted to the ring 
buffer asynchronously, i.e., without a thread calling kevent_wait.  And 
then you need the extra parameter and verification.  Even if it's today 
not needed we have to future-proof the interface since it cannot be 
changed once in use.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-23 19:45         ` Ulrich Drepper
@ 2006-11-24 11:01           ` Evgeniy Polyakov
  2006-11-24 16:06             ` Ulrich Drepper
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-24 11:01 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

On Thu, Nov 23, 2006 at 11:45:36AM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >Kernel does not put there a new entry, it is only done inside
> >kevent_wait(). Entries are put into queue (in any context), where they can 
> >be obtained
> >from only kevent_wait() or kevent_get_events().
> 
> I know this is how it's done now.  But it is not where it has to end. 
> IMO we have to get to a solution where new events are posted to the ring 
> buffer asynchronously, i.e., without a thread calling kevent_wait.  And 
> then you need the extra parameter and verification.  Even if it's today 
> not needed we have to future-proof the interface since it cannot be 
> changed once in use.

There is a special flag in kevent_user to wake it if there are no ready
events - kernel thread which has added new events will set it and thus
subsequent kevent_wait() will return with updated indexes - userspace
must check indexes after kevent_wait().

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-24 11:01           ` Evgeniy Polyakov
@ 2006-11-24 16:06             ` Ulrich Drepper
  2006-11-24 16:14               ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-24 16:06 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

Evgeniy Polyakov wrote:
>> I know this is how it's done now.  But it is not where it has to end. 
>> IMO we have to get to a solution where new events are posted to the ring 
>> buffer asynchronously, i.e., without a thread calling kevent_wait.  And 
>> then you need the extra parameter and verification.  Even if it's today 
>> not needed we have to future-proof the interface since it cannot be 
>> changed once in use.
> 
> There is a special flag in kevent_user to wake it if there are no ready
> events - kernel thread which has added new events will set it and thus
> subsequent kevent_wait() will return with updated indexes - userspace
> must check indexes after kevent_wait().

You misunderstand.  I don't want to return without waiting unconditionally.

There is a race which has to be closed.  It's exactly the same as in the 
futex syscall.  I've shown the interaction between the kernel and the 
thread in the previous mail.  There is inevitably a time difference 
between the thread checking whether the ring buffer is empty and the 
kernel putting the thread to sleep in the kevent_wait call.

This is no problem with the current kevent_wait implementation since the 
ring buffer is not filled asynchronously.  But if/when it will be the 
kernel might add something to the ring buffer _after_ the thread checks 
for an empty ring buffer and _before_ it enters the kernel in the 
kevent_wait syscall.

The kevent_wait syscall will only wake the thread when a new event is 
posted.  We do not in general want it to be woken when the ring buffer 
is non empty.  This would create far too many unnecessary wakeups it 
there is more than one thread working on the queue.

With the addition parameters for kevent_wait indicating when the calling 
thread last checked the ring buffer the kernel can find out whether the 
decision to call kevent_wait was made based on outdated information or 
not.  Outdated in the case a new event has been posted.  In this case 
the thread is not put to sleep but instead returns.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-24 16:06             ` Ulrich Drepper
@ 2006-11-24 16:14               ` Evgeniy Polyakov
  2006-11-24 16:31                 ` Evgeniy Polyakov
  2006-11-27 19:20                 ` Ulrich Drepper
  0 siblings, 2 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-24 16:14 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

On Fri, Nov 24, 2006 at 08:06:59AM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >>I know this is how it's done now.  But it is not where it has to end. 
> >>IMO we have to get to a solution where new events are posted to the ring 
> >>buffer asynchronously, i.e., without a thread calling kevent_wait.  And 
> >>then you need the extra parameter and verification.  Even if it's today 
> >>not needed we have to future-proof the interface since it cannot be 
> >>changed once in use.
> >
> >There is a special flag in kevent_user to wake it if there are no ready
> >events - kernel thread which has added new events will set it and thus
> >subsequent kevent_wait() will return with updated indexes - userspace
> >must check indexes after kevent_wait().
> 
> You misunderstand.  I don't want to return without waiting unconditionally.
> 
> There is a race which has to be closed.  It's exactly the same as in the 
> futex syscall.  I've shown the interaction between the kernel and the 
> thread in the previous mail.  There is inevitably a time difference 
> between the thread checking whether the ring buffer is empty and the 
> kernel putting the thread to sleep in the kevent_wait call.
> 
> This is no problem with the current kevent_wait implementation since the 
> ring buffer is not filled asynchronously.  But if/when it will be the 
> kernel might add something to the ring buffer _after_ the thread checks 
> for an empty ring buffer and _before_ it enters the kernel in the 
> kevent_wait syscall.
> 
> The kevent_wait syscall will only wake the thread when a new event is 
> posted.  We do not in general want it to be woken when the ring buffer 
> is non empty.  This would create far too many unnecessary wakeups it 
> there is more than one thread working on the queue.
> 
> With the addition parameters for kevent_wait indicating when the calling 
> thread last checked the ring buffer the kernel can find out whether the 
> decision to call kevent_wait was made based on outdated information or 
> not.  Outdated in the case a new event has been posted.  In this case 
> the thread is not put to sleep but instead returns.

Read my mail again.

If kernel has put data asynchronously it will setup special flag, thus 
kevent_wait() will not sleep and will return, so thread will check new
entries and process them.

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-24 16:14               ` Evgeniy Polyakov
@ 2006-11-24 16:31                 ` Evgeniy Polyakov
  2006-11-27 19:20                 ` Ulrich Drepper
  1 sibling, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-24 16:31 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

On Fri, Nov 24, 2006 at 07:14:06PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> If kernel has put data asynchronously it will setup special flag, thus 
> kevent_wait() will not sleep and will return, so thread will check new
> entries and process them.

For the clarification - only kevent_wait() updates index, userspace
will not detect that it has changed after thread has put there new
data.
In case kernel thread will updated index too, you are correct,
kevent_wait() should get index as parameter.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-24 16:14               ` Evgeniy Polyakov
  2006-11-24 16:31                 ` Evgeniy Polyakov
@ 2006-11-27 19:20                 ` Ulrich Drepper
  1 sibling, 0 replies; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-27 19:20 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

Evgeniy Polyakov wrote:

> If kernel has put data asynchronously it will setup special flag, thus 
> kevent_wait() will not sleep and will return, so thread will check new
> entries and process them.

This is not sufficient.

The userlevel code does not commit the events until they are processed. 
  So assume two threads at userlevel, one event is asynchronously 
posted.  The first thread picks it up, the second call kevent_wait.

With your scheme it will not be put to sleep and unnecessarily returns 
to userlevel.

What I propose and what has been proven to work in many situations is to 
have part of the kevent_wait syscall the information about "I am aware 
of all events up to XX; wake me only if anything beyond that is added".

Please take a look at how futexes work, it's really the same concept. 
And it's really also simpler for the implementation.  Having such a flag 
is much more complicated than adding a simple index comparison before 
going to sleep.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-21 16:29   ` [take25 1/6] kevent: Description Evgeniy Polyakov
  2006-11-21 16:29     ` [take25 2/6] kevent: Core files Evgeniy Polyakov
  2006-11-22 23:46     ` [take25 1/6] kevent: Description Ulrich Drepper
@ 2006-11-22 23:52     ` Ulrich Drepper
  2006-11-23 11:55       ` Evgeniy Polyakov
  2006-11-23 22:33     ` Ulrich Drepper
  3 siblings, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-22 23:52 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

Evgeniy Polyakov wrote:
> + struct kevent_ring
> + {
> +   unsigned int ring_kidx, ring_uidx, ring_over;
> +   struct ukevent event[0];
> + }
> + [...]
> +ring_uidx - index of the first entry userspace can start reading from

Do we need this value in the structure?  Userlevel cannot and should not 
be able to modify it.  So, userland has in any case to track the tail 
pointer itself.  Why then have this value at all?

After kevent_init() the tail pointer is implicitly assumed to be 0. 
Since the front pointer (well index) is also zero nothing is available 
for reading.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-22 23:52     ` Ulrich Drepper
@ 2006-11-23 11:55       ` Evgeniy Polyakov
  2006-11-23 20:00         ` Ulrich Drepper
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-23 11:55 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

On Wed, Nov 22, 2006 at 03:52:11PM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >+ struct kevent_ring
> >+ {
> >+   unsigned int ring_kidx, ring_uidx, ring_over;
> >+   struct ukevent event[0];
> >+ }
> >+ [...]
> >+ring_uidx - index of the first entry userspace can start reading from
> 
> Do we need this value in the structure?  Userlevel cannot and should not 
> be able to modify it.  So, userland has in any case to track the tail 
> pointer itself.  Why then have this value at all?
> 
> After kevent_init() the tail pointer is implicitly assumed to be 0. 
> Since the front pointer (well index) is also zero nothing is available 
> for reading.

uidx is an index, starting from which there are unread entries. It is
updated by userspace when it commits entries, so it is 'consumer'
pointer, while kidx is an index where kernel will put new entries, i.e.
'producer' index. We definitely need them both.
Userspace can only update (implicitly by calling kevent_commit()) uidx.

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-23 11:55       ` Evgeniy Polyakov
@ 2006-11-23 20:00         ` Ulrich Drepper
  2006-11-23 21:49           ` Hans Henrik Happe
  2006-11-24 11:46           ` Evgeniy Polyakov
  0 siblings, 2 replies; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-23 20:00 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

Evgeniy Polyakov wrote:
> uidx is an index, starting from which there are unread entries. It is
> updated by userspace when it commits entries, so it is 'consumer'
> pointer, while kidx is an index where kernel will put new entries, i.e.
> 'producer' index. We definitely need them both.
> Userspace can only update (implicitly by calling kevent_commit()) uidx.

Right, which is why exporting this entry is not needed.  Keep the 
interface as small as possible.

Userlevel has to maintain its own index.  Just assume kevent_wait 
returns 10 new entries and you have multiple threads.  In this case all 
threads take their turns and pick an entry from the ring buffer.  This 
basically has to be done with something like this (I ignore wrap-arounds 
here to simplify the example):

   int getidx() {
     while (uidx < kidx)
        if (atomic_cmpxchg(uidx, uidx + 1, uidx) == 0)
          return uidx;
     return -1;
   }

Very much simplified but it should show that we need a writable copy of 
the uidx.  And this value at any time must be consistent with the index 
the kernel assumes.

The current ring_uidx value can at best be used to reinitialize the 
userlevel uidx value after each kevent_wait call but this is unnecessary 
at best (since uidx must already have this value) and racy in problem 
cases (what if more than one thread gets woken concurrently with uidx 
having the same value and one thread stores the uidx value and 
immediately increments it to get an index; the second store would 
overwrite the increment).

I can assure you that any implementation I write would not use the 
ring_uidx value.  Only trivial, single-threaded examples like you 
ring_buffer.c could ever take advantage of this value.  It's not worth it.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-23 20:00         ` Ulrich Drepper
@ 2006-11-23 21:49           ` Hans Henrik Happe
  2006-11-23 22:34             ` Ulrich Drepper
  2006-11-24 11:46           ` Evgeniy Polyakov
  1 sibling, 1 reply; 200+ messages in thread
From: Hans Henrik Happe @ 2006-11-23 21:49 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Evgeniy Polyakov, David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

On Thursday 23 November 2006 21:00, Ulrich Drepper wrote:
> Evgeniy Polyakov wrote:
> > uidx is an index, starting from which there are unread entries. It is
> > updated by userspace when it commits entries, so it is 'consumer'
> > pointer, while kidx is an index where kernel will put new entries, i.e.
> > 'producer' index. We definitely need them both.
> > Userspace can only update (implicitly by calling kevent_commit()) uidx.
> 
> Right, which is why exporting this entry is not needed.  Keep the 
> interface as small as possible.
> 
> Userlevel has to maintain its own index.  Just assume kevent_wait 
> returns 10 new entries and you have multiple threads.  In this case all 
> threads take their turns and pick an entry from the ring buffer.  This 
> basically has to be done with something like this (I ignore wrap-arounds 
> here to simplify the example):
> 
>    int getidx() {
>      while (uidx < kidx)
>         if (atomic_cmpxchg(uidx, uidx + 1, uidx) == 0)
>           return uidx;
>      return -1;
>    }

I don't know if this falls under the simplification, but wouldn't there be a 
race when reading/copying the event data? I guess this could be solved with 
an extra user index. 

--

Hans Henrik Happe 

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-23 21:49           ` Hans Henrik Happe
@ 2006-11-23 22:34             ` Ulrich Drepper
  2006-11-24 11:50               ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-23 22:34 UTC (permalink / raw)
  To: Hans Henrik Happe
  Cc: Evgeniy Polyakov, David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

Hans Henrik Happe wrote:
> I don't know if this falls under the simplification, but wouldn't there be a 
> race when reading/copying the event data? I guess this could be solved with 
> an extra user index. 

That's what I said, reading the value from the ring buffer structure's 
head would be racy.  All this can only work for single threaded code.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-23 22:34             ` Ulrich Drepper
@ 2006-11-24 11:50               ` Evgeniy Polyakov
  2006-11-24 16:17                 ` Ulrich Drepper
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-24 11:50 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Hans Henrik Happe, David Miller, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck,
	linux-kernel, Jeff Garzik

On Thu, Nov 23, 2006 at 02:34:46PM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Hans Henrik Happe wrote:
> >I don't know if this falls under the simplification, but wouldn't there be 
> >a race when reading/copying the event data? I guess this could be solved 
> >with an extra user index. 
> 
> That's what I said, reading the value from the ring buffer structure's 
> head would be racy.  All this can only work for single threaded code.

Value in the userspace ring is updated each time it is changed in kernel
(when userspace calls kevent_commit()), when userspace has read its old
value it is guaranteed that requested number of events _is_ there
(although it is possible that there are more than that value).

Ulrich, why didn't you comment on previous interface, which had exactly
_one_ index exported to userspace - it is only required to add implicit
uidx and (if you prefer that way) additional syscall, since in previous
interface both waiting and commit was handled by kevent_wait() with
different parameters.

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-24 11:50               ` Evgeniy Polyakov
@ 2006-11-24 16:17                 ` Ulrich Drepper
  0 siblings, 0 replies; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-24 16:17 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Hans Henrik Happe, David Miller, Andrew Morton, netdev,
	Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck,
	linux-kernel, Jeff Garzik

Evgeniy Polyakov wrote:
> Ulrich, why didn't you comment on previous interface, which had exactly
> _one_ index exported to userspace - it is only required to add implicit
> uidx and (if you prefer that way) additional syscall, since in previous
> interface both waiting and commit was handled by kevent_wait() with
> different parameters.

If you read my old mails you'll find that I'm pretty consistent wrt to 
the ring buffer interface.  The old code had other problems, not the 
missing exposure of the uidx value.

There is really not much disagreement here.  I just don't like the 
interface unnecessarily and misleadingly large by exposing the uidx 
value which is not useful to the userlevel code.  Just remove the 
element and stuff it into a kernel-internal struct for the queue and 
you're done.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-23 20:00         ` Ulrich Drepper
  2006-11-23 21:49           ` Hans Henrik Happe
@ 2006-11-24 11:46           ` Evgeniy Polyakov
  2006-11-24 16:30             ` Ulrich Drepper
  1 sibling, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-24 11:46 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

On Thu, Nov 23, 2006 at 12:00:45PM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >uidx is an index, starting from which there are unread entries. It is
> >updated by userspace when it commits entries, so it is 'consumer'
> >pointer, while kidx is an index where kernel will put new entries, i.e.
> >'producer' index. We definitely need them both.
> >Userspace can only update (implicitly by calling kevent_commit()) uidx.
> 
> Right, which is why exporting this entry is not needed.  Keep the 
> interface as small as possible.

If there are several callers of kevent_commit(), uidx can be changed far
than first user expects, so there should be possibility to check that
value. It is thus exported into shared ring buffer structure.

> Userlevel has to maintain its own index.  Just assume kevent_wait 
> returns 10 new entries and you have multiple threads.  In this case all 
> threads take their turns and pick an entry from the ring buffer.  This 
> basically has to be done with something like this (I ignore wrap-arounds 
> here to simplify the example):
> 
>   int getidx() {
>     while (uidx < kidx)
>        if (atomic_cmpxchg(uidx, uidx + 1, uidx) == 0)
>          return uidx;
>     return -1;
>   }
> 
> Very much simplified but it should show that we need a writable copy of 
> the uidx.  And this value at any time must be consistent with the index 
> the kernel assumes.

I seriously doubt it is simpler than having index provided by kernel.

> The current ring_uidx value can at best be used to reinitialize the 
> userlevel uidx value after each kevent_wait call but this is unnecessary 
> at best (since uidx must already have this value) and racy in problem 
> cases (what if more than one thread gets woken concurrently with uidx 
> having the same value and one thread stores the uidx value and 
> immediately increments it to get an index; the second store would 
> overwrite the increment).
> 
> I can assure you that any implementation I write would not use the 
> ring_uidx value.  Only trivial, single-threaded examples like you 
> ring_buffer.c could ever take advantage of this value.  It's not worth it.

You propose to make uidx shared local variable - it is doable, but it
is not required - userspace can use kernel's variable, since it is
updated exactly in the places where that index is changed.

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-24 11:46           ` Evgeniy Polyakov
@ 2006-11-24 16:30             ` Ulrich Drepper
  2006-11-24 16:49               ` Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-24 16:30 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

Evgeniy Polyakov wrote:
>> Very much simplified but it should show that we need a writable copy of 
>> the uidx.  And this value at any time must be consistent with the index 
>> the kernel assumes.
> 
> I seriously doubt it is simpler than having index provided by kernel.

What has simpler to do with it?  The userlevel code should not modify 
the ring buffer structure at all.  If we'd do this then all operations, 
at least on the uidx field, would have to be atomic operations.  This is 
currently not the case for the kernel side since it's protected by a 
lock for the event queue.  Using the uidx field from userlevel would 
therefore just make things slower.

And for what?  Changing the uidx value would make the commit syscall 
unnecessary.  This might be an argument but it sounds too dangerous. 
IMO the value should be protected by the kernel.

And in any case, the uidx value cannot be updated until the event 
actually has been processed.  But the threads still need to coordinate 
distributing the events from the ring buffer amongst themselves.  This 
will in any case require a second variable.

So, if you want to do away with the commit syscall, keep the uidx value. 
  This also requires that the ring buffer head will always be writable 
(something I'd like to avoid making part of the interface but I'm 
flexible on this).  Otherwise, the ring_uidx element can go away, it's 
not needed and will only make people think about wrong approaches to use it.

> You propose to make uidx shared local variable - it is doable, but it
> is not required - userspace can use kernel's variable, since it is
> updated exactly in the places where that index is changed.

As said above, we always need another variable and uidx is only a 
replacement for the commit call.  Until the event is processed the uidx 
cannot be incremented since otherwise the ring buffer entry might be 
overwritten.

And kernel people of all should be happy to limit the exposure of the 
implementation.  So, leave the problem of keeping track of the tail 
pointer to the userlevel code.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-24 16:30             ` Ulrich Drepper
@ 2006-11-24 16:49               ` Evgeniy Polyakov
  2006-11-27 19:23                 ` Ulrich Drepper
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-24 16:49 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

On Fri, Nov 24, 2006 at 08:30:14AM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >>Very much simplified but it should show that we need a writable copy of 
> >>the uidx.  And this value at any time must be consistent with the index 
> >>the kernel assumes.
> >
> >I seriously doubt it is simpler than having index provided by kernel.
> 
> What has simpler to do with it?  The userlevel code should not modify 
> the ring buffer structure at all.  If we'd do this then all operations, 
> at least on the uidx field, would have to be atomic operations.  This is 
> currently not the case for the kernel side since it's protected by a 
> lock for the event queue.  Using the uidx field from userlevel would 
> therefore just make things slower.

That index is provided by kernel for userspace so that userspace could
determine where indexes are - of course userspace can maintain it
itself, but it can also use provided by kernel. It is not written
explicitly, but only through kevent_commit().

> And for what?  Changing the uidx value would make the commit syscall 
> unnecessary.  This might be an argument but it sounds too dangerous. 
> IMO the value should be protected by the kernel.
> 
> And in any case, the uidx value cannot be updated until the event 
> actually has been processed.  But the threads still need to coordinate 
> distributing the events from the ring buffer amongst themselves.  This 
> will in any case require a second variable.
> 
> So, if you want to do away with the commit syscall, keep the uidx value. 
>  This also requires that the ring buffer head will always be writable 
> (something I'd like to avoid making part of the interface but I'm 
> flexible on this).  Otherwise, the ring_uidx element can go away, it's 
> not needed and will only make people think about wrong approaches to use it.

No, head will not be writeable - it is absolutely.

I do not care actually about that index, but as you have probably noticed, 
there was such an interface already, and I changed it. So, this will be the 
last change of the interface. You think it should not be exported -
fine, it will not be.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-24 16:49               ` Evgeniy Polyakov
@ 2006-11-27 19:23                 ` Ulrich Drepper
  0 siblings, 0 replies; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-27 19:23 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

Evgeniy Polyakov wrote:

> That index is provided by kernel for userspace so that userspace could
> determine where indexes are - of course userspace can maintain it
> itself, but it can also use provided by kernel.

Indeed.  That's what I said.  But I also pointed out that the field is 
only useful in simple minded programs and certainly not in the wrappers 
the runtime (glibc) will provide.

As you said yourself, there is no real need for the value being there, 
userland can keep track of it by itself.  So, let's reduce the interface.


> I do not care actually about that index, but as you have probably noticed, 
> there was such an interface already, and I changed it. So, this will be the 
> last change of the interface. You think it should not be exported -
> fine, it will not be.

Thanks.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-21 16:29   ` [take25 1/6] kevent: Description Evgeniy Polyakov
                       ` (2 preceding siblings ...)
  2006-11-22 23:52     ` Ulrich Drepper
@ 2006-11-23 22:33     ` Ulrich Drepper
  2006-11-23 22:48       ` Jeff Garzik
  2006-11-24 12:05       ` Evgeniy Polyakov
  3 siblings, 2 replies; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-23 22:33 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

Evgeniy Polyakov wrote:
> + int kevent_commit(int ctl_fd, unsigned int start, 
> + 	unsigned int num, unsigned int over);

I think we can simplify this interface:

    int kevent_commit(int ctl_fd, unsigned int new_tail,
                      unsigned int over);

The kernel sets the ring_uidx value to the 'new_tail' value if the tail 
pointer would be incremented (module wrap around) and is not higher then 
the current front pointer.  The test will be a bit complicated but not 
more so than what the current code has to do to check for mistakes.

This approach has the advantage that the commit calls don't have to be 
synchronized.  If one thread sets the tail pointer to, say, 10 and 
another to 12, then it does not matter whether the first thread is 
delayed.  If it will eventually be executed the result is simply a no-op 
and since second thread's action supersedes it.

Maybe the current form is even impossible to use with explicit locking 
at userlevel.  What if one thread, which is about to call kevent_commit, 
if indefinitely delayed.  Then this commit request's value is never 
taken into account and the tail pointer is always short of what it 
should be.

There is one more thing to consider.  Oftentimes the commit request will 
be immediately followed by a kevent_wait call.  It would be good to 
merge this pair of calls.  The two parameters new_tail and over could 
also be passed to the kevent_wait call and the commit can happen before 
the thread looks for new events and eventually goes to sleep.  If this 
can be implemented then the kevent_commit syscall by itself might not be 
needed at all.  Instead you'd call kevent_wait() and make the maximum 
number of events which can be returned zero.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-23 22:33     ` Ulrich Drepper
@ 2006-11-23 22:48       ` Jeff Garzik
  2006-11-23 23:45         ` Ulrich Drepper
  2006-11-24  0:14         ` Hans Henrik Happe
  2006-11-24 12:05       ` Evgeniy Polyakov
  1 sibling, 2 replies; 200+ messages in thread
From: Jeff Garzik @ 2006-11-23 22:48 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Evgeniy Polyakov, David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel

Ulrich Drepper wrote:
> Evgeniy Polyakov wrote:
>> + int kevent_commit(int ctl_fd, unsigned int start, +     unsigned int 
>> num, unsigned int over);
> 
> I think we can simplify this interface:
> 
>    int kevent_commit(int ctl_fd, unsigned int new_tail,
>                      unsigned int over);
> 
> The kernel sets the ring_uidx value to the 'new_tail' value if the tail 
> pointer would be incremented (module wrap around) and is not higher then 
> the current front pointer.  The test will be a bit complicated but not 
> more so than what the current code has to do to check for mistakes.
> 
> This approach has the advantage that the commit calls don't have to be 
> synchronized.  If one thread sets the tail pointer to, say, 10 and 
> another to 12, then it does not matter whether the first thread is 
> delayed.  If it will eventually be executed the result is simply a no-op 
> and since second thread's action supersedes it.
> 
> Maybe the current form is even impossible to use with explicit locking 
> at userlevel.  What if one thread, which is about to call kevent_commit, 
> if indefinitely delayed.  Then this commit request's value is never 
> taken into account and the tail pointer is always short of what it 
> should be.

I'm really wondering is designing for N-threads-to-1-ring is the wisest 
choice?

Considering current designs, it seems more likely that a single thread 
polls for socket activity, then dispatches work.  How often do you 
really see in userland multiple threads polling the same set of fds, 
then fighting to decide who will handle raised events?

More likely, you will see "prefork" (start N threads, each with its own 
ring) or a worker pool (single thread receives events, then dispatches 
to multiple threads for execution) or even one-thread-per-fd (single 
thread receives events, then starts new thread for handling).

If you have multiple threads accessing the same ring -- a poor design 
choice -- I would think the burden should be on the application, to 
provide proper synchronization.

If the desire is to have the kernel distributes events directly to 
multiple threads, then the app should dup(2) the fd to be watched, and 
create a ring buffer for each separate thread.

	Jeff

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-23 22:48       ` Jeff Garzik
@ 2006-11-23 23:45         ` Ulrich Drepper
  2006-11-24  0:48           ` Eric Dumazet
  2006-11-24  0:14         ` Hans Henrik Happe
  1 sibling, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-23 23:45 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Evgeniy Polyakov, David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel

Jeff Garzik wrote:
> Considering current designs, it seems more likely that a single thread 
> polls for socket activity, then dispatches work.  How often do you 
> really see in userland multiple threads polling the same set of fds, 
> then fighting to decide who will handle raised events?
> 
> More likely, you will see "prefork" (start N threads, each with its own 
> ring) or a worker pool (single thread receives events, then dispatches 
> to multiple threads for execution) or even one-thread-per-fd (single 
> thread receives events, then starts new thread for handling).

No, absolutely not.  This is exactly not what should/is/will happen.

You create worker threads to handle to work for the entire program. 
Look at something like a web server.  When creating several queues, how 
do you distribute all the connections to the different queues?  To 
ensure every connection is handled as quickly as possible you stuff them 
all in the same queue and then have all threads use this one queue. 
Whenever an event is posted a thread is woken.  _One_ thread.  If two 
events are posted, two threads are woken.  In this situation we have a 
few atomic ops at userlevel to make sure that the two threads don't pick 
the same event but that's all there is wrt "fighting".

The alternative is the sorry state we have now.  In nscd, for instance, 
we have one single thread waiting for incoming connections and it then 
has to wake up a worker thread to handle the processing.  This is done 
because we cannot "park" all threads in the accept() call since when a 
new connection is announced _all_ the threads are woken.  With the new 
event handling this wouldn't be the case, one thread only is woken and 
we don't have to wake worker threads.  All threads can be worker threads.

> If you have multiple threads accessing the same ring -- a poor design 
> choice

To the contrary.  It is the perfect means to distribute the workload to 
multiple threads.  Beside, how would you implement asynchronous filling 
of the ring buffer to avoid unnecessary syscalls if you have many 
different queues?

> -- I would think the burden should be on the application, to 
> provide proper synchronization.

Sure, as much as possible.  But there is no reason to design the commit 
interface in the way which requires expensive synchronization when there 
is another design which can do exactly the same work but does not 
require synchronization.  The currently proposed kevent_commit and my 
proposed variant are functionally equivalent.

> If the desire is to have the kernel distributes events directly to 
> multiple threads, then the app should dup(2) the fd to be watched, and 
> create a ring buffer for each separate thread.

And how would you synchronize the file descriptor use across the 
threads?  The event would be sent to all the event queues so that you 
would a) unnecessarily wake all threads and b) have all but one thread 
see the operation (say, read or write on a socket) fail with 
EWOULDBLOCK.  That's just silly, we can have that today and continue to 
waste precious CPU cycles.

If you say that you post exactly one event per file description (not 
handle) then what do you do if the programmer wants the opposite?  And 
again, what do you do for asynchronous ring buffer filling.  Which queue 
do you pick?  Pick the wrong one and the event might be in the ring 
buffer for a long time which another thread handling another queue is ready.

Using a single central queue is the perfect means to distribute the load 
to a number of threads.  Nobody is forcing you to do it, you're free to 
use separate queues if you want.  But the model should not enforce this.

Overall, I cannot see at all where your problem is.  I agree that the 
synchronization of the access to the ring buffer must be done at 
userlevel.  This is why the uidx exposure isn't needed.  The wakeup in 
any case has to take threads into account.  The only change I proposed 
to enable better multi-thread handling is the revised commit interface 
and this change in no way hinders single-threaded users.  The interface 
is not hindered in any way or form by the use of threads.

Oh, and when I say "threads" I should have said "threads or processes". 
  The whole also applies to multi-process applications.  They can share 
event queues by placing them in shared memory.  And I hope that everyone 
agrees that programs have to go into the direction of having more than 
one execution context to take advantage of increased CPU power in 
future.  CMP is only becoming more and more important.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-23 23:45         ` Ulrich Drepper
@ 2006-11-24  0:48           ` Eric Dumazet
  2006-11-24  8:14             ` Andrew Morton
  0 siblings, 1 reply; 200+ messages in thread
From: Eric Dumazet @ 2006-11-24  0:48 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Jeff Garzik, Evgeniy Polyakov, David Miller, Andrew Morton,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel

Ulrich Drepper a écrit :
> 
> You create worker threads to handle to work for the entire program. Look 
> at something like a web server.  When creating several queues, how do 
> you distribute all the connections to the different queues?  To ensure 
> every connection is handled as quickly as possible you stuff them all in 
> the same queue and then have all threads use this one queue. Whenever an 
> event is posted a thread is woken.  _One_ thread.  If two events are 
> posted, two threads are woken.  In this situation we have a few atomic 
> ops at userlevel to make sure that the two threads don't pick the same 
> event but that's all there is wrt "fighting".
> 
> The alternative is the sorry state we have now.  In nscd, for instance, 
> we have one single thread waiting for incoming connections and it then 
> has to wake up a worker thread to handle the processing.  This is done 
> because we cannot "park" all threads in the accept() call since when a 
> new connection is announced _all_ the threads are woken.  With the new 
> event handling this wouldn't be the case, one thread only is woken and 
> we don't have to wake worker threads.  All threads can be worker threads.

Having one specialized thread handling the distribution of work to worker 
threads is better most of the time. This thread can be a worker thread by 
itself (to avoid context switchs), but can decide to wake up 'slave threads' 
if he believes it has too (for example if he can notice that a *lot* of 
requests are pending)

This is because with moderate load, it's better to have only one CPU running 
80% of its time, keeping its cache hot, than 'distribute' the work on four 
CPU, that would be used 25% of their time, but with lot of cache line ping 
pongs and poor cache reuse.

If you let 'kevent'/'dumb kernel dispatcher'/'futex'/'whatever' decide to wake 
up one thread for each new event, you *may* have lower performance, because of 
higher system overhead (system means : system scheduler/internals, but also 
bus trafic)
  Only the application writer can have a clue of average use of its worker 
threads, and can decide to dynamically adjust parameters if needed to handle 
load spikes.

SMP machines are nice, but for many workloads, it's better to avoid spreading 
a working set on several CPUS that fight for common resources (memory).

Back to 'kevent':
-----------------
I think that having a syscall to commit events should not be mandatory. A 
syscall is needed only to wait for new events if the ring is empty. But then 
maybe we dont need yet a new syscall to perform a wait :
We already have nice synchronisations primitives (futex for example).

User program should be able to update a 'uidx' in user space (using atomic ops 
only if multi-threaded), and could just use futex infrastructure if ring 
buffer is empty (uidx == kidx) , and call FUTEX_WAIT( &kidx, current value = uidx)

I think I already gave my opinion on a ring buffer, but let just rephrase it :

One part should be read/write for application (to be able to change uidx)
(or User app just give at init time to kernel the address of a futex in its vm 
space)

One part could be read only for application (but could be read/write : we dont 
care if user application is stupid) : kernel writes its kidx (or a copy of it) 
and events.

For best performance, uidx and kidx should be on different cache lines (basic 
isolation of producer / consumer)

When kernel wants to queue a new event in a ring buffer it can :

See if user program did consume some events since last invocation (kernel 
fetches uidx and compare it with its own uidx value : no syscall needed)
Check if a slot is available in ring buffer.
Copy the event in ring buffer, perform a memory barrier, then increment kidx.
call futex_wake(&kidx, 1 thread)

User application is free to have one thread/process or several 
threads/processes waiting for new events (or even no thread at all :) )

Eric

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-24  0:48           ` Eric Dumazet
@ 2006-11-24  8:14             ` Andrew Morton
  2006-11-24  8:33               ` Eric Dumazet
  0 siblings, 1 reply; 200+ messages in thread
From: Andrew Morton @ 2006-11-24  8:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ulrich Drepper, Jeff Garzik, Evgeniy Polyakov, David Miller,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel

On Fri, 24 Nov 2006 01:48:32 +0100
Eric Dumazet <dada1@cosmosbay.com> wrote:

> > The alternative is the sorry state we have now.  In nscd, for instance, 
> > we have one single thread waiting for incoming connections and it then 
> > has to wake up a worker thread to handle the processing.  This is done 
> > because we cannot "park" all threads in the accept() call since when a 
> > new connection is announced _all_ the threads are woken.  With the new 
> > event handling this wouldn't be the case, one thread only is woken and 
> > we don't have to wake worker threads.  All threads can be worker threads.
> 
> Having one specialized thread handling the distribution of work to worker 
> threads is better most of the time.

It might be now.  Think "commodity 128-way".  Your single distribution thread
will run out of steam.

What Ulrich is proposing is faster.  This is a new interface.  Let's design
it to be fast.


^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-24  8:14             ` Andrew Morton
@ 2006-11-24  8:33               ` Eric Dumazet
  2006-11-24 15:26                 ` Ulrich Drepper
  0 siblings, 1 reply; 200+ messages in thread
From: Eric Dumazet @ 2006-11-24  8:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ulrich Drepper, Jeff Garzik, Evgeniy Polyakov, David Miller,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel

Andrew Morton a écrit :
> On Fri, 24 Nov 2006 01:48:32 +0100
> Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>>> The alternative is the sorry state we have now.  In nscd, for instance, 
>>> we have one single thread waiting for incoming connections and it then 
>>> has to wake up a worker thread to handle the processing.  This is done 
>>> because we cannot "park" all threads in the accept() call since when a 
>>> new connection is announced _all_ the threads are woken.  With the new 
>>> event handling this wouldn't be the case, one thread only is woken and 
>>> we don't have to wake worker threads.  All threads can be worker threads.
>> Having one specialized thread handling the distribution of work to worker 
>> threads is better most of the time.
> 
> It might be now.  Think "commodity 128-way".  Your single distribution thread
> will run out of steam.
> 
> What Ulrich is proposing is faster.  This is a new interface.  Let's design
> it to be fast.

Hum... I guess you didnt read my mail... I basically agree with Ulrich.

I just wanted to say that a fast application cannot rely only on a "let's park 
N threads waiting for single event in this queue", and hope kernel will be 
smart for us.

Even with 128-ways, you still hit a central point of coordination (it can be a 
mutex in kevent code, a atomic uidx in userland, or whatever) for a 'kevent 
queue'. Once you paid the cache lines ping/pong, you wont be *fast*.

I wish *you* dont think of kevent of only dispatching HTTP 1.0 trivial web 
requests.

Being able to direct a particular request on a particular CPU is certainly 
something that cannot be hardcoded in 'the new kevent interface'.

Eric

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-24  8:33               ` Eric Dumazet
@ 2006-11-24 15:26                 ` Ulrich Drepper
  0 siblings, 0 replies; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-24 15:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Jeff Garzik, Evgeniy Polyakov, David Miller,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel

Eric Dumazet wrote:
> Being able to direct a particular request on a particular CPU is 
> certainly something that cannot be hardcoded in 'the new kevent interface'.

Nobody is proposing this.  Although I have proposed that if the kernel 
knows which CPU can best service a request it might hint as much.

But in general, you're free to decentralize as much as you want.  But 
this does not mean it should not also be possible to use a number of 
threads in the same loop and the same kevent queue.  That's the part 
which needs designing, the separate queues will always be possible.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-23 22:48       ` Jeff Garzik
  2006-11-23 23:45         ` Ulrich Drepper
@ 2006-11-24  0:14         ` Hans Henrik Happe
  1 sibling, 0 replies; 200+ messages in thread
From: Hans Henrik Happe @ 2006-11-24  0:14 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Ulrich Drepper, Evgeniy Polyakov, David Miller, Andrew Morton,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel

On Thursday 23 November 2006 23:48, Jeff Garzik wrote:
> I'm really wondering is designing for N-threads-to-1-ring is the wisest 
> choice?
> 
> Considering current designs, it seems more likely that a single thread 
> polls for socket activity, then dispatches work.  How often do you 
> really see in userland multiple threads polling the same set of fds, 
> then fighting to decide who will handle raised events?

They should not fight, but gently divide event handling work.

> More likely, you will see "prefork" (start N threads, each with its own 
> ring) 

One ring could be more busy than others, leaving all the work to one thread.

> or a worker pool (single thread receives events, then dispatches  
> to multiple threads for execution) or even one-thread-per-fd (single 
> thread receives events, then starts new thread for handling).

This is more like fighting :-) 
It adds context switches and therefore extra latency for event handling. 

> If you have multiple threads accessing the same ring -- a poor design 
> choice -- I would think the burden should be on the application, to 
> provide proper synchronization.

Comming from the HPC world I do not agree. Context switches should be avoided. 
This paper is a good example from the HPC world: 

http://cobweb.ecn.purdue.edu/~vpai/Publications/majumder-lacsi04.pdf.

The latency problems introduced by context switches in this work calls for 
even more functionality in event handling. I will not go into details now. 
There are enough problems with kevent's current feature set and I believe 
these extra features can be added later without breaking the API.

--

Hans Henrik Happe

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-23 22:33     ` Ulrich Drepper
  2006-11-23 22:48       ` Jeff Garzik
@ 2006-11-24 12:05       ` Evgeniy Polyakov
  2006-11-24 12:13         ` Evgeniy Polyakov
  2006-11-27 19:43         ` Ulrich Drepper
  1 sibling, 2 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-24 12:05 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

On Thu, Nov 23, 2006 at 02:33:16PM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >+ int kevent_commit(int ctl_fd, unsigned int start, 
> >+ 	unsigned int num, unsigned int over);
> 
> I think we can simplify this interface:
> 
>    int kevent_commit(int ctl_fd, unsigned int new_tail,
>                      unsigned int over);
> 
> The kernel sets the ring_uidx value to the 'new_tail' value if the tail 
> pointer would be incremented (module wrap around) and is not higher then 
> the current front pointer.  The test will be a bit complicated but not 
> more so than what the current code has to do to check for mistakes.
> 
> This approach has the advantage that the commit calls don't have to be 
> synchronized.  If one thread sets the tail pointer to, say, 10 and 
> another to 12, then it does not matter whether the first thread is 
> delayed.  If it will eventually be executed the result is simply a no-op 
> and since second thread's action supersedes it.
> 
> Maybe the current form is even impossible to use with explicit locking 
> at userlevel.  What if one thread, which is about to call kevent_commit, 
> if indefinitely delayed.  Then this commit request's value is never 
> taken into account and the tail pointer is always short of what it 
> should be.

I like this interface, although current one does not allow special
synchronization in userspace, since it calculates if new commit is in
the area where previous commit was.
Will change for the next release.

> There is one more thing to consider.  Oftentimes the commit request will 
> be immediately followed by a kevent_wait call.  It would be good to 
> merge this pair of calls.  The two parameters new_tail and over could 
> also be passed to the kevent_wait call and the commit can happen before 
> the thread looks for new events and eventually goes to sleep.  If this 
> can be implemented then the kevent_commit syscall by itself might not be 
> needed at all.  Instead you'd call kevent_wait() and make the maximum 
> number of events which can be returned zero.

It _IS_ how previous interface worked.

	EXACTLY!

There was one syscall which committed requested number of events and
waited when there are new ready events. The only thing it missed, was
userspace index (it assumed that if userspace waits for something, then
all previous work is done).

Ulrich, I'm not going to think for other people all over the world and
blindly implementing ideas, which in a day or two will be commented as
redundant, since flow of mind has changed, and they had not enough time
to check previous version.

I will wait for some time until you and other people made theirs comments 
on interfaces and release final version in about a week, and now I will
go to hack netchannels.

	NO INTERFACE CHANGES AFTER THAT DAY. 
		COMPLETELY.

So, feel free to think about perfect interface anyone will be happy
with. But please release your thoughts not in form of abstract words,
but more precisely, at least like in this e-mail, so I could understand
what _you_ want from _your_ interface.

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-24 12:05       ` Evgeniy Polyakov
@ 2006-11-24 12:13         ` Evgeniy Polyakov
  2006-11-27 19:43         ` Ulrich Drepper
  1 sibling, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-24 12:13 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

On Fri, Nov 24, 2006 at 03:05:31PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> On Thu, Nov 23, 2006 at 02:33:16PM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> > Evgeniy Polyakov wrote:
> > >+ int kevent_commit(int ctl_fd, unsigned int start, 
> > >+ 	unsigned int num, unsigned int over);
> > 
> > I think we can simplify this interface:
> > 
> >    int kevent_commit(int ctl_fd, unsigned int new_tail,
> >                      unsigned int over);
> > 
> > The kernel sets the ring_uidx value to the 'new_tail' value if the tail 
> > pointer would be incremented (module wrap around) and is not higher then 
> > the current front pointer.  The test will be a bit complicated but not 
> > more so than what the current code has to do to check for mistakes.
> > 
> > This approach has the advantage that the commit calls don't have to be 
> > synchronized.  If one thread sets the tail pointer to, say, 10 and 
> > another to 12, then it does not matter whether the first thread is 
> > delayed.  If it will eventually be executed the result is simply a no-op 
> > and since second thread's action supersedes it.
> > 
> > Maybe the current form is even impossible to use with explicit locking 
> > at userlevel.  What if one thread, which is about to call kevent_commit, 
> > if indefinitely delayed.  Then this commit request's value is never 
> > taken into account and the tail pointer is always short of what it 
> > should be.
> 
> I like this interface, although current one does not allow special

...does not require...

> synchronization in userspace, since it calculates if new commit is in
> the area where previous commit was.
> Will change for the next release.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-24 12:05       ` Evgeniy Polyakov
  2006-11-24 12:13         ` Evgeniy Polyakov
@ 2006-11-27 19:43         ` Ulrich Drepper
  2006-11-28 10:26           ` Evgeniy Polyakov
  1 sibling, 1 reply; 200+ messages in thread
From: Ulrich Drepper @ 2006-11-27 19:43 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

Evgeniy Polyakov wrote:
> It _IS_ how previous interface worked.
> 
> 	EXACTLY!

No, the old interface committed everything not only up to a given index. 
  This is the huge difference which makes or breaks it.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

^ permalink raw reply	[flat|nested] 200+ messages in thread

* Re: [take25 1/6] kevent: Description.
  2006-11-27 19:43         ` Ulrich Drepper
@ 2006-11-28 10:26           ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-28 10:26 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: David Miller, Andrew Morton, netdev, Zach Brown,
	Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel,
	Jeff Garzik

On Mon, Nov 27, 2006 at 11:43:46AM -0800, Ulrich Drepper (drepper@redhat.com) wrote:
> Evgeniy Polyakov wrote:
> >It _IS_ how previous interface worked.
> >
> >	EXACTLY!
> 
> No, the old interface committed everything not only up to a given index. 
>  This is the huge difference which makes or breaks it.

Interface was the same - logic behind it was differnet, the only thing
required was to add consumer's index - that is all, no need to change a
lot of declarations, userspace and so on - just use existing interface
and extend its functionality. 

But it does not matter anymore, later this week I will collect all
proposed changes and implement (hopefully) last release, which will
close most of the questions regarding userspace interfaces (except
signal mask, it is in fluent state), so we could concentrate on
internals and/or new kernel users.

> -- 
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, 
> CA ❖

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 200+ messages in thread

* [take26 0/8] kevent: Generic event handling mechanism.
       [not found] <1154985aa0591036@2ka.mipt.ru>
                   ` (4 preceding siblings ...)
  2006-11-21 16:29 ` [take25 " Evgeniy Polyakov
@ 2006-11-30 19:14 ` Evgeniy Polyakov
  2006-11-30 19:14   ` [take26 1/8] kevent: Description Evgeniy Polyakov
  5 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-30 19:14 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Generic event handling mechanism.

Kevent is a generic subsytem which allows to handle event notifications.
It supports both level and edge triggered events. It is similar to
poll/epoll in some cases, but it is more scalable, it is faster and
allows to work with essentially eny kind of events.

Events are provided into kernel through control syscall and can be read
back through ring buffer or using usual syscalls.
Kevent update (i.e. readiness switching) happens directly from internals
of the appropriate state machine of the underlying subsytem (like
network, filesystem, timer or any other).

Homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

Documentation page (will update Dec 1):
http://linux-net.osdl.org/index.php/Kevent

I installed slightly used, but still functional (bought on ebay) remote 
mind reader, and set it up to read Ulrich's alpha brain waves (I hope he 
agrees that it is a good decision), which took me the whole week.
So I think the last ring buffer implementation is what we all wanted.
Details in documentation part.

It seems that setup was correct and we finially found what we wanted from
interface part.

Changes from 'take35' patchset:
 * use timespec as timeout parameter.
 * added high-resolution timer to handle absolute timeouts.
 * added flags to waiting and initialization syscalls.
 * kevent_commit() has new_uidx parameter.
 * kevent_wait() has old_uidx parameter, which, if not equal to u->uidx,
 	results in immediate wakeup (usefull for the case when entries
	are added asynchronously from kernel (not supported for now)).
 * added interface to mark any event as ready.
 * event POSIX timers support.
 * return -ENOSYS if there is no registered event type.
 * provided file descriptor must be checked for fifo type (spotted by Eric Dumazet).
 * documentation update.
 * lighttpd patch updated (the latest benchmarks with lighttpd patch can be found in blog).

Changes from 'take24' patchset:
 * new (old (new)) ring buffer implementation with kernel and user indexes.
 * added initialization syscall instead of opening /dev/kevent
 * kevent_commit() syscall to commit ring buffer entries
 * changed KEVENT_REQ_WAKEUP_ONE flag to KEVENT_REQ_WAKEUP_ALL, kevent wakes
   only first thread always if that flag is not set
 * KEVENT_REQ_ALWAYS_QUEUE flag. If set, kevent will be queued into ready queue
   instead of copying back to userspace when kevent is ready immediately when
   it is added.
 * lighttpd patch (Hail! Although nothing really outstanding compared to epoll)

Changes from 'take23' patchset:
 * kevent PIPE notifications
 * KEVENT_REQ_LAST_CHECK flag, which allows to perform last check at dequeueing time
 * fixed poll/select notifications (were broken due to tree manipulations)
 * made Documentation/kevent.txt look nice in 80-col terminal
 * fix for copy_to_user() failure report for the first kevent (Andrew Morton)
 * minor function renames

Changes from 'take22' patchset:
 * new ring buffer implementation in process' memory
 * wakeup-one-thread flag
 * edge-triggered behaviour

Changes from 'take21' patchset:
 * minor cleanups (different return values, removed unneded variables, whitespaces and so on)
 * fixed bug in kevent removal in case when kevent being removed
   is the same as overflow_kevent (spotted by Eric Dumazet)

Changes from 'take20' patchset:
 * new ring buffer implementation
 * removed artificial limit on possible number of kevents

Changes from 'take19' patchset:
 * use __init instead of __devinit
 * removed 'default N' from config for user statistic
 * removed kevent_user_fini() since kevent can not be unloaded
 * use KERN_INFO for statistic output

Changes from 'take18' patchset:
 * use __init instead of __devinit
 * removed 'default N' from config for user statistic
 * removed kevent_user_fini() since kevent can not be unloaded
 * use KERN_INFO for statistic output

Changes from 'take17' patchset:
 * Use RB tree instead of hash table. 
	At least for a web sever, frequency of addition/deletion of new kevent 
	is comparable with number of search access, i.e. most of the time events 
	are added, accesed only couple of times and then removed, so it justifies 
	RB tree usage over AVL tree, since the latter does have much slower deletion 
	time (max O(log(N)) compared to 3 ops), 
	although faster search time (1.44*O(log(N)) vs. 2*O(log(N))). 
	So for kevents I use RB tree for now and later, when my AVL tree implementation 
	is ready, it will be possible to compare them.
 * Changed readiness check for socket notifications.

With both above changes it is possible to achieve more than 3380 req/second compared to 2200, 
sometimes 2500 req/second for epoll() for trivial web-server and httperf client on the same
hardware.
It is possible that above kevent limit is due to maximum allowed kevents in a time limit, which is
4096 events.

Changes from 'take16' patchset:
 * misc cleanups (__read_mostly, const ...)
 * created special macro which is used for mmap size (number of pages) calculation
 * export kevent_socket_notify(), since it is used in network protocols which can be 
	built as modules (IPv6 for example)

Changes from 'take15' patchset:
 * converted kevent_timer to high-resolution timers, this forces timer API update at
	http://linux-net.osdl.org/index.php/Kevent
 * use struct ukevent* instead of void * in syscalls (documentation has been updated)
 * added warning in kevent_add_ukevent() if ring has broken index (for testing)

Changes from 'take14' patchset:
 * added kevent_wait()
    This syscall waits until either timeout expires or at least one event
    becomes ready. It also commits that @num events from @start are processed
    by userspace and thus can be be removed or rearmed (depending on it's flags).
    It can be used for commit events read by userspace through mmap interface.
    Example userspace code (evtest.c) can be found on project's homepage.
 * added socket notifications (send/recv/accept)

Changes from 'take13' patchset:
 * do not get lock aroung user data check in __kevent_search()
 * fail early if there were no registered callbacks for given type of kevent
 * trailing whitespace cleanup

Changes from 'take12' patchset:
 * remove non-chardev interface for initialization
 * use pointer to kevent_mring instead of unsigned longs
 * use aligned 64bit type in raw user data (can be used by high-res timer if needed)
 * simplified enqueue/dequeue callbacks and kevent initialization
 * use nanoseconds for timeout
 * put number of milliseconds into timer's return data
 * move some definitions into user-visible header
 * removed filenames from comments

Changes from 'take11' patchset:
 * include missing headers into patchset
 * some trivial code cleanups (use goto instead of if/else games and so on)
 * some whitespace cleanups
 * check for ready_callback() callback before main loop which should save us some ticks

Changes from 'take10' patchset:
 * removed non-existent prototypes
 * added helper function for kevent_registered_callbacks
 * fixed 80 lines comments issues
 * added shared between userspace and kernelspace header instead of embedd them in one
 * core restructuring to remove forward declarations
 * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
 * use vm_insert_page() instead of remap_pfn_range()

Changes from 'take9' patchset:
 * fixed ->nopage method

Changes from 'take8' patchset:
 * fixed mmap release bug
 * use module_init() instead of late_initcall()
 * use better structures for timer notifications

Changes from 'take7' patchset:
 * new mmap interface (not tested, waiting for other changes to be acked)
	- use nopage() method to dynamically substitue pages
	- allocate new page for events only when new added kevent requres it
	- do not use ugly index dereferencing, use structure instead
	- reduced amount of data in the ring (id and flags), 
		maximum 12 pages on x86 per kevent fd

Changes from 'take6' patchset:
 * a lot of comments!
 * do not use list poisoning for detection of the fact, that entry is in the list
 * return number of ready kevents even if copy*user() fails
 * strict check for number of kevents in syscall
 * use ARRAY_SIZE for array size calculation
 * changed superblock magic number
 * use SLAB_PANIC instead of direct panic() call
 * changed -E* return values
 * a lot of small cleanups and indent fixes

Changes from 'take5' patchset:
 * removed compilation warnings about unused wariables when lockdep is not turned on
 * do not use internal socket structures, use appropriate (exported) wrappers instead
 * removed default 1 second timeout
 * removed AIO stuff from patchset

Changes from 'take4' patchset:
 * use miscdevice instead of chardevice
 * comments fixes

Changes from 'take3' patchset:
 * removed serializing mutex from kevent_user_wait()
 * moved storage list processing to RCU
 * removed lockdep screaming - all storage locks are initialized in the same function, so it was
learned 
	to differentiate between various cases
 * remove kevent from storage if is marked as broken after callback
 * fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion 

Changes from 'take2' patchset:
 * split kevent_finish_user() to locked and unlocked variants
 * do not use KEVENT_STAT ifdefs, use inline functions instead
 * use array of callbacks of each type instead of each kevent callback initialization
 * changed name of ukevent guarding lock
 * use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
 * do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
 * various indent cleanups
 * added optimisation, which is aimed to help when a lot of kevents are being copied from
userspace
 * mapped buffer (initial) implementation (no userspace yet)

Changes from 'take1' patchset:
 - rebased against 2.6.18-git tree
 - removed ioctl controlling
 - added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
			unsigned int timeout, void __user *buf, unsigned flags)
 - use old syscall kevent_ctl for creation/removing, modification and initial kevent 
	initialization
 - use mutuxes instead of semaphores
 - added file descriptor check and return error if provided descriptor does not match
	kevent file operations
 - various indent fixes
 - removed aio_sendfile() declarations.

Thank you.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

^ permalink raw reply	[flat|nested] 200+ messages in thread

* [take26 1/8] kevent: Description.
  2006-11-30 19:14 ` [take26 0/8] kevent: Generic event handling mechanism Evgeniy Polyakov
@ 2006-11-30 19:14   ` Evgeniy Polyakov
  2006-11-30 19:14     ` [take26 2/8] kevent: Core files Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-30 19:14 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Description.


diff --git a/Documentation/kevent.txt b/Documentation/kevent.txt
new file mode 100644
index 0000000..2e03a3f
--- /dev/null
+++ b/Documentation/kevent.txt
@@ -0,0 +1,240 @@
+Description.
+
+int kevent_init(struct kevent_ring *ring, unsigned int ring_size, 
+	unsigned int flags);
+
+num - size of the ring buffer in events 
+ring - pointer to allocated ring buffer
+flags - various flags, see KEVENT_FLAGS_* definitions.
+
+Return value: kevent control file descriptor or negative error value.
+
+ struct kevent_ring
+ {
+   unsigned int ring_kidx, ring_over;
+   struct ukevent event[0];
+ }
+
+ring_kidx - index in the ring buffer where kernel will put new events 
+		when kevent_wait() or kevent_get_events() is called 
+ring_over - number of overflows of ring_uidx happend from the start.
+	Overflow counter is used to prevent situation when two threads 
+	are going to free the same events, but one of them was scheduled 
+	away for too long, so ring indexes were wrapped, so when that 
+	thread will be awakened, it will free not those events, which 
+	it suppose to free.
+
+Example userspace code (ring_buffer.c) can be found on project's homepage.
+
+Each kevent syscall can be so called cancellation point in glibc, i.e. when 
+thread has been cancelled in kevent syscall, thread can be safely removed 
+and no events will be lost, since each syscall (kevent_wait() or 
+kevent_get_events()) will copy event into special ring buffer, accessible 
+from other threads or even processes (if shared memory is used).
+
+When kevent is removed (not dequeued when it is ready, but just removed), 
+even if it was ready, it is not copied into ring buffer, since if it is 
+removed, no one cares about it (otherwise user would wait until it becomes 
+ready and got it through usual way using kevent_get_events() or kevent_wait()) 
+and thus no need to copy it to the ring buffer.
+
+-------------------------------------------------------------------------------
+
+
+int kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent *arg);
+
+fd - is the file descriptor referring to the kevent queue to manipulate. 
+It is created by opening "/dev/kevent" char device, which is created with 
+dynamic minor number and major number assigned for misc devices. 
+
+cmd - is the requested operation. It can be one of the following:
+    KEVENT_CTL_ADD - add event notification 
+    KEVENT_CTL_REMOVE - remove event notification 
+    KEVENT_CTL_MODIFY - modify existing notification 
+    KEVENT_CTL_READY - mark existing events as ready, if number of events is zero,
+    	it just wakes up parked in syscall thread
+
+num - number of struct ukevent in the array pointed to by arg 
+arg - array of struct ukevent
+
+Return value: 
+ number of events processed or negative error value.
+
+When called, kevent_ctl will carry out the operation specified in the 
+cmd parameter.
+-------------------------------------------------------------------------------
+
+ int kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, 
+ 		struct timespec timeout, struct ukevent *buf, unsigned flags);
+
+ctl_fd - file descriptor referring to the kevent queue 
+min_nr - minimum number of completed events that kevent_get_events will block 
+	 waiting for 
+max_nr - number of struct ukevent in buf 
+timeout - time to wait before returning less than min_nr 
+	  events. If this is -1, then wait forever. 
+buf - pointer to an array of struct ukevent. 
+flags - various flags, see KEVENT_FLAGS_* definitions.
+
+Return value:
+ number of events copied or negative error value.
+
+kevent_get_events will wait timeout milliseconds for at least min_nr completed 
+events, copying completed struct ukevents to buf and deleting any 
+KEVENT_REQ_ONESHOT event requests. In nonblocking mode it returns as many 
+events as possible, but not more than max_nr. In blocking mode it waits until 
+timeout or if at least min_nr events are ready.
+
+This function copies event into ring buffer if it was initialized, if ring buffer
+is full, KEVENT_RET_COPY_FAILED flag is set in ret_flags field.
+-------------------------------------------------------------------------------
+
+ int kevent_wait(int ctl_fd, unsigned int num, unsigned int old_uidx, 
+ 	struct timespec timeout, unsigned int flags);
+
+ctl_fd - file descriptor referring to the kevent queue 
+num - number of processed kevents 
+old_uidx - the last index user is aware of
+timeout - time to wait until there is free space in kevent queue
+flags - various flags, see KEVENT_FLAGS_* definitions.
+
+Return value:
+ number of events copied into ring buffer or negative error value.
+
+This syscall waits until either timeout expires or at least one event becomes 
+ready. It also copies events into special ring buffer. If ring buffer is full,
+it waits until there are ready events and then return.
+If kevent is one-shot kevent it is removed in this syscall.
+If kevent is edge-triggered (KEVENT_REQ_ET flag is set in 'req_flags') it is 
+requeued in this syscall for performance reasons.
+-------------------------------------------------------------------------------
+
+ int kevent_commit(int ctl_fd, unsigned int new_idx, unsigned int over);
+
+ctl_fd - file descriptor referring to the kevent queue 
+new_uidx - the last committed kevent
+over - overflow count for given $new_idx value
+
+Return value:
+ number of committed kevents or negative error value.
+
+This function commits, i.e. marks as empty, slots in the ring buffer, so
+they can be reused when userspace completes that entries processing.
+
+Overflow counter is used to prevent situation when two threads are going 
+to free the same events, but one of them was scheduled away for too long, 
+so ring indexes were wrapped, so when that thread will be awakened, it 
+will free not those events, which it suppose to free.
+
+It is possible that returned number of committed events will be smaller than
+requested number - it is possible when several threads try to commit the
+same events.
+-------------------------------------------------------------------------------
+
+The bulk of the interface is entirely done through the ukevent struct. 
+It is used to add event requests, modify existing event requests, 
+specify which event requests to remove, and return completed events.
+
+struct ukevent contains the following members:
+
+struct kevent_id id
+    Id of this request, e.g. socket number, file descriptor and so on 
+__u32 type
+    Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on 
+__u32 event
+    Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED 
+__u32 req_flags
+    Per-event request flags,
+
+    KEVENT_REQ_ONESHOT
+        event will be removed when it is ready 
+
+    KEVENT_REQ_WAKEUP_ALL
+        Kevent wakes up only first thread interested in given event, 
+	or all threads if this flag is set.
+
+    KEVENT_REQ_ET
+        Edge Triggered behaviour. It is an optimisation which allows to move 
+	ready and dequeued (i.e. copied to userspace) event to move into set 
+	of interest for given storage (socket, inode and so on) again. It is 
+	very usefull for cases when the same event should be used many times 
+	(like reading from pipe). It is similar to epoll()'s EPOLLET flag. 
+
+    KEVENT_REQ_LAST_CHECK
+        if set allows to perform the last check on kevent (call appropriate 
+	callback) when kevent is marked as ready and has been removed from 
+	ready queue. If it will be confirmed that kevent is ready 
+	(k->callbacks.callback(k) returns true) then kevent will be copied 
+	to userspace, otherwise it will be requeued back to storage. 
+	Second (checking) call is performed with this bit cleared, so callback 
+	can detect when it was called from kevent_storage_ready() - bit is set, 
+	or kevent_dequeue_ready() - bit is cleared. If kevent will be requeued, 
+	bit will be set again.
+
+   KEVENT_REQ_ALWAYS_QUEUE
+        If this flag is set kevent will be queued into ready queue if it is 
+	ready at enqueue time, otherwise it will be copied back to userspace
+	and will not be queued into the storage.
+
+__u32 ret_flags
+    Per-event return flags
+
+    KEVENT_RET_BROKEN
+        Kevent is broken 
+
+    KEVENT_RET_DONE
+        Kevent processing was finished successfully 
+
+    KEVENT_RET_COPY_FAILED
+        Kevent was not copied into ring buffer due to some error conditions. 
+
+__u32 ret_data
+    Event return data. Event originator fills it with anything it likes 
+    (for example timer notifications put number of milliseconds when timer 
+    has fired 
+union { __u32 user[2]; void *ptr; }
+    User's data. It is not used, just copied to/from user. The whole structure 
+    is aligned to 8 bytes already, so the last union is aligned properly. 
+
+-------------------------------------------------------------------------------
+
+Kevent waiting syscall flags.
+
+KEVENT_FLAGS_ABSTIME - provided timespec parameter contains absolute time, 
+	for example Aug 27, 2194, or time(NULL) + 10.
+
+-------------------------------------------------------------------------------
+
+Usage
+
+For KEVENT_CTL_ADD, all fields relevant to the event type must be filled 
+(id, type, event, req_flags). 
+After kevent_ctl(..., KEVENT_CTL_ADD, ...) returns each struct's ret_flags 
+should be checked to see if the event is already broken or done.
+
+For KEVENT_CTL_MODIFY, the id, req_flags, and user and event fields must be 
+set and an existing kevent request must have matching id and user fields. If 
+match is found, req_flags and event are replaced with the newly supplied 
+values and requeueing is started, so modified kevent can be checked and 
+probably marked as ready immediately. If a match can't be found, the 
+passed in ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is 
+always set.
+
+For KEVENT_CTL_REMOVE, the id and user fields must be set and an existing 
+kevent request must have matching id and user fields. If a match is found, 
+the kevent request is removed. If a match can't be found, the passed in 
+ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is always set.
+
+For kevent_get_events, the entire structure is returned.
+
+-------------------------------------------------------------------------------
+
+Usage cases
+
+kevent_timer
+struct ukevent should contain following fields:
+    type - KEVENT_TIMER 
+    event - KEVENT_TIMER_FIRED 
+    req_flags - KEVENT_REQ_ONESHOT if you want to fire that timer only once 
+    id.raw[0] - number of seconds after commit when this timer shout expire 
+    id.raw[0] - additional to number of seconds number of nanoseconds 


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take26 2/8] kevent: Core files.
  2006-11-30 19:14   ` [take26 1/8] kevent: Description Evgeniy Polyakov
@ 2006-11-30 19:14     ` Evgeniy Polyakov
  2006-11-30 19:14       ` [take26 3/8] kevent: poll/select() notifications Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-30 19:14 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Core files.

This patch includes core kevent files:
 * userspace controlling
 * kernelspace interfaces
 * initialization
 * notification state machines

Some bits of documentation can be found on project's homepage (and links from there):
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index 7e639f7..a6221c2 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -318,3 +318,8 @@ ENTRY(sys_call_table)
 	.long sys_vmsplice
 	.long sys_move_pages
 	.long sys_getcpu
+	.long sys_kevent_get_events
+	.long sys_kevent_ctl		/* 320 */
+	.long sys_kevent_wait
+	.long sys_kevent_commit
+	.long sys_kevent_init
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index b4aa875..dda2168 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -714,8 +714,13 @@ ia32_sys_call_table:
 	.quad compat_sys_get_robust_list
 	.quad sys_splice
 	.quad sys_sync_file_range
-	.quad sys_tee
+	.quad sys_tee			/* 315 */
 	.quad compat_sys_vmsplice
 	.quad compat_sys_move_pages
 	.quad sys_getcpu
+	.quad sys_kevent_get_events
+	.quad sys_kevent_ctl		/* 320 */
+	.quad sys_kevent_wait
+	.quad sys_kevent_commit
+	.quad sys_kevent_init
 ia32_syscall_end:		
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index bd99870..57a6b8c 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -324,10 +324,15 @@
 #define __NR_vmsplice		316
 #define __NR_move_pages		317
 #define __NR_getcpu		318
+#define __NR_kevent_get_events	319
+#define __NR_kevent_ctl		320
+#define __NR_kevent_wait	321
+#define __NR_kevent_commit	322
+#define __NR_kevent_init	323
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 319
+#define NR_syscalls 324
 #include <linux/err.h>
 
 /*
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 6137146..17d750d 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,20 @@ __SYSCALL(__NR_sync_file_range, sys_sync
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages		279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events	280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl		281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
+#define __NR_kevent_wait	282
+__SYSCALL(__NR_kevent_wait, sys_kevent_wait)
+#define __NR_kevent_commit	283
+__SYSCALL(__NR_kevent_commit, sys_kevent_commit)
+#define __NR_kevent_init	284
+__SYSCALL(__NR_kevent_init, sys_kevent_init)
 
 #ifdef __KERNEL__
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_init
 #include <linux/err.h>
 
 #ifndef __NO_STUBS
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..3469435
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,238 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/fs.h>
+#include <linux/sched.h>
+#include <linux/hrtimer.h>
+#include <linux/kevent_storage.h>
+#include <linux/ukevent.h>
+
+#define KEVENT_MIN_BUFFS_ALLOC	3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+	kevent_callback_t	callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY		0x1
+#define KEVENT_STORAGE		0x2
+#define KEVENT_USER		0x4
+
+struct kevent
+{
+	/* Used for kevent freeing.*/
+	struct rcu_head		rcu_head;
+	struct ukevent		event;
+	/* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+	spinlock_t		ulock;
+
+	/* Entry of user's tree. */
+	struct rb_node		kevent_node;
+	/* Entry of origin's queue. */
+	struct list_head	storage_entry;
+	/* Entry of user's ready. */
+	struct list_head	ready_entry;
+
+	u32			flags;
+
+	/* User who requested this kevent. */
+	struct kevent_user	*user;
+	/* Kevent container. */
+	struct kevent_storage	*st;
+
+	struct kevent_callbacks	callbacks;
+
+	/* Private data for different storages.
+	 * poll()/select storage has a list of wait_queue_t containers
+	 * for each ->poll() { poll_wait()' } here.
+	 */
+	void			*priv;
+};
+
+struct kevent_user
+{
+	struct rb_root		kevent_root;
+	spinlock_t		kevent_lock;
+	/* Number of queued kevents. */
+	unsigned int		kevent_num;
+
+	/* List of ready kevents. */
+	struct list_head	ready_list;
+	/* Number of ready kevents. */
+	unsigned int		ready_num;
+	/* Protects all manipulations with ready queue. */
+	spinlock_t 		ready_lock;
+
+	/* Protects against simultaneous kevent_user control manipulations. */
+	struct mutex		ctl_mutex;
+	/* Wait until some events are ready. */
+	wait_queue_head_t	wait;
+	/* Exit from syscall if someone wants us to do it */
+	int			need_exit;
+
+	/* Reference counter, increased for each new kevent. */
+	atomic_t		refcnt;
+
+	/* Mutex protecting userspace ring buffer. */
+	struct mutex		ring_lock;
+	/* Kernel index and size of the userspace ring buffer. */
+	unsigned int		kidx, uidx, ring_size, ring_over, full;
+	/* Pointer to userspace ring buffer. */
+	struct kevent_ring __user *pring;
+
+	/* Is used for absolute waiting times. */
+	struct hrtimer		timer;
+
+#ifdef CONFIG_KEVENT_USER_STAT
+	unsigned long		im_num;
+	unsigned long		wait_num, ring_num;
+	unsigned long		total;
+#endif
+};
+
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos);
+
+void kevent_storage_ready(struct kevent_storage *st,
+		kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+void kevent_ready(struct kevent *k, int ret);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_stat_init(struct kevent_user *u)
+{
+	u->wait_num = u->im_num = u->total = u->ring_num = 0;
+}
+static inline void kevent_stat_print(struct kevent_user *u)
+{
+	printk(KERN_INFO "%s: u: %p, wait: %lu, ring: %lu, immediately: %lu, total: %lu.\n",
+			__func__, u, u->wait_num, u->ring_num, u->im_num, u->total);
+}
+static inline void kevent_stat_im(struct kevent_user *u)
+{
+	u->im_num++;
+}
+static inline void kevent_stat_ring(struct kevent_user *u)
+{
+	u->ring_num++;
+}
+static inline void kevent_stat_wait(struct kevent_user *u)
+{
+	u->wait_num++;
+}
+static inline void kevent_stat_total(struct kevent_user *u)
+{
+	u->total++;
+}
+#else
+#define kevent_stat_print(u)		({ (void) u;})
+#define kevent_stat_init(u)		({ (void) u;})
+#define kevent_stat_im(u)		({ (void) u;})
+#define kevent_stat_wait(u)		({ (void) u;})
+#define kevent_stat_ring(u)		({ (void) u;})
+#define kevent_stat_total(u)		({ (void) u;})
+#endif
+
+#ifdef CONFIG_LOCKDEP
+void kevent_socket_reinit(struct socket *sock);
+void kevent_sk_reinit(struct sock *sk);
+#else
+static inline void kevent_socket_reinit(struct socket *sock)
+{
+}
+static inline void kevent_sk_reinit(struct sock *sk)
+{
+}
+#endif
+#ifdef CONFIG_KEVENT_SOCKET
+void kevent_socket_notify(struct sock *sock, u32 event);
+int kevent_socket_dequeue(struct kevent *k);
+int kevent_socket_enqueue(struct kevent *k);
+#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC)
+#else
+static inline void kevent_socket_notify(struct sock *sock, u32 event)
+{
+}
+#define sock_async(__sk)	({ (void)__sk; 0; })
+#endif
+
+#ifdef CONFIG_KEVENT_POLL
+static inline void kevent_init_file(struct file *file)
+{
+	kevent_storage_init(file, &file->st);
+}
+
+static inline void kevent_cleanup_file(struct file *file)
+{
+	kevent_storage_fini(&file->st);
+}
+#else
+static inline void kevent_init_file(struct file *file) {}
+static inline void kevent_cleanup_file(struct file *file) {}
+#endif
+
+#ifdef CONFIG_KEVENT_PIPE
+extern void kevent_pipe_notify(struct inode *inode, u32 events);
+#else
+static inline void kevent_pipe_notify(struct inode *inode, u32 events) {}
+#endif
+
+#ifdef CONFIG_KEVENT_SIGNAL
+extern int kevent_signal_notify(struct task_struct *tsk, int sig);
+#else
+static inline int kevent_signal_notify(struct task_struct *tsk, int sig) {return 0;}
+#endif
+
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..a38575d
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,11 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+	void			*origin;		/* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+	struct list_head	list;			/* List of queued kevents. */
+	spinlock_t		lock;			/* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2d1c3d5..7574ec3 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -54,6 +54,8 @@ struct compat_stat;
 struct compat_timeval;
 struct robust_list_head;
 struct getcpu_cache;
+struct ukevent;
+struct kevent_ring;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -599,4 +601,11 @@ asmlinkage long sys_set_robust_list(stru
 				    size_t len);
 asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);
 
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max,
+		struct timespec timeout, struct ukevent __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, struct ukevent __user *buf);
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, unsigned int old_uidx, 
+		struct timespec timeout, unsigned int flags);
+asmlinkage long sys_kevent_commit(int ctl_fd, unsigned int new_uidx, unsigned int over);
+asmlinkage long sys_kevent_init(int ctl_fd, struct kevent_ring __user *ring, unsigned int num, unsigned int flags);
 #endif
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
new file mode 100644
index 0000000..5201bc4
--- /dev/null
+++ b/include/linux/ukevent.h
@@ -0,0 +1,183 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __UKEVENT_H
+#define __UKEVENT_H
+
+#include <linux/types.h>
+
+/*
+ * Kevent request flags.
+ */
+
+/* Process this event only once and then remove it. */
+#define KEVENT_REQ_ONESHOT	0x1
+/* Kevent wakes up only first thread interested in given event,
+ * or all threads if this flag is set.
+ */
+#define KEVENT_REQ_WAKEUP_ALL	0x2
+/* Edge Triggered behaviour. */
+#define KEVENT_REQ_ET		0x4
+/* Perform the last check on kevent (call appropriate callback) when
+ * kevent is marked as ready and has been removed from ready queue.
+ * If it will be confirmed that kevent is ready 
+ * (k->callbacks.callback(k) returns true) then kevent will be copied
+ * to userspace, otherwise it will be requeued back to storage. 
+ * Second (checking) call is performed with this bit _cleared_ so
+ * callback can detect when it was called from 
+ * kevent_storage_ready() - bit is set, or 
+ * kevent_dequeue_ready() - bit is cleared. 
+ * If kevent will be requeued, bit will be set again. */
+#define KEVENT_REQ_LAST_CHECK	0x8
+/*
+ * Always queue kevent even if it is immediately ready.
+ */
+#define KEVENT_REQ_ALWAYS_QUEUE	0x16
+
+/*
+ * Kevent return flags.
+ */
+/* Kevent is broken. */
+#define KEVENT_RET_BROKEN	0x1
+/* Kevent processing was finished successfully. */
+#define KEVENT_RET_DONE		0x2
+/* Kevent was not copied into ring buffer due to some error conditions. */
+#define KEVENT_RET_COPY_FAILED	0x4
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 		0
+#define KEVENT_INODE		1
+#define KEVENT_TIMER		2
+#define KEVENT_POLL		3
+#define KEVENT_NAIO		4
+#define KEVENT_AIO		5
+#define KEVENT_PIPE		6
+#define KEVENT_SIGNAL		7
+#define KEVENT_POSIX_TIMER	8
+#define	KEVENT_MAX		9
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define	KEVENT_TIMER_FIRED	0x1
+
+/*
+ * Socket/network asynchronous IO and PIPE events.
+ */
+#define	KEVENT_SOCKET_RECV	0x1
+#define	KEVENT_SOCKET_ACCEPT	0x2
+#define	KEVENT_SOCKET_SEND	0x4
+
+/*
+ * Inode events.
+ */
+#define	KEVENT_INODE_CREATE	0x1
+#define	KEVENT_INODE_REMOVE	0x2
+
+/*
+ * Poll events.
+ */
+#define	KEVENT_POLL_POLLIN	0x0001
+#define	KEVENT_POLL_POLLPRI	0x0002
+#define	KEVENT_POLL_POLLOUT	0x0004
+#define	KEVENT_POLL_POLLERR	0x0008
+#define	KEVENT_POLL_POLLHUP	0x0010
+#define	KEVENT_POLL_POLLNVAL	0x0020
+
+#define	KEVENT_POLL_POLLRDNORM	0x0040
+#define	KEVENT_POLL_POLLRDBAND	0x0080
+#define	KEVENT_POLL_POLLWRNORM	0x0100
+#define	KEVENT_POLL_POLLWRBAND	0x0200
+#define	KEVENT_POLL_POLLMSG	0x0400
+#define	KEVENT_POLL_POLLREMOVE	0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define	KEVENT_AIO_BIO		0x1
+
+/*
+ * Signal events.
+ */
+#define KEVENT_SIGNAL_DELIVERY		0x1
+
+/* If set in raw64, then given signals will not be delivered
+ * in a usual way through sigmask update and signal callback 
+ * invokation. */
+#define KEVENT_SIGNAL_NOMASK	0x8000000000000000ULL
+
+/* Mask of all possible event values. */
+#define KEVENT_MASK_ALL		0xffffffff
+/* Empty mask of ready events. */
+#define KEVENT_MASK_EMPTY	0x0
+
+struct kevent_id
+{
+	union {
+		__u32		raw[2];
+		__u64		raw_u64 __attribute__((aligned(8)));
+	};
+};
+
+struct ukevent
+{
+	/* Id of this request, e.g. socket number, file descriptor and so on... */
+	struct kevent_id	id;
+	/* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+	__u32			type;
+	/* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+	__u32			event;
+	/* Per-event request flags */
+	__u32			req_flags;
+	/* Per-event return flags */
+	__u32			ret_flags;
+	/* Event return data. Event originator fills it with anything it likes. */
+	__u32			ret_data[2];
+	/* User's data. It is not used, just copied to/from user.
+	 * The whole structure is aligned to 8 bytes already, so the last union
+	 * is aligned properly.
+	 */
+	union {
+		__u32		user[2];
+		void		*ptr;
+	};
+};
+
+struct kevent_ring
+{
+	unsigned int		ring_kidx, ring_over;
+	struct ukevent		event[0];
+};
+
+#define	KEVENT_CTL_ADD 		0
+#define	KEVENT_CTL_REMOVE	1
+#define	KEVENT_CTL_MODIFY	2
+#define	KEVENT_CTL_READY	3
+
+/* Provided timespec parameter uses absolute time, i.e. 'wait until Aug 27, 2194' */
+#define KEVENT_FLAGS_ABSTIME	1
+
+#endif /* __UKEVENT_H */
diff --git a/init/Kconfig b/init/Kconfig
index d2eb7a8..c7d8250 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -201,6 +201,8 @@ config AUDITSYSCALL
 	  such as SELinux.  To use audit's filesystem watch feature, please
 	  ensure that INOTIFY is configured.
 
+source "kernel/kevent/Kconfig"
+
 config IKCONFIG
 	bool "Kernel .config support"
 	---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..4b137ee
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,60 @@
+config KEVENT
+	bool "Kernel event notification mechanism"
+	help
+	  This option enables event queue mechanism.
+	  It can be used as replacement for poll()/select(), AIO callback
+	  invocations, advanced timer notifications and other kernel
+	  object status changes.
+
+config KEVENT_USER_STAT
+	bool "Kevent user statistic"
+	depends on KEVENT
+	help
+	  This option will turn kevent_user statistic collection on.
+	  Statistic data includes total number of kevent, number of kevents
+	  which are ready immediately at insertion time and number of kevents
+	  which were removed through readiness completion.
+	  It will be printed each time control kevent descriptor is closed.
+
+config KEVENT_TIMER
+	bool "Kernel event notifications for timers"
+	depends on KEVENT
+	help
+	  This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+	bool "Kernel event notifications for poll()/select()"
+	depends on KEVENT
+	help
+	  This option allows to use kevent subsystem for poll()/select()
+	  notifications.
+
+config KEVENT_SOCKET
+	bool "Kernel event notifications for sockets"
+	depends on NET && KEVENT
+	help
+	  This option enables notifications through KEVENT subsystem of 
+	  sockets operations, like new packet receiving conditions, 
+	  ready for accept conditions and so on.
+
+config KEVENT_PIPE
+	bool "Kernel event notifications for pipes"
+	depends on KEVENT
+	help
+	  This option enables notifications through KEVENT subsystem of 
+	  pipe read/write operations.
+
+config KEVENT_SIGNAL
+	bool "Kernel event notifications for signals"
+	depends on KEVENT
+	help
+	  This option enables signal delivery through KEVENT subsystem.
+	  Signals which were requested to be delivered through kevent
+	  subsystem must be registered through usual signal() and others
+	  syscalls, this option allows alternative delivery.
+	  With KEVENT_SIGNAL_NOMASK flag being set in kevent for set of 
+	  signals, they will not be delivered in a usual way.
+	  Kevents for appropriate signals are not copied when process forks,
+	  new process must add new kevents after fork(). Mask of signals
+	  is copied as before.
+
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..f98e0c8
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,6 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
+obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o
+obj-$(CONFIG_KEVENT_PIPE) += kevent_pipe.o
+obj-$(CONFIG_KEVENT_SIGNAL) += kevent_signal.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..b0adcdc
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,247 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+	return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+	return k->callbacks.dequeue(k);
+}
+
+/*
+ * Mark kevent as broken.
+ */
+int kevent_break(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->ulock, flags);
+	k->event.ret_flags |= KEVENT_RET_BROKEN;
+	spin_unlock_irqrestore(&k->ulock, flags);
+	return -EINVAL;
+}
+
+static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX] __read_mostly;
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos)
+{
+	struct kevent_callbacks *p;
+
+	if (pos >= KEVENT_MAX)
+		return -EINVAL;
+
+	p = &kevent_registered_callbacks[pos];
+
+	p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break;
+	p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break;
+	p->callback = (cb->callback) ? cb->callback : kevent_break;
+
+	printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);
+	return 0;
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+	spin_lock_init(&k->ulock);
+	k->flags = 0;
+
+	if (unlikely(k->event.type >= KEVENT_MAX)) {
+		kevent_break(k);
+		return -ENOSYS;
+	}
+
+	if (!kevent_registered_callbacks[k->event.type].callback) {
+		kevent_break(k);
+		return -ENOSYS;
+	}
+
+	k->callbacks = kevent_registered_callbacks[k->event.type];
+	if (unlikely(k->callbacks.callback == kevent_break)) {
+		kevent_break(k);
+		return -ENOSYS;
+	}
+
+	return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	k->st = st;
+	spin_lock_irqsave(&st->lock, flags);
+	list_add_tail_rcu(&k->storage_entry, &st->list);
+	k->flags |= KEVENT_STORAGE;
+	spin_unlock_irqrestore(&st->lock, flags);
+	return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue.
+ * It does not decrease origin's reference counter in any way
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&st->lock, flags);
+	if (k->flags & KEVENT_STORAGE) {
+		list_del_rcu(&k->storage_entry);
+		k->flags &= ~KEVENT_STORAGE;
+	}
+	spin_unlock_irqrestore(&st->lock, flags);
+}
+
+void kevent_ready(struct kevent *k, int ret)
+{
+	unsigned long flags;
+	int rem;
+
+	spin_lock_irqsave(&k->ulock, flags);
+	if (ret > 0)
+		k->event.ret_flags |= KEVENT_RET_DONE;
+	else if (ret < 0)
+		k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE);
+	else
+		ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+	rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+	spin_unlock_irqrestore(&k->ulock, flags);
+
+	if (ret) {
+		if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) {
+			list_del_rcu(&k->storage_entry);
+			k->flags &= ~KEVENT_STORAGE;
+		}
+
+		spin_lock_irqsave(&k->user->ready_lock, flags);
+		if (!(k->flags & KEVENT_READY)) {
+			list_add_tail(&k->ready_entry, &k->user->ready_list);
+			k->flags |= KEVENT_READY;
+			k->user->ready_num++;
+		}
+		spin_unlock_irqrestore(&k->user->ready_lock, flags);
+		wake_up(&k->user->wait);
+	}
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static int __kevent_requeue(struct kevent *k, u32 event)
+{
+	int ret;
+
+	ret = k->callbacks.callback(k);
+
+	kevent_ready(k, ret);
+
+	return ret;
+}
+
+/*
+ * Check if kevent is ready (by invoking it's callback) and requeue/remove
+ * if needed.
+ */
+void kevent_requeue(struct kevent *k)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&k->st->lock, flags);
+	__kevent_requeue(k, 0);
+	spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st,
+		kevent_callback_t ready_callback, u32 event)
+{
+	struct kevent *k;
+	int wake_num = 0;
+
+	rcu_read_lock();
+	if (unlikely(ready_callback))
+		list_for_each_entry_rcu(k, &st->list, storage_entry)
+			(*ready_callback)(k);
+
+	list_for_each_entry_rcu(k, &st->list, storage_entry) {
+		if (event & k->event.event)
+			if ((k->event.req_flags & KEVENT_REQ_WAKEUP_ALL) || wake_num == 0)
+				if (__kevent_requeue(k, event))
+					wake_num++;
+	}
+	rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+	spin_lock_init(&st->lock);
+	st->origin = origin;
+	INIT_LIST_HEAD(&st->list);
+	return 0;
+}
+
+/*
+ * Mark all events as broken, that will remove them from storage,
+ * so storage origin (inode, socket and so on) can be safely removed.
+ * No new entries are allowed to be added into the storage at this point.
+ * (Socket is removed from file table at this point for example).
+ */
+void kevent_storage_fini(struct kevent_storage *st)
+{
+	kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..3fc2daa
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,1344 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static kmem_cache_t *kevent_cache __read_mostly;
+static kmem_cache_t *kevent_user_cache __read_mostly;
+
+static int kevent_debug_abstime;
+
+/*
+ * kevents are pollable, return POLLIN and POLLRDNORM
+ * when there is at least one ready kevent.
+ */
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+	struct kevent_user *u = file->private_data;
+	unsigned int mask;
+
+	poll_wait(file, &u->wait, wait);
+	mask = 0;
+
+	if (u->ready_num || u->need_exit)
+		mask |= POLLIN | POLLRDNORM;
+	u->need_exit = 0;
+
+	return mask;
+}
+
+static inline unsigned int kevent_ring_space(struct kevent_user *u)
+{
+	if (u->full)
+		return 0;
+
+	return (u->uidx > u->kidx)?
+		(u->uidx - u->kidx):
+		(u->ring_size - (u->kidx - u->uidx));
+}
+
+static inline int kevent_ring_index_inc(unsigned int *pidx, unsigned int size)
+{
+	unsigned int idx = *pidx;
+
+	if (++idx >= size)
+		idx = 0;
+	*pidx = idx;
+	return (idx == 0);
+}
+
+/*
+ * Copies kevent into userspace ring buffer if it was initialized.
+ * Returns 
+ *  0 on success or if ring buffer is not used
+ *  -EAGAIN if there were no place for that kevent
+ *  -EFAULT if copy_to_user() failed.
+ *
+ *  Must be called under kevent_user->ring_lock locked.
+ */
+static int kevent_copy_ring_buffer(struct kevent *k)
+{
+	struct kevent_ring __user *ring;
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+	int err;
+
+	ring = u->pring;
+	if (!ring)
+		return 0;
+
+	if (!kevent_ring_space(u))
+		return -EAGAIN;
+
+	if (copy_to_user(&ring->event[u->kidx], &k->event, sizeof(struct ukevent))) {
+		err = -EFAULT;
+		goto err_out_exit;
+	}
+
+	kevent_ring_index_inc(&u->kidx, u->ring_size);
+
+	if (u->kidx == u->uidx)
+		u->full = 1;
+
+	if (put_user(u->kidx, &ring->ring_kidx)) {
+		err = -EFAULT;
+		goto err_out_exit;
+	}
+
+	return 0;
+
+err_out_exit:
+	spin_lock_irqsave(&k->ulock, flags);
+	k->event.ret_flags |= KEVENT_RET_COPY_FAILED;
+	spin_unlock_irqrestore(&k->ulock, flags);
+	return err;
+}
+
+static struct kevent_user *kevent_user_alloc(struct kevent_ring __user *ring, unsigned int num)
+{
+	struct kevent_user *u;
+
+	u = kmem_cache_alloc(kevent_user_cache, GFP_KERNEL);
+	if (!u)
+		return NULL;
+
+	INIT_LIST_HEAD(&u->ready_list);
+	spin_lock_init(&u->ready_lock);
+	kevent_stat_init(u);
+	spin_lock_init(&u->kevent_lock);
+	u->kevent_root = RB_ROOT;
+
+	mutex_init(&u->ctl_mutex);
+	init_waitqueue_head(&u->wait);
+	u->need_exit = 0;
+
+	atomic_set(&u->refcnt, 1);
+
+	mutex_init(&u->ring_lock);
+	u->kidx = u->uidx = u->ring_over = u->full = 0;
+
+	u->pring = ring;
+	u->ring_size = num;
+
+	hrtimer_init(&u->timer, CLOCK_REALTIME, HRTIMER_ABS);
+
+	return u;
+}
+
+/*
+ * Kevent userspace control block reference counting.
+ * Set to 1 at creation time, when appropriate kevent file descriptor
+ * is closed, that reference counter is decreased.
+ * When counter hits zero block is freed.
+ */
+static inline void kevent_user_get(struct kevent_user *u)
+{
+	atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+	if (atomic_dec_and_test(&u->refcnt)) {
+		kevent_stat_print(u);
+		hrtimer_cancel(&u->timer);
+		kmem_cache_free(kevent_user_cache, u);
+	}
+}
+
+static inline int kevent_compare_id(struct kevent_id *left, struct kevent_id *right)
+{
+	if (left->raw_u64 > right->raw_u64)
+		return -1;
+
+	if (right->raw_u64 > left->raw_u64)
+		return 1;
+
+	return 0;
+}
+
+/*
+ * RCU protects storage list (kevent->storage_entry).
+ * Free entry in RCU callback, it is dequeued from all lists at
+ * this point.
+ */
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+	struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+	kmem_cache_free(kevent_cache, kevent);
+}
+
+/*
+ * Must be called under u->ready_lock.
+ * This function unlinks kevent from ready queue.
+ */
+static inline void kevent_unlink_ready(struct kevent *k)
+{
+	list_del(&k->ready_entry);
+	k->flags &= ~KEVENT_READY;
+	k->user->ready_num--;
+}
+
+static void kevent_remove_ready(struct kevent *k)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->ready_lock, flags);
+	if (k->flags & KEVENT_READY)
+		kevent_unlink_ready(k);
+	spin_unlock_irqrestore(&u->ready_lock, flags);
+}
+
+/*
+ * Complete kevent removing - it dequeues kevent from storage list
+ * if it is requested, removes kevent from ready list, drops userspace
+ * control block reference counter and schedules kevent freeing through RCU.
+ */
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+	if (deq)
+		kevent_dequeue(k);
+
+	kevent_remove_ready(k);
+
+	kevent_user_put(k->user);
+	call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+/*
+ * Remove from all lists and free kevent.
+ * Must be called under kevent_user->kevent_lock to protect
+ * kevent->kevent_entry removing.
+ */
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+
+	rb_erase(&k->kevent_node, &u->kevent_root);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events,
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+	struct kevent_user *u = k->user;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	rb_erase(&k->kevent_node, &u->kevent_root);
+	k->flags &= ~KEVENT_USER;
+	u->kevent_num--;
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+	kevent_finish_user_complete(k, deq);
+}
+
+static struct kevent *__kevent_dequeue_ready_one(struct kevent_user *u)
+{
+	unsigned long flags;
+	struct kevent *k = NULL;
+
+	if (u->ready_num) {
+		spin_lock_irqsave(&u->ready_lock, flags);
+		if (u->ready_num && !list_empty(&u->ready_list)) {
+			k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+			kevent_unlink_ready(k);
+		}
+		spin_unlock_irqrestore(&u->ready_lock, flags);
+	}
+
+	return k;
+}
+
+static struct kevent *kevent_dequeue_ready_one(struct kevent_user *u)
+{
+	struct kevent *k = NULL;
+
+	while (u->ready_num && !k) {
+		k = __kevent_dequeue_ready_one(u);
+
+		if (k && (k->event.req_flags & KEVENT_REQ_LAST_CHECK)) {
+			unsigned long flags;
+
+			spin_lock_irqsave(&k->ulock, flags);
+			k->event.req_flags &= ~KEVENT_REQ_LAST_CHECK;
+			spin_unlock_irqrestore(&k->ulock, flags);
+
+			if (!k->callbacks.callback(k)) {
+				spin_lock_irqsave(&k->ulock, flags);
+				k->event.req_flags |= KEVENT_REQ_LAST_CHECK;
+				k->event.ret_flags = 0;
+				k->event.ret_data[0] = k->event.ret_data[1] = 0;
+				spin_unlock_irqrestore(&k->ulock, flags);
+				k = NULL;
+			}
+		} else
+			break;
+	}
+
+	return k;
+}
+
+static inline void kevent_copy_ring(struct kevent *k)
+{
+	unsigned long flags;
+
+	if (!k)
+		return;
+
+	if (kevent_copy_ring_buffer(k)) {
+		spin_lock_irqsave(&k->ulock, flags);
+		k->event.ret_flags |= KEVENT_RET_COPY_FAILED;
+		spin_unlock_irqrestore(&k->ulock, flags);
+	}
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *kevent_dequeue_ready(struct kevent_user *u)
+{
+	struct kevent *k;
+
+	mutex_lock(&u->ring_lock);
+	k = kevent_dequeue_ready_one(u);
+	kevent_copy_ring(k);
+	mutex_unlock(&u->ring_lock);
+
+	return k;
+}
+
+/*
+ * Dequeue one entry from user's ready queue if there is space in ring buffer.
+ */
+static struct kevent *kevent_dequeue_ready_ring(struct kevent_user *u)
+{
+	struct kevent *k = NULL;
+
+	mutex_lock(&u->ring_lock);
+	if (kevent_ring_space(u)) {
+		k = kevent_dequeue_ready_one(u);
+		kevent_copy_ring(k);
+	}
+	mutex_unlock(&u->ring_lock);
+
+	return k;
+}
+
+static void kevent_complete_ready(struct kevent *k)
+{
+	if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+		/*
+		 * If it is one-shot kevent, it has been removed already from
+		 * origin's queue, so we can easily free it here.
+		 */
+		kevent_finish_user(k, 1);
+	else if (k->event.req_flags & KEVENT_REQ_ET) {
+		unsigned long flags;
+
+		/*
+		 * Edge-triggered behaviour: mark event as clear new one.
+		 */
+
+		spin_lock_irqsave(&k->ulock, flags);
+		k->event.ret_flags = 0;
+		k->event.ret_data[0] = k->event.ret_data[1] = 0;
+		spin_unlock_irqrestore(&k->ulock, flags);
+	}
+}
+
+/*
+ * Search a kevent inside kevent tree for given ukevent.
+ */
+static struct kevent *__kevent_search(struct kevent_id *id, struct kevent_user *u)
+{
+	struct kevent *k, *ret = NULL;
+	struct rb_node *n = u->kevent_root.rb_node;
+	int cmp;
+
+	while (n) {
+		k = rb_entry(n, struct kevent, kevent_node);
+		cmp = kevent_compare_id(&k->event.id, id);
+
+		if (cmp > 0)
+			n = n->rb_right;
+		else if (cmp < 0)
+			n = n->rb_left;
+		else {
+			ret = k;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+/*
+ * Search and modify kevent according to provided ukevent.
+ */
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err = -ENODEV;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&uk->id, u);
+	if (k) {
+		spin_lock(&k->ulock);
+		k->event.event = uk->event;
+		k->event.req_flags = uk->req_flags;
+		k->event.ret_flags = 0;
+		spin_unlock(&k->ulock);
+		kevent_requeue(k);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Remove kevent which matches provided ukevent.
+ */
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+	int err = -ENODEV;
+	struct kevent *k;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&uk->id, u);
+	if (k) {
+		__kevent_finish_user(k, 1);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Detaches userspace control block from file descriptor
+ * and decrease it's reference counter.
+ * No new kevents can be added or removed from any list at this point.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+	struct kevent_user *u = file->private_data;
+	struct kevent *k;
+	struct rb_node *n;
+
+	for (n = rb_first(&u->kevent_root); n; n = rb_next(n)) {
+		k = rb_entry(n, struct kevent, kevent_node);
+		kevent_finish_user(k, 1);
+	}
+
+	kevent_user_put(u);
+	file->private_data = NULL;
+
+	return 0;
+}
+
+/*
+ * Read requested number of ukevents in one shot.
+ */
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+	struct ukevent *ukev;
+
+	ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+	if (!ukev)
+		return NULL;
+
+	if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
+		kfree(ukev);
+		return NULL;
+	}
+
+	return ukev;
+}
+
+static int kevent_mark_ready(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err = -ENODEV;
+	unsigned long flags;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	k = __kevent_search(&uk->id, u);
+	if (k) {
+		spin_lock(&k->st->lock);
+		kevent_ready(k, 1);
+		spin_unlock(&k->st->lock);
+		err = 0;
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Mark appropriate kevents as ready.
+ * If number of events is zero just wake up one listener.
+ */
+static int kevent_user_ctl_ready(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = -EINVAL, cerr = 0, rnum = 0, i;
+	void __user *orig = arg;
+	struct ukevent uk;
+
+	if (num > u->kevent_num)
+		return err;
+	
+	if (!num) {
+		u->need_exit = 1;
+		wake_up(&u->wait);
+		return 0;
+	}
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				err = kevent_mark_ready(&ukev[i], u);
+				if (err) {
+					if (i != rnum)
+						memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+					rnum++;
+				}
+			}
+			if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+				cerr = -EFAULT;
+			kfree(ukev);
+			goto out_setup;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			cerr = -EFAULT;
+			break;
+		}
+		arg += sizeof(struct ukevent);
+
+		err = kevent_mark_ready(&uk, u);
+		if (err) {
+			if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+				cerr = -EFAULT;
+				break;
+			}
+			orig += sizeof(struct ukevent);
+			rnum++;
+		}
+	}
+
+out_setup:
+	if (cerr < 0) {
+		err = cerr;
+		goto out_remove;
+	}
+
+	err = num - rnum;
+out_remove:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Read from userspace all ukevents and modify appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_modify(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_modify(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Read from userspace all ukevents and remove appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err = 0, i;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	if (num > u->kevent_num) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				if (kevent_remove(&ukev[i], u))
+					ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+				ukev[i].ret_flags |= KEVENT_RET_DONE;
+			}
+			if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+				err = -EFAULT;
+			kfree(ukev);
+			goto out;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		if (kevent_remove(&uk, u))
+			uk.ret_flags |= KEVENT_RET_BROKEN;
+
+		uk.ret_flags |= KEVENT_RET_DONE;
+
+		if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+			err = -EFAULT;
+			break;
+		}
+
+		arg += sizeof(struct ukevent);
+	}
+out:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/*
+ * Queue kevent into userspace control block and increase
+ * it's reference counter.
+ */
+static int kevent_user_enqueue(struct kevent_user *u, struct kevent *new)
+{
+	unsigned long flags;
+	struct rb_node **p = &u->kevent_root.rb_node, *parent = NULL;
+	struct kevent *k;
+	int err = 0, cmp;
+
+	spin_lock_irqsave(&u->kevent_lock, flags);
+	while (*p) {
+		parent = *p;
+		k = rb_entry(parent, struct kevent, kevent_node);
+
+		cmp = kevent_compare_id(&k->event.id, &new->event.id);
+		if (cmp > 0)
+			p = &parent->rb_right;
+		else if (cmp < 0)
+			p = &parent->rb_left;
+		else {
+			err = -EEXIST;
+			break;
+		}
+	}
+	if (likely(!err)) {
+		rb_link_node(&new->kevent_node, parent, p);
+		rb_insert_color(&new->kevent_node, &u->kevent_root);
+		new->flags |= KEVENT_USER;
+		u->kevent_num++;
+		kevent_user_get(u);
+	}
+	spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+	return err;
+}
+
+/*
+ * Add kevent from both kernel and userspace users.
+ * This function allocates and queues kevent, returns negative value
+ * on error, positive if kevent is ready immediately and zero
+ * if kevent has been queued.
+ */
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+	struct kevent *k;
+	int err;
+
+	k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+	if (!k) {
+		err = -ENOMEM;
+		goto err_out_exit;
+	}
+
+	memcpy(&k->event, uk, sizeof(struct ukevent));
+	INIT_RCU_HEAD(&k->rcu_head);
+
+	k->event.ret_flags = 0;
+
+	err = kevent_init(k);
+	if (err) {
+		kmem_cache_free(kevent_cache, k);
+		goto err_out_exit;
+	}
+	k->user = u;
+	kevent_stat_total(u);
+	err = kevent_user_enqueue(u, k);
+	if (err) {
+		kmem_cache_free(kevent_cache, k);
+		goto err_out_exit;
+	}
+
+	err = kevent_enqueue(k);
+	if (err) {
+		memcpy(uk, &k->event, sizeof(struct ukevent));
+		kevent_finish_user(k, 0);
+		goto err_out_exit;
+	}
+
+	return 0;
+
+err_out_exit:
+	if (err < 0) {
+		uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE;
+		uk->ret_data[1] = err;
+	} else if (err > 0)
+		uk->ret_flags |= KEVENT_RET_DONE;
+	return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one
+ * and add them into appropriate kevent_storages,
+ * e.g. sockets, inodes and so on...
+ * Ready events will replace ones provided by used and number
+ * of ready events is returned.
+ * User must check ret_flags field of each ukevent structure
+ * to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+	int err, cerr = 0, rnum = 0, i;
+	void __user *orig = arg;
+	struct ukevent uk;
+
+	mutex_lock(&u->ctl_mutex);
+
+	err = -EINVAL;
+	if (num > KEVENT_MIN_BUFFS_ALLOC) {
+		struct ukevent *ukev;
+
+		ukev = kevent_get_user(num, arg);
+		if (ukev) {
+			for (i = 0; i < num; ++i) {
+				err = kevent_user_add_ukevent(&ukev[i], u);
+				if (err) {
+					kevent_stat_im(u);
+					if (i != rnum)
+						memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+					rnum++;
+				}
+			}
+			if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+				cerr = -EFAULT;
+			kfree(ukev);
+			goto out_setup;
+		}
+	}
+
+	for (i = 0; i < num; ++i) {
+		if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+			cerr = -EFAULT;
+			break;
+		}
+		arg += sizeof(struct ukevent);
+
+		err = kevent_user_add_ukevent(&uk, u);
+		if (err) {
+			kevent_stat_im(u);
+			if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+				cerr = -EFAULT;
+				break;
+			}
+			orig += sizeof(struct ukevent);
+			rnum++;
+		}
+	}
+
+out_setup:
+	if (cerr < 0) {
+		err = cerr;
+		goto out_remove;
+	}
+
+	err = rnum;
+out_remove:
+	mutex_unlock(&u->ctl_mutex);
+
+	return err;
+}
+
+/* Used to wakeup waiting syscalls in case high-resolution timer is used. */
+static int kevent_user_wake(struct hrtimer *timer)
+{
+	struct kevent_user *u = container_of(timer, struct kevent_user, timer);
+
+	u->need_exit = 1;
+	wake_up(&u->wait);
+
+	return HRTIMER_NORESTART;
+}
+
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u,
+		unsigned int min_nr, unsigned int max_nr, struct timespec timeout,
+		void __user *buf, unsigned int flags)
+{
+	struct kevent *k;
+	int num = 0;
+	long tm = MAX_SCHEDULE_TIMEOUT;
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		if (!timespec_valid(&timeout))
+			return -EINVAL;
+
+		if (flags & KEVENT_FLAGS_ABSTIME) {
+			hrtimer_cancel(&u->timer);
+			hrtimer_init(&u->timer, CLOCK_REALTIME, HRTIMER_ABS);
+			u->timer.expires = ktime_set(timeout.tv_sec, timeout.tv_nsec);
+			u->timer.function = &kevent_user_wake;
+			hrtimer_start(&u->timer, u->timer.expires, HRTIMER_ABS);
+			if (unlikely(kevent_debug_abstime == 0)) {
+				printk(KERN_INFO "kevent: author was wrong, "
+						"someone uses absolute time in %s, "
+						"please report to remove this warning.\n", __func__);
+				kevent_debug_abstime = 1;
+			}
+		} else {
+			tm = timespec_to_jiffies(&timeout);
+		}
+
+		wait_event_interruptible_timeout(u->wait,
+			((u->ready_num >= 1) && kevent_ring_space(u)) || u->need_exit, tm);
+	}
+	u->need_exit = 0;
+
+	while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) {
+		if (copy_to_user(buf + num*sizeof(struct ukevent),
+					&k->event, sizeof(struct ukevent))) {
+			if (num == 0)
+				num = -EFAULT;
+			break;
+		}
+		kevent_complete_ready(k);
+		++num;
+		kevent_stat_wait(u);
+	}
+
+	return num;
+}
+
+struct file_operations kevent_user_fops = {
+	.release	= kevent_user_release,
+	.poll		= kevent_user_poll,
+	.owner		= THIS_MODULE,
+};
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+	int err;
+	struct kevent_user *u = file->private_data;
+
+	switch (cmd) {
+	case KEVENT_CTL_ADD:
+		err = kevent_user_ctl_add(u, num, arg);
+		break;
+	case KEVENT_CTL_REMOVE:
+		err = kevent_user_ctl_remove(u, num, arg);
+		break;
+	case KEVENT_CTL_MODIFY:
+		err = kevent_user_ctl_modify(u, num, arg);
+		break;
+	case KEVENT_CTL_READY:
+		err = kevent_user_ctl_ready(u, num, arg);
+		break;
+	default:
+		err = -EINVAL;
+		break;
+	}
+
+	return err;
+}
+
+/*
+ * Used to get ready kevents from queue.
+ * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).
+ * @min_nr - minimum number of ready kevents.
+ * @max_nr - maximum number of ready kevents.
+ * @timeout - time to wait until some events are ready.
+ * @buf - buffer to place ready events.
+ * @flags - various flags (see include/linux/ukevent.h KEVENT_FLAGS_*).
+ */
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+		struct timespec timeout, struct ukevent __user *buf, unsigned flags)
+{
+	int err = -EINVAL;
+	struct file *file;
+	struct kevent_user *u;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -EBADF;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf, flags);
+out_fput:
+	fput(file);
+	return err;
+}
+
+static struct vfsmount *kevent_mnt __read_mostly;
+
+static int kevent_get_sb(struct file_system_type *fs_type, int flags,
+		   const char *dev_name, void *data, struct vfsmount *mnt)
+{
+	return get_sb_pseudo(fs_type, "kevent", NULL, 0xaabbccdd, mnt);
+}
+
+static struct file_system_type kevent_fs_type = {
+	.name		= "keventfs",
+	.get_sb		= kevent_get_sb,
+	.kill_sb	= kill_anon_super,
+};
+
+static int keventfs_delete_dentry(struct dentry *dentry)
+{
+	return 1;
+}
+
+static struct dentry_operations keventfs_dentry_operations = {
+	.d_delete	= keventfs_delete_dentry,
+};
+
+asmlinkage long sys_kevent_init(struct kevent_ring __user *ring, unsigned int num, unsigned int flags)
+{
+	struct qstr this;
+	char name[32];
+	struct dentry *dentry;
+	struct inode *inode;
+	struct file *file;
+	int err = -ENFILE, fd;
+	struct kevent_user *u;
+
+	if ((ring && !num) || (!ring && num) || (num == 1))
+		return -EINVAL;
+
+	file = get_empty_filp();
+	if (!file)
+		goto err_out_exit;
+
+	inode = new_inode(kevent_mnt->mnt_sb);
+	if (!inode)
+		goto err_out_fput;
+
+	inode->i_fop = &kevent_user_fops;
+
+	inode->i_state = I_DIRTY;
+	inode->i_mode = S_IRUSR | S_IWUSR;
+	inode->i_uid = current->fsuid;
+	inode->i_gid = current->fsgid;
+	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+
+	err = get_unused_fd();
+	if (err < 0)
+		goto err_out_iput;
+	fd = err;
+
+	err = -ENOMEM;
+	u = kevent_user_alloc(ring, num);
+	if (!u)
+		goto err_out_put_fd;
+
+	sprintf(name, "[%lu]", inode->i_ino);
+	this.name = name;
+	this.len = strlen(name);
+	this.hash = inode->i_ino;
+	dentry = d_alloc(kevent_mnt->mnt_sb->s_root, &this);
+	if (!dentry)
+		goto err_out_free;
+	dentry->d_op = &keventfs_dentry_operations;
+	d_add(dentry, inode);
+	file->f_vfsmnt = mntget(kevent_mnt);
+	file->f_dentry = dentry;
+	file->f_mapping = inode->i_mapping;
+	file->f_pos = 0;
+	file->f_flags = O_RDONLY;
+	file->f_op = &kevent_user_fops;
+	file->f_mode = FMODE_READ;
+	file->f_version = 0;
+	file->private_data = u;
+
+	fd_install(fd, file);
+
+	return fd;
+
+err_out_free:
+	kmem_cache_free(kevent_user_cache, u);
+err_out_put_fd:
+	put_unused_fd(fd);
+err_out_iput:
+	iput(inode);
+err_out_fput:
+	put_filp(file);
+err_out_exit:
+	return err;
+}
+
+/*
+ * Commits user's index (consumer index).
+ * Must be called under u->ring_lock mutex held.
+ */
+static int __kevent_user_commit(struct kevent_user *u, unsigned int new_uidx, unsigned int over)
+{
+	int err = -EOVERFLOW, comm = 0;
+	struct kevent_ring __user *ring = u->pring;
+
+	if (!ring) {
+		err = 0;
+		goto err_out_exit;
+	}
+
+	if (new_uidx >= u->ring_size) {
+		err = -EINVAL;
+		goto err_out_exit;
+	}
+
+	if ((over != u->ring_over - 1) && (over != u->ring_over))
+		goto err_out_exit;
+
+	if (u->uidx < u->kidx && new_uidx > u->kidx) {
+		err = -EINVAL;
+		goto err_out_exit;
+	}
+
+	if (new_uidx > u->uidx) {
+		if (over != u->ring_over)
+			goto err_out_exit;
+
+		comm = new_uidx - u->uidx;
+		u->uidx = new_uidx;
+		u->full = 0;
+	} else if (new_uidx < u->uidx) {
+		comm = u->ring_size - (u->uidx - new_uidx);
+		u->uidx = new_uidx;
+		u->full = 0;
+		u->ring_over++;
+
+		if (put_user(u->ring_over, &ring->ring_over)) {
+			err = -EFAULT;
+			goto err_out_exit;
+		}
+	}
+
+	return comm;
+
+err_out_exit:
+	return err;
+}
+
+/*
+ * This syscall is used to perform waiting until there is free space in the ring
+ * buffer, in that case some events will be copied there.
+ * Function returns number of actually copied ready events in ring buffer.
+ * After this function is completed userspace ring->ring_kidx will be updated.
+ *
+ * @ctl_fd - kevent file descriptor.
+ * @num - number of kevents to process.
+ * @old_uidx - the last index user is aware of.
+ * @timeout - time to wait until there is free space in kevent queue.
+ * @flags - various flags (see include/linux/ukevent.h KEVENT_FLAGS_*).
+ *
+ * When we need to commit @num events, it means we should just remove first @num
+ * kevents from ready queue and copy them into the buffer. 
+ * Kevents will be copied into ring buffer in order they were placed into ready queue.
+ * One-shot kevents will be removed here, since there is no way they can be reused.
+ * Edge-triggered events will be requeued here for better performance.
+ */
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, unsigned int old_uidx, 
+		struct timespec timeout, unsigned int flags)
+{
+	int err = -EINVAL, copied = 0;
+	struct file *file;
+	struct kevent_user *u;
+	struct kevent *k;
+	struct kevent_ring __user *ring;
+	long tm = MAX_SCHEDULE_TIMEOUT;
+	unsigned int i;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -EBADF;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	ring = u->pring;
+	if (!ring || num > u->ring_size)
+		goto out_fput;
+#if 0
+	/*
+	 * Allow to immediately update ring index, but it is not supported,
+	 * since syscall() has limited number of arguments which is actually
+	 * a good idea - use kevent_commit() instead.
+	 */
+	if ((u->uidx != new_uidx) && (new_uidx != 0xffffffff)) {
+		mutex_lock(&u->ring_lock);
+		__kevent_user_commit(u, new_uidx, over);
+		mutex_unlock(&u->ring_lock);
+	}
+#endif
+
+	if (!(file->f_flags & O_NONBLOCK)) {
+		if (!timespec_valid(&timeout))
+			goto out_fput;
+
+		if (flags & KEVENT_FLAGS_ABSTIME) {
+			hrtimer_cancel(&u->timer);
+			hrtimer_init(&u->timer, CLOCK_REALTIME, HRTIMER_ABS);
+			u->timer.expires = ktime_set(timeout.tv_sec, timeout.tv_nsec);
+			u->timer.function = &kevent_user_wake;
+			hrtimer_start(&u->timer, u->timer.expires, HRTIMER_ABS);
+			if (unlikely(kevent_debug_abstime == 0)) {
+				printk(KERN_INFO "kevent: author was wrong, "
+						"someone uses absolute time in %s, "
+						"please report to remove this warning.\n", __func__);
+				kevent_debug_abstime = 1;
+			}
+		} else {
+			tm = timespec_to_jiffies(&timeout);
+		}
+
+		wait_event_interruptible_timeout(u->wait,
+			((u->ready_num >= 1) && kevent_ring_space(u)) || 
+				u->need_exit || old_uidx != u->uidx,
+				tm);
+	}
+	u->need_exit = 0;
+
+	for (i=0; i<num; ++i) {
+		k = kevent_dequeue_ready_ring(u);
+		if (!k)
+			break;
+		kevent_complete_ready(k);
+
+		if (k->event.ret_flags & KEVENT_RET_COPY_FAILED)
+			break;
+		kevent_stat_ring(u);
+		copied++;
+	}
+
+	fput(file);
+
+	return copied;
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * This syscall is used to commit events in ring buffer, i.e. mark appropriate
+ * entries as unused by userspace so subsequent kevent_wait() could overwrite them.
+ * This fucntion returns actual number of kevents which were committed.
+ * After this function is completed userspace ring->ring_over can be updated.
+ *
+ * @ctl_fd - kevent file descriptor.
+ * @new_uidx - the last committed kevent.
+ * @over - number of overflows given queue had.
+ */
+asmlinkage long sys_kevent_commit(int ctl_fd, unsigned int new_uidx, unsigned int over)
+{
+	int err = -EINVAL, comm = 0;
+	struct file *file;
+	struct kevent_user *u;
+
+	file = fget(ctl_fd);
+	if (!file)
+		return -EBADF;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+	u = file->private_data;
+
+	mutex_lock(&u->ring_lock);
+	err = __kevent_user_commit(u, new_uidx, over);
+	if (err < 0)
+		goto err_out_unlock;
+	comm = err;
+	mutex_unlock(&u->ring_lock);
+
+	fput(file);
+
+	return comm;
+
+err_out_unlock:
+	mutex_unlock(&u->ring_lock);
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * This syscall is used to perform various control operations
+ * on given kevent queue, which is obtained through kevent file descriptor @fd.
+ * @cmd - type of operation.
+ * @num - number of kevents to be processed.
+ * @arg - pointer to array of struct ukevent.
+ */
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent __user *arg)
+{
+	int err = -EINVAL;
+	struct file *file;
+
+	file = fget(fd);
+	if (!file)
+		return -EBADF;
+
+	if (file->f_op != &kevent_user_fops)
+		goto out_fput;
+
+	err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+	fput(file);
+	return err;
+}
+
+/*
+ * Kevent subsystem initialization - create caches and register
+ * filesystem to get control file descriptors from.
+ */
+static int __init kevent_user_init(void)
+{
+	int err = 0;
+
+	kevent_cache = kmem_cache_create("kevent_cache",
+			sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
+	
+	kevent_user_cache = kmem_cache_create("kevent_user_cache",
+			sizeof(struct kevent_user), 0, SLAB_PANIC, NULL, NULL);
+	
+	err = register_filesystem(&kevent_fs_type);
+	if (err)
+		goto err_out_exit;
+	
+	kevent_mnt = kern_mount(&kevent_fs_type);
+	err = PTR_ERR(kevent_mnt);
+	if (IS_ERR(kevent_mnt))
+		goto err_out_unreg;
+
+	printk(KERN_INFO "KEVENT subsystem has been successfully registered.\n");
+
+	return 0;
+
+err_out_unreg:
+	unregister_filesystem(&kevent_fs_type);
+err_out_exit:
+	kmem_cache_destroy(kevent_cache);
+	return err;
+}
+
+module_init(kevent_user_init);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7a3b2e7..3b7d35f 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,12 @@ cond_syscall(ppc_rtas);
 cond_syscall(sys_spu_run);
 cond_syscall(sys_spu_create);
 
+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_ctl);
+cond_syscall(sys_kevent_wait);
+cond_syscall(sys_kevent_commit);
+cond_syscall(sys_kevent_init);
+
 /* mmu depending weak syscall entries */
 cond_syscall(sys_mprotect);
 cond_syscall(sys_msync);


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take26 3/8] kevent: poll/select() notifications.
  2006-11-30 19:14     ` [take26 2/8] kevent: Core files Evgeniy Polyakov
@ 2006-11-30 19:14       ` Evgeniy Polyakov
  2006-11-30 19:14         ` [take26 4/8] kevent: Socket notifications Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-30 19:14 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


poll/select() notifications.

This patch includes generic poll/select notifications.
kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake, a lot of allocations and so on).

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/fs/file_table.c b/fs/file_table.c
index bc35a40..0805547 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -20,6 +20,7 @@
 #include <linux/cdev.h>
 #include <linux/fsnotify.h>
 #include <linux/sysctl.h>
+#include <linux/kevent.h>
 #include <linux/percpu_counter.h>
 
 #include <asm/atomic.h>
@@ -119,6 +120,7 @@ struct file *get_empty_filp(void)
 	f->f_uid = tsk->fsuid;
 	f->f_gid = tsk->fsgid;
 	eventpoll_init_file(f);
+	kevent_init_file(f);
 	/* f->f_version: 0 */
 	return f;
 
@@ -164,6 +166,7 @@ void fastcall __fput(struct file *file)
 	 * in the file cleanup chain.
 	 */
 	eventpoll_release(file);
+	kevent_cleanup_file(file);
 	locks_remove_flock(file);
 
 	if (file->f_op && file->f_op->release)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5baf3a1..8bbf3a5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -276,6 +276,7 @@ extern int dir_notify_enable;
 #include <linux/init.h>
 #include <linux/sched.h>
 #include <linux/mutex.h>
+#include <linux/kevent_storage.h>
 
 #include <asm/atomic.h>
 #include <asm/semaphore.h>
@@ -586,6 +587,10 @@ struct inode {
 	struct mutex		inotify_mutex;	/* protects the watches list */
 #endif
 
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+	struct kevent_storage	st;
+#endif
+
 	unsigned long		i_state;
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 
@@ -739,6 +744,9 @@ struct file {
 	struct list_head	f_ep_links;
 	spinlock_t		f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+	struct kevent_storage	st;
+#endif
 	struct address_space	*f_mapping;
 };
 extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..11dbe25
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,232 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+	struct poll_table_struct 	pt;
+	struct kevent			*k;
+};
+
+struct kevent_poll_wait_container
+{
+	struct list_head		container_entry;
+	wait_queue_head_t		*whead;
+	wait_queue_t			wait;
+	struct kevent			*k;
+};
+
+struct kevent_poll_private
+{
+	struct list_head		container_list;
+	spinlock_t			container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+		unsigned mode, int sync, void *key)
+{
+	struct kevent_poll_wait_container *cont =
+		container_of(wait, struct kevent_poll_wait_container, wait);
+	struct kevent *k = cont->k;
+
+	kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL);
+	return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+		struct poll_table_struct *poll_table)
+{
+	struct kevent *k =
+		container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *cont;
+	unsigned long flags;
+
+	cont = kmem_cache_alloc(kevent_poll_container_cache, GFP_KERNEL);
+	if (!cont) {
+		kevent_break(k);
+		return;
+	}
+
+	cont->k = k;
+	init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+	cont->whead = whead;
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_add_tail(&cont->container_entry, &priv->container_list);
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+	struct file *file;
+	int err;
+	unsigned int revents;
+	unsigned long flags;
+	struct kevent_poll_ctl ctl;
+	struct kevent_poll_private *priv;
+
+	file = fget(k->event.id.raw[0]);
+	if (!file)
+		return -EBADF;
+	
+	err = -EINVAL;
+	if (!file->f_op || !file->f_op->poll)
+		goto err_out_fput;
+	
+	err = -ENOMEM;
+	priv = kmem_cache_alloc(kevent_poll_priv_cache, GFP_KERNEL);
+	if (!priv)
+		goto err_out_fput;
+
+	spin_lock_init(&priv->container_lock);
+	INIT_LIST_HEAD(&priv->container_list);
+
+	k->priv = priv;
+
+	ctl.k = k;
+	init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+	
+	err = kevent_storage_enqueue(&file->st, k);
+	if (err)
+		goto err_out_free;
+
+	if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE) {
+		kevent_requeue(k);
+	} else {
+		revents = file->f_op->poll(file, &ctl.pt);
+		if (revents & k->event.event) {
+			err = 1;
+			goto out_dequeue;
+		}
+	}
+
+	spin_lock_irqsave(&k->ulock, flags);
+	k->event.req_flags |= KEVENT_REQ_LAST_CHECK;
+	spin_unlock_irqrestore(&k->ulock, flags);
+
+	return 0;
+
+out_dequeue:
+	kevent_storage_dequeue(k->st, k);
+err_out_free:
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+	fput(file);
+	return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+	struct file *file = k->st->origin;
+	struct kevent_poll_private *priv = k->priv;
+	struct kevent_poll_wait_container *w, *n;
+	unsigned long flags;
+
+	kevent_storage_dequeue(k->st, k);
+
+	spin_lock_irqsave(&priv->container_lock, flags);
+	list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+		list_del(&w->container_entry);
+		remove_wait_queue(w->whead, &w->wait);
+		kmem_cache_free(kevent_poll_container_cache, w);
+	}
+	spin_unlock_irqrestore(&priv->container_lock, flags);
+
+	kmem_cache_free(kevent_poll_priv_cache, priv);
+	k->priv = NULL;
+
+	fput(file);
+
+	return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+	if (k->event.req_flags & KEVENT_REQ_LAST_CHECK) {
+		return 1;
+	} else {
+		struct file *file = k->st->origin;
+		unsigned int revents = file->f_op->poll(file, NULL);
+
+		k->event.ret_data[0] = revents & k->event.event;
+		
+		return (revents & k->event.event);
+	}
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+	struct kevent_callbacks pc = {
+		.callback = &kevent_poll_callback,
+		.enqueue = &kevent_poll_enqueue,
+		.dequeue = &kevent_poll_dequeue};
+
+	kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache",
+			sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+	if (!kevent_poll_container_cache) {
+		printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+		return -ENOMEM;
+	}
+
+	kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache",
+			sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+	if (!kevent_poll_priv_cache) {
+		printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+		kmem_cache_destroy(kevent_poll_container_cache);
+		kevent_poll_container_cache = NULL;
+		return -ENOMEM;
+	}
+
+	kevent_add_callbacks(&pc, KEVENT_POLL);
+
+	printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+	return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+	lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+	kmem_cache_destroy(kevent_poll_priv_cache);
+	kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);

^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take26 4/8] kevent: Socket notifications.
  2006-11-30 19:14       ` [take26 3/8] kevent: poll/select() notifications Evgeniy Polyakov
@ 2006-11-30 19:14         ` Evgeniy Polyakov
  2006-11-30 19:14           ` [take26 5/8] kevent: Timer notifications Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-30 19:14 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Socket notifications.

This patch includes socket send/recv/accept notifications.
Using trivial web server based on kevent and this features
instead of epoll it's performance increased more than noticebly.
More details about various benchmarks and server itself 
(evserver_kevent.c) can be found on project's homepage.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru>

diff --git a/fs/inode.c b/fs/inode.c
index ada7643..2740617 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@
 #include <linux/cdev.h>
 #include <linux/bootmem.h>
 #include <linux/inotify.h>
+#include <linux/kevent.h>
 #include <linux/mount.h>
 
 /*
@@ -164,12 +165,18 @@ static struct inode *alloc_inode(struct
 		}
 		inode->i_private = 0;
 		inode->i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+		kevent_storage_init(inode, &inode->st);
+#endif
 	}
 	return inode;
 }
 
 void destroy_inode(struct inode *inode) 
 {
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+	kevent_storage_fini(&inode->st);
+#endif
 	BUG_ON(inode_has_buffers(inode));
 	security_inode_free(inode);
 	if (inode->i_sb->s_op->destroy_inode)
diff --git a/include/net/sock.h b/include/net/sock.h
index edd4d73..d48ded8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -48,6 +48,7 @@
 #include <linux/netdevice.h>
 #include <linux/skbuff.h>	/* struct sk_buff */
 #include <linux/security.h>
+#include <linux/kevent.h>
 
 #include <linux/filter.h>
 
@@ -450,6 +451,21 @@ static inline int sk_stream_memory_free(
 
 extern void sk_stream_rfree(struct sk_buff *skb);
 
+struct socket_alloc {
+	struct socket socket;
+	struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+	return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+	return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
+}
+
 static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
 {
 	skb->sk = sk;
@@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct
 		sk->sk_backlog.tail = skb;
 	}
 	skb->next = NULL;
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 }
 
 #define sk_wait_event(__sk, __timeo, __condition)		\
@@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio
 	return si->kiocb;
 }
 
-struct socket_alloc {
-	struct socket socket;
-	struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
-	return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
-	return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
-}
-
 extern void __sk_stream_mem_reclaim(struct sock *sk);
 extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..69f4ad2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so
 			tp->ucopy.memory = 0;
 		} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
 			wake_up_interruptible(sk->sk_sleep);
+			kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 			if (!inet_csk_ack_scheduled(sk))
 				inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
 						          (3 * TCP_RTO_MIN) / 4,
diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 0000000..9c24b5b
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,142 @@
+/*
+ * 	kevent_socket.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/tcp.h>
+#include <linux/kevent.h>
+
+#include <net/sock.h>
+#include <net/request_sock.h>
+#include <net/inet_connection_sock.h>
+
+static int kevent_socket_callback(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+	unsigned int events = SOCKET_I(inode)->ops->poll(SOCKET_I(inode)->file, SOCKET_I(inode), NULL);
+
+	if ((events & (POLLIN | POLLRDNORM)) && (k->event.event & (KEVENT_SOCKET_RECV | KEVENT_SOCKET_ACCEPT)))
+		return 1;
+	if ((events & (POLLOUT | POLLWRNORM)) && (k->event.event & KEVENT_SOCKET_SEND))
+		return 1;
+	if (events & (POLLERR | POLLHUP))
+		return -1;
+	return 0;
+}
+
+int kevent_socket_enqueue(struct kevent *k)
+{
+	struct inode *inode;
+	struct socket *sock;
+	int err = -EBADF;
+
+	sock = sockfd_lookup(k->event.id.raw[0], &err);
+	if (!sock)
+		goto err_out_exit;
+
+	inode = igrab(SOCK_INODE(sock));
+	if (!inode)
+		goto err_out_fput;
+
+	err = kevent_storage_enqueue(&inode->st, k);
+	if (err)
+		goto err_out_iput;
+
+	if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE) {
+		kevent_requeue(k);
+		err = 0;
+	} else {
+		err = k->callbacks.callback(k);
+		if (err)
+			goto err_out_dequeue;
+	}
+
+	return err;
+
+err_out_dequeue:
+	kevent_storage_dequeue(k->st, k);
+err_out_iput:
+	iput(inode);
+err_out_fput:
+	sockfd_put(sock);
+err_out_exit:
+	return err;
+}
+
+int kevent_socket_dequeue(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+	struct socket *sock;
+
+	kevent_storage_dequeue(k->st, k);
+
+	sock = SOCKET_I(inode);
+	iput(inode);
+	sockfd_put(sock);
+
+	return 0;
+}
+
+void kevent_socket_notify(struct sock *sk, u32 event)
+{
+	if (sk->sk_socket)
+		kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event);
+}
+
+/*
+ * It is required for network protocols compiled as modules, like IPv6.
+ */
+EXPORT_SYMBOL_GPL(kevent_socket_notify);
+
+#ifdef CONFIG_LOCKDEP
+static struct lock_class_key kevent_sock_key;
+
+void kevent_socket_reinit(struct socket *sock)
+{
+	struct inode *inode = SOCK_INODE(sock);
+
+	lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+}
+
+void kevent_sk_reinit(struct sock *sk)
+{
+	if (sk->sk_socket) {
+		struct inode *inode = SOCK_INODE(sk->sk_socket);
+
+		lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+	}
+}
+#endif
+static int __init kevent_init_socket(void)
+{
+	struct kevent_callbacks sc = {
+		.callback = &kevent_socket_callback,
+		.enqueue = &kevent_socket_enqueue,
+		.dequeue = &kevent_socket_dequeue};
+
+	return kevent_add_callbacks(&sc, KEVENT_SOCKET);
+}
+module_init(kevent_init_socket);
diff --git a/net/core/sock.c b/net/core/sock.c
index b77e155..7d5fa3e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1402,6 +1402,7 @@ static void sock_def_wakeup(struct sock
 	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
 		wake_up_interruptible_all(sk->sk_sleep);
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_error_report(struct sock *sk)
@@ -1411,6 +1412,7 @@ static void sock_def_error_report(struct
 		wake_up_interruptible(sk->sk_sleep);
 	sk_wake_async(sk,0,POLL_ERR); 
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_readable(struct sock *sk, int len)
@@ -1420,6 +1422,7 @@ static void sock_def_readable(struct soc
 		wake_up_interruptible(sk->sk_sleep);
 	sk_wake_async(sk,1,POLL_IN);
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
 }
 
 static void sock_def_write_space(struct sock *sk)
@@ -1439,6 +1442,7 @@ static void sock_def_write_space(struct
 	}
 
 	read_unlock(&sk->sk_callback_lock);
+	kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
 }
 
 static void sock_def_destruct(struct sock *sk)
@@ -1489,6 +1493,8 @@ void sock_init_data(struct socket *sock,
 	sk->sk_state		=	TCP_CLOSE;
 	sk->sk_socket		=	sock;
 
+	kevent_sk_reinit(sk);
+
 	sock_set_flag(sk, SOCK_ZAPPED);
 
 	if(sock)
@@ -1555,8 +1561,10 @@ void fastcall release_sock(struct sock *
 	if (sk->sk_backlog.tail)
 		__release_sock(sk);
 	sk->sk_lock.owner = NULL;
-	if (waitqueue_active(&sk->sk_lock.wq))
+	if (waitqueue_active(&sk->sk_lock.wq)) {
 		wake_up(&sk->sk_lock.wq);
+		kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
+	}
 	spin_unlock_bh(&sk->sk_lock.slock);
 }
 EXPORT_SYMBOL(release_sock);
diff --git a/net/core/stream.c b/net/core/stream.c
index d1d7dec..2878c2a 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock *
 			wake_up_interruptible(sk->sk_sleep);
 		if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
 			sock_wake_async(sock, 2, POLL_OUT);
+		kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
 	}
 }
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3f884ce..e7dd989 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3119,6 +3119,7 @@ static void tcp_ofo_queue(struct sock *s
 
 		__skb_unlink(skb, &tp->out_of_order_queue);
 		__skb_queue_tail(&sk->sk_receive_queue, skb);
+		kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
 		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
 		if(skb->h.th->fin)
 			tcp_fin(skb, sk, skb->h.th);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index c83938b..b0dd70d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -61,6 +61,7 @@
 #include <linux/jhash.h>
 #include <linux/init.h>
 #include <linux/times.h>
+#include <linux/kevent.h>
 
 #include <net/icmp.h>
 #include <net/inet_hashtables.h>
@@ -870,6 +871,7 @@ int tcp_v4_conn_request(struct sock *sk,
 	   	reqsk_free(req);
 	} else {
 		inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+		kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT);
 	}
 	return 0;
 
diff --git a/net/socket.c b/net/socket.c
index 1bc4167..5582b4a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -85,6 +85,7 @@
 #include <linux/kmod.h>
 #include <linux/audit.h>
 #include <linux/wireless.h>
+#include <linux/kevent.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -490,6 +491,8 @@ static struct socket *sock_alloc(void)
 	inode->i_uid = current->fsuid;
 	inode->i_gid = current->fsgid;
 
+	kevent_socket_reinit(sock);
+
 	get_cpu_var(sockets_in_use)++;
 	put_cpu_var(sockets_in_use);
 	return sock;

^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take26 5/8] kevent: Timer notifications.
  2006-11-30 19:14         ` [take26 4/8] kevent: Socket notifications Evgeniy Polyakov
@ 2006-11-30 19:14           ` Evgeniy Polyakov
  2006-11-30 19:14             ` [take26 6/8] kevent: Pipe notifications Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-30 19:14 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Timer notifications.

Timer notifications can be used for fine grained per-process time 
management, since interval timers are very inconvenient to use, 
and they are limited.

This subsystem uses high-resolution timers.
id.raw[0] is used as number of seconds
id.raw[1] is used as number of nanoseconds

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..df93049
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,112 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/hrtimer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+struct kevent_timer
+{
+	struct hrtimer		ktimer;
+	struct kevent_storage	ktimer_storage;
+	struct kevent		*ktimer_event;
+};
+
+static int kevent_timer_func(struct hrtimer *timer)
+{
+	struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer);
+	struct kevent *k = t->ktimer_event;
+
+	kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL);
+	hrtimer_forward(timer, timer->base->softirq_time,
+			ktime_set(k->event.id.raw[0], k->event.id.raw[1]));
+	return HRTIMER_RESTART;
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+	int err;
+	struct kevent_timer *t;
+
+	t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+	if (!t)
+		return -ENOMEM;
+
+	hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL);
+	t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]);
+	t->ktimer.function = kevent_timer_func;
+	t->ktimer_event = k;
+
+	err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+	if (err)
+		goto err_out_free;
+	lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+	err = kevent_storage_enqueue(&t->ktimer_storage, k);
+	if (err)
+		goto err_out_st_fini;
+
+	hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL);
+
+	return 0;
+
+err_out_st_fini:
+	kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+	kfree(t);
+
+	return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+	struct kevent_storage *st = k->st;
+	struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage);
+
+	hrtimer_cancel(&t->ktimer);
+	kevent_storage_dequeue(st, k);
+	kfree(t);
+
+	return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+	k->event.ret_data[0] = jiffies_to_msecs(jiffies);
+	return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+	struct kevent_callbacks tc = {
+		.callback = &kevent_timer_callback,
+		.enqueue = &kevent_timer_enqueue,
+		.dequeue = &kevent_timer_dequeue};
+
+	return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);
+


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take26 6/8] kevent: Pipe notifications.
  2006-11-30 19:14           ` [take26 5/8] kevent: Timer notifications Evgeniy Polyakov
@ 2006-11-30 19:14             ` Evgeniy Polyakov
  2006-11-30 19:14               ` [take26 7/8] kevent: Signal notifications Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-30 19:14 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Pipe notifications.


diff --git a/fs/pipe.c b/fs/pipe.c
index f3b6f71..aeaee9c 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -16,6 +16,7 @@
 #include <linux/uio.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/kevent.h>
 
 #include <asm/uaccess.h>
 #include <asm/ioctls.h>
@@ -312,6 +313,7 @@ redo:
 			break;
 		}
 		if (do_wakeup) {
+			kevent_pipe_notify(inode, KEVENT_SOCKET_SEND);
 			wake_up_interruptible_sync(&pipe->wait);
  			kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 		}
@@ -321,6 +323,7 @@ redo:
 
 	/* Signal writers asynchronously that there is more room. */
 	if (do_wakeup) {
+		kevent_pipe_notify(inode, KEVENT_SOCKET_SEND);
 		wake_up_interruptible(&pipe->wait);
 		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 	}
@@ -490,6 +493,7 @@ redo2:
 			break;
 		}
 		if (do_wakeup) {
+			kevent_pipe_notify(inode, KEVENT_SOCKET_RECV);
 			wake_up_interruptible_sync(&pipe->wait);
 			kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 			do_wakeup = 0;
@@ -501,6 +505,7 @@ redo2:
 out:
 	mutex_unlock(&inode->i_mutex);
 	if (do_wakeup) {
+		kevent_pipe_notify(inode, KEVENT_SOCKET_RECV);
 		wake_up_interruptible(&pipe->wait);
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 	}
@@ -605,6 +610,7 @@ pipe_release(struct inode *inode, int de
 		free_pipe_info(inode);
 	} else {
 		wake_up_interruptible(&pipe->wait);
+		kevent_pipe_notify(inode, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 	}
diff --git a/kernel/kevent/kevent_pipe.c b/kernel/kevent/kevent_pipe.c
new file mode 100644
index 0000000..d529fa9
--- /dev/null
+++ b/kernel/kevent/kevent_pipe.c
@@ -0,0 +1,121 @@
+/*
+ * 	kevent_pipe.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/kevent.h>
+#include <linux/pipe_fs_i.h>
+
+static int kevent_pipe_callback(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+	struct pipe_inode_info *pipe = inode->i_pipe;
+	int nrbufs = pipe->nrbufs;
+
+	if (k->event.event & KEVENT_SOCKET_RECV && nrbufs > 0) {
+		if (!pipe->writers)
+			return -1;
+		return 1;
+	}
+	
+	if (k->event.event & KEVENT_SOCKET_SEND && nrbufs < PIPE_BUFFERS) {
+		if (!pipe->readers)
+			return -1;
+		return 1;
+	}
+
+	return 0;
+}
+
+int kevent_pipe_enqueue(struct kevent *k)
+{
+	struct file *pipe;
+	int err = -EBADF;
+	struct inode *inode;
+
+	pipe = fget(k->event.id.raw[0]);
+	if (!pipe)
+		goto err_out_exit;
+
+	inode = igrab(pipe->f_dentry->d_inode);
+	if (!inode)
+		goto err_out_fput;
+
+	err = -EINVAL;
+	if (!S_ISFIFO(inode->i_mode))
+		goto err_out_iput;
+
+	err = kevent_storage_enqueue(&inode->st, k);
+	if (err)
+		goto err_out_iput;
+
+	if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE) {
+		kevent_requeue(k);
+		err = 0;
+	} else {
+		err = k->callbacks.callback(k);
+		if (err)
+			goto err_out_dequeue;
+	}
+
+	fput(pipe);
+
+	return err;
+
+err_out_dequeue:
+	kevent_storage_dequeue(k->st, k);
+err_out_iput:
+	iput(inode);
+err_out_fput:
+	fput(pipe);
+err_out_exit:
+	return err;
+}
+
+int kevent_pipe_dequeue(struct kevent *k)
+{
+	struct inode *inode = k->st->origin;
+
+	kevent_storage_dequeue(k->st, k);
+	iput(inode);
+
+	return 0;
+}
+
+void kevent_pipe_notify(struct inode *inode, u32 event)
+{
+	kevent_storage_ready(&inode->st, NULL, event);
+}
+
+static int __init kevent_init_pipe(void)
+{
+	struct kevent_callbacks sc = {
+		.callback = &kevent_pipe_callback,
+		.enqueue = &kevent_pipe_enqueue,
+		.dequeue = &kevent_pipe_dequeue};
+
+	return kevent_add_callbacks(&sc, KEVENT_PIPE);
+}
+module_init(kevent_init_pipe);


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take26 7/8] kevent: Signal notifications.
  2006-11-30 19:14             ` [take26 6/8] kevent: Pipe notifications Evgeniy Polyakov
@ 2006-11-30 19:14               ` Evgeniy Polyakov
  2006-11-30 19:14                 ` [take26 8/8] kevent: Kevent posix timer notifications Evgeniy Polyakov
  0 siblings, 1 reply; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-30 19:14 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Signal notifications.

This type of notifications allows to deliver signals through kevent queue.
One can find example application signal.c on project homepage.

If KEVENT_SIGNAL_NOMASK bit is set in raw_u64 id then signal will be
delivered only through queue, otherwise both delivery types are used - old
through update of mask of pending signals and through queue.

If signal is delivered only through kevent queue mask of pending signals
is not updated at all, which is equal to putting signal into blocked mask,
but with delivery of that signal through kevent queue.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>


diff --git a/include/linux/sched.h b/include/linux/sched.h
index fc4a987..ef38a3c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -80,6 +80,7 @@ struct sched_param {
 #include <linux/resource.h>
 #include <linux/timer.h>
 #include <linux/hrtimer.h>
+#include <linux/kevent_storage.h>
 
 #include <asm/processor.h>
 
@@ -1013,6 +1014,10 @@ struct task_struct {
 #ifdef	CONFIG_TASK_DELAY_ACCT
 	struct task_delay_info *delays;
 #endif
+#ifdef CONFIG_KEVENT_SIGNAL
+	struct kevent_storage st;
+	u32 kevent_signals;
+#endif
 };
 
 static inline pid_t process_group(struct task_struct *tsk)
diff --git a/kernel/fork.c b/kernel/fork.c
index 1c999f3..e5b5b14 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -46,6 +46,7 @@
 #include <linux/delayacct.h>
 #include <linux/taskstats_kern.h>
 #include <linux/random.h>
+#include <linux/kevent.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -115,6 +116,9 @@ void __put_task_struct(struct task_struc
 	WARN_ON(atomic_read(&tsk->usage));
 	WARN_ON(tsk == current);
 
+#ifdef CONFIG_KEVENT_SIGNAL
+	kevent_storage_fini(&tsk->st);
+#endif
 	security_task_free(tsk);
 	free_uid(tsk->user);
 	put_group_info(tsk->group_info);
@@ -1121,6 +1125,10 @@ static struct task_struct *copy_process(
 	if (retval)
 		goto bad_fork_cleanup_namespace;
 
+#ifdef CONFIG_KEVENT_SIGNAL
+	kevent_storage_init(p, &p->st);
+#endif
+
 	p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
 	/*
 	 * Clear TID on mm_release()?
diff --git a/kernel/kevent/kevent_signal.c b/kernel/kevent/kevent_signal.c
new file mode 100644
index 0000000..0edd2e4
--- /dev/null
+++ b/kernel/kevent/kevent_signal.c
@@ -0,0 +1,92 @@
+/*
+ * 	kevent_signal.c
+ * 
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/kevent.h>
+
+static int kevent_signal_callback(struct kevent *k)
+{
+	struct task_struct *tsk = k->st->origin;
+	int sig = k->event.id.raw[0];
+	int ret = 0;
+
+	if (sig == tsk->kevent_signals)
+		ret = 1;
+
+	if (ret && (k->event.id.raw_u64 & KEVENT_SIGNAL_NOMASK))
+		tsk->kevent_signals |= 0x80000000;
+
+	return ret;
+}
+
+int kevent_signal_enqueue(struct kevent *k)
+{
+	int err;
+
+	err = kevent_storage_enqueue(&current->st, k);
+	if (err)
+		goto err_out_exit;
+
+	if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE) {
+		kevent_requeue(k);
+		err = 0;
+	} else {
+		err = k->callbacks.callback(k);
+		if (err)
+			goto err_out_dequeue;
+	}
+
+	return err;
+
+err_out_dequeue:
+	kevent_storage_dequeue(k->st, k);
+err_out_exit:
+	return err;
+}
+
+int kevent_signal_dequeue(struct kevent *k)
+{
+	kevent_storage_dequeue(k->st, k);
+	return 0;
+}
+
+int kevent_signal_notify(struct task_struct *tsk, int sig)
+{
+	tsk->kevent_signals = sig;
+	kevent_storage_ready(&tsk->st, NULL, KEVENT_SIGNAL_DELIVERY);
+	return (tsk->kevent_signals & 0x80000000);
+}
+
+static int __init kevent_init_signal(void)
+{
+	struct kevent_callbacks sc = {
+		.callback = &kevent_signal_callback,
+		.enqueue = &kevent_signal_enqueue,
+		.dequeue = &kevent_signal_dequeue};
+
+	return kevent_add_callbacks(&sc, KEVENT_SIGNAL);
+}
+module_init(kevent_init_signal);
diff --git a/kernel/signal.c b/kernel/signal.c
index fb5da6d..d3d3594 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -23,6 +23,7 @@
 #include <linux/ptrace.h>
 #include <linux/signal.h>
 #include <linux/capability.h>
+#include <linux/kevent.h>
 #include <asm/param.h>
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -703,6 +704,9 @@ static int send_signal(int sig, struct s
 {
 	struct sigqueue * q = NULL;
 	int ret = 0;
+	
+	if (kevent_signal_notify(t, sig))
+		return 1;
 
 	/*
 	 * fast-pathed signals for kernel-internal things like SIGSTOP
@@ -782,6 +786,17 @@ specific_send_sig_info(int sig, struct s
 	ret = send_signal(sig, info, t, &t->pending);
 	if (!ret && !sigismember(&t->blocked, sig))
 		signal_wake_up(t, sig == SIGKILL);
+#ifdef CONFIG_KEVENT_SIGNAL
+	/*
+	 * Kevent allows to deliver signals through kevent queue, 
+	 * it is possible to setup kevent to not deliver
+	 * signal through the usual way, in that case send_signal()
+	 * returns 1 and signal is delivered only through kevent queue.
+	 * We simulate successfull delivery notification through this hack:
+	 */
+	if (ret == 1)
+		ret = 0;
+#endif
 out:
 	return ret;
 }
@@ -971,6 +986,17 @@ __group_send_sig_info(int sig, struct si
 	 * to avoid several races.
 	 */
 	ret = send_signal(sig, info, p, &p->signal->shared_pending);
+#ifdef CONFIG_KEVENT_SIGNAL
+	/*
+	 * Kevent allows to deliver signals through kevent queue, 
+	 * it is possible to setup kevent to not deliver
+	 * signal through the usual way, in that case send_signal()
+	 * returns 1 and signal is delivered only through kevent queue.
+	 * We simulate successfull delivery notification through this hack:
+	 */
+	if (ret == 1)
+		ret = 0;
+#endif
 	if (unlikely(ret))
 		return ret;
 


^ permalink raw reply related	[flat|nested] 200+ messages in thread

* [take26 8/8] kevent: Kevent posix timer notifications.
  2006-11-30 19:14               ` [take26 7/8] kevent: Signal notifications Evgeniy Polyakov
@ 2006-11-30 19:14                 ` Evgeniy Polyakov
  0 siblings, 0 replies; 200+ messages in thread
From: Evgeniy Polyakov @ 2006-11-30 19:14 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov,
	netdev, Zach Brown, Christoph Hellwig, Chase Venters,
	Johann Borck, linux-kernel, Jeff Garzik


Kevent posix timer notifications.

Simple extensions to POSIX timers which allows
to deliver notification of the timer expiration
through kevent queue.

Example application posix_timer.c can be found
in archive on project homepage.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>


diff --git a/include/asm-generic/siginfo.h b/include/asm-generic/siginfo.h
index 8786e01..3768746 100644
--- a/include/asm-generic/siginfo.h
+++ b/include/asm-generic/siginfo.h
@@ -235,6 +235,7 @@ typedef struct siginfo {
 #define SIGEV_NONE	1	/* other notification: meaningless */
 #define SIGEV_THREAD	2	/* deliver via thread creation */
 #define SIGEV_THREAD_ID 4	/* deliver to thread */
+#define SIGEV_KEVENT	8	/* deliver through kevent queue */
 
 /*
  * This works because the alignment is ok on all current architectures
@@ -260,6 +261,8 @@ typedef struct sigevent {
 			void (*_function)(sigval_t);
 			void *_attribute;	/* really pthread_attr_t */
 		} _sigev_thread;
+
+		int kevent_fd;
 	} _sigev_un;
 } sigevent_t;
 
diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index a7dd38f..4b9deb4 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -4,6 +4,7 @@
 #include <linux/spinlock.h>
 #include <linux/list.h>
 #include <linux/sched.h>
+#include <linux/kevent_storage.h>
 
 union cpu_time_count {
 	cputime_t cpu;
@@ -49,6 +50,9 @@ struct k_itimer {
 	sigval_t it_sigev_value;	/* value word of sigevent struct */
 	struct task_struct *it_process;	/* process to send signal to */
 	struct sigqueue *sigq;		/* signal queue entry. */
+#ifdef CONFIG_KEVENT_TIMER
+	struct kevent_storage st;
+#endif
 	union {
 		struct {
 			struct hrtimer timer;
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index e5ebcc1..8d0e7a3 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -48,6 +48,8 @@
 #include <linux/wait.h>
 #include <linux/workqueue.h>
 #include <linux/module.h>
+#include <linux/kevent.h>
+#include <linux/file.h>
 
 /*
  * Management arrays for POSIX timers.	 Timers are kept in slab memory
@@ -224,6 +226,99 @@ static int posix_ktime_get_ts(clockid_t
 	return 0;
 }
 
+#ifdef CONFIG_KEVENT_TIMER
+static int posix_kevent_enqueue(struct kevent *k)
+{
+	/*
+	 * It is not ugly - there is no pointer in the id field union, 
+	 * but its size is 64bits, which is ok for any known pointer size.
+	 */
+	struct k_itimer *tmr = (struct k_itimer *)(unsigned long)k->event.id.raw_u64;
+	return kevent_storage_enqueue(&tmr->st, k);
+}
+static int posix_kevent_dequeue(struct kevent *k)
+{
+	struct k_itimer *tmr = (struct k_itimer *)(unsigned long)k->event.id.raw_u64;
+	kevent_storage_dequeue(&tmr->st, k);
+	return 0;
+}
+static int posix_kevent_callback(struct kevent *k)
+{
+	return 1;
+}
+static int posix_kevent_init(void)
+{
+	struct kevent_callbacks tc = {
+		.callback = &posix_kevent_callback,
+		.enqueue = &posix_kevent_enqueue,
+		.dequeue = &posix_kevent_dequeue};
+
+	return kevent_add_callbacks(&tc, KEVENT_POSIX_TIMER);
+}
+
+extern struct file_operations kevent_user_fops;
+
+static int posix_kevent_init_timer(struct k_itimer *tmr, int fd)
+{
+	struct ukevent uk;
+	struct file *file;
+	struct kevent_user *u;
+	int err;
+
+	file = fget(fd);
+	if (!file) {
+		err = -EBADF;
+		goto err_out;
+	}
+
+	if (file->f_op != &kevent_user_fops) {
+		err = -EINVAL;
+		goto err_out_fput;
+	}
+
+	u = file->private_data;
+
+	memset(&uk, 0, sizeof(struct ukevent));
+
+	uk.event = KEVENT_MASK_ALL;
+	uk.type = KEVENT_POSIX_TIMER;
+	uk.id.raw_u64 = (unsigned long)(tmr); /* Just cast to something unique */
+	uk.req_flags = KEVENT_REQ_ONESHOT | KEVENT_REQ_ALWAYS_QUEUE;
+	uk.ptr = tmr->it_sigev_value.sival_ptr;
+
+	err = kevent_user_add_ukevent(&uk, u);
+	if (err)
+		goto err_out_fput;
+
+	fput(file);
+
+	return 0;
+
+err_out_fput:
+	fput(file);
+err_out:
+	return err;
+}
+
+static void posix_kevent_fini_timer(struct k_itimer *tmr)
+{
+	kevent_storage_fini(&tmr->st);
+}
+#else
+static int posix_kevent_init_timer(struct k_itimer *tmr, int fd)
+{
+	return -ENOSYS;
+}
+static int posix_kevent_init(void)
+{
+	return 0;
+}
+static void posix_kevent_fini_timer(struct k_itimer *tmr)
+{
+}
+#endif
+
+
 /*
  * Initialize everything, well, just everything in Posix clocks/timers ;)
  */
@@ -241,6 +336,11 @@ static __init int init_posix_timers(void
 	register_posix_clock(CLOCK_REALTIME, &clock_realtime);
 	register_posix_clock(CLOCK_MONOTONIC, &clock_monotonic);
 
+	if (posix_kevent_init()) {
+		printk(KERN_ERR "Failed to initialize kevent posix timers.\n");
+		BUG();
+	}
+
 	posix_timers_cache = kmem_cache_create("posix_timers_cache",
 					sizeof (struct k_itimer), 0, 0, NULL, NULL);
 	idr_init(&posix_timers_id);
@@ -343,23 +443,27 @@ static int posix_timer_fn(struct hrtimer
 
 	timr = container_of(timer, struct k_itimer, it.real.timer);
 	spin_lock_irqsave(&timr->it_lock, flags);
+	
+	if (timr->it_sigev_notify == SIGEV_KEVENT) {
+		kevent_storage_ready(&timr->st, NULL, KEVENT_MASK_ALL);
+	} else {
+		if (timr->it.real.interval.tv64 != 0)
+			si_private = ++timr->it_requeue_pending;
 
-	if (timr->it.real.interval.tv64 != 0)
-		si_private = ++timr->it_requeue_pending;
-
-	if (posix_timer_event(timr, si_private)) {
-		/*
-		 * signal was not sent because of sig_ignor
-		 * we will not get a call back to restart it AND
-		 * it should be restarted.
-		 */
-		if (timr->it.real.interval.tv64 != 0) {
-			timr->it_overrun +=
-				hrtimer_forward(timer,
-						timer->base->softirq_time,
-						timr->it.real.interval);
-			ret = HRTIMER_RESTART;
-			++timr->it_requeue_pending;
+		if (posix_timer_event(timr, si_private)) {
+			/*
+			 * signal was not sent because of sig_ignor
+			 * we will not get a call back to restart it AND
+			 * it should be restarted.
+			 */
+			if (timr->it.real.interval.tv64 != 0) {
+				timr->it_overrun +=
+					hrtimer_forward(timer,
+							timer->base->softirq_time,
+							timr->it.real.interval);
+				ret = HRTIMER_RESTART;
+				++timr->it_requeue_pending;
+			}
 		}
 	}
 
@@ -407,6 +511,9 @@ static struct k_itimer * alloc_posix_tim
 		kmem_cache_free(posix_timers_cache, tmr);
 		tmr = NULL;
 	}
+#ifdef CONFIG_KEVENT_TIMER
+	kevent_storage_init(tmr, &tmr->st);
+#endif
 	return tmr;
 }
 
@@ -424,6 +531,7 @@ static void release_posix_timer(struct k
 	if (unlikely(tmr->it_process) &&
 	    tmr->it_sigev_notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID))
 		put_task_struct(tmr->it_process);
+	posix_kevent_fini_timer(tmr);
 	kmem_cache_free(posix_timers_cache, tmr);
 }
 
@@ -496,40 +604,52 @@ sys_timer_create(const clockid_t which_c
 		new_timer->it_sigev_signo = event.sigev_signo;
 		new_timer->it_sigev_value = event.sigev_value;
 
-		read_lock(&tasklist_lock);
-		if ((process = good_sigevent(&event))) {
-			/*
-			 * We may be setting up this process for another
-			 * thread.  It may be exiting.  To catch this
-			 * case the we check the PF_EXITING flag.  If
-			 * the flag is not set, the siglock will catch
-			 * him before it is too late (in exit_itimers).
-			 *
-			 * The exec case is a bit more invloved but easy
-			 * to code.  If the process is in our thread
-			 * group (and it must be or we would not allow
-			 * it here) and is doing an exec, it will cause
-			 * us to be killed.  In this case it will wait
-			 * for us to die which means we can finish this
-			 * linkage with our last gasp. I.e. no code :)
-			 */
+		if (event.sigev_notify == SIGEV_KEVENT) {
+			error = posix_kevent_init_timer(new_timer, event._sigev_un.kevent_fd);
+			if (error)
+				goto out;
+
+			process = current->group_leader;
 			spin_lock_irqsave(&process->sighand->siglock, flags);
-			if (!(process->flags & PF_EXITING)) {
-				new_timer->it_process = process;
-				list_add(&new_timer->list,
-					 &process->signal->posix_timers);
-				spin_unlock_irqrestore(&process->sighand->siglock, flags);
-				if (new_timer->it_sigev_notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID))
-					get_task_struct(process);
-			} else {
-				spin_unlock_irqrestore(&process->sighand->siglock, flags);
-				process = NULL;
+			new_timer->it_process = process;
+			list_add(&new_timer->list, &process->signal->posix_timers);
+			spin_unlock_irqrestore(&process->sighand->siglock, flags);
+		} else {
+			read_lock(&tasklist_lock);
+			if ((process = good_sigevent(&event))) {
+				/*
+				 * We may be setting up this process for another
+				 * thread.  It may be exiting.  To catch this
+				 * case the we check the PF_EXITING flag.  If
+				 * the flag is not set, the siglock will catch
+				 * him before it is too late (in exit_itimers).
+				 *
+				 * The exec case is a bit more invloved but easy
+				 * to code.  If the process is in our thread
+				 * group (and it must be or we would not allow
+				 * it here) and is doing an exec, it will cause
+				 * us to be killed.  In this case it will wait
+				 * for us to die which means we can finish this
+				 * linkage with our last gasp. I.e. no code :)
+				 */
+				spin_lock_irqsave(&process->sighand->siglock, flags);
+				if (!(process->flags & PF_EXITING)) {
+					new_timer->it_process = process;
+					list_add(&new_timer->list,
+						 &process->signal->posix_timers);
+					spin_unlock_irqrestore(&process->sighand->siglock, flags);
+					if (new_timer->it_sigev_notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID))
+						get_task_struct(process);
+				} else {
+					spin_unlock_irqrestore(&process->sighand->siglock, flags);
+					process = NULL;
+				}
+			}
+			read_unlock(&tasklist_lock);
+			if (!process) {
+				error = -EINVAL;
+				goto out;
 			}
-		}
-		read_unlock(&tasklist_lock);
-		if (!process) {
-			error = -EINVAL;
-			goto out;
 		}
 	} else {
 		new_timer->it_sigev_notify = SIGEV_SIGNAL;

^ permalink raw reply related	[flat|nested] 200+ messages in thread

end of thread, other threads:[~2006-12-28  9:52 UTC | newest]

Thread overview: 200+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1154985aa0591036@2ka.mipt.ru>
2006-10-27 16:10 ` [take21 0/4] kevent: Generic event handling mechanism Evgeniy Polyakov
2006-10-27 16:10   ` [take21 1/4] kevent: Core files Evgeniy Polyakov
2006-10-27 16:10     ` [take21 2/4] kevent: poll/select() notifications Evgeniy Polyakov
2006-10-27 16:10       ` [take21 3/4] kevent: Socket notifications Evgeniy Polyakov
2006-10-27 16:10         ` [take21 4/4] kevent: Timer notifications Evgeniy Polyakov
2006-10-28 10:04       ` [take21 2/4] kevent: poll/select() notifications Eric Dumazet
2006-10-28 10:08         ` Evgeniy Polyakov
2006-10-28 10:28     ` [take21 1/4] kevent: Core files Eric Dumazet
2006-10-28 10:53       ` Evgeniy Polyakov
2006-10-28 12:36         ` Eric Dumazet
2006-10-28 13:03           ` Evgeniy Polyakov
2006-10-28 13:23             ` Eric Dumazet
2006-10-28 13:28               ` Evgeniy Polyakov
2006-10-28 13:34                 ` Eric Dumazet
2006-10-28 13:47                   ` Evgeniy Polyakov
2006-10-27 16:42   ` [take21 0/4] kevent: Generic event handling mechanism Evgeniy Polyakov
2006-11-07 11:26   ` Jeff Garzik
2006-11-07 11:46     ` Jeff Garzik
2006-11-07 11:58       ` Evgeniy Polyakov
2006-11-07 11:51     ` Evgeniy Polyakov
2006-11-07 12:17       ` Jeff Garzik
2006-11-07 12:29         ` Evgeniy Polyakov
2006-11-07 12:32       ` Jeff Garzik
2006-11-07 19:34         ` Andrew Morton
2006-11-07 20:52           ` David Miller
2006-11-07 21:38             ` Andrew Morton
2006-11-01 11:36 ` [take22 " Evgeniy Polyakov
2006-11-01 11:36   ` [take22 1/4] kevent: Core files Evgeniy Polyakov
2006-11-01 11:36     ` [take22 2/4] kevent: poll/select() notifications Evgeniy Polyakov
2006-11-01 11:36       ` [take22 3/4] kevent: Socket notifications Evgeniy Polyakov
2006-11-01 11:36         ` [take22 4/4] kevent: Timer notifications Evgeniy Polyakov
2006-11-01 13:06   ` [take22 0/4] kevent: Generic event handling mechanism Pavel Machek
2006-11-01 13:25     ` Evgeniy Polyakov
2006-11-01 16:05       ` Pavel Machek
2006-11-01 16:24         ` Evgeniy Polyakov
2006-11-01 18:13           ` Oleg Verych
2006-11-01 18:57             ` Evgeniy Polyakov
2006-11-02  2:12               ` Nate Diller
2006-11-02  6:21                 ` Evgeniy Polyakov
2006-11-02 19:40                   ` Nate Diller
2006-11-03  8:42                     ` Evgeniy Polyakov
2006-11-03  8:57                       ` Pavel Machek
2006-11-03  9:04                         ` David Miller
2006-11-07 12:05                           ` Jeff Garzik
2006-11-03  9:13                         ` Evgeniy Polyakov
2006-11-05 11:19                           ` Pavel Machek
2006-11-05 11:43                             ` Evgeniy Polyakov
     [not found]                 ` <aaf959cb0611011829k36deda6ahe61bcb9bf8e612e1@mail.gmail.com>
     [not found]                   ` <aaf959cb0611011830j1ca3e469tc4a6af3a2a010fa@mail.gmail.com>
     [not found]                     ` <4549A261.9010007@cosmosbay.com>
2006-11-03  2:42                       ` zhou drangon
2006-11-03  9:16                         ` Evgeniy Polyakov
2006-11-07 12:02                 ` Jeff Garzik
2006-11-03 18:49               ` Oleg Verych
2006-11-04 10:24                 ` Evgeniy Polyakov
2006-11-04 17:47                 ` Evgeniy Polyakov
2006-11-01 16:07     ` James Morris
2006-11-07 16:50 ` [take23 0/5] " Evgeniy Polyakov
2006-11-07 16:50   ` [take23 1/5] kevent: Description Evgeniy Polyakov
2006-11-07 16:50     ` [take23 2/5] kevent: Core files Evgeniy Polyakov
2006-11-07 16:50       ` [take23 3/5] kevent: poll/select() notifications Evgeniy Polyakov
2006-11-07 16:50         ` [take23 4/5] kevent: Socket notifications Evgeniy Polyakov
2006-11-07 16:50           ` [take23 5/5] kevent: Timer notifications Evgeniy Polyakov
2006-11-07 22:53         ` [take23 3/5] kevent: poll/select() notifications Davide Libenzi
2006-11-08  8:45           ` Evgeniy Polyakov
2006-11-08 17:03             ` Evgeniy Polyakov
2006-11-07 22:16       ` [take23 2/5] kevent: Core files Andrew Morton
2006-11-08  8:24         ` Evgeniy Polyakov
2006-11-07 22:16     ` [take23 1/5] kevent: Description Andrew Morton
2006-11-08  8:23       ` Evgeniy Polyakov
2006-11-07 22:17   ` [take23 0/5] kevent: Generic event handling mechanism Andrew Morton
2006-11-08  8:21     ` Evgeniy Polyakov
2006-11-08 14:51       ` Eric Dumazet
2006-11-08 22:03         ` Andrew Morton
2006-11-08 22:44           ` Davide Libenzi
2006-11-08 23:07             ` Eric Dumazet
2006-11-08 23:56               ` Davide Libenzi
2006-11-09  7:24                 ` Eric Dumazet
2006-11-09  7:52                   ` Eric Dumazet
2006-11-09 17:12                     ` Davide Libenzi
2006-11-09  8:23 ` [take24 0/6] " Evgeniy Polyakov
2006-11-09  8:23   ` [take24 1/6] kevent: Description Evgeniy Polyakov
2006-11-09  8:23     ` [take24 2/6] kevent: Core files Evgeniy Polyakov
2006-11-09  8:23       ` [take24 3/6] kevent: poll/select() notifications Evgeniy Polyakov
2006-11-09  8:23         ` [take24 4/6] kevent: Socket notifications Evgeniy Polyakov
2006-11-09  8:23           ` [take24 5/6] kevent: Timer notifications Evgeniy Polyakov
2006-11-09  8:23             ` [take24 6/6] kevent: Pipe notifications Evgeniy Polyakov
2006-11-09  9:08         ` [take24 3/6] kevent: poll/select() notifications Eric Dumazet
2006-11-09  9:29           ` Evgeniy Polyakov
2006-11-09 18:51         ` Davide Libenzi
2006-11-09 19:10           ` Evgeniy Polyakov
2006-11-09 19:42             ` Davide Libenzi
2006-11-09 20:10               ` Davide Libenzi
2006-11-11 17:36   ` [take24 7/6] kevent: signal notifications Evgeniy Polyakov
2006-11-11 22:28   ` [take24 0/6] kevent: Generic event handling mechanism Ulrich Drepper
2006-11-13 10:54     ` Evgeniy Polyakov
2006-11-13 11:16       ` Evgeniy Polyakov
2006-11-20  0:02       ` Ulrich Drepper
2006-11-20  8:25         ` Evgeniy Polyakov
2006-11-20  8:43           ` Andrew Morton
2006-11-20  8:51             ` Evgeniy Polyakov
2006-11-20  9:15               ` Andrew Morton
2006-11-20  9:19                 ` Evgeniy Polyakov
2006-11-20 20:29           ` Ulrich Drepper
2006-11-20 21:46             ` Jeff Garzik
2006-11-20 21:52               ` Ulrich Drepper
2006-11-21  9:09                 ` Ingo Oeser
2006-11-22 11:38                 ` Michael Tokarev
2006-11-22 11:47                   ` Evgeniy Polyakov
2006-11-22 12:33                   ` Jeff Garzik
2006-11-21  9:53             ` Evgeniy Polyakov
2006-11-21 16:58               ` Ulrich Drepper
2006-11-21 17:43                 ` Evgeniy Polyakov
2006-11-21 18:46                   ` Evgeniy Polyakov
2006-11-21 20:01                     ` Jeff Garzik
2006-11-22 10:41                       ` Evgeniy Polyakov
2006-11-21 20:19                     ` Jeff Garzik
2006-11-22 10:39                       ` Evgeniy Polyakov
2006-11-22  7:38                     ` Ulrich Drepper
2006-11-22 10:44                       ` Evgeniy Polyakov
2006-11-22 21:02                         ` Ulrich Drepper
2006-11-23 12:23                           ` Evgeniy Polyakov
2006-11-23  8:52                         ` Kevent POSIX timers support Evgeniy Polyakov
2006-11-23 20:26                           ` Ulrich Drepper
2006-11-24  9:50                             ` Evgeniy Polyakov
2006-11-27 18:20                               ` Ulrich Drepper
2006-11-27 18:24                                 ` David Miller
2006-11-27 18:36                                   ` Ulrich Drepper
2006-11-27 18:49                                     ` David Miller
2006-11-28  9:16                                       ` Evgeniy Polyakov
2006-11-28 19:13                                         ` David Miller
2006-11-28 19:22                                           ` Evgeniy Polyakov
2006-12-12  1:36                                             ` David Miller
2006-12-12  5:31                                               ` Evgeniy Polyakov
2006-11-28  9:16                                 ` Evgeniy Polyakov
2006-11-22  7:33                   ` [take24 0/6] kevent: Generic event handling mechanism Ulrich Drepper
2006-11-22 10:38                     ` Evgeniy Polyakov
2006-11-22 22:22                       ` Ulrich Drepper
2006-11-23 12:18                         ` Evgeniy Polyakov
2006-11-23 22:23                           ` Ulrich Drepper
2006-11-24 10:57                             ` Evgeniy Polyakov
2006-11-27 19:12                               ` Ulrich Drepper
2006-11-28 11:00                                 ` Evgeniy Polyakov
2006-11-22 12:09                     ` Evgeniy Polyakov
2006-11-22 12:15                       ` Evgeniy Polyakov
2006-11-22 13:46                         ` Evgeniy Polyakov
2006-11-22 22:24                         ` Ulrich Drepper
2006-11-23 12:22                           ` Evgeniy Polyakov
2006-11-23 20:34                             ` Ulrich Drepper
2006-11-24 10:58                               ` Evgeniy Polyakov
2006-11-27 18:23                                 ` Ulrich Drepper
2006-11-28 10:13                                   ` Evgeniy Polyakov
2006-12-27 20:45                                     ` Ulrich Drepper
2006-12-28  9:50                                       ` Evgeniy Polyakov
2006-11-21 16:29 ` [take25 " Evgeniy Polyakov
2006-11-21 16:29   ` [take25 1/6] kevent: Description Evgeniy Polyakov
2006-11-21 16:29     ` [take25 2/6] kevent: Core files Evgeniy Polyakov
2006-11-21 16:29       ` [take25 3/6] kevent: poll/select() notifications Evgeniy Polyakov
2006-11-21 16:29         ` [take25 4/6] kevent: Socket notifications Evgeniy Polyakov
2006-11-21 16:29           ` [take25 5/6] kevent: Timer notifications Evgeniy Polyakov
2006-11-21 16:29             ` [take25 6/6] kevent: Pipe notifications Evgeniy Polyakov
2006-11-22 11:20               ` Eric Dumazet
2006-11-22 11:30                 ` Evgeniy Polyakov
2006-11-22 23:46     ` [take25 1/6] kevent: Description Ulrich Drepper
2006-11-23 11:52       ` Evgeniy Polyakov
2006-11-23 19:45         ` Ulrich Drepper
2006-11-24 11:01           ` Evgeniy Polyakov
2006-11-24 16:06             ` Ulrich Drepper
2006-11-24 16:14               ` Evgeniy Polyakov
2006-11-24 16:31                 ` Evgeniy Polyakov
2006-11-27 19:20                 ` Ulrich Drepper
2006-11-22 23:52     ` Ulrich Drepper
2006-11-23 11:55       ` Evgeniy Polyakov
2006-11-23 20:00         ` Ulrich Drepper
2006-11-23 21:49           ` Hans Henrik Happe
2006-11-23 22:34             ` Ulrich Drepper
2006-11-24 11:50               ` Evgeniy Polyakov
2006-11-24 16:17                 ` Ulrich Drepper
2006-11-24 11:46           ` Evgeniy Polyakov
2006-11-24 16:30             ` Ulrich Drepper
2006-11-24 16:49               ` Evgeniy Polyakov
2006-11-27 19:23                 ` Ulrich Drepper
2006-11-23 22:33     ` Ulrich Drepper
2006-11-23 22:48       ` Jeff Garzik
2006-11-23 23:45         ` Ulrich Drepper
2006-11-24  0:48           ` Eric Dumazet
2006-11-24  8:14             ` Andrew Morton
2006-11-24  8:33               ` Eric Dumazet
2006-11-24 15:26                 ` Ulrich Drepper
2006-11-24  0:14         ` Hans Henrik Happe
2006-11-24 12:05       ` Evgeniy Polyakov
2006-11-24 12:13         ` Evgeniy Polyakov
2006-11-27 19:43         ` Ulrich Drepper
2006-11-28 10:26           ` Evgeniy Polyakov
2006-11-30 19:14 ` [take26 0/8] kevent: Generic event handling mechanism Evgeniy Polyakov
2006-11-30 19:14   ` [take26 1/8] kevent: Description Evgeniy Polyakov
2006-11-30 19:14     ` [take26 2/8] kevent: Core files Evgeniy Polyakov
2006-11-30 19:14       ` [take26 3/8] kevent: poll/select() notifications Evgeniy Polyakov
2006-11-30 19:14         ` [take26 4/8] kevent: Socket notifications Evgeniy Polyakov
2006-11-30 19:14           ` [take26 5/8] kevent: Timer notifications Evgeniy Polyakov
2006-11-30 19:14             ` [take26 6/8] kevent: Pipe notifications Evgeniy Polyakov
2006-11-30 19:14               ` [take26 7/8] kevent: Signal notifications Evgeniy Polyakov
2006-11-30 19:14                 ` [take26 8/8] kevent: Kevent posix timer notifications Evgeniy Polyakov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).