* [take21 0/4] kevent: Generic event handling mechanism. [not found] <1154985aa0591036@2ka.mipt.ru> @ 2006-10-27 16:10 ` Evgeniy Polyakov 2006-10-27 16:10 ` [take21 1/4] kevent: Core files Evgeniy Polyakov ` (2 more replies) 2006-11-01 11:36 ` [take22 " Evgeniy Polyakov ` (4 subsequent siblings) 5 siblings, 3 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-10-27 16:10 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel Generic event handling mechanism. Consider for inclusion. Changes from 'take20' patchset: * new ring buffer implementation * removed artificial limit on possible number of kevents With this release and fixed userspace web server it was possible to achive 3960+ req/s with client connection rate of 4000 con/s over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which is too close to wire speed if we get into account headers and the like. Changes from 'take19' patchset: * use __init instead of __devinit * removed 'default N' from config for user statistic * removed kevent_user_fini() since kevent can not be unloaded * use KERN_INFO for statistic output Changes from 'take18' patchset: * use __init instead of __devinit * removed 'default N' from config for user statistic * removed kevent_user_fini() since kevent can not be unloaded * use KERN_INFO for statistic output Changes from 'take17' patchset: * Use RB tree instead of hash table. At least for a web sever, frequency of addition/deletion of new kevent is comparable with number of search access, i.e. most of the time events are added, accesed only couple of times and then removed, so it justifies RB tree usage over AVL tree, since the latter does have much slower deletion time (max O(log(N)) compared to 3 ops), although faster search time (1.44*O(log(N)) vs. 2*O(log(N))). So for kevents I use RB tree for now and later, when my AVL tree implementation is ready, it will be possible to compare them. * Changed readiness check for socket notifications. With both above changes it is possible to achieve more than 3380 req/second compared to 2200, sometimes 2500 req/second for epoll() for trivial web-server and httperf client on the same hardware. It is possible that above kevent limit is due to maximum allowed kevents in a time limit, which is 4096 events. Changes from 'take16' patchset: * misc cleanups (__read_mostly, const ...) * created special macro which is used for mmap size (number of pages) calculation * export kevent_socket_notify(), since it is used in network protocols which can be built as modules (IPv6 for example) Changes from 'take15' patchset: * converted kevent_timer to high-resolution timers, this forces timer API update at http://linux-net.osdl.org/index.php/Kevent * use struct ukevent* instead of void * in syscalls (documentation has been updated) * added warning in kevent_add_ukevent() if ring has broken index (for testing) Changes from 'take14' patchset: * added kevent_wait() This syscall waits until either timeout expires or at least one event becomes ready. It also commits that @num events from @start are processed by userspace and thus can be be removed or rearmed (depending on it's flags). It can be used for commit events read by userspace through mmap interface. Example userspace code (evtest.c) can be found on project's homepage. * added socket notifications (send/recv/accept) Changes from 'take13' patchset: * do not get lock aroung user data check in __kevent_search() * fail early if there were no registered callbacks for given type of kevent * trailing whitespace cleanup Changes from 'take12' patchset: * remove non-chardev interface for initialization * use pointer to kevent_mring instead of unsigned longs * use aligned 64bit type in raw user data (can be used by high-res timer if needed) * simplified enqueue/dequeue callbacks and kevent initialization * use nanoseconds for timeout * put number of milliseconds into timer's return data * move some definitions into user-visible header * removed filenames from comments Changes from 'take11' patchset: * include missing headers into patchset * some trivial code cleanups (use goto instead of if/else games and so on) * some whitespace cleanups * check for ready_callback() callback before main loop which should save us some ticks Changes from 'take10' patchset: * removed non-existent prototypes * added helper function for kevent_registered_callbacks * fixed 80 lines comments issues * added shared between userspace and kernelspace header instead of embedd them in one * core restructuring to remove forward declarations * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p * use vm_insert_page() instead of remap_pfn_range() Changes from 'take9' patchset: * fixed ->nopage method Changes from 'take8' patchset: * fixed mmap release bug * use module_init() instead of late_initcall() * use better structures for timer notifications Changes from 'take7' patchset: * new mmap interface (not tested, waiting for other changes to be acked) - use nopage() method to dynamically substitue pages - allocate new page for events only when new added kevent requres it - do not use ugly index dereferencing, use structure instead - reduced amount of data in the ring (id and flags), maximum 12 pages on x86 per kevent fd Changes from 'take6' patchset: * a lot of comments! * do not use list poisoning for detection of the fact, that entry is in the list * return number of ready kevents even if copy*user() fails * strict check for number of kevents in syscall * use ARRAY_SIZE for array size calculation * changed superblock magic number * use SLAB_PANIC instead of direct panic() call * changed -E* return values * a lot of small cleanups and indent fixes Changes from 'take5' patchset: * removed compilation warnings about unused wariables when lockdep is not turned on * do not use internal socket structures, use appropriate (exported) wrappers instead * removed default 1 second timeout * removed AIO stuff from patchset Changes from 'take4' patchset: * use miscdevice instead of chardevice * comments fixes Changes from 'take3' patchset: * removed serializing mutex from kevent_user_wait() * moved storage list processing to RCU * removed lockdep screaming - all storage locks are initialized in the same function, so it was learned to differentiate between various cases * remove kevent from storage if is marked as broken after callback * fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion Changes from 'take2' patchset: * split kevent_finish_user() to locked and unlocked variants * do not use KEVENT_STAT ifdefs, use inline functions instead * use array of callbacks of each type instead of each kevent callback initialization * changed name of ukevent guarding lock * use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks * do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters * various indent cleanups * added optimisation, which is aimed to help when a lot of kevents are being copied from userspace * mapped buffer (initial) implementation (no userspace yet) Changes from 'take1' patchset: - rebased against 2.6.18-git tree - removed ioctl controlling - added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr, unsigned int timeout, void __user *buf, unsigned flags) - use old syscall kevent_ctl for creation/removing, modification and initial kevent initialization - use mutuxes instead of semaphores - added file descriptor check and return error if provided descriptor does not match kevent file operations - various indent fixes - removed aio_sendfile() declarations. Thank you. Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> ^ permalink raw reply [flat|nested] 200+ messages in thread
* [take21 1/4] kevent: Core files. 2006-10-27 16:10 ` [take21 0/4] kevent: Generic event handling mechanism Evgeniy Polyakov @ 2006-10-27 16:10 ` Evgeniy Polyakov 2006-10-27 16:10 ` [take21 2/4] kevent: poll/select() notifications Evgeniy Polyakov 2006-10-28 10:28 ` [take21 1/4] kevent: Core files Eric Dumazet 2006-10-27 16:42 ` [take21 0/4] kevent: Generic event handling mechanism Evgeniy Polyakov 2006-11-07 11:26 ` Jeff Garzik 2 siblings, 2 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-10-27 16:10 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel Core files. This patch includes core kevent files: * userspace controlling * kernelspace interfaces * initialization * notification state machines Some bits of documentation can be found on project's homepage (and links from there): http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S index 7e639f7..a9560eb 100644 --- a/arch/i386/kernel/syscall_table.S +++ b/arch/i386/kernel/syscall_table.S @@ -318,3 +318,6 @@ ENTRY(sys_call_table) .long sys_vmsplice .long sys_move_pages .long sys_getcpu + .long sys_kevent_get_events + .long sys_kevent_ctl /* 320 */ + .long sys_kevent_wait diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S index b4aa875..cf18955 100644 --- a/arch/x86_64/ia32/ia32entry.S +++ b/arch/x86_64/ia32/ia32entry.S @@ -714,8 +714,11 @@ #endif .quad compat_sys_get_robust_list .quad sys_splice .quad sys_sync_file_range - .quad sys_tee + .quad sys_tee /* 315 */ .quad compat_sys_vmsplice .quad compat_sys_move_pages .quad sys_getcpu + .quad sys_kevent_get_events + .quad sys_kevent_ctl /* 320 */ + .quad sys_kevent_wait ia32_syscall_end: diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h index bd99870..f009677 100644 --- a/include/asm-i386/unistd.h +++ b/include/asm-i386/unistd.h @@ -324,10 +324,13 @@ #define __NR_tee 315 #define __NR_vmsplice 316 #define __NR_move_pages 317 #define __NR_getcpu 318 +#define __NR_kevent_get_events 319 +#define __NR_kevent_ctl 320 +#define __NR_kevent_wait 321 #ifdef __KERNEL__ -#define NR_syscalls 319 +#define NR_syscalls 322 #include <linux/err.h> /* diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h index 6137146..c53d156 100644 --- a/include/asm-x86_64/unistd.h +++ b/include/asm-x86_64/unistd.h @@ -619,10 +619,16 @@ #define __NR_vmsplice 278 __SYSCALL(__NR_vmsplice, sys_vmsplice) #define __NR_move_pages 279 __SYSCALL(__NR_move_pages, sys_move_pages) +#define __NR_kevent_get_events 280 +__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events) +#define __NR_kevent_ctl 281 +__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl) +#define __NR_kevent_wait 282 +__SYSCALL(__NR_kevent_wait, sys_kevent_wait) #ifdef __KERNEL__ -#define __NR_syscall_max __NR_move_pages +#define __NR_syscall_max __NR_kevent_wait #include <linux/err.h> #ifndef __NO_STUBS diff --git a/include/linux/kevent.h b/include/linux/kevent.h new file mode 100644 index 0000000..125414c --- /dev/null +++ b/include/linux/kevent.h @@ -0,0 +1,205 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __KEVENT_H +#define __KEVENT_H +#include <linux/types.h> +#include <linux/list.h> +#include <linux/rbtree.h> +#include <linux/spinlock.h> +#include <linux/mutex.h> +#include <linux/wait.h> +#include <linux/net.h> +#include <linux/rcupdate.h> +#include <linux/kevent_storage.h> +#include <linux/ukevent.h> + +#define KEVENT_MIN_BUFFS_ALLOC 3 + +struct kevent; +struct kevent_storage; +typedef int (* kevent_callback_t)(struct kevent *); + +/* @callback is called each time new event has been caught. */ +/* @enqueue is called each time new event is queued. */ +/* @dequeue is called each time event is dequeued. */ + +struct kevent_callbacks { + kevent_callback_t callback, enqueue, dequeue; +}; + +#define KEVENT_READY 0x1 +#define KEVENT_STORAGE 0x2 +#define KEVENT_USER 0x4 + +struct kevent +{ + /* Used for kevent freeing.*/ + struct rcu_head rcu_head; + struct ukevent event; + /* This lock protects ukevent manipulations, e.g. ret_flags changes. */ + spinlock_t ulock; + + /* Entry of user's tree. */ + struct rb_node kevent_node; + /* Entry of origin's queue. */ + struct list_head storage_entry; + /* Entry of user's ready. */ + struct list_head ready_entry; + + u32 flags; + + /* User who requested this kevent. */ + struct kevent_user *user; + /* Kevent container. */ + struct kevent_storage *st; + + struct kevent_callbacks callbacks; + + /* Private data for different storages. + * poll()/select storage has a list of wait_queue_t containers + * for each ->poll() { poll_wait()' } here. + */ + void *priv; +}; + +struct kevent_user +{ + struct rb_root kevent_root; + spinlock_t kevent_lock; + /* Number of queued kevents. */ + unsigned int kevent_num; + + /* List of ready kevents. */ + struct list_head ready_list; + /* Number of ready kevents. */ + unsigned int ready_num; + /* Protects all manipulations with ready queue. */ + spinlock_t ready_lock; + + /* Protects against simultaneous kevent_user control manipulations. */ + struct mutex ctl_mutex; + /* Wait until some events are ready. */ + wait_queue_head_t wait; + + /* Reference counter, increased for each new kevent. */ + atomic_t refcnt; + + /* First kevent which was not put into ring buffer due to overflow. + * It will be copied into the buffer, when first event will be removed + * from ready queue (and thus there will be an empty place in the + * ring buffer). + */ + struct kevent *overflow_kevent; + /* Array of pages forming mapped ring buffer */ + struct kevent_mring **pring; + +#ifdef CONFIG_KEVENT_USER_STAT + unsigned long im_num; + unsigned long wait_num, mmap_num; + unsigned long total; +#endif +}; + +int kevent_enqueue(struct kevent *k); +int kevent_dequeue(struct kevent *k); +int kevent_init(struct kevent *k); +void kevent_requeue(struct kevent *k); +int kevent_break(struct kevent *k); + +int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos); + +int kevent_user_ring_add_event(struct kevent *k); + +void kevent_storage_ready(struct kevent_storage *st, + kevent_callback_t ready_callback, u32 event); +int kevent_storage_init(void *origin, struct kevent_storage *st); +void kevent_storage_fini(struct kevent_storage *st); +int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k); +void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k); + +int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u); + +#ifdef CONFIG_KEVENT_POLL +void kevent_poll_reinit(struct file *file); +#else +static inline void kevent_poll_reinit(struct file *file) +{ +} +#endif + +#ifdef CONFIG_KEVENT_USER_STAT +static inline void kevent_stat_init(struct kevent_user *u) +{ + u->wait_num = u->im_num = u->total = 0; +} +static inline void kevent_stat_print(struct kevent_user *u) +{ + printk(KERN_INFO "%s: u: %p, wait: %lu, mmap: %lu, immediately: %lu, total: %lu.\n", + __func__, u, u->wait_num, u->mmap_num, u->im_num, u->total); +} +static inline void kevent_stat_im(struct kevent_user *u) +{ + u->im_num++; +} +static inline void kevent_stat_mmap(struct kevent_user *u) +{ + u->mmap_num++; +} +static inline void kevent_stat_wait(struct kevent_user *u) +{ + u->wait_num++; +} +static inline void kevent_stat_total(struct kevent_user *u) +{ + u->total++; +} +#else +#define kevent_stat_print(u) ({ (void) u;}) +#define kevent_stat_init(u) ({ (void) u;}) +#define kevent_stat_im(u) ({ (void) u;}) +#define kevent_stat_wait(u) ({ (void) u;}) +#define kevent_stat_mmap(u) ({ (void) u;}) +#define kevent_stat_total(u) ({ (void) u;}) +#endif + +#ifdef CONFIG_KEVENT_SOCKET +#ifdef CONFIG_LOCKDEP +void kevent_socket_reinit(struct socket *sock); +void kevent_sk_reinit(struct sock *sk); +#else +static inline void kevent_socket_reinit(struct socket *sock) +{ +} +static inline void kevent_sk_reinit(struct sock *sk) +{ +} +#endif +void kevent_socket_notify(struct sock *sock, u32 event); +int kevent_socket_dequeue(struct kevent *k); +int kevent_socket_enqueue(struct kevent *k); +#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC) +#else +static inline void kevent_socket_notify(struct sock *sock, u32 event) +{ +} +#define sock_async(__sk) ({ (void)__sk; 0; }) +#endif + +#endif /* __KEVENT_H */ diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h new file mode 100644 index 0000000..a38575d --- /dev/null +++ b/include/linux/kevent_storage.h @@ -0,0 +1,11 @@ +#ifndef __KEVENT_STORAGE_H +#define __KEVENT_STORAGE_H + +struct kevent_storage +{ + void *origin; /* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */ + struct list_head list; /* List of queued kevents. */ + spinlock_t lock; /* Protects users queue. */ +}; + +#endif /* __KEVENT_STORAGE_H */ diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 2d1c3d5..71a758f 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -54,6 +54,7 @@ struct compat_stat; struct compat_timeval; struct robust_list_head; struct getcpu_cache; +struct ukevent; #include <linux/types.h> #include <linux/aio_abi.h> @@ -599,4 +600,8 @@ asmlinkage long sys_set_robust_list(stru size_t len); asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache); +asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max, + __u64 timeout, struct ukevent __user *buf, unsigned flags); +asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, struct ukevent __user *buf); +asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int start, unsigned int num, __u64 timeout); #endif diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h new file mode 100644 index 0000000..daa8202 --- /dev/null +++ b/include/linux/ukevent.h @@ -0,0 +1,163 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __UKEVENT_H +#define __UKEVENT_H + +/* + * Kevent request flags. + */ + +/* Process this event only once and then dequeue. */ +#define KEVENT_REQ_ONESHOT 0x1 + +/* + * Kevent return flags. + */ +/* Kevent is broken. */ +#define KEVENT_RET_BROKEN 0x1 +/* Kevent processing was finished successfully. */ +#define KEVENT_RET_DONE 0x2 + +/* + * Kevent type set. + */ +#define KEVENT_SOCKET 0 +#define KEVENT_INODE 1 +#define KEVENT_TIMER 2 +#define KEVENT_POLL 3 +#define KEVENT_NAIO 4 +#define KEVENT_AIO 5 +#define KEVENT_MAX 6 + +/* + * Per-type event sets. + * Number of per-event sets should be exactly as number of kevent types. + */ + +/* + * Timer events. + */ +#define KEVENT_TIMER_FIRED 0x1 + +/* + * Socket/network asynchronous IO events. + */ +#define KEVENT_SOCKET_RECV 0x1 +#define KEVENT_SOCKET_ACCEPT 0x2 +#define KEVENT_SOCKET_SEND 0x4 + +/* + * Inode events. + */ +#define KEVENT_INODE_CREATE 0x1 +#define KEVENT_INODE_REMOVE 0x2 + +/* + * Poll events. + */ +#define KEVENT_POLL_POLLIN 0x0001 +#define KEVENT_POLL_POLLPRI 0x0002 +#define KEVENT_POLL_POLLOUT 0x0004 +#define KEVENT_POLL_POLLERR 0x0008 +#define KEVENT_POLL_POLLHUP 0x0010 +#define KEVENT_POLL_POLLNVAL 0x0020 + +#define KEVENT_POLL_POLLRDNORM 0x0040 +#define KEVENT_POLL_POLLRDBAND 0x0080 +#define KEVENT_POLL_POLLWRNORM 0x0100 +#define KEVENT_POLL_POLLWRBAND 0x0200 +#define KEVENT_POLL_POLLMSG 0x0400 +#define KEVENT_POLL_POLLREMOVE 0x1000 + +/* + * Asynchronous IO events. + */ +#define KEVENT_AIO_BIO 0x1 + +#define KEVENT_MASK_ALL 0xffffffff +/* Mask of all possible event values. */ +#define KEVENT_MASK_EMPTY 0x0 +/* Empty mask of ready events. */ + +struct kevent_id +{ + union { + __u32 raw[2]; + __u64 raw_u64 __attribute__((aligned(8))); + }; +}; + +struct ukevent +{ + /* Id of this request, e.g. socket number, file descriptor and so on... */ + struct kevent_id id; + /* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */ + __u32 type; + /* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */ + __u32 event; + /* Per-event request flags */ + __u32 req_flags; + /* Per-event return flags */ + __u32 ret_flags; + /* Event return data. Event originator fills it with anything it likes. */ + __u32 ret_data[2]; + /* User's data. It is not used, just copied to/from user. + * The whole structure is aligned to 8 bytes already, so the last union + * is aligned properly. + */ + union { + __u32 user[2]; + void *ptr; + }; +}; + +struct mukevent +{ + struct kevent_id id; + __u32 ret_flags; +}; + +#define KEVENT_MAX_PAGES 2 + +/* + * Note that kevents does not exactly fill the page (each mukevent is 12 bytes), + * so we reuse 4 bytes at the begining of the page to store index. + * Take that into account if you want to change size of struct mukevent. + */ +#define KEVENTS_ON_PAGE ((PAGE_SIZE-2*sizeof(unsigned int))/sizeof(struct mukevent)) +struct kevent_mring +{ + unsigned int kidx, uidx; + struct mukevent event[KEVENTS_ON_PAGE]; +}; + +/* + * Used only for sanitizing of the kevent_wait() input data - do not + * allow user to specify number of events more than it is possible to place + * into ring buffer. This does not limit number of events which can be + * put into kevent queue (which is unlimited). + */ +#define KEVENT_MAX_EVENTS (KEVENT_MAX_PAGES * KEVENTS_ON_PAGE) + +#define KEVENT_CTL_ADD 0 +#define KEVENT_CTL_REMOVE 1 +#define KEVENT_CTL_MODIFY 2 + +#endif /* __UKEVENT_H */ diff --git a/init/Kconfig b/init/Kconfig index d2eb7a8..c7d8250 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -201,6 +201,8 @@ config AUDITSYSCALL such as SELinux. To use audit's filesystem watch feature, please ensure that INOTIFY is configured. +source "kernel/kevent/Kconfig" + config IKCONFIG bool "Kernel .config support" ---help--- diff --git a/kernel/Makefile b/kernel/Makefile index d62ec66..2d7a6dd 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o +obj-$(CONFIG_KEVENT) += kevent/ obj-$(CONFIG_RELAY) += relay.o obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o obj-$(CONFIG_TASKSTATS) += taskstats.o diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig new file mode 100644 index 0000000..5ba8086 --- /dev/null +++ b/kernel/kevent/Kconfig @@ -0,0 +1,39 @@ +config KEVENT + bool "Kernel event notification mechanism" + help + This option enables event queue mechanism. + It can be used as replacement for poll()/select(), AIO callback + invocations, advanced timer notifications and other kernel + object status changes. + +config KEVENT_USER_STAT + bool "Kevent user statistic" + depends on KEVENT + help + This option will turn kevent_user statistic collection on. + Statistic data includes total number of kevent, number of kevents + which are ready immediately at insertion time and number of kevents + which were removed through readiness completion. + It will be printed each time control kevent descriptor is closed. + +config KEVENT_TIMER + bool "Kernel event notifications for timers" + depends on KEVENT + help + This option allows to use timers through KEVENT subsystem. + +config KEVENT_POLL + bool "Kernel event notifications for poll()/select()" + depends on KEVENT + help + This option allows to use kevent subsystem for poll()/select() + notifications. + +config KEVENT_SOCKET + bool "Kernel event notifications for sockets" + depends on NET && KEVENT + help + This option enables notifications through KEVENT subsystem of + sockets operations, like new packet receiving conditions, + ready for accept conditions and so on. + diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile new file mode 100644 index 0000000..9130cad --- /dev/null +++ b/kernel/kevent/Makefile @@ -0,0 +1,4 @@ +obj-y := kevent.o kevent_user.o +obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o +obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o +obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c new file mode 100644 index 0000000..25404d3 --- /dev/null +++ b/kernel/kevent/kevent.c @@ -0,0 +1,227 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/mempool.h> +#include <linux/sched.h> +#include <linux/wait.h> +#include <linux/kevent.h> + +/* + * Attempts to add an event into appropriate origin's queue. + * Returns positive value if this event is ready immediately, + * negative value in case of error and zero if event has been queued. + * ->enqueue() callback must increase origin's reference counter. + */ +int kevent_enqueue(struct kevent *k) +{ + return k->callbacks.enqueue(k); +} + +/* + * Remove event from the appropriate queue. + * ->dequeue() callback must decrease origin's reference counter. + */ +int kevent_dequeue(struct kevent *k) +{ + return k->callbacks.dequeue(k); +} + +/* + * Mark kevent as broken. + */ +int kevent_break(struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&k->ulock, flags); + k->event.ret_flags |= KEVENT_RET_BROKEN; + spin_unlock_irqrestore(&k->ulock, flags); + return -EINVAL; +} + +static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX] __read_mostly; + +int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos) +{ + struct kevent_callbacks *p; + + if (pos >= KEVENT_MAX) + return -EINVAL; + + p = &kevent_registered_callbacks[pos]; + + p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break; + p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break; + p->callback = (cb->callback) ? cb->callback : kevent_break; + + printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos); + return 0; +} + +/* + * Must be called before event is going to be added into some origin's queue. + * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks. + * If failed, kevent should not be used or kevent_enqueue() will fail to add + * this kevent into origin's queue with setting + * KEVENT_RET_BROKEN flag in kevent->event.ret_flags. + */ +int kevent_init(struct kevent *k) +{ + spin_lock_init(&k->ulock); + k->flags = 0; + + if (unlikely(k->event.type >= KEVENT_MAX || + !kevent_registered_callbacks[k->event.type].callback)) + return kevent_break(k); + + k->callbacks = kevent_registered_callbacks[k->event.type]; + if (unlikely(k->callbacks.callback == kevent_break)) + return kevent_break(k); + + return 0; +} + +/* + * Called from ->enqueue() callback when reference counter for given + * origin (socket, inode...) has been increased. + */ +int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k) +{ + unsigned long flags; + + k->st = st; + spin_lock_irqsave(&st->lock, flags); + list_add_tail_rcu(&k->storage_entry, &st->list); + k->flags |= KEVENT_STORAGE; + spin_unlock_irqrestore(&st->lock, flags); + return 0; +} + +/* + * Dequeue kevent from origin's queue. + * It does not decrease origin's reference counter in any way + * and must be called before it, so storage itself must be valid. + * It is called from ->dequeue() callback. + */ +void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&st->lock, flags); + if (k->flags & KEVENT_STORAGE) { + list_del_rcu(&k->storage_entry); + k->flags &= ~KEVENT_STORAGE; + } + spin_unlock_irqrestore(&st->lock, flags); +} + +/* + * Call kevent ready callback and queue it into ready queue if needed. + * If kevent is marked as one-shot, then remove it from storage queue. + */ +static void __kevent_requeue(struct kevent *k, u32 event) +{ + int ret, rem; + unsigned long flags; + + ret = k->callbacks.callback(k); + + spin_lock_irqsave(&k->ulock, flags); + if (ret > 0) + k->event.ret_flags |= KEVENT_RET_DONE; + else if (ret < 0) + k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE); + else + ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE)); + rem = (k->event.req_flags & KEVENT_REQ_ONESHOT); + spin_unlock_irqrestore(&k->ulock, flags); + + if (ret) { + if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) { + list_del_rcu(&k->storage_entry); + k->flags &= ~KEVENT_STORAGE; + } + + spin_lock_irqsave(&k->user->ready_lock, flags); + if (!(k->flags & KEVENT_READY)) { + kevent_user_ring_add_event(k); + list_add_tail(&k->ready_entry, &k->user->ready_list); + k->flags |= KEVENT_READY; + k->user->ready_num++; + } + spin_unlock_irqrestore(&k->user->ready_lock, flags); + wake_up(&k->user->wait); + } +} + +/* + * Check if kevent is ready (by invoking it's callback) and requeue/remove + * if needed. + */ +void kevent_requeue(struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&k->st->lock, flags); + __kevent_requeue(k, 0); + spin_unlock_irqrestore(&k->st->lock, flags); +} + +/* + * Called each time some activity in origin (socket, inode...) is noticed. + */ +void kevent_storage_ready(struct kevent_storage *st, + kevent_callback_t ready_callback, u32 event) +{ + struct kevent *k; + + rcu_read_lock(); + if (ready_callback) + list_for_each_entry_rcu(k, &st->list, storage_entry) + (*ready_callback)(k); + + list_for_each_entry_rcu(k, &st->list, storage_entry) + if (event & k->event.event) + __kevent_requeue(k, event); + rcu_read_unlock(); +} + +int kevent_storage_init(void *origin, struct kevent_storage *st) +{ + spin_lock_init(&st->lock); + st->origin = origin; + INIT_LIST_HEAD(&st->list); + return 0; +} + +/* + * Mark all events as broken, that will remove them from storage, + * so storage origin (inode, sockt and so on) can be safely removed. + * No new entries are allowed to be added into the storage at this point. + * (Socket is removed from file table at this point for example). + */ +void kevent_storage_fini(struct kevent_storage *st) +{ + kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL); +} diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c new file mode 100644 index 0000000..e92a1dc --- /dev/null +++ b/kernel/kevent/kevent_user.c @@ -0,0 +1,1000 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/mount.h> +#include <linux/device.h> +#include <linux/poll.h> +#include <linux/kevent.h> +#include <linux/miscdevice.h> +#include <asm/io.h> + +static const char kevent_name[] = "kevent"; +static kmem_cache_t *kevent_cache __read_mostly; + +/* + * kevents are pollable, return POLLIN and POLLRDNORM + * when there is at least one ready kevent. + */ +static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait) +{ + struct kevent_user *u = file->private_data; + unsigned int mask; + + poll_wait(file, &u->wait, wait); + mask = 0; + + if (u->ready_num) + mask |= POLLIN | POLLRDNORM; + + return mask; +} + +/* + * Called under kevent_user->ready_lock, so updates are always protected. + */ +int kevent_user_ring_add_event(struct kevent *k) +{ + unsigned int pidx, off; + struct kevent_mring *ring, *copy_ring; + + ring = k->user->pring[0]; + + if ((ring->kidx + 1 == ring->uidx) || + ((ring->kidx + 1 == KEVENT_MAX_EVENTS) && ring->uidx == 0)) { + if (k->user->overflow_kevent == NULL) + k->user->overflow_kevent = k; + return -EAGAIN; + } + + pidx = ring->kidx/KEVENTS_ON_PAGE; + off = ring->kidx%KEVENTS_ON_PAGE; + + if (unlikely(pidx >= KEVENT_MAX_PAGES)) { + printk(KERN_ERR "%s: kidx: %u, pidx: %u, on_page: %lu, pidx: %u.\n", + __func__, ring->kidx, ring->uidx, KEVENTS_ON_PAGE, pidx); + return -EINVAL; + } + + copy_ring = k->user->pring[pidx]; + + copy_ring->event[off].id.raw[0] = k->event.id.raw[0]; + copy_ring->event[off].id.raw[1] = k->event.id.raw[1]; + copy_ring->event[off].ret_flags = k->event.ret_flags; + + if (++ring->kidx >= KEVENT_MAX_EVENTS) + ring->kidx = 0; + + return 0; +} + +/* + * Initialize mmap ring buffer. + * It will store ready kevents, so userspace could get them directly instead + * of using syscall. Esentially syscall becomes just a waiting point. + * @KEVENT_MAX_PAGES is an arbitrary number of pages to store ready events. + */ +static int kevent_user_ring_init(struct kevent_user *u) +{ + int i; + + u->pring = kzalloc(KEVENT_MAX_PAGES * sizeof(struct kevent_mring *), GFP_KERNEL); + if (!u->pring) + return -ENOMEM; + + for (i=0; i<KEVENT_MAX_PAGES; ++i) { + u->pring[i] = (struct kevent_mring *)__get_free_page(GFP_KERNEL); + if (!u->pring[i]) + break; + } + + if (i != KEVENT_MAX_PAGES) + goto err_out_free; + + u->pring[0]->uidx = u->pring[0]->kidx = 0; + + return 0; + +err_out_free: + for (i=0; i<KEVENT_MAX_PAGES; ++i) { + if (!u->pring[i]) + break; + + free_page((unsigned long)u->pring[i]); + } + + kfree(u->pring); + + return -ENOMEM; +} + +static void kevent_user_ring_fini(struct kevent_user *u) +{ + int i; + + for (i=0; i<KEVENT_MAX_PAGES; ++i) + free_page((unsigned long)u->pring[i]); + + kfree(u->pring); +} + +static int kevent_user_open(struct inode *inode, struct file *file) +{ + struct kevent_user *u; + + u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL); + if (!u) + return -ENOMEM; + + INIT_LIST_HEAD(&u->ready_list); + spin_lock_init(&u->ready_lock); + kevent_stat_init(u); + spin_lock_init(&u->kevent_lock); + u->kevent_root = RB_ROOT; + + mutex_init(&u->ctl_mutex); + init_waitqueue_head(&u->wait); + + atomic_set(&u->refcnt, 1); + + if (unlikely(kevent_user_ring_init(u))) { + kfree(u); + return -ENOMEM; + } + + file->private_data = u; + return 0; +} + +/* + * Kevent userspace control block reference counting. + * Set to 1 at creation time, when appropriate kevent file descriptor + * is closed, that reference counter is decreased. + * When counter hits zero block is freed. + */ +static inline void kevent_user_get(struct kevent_user *u) +{ + atomic_inc(&u->refcnt); +} + +static inline void kevent_user_put(struct kevent_user *u) +{ + if (atomic_dec_and_test(&u->refcnt)) { + kevent_stat_print(u); + kevent_user_ring_fini(u); + kfree(u); + } +} + +/* + * Mmap implementation for ring buffer, which is created as array + * of pages, so vm_pgoff is an offset (in pages, not in bytes) of + * the first page to be mapped. + */ +static int kevent_user_mmap(struct file *file, struct vm_area_struct *vma) +{ + unsigned long start = vma->vm_start, off = vma->vm_pgoff / PAGE_SIZE; + struct kevent_user *u = file->private_data; + + if (off >= KEVENT_MAX_PAGES) + return -EINVAL; + + if (vma->vm_flags & VM_WRITE) + return -EPERM; + + vma->vm_flags |= VM_RESERVED; + vma->vm_file = file; + + if (vm_insert_page(vma, start, virt_to_page(u->pring[off]))) + return -EFAULT; + + return 0; +} + +static inline int kevent_compare_id(struct kevent_id *left, struct kevent_id *right) +{ + if (left->raw_u64 > right->raw_u64) + return -1; + + if (right->raw_u64 > left->raw_u64) + return 1; + + return 0; +} + +/* + * RCU protects storage list (kevent->storage_entry). + * Free entry in RCU callback, it is dequeued from all lists at + * this point. + */ + +static void kevent_free_rcu(struct rcu_head *rcu) +{ + struct kevent *kevent = container_of(rcu, struct kevent, rcu_head); + kmem_cache_free(kevent_cache, kevent); +} + +/* + * Complete kevent removing - it dequeues kevent from storage list + * if it is requested, removes kevent from ready list, drops userspace + * control block reference counter and schedules kevent freeing through RCU. + */ +static void kevent_finish_user_complete(struct kevent *k, int deq) +{ + struct kevent_user *u = k->user; + unsigned long flags; + + if (deq) + kevent_dequeue(k); + + spin_lock_irqsave(&u->ready_lock, flags); + if (k->flags & KEVENT_READY) { + list_del(&k->ready_entry); + k->flags &= ~KEVENT_READY; + u->ready_num--; + } + spin_unlock_irqrestore(&u->ready_lock, flags); + + kevent_user_put(u); + call_rcu(&k->rcu_head, kevent_free_rcu); +} + +/* + * Remove from all lists and free kevent. + * Must be called under kevent_user->kevent_lock to protect + * kevent->kevent_entry removing. + */ +static void __kevent_finish_user(struct kevent *k, int deq) +{ + struct kevent_user *u = k->user; + + rb_erase(&k->kevent_node, &u->kevent_root); + k->flags &= ~KEVENT_USER; + u->kevent_num--; + kevent_finish_user_complete(k, deq); +} + +/* + * Remove kevent from user's list of all events, + * dequeue it from storage and decrease user's reference counter, + * since this kevent does not exist anymore. That is why it is freed here. + */ +static void kevent_finish_user(struct kevent *k, int deq) +{ + struct kevent_user *u = k->user; + unsigned long flags; + + spin_lock_irqsave(&u->kevent_lock, flags); + rb_erase(&k->kevent_node, &u->kevent_root); + k->flags &= ~KEVENT_USER; + u->kevent_num--; + spin_unlock_irqrestore(&u->kevent_lock, flags); + kevent_finish_user_complete(k, deq); +} + +/* + * Dequeue one entry from user's ready queue. + */ +static struct kevent *kqueue_dequeue_ready(struct kevent_user *u) +{ + unsigned long flags; + struct kevent *k = NULL; + + spin_lock_irqsave(&u->ready_lock, flags); + if (u->ready_num && !list_empty(&u->ready_list)) { + k = list_entry(u->ready_list.next, struct kevent, ready_entry); + list_del(&k->ready_entry); + k->flags &= ~KEVENT_READY; + u->ready_num--; + if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS) + u->pring[0]->uidx = 0; + + if (u->overflow_kevent) { + int err; + + err = kevent_user_ring_add_event(u->overflow_kevent); + if (!err) { + if (u->overflow_kevent->ready_entry.next == &u->ready_list) + u->overflow_kevent = NULL; + else + u->overflow_kevent = + list_entry(u->overflow_kevent->ready_entry.next, + struct kevent, ready_entry); + } + } + } + spin_unlock_irqrestore(&u->ready_lock, flags); + + return k; +} + +/* + * Search a kevent inside kevent tree for given ukevent. + */ +static struct kevent *__kevent_search(struct kevent_id *id, struct kevent_user *u) +{ + struct kevent *k, *ret = NULL; + struct rb_node *n = u->kevent_root.rb_node; + int cmp; + + while (n) { + k = rb_entry(n, struct kevent, kevent_node); + cmp = kevent_compare_id(&k->event.id, id); + + if (cmp > 0) + n = n->rb_right; + else if (cmp < 0) + n = n->rb_left; + else { + ret = k; + break; + } + } + + return ret; +} + +/* + * Search and modify kevent according to provided ukevent. + */ +static int kevent_modify(struct ukevent *uk, struct kevent_user *u) +{ + struct kevent *k; + int err = -ENODEV; + unsigned long flags; + + spin_lock_irqsave(&u->kevent_lock, flags); + k = __kevent_search(&uk->id, u); + if (k) { + spin_lock(&k->ulock); + k->event.event = uk->event; + k->event.req_flags = uk->req_flags; + k->event.ret_flags = 0; + spin_unlock(&k->ulock); + kevent_requeue(k); + err = 0; + } + spin_unlock_irqrestore(&u->kevent_lock, flags); + + return err; +} + +/* + * Remove kevent which matches provided ukevent. + */ +static int kevent_remove(struct ukevent *uk, struct kevent_user *u) +{ + int err = -ENODEV; + struct kevent *k; + unsigned long flags; + + spin_lock_irqsave(&u->kevent_lock, flags); + k = __kevent_search(&uk->id, u); + if (k) { + __kevent_finish_user(k, 1); + err = 0; + } + spin_unlock_irqrestore(&u->kevent_lock, flags); + + return err; +} + +/* + * Detaches userspace control block from file descriptor + * and decrease it's reference counter. + * No new kevents can be added or removed from any list at this point. + */ +static int kevent_user_release(struct inode *inode, struct file *file) +{ + struct kevent_user *u = file->private_data; + struct kevent *k; + struct rb_node *n; + + for (n = rb_first(&u->kevent_root); n; n = rb_next(n)) { + k = rb_entry(n, struct kevent, kevent_node); + kevent_finish_user(k, 1); + } + + kevent_user_put(u); + file->private_data = NULL; + + return 0; +} + +/* + * Read requested number of ukevents in one shot. + */ +static struct ukevent *kevent_get_user(unsigned int num, void __user *arg) +{ + struct ukevent *ukev; + + ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL); + if (!ukev) + return NULL; + + if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) { + kfree(ukev); + return NULL; + } + + return ukev; +} + +/* + * Read from userspace all ukevents and modify appropriate kevents. + * If provided number of ukevents is more that threshold, it is faster + * to allocate a room for them and copy in one shot instead of copy + * one-by-one and then process them. + */ +static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg) +{ + int err = 0, i; + struct ukevent uk; + + mutex_lock(&u->ctl_mutex); + + if (num > u->kevent_num) { + err = -EINVAL; + goto out; + } + + if (num > KEVENT_MIN_BUFFS_ALLOC) { + struct ukevent *ukev; + + ukev = kevent_get_user(num, arg); + if (ukev) { + for (i = 0; i < num; ++i) { + if (kevent_modify(&ukev[i], u)) + ukev[i].ret_flags |= KEVENT_RET_BROKEN; + ukev[i].ret_flags |= KEVENT_RET_DONE; + } + if (copy_to_user(arg, ukev, num*sizeof(struct ukevent))) + err = -EFAULT; + kfree(ukev); + goto out; + } + } + + for (i = 0; i < num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + if (kevent_modify(&uk, u)) + uk.ret_flags |= KEVENT_RET_BROKEN; + uk.ret_flags |= KEVENT_RET_DONE; + + if (copy_to_user(arg, &uk, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + arg += sizeof(struct ukevent); + } +out: + mutex_unlock(&u->ctl_mutex); + + return err; +} + +/* + * Read from userspace all ukevents and remove appropriate kevents. + * If provided number of ukevents is more that threshold, it is faster + * to allocate a room for them and copy in one shot instead of copy + * one-by-one and then process them. + */ +static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg) +{ + int err = 0, i; + struct ukevent uk; + + mutex_lock(&u->ctl_mutex); + + if (num > u->kevent_num) { + err = -EINVAL; + goto out; + } + + if (num > KEVENT_MIN_BUFFS_ALLOC) { + struct ukevent *ukev; + + ukev = kevent_get_user(num, arg); + if (ukev) { + for (i = 0; i < num; ++i) { + if (kevent_remove(&ukev[i], u)) + ukev[i].ret_flags |= KEVENT_RET_BROKEN; + ukev[i].ret_flags |= KEVENT_RET_DONE; + } + if (copy_to_user(arg, ukev, num*sizeof(struct ukevent))) + err = -EFAULT; + kfree(ukev); + goto out; + } + } + + for (i = 0; i < num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + if (kevent_remove(&uk, u)) + uk.ret_flags |= KEVENT_RET_BROKEN; + + uk.ret_flags |= KEVENT_RET_DONE; + + if (copy_to_user(arg, &uk, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + arg += sizeof(struct ukevent); + } +out: + mutex_unlock(&u->ctl_mutex); + + return err; +} + +/* + * Queue kevent into userspace control block and increase + * it's reference counter. + */ +static int kevent_user_enqueue(struct kevent_user *u, struct kevent *new) +{ + unsigned long flags; + struct rb_node **p = &u->kevent_root.rb_node, *parent = NULL; + struct kevent *k; + int err = 0, cmp; + + spin_lock_irqsave(&u->kevent_lock, flags); + while (*p) { + parent = *p; + k = rb_entry(parent, struct kevent, kevent_node); + + cmp = kevent_compare_id(&k->event.id, &new->event.id); + if (cmp > 0) + p = &parent->rb_right; + else if (cmp < 0) + p = &parent->rb_left; + else { + err = -EEXIST; + break; + } + } + if (likely(!err)) { + rb_link_node(&new->kevent_node, parent, p); + rb_insert_color(&new->kevent_node, &u->kevent_root); + new->flags |= KEVENT_USER; + u->kevent_num++; + kevent_user_get(u); + } + spin_unlock_irqrestore(&u->kevent_lock, flags); + + return err; +} + +/* + * Add kevent from both kernel and userspace users. + * This function allocates and queues kevent, returns negative value + * on error, positive if kevent is ready immediately and zero + * if kevent has been queued. + */ +int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u) +{ + struct kevent *k; + int err; + + k = kmem_cache_alloc(kevent_cache, GFP_KERNEL); + if (!k) { + err = -ENOMEM; + goto err_out_exit; + } + + memcpy(&k->event, uk, sizeof(struct ukevent)); + INIT_RCU_HEAD(&k->rcu_head); + + k->event.ret_flags = 0; + + err = kevent_init(k); + if (err) { + kmem_cache_free(kevent_cache, k); + goto err_out_exit; + } + k->user = u; + kevent_stat_total(u); + err = kevent_user_enqueue(u, k); + if (err) { + kmem_cache_free(kevent_cache, k); + goto err_out_exit; + } + + err = kevent_enqueue(k); + if (err) { + memcpy(uk, &k->event, sizeof(struct ukevent)); + kevent_finish_user(k, 0); + goto err_out_exit; + } + + return 0; + +err_out_exit: + if (err < 0) { + uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE; + uk->ret_data[1] = err; + } else if (err > 0) + uk->ret_flags |= KEVENT_RET_DONE; + return err; +} + +/* + * Copy all ukevents from userspace, allocate kevent for each one + * and add them into appropriate kevent_storages, + * e.g. sockets, inodes and so on... + * Ready events will replace ones provided by used and number + * of ready events is returned. + * User must check ret_flags field of each ukevent structure + * to determine if it is fired or failed event. + */ +static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg) +{ + int err, cerr = 0, knum = 0, rnum = 0, i; + void __user *orig = arg; + struct ukevent uk; + + mutex_lock(&u->ctl_mutex); + + err = -EINVAL; + if (num > KEVENT_MIN_BUFFS_ALLOC) { + struct ukevent *ukev; + + ukev = kevent_get_user(num, arg); + if (ukev) { + for (i = 0; i < num; ++i) { + err = kevent_user_add_ukevent(&ukev[i], u); + if (err) { + kevent_stat_im(u); + if (i != rnum) + memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent)); + rnum++; + } else + knum++; + } + if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent))) + cerr = -EFAULT; + kfree(ukev); + goto out_setup; + } + } + + for (i = 0; i < num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + cerr = -EFAULT; + break; + } + arg += sizeof(struct ukevent); + + err = kevent_user_add_ukevent(&uk, u); + if (err) { + kevent_stat_im(u); + if (copy_to_user(orig, &uk, sizeof(struct ukevent))) { + cerr = -EFAULT; + break; + } + orig += sizeof(struct ukevent); + rnum++; + } else + knum++; + } + +out_setup: + if (cerr < 0) { + err = cerr; + goto out_remove; + } + + err = rnum; +out_remove: + mutex_unlock(&u->ctl_mutex); + + return err; +} + +/* + * In nonblocking mode it returns as many events as possible, but not more than @max_nr. + * In blocking mode it waits until timeout or if at least @min_nr events are ready. + */ +static int kevent_user_wait(struct file *file, struct kevent_user *u, + unsigned int min_nr, unsigned int max_nr, __u64 timeout, + void __user *buf) +{ + struct kevent *k; + int num = 0; + + if (!(file->f_flags & O_NONBLOCK)) { + wait_event_interruptible_timeout(u->wait, + u->ready_num >= min_nr, + clock_t_to_jiffies(nsec_to_clock_t(timeout))); + } + + while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) { + if (copy_to_user(buf + num*sizeof(struct ukevent), + &k->event, sizeof(struct ukevent))) + break; + + /* + * If it is one-shot kevent, it has been removed already from + * origin's queue, so we can easily free it here. + */ + if (k->event.req_flags & KEVENT_REQ_ONESHOT) + kevent_finish_user(k, 1); + ++num; + kevent_stat_wait(u); + } + + return num; +} + +static struct file_operations kevent_user_fops = { + .mmap = kevent_user_mmap, + .open = kevent_user_open, + .release = kevent_user_release, + .poll = kevent_user_poll, + .owner = THIS_MODULE, +}; + +static struct miscdevice kevent_miscdev = { + .minor = MISC_DYNAMIC_MINOR, + .name = kevent_name, + .fops = &kevent_user_fops, +}; + +static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg) +{ + int err; + struct kevent_user *u = file->private_data; + + switch (cmd) { + case KEVENT_CTL_ADD: + err = kevent_user_ctl_add(u, num, arg); + break; + case KEVENT_CTL_REMOVE: + err = kevent_user_ctl_remove(u, num, arg); + break; + case KEVENT_CTL_MODIFY: + err = kevent_user_ctl_modify(u, num, arg); + break; + default: + err = -EINVAL; + break; + } + + return err; +} + +/* + * Used to get ready kevents from queue. + * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT). + * @min_nr - minimum number of ready kevents. + * @max_nr - maximum number of ready kevents. + * @timeout - timeout in nanoseconds to wait until some events are ready. + * @buf - buffer to place ready events. + * @flags - ununsed for now (will be used for mmap implementation). + */ +asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, + __u64 timeout, struct ukevent __user *buf, unsigned flags) +{ + int err = -EINVAL; + struct file *file; + struct kevent_user *u; + + file = fget(ctl_fd); + if (!file) + return -ENODEV; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + u = file->private_data; + + err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf); +out_fput: + fput(file); + return err; +} + +/* + * This syscall is used to perform waiting until there is free space in kevent queue + * and removes/requeues requested number of events (commits them). Function returns + * number of actually committed events. + * + * @ctl_fd - kevent file descriptor. + * @start - number of first ready event. + * @num - number of processed kevents. + * @timeout - this timeout specifies number of nanoseconds to wait until there is + * free space in kevent queue. + * + * Ring buffer is designed in a way that first ready kevent will be at @ring->uidx + * position, and all other ready events will be in FIFO order after it. + * So when we need to commit @num events, it means we should just remove first @num + * kevents from ready queue and commit them. We do not use any special locking to + * protect this function against simultaneous running - kevent dequeueing is atomic, + * and we do not care about order in which events were committed. + * An example: thread 1 and thread 2 simultaneously call kevent_wait() to + * commit 2 and 3 events. It is possible that first thread will commit + * events 0 and 2 while second thread will commit events 1, 3 and 4. + * If there were only 3 ready events, then one of the calls will return lesser number + * of committed events than it was requested. + * ring->uidx update is atomic, since it is protected by u->ready_lock, + * which removes race with kevent_user_ring_add_event(). + * + * If user asks to commit events which have beed removed by kevent_get_events() recently + * (for example when one thread looked into ring indexes and started to commit evets, + * which were simultaneously committed by other thread through kevent_get_events(), + * kevent_wait() will not commit unprocessed events, but will return number of actually + * committed events instead. + * + * It is forbidden to try to commit events not from the start of the buffer, but from + * some 'futher' event. + * + * An example: if ready events use positions 2-5, + * it is permitted to start to commit 3 events from position 0, + * in this case 0 and 1 positions will be ommited and only event in position 2 will + * be committed and kevent_wait() will return 1, since only one event was actually committed. + * It is forbidden to try to commit from position 4, 0 will be returned. + * This means that if some events were committed using kevent_get_events(), + * they will not be counted, instead userspace should check ring index and try to commit again. + */ +asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int start, unsigned int num, __u64 timeout) +{ + int err = -EINVAL, committed = 0; + struct file *file; + struct kevent_user *u; + struct kevent *k; + struct kevent_mring *ring; + unsigned int i, actual; + unsigned long flags; + + if (num >= KEVENT_MAX_EVENTS) + return -EINVAL; + + file = fget(ctl_fd); + if (!file) + return -ENODEV; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + u = file->private_data; + + ring = u->pring[0]; + + spin_lock_irqsave(&u->ready_lock, flags); + actual = (ring->kidx > ring->uidx)? + (ring->kidx - ring->uidx): + (KEVENT_MAX_EVENTS - (ring->uidx - ring->kidx)); + + if (actual < num) + num = actual; + + if (start < ring->uidx) { + /* + * Some events have been committed through kevent_get_events(). + * ready events + * |==========|RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR|==========| + * ring->uidx ring->kidx + * | | + * start start+num + * + */ + unsigned int diff = ring->uidx - start; + + if (num < diff) + num = 0; + else + num -= diff; + } else if (start > ring->uidx) + num = 0; + + spin_unlock_irqrestore(&u->ready_lock, flags); + + for (i=0; i<num; ++i) { + k = kqueue_dequeue_ready(u); + if (!k) + break; + + if (k->event.req_flags & KEVENT_REQ_ONESHOT) + kevent_finish_user(k, 1); + kevent_stat_mmap(u); + committed++; + } + + if (!(file->f_flags & O_NONBLOCK)) { + wait_event_interruptible_timeout(u->wait, + u->ready_num >= 1, + clock_t_to_jiffies(nsec_to_clock_t(timeout))); + } + + fput(file); + + return committed; +out_fput: + fput(file); + return err; +} + +/* + * This syscall is used to perform various control operations + * on given kevent queue, which is obtained through kevent file descriptor @fd. + * @cmd - type of operation. + * @num - number of kevents to be processed. + * @arg - pointer to array of struct ukevent. + */ +asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent __user *arg) +{ + int err = -EINVAL; + struct file *file; + + file = fget(fd); + if (!file) + return -ENODEV; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + + err = kevent_ctl_process(file, cmd, num, arg); + +out_fput: + fput(file); + return err; +} + +/* + * Kevent subsystem initialization - create kevent cache and register + * filesystem to get control file descriptors from. + */ +static int __init kevent_user_init(void) +{ + int err = 0; + + kevent_cache = kmem_cache_create("kevent_cache", + sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL); + + err = misc_register(&kevent_miscdev); + if (err) { + printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err); + goto err_out_exit; + } + + printk(KERN_INFO "KEVENT subsystem has been successfully registered.\n"); + + return 0; + +err_out_exit: + kmem_cache_destroy(kevent_cache); + return err; +} + +module_init(kevent_user_init); diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 7a3b2e7..bc0582b 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -122,6 +122,10 @@ cond_syscall(ppc_rtas); cond_syscall(sys_spu_run); cond_syscall(sys_spu_create); +cond_syscall(sys_kevent_get_events); +cond_syscall(sys_kevent_wait); +cond_syscall(sys_kevent_ctl); + /* mmu depending weak syscall entries */ cond_syscall(sys_mprotect); cond_syscall(sys_msync); ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take21 2/4] kevent: poll/select() notifications. 2006-10-27 16:10 ` [take21 1/4] kevent: Core files Evgeniy Polyakov @ 2006-10-27 16:10 ` Evgeniy Polyakov 2006-10-27 16:10 ` [take21 3/4] kevent: Socket notifications Evgeniy Polyakov 2006-10-28 10:04 ` [take21 2/4] kevent: poll/select() notifications Eric Dumazet 2006-10-28 10:28 ` [take21 1/4] kevent: Core files Eric Dumazet 1 sibling, 2 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-10-27 16:10 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel poll/select() notifications. This patch includes generic poll/select notifications. kevent_poll works simialr to epoll and has the same issues (callback is invoked not from internal state machine of the caller, but through process awake, a lot of allocations and so on). Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru> diff --git a/include/linux/fs.h b/include/linux/fs.h index 5baf3a1..f81299f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -276,6 +276,7 @@ #include <linux/prio_tree.h> #include <linux/init.h> #include <linux/sched.h> #include <linux/mutex.h> +#include <linux/kevent.h> #include <asm/atomic.h> #include <asm/semaphore.h> @@ -586,6 +587,10 @@ #ifdef CONFIG_INOTIFY struct mutex inotify_mutex; /* protects the watches list */ #endif +#ifdef CONFIG_KEVENT_SOCKET + struct kevent_storage st; +#endif + unsigned long i_state; unsigned long dirtied_when; /* jiffies of first dirtying */ @@ -739,6 +744,9 @@ #ifdef CONFIG_EPOLL struct list_head f_ep_links; spinlock_t f_ep_lock; #endif /* #ifdef CONFIG_EPOLL */ +#ifdef CONFIG_KEVENT_POLL + struct kevent_storage st; +#endif struct address_space *f_mapping; }; extern spinlock_t files_lock; diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c new file mode 100644 index 0000000..fb74e0f --- /dev/null +++ b/kernel/kevent/kevent_poll.c @@ -0,0 +1,222 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/timer.h> +#include <linux/file.h> +#include <linux/kevent.h> +#include <linux/poll.h> +#include <linux/fs.h> + +static kmem_cache_t *kevent_poll_container_cache; +static kmem_cache_t *kevent_poll_priv_cache; + +struct kevent_poll_ctl +{ + struct poll_table_struct pt; + struct kevent *k; +}; + +struct kevent_poll_wait_container +{ + struct list_head container_entry; + wait_queue_head_t *whead; + wait_queue_t wait; + struct kevent *k; +}; + +struct kevent_poll_private +{ + struct list_head container_list; + spinlock_t container_lock; +}; + +static int kevent_poll_enqueue(struct kevent *k); +static int kevent_poll_dequeue(struct kevent *k); +static int kevent_poll_callback(struct kevent *k); + +static int kevent_poll_wait_callback(wait_queue_t *wait, + unsigned mode, int sync, void *key) +{ + struct kevent_poll_wait_container *cont = + container_of(wait, struct kevent_poll_wait_container, wait); + struct kevent *k = cont->k; + struct file *file = k->st->origin; + u32 revents; + + revents = file->f_op->poll(file, NULL); + + kevent_storage_ready(k->st, NULL, revents); + + return 0; +} + +static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead, + struct poll_table_struct *poll_table) +{ + struct kevent *k = + container_of(poll_table, struct kevent_poll_ctl, pt)->k; + struct kevent_poll_private *priv = k->priv; + struct kevent_poll_wait_container *cont; + unsigned long flags; + + cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL); + if (!cont) { + kevent_break(k); + return; + } + + cont->k = k; + init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback); + cont->whead = whead; + + spin_lock_irqsave(&priv->container_lock, flags); + list_add_tail(&cont->container_entry, &priv->container_list); + spin_unlock_irqrestore(&priv->container_lock, flags); + + add_wait_queue(whead, &cont->wait); +} + +static int kevent_poll_enqueue(struct kevent *k) +{ + struct file *file; + int err, ready = 0; + unsigned int revents; + struct kevent_poll_ctl ctl; + struct kevent_poll_private *priv; + + file = fget(k->event.id.raw[0]); + if (!file) + return -ENODEV; + + err = -EINVAL; + if (!file->f_op || !file->f_op->poll) + goto err_out_fput; + + err = -ENOMEM; + priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL); + if (!priv) + goto err_out_fput; + + spin_lock_init(&priv->container_lock); + INIT_LIST_HEAD(&priv->container_list); + + k->priv = priv; + + ctl.k = k; + init_poll_funcptr(&ctl.pt, &kevent_poll_qproc); + + err = kevent_storage_enqueue(&file->st, k); + if (err) + goto err_out_free; + + revents = file->f_op->poll(file, &ctl.pt); + if (revents & k->event.event) { + ready = 1; + kevent_poll_dequeue(k); + } + + return ready; + +err_out_free: + kmem_cache_free(kevent_poll_priv_cache, priv); +err_out_fput: + fput(file); + return err; +} + +static int kevent_poll_dequeue(struct kevent *k) +{ + struct file *file = k->st->origin; + struct kevent_poll_private *priv = k->priv; + struct kevent_poll_wait_container *w, *n; + unsigned long flags; + + kevent_storage_dequeue(k->st, k); + + spin_lock_irqsave(&priv->container_lock, flags); + list_for_each_entry_safe(w, n, &priv->container_list, container_entry) { + list_del(&w->container_entry); + remove_wait_queue(w->whead, &w->wait); + kmem_cache_free(kevent_poll_container_cache, w); + } + spin_unlock_irqrestore(&priv->container_lock, flags); + + kmem_cache_free(kevent_poll_priv_cache, priv); + k->priv = NULL; + + fput(file); + + return 0; +} + +static int kevent_poll_callback(struct kevent *k) +{ + struct file *file = k->st->origin; + unsigned int revents = file->f_op->poll(file, NULL); + + k->event.ret_data[0] = revents & k->event.event; + + return (revents & k->event.event); +} + +static int __init kevent_poll_sys_init(void) +{ + struct kevent_callbacks pc = { + .callback = &kevent_poll_callback, + .enqueue = &kevent_poll_enqueue, + .dequeue = &kevent_poll_dequeue}; + + kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache", + sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL); + if (!kevent_poll_container_cache) { + printk(KERN_ERR "Failed to create kevent poll container cache.\n"); + return -ENOMEM; + } + + kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache", + sizeof(struct kevent_poll_private), 0, 0, NULL, NULL); + if (!kevent_poll_priv_cache) { + printk(KERN_ERR "Failed to create kevent poll private data cache.\n"); + kmem_cache_destroy(kevent_poll_container_cache); + kevent_poll_container_cache = NULL; + return -ENOMEM; + } + + kevent_add_callbacks(&pc, KEVENT_POLL); + + printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n"); + return 0; +} + +static struct lock_class_key kevent_poll_key; + +void kevent_poll_reinit(struct file *file) +{ + lockdep_set_class(&file->st.lock, &kevent_poll_key); +} + +static void __exit kevent_poll_sys_fini(void) +{ + kmem_cache_destroy(kevent_poll_priv_cache); + kmem_cache_destroy(kevent_poll_container_cache); +} + +module_init(kevent_poll_sys_init); +module_exit(kevent_poll_sys_fini); ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take21 3/4] kevent: Socket notifications. 2006-10-27 16:10 ` [take21 2/4] kevent: poll/select() notifications Evgeniy Polyakov @ 2006-10-27 16:10 ` Evgeniy Polyakov 2006-10-27 16:10 ` [take21 4/4] kevent: Timer notifications Evgeniy Polyakov 2006-10-28 10:04 ` [take21 2/4] kevent: poll/select() notifications Eric Dumazet 1 sibling, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-10-27 16:10 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel Socket notifications. This patch includes socket send/recv/accept notifications. Using trivial web server based on kevent and this features instead of epoll it's performance increased more than noticebly. More details about various benchmarks and server itself (evserver_kevent.c) can be found on project's homepage. Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru> diff --git a/fs/inode.c b/fs/inode.c index ada7643..ff1b129 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -21,6 +21,7 @@ #include <linux/pagemap.h> #include <linux/cdev.h> #include <linux/bootmem.h> #include <linux/inotify.h> +#include <linux/kevent.h> #include <linux/mount.h> /* @@ -164,12 +165,18 @@ #endif } inode->i_private = 0; inode->i_mapping = mapping; +#if defined CONFIG_KEVENT_SOCKET + kevent_storage_init(inode, &inode->st); +#endif } return inode; } void destroy_inode(struct inode *inode) { +#if defined CONFIG_KEVENT_SOCKET + kevent_storage_fini(&inode->st); +#endif BUG_ON(inode_has_buffers(inode)); security_inode_free(inode); if (inode->i_sb->s_op->destroy_inode) diff --git a/include/net/sock.h b/include/net/sock.h index edd4d73..d48ded8 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -48,6 +48,7 @@ #include <linux/lockdep.h> #include <linux/netdevice.h> #include <linux/skbuff.h> /* struct sk_buff */ #include <linux/security.h> +#include <linux/kevent.h> #include <linux/filter.h> @@ -450,6 +451,21 @@ static inline int sk_stream_memory_free( extern void sk_stream_rfree(struct sk_buff *skb); +struct socket_alloc { + struct socket socket; + struct inode vfs_inode; +}; + +static inline struct socket *SOCKET_I(struct inode *inode) +{ + return &container_of(inode, struct socket_alloc, vfs_inode)->socket; +} + +static inline struct inode *SOCK_INODE(struct socket *socket) +{ + return &container_of(socket, struct socket_alloc, socket)->vfs_inode; +} + static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk) { skb->sk = sk; @@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct sk->sk_backlog.tail = skb; } skb->next = NULL; + kevent_socket_notify(sk, KEVENT_SOCKET_RECV); } #define sk_wait_event(__sk, __timeo, __condition) \ @@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio return si->kiocb; } -struct socket_alloc { - struct socket socket; - struct inode vfs_inode; -}; - -static inline struct socket *SOCKET_I(struct inode *inode) -{ - return &container_of(inode, struct socket_alloc, vfs_inode)->socket; -} - -static inline struct inode *SOCK_INODE(struct socket *socket) -{ - return &container_of(socket, struct socket_alloc, socket)->vfs_inode; -} - extern void __sk_stream_mem_reclaim(struct sock *sk); extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind); diff --git a/include/net/tcp.h b/include/net/tcp.h index 7a093d0..69f4ad2 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so tp->ucopy.memory = 0; } else if (skb_queue_len(&tp->ucopy.prequeue) == 1) { wake_up_interruptible(sk->sk_sleep); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); if (!inet_csk_ack_scheduled(sk)) inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK, (3 * TCP_RTO_MIN) / 4, diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c new file mode 100644 index 0000000..c865b3e --- /dev/null +++ b/kernel/kevent/kevent_socket.c @@ -0,0 +1,129 @@ +/* + * kevent_socket.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/timer.h> +#include <linux/file.h> +#include <linux/tcp.h> +#include <linux/kevent.h> + +#include <net/sock.h> +#include <net/request_sock.h> +#include <net/inet_connection_sock.h> + +static int kevent_socket_callback(struct kevent *k) +{ + struct inode *inode = k->st->origin; + return SOCKET_I(inode)->ops->poll(SOCKET_I(inode)->file, SOCKET_I(inode), NULL); +} + +int kevent_socket_enqueue(struct kevent *k) +{ + struct inode *inode; + struct socket *sock; + int err = -ENODEV; + + sock = sockfd_lookup(k->event.id.raw[0], &err); + if (!sock) + goto err_out_exit; + + inode = igrab(SOCK_INODE(sock)); + if (!inode) + goto err_out_fput; + + err = kevent_storage_enqueue(&inode->st, k); + if (err) + goto err_out_iput; + + err = k->callbacks.callback(k); + if (err) + goto err_out_dequeue; + + return err; + +err_out_dequeue: + kevent_storage_dequeue(k->st, k); +err_out_iput: + iput(inode); +err_out_fput: + sockfd_put(sock); +err_out_exit: + return err; +} + +int kevent_socket_dequeue(struct kevent *k) +{ + struct inode *inode = k->st->origin; + struct socket *sock; + + kevent_storage_dequeue(k->st, k); + + sock = SOCKET_I(inode); + iput(inode); + sockfd_put(sock); + + return 0; +} + +void kevent_socket_notify(struct sock *sk, u32 event) +{ + if (sk->sk_socket) + kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event); +} + +/* + * It is required for network protocols compiled as modules, like IPv6. + */ +EXPORT_SYMBOL_GPL(kevent_socket_notify); + +#ifdef CONFIG_LOCKDEP +static struct lock_class_key kevent_sock_key; + +void kevent_socket_reinit(struct socket *sock) +{ + struct inode *inode = SOCK_INODE(sock); + + lockdep_set_class(&inode->st.lock, &kevent_sock_key); +} + +void kevent_sk_reinit(struct sock *sk) +{ + if (sk->sk_socket) { + struct inode *inode = SOCK_INODE(sk->sk_socket); + + lockdep_set_class(&inode->st.lock, &kevent_sock_key); + } +} +#endif +static int __init kevent_init_socket(void) +{ + struct kevent_callbacks sc = { + .callback = &kevent_socket_callback, + .enqueue = &kevent_socket_enqueue, + .dequeue = &kevent_socket_dequeue}; + + return kevent_add_callbacks(&sc, KEVENT_SOCKET); +} +module_init(kevent_init_socket); diff --git a/net/core/sock.c b/net/core/sock.c index b77e155..7d5fa3e 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1402,6 +1402,7 @@ static void sock_def_wakeup(struct sock if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) wake_up_interruptible_all(sk->sk_sleep); read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); } static void sock_def_error_report(struct sock *sk) @@ -1411,6 +1412,7 @@ static void sock_def_error_report(struct wake_up_interruptible(sk->sk_sleep); sk_wake_async(sk,0,POLL_ERR); read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); } static void sock_def_readable(struct sock *sk, int len) @@ -1420,6 +1422,7 @@ static void sock_def_readable(struct soc wake_up_interruptible(sk->sk_sleep); sk_wake_async(sk,1,POLL_IN); read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); } static void sock_def_write_space(struct sock *sk) @@ -1439,6 +1442,7 @@ static void sock_def_write_space(struct } read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV); } static void sock_def_destruct(struct sock *sk) @@ -1489,6 +1493,8 @@ #endif sk->sk_state = TCP_CLOSE; sk->sk_socket = sock; + kevent_sk_reinit(sk); + sock_set_flag(sk, SOCK_ZAPPED); if(sock) @@ -1555,8 +1561,10 @@ void fastcall release_sock(struct sock * if (sk->sk_backlog.tail) __release_sock(sk); sk->sk_lock.owner = NULL; - if (waitqueue_active(&sk->sk_lock.wq)) + if (waitqueue_active(&sk->sk_lock.wq)) { wake_up(&sk->sk_lock.wq); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); + } spin_unlock_bh(&sk->sk_lock.slock); } EXPORT_SYMBOL(release_sock); diff --git a/net/core/stream.c b/net/core/stream.c index d1d7dec..2878c2a 100644 --- a/net/core/stream.c +++ b/net/core/stream.c @@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock * wake_up_interruptible(sk->sk_sleep); if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN)) sock_wake_async(sock, 2, POLL_OUT); + kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV); } } diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 3f884ce..e7dd989 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3119,6 +3119,7 @@ static void tcp_ofo_queue(struct sock *s __skb_unlink(skb, &tp->out_of_order_queue); __skb_queue_tail(&sk->sk_receive_queue, skb); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV); tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq; if(skb->h.th->fin) tcp_fin(skb, sk, skb->h.th); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index c83938b..b0dd70d 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -61,6 +61,7 @@ #include <linux/cache.h> #include <linux/jhash.h> #include <linux/init.h> #include <linux/times.h> +#include <linux/kevent.h> #include <net/icmp.h> #include <net/inet_hashtables.h> @@ -870,6 +871,7 @@ #endif reqsk_free(req); } else { inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT); + kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT); } return 0; diff --git a/net/socket.c b/net/socket.c index 1bc4167..5582b4a 100644 --- a/net/socket.c +++ b/net/socket.c @@ -85,6 +85,7 @@ #include <linux/compat.h> #include <linux/kmod.h> #include <linux/audit.h> #include <linux/wireless.h> +#include <linux/kevent.h> #include <asm/uaccess.h> #include <asm/unistd.h> @@ -490,6 +491,8 @@ static struct socket *sock_alloc(void) inode->i_uid = current->fsuid; inode->i_gid = current->fsgid; + kevent_socket_reinit(sock); + get_cpu_var(sockets_in_use)++; put_cpu_var(sockets_in_use); return sock; ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take21 4/4] kevent: Timer notifications. 2006-10-27 16:10 ` [take21 3/4] kevent: Socket notifications Evgeniy Polyakov @ 2006-10-27 16:10 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-10-27 16:10 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel Timer notifications. Timer notifications can be used for fine grained per-process time management, since interval timers are very inconvenient to use, and they are limited. This subsystem uses high-resolution timers. id.raw[0] is used as number of seconds id.raw[1] is used as number of nanoseconds Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c new file mode 100644 index 0000000..04acc46 --- /dev/null +++ b/kernel/kevent/kevent_timer.c @@ -0,0 +1,113 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/hrtimer.h> +#include <linux/jiffies.h> +#include <linux/kevent.h> + +struct kevent_timer +{ + struct hrtimer ktimer; + struct kevent_storage ktimer_storage; + struct kevent *ktimer_event; +}; + +static int kevent_timer_func(struct hrtimer *timer) +{ + struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer); + struct kevent *k = t->ktimer_event; + + kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL); + hrtimer_forward(timer, timer->base->softirq_time, + ktime_set(k->event.id.raw[0], k->event.id.raw[1])); + return HRTIMER_RESTART; +} + +static struct lock_class_key kevent_timer_key; + +static int kevent_timer_enqueue(struct kevent *k) +{ + int err; + struct kevent_timer *t; + + t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL); + if (!t) + return -ENOMEM; + + hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL); + t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]); + t->ktimer.function = kevent_timer_func; + t->ktimer_event = k; + + err = kevent_storage_init(&t->ktimer, &t->ktimer_storage); + if (err) + goto err_out_free; + lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key); + + err = kevent_storage_enqueue(&t->ktimer_storage, k); + if (err) + goto err_out_st_fini; + + printk("%s: jiffies: %lu, timer: %p.\n", __func__, jiffies, &t->ktimer); + hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL); + + return 0; + +err_out_st_fini: + kevent_storage_fini(&t->ktimer_storage); +err_out_free: + kfree(t); + + return err; +} + +static int kevent_timer_dequeue(struct kevent *k) +{ + struct kevent_storage *st = k->st; + struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage); + + hrtimer_cancel(&t->ktimer); + kevent_storage_dequeue(st, k); + kfree(t); + + return 0; +} + +static int kevent_timer_callback(struct kevent *k) +{ + k->event.ret_data[0] = jiffies_to_msecs(jiffies); + return 1; +} + +static int __init kevent_init_timer(void) +{ + struct kevent_callbacks tc = { + .callback = &kevent_timer_callback, + .enqueue = &kevent_timer_enqueue, + .dequeue = &kevent_timer_dequeue}; + + return kevent_add_callbacks(&tc, KEVENT_TIMER); +} +module_init(kevent_init_timer); + ^ permalink raw reply related [flat|nested] 200+ messages in thread
* Re: [take21 2/4] kevent: poll/select() notifications. 2006-10-27 16:10 ` [take21 2/4] kevent: poll/select() notifications Evgeniy Polyakov 2006-10-27 16:10 ` [take21 3/4] kevent: Socket notifications Evgeniy Polyakov @ 2006-10-28 10:04 ` Eric Dumazet 2006-10-28 10:08 ` Evgeniy Polyakov 1 sibling, 1 reply; 200+ messages in thread From: Eric Dumazet @ 2006-10-28 10:04 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel Evgeniy Polyakov a écrit : > + file = fget(k->event.id.raw[0]); > + if (!file) > + return -ENODEV; Please, do us a favor, and use EBADF instead of ENODEV. EBADF : /* Bad file number */ ENODEV : /* No such device */ You have many ENODEV uses in your patches and that really hurts. Eric ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take21 2/4] kevent: poll/select() notifications. 2006-10-28 10:04 ` [take21 2/4] kevent: poll/select() notifications Eric Dumazet @ 2006-10-28 10:08 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-10-28 10:08 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel On Sat, Oct 28, 2006 at 12:04:10PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote: > Evgeniy Polyakov a écrit : > > >+ file = fget(k->event.id.raw[0]); > >+ if (!file) > >+ return -ENODEV; > > Please, do us a favor, and use EBADF instead of ENODEV. > > EBADF : /* Bad file number */ > > ENODEV : /* No such device */ > > You have many ENODEV uses in your patches and that really hurts. Ok :) > Eric -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take21 1/4] kevent: Core files. 2006-10-27 16:10 ` [take21 1/4] kevent: Core files Evgeniy Polyakov 2006-10-27 16:10 ` [take21 2/4] kevent: poll/select() notifications Evgeniy Polyakov @ 2006-10-28 10:28 ` Eric Dumazet 2006-10-28 10:53 ` Evgeniy Polyakov 1 sibling, 1 reply; 200+ messages in thread From: Eric Dumazet @ 2006-10-28 10:28 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel +/* + * Called under kevent_user->ready_lock, so updates are always protected. + */ +int kevent_user_ring_add_event(struct kevent *k) +{ + unsigned int pidx, off; + struct kevent_mring *ring, *copy_ring; + + ring = k->user->pring[0]; + + if ((ring->kidx + 1 == ring->uidx) || + ((ring->kidx + 1 == KEVENT_MAX_EVENTS) && ring->uidx == 0)) { + if (k->user->overflow_kevent == NULL) + k->user->overflow_kevent = k; + return -EAGAIN; + } + I really dont understand how you manage to queue multiple kevents in the 'overflow list'. You just queue one kevent at most. What am I missing ? > + > + for (i=0; i<KEVENT_MAX_PAGES; ++i) { > + u->pring[i] = (struct kevent_mring *)__get_free_page(GFP_KERNEL); > + if (!u->pring[i]) > + break; > + } > + > + if (i != KEVENT_MAX_PAGES) > + goto err_out_free; Why dont you use goto directly ? if (!u->pring[i]) goto err_out_free; > + > + u->pring[0]->uidx = u->pring[0]->kidx = 0; > + > + return 0; > + > +err_out_free: > + for (i=0; i<KEVENT_MAX_PAGES; ++i) { > + if (!u->pring[i]) > + break; > + > + free_page((unsigned long)u->pring[i]); > + } > + return k; > +} > + > +static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg) > +{ > + int err, cerr = 0, knum = 0, rnum = 0, i; > + void __user *orig = arg; > + struct ukevent uk; > + > + mutex_lock(&u->ctl_mutex); > + > + err = -EINVAL; > + if (num > KEVENT_MIN_BUFFS_ALLOC) { > + struct ukevent *ukev; > + > + ukev = kevent_get_user(num, arg); > + if (ukev) { > + for (i = 0; i < num; ++i) { > + err = kevent_user_add_ukevent(&ukev[i], u); > + if (err) { > + kevent_stat_im(u); > + if (i != rnum) > + memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent)); > + rnum++; > + } else > + knum++; Why are you using/counting knum ? > + } > + if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent))) > + cerr = -EFAULT; > + kfree(ukev); > + goto out_setup; > + } > + } > + > + for (i = 0; i < num; ++i) { > + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { > + cerr = -EFAULT; > + break; > + } > + arg += sizeof(struct ukevent); > + > + err = kevent_user_add_ukevent(&uk, u); > + if (err) { > + kevent_stat_im(u); > + if (copy_to_user(orig, &uk, sizeof(struct ukevent))) { > + cerr = -EFAULT; > + break; > + } > + orig += sizeof(struct ukevent); > + rnum++; > + } else > + knum++; > + } > + > +out_setup: > + if (cerr < 0) { > + err = cerr; > + goto out_remove; > + } > + > + err = rnum; > +out_remove: > + mutex_unlock(&u->ctl_mutex); > + > + return err; > +} ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take21 1/4] kevent: Core files. 2006-10-28 10:28 ` [take21 1/4] kevent: Core files Eric Dumazet @ 2006-10-28 10:53 ` Evgeniy Polyakov 2006-10-28 12:36 ` Eric Dumazet 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-10-28 10:53 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel On Sat, Oct 28, 2006 at 12:28:12PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote: > +/* > + * Called under kevent_user->ready_lock, so updates are always protected. > + */ > +int kevent_user_ring_add_event(struct kevent *k) > +{ > + unsigned int pidx, off; > + struct kevent_mring *ring, *copy_ring; > + > + ring = k->user->pring[0]; > + > + if ((ring->kidx + 1 == ring->uidx) || > + ((ring->kidx + 1 == KEVENT_MAX_EVENTS) && ring->uidx > == 0)) { > + if (k->user->overflow_kevent == NULL) > + k->user->overflow_kevent = k; > + return -EAGAIN; > + } > + > > > I really dont understand how you manage to queue multiple kevents in the > 'overflow list'. You just queue one kevent at most. What am I missing ? There is no overflow list - it is a pointer to the first kevent in the ready queue, which was not put into ring buffer. It is an optimisation, which allows to not search for that position each time new event should be placed into the buffer, when it starts to have an empty slot. > > >+ > >+ for (i=0; i<KEVENT_MAX_PAGES; ++i) { > >+ u->pring[i] = (struct kevent_mring > >*)__get_free_page(GFP_KERNEL); > >+ if (!u->pring[i]) > >+ break; > >+ } > >+ > >+ if (i != KEVENT_MAX_PAGES) > >+ goto err_out_free; > > Why dont you use goto directly ? > > if (!u->pring[i]) > goto err_out_free; > I used a fallback mode here which allowed to use smaller number of pages for kevent ring buffer, but then decided to drop it. So it is possible to use goto directly. > >+ > >+ u->pring[0]->uidx = u->pring[0]->kidx = 0; > >+ > >+ return 0; > >+ > >+err_out_free: > >+ for (i=0; i<KEVENT_MAX_PAGES; ++i) { > >+ if (!u->pring[i]) > >+ break; > >+ > >+ free_page((unsigned long)u->pring[i]); > >+ } > >+ return k; > >+} > >+ > > > > > >+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, > >void __user *arg) > >+{ > >+ int err, cerr = 0, knum = 0, rnum = 0, i; > >+ void __user *orig = arg; > >+ struct ukevent uk; > >+ > >+ mutex_lock(&u->ctl_mutex); > >+ > >+ err = -EINVAL; > >+ if (num > KEVENT_MIN_BUFFS_ALLOC) { > >+ struct ukevent *ukev; > >+ > >+ ukev = kevent_get_user(num, arg); > >+ if (ukev) { > >+ for (i = 0; i < num; ++i) { > >+ err = kevent_user_add_ukevent(&ukev[i], u); > >+ if (err) { > >+ kevent_stat_im(u); > >+ if (i != rnum) > >+ memcpy(&ukev[rnum], > >&ukev[i], sizeof(struct ukevent)); > >+ rnum++; > >+ } else > >+ knum++; > > > Why are you using/counting knum ? It should go avay. > >+ } > >+ if (copy_to_user(orig, ukev, rnum*sizeof(struct > >ukevent))) > >+ cerr = -EFAULT; > >+ kfree(ukev); > >+ goto out_setup; > >+ } > >+ } > >+ > >+ for (i = 0; i < num; ++i) { > >+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { > >+ cerr = -EFAULT; > >+ break; > >+ } > >+ arg += sizeof(struct ukevent); > >+ > >+ err = kevent_user_add_ukevent(&uk, u); > >+ if (err) { > >+ kevent_stat_im(u); > >+ if (copy_to_user(orig, &uk, sizeof(struct ukevent))) > >{ > >+ cerr = -EFAULT; > >+ break; > >+ } > >+ orig += sizeof(struct ukevent); > >+ rnum++; > >+ } else > >+ knum++; > >+ } > >+ > >+out_setup: > >+ if (cerr < 0) { > >+ err = cerr; > >+ goto out_remove; > >+ } > >+ > >+ err = rnum; > >+out_remove: > >+ mutex_unlock(&u->ctl_mutex); > >+ > >+ return err; > >+} > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take21 1/4] kevent: Core files. 2006-10-28 10:53 ` Evgeniy Polyakov @ 2006-10-28 12:36 ` Eric Dumazet 2006-10-28 13:03 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Eric Dumazet @ 2006-10-28 12:36 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel Evgeniy Polyakov a e'crit : > On Sat, Oct 28, 2006 at 12:28:12PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote: >> >> I really dont understand how you manage to queue multiple kevents in the >> 'overflow list'. You just queue one kevent at most. What am I missing ? > > There is no overflow list - it is a pointer to the first kevent in the > ready queue, which was not put into ring buffer. It is an optimisation, > which allows to not search for that position each time new event should > be placed into the buffer, when it starts to have an empty slot. This overflow list (you may call it differently, but still it IS a list), is not complete. I feel you add it just to make me happy, but I am not (yet :) ) For example, you make no test at kevent_finish_user_complete() time. Obviously, you can have a dangling pointer, and crash your box in certain conditions. static void kevent_finish_user_complete(struct kevent *k, int deq) { struct kevent_user *u = k->user; unsigned long flags; if (deq) kevent_dequeue(k); spin_lock_irqsave(&u->ready_lock, flags); if (k->flags & KEVENT_READY) { + if (u->overflow_event == k) { + /* MUST do something to change u->overflow_kevent */ + } list_del(&k->ready_entry); k->flags &= ~KEVENT_READY; u->ready_num--; } spin_unlock_irqrestore(&u->ready_lock, flags); kevent_user_put(u); call_rcu(&k->rcu_head, kevent_free_rcu); } Eric ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take21 1/4] kevent: Core files. 2006-10-28 12:36 ` Eric Dumazet @ 2006-10-28 13:03 ` Evgeniy Polyakov 2006-10-28 13:23 ` Eric Dumazet 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-10-28 13:03 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel On Sat, Oct 28, 2006 at 02:36:31PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote: > Evgeniy Polyakov a e'crit : > >On Sat, Oct 28, 2006 at 12:28:12PM +0200, Eric Dumazet > >(dada1@cosmosbay.com) wrote: > >> > >>I really dont understand how you manage to queue multiple kevents in the > >>'overflow list'. You just queue one kevent at most. What am I missing ? > > > >There is no overflow list - it is a pointer to the first kevent in the > >ready queue, which was not put into ring buffer. It is an optimisation, > >which allows to not search for that position each time new event should > >be placed into the buffer, when it starts to have an empty slot. > > This overflow list (you may call it differently, but still it IS a list), > is not complete. I feel you add it just to make me happy, but I am not (yet > :) ) There is no overflow list. There is ready queue, part of which (first several entries) is copied into the ring buffer, overflow_kevent is a pointer to the first kevent which was not copied. > For example, you make no test at kevent_finish_user_complete() time. > > Obviously, you can have a dangling pointer, and crash your box in certain > conditions. You are right, I did not put overflow_kevent check into all places which can remove kevent. Here is a patch I am about to commit into the kevent tree: diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c index 711a8a8..ecee668 100644 --- a/kernel/kevent/kevent_user.c +++ b/kernel/kevent/kevent_user.c @@ -235,6 +235,36 @@ static void kevent_free_rcu(struct rcu_h } /* + * Must be called under u->ready_lock. + * This function removes kevent from ready queue and + * tries to add new kevent into ring buffer. + */ +static void kevent_remove_ready(struct kevent *k) +{ + struct kevent_user *u = k->user; + + list_del(&k->ready_entry); + k->flags &= ~KEVENT_READY; + u->ready_num--; + if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS) + u->pring[0]->uidx = 0; + + if (u->overflow_kevent) { + int err; + + err = kevent_user_ring_add_event(u->overflow_kevent); + if (!err || u->overflow_kevent == k) { + if (u->overflow_kevent->ready_entry.next == &u->ready_list) + u->overflow_kevent = NULL; + else + u->overflow_kevent = + list_entry(u->overflow_kevent->ready_entry.next, + struct kevent, ready_entry); + } + } +} + +/* * Complete kevent removing - it dequeues kevent from storage list * if it is requested, removes kevent from ready list, drops userspace * control block reference counter and schedules kevent freeing through RCU. @@ -248,11 +278,8 @@ static void kevent_finish_user_complete( kevent_dequeue(k); spin_lock_irqsave(&u->ready_lock, flags); - if (k->flags & KEVENT_READY) { - list_del(&k->ready_entry); - k->flags &= ~KEVENT_READY; - u->ready_num--; - } + if (k->flags & KEVENT_READY) + kevent_remove_ready(k); spin_unlock_irqrestore(&u->ready_lock, flags); kevent_user_put(u); @@ -303,25 +330,7 @@ static struct kevent *kqueue_dequeue_rea spin_lock_irqsave(&u->ready_lock, flags); if (u->ready_num && !list_empty(&u->ready_list)) { k = list_entry(u->ready_list.next, struct kevent, ready_entry); - list_del(&k->ready_entry); - k->flags &= ~KEVENT_READY; - u->ready_num--; - if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS) - u->pring[0]->uidx = 0; - - if (u->overflow_kevent) { - int err; - - err = kevent_user_ring_add_event(u->overflow_kevent); - if (!err) { - if (u->overflow_kevent->ready_entry.next == &u->ready_list) - u->overflow_kevent = NULL; - else - u->overflow_kevent = - list_entry(u->overflow_kevent->ready_entry.next, - struct kevent, ready_entry); - } - } + kevent_remove_ready(k); } spin_unlock_irqrestore(&u->ready_lock, flags); It tries to put next kevent into the ring and thus update overflow_kevent if new kevent has been put into the buffer or kevent being removed is overflow kevent. Patch depends on committed changes of returned error numbers and unused variables cleanup, it will be included into next patchset if there are no problems with it. -- Evgeniy Polyakov ^ permalink raw reply related [flat|nested] 200+ messages in thread
* Re: [take21 1/4] kevent: Core files. 2006-10-28 13:03 ` Evgeniy Polyakov @ 2006-10-28 13:23 ` Eric Dumazet 2006-10-28 13:28 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Eric Dumazet @ 2006-10-28 13:23 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel Evgeniy Polyakov a e'crit : > On Sat, Oct 28, 2006 at 02:36:31PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote: >> Evgeniy Polyakov a e'crit : >>> On Sat, Oct 28, 2006 at 12:28:12PM +0200, Eric Dumazet >>> (dada1@cosmosbay.com) wrote: >>>> I really dont understand how you manage to queue multiple kevents in the >>>> 'overflow list'. You just queue one kevent at most. What am I missing ? >>> There is no overflow list - it is a pointer to the first kevent in the >>> ready queue, which was not put into ring buffer. It is an optimisation, >>> which allows to not search for that position each time new event should >>> be placed into the buffer, when it starts to have an empty slot. >> This overflow list (you may call it differently, but still it IS a list), >> is not complete. I feel you add it just to make me happy, but I am not (yet >> :) ) > > There is no overflow list. > There is ready queue, part of which (first several entries) is copied > into the ring buffer, overflow_kevent is a pointer to the first kevent which > was not copied. > >> For example, you make no test at kevent_finish_user_complete() time. >> >> Obviously, you can have a dangling pointer, and crash your box in certain >> conditions. > > You are right, I did not put overflow_kevent check into all places which > can remove kevent. > > Here is a patch I am about to commit into the kevent tree: > > diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c > index 711a8a8..ecee668 100644 > --- a/kernel/kevent/kevent_user.c > +++ b/kernel/kevent/kevent_user.c > @@ -235,6 +235,36 @@ static void kevent_free_rcu(struct rcu_h > } > > /* > + * Must be called under u->ready_lock. > + * This function removes kevent from ready queue and > + * tries to add new kevent into ring buffer. > + */ > +static void kevent_remove_ready(struct kevent *k) > +{ > + struct kevent_user *u = k->user; > + > + list_del(&k->ready_entry); Arg... no You cannot call list_del() , then check overflow_kevent. I you call list_del on what happens to be the kevent pointed by overflow_kevent, you loose... > + k->flags &= ~KEVENT_READY; > + u->ready_num--; > + if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS) > + u->pring[0]->uidx = 0; > + > + if (u->overflow_kevent) { > + int err; > + > + err = kevent_user_ring_add_event(u->overflow_kevent); > + if (!err || u->overflow_kevent == k) { > + if (u->overflow_kevent->ready_entry.next == &u->ready_list) > + u->overflow_kevent = NULL; > + else > + u->overflow_kevent = > + list_entry(u->overflow_kevent->ready_entry.next, > + struct kevent, ready_entry); > + } > + } > +} > + > +/* > * Complete kevent removing - it dequeues kevent from storage list > * if it is requested, removes kevent from ready list, drops userspace > * control block reference counter and schedules kevent freeing through RCU. > @@ -248,11 +278,8 @@ static void kevent_finish_user_complete( > kevent_dequeue(k); > > spin_lock_irqsave(&u->ready_lock, flags); > - if (k->flags & KEVENT_READY) { > - list_del(&k->ready_entry); > - k->flags &= ~KEVENT_READY; > - u->ready_num--; > - } > + if (k->flags & KEVENT_READY) > + kevent_remove_ready(k); > spin_unlock_irqrestore(&u->ready_lock, flags); > > kevent_user_put(u); > @@ -303,25 +330,7 @@ static struct kevent *kqueue_dequeue_rea > spin_lock_irqsave(&u->ready_lock, flags); > if (u->ready_num && !list_empty(&u->ready_list)) { > k = list_entry(u->ready_list.next, struct kevent, ready_entry); > - list_del(&k->ready_entry); > - k->flags &= ~KEVENT_READY; > - u->ready_num--; > - if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS) > - u->pring[0]->uidx = 0; > - > - if (u->overflow_kevent) { > - int err; > - > - err = kevent_user_ring_add_event(u->overflow_kevent); > - if (!err) { > - if (u->overflow_kevent->ready_entry.next == &u->ready_list) > - u->overflow_kevent = NULL; > - else > - u->overflow_kevent = > - list_entry(u->overflow_kevent->ready_entry.next, > - struct kevent, ready_entry); > - } > - } > + kevent_remove_ready(k); > } > spin_unlock_irqrestore(&u->ready_lock, flags); > ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take21 1/4] kevent: Core files. 2006-10-28 13:23 ` Eric Dumazet @ 2006-10-28 13:28 ` Evgeniy Polyakov 2006-10-28 13:34 ` Eric Dumazet 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-10-28 13:28 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel On Sat, Oct 28, 2006 at 03:23:40PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote: > >diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c > >index 711a8a8..ecee668 100644 > >--- a/kernel/kevent/kevent_user.c > >+++ b/kernel/kevent/kevent_user.c > >@@ -235,6 +235,36 @@ static void kevent_free_rcu(struct rcu_h > > } > > > > /* > >+ * Must be called under u->ready_lock. > >+ * This function removes kevent from ready queue and > >+ * tries to add new kevent into ring buffer. > >+ */ > >+static void kevent_remove_ready(struct kevent *k) > >+{ > >+ struct kevent_user *u = k->user; > >+ > >+ list_del(&k->ready_entry); > > Arg... no > > You cannot call list_del() , then check overflow_kevent. > > I you call list_del on what happens to be the kevent pointed by > overflow_kevent, you loose... This function is always called from appropriate context, where it is guaranteed that it is safe to call list_del: 1. when kevent is removed. It is called after check, that given kevent is in the ready queue. 2. when dequeued from ready queue, which means that it can be removed from that queue. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take21 1/4] kevent: Core files. 2006-10-28 13:28 ` Evgeniy Polyakov @ 2006-10-28 13:34 ` Eric Dumazet 2006-10-28 13:47 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Eric Dumazet @ 2006-10-28 13:34 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel Evgeniy Polyakov a e'crit : > On Sat, Oct 28, 2006 at 03:23:40PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote: >>> diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c >>> index 711a8a8..ecee668 100644 >>> --- a/kernel/kevent/kevent_user.c >>> +++ b/kernel/kevent/kevent_user.c >>> @@ -235,6 +235,36 @@ static void kevent_free_rcu(struct rcu_h >>> } >>> >>> /* >>> + * Must be called under u->ready_lock. >>> + * This function removes kevent from ready queue and >>> + * tries to add new kevent into ring buffer. >>> + */ >>> +static void kevent_remove_ready(struct kevent *k) >>> +{ >>> + struct kevent_user *u = k->user; >>> + >>> + list_del(&k->ready_entry); >> Arg... no >> >> You cannot call list_del() , then check overflow_kevent. >> >> I you call list_del on what happens to be the kevent pointed by >> overflow_kevent, you loose... > > This function is always called from appropriate context, where it is > guaranteed that it is safe to call list_del: > 1. when kevent is removed. It is called after check, that given kevent > is in the ready queue. > 2. when dequeued from ready queue, which means that it can be removed > from that queue. > Could you please check the list_del() function ? file include/linux/list.h static inline void list_del(struct list_head *entry) { __list_del(entry->prev, entry->next); entry->next = LIST_POISON1; entry->prev = LIST_POISON2; } So, after calling list_del(&k->read_entry); next and prev are basically destroyed. So when you write later : + if (!err || u->overflow_kevent == k) { + if (u->overflow_kevent->ready_entry.next == &u->ready_list) + u->overflow_kevent = NULL; + else + u->overflow_kevent = + list_entry(u->overflow_kevent->ready_entry.next, + struct kevent, ready_entry); + } then you have a problem, since list_entry(k->ready_entry.next, struct kevent, ready_entry); will give you garbage. Eric ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take21 1/4] kevent: Core files. 2006-10-28 13:34 ` Eric Dumazet @ 2006-10-28 13:47 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-10-28 13:47 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel On Sat, Oct 28, 2006 at 03:34:52PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote: > >>>+ list_del(&k->ready_entry); > >>Arg... no > >> > >>You cannot call list_del() , then check overflow_kevent. > >> > >>I you call list_del on what happens to be the kevent pointed by > >>overflow_kevent, you loose... > > > >This function is always called from appropriate context, where it is > >guaranteed that it is safe to call list_del: > >1. when kevent is removed. It is called after check, that given kevent > >is in the ready queue. > >2. when dequeued from ready queue, which means that it can be removed > >from that queue. > > > > Could you please check the list_del() function ? > > file include/linux/list.h > > static inline void list_del(struct list_head *entry) > { > __list_del(entry->prev, entry->next); > entry->next = LIST_POISON1; > entry->prev = LIST_POISON2; > } > > So, after calling list_del(&k->read_entry); > next and prev are basically destroyed. > > So when you write later : > > + if (!err || u->overflow_kevent == k) { > + if (u->overflow_kevent->ready_entry.next == &u->ready_list) > + u->overflow_kevent = NULL; > + else > + u->overflow_kevent = + > list_entry(u->overflow_kevent->ready_entry.next, + > struct kevent, ready_entry); > + } > > > then you have a problem, since > > list_entry(k->ready_entry.next, struct kevent, ready_entry); > > will give you garbage. Ok, I understand you now. To remove this issue we can delete entry from the list after all checks with overflow_kevent pointer are completed, i.e. have something like this: diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c index 711a8a8..f3fec9b 100644 --- a/kernel/kevent/kevent_user.c +++ b/kernel/kevent/kevent_user.c @@ -235,6 +235,36 @@ static void kevent_free_rcu(struct rcu_h } /* + * Must be called under u->ready_lock. + * This function removes kevent from ready queue and + * tries to add new kevent into ring buffer. + */ +static void kevent_remove_ready(struct kevent *k) +{ + struct kevent_user *u = k->user; + + if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS) + u->pring[0]->uidx = 0; + + if (u->overflow_kevent) { + int err; + + err = kevent_user_ring_add_event(u->overflow_kevent); + if (!err || u->overflow_kevent == k) { + if (u->overflow_kevent->ready_entry.next == &u->ready_list) + u->overflow_kevent = NULL; + else + u->overflow_kevent = + list_entry(u->overflow_kevent->ready_entry.next, + struct kevent, ready_entry); + } + } + list_del(&k->ready_entry); + k->flags &= ~KEVENT_READY; + u->ready_num--; +} + +/* * Complete kevent removing - it dequeues kevent from storage list * if it is requested, removes kevent from ready list, drops userspace * control block reference counter and schedules kevent freeing through RCU. @@ -248,11 +278,8 @@ static void kevent_finish_user_complete( kevent_dequeue(k); spin_lock_irqsave(&u->ready_lock, flags); - if (k->flags & KEVENT_READY) { - list_del(&k->ready_entry); - k->flags &= ~KEVENT_READY; - u->ready_num--; - } + if (k->flags & KEVENT_READY) + kevent_remove_ready(k); spin_unlock_irqrestore(&u->ready_lock, flags); kevent_user_put(u); @@ -303,25 +330,7 @@ static struct kevent *kqueue_dequeue_rea spin_lock_irqsave(&u->ready_lock, flags); if (u->ready_num && !list_empty(&u->ready_list)) { k = list_entry(u->ready_list.next, struct kevent, ready_entry); - list_del(&k->ready_entry); - k->flags &= ~KEVENT_READY; - u->ready_num--; - if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS) - u->pring[0]->uidx = 0; - - if (u->overflow_kevent) { - int err; - - err = kevent_user_ring_add_event(u->overflow_kevent); - if (!err) { - if (u->overflow_kevent->ready_entry.next == &u->ready_list) - u->overflow_kevent = NULL; - else - u->overflow_kevent = - list_entry(u->overflow_kevent->ready_entry.next, - struct kevent, ready_entry); - } - } + kevent_remove_ready(k); } spin_unlock_irqrestore(&u->ready_lock, flags); Thanks. > Eric -- Evgeniy Polyakov ^ permalink raw reply related [flat|nested] 200+ messages in thread
* Re: [take21 0/4] kevent: Generic event handling mechanism. 2006-10-27 16:10 ` [take21 0/4] kevent: Generic event handling mechanism Evgeniy Polyakov 2006-10-27 16:10 ` [take21 1/4] kevent: Core files Evgeniy Polyakov @ 2006-10-27 16:42 ` Evgeniy Polyakov 2006-11-07 11:26 ` Jeff Garzik 2 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-10-27 16:42 UTC (permalink / raw) To: johnpol Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2305 bytes --] On Fri, Oct 27, 2006 at 08:10:01PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote: > > Generic event handling mechanism. > > Consider for inclusion. > > Changes from 'take20' patchset: > * new ring buffer implementation Test userspace application can be found in archive on project's homepage. It is also attached to this mail. Short design notes about ring buffer implementation. Ring buffer is designed in a way that first ready kevent will be at ring->uidx position, and all other ready events will be in FIFO order after it. So when we need to commit num events, it means we should just remove first num kevents from ready queue and commit them. We do not use any special locking to protect this function against simultaneous running - kevent dequeueing is atomic, and we do not care about order in which events were committed. An example: thread 1 and thread 2 simultaneously call kevent_wait() to commit 2 and 3 events. It is possible that first thread will commit events 0 and 2 while second thread will commit events 1, 3 and 4. If there were only 3 ready events, then one of the calls will return lesser number of committed events than it was requested. ring->uidx update is atomic, since it is protected by u->ready_lock, which removes race with kevent_user_ring_add_event(). If user asks to commit events which have beed removed by kevent_get_events() recently (for example when one thread looked into ring indexes and started to commit evets, which were simultaneously committed by other thread through kevent_get_events(), kevent_wait() will not commit unprocessed events, but will return number of actually committed events instead. It is forbidden to try to commit events not from the start of the buffer, but from some 'futher' event. An example: if ready events use positions 2-5, it is permitted to start to commit 3 events from position 0, in this case 0 and 1 positions will be ommited and only event in position 2 will be committed and kevent_wait() will return 1, since only one event was actually committed. It is forbidden to try to commit from position 4, 0 will be returned. This means that if some events were committed using kevent_get_events(), they will not be counted, instead userspace should check ring index and try to commit again. -- Evgeniy Polyakov [-- Attachment #2: evtest.c --] [-- Type: text/plain, Size: 5070 bytes --] #include <sys/types.h> #include <sys/stat.h> #include <sys/ioctl.h> #include <sys/time.h> #include <sys/mman.h> #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <errno.h> #include <string.h> #include <time.h> #include <unistd.h> #include <linux/unistd.h> #include <linux/types.h> #define PAGE_SIZE 4096 #include <linux/ukevent.h> #define _syscall3(type,name,type1,arg1,type2,arg2,type3,arg3) \ type name (type1 arg1, type2 arg2, type3 arg3) \ {\ return syscall(__NR_##name, arg1, arg2, arg3);\ } #define _syscall4(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4) \ type name (type1 arg1, type2 arg2, type3 arg3, type4 arg4) \ {\ return syscall(__NR_##name, arg1, arg2, arg3, arg4);\ } #define _syscall5(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4, \ type5,arg5) \ type name (type1 arg1,type2 arg2,type3 arg3,type4 arg4,type5 arg5) \ {\ return syscall(__NR_##name, arg1, arg2, arg3, arg4, arg5);\ } #define _syscall6(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4, \ type5,arg5,type6,arg6) \ type name (type1 arg1,type2 arg2,type3 arg3,type4 arg4,type5 arg5, type6 arg6) \ {\ return syscall(__NR_##name, arg1, arg2, arg3, arg4, arg5, arg6);\ } _syscall4(int, kevent_ctl, int, arg1, unsigned int, argv2, unsigned int, argv3, void *, argv4); _syscall6(int, kevent_get_events, int, arg1, unsigned int, argv2, unsigned int, argv3, __u64, argv4, void *, argv5, unsigned, arg6); _syscall4(int, kevent_wait, int, arg1, unsigned int, arg2, unsigned int, argv3, __u64, argv4); #define ulog(f, a...) fprintf(stderr, "%8u: "f, time(NULL), ##a) #define ulog_err(f, a...) ulog(f ": %s [%d].\n", ##a, strerror(errno), errno) static void usage(char *p) { ulog("Usage: %s -t type -e event -o oneshot -p path -n wait_num -f kevent_file -r ready_num -h\n", p); } static int evtest_mmap(int fd, struct kevent_mring **ring, int number) { int i; off_t o = 0; for (i=0; i<number; ++i) { ring[i] = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, fd, o); if (ring[i] == MAP_FAILED) { ulog_err("Failed to mmap: i: %d, number: %u, offset: %lu", i, number, o); return -ENOMEM; } printf("mmap: %d: number: %u, offset: %lu.\n", i, number, o); o += PAGE_SIZE; } return 0; } int main(int argc, char *argv[]) { int ch, fd, err, oneshot, wait_num; unsigned int i, ready_num, old_idx, new_idx, tm_sec, tm_nsec; char *file; char buf[4096]; struct ukevent *uk; struct mukevent *m; struct kevent_mring *ring[KEVENT_MAX_PAGES]; off_t offset; oneshot = 0; wait_num = 10; offset = 0; old_idx = 0; file = "/dev/kevent"; tm_sec = 2; tm_nsec = 0; ready_num = 1; while ((ch = getopt(argc, argv, "r:f:t:T:o:n:h")) > 0) { switch (ch) { case 'f': file = optarg; break; case 'r': ready_num = atoi(optarg); break; case 'n': wait_num = atoi(optarg); break; case 't': tm_sec = atoi(optarg); break; case 'T': tm_nsec = atoi(optarg); break; case 'o': oneshot = atoi(optarg); break; default: usage(argv[0]); return -1; } } fd = open(file, O_RDWR); if (fd == -1) { ulog_err("Failed create kevent control block using file %s", file); return -1; } err = evtest_mmap(fd, ring, KEVENT_MAX_PAGES); if (err) return err; memset(buf, 0, sizeof(buf)); for (i=0; i<ready_num; ++i) { uk = (struct ukevent *)buf; uk->event = KEVENT_TIMER_FIRED; uk->type = KEVENT_TIMER; if (oneshot) uk->req_flags |= KEVENT_REQ_ONESHOT; uk->user[0] = i; uk->id.raw[0] = tm_sec; uk->id.raw[1] = tm_nsec+i; err = kevent_ctl(fd, KEVENT_CTL_ADD, 1, uk); if (err < 0) { ulog_err("Failed to perform control operation: oneshot: %d, sec: %u, nsec: %u", oneshot, tm_sec, tm_nsec); close(fd); return err; } if (err) { ulog("%d: %016llx: ret_flags: 0x%x, ret_data: %u %d.\n", i, uk->id.raw_u64, uk->ret_flags, uk->ret_data[0], (int)uk->ret_data[1]); } } old_idx = ready_num = 0; while (1) { new_idx = ring[0]->kidx; old_idx = ring[0]->uidx; if (new_idx != old_idx) { ready_num = (old_idx > new_idx)?(KEVENT_MAX_EVENTS - (old_idx - new_idx)):(new_idx - old_idx); ulog("mmap: new: %u, old: %u, ready: %u.\n", new_idx, old_idx, ready_num); for (i=0; i<ready_num; ++i) { int ridx = old_idx / KEVENTS_ON_PAGE; int idx = old_idx % KEVENTS_ON_PAGE; m = &ring[ridx]->event[idx % KEVENTS_ON_PAGE]; ulog("%08x: %08x.%08x - %08x\n", i, m->id.raw[0], m->id.raw[1], m->ret_flags); } } ulog("going to wait: old: %u, new: %u, ready_num: %u, uidx: %u, kidx: %u.\n", old_idx, new_idx, ready_num, ring[0]->uidx, ring[0]->kidx); err = kevent_wait(fd, old_idx, ready_num, 10000000000ULL); if (err < 0) { if (errno != EAGAIN) { ulog_err("Failed to perform control operation: oneshot: %d, sec: %u, nsec: %u", oneshot, tm_sec, tm_nsec); close(fd); return err; } old_idx = (old_idx + ready_num) % KEVENT_MAX_EVENTS; ready_num = 0; } ulog("wait: old: %u, ready: %u, ret: %d.\n", old_idx, ready_num, err); } close(fd); return 0; } ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take21 0/4] kevent: Generic event handling mechanism. 2006-10-27 16:10 ` [take21 0/4] kevent: Generic event handling mechanism Evgeniy Polyakov 2006-10-27 16:10 ` [take21 1/4] kevent: Core files Evgeniy Polyakov 2006-10-27 16:42 ` [take21 0/4] kevent: Generic event handling mechanism Evgeniy Polyakov @ 2006-11-07 11:26 ` Jeff Garzik 2006-11-07 11:46 ` Jeff Garzik 2006-11-07 11:51 ` Evgeniy Polyakov 2 siblings, 2 replies; 200+ messages in thread From: Jeff Garzik @ 2006-11-07 11:26 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, linux-kernel, Linus Torvalds Evgeniy Polyakov wrote: > Generic event handling mechanism. > > Consider for inclusion. > > Changes from 'take20' patchset: > * new ring buffer implementation > * removed artificial limit on possible number of kevents > With this release and fixed userspace web server it was possible to > achive 3960+ req/s with client connection rate of 4000 con/s > over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which > is too close to wire speed if we get into account headers and the like. OK, now that ring buffer is here, I definitely like the direction this code is taking. I just committed the patches to a local repo for a good in-depth review. Could you write up a simple text file, documenting (a) your proposed syscalls and (b) your ring buffer design? Overall I have a Linux "design wish", that I hope kevent can fulfill: To develop completely async applications (generally network servers, in Linux-land) and increase the chance of zero-copy I/O, network and file I/O submission and completion should be as async as possible. As such, syscalls themselves have come a serializing bottleneck that isn't strictly necessary. A fully-async application should be able to submit file read, file write, and network write requests asynchronously... in batches. Network reads, and file I/O completions should be received asynchronously, potentially in batches. Even with epoll and AIO syscalls, Linux isn't quite up to the task. So to me, the design of the userspace interface that solves this problem is a fundamental issue. My best guess at a solution would be two classes of mmap'd ring buffers, request and response. Let the app allocate one or more. Then have two hooks, (a) kick the kernel to read the request ring, and (b) kick the app when one or more events have arrived on a ring. But that's just thinking out loud. I welcome any solution that gives userspace a fully-async submission/completion interface for both network and file I/O. Setting the standard for a good interface here means Linux will kick ass for decades more to come ;-) This is IMO a Big Deal(tm). Jeff ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take21 0/4] kevent: Generic event handling mechanism. 2006-11-07 11:26 ` Jeff Garzik @ 2006-11-07 11:46 ` Jeff Garzik 2006-11-07 11:58 ` Evgeniy Polyakov 2006-11-07 11:51 ` Evgeniy Polyakov 1 sibling, 1 reply; 200+ messages in thread From: Jeff Garzik @ 2006-11-07 11:46 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, linux-kernel, Linus Torvalds At an aside... This may be useful. Or not. Al Viro had an interesting idea about kernel<->userspace data passing interfaces. He had suggested creating a task-specific filesystem derived from ramfs. Through the normal VFS/VM codepaths, the user can easily create [subject to resource/priv checks] a buffer that is locked into the pagecache. Using mmap, read, write, whatever they prefer. Derive from tmpfs, and the buffers are swappable. Then it would be a simple matter to associate a file stored in "keventfs" with a ring buffer guaranteed to be pagecache-friendly. Heck, that might make zero-copy easier in some cases, too. And using a filesystem would mean that you could do all this without adding syscalls, by using special (poll-able!) files in the filesystem for control and notification purposes. Jeff ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take21 0/4] kevent: Generic event handling mechanism. 2006-11-07 11:46 ` Jeff Garzik @ 2006-11-07 11:58 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-07 11:58 UTC (permalink / raw) To: Jeff Garzik Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, linux-kernel, Linus Torvalds On Tue, Nov 07, 2006 at 06:46:58AM -0500, Jeff Garzik (jeff@garzik.org) wrote: > At an aside... This may be useful. Or not. > > Al Viro had an interesting idea about kernel<->userspace data passing > interfaces. He had suggested creating a task-specific filesystem > derived from ramfs. Through the normal VFS/VM codepaths, the user can > easily create [subject to resource/priv checks] a buffer that is locked > into the pagecache. Using mmap, read, write, whatever they prefer. > Derive from tmpfs, and the buffers are swappable. It looks like Al likes filesystems more than any other part of kernel tree... Existing ring buffer is created in process' memory, so it is swappable too (which is probably the most significant part of this ring buffer version), but in theory kevent file descriptor can be obtained not from the char device, but from special filesystem (well, it was done in that way in first releases but then I was asked to remove such functionality). > Then it would be a simple matter to associate a file stored in > "keventfs" with a ring buffer guaranteed to be pagecache-friendly. > > Heck, that might make zero-copy easier in some cases, too. And using a > filesystem would mean that you could do all this without adding > syscalls, by using special (poll-able!) files in the filesystem for > control and notification purposes. There are too many ideas about networking zero-copy both sending and receiving, and some of them are even implemented on different layers (starting from special allocator down to splice() with additional single allocation/copy). > Jeff -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take21 0/4] kevent: Generic event handling mechanism. 2006-11-07 11:26 ` Jeff Garzik 2006-11-07 11:46 ` Jeff Garzik @ 2006-11-07 11:51 ` Evgeniy Polyakov 2006-11-07 12:17 ` Jeff Garzik 2006-11-07 12:32 ` Jeff Garzik 1 sibling, 2 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-07 11:51 UTC (permalink / raw) To: Jeff Garzik Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, linux-kernel, Linus Torvalds On Tue, Nov 07, 2006 at 06:26:09AM -0500, Jeff Garzik (jeff@garzik.org) wrote: > Evgeniy Polyakov wrote: > >Generic event handling mechanism. > > > >Consider for inclusion. > > > >Changes from 'take20' patchset: > > * new ring buffer implementation > > * removed artificial limit on possible number of kevents > >With this release and fixed userspace web server it was possible to > >achive 3960+ req/s with client connection rate of 4000 con/s > >over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which > >is too close to wire speed if we get into account headers and the like. > > OK, now that ring buffer is here, I definitely like the direction this > code is taking. I just committed the patches to a local repo for a good > in-depth review. It is third ring buffer, the fourth one will be in the next release, which should satisfy everyone. > Could you write up a simple text file, documenting (a) your proposed > syscalls and (b) your ring buffer design? Initial draft about supported syscalls can be found at documentation page at http://linux-net.osdl.org/index.php/Kevent Ring buffer background bits pasted below (quotations from blog, do not pay too much attention if sometimes something is not in sync). New ring buffer is implemented fully in userspace in process' memory, which means that there are no memory pinned, its size can have almost any length, several threads and processes can access it simultaneously. There is new system call int kevent_ring_init(int ctl_fd, struct ring_buffer *ring, unsigned int num); which initializes kevent's ring buffer (int ctl_fd is a kevent file descriptor, struct ring_buffer *ring is a userspace allocated ring buffer, and unsigned int num is maximum number of events (struct ukevent) which can be placed into that buffer). Ring buffer is described with following structure: struct kevent_ring { unsigned int ring_kidx, ring_uidx; struct ukevent event[0]; }; where unsigned int ring_kidx, ring_uidx are last kernel's position (i.e. position which points to the first place after the last kevent put by kernel into the ring buffer) and last userspace commit (i.e. position where first unread kevent lives) positions appropriately. I will release appropriate userspace test application when tests are completed. When kevent is removed (not dequeued when it is ready, but just removed), even if it was ready, it is not copied into ring buffer, since if it is removed, no one cares about it (otherwise user would wait until it becomes ready and got it through usual way using kevent_get_events() or kevent_wait()) and thus no need to copy it to the ring buffer. Dequeueing of the kevent (calling kevent_get_events()) means that user has processed previously dequeued kevent and is ready to process new one, which means that position in the ring buffer previously ocupied but that event can be reused by currently dequeued event. In the world where only one type of syscalls to get events is used (either usual way and kevent_get_events() or ring buffer and kevent_wait()) it should not be a problem, since kevent_wait() only allows to mark number of events as processed by userspace starting from the beginning (i.e. from the last processed event), but if several threads will use different models, that can rise some questions, for example one thread can start to read events from ring buffer, and in that time other thread will call kevent_get_events(), which can rewrite that events. Actually other thread can call kevent_wait() to commit that events (i.e. mark them as processed by userspace so kernel could free them or requeue), so appropriate locking is required in userspace in any way. So I want to repeat, that it is possible with userspace ring buffer, that events in the ring buffer can be replaced without knowledge for the thread currently reading them (when other thread calls kevent_get_events() or kevent_wait()), so appropriate locking between threads or processes, which can simultaneously access the same ring buffer, is required. Having userspace ring buffer allows to make all kevent syscalls as so called 'cancellation points' by glibc, i.e. when thread has been cancelled in kevent syscall, thread can be safely removed and no events will be lost, since each syscall will copy event into special ring buffer, accessible from other threads or even processes (if shared memory is used). > > Overall I have a Linux "design wish", that I hope kevent can fulfill: > > To develop completely async applications (generally network servers, in > Linux-land) and increase the chance of zero-copy I/O, network and file > I/O submission and completion should be as async as possible. > > As such, syscalls themselves have come a serializing bottleneck that > isn't strictly necessary. A fully-async application should be able to > submit file read, file write, and network write requests > asynchronously... in batches. Network reads, and file I/O completions > should be received asynchronously, potentially in batches. > > Even with epoll and AIO syscalls, Linux isn't quite up to the task. > > So to me, the design of the userspace interface that solves this problem > is a fundamental issue. > > My best guess at a solution would be two classes of mmap'd ring buffers, > request and response. Let the app allocate one or more. Then have two > hooks, (a) kick the kernel to read the request ring, and (b) kick the > app when one or more events have arrived on a ring. Mmap ring buffer implementation was stopped by Andrew Morton and Ulrich Drepper, process' memory is used instead. copy_to_user() is slower (and some times noticebly), but there are major advantages of such approach. > But that's just thinking out loud. I welcome any solution that gives > userspace a fully-async submission/completion interface for both network > and file I/O. Well, kevent network and FS AIO are suspended for now (although first patches included them all). > Setting the standard for a good interface here means Linux will kick ass > for decades more to come ;-) This is IMO a Big Deal(tm). > > Jeff > -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take21 0/4] kevent: Generic event handling mechanism. 2006-11-07 11:51 ` Evgeniy Polyakov @ 2006-11-07 12:17 ` Jeff Garzik 2006-11-07 12:29 ` Evgeniy Polyakov 2006-11-07 12:32 ` Jeff Garzik 1 sibling, 1 reply; 200+ messages in thread From: Jeff Garzik @ 2006-11-07 12:17 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, linux-kernel, Linus Torvalds Evgeniy Polyakov wrote: > Well, kevent network and FS AIO are suspended for now (although first Why? IMO, getting async event submission right is important. It should be designed in parallel with async event reception. Jeff ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take21 0/4] kevent: Generic event handling mechanism. 2006-11-07 12:17 ` Jeff Garzik @ 2006-11-07 12:29 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-07 12:29 UTC (permalink / raw) To: Jeff Garzik Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, linux-kernel, Linus Torvalds On Tue, Nov 07, 2006 at 07:17:03AM -0500, Jeff Garzik (jeff@garzik.org) wrote: > Evgeniy Polyakov wrote: > >Well, kevent network and FS AIO are suspended for now (although first > > Why? > > IMO, getting async event submission right is important. It should be > designed in parallel with async event reception. It was not only designed but also implemented, but... FS AIO was confirmed to have correct design, but there were minor (from my point of view) layering design problems (I was almost suggested to make myself a lobotomy after I put get_block() callback into address_space_operations, there were also some code duplication of mpage_readpages() in async way in kevent/kevent_aio.c - I made it to separate kevent as much as possible, both changes can live in fs/ with appropriate callback export). Network AIO I postponed for a while, since looking how hard core changed are processed, it looks like a better decision... Using Ulrich's DMA allocation API (if it would exist not only as proposal) it would be possible to speed up NAIO yet a bit too. Kevent based FS AIO patch can be found for example here (it contains full kevent subsystem with network aio and fs aio): http://tservice.net.ru/~s0mbre/archive/kevent/kevent_full.diff.3 Network aio homepage: http://tservice.net.ru/~s0mbre/old/?section=projects&item=naio > Jeff > -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take21 0/4] kevent: Generic event handling mechanism. 2006-11-07 11:51 ` Evgeniy Polyakov 2006-11-07 12:17 ` Jeff Garzik @ 2006-11-07 12:32 ` Jeff Garzik 2006-11-07 19:34 ` Andrew Morton 1 sibling, 1 reply; 200+ messages in thread From: Jeff Garzik @ 2006-11-07 12:32 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, linux-kernel, Linus Torvalds Evgeniy Polyakov wrote: > Mmap ring buffer implementation was stopped by Andrew Morton and Ulrich > Drepper, process' memory is used instead. copy_to_user() is slower (and > some times noticebly), but there are major advantages of such approach. hmmmm. I say there are advantages to both. Perhaps create a "kevent_direct_limit" resource limit for each thread. By default, each thread could mmap $n pinned pagecache pages. Sysadmin can tune certain app resource limits to permit more. I would think that retaining the option to avoid copy_to_user() -somehow- in -some- cases would be wise. Jeff ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take21 0/4] kevent: Generic event handling mechanism. 2006-11-07 12:32 ` Jeff Garzik @ 2006-11-07 19:34 ` Andrew Morton 2006-11-07 20:52 ` David Miller 0 siblings, 1 reply; 200+ messages in thread From: Andrew Morton @ 2006-11-07 19:34 UTC (permalink / raw) To: Jeff Garzik Cc: Evgeniy Polyakov, David Miller, Ulrich Drepper, netdev, linux-kernel, Linus Torvalds On Tue, 07 Nov 2006 07:32:20 -0500 Jeff Garzik <jeff@garzik.org> wrote: > Evgeniy Polyakov wrote: > > Mmap ring buffer implementation was stopped by Andrew Morton and Ulrich > > Drepper, process' memory is used instead. copy_to_user() is slower (and > > some times noticebly), but there are major advantages of such approach. > > > hmmmm. I say there are advantages to both. My problem with the old mmapped ringbuffer was that it permitted each user to pin (typically) 48MB of unswappable memory. Plus this pinned-memory problem would put upper bounds on the ring size. > Perhaps create a "kevent_direct_limit" resource limit for each thread. > By default, each thread could mmap $n pinned pagecache pages. Sysadmin > can tune certain app resource limits to permit more. > > I would think that retaining the option to avoid copy_to_user() > -somehow- in -some- cases would be wise. What Evgeniy means here is that copy_to_user() is slower than memcpy() (on his machine, with his kernel config, at least). Which is kinda weird and unexpected and is something which we should investigate independently from this project. (Rather than simply going and bypassing it!) ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take21 0/4] kevent: Generic event handling mechanism. 2006-11-07 19:34 ` Andrew Morton @ 2006-11-07 20:52 ` David Miller 2006-11-07 21:38 ` Andrew Morton 0 siblings, 1 reply; 200+ messages in thread From: David Miller @ 2006-11-07 20:52 UTC (permalink / raw) To: akpm; +Cc: jeff, johnpol, drepper, netdev, linux-kernel, torvalds From: Andrew Morton <akpm@osdl.org> Date: Tue, 7 Nov 2006 11:34:00 -0800 > What Evgeniy means here is that copy_to_user() is slower than memcpy() (on > his machine, with his kernel config, at least). > > Which is kinda weird and unexpected and is something which we should > investigate independently from this project. (Rather than simply going > and bypassing it!) It's straightforward to me. :-) If the kerne memcpy()'s, it uses those nice 4MB PTE mappings to the kernel pages. With copy_to_user() you run through tiny 4K or 8K PTE mappings which thrash the TLB. The TLB is therefore able to hold more of the accessed state at a time if you touch the pages on the kernel side. ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take21 0/4] kevent: Generic event handling mechanism. 2006-11-07 20:52 ` David Miller @ 2006-11-07 21:38 ` Andrew Morton 0 siblings, 0 replies; 200+ messages in thread From: Andrew Morton @ 2006-11-07 21:38 UTC (permalink / raw) To: David Miller; +Cc: jeff, johnpol, drepper, netdev, linux-kernel, torvalds On Tue, 07 Nov 2006 12:52:41 -0800 (PST) David Miller <davem@davemloft.net> wrote: > From: Andrew Morton <akpm@osdl.org> > Date: Tue, 7 Nov 2006 11:34:00 -0800 > > > What Evgeniy means here is that copy_to_user() is slower than memcpy() (on > > his machine, with his kernel config, at least). > > > > Which is kinda weird and unexpected and is something which we should > > investigate independently from this project. (Rather than simply going > > and bypassing it!) > > It's straightforward to me. :-) > > If the kerne memcpy()'s, it uses those nice 4MB PTE mappings to > the kernel pages. With copy_to_user() you run through tiny > 4K or 8K PTE mappings which thrash the TLB. > > The TLB is therefore able to hold more of the accessed state at > a time if you touch the pages on the kernel side. Maybe. Evgeniy tends to favour teeny microbenchmarks. I'd also be suspecting the considerable setup code in the x86 uaccess funtions. That would show up in a tight loop doing large numbers of small copies. ^ permalink raw reply [flat|nested] 200+ messages in thread
* [take22 0/4] kevent: Generic event handling mechanism. [not found] <1154985aa0591036@2ka.mipt.ru> 2006-10-27 16:10 ` [take21 0/4] kevent: Generic event handling mechanism Evgeniy Polyakov @ 2006-11-01 11:36 ` Evgeniy Polyakov 2006-11-01 11:36 ` [take22 1/4] kevent: Core files Evgeniy Polyakov 2006-11-01 13:06 ` [take22 0/4] kevent: Generic event handling mechanism Pavel Machek 2006-11-07 16:50 ` [take23 0/5] " Evgeniy Polyakov ` (3 subsequent siblings) 5 siblings, 2 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-01 11:36 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel Generic event handling mechanism. Consider for inclusion. Changes from 'take21' patchset: * minor cleanups (different return values, removed unneded variables, whitespaces and so on) * fixed bug in kevent removal in case when kevent being removed is the same as overflow_kevent (spotted by Eric Dumazet) Changes from 'take20' patchset: * new ring buffer implementation * removed artificial limit on possible number of kevents With this release and fixed userspace web server it was possible to achive 3960+ req/s with client connection rate of 4000 con/s over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which is too close to wire speed if we get into account headers and the like. Changes from 'take19' patchset: * use __init instead of __devinit * removed 'default N' from config for user statistic * removed kevent_user_fini() since kevent can not be unloaded * use KERN_INFO for statistic output Changes from 'take18' patchset: * use __init instead of __devinit * removed 'default N' from config for user statistic * removed kevent_user_fini() since kevent can not be unloaded * use KERN_INFO for statistic output Changes from 'take17' patchset: * Use RB tree instead of hash table. At least for a web sever, frequency of addition/deletion of new kevent is comparable with number of search access, i.e. most of the time events are added, accesed only couple of times and then removed, so it justifies RB tree usage over AVL tree, since the latter does have much slower deletion time (max O(log(N)) compared to 3 ops), although faster search time (1.44*O(log(N)) vs. 2*O(log(N))). So for kevents I use RB tree for now and later, when my AVL tree implementation is ready, it will be possible to compare them. * Changed readiness check for socket notifications. With both above changes it is possible to achieve more than 3380 req/second compared to 2200, sometimes 2500 req/second for epoll() for trivial web-server and httperf client on the same hardware. It is possible that above kevent limit is due to maximum allowed kevents in a time limit, which is 4096 events. Changes from 'take16' patchset: * misc cleanups (__read_mostly, const ...) * created special macro which is used for mmap size (number of pages) calculation * export kevent_socket_notify(), since it is used in network protocols which can be built as modules (IPv6 for example) Changes from 'take15' patchset: * converted kevent_timer to high-resolution timers, this forces timer API update at http://linux-net.osdl.org/index.php/Kevent * use struct ukevent* instead of void * in syscalls (documentation has been updated) * added warning in kevent_add_ukevent() if ring has broken index (for testing) Changes from 'take14' patchset: * added kevent_wait() This syscall waits until either timeout expires or at least one event becomes ready. It also commits that @num events from @start are processed by userspace and thus can be be removed or rearmed (depending on it's flags). It can be used for commit events read by userspace through mmap interface. Example userspace code (evtest.c) can be found on project's homepage. * added socket notifications (send/recv/accept) Changes from 'take13' patchset: * do not get lock aroung user data check in __kevent_search() * fail early if there were no registered callbacks for given type of kevent * trailing whitespace cleanup Changes from 'take12' patchset: * remove non-chardev interface for initialization * use pointer to kevent_mring instead of unsigned longs * use aligned 64bit type in raw user data (can be used by high-res timer if needed) * simplified enqueue/dequeue callbacks and kevent initialization * use nanoseconds for timeout * put number of milliseconds into timer's return data * move some definitions into user-visible header * removed filenames from comments Changes from 'take11' patchset: * include missing headers into patchset * some trivial code cleanups (use goto instead of if/else games and so on) * some whitespace cleanups * check for ready_callback() callback before main loop which should save us some ticks Changes from 'take10' patchset: * removed non-existent prototypes * added helper function for kevent_registered_callbacks * fixed 80 lines comments issues * added shared between userspace and kernelspace header instead of embedd them in one * core restructuring to remove forward declarations * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p * use vm_insert_page() instead of remap_pfn_range() Changes from 'take9' patchset: * fixed ->nopage method Changes from 'take8' patchset: * fixed mmap release bug * use module_init() instead of late_initcall() * use better structures for timer notifications Changes from 'take7' patchset: * new mmap interface (not tested, waiting for other changes to be acked) - use nopage() method to dynamically substitue pages - allocate new page for events only when new added kevent requres it - do not use ugly index dereferencing, use structure instead - reduced amount of data in the ring (id and flags), maximum 12 pages on x86 per kevent fd Changes from 'take6' patchset: * a lot of comments! * do not use list poisoning for detection of the fact, that entry is in the list * return number of ready kevents even if copy*user() fails * strict check for number of kevents in syscall * use ARRAY_SIZE for array size calculation * changed superblock magic number * use SLAB_PANIC instead of direct panic() call * changed -E* return values * a lot of small cleanups and indent fixes Changes from 'take5' patchset: * removed compilation warnings about unused wariables when lockdep is not turned on * do not use internal socket structures, use appropriate (exported) wrappers instead * removed default 1 second timeout * removed AIO stuff from patchset Changes from 'take4' patchset: * use miscdevice instead of chardevice * comments fixes Changes from 'take3' patchset: * removed serializing mutex from kevent_user_wait() * moved storage list processing to RCU * removed lockdep screaming - all storage locks are initialized in the same function, so it was learned to differentiate between various cases * remove kevent from storage if is marked as broken after callback * fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion Changes from 'take2' patchset: * split kevent_finish_user() to locked and unlocked variants * do not use KEVENT_STAT ifdefs, use inline functions instead * use array of callbacks of each type instead of each kevent callback initialization * changed name of ukevent guarding lock * use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks * do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters * various indent cleanups * added optimisation, which is aimed to help when a lot of kevents are being copied from userspace * mapped buffer (initial) implementation (no userspace yet) Changes from 'take1' patchset: - rebased against 2.6.18-git tree - removed ioctl controlling - added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr, unsigned int timeout, void __user *buf, unsigned flags) - use old syscall kevent_ctl for creation/removing, modification and initial kevent initialization - use mutuxes instead of semaphores - added file descriptor check and return error if provided descriptor does not match kevent file operations - various indent fixes - removed aio_sendfile() declarations. Thank you. Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> ^ permalink raw reply [flat|nested] 200+ messages in thread
* [take22 1/4] kevent: Core files. 2006-11-01 11:36 ` [take22 " Evgeniy Polyakov @ 2006-11-01 11:36 ` Evgeniy Polyakov 2006-11-01 11:36 ` [take22 2/4] kevent: poll/select() notifications Evgeniy Polyakov 2006-11-01 13:06 ` [take22 0/4] kevent: Generic event handling mechanism Pavel Machek 1 sibling, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-01 11:36 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel Core files. This patch includes core kevent files: * userspace controlling * kernelspace interfaces * initialization * notification state machines Some bits of documentation can be found on project's homepage (and links from there): http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S index 7e639f7..a9560eb 100644 --- a/arch/i386/kernel/syscall_table.S +++ b/arch/i386/kernel/syscall_table.S @@ -318,3 +318,6 @@ ENTRY(sys_call_table) .long sys_vmsplice .long sys_move_pages .long sys_getcpu + .long sys_kevent_get_events + .long sys_kevent_ctl /* 320 */ + .long sys_kevent_wait diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S index b4aa875..cf18955 100644 --- a/arch/x86_64/ia32/ia32entry.S +++ b/arch/x86_64/ia32/ia32entry.S @@ -714,8 +714,11 @@ #endif .quad compat_sys_get_robust_list .quad sys_splice .quad sys_sync_file_range - .quad sys_tee + .quad sys_tee /* 315 */ .quad compat_sys_vmsplice .quad compat_sys_move_pages .quad sys_getcpu + .quad sys_kevent_get_events + .quad sys_kevent_ctl /* 320 */ + .quad sys_kevent_wait ia32_syscall_end: diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h index bd99870..f009677 100644 --- a/include/asm-i386/unistd.h +++ b/include/asm-i386/unistd.h @@ -324,10 +324,13 @@ #define __NR_tee 315 #define __NR_vmsplice 316 #define __NR_move_pages 317 #define __NR_getcpu 318 +#define __NR_kevent_get_events 319 +#define __NR_kevent_ctl 320 +#define __NR_kevent_wait 321 #ifdef __KERNEL__ -#define NR_syscalls 319 +#define NR_syscalls 322 #include <linux/err.h> /* diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h index 6137146..c53d156 100644 --- a/include/asm-x86_64/unistd.h +++ b/include/asm-x86_64/unistd.h @@ -619,10 +619,16 @@ #define __NR_vmsplice 278 __SYSCALL(__NR_vmsplice, sys_vmsplice) #define __NR_move_pages 279 __SYSCALL(__NR_move_pages, sys_move_pages) +#define __NR_kevent_get_events 280 +__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events) +#define __NR_kevent_ctl 281 +__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl) +#define __NR_kevent_wait 282 +__SYSCALL(__NR_kevent_wait, sys_kevent_wait) #ifdef __KERNEL__ -#define __NR_syscall_max __NR_move_pages +#define __NR_syscall_max __NR_kevent_wait #include <linux/err.h> #ifndef __NO_STUBS diff --git a/include/linux/kevent.h b/include/linux/kevent.h new file mode 100644 index 0000000..743b328 --- /dev/null +++ b/include/linux/kevent.h @@ -0,0 +1,205 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __KEVENT_H +#define __KEVENT_H +#include <linux/types.h> +#include <linux/list.h> +#include <linux/rbtree.h> +#include <linux/spinlock.h> +#include <linux/mutex.h> +#include <linux/wait.h> +#include <linux/net.h> +#include <linux/rcupdate.h> +#include <linux/kevent_storage.h> +#include <linux/ukevent.h> + +#define KEVENT_MIN_BUFFS_ALLOC 3 + +struct kevent; +struct kevent_storage; +typedef int (* kevent_callback_t)(struct kevent *); + +/* @callback is called each time new event has been caught. */ +/* @enqueue is called each time new event is queued. */ +/* @dequeue is called each time event is dequeued. */ + +struct kevent_callbacks { + kevent_callback_t callback, enqueue, dequeue; +}; + +#define KEVENT_READY 0x1 +#define KEVENT_STORAGE 0x2 +#define KEVENT_USER 0x4 + +struct kevent +{ + /* Used for kevent freeing.*/ + struct rcu_head rcu_head; + struct ukevent event; + /* This lock protects ukevent manipulations, e.g. ret_flags changes. */ + spinlock_t ulock; + + /* Entry of user's tree. */ + struct rb_node kevent_node; + /* Entry of origin's queue. */ + struct list_head storage_entry; + /* Entry of user's ready. */ + struct list_head ready_entry; + + u32 flags; + + /* User who requested this kevent. */ + struct kevent_user *user; + /* Kevent container. */ + struct kevent_storage *st; + + struct kevent_callbacks callbacks; + + /* Private data for different storages. + * poll()/select storage has a list of wait_queue_t containers + * for each ->poll() { poll_wait()' } here. + */ + void *priv; +}; + +struct kevent_user +{ + struct rb_root kevent_root; + spinlock_t kevent_lock; + /* Number of queued kevents. */ + unsigned int kevent_num; + + /* List of ready kevents. */ + struct list_head ready_list; + /* Number of ready kevents. */ + unsigned int ready_num; + /* Protects all manipulations with ready queue. */ + spinlock_t ready_lock; + + /* Protects against simultaneous kevent_user control manipulations. */ + struct mutex ctl_mutex; + /* Wait until some events are ready. */ + wait_queue_head_t wait; + + /* Reference counter, increased for each new kevent. */ + atomic_t refcnt; + + /* First kevent which was not put into ring buffer due to overflow. + * It will be copied into the buffer, when first event will be removed + * from ready queue (and thus there will be an empty place in the + * ring buffer). + */ + struct kevent *overflow_kevent; + /* Array of pages forming mapped ring buffer */ + struct kevent_mring **pring; + +#ifdef CONFIG_KEVENT_USER_STAT + unsigned long im_num; + unsigned long wait_num, mmap_num; + unsigned long total; +#endif +}; + +int kevent_enqueue(struct kevent *k); +int kevent_dequeue(struct kevent *k); +int kevent_init(struct kevent *k); +void kevent_requeue(struct kevent *k); +int kevent_break(struct kevent *k); + +int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos); + +int kevent_user_ring_add_event(struct kevent *k); + +void kevent_storage_ready(struct kevent_storage *st, + kevent_callback_t ready_callback, u32 event); +int kevent_storage_init(void *origin, struct kevent_storage *st); +void kevent_storage_fini(struct kevent_storage *st); +int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k); +void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k); + +int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u); + +#ifdef CONFIG_KEVENT_POLL +void kevent_poll_reinit(struct file *file); +#else +static inline void kevent_poll_reinit(struct file *file) +{ +} +#endif + +#ifdef CONFIG_KEVENT_USER_STAT +static inline void kevent_stat_init(struct kevent_user *u) +{ + u->wait_num = u->im_num = u->total = 0; +} +static inline void kevent_stat_print(struct kevent_user *u) +{ + printk(KERN_INFO "%s: u: %p, wait: %lu, mmap: %lu, immediately: %lu, total: %lu.\n", + __func__, u, u->wait_num, u->mmap_num, u->im_num, u->total); +} +static inline void kevent_stat_im(struct kevent_user *u) +{ + u->im_num++; +} +static inline void kevent_stat_mmap(struct kevent_user *u) +{ + u->mmap_num++; +} +static inline void kevent_stat_wait(struct kevent_user *u) +{ + u->wait_num++; +} +static inline void kevent_stat_total(struct kevent_user *u) +{ + u->total++; +} +#else +#define kevent_stat_print(u) ({ (void) u;}) +#define kevent_stat_init(u) ({ (void) u;}) +#define kevent_stat_im(u) ({ (void) u;}) +#define kevent_stat_wait(u) ({ (void) u;}) +#define kevent_stat_mmap(u) ({ (void) u;}) +#define kevent_stat_total(u) ({ (void) u;}) +#endif + +#ifdef CONFIG_LOCKDEP +void kevent_socket_reinit(struct socket *sock); +void kevent_sk_reinit(struct sock *sk); +#else +static inline void kevent_socket_reinit(struct socket *sock) +{ +} +static inline void kevent_sk_reinit(struct sock *sk) +{ +} +#endif +#ifdef CONFIG_KEVENT_SOCKET +void kevent_socket_notify(struct sock *sock, u32 event); +int kevent_socket_dequeue(struct kevent *k); +int kevent_socket_enqueue(struct kevent *k); +#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC) +#else +static inline void kevent_socket_notify(struct sock *sock, u32 event) +{ +} +#define sock_async(__sk) ({ (void)__sk; 0; }) +#endif + +#endif /* __KEVENT_H */ diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h new file mode 100644 index 0000000..a38575d --- /dev/null +++ b/include/linux/kevent_storage.h @@ -0,0 +1,11 @@ +#ifndef __KEVENT_STORAGE_H +#define __KEVENT_STORAGE_H + +struct kevent_storage +{ + void *origin; /* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */ + struct list_head list; /* List of queued kevents. */ + spinlock_t lock; /* Protects users queue. */ +}; + +#endif /* __KEVENT_STORAGE_H */ diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 2d1c3d5..71a758f 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -54,6 +54,7 @@ struct compat_stat; struct compat_timeval; struct robust_list_head; struct getcpu_cache; +struct ukevent; #include <linux/types.h> #include <linux/aio_abi.h> @@ -599,4 +600,8 @@ asmlinkage long sys_set_robust_list(stru size_t len); asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache); +asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max, + __u64 timeout, struct ukevent __user *buf, unsigned flags); +asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, struct ukevent __user *buf); +asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int start, unsigned int num, __u64 timeout); #endif diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h new file mode 100644 index 0000000..daa8202 --- /dev/null +++ b/include/linux/ukevent.h @@ -0,0 +1,163 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __UKEVENT_H +#define __UKEVENT_H + +/* + * Kevent request flags. + */ + +/* Process this event only once and then dequeue. */ +#define KEVENT_REQ_ONESHOT 0x1 + +/* + * Kevent return flags. + */ +/* Kevent is broken. */ +#define KEVENT_RET_BROKEN 0x1 +/* Kevent processing was finished successfully. */ +#define KEVENT_RET_DONE 0x2 + +/* + * Kevent type set. + */ +#define KEVENT_SOCKET 0 +#define KEVENT_INODE 1 +#define KEVENT_TIMER 2 +#define KEVENT_POLL 3 +#define KEVENT_NAIO 4 +#define KEVENT_AIO 5 +#define KEVENT_MAX 6 + +/* + * Per-type event sets. + * Number of per-event sets should be exactly as number of kevent types. + */ + +/* + * Timer events. + */ +#define KEVENT_TIMER_FIRED 0x1 + +/* + * Socket/network asynchronous IO events. + */ +#define KEVENT_SOCKET_RECV 0x1 +#define KEVENT_SOCKET_ACCEPT 0x2 +#define KEVENT_SOCKET_SEND 0x4 + +/* + * Inode events. + */ +#define KEVENT_INODE_CREATE 0x1 +#define KEVENT_INODE_REMOVE 0x2 + +/* + * Poll events. + */ +#define KEVENT_POLL_POLLIN 0x0001 +#define KEVENT_POLL_POLLPRI 0x0002 +#define KEVENT_POLL_POLLOUT 0x0004 +#define KEVENT_POLL_POLLERR 0x0008 +#define KEVENT_POLL_POLLHUP 0x0010 +#define KEVENT_POLL_POLLNVAL 0x0020 + +#define KEVENT_POLL_POLLRDNORM 0x0040 +#define KEVENT_POLL_POLLRDBAND 0x0080 +#define KEVENT_POLL_POLLWRNORM 0x0100 +#define KEVENT_POLL_POLLWRBAND 0x0200 +#define KEVENT_POLL_POLLMSG 0x0400 +#define KEVENT_POLL_POLLREMOVE 0x1000 + +/* + * Asynchronous IO events. + */ +#define KEVENT_AIO_BIO 0x1 + +#define KEVENT_MASK_ALL 0xffffffff +/* Mask of all possible event values. */ +#define KEVENT_MASK_EMPTY 0x0 +/* Empty mask of ready events. */ + +struct kevent_id +{ + union { + __u32 raw[2]; + __u64 raw_u64 __attribute__((aligned(8))); + }; +}; + +struct ukevent +{ + /* Id of this request, e.g. socket number, file descriptor and so on... */ + struct kevent_id id; + /* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */ + __u32 type; + /* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */ + __u32 event; + /* Per-event request flags */ + __u32 req_flags; + /* Per-event return flags */ + __u32 ret_flags; + /* Event return data. Event originator fills it with anything it likes. */ + __u32 ret_data[2]; + /* User's data. It is not used, just copied to/from user. + * The whole structure is aligned to 8 bytes already, so the last union + * is aligned properly. + */ + union { + __u32 user[2]; + void *ptr; + }; +}; + +struct mukevent +{ + struct kevent_id id; + __u32 ret_flags; +}; + +#define KEVENT_MAX_PAGES 2 + +/* + * Note that kevents does not exactly fill the page (each mukevent is 12 bytes), + * so we reuse 4 bytes at the begining of the page to store index. + * Take that into account if you want to change size of struct mukevent. + */ +#define KEVENTS_ON_PAGE ((PAGE_SIZE-2*sizeof(unsigned int))/sizeof(struct mukevent)) +struct kevent_mring +{ + unsigned int kidx, uidx; + struct mukevent event[KEVENTS_ON_PAGE]; +}; + +/* + * Used only for sanitizing of the kevent_wait() input data - do not + * allow user to specify number of events more than it is possible to place + * into ring buffer. This does not limit number of events which can be + * put into kevent queue (which is unlimited). + */ +#define KEVENT_MAX_EVENTS (KEVENT_MAX_PAGES * KEVENTS_ON_PAGE) + +#define KEVENT_CTL_ADD 0 +#define KEVENT_CTL_REMOVE 1 +#define KEVENT_CTL_MODIFY 2 + +#endif /* __UKEVENT_H */ diff --git a/init/Kconfig b/init/Kconfig index d2eb7a8..c7d8250 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -201,6 +201,8 @@ config AUDITSYSCALL such as SELinux. To use audit's filesystem watch feature, please ensure that INOTIFY is configured. +source "kernel/kevent/Kconfig" + config IKCONFIG bool "Kernel .config support" ---help--- diff --git a/kernel/Makefile b/kernel/Makefile index d62ec66..2d7a6dd 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o +obj-$(CONFIG_KEVENT) += kevent/ obj-$(CONFIG_RELAY) += relay.o obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o obj-$(CONFIG_TASKSTATS) += taskstats.o diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig new file mode 100644 index 0000000..5ba8086 --- /dev/null +++ b/kernel/kevent/Kconfig @@ -0,0 +1,39 @@ +config KEVENT + bool "Kernel event notification mechanism" + help + This option enables event queue mechanism. + It can be used as replacement for poll()/select(), AIO callback + invocations, advanced timer notifications and other kernel + object status changes. + +config KEVENT_USER_STAT + bool "Kevent user statistic" + depends on KEVENT + help + This option will turn kevent_user statistic collection on. + Statistic data includes total number of kevent, number of kevents + which are ready immediately at insertion time and number of kevents + which were removed through readiness completion. + It will be printed each time control kevent descriptor is closed. + +config KEVENT_TIMER + bool "Kernel event notifications for timers" + depends on KEVENT + help + This option allows to use timers through KEVENT subsystem. + +config KEVENT_POLL + bool "Kernel event notifications for poll()/select()" + depends on KEVENT + help + This option allows to use kevent subsystem for poll()/select() + notifications. + +config KEVENT_SOCKET + bool "Kernel event notifications for sockets" + depends on NET && KEVENT + help + This option enables notifications through KEVENT subsystem of + sockets operations, like new packet receiving conditions, + ready for accept conditions and so on. + diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile new file mode 100644 index 0000000..9130cad --- /dev/null +++ b/kernel/kevent/Makefile @@ -0,0 +1,4 @@ +obj-y := kevent.o kevent_user.o +obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o +obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o +obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c new file mode 100644 index 0000000..25404d3 --- /dev/null +++ b/kernel/kevent/kevent.c @@ -0,0 +1,227 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/mempool.h> +#include <linux/sched.h> +#include <linux/wait.h> +#include <linux/kevent.h> + +/* + * Attempts to add an event into appropriate origin's queue. + * Returns positive value if this event is ready immediately, + * negative value in case of error and zero if event has been queued. + * ->enqueue() callback must increase origin's reference counter. + */ +int kevent_enqueue(struct kevent *k) +{ + return k->callbacks.enqueue(k); +} + +/* + * Remove event from the appropriate queue. + * ->dequeue() callback must decrease origin's reference counter. + */ +int kevent_dequeue(struct kevent *k) +{ + return k->callbacks.dequeue(k); +} + +/* + * Mark kevent as broken. + */ +int kevent_break(struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&k->ulock, flags); + k->event.ret_flags |= KEVENT_RET_BROKEN; + spin_unlock_irqrestore(&k->ulock, flags); + return -EINVAL; +} + +static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX] __read_mostly; + +int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos) +{ + struct kevent_callbacks *p; + + if (pos >= KEVENT_MAX) + return -EINVAL; + + p = &kevent_registered_callbacks[pos]; + + p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break; + p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break; + p->callback = (cb->callback) ? cb->callback : kevent_break; + + printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos); + return 0; +} + +/* + * Must be called before event is going to be added into some origin's queue. + * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks. + * If failed, kevent should not be used or kevent_enqueue() will fail to add + * this kevent into origin's queue with setting + * KEVENT_RET_BROKEN flag in kevent->event.ret_flags. + */ +int kevent_init(struct kevent *k) +{ + spin_lock_init(&k->ulock); + k->flags = 0; + + if (unlikely(k->event.type >= KEVENT_MAX || + !kevent_registered_callbacks[k->event.type].callback)) + return kevent_break(k); + + k->callbacks = kevent_registered_callbacks[k->event.type]; + if (unlikely(k->callbacks.callback == kevent_break)) + return kevent_break(k); + + return 0; +} + +/* + * Called from ->enqueue() callback when reference counter for given + * origin (socket, inode...) has been increased. + */ +int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k) +{ + unsigned long flags; + + k->st = st; + spin_lock_irqsave(&st->lock, flags); + list_add_tail_rcu(&k->storage_entry, &st->list); + k->flags |= KEVENT_STORAGE; + spin_unlock_irqrestore(&st->lock, flags); + return 0; +} + +/* + * Dequeue kevent from origin's queue. + * It does not decrease origin's reference counter in any way + * and must be called before it, so storage itself must be valid. + * It is called from ->dequeue() callback. + */ +void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&st->lock, flags); + if (k->flags & KEVENT_STORAGE) { + list_del_rcu(&k->storage_entry); + k->flags &= ~KEVENT_STORAGE; + } + spin_unlock_irqrestore(&st->lock, flags); +} + +/* + * Call kevent ready callback and queue it into ready queue if needed. + * If kevent is marked as one-shot, then remove it from storage queue. + */ +static void __kevent_requeue(struct kevent *k, u32 event) +{ + int ret, rem; + unsigned long flags; + + ret = k->callbacks.callback(k); + + spin_lock_irqsave(&k->ulock, flags); + if (ret > 0) + k->event.ret_flags |= KEVENT_RET_DONE; + else if (ret < 0) + k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE); + else + ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE)); + rem = (k->event.req_flags & KEVENT_REQ_ONESHOT); + spin_unlock_irqrestore(&k->ulock, flags); + + if (ret) { + if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) { + list_del_rcu(&k->storage_entry); + k->flags &= ~KEVENT_STORAGE; + } + + spin_lock_irqsave(&k->user->ready_lock, flags); + if (!(k->flags & KEVENT_READY)) { + kevent_user_ring_add_event(k); + list_add_tail(&k->ready_entry, &k->user->ready_list); + k->flags |= KEVENT_READY; + k->user->ready_num++; + } + spin_unlock_irqrestore(&k->user->ready_lock, flags); + wake_up(&k->user->wait); + } +} + +/* + * Check if kevent is ready (by invoking it's callback) and requeue/remove + * if needed. + */ +void kevent_requeue(struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&k->st->lock, flags); + __kevent_requeue(k, 0); + spin_unlock_irqrestore(&k->st->lock, flags); +} + +/* + * Called each time some activity in origin (socket, inode...) is noticed. + */ +void kevent_storage_ready(struct kevent_storage *st, + kevent_callback_t ready_callback, u32 event) +{ + struct kevent *k; + + rcu_read_lock(); + if (ready_callback) + list_for_each_entry_rcu(k, &st->list, storage_entry) + (*ready_callback)(k); + + list_for_each_entry_rcu(k, &st->list, storage_entry) + if (event & k->event.event) + __kevent_requeue(k, event); + rcu_read_unlock(); +} + +int kevent_storage_init(void *origin, struct kevent_storage *st) +{ + spin_lock_init(&st->lock); + st->origin = origin; + INIT_LIST_HEAD(&st->list); + return 0; +} + +/* + * Mark all events as broken, that will remove them from storage, + * so storage origin (inode, sockt and so on) can be safely removed. + * No new entries are allowed to be added into the storage at this point. + * (Socket is removed from file table at this point for example). + */ +void kevent_storage_fini(struct kevent_storage *st) +{ + kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL); +} diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c new file mode 100644 index 0000000..f3fec9b --- /dev/null +++ b/kernel/kevent/kevent_user.c @@ -0,0 +1,1004 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/mount.h> +#include <linux/device.h> +#include <linux/poll.h> +#include <linux/kevent.h> +#include <linux/miscdevice.h> +#include <asm/io.h> + +static const char kevent_name[] = "kevent"; +static kmem_cache_t *kevent_cache __read_mostly; + +/* + * kevents are pollable, return POLLIN and POLLRDNORM + * when there is at least one ready kevent. + */ +static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait) +{ + struct kevent_user *u = file->private_data; + unsigned int mask; + + poll_wait(file, &u->wait, wait); + mask = 0; + + if (u->ready_num) + mask |= POLLIN | POLLRDNORM; + + return mask; +} + +/* + * Called under kevent_user->ready_lock, so updates are always protected. + */ +int kevent_user_ring_add_event(struct kevent *k) +{ + unsigned int pidx, off; + struct kevent_mring *ring, *copy_ring; + + ring = k->user->pring[0]; + + if ((ring->kidx + 1 == ring->uidx) || + ((ring->kidx + 1 == KEVENT_MAX_EVENTS) && ring->uidx == 0)) { + if (k->user->overflow_kevent == NULL) + k->user->overflow_kevent = k; + return -EAGAIN; + } + + pidx = ring->kidx/KEVENTS_ON_PAGE; + off = ring->kidx%KEVENTS_ON_PAGE; + + if (unlikely(pidx >= KEVENT_MAX_PAGES)) { + printk(KERN_ERR "%s: kidx: %u, pidx: %u, on_page: %lu, pidx: %u.\n", + __func__, ring->kidx, ring->uidx, KEVENTS_ON_PAGE, pidx); + return -EINVAL; + } + + copy_ring = k->user->pring[pidx]; + + copy_ring->event[off].id.raw[0] = k->event.id.raw[0]; + copy_ring->event[off].id.raw[1] = k->event.id.raw[1]; + copy_ring->event[off].ret_flags = k->event.ret_flags; + + if (++ring->kidx >= KEVENT_MAX_EVENTS) + ring->kidx = 0; + + return 0; +} + +/* + * Initialize mmap ring buffer. + * It will store ready kevents, so userspace could get them directly instead + * of using syscall. Esentially syscall becomes just a waiting point. + * @KEVENT_MAX_PAGES is an arbitrary number of pages to store ready events. + */ +static int kevent_user_ring_init(struct kevent_user *u) +{ + int i; + + u->pring = kzalloc(KEVENT_MAX_PAGES * sizeof(struct kevent_mring *), GFP_KERNEL); + if (!u->pring) + return -ENOMEM; + + for (i=0; i<KEVENT_MAX_PAGES; ++i) { + u->pring[i] = (struct kevent_mring *)__get_free_page(GFP_KERNEL); + if (!u->pring[i]) + goto err_out_free; + } + + u->pring[0]->uidx = u->pring[0]->kidx = 0; + + return 0; + +err_out_free: + for (i=0; i<KEVENT_MAX_PAGES; ++i) { + if (!u->pring[i]) + break; + + free_page((unsigned long)u->pring[i]); + } + + kfree(u->pring); + + return -ENOMEM; +} + +static void kevent_user_ring_fini(struct kevent_user *u) +{ + int i; + + for (i=0; i<KEVENT_MAX_PAGES; ++i) + free_page((unsigned long)u->pring[i]); + + kfree(u->pring); +} + +static int kevent_user_open(struct inode *inode, struct file *file) +{ + struct kevent_user *u; + + u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL); + if (!u) + return -ENOMEM; + + INIT_LIST_HEAD(&u->ready_list); + spin_lock_init(&u->ready_lock); + kevent_stat_init(u); + spin_lock_init(&u->kevent_lock); + u->kevent_root = RB_ROOT; + + mutex_init(&u->ctl_mutex); + init_waitqueue_head(&u->wait); + + atomic_set(&u->refcnt, 1); + + if (unlikely(kevent_user_ring_init(u))) { + kfree(u); + return -ENOMEM; + } + + file->private_data = u; + return 0; +} + +/* + * Kevent userspace control block reference counting. + * Set to 1 at creation time, when appropriate kevent file descriptor + * is closed, that reference counter is decreased. + * When counter hits zero block is freed. + */ +static inline void kevent_user_get(struct kevent_user *u) +{ + atomic_inc(&u->refcnt); +} + +static inline void kevent_user_put(struct kevent_user *u) +{ + if (atomic_dec_and_test(&u->refcnt)) { + kevent_stat_print(u); + kevent_user_ring_fini(u); + kfree(u); + } +} + +/* + * Mmap implementation for ring buffer, which is created as array + * of pages, so vm_pgoff is an offset (in pages, not in bytes) of + * the first page to be mapped. + */ +static int kevent_user_mmap(struct file *file, struct vm_area_struct *vma) +{ + unsigned long start = vma->vm_start, off = vma->vm_pgoff / PAGE_SIZE; + struct kevent_user *u = file->private_data; + + if (off >= KEVENT_MAX_PAGES) + return -EINVAL; + + if (vma->vm_flags & VM_WRITE) + return -EPERM; + + vma->vm_flags |= VM_RESERVED; + vma->vm_file = file; + + if (vm_insert_page(vma, start, virt_to_page(u->pring[off]))) + return -EFAULT; + + return 0; +} + +static inline int kevent_compare_id(struct kevent_id *left, struct kevent_id *right) +{ + if (left->raw_u64 > right->raw_u64) + return -1; + + if (right->raw_u64 > left->raw_u64) + return 1; + + return 0; +} + +/* + * RCU protects storage list (kevent->storage_entry). + * Free entry in RCU callback, it is dequeued from all lists at + * this point. + */ + +static void kevent_free_rcu(struct rcu_head *rcu) +{ + struct kevent *kevent = container_of(rcu, struct kevent, rcu_head); + kmem_cache_free(kevent_cache, kevent); +} + +/* + * Must be called under u->ready_lock. + * This function removes kevent from ready queue and + * tries to add new kevent into ring buffer. + */ +static void kevent_remove_ready(struct kevent *k) +{ + struct kevent_user *u = k->user; + + if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS) + u->pring[0]->uidx = 0; + + if (u->overflow_kevent) { + int err; + + err = kevent_user_ring_add_event(u->overflow_kevent); + if (!err || u->overflow_kevent == k) { + if (u->overflow_kevent->ready_entry.next == &u->ready_list) + u->overflow_kevent = NULL; + else + u->overflow_kevent = + list_entry(u->overflow_kevent->ready_entry.next, + struct kevent, ready_entry); + } + } + list_del(&k->ready_entry); + k->flags &= ~KEVENT_READY; + u->ready_num--; +} + +/* + * Complete kevent removing - it dequeues kevent from storage list + * if it is requested, removes kevent from ready list, drops userspace + * control block reference counter and schedules kevent freeing through RCU. + */ +static void kevent_finish_user_complete(struct kevent *k, int deq) +{ + struct kevent_user *u = k->user; + unsigned long flags; + + if (deq) + kevent_dequeue(k); + + spin_lock_irqsave(&u->ready_lock, flags); + if (k->flags & KEVENT_READY) + kevent_remove_ready(k); + spin_unlock_irqrestore(&u->ready_lock, flags); + + kevent_user_put(u); + call_rcu(&k->rcu_head, kevent_free_rcu); +} + +/* + * Remove from all lists and free kevent. + * Must be called under kevent_user->kevent_lock to protect + * kevent->kevent_entry removing. + */ +static void __kevent_finish_user(struct kevent *k, int deq) +{ + struct kevent_user *u = k->user; + + rb_erase(&k->kevent_node, &u->kevent_root); + k->flags &= ~KEVENT_USER; + u->kevent_num--; + kevent_finish_user_complete(k, deq); +} + +/* + * Remove kevent from user's list of all events, + * dequeue it from storage and decrease user's reference counter, + * since this kevent does not exist anymore. That is why it is freed here. + */ +static void kevent_finish_user(struct kevent *k, int deq) +{ + struct kevent_user *u = k->user; + unsigned long flags; + + spin_lock_irqsave(&u->kevent_lock, flags); + rb_erase(&k->kevent_node, &u->kevent_root); + k->flags &= ~KEVENT_USER; + u->kevent_num--; + spin_unlock_irqrestore(&u->kevent_lock, flags); + kevent_finish_user_complete(k, deq); +} + +/* + * Dequeue one entry from user's ready queue. + */ +static struct kevent *kqueue_dequeue_ready(struct kevent_user *u) +{ + unsigned long flags; + struct kevent *k = NULL; + + spin_lock_irqsave(&u->ready_lock, flags); + if (u->ready_num && !list_empty(&u->ready_list)) { + k = list_entry(u->ready_list.next, struct kevent, ready_entry); + kevent_remove_ready(k); + } + spin_unlock_irqrestore(&u->ready_lock, flags); + + return k; +} + +/* + * Search a kevent inside kevent tree for given ukevent. + */ +static struct kevent *__kevent_search(struct kevent_id *id, struct kevent_user *u) +{ + struct kevent *k, *ret = NULL; + struct rb_node *n = u->kevent_root.rb_node; + int cmp; + + while (n) { + k = rb_entry(n, struct kevent, kevent_node); + cmp = kevent_compare_id(&k->event.id, id); + + if (cmp > 0) + n = n->rb_right; + else if (cmp < 0) + n = n->rb_left; + else { + ret = k; + break; + } + } + + return ret; +} + +/* + * Search and modify kevent according to provided ukevent. + */ +static int kevent_modify(struct ukevent *uk, struct kevent_user *u) +{ + struct kevent *k; + int err = -ENODEV; + unsigned long flags; + + spin_lock_irqsave(&u->kevent_lock, flags); + k = __kevent_search(&uk->id, u); + if (k) { + spin_lock(&k->ulock); + k->event.event = uk->event; + k->event.req_flags = uk->req_flags; + k->event.ret_flags = 0; + spin_unlock(&k->ulock); + kevent_requeue(k); + err = 0; + } + spin_unlock_irqrestore(&u->kevent_lock, flags); + + return err; +} + +/* + * Remove kevent which matches provided ukevent. + */ +static int kevent_remove(struct ukevent *uk, struct kevent_user *u) +{ + int err = -ENODEV; + struct kevent *k; + unsigned long flags; + + spin_lock_irqsave(&u->kevent_lock, flags); + k = __kevent_search(&uk->id, u); + if (k) { + __kevent_finish_user(k, 1); + err = 0; + } + spin_unlock_irqrestore(&u->kevent_lock, flags); + + return err; +} + +/* + * Detaches userspace control block from file descriptor + * and decrease it's reference counter. + * No new kevents can be added or removed from any list at this point. + */ +static int kevent_user_release(struct inode *inode, struct file *file) +{ + struct kevent_user *u = file->private_data; + struct kevent *k; + struct rb_node *n; + + for (n = rb_first(&u->kevent_root); n; n = rb_next(n)) { + k = rb_entry(n, struct kevent, kevent_node); + kevent_finish_user(k, 1); + } + + kevent_user_put(u); + file->private_data = NULL; + + return 0; +} + +/* + * Read requested number of ukevents in one shot. + */ +static struct ukevent *kevent_get_user(unsigned int num, void __user *arg) +{ + struct ukevent *ukev; + + ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL); + if (!ukev) + return NULL; + + if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) { + kfree(ukev); + return NULL; + } + + return ukev; +} + +/* + * Read from userspace all ukevents and modify appropriate kevents. + * If provided number of ukevents is more that threshold, it is faster + * to allocate a room for them and copy in one shot instead of copy + * one-by-one and then process them. + */ +static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg) +{ + int err = 0, i; + struct ukevent uk; + + mutex_lock(&u->ctl_mutex); + + if (num > u->kevent_num) { + err = -EINVAL; + goto out; + } + + if (num > KEVENT_MIN_BUFFS_ALLOC) { + struct ukevent *ukev; + + ukev = kevent_get_user(num, arg); + if (ukev) { + for (i = 0; i < num; ++i) { + if (kevent_modify(&ukev[i], u)) + ukev[i].ret_flags |= KEVENT_RET_BROKEN; + ukev[i].ret_flags |= KEVENT_RET_DONE; + } + if (copy_to_user(arg, ukev, num*sizeof(struct ukevent))) + err = -EFAULT; + kfree(ukev); + goto out; + } + } + + for (i = 0; i < num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + if (kevent_modify(&uk, u)) + uk.ret_flags |= KEVENT_RET_BROKEN; + uk.ret_flags |= KEVENT_RET_DONE; + + if (copy_to_user(arg, &uk, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + arg += sizeof(struct ukevent); + } +out: + mutex_unlock(&u->ctl_mutex); + + return err; +} + +/* + * Read from userspace all ukevents and remove appropriate kevents. + * If provided number of ukevents is more that threshold, it is faster + * to allocate a room for them and copy in one shot instead of copy + * one-by-one and then process them. + */ +static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg) +{ + int err = 0, i; + struct ukevent uk; + + mutex_lock(&u->ctl_mutex); + + if (num > u->kevent_num) { + err = -EINVAL; + goto out; + } + + if (num > KEVENT_MIN_BUFFS_ALLOC) { + struct ukevent *ukev; + + ukev = kevent_get_user(num, arg); + if (ukev) { + for (i = 0; i < num; ++i) { + if (kevent_remove(&ukev[i], u)) + ukev[i].ret_flags |= KEVENT_RET_BROKEN; + ukev[i].ret_flags |= KEVENT_RET_DONE; + } + if (copy_to_user(arg, ukev, num*sizeof(struct ukevent))) + err = -EFAULT; + kfree(ukev); + goto out; + } + } + + for (i = 0; i < num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + if (kevent_remove(&uk, u)) + uk.ret_flags |= KEVENT_RET_BROKEN; + + uk.ret_flags |= KEVENT_RET_DONE; + + if (copy_to_user(arg, &uk, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + arg += sizeof(struct ukevent); + } +out: + mutex_unlock(&u->ctl_mutex); + + return err; +} + +/* + * Queue kevent into userspace control block and increase + * it's reference counter. + */ +static int kevent_user_enqueue(struct kevent_user *u, struct kevent *new) +{ + unsigned long flags; + struct rb_node **p = &u->kevent_root.rb_node, *parent = NULL; + struct kevent *k; + int err = 0, cmp; + + spin_lock_irqsave(&u->kevent_lock, flags); + while (*p) { + parent = *p; + k = rb_entry(parent, struct kevent, kevent_node); + + cmp = kevent_compare_id(&k->event.id, &new->event.id); + if (cmp > 0) + p = &parent->rb_right; + else if (cmp < 0) + p = &parent->rb_left; + else { + err = -EEXIST; + break; + } + } + if (likely(!err)) { + rb_link_node(&new->kevent_node, parent, p); + rb_insert_color(&new->kevent_node, &u->kevent_root); + new->flags |= KEVENT_USER; + u->kevent_num++; + kevent_user_get(u); + } + spin_unlock_irqrestore(&u->kevent_lock, flags); + + return err; +} + +/* + * Add kevent from both kernel and userspace users. + * This function allocates and queues kevent, returns negative value + * on error, positive if kevent is ready immediately and zero + * if kevent has been queued. + */ +int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u) +{ + struct kevent *k; + int err; + + k = kmem_cache_alloc(kevent_cache, GFP_KERNEL); + if (!k) { + err = -ENOMEM; + goto err_out_exit; + } + + memcpy(&k->event, uk, sizeof(struct ukevent)); + INIT_RCU_HEAD(&k->rcu_head); + + k->event.ret_flags = 0; + + err = kevent_init(k); + if (err) { + kmem_cache_free(kevent_cache, k); + goto err_out_exit; + } + k->user = u; + kevent_stat_total(u); + err = kevent_user_enqueue(u, k); + if (err) { + kmem_cache_free(kevent_cache, k); + goto err_out_exit; + } + + err = kevent_enqueue(k); + if (err) { + memcpy(uk, &k->event, sizeof(struct ukevent)); + kevent_finish_user(k, 0); + goto err_out_exit; + } + + return 0; + +err_out_exit: + if (err < 0) { + uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE; + uk->ret_data[1] = err; + } else if (err > 0) + uk->ret_flags |= KEVENT_RET_DONE; + return err; +} + +/* + * Copy all ukevents from userspace, allocate kevent for each one + * and add them into appropriate kevent_storages, + * e.g. sockets, inodes and so on... + * Ready events will replace ones provided by used and number + * of ready events is returned. + * User must check ret_flags field of each ukevent structure + * to determine if it is fired or failed event. + */ +static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg) +{ + int err, cerr = 0, rnum = 0, i; + void __user *orig = arg; + struct ukevent uk; + + mutex_lock(&u->ctl_mutex); + + err = -EINVAL; + if (num > KEVENT_MIN_BUFFS_ALLOC) { + struct ukevent *ukev; + + ukev = kevent_get_user(num, arg); + if (ukev) { + for (i = 0; i < num; ++i) { + err = kevent_user_add_ukevent(&ukev[i], u); + if (err) { + kevent_stat_im(u); + if (i != rnum) + memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent)); + rnum++; + } + } + if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent))) + cerr = -EFAULT; + kfree(ukev); + goto out_setup; + } + } + + for (i = 0; i < num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + cerr = -EFAULT; + break; + } + arg += sizeof(struct ukevent); + + err = kevent_user_add_ukevent(&uk, u); + if (err) { + kevent_stat_im(u); + if (copy_to_user(orig, &uk, sizeof(struct ukevent))) { + cerr = -EFAULT; + break; + } + orig += sizeof(struct ukevent); + rnum++; + } + } + +out_setup: + if (cerr < 0) { + err = cerr; + goto out_remove; + } + + err = rnum; +out_remove: + mutex_unlock(&u->ctl_mutex); + + return err; +} + +/* + * In nonblocking mode it returns as many events as possible, but not more than @max_nr. + * In blocking mode it waits until timeout or if at least @min_nr events are ready. + */ +static int kevent_user_wait(struct file *file, struct kevent_user *u, + unsigned int min_nr, unsigned int max_nr, __u64 timeout, + void __user *buf) +{ + struct kevent *k; + int num = 0; + + if (!(file->f_flags & O_NONBLOCK)) { + wait_event_interruptible_timeout(u->wait, + u->ready_num >= min_nr, + clock_t_to_jiffies(nsec_to_clock_t(timeout))); + } + + while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) { + if (copy_to_user(buf + num*sizeof(struct ukevent), + &k->event, sizeof(struct ukevent))) + break; + + /* + * If it is one-shot kevent, it has been removed already from + * origin's queue, so we can easily free it here. + */ + if (k->event.req_flags & KEVENT_REQ_ONESHOT) + kevent_finish_user(k, 1); + ++num; + kevent_stat_wait(u); + } + + return num; +} + +static struct file_operations kevent_user_fops = { + .mmap = kevent_user_mmap, + .open = kevent_user_open, + .release = kevent_user_release, + .poll = kevent_user_poll, + .owner = THIS_MODULE, +}; + +static struct miscdevice kevent_miscdev = { + .minor = MISC_DYNAMIC_MINOR, + .name = kevent_name, + .fops = &kevent_user_fops, +}; + +static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg) +{ + int err; + struct kevent_user *u = file->private_data; + + switch (cmd) { + case KEVENT_CTL_ADD: + err = kevent_user_ctl_add(u, num, arg); + break; + case KEVENT_CTL_REMOVE: + err = kevent_user_ctl_remove(u, num, arg); + break; + case KEVENT_CTL_MODIFY: + err = kevent_user_ctl_modify(u, num, arg); + break; + default: + err = -EINVAL; + break; + } + + return err; +} + +/* + * Used to get ready kevents from queue. + * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT). + * @min_nr - minimum number of ready kevents. + * @max_nr - maximum number of ready kevents. + * @timeout - timeout in nanoseconds to wait until some events are ready. + * @buf - buffer to place ready events. + * @flags - ununsed for now (will be used for mmap implementation). + */ +asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, + __u64 timeout, struct ukevent __user *buf, unsigned flags) +{ + int err = -EINVAL; + struct file *file; + struct kevent_user *u; + + file = fget(ctl_fd); + if (!file) + return -EBADF; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + u = file->private_data; + + err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf); +out_fput: + fput(file); + return err; +} + +/* + * This syscall is used to perform waiting until there is free space in kevent queue + * and removes/requeues requested number of events (commits them). Function returns + * number of actually committed events. + * + * @ctl_fd - kevent file descriptor. + * @start - number of first ready event. + * @num - number of processed kevents. + * @timeout - this timeout specifies number of nanoseconds to wait until there is + * free space in kevent queue. + * + * Ring buffer is designed in a way that first ready kevent will be at @ring->uidx + * position, and all other ready events will be in FIFO order after it. + * So when we need to commit @num events, it means we should just remove first @num + * kevents from ready queue and commit them. We do not use any special locking to + * protect this function against simultaneous running - kevent dequeueing is atomic, + * and we do not care about order in which events were committed. + * An example: thread 1 and thread 2 simultaneously call kevent_wait() to + * commit 2 and 3 events. It is possible that first thread will commit + * events 0 and 2 while second thread will commit events 1, 3 and 4. + * If there were only 3 ready events, then one of the calls will return lesser number + * of committed events than it was requested. + * ring->uidx update is atomic, since it is protected by u->ready_lock, + * which removes race with kevent_user_ring_add_event(). + * + * If user asks to commit events which have beed removed by kevent_get_events() recently + * (for example when one thread looked into ring indexes and started to commit evets, + * which were simultaneously committed by other thread through kevent_get_events(), + * kevent_wait() will not commit unprocessed events, but will return number of actually + * committed events instead. + * + * It is forbidden to try to commit events not from the start of the buffer, but from + * some 'futher' event. + * + * An example: if ready events use positions 2-5, + * it is permitted to start to commit 3 events from position 0, + * in this case 0 and 1 positions will be ommited and only event in position 2 will + * be committed and kevent_wait() will return 1, since only one event was actually committed. + * It is forbidden to try to commit from position 4, 0 will be returned. + * This means that if some events were committed using kevent_get_events(), + * they will not be counted, instead userspace should check ring index and try to commit again. + */ +asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int start, unsigned int num, __u64 timeout) +{ + int err = -EINVAL, committed = 0; + struct file *file; + struct kevent_user *u; + struct kevent *k; + struct kevent_mring *ring; + unsigned int i, actual; + unsigned long flags; + + if (num >= KEVENT_MAX_EVENTS) + return -EINVAL; + + file = fget(ctl_fd); + if (!file) + return -EBADF; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + u = file->private_data; + + ring = u->pring[0]; + + spin_lock_irqsave(&u->ready_lock, flags); + actual = (ring->kidx > ring->uidx)? + (ring->kidx - ring->uidx): + (KEVENT_MAX_EVENTS - (ring->uidx - ring->kidx)); + + if (actual < num) + num = actual; + + if (start < ring->uidx) { + /* + * Some events have been committed through kevent_get_events(). + * ready events + * |==========|RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR|==========| + * ring->uidx ring->kidx + * | | + * start start+num + * + */ + unsigned int diff = ring->uidx - start; + + if (num < diff) + num = 0; + else + num -= diff; + } else if (start > ring->uidx) + num = 0; + + spin_unlock_irqrestore(&u->ready_lock, flags); + + for (i=0; i<num; ++i) { + k = kqueue_dequeue_ready(u); + if (!k) + break; + + if (k->event.req_flags & KEVENT_REQ_ONESHOT) + kevent_finish_user(k, 1); + kevent_stat_mmap(u); + committed++; + } + + if (!(file->f_flags & O_NONBLOCK)) { + wait_event_interruptible_timeout(u->wait, + u->ready_num >= 1, + clock_t_to_jiffies(nsec_to_clock_t(timeout))); + } + + fput(file); + + return committed; +out_fput: + fput(file); + return err; +} + +/* + * This syscall is used to perform various control operations + * on given kevent queue, which is obtained through kevent file descriptor @fd. + * @cmd - type of operation. + * @num - number of kevents to be processed. + * @arg - pointer to array of struct ukevent. + */ +asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent __user *arg) +{ + int err = -EINVAL; + struct file *file; + + file = fget(fd); + if (!file) + return -EBADF; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + + err = kevent_ctl_process(file, cmd, num, arg); + +out_fput: + fput(file); + return err; +} + +/* + * Kevent subsystem initialization - create kevent cache and register + * filesystem to get control file descriptors from. + */ +static int __init kevent_user_init(void) +{ + int err = 0; + + kevent_cache = kmem_cache_create("kevent_cache", + sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL); + + err = misc_register(&kevent_miscdev); + if (err) { + printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err); + goto err_out_exit; + } + + printk(KERN_INFO "KEVENT subsystem has been successfully registered.\n"); + + return 0; + +err_out_exit: + kmem_cache_destroy(kevent_cache); + return err; +} + +module_init(kevent_user_init); diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 7a3b2e7..bc0582b 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -122,6 +122,10 @@ cond_syscall(ppc_rtas); cond_syscall(sys_spu_run); cond_syscall(sys_spu_create); +cond_syscall(sys_kevent_get_events); +cond_syscall(sys_kevent_wait); +cond_syscall(sys_kevent_ctl); + /* mmu depending weak syscall entries */ cond_syscall(sys_mprotect); cond_syscall(sys_msync); ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take22 2/4] kevent: poll/select() notifications. 2006-11-01 11:36 ` [take22 1/4] kevent: Core files Evgeniy Polyakov @ 2006-11-01 11:36 ` Evgeniy Polyakov 2006-11-01 11:36 ` [take22 3/4] kevent: Socket notifications Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-01 11:36 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel poll/select() notifications. This patch includes generic poll/select notifications. kevent_poll works simialr to epoll and has the same issues (callback is invoked not from internal state machine of the caller, but through process awake, a lot of allocations and so on). Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru> diff --git a/include/linux/fs.h b/include/linux/fs.h index 5baf3a1..f81299f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -276,6 +276,7 @@ #include <linux/prio_tree.h> #include <linux/init.h> #include <linux/sched.h> #include <linux/mutex.h> +#include <linux/kevent.h> #include <asm/atomic.h> #include <asm/semaphore.h> @@ -586,6 +587,10 @@ #ifdef CONFIG_INOTIFY struct mutex inotify_mutex; /* protects the watches list */ #endif +#ifdef CONFIG_KEVENT_SOCKET + struct kevent_storage st; +#endif + unsigned long i_state; unsigned long dirtied_when; /* jiffies of first dirtying */ @@ -739,6 +744,9 @@ #ifdef CONFIG_EPOLL struct list_head f_ep_links; spinlock_t f_ep_lock; #endif /* #ifdef CONFIG_EPOLL */ +#ifdef CONFIG_KEVENT_POLL + struct kevent_storage st; +#endif struct address_space *f_mapping; }; extern spinlock_t files_lock; diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c new file mode 100644 index 0000000..94facbb --- /dev/null +++ b/kernel/kevent/kevent_poll.c @@ -0,0 +1,222 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/timer.h> +#include <linux/file.h> +#include <linux/kevent.h> +#include <linux/poll.h> +#include <linux/fs.h> + +static kmem_cache_t *kevent_poll_container_cache; +static kmem_cache_t *kevent_poll_priv_cache; + +struct kevent_poll_ctl +{ + struct poll_table_struct pt; + struct kevent *k; +}; + +struct kevent_poll_wait_container +{ + struct list_head container_entry; + wait_queue_head_t *whead; + wait_queue_t wait; + struct kevent *k; +}; + +struct kevent_poll_private +{ + struct list_head container_list; + spinlock_t container_lock; +}; + +static int kevent_poll_enqueue(struct kevent *k); +static int kevent_poll_dequeue(struct kevent *k); +static int kevent_poll_callback(struct kevent *k); + +static int kevent_poll_wait_callback(wait_queue_t *wait, + unsigned mode, int sync, void *key) +{ + struct kevent_poll_wait_container *cont = + container_of(wait, struct kevent_poll_wait_container, wait); + struct kevent *k = cont->k; + struct file *file = k->st->origin; + u32 revents; + + revents = file->f_op->poll(file, NULL); + + kevent_storage_ready(k->st, NULL, revents); + + return 0; +} + +static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead, + struct poll_table_struct *poll_table) +{ + struct kevent *k = + container_of(poll_table, struct kevent_poll_ctl, pt)->k; + struct kevent_poll_private *priv = k->priv; + struct kevent_poll_wait_container *cont; + unsigned long flags; + + cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL); + if (!cont) { + kevent_break(k); + return; + } + + cont->k = k; + init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback); + cont->whead = whead; + + spin_lock_irqsave(&priv->container_lock, flags); + list_add_tail(&cont->container_entry, &priv->container_list); + spin_unlock_irqrestore(&priv->container_lock, flags); + + add_wait_queue(whead, &cont->wait); +} + +static int kevent_poll_enqueue(struct kevent *k) +{ + struct file *file; + int err, ready = 0; + unsigned int revents; + struct kevent_poll_ctl ctl; + struct kevent_poll_private *priv; + + file = fget(k->event.id.raw[0]); + if (!file) + return -EBADF; + + err = -EINVAL; + if (!file->f_op || !file->f_op->poll) + goto err_out_fput; + + err = -ENOMEM; + priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL); + if (!priv) + goto err_out_fput; + + spin_lock_init(&priv->container_lock); + INIT_LIST_HEAD(&priv->container_list); + + k->priv = priv; + + ctl.k = k; + init_poll_funcptr(&ctl.pt, &kevent_poll_qproc); + + err = kevent_storage_enqueue(&file->st, k); + if (err) + goto err_out_free; + + revents = file->f_op->poll(file, &ctl.pt); + if (revents & k->event.event) { + ready = 1; + kevent_poll_dequeue(k); + } + + return ready; + +err_out_free: + kmem_cache_free(kevent_poll_priv_cache, priv); +err_out_fput: + fput(file); + return err; +} + +static int kevent_poll_dequeue(struct kevent *k) +{ + struct file *file = k->st->origin; + struct kevent_poll_private *priv = k->priv; + struct kevent_poll_wait_container *w, *n; + unsigned long flags; + + kevent_storage_dequeue(k->st, k); + + spin_lock_irqsave(&priv->container_lock, flags); + list_for_each_entry_safe(w, n, &priv->container_list, container_entry) { + list_del(&w->container_entry); + remove_wait_queue(w->whead, &w->wait); + kmem_cache_free(kevent_poll_container_cache, w); + } + spin_unlock_irqrestore(&priv->container_lock, flags); + + kmem_cache_free(kevent_poll_priv_cache, priv); + k->priv = NULL; + + fput(file); + + return 0; +} + +static int kevent_poll_callback(struct kevent *k) +{ + struct file *file = k->st->origin; + unsigned int revents = file->f_op->poll(file, NULL); + + k->event.ret_data[0] = revents & k->event.event; + + return (revents & k->event.event); +} + +static int __init kevent_poll_sys_init(void) +{ + struct kevent_callbacks pc = { + .callback = &kevent_poll_callback, + .enqueue = &kevent_poll_enqueue, + .dequeue = &kevent_poll_dequeue}; + + kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache", + sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL); + if (!kevent_poll_container_cache) { + printk(KERN_ERR "Failed to create kevent poll container cache.\n"); + return -ENOMEM; + } + + kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache", + sizeof(struct kevent_poll_private), 0, 0, NULL, NULL); + if (!kevent_poll_priv_cache) { + printk(KERN_ERR "Failed to create kevent poll private data cache.\n"); + kmem_cache_destroy(kevent_poll_container_cache); + kevent_poll_container_cache = NULL; + return -ENOMEM; + } + + kevent_add_callbacks(&pc, KEVENT_POLL); + + printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n"); + return 0; +} + +static struct lock_class_key kevent_poll_key; + +void kevent_poll_reinit(struct file *file) +{ + lockdep_set_class(&file->st.lock, &kevent_poll_key); +} + +static void __exit kevent_poll_sys_fini(void) +{ + kmem_cache_destroy(kevent_poll_priv_cache); + kmem_cache_destroy(kevent_poll_container_cache); +} + +module_init(kevent_poll_sys_init); +module_exit(kevent_poll_sys_fini); ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take22 3/4] kevent: Socket notifications. 2006-11-01 11:36 ` [take22 2/4] kevent: poll/select() notifications Evgeniy Polyakov @ 2006-11-01 11:36 ` Evgeniy Polyakov 2006-11-01 11:36 ` [take22 4/4] kevent: Timer notifications Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-01 11:36 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel Socket notifications. This patch includes socket send/recv/accept notifications. Using trivial web server based on kevent and this features instead of epoll it's performance increased more than noticebly. More details about various benchmarks and server itself (evserver_kevent.c) can be found on project's homepage. Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru> diff --git a/fs/inode.c b/fs/inode.c index ada7643..ff1b129 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -21,6 +21,7 @@ #include <linux/pagemap.h> #include <linux/cdev.h> #include <linux/bootmem.h> #include <linux/inotify.h> +#include <linux/kevent.h> #include <linux/mount.h> /* @@ -164,12 +165,18 @@ #endif } inode->i_private = 0; inode->i_mapping = mapping; +#if defined CONFIG_KEVENT_SOCKET + kevent_storage_init(inode, &inode->st); +#endif } return inode; } void destroy_inode(struct inode *inode) { +#if defined CONFIG_KEVENT_SOCKET + kevent_storage_fini(&inode->st); +#endif BUG_ON(inode_has_buffers(inode)); security_inode_free(inode); if (inode->i_sb->s_op->destroy_inode) diff --git a/include/net/sock.h b/include/net/sock.h index edd4d73..d48ded8 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -48,6 +48,7 @@ #include <linux/lockdep.h> #include <linux/netdevice.h> #include <linux/skbuff.h> /* struct sk_buff */ #include <linux/security.h> +#include <linux/kevent.h> #include <linux/filter.h> @@ -450,6 +451,21 @@ static inline int sk_stream_memory_free( extern void sk_stream_rfree(struct sk_buff *skb); +struct socket_alloc { + struct socket socket; + struct inode vfs_inode; +}; + +static inline struct socket *SOCKET_I(struct inode *inode) +{ + return &container_of(inode, struct socket_alloc, vfs_inode)->socket; +} + +static inline struct inode *SOCK_INODE(struct socket *socket) +{ + return &container_of(socket, struct socket_alloc, socket)->vfs_inode; +} + static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk) { skb->sk = sk; @@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct sk->sk_backlog.tail = skb; } skb->next = NULL; + kevent_socket_notify(sk, KEVENT_SOCKET_RECV); } #define sk_wait_event(__sk, __timeo, __condition) \ @@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio return si->kiocb; } -struct socket_alloc { - struct socket socket; - struct inode vfs_inode; -}; - -static inline struct socket *SOCKET_I(struct inode *inode) -{ - return &container_of(inode, struct socket_alloc, vfs_inode)->socket; -} - -static inline struct inode *SOCK_INODE(struct socket *socket) -{ - return &container_of(socket, struct socket_alloc, socket)->vfs_inode; -} - extern void __sk_stream_mem_reclaim(struct sock *sk); extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind); diff --git a/include/net/tcp.h b/include/net/tcp.h index 7a093d0..69f4ad2 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so tp->ucopy.memory = 0; } else if (skb_queue_len(&tp->ucopy.prequeue) == 1) { wake_up_interruptible(sk->sk_sleep); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); if (!inet_csk_ack_scheduled(sk)) inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK, (3 * TCP_RTO_MIN) / 4, diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c new file mode 100644 index 0000000..5040b4c --- /dev/null +++ b/kernel/kevent/kevent_socket.c @@ -0,0 +1,129 @@ +/* + * kevent_socket.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/timer.h> +#include <linux/file.h> +#include <linux/tcp.h> +#include <linux/kevent.h> + +#include <net/sock.h> +#include <net/request_sock.h> +#include <net/inet_connection_sock.h> + +static int kevent_socket_callback(struct kevent *k) +{ + struct inode *inode = k->st->origin; + return SOCKET_I(inode)->ops->poll(SOCKET_I(inode)->file, SOCKET_I(inode), NULL); +} + +int kevent_socket_enqueue(struct kevent *k) +{ + struct inode *inode; + struct socket *sock; + int err = -EBADF; + + sock = sockfd_lookup(k->event.id.raw[0], &err); + if (!sock) + goto err_out_exit; + + inode = igrab(SOCK_INODE(sock)); + if (!inode) + goto err_out_fput; + + err = kevent_storage_enqueue(&inode->st, k); + if (err) + goto err_out_iput; + + err = k->callbacks.callback(k); + if (err) + goto err_out_dequeue; + + return err; + +err_out_dequeue: + kevent_storage_dequeue(k->st, k); +err_out_iput: + iput(inode); +err_out_fput: + sockfd_put(sock); +err_out_exit: + return err; +} + +int kevent_socket_dequeue(struct kevent *k) +{ + struct inode *inode = k->st->origin; + struct socket *sock; + + kevent_storage_dequeue(k->st, k); + + sock = SOCKET_I(inode); + iput(inode); + sockfd_put(sock); + + return 0; +} + +void kevent_socket_notify(struct sock *sk, u32 event) +{ + if (sk->sk_socket) + kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event); +} + +/* + * It is required for network protocols compiled as modules, like IPv6. + */ +EXPORT_SYMBOL_GPL(kevent_socket_notify); + +#ifdef CONFIG_LOCKDEP +static struct lock_class_key kevent_sock_key; + +void kevent_socket_reinit(struct socket *sock) +{ + struct inode *inode = SOCK_INODE(sock); + + lockdep_set_class(&inode->st.lock, &kevent_sock_key); +} + +void kevent_sk_reinit(struct sock *sk) +{ + if (sk->sk_socket) { + struct inode *inode = SOCK_INODE(sk->sk_socket); + + lockdep_set_class(&inode->st.lock, &kevent_sock_key); + } +} +#endif +static int __init kevent_init_socket(void) +{ + struct kevent_callbacks sc = { + .callback = &kevent_socket_callback, + .enqueue = &kevent_socket_enqueue, + .dequeue = &kevent_socket_dequeue}; + + return kevent_add_callbacks(&sc, KEVENT_SOCKET); +} +module_init(kevent_init_socket); diff --git a/net/core/sock.c b/net/core/sock.c index b77e155..7d5fa3e 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1402,6 +1402,7 @@ static void sock_def_wakeup(struct sock if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) wake_up_interruptible_all(sk->sk_sleep); read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); } static void sock_def_error_report(struct sock *sk) @@ -1411,6 +1412,7 @@ static void sock_def_error_report(struct wake_up_interruptible(sk->sk_sleep); sk_wake_async(sk,0,POLL_ERR); read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); } static void sock_def_readable(struct sock *sk, int len) @@ -1420,6 +1422,7 @@ static void sock_def_readable(struct soc wake_up_interruptible(sk->sk_sleep); sk_wake_async(sk,1,POLL_IN); read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); } static void sock_def_write_space(struct sock *sk) @@ -1439,6 +1442,7 @@ static void sock_def_write_space(struct } read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV); } static void sock_def_destruct(struct sock *sk) @@ -1489,6 +1493,8 @@ #endif sk->sk_state = TCP_CLOSE; sk->sk_socket = sock; + kevent_sk_reinit(sk); + sock_set_flag(sk, SOCK_ZAPPED); if(sock) @@ -1555,8 +1561,10 @@ void fastcall release_sock(struct sock * if (sk->sk_backlog.tail) __release_sock(sk); sk->sk_lock.owner = NULL; - if (waitqueue_active(&sk->sk_lock.wq)) + if (waitqueue_active(&sk->sk_lock.wq)) { wake_up(&sk->sk_lock.wq); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); + } spin_unlock_bh(&sk->sk_lock.slock); } EXPORT_SYMBOL(release_sock); diff --git a/net/core/stream.c b/net/core/stream.c index d1d7dec..2878c2a 100644 --- a/net/core/stream.c +++ b/net/core/stream.c @@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock * wake_up_interruptible(sk->sk_sleep); if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN)) sock_wake_async(sock, 2, POLL_OUT); + kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV); } } diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 3f884ce..e7dd989 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3119,6 +3119,7 @@ static void tcp_ofo_queue(struct sock *s __skb_unlink(skb, &tp->out_of_order_queue); __skb_queue_tail(&sk->sk_receive_queue, skb); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV); tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq; if(skb->h.th->fin) tcp_fin(skb, sk, skb->h.th); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index c83938b..b0dd70d 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -61,6 +61,7 @@ #include <linux/cache.h> #include <linux/jhash.h> #include <linux/init.h> #include <linux/times.h> +#include <linux/kevent.h> #include <net/icmp.h> #include <net/inet_hashtables.h> @@ -870,6 +871,7 @@ #endif reqsk_free(req); } else { inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT); + kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT); } return 0; diff --git a/net/socket.c b/net/socket.c index 1bc4167..5582b4a 100644 --- a/net/socket.c +++ b/net/socket.c @@ -85,6 +85,7 @@ #include <linux/compat.h> #include <linux/kmod.h> #include <linux/audit.h> #include <linux/wireless.h> +#include <linux/kevent.h> #include <asm/uaccess.h> #include <asm/unistd.h> @@ -490,6 +491,8 @@ static struct socket *sock_alloc(void) inode->i_uid = current->fsuid; inode->i_gid = current->fsgid; + kevent_socket_reinit(sock); + get_cpu_var(sockets_in_use)++; put_cpu_var(sockets_in_use); return sock; ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take22 4/4] kevent: Timer notifications. 2006-11-01 11:36 ` [take22 3/4] kevent: Socket notifications Evgeniy Polyakov @ 2006-11-01 11:36 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-01 11:36 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel Timer notifications. Timer notifications can be used for fine grained per-process time management, since interval timers are very inconvenient to use, and they are limited. This subsystem uses high-resolution timers. id.raw[0] is used as number of seconds id.raw[1] is used as number of nanoseconds Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c new file mode 100644 index 0000000..04acc46 --- /dev/null +++ b/kernel/kevent/kevent_timer.c @@ -0,0 +1,113 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/hrtimer.h> +#include <linux/jiffies.h> +#include <linux/kevent.h> + +struct kevent_timer +{ + struct hrtimer ktimer; + struct kevent_storage ktimer_storage; + struct kevent *ktimer_event; +}; + +static int kevent_timer_func(struct hrtimer *timer) +{ + struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer); + struct kevent *k = t->ktimer_event; + + kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL); + hrtimer_forward(timer, timer->base->softirq_time, + ktime_set(k->event.id.raw[0], k->event.id.raw[1])); + return HRTIMER_RESTART; +} + +static struct lock_class_key kevent_timer_key; + +static int kevent_timer_enqueue(struct kevent *k) +{ + int err; + struct kevent_timer *t; + + t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL); + if (!t) + return -ENOMEM; + + hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL); + t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]); + t->ktimer.function = kevent_timer_func; + t->ktimer_event = k; + + err = kevent_storage_init(&t->ktimer, &t->ktimer_storage); + if (err) + goto err_out_free; + lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key); + + err = kevent_storage_enqueue(&t->ktimer_storage, k); + if (err) + goto err_out_st_fini; + + printk("%s: jiffies: %lu, timer: %p.\n", __func__, jiffies, &t->ktimer); + hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL); + + return 0; + +err_out_st_fini: + kevent_storage_fini(&t->ktimer_storage); +err_out_free: + kfree(t); + + return err; +} + +static int kevent_timer_dequeue(struct kevent *k) +{ + struct kevent_storage *st = k->st; + struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage); + + hrtimer_cancel(&t->ktimer); + kevent_storage_dequeue(st, k); + kfree(t); + + return 0; +} + +static int kevent_timer_callback(struct kevent *k) +{ + k->event.ret_data[0] = jiffies_to_msecs(jiffies); + return 1; +} + +static int __init kevent_init_timer(void) +{ + struct kevent_callbacks tc = { + .callback = &kevent_timer_callback, + .enqueue = &kevent_timer_enqueue, + .dequeue = &kevent_timer_dequeue}; + + return kevent_add_callbacks(&tc, KEVENT_TIMER); +} +module_init(kevent_init_timer); + ^ permalink raw reply related [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-01 11:36 ` [take22 " Evgeniy Polyakov 2006-11-01 11:36 ` [take22 1/4] kevent: Core files Evgeniy Polyakov @ 2006-11-01 13:06 ` Pavel Machek 2006-11-01 13:25 ` Evgeniy Polyakov 2006-11-01 16:07 ` James Morris 1 sibling, 2 replies; 200+ messages in thread From: Pavel Machek @ 2006-11-01 13:06 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel Hi! > Generic event handling mechanism. > > Consider for inclusion. > > Changes from 'take21' patchset: We are not interrested in how many times you spammed us, nor we want to know what was wrong in previous versions. It would be nice to have short summary of what this is good for, instead. Pavel -- Thanks, Sharp! ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-01 13:06 ` [take22 0/4] kevent: Generic event handling mechanism Pavel Machek @ 2006-11-01 13:25 ` Evgeniy Polyakov 2006-11-01 16:05 ` Pavel Machek 2006-11-01 16:07 ` James Morris 1 sibling, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-01 13:25 UTC (permalink / raw) To: Pavel Machek Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel On Wed, Nov 01, 2006 at 02:06:14PM +0100, Pavel Machek (pavel@ucw.cz) wrote: > Hi! > > > Generic event handling mechanism. > > > > Consider for inclusion. > > > > Changes from 'take21' patchset: > > We are not interrested in how many times you spammed us, nor we want > to know what was wrong in previous versions. It would be nice to have > short summary of what this is good for, instead. Let me guess, short explaination in subsequent emails is not enough... If changelog will be removed, then how people will detect what happend after previous release? Kevent is a generic subsytem which allows to handle event notifications. It supports both level and edge triggered events. It is similar to poll/epoll in some cases, but it is more scalable, it is faster and allows to work with essentially eny kind of events. Events are provided into kernel through control syscall and can be read back through mmaped ring or syscall. Kevent update (i.e. readiness switching) happens directly from internals of the appropriate state machine of the underlying subsytem (like network, filesystem, timer or any other). I will put that text into introduction message. > Pavel > -- > Thanks, Sharp! -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-01 13:25 ` Evgeniy Polyakov @ 2006-11-01 16:05 ` Pavel Machek 2006-11-01 16:24 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Pavel Machek @ 2006-11-01 16:05 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel Hi! > > > Generic event handling mechanism. > > > > > > Consider for inclusion. > > > > > > Changes from 'take21' patchset: > > > > We are not interrested in how many times you spammed us, nor we want > > to know what was wrong in previous versions. It would be nice to have > > short summary of what this is good for, instead. > > Let me guess, short explaination in subsequent emails is not > enough... Yes. > Kevent is a generic subsytem which allows to handle event notifications. > It supports both level and edge triggered events. It is similar to > poll/epoll in some cases, but it is more scalable, it is faster and > allows to work with essentially eny kind of events. Quantifying "how much more scalable" would be nice, as would be some example where it is useful. ("It makes my webserver twice as fast on monster 64-cpu box"). > Events are provided into kernel through control syscall and can be read > back through mmaped ring or syscall. > Kevent update (i.e. readiness switching) happens directly from internals > of the appropriate state machine of the underlying subsytem (like > network, filesystem, timer or any other). > > I will put that text into introduction message. Thanks. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-01 16:05 ` Pavel Machek @ 2006-11-01 16:24 ` Evgeniy Polyakov 2006-11-01 18:13 ` Oleg Verych 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-01 16:24 UTC (permalink / raw) To: Pavel Machek Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel On Wed, Nov 01, 2006 at 05:05:51PM +0100, Pavel Machek (pavel@ucw.cz) wrote: > Hi! Hi Pavel. > > Kevent is a generic subsytem which allows to handle event notifications. > > It supports both level and edge triggered events. It is similar to > > poll/epoll in some cases, but it is more scalable, it is faster and > > allows to work with essentially eny kind of events. > > Quantifying "how much more scalable" would be nice, as would be some > example where it is useful. ("It makes my webserver twice as fast on > monster 64-cpu box"). Trivial kevent web-server can handle 3960+ req/sec on Xeon 2.4Ghz with 1Gb RAM, epoll based - 2200-2500 req/sec. 100 Mbit wire is filled almost 100% (10582.7 KB/s of data without TCP and below headers). More benchmarks created by me and Johann Borck can be found on project's homepage as long as all my sources used in tests. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-01 16:24 ` Evgeniy Polyakov @ 2006-11-01 18:13 ` Oleg Verych 2006-11-01 18:57 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Oleg Verych @ 2006-11-01 18:13 UTC (permalink / raw) To: linux-kernel; +Cc: netdev Hallo, Evgeniy Polyakov. On 2006-11-01, you wrote: [] >> Quantifying "how much more scalable" would be nice, as would be some >> example where it is useful. ("It makes my webserver twice as fast on >> monster 64-cpu box"). > > Trivial kevent web-server can handle 3960+ req/sec on Xeon 2.4Ghz with [...] Seriously. I'm seeing that patches also. New, shiny, always ready "for inclusion". But considering kernel (linux in this case) as not thing for itself, i want to ask following question. Where's real-life application to do configure && make && make install? There were some comments about laking much of such programs, answers were "was in prev. e-mail", "need to update them", something like that. "Trivial web server" sources url, mentioned in benchmark isn't pointed in patch advertisement. If it was, should i actually try that new *trivial* wheel? Saying that, i want to give you some short examples, i know. *Linux kernel <-> userspace*: o Alexey Kuznetsov networking <-> (excellent) iproute set of utilities; o Maxim Krasnyansky tun net driver <-> vtun daemon application; *Glibc with mister Drepper* has huge set of tests, please search for `tst*' files in the sources. To make a little hint to you, Evgeniy, why don't you find a little animal in the open source zoo to implement little interface to proposed kernel subsystem and then show it to The Big Jury (not me), we have here? And i can not see, how you've managed to implement something like that having almost nothing on the test basket. Very *suspicious* ch. One, that comes in mind is lighthttpd <http://www.lighttpd.net/>. It had sub-interface for event systems like select,poll,epoll, when i checked its sources last time. And it is mature, btw. Cheers. [ -*- OT -*- ] [ I wouldn't write all this, unless saw your opinion about the ] [ reportbug (part of the Debian Bug Tracking System) this week. ] [ While i'm nobody here, imho, the first thing about good programmer ] [ must be, that he is excellent user. ] ____ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-01 18:13 ` Oleg Verych @ 2006-11-01 18:57 ` Evgeniy Polyakov 2006-11-02 2:12 ` Nate Diller 2006-11-03 18:49 ` Oleg Verych 0 siblings, 2 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-01 18:57 UTC (permalink / raw) To: LKML Cc: Oleg Verych, Pavel Machek, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck On Wed, Nov 01, 2006 at 06:20:43PM +0000, Oleg Verych (olecom@flower.upol.cz) wrote: > > Hallo, Evgeniy Polyakov. Hello, Oleg. > On 2006-11-01, you wrote: > [] > >> Quantifying "how much more scalable" would be nice, as would be some > >> example where it is useful. ("It makes my webserver twice as fast on > >> monster 64-cpu box"). > > > > Trivial kevent web-server can handle 3960+ req/sec on Xeon 2.4Ghz with > [...] > > Seriously. I'm seeing that patches also. New, shiny, always ready "for > inclusion". But considering kernel (linux in this case) as not thing > for itself, i want to ask following question. > > Where's real-life application to do configure && make && make install? Your real life or mine as developer? I fortunately do not know anything about your real life, but my real life applications can be found on project's homepage. There is a link to archive there, where you can find plenty of sources. You likely do not know, but it is a bit risky business to patch all existing applications to show that approach is correct, if implementation is not completed. You likely do not know, but after I first time announced kevents in February I changed interfaces 4 times - and it is just interfaces, not including numerous features added/removed by developer's requests. > There were some comments about laking much of such programs, answers were > "was in prev. e-mail", "need to update them", something like that. > "Trivial web server" sources url, mentioned in benchmark isn't pointed > in patch advertisement. If it was, should i actually try that new > *trivial* wheel? Answer is trivial - there is archive where one can find a source code (filenames are posted regulary). Should I create a rpm? For what glibc version? > Saying that, i want to give you some short examples, i know. > *Linux kernel <-> userspace*: > o Alexey Kuznetsov networking <-> (excellent) iproute set of utilities; iproute documentation was way too bad when Alexey presented it first time :) > o Maxim Krasnyansky tun net driver <-> vtun daemon application; > > *Glibc with mister Drepper* has huge set of tests, please search for > `tst*' files in the sources. Btw, show me splice() 'shiny' application? Does lighttpd use it? Or move_pages(). > To make a little hint to you, Evgeniy, why don't you find a little > animal in the open source zoo to implement little interface to > proposed kernel subsystem and then show it to The Big Jury (not me), > we have here? And i can not see, how you've managed to implement > something like that having almost nothing on the test basket. > Very *suspicious* ch. There are always people who do not like something, what can I do with it? I present the code, we discuss it, I ask for inclusion (since it is the only way to get feedback), something requires changes, it is changed and so on - it is development process. I created 'little animal in the open source zoo' by myself to show how simple kevents are. > One, that comes in mind is lighthttpd <http://www.lighttpd.net/>. > It had sub-interface for event systems like select,poll,epoll, when i > checked its sources last time. And it is mature, btw. As I already told several times, I changed only interfaces 4 times already, since no one seems to know what we really want and how interface should look like. You suggest to patch lighttpd? Well, it is doable, but then I will be asked to change apache and nginx. And then someone will suggest to change order of parameters. Will you help me rewrite userspace? No, you will not. You asks for something without providing anything back (not getting into account code, but discussion, ideas, testing time, nothing), and you do it in ultimate manner. Btw, kevent also support AIO notifications - do you suggest to patch reactor/proactor for tests? It supports network AIO - do you suggest to write support for that into apache? What about timers? It is possible to rewrite all POSIX timers users to usem instead. There is feature request for userspace events and singal delivery - what to do with that? I created trivial web servers, which send single static page and use various event handling schemes, and I test new subsystem with new tools, when tests are completed and all requested features are implemented it is time to work on different more complex users. So let's at least complete what we have right now, so no developer's efforts could be wasted writing empty chars in various places. > Cheers. > > [ -*- OT -*- ] > [ I wouldn't write all this, unless saw your opinion about the ] > [ reportbug (part of the Debian Bug Tracking System) this week. ] > [ While i'm nobody here, imho, the first thing about good programmer ] > [ must be, that he is excellent user. ] > ____ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-01 18:57 ` Evgeniy Polyakov @ 2006-11-02 2:12 ` Nate Diller 2006-11-02 6:21 ` Evgeniy Polyakov ` (2 more replies) 2006-11-03 18:49 ` Oleg Verych 1 sibling, 3 replies; 200+ messages in thread From: Nate Diller @ 2006-11-02 2:12 UTC (permalink / raw) To: Evgeniy Polyakov Cc: LKML, Oleg Verych, Pavel Machek, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck On 11/1/06, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote: > On Wed, Nov 01, 2006 at 06:20:43PM +0000, Oleg Verych (olecom@flower.upol.cz) wrote: > > > > Hallo, Evgeniy Polyakov. > > Hello, Oleg. > > > On 2006-11-01, you wrote: > > [] > > >> Quantifying "how much more scalable" would be nice, as would be some > > >> example where it is useful. ("It makes my webserver twice as fast on > > >> monster 64-cpu box"). > > > > > > Trivial kevent web-server can handle 3960+ req/sec on Xeon 2.4Ghz with > > [...] > > > > Seriously. I'm seeing that patches also. New, shiny, always ready "for > > inclusion". But considering kernel (linux in this case) as not thing > > for itself, i want to ask following question. > > > > Where's real-life application to do configure && make && make install? > > Your real life or mine as developer? > I fortunately do not know anything about your real life, but my real life > applications can be found on project's homepage. > There is a link to archive there, where you can find plenty of sources. > You likely do not know, but it is a bit risky business to patch all > existing applications to show that approach is correct, if > implementation is not completed. > You likely do not know, but after I first time announced kevents in > February I changed interfaces 4 times - and it is just interfaces, not > including numerous features added/removed by developer's requests. > > > There were some comments about laking much of such programs, answers were > > "was in prev. e-mail", "need to update them", something like that. > > "Trivial web server" sources url, mentioned in benchmark isn't pointed > > in patch advertisement. If it was, should i actually try that new > > *trivial* wheel? > > Answer is trivial - there is archive where one can find a source code > (filenames are posted regulary). Should I create a rpm? For what glibc > version? > > > Saying that, i want to give you some short examples, i know. > > *Linux kernel <-> userspace*: > > o Alexey Kuznetsov networking <-> (excellent) iproute set of utilities; > > iproute documentation was way too bad when Alexey presented it first > time :) > > > o Maxim Krasnyansky tun net driver <-> vtun daemon application; > > > > *Glibc with mister Drepper* has huge set of tests, please search for > > `tst*' files in the sources. > > Btw, show me splice() 'shiny' application? Does lighttpd use it? > Or move_pages(). > > > To make a little hint to you, Evgeniy, why don't you find a little > > animal in the open source zoo to implement little interface to > > proposed kernel subsystem and then show it to The Big Jury (not me), > > we have here? And i can not see, how you've managed to implement > > something like that having almost nothing on the test basket. > > Very *suspicious* ch. > > There are always people who do not like something, what can I do with > it? I present the code, we discuss it, I ask for inclusion (since it is > the only way to get feedback), something requires changes, it is changed > and so on - it is development process. > I created 'little animal in the open source zoo' by myself to show how > simple kevents are. > > > One, that comes in mind is lighthttpd <http://www.lighttpd.net/>. > > It had sub-interface for event systems like select,poll,epoll, when i > > checked its sources last time. And it is mature, btw. > > As I already told several times, I changed only interfaces 4 times > already, since no one seems to know what we really want and how > interface should look like. Indesiciveness has certainly been an issue here, but I remember akpm and Ulrich both giving concrete suggestions. I was particularly interested in Andrew's request to explain and justify the differences between kevent and BSD's kqueue interface. Was there a discussion that I missed? I am very interested to see your work on this mechanism merged, because you've clearly emphasized performance and shown impressive results. But it seems like we lose out on a lot by throwing out all the applications that already use kqueue. NATE ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-02 2:12 ` Nate Diller @ 2006-11-02 6:21 ` Evgeniy Polyakov 2006-11-02 19:40 ` Nate Diller [not found] ` <aaf959cb0611011829k36deda6ahe61bcb9bf8e612e1@mail.gmail.com> 2006-11-07 12:02 ` Jeff Garzik 2 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-02 6:21 UTC (permalink / raw) To: Nate Diller Cc: LKML, Oleg Verych, Pavel Machek, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck On Wed, Nov 01, 2006 at 06:12:41PM -0800, Nate Diller (nate.diller@gmail.com) wrote: > Indesiciveness has certainly been an issue here, but I remember akpm > and Ulrich both giving concrete suggestions. I was particularly > interested in Andrew's request to explain and justify the differences > between kevent and BSD's kqueue interface. Was there a discussion > that I missed? I am very interested to see your work on this > mechanism merged, because you've clearly emphasized performance and > shown impressive results. But it seems like we lose out on a lot by > throwing out all the applications that already use kqueue. It looks you missed that discussion - freebsd kqueue has fields in the kevent structure which have diffent sizes in 32 and 64 bit environments. > NATE -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-02 6:21 ` Evgeniy Polyakov @ 2006-11-02 19:40 ` Nate Diller 2006-11-03 8:42 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Nate Diller @ 2006-11-02 19:40 UTC (permalink / raw) To: Evgeniy Polyakov Cc: LKML, Oleg Verych, Pavel Machek, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck On 11/1/06, Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote: > On Wed, Nov 01, 2006 at 06:12:41PM -0800, Nate Diller (nate.diller@gmail.com) wrote: > > Indesiciveness has certainly been an issue here, but I remember akpm > > and Ulrich both giving concrete suggestions. I was particularly > > interested in Andrew's request to explain and justify the differences > > between kevent and BSD's kqueue interface. Was there a discussion > > that I missed? I am very interested to see your work on this > > mechanism merged, because you've clearly emphasized performance and > > shown impressive results. But it seems like we lose out on a lot by > > throwing out all the applications that already use kqueue. > > It looks you missed that discussion - freebsd kqueue has fields in the > kevent structure which have diffent sizes in 32 and 64 bit environments. Are you saying that the *only* reason we choose not to be source-compatible with BSD is the 32 bit userland on 64 bit arch problem? I've followed every thread that gmail 'kqueue' search returns, which thread are you referring to? Nicholas Miell, in "The Proposed Linux kevent API" thread, seems to think that there are no advantages over kqueue to justify the incompatibility, an argument you made no effort to refute. I've also read the Kevent wiki at linux-net.osdl.org, but it too is lacking in any direct comparisons (even theoretical, let alone benchmarks) of the flexibility, performance, etc. between the two. I'm not arguing that you've done a bad design, I'm asking you to brag about the things you improved on vs. kqueue. Your emphasis on unifying all the different event types into one interface is really cool, fill me in on why that can't be effectively done with the kqueue compatability and I also will advocate for kevent inclusion. NATE ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-02 19:40 ` Nate Diller @ 2006-11-03 8:42 ` Evgeniy Polyakov 2006-11-03 8:57 ` Pavel Machek 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-03 8:42 UTC (permalink / raw) To: Nate Diller Cc: LKML, Oleg Verych, Pavel Machek, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck On Thu, Nov 02, 2006 at 11:40:43AM -0800, Nate Diller (nate.diller@gmail.com) wrote: > Are you saying that the *only* reason we choose not to be > source-compatible with BSD is the 32 bit userland on 64 bit arch > problem? I've followed every thread that gmail 'kqueue' search I.e. do you want that generic event handling mechanism would not work on x86_64? I doubt you do. > returns, which thread are you referring to? Nicholas Miell, in "The > Proposed Linux kevent API" thread, seems to think that there are no > advantages over kqueue to justify the incompatibility, an argument you > made no effort to refute. I've also read the Kevent wiki at > linux-net.osdl.org, but it too is lacking in any direct comparisons > (even theoretical, let alone benchmarks) of the flexibility, > performance, etc. between the two. > > I'm not arguing that you've done a bad design, I'm asking you to brag > about the things you improved on vs. kqueue. Your emphasis on > unifying all the different event types into one interface is really > cool, fill me in on why that can't be effectively done with the kqueue > compatability and I also will advocate for kevent inclusion. kqueue just can not be used as is in Linux (_maybe_ *bsd has different types, not those which I found in /usr/include in my FC5 and Debian distro). It will not work on x86_64 for example. Some kind of a pointer or unsigned long in structures which are transferred between kernelspace and userspace is so much questionable, than it is much better even do not see there... (if I would not have so political correctness, I would describe it in a much different words actually). So, kqueue API and structures can not be usd in Linux. > NATE -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-03 8:42 ` Evgeniy Polyakov @ 2006-11-03 8:57 ` Pavel Machek 2006-11-03 9:04 ` David Miller 2006-11-03 9:13 ` Evgeniy Polyakov 0 siblings, 2 replies; 200+ messages in thread From: Pavel Machek @ 2006-11-03 8:57 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Nate Diller, LKML, Oleg Verych, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck Hi! > > returns, which thread are you referring to? Nicholas Miell, in "The > > Proposed Linux kevent API" thread, seems to think that there are no > > advantages over kqueue to justify the incompatibility, an argument you > > made no effort to refute. I've also read the Kevent wiki at > > linux-net.osdl.org, but it too is lacking in any direct comparisons > > (even theoretical, let alone benchmarks) of the flexibility, > > performance, etc. between the two. > > > > I'm not arguing that you've done a bad design, I'm asking you to brag > > about the things you improved on vs. kqueue. Your emphasis on > > unifying all the different event types into one interface is really > > cool, fill me in on why that can't be effectively done with the kqueue > > compatability and I also will advocate for kevent inclusion. > > kqueue just can not be used as is in Linux (_maybe_ *bsd has different > types, not those which I found in /usr/include in my FC5 and Debian > distro). It will not work on x86_64 for example. Some kind of a pointer > or unsigned long in structures which are transferred between kernelspace > and userspace is so much questionable, than it is much better even do > not see there... (if I would not have so political correctness, I would > describe it in a much different words actually). > So, kqueue API and structures can not be usd in Linux. Not sure what you are smoking, but "there's unsigned long in *bsd version, lets rewrite it from scratch" sounds like very bad idea. What about fixing that one bit you don't like? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-03 8:57 ` Pavel Machek @ 2006-11-03 9:04 ` David Miller 2006-11-07 12:05 ` Jeff Garzik 2006-11-03 9:13 ` Evgeniy Polyakov 1 sibling, 1 reply; 200+ messages in thread From: David Miller @ 2006-11-03 9:04 UTC (permalink / raw) To: pavel Cc: johnpol, nate.diller, linux-kernel, olecom, drepper, akpm, netdev, zach.brown, hch, chase.venters, johann.borck From: Pavel Machek <pavel@ucw.cz> Date: Fri, 3 Nov 2006 09:57:12 +0100 > Not sure what you are smoking, but "there's unsigned long in *bsd > version, lets rewrite it from scratch" sounds like very bad idea. What > about fixing that one bit you don't like? I disagree, it's more like since we have to be structure incompatible anyways, let's design something superior if we can. ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-03 9:04 ` David Miller @ 2006-11-07 12:05 ` Jeff Garzik 0 siblings, 0 replies; 200+ messages in thread From: Jeff Garzik @ 2006-11-07 12:05 UTC (permalink / raw) To: David Miller Cc: pavel, johnpol, nate.diller, linux-kernel, olecom, drepper, akpm, netdev, zach.brown, hch, chase.venters, johann.borck David Miller wrote: > From: Pavel Machek <pavel@ucw.cz> > Date: Fri, 3 Nov 2006 09:57:12 +0100 > >> Not sure what you are smoking, but "there's unsigned long in *bsd >> version, lets rewrite it from scratch" sounds like very bad idea. What >> about fixing that one bit you don't like? > > I disagree, it's more like since we have to be structure incompatible > anyways, let's design something superior if we can. Definitely agreed. Jeff ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-03 8:57 ` Pavel Machek 2006-11-03 9:04 ` David Miller @ 2006-11-03 9:13 ` Evgeniy Polyakov 2006-11-05 11:19 ` Pavel Machek 1 sibling, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-03 9:13 UTC (permalink / raw) To: Pavel Machek Cc: Nate Diller, LKML, Oleg Verych, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck On Fri, Nov 03, 2006 at 09:57:12AM +0100, Pavel Machek (pavel@ucw.cz) wrote: > > So, kqueue API and structures can not be usd in Linux. > > Not sure what you are smoking, but "there's unsigned long in *bsd > version, lets rewrite it from scratch" sounds like very bad idea. What > about fixing that one bit you don't like? It is not about what I dislike, but about what is broken or not. Putting u64 instead of a long or some kind of that _is_ incompatible already, so why should we even use it? And, btw, what we are talking about? Is it about the whole kevent compared to kqueue in kernelspace, or just about what structure is being transferred between kernelspace and userspace? I'm sure, it was some kind of a joke to 'not rewrite *bsd from scratch and use kqueue in Linux kernel as is'. > Pavel > -- > (english) http://www.livejournal.com/~pavelmachek > (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-03 9:13 ` Evgeniy Polyakov @ 2006-11-05 11:19 ` Pavel Machek 2006-11-05 11:43 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Pavel Machek @ 2006-11-05 11:19 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Nate Diller, LKML, Oleg Verych, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck Hi! On Fri 2006-11-03 12:13:02, Evgeniy Polyakov wrote: > On Fri, Nov 03, 2006 at 09:57:12AM +0100, Pavel Machek (pavel@ucw.cz) wrote: > > > So, kqueue API and structures can not be usd in Linux. > > > > Not sure what you are smoking, but "there's unsigned long in *bsd > > version, lets rewrite it from scratch" sounds like very bad idea. What > > about fixing that one bit you don't like? > > It is not about what I dislike, but about what is broken or not. > Putting u64 instead of a long or some kind of that _is_ incompatible > already, so why should we even use it? Well.. u64 vs unsigned long *is* binary incompatible, but it is similar enough that it is going to be compatible at source level, or maybe userland app will need *minor* ifdefs... That's better than two completely different versions... > And, btw, what we are talking about? Is it about the whole kevent > compared to kqueue in kernelspace, or just about what structure is being > transferred between kernelspace and userspace? > I'm sure, it was some kind of a joke to 'not rewrite *bsd from scratch > and use kqueue in Linux kernel as is'. No, it is probably not possible to take code from BSD kernel and "just port it". But keeping same/similar userland interface would be nice. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-05 11:19 ` Pavel Machek @ 2006-11-05 11:43 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-05 11:43 UTC (permalink / raw) To: Pavel Machek Cc: Nate Diller, LKML, Oleg Verych, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck On Sun, Nov 05, 2006 at 12:19:33PM +0100, Pavel Machek (pavel@ucw.cz) wrote: > Hi! > > On Fri 2006-11-03 12:13:02, Evgeniy Polyakov wrote: > > On Fri, Nov 03, 2006 at 09:57:12AM +0100, Pavel Machek (pavel@ucw.cz) wrote: > > > > So, kqueue API and structures can not be usd in Linux. > > > > > > Not sure what you are smoking, but "there's unsigned long in *bsd > > > version, lets rewrite it from scratch" sounds like very bad idea. What > > > about fixing that one bit you don't like? > > > > It is not about what I dislike, but about what is broken or not. > > Putting u64 instead of a long or some kind of that _is_ incompatible > > already, so why should we even use it? > > Well.. u64 vs unsigned long *is* binary incompatible, but it is > similar enough that it is going to be compatible at source level, or > maybe userland app will need *minor* ifdefs... That's better than two > completely different versions... > > > And, btw, what we are talking about? Is it about the whole kevent > > compared to kqueue in kernelspace, or just about what structure is being > > transferred between kernelspace and userspace? > > I'm sure, it was some kind of a joke to 'not rewrite *bsd from scratch > > and use kqueue in Linux kernel as is'. > > No, it is probably not possible to take code from BSD kernel and "just > port it". But keeping same/similar userland interface would be nice. It is not only probably, but not even unlikely - it is impossible to get FreeBSD kqueue code and port it - that port will be completely different system. It is impossible to have the same event structure, one should create #if defined kqueue fill all members of the structure #else if defined kevent fill different members name, since Linux does not even have some types #endif *BSD kevent (structure transferred between userspace and kernelspace) struct kevent { uintptr_t ident; /* identifier for this event */ short filter; /* filter for event */ u_short flags; /* action flags for kqueue */ u_int fflags; /* filter flag value */ intptr_t data; /* filter data value */ void *udata; /* opaque user data identifier */ }; You must fill all fields differently due to above. Just an example: Linux kevent has extended ID field which is grouped into type.event, kqueue has different pointer indent and short filter. Linux kevent does not have filters, but instead it has generic storages of events which can be processed in any way origin of the storage wants (this for example allows to create aio_sendfile() (which is dropped from patchset currently) which no other system in the wild has). There are too many differences. It is just different systems. If both can be described by sentence "system which handles events", it does not mean that they are the same and can use the structures or even have similar design. Kevent is not kqueue in any way (although there are certain similarities), so they can not share anything. > Pavel > -- > (english) http://www.livejournal.com/~pavelmachek > (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
[parent not found: <aaf959cb0611011829k36deda6ahe61bcb9bf8e612e1@mail.gmail.com>]
[parent not found: <aaf959cb0611011830j1ca3e469tc4a6af3a2a010fa@mail.gmail.com>]
[parent not found: <4549A261.9010007@cosmosbay.com>]
* Re: [take22 0/4] kevent: Generic event handling mechanism. [not found] ` <4549A261.9010007@cosmosbay.com> @ 2006-11-03 2:42 ` zhou drangon 2006-11-03 9:16 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: zhou drangon @ 2006-11-03 2:42 UTC (permalink / raw) To: Eric Dumazet Cc: linux-kernel, Evgeniy Polyakov, Oleg Verych, Pavel Machek, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, drangon.zhou 2006/11/2, Eric Dumazet <dada1@cosmosbay.com>: > zhou drangon a écrit : > > performance is great, and we are exciting at the result. > > > > I want to know why there can be so much improvement, can we improve > > epoll too ? > > Why did you remove most of CC addresses but lkml ? > Dont do that please... I seldom reply to the mailing list, Sorry for this. > > Good question :) > > Hum, I think I can look into epoll and see how it can be improved (if necessary) > I have an other question. As for the VFS system, when we introduce the AIO machinism, we add aio_read, aio_write, etc... to file ops, and then we make the read, write op to call aio_read, aio_write, so that we only remain one implement in kernel. Can we do event machinism the same way? when kevent is robust enough, can we implement epoll/select/io_submit etc... base on kevent ?? In this way, we can simplified the kernel, and epoll can gain improvement from kevent. > This is not to say we dont need kevent ! Please Evgeniy continue your work ! Yes! We are expecting for you greate work. I create an userland event-driven framework for my application. but I have to use multiple thread to receive event, epoll to wait most event, and io_getevent to wait disk AIO event, I hope we can get a universal event machinism to make the code elegance. > > Just to remind you that according to > http://www.xmailserver.org/linux-patches/nio-improve.html David Libenzi had to > wait 18 months before epoll being officialy added into kernel. > > At that time, many applications were using epoll, and we were patching our > kernels for that. > > > I cooked a very simple program (attached in this mail), using pipes and epoll, > and got 250.000 events received per second on an otherwise lightly loaded > machine (dual opteron 246 , 2GHz, 1MB cache per cpu) with 10.000 pipes (20.000 > handles) > > It could be nice to add support for other event providers in this program > (AF_INET & AF_UNIX sockets for example), and also add support for kevent, so > that we really can compare epoll/kevent without a complex setup. > I should extend the program to also add/remove sources during lifetime, not > only insert at setup time. > > # gcc -O2 -o epoll_pipe_bench epoll_pipe_bench.c -lpthread > # ulimit -n 1000000 > # epoll_pipe_bench -n 10000 > ^C after a while... > > oprofile results say that ep_poll_callback() and sys_epoll_wait() use 20% of > cpu time. > Even if we gain a two factor in cpu time or cache usage, we wont eliminate > other costs... > > oprofile results gave : > > Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit > mask of 0x00 (No unit mask) count 50000 > samples % symbol name > 2015420 11.1309 ep_poll_callback > 1867431 10.3136 pipe_writev > 1791872 9.8963 sys_epoll_wait > 1357297 7.4962 fget_light > 1277515 7.0556 pipe_readv > 998447 5.5143 current_fs_time > 801597 4.4271 __mark_inode_dirty > 755268 4.1713 __wake_up > 587065 3.2423 __write_lock_failed > 582931 3.2195 system_call > 297132 1.6410 iov_fault_in_pages_read > 296136 1.6355 sys_write > 290106 1.6022 __wake_up_common > 270692 1.4950 bad_pipe_w > 261516 1.4443 do_pipe > 257208 1.4205 tg3_start_xmit_dma_bug > 254917 1.4079 pipe_poll > 252925 1.3969 copy_user_generic_c > 234212 1.2935 generic_pipe_buf_map > 228659 1.2629 ret_from_sys_call > 212541 1.1738 sysret_check > 166529 0.9197 sys_read > 160038 0.8839 vfs_write > 151091 0.8345 pipe_ioctl > 136301 0.7528 file_update_time > 107173 0.5919 tg3_poll > 77846 0.4299 ipt_do_table > 75081 0.4147 schedule > 73059 0.4035 vfs_read > 69787 0.3854 get_task_comm > 63923 0.3530 memcpy > 60019 0.3315 touch_atime > 57490 0.3175 eventpoll_release_file > 56152 0.3101 tg3_write_flush_reg32 > 54468 0.3008 rw_verify_area > 47833 0.2642 generic_pipe_buf_unmap > 47777 0.2639 __switch_to > 44106 0.2436 bad_pipe_r > 41824 0.2310 proc_nr_files > 41319 0.2282 pipe_iov_copy_from_user > > > Eric > > > > /* > * How to stress epoll > * > * This program uses many pipes and two threads. > * First we open as many pipes we can. (see ulimit -n) > * Then we create a worker thread. > * The worker thread will send bytes to random pipes. > * The main thread uses epoll to collect ready pipes and read them. > * Each second, a number of collected bytes is printed on stderr > * > * Usage : epoll_bench [-n X] > */ > #include <pthread.h> > #include <stdlib.h> > #include <errno.h> > #include <stdio.h> > #include <string.h> > #include <sys/epoll.h> > #include <signal.h> > #include <unistd.h> > #include <sys/time.h> > > int nbpipes = 1024; > > struct pipefd { > int fd[2]; > } *tab; > > int epoll_fd; > > static int alloc_pipes() > { > int i; > > epoll_fd = epoll_create(nbpipes); > if (epoll_fd == -1) { > perror("epoll_create"); > return -1; > } > tab = malloc(sizeof(struct pipefd) * nbpipes); > if (tab ==NULL) { > perror("malloc"); > return -1; > } > for (i = 0 ; i < nbpipes ; i++) { > struct epoll_event ev; > if (pipe(tab[i].fd) == -1) > break; > ev.events = EPOLLIN | EPOLLOUT | EPOLLHUP | EPOLLPRI | EPOLLET; > ev.data.u64 = (uint64_t)i; > epoll_ctl(epoll_fd, EPOLL_CTL_ADD, tab[i].fd[0], &ev); > } > nbpipes = i; > printf("%d pipes setup\n", nbpipes); > return 0; > } > > > unsigned long nbhandled; > static void timer_func() > { > char buffer[32]; > size_t len; > static unsigned long old; > unsigned long delta = nbhandled - old; > old = nbhandled; > len = sprintf(buffer, "%lu\n", delta); > write(2, buffer, len); > } > > static void timer_setup() > { > struct itimerval it; > struct sigaction sg; > > memset(&sg, 0, sizeof(sg)); > sg.sa_handler = timer_func; > sigaction(SIGALRM, &sg, 0); > it.it_interval.tv_sec = 1; > it.it_interval.tv_usec = 0; > it.it_value.tv_sec = 1; > it.it_value.tv_usec = 0; > if (setitimer(ITIMER_REAL, &it, 0)) > perror("setitimer"); > } > > static void * worker_thread_func(void *arg) > { > int fd; > char c = 1; > for (;;) { > fd = rand() % nbpipes; > write(tab[fd].fd[1], &c, 1); > } > } > > > int main(int argc, char *argv[]) > { > char buff[1024]; > pthread_t tid; > int c; > > while ((c = getopt(argc, argv, "n:")) != EOF) { > if (c == 'n') nbpipes = atoi(optarg); > } > alloc_pipes(); > pthread_create(&tid, NULL, worker_thread_func, (void *)0); > timer_setup(); > > for (;;) { > struct epoll_event events[128]; > int nb = epoll_wait(epoll_fd, events, 128, 10000); > int i, fd; > for (i = 0 ; i < nb ; i++) { > fd = tab[events[i].data.u64].fd[0]; > if (read(fd, buff, 1024) > 0) > nbhandled++; > } > } > } > > > ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-03 2:42 ` zhou drangon @ 2006-11-03 9:16 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-03 9:16 UTC (permalink / raw) To: zhou drangon Cc: Eric Dumazet, linux-kernel, Oleg Verych, Pavel Machek, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, drangon.zhou On Fri, Nov 03, 2006 at 10:42:04AM +0800, zhou drangon (drangon.mail@gmail.com) wrote: > As for the VFS system, when we introduce the AIO machinism, we add aio_read, > aio_write, etc... to file ops, and then we make the read, write op to > call aio_read, > aio_write, so that we only remain one implement in kernel. > Can we do event machinism the same way? > when kevent is robust enough, can we implement epoll/select/io_submit etc... > base on kevent ?? > In this way, we can simplified the kernel, and epoll can gain > improvement from kevent. There is AIO implementaion on top of kevent, although it was confirmed that it has a good design, except minor API layering changes, it was postponed for a while. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-02 2:12 ` Nate Diller 2006-11-02 6:21 ` Evgeniy Polyakov [not found] ` <aaf959cb0611011829k36deda6ahe61bcb9bf8e612e1@mail.gmail.com> @ 2006-11-07 12:02 ` Jeff Garzik 2 siblings, 0 replies; 200+ messages in thread From: Jeff Garzik @ 2006-11-07 12:02 UTC (permalink / raw) To: Nate Diller Cc: Evgeniy Polyakov, LKML, Oleg Verych, Pavel Machek, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck Nate Diller wrote: > Indesiciveness has certainly been an issue here, but I remember akpm > and Ulrich both giving concrete suggestions. I was particularly > interested in Andrew's request to explain and justify the differences > between kevent and BSD's kqueue interface. Was there a discussion > that I missed? I am very interested to see your work on this > mechanism merged, because you've clearly emphasized performance and > shown impressive results. But it seems like we lose out on a lot by > throwing out all the applications that already use kqueue. kqueue looks pretty nice, the filter/note models in particular. I don't see anything about ring buffers though. I also wonder about the asynchronous event side (send), not just the event reception side. Jeff ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-01 18:57 ` Evgeniy Polyakov 2006-11-02 2:12 ` Nate Diller @ 2006-11-03 18:49 ` Oleg Verych 2006-11-04 10:24 ` Evgeniy Polyakov 2006-11-04 17:47 ` Evgeniy Polyakov 1 sibling, 2 replies; 200+ messages in thread From: Oleg Verych @ 2006-11-03 18:49 UTC (permalink / raw) To: Evgeniy Polyakov Cc: LKML, Pavel Machek, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck On Wed, Nov 01, 2006 at 09:57:46PM +0300, Evgeniy Polyakov wrote: > On Wed, Nov 01, 2006 at 06:20:43PM +0000, Oleg Verych (olecom@flower.upol.cz) wrote: [] > > Where's real-life application to do configure && make && make install? > > Your real life or mine as developer? > I fortunately do not know anything about your real life, but my real life To do not further shift conversation in no technical way, think of my sentence as question *and* as definition. > applications can be found on project's homepage. > There is a link to archive there, where you can find plenty of sources. But no single makefile. Or what CC and options do not mater really? You can easily find in your server's apache logs, my visit of that archive in the day of my message (today i just confirmed my assertions): browser lynx, host flower.upol.cz. > You likely do not know, but it is a bit risky business to patch all > existing applications to show that approach is correct, if > implementation is not completed. Fortunately to me, `lighthttpd' is real-life *and* in the benchmark area also. Just see that site how much there was measured: different OSes, special tunning. *That* is i'm talking about. Epoll _wrapper_ there, is 3461 byte long, your answer to _me_ 2580. People are bringing you a test bed, with all set up ready to use; need less code, go on, comment needless out! > You likely do not know, but after I first time announced kevents in > February I changed interfaces 4 times - and it is just interfaces, not > including numerous features added/removed by developer's requests. I think that called open source, linux kernel case. > > There were some comments about laking much of such programs, answers were > > "was in prev. e-mail", "need to update them", something like that. > > "Trivial web server" sources url, mentioned in benchmark isn't pointed > > in patch advertisement. If it was, should i actually try that new > > *trivial* wheel? > > Answer is trivial - there is archive where one can find a source code > (filenames are posted regulary). Should I create a rpm? For what glibc > version? Hmm. Let me answer on that "dup" with stuff from LKML archive. That will reveal, that my guesses were told by The Big Jury to you already: [^0] Message-ID: 44CA66D8.3010404@oracle.com [^1] Message-ID: 20060818104120.GA20816@infradead.org, Message-ID: 20060816133014.GB32499@infradead.org more than 10 takes ago. > > Saying that, i want to give you some short examples, i know. > > *Linux kernel <-> userspace*: > > o Alexey Kuznetsov networking <-> (excellent) iproute set of utilities; > > iproute documentation was way too bad when Alexey presented it first > time :) As example, after have read some books on TCP/IP and Ethernet, internal help of `ip' was all i needed to know. > Btw, show me splice() 'shiny' application? Does lighttpd use it? > Or move_pages(). You know who proposed that, and you know how many (few) releases ago. > > To make a little hint to you, Evgeniy, why don't you find a little > > animal in the open source zoo to implement little interface to > > proposed kernel subsystem and then show it to The Big Jury (not me), > > we have here? And i can not see, how you've managed to implement > > something like that having almost nothing on the test basket. > > Very *suspicious* ch. > > There are always people who do not like something, what can I do with I didn't think, that my message was offensive. Also i didn't even say, that you have not bothered feed your code to "scripts/Lindent". [] > I created trivial web servers, which send single static page and use > various event handling schemes, and I test new subsystem with new tools, > when tests are completed and all requested features are implemented it > is time to work on different more complex users. Please, see [^0], > So let's at least complete what we have right now, so no developer's > efforts could be wasted writing empty chars in various places. and [^1]. [ Please do not answer just to answer, cc list is big, no one from ] [ The Big Jury seems to care. (well, Jonathan does, but he wasn't in cc) ] Friendly, Oleg. ____ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-03 18:49 ` Oleg Verych @ 2006-11-04 10:24 ` Evgeniy Polyakov 2006-11-04 17:47 ` Evgeniy Polyakov 1 sibling, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-04 10:24 UTC (permalink / raw) To: Oleg Verych Cc: LKML, Pavel Machek, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck On Fri, Nov 03, 2006 at 07:49:16PM +0100, Oleg Verych (olecom@flower.upol.cz) wrote: > > applications can be found on project's homepage. > > There is a link to archive there, where you can find plenty of sources. > > But no single makefile. Or what CC and options do not mater really? > You can easily find in your server's apache logs, my visit of that > archive in the day of my message (today i just confirmed my assertions): > browser lynx, host flower.upol.cz. If you can not compile that sources, than you should not use kevent for a while. Definitely. Options are pretty simple: -W -Wall -I$(path_to_kernel_tree)/include > > You likely do not know, but it is a bit risky business to patch all > > existing applications to show that approach is correct, if > > implementation is not completed. > > Fortunately to me, `lighthttpd' is real-life *and* in the benchmark > area also. Just see that site how much there was measured: different OSes, > special tunning. *That* is i'm talking about. Epoll _wrapper_ there, > is 3461 byte long, your answer to _me_ 2580. People are bringing you a > test bed, with all set up ready to use; need less code, go on, comment > needless out! So what? People bring me tons of various stuff, and I prefer to use my own for tests. If _you_ need it, _you_ can always patch any sources you like. > > You likely do not know, but after I first time announced kevents in > > February I changed interfaces 4 times - and it is just interfaces, not > > including numerous features added/removed by developer's requests. > > I think that called open source, linux kernel case. You missed the point - I'm not going to patch tons of existing applications when I'm asked to change an interface once per month. When all requested features are implemented I definitely with patch some popular web-server to show how kevent is used. > > > There were some comments about laking much of such programs, answers were > > > "was in prev. e-mail", "need to update them", something like that. > > > "Trivial web server" sources url, mentioned in benchmark isn't pointed > > > in patch advertisement. If it was, should i actually try that new > > > *trivial* wheel? > > > > Answer is trivial - there is archive where one can find a source code > > (filenames are posted regulary). Should I create a rpm? For what glibc > > version? > > Hmm. Let me answer on that "dup" with stuff from LKML archive. That > will reveal, that my guesses were told by The Big Jury to you already: > > [^0] Message-ID: 44CA66D8.3010404@oracle.com > [^1] Message-ID: 20060818104120.GA20816@infradead.org, > Message-ID: 20060816133014.GB32499@infradead.org > > more than 10 takes ago. And? Please provide a link to archive. > > > Saying that, i want to give you some short examples, i know. > > > *Linux kernel <-> userspace*: > > > o Alexey Kuznetsov networking <-> (excellent) iproute set of utilities; > > > > iproute documentation was way too bad when Alexey presented it first > > time :) > > As example, after have read some books on TCP/IP and Ethernet, internal > help of `ip' was all i needed to know. :)) i.e. it is ok for you to 'read some books on TCP/IP and Ethernet' to understand how utility works, and it is not ok to determine how to compile my sources? Do not compile my sources. > > Btw, show me splice() 'shiny' application? Does lighttpd use it? > > Or move_pages(). > > You know who proposed that, and you know how many (few) releases ago. And why lighttpd still do not use it? You should start to blame authors of the splice() for that. You will not? Then I can not consider your words in my direction as serious. > > > To make a little hint to you, Evgeniy, why don't you find a little > > > animal in the open source zoo to implement little interface to > > > proposed kernel subsystem and then show it to The Big Jury (not me), > > > we have here? And i can not see, how you've managed to implement > > > something like that having almost nothing on the test basket. > > > Very *suspicious* ch. > > > > There are always people who do not like something, what can I do with > > I didn't think, that my message was offensive. Also i didn't even say, > that you have not bothered feed your code to "scripts/Lindent". You do not use kevent, why do you care about indent of the userspace tools? > [] > > I created trivial web servers, which send single static page and use > > various event handling schemes, and I test new subsystem with new tools, > > when tests are completed and all requested features are implemented it > > is time to work on different more complex users. > > Please, see [^0], > > > So let's at least complete what we have right now, so no developer's > > efforts could be wasted writing empty chars in various places. > > and [^1]. > > [ Please do not answer just to answer, cc list is big, no one from ] > [ The Big Jury seems to care. (well, Jonathan does, but he wasn't in cc) ] This thread is just to answer for the sake of answers - there is completely no sense in it. You blame me that I did not create some benchmarks you like, but I do not care about it. I created usefull patch and test is in the way I like, because it is much more productive, than spending a lot of time detemining how different sources work with appropriate loads. When there will be strong requirement to perform additional tests, I will do them. > Friendly, Oleg. > ____ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-03 18:49 ` Oleg Verych 2006-11-04 10:24 ` Evgeniy Polyakov @ 2006-11-04 17:47 ` Evgeniy Polyakov 1 sibling, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-04 17:47 UTC (permalink / raw) To: Oleg Verych Cc: LKML, Pavel Machek, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck On Fri, Nov 03, 2006 at 07:49:16PM +0100, Oleg Verych (olecom@flower.upol.cz) wrote: > [ Please do not answer just to answer, cc list is big, no one from ] > [ The Big Jury seems to care. (well, Jonathan does, but he wasn't in cc) ] > > Friendly, Oleg. Just in case some misunderstanding happend: I do not want to insult anyone who is against kevent, I just do not understand cases, when people require me to do something to convince them in rude manner. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take22 0/4] kevent: Generic event handling mechanism. 2006-11-01 13:06 ` [take22 0/4] kevent: Generic event handling mechanism Pavel Machek 2006-11-01 13:25 ` Evgeniy Polyakov @ 2006-11-01 16:07 ` James Morris 1 sibling, 0 replies; 200+ messages in thread From: James Morris @ 2006-11-01 16:07 UTC (permalink / raw) To: Pavel Machek Cc: Evgeniy Polyakov, David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel On Wed, 1 Nov 2006, Pavel Machek wrote: > Hi! > > > Generic event handling mechanism. > > > > Consider for inclusion. > > > > Changes from 'take21' patchset: > > We are not interrested in how many times you spammed us, nor we want > to know what was wrong in previous versions. It would be nice to have > short summary of what this is good for, instead. I'm interested in knowing which version the patches belong to and what has changed (geez, it's rare enough that someone actually bothers to do this with an updated patchset, and to complain about it?) - James -- James Morris <jmorris@namei.org> ^ permalink raw reply [flat|nested] 200+ messages in thread
* [take23 0/5] kevent: Generic event handling mechanism. [not found] <1154985aa0591036@2ka.mipt.ru> 2006-10-27 16:10 ` [take21 0/4] kevent: Generic event handling mechanism Evgeniy Polyakov 2006-11-01 11:36 ` [take22 " Evgeniy Polyakov @ 2006-11-07 16:50 ` Evgeniy Polyakov 2006-11-07 16:50 ` [take23 1/5] kevent: Description Evgeniy Polyakov 2006-11-07 22:17 ` [take23 0/5] kevent: Generic event handling mechanism Andrew Morton 2006-11-09 8:23 ` [take24 0/6] " Evgeniy Polyakov ` (2 subsequent siblings) 5 siblings, 2 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-07 16:50 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Generic event handling mechanism. Kevent is a generic subsytem which allows to handle event notifications. It supports both level and edge triggered events. It is similar to poll/epoll in some cases, but it is more scalable, it is faster and allows to work with essentially eny kind of events. Events are provided into kernel through control syscall and can be read back through mmaped ring or syscall. Kevent update (i.e. readiness switching) happens directly from internals of the appropriate state machine of the underlying subsytem (like network, filesystem, timer or any other). Homepage: http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent Documentation page: http://linux-net.osdl.org/index.php/Kevent Consider for inclusion. Changes from 'take22' patchset: * new ring buffer implementation in process' memory * wakeup-one-thread flag * edge-triggered behaviour With this release additional independent benchmark shows kevent speed compared to epoll: Eric Dumazet created special benchmark which creates set of AF_INET sockets and two threads start to simultaneously read and write data from/into them. Here is results: epoll (no EPOLLET): 57428 events/sec kevent (no ET): 59794 events/sec epoll (with EPOLLET): 71000 events/sec kevent (with ET): 78265 events/sec Maximum (busy loop reading events): 88482 events/sec Changes from 'take21' patchset: * minor cleanups (different return values, removed unneded variables, whitespaces and so on) * fixed bug in kevent removal in case when kevent being removed is the same as overflow_kevent (spotted by Eric Dumazet) Changes from 'take20' patchset: * new ring buffer implementation * removed artificial limit on possible number of kevents With this release and fixed userspace web server it was possible to achive 3960+ req/s with client connection rate of 4000 con/s over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which is too close to wire speed if we get into account headers and the like. Changes from 'take19' patchset: * use __init instead of __devinit * removed 'default N' from config for user statistic * removed kevent_user_fini() since kevent can not be unloaded * use KERN_INFO for statistic output Changes from 'take18' patchset: * use __init instead of __devinit * removed 'default N' from config for user statistic * removed kevent_user_fini() since kevent can not be unloaded * use KERN_INFO for statistic output Changes from 'take17' patchset: * Use RB tree instead of hash table. At least for a web sever, frequency of addition/deletion of new kevent is comparable with number of search access, i.e. most of the time events are added, accesed only couple of times and then removed, so it justifies RB tree usage over AVL tree, since the latter does have much slower deletion time (max O(log(N)) compared to 3 ops), although faster search time (1.44*O(log(N)) vs. 2*O(log(N))). So for kevents I use RB tree for now and later, when my AVL tree implementation is ready, it will be possible to compare them. * Changed readiness check for socket notifications. With both above changes it is possible to achieve more than 3380 req/second compared to 2200, sometimes 2500 req/second for epoll() for trivial web-server and httperf client on the same hardware. It is possible that above kevent limit is due to maximum allowed kevents in a time limit, which is 4096 events. Changes from 'take16' patchset: * misc cleanups (__read_mostly, const ...) * created special macro which is used for mmap size (number of pages) calculation * export kevent_socket_notify(), since it is used in network protocols which can be built as modules (IPv6 for example) Changes from 'take15' patchset: * converted kevent_timer to high-resolution timers, this forces timer API update at http://linux-net.osdl.org/index.php/Kevent * use struct ukevent* instead of void * in syscalls (documentation has been updated) * added warning in kevent_add_ukevent() if ring has broken index (for testing) Changes from 'take14' patchset: * added kevent_wait() This syscall waits until either timeout expires or at least one event becomes ready. It also commits that @num events from @start are processed by userspace and thus can be be removed or rearmed (depending on it's flags). It can be used for commit events read by userspace through mmap interface. Example userspace code (evtest.c) can be found on project's homepage. * added socket notifications (send/recv/accept) Changes from 'take13' patchset: * do not get lock aroung user data check in __kevent_search() * fail early if there were no registered callbacks for given type of kevent * trailing whitespace cleanup Changes from 'take12' patchset: * remove non-chardev interface for initialization * use pointer to kevent_mring instead of unsigned longs * use aligned 64bit type in raw user data (can be used by high-res timer if needed) * simplified enqueue/dequeue callbacks and kevent initialization * use nanoseconds for timeout * put number of milliseconds into timer's return data * move some definitions into user-visible header * removed filenames from comments Changes from 'take11' patchset: * include missing headers into patchset * some trivial code cleanups (use goto instead of if/else games and so on) * some whitespace cleanups * check for ready_callback() callback before main loop which should save us some ticks Changes from 'take10' patchset: * removed non-existent prototypes * added helper function for kevent_registered_callbacks * fixed 80 lines comments issues * added shared between userspace and kernelspace header instead of embedd them in one * core restructuring to remove forward declarations * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p * use vm_insert_page() instead of remap_pfn_range() Changes from 'take9' patchset: * fixed ->nopage method Changes from 'take8' patchset: * fixed mmap release bug * use module_init() instead of late_initcall() * use better structures for timer notifications Changes from 'take7' patchset: * new mmap interface (not tested, waiting for other changes to be acked) - use nopage() method to dynamically substitue pages - allocate new page for events only when new added kevent requres it - do not use ugly index dereferencing, use structure instead - reduced amount of data in the ring (id and flags), maximum 12 pages on x86 per kevent fd Changes from 'take6' patchset: * a lot of comments! * do not use list poisoning for detection of the fact, that entry is in the list * return number of ready kevents even if copy*user() fails * strict check for number of kevents in syscall * use ARRAY_SIZE for array size calculation * changed superblock magic number * use SLAB_PANIC instead of direct panic() call * changed -E* return values * a lot of small cleanups and indent fixes Changes from 'take5' patchset: * removed compilation warnings about unused wariables when lockdep is not turned on * do not use internal socket structures, use appropriate (exported) wrappers instead * removed default 1 second timeout * removed AIO stuff from patchset Changes from 'take4' patchset: * use miscdevice instead of chardevice * comments fixes Changes from 'take3' patchset: * removed serializing mutex from kevent_user_wait() * moved storage list processing to RCU * removed lockdep screaming - all storage locks are initialized in the same function, so it was learned to differentiate between various cases * remove kevent from storage if is marked as broken after callback * fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion Changes from 'take2' patchset: * split kevent_finish_user() to locked and unlocked variants * do not use KEVENT_STAT ifdefs, use inline functions instead * use array of callbacks of each type instead of each kevent callback initialization * changed name of ukevent guarding lock * use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks * do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters * various indent cleanups * added optimisation, which is aimed to help when a lot of kevents are being copied from userspace * mapped buffer (initial) implementation (no userspace yet) Changes from 'take1' patchset: - rebased against 2.6.18-git tree - removed ioctl controlling - added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr, unsigned int timeout, void __user *buf, unsigned flags) - use old syscall kevent_ctl for creation/removing, modification and initial kevent initialization - use mutuxes instead of semaphores - added file descriptor check and return error if provided descriptor does not match kevent file operations - various indent fixes - removed aio_sendfile() declarations. Thank you. Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> ^ permalink raw reply [flat|nested] 200+ messages in thread
* [take23 1/5] kevent: Description. 2006-11-07 16:50 ` [take23 0/5] " Evgeniy Polyakov @ 2006-11-07 16:50 ` Evgeniy Polyakov 2006-11-07 16:50 ` [take23 2/5] kevent: Core files Evgeniy Polyakov 2006-11-07 22:16 ` [take23 1/5] kevent: Description Andrew Morton 2006-11-07 22:17 ` [take23 0/5] kevent: Generic event handling mechanism Andrew Morton 1 sibling, 2 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-07 16:50 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Description. int kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent *arg); fd - is the file descriptor referring to the kevent queue to manipulate. It is created by opening "/dev/kevent" char device, which is created with dynamic minor number and major number assigned for misc devices. cmd - is the requested operation. It can be one of the following: KEVENT_CTL_ADD - add event notification KEVENT_CTL_REMOVE - remove event notification KEVENT_CTL_MODIFY - modify existing notification num - number of struct ukevent in the array pointed to by arg arg - array of struct ukevent When called, kevent_ctl will carry out the operation specified in the cmd parameter. ------------------------------------------------------------------------------------- int kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, __u64 timeout, struct ukevent *buf, unsigned flags) ctl_fd - file descriptor referring to the kevent queue min_nr - minimum number of completed events that kevent_get_events will block waiting for max_nr - number of struct ukevent in buf timeout - number of nanoseconds to wait before returning less than min_nr events. If this is -1, then wait forever. buf - pointer to an array of struct ukevent. flags - unused kevent_get_events will wait timeout milliseconds for at least min_nr completed events, copying completed struct ukevents to buf and deleting any KEVENT_REQ_ONESHOT event requests. In nonblocking mode it returns as many events as possible, but not more than max_nr. In blocking mode it waits until timeout or if at least min_nr events are ready. ------------------------------------------------------------------------------------- int kevent_wait(int ctl_fd, unsigned int num, __u64 timeout) ctl_fd - file descriptor referring to the kevent queue num - number of processed kevents timeout - this timeout specifies number of nanoseconds to wait until there is free space in kevent queue This syscall waits until either timeout expires or at least one event becomes ready. It also copies that num events into special ring buffer and requeues them (or removes depending on flags). ------------------------------------------------------------------------------------- int kevent_ring_init(int ctl_fd, struct kevent_ring *ring, unsigned int num) ctl_fd - file descriptor referring to the kevent queue num - size of the ring buffer in events struct kevent_ring { unsigned int ring_kidx; struct ukevent event[0]; } ring_kidx - is an index in the ring buffer where kernel will put new events when kevent_wait() or kevent_get_events() is called Example userspace code (ring_buffer.c) can be found on project's homepage. Each kevent syscall can be so called cancellation point in glibc, i.e. when thread has been cancelled in kevent syscall, thread can be safely removed and no events will be lost, since each syscall (kevent_wait() or kevent_get_events()) will copy event into special ring buffer, accessible from other threads or even processes (if shared memory is used). When kevent is removed (not dequeued when it is ready, but just removed), even if it was ready, it is not copied into ring buffer, since if it is removed, no one cares about it (otherwise user would wait until it becomes ready and got it through usual way using kevent_get_events() or kevent_wait()) and thus no need to copy it to the ring buffer. It is possible with userspace ring buffer, that events in the ring buffer can be replaced without knowledge for the thread currently reading them (when other thread calls kevent_get_events() or kevent_wait()), so appropriate locking between threads or processes, which can simultaneously access the same ring buffer, is required. ------------------------------------------------------------------------------------- The bulk of the interface is entirely done through the ukevent struct. It is used to add event requests, modify existing event requests, specify which event requests to remove, and return completed events. struct ukevent contains the following members: struct kevent_id id Id of this request, e.g. socket number, file descriptor and so on __u32 type Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on __u32 event Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED __u32 req_flags Per-event request flags, KEVENT_REQ_ONESHOT event will be removed when it is ready KEVENT_REQ_WAKEUP_ONE When several threads wait on the same kevent queue and requested the same event, for example 'wake me up when new client has connected, so I could call accept()', then all threads will be awakened when new client has connected, but only one of them can process the data. This problem is known as thundering nerd problem. Events which have this flag set will not be marked as ready (and appropriate threads will not be awakened) if at least one event has been already marked. KEVENT_REQ_ET Edge Triggered behaviour. It is an optimisation which allows to move ready and dequeued (i.e. copied to userspace) event to move into set of interest for given storage (socket, inode and so on) again. It is very usefull for cases when the same event should be used many times (like reading from pipe). It is similar to epoll()'s EPOLLET flag. __u32 ret_flags Per-event return flags KEVENT_RET_BROKEN Kevent is broken KEVENT_RET_DONE Kevent processing was finished successfully KEVENT_RET_COPY_FAILED Kevent was not copied into ring buffer due to some error conditions. __u32 ret_data Event return data. Event originator fills it with anything it likes (for example timer notifications put number of milliseconds when timer has fired union { __u32 user[2]; void *ptr; } User's data. It is not used, just copied to/from user. The whole structure is aligned to 8 bytes already, so the last union is aligned properly. --------------------------------------------------------------------------------- Usage For KEVENT_CTL_ADD, all fields relevant to the event type must be filled (id, type, possibly event, req_flags). After kevent_ctl(..., KEVENT_CTL_ADD, ...) returns each struct's ret_flags should be checked to see if the event is already broken or done. For KEVENT_CTL_MODIFY, the id, req_flags, and user and event fields must be set and an existing kevent request must have matching id and user fields. If a match is found, req_flags and event are replaced with the newly supplied values and requeueing is started, so modified kevent can be checked and probably marked as ready immediately. If a match can't be found, the passed in ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is always set. For KEVENT_CTL_REMOVE, the id and user fields must be set and an existing kevent request must have matching id and user fields. If a match is found, the kevent request is removed. If a match can't be found, the passed in ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is always set. For kevent_get_events, the entire structure is returned. --------------------------------------------------------------------------------- Usage cases kevent_timer struct ukevent should contain following fields: type - KEVENT_TIMER event - KEVENT_TIMER_FIRED req_flags - KEVENT_REQ_ONESHOT if you want to fire that timer only once id.raw[0] - number of seconds after commit when this timer shout expire id.raw[0] - additional to number of seconds number of nanoseconds ^ permalink raw reply [flat|nested] 200+ messages in thread
* [take23 2/5] kevent: Core files. 2006-11-07 16:50 ` [take23 1/5] kevent: Description Evgeniy Polyakov @ 2006-11-07 16:50 ` Evgeniy Polyakov 2006-11-07 16:50 ` [take23 3/5] kevent: poll/select() notifications Evgeniy Polyakov 2006-11-07 22:16 ` [take23 2/5] kevent: Core files Andrew Morton 2006-11-07 22:16 ` [take23 1/5] kevent: Description Andrew Morton 1 sibling, 2 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-07 16:50 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Core files. This patch includes core kevent files: * userspace controlling * kernelspace interfaces * initialization * notification state machines Some bits of documentation can be found on project's homepage (and links from there): http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S index 7e639f7..fa8075b 100644 --- a/arch/i386/kernel/syscall_table.S +++ b/arch/i386/kernel/syscall_table.S @@ -318,3 +318,7 @@ ENTRY(sys_call_table) .long sys_vmsplice .long sys_move_pages .long sys_getcpu + .long sys_kevent_get_events + .long sys_kevent_ctl /* 320 */ + .long sys_kevent_wait + .long sys_kevent_ring_init diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S index b4aa875..95fb252 100644 --- a/arch/x86_64/ia32/ia32entry.S +++ b/arch/x86_64/ia32/ia32entry.S @@ -714,8 +714,12 @@ #endif .quad compat_sys_get_robust_list .quad sys_splice .quad sys_sync_file_range - .quad sys_tee + .quad sys_tee /* 315 */ .quad compat_sys_vmsplice .quad compat_sys_move_pages .quad sys_getcpu + .quad sys_kevent_get_events + .quad sys_kevent_ctl /* 320 */ + .quad sys_kevent_wait + .quad sys_kevent_ring_init ia32_syscall_end: diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h index bd99870..2161ef2 100644 --- a/include/asm-i386/unistd.h +++ b/include/asm-i386/unistd.h @@ -324,10 +324,14 @@ #define __NR_tee 315 #define __NR_vmsplice 316 #define __NR_move_pages 317 #define __NR_getcpu 318 +#define __NR_kevent_get_events 319 +#define __NR_kevent_ctl 320 +#define __NR_kevent_wait 321 +#define __NR_kevent_ring_init 322 #ifdef __KERNEL__ -#define NR_syscalls 319 +#define NR_syscalls 323 #include <linux/err.h> /* diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h index 6137146..3669c0f 100644 --- a/include/asm-x86_64/unistd.h +++ b/include/asm-x86_64/unistd.h @@ -619,10 +619,18 @@ #define __NR_vmsplice 278 __SYSCALL(__NR_vmsplice, sys_vmsplice) #define __NR_move_pages 279 __SYSCALL(__NR_move_pages, sys_move_pages) +#define __NR_kevent_get_events 280 +__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events) +#define __NR_kevent_ctl 281 +__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl) +#define __NR_kevent_wait 282 +__SYSCALL(__NR_kevent_wait, sys_kevent_wait) +#define __NR_kevent_ring_init 283 +__SYSCALL(__NR_kevent_ring_init, sys_kevent_ring_init) #ifdef __KERNEL__ -#define __NR_syscall_max __NR_move_pages +#define __NR_syscall_max __NR_kevent_ring_init #include <linux/err.h> #ifndef __NO_STUBS diff --git a/include/linux/kevent.h b/include/linux/kevent.h new file mode 100644 index 0000000..781ffa8 --- /dev/null +++ b/include/linux/kevent.h @@ -0,0 +1,201 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __KEVENT_H +#define __KEVENT_H +#include <linux/types.h> +#include <linux/list.h> +#include <linux/rbtree.h> +#include <linux/spinlock.h> +#include <linux/mutex.h> +#include <linux/wait.h> +#include <linux/net.h> +#include <linux/rcupdate.h> +#include <linux/kevent_storage.h> +#include <linux/ukevent.h> + +#define KEVENT_MIN_BUFFS_ALLOC 3 + +struct kevent; +struct kevent_storage; +typedef int (* kevent_callback_t)(struct kevent *); + +/* @callback is called each time new event has been caught. */ +/* @enqueue is called each time new event is queued. */ +/* @dequeue is called each time event is dequeued. */ + +struct kevent_callbacks { + kevent_callback_t callback, enqueue, dequeue; +}; + +#define KEVENT_READY 0x1 +#define KEVENT_STORAGE 0x2 +#define KEVENT_USER 0x4 + +struct kevent +{ + /* Used for kevent freeing.*/ + struct rcu_head rcu_head; + struct ukevent event; + /* This lock protects ukevent manipulations, e.g. ret_flags changes. */ + spinlock_t ulock; + + /* Entry of user's tree. */ + struct rb_node kevent_node; + /* Entry of origin's queue. */ + struct list_head storage_entry; + /* Entry of user's ready. */ + struct list_head ready_entry; + + u32 flags; + + /* User who requested this kevent. */ + struct kevent_user *user; + /* Kevent container. */ + struct kevent_storage *st; + + struct kevent_callbacks callbacks; + + /* Private data for different storages. + * poll()/select storage has a list of wait_queue_t containers + * for each ->poll() { poll_wait()' } here. + */ + void *priv; +}; + +struct kevent_user +{ + struct rb_root kevent_root; + spinlock_t kevent_lock; + /* Number of queued kevents. */ + unsigned int kevent_num; + + /* List of ready kevents. */ + struct list_head ready_list; + /* Number of ready kevents. */ + unsigned int ready_num; + /* Protects all manipulations with ready queue. */ + spinlock_t ready_lock; + + /* Protects against simultaneous kevent_user control manipulations. */ + struct mutex ctl_mutex; + /* Wait until some events are ready. */ + wait_queue_head_t wait; + + /* Reference counter, increased for each new kevent. */ + atomic_t refcnt; + + /* Mutex protecting userspace ring buffer. */ + struct mutex ring_lock; + /* Kernel index and size of the userspace ring buffer. */ + unsigned int kidx, ring_size; + /* Pointer to userspace ring buffer. */ + struct kevent_ring __user *pring; + +#ifdef CONFIG_KEVENT_USER_STAT + unsigned long im_num; + unsigned long wait_num, ring_num; + unsigned long total; +#endif +}; + +int kevent_enqueue(struct kevent *k); +int kevent_dequeue(struct kevent *k); +int kevent_init(struct kevent *k); +void kevent_requeue(struct kevent *k); +int kevent_break(struct kevent *k); + +int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos); + +void kevent_storage_ready(struct kevent_storage *st, + kevent_callback_t ready_callback, u32 event); +int kevent_storage_init(void *origin, struct kevent_storage *st); +void kevent_storage_fini(struct kevent_storage *st); +int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k); +void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k); + +int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u); + +#ifdef CONFIG_KEVENT_POLL +void kevent_poll_reinit(struct file *file); +#else +static inline void kevent_poll_reinit(struct file *file) +{ +} +#endif + +#ifdef CONFIG_KEVENT_USER_STAT +static inline void kevent_stat_init(struct kevent_user *u) +{ + u->wait_num = u->im_num = u->total = 0; +} +static inline void kevent_stat_print(struct kevent_user *u) +{ + printk(KERN_INFO "%s: u: %p, wait: %lu, ring: %lu, immediately: %lu, total: %lu.\n", + __func__, u, u->wait_num, u->ring_num, u->im_num, u->total); +} +static inline void kevent_stat_im(struct kevent_user *u) +{ + u->im_num++; +} +static inline void kevent_stat_ring(struct kevent_user *u) +{ + u->ring_num++; +} +static inline void kevent_stat_wait(struct kevent_user *u) +{ + u->wait_num++; +} +static inline void kevent_stat_total(struct kevent_user *u) +{ + u->total++; +} +#else +#define kevent_stat_print(u) ({ (void) u;}) +#define kevent_stat_init(u) ({ (void) u;}) +#define kevent_stat_im(u) ({ (void) u;}) +#define kevent_stat_wait(u) ({ (void) u;}) +#define kevent_stat_ring(u) ({ (void) u;}) +#define kevent_stat_total(u) ({ (void) u;}) +#endif + +#ifdef CONFIG_LOCKDEP +void kevent_socket_reinit(struct socket *sock); +void kevent_sk_reinit(struct sock *sk); +#else +static inline void kevent_socket_reinit(struct socket *sock) +{ +} +static inline void kevent_sk_reinit(struct sock *sk) +{ +} +#endif +#ifdef CONFIG_KEVENT_SOCKET +void kevent_socket_notify(struct sock *sock, u32 event); +int kevent_socket_dequeue(struct kevent *k); +int kevent_socket_enqueue(struct kevent *k); +#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC) +#else +static inline void kevent_socket_notify(struct sock *sock, u32 event) +{ +} +#define sock_async(__sk) ({ (void)__sk; 0; }) +#endif + +#endif /* __KEVENT_H */ diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h new file mode 100644 index 0000000..a38575d --- /dev/null +++ b/include/linux/kevent_storage.h @@ -0,0 +1,11 @@ +#ifndef __KEVENT_STORAGE_H +#define __KEVENT_STORAGE_H + +struct kevent_storage +{ + void *origin; /* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */ + struct list_head list; /* List of queued kevents. */ + spinlock_t lock; /* Protects users queue. */ +}; + +#endif /* __KEVENT_STORAGE_H */ diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 2d1c3d5..471a685 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -54,6 +54,8 @@ struct compat_stat; struct compat_timeval; struct robust_list_head; struct getcpu_cache; +struct ukevent; +struct kevent_ring; #include <linux/types.h> #include <linux/aio_abi.h> @@ -599,4 +601,9 @@ asmlinkage long sys_set_robust_list(stru size_t len); asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache); +asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max, + __u64 timeout, struct ukevent __user *buf, unsigned flags); +asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, struct ukevent __user *buf); +asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, __u64 timeout); +asmlinkage long sys_kevent_ring_init(int ctl_fd, struct kevent_ring __user *ring, unsigned int num); #endif diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h new file mode 100644 index 0000000..ee881c9 --- /dev/null +++ b/include/linux/ukevent.h @@ -0,0 +1,153 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __UKEVENT_H +#define __UKEVENT_H + +/* + * Kevent request flags. + */ + +/* Process this event only once and then remove it. */ +#define KEVENT_REQ_ONESHOT 0x1 +/* Wake up only when event exclusively belongs to this thread, + * for example when several threads are waiting for new client + * connection so they could perform accept() it is a good idea + * to set this flag, so only one thread of all with this flag set + * will be awakened. + * If there are events without this flags, appropriate threads will + * be awakened too. */ +#define KEVENT_REQ_WAKEUP_ONE 0x2 +/* Edge Triggered behaviour. */ +#define KEVENT_REQ_ET 0x4 + +/* + * Kevent return flags. + */ +/* Kevent is broken. */ +#define KEVENT_RET_BROKEN 0x1 +/* Kevent processing was finished successfully. */ +#define KEVENT_RET_DONE 0x2 +/* Kevent was not copied into ring buffer due to some error conditions. */ +#define KEVENT_RET_COPY_FAILED 0x4 + +/* + * Kevent type set. + */ +#define KEVENT_SOCKET 0 +#define KEVENT_INODE 1 +#define KEVENT_TIMER 2 +#define KEVENT_POLL 3 +#define KEVENT_NAIO 4 +#define KEVENT_AIO 5 +#define KEVENT_MAX 6 + +/* + * Per-type event sets. + * Number of per-event sets should be exactly as number of kevent types. + */ + +/* + * Timer events. + */ +#define KEVENT_TIMER_FIRED 0x1 + +/* + * Socket/network asynchronous IO events. + */ +#define KEVENT_SOCKET_RECV 0x1 +#define KEVENT_SOCKET_ACCEPT 0x2 +#define KEVENT_SOCKET_SEND 0x4 + +/* + * Inode events. + */ +#define KEVENT_INODE_CREATE 0x1 +#define KEVENT_INODE_REMOVE 0x2 + +/* + * Poll events. + */ +#define KEVENT_POLL_POLLIN 0x0001 +#define KEVENT_POLL_POLLPRI 0x0002 +#define KEVENT_POLL_POLLOUT 0x0004 +#define KEVENT_POLL_POLLERR 0x0008 +#define KEVENT_POLL_POLLHUP 0x0010 +#define KEVENT_POLL_POLLNVAL 0x0020 + +#define KEVENT_POLL_POLLRDNORM 0x0040 +#define KEVENT_POLL_POLLRDBAND 0x0080 +#define KEVENT_POLL_POLLWRNORM 0x0100 +#define KEVENT_POLL_POLLWRBAND 0x0200 +#define KEVENT_POLL_POLLMSG 0x0400 +#define KEVENT_POLL_POLLREMOVE 0x1000 + +/* + * Asynchronous IO events. + */ +#define KEVENT_AIO_BIO 0x1 + +#define KEVENT_MASK_ALL 0xffffffff +/* Mask of all possible event values. */ +#define KEVENT_MASK_EMPTY 0x0 +/* Empty mask of ready events. */ + +struct kevent_id +{ + union { + __u32 raw[2]; + __u64 raw_u64 __attribute__((aligned(8))); + }; +}; + +struct ukevent +{ + /* Id of this request, e.g. socket number, file descriptor and so on... */ + struct kevent_id id; + /* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */ + __u32 type; + /* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */ + __u32 event; + /* Per-event request flags */ + __u32 req_flags; + /* Per-event return flags */ + __u32 ret_flags; + /* Event return data. Event originator fills it with anything it likes. */ + __u32 ret_data[2]; + /* User's data. It is not used, just copied to/from user. + * The whole structure is aligned to 8 bytes already, so the last union + * is aligned properly. + */ + union { + __u32 user[2]; + void *ptr; + }; +}; + +struct kevent_ring +{ + unsigned int ring_kidx; + struct ukevent event[0]; +}; + +#define KEVENT_CTL_ADD 0 +#define KEVENT_CTL_REMOVE 1 +#define KEVENT_CTL_MODIFY 2 + +#endif /* __UKEVENT_H */ diff --git a/init/Kconfig b/init/Kconfig index d2eb7a8..c7d8250 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -201,6 +201,8 @@ config AUDITSYSCALL such as SELinux. To use audit's filesystem watch feature, please ensure that INOTIFY is configured. +source "kernel/kevent/Kconfig" + config IKCONFIG bool "Kernel .config support" ---help--- diff --git a/kernel/Makefile b/kernel/Makefile index d62ec66..2d7a6dd 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o +obj-$(CONFIG_KEVENT) += kevent/ obj-$(CONFIG_RELAY) += relay.o obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o obj-$(CONFIG_TASKSTATS) += taskstats.o diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig new file mode 100644 index 0000000..5ba8086 --- /dev/null +++ b/kernel/kevent/Kconfig @@ -0,0 +1,39 @@ +config KEVENT + bool "Kernel event notification mechanism" + help + This option enables event queue mechanism. + It can be used as replacement for poll()/select(), AIO callback + invocations, advanced timer notifications and other kernel + object status changes. + +config KEVENT_USER_STAT + bool "Kevent user statistic" + depends on KEVENT + help + This option will turn kevent_user statistic collection on. + Statistic data includes total number of kevent, number of kevents + which are ready immediately at insertion time and number of kevents + which were removed through readiness completion. + It will be printed each time control kevent descriptor is closed. + +config KEVENT_TIMER + bool "Kernel event notifications for timers" + depends on KEVENT + help + This option allows to use timers through KEVENT subsystem. + +config KEVENT_POLL + bool "Kernel event notifications for poll()/select()" + depends on KEVENT + help + This option allows to use kevent subsystem for poll()/select() + notifications. + +config KEVENT_SOCKET + bool "Kernel event notifications for sockets" + depends on NET && KEVENT + help + This option enables notifications through KEVENT subsystem of + sockets operations, like new packet receiving conditions, + ready for accept conditions and so on. + diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile new file mode 100644 index 0000000..9130cad --- /dev/null +++ b/kernel/kevent/Makefile @@ -0,0 +1,4 @@ +obj-y := kevent.o kevent_user.o +obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o +obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o +obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c new file mode 100644 index 0000000..24ee44a --- /dev/null +++ b/kernel/kevent/kevent.c @@ -0,0 +1,232 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/mempool.h> +#include <linux/sched.h> +#include <linux/wait.h> +#include <linux/kevent.h> + +/* + * Attempts to add an event into appropriate origin's queue. + * Returns positive value if this event is ready immediately, + * negative value in case of error and zero if event has been queued. + * ->enqueue() callback must increase origin's reference counter. + */ +int kevent_enqueue(struct kevent *k) +{ + return k->callbacks.enqueue(k); +} + +/* + * Remove event from the appropriate queue. + * ->dequeue() callback must decrease origin's reference counter. + */ +int kevent_dequeue(struct kevent *k) +{ + return k->callbacks.dequeue(k); +} + +/* + * Mark kevent as broken. + */ +int kevent_break(struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&k->ulock, flags); + k->event.ret_flags |= KEVENT_RET_BROKEN; + spin_unlock_irqrestore(&k->ulock, flags); + return -EINVAL; +} + +static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX] __read_mostly; + +int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos) +{ + struct kevent_callbacks *p; + + if (pos >= KEVENT_MAX) + return -EINVAL; + + p = &kevent_registered_callbacks[pos]; + + p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break; + p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break; + p->callback = (cb->callback) ? cb->callback : kevent_break; + + printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos); + return 0; +} + +/* + * Must be called before event is going to be added into some origin's queue. + * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks. + * If failed, kevent should not be used or kevent_enqueue() will fail to add + * this kevent into origin's queue with setting + * KEVENT_RET_BROKEN flag in kevent->event.ret_flags. + */ +int kevent_init(struct kevent *k) +{ + spin_lock_init(&k->ulock); + k->flags = 0; + + if (unlikely(k->event.type >= KEVENT_MAX || + !kevent_registered_callbacks[k->event.type].callback)) + return kevent_break(k); + + k->callbacks = kevent_registered_callbacks[k->event.type]; + if (unlikely(k->callbacks.callback == kevent_break)) + return kevent_break(k); + + return 0; +} + +/* + * Called from ->enqueue() callback when reference counter for given + * origin (socket, inode...) has been increased. + */ +int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k) +{ + unsigned long flags; + + k->st = st; + spin_lock_irqsave(&st->lock, flags); + list_add_tail_rcu(&k->storage_entry, &st->list); + k->flags |= KEVENT_STORAGE; + spin_unlock_irqrestore(&st->lock, flags); + return 0; +} + +/* + * Dequeue kevent from origin's queue. + * It does not decrease origin's reference counter in any way + * and must be called before it, so storage itself must be valid. + * It is called from ->dequeue() callback. + */ +void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&st->lock, flags); + if (k->flags & KEVENT_STORAGE) { + list_del_rcu(&k->storage_entry); + k->flags &= ~KEVENT_STORAGE; + } + spin_unlock_irqrestore(&st->lock, flags); +} + +/* + * Call kevent ready callback and queue it into ready queue if needed. + * If kevent is marked as one-shot, then remove it from storage queue. + */ +static int __kevent_requeue(struct kevent *k, u32 event) +{ + int ret, rem; + unsigned long flags; + + ret = k->callbacks.callback(k); + + spin_lock_irqsave(&k->ulock, flags); + if (ret > 0) + k->event.ret_flags |= KEVENT_RET_DONE; + else if (ret < 0) + k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE); + else + ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE)); + rem = (k->event.req_flags & KEVENT_REQ_ONESHOT); + spin_unlock_irqrestore(&k->ulock, flags); + + if (ret) { + if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) { + list_del_rcu(&k->storage_entry); + k->flags &= ~KEVENT_STORAGE; + } + + spin_lock_irqsave(&k->user->ready_lock, flags); + if (!(k->flags & KEVENT_READY)) { + list_add_tail(&k->ready_entry, &k->user->ready_list); + k->flags |= KEVENT_READY; + k->user->ready_num++; + } + spin_unlock_irqrestore(&k->user->ready_lock, flags); + wake_up(&k->user->wait); + } + + return ret; +} + +/* + * Check if kevent is ready (by invoking it's callback) and requeue/remove + * if needed. + */ +void kevent_requeue(struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&k->st->lock, flags); + __kevent_requeue(k, 0); + spin_unlock_irqrestore(&k->st->lock, flags); +} + +/* + * Called each time some activity in origin (socket, inode...) is noticed. + */ +void kevent_storage_ready(struct kevent_storage *st, + kevent_callback_t ready_callback, u32 event) +{ + struct kevent *k; + int wake_num = 0; + + rcu_read_lock(); + if (ready_callback) + list_for_each_entry_rcu(k, &st->list, storage_entry) + (*ready_callback)(k); + + list_for_each_entry_rcu(k, &st->list, storage_entry) { + if (event & k->event.event) + if (!(k->event.req_flags & KEVENT_REQ_WAKEUP_ONE) || wake_num == 0) + if (__kevent_requeue(k, event)) + wake_num++; + } + rcu_read_unlock(); +} + +int kevent_storage_init(void *origin, struct kevent_storage *st) +{ + spin_lock_init(&st->lock); + st->origin = origin; + INIT_LIST_HEAD(&st->list); + return 0; +} + +/* + * Mark all events as broken, that will remove them from storage, + * so storage origin (inode, sockt and so on) can be safely removed. + * No new entries are allowed to be added into the storage at this point. + * (Socket is removed from file table at this point for example). + */ +void kevent_storage_fini(struct kevent_storage *st) +{ + kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL); +} diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c new file mode 100644 index 0000000..5ebfa6d --- /dev/null +++ b/kernel/kevent/kevent_user.c @@ -0,0 +1,913 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/mount.h> +#include <linux/device.h> +#include <linux/poll.h> +#include <linux/kevent.h> +#include <linux/miscdevice.h> +#include <asm/io.h> + +static const char kevent_name[] = "kevent"; +static kmem_cache_t *kevent_cache __read_mostly; + +/* + * kevents are pollable, return POLLIN and POLLRDNORM + * when there is at least one ready kevent. + */ +static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait) +{ + struct kevent_user *u = file->private_data; + unsigned int mask; + + poll_wait(file, &u->wait, wait); + mask = 0; + + if (u->ready_num) + mask |= POLLIN | POLLRDNORM; + + return mask; +} + +/* + * Copies kevent into userspace ring buffer if it was initialized. + * Returns + * 0 on success, + * -EAGAIN if there were no place for that kevent (impossible) + * -EFAULT if copy_to_user() failed. + * + * Must be called under kevent_user->ring_lock locked. + */ +static int kevent_copy_ring_buffer(struct kevent *k) +{ + struct kevent_ring __user *ring; + struct kevent_user *u = k->user; + unsigned long flags; + int err; + + ring = u->pring; + if (!ring) + return 0; + + if (copy_to_user(&ring->event[u->kidx], &k->event, sizeof(struct ukevent))) { + err = -EFAULT; + goto err_out_exit; + } + + if (put_user(u->kidx, &ring->ring_kidx)) { + err = -EFAULT; + goto err_out_exit; + } + + if (++u->kidx >= u->ring_size) + u->kidx = 0; + + return 0; + +err_out_exit: + spin_lock_irqsave(&k->ulock, flags); + k->event.ret_flags |= KEVENT_RET_COPY_FAILED; + spin_unlock_irqrestore(&k->ulock, flags); + return err; +} + +static int kevent_user_open(struct inode *inode, struct file *file) +{ + struct kevent_user *u; + + u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL); + if (!u) + return -ENOMEM; + + INIT_LIST_HEAD(&u->ready_list); + spin_lock_init(&u->ready_lock); + kevent_stat_init(u); + spin_lock_init(&u->kevent_lock); + u->kevent_root = RB_ROOT; + + mutex_init(&u->ctl_mutex); + init_waitqueue_head(&u->wait); + + atomic_set(&u->refcnt, 1); + + mutex_init(&u->ring_lock); + u->kidx = u->ring_size = 0; + u->pring = NULL; + + file->private_data = u; + return 0; +} + +/* + * Kevent userspace control block reference counting. + * Set to 1 at creation time, when appropriate kevent file descriptor + * is closed, that reference counter is decreased. + * When counter hits zero block is freed. + */ +static inline void kevent_user_get(struct kevent_user *u) +{ + atomic_inc(&u->refcnt); +} + +static inline void kevent_user_put(struct kevent_user *u) +{ + if (atomic_dec_and_test(&u->refcnt)) { + kevent_stat_print(u); + kfree(u); + } +} + +static inline int kevent_compare_id(struct kevent_id *left, struct kevent_id *right) +{ + if (left->raw_u64 > right->raw_u64) + return -1; + + if (right->raw_u64 > left->raw_u64) + return 1; + + return 0; +} + +/* + * RCU protects storage list (kevent->storage_entry). + * Free entry in RCU callback, it is dequeued from all lists at + * this point. + */ + +static void kevent_free_rcu(struct rcu_head *rcu) +{ + struct kevent *kevent = container_of(rcu, struct kevent, rcu_head); + kmem_cache_free(kevent_cache, kevent); +} + +/* + * Must be called under u->ready_lock. + * This function unlinks kevent from ready queue. + */ +static inline void kevent_unlink_ready(struct kevent *k) +{ + list_del(&k->ready_entry); + k->flags &= ~KEVENT_READY; + k->user->ready_num--; +} + +static void kevent_remove_ready(struct kevent *k) +{ + struct kevent_user *u = k->user; + unsigned long flags; + + spin_lock_irqsave(&u->ready_lock, flags); + if (k->flags & KEVENT_READY) + kevent_unlink_ready(k); + spin_unlock_irqrestore(&u->ready_lock, flags); +} + +/* + * Complete kevent removing - it dequeues kevent from storage list + * if it is requested, removes kevent from ready list, drops userspace + * control block reference counter and schedules kevent freeing through RCU. + */ +static void kevent_finish_user_complete(struct kevent *k, int deq) +{ + if (deq) + kevent_dequeue(k); + + kevent_remove_ready(k); + + kevent_user_put(k->user); + call_rcu(&k->rcu_head, kevent_free_rcu); +} + +/* + * Remove from all lists and free kevent. + * Must be called under kevent_user->kevent_lock to protect + * kevent->kevent_entry removing. + */ +static void __kevent_finish_user(struct kevent *k, int deq) +{ + struct kevent_user *u = k->user; + + rb_erase(&k->kevent_node, &u->kevent_root); + k->flags &= ~KEVENT_USER; + u->kevent_num--; + kevent_finish_user_complete(k, deq); +} + +/* + * Remove kevent from user's list of all events, + * dequeue it from storage and decrease user's reference counter, + * since this kevent does not exist anymore. That is why it is freed here. + */ +static void kevent_finish_user(struct kevent *k, int deq) +{ + struct kevent_user *u = k->user; + unsigned long flags; + + spin_lock_irqsave(&u->kevent_lock, flags); + rb_erase(&k->kevent_node, &u->kevent_root); + k->flags &= ~KEVENT_USER; + u->kevent_num--; + spin_unlock_irqrestore(&u->kevent_lock, flags); + kevent_finish_user_complete(k, deq); +} + +/* + * Dequeue one entry from user's ready queue. + */ +static struct kevent *kqueue_dequeue_ready(struct kevent_user *u) +{ + unsigned long flags; + struct kevent *k = NULL; + + mutex_lock(&u->ring_lock); + spin_lock_irqsave(&u->ready_lock, flags); + if (u->ready_num && !list_empty(&u->ready_list)) { + k = list_entry(u->ready_list.next, struct kevent, ready_entry); + kevent_unlink_ready(k); + } + spin_unlock_irqrestore(&u->ready_lock, flags); + + if (k) + kevent_copy_ring_buffer(k); + mutex_unlock(&u->ring_lock); + + return k; +} + +static void kevent_complete_ready(struct kevent *k) +{ + if (k->event.req_flags & KEVENT_REQ_ONESHOT) + /* + * If it is one-shot kevent, it has been removed already from + * origin's queue, so we can easily free it here. + */ + kevent_finish_user(k, 1); + else if (k->event.req_flags & KEVENT_REQ_ET) { + unsigned long flags; + + /* + * Edge-triggered behaviour: mark event as clear new one. + */ + + spin_lock_irqsave(&k->ulock, flags); + k->event.ret_flags = 0; + k->event.ret_data[0] = k->event.ret_data[1] = 0; + spin_unlock_irqrestore(&k->ulock, flags); + } +} + +/* + * Search a kevent inside kevent tree for given ukevent. + */ +static struct kevent *__kevent_search(struct kevent_id *id, struct kevent_user *u) +{ + struct kevent *k, *ret = NULL; + struct rb_node *n = u->kevent_root.rb_node; + int cmp; + + while (n) { + k = rb_entry(n, struct kevent, kevent_node); + cmp = kevent_compare_id(&k->event.id, id); + + if (cmp > 0) + n = n->rb_right; + else if (cmp < 0) + n = n->rb_left; + else { + ret = k; + break; + } + } + + return ret; +} + +/* + * Search and modify kevent according to provided ukevent. + */ +static int kevent_modify(struct ukevent *uk, struct kevent_user *u) +{ + struct kevent *k; + int err = -ENODEV; + unsigned long flags; + + spin_lock_irqsave(&u->kevent_lock, flags); + k = __kevent_search(&uk->id, u); + if (k) { + spin_lock(&k->ulock); + k->event.event = uk->event; + k->event.req_flags = uk->req_flags; + k->event.ret_flags = 0; + spin_unlock(&k->ulock); + kevent_requeue(k); + err = 0; + } + spin_unlock_irqrestore(&u->kevent_lock, flags); + + return err; +} + +/* + * Remove kevent which matches provided ukevent. + */ +static int kevent_remove(struct ukevent *uk, struct kevent_user *u) +{ + int err = -ENODEV; + struct kevent *k; + unsigned long flags; + + spin_lock_irqsave(&u->kevent_lock, flags); + k = __kevent_search(&uk->id, u); + if (k) { + __kevent_finish_user(k, 1); + err = 0; + } + spin_unlock_irqrestore(&u->kevent_lock, flags); + + return err; +} + +/* + * Detaches userspace control block from file descriptor + * and decrease it's reference counter. + * No new kevents can be added or removed from any list at this point. + */ +static int kevent_user_release(struct inode *inode, struct file *file) +{ + struct kevent_user *u = file->private_data; + struct kevent *k; + struct rb_node *n; + + for (n = rb_first(&u->kevent_root); n; n = rb_next(n)) { + k = rb_entry(n, struct kevent, kevent_node); + kevent_finish_user(k, 1); + } + + kevent_user_put(u); + file->private_data = NULL; + + return 0; +} + +/* + * Read requested number of ukevents in one shot. + */ +static struct ukevent *kevent_get_user(unsigned int num, void __user *arg) +{ + struct ukevent *ukev; + + ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL); + if (!ukev) + return NULL; + + if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) { + kfree(ukev); + return NULL; + } + + return ukev; +} + +/* + * Read from userspace all ukevents and modify appropriate kevents. + * If provided number of ukevents is more that threshold, it is faster + * to allocate a room for them and copy in one shot instead of copy + * one-by-one and then process them. + */ +static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg) +{ + int err = 0, i; + struct ukevent uk; + + mutex_lock(&u->ctl_mutex); + + if (num > u->kevent_num) { + err = -EINVAL; + goto out; + } + + if (num > KEVENT_MIN_BUFFS_ALLOC) { + struct ukevent *ukev; + + ukev = kevent_get_user(num, arg); + if (ukev) { + for (i = 0; i < num; ++i) { + if (kevent_modify(&ukev[i], u)) + ukev[i].ret_flags |= KEVENT_RET_BROKEN; + ukev[i].ret_flags |= KEVENT_RET_DONE; + } + if (copy_to_user(arg, ukev, num*sizeof(struct ukevent))) + err = -EFAULT; + kfree(ukev); + goto out; + } + } + + for (i = 0; i < num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + if (kevent_modify(&uk, u)) + uk.ret_flags |= KEVENT_RET_BROKEN; + uk.ret_flags |= KEVENT_RET_DONE; + + if (copy_to_user(arg, &uk, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + arg += sizeof(struct ukevent); + } +out: + mutex_unlock(&u->ctl_mutex); + + return err; +} + +/* + * Read from userspace all ukevents and remove appropriate kevents. + * If provided number of ukevents is more that threshold, it is faster + * to allocate a room for them and copy in one shot instead of copy + * one-by-one and then process them. + */ +static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg) +{ + int err = 0, i; + struct ukevent uk; + + mutex_lock(&u->ctl_mutex); + + if (num > u->kevent_num) { + err = -EINVAL; + goto out; + } + + if (num > KEVENT_MIN_BUFFS_ALLOC) { + struct ukevent *ukev; + + ukev = kevent_get_user(num, arg); + if (ukev) { + for (i = 0; i < num; ++i) { + if (kevent_remove(&ukev[i], u)) + ukev[i].ret_flags |= KEVENT_RET_BROKEN; + ukev[i].ret_flags |= KEVENT_RET_DONE; + } + if (copy_to_user(arg, ukev, num*sizeof(struct ukevent))) + err = -EFAULT; + kfree(ukev); + goto out; + } + } + + for (i = 0; i < num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + if (kevent_remove(&uk, u)) + uk.ret_flags |= KEVENT_RET_BROKEN; + + uk.ret_flags |= KEVENT_RET_DONE; + + if (copy_to_user(arg, &uk, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + arg += sizeof(struct ukevent); + } +out: + mutex_unlock(&u->ctl_mutex); + + return err; +} + +/* + * Queue kevent into userspace control block and increase + * it's reference counter. + */ +static int kevent_user_enqueue(struct kevent_user *u, struct kevent *new) +{ + unsigned long flags; + struct rb_node **p = &u->kevent_root.rb_node, *parent = NULL; + struct kevent *k; + int err = 0, cmp; + + spin_lock_irqsave(&u->kevent_lock, flags); + while (*p) { + parent = *p; + k = rb_entry(parent, struct kevent, kevent_node); + + cmp = kevent_compare_id(&k->event.id, &new->event.id); + if (cmp > 0) + p = &parent->rb_right; + else if (cmp < 0) + p = &parent->rb_left; + else { + err = -EEXIST; + break; + } + } + if (likely(!err)) { + rb_link_node(&new->kevent_node, parent, p); + rb_insert_color(&new->kevent_node, &u->kevent_root); + new->flags |= KEVENT_USER; + u->kevent_num++; + kevent_user_get(u); + } + spin_unlock_irqrestore(&u->kevent_lock, flags); + + return err; +} + +/* + * Add kevent from both kernel and userspace users. + * This function allocates and queues kevent, returns negative value + * on error, positive if kevent is ready immediately and zero + * if kevent has been queued. + */ +int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u) +{ + struct kevent *k; + int err; + + k = kmem_cache_alloc(kevent_cache, GFP_KERNEL); + if (!k) { + err = -ENOMEM; + goto err_out_exit; + } + + memcpy(&k->event, uk, sizeof(struct ukevent)); + INIT_RCU_HEAD(&k->rcu_head); + + k->event.ret_flags = 0; + + err = kevent_init(k); + if (err) { + kmem_cache_free(kevent_cache, k); + goto err_out_exit; + } + k->user = u; + kevent_stat_total(u); + err = kevent_user_enqueue(u, k); + if (err) { + kmem_cache_free(kevent_cache, k); + goto err_out_exit; + } + + err = kevent_enqueue(k); + if (err) { + memcpy(uk, &k->event, sizeof(struct ukevent)); + kevent_finish_user(k, 0); + goto err_out_exit; + } + + return 0; + +err_out_exit: + if (err < 0) { + uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE; + uk->ret_data[1] = err; + } else if (err > 0) + uk->ret_flags |= KEVENT_RET_DONE; + return err; +} + +/* + * Copy all ukevents from userspace, allocate kevent for each one + * and add them into appropriate kevent_storages, + * e.g. sockets, inodes and so on... + * Ready events will replace ones provided by used and number + * of ready events is returned. + * User must check ret_flags field of each ukevent structure + * to determine if it is fired or failed event. + */ +static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg) +{ + int err, cerr = 0, rnum = 0, i; + void __user *orig = arg; + struct ukevent uk; + + mutex_lock(&u->ctl_mutex); + + err = -EINVAL; + if (num > KEVENT_MIN_BUFFS_ALLOC) { + struct ukevent *ukev; + + ukev = kevent_get_user(num, arg); + if (ukev) { + for (i = 0; i < num; ++i) { + err = kevent_user_add_ukevent(&ukev[i], u); + if (err) { + kevent_stat_im(u); + if (i != rnum) + memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent)); + rnum++; + } + } + if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent))) + cerr = -EFAULT; + kfree(ukev); + goto out_setup; + } + } + + for (i = 0; i < num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + cerr = -EFAULT; + break; + } + arg += sizeof(struct ukevent); + + err = kevent_user_add_ukevent(&uk, u); + if (err) { + kevent_stat_im(u); + if (copy_to_user(orig, &uk, sizeof(struct ukevent))) { + cerr = -EFAULT; + break; + } + orig += sizeof(struct ukevent); + rnum++; + } + } + +out_setup: + if (cerr < 0) { + err = cerr; + goto out_remove; + } + + err = rnum; +out_remove: + mutex_unlock(&u->ctl_mutex); + + return err; +} + +/* + * In nonblocking mode it returns as many events as possible, but not more than @max_nr. + * In blocking mode it waits until timeout or if at least @min_nr events are ready. + */ +static int kevent_user_wait(struct file *file, struct kevent_user *u, + unsigned int min_nr, unsigned int max_nr, __u64 timeout, + void __user *buf) +{ + struct kevent *k; + int num = 0; + + if (!(file->f_flags & O_NONBLOCK)) { + wait_event_interruptible_timeout(u->wait, + u->ready_num >= min_nr, + clock_t_to_jiffies(nsec_to_clock_t(timeout))); + } + + while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) { + if (copy_to_user(buf + num*sizeof(struct ukevent), + &k->event, sizeof(struct ukevent))) + break; + kevent_complete_ready(k); + ++num; + kevent_stat_wait(u); + } + + return num; +} + +static struct file_operations kevent_user_fops = { + .open = kevent_user_open, + .release = kevent_user_release, + .poll = kevent_user_poll, + .owner = THIS_MODULE, +}; + +static struct miscdevice kevent_miscdev = { + .minor = MISC_DYNAMIC_MINOR, + .name = kevent_name, + .fops = &kevent_user_fops, +}; + +static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg) +{ + int err; + struct kevent_user *u = file->private_data; + + switch (cmd) { + case KEVENT_CTL_ADD: + err = kevent_user_ctl_add(u, num, arg); + break; + case KEVENT_CTL_REMOVE: + err = kevent_user_ctl_remove(u, num, arg); + break; + case KEVENT_CTL_MODIFY: + err = kevent_user_ctl_modify(u, num, arg); + break; + default: + err = -EINVAL; + break; + } + + return err; +} + +/* + * Used to get ready kevents from queue. + * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT). + * @min_nr - minimum number of ready kevents. + * @max_nr - maximum number of ready kevents. + * @timeout - timeout in nanoseconds to wait until some events are ready. + * @buf - buffer to place ready events. + * @flags - ununsed for now (will be used for mmap implementation). + */ +asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, + __u64 timeout, struct ukevent __user *buf, unsigned flags) +{ + int err = -EINVAL; + struct file *file; + struct kevent_user *u; + + file = fget(ctl_fd); + if (!file) + return -EBADF; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + u = file->private_data; + + err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf); +out_fput: + fput(file); + return err; +} + +asmlinkage long sys_kevent_ring_init(int ctl_fd, struct kevent_ring __user *ring, unsigned int num) +{ + int err = -EINVAL; + struct file *file; + struct kevent_user *u; + + file = fget(ctl_fd); + if (!file) + return -EBADF; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + u = file->private_data; + + mutex_lock(&u->ring_lock); + if (u->pring) { + err = -EINVAL; + goto err_out_exit; + } + u->pring = ring; + u->ring_size = num; + mutex_unlock(&u->ring_lock); + + fput(file); + + return 0; + +err_out_exit: + mutex_unlock(&u->ring_lock); +out_fput: + fput(file); + return err; +} + +/* + * This syscall is used to perform waiting until there is free space in kevent queue + * and removes/requeues requested number of events (commits them). Function returns + * number of actually committed events. + * + * @ctl_fd - kevent file descriptor. + * @num - number of kevents to process. + * @timeout - this timeout specifies number of nanoseconds to wait until there is + * free space in kevent queue. + * + * When we need to commit @num events, it means we should just remove first @num + * kevents from ready queue and copy them into the buffer. + * Kevents will be copied into ring buffer in order they were placed into ready queue. + */ +asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, __u64 timeout) +{ + int err = -EINVAL, committed = 0; + struct file *file; + struct kevent_user *u; + struct kevent *k; + struct kevent_ring __user *ring; + unsigned int i; + + file = fget(ctl_fd); + if (!file) + return -EBADF; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + u = file->private_data; + + ring = u->pring; + if (!ring || num >= u->ring_size) + goto out_fput; + + if (!(file->f_flags & O_NONBLOCK)) { + wait_event_interruptible_timeout(u->wait, + u->ready_num >= 1, + clock_t_to_jiffies(nsec_to_clock_t(timeout))); + } + + for (i=0; i<num; ++i) { + k = kqueue_dequeue_ready(u); + if (!k) + break; + kevent_complete_ready(k); + kevent_stat_ring(u); + committed++; + } + + fput(file); + + return committed; +out_fput: + fput(file); + return err; +} + +/* + * This syscall is used to perform various control operations + * on given kevent queue, which is obtained through kevent file descriptor @fd. + * @cmd - type of operation. + * @num - number of kevents to be processed. + * @arg - pointer to array of struct ukevent. + */ +asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent __user *arg) +{ + int err = -EINVAL; + struct file *file; + + file = fget(fd); + if (!file) + return -EBADF; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + + err = kevent_ctl_process(file, cmd, num, arg); + +out_fput: + fput(file); + return err; +} + +/* + * Kevent subsystem initialization - create kevent cache and register + * filesystem to get control file descriptors from. + */ +static int __init kevent_user_init(void) +{ + int err = 0; + + kevent_cache = kmem_cache_create("kevent_cache", + sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL); + + err = misc_register(&kevent_miscdev); + if (err) { + printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err); + goto err_out_exit; + } + + printk(KERN_INFO "KEVENT subsystem has been successfully registered.\n"); + + return 0; + +err_out_exit: + kmem_cache_destroy(kevent_cache); + return err; +} + +module_init(kevent_user_init); diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 7a3b2e7..5200583 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -122,6 +122,11 @@ cond_syscall(ppc_rtas); cond_syscall(sys_spu_run); cond_syscall(sys_spu_create); +cond_syscall(sys_kevent_get_events); +cond_syscall(sys_kevent_wait); +cond_syscall(sys_kevent_ctl); +cond_syscall(sys_kevent_ring_init); + /* mmu depending weak syscall entries */ cond_syscall(sys_mprotect); cond_syscall(sys_msync); ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take23 3/5] kevent: poll/select() notifications. 2006-11-07 16:50 ` [take23 2/5] kevent: Core files Evgeniy Polyakov @ 2006-11-07 16:50 ` Evgeniy Polyakov 2006-11-07 16:50 ` [take23 4/5] kevent: Socket notifications Evgeniy Polyakov 2006-11-07 22:53 ` [take23 3/5] kevent: poll/select() notifications Davide Libenzi 2006-11-07 22:16 ` [take23 2/5] kevent: Core files Andrew Morton 1 sibling, 2 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-07 16:50 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik poll/select() notifications. This patch includes generic poll/select notifications. kevent_poll works simialr to epoll and has the same issues (callback is invoked not from internal state machine of the caller, but through process awake, a lot of allocations and so on). Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru> diff --git a/include/linux/fs.h b/include/linux/fs.h index 5baf3a1..f81299f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -276,6 +276,7 @@ #include <linux/prio_tree.h> #include <linux/init.h> #include <linux/sched.h> #include <linux/mutex.h> +#include <linux/kevent.h> #include <asm/atomic.h> #include <asm/semaphore.h> @@ -586,6 +587,10 @@ #ifdef CONFIG_INOTIFY struct mutex inotify_mutex; /* protects the watches list */ #endif +#ifdef CONFIG_KEVENT_SOCKET + struct kevent_storage st; +#endif + unsigned long i_state; unsigned long dirtied_when; /* jiffies of first dirtying */ @@ -739,6 +744,9 @@ #ifdef CONFIG_EPOLL struct list_head f_ep_links; spinlock_t f_ep_lock; #endif /* #ifdef CONFIG_EPOLL */ +#ifdef CONFIG_KEVENT_POLL + struct kevent_storage st; +#endif struct address_space *f_mapping; }; extern spinlock_t files_lock; diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c new file mode 100644 index 0000000..94facbb --- /dev/null +++ b/kernel/kevent/kevent_poll.c @@ -0,0 +1,222 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/timer.h> +#include <linux/file.h> +#include <linux/kevent.h> +#include <linux/poll.h> +#include <linux/fs.h> + +static kmem_cache_t *kevent_poll_container_cache; +static kmem_cache_t *kevent_poll_priv_cache; + +struct kevent_poll_ctl +{ + struct poll_table_struct pt; + struct kevent *k; +}; + +struct kevent_poll_wait_container +{ + struct list_head container_entry; + wait_queue_head_t *whead; + wait_queue_t wait; + struct kevent *k; +}; + +struct kevent_poll_private +{ + struct list_head container_list; + spinlock_t container_lock; +}; + +static int kevent_poll_enqueue(struct kevent *k); +static int kevent_poll_dequeue(struct kevent *k); +static int kevent_poll_callback(struct kevent *k); + +static int kevent_poll_wait_callback(wait_queue_t *wait, + unsigned mode, int sync, void *key) +{ + struct kevent_poll_wait_container *cont = + container_of(wait, struct kevent_poll_wait_container, wait); + struct kevent *k = cont->k; + struct file *file = k->st->origin; + u32 revents; + + revents = file->f_op->poll(file, NULL); + + kevent_storage_ready(k->st, NULL, revents); + + return 0; +} + +static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead, + struct poll_table_struct *poll_table) +{ + struct kevent *k = + container_of(poll_table, struct kevent_poll_ctl, pt)->k; + struct kevent_poll_private *priv = k->priv; + struct kevent_poll_wait_container *cont; + unsigned long flags; + + cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL); + if (!cont) { + kevent_break(k); + return; + } + + cont->k = k; + init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback); + cont->whead = whead; + + spin_lock_irqsave(&priv->container_lock, flags); + list_add_tail(&cont->container_entry, &priv->container_list); + spin_unlock_irqrestore(&priv->container_lock, flags); + + add_wait_queue(whead, &cont->wait); +} + +static int kevent_poll_enqueue(struct kevent *k) +{ + struct file *file; + int err, ready = 0; + unsigned int revents; + struct kevent_poll_ctl ctl; + struct kevent_poll_private *priv; + + file = fget(k->event.id.raw[0]); + if (!file) + return -EBADF; + + err = -EINVAL; + if (!file->f_op || !file->f_op->poll) + goto err_out_fput; + + err = -ENOMEM; + priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL); + if (!priv) + goto err_out_fput; + + spin_lock_init(&priv->container_lock); + INIT_LIST_HEAD(&priv->container_list); + + k->priv = priv; + + ctl.k = k; + init_poll_funcptr(&ctl.pt, &kevent_poll_qproc); + + err = kevent_storage_enqueue(&file->st, k); + if (err) + goto err_out_free; + + revents = file->f_op->poll(file, &ctl.pt); + if (revents & k->event.event) { + ready = 1; + kevent_poll_dequeue(k); + } + + return ready; + +err_out_free: + kmem_cache_free(kevent_poll_priv_cache, priv); +err_out_fput: + fput(file); + return err; +} + +static int kevent_poll_dequeue(struct kevent *k) +{ + struct file *file = k->st->origin; + struct kevent_poll_private *priv = k->priv; + struct kevent_poll_wait_container *w, *n; + unsigned long flags; + + kevent_storage_dequeue(k->st, k); + + spin_lock_irqsave(&priv->container_lock, flags); + list_for_each_entry_safe(w, n, &priv->container_list, container_entry) { + list_del(&w->container_entry); + remove_wait_queue(w->whead, &w->wait); + kmem_cache_free(kevent_poll_container_cache, w); + } + spin_unlock_irqrestore(&priv->container_lock, flags); + + kmem_cache_free(kevent_poll_priv_cache, priv); + k->priv = NULL; + + fput(file); + + return 0; +} + +static int kevent_poll_callback(struct kevent *k) +{ + struct file *file = k->st->origin; + unsigned int revents = file->f_op->poll(file, NULL); + + k->event.ret_data[0] = revents & k->event.event; + + return (revents & k->event.event); +} + +static int __init kevent_poll_sys_init(void) +{ + struct kevent_callbacks pc = { + .callback = &kevent_poll_callback, + .enqueue = &kevent_poll_enqueue, + .dequeue = &kevent_poll_dequeue}; + + kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache", + sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL); + if (!kevent_poll_container_cache) { + printk(KERN_ERR "Failed to create kevent poll container cache.\n"); + return -ENOMEM; + } + + kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache", + sizeof(struct kevent_poll_private), 0, 0, NULL, NULL); + if (!kevent_poll_priv_cache) { + printk(KERN_ERR "Failed to create kevent poll private data cache.\n"); + kmem_cache_destroy(kevent_poll_container_cache); + kevent_poll_container_cache = NULL; + return -ENOMEM; + } + + kevent_add_callbacks(&pc, KEVENT_POLL); + + printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n"); + return 0; +} + +static struct lock_class_key kevent_poll_key; + +void kevent_poll_reinit(struct file *file) +{ + lockdep_set_class(&file->st.lock, &kevent_poll_key); +} + +static void __exit kevent_poll_sys_fini(void) +{ + kmem_cache_destroy(kevent_poll_priv_cache); + kmem_cache_destroy(kevent_poll_container_cache); +} + +module_init(kevent_poll_sys_init); +module_exit(kevent_poll_sys_fini); ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take23 4/5] kevent: Socket notifications. 2006-11-07 16:50 ` [take23 3/5] kevent: poll/select() notifications Evgeniy Polyakov @ 2006-11-07 16:50 ` Evgeniy Polyakov 2006-11-07 16:50 ` [take23 5/5] kevent: Timer notifications Evgeniy Polyakov 2006-11-07 22:53 ` [take23 3/5] kevent: poll/select() notifications Davide Libenzi 1 sibling, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-07 16:50 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Socket notifications. This patch includes socket send/recv/accept notifications. Using trivial web server based on kevent and this features instead of epoll it's performance increased more than noticebly. More details about various benchmarks and server itself (evserver_kevent.c) can be found on project's homepage. Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru> diff --git a/fs/inode.c b/fs/inode.c index ada7643..ff1b129 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -21,6 +21,7 @@ #include <linux/pagemap.h> #include <linux/cdev.h> #include <linux/bootmem.h> #include <linux/inotify.h> +#include <linux/kevent.h> #include <linux/mount.h> /* @@ -164,12 +165,18 @@ #endif } inode->i_private = 0; inode->i_mapping = mapping; +#if defined CONFIG_KEVENT_SOCKET + kevent_storage_init(inode, &inode->st); +#endif } return inode; } void destroy_inode(struct inode *inode) { +#if defined CONFIG_KEVENT_SOCKET + kevent_storage_fini(&inode->st); +#endif BUG_ON(inode_has_buffers(inode)); security_inode_free(inode); if (inode->i_sb->s_op->destroy_inode) diff --git a/include/net/sock.h b/include/net/sock.h index edd4d73..d48ded8 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -48,6 +48,7 @@ #include <linux/lockdep.h> #include <linux/netdevice.h> #include <linux/skbuff.h> /* struct sk_buff */ #include <linux/security.h> +#include <linux/kevent.h> #include <linux/filter.h> @@ -450,6 +451,21 @@ static inline int sk_stream_memory_free( extern void sk_stream_rfree(struct sk_buff *skb); +struct socket_alloc { + struct socket socket; + struct inode vfs_inode; +}; + +static inline struct socket *SOCKET_I(struct inode *inode) +{ + return &container_of(inode, struct socket_alloc, vfs_inode)->socket; +} + +static inline struct inode *SOCK_INODE(struct socket *socket) +{ + return &container_of(socket, struct socket_alloc, socket)->vfs_inode; +} + static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk) { skb->sk = sk; @@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct sk->sk_backlog.tail = skb; } skb->next = NULL; + kevent_socket_notify(sk, KEVENT_SOCKET_RECV); } #define sk_wait_event(__sk, __timeo, __condition) \ @@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio return si->kiocb; } -struct socket_alloc { - struct socket socket; - struct inode vfs_inode; -}; - -static inline struct socket *SOCKET_I(struct inode *inode) -{ - return &container_of(inode, struct socket_alloc, vfs_inode)->socket; -} - -static inline struct inode *SOCK_INODE(struct socket *socket) -{ - return &container_of(socket, struct socket_alloc, socket)->vfs_inode; -} - extern void __sk_stream_mem_reclaim(struct sock *sk); extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind); diff --git a/include/net/tcp.h b/include/net/tcp.h index 7a093d0..69f4ad2 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so tp->ucopy.memory = 0; } else if (skb_queue_len(&tp->ucopy.prequeue) == 1) { wake_up_interruptible(sk->sk_sleep); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); if (!inet_csk_ack_scheduled(sk)) inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK, (3 * TCP_RTO_MIN) / 4, diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c new file mode 100644 index 0000000..7f74110 --- /dev/null +++ b/kernel/kevent/kevent_socket.c @@ -0,0 +1,135 @@ +/* + * kevent_socket.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/timer.h> +#include <linux/file.h> +#include <linux/tcp.h> +#include <linux/kevent.h> + +#include <net/sock.h> +#include <net/request_sock.h> +#include <net/inet_connection_sock.h> + +static int kevent_socket_callback(struct kevent *k) +{ + struct inode *inode = k->st->origin; + unsigned int events = SOCKET_I(inode)->ops->poll(SOCKET_I(inode)->file, SOCKET_I(inode), NULL); + + if ((events & (POLLIN | POLLRDNORM)) && (k->event.event & (KEVENT_SOCKET_RECV | KEVENT_SOCKET_ACCEPT))) + return 1; + if ((events & (POLLOUT | POLLWRNORM)) && (k->event.event & KEVENT_SOCKET_SEND)) + return 1; + return 0; +} + +int kevent_socket_enqueue(struct kevent *k) +{ + struct inode *inode; + struct socket *sock; + int err = -EBADF; + + sock = sockfd_lookup(k->event.id.raw[0], &err); + if (!sock) + goto err_out_exit; + + inode = igrab(SOCK_INODE(sock)); + if (!inode) + goto err_out_fput; + + err = kevent_storage_enqueue(&inode->st, k); + if (err) + goto err_out_iput; + + err = k->callbacks.callback(k); + if (err) + goto err_out_dequeue; + + return err; + +err_out_dequeue: + kevent_storage_dequeue(k->st, k); +err_out_iput: + iput(inode); +err_out_fput: + sockfd_put(sock); +err_out_exit: + return err; +} + +int kevent_socket_dequeue(struct kevent *k) +{ + struct inode *inode = k->st->origin; + struct socket *sock; + + kevent_storage_dequeue(k->st, k); + + sock = SOCKET_I(inode); + iput(inode); + sockfd_put(sock); + + return 0; +} + +void kevent_socket_notify(struct sock *sk, u32 event) +{ + if (sk->sk_socket) + kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event); +} + +/* + * It is required for network protocols compiled as modules, like IPv6. + */ +EXPORT_SYMBOL_GPL(kevent_socket_notify); + +#ifdef CONFIG_LOCKDEP +static struct lock_class_key kevent_sock_key; + +void kevent_socket_reinit(struct socket *sock) +{ + struct inode *inode = SOCK_INODE(sock); + + lockdep_set_class(&inode->st.lock, &kevent_sock_key); +} + +void kevent_sk_reinit(struct sock *sk) +{ + if (sk->sk_socket) { + struct inode *inode = SOCK_INODE(sk->sk_socket); + + lockdep_set_class(&inode->st.lock, &kevent_sock_key); + } +} +#endif +static int __init kevent_init_socket(void) +{ + struct kevent_callbacks sc = { + .callback = &kevent_socket_callback, + .enqueue = &kevent_socket_enqueue, + .dequeue = &kevent_socket_dequeue}; + + return kevent_add_callbacks(&sc, KEVENT_SOCKET); +} +module_init(kevent_init_socket); diff --git a/net/core/sock.c b/net/core/sock.c index b77e155..7d5fa3e 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1402,6 +1402,7 @@ static void sock_def_wakeup(struct sock if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) wake_up_interruptible_all(sk->sk_sleep); read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); } static void sock_def_error_report(struct sock *sk) @@ -1411,6 +1412,7 @@ static void sock_def_error_report(struct wake_up_interruptible(sk->sk_sleep); sk_wake_async(sk,0,POLL_ERR); read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); } static void sock_def_readable(struct sock *sk, int len) @@ -1420,6 +1422,7 @@ static void sock_def_readable(struct soc wake_up_interruptible(sk->sk_sleep); sk_wake_async(sk,1,POLL_IN); read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); } static void sock_def_write_space(struct sock *sk) @@ -1439,6 +1442,7 @@ static void sock_def_write_space(struct } read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV); } static void sock_def_destruct(struct sock *sk) @@ -1489,6 +1493,8 @@ #endif sk->sk_state = TCP_CLOSE; sk->sk_socket = sock; + kevent_sk_reinit(sk); + sock_set_flag(sk, SOCK_ZAPPED); if(sock) @@ -1555,8 +1561,10 @@ void fastcall release_sock(struct sock * if (sk->sk_backlog.tail) __release_sock(sk); sk->sk_lock.owner = NULL; - if (waitqueue_active(&sk->sk_lock.wq)) + if (waitqueue_active(&sk->sk_lock.wq)) { wake_up(&sk->sk_lock.wq); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); + } spin_unlock_bh(&sk->sk_lock.slock); } EXPORT_SYMBOL(release_sock); diff --git a/net/core/stream.c b/net/core/stream.c index d1d7dec..2878c2a 100644 --- a/net/core/stream.c +++ b/net/core/stream.c @@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock * wake_up_interruptible(sk->sk_sleep); if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN)) sock_wake_async(sock, 2, POLL_OUT); + kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV); } } diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 3f884ce..e7dd989 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3119,6 +3119,7 @@ static void tcp_ofo_queue(struct sock *s __skb_unlink(skb, &tp->out_of_order_queue); __skb_queue_tail(&sk->sk_receive_queue, skb); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV); tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq; if(skb->h.th->fin) tcp_fin(skb, sk, skb->h.th); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index c83938b..b0dd70d 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -61,6 +61,7 @@ #include <linux/cache.h> #include <linux/jhash.h> #include <linux/init.h> #include <linux/times.h> +#include <linux/kevent.h> #include <net/icmp.h> #include <net/inet_hashtables.h> @@ -870,6 +871,7 @@ #endif reqsk_free(req); } else { inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT); + kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT); } return 0; diff --git a/net/socket.c b/net/socket.c index 1bc4167..5582b4a 100644 --- a/net/socket.c +++ b/net/socket.c @@ -85,6 +85,7 @@ #include <linux/compat.h> #include <linux/kmod.h> #include <linux/audit.h> #include <linux/wireless.h> +#include <linux/kevent.h> #include <asm/uaccess.h> #include <asm/unistd.h> @@ -490,6 +491,8 @@ static struct socket *sock_alloc(void) inode->i_uid = current->fsuid; inode->i_gid = current->fsgid; + kevent_socket_reinit(sock); + get_cpu_var(sockets_in_use)++; put_cpu_var(sockets_in_use); return sock; ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take23 5/5] kevent: Timer notifications. 2006-11-07 16:50 ` [take23 4/5] kevent: Socket notifications Evgeniy Polyakov @ 2006-11-07 16:50 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-07 16:50 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Timer notifications. Timer notifications can be used for fine grained per-process time management, since interval timers are very inconvenient to use, and they are limited. This subsystem uses high-resolution timers. id.raw[0] is used as number of seconds id.raw[1] is used as number of nanoseconds Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c new file mode 100644 index 0000000..df93049 --- /dev/null +++ b/kernel/kevent/kevent_timer.c @@ -0,0 +1,112 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/hrtimer.h> +#include <linux/jiffies.h> +#include <linux/kevent.h> + +struct kevent_timer +{ + struct hrtimer ktimer; + struct kevent_storage ktimer_storage; + struct kevent *ktimer_event; +}; + +static int kevent_timer_func(struct hrtimer *timer) +{ + struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer); + struct kevent *k = t->ktimer_event; + + kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL); + hrtimer_forward(timer, timer->base->softirq_time, + ktime_set(k->event.id.raw[0], k->event.id.raw[1])); + return HRTIMER_RESTART; +} + +static struct lock_class_key kevent_timer_key; + +static int kevent_timer_enqueue(struct kevent *k) +{ + int err; + struct kevent_timer *t; + + t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL); + if (!t) + return -ENOMEM; + + hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL); + t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]); + t->ktimer.function = kevent_timer_func; + t->ktimer_event = k; + + err = kevent_storage_init(&t->ktimer, &t->ktimer_storage); + if (err) + goto err_out_free; + lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key); + + err = kevent_storage_enqueue(&t->ktimer_storage, k); + if (err) + goto err_out_st_fini; + + hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL); + + return 0; + +err_out_st_fini: + kevent_storage_fini(&t->ktimer_storage); +err_out_free: + kfree(t); + + return err; +} + +static int kevent_timer_dequeue(struct kevent *k) +{ + struct kevent_storage *st = k->st; + struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage); + + hrtimer_cancel(&t->ktimer); + kevent_storage_dequeue(st, k); + kfree(t); + + return 0; +} + +static int kevent_timer_callback(struct kevent *k) +{ + k->event.ret_data[0] = jiffies_to_msecs(jiffies); + return 1; +} + +static int __init kevent_init_timer(void) +{ + struct kevent_callbacks tc = { + .callback = &kevent_timer_callback, + .enqueue = &kevent_timer_enqueue, + .dequeue = &kevent_timer_dequeue}; + + return kevent_add_callbacks(&tc, KEVENT_TIMER); +} +module_init(kevent_init_timer); + ^ permalink raw reply related [flat|nested] 200+ messages in thread
* Re: [take23 3/5] kevent: poll/select() notifications. 2006-11-07 16:50 ` [take23 3/5] kevent: poll/select() notifications Evgeniy Polyakov 2006-11-07 16:50 ` [take23 4/5] kevent: Socket notifications Evgeniy Polyakov @ 2006-11-07 22:53 ` Davide Libenzi 2006-11-08 8:45 ` Evgeniy Polyakov 1 sibling, 1 reply; 200+ messages in thread From: Davide Libenzi @ 2006-11-07 22:53 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, Linux Kernel Mailing List, Jeff Garzik On Tue, 7 Nov 2006, Evgeniy Polyakov wrote: > +static int kevent_poll_wait_callback(wait_queue_t *wait, > + unsigned mode, int sync, void *key) > +{ > + struct kevent_poll_wait_container *cont = > + container_of(wait, struct kevent_poll_wait_container, wait); > + struct kevent *k = cont->k; > + struct file *file = k->st->origin; > + u32 revents; > + > + revents = file->f_op->poll(file, NULL); > + > + kevent_storage_ready(k->st, NULL, revents); > + > + return 0; > +} Are you sure you can safely call file->f_op->poll() from inside a callback based wakeup? The low level driver may be calling the wakeup with one of its locks held, and during the file->f_op->poll may be trying to acquire the same lock. I remember there was a discussion about this, and assuming the above not true, made epoll code more complex (and slower, since an extra O(R) loop was needed to fetch events). - Davide ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take23 3/5] kevent: poll/select() notifications. 2006-11-07 22:53 ` [take23 3/5] kevent: poll/select() notifications Davide Libenzi @ 2006-11-08 8:45 ` Evgeniy Polyakov 2006-11-08 17:03 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-08 8:45 UTC (permalink / raw) To: Davide Libenzi Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, Linux Kernel Mailing List, Jeff Garzik On Tue, Nov 07, 2006 at 02:53:33PM -0800, Davide Libenzi (davidel@xmailserver.org) wrote: > On Tue, 7 Nov 2006, Evgeniy Polyakov wrote: > > > +static int kevent_poll_wait_callback(wait_queue_t *wait, > > + unsigned mode, int sync, void *key) > > +{ > > + struct kevent_poll_wait_container *cont = > > + container_of(wait, struct kevent_poll_wait_container, wait); > > + struct kevent *k = cont->k; > > + struct file *file = k->st->origin; > > + u32 revents; > > + > > + revents = file->f_op->poll(file, NULL); > > + > > + kevent_storage_ready(k->st, NULL, revents); > > + > > + return 0; > > +} > > Are you sure you can safely call file->f_op->poll() from inside a callback > based wakeup? The low level driver may be calling the wakeup with one of > its locks held, and during the file->f_op->poll may be trying to acquire > the same lock. I remember there was a discussion about this, and assuming > the above not true, made epoll code more complex (and slower, since an > extra O(R) loop was needed to fetch events). Indeed, I have not paid too much attention to poll/select notifications in kevent actually. As far as I recall it should be called on behalf of process doing kevent_get_event(). I will check and fix if that is not correct. Thanks Davide. > - Davide > -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take23 3/5] kevent: poll/select() notifications. 2006-11-08 8:45 ` Evgeniy Polyakov @ 2006-11-08 17:03 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-08 17:03 UTC (permalink / raw) To: Davide Libenzi Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, Linux Kernel Mailing List, Jeff Garzik On Wed, Nov 08, 2006 at 11:45:54AM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote: > > Are you sure you can safely call file->f_op->poll() from inside a callback > > based wakeup? The low level driver may be calling the wakeup with one of > > its locks held, and during the file->f_op->poll may be trying to acquire > > the same lock. I remember there was a discussion about this, and assuming > > the above not true, made epoll code more complex (and slower, since an > > extra O(R) loop was needed to fetch events). > > Indeed, I have not paid too much attention to poll/select notifications in > kevent actually. As far as I recall it should be called on behalf of process > doing kevent_get_event(). I will check and fix if that is not correct. > Thanks Davide. Indeed there was a bug. Actually poll/select patch was broken quite noticebly - patchset did not include major changes I made for it. I will put them all into next release. Thanks again Davide for pointing that out. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take23 2/5] kevent: Core files. 2006-11-07 16:50 ` [take23 2/5] kevent: Core files Evgeniy Polyakov 2006-11-07 16:50 ` [take23 3/5] kevent: poll/select() notifications Evgeniy Polyakov @ 2006-11-07 22:16 ` Andrew Morton 2006-11-08 8:24 ` Evgeniy Polyakov 1 sibling, 1 reply; 200+ messages in thread From: Andrew Morton @ 2006-11-07 22:16 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Tue, 7 Nov 2006 19:50:48 +0300 Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote: > This patch includes core kevent files: > * userspace controlling > * kernelspace interfaces > * initialization > * notification state machines I fixed up all the rejects, but your syscall numbers changed. Please always raise patches against the latest kernel. ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take23 2/5] kevent: Core files. 2006-11-07 22:16 ` [take23 2/5] kevent: Core files Andrew Morton @ 2006-11-08 8:24 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-08 8:24 UTC (permalink / raw) To: Andrew Morton Cc: David Miller, Ulrich Drepper, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Tue, Nov 07, 2006 at 02:16:57PM -0800, Andrew Morton (akpm@osdl.org) wrote: > On Tue, 7 Nov 2006 19:50:48 +0300 > Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote: > > > This patch includes core kevent files: > > * userspace controlling > > * kernelspace interfaces > > * initialization > > * notification state machines > > I fixed up all the rejects, but your syscall numbers changed. Please > always raise patches against the latest kernel. Will do. NUmbers actually are the same, but added new syscall which was against old tree. Thanks Andrew. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take23 1/5] kevent: Description. 2006-11-07 16:50 ` [take23 1/5] kevent: Description Evgeniy Polyakov 2006-11-07 16:50 ` [take23 2/5] kevent: Core files Evgeniy Polyakov @ 2006-11-07 22:16 ` Andrew Morton 2006-11-08 8:23 ` Evgeniy Polyakov 1 sibling, 1 reply; 200+ messages in thread From: Andrew Morton @ 2006-11-07 22:16 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Tue, 7 Nov 2006 19:50:48 +0300 Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote: > Description. I converted this into Documentation/kevent.txt. It looks like crap in an 80-col xterm btw. ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take23 1/5] kevent: Description. 2006-11-07 22:16 ` [take23 1/5] kevent: Description Andrew Morton @ 2006-11-08 8:23 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-08 8:23 UTC (permalink / raw) To: Andrew Morton Cc: David Miller, Ulrich Drepper, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Tue, Nov 07, 2006 at 02:16:40PM -0800, Andrew Morton (akpm@osdl.org) wrote: > On Tue, 7 Nov 2006 19:50:48 +0300 > Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote: > > > Description. > > I converted this into Documentation/kevent.txt. It looks like crap in an 80-col > xterm btw. Thanks. It was copied as is from documentation page, so it does looks like crap in non-browser window. I'm quite sure there will be some questions about kevent, so I will update that file and fix indent. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take23 0/5] kevent: Generic event handling mechanism. 2006-11-07 16:50 ` [take23 0/5] " Evgeniy Polyakov 2006-11-07 16:50 ` [take23 1/5] kevent: Description Evgeniy Polyakov @ 2006-11-07 22:17 ` Andrew Morton 2006-11-08 8:21 ` Evgeniy Polyakov 1 sibling, 1 reply; 200+ messages in thread From: Andrew Morton @ 2006-11-07 22:17 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Tue, 7 Nov 2006 19:50:48 +0300 Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote: > Generic event handling mechanism. I updated the version in -mm to v23. So people can play with it and review it. It looks like a bit of work will be needed to get it to compile. It seems that most of the fixes which were added to the previous version were merged or are now irrelevant, however you lost this change: From: Andrew Morton <akpm@osdl.org> If kevent_user_wait() gets -EFAULT on the attempt to copy the first event, it will return 0, which is indistinguishable from "no events pending". It can and should return EFAULT in this case. Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> --- kernel/kevent/kevent_user.c | 5 ++++- 1 files changed, 4 insertions(+), 1 deletion(-) diff -puN kernel/kevent/kevent_user.c~kevent_user_wait-retval-fix kernel/kevent/kevent_user.c --- a/kernel/kevent/kevent_user.c~kevent_user_wait-retval-fix +++ a/kernel/kevent/kevent_user.c @@ -690,8 +690,11 @@ static int kevent_user_wait(struct file while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) { if (copy_to_user(buf + num*sizeof(struct ukevent), - &k->event, sizeof(struct ukevent))) + &k->event, sizeof(struct ukevent))) { + if (num == 0) + num = -EFAULT; break; + } kevent_complete_ready(k); ++num; kevent_stat_wait(u); _ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take23 0/5] kevent: Generic event handling mechanism. 2006-11-07 22:17 ` [take23 0/5] kevent: Generic event handling mechanism Andrew Morton @ 2006-11-08 8:21 ` Evgeniy Polyakov 2006-11-08 14:51 ` Eric Dumazet 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-08 8:21 UTC (permalink / raw) To: Andrew Morton Cc: David Miller, Ulrich Drepper, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Tue, Nov 07, 2006 at 02:17:18PM -0800, Andrew Morton (akpm@osdl.org) wrote: > From: Andrew Morton <akpm@osdl.org> > > If kevent_user_wait() gets -EFAULT on the attempt to copy the first event, it > will return 0, which is indistinguishable from "no events pending". > > It can and should return EFAULT in this case. Correct, I missed that. Thanks Andrew, I will put into my tree, -mm seems to have it already. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take23 0/5] kevent: Generic event handling mechanism. 2006-11-08 8:21 ` Evgeniy Polyakov @ 2006-11-08 14:51 ` Eric Dumazet 2006-11-08 22:03 ` Andrew Morton 0 siblings, 1 reply; 200+ messages in thread From: Eric Dumazet @ 2006-11-08 14:51 UTC (permalink / raw) To: Andrew Morton Cc: Evgeniy Polyakov, David Miller, Ulrich Drepper, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik [-- Attachment #1: Type: text/plain, Size: 1057 bytes --] On Wednesday 08 November 2006 09:21, Evgeniy Polyakov wrote: > On Tue, Nov 07, 2006 at 02:17:18PM -0800, Andrew Morton (akpm@osdl.org) wrote: > > From: Andrew Morton <akpm@osdl.org> > > > > If kevent_user_wait() gets -EFAULT on the attempt to copy the first > > event, it will return 0, which is indistinguishable from "no events > > pending". > > > > It can and should return EFAULT in this case. > > Correct, I missed that. > Thanks Andrew, I will put into my tree, -mm seems to have it already. I believe eventpoll has a similar problem. Not a big problem, but we can be cleaner. Normally, the access_ok() done in sys_epoll_wait() should catch non writeable user area, unless another thread play VM game (the thread in sys_epoll_wait() can sleep) [PATCH] eventpoll : In case a fault occurs during copy_to_user(), we should report the count of events that were successfully copied into user space, instead of EFAULT. That would be consistent with behavior of read/write() syscalls for example. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> [-- Attachment #2: eventpoll.patch --] [-- Type: text/plain, Size: 423 bytes --] --- linux/fs/eventpoll.c 2006-11-08 15:37:36.000000000 +0100 +++ linux/fs/eventpoll.c 2006-11-08 15:38:31.000000000 +0100 @@ -1447,7 +1447,7 @@ &events[eventcnt].events) || __put_user(epi->event.data, &events[eventcnt].data)) - return -EFAULT; + return eventcnt ? eventcnt : -EFAULT; if (epi->event.events & EPOLLONESHOT) epi->event.events &= EP_PRIVATE_BITS; eventcnt++; ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take23 0/5] kevent: Generic event handling mechanism. 2006-11-08 14:51 ` Eric Dumazet @ 2006-11-08 22:03 ` Andrew Morton 2006-11-08 22:44 ` Davide Libenzi 0 siblings, 1 reply; 200+ messages in thread From: Andrew Morton @ 2006-11-08 22:03 UTC (permalink / raw) To: Eric Dumazet Cc: Evgeniy Polyakov, David Miller, Ulrich Drepper, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Davide Libenzi On Wed, 8 Nov 2006 15:51:13 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote: > [PATCH] eventpoll : In case a fault occurs during copy_to_user(), we should > report the count of events that were successfully copied into user space, > instead of EFAULT. That would be consistent with behavior of read/write() > syscalls for example. > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > > > > [eventpoll.patch text/plain (424B)] > --- linux/fs/eventpoll.c 2006-11-08 15:37:36.000000000 +0100 > +++ linux/fs/eventpoll.c 2006-11-08 15:38:31.000000000 +0100 > @@ -1447,7 +1447,7 @@ > &events[eventcnt].events) || > __put_user(epi->event.data, > &events[eventcnt].data)) > - return -EFAULT; > + return eventcnt ? eventcnt : -EFAULT; > if (epi->event.events & EPOLLONESHOT) > epi->event.events &= EP_PRIVATE_BITS; > eventcnt++; > Definitely a better interface, but I wonder if it's too late to change it. An app which does if (epoll_wait(...) == -1) barf(errno); else assume_all_events_were_received(); will now do the wrong thing. otoh, such an applciation basically _has_ to use the epoll_wait() return value to work out how many events it received, so maybe it's OK... ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take23 0/5] kevent: Generic event handling mechanism. 2006-11-08 22:03 ` Andrew Morton @ 2006-11-08 22:44 ` Davide Libenzi 2006-11-08 23:07 ` Eric Dumazet 0 siblings, 1 reply; 200+ messages in thread From: Davide Libenzi @ 2006-11-08 22:44 UTC (permalink / raw) To: Andrew Morton Cc: Eric Dumazet, Evgeniy Polyakov, David Miller, Ulrich Drepper, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, Linux Kernel Mailing List, Jeff Garzik On Wed, 8 Nov 2006, Andrew Morton wrote: > On Wed, 8 Nov 2006 15:51:13 +0100 > Eric Dumazet <dada1@cosmosbay.com> wrote: > > > [PATCH] eventpoll : In case a fault occurs during copy_to_user(), we should > > report the count of events that were successfully copied into user space, > > instead of EFAULT. That would be consistent with behavior of read/write() > > syscalls for example. > > > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > > > > > > > > [eventpoll.patch text/plain (424B)] > > --- linux/fs/eventpoll.c 2006-11-08 15:37:36.000000000 +0100 > > +++ linux/fs/eventpoll.c 2006-11-08 15:38:31.000000000 +0100 > > @@ -1447,7 +1447,7 @@ > > &events[eventcnt].events) || > > __put_user(epi->event.data, > > &events[eventcnt].data)) > > - return -EFAULT; > > + return eventcnt ? eventcnt : -EFAULT; > > if (epi->event.events & EPOLLONESHOT) > > epi->event.events &= EP_PRIVATE_BITS; > > eventcnt++; > > > > Definitely a better interface, but I wonder if it's too late to change it. > > An app which does > > if (epoll_wait(...) == -1) > barf(errno); > else > assume_all_events_were_received(); > > will now do the wrong thing. > > otoh, such an applciation basically _has_ to use the epoll_wait() > return value to work out how many events it received, so maybe it's OK... I don't care about both ways, but sys_poll() does the same thing epoll does right now, so I would not change epoll behaviour. - Davide ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take23 0/5] kevent: Generic event handling mechanism. 2006-11-08 22:44 ` Davide Libenzi @ 2006-11-08 23:07 ` Eric Dumazet 2006-11-08 23:56 ` Davide Libenzi 0 siblings, 1 reply; 200+ messages in thread From: Eric Dumazet @ 2006-11-08 23:07 UTC (permalink / raw) To: Davide Libenzi Cc: Andrew Morton, Evgeniy Polyakov, David Miller, Ulrich Drepper, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, Linux Kernel Mailing List, Jeff Garzik Davide Libenzi a écrit : > On Wed, 8 Nov 2006, Andrew Morton wrote: > >> On Wed, 8 Nov 2006 15:51:13 +0100 >> Eric Dumazet <dada1@cosmosbay.com> wrote: >> >>> [PATCH] eventpoll : In case a fault occurs during copy_to_user(), we should >>> report the count of events that were successfully copied into user space, >>> instead of EFAULT. That would be consistent with behavior of read/write() >>> syscalls for example. >>> >>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> >>> >>> >>> >>> [eventpoll.patch text/plain (424B)] >>> --- linux/fs/eventpoll.c 2006-11-08 15:37:36.000000000 +0100 >>> +++ linux/fs/eventpoll.c 2006-11-08 15:38:31.000000000 +0100 >>> @@ -1447,7 +1447,7 @@ >>> &events[eventcnt].events) || >>> __put_user(epi->event.data, >>> &events[eventcnt].data)) >>> - return -EFAULT; >>> + return eventcnt ? eventcnt : -EFAULT; >>> if (epi->event.events & EPOLLONESHOT) >>> epi->event.events &= EP_PRIVATE_BITS; >>> eventcnt++; >>> >> Definitely a better interface, but I wonder if it's too late to change it. >> >> An app which does >> >> if (epoll_wait(...) == -1) >> barf(errno); >> else >> assume_all_events_were_received(); >> >> will now do the wrong thing. >> >> otoh, such an applciation basically _has_ to use the epoll_wait() >> return value to work out how many events it received, so maybe it's OK... > > I don't care about both ways, but sys_poll() does the same thing epoll > does right now, so I would not change epoll behaviour. > Sure poll() cannot return a partial count, since its return value is : On success, a positive number is returned, where the number returned is the number of structures which have non-zero revents fields (in other words, those descriptors with events or errors reported). poll() is non destructive (it doesnt change any state into kernel). Returning EFAULT in case of an error in the very last bit of user area is mandatory. On the contrary : epoll_wait() does return a count of transfered events, and update some state in kernel (it consume Edge Trigered events : They can be lost forever if not reported to user) So epoll_wait() is much more like read(), that also updates file state in kernel (current file position) ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take23 0/5] kevent: Generic event handling mechanism. 2006-11-08 23:07 ` Eric Dumazet @ 2006-11-08 23:56 ` Davide Libenzi 2006-11-09 7:24 ` Eric Dumazet 0 siblings, 1 reply; 200+ messages in thread From: Davide Libenzi @ 2006-11-08 23:56 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Evgeniy Polyakov, David Miller, Ulrich Drepper, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, Linux Kernel Mailing List, Jeff Garzik [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: TEXT/PLAIN; CHARSET=X-UNKNOWN, Size: 2071 bytes --] On Thu, 9 Nov 2006, Eric Dumazet wrote: > Davide Libenzi a écrit : > > > > I don't care about both ways, but sys_poll() does the same thing epoll does > > right now, so I would not change epoll behaviour. > > > > Sure poll() cannot return a partial count, since its return value is : > > On success, a positive number is returned, where the number returned is > the number of structures which have non-zero revents fields (in other > words, those descriptors with events or errors reported). > > poll() is non destructive (it doesnt change any state into kernel). Returning > EFAULT in case of an error in the very last bit of user area is mandatory. > > On the contrary : > > epoll_wait() does return a count of transfered events, and update some state > in kernel (it consume Edge Trigered events : They can be lost forever if not > reported to user) > > So epoll_wait() is much more like read(), that also updates file state in > kernel (current file position) Lost forever means? If there are more processes watching some fd (external events), they all get their own copy of the events in their own private epoll fd. It's not that we "steal" things out of the kernel, is not a 1:1 producer/consumer thing (one producer, 1 queue). It's one producer, broadcast to all listeners (consumers) thing. The only case where it'd matter is in the case of multiple threads sharing the same epoll fd. In general, I'd be more for having the userspace get his own SEGFAULT instead of letting it go with broken parameters. If I'm coding userspace, and I'm doing something wrong, I like the kernel to let me know, instead of trying to fix things for me. Also, epoll can easily be fixed (add a param to ep_reinject_items() to re-inject items in case of error/EFAULT) to leave events in the ready-list and let the EFAULT emerge. Anyone else has opinions about this? PS: Next time it'd be great if you Cc: me when posting epoll patches, so you avoid Andrew the job of doing it. - Davide ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take23 0/5] kevent: Generic event handling mechanism. 2006-11-08 23:56 ` Davide Libenzi @ 2006-11-09 7:24 ` Eric Dumazet 2006-11-09 7:52 ` Eric Dumazet 0 siblings, 1 reply; 200+ messages in thread From: Eric Dumazet @ 2006-11-09 7:24 UTC (permalink / raw) To: Davide Libenzi Cc: Andrew Morton, Evgeniy Polyakov, David Miller, Ulrich Drepper, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, Linux Kernel Mailing List, Jeff Garzik Davide Libenzi a écrit : > On Thu, 9 Nov 2006, Eric Dumazet wrote: > >> Davide Libenzi a ?crit : >>> I don't care about both ways, but sys_poll() does the same thing epoll does >>> right now, so I would not change epoll behaviour. >>> >> Sure poll() cannot return a partial count, since its return value is : >> >> On success, a positive number is returned, where the number returned is >> the number of structures which have non-zero revents fields (in other >> words, those descriptors with events or errors reported). >> >> poll() is non destructive (it doesnt change any state into kernel). Returning >> EFAULT in case of an error in the very last bit of user area is mandatory. >> >> On the contrary : >> >> epoll_wait() does return a count of transfered events, and update some state >> in kernel (it consume Edge Trigered events : They can be lost forever if not >> reported to user) >> >> So epoll_wait() is much more like read(), that also updates file state in >> kernel (current file position) > > Lost forever means? If there are more processes watching some fd > (external events), they all get their own copy of the events in their own > private epoll fd. It's not that we "steal" things out of the kernel, is > not a 1:1 producer/consumer thing (one producer, 1 queue). It's one > producer, broadcast to all listeners (consumers) thing. The only case > where it'd matter is in the case of multiple threads sharing the same > epoll fd. In my particular epoll application, the producer is tcp stack, and I have one consumer. If an network event is lost in the EFAULT handling, its lost forever. In any case, my application do provide a correct user area, so this problem is only theorical. > In general, I'd be more for having the userspace get his own SEGFAULT > instead of letting it go with broken parameters. If I'm coding userspace, > and I'm doing something wrong, I like the kernel to let me know, instead > of trying to fix things for me. > Also, epoll can easily be fixed (add a param to ep_reinject_items() to > re-inject items in case of error/EFAULT) to leave events in the ready-list > and let the EFAULT emerge. Please dont slow the hot path for a basically "User Error". It's already tested in the transfert function, with two conditional branches for each transfered event. > Anyone else has opinions about this? > > > > > PS: Next time it'd be great if you Cc: me when posting epoll patches, so > you avoid Andrew the job of doing it. Yes, but this particular patch was a followup on own kevent Andrew patch. I have a bunch of patches for epoll I will send to you :) Eric ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take23 0/5] kevent: Generic event handling mechanism. 2006-11-09 7:24 ` Eric Dumazet @ 2006-11-09 7:52 ` Eric Dumazet 2006-11-09 17:12 ` Davide Libenzi 0 siblings, 1 reply; 200+ messages in thread From: Eric Dumazet @ 2006-11-09 7:52 UTC (permalink / raw) To: Eric Dumazet Cc: Davide Libenzi, Andrew Morton, Evgeniy Polyakov, David Miller, Ulrich Drepper, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, Linux Kernel Mailing List, Jeff Garzik Eric Dumazet a écrit : > Davide Libenzi a écrit : >> Lost forever means? If there are more processes watching some fd >> (external events), they all get their own copy of the events in their >> own private epoll fd. It's not that we "steal" things out of the >> kernel, is not a 1:1 producer/consumer thing (one producer, 1 queue). >> It's one producer, broadcast to all listeners (consumers) thing. The >> only case where it'd matter is in the case of multiple threads sharing >> the same epoll fd. > > In my particular epoll application, the producer is tcp stack, and I > have one consumer. If an network event is lost in the EFAULT handling, > its lost forever. In any case, my application do provide a correct user > area, so this problem is only theorical. I realize I was not explicit, and dit not answer your question (Lost forever means ?) if (epi->revents) { if (__put_user(epi->revents, &events[eventcnt].events) || __put_user(epi->event.data, &events[eventcnt].data)) return -EFAULT; >> if (epi->event.events & EPOLLONESHOT) >> epi->event.events &= EP_PRIVATE_BITS; eventcnt++; } If one EPOLLONESHOT event is correctly copied to user space, its status is updated. If other ready events in the same epoll_wait() call cannot be transferred because of an EFAULT (we reach the real end of user provided area), this EPOLLONESHOT event is lost forever, because it wont be requeued in ready list. Eric ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take23 0/5] kevent: Generic event handling mechanism. 2006-11-09 7:52 ` Eric Dumazet @ 2006-11-09 17:12 ` Davide Libenzi 0 siblings, 0 replies; 200+ messages in thread From: Davide Libenzi @ 2006-11-09 17:12 UTC (permalink / raw) To: Eric Dumazet Cc: Davide Libenzi, Andrew Morton, Evgeniy Polyakov, David Miller, Ulrich Drepper, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, Linux Kernel Mailing List, Jeff Garzik On Thu, 9 Nov 2006, Eric Dumazet wrote: > > > Lost forever means? If there are more processes watching some fd (external > > > events), they all get their own copy of the events in their own private > > > epoll fd. It's not that we "steal" things out of the kernel, is not a 1:1 > > > producer/consumer thing (one producer, 1 queue). It's one producer, > > > broadcast to all listeners (consumers) thing. The only case where it'd > > > matter is in the case of multiple threads sharing the same epoll fd. > > > > In my particular epoll application, the producer is tcp stack, and I have > > one consumer. If an network event is lost in the EFAULT handling, its lost > > forever. In any case, my application do provide a correct user area, so this > > problem is only theorical. > > I realize I was not explicit, and dit not answer your question (Lost forever > means ?) > > if (epi->revents) { > if (__put_user(epi->revents, > &events[eventcnt].events) || > __put_user(epi->event.data, > &events[eventcnt].data)) > return -EFAULT; > >> if (epi->event.events & EPOLLONESHOT) > >> epi->event.events &= EP_PRIVATE_BITS; > eventcnt++; > } > > If one EPOLLONESHOT event is correctly copied to user space, its status is > updated. > > If other ready events in the same epoll_wait() call cannot be transferred > because of an EFAULT (we reach the real end of user provided area), this > EPOLLONESHOT event is lost forever, because it wont be requeued in ready list. Your application is feeding crap to the kernel, because of programming bugs. If that happens, I want an EFAULT and not a partially filled buffer. And which buffer then? This could have been scribbled in userspace memory (the pointer), and the try of the kernel to mask out bugs might create even more subtle problems. Such bug will *never* show up in the up in case the wrong buffer is partially valid (first part, that is the *only* case where your fix would make a difference compared to the status quo), since in case of no ready events we'll never hit it, and in case of some events we'll always return few of them and never EFAULT. No, the more I think about it, the more I personally disagree with the change. > Please dont slow the hot path for a basically "User Error". It's already > tested in the transfert function, with two conditional > branches for each transfered event. Ohh, if you think you can measure them from userspace, those can be turned in 'err |= __put_user();' with err tested only out of the loop. - Davide ^ permalink raw reply [flat|nested] 200+ messages in thread
* [take24 0/6] kevent: Generic event handling mechanism. [not found] <1154985aa0591036@2ka.mipt.ru> ` (2 preceding siblings ...) 2006-11-07 16:50 ` [take23 0/5] " Evgeniy Polyakov @ 2006-11-09 8:23 ` Evgeniy Polyakov 2006-11-09 8:23 ` [take24 1/6] kevent: Description Evgeniy Polyakov ` (2 more replies) 2006-11-21 16:29 ` [take25 " Evgeniy Polyakov 2006-11-30 19:14 ` [take26 0/8] kevent: Generic event handling mechanism Evgeniy Polyakov 5 siblings, 3 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-09 8:23 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Generic event handling mechanism. Kevent is a generic subsytem which allows to handle event notifications. It supports both level and edge triggered events. It is similar to poll/epoll in some cases, but it is more scalable, it is faster and allows to work with essentially eny kind of events. Events are provided into kernel through control syscall and can be read back through ring buffer or using usual syscalls. Kevent update (i.e. readiness switching) happens directly from internals of the appropriate state machine of the underlying subsytem (like network, filesystem, timer or any other). Homepage: http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent Documentation page: http://linux-net.osdl.org/index.php/Kevent Consider for inclusion. Changes from 'take23' patchset: * kevent PIPE notifications * KEVENT_REQ_LAST_CHECK flag, which allows to perform last check in dequeuing time * fixed poll/select notifications (were broken due to tree manipulations) * made Documentation/kevent.txt look nice in 80-col terminal * fix for copy_to_user() failure report for the first kevent (Andrew Morton) * minor fucntion renames Here is pipe result with kevent_pipe kernel kevent part with 2000 pipes (Eric Dumazet's application): epoll (edge-triggered): 248408 events/sec kevent (edge-triggered): 269282 events/sec Busy reading loop: 269519 events/sec Changes from 'take22' patchset: * new ring buffer implementation in process' memory * wakeup-one-thread flag * edge-triggered behaviour With this release additional independent benchmark shows kevent speed compared to epoll: Eric Dumazet created special benchmark which creates set of AF_INET sockets and two threads start to simultaneously read and write data from/into them. Here is results: epoll (no EPOLLET): 57428 events/sec kevent (no ET): 59794 events/sec epoll (with EPOLLET): 71000 events/sec kevent (with ET): 78265 events/sec Maximum (busy loop reading events): 88482 events/sec Changes from 'take21' patchset: * minor cleanups (different return values, removed unneded variables, whitespaces and so on) * fixed bug in kevent removal in case when kevent being removed is the same as overflow_kevent (spotted by Eric Dumazet) Changes from 'take20' patchset: * new ring buffer implementation * removed artificial limit on possible number of kevents With this release and fixed userspace web server it was possible to achive 3960+ req/s with client connection rate of 4000 con/s over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which is too close to wire speed if we get into account headers and the like. Changes from 'take19' patchset: * use __init instead of __devinit * removed 'default N' from config for user statistic * removed kevent_user_fini() since kevent can not be unloaded * use KERN_INFO for statistic output Changes from 'take18' patchset: * use __init instead of __devinit * removed 'default N' from config for user statistic * removed kevent_user_fini() since kevent can not be unloaded * use KERN_INFO for statistic output Changes from 'take17' patchset: * Use RB tree instead of hash table. At least for a web sever, frequency of addition/deletion of new kevent is comparable with number of search access, i.e. most of the time events are added, accesed only couple of times and then removed, so it justifies RB tree usage over AVL tree, since the latter does have much slower deletion time (max O(log(N)) compared to 3 ops), although faster search time (1.44*O(log(N)) vs. 2*O(log(N))). So for kevents I use RB tree for now and later, when my AVL tree implementation is ready, it will be possible to compare them. * Changed readiness check for socket notifications. With both above changes it is possible to achieve more than 3380 req/second compared to 2200, sometimes 2500 req/second for epoll() for trivial web-server and httperf client on the same hardware. It is possible that above kevent limit is due to maximum allowed kevents in a time limit, which is 4096 events. Changes from 'take16' patchset: * misc cleanups (__read_mostly, const ...) * created special macro which is used for mmap size (number of pages) calculation * export kevent_socket_notify(), since it is used in network protocols which can be built as modules (IPv6 for example) Changes from 'take15' patchset: * converted kevent_timer to high-resolution timers, this forces timer API update at http://linux-net.osdl.org/index.php/Kevent * use struct ukevent* instead of void * in syscalls (documentation has been updated) * added warning in kevent_add_ukevent() if ring has broken index (for testing) Changes from 'take14' patchset: * added kevent_wait() This syscall waits until either timeout expires or at least one event becomes ready. It also commits that @num events from @start are processed by userspace and thus can be be removed or rearmed (depending on it's flags). It can be used for commit events read by userspace through mmap interface. Example userspace code (evtest.c) can be found on project's homepage. * added socket notifications (send/recv/accept) Changes from 'take13' patchset: * do not get lock aroung user data check in __kevent_search() * fail early if there were no registered callbacks for given type of kevent * trailing whitespace cleanup Changes from 'take12' patchset: * remove non-chardev interface for initialization * use pointer to kevent_mring instead of unsigned longs * use aligned 64bit type in raw user data (can be used by high-res timer if needed) * simplified enqueue/dequeue callbacks and kevent initialization * use nanoseconds for timeout * put number of milliseconds into timer's return data * move some definitions into user-visible header * removed filenames from comments Changes from 'take11' patchset: * include missing headers into patchset * some trivial code cleanups (use goto instead of if/else games and so on) * some whitespace cleanups * check for ready_callback() callback before main loop which should save us some ticks Changes from 'take10' patchset: * removed non-existent prototypes * added helper function for kevent_registered_callbacks * fixed 80 lines comments issues * added shared between userspace and kernelspace header instead of embedd them in one * core restructuring to remove forward declarations * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p * use vm_insert_page() instead of remap_pfn_range() Changes from 'take9' patchset: * fixed ->nopage method Changes from 'take8' patchset: * fixed mmap release bug * use module_init() instead of late_initcall() * use better structures for timer notifications Changes from 'take7' patchset: * new mmap interface (not tested, waiting for other changes to be acked) - use nopage() method to dynamically substitue pages - allocate new page for events only when new added kevent requres it - do not use ugly index dereferencing, use structure instead - reduced amount of data in the ring (id and flags), maximum 12 pages on x86 per kevent fd Changes from 'take6' patchset: * a lot of comments! * do not use list poisoning for detection of the fact, that entry is in the list * return number of ready kevents even if copy*user() fails * strict check for number of kevents in syscall * use ARRAY_SIZE for array size calculation * changed superblock magic number * use SLAB_PANIC instead of direct panic() call * changed -E* return values * a lot of small cleanups and indent fixes Changes from 'take5' patchset: * removed compilation warnings about unused wariables when lockdep is not turned on * do not use internal socket structures, use appropriate (exported) wrappers instead * removed default 1 second timeout * removed AIO stuff from patchset Changes from 'take4' patchset: * use miscdevice instead of chardevice * comments fixes Changes from 'take3' patchset: * removed serializing mutex from kevent_user_wait() * moved storage list processing to RCU * removed lockdep screaming - all storage locks are initialized in the same function, so it was learned to differentiate between various cases * remove kevent from storage if is marked as broken after callback * fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion Changes from 'take2' patchset: * split kevent_finish_user() to locked and unlocked variants * do not use KEVENT_STAT ifdefs, use inline functions instead * use array of callbacks of each type instead of each kevent callback initialization * changed name of ukevent guarding lock * use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks * do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters * various indent cleanups * added optimisation, which is aimed to help when a lot of kevents are being copied from userspace * mapped buffer (initial) implementation (no userspace yet) Changes from 'take1' patchset: - rebased against 2.6.18-git tree - removed ioctl controlling - added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr, unsigned int timeout, void __user *buf, unsigned flags) - use old syscall kevent_ctl for creation/removing, modification and initial kevent initialization - use mutuxes instead of semaphores - added file descriptor check and return error if provided descriptor does not match kevent file operations - various indent fixes - removed aio_sendfile() declarations. Thank you. Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> ^ permalink raw reply [flat|nested] 200+ messages in thread
* [take24 1/6] kevent: Description. 2006-11-09 8:23 ` [take24 0/6] " Evgeniy Polyakov @ 2006-11-09 8:23 ` Evgeniy Polyakov 2006-11-09 8:23 ` [take24 2/6] kevent: Core files Evgeniy Polyakov 2006-11-11 17:36 ` [take24 7/6] kevent: signal notifications Evgeniy Polyakov 2006-11-11 22:28 ` [take24 0/6] kevent: Generic event handling mechanism Ulrich Drepper 2 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-09 8:23 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Description. diff --git a/Documentation/kevent.txt b/Documentation/kevent.txt new file mode 100644 index 0000000..ca49e4b --- /dev/null +++ b/Documentation/kevent.txt @@ -0,0 +1,186 @@ +Description. + +int kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent *arg); + +fd - is the file descriptor referring to the kevent queue to manipulate. +It is created by opening "/dev/kevent" char device, which is created with +dynamic minor number and major number assigned for misc devices. + +cmd - is the requested operation. It can be one of the following: + KEVENT_CTL_ADD - add event notification + KEVENT_CTL_REMOVE - remove event notification + KEVENT_CTL_MODIFY - modify existing notification + +num - number of struct ukevent in the array pointed to by arg +arg - array of struct ukevent + +When called, kevent_ctl will carry out the operation specified in the +cmd parameter. +------------------------------------------------------------------------------- + + int kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, + __u64 timeout, struct ukevent *buf, unsigned flags) + +ctl_fd - file descriptor referring to the kevent queue +min_nr - minimum number of completed events that kevent_get_events will block + waiting for +max_nr - number of struct ukevent in buf +timeout - number of nanoseconds to wait before returning less than min_nr + events. If this is -1, then wait forever. +buf - pointer to an array of struct ukevent. +flags - unused + +kevent_get_events will wait timeout milliseconds for at least min_nr completed +events, copying completed struct ukevents to buf and deleting any +KEVENT_REQ_ONESHOT event requests. In nonblocking mode it returns as many +events as possible, but not more than max_nr. In blocking mode it waits until +timeout or if at least min_nr events are ready. +------------------------------------------------------------------------------- + + int kevent_wait(int ctl_fd, unsigned int num, __u64 timeout) + +ctl_fd - file descriptor referring to the kevent queue +num - number of processed kevents +timeout - this timeout specifies number of nanoseconds to wait until there is + free space in kevent queue + +This syscall waits until either timeout expires or at least one event becomes +ready. It also copies that num events into special ring buffer and requeues +them (or removes depending on flags). +------------------------------------------------------------------------------- + + int kevent_ring_init(int ctl_fd, struct kevent_ring *ring, unsigned int num) + +ctl_fd - file descriptor referring to the kevent queue +num - size of the ring buffer in events + + struct kevent_ring + { + unsigned int ring_kidx; + struct ukevent event[0]; + } + +ring_kidx - is an index in the ring buffer where kernel will put new events + when kevent_wait() or kevent_get_events() is called + +Example userspace code (ring_buffer.c) can be found on project's homepage. + +Each kevent syscall can be so called cancellation point in glibc, i.e. when +thread has been cancelled in kevent syscall, thread can be safely removed +and no events will be lost, since each syscall (kevent_wait() or +kevent_get_events()) will copy event into special ring buffer, accessible +from other threads or even processes (if shared memory is used). + +When kevent is removed (not dequeued when it is ready, but just removed), +even if it was ready, it is not copied into ring buffer, since if it is +removed, no one cares about it (otherwise user would wait until it becomes +ready and got it through usual way using kevent_get_events() or kevent_wait()) +and thus no need to copy it to the ring buffer. + +It is possible with userspace ring buffer, that events in the ring buffer +can be replaced without knowledge for the thread currently reading them +(when other thread calls kevent_get_events() or kevent_wait()), so appropriate +locking between threads or processes, which can simultaneously access the same +ring buffer, is required. +------------------------------------------------------------------------------- + +The bulk of the interface is entirely done through the ukevent struct. +It is used to add event requests, modify existing event requests, +specify which event requests to remove, and return completed events. + +struct ukevent contains the following members: + +struct kevent_id id + Id of this request, e.g. socket number, file descriptor and so on +__u32 type + Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on +__u32 event + Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED +__u32 req_flags + Per-event request flags, + + KEVENT_REQ_ONESHOT + event will be removed when it is ready + + KEVENT_REQ_WAKEUP_ONE + When several threads wait on the same kevent queue and requested the + same event, for example 'wake me up when new client has connected, + so I could call accept()', then all threads will be awakened when new + client has connected, but only one of them can process the data. This + problem is known as thundering nerd problem. Events which have this + flag set will not be marked as ready (and appropriate threads will + not be awakened) if at least one event has been already marked. + + KEVENT_REQ_ET + Edge Triggered behaviour. It is an optimisation which allows to move + ready and dequeued (i.e. copied to userspace) event to move into set + of interest for given storage (socket, inode and so on) again. It is + very usefull for cases when the same event should be used many times + (like reading from pipe). It is similar to epoll()'s EPOLLET flag. + + KEVENT_REQ_LAST_CHECK + if set allows to perform the last check on kevent (call appropriate + callback) when kevent is marked as ready and has been removed from + ready queue. If it will be confirmed that kevent is ready + (k->callbacks.callback(k) returns true) then kevent will be copied + to userspace, otherwise it will be requeued back to storage. + Second (checking) call is performed with this bit cleared, so callback + can detect when it was called from kevent_storage_ready() - bit is set, + or kevent_dequeue_ready() - bit is cleared. If kevent will be requeued, + bit will be set again. + +__u32 ret_flags + Per-event return flags + + KEVENT_RET_BROKEN + Kevent is broken + + KEVENT_RET_DONE + Kevent processing was finished successfully + + KEVENT_RET_COPY_FAILED + Kevent was not copied into ring buffer due to some error conditions. + +__u32 ret_data + Event return data. Event originator fills it with anything it likes + (for example timer notifications put number of milliseconds when timer + has fired +union { __u32 user[2]; void *ptr; } + User's data. It is not used, just copied to/from user. The whole structure + is aligned to 8 bytes already, so the last union is aligned properly. + +------------------------------------------------------------------------------- + +Usage + +For KEVENT_CTL_ADD, all fields relevant to the event type must be filled +(id, type, possibly event, req_flags). +After kevent_ctl(..., KEVENT_CTL_ADD, ...) returns each struct's ret_flags +should be checked to see if the event is already broken or done. + +For KEVENT_CTL_MODIFY, the id, req_flags, and user and event fields must be +set and an existing kevent request must have matching id and user fields. If +match is found, req_flags and event are replaced with the newly supplied +values and requeueing is started, so modified kevent can be checked and +probably marked as ready immediately. If a match can't be found, the +passed in ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is +always set. + +For KEVENT_CTL_REMOVE, the id and user fields must be set and an existing +kevent request must have matching id and user fields. If a match is found, +the kevent request is removed. If a match can't be found, the passed in +ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is always set. + +For kevent_get_events, the entire structure is returned. + +------------------------------------------------------------------------------- + +Usage cases + +kevent_timer +struct ukevent should contain following fields: + type - KEVENT_TIMER + event - KEVENT_TIMER_FIRED + req_flags - KEVENT_REQ_ONESHOT if you want to fire that timer only once + id.raw[0] - number of seconds after commit when this timer shout expire + id.raw[0] - additional to number of seconds number of nanoseconds ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take24 2/6] kevent: Core files. 2006-11-09 8:23 ` [take24 1/6] kevent: Description Evgeniy Polyakov @ 2006-11-09 8:23 ` Evgeniy Polyakov 2006-11-09 8:23 ` [take24 3/6] kevent: poll/select() notifications Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-09 8:23 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Core files. This patch includes core kevent files: * userspace controlling * kernelspace interfaces * initialization * notification state machines Some bits of documentation can be found on project's homepage (and links from there): http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S index 7e639f7..fa8075b 100644 --- a/arch/i386/kernel/syscall_table.S +++ b/arch/i386/kernel/syscall_table.S @@ -318,3 +318,7 @@ ENTRY(sys_call_table) .long sys_vmsplice .long sys_move_pages .long sys_getcpu + .long sys_kevent_get_events + .long sys_kevent_ctl /* 320 */ + .long sys_kevent_wait + .long sys_kevent_ring_init diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S index b4aa875..95fb252 100644 --- a/arch/x86_64/ia32/ia32entry.S +++ b/arch/x86_64/ia32/ia32entry.S @@ -714,8 +714,12 @@ #endif .quad compat_sys_get_robust_list .quad sys_splice .quad sys_sync_file_range - .quad sys_tee + .quad sys_tee /* 315 */ .quad compat_sys_vmsplice .quad compat_sys_move_pages .quad sys_getcpu + .quad sys_kevent_get_events + .quad sys_kevent_ctl /* 320 */ + .quad sys_kevent_wait + .quad sys_kevent_ring_init ia32_syscall_end: diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h index bd99870..2161ef2 100644 --- a/include/asm-i386/unistd.h +++ b/include/asm-i386/unistd.h @@ -324,10 +324,14 @@ #define __NR_tee 315 #define __NR_vmsplice 316 #define __NR_move_pages 317 #define __NR_getcpu 318 +#define __NR_kevent_get_events 319 +#define __NR_kevent_ctl 320 +#define __NR_kevent_wait 321 +#define __NR_kevent_ring_init 322 #ifdef __KERNEL__ -#define NR_syscalls 319 +#define NR_syscalls 323 #include <linux/err.h> /* diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h index 6137146..3669c0f 100644 --- a/include/asm-x86_64/unistd.h +++ b/include/asm-x86_64/unistd.h @@ -619,10 +619,18 @@ #define __NR_vmsplice 278 __SYSCALL(__NR_vmsplice, sys_vmsplice) #define __NR_move_pages 279 __SYSCALL(__NR_move_pages, sys_move_pages) +#define __NR_kevent_get_events 280 +__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events) +#define __NR_kevent_ctl 281 +__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl) +#define __NR_kevent_wait 282 +__SYSCALL(__NR_kevent_wait, sys_kevent_wait) +#define __NR_kevent_ring_init 283 +__SYSCALL(__NR_kevent_ring_init, sys_kevent_ring_init) #ifdef __KERNEL__ -#define __NR_syscall_max __NR_move_pages +#define __NR_syscall_max __NR_kevent_ring_init #include <linux/err.h> #ifndef __NO_STUBS diff --git a/include/linux/kevent.h b/include/linux/kevent.h new file mode 100644 index 0000000..f7cbf6b --- /dev/null +++ b/include/linux/kevent.h @@ -0,0 +1,223 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __KEVENT_H +#define __KEVENT_H +#include <linux/types.h> +#include <linux/list.h> +#include <linux/rbtree.h> +#include <linux/spinlock.h> +#include <linux/mutex.h> +#include <linux/wait.h> +#include <linux/net.h> +#include <linux/rcupdate.h> +#include <linux/fs.h> +#include <linux/kevent_storage.h> +#include <linux/ukevent.h> + +#define KEVENT_MIN_BUFFS_ALLOC 3 + +struct kevent; +struct kevent_storage; +typedef int (* kevent_callback_t)(struct kevent *); + +/* @callback is called each time new event has been caught. */ +/* @enqueue is called each time new event is queued. */ +/* @dequeue is called each time event is dequeued. */ + +struct kevent_callbacks { + kevent_callback_t callback, enqueue, dequeue; +}; + +#define KEVENT_READY 0x1 +#define KEVENT_STORAGE 0x2 +#define KEVENT_USER 0x4 + +struct kevent +{ + /* Used for kevent freeing.*/ + struct rcu_head rcu_head; + struct ukevent event; + /* This lock protects ukevent manipulations, e.g. ret_flags changes. */ + spinlock_t ulock; + + /* Entry of user's tree. */ + struct rb_node kevent_node; + /* Entry of origin's queue. */ + struct list_head storage_entry; + /* Entry of user's ready. */ + struct list_head ready_entry; + + u32 flags; + + /* User who requested this kevent. */ + struct kevent_user *user; + /* Kevent container. */ + struct kevent_storage *st; + + struct kevent_callbacks callbacks; + + /* Private data for different storages. + * poll()/select storage has a list of wait_queue_t containers + * for each ->poll() { poll_wait()' } here. + */ + void *priv; +}; + +struct kevent_user +{ + struct rb_root kevent_root; + spinlock_t kevent_lock; + /* Number of queued kevents. */ + unsigned int kevent_num; + + /* List of ready kevents. */ + struct list_head ready_list; + /* Number of ready kevents. */ + unsigned int ready_num; + /* Protects all manipulations with ready queue. */ + spinlock_t ready_lock; + + /* Protects against simultaneous kevent_user control manipulations. */ + struct mutex ctl_mutex; + /* Wait until some events are ready. */ + wait_queue_head_t wait; + + /* Reference counter, increased for each new kevent. */ + atomic_t refcnt; + + /* Mutex protecting userspace ring buffer. */ + struct mutex ring_lock; + /* Kernel index and size of the userspace ring buffer. */ + unsigned int kidx, ring_size; + /* Pointer to userspace ring buffer. */ + struct kevent_ring __user *pring; + +#ifdef CONFIG_KEVENT_USER_STAT + unsigned long im_num; + unsigned long wait_num, ring_num; + unsigned long total; +#endif +}; + +int kevent_enqueue(struct kevent *k); +int kevent_dequeue(struct kevent *k); +int kevent_init(struct kevent *k); +void kevent_requeue(struct kevent *k); +int kevent_break(struct kevent *k); + +int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos); + +void kevent_storage_ready(struct kevent_storage *st, + kevent_callback_t ready_callback, u32 event); +int kevent_storage_init(void *origin, struct kevent_storage *st); +void kevent_storage_fini(struct kevent_storage *st); +int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k); +void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k); + +int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u); + +#ifdef CONFIG_KEVENT_POLL +void kevent_poll_reinit(struct file *file); +#else +static inline void kevent_poll_reinit(struct file *file) +{ +} +#endif + +#ifdef CONFIG_KEVENT_USER_STAT +static inline void kevent_stat_init(struct kevent_user *u) +{ + u->wait_num = u->im_num = u->total = 0; +} +static inline void kevent_stat_print(struct kevent_user *u) +{ + printk(KERN_INFO "%s: u: %p, wait: %lu, ring: %lu, immediately: %lu, total: %lu.\n", + __func__, u, u->wait_num, u->ring_num, u->im_num, u->total); +} +static inline void kevent_stat_im(struct kevent_user *u) +{ + u->im_num++; +} +static inline void kevent_stat_ring(struct kevent_user *u) +{ + u->ring_num++; +} +static inline void kevent_stat_wait(struct kevent_user *u) +{ + u->wait_num++; +} +static inline void kevent_stat_total(struct kevent_user *u) +{ + u->total++; +} +#else +#define kevent_stat_print(u) ({ (void) u;}) +#define kevent_stat_init(u) ({ (void) u;}) +#define kevent_stat_im(u) ({ (void) u;}) +#define kevent_stat_wait(u) ({ (void) u;}) +#define kevent_stat_ring(u) ({ (void) u;}) +#define kevent_stat_total(u) ({ (void) u;}) +#endif + +#ifdef CONFIG_LOCKDEP +void kevent_socket_reinit(struct socket *sock); +void kevent_sk_reinit(struct sock *sk); +#else +static inline void kevent_socket_reinit(struct socket *sock) +{ +} +static inline void kevent_sk_reinit(struct sock *sk) +{ +} +#endif +#ifdef CONFIG_KEVENT_SOCKET +void kevent_socket_notify(struct sock *sock, u32 event); +int kevent_socket_dequeue(struct kevent *k); +int kevent_socket_enqueue(struct kevent *k); +#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC) +#else +static inline void kevent_socket_notify(struct sock *sock, u32 event) +{ +} +#define sock_async(__sk) ({ (void)__sk; 0; }) +#endif + +#ifdef CONFIG_KEVENT_POLL +static inline void kevent_init_file(struct file *file) +{ + kevent_storage_init(file, &file->st); +} + +static inline void kevent_cleanup_file(struct file *file) +{ + kevent_storage_fini(&file->st); +} +#else +static inline void kevent_init_file(struct file *file) {} +static inline void kevent_cleanup_file(struct file *file) {} +#endif + +#ifdef CONFIG_KEVENT_PIPE +extern void kevent_pipe_notify(struct inode *inode, u32 events); +#else +static inline void kevent_pipe_notify(struct inode *inode, u32 events) {} +#endif + +#endif /* __KEVENT_H */ diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h new file mode 100644 index 0000000..a38575d --- /dev/null +++ b/include/linux/kevent_storage.h @@ -0,0 +1,11 @@ +#ifndef __KEVENT_STORAGE_H +#define __KEVENT_STORAGE_H + +struct kevent_storage +{ + void *origin; /* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */ + struct list_head list; /* List of queued kevents. */ + spinlock_t lock; /* Protects users queue. */ +}; + +#endif /* __KEVENT_STORAGE_H */ diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 2d1c3d5..471a685 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -54,6 +54,8 @@ struct compat_stat; struct compat_timeval; struct robust_list_head; struct getcpu_cache; +struct ukevent; +struct kevent_ring; #include <linux/types.h> #include <linux/aio_abi.h> @@ -599,4 +601,9 @@ asmlinkage long sys_set_robust_list(stru size_t len); asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache); +asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max, + __u64 timeout, struct ukevent __user *buf, unsigned flags); +asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, struct ukevent __user *buf); +asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, __u64 timeout); +asmlinkage long sys_kevent_ring_init(int ctl_fd, struct kevent_ring __user *ring, unsigned int num); #endif diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h new file mode 100644 index 0000000..b14e14e --- /dev/null +++ b/include/linux/ukevent.h @@ -0,0 +1,165 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __UKEVENT_H +#define __UKEVENT_H + +/* + * Kevent request flags. + */ + +/* Process this event only once and then remove it. */ +#define KEVENT_REQ_ONESHOT 0x1 +/* Wake up only when event exclusively belongs to this thread, + * for example when several threads are waiting for new client + * connection so they could perform accept() it is a good idea + * to set this flag, so only one thread of all with this flag set + * will be awakened. + * If there are events without this flags, appropriate threads will + * be awakened too. */ +#define KEVENT_REQ_WAKEUP_ONE 0x2 +/* Edge Triggered behaviour. */ +#define KEVENT_REQ_ET 0x4 +/* Perform the last check on kevent (call appropriate callback) when + * kevent is marked as ready and has been removed from ready queue. + * If it will be confirmed that kevent is ready + * (k->callbacks.callback(k) returns true) then kevent will be copied + * to userspace, otherwise it will be requeued back to storage. + * Second (checking) call is performed with this bit _cleared_ so + * callback can detect when it was called from + * kevent_storage_ready() - bit is set, or + * kevent_dequeue_ready() - bit is cleared. + * If kevent will be requeued, bit will be set again. */ +#define KEVENT_REQ_LAST_CHECK 0x8 + +/* + * Kevent return flags. + */ +/* Kevent is broken. */ +#define KEVENT_RET_BROKEN 0x1 +/* Kevent processing was finished successfully. */ +#define KEVENT_RET_DONE 0x2 +/* Kevent was not copied into ring buffer due to some error conditions. */ +#define KEVENT_RET_COPY_FAILED 0x4 + +/* + * Kevent type set. + */ +#define KEVENT_SOCKET 0 +#define KEVENT_INODE 1 +#define KEVENT_TIMER 2 +#define KEVENT_POLL 3 +#define KEVENT_NAIO 4 +#define KEVENT_AIO 5 +#define KEVENT_PIPE 6 +#define KEVENT_MAX 7 + +/* + * Per-type event sets. + * Number of per-event sets should be exactly as number of kevent types. + */ + +/* + * Timer events. + */ +#define KEVENT_TIMER_FIRED 0x1 + +/* + * Socket/network asynchronous IO events. + */ +#define KEVENT_SOCKET_RECV 0x1 +#define KEVENT_SOCKET_ACCEPT 0x2 +#define KEVENT_SOCKET_SEND 0x4 + +/* + * Inode events. + */ +#define KEVENT_INODE_CREATE 0x1 +#define KEVENT_INODE_REMOVE 0x2 + +/* + * Poll events. + */ +#define KEVENT_POLL_POLLIN 0x0001 +#define KEVENT_POLL_POLLPRI 0x0002 +#define KEVENT_POLL_POLLOUT 0x0004 +#define KEVENT_POLL_POLLERR 0x0008 +#define KEVENT_POLL_POLLHUP 0x0010 +#define KEVENT_POLL_POLLNVAL 0x0020 + +#define KEVENT_POLL_POLLRDNORM 0x0040 +#define KEVENT_POLL_POLLRDBAND 0x0080 +#define KEVENT_POLL_POLLWRNORM 0x0100 +#define KEVENT_POLL_POLLWRBAND 0x0200 +#define KEVENT_POLL_POLLMSG 0x0400 +#define KEVENT_POLL_POLLREMOVE 0x1000 + +/* + * Asynchronous IO events. + */ +#define KEVENT_AIO_BIO 0x1 + +#define KEVENT_MASK_ALL 0xffffffff +/* Mask of all possible event values. */ +#define KEVENT_MASK_EMPTY 0x0 +/* Empty mask of ready events. */ + +struct kevent_id +{ + union { + __u32 raw[2]; + __u64 raw_u64 __attribute__((aligned(8))); + }; +}; + +struct ukevent +{ + /* Id of this request, e.g. socket number, file descriptor and so on... */ + struct kevent_id id; + /* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */ + __u32 type; + /* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */ + __u32 event; + /* Per-event request flags */ + __u32 req_flags; + /* Per-event return flags */ + __u32 ret_flags; + /* Event return data. Event originator fills it with anything it likes. */ + __u32 ret_data[2]; + /* User's data. It is not used, just copied to/from user. + * The whole structure is aligned to 8 bytes already, so the last union + * is aligned properly. + */ + union { + __u32 user[2]; + void *ptr; + }; +}; + +struct kevent_ring +{ + unsigned int ring_kidx; + struct ukevent event[0]; +}; + +#define KEVENT_CTL_ADD 0 +#define KEVENT_CTL_REMOVE 1 +#define KEVENT_CTL_MODIFY 2 + +#endif /* __UKEVENT_H */ diff --git a/init/Kconfig b/init/Kconfig index d2eb7a8..c7d8250 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -201,6 +201,8 @@ config AUDITSYSCALL such as SELinux. To use audit's filesystem watch feature, please ensure that INOTIFY is configured. +source "kernel/kevent/Kconfig" + config IKCONFIG bool "Kernel .config support" ---help--- diff --git a/kernel/Makefile b/kernel/Makefile index d62ec66..2d7a6dd 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o +obj-$(CONFIG_KEVENT) += kevent/ obj-$(CONFIG_RELAY) += relay.o obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o obj-$(CONFIG_TASKSTATS) += taskstats.o diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig new file mode 100644 index 0000000..267fc53 --- /dev/null +++ b/kernel/kevent/Kconfig @@ -0,0 +1,45 @@ +config KEVENT + bool "Kernel event notification mechanism" + help + This option enables event queue mechanism. + It can be used as replacement for poll()/select(), AIO callback + invocations, advanced timer notifications and other kernel + object status changes. + +config KEVENT_USER_STAT + bool "Kevent user statistic" + depends on KEVENT + help + This option will turn kevent_user statistic collection on. + Statistic data includes total number of kevent, number of kevents + which are ready immediately at insertion time and number of kevents + which were removed through readiness completion. + It will be printed each time control kevent descriptor is closed. + +config KEVENT_TIMER + bool "Kernel event notifications for timers" + depends on KEVENT + help + This option allows to use timers through KEVENT subsystem. + +config KEVENT_POLL + bool "Kernel event notifications for poll()/select()" + depends on KEVENT + help + This option allows to use kevent subsystem for poll()/select() + notifications. + +config KEVENT_SOCKET + bool "Kernel event notifications for sockets" + depends on NET && KEVENT + help + This option enables notifications through KEVENT subsystem of + sockets operations, like new packet receiving conditions, + ready for accept conditions and so on. + +config KEVENT_PIPE + bool "Kernel event notifications for pipes" + depends on KEVENT + help + This option enables notifications through KEVENT subsystem of + pipe read/write operations. diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile new file mode 100644 index 0000000..d4d6b68 --- /dev/null +++ b/kernel/kevent/Makefile @@ -0,0 +1,5 @@ +obj-y := kevent.o kevent_user.o +obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o +obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o +obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o +obj-$(CONFIG_KEVENT_PIPE) += kevent_pipe.o diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c new file mode 100644 index 0000000..24ee44a --- /dev/null +++ b/kernel/kevent/kevent.c @@ -0,0 +1,232 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/mempool.h> +#include <linux/sched.h> +#include <linux/wait.h> +#include <linux/kevent.h> + +/* + * Attempts to add an event into appropriate origin's queue. + * Returns positive value if this event is ready immediately, + * negative value in case of error and zero if event has been queued. + * ->enqueue() callback must increase origin's reference counter. + */ +int kevent_enqueue(struct kevent *k) +{ + return k->callbacks.enqueue(k); +} + +/* + * Remove event from the appropriate queue. + * ->dequeue() callback must decrease origin's reference counter. + */ +int kevent_dequeue(struct kevent *k) +{ + return k->callbacks.dequeue(k); +} + +/* + * Mark kevent as broken. + */ +int kevent_break(struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&k->ulock, flags); + k->event.ret_flags |= KEVENT_RET_BROKEN; + spin_unlock_irqrestore(&k->ulock, flags); + return -EINVAL; +} + +static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX] __read_mostly; + +int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos) +{ + struct kevent_callbacks *p; + + if (pos >= KEVENT_MAX) + return -EINVAL; + + p = &kevent_registered_callbacks[pos]; + + p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break; + p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break; + p->callback = (cb->callback) ? cb->callback : kevent_break; + + printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos); + return 0; +} + +/* + * Must be called before event is going to be added into some origin's queue. + * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks. + * If failed, kevent should not be used or kevent_enqueue() will fail to add + * this kevent into origin's queue with setting + * KEVENT_RET_BROKEN flag in kevent->event.ret_flags. + */ +int kevent_init(struct kevent *k) +{ + spin_lock_init(&k->ulock); + k->flags = 0; + + if (unlikely(k->event.type >= KEVENT_MAX || + !kevent_registered_callbacks[k->event.type].callback)) + return kevent_break(k); + + k->callbacks = kevent_registered_callbacks[k->event.type]; + if (unlikely(k->callbacks.callback == kevent_break)) + return kevent_break(k); + + return 0; +} + +/* + * Called from ->enqueue() callback when reference counter for given + * origin (socket, inode...) has been increased. + */ +int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k) +{ + unsigned long flags; + + k->st = st; + spin_lock_irqsave(&st->lock, flags); + list_add_tail_rcu(&k->storage_entry, &st->list); + k->flags |= KEVENT_STORAGE; + spin_unlock_irqrestore(&st->lock, flags); + return 0; +} + +/* + * Dequeue kevent from origin's queue. + * It does not decrease origin's reference counter in any way + * and must be called before it, so storage itself must be valid. + * It is called from ->dequeue() callback. + */ +void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&st->lock, flags); + if (k->flags & KEVENT_STORAGE) { + list_del_rcu(&k->storage_entry); + k->flags &= ~KEVENT_STORAGE; + } + spin_unlock_irqrestore(&st->lock, flags); +} + +/* + * Call kevent ready callback and queue it into ready queue if needed. + * If kevent is marked as one-shot, then remove it from storage queue. + */ +static int __kevent_requeue(struct kevent *k, u32 event) +{ + int ret, rem; + unsigned long flags; + + ret = k->callbacks.callback(k); + + spin_lock_irqsave(&k->ulock, flags); + if (ret > 0) + k->event.ret_flags |= KEVENT_RET_DONE; + else if (ret < 0) + k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE); + else + ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE)); + rem = (k->event.req_flags & KEVENT_REQ_ONESHOT); + spin_unlock_irqrestore(&k->ulock, flags); + + if (ret) { + if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) { + list_del_rcu(&k->storage_entry); + k->flags &= ~KEVENT_STORAGE; + } + + spin_lock_irqsave(&k->user->ready_lock, flags); + if (!(k->flags & KEVENT_READY)) { + list_add_tail(&k->ready_entry, &k->user->ready_list); + k->flags |= KEVENT_READY; + k->user->ready_num++; + } + spin_unlock_irqrestore(&k->user->ready_lock, flags); + wake_up(&k->user->wait); + } + + return ret; +} + +/* + * Check if kevent is ready (by invoking it's callback) and requeue/remove + * if needed. + */ +void kevent_requeue(struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&k->st->lock, flags); + __kevent_requeue(k, 0); + spin_unlock_irqrestore(&k->st->lock, flags); +} + +/* + * Called each time some activity in origin (socket, inode...) is noticed. + */ +void kevent_storage_ready(struct kevent_storage *st, + kevent_callback_t ready_callback, u32 event) +{ + struct kevent *k; + int wake_num = 0; + + rcu_read_lock(); + if (ready_callback) + list_for_each_entry_rcu(k, &st->list, storage_entry) + (*ready_callback)(k); + + list_for_each_entry_rcu(k, &st->list, storage_entry) { + if (event & k->event.event) + if (!(k->event.req_flags & KEVENT_REQ_WAKEUP_ONE) || wake_num == 0) + if (__kevent_requeue(k, event)) + wake_num++; + } + rcu_read_unlock(); +} + +int kevent_storage_init(void *origin, struct kevent_storage *st) +{ + spin_lock_init(&st->lock); + st->origin = origin; + INIT_LIST_HEAD(&st->list); + return 0; +} + +/* + * Mark all events as broken, that will remove them from storage, + * so storage origin (inode, sockt and so on) can be safely removed. + * No new entries are allowed to be added into the storage at this point. + * (Socket is removed from file table at this point for example). + */ +void kevent_storage_fini(struct kevent_storage *st) +{ + kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL); +} diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c new file mode 100644 index 0000000..00d942a --- /dev/null +++ b/kernel/kevent/kevent_user.c @@ -0,0 +1,936 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/mount.h> +#include <linux/device.h> +#include <linux/poll.h> +#include <linux/kevent.h> +#include <linux/miscdevice.h> +#include <asm/io.h> + +static const char kevent_name[] = "kevent"; +static kmem_cache_t *kevent_cache __read_mostly; + +/* + * kevents are pollable, return POLLIN and POLLRDNORM + * when there is at least one ready kevent. + */ +static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait) +{ + struct kevent_user *u = file->private_data; + unsigned int mask; + + poll_wait(file, &u->wait, wait); + mask = 0; + + if (u->ready_num) + mask |= POLLIN | POLLRDNORM; + + return mask; +} + +/* + * Copies kevent into userspace ring buffer if it was initialized. + * Returns + * 0 on success, + * -EAGAIN if there were no place for that kevent (impossible) + * -EFAULT if copy_to_user() failed. + * + * Must be called under kevent_user->ring_lock locked. + */ +static int kevent_copy_ring_buffer(struct kevent *k) +{ + struct kevent_ring __user *ring; + struct kevent_user *u = k->user; + unsigned long flags; + int err; + + ring = u->pring; + if (!ring) + return 0; + + if (copy_to_user(&ring->event[u->kidx], &k->event, sizeof(struct ukevent))) { + err = -EFAULT; + goto err_out_exit; + } + + if (put_user(u->kidx, &ring->ring_kidx)) { + err = -EFAULT; + goto err_out_exit; + } + + if (++u->kidx >= u->ring_size) + u->kidx = 0; + + return 0; + +err_out_exit: + spin_lock_irqsave(&k->ulock, flags); + k->event.ret_flags |= KEVENT_RET_COPY_FAILED; + spin_unlock_irqrestore(&k->ulock, flags); + return err; +} + +static int kevent_user_open(struct inode *inode, struct file *file) +{ + struct kevent_user *u; + + u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL); + if (!u) + return -ENOMEM; + + INIT_LIST_HEAD(&u->ready_list); + spin_lock_init(&u->ready_lock); + kevent_stat_init(u); + spin_lock_init(&u->kevent_lock); + u->kevent_root = RB_ROOT; + + mutex_init(&u->ctl_mutex); + init_waitqueue_head(&u->wait); + + atomic_set(&u->refcnt, 1); + + mutex_init(&u->ring_lock); + u->kidx = u->ring_size = 0; + u->pring = NULL; + + file->private_data = u; + return 0; +} + +/* + * Kevent userspace control block reference counting. + * Set to 1 at creation time, when appropriate kevent file descriptor + * is closed, that reference counter is decreased. + * When counter hits zero block is freed. + */ +static inline void kevent_user_get(struct kevent_user *u) +{ + atomic_inc(&u->refcnt); +} + +static inline void kevent_user_put(struct kevent_user *u) +{ + if (atomic_dec_and_test(&u->refcnt)) { + kevent_stat_print(u); + kfree(u); + } +} + +static inline int kevent_compare_id(struct kevent_id *left, struct kevent_id *right) +{ + if (left->raw_u64 > right->raw_u64) + return -1; + + if (right->raw_u64 > left->raw_u64) + return 1; + + return 0; +} + +/* + * RCU protects storage list (kevent->storage_entry). + * Free entry in RCU callback, it is dequeued from all lists at + * this point. + */ + +static void kevent_free_rcu(struct rcu_head *rcu) +{ + struct kevent *kevent = container_of(rcu, struct kevent, rcu_head); + kmem_cache_free(kevent_cache, kevent); +} + +/* + * Must be called under u->ready_lock. + * This function unlinks kevent from ready queue. + */ +static inline void kevent_unlink_ready(struct kevent *k) +{ + list_del(&k->ready_entry); + k->flags &= ~KEVENT_READY; + k->user->ready_num--; +} + +static void kevent_remove_ready(struct kevent *k) +{ + struct kevent_user *u = k->user; + unsigned long flags; + + spin_lock_irqsave(&u->ready_lock, flags); + if (k->flags & KEVENT_READY) + kevent_unlink_ready(k); + spin_unlock_irqrestore(&u->ready_lock, flags); +} + +/* + * Complete kevent removing - it dequeues kevent from storage list + * if it is requested, removes kevent from ready list, drops userspace + * control block reference counter and schedules kevent freeing through RCU. + */ +static void kevent_finish_user_complete(struct kevent *k, int deq) +{ + if (deq) + kevent_dequeue(k); + + kevent_remove_ready(k); + + kevent_user_put(k->user); + call_rcu(&k->rcu_head, kevent_free_rcu); +} + +/* + * Remove from all lists and free kevent. + * Must be called under kevent_user->kevent_lock to protect + * kevent->kevent_entry removing. + */ +static void __kevent_finish_user(struct kevent *k, int deq) +{ + struct kevent_user *u = k->user; + + rb_erase(&k->kevent_node, &u->kevent_root); + k->flags &= ~KEVENT_USER; + u->kevent_num--; + kevent_finish_user_complete(k, deq); +} + +/* + * Remove kevent from user's list of all events, + * dequeue it from storage and decrease user's reference counter, + * since this kevent does not exist anymore. That is why it is freed here. + */ +static void kevent_finish_user(struct kevent *k, int deq) +{ + struct kevent_user *u = k->user; + unsigned long flags; + + spin_lock_irqsave(&u->kevent_lock, flags); + rb_erase(&k->kevent_node, &u->kevent_root); + k->flags &= ~KEVENT_USER; + u->kevent_num--; + spin_unlock_irqrestore(&u->kevent_lock, flags); + kevent_finish_user_complete(k, deq); +} + +/* + * Dequeue one entry from user's ready queue. + */ +static struct kevent *kevent_dequeue_ready(struct kevent_user *u) +{ + unsigned long flags; + struct kevent *k = NULL; + + mutex_lock(&u->ring_lock); + while (u->ready_num && !k) { + spin_lock_irqsave(&u->ready_lock, flags); + if (u->ready_num && !list_empty(&u->ready_list)) { + k = list_entry(u->ready_list.next, struct kevent, ready_entry); + kevent_unlink_ready(k); + } + spin_unlock_irqrestore(&u->ready_lock, flags); + + if (k && (k->event.req_flags & KEVENT_REQ_LAST_CHECK)) { + unsigned long flags; + + spin_lock_irqsave(&k->ulock, flags); + k->event.req_flags &= ~KEVENT_REQ_LAST_CHECK; + spin_unlock_irqrestore(&k->ulock, flags); + + if (!k->callbacks.callback(k)) { + spin_lock_irqsave(&k->ulock, flags); + k->event.req_flags |= KEVENT_REQ_LAST_CHECK; + k->event.ret_flags = 0; + k->event.ret_data[0] = k->event.ret_data[1] = 0; + spin_unlock_irqrestore(&k->ulock, flags); + k = NULL; + } + } else + break; + } + + if (k) + kevent_copy_ring_buffer(k); + mutex_unlock(&u->ring_lock); + + return k; +} + +static void kevent_complete_ready(struct kevent *k) +{ + if (k->event.req_flags & KEVENT_REQ_ONESHOT) + /* + * If it is one-shot kevent, it has been removed already from + * origin's queue, so we can easily free it here. + */ + kevent_finish_user(k, 1); + else if (k->event.req_flags & KEVENT_REQ_ET) { + unsigned long flags; + + /* + * Edge-triggered behaviour: mark event as clear new one. + */ + + spin_lock_irqsave(&k->ulock, flags); + k->event.ret_flags = 0; + k->event.ret_data[0] = k->event.ret_data[1] = 0; + spin_unlock_irqrestore(&k->ulock, flags); + } +} + +/* + * Search a kevent inside kevent tree for given ukevent. + */ +static struct kevent *__kevent_search(struct kevent_id *id, struct kevent_user *u) +{ + struct kevent *k, *ret = NULL; + struct rb_node *n = u->kevent_root.rb_node; + int cmp; + + while (n) { + k = rb_entry(n, struct kevent, kevent_node); + cmp = kevent_compare_id(&k->event.id, id); + + if (cmp > 0) + n = n->rb_right; + else if (cmp < 0) + n = n->rb_left; + else { + ret = k; + break; + } + } + + return ret; +} + +/* + * Search and modify kevent according to provided ukevent. + */ +static int kevent_modify(struct ukevent *uk, struct kevent_user *u) +{ + struct kevent *k; + int err = -ENODEV; + unsigned long flags; + + spin_lock_irqsave(&u->kevent_lock, flags); + k = __kevent_search(&uk->id, u); + if (k) { + spin_lock(&k->ulock); + k->event.event = uk->event; + k->event.req_flags = uk->req_flags; + k->event.ret_flags = 0; + spin_unlock(&k->ulock); + kevent_requeue(k); + err = 0; + } + spin_unlock_irqrestore(&u->kevent_lock, flags); + + return err; +} + +/* + * Remove kevent which matches provided ukevent. + */ +static int kevent_remove(struct ukevent *uk, struct kevent_user *u) +{ + int err = -ENODEV; + struct kevent *k; + unsigned long flags; + + spin_lock_irqsave(&u->kevent_lock, flags); + k = __kevent_search(&uk->id, u); + if (k) { + __kevent_finish_user(k, 1); + err = 0; + } + spin_unlock_irqrestore(&u->kevent_lock, flags); + + return err; +} + +/* + * Detaches userspace control block from file descriptor + * and decrease it's reference counter. + * No new kevents can be added or removed from any list at this point. + */ +static int kevent_user_release(struct inode *inode, struct file *file) +{ + struct kevent_user *u = file->private_data; + struct kevent *k; + struct rb_node *n; + + for (n = rb_first(&u->kevent_root); n; n = rb_next(n)) { + k = rb_entry(n, struct kevent, kevent_node); + kevent_finish_user(k, 1); + } + + kevent_user_put(u); + file->private_data = NULL; + + return 0; +} + +/* + * Read requested number of ukevents in one shot. + */ +static struct ukevent *kevent_get_user(unsigned int num, void __user *arg) +{ + struct ukevent *ukev; + + ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL); + if (!ukev) + return NULL; + + if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) { + kfree(ukev); + return NULL; + } + + return ukev; +} + +/* + * Read from userspace all ukevents and modify appropriate kevents. + * If provided number of ukevents is more that threshold, it is faster + * to allocate a room for them and copy in one shot instead of copy + * one-by-one and then process them. + */ +static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg) +{ + int err = 0, i; + struct ukevent uk; + + mutex_lock(&u->ctl_mutex); + + if (num > u->kevent_num) { + err = -EINVAL; + goto out; + } + + if (num > KEVENT_MIN_BUFFS_ALLOC) { + struct ukevent *ukev; + + ukev = kevent_get_user(num, arg); + if (ukev) { + for (i = 0; i < num; ++i) { + if (kevent_modify(&ukev[i], u)) + ukev[i].ret_flags |= KEVENT_RET_BROKEN; + ukev[i].ret_flags |= KEVENT_RET_DONE; + } + if (copy_to_user(arg, ukev, num*sizeof(struct ukevent))) + err = -EFAULT; + kfree(ukev); + goto out; + } + } + + for (i = 0; i < num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + if (kevent_modify(&uk, u)) + uk.ret_flags |= KEVENT_RET_BROKEN; + uk.ret_flags |= KEVENT_RET_DONE; + + if (copy_to_user(arg, &uk, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + arg += sizeof(struct ukevent); + } +out: + mutex_unlock(&u->ctl_mutex); + + return err; +} + +/* + * Read from userspace all ukevents and remove appropriate kevents. + * If provided number of ukevents is more that threshold, it is faster + * to allocate a room for them and copy in one shot instead of copy + * one-by-one and then process them. + */ +static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg) +{ + int err = 0, i; + struct ukevent uk; + + mutex_lock(&u->ctl_mutex); + + if (num > u->kevent_num) { + err = -EINVAL; + goto out; + } + + if (num > KEVENT_MIN_BUFFS_ALLOC) { + struct ukevent *ukev; + + ukev = kevent_get_user(num, arg); + if (ukev) { + for (i = 0; i < num; ++i) { + if (kevent_remove(&ukev[i], u)) + ukev[i].ret_flags |= KEVENT_RET_BROKEN; + ukev[i].ret_flags |= KEVENT_RET_DONE; + } + if (copy_to_user(arg, ukev, num*sizeof(struct ukevent))) + err = -EFAULT; + kfree(ukev); + goto out; + } + } + + for (i = 0; i < num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + if (kevent_remove(&uk, u)) + uk.ret_flags |= KEVENT_RET_BROKEN; + + uk.ret_flags |= KEVENT_RET_DONE; + + if (copy_to_user(arg, &uk, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + arg += sizeof(struct ukevent); + } +out: + mutex_unlock(&u->ctl_mutex); + + return err; +} + +/* + * Queue kevent into userspace control block and increase + * it's reference counter. + */ +static int kevent_user_enqueue(struct kevent_user *u, struct kevent *new) +{ + unsigned long flags; + struct rb_node **p = &u->kevent_root.rb_node, *parent = NULL; + struct kevent *k; + int err = 0, cmp; + + spin_lock_irqsave(&u->kevent_lock, flags); + while (*p) { + parent = *p; + k = rb_entry(parent, struct kevent, kevent_node); + + cmp = kevent_compare_id(&k->event.id, &new->event.id); + if (cmp > 0) + p = &parent->rb_right; + else if (cmp < 0) + p = &parent->rb_left; + else { + err = -EEXIST; + break; + } + } + if (likely(!err)) { + rb_link_node(&new->kevent_node, parent, p); + rb_insert_color(&new->kevent_node, &u->kevent_root); + new->flags |= KEVENT_USER; + u->kevent_num++; + kevent_user_get(u); + } + spin_unlock_irqrestore(&u->kevent_lock, flags); + + return err; +} + +/* + * Add kevent from both kernel and userspace users. + * This function allocates and queues kevent, returns negative value + * on error, positive if kevent is ready immediately and zero + * if kevent has been queued. + */ +int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u) +{ + struct kevent *k; + int err; + + k = kmem_cache_alloc(kevent_cache, GFP_KERNEL); + if (!k) { + err = -ENOMEM; + goto err_out_exit; + } + + memcpy(&k->event, uk, sizeof(struct ukevent)); + INIT_RCU_HEAD(&k->rcu_head); + + k->event.ret_flags = 0; + + err = kevent_init(k); + if (err) { + kmem_cache_free(kevent_cache, k); + goto err_out_exit; + } + k->user = u; + kevent_stat_total(u); + err = kevent_user_enqueue(u, k); + if (err) { + kmem_cache_free(kevent_cache, k); + goto err_out_exit; + } + + err = kevent_enqueue(k); + if (err) { + memcpy(uk, &k->event, sizeof(struct ukevent)); + kevent_finish_user(k, 0); + goto err_out_exit; + } + + return 0; + +err_out_exit: + if (err < 0) { + uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE; + uk->ret_data[1] = err; + } else if (err > 0) + uk->ret_flags |= KEVENT_RET_DONE; + return err; +} + +/* + * Copy all ukevents from userspace, allocate kevent for each one + * and add them into appropriate kevent_storages, + * e.g. sockets, inodes and so on... + * Ready events will replace ones provided by used and number + * of ready events is returned. + * User must check ret_flags field of each ukevent structure + * to determine if it is fired or failed event. + */ +static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg) +{ + int err, cerr = 0, rnum = 0, i; + void __user *orig = arg; + struct ukevent uk; + + mutex_lock(&u->ctl_mutex); + + err = -EINVAL; + if (num > KEVENT_MIN_BUFFS_ALLOC) { + struct ukevent *ukev; + + ukev = kevent_get_user(num, arg); + if (ukev) { + for (i = 0; i < num; ++i) { + err = kevent_user_add_ukevent(&ukev[i], u); + if (err) { + kevent_stat_im(u); + if (i != rnum) + memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent)); + rnum++; + } + } + if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent))) + cerr = -EFAULT; + kfree(ukev); + goto out_setup; + } + } + + for (i = 0; i < num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + cerr = -EFAULT; + break; + } + arg += sizeof(struct ukevent); + + err = kevent_user_add_ukevent(&uk, u); + if (err) { + kevent_stat_im(u); + if (copy_to_user(orig, &uk, sizeof(struct ukevent))) { + cerr = -EFAULT; + break; + } + orig += sizeof(struct ukevent); + rnum++; + } + } + +out_setup: + if (cerr < 0) { + err = cerr; + goto out_remove; + } + + err = rnum; +out_remove: + mutex_unlock(&u->ctl_mutex); + + return err; +} + +/* + * In nonblocking mode it returns as many events as possible, but not more than @max_nr. + * In blocking mode it waits until timeout or if at least @min_nr events are ready. + */ +static int kevent_user_wait(struct file *file, struct kevent_user *u, + unsigned int min_nr, unsigned int max_nr, __u64 timeout, + void __user *buf) +{ + struct kevent *k; + int num = 0; + + if (!(file->f_flags & O_NONBLOCK)) { + wait_event_interruptible_timeout(u->wait, + u->ready_num >= min_nr, + clock_t_to_jiffies(nsec_to_clock_t(timeout))); + } + + while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) { + if (copy_to_user(buf + num*sizeof(struct ukevent), + &k->event, sizeof(struct ukevent))) { + if (num == 0) + num = -EFAULT; + break; + } + kevent_complete_ready(k); + ++num; + kevent_stat_wait(u); + } + + return num; +} + +static struct file_operations kevent_user_fops = { + .open = kevent_user_open, + .release = kevent_user_release, + .poll = kevent_user_poll, + .owner = THIS_MODULE, +}; + +static struct miscdevice kevent_miscdev = { + .minor = MISC_DYNAMIC_MINOR, + .name = kevent_name, + .fops = &kevent_user_fops, +}; + +static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg) +{ + int err; + struct kevent_user *u = file->private_data; + + switch (cmd) { + case KEVENT_CTL_ADD: + err = kevent_user_ctl_add(u, num, arg); + break; + case KEVENT_CTL_REMOVE: + err = kevent_user_ctl_remove(u, num, arg); + break; + case KEVENT_CTL_MODIFY: + err = kevent_user_ctl_modify(u, num, arg); + break; + default: + err = -EINVAL; + break; + } + + return err; +} + +/* + * Used to get ready kevents from queue. + * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT). + * @min_nr - minimum number of ready kevents. + * @max_nr - maximum number of ready kevents. + * @timeout - timeout in nanoseconds to wait until some events are ready. + * @buf - buffer to place ready events. + * @flags - ununsed for now (will be used for mmap implementation). + */ +asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, + __u64 timeout, struct ukevent __user *buf, unsigned flags) +{ + int err = -EINVAL; + struct file *file; + struct kevent_user *u; + + file = fget(ctl_fd); + if (!file) + return -EBADF; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + u = file->private_data; + + err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf); +out_fput: + fput(file); + return err; +} + +asmlinkage long sys_kevent_ring_init(int ctl_fd, struct kevent_ring __user *ring, unsigned int num) +{ + int err = -EINVAL; + struct file *file; + struct kevent_user *u; + + file = fget(ctl_fd); + if (!file) + return -EBADF; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + u = file->private_data; + + mutex_lock(&u->ring_lock); + if (u->pring) { + err = -EINVAL; + goto err_out_exit; + } + u->pring = ring; + u->ring_size = num; + mutex_unlock(&u->ring_lock); + + fput(file); + + return 0; + +err_out_exit: + mutex_unlock(&u->ring_lock); +out_fput: + fput(file); + return err; +} + +/* + * This syscall is used to perform waiting until there is free space in kevent queue + * and removes/requeues requested number of events (commits them). Function returns + * number of actually committed events. + * + * @ctl_fd - kevent file descriptor. + * @num - number of kevents to process. + * @timeout - this timeout specifies number of nanoseconds to wait until there is + * free space in kevent queue. + * + * When we need to commit @num events, it means we should just remove first @num + * kevents from ready queue and copy them into the buffer. + * Kevents will be copied into ring buffer in order they were placed into ready queue. + */ +asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, __u64 timeout) +{ + int err = -EINVAL, committed = 0; + struct file *file; + struct kevent_user *u; + struct kevent *k; + struct kevent_ring __user *ring; + unsigned int i; + + file = fget(ctl_fd); + if (!file) + return -EBADF; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + u = file->private_data; + + ring = u->pring; + if (!ring || num >= u->ring_size) + goto out_fput; + + if (!(file->f_flags & O_NONBLOCK)) { + wait_event_interruptible_timeout(u->wait, + u->ready_num >= 1, + clock_t_to_jiffies(nsec_to_clock_t(timeout))); + } + + for (i=0; i<num; ++i) { + k = kevent_dequeue_ready(u); + if (!k) + break; + kevent_complete_ready(k); + kevent_stat_ring(u); + committed++; + } + + fput(file); + + return committed; +out_fput: + fput(file); + return err; +} + +/* + * This syscall is used to perform various control operations + * on given kevent queue, which is obtained through kevent file descriptor @fd. + * @cmd - type of operation. + * @num - number of kevents to be processed. + * @arg - pointer to array of struct ukevent. + */ +asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent __user *arg) +{ + int err = -EINVAL; + struct file *file; + + file = fget(fd); + if (!file) + return -EBADF; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + + err = kevent_ctl_process(file, cmd, num, arg); + +out_fput: + fput(file); + return err; +} + +/* + * Kevent subsystem initialization - create kevent cache and register + * filesystem to get control file descriptors from. + */ +static int __init kevent_user_init(void) +{ + int err = 0; + + kevent_cache = kmem_cache_create("kevent_cache", + sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL); + + err = misc_register(&kevent_miscdev); + if (err) { + printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err); + goto err_out_exit; + } + + printk(KERN_INFO "KEVENT subsystem has been successfully registered.\n"); + + return 0; + +err_out_exit: + kmem_cache_destroy(kevent_cache); + return err; +} + +module_init(kevent_user_init); diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 7a3b2e7..5200583 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -122,6 +122,11 @@ cond_syscall(ppc_rtas); cond_syscall(sys_spu_run); cond_syscall(sys_spu_create); +cond_syscall(sys_kevent_get_events); +cond_syscall(sys_kevent_wait); +cond_syscall(sys_kevent_ctl); +cond_syscall(sys_kevent_ring_init); + /* mmu depending weak syscall entries */ cond_syscall(sys_mprotect); cond_syscall(sys_msync); ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take24 3/6] kevent: poll/select() notifications. 2006-11-09 8:23 ` [take24 2/6] kevent: Core files Evgeniy Polyakov @ 2006-11-09 8:23 ` Evgeniy Polyakov 2006-11-09 8:23 ` [take24 4/6] kevent: Socket notifications Evgeniy Polyakov ` (2 more replies) 0 siblings, 3 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-09 8:23 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik poll/select() notifications. This patch includes generic poll/select notifications. kevent_poll works simialr to epoll and has the same issues (callback is invoked not from internal state machine of the caller, but through process awake, a lot of allocations and so on). Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru> diff --git a/fs/file_table.c b/fs/file_table.c index bc35a40..0805547 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -20,6 +20,7 @@ #include <linux/capability.h> #include <linux/cdev.h> #include <linux/fsnotify.h> #include <linux/sysctl.h> +#include <linux/kevent.h> #include <linux/percpu_counter.h> #include <asm/atomic.h> @@ -119,6 +120,7 @@ struct file *get_empty_filp(void) f->f_uid = tsk->fsuid; f->f_gid = tsk->fsgid; eventpoll_init_file(f); + kevent_init_file(f); /* f->f_version: 0 */ return f; @@ -164,6 +166,7 @@ void fastcall __fput(struct file *file) * in the file cleanup chain. */ eventpoll_release(file); + kevent_cleanup_file(file); locks_remove_flock(file); if (file->f_op && file->f_op->release) diff --git a/fs/inode.c b/fs/inode.c index ada7643..6745c00 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -21,6 +21,7 @@ #include <linux/pagemap.h> #include <linux/cdev.h> #include <linux/bootmem.h> #include <linux/inotify.h> +#include <linux/kevent.h> #include <linux/mount.h> /* @@ -164,12 +165,18 @@ #endif } inode->i_private = 0; inode->i_mapping = mapping; +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE + kevent_storage_init(inode, &inode->st); +#endif } return inode; } void destroy_inode(struct inode *inode) { +#if defined CONFIG_KEVENT_SOCKET + kevent_storage_fini(&inode->st); +#endif BUG_ON(inode_has_buffers(inode)); security_inode_free(inode); if (inode->i_sb->s_op->destroy_inode) diff --git a/include/linux/fs.h b/include/linux/fs.h index 5baf3a1..c529723 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -276,6 +276,7 @@ #include <linux/prio_tree.h> #include <linux/init.h> #include <linux/sched.h> #include <linux/mutex.h> +#include <linux/kevent_storage.h> #include <asm/atomic.h> #include <asm/semaphore.h> @@ -586,6 +587,10 @@ #ifdef CONFIG_INOTIFY struct mutex inotify_mutex; /* protects the watches list */ #endif +#ifdef CONFIG_KEVENT_SOCKET + struct kevent_storage st; +#endif + unsigned long i_state; unsigned long dirtied_when; /* jiffies of first dirtying */ @@ -739,6 +744,9 @@ #ifdef CONFIG_EPOLL struct list_head f_ep_links; spinlock_t f_ep_lock; #endif /* #ifdef CONFIG_EPOLL */ +#ifdef CONFIG_KEVENT_POLL + struct kevent_storage st; +#endif struct address_space *f_mapping; }; extern spinlock_t files_lock; diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c new file mode 100644 index 0000000..7030d21 --- /dev/null +++ b/kernel/kevent/kevent_poll.c @@ -0,0 +1,228 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/timer.h> +#include <linux/file.h> +#include <linux/kevent.h> +#include <linux/poll.h> +#include <linux/fs.h> + +static kmem_cache_t *kevent_poll_container_cache; +static kmem_cache_t *kevent_poll_priv_cache; + +struct kevent_poll_ctl +{ + struct poll_table_struct pt; + struct kevent *k; +}; + +struct kevent_poll_wait_container +{ + struct list_head container_entry; + wait_queue_head_t *whead; + wait_queue_t wait; + struct kevent *k; +}; + +struct kevent_poll_private +{ + struct list_head container_list; + spinlock_t container_lock; +}; + +static int kevent_poll_enqueue(struct kevent *k); +static int kevent_poll_dequeue(struct kevent *k); +static int kevent_poll_callback(struct kevent *k); + +static int kevent_poll_wait_callback(wait_queue_t *wait, + unsigned mode, int sync, void *key) +{ + struct kevent_poll_wait_container *cont = + container_of(wait, struct kevent_poll_wait_container, wait); + struct kevent *k = cont->k; + + kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL); + return 0; +} + +static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead, + struct poll_table_struct *poll_table) +{ + struct kevent *k = + container_of(poll_table, struct kevent_poll_ctl, pt)->k; + struct kevent_poll_private *priv = k->priv; + struct kevent_poll_wait_container *cont; + unsigned long flags; + + cont = kmem_cache_alloc(kevent_poll_container_cache, GFP_KERNEL); + if (!cont) { + kevent_break(k); + return; + } + + cont->k = k; + init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback); + cont->whead = whead; + + spin_lock_irqsave(&priv->container_lock, flags); + list_add_tail(&cont->container_entry, &priv->container_list); + spin_unlock_irqrestore(&priv->container_lock, flags); + + add_wait_queue(whead, &cont->wait); +} + +static int kevent_poll_enqueue(struct kevent *k) +{ + struct file *file; + int err; + unsigned int revents; + unsigned long flags; + struct kevent_poll_ctl ctl; + struct kevent_poll_private *priv; + + file = fget(k->event.id.raw[0]); + if (!file) + return -EBADF; + + err = -EINVAL; + if (!file->f_op || !file->f_op->poll) + goto err_out_fput; + + err = -ENOMEM; + priv = kmem_cache_alloc(kevent_poll_priv_cache, GFP_KERNEL); + if (!priv) + goto err_out_fput; + + spin_lock_init(&priv->container_lock); + INIT_LIST_HEAD(&priv->container_list); + + k->priv = priv; + + ctl.k = k; + init_poll_funcptr(&ctl.pt, &kevent_poll_qproc); + + err = kevent_storage_enqueue(&file->st, k); + if (err) + goto err_out_free; + + revents = file->f_op->poll(file, &ctl.pt); + if (revents & k->event.event) { + err = 1; + goto out_dequeue; + } + + spin_lock_irqsave(&k->ulock, flags); + k->event.req_flags |= KEVENT_REQ_LAST_CHECK; + spin_unlock_irqrestore(&k->ulock, flags); + + return 0; + +out_dequeue: + kevent_storage_dequeue(k->st, k); +err_out_free: + kmem_cache_free(kevent_poll_priv_cache, priv); +err_out_fput: + fput(file); + return err; +} + +static int kevent_poll_dequeue(struct kevent *k) +{ + struct file *file = k->st->origin; + struct kevent_poll_private *priv = k->priv; + struct kevent_poll_wait_container *w, *n; + unsigned long flags; + + kevent_storage_dequeue(k->st, k); + + spin_lock_irqsave(&priv->container_lock, flags); + list_for_each_entry_safe(w, n, &priv->container_list, container_entry) { + list_del(&w->container_entry); + remove_wait_queue(w->whead, &w->wait); + kmem_cache_free(kevent_poll_container_cache, w); + } + spin_unlock_irqrestore(&priv->container_lock, flags); + + kmem_cache_free(kevent_poll_priv_cache, priv); + k->priv = NULL; + + fput(file); + + return 0; +} + +static int kevent_poll_callback(struct kevent *k) +{ + if (k->event.req_flags & KEVENT_REQ_LAST_CHECK) { + return 1; + } else { + struct file *file = k->st->origin; + unsigned int revents = file->f_op->poll(file, NULL); + + k->event.ret_data[0] = revents & k->event.event; + + return (revents & k->event.event); + } +} + +static int __init kevent_poll_sys_init(void) +{ + struct kevent_callbacks pc = { + .callback = &kevent_poll_callback, + .enqueue = &kevent_poll_enqueue, + .dequeue = &kevent_poll_dequeue}; + + kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache", + sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL); + if (!kevent_poll_container_cache) { + printk(KERN_ERR "Failed to create kevent poll container cache.\n"); + return -ENOMEM; + } + + kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache", + sizeof(struct kevent_poll_private), 0, 0, NULL, NULL); + if (!kevent_poll_priv_cache) { + printk(KERN_ERR "Failed to create kevent poll private data cache.\n"); + kmem_cache_destroy(kevent_poll_container_cache); + kevent_poll_container_cache = NULL; + return -ENOMEM; + } + + kevent_add_callbacks(&pc, KEVENT_POLL); + + printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n"); + return 0; +} + +static struct lock_class_key kevent_poll_key; + +void kevent_poll_reinit(struct file *file) +{ + lockdep_set_class(&file->st.lock, &kevent_poll_key); +} + +static void __exit kevent_poll_sys_fini(void) +{ + kmem_cache_destroy(kevent_poll_priv_cache); + kmem_cache_destroy(kevent_poll_container_cache); +} + +module_init(kevent_poll_sys_init); +module_exit(kevent_poll_sys_fini); ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take24 4/6] kevent: Socket notifications. 2006-11-09 8:23 ` [take24 3/6] kevent: poll/select() notifications Evgeniy Polyakov @ 2006-11-09 8:23 ` Evgeniy Polyakov 2006-11-09 8:23 ` [take24 5/6] kevent: Timer notifications Evgeniy Polyakov 2006-11-09 9:08 ` [take24 3/6] kevent: poll/select() notifications Eric Dumazet 2006-11-09 18:51 ` Davide Libenzi 2 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-09 8:23 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Socket notifications. This patch includes socket send/recv/accept notifications. Using trivial web server based on kevent and this features instead of epoll it's performance increased more than noticebly. More details about various benchmarks and server itself (evserver_kevent.c) can be found on project's homepage. Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru> diff --git a/fs/inode.c b/fs/inode.c index ada7643..6745c00 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -21,6 +21,7 @@ #include <linux/pagemap.h> #include <linux/cdev.h> #include <linux/bootmem.h> #include <linux/inotify.h> +#include <linux/kevent.h> #include <linux/mount.h> /* @@ -164,12 +165,18 @@ #endif } inode->i_private = 0; inode->i_mapping = mapping; +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE + kevent_storage_init(inode, &inode->st); +#endif } return inode; } void destroy_inode(struct inode *inode) { +#if defined CONFIG_KEVENT_SOCKET + kevent_storage_fini(&inode->st); +#endif BUG_ON(inode_has_buffers(inode)); security_inode_free(inode); if (inode->i_sb->s_op->destroy_inode) diff --git a/include/net/sock.h b/include/net/sock.h index edd4d73..d48ded8 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -48,6 +48,7 @@ #include <linux/lockdep.h> #include <linux/netdevice.h> #include <linux/skbuff.h> /* struct sk_buff */ #include <linux/security.h> +#include <linux/kevent.h> #include <linux/filter.h> @@ -450,6 +451,21 @@ static inline int sk_stream_memory_free( extern void sk_stream_rfree(struct sk_buff *skb); +struct socket_alloc { + struct socket socket; + struct inode vfs_inode; +}; + +static inline struct socket *SOCKET_I(struct inode *inode) +{ + return &container_of(inode, struct socket_alloc, vfs_inode)->socket; +} + +static inline struct inode *SOCK_INODE(struct socket *socket) +{ + return &container_of(socket, struct socket_alloc, socket)->vfs_inode; +} + static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk) { skb->sk = sk; @@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct sk->sk_backlog.tail = skb; } skb->next = NULL; + kevent_socket_notify(sk, KEVENT_SOCKET_RECV); } #define sk_wait_event(__sk, __timeo, __condition) \ @@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio return si->kiocb; } -struct socket_alloc { - struct socket socket; - struct inode vfs_inode; -}; - -static inline struct socket *SOCKET_I(struct inode *inode) -{ - return &container_of(inode, struct socket_alloc, vfs_inode)->socket; -} - -static inline struct inode *SOCK_INODE(struct socket *socket) -{ - return &container_of(socket, struct socket_alloc, socket)->vfs_inode; -} - extern void __sk_stream_mem_reclaim(struct sock *sk); extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind); diff --git a/include/net/tcp.h b/include/net/tcp.h index 7a093d0..69f4ad2 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so tp->ucopy.memory = 0; } else if (skb_queue_len(&tp->ucopy.prequeue) == 1) { wake_up_interruptible(sk->sk_sleep); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); if (!inet_csk_ack_scheduled(sk)) inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK, (3 * TCP_RTO_MIN) / 4, diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c new file mode 100644 index 0000000..7f74110 --- /dev/null +++ b/kernel/kevent/kevent_socket.c @@ -0,0 +1,135 @@ +/* + * kevent_socket.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/timer.h> +#include <linux/file.h> +#include <linux/tcp.h> +#include <linux/kevent.h> + +#include <net/sock.h> +#include <net/request_sock.h> +#include <net/inet_connection_sock.h> + +static int kevent_socket_callback(struct kevent *k) +{ + struct inode *inode = k->st->origin; + unsigned int events = SOCKET_I(inode)->ops->poll(SOCKET_I(inode)->file, SOCKET_I(inode), NULL); + + if ((events & (POLLIN | POLLRDNORM)) && (k->event.event & (KEVENT_SOCKET_RECV | KEVENT_SOCKET_ACCEPT))) + return 1; + if ((events & (POLLOUT | POLLWRNORM)) && (k->event.event & KEVENT_SOCKET_SEND)) + return 1; + return 0; +} + +int kevent_socket_enqueue(struct kevent *k) +{ + struct inode *inode; + struct socket *sock; + int err = -EBADF; + + sock = sockfd_lookup(k->event.id.raw[0], &err); + if (!sock) + goto err_out_exit; + + inode = igrab(SOCK_INODE(sock)); + if (!inode) + goto err_out_fput; + + err = kevent_storage_enqueue(&inode->st, k); + if (err) + goto err_out_iput; + + err = k->callbacks.callback(k); + if (err) + goto err_out_dequeue; + + return err; + +err_out_dequeue: + kevent_storage_dequeue(k->st, k); +err_out_iput: + iput(inode); +err_out_fput: + sockfd_put(sock); +err_out_exit: + return err; +} + +int kevent_socket_dequeue(struct kevent *k) +{ + struct inode *inode = k->st->origin; + struct socket *sock; + + kevent_storage_dequeue(k->st, k); + + sock = SOCKET_I(inode); + iput(inode); + sockfd_put(sock); + + return 0; +} + +void kevent_socket_notify(struct sock *sk, u32 event) +{ + if (sk->sk_socket) + kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event); +} + +/* + * It is required for network protocols compiled as modules, like IPv6. + */ +EXPORT_SYMBOL_GPL(kevent_socket_notify); + +#ifdef CONFIG_LOCKDEP +static struct lock_class_key kevent_sock_key; + +void kevent_socket_reinit(struct socket *sock) +{ + struct inode *inode = SOCK_INODE(sock); + + lockdep_set_class(&inode->st.lock, &kevent_sock_key); +} + +void kevent_sk_reinit(struct sock *sk) +{ + if (sk->sk_socket) { + struct inode *inode = SOCK_INODE(sk->sk_socket); + + lockdep_set_class(&inode->st.lock, &kevent_sock_key); + } +} +#endif +static int __init kevent_init_socket(void) +{ + struct kevent_callbacks sc = { + .callback = &kevent_socket_callback, + .enqueue = &kevent_socket_enqueue, + .dequeue = &kevent_socket_dequeue}; + + return kevent_add_callbacks(&sc, KEVENT_SOCKET); +} +module_init(kevent_init_socket); diff --git a/net/core/sock.c b/net/core/sock.c index b77e155..7d5fa3e 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1402,6 +1402,7 @@ static void sock_def_wakeup(struct sock if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) wake_up_interruptible_all(sk->sk_sleep); read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); } static void sock_def_error_report(struct sock *sk) @@ -1411,6 +1412,7 @@ static void sock_def_error_report(struct wake_up_interruptible(sk->sk_sleep); sk_wake_async(sk,0,POLL_ERR); read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); } static void sock_def_readable(struct sock *sk, int len) @@ -1420,6 +1422,7 @@ static void sock_def_readable(struct soc wake_up_interruptible(sk->sk_sleep); sk_wake_async(sk,1,POLL_IN); read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); } static void sock_def_write_space(struct sock *sk) @@ -1439,6 +1442,7 @@ static void sock_def_write_space(struct } read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV); } static void sock_def_destruct(struct sock *sk) @@ -1489,6 +1493,8 @@ #endif sk->sk_state = TCP_CLOSE; sk->sk_socket = sock; + kevent_sk_reinit(sk); + sock_set_flag(sk, SOCK_ZAPPED); if(sock) @@ -1555,8 +1561,10 @@ void fastcall release_sock(struct sock * if (sk->sk_backlog.tail) __release_sock(sk); sk->sk_lock.owner = NULL; - if (waitqueue_active(&sk->sk_lock.wq)) + if (waitqueue_active(&sk->sk_lock.wq)) { wake_up(&sk->sk_lock.wq); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); + } spin_unlock_bh(&sk->sk_lock.slock); } EXPORT_SYMBOL(release_sock); diff --git a/net/core/stream.c b/net/core/stream.c index d1d7dec..2878c2a 100644 --- a/net/core/stream.c +++ b/net/core/stream.c @@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock * wake_up_interruptible(sk->sk_sleep); if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN)) sock_wake_async(sock, 2, POLL_OUT); + kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV); } } diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 3f884ce..e7dd989 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3119,6 +3119,7 @@ static void tcp_ofo_queue(struct sock *s __skb_unlink(skb, &tp->out_of_order_queue); __skb_queue_tail(&sk->sk_receive_queue, skb); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV); tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq; if(skb->h.th->fin) tcp_fin(skb, sk, skb->h.th); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index c83938b..b0dd70d 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -61,6 +61,7 @@ #include <linux/cache.h> #include <linux/jhash.h> #include <linux/init.h> #include <linux/times.h> +#include <linux/kevent.h> #include <net/icmp.h> #include <net/inet_hashtables.h> @@ -870,6 +871,7 @@ #endif reqsk_free(req); } else { inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT); + kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT); } return 0; diff --git a/net/socket.c b/net/socket.c index 1bc4167..5582b4a 100644 --- a/net/socket.c +++ b/net/socket.c @@ -85,6 +85,7 @@ #include <linux/compat.h> #include <linux/kmod.h> #include <linux/audit.h> #include <linux/wireless.h> +#include <linux/kevent.h> #include <asm/uaccess.h> #include <asm/unistd.h> @@ -490,6 +491,8 @@ static struct socket *sock_alloc(void) inode->i_uid = current->fsuid; inode->i_gid = current->fsgid; + kevent_socket_reinit(sock); + get_cpu_var(sockets_in_use)++; put_cpu_var(sockets_in_use); return sock; ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take24 5/6] kevent: Timer notifications. 2006-11-09 8:23 ` [take24 4/6] kevent: Socket notifications Evgeniy Polyakov @ 2006-11-09 8:23 ` Evgeniy Polyakov 2006-11-09 8:23 ` [take24 6/6] kevent: Pipe notifications Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-09 8:23 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Timer notifications. Timer notifications can be used for fine grained per-process time management, since interval timers are very inconvenient to use, and they are limited. This subsystem uses high-resolution timers. id.raw[0] is used as number of seconds id.raw[1] is used as number of nanoseconds Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c new file mode 100644 index 0000000..df93049 --- /dev/null +++ b/kernel/kevent/kevent_timer.c @@ -0,0 +1,112 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/hrtimer.h> +#include <linux/jiffies.h> +#include <linux/kevent.h> + +struct kevent_timer +{ + struct hrtimer ktimer; + struct kevent_storage ktimer_storage; + struct kevent *ktimer_event; +}; + +static int kevent_timer_func(struct hrtimer *timer) +{ + struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer); + struct kevent *k = t->ktimer_event; + + kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL); + hrtimer_forward(timer, timer->base->softirq_time, + ktime_set(k->event.id.raw[0], k->event.id.raw[1])); + return HRTIMER_RESTART; +} + +static struct lock_class_key kevent_timer_key; + +static int kevent_timer_enqueue(struct kevent *k) +{ + int err; + struct kevent_timer *t; + + t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL); + if (!t) + return -ENOMEM; + + hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL); + t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]); + t->ktimer.function = kevent_timer_func; + t->ktimer_event = k; + + err = kevent_storage_init(&t->ktimer, &t->ktimer_storage); + if (err) + goto err_out_free; + lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key); + + err = kevent_storage_enqueue(&t->ktimer_storage, k); + if (err) + goto err_out_st_fini; + + hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL); + + return 0; + +err_out_st_fini: + kevent_storage_fini(&t->ktimer_storage); +err_out_free: + kfree(t); + + return err; +} + +static int kevent_timer_dequeue(struct kevent *k) +{ + struct kevent_storage *st = k->st; + struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage); + + hrtimer_cancel(&t->ktimer); + kevent_storage_dequeue(st, k); + kfree(t); + + return 0; +} + +static int kevent_timer_callback(struct kevent *k) +{ + k->event.ret_data[0] = jiffies_to_msecs(jiffies); + return 1; +} + +static int __init kevent_init_timer(void) +{ + struct kevent_callbacks tc = { + .callback = &kevent_timer_callback, + .enqueue = &kevent_timer_enqueue, + .dequeue = &kevent_timer_dequeue}; + + return kevent_add_callbacks(&tc, KEVENT_TIMER); +} +module_init(kevent_init_timer); + ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take24 6/6] kevent: Pipe notifications. 2006-11-09 8:23 ` [take24 5/6] kevent: Timer notifications Evgeniy Polyakov @ 2006-11-09 8:23 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-09 8:23 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Pipe notifications. diff --git a/fs/pipe.c b/fs/pipe.c index f3b6f71..aeaee9c 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -16,6 +16,7 @@ #include <linux/pipe_fs_i.h> #include <linux/uio.h> #include <linux/highmem.h> #include <linux/pagemap.h> +#include <linux/kevent.h> #include <asm/uaccess.h> #include <asm/ioctls.h> @@ -312,6 +313,7 @@ redo: break; } if (do_wakeup) { + kevent_pipe_notify(inode, KEVENT_SOCKET_SEND); wake_up_interruptible_sync(&pipe->wait); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } @@ -321,6 +323,7 @@ redo: /* Signal writers asynchronously that there is more room. */ if (do_wakeup) { + kevent_pipe_notify(inode, KEVENT_SOCKET_SEND); wake_up_interruptible(&pipe->wait); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } @@ -490,6 +493,7 @@ redo2: break; } if (do_wakeup) { + kevent_pipe_notify(inode, KEVENT_SOCKET_RECV); wake_up_interruptible_sync(&pipe->wait); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); do_wakeup = 0; @@ -501,6 +505,7 @@ redo2: out: mutex_unlock(&inode->i_mutex); if (do_wakeup) { + kevent_pipe_notify(inode, KEVENT_SOCKET_RECV); wake_up_interruptible(&pipe->wait); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); } @@ -605,6 +610,7 @@ pipe_release(struct inode *inode, int de free_pipe_info(inode); } else { wake_up_interruptible(&pipe->wait); + kevent_pipe_notify(inode, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } diff --git a/kernel/kevent/kevent_pipe.c b/kernel/kevent/kevent_pipe.c new file mode 100644 index 0000000..32c6f19 --- /dev/null +++ b/kernel/kevent/kevent_pipe.c @@ -0,0 +1,112 @@ +/* + * kevent_pipe.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/file.h> +#include <linux/fs.h> +#include <linux/kevent.h> +#include <linux/pipe_fs_i.h> + +static int kevent_pipe_callback(struct kevent *k) +{ + struct inode *inode = k->st->origin; + struct pipe_inode_info *pipe = inode->i_pipe; + int nrbufs = pipe->nrbufs; + + if (k->event.event & KEVENT_SOCKET_RECV && nrbufs > 0) { + if (!pipe->writers) + return -1; + return 1; + } + + if (k->event.event & KEVENT_SOCKET_SEND && nrbufs < PIPE_BUFFERS) { + if (!pipe->readers) + return -1; + return 1; + } + + return 0; +} + +int kevent_pipe_enqueue(struct kevent *k) +{ + struct file *pipe; + int err = -EBADF; + struct inode *inode; + + pipe = fget(k->event.id.raw[0]); + if (!pipe) + goto err_out_exit; + + inode = igrab(pipe->f_dentry->d_inode); + if (!inode) + goto err_out_fput; + + err = kevent_storage_enqueue(&inode->st, k); + if (err) + goto err_out_iput; + + err = k->callbacks.callback(k); + if (err) + goto err_out_dequeue; + + fput(pipe); + + return err; + +err_out_dequeue: + kevent_storage_dequeue(k->st, k); +err_out_iput: + iput(inode); +err_out_fput: + fput(pipe); +err_out_exit: + return err; +} + +int kevent_pipe_dequeue(struct kevent *k) +{ + struct inode *inode = k->st->origin; + + kevent_storage_dequeue(k->st, k); + iput(inode); + + return 0; +} + +void kevent_pipe_notify(struct inode *inode, u32 event) +{ + kevent_storage_ready(&inode->st, NULL, event); +} + +static int __init kevent_init_pipe(void) +{ + struct kevent_callbacks sc = { + .callback = &kevent_pipe_callback, + .enqueue = &kevent_pipe_enqueue, + .dequeue = &kevent_pipe_dequeue}; + + return kevent_add_callbacks(&sc, KEVENT_PIPE); +} +module_init(kevent_init_pipe); ^ permalink raw reply related [flat|nested] 200+ messages in thread
* Re: [take24 3/6] kevent: poll/select() notifications. 2006-11-09 8:23 ` [take24 3/6] kevent: poll/select() notifications Evgeniy Polyakov 2006-11-09 8:23 ` [take24 4/6] kevent: Socket notifications Evgeniy Polyakov @ 2006-11-09 9:08 ` Eric Dumazet 2006-11-09 9:29 ` Evgeniy Polyakov 2006-11-09 18:51 ` Davide Libenzi 2 siblings, 1 reply; 200+ messages in thread From: Eric Dumazet @ 2006-11-09 9:08 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Thursday 09 November 2006 09:23, Evgeniy Polyakov wrote: > poll/select() notifications. > > This patch includes generic poll/select notifications. > kevent_poll works simialr to epoll and has the same issues (callback > is invoked not from internal state machine of the caller, but through > process awake, a lot of allocations and so on). > > Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru> > > diff --git a/fs/file_table.c b/fs/file_table.c > index bc35a40..0805547 100644 > --- a/fs/file_table.c > +++ b/fs/file_table.c > @@ -20,6 +20,7 @@ #include <linux/capability.h> > #include <linux/cdev.h> > #include <linux/fsnotify.h> > #include <linux/sysctl.h> > +#include <linux/kevent.h> > #include <linux/percpu_counter.h> > > #include <asm/atomic.h> > @@ -119,6 +120,7 @@ struct file *get_empty_filp(void) > f->f_uid = tsk->fsuid; > f->f_gid = tsk->fsgid; > eventpoll_init_file(f); > + kevent_init_file(f); > /* f->f_version: 0 */ > return f; > > @@ -164,6 +166,7 @@ void fastcall __fput(struct file *file) > * in the file cleanup chain. > */ > eventpoll_release(file); > + kevent_cleanup_file(file); > locks_remove_flock(file); > > if (file->f_op && file->f_op->release) > diff --git a/fs/inode.c b/fs/inode.c > index ada7643..6745c00 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -21,6 +21,7 @@ #include <linux/pagemap.h> > #include <linux/cdev.h> > #include <linux/bootmem.h> > #include <linux/inotify.h> > +#include <linux/kevent.h> > #include <linux/mount.h> > > /* > @@ -164,12 +165,18 @@ #endif > } > inode->i_private = 0; > inode->i_mapping = mapping; Here you test both KEVENT_SOCKET and KEVENT_PIPE > +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE > + kevent_storage_init(inode, &inode->st); > +#endif > } > return inode; > } > > void destroy_inode(struct inode *inode) > { but here you test only KEVENT_SOCKET > +#if defined CONFIG_KEVENT_SOCKET > + kevent_storage_fini(&inode->st); > +#endif > BUG_ON(inode_has_buffers(inode)); > security_inode_free(inode); > if (inode->i_sb->s_op->destroy_inode) > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 5baf3a1..c529723 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -276,6 +276,7 @@ #include <linux/prio_tree.h> > #include <linux/init.h> > #include <linux/sched.h> > #include <linux/mutex.h> > +#include <linux/kevent_storage.h> > > #include <asm/atomic.h> > #include <asm/semaphore.h> > @@ -586,6 +587,10 @@ #ifdef CONFIG_INOTIFY > struct mutex inotify_mutex; /* protects the watches list */ > #endif > Here you include a kevent_storage only if KEVENT_SOCKET > +#ifdef CONFIG_KEVENT_SOCKET > + struct kevent_storage st; > +#endif > + > unsigned long i_state; > unsigned long dirtied_when; /* jiffies of first dirtying */ > > @@ -739,6 +744,9 @@ #ifdef CONFIG_EPOLL > struct list_head f_ep_links; > spinlock_t f_ep_lock; > #endif /* #ifdef CONFIG_EPOLL */ > +#ifdef CONFIG_KEVENT_POLL > + struct kevent_storage st; > +#endif > struct address_space *f_mapping; > }; > extern spinlock_t files_lock; ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 3/6] kevent: poll/select() notifications. 2006-11-09 9:08 ` [take24 3/6] kevent: poll/select() notifications Eric Dumazet @ 2006-11-09 9:29 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-09 9:29 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Thu, Nov 09, 2006 at 10:08:44AM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote: > Here you test both KEVENT_SOCKET and KEVENT_PIPE > > > +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE > > + kevent_storage_init(inode, &inode->st); > > +#endif > > } > > return inode; > > } > > > > void destroy_inode(struct inode *inode) > > { > > but here you test only KEVENT_SOCKET > > > +#if defined CONFIG_KEVENT_SOCKET > > + kevent_storage_fini(&inode->st); > > +#endif Indeed, it must be #if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE > > BUG_ON(inode_has_buffers(inode)); > > security_inode_free(inode); > > if (inode->i_sb->s_op->destroy_inode) > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 5baf3a1..c529723 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -276,6 +276,7 @@ #include <linux/prio_tree.h> > > #include <linux/init.h> > > #include <linux/sched.h> > > #include <linux/mutex.h> > > +#include <linux/kevent_storage.h> > > > > #include <asm/atomic.h> > > #include <asm/semaphore.h> > > @@ -586,6 +587,10 @@ #ifdef CONFIG_INOTIFY > > struct mutex inotify_mutex; /* protects the watches list */ > > #endif > > > > Here you include a kevent_storage only if KEVENT_SOCKET > > > +#ifdef CONFIG_KEVENT_SOCKET > > + struct kevent_storage st; > > +#endif > > + It must be #if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 3/6] kevent: poll/select() notifications. 2006-11-09 8:23 ` [take24 3/6] kevent: poll/select() notifications Evgeniy Polyakov 2006-11-09 8:23 ` [take24 4/6] kevent: Socket notifications Evgeniy Polyakov 2006-11-09 9:08 ` [take24 3/6] kevent: poll/select() notifications Eric Dumazet @ 2006-11-09 18:51 ` Davide Libenzi 2006-11-09 19:10 ` Evgeniy Polyakov 2 siblings, 1 reply; 200+ messages in thread From: Davide Libenzi @ 2006-11-09 18:51 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, Linux Kernel Mailing List, Jeff Garzik On Thu, 9 Nov 2006, Evgeniy Polyakov wrote: > +static int kevent_poll_callback(struct kevent *k) > +{ > + if (k->event.req_flags & KEVENT_REQ_LAST_CHECK) { > + return 1; > + } else { > + struct file *file = k->st->origin; > + unsigned int revents = file->f_op->poll(file, NULL); > + > + k->event.ret_data[0] = revents & k->event.event; > + > + return (revents & k->event.event); > + } > +} You need to be careful that file->f_op->poll is not called inside the spin_lock_irqsave/spin_lock_irqrestore pair, since (even this came up during epoll developemtn days) file->f_op->poll might do a simple spin_lock_irq/spin_unlock_irq. This unfortunate constrain forced epoll to have a suboptimal double O(R) loop to handle LT events. - Davide ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 3/6] kevent: poll/select() notifications. 2006-11-09 18:51 ` Davide Libenzi @ 2006-11-09 19:10 ` Evgeniy Polyakov 2006-11-09 19:42 ` Davide Libenzi 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-09 19:10 UTC (permalink / raw) To: Davide Libenzi Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, Linux Kernel Mailing List, Jeff Garzik On Thu, Nov 09, 2006 at 10:51:56AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote: > On Thu, 9 Nov 2006, Evgeniy Polyakov wrote: > > > +static int kevent_poll_callback(struct kevent *k) > > +{ > > + if (k->event.req_flags & KEVENT_REQ_LAST_CHECK) { > > + return 1; > > + } else { > > + struct file *file = k->st->origin; > > + unsigned int revents = file->f_op->poll(file, NULL); > > + > > + k->event.ret_data[0] = revents & k->event.event; > > + > > + return (revents & k->event.event); > > + } > > +} > > You need to be careful that file->f_op->poll is not called inside the > spin_lock_irqsave/spin_lock_irqrestore pair, since (even this came up > during epoll developemtn days) file->f_op->poll might do a simple > spin_lock_irq/spin_unlock_irq. This unfortunate constrain forced epoll to > have a suboptimal double O(R) loop to handle LT events. It is tricky - users call wake_up() from any context, which in turn ends up calling kevent_storage_ready(), which calls kevent_poll_callback() with KEVENT_REQ_LAST_CHECK bit set, which becomes almost empty call in fast path. Since callback returns 1, kevent will be queued into ready queue, which is processed on behalf of syscalls - in that case kevent will check the flag and since KEVENT_REQ_LAST_CHECK is set, will call callback again to check if kevent is correctly marked, but already without that flag (it happens in syscall context, i.e. process context without any locks held), so callback calls ->poll(), which can sleep, but it is safe. If ->poll() returns 'ready' value, kevent is transfers data into userspace, otherwise it is 'requeued' (just removed from ready queue). > - Davide > -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 3/6] kevent: poll/select() notifications. 2006-11-09 19:10 ` Evgeniy Polyakov @ 2006-11-09 19:42 ` Davide Libenzi 2006-11-09 20:10 ` Davide Libenzi 0 siblings, 1 reply; 200+ messages in thread From: Davide Libenzi @ 2006-11-09 19:42 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, Linux Kernel Mailing List, Jeff Garzik On Thu, 9 Nov 2006, Evgeniy Polyakov wrote: > On Thu, Nov 09, 2006 at 10:51:56AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote: > > On Thu, 9 Nov 2006, Evgeniy Polyakov wrote: > > > > > +static int kevent_poll_callback(struct kevent *k) > > > +{ > > > + if (k->event.req_flags & KEVENT_REQ_LAST_CHECK) { > > > + return 1; > > > + } else { > > > + struct file *file = k->st->origin; > > > + unsigned int revents = file->f_op->poll(file, NULL); > > > + > > > + k->event.ret_data[0] = revents & k->event.event; > > > + > > > + return (revents & k->event.event); > > > + } > > > +} > > > > You need to be careful that file->f_op->poll is not called inside the > > spin_lock_irqsave/spin_lock_irqrestore pair, since (even this came up > > during epoll developemtn days) file->f_op->poll might do a simple > > spin_lock_irq/spin_unlock_irq. This unfortunate constrain forced epoll to > > have a suboptimal double O(R) loop to handle LT events. > > It is tricky - users call wake_up() from any context, which in turn ends > up calling kevent_storage_ready(), which calls kevent_poll_callback() with > KEVENT_REQ_LAST_CHECK bit set, which becomes almost empty call in fast > path. Since callback returns 1, kevent will be queued into ready queue, > which is processed on behalf of syscalls - in that case kevent will > check the flag and since KEVENT_REQ_LAST_CHECK is set, will call > callback again to check if kevent is correctly marked, but already > without that flag (it happens in syscall context, i.e. process context > without any locks held), so callback calls ->poll(), which can sleep, > but it is safe. If ->poll() returns 'ready' value, kevent is transfers > data into userspace, otherwise it is 'requeued' (just removed from > ready queue). Oh, mine was only a general warn. I hadn't looked at the generic code before. But now that I poke on it, I see: void kevent_requeue(struct kevent *k) { unsigned long flags; spin_lock_irqsave(&k->st->lock, flags); __kevent_requeue(k, 0); spin_unlock_irqrestore(&k->st->lock, flags); } and then: static int __kevent_requeue(struct kevent *k, u32 event) { int ret, rem; unsigned long flags; ret = k->callbacks.callback(k); Isn't the k->callbacks.callback() possibly end up calling f_op->poll? - Davide ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 3/6] kevent: poll/select() notifications. 2006-11-09 19:42 ` Davide Libenzi @ 2006-11-09 20:10 ` Davide Libenzi 0 siblings, 0 replies; 200+ messages in thread From: Davide Libenzi @ 2006-11-09 20:10 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, Linux Kernel Mailing List, Jeff Garzik On Thu, 9 Nov 2006, Davide Libenzi wrote: > On Thu, 9 Nov 2006, Evgeniy Polyakov wrote: > > > On Thu, Nov 09, 2006 at 10:51:56AM -0800, Davide Libenzi (davidel@xmailserver.org) wrote: > > > On Thu, 9 Nov 2006, Evgeniy Polyakov wrote: > > > > > > > +static int kevent_poll_callback(struct kevent *k) > > > > +{ > > > > + if (k->event.req_flags & KEVENT_REQ_LAST_CHECK) { > > > > + return 1; > > > > + } else { > > > > + struct file *file = k->st->origin; > > > > + unsigned int revents = file->f_op->poll(file, NULL); > > > > + > > > > + k->event.ret_data[0] = revents & k->event.event; > > > > + > > > > + return (revents & k->event.event); > > > > + } > > > > +} > > > > > > You need to be careful that file->f_op->poll is not called inside the > > > spin_lock_irqsave/spin_lock_irqrestore pair, since (even this came up > > > during epoll developemtn days) file->f_op->poll might do a simple > > > spin_lock_irq/spin_unlock_irq. This unfortunate constrain forced epoll to > > > have a suboptimal double O(R) loop to handle LT events. > > > > It is tricky - users call wake_up() from any context, which in turn ends > > up calling kevent_storage_ready(), which calls kevent_poll_callback() with > > KEVENT_REQ_LAST_CHECK bit set, which becomes almost empty call in fast > > path. Since callback returns 1, kevent will be queued into ready queue, > > which is processed on behalf of syscalls - in that case kevent will > > check the flag and since KEVENT_REQ_LAST_CHECK is set, will call > > callback again to check if kevent is correctly marked, but already > > without that flag (it happens in syscall context, i.e. process context > > without any locks held), so callback calls ->poll(), which can sleep, > > but it is safe. If ->poll() returns 'ready' value, kevent is transfers > > data into userspace, otherwise it is 'requeued' (just removed from > > ready queue). > > Oh, mine was only a general warn. I hadn't looked at the generic code > before. But now that I poke on it, I see: > > void kevent_requeue(struct kevent *k) > { > unsigned long flags; > > spin_lock_irqsave(&k->st->lock, flags); > __kevent_requeue(k, 0); > spin_unlock_irqrestore(&k->st->lock, flags); > } > > and then: > > static int __kevent_requeue(struct kevent *k, u32 event) > { > int ret, rem; > unsigned long flags; > > ret = k->callbacks.callback(k); > > Isn't the k->callbacks.callback() possibly end up calling f_op->poll? Ack, there the check for KEVENT_REQ_LAST_CHECK inside the callback. The problem with f_op->poll was not that it can sleep (not excluded though) but that some f_op->poll can do a simple spin_lock_irq/spin_unlock_irq. But for a quick peek your new code seems fine with that. - Davide ^ permalink raw reply [flat|nested] 200+ messages in thread
* [take24 7/6] kevent: signal notifications. 2006-11-09 8:23 ` [take24 0/6] " Evgeniy Polyakov 2006-11-09 8:23 ` [take24 1/6] kevent: Description Evgeniy Polyakov @ 2006-11-11 17:36 ` Evgeniy Polyakov 2006-11-11 22:28 ` [take24 0/6] kevent: Generic event handling mechanism Ulrich Drepper 2 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-11 17:36 UTC (permalink / raw) To: David Miller Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Signals which were requested to be delivered through kevent subsystem must be registered through usual signal() and others syscalls, this option allows alternative delivery. With KEVENT_SIGNAL_NOMASK flag being set in kevent for set of signals, they will not be delivered in a usual way. Kevents for appropriate signals are not copied when process forks, new process must add new kevents after fork(). Mask of signals is copied as before. Test application which registers two signal callbacks for usr1 and usr2 signals and it's deivery through kevent (the former with both callback and kevent notifications, the latter only through kevent) is called signal.c and can be found in archive on project homepage http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> diff --git a/include/linux/kevent.h b/include/linux/kevent.h index f7cbf6b..e588ae6 100644 --- a/include/linux/kevent.h +++ b/include/linux/kevent.h @@ -28,6 +28,7 @@ #include <linux/wait.h> #include <linux/net.h> #include <linux/rcupdate.h> #include <linux/fs.h> +#include <linux/sched.h> #include <linux/kevent_storage.h> #include <linux/ukevent.h> @@ -220,4 +221,10 @@ #else static inline void kevent_pipe_notify(struct inode *inode, u32 events) {} #endif +#ifdef CONFIG_KEVENT_SIGNAL +extern int kevent_signal_notify(struct task_struct *tsk, int sig); +#else +static inline int kevent_signal_notify(struct task_struct *tsk, int sig) {return 0;} +#endif + #endif /* __KEVENT_H */ diff --git a/include/linux/sched.h b/include/linux/sched.h index fc4a987..ef38a3c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -80,6 +80,7 @@ #include <linux/param.h> #include <linux/resource.h> #include <linux/timer.h> #include <linux/hrtimer.h> +#include <linux/kevent_storage.h> #include <asm/processor.h> @@ -1013,6 +1014,10 @@ #endif #ifdef CONFIG_TASK_DELAY_ACCT struct task_delay_info *delays; #endif +#ifdef CONFIG_KEVENT_SIGNAL + struct kevent_storage st; + u32 kevent_signals; +#endif }; static inline pid_t process_group(struct task_struct *tsk) diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h index b14e14e..a6038eb 100644 --- a/include/linux/ukevent.h +++ b/include/linux/ukevent.h @@ -68,7 +68,8 @@ #define KEVENT_POLL 3 #define KEVENT_NAIO 4 #define KEVENT_AIO 5 #define KEVENT_PIPE 6 -#define KEVENT_MAX 7 +#define KEVENT_SIGNAL 7 +#define KEVENT_MAX 8 /* * Per-type event sets. @@ -81,7 +82,7 @@ #define KEVENT_MAX 7 #define KEVENT_TIMER_FIRED 0x1 /* - * Socket/network asynchronous IO events. + * Socket/network asynchronous IO and PIPE events. */ #define KEVENT_SOCKET_RECV 0x1 #define KEVENT_SOCKET_ACCEPT 0x2 @@ -115,10 +116,20 @@ #define KEVENT_POLL_POLLREMOVE 0x1000 */ #define KEVENT_AIO_BIO 0x1 -#define KEVENT_MASK_ALL 0xffffffff +/* + * Signal events. + */ +#define KEVENT_SIGNAL_DELIVERY 0x1 + +/* If set in raw64, then given signals will not be delivered + * in a usual way through sigmask update and signal callback + * invokation. */ +#define KEVENT_SIGNAL_NOMASK 0x8000000000000000ULL + /* Mask of all possible event values. */ -#define KEVENT_MASK_EMPTY 0x0 +#define KEVENT_MASK_ALL 0xffffffff /* Empty mask of ready events. */ +#define KEVENT_MASK_EMPTY 0x0 struct kevent_id { diff --git a/kernel/fork.c b/kernel/fork.c index 1c999f3..e5b5b14 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -46,6 +46,7 @@ #include <linux/cn_proc.h> #include <linux/delayacct.h> #include <linux/taskstats_kern.h> #include <linux/random.h> +#include <linux/kevent.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -115,6 +116,9 @@ void __put_task_struct(struct task_struc WARN_ON(atomic_read(&tsk->usage)); WARN_ON(tsk == current); +#ifdef CONFIG_KEVENT_SIGNAL + kevent_storage_fini(&tsk->st); +#endif security_task_free(tsk); free_uid(tsk->user); put_group_info(tsk->group_info); @@ -1121,6 +1125,10 @@ #endif if (retval) goto bad_fork_cleanup_namespace; +#ifdef CONFIG_KEVENT_SIGNAL + kevent_storage_init(p, &p->st); +#endif + p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL; /* * Clear TID on mm_release()? diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig index 267fc53..4b137ee 100644 --- a/kernel/kevent/Kconfig +++ b/kernel/kevent/Kconfig @@ -43,3 +43,18 @@ config KEVENT_PIPE help This option enables notifications through KEVENT subsystem of pipe read/write operations. + +config KEVENT_SIGNAL + bool "Kernel event notifications for signals" + depends on KEVENT + help + This option enables signal delivery through KEVENT subsystem. + Signals which were requested to be delivered through kevent + subsystem must be registered through usual signal() and others + syscalls, this option allows alternative delivery. + With KEVENT_SIGNAL_NOMASK flag being set in kevent for set of + signals, they will not be delivered in a usual way. + Kevents for appropriate signals are not copied when process forks, + new process must add new kevents after fork(). Mask of signals + is copied as before. + diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile index d4d6b68..f98e0c8 100644 --- a/kernel/kevent/Makefile +++ b/kernel/kevent/Makefile @@ -3,3 +3,4 @@ obj-$(CONFIG_KEVENT_TIMER) += kevent_tim obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o obj-$(CONFIG_KEVENT_PIPE) += kevent_pipe.o +obj-$(CONFIG_KEVENT_SIGNAL) += kevent_signal.o diff --git a/kernel/kevent/kevent_signal.c b/kernel/kevent/kevent_signal.c new file mode 100644 index 0000000..15f9d1f --- /dev/null +++ b/kernel/kevent/kevent_signal.c @@ -0,0 +1,87 @@ +/* + * kevent_signal.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/file.h> +#include <linux/fs.h> +#include <linux/kevent.h> + +static int kevent_signal_callback(struct kevent *k) +{ + struct task_struct *tsk = k->st->origin; + int sig = k->event.id.raw[0]; + int ret = 0; + + if (sig == tsk->kevent_signals) + ret = 1; + + if (ret && (k->event.id.raw_u64 & KEVENT_SIGNAL_NOMASK)) + tsk->kevent_signals |= 0x80000000; + + return ret; +} + +int kevent_signal_enqueue(struct kevent *k) +{ + int err; + + err = kevent_storage_enqueue(¤t->st, k); + if (err) + goto err_out_exit; + + err = k->callbacks.callback(k); + if (err) + goto err_out_dequeue; + + return err; + +err_out_dequeue: + kevent_storage_dequeue(k->st, k); +err_out_exit: + return err; +} + +int kevent_signal_dequeue(struct kevent *k) +{ + kevent_storage_dequeue(k->st, k); + return 0; +} + +int kevent_signal_notify(struct task_struct *tsk, int sig) +{ + tsk->kevent_signals = sig; + kevent_storage_ready(&tsk->st, NULL, KEVENT_SIGNAL_DELIVERY); + return (tsk->kevent_signals & 0x80000000); +} + +static int __init kevent_init_signal(void) +{ + struct kevent_callbacks sc = { + .callback = &kevent_signal_callback, + .enqueue = &kevent_signal_enqueue, + .dequeue = &kevent_signal_dequeue}; + + return kevent_add_callbacks(&sc, KEVENT_SIGNAL); +} +module_init(kevent_init_signal); diff --git a/kernel/signal.c b/kernel/signal.c index fb5da6d..d3d3594 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -23,6 +23,7 @@ #include <linux/syscalls.h> #include <linux/ptrace.h> #include <linux/signal.h> #include <linux/capability.h> +#include <linux/kevent.h> #include <asm/param.h> #include <asm/uaccess.h> #include <asm/unistd.h> @@ -703,6 +704,9 @@ static int send_signal(int sig, struct s { struct sigqueue * q = NULL; int ret = 0; + + if (kevent_signal_notify(t, sig)) + return 1; /* * fast-pathed signals for kernel-internal things like SIGSTOP @@ -782,6 +786,17 @@ specific_send_sig_info(int sig, struct s ret = send_signal(sig, info, t, &t->pending); if (!ret && !sigismember(&t->blocked, sig)) signal_wake_up(t, sig == SIGKILL); +#ifdef CONFIG_KEVENT_SIGNAL + /* + * Kevent allows to deliver signals through kevent queue, + * it is possible to setup kevent to not deliver + * signal through the usual way, in that case send_signal() + * returns 1 and signal is delivered only through kevent queue. + * We simulate successfull delivery notification through this hack: + */ + if (ret == 1) + ret = 0; +#endif out: return ret; } @@ -971,6 +986,17 @@ __group_send_sig_info(int sig, struct si * to avoid several races. */ ret = send_signal(sig, info, p, &p->signal->shared_pending); +#ifdef CONFIG_KEVENT_SIGNAL + /* + * Kevent allows to deliver signals through kevent queue, + * it is possible to setup kevent to not deliver + * signal through the usual way, in that case send_signal() + * returns 1 and signal is delivered only through kevent queue. + * We simulate successfull delivery notification through this hack: + */ + if (ret == 1) + ret = 0; +#endif if (unlikely(ret)) return ret; -- Evgeniy Polyakov ^ permalink raw reply related [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-09 8:23 ` [take24 0/6] " Evgeniy Polyakov 2006-11-09 8:23 ` [take24 1/6] kevent: Description Evgeniy Polyakov 2006-11-11 17:36 ` [take24 7/6] kevent: signal notifications Evgeniy Polyakov @ 2006-11-11 22:28 ` Ulrich Drepper 2006-11-13 10:54 ` Evgeniy Polyakov 2 siblings, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-11-11 22:28 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro Evgeniy Polyakov wrote: > Generic event handling mechanism. > [...] Sorry for the delay again. Kernel work is simply not my highest priority. I've collected my comments on some parts of the patch. I haven't gone through every part of the patch yet. Sorry for the length. =================== - basic ring buffer problem: the kevent_copy_ring_buffer function stores the event in the ring buffer without disregard of the current content. + if dequeued entries larger than number of ring buffer entries events immediately get overwritten without passing anything to userlevel + as with the old approach, the ring buffer is basically unusable with multiple threads/processes. A thread calling kevent_wait might cause entries another thread is still working on to be overwritten. Possible solution: a) it would be possible to have a "used" flag in each ring buffer entry. That's too expensive, I guess. b) kevent_wait needs another parameter which specifies the which is the last (i.e., least recently added) entry in the ring buffer. Everything between this entry and the current head (in ->kidx) is occupied. If multiple threads arrive in kevent_wait the highest idx (with wrap around possibly lowest) is used. kevent_wait will not try to move more entries into the ring buffer if ->kidx and the higest index passed in to any kevent_wait call is equal (i.e., the ring buffer is full). There is one issue, though, and that is that a system call is needed to signal to the kernel that more entries in the ring buffer are processed and that they can be refilled. This goes against the kernel filling the ring buffer automatically (see below) Threads should be able to (not necessarily forced to) use the interfaces like this: - by default all threads are "parked" in the kevent_wait syscall. - If an event occurs one thread might be woken (depending on the 'num' parameter) - the woken thread(s) work on all the events in the ring buffer and then call kevent_wait() again. This requires that the threads can independently call kevent_wait() and that they can independently retrieve events from the ring buffer without fear the entry gets overwritten before it is retrieved. Atomically retrieving entries from the ring buffer can be implemented at userlevel. Either the ring buffer is writable and a field in each ring buffer entry can be used as a 'handled' flag. Obviously this can be done with atomic compare-and-exchange. If the ring buffer is not writable then, as part of the userlevel wrapper around the event handling interfaces, another array is created which contains the use flags for each ring buffer entry. This is less elegant and probably slower. =================== - implementing the kevent_wait syscall the proposed way means we are missing out on one possible optimization. The ring buffer is currently only filled on kevent_wait calls. I expect that in really high traffic situations requests are coming in at a higher rate than the can be processed. At least for periods of time. If such situations it would be nice to not have to call into the kernel at all. If the kernel would deliver into the ring buffer on its own this would be possible. If the argument against this is that kevent_get_event should be possible the answer is... =================== - the kevent_get_event syscall is not needed at all. All reporting should be done using a ring buffer. There really is not reason to keep two interfaces around which serve the same purpose. Making the argument the kevent_get_event is so much easier to use is not valid. The exposed interface to access the ring buffer will be easy, too. In the OLS paper I more or wait hinted at the interfaces. I think they should be like this (names are irrelevant): ec_t ec_create(unsigned flags); int ec_destroy(ec_t ec); int ec_poll_event(ec_t ec, event_data_t *d); int ec_wait_event(ec_t ec, event_data_t *d); int ec_timedwait_event(ec_t ec, event_data_t *d, struct timespec *to); The latter three interfaces are the interesting ones. We have to get the data out of the ring buffer as quickly as possible. So the interfaces require passing in a reference to an object which can hold the data. The 'poll' variant won't delay, the other two will. We need separate create and destroy functions since there will always be a userlevel component of the data structures. The create variant can allocate the ring buffer and the other memory needed ('handled' flags, tail pointers, ...) and destroy free all resources. These interfaces are fast and easy to use. At least as easy as the kevent_get_event syscall. And all transparently implemented on top of the ring buffer. So, please let's drop the unneeded syscall. =================== - another optimization I am thinking about is optimizing the thread wakeup and ring buffer use for cache line use. I.e., if we know an event was queued on a specific CPU then the wakeup function should take this into account. I.e., if any of the threads waiting was/will be scheduled on the same CPU it should be preferred. With the current simple form of a ring buffer this isn't sufficient, though. Reading all entries in the ring buffer until finding the one written by the CPU in question is not helpful. We'd need a mechanism to point the thread to the entry in question. One possibility to do this is to return the ring buffer entry as the return value of the kevent_wait() syscall. This works fine if the thread only works for one event (which I guess will be 99.999% of all uses). An extension could be to extend the ukevent structure to contain an index of the next entry written the same CPU. Another problem this entails is false sharing of the ring buffer entries. This would probably require to pad the ukevent structure to 64 bytes. It's not that much more, 40 bytes so far, it's also more future-safe. The alternative is to allocate have per-CPU regions in the ring buffer. With hotplug CPUs this is just plain silly. I think this optimization has the potential to help quite a bit, especially for large machines. =================== - we absolutely need an interface to signal the kernel that a thread, just woken from kevent_wait, cannot handle the events. I.e., the events are in the ring buffer but all the other threads are in the kernel in their kevent_wait calls. The new syscall would wake up one or more threads to handle the events. This syscall is for instance necessary if the thread calling kevent_wait is canceled. It might also be needed when a thread requested more than one event and realizes processing an entry takes a long time and that another thread might work on the other items in the meantime. Al Viro pointed out another possible solution which also could solve the "handled" flag problem and concurrency in use of the ring buffer. The idea is to require the kevent_wait() syscall to signal which entry in the ring buffer is handled or not handled. This means: + the kernel knows at any time which entries in the buffer are free and which are not + concurrent filling of the ring buffer is no problem anymore since entries are not discarded until told + by not waiting for event (num parameter == 0) the syscall can be used to discard entries to free up the ring buffer before continuing to work on more entries. And, as per the requirement above, it can be used to tell the kernel that certain entries are *NOT* handled and need to be sent to another thread. This would be useful in the thread cancellation case. This seems like a nice approach. =================== - why no syscall to create kevent queue? With dynamic /dev this might be a problem and it's really not much additional code. What about programs which want to use these interfaces before /dev is set up? =================== - still: the syscall should use a struct timespec* timeout parameter and not nanosecs. There are at least three timeout modes which are wanted: + relative, unconditionally wait that long + relative, aborted in case of large enough settimeofday() or NTP adjustment + absolute timeout. Probably even with selecting which clock ot use. This mode requires a timespec value parameter We have all this code already in the futex syscall. It just needs to be generalized or copied and adjusted. =================== - still: no signal mask parameter in the kevent_wait (and get_event) syscall. Regardless of what one thinks about signals, they are used and integrating the kevent interface into existing code requires this functionality. And it's not only about receiving signals. The signal mask parameter can also be used to _prevent_ signals from being delivered in that time. =================== - the KEVENT_REQ_WAKEUP_ONE functionality is good and needed. But I would reverse the default. I cannot see many places where you want all threads to be woken. Introduce KEVENT_REQ_WAKEUP_ALL instead. =================== - there is really no reason to invent yet another timer implementation. We have the POSIX timers which are feature rich and nicely implemented. All that is needed is to implement SIGEV_KEVENT as a notification mechanism. The timer is registered as part of the timer_create() syscalls. =================== I haven't yet looked at the other event sources. I think the above is enough for now. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-11 22:28 ` [take24 0/6] kevent: Generic event handling mechanism Ulrich Drepper @ 2006-11-13 10:54 ` Evgeniy Polyakov 2006-11-13 11:16 ` Evgeniy Polyakov 2006-11-20 0:02 ` Ulrich Drepper 0 siblings, 2 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-13 10:54 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Sat, Nov 11, 2006 at 02:28:53PM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > >Generic event handling mechanism. > >[...] > > Sorry for the delay again. Kernel work is simply not my highest priority. > > I've collected my comments on some parts of the patch. I haven't gone > through every part of the patch yet. Sorry for the length. No problem. > =================== > > - basic ring buffer problem: the kevent_copy_ring_buffer function stores > the event in the ring buffer without disregard of the current content. > > + if dequeued entries larger than number of ring buffer entries > events immediately get overwritten without passing anything to > userlevel > > + as with the old approach, the ring buffer is basically unusable with > multiple threads/processes. A thread calling kevent_wait might > cause entries another thread is still working on to be overwritten. > > Possible solution: > > a) it would be possible to have a "used" flag in each ring buffer entry. > That's too expensive, I guess. > > b) kevent_wait needs another parameter which specifies the which is the > last (i.e., least recently added) entry in the ring buffer. > Everything between this entry and the current head (in ->kidx) is > occupied. If multiple threads arrive in kevent_wait the highest idx > (with wrap around possibly lowest) is used. > > kevent_wait will not try to move more entries into the ring buffer > if ->kidx and the higest index passed in to any kevent_wait call > is equal (i.e., the ring buffer is full). > > There is one issue, though, and that is that a system call is needed > to signal to the kernel that more entries in the ring buffer are > processed and that they can be refilled. This goes against the > kernel filling the ring buffer automatically (see below) If thread calls kevent_wait() it means it has processed previous entries, one can call kevent_wait() with $num parameter as zero, which means that thread does not want any new events, so nothing will be copied. > Threads should be able to (not necessarily forced to) use the > interfaces like this: > > - by default all threads are "parked" in the kevent_wait syscall. > > > - If an event occurs one thread might be woken (depending on the 'num' > parameter) > > - the woken thread(s) work on all the events in the ring buffer and > then call kevent_wait() again. > > This requires that the threads can independently call kevent_wait() > and that they can independently retrieve events from the ring buffer > without fear the entry gets overwritten before it is retrieved. > Atomically retrieving entries from the ring buffer can be implemented > at userlevel. Either the ring buffer is writable and a field in each > ring buffer entry can be used as a 'handled' flag. Obviously this can > be done with atomic compare-and-exchange. If the ring buffer is not > writable then, as part of the userlevel wrapper around the event > handling interfaces, another array is created which contains the use > flags for each ring buffer entry. This is less elegant and probably > slower. Writable ring buffer does not sound too good to me - what if one thread will overwrite the whole ring buffer so kernel's indexes can be screwed? Ring buffer processed not in FIFO order is wrong idea - ring buffer can be potentially very big and searching there for the entry, which was been marked as 'free' by userspace is not a solution at all - userspace in that case must provide ukevent so fast tree search would be used, (and although it is already possible) it requires userspace to make additional syscalls which is not what we want. So kevent ring buffer is designed in the following way: all entries can be processed _only_ in fifo order, i.e. they can be read in any order threads want, but when one thread calls kevent_wait(num), $num requested from the begining can be overwritten - kernel does not know how many users reads those $num events from the begining, and even if they have some flag that 'do not touch me, someone reads me', how and when those entries will be reused? Kernel does not store bitmask or any other type of objects to show that holes in the ring buffer are free - it works in FIFO order since it the fastest mode. As a solution I can create folowing scheme: there are two syscalls (or one with a switch) which get events and commits them. kevent_wait() becomes a syscall which waits until number of events or one of them becomes ready and just copies them into ring buffer and returns. kevent_wait() will fail with special error code when ring buffer is full. kevent_commit() frees requested number of events _from the beginning_, i.e. from special index, visible from userspace. Userspace can create special counters for events (and even put them into read-only ring buffer overwriting some fields of kevent, especially if we will increase it's size) and only call kevent_commit() when all events have zero usage counter. I disagree that having possibility to have holes in the ring buffer is a good idea at all - it requires much more complex protocol, which will fill and reuse that holes, and the main disavantge - it requires to transfer much more information from userspace to kernelspace to free the ring entry in the hole - in that case it is already possible just to call kevent_ctl(KEVENT_REMOVE) and do not wash the brain with new approach at all. > =================== > > - implementing the kevent_wait syscall the proposed way means we are > missing out on one possible optimization. The ring buffer is > currently only filled on kevent_wait calls. I expect that in really > high traffic situations requests are coming in at a higher rate than > the can be processed. At least for periods of time. If such > situations it would be nice to not have to call into the kernel at > all. If the kernel would deliver into the ring buffer on its own > this would be possible. Well, it can be done on behalf of workqueue or dedicated thread which will bring up appropriate mm context, although it means that userspace can not handle the load it requested, which is a bad sign... > If the argument against this is that kevent_get_event should be > possible the answer is... > > =================== > > - the kevent_get_event syscall is not needed at all. All reporting > should be done using a ring buffer. There really is not reason to > keep two interfaces around which serve the same purpose. Making > the argument the kevent_get_event is so much easier to use is not > valid. The exposed interface to access the ring buffer will be easy, > too. In the OLS paper I more or wait hinted at the interfaces. I > think they should be like this (names are irrelevant): Well, kevent_get_events() _is_ much easier to use. And actually having only that interface it is possible to implement ring buffer with any kind or protocol for its controlling - userspace can have a wrapper which will call kevent_get_events() with pointer which shows to the place in the shared ring buffer where to place new events, that wrapper can handle essentially any kind of flags/parameters which are suitable for that ring buffer implementation. But since we started to implement ring buffer as a additional feature of kevent, let's find the way all people will be happy with before removing something which was proven to work correctly. > ec_t ec_create(unsigned flags); > int ec_destroy(ec_t ec); > int ec_poll_event(ec_t ec, event_data_t *d); > int ec_wait_event(ec_t ec, event_data_t *d); > int ec_timedwait_event(ec_t ec, event_data_t *d, struct timespec *to); > > The latter three interfaces are the interesting ones. We have to get > the data out of the ring buffer as quickly as possible. So the > interfaces require passing in a reference to an object which can hold > the data. The 'poll' variant won't delay, the other two will. The last three are exactly kevent_get_events() with different set of parameters - it is possible to get events without sleeping, it is possible to wait until at least something is ready and it is possible to sleep for timeout. > We need separate create and destroy functions since there will always > be a userlevel component of the data structures. The create variant > can allocate the ring buffer and the other memory needed ('handled' > flags, tail pointers, ...) and destroy free all resources. > > These interfaces are fast and easy to use. At least as easy as the > kevent_get_event syscall. And all transparently implemented on top of > the ring buffer. So, please let's drop the unneeded syscall. They all already imeplemented. Just all above, and it was done several months ago already. No need to reinvent what is already there. Even if we will decide to remove kevent_get_events() in favour of ring buffer-only implementation, winting-for-event syscall will be essentially kevent_get_events() without pointer to the place where to put events. And I will not repeat, that it is (and was from the beginning for about 10 months already) to implement ring buffer using kevent_get_events(). I agree that having special syscall to initialize kevent is a good idea, and initial kevent implementation had it, but it was removed due to API cleanup work by Cristoph Hellwing. So I again see the same problem as several months ago when there are many people who have opposite views on API, and I as author do not know who is right... Can we all agree that initialization syscall is a good idea? > =================== > > - another optimization I am thinking about is optimizing the thread > wakeup and ring buffer use for cache line use. I.e., if we know > an event was queued on a specific CPU then the wakeup function > should take this into account. I.e., if any of the threads > waiting was/will be scheduled on the same CPU it should be > preferred. Do you have _any_ kind of benchmarks with epoll() which would show that it is feasible? ukevent is one cache line (well, 2 cache lines on old CPUs), which can be setup way too far away from the time when it is ready, and CPU which origianlly set that up can be busy, so we will lose performance waiting until CPU becomes free instead of calling other thread on different CPU. So I'm asking is there at least some data except theoretical thoughts? > With the current simple form of a ring buffer this isn't sufficient, > though. Reading all entries in the ring buffer until finding the > one written by the CPU in question is not helpful. We'd need a > mechanism to point the thread to the entry in question. One > possibility to do this is to return the ring buffer entry as the > return value of the kevent_wait() syscall. This works fine if the > thread only works for one event (which I guess will be 99.999% of > all uses). An extension could be to extend the ukevent structure to > contain an index of the next entry written the same CPU. > > Another problem this entails is false sharing of the ring buffer > entries. This would probably require to pad the ukevent structure > to 64 bytes. It's not that much more, 40 bytes so far, it's > also more future-safe. The alternative is to allocate have per-CPU > regions in the ring buffer. With hotplug CPUs this is just plain > silly. > > I think this optimization has the potential to help quite a bit, > especially for large machines. I think again that complete removal of ring buffer and its implementation in userspace wrapper and kevent_get_events() is a good idea. But probably I'm alone thinking in that direction, so let's think about ring buffer in kernelspace. It is possible to specify CPU id in kevent (not in ukevent, i.e. not in shared by userspace structure, but in it's kernel representation), and then check if currently active CPU is the same or not, but what if it is not the same CPU? Entry order is important, since application can take advantage of synchronization, so idea to skip some entries is bad. > =================== > > - we absolutely need an interface to signal the kernel that a thread, > just woken from kevent_wait, cannot handle the events. I.e., the > events are in the ring buffer but all the other threads are in the > kernel in their kevent_wait calls. The new syscall would wake up > one or more threads to handle the events. > > This syscall is for instance necessary if the thread calling > kevent_wait is canceled. It might also be needed when a thread > requested more than one event and realizes processing an entry > takes a long time and that another thread might work on the other > items in the meantime. Hmm, send a signal to other thread when glibc cancells given one... This problem points me to the idea of userspace thread implementation I have in mind, but it is another story. It is management task - kernel should not even know about someone has died and can not process events it requested. Userspace can open a control pipe (and setup a kevent handler for it) and glibc will write there a byte thus awakening some other thread. It can be done in userspace and should be done in userspace. If you insist I will create userspace kevent handling - userspace will be able to request kevents and mark them as ready. > Al Viro pointed out another possible solution which also could solve > the "handled" flag problem and concurrency in use of the ring buffer. > > The idea is to require the kevent_wait() syscall to signal which entry > in the ring buffer is handled or not handled. This means: > > + the kernel knows at any time which entries in the buffer are free > and which are not > > + concurrent filling of the ring buffer is no problem anymore since > entries are not discarded until told > > + by not waiting for event (num parameter == 0) the syscall can be > used to discard entries to free up the ring buffer before continuing > to work on more entries. And, as per the requirement above, it can > be used to tell the kernel that certain entries are *NOT* handled > and need to be sent to another thread. This would be useful in the > thread cancellation case. > > This seems like a nice approach. But unfortunately theory and practice are different in a real world. Kernel has millions of entries in _linear_ ring buffer, how do you think they should be handled without complex protocol between userspace and kernelspace? In that protocol userspace is required to transfer some information to kernelspace so it could find the entry (i.e. per entry field ! ), and then it should have a tree or other mechanism to store free and used chunks of entries... You probably did not see my network tree allocator patches I posted in lkml@, netdev@ and linux-mm@ lists - it is quite big chunk of code which handles exactly that, but you do not want to implement it in glibc I think... So, do not overdesign. And as a side note, btw - _all_ above can be implemented in userspace. > =================== > > - why no syscall to create kevent queue? With dynamic /dev this might > be a problem and it's really not much additional code. What about > programs which want to use these interfaces before /dev is set up? It was there - Cristoph Hellwig removed it in his API cleanup patch, so far it was not needed at all (and is not needed for now). That application can create /dev file by itself if it wants... Just a though. > =================== > > - still: the syscall should use a struct timespec* timeout parameter > and not nanosecs. There are at least three timeout modes which > are wanted: > > + relative, unconditionally wait that long > > + relative, aborted in case of large enough settimeofday() or NTP > adjustment > > + absolute timeout. Probably even with selecting which clock ot use. > This mode requires a timespec value parameter > > > We have all this code already in the futex syscall. It just needs to > be generalized or copied and adjusted. Will we discuss it for death? Kevent does not need to have absolute timeout. Because timeout specified there is always related to the start of syscall, since it is a timeout which specifies maximum time frame syscall can live. All such timeouts _ARE_ relative and should be relative since it is correct. > =================== > > - still: no signal mask parameter in the kevent_wait (and get_event) > syscall. Regardless of what one thinks about signals, they are used > and integrating the kevent interface into existing code requires > this functionality. And it's not only about receiving signals. > The signal mask parameter can also be used to _prevent_ signals from > being delivered in that time. I created kevent_signal notifications - it allows user to setup any set of interested signals before call to kevent_get_events() and friends. No need to solve a problem with operation way when there is tactical and strategical ones - kevent signal is that way which allows not to use workarounds for interfaces which do not support handling of different types of events except file descriptors. > =================== > > - the KEVENT_REQ_WAKEUP_ONE functionality is good and needed. But I > would reverse the default. I cannot see many places where you want > all threads to be woken. Introduce KEVENT_REQ_WAKEUP_ALL instead. I.e. to wake up only first thread always and in addon those threads which have specified flag set? Ok, will put into todo foer the next release. > =================== > > - there is really no reason to invent yet another timer implementation. > We have the POSIX timers which are feature rich and nicely > implemented. All that is needed is to implement SIGEV_KEVENT as a > notification mechanism. The timer is registered as part of the > timer_create() syscalls. Feel free to add any interface you like - it is as simple as call for kevent_user_add_ukevent() in userspace. > =================== > > > I haven't yet looked at the other event sources. I think the above is > enough for now. It looks like you generate ideas (or move them into different implementation layer) faster than I implement them :) And I almost silently stay behind with the fact that it is possbile to implement _all_ above ring buffer things in userspace with kevent_get_events() and this functionality is there for almost a year :) Let's solve problem in order of theirs appearance - what do you think about above interface for ring buffer? > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, > CA ❖ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-13 10:54 ` Evgeniy Polyakov @ 2006-11-13 11:16 ` Evgeniy Polyakov 2006-11-20 0:02 ` Ulrich Drepper 1 sibling, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-13 11:16 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Mon, Nov 13, 2006 at 01:54:58PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote: > > =================== > > > > - there is really no reason to invent yet another timer implementation. > > We have the POSIX timers which are feature rich and nicely > > implemented. All that is needed is to implement SIGEV_KEVENT as a > > notification mechanism. The timer is registered as part of the > > timer_create() syscalls. > > Feel free to add any interface you like - it is as simple as call for > kevent_user_add_ukevent() in userspace. ... in kernelspace I mean. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-13 10:54 ` Evgeniy Polyakov 2006-11-13 11:16 ` Evgeniy Polyakov @ 2006-11-20 0:02 ` Ulrich Drepper 2006-11-20 8:25 ` Evgeniy Polyakov 1 sibling, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-11-20 0:02 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro Evgeniy Polyakov wrote: >> Possible solution: >> >> a) it would be possible to have a "used" flag in each ring buffer entry. >> That's too expensive, I guess. >> >> b) kevent_wait needs another parameter which specifies the which is the >> last (i.e., least recently added) entry in the ring buffer. >> Everything between this entry and the current head (in ->kidx) is >> occupied. If multiple threads arrive in kevent_wait the highest idx >> (with wrap around possibly lowest) is used. >> >> kevent_wait will not try to move more entries into the ring buffer >> if ->kidx and the higest index passed in to any kevent_wait call >> is equal (i.e., the ring buffer is full). >> >> There is one issue, though, and that is that a system call is needed >> to signal to the kernel that more entries in the ring buffer are >> processed and that they can be refilled. This goes against the >> kernel filling the ring buffer automatically (see below) > > If thread calls kevent_wait() it means it has processed previous entries, > one can call kevent_wait() with $num parameter as zero, which > means that thread does not want any new events, so nothing will be > copied. This doesn't solve the problem. You could only request new events when all previously reported events are processed. Plus: how do you report events if the you don't allow get_event pass them on? > Writable ring buffer does not sound too good to me - what if one thread > will overwrite the whole ring buffer so kernel's indexes can be screwed? Agreed, there are problems. This is why I suggested the ring buffer can be a structured. Parts of it might be read-only, other parts read/write. I don't necessarily think the 'used' flag is the right way. And front/tail pointer solution seems to be better. > Ring buffer processed not in FIFO order is wrong idea Not necessarily, see my comments about CPU affinity in the previous mail. > - ring buffer can > be potentially very big and searching there for the entry, which was > been marked as 'free' by userspace is not a solution at all - userspace > in that case must provide ukevent so fast tree search would be used, > (and although it is already possible) it requires userspace to make > additional syscalls which is not what we want. It is not necessary. I've proposed to only have a fron and tail pointer. The tail pointer is maintained by the application and passed to the kernel explicitly or via shared memory. The kernel maintains the front pointer. No tree needed. > As a solution I can create folowing scheme: > there are two syscalls (or one with a switch) which get events and > commits them. > > kevent_wait() becomes a syscall which waits until number of events or > one of them becomes ready and just copies them into ring buffer and > returns. kevent_wait() will fail with special error code when ring > buffer is full. > > kevent_commit() frees requested number of events _from the beginning_, > i.e. from special index, visible from userspace. Userspace can create > special counters for events (and even put them into read-only ring > buffer overwriting some fields of kevent, especially if we will increase > it's size) and only call kevent_commit() when all events have zero usage > counter. Right, that's basically the front/tail pointer implementation. That would work. You just have to make sure that the kevent_wait() call takes the current front pointer/index as a parameter. This way if the buffer gets filled between the thread checking the ring buffer (and finding it empty) and the syscall being handled the thread is not suspended. > I disagree that having possibility to have holes in the ring buffer is a > good idea at all - it requires much more complex protocol, which will > fill and reuse that holes, and the main disavantge - it requires to > transfer much more information from userspace to kernelspace to free the > ring entry in the hole - in that case it is already possible just to > call kevent_ctl(KEVENT_REMOVE) and do not wash the brain with new > approach at all. Well, it would require more data transport of we'd use writable shared memory. But I agree, it's far too complicated and might not scale with growing ring buffer sizes. >> - implementing the kevent_wait syscall the proposed way means we are >> missing out on one possible optimization. The ring buffer is >> currently only filled on kevent_wait calls. I expect that in really >> high traffic situations requests are coming in at a higher rate than >> the can be processed. At least for periods of time. If such >> situations it would be nice to not have to call into the kernel at >> all. If the kernel would deliver into the ring buffer on its own >> this would be possible. > > Well, it can be done on behalf of workqueue or dedicated thread which > will bring up appropriate mm context, I think it should be done. It's potentially a huge advantage. > although it means that userspace > can not handle the load it requested, which is a bad sign... I don't understand. What is not supposed to work? There is nothing which cannot work with automatic posting since the get_event() call does nothing but copying the event data over and wake a thread. >> - the kevent_get_event syscall is not needed at all. All reporting >> should be done using a ring buffer. There really is not reason to >> keep two interfaces around which serve the same purpose. Making >> the argument the kevent_get_event is so much easier to use is not >> valid. The exposed interface to access the ring buffer will be easy, >> too. In the OLS paper I more or wait hinted at the interfaces. I >> think they should be like this (names are irrelevant): > > Well, kevent_get_events() _is_ much easier to use. And actually having > only that interface it is possible to implement ring buffer with any > kind or protocol for its controlling - userspace can have a wrapper > which will call kevent_get_events() with pointer which shows to the > place in the shared ring buffer where to place new events, that wrapper > can handle essentially any kind of flags/parameters which are suitable > for that ring buffer implementation. That's far too slow. The whole point behind the ring buffer is speed. And emulation would defeat the purpose. > But since we started to implement ring buffer as a additional feature of > kevent, let's find the way all people will be happy with before removing > something which was proven to work correctly. The get_event interface is basically the userlevel interface the runtime (glibc probably) would provide. Programmers don't see the complexity. I'm concerned about the get_event interface holding the kernel implementation back. For instance, automatic filling the ring buffer. This would not be possible if the program is free to mix kevent_get_event and kevent_wait calls freely. If you do away with the get_event syscall the automatic ring buffer filling is possible and a logical extension. > > The last three are exactly kevent_get_events() with different set of > parameters - it is possible to get events without sleeping, it is > possible to wait until at least something is ready and it is possible to > sleep for timeout. Exactly. But these interfaces should be implemented at userlevel, not at the syscall level. It's not necessary. The kernel interface should be kept as small as possible and the get_event syscall is pure duplication. > They all already imeplemented. Just all above, and it was done several > months ago already. No need to reinvent what is already there. > Even if we will decide to remove kevent_get_events() in favour of ring > buffer-only implementation, winting-for-event syscall will be > essentially kevent_get_events() without pointer to the place where to > put events. Right, but this limitation of the interface is important. It means the interface of the kernel is smaller: fewer possibilities for problems and fewer constraints if in future something should be changed (and smaller kernel). > I agree that having special syscall to initialize kevent is a good idea, > and initial kevent implementation had it, but it was removed due to API > cleanup work by Cristoph Hellwing. Well, he is wrong. If, for instance, init or any of the programs which start first wants to use the syscall it couldn't because /dev isn't mounted. The program might use libraries and therefore not have any influence on whether the kevent stuff is used or not. Yes, the /dev interface is useful for some/many other kernel interfaces. But this is a core interface. For the same reason epoll_create is a syscall. > Do you have _any_ kind of benchmarks with epoll() which would show that > it is feasible? ukevent is one cache line (well, 2 cache lines on old > CPUs), which can be setup way too far away from the time when it is > ready, and CPU which origianlly set that up can be busy, so we will lose > performance waiting until CPU becomes free instead of calling other > thread on different CPU. If the period between the generation of the event (e.g., incoming network traffic or sent data) and the delivery of the event by waking a thread is too long, it makes not too much sense. But if the L2 cache hasn't hasn't been flushed it might be a big advantage. I think it's reasonable to only have the last queued entry for a CPU handled special. And note, this is only ever a hint. If an event entry was created by the kernel in one CPU but none of the threads which wait to be waken is on that CPU, nothing has to be done. No, I don't have a benchmark. But it is likely quite easily possible to create a synthetic benachmark. Maybe with pipes. > It is possible to specify CPU id in kevent (not in ukevent, i.e. not > in shared by userspace structure, but in it's kernel representation), > and then check if currently active CPU is the same or not, but what if > it is not the same CPU? Nothing special. It's up to the userlevel wrapper code. The CPU number would only be a hint. > Entry order is important, since application can > take advantage of synchronization, so idea to skip some entries is bad. That's something the application should be make a call about. It's not always (or even mostly) the case that the ordering of the notification is important. Furthermore, this would also require the kernel to enforce an ordering. This is expensive on SMP machines. A locally generated event (i.e., source and the thread reporting the event) can be delivered faster than an event created on another CPU. > It is management task - kernel should not even know about someone has > died and can not process events it requested. But the kernel has to be involed. > Userspace can open a control pipe (and setup a kevent handler for it) > and glibc will write there a byte thus awakening some other thread. > It can be done in userspace and should be done in userspace. That's invasive. The problem is that no userlevel interface should have to implicitly keep file descriptors open. This would mean the application would be influenced since suddenly a file descriptor is not available anymore. Yes, applications shouldn't care but they unfortunately sometimes do. > Will we discuss it for death? > > Kevent does not need to have absolute timeout. Of course it does. Just because you don't see a need for it for your applications right now it doesn't mean it's not a valid use. > Because timeout specified there is always related to the start of > syscall, since it is a timeout which specifies maximum time frame > syscall can live. That's your current implementation. There is absolutely no reason whatsoever why this couldn't be changed. > I created kevent_signal notifications - it allows user to setup any set > of interested signals before call to kevent_get_events() and friends. > > No need to solve a problem with operation way when there is tactical and > strategical ones Of course there is a need and I explained it before. Getting signal notifications is in no way the same as changing the signal mask temporarily. You cannot correctly emulate the case where you want to block a signal while in the call as reenable it afterwards. Receiving the signal as an event and then artificially raising it is not the same. Especially timing-wise, the signal kevent might not be seen long after the syscall returns because other entries are worked on first. The opposite case is equally impossible to emulate: unblocking a signal just for the duration of the syscall. These are all possible and used cases. >> - the KEVENT_REQ_WAKEUP_ONE functionality is good and needed. But I >> would reverse the default. I cannot see many places where you want >> all threads to be woken. Introduce KEVENT_REQ_WAKEUP_ALL instead. > > I.e. to wake up only first thread always and in addon those threads > which have specified flag set? Ok, will put into todo foer the next > release. It's a flag for an event. So the threads won't have the flag set. If an event is delivered with the flag set, wake all threads. Otherwise just one. >> - there is really no reason to invent yet another timer implementation. >> We have the POSIX timers which are feature rich and nicely >> implemented. All that is needed is to implement SIGEV_KEVENT as a >> notification mechanism. The timer is registered as part of the >> timer_create() syscalls. > > Feel free to add any interface you like - it is as simple as call for > kevent_user_add_ukevent() in userspace. No, that's not what I mean. There is no need for the special timer-related part of your patch. Instead the existing POSIX timer syscalls should be modified to handle SIGEV_KEVENT notification. Again, keep the interface as small as possible. Plus, the POSIX timer interface is very flexible. You don't want to duplicate all that functionality. > And I almost silently stay behind with the fact that it is possbile to > implement _all_ above ring buffer things in userspace with > kevent_get_events() and this functionality is there for almost a year :) Again, this defeats the purpose completely. The ring buffer is the faster interface, especially when coupled with asynchronous filling of ring buffer (i.e., without a syscal). > Let's solve problem in order of theirs appearance - what do you think > about above interface for ring buffer? Looks better, yes. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-20 0:02 ` Ulrich Drepper @ 2006-11-20 8:25 ` Evgeniy Polyakov 2006-11-20 8:43 ` Andrew Morton 2006-11-20 20:29 ` Ulrich Drepper 0 siblings, 2 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-20 8:25 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Sun, Nov 19, 2006 at 04:02:03PM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > >>Possible solution: > >> > >>a) it would be possible to have a "used" flag in each ring buffer entry. > >> That's too expensive, I guess. > >> > >>b) kevent_wait needs another parameter which specifies the which is the > >> last (i.e., least recently added) entry in the ring buffer. > >> Everything between this entry and the current head (in ->kidx) is > >> occupied. If multiple threads arrive in kevent_wait the highest idx > >> (with wrap around possibly lowest) is used. > >> > >> kevent_wait will not try to move more entries into the ring buffer > >> if ->kidx and the higest index passed in to any kevent_wait call > >> is equal (i.e., the ring buffer is full). > >> > >> There is one issue, though, and that is that a system call is needed > >> to signal to the kernel that more entries in the ring buffer are > >> processed and that they can be refilled. This goes against the > >> kernel filling the ring buffer automatically (see below) > > > >If thread calls kevent_wait() it means it has processed previous entries, > >one can call kevent_wait() with $num parameter as zero, which > >means that thread does not want any new events, so nothing will be > >copied. > > This doesn't solve the problem. You could only request new events when > all previously reported events are processed. Plus: how do you report > events if the you don't allow get_event pass them on? Userspace should itself maintain order and possibility to get event in this implementation, kernel just returns events which were requested. > >Writable ring buffer does not sound too good to me - what if one thread > >will overwrite the whole ring buffer so kernel's indexes can be screwed? > > Agreed, there are problems. This is why I suggested the ring buffer can > be a structured. Parts of it might be read-only, other parts > read/write. I don't necessarily think the 'used' flag is the right way. > And front/tail pointer solution seems to be better. > > > >Ring buffer processed not in FIFO order is wrong idea > > Not necessarily, see my comments about CPU affinity in the previous mail. > > > >- ring buffer can > >be potentially very big and searching there for the entry, which was > >been marked as 'free' by userspace is not a solution at all - userspace > >in that case must provide ukevent so fast tree search would be used, > >(and although it is already possible) it requires userspace to make > >additional syscalls which is not what we want. > > It is not necessary. I've proposed to only have a fron and tail > pointer. The tail pointer is maintained by the application and passed > to the kernel explicitly or via shared memory. The kernel maintains the > front pointer. No tree needed. There was such implementation (in previous patchset) - sine no one commented, I changed that. > >As a solution I can create folowing scheme: > >there are two syscalls (or one with a switch) which get events and > >commits them. > > > >kevent_wait() becomes a syscall which waits until number of events or > >one of them becomes ready and just copies them into ring buffer and > >returns. kevent_wait() will fail with special error code when ring > >buffer is full. > > > >kevent_commit() frees requested number of events _from the beginning_, > >i.e. from special index, visible from userspace. Userspace can create > >special counters for events (and even put them into read-only ring > >buffer overwriting some fields of kevent, especially if we will increase > >it's size) and only call kevent_commit() when all events have zero usage > >counter. > > Right, that's basically the front/tail pointer implementation. That > would work. You just have to make sure that the kevent_wait() call > takes the current front pointer/index as a parameter. This way if the > buffer gets filled between the thread checking the ring buffer (and > finding it empty) and the syscall being handled the thread is not suspended. It is exactly how previous ring buffer (in mapped area though) was implemented. I think I need to quickly setup my slightly used (bought on ebay) but still working mind reader, I will try to tune it to work with your brain waves so next time I would not spent weeks changing something which could be reused, while others keep silent :) > >I disagree that having possibility to have holes in the ring buffer is a > >good idea at all - it requires much more complex protocol, which will > >fill and reuse that holes, and the main disavantge - it requires to > >transfer much more information from userspace to kernelspace to free the > >ring entry in the hole - in that case it is already possible just to > >call kevent_ctl(KEVENT_REMOVE) and do not wash the brain with new > >approach at all. > > Well, it would require more data transport of we'd use writable shared > memory. But I agree, it's far too complicated and might not scale with > growing ring buffer sizes. > > > >>- implementing the kevent_wait syscall the proposed way means we are > >> missing out on one possible optimization. The ring buffer is > >> currently only filled on kevent_wait calls. I expect that in really > >> high traffic situations requests are coming in at a higher rate than > >> the can be processed. At least for periods of time. If such > >> situations it would be nice to not have to call into the kernel at > >> all. If the kernel would deliver into the ring buffer on its own > >> this would be possible. > > > >Well, it can be done on behalf of workqueue or dedicated thread which > >will bring up appropriate mm context, > > I think it should be done. It's potentially a huge advantage. > > > >although it means that userspace > >can not handle the load it requested, which is a bad sign... > > I don't understand. What is not supposed to work? There is nothing > which cannot work with automatic posting since the get_event() call does > nothing but copying the event data over and wake a thread. If userspace is too slow to get events, dedicated thread or workqueue will be busy unneded things, although they can allow to remove peaks in the load. > >>- the kevent_get_event syscall is not needed at all. All reporting > >> should be done using a ring buffer. There really is not reason to > >> keep two interfaces around which serve the same purpose. Making > >> the argument the kevent_get_event is so much easier to use is not > >> valid. The exposed interface to access the ring buffer will be easy, > >> too. In the OLS paper I more or wait hinted at the interfaces. I > >> think they should be like this (names are irrelevant): > > > >Well, kevent_get_events() _is_ much easier to use. And actually having > >only that interface it is possible to implement ring buffer with any > >kind or protocol for its controlling - userspace can have a wrapper > >which will call kevent_get_events() with pointer which shows to the > >place in the shared ring buffer where to place new events, that wrapper > >can handle essentially any kind of flags/parameters which are suitable > >for that ring buffer implementation. > > That's far too slow. The whole point behind the ring buffer is speed. > And emulation would defeat the purpose. It was an example, I do not say ring-buffer maintained in kernelspace is bad idea. Actually it is possible to create several threads which will only read events into the buffer, which will be processed by some pool of 'working' threads. There are a lot of possibilities to work with only one syscall and create scalable system. > >But since we started to implement ring buffer as a additional feature of > >kevent, let's find the way all people will be happy with before removing > >something which was proven to work correctly. > > The get_event interface is basically the userlevel interface the runtime > (glibc probably) would provide. Programmers don't see the complexity. > > I'm concerned about the get_event interface holding the kernel > implementation back. For instance, automatic filling the ring buffer. > This would not be possible if the program is free to mix > kevent_get_event and kevent_wait calls freely. If you do away with the > get_event syscall the automatic ring buffer filling is possible and a > logical extension. Yes, that is why only one should be used. If there are several threads, then ring buffer implementation should be used otherwise just kevent_get_events(). In theory yes, access library like glibc can provide kevent_get_events() which will read event from ring buffer, but there is no such call right now, so kernel's kevent_get_events() looks reasonable. > >The last three are exactly kevent_get_events() with different set of > >parameters - it is possible to get events without sleeping, it is > >possible to wait until at least something is ready and it is possible to > >sleep for timeout. > > Exactly. But these interfaces should be implemented at userlevel, not > at the syscall level. It's not necessary. The kernel interface should > be kept as small as possible and the get_event syscall is pure duplication. I would say that ring-buffer mainpulating syscalls are duplicatino, but it is just matter of a view :) > >They all already imeplemented. Just all above, and it was done several > >months ago already. No need to reinvent what is already there. > >Even if we will decide to remove kevent_get_events() in favour of ring > >buffer-only implementation, winting-for-event syscall will be > >essentially kevent_get_events() without pointer to the place where to > >put events. > > Right, but this limitation of the interface is important. It means the > interface of the kernel is smaller: fewer possibilities for problems and > fewer constraints if in future something should be changed (and smaller > kernel). Ok, lets see for ring buffer implementation right now, and then we will decide if we want to remove or to stay with kevent_get_events() syscall. > >I agree that having special syscall to initialize kevent is a good idea, > >and initial kevent implementation had it, but it was removed due to API > >cleanup work by Cristoph Hellwing. > > Well, he is wrong. If, for instance, init or any of the programs which > start first wants to use the syscall it couldn't because /dev isn't > mounted. The program might use libraries and therefore not have any > influence on whether the kevent stuff is used or not. > > Yes, the /dev interface is useful for some/many other kernel interfaces. > But this is a core interface. For the same reason epoll_create is a > syscall. Ok, I will create initialization syscall. > >Do you have _any_ kind of benchmarks with epoll() which would show that > >it is feasible? ukevent is one cache line (well, 2 cache lines on old > >CPUs), which can be setup way too far away from the time when it is > >ready, and CPU which origianlly set that up can be busy, so we will lose > >performance waiting until CPU becomes free instead of calling other > >thread on different CPU. > > If the period between the generation of the event (e.g., incoming > network traffic or sent data) and the delivery of the event by waking a > thread is too long, it makes not too much sense. But if the L2 cache > hasn't hasn't been flushed it might be a big advantage. > > I think it's reasonable to only have the last queued entry for a CPU > handled special. And note, this is only ever a hint. If an event entry > was created by the kernel in one CPU but none of the threads which wait > to be waken is on that CPU, nothing has to be done. > > No, I don't have a benchmark. But it is likely quite easily possible to > create a synthetic benachmark. Maybe with pipes. > > > >It is possible to specify CPU id in kevent (not in ukevent, i.e. not > >in shared by userspace structure, but in it's kernel representation), > >and then check if currently active CPU is the same or not, but what if > >it is not the same CPU? > > Nothing special. It's up to the userlevel wrapper code. The CPU number > would only be a hint. > > > >Entry order is important, since application can > >take advantage of synchronization, so idea to skip some entries is bad. > > That's something the application should be make a call about. It's not > always (or even mostly) the case that the ordering of the notification > is important. Furthermore, this would also require the kernel to > enforce an ordering. This is expensive on SMP machines. A locally > generated event (i.e., source and the thread reporting the event) can be > delivered faster than an event created on another CPU. How come? If signal was delivered earlier than data arrived, userspace should get signal before data - that is the rule. Ordering is maintained not for event insertion, but for marking them ready - it is atomic, so who first starts to mark even ready, that event will be read first from the ready queue. > >It is management task - kernel should not even know about someone has > >died and can not process events it requested. > > But the kernel has to be involed. > > > >Userspace can open a control pipe (and setup a kevent handler for it) > >and glibc will write there a byte thus awakening some other thread. > >It can be done in userspace and should be done in userspace. > > That's invasive. The problem is that no userlevel interface should have > to implicitly keep file descriptors open. This would mean the > application would be influenced since suddenly a file descriptor is not > available anymore. Yes, applications shouldn't care but they > unfortunately sometimes do. Then I propose userspace notifications - each new thread can register 'wake me up when userspace event 1 is ready' and 'event 1' will be marked as ready by glibc when it removes the thread. > >Will we discuss it for death? > > > >Kevent does not need to have absolute timeout. > > Of course it does. Just because you don't see a need for it for your > applications right now it doesn't mean it's not a valid use. Please explain why glibc AIO uses relatinve timeouts then :) > >Because timeout specified there is always related to the start of > >syscall, since it is a timeout which specifies maximum time frame > >syscall can live. > > That's your current implementation. There is absolutely no reason > whatsoever why this couldn't be changed. It has nothing with implementation - it is logic. Something starts and it has its maximum lifetime, but not something starts and should be stopped Jan 1, 2008. In the latter case one can setup a timer, but it does not allow to specify maximum lifetime. If glibc posix sleeping functions converts relatinve AIO timeouts into absolute it does not mean all should do it. It is just not needed. > >I created kevent_signal notifications - it allows user to setup any set > >of interested signals before call to kevent_get_events() and friends. > > > >No need to solve a problem with operation way when there is tactical and > >strategical ones > > Of course there is a need and I explained it before. Getting signal > notifications is in no way the same as changing the signal mask > temporarily. You cannot correctly emulate the case where you want to > block a signal while in the call as reenable it afterwards. Receiving > the signal as an event and then artificially raising it is not the same. > Especially timing-wise, the signal kevent might not be seen long after > the syscall returns because other entries are worked on first. > > The opposite case is equally impossible to emulate: unblocking a signal > just for the duration of the syscall. These are all possible and used > cases. Add and remove appropriate kevent - it is as simple as call for one function. > >>- the KEVENT_REQ_WAKEUP_ONE functionality is good and needed. But I > >> would reverse the default. I cannot see many places where you want > >> all threads to be woken. Introduce KEVENT_REQ_WAKEUP_ALL instead. > > > >I.e. to wake up only first thread always and in addon those threads > >which have specified flag set? Ok, will put into todo foer the next > >release. > > It's a flag for an event. So the threads won't have the flag set. If > an event is delivered with the flag set, wake all threads. Otherwise > just one. Ok. > >>- there is really no reason to invent yet another timer implementation. > >> We have the POSIX timers which are feature rich and nicely > >> implemented. All that is needed is to implement SIGEV_KEVENT as a > >> notification mechanism. The timer is registered as part of the > >> timer_create() syscalls. > > > >Feel free to add any interface you like - it is as simple as call for > >kevent_user_add_ukevent() in userspace. > > No, that's not what I mean. There is no need for the special > timer-related part of your patch. Instead the existing POSIX timer > syscalls should be modified to handle SIGEV_KEVENT notification. Again, > keep the interface as small as possible. Plus, the POSIX timer > interface is very flexible. You don't want to duplicate all that > functionality. Interface is already there with kevent_ctl(KEVENT_ADD), I just created additional entry, which describes timers enqueue/dequeue callbacks - I have not invented new interfaces, just reused existing generic kevent facilities. It is possible to add timer events from any other place. > >And I almost silently stay behind with the fact that it is possbile to > >implement _all_ above ring buffer things in userspace with > >kevent_get_events() and this functionality is there for almost a year :) > > Again, this defeats the purpose completely. The ring buffer is the > faster interface, especially when coupled with asynchronous filling of > ring buffer (i.e., without a syscal). It is still possible to have very scalable system with it, for example with one thread dedicated for syscall reading (with big number of events transferred in one shot syscall overhead becomes negligible) and pool of working threads. It is not about 'let's remove kernelspace ring buffer management', but about possibilities and flexibility of the existing model. > >Let's solve problem in order of theirs appearance - what do you think > >about above interface for ring buffer? > > Looks better, yes. Ok, I will implement this new (old) ring buffer in present it in the next release. I will also schedule there userspace notifications, 'wake-up-one-thread' flag changes and other small updates. > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, > CA ❖ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-20 8:25 ` Evgeniy Polyakov @ 2006-11-20 8:43 ` Andrew Morton 2006-11-20 8:51 ` Evgeniy Polyakov 2006-11-20 20:29 ` Ulrich Drepper 1 sibling, 1 reply; 200+ messages in thread From: Andrew Morton @ 2006-11-20 8:43 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Ulrich Drepper, David Miller, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Mon, 20 Nov 2006 11:25:01 +0300 Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote: > On Sun, Nov 19, 2006 at 04:02:03PM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > > Evgeniy Polyakov wrote: > > >>Possible solution: > > >> > > >>a) it would be possible to have a "used" flag in each ring buffer entry. > > >> That's too expensive, I guess. > > >> > > >>b) kevent_wait needs another parameter which specifies the which is the > > >> last (i.e., least recently added) entry in the ring buffer. > > >> Everything between this entry and the current head (in ->kidx) is > > >> occupied. If multiple threads arrive in kevent_wait the highest idx > > >> (with wrap around possibly lowest) is used. > > >> > > >> kevent_wait will not try to move more entries into the ring buffer > > >> if ->kidx and the higest index passed in to any kevent_wait call > > >> is equal (i.e., the ring buffer is full). > > >> > > >> There is one issue, though, and that is that a system call is needed > > >> to signal to the kernel that more entries in the ring buffer are > > >> processed and that they can be refilled. This goes against the > > >> kernel filling the ring buffer automatically (see below) > > > > > >If thread calls kevent_wait() it means it has processed previous entries, > > >one can call kevent_wait() with $num parameter as zero, which > > >means that thread does not want any new events, so nothing will be > > >copied. > > > > This doesn't solve the problem. You could only request new events when > > all previously reported events are processed. Plus: how do you report > > events if the you don't allow get_event pass them on? > > Userspace should itself maintain order and possibility to get event in > this implementation, kernel just returns events which were requested. That would mean that in a multithreaded application (or multi-processes sharing the same MAP_SHARED ringbuffer), all threads/processes will be slowed down to wait for the slowest one. > > >They all already imeplemented. Just all above, and it was done several > > >months ago already. No need to reinvent what is already there. > > >Even if we will decide to remove kevent_get_events() in favour of ring > > >buffer-only implementation, winting-for-event syscall will be > > >essentially kevent_get_events() without pointer to the place where to > > >put events. > > > > Right, but this limitation of the interface is important. It means the > > interface of the kernel is smaller: fewer possibilities for problems and > > fewer constraints if in future something should be changed (and smaller > > kernel). > > Ok, lets see for ring buffer implementation right now, and then we will > decide if we want to remove or to stay with kevent_get_events() syscall. I agree that kevent_get_events() is duplicative and we shouldn't need it. Better to concentrate all our development effort on the single and most flexible means of delivery. ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-20 8:43 ` Andrew Morton @ 2006-11-20 8:51 ` Evgeniy Polyakov 2006-11-20 9:15 ` Andrew Morton 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-20 8:51 UTC (permalink / raw) To: Andrew Morton Cc: Ulrich Drepper, David Miller, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Mon, Nov 20, 2006 at 12:43:01AM -0800, Andrew Morton (akpm@osdl.org) wrote: > > > >If thread calls kevent_wait() it means it has processed previous entries, > > > >one can call kevent_wait() with $num parameter as zero, which > > > >means that thread does not want any new events, so nothing will be > > > >copied. > > > > > > This doesn't solve the problem. You could only request new events when > > > all previously reported events are processed. Plus: how do you report > > > events if the you don't allow get_event pass them on? > > > > Userspace should itself maintain order and possibility to get event in > > this implementation, kernel just returns events which were requested. > > That would mean that in a multithreaded application (or multi-processes > sharing the same MAP_SHARED ringbuffer), all threads/processes will be > slowed down to wait for the slowest one. Not at all - all other threads can call kevent_get_events() with theirs own place in the ring buffer, so while one of them is processing an entry, others can fill next entries. > > > >They all already imeplemented. Just all above, and it was done several > > > >months ago already. No need to reinvent what is already there. > > > >Even if we will decide to remove kevent_get_events() in favour of ring > > > >buffer-only implementation, winting-for-event syscall will be > > > >essentially kevent_get_events() without pointer to the place where to > > > >put events. > > > > > > Right, but this limitation of the interface is important. It means the > > > interface of the kernel is smaller: fewer possibilities for problems and > > > fewer constraints if in future something should be changed (and smaller > > > kernel). > > > > Ok, lets see for ring buffer implementation right now, and then we will > > decide if we want to remove or to stay with kevent_get_events() syscall. > > I agree that kevent_get_events() is duplicative and we shouldn't need it. > Better to concentrate all our development effort on the single and most > flexible means of delivery. Let's wait for ring buffer imeplementation first :) -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-20 8:51 ` Evgeniy Polyakov @ 2006-11-20 9:15 ` Andrew Morton 2006-11-20 9:19 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Andrew Morton @ 2006-11-20 9:15 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Ulrich Drepper, David Miller, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Mon, 20 Nov 2006 11:51:59 +0300 Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote: > On Mon, Nov 20, 2006 at 12:43:01AM -0800, Andrew Morton (akpm@osdl.org) wrote: > > > > >If thread calls kevent_wait() it means it has processed previous entries, > > > > >one can call kevent_wait() with $num parameter as zero, which > > > > >means that thread does not want any new events, so nothing will be > > > > >copied. > > > > > > > > This doesn't solve the problem. You could only request new events when > > > > all previously reported events are processed. Plus: how do you report > > > > events if the you don't allow get_event pass them on? > > > > > > Userspace should itself maintain order and possibility to get event in > > > this implementation, kernel just returns events which were requested. > > > > That would mean that in a multithreaded application (or multi-processes > > sharing the same MAP_SHARED ringbuffer), all threads/processes will be > > slowed down to wait for the slowest one. > > Not at all - all other threads can call kevent_get_events() with theirs > own place in the ring buffer, so while one of them is processing an > entry, others can fill next entries. eh? That's not a ringbuffer, and it sounds awfully complex. I don't know if this (new?) proposal resolves the events-gets-lost-due-to-thread-cancellation problem? Would need to see considerably more detail. ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-20 9:15 ` Andrew Morton @ 2006-11-20 9:19 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-20 9:19 UTC (permalink / raw) To: Andrew Morton Cc: Ulrich Drepper, David Miller, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Mon, Nov 20, 2006 at 01:15:16AM -0800, Andrew Morton (akpm@osdl.org) wrote: > On Mon, 20 Nov 2006 11:51:59 +0300 > Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote: > > > On Mon, Nov 20, 2006 at 12:43:01AM -0800, Andrew Morton (akpm@osdl.org) wrote: > > > > > >If thread calls kevent_wait() it means it has processed previous entries, > > > > > >one can call kevent_wait() with $num parameter as zero, which > > > > > >means that thread does not want any new events, so nothing will be > > > > > >copied. > > > > > > > > > > This doesn't solve the problem. You could only request new events when > > > > > all previously reported events are processed. Plus: how do you report > > > > > events if the you don't allow get_event pass them on? > > > > > > > > Userspace should itself maintain order and possibility to get event in > > > > this implementation, kernel just returns events which were requested. > > > > > > That would mean that in a multithreaded application (or multi-processes > > > sharing the same MAP_SHARED ringbuffer), all threads/processes will be > > > slowed down to wait for the slowest one. > > > > Not at all - all other threads can call kevent_get_events() with theirs > > own place in the ring buffer, so while one of them is processing an > > entry, others can fill next entries. > > eh? That's not a ringbuffer, and it sounds awfully complex. > > I don't know if this (new?) proposal resolves the > events-gets-lost-due-to-thread-cancellation problem? Would need to see > considerably more detail. It does - event is copied into shared buffer, but place (or index in the ring buffer) is selected by userspace (wrapper, glibc, anything). It is simple and (from my point of view) elegant, but it will not be used - I surrender and implement kenelspace ring buffer management right now, I just said that it is possible to implement any kind of ring buffer in userspace with old kevent_get_events() syscall only. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-20 8:25 ` Evgeniy Polyakov 2006-11-20 8:43 ` Andrew Morton @ 2006-11-20 20:29 ` Ulrich Drepper 2006-11-20 21:46 ` Jeff Garzik 2006-11-21 9:53 ` Evgeniy Polyakov 1 sibling, 2 replies; 200+ messages in thread From: Ulrich Drepper @ 2006-11-20 20:29 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro Evgeniy Polyakov wrote: > It is exactly how previous ring buffer (in mapped area though) was > implemented. Not any of those I saw. The one I looked at always started again at index 0 to fill the ring buffer. I'll wait for the next implementation. >> That's something the application should be make a call about. It's not >> always (or even mostly) the case that the ordering of the notification >> is important. Furthermore, this would also require the kernel to >> enforce an ordering. This is expensive on SMP machines. A locally >> generated event (i.e., source and the thread reporting the event) can be >> delivered faster than an event created on another CPU. > > How come? If signal was delivered earlier than data arrived, userspace > should get signal before data - that is the rule. Ordering is maintained > not for event insertion, but for marking them ready - it is atomic, so > who first starts to mark even ready, that event will be read first from > the ready queue. This is as far as the kernel is concerned. Queue them in the order they arrive. I'm talking about the userlevel side. *If* (and it needs to be verified that this has an advantage) a CPU creates an event for, e.g., a read event and then a number of threads could be notified about the event. When the kernel has to wake up a thread it'll look whether any thread is scheduled on the same CPU which generated the event. Then the thread, upon waking up, can be told about the entry in the ring buffer which can be accessed first best (due to caching). This entry needs not be the first available in the ring buffer but that's a problem the userlevel code has to worry about. > Then I propose userspace notifications - each new thread can register > 'wake me up when userspace event 1 is ready' and 'event 1' will be > marked as ready by glibc when it removes the thread. You don't want to have a channel like this. The userlevel code doesn't know which threads are waiting in the kernel on the event queue. And it seems to be much more complicated then simply have an kevent call which tells the kernel "wake up N or 1 more threads since I cannot handle it". Basically a futex_wake()-like call. >> Of course it does. Just because you don't see a need for it for your >> applications right now it doesn't mean it's not a valid use. > > Please explain why glibc AIO uses relatinve timeouts then :) You are still completely focused on AIO. We are talking here about a new generic event handling. It is not tied to AIO. We will add all kinds of events, e.g., hopefully futex support and many others. And even for AIO it's relevant. As I said, relative timeouts are unable to cope with settimeofday calls or ntp adjustments. AIO is certainly usable in situations where timeouts are related to wall clock time. > It has nothing with implementation - it is logic. Something starts and > it has its maximum lifetime, but not something starts and should be > stopped Jan 1, 2008. It is an implementation detail. Look at the PI futex support. It has timeouts which can be cut short (or increased) due to wall clock changes. >> The opposite case is equally impossible to emulate: unblocking a signal >> just for the duration of the syscall. These are all possible and used >> cases. > > Add and remove appropriate kevent - it is as simple as call for one > function. No, it's not. The kevent stuff handles only the kevent handler (i.e., the replacement for calling the signal handler). It cannot set signal masks. I am talking about signal masks here. And don't suggest "I can add another kevent feature where I can register signal masks". This would be ridiculous since it's not an event source. Just add the parameter and every base is covered and, at least equally important, we have symmetry between the event handling interfaces. >> No, that's not what I mean. There is no need for the special >> timer-related part of your patch. Instead the existing POSIX timer >> syscalls should be modified to handle SIGEV_KEVENT notification. Again, >> keep the interface as small as possible. Plus, the POSIX timer >> interface is very flexible. You don't want to duplicate all that >> functionality. > > Interface is already there with kevent_ctl(KEVENT_ADD), I just created > additional entry, which describes timers enqueue/dequeue callbacks New multiplexers cases are additional syscalls. This is unnecessary code. Increased kernel interface and such. We have the POSIX timer interfaces which are feature-rich and standardized *and* can be triviall extended (at least from the userlevel interface POV) to use event queues. If you don't want to do this, fine, I'll try to get it made. But drop the timer part of your patches. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-20 20:29 ` Ulrich Drepper @ 2006-11-20 21:46 ` Jeff Garzik 2006-11-20 21:52 ` Ulrich Drepper 2006-11-21 9:53 ` Evgeniy Polyakov 1 sibling, 1 reply; 200+ messages in thread From: Jeff Garzik @ 2006-11-20 21:46 UTC (permalink / raw) To: Ulrich Drepper Cc: Evgeniy Polyakov, David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Alexander Viro Ulrich Drepper wrote: > Evgeniy Polyakov wrote: >> It is exactly how previous ring buffer (in mapped area though) was >> implemented. > > Not any of those I saw. The one I looked at always started again at > index 0 to fill the ring buffer. I'll wait for the next implementation. I like the two-pointer ring buffer approach, one pointer for the consumer and one for the producer. > You don't want to have a channel like this. The userlevel code doesn't > know which threads are waiting in the kernel on the event queue. And it Agreed. > You are still completely focused on AIO. We are talking here about a > new generic event handling. It is not tied to AIO. We will add all Agreed. > As I said, relative timeouts are unable to cope with settimeofday calls > or ntp adjustments. AIO is certainly usable in situations where > timeouts are related to wall clock time. I think we have lived with relative timeouts for so long, it would be unusual to change now. select(2), poll(2), epoll_wait(2) all take relative timeouts. Jeff ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-20 21:46 ` Jeff Garzik @ 2006-11-20 21:52 ` Ulrich Drepper 2006-11-21 9:09 ` Ingo Oeser 2006-11-22 11:38 ` Michael Tokarev 0 siblings, 2 replies; 200+ messages in thread From: Ulrich Drepper @ 2006-11-20 21:52 UTC (permalink / raw) To: Jeff Garzik Cc: Evgeniy Polyakov, David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Alexander Viro Jeff Garzik wrote: > I think we have lived with relative timeouts for so long, it would be > unusual to change now. select(2), poll(2), epoll_wait(2) all take > relative timeouts. I'm not talking about always using absolute timeouts. I'm saying the timeout parameter should be a struct timespec* and then the flags word could have a flag meaning "this is an absolute timeout". I.e., enable both uses,, even make relative timeouts the default. This is what the modern POSIX interfaces do, too, see clock_nanosleep. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-20 21:52 ` Ulrich Drepper @ 2006-11-21 9:09 ` Ingo Oeser 2006-11-22 11:38 ` Michael Tokarev 1 sibling, 0 replies; 200+ messages in thread From: Ingo Oeser @ 2006-11-21 9:09 UTC (permalink / raw) To: Ulrich Drepper Cc: Jeff Garzik, Evgeniy Polyakov, David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Alexander Viro Hi, Ulrich Drepper schrieb: > Jeff Garzik wrote: > > I think we have lived with relative timeouts for so long, it would be > > unusual to change now. select(2), poll(2), epoll_wait(2) all take > > relative timeouts. > > I'm not talking about always using absolute timeouts. > > I'm saying the timeout parameter should be a struct timespec* and then > the flags word could have a flag meaning "this is an absolute timeout". > I.e., enable both uses,, even make relative timeouts the default. > This is what the modern POSIX interfaces do, too, see clock_nanosleep. I agree here. And while you are at it: Have it say "not before" vs. "not after". <rant> And if you call "absolute timeout" an "alarm" or "deadline" everyone will agree, that this is useful. Timeout means "I ran OUT of TIME to do it" and this is by definition relative to a starting point. A "deadline" is an absolute point in (wall) time where sth. has to be ready and an "alarm" is an absolute point in (wall) time where sth. is triggered (e.g. a bell rings on your "ALARM clock"). I don't know which person established that non-sense nomenclature about relative and absolute timouts. </rant> Regards Ingo Oeser ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-20 21:52 ` Ulrich Drepper 2006-11-21 9:09 ` Ingo Oeser @ 2006-11-22 11:38 ` Michael Tokarev 2006-11-22 11:47 ` Evgeniy Polyakov 2006-11-22 12:33 ` Jeff Garzik 1 sibling, 2 replies; 200+ messages in thread From: Michael Tokarev @ 2006-11-22 11:38 UTC (permalink / raw) To: Ulrich Drepper Cc: Jeff Garzik, Evgeniy Polyakov, David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Alexander Viro Ulrich Drepper wrote: > Jeff Garzik wrote: >> I think we have lived with relative timeouts for so long, it would be >> unusual to change now. select(2), poll(2), epoll_wait(2) all take >> relative timeouts. > > I'm not talking about always using absolute timeouts. > > I'm saying the timeout parameter should be a struct timespec* and then > the flags word could have a flag meaning "this is an absolute timeout". > I.e., enable both uses,, even make relative timeouts the default. This > is what the modern POSIX interfaces do, too, see clock_nanosleep. Can't the argument be something like u64 instead of struct timespec, regardless of this discussion (relative vs absolute)? Compare: void mysleep(int msec) { struct timeval tv; tv.tv_sec = msec/1000; tv.tv_usec = msec%1000; select(0,0,0,0,&tv); } with void mysleep(int msec) { poll(0, 0, msec*SOME_TIME_SCALE_VALUE); } That to say: struct time{spec,val,whatever} is more difficult to use than plain numbers. But yes... existing struct timespec has an advantage of being already existed. Oh well. /mjt ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-22 11:38 ` Michael Tokarev @ 2006-11-22 11:47 ` Evgeniy Polyakov 2006-11-22 12:33 ` Jeff Garzik 1 sibling, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-22 11:47 UTC (permalink / raw) To: Michael Tokarev Cc: Ulrich Drepper, Jeff Garzik, David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Alexander Viro On Wed, Nov 22, 2006 at 02:38:50PM +0300, Michael Tokarev (mjt@tls.msk.ru) wrote: > Ulrich Drepper wrote: > > Jeff Garzik wrote: > >> I think we have lived with relative timeouts for so long, it would be > >> unusual to change now. select(2), poll(2), epoll_wait(2) all take > >> relative timeouts. > > > > I'm not talking about always using absolute timeouts. > > > > I'm saying the timeout parameter should be a struct timespec* and then > > the flags word could have a flag meaning "this is an absolute timeout". > > I.e., enable both uses,, even make relative timeouts the default. This > > is what the modern POSIX interfaces do, too, see clock_nanosleep. > > > Can't the argument be something like u64 instead of struct timespec, > regardless of this discussion (relative vs absolute)? It is right now :) > /mjt -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-22 11:38 ` Michael Tokarev 2006-11-22 11:47 ` Evgeniy Polyakov @ 2006-11-22 12:33 ` Jeff Garzik 1 sibling, 0 replies; 200+ messages in thread From: Jeff Garzik @ 2006-11-22 12:33 UTC (permalink / raw) To: Michael Tokarev Cc: Ulrich Drepper, Evgeniy Polyakov, David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Alexander Viro Michael Tokarev wrote: > Can't the argument be something like u64 instead of struct timespec, > regardless of this discussion (relative vs absolute)? Newer syscalls (ppoll, pselect) take struct timespec, which is a reasonable, modern form of the timeout argument... Jeff ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-20 20:29 ` Ulrich Drepper 2006-11-20 21:46 ` Jeff Garzik @ 2006-11-21 9:53 ` Evgeniy Polyakov 2006-11-21 16:58 ` Ulrich Drepper 1 sibling, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-21 9:53 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Mon, Nov 20, 2006 at 12:29:31PM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > >It is exactly how previous ring buffer (in mapped area though) was > >implemented. > > Not any of those I saw. The one I looked at always started again at > index 0 to fill the ring buffer. I'll wait for the next implementation. That what I'm talking about - there are at least 4 (!) different ring buffer implementations, most of them were not even looked at. But new version is ready, I will complete testing stage and will relese 'take25' soon today. For those who like 'real-world benchmark and so on' I created a patch for the latest stable lighttpd version and test it with kevent. > >>That's something the application should be make a call about. It's not > >>always (or even mostly) the case that the ordering of the notification > >>is important. Furthermore, this would also require the kernel to > >>enforce an ordering. This is expensive on SMP machines. A locally > >>generated event (i.e., source and the thread reporting the event) can be > >>delivered faster than an event created on another CPU. > > > >How come? If signal was delivered earlier than data arrived, userspace > >should get signal before data - that is the rule. Ordering is maintained > >not for event insertion, but for marking them ready - it is atomic, so > >who first starts to mark even ready, that event will be read first from > >the ready queue. > > This is as far as the kernel is concerned. Queue them in the order they > arrive. > > I'm talking about the userlevel side. *If* (and it needs to be verified > that this has an advantage) a CPU creates an event for, e.g., a read > event and then a number of threads could be notified about the event. > When the kernel has to wake up a thread it'll look whether any thread is > scheduled on the same CPU which generated the event. Then the thread, > upon waking up, can be told about the entry in the ring buffer which can > be accessed first best (due to caching). This entry needs not be the > first available in the ring buffer but that's a problem the userlevel > code has to worry about. Ok, I've understood. > >Then I propose userspace notifications - each new thread can register > >'wake me up when userspace event 1 is ready' and 'event 1' will be > >marked as ready by glibc when it removes the thread. > > You don't want to have a channel like this. The userlevel code doesn't > know which threads are waiting in the kernel on the event queue. And it > seems to be much more complicated then simply have an kevent call which > tells the kernel "wake up N or 1 more threads since I cannot handle it". > Basically a futex_wake()-like call. Kernel does not know about any threads which waits for events, it only has queue of events, it can only wake those who was parked in kevent_get_events() or kevent_wait(), but syscall will return only when condition it waits on is true, i.e. when there is new event in the ready queue and/or ring buffer has empty slots, but kernel will wake them up in any case if those conditions are true. How should it know which syscall should be interrupted when special syscall is called? > >>Of course it does. Just because you don't see a need for it for your > >>applications right now it doesn't mean it's not a valid use. > > > >Please explain why glibc AIO uses relatinve timeouts then :) > > You are still completely focused on AIO. We are talking here about a > new generic event handling. It is not tied to AIO. We will add all > kinds of events, e.g., hopefully futex support and many others. And > even for AIO it's relevant. > > As I said, relative timeouts are unable to cope with settimeofday calls > or ntp adjustments. AIO is certainly usable in situations where > timeouts are related to wall clock time. No AIO, but syscall. Only syscall time matters. Syscall starts, it sould be sometime stopped. When it should be stopped? It should be stopped after some time after it was started! I still do not understand how will you use absolute timeout values there. Please exaplain. > >It has nothing with implementation - it is logic. Something starts and > >it has its maximum lifetime, but not something starts and should be > >stopped Jan 1, 2008. > > It is an implementation detail. Look at the PI futex support. It has > timeouts which can be cut short (or increased) due to wall clock changes. futex_wait() uses relative timeouts: static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time) Kernel use relative timeouts. Only special syscalls, which work with absolute time, have absolute timeouts (like settimeofday). > >>The opposite case is equally impossible to emulate: unblocking a signal > >>just for the duration of the syscall. These are all possible and used > >>cases. > > > >Add and remove appropriate kevent - it is as simple as call for one > >function. > > No, it's not. The kevent stuff handles only the kevent handler (i.e., > the replacement for calling the signal handler). It cannot set signal > masks. I am talking about signal masks here. And don't suggest "I can > add another kevent feature where I can register signal masks". This > would be ridiculous since it's not an event source. Just add the > parameter and every base is covered and, at least equally important, we > have symmetry between the event handling interfaces. We have not have such symmetry. Other event handling interfaces can not work with events, which do not have file descriptor behind them. Kevent can and works. Signals are just usual events. You request to get events - and you get them. You request to not get events during syscall - you remove events. Btw, please point me to the discussion about real life usefullness of that parameter for epoll. I read thread where sys_pepoll() was intruduced, but except some theoretical handwaving about possible usefullness there are no real signs of that requirement. What is the ground research or extended explaination about blocking/unblocking some signals during syscall execution? > >>No, that's not what I mean. There is no need for the special > >>timer-related part of your patch. Instead the existing POSIX timer > >>syscalls should be modified to handle SIGEV_KEVENT notification. Again, > >>keep the interface as small as possible. Plus, the POSIX timer > >>interface is very flexible. You don't want to duplicate all that > >>functionality. > > > >Interface is already there with kevent_ctl(KEVENT_ADD), I just created > >additional entry, which describes timers enqueue/dequeue callbacks > > New multiplexers cases are additional syscalls. This is unnecessary > code. Increased kernel interface and such. We have the POSIX timer > interfaces which are feature-rich and standardized *and* can be triviall > extended (at least from the userlevel interface POV) to use event > queues. If you don't want to do this, fine, I'll try to get it made. > But drop the timer part of your patches. There are _no_ additional syscalls. I just introduced new case for event type. You _need_ it to be done, since any kernel kevent user must have enqueue/dequeue/callback callbacks. It is just an implementation of that callbacks. I made the work, one can create any interfaces (additional syscalls or anything else) on top of that. Due to the fact that kevent was designed as generic event handling mechanism it is possible to work will all types of events using the same interface, which was created 10 month ago: kevent add, remove and so on... There is nothing special for timers there - it is separate file which does _not_ have any interfaces accessible outside kevent core (i.e. syscalls or exported symbols). Btw, how POSIX API should be extended to allow to queue events - queue is required (which is created when user calls kevent_init() or previoisly opened /dev/kevent), how should it be accessed, since it is just a file descriptor in process task_struct. > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, > CA ❖ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-21 9:53 ` Evgeniy Polyakov @ 2006-11-21 16:58 ` Ulrich Drepper 2006-11-21 17:43 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-11-21 16:58 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro Evgeniy Polyakov wrote: >> You don't want to have a channel like this. The userlevel code doesn't >> know which threads are waiting in the kernel on the event queue. And it >> seems to be much more complicated then simply have an kevent call which >> tells the kernel "wake up N or 1 more threads since I cannot handle it". >> Basically a futex_wake()-like call. > > Kernel does not know about any threads which waits for events, it only > has queue of events, it can only wake those who was parked in > kevent_get_events() or kevent_wait(), but syscall will return only when > condition it waits on is true, i.e. when there is new event in the ready > queue and/or ring buffer has empty slots, but kernel will wake them up > in any case if those conditions are true. > > How should it know which syscall should be interrupted when special syscall > is called? It's not about interrupting any threads. The issue is that the wakeup of a thread from the kevent_wait call constitutes an "event notification". If, as it should be, only one thread is woken than this information mustn't get lost. If the woken thread cannot work on the events it got notified for, then it must tell the kernel about it so that, *if* there are other threads waiting in kevent_wait, one of those other threads can be woken. What is needed is a simple "wake another thread waiting on this event queue" syscall. Yes, in theory we could open an additional pipe with each event queue and use it for waking threads, but this is influencing the ABI through the use of a file descriptor. It's much better to have an explicit way to do this. > No AIO, but syscall. > Only syscall time matters. > Syscall starts, it sould be sometime stopped. When it should be stopped? > It should be stopped after some time after it was started! > > I still do not understand how will you use absolute timeout values > there. Please exaplain. What is there to explain? If you are waiting for events which must coincide with real-world events you'll naturally will want to formulate something like "wait for X until 10:15h". You cannot formulate this correctly with relative timeouts since the realtime clock might be adjusted. > futex_wait() uses relative timeouts: > static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time) > > Kernel use relative timeouts. Look again. This time at the implementation. For FUTEX_LOCK_PI the timeout is an absolute timeout. > We have not have such symmetry. > Other event handling interfaces can not work with events, which do not > have file descriptor behind them. Kevent can and works. > Signals are just usual events. > > You request to get events - and you get them. > You request to not get events during syscall - you remove events. None of this matches what I'm talking about. If you want to block a signal for the duration of the kevent_wait call this is nothing you can do by registering an event. Registering events has nothing to do with signal masks. They are not modified. It is the program's responsibility to set the mask up correctly. Just like sigwaitinfo() etc expect all signals which are waited on to be blocked. The signal mask handling is orthogonal to all this and must be explicit. In some cases explicit pthread_sigmask/sigprocmask calls. But this is not atomic if a signal must be masked/unmasked for the *_wait call. This is why we have variants like pselect/ppoll/epoll_pwait which explicitly and *atomically* change the signal mask for the duration of the call. > Btw, please point me to the discussion about real life usefullness of > that parameter for epoll. I read thread where sys_pepoll() was > intruduced, but except some theoretical handwaving about possible > usefullness there are no real signs of that requirement. Don't search for epoll_pwait, it's not widely used yet. Search for pselect, which is standardized. You'll find plenty of uses of that interface. The number is certainly depressed in the moment since until recently there was no correct implementation on Linux. And the interface is mostly used in real-time contexts where signals are more commonly used. > What is the ground research or extended explaination about > blocking/unblocking some signals during syscall execution? Why is this even a question? Have you done programming with signals? You hatred of signals makes me think this isn't the case. You might want to unblock a signal on a *_wait call if it can be used to interrupt the wait but you don't want this to happen during when the thread is working on a request. You might want to block a signal, for instance, around a sigwaitinfo call or, in this case, a kevent_wait call where the signal might be delivered to the queue. There are countless possibilities. Signals are very flexible. > There are _no_ additional syscalls. > I just introduced new case for event type. Which is a new syscall. All demultiplexer cases are no syscalls. Which, BTW, implies that unrecognized types should actually cause a ENOSYS return value (this affects kevent_break). We've been over this many times. If EINVAL is return this case cannot be distinguished from invalid parameters. This is crucial for future extensions where userland (esp glibc) needs to be able to determine whether a new feature is supported on the system. > You _need_ it to be done, since any kernel kevent user must have > enqueue/dequeue/callback callbacks. It is just an implementation of that > callbacks. I don't question that. But there is no need to add the callback. It extends the kernel ABI/API. And for what? A vastly inferior timer implementation compared to the POSIX timers. And this while all that needs to be done is to extend the POSIX timer code slightly to handle SIGEV_KEVENT in addition to the other notification methods currently used. If you do it right then the code can be shared with the file AIO code which currently is circulated as well and which uses parts of the POSIX timer infrastructure. > Btw, how POSIX API should be extended to allow to queue events - queue > is required (which is created when user calls kevent_init() or > previoisly opened /dev/kevent), how should it be accessed, since it is > just a file descriptor in process task_struct. I've explained this multiple times. The struct sigevent structure needs to be extended to get a new part in the union. Something like struct { int kevent_fd; void *data; } _sigev_kevent; Then define SIGEV_KEVENT as a value distinct from the other SIGEV_ values. In the code which handles setup of timers (the timer_create syscall), recognize SIGEV_KEVENT and handle it appropriately. I.e., call into the code to register the event source, just like you'd do with the current interface. Then add the code to post an event to the event queue where currently signals would be sent et voilà. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-21 16:58 ` Ulrich Drepper @ 2006-11-21 17:43 ` Evgeniy Polyakov 2006-11-21 18:46 ` Evgeniy Polyakov 2006-11-22 7:33 ` [take24 0/6] kevent: Generic event handling mechanism Ulrich Drepper 0 siblings, 2 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-21 17:43 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Tue, Nov 21, 2006 at 08:58:49AM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > >>You don't want to have a channel like this. The userlevel code doesn't > >>know which threads are waiting in the kernel on the event queue. And it > >>seems to be much more complicated then simply have an kevent call which > >>tells the kernel "wake up N or 1 more threads since I cannot handle it". > >> Basically a futex_wake()-like call. > > > >Kernel does not know about any threads which waits for events, it only > >has queue of events, it can only wake those who was parked in > >kevent_get_events() or kevent_wait(), but syscall will return only when > >condition it waits on is true, i.e. when there is new event in the ready > >queue and/or ring buffer has empty slots, but kernel will wake them up > >in any case if those conditions are true. > > > >How should it know which syscall should be interrupted when special syscall > >is called? > > It's not about interrupting any threads. > > The issue is that the wakeup of a thread from the kevent_wait call > constitutes an "event notification". If, as it should be, only one > thread is woken than this information mustn't get lost. If the woken > thread cannot work on the events it got notified for, then it must tell > the kernel about it so that, *if* there are other threads waiting in > kevent_wait, one of those other threads can be woken. > > What is needed is a simple "wake another thread waiting on this event > queue" syscall. Yes, in theory we could open an additional pipe with > each event queue and use it for waking threads, but this is influencing > the ABI through the use of a file descriptor. It's much better to have > an explicit way to do this. Threads are parked in syscalls - which one should be interrupted? And what if there were no threads waiting in syscalls? > >No AIO, but syscall. > >Only syscall time matters. > >Syscall starts, it sould be sometime stopped. When it should be stopped? > >It should be stopped after some time after it was started! > > > >I still do not understand how will you use absolute timeout values > >there. Please exaplain. > > What is there to explain? If you are waiting for events which must > coincide with real-world events you'll naturally will want to formulate > something like "wait for X until 10:15h". You cannot formulate this > correctly with relative timeouts since the realtime clock might be adjusted. It has completely nothing with syscall. You register a timer to wait until 10:15 that is all. You do not ask to sleep in read() until some time, because read() has nothing in common with that time and event. But actually it becomes stupid discussion, don't you think? What do you think about putting there timespec and a small warning in dmesg about absolute timeout? When someone will report it I will publically say that you were right and it is correct to have possibility to have absolute timeouts for syscalls? :) > >futex_wait() uses relative timeouts: > > static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time) > > > >Kernel use relative timeouts. > > Look again. This time at the implementation. For FUTEX_LOCK_PI the > timeout is an absolute timeout. How come? It just uses timespec. > >We have not have such symmetry. > >Other event handling interfaces can not work with events, which do not > >have file descriptor behind them. Kevent can and works. > >Signals are just usual events. > > > >You request to get events - and you get them. > >You request to not get events during syscall - you remove events. > > None of this matches what I'm talking about. If you want to block a > signal for the duration of the kevent_wait call this is nothing you can > do by registering an event. > > Registering events has nothing to do with signal masks. They are not > modified. It is the program's responsibility to set the mask up > correctly. Just like sigwaitinfo() etc expect all signals which are > waited on to be blocked. > > The signal mask handling is orthogonal to all this and must be explicit. > In some cases explicit pthread_sigmask/sigprocmask calls. But this is > not atomic if a signal must be masked/unmasked for the *_wait call. > This is why we have variants like pselect/ppoll/epoll_pwait which > explicitly and *atomically* change the signal mask for the duration of > the call. You probably missed kevent signal patch - signal will not be delivered (in special cases) since it will not be copied into signal mask. System just will not know that it happend. Completely. Like putting it into blocked mask. > >Btw, please point me to the discussion about real life usefullness of > >that parameter for epoll. I read thread where sys_pepoll() was > >intruduced, but except some theoretical handwaving about possible > >usefullness there are no real signs of that requirement. > > Don't search for epoll_pwait, it's not widely used yet. Search for > pselect, which is standardized. You'll find plenty of uses of that > interface. The number is certainly depressed in the moment since until > recently there was no correct implementation on Linux. And the > interface is mostly used in real-time contexts where signals are more > commonly used. I found this: ... document a pselect() call intended to remove the race condition that is present when one wants to wait on either a signal or some file descriptor. (See also Stevens, Unix Network Programming, Volume 1, 2nd Ed., 1998, p. 168 and the pselect.2 man page released today.) Glibc 2.0 has a bad version (wrong number of parameters) and glibc 2.1 a better version, but the whole purpose of pselect is to avoid the race, and glibc cannot do that, one needs kernel support. But it is completely irrelevant with kevent signals - there is no race for that case when signal is delivered through file descriptor. > >What is the ground research or extended explaination about > >blocking/unblocking some signals during syscall execution? > > Why is this even a question? Have you done programming with signals? > You hatred of signals makes me think this isn't the case. It is much better to not know how thing works, then to not be possible to understand how new things can work. > You might want to unblock a signal on a *_wait call if it can be used to > interrupt the wait but you don't want this to happen during when the > thread is working on a request. Add kevent signal and do not process that event. > You might want to block a signal, for instance, around a sigwaitinfo > call or, in this case, a kevent_wait call where the signal might be > delivered to the queue. Having special type of kevent signal is the same as putting signal into blocked mask, but signal event will be marked as ready - to indicate that condition was there. There will not be any race in that case. > There are countless possibilities. Signals are very flexible. That is why we want to get them through synchronous queue? :) > >There are _no_ additional syscalls. > >I just introduced new case for event type. > > Which is a new syscall. All demultiplexer cases are no syscalls. I think I am a bit blind, probably parts of Leonids are still getting into my brain, but there is one syscall called kevent_ctl() which adds different events, including timer, signal, socket and others. > Which, BTW, implies that unrecognized types should actually cause a > ENOSYS return value (this affects kevent_break). We've been over this > many times. If EINVAL is return this case cannot be distinguished from > invalid parameters. This is crucial for future extensions where > userland (esp glibc) needs to be able to determine whether a new feature > is supported on the system. I can replace with -ENOSYS if you like. > >You _need_ it to be done, since any kernel kevent user must have > >enqueue/dequeue/callback callbacks. It is just an implementation of that > >callbacks. > > I don't question that. But there is no need to add the callback. It No one asked and pain me to create kevent, but it is done. Probably no the way some people wanted, but it always happend, it is really not that bad. Kevent subsystem operates with structures which can be added into completely different objects in the system - inodes, files - anything. And to say to that object about new events there are special callbacks - enqueue and dequeue. Callback which has extremely unusual name 'callback' is invoked when object, where event is linked, has something to report - new data, fired alarm or anything else, so it calls kevent's ->callback and if return value is positive, kevent is marked as ready. It allows to have event with different sets of interests for the same type of the main object - for example socket can have read and write callbacks. So you must have them. As you probably saw, kevent_timer_callback() just returns 1. > extends the kernel ABI/API. And for what? A vastly inferior timer > implementation compared to the POSIX timers. And this while all that > needs to be done is to extend the POSIX timer code slightly to handle > SIGEV_KEVENT in addition to the other notification methods currently > used. If you do it right then the code can be shared with the file AIO > code which currently is circulated as well and which uses parts of the > POSIX timer infrastructure. Ulrich, tell me the truth, will you kill me if I say that I have an entry in TODO to implement different AIO design (details for interested readers can be found in my blog), and then present it to community? :)) > >Btw, how POSIX API should be extended to allow to queue events - queue > >is required (which is created when user calls kevent_init() or > >previoisly opened /dev/kevent), how should it be accessed, since it is > >just a file descriptor in process task_struct. > > I've explained this multiple times. The struct sigevent structure needs > to be extended to get a new part in the union. Something like > > struct { > int kevent_fd; > void *data; > } _sigev_kevent; > > Then define SIGEV_KEVENT as a value distinct from the other SIGEV_ > values. In the code which handles setup of timers (the timer_create > syscall), recognize SIGEV_KEVENT and handle it appropriately. I.e., > call into the code to register the event source, just like you'd do with > the current interface. Then add the code to post an event to the event > queue where currently signals would be sent et voilà. Ok, I see. It is doable and simple. I will try to implement it tomorrow. > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, > CA ❖ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-21 17:43 ` Evgeniy Polyakov @ 2006-11-21 18:46 ` Evgeniy Polyakov 2006-11-21 20:01 ` Jeff Garzik ` (2 more replies) 2006-11-22 7:33 ` [take24 0/6] kevent: Generic event handling mechanism Ulrich Drepper 1 sibling, 3 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-21 18:46 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Tue, Nov 21, 2006 at 08:43:34PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote: > > I've explained this multiple times. The struct sigevent structure needs > > to be extended to get a new part in the union. Something like > > > > struct { > > int kevent_fd; > > void *data; > > } _sigev_kevent; > > > > Then define SIGEV_KEVENT as a value distinct from the other SIGEV_ > > values. In the code which handles setup of timers (the timer_create > > syscall), recognize SIGEV_KEVENT and handle it appropriately. I.e., > > call into the code to register the event source, just like you'd do with > > the current interface. Then add the code to post an event to the event > > queue where currently signals would be sent et voilà. > > Ok, I see. > It is doable and simple. > I will try to implement it tomorrow. I've checked the code. Since it will be a union, it is impossible to use _sigev_thread and it becomes just SIGEV_SIGNAL case with different delivery mechanism. Is it what you want? -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-21 18:46 ` Evgeniy Polyakov @ 2006-11-21 20:01 ` Jeff Garzik 2006-11-22 10:41 ` Evgeniy Polyakov 2006-11-21 20:19 ` Jeff Garzik 2006-11-22 7:38 ` Ulrich Drepper 2 siblings, 1 reply; 200+ messages in thread From: Jeff Garzik @ 2006-11-21 20:01 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Ulrich Drepper, David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Alexander Viro nitpick: in ring_buffer.c (example app), I would use posix_memalign(3) rather than malloc(3) Jeff ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-21 20:01 ` Jeff Garzik @ 2006-11-22 10:41 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-22 10:41 UTC (permalink / raw) To: Jeff Garzik Cc: Ulrich Drepper, David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Alexander Viro On Tue, Nov 21, 2006 at 03:01:45PM -0500, Jeff Garzik (jeff@garzik.org) wrote: > nitpick: in ring_buffer.c (example app), I would use posix_memalign(3) > rather than malloc(3) Yes, it can be done. > Jeff -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-21 18:46 ` Evgeniy Polyakov 2006-11-21 20:01 ` Jeff Garzik @ 2006-11-21 20:19 ` Jeff Garzik 2006-11-22 10:39 ` Evgeniy Polyakov 2006-11-22 7:38 ` Ulrich Drepper 2 siblings, 1 reply; 200+ messages in thread From: Jeff Garzik @ 2006-11-21 20:19 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Ulrich Drepper, David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Alexander Viro Another: pass a 'flags' argument to kevent_init(2). I guarantee you will need it eventually. It IMO would help with later binary compatibility, if nothing else. You wouldn't need a new syscall to introduce struct kevent_ring_v2. Jeff ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-21 20:19 ` Jeff Garzik @ 2006-11-22 10:39 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-22 10:39 UTC (permalink / raw) To: Jeff Garzik Cc: Ulrich Drepper, David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Alexander Viro On Tue, Nov 21, 2006 at 03:19:05PM -0500, Jeff Garzik (jeff@garzik.org) wrote: > Another: pass a 'flags' argument to kevent_init(2). I guarantee you > will need it eventually. It IMO would help with later binary > compatibility, if nothing else. You wouldn't need a new syscall to > introduce struct kevent_ring_v2. Yep, I will add there 'flags' field. > Jeff -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-21 18:46 ` Evgeniy Polyakov 2006-11-21 20:01 ` Jeff Garzik 2006-11-21 20:19 ` Jeff Garzik @ 2006-11-22 7:38 ` Ulrich Drepper 2006-11-22 10:44 ` Evgeniy Polyakov 2 siblings, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-11-22 7:38 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro Evgeniy Polyakov wrote: > I've checked the code. > Since it will be a union, it is impossible to use _sigev_thread and it > becomes just SIGEV_SIGNAL case with different delivery mechanism. > Is it what you want? struct sigevent is defined like this: typedef struct sigevent { sigval_t sigev_value; int sigev_signo; int sigev_notify; union { int _pad[SIGEV_PAD_SIZE]; int _tid; struct { void (*_function)(sigval_t); void *_attribute; /* really pthread_attr_t */ } _sigev_thread; } _sigev_un; } sigevent_t; For the SIGEV_KEVENT case: sigev_notify is set to SIGEV_KEVENT (obviously) sigev_value can be used for the void* data passed along with the signal, just like in the case of a signal delivery Now you need a way to specify the kevent descriptor. Just add int _kevent; inside the union and if you want #define sigev_kevent_descr _sigev_un._kevent That should be all. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-22 7:38 ` Ulrich Drepper @ 2006-11-22 10:44 ` Evgeniy Polyakov 2006-11-22 21:02 ` Ulrich Drepper 2006-11-23 8:52 ` Kevent POSIX timers support Evgeniy Polyakov 0 siblings, 2 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-22 10:44 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Tue, Nov 21, 2006 at 11:38:25PM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > >I've checked the code. > >Since it will be a union, it is impossible to use _sigev_thread and it > >becomes just SIGEV_SIGNAL case with different delivery mechanism. > >Is it what you want? > > struct sigevent is defined like this: > > typedef struct sigevent { > sigval_t sigev_value; > int sigev_signo; > int sigev_notify; > union { > int _pad[SIGEV_PAD_SIZE]; > int _tid; > > struct { > void (*_function)(sigval_t); > void *_attribute; /* really pthread_attr_t */ > } _sigev_thread; > } _sigev_un; > } sigevent_t; > > > For the SIGEV_KEVENT case: > > sigev_notify is set to SIGEV_KEVENT (obviously) > > sigev_value can be used for the void* data passed along with the > signal, just like in the case of a signal delivery > > Now you need a way to specify the kevent descriptor. Just add > > int _kevent; > > inside the union and if you want > > #define sigev_kevent_descr _sigev_un._kevent > > That should be all. That what I implemented. But in this case it will be impossible to have SIGEV_THREAD and SIGEV_KEVENT at the same time, it will be just the same as SIGEV_SIGNAL but with different delivery mechanism. Is is what you expect for that? > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, > CA ❖ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-22 10:44 ` Evgeniy Polyakov @ 2006-11-22 21:02 ` Ulrich Drepper 2006-11-23 12:23 ` Evgeniy Polyakov 2006-11-23 8:52 ` Kevent POSIX timers support Evgeniy Polyakov 1 sibling, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-11-22 21:02 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro Evgeniy Polyakov wrote: > But in this case it will be impossible to have SIGEV_THREAD and SIGEV_KEVENT > at the same time, it will be just the same as SIGEV_SIGNAL but with > different delivery mechanism. Is is what you expect for that? Yes, that's expected. The event if for the queue, not directed to a specific thread. If in future we want to think about preferably waking a specific thread we can then think about it. But I doubt that'll be beneficial. The thread specific part in the signal handling is only used to implement the SIGEV_THREAD notification. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-22 21:02 ` Ulrich Drepper @ 2006-11-23 12:23 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-23 12:23 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Wed, Nov 22, 2006 at 01:02:00PM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > >But in this case it will be impossible to have SIGEV_THREAD and > >SIGEV_KEVENT > >at the same time, it will be just the same as SIGEV_SIGNAL but with > >different delivery mechanism. Is is what you expect for that? > > Yes, that's expected. The event if for the queue, not directed to a > specific thread. > > If in future we want to think about preferably waking a specific thread > we can then think about it. But I doubt that'll be beneficial. The > thread specific part in the signal handling is only used to implement > the SIGEV_THREAD notification. Ok, so please review patch I sent, if it is ok from design point of view, I will run some tests here. > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, > CA ❖ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Kevent POSIX timers support. 2006-11-22 10:44 ` Evgeniy Polyakov 2006-11-22 21:02 ` Ulrich Drepper @ 2006-11-23 8:52 ` Evgeniy Polyakov 2006-11-23 20:26 ` Ulrich Drepper 1 sibling, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-23 8:52 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Wed, Nov 22, 2006 at 01:44:16PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote: > That what I implemented. > But in this case it will be impossible to have SIGEV_THREAD and SIGEV_KEVENT > at the same time, it will be just the same as SIGEV_SIGNAL but with > different delivery mechanism. Is is what you expect for that? Something like this morning hack (compile tested only). If my thoughts are correct, I will create some simple application and test if it works. Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h index a7dd38f..4b9deb4 100644 --- a/include/linux/posix-timers.h +++ b/include/linux/posix-timers.h @@ -4,6 +4,7 @@ #include <linux/spinlock.h> #include <linux/list.h> #include <linux/sched.h> +#include <linux/kevent_storage.h> union cpu_time_count { cputime_t cpu; @@ -49,6 +50,9 @@ struct k_itimer { sigval_t it_sigev_value; /* value word of sigevent struct */ struct task_struct *it_process; /* process to send signal to */ struct sigqueue *sigq; /* signal queue entry. */ +#ifdef CONFIG_KEVENT_TIMER + struct kevent_storage st; +#endif union { struct { struct hrtimer timer; diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c index e5ebcc1..148a9f9 100644 --- a/kernel/posix-timers.c +++ b/kernel/posix-timers.c @@ -48,6 +48,8 @@ #include <linux/wait.h> #include <linux/workqueue.h> #include <linux/module.h> +#include <linux/kevent.h> +#include <linux/file.h> /* * Management arrays for POSIX timers. Timers are kept in slab memory @@ -224,6 +226,95 @@ static int posix_ktime_get_ts(clockid_t return 0; } +#ifdef CONFIG_KEVENT_TIMER +static int posix_kevent_enqueue(struct kevent *k) +{ + struct k_itimer *tmr = k->event.ptr; + return kevent_storage_enqueue(&tmr->st, k); +} +static int posix_kevent_dequeue(struct kevent *k) +{ + struct k_itimer *tmr = k->event.ptr; + kevent_storage_dequeue(&tmr->st, k); + return 0; +} +static int posix_kevent_callback(struct kevent *k) +{ + return 1; +} +static int posix_kevent_init(void) +{ + struct kevent_callbacks tc = { + .callback = &posix_kevent_callback, + .enqueue = &posix_kevent_enqueue, + .dequeue = &posix_kevent_dequeue}; + + return kevent_add_callbacks(&tc, KEVENT_POSIX_TIMER); +} + +extern struct file_operations kevent_user_fops; + +static int posix_kevent_init_timer(struct k_itimer *tmr, int fd) +{ + struct ukevent uk; + struct file *file; + struct kevent_user *u; + int err; + + file = fget(fd); + if (!file) { + err = -EBADF; + goto err_out; + } + + if (file->f_op != &kevent_user_fops) { + err = -EINVAL; + goto err_out_fput; + } + + u = file->private_data; + + memset(&uk, 0, sizeof(struct ukevent)); + + uk.type = KEVENT_POSIX_TIMER; + uk.id.raw_u64 = (unsigned long)(tmr); /* Just cast to something unique */ + uk.ptr = tmr; + + tmr->it_sigev_value.sival_ptr = file; + + err = kevent_user_add_ukevent(&uk, u); + if (err) + goto err_out_fput; + + fput(file); + + return 0; + +err_out_fput: + fput(file); +err_out: + return err; +} + +static void posix_kevent_fini_timer(struct k_itimer *tmr) +{ + kevent_storage_fini(&tmr->st); +} +#else +static int posix_kevent_init_timer(struct k_itimer *tmr, int fd) +{ + return -ENOSYS; +} +static int posix_kevent_init(void) +{ + return 0; +} +static void posix_kevent_fini_timer(struct k_itimer *tmr) +{ +} +#endif + + /* * Initialize everything, well, just everything in Posix clocks/timers ;) */ @@ -241,6 +332,11 @@ static __init int init_posix_timers(void register_posix_clock(CLOCK_REALTIME, &clock_realtime); register_posix_clock(CLOCK_MONOTONIC, &clock_monotonic); + if (posix_kevent_init()) { + printk(KERN_ERR "Failed to initialize kevent posix timers.\n"); + BUG(); + } + posix_timers_cache = kmem_cache_create("posix_timers_cache", sizeof (struct k_itimer), 0, 0, NULL, NULL); idr_init(&posix_timers_id); @@ -343,23 +439,27 @@ static int posix_timer_fn(struct hrtimer timr = container_of(timer, struct k_itimer, it.real.timer); spin_lock_irqsave(&timr->it_lock, flags); + + if (timr->it_sigev_notify & SIGEV_KEVENT) { + kevent_storage_ready(&timr->st, NULL, KEVENT_MASK_ALL); + } else { + if (timr->it.real.interval.tv64 != 0) + si_private = ++timr->it_requeue_pending; - if (timr->it.real.interval.tv64 != 0) - si_private = ++timr->it_requeue_pending; - - if (posix_timer_event(timr, si_private)) { - /* - * signal was not sent because of sig_ignor - * we will not get a call back to restart it AND - * it should be restarted. - */ - if (timr->it.real.interval.tv64 != 0) { - timr->it_overrun += - hrtimer_forward(timer, - timer->base->softirq_time, - timr->it.real.interval); - ret = HRTIMER_RESTART; - ++timr->it_requeue_pending; + if (posix_timer_event(timr, si_private)) { + /* + * signal was not sent because of sig_ignor + * we will not get a call back to restart it AND + * it should be restarted. + */ + if (timr->it.real.interval.tv64 != 0) { + timr->it_overrun += + hrtimer_forward(timer, + timer->base->softirq_time, + timr->it.real.interval); + ret = HRTIMER_RESTART; + ++timr->it_requeue_pending; + } } } @@ -407,6 +507,9 @@ static struct k_itimer * alloc_posix_tim kmem_cache_free(posix_timers_cache, tmr); tmr = NULL; } +#ifdef CONFIG_KEVENT_TIMER + kevent_storage_init(tmr, &tmr->st); +#endif return tmr; } @@ -424,6 +527,7 @@ static void release_posix_timer(struct k if (unlikely(tmr->it_process) && tmr->it_sigev_notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID)) put_task_struct(tmr->it_process); + posix_kevent_fini_timer(tmr); kmem_cache_free(posix_timers_cache, tmr); } @@ -496,40 +600,52 @@ sys_timer_create(const clockid_t which_c new_timer->it_sigev_signo = event.sigev_signo; new_timer->it_sigev_value = event.sigev_value; - read_lock(&tasklist_lock); - if ((process = good_sigevent(&event))) { - /* - * We may be setting up this process for another - * thread. It may be exiting. To catch this - * case the we check the PF_EXITING flag. If - * the flag is not set, the siglock will catch - * him before it is too late (in exit_itimers). - * - * The exec case is a bit more invloved but easy - * to code. If the process is in our thread - * group (and it must be or we would not allow - * it here) and is doing an exec, it will cause - * us to be killed. In this case it will wait - * for us to die which means we can finish this - * linkage with our last gasp. I.e. no code :) - */ + if (event.sigev_notify & SIGEV_KEVENT) { + error = posix_kevent_init_timer(new_timer, event._sigev_un.kevent_fd); + if (error) + goto out; + + process = current->group_leader; spin_lock_irqsave(&process->sighand->siglock, flags); - if (!(process->flags & PF_EXITING)) { - new_timer->it_process = process; - list_add(&new_timer->list, - &process->signal->posix_timers); - spin_unlock_irqrestore(&process->sighand->siglock, flags); - if (new_timer->it_sigev_notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID)) - get_task_struct(process); - } else { - spin_unlock_irqrestore(&process->sighand->siglock, flags); - process = NULL; + new_timer->it_process = process; + list_add(&new_timer->list, &process->signal->posix_timers); + spin_unlock_irqrestore(&process->sighand->siglock, flags); + } else { + read_lock(&tasklist_lock); + if ((process = good_sigevent(&event))) { + /* + * We may be setting up this process for another + * thread. It may be exiting. To catch this + * case the we check the PF_EXITING flag. If + * the flag is not set, the siglock will catch + * him before it is too late (in exit_itimers). + * + * The exec case is a bit more invloved but easy + * to code. If the process is in our thread + * group (and it must be or we would not allow + * it here) and is doing an exec, it will cause + * us to be killed. In this case it will wait + * for us to die which means we can finish this + * linkage with our last gasp. I.e. no code :) + */ + spin_lock_irqsave(&process->sighand->siglock, flags); + if (!(process->flags & PF_EXITING)) { + new_timer->it_process = process; + list_add(&new_timer->list, + &process->signal->posix_timers); + spin_unlock_irqrestore(&process->sighand->siglock, flags); + if (new_timer->it_sigev_notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID)) + get_task_struct(process); + } else { + spin_unlock_irqrestore(&process->sighand->siglock, flags); + process = NULL; + } + } + read_unlock(&tasklist_lock); + if (!process) { + error = -EINVAL; + goto out; } - } - read_unlock(&tasklist_lock); - if (!process) { - error = -EINVAL; - goto out; } } else { new_timer->it_sigev_notify = SIGEV_SIGNAL; -- Evgeniy Polyakov ^ permalink raw reply related [flat|nested] 200+ messages in thread
* Re: Kevent POSIX timers support. 2006-11-23 8:52 ` Kevent POSIX timers support Evgeniy Polyakov @ 2006-11-23 20:26 ` Ulrich Drepper 2006-11-24 9:50 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-11-23 20:26 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro Evgeniy Polyakov wrote: > +static int posix_kevent_init(void) > +{ > + struct kevent_callbacks tc = { > + .callback = &posix_kevent_callback, > + .enqueue = &posix_kevent_enqueue, > + .dequeue = &posix_kevent_dequeue}; How do we prevent that somebody tries to register a POSIX timer event source with kevent_ctl(KEVENT_CTL_ADD)? This should only be possible from sys_timer_create and nowhere else. Can you add a parameter to kevent_enqueue indicating this is a call from inside the kernel and then ignore certain enqueue callbacks? > @@ -343,23 +439,27 @@ static int posix_timer_fn(struct hrtimer > > timr = container_of(timer, struct k_itimer, it.real.timer); > spin_lock_irqsave(&timr->it_lock, flags); > + > + if (timr->it_sigev_notify & SIGEV_KEVENT) { > + kevent_storage_ready(&timr->st, NULL, KEVENT_MASK_ALL); > + } else { We need to pass the data in the sigev_value meember of the struct sigevent structure passed to timer_create to the caller. I don't see it being done here nor when the timer is created. Do I miss something? The sigev_value value should be stored in the user/ptr member of struct ukevent. > + if (event.sigev_notify & SIGEV_KEVENT) { Don't use a bit. It makes no sense to combine SIGEV_SIGNAL with SIGEV_KEVENT etc. Only SIGEV_THREAD_ID is a special case. Just define SIGEV_KEVENT to 3 and replace the tests like the one cited above with if (timr->it_sigev_notify == SIGEV_KEVENT) -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: Kevent POSIX timers support. 2006-11-23 20:26 ` Ulrich Drepper @ 2006-11-24 9:50 ` Evgeniy Polyakov 2006-11-27 18:20 ` Ulrich Drepper 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-24 9:50 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Thu, Nov 23, 2006 at 12:26:15PM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > >+static int posix_kevent_init(void) > >+{ > >+ struct kevent_callbacks tc = { > >+ .callback = &posix_kevent_callback, > >+ .enqueue = &posix_kevent_enqueue, > >+ .dequeue = &posix_kevent_dequeue}; > > How do we prevent that somebody tries to register a POSIX timer event > source with kevent_ctl(KEVENT_CTL_ADD)? This should only be possible > from sys_timer_create and nowhere else. > > Can you add a parameter to kevent_enqueue indicating this is a call from > inside the kernel and then ignore certain enqueue callbacks? I think we need some set of flags for callbacks - where they can be called, maybe even from which context and so on. So userspace will not be allowed to create such timers through kevent API. Will do it for release. > >@@ -343,23 +439,27 @@ static int posix_timer_fn(struct hrtimer > > > > timr = container_of(timer, struct k_itimer, it.real.timer); > > spin_lock_irqsave(&timr->it_lock, flags); > >+ > >+ if (timr->it_sigev_notify & SIGEV_KEVENT) { > >+ kevent_storage_ready(&timr->st, NULL, KEVENT_MASK_ALL); > >+ } else { > > We need to pass the data in the sigev_value meember of the struct > sigevent structure passed to timer_create to the caller. I don't see it > being done here nor when the timer is created. Do I miss something? > The sigev_value value should be stored in the user/ptr member of struct > ukevent. sigev_value was stored in k_itimer structure, I just do not know where to put it in the ukevent provided to userspace - it can be placed in pointer value if you like. > >+ if (event.sigev_notify & SIGEV_KEVENT) { > > Don't use a bit. It makes no sense to combine SIGEV_SIGNAL with > SIGEV_KEVENT etc. Only SIGEV_THREAD_ID is a special case. > > Just define SIGEV_KEVENT to 3 and replace the tests like the one cited > above with > > if (timr->it_sigev_notify == SIGEV_KEVENT) Ok. > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, > CA ❖ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: Kevent POSIX timers support. 2006-11-24 9:50 ` Evgeniy Polyakov @ 2006-11-27 18:20 ` Ulrich Drepper 2006-11-27 18:24 ` David Miller 2006-11-28 9:16 ` Evgeniy Polyakov 0 siblings, 2 replies; 200+ messages in thread From: Ulrich Drepper @ 2006-11-27 18:20 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro Evgeniy Polyakov wrote: >> We need to pass the data in the sigev_value meember of the struct >> sigevent structure passed to timer_create to the caller. I don't see it >> being done here nor when the timer is created. Do I miss something? >> The sigev_value value should be stored in the user/ptr member of struct >> ukevent. > > sigev_value was stored in k_itimer structure, I just do not know where > to put it in the ukevent provided to userspace - it can be placed in > pointer value if you like. sigev_value is a union and the largest element is a pointer. So, transporting the pointer value is sufficient and it should be passed up to the user in the ptr member of struct ukevent. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: Kevent POSIX timers support. 2006-11-27 18:20 ` Ulrich Drepper @ 2006-11-27 18:24 ` David Miller 2006-11-27 18:36 ` Ulrich Drepper 2006-11-28 9:16 ` Evgeniy Polyakov 1 sibling, 1 reply; 200+ messages in thread From: David Miller @ 2006-11-27 18:24 UTC (permalink / raw) To: drepper Cc: johnpol, akpm, netdev, zach.brown, hch, chase.venters, johann.borck, linux-kernel, jeff, aviro From: Ulrich Drepper <drepper@redhat.com> Date: Mon, 27 Nov 2006 10:20:50 -0800 > Evgeniy Polyakov wrote: > >> We need to pass the data in the sigev_value meember of the struct > >> sigevent structure passed to timer_create to the caller. I don't see it > >> being done here nor when the timer is created. Do I miss something? > >> The sigev_value value should be stored in the user/ptr member of struct > >> ukevent. > > > > sigev_value was stored in k_itimer structure, I just do not know where > > to put it in the ukevent provided to userspace - it can be placed in > > pointer value if you like. > > sigev_value is a union and the largest element is a pointer. So, > transporting the pointer value is sufficient and it should be passed up > to the user in the ptr member of struct ukevent. Now we'll have to have a compat layer for 32-bit/64-bit environments thanks to POSIX timers, which is rediculious. This is exactly the kind of thing I was hoping we could avoid when designing these data structures. No pointers, no non-fixed sized types, only types which are identically sized and aligned between 32-bit and 64-bit environments. It's OK to have these problems for things designed a long time ago before 32-bit/64-bit compat issues existed, but for new stuff no way. ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: Kevent POSIX timers support. 2006-11-27 18:24 ` David Miller @ 2006-11-27 18:36 ` Ulrich Drepper 2006-11-27 18:49 ` David Miller 0 siblings, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-11-27 18:36 UTC (permalink / raw) To: David Miller Cc: johnpol, akpm, netdev, zach.brown, hch, chase.venters, johann.borck, linux-kernel, jeff, aviro David Miller wrote: > Now we'll have to have a compat layer for 32-bit/64-bit environments > thanks to POSIX timers, which is rediculious. We already have compat_sys_timer_create. It should be sufficient just to add the conversion (if anything new is needed) there. The pointer value can be passed to userland in one or two int fields, I don't really care. When reporting the event to the user code we cannot just point into the ring buffer anyway. So while copying the data we can rewrite it if necessary. I see no need to complicate the code more than it already is. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: Kevent POSIX timers support. 2006-11-27 18:36 ` Ulrich Drepper @ 2006-11-27 18:49 ` David Miller 2006-11-28 9:16 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: David Miller @ 2006-11-27 18:49 UTC (permalink / raw) To: drepper Cc: johnpol, akpm, netdev, zach.brown, hch, chase.venters, johann.borck, linux-kernel, jeff, aviro From: Ulrich Drepper <drepper@redhat.com> Date: Mon, 27 Nov 2006 10:36:06 -0800 > David Miller wrote: > > Now we'll have to have a compat layer for 32-bit/64-bit environments > > thanks to POSIX timers, which is rediculious. > > We already have compat_sys_timer_create. It should be sufficient just > to add the conversion (if anything new is needed) there. The pointer > value can be passed to userland in one or two int fields, I don't really > care. When reporting the event to the user code we cannot just point > into the ring buffer anyway. So while copying the data we can rewrite > it if necessary. I see no need to complicate the code more than it > already is. Ok, as long as that thing doesn't end up in the ring buffer entry data structure, that's where the real troubles would be. ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: Kevent POSIX timers support. 2006-11-27 18:49 ` David Miller @ 2006-11-28 9:16 ` Evgeniy Polyakov 2006-11-28 19:13 ` David Miller 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-28 9:16 UTC (permalink / raw) To: David Miller Cc: drepper, akpm, netdev, zach.brown, hch, chase.venters, johann.borck, linux-kernel, jeff, aviro On Mon, Nov 27, 2006 at 10:49:55AM -0800, David Miller (davem@davemloft.net) wrote: > From: Ulrich Drepper <drepper@redhat.com> > Date: Mon, 27 Nov 2006 10:36:06 -0800 > > > David Miller wrote: > > > Now we'll have to have a compat layer for 32-bit/64-bit environments > > > thanks to POSIX timers, which is rediculious. > > > > We already have compat_sys_timer_create. It should be sufficient just > > to add the conversion (if anything new is needed) there. The pointer > > value can be passed to userland in one or two int fields, I don't really > > care. When reporting the event to the user code we cannot just point > > into the ring buffer anyway. So while copying the data we can rewrite > > it if necessary. I see no need to complicate the code more than it > > already is. > > Ok, as long as that thing doesn't end up in the ring buffer entry > data structure, that's where the real troubles would be. Although ukevent has pointer embedded, it is unioned with u64, so there should be no problems until 128 bit arch appeared, which likely will not happen soon. There is also unused in kevent posix timers patch 'u32 ret_val[2]' field, which can store segval's value too. But the fact that ukevent does not and will not in any way have variable size is absolutely. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: Kevent POSIX timers support. 2006-11-28 9:16 ` Evgeniy Polyakov @ 2006-11-28 19:13 ` David Miller 2006-11-28 19:22 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: David Miller @ 2006-11-28 19:13 UTC (permalink / raw) To: johnpol Cc: drepper, akpm, netdev, zach.brown, hch, chase.venters, johann.borck, linux-kernel, jeff, aviro From: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Date: Tue, 28 Nov 2006 12:16:02 +0300 > Although ukevent has pointer embedded, it is unioned with u64, so there > should be no problems until 128 bit arch appeared, which likely will not > happen soon. There is also unused in kevent posix timers patch > 'u32 ret_val[2]' field, which can store segval's value too. > > But the fact that ukevent does not and will not in any way have variable > size is absolutely. I believe that in order to be %100 safe you will need to use the special aligned_u64 type, as that takes care of a crucial difference between x86 and x86_64 API, namely that u64 needs 8-byte alignment on x86_64 but not on x86. You probably know this already :-) ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: Kevent POSIX timers support. 2006-11-28 19:13 ` David Miller @ 2006-11-28 19:22 ` Evgeniy Polyakov 2006-12-12 1:36 ` David Miller 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-28 19:22 UTC (permalink / raw) To: David Miller Cc: drepper, akpm, netdev, zach.brown, hch, chase.venters, johann.borck, linux-kernel, jeff, aviro On Tue, Nov 28, 2006 at 11:13:00AM -0800, David Miller (davem@davemloft.net) wrote: > From: Evgeniy Polyakov <johnpol@2ka.mipt.ru> > Date: Tue, 28 Nov 2006 12:16:02 +0300 > > > Although ukevent has pointer embedded, it is unioned with u64, so there > > should be no problems until 128 bit arch appeared, which likely will not > > happen soon. There is also unused in kevent posix timers patch > > 'u32 ret_val[2]' field, which can store segval's value too. > > > > But the fact that ukevent does not and will not in any way have variable > > size is absolutely. > > I believe that in order to be %100 safe you will need to use the > special aligned_u64 type, as that takes care of a crucial difference > between x86 and x86_64 API, namely that u64 needs 8-byte alignment on > x86_64 but not on x86. > > You probably know this already :-) Yep :) So I put it at the end, where structure is already correctly aligned, so there is no need for special alignment. And, btw, last time I checked, aligned_u64 was not exported to userspace. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: Kevent POSIX timers support. 2006-11-28 19:22 ` Evgeniy Polyakov @ 2006-12-12 1:36 ` David Miller 2006-12-12 5:31 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: David Miller @ 2006-12-12 1:36 UTC (permalink / raw) To: johnpol Cc: drepper, akpm, netdev, zach.brown, hch, chase.venters, johann.borck, linux-kernel, jeff, aviro From: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Date: Tue, 28 Nov 2006 22:22:36 +0300 > And, btw, last time I checked, aligned_u64 was not exported to > userspace. It is in linux/types.h and not protected by __KERNEL__ ifdefs. Perhaps you mean something else? ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: Kevent POSIX timers support. 2006-12-12 1:36 ` David Miller @ 2006-12-12 5:31 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-12-12 5:31 UTC (permalink / raw) To: David Miller Cc: drepper, akpm, netdev, zach.brown, hch, chase.venters, johann.borck, linux-kernel, jeff, aviro On Mon, Dec 11, 2006 at 05:36:44PM -0800, David Miller (davem@davemloft.net) wrote: > From: Evgeniy Polyakov <johnpol@2ka.mipt.ru> > Date: Tue, 28 Nov 2006 22:22:36 +0300 > > > And, btw, last time I checked, aligned_u64 was not exported to > > userspace. > > It is in linux/types.h and not protected by __KERNEL__ ifdefs. > Perhaps you mean something else? It looks like I checked wrong #ifdef __KERNEL__/#endif pair. It is there indeed. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: Kevent POSIX timers support. 2006-11-27 18:20 ` Ulrich Drepper 2006-11-27 18:24 ` David Miller @ 2006-11-28 9:16 ` Evgeniy Polyakov 1 sibling, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-28 9:16 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Mon, Nov 27, 2006 at 10:20:50AM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > sigev_value is a union and the largest element is a pointer. So, > transporting the pointer value is sufficient and it should be passed up > to the user in the ptr member of struct ukevent. That is where I've put it in current version. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-21 17:43 ` Evgeniy Polyakov 2006-11-21 18:46 ` Evgeniy Polyakov @ 2006-11-22 7:33 ` Ulrich Drepper 2006-11-22 10:38 ` Evgeniy Polyakov 2006-11-22 12:09 ` Evgeniy Polyakov 1 sibling, 2 replies; 200+ messages in thread From: Ulrich Drepper @ 2006-11-22 7:33 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro Evgeniy Polyakov wrote: > Threads are parked in syscalls - which one should be interrupted? It doesn't matter, use the same policy you use when waking a thread in case of an event. This is not about waking a specific thread, it's about not dropping the event notification. > And what if there were no threads waiting in syscalls? This is fine, do nothing. It means that the other threads are about to read the ring buffer and will pick up the event. The case which must be avoided is that of all threads being in the kernel, one threads gets woken, and then is canceled. Without notifying the kernel about the cancellation and in the absence of further events notifications the process is deadlocked. A second case which should be avoided is that there is a thread waiting when a thread gets canceled and there are one or more addition threads around, but not in the kernel. But those other threads might not get to the ring buffer anytime soon, so handling the event is unnecessarily delayed. > It has completely nothing with syscall. > You register a timer to wait until 10:15 that is all. That's a nonsense argument. In this case you would not add any timeout parameter at all. Of course nobody would want that since it's simply too slow. Stop thinking about the absolute timeout as an exceptional case, it might very well not be for some problems. Beside, I've already mentioned another case where a struct timespec* parameter is needed. There are even two different relative timeouts: using the monotonis clock or using the realtime clock. The latter is affected by gettimeofday and ntp. >>> Kernel use relative timeouts. >> Look again. This time at the implementation. For FUTEX_LOCK_PI the >> timeout is an absolute timeout. > > How come? It just uses timespec. Correct, it's using the value passed in. >> The signal mask handling is orthogonal to all this and must be explicit. >> In some cases explicit pthread_sigmask/sigprocmask calls. But this is >> not atomic if a signal must be masked/unmasked for the *_wait call. >> This is why we have variants like pselect/ppoll/epoll_pwait which >> explicitly and *atomically* change the signal mask for the duration of >> the call. > > You probably missed kevent signal patch - signal will not be delivered > (in special cases) since it will not be copied into signal mask. System > just will not know that it happend. Completely. Like putting it into > blocked mask. I don't really understand what you want to say here. I looked over the patch and I don't think I miss anything. You just deliver the signal as an event. No signal mask handling at all. This is exactly the problem. > But it is completely irrelevant with kevent signals - there is no race > for that case when signal is delivered through file descriptor. Of course there is a race. You might not want the signal delivered. This is what the signal mask is for. Of the other way around, as I've said before. > It is much better to not know how thing works, then to not be possible > to understand how new things can work. Well, this explains why you don't understand signal masks at all. > Add kevent signal and do not process that event. That's not only a horrible hack, it does not work. If I want to ignore a signal for the duration of the call, while you have it occasionally blocked for the rest of the program you would have to register the kevent for the signal, unblock the signal, the kevent_wait call, reset the mask, remove the kevent for the signal.. Otherwise it would not be delivered to be ignored. And then you have a race, the same race pselect is designed to prevent. In fact, you have two races. There are other scenarios like this. Fact is, signal mask handling is necessary and it cannot be folded into the event handling, it's orthogonal. > Having special type of kevent signal is the same as putting signal into > blocked mask, but signal event will be marked as ready - to indicate > that condition was there. > There will not be any race in that case. Nonsense on all counts. > I think I am a bit blind, probably parts of Leonids are still getting > into my brain, but there is one syscall called kevent_ctl() which adds > different events, including timer, signal, socket and others. You are searching for callbacks and if none is found you return EINVAL. This is exactly the same as if you'd create separate syscalls. Perhaps even worse, I really don't like demultiplexers, separate syscalls are much cleaner. Avoiding these callbacks would help reducing the kernel interface, especially for this useless since inferior timer implementation. > I can replace with -ENOSYS if you like. It's necessary since we must be able to distinguish the errors. > No one asked and pain me to create kevent, but it is done. > Probably no the way some people wanted, but it always happend, > it is really not that bad. Nobody says that the work isn't appreciated. But if you don't want it to be critiqued, don't publish it. If you don't want to mask any more changes, fine, say so. I'll find somebody else to do it or will do it myself. I claim that I know a thing or two about interfaces of the runtime programs expect to use. And I know POSIX and the way the interfaces are designed and how they interact. > Ulrich, tell me the truth, will you kill me if I say that I have an entry > in TODO to implement different AIO design (details for interested readers > can be found in my blog), and then present it to community? :)) I don't care about the kernel implementation as long as the interface is compatible with what I need for the POSIX AIO implementation. The currently proposed code is going in that direction. Any implementation which like Ben's old one does not allow POSIX AIO to be implemented I will of oppose. >> Then define SIGEV_KEVENT as a value distinct from the other SIGEV_ >> values. In the code which handles setup of timers (the timer_create >> syscall), recognize SIGEV_KEVENT and handle it appropriately. I.e., >> call into the code to register the event source, just like you'd do with >> the current interface. Then add the code to post an event to the event >> queue where currently signals would be sent et voilà. > > Ok, I see. > It is doable and simple. > I will try to implement it tomorrow. Thanks, that's progress. And yes, I imagine it's not hard which is why the currently proposed timer interface is so unnecessary. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-22 7:33 ` [take24 0/6] kevent: Generic event handling mechanism Ulrich Drepper @ 2006-11-22 10:38 ` Evgeniy Polyakov 2006-11-22 22:22 ` Ulrich Drepper 2006-11-22 12:09 ` Evgeniy Polyakov 1 sibling, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-22 10:38 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Tue, Nov 21, 2006 at 11:33:39PM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > >Threads are parked in syscalls - which one should be interrupted? > > It doesn't matter, use the same policy you use when waking a thread in > case of an event. This is not about waking a specific thread, it's > about not dropping the event notification. Event notification is not dropped - thread was awakened, kernel task is completed. Kernel does not know and should not have such knowledge about the fact that selected thread was not good enough. If you want to wakeup another thread - create another event, that is why I proposed userspace notifications, which I actually do not like. > >And what if there were no threads waiting in syscalls? > > This is fine, do nothing. It means that the other threads are about to > read the ring buffer and will pick up the event. > > > The case which must be avoided is that of all threads being in the > kernel, one threads gets woken, and then is canceled. Without notifying > the kernel about the cancellation and in the absence of further events > notifications the process is deadlocked. > > A second case which should be avoided is that there is a thread waiting > when a thread gets canceled and there are one or more addition threads > around, but not in the kernel. But those other threads might not get to > the ring buffer anytime soon, so handling the event is unnecessarily > delayed. If those threads are not in the kernel, kernel can not wake hem up. But if there is an event 'wake me up when thread has died' or something like that, when new threads will try to sleep in syscall, they will be immediately awakened, since that event will be ready. > >It has completely nothing with syscall. > >You register a timer to wait until 10:15 that is all. > > That's a nonsense argument. In this case you would not add any timeout > parameter at all. Of course nobody would want that since it's simply > too slow. Stop thinking about the absolute timeout as an exceptional > case, it might very well not be for some problems. I repeate - timeout is needed to tell kernel the maximum possible timeframe syscall can live. When you will tell me why you want syscall to be interrupted when some absolute time is on the clock instead of having special event for that, then ok. I think I know why you want absolute time there - because glibc converts most of the timeouts to absolute time since posix waiting pthread_cond_timedwait() works only with it. > Beside, I've already mentioned another case where a struct timespec* > parameter is needed. There are even two different relative timeouts: > using the monotonis clock or using the realtime clock. The latter is > affected by gettimeofday and ntp. Kevent convert it to jiffies since it uses wait_event() and friends, jiffies do not carry information about clocks to be used. > >>>Kernel use relative timeouts. > >>Look again. This time at the implementation. For FUTEX_LOCK_PI the > >>timeout is an absolute timeout. > > > >How come? It just uses timespec. > > Correct, it's using the value passed in. > > > >>The signal mask handling is orthogonal to all this and must be explicit. > >> In some cases explicit pthread_sigmask/sigprocmask calls. But this is > >>not atomic if a signal must be masked/unmasked for the *_wait call. > >>This is why we have variants like pselect/ppoll/epoll_pwait which > >>explicitly and *atomically* change the signal mask for the duration of > >>the call. > > > >You probably missed kevent signal patch - signal will not be delivered > >(in special cases) since it will not be copied into signal mask. System > >just will not know that it happend. Completely. Like putting it into > >blocked mask. > > > I don't really understand what you want to say here. > > I looked over the patch and I don't think I miss anything. You just > deliver the signal as an event. No signal mask handling at all. This > is exactly the problem. Have you seen specific_send_sig_info(): /* Short-circuit ignored signals. */ if (sig_ignored(p, sig)) { ret = 1; goto out; } almost the same happens when signal is delivered using kevent (special case) - pending mask is not updated. > >But it is completely irrelevant with kevent signals - there is no race > >for that case when signal is delivered through file descriptor. > > Of course there is a race. You might not want the signal delivered. > This is what the signal mask is for. Of the other way around, as I've > said before. Then ignore that event - there is no race between signal delivery and other descriptors reading, and it _is_ when signal is delivered no through the same queue but asynchronously with mask update. > >It is much better to not know how thing works, then to not be possible > >to understand how new things can work. > > Well, this explains why you don't understand signal masks at all. Nice :) I at least try to do something to solve this problem, instead of blindly saying the same again and again without even trying to hear and understand what others say. > >Add kevent signal and do not process that event. > > That's not only a horrible hack, it does not work. If I want to ignore > a signal for the duration of the call, while you have it occasionally > blocked for the rest of the program you would have to register the > kevent for the signal, unblock the signal, the kevent_wait call, reset > the mask, remove the kevent for the signal.. Otherwise it would not be > delivered to be ignored. And then you have a race, the same race > pselect is designed to prevent. In fact, you have two races. > > There are other scenarios like this. Fact is, signal mask handling is > necessary and it cannot be folded into the event handling, it's orthogonal. You have too narrow look. Look broader - pselect() has signal mask to prevent race between async signal delivery and file descriptor readiness. With kevent both that events are delivered through the same queue, so there is no race, so kevent syscalls do not need that workaround for 20 years-old design, which can not handle different than fd events. > >Having special type of kevent signal is the same as putting signal into > >blocked mask, but signal event will be marked as ready - to indicate > >that condition was there. > >There will not be any race in that case. > > Nonsense on all counts. > > > >I think I am a bit blind, probably parts of Leonids are still getting > >into my brain, but there is one syscall called kevent_ctl() which adds > >different events, including timer, signal, socket and others. > > You are searching for callbacks and if none is found you return EINVAL. > This is exactly the same as if you'd create separate syscalls. > Perhaps even worse, I really don't like demultiplexers, separate > syscalls are much cleaner. > > Avoiding these callbacks would help reducing the kernel interface, > especially for this useless since inferior timer implementation. You completely do not want to understand how kevent works and why they are needed, if you would try to think that there are different than yours opinions, then probably we could have some progress. Those callbacks are neededto support different types of objects, which can produce events, with the same interface. > >I can replace with -ENOSYS if you like. > > It's necessary since we must be able to distinguish the errors. And what if user requests bogus event type - is it invalid condition or normal, but not handled (thus enosys)? > >No one asked and pain me to create kevent, but it is done. > >Probably no the way some people wanted, but it always happend, > >it is really not that bad. > > Nobody says that the work isn't appreciated. But if you don't want it > to be critiqued, don't publish it. If you don't want to mask any more > changes, fine, say so. I'll find somebody else to do it or will do it > myself. I greatly appreciate critics, really. But when it comes to 'this sucks because it sucks, no matter if it is completely different way, it still sucks because others sucked there too' I can not say it is critics, it becomes nonsence. > I claim that I know a thing or two about interfaces of the runtime > programs expect to use. And I know POSIX and the way the interfaces are > designed and how they interact. Well, then I claim that I do not know 'thing or two about interfaces of the runtime programs expect to use', but instead I write those programms and I know my needs. And POSIX interfaces are the last one I prefer to use. We are on the different positions - theoretical thoughs about world hapinness, and practical usage. I do not say that only one of that approaches must exist, they both can live together, but it requires that people from both sides not only tried to say, that other part is stupid and do not know something or anything, but instead tried to listen and get into account that. > >Ulrich, tell me the truth, will you kill me if I say that I have an entry > >in TODO to implement different AIO design (details for interested readers > >can be found in my blog), and then present it to community? :)) > > I don't care about the kernel implementation as long as the interface is > compatible with what I need for the POSIX AIO implementation. The > currently proposed code is going in that direction. Any implementation > which like Ben's old one does not allow POSIX AIO to be implemented I > will of oppose. What if it will not be called POSIX AIO, but instead some kind of 'true AIO' or 'real AIO' or maybe 'alternative AIO'? :) It is quite sure that POSIX AIO interfaces will unlikely to be applied there... > >>Then define SIGEV_KEVENT as a value distinct from the other SIGEV_ > >>values. In the code which handles setup of timers (the timer_create > >>syscall), recognize SIGEV_KEVENT and handle it appropriately. I.e., > >>call into the code to register the event source, just like you'd do with > >>the current interface. Then add the code to post an event to the event > >>queue where currently signals would be sent et voilà. > > > >Ok, I see. > >It is doable and simple. > >I will try to implement it tomorrow. > > Thanks, that's progress. And yes, I imagine it's not hard which is why > the currently proposed timer interface is so unnecessary. It is the first techical but not political problem we cought in this endless discussion, I separated it in different subthread already. Let's try to think more about it there. > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, > CA ❖ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-22 10:38 ` Evgeniy Polyakov @ 2006-11-22 22:22 ` Ulrich Drepper 2006-11-23 12:18 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-11-22 22:22 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro Evgeniy Polyakov wrote: > Event notification is not dropped - [...] Since you said you added the new syscall I'll leave this alone. > I repeate - timeout is needed to tell kernel the maximum possible > timeframe syscall can live. When you will tell me why you want syscall > to be interrupted when some absolute time is on the clock instead of > having special event for that, then ok. This goes together with... > I think I know why you want absolute time there - because glibc converts > most of the timeouts to absolute time since posix waiting > pthread_cond_timedwait() works only with it. I did not make the decision to use absolute timeouts/deadlines. This is what is needed in many situations. It's the more general way to specify delays. These are real-world requirements which were taken into account when designing the interfaces. For most cases I would agree that when doing AIO you need relative timeouts. But the event handling is not about AIO alone. It's all kinds of events and some/many are wall clock related. And it is definitely necessary in some situations to be able to interrupt if the clock jumps ahead. If a program deals with devices in the real world this be crucial. The new event handling must be generic enough to accommodate all these uses and using struct timespec* plus eventually flags does not add any measurable overhead so there is no reason to not do it right. > Kevent convert it to jiffies since it uses wait_event() and friends, > jiffies do not carry information about clocks to be used. Then this points to a place in the implementation which needs changing. The interface cannot be restricted just because the implementation currently allow this to be implemented. > /* Short-circuit ignored signals. */ > if (sig_ignored(p, sig)) { > ret = 1; > goto out; > } > > almost the same happens when signal is delivered using kevent (special > case) - pending mask is not updated. Yes, and how do you set the signal mask atomically wrt to registering and unregistering signals with kevent and the syscall itself? You cannot. But this is exactly which is resolved by adding the signal mask parameter. Programs which don't need the functionality simply pass a NULL pointer and the cost is once again not measurable. But don't restrict the functionality just because you don't see a use for this in your small world. Yes, we could (later again) add new syscalls. But this is plain stupid. I would love to never have the epoll_wait or select syscall and just have epoll_pwait and pselect since the functionality is a superset. We have a larger kernel ABI. Here we can stop making the same mistake again. For the userlevel side we might even have separate intterfaces, one with one without signal mask parameter. But that's userlevel, both functions would use the same syscall. >> There are other scenarios like this. Fact is, signal mask handling is >> necessary and it cannot be folded into the event handling, it's orthogonal. > > You have too narrow look. > Look broader - pselect() has signal mask to prevent race between async > signal delivery and file descriptor readiness. With kevent both that > events are delivered through the same queue, so there is no race, so > kevent syscalls do not need that workaround for 20 years-old design, > which can not handle different than fd events. Your failure to understand to signal model leads to wrong conclusions. There are races, several of them, and you cannot do anything without signal mask parameters. I've explained this before. >> Avoiding these callbacks would help reducing the kernel interface, >> especially for this useless since inferior timer implementation. > > You completely do not want to understand how kevent works and why they > are needed, if you would try to think that there are different than > yours opinions, then probably we could have some progress. I think I know very well how they work meanwhile. > Those callbacks are neededto support different types of objects, which > can produce events, with the same interface. Yes, but it is not necessary to expose all the different types in the userlevel APIs. That's the issue. Reduce the exposure of kernel functionality to userlevel APIs. If you integrate the timer handling into the POSIX timer syscalls the callbacks in your timer patch might not need be there. At least the enqueue callback, if I remember correctly. All enqueue operations are initiated by timer_create calls which can call the function directly. Removing the callback from the list used by add_ctl will reduce the exposed interface. >>> I can replace with -ENOSYS if you like. >> It's necessary since we must be able to distinguish the errors. > > And what if user requests bogus event type - is it invalid condition or > normal, but not handled (thus enosys)? It's ENOSYS. Just like for system calls. You cannot distinguish completely invalid values from values which are correct only on later kernels. But: the first use is a bug while the later is not a bug and needed to write robust and well performing apps. The former's problems therefore are unimportant. > Well, then I claim that I do not know 'thing or two about interfaces of > the runtime programs expect to use', but instead I write those programms > and I know my needs. And POSIX interfaces are the last one I prefer to > use. Well, there it is. You look out for yourself while I make sure that all the bases I can think of are covered. Again, if you don't want to work on the generalization, fine. That's your right. Nobody will think bad of you for doing this. But don't expect that a) I'll not try to change it and b) I'll not object to the changes being accepted as they are. > What if it will not be called POSIX AIO, but instead some kind of 'true > AIO' or 'real AIO' or maybe 'alternative AIO'? :) > It is quite sure that POSIX AIO interfaces will unlikely to be applied > there... Programmers don't like specialized OS-specific interfaces. AIO users who put up with libaio are rare. The same will happen with any other approach. The Samba use is symptomatic: they need portability even if this costs a minute percentage of performance compared to a highly specialized implementation. There might be some aspects of POSIX AIO which could be implemented better on Linux. But the important part in the name is the 'P'. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-22 22:22 ` Ulrich Drepper @ 2006-11-23 12:18 ` Evgeniy Polyakov 2006-11-23 22:23 ` Ulrich Drepper 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-23 12:18 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Wed, Nov 22, 2006 at 02:22:15PM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > >I repeate - timeout is needed to tell kernel the maximum possible > >timeframe syscall can live. When you will tell me why you want syscall > >to be interrupted when some absolute time is on the clock instead of > >having special event for that, then ok. > > This goes together with... > > > >I think I know why you want absolute time there - because glibc converts > >most of the timeouts to absolute time since posix waiting > >pthread_cond_timedwait() works only with it. > > I did not make the decision to use absolute timeouts/deadlines. This is > what is needed in many situations. It's the more general way to specify > delays. These are real-world requirements which were taken into account > when designing the interfaces. > > For most cases I would agree that when doing AIO you need relative > timeouts. But the event handling is not about AIO alone. It's all > kinds of events and some/many are wall clock related. And it is > definitely necessary in some situations to be able to interrupt if the > clock jumps ahead. If a program deals with devices in the real world > this be crucial. The new event handling must be generic enough to > accommodate all these uses and using struct timespec* plus eventually > flags does not add any measurable overhead so there is no reason to not > do it right. Timeouts are not about AIO or any other event types (there are a lot of them already as you can see), it is only about syscall itself. Please point me to _any_ syscall out there which uses absolute time (except settimeofday() and similar syscalls). > >Kevent convert it to jiffies since it uses wait_event() and friends, > >jiffies do not carry information about clocks to be used. > > Then this points to a place in the implementation which needs changing. > The interface cannot be restricted just because the implementation > currently allow this to be implemented. Btw, do you propose to change all users of wait_event()? Interface is not restricted, it is just different from what you want it to be, and you did not show why it requires changes. Btw, there are _no_ interfaces similar to 'wait event with absolute times' in kernel. > > /* Short-circuit ignored signals. */ > > if (sig_ignored(p, sig)) { > > ret = 1; > > goto out; > > } > > > >almost the same happens when signal is delivered using kevent (special > >case) - pending mask is not updated. > > Yes, and how do you set the signal mask atomically wrt to registering > and unregistering signals with kevent and the syscall itself? You > cannot. But this is exactly which is resolved by adding the signal mask > parameter. kevent signal registering is atomic with respect to other kevent syscalls: control syscalls are protected by mutex and waiting syscalls work with queue, which is protected by appropriate lock. > Programs which don't need the functionality simply pass a NULL pointer > and the cost is once again not measurable. But don't restrict the > functionality just because you don't see a use for this in your small world. > > Yes, we could (later again) add new syscalls. But this is plain stupid. > I would love to never have the epoll_wait or select syscall and just > have epoll_pwait and pselect since the functionality is a superset. We > have a larger kernel ABI. Here we can stop making the same mistake again. > > For the userlevel side we might even have separate intterfaces, one with > one without signal mask parameter. But that's userlevel, both functions > would use the same syscall. Let me formulate signal problem here, please point me if it is correct or not. User registers some async signal notifications and calls poll() waiting for some file descriptors to became ready. When it is interrupted there is no knowledge about what really happend first - signal was delivered or file descriptor was ready. Is it correct? In case it is, let me explain why this situation can not happen with kevent: since signals are not delivered in the old way, but instead they are queued into the same queue where file descriptors are, and queueing is atomic, and pending signal mask is not updated, user will only read one event after another, which automatically (since delivery is atomic) means that what first was read, that was first happend. So, why in the latter situation we need to specify signal mask, which will block some signals from _async_ delivery, since there is _no_ async delivery? > >>There are other scenarios like this. Fact is, signal mask handling is > >>necessary and it cannot be folded into the event handling, it's > >>orthogonal. > > > >You have too narrow look. > >Look broader - pselect() has signal mask to prevent race between async > >signal delivery and file descriptor readiness. With kevent both that > >events are delivered through the same queue, so there is no race, so > >kevent syscalls do not need that workaround for 20 years-old design, > >which can not handle different than fd events. > > Your failure to understand to signal model leads to wrong conclusions. > There are races, several of them, and you cannot do anything without > signal mask parameters. I've explained this before. Please refer to my above explaination, point me in that example what we are talking about. It seems we do not understand each other. > >>Avoiding these callbacks would help reducing the kernel interface, > >>especially for this useless since inferior timer implementation. > > > >You completely do not want to understand how kevent works and why they > >are needed, if you would try to think that there are different than > >yours opinions, then probably we could have some progress. > > I think I know very well how they work meanwhile. If that would be true, I would be very happy. Definitely. > >Those callbacks are neededto support different types of objects, which > >can produce events, with the same interface. > > Yes, but it is not necessary to expose all the different types in the > userlevel APIs. That's the issue. Reduce the exposure of kernel > functionality to userlevel APIs. > > If you integrate the timer handling into the POSIX timer syscalls the > callbacks in your timer patch might not need be there. At least the > enqueue callback, if I remember correctly. All enqueue operations are > initiated by timer_create calls which can call the function directly. > Removing the callback from the list used by add_ctl will reduce the > exposed interface. I posted a patch to implement kevent support for posix timers, it is quite simple in existing model. No need to remove anything, that allows to have flexibility and create different usage models others than what is required by fairly small part of the users. > >>>I can replace with -ENOSYS if you like. > >>It's necessary since we must be able to distinguish the errors. > > > >And what if user requests bogus event type - is it invalid condition or > >normal, but not handled (thus enosys)? > > It's ENOSYS. Just like for system calls. You cannot distinguish > completely invalid values from values which are correct only on later > kernels. But: the first use is a bug while the later is not a bug and > needed to write robust and well performing apps. The former's problems > therefore are unimportant. I implemented it to return -enosys for the case, when event type is smaller than maximum allowed and no subsystem is registered, and -einval for the case, when requested type is higher. > >Well, then I claim that I do not know 'thing or two about interfaces of > >the runtime programs expect to use', but instead I write those programms > >and I know my needs. And POSIX interfaces are the last one I prefer to > >use. > > Well, there it is. You look out for yourself while I make sure that all > the bases I can think of are covered. > > Again, if you don't want to work on the generalization, fine. That's > your right. Nobody will think bad of you for doing this. But don't > expect that a) I'll not try to change it and b) I'll not object to the > changes being accepted as they are. It is not about generalization, but about those who do practical work and those who prefer to spread theoretical thoughts, which result in several month of unused empty discussions. > >What if it will not be called POSIX AIO, but instead some kind of 'true > >AIO' or 'real AIO' or maybe 'alternative AIO'? :) > >It is quite sure that POSIX AIO interfaces will unlikely to be applied > >there... > > Programmers don't like specialized OS-specific interfaces. AIO users > who put up with libaio are rare. The same will happen with any other > approach. The Samba use is symptomatic: they need portability even if > this costs a minute percentage of performance compared to a highly > specialized implementation. Do not say for everyone - it is not a some kind of feodalism with the only opinion allowed - respect those who do not like/do not want what you propose them to use. > There might be some aspects of POSIX AIO which could be implemented > better on Linux. But the important part in the name is the 'P'. I will create completely different model, POSIX completely is not designed for that - that model allows to specify set of tasks performed on object completely asycnhronous to user before object is returned - for example specify destination socket and filename, so async sendfile will asynchronously open file, transfer it to remote destination and probably even close (or return file descriptor). The same can be applied to aio read/write. > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, > CA ❖ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-23 12:18 ` Evgeniy Polyakov @ 2006-11-23 22:23 ` Ulrich Drepper 2006-11-24 10:57 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-11-23 22:23 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro Evgeniy Polyakov wrote: > On Wed, Nov 22, 2006 at 02:22:15PM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Timeouts are not about AIO or any other event types (there are a lot of > them already as you can see), it is only about syscall itself. > Please point me to _any_ syscall out there which uses absolute time > (except settimeofday() and similar syscalls). futex(FUTEX_LOCK_PI). > Btw, do you propose to change all users of wait_event()? Which users? > Interface is not restricted, it is just different from what you want it > to be, and you did not show why it requires changes. No, it is restricted because I cannot express something like an absolute timeout/deadline. If the parameter would be a struct timespec* then at any time we can implement either relative timeouts w/ and w/out observance of settimeofday/ntp and absolute timeouts. This is what makes the interface generic and unrestricted while your current version cannot be used for the latter. > kevent signal registering is atomic with respect to other kevent > syscalls: control syscalls are protected by mutex and waiting syscalls > work with queue, which is protected by appropriate lock. It is about atomicity wrt to the signal mask manipulation which would have to precede the kevent_wait call and the call itself (and registering a signal for kevent delivery). This is not atomic. > Let me formulate signal problem here, please point me if it is correct > or not. There are a myriad of different scenarios, it makes no sense to pick one. The interface must be generic to cover them all, I don't know how often I have to repeat this. > User registers some async signal notifications and calls poll() waiting > for some file descriptors to became ready. When it is interrupted there > is no knowledge about what really happend first - signal was delivered > or file descriptor was ready. The order is unimportant. You change the signal mask, for instance, if the time when a thread is waiting in poll() is the only time when a signal can be handled. Or vice versa, it's the time when signals are not wanted. And these are per-thread decisions. Signal handlers and kevent registrations for signals are process-wide decisions. And furthermore: with kevent delivered signals there is no signal mask anymore (at least you seem to not check it). Even if this would be done it doesn't change the fact that you cannot use signals the way many programs want to. Fact is that without a signal queue you cannot implement the above cases. You cannot block/unblock a signal for a specific thread. You also cannot work together with signals which cannot be delivered through kevent. This is the case for existing code in a program which happens to use also kevent and it is the case if there is more than one possible recipient. With kevent signals can be attached to one kevent queue only but the recipients (different threads or only different parts of a program) need not use the same kevent queue. I've said from the start that you cannot possibly expect that programs are not using signal delivery in the current form. And the complete loss of blocking signals for individual threads makes the kevent-based signal delivery incomplete (in a non-fixable form) anyway. > In case it is, let me explain why this situation can not happen with > kevent: since signals are not delivered in the old way, but instead they > are queued into the same queue where file descriptors are, and queueing > is atomic, and pending signal mask is not updated, user will only read > one event after another, which automatically (since delivery is atomic) > means that what first was read, that was first happend. This really has nothing to do with the problem. > I posted a patch to implement kevent support for posix timers, it is > quite simple in existing model. No need to remove anything, Surely you don't suggest keeping your original timer patch? > I implemented it to return -enosys for the case, when event type is > smaller than maximum allowed and no subsystem is registered, and -einval > for the case, when requested type is higher. What is the "maximum allowed"? ENOSYS must be returned for all values which could potentially in future be used as a valid type value. If you limit the values which are treated this way you are setting a fixed upper limit for the type values which _ever_ can be used. > It is not about generalization, but about those who do practical work > and those who prefer to spread theoretical thoughts, which result in > several month of unused empty discussions. I've told you, then don't work on these parts. I'll get the changes I think are needed implemented by somebody else or I'll do it myself. If you say that only those you implement something have a say in the way this is done then this is fine with me. But you have to realize that you're not the one who will make all the final decisions. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-23 22:23 ` Ulrich Drepper @ 2006-11-24 10:57 ` Evgeniy Polyakov 2006-11-27 19:12 ` Ulrich Drepper 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-24 10:57 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Thu, Nov 23, 2006 at 02:23:12PM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > >On Wed, Nov 22, 2006 at 02:22:15PM -0800, Ulrich Drepper > >(drepper@redhat.com) wrote: > >Timeouts are not about AIO or any other event types (there are a lot of > >them already as you can see), it is only about syscall itself. > >Please point me to _any_ syscall out there which uses absolute time > >(except settimeofday() and similar syscalls). > > futex(FUTEX_LOCK_PI). It just sets hrtimer with abs time and sleeps - it can achieve the same goals using similar to wait_event() mechanism. > >Btw, do you propose to change all users of wait_event()? > > Which users? Any users which use wait_event() or schedule_timeout(). Futex for example - it perfectly ok lives with relative timeouts provided to schedule_timeout() - the same (roughly saying of course) is done in kevent. > >Interface is not restricted, it is just different from what you want it > >to be, and you did not show why it requires changes. > > No, it is restricted because I cannot express something like an absolute > timeout/deadline. If the parameter would be a struct timespec* then at > any time we can implement either relative timeouts w/ and w/out > observance of settimeofday/ntp and absolute timeouts. This is what > makes the interface generic and unrestricted while your current version > cannot be used for the latter. I think I said already several times that absolute timeouts are not related to syscall execution process. But you seems to not hear me and insist. Ok, I will change waiting syscalls to have 'flags' parameter and 'struct timespec' as timeout parameter. Special bit in flags will result in additional timer setup which will fire after absolute timeout and will wake up those who wait... > >kevent signal registering is atomic with respect to other kevent > >syscalls: control syscalls are protected by mutex and waiting syscalls > >work with queue, which is protected by appropriate lock. > > It is about atomicity wrt to the signal mask manipulation which would > have to precede the kevent_wait call and the call itself (and > registering a signal for kevent delivery). This is not atomic. If signal mask is updated from userspace it should be done through kevent - add/remove different kevent signals. Signal mask of pending signals is not updated for special kevent signals. > >Let me formulate signal problem here, please point me if it is correct > >or not. > > There are a myriad of different scenarios, it makes no sense to pick > one. The interface must be generic to cover them all, I don't know how > often I have to repeat this. The whole signal mask was added by POSXI exactly for that single practical race in the event dispatching mechanism, which can not handle other types of events like signals. > >User registers some async signal notifications and calls poll() waiting > >for some file descriptors to became ready. When it is interrupted there > >is no knowledge about what really happend first - signal was delivered > >or file descriptor was ready. > > The order is unimportant. You change the signal mask, for instance, if > the time when a thread is waiting in poll() is the only time when a > signal can be handled. Or vice versa, it's the time when signals are > not wanted. And these are per-thread decisions. > > Signal handlers and kevent registrations for signals are process-wide > decisions. And furthermore: with kevent delivered signals there is no > signal mask anymore (at least you seem to not check it). Even if this > would be done it doesn't change the fact that you cannot use signals the > way many programs want to. There is major contradiction here - you say that programmers will use old-style signal delivery and want me to add signal mask to prevent that delivery, so signals would be in blocked mask, when I say that current kevent signal delivery does not update pending signal mask, which is the same as putting signals into blocked mask, you say that it is not what is required. > Fact is that without a signal queue you cannot implement the above > cases. You cannot block/unblock a signal for a specific thread. You > also cannot work together with signals which cannot be delivered through > kevent. This is the case for existing code in a program which happens > to use also kevent and it is the case if there is more than one possible > recipient. With kevent signals can be attached to one kevent queue only > but the recipients (different threads or only different parts of a > program) need not use the same kevent queue. Signal queue is replaced with kevent queue, and it is in sync with all other kevents. Programmers which want to use kevents will use kevents (if miracle will happend and we agree that kevent is good for inclusion), and programmers will know how kevent signal delivery works. > I've said from the start that you cannot possibly expect that programs > are not using signal delivery in the current form. And the complete > loss of blocking signals for individual threads makes the kevent-based > signal delivery incomplete (in a non-fixable form) anyway. Having sigmask parameter is the same as creating kevent signal delivery. And, btw, programmers can change signal mask before calling syscall, since in the syscall there is a gap between start and sigprocmask() call. > >In case it is, let me explain why this situation can not happen with > >kevent: since signals are not delivered in the old way, but instead they > >are queued into the same queue where file descriptors are, and queueing > >is atomic, and pending signal mask is not updated, user will only read > >one event after another, which automatically (since delivery is atomic) > >means that what first was read, that was first happend. > > This really has nothing to do with the problem. It is the only practical example of the need for that signal mask. And it can be perfectly handled by kevent. > >I posted a patch to implement kevent support for posix timers, it is > >quite simple in existing model. No need to remove anything, > > Surely you don't suggest keeping your original timer patch? Of course not - kevent timers are more scalable than posix timers (the latter uses idr, which is slower than balanced binary tree, since it looks like it uses similar to radix tree algo), POSIX interface is much-much-much more unconvenient to use than simple add/wait. > >I implemented it to return -enosys for the case, when event type is > >smaller than maximum allowed and no subsystem is registered, and -einval > >for the case, when requested type is higher. > > What is the "maximum allowed"? ENOSYS must be returned for all values > which could potentially in future be used as a valid type value. If you > limit the values which are treated this way you are setting a fixed > upper limit for the type values which _ever_ can be used. Upper limit is for current version - when new type is added limit is increased - just like maximum number of syscalls. Ok, I will use -ENOSYS for all cases. > >It is not about generalization, but about those who do practical work > >and those who prefer to spread theoretical thoughts, which result in > >several month of unused empty discussions. > > I've told you, then don't work on these parts. I'll get the changes I > think are needed implemented by somebody else or I'll do it myself. If > you say that only those you implement something have a say in the way > this is done then this is fine with me. But you have to realize that > you're not the one who will make all the final decisions. Because our void discussion seems to never end, which puts kevent into hung state - I definitely prefer final words made by kernel maintainers about inclusion or declining of the kevents, but they keep silence since they look for not only my decision as author, but also different opinions of the potential users. > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, > CA ❖ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-24 10:57 ` Evgeniy Polyakov @ 2006-11-27 19:12 ` Ulrich Drepper 2006-11-28 11:00 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-11-27 19:12 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro Evgeniy Polyakov wrote: > It just sets hrtimer with abs time and sleeps - it can achieve the same > goals using similar to wait_event() mechanism. I don't follow. Of course it is somehow possible to wait until an absolute deadline. But it's not part of the parameter list and hence easily and _quickly_ usable. >>> Btw, do you propose to change all users of wait_event()? >> Which users? > > Any users which use wait_event() or schedule_timeout(). Futex for > example - it perfectly ok lives with relative timeouts provided to > schedule_timeout() - the same (roughly saying of course) is done in kevent. No, it does not live perfectly OK with relative timeouts. The userlevel implementation is actually wrong because of this in subtle ways. Some futex interfaces take absolute timeouts and they have to be interrupted if the realtime clock is set forward. Also, the calls are complicated and slow because the userlevel wrapper has to call clock_gettime/gettimeofday before each futex syscall. If the kernel would accept absolute timeouts as well we would save a syscall and have actually a correct implementation. > I think I said already several times that absolute timeouts are not > related to syscall execution process. But you seems to not hear me and > insist. Because you're wrong. For your use cases it might not be but it's not true in general. And your interface is preventing it from being implemented forever. > Ok, I will change waiting syscalls to have 'flags' parameter and 'struct > timespec' as timeout parameter. Special bit in flags will result in > additional timer setup which will fire after absolute timeout and will > wake up those who wait... Thanks a lot. >>> kevent signal registering is atomic with respect to other kevent >>> syscalls: control syscalls are protected by mutex and waiting syscalls >>> work with queue, which is protected by appropriate lock. >> It is about atomicity wrt to the signal mask manipulation which would >> have to precede the kevent_wait call and the call itself (and >> registering a signal for kevent delivery). This is not atomic. > > If signal mask is updated from userspace it should be done through > kevent - add/remove different kevent signals. Indeed, this is what I've been saying and why ppoll/pselect/epoll_pwait take the sigset_t parameter. Adding the signal mask to the queued events (e.g., the signal events) does not work. First of all it's slow, you'd have to find and combine all mask at least every time a signal event is added/removed. Then how do you combine them, OR or AND? Not all threads might want/need the same signal mask. These are just some of the usability problems. The only clean and usable solution is really to OPTIONALLY pass in the signal mask. Nobody forces anybody to use this feature. Pass a NULL pointer and nothing happens, this is how the other syscalls also work. > The whole signal mask was added by POSXI exactly for that single > practical race in the event dispatching mechanism, which can not handle > other types of events like signals. No. How should this argument make sense ? Signals cannot be used in the current event handling and are therefore used for something completely different. And they will have to be used like this for many applications (.e., thread cancellation, setuid/setgid implementation, etc). That fact that the new event handling can handle signals is orthogonal (and good). But it does not supersede the old signal use, it's something new. The old uses are still valid. BTW: there is a little design decision which has to be made: if a signal is registered with kevent and this signal is sent to a specific thread instead of the process (tkill and tgkill), what should happen? I'm currently leaning toward failing the tkill/tgkill syscall if delivery of the signal requires posting to an event queue. > There is major contradiction here - you say that programmers will use > old-style signal delivery and want me to add signal mask to prevent that > delivery, so signals would be in blocked mask, That's one thing you can do. You also can unblock signals. > when I say that current kevent > signal delivery does not update pending signal mask, which is the same as > putting signals into blocked mask, you say that it is not what is > required. First, what is "pending signal mask"? There is one signal mask per thread. And "pending" refers to thread delivery (either per-process or per-thread) which is not the signal mask (well, for non-RT signals it can be a bitmap but this still is no mask). Second, I'm not talking about signal delivery. Yes, sigaction allows to specify how the signal mask is to be changed when a signal is delivered. But this is not what I'm talk about. I'm talking about the signal mask used for the duration of the kevent_wait syscall, regardless of whether signals are waited for or delivered. > Signal queue is replaced with kevent queue, and it is in sync with all > other kevents. But the signal mask is something completely different and completely independent from the signal queue. There is nothing in the kevent interface to replace that functionality. Nor should this be possible with the events; only a sigset_t parameter to kevent_wait makes sense. > Having sigmask parameter is the same as creating kevent signal delivery. No, no, no. Not at all. >> Surely you don't suggest keeping your original timer patch? > > Of course not - kevent timers are more scalable than posix timers (the > latter uses idr, which is slower than balanced binary tree, since it > looks like it uses similar to radix tree algo), POSIX interface is > much-much-much more unconvenient to use than simple add/wait. I assume you misread the question. You agree to drop the patch and then go on listing things why you think it's better to keep them. I don't think these arguments are in any way sufficient. The interface is already too big and this is 100% duplicate functionality. If there are performance problems with the POSIX timer implementation (and I have yet to see indications) it should be fixed instead of worked around. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-27 19:12 ` Ulrich Drepper @ 2006-11-28 11:00 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-28 11:00 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Mon, Nov 27, 2006 at 11:12:21AM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > >It just sets hrtimer with abs time and sleeps - it can achieve the same > >goals using similar to wait_event() mechanism. > > I don't follow. Of course it is somehow possible to wait until an > absolute deadline. But it's not part of the parameter list and hence > easily and _quickly_ usable. I just described how it is implemented in futex. I will create the same approach - hrtimer which will wakeup wait_event() with infinite timeout. > >>>Btw, do you propose to change all users of wait_event()? > >>Which users? > > > >Any users which use wait_event() or schedule_timeout(). Futex for > >example - it perfectly ok lives with relative timeouts provided to > >schedule_timeout() - the same (roughly saying of course) is done in kevent. > > No, it does not live perfectly OK with relative timeouts. The userlevel > implementation is actually wrong because of this in subtle ways. Some > futex interfaces take absolute timeouts and they have to be interrupted > if the realtime clock is set forward. > > Also, the calls are complicated and slow because the userlevel wrapper > has to call clock_gettime/gettimeofday before each futex syscall. If > the kernel would accept absolute timeouts as well we would save a > syscall and have actually a correct implementation. It is only done for LOCK_PI case, which was specially created to have absolute timeout, i.e. futex does not need it, but there is an option. I will extend waiting syscalls to have timespec and absolute timeout, I'm just want to stop this (I hope you agree) stupid endless arguing about completely unimportant thing. > >I think I said already several times that absolute timeouts are not > >related to syscall execution process. But you seems to not hear me and > >insist. > > Because you're wrong. For your use cases it might not be but it's not > true in general. And your interface is preventing it from being > implemented forever. Because I'm right and it will not be used :) Well, it does not matter anymore, right? > >Ok, I will change waiting syscalls to have 'flags' parameter and 'struct > >timespec' as timeout parameter. Special bit in flags will result in > >additional timer setup which will fire after absolute timeout and will > >wake up those who wait... > > Thanks a lot. No problem - I always like to spend couple of month arguing about taste and 'right-from-my-point-of-view' theories - doesn't it the best way to waste the time? ... > >Having sigmask parameter is the same as creating kevent signal delivery. > > No, no, no. Not at all. I've dropped a lot, but let me describe signal mask problem in few words: signal mask provided in sys_pselect() and friends is a mask of signals, which will be put into blocked mask in the task structure in kernel. When new signal is going to be delivered, signal number is being checked if it is in blocked mask, and if so, signal is not put into pending mask of signals, which ends up in not being delivered to userspace. Kevent (with special flag) does exactly the same - but it does not update blocked mask, but instead adds another check if signal is in kevent set of requests, in that case signal is delivered to userspace through kevent queue. It is _exactly_ the same behaviour from userspace point of view concerning race of delivery signal versus file descriptor readyness. Exactly. Here is code snippet: specific_send_sig_info() { ... /* Short-circuit ignored signals. */ if (sig_ignored(t, sig)) goto out; ... ret = send_signal(sig, info, t, &t->pending); if (!ret && !sigismember(&t->blocked, sig)) signal_wake_up(t, sig == SIGKILL); #ifdef CONFIG_KEVENT_SIGNAL /* * Kevent allows to deliver signals through kevent queue, * it is possible to setup kevent to not deliver * signal through the usual way, in that case send_signal() * returns 1 and signal is delivered only through kevent queue. * We simulate successfull delivery notification through this hack: */ if (ret == 1) ret = 0; #endif out: return ret; } > >>Surely you don't suggest keeping your original timer patch? > > > >Of course not - kevent timers are more scalable than posix timers (the > >latter uses idr, which is slower than balanced binary tree, since it > >looks like it uses similar to radix tree algo), POSIX interface is > >much-much-much more unconvenient to use than simple add/wait. > > I assume you misread the question. You agree to drop the patch and then > go on listing things why you think it's better to keep them. I don't > think these arguments are in any way sufficient. The interface is > already too big and this is 100% duplicate functionality. If there are > performance problems with the POSIX timer implementation (and I have yet > to see indications) it should be fixed instead of worked around. I do _not_ agree to drop kevent timer patch (not posix timer), since from my point of view it is much more convenient interface, it is more scalable, it is generic enough to be used with other kevent methods. But anyway, we can spend awfull lot of time arguing about taste, which is definitely _NOT_ what we want. So, there are two worlds - posix timers and usual timers, accessible from userspace, first one through create_timer() and friends, second one with kevent interface. > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, > CA ❖ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-22 7:33 ` [take24 0/6] kevent: Generic event handling mechanism Ulrich Drepper 2006-11-22 10:38 ` Evgeniy Polyakov @ 2006-11-22 12:09 ` Evgeniy Polyakov 2006-11-22 12:15 ` Evgeniy Polyakov 1 sibling, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-22 12:09 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Tue, Nov 21, 2006 at 11:33:39PM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > >Threads are parked in syscalls - which one should be interrupted? > > It doesn't matter, use the same policy you use when waking a thread in > case of an event. This is not about waking a specific thread, it's > about not dropping the event notification. > > > >And what if there were no threads waiting in syscalls? > > This is fine, do nothing. It means that the other threads are about to > read the ring buffer and will pick up the event. > > > The case which must be avoided is that of all threads being in the > kernel, one threads gets woken, and then is canceled. Without notifying > the kernel about the cancellation and in the absence of further events > notifications the process is deadlocked. > > A second case which should be avoided is that there is a thread waiting > when a thread gets canceled and there are one or more addition threads > around, but not in the kernel. But those other threads might not get to > the ring buffer anytime soon, so handling the event is unnecessarily > delayed. Ok, to solve the problem in the way which should be good for both I decided to implement additional syscall which will allow to mark any event as ready and thus wake up appropriate threads. If userspace will request zero events to be marked as ready, syscall will just interrupt/wakeup one of the listeners parked in syscall. Piece? -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-22 12:09 ` Evgeniy Polyakov @ 2006-11-22 12:15 ` Evgeniy Polyakov 2006-11-22 13:46 ` Evgeniy Polyakov 2006-11-22 22:24 ` Ulrich Drepper 0 siblings, 2 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-22 12:15 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Wed, Nov 22, 2006 at 03:09:34PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote: > Ok, to solve the problem in the way which should be good for both I > decided to implement additional syscall which will allow to mark any > event as ready and thus wake up appropriate threads. If userspace will > request zero events to be marked as ready, syscall will just > interrupt/wakeup one of the listeners parked in syscall. Btw, what about putting aditional multiplexer into add/remove/modify switch? There will be logical 'ready' addon? -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-22 12:15 ` Evgeniy Polyakov @ 2006-11-22 13:46 ` Evgeniy Polyakov 2006-11-22 22:24 ` Ulrich Drepper 1 sibling, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-22 13:46 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Wed, Nov 22, 2006 at 03:15:16PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote: > On Wed, Nov 22, 2006 at 03:09:34PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote: > > Ok, to solve the problem in the way which should be good for both I > > decided to implement additional syscall which will allow to mark any > > event as ready and thus wake up appropriate threads. If userspace will > > request zero events to be marked as ready, syscall will just > > interrupt/wakeup one of the listeners parked in syscall. > > Btw, what about putting aditional multiplexer into add/remove/modify > switch? There will be logical 'ready' addon? Something like this. Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> diff --git a/include/linux/kevent.h b/include/linux/kevent.h index c909c62..7afb3d6 100644 --- a/include/linux/kevent.h +++ b/include/linux/kevent.h @@ -99,6 +99,8 @@ struct kevent_user struct mutex ctl_mutex; /* Wait until some events are ready. */ wait_queue_head_t wait; + /* Exit from syscall if someone wants us to do it */ + int need_exit; /* Reference counter, increased for each new kevent. */ atomic_t refcnt; @@ -132,6 +134,8 @@ void kevent_storage_fini(struct kevent_s int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k); void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k); +void kevent_ready(struct kevent *k, int ret); + int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u); #ifdef CONFIG_KEVENT_POLL diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h index 0680fdf..6bc0c79 100644 --- a/include/linux/ukevent.h +++ b/include/linux/ukevent.h @@ -174,5 +174,6 @@ struct kevent_ring #define KEVENT_CTL_ADD 0 #define KEVENT_CTL_REMOVE 1 #define KEVENT_CTL_MODIFY 2 +#define KEVENT_CTL_READY 3 #endif /* __UKEVENT_H */ diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c index 4d2d878..d1770a1 100644 --- a/kernel/kevent/kevent.c +++ b/kernel/kevent/kevent.c @@ -91,10 +91,10 @@ int kevent_init(struct kevent *k) spin_lock_init(&k->ulock); k->flags = 0; - if (unlikely(k->event.type >= KEVENT_MAX) + if (unlikely(k->event.type >= KEVENT_MAX)) return kevent_break(k); - if (!kevent_registered_callbacks[k->event.type].callback)) { + if (!kevent_registered_callbacks[k->event.type].callback) { kevent_break(k); return -ENOSYS; } @@ -142,16 +142,10 @@ void kevent_storage_dequeue(struct keven spin_unlock_irqrestore(&st->lock, flags); } -/* - * Call kevent ready callback and queue it into ready queue if needed. - * If kevent is marked as one-shot, then remove it from storage queue. - */ -static int __kevent_requeue(struct kevent *k, u32 event) +void kevent_ready(struct kevent *k, int ret) { - int ret, rem; unsigned long flags; - - ret = k->callbacks.callback(k); + int rem; spin_lock_irqsave(&k->ulock, flags); if (ret > 0) @@ -178,6 +172,19 @@ static int __kevent_requeue(struct keven spin_unlock_irqrestore(&k->user->ready_lock, flags); wake_up(&k->user->wait); } +} + +/* + * Call kevent ready callback and queue it into ready queue if needed. + * If kevent is marked as one-shot, then remove it from storage queue. + */ +static int __kevent_requeue(struct kevent *k, u32 event) +{ + int ret; + + ret = k->callbacks.callback(k); + + kevent_ready(k, ret); return ret; } diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c index 2cd8c99..3d1ea6b 100644 --- a/kernel/kevent/kevent_user.c +++ b/kernel/kevent/kevent_user.c @@ -47,8 +47,9 @@ static unsigned int kevent_user_poll(str poll_wait(file, &u->wait, wait); mask = 0; - if (u->ready_num) + if (u->ready_num || u->need_exit) mask |= POLLIN | POLLRDNORM; + u->need_exit = 0; return mask; } @@ -136,6 +137,7 @@ static struct kevent_user *kevent_user_a mutex_init(&u->ctl_mutex); init_waitqueue_head(&u->wait); + u->need_exit = 0; atomic_set(&u->refcnt, 1); @@ -487,6 +489,97 @@ static struct ukevent *kevent_get_user(u return ukev; } +static int kevent_mark_ready(struct ukevent *uk, struct kevent_user *u) +{ + struct kevent *k; + int err = -ENODEV; + unsigned long flags; + + spin_lock_irqsave(&u->kevent_lock, flags); + k = __kevent_search(&uk->id, u); + if (k) { + spin_lock(&k->st->lock); + kevent_ready(k, 1); + spin_unlock(&k->st->lock); + err = 0; + } + spin_unlock_irqrestore(&u->kevent_lock, flags); + + return err; +} + +/* + * Mark appropriate kevents as ready. + * If number of events is zero just wake up one listener. + */ +static int kevent_user_ctl_ready(struct kevent_user *u, unsigned int num, void __user *arg) +{ + int err = -EINVAL, cerr = 0, rnum = 0, i; + void __user *orig = arg; + struct ukevent uk; + + if (num > u->kevent_num) + return err; + + if (!num) { + u->need_exit = 1; + wake_up(&u->wait); + return 0; + } + + mutex_lock(&u->ctl_mutex); + + if (num > KEVENT_MIN_BUFFS_ALLOC) { + struct ukevent *ukev; + + ukev = kevent_get_user(num, arg); + if (ukev) { + for (i = 0; i < num; ++i) { + err = kevent_mark_ready(&ukev[i], u); + if (err) { + if (i != rnum) + memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent)); + rnum++; + } + } + if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent))) + cerr = -EFAULT; + kfree(ukev); + goto out_setup; + } + } + + for (i = 0; i < num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + cerr = -EFAULT; + break; + } + arg += sizeof(struct ukevent); + + err = kevent_mark_ready(&uk, u); + if (err) { + if (copy_to_user(orig, &uk, sizeof(struct ukevent))) { + cerr = -EFAULT; + break; + } + orig += sizeof(struct ukevent); + rnum++; + } + } + +out_setup: + if (cerr < 0) { + err = cerr; + goto out_remove; + } + + err = num - rnum; +out_remove: + mutex_unlock(&u->ctl_mutex); + + return err; +} + /* * Read from userspace all ukevents and modify appropriate kevents. * If provided number of ukevents is more that threshold, it is faster @@ -779,9 +872,10 @@ static int kevent_user_wait(struct file if (!(file->f_flags & O_NONBLOCK)) { wait_event_interruptible_timeout(u->wait, - u->ready_num >= min_nr, + (u->ready_num >= min_nr) || u->need_exit, clock_t_to_jiffies(nsec_to_clock_t(timeout))); } + u->need_exit = 0; while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) { if (copy_to_user(buf + num*sizeof(struct ukevent), @@ -819,6 +913,9 @@ static int kevent_ctl_process(struct fil case KEVENT_CTL_MODIFY: err = kevent_user_ctl_modify(u, num, arg); break; + case KEVENT_CTL_READY: + err = kevent_user_ctl_ready(u, num, arg); + break; default: err = -EINVAL; break; @@ -994,9 +1091,10 @@ asmlinkage long sys_kevent_wait(int ctl_ if (!(file->f_flags & O_NONBLOCK)) { wait_event_interruptible_timeout(u->wait, - ((u->ready_num >= 1) && (kevent_ring_space(u))), + ((u->ready_num >= 1) && kevent_ring_space(u)) || u->need_exit, clock_t_to_jiffies(nsec_to_clock_t(timeout))); } + u->need_exit = 0; for (i=0; i<num; ++i) { k = kevent_dequeue_ready_ring(u); -- Evgeniy Polyakov ^ permalink raw reply related [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-22 12:15 ` Evgeniy Polyakov 2006-11-22 13:46 ` Evgeniy Polyakov @ 2006-11-22 22:24 ` Ulrich Drepper 2006-11-23 12:22 ` Evgeniy Polyakov 1 sibling, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-11-22 22:24 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro Evgeniy Polyakov wrote: > On Wed, Nov 22, 2006 at 03:09:34PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote: >> Ok, to solve the problem in the way which should be good for both I >> decided to implement additional syscall which will allow to mark any >> event as ready and thus wake up appropriate threads. If userspace will >> request zero events to be marked as ready, syscall will just >> interrupt/wakeup one of the listeners parked in syscall. I'll wait for the new code drop to comment. > Btw, what about putting aditional multiplexer into add/remove/modify > switch? There will be logical 'ready' addon? Is it needed? Usually this is done with a *_wait call with a timeout of zero. That code path might have to be optimized but it should already be there. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-22 22:24 ` Ulrich Drepper @ 2006-11-23 12:22 ` Evgeniy Polyakov 2006-11-23 20:34 ` Ulrich Drepper 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-23 12:22 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Wed, Nov 22, 2006 at 02:24:00PM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > >On Wed, Nov 22, 2006 at 03:09:34PM +0300, Evgeniy Polyakov > >(johnpol@2ka.mipt.ru) wrote: > >>Ok, to solve the problem in the way which should be good for both I > >>decided to implement additional syscall which will allow to mark any > >>event as ready and thus wake up appropriate threads. If userspace will > >>request zero events to be marked as ready, syscall will just > >>interrupt/wakeup one of the listeners parked in syscall. > > I'll wait for the new code drop to comment. I posted it. > >Btw, what about putting aditional multiplexer into add/remove/modify > >switch? There will be logical 'ready' addon? > > Is it needed? Usually this is done with a *_wait call with a timeout of > zero. That code path might have to be optimized but it should already > be there. It does not allow to mark events as ready. And current interfaces wake up when either timeout is zero (in this case thread itself does not sleep and can process events), or when there is _new_ work - since there is no _new_ work, when thread awakened to process it was killed, kernel does not think that something is wrong. > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, > CA ❖ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-23 12:22 ` Evgeniy Polyakov @ 2006-11-23 20:34 ` Ulrich Drepper 2006-11-24 10:58 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-11-23 20:34 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro Evgeniy Polyakov wrote: >>> Btw, what about putting aditional multiplexer into add/remove/modify >>> switch? There will be logical 'ready' addon? >> Is it needed? Usually this is done with a *_wait call with a timeout of >> zero. That code path might have to be optimized but it should already >> be there. > > It does not allow to mark events as ready. > And current interfaces wake up when either timeout is zero (in this case > thread itself does not sleep and can process events), or when there is > _new_ work - since there is no _new_ work, when thread awakened to > process it was killed, kernel does not think that something is wrong. Rather than mark an existing entry as ready, how about a call to inject a new ready event? This would be useful to implement functionality at userlevel and still use an event queue to announce the availability. Without this type of functionality we'd need to use indirect notification via signal or pipe or something like that. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-23 20:34 ` Ulrich Drepper @ 2006-11-24 10:58 ` Evgeniy Polyakov 2006-11-27 18:23 ` Ulrich Drepper 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-24 10:58 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Thu, Nov 23, 2006 at 12:34:50PM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > >>>Btw, what about putting aditional multiplexer into add/remove/modify > >>>switch? There will be logical 'ready' addon? > >>Is it needed? Usually this is done with a *_wait call with a timeout of > >>zero. That code path might have to be optimized but it should already > >>be there. > > > >It does not allow to mark events as ready. > >And current interfaces wake up when either timeout is zero (in this case > >thread itself does not sleep and can process events), or when there is > >_new_ work - since there is no _new_ work, when thread awakened to > >process it was killed, kernel does not think that something is wrong. > > Rather than mark an existing entry as ready, how about a call to inject > a new ready event? > > This would be useful to implement functionality at userlevel and still > use an event queue to announce the availability. Without this type of > functionality we'd need to use indirect notification via signal or pipe > or something like that. With provided patch it is possible to wakeup 'for-free' - just call kevent_ctl(ready) with zero number of ready events, so thread will be awakened if it was in poll(kevent_fd), kevent_wait() or kevent_get_events(). > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, > CA ❖ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-24 10:58 ` Evgeniy Polyakov @ 2006-11-27 18:23 ` Ulrich Drepper 2006-11-28 10:13 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-11-27 18:23 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro Evgeniy Polyakov wrote: > > With provided patch it is possible to wakeup 'for-free' - just call > kevent_ctl(ready) with zero number of ready events, so thread will be > awakened if it was in poll(kevent_fd), kevent_wait() or > kevent_get_events(). Yes, I realize that. But I wrote something else: >> Rather than mark an existing entry as ready, how about a call to >> inject a new ready event? >> >> This would be useful to implement functionality at userlevel and >> still use an event queue to announce the availability. Without this >> type of functionality we'd need to use indirect notification via >> signal or pipe or something like that. This is still something which is wanted. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-27 18:23 ` Ulrich Drepper @ 2006-11-28 10:13 ` Evgeniy Polyakov 2006-12-27 20:45 ` Ulrich Drepper 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-28 10:13 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Mon, Nov 27, 2006 at 10:23:39AM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > > > >With provided patch it is possible to wakeup 'for-free' - just call > >kevent_ctl(ready) with zero number of ready events, so thread will be > >awakened if it was in poll(kevent_fd), kevent_wait() or > >kevent_get_events(). > > Yes, I realize that. But I wrote something else: > > >> Rather than mark an existing entry as ready, how about a call to > >> inject a new ready event? > >> > >> This would be useful to implement functionality at userlevel and > >> still use an event queue to announce the availability. Without this > >> type of functionality we'd need to use indirect notification via > >> signal or pipe or something like that. > > This is still something which is wanted. Why do we want to inject _ready_ event, when it is possible to mark event as ready and wakeup thread parked in syscall? > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, > CA ❖ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-11-28 10:13 ` Evgeniy Polyakov @ 2006-12-27 20:45 ` Ulrich Drepper 2006-12-28 9:50 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-12-27 20:45 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro [-- Attachment #1: Type: text/plain, Size: 1028 bytes --] Evgeniy Polyakov wrote: > Why do we want to inject _ready_ event, when it is possible to mark > event as ready and wakeup thread parked in syscall? Going back to this old one: How do you want to mark an event ready if you don't want to introduce yet another layer of data structures? The event notification happens through entries in the ring buffer. Userlevel code should never add anything to the ring buffer directly, this would mean huge synchronization problems. Yes, one could add additional data structures accompanying the ring buffer which can specify userlevel-generated events. But this is a) clumsy and b) a pain to use when the same ring buffer is used in multiple threads (you'd have to have another shared memory segment). It's much cleaner if the userlevel code can get the kernel to inject a userlevel-generated event. This is the equivalent of userlevel code generating a signal with kill(). -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 251 bytes --] ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take24 0/6] kevent: Generic event handling mechanism. 2006-12-27 20:45 ` Ulrich Drepper @ 2006-12-28 9:50 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-12-28 9:50 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik, Alexander Viro On Wed, Dec 27, 2006 at 12:45:50PM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > > Why do we want to inject _ready_ event, when it is possible to mark > > event as ready and wakeup thread parked in syscall? > > Going back to this old one: > > How do you want to mark an event ready if you don't want to introduce > yet another layer of data structures? The event notification happens > through entries in the ring buffer. Userlevel code should never add > anything to the ring buffer directly, this would mean huge > synchronization problems. Yes, one could add additional data structures > accompanying the ring buffer which can specify userlevel-generated > events. But this is a) clumsy and b) a pain to use when the same ring > buffer is used in multiple threads (you'd have to have another shared > memory segment). > > It's much cleaner if the userlevel code can get the kernel to inject a > userlevel-generated event. This is the equivalent of userlevel code > generating a signal with kill(). Existing possibility to mark event as ready works following way: event is queued into storage queue (socket, inode or some other queue), when readiness condition becomes true, event is queued into ready queue (although it is still in the storage queueu). It happens completely asynchronosu to _any_ kind of userspace processing. When userspace calls apropriate syscall, event is being copied into ring buffer. Thus userspace readiness will just mark event as ready, i.e. it queues event into ready queue, so later usersapce will callsyscall to actually get the event. When one thread is parked in the syscall and there are _no_ events which should be marked as ready (for example only sockets are there, and it is not a good idea to wakeup the whole socket processing state machine), then there is no possibility to receive such event (although it is possible to interrupt and break syscall). So, according to injecting ready events, it can be done - just an addition of special flag which will force kevent core to move event into ready queue immediately. In this case userspace can event prepare a needed event (like signal event) and deliver it to process, so it will think (only from kevent point of view) that real signal has been arrived. I will also add special type of events - userspace events - which will not have empty callbacks, which will be intended to use for user-defined way (i.e. for inter thread communications). > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ > -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* [take25 0/6] kevent: Generic event handling mechanism. [not found] <1154985aa0591036@2ka.mipt.ru> ` (3 preceding siblings ...) 2006-11-09 8:23 ` [take24 0/6] " Evgeniy Polyakov @ 2006-11-21 16:29 ` Evgeniy Polyakov 2006-11-21 16:29 ` [take25 1/6] kevent: Description Evgeniy Polyakov 2006-11-30 19:14 ` [take26 0/8] kevent: Generic event handling mechanism Evgeniy Polyakov 5 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-21 16:29 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Generic event handling mechanism. Kevent is a generic subsytem which allows to handle event notifications. It supports both level and edge triggered events. It is similar to poll/epoll in some cases, but it is more scalable, it is faster and allows to work with essentially eny kind of events. Events are provided into kernel through control syscall and can be read back through ring buffer or using usual syscalls. Kevent update (i.e. readiness switching) happens directly from internals of the appropriate state machine of the underlying subsytem (like network, filesystem, timer or any other). Homepage: http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent Documentation page: http://linux-net.osdl.org/index.php/Kevent I installed slightly used, but still functional (bought on ebay) remote mind reader, and set it up to read Ulrich's alpha brain waves (I hope he agrees that it is a good decision), which took me the whole week. So I think the last ring buffer implementation is what we all wanted. Details in documentation part. Changes from 'take24' patchset: * new (old (new)) ring buffer imeplementation with kernel and user indexes. * added initialization syscall instead of opening /dev/kevent * kevent_commit() syscall to commit ring buffer entries * changed KEVENT_REQ_WAKEUP_ONE flag to KEVENT_REQ_WAKEUP_ALL, kevent wakes only first thread always if that flag is not set * KEVENT_REQ_ALWAYS_QUEUE flag. If set, kevent will be queued into ready queue instead of copying back to userspace when kevent is ready immediately when it is added. * lighttpd patch (Hail! Although nothing realy outstanding compared to epoll) Changes from 'take23' patchset: * kevent PIPE notifications * KEVENT_REQ_LAST_CHECK flag, which allows to perform last check at dequeueing time * fixed poll/select notifications (were broken due to tree manipulations) * made Documentation/kevent.txt look nice in 80-col terminal * fix for copy_to_user() failure report for the first kevent (Andrew Morton) * minor function renames Changes from 'take22' patchset: * new ring buffer implementation in process' memory * wakeup-one-thread flag * edge-triggered behaviour Changes from 'take21' patchset: * minor cleanups (different return values, removed unneded variables, whitespaces and so on) * fixed bug in kevent removal in case when kevent being removed is the same as overflow_kevent (spotted by Eric Dumazet) Changes from 'take20' patchset: * new ring buffer implementation * removed artificial limit on possible number of kevents Changes from 'take19' patchset: * use __init instead of __devinit * removed 'default N' from config for user statistic * removed kevent_user_fini() since kevent can not be unloaded * use KERN_INFO for statistic output Changes from 'take18' patchset: * use __init instead of __devinit * removed 'default N' from config for user statistic * removed kevent_user_fini() since kevent can not be unloaded * use KERN_INFO for statistic output Changes from 'take17' patchset: * Use RB tree instead of hash table. At least for a web sever, frequency of addition/deletion of new kevent is comparable with number of search access, i.e. most of the time events are added, accesed only couple of times and then removed, so it justifies RB tree usage over AVL tree, since the latter does have much slower deletion time (max O(log(N)) compared to 3 ops), although faster search time (1.44*O(log(N)) vs. 2*O(log(N))). So for kevents I use RB tree for now and later, when my AVL tree implementation is ready, it will be possible to compare them. * Changed readiness check for socket notifications. With both above changes it is possible to achieve more than 3380 req/second compared to 2200, sometimes 2500 req/second for epoll() for trivial web-server and httperf client on the same hardware. It is possible that above kevent limit is due to maximum allowed kevents in a time limit, which is 4096 events. Changes from 'take16' patchset: * misc cleanups (__read_mostly, const ...) * created special macro which is used for mmap size (number of pages) calculation * export kevent_socket_notify(), since it is used in network protocols which can be built as modules (IPv6 for example) Changes from 'take15' patchset: * converted kevent_timer to high-resolution timers, this forces timer API update at http://linux-net.osdl.org/index.php/Kevent * use struct ukevent* instead of void * in syscalls (documentation has been updated) * added warning in kevent_add_ukevent() if ring has broken index (for testing) Changes from 'take14' patchset: * added kevent_wait() This syscall waits until either timeout expires or at least one event becomes ready. It also commits that @num events from @start are processed by userspace and thus can be be removed or rearmed (depending on it's flags). It can be used for commit events read by userspace through mmap interface. Example userspace code (evtest.c) can be found on project's homepage. * added socket notifications (send/recv/accept) Changes from 'take13' patchset: * do not get lock aroung user data check in __kevent_search() * fail early if there were no registered callbacks for given type of kevent * trailing whitespace cleanup Changes from 'take12' patchset: * remove non-chardev interface for initialization * use pointer to kevent_mring instead of unsigned longs * use aligned 64bit type in raw user data (can be used by high-res timer if needed) * simplified enqueue/dequeue callbacks and kevent initialization * use nanoseconds for timeout * put number of milliseconds into timer's return data * move some definitions into user-visible header * removed filenames from comments Changes from 'take11' patchset: * include missing headers into patchset * some trivial code cleanups (use goto instead of if/else games and so on) * some whitespace cleanups * check for ready_callback() callback before main loop which should save us some ticks Changes from 'take10' patchset: * removed non-existent prototypes * added helper function for kevent_registered_callbacks * fixed 80 lines comments issues * added shared between userspace and kernelspace header instead of embedd them in one * core restructuring to remove forward declarations * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p * use vm_insert_page() instead of remap_pfn_range() Changes from 'take9' patchset: * fixed ->nopage method Changes from 'take8' patchset: * fixed mmap release bug * use module_init() instead of late_initcall() * use better structures for timer notifications Changes from 'take7' patchset: * new mmap interface (not tested, waiting for other changes to be acked) - use nopage() method to dynamically substitue pages - allocate new page for events only when new added kevent requres it - do not use ugly index dereferencing, use structure instead - reduced amount of data in the ring (id and flags), maximum 12 pages on x86 per kevent fd Changes from 'take6' patchset: * a lot of comments! * do not use list poisoning for detection of the fact, that entry is in the list * return number of ready kevents even if copy*user() fails * strict check for number of kevents in syscall * use ARRAY_SIZE for array size calculation * changed superblock magic number * use SLAB_PANIC instead of direct panic() call * changed -E* return values * a lot of small cleanups and indent fixes Changes from 'take5' patchset: * removed compilation warnings about unused wariables when lockdep is not turned on * do not use internal socket structures, use appropriate (exported) wrappers instead * removed default 1 second timeout * removed AIO stuff from patchset Changes from 'take4' patchset: * use miscdevice instead of chardevice * comments fixes Changes from 'take3' patchset: * removed serializing mutex from kevent_user_wait() * moved storage list processing to RCU * removed lockdep screaming - all storage locks are initialized in the same function, so it was learned to differentiate between various cases * remove kevent from storage if is marked as broken after callback * fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion Changes from 'take2' patchset: * split kevent_finish_user() to locked and unlocked variants * do not use KEVENT_STAT ifdefs, use inline functions instead * use array of callbacks of each type instead of each kevent callback initialization * changed name of ukevent guarding lock * use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks * do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters * various indent cleanups * added optimisation, which is aimed to help when a lot of kevents are being copied from userspace * mapped buffer (initial) implementation (no userspace yet) Changes from 'take1' patchset: - rebased against 2.6.18-git tree - removed ioctl controlling - added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr, unsigned int timeout, void __user *buf, unsigned flags) - use old syscall kevent_ctl for creation/removing, modification and initial kevent initialization - use mutuxes instead of semaphores - added file descriptor check and return error if provided descriptor does not match kevent file operations - various indent fixes - removed aio_sendfile() declarations. Thank you. Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> ^ permalink raw reply [flat|nested] 200+ messages in thread
* [take25 1/6] kevent: Description. 2006-11-21 16:29 ` [take25 " Evgeniy Polyakov @ 2006-11-21 16:29 ` Evgeniy Polyakov 2006-11-21 16:29 ` [take25 2/6] kevent: Core files Evgeniy Polyakov ` (3 more replies) 0 siblings, 4 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-21 16:29 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Description. diff --git a/Documentation/kevent.txt b/Documentation/kevent.txt new file mode 100644 index 0000000..49e1cc2 --- /dev/null +++ b/Documentation/kevent.txt @@ -0,0 +1,230 @@ +Description. + +int kevent_init(struct kevent_ring *ring, unsigned int ring_size); + +num - size of the ring buffer in events +ring - pointer to allocated ring buffer + +Return value: kevent control file descriptor or negative error value. + + struct kevent_ring + { + unsigned int ring_kidx, ring_uidx, ring_over; + struct ukevent event[0]; + } + +ring_kidx - index in the ring buffer where kernel will put new events + when kevent_wait() or kevent_get_events() is called +ring_uidx - index of the first entry userspace can start reading from +ring_over - number of overflows of ring_uidx happend from the start. + Overflow counter is used to prevent situation when two threads + are going to free the same events, but one of them was scheduled + away for too long, so ring indexes were wrapped, so when that + thread will be awakened, it will free not those events, which + it suppose to free. + +Example userspace code (ring_buffer.c) can be found on project's homepage. + +Each kevent syscall can be so called cancellation point in glibc, i.e. when +thread has been cancelled in kevent syscall, thread can be safely removed +and no events will be lost, since each syscall (kevent_wait() or +kevent_get_events()) will copy event into special ring buffer, accessible +from other threads or even processes (if shared memory is used). + +When kevent is removed (not dequeued when it is ready, but just removed), +even if it was ready, it is not copied into ring buffer, since if it is +removed, no one cares about it (otherwise user would wait until it becomes +ready and got it through usual way using kevent_get_events() or kevent_wait()) +and thus no need to copy it to the ring buffer. + +------------------------------------------------------------------------------- + + +int kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent *arg); + +fd - is the file descriptor referring to the kevent queue to manipulate. +It is created by opening "/dev/kevent" char device, which is created with +dynamic minor number and major number assigned for misc devices. + +cmd - is the requested operation. It can be one of the following: + KEVENT_CTL_ADD - add event notification + KEVENT_CTL_REMOVE - remove event notification + KEVENT_CTL_MODIFY - modify existing notification + +num - number of struct ukevent in the array pointed to by arg +arg - array of struct ukevent + +Return value: + number of events processed or negative error value. + +When called, kevent_ctl will carry out the operation specified in the +cmd parameter. +------------------------------------------------------------------------------- + + int kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, + __u64 timeout, struct ukevent *buf, unsigned flags); + +ctl_fd - file descriptor referring to the kevent queue +min_nr - minimum number of completed events that kevent_get_events will block + waiting for +max_nr - number of struct ukevent in buf +timeout - number of nanoseconds to wait before returning less than min_nr + events. If this is -1, then wait forever. +buf - pointer to an array of struct ukevent. +flags - unused + +Return value: + number of events copied or negative error value. + +kevent_get_events will wait timeout milliseconds for at least min_nr completed +events, copying completed struct ukevents to buf and deleting any +KEVENT_REQ_ONESHOT event requests. In nonblocking mode it returns as many +events as possible, but not more than max_nr. In blocking mode it waits until +timeout or if at least min_nr events are ready. + +This function copies event into ring buffer if it was initialized, if ring buffer +is full, KEVENT_RET_COPY_FAILED flag is set in ret_flags field. +------------------------------------------------------------------------------- + + int kevent_wait(int ctl_fd, unsigned int num, __u64 timeout); + +ctl_fd - file descriptor referring to the kevent queue +num - number of processed kevents +timeout - this timeout specifies number of nanoseconds to wait until there is + free space in kevent queue + +Return value: + number of events copied into ring buffer or negative error value. + +This syscall waits until either timeout expires or at least one event becomes +ready. It also copies events into special ring buffer. If ring buffer is full, +it waits until there are ready events and then return. +If kevent is one-shot kevent it is removed in this syscall. +If kevent is edge-triggered (KEVENT_REQ_ET flag is set in 'req_flags') it is +requeued in this syscall for performance reasons. +------------------------------------------------------------------------------- + + int kevent_commit(int ctl_fd, unsigned int start, + unsigned int num, unsigned int over); + +ctl_fd - file descriptor referring to the kevent queue +start - index of the first index in the ring buffer to start to commit from +num - number of kevents to commit +over - overflow count for given $start value + +Return value: + number of committed kevents or negative error value. + +This function commits, i.e. marks as empty, slots in the ring buffer, so +they can be reused when userspace completes that entries processing. + +Overflow counter is used to prevent situation when two threads are going +to free the same events, but one of them was scheduled away for too long, +so ring indexes were wrapped, so when that thread will be awakened, it +will free not those events, which it suppose to free. + +It is possible that returned number of committed events will be smaller than +requested number - it is possible when several threads try to commit the +same events. +------------------------------------------------------------------------------- + +The bulk of the interface is entirely done through the ukevent struct. +It is used to add event requests, modify existing event requests, +specify which event requests to remove, and return completed events. + +struct ukevent contains the following members: + +struct kevent_id id + Id of this request, e.g. socket number, file descriptor and so on +__u32 type + Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on +__u32 event + Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED +__u32 req_flags + Per-event request flags, + + KEVENT_REQ_ONESHOT + event will be removed when it is ready + + KEVENT_REQ_WAKEUP_ALL + Kevent wakes up only first thread interested in given event, + or all threads if this flag is set. + + KEVENT_REQ_ET + Edge Triggered behaviour. It is an optimisation which allows to move + ready and dequeued (i.e. copied to userspace) event to move into set + of interest for given storage (socket, inode and so on) again. It is + very usefull for cases when the same event should be used many times + (like reading from pipe). It is similar to epoll()'s EPOLLET flag. + + KEVENT_REQ_LAST_CHECK + if set allows to perform the last check on kevent (call appropriate + callback) when kevent is marked as ready and has been removed from + ready queue. If it will be confirmed that kevent is ready + (k->callbacks.callback(k) returns true) then kevent will be copied + to userspace, otherwise it will be requeued back to storage. + Second (checking) call is performed with this bit cleared, so callback + can detect when it was called from kevent_storage_ready() - bit is set, + or kevent_dequeue_ready() - bit is cleared. If kevent will be requeued, + bit will be set again. + + KEVENT_REQ_ALWAYS_QUEUE + If this flag is set kevent will be queued into ready queue if it is + ready at enqueue time, otherwise it will be copied back to userspace + and will not be queued into the storage. + +__u32 ret_flags + Per-event return flags + + KEVENT_RET_BROKEN + Kevent is broken + + KEVENT_RET_DONE + Kevent processing was finished successfully + + KEVENT_RET_COPY_FAILED + Kevent was not copied into ring buffer due to some error conditions. + +__u32 ret_data + Event return data. Event originator fills it with anything it likes + (for example timer notifications put number of milliseconds when timer + has fired +union { __u32 user[2]; void *ptr; } + User's data. It is not used, just copied to/from user. The whole structure + is aligned to 8 bytes already, so the last union is aligned properly. + +------------------------------------------------------------------------------- + +Usage + +For KEVENT_CTL_ADD, all fields relevant to the event type must be filled +(id, type, event, req_flags). +After kevent_ctl(..., KEVENT_CTL_ADD, ...) returns each struct's ret_flags +should be checked to see if the event is already broken or done. + +For KEVENT_CTL_MODIFY, the id, req_flags, and user and event fields must be +set and an existing kevent request must have matching id and user fields. If +match is found, req_flags and event are replaced with the newly supplied +values and requeueing is started, so modified kevent can be checked and +probably marked as ready immediately. If a match can't be found, the +passed in ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is +always set. + +For KEVENT_CTL_REMOVE, the id and user fields must be set and an existing +kevent request must have matching id and user fields. If a match is found, +the kevent request is removed. If a match can't be found, the passed in +ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is always set. + +For kevent_get_events, the entire structure is returned. + +------------------------------------------------------------------------------- + +Usage cases + +kevent_timer +struct ukevent should contain following fields: + type - KEVENT_TIMER + event - KEVENT_TIMER_FIRED + req_flags - KEVENT_REQ_ONESHOT if you want to fire that timer only once + id.raw[0] - number of seconds after commit when this timer shout expire + id.raw[0] - additional to number of seconds number of nanoseconds ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take25 2/6] kevent: Core files. 2006-11-21 16:29 ` [take25 1/6] kevent: Description Evgeniy Polyakov @ 2006-11-21 16:29 ` Evgeniy Polyakov 2006-11-21 16:29 ` [take25 3/6] kevent: poll/select() notifications Evgeniy Polyakov 2006-11-22 23:46 ` [take25 1/6] kevent: Description Ulrich Drepper ` (2 subsequent siblings) 3 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-21 16:29 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Core files. This patch includes core kevent files: * userspace controlling * kernelspace interfaces * initialization * notification state machines Some bits of documentation can be found on project's homepage (and links from there): http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S index 7e639f7..a6221c2 100644 --- a/arch/i386/kernel/syscall_table.S +++ b/arch/i386/kernel/syscall_table.S @@ -318,3 +318,8 @@ ENTRY(sys_call_table) .long sys_vmsplice .long sys_move_pages .long sys_getcpu + .long sys_kevent_get_events + .long sys_kevent_ctl /* 320 */ + .long sys_kevent_wait + .long sys_kevent_commit + .long sys_kevent_init diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S index b4aa875..dda2168 100644 --- a/arch/x86_64/ia32/ia32entry.S +++ b/arch/x86_64/ia32/ia32entry.S @@ -714,8 +714,13 @@ ia32_sys_call_table: .quad compat_sys_get_robust_list .quad sys_splice .quad sys_sync_file_range - .quad sys_tee + .quad sys_tee /* 315 */ .quad compat_sys_vmsplice .quad compat_sys_move_pages .quad sys_getcpu + .quad sys_kevent_get_events + .quad sys_kevent_ctl /* 320 */ + .quad sys_kevent_wait + .quad sys_kevent_commit + .quad sys_kevent_init ia32_syscall_end: diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h index bd99870..57a6b8c 100644 --- a/include/asm-i386/unistd.h +++ b/include/asm-i386/unistd.h @@ -324,10 +324,15 @@ #define __NR_vmsplice 316 #define __NR_move_pages 317 #define __NR_getcpu 318 +#define __NR_kevent_get_events 319 +#define __NR_kevent_ctl 320 +#define __NR_kevent_wait 321 +#define __NR_kevent_commit 322 +#define __NR_kevent_init 323 #ifdef __KERNEL__ -#define NR_syscalls 319 +#define NR_syscalls 324 #include <linux/err.h> /* diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h index 6137146..17d750d 100644 --- a/include/asm-x86_64/unistd.h +++ b/include/asm-x86_64/unistd.h @@ -619,10 +619,20 @@ __SYSCALL(__NR_sync_file_range, sys_sync __SYSCALL(__NR_vmsplice, sys_vmsplice) #define __NR_move_pages 279 __SYSCALL(__NR_move_pages, sys_move_pages) +#define __NR_kevent_get_events 280 +__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events) +#define __NR_kevent_ctl 281 +__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl) +#define __NR_kevent_wait 282 +__SYSCALL(__NR_kevent_wait, sys_kevent_wait) +#define __NR_kevent_commit 283 +__SYSCALL(__NR_kevent_commit, sys_kevent_commit) +#define __NR_kevent_init 284 +__SYSCALL(__NR_kevent_init, sys_kevent_init) #ifdef __KERNEL__ -#define __NR_syscall_max __NR_move_pages +#define __NR_syscall_max __NR_kevent_init #include <linux/err.h> #ifndef __NO_STUBS diff --git a/include/linux/kevent.h b/include/linux/kevent.h new file mode 100644 index 0000000..c909c62 --- /dev/null +++ b/include/linux/kevent.h @@ -0,0 +1,230 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __KEVENT_H +#define __KEVENT_H +#include <linux/types.h> +#include <linux/list.h> +#include <linux/rbtree.h> +#include <linux/spinlock.h> +#include <linux/mutex.h> +#include <linux/wait.h> +#include <linux/net.h> +#include <linux/rcupdate.h> +#include <linux/fs.h> +#include <linux/sched.h> +#include <linux/kevent_storage.h> +#include <linux/ukevent.h> + +#define KEVENT_MIN_BUFFS_ALLOC 3 + +struct kevent; +struct kevent_storage; +typedef int (* kevent_callback_t)(struct kevent *); + +/* @callback is called each time new event has been caught. */ +/* @enqueue is called each time new event is queued. */ +/* @dequeue is called each time event is dequeued. */ + +struct kevent_callbacks { + kevent_callback_t callback, enqueue, dequeue; +}; + +#define KEVENT_READY 0x1 +#define KEVENT_STORAGE 0x2 +#define KEVENT_USER 0x4 + +struct kevent +{ + /* Used for kevent freeing.*/ + struct rcu_head rcu_head; + struct ukevent event; + /* This lock protects ukevent manipulations, e.g. ret_flags changes. */ + spinlock_t ulock; + + /* Entry of user's tree. */ + struct rb_node kevent_node; + /* Entry of origin's queue. */ + struct list_head storage_entry; + /* Entry of user's ready. */ + struct list_head ready_entry; + + u32 flags; + + /* User who requested this kevent. */ + struct kevent_user *user; + /* Kevent container. */ + struct kevent_storage *st; + + struct kevent_callbacks callbacks; + + /* Private data for different storages. + * poll()/select storage has a list of wait_queue_t containers + * for each ->poll() { poll_wait()' } here. + */ + void *priv; +}; + +struct kevent_user +{ + struct rb_root kevent_root; + spinlock_t kevent_lock; + /* Number of queued kevents. */ + unsigned int kevent_num; + + /* List of ready kevents. */ + struct list_head ready_list; + /* Number of ready kevents. */ + unsigned int ready_num; + /* Protects all manipulations with ready queue. */ + spinlock_t ready_lock; + + /* Protects against simultaneous kevent_user control manipulations. */ + struct mutex ctl_mutex; + /* Wait until some events are ready. */ + wait_queue_head_t wait; + + /* Reference counter, increased for each new kevent. */ + atomic_t refcnt; + + /* Mutex protecting userspace ring buffer. */ + struct mutex ring_lock; + /* Kernel index and size of the userspace ring buffer. */ + unsigned int kidx, uidx, ring_size, ring_over, full; + /* Pointer to userspace ring buffer. */ + struct kevent_ring __user *pring; + +#ifdef CONFIG_KEVENT_USER_STAT + unsigned long im_num; + unsigned long wait_num, ring_num; + unsigned long total; +#endif +}; + +int kevent_enqueue(struct kevent *k); +int kevent_dequeue(struct kevent *k); +int kevent_init(struct kevent *k); +void kevent_requeue(struct kevent *k); +int kevent_break(struct kevent *k); + +int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos); + +void kevent_storage_ready(struct kevent_storage *st, + kevent_callback_t ready_callback, u32 event); +int kevent_storage_init(void *origin, struct kevent_storage *st); +void kevent_storage_fini(struct kevent_storage *st); +int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k); +void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k); + +int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u); + +#ifdef CONFIG_KEVENT_POLL +void kevent_poll_reinit(struct file *file); +#else +static inline void kevent_poll_reinit(struct file *file) +{ +} +#endif + +#ifdef CONFIG_KEVENT_USER_STAT +static inline void kevent_stat_init(struct kevent_user *u) +{ + u->wait_num = u->im_num = u->total = 0; +} +static inline void kevent_stat_print(struct kevent_user *u) +{ + printk(KERN_INFO "%s: u: %p, wait: %lu, ring: %lu, immediately: %lu, total: %lu.\n", + __func__, u, u->wait_num, u->ring_num, u->im_num, u->total); +} +static inline void kevent_stat_im(struct kevent_user *u) +{ + u->im_num++; +} +static inline void kevent_stat_ring(struct kevent_user *u) +{ + u->ring_num++; +} +static inline void kevent_stat_wait(struct kevent_user *u) +{ + u->wait_num++; +} +static inline void kevent_stat_total(struct kevent_user *u) +{ + u->total++; +} +#else +#define kevent_stat_print(u) ({ (void) u;}) +#define kevent_stat_init(u) ({ (void) u;}) +#define kevent_stat_im(u) ({ (void) u;}) +#define kevent_stat_wait(u) ({ (void) u;}) +#define kevent_stat_ring(u) ({ (void) u;}) +#define kevent_stat_total(u) ({ (void) u;}) +#endif + +#ifdef CONFIG_LOCKDEP +void kevent_socket_reinit(struct socket *sock); +void kevent_sk_reinit(struct sock *sk); +#else +static inline void kevent_socket_reinit(struct socket *sock) +{ +} +static inline void kevent_sk_reinit(struct sock *sk) +{ +} +#endif +#ifdef CONFIG_KEVENT_SOCKET +void kevent_socket_notify(struct sock *sock, u32 event); +int kevent_socket_dequeue(struct kevent *k); +int kevent_socket_enqueue(struct kevent *k); +#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC) +#else +static inline void kevent_socket_notify(struct sock *sock, u32 event) +{ +} +#define sock_async(__sk) ({ (void)__sk; 0; }) +#endif + +#ifdef CONFIG_KEVENT_POLL +static inline void kevent_init_file(struct file *file) +{ + kevent_storage_init(file, &file->st); +} + +static inline void kevent_cleanup_file(struct file *file) +{ + kevent_storage_fini(&file->st); +} +#else +static inline void kevent_init_file(struct file *file) {} +static inline void kevent_cleanup_file(struct file *file) {} +#endif + +#ifdef CONFIG_KEVENT_PIPE +extern void kevent_pipe_notify(struct inode *inode, u32 events); +#else +static inline void kevent_pipe_notify(struct inode *inode, u32 events) {} +#endif + +#ifdef CONFIG_KEVENT_SIGNAL +extern int kevent_signal_notify(struct task_struct *tsk, int sig); +#else +static inline int kevent_signal_notify(struct task_struct *tsk, int sig) {return 0;} +#endif + +#endif /* __KEVENT_H */ diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h new file mode 100644 index 0000000..a38575d --- /dev/null +++ b/include/linux/kevent_storage.h @@ -0,0 +1,11 @@ +#ifndef __KEVENT_STORAGE_H +#define __KEVENT_STORAGE_H + +struct kevent_storage +{ + void *origin; /* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */ + struct list_head list; /* List of queued kevents. */ + spinlock_t lock; /* Protects users queue. */ +}; + +#endif /* __KEVENT_STORAGE_H */ diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 2d1c3d5..1317a18 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -54,6 +54,8 @@ struct compat_stat; struct compat_timeval; struct robust_list_head; struct getcpu_cache; +struct ukevent; +struct kevent_ring; #include <linux/types.h> #include <linux/aio_abi.h> @@ -599,4 +601,10 @@ asmlinkage long sys_set_robust_list(stru size_t len); asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache); +asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max, + __u64 timeout, struct ukevent __user *buf, unsigned flags); +asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, struct ukevent __user *buf); +asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, __u64 timeout); +asmlinkage long sys_kevent_commit(int ctl_fd, unsigned int start, unsigned int num, unsigned int over); +asmlinkage long sys_kevent_init(int ctl_fd, struct kevent_ring __user *ring, unsigned int num); #endif diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h new file mode 100644 index 0000000..0680fdf --- /dev/null +++ b/include/linux/ukevent.h @@ -0,0 +1,178 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __UKEVENT_H +#define __UKEVENT_H + +#include <linux/types.h> + +/* + * Kevent request flags. + */ + +/* Process this event only once and then remove it. */ +#define KEVENT_REQ_ONESHOT 0x1 +/* Kevent wakes up only first thread interested in given event, + * or all threads if this flag is set. + */ +#define KEVENT_REQ_WAKEUP_ALL 0x2 +/* Edge Triggered behaviour. */ +#define KEVENT_REQ_ET 0x4 +/* Perform the last check on kevent (call appropriate callback) when + * kevent is marked as ready and has been removed from ready queue. + * If it will be confirmed that kevent is ready + * (k->callbacks.callback(k) returns true) then kevent will be copied + * to userspace, otherwise it will be requeued back to storage. + * Second (checking) call is performed with this bit _cleared_ so + * callback can detect when it was called from + * kevent_storage_ready() - bit is set, or + * kevent_dequeue_ready() - bit is cleared. + * If kevent will be requeued, bit will be set again. */ +#define KEVENT_REQ_LAST_CHECK 0x8 +/* + * Always queue kevent even if it is immediately ready. + */ +#define KEVENT_REQ_ALWAYS_QUEUE 0x16 + +/* + * Kevent return flags. + */ +/* Kevent is broken. */ +#define KEVENT_RET_BROKEN 0x1 +/* Kevent processing was finished successfully. */ +#define KEVENT_RET_DONE 0x2 +/* Kevent was not copied into ring buffer due to some error conditions. */ +#define KEVENT_RET_COPY_FAILED 0x4 + +/* + * Kevent type set. + */ +#define KEVENT_SOCKET 0 +#define KEVENT_INODE 1 +#define KEVENT_TIMER 2 +#define KEVENT_POLL 3 +#define KEVENT_NAIO 4 +#define KEVENT_AIO 5 +#define KEVENT_PIPE 6 +#define KEVENT_SIGNAL 7 +#define KEVENT_MAX 8 + +/* + * Per-type event sets. + * Number of per-event sets should be exactly as number of kevent types. + */ + +/* + * Timer events. + */ +#define KEVENT_TIMER_FIRED 0x1 + +/* + * Socket/network asynchronous IO and PIPE events. + */ +#define KEVENT_SOCKET_RECV 0x1 +#define KEVENT_SOCKET_ACCEPT 0x2 +#define KEVENT_SOCKET_SEND 0x4 + +/* + * Inode events. + */ +#define KEVENT_INODE_CREATE 0x1 +#define KEVENT_INODE_REMOVE 0x2 + +/* + * Poll events. + */ +#define KEVENT_POLL_POLLIN 0x0001 +#define KEVENT_POLL_POLLPRI 0x0002 +#define KEVENT_POLL_POLLOUT 0x0004 +#define KEVENT_POLL_POLLERR 0x0008 +#define KEVENT_POLL_POLLHUP 0x0010 +#define KEVENT_POLL_POLLNVAL 0x0020 + +#define KEVENT_POLL_POLLRDNORM 0x0040 +#define KEVENT_POLL_POLLRDBAND 0x0080 +#define KEVENT_POLL_POLLWRNORM 0x0100 +#define KEVENT_POLL_POLLWRBAND 0x0200 +#define KEVENT_POLL_POLLMSG 0x0400 +#define KEVENT_POLL_POLLREMOVE 0x1000 + +/* + * Asynchronous IO events. + */ +#define KEVENT_AIO_BIO 0x1 + +/* + * Signal events. + */ +#define KEVENT_SIGNAL_DELIVERY 0x1 + +/* If set in raw64, then given signals will not be delivered + * in a usual way through sigmask update and signal callback + * invokation. */ +#define KEVENT_SIGNAL_NOMASK 0x8000000000000000ULL + +/* Mask of all possible event values. */ +#define KEVENT_MASK_ALL 0xffffffff +/* Empty mask of ready events. */ +#define KEVENT_MASK_EMPTY 0x0 + +struct kevent_id +{ + union { + __u32 raw[2]; + __u64 raw_u64 __attribute__((aligned(8))); + }; +}; + +struct ukevent +{ + /* Id of this request, e.g. socket number, file descriptor and so on... */ + struct kevent_id id; + /* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */ + __u32 type; + /* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */ + __u32 event; + /* Per-event request flags */ + __u32 req_flags; + /* Per-event return flags */ + __u32 ret_flags; + /* Event return data. Event originator fills it with anything it likes. */ + __u32 ret_data[2]; + /* User's data. It is not used, just copied to/from user. + * The whole structure is aligned to 8 bytes already, so the last union + * is aligned properly. + */ + union { + __u32 user[2]; + void *ptr; + }; +}; + +struct kevent_ring +{ + unsigned int ring_kidx, ring_uidx, ring_over; + struct ukevent event[0]; +}; + +#define KEVENT_CTL_ADD 0 +#define KEVENT_CTL_REMOVE 1 +#define KEVENT_CTL_MODIFY 2 + +#endif /* __UKEVENT_H */ diff --git a/init/Kconfig b/init/Kconfig index d2eb7a8..c7d8250 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -201,6 +201,8 @@ config AUDITSYSCALL such as SELinux. To use audit's filesystem watch feature, please ensure that INOTIFY is configured. +source "kernel/kevent/Kconfig" + config IKCONFIG bool "Kernel .config support" ---help--- diff --git a/kernel/Makefile b/kernel/Makefile index d62ec66..2d7a6dd 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o +obj-$(CONFIG_KEVENT) += kevent/ obj-$(CONFIG_RELAY) += relay.o obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o obj-$(CONFIG_TASKSTATS) += taskstats.o diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig new file mode 100644 index 0000000..4b137ee --- /dev/null +++ b/kernel/kevent/Kconfig @@ -0,0 +1,60 @@ +config KEVENT + bool "Kernel event notification mechanism" + help + This option enables event queue mechanism. + It can be used as replacement for poll()/select(), AIO callback + invocations, advanced timer notifications and other kernel + object status changes. + +config KEVENT_USER_STAT + bool "Kevent user statistic" + depends on KEVENT + help + This option will turn kevent_user statistic collection on. + Statistic data includes total number of kevent, number of kevents + which are ready immediately at insertion time and number of kevents + which were removed through readiness completion. + It will be printed each time control kevent descriptor is closed. + +config KEVENT_TIMER + bool "Kernel event notifications for timers" + depends on KEVENT + help + This option allows to use timers through KEVENT subsystem. + +config KEVENT_POLL + bool "Kernel event notifications for poll()/select()" + depends on KEVENT + help + This option allows to use kevent subsystem for poll()/select() + notifications. + +config KEVENT_SOCKET + bool "Kernel event notifications for sockets" + depends on NET && KEVENT + help + This option enables notifications through KEVENT subsystem of + sockets operations, like new packet receiving conditions, + ready for accept conditions and so on. + +config KEVENT_PIPE + bool "Kernel event notifications for pipes" + depends on KEVENT + help + This option enables notifications through KEVENT subsystem of + pipe read/write operations. + +config KEVENT_SIGNAL + bool "Kernel event notifications for signals" + depends on KEVENT + help + This option enables signal delivery through KEVENT subsystem. + Signals which were requested to be delivered through kevent + subsystem must be registered through usual signal() and others + syscalls, this option allows alternative delivery. + With KEVENT_SIGNAL_NOMASK flag being set in kevent for set of + signals, they will not be delivered in a usual way. + Kevents for appropriate signals are not copied when process forks, + new process must add new kevents after fork(). Mask of signals + is copied as before. + diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile new file mode 100644 index 0000000..f98e0c8 --- /dev/null +++ b/kernel/kevent/Makefile @@ -0,0 +1,6 @@ +obj-y := kevent.o kevent_user.o +obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o +obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o +obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o +obj-$(CONFIG_KEVENT_PIPE) += kevent_pipe.o +obj-$(CONFIG_KEVENT_SIGNAL) += kevent_signal.o diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c new file mode 100644 index 0000000..8cf756c --- /dev/null +++ b/kernel/kevent/kevent.c @@ -0,0 +1,232 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/mempool.h> +#include <linux/sched.h> +#include <linux/wait.h> +#include <linux/kevent.h> + +/* + * Attempts to add an event into appropriate origin's queue. + * Returns positive value if this event is ready immediately, + * negative value in case of error and zero if event has been queued. + * ->enqueue() callback must increase origin's reference counter. + */ +int kevent_enqueue(struct kevent *k) +{ + return k->callbacks.enqueue(k); +} + +/* + * Remove event from the appropriate queue. + * ->dequeue() callback must decrease origin's reference counter. + */ +int kevent_dequeue(struct kevent *k) +{ + return k->callbacks.dequeue(k); +} + +/* + * Mark kevent as broken. + */ +int kevent_break(struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&k->ulock, flags); + k->event.ret_flags |= KEVENT_RET_BROKEN; + spin_unlock_irqrestore(&k->ulock, flags); + return -EINVAL; +} + +static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX] __read_mostly; + +int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos) +{ + struct kevent_callbacks *p; + + if (pos >= KEVENT_MAX) + return -EINVAL; + + p = &kevent_registered_callbacks[pos]; + + p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break; + p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break; + p->callback = (cb->callback) ? cb->callback : kevent_break; + + printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos); + return 0; +} + +/* + * Must be called before event is going to be added into some origin's queue. + * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks. + * If failed, kevent should not be used or kevent_enqueue() will fail to add + * this kevent into origin's queue with setting + * KEVENT_RET_BROKEN flag in kevent->event.ret_flags. + */ +int kevent_init(struct kevent *k) +{ + spin_lock_init(&k->ulock); + k->flags = 0; + + if (unlikely(k->event.type >= KEVENT_MAX || + !kevent_registered_callbacks[k->event.type].callback)) + return kevent_break(k); + + k->callbacks = kevent_registered_callbacks[k->event.type]; + if (unlikely(k->callbacks.callback == kevent_break)) + return kevent_break(k); + + return 0; +} + +/* + * Called from ->enqueue() callback when reference counter for given + * origin (socket, inode...) has been increased. + */ +int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k) +{ + unsigned long flags; + + k->st = st; + spin_lock_irqsave(&st->lock, flags); + list_add_tail_rcu(&k->storage_entry, &st->list); + k->flags |= KEVENT_STORAGE; + spin_unlock_irqrestore(&st->lock, flags); + return 0; +} + +/* + * Dequeue kevent from origin's queue. + * It does not decrease origin's reference counter in any way + * and must be called before it, so storage itself must be valid. + * It is called from ->dequeue() callback. + */ +void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&st->lock, flags); + if (k->flags & KEVENT_STORAGE) { + list_del_rcu(&k->storage_entry); + k->flags &= ~KEVENT_STORAGE; + } + spin_unlock_irqrestore(&st->lock, flags); +} + +/* + * Call kevent ready callback and queue it into ready queue if needed. + * If kevent is marked as one-shot, then remove it from storage queue. + */ +static int __kevent_requeue(struct kevent *k, u32 event) +{ + int ret, rem; + unsigned long flags; + + ret = k->callbacks.callback(k); + + spin_lock_irqsave(&k->ulock, flags); + if (ret > 0) + k->event.ret_flags |= KEVENT_RET_DONE; + else if (ret < 0) + k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE); + else + ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE)); + rem = (k->event.req_flags & KEVENT_REQ_ONESHOT); + spin_unlock_irqrestore(&k->ulock, flags); + + if (ret) { + if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) { + list_del_rcu(&k->storage_entry); + k->flags &= ~KEVENT_STORAGE; + } + + spin_lock_irqsave(&k->user->ready_lock, flags); + if (!(k->flags & KEVENT_READY)) { + list_add_tail(&k->ready_entry, &k->user->ready_list); + k->flags |= KEVENT_READY; + k->user->ready_num++; + } + spin_unlock_irqrestore(&k->user->ready_lock, flags); + wake_up(&k->user->wait); + } + + return ret; +} + +/* + * Check if kevent is ready (by invoking it's callback) and requeue/remove + * if needed. + */ +void kevent_requeue(struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&k->st->lock, flags); + __kevent_requeue(k, 0); + spin_unlock_irqrestore(&k->st->lock, flags); +} + +/* + * Called each time some activity in origin (socket, inode...) is noticed. + */ +void kevent_storage_ready(struct kevent_storage *st, + kevent_callback_t ready_callback, u32 event) +{ + struct kevent *k; + int wake_num = 0; + + rcu_read_lock(); + if (unlikely(ready_callback)) + list_for_each_entry_rcu(k, &st->list, storage_entry) + (*ready_callback)(k); + + list_for_each_entry_rcu(k, &st->list, storage_entry) { + if (event & k->event.event) + if ((k->event.req_flags & KEVENT_REQ_WAKEUP_ALL) || wake_num == 0) + if (__kevent_requeue(k, event)) + wake_num++; + } + rcu_read_unlock(); +} + +int kevent_storage_init(void *origin, struct kevent_storage *st) +{ + spin_lock_init(&st->lock); + st->origin = origin; + INIT_LIST_HEAD(&st->list); + return 0; +} + +/* + * Mark all events as broken, that will remove them from storage, + * so storage origin (inode, sockt and so on) can be safely removed. + * No new entries are allowed to be added into the storage at this point. + * (Socket is removed from file table at this point for example). + */ +void kevent_storage_fini(struct kevent_storage *st) +{ + kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL); +} diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c new file mode 100644 index 0000000..2cd8c99 --- /dev/null +++ b/kernel/kevent/kevent_user.c @@ -0,0 +1,1181 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/mount.h> +#include <linux/device.h> +#include <linux/poll.h> +#include <linux/kevent.h> +#include <linux/miscdevice.h> +#include <asm/io.h> + +static kmem_cache_t *kevent_cache __read_mostly; +static kmem_cache_t *kevent_user_cache __read_mostly; + +/* + * kevents are pollable, return POLLIN and POLLRDNORM + * when there is at least one ready kevent. + */ +static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait) +{ + struct kevent_user *u = file->private_data; + unsigned int mask; + + poll_wait(file, &u->wait, wait); + mask = 0; + + if (u->ready_num) + mask |= POLLIN | POLLRDNORM; + + return mask; +} + +static inline unsigned int kevent_ring_space(struct kevent_user *u) +{ + if (u->full) + return 0; + + return (u->uidx > u->kidx)? + (u->uidx - u->kidx): + (u->ring_size - (u->kidx - u->uidx)); +} + +static inline int kevent_ring_index_inc(unsigned int *pidx, unsigned int size) +{ + unsigned int idx = *pidx; + + if (++idx >= size) + idx = 0; + *pidx = idx; + return (idx == 0); +} + +/* + * Copies kevent into userspace ring buffer if it was initialized. + * Returns + * 0 on success or if ring buffer is not used + * -EAGAIN if there were no place for that kevent + * -EFAULT if copy_to_user() failed. + * + * Must be called under kevent_user->ring_lock locked. + */ +static int kevent_copy_ring_buffer(struct kevent *k) +{ + struct kevent_ring __user *ring; + struct kevent_user *u = k->user; + unsigned long flags; + int err; + + ring = u->pring; + if (!ring) + return 0; + + if (!kevent_ring_space(u)) + return -EAGAIN; + + if (copy_to_user(&ring->event[u->kidx], &k->event, sizeof(struct ukevent))) { + err = -EFAULT; + goto err_out_exit; + } + + kevent_ring_index_inc(&u->kidx, u->ring_size); + + if (u->kidx == u->uidx) + u->full = 1; + + if (put_user(u->kidx, &ring->ring_kidx)) { + err = -EFAULT; + goto err_out_exit; + } + + return 0; + +err_out_exit: + spin_lock_irqsave(&k->ulock, flags); + k->event.ret_flags |= KEVENT_RET_COPY_FAILED; + spin_unlock_irqrestore(&k->ulock, flags); + return err; +} + +static struct kevent_user *kevent_user_alloc(struct kevent_ring __user *ring, unsigned int num) +{ + struct kevent_user *u; + + u = kmem_cache_alloc(kevent_user_cache, GFP_KERNEL); + if (!u) + return NULL; + + INIT_LIST_HEAD(&u->ready_list); + spin_lock_init(&u->ready_lock); + kevent_stat_init(u); + spin_lock_init(&u->kevent_lock); + u->kevent_root = RB_ROOT; + + mutex_init(&u->ctl_mutex); + init_waitqueue_head(&u->wait); + + atomic_set(&u->refcnt, 1); + + mutex_init(&u->ring_lock); + u->kidx = u->uidx = u->ring_over = u->full = 0; + + u->pring = ring; + u->ring_size = num; + + return u; +} + +/* + * Kevent userspace control block reference counting. + * Set to 1 at creation time, when appropriate kevent file descriptor + * is closed, that reference counter is decreased. + * When counter hits zero block is freed. + */ +static inline void kevent_user_get(struct kevent_user *u) +{ + atomic_inc(&u->refcnt); +} + +static inline void kevent_user_put(struct kevent_user *u) +{ + if (atomic_dec_and_test(&u->refcnt)) { + kevent_stat_print(u); + kmem_cache_free(kevent_user_cache, u); + } +} + +static inline int kevent_compare_id(struct kevent_id *left, struct kevent_id *right) +{ + if (left->raw_u64 > right->raw_u64) + return -1; + + if (right->raw_u64 > left->raw_u64) + return 1; + + return 0; +} + +/* + * RCU protects storage list (kevent->storage_entry). + * Free entry in RCU callback, it is dequeued from all lists at + * this point. + */ + +static void kevent_free_rcu(struct rcu_head *rcu) +{ + struct kevent *kevent = container_of(rcu, struct kevent, rcu_head); + kmem_cache_free(kevent_cache, kevent); +} + +/* + * Must be called under u->ready_lock. + * This function unlinks kevent from ready queue. + */ +static inline void kevent_unlink_ready(struct kevent *k) +{ + list_del(&k->ready_entry); + k->flags &= ~KEVENT_READY; + k->user->ready_num--; +} + +static void kevent_remove_ready(struct kevent *k) +{ + struct kevent_user *u = k->user; + unsigned long flags; + + spin_lock_irqsave(&u->ready_lock, flags); + if (k->flags & KEVENT_READY) + kevent_unlink_ready(k); + spin_unlock_irqrestore(&u->ready_lock, flags); +} + +/* + * Complete kevent removing - it dequeues kevent from storage list + * if it is requested, removes kevent from ready list, drops userspace + * control block reference counter and schedules kevent freeing through RCU. + */ +static void kevent_finish_user_complete(struct kevent *k, int deq) +{ + if (deq) + kevent_dequeue(k); + + kevent_remove_ready(k); + + kevent_user_put(k->user); + call_rcu(&k->rcu_head, kevent_free_rcu); +} + +/* + * Remove from all lists and free kevent. + * Must be called under kevent_user->kevent_lock to protect + * kevent->kevent_entry removing. + */ +static void __kevent_finish_user(struct kevent *k, int deq) +{ + struct kevent_user *u = k->user; + + rb_erase(&k->kevent_node, &u->kevent_root); + k->flags &= ~KEVENT_USER; + u->kevent_num--; + kevent_finish_user_complete(k, deq); +} + +/* + * Remove kevent from user's list of all events, + * dequeue it from storage and decrease user's reference counter, + * since this kevent does not exist anymore. That is why it is freed here. + */ +static void kevent_finish_user(struct kevent *k, int deq) +{ + struct kevent_user *u = k->user; + unsigned long flags; + + spin_lock_irqsave(&u->kevent_lock, flags); + rb_erase(&k->kevent_node, &u->kevent_root); + k->flags &= ~KEVENT_USER; + u->kevent_num--; + spin_unlock_irqrestore(&u->kevent_lock, flags); + kevent_finish_user_complete(k, deq); +} + +static struct kevent *__kevent_dequeue_ready_one(struct kevent_user *u) +{ + unsigned long flags; + struct kevent *k = NULL; + + if (u->ready_num) { + spin_lock_irqsave(&u->ready_lock, flags); + if (u->ready_num && !list_empty(&u->ready_list)) { + k = list_entry(u->ready_list.next, struct kevent, ready_entry); + kevent_unlink_ready(k); + } + spin_unlock_irqrestore(&u->ready_lock, flags); + } + + return k; +} + +static struct kevent *kevent_dequeue_ready_one(struct kevent_user *u) +{ + struct kevent *k = NULL; + + while (u->ready_num && !k) { + k = __kevent_dequeue_ready_one(u); + + if (k && (k->event.req_flags & KEVENT_REQ_LAST_CHECK)) { + unsigned long flags; + + spin_lock_irqsave(&k->ulock, flags); + k->event.req_flags &= ~KEVENT_REQ_LAST_CHECK; + spin_unlock_irqrestore(&k->ulock, flags); + + if (!k->callbacks.callback(k)) { + spin_lock_irqsave(&k->ulock, flags); + k->event.req_flags |= KEVENT_REQ_LAST_CHECK; + k->event.ret_flags = 0; + k->event.ret_data[0] = k->event.ret_data[1] = 0; + spin_unlock_irqrestore(&k->ulock, flags); + k = NULL; + } + } else + break; + } + + return k; +} + +static inline void kevent_copy_ring(struct kevent *k) +{ + unsigned long flags; + + if (!k) + return; + + if (kevent_copy_ring_buffer(k)) { + spin_lock_irqsave(&k->ulock, flags); + k->event.ret_flags |= KEVENT_RET_COPY_FAILED; + spin_unlock_irqrestore(&k->ulock, flags); + } +} + +/* + * Dequeue one entry from user's ready queue. + */ +static struct kevent *kevent_dequeue_ready(struct kevent_user *u) +{ + struct kevent *k; + + mutex_lock(&u->ring_lock); + k = kevent_dequeue_ready_one(u); + kevent_copy_ring(k); + mutex_unlock(&u->ring_lock); + + return k; +} + +/* + * Dequeue one entry from user's ready queue if there is space in ring buffer. + */ +static struct kevent *kevent_dequeue_ready_ring(struct kevent_user *u) +{ + struct kevent *k = NULL; + + mutex_lock(&u->ring_lock); + if (kevent_ring_space(u)) { + k = kevent_dequeue_ready_one(u); + kevent_copy_ring(k); + } + mutex_unlock(&u->ring_lock); + + return k; +} + +static void kevent_complete_ready(struct kevent *k) +{ + if (k->event.req_flags & KEVENT_REQ_ONESHOT) + /* + * If it is one-shot kevent, it has been removed already from + * origin's queue, so we can easily free it here. + */ + kevent_finish_user(k, 1); + else if (k->event.req_flags & KEVENT_REQ_ET) { + unsigned long flags; + + /* + * Edge-triggered behaviour: mark event as clear new one. + */ + + spin_lock_irqsave(&k->ulock, flags); + k->event.ret_flags = 0; + k->event.ret_data[0] = k->event.ret_data[1] = 0; + spin_unlock_irqrestore(&k->ulock, flags); + } +} + +/* + * Search a kevent inside kevent tree for given ukevent. + */ +static struct kevent *__kevent_search(struct kevent_id *id, struct kevent_user *u) +{ + struct kevent *k, *ret = NULL; + struct rb_node *n = u->kevent_root.rb_node; + int cmp; + + while (n) { + k = rb_entry(n, struct kevent, kevent_node); + cmp = kevent_compare_id(&k->event.id, id); + + if (cmp > 0) + n = n->rb_right; + else if (cmp < 0) + n = n->rb_left; + else { + ret = k; + break; + } + } + + return ret; +} + +/* + * Search and modify kevent according to provided ukevent. + */ +static int kevent_modify(struct ukevent *uk, struct kevent_user *u) +{ + struct kevent *k; + int err = -ENODEV; + unsigned long flags; + + spin_lock_irqsave(&u->kevent_lock, flags); + k = __kevent_search(&uk->id, u); + if (k) { + spin_lock(&k->ulock); + k->event.event = uk->event; + k->event.req_flags = uk->req_flags; + k->event.ret_flags = 0; + spin_unlock(&k->ulock); + kevent_requeue(k); + err = 0; + } + spin_unlock_irqrestore(&u->kevent_lock, flags); + + return err; +} + +/* + * Remove kevent which matches provided ukevent. + */ +static int kevent_remove(struct ukevent *uk, struct kevent_user *u) +{ + int err = -ENODEV; + struct kevent *k; + unsigned long flags; + + spin_lock_irqsave(&u->kevent_lock, flags); + k = __kevent_search(&uk->id, u); + if (k) { + __kevent_finish_user(k, 1); + err = 0; + } + spin_unlock_irqrestore(&u->kevent_lock, flags); + + return err; +} + +/* + * Detaches userspace control block from file descriptor + * and decrease it's reference counter. + * No new kevents can be added or removed from any list at this point. + */ +static int kevent_user_release(struct inode *inode, struct file *file) +{ + struct kevent_user *u = file->private_data; + struct kevent *k; + struct rb_node *n; + + for (n = rb_first(&u->kevent_root); n; n = rb_next(n)) { + k = rb_entry(n, struct kevent, kevent_node); + kevent_finish_user(k, 1); + } + + kevent_user_put(u); + file->private_data = NULL; + + return 0; +} + +/* + * Read requested number of ukevents in one shot. + */ +static struct ukevent *kevent_get_user(unsigned int num, void __user *arg) +{ + struct ukevent *ukev; + + ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL); + if (!ukev) + return NULL; + + if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) { + kfree(ukev); + return NULL; + } + + return ukev; +} + +/* + * Read from userspace all ukevents and modify appropriate kevents. + * If provided number of ukevents is more that threshold, it is faster + * to allocate a room for them and copy in one shot instead of copy + * one-by-one and then process them. + */ +static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg) +{ + int err = 0, i; + struct ukevent uk; + + mutex_lock(&u->ctl_mutex); + + if (num > u->kevent_num) { + err = -EINVAL; + goto out; + } + + if (num > KEVENT_MIN_BUFFS_ALLOC) { + struct ukevent *ukev; + + ukev = kevent_get_user(num, arg); + if (ukev) { + for (i = 0; i < num; ++i) { + if (kevent_modify(&ukev[i], u)) + ukev[i].ret_flags |= KEVENT_RET_BROKEN; + ukev[i].ret_flags |= KEVENT_RET_DONE; + } + if (copy_to_user(arg, ukev, num*sizeof(struct ukevent))) + err = -EFAULT; + kfree(ukev); + goto out; + } + } + + for (i = 0; i < num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + if (kevent_modify(&uk, u)) + uk.ret_flags |= KEVENT_RET_BROKEN; + uk.ret_flags |= KEVENT_RET_DONE; + + if (copy_to_user(arg, &uk, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + arg += sizeof(struct ukevent); + } +out: + mutex_unlock(&u->ctl_mutex); + + return err; +} + +/* + * Read from userspace all ukevents and remove appropriate kevents. + * If provided number of ukevents is more that threshold, it is faster + * to allocate a room for them and copy in one shot instead of copy + * one-by-one and then process them. + */ +static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg) +{ + int err = 0, i; + struct ukevent uk; + + mutex_lock(&u->ctl_mutex); + + if (num > u->kevent_num) { + err = -EINVAL; + goto out; + } + + if (num > KEVENT_MIN_BUFFS_ALLOC) { + struct ukevent *ukev; + + ukev = kevent_get_user(num, arg); + if (ukev) { + for (i = 0; i < num; ++i) { + if (kevent_remove(&ukev[i], u)) + ukev[i].ret_flags |= KEVENT_RET_BROKEN; + ukev[i].ret_flags |= KEVENT_RET_DONE; + } + if (copy_to_user(arg, ukev, num*sizeof(struct ukevent))) + err = -EFAULT; + kfree(ukev); + goto out; + } + } + + for (i = 0; i < num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + if (kevent_remove(&uk, u)) + uk.ret_flags |= KEVENT_RET_BROKEN; + + uk.ret_flags |= KEVENT_RET_DONE; + + if (copy_to_user(arg, &uk, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + arg += sizeof(struct ukevent); + } +out: + mutex_unlock(&u->ctl_mutex); + + return err; +} + +/* + * Queue kevent into userspace control block and increase + * it's reference counter. + */ +static int kevent_user_enqueue(struct kevent_user *u, struct kevent *new) +{ + unsigned long flags; + struct rb_node **p = &u->kevent_root.rb_node, *parent = NULL; + struct kevent *k; + int err = 0, cmp; + + spin_lock_irqsave(&u->kevent_lock, flags); + while (*p) { + parent = *p; + k = rb_entry(parent, struct kevent, kevent_node); + + cmp = kevent_compare_id(&k->event.id, &new->event.id); + if (cmp > 0) + p = &parent->rb_right; + else if (cmp < 0) + p = &parent->rb_left; + else { + err = -EEXIST; + break; + } + } + if (likely(!err)) { + rb_link_node(&new->kevent_node, parent, p); + rb_insert_color(&new->kevent_node, &u->kevent_root); + new->flags |= KEVENT_USER; + u->kevent_num++; + kevent_user_get(u); + } + spin_unlock_irqrestore(&u->kevent_lock, flags); + + return err; +} + +/* + * Add kevent from both kernel and userspace users. + * This function allocates and queues kevent, returns negative value + * on error, positive if kevent is ready immediately and zero + * if kevent has been queued. + */ +int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u) +{ + struct kevent *k; + int err; + + k = kmem_cache_alloc(kevent_cache, GFP_KERNEL); + if (!k) { + err = -ENOMEM; + goto err_out_exit; + } + + memcpy(&k->event, uk, sizeof(struct ukevent)); + INIT_RCU_HEAD(&k->rcu_head); + + k->event.ret_flags = 0; + + err = kevent_init(k); + if (err) { + kmem_cache_free(kevent_cache, k); + goto err_out_exit; + } + k->user = u; + kevent_stat_total(u); + err = kevent_user_enqueue(u, k); + if (err) { + kmem_cache_free(kevent_cache, k); + goto err_out_exit; + } + + err = kevent_enqueue(k); + if (err) { + memcpy(uk, &k->event, sizeof(struct ukevent)); + kevent_finish_user(k, 0); + goto err_out_exit; + } + + return 0; + +err_out_exit: + if (err < 0) { + uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE; + uk->ret_data[1] = err; + } else if (err > 0) + uk->ret_flags |= KEVENT_RET_DONE; + return err; +} + +/* + * Copy all ukevents from userspace, allocate kevent for each one + * and add them into appropriate kevent_storages, + * e.g. sockets, inodes and so on... + * Ready events will replace ones provided by used and number + * of ready events is returned. + * User must check ret_flags field of each ukevent structure + * to determine if it is fired or failed event. + */ +static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg) +{ + int err, cerr = 0, rnum = 0, i; + void __user *orig = arg; + struct ukevent uk; + + mutex_lock(&u->ctl_mutex); + + err = -EINVAL; + if (num > KEVENT_MIN_BUFFS_ALLOC) { + struct ukevent *ukev; + + ukev = kevent_get_user(num, arg); + if (ukev) { + for (i = 0; i < num; ++i) { + err = kevent_user_add_ukevent(&ukev[i], u); + if (err) { + kevent_stat_im(u); + if (i != rnum) + memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent)); + rnum++; + } + } + if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent))) + cerr = -EFAULT; + kfree(ukev); + goto out_setup; + } + } + + for (i = 0; i < num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + cerr = -EFAULT; + break; + } + arg += sizeof(struct ukevent); + + err = kevent_user_add_ukevent(&uk, u); + if (err) { + kevent_stat_im(u); + if (copy_to_user(orig, &uk, sizeof(struct ukevent))) { + cerr = -EFAULT; + break; + } + orig += sizeof(struct ukevent); + rnum++; + } + } + +out_setup: + if (cerr < 0) { + err = cerr; + goto out_remove; + } + + err = rnum; +out_remove: + mutex_unlock(&u->ctl_mutex); + + return err; +} + +/* + * In nonblocking mode it returns as many events as possible, but not more than @max_nr. + * In blocking mode it waits until timeout or if at least @min_nr events are ready. + */ +static int kevent_user_wait(struct file *file, struct kevent_user *u, + unsigned int min_nr, unsigned int max_nr, __u64 timeout, + void __user *buf) +{ + struct kevent *k; + int num = 0; + + if (!(file->f_flags & O_NONBLOCK)) { + wait_event_interruptible_timeout(u->wait, + u->ready_num >= min_nr, + clock_t_to_jiffies(nsec_to_clock_t(timeout))); + } + + while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) { + if (copy_to_user(buf + num*sizeof(struct ukevent), + &k->event, sizeof(struct ukevent))) { + if (num == 0) + num = -EFAULT; + break; + } + kevent_complete_ready(k); + ++num; + kevent_stat_wait(u); + } + + return num; +} + +static struct file_operations kevent_user_fops = { + .release = kevent_user_release, + .poll = kevent_user_poll, + .owner = THIS_MODULE, +}; + +static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg) +{ + int err; + struct kevent_user *u = file->private_data; + + switch (cmd) { + case KEVENT_CTL_ADD: + err = kevent_user_ctl_add(u, num, arg); + break; + case KEVENT_CTL_REMOVE: + err = kevent_user_ctl_remove(u, num, arg); + break; + case KEVENT_CTL_MODIFY: + err = kevent_user_ctl_modify(u, num, arg); + break; + default: + err = -EINVAL; + break; + } + + return err; +} + +/* + * Used to get ready kevents from queue. + * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT). + * @min_nr - minimum number of ready kevents. + * @max_nr - maximum number of ready kevents. + * @timeout - timeout in nanoseconds to wait until some events are ready. + * @buf - buffer to place ready events. + * @flags - ununsed for now (will be used for mmap implementation). + */ +asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, + __u64 timeout, struct ukevent __user *buf, unsigned flags) +{ + int err = -EINVAL; + struct file *file; + struct kevent_user *u; + + file = fget(ctl_fd); + if (!file) + return -EBADF; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + u = file->private_data; + + err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf); +out_fput: + fput(file); + return err; +} + +static struct vfsmount *kevent_mnt __read_mostly; + +static int kevent_get_sb(struct file_system_type *fs_type, int flags, + const char *dev_name, void *data, struct vfsmount *mnt) +{ + return get_sb_pseudo(fs_type, "kevent", NULL, 0xaabbccdd, mnt); +} + +static struct file_system_type kevent_fs_type = { + .name = "keventfs", + .get_sb = kevent_get_sb, + .kill_sb = kill_anon_super, +}; + +static int keventfs_delete_dentry(struct dentry *dentry) +{ + return 1; +} + +static struct dentry_operations keventfs_dentry_operations = { + .d_delete = keventfs_delete_dentry, +}; + +asmlinkage long sys_kevent_init(struct kevent_ring __user *ring, unsigned int num) +{ + struct qstr this; + char name[32]; + struct dentry *dentry; + struct inode *inode; + struct file *file; + int err = -ENFILE, fd; + struct kevent_user *u; + + if ((ring && !num) || (!ring && num) || (num == 1)) + return -EINVAL; + + file = get_empty_filp(); + if (!file) + goto err_out_exit; + + inode = new_inode(kevent_mnt->mnt_sb); + if (!inode) + goto err_out_fput; + + inode->i_fop = &kevent_user_fops; + + inode->i_state = I_DIRTY; + inode->i_mode = S_IRUSR | S_IWUSR; + inode->i_uid = current->fsuid; + inode->i_gid = current->fsgid; + inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; + + err = get_unused_fd(); + if (err < 0) + goto err_out_iput; + fd = err; + + err = -ENOMEM; + u = kevent_user_alloc(ring, num); + if (!u) + goto err_out_put_fd; + + sprintf(name, "[%lu]", inode->i_ino); + this.name = name; + this.len = strlen(name); + this.hash = inode->i_ino; + dentry = d_alloc(kevent_mnt->mnt_sb->s_root, &this); + if (!dentry) + goto err_out_free; + dentry->d_op = &keventfs_dentry_operations; + d_add(dentry, inode); + file->f_vfsmnt = mntget(kevent_mnt); + file->f_dentry = dentry; + file->f_mapping = inode->i_mapping; + file->f_pos = 0; + file->f_flags = O_RDONLY; + file->f_op = &kevent_user_fops; + file->f_mode = FMODE_READ; + file->f_version = 0; + file->private_data = u; + + fd_install(fd, file); + + return fd; + +err_out_free: + kmem_cache_free(kevent_user_cache, u); +err_out_put_fd: + put_unused_fd(fd); +err_out_iput: + iput(inode); +err_out_fput: + put_filp(file); +err_out_exit: + return err; +} + +/* + * This syscall is used to perform waiting until there is free space in the ring + * buffer, in that case some events will be copied there. + * Function returns number of actually copied ready events in ring buffer. + * After this function is completed userspace ring->ring_kidx will be updated. + * + * @ctl_fd - kevent file descriptor. + * @num - number of kevents to process. + * @timeout - this timeout specifies number of nanoseconds to wait until there is + * free space in kevent queue. + * + * When we need to commit @num events, it means we should just remove first @num + * kevents from ready queue and copy them into the buffer. + * Kevents will be copied into ring buffer in order they were placed into ready queue. + * One-shot kevents will be removed here, since there is no way they can be reused. + * Edge-triggered events will be requeued here for better performance. + */ +asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, __u64 timeout) +{ + int err = -EINVAL, copied = 0; + struct file *file; + struct kevent_user *u; + struct kevent *k; + struct kevent_ring __user *ring; + unsigned int i; + + file = fget(ctl_fd); + if (!file) + return -EBADF; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + u = file->private_data; + + ring = u->pring; + if (!ring || num > u->ring_size) + goto out_fput; + + if (!(file->f_flags & O_NONBLOCK)) { + wait_event_interruptible_timeout(u->wait, + ((u->ready_num >= 1) && (kevent_ring_space(u))), + clock_t_to_jiffies(nsec_to_clock_t(timeout))); + } + + for (i=0; i<num; ++i) { + k = kevent_dequeue_ready_ring(u); + if (!k) + break; + kevent_complete_ready(k); + + if (k->event.ret_flags & KEVENT_RET_COPY_FAILED) + break; + kevent_stat_ring(u); + copied++; + } + + fput(file); + + return copied; +out_fput: + fput(file); + return err; +} + +/* + * This syscall is used to commit events in ring buffer, i.e. mark appropriate + * entries as unused by userspace so subsequent kevent_wait() could overwrite them. + * This fucntion returns actual number of kevents which were committed. + * After this function is completed userspace ring->ring_uidx will be updated. + * + * @ctl_fd - kevent file descriptor. + * @start - index of the first kevent to be committed. + * @num - number of kevents to commit. + * @over - number of overflows given queue had. + * + * If several threads are going to commit the same events, and one of them + * has committed events, while other was scheduled away for too long, that + * ring indexes have wrapped, it is possible that incorrect + */ +asmlinkage long sys_kevent_commit(int ctl_fd, unsigned int start, unsigned int num, unsigned int over) +{ + int err = -EINVAL, comm = 0, i, over_changed = 0; + struct file *file; + struct kevent_user *u; + struct kevent_ring __user *ring; + + file = fget(ctl_fd); + if (!file) + return -EBADF; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + u = file->private_data; + ring = u->pring; + + if (!ring || num > u->ring_size) + goto out_fput; + + err = -EOVERFLOW; + mutex_lock(&u->ring_lock); + if (over != u->ring_over+1 && over != u->ring_over) + goto err_out_unlock; + + if (start > u->uidx) { + if (over != u->ring_over+1) { + if (over == u->ring_over) + err = -EINVAL; + goto err_out_unlock; + } else { + /* + * To be or not to be, that is a question: + * Whether it is nobler in the mind to suffer... + * Stop. Not. + * To optimize 'the modulo' or not, that is a question: + * Are there many CPUs, which still being in the world production + * And suffer badly from that stuff in it. + */ + unsigned int mod = (start + num) % u->ring_size; + + if (mod >= u->uidx) + comm = mod - u->uidx; + } + } else { + if (over != u->ring_over) + goto err_out_unlock; + + if (start + num >= u->uidx) + comm = start + num - u->uidx; + } + + if (comm) + u->full = 0; + + for (i=0; i<comm; ++i) { + if (kevent_ring_index_inc(&u->uidx, u->ring_size)) { + u->ring_over++; + over_changed = 1; + } + } + + if (over_changed) { + if (put_user(u->ring_over, &ring->ring_over)) { + err = -EFAULT; + goto err_out_unlock; + } + } + + if (put_user(u->uidx, &ring->ring_uidx)) { + err = -EFAULT; + goto err_out_unlock; + } + mutex_unlock(&u->ring_lock); + + fput(file); + + return comm; + +err_out_unlock: + mutex_unlock(&u->ring_lock); +out_fput: + fput(file); + return err; +} + +/* + * This syscall is used to perform various control operations + * on given kevent queue, which is obtained through kevent file descriptor @fd. + * @cmd - type of operation. + * @num - number of kevents to be processed. + * @arg - pointer to array of struct ukevent. + */ +asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent __user *arg) +{ + int err = -EINVAL; + struct file *file; + + file = fget(fd); + if (!file) + return -EBADF; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + + err = kevent_ctl_process(file, cmd, num, arg); + +out_fput: + fput(file); + return err; +} + +/* + * Kevent subsystem initialization - create caches and register + * filesystem to get control file descriptors from. + */ +static int __init kevent_user_init(void) +{ + int err = 0; + + kevent_cache = kmem_cache_create("kevent_cache", + sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL); + + kevent_user_cache = kmem_cache_create("kevent_user_cache", + sizeof(struct kevent_user), 0, SLAB_PANIC, NULL, NULL); + + err = register_filesystem(&kevent_fs_type); + if (err) + goto err_out_exit; + + kevent_mnt = kern_mount(&kevent_fs_type); + err = PTR_ERR(kevent_mnt); + if (IS_ERR(kevent_mnt)) + goto err_out_unreg; + + printk(KERN_INFO "KEVENT subsystem has been successfully registered.\n"); + + return 0; + +err_out_unreg: + unregister_filesystem(&kevent_fs_type); +err_out_exit: + kmem_cache_destroy(kevent_cache); + return err; +} + +module_init(kevent_user_init); diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 7a3b2e7..3b7d35f 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -122,6 +122,12 @@ cond_syscall(ppc_rtas); cond_syscall(sys_spu_run); cond_syscall(sys_spu_create); +cond_syscall(sys_kevent_get_events); +cond_syscall(sys_kevent_ctl); +cond_syscall(sys_kevent_wait); +cond_syscall(sys_kevent_commit); +cond_syscall(sys_kevent_init); + /* mmu depending weak syscall entries */ cond_syscall(sys_mprotect); cond_syscall(sys_msync); ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take25 3/6] kevent: poll/select() notifications. 2006-11-21 16:29 ` [take25 2/6] kevent: Core files Evgeniy Polyakov @ 2006-11-21 16:29 ` Evgeniy Polyakov 2006-11-21 16:29 ` [take25 4/6] kevent: Socket notifications Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-21 16:29 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik poll/select() notifications. This patch includes generic poll/select notifications. kevent_poll works simialr to epoll and has the same issues (callback is invoked not from internal state machine of the caller, but through process awake, a lot of allocations and so on). Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru> diff --git a/fs/file_table.c b/fs/file_table.c index bc35a40..0805547 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -20,6 +20,7 @@ #include <linux/cdev.h> #include <linux/fsnotify.h> #include <linux/sysctl.h> +#include <linux/kevent.h> #include <linux/percpu_counter.h> #include <asm/atomic.h> @@ -119,6 +120,7 @@ struct file *get_empty_filp(void) f->f_uid = tsk->fsuid; f->f_gid = tsk->fsgid; eventpoll_init_file(f); + kevent_init_file(f); /* f->f_version: 0 */ return f; @@ -164,6 +166,7 @@ void fastcall __fput(struct file *file) * in the file cleanup chain. */ eventpoll_release(file); + kevent_cleanup_file(file); locks_remove_flock(file); if (file->f_op && file->f_op->release) diff --git a/fs/inode.c b/fs/inode.c index ada7643..2740617 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -21,6 +21,7 @@ #include <linux/cdev.h> #include <linux/bootmem.h> #include <linux/inotify.h> +#include <linux/kevent.h> #include <linux/mount.h> /* @@ -164,12 +165,18 @@ static struct inode *alloc_inode(struct } inode->i_private = 0; inode->i_mapping = mapping; +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE + kevent_storage_init(inode, &inode->st); +#endif } return inode; } void destroy_inode(struct inode *inode) { +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE + kevent_storage_fini(&inode->st); +#endif BUG_ON(inode_has_buffers(inode)); security_inode_free(inode); if (inode->i_sb->s_op->destroy_inode) diff --git a/include/linux/fs.h b/include/linux/fs.h index 5baf3a1..8bbf3a5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -276,6 +276,7 @@ extern int dir_notify_enable; #include <linux/init.h> #include <linux/sched.h> #include <linux/mutex.h> +#include <linux/kevent_storage.h> #include <asm/atomic.h> #include <asm/semaphore.h> @@ -586,6 +587,10 @@ struct inode { struct mutex inotify_mutex; /* protects the watches list */ #endif +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE + struct kevent_storage st; +#endif + unsigned long i_state; unsigned long dirtied_when; /* jiffies of first dirtying */ @@ -739,6 +744,9 @@ struct file { struct list_head f_ep_links; spinlock_t f_ep_lock; #endif /* #ifdef CONFIG_EPOLL */ +#ifdef CONFIG_KEVENT_POLL + struct kevent_storage st; +#endif struct address_space *f_mapping; }; extern spinlock_t files_lock; diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c new file mode 100644 index 0000000..11dbe25 --- /dev/null +++ b/kernel/kevent/kevent_poll.c @@ -0,0 +1,232 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/timer.h> +#include <linux/file.h> +#include <linux/kevent.h> +#include <linux/poll.h> +#include <linux/fs.h> + +static kmem_cache_t *kevent_poll_container_cache; +static kmem_cache_t *kevent_poll_priv_cache; + +struct kevent_poll_ctl +{ + struct poll_table_struct pt; + struct kevent *k; +}; + +struct kevent_poll_wait_container +{ + struct list_head container_entry; + wait_queue_head_t *whead; + wait_queue_t wait; + struct kevent *k; +}; + +struct kevent_poll_private +{ + struct list_head container_list; + spinlock_t container_lock; +}; + +static int kevent_poll_enqueue(struct kevent *k); +static int kevent_poll_dequeue(struct kevent *k); +static int kevent_poll_callback(struct kevent *k); + +static int kevent_poll_wait_callback(wait_queue_t *wait, + unsigned mode, int sync, void *key) +{ + struct kevent_poll_wait_container *cont = + container_of(wait, struct kevent_poll_wait_container, wait); + struct kevent *k = cont->k; + + kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL); + return 0; +} + +static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead, + struct poll_table_struct *poll_table) +{ + struct kevent *k = + container_of(poll_table, struct kevent_poll_ctl, pt)->k; + struct kevent_poll_private *priv = k->priv; + struct kevent_poll_wait_container *cont; + unsigned long flags; + + cont = kmem_cache_alloc(kevent_poll_container_cache, GFP_KERNEL); + if (!cont) { + kevent_break(k); + return; + } + + cont->k = k; + init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback); + cont->whead = whead; + + spin_lock_irqsave(&priv->container_lock, flags); + list_add_tail(&cont->container_entry, &priv->container_list); + spin_unlock_irqrestore(&priv->container_lock, flags); + + add_wait_queue(whead, &cont->wait); +} + +static int kevent_poll_enqueue(struct kevent *k) +{ + struct file *file; + int err; + unsigned int revents; + unsigned long flags; + struct kevent_poll_ctl ctl; + struct kevent_poll_private *priv; + + file = fget(k->event.id.raw[0]); + if (!file) + return -EBADF; + + err = -EINVAL; + if (!file->f_op || !file->f_op->poll) + goto err_out_fput; + + err = -ENOMEM; + priv = kmem_cache_alloc(kevent_poll_priv_cache, GFP_KERNEL); + if (!priv) + goto err_out_fput; + + spin_lock_init(&priv->container_lock); + INIT_LIST_HEAD(&priv->container_list); + + k->priv = priv; + + ctl.k = k; + init_poll_funcptr(&ctl.pt, &kevent_poll_qproc); + + err = kevent_storage_enqueue(&file->st, k); + if (err) + goto err_out_free; + + if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE) { + kevent_requeue(k); + } else { + revents = file->f_op->poll(file, &ctl.pt); + if (revents & k->event.event) { + err = 1; + goto out_dequeue; + } + } + + spin_lock_irqsave(&k->ulock, flags); + k->event.req_flags |= KEVENT_REQ_LAST_CHECK; + spin_unlock_irqrestore(&k->ulock, flags); + + return 0; + +out_dequeue: + kevent_storage_dequeue(k->st, k); +err_out_free: + kmem_cache_free(kevent_poll_priv_cache, priv); +err_out_fput: + fput(file); + return err; +} + +static int kevent_poll_dequeue(struct kevent *k) +{ + struct file *file = k->st->origin; + struct kevent_poll_private *priv = k->priv; + struct kevent_poll_wait_container *w, *n; + unsigned long flags; + + kevent_storage_dequeue(k->st, k); + + spin_lock_irqsave(&priv->container_lock, flags); + list_for_each_entry_safe(w, n, &priv->container_list, container_entry) { + list_del(&w->container_entry); + remove_wait_queue(w->whead, &w->wait); + kmem_cache_free(kevent_poll_container_cache, w); + } + spin_unlock_irqrestore(&priv->container_lock, flags); + + kmem_cache_free(kevent_poll_priv_cache, priv); + k->priv = NULL; + + fput(file); + + return 0; +} + +static int kevent_poll_callback(struct kevent *k) +{ + if (k->event.req_flags & KEVENT_REQ_LAST_CHECK) { + return 1; + } else { + struct file *file = k->st->origin; + unsigned int revents = file->f_op->poll(file, NULL); + + k->event.ret_data[0] = revents & k->event.event; + + return (revents & k->event.event); + } +} + +static int __init kevent_poll_sys_init(void) +{ + struct kevent_callbacks pc = { + .callback = &kevent_poll_callback, + .enqueue = &kevent_poll_enqueue, + .dequeue = &kevent_poll_dequeue}; + + kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache", + sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL); + if (!kevent_poll_container_cache) { + printk(KERN_ERR "Failed to create kevent poll container cache.\n"); + return -ENOMEM; + } + + kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache", + sizeof(struct kevent_poll_private), 0, 0, NULL, NULL); + if (!kevent_poll_priv_cache) { + printk(KERN_ERR "Failed to create kevent poll private data cache.\n"); + kmem_cache_destroy(kevent_poll_container_cache); + kevent_poll_container_cache = NULL; + return -ENOMEM; + } + + kevent_add_callbacks(&pc, KEVENT_POLL); + + printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n"); + return 0; +} + +static struct lock_class_key kevent_poll_key; + +void kevent_poll_reinit(struct file *file) +{ + lockdep_set_class(&file->st.lock, &kevent_poll_key); +} + +static void __exit kevent_poll_sys_fini(void) +{ + kmem_cache_destroy(kevent_poll_priv_cache); + kmem_cache_destroy(kevent_poll_container_cache); +} + +module_init(kevent_poll_sys_init); +module_exit(kevent_poll_sys_fini); ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take25 4/6] kevent: Socket notifications. 2006-11-21 16:29 ` [take25 3/6] kevent: poll/select() notifications Evgeniy Polyakov @ 2006-11-21 16:29 ` Evgeniy Polyakov 2006-11-21 16:29 ` [take25 5/6] kevent: Timer notifications Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-21 16:29 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Socket notifications. This patch includes socket send/recv/accept notifications. Using trivial web server based on kevent and this features instead of epoll it's performance increased more than noticebly. More details about various benchmarks and server itself (evserver_kevent.c) can be found on project's homepage. Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru> diff --git a/fs/inode.c b/fs/inode.c index ada7643..2740617 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -21,6 +21,7 @@ #include <linux/cdev.h> #include <linux/bootmem.h> #include <linux/inotify.h> +#include <linux/kevent.h> #include <linux/mount.h> /* @@ -164,12 +165,18 @@ static struct inode *alloc_inode(struct } inode->i_private = 0; inode->i_mapping = mapping; +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE + kevent_storage_init(inode, &inode->st); +#endif } return inode; } void destroy_inode(struct inode *inode) { +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE + kevent_storage_fini(&inode->st); +#endif BUG_ON(inode_has_buffers(inode)); security_inode_free(inode); if (inode->i_sb->s_op->destroy_inode) diff --git a/include/net/sock.h b/include/net/sock.h index edd4d73..d48ded8 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -48,6 +48,7 @@ #include <linux/netdevice.h> #include <linux/skbuff.h> /* struct sk_buff */ #include <linux/security.h> +#include <linux/kevent.h> #include <linux/filter.h> @@ -450,6 +451,21 @@ static inline int sk_stream_memory_free( extern void sk_stream_rfree(struct sk_buff *skb); +struct socket_alloc { + struct socket socket; + struct inode vfs_inode; +}; + +static inline struct socket *SOCKET_I(struct inode *inode) +{ + return &container_of(inode, struct socket_alloc, vfs_inode)->socket; +} + +static inline struct inode *SOCK_INODE(struct socket *socket) +{ + return &container_of(socket, struct socket_alloc, socket)->vfs_inode; +} + static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk) { skb->sk = sk; @@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct sk->sk_backlog.tail = skb; } skb->next = NULL; + kevent_socket_notify(sk, KEVENT_SOCKET_RECV); } #define sk_wait_event(__sk, __timeo, __condition) \ @@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio return si->kiocb; } -struct socket_alloc { - struct socket socket; - struct inode vfs_inode; -}; - -static inline struct socket *SOCKET_I(struct inode *inode) -{ - return &container_of(inode, struct socket_alloc, vfs_inode)->socket; -} - -static inline struct inode *SOCK_INODE(struct socket *socket) -{ - return &container_of(socket, struct socket_alloc, socket)->vfs_inode; -} - extern void __sk_stream_mem_reclaim(struct sock *sk); extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind); diff --git a/include/net/tcp.h b/include/net/tcp.h index 7a093d0..69f4ad2 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so tp->ucopy.memory = 0; } else if (skb_queue_len(&tp->ucopy.prequeue) == 1) { wake_up_interruptible(sk->sk_sleep); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); if (!inet_csk_ack_scheduled(sk)) inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK, (3 * TCP_RTO_MIN) / 4, diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c new file mode 100644 index 0000000..9c24b5b --- /dev/null +++ b/kernel/kevent/kevent_socket.c @@ -0,0 +1,142 @@ +/* + * kevent_socket.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/timer.h> +#include <linux/file.h> +#include <linux/tcp.h> +#include <linux/kevent.h> + +#include <net/sock.h> +#include <net/request_sock.h> +#include <net/inet_connection_sock.h> + +static int kevent_socket_callback(struct kevent *k) +{ + struct inode *inode = k->st->origin; + unsigned int events = SOCKET_I(inode)->ops->poll(SOCKET_I(inode)->file, SOCKET_I(inode), NULL); + + if ((events & (POLLIN | POLLRDNORM)) && (k->event.event & (KEVENT_SOCKET_RECV | KEVENT_SOCKET_ACCEPT))) + return 1; + if ((events & (POLLOUT | POLLWRNORM)) && (k->event.event & KEVENT_SOCKET_SEND)) + return 1; + if (events & (POLLERR | POLLHUP)) + return -1; + return 0; +} + +int kevent_socket_enqueue(struct kevent *k) +{ + struct inode *inode; + struct socket *sock; + int err = -EBADF; + + sock = sockfd_lookup(k->event.id.raw[0], &err); + if (!sock) + goto err_out_exit; + + inode = igrab(SOCK_INODE(sock)); + if (!inode) + goto err_out_fput; + + err = kevent_storage_enqueue(&inode->st, k); + if (err) + goto err_out_iput; + + if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE) { + kevent_requeue(k); + err = 0; + } else { + err = k->callbacks.callback(k); + if (err) + goto err_out_dequeue; + } + + return err; + +err_out_dequeue: + kevent_storage_dequeue(k->st, k); +err_out_iput: + iput(inode); +err_out_fput: + sockfd_put(sock); +err_out_exit: + return err; +} + +int kevent_socket_dequeue(struct kevent *k) +{ + struct inode *inode = k->st->origin; + struct socket *sock; + + kevent_storage_dequeue(k->st, k); + + sock = SOCKET_I(inode); + iput(inode); + sockfd_put(sock); + + return 0; +} + +void kevent_socket_notify(struct sock *sk, u32 event) +{ + if (sk->sk_socket) + kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event); +} + +/* + * It is required for network protocols compiled as modules, like IPv6. + */ +EXPORT_SYMBOL_GPL(kevent_socket_notify); + +#ifdef CONFIG_LOCKDEP +static struct lock_class_key kevent_sock_key; + +void kevent_socket_reinit(struct socket *sock) +{ + struct inode *inode = SOCK_INODE(sock); + + lockdep_set_class(&inode->st.lock, &kevent_sock_key); +} + +void kevent_sk_reinit(struct sock *sk) +{ + if (sk->sk_socket) { + struct inode *inode = SOCK_INODE(sk->sk_socket); + + lockdep_set_class(&inode->st.lock, &kevent_sock_key); + } +} +#endif +static int __init kevent_init_socket(void) +{ + struct kevent_callbacks sc = { + .callback = &kevent_socket_callback, + .enqueue = &kevent_socket_enqueue, + .dequeue = &kevent_socket_dequeue}; + + return kevent_add_callbacks(&sc, KEVENT_SOCKET); +} +module_init(kevent_init_socket); diff --git a/net/core/sock.c b/net/core/sock.c index b77e155..7d5fa3e 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1402,6 +1402,7 @@ static void sock_def_wakeup(struct sock if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) wake_up_interruptible_all(sk->sk_sleep); read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); } static void sock_def_error_report(struct sock *sk) @@ -1411,6 +1412,7 @@ static void sock_def_error_report(struct wake_up_interruptible(sk->sk_sleep); sk_wake_async(sk,0,POLL_ERR); read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); } static void sock_def_readable(struct sock *sk, int len) @@ -1420,6 +1422,7 @@ static void sock_def_readable(struct soc wake_up_interruptible(sk->sk_sleep); sk_wake_async(sk,1,POLL_IN); read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); } static void sock_def_write_space(struct sock *sk) @@ -1439,6 +1442,7 @@ static void sock_def_write_space(struct } read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV); } static void sock_def_destruct(struct sock *sk) @@ -1489,6 +1493,8 @@ void sock_init_data(struct socket *sock, sk->sk_state = TCP_CLOSE; sk->sk_socket = sock; + kevent_sk_reinit(sk); + sock_set_flag(sk, SOCK_ZAPPED); if(sock) @@ -1555,8 +1561,10 @@ void fastcall release_sock(struct sock * if (sk->sk_backlog.tail) __release_sock(sk); sk->sk_lock.owner = NULL; - if (waitqueue_active(&sk->sk_lock.wq)) + if (waitqueue_active(&sk->sk_lock.wq)) { wake_up(&sk->sk_lock.wq); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); + } spin_unlock_bh(&sk->sk_lock.slock); } EXPORT_SYMBOL(release_sock); diff --git a/net/core/stream.c b/net/core/stream.c index d1d7dec..2878c2a 100644 --- a/net/core/stream.c +++ b/net/core/stream.c @@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock * wake_up_interruptible(sk->sk_sleep); if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN)) sock_wake_async(sock, 2, POLL_OUT); + kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV); } } diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 3f884ce..e7dd989 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3119,6 +3119,7 @@ static void tcp_ofo_queue(struct sock *s __skb_unlink(skb, &tp->out_of_order_queue); __skb_queue_tail(&sk->sk_receive_queue, skb); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV); tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq; if(skb->h.th->fin) tcp_fin(skb, sk, skb->h.th); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index c83938b..b0dd70d 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -61,6 +61,7 @@ #include <linux/jhash.h> #include <linux/init.h> #include <linux/times.h> +#include <linux/kevent.h> #include <net/icmp.h> #include <net/inet_hashtables.h> @@ -870,6 +871,7 @@ int tcp_v4_conn_request(struct sock *sk, reqsk_free(req); } else { inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT); + kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT); } return 0; diff --git a/net/socket.c b/net/socket.c index 1bc4167..5582b4a 100644 --- a/net/socket.c +++ b/net/socket.c @@ -85,6 +85,7 @@ #include <linux/kmod.h> #include <linux/audit.h> #include <linux/wireless.h> +#include <linux/kevent.h> #include <asm/uaccess.h> #include <asm/unistd.h> @@ -490,6 +491,8 @@ static struct socket *sock_alloc(void) inode->i_uid = current->fsuid; inode->i_gid = current->fsgid; + kevent_socket_reinit(sock); + get_cpu_var(sockets_in_use)++; put_cpu_var(sockets_in_use); return sock; ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take25 5/6] kevent: Timer notifications. 2006-11-21 16:29 ` [take25 4/6] kevent: Socket notifications Evgeniy Polyakov @ 2006-11-21 16:29 ` Evgeniy Polyakov 2006-11-21 16:29 ` [take25 6/6] kevent: Pipe notifications Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-21 16:29 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Timer notifications. Timer notifications can be used for fine grained per-process time management, since interval timers are very inconvenient to use, and they are limited. This subsystem uses high-resolution timers. id.raw[0] is used as number of seconds id.raw[1] is used as number of nanoseconds Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c new file mode 100644 index 0000000..df93049 --- /dev/null +++ b/kernel/kevent/kevent_timer.c @@ -0,0 +1,112 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/hrtimer.h> +#include <linux/jiffies.h> +#include <linux/kevent.h> + +struct kevent_timer +{ + struct hrtimer ktimer; + struct kevent_storage ktimer_storage; + struct kevent *ktimer_event; +}; + +static int kevent_timer_func(struct hrtimer *timer) +{ + struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer); + struct kevent *k = t->ktimer_event; + + kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL); + hrtimer_forward(timer, timer->base->softirq_time, + ktime_set(k->event.id.raw[0], k->event.id.raw[1])); + return HRTIMER_RESTART; +} + +static struct lock_class_key kevent_timer_key; + +static int kevent_timer_enqueue(struct kevent *k) +{ + int err; + struct kevent_timer *t; + + t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL); + if (!t) + return -ENOMEM; + + hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL); + t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]); + t->ktimer.function = kevent_timer_func; + t->ktimer_event = k; + + err = kevent_storage_init(&t->ktimer, &t->ktimer_storage); + if (err) + goto err_out_free; + lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key); + + err = kevent_storage_enqueue(&t->ktimer_storage, k); + if (err) + goto err_out_st_fini; + + hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL); + + return 0; + +err_out_st_fini: + kevent_storage_fini(&t->ktimer_storage); +err_out_free: + kfree(t); + + return err; +} + +static int kevent_timer_dequeue(struct kevent *k) +{ + struct kevent_storage *st = k->st; + struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage); + + hrtimer_cancel(&t->ktimer); + kevent_storage_dequeue(st, k); + kfree(t); + + return 0; +} + +static int kevent_timer_callback(struct kevent *k) +{ + k->event.ret_data[0] = jiffies_to_msecs(jiffies); + return 1; +} + +static int __init kevent_init_timer(void) +{ + struct kevent_callbacks tc = { + .callback = &kevent_timer_callback, + .enqueue = &kevent_timer_enqueue, + .dequeue = &kevent_timer_dequeue}; + + return kevent_add_callbacks(&tc, KEVENT_TIMER); +} +module_init(kevent_init_timer); + ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take25 6/6] kevent: Pipe notifications. 2006-11-21 16:29 ` [take25 5/6] kevent: Timer notifications Evgeniy Polyakov @ 2006-11-21 16:29 ` Evgeniy Polyakov 2006-11-22 11:20 ` Eric Dumazet 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-21 16:29 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Pipe notifications. diff --git a/fs/pipe.c b/fs/pipe.c index f3b6f71..aeaee9c 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -16,6 +16,7 @@ #include <linux/uio.h> #include <linux/highmem.h> #include <linux/pagemap.h> +#include <linux/kevent.h> #include <asm/uaccess.h> #include <asm/ioctls.h> @@ -312,6 +313,7 @@ redo: break; } if (do_wakeup) { + kevent_pipe_notify(inode, KEVENT_SOCKET_SEND); wake_up_interruptible_sync(&pipe->wait); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } @@ -321,6 +323,7 @@ redo: /* Signal writers asynchronously that there is more room. */ if (do_wakeup) { + kevent_pipe_notify(inode, KEVENT_SOCKET_SEND); wake_up_interruptible(&pipe->wait); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } @@ -490,6 +493,7 @@ redo2: break; } if (do_wakeup) { + kevent_pipe_notify(inode, KEVENT_SOCKET_RECV); wake_up_interruptible_sync(&pipe->wait); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); do_wakeup = 0; @@ -501,6 +505,7 @@ redo2: out: mutex_unlock(&inode->i_mutex); if (do_wakeup) { + kevent_pipe_notify(inode, KEVENT_SOCKET_RECV); wake_up_interruptible(&pipe->wait); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); } @@ -605,6 +610,7 @@ pipe_release(struct inode *inode, int de free_pipe_info(inode); } else { wake_up_interruptible(&pipe->wait); + kevent_pipe_notify(inode, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } diff --git a/kernel/kevent/kevent_pipe.c b/kernel/kevent/kevent_pipe.c new file mode 100644 index 0000000..5080642 --- /dev/null +++ b/kernel/kevent/kevent_pipe.c @@ -0,0 +1,117 @@ +/* + * kevent_pipe.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/file.h> +#include <linux/fs.h> +#include <linux/kevent.h> +#include <linux/pipe_fs_i.h> + +static int kevent_pipe_callback(struct kevent *k) +{ + struct inode *inode = k->st->origin; + struct pipe_inode_info *pipe = inode->i_pipe; + int nrbufs = pipe->nrbufs; + + if (k->event.event & KEVENT_SOCKET_RECV && nrbufs > 0) { + if (!pipe->writers) + return -1; + return 1; + } + + if (k->event.event & KEVENT_SOCKET_SEND && nrbufs < PIPE_BUFFERS) { + if (!pipe->readers) + return -1; + return 1; + } + + return 0; +} + +int kevent_pipe_enqueue(struct kevent *k) +{ + struct file *pipe; + int err = -EBADF; + struct inode *inode; + + pipe = fget(k->event.id.raw[0]); + if (!pipe) + goto err_out_exit; + + inode = igrab(pipe->f_dentry->d_inode); + if (!inode) + goto err_out_fput; + + err = kevent_storage_enqueue(&inode->st, k); + if (err) + goto err_out_iput; + + if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE) { + kevent_requeue(k); + err = 0; + } else { + err = k->callbacks.callback(k); + if (err) + goto err_out_dequeue; + } + + fput(pipe); + + return err; + +err_out_dequeue: + kevent_storage_dequeue(k->st, k); +err_out_iput: + iput(inode); +err_out_fput: + fput(pipe); +err_out_exit: + return err; +} + +int kevent_pipe_dequeue(struct kevent *k) +{ + struct inode *inode = k->st->origin; + + kevent_storage_dequeue(k->st, k); + iput(inode); + + return 0; +} + +void kevent_pipe_notify(struct inode *inode, u32 event) +{ + kevent_storage_ready(&inode->st, NULL, event); +} + +static int __init kevent_init_pipe(void) +{ + struct kevent_callbacks sc = { + .callback = &kevent_pipe_callback, + .enqueue = &kevent_pipe_enqueue, + .dequeue = &kevent_pipe_dequeue}; + + return kevent_add_callbacks(&sc, KEVENT_PIPE); +} +module_init(kevent_init_pipe); ^ permalink raw reply related [flat|nested] 200+ messages in thread
* Re: [take25 6/6] kevent: Pipe notifications. 2006-11-21 16:29 ` [take25 6/6] kevent: Pipe notifications Evgeniy Polyakov @ 2006-11-22 11:20 ` Eric Dumazet 2006-11-22 11:30 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Eric Dumazet @ 2006-11-22 11:20 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Tuesday 21 November 2006 17:29, Evgeniy Polyakov wrote: > Pipe notifications. > +int kevent_pipe_enqueue(struct kevent *k) > +{ > + struct file *pipe; > + int err = -EBADF; > + struct inode *inode; > + > + pipe = fget(k->event.id.raw[0]); > + if (!pipe) > + goto err_out_exit; > + > + inode = igrab(pipe->f_dentry->d_inode); > + if (!inode) > + goto err_out_fput; > + Well... How can you be sure 'pipe/inode' really refers to a pipe/fifo here ? Hint : i_pipe <> NULL is not sufficient because i_pipe, i_bdev, i_cdev share the same location. (check pipe_info() in fs/splice.c) So I guess you need : err = -EINVAL; if (!S_ISFIFO(inode->i_mode)) goto err_out_iput; Eric ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 6/6] kevent: Pipe notifications. 2006-11-22 11:20 ` Eric Dumazet @ 2006-11-22 11:30 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-22 11:30 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, Ulrich Drepper, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Wed, Nov 22, 2006 at 12:20:50PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote: > On Tuesday 21 November 2006 17:29, Evgeniy Polyakov wrote: > > Pipe notifications. > > > +int kevent_pipe_enqueue(struct kevent *k) > > +{ > > + struct file *pipe; > > + int err = -EBADF; > > + struct inode *inode; > > + > > + pipe = fget(k->event.id.raw[0]); > > + if (!pipe) > > + goto err_out_exit; > > + > > + inode = igrab(pipe->f_dentry->d_inode); > > + if (!inode) > > + goto err_out_fput; > > + > > Well... > > How can you be sure 'pipe/inode' really refers to a pipe/fifo here ? > > Hint : i_pipe <> NULL is not sufficient because i_pipe, i_bdev, i_cdev share > the same location. (check pipe_info() in fs/splice.c) > > So I guess you need : > > err = -EINVAL; > if (!S_ISFIFO(inode->i_mode)) > goto err_out_iput; You are correct, I did not perform that check, since all pipe open functions do rely on the i_pipe, which can not be block device at that point, but with kevent file descriptor can be anything, so that check must be performed. I will put it into the tree, thanks Eric. > Eric -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-21 16:29 ` [take25 1/6] kevent: Description Evgeniy Polyakov 2006-11-21 16:29 ` [take25 2/6] kevent: Core files Evgeniy Polyakov @ 2006-11-22 23:46 ` Ulrich Drepper 2006-11-23 11:52 ` Evgeniy Polyakov 2006-11-22 23:52 ` Ulrich Drepper 2006-11-23 22:33 ` Ulrich Drepper 3 siblings, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-11-22 23:46 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Evgeniy Polyakov wrote: > + int kevent_wait(int ctl_fd, unsigned int num, __u64 timeout); > + > +ctl_fd - file descriptor referring to the kevent queue > +num - number of processed kevents > +timeout - this timeout specifies number of nanoseconds to wait until there is > + free space in kevent queue > + > +Return value: > + number of events copied into ring buffer or negative error value. This is not quite sufficient. What we also need is a parameter which specifies which ring buffer the code assumes is currently active. This is just like the EWOULDBLOCK error in the futex. I.e., the kernel doesn't move the thread on the wait list if the index has changed. Otherwise asynchronous ring buffer filling is impossible. Assume this thread kernel get current ring buffer idx front and tail pointer the same add new entry to ring buffer bump front pointer call kevent_wait() With the interface above this leads to a deadlock. The kernel delivered the event and is done with it. If the kevent_wait() syscall gets an additional parameter which specifies the expected front pointer the kernel wouldn't put the thread to sleep since, in this case, the front pointer changed since last checked. The kernel cannot and should not check the ring buffer is empty. Userlevel should maintain the tail pointer all by itself. And even if the tail pointer is available to the kernel, the program might want to handle the queued events differently. The above also comes to bear without asynchronous queuing if a thread waits for more than one event and it is possible to handle both events concurrently in two threads. Passing in the expected front pointer value is flexible and efficient. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-22 23:46 ` [take25 1/6] kevent: Description Ulrich Drepper @ 2006-11-23 11:52 ` Evgeniy Polyakov 2006-11-23 19:45 ` Ulrich Drepper 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-23 11:52 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Wed, Nov 22, 2006 at 03:46:42PM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > >+ int kevent_wait(int ctl_fd, unsigned int num, __u64 timeout); > >+ > >+ctl_fd - file descriptor referring to the kevent queue > >+num - number of processed kevents > >+timeout - this timeout specifies number of nanoseconds to wait until > >there is + free space in kevent queue > >+ > >+Return value: > >+ number of events copied into ring buffer or negative error value. > > This is not quite sufficient. What we also need is a parameter which > specifies which ring buffer the code assumes is currently active. This > is just like the EWOULDBLOCK error in the futex. I.e., the kernel > doesn't move the thread on the wait list if the index has changed. > Otherwise asynchronous ring buffer filling is impossible. Assume this > > thread kernel > > get current ring buffer idx > > front and tail pointer the same > > add new entry to ring buffer > > bump front pointer > > call kevent_wait() > > > With the interface above this leads to a deadlock. The kernel delivered > the event and is done with it. Kernel does not put there a new entry, it is only done inside kevent_wait(). Entries are put into queue (in any context), where they can be obtained from only kevent_wait() or kevent_get_events(). -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-23 11:52 ` Evgeniy Polyakov @ 2006-11-23 19:45 ` Ulrich Drepper 2006-11-24 11:01 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-11-23 19:45 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Evgeniy Polyakov wrote: > Kernel does not put there a new entry, it is only done inside > kevent_wait(). Entries are put into queue (in any context), where they can be obtained > from only kevent_wait() or kevent_get_events(). I know this is how it's done now. But it is not where it has to end. IMO we have to get to a solution where new events are posted to the ring buffer asynchronously, i.e., without a thread calling kevent_wait. And then you need the extra parameter and verification. Even if it's today not needed we have to future-proof the interface since it cannot be changed once in use. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-23 19:45 ` Ulrich Drepper @ 2006-11-24 11:01 ` Evgeniy Polyakov 2006-11-24 16:06 ` Ulrich Drepper 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-24 11:01 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Thu, Nov 23, 2006 at 11:45:36AM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > >Kernel does not put there a new entry, it is only done inside > >kevent_wait(). Entries are put into queue (in any context), where they can > >be obtained > >from only kevent_wait() or kevent_get_events(). > > I know this is how it's done now. But it is not where it has to end. > IMO we have to get to a solution where new events are posted to the ring > buffer asynchronously, i.e., without a thread calling kevent_wait. And > then you need the extra parameter and verification. Even if it's today > not needed we have to future-proof the interface since it cannot be > changed once in use. There is a special flag in kevent_user to wake it if there are no ready events - kernel thread which has added new events will set it and thus subsequent kevent_wait() will return with updated indexes - userspace must check indexes after kevent_wait(). > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, > CA ❖ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-24 11:01 ` Evgeniy Polyakov @ 2006-11-24 16:06 ` Ulrich Drepper 2006-11-24 16:14 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-11-24 16:06 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Evgeniy Polyakov wrote: >> I know this is how it's done now. But it is not where it has to end. >> IMO we have to get to a solution where new events are posted to the ring >> buffer asynchronously, i.e., without a thread calling kevent_wait. And >> then you need the extra parameter and verification. Even if it's today >> not needed we have to future-proof the interface since it cannot be >> changed once in use. > > There is a special flag in kevent_user to wake it if there are no ready > events - kernel thread which has added new events will set it and thus > subsequent kevent_wait() will return with updated indexes - userspace > must check indexes after kevent_wait(). You misunderstand. I don't want to return without waiting unconditionally. There is a race which has to be closed. It's exactly the same as in the futex syscall. I've shown the interaction between the kernel and the thread in the previous mail. There is inevitably a time difference between the thread checking whether the ring buffer is empty and the kernel putting the thread to sleep in the kevent_wait call. This is no problem with the current kevent_wait implementation since the ring buffer is not filled asynchronously. But if/when it will be the kernel might add something to the ring buffer _after_ the thread checks for an empty ring buffer and _before_ it enters the kernel in the kevent_wait syscall. The kevent_wait syscall will only wake the thread when a new event is posted. We do not in general want it to be woken when the ring buffer is non empty. This would create far too many unnecessary wakeups it there is more than one thread working on the queue. With the addition parameters for kevent_wait indicating when the calling thread last checked the ring buffer the kernel can find out whether the decision to call kevent_wait was made based on outdated information or not. Outdated in the case a new event has been posted. In this case the thread is not put to sleep but instead returns. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-24 16:06 ` Ulrich Drepper @ 2006-11-24 16:14 ` Evgeniy Polyakov 2006-11-24 16:31 ` Evgeniy Polyakov 2006-11-27 19:20 ` Ulrich Drepper 0 siblings, 2 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-24 16:14 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Fri, Nov 24, 2006 at 08:06:59AM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > >>I know this is how it's done now. But it is not where it has to end. > >>IMO we have to get to a solution where new events are posted to the ring > >>buffer asynchronously, i.e., without a thread calling kevent_wait. And > >>then you need the extra parameter and verification. Even if it's today > >>not needed we have to future-proof the interface since it cannot be > >>changed once in use. > > > >There is a special flag in kevent_user to wake it if there are no ready > >events - kernel thread which has added new events will set it and thus > >subsequent kevent_wait() will return with updated indexes - userspace > >must check indexes after kevent_wait(). > > You misunderstand. I don't want to return without waiting unconditionally. > > There is a race which has to be closed. It's exactly the same as in the > futex syscall. I've shown the interaction between the kernel and the > thread in the previous mail. There is inevitably a time difference > between the thread checking whether the ring buffer is empty and the > kernel putting the thread to sleep in the kevent_wait call. > > This is no problem with the current kevent_wait implementation since the > ring buffer is not filled asynchronously. But if/when it will be the > kernel might add something to the ring buffer _after_ the thread checks > for an empty ring buffer and _before_ it enters the kernel in the > kevent_wait syscall. > > The kevent_wait syscall will only wake the thread when a new event is > posted. We do not in general want it to be woken when the ring buffer > is non empty. This would create far too many unnecessary wakeups it > there is more than one thread working on the queue. > > With the addition parameters for kevent_wait indicating when the calling > thread last checked the ring buffer the kernel can find out whether the > decision to call kevent_wait was made based on outdated information or > not. Outdated in the case a new event has been posted. In this case > the thread is not put to sleep but instead returns. Read my mail again. If kernel has put data asynchronously it will setup special flag, thus kevent_wait() will not sleep and will return, so thread will check new entries and process them. > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, > CA ❖ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-24 16:14 ` Evgeniy Polyakov @ 2006-11-24 16:31 ` Evgeniy Polyakov 2006-11-27 19:20 ` Ulrich Drepper 1 sibling, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-24 16:31 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Fri, Nov 24, 2006 at 07:14:06PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote: > If kernel has put data asynchronously it will setup special flag, thus > kevent_wait() will not sleep and will return, so thread will check new > entries and process them. For the clarification - only kevent_wait() updates index, userspace will not detect that it has changed after thread has put there new data. In case kernel thread will updated index too, you are correct, kevent_wait() should get index as parameter. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-24 16:14 ` Evgeniy Polyakov 2006-11-24 16:31 ` Evgeniy Polyakov @ 2006-11-27 19:20 ` Ulrich Drepper 1 sibling, 0 replies; 200+ messages in thread From: Ulrich Drepper @ 2006-11-27 19:20 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Evgeniy Polyakov wrote: > If kernel has put data asynchronously it will setup special flag, thus > kevent_wait() will not sleep and will return, so thread will check new > entries and process them. This is not sufficient. The userlevel code does not commit the events until they are processed. So assume two threads at userlevel, one event is asynchronously posted. The first thread picks it up, the second call kevent_wait. With your scheme it will not be put to sleep and unnecessarily returns to userlevel. What I propose and what has been proven to work in many situations is to have part of the kevent_wait syscall the information about "I am aware of all events up to XX; wake me only if anything beyond that is added". Please take a look at how futexes work, it's really the same concept. And it's really also simpler for the implementation. Having such a flag is much more complicated than adding a simple index comparison before going to sleep. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-21 16:29 ` [take25 1/6] kevent: Description Evgeniy Polyakov 2006-11-21 16:29 ` [take25 2/6] kevent: Core files Evgeniy Polyakov 2006-11-22 23:46 ` [take25 1/6] kevent: Description Ulrich Drepper @ 2006-11-22 23:52 ` Ulrich Drepper 2006-11-23 11:55 ` Evgeniy Polyakov 2006-11-23 22:33 ` Ulrich Drepper 3 siblings, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-11-22 23:52 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Evgeniy Polyakov wrote: > + struct kevent_ring > + { > + unsigned int ring_kidx, ring_uidx, ring_over; > + struct ukevent event[0]; > + } > + [...] > +ring_uidx - index of the first entry userspace can start reading from Do we need this value in the structure? Userlevel cannot and should not be able to modify it. So, userland has in any case to track the tail pointer itself. Why then have this value at all? After kevent_init() the tail pointer is implicitly assumed to be 0. Since the front pointer (well index) is also zero nothing is available for reading. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-22 23:52 ` Ulrich Drepper @ 2006-11-23 11:55 ` Evgeniy Polyakov 2006-11-23 20:00 ` Ulrich Drepper 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-23 11:55 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Wed, Nov 22, 2006 at 03:52:11PM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > >+ struct kevent_ring > >+ { > >+ unsigned int ring_kidx, ring_uidx, ring_over; > >+ struct ukevent event[0]; > >+ } > >+ [...] > >+ring_uidx - index of the first entry userspace can start reading from > > Do we need this value in the structure? Userlevel cannot and should not > be able to modify it. So, userland has in any case to track the tail > pointer itself. Why then have this value at all? > > After kevent_init() the tail pointer is implicitly assumed to be 0. > Since the front pointer (well index) is also zero nothing is available > for reading. uidx is an index, starting from which there are unread entries. It is updated by userspace when it commits entries, so it is 'consumer' pointer, while kidx is an index where kernel will put new entries, i.e. 'producer' index. We definitely need them both. Userspace can only update (implicitly by calling kevent_commit()) uidx. > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, > CA ❖ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-23 11:55 ` Evgeniy Polyakov @ 2006-11-23 20:00 ` Ulrich Drepper 2006-11-23 21:49 ` Hans Henrik Happe 2006-11-24 11:46 ` Evgeniy Polyakov 0 siblings, 2 replies; 200+ messages in thread From: Ulrich Drepper @ 2006-11-23 20:00 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Evgeniy Polyakov wrote: > uidx is an index, starting from which there are unread entries. It is > updated by userspace when it commits entries, so it is 'consumer' > pointer, while kidx is an index where kernel will put new entries, i.e. > 'producer' index. We definitely need them both. > Userspace can only update (implicitly by calling kevent_commit()) uidx. Right, which is why exporting this entry is not needed. Keep the interface as small as possible. Userlevel has to maintain its own index. Just assume kevent_wait returns 10 new entries and you have multiple threads. In this case all threads take their turns and pick an entry from the ring buffer. This basically has to be done with something like this (I ignore wrap-arounds here to simplify the example): int getidx() { while (uidx < kidx) if (atomic_cmpxchg(uidx, uidx + 1, uidx) == 0) return uidx; return -1; } Very much simplified but it should show that we need a writable copy of the uidx. And this value at any time must be consistent with the index the kernel assumes. The current ring_uidx value can at best be used to reinitialize the userlevel uidx value after each kevent_wait call but this is unnecessary at best (since uidx must already have this value) and racy in problem cases (what if more than one thread gets woken concurrently with uidx having the same value and one thread stores the uidx value and immediately increments it to get an index; the second store would overwrite the increment). I can assure you that any implementation I write would not use the ring_uidx value. Only trivial, single-threaded examples like you ring_buffer.c could ever take advantage of this value. It's not worth it. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-23 20:00 ` Ulrich Drepper @ 2006-11-23 21:49 ` Hans Henrik Happe 2006-11-23 22:34 ` Ulrich Drepper 2006-11-24 11:46 ` Evgeniy Polyakov 1 sibling, 1 reply; 200+ messages in thread From: Hans Henrik Happe @ 2006-11-23 21:49 UTC (permalink / raw) To: Ulrich Drepper Cc: Evgeniy Polyakov, David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Thursday 23 November 2006 21:00, Ulrich Drepper wrote: > Evgeniy Polyakov wrote: > > uidx is an index, starting from which there are unread entries. It is > > updated by userspace when it commits entries, so it is 'consumer' > > pointer, while kidx is an index where kernel will put new entries, i.e. > > 'producer' index. We definitely need them both. > > Userspace can only update (implicitly by calling kevent_commit()) uidx. > > Right, which is why exporting this entry is not needed. Keep the > interface as small as possible. > > Userlevel has to maintain its own index. Just assume kevent_wait > returns 10 new entries and you have multiple threads. In this case all > threads take their turns and pick an entry from the ring buffer. This > basically has to be done with something like this (I ignore wrap-arounds > here to simplify the example): > > int getidx() { > while (uidx < kidx) > if (atomic_cmpxchg(uidx, uidx + 1, uidx) == 0) > return uidx; > return -1; > } I don't know if this falls under the simplification, but wouldn't there be a race when reading/copying the event data? I guess this could be solved with an extra user index. -- Hans Henrik Happe ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-23 21:49 ` Hans Henrik Happe @ 2006-11-23 22:34 ` Ulrich Drepper 2006-11-24 11:50 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-11-23 22:34 UTC (permalink / raw) To: Hans Henrik Happe Cc: Evgeniy Polyakov, David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Hans Henrik Happe wrote: > I don't know if this falls under the simplification, but wouldn't there be a > race when reading/copying the event data? I guess this could be solved with > an extra user index. That's what I said, reading the value from the ring buffer structure's head would be racy. All this can only work for single threaded code. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-23 22:34 ` Ulrich Drepper @ 2006-11-24 11:50 ` Evgeniy Polyakov 2006-11-24 16:17 ` Ulrich Drepper 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-24 11:50 UTC (permalink / raw) To: Ulrich Drepper Cc: Hans Henrik Happe, David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Thu, Nov 23, 2006 at 02:34:46PM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Hans Henrik Happe wrote: > >I don't know if this falls under the simplification, but wouldn't there be > >a race when reading/copying the event data? I guess this could be solved > >with an extra user index. > > That's what I said, reading the value from the ring buffer structure's > head would be racy. All this can only work for single threaded code. Value in the userspace ring is updated each time it is changed in kernel (when userspace calls kevent_commit()), when userspace has read its old value it is guaranteed that requested number of events _is_ there (although it is possible that there are more than that value). Ulrich, why didn't you comment on previous interface, which had exactly _one_ index exported to userspace - it is only required to add implicit uidx and (if you prefer that way) additional syscall, since in previous interface both waiting and commit was handled by kevent_wait() with different parameters. > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, > CA ❖ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-24 11:50 ` Evgeniy Polyakov @ 2006-11-24 16:17 ` Ulrich Drepper 0 siblings, 0 replies; 200+ messages in thread From: Ulrich Drepper @ 2006-11-24 16:17 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Hans Henrik Happe, David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Evgeniy Polyakov wrote: > Ulrich, why didn't you comment on previous interface, which had exactly > _one_ index exported to userspace - it is only required to add implicit > uidx and (if you prefer that way) additional syscall, since in previous > interface both waiting and commit was handled by kevent_wait() with > different parameters. If you read my old mails you'll find that I'm pretty consistent wrt to the ring buffer interface. The old code had other problems, not the missing exposure of the uidx value. There is really not much disagreement here. I just don't like the interface unnecessarily and misleadingly large by exposing the uidx value which is not useful to the userlevel code. Just remove the element and stuff it into a kernel-internal struct for the queue and you're done. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-23 20:00 ` Ulrich Drepper 2006-11-23 21:49 ` Hans Henrik Happe @ 2006-11-24 11:46 ` Evgeniy Polyakov 2006-11-24 16:30 ` Ulrich Drepper 1 sibling, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-24 11:46 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Thu, Nov 23, 2006 at 12:00:45PM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > >uidx is an index, starting from which there are unread entries. It is > >updated by userspace when it commits entries, so it is 'consumer' > >pointer, while kidx is an index where kernel will put new entries, i.e. > >'producer' index. We definitely need them both. > >Userspace can only update (implicitly by calling kevent_commit()) uidx. > > Right, which is why exporting this entry is not needed. Keep the > interface as small as possible. If there are several callers of kevent_commit(), uidx can be changed far than first user expects, so there should be possibility to check that value. It is thus exported into shared ring buffer structure. > Userlevel has to maintain its own index. Just assume kevent_wait > returns 10 new entries and you have multiple threads. In this case all > threads take their turns and pick an entry from the ring buffer. This > basically has to be done with something like this (I ignore wrap-arounds > here to simplify the example): > > int getidx() { > while (uidx < kidx) > if (atomic_cmpxchg(uidx, uidx + 1, uidx) == 0) > return uidx; > return -1; > } > > Very much simplified but it should show that we need a writable copy of > the uidx. And this value at any time must be consistent with the index > the kernel assumes. I seriously doubt it is simpler than having index provided by kernel. > The current ring_uidx value can at best be used to reinitialize the > userlevel uidx value after each kevent_wait call but this is unnecessary > at best (since uidx must already have this value) and racy in problem > cases (what if more than one thread gets woken concurrently with uidx > having the same value and one thread stores the uidx value and > immediately increments it to get an index; the second store would > overwrite the increment). > > I can assure you that any implementation I write would not use the > ring_uidx value. Only trivial, single-threaded examples like you > ring_buffer.c could ever take advantage of this value. It's not worth it. You propose to make uidx shared local variable - it is doable, but it is not required - userspace can use kernel's variable, since it is updated exactly in the places where that index is changed. > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, > CA ❖ > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-24 11:46 ` Evgeniy Polyakov @ 2006-11-24 16:30 ` Ulrich Drepper 2006-11-24 16:49 ` Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-11-24 16:30 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Evgeniy Polyakov wrote: >> Very much simplified but it should show that we need a writable copy of >> the uidx. And this value at any time must be consistent with the index >> the kernel assumes. > > I seriously doubt it is simpler than having index provided by kernel. What has simpler to do with it? The userlevel code should not modify the ring buffer structure at all. If we'd do this then all operations, at least on the uidx field, would have to be atomic operations. This is currently not the case for the kernel side since it's protected by a lock for the event queue. Using the uidx field from userlevel would therefore just make things slower. And for what? Changing the uidx value would make the commit syscall unnecessary. This might be an argument but it sounds too dangerous. IMO the value should be protected by the kernel. And in any case, the uidx value cannot be updated until the event actually has been processed. But the threads still need to coordinate distributing the events from the ring buffer amongst themselves. This will in any case require a second variable. So, if you want to do away with the commit syscall, keep the uidx value. This also requires that the ring buffer head will always be writable (something I'd like to avoid making part of the interface but I'm flexible on this). Otherwise, the ring_uidx element can go away, it's not needed and will only make people think about wrong approaches to use it. > You propose to make uidx shared local variable - it is doable, but it > is not required - userspace can use kernel's variable, since it is > updated exactly in the places where that index is changed. As said above, we always need another variable and uidx is only a replacement for the commit call. Until the event is processed the uidx cannot be incremented since otherwise the ring buffer entry might be overwritten. And kernel people of all should be happy to limit the exposure of the implementation. So, leave the problem of keeping track of the tail pointer to the userlevel code. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-24 16:30 ` Ulrich Drepper @ 2006-11-24 16:49 ` Evgeniy Polyakov 2006-11-27 19:23 ` Ulrich Drepper 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-24 16:49 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Fri, Nov 24, 2006 at 08:30:14AM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > >>Very much simplified but it should show that we need a writable copy of > >>the uidx. And this value at any time must be consistent with the index > >>the kernel assumes. > > > >I seriously doubt it is simpler than having index provided by kernel. > > What has simpler to do with it? The userlevel code should not modify > the ring buffer structure at all. If we'd do this then all operations, > at least on the uidx field, would have to be atomic operations. This is > currently not the case for the kernel side since it's protected by a > lock for the event queue. Using the uidx field from userlevel would > therefore just make things slower. That index is provided by kernel for userspace so that userspace could determine where indexes are - of course userspace can maintain it itself, but it can also use provided by kernel. It is not written explicitly, but only through kevent_commit(). > And for what? Changing the uidx value would make the commit syscall > unnecessary. This might be an argument but it sounds too dangerous. > IMO the value should be protected by the kernel. > > And in any case, the uidx value cannot be updated until the event > actually has been processed. But the threads still need to coordinate > distributing the events from the ring buffer amongst themselves. This > will in any case require a second variable. > > So, if you want to do away with the commit syscall, keep the uidx value. > This also requires that the ring buffer head will always be writable > (something I'd like to avoid making part of the interface but I'm > flexible on this). Otherwise, the ring_uidx element can go away, it's > not needed and will only make people think about wrong approaches to use it. No, head will not be writeable - it is absolutely. I do not care actually about that index, but as you have probably noticed, there was such an interface already, and I changed it. So, this will be the last change of the interface. You think it should not be exported - fine, it will not be. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-24 16:49 ` Evgeniy Polyakov @ 2006-11-27 19:23 ` Ulrich Drepper 0 siblings, 0 replies; 200+ messages in thread From: Ulrich Drepper @ 2006-11-27 19:23 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Evgeniy Polyakov wrote: > That index is provided by kernel for userspace so that userspace could > determine where indexes are - of course userspace can maintain it > itself, but it can also use provided by kernel. Indeed. That's what I said. But I also pointed out that the field is only useful in simple minded programs and certainly not in the wrappers the runtime (glibc) will provide. As you said yourself, there is no real need for the value being there, userland can keep track of it by itself. So, let's reduce the interface. > I do not care actually about that index, but as you have probably noticed, > there was such an interface already, and I changed it. So, this will be the > last change of the interface. You think it should not be exported - > fine, it will not be. Thanks. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-21 16:29 ` [take25 1/6] kevent: Description Evgeniy Polyakov ` (2 preceding siblings ...) 2006-11-22 23:52 ` Ulrich Drepper @ 2006-11-23 22:33 ` Ulrich Drepper 2006-11-23 22:48 ` Jeff Garzik 2006-11-24 12:05 ` Evgeniy Polyakov 3 siblings, 2 replies; 200+ messages in thread From: Ulrich Drepper @ 2006-11-23 22:33 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Evgeniy Polyakov wrote: > + int kevent_commit(int ctl_fd, unsigned int start, > + unsigned int num, unsigned int over); I think we can simplify this interface: int kevent_commit(int ctl_fd, unsigned int new_tail, unsigned int over); The kernel sets the ring_uidx value to the 'new_tail' value if the tail pointer would be incremented (module wrap around) and is not higher then the current front pointer. The test will be a bit complicated but not more so than what the current code has to do to check for mistakes. This approach has the advantage that the commit calls don't have to be synchronized. If one thread sets the tail pointer to, say, 10 and another to 12, then it does not matter whether the first thread is delayed. If it will eventually be executed the result is simply a no-op and since second thread's action supersedes it. Maybe the current form is even impossible to use with explicit locking at userlevel. What if one thread, which is about to call kevent_commit, if indefinitely delayed. Then this commit request's value is never taken into account and the tail pointer is always short of what it should be. There is one more thing to consider. Oftentimes the commit request will be immediately followed by a kevent_wait call. It would be good to merge this pair of calls. The two parameters new_tail and over could also be passed to the kevent_wait call and the commit can happen before the thread looks for new events and eventually goes to sleep. If this can be implemented then the kevent_commit syscall by itself might not be needed at all. Instead you'd call kevent_wait() and make the maximum number of events which can be returned zero. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-23 22:33 ` Ulrich Drepper @ 2006-11-23 22:48 ` Jeff Garzik 2006-11-23 23:45 ` Ulrich Drepper 2006-11-24 0:14 ` Hans Henrik Happe 2006-11-24 12:05 ` Evgeniy Polyakov 1 sibling, 2 replies; 200+ messages in thread From: Jeff Garzik @ 2006-11-23 22:48 UTC (permalink / raw) To: Ulrich Drepper Cc: Evgeniy Polyakov, David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel Ulrich Drepper wrote: > Evgeniy Polyakov wrote: >> + int kevent_commit(int ctl_fd, unsigned int start, + unsigned int >> num, unsigned int over); > > I think we can simplify this interface: > > int kevent_commit(int ctl_fd, unsigned int new_tail, > unsigned int over); > > The kernel sets the ring_uidx value to the 'new_tail' value if the tail > pointer would be incremented (module wrap around) and is not higher then > the current front pointer. The test will be a bit complicated but not > more so than what the current code has to do to check for mistakes. > > This approach has the advantage that the commit calls don't have to be > synchronized. If one thread sets the tail pointer to, say, 10 and > another to 12, then it does not matter whether the first thread is > delayed. If it will eventually be executed the result is simply a no-op > and since second thread's action supersedes it. > > Maybe the current form is even impossible to use with explicit locking > at userlevel. What if one thread, which is about to call kevent_commit, > if indefinitely delayed. Then this commit request's value is never > taken into account and the tail pointer is always short of what it > should be. I'm really wondering is designing for N-threads-to-1-ring is the wisest choice? Considering current designs, it seems more likely that a single thread polls for socket activity, then dispatches work. How often do you really see in userland multiple threads polling the same set of fds, then fighting to decide who will handle raised events? More likely, you will see "prefork" (start N threads, each with its own ring) or a worker pool (single thread receives events, then dispatches to multiple threads for execution) or even one-thread-per-fd (single thread receives events, then starts new thread for handling). If you have multiple threads accessing the same ring -- a poor design choice -- I would think the burden should be on the application, to provide proper synchronization. If the desire is to have the kernel distributes events directly to multiple threads, then the app should dup(2) the fd to be watched, and create a ring buffer for each separate thread. Jeff ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-23 22:48 ` Jeff Garzik @ 2006-11-23 23:45 ` Ulrich Drepper 2006-11-24 0:48 ` Eric Dumazet 2006-11-24 0:14 ` Hans Henrik Happe 1 sibling, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-11-23 23:45 UTC (permalink / raw) To: Jeff Garzik Cc: Evgeniy Polyakov, David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel Jeff Garzik wrote: > Considering current designs, it seems more likely that a single thread > polls for socket activity, then dispatches work. How often do you > really see in userland multiple threads polling the same set of fds, > then fighting to decide who will handle raised events? > > More likely, you will see "prefork" (start N threads, each with its own > ring) or a worker pool (single thread receives events, then dispatches > to multiple threads for execution) or even one-thread-per-fd (single > thread receives events, then starts new thread for handling). No, absolutely not. This is exactly not what should/is/will happen. You create worker threads to handle to work for the entire program. Look at something like a web server. When creating several queues, how do you distribute all the connections to the different queues? To ensure every connection is handled as quickly as possible you stuff them all in the same queue and then have all threads use this one queue. Whenever an event is posted a thread is woken. _One_ thread. If two events are posted, two threads are woken. In this situation we have a few atomic ops at userlevel to make sure that the two threads don't pick the same event but that's all there is wrt "fighting". The alternative is the sorry state we have now. In nscd, for instance, we have one single thread waiting for incoming connections and it then has to wake up a worker thread to handle the processing. This is done because we cannot "park" all threads in the accept() call since when a new connection is announced _all_ the threads are woken. With the new event handling this wouldn't be the case, one thread only is woken and we don't have to wake worker threads. All threads can be worker threads. > If you have multiple threads accessing the same ring -- a poor design > choice To the contrary. It is the perfect means to distribute the workload to multiple threads. Beside, how would you implement asynchronous filling of the ring buffer to avoid unnecessary syscalls if you have many different queues? > -- I would think the burden should be on the application, to > provide proper synchronization. Sure, as much as possible. But there is no reason to design the commit interface in the way which requires expensive synchronization when there is another design which can do exactly the same work but does not require synchronization. The currently proposed kevent_commit and my proposed variant are functionally equivalent. > If the desire is to have the kernel distributes events directly to > multiple threads, then the app should dup(2) the fd to be watched, and > create a ring buffer for each separate thread. And how would you synchronize the file descriptor use across the threads? The event would be sent to all the event queues so that you would a) unnecessarily wake all threads and b) have all but one thread see the operation (say, read or write on a socket) fail with EWOULDBLOCK. That's just silly, we can have that today and continue to waste precious CPU cycles. If you say that you post exactly one event per file description (not handle) then what do you do if the programmer wants the opposite? And again, what do you do for asynchronous ring buffer filling. Which queue do you pick? Pick the wrong one and the event might be in the ring buffer for a long time which another thread handling another queue is ready. Using a single central queue is the perfect means to distribute the load to a number of threads. Nobody is forcing you to do it, you're free to use separate queues if you want. But the model should not enforce this. Overall, I cannot see at all where your problem is. I agree that the synchronization of the access to the ring buffer must be done at userlevel. This is why the uidx exposure isn't needed. The wakeup in any case has to take threads into account. The only change I proposed to enable better multi-thread handling is the revised commit interface and this change in no way hinders single-threaded users. The interface is not hindered in any way or form by the use of threads. Oh, and when I say "threads" I should have said "threads or processes". The whole also applies to multi-process applications. They can share event queues by placing them in shared memory. And I hope that everyone agrees that programs have to go into the direction of having more than one execution context to take advantage of increased CPU power in future. CMP is only becoming more and more important. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-23 23:45 ` Ulrich Drepper @ 2006-11-24 0:48 ` Eric Dumazet 2006-11-24 8:14 ` Andrew Morton 0 siblings, 1 reply; 200+ messages in thread From: Eric Dumazet @ 2006-11-24 0:48 UTC (permalink / raw) To: Ulrich Drepper Cc: Jeff Garzik, Evgeniy Polyakov, David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel Ulrich Drepper a écrit : > > You create worker threads to handle to work for the entire program. Look > at something like a web server. When creating several queues, how do > you distribute all the connections to the different queues? To ensure > every connection is handled as quickly as possible you stuff them all in > the same queue and then have all threads use this one queue. Whenever an > event is posted a thread is woken. _One_ thread. If two events are > posted, two threads are woken. In this situation we have a few atomic > ops at userlevel to make sure that the two threads don't pick the same > event but that's all there is wrt "fighting". > > The alternative is the sorry state we have now. In nscd, for instance, > we have one single thread waiting for incoming connections and it then > has to wake up a worker thread to handle the processing. This is done > because we cannot "park" all threads in the accept() call since when a > new connection is announced _all_ the threads are woken. With the new > event handling this wouldn't be the case, one thread only is woken and > we don't have to wake worker threads. All threads can be worker threads. Having one specialized thread handling the distribution of work to worker threads is better most of the time. This thread can be a worker thread by itself (to avoid context switchs), but can decide to wake up 'slave threads' if he believes it has too (for example if he can notice that a *lot* of requests are pending) This is because with moderate load, it's better to have only one CPU running 80% of its time, keeping its cache hot, than 'distribute' the work on four CPU, that would be used 25% of their time, but with lot of cache line ping pongs and poor cache reuse. If you let 'kevent'/'dumb kernel dispatcher'/'futex'/'whatever' decide to wake up one thread for each new event, you *may* have lower performance, because of higher system overhead (system means : system scheduler/internals, but also bus trafic) Only the application writer can have a clue of average use of its worker threads, and can decide to dynamically adjust parameters if needed to handle load spikes. SMP machines are nice, but for many workloads, it's better to avoid spreading a working set on several CPUS that fight for common resources (memory). Back to 'kevent': ----------------- I think that having a syscall to commit events should not be mandatory. A syscall is needed only to wait for new events if the ring is empty. But then maybe we dont need yet a new syscall to perform a wait : We already have nice synchronisations primitives (futex for example). User program should be able to update a 'uidx' in user space (using atomic ops only if multi-threaded), and could just use futex infrastructure if ring buffer is empty (uidx == kidx) , and call FUTEX_WAIT( &kidx, current value = uidx) I think I already gave my opinion on a ring buffer, but let just rephrase it : One part should be read/write for application (to be able to change uidx) (or User app just give at init time to kernel the address of a futex in its vm space) One part could be read only for application (but could be read/write : we dont care if user application is stupid) : kernel writes its kidx (or a copy of it) and events. For best performance, uidx and kidx should be on different cache lines (basic isolation of producer / consumer) When kernel wants to queue a new event in a ring buffer it can : See if user program did consume some events since last invocation (kernel fetches uidx and compare it with its own uidx value : no syscall needed) Check if a slot is available in ring buffer. Copy the event in ring buffer, perform a memory barrier, then increment kidx. call futex_wake(&kidx, 1 thread) User application is free to have one thread/process or several threads/processes waiting for new events (or even no thread at all :) ) Eric ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-24 0:48 ` Eric Dumazet @ 2006-11-24 8:14 ` Andrew Morton 2006-11-24 8:33 ` Eric Dumazet 0 siblings, 1 reply; 200+ messages in thread From: Andrew Morton @ 2006-11-24 8:14 UTC (permalink / raw) To: Eric Dumazet Cc: Ulrich Drepper, Jeff Garzik, Evgeniy Polyakov, David Miller, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel On Fri, 24 Nov 2006 01:48:32 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote: > > The alternative is the sorry state we have now. In nscd, for instance, > > we have one single thread waiting for incoming connections and it then > > has to wake up a worker thread to handle the processing. This is done > > because we cannot "park" all threads in the accept() call since when a > > new connection is announced _all_ the threads are woken. With the new > > event handling this wouldn't be the case, one thread only is woken and > > we don't have to wake worker threads. All threads can be worker threads. > > Having one specialized thread handling the distribution of work to worker > threads is better most of the time. It might be now. Think "commodity 128-way". Your single distribution thread will run out of steam. What Ulrich is proposing is faster. This is a new interface. Let's design it to be fast. ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-24 8:14 ` Andrew Morton @ 2006-11-24 8:33 ` Eric Dumazet 2006-11-24 15:26 ` Ulrich Drepper 0 siblings, 1 reply; 200+ messages in thread From: Eric Dumazet @ 2006-11-24 8:33 UTC (permalink / raw) To: Andrew Morton Cc: Ulrich Drepper, Jeff Garzik, Evgeniy Polyakov, David Miller, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel Andrew Morton a écrit : > On Fri, 24 Nov 2006 01:48:32 +0100 > Eric Dumazet <dada1@cosmosbay.com> wrote: > >>> The alternative is the sorry state we have now. In nscd, for instance, >>> we have one single thread waiting for incoming connections and it then >>> has to wake up a worker thread to handle the processing. This is done >>> because we cannot "park" all threads in the accept() call since when a >>> new connection is announced _all_ the threads are woken. With the new >>> event handling this wouldn't be the case, one thread only is woken and >>> we don't have to wake worker threads. All threads can be worker threads. >> Having one specialized thread handling the distribution of work to worker >> threads is better most of the time. > > It might be now. Think "commodity 128-way". Your single distribution thread > will run out of steam. > > What Ulrich is proposing is faster. This is a new interface. Let's design > it to be fast. Hum... I guess you didnt read my mail... I basically agree with Ulrich. I just wanted to say that a fast application cannot rely only on a "let's park N threads waiting for single event in this queue", and hope kernel will be smart for us. Even with 128-ways, you still hit a central point of coordination (it can be a mutex in kevent code, a atomic uidx in userland, or whatever) for a 'kevent queue'. Once you paid the cache lines ping/pong, you wont be *fast*. I wish *you* dont think of kevent of only dispatching HTTP 1.0 trivial web requests. Being able to direct a particular request on a particular CPU is certainly something that cannot be hardcoded in 'the new kevent interface'. Eric ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-24 8:33 ` Eric Dumazet @ 2006-11-24 15:26 ` Ulrich Drepper 0 siblings, 0 replies; 200+ messages in thread From: Ulrich Drepper @ 2006-11-24 15:26 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Jeff Garzik, Evgeniy Polyakov, David Miller, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel Eric Dumazet wrote: > Being able to direct a particular request on a particular CPU is > certainly something that cannot be hardcoded in 'the new kevent interface'. Nobody is proposing this. Although I have proposed that if the kernel knows which CPU can best service a request it might hint as much. But in general, you're free to decentralize as much as you want. But this does not mean it should not also be possible to use a number of threads in the same loop and the same kevent queue. That's the part which needs designing, the separate queues will always be possible. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-23 22:48 ` Jeff Garzik 2006-11-23 23:45 ` Ulrich Drepper @ 2006-11-24 0:14 ` Hans Henrik Happe 1 sibling, 0 replies; 200+ messages in thread From: Hans Henrik Happe @ 2006-11-24 0:14 UTC (permalink / raw) To: Jeff Garzik Cc: Ulrich Drepper, Evgeniy Polyakov, David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel On Thursday 23 November 2006 23:48, Jeff Garzik wrote: > I'm really wondering is designing for N-threads-to-1-ring is the wisest > choice? > > Considering current designs, it seems more likely that a single thread > polls for socket activity, then dispatches work. How often do you > really see in userland multiple threads polling the same set of fds, > then fighting to decide who will handle raised events? They should not fight, but gently divide event handling work. > More likely, you will see "prefork" (start N threads, each with its own > ring) One ring could be more busy than others, leaving all the work to one thread. > or a worker pool (single thread receives events, then dispatches > to multiple threads for execution) or even one-thread-per-fd (single > thread receives events, then starts new thread for handling). This is more like fighting :-) It adds context switches and therefore extra latency for event handling. > If you have multiple threads accessing the same ring -- a poor design > choice -- I would think the burden should be on the application, to > provide proper synchronization. Comming from the HPC world I do not agree. Context switches should be avoided. This paper is a good example from the HPC world: http://cobweb.ecn.purdue.edu/~vpai/Publications/majumder-lacsi04.pdf. The latency problems introduced by context switches in this work calls for even more functionality in event handling. I will not go into details now. There are enough problems with kevent's current feature set and I believe these extra features can be added later without breaking the API. -- Hans Henrik Happe ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-23 22:33 ` Ulrich Drepper 2006-11-23 22:48 ` Jeff Garzik @ 2006-11-24 12:05 ` Evgeniy Polyakov 2006-11-24 12:13 ` Evgeniy Polyakov 2006-11-27 19:43 ` Ulrich Drepper 1 sibling, 2 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-24 12:05 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Thu, Nov 23, 2006 at 02:33:16PM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > >+ int kevent_commit(int ctl_fd, unsigned int start, > >+ unsigned int num, unsigned int over); > > I think we can simplify this interface: > > int kevent_commit(int ctl_fd, unsigned int new_tail, > unsigned int over); > > The kernel sets the ring_uidx value to the 'new_tail' value if the tail > pointer would be incremented (module wrap around) and is not higher then > the current front pointer. The test will be a bit complicated but not > more so than what the current code has to do to check for mistakes. > > This approach has the advantage that the commit calls don't have to be > synchronized. If one thread sets the tail pointer to, say, 10 and > another to 12, then it does not matter whether the first thread is > delayed. If it will eventually be executed the result is simply a no-op > and since second thread's action supersedes it. > > Maybe the current form is even impossible to use with explicit locking > at userlevel. What if one thread, which is about to call kevent_commit, > if indefinitely delayed. Then this commit request's value is never > taken into account and the tail pointer is always short of what it > should be. I like this interface, although current one does not allow special synchronization in userspace, since it calculates if new commit is in the area where previous commit was. Will change for the next release. > There is one more thing to consider. Oftentimes the commit request will > be immediately followed by a kevent_wait call. It would be good to > merge this pair of calls. The two parameters new_tail and over could > also be passed to the kevent_wait call and the commit can happen before > the thread looks for new events and eventually goes to sleep. If this > can be implemented then the kevent_commit syscall by itself might not be > needed at all. Instead you'd call kevent_wait() and make the maximum > number of events which can be returned zero. It _IS_ how previous interface worked. EXACTLY! There was one syscall which committed requested number of events and waited when there are new ready events. The only thing it missed, was userspace index (it assumed that if userspace waits for something, then all previous work is done). Ulrich, I'm not going to think for other people all over the world and blindly implementing ideas, which in a day or two will be commented as redundant, since flow of mind has changed, and they had not enough time to check previous version. I will wait for some time until you and other people made theirs comments on interfaces and release final version in about a week, and now I will go to hack netchannels. NO INTERFACE CHANGES AFTER THAT DAY. COMPLETELY. So, feel free to think about perfect interface anyone will be happy with. But please release your thoughts not in form of abstract words, but more precisely, at least like in this e-mail, so I could understand what _you_ want from _your_ interface. > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, > CA ❖ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-24 12:05 ` Evgeniy Polyakov @ 2006-11-24 12:13 ` Evgeniy Polyakov 2006-11-27 19:43 ` Ulrich Drepper 1 sibling, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-24 12:13 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Fri, Nov 24, 2006 at 03:05:31PM +0300, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote: > On Thu, Nov 23, 2006 at 02:33:16PM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > > Evgeniy Polyakov wrote: > > >+ int kevent_commit(int ctl_fd, unsigned int start, > > >+ unsigned int num, unsigned int over); > > > > I think we can simplify this interface: > > > > int kevent_commit(int ctl_fd, unsigned int new_tail, > > unsigned int over); > > > > The kernel sets the ring_uidx value to the 'new_tail' value if the tail > > pointer would be incremented (module wrap around) and is not higher then > > the current front pointer. The test will be a bit complicated but not > > more so than what the current code has to do to check for mistakes. > > > > This approach has the advantage that the commit calls don't have to be > > synchronized. If one thread sets the tail pointer to, say, 10 and > > another to 12, then it does not matter whether the first thread is > > delayed. If it will eventually be executed the result is simply a no-op > > and since second thread's action supersedes it. > > > > Maybe the current form is even impossible to use with explicit locking > > at userlevel. What if one thread, which is about to call kevent_commit, > > if indefinitely delayed. Then this commit request's value is never > > taken into account and the tail pointer is always short of what it > > should be. > > I like this interface, although current one does not allow special ...does not require... > synchronization in userspace, since it calculates if new commit is in > the area where previous commit was. > Will change for the next release. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-24 12:05 ` Evgeniy Polyakov 2006-11-24 12:13 ` Evgeniy Polyakov @ 2006-11-27 19:43 ` Ulrich Drepper 2006-11-28 10:26 ` Evgeniy Polyakov 1 sibling, 1 reply; 200+ messages in thread From: Ulrich Drepper @ 2006-11-27 19:43 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Evgeniy Polyakov wrote: > It _IS_ how previous interface worked. > > EXACTLY! No, the old interface committed everything not only up to a given index. This is the huge difference which makes or breaks it. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ ^ permalink raw reply [flat|nested] 200+ messages in thread
* Re: [take25 1/6] kevent: Description. 2006-11-27 19:43 ` Ulrich Drepper @ 2006-11-28 10:26 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-28 10:26 UTC (permalink / raw) To: Ulrich Drepper Cc: David Miller, Andrew Morton, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik On Mon, Nov 27, 2006 at 11:43:46AM -0800, Ulrich Drepper (drepper@redhat.com) wrote: > Evgeniy Polyakov wrote: > >It _IS_ how previous interface worked. > > > > EXACTLY! > > No, the old interface committed everything not only up to a given index. > This is the huge difference which makes or breaks it. Interface was the same - logic behind it was differnet, the only thing required was to add consumer's index - that is all, no need to change a lot of declarations, userspace and so on - just use existing interface and extend its functionality. But it does not matter anymore, later this week I will collect all proposed changes and implement (hopefully) last release, which will close most of the questions regarding userspace interfaces (except signal mask, it is in fluent state), so we could concentrate on internals and/or new kernel users. > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, > CA ❖ -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 200+ messages in thread
* [take26 0/8] kevent: Generic event handling mechanism. [not found] <1154985aa0591036@2ka.mipt.ru> ` (4 preceding siblings ...) 2006-11-21 16:29 ` [take25 " Evgeniy Polyakov @ 2006-11-30 19:14 ` Evgeniy Polyakov 2006-11-30 19:14 ` [take26 1/8] kevent: Description Evgeniy Polyakov 5 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-30 19:14 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Generic event handling mechanism. Kevent is a generic subsytem which allows to handle event notifications. It supports both level and edge triggered events. It is similar to poll/epoll in some cases, but it is more scalable, it is faster and allows to work with essentially eny kind of events. Events are provided into kernel through control syscall and can be read back through ring buffer or using usual syscalls. Kevent update (i.e. readiness switching) happens directly from internals of the appropriate state machine of the underlying subsytem (like network, filesystem, timer or any other). Homepage: http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent Documentation page (will update Dec 1): http://linux-net.osdl.org/index.php/Kevent I installed slightly used, but still functional (bought on ebay) remote mind reader, and set it up to read Ulrich's alpha brain waves (I hope he agrees that it is a good decision), which took me the whole week. So I think the last ring buffer implementation is what we all wanted. Details in documentation part. It seems that setup was correct and we finially found what we wanted from interface part. Changes from 'take35' patchset: * use timespec as timeout parameter. * added high-resolution timer to handle absolute timeouts. * added flags to waiting and initialization syscalls. * kevent_commit() has new_uidx parameter. * kevent_wait() has old_uidx parameter, which, if not equal to u->uidx, results in immediate wakeup (usefull for the case when entries are added asynchronously from kernel (not supported for now)). * added interface to mark any event as ready. * event POSIX timers support. * return -ENOSYS if there is no registered event type. * provided file descriptor must be checked for fifo type (spotted by Eric Dumazet). * documentation update. * lighttpd patch updated (the latest benchmarks with lighttpd patch can be found in blog). Changes from 'take24' patchset: * new (old (new)) ring buffer implementation with kernel and user indexes. * added initialization syscall instead of opening /dev/kevent * kevent_commit() syscall to commit ring buffer entries * changed KEVENT_REQ_WAKEUP_ONE flag to KEVENT_REQ_WAKEUP_ALL, kevent wakes only first thread always if that flag is not set * KEVENT_REQ_ALWAYS_QUEUE flag. If set, kevent will be queued into ready queue instead of copying back to userspace when kevent is ready immediately when it is added. * lighttpd patch (Hail! Although nothing really outstanding compared to epoll) Changes from 'take23' patchset: * kevent PIPE notifications * KEVENT_REQ_LAST_CHECK flag, which allows to perform last check at dequeueing time * fixed poll/select notifications (were broken due to tree manipulations) * made Documentation/kevent.txt look nice in 80-col terminal * fix for copy_to_user() failure report for the first kevent (Andrew Morton) * minor function renames Changes from 'take22' patchset: * new ring buffer implementation in process' memory * wakeup-one-thread flag * edge-triggered behaviour Changes from 'take21' patchset: * minor cleanups (different return values, removed unneded variables, whitespaces and so on) * fixed bug in kevent removal in case when kevent being removed is the same as overflow_kevent (spotted by Eric Dumazet) Changes from 'take20' patchset: * new ring buffer implementation * removed artificial limit on possible number of kevents Changes from 'take19' patchset: * use __init instead of __devinit * removed 'default N' from config for user statistic * removed kevent_user_fini() since kevent can not be unloaded * use KERN_INFO for statistic output Changes from 'take18' patchset: * use __init instead of __devinit * removed 'default N' from config for user statistic * removed kevent_user_fini() since kevent can not be unloaded * use KERN_INFO for statistic output Changes from 'take17' patchset: * Use RB tree instead of hash table. At least for a web sever, frequency of addition/deletion of new kevent is comparable with number of search access, i.e. most of the time events are added, accesed only couple of times and then removed, so it justifies RB tree usage over AVL tree, since the latter does have much slower deletion time (max O(log(N)) compared to 3 ops), although faster search time (1.44*O(log(N)) vs. 2*O(log(N))). So for kevents I use RB tree for now and later, when my AVL tree implementation is ready, it will be possible to compare them. * Changed readiness check for socket notifications. With both above changes it is possible to achieve more than 3380 req/second compared to 2200, sometimes 2500 req/second for epoll() for trivial web-server and httperf client on the same hardware. It is possible that above kevent limit is due to maximum allowed kevents in a time limit, which is 4096 events. Changes from 'take16' patchset: * misc cleanups (__read_mostly, const ...) * created special macro which is used for mmap size (number of pages) calculation * export kevent_socket_notify(), since it is used in network protocols which can be built as modules (IPv6 for example) Changes from 'take15' patchset: * converted kevent_timer to high-resolution timers, this forces timer API update at http://linux-net.osdl.org/index.php/Kevent * use struct ukevent* instead of void * in syscalls (documentation has been updated) * added warning in kevent_add_ukevent() if ring has broken index (for testing) Changes from 'take14' patchset: * added kevent_wait() This syscall waits until either timeout expires or at least one event becomes ready. It also commits that @num events from @start are processed by userspace and thus can be be removed or rearmed (depending on it's flags). It can be used for commit events read by userspace through mmap interface. Example userspace code (evtest.c) can be found on project's homepage. * added socket notifications (send/recv/accept) Changes from 'take13' patchset: * do not get lock aroung user data check in __kevent_search() * fail early if there were no registered callbacks for given type of kevent * trailing whitespace cleanup Changes from 'take12' patchset: * remove non-chardev interface for initialization * use pointer to kevent_mring instead of unsigned longs * use aligned 64bit type in raw user data (can be used by high-res timer if needed) * simplified enqueue/dequeue callbacks and kevent initialization * use nanoseconds for timeout * put number of milliseconds into timer's return data * move some definitions into user-visible header * removed filenames from comments Changes from 'take11' patchset: * include missing headers into patchset * some trivial code cleanups (use goto instead of if/else games and so on) * some whitespace cleanups * check for ready_callback() callback before main loop which should save us some ticks Changes from 'take10' patchset: * removed non-existent prototypes * added helper function for kevent_registered_callbacks * fixed 80 lines comments issues * added shared between userspace and kernelspace header instead of embedd them in one * core restructuring to remove forward declarations * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p * use vm_insert_page() instead of remap_pfn_range() Changes from 'take9' patchset: * fixed ->nopage method Changes from 'take8' patchset: * fixed mmap release bug * use module_init() instead of late_initcall() * use better structures for timer notifications Changes from 'take7' patchset: * new mmap interface (not tested, waiting for other changes to be acked) - use nopage() method to dynamically substitue pages - allocate new page for events only when new added kevent requres it - do not use ugly index dereferencing, use structure instead - reduced amount of data in the ring (id and flags), maximum 12 pages on x86 per kevent fd Changes from 'take6' patchset: * a lot of comments! * do not use list poisoning for detection of the fact, that entry is in the list * return number of ready kevents even if copy*user() fails * strict check for number of kevents in syscall * use ARRAY_SIZE for array size calculation * changed superblock magic number * use SLAB_PANIC instead of direct panic() call * changed -E* return values * a lot of small cleanups and indent fixes Changes from 'take5' patchset: * removed compilation warnings about unused wariables when lockdep is not turned on * do not use internal socket structures, use appropriate (exported) wrappers instead * removed default 1 second timeout * removed AIO stuff from patchset Changes from 'take4' patchset: * use miscdevice instead of chardevice * comments fixes Changes from 'take3' patchset: * removed serializing mutex from kevent_user_wait() * moved storage list processing to RCU * removed lockdep screaming - all storage locks are initialized in the same function, so it was learned to differentiate between various cases * remove kevent from storage if is marked as broken after callback * fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion Changes from 'take2' patchset: * split kevent_finish_user() to locked and unlocked variants * do not use KEVENT_STAT ifdefs, use inline functions instead * use array of callbacks of each type instead of each kevent callback initialization * changed name of ukevent guarding lock * use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks * do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters * various indent cleanups * added optimisation, which is aimed to help when a lot of kevents are being copied from userspace * mapped buffer (initial) implementation (no userspace yet) Changes from 'take1' patchset: - rebased against 2.6.18-git tree - removed ioctl controlling - added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr, unsigned int timeout, void __user *buf, unsigned flags) - use old syscall kevent_ctl for creation/removing, modification and initial kevent initialization - use mutuxes instead of semaphores - added file descriptor check and return error if provided descriptor does not match kevent file operations - various indent fixes - removed aio_sendfile() declarations. Thank you. Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> ^ permalink raw reply [flat|nested] 200+ messages in thread
* [take26 1/8] kevent: Description. 2006-11-30 19:14 ` [take26 0/8] kevent: Generic event handling mechanism Evgeniy Polyakov @ 2006-11-30 19:14 ` Evgeniy Polyakov 2006-11-30 19:14 ` [take26 2/8] kevent: Core files Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-30 19:14 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Description. diff --git a/Documentation/kevent.txt b/Documentation/kevent.txt new file mode 100644 index 0000000..2e03a3f --- /dev/null +++ b/Documentation/kevent.txt @@ -0,0 +1,240 @@ +Description. + +int kevent_init(struct kevent_ring *ring, unsigned int ring_size, + unsigned int flags); + +num - size of the ring buffer in events +ring - pointer to allocated ring buffer +flags - various flags, see KEVENT_FLAGS_* definitions. + +Return value: kevent control file descriptor or negative error value. + + struct kevent_ring + { + unsigned int ring_kidx, ring_over; + struct ukevent event[0]; + } + +ring_kidx - index in the ring buffer where kernel will put new events + when kevent_wait() or kevent_get_events() is called +ring_over - number of overflows of ring_uidx happend from the start. + Overflow counter is used to prevent situation when two threads + are going to free the same events, but one of them was scheduled + away for too long, so ring indexes were wrapped, so when that + thread will be awakened, it will free not those events, which + it suppose to free. + +Example userspace code (ring_buffer.c) can be found on project's homepage. + +Each kevent syscall can be so called cancellation point in glibc, i.e. when +thread has been cancelled in kevent syscall, thread can be safely removed +and no events will be lost, since each syscall (kevent_wait() or +kevent_get_events()) will copy event into special ring buffer, accessible +from other threads or even processes (if shared memory is used). + +When kevent is removed (not dequeued when it is ready, but just removed), +even if it was ready, it is not copied into ring buffer, since if it is +removed, no one cares about it (otherwise user would wait until it becomes +ready and got it through usual way using kevent_get_events() or kevent_wait()) +and thus no need to copy it to the ring buffer. + +------------------------------------------------------------------------------- + + +int kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent *arg); + +fd - is the file descriptor referring to the kevent queue to manipulate. +It is created by opening "/dev/kevent" char device, which is created with +dynamic minor number and major number assigned for misc devices. + +cmd - is the requested operation. It can be one of the following: + KEVENT_CTL_ADD - add event notification + KEVENT_CTL_REMOVE - remove event notification + KEVENT_CTL_MODIFY - modify existing notification + KEVENT_CTL_READY - mark existing events as ready, if number of events is zero, + it just wakes up parked in syscall thread + +num - number of struct ukevent in the array pointed to by arg +arg - array of struct ukevent + +Return value: + number of events processed or negative error value. + +When called, kevent_ctl will carry out the operation specified in the +cmd parameter. +------------------------------------------------------------------------------- + + int kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, + struct timespec timeout, struct ukevent *buf, unsigned flags); + +ctl_fd - file descriptor referring to the kevent queue +min_nr - minimum number of completed events that kevent_get_events will block + waiting for +max_nr - number of struct ukevent in buf +timeout - time to wait before returning less than min_nr + events. If this is -1, then wait forever. +buf - pointer to an array of struct ukevent. +flags - various flags, see KEVENT_FLAGS_* definitions. + +Return value: + number of events copied or negative error value. + +kevent_get_events will wait timeout milliseconds for at least min_nr completed +events, copying completed struct ukevents to buf and deleting any +KEVENT_REQ_ONESHOT event requests. In nonblocking mode it returns as many +events as possible, but not more than max_nr. In blocking mode it waits until +timeout or if at least min_nr events are ready. + +This function copies event into ring buffer if it was initialized, if ring buffer +is full, KEVENT_RET_COPY_FAILED flag is set in ret_flags field. +------------------------------------------------------------------------------- + + int kevent_wait(int ctl_fd, unsigned int num, unsigned int old_uidx, + struct timespec timeout, unsigned int flags); + +ctl_fd - file descriptor referring to the kevent queue +num - number of processed kevents +old_uidx - the last index user is aware of +timeout - time to wait until there is free space in kevent queue +flags - various flags, see KEVENT_FLAGS_* definitions. + +Return value: + number of events copied into ring buffer or negative error value. + +This syscall waits until either timeout expires or at least one event becomes +ready. It also copies events into special ring buffer. If ring buffer is full, +it waits until there are ready events and then return. +If kevent is one-shot kevent it is removed in this syscall. +If kevent is edge-triggered (KEVENT_REQ_ET flag is set in 'req_flags') it is +requeued in this syscall for performance reasons. +------------------------------------------------------------------------------- + + int kevent_commit(int ctl_fd, unsigned int new_idx, unsigned int over); + +ctl_fd - file descriptor referring to the kevent queue +new_uidx - the last committed kevent +over - overflow count for given $new_idx value + +Return value: + number of committed kevents or negative error value. + +This function commits, i.e. marks as empty, slots in the ring buffer, so +they can be reused when userspace completes that entries processing. + +Overflow counter is used to prevent situation when two threads are going +to free the same events, but one of them was scheduled away for too long, +so ring indexes were wrapped, so when that thread will be awakened, it +will free not those events, which it suppose to free. + +It is possible that returned number of committed events will be smaller than +requested number - it is possible when several threads try to commit the +same events. +------------------------------------------------------------------------------- + +The bulk of the interface is entirely done through the ukevent struct. +It is used to add event requests, modify existing event requests, +specify which event requests to remove, and return completed events. + +struct ukevent contains the following members: + +struct kevent_id id + Id of this request, e.g. socket number, file descriptor and so on +__u32 type + Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on +__u32 event + Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED +__u32 req_flags + Per-event request flags, + + KEVENT_REQ_ONESHOT + event will be removed when it is ready + + KEVENT_REQ_WAKEUP_ALL + Kevent wakes up only first thread interested in given event, + or all threads if this flag is set. + + KEVENT_REQ_ET + Edge Triggered behaviour. It is an optimisation which allows to move + ready and dequeued (i.e. copied to userspace) event to move into set + of interest for given storage (socket, inode and so on) again. It is + very usefull for cases when the same event should be used many times + (like reading from pipe). It is similar to epoll()'s EPOLLET flag. + + KEVENT_REQ_LAST_CHECK + if set allows to perform the last check on kevent (call appropriate + callback) when kevent is marked as ready and has been removed from + ready queue. If it will be confirmed that kevent is ready + (k->callbacks.callback(k) returns true) then kevent will be copied + to userspace, otherwise it will be requeued back to storage. + Second (checking) call is performed with this bit cleared, so callback + can detect when it was called from kevent_storage_ready() - bit is set, + or kevent_dequeue_ready() - bit is cleared. If kevent will be requeued, + bit will be set again. + + KEVENT_REQ_ALWAYS_QUEUE + If this flag is set kevent will be queued into ready queue if it is + ready at enqueue time, otherwise it will be copied back to userspace + and will not be queued into the storage. + +__u32 ret_flags + Per-event return flags + + KEVENT_RET_BROKEN + Kevent is broken + + KEVENT_RET_DONE + Kevent processing was finished successfully + + KEVENT_RET_COPY_FAILED + Kevent was not copied into ring buffer due to some error conditions. + +__u32 ret_data + Event return data. Event originator fills it with anything it likes + (for example timer notifications put number of milliseconds when timer + has fired +union { __u32 user[2]; void *ptr; } + User's data. It is not used, just copied to/from user. The whole structure + is aligned to 8 bytes already, so the last union is aligned properly. + +------------------------------------------------------------------------------- + +Kevent waiting syscall flags. + +KEVENT_FLAGS_ABSTIME - provided timespec parameter contains absolute time, + for example Aug 27, 2194, or time(NULL) + 10. + +------------------------------------------------------------------------------- + +Usage + +For KEVENT_CTL_ADD, all fields relevant to the event type must be filled +(id, type, event, req_flags). +After kevent_ctl(..., KEVENT_CTL_ADD, ...) returns each struct's ret_flags +should be checked to see if the event is already broken or done. + +For KEVENT_CTL_MODIFY, the id, req_flags, and user and event fields must be +set and an existing kevent request must have matching id and user fields. If +match is found, req_flags and event are replaced with the newly supplied +values and requeueing is started, so modified kevent can be checked and +probably marked as ready immediately. If a match can't be found, the +passed in ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is +always set. + +For KEVENT_CTL_REMOVE, the id and user fields must be set and an existing +kevent request must have matching id and user fields. If a match is found, +the kevent request is removed. If a match can't be found, the passed in +ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is always set. + +For kevent_get_events, the entire structure is returned. + +------------------------------------------------------------------------------- + +Usage cases + +kevent_timer +struct ukevent should contain following fields: + type - KEVENT_TIMER + event - KEVENT_TIMER_FIRED + req_flags - KEVENT_REQ_ONESHOT if you want to fire that timer only once + id.raw[0] - number of seconds after commit when this timer shout expire + id.raw[0] - additional to number of seconds number of nanoseconds ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take26 2/8] kevent: Core files. 2006-11-30 19:14 ` [take26 1/8] kevent: Description Evgeniy Polyakov @ 2006-11-30 19:14 ` Evgeniy Polyakov 2006-11-30 19:14 ` [take26 3/8] kevent: poll/select() notifications Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-30 19:14 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Core files. This patch includes core kevent files: * userspace controlling * kernelspace interfaces * initialization * notification state machines Some bits of documentation can be found on project's homepage (and links from there): http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S index 7e639f7..a6221c2 100644 --- a/arch/i386/kernel/syscall_table.S +++ b/arch/i386/kernel/syscall_table.S @@ -318,3 +318,8 @@ ENTRY(sys_call_table) .long sys_vmsplice .long sys_move_pages .long sys_getcpu + .long sys_kevent_get_events + .long sys_kevent_ctl /* 320 */ + .long sys_kevent_wait + .long sys_kevent_commit + .long sys_kevent_init diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S index b4aa875..dda2168 100644 --- a/arch/x86_64/ia32/ia32entry.S +++ b/arch/x86_64/ia32/ia32entry.S @@ -714,8 +714,13 @@ ia32_sys_call_table: .quad compat_sys_get_robust_list .quad sys_splice .quad sys_sync_file_range - .quad sys_tee + .quad sys_tee /* 315 */ .quad compat_sys_vmsplice .quad compat_sys_move_pages .quad sys_getcpu + .quad sys_kevent_get_events + .quad sys_kevent_ctl /* 320 */ + .quad sys_kevent_wait + .quad sys_kevent_commit + .quad sys_kevent_init ia32_syscall_end: diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h index bd99870..57a6b8c 100644 --- a/include/asm-i386/unistd.h +++ b/include/asm-i386/unistd.h @@ -324,10 +324,15 @@ #define __NR_vmsplice 316 #define __NR_move_pages 317 #define __NR_getcpu 318 +#define __NR_kevent_get_events 319 +#define __NR_kevent_ctl 320 +#define __NR_kevent_wait 321 +#define __NR_kevent_commit 322 +#define __NR_kevent_init 323 #ifdef __KERNEL__ -#define NR_syscalls 319 +#define NR_syscalls 324 #include <linux/err.h> /* diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h index 6137146..17d750d 100644 --- a/include/asm-x86_64/unistd.h +++ b/include/asm-x86_64/unistd.h @@ -619,10 +619,20 @@ __SYSCALL(__NR_sync_file_range, sys_sync __SYSCALL(__NR_vmsplice, sys_vmsplice) #define __NR_move_pages 279 __SYSCALL(__NR_move_pages, sys_move_pages) +#define __NR_kevent_get_events 280 +__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events) +#define __NR_kevent_ctl 281 +__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl) +#define __NR_kevent_wait 282 +__SYSCALL(__NR_kevent_wait, sys_kevent_wait) +#define __NR_kevent_commit 283 +__SYSCALL(__NR_kevent_commit, sys_kevent_commit) +#define __NR_kevent_init 284 +__SYSCALL(__NR_kevent_init, sys_kevent_init) #ifdef __KERNEL__ -#define __NR_syscall_max __NR_move_pages +#define __NR_syscall_max __NR_kevent_init #include <linux/err.h> #ifndef __NO_STUBS diff --git a/include/linux/kevent.h b/include/linux/kevent.h new file mode 100644 index 0000000..3469435 --- /dev/null +++ b/include/linux/kevent.h @@ -0,0 +1,238 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __KEVENT_H +#define __KEVENT_H +#include <linux/types.h> +#include <linux/list.h> +#include <linux/rbtree.h> +#include <linux/spinlock.h> +#include <linux/mutex.h> +#include <linux/wait.h> +#include <linux/net.h> +#include <linux/rcupdate.h> +#include <linux/fs.h> +#include <linux/sched.h> +#include <linux/hrtimer.h> +#include <linux/kevent_storage.h> +#include <linux/ukevent.h> + +#define KEVENT_MIN_BUFFS_ALLOC 3 + +struct kevent; +struct kevent_storage; +typedef int (* kevent_callback_t)(struct kevent *); + +/* @callback is called each time new event has been caught. */ +/* @enqueue is called each time new event is queued. */ +/* @dequeue is called each time event is dequeued. */ + +struct kevent_callbacks { + kevent_callback_t callback, enqueue, dequeue; +}; + +#define KEVENT_READY 0x1 +#define KEVENT_STORAGE 0x2 +#define KEVENT_USER 0x4 + +struct kevent +{ + /* Used for kevent freeing.*/ + struct rcu_head rcu_head; + struct ukevent event; + /* This lock protects ukevent manipulations, e.g. ret_flags changes. */ + spinlock_t ulock; + + /* Entry of user's tree. */ + struct rb_node kevent_node; + /* Entry of origin's queue. */ + struct list_head storage_entry; + /* Entry of user's ready. */ + struct list_head ready_entry; + + u32 flags; + + /* User who requested this kevent. */ + struct kevent_user *user; + /* Kevent container. */ + struct kevent_storage *st; + + struct kevent_callbacks callbacks; + + /* Private data for different storages. + * poll()/select storage has a list of wait_queue_t containers + * for each ->poll() { poll_wait()' } here. + */ + void *priv; +}; + +struct kevent_user +{ + struct rb_root kevent_root; + spinlock_t kevent_lock; + /* Number of queued kevents. */ + unsigned int kevent_num; + + /* List of ready kevents. */ + struct list_head ready_list; + /* Number of ready kevents. */ + unsigned int ready_num; + /* Protects all manipulations with ready queue. */ + spinlock_t ready_lock; + + /* Protects against simultaneous kevent_user control manipulations. */ + struct mutex ctl_mutex; + /* Wait until some events are ready. */ + wait_queue_head_t wait; + /* Exit from syscall if someone wants us to do it */ + int need_exit; + + /* Reference counter, increased for each new kevent. */ + atomic_t refcnt; + + /* Mutex protecting userspace ring buffer. */ + struct mutex ring_lock; + /* Kernel index and size of the userspace ring buffer. */ + unsigned int kidx, uidx, ring_size, ring_over, full; + /* Pointer to userspace ring buffer. */ + struct kevent_ring __user *pring; + + /* Is used for absolute waiting times. */ + struct hrtimer timer; + +#ifdef CONFIG_KEVENT_USER_STAT + unsigned long im_num; + unsigned long wait_num, ring_num; + unsigned long total; +#endif +}; + +int kevent_enqueue(struct kevent *k); +int kevent_dequeue(struct kevent *k); +int kevent_init(struct kevent *k); +void kevent_requeue(struct kevent *k); +int kevent_break(struct kevent *k); + +int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos); + +void kevent_storage_ready(struct kevent_storage *st, + kevent_callback_t ready_callback, u32 event); +int kevent_storage_init(void *origin, struct kevent_storage *st); +void kevent_storage_fini(struct kevent_storage *st); +int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k); +void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k); + +void kevent_ready(struct kevent *k, int ret); + +int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u); + +#ifdef CONFIG_KEVENT_POLL +void kevent_poll_reinit(struct file *file); +#else +static inline void kevent_poll_reinit(struct file *file) +{ +} +#endif + +#ifdef CONFIG_KEVENT_USER_STAT +static inline void kevent_stat_init(struct kevent_user *u) +{ + u->wait_num = u->im_num = u->total = u->ring_num = 0; +} +static inline void kevent_stat_print(struct kevent_user *u) +{ + printk(KERN_INFO "%s: u: %p, wait: %lu, ring: %lu, immediately: %lu, total: %lu.\n", + __func__, u, u->wait_num, u->ring_num, u->im_num, u->total); +} +static inline void kevent_stat_im(struct kevent_user *u) +{ + u->im_num++; +} +static inline void kevent_stat_ring(struct kevent_user *u) +{ + u->ring_num++; +} +static inline void kevent_stat_wait(struct kevent_user *u) +{ + u->wait_num++; +} +static inline void kevent_stat_total(struct kevent_user *u) +{ + u->total++; +} +#else +#define kevent_stat_print(u) ({ (void) u;}) +#define kevent_stat_init(u) ({ (void) u;}) +#define kevent_stat_im(u) ({ (void) u;}) +#define kevent_stat_wait(u) ({ (void) u;}) +#define kevent_stat_ring(u) ({ (void) u;}) +#define kevent_stat_total(u) ({ (void) u;}) +#endif + +#ifdef CONFIG_LOCKDEP +void kevent_socket_reinit(struct socket *sock); +void kevent_sk_reinit(struct sock *sk); +#else +static inline void kevent_socket_reinit(struct socket *sock) +{ +} +static inline void kevent_sk_reinit(struct sock *sk) +{ +} +#endif +#ifdef CONFIG_KEVENT_SOCKET +void kevent_socket_notify(struct sock *sock, u32 event); +int kevent_socket_dequeue(struct kevent *k); +int kevent_socket_enqueue(struct kevent *k); +#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC) +#else +static inline void kevent_socket_notify(struct sock *sock, u32 event) +{ +} +#define sock_async(__sk) ({ (void)__sk; 0; }) +#endif + +#ifdef CONFIG_KEVENT_POLL +static inline void kevent_init_file(struct file *file) +{ + kevent_storage_init(file, &file->st); +} + +static inline void kevent_cleanup_file(struct file *file) +{ + kevent_storage_fini(&file->st); +} +#else +static inline void kevent_init_file(struct file *file) {} +static inline void kevent_cleanup_file(struct file *file) {} +#endif + +#ifdef CONFIG_KEVENT_PIPE +extern void kevent_pipe_notify(struct inode *inode, u32 events); +#else +static inline void kevent_pipe_notify(struct inode *inode, u32 events) {} +#endif + +#ifdef CONFIG_KEVENT_SIGNAL +extern int kevent_signal_notify(struct task_struct *tsk, int sig); +#else +static inline int kevent_signal_notify(struct task_struct *tsk, int sig) {return 0;} +#endif + +#endif /* __KEVENT_H */ diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h new file mode 100644 index 0000000..a38575d --- /dev/null +++ b/include/linux/kevent_storage.h @@ -0,0 +1,11 @@ +#ifndef __KEVENT_STORAGE_H +#define __KEVENT_STORAGE_H + +struct kevent_storage +{ + void *origin; /* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */ + struct list_head list; /* List of queued kevents. */ + spinlock_t lock; /* Protects users queue. */ +}; + +#endif /* __KEVENT_STORAGE_H */ diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 2d1c3d5..7574ec3 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -54,6 +54,8 @@ struct compat_stat; struct compat_timeval; struct robust_list_head; struct getcpu_cache; +struct ukevent; +struct kevent_ring; #include <linux/types.h> #include <linux/aio_abi.h> @@ -599,4 +601,11 @@ asmlinkage long sys_set_robust_list(stru size_t len); asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache); +asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max, + struct timespec timeout, struct ukevent __user *buf, unsigned flags); +asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, struct ukevent __user *buf); +asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, unsigned int old_uidx, + struct timespec timeout, unsigned int flags); +asmlinkage long sys_kevent_commit(int ctl_fd, unsigned int new_uidx, unsigned int over); +asmlinkage long sys_kevent_init(int ctl_fd, struct kevent_ring __user *ring, unsigned int num, unsigned int flags); #endif diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h new file mode 100644 index 0000000..5201bc4 --- /dev/null +++ b/include/linux/ukevent.h @@ -0,0 +1,183 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __UKEVENT_H +#define __UKEVENT_H + +#include <linux/types.h> + +/* + * Kevent request flags. + */ + +/* Process this event only once and then remove it. */ +#define KEVENT_REQ_ONESHOT 0x1 +/* Kevent wakes up only first thread interested in given event, + * or all threads if this flag is set. + */ +#define KEVENT_REQ_WAKEUP_ALL 0x2 +/* Edge Triggered behaviour. */ +#define KEVENT_REQ_ET 0x4 +/* Perform the last check on kevent (call appropriate callback) when + * kevent is marked as ready and has been removed from ready queue. + * If it will be confirmed that kevent is ready + * (k->callbacks.callback(k) returns true) then kevent will be copied + * to userspace, otherwise it will be requeued back to storage. + * Second (checking) call is performed with this bit _cleared_ so + * callback can detect when it was called from + * kevent_storage_ready() - bit is set, or + * kevent_dequeue_ready() - bit is cleared. + * If kevent will be requeued, bit will be set again. */ +#define KEVENT_REQ_LAST_CHECK 0x8 +/* + * Always queue kevent even if it is immediately ready. + */ +#define KEVENT_REQ_ALWAYS_QUEUE 0x16 + +/* + * Kevent return flags. + */ +/* Kevent is broken. */ +#define KEVENT_RET_BROKEN 0x1 +/* Kevent processing was finished successfully. */ +#define KEVENT_RET_DONE 0x2 +/* Kevent was not copied into ring buffer due to some error conditions. */ +#define KEVENT_RET_COPY_FAILED 0x4 + +/* + * Kevent type set. + */ +#define KEVENT_SOCKET 0 +#define KEVENT_INODE 1 +#define KEVENT_TIMER 2 +#define KEVENT_POLL 3 +#define KEVENT_NAIO 4 +#define KEVENT_AIO 5 +#define KEVENT_PIPE 6 +#define KEVENT_SIGNAL 7 +#define KEVENT_POSIX_TIMER 8 +#define KEVENT_MAX 9 + +/* + * Per-type event sets. + * Number of per-event sets should be exactly as number of kevent types. + */ + +/* + * Timer events. + */ +#define KEVENT_TIMER_FIRED 0x1 + +/* + * Socket/network asynchronous IO and PIPE events. + */ +#define KEVENT_SOCKET_RECV 0x1 +#define KEVENT_SOCKET_ACCEPT 0x2 +#define KEVENT_SOCKET_SEND 0x4 + +/* + * Inode events. + */ +#define KEVENT_INODE_CREATE 0x1 +#define KEVENT_INODE_REMOVE 0x2 + +/* + * Poll events. + */ +#define KEVENT_POLL_POLLIN 0x0001 +#define KEVENT_POLL_POLLPRI 0x0002 +#define KEVENT_POLL_POLLOUT 0x0004 +#define KEVENT_POLL_POLLERR 0x0008 +#define KEVENT_POLL_POLLHUP 0x0010 +#define KEVENT_POLL_POLLNVAL 0x0020 + +#define KEVENT_POLL_POLLRDNORM 0x0040 +#define KEVENT_POLL_POLLRDBAND 0x0080 +#define KEVENT_POLL_POLLWRNORM 0x0100 +#define KEVENT_POLL_POLLWRBAND 0x0200 +#define KEVENT_POLL_POLLMSG 0x0400 +#define KEVENT_POLL_POLLREMOVE 0x1000 + +/* + * Asynchronous IO events. + */ +#define KEVENT_AIO_BIO 0x1 + +/* + * Signal events. + */ +#define KEVENT_SIGNAL_DELIVERY 0x1 + +/* If set in raw64, then given signals will not be delivered + * in a usual way through sigmask update and signal callback + * invokation. */ +#define KEVENT_SIGNAL_NOMASK 0x8000000000000000ULL + +/* Mask of all possible event values. */ +#define KEVENT_MASK_ALL 0xffffffff +/* Empty mask of ready events. */ +#define KEVENT_MASK_EMPTY 0x0 + +struct kevent_id +{ + union { + __u32 raw[2]; + __u64 raw_u64 __attribute__((aligned(8))); + }; +}; + +struct ukevent +{ + /* Id of this request, e.g. socket number, file descriptor and so on... */ + struct kevent_id id; + /* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */ + __u32 type; + /* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */ + __u32 event; + /* Per-event request flags */ + __u32 req_flags; + /* Per-event return flags */ + __u32 ret_flags; + /* Event return data. Event originator fills it with anything it likes. */ + __u32 ret_data[2]; + /* User's data. It is not used, just copied to/from user. + * The whole structure is aligned to 8 bytes already, so the last union + * is aligned properly. + */ + union { + __u32 user[2]; + void *ptr; + }; +}; + +struct kevent_ring +{ + unsigned int ring_kidx, ring_over; + struct ukevent event[0]; +}; + +#define KEVENT_CTL_ADD 0 +#define KEVENT_CTL_REMOVE 1 +#define KEVENT_CTL_MODIFY 2 +#define KEVENT_CTL_READY 3 + +/* Provided timespec parameter uses absolute time, i.e. 'wait until Aug 27, 2194' */ +#define KEVENT_FLAGS_ABSTIME 1 + +#endif /* __UKEVENT_H */ diff --git a/init/Kconfig b/init/Kconfig index d2eb7a8..c7d8250 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -201,6 +201,8 @@ config AUDITSYSCALL such as SELinux. To use audit's filesystem watch feature, please ensure that INOTIFY is configured. +source "kernel/kevent/Kconfig" + config IKCONFIG bool "Kernel .config support" ---help--- diff --git a/kernel/Makefile b/kernel/Makefile index d62ec66..2d7a6dd 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o +obj-$(CONFIG_KEVENT) += kevent/ obj-$(CONFIG_RELAY) += relay.o obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o obj-$(CONFIG_TASKSTATS) += taskstats.o diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig new file mode 100644 index 0000000..4b137ee --- /dev/null +++ b/kernel/kevent/Kconfig @@ -0,0 +1,60 @@ +config KEVENT + bool "Kernel event notification mechanism" + help + This option enables event queue mechanism. + It can be used as replacement for poll()/select(), AIO callback + invocations, advanced timer notifications and other kernel + object status changes. + +config KEVENT_USER_STAT + bool "Kevent user statistic" + depends on KEVENT + help + This option will turn kevent_user statistic collection on. + Statistic data includes total number of kevent, number of kevents + which are ready immediately at insertion time and number of kevents + which were removed through readiness completion. + It will be printed each time control kevent descriptor is closed. + +config KEVENT_TIMER + bool "Kernel event notifications for timers" + depends on KEVENT + help + This option allows to use timers through KEVENT subsystem. + +config KEVENT_POLL + bool "Kernel event notifications for poll()/select()" + depends on KEVENT + help + This option allows to use kevent subsystem for poll()/select() + notifications. + +config KEVENT_SOCKET + bool "Kernel event notifications for sockets" + depends on NET && KEVENT + help + This option enables notifications through KEVENT subsystem of + sockets operations, like new packet receiving conditions, + ready for accept conditions and so on. + +config KEVENT_PIPE + bool "Kernel event notifications for pipes" + depends on KEVENT + help + This option enables notifications through KEVENT subsystem of + pipe read/write operations. + +config KEVENT_SIGNAL + bool "Kernel event notifications for signals" + depends on KEVENT + help + This option enables signal delivery through KEVENT subsystem. + Signals which were requested to be delivered through kevent + subsystem must be registered through usual signal() and others + syscalls, this option allows alternative delivery. + With KEVENT_SIGNAL_NOMASK flag being set in kevent for set of + signals, they will not be delivered in a usual way. + Kevents for appropriate signals are not copied when process forks, + new process must add new kevents after fork(). Mask of signals + is copied as before. + diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile new file mode 100644 index 0000000..f98e0c8 --- /dev/null +++ b/kernel/kevent/Makefile @@ -0,0 +1,6 @@ +obj-y := kevent.o kevent_user.o +obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o +obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o +obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o +obj-$(CONFIG_KEVENT_PIPE) += kevent_pipe.o +obj-$(CONFIG_KEVENT_SIGNAL) += kevent_signal.o diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c new file mode 100644 index 0000000..b0adcdc --- /dev/null +++ b/kernel/kevent/kevent.c @@ -0,0 +1,247 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/mempool.h> +#include <linux/sched.h> +#include <linux/wait.h> +#include <linux/kevent.h> + +/* + * Attempts to add an event into appropriate origin's queue. + * Returns positive value if this event is ready immediately, + * negative value in case of error and zero if event has been queued. + * ->enqueue() callback must increase origin's reference counter. + */ +int kevent_enqueue(struct kevent *k) +{ + return k->callbacks.enqueue(k); +} + +/* + * Remove event from the appropriate queue. + * ->dequeue() callback must decrease origin's reference counter. + */ +int kevent_dequeue(struct kevent *k) +{ + return k->callbacks.dequeue(k); +} + +/* + * Mark kevent as broken. + */ +int kevent_break(struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&k->ulock, flags); + k->event.ret_flags |= KEVENT_RET_BROKEN; + spin_unlock_irqrestore(&k->ulock, flags); + return -EINVAL; +} + +static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX] __read_mostly; + +int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos) +{ + struct kevent_callbacks *p; + + if (pos >= KEVENT_MAX) + return -EINVAL; + + p = &kevent_registered_callbacks[pos]; + + p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break; + p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break; + p->callback = (cb->callback) ? cb->callback : kevent_break; + + printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos); + return 0; +} + +/* + * Must be called before event is going to be added into some origin's queue. + * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks. + * If failed, kevent should not be used or kevent_enqueue() will fail to add + * this kevent into origin's queue with setting + * KEVENT_RET_BROKEN flag in kevent->event.ret_flags. + */ +int kevent_init(struct kevent *k) +{ + spin_lock_init(&k->ulock); + k->flags = 0; + + if (unlikely(k->event.type >= KEVENT_MAX)) { + kevent_break(k); + return -ENOSYS; + } + + if (!kevent_registered_callbacks[k->event.type].callback) { + kevent_break(k); + return -ENOSYS; + } + + k->callbacks = kevent_registered_callbacks[k->event.type]; + if (unlikely(k->callbacks.callback == kevent_break)) { + kevent_break(k); + return -ENOSYS; + } + + return 0; +} + +/* + * Called from ->enqueue() callback when reference counter for given + * origin (socket, inode...) has been increased. + */ +int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k) +{ + unsigned long flags; + + k->st = st; + spin_lock_irqsave(&st->lock, flags); + list_add_tail_rcu(&k->storage_entry, &st->list); + k->flags |= KEVENT_STORAGE; + spin_unlock_irqrestore(&st->lock, flags); + return 0; +} + +/* + * Dequeue kevent from origin's queue. + * It does not decrease origin's reference counter in any way + * and must be called before it, so storage itself must be valid. + * It is called from ->dequeue() callback. + */ +void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&st->lock, flags); + if (k->flags & KEVENT_STORAGE) { + list_del_rcu(&k->storage_entry); + k->flags &= ~KEVENT_STORAGE; + } + spin_unlock_irqrestore(&st->lock, flags); +} + +void kevent_ready(struct kevent *k, int ret) +{ + unsigned long flags; + int rem; + + spin_lock_irqsave(&k->ulock, flags); + if (ret > 0) + k->event.ret_flags |= KEVENT_RET_DONE; + else if (ret < 0) + k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE); + else + ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE)); + rem = (k->event.req_flags & KEVENT_REQ_ONESHOT); + spin_unlock_irqrestore(&k->ulock, flags); + + if (ret) { + if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) { + list_del_rcu(&k->storage_entry); + k->flags &= ~KEVENT_STORAGE; + } + + spin_lock_irqsave(&k->user->ready_lock, flags); + if (!(k->flags & KEVENT_READY)) { + list_add_tail(&k->ready_entry, &k->user->ready_list); + k->flags |= KEVENT_READY; + k->user->ready_num++; + } + spin_unlock_irqrestore(&k->user->ready_lock, flags); + wake_up(&k->user->wait); + } +} + +/* + * Call kevent ready callback and queue it into ready queue if needed. + * If kevent is marked as one-shot, then remove it from storage queue. + */ +static int __kevent_requeue(struct kevent *k, u32 event) +{ + int ret; + + ret = k->callbacks.callback(k); + + kevent_ready(k, ret); + + return ret; +} + +/* + * Check if kevent is ready (by invoking it's callback) and requeue/remove + * if needed. + */ +void kevent_requeue(struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&k->st->lock, flags); + __kevent_requeue(k, 0); + spin_unlock_irqrestore(&k->st->lock, flags); +} + +/* + * Called each time some activity in origin (socket, inode...) is noticed. + */ +void kevent_storage_ready(struct kevent_storage *st, + kevent_callback_t ready_callback, u32 event) +{ + struct kevent *k; + int wake_num = 0; + + rcu_read_lock(); + if (unlikely(ready_callback)) + list_for_each_entry_rcu(k, &st->list, storage_entry) + (*ready_callback)(k); + + list_for_each_entry_rcu(k, &st->list, storage_entry) { + if (event & k->event.event) + if ((k->event.req_flags & KEVENT_REQ_WAKEUP_ALL) || wake_num == 0) + if (__kevent_requeue(k, event)) + wake_num++; + } + rcu_read_unlock(); +} + +int kevent_storage_init(void *origin, struct kevent_storage *st) +{ + spin_lock_init(&st->lock); + st->origin = origin; + INIT_LIST_HEAD(&st->list); + return 0; +} + +/* + * Mark all events as broken, that will remove them from storage, + * so storage origin (inode, socket and so on) can be safely removed. + * No new entries are allowed to be added into the storage at this point. + * (Socket is removed from file table at this point for example). + */ +void kevent_storage_fini(struct kevent_storage *st) +{ + kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL); +} diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c new file mode 100644 index 0000000..3fc2daa --- /dev/null +++ b/kernel/kevent/kevent_user.c @@ -0,0 +1,1344 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/mount.h> +#include <linux/device.h> +#include <linux/poll.h> +#include <linux/kevent.h> +#include <linux/miscdevice.h> +#include <asm/io.h> + +static kmem_cache_t *kevent_cache __read_mostly; +static kmem_cache_t *kevent_user_cache __read_mostly; + +static int kevent_debug_abstime; + +/* + * kevents are pollable, return POLLIN and POLLRDNORM + * when there is at least one ready kevent. + */ +static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait) +{ + struct kevent_user *u = file->private_data; + unsigned int mask; + + poll_wait(file, &u->wait, wait); + mask = 0; + + if (u->ready_num || u->need_exit) + mask |= POLLIN | POLLRDNORM; + u->need_exit = 0; + + return mask; +} + +static inline unsigned int kevent_ring_space(struct kevent_user *u) +{ + if (u->full) + return 0; + + return (u->uidx > u->kidx)? + (u->uidx - u->kidx): + (u->ring_size - (u->kidx - u->uidx)); +} + +static inline int kevent_ring_index_inc(unsigned int *pidx, unsigned int size) +{ + unsigned int idx = *pidx; + + if (++idx >= size) + idx = 0; + *pidx = idx; + return (idx == 0); +} + +/* + * Copies kevent into userspace ring buffer if it was initialized. + * Returns + * 0 on success or if ring buffer is not used + * -EAGAIN if there were no place for that kevent + * -EFAULT if copy_to_user() failed. + * + * Must be called under kevent_user->ring_lock locked. + */ +static int kevent_copy_ring_buffer(struct kevent *k) +{ + struct kevent_ring __user *ring; + struct kevent_user *u = k->user; + unsigned long flags; + int err; + + ring = u->pring; + if (!ring) + return 0; + + if (!kevent_ring_space(u)) + return -EAGAIN; + + if (copy_to_user(&ring->event[u->kidx], &k->event, sizeof(struct ukevent))) { + err = -EFAULT; + goto err_out_exit; + } + + kevent_ring_index_inc(&u->kidx, u->ring_size); + + if (u->kidx == u->uidx) + u->full = 1; + + if (put_user(u->kidx, &ring->ring_kidx)) { + err = -EFAULT; + goto err_out_exit; + } + + return 0; + +err_out_exit: + spin_lock_irqsave(&k->ulock, flags); + k->event.ret_flags |= KEVENT_RET_COPY_FAILED; + spin_unlock_irqrestore(&k->ulock, flags); + return err; +} + +static struct kevent_user *kevent_user_alloc(struct kevent_ring __user *ring, unsigned int num) +{ + struct kevent_user *u; + + u = kmem_cache_alloc(kevent_user_cache, GFP_KERNEL); + if (!u) + return NULL; + + INIT_LIST_HEAD(&u->ready_list); + spin_lock_init(&u->ready_lock); + kevent_stat_init(u); + spin_lock_init(&u->kevent_lock); + u->kevent_root = RB_ROOT; + + mutex_init(&u->ctl_mutex); + init_waitqueue_head(&u->wait); + u->need_exit = 0; + + atomic_set(&u->refcnt, 1); + + mutex_init(&u->ring_lock); + u->kidx = u->uidx = u->ring_over = u->full = 0; + + u->pring = ring; + u->ring_size = num; + + hrtimer_init(&u->timer, CLOCK_REALTIME, HRTIMER_ABS); + + return u; +} + +/* + * Kevent userspace control block reference counting. + * Set to 1 at creation time, when appropriate kevent file descriptor + * is closed, that reference counter is decreased. + * When counter hits zero block is freed. + */ +static inline void kevent_user_get(struct kevent_user *u) +{ + atomic_inc(&u->refcnt); +} + +static inline void kevent_user_put(struct kevent_user *u) +{ + if (atomic_dec_and_test(&u->refcnt)) { + kevent_stat_print(u); + hrtimer_cancel(&u->timer); + kmem_cache_free(kevent_user_cache, u); + } +} + +static inline int kevent_compare_id(struct kevent_id *left, struct kevent_id *right) +{ + if (left->raw_u64 > right->raw_u64) + return -1; + + if (right->raw_u64 > left->raw_u64) + return 1; + + return 0; +} + +/* + * RCU protects storage list (kevent->storage_entry). + * Free entry in RCU callback, it is dequeued from all lists at + * this point. + */ + +static void kevent_free_rcu(struct rcu_head *rcu) +{ + struct kevent *kevent = container_of(rcu, struct kevent, rcu_head); + kmem_cache_free(kevent_cache, kevent); +} + +/* + * Must be called under u->ready_lock. + * This function unlinks kevent from ready queue. + */ +static inline void kevent_unlink_ready(struct kevent *k) +{ + list_del(&k->ready_entry); + k->flags &= ~KEVENT_READY; + k->user->ready_num--; +} + +static void kevent_remove_ready(struct kevent *k) +{ + struct kevent_user *u = k->user; + unsigned long flags; + + spin_lock_irqsave(&u->ready_lock, flags); + if (k->flags & KEVENT_READY) + kevent_unlink_ready(k); + spin_unlock_irqrestore(&u->ready_lock, flags); +} + +/* + * Complete kevent removing - it dequeues kevent from storage list + * if it is requested, removes kevent from ready list, drops userspace + * control block reference counter and schedules kevent freeing through RCU. + */ +static void kevent_finish_user_complete(struct kevent *k, int deq) +{ + if (deq) + kevent_dequeue(k); + + kevent_remove_ready(k); + + kevent_user_put(k->user); + call_rcu(&k->rcu_head, kevent_free_rcu); +} + +/* + * Remove from all lists and free kevent. + * Must be called under kevent_user->kevent_lock to protect + * kevent->kevent_entry removing. + */ +static void __kevent_finish_user(struct kevent *k, int deq) +{ + struct kevent_user *u = k->user; + + rb_erase(&k->kevent_node, &u->kevent_root); + k->flags &= ~KEVENT_USER; + u->kevent_num--; + kevent_finish_user_complete(k, deq); +} + +/* + * Remove kevent from user's list of all events, + * dequeue it from storage and decrease user's reference counter, + * since this kevent does not exist anymore. That is why it is freed here. + */ +static void kevent_finish_user(struct kevent *k, int deq) +{ + struct kevent_user *u = k->user; + unsigned long flags; + + spin_lock_irqsave(&u->kevent_lock, flags); + rb_erase(&k->kevent_node, &u->kevent_root); + k->flags &= ~KEVENT_USER; + u->kevent_num--; + spin_unlock_irqrestore(&u->kevent_lock, flags); + kevent_finish_user_complete(k, deq); +} + +static struct kevent *__kevent_dequeue_ready_one(struct kevent_user *u) +{ + unsigned long flags; + struct kevent *k = NULL; + + if (u->ready_num) { + spin_lock_irqsave(&u->ready_lock, flags); + if (u->ready_num && !list_empty(&u->ready_list)) { + k = list_entry(u->ready_list.next, struct kevent, ready_entry); + kevent_unlink_ready(k); + } + spin_unlock_irqrestore(&u->ready_lock, flags); + } + + return k; +} + +static struct kevent *kevent_dequeue_ready_one(struct kevent_user *u) +{ + struct kevent *k = NULL; + + while (u->ready_num && !k) { + k = __kevent_dequeue_ready_one(u); + + if (k && (k->event.req_flags & KEVENT_REQ_LAST_CHECK)) { + unsigned long flags; + + spin_lock_irqsave(&k->ulock, flags); + k->event.req_flags &= ~KEVENT_REQ_LAST_CHECK; + spin_unlock_irqrestore(&k->ulock, flags); + + if (!k->callbacks.callback(k)) { + spin_lock_irqsave(&k->ulock, flags); + k->event.req_flags |= KEVENT_REQ_LAST_CHECK; + k->event.ret_flags = 0; + k->event.ret_data[0] = k->event.ret_data[1] = 0; + spin_unlock_irqrestore(&k->ulock, flags); + k = NULL; + } + } else + break; + } + + return k; +} + +static inline void kevent_copy_ring(struct kevent *k) +{ + unsigned long flags; + + if (!k) + return; + + if (kevent_copy_ring_buffer(k)) { + spin_lock_irqsave(&k->ulock, flags); + k->event.ret_flags |= KEVENT_RET_COPY_FAILED; + spin_unlock_irqrestore(&k->ulock, flags); + } +} + +/* + * Dequeue one entry from user's ready queue. + */ +static struct kevent *kevent_dequeue_ready(struct kevent_user *u) +{ + struct kevent *k; + + mutex_lock(&u->ring_lock); + k = kevent_dequeue_ready_one(u); + kevent_copy_ring(k); + mutex_unlock(&u->ring_lock); + + return k; +} + +/* + * Dequeue one entry from user's ready queue if there is space in ring buffer. + */ +static struct kevent *kevent_dequeue_ready_ring(struct kevent_user *u) +{ + struct kevent *k = NULL; + + mutex_lock(&u->ring_lock); + if (kevent_ring_space(u)) { + k = kevent_dequeue_ready_one(u); + kevent_copy_ring(k); + } + mutex_unlock(&u->ring_lock); + + return k; +} + +static void kevent_complete_ready(struct kevent *k) +{ + if (k->event.req_flags & KEVENT_REQ_ONESHOT) + /* + * If it is one-shot kevent, it has been removed already from + * origin's queue, so we can easily free it here. + */ + kevent_finish_user(k, 1); + else if (k->event.req_flags & KEVENT_REQ_ET) { + unsigned long flags; + + /* + * Edge-triggered behaviour: mark event as clear new one. + */ + + spin_lock_irqsave(&k->ulock, flags); + k->event.ret_flags = 0; + k->event.ret_data[0] = k->event.ret_data[1] = 0; + spin_unlock_irqrestore(&k->ulock, flags); + } +} + +/* + * Search a kevent inside kevent tree for given ukevent. + */ +static struct kevent *__kevent_search(struct kevent_id *id, struct kevent_user *u) +{ + struct kevent *k, *ret = NULL; + struct rb_node *n = u->kevent_root.rb_node; + int cmp; + + while (n) { + k = rb_entry(n, struct kevent, kevent_node); + cmp = kevent_compare_id(&k->event.id, id); + + if (cmp > 0) + n = n->rb_right; + else if (cmp < 0) + n = n->rb_left; + else { + ret = k; + break; + } + } + + return ret; +} + +/* + * Search and modify kevent according to provided ukevent. + */ +static int kevent_modify(struct ukevent *uk, struct kevent_user *u) +{ + struct kevent *k; + int err = -ENODEV; + unsigned long flags; + + spin_lock_irqsave(&u->kevent_lock, flags); + k = __kevent_search(&uk->id, u); + if (k) { + spin_lock(&k->ulock); + k->event.event = uk->event; + k->event.req_flags = uk->req_flags; + k->event.ret_flags = 0; + spin_unlock(&k->ulock); + kevent_requeue(k); + err = 0; + } + spin_unlock_irqrestore(&u->kevent_lock, flags); + + return err; +} + +/* + * Remove kevent which matches provided ukevent. + */ +static int kevent_remove(struct ukevent *uk, struct kevent_user *u) +{ + int err = -ENODEV; + struct kevent *k; + unsigned long flags; + + spin_lock_irqsave(&u->kevent_lock, flags); + k = __kevent_search(&uk->id, u); + if (k) { + __kevent_finish_user(k, 1); + err = 0; + } + spin_unlock_irqrestore(&u->kevent_lock, flags); + + return err; +} + +/* + * Detaches userspace control block from file descriptor + * and decrease it's reference counter. + * No new kevents can be added or removed from any list at this point. + */ +static int kevent_user_release(struct inode *inode, struct file *file) +{ + struct kevent_user *u = file->private_data; + struct kevent *k; + struct rb_node *n; + + for (n = rb_first(&u->kevent_root); n; n = rb_next(n)) { + k = rb_entry(n, struct kevent, kevent_node); + kevent_finish_user(k, 1); + } + + kevent_user_put(u); + file->private_data = NULL; + + return 0; +} + +/* + * Read requested number of ukevents in one shot. + */ +static struct ukevent *kevent_get_user(unsigned int num, void __user *arg) +{ + struct ukevent *ukev; + + ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL); + if (!ukev) + return NULL; + + if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) { + kfree(ukev); + return NULL; + } + + return ukev; +} + +static int kevent_mark_ready(struct ukevent *uk, struct kevent_user *u) +{ + struct kevent *k; + int err = -ENODEV; + unsigned long flags; + + spin_lock_irqsave(&u->kevent_lock, flags); + k = __kevent_search(&uk->id, u); + if (k) { + spin_lock(&k->st->lock); + kevent_ready(k, 1); + spin_unlock(&k->st->lock); + err = 0; + } + spin_unlock_irqrestore(&u->kevent_lock, flags); + + return err; +} + +/* + * Mark appropriate kevents as ready. + * If number of events is zero just wake up one listener. + */ +static int kevent_user_ctl_ready(struct kevent_user *u, unsigned int num, void __user *arg) +{ + int err = -EINVAL, cerr = 0, rnum = 0, i; + void __user *orig = arg; + struct ukevent uk; + + if (num > u->kevent_num) + return err; + + if (!num) { + u->need_exit = 1; + wake_up(&u->wait); + return 0; + } + + mutex_lock(&u->ctl_mutex); + + if (num > KEVENT_MIN_BUFFS_ALLOC) { + struct ukevent *ukev; + + ukev = kevent_get_user(num, arg); + if (ukev) { + for (i = 0; i < num; ++i) { + err = kevent_mark_ready(&ukev[i], u); + if (err) { + if (i != rnum) + memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent)); + rnum++; + } + } + if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent))) + cerr = -EFAULT; + kfree(ukev); + goto out_setup; + } + } + + for (i = 0; i < num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + cerr = -EFAULT; + break; + } + arg += sizeof(struct ukevent); + + err = kevent_mark_ready(&uk, u); + if (err) { + if (copy_to_user(orig, &uk, sizeof(struct ukevent))) { + cerr = -EFAULT; + break; + } + orig += sizeof(struct ukevent); + rnum++; + } + } + +out_setup: + if (cerr < 0) { + err = cerr; + goto out_remove; + } + + err = num - rnum; +out_remove: + mutex_unlock(&u->ctl_mutex); + + return err; +} + +/* + * Read from userspace all ukevents and modify appropriate kevents. + * If provided number of ukevents is more that threshold, it is faster + * to allocate a room for them and copy in one shot instead of copy + * one-by-one and then process them. + */ +static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg) +{ + int err = 0, i; + struct ukevent uk; + + mutex_lock(&u->ctl_mutex); + + if (num > u->kevent_num) { + err = -EINVAL; + goto out; + } + + if (num > KEVENT_MIN_BUFFS_ALLOC) { + struct ukevent *ukev; + + ukev = kevent_get_user(num, arg); + if (ukev) { + for (i = 0; i < num; ++i) { + if (kevent_modify(&ukev[i], u)) + ukev[i].ret_flags |= KEVENT_RET_BROKEN; + ukev[i].ret_flags |= KEVENT_RET_DONE; + } + if (copy_to_user(arg, ukev, num*sizeof(struct ukevent))) + err = -EFAULT; + kfree(ukev); + goto out; + } + } + + for (i = 0; i < num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + if (kevent_modify(&uk, u)) + uk.ret_flags |= KEVENT_RET_BROKEN; + uk.ret_flags |= KEVENT_RET_DONE; + + if (copy_to_user(arg, &uk, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + arg += sizeof(struct ukevent); + } +out: + mutex_unlock(&u->ctl_mutex); + + return err; +} + +/* + * Read from userspace all ukevents and remove appropriate kevents. + * If provided number of ukevents is more that threshold, it is faster + * to allocate a room for them and copy in one shot instead of copy + * one-by-one and then process them. + */ +static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg) +{ + int err = 0, i; + struct ukevent uk; + + mutex_lock(&u->ctl_mutex); + + if (num > u->kevent_num) { + err = -EINVAL; + goto out; + } + + if (num > KEVENT_MIN_BUFFS_ALLOC) { + struct ukevent *ukev; + + ukev = kevent_get_user(num, arg); + if (ukev) { + for (i = 0; i < num; ++i) { + if (kevent_remove(&ukev[i], u)) + ukev[i].ret_flags |= KEVENT_RET_BROKEN; + ukev[i].ret_flags |= KEVENT_RET_DONE; + } + if (copy_to_user(arg, ukev, num*sizeof(struct ukevent))) + err = -EFAULT; + kfree(ukev); + goto out; + } + } + + for (i = 0; i < num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + if (kevent_remove(&uk, u)) + uk.ret_flags |= KEVENT_RET_BROKEN; + + uk.ret_flags |= KEVENT_RET_DONE; + + if (copy_to_user(arg, &uk, sizeof(struct ukevent))) { + err = -EFAULT; + break; + } + + arg += sizeof(struct ukevent); + } +out: + mutex_unlock(&u->ctl_mutex); + + return err; +} + +/* + * Queue kevent into userspace control block and increase + * it's reference counter. + */ +static int kevent_user_enqueue(struct kevent_user *u, struct kevent *new) +{ + unsigned long flags; + struct rb_node **p = &u->kevent_root.rb_node, *parent = NULL; + struct kevent *k; + int err = 0, cmp; + + spin_lock_irqsave(&u->kevent_lock, flags); + while (*p) { + parent = *p; + k = rb_entry(parent, struct kevent, kevent_node); + + cmp = kevent_compare_id(&k->event.id, &new->event.id); + if (cmp > 0) + p = &parent->rb_right; + else if (cmp < 0) + p = &parent->rb_left; + else { + err = -EEXIST; + break; + } + } + if (likely(!err)) { + rb_link_node(&new->kevent_node, parent, p); + rb_insert_color(&new->kevent_node, &u->kevent_root); + new->flags |= KEVENT_USER; + u->kevent_num++; + kevent_user_get(u); + } + spin_unlock_irqrestore(&u->kevent_lock, flags); + + return err; +} + +/* + * Add kevent from both kernel and userspace users. + * This function allocates and queues kevent, returns negative value + * on error, positive if kevent is ready immediately and zero + * if kevent has been queued. + */ +int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u) +{ + struct kevent *k; + int err; + + k = kmem_cache_alloc(kevent_cache, GFP_KERNEL); + if (!k) { + err = -ENOMEM; + goto err_out_exit; + } + + memcpy(&k->event, uk, sizeof(struct ukevent)); + INIT_RCU_HEAD(&k->rcu_head); + + k->event.ret_flags = 0; + + err = kevent_init(k); + if (err) { + kmem_cache_free(kevent_cache, k); + goto err_out_exit; + } + k->user = u; + kevent_stat_total(u); + err = kevent_user_enqueue(u, k); + if (err) { + kmem_cache_free(kevent_cache, k); + goto err_out_exit; + } + + err = kevent_enqueue(k); + if (err) { + memcpy(uk, &k->event, sizeof(struct ukevent)); + kevent_finish_user(k, 0); + goto err_out_exit; + } + + return 0; + +err_out_exit: + if (err < 0) { + uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE; + uk->ret_data[1] = err; + } else if (err > 0) + uk->ret_flags |= KEVENT_RET_DONE; + return err; +} + +/* + * Copy all ukevents from userspace, allocate kevent for each one + * and add them into appropriate kevent_storages, + * e.g. sockets, inodes and so on... + * Ready events will replace ones provided by used and number + * of ready events is returned. + * User must check ret_flags field of each ukevent structure + * to determine if it is fired or failed event. + */ +static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg) +{ + int err, cerr = 0, rnum = 0, i; + void __user *orig = arg; + struct ukevent uk; + + mutex_lock(&u->ctl_mutex); + + err = -EINVAL; + if (num > KEVENT_MIN_BUFFS_ALLOC) { + struct ukevent *ukev; + + ukev = kevent_get_user(num, arg); + if (ukev) { + for (i = 0; i < num; ++i) { + err = kevent_user_add_ukevent(&ukev[i], u); + if (err) { + kevent_stat_im(u); + if (i != rnum) + memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent)); + rnum++; + } + } + if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent))) + cerr = -EFAULT; + kfree(ukev); + goto out_setup; + } + } + + for (i = 0; i < num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + cerr = -EFAULT; + break; + } + arg += sizeof(struct ukevent); + + err = kevent_user_add_ukevent(&uk, u); + if (err) { + kevent_stat_im(u); + if (copy_to_user(orig, &uk, sizeof(struct ukevent))) { + cerr = -EFAULT; + break; + } + orig += sizeof(struct ukevent); + rnum++; + } + } + +out_setup: + if (cerr < 0) { + err = cerr; + goto out_remove; + } + + err = rnum; +out_remove: + mutex_unlock(&u->ctl_mutex); + + return err; +} + +/* Used to wakeup waiting syscalls in case high-resolution timer is used. */ +static int kevent_user_wake(struct hrtimer *timer) +{ + struct kevent_user *u = container_of(timer, struct kevent_user, timer); + + u->need_exit = 1; + wake_up(&u->wait); + + return HRTIMER_NORESTART; +} + + +/* + * In nonblocking mode it returns as many events as possible, but not more than @max_nr. + * In blocking mode it waits until timeout or if at least @min_nr events are ready. + */ +static int kevent_user_wait(struct file *file, struct kevent_user *u, + unsigned int min_nr, unsigned int max_nr, struct timespec timeout, + void __user *buf, unsigned int flags) +{ + struct kevent *k; + int num = 0; + long tm = MAX_SCHEDULE_TIMEOUT; + + if (!(file->f_flags & O_NONBLOCK)) { + if (!timespec_valid(&timeout)) + return -EINVAL; + + if (flags & KEVENT_FLAGS_ABSTIME) { + hrtimer_cancel(&u->timer); + hrtimer_init(&u->timer, CLOCK_REALTIME, HRTIMER_ABS); + u->timer.expires = ktime_set(timeout.tv_sec, timeout.tv_nsec); + u->timer.function = &kevent_user_wake; + hrtimer_start(&u->timer, u->timer.expires, HRTIMER_ABS); + if (unlikely(kevent_debug_abstime == 0)) { + printk(KERN_INFO "kevent: author was wrong, " + "someone uses absolute time in %s, " + "please report to remove this warning.\n", __func__); + kevent_debug_abstime = 1; + } + } else { + tm = timespec_to_jiffies(&timeout); + } + + wait_event_interruptible_timeout(u->wait, + ((u->ready_num >= 1) && kevent_ring_space(u)) || u->need_exit, tm); + } + u->need_exit = 0; + + while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) { + if (copy_to_user(buf + num*sizeof(struct ukevent), + &k->event, sizeof(struct ukevent))) { + if (num == 0) + num = -EFAULT; + break; + } + kevent_complete_ready(k); + ++num; + kevent_stat_wait(u); + } + + return num; +} + +struct file_operations kevent_user_fops = { + .release = kevent_user_release, + .poll = kevent_user_poll, + .owner = THIS_MODULE, +}; + +static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg) +{ + int err; + struct kevent_user *u = file->private_data; + + switch (cmd) { + case KEVENT_CTL_ADD: + err = kevent_user_ctl_add(u, num, arg); + break; + case KEVENT_CTL_REMOVE: + err = kevent_user_ctl_remove(u, num, arg); + break; + case KEVENT_CTL_MODIFY: + err = kevent_user_ctl_modify(u, num, arg); + break; + case KEVENT_CTL_READY: + err = kevent_user_ctl_ready(u, num, arg); + break; + default: + err = -EINVAL; + break; + } + + return err; +} + +/* + * Used to get ready kevents from queue. + * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT). + * @min_nr - minimum number of ready kevents. + * @max_nr - maximum number of ready kevents. + * @timeout - time to wait until some events are ready. + * @buf - buffer to place ready events. + * @flags - various flags (see include/linux/ukevent.h KEVENT_FLAGS_*). + */ +asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, + struct timespec timeout, struct ukevent __user *buf, unsigned flags) +{ + int err = -EINVAL; + struct file *file; + struct kevent_user *u; + + file = fget(ctl_fd); + if (!file) + return -EBADF; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + u = file->private_data; + + err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf, flags); +out_fput: + fput(file); + return err; +} + +static struct vfsmount *kevent_mnt __read_mostly; + +static int kevent_get_sb(struct file_system_type *fs_type, int flags, + const char *dev_name, void *data, struct vfsmount *mnt) +{ + return get_sb_pseudo(fs_type, "kevent", NULL, 0xaabbccdd, mnt); +} + +static struct file_system_type kevent_fs_type = { + .name = "keventfs", + .get_sb = kevent_get_sb, + .kill_sb = kill_anon_super, +}; + +static int keventfs_delete_dentry(struct dentry *dentry) +{ + return 1; +} + +static struct dentry_operations keventfs_dentry_operations = { + .d_delete = keventfs_delete_dentry, +}; + +asmlinkage long sys_kevent_init(struct kevent_ring __user *ring, unsigned int num, unsigned int flags) +{ + struct qstr this; + char name[32]; + struct dentry *dentry; + struct inode *inode; + struct file *file; + int err = -ENFILE, fd; + struct kevent_user *u; + + if ((ring && !num) || (!ring && num) || (num == 1)) + return -EINVAL; + + file = get_empty_filp(); + if (!file) + goto err_out_exit; + + inode = new_inode(kevent_mnt->mnt_sb); + if (!inode) + goto err_out_fput; + + inode->i_fop = &kevent_user_fops; + + inode->i_state = I_DIRTY; + inode->i_mode = S_IRUSR | S_IWUSR; + inode->i_uid = current->fsuid; + inode->i_gid = current->fsgid; + inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; + + err = get_unused_fd(); + if (err < 0) + goto err_out_iput; + fd = err; + + err = -ENOMEM; + u = kevent_user_alloc(ring, num); + if (!u) + goto err_out_put_fd; + + sprintf(name, "[%lu]", inode->i_ino); + this.name = name; + this.len = strlen(name); + this.hash = inode->i_ino; + dentry = d_alloc(kevent_mnt->mnt_sb->s_root, &this); + if (!dentry) + goto err_out_free; + dentry->d_op = &keventfs_dentry_operations; + d_add(dentry, inode); + file->f_vfsmnt = mntget(kevent_mnt); + file->f_dentry = dentry; + file->f_mapping = inode->i_mapping; + file->f_pos = 0; + file->f_flags = O_RDONLY; + file->f_op = &kevent_user_fops; + file->f_mode = FMODE_READ; + file->f_version = 0; + file->private_data = u; + + fd_install(fd, file); + + return fd; + +err_out_free: + kmem_cache_free(kevent_user_cache, u); +err_out_put_fd: + put_unused_fd(fd); +err_out_iput: + iput(inode); +err_out_fput: + put_filp(file); +err_out_exit: + return err; +} + +/* + * Commits user's index (consumer index). + * Must be called under u->ring_lock mutex held. + */ +static int __kevent_user_commit(struct kevent_user *u, unsigned int new_uidx, unsigned int over) +{ + int err = -EOVERFLOW, comm = 0; + struct kevent_ring __user *ring = u->pring; + + if (!ring) { + err = 0; + goto err_out_exit; + } + + if (new_uidx >= u->ring_size) { + err = -EINVAL; + goto err_out_exit; + } + + if ((over != u->ring_over - 1) && (over != u->ring_over)) + goto err_out_exit; + + if (u->uidx < u->kidx && new_uidx > u->kidx) { + err = -EINVAL; + goto err_out_exit; + } + + if (new_uidx > u->uidx) { + if (over != u->ring_over) + goto err_out_exit; + + comm = new_uidx - u->uidx; + u->uidx = new_uidx; + u->full = 0; + } else if (new_uidx < u->uidx) { + comm = u->ring_size - (u->uidx - new_uidx); + u->uidx = new_uidx; + u->full = 0; + u->ring_over++; + + if (put_user(u->ring_over, &ring->ring_over)) { + err = -EFAULT; + goto err_out_exit; + } + } + + return comm; + +err_out_exit: + return err; +} + +/* + * This syscall is used to perform waiting until there is free space in the ring + * buffer, in that case some events will be copied there. + * Function returns number of actually copied ready events in ring buffer. + * After this function is completed userspace ring->ring_kidx will be updated. + * + * @ctl_fd - kevent file descriptor. + * @num - number of kevents to process. + * @old_uidx - the last index user is aware of. + * @timeout - time to wait until there is free space in kevent queue. + * @flags - various flags (see include/linux/ukevent.h KEVENT_FLAGS_*). + * + * When we need to commit @num events, it means we should just remove first @num + * kevents from ready queue and copy them into the buffer. + * Kevents will be copied into ring buffer in order they were placed into ready queue. + * One-shot kevents will be removed here, since there is no way they can be reused. + * Edge-triggered events will be requeued here for better performance. + */ +asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, unsigned int old_uidx, + struct timespec timeout, unsigned int flags) +{ + int err = -EINVAL, copied = 0; + struct file *file; + struct kevent_user *u; + struct kevent *k; + struct kevent_ring __user *ring; + long tm = MAX_SCHEDULE_TIMEOUT; + unsigned int i; + + file = fget(ctl_fd); + if (!file) + return -EBADF; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + u = file->private_data; + + ring = u->pring; + if (!ring || num > u->ring_size) + goto out_fput; +#if 0 + /* + * Allow to immediately update ring index, but it is not supported, + * since syscall() has limited number of arguments which is actually + * a good idea - use kevent_commit() instead. + */ + if ((u->uidx != new_uidx) && (new_uidx != 0xffffffff)) { + mutex_lock(&u->ring_lock); + __kevent_user_commit(u, new_uidx, over); + mutex_unlock(&u->ring_lock); + } +#endif + + if (!(file->f_flags & O_NONBLOCK)) { + if (!timespec_valid(&timeout)) + goto out_fput; + + if (flags & KEVENT_FLAGS_ABSTIME) { + hrtimer_cancel(&u->timer); + hrtimer_init(&u->timer, CLOCK_REALTIME, HRTIMER_ABS); + u->timer.expires = ktime_set(timeout.tv_sec, timeout.tv_nsec); + u->timer.function = &kevent_user_wake; + hrtimer_start(&u->timer, u->timer.expires, HRTIMER_ABS); + if (unlikely(kevent_debug_abstime == 0)) { + printk(KERN_INFO "kevent: author was wrong, " + "someone uses absolute time in %s, " + "please report to remove this warning.\n", __func__); + kevent_debug_abstime = 1; + } + } else { + tm = timespec_to_jiffies(&timeout); + } + + wait_event_interruptible_timeout(u->wait, + ((u->ready_num >= 1) && kevent_ring_space(u)) || + u->need_exit || old_uidx != u->uidx, + tm); + } + u->need_exit = 0; + + for (i=0; i<num; ++i) { + k = kevent_dequeue_ready_ring(u); + if (!k) + break; + kevent_complete_ready(k); + + if (k->event.ret_flags & KEVENT_RET_COPY_FAILED) + break; + kevent_stat_ring(u); + copied++; + } + + fput(file); + + return copied; +out_fput: + fput(file); + return err; +} + +/* + * This syscall is used to commit events in ring buffer, i.e. mark appropriate + * entries as unused by userspace so subsequent kevent_wait() could overwrite them. + * This fucntion returns actual number of kevents which were committed. + * After this function is completed userspace ring->ring_over can be updated. + * + * @ctl_fd - kevent file descriptor. + * @new_uidx - the last committed kevent. + * @over - number of overflows given queue had. + */ +asmlinkage long sys_kevent_commit(int ctl_fd, unsigned int new_uidx, unsigned int over) +{ + int err = -EINVAL, comm = 0; + struct file *file; + struct kevent_user *u; + + file = fget(ctl_fd); + if (!file) + return -EBADF; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + u = file->private_data; + + mutex_lock(&u->ring_lock); + err = __kevent_user_commit(u, new_uidx, over); + if (err < 0) + goto err_out_unlock; + comm = err; + mutex_unlock(&u->ring_lock); + + fput(file); + + return comm; + +err_out_unlock: + mutex_unlock(&u->ring_lock); +out_fput: + fput(file); + return err; +} + +/* + * This syscall is used to perform various control operations + * on given kevent queue, which is obtained through kevent file descriptor @fd. + * @cmd - type of operation. + * @num - number of kevents to be processed. + * @arg - pointer to array of struct ukevent. + */ +asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent __user *arg) +{ + int err = -EINVAL; + struct file *file; + + file = fget(fd); + if (!file) + return -EBADF; + + if (file->f_op != &kevent_user_fops) + goto out_fput; + + err = kevent_ctl_process(file, cmd, num, arg); + +out_fput: + fput(file); + return err; +} + +/* + * Kevent subsystem initialization - create caches and register + * filesystem to get control file descriptors from. + */ +static int __init kevent_user_init(void) +{ + int err = 0; + + kevent_cache = kmem_cache_create("kevent_cache", + sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL); + + kevent_user_cache = kmem_cache_create("kevent_user_cache", + sizeof(struct kevent_user), 0, SLAB_PANIC, NULL, NULL); + + err = register_filesystem(&kevent_fs_type); + if (err) + goto err_out_exit; + + kevent_mnt = kern_mount(&kevent_fs_type); + err = PTR_ERR(kevent_mnt); + if (IS_ERR(kevent_mnt)) + goto err_out_unreg; + + printk(KERN_INFO "KEVENT subsystem has been successfully registered.\n"); + + return 0; + +err_out_unreg: + unregister_filesystem(&kevent_fs_type); +err_out_exit: + kmem_cache_destroy(kevent_cache); + return err; +} + +module_init(kevent_user_init); diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 7a3b2e7..3b7d35f 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -122,6 +122,12 @@ cond_syscall(ppc_rtas); cond_syscall(sys_spu_run); cond_syscall(sys_spu_create); +cond_syscall(sys_kevent_get_events); +cond_syscall(sys_kevent_ctl); +cond_syscall(sys_kevent_wait); +cond_syscall(sys_kevent_commit); +cond_syscall(sys_kevent_init); + /* mmu depending weak syscall entries */ cond_syscall(sys_mprotect); cond_syscall(sys_msync); ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take26 3/8] kevent: poll/select() notifications. 2006-11-30 19:14 ` [take26 2/8] kevent: Core files Evgeniy Polyakov @ 2006-11-30 19:14 ` Evgeniy Polyakov 2006-11-30 19:14 ` [take26 4/8] kevent: Socket notifications Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-30 19:14 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik poll/select() notifications. This patch includes generic poll/select notifications. kevent_poll works simialr to epoll and has the same issues (callback is invoked not from internal state machine of the caller, but through process awake, a lot of allocations and so on). Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru> diff --git a/fs/file_table.c b/fs/file_table.c index bc35a40..0805547 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -20,6 +20,7 @@ #include <linux/cdev.h> #include <linux/fsnotify.h> #include <linux/sysctl.h> +#include <linux/kevent.h> #include <linux/percpu_counter.h> #include <asm/atomic.h> @@ -119,6 +120,7 @@ struct file *get_empty_filp(void) f->f_uid = tsk->fsuid; f->f_gid = tsk->fsgid; eventpoll_init_file(f); + kevent_init_file(f); /* f->f_version: 0 */ return f; @@ -164,6 +166,7 @@ void fastcall __fput(struct file *file) * in the file cleanup chain. */ eventpoll_release(file); + kevent_cleanup_file(file); locks_remove_flock(file); if (file->f_op && file->f_op->release) diff --git a/include/linux/fs.h b/include/linux/fs.h index 5baf3a1..8bbf3a5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -276,6 +276,7 @@ extern int dir_notify_enable; #include <linux/init.h> #include <linux/sched.h> #include <linux/mutex.h> +#include <linux/kevent_storage.h> #include <asm/atomic.h> #include <asm/semaphore.h> @@ -586,6 +587,10 @@ struct inode { struct mutex inotify_mutex; /* protects the watches list */ #endif +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE + struct kevent_storage st; +#endif + unsigned long i_state; unsigned long dirtied_when; /* jiffies of first dirtying */ @@ -739,6 +744,9 @@ struct file { struct list_head f_ep_links; spinlock_t f_ep_lock; #endif /* #ifdef CONFIG_EPOLL */ +#ifdef CONFIG_KEVENT_POLL + struct kevent_storage st; +#endif struct address_space *f_mapping; }; extern spinlock_t files_lock; diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c new file mode 100644 index 0000000..11dbe25 --- /dev/null +++ b/kernel/kevent/kevent_poll.c @@ -0,0 +1,232 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/timer.h> +#include <linux/file.h> +#include <linux/kevent.h> +#include <linux/poll.h> +#include <linux/fs.h> + +static kmem_cache_t *kevent_poll_container_cache; +static kmem_cache_t *kevent_poll_priv_cache; + +struct kevent_poll_ctl +{ + struct poll_table_struct pt; + struct kevent *k; +}; + +struct kevent_poll_wait_container +{ + struct list_head container_entry; + wait_queue_head_t *whead; + wait_queue_t wait; + struct kevent *k; +}; + +struct kevent_poll_private +{ + struct list_head container_list; + spinlock_t container_lock; +}; + +static int kevent_poll_enqueue(struct kevent *k); +static int kevent_poll_dequeue(struct kevent *k); +static int kevent_poll_callback(struct kevent *k); + +static int kevent_poll_wait_callback(wait_queue_t *wait, + unsigned mode, int sync, void *key) +{ + struct kevent_poll_wait_container *cont = + container_of(wait, struct kevent_poll_wait_container, wait); + struct kevent *k = cont->k; + + kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL); + return 0; +} + +static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead, + struct poll_table_struct *poll_table) +{ + struct kevent *k = + container_of(poll_table, struct kevent_poll_ctl, pt)->k; + struct kevent_poll_private *priv = k->priv; + struct kevent_poll_wait_container *cont; + unsigned long flags; + + cont = kmem_cache_alloc(kevent_poll_container_cache, GFP_KERNEL); + if (!cont) { + kevent_break(k); + return; + } + + cont->k = k; + init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback); + cont->whead = whead; + + spin_lock_irqsave(&priv->container_lock, flags); + list_add_tail(&cont->container_entry, &priv->container_list); + spin_unlock_irqrestore(&priv->container_lock, flags); + + add_wait_queue(whead, &cont->wait); +} + +static int kevent_poll_enqueue(struct kevent *k) +{ + struct file *file; + int err; + unsigned int revents; + unsigned long flags; + struct kevent_poll_ctl ctl; + struct kevent_poll_private *priv; + + file = fget(k->event.id.raw[0]); + if (!file) + return -EBADF; + + err = -EINVAL; + if (!file->f_op || !file->f_op->poll) + goto err_out_fput; + + err = -ENOMEM; + priv = kmem_cache_alloc(kevent_poll_priv_cache, GFP_KERNEL); + if (!priv) + goto err_out_fput; + + spin_lock_init(&priv->container_lock); + INIT_LIST_HEAD(&priv->container_list); + + k->priv = priv; + + ctl.k = k; + init_poll_funcptr(&ctl.pt, &kevent_poll_qproc); + + err = kevent_storage_enqueue(&file->st, k); + if (err) + goto err_out_free; + + if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE) { + kevent_requeue(k); + } else { + revents = file->f_op->poll(file, &ctl.pt); + if (revents & k->event.event) { + err = 1; + goto out_dequeue; + } + } + + spin_lock_irqsave(&k->ulock, flags); + k->event.req_flags |= KEVENT_REQ_LAST_CHECK; + spin_unlock_irqrestore(&k->ulock, flags); + + return 0; + +out_dequeue: + kevent_storage_dequeue(k->st, k); +err_out_free: + kmem_cache_free(kevent_poll_priv_cache, priv); +err_out_fput: + fput(file); + return err; +} + +static int kevent_poll_dequeue(struct kevent *k) +{ + struct file *file = k->st->origin; + struct kevent_poll_private *priv = k->priv; + struct kevent_poll_wait_container *w, *n; + unsigned long flags; + + kevent_storage_dequeue(k->st, k); + + spin_lock_irqsave(&priv->container_lock, flags); + list_for_each_entry_safe(w, n, &priv->container_list, container_entry) { + list_del(&w->container_entry); + remove_wait_queue(w->whead, &w->wait); + kmem_cache_free(kevent_poll_container_cache, w); + } + spin_unlock_irqrestore(&priv->container_lock, flags); + + kmem_cache_free(kevent_poll_priv_cache, priv); + k->priv = NULL; + + fput(file); + + return 0; +} + +static int kevent_poll_callback(struct kevent *k) +{ + if (k->event.req_flags & KEVENT_REQ_LAST_CHECK) { + return 1; + } else { + struct file *file = k->st->origin; + unsigned int revents = file->f_op->poll(file, NULL); + + k->event.ret_data[0] = revents & k->event.event; + + return (revents & k->event.event); + } +} + +static int __init kevent_poll_sys_init(void) +{ + struct kevent_callbacks pc = { + .callback = &kevent_poll_callback, + .enqueue = &kevent_poll_enqueue, + .dequeue = &kevent_poll_dequeue}; + + kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache", + sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL); + if (!kevent_poll_container_cache) { + printk(KERN_ERR "Failed to create kevent poll container cache.\n"); + return -ENOMEM; + } + + kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache", + sizeof(struct kevent_poll_private), 0, 0, NULL, NULL); + if (!kevent_poll_priv_cache) { + printk(KERN_ERR "Failed to create kevent poll private data cache.\n"); + kmem_cache_destroy(kevent_poll_container_cache); + kevent_poll_container_cache = NULL; + return -ENOMEM; + } + + kevent_add_callbacks(&pc, KEVENT_POLL); + + printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n"); + return 0; +} + +static struct lock_class_key kevent_poll_key; + +void kevent_poll_reinit(struct file *file) +{ + lockdep_set_class(&file->st.lock, &kevent_poll_key); +} + +static void __exit kevent_poll_sys_fini(void) +{ + kmem_cache_destroy(kevent_poll_priv_cache); + kmem_cache_destroy(kevent_poll_container_cache); +} + +module_init(kevent_poll_sys_init); +module_exit(kevent_poll_sys_fini); ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take26 4/8] kevent: Socket notifications. 2006-11-30 19:14 ` [take26 3/8] kevent: poll/select() notifications Evgeniy Polyakov @ 2006-11-30 19:14 ` Evgeniy Polyakov 2006-11-30 19:14 ` [take26 5/8] kevent: Timer notifications Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-30 19:14 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Socket notifications. This patch includes socket send/recv/accept notifications. Using trivial web server based on kevent and this features instead of epoll it's performance increased more than noticebly. More details about various benchmarks and server itself (evserver_kevent.c) can be found on project's homepage. Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru> diff --git a/fs/inode.c b/fs/inode.c index ada7643..2740617 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -21,6 +21,7 @@ #include <linux/cdev.h> #include <linux/bootmem.h> #include <linux/inotify.h> +#include <linux/kevent.h> #include <linux/mount.h> /* @@ -164,12 +165,18 @@ static struct inode *alloc_inode(struct } inode->i_private = 0; inode->i_mapping = mapping; +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE + kevent_storage_init(inode, &inode->st); +#endif } return inode; } void destroy_inode(struct inode *inode) { +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE + kevent_storage_fini(&inode->st); +#endif BUG_ON(inode_has_buffers(inode)); security_inode_free(inode); if (inode->i_sb->s_op->destroy_inode) diff --git a/include/net/sock.h b/include/net/sock.h index edd4d73..d48ded8 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -48,6 +48,7 @@ #include <linux/netdevice.h> #include <linux/skbuff.h> /* struct sk_buff */ #include <linux/security.h> +#include <linux/kevent.h> #include <linux/filter.h> @@ -450,6 +451,21 @@ static inline int sk_stream_memory_free( extern void sk_stream_rfree(struct sk_buff *skb); +struct socket_alloc { + struct socket socket; + struct inode vfs_inode; +}; + +static inline struct socket *SOCKET_I(struct inode *inode) +{ + return &container_of(inode, struct socket_alloc, vfs_inode)->socket; +} + +static inline struct inode *SOCK_INODE(struct socket *socket) +{ + return &container_of(socket, struct socket_alloc, socket)->vfs_inode; +} + static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk) { skb->sk = sk; @@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct sk->sk_backlog.tail = skb; } skb->next = NULL; + kevent_socket_notify(sk, KEVENT_SOCKET_RECV); } #define sk_wait_event(__sk, __timeo, __condition) \ @@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio return si->kiocb; } -struct socket_alloc { - struct socket socket; - struct inode vfs_inode; -}; - -static inline struct socket *SOCKET_I(struct inode *inode) -{ - return &container_of(inode, struct socket_alloc, vfs_inode)->socket; -} - -static inline struct inode *SOCK_INODE(struct socket *socket) -{ - return &container_of(socket, struct socket_alloc, socket)->vfs_inode; -} - extern void __sk_stream_mem_reclaim(struct sock *sk); extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind); diff --git a/include/net/tcp.h b/include/net/tcp.h index 7a093d0..69f4ad2 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so tp->ucopy.memory = 0; } else if (skb_queue_len(&tp->ucopy.prequeue) == 1) { wake_up_interruptible(sk->sk_sleep); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); if (!inet_csk_ack_scheduled(sk)) inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK, (3 * TCP_RTO_MIN) / 4, diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c new file mode 100644 index 0000000..9c24b5b --- /dev/null +++ b/kernel/kevent/kevent_socket.c @@ -0,0 +1,142 @@ +/* + * kevent_socket.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/timer.h> +#include <linux/file.h> +#include <linux/tcp.h> +#include <linux/kevent.h> + +#include <net/sock.h> +#include <net/request_sock.h> +#include <net/inet_connection_sock.h> + +static int kevent_socket_callback(struct kevent *k) +{ + struct inode *inode = k->st->origin; + unsigned int events = SOCKET_I(inode)->ops->poll(SOCKET_I(inode)->file, SOCKET_I(inode), NULL); + + if ((events & (POLLIN | POLLRDNORM)) && (k->event.event & (KEVENT_SOCKET_RECV | KEVENT_SOCKET_ACCEPT))) + return 1; + if ((events & (POLLOUT | POLLWRNORM)) && (k->event.event & KEVENT_SOCKET_SEND)) + return 1; + if (events & (POLLERR | POLLHUP)) + return -1; + return 0; +} + +int kevent_socket_enqueue(struct kevent *k) +{ + struct inode *inode; + struct socket *sock; + int err = -EBADF; + + sock = sockfd_lookup(k->event.id.raw[0], &err); + if (!sock) + goto err_out_exit; + + inode = igrab(SOCK_INODE(sock)); + if (!inode) + goto err_out_fput; + + err = kevent_storage_enqueue(&inode->st, k); + if (err) + goto err_out_iput; + + if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE) { + kevent_requeue(k); + err = 0; + } else { + err = k->callbacks.callback(k); + if (err) + goto err_out_dequeue; + } + + return err; + +err_out_dequeue: + kevent_storage_dequeue(k->st, k); +err_out_iput: + iput(inode); +err_out_fput: + sockfd_put(sock); +err_out_exit: + return err; +} + +int kevent_socket_dequeue(struct kevent *k) +{ + struct inode *inode = k->st->origin; + struct socket *sock; + + kevent_storage_dequeue(k->st, k); + + sock = SOCKET_I(inode); + iput(inode); + sockfd_put(sock); + + return 0; +} + +void kevent_socket_notify(struct sock *sk, u32 event) +{ + if (sk->sk_socket) + kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event); +} + +/* + * It is required for network protocols compiled as modules, like IPv6. + */ +EXPORT_SYMBOL_GPL(kevent_socket_notify); + +#ifdef CONFIG_LOCKDEP +static struct lock_class_key kevent_sock_key; + +void kevent_socket_reinit(struct socket *sock) +{ + struct inode *inode = SOCK_INODE(sock); + + lockdep_set_class(&inode->st.lock, &kevent_sock_key); +} + +void kevent_sk_reinit(struct sock *sk) +{ + if (sk->sk_socket) { + struct inode *inode = SOCK_INODE(sk->sk_socket); + + lockdep_set_class(&inode->st.lock, &kevent_sock_key); + } +} +#endif +static int __init kevent_init_socket(void) +{ + struct kevent_callbacks sc = { + .callback = &kevent_socket_callback, + .enqueue = &kevent_socket_enqueue, + .dequeue = &kevent_socket_dequeue}; + + return kevent_add_callbacks(&sc, KEVENT_SOCKET); +} +module_init(kevent_init_socket); diff --git a/net/core/sock.c b/net/core/sock.c index b77e155..7d5fa3e 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1402,6 +1402,7 @@ static void sock_def_wakeup(struct sock if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) wake_up_interruptible_all(sk->sk_sleep); read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); } static void sock_def_error_report(struct sock *sk) @@ -1411,6 +1412,7 @@ static void sock_def_error_report(struct wake_up_interruptible(sk->sk_sleep); sk_wake_async(sk,0,POLL_ERR); read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); } static void sock_def_readable(struct sock *sk, int len) @@ -1420,6 +1422,7 @@ static void sock_def_readable(struct soc wake_up_interruptible(sk->sk_sleep); sk_wake_async(sk,1,POLL_IN); read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); } static void sock_def_write_space(struct sock *sk) @@ -1439,6 +1442,7 @@ static void sock_def_write_space(struct } read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV); } static void sock_def_destruct(struct sock *sk) @@ -1489,6 +1493,8 @@ void sock_init_data(struct socket *sock, sk->sk_state = TCP_CLOSE; sk->sk_socket = sock; + kevent_sk_reinit(sk); + sock_set_flag(sk, SOCK_ZAPPED); if(sock) @@ -1555,8 +1561,10 @@ void fastcall release_sock(struct sock * if (sk->sk_backlog.tail) __release_sock(sk); sk->sk_lock.owner = NULL; - if (waitqueue_active(&sk->sk_lock.wq)) + if (waitqueue_active(&sk->sk_lock.wq)) { wake_up(&sk->sk_lock.wq); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); + } spin_unlock_bh(&sk->sk_lock.slock); } EXPORT_SYMBOL(release_sock); diff --git a/net/core/stream.c b/net/core/stream.c index d1d7dec..2878c2a 100644 --- a/net/core/stream.c +++ b/net/core/stream.c @@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock * wake_up_interruptible(sk->sk_sleep); if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN)) sock_wake_async(sock, 2, POLL_OUT); + kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV); } } diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 3f884ce..e7dd989 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3119,6 +3119,7 @@ static void tcp_ofo_queue(struct sock *s __skb_unlink(skb, &tp->out_of_order_queue); __skb_queue_tail(&sk->sk_receive_queue, skb); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV); tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq; if(skb->h.th->fin) tcp_fin(skb, sk, skb->h.th); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index c83938b..b0dd70d 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -61,6 +61,7 @@ #include <linux/jhash.h> #include <linux/init.h> #include <linux/times.h> +#include <linux/kevent.h> #include <net/icmp.h> #include <net/inet_hashtables.h> @@ -870,6 +871,7 @@ int tcp_v4_conn_request(struct sock *sk, reqsk_free(req); } else { inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT); + kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT); } return 0; diff --git a/net/socket.c b/net/socket.c index 1bc4167..5582b4a 100644 --- a/net/socket.c +++ b/net/socket.c @@ -85,6 +85,7 @@ #include <linux/kmod.h> #include <linux/audit.h> #include <linux/wireless.h> +#include <linux/kevent.h> #include <asm/uaccess.h> #include <asm/unistd.h> @@ -490,6 +491,8 @@ static struct socket *sock_alloc(void) inode->i_uid = current->fsuid; inode->i_gid = current->fsgid; + kevent_socket_reinit(sock); + get_cpu_var(sockets_in_use)++; put_cpu_var(sockets_in_use); return sock; ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take26 5/8] kevent: Timer notifications. 2006-11-30 19:14 ` [take26 4/8] kevent: Socket notifications Evgeniy Polyakov @ 2006-11-30 19:14 ` Evgeniy Polyakov 2006-11-30 19:14 ` [take26 6/8] kevent: Pipe notifications Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-30 19:14 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Timer notifications. Timer notifications can be used for fine grained per-process time management, since interval timers are very inconvenient to use, and they are limited. This subsystem uses high-resolution timers. id.raw[0] is used as number of seconds id.raw[1] is used as number of nanoseconds Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c new file mode 100644 index 0000000..df93049 --- /dev/null +++ b/kernel/kevent/kevent_timer.c @@ -0,0 +1,112 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/hrtimer.h> +#include <linux/jiffies.h> +#include <linux/kevent.h> + +struct kevent_timer +{ + struct hrtimer ktimer; + struct kevent_storage ktimer_storage; + struct kevent *ktimer_event; +}; + +static int kevent_timer_func(struct hrtimer *timer) +{ + struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer); + struct kevent *k = t->ktimer_event; + + kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL); + hrtimer_forward(timer, timer->base->softirq_time, + ktime_set(k->event.id.raw[0], k->event.id.raw[1])); + return HRTIMER_RESTART; +} + +static struct lock_class_key kevent_timer_key; + +static int kevent_timer_enqueue(struct kevent *k) +{ + int err; + struct kevent_timer *t; + + t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL); + if (!t) + return -ENOMEM; + + hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL); + t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]); + t->ktimer.function = kevent_timer_func; + t->ktimer_event = k; + + err = kevent_storage_init(&t->ktimer, &t->ktimer_storage); + if (err) + goto err_out_free; + lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key); + + err = kevent_storage_enqueue(&t->ktimer_storage, k); + if (err) + goto err_out_st_fini; + + hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL); + + return 0; + +err_out_st_fini: + kevent_storage_fini(&t->ktimer_storage); +err_out_free: + kfree(t); + + return err; +} + +static int kevent_timer_dequeue(struct kevent *k) +{ + struct kevent_storage *st = k->st; + struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage); + + hrtimer_cancel(&t->ktimer); + kevent_storage_dequeue(st, k); + kfree(t); + + return 0; +} + +static int kevent_timer_callback(struct kevent *k) +{ + k->event.ret_data[0] = jiffies_to_msecs(jiffies); + return 1; +} + +static int __init kevent_init_timer(void) +{ + struct kevent_callbacks tc = { + .callback = &kevent_timer_callback, + .enqueue = &kevent_timer_enqueue, + .dequeue = &kevent_timer_dequeue}; + + return kevent_add_callbacks(&tc, KEVENT_TIMER); +} +module_init(kevent_init_timer); + ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take26 6/8] kevent: Pipe notifications. 2006-11-30 19:14 ` [take26 5/8] kevent: Timer notifications Evgeniy Polyakov @ 2006-11-30 19:14 ` Evgeniy Polyakov 2006-11-30 19:14 ` [take26 7/8] kevent: Signal notifications Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-30 19:14 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Pipe notifications. diff --git a/fs/pipe.c b/fs/pipe.c index f3b6f71..aeaee9c 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -16,6 +16,7 @@ #include <linux/uio.h> #include <linux/highmem.h> #include <linux/pagemap.h> +#include <linux/kevent.h> #include <asm/uaccess.h> #include <asm/ioctls.h> @@ -312,6 +313,7 @@ redo: break; } if (do_wakeup) { + kevent_pipe_notify(inode, KEVENT_SOCKET_SEND); wake_up_interruptible_sync(&pipe->wait); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } @@ -321,6 +323,7 @@ redo: /* Signal writers asynchronously that there is more room. */ if (do_wakeup) { + kevent_pipe_notify(inode, KEVENT_SOCKET_SEND); wake_up_interruptible(&pipe->wait); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } @@ -490,6 +493,7 @@ redo2: break; } if (do_wakeup) { + kevent_pipe_notify(inode, KEVENT_SOCKET_RECV); wake_up_interruptible_sync(&pipe->wait); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); do_wakeup = 0; @@ -501,6 +505,7 @@ redo2: out: mutex_unlock(&inode->i_mutex); if (do_wakeup) { + kevent_pipe_notify(inode, KEVENT_SOCKET_RECV); wake_up_interruptible(&pipe->wait); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); } @@ -605,6 +610,7 @@ pipe_release(struct inode *inode, int de free_pipe_info(inode); } else { wake_up_interruptible(&pipe->wait); + kevent_pipe_notify(inode, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } diff --git a/kernel/kevent/kevent_pipe.c b/kernel/kevent/kevent_pipe.c new file mode 100644 index 0000000..d529fa9 --- /dev/null +++ b/kernel/kevent/kevent_pipe.c @@ -0,0 +1,121 @@ +/* + * kevent_pipe.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/file.h> +#include <linux/fs.h> +#include <linux/kevent.h> +#include <linux/pipe_fs_i.h> + +static int kevent_pipe_callback(struct kevent *k) +{ + struct inode *inode = k->st->origin; + struct pipe_inode_info *pipe = inode->i_pipe; + int nrbufs = pipe->nrbufs; + + if (k->event.event & KEVENT_SOCKET_RECV && nrbufs > 0) { + if (!pipe->writers) + return -1; + return 1; + } + + if (k->event.event & KEVENT_SOCKET_SEND && nrbufs < PIPE_BUFFERS) { + if (!pipe->readers) + return -1; + return 1; + } + + return 0; +} + +int kevent_pipe_enqueue(struct kevent *k) +{ + struct file *pipe; + int err = -EBADF; + struct inode *inode; + + pipe = fget(k->event.id.raw[0]); + if (!pipe) + goto err_out_exit; + + inode = igrab(pipe->f_dentry->d_inode); + if (!inode) + goto err_out_fput; + + err = -EINVAL; + if (!S_ISFIFO(inode->i_mode)) + goto err_out_iput; + + err = kevent_storage_enqueue(&inode->st, k); + if (err) + goto err_out_iput; + + if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE) { + kevent_requeue(k); + err = 0; + } else { + err = k->callbacks.callback(k); + if (err) + goto err_out_dequeue; + } + + fput(pipe); + + return err; + +err_out_dequeue: + kevent_storage_dequeue(k->st, k); +err_out_iput: + iput(inode); +err_out_fput: + fput(pipe); +err_out_exit: + return err; +} + +int kevent_pipe_dequeue(struct kevent *k) +{ + struct inode *inode = k->st->origin; + + kevent_storage_dequeue(k->st, k); + iput(inode); + + return 0; +} + +void kevent_pipe_notify(struct inode *inode, u32 event) +{ + kevent_storage_ready(&inode->st, NULL, event); +} + +static int __init kevent_init_pipe(void) +{ + struct kevent_callbacks sc = { + .callback = &kevent_pipe_callback, + .enqueue = &kevent_pipe_enqueue, + .dequeue = &kevent_pipe_dequeue}; + + return kevent_add_callbacks(&sc, KEVENT_PIPE); +} +module_init(kevent_init_pipe); ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take26 7/8] kevent: Signal notifications. 2006-11-30 19:14 ` [take26 6/8] kevent: Pipe notifications Evgeniy Polyakov @ 2006-11-30 19:14 ` Evgeniy Polyakov 2006-11-30 19:14 ` [take26 8/8] kevent: Kevent posix timer notifications Evgeniy Polyakov 0 siblings, 1 reply; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-30 19:14 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Signal notifications. This type of notifications allows to deliver signals through kevent queue. One can find example application signal.c on project homepage. If KEVENT_SIGNAL_NOMASK bit is set in raw_u64 id then signal will be delivered only through queue, otherwise both delivery types are used - old through update of mask of pending signals and through queue. If signal is delivered only through kevent queue mask of pending signals is not updated at all, which is equal to putting signal into blocked mask, but with delivery of that signal through kevent queue. Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> diff --git a/include/linux/sched.h b/include/linux/sched.h index fc4a987..ef38a3c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -80,6 +80,7 @@ struct sched_param { #include <linux/resource.h> #include <linux/timer.h> #include <linux/hrtimer.h> +#include <linux/kevent_storage.h> #include <asm/processor.h> @@ -1013,6 +1014,10 @@ struct task_struct { #ifdef CONFIG_TASK_DELAY_ACCT struct task_delay_info *delays; #endif +#ifdef CONFIG_KEVENT_SIGNAL + struct kevent_storage st; + u32 kevent_signals; +#endif }; static inline pid_t process_group(struct task_struct *tsk) diff --git a/kernel/fork.c b/kernel/fork.c index 1c999f3..e5b5b14 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -46,6 +46,7 @@ #include <linux/delayacct.h> #include <linux/taskstats_kern.h> #include <linux/random.h> +#include <linux/kevent.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -115,6 +116,9 @@ void __put_task_struct(struct task_struc WARN_ON(atomic_read(&tsk->usage)); WARN_ON(tsk == current); +#ifdef CONFIG_KEVENT_SIGNAL + kevent_storage_fini(&tsk->st); +#endif security_task_free(tsk); free_uid(tsk->user); put_group_info(tsk->group_info); @@ -1121,6 +1125,10 @@ static struct task_struct *copy_process( if (retval) goto bad_fork_cleanup_namespace; +#ifdef CONFIG_KEVENT_SIGNAL + kevent_storage_init(p, &p->st); +#endif + p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL; /* * Clear TID on mm_release()? diff --git a/kernel/kevent/kevent_signal.c b/kernel/kevent/kevent_signal.c new file mode 100644 index 0000000..0edd2e4 --- /dev/null +++ b/kernel/kevent/kevent_signal.c @@ -0,0 +1,92 @@ +/* + * kevent_signal.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/file.h> +#include <linux/fs.h> +#include <linux/kevent.h> + +static int kevent_signal_callback(struct kevent *k) +{ + struct task_struct *tsk = k->st->origin; + int sig = k->event.id.raw[0]; + int ret = 0; + + if (sig == tsk->kevent_signals) + ret = 1; + + if (ret && (k->event.id.raw_u64 & KEVENT_SIGNAL_NOMASK)) + tsk->kevent_signals |= 0x80000000; + + return ret; +} + +int kevent_signal_enqueue(struct kevent *k) +{ + int err; + + err = kevent_storage_enqueue(¤t->st, k); + if (err) + goto err_out_exit; + + if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE) { + kevent_requeue(k); + err = 0; + } else { + err = k->callbacks.callback(k); + if (err) + goto err_out_dequeue; + } + + return err; + +err_out_dequeue: + kevent_storage_dequeue(k->st, k); +err_out_exit: + return err; +} + +int kevent_signal_dequeue(struct kevent *k) +{ + kevent_storage_dequeue(k->st, k); + return 0; +} + +int kevent_signal_notify(struct task_struct *tsk, int sig) +{ + tsk->kevent_signals = sig; + kevent_storage_ready(&tsk->st, NULL, KEVENT_SIGNAL_DELIVERY); + return (tsk->kevent_signals & 0x80000000); +} + +static int __init kevent_init_signal(void) +{ + struct kevent_callbacks sc = { + .callback = &kevent_signal_callback, + .enqueue = &kevent_signal_enqueue, + .dequeue = &kevent_signal_dequeue}; + + return kevent_add_callbacks(&sc, KEVENT_SIGNAL); +} +module_init(kevent_init_signal); diff --git a/kernel/signal.c b/kernel/signal.c index fb5da6d..d3d3594 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -23,6 +23,7 @@ #include <linux/ptrace.h> #include <linux/signal.h> #include <linux/capability.h> +#include <linux/kevent.h> #include <asm/param.h> #include <asm/uaccess.h> #include <asm/unistd.h> @@ -703,6 +704,9 @@ static int send_signal(int sig, struct s { struct sigqueue * q = NULL; int ret = 0; + + if (kevent_signal_notify(t, sig)) + return 1; /* * fast-pathed signals for kernel-internal things like SIGSTOP @@ -782,6 +786,17 @@ specific_send_sig_info(int sig, struct s ret = send_signal(sig, info, t, &t->pending); if (!ret && !sigismember(&t->blocked, sig)) signal_wake_up(t, sig == SIGKILL); +#ifdef CONFIG_KEVENT_SIGNAL + /* + * Kevent allows to deliver signals through kevent queue, + * it is possible to setup kevent to not deliver + * signal through the usual way, in that case send_signal() + * returns 1 and signal is delivered only through kevent queue. + * We simulate successfull delivery notification through this hack: + */ + if (ret == 1) + ret = 0; +#endif out: return ret; } @@ -971,6 +986,17 @@ __group_send_sig_info(int sig, struct si * to avoid several races. */ ret = send_signal(sig, info, p, &p->signal->shared_pending); +#ifdef CONFIG_KEVENT_SIGNAL + /* + * Kevent allows to deliver signals through kevent queue, + * it is possible to setup kevent to not deliver + * signal through the usual way, in that case send_signal() + * returns 1 and signal is delivered only through kevent queue. + * We simulate successfull delivery notification through this hack: + */ + if (ret == 1) + ret = 0; +#endif if (unlikely(ret)) return ret; ^ permalink raw reply related [flat|nested] 200+ messages in thread
* [take26 8/8] kevent: Kevent posix timer notifications. 2006-11-30 19:14 ` [take26 7/8] kevent: Signal notifications Evgeniy Polyakov @ 2006-11-30 19:14 ` Evgeniy Polyakov 0 siblings, 0 replies; 200+ messages in thread From: Evgeniy Polyakov @ 2006-11-30 19:14 UTC (permalink / raw) To: Evgeniy Polyakov Cc: David Miller, Ulrich Drepper, Andrew Morton, Evgeniy Polyakov, netdev, Zach Brown, Christoph Hellwig, Chase Venters, Johann Borck, linux-kernel, Jeff Garzik Kevent posix timer notifications. Simple extensions to POSIX timers which allows to deliver notification of the timer expiration through kevent queue. Example application posix_timer.c can be found in archive on project homepage. Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> diff --git a/include/asm-generic/siginfo.h b/include/asm-generic/siginfo.h index 8786e01..3768746 100644 --- a/include/asm-generic/siginfo.h +++ b/include/asm-generic/siginfo.h @@ -235,6 +235,7 @@ typedef struct siginfo { #define SIGEV_NONE 1 /* other notification: meaningless */ #define SIGEV_THREAD 2 /* deliver via thread creation */ #define SIGEV_THREAD_ID 4 /* deliver to thread */ +#define SIGEV_KEVENT 8 /* deliver through kevent queue */ /* * This works because the alignment is ok on all current architectures @@ -260,6 +261,8 @@ typedef struct sigevent { void (*_function)(sigval_t); void *_attribute; /* really pthread_attr_t */ } _sigev_thread; + + int kevent_fd; } _sigev_un; } sigevent_t; diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h index a7dd38f..4b9deb4 100644 --- a/include/linux/posix-timers.h +++ b/include/linux/posix-timers.h @@ -4,6 +4,7 @@ #include <linux/spinlock.h> #include <linux/list.h> #include <linux/sched.h> +#include <linux/kevent_storage.h> union cpu_time_count { cputime_t cpu; @@ -49,6 +50,9 @@ struct k_itimer { sigval_t it_sigev_value; /* value word of sigevent struct */ struct task_struct *it_process; /* process to send signal to */ struct sigqueue *sigq; /* signal queue entry. */ +#ifdef CONFIG_KEVENT_TIMER + struct kevent_storage st; +#endif union { struct { struct hrtimer timer; diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c index e5ebcc1..8d0e7a3 100644 --- a/kernel/posix-timers.c +++ b/kernel/posix-timers.c @@ -48,6 +48,8 @@ #include <linux/wait.h> #include <linux/workqueue.h> #include <linux/module.h> +#include <linux/kevent.h> +#include <linux/file.h> /* * Management arrays for POSIX timers. Timers are kept in slab memory @@ -224,6 +226,99 @@ static int posix_ktime_get_ts(clockid_t return 0; } +#ifdef CONFIG_KEVENT_TIMER +static int posix_kevent_enqueue(struct kevent *k) +{ + /* + * It is not ugly - there is no pointer in the id field union, + * but its size is 64bits, which is ok for any known pointer size. + */ + struct k_itimer *tmr = (struct k_itimer *)(unsigned long)k->event.id.raw_u64; + return kevent_storage_enqueue(&tmr->st, k); +} +static int posix_kevent_dequeue(struct kevent *k) +{ + struct k_itimer *tmr = (struct k_itimer *)(unsigned long)k->event.id.raw_u64; + kevent_storage_dequeue(&tmr->st, k); + return 0; +} +static int posix_kevent_callback(struct kevent *k) +{ + return 1; +} +static int posix_kevent_init(void) +{ + struct kevent_callbacks tc = { + .callback = &posix_kevent_callback, + .enqueue = &posix_kevent_enqueue, + .dequeue = &posix_kevent_dequeue}; + + return kevent_add_callbacks(&tc, KEVENT_POSIX_TIMER); +} + +extern struct file_operations kevent_user_fops; + +static int posix_kevent_init_timer(struct k_itimer *tmr, int fd) +{ + struct ukevent uk; + struct file *file; + struct kevent_user *u; + int err; + + file = fget(fd); + if (!file) { + err = -EBADF; + goto err_out; + } + + if (file->f_op != &kevent_user_fops) { + err = -EINVAL; + goto err_out_fput; + } + + u = file->private_data; + + memset(&uk, 0, sizeof(struct ukevent)); + + uk.event = KEVENT_MASK_ALL; + uk.type = KEVENT_POSIX_TIMER; + uk.id.raw_u64 = (unsigned long)(tmr); /* Just cast to something unique */ + uk.req_flags = KEVENT_REQ_ONESHOT | KEVENT_REQ_ALWAYS_QUEUE; + uk.ptr = tmr->it_sigev_value.sival_ptr; + + err = kevent_user_add_ukevent(&uk, u); + if (err) + goto err_out_fput; + + fput(file); + + return 0; + +err_out_fput: + fput(file); +err_out: + return err; +} + +static void posix_kevent_fini_timer(struct k_itimer *tmr) +{ + kevent_storage_fini(&tmr->st); +} +#else +static int posix_kevent_init_timer(struct k_itimer *tmr, int fd) +{ + return -ENOSYS; +} +static int posix_kevent_init(void) +{ + return 0; +} +static void posix_kevent_fini_timer(struct k_itimer *tmr) +{ +} +#endif + + /* * Initialize everything, well, just everything in Posix clocks/timers ;) */ @@ -241,6 +336,11 @@ static __init int init_posix_timers(void register_posix_clock(CLOCK_REALTIME, &clock_realtime); register_posix_clock(CLOCK_MONOTONIC, &clock_monotonic); + if (posix_kevent_init()) { + printk(KERN_ERR "Failed to initialize kevent posix timers.\n"); + BUG(); + } + posix_timers_cache = kmem_cache_create("posix_timers_cache", sizeof (struct k_itimer), 0, 0, NULL, NULL); idr_init(&posix_timers_id); @@ -343,23 +443,27 @@ static int posix_timer_fn(struct hrtimer timr = container_of(timer, struct k_itimer, it.real.timer); spin_lock_irqsave(&timr->it_lock, flags); + + if (timr->it_sigev_notify == SIGEV_KEVENT) { + kevent_storage_ready(&timr->st, NULL, KEVENT_MASK_ALL); + } else { + if (timr->it.real.interval.tv64 != 0) + si_private = ++timr->it_requeue_pending; - if (timr->it.real.interval.tv64 != 0) - si_private = ++timr->it_requeue_pending; - - if (posix_timer_event(timr, si_private)) { - /* - * signal was not sent because of sig_ignor - * we will not get a call back to restart it AND - * it should be restarted. - */ - if (timr->it.real.interval.tv64 != 0) { - timr->it_overrun += - hrtimer_forward(timer, - timer->base->softirq_time, - timr->it.real.interval); - ret = HRTIMER_RESTART; - ++timr->it_requeue_pending; + if (posix_timer_event(timr, si_private)) { + /* + * signal was not sent because of sig_ignor + * we will not get a call back to restart it AND + * it should be restarted. + */ + if (timr->it.real.interval.tv64 != 0) { + timr->it_overrun += + hrtimer_forward(timer, + timer->base->softirq_time, + timr->it.real.interval); + ret = HRTIMER_RESTART; + ++timr->it_requeue_pending; + } } } @@ -407,6 +511,9 @@ static struct k_itimer * alloc_posix_tim kmem_cache_free(posix_timers_cache, tmr); tmr = NULL; } +#ifdef CONFIG_KEVENT_TIMER + kevent_storage_init(tmr, &tmr->st); +#endif return tmr; } @@ -424,6 +531,7 @@ static void release_posix_timer(struct k if (unlikely(tmr->it_process) && tmr->it_sigev_notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID)) put_task_struct(tmr->it_process); + posix_kevent_fini_timer(tmr); kmem_cache_free(posix_timers_cache, tmr); } @@ -496,40 +604,52 @@ sys_timer_create(const clockid_t which_c new_timer->it_sigev_signo = event.sigev_signo; new_timer->it_sigev_value = event.sigev_value; - read_lock(&tasklist_lock); - if ((process = good_sigevent(&event))) { - /* - * We may be setting up this process for another - * thread. It may be exiting. To catch this - * case the we check the PF_EXITING flag. If - * the flag is not set, the siglock will catch - * him before it is too late (in exit_itimers). - * - * The exec case is a bit more invloved but easy - * to code. If the process is in our thread - * group (and it must be or we would not allow - * it here) and is doing an exec, it will cause - * us to be killed. In this case it will wait - * for us to die which means we can finish this - * linkage with our last gasp. I.e. no code :) - */ + if (event.sigev_notify == SIGEV_KEVENT) { + error = posix_kevent_init_timer(new_timer, event._sigev_un.kevent_fd); + if (error) + goto out; + + process = current->group_leader; spin_lock_irqsave(&process->sighand->siglock, flags); - if (!(process->flags & PF_EXITING)) { - new_timer->it_process = process; - list_add(&new_timer->list, - &process->signal->posix_timers); - spin_unlock_irqrestore(&process->sighand->siglock, flags); - if (new_timer->it_sigev_notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID)) - get_task_struct(process); - } else { - spin_unlock_irqrestore(&process->sighand->siglock, flags); - process = NULL; + new_timer->it_process = process; + list_add(&new_timer->list, &process->signal->posix_timers); + spin_unlock_irqrestore(&process->sighand->siglock, flags); + } else { + read_lock(&tasklist_lock); + if ((process = good_sigevent(&event))) { + /* + * We may be setting up this process for another + * thread. It may be exiting. To catch this + * case the we check the PF_EXITING flag. If + * the flag is not set, the siglock will catch + * him before it is too late (in exit_itimers). + * + * The exec case is a bit more invloved but easy + * to code. If the process is in our thread + * group (and it must be or we would not allow + * it here) and is doing an exec, it will cause + * us to be killed. In this case it will wait + * for us to die which means we can finish this + * linkage with our last gasp. I.e. no code :) + */ + spin_lock_irqsave(&process->sighand->siglock, flags); + if (!(process->flags & PF_EXITING)) { + new_timer->it_process = process; + list_add(&new_timer->list, + &process->signal->posix_timers); + spin_unlock_irqrestore(&process->sighand->siglock, flags); + if (new_timer->it_sigev_notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID)) + get_task_struct(process); + } else { + spin_unlock_irqrestore(&process->sighand->siglock, flags); + process = NULL; + } + } + read_unlock(&tasklist_lock); + if (!process) { + error = -EINVAL; + goto out; } - } - read_unlock(&tasklist_lock); - if (!process) { - error = -EINVAL; - goto out; } } else { new_timer->it_sigev_notify = SIGEV_SIGNAL; ^ permalink raw reply related [flat|nested] 200+ messages in thread
end of thread, other threads:[~2006-12-28 9:52 UTC | newest]
Thread overview: 200+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1154985aa0591036@2ka.mipt.ru>
2006-10-27 16:10 ` [take21 0/4] kevent: Generic event handling mechanism Evgeniy Polyakov
2006-10-27 16:10 ` [take21 1/4] kevent: Core files Evgeniy Polyakov
2006-10-27 16:10 ` [take21 2/4] kevent: poll/select() notifications Evgeniy Polyakov
2006-10-27 16:10 ` [take21 3/4] kevent: Socket notifications Evgeniy Polyakov
2006-10-27 16:10 ` [take21 4/4] kevent: Timer notifications Evgeniy Polyakov
2006-10-28 10:04 ` [take21 2/4] kevent: poll/select() notifications Eric Dumazet
2006-10-28 10:08 ` Evgeniy Polyakov
2006-10-28 10:28 ` [take21 1/4] kevent: Core files Eric Dumazet
2006-10-28 10:53 ` Evgeniy Polyakov
2006-10-28 12:36 ` Eric Dumazet
2006-10-28 13:03 ` Evgeniy Polyakov
2006-10-28 13:23 ` Eric Dumazet
2006-10-28 13:28 ` Evgeniy Polyakov
2006-10-28 13:34 ` Eric Dumazet
2006-10-28 13:47 ` Evgeniy Polyakov
2006-10-27 16:42 ` [take21 0/4] kevent: Generic event handling mechanism Evgeniy Polyakov
2006-11-07 11:26 ` Jeff Garzik
2006-11-07 11:46 ` Jeff Garzik
2006-11-07 11:58 ` Evgeniy Polyakov
2006-11-07 11:51 ` Evgeniy Polyakov
2006-11-07 12:17 ` Jeff Garzik
2006-11-07 12:29 ` Evgeniy Polyakov
2006-11-07 12:32 ` Jeff Garzik
2006-11-07 19:34 ` Andrew Morton
2006-11-07 20:52 ` David Miller
2006-11-07 21:38 ` Andrew Morton
2006-11-01 11:36 ` [take22 " Evgeniy Polyakov
2006-11-01 11:36 ` [take22 1/4] kevent: Core files Evgeniy Polyakov
2006-11-01 11:36 ` [take22 2/4] kevent: poll/select() notifications Evgeniy Polyakov
2006-11-01 11:36 ` [take22 3/4] kevent: Socket notifications Evgeniy Polyakov
2006-11-01 11:36 ` [take22 4/4] kevent: Timer notifications Evgeniy Polyakov
2006-11-01 13:06 ` [take22 0/4] kevent: Generic event handling mechanism Pavel Machek
2006-11-01 13:25 ` Evgeniy Polyakov
2006-11-01 16:05 ` Pavel Machek
2006-11-01 16:24 ` Evgeniy Polyakov
2006-11-01 18:13 ` Oleg Verych
2006-11-01 18:57 ` Evgeniy Polyakov
2006-11-02 2:12 ` Nate Diller
2006-11-02 6:21 ` Evgeniy Polyakov
2006-11-02 19:40 ` Nate Diller
2006-11-03 8:42 ` Evgeniy Polyakov
2006-11-03 8:57 ` Pavel Machek
2006-11-03 9:04 ` David Miller
2006-11-07 12:05 ` Jeff Garzik
2006-11-03 9:13 ` Evgeniy Polyakov
2006-11-05 11:19 ` Pavel Machek
2006-11-05 11:43 ` Evgeniy Polyakov
[not found] ` <aaf959cb0611011829k36deda6ahe61bcb9bf8e612e1@mail.gmail.com>
[not found] ` <aaf959cb0611011830j1ca3e469tc4a6af3a2a010fa@mail.gmail.com>
[not found] ` <4549A261.9010007@cosmosbay.com>
2006-11-03 2:42 ` zhou drangon
2006-11-03 9:16 ` Evgeniy Polyakov
2006-11-07 12:02 ` Jeff Garzik
2006-11-03 18:49 ` Oleg Verych
2006-11-04 10:24 ` Evgeniy Polyakov
2006-11-04 17:47 ` Evgeniy Polyakov
2006-11-01 16:07 ` James Morris
2006-11-07 16:50 ` [take23 0/5] " Evgeniy Polyakov
2006-11-07 16:50 ` [take23 1/5] kevent: Description Evgeniy Polyakov
2006-11-07 16:50 ` [take23 2/5] kevent: Core files Evgeniy Polyakov
2006-11-07 16:50 ` [take23 3/5] kevent: poll/select() notifications Evgeniy Polyakov
2006-11-07 16:50 ` [take23 4/5] kevent: Socket notifications Evgeniy Polyakov
2006-11-07 16:50 ` [take23 5/5] kevent: Timer notifications Evgeniy Polyakov
2006-11-07 22:53 ` [take23 3/5] kevent: poll/select() notifications Davide Libenzi
2006-11-08 8:45 ` Evgeniy Polyakov
2006-11-08 17:03 ` Evgeniy Polyakov
2006-11-07 22:16 ` [take23 2/5] kevent: Core files Andrew Morton
2006-11-08 8:24 ` Evgeniy Polyakov
2006-11-07 22:16 ` [take23 1/5] kevent: Description Andrew Morton
2006-11-08 8:23 ` Evgeniy Polyakov
2006-11-07 22:17 ` [take23 0/5] kevent: Generic event handling mechanism Andrew Morton
2006-11-08 8:21 ` Evgeniy Polyakov
2006-11-08 14:51 ` Eric Dumazet
2006-11-08 22:03 ` Andrew Morton
2006-11-08 22:44 ` Davide Libenzi
2006-11-08 23:07 ` Eric Dumazet
2006-11-08 23:56 ` Davide Libenzi
2006-11-09 7:24 ` Eric Dumazet
2006-11-09 7:52 ` Eric Dumazet
2006-11-09 17:12 ` Davide Libenzi
2006-11-09 8:23 ` [take24 0/6] " Evgeniy Polyakov
2006-11-09 8:23 ` [take24 1/6] kevent: Description Evgeniy Polyakov
2006-11-09 8:23 ` [take24 2/6] kevent: Core files Evgeniy Polyakov
2006-11-09 8:23 ` [take24 3/6] kevent: poll/select() notifications Evgeniy Polyakov
2006-11-09 8:23 ` [take24 4/6] kevent: Socket notifications Evgeniy Polyakov
2006-11-09 8:23 ` [take24 5/6] kevent: Timer notifications Evgeniy Polyakov
2006-11-09 8:23 ` [take24 6/6] kevent: Pipe notifications Evgeniy Polyakov
2006-11-09 9:08 ` [take24 3/6] kevent: poll/select() notifications Eric Dumazet
2006-11-09 9:29 ` Evgeniy Polyakov
2006-11-09 18:51 ` Davide Libenzi
2006-11-09 19:10 ` Evgeniy Polyakov
2006-11-09 19:42 ` Davide Libenzi
2006-11-09 20:10 ` Davide Libenzi
2006-11-11 17:36 ` [take24 7/6] kevent: signal notifications Evgeniy Polyakov
2006-11-11 22:28 ` [take24 0/6] kevent: Generic event handling mechanism Ulrich Drepper
2006-11-13 10:54 ` Evgeniy Polyakov
2006-11-13 11:16 ` Evgeniy Polyakov
2006-11-20 0:02 ` Ulrich Drepper
2006-11-20 8:25 ` Evgeniy Polyakov
2006-11-20 8:43 ` Andrew Morton
2006-11-20 8:51 ` Evgeniy Polyakov
2006-11-20 9:15 ` Andrew Morton
2006-11-20 9:19 ` Evgeniy Polyakov
2006-11-20 20:29 ` Ulrich Drepper
2006-11-20 21:46 ` Jeff Garzik
2006-11-20 21:52 ` Ulrich Drepper
2006-11-21 9:09 ` Ingo Oeser
2006-11-22 11:38 ` Michael Tokarev
2006-11-22 11:47 ` Evgeniy Polyakov
2006-11-22 12:33 ` Jeff Garzik
2006-11-21 9:53 ` Evgeniy Polyakov
2006-11-21 16:58 ` Ulrich Drepper
2006-11-21 17:43 ` Evgeniy Polyakov
2006-11-21 18:46 ` Evgeniy Polyakov
2006-11-21 20:01 ` Jeff Garzik
2006-11-22 10:41 ` Evgeniy Polyakov
2006-11-21 20:19 ` Jeff Garzik
2006-11-22 10:39 ` Evgeniy Polyakov
2006-11-22 7:38 ` Ulrich Drepper
2006-11-22 10:44 ` Evgeniy Polyakov
2006-11-22 21:02 ` Ulrich Drepper
2006-11-23 12:23 ` Evgeniy Polyakov
2006-11-23 8:52 ` Kevent POSIX timers support Evgeniy Polyakov
2006-11-23 20:26 ` Ulrich Drepper
2006-11-24 9:50 ` Evgeniy Polyakov
2006-11-27 18:20 ` Ulrich Drepper
2006-11-27 18:24 ` David Miller
2006-11-27 18:36 ` Ulrich Drepper
2006-11-27 18:49 ` David Miller
2006-11-28 9:16 ` Evgeniy Polyakov
2006-11-28 19:13 ` David Miller
2006-11-28 19:22 ` Evgeniy Polyakov
2006-12-12 1:36 ` David Miller
2006-12-12 5:31 ` Evgeniy Polyakov
2006-11-28 9:16 ` Evgeniy Polyakov
2006-11-22 7:33 ` [take24 0/6] kevent: Generic event handling mechanism Ulrich Drepper
2006-11-22 10:38 ` Evgeniy Polyakov
2006-11-22 22:22 ` Ulrich Drepper
2006-11-23 12:18 ` Evgeniy Polyakov
2006-11-23 22:23 ` Ulrich Drepper
2006-11-24 10:57 ` Evgeniy Polyakov
2006-11-27 19:12 ` Ulrich Drepper
2006-11-28 11:00 ` Evgeniy Polyakov
2006-11-22 12:09 ` Evgeniy Polyakov
2006-11-22 12:15 ` Evgeniy Polyakov
2006-11-22 13:46 ` Evgeniy Polyakov
2006-11-22 22:24 ` Ulrich Drepper
2006-11-23 12:22 ` Evgeniy Polyakov
2006-11-23 20:34 ` Ulrich Drepper
2006-11-24 10:58 ` Evgeniy Polyakov
2006-11-27 18:23 ` Ulrich Drepper
2006-11-28 10:13 ` Evgeniy Polyakov
2006-12-27 20:45 ` Ulrich Drepper
2006-12-28 9:50 ` Evgeniy Polyakov
2006-11-21 16:29 ` [take25 " Evgeniy Polyakov
2006-11-21 16:29 ` [take25 1/6] kevent: Description Evgeniy Polyakov
2006-11-21 16:29 ` [take25 2/6] kevent: Core files Evgeniy Polyakov
2006-11-21 16:29 ` [take25 3/6] kevent: poll/select() notifications Evgeniy Polyakov
2006-11-21 16:29 ` [take25 4/6] kevent: Socket notifications Evgeniy Polyakov
2006-11-21 16:29 ` [take25 5/6] kevent: Timer notifications Evgeniy Polyakov
2006-11-21 16:29 ` [take25 6/6] kevent: Pipe notifications Evgeniy Polyakov
2006-11-22 11:20 ` Eric Dumazet
2006-11-22 11:30 ` Evgeniy Polyakov
2006-11-22 23:46 ` [take25 1/6] kevent: Description Ulrich Drepper
2006-11-23 11:52 ` Evgeniy Polyakov
2006-11-23 19:45 ` Ulrich Drepper
2006-11-24 11:01 ` Evgeniy Polyakov
2006-11-24 16:06 ` Ulrich Drepper
2006-11-24 16:14 ` Evgeniy Polyakov
2006-11-24 16:31 ` Evgeniy Polyakov
2006-11-27 19:20 ` Ulrich Drepper
2006-11-22 23:52 ` Ulrich Drepper
2006-11-23 11:55 ` Evgeniy Polyakov
2006-11-23 20:00 ` Ulrich Drepper
2006-11-23 21:49 ` Hans Henrik Happe
2006-11-23 22:34 ` Ulrich Drepper
2006-11-24 11:50 ` Evgeniy Polyakov
2006-11-24 16:17 ` Ulrich Drepper
2006-11-24 11:46 ` Evgeniy Polyakov
2006-11-24 16:30 ` Ulrich Drepper
2006-11-24 16:49 ` Evgeniy Polyakov
2006-11-27 19:23 ` Ulrich Drepper
2006-11-23 22:33 ` Ulrich Drepper
2006-11-23 22:48 ` Jeff Garzik
2006-11-23 23:45 ` Ulrich Drepper
2006-11-24 0:48 ` Eric Dumazet
2006-11-24 8:14 ` Andrew Morton
2006-11-24 8:33 ` Eric Dumazet
2006-11-24 15:26 ` Ulrich Drepper
2006-11-24 0:14 ` Hans Henrik Happe
2006-11-24 12:05 ` Evgeniy Polyakov
2006-11-24 12:13 ` Evgeniy Polyakov
2006-11-27 19:43 ` Ulrich Drepper
2006-11-28 10:26 ` Evgeniy Polyakov
2006-11-30 19:14 ` [take26 0/8] kevent: Generic event handling mechanism Evgeniy Polyakov
2006-11-30 19:14 ` [take26 1/8] kevent: Description Evgeniy Polyakov
2006-11-30 19:14 ` [take26 2/8] kevent: Core files Evgeniy Polyakov
2006-11-30 19:14 ` [take26 3/8] kevent: poll/select() notifications Evgeniy Polyakov
2006-11-30 19:14 ` [take26 4/8] kevent: Socket notifications Evgeniy Polyakov
2006-11-30 19:14 ` [take26 5/8] kevent: Timer notifications Evgeniy Polyakov
2006-11-30 19:14 ` [take26 6/8] kevent: Pipe notifications Evgeniy Polyakov
2006-11-30 19:14 ` [take26 7/8] kevent: Signal notifications Evgeniy Polyakov
2006-11-30 19:14 ` [take26 8/8] kevent: Kevent posix timer notifications Evgeniy Polyakov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).