* Re: async network I/O, event channels, etc [not found] <44C66FC9.3050402@redhat.com> @ 2006-07-25 22:01 ` David Miller 2006-07-25 22:55 ` Nicholas Miell 2006-07-26 6:28 ` Evgeniy Polyakov 0 siblings, 2 replies; 73+ messages in thread From: David Miller @ 2006-07-25 22:01 UTC (permalink / raw) To: drepper; +Cc: linux-kernel, netdev From: Ulrich Drepper <drepper@redhat.com> Date: Tue, 25 Jul 2006 12:23:53 -0700 > I was very much surprised by the reactions I got after my OLS talk. > Lots of people declared interest and even agreed with the approach and > asked me to do further ahead with all this. For those who missed it, > the paper and the slides are available on my home page: > > http://people.redhat.com/drepper/ > > As for the next steps I see a number of possible ways. The discussions > can be held on the usual mailing lists (i.e., lkml and netdev) but due > to the raw nature of the current proposal I would imagine that would be > mainly perceived as noise. Since I gave a big thumbs up for Evgivny's kevent work yesterday on linux-kernel, you might want to start by comparing your work to his. Because his has the advantage that 1) we have code now and 2) he has written many test applications and performed many benchmarks against his code which has flushed out most of the major implementation issues. I think most of the people who have encouraged your work are unaware of Evgivny's kevent stuff, which is extremely unfortunate, the two works are more similar than they are different. I do not think discussing all of this on netdev would be perceived as noise. :) ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: async network I/O, event channels, etc 2006-07-25 22:01 ` async network I/O, event channels, etc David Miller @ 2006-07-25 22:55 ` Nicholas Miell 2006-07-26 6:28 ` Evgeniy Polyakov 1 sibling, 0 replies; 73+ messages in thread From: Nicholas Miell @ 2006-07-25 22:55 UTC (permalink / raw) To: David Miller; +Cc: drepper, linux-kernel, netdev On Tue, 2006-07-25 at 15:01 -0700, David Miller wrote: > From: Ulrich Drepper <drepper@redhat.com> > Date: Tue, 25 Jul 2006 12:23:53 -0700 > > > I was very much surprised by the reactions I got after my OLS talk. > > Lots of people declared interest and even agreed with the approach and > > asked me to do further ahead with all this. For those who missed it, > > the paper and the slides are available on my home page: > > > > http://people.redhat.com/drepper/ > > > > As for the next steps I see a number of possible ways. The discussions > > can be held on the usual mailing lists (i.e., lkml and netdev) but due > > to the raw nature of the current proposal I would imagine that would be > > mainly perceived as noise. > > Since I gave a big thumbs up for Evgivny's kevent work yesterday > on linux-kernel, you might want to start by comparing your work > to his. Because his has the advantage that 1) we have code now > and 2) he has written many test applications and performed many > benchmarks against his code which has flushed out most of the > major implementation issues. > > I think most of the people who have encouraged your work are unaware > of Evgivny's kevent stuff, which is extremely unfortunate, the two > works are more similar than they are different. > > I do not think discussing all of this on netdev would be perceived > as noise. :) While the comparing is going on, how does this compare to Solaris's ports interface? It's documented at http://docs.sun.com/app/docs/doc/816-5168/6mbb3hrir?a=view Also, since we're on the subject, why a whole new interface for event queuing instead of extending the existing io_getevents(2) and friends? -- Nicholas Miell <nmiell@comcast.net> ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: async network I/O, event channels, etc 2006-07-25 22:01 ` async network I/O, event channels, etc David Miller 2006-07-25 22:55 ` Nicholas Miell @ 2006-07-26 6:28 ` Evgeniy Polyakov 2006-07-26 9:18 ` [0/4] kevent: generic event processing subsystem Evgeniy Polyakov 2006-07-27 6:10 ` async network I/O, event channels, etc David Miller 1 sibling, 2 replies; 73+ messages in thread From: Evgeniy Polyakov @ 2006-07-26 6:28 UTC (permalink / raw) To: David Miller; +Cc: drepper, linux-kernel, netdev On Tue, Jul 25, 2006 at 03:01:22PM -0700, David Miller (davem@davemloft.net) wrote: > From: Ulrich Drepper <drepper@redhat.com> > Date: Tue, 25 Jul 2006 12:23:53 -0700 > > > I was very much surprised by the reactions I got after my OLS talk. > > Lots of people declared interest and even agreed with the approach and > > asked me to do further ahead with all this. For those who missed it, > > the paper and the slides are available on my home page: > > > > http://people.redhat.com/drepper/ > > > > As for the next steps I see a number of possible ways. The discussions > > can be held on the usual mailing lists (i.e., lkml and netdev) but due > > to the raw nature of the current proposal I would imagine that would be > > mainly perceived as noise. > > Since I gave a big thumbs up for Evgivny's kevent work yesterday > on linux-kernel, you might want to start by comparing your work > to his. Because his has the advantage that 1) we have code now > and 2) he has written many test applications and performed many > benchmarks against his code which has flushed out most of the > major implementation issues. > > I think most of the people who have encouraged your work are unaware > of Evgivny's kevent stuff, which is extremely unfortunate, the two > works are more similar than they are different. > > I do not think discussing all of this on netdev would be perceived > as noise. :) Hello David, Ulrich. Here is brief description of what is kevent and how it works. Kevent subsystem incorporates several AIO/kqueue design notes and ideas. Kevent can be used both for edge and level notifications. It supports socket notifications (accept, send, recv), network AIO (aio_send(), aio_recv() and aio_sendfile()), inode notifications (create/remove), generic poll()/select() notifications and timer notifications. There are several object in the kevent system: storage - each source of events (socket, inode, timer, aio, anything) has structure kevent_storage incorporated into it, which is basically a list of registered interests for this source of events. user - it is abstraction which holds all requested kevents. It is similar to FreeBSD's kqueue. kevent - set of interests for given source of events or storage. When kevent is queued into storage, it will live there until removed by kevent_dequeue(). When some activity is noticed in given storage, it scans it's kevent_storage->list for kevents which match activity event. If kevents are found and they are not already in the kevent_user->ready_list, they will be added there at the end. ioctl(WAIT) (or appropriate syscall) will wait until either requested number of kevents are ready or timeout elapsed or at least one kevent is ready, it's behaviour depends on parameters. It is possible to have one-shot kevents, which are automatically removed when are ready. Any event can be added/removed/modified by ioctl or special controlling syscall. Network AIO is based on kevent and works as usual kevent storage on top of inode. When new socket is created it is associated with that inode and when some activity is detected appropriate notifications are generated and kevent_naio_callback() is called. When new kevent is being registered, network AIO ->enqueue() callback simply marks itself like usual socket event watcher. It also locks physical userspace pages in memory and stores appropriate pointers in private kevent structure. I have not created additional DMA memory allocation methods, like Ulrich described in his article, so I handle it inside NAIO which has some overhead (I posted get_user_pages() sclability graph some time ago). Network AIO callback gets pointers to userspace pages and tries to copy data from receiving skb queue into them using protocol specific callback. This callback is very similar to ->recvmsg(), so they could share a lot in future (as far as I recall it worked only with hardware capable to do checksumming, I'm a bit lazy). Both network and aio implementation work on top of hooks inside appropriate state machines, but not as repeated call design (curect AIO) or special thread (SGI AIO). AIO work was stopped, since I was unable to achieve the same speed as synchronous read (maximum speeds were 2Gb/sec vs. 2.1 GB/sec for aio and sync IO accordingly when reading data from the cache). Network aio_sendfile() works lazily - it asynchronously populates pages into the VFS cache (which can be used for various tricks with adaptive readahead) and then uses usual ->sendfile() callback. I have not created an interface for userspace events (like Solaris), since right now I do not see it's usefullness, but if there is requirements for that it is quite easy with kevents. I'm preparing set of kevent patches resend (with cleanups mentioned in previous e-mails), which will be ready in a couple of moments. 1. kevent homepage. http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent 2. network aio homepage. http://tservice.net.ru/~s0mbre/old/?section=projects&item=naio 3. LWN.net published a very good article about kevent. http://lwn.net/Articles/172844/ Thank you. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 73+ messages in thread
* [0/4] kevent: generic event processing subsystem. 2006-07-26 6:28 ` Evgeniy Polyakov @ 2006-07-26 9:18 ` Evgeniy Polyakov 2006-07-26 9:18 ` [1/4] kevent: core files Evgeniy Polyakov 2006-07-27 6:10 ` async network I/O, event channels, etc David Miller 1 sibling, 1 reply; 73+ messages in thread From: Evgeniy Polyakov @ 2006-07-26 9:18 UTC (permalink / raw) To: lkml; +Cc: David Miller, Ulrich Drepper, Evgeniy Polyakov, netdev Kevent subsystem incorporates several AIO/kqueue design notes and ideas. Kevent can be used both for edge and level notifications. It supports socket notifications (accept, send, recv), network AIO (aio_send(), aio_recv() and aio_sendfile()), inode notifications (create/remove), generic poll()/select() notifications and timer notifications. There are several object in the kevent system: storage - each source of events (socket, inode, timer, aio, anything) has structure kevent_storage incorporated into it, which is basically a list of registered interests for this source of events. user - it is abstraction which holds all requested kevents. It is similar to FreeBSD's kqueue. kevent - set of interests for given source of events or storage. When kevent is queued into storage, it will live there until removed by kevent_dequeue(). When some activity is noticed in given storage, it scans it's kevent_storage->list for kevents which match activity event. If kevents are found and they are not already in the kevent_user->ready_list, they will be added there at the end. ioctl(WAIT) (or appropriate syscall) will wait until either requested number of kevents are ready or timeout elapsed or at least one kevent is ready, it's behaviour depends on parameters. It is possible to have one-shot kevents, which are automatically removed when are ready. Any event can be added/removed/modified by ioctl or special controlling syscall. Network AIO is based on kevent and works as usual kevent storage on top of inode. When new socket is created it is associated with that inode and when some activity is detected appropriate notifications are generated and kevent_naio_callback() is called. When new kevent is being registered, network AIO ->enqueue() callback simply marks itself like usual socket event watcher. It also locks physical userspace pages in memory and stores appropriate pointers in private kevent structure. I have not created additional DMA memory allocation methods, like Ulrich described in his article, so I handle it inside NAIO which has some overhead (I posted get_user_pages() sclability graph some time ago). Network AIO callback gets pointers to userspace pages and tries to copy data from receiving skb queue into them using protocol specific callback. This callback is very similar to ->recvmsg(), so they could share a lot in future (as far as I recall it worked only with hardware capable to do checksumming, I'm a bit lazy). Both network and aio implementation work on top of hooks inside appropriate state machines, but not as repeated call design (curect AIO) or special thread (SGI AIO). AIO work was stopped, since I was unable to achieve the same speed as synchronous read (maximum speeds were 2Gb/sec vs. 2.1 GB/sec for aio and sync IO accordingly when reading data from the cache). Network aio_sendfile() works lazily - it asynchronously populates pages into the VFS cache (which can be used for various tricks with adaptive readahead) and then uses usual ->sendfile() callback. I have not created an interface for userspace events (like Solaris), since right now I do not see it's usefullness, but if there is requirements for that it is quite easy with kevents. Patches currently include ifdefs and kevent can be disabled in config, when things are settled that can be removed. 1. kevent homepage. http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent 2. network aio homepage. http://tservice.net.ru/~s0mbre/old/?section=projects&item=naio 3. LWN.net published a very good article about kevent. http://lwn.net/Articles/172844/ Thank you. Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> ^ permalink raw reply [flat|nested] 73+ messages in thread
* [1/4] kevent: core files. 2006-07-26 9:18 ` [0/4] kevent: generic event processing subsystem Evgeniy Polyakov @ 2006-07-26 9:18 ` Evgeniy Polyakov 2006-07-26 9:18 ` [2/4] kevent: network AIO, socket notifications Evgeniy Polyakov ` (2 more replies) 0 siblings, 3 replies; 73+ messages in thread From: Evgeniy Polyakov @ 2006-07-26 9:18 UTC (permalink / raw) To: lkml; +Cc: David Miller, Ulrich Drepper, Evgeniy Polyakov, netdev This patch includes core kevent files: - userspace controlling - kernelspace interfaces - initialization - notification state machines It might also inlclude parts from other subsystem (like network related syscalls, so it is possible that it will not compile without other patches applied). Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S index af56987..93e23ff 100644 --- a/arch/i386/kernel/syscall_table.S +++ b/arch/i386/kernel/syscall_table.S @@ -316,3 +316,7 @@ ENTRY(sys_call_table) .long sys_sync_file_range .long sys_tee /* 315 */ .long sys_vmsplice + .long sys_aio_recv + .long sys_aio_send + .long sys_aio_sendfile + .long sys_kevent_ctl diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S index 5a92fed..534d516 100644 --- a/arch/x86_64/ia32/ia32entry.S +++ b/arch/x86_64/ia32/ia32entry.S @@ -696,4 +696,8 @@ #endif .quad sys_sync_file_range .quad sys_tee .quad compat_sys_vmsplice + .quad sys_aio_recv + .quad sys_aio_send + .quad sys_aio_sendfile + .quad sys_kevent_ctl ia32_syscall_end: diff --git a/include/asm-i386/socket.h b/include/asm-i386/socket.h index 802ae76..3473f5c 100644 --- a/include/asm-i386/socket.h +++ b/include/asm-i386/socket.h @@ -49,4 +49,6 @@ #define SO_ACCEPTCONN 30 #define SO_PEERSEC 31 +#define SO_ASYNC_SOCK 34 + #endif /* _ASM_SOCKET_H */ diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h index de2ccc1..52f8642 100644 --- a/include/asm-i386/unistd.h +++ b/include/asm-i386/unistd.h @@ -322,10 +322,14 @@ #define __NR_splice 313 #define __NR_sync_file_range 314 #define __NR_tee 315 #define __NR_vmsplice 316 +#define __NR_aio_recv 317 +#define __NR_aio_send 318 +#define __NR_aio_sendfile 319 +#define __NR_kevent_ctl 320 #ifdef __KERNEL__ -#define NR_syscalls 317 +#define NR_syscalls 321 /* * user-visible error numbers are in the range -1 - -128: see diff --git a/include/asm-x86_64/socket.h b/include/asm-x86_64/socket.h index f2cdbea..1f31f86 100644 --- a/include/asm-x86_64/socket.h +++ b/include/asm-x86_64/socket.h @@ -49,4 +49,6 @@ #define SO_ACCEPTCONN 30 #define SO_PEERSEC 31 +#define SO_ASYNC_SOCK 34 + #endif /* _ASM_SOCKET_H */ diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h index 0aff22b..352c34b 100644 --- a/include/asm-x86_64/unistd.h +++ b/include/asm-x86_64/unistd.h @@ -617,11 +617,18 @@ #define __NR_sync_file_range 277 __SYSCALL(__NR_sync_file_range, sys_sync_file_range) #define __NR_vmsplice 278 __SYSCALL(__NR_vmsplice, sys_vmsplice) +#define __NR_aio_recv 279 +__SYSCALL(__NR_aio_recv, sys_aio_recv) +#define __NR_aio_send 280 +__SYSCALL(__NR_aio_send, sys_aio_send) +#define __NR_aio_sendfile 281 +__SYSCALL(__NR_aio_sendfile, sys_aio_sendfile) +#define __NR_kevent_ctl 282 +__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl) #ifdef __KERNEL__ -#define __NR_syscall_max __NR_vmsplice - +#define __NR_syscall_max __NR_kevent_ctl #ifndef __NO_STUBS /* user-visible error numbers are in the range -1 - -4095 */ diff --git a/include/linux/kevent.h b/include/linux/kevent.h new file mode 100644 index 0000000..e94a7bf --- /dev/null +++ b/include/linux/kevent.h @@ -0,0 +1,263 @@ +/* + * kevent.h + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __KEVENT_H +#define __KEVENT_H + +/* + * Kevent request flags. + */ + +#define KEVENT_REQ_ONESHOT 0x1 /* Process this event only once and then dequeue. */ + +/* + * Kevent return flags. + */ +#define KEVENT_RET_BROKEN 0x1 /* Kevent is broken. */ +#define KEVENT_RET_DONE 0x2 /* Kevent processing was finished successfully. */ + +/* + * Kevent type set. + */ +#define KEVENT_SOCKET 0 +#define KEVENT_INODE 1 +#define KEVENT_TIMER 2 +#define KEVENT_POLL 3 +#define KEVENT_NAIO 4 +#define KEVENT_AIO 5 +#define KEVENT_MAX 6 + +/* + * Per-type event sets. + * Number of per-event sets should be exactly as number of kevent types. + */ + +/* + * Timer events. + */ +#define KEVENT_TIMER_FIRED 0x1 + +/* + * Socket/network asynchronous IO events. + */ +#define KEVENT_SOCKET_RECV 0x1 +#define KEVENT_SOCKET_ACCEPT 0x2 +#define KEVENT_SOCKET_SEND 0x4 + +/* + * Inode events. + */ +#define KEVENT_INODE_CREATE 0x1 +#define KEVENT_INODE_REMOVE 0x2 + +/* + * Poll events. + */ +#define KEVENT_POLL_POLLIN 0x0001 +#define KEVENT_POLL_POLLPRI 0x0002 +#define KEVENT_POLL_POLLOUT 0x0004 +#define KEVENT_POLL_POLLERR 0x0008 +#define KEVENT_POLL_POLLHUP 0x0010 +#define KEVENT_POLL_POLLNVAL 0x0020 + +#define KEVENT_POLL_POLLRDNORM 0x0040 +#define KEVENT_POLL_POLLRDBAND 0x0080 +#define KEVENT_POLL_POLLWRNORM 0x0100 +#define KEVENT_POLL_POLLWRBAND 0x0200 +#define KEVENT_POLL_POLLMSG 0x0400 +#define KEVENT_POLL_POLLREMOVE 0x1000 + +/* + * Asynchronous IO events. + */ +#define KEVENT_AIO_BIO 0x1 + +#define KEVENT_MASK_ALL 0xffffffff /* Mask of all possible event values. */ +#define KEVENT_MASK_EMPTY 0x0 /* Empty mask of ready events. */ + +struct kevent_id +{ + __u32 raw[2]; +}; + +struct ukevent +{ + struct kevent_id id; /* Id of this request, e.g. socket number, file descriptor and so on... */ + __u32 type; /* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */ + __u32 event; /* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */ + __u32 req_flags; /* Per-event request flags */ + __u32 ret_flags; /* Per-event return flags */ + __u32 ret_data[2]; /* Event return data. Event originator fills it with anything it likes. */ + union { + __u32 user[2]; /* User's data. It is not used, just copied to/from user. */ + void *ptr; + }; +}; + +#define KEVENT_CTL_ADD 0 +#define KEVENT_CTL_REMOVE 1 +#define KEVENT_CTL_MODIFY 2 +#define KEVENT_CTL_WAIT 3 +#define KEVENT_CTL_INIT 4 + +struct kevent_user_control +{ + unsigned int cmd; /* Control command, e.g. KEVENT_ADD, KEVENT_REMOVE... */ + unsigned int num; /* Number of ukevents this strucutre controls. */ + unsigned int timeout; /* Timeout in milliseconds waiting for "num" events to become ready. */ +}; + +#define KEVENT_USER_SYMBOL 'K' +#define KEVENT_USER_CTL _IOWR(KEVENT_USER_SYMBOL, 0, struct kevent_user_control) +#define KEVENT_USER_WAIT _IOWR(KEVENT_USER_SYMBOL, 1, struct kevent_user_control) + +#ifdef __KERNEL__ + +#include <linux/types.h> +#include <linux/list.h> +#include <linux/spinlock.h> +#include <linux/kevent_storage.h> +#include <asm/semaphore.h> + +struct inode; +struct dentry; +struct sock; + +struct kevent; +struct kevent_storage; +typedef int (* kevent_callback_t)(struct kevent *); + +struct kevent +{ + struct ukevent event; + spinlock_t lock; /* This lock protects ukevent manipulations, e.g. ret_flags changes. */ + + struct list_head kevent_entry; /* Entry of user's queue. */ + struct list_head storage_entry; /* Entry of origin's queue. */ + struct list_head ready_entry; /* Entry of user's ready. */ + + struct kevent_user *user; /* User who requested this kevent. */ + struct kevent_storage *st; /* Kevent container. */ + + kevent_callback_t callback; /* Is called each time new event has been caught. */ + kevent_callback_t enqueue; /* Is called each time new event is queued. */ + kevent_callback_t dequeue; /* Is called each time event is dequeued. */ + + void *priv; /* Private data for different storages. + * poll()/select storage has a list of wait_queue_t containers + * for each ->poll() { poll_wait()' } here. + */ +}; + +#define KEVENT_HASH_MASK 0xff + +struct kevent_list +{ + struct list_head kevent_list; /* List of all kevents. */ + spinlock_t kevent_lock; /* Protects all manipulations with queue of kevents. */ +}; + +struct kevent_user +{ + struct kevent_list kqueue[KEVENT_HASH_MASK+1]; + unsigned int kevent_num; /* Number of queued kevents. */ + + struct list_head ready_list; /* List of ready kevents. */ + unsigned int ready_num; /* Number of ready kevents. */ + spinlock_t ready_lock; /* Protects all manipulations with ready queue. */ + + unsigned int max_ready_num; /* Requested number of kevents. */ + + struct semaphore ctl_mutex; /* Protects against simultaneous kevent_user control manipulations. */ + struct semaphore wait_mutex; /* Protects against simultaneous kevent_user waits. */ + wait_queue_head_t wait; /* Wait until some events are ready. */ + + atomic_t refcnt; /* Reference counter, increased for each new kevent. */ +#ifdef CONFIG_KEVENT_USER_STAT + unsigned long im_num; + unsigned long wait_num; + unsigned long total; +#endif +}; + +#define KEVENT_MAX_REQUESTS PAGE_SIZE/sizeof(struct kevent) + +struct kevent *kevent_alloc(gfp_t mask); +void kevent_free(struct kevent *k); +int kevent_enqueue(struct kevent *k); +int kevent_dequeue(struct kevent *k); +int kevent_init(struct kevent *k); +void kevent_requeue(struct kevent *k); + +#define list_for_each_entry_reverse_safe(pos, n, head, member) \ + for (pos = list_entry((head)->prev, typeof(*pos), member), \ + n = list_entry(pos->member.prev, typeof(*pos), member); \ + prefetch(pos->member.prev), &pos->member != (head); \ + pos = n, n = list_entry(pos->member.prev, typeof(*pos), member)) + +int kevent_break(struct kevent *k); +int kevent_init(struct kevent *k); + +int kevent_init_socket(struct kevent *k); +int kevent_init_inode(struct kevent *k); +int kevent_init_timer(struct kevent *k); +int kevent_init_poll(struct kevent *k); +int kevent_init_naio(struct kevent *k); +int kevent_init_aio(struct kevent *k); + +void kevent_storage_ready(struct kevent_storage *st, + kevent_callback_t ready_callback, u32 event); +int kevent_storage_init(void *origin, struct kevent_storage *st); +void kevent_storage_fini(struct kevent_storage *st); +int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k); +void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k); + +int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u); + +#ifdef CONFIG_KEVENT_INODE +void kevent_inode_notify(struct inode *inode, u32 event); +void kevent_inode_notify_parent(struct dentry *dentry, u32 event); +void kevent_inode_remove(struct inode *inode); +#else +static inline void kevent_inode_notify(struct inode *inode, u32 event) +{ +} +static inline void kevent_inode_notify_parent(struct dentry *dentry, u32 event) +{ +} +static inline void kevent_inode_remove(struct inode *inode) +{ +} +#endif /* CONFIG_KEVENT_INODE */ +#ifdef CONFIG_KEVENT_SOCKET + +void kevent_socket_notify(struct sock *sock, u32 event); +int kevent_socket_dequeue(struct kevent *k); +int kevent_socket_enqueue(struct kevent *k); +#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC) +#else +static inline void kevent_socket_notify(struct sock *sock, u32 event) +{ +} +#define sock_async(__sk) 0 +#endif +#endif /* __KERNEL__ */ +#endif /* __KEVENT_H */ diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h new file mode 100644 index 0000000..bd891f0 --- /dev/null +++ b/include/linux/kevent_storage.h @@ -0,0 +1,12 @@ +#ifndef __KEVENT_STORAGE_H +#define __KEVENT_STORAGE_H + +struct kevent_storage +{ + void *origin; /* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */ + struct list_head list; /* List of queued kevents. */ + unsigned int qlen; /* Number of queued kevents. */ + spinlock_t lock; /* Protects users queue. */ +}; + +#endif /* __KEVENT_STORAGE_H */ diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 66f8819..ea914c3 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1269,6 +1269,8 @@ extern struct sk_buff *skb_recv_datagram int noblock, int *err); extern unsigned int datagram_poll(struct file *file, struct socket *sock, struct poll_table_struct *wait); +extern int skb_copy_datagram(const struct sk_buff *from, + int offset, void *dst, int size); extern int skb_copy_datagram_iovec(const struct sk_buff *from, int offset, struct iovec *to, int size); diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index bd67a44..33d436e 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -587,4 +587,8 @@ asmlinkage long sys_get_robust_list(int asmlinkage long sys_set_robust_list(struct robust_list_head __user *head, size_t len); +asmlinkage long sys_aio_recv(int ctl_fd, int s, void __user *buf, size_t size, unsigned flags); +asmlinkage long sys_aio_send(int ctl_fd, int s, void __user *buf, size_t size, unsigned flags); +asmlinkage long sys_aio_sendfile(int ctl_fd, int fd, int s, size_t size, unsigned flags); +asmlinkage long sys_kevent_ctl(int ctl_fd, void __user *buf); #endif diff --git a/include/net/sock.h b/include/net/sock.h index d10dfec..7a2bee3 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -47,6 +47,7 @@ #include <linux/module.h> #include <linux/netdevice.h> #include <linux/skbuff.h> /* struct sk_buff */ #include <linux/security.h> +#include <linux/kevent.h> #include <linux/filter.h> @@ -386,6 +387,8 @@ enum sock_flags { SOCK_NO_LARGESEND, /* whether to sent large segments or not */ SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */ SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */ + SOCK_ASYNC, + SOCK_ASYNC_INUSE, }; static inline void sock_copy_flags(struct sock *nsk, struct sock *osk) @@ -445,6 +448,21 @@ static inline int sk_stream_memory_free( extern void sk_stream_rfree(struct sk_buff *skb); +struct socket_alloc { + struct socket socket; + struct inode vfs_inode; +}; + +static inline struct socket *SOCKET_I(struct inode *inode) +{ + return &container_of(inode, struct socket_alloc, vfs_inode)->socket; +} + +static inline struct inode *SOCK_INODE(struct socket *socket) +{ + return &container_of(socket, struct socket_alloc, socket)->vfs_inode; +} + static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk) { skb->sk = sk; @@ -472,6 +490,7 @@ static inline void sk_add_backlog(struct sk->sk_backlog.tail = skb; } skb->next = NULL; + kevent_socket_notify(sk, KEVENT_SOCKET_RECV); } #define sk_wait_event(__sk, __timeo, __condition) \ @@ -543,6 +562,12 @@ struct proto { int (*backlog_rcv) (struct sock *sk, struct sk_buff *skb); + + int (*async_recv) (struct sock *sk, + void *dst, size_t size); + int (*async_send) (struct sock *sk, + struct page **pages, unsigned int poffset, + size_t size); /* Keeping track of sk's, looking them up, and port selection methods. */ void (*hash)(struct sock *sk); @@ -674,21 +699,6 @@ static inline struct kiocb *siocb_to_kio return si->kiocb; } -struct socket_alloc { - struct socket socket; - struct inode vfs_inode; -}; - -static inline struct socket *SOCKET_I(struct inode *inode) -{ - return &container_of(inode, struct socket_alloc, vfs_inode)->socket; -} - -static inline struct inode *SOCK_INODE(struct socket *socket) -{ - return &container_of(socket, struct socket_alloc, socket)->vfs_inode; -} - extern void __sk_stream_mem_reclaim(struct sock *sk); extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind); diff --git a/include/net/tcp.h b/include/net/tcp.h index 5f4eb5c..820cd5a 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -364,6 +364,8 @@ extern int compat_tcp_setsockopt(struc int level, int optname, char __user *optval, int optlen); extern void tcp_set_keepalive(struct sock *sk, int val); +extern int tcp_async_recv(struct sock *sk, void *dst, size_t size); +extern int tcp_async_send(struct sock *sk, struct page **pages, unsigned int poffset, size_t size); extern int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, size_t len, int nonblock, @@ -857,6 +859,7 @@ static inline int tcp_prequeue(struct so tp->ucopy.memory = 0; } else if (skb_queue_len(&tp->ucopy.prequeue) == 1) { wake_up_interruptible(sk->sk_sleep); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); if (!inet_csk_ack_scheduled(sk)) inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK, (3 * TCP_RTO_MIN) / 4, diff --git a/init/Kconfig b/init/Kconfig index df864a3..6135afc 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -185,6 +185,8 @@ config AUDITSYSCALL such as SELinux. To use audit's filesystem watch feature, please ensure that INOTIFY is configured. +source "kernel/kevent/Kconfig" + config IKCONFIG bool "Kernel .config support" ---help--- diff --git a/kernel/Makefile b/kernel/Makefile index f6ef00f..eb057ea 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -36,6 +36,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o +obj-$(CONFIG_KEVENT) += kevent/ obj-$(CONFIG_RELAY) += relay.o ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y) diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig new file mode 100644 index 0000000..88b35af --- /dev/null +++ b/kernel/kevent/Kconfig @@ -0,0 +1,57 @@ +config KEVENT + bool "Kernel event notification mechanism" + help + This option enables event queue mechanism. + It can be used as replacement for poll()/select(), AIO callback invocations, + advanced timer notifications and other kernel object status changes. + +config KEVENT_USER_STAT + bool "Kevent user statistic" + depends on KEVENT + default N + help + This option will turn kevent_user statistic collection on. + Statistic data includes total number of kevent, number of kevents which are ready + immediately at insertion time and number of kevents which were removed through + readiness completion. It will be printed each time control kevent descriptor + is closed. + +config KEVENT_SOCKET + bool "Kernel event notifications for sockets" + depends on NET && KEVENT + help + This option enables notifications through KEVENT subsystem of + sockets operations, like new packet receiving conditions, ready for accept + conditions and so on. + +config KEVENT_INODE + bool "Kernel event notifications for inodes" + depends on KEVENT + help + This option enables notifications through KEVENT subsystem of + inode operations, like file creation, removal and so on. + +config KEVENT_TIMER + bool "Kernel event notifications for timers" + depends on KEVENT + help + This option allows to use timers through KEVENT subsystem. + +config KEVENT_POLL + bool "Kernel event notifications for poll()/select()" + depends on KEVENT + help + This option allows to use kevent subsystem for poll()/select() notifications. + +config KEVENT_NAIO + bool "Network asynchronous IO" + depends on KEVENT && KEVENT_SOCKET + help + This option enables kevent based network asynchronous IO subsystem. + +config KEVENT_AIO + bool "Asynchronous IO" + depends on KEVENT + help + This option allows to use kevent subsystem for AIO operations. + AIO read is currently supported. diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile new file mode 100644 index 0000000..7dcd651 --- /dev/null +++ b/kernel/kevent/Makefile @@ -0,0 +1,7 @@ +obj-y := kevent.o kevent_user.o kevent_init.o +obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o +obj-$(CONFIG_KEVENT_INODE) += kevent_inode.o +obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o +obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o +obj-$(CONFIG_KEVENT_NAIO) += kevent_naio.o +obj-$(CONFIG_KEVENT_AIO) += kevent_aio.o diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c new file mode 100644 index 0000000..f699a13 --- /dev/null +++ b/kernel/kevent/kevent.c @@ -0,0 +1,260 @@ +/* + * kevent.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/mempool.h> +#include <linux/sched.h> +#include <linux/wait.h> +#include <linux/kevent.h> + +static kmem_cache_t *kevent_cache; + +/* + * Attempts to add an event into appropriate origin's queue. + * Returns positive value if this event is ready immediately, + * negative value in case of error and zero if event has been queued. + * ->enqueue() callback must increase origin's reference counter. + */ +int kevent_enqueue(struct kevent *k) +{ + if (k->event.type >= KEVENT_MAX) + return -E2BIG; + + if (!k->enqueue) { + kevent_break(k); + return -EINVAL; + } + + return k->enqueue(k); +} + +/* + * Remove event from the appropriate queue. + * ->dequeue() callback must decrease origin's reference counter. + */ +int kevent_dequeue(struct kevent *k) +{ + if (k->event.type >= KEVENT_MAX) + return -E2BIG; + + if (!k->dequeue) { + kevent_break(k); + return -EINVAL; + } + + return k->dequeue(k); +} + +/* + * Must be called before event is going to be added into some origin's queue. + * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks. + * If failed, kevent should not be used or kevent_enqueue() will fail to add + * this kevent into origin's queue with setting + * KEVENT_RET_BROKEN flag in kevent->event.ret_flags. + */ +int kevent_init(struct kevent *k) +{ + int err; + + spin_lock_init(&k->lock); + k->kevent_entry.next = LIST_POISON1; + k->storage_entry.next = LIST_POISON1; + k->ready_entry.next = LIST_POISON1; + + if (k->event.type >= KEVENT_MAX) + return -E2BIG; + + switch (k->event.type) { + case KEVENT_NAIO: + err = kevent_init_naio(k); + break; + case KEVENT_SOCKET: + err = kevent_init_socket(k); + break; + case KEVENT_INODE: + err = kevent_init_inode(k); + break; + case KEVENT_TIMER: + err = kevent_init_timer(k); + break; + case KEVENT_POLL: + err = kevent_init_poll(k); + break; + case KEVENT_AIO: + err = kevent_init_aio(k); + break; + default: + err = -ENODEV; + } + + return err; +} + +/* + * Called from ->enqueue() callback when reference counter for given + * origin (socket, inode...) has been increased. + */ +int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k) +{ + unsigned long flags; + + k->st = st; + spin_lock_irqsave(&st->lock, flags); + list_add_tail(&k->storage_entry, &st->list); + st->qlen++; + spin_unlock_irqrestore(&st->lock, flags); + return 0; +} + +/* + * Dequeue kevent from origin's queue. + * It does not decrease origin's reference counter in any way + * and must be called before it, so storage itself must be valid. + * It is called from ->dequeue() callback. + */ +void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&st->lock, flags); + if (k->storage_entry.next != LIST_POISON1) { + list_del(&k->storage_entry); + st->qlen--; + } + spin_unlock_irqrestore(&st->lock, flags); +} + +static void __kevent_requeue(struct kevent *k, u32 event) +{ + int err, rem = 0; + unsigned long flags; + + err = k->callback(k); + + spin_lock_irqsave(&k->lock, flags); + if (err > 0) { + k->event.ret_flags |= KEVENT_RET_DONE; + } else if (err < 0) { + k->event.ret_flags |= KEVENT_RET_BROKEN; + k->event.ret_flags |= KEVENT_RET_DONE; + } + rem = (k->event.req_flags & KEVENT_REQ_ONESHOT); + if (!err) + err = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE)); + spin_unlock_irqrestore(&k->lock, flags); + + if (err) { + if (rem) { + list_del(&k->storage_entry); + k->st->qlen--; + } + + spin_lock_irqsave(&k->user->ready_lock, flags); + if (k->ready_entry.next == LIST_POISON1) { + list_add_tail(&k->ready_entry, &k->user->ready_list); + k->user->ready_num++; + } + spin_unlock_irqrestore(&k->user->ready_lock, flags); + wake_up(&k->user->wait); + } +} + +void kevent_requeue(struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&k->st->lock, flags); + __kevent_requeue(k, 0); + spin_unlock_irqrestore(&k->st->lock, flags); +} + +/* + * Called each time some activity in origin (socket, inode...) is noticed. + */ +void kevent_storage_ready(struct kevent_storage *st, + kevent_callback_t ready_callback, u32 event) +{ + struct kevent *k, *n; + + spin_lock(&st->lock); + list_for_each_entry_safe(k, n, &st->list, storage_entry) { + if (ready_callback) + ready_callback(k); + + if (event & k->event.event) + __kevent_requeue(k, event); + } + spin_unlock(&st->lock); +} + +int kevent_storage_init(void *origin, struct kevent_storage *st) +{ + spin_lock_init(&st->lock); + st->origin = origin; + st->qlen = 0; + INIT_LIST_HEAD(&st->list); + return 0; +} + +void kevent_storage_fini(struct kevent_storage *st) +{ + kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL); +} + +struct kevent *kevent_alloc(gfp_t mask) +{ + struct kevent *k; + + if (kevent_cache) + k = kmem_cache_alloc(kevent_cache, mask); + else + k = kzalloc(sizeof(struct kevent), mask); + + return k; +} + +void kevent_free(struct kevent *k) +{ + memset(k, 0xab, sizeof(struct kevent)); + + if (kevent_cache) + kmem_cache_free(kevent_cache, k); + else + kfree(k); +} + +int __init kevent_sys_init(void) +{ + int err = 0; + + kevent_cache = kmem_cache_create("kevent_cache", + sizeof(struct kevent), 0, 0, NULL, NULL); + if (!kevent_cache) + err = -ENOMEM; + + return err; +} + +late_initcall(kevent_sys_init); diff --git a/kernel/kevent/kevent_init.c b/kernel/kevent/kevent_init.c new file mode 100644 index 0000000..ec95114 --- /dev/null +++ b/kernel/kevent/kevent_init.c @@ -0,0 +1,85 @@ +/* + * kevent_init.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/spinlock.h> +#include <linux/errno.h> +#include <linux/kevent.h> + +int kevent_break(struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&k->lock, flags); + k->event.ret_flags |= KEVENT_RET_BROKEN; + spin_unlock_irqrestore(&k->lock, flags); + return 0; +} + +#ifndef CONFIG_KEVENT_SOCKET +int kevent_init_socket(struct kevent *k) +{ + kevent_break(k); + return -ENODEV; +} +#endif + +#ifndef CONFIG_KEVENT_INODE +int kevent_init_inode(struct kevent *k) +{ + kevent_break(k); + return -ENODEV; +} +#endif + +#ifndef CONFIG_KEVENT_TIMER +int kevent_init_timer(struct kevent *k) +{ + kevent_break(k); + return -ENODEV; +} +#endif + +#ifndef CONFIG_KEVENT_POLL +int kevent_init_poll(struct kevent *k) +{ + kevent_break(k); + return -ENODEV; +} +#endif + +#ifndef CONFIG_KEVENT_NAIO +int kevent_init_naio(struct kevent *k) +{ + kevent_break(k); + return -ENODEV; +} +#endif + +#ifndef CONFIG_KEVENT_AIO +int kevent_init_aio(struct kevent *k) +{ + kevent_break(k); + return -ENODEV; +} +#endif diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c new file mode 100644 index 0000000..2f71fe4 --- /dev/null +++ b/kernel/kevent/kevent_user.c @@ -0,0 +1,728 @@ +/* + * kevent_user.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/mount.h> +#include <linux/device.h> +#include <linux/poll.h> +#include <linux/kevent.h> +#include <linux/jhash.h> +#include <asm/uaccess.h> +#include <asm/semaphore.h> + +static struct class *kevent_user_class; +static char kevent_name[] = "kevent"; +static int kevent_user_major; + +static int kevent_user_open(struct inode *, struct file *); +static int kevent_user_release(struct inode *, struct file *); +static int kevent_user_ioctl(struct inode *, struct file *, + unsigned int, unsigned long); +static unsigned int kevent_user_poll(struct file *, struct poll_table_struct *); + +static struct file_operations kevent_user_fops = { + .open = kevent_user_open, + .release = kevent_user_release, + .ioctl = kevent_user_ioctl, + .poll = kevent_user_poll, + .owner = THIS_MODULE, +}; + +static struct super_block *kevent_get_sb(struct file_system_type *fs_type, + int flags, const char *dev_name, void *data) +{ + /* So original magic... */ + return get_sb_pseudo(fs_type, kevent_name, NULL, 0xabcdef); +} + +static struct file_system_type kevent_fs_type = { + .name = kevent_name, + .get_sb = kevent_get_sb, + .kill_sb = kill_anon_super, +}; + +static struct vfsmount *kevent_mnt; + +static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait) +{ + struct kevent_user *u = file->private_data; + unsigned int mask; + + poll_wait(file, &u->wait, wait); + mask = 0; + + if (u->ready_num) + mask |= POLLIN | POLLRDNORM; + + return mask; +} + +static struct kevent_user *kevent_user_alloc(void) +{ + struct kevent_user *u; + int i; + + u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL); + if (!u) + return NULL; + + INIT_LIST_HEAD(&u->ready_list); + spin_lock_init(&u->ready_lock); + u->ready_num = 0; +#ifdef CONFIG_KEVENT_USER_STAT + u->wait_num = u->im_num = u->total = 0; +#endif + for (i=0; i<KEVENT_HASH_MASK+1; ++i) { + INIT_LIST_HEAD(&u->kqueue[i].kevent_list); + spin_lock_init(&u->kqueue[i].kevent_lock); + } + u->kevent_num = 0; + + init_MUTEX(&u->ctl_mutex); + init_MUTEX(&u->wait_mutex); + init_waitqueue_head(&u->wait); + u->max_ready_num = 0; + + atomic_set(&u->refcnt, 1); + + return u; +} + +static int kevent_user_open(struct inode *inode, struct file *file) +{ + struct kevent_user *u = kevent_user_alloc(); + + if (!u) + return -ENOMEM; + + file->private_data = u; + + return 0; +} + +static inline void kevent_user_get(struct kevent_user *u) +{ + atomic_inc(&u->refcnt); +} + +static inline void kevent_user_put(struct kevent_user *u) +{ + if (atomic_dec_and_test(&u->refcnt)) { +#ifdef CONFIG_KEVENT_USER_STAT + printk("%s: u=%p, wait=%lu, immediately=%lu, total=%lu.\n", + __func__, u, u->wait_num, u->im_num, u->total); +#endif + kfree(u); + } +} + +#if 0 +static inline unsigned int kevent_user_hash(struct ukevent *uk) +{ + unsigned int h = (uk->user[0] ^ uk->user[1]) ^ (uk->id.raw[0] ^ uk->id.raw[1]); + + h = (((h >> 16) & 0xffff) ^ (h & 0xffff)) & 0xffff; + h = (((h >> 8) & 0xff) ^ (h & 0xff)) & KEVENT_HASH_MASK; + + return h; +} +#else +static inline unsigned int kevent_user_hash(struct ukevent *uk) +{ + return jhash_1word(uk->id.raw[0], 0) & KEVENT_HASH_MASK; +} +#endif + +/* + * Remove kevent from user's list of all events, + * dequeue it from storage and decrease user's reference counter, + * since this kevent does not exist anymore. That is why it is freed here. + */ +static void kevent_finish_user(struct kevent *k, int lock, int deq) +{ + struct kevent_user *u = k->user; + unsigned long flags; + + if (lock) { + unsigned int hash = kevent_user_hash(&k->event); + struct kevent_list *l = &u->kqueue[hash]; + + spin_lock_irqsave(&l->kevent_lock, flags); + list_del(&k->kevent_entry); + u->kevent_num--; + spin_unlock_irqrestore(&l->kevent_lock, flags); + } else { + list_del(&k->kevent_entry); + u->kevent_num--; + } + + if (deq) + kevent_dequeue(k); + + spin_lock_irqsave(&u->ready_lock, flags); + if (k->ready_entry.next != LIST_POISON1) { + list_del(&k->ready_entry); + u->ready_num--; + } + spin_unlock_irqrestore(&u->ready_lock, flags); + + kevent_user_put(u); + kevent_free(k); +} + +/* + * Dequeue one entry from user's ready queue. + */ +static struct kevent *__kqueue_dequeue_one_ready(struct list_head *q, + unsigned int *qlen) +{ + struct kevent *k = NULL; + unsigned int len = *qlen; + + if (len && !list_empty(q)) { + k = list_entry(q->next, struct kevent, ready_entry); + list_del(&k->ready_entry); + *qlen = len - 1; + } + + return k; +} + +static struct kevent *kqueue_dequeue_ready(struct kevent_user *u) +{ + unsigned long flags; + struct kevent *k; + + spin_lock_irqsave(&u->ready_lock, flags); + k = __kqueue_dequeue_one_ready(&u->ready_list, &u->ready_num); + spin_unlock_irqrestore(&u->ready_lock, flags); + + return k; +} + +static struct kevent *__kevent_search(struct kevent_list *l, struct ukevent *uk, + struct kevent_user *u) +{ + struct kevent *k; + int found = 0; + + list_for_each_entry(k, &l->kevent_list, kevent_entry) { + spin_lock(&k->lock); + if (k->event.user[0] == uk->user[0] && k->event.user[1] == uk->user[1] && + k->event.id.raw[0] == uk->id.raw[0] && + k->event.id.raw[1] == uk->id.raw[1]) { + found = 1; + spin_unlock(&k->lock); + break; + } + spin_unlock(&k->lock); + } + + return (found)?k:NULL; +} + +static int kevent_modify(struct ukevent *uk, struct kevent_user *u) +{ + struct kevent *k; + unsigned int hash = kevent_user_hash(uk); + struct kevent_list *l = &u->kqueue[hash]; + int err = -ENODEV; + unsigned long flags; + + spin_lock_irqsave(&l->kevent_lock, flags); + k = __kevent_search(l, uk, u); + if (k) { + spin_lock(&k->lock); + k->event.event = uk->event; + k->event.req_flags = uk->req_flags; + k->event.ret_flags = 0; + spin_unlock(&k->lock); + kevent_requeue(k); + err = 0; + } + spin_unlock_irqrestore(&l->kevent_lock, flags); + + return err; +} + +static int kevent_remove(struct ukevent *uk, struct kevent_user *u) +{ + int err = -ENODEV; + struct kevent *k; + unsigned int hash = kevent_user_hash(uk); + struct kevent_list *l = &u->kqueue[hash]; + unsigned long flags; + + spin_lock_irqsave(&l->kevent_lock, flags); + k = __kevent_search(l, uk, u); + if (k) { + kevent_finish_user(k, 0, 1); + err = 0; + } + spin_unlock_irqrestore(&l->kevent_lock, flags); + + return err; +} + +/* + * No new entry can be added or removed from any list at this point. + * It is not permitted to call ->ioctl() and ->release() in parallel. + */ +static int kevent_user_release(struct inode *inode, struct file *file) +{ + struct kevent_user *u = file->private_data; + struct kevent *k, *n; + int i; + + for (i=0; i<KEVENT_HASH_MASK+1; ++i) { + struct kevent_list *l = &u->kqueue[i]; + + list_for_each_entry_safe(k, n, &l->kevent_list, kevent_entry) + kevent_finish_user(k, 1, 1); + } + + kevent_user_put(u); + file->private_data = NULL; + + return 0; +} + +static int kevent_user_ctl_modify(struct kevent_user *u, + struct kevent_user_control *ctl, void __user *arg) +{ + int err = 0, i; + struct ukevent uk; + + if (down_interruptible(&u->ctl_mutex)) + return -ERESTARTSYS; + + for (i=0; i<ctl->num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + err = -EINVAL; + break; + } + + if (kevent_modify(&uk, u)) + uk.ret_flags |= KEVENT_RET_BROKEN; + uk.ret_flags |= KEVENT_RET_DONE; + + if (copy_to_user(arg, &uk, sizeof(struct ukevent))) { + err = -EINVAL; + break; + } + + arg += sizeof(struct ukevent); + } + + up(&u->ctl_mutex); + + return err; +} + +static int kevent_user_ctl_remove(struct kevent_user *u, + struct kevent_user_control *ctl, void __user *arg) +{ + int err = 0, i; + struct ukevent uk; + + if (down_interruptible(&u->ctl_mutex)) + return -ERESTARTSYS; + + for (i=0; i<ctl->num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + err = -EINVAL; + break; + } + + if (kevent_remove(&uk, u)) + uk.ret_flags |= KEVENT_RET_BROKEN; + + uk.ret_flags |= KEVENT_RET_DONE; + + if (copy_to_user(arg, &uk, sizeof(struct ukevent))) { + err = -EINVAL; + break; + } + + arg += sizeof(struct ukevent); + } + + up(&u->ctl_mutex); + + return err; +} + +int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u) +{ + struct kevent *k; + int err; + + k = kevent_alloc(GFP_KERNEL); + if (!k) { + err = -ENOMEM; + goto err_out_exit; + } + + memcpy(&k->event, uk, sizeof(struct ukevent)); + + k->event.ret_flags = 0; + + err = kevent_init(k); + if (err) { + kevent_free(k); + goto err_out_exit; + } + k->user = u; +#ifdef CONFIG_KEVENT_USER_STAT + u->total++; +#endif + { + unsigned long flags; + unsigned int hash = kevent_user_hash(&k->event); + struct kevent_list *l = &u->kqueue[hash]; + + spin_lock_irqsave(&l->kevent_lock, flags); + list_add_tail(&k->kevent_entry, &l->kevent_list); + u->kevent_num++; + kevent_user_get(u); + spin_unlock_irqrestore(&l->kevent_lock, flags); + } + + err = kevent_enqueue(k); + if (err) { + memcpy(uk, &k->event, sizeof(struct ukevent)); + if (err < 0) + uk->ret_flags |= KEVENT_RET_BROKEN; + uk->ret_flags |= KEVENT_RET_DONE; + kevent_finish_user(k, 1, 0); + } + +err_out_exit: + return err; +} + +/* + * Copy all ukevents from userspace, allocate kevent for each one + * and add them into appropriate kevent_storages, + * e.g. sockets, inodes and so on... + * If something goes wrong, all events will be dequeued and + * negative error will be returned. + * On success zero is returned and + * ctl->num will be a number of finished events, either completed or failed. + * Array of finished events (struct ukevent) will be placed behind + * kevent_user_control structure. User must run through that array and check + * ret_flags field of each ukevent structure to determine if it is fired or failed event. + */ +static int kevent_user_ctl_add(struct kevent_user *u, + struct kevent_user_control *ctl, void __user *arg) +{ + int err = 0, cerr = 0, num = 0, knum = 0, i; + void __user *orig, *ctl_addr; + struct ukevent uk; + + if (down_interruptible(&u->ctl_mutex)) + return -ERESTARTSYS; + + orig = arg; + ctl_addr = arg - sizeof(struct kevent_user_control); +#if 1 + err = -ENFILE; + if (u->kevent_num + ctl->num >= 1024) + goto err_out_remove; +#endif + for (i=0; i<ctl->num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + cerr = -EINVAL; + break; + } + arg += sizeof(struct ukevent); + + err = kevent_user_add_ukevent(&uk, u); + if (err) { +#ifdef CONFIG_KEVENT_USER_STAT + u->im_num++; +#endif + if (copy_to_user(orig, &uk, sizeof(struct ukevent))) + cerr = -EINVAL; + orig += sizeof(struct ukevent); + num++; + } else + knum++; + } + + if (cerr < 0) + goto err_out_remove; + + ctl->num = num; + if (copy_to_user(ctl_addr, ctl, sizeof(struct kevent_user_control))) + cerr = -EINVAL; + + if (cerr) + err = cerr; + if (!err) + err = num; + +err_out_remove: + up(&u->ctl_mutex); + + return err; +} + +/* + * Waits until at least ctl->ready_num events are ready or timeout and returns + * number of ready events (in case of timeout) or number of requested events. + */ +static int kevent_user_wait(struct file *file, struct kevent_user *u, + struct kevent_user_control *ctl, void __user *arg) +{ + struct kevent *k; + int cerr = 0, num = 0; + void __user *ptr = arg + sizeof(struct kevent_user_control); + + if (down_interruptible(&u->ctl_mutex)) + return -ERESTARTSYS; + + if (!(file->f_flags & O_NONBLOCK)) { + if (ctl->timeout) + wait_event_interruptible_timeout(u->wait, + u->ready_num >= ctl->num, msecs_to_jiffies(ctl->timeout)); + else + wait_event_interruptible_timeout(u->wait, + u->ready_num > 0, msecs_to_jiffies(1000)); + } + while (num < ctl->num && ((k = kqueue_dequeue_ready(u)) != NULL)) { + if (copy_to_user(ptr + num*sizeof(struct ukevent), + &k->event, sizeof(struct ukevent))) + cerr = -EINVAL; + + /* + * If it is one-shot kevent, it has been removed already from + * origin's queue, so we can easily free it here. + */ + if (k->event.req_flags & KEVENT_REQ_ONESHOT) + kevent_finish_user(k, 1, 1); + ++num; +#ifdef CONFIG_KEVENT_USER_STAT + u->wait_num++; +#endif + } + + ctl->num = num; + if (copy_to_user(arg, ctl, sizeof(struct kevent_user_control))) + cerr = -EINVAL; + + up(&u->ctl_mutex); + + return (cerr)?cerr:num; +} + +static int kevent_ctl_init(void) +{ + struct kevent_user *u; + struct file *file; + int fd, ret; + + fd = get_unused_fd(); + if (fd < 0) + return fd; + + file = get_empty_filp(); + if (!file) { + ret = -ENFILE; + goto out_put_fd; + } + + u = kevent_user_alloc(); + if (unlikely(!u)) { + ret = -ENOMEM; + goto out_put_file; + } + + file->f_op = &kevent_user_fops; + file->f_vfsmnt = mntget(kevent_mnt); + file->f_dentry = dget(kevent_mnt->mnt_root); + file->f_mapping = file->f_dentry->d_inode->i_mapping; + file->f_mode = FMODE_READ; + file->f_flags = O_RDONLY; + file->private_data = u; + + fd_install(fd, file); + + return fd; + +out_put_file: + put_filp(file); +out_put_fd: + put_unused_fd(fd); + return ret; +} + +static int kevent_ctl_process(struct file *file, + struct kevent_user_control *ctl, void __user *arg) +{ + int err; + struct kevent_user *u = file->private_data; + + if (!u) + return -EINVAL; + + switch (ctl->cmd) { + case KEVENT_CTL_ADD: + err = kevent_user_ctl_add(u, ctl, + arg+sizeof(struct kevent_user_control)); + break; + case KEVENT_CTL_REMOVE: + err = kevent_user_ctl_remove(u, ctl, + arg+sizeof(struct kevent_user_control)); + break; + case KEVENT_CTL_MODIFY: + err = kevent_user_ctl_modify(u, ctl, + arg+sizeof(struct kevent_user_control)); + break; + case KEVENT_CTL_WAIT: + err = kevent_user_wait(file, u, ctl, arg); + break; + case KEVENT_CTL_INIT: + err = kevent_ctl_init(); + default: + err = -EINVAL; + break; + } + + return err; +} + +asmlinkage long sys_kevent_ctl(int fd, void __user *arg) +{ + int err, fput_needed; + struct kevent_user_control ctl; + struct file *file; + + if (copy_from_user(&ctl, arg, sizeof(struct kevent_user_control))) + return -EINVAL; + + if (ctl.cmd == KEVENT_CTL_INIT) + return kevent_ctl_init(); + + file = fget_light(fd, &fput_needed); + if (!file) + return -ENODEV; + + err = kevent_ctl_process(file, &ctl, arg); + + fput_light(file, fput_needed); + return err; +} + +static int kevent_user_ioctl(struct inode *inode, struct file *file, + unsigned int cmd, unsigned long arg) +{ + int err = -ENODEV; + struct kevent_user_control ctl; + struct kevent_user *u = file->private_data; + void __user *ptr = (void __user *)arg; + + if (copy_from_user(&ctl, ptr, sizeof(struct kevent_user_control))) + return -EINVAL; + + switch (cmd) { + case KEVENT_USER_CTL: + err = kevent_ctl_process(file, &ctl, ptr); + break; + case KEVENT_USER_WAIT: + err = kevent_user_wait(file, u, &ctl, ptr); + break; + default: + break; + } + + return err; +} + +static int __devinit kevent_user_init(void) +{ + struct class_device *dev; + int err = 0; + + err = register_filesystem(&kevent_fs_type); + if (err) + panic("%s: failed to register filesystem: err=%d.\n", + kevent_name, err); + + kevent_mnt = kern_mount(&kevent_fs_type); + if (IS_ERR(kevent_mnt)) + panic("%s: failed to mount silesystem: err=%ld.\n", + kevent_name, PTR_ERR(kevent_mnt)); + + kevent_user_major = register_chrdev(0, kevent_name, &kevent_user_fops); + if (kevent_user_major < 0) { + printk(KERN_ERR "Failed to register \"%s\" char device: err=%d.\n", + kevent_name, kevent_user_major); + return -ENODEV; + } + + kevent_user_class = class_create(THIS_MODULE, "kevent"); + if (IS_ERR(kevent_user_class)) { + printk(KERN_ERR "Failed to register \"%s\" class: err=%ld.\n", + kevent_name, PTR_ERR(kevent_user_class)); + err = PTR_ERR(kevent_user_class); + goto err_out_unregister; + } + + dev = class_device_create(kevent_user_class, NULL, + MKDEV(kevent_user_major, 0), NULL, kevent_name); + if (IS_ERR(dev)) { + printk(KERN_ERR "Failed to create %d.%d class device in \"%s\" class: err=%ld.\n", + kevent_user_major, 0, kevent_name, PTR_ERR(dev)); + err = PTR_ERR(dev); + goto err_out_class_destroy; + } + + printk("KEVENT subsystem: chardev helper: major=%d.\n", kevent_user_major); + + return 0; + +err_out_class_destroy: + class_destroy(kevent_user_class); +err_out_unregister: + unregister_chrdev(kevent_user_major, kevent_name); + + return err; +} + +static void __devexit kevent_user_fini(void) +{ + class_device_destroy(kevent_user_class, MKDEV(kevent_user_major, 0)); + class_destroy(kevent_user_class); + unregister_chrdev(kevent_user_major, kevent_name); + mntput(kevent_mnt); + unregister_filesystem(&kevent_fs_type); +} + +module_init(kevent_user_init); +module_exit(kevent_user_fini); diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 5433195..dcbacf5 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -121,6 +121,11 @@ cond_syscall(ppc_rtas); cond_syscall(sys_spu_run); cond_syscall(sys_spu_create); +cond_syscall(sys_aio_recv); +cond_syscall(sys_aio_send); +cond_syscall(sys_aio_sendfile); +cond_syscall(sys_kevent_ctl); + /* mmu depending weak syscall entries */ cond_syscall(sys_mprotect); cond_syscall(sys_msync); diff --git a/net/core/datagram.c b/net/core/datagram.c index aecddcc..493245b 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -236,6 +236,60 @@ void skb_kill_datagram(struct sock *sk, EXPORT_SYMBOL(skb_kill_datagram); /** + * skb_copy_datagram - Copy a datagram. + * @skb: buffer to copy + * @offset: offset in the buffer to start copying from + * @to: pointer to copy to + * @len: amount of data to copy from buffer to iovec + */ +int skb_copy_datagram(const struct sk_buff *skb, int offset, + void *to, int len) +{ + int i, fraglen, end = 0; + struct sk_buff *next = skb_shinfo(skb)->frag_list; + + if (!len) + return 0; + +next_skb: + fraglen = skb_headlen(skb); + i = -1; + + while (1) { + int start = end; + + if ((end += fraglen) > offset) { + int copy = end - offset, o = offset - start; + + if (copy > len) + copy = len; + if (i == -1) + memcpy(to, skb->data + o, copy); + else { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + struct page *page = frag->page; + void *p = kmap(page) + frag->page_offset + o; + memcpy(to, p, copy); + kunmap(page); + } + if (!(len -= copy)) + return 0; + offset += copy; + } + if (++i >= skb_shinfo(skb)->nr_frags) + break; + fraglen = skb_shinfo(skb)->frags[i].size; + } + if (next) { + skb = next; + BUG_ON(skb_shinfo(skb)->frag_list); + next = skb->next; + goto next_skb; + } + return -EFAULT; +} + +/** * skb_copy_datagram_iovec - Copy a datagram to an iovec. * @skb: buffer to copy * @offset: offset in the buffer to start copying from @@ -530,6 +584,7 @@ unsigned int datagram_poll(struct file * EXPORT_SYMBOL(datagram_poll); EXPORT_SYMBOL(skb_copy_and_csum_datagram_iovec); +EXPORT_SYMBOL(skb_copy_datagram); EXPORT_SYMBOL(skb_copy_datagram_iovec); EXPORT_SYMBOL(skb_free_datagram); EXPORT_SYMBOL(skb_recv_datagram); diff --git a/net/core/sock.c b/net/core/sock.c index 5d820c3..3345048 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -564,6 +564,16 @@ #endif spin_unlock_bh(&sk->sk_lock.slock); ret = -ENONET; break; +#ifdef CONFIG_KEVENT_SOCKET + case SO_ASYNC_SOCK: + spin_lock_bh(&sk->sk_lock.slock); + if (valbool) + sock_set_flag(sk, SOCK_ASYNC); + else + sock_reset_flag(sk, SOCK_ASYNC); + spin_unlock_bh(&sk->sk_lock.slock); + break; +#endif /* We implement the SO_SNDLOWAT etc to not be settable (1003.1g 5.3) */ @@ -1313,6 +1323,7 @@ static void sock_def_wakeup(struct sock if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) wake_up_interruptible_all(sk->sk_sleep); read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); } static void sock_def_error_report(struct sock *sk) @@ -1322,6 +1333,7 @@ static void sock_def_error_report(struct wake_up_interruptible(sk->sk_sleep); sk_wake_async(sk,0,POLL_ERR); read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); } static void sock_def_readable(struct sock *sk, int len) @@ -1331,6 +1343,7 @@ static void sock_def_readable(struct soc wake_up_interruptible(sk->sk_sleep); sk_wake_async(sk,1,POLL_IN); read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); } static void sock_def_write_space(struct sock *sk) @@ -1350,6 +1363,7 @@ static void sock_def_write_space(struct } read_unlock(&sk->sk_callback_lock); + kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV); } static void sock_def_destruct(struct sock *sk) @@ -1454,8 +1468,10 @@ void fastcall release_sock(struct sock * if (sk->sk_backlog.tail) __release_sock(sk); sk->sk_lock.owner = NULL; - if (waitqueue_active(&(sk->sk_lock.wq))) + if (waitqueue_active(&(sk->sk_lock.wq))) { wake_up(&(sk->sk_lock.wq)); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); + } spin_unlock_bh(&(sk->sk_lock.slock)); } EXPORT_SYMBOL(release_sock); diff --git a/net/core/stream.c b/net/core/stream.c index e948969..91e2e07 100644 --- a/net/core/stream.c +++ b/net/core/stream.c @@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock * wake_up_interruptible(sk->sk_sleep); if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN)) sock_wake_async(sock, 2, POLL_OUT); + kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV); } } diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 74998f2..403d33e 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -206,6 +206,7 @@ * lingertime == 0 (RFC 793 ABORT Call) * Hirokazu Takahashi : Use copy_from_user() instead of * csum_and_copy_from_user() if possible. + * Evgeniy Polyakov : Network asynchronous IO. * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License @@ -1085,6 +1086,275 @@ int tcp_read_sock(struct sock *sk, read_ } /* + * Must be called with locked sock. + */ +int tcp_async_send(struct sock *sk, struct page **pages, unsigned int poffset, size_t len) +{ + struct tcp_sock *tp = tcp_sk(sk); + int mss_now, size_goal; + int err = -EAGAIN; + ssize_t copied; + + /* Wait for a connection to finish. */ + if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) + goto out_err; + + clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); + + mss_now = tcp_current_mss(sk, 1); + size_goal = tp->xmit_size_goal; + copied = 0; + + err = -EPIPE; + if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN) || sock_flag(sk, SOCK_DONE) || + (sk->sk_state == TCP_CLOSE) || (atomic_read(&sk->sk_refcnt) == 1)) + goto do_error; + + while (len > 0) { + struct sk_buff *skb = sk->sk_write_queue.prev; + struct page *page = pages[poffset / PAGE_SIZE]; + int copy, i, can_coalesce; + int offset = poffset % PAGE_SIZE; + int size = min_t(size_t, len, PAGE_SIZE - offset); + + if (!sk->sk_send_head || (copy = size_goal - skb->len) <= 0) { +new_segment: + if (!sk_stream_memory_free(sk)) + goto wait_for_sndbuf; + + skb = sk_stream_alloc_pskb(sk, 0, 0, + sk->sk_allocation); + if (!skb) + goto wait_for_memory; + + skb_entail(sk, tp, skb); + copy = size_goal; + } + + if (copy > size) + copy = size; + + i = skb_shinfo(skb)->nr_frags; + can_coalesce = skb_can_coalesce(skb, i, page, offset); + if (!can_coalesce && i >= MAX_SKB_FRAGS) { + tcp_mark_push(tp, skb); + goto new_segment; + } + if (!sk_stream_wmem_schedule(sk, copy)) + goto wait_for_memory; + + if (can_coalesce) { + skb_shinfo(skb)->frags[i - 1].size += copy; + } else { + get_page(page); + skb_fill_page_desc(skb, i, page, offset, copy); + } + + skb->len += copy; + skb->data_len += copy; + skb->truesize += copy; + sk->sk_wmem_queued += copy; + sk->sk_forward_alloc -= copy; + skb->ip_summed = CHECKSUM_HW; + tp->write_seq += copy; + TCP_SKB_CB(skb)->end_seq += copy; + skb_shinfo(skb)->tso_segs = 0; + + if (!copied) + TCP_SKB_CB(skb)->flags &= ~TCPCB_FLAG_PSH; + + copied += copy; + poffset += copy; + if (!(len -= copy)) + goto out; + + if (skb->len < mss_now) + continue; + + if (forced_push(tp)) { + tcp_mark_push(tp, skb); + __tcp_push_pending_frames(sk, tp, mss_now, TCP_NAGLE_PUSH); + } else if (skb == sk->sk_send_head) + tcp_push_one(sk, mss_now); + continue; + +wait_for_sndbuf: + set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); +wait_for_memory: + if (copied) + tcp_push(sk, tp, 0, mss_now, TCP_NAGLE_PUSH); + + err = -EAGAIN; + goto do_error; + } + +out: + if (copied) + tcp_push(sk, tp, 0, mss_now, tp->nonagle); + return copied; + +do_error: + if (copied) + goto out; +out_err: + return sk_stream_error(sk, 0, err); +} + +/* + * Must be called with locked sock. + */ +int tcp_async_recv(struct sock *sk, void *dst, size_t len) +{ + struct tcp_sock *tp = tcp_sk(sk); + int copied = 0; + u32 *seq; + unsigned long used; + int err; + int target; /* Read at least this many bytes */ + + TCP_CHECK_TIMER(sk); + + err = -ENOTCONN; + if (sk->sk_state == TCP_LISTEN) + goto out; + + seq = &tp->copied_seq; + + target = sock_rcvlowat(sk, 0, len); + + do { + struct sk_buff *skb; + u32 offset; + + /* Are we at urgent data? Stop if we have read anything or have SIGURG pending. */ + if (tp->urg_data && tp->urg_seq == *seq) { + if (copied) + break; + } + + /* Next get a buffer. */ + + skb = skb_peek(&sk->sk_receive_queue); + do { + if (!skb) + break; + + /* Now that we have two receive queues this + * shouldn't happen. + */ + if (before(*seq, TCP_SKB_CB(skb)->seq)) { + printk(KERN_INFO "async_recv bug: copied %X " + "seq %X\n", *seq, TCP_SKB_CB(skb)->seq); + break; + } + offset = *seq - TCP_SKB_CB(skb)->seq; + if (skb->h.th->syn) + offset--; + if (offset < skb->len) + goto found_ok_skb; + if (skb->h.th->fin) + goto found_fin_ok; + skb = skb->next; + } while (skb != (struct sk_buff *)&sk->sk_receive_queue); + + if (copied) + break; + + if (sock_flag(sk, SOCK_DONE)) + break; + + if (sk->sk_err) { + copied = sock_error(sk); + break; + } + + if (sk->sk_shutdown & RCV_SHUTDOWN) + break; + + if (sk->sk_state == TCP_CLOSE) { + if (!sock_flag(sk, SOCK_DONE)) { + /* This occurs when user tries to read + * from never connected socket. + */ + copied = -ENOTCONN; + break; + } + break; + } + + copied = -EAGAIN; + break; + + found_ok_skb: + /* Ok so how much can we use? */ + used = skb->len - offset; + if (len < used) + used = len; + + /* Do we have urgent data here? */ + if (tp->urg_data) { + u32 urg_offset = tp->urg_seq - *seq; + if (urg_offset < used) { + if (!urg_offset) { + if (!sock_flag(sk, SOCK_URGINLINE)) { + ++*seq; + offset++; + used--; + if (!used) + goto skip_copy; + } + } else + used = urg_offset; + } + } + + err = skb_copy_datagram(skb, offset, dst, used); + if (err) { + /* Exception. Bailout! */ + if (!copied) + copied = -EFAULT; + break; + } + + *seq += used; + copied += used; + len -= used; + dst += used; + + tcp_rcv_space_adjust(sk); + +skip_copy: + if (tp->urg_data && after(tp->copied_seq, tp->urg_seq)) { + tp->urg_data = 0; + tcp_fast_path_check(sk, tp); + } + if (used + offset < skb->len) + continue; + + if (skb->h.th->fin) + goto found_fin_ok; + sk_eat_skb(sk, skb); + continue; + + found_fin_ok: + /* Process the FIN. */ + ++*seq; + sk_eat_skb(sk, skb); + break; + } while (len > 0); + + /* Clean up data we have read: This will do ACK frames. */ + cleanup_rbuf(sk, copied); + + TCP_CHECK_TIMER(sk); + return copied; + +out: + TCP_CHECK_TIMER(sk); + return err; +} + +/* * This routine copies from a sock struct into the user buffer. * * Technical note: in 2.3 we work on _locked_ socket, so that @@ -2259,6 +2529,8 @@ EXPORT_SYMBOL(tcp_getsockopt); EXPORT_SYMBOL(tcp_ioctl); EXPORT_SYMBOL(tcp_poll); EXPORT_SYMBOL(tcp_read_sock); +EXPORT_SYMBOL(tcp_async_recv); +EXPORT_SYMBOL(tcp_async_send); EXPORT_SYMBOL(tcp_recvmsg); EXPORT_SYMBOL(tcp_sendmsg); EXPORT_SYMBOL(tcp_sendpage); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index e08245b..5655b1e 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3113,6 +3113,7 @@ static void tcp_ofo_queue(struct sock *s __skb_unlink(skb, &tp->out_of_order_queue); __skb_queue_tail(&sk->sk_receive_queue, skb); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV); tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq; if(skb->h.th->fin) tcp_fin(skb, sk, skb->h.th); @@ -3956,7 +3957,8 @@ int tcp_rcv_established(struct sock *sk, int copied_early = 0; if (tp->copied_seq == tp->rcv_nxt && - len - tcp_header_len <= tp->ucopy.len) { + len - tcp_header_len <= tp->ucopy.len && + !sock_async(sk)) { #ifdef CONFIG_NET_DMA if (tcp_dma_try_early_copy(sk, skb, tcp_header_len)) { copied_early = 1; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 25ecc6e..05d7086 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -62,6 +62,7 @@ #include <linux/cache.h> #include <linux/jhash.h> #include <linux/init.h> #include <linux/times.h> +#include <linux/kevent.h> #include <net/icmp.h> #include <net/inet_hashtables.h> @@ -850,6 +851,7 @@ #endif reqsk_free(req); } else { inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT); + kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT); } return 0; @@ -1089,24 +1091,30 @@ process: skb->dev = NULL; - bh_lock_sock(sk); ret = 0; - if (!sock_owned_by_user(sk)) { + if (sock_async(sk)) { + spin_lock_bh(&sk->sk_lock.slock); + ret = tcp_v4_do_rcv(sk, skb); + spin_unlock_bh(&sk->sk_lock.slock); + } else { + bh_lock_sock(sk); + if (!sock_owned_by_user(sk)) { #ifdef CONFIG_NET_DMA - struct tcp_sock *tp = tcp_sk(sk); - if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list) - tp->ucopy.dma_chan = get_softnet_dma(); - if (tp->ucopy.dma_chan) - ret = tcp_v4_do_rcv(sk, skb); - else + struct tcp_sock *tp = tcp_sk(sk); + if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list) + tp->ucopy.dma_chan = get_softnet_dma(); + if (tp->ucopy.dma_chan) + ret = tcp_v4_do_rcv(sk, skb); + else #endif - { - if (!tcp_prequeue(sk, skb)) - ret = tcp_v4_do_rcv(sk, skb); - } - } else - sk_add_backlog(sk, skb); - bh_unlock_sock(sk); + { + if (!tcp_prequeue(sk, skb)) + ret = tcp_v4_do_rcv(sk, skb); + } + } else + sk_add_backlog(sk, skb); + bh_unlock_sock(sk); + } sock_put(sk); @@ -1830,6 +1838,8 @@ struct proto tcp_prot = { .getsockopt = tcp_getsockopt, .sendmsg = tcp_sendmsg, .recvmsg = tcp_recvmsg, + .async_recv = tcp_async_recv, + .async_send = tcp_async_send, .backlog_rcv = tcp_v4_do_rcv, .hash = tcp_v4_hash, .unhash = tcp_unhash, diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index a50eb30..e27e231 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1215,22 +1215,28 @@ process: skb->dev = NULL; - bh_lock_sock(sk); ret = 0; - if (!sock_owned_by_user(sk)) { + if (sock_async(sk)) { + spin_lock_bh(&sk->sk_lock.slock); + ret = tcp_v4_do_rcv(sk, skb); + spin_unlock_bh(&sk->sk_lock.slock); + } else { + bh_lock_sock(sk); + if (!sock_owned_by_user(sk)) { #ifdef CONFIG_NET_DMA - struct tcp_sock *tp = tcp_sk(sk); - if (tp->ucopy.dma_chan) - ret = tcp_v6_do_rcv(sk, skb); - else -#endif - { - if (!tcp_prequeue(sk, skb)) + struct tcp_sock *tp = tcp_sk(sk); + if (tp->ucopy.dma_chan) ret = tcp_v6_do_rcv(sk, skb); - } - } else - sk_add_backlog(sk, skb); - bh_unlock_sock(sk); + else +#endif + { + if (!tcp_prequeue(sk, skb)) + ret = tcp_v6_do_rcv(sk, skb); + } + } else + sk_add_backlog(sk, skb); + bh_unlock_sock(sk); + } sock_put(sk); return ret ? -1 : 0; @@ -1580,6 +1586,8 @@ struct proto tcpv6_prot = { .getsockopt = tcp_getsockopt, .sendmsg = tcp_sendmsg, .recvmsg = tcp_recvmsg, + .async_recv = tcp_async_recv, + .async_send = tcp_async_send, .backlog_rcv = tcp_v6_do_rcv, .hash = tcp_v6_hash, .unhash = tcp_unhash, ^ permalink raw reply related [flat|nested] 73+ messages in thread
* [2/4] kevent: network AIO, socket notifications. 2006-07-26 9:18 ` [1/4] kevent: core files Evgeniy Polyakov @ 2006-07-26 9:18 ` Evgeniy Polyakov 2006-07-26 9:18 ` [3/4] kevent: AIO, aio_sendfile() implementation Evgeniy Polyakov 2006-07-26 10:31 ` [1/4] kevent: core files Andrew Morton 2006-07-26 10:44 ` Evgeniy Polyakov 2 siblings, 1 reply; 73+ messages in thread From: Evgeniy Polyakov @ 2006-07-26 9:18 UTC (permalink / raw) To: lkml; +Cc: David Miller, Ulrich Drepper, Evgeniy Polyakov, netdev This patchset includes socket notifications and network asynchronous IO. Network AIO is based on kevent and works as usual kevent storage on top of inode. When new socket is created it is associated with inode (to save some space, since inode already has kevent_storage embedded) and when some activity is detected appropriate notifications are generated and kevent_naio_callback() is called. When new kevent is being registered, network AIO ->enqueue() callback simply marks itself like usual socket event watcher. It also locks physical userspace pages in memory and stores appropriate pointers in private kevent structure. I have not created additional DMA memory allocation methods, like Ulrich described in his article, so I handle it inside NAIO which has some overhead (I posted get_user_pages() sclability graph some time ago). New set of syscalls to allocate DMAable memory is in TODO. Network AIO callback gets pointers to userspace pages and tries to copy data from receiving skb queue into them using protocol specific callback. This callback is very similar to ->recvmsg(), so they could share a lot in future (as far as I recall it worked only with hardware capable to do checksumming, I'm a bit lazy, it is in TODO) Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c new file mode 100644 index 0000000..c230aaa --- /dev/null +++ b/kernel/kevent/kevent_socket.c @@ -0,0 +1,125 @@ +/* + * kevent_socket.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/timer.h> +#include <linux/file.h> +#include <linux/tcp.h> +#include <linux/kevent.h> + +#include <net/sock.h> +#include <net/request_sock.h> +#include <net/inet_connection_sock.h> + +static int kevent_socket_callback(struct kevent *k) +{ + struct inode *inode = k->st->origin; + struct sock *sk = SOCKET_I(inode)->sk; + int rmem; + + if (k->event.event & KEVENT_SOCKET_RECV) { + int ret = 0; + + if ((rmem = atomic_read(&sk->sk_rmem_alloc)) > 0 || + !skb_queue_empty(&sk->sk_receive_queue)) + ret = 1; + if (sk->sk_shutdown & RCV_SHUTDOWN) + ret = 1; + if (ret) + return ret; + } + if ((k->event.event & KEVENT_SOCKET_ACCEPT) && + (!reqsk_queue_empty(&inet_csk(sk)->icsk_accept_queue) || + reqsk_queue_len_young(&inet_csk(sk)->icsk_accept_queue))) { + k->event.ret_data[1] = reqsk_queue_len(&inet_csk(sk)->icsk_accept_queue); + return 1; + } + + return 0; +} + +int kevent_socket_enqueue(struct kevent *k) +{ + struct file *file; + struct inode *inode; + int err, fput_needed; + + file = fget_light(k->event.id.raw[0], &fput_needed); + if (!file) + return -ENODEV; + + err = -EINVAL; + if (!file->f_dentry || !file->f_dentry->d_inode) + goto err_out_fput; + + inode = igrab(file->f_dentry->d_inode); + if (!inode) + goto err_out_fput; + + err = kevent_storage_enqueue(&inode->st, k); + if (err) + goto err_out_iput; + + err = k->callback(k); + if (err) + goto err_out_dequeue; + + fput_light(file, fput_needed); + return err; + +err_out_dequeue: + kevent_storage_dequeue(k->st, k); +err_out_iput: + iput(inode); +err_out_fput: + fput_light(file, fput_needed); + return err; +} + +int kevent_socket_dequeue(struct kevent *k) +{ + struct inode *inode = k->st->origin; + + kevent_storage_dequeue(k->st, k); + iput(inode); + + return 0; +} + +int kevent_init_socket(struct kevent *k) +{ + k->enqueue = &kevent_socket_enqueue; + k->dequeue = &kevent_socket_dequeue; + k->callback = &kevent_socket_callback; + return 0; +} + +void kevent_socket_notify(struct sock *sk, u32 event) +{ + if (sk->sk_socket && !test_and_set_bit(SOCK_ASYNC_INUSE, &sk->sk_flags)) { + kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event); + sock_reset_flag(sk, SOCK_ASYNC_INUSE); + } +} diff --git a/kernel/kevent/kevent_naio.c b/kernel/kevent/kevent_naio.c new file mode 100644 index 0000000..1c71021 --- /dev/null +++ b/kernel/kevent/kevent_naio.c @@ -0,0 +1,239 @@ +/* + * kevent_naio.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/spinlock.h> +#include <linux/file.h> +#include <linux/pagemap.h> +#include <linux/kevent.h> + +#include <net/sock.h> +#include <net/tcp_states.h> + +static int kevent_naio_enqueue(struct kevent *k); +static int kevent_naio_dequeue(struct kevent *k); +static int kevent_naio_callback(struct kevent *k); + +static int kevent_naio_setup_aio(int ctl_fd, int s, void __user *buf, + size_t size, u32 event) +{ + struct kevent_user *u; + struct file *file; + int err, fput_needed; + struct ukevent uk; + + file = fget_light(ctl_fd, &fput_needed); + if (!file) + return -ENODEV; + + u = file->private_data; + if (!u) { + err = -EINVAL; + goto err_out_fput; + } + + memset(&uk, 0, sizeof(struct ukevent)); + uk.type = KEVENT_NAIO; + uk.ptr = buf; + uk.req_flags = KEVENT_REQ_ONESHOT; + uk.event = event; + uk.id.raw[0] = s; + uk.id.raw[1] = size; + + err = kevent_user_add_ukevent(&uk, u); + +err_out_fput: + fput_light(file, fput_needed); + return err; +} + +asmlinkage long sys_aio_recv(int ctl_fd, int s, void __user *buf, + size_t size, unsigned flags) +{ + return kevent_naio_setup_aio(ctl_fd, s, buf, size, KEVENT_SOCKET_RECV); +} + +asmlinkage long sys_aio_send(int ctl_fd, int s, void __user *buf, + size_t size, unsigned flags) +{ + return kevent_naio_setup_aio(ctl_fd, s, buf, size, KEVENT_SOCKET_SEND); +} + +static int kevent_naio_enqueue(struct kevent *k) +{ + int err, i; + struct page **page; + void *addr; + unsigned int size = k->event.id.raw[1]; + int num = size/PAGE_SIZE; + struct file *file; + struct sock *sk = NULL; + int fput_needed; + + file = fget_light(k->event.id.raw[0], &fput_needed); + if (!file) + return -ENODEV; + + err = -EINVAL; + if (!file->f_dentry || !file->f_dentry->d_inode) + goto err_out_fput; + + sk = SOCKET_I(file->f_dentry->d_inode)->sk; + + err = -ESOCKTNOSUPPORT; + if (!sk || !sk->sk_prot->async_recv || !sk->sk_prot->async_send || + !sock_flag(sk, SOCK_ASYNC)) + goto err_out_fput; + + addr = k->event.ptr; + if (((unsigned long)addr & PAGE_MASK) != (unsigned long)addr) + num++; + + page = kmalloc(sizeof(struct page *) * num, GFP_KERNEL); + if (!page) + return -ENOMEM; + + down_read(¤t->mm->mmap_sem); + err = get_user_pages(current, current->mm, (unsigned long)addr, + num, 1, 0, page, NULL); + up_read(¤t->mm->mmap_sem); + if (err <= 0) + goto err_out_free; + num = err; + + k->event.ret_data[0] = num; + k->event.ret_data[1] = offset_in_page(k->event.ptr); + k->priv = page; + + sk->sk_allocation = GFP_ATOMIC; + + spin_lock_bh(&sk->sk_lock.slock); + err = kevent_socket_enqueue(k); + spin_unlock_bh(&sk->sk_lock.slock); + if (err) + goto err_out_put_pages; + + fput_light(file, fput_needed); + + return err; + +err_out_put_pages: + for (i=0; i<num; ++i) + page_cache_release(page[i]); +err_out_free: + kfree(page); +err_out_fput: + fput_light(file, fput_needed); + + return err; +} + +static int kevent_naio_dequeue(struct kevent *k) +{ + int err, i, num; + struct page **page = k->priv; + + num = k->event.ret_data[0]; + + err = kevent_socket_dequeue(k); + + for (i=0; i<num; ++i) + page_cache_release(page[i]); + + kfree(k->priv); + k->priv = NULL; + + return err; +} + +static int kevent_naio_callback(struct kevent *k) +{ + struct inode *inode = k->st->origin; + struct sock *sk = SOCKET_I(inode)->sk; + unsigned int size = k->event.id.raw[1]; + unsigned int off = k->event.ret_data[1]; + struct page **pages = k->priv, *page; + int ready = 0, num = off/PAGE_SIZE, err = 0, send = 0; + void *ptr, *optr; + unsigned int len; + + if (!sock_flag(sk, SOCK_ASYNC)) + return -1; + + if (k->event.event & KEVENT_SOCKET_SEND) + send = 1; + else if (!(k->event.event & KEVENT_SOCKET_RECV)) + return -EINVAL; + + /* + * sk_prot->async_*() can return either number of bytes processed, + * or negative error value, or zero if socket is closed. + */ + + if (!send) { + page = pages[num]; + + optr = ptr = kmap_atomic(page, KM_IRQ0); + if (!ptr) + return -ENOMEM; + + ptr += off % PAGE_SIZE; + len = min_t(unsigned int, PAGE_SIZE - (ptr - optr), size); + + err = sk->sk_prot->async_recv(sk, ptr, len); + + kunmap_atomic(optr, KM_IRQ0); + } else { + len = size; + err = sk->sk_prot->async_send(sk, pages, off, size); + } + + if (err > 0) { + num++; + size -= err; + off += err; + } + + k->event.ret_data[1] = off; + k->event.id.raw[1] = size; + + if (err == 0 || (err < 0 && err != -EAGAIN)) + ready = -1; + + if (!size) + ready = 1; +#if 0 + printk("%s: sk=%p, k=%p, size=%4u, off=%4u, err=%3d, ready=%1d.\n", + __func__, sk, k, size, off, err, ready); +#endif + + return ready; +} + +int kevent_init_naio(struct kevent *k) +{ + k->enqueue = &kevent_naio_enqueue; + k->dequeue = &kevent_naio_dequeue; + k->callback = &kevent_naio_callback; + return 0; +} ^ permalink raw reply related [flat|nested] 73+ messages in thread
* [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-26 9:18 ` [2/4] kevent: network AIO, socket notifications Evgeniy Polyakov @ 2006-07-26 9:18 ` Evgeniy Polyakov 2006-07-26 9:18 ` [4/4] kevent: poll/select() notifications. Timer notifications Evgeniy Polyakov ` (2 more replies) 0 siblings, 3 replies; 73+ messages in thread From: Evgeniy Polyakov @ 2006-07-26 9:18 UTC (permalink / raw) To: lkml; +Cc: David Miller, Ulrich Drepper, Evgeniy Polyakov, netdev This patch includes asynchronous propagation of file's data into VFS cache and aio_sendfile() implementation. Network aio_sendfile() works lazily - it asynchronously populates pages into the VFS cache (which can be used for various tricks with adaptive readahead) and then uses usual ->sendfile() callback. Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> diff --git a/fs/bio.c b/fs/bio.c index 6a0b9ad..a3ee530 100644 --- a/fs/bio.c +++ b/fs/bio.c @@ -119,7 +119,7 @@ void bio_free(struct bio *bio, struct bi /* * default destructor for a bio allocated with bio_alloc_bioset() */ -static void bio_fs_destructor(struct bio *bio) +void bio_fs_destructor(struct bio *bio) { bio_free(bio, fs_bio_set); } diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 04af9c4..295fce9 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -685,6 +685,7 @@ ext2_writepages(struct address_space *ma } struct address_space_operations ext2_aops = { + .get_block = ext2_get_block, .readpage = ext2_readpage, .readpages = ext2_readpages, .writepage = ext2_writepage, diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index 2edd7ee..e44f5ad 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -1700,6 +1700,7 @@ static int ext3_journalled_set_page_dirt } static struct address_space_operations ext3_ordered_aops = { + .get_block = ext3_get_block, .readpage = ext3_readpage, .readpages = ext3_readpages, .writepage = ext3_ordered_writepage, diff --git a/fs/file_table.c b/fs/file_table.c index bcea199..8759479 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -113,6 +113,9 @@ struct file *get_empty_filp(void) if (security_file_alloc(f)) goto fail_sec; +#ifdef CONFIG_KEVENT_POLL + kevent_storage_init(f, &f->st); +#endif tsk = current; INIT_LIST_HEAD(&f->f_u.fu_list); atomic_set(&f->f_count, 1); @@ -160,6 +163,9 @@ void fastcall __fput(struct file *file) might_sleep(); fsnotify_close(file); +#ifdef CONFIG_KEVENT_POLL + kevent_storage_fini(&file->st); +#endif /* * The function eventpoll_release() should be the first called * in the file cleanup chain. diff --git a/fs/inode.c b/fs/inode.c index 3a2446a..0493935 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -22,6 +22,7 @@ #include <linux/pagemap.h> #include <linux/cdev.h> #include <linux/bootmem.h> #include <linux/inotify.h> +#include <linux/kevent.h> #include <linux/mount.h> /* @@ -166,12 +167,18 @@ #endif } memset(&inode->u, 0, sizeof(inode->u)); inode->i_mapping = mapping; +#if defined CONFIG_KEVENT + kevent_storage_init(inode, &inode->st); +#endif } return inode; } void destroy_inode(struct inode *inode) { +#if defined CONFIG_KEVENT_INODE || defined CONFIG_KEVENT_SOCKET + kevent_storage_fini(&inode->st); +#endif BUG_ON(inode_has_buffers(inode)); security_inode_free(inode); if (inode->i_sb->s_op->destroy_inode) diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c index 9857e50..bbbb578 100644 --- a/fs/reiserfs/inode.c +++ b/fs/reiserfs/inode.c @@ -2997,6 +2997,7 @@ int reiserfs_setattr(struct dentry *dent } struct address_space_operations reiserfs_address_space_operations = { + .get_block = reiserfs_get_block, .writepage = reiserfs_writepage, .readpage = reiserfs_readpage, .readpages = reiserfs_readpages, diff --git a/include/linux/fs.h b/include/linux/fs.h index ecc8c2c..248f6a1 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -236,6 +236,9 @@ #include <linux/mutex.h> #include <asm/atomic.h> #include <asm/semaphore.h> #include <asm/byteorder.h> +#ifdef CONFIG_KEVENT +#include <linux/kevent_storage.h> +#endif struct hd_geometry; struct iovec; @@ -348,6 +351,8 @@ struct address_space; struct writeback_control; struct address_space_operations { + int (*get_block)(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create); int (*writepage)(struct page *page, struct writeback_control *wbc); int (*readpage)(struct file *, struct page *); void (*sync_page)(struct page *); @@ -526,6 +531,10 @@ #ifdef CONFIG_INOTIFY struct mutex inotify_mutex; /* protects the watches list */ #endif +#ifdef CONFIG_KEVENT_INODE + struct kevent_storage st; +#endif + unsigned long i_state; unsigned long dirtied_when; /* jiffies of first dirtying */ @@ -659,6 +668,9 @@ #ifdef CONFIG_EPOLL struct list_head f_ep_links; spinlock_t f_ep_lock; #endif /* #ifdef CONFIG_EPOLL */ +#ifdef CONFIG_KEVENT_POLL + struct kevent_storage st; +#endif struct address_space *f_mapping; }; extern spinlock_t files_lock; diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h index cc5dec7..0acc8db 100644 --- a/include/linux/fsnotify.h +++ b/include/linux/fsnotify.h @@ -15,6 +15,7 @@ #ifdef __KERNEL__ #include <linux/dnotify.h> #include <linux/inotify.h> +#include <linux/kevent.h> #include <linux/audit.h> /* @@ -79,6 +80,7 @@ static inline void fsnotify_nameremove(s isdir = IN_ISDIR; dnotify_parent(dentry, DN_DELETE); inotify_dentry_parent_queue_event(dentry, IN_DELETE|isdir, 0, dentry->d_name.name); + kevent_inode_notify_parent(dentry, KEVENT_INODE_REMOVE); } /* @@ -88,6 +90,7 @@ static inline void fsnotify_inoderemove( { inotify_inode_queue_event(inode, IN_DELETE_SELF, 0, NULL, NULL); inotify_inode_is_dead(inode); + kevent_inode_remove(inode); } /* @@ -96,6 +99,7 @@ static inline void fsnotify_inoderemove( static inline void fsnotify_create(struct inode *inode, struct dentry *dentry) { inode_dir_notify(inode, DN_CREATE); + kevent_inode_notify(inode, KEVENT_INODE_CREATE); inotify_inode_queue_event(inode, IN_CREATE, 0, dentry->d_name.name, dentry->d_inode); audit_inode_child(dentry->d_name.name, dentry->d_inode, inode->i_ino); @@ -107,6 +111,7 @@ static inline void fsnotify_create(struc static inline void fsnotify_mkdir(struct inode *inode, struct dentry *dentry) { inode_dir_notify(inode, DN_CREATE); + kevent_inode_notify(inode, KEVENT_INODE_CREATE); inotify_inode_queue_event(inode, IN_CREATE | IN_ISDIR, 0, dentry->d_name.name, dentry->d_inode); audit_inode_child(dentry->d_name.name, dentry->d_inode, inode->i_ino); diff --git a/kernel/kevent/kevent_inode.c b/kernel/kevent/kevent_inode.c new file mode 100644 index 0000000..3af0e11 --- /dev/null +++ b/kernel/kevent/kevent_inode.c @@ -0,0 +1,110 @@ +/* + * kevent_inode.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/timer.h> +#include <linux/file.h> +#include <linux/kevent.h> +#include <linux/fs.h> + +static int kevent_inode_enqueue(struct kevent *k) +{ + struct file *file; + struct inode *inode; + int err, fput_needed; + + file = fget_light(k->event.id.raw[0], &fput_needed); + if (!file) + return -ENODEV; + + err = -EINVAL; + if (!file->f_dentry || !file->f_dentry->d_inode) + goto err_out_fput; + + inode = igrab(file->f_dentry->d_inode); + if (!inode) + goto err_out_fput; + + err = kevent_storage_enqueue(&inode->st, k); + if (err) + goto err_out_iput; + + fput_light(file, fput_needed); + return 0; + +err_out_iput: + iput(inode); +err_out_fput: + fput_light(file, fput_needed); + return err; +} + +static int kevent_inode_dequeue(struct kevent *k) +{ + struct inode *inode = k->st->origin; + + kevent_storage_dequeue(k->st, k); + iput(inode); + + return 0; +} + +static int kevent_inode_callback(struct kevent *k) +{ + return 1; +} + +int kevent_init_inode(struct kevent *k) +{ + k->enqueue = &kevent_inode_enqueue; + k->dequeue = &kevent_inode_dequeue; + k->callback = &kevent_inode_callback; + return 0; +} + +void kevent_inode_notify_parent(struct dentry *dentry, u32 event) +{ + struct dentry *parent; + struct inode *inode; + + spin_lock(&dentry->d_lock); + parent = dentry->d_parent; + inode = parent->d_inode; + + dget(parent); + spin_unlock(&dentry->d_lock); + kevent_inode_notify(inode, KEVENT_INODE_REMOVE); + dput(parent); +} + +void kevent_inode_remove(struct inode *inode) +{ + kevent_storage_fini(&inode->st); +} + +void kevent_inode_notify(struct inode *inode, u32 event) +{ + kevent_storage_ready(&inode->st, NULL, event); +} diff --git a/kernel/kevent/kevent_aio.c b/kernel/kevent/kevent_aio.c new file mode 100644 index 0000000..d4132a3 --- /dev/null +++ b/kernel/kevent/kevent_aio.c @@ -0,0 +1,580 @@ +/* + * kevent_aio.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/spinlock.h> +#include <linux/file.h> +#include <linux/fs.h> +#include <linux/swap.h> +#include <linux/pagemap.h> +#include <linux/bio.h> +#include <linux/buffer_head.h> +#include <linux/kevent.h> + +#include <net/sock.h> + +#define KEVENT_AIO_DEBUG + +#ifdef KEVENT_AIO_DEBUG +#define dprintk(f, a...) printk(f, ##a) +#else +#define dprintk(f, a...) do {} while (0) +#endif + +struct kevent_aio_private +{ + int pg_num; + size_t size; + loff_t offset; + loff_t processed; + atomic_t bio_page_num; + struct completion bio_complete; + struct file *file, *sock; + struct work_struct work; +}; + +static int kevent_aio_dequeue(struct kevent *k); +static int kevent_aio_enqueue(struct kevent *k); +static int kevent_aio_callback(struct kevent *k); + +extern void bio_fs_destructor(struct bio *bio); + +static void kevent_aio_bio_destructor(struct bio *bio) +{ + struct kevent *k = bio->bi_private; + struct kevent_aio_private *priv = k->priv; + + dprintk("%s: bio=%p, num=%u, k=%p, inode=%p.\n", __func__, bio, bio->bi_vcnt, k, k->st->origin); + schedule_work(&priv->work); + bio_fs_destructor(bio); +} + +static void kevent_aio_bio_put(struct kevent *k) +{ + struct kevent_aio_private *priv = k->priv; + + if (atomic_dec_and_test(&priv->bio_page_num)) + complete(&priv->bio_complete); +} + +static int kevent_mpage_end_io_read(struct bio *bio, unsigned int bytes_done, int err) +{ + const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); + struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1; + struct kevent *k = bio->bi_private; + + if (bio->bi_size) + return 1; + + do { + struct page *page = bvec->bv_page; + + if (--bvec >= bio->bi_io_vec) + prefetchw(&bvec->bv_page->flags); + + if (uptodate) { + SetPageUptodate(page); + } else { + ClearPageUptodate(page); + SetPageError(page); + } + + unlock_page(page); + kevent_aio_bio_put(k); + } while (bvec >= bio->bi_io_vec); + + bio_put(bio); + return 0; +} + +static inline struct bio *kevent_mpage_bio_submit(int rw, struct bio *bio) +{ + if (bio) { + bio->bi_end_io = kevent_mpage_end_io_read; + dprintk("%s: bio=%p, num=%u.\n", __func__, bio, bio->bi_vcnt); + submit_bio(READ, bio); + } + return NULL; +} + +static struct bio *kevent_mpage_readpage(struct kevent *k, struct bio *bio, + struct page *page, unsigned nr_pages, get_block_t get_block, + loff_t *offset, sector_t *last_block_in_bio) +{ + struct inode *inode = k->st->origin; + const unsigned blkbits = inode->i_blkbits; + const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits; + const unsigned blocksize = 1 << blkbits; + sector_t block_in_file; + sector_t last_block; + struct block_device *bdev = NULL; + unsigned first_hole = blocks_per_page; + unsigned page_block; + sector_t blocks[MAX_BUF_PER_PAGE]; + struct buffer_head bh; + int fully_mapped = 1, length; + + block_in_file = (*offset + blocksize - 1) >> blkbits; + last_block = (i_size_read(inode) + blocksize - 1) >> blkbits; + + bh.b_page = page; + for (page_block = 0; page_block < blocks_per_page; page_block++, block_in_file++) { + bh.b_state = 0; + if (block_in_file < last_block) { + if (get_block(inode, block_in_file, &bh, 0)) + goto confused; + } + + if (!buffer_mapped(&bh)) { + fully_mapped = 0; + if (first_hole == blocks_per_page) + first_hole = page_block; + continue; + } + + /* some filesystems will copy data into the page during + * the get_block call, in which case we don't want to + * read it again. map_buffer_to_page copies the data + * we just collected from get_block into the page's buffers + * so readpage doesn't have to repeat the get_block call + */ + if (buffer_uptodate(&bh)) { + BUG(); + //map_buffer_to_page(page, &bh, page_block); + goto confused; + } + + if (first_hole != blocks_per_page) + goto confused; /* hole -> non-hole */ + + /* Contiguous blocks? */ + if (page_block && blocks[page_block-1] != bh.b_blocknr-1) + goto confused; + blocks[page_block] = bh.b_blocknr; + bdev = bh.b_bdev; + } + + if (!bdev) + goto confused; + + if (first_hole != blocks_per_page) { + char *kaddr = kmap_atomic(page, KM_USER0); + memset(kaddr + (first_hole << blkbits), 0, + PAGE_CACHE_SIZE - (first_hole << blkbits)); + flush_dcache_page(page); + kunmap_atomic(kaddr, KM_USER0); + if (first_hole == 0) { + SetPageUptodate(page); + goto out; + } + } else if (fully_mapped) { + SetPageMappedToDisk(page); + } + + /* + * This page will go to BIO. Do we need to send this BIO off first? + */ + if (bio && (*last_block_in_bio != blocks[0] - 1)) + bio = kevent_mpage_bio_submit(READ, bio); + +alloc_new: + if (bio == NULL) { + nr_pages = min_t(unsigned, nr_pages, bio_get_nr_vecs(bdev)); + bio = bio_alloc(GFP_KERNEL, nr_pages); + if (bio == NULL) + goto confused; + + bio->bi_destructor = kevent_aio_bio_destructor; + bio->bi_bdev = bdev; + bio->bi_sector = blocks[0] << (blkbits - 9); + bio->bi_private = k; + } + + length = first_hole << blkbits; + if (bio_add_page(bio, page, length, 0) < length) { + bio = kevent_mpage_bio_submit(READ, bio); + dprintk("%s: Failed to add a page: nr_pages=%d, length=%d, page=%p.\n", + __func__, nr_pages, length, page); + goto alloc_new; + } + + dprintk("%s: bio=%p, b=%d, m=%d, u=%d, nr_pages=%d, offset=%Lu, " + "size=%Lu. page_block=%u, page=%p.\n", + __func__, bio, buffer_boundary(&bh), buffer_mapped(&bh), + buffer_uptodate(&bh), nr_pages, *offset, i_size_read(inode), + page_block, page); + + *offset = *offset + length; + + if (buffer_boundary(&bh) || (first_hole != blocks_per_page)) + bio = kevent_mpage_bio_submit(READ, bio); + else + *last_block_in_bio = blocks[blocks_per_page - 1]; + +out: + return bio; + +confused: + dprintk("%s: confused. bio=%p, nr_pages=%d.\n", __func__, bio, nr_pages); + if (bio) + bio = kevent_mpage_bio_submit(READ, bio); + kevent_aio_bio_put(k); + SetPageUptodate(page); + + if (nr_pages == 1) { + struct kevent_aio_private *priv = k->priv; + + wait_for_completion(&priv->bio_complete); + kevent_storage_ready(k->st, NULL, KEVENT_AIO_BIO); + init_completion(&priv->bio_complete); + complete(&priv->bio_complete); + } + goto out; +} + +static int kevent_aio_alloc_cached_page(struct kevent *k, struct page **cached_page) +{ + struct kevent_aio_private *priv = k->priv; + struct address_space *mapping = priv->file->f_mapping; + struct page *page; + int err = 0; + pgoff_t index = priv->offset >> PAGE_CACHE_SHIFT; + + page = page_cache_alloc_cold(mapping); + if (!page) { + err = -ENOMEM; + goto out; + } + + err = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL); + if (err) { + if (err == -EEXIST) + err = 0; + page_cache_release(page); + goto out; + } + + dprintk("%s: page=%p, offset=%Lu, processed=%Lu, index=%lu, size=%zu.\n", + __func__, page, priv->offset, priv->processed, index, priv->size); + + *cached_page = page; + +out: + return err; +} + +static int kevent_mpage_readpages(struct kevent *k, int first, + int (* get_block)(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create)) +{ + struct bio *bio = NULL; + struct kevent_aio_private *priv = k->priv; + sector_t last_block_in_bio = 0; + int i, err = 0; + + atomic_set(&priv->bio_page_num, priv->pg_num); + + for (i=first; i<priv->pg_num; ++i) { + struct page *page; + + err = kevent_aio_alloc_cached_page(k, &page); + if (err) + break; + + /* + * If there is no error and page is NULL, this means + * that someone added a page into VFS cache. + * We will not process this page, since it is that who + * added a page must read data from disk. + */ + if (!page) + continue; + + bio = kevent_mpage_readpage(k, bio, page, priv->pg_num - i, + get_block, &priv->offset, &last_block_in_bio); + } + + if (bio) + bio = kevent_mpage_bio_submit(READ, bio); + + return err; +} + +static size_t kevent_aio_vfs_read_actor(struct kevent *k, struct page *kpage, size_t len) +{ + struct kevent_aio_private *priv = k->priv; + size_t ret; + + ret = priv->sock->f_op->sendpage(priv->sock, kpage, 0, len, &priv->sock->f_pos, 1); + + dprintk("%s: k=%p, page=%p, len=%zu, ret=%zd.\n", + __func__, k, kpage, len, ret); + + return ret; +} + +static int kevent_aio_vfs_read(struct kevent *k, + size_t (*actor)(struct kevent *, struct page *, size_t)) +{ + struct kevent_aio_private *priv = k->priv; + struct address_space *mapping; + size_t isize, actor_size; + int i; + + mapping = priv->file->f_mapping; + isize = i_size_read(priv->file->f_dentry->d_inode); + + dprintk("%s: start: size_left=%zd, offset=%Lu, processed=%Lu, isize=%zu, pg_num=%d.\n", + __func__, priv->size, priv->offset, priv->processed, isize, priv->pg_num); + + for (i=0; i<priv->pg_num && priv->size; ++i) { + struct page *page; + size_t nr = PAGE_CACHE_SIZE; + + cond_resched(); + page = find_get_page(mapping, priv->processed >> PAGE_CACHE_SHIFT); + if (unlikely(page == NULL)) + break; + if (!PageUptodate(page)) { + dprintk("%s: %2d: page=%p, processed=%Lu, size=%zu not uptodate.\n", + __func__, i, page, priv->processed, priv->size); + page_cache_release(page); + break; + } + + if (mapping_writably_mapped(mapping)) + flush_dcache_page(page); + + mark_page_accessed(page); + + if (nr + priv->processed > isize) + nr = isize - priv->processed; + if (nr > priv->size) + nr = priv->size; + + actor_size = actor(k, page, nr); + if (actor_size < 0) { + page_cache_release(page); + break; + } + + page_cache_release(page); + + priv->processed += actor_size; + priv->size -= actor_size; + } + + if (!priv->size) + i = priv->pg_num; + + if (i != priv->pg_num) + priv->offset = priv->processed; + + dprintk("%s: end: next=%d, num=%d, left=%zu, offset=%Lu, procesed=%Lu, ret=%d.\n", + __func__, i, priv->pg_num, + priv->size, priv->offset, priv->processed, i); + + return i; +} + +static int kevent_aio_callback(struct kevent *k) +{ + return 1; +} + +static void kevent_aio_work(void *data) +{ + struct kevent *k = data; + struct kevent_aio_private *priv = k->priv; + struct inode *inode = k->st->origin; + struct address_space *mapping = priv->file->f_mapping; + int err, ready = 0, num; + + dprintk("%s: k=%p, priv=%p, inode=%p.\n", __func__, k, priv, inode); + + init_completion(&priv->bio_complete); + + num = ready = kevent_aio_vfs_read(k, &kevent_aio_vfs_read_actor); + if (ready > 0 && ready != priv->pg_num) + ready = 0; + + dprintk("%s: k=%p, ready=%d, size=%zd.\n", __func__, k, ready, priv->size); + + if (!ready) { + err = kevent_mpage_readpages(k, num, mapping->a_ops->get_block); + if (err) { + dprintk("%s: kevent_mpage_readpages failed: err=%d, k=%p, size=%zd.\n", + __func__, err, k, priv->size); + kevent_break(k); + kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL); + } + } else { + dprintk("%s: next k=%p, size=%zd.\n", __func__, k, priv->size); + + if (priv->size) + schedule_work(&priv->work); + else { + kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL); + } + + complete(&priv->bio_complete); + } +} + +static int kevent_aio_enqueue(struct kevent *k) +{ + int err; + struct file *file, *sock; + struct inode *inode; + struct kevent_aio_private *priv; + struct address_space *mapping; + int fd = k->event.id.raw[0]; + int num = k->event.id.raw[1]; + int s = k->event.ret_data[0]; + size_t size; + + err = -ENODEV; + file = fget(fd); + if (!file) + goto err_out_exit; + + sock = fget(s); + if (!sock) + goto err_out_fput_file; + + mapping = file->f_mapping; + + err = -EINVAL; + if (!file->f_dentry || !file->f_dentry->d_inode || !mapping->a_ops->get_block) + goto err_out_fput; + if (!sock->f_dentry || !sock->f_dentry->d_inode) + goto err_out_fput; + + inode = igrab(file->f_dentry->d_inode); + if (!inode) + goto err_out_fput; + + size = i_size_read(inode); + + num = (size > num << PAGE_SHIFT) ? num : (size >> PAGE_SHIFT); + + err = -ENOMEM; + priv = kzalloc(sizeof(struct kevent_aio_private), GFP_KERNEL); + if (!priv) + goto err_out_iput; + + priv->pg_num = num; + priv->size = size; + priv->offset = 0; + priv->file = file; + priv->sock = sock; + INIT_WORK(&priv->work, kevent_aio_work, k); + k->priv = priv; + + dprintk("%s: read: k=%p, priv=%p, inode=%p, num=%u, size=%zu, off=%Lu.\n", + __func__, k, priv, inode, priv->pg_num, priv->size, priv->offset); + + init_completion(&priv->bio_complete); + kevent_storage_enqueue(&inode->st, k); + schedule_work(&priv->work); + + return 0; + +err_out_iput: + iput(inode); +err_out_fput: + fput(sock); +err_out_fput_file: + fput(file); +err_out_exit: + + return err; +} + +static int kevent_aio_dequeue(struct kevent *k) +{ + struct kevent_aio_private *priv = k->priv; + struct inode *inode = k->st->origin; + struct file *file = priv->file; + struct file *sock = priv->sock; + + kevent_storage_dequeue(k->st, k); + flush_scheduled_work(); + wait_for_completion(&priv->bio_complete); + + kfree(k->priv); + k->priv = NULL; + iput(inode); + fput(file); + fput(sock); + + return 0; +} + +asmlinkage long sys_aio_sendfile(int ctl_fd, int fd, int s, + size_t size, unsigned flags) +{ + struct ukevent ukread, uksend; + struct kevent_user *u; + struct file *file; + int err, fput_needed; + int num = (flags & 7)?(flags & 7):8; + + memset(&ukread, 0, sizeof(struct ukevent)); + memset(&uksend, 0, sizeof(struct ukevent)); + + ukread.type = KEVENT_AIO; + ukread.event = KEVENT_AIO_BIO; + + ukread.id.raw[0] = fd; + ukread.id.raw[1] = num; + ukread.ret_data[0] = s; + + dprintk("%s: fd=%d, s=%d, num=%d.\n", __func__, fd, s, num); + + file = fget_light(ctl_fd, &fput_needed); + if (!file) + return -ENODEV; + + u = file->private_data; + if (!u) { + err = -EINVAL; + goto err_out_fput; + } + + err = kevent_user_add_ukevent(&ukread, u); + if (err < 0) + goto err_out_fput; + +err_out_fput: + fput_light(file, fput_needed); + return err; +} + +int kevent_init_aio(struct kevent *k) +{ + k->enqueue = &kevent_aio_enqueue; + k->dequeue = &kevent_aio_dequeue; + k->callback = &kevent_aio_callback; + return 0; +} ^ permalink raw reply related [flat|nested] 73+ messages in thread
* [4/4] kevent: poll/select() notifications. Timer notifications. 2006-07-26 9:18 ` [3/4] kevent: AIO, aio_sendfile() implementation Evgeniy Polyakov @ 2006-07-26 9:18 ` Evgeniy Polyakov 2006-07-26 10:00 ` [3/4] kevent: AIO, aio_sendfile() implementation Christoph Hellwig 2006-07-26 10:04 ` Christoph Hellwig 2 siblings, 0 replies; 73+ messages in thread From: Evgeniy Polyakov @ 2006-07-26 9:18 UTC (permalink / raw) To: lkml; +Cc: David Miller, Ulrich Drepper, Evgeniy Polyakov, netdev This patch includes generic poll/select and timer notifications. kevent_poll works simialr to epoll and has the same issues (callback is invoked not from internal state machine of the caller, but through process awake). Timer notifications can be used for fine grained per-process time management, since iteractive timers are very inconveniently to use, and they are limited. Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mitp.ru> diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c new file mode 100644 index 0000000..4950e7c --- /dev/null +++ b/kernel/kevent/kevent_poll.c @@ -0,0 +1,223 @@ +/* + * kevent_poll.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/timer.h> +#include <linux/file.h> +#include <linux/kevent.h> +#include <linux/poll.h> +#include <linux/fs.h> + +static kmem_cache_t *kevent_poll_container_cache; +static kmem_cache_t *kevent_poll_priv_cache; + +struct kevent_poll_ctl +{ + struct poll_table_struct pt; + struct kevent *k; +}; + +struct kevent_poll_wait_container +{ + struct list_head container_entry; + wait_queue_head_t *whead; + wait_queue_t wait; + struct kevent *k; +}; + +struct kevent_poll_private +{ + struct list_head container_list; + spinlock_t container_lock; +}; + +static int kevent_poll_enqueue(struct kevent *k); +static int kevent_poll_dequeue(struct kevent *k); +static int kevent_poll_callback(struct kevent *k); + +static int kevent_poll_wait_callback(wait_queue_t *wait, + unsigned mode, int sync, void *key) +{ + struct kevent_poll_wait_container *cont = + container_of(wait, struct kevent_poll_wait_container, wait); + struct kevent *k = cont->k; + struct file *file = k->st->origin; + unsigned long flags; + u32 revents, event; + + revents = file->f_op->poll(file, NULL); + spin_lock_irqsave(&k->lock, flags); + event = k->event.event; + spin_unlock_irqrestore(&k->lock, flags); + + kevent_storage_ready(k->st, NULL, revents); + + return 0; +} + +static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead, + struct poll_table_struct *poll_table) +{ + struct kevent *k = + container_of(poll_table, struct kevent_poll_ctl, pt)->k; + struct kevent_poll_private *priv = k->priv; + struct kevent_poll_wait_container *cont; + unsigned long flags; + + cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL); + if (!cont) { + kevent_break(k); + return; + } + + cont->k = k; + init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback); + cont->whead = whead; + + spin_lock_irqsave(&priv->container_lock, flags); + list_add_tail(&cont->container_entry, &priv->container_list); + spin_unlock_irqrestore(&priv->container_lock, flags); + + add_wait_queue(whead, &cont->wait); +} + +static int kevent_poll_enqueue(struct kevent *k) +{ + struct file *file; + int err, ready = 0; + unsigned int revents; + struct kevent_poll_ctl ctl; + struct kevent_poll_private *priv; + + file = fget(k->event.id.raw[0]); + if (!file) + return -ENODEV; + + err = -EINVAL; + if (!file->f_op || !file->f_op->poll) + goto err_out_fput; + + err = -ENOMEM; + priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL); + if (!priv) + goto err_out_fput; + + spin_lock_init(&priv->container_lock); + INIT_LIST_HEAD(&priv->container_list); + + k->priv = priv; + + ctl.k = k; + init_poll_funcptr(&ctl.pt, &kevent_poll_qproc); + + err = kevent_storage_enqueue(&file->st, k); + if (err) + goto err_out_free; + + revents = file->f_op->poll(file, &ctl.pt); + if (revents & k->event.event) { + ready = 1; + kevent_poll_dequeue(k); + } + + return ready; + +err_out_free: + kmem_cache_free(kevent_poll_priv_cache, priv); +err_out_fput: + fput(file); + return err; +} + +static int kevent_poll_dequeue(struct kevent *k) +{ + struct file *file = k->st->origin; + struct kevent_poll_private *priv = k->priv; + struct kevent_poll_wait_container *w, *n; + unsigned long flags; + + kevent_storage_dequeue(k->st, k); + + spin_lock_irqsave(&priv->container_lock, flags); + list_for_each_entry_safe(w, n, &priv->container_list, container_entry) { + list_del(&w->container_entry); + remove_wait_queue(w->whead, &w->wait); + kmem_cache_free(kevent_poll_container_cache, w); + } + spin_unlock_irqrestore(&priv->container_lock, flags); + + kmem_cache_free(kevent_poll_priv_cache, priv); + k->priv = NULL; + + fput(file); + + return 0; +} + +static int kevent_poll_callback(struct kevent *k) +{ + struct file *file = k->st->origin; + unsigned int revents = file->f_op->poll(file, NULL); + return (revents & k->event.event); +} + +int kevent_init_poll(struct kevent *k) +{ + if (!kevent_poll_container_cache || !kevent_poll_priv_cache) + return -ENOMEM; + + k->enqueue = &kevent_poll_enqueue; + k->dequeue = &kevent_poll_dequeue; + k->callback = &kevent_poll_callback; + return 0; +} + + +static int __init kevent_poll_sys_init(void) +{ + kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache", + sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL); + if (!kevent_poll_container_cache) { + printk(KERN_ERR "Failed to create kevent poll container cache.\n"); + return -ENOMEM; + } + + kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache", + sizeof(struct kevent_poll_private), 0, 0, NULL, NULL); + if (!kevent_poll_priv_cache) { + printk(KERN_ERR "Failed to create kevent poll private data cache.\n"); + kmem_cache_destroy(kevent_poll_container_cache); + kevent_poll_container_cache = NULL; + return -ENOMEM; + } + + printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n"); + return 0; +} + +static void __exit kevent_poll_sys_fini(void) +{ + kmem_cache_destroy(kevent_poll_priv_cache); + kmem_cache_destroy(kevent_poll_container_cache); +} + +module_init(kevent_poll_sys_init); +module_exit(kevent_poll_sys_fini); diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c new file mode 100644 index 0000000..53d3bdf --- /dev/null +++ b/kernel/kevent/kevent_timer.c @@ -0,0 +1,112 @@ +/* + * kevent_timer.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/timer.h> +#include <linux/jiffies.h> +#include <linux/kevent.h> + +static void kevent_timer_func(unsigned long data) +{ + struct kevent *k = (struct kevent *)data; + struct timer_list *t = k->st->origin; + + kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL); + mod_timer(t, jiffies + msecs_to_jiffies(k->event.id.raw[0])); +} + +static int kevent_timer_enqueue(struct kevent *k) +{ + struct timer_list *t; + struct kevent_storage *st; + int err; + + t = kmalloc(sizeof(struct timer_list) + sizeof(struct kevent_storage), + GFP_KERNEL); + if (!t) + return -ENOMEM; + + init_timer(t); + t->function = kevent_timer_func; + t->expires = jiffies + msecs_to_jiffies(k->event.id.raw[0]); + t->data = (unsigned long)k; + + st = (struct kevent_storage *)(t+1); + err = kevent_storage_init(t, st); + if (err) + goto err_out_free; + + err = kevent_storage_enqueue(st, k); + if (err) + goto err_out_st_fini; + + add_timer(t); + + return 0; + +err_out_st_fini: + kevent_storage_fini(st); +err_out_free: + kfree(t); + + return err; +} + +static int kevent_timer_dequeue(struct kevent *k) +{ + struct kevent_storage *st = k->st; + struct timer_list *t = st->origin; + + if (!t) + return -ENODEV; + + del_timer_sync(t); + + kevent_storage_dequeue(st, k); + + kfree(t); + + return 0; +} + +static int kevent_timer_callback(struct kevent *k) +{ + struct kevent_storage *st = k->st; + struct timer_list *t = st->origin; + + if (!t) + return -ENODEV; + + k->event.ret_data[0] = (__u32)jiffies; + return 1; +} + +int kevent_init_timer(struct kevent *k) +{ + k->enqueue = &kevent_timer_enqueue; + k->dequeue = &kevent_timer_dequeue; + k->callback = &kevent_timer_callback; + return 0; +} ^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-26 9:18 ` [3/4] kevent: AIO, aio_sendfile() implementation Evgeniy Polyakov 2006-07-26 9:18 ` [4/4] kevent: poll/select() notifications. Timer notifications Evgeniy Polyakov @ 2006-07-26 10:00 ` Christoph Hellwig 2006-07-26 10:08 ` Evgeniy Polyakov 2006-07-26 10:04 ` Christoph Hellwig 2 siblings, 1 reply; 73+ messages in thread From: Christoph Hellwig @ 2006-07-26 10:00 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: lkml, David Miller, Ulrich Drepper, netdev On Wed, Jul 26, 2006 at 01:18:15PM +0400, Evgeniy Polyakov wrote: > > This patch includes asynchronous propagation of file's data into VFS > cache and aio_sendfile() implementation. > Network aio_sendfile() works lazily - it asynchronously populates pages > into the VFS cache (which can be used for various tricks with adaptive > readahead) and then uses usual ->sendfile() callback. > > Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> > > diff --git a/fs/bio.c b/fs/bio.c > index 6a0b9ad..a3ee530 100644 > --- a/fs/bio.c > +++ b/fs/bio.c > @@ -119,7 +119,7 @@ void bio_free(struct bio *bio, struct bi > /* > * default destructor for a bio allocated with bio_alloc_bioset() > */ > -static void bio_fs_destructor(struct bio *bio) > +void bio_fs_destructor(struct bio *bio) > { > bio_free(bio, fs_bio_set); > } > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index 04af9c4..295fce9 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -685,6 +685,7 @@ ext2_writepages(struct address_space *ma > } > > struct address_space_operations ext2_aops = { > + .get_block = ext2_get_block, No way in hell. For whatever you do please provide a interface at the readpage/writepage/sendfile/etc abstraction layer. get_block is nothing that can be exposed to the common code. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-26 10:00 ` [3/4] kevent: AIO, aio_sendfile() implementation Christoph Hellwig @ 2006-07-26 10:08 ` Evgeniy Polyakov 2006-07-26 10:13 ` Christoph Hellwig 0 siblings, 1 reply; 73+ messages in thread From: Evgeniy Polyakov @ 2006-07-26 10:08 UTC (permalink / raw) To: Christoph Hellwig, lkml, David Miller, Ulrich Drepper, netdev On Wed, Jul 26, 2006 at 11:00:13AM +0100, Christoph Hellwig (hch@infradead.org) wrote: > > struct address_space_operations ext2_aops = { > > + .get_block = ext2_get_block, > > No way in hell. For whatever you do please provide a interface at > the readpage/writepage/sendfile/etc abstraction layer. get_block is > nothing that can be exposed to the common code. Compare this with sync read methods - all they do is exactly the same operations with low-level blocks, which are combined into nice exported function, so there is _no_ readpage layer - it calls only one function which works with blocks. I would create the same, i.e. async_readpage(), which called kevent's functions and processed low-level blocks, just like sync code does, but that requires kevent to be deep part of the FS tree. So I prefer to have kevent/some_function_which_works_with_blocks_and_kevents() instead of fs/some_function_which_works_with_block_and_kevents() kevent/call_that_function_like_all_readpage_callbacks_do(). So it is not a technical problem, but political one. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-26 10:08 ` Evgeniy Polyakov @ 2006-07-26 10:13 ` Christoph Hellwig 2006-07-26 10:25 ` Evgeniy Polyakov 0 siblings, 1 reply; 73+ messages in thread From: Christoph Hellwig @ 2006-07-26 10:13 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Christoph Hellwig, lkml, David Miller, Ulrich Drepper, netdev On Wed, Jul 26, 2006 at 02:08:49PM +0400, Evgeniy Polyakov wrote: > On Wed, Jul 26, 2006 at 11:00:13AM +0100, Christoph Hellwig (hch@infradead.org) wrote: > > > struct address_space_operations ext2_aops = { > > > + .get_block = ext2_get_block, > > > > No way in hell. For whatever you do please provide a interface at > > the readpage/writepage/sendfile/etc abstraction layer. get_block is > > nothing that can be exposed to the common code. > > Compare this with sync read methods - all they do is exactly the same > operations with low-level blocks, which are combined into nice exported > function, so there is _no_ readpage layer - it calls only one function > which works with blocks. No. The abtraction layer there is ->readpage(s). _A_ common implementation works with a get_block callback from the filesystem, but there are various others. We've been there before, up to mid-2.3.x we had a get_block inode operation and we got rid of it because it is the wrong abstraction. > So it is not a technical problem, but political one. It's a technical problem, and it's called get you abstractions right. And ontop of that a political one and that's called get your abstraction coherent. If you managed to argue all of us into accept that get_block is the right abstraction (and as I mentioned above that's technically not true) you'd still have the burden to update everything to use the same abstraction. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-26 10:13 ` Christoph Hellwig @ 2006-07-26 10:25 ` Evgeniy Polyakov 0 siblings, 0 replies; 73+ messages in thread From: Evgeniy Polyakov @ 2006-07-26 10:25 UTC (permalink / raw) To: Christoph Hellwig, lkml, David Miller, Ulrich Drepper, netdev On Wed, Jul 26, 2006 at 11:13:56AM +0100, Christoph Hellwig (hch@infradead.org) wrote: > On Wed, Jul 26, 2006 at 02:08:49PM +0400, Evgeniy Polyakov wrote: > > On Wed, Jul 26, 2006 at 11:00:13AM +0100, Christoph Hellwig (hch@infradead.org) wrote: > > > > struct address_space_operations ext2_aops = { > > > > + .get_block = ext2_get_block, > > > > > > No way in hell. For whatever you do please provide a interface at > > > the readpage/writepage/sendfile/etc abstraction layer. get_block is > > > nothing that can be exposed to the common code. > > > > Compare this with sync read methods - all they do is exactly the same > > operations with low-level blocks, which are combined into nice exported > > function, so there is _no_ readpage layer - it calls only one function > > which works with blocks. > > No. The abtraction layer there is ->readpage(s). _A_ common implementation > works with a get_block callback from the filesystem, but there are various > others. We've been there before, up to mid-2.3.x we had a get_block inode > operation and we got rid of it because it is the wrong abstraction. Well, kevent can work not from it's own, but with common implementation, which works with get_block(). No problem here. > > So it is not a technical problem, but political one. > > It's a technical problem, and it's called get you abstractions right. And > ontop of that a political one and that's called get your abstraction coherent. > If you managed to argue all of us into accept that get_block is the right > abstraction (and as I mentioned above that's technically not true) you'd > still have the burden to update everything to use the same abstraction. Christoph, I completely understand your point of view. There is absolutely no technical problem to create common async implementation, and place it where existing sync lives and call from readpage() level. It just requires to allow to change BIO callbacks instead of default one, and (probably) event sync readpage can be used. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-26 9:18 ` [3/4] kevent: AIO, aio_sendfile() implementation Evgeniy Polyakov 2006-07-26 9:18 ` [4/4] kevent: poll/select() notifications. Timer notifications Evgeniy Polyakov 2006-07-26 10:00 ` [3/4] kevent: AIO, aio_sendfile() implementation Christoph Hellwig @ 2006-07-26 10:04 ` Christoph Hellwig 2006-07-26 10:12 ` David Miller 2006-07-26 10:19 ` Evgeniy Polyakov 2 siblings, 2 replies; 73+ messages in thread From: Christoph Hellwig @ 2006-07-26 10:04 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: lkml, David Miller, Ulrich Drepper, netdev On Wed, Jul 26, 2006 at 01:18:15PM +0400, Evgeniy Polyakov wrote: > > This patch includes asynchronous propagation of file's data into VFS > cache and aio_sendfile() implementation. > Network aio_sendfile() works lazily - it asynchronously populates pages > into the VFS cache (which can be used for various tricks with adaptive > readahead) and then uses usual ->sendfile() callback. And please don't base this on sendfile. Please make the splice infrastructure aynschronous without duplicating all the code but rather make the existing code aynch and the existing synchronous call wait on them to finish, similar to how we handle async/sync direct I/O. And to be honest, I don't think adding all this code is acceptable if it can't replace the existing aio code while keeping the interface. So while you interface looks pretty sane the implementation needs a lot of work still :) ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-26 10:04 ` Christoph Hellwig @ 2006-07-26 10:12 ` David Miller 2006-07-26 10:15 ` Christoph Hellwig 2006-07-26 14:14 ` Avi Kivity 2006-07-26 10:19 ` Evgeniy Polyakov 1 sibling, 2 replies; 73+ messages in thread From: David Miller @ 2006-07-26 10:12 UTC (permalink / raw) To: hch; +Cc: johnpol, linux-kernel, drepper, netdev From: Christoph Hellwig <hch@infradead.org> Date: Wed, 26 Jul 2006 11:04:31 +0100 > And to be honest, I don't think adding all this code is acceptable > if it can't replace the existing aio code while keeping the > interface. So while you interface looks pretty sane the > implementation needs a lot of work still :) Networking and disk AIO have significantly different needs. Therefore, I really don't see it as reasonable to expect a merge of these two things. It doesn't make any sense. I do agree that this stuff needs to be cleaned up, all the get_block etc. hacks have to be pulled out and abstracted properly. That part of the kevent changes are indeed still crap :) ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-26 10:12 ` David Miller @ 2006-07-26 10:15 ` Christoph Hellwig 2006-07-26 20:21 ` Phillip Susi 2006-07-26 14:14 ` Avi Kivity 1 sibling, 1 reply; 73+ messages in thread From: Christoph Hellwig @ 2006-07-26 10:15 UTC (permalink / raw) To: David Miller; +Cc: hch, johnpol, linux-kernel, drepper, netdev On Wed, Jul 26, 2006 at 03:12:47AM -0700, David Miller wrote: > From: Christoph Hellwig <hch@infradead.org> > Date: Wed, 26 Jul 2006 11:04:31 +0100 > > > And to be honest, I don't think adding all this code is acceptable > > if it can't replace the existing aio code while keeping the > > interface. So while you interface looks pretty sane the > > implementation needs a lot of work still :) > > Networking and disk AIO have significantly different needs. > > Therefore, I really don't see it as reasonable to expect > a merge of these two things. It doesn't make any sense. I'm not sure about that. The current aio interface isn't exactly nice for disk I/O either. I'm more than happy to have a discussion about that aspect. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-26 10:15 ` Christoph Hellwig @ 2006-07-26 20:21 ` Phillip Susi 0 siblings, 0 replies; 73+ messages in thread From: Phillip Susi @ 2006-07-26 20:21 UTC (permalink / raw) To: Christoph Hellwig, David Miller, johnpol, linux-kernel, drepper, netdev Christoph Hellwig wrote: >> Networking and disk AIO have significantly different needs. >> >> Therefore, I really don't see it as reasonable to expect >> a merge of these two things. It doesn't make any sense. > > I'm not sure about that. The current aio interface isn't exactly nice > for disk I/O either. I'm more than happy to have a discussion about > that aspect. > I agree that it makes perfect sense for a merger because aio and networking have very similar needs. In both cases, the caller hands the kernel a buffer and wants the kernel to either fill it or consume it, and to be able to do so asynchronously. You also want to maximize performance in both cases by taking advantage of zero copy IO. I wonder though, why do you say the current aio interface isn't nice for disk IO? It seems to work rather nicely to me, and is much better than the posix aio interface. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-26 10:12 ` David Miller 2006-07-26 10:15 ` Christoph Hellwig @ 2006-07-26 14:14 ` Avi Kivity 1 sibling, 0 replies; 73+ messages in thread From: Avi Kivity @ 2006-07-26 14:14 UTC (permalink / raw) To: David Miller; +Cc: hch, johnpol, linux-kernel, drepper, netdev David Miller wrote: > > From: Christoph Hellwig <hch@infradead.org> > Date: Wed, 26 Jul 2006 11:04:31 +0100 > > > And to be honest, I don't think adding all this code is acceptable > > if it can't replace the existing aio code while keeping the > > interface. So while you interface looks pretty sane the > > implementation needs a lot of work still :) > > Networking and disk AIO have significantly different needs. > Surely, there needs to be a unified polling interface to support single threaded designs. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-26 10:04 ` Christoph Hellwig 2006-07-26 10:12 ` David Miller @ 2006-07-26 10:19 ` Evgeniy Polyakov 2006-07-26 10:30 ` Christoph Hellwig 1 sibling, 1 reply; 73+ messages in thread From: Evgeniy Polyakov @ 2006-07-26 10:19 UTC (permalink / raw) To: Christoph Hellwig, lkml, David Miller, Ulrich Drepper, netdev On Wed, Jul 26, 2006 at 11:04:31AM +0100, Christoph Hellwig (hch@infradead.org) wrote: > On Wed, Jul 26, 2006 at 01:18:15PM +0400, Evgeniy Polyakov wrote: > > > > This patch includes asynchronous propagation of file's data into VFS > > cache and aio_sendfile() implementation. > > Network aio_sendfile() works lazily - it asynchronously populates pages > > into the VFS cache (which can be used for various tricks with adaptive > > readahead) and then uses usual ->sendfile() callback. > > And please don't base this on sendfile. Please make the splice infrastructure > aynschronous without duplicating all the code but rather make the existing > code aynch and the existing synchronous call wait on them to finish, similar > to how we handle async/sync direct I/O. And to be honest, I don't think > adding all this code is acceptable if it can't replace the existing aio > code while keeping the interface. So while you interface looks pretty > sane the implementation needs a lot of work still :) Kevent was created quite before splice and friends, so I used what there were :) I stopped to work on AIO, since neither existing, nor mine implementation were able to outperform sync speeds (one of the major problems in my implementation is get_user_pages() overhead, which can be completely eliminated with physical memory allocation being done in advance in userspace, like Ulrich described). My personal opinion on existing AIO is that it is not the right design. Benjamin LaHaise agree with me (if I understood him right), but he failed to move AIO outside repeated-call model (2.4 had state machine based one, and out-of-the tree 2.6 patches have that design too). In theory existing AIO (with all posix userspace API) can be replaced with kevent (it will even take less space), but I would present it as a TODO item, since kevent itself has nothing to do with AIO. Kevent is a generic event processing mechanism, AIO, network AIO and all others are just kernel users of it's functionality. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-26 10:19 ` Evgeniy Polyakov @ 2006-07-26 10:30 ` Christoph Hellwig 2006-07-26 14:28 ` Ulrich Drepper 0 siblings, 1 reply; 73+ messages in thread From: Christoph Hellwig @ 2006-07-26 10:30 UTC (permalink / raw) To: Evgeniy Polyakov Cc: Christoph Hellwig, lkml, David Miller, Ulrich Drepper, netdev On Wed, Jul 26, 2006 at 02:19:21PM +0400, Evgeniy Polyakov wrote: > I stopped to work on AIO, since neither existing, nor mine > implementation were able to outperform sync speeds (one of the major problems > in my implementation is get_user_pages() overhead, which can be > completely eliminated with physical memory allocation being done in > advance in userspace, like Ulrich described). > My personal opinion on existing AIO is that it is not the right design. > Benjamin LaHaise agree with me (if I understood him right), I completely agree with that aswell. > but he > failed to move AIO outside repeated-call model (2.4 had state machine > based one, and out-of-the tree 2.6 patches have that design too). > In theory existing AIO (with all posix userspace API) can be replaced > with kevent (it will even take less space), but I would present it as a > TODO item, since kevent itself has nothing to do with AIO. And replacing the existing aio code is exactly we I want you to do. We can't keep adding more and more code without getting rid of old mess forever. And yes, the asynchronous pagecache population bit in your patchkit has a lot to do with aio. It's same variant of aio done right (or at least less bad). I suspect the right way to go ahead is to drop that bit for now (it's the by far worst code in the patchkit anyway) and then we can redo it later to not get abstractions wrong and duplicate lots of code but also replace the aio code. I don't expect you to do that alone, you'll probably need quite a bit help from us FS and VM people. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-26 10:30 ` Christoph Hellwig @ 2006-07-26 14:28 ` Ulrich Drepper 2006-07-26 16:22 ` Badari Pulavarty 0 siblings, 1 reply; 73+ messages in thread From: Ulrich Drepper @ 2006-07-26 14:28 UTC (permalink / raw) To: Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, Ulrich Drepper, netdev [-- Attachment #1: Type: text/plain, Size: 819 bytes --] Christoph Hellwig wrote: >> My personal opinion on existing AIO is that it is not the right design. >> Benjamin LaHaise agree with me (if I understood him right), > > I completely agree with that aswell. I agree, too, but the current code is not the last of the line. Suparna has a st of patches which make the current kernel aio code work much better and especially make it really usable to implement POSIX AIO. In Ottawa we were talking about submitting it and Suparna will. We just thought about a little longer timeframe. I guess it could be accelerated since he mostly has the patch done. But I don't know her schedule. Important here is, don't base any decision on the current aio implementation. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 251 bytes --] ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-26 14:28 ` Ulrich Drepper @ 2006-07-26 16:22 ` Badari Pulavarty 2006-07-27 6:49 ` Sébastien Dugué 0 siblings, 1 reply; 73+ messages in thread From: Badari Pulavarty @ 2006-07-26 16:22 UTC (permalink / raw) To: Ulrich Drepper Cc: Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev, Suparna Bhattacharya Ulrich Drepper wrote: > Christoph Hellwig wrote: > >>> My personal opinion on existing AIO is that it is not the right design. >>> Benjamin LaHaise agree with me (if I understood him right), >>> >> I completely agree with that aswell. >> > > I agree, too, but the current code is not the last of the line. Suparna > has a st of patches which make the current kernel aio code work much > better and especially make it really usable to implement POSIX AIO. > > In Ottawa we were talking about submitting it and Suparna will. We just > thought about a little longer timeframe. I guess it could be > accelerated since he mostly has the patch done. But I don't know her > schedule. > > Important here is, don't base any decision on the current aio > implementation. > Ulrich, Suparna mentioned your interest in making POSIX glibc aio work with kernel-aio at OLS. We thought taking a re-look at the (kernel side) work BULL did, would be a nice starting point. I re-based those patches to 2.6.18-rc2 and sent it to Zach Brown for review before sending them out to list. These patches does NOT make AIO any cleaner. All they do is add functionality to support POSIX AIO easier. These are [ PATCH 1/3 ] Adding signal notification for event completion [ PATCH 2/3 ] lio (listio) completion semantics [ PATCH 3/3 ] cancel_fd support Suparna explained these in the following article: http://lwn.net/Articles/148755/ If you think, this is a reasonable direction/approach for the kernel and you would take care of glibc side of things - I can spend time on these patches, getting them to reasonable shape and push for inclusion. Please let us know. Thanks, Badari ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-26 16:22 ` Badari Pulavarty @ 2006-07-27 6:49 ` Sébastien Dugué 2006-07-27 15:28 ` Badari Pulavarty 0 siblings, 1 reply; 73+ messages in thread From: Sébastien Dugué @ 2006-07-27 6:49 UTC (permalink / raw) To: Badari Pulavarty Cc: Ulrich Drepper, Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev, Suparna Bhattacharya On Wed, 2006-07-26 at 09:22 -0700, Badari Pulavarty wrote: > Ulrich Drepper wrote: > > Christoph Hellwig wrote: > > > >>> My personal opinion on existing AIO is that it is not the right design. > >>> Benjamin LaHaise agree with me (if I understood him right), > >>> > >> I completely agree with that aswell. > >> > > > > I agree, too, but the current code is not the last of the line. Suparna > > has a st of patches which make the current kernel aio code work much > > better and especially make it really usable to implement POSIX AIO. > > > > In Ottawa we were talking about submitting it and Suparna will. We just > > thought about a little longer timeframe. I guess it could be > > accelerated since he mostly has the patch done. But I don't know her > > schedule. > > > > Important here is, don't base any decision on the current aio > > implementation. > > > Ulrich, > > Suparna mentioned your interest in making POSIX glibc aio work with > kernel-aio at OLS. > We thought taking a re-look at the (kernel side) work BULL did, would be > a nice starting > point. I re-based those patches to 2.6.18-rc2 and sent it to Zach Brown > for review before > sending them out to list. > > These patches does NOT make AIO any cleaner. All they do is add > functionality to support > POSIX AIO easier. These are > > [ PATCH 1/3 ] Adding signal notification for event completion > > [ PATCH 2/3 ] lio (listio) completion semantics > > [ PATCH 3/3 ] cancel_fd support Badari, Thanks for refreshing those patches, they have been sitting here for quite some time now and collected dust. I also think Suparna's patchset for doing buffered AIO would be a real plus here. > > Suparna explained these in the following article: > > http://lwn.net/Articles/148755/ > > If you think, this is a reasonable direction/approach for the kernel and > you would take care > of glibc side of things - I can spend time on these patches, getting > them to reasonable shape > and push for inclusion. Ulrich, I you want to have a look at how those patches are put to use in libposix-aio, have a look at http://sourceforge.net/projects/paiol. It could be a starting point for glibc. Thanks, Sébastien. -- ----------------------------------------------------- Sébastien Dugué BULL/FREC:B1-247 phone: (+33) 476 29 77 70 Bullcom: 229-7770 mailto:sebastien.dugue@bull.net Linux POSIX AIO: http://www.bullopensource.org/posix http://sourceforge.net/projects/paiol ----------------------------------------------------- ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-27 6:49 ` Sébastien Dugué @ 2006-07-27 15:28 ` Badari Pulavarty 2006-07-27 18:14 ` Zach Brown 2006-07-28 7:26 ` Sébastien Dugué 0 siblings, 2 replies; 73+ messages in thread From: Badari Pulavarty @ 2006-07-27 15:28 UTC (permalink / raw) To: Sébastien Dugué Cc: Ulrich Drepper, Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev, Suparna Bhattacharya Sébastien Dugué wrote: > On Wed, 2006-07-26 at 09:22 -0700, Badari Pulavarty wrote: > >> Ulrich Drepper wrote: >> >>> Christoph Hellwig wrote: >>> >>> >>>>> My personal opinion on existing AIO is that it is not the right design. >>>>> Benjamin LaHaise agree with me (if I understood him right), >>>>> >>>>> >>>> I completely agree with that aswell. >>>> >>>> >>> I agree, too, but the current code is not the last of the line. Suparna >>> has a st of patches which make the current kernel aio code work much >>> better and especially make it really usable to implement POSIX AIO. >>> >>> In Ottawa we were talking about submitting it and Suparna will. We just >>> thought about a little longer timeframe. I guess it could be >>> accelerated since he mostly has the patch done. But I don't know her >>> schedule. >>> >>> Important here is, don't base any decision on the current aio >>> implementation. >>> >>> >> Ulrich, >> >> Suparna mentioned your interest in making POSIX glibc aio work with >> kernel-aio at OLS. >> We thought taking a re-look at the (kernel side) work BULL did, would be >> a nice starting >> point. I re-based those patches to 2.6.18-rc2 and sent it to Zach Brown >> for review before >> sending them out to list. >> >> These patches does NOT make AIO any cleaner. All they do is add >> functionality to support >> POSIX AIO easier. These are >> >> [ PATCH 1/3 ] Adding signal notification for event completion >> >> [ PATCH 2/3 ] lio (listio) completion semantics >> >> [ PATCH 3/3 ] cancel_fd support >> > > Badari, > > Thanks for refreshing those patches, they have been sitting here > for quite some time now and collected dust. > > I also think Suparna's patchset for doing buffered AIO would be > a real plus here. > > >> Suparna explained these in the following article: >> >> http://lwn.net/Articles/148755/ >> >> If you think, this is a reasonable direction/approach for the kernel and >> you would take care >> of glibc side of things - I can spend time on these patches, getting >> them to reasonable shape >> and push for inclusion. >> > > Ulrich, I you want to have a look at how those patches are put to > use in libposix-aio, have a look at http://sourceforge.net/projects/paiol. > > It could be a starting point for glibc. > > Thanks, > > Sébastien. > > Sebastien, Suparna mentioned at Ulrich wants us to concentrate on kernel-side support, so that he can look at glibc side of things (along with other work he is already doing). So, if we can get an agreement on what kind of kernel support is needed - we can focus our efforts on kernel side first and leave glibc enablement to capable hands of Uli :) Thanks, Badari ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-27 15:28 ` Badari Pulavarty @ 2006-07-27 18:14 ` Zach Brown 2006-07-27 18:29 ` Badari Pulavarty 2006-07-28 7:26 ` Sébastien Dugué 1 sibling, 1 reply; 73+ messages in thread From: Zach Brown @ 2006-07-27 18:14 UTC (permalink / raw) To: Badari Pulavarty Cc: Sébastien Dugué, Ulrich Drepper, Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev, Suparna Bhattacharya > Suparna mentioned at Ulrich wants us to concentrate on kernel-side > support, so that he can look at glibc side of things (along with > other work he is already doing). So, if we can get an agreement on > what kind of kernel support is needed - we can focus our efforts on > kernel side first and leave glibc enablement to capable hands of Uli > :) Yeah, and the existing patches still need some cleanup. Badari, did you still want me to look into that? We need someone to claim ultimate responsibility for getting these patches suitable for merging :). I'm happy to do that if Suparna isn't already on it. - z ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-27 18:14 ` Zach Brown @ 2006-07-27 18:29 ` Badari Pulavarty 2006-07-27 18:44 ` Ulrich Drepper 0 siblings, 1 reply; 73+ messages in thread From: Badari Pulavarty @ 2006-07-27 18:29 UTC (permalink / raw) To: Zach Brown Cc: Sébastien Dugué, Ulrich Drepper, Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev, Suparna Bhattacharya On Thu, 2006-07-27 at 11:14 -0700, Zach Brown wrote: > > Suparna mentioned at Ulrich wants us to concentrate on kernel-side > > support, so that he can look at glibc side of things (along with > > other work he is already doing). So, if we can get an agreement on > > what kind of kernel support is needed - we can focus our efforts on > > kernel side first and leave glibc enablement to capable hands of Uli > > :) > > Yeah, and the existing patches still need some cleanup. Badari, did you > still want me to look into that? > > We need someone to claim ultimate responsibility for getting these > patches suitable for merging :). I'm happy to do that if Suparna isn't > already on it. Zach, Thanks for volunteering !! Sebastien & I should be able to help you. Before we spend too much time cleaning up and merging into mainline - I would like an agreement that what we add is good enough for glibc POSIX AIO. I hate to waste everyone's time and add complexity to the kernel - if glibc side is not going to happen :( Thanks, Badari ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-27 18:29 ` Badari Pulavarty @ 2006-07-27 18:44 ` Ulrich Drepper 2006-07-27 21:02 ` Badari Pulavarty ` (2 more replies) 0 siblings, 3 replies; 73+ messages in thread From: Ulrich Drepper @ 2006-07-27 18:44 UTC (permalink / raw) To: Badari Pulavarty Cc: Zach Brown, Sébastien Dugué, Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev, Suparna Bhattacharya [-- Attachment #1: Type: text/plain, Size: 1417 bytes --] Badari Pulavarty wrote: > Before we spend too much time cleaning up and merging into mainline - > I would like an agreement that what we add is good enough for glibc > POSIX AIO. I haven't seen a description of the interface so far. Would be good if it existed. But I briefly mentioned one quirk in the interface about which Suparna wasn't sure whether it's implemented/implementable in the current interface. If a lio_listio call is made the individual requests are handle just as if they'd be issue separately. I.e., the notification specified in the individual aiocb is performed when the specific request is done. Then, once all requests are done, another notification is made, this time controlled by the sigevent parameter if lio_listio. Another feature which I always wanted: the current lio_listio call returns in blocking mode only if all requests are done. In non-blocking mode it returns immediately and the program needs to poll the aiocbs. What is needed is something in the middle. For instance, if multiple read requests are issued the program might be able to start working as soon as one request is satisfied. I.e., a call similar to lio_listio would be nice which also takes another parameter specifying how many of the NENT aiocbs have to finish before the call returns. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 251 bytes --] ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-27 18:44 ` Ulrich Drepper @ 2006-07-27 21:02 ` Badari Pulavarty 2006-07-28 7:31 ` Sébastien Dugué 2006-07-28 12:58 ` Sébastien Dugué 2006-07-28 7:29 ` [3/4] kevent: AIO, aio_sendfile() implementation Sébastien Dugué 2006-07-31 10:11 ` Suparna Bhattacharya 2 siblings, 2 replies; 73+ messages in thread From: Badari Pulavarty @ 2006-07-27 21:02 UTC (permalink / raw) To: Ulrich Drepper Cc: Zach Brown, Sébastien Dugué, Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev, Suparna Bhattacharya On Thu, 2006-07-27 at 11:44 -0700, Ulrich Drepper wrote: > Badari Pulavarty wrote: > > Before we spend too much time cleaning up and merging into mainline - > > I would like an agreement that what we add is good enough for glibc > > POSIX AIO. > > I haven't seen a description of the interface so far. Would be good if > it existed. But I briefly mentioned one quirk in the interface about > which Suparna wasn't sure whether it's implemented/implementable in the > current interface. Sebastien, could you provide a description of interfaces you are adding ? Since you did all the work, it would be appropriate for you to do it :) > If a lio_listio call is made the individual requests are handle just as > if they'd be issue separately. I.e., the notification specified in the > individual aiocb is performed when the specific request is done. Then, > once all requests are done, another notification is made, this time > controlled by the sigevent parameter if lio_listio. > > > Another feature which I always wanted: the current lio_listio call > returns in blocking mode only if all requests are done. In non-blocking > mode it returns immediately and the program needs to poll the aiocbs. > What is needed is something in the middle. For instance, if multiple > read requests are issued the program might be able to start working as > soon as one request is satisfied. I.e., a call similar to lio_listio > would be nice which also takes another parameter specifying how many of > the NENT aiocbs have to finish before the call returns. Looks reasonable. Thanks, Badari ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-27 21:02 ` Badari Pulavarty @ 2006-07-28 7:31 ` Sébastien Dugué 2006-07-28 12:58 ` Sébastien Dugué 1 sibling, 0 replies; 73+ messages in thread From: Sébastien Dugué @ 2006-07-28 7:31 UTC (permalink / raw) To: Badari Pulavarty Cc: Ulrich Drepper, Zach Brown, Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev, Suparna Bhattacharya On Thu, 2006-07-27 at 14:02 -0700, Badari Pulavarty wrote: > On Thu, 2006-07-27 at 11:44 -0700, Ulrich Drepper wrote: > > Badari Pulavarty wrote: > > > Before we spend too much time cleaning up and merging into mainline - > > > I would like an agreement that what we add is good enough for glibc > > > POSIX AIO. > > > > I haven't seen a description of the interface so far. Would be good if > > it existed. But I briefly mentioned one quirk in the interface about > > which Suparna wasn't sure whether it's implemented/implementable in the > > current interface. > > Sebastien, could you provide a description of interfaces you are > adding ? Since you did all the work, it would be appropriate for > you to do it :) > I will clean up what description I have and send it soon. Sébastien. -- ----------------------------------------------------- Sébastien Dugué BULL/FREC:B1-247 phone: (+33) 476 29 77 70 Bullcom: 229-7770 mailto:sebastien.dugue@bull.net Linux POSIX AIO: http://www.bullopensource.org/posix http://sourceforge.net/projects/paiol ----------------------------------------------------- ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-27 21:02 ` Badari Pulavarty 2006-07-28 7:31 ` Sébastien Dugué @ 2006-07-28 12:58 ` Sébastien Dugué 2006-08-11 19:45 ` Ulrich Drepper 1 sibling, 1 reply; 73+ messages in thread From: Sébastien Dugué @ 2006-07-28 12:58 UTC (permalink / raw) To: Badari Pulavarty Cc: Ulrich Drepper, Zach Brown, Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev, Suparna Bhattacharya [-- Attachment #1: Type: text/plain, Size: 1257 bytes --] On Thu, 2006-07-27 at 14:02 -0700, Badari Pulavarty wrote: > On Thu, 2006-07-27 at 11:44 -0700, Ulrich Drepper wrote: > > Badari Pulavarty wrote: > > > Before we spend too much time cleaning up and merging into mainline - > > > I would like an agreement that what we add is good enough for glibc > > > POSIX AIO. > > > > I haven't seen a description of the interface so far. Would be good if > > it existed. But I briefly mentioned one quirk in the interface about > > which Suparna wasn't sure whether it's implemented/implementable in the > > current interface. > > Sebastien, could you provide a description of interfaces you are > adding ? Since you did all the work, it would be appropriate for > you to do it :) > Here are the descriptions for the AIO completion notification and listio patches. Hope I did not leave out too much. Sébastien. -- ----------------------------------------------------- Sébastien Dugué BULL/FREC:B1-247 phone: (+33) 476 29 77 70 Bullcom: 229-7770 mailto:sebastien.dugue@bull.net Linux POSIX AIO: http://www.bullopensource.org/posix http://sourceforge.net/projects/paiol ----------------------------------------------------- [-- Attachment #2: aioevent.txt --] [-- Type: text/plain, Size: 2741 bytes --] aio completion notification Summary: ------- The current 2.6 kernel does not support notification of user space via an RT signal upon an asynchronous IO completion. The POSIX specification states that when an AIO request completes, a signal can be delivered to the application as notification. The aioevent patch adds a struct sigevent *aio_sigeventp to the iocb. The relevant fields (pid, signal number and value) are stored in the kiocb for use when the request completes. That sigevent structure is filled by the application as part of the AIO request preparation. Upon request completion, the kernel notifies the application using those sigevent parameters. If SIGEV_NONE has been specified, then the old behaviour is retained and the application must rely on polling the completion queue using io_getevents(). Details: ------- A struct sigevent *aio_sigeventp is added to struct iocb in include/linux/aio_abi.h An enum {IO_NOTIFY_SIGNAL = 0, IO_NOTIFY_THREAD_ID = 1} is added in include/linux/aio.h: - IO_NOTIFY_SIGNAL means that the signal is to be sent to the requesting thread - IO_NOTIFY_THREAD_ID means that the signal is to be sent to a specifi thread. The following fields are added to struct kiocb in include/linux/aio.h: - pid_t ki_pid: target of the signal - __u16 ki_signo: signal number - __u16 ki_notify: kind of notification, IO_NOTIFY_SIGNAL or IO_NOTIFY_THREAD_ID - uid_t ki_uid, ki_euid: filled with the submitter credentials - sigval_t ki_sigev_value: value stuffed in siginfo these fields are only valid if ki_signo != 0. In io_submit_one(), if the application provided a sigevent then iocb_setup_sigevent() is called which does the following: - save current->uid and current->euid in the kiocb fields ki_uid and ki_euid for use in the completion path to check permissions - check access to the user sigevent - extract the needed fields from the sigevent (pid, signo, and value). If the signal number passed from userspace is 0 then no notification is to occur and ki_signo is set to 0 - check whether the submitting thread wants to be notified directly (sigevent->sigev_notify_thread_id is 0) or wants the signal to be sent to another thread. In the latter case a check is made to assert that the target thread is in the same thread group - fill in the kiocb fields (ki_pid, ki_signo, ki_notify and ki_sigev_value) for that request. Upon request completion, in aio_complete(), if ki_signo is not 0, then __aio_send_signal() is called which sends the signal as follows: - fill in the siginfo struct to be sent to the application - check whether we have permission to signal the given thread - send the signal [-- Attachment #3: lioevent.txt --] [-- Type: text/plain, Size: 2489 bytes --] listio support Summary: ------- The lio patch adds POSIX listio completion notification support. It builds on support provided by the aio event patch and adds an IOCB_CMD_GROUP command to sys_io_submit(). The purpose of IOCB_CMD_GROUP is to group together the following requests in the list up to the end of the list. As part of listio submission, the user process prepends to a list of requests an empty special aiocb with an aio_lio_opcode of IOCB_CMD_GROUP, filling only the aio_sigevent fields. Details: ------- An IOCB_CMD_GROUP is added to the IOCB_CMD enum in include/linux/aio_abi.h A struct lio_event is added in include/linux/aio.h A struct lio_event *ki_lio is added to struct iocb in include/linux/aio.h In sys_io_submit(), upon detecting such an IOCB_CMD_GROUP marker iocb, an lio_event is created in lio_create() which contains the necessary information for signaling a thread (signal number, pid, notify type and value) along with a count of requests attached to this event. The following depicts the lio_event structure: struct lio_event { atomic_t lio_users; int lio_wait; __s32 lio_pid; __u16 lio_signo; __u16 lio_notify; __u64 lio_value; uid_t lio_uid, lio_euid; }; lio_users holds a count of the number of requests attached to this lio. It is incremented with each request submitted and decremented at each request completion. Thread notification occurs when this count reaches 0. Each subsequent submitted request is attached to this lio_event by setting the request kiocb->*ki_lio to that lio_event (in io_submit_one()) and incrementing the lio_users count. In aio_complete(), if the request is attached to an lio (ki_lio <> 0), then lio_check() is called to decrement the lio_users count and eventually signal the user process when all the requests in the group have completed. The IOCB_CMD_GROUP command semantic is as follows: - if the associated aiocb sigevent is NULL then we want to group requests for the purpose of blocking on the group completion (LIO_WAIT sync behavior). - if the associated sigevent is valid (not NULL) then we want to group requests for the purpose of being notified upon that group of requests completion (LIO_NOWAIT async behaviour). ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-28 12:58 ` Sébastien Dugué @ 2006-08-11 19:45 ` Ulrich Drepper 2006-08-12 18:29 ` Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile) Suparna Bhattacharya 0 siblings, 1 reply; 73+ messages in thread From: Ulrich Drepper @ 2006-08-11 19:45 UTC (permalink / raw) To: Sébastien Dugué Cc: Badari Pulavarty, Zach Brown, Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev, Suparna Bhattacharya [-- Attachment #1: Type: text/plain, Size: 4620 bytes --] Sébastien Dugué wrote: > aio completion notification I looked over this now but I don't think I understand everything. Or I don't see how it all is integrated. And no, I'm not looking at the proposed glibc code since would mean being tainted. > Details: > ------- > > A struct sigevent *aio_sigeventp is added to struct iocb in > include/linux/aio_abi.h > > An enum {IO_NOTIFY_SIGNAL = 0, IO_NOTIFY_THREAD_ID = 1} is added in > include/linux/aio.h: > > - IO_NOTIFY_SIGNAL means that the signal is to be sent to the > requesting thread > > - IO_NOTIFY_THREAD_ID means that the signal is to be sent to a > specifi thread. This has been proved to be sufficient in the timer code which basically has the same problem. But why do you need separate constants? We have the various SIGEV_* constants, among them SIGEV_THREAD_ID. Just use these constants for the values of ki_notify. > The following fields are added to struct kiocb in include/linux/aio.h: > > - pid_t ki_pid: target of the signal > > - __u16 ki_signo: signal number > > - __u16 ki_notify: kind of notification, IO_NOTIFY_SIGNAL or > IO_NOTIFY_THREAD_ID > > - uid_t ki_uid, ki_euid: filled with the submitter credentials These two fields aren't needed for the POSIX interfaces. Where does the requirement come from? I don't say they should be removed, they might be useful, but if the costs are non-negligible then they could go away. > - check whether the submitting thread wants to be notified directly > (sigevent->sigev_notify_thread_id is 0) or wants the signal to be sent > to another thread. > In the latter case a check is made to assert that the target thread > is in the same thread group Is this really how it's implemented? This is not how it should be. Either a signal is sent to a specific thread in the same process (this is what SIGEV_THREAD_ID is for) or the signal is sent to a calling process. Sending a signal to the process means that from the kernel's POV any thread which doesn't have the signal blocked can receive it. The final decision is made by the kernel. There is no mechanism to send the signal to another process. So, for the purpose of the POSIX AIO code the ki_pid value is only needed when the SIGEV_THREAD_ID bit is set. It could be an extension and I don't mind it being introduced. But again, it's not necessary and if it adds costs then it could be left out. It is something which could easily be introduced later if the need arises. > listio support > I really don't understand the kernel interface for this feature. > Details: > ------- > > An IOCB_CMD_GROUP is added to the IOCB_CMD enum in include/linux/aio_abi.h > > A struct lio_event is added in include/linux/aio.h > > A struct lio_event *ki_lio is added to struct iocb in include/linux/aio.h So you have a pointer in the structure for the individual requests. I assume you use the atomic counter to trigger the final delivery. I further assume that if lio_wait is set the calling thread is suspended until all requests are handled and that the final notification in this case means that thread gets woken. This is all fine. But how do you pass the requests to the kernel? If you have a new lio_listio-like syscall it'll be easy. But I haven't seen anything like this mentioned. The alternative is to pass the requests one-by-one in which case I don't see how you create the reference to the lio_listio control block. This approach seems to be slower. If all requests are passed at once, do you have the equivalent of LIO_NOP entries? How can we support the extension where we wait for a number of requests which need not be all of them. I.e., I submit N requests and want to be notified when at least M (M <= N) notified. I am not yet clear about the actual semantics we should implement (e.g., do we send another notification after the first one?) but it's something which IMO should be taken into account in the design. Finally, and this is very important, does you code send out the individual requests notification and then in the end the lio_listio completion? I think Suparna wrote this is the case but I want to make sure. Overall, this looks much better than the old code. If the answers to my questions show that the behavior is compatible with the POSIX AIO code I'm certainly very much in favor of adding the kernel code. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 251 bytes --] ^ permalink raw reply [flat|nested] 73+ messages in thread
* Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile) 2006-08-11 19:45 ` Ulrich Drepper @ 2006-08-12 18:29 ` Suparna Bhattacharya 2006-08-12 19:10 ` Ulrich Drepper 2006-09-04 14:28 ` Sébastien Dugué 0 siblings, 2 replies; 73+ messages in thread From: Suparna Bhattacharya @ 2006-08-12 18:29 UTC (permalink / raw) To: Ulrich Drepper Cc: =?iso-8859-1?Q?S=E9bastien_Dugu=E9_=3Csebastien=2Edugue=40bull=2Enet?=.=?iso-8859-1?Q?=3E?=, Badari Pulavarty, Zach Brown, Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev, linux-aio BTW, if anyone would like to be dropped off this growing cc list, please let us know. On Fri, Aug 11, 2006 at 12:45:55PM -0700, Ulrich Drepper wrote: > Sébastien Dugué wrote: > > aio completion notification > > I looked over this now but I don't think I understand everything. Or I > don't see how it all is integrated. And no, I'm not looking at the > proposed glibc code since would mean being tainted. Oh, I didn't realise that. I'll make an attempt to clarify parts that I understand based on what I have gleaned from my reading of the code and intent, but hopefully Sebastien, Ben, Zach et al will be able to pitch in for a more accurate and complete picture. > > > > Details: > > ------- > > > > A struct sigevent *aio_sigeventp is added to struct iocb in > > include/linux/aio_abi.h > > > > An enum {IO_NOTIFY_SIGNAL = 0, IO_NOTIFY_THREAD_ID = 1} is added in > > include/linux/aio.h: > > > > - IO_NOTIFY_SIGNAL means that the signal is to be sent to the > > requesting thread > > > > - IO_NOTIFY_THREAD_ID means that the signal is to be sent to a > > specifi thread. > > This has been proved to be sufficient in the timer code which basically > has the same problem. But why do you need separate constants? We have > the various SIGEV_* constants, among them SIGEV_THREAD_ID. Just use > these constants for the values of ki_notify. > I am wondering about that too. IIRC, the IO_NOTIFY_* constants are not part of the ABI, but only internal to the kernel implementation. I think Zach had suggested inferring THREAD_ID notification if the pid specified is not zero. But, I don't see why ->sigev_notify couldn't used directly (just like the POSIX timers code does) thus doing away with the new constants altogether. Sebestian/Laurent, do you recall? > > > The following fields are added to struct kiocb in include/linux/aio.h: > > > > - pid_t ki_pid: target of the signal > > > > - __u16 ki_signo: signal number > > > > - __u16 ki_notify: kind of notification, IO_NOTIFY_SIGNAL or > > IO_NOTIFY_THREAD_ID > > > > - uid_t ki_uid, ki_euid: filled with the submitter credentials > > These two fields aren't needed for the POSIX interfaces. Where does the > requirement come from? I don't say they should be removed, they might > be useful, but if the costs are non-negligible then they could go away. I'm guessing they are being used for validation of permissions at the time of sending the signal, but maybe saving the task pointer in the iocb instead of the pid would suffice ? > > > > - check whether the submitting thread wants to be notified directly > > (sigevent->sigev_notify_thread_id is 0) or wants the signal to be sent > > to another thread. > > In the latter case a check is made to assert that the target thread > > is in the same thread group > > Is this really how it's implemented? This is not how it should be. > Either a signal is sent to a specific thread in the same process (this > is what SIGEV_THREAD_ID is for) or the signal is sent to a calling > process. Sending a signal to the process means that from the kernel's > POV any thread which doesn't have the signal blocked can receive it. > The final decision is made by the kernel. There is no mechanism to send > the signal to another process. The code seems to be set up to call specific_send_sig_info() in the case of *_THREAD_ID , and __group_send_sig_info() otherwise. So I think the intended behaviour is as you describe it should be (__group_send_sig_info does the equivalent of sending a signal to the process and so any thread which doesn't have signals blocked can receive it, while specific_send_sig_info sends it to a particular thread). But, I should really leave it to Sebestian to confirm that. > > So, for the purpose of the POSIX AIO code the ki_pid value is only > needed when the SIGEV_THREAD_ID bit is set. > > It could be an extension and I don't mind it being introduced. But > again, it's not necessary and if it adds costs then it could be left > out. It is something which could easily be introduced later if the need > arises. > > > > listio support > > > > I really don't understand the kernel interface for this feature. I'm sorry this is confusing. This probably means that we need to separate the external interface description more clearly and completely from the internals. > > > > Details: > > ------- > > > > An IOCB_CMD_GROUP is added to the IOCB_CMD enum in include/linux/aio_abi.h > > > > A struct lio_event is added in include/linux/aio.h > > > > A struct lio_event *ki_lio is added to struct iocb in include/linux/aio.h > > So you have a pointer in the structure for the individual requests. I > assume you use the atomic counter to trigger the final delivery. I > further assume that if lio_wait is set the calling thread is suspended > until all requests are handled and that the final notification in this > case means that thread gets woken. > > This is all fine. > > But how do you pass the requests to the kernel? If you have a new > lio_listio-like syscall it'll be easy. But I haven't seen anything like > this mentioned. > > The alternative is to pass the requests one-by-one in which case I don't > see how you create the reference to the lio_listio control block. This > approach seems to be slower. The way it works (and better ideas are welcome) is that, since the io_submit() syscall already accepts an array of iocbs[], no new syscall was introduced. To implement lio_listio, one has to set up such an array, with the first iocb in the array having the special (new) grouping opcode of IOCB_CMD_GROUP which specifies the sigev notification to be associated with group completion (a NULL value of the sigev notification pointer would imply equivalent of LIO_WAIT). The following iocbs in the array should correspond to the set of listio aiocbs. Whenever it encounters an IOCB_CMD_GROUP iocb opcode, the kernel would interpret all subsequent iocbs[] submitted in the same io_submit() call to be associated with the same lio control block. Does that clarify ? Would an example help ? > > If all requests are passed at once, do you have the equivalent of > LIO_NOP entries? > Good question - we do have an IOCB_CMD_NOOP defined, and I seem to even recall a patch that implemented it, but am wondering if it ever got merged. Ben/Zach ? > > How can we support the extension where we wait for a number of requests > which need not be all of them. I.e., I submit N requests and want to be > notified when at least M (M <= N) notified. I am not yet clear about > the actual semantics we should implement (e.g., do we send another > notification after the first one?) but it's something which IMO should > be taken into account in the design. > My thought here was that it should be possible to include M as a parameter to the IOCB_CMD_GROUP opcode iocb, and thus incorporated in the lio control block ... then whatever semantics are agreed upon can be implemented. > > Finally, and this is very important, does you code send out the > individual requests notification and then in the end the lio_listio > completion? I think Suparna wrote this is the case but I want to make sure. Sebestian, could you confirm ? > > > Overall, this looks much better than the old code. If the answers to my > questions show that the behavior is compatible with the POSIX AIO code > I'm certainly very much in favor of adding the kernel code. Thanks a lot for looking through this ! Let us know what you think about the listio interface ... hopefully the other issues are mostly simple to resolve. Regards Suparna > > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ > -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile) 2006-08-12 18:29 ` Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile) Suparna Bhattacharya @ 2006-08-12 19:10 ` Ulrich Drepper 2006-08-12 19:28 ` Jakub Jelinek ` (2 more replies) 2006-09-04 14:28 ` Sébastien Dugué 1 sibling, 3 replies; 73+ messages in thread From: Ulrich Drepper @ 2006-08-12 19:10 UTC (permalink / raw) To: suparna Cc: sebastien.dugue, Badari Pulavarty, Zach Brown, Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev, linux-aio [-- Attachment #1: Type: text/plain, Size: 2751 bytes --] Suparna Bhattacharya wrote: > I am wondering about that too. IIRC, the IO_NOTIFY_* constants are not > part of the ABI, but only internal to the kernel implementation. I think > Zach had suggested inferring THREAD_ID notification if the pid specified > is not zero. But, I don't see why ->sigev_notify couldn't used directly > (just like the POSIX timers code does) thus doing away with the > new constants altogether. Sebestian/Laurent, do you recall? I suggest to model the implementation after the timer code which does exactly what we need. > I'm guessing they are being used for validation of permissions at the time > of sending the signal, but maybe saving the task pointer in the iocb instead > of the pid would suffice ? Why should any verification be necessary? The requests are generated in the same process which will receive the notification. Even if the POSIX process (aka, kernel process group) changes the IDs the notifications should be set. The key is that notifications cannot be sent to another POSIX process. Adding this as a feature just makes things so much more complicated. > So I think the > intended behaviour is as you describe it should be Then the documentation needs to be adjusted. > The way it works (and better ideas are welcome) is that, since the io_submit() > syscall already accepts an array of iocbs[], no new syscall was introduced. > To implement lio_listio, one has to set up such an array, with the first iocb > in the array having the special (new) grouping opcode of IOCB_CMD_GROUP which > specifies the sigev notification to be associated with group completion > (a NULL value of the sigev notification pointer would imply equivalent of > LIO_WAIT). OK, this seems OK. We have to construct the iocb arrays dynamically anyway. > My thought here was that it should be possible to include M as a parameter > to the IOCB_CMD_GROUP opcode iocb, and thus incorporated in the lio control > block ... then whatever semantics are agreed upon can be implemented. If you have room for the parameter this is fine. For the beginning we can enforce the number to be the same as the total number of requests. > Let us know what you think about the listio interface ... hopefully the > other issues are mostly simple to resolve. It should be fine and I would support adding all this assuming the normal file support (as opposed to direct I/O only) is added, too. But I have one last question: sockets, pipes and the like are already supported, right? If this is not the case we have a problem with the currently proposed lio_listio interface. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 251 bytes --] ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile) 2006-08-12 19:10 ` Ulrich Drepper @ 2006-08-12 19:28 ` Jakub Jelinek 2006-09-04 14:37 ` Sébastien Dugué 2006-08-14 7:02 ` Suparna Bhattacharya 2006-09-04 14:36 ` Sébastien Dugué 2 siblings, 1 reply; 73+ messages in thread From: Jakub Jelinek @ 2006-08-12 19:28 UTC (permalink / raw) To: Ulrich Drepper Cc: suparna, sebastien.dugue, Badari Pulavarty, Zach Brown, Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev, linux-aio On Sat, Aug 12, 2006 at 12:10:35PM -0700, Ulrich Drepper wrote: > > I am wondering about that too. IIRC, the IO_NOTIFY_* constants are not > > part of the ABI, but only internal to the kernel implementation. I think > > Zach had suggested inferring THREAD_ID notification if the pid specified > > is not zero. But, I don't see why ->sigev_notify couldn't used directly > > (just like the POSIX timers code does) thus doing away with the > > new constants altogether. Sebestian/Laurent, do you recall? > > I suggest to model the implementation after the timer code which does > exactly what we need. Yeah, and if at all possible we want to use just one helper thread for SIGEV_THREAD notification of timers/aio/etc., so it really should behave the same as timer thread notification. Jakub ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile) 2006-08-12 19:28 ` Jakub Jelinek @ 2006-09-04 14:37 ` Sébastien Dugué 0 siblings, 0 replies; 73+ messages in thread From: Sébastien Dugué @ 2006-09-04 14:37 UTC (permalink / raw) To: Jakub Jelinek Cc: Ulrich Drepper, suparna, Badari Pulavarty, Zach Brown, Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev, linux-aio On Sat, 2006-08-12 at 15:28 -0400, Jakub Jelinek wrote: > On Sat, Aug 12, 2006 at 12:10:35PM -0700, Ulrich Drepper wrote: > > > I am wondering about that too. IIRC, the IO_NOTIFY_* constants are not > > > part of the ABI, but only internal to the kernel implementation. I think > > > Zach had suggested inferring THREAD_ID notification if the pid specified > > > is not zero. But, I don't see why ->sigev_notify couldn't used directly > > > (just like the POSIX timers code does) thus doing away with the > > > new constants altogether. Sebestian/Laurent, do you recall? > > > > I suggest to model the implementation after the timer code which does > > exactly what we need. > > Yeah, and if at all possible we want to use just one helper thread for > SIGEV_THREAD notification of timers/aio/etc., so it really should behave the > same as timer thread notification. > That's exactly what is done in libposix-aio. Sébastien. -- ----------------------------------------------------- Sébastien Dugué BULL/FREC:B1-247 phone: (+33) 476 29 77 70 Bullcom: 229-7770 mailto:sebastien.dugue@bull.net Linux POSIX AIO: http://www.bullopensource.org/posix http://sourceforge.net/projects/paiol ----------------------------------------------------- ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile) 2006-08-12 19:10 ` Ulrich Drepper 2006-08-12 19:28 ` Jakub Jelinek @ 2006-08-14 7:02 ` Suparna Bhattacharya 2006-08-14 16:38 ` Ulrich Drepper 2006-09-04 14:36 ` Sébastien Dugué 2 siblings, 1 reply; 73+ messages in thread From: Suparna Bhattacharya @ 2006-08-14 7:02 UTC (permalink / raw) To: Ulrich Drepper Cc: sebastien.dugue, Badari Pulavarty, Zach Brown, Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev, linux-aio, mingo On Sat, Aug 12, 2006 at 12:10:35PM -0700, Ulrich Drepper wrote: > Suparna Bhattacharya wrote: > > I am wondering about that too. IIRC, the IO_NOTIFY_* constants are not > > part of the ABI, but only internal to the kernel implementation. I think > > Zach had suggested inferring THREAD_ID notification if the pid specified > > is not zero. But, I don't see why ->sigev_notify couldn't used directly > > (just like the POSIX timers code does) thus doing away with the > > new constants altogether. Sebestian/Laurent, do you recall? > > I suggest to model the implementation after the timer code which does > exactly what we need. Agreed. > > > > I'm guessing they are being used for validation of permissions at the time > > of sending the signal, but maybe saving the task pointer in the iocb instead > > of the pid would suffice ? > > Why should any verification be necessary? The requests are generated in > the same process which will receive the notification. Even if the POSIX > process (aka, kernel process group) changes the IDs the notifications > should be set. The key is that notifications cannot be sent to another > POSIX process. Is there a (remote) possibility that the thread could have died and its pid got reused by a new thread in another process ? Or is there a mechanism that prevents such a possibility from arising (not just in NPTL library, but at the kernel level) ? I think the timer code saves a reference to the task pointer instead of the pid, which is what I was suggesting above (instead of the euid checks), as way to avoid the above situation. > > Adding this as a feature just makes things so much more complicated. > > > > So I think the > > intended behaviour is as you describe it should be > > Then the documentation needs to be adjusted. *Nod* > > > > The way it works (and better ideas are welcome) is that, since the io_submit() > > syscall already accepts an array of iocbs[], no new syscall was introduced. > > To implement lio_listio, one has to set up such an array, with the first iocb > > in the array having the special (new) grouping opcode of IOCB_CMD_GROUP which > > specifies the sigev notification to be associated with group completion > > (a NULL value of the sigev notification pointer would imply equivalent of > > LIO_WAIT). > > OK, this seems OK. We have to construct the iocb arrays dynamically anyway. > > > > My thought here was that it should be possible to include M as a parameter > > to the IOCB_CMD_GROUP opcode iocb, and thus incorporated in the lio control > > block ... then whatever semantics are agreed upon can be implemented. > > If you have room for the parameter this is fine. For the beginning we > can enforce the number to be the same as the total number of requests. > Sounds good. > > > Let us know what you think about the listio interface ... hopefully the > > other issues are mostly simple to resolve. > > It should be fine and I would support adding all this assuming the > normal file support (as opposed to direct I/O only) is added, too. OK. I updated my patchset against 2618-rc3 just after OLS. > > > But I have one last question: sockets, pipes and the like are already > supported, right? If this is not the case we have a problem with the > currently proposed lio_listio interface. AIO for pipes should not be a problem - Chris Mason had a patch, so we can just bring it up to the current levels, possibly with some additional improvements. I'm not sure what would be the right thing to do for the sockets case. While we could put together a patch for basic aio_read/write (based on the same model used for files), given the whole ongoing kevent effort, its not yet clear to me what would make the most sense ... Ben had a patch to do a fallback to kernel threads for AIO operations that are not yet supported natively. I had some concerns about the approach, but I guess he had intended it as an interim path for cases like this. Suggestions would be much appreciated ? DaveM, Ingo, Andrew ? Regards Suparna > > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ > -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile) 2006-08-14 7:02 ` Suparna Bhattacharya @ 2006-08-14 16:38 ` Ulrich Drepper 2006-08-15 2:06 ` Nicholas Miell 0 siblings, 1 reply; 73+ messages in thread From: Ulrich Drepper @ 2006-08-14 16:38 UTC (permalink / raw) To: suparna Cc: sebastien.dugue, Badari Pulavarty, Zach Brown, Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev, linux-aio, mingo [-- Attachment #1: Type: text/plain, Size: 2036 bytes --] Suparna Bhattacharya wrote: > Is there a (remote) possibility that the thread could have died and its > pid got reused by a new thread in another process ? Or is there a mechanism > that prevents such a possibility from arising (not just in NPTL library, > but at the kernel level) ? The UID/GID won't help you with dying processes. What if the same user creates a process with the same PID? That process will not expect the notification and mustn't receive it. If you cannot detect whether the issuing process died you have problems which cannot be solved with a uid/gid pair. > AIO for pipes should not be a problem - Chris Mason had a patch, so we can > just bring it up to the current levels, possibly with some additional > improvements. Good. > I'm not sure what would be the right thing to do for the sockets case. While > we could put together a patch for basic aio_read/write (based on the same > model used for files), given the whole ongoing kevent effort, its not yet > clear to me what would make the most sense ... > > Ben had a patch to do a fallback to kernel threads for AIO operations that > are not yet supported natively. I had some concerns about the approach, but > I guess he had intended it as an interim path for cases like this. A fallback solution would be sufficient. Nobody _should_ use POSIX AIO for networking but people do and just giving them something that works is good enough. It cannot really be worse than the userlevel emulation we have know. The alternative, separately and sequentially handling network sockets at userlevel is horrible. We'd have to go over every file descriptor and check whether it's a socket and then take if out of the request list for the kernel. Then they need to be handled separately before or after the kernel AIO code. This would punish unduly all the 99.9% of the programs which don't use POSIX AIO for network I/O. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 251 bytes --] ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile) 2006-08-14 16:38 ` Ulrich Drepper @ 2006-08-15 2:06 ` Nicholas Miell 0 siblings, 0 replies; 73+ messages in thread From: Nicholas Miell @ 2006-08-15 2:06 UTC (permalink / raw) To: Ulrich Drepper Cc: suparna, sebastien.dugue, Badari Pulavarty, Zach Brown, Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev, linux-aio, mingo On Mon, 2006-08-14 at 09:38 -0700, Ulrich Drepper wrote: > Suparna Bhattacharya wrote: > > Is there a (remote) possibility that the thread could have died and its > > pid got reused by a new thread in another process ? Or is there a mechanism > > that prevents such a possibility from arising (not just in NPTL library, > > but at the kernel level) ? > > The UID/GID won't help you with dying processes. What if the same user > creates a process with the same PID? That process will not expect the > notification and mustn't receive it. If you cannot detect whether the > issuing process died you have problems which cannot be solved with a > uid/gid pair. > > Eric W. Biederman sent a series of patches that introduced a struct task_ref specifically to solve this sort of problem on January 28 of this year, but I don't think it went anywhere. -- Nicholas Miell <nmiell@comcast.net> ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile) 2006-08-12 19:10 ` Ulrich Drepper 2006-08-12 19:28 ` Jakub Jelinek 2006-08-14 7:02 ` Suparna Bhattacharya @ 2006-09-04 14:36 ` Sébastien Dugué 2 siblings, 0 replies; 73+ messages in thread From: Sébastien Dugué @ 2006-09-04 14:36 UTC (permalink / raw) To: Ulrich Drepper Cc: suparna, Badari Pulavarty, Zach Brown, Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev, linux-aio On Sat, 2006-08-12 at 12:10 -0700, Ulrich Drepper wrote: > Suparna Bhattacharya wrote: > > I am wondering about that too. IIRC, the IO_NOTIFY_* constants are not > > part of the ABI, but only internal to the kernel implementation. I think > > Zach had suggested inferring THREAD_ID notification if the pid specified > > is not zero. But, I don't see why ->sigev_notify couldn't used directly > > (just like the POSIX timers code does) thus doing away with the > > new constants altogether. Sebestian/Laurent, do you recall? > > I suggest to model the implementation after the timer code which does > exactly what we need. > Will do. > > > I'm guessing they are being used for validation of permissions at the time > > of sending the signal, but maybe saving the task pointer in the iocb instead > > of the pid would suffice ? > > Why should any verification be necessary? The requests are generated in > the same process which will receive the notification. Even if the POSIX > process (aka, kernel process group) changes the IDs the notifications > should be set. The key is that notifications cannot be sent to another > POSIX process. > > Adding this as a feature just makes things so much more complicated. > Agreed. Sébastien. -- ----------------------------------------------------- Sébastien Dugué BULL/FREC:B1-247 phone: (+33) 476 29 77 70 Bullcom: 229-7770 mailto:sebastien.dugue@bull.net Linux POSIX AIO: http://www.bullopensource.org/posix http://sourceforge.net/projects/paiol ----------------------------------------------------- ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile) 2006-08-12 18:29 ` Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile) Suparna Bhattacharya 2006-08-12 19:10 ` Ulrich Drepper @ 2006-09-04 14:28 ` Sébastien Dugué 1 sibling, 0 replies; 73+ messages in thread From: Sébastien Dugué @ 2006-09-04 14:28 UTC (permalink / raw) To: suparna Cc: Ulrich Drepper, =?iso-8859-1?Q?S=E9bastien_Dugu=E9_=3Csebastien=2Edugue=40bull=2Enet?=.=?iso-8859-1?Q?=3E?=, Badari Pulavarty, Zach Brown, Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev, linux-aio, Benjamin LaHaise Hi, just came back from vacation, sorry for the delay. On Sat, 2006-08-12 at 23:59 +0530, Suparna Bhattacharya wrote: > BTW, if anyone would like to be dropped off this growing cc list, please > let us know. > > On Fri, Aug 11, 2006 at 12:45:55PM -0700, Ulrich Drepper wrote: > > Sébastien Dugué wrote: > > > aio completion notification > > > > I looked over this now but I don't think I understand everything. Or I > > don't see how it all is integrated. And no, I'm not looking at the > > proposed glibc code since would mean being tainted. > > Oh, I didn't realise that. > I'll make an attempt to clarify parts that I understand based on what I > have gleaned from my reading of the code and intent, but hopefully Sebastien, > Ben, Zach et al will be able to pitch in for a more accurate and complete > picture. > > > > > > > > Details: > > > ------- > > > > > > A struct sigevent *aio_sigeventp is added to struct iocb in > > > include/linux/aio_abi.h > > > > > > An enum {IO_NOTIFY_SIGNAL = 0, IO_NOTIFY_THREAD_ID = 1} is added in > > > include/linux/aio.h: > > > > > > - IO_NOTIFY_SIGNAL means that the signal is to be sent to the > > > requesting thread > > > > > > - IO_NOTIFY_THREAD_ID means that the signal is to be sent to a > > > specifi thread. > > > > This has been proved to be sufficient in the timer code which basically > > has the same problem. But why do you need separate constants? We have > > the various SIGEV_* constants, among them SIGEV_THREAD_ID. Just use > > these constants for the values of ki_notify. > > > > I am wondering about that too. IIRC, the IO_NOTIFY_* constants are not > part of the ABI, but only internal to the kernel implementation. I think > Zach had suggested inferring THREAD_ID notification if the pid specified > is not zero. But, I don't see why ->sigev_notify couldn't used directly > (just like the POSIX timers code does) thus doing away with the > new constants altogether. Sebestian/Laurent, do you recall? As I see it, those IO_NOTIFY_* constants are uneeded and we could use ->sigev_notify directly. I will change this so that we use the same mechanism as the POSIX timers code. > > > > > > The following fields are added to struct kiocb in include/linux/aio.h: > > > > > > - pid_t ki_pid: target of the signal > > > > > > - __u16 ki_signo: signal number > > > > > > - __u16 ki_notify: kind of notification, IO_NOTIFY_SIGNAL or > > > IO_NOTIFY_THREAD_ID > > > > > > - uid_t ki_uid, ki_euid: filled with the submitter credentials > > > > These two fields aren't needed for the POSIX interfaces. Where does the > > requirement come from? I don't say they should be removed, they might > > be useful, but if the costs are non-negligible then they could go away. > > I'm guessing they are being used for validation of permissions at the time > of sending the signal, but maybe saving the task pointer in the iocb instead > of the pid would suffice ? IIRC, Ben added these for that exact reason. Is this really needed? Ben? > > > > > > > > - check whether the submitting thread wants to be notified directly > > > (sigevent->sigev_notify_thread_id is 0) or wants the signal to be sent > > > to another thread. > > > In the latter case a check is made to assert that the target thread > > > is in the same thread group > > > > Is this really how it's implemented? This is not how it should be. > > Either a signal is sent to a specific thread in the same process (this > > is what SIGEV_THREAD_ID is for) or the signal is sent to a calling > > process. Sending a signal to the process means that from the kernel's > > POV any thread which doesn't have the signal blocked can receive it. > > The final decision is made by the kernel. There is no mechanism to send > > the signal to another process. > > The code seems to be set up to call specific_send_sig_info() in the case > of *_THREAD_ID , and __group_send_sig_info() otherwise. So I think the > intended behaviour is as you describe it should be (__group_send_sig_info > does the equivalent of sending a signal to the process and so any thread > which doesn't have signals blocked can receive it, while specific_send_sig_info > sends it to a particular thread). > > But, I should really leave it to Sebestian to confirm that. That's right, but I think that part needs to be reworked to follow the same logic as the POSIX timers. > > > listio support > > > > > > > I really don't understand the kernel interface for this feature. > > I'm sorry this is confusing. This probably means that we need to > separate the external interface description more clearly and completely > from the internals. > > > > > > > > Details: > > > ------- > > > > > > An IOCB_CMD_GROUP is added to the IOCB_CMD enum in include/linux/aio_abi.h > > > > > > A struct lio_event is added in include/linux/aio.h > > > > > > A struct lio_event *ki_lio is added to struct iocb in include/linux/aio.h > > > > So you have a pointer in the structure for the individual requests. I > > assume you use the atomic counter to trigger the final delivery. I > > further assume that if lio_wait is set the calling thread is suspended > > until all requests are handled and that the final notification in this > > case means that thread gets woken. > > > > This is all fine. > > > > But how do you pass the requests to the kernel? If you have a new > > lio_listio-like syscall it'll be easy. But I haven't seen anything like > > this mentioned. > > > > The alternative is to pass the requests one-by-one in which case I don't > > see how you create the reference to the lio_listio control block. This > > approach seems to be slower. > > The way it works (and better ideas are welcome) is that, since the io_submit() > syscall already accepts an array of iocbs[], no new syscall was introduced. > To implement lio_listio, one has to set up such an array, with the first iocb > in the array having the special (new) grouping opcode of IOCB_CMD_GROUP which > specifies the sigev notification to be associated with group completion > (a NULL value of the sigev notification pointer would imply equivalent of > LIO_WAIT). The following iocbs in the array should correspond to the set of > listio aiocbs. Whenever it encounters an IOCB_CMD_GROUP iocb opcode, the > kernel would interpret all subsequent iocbs[] submitted in the same > io_submit() call to be associated with the same lio control block. > > Does that clarify ? > > Would an example help ? > > > > > If all requests are passed at once, do you have the equivalent of > > LIO_NOP entries? So far, LIO_NOP entries are pruned by the support library (libposix-aio) and never sent to the kernel. > > > > Good question - we do have an IOCB_CMD_NOOP defined, and I seem to even > recall a patch that implemented it, but am wondering if it ever got merged. > Ben/Zach ? > > > > > How can we support the extension where we wait for a number of requests > > which need not be all of them. I.e., I submit N requests and want to be > > notified when at least M (M <= N) notified. I am not yet clear about > > the actual semantics we should implement (e.g., do we send another > > notification after the first one?) but it's something which IMO should > > be taken into account in the design. > > > > My thought here was that it should be possible to include M as a parameter > to the IOCB_CMD_GROUP opcode iocb, and thus incorporated in the lio control > block ... then whatever semantics are agreed upon can be implemented. > > > > > Finally, and this is very important, does you code send out the > > individual requests notification and then in the end the lio_listio > > completion? I think Suparna wrote this is the case but I want to make sure. > > Sebestian, could you confirm ? If (and only if) the user did setup a sigevent for one or more individual requests then those requests completion will trigger a notification and in the end the list completion notification is sent. Otherwise, only the list completion notification is sent. -- ----------------------------------------------------- Sébastien Dugué BULL/FREC:B1-247 phone: (+33) 476 29 77 70 Bullcom: 229-7770 mailto:sebastien.dugue@bull.net Linux POSIX AIO: http://www.bullopensource.org/posix http://sourceforge.net/projects/paiol ----------------------------------------------------- ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-27 18:44 ` Ulrich Drepper 2006-07-27 21:02 ` Badari Pulavarty @ 2006-07-28 7:29 ` Sébastien Dugué 2006-07-31 10:11 ` Suparna Bhattacharya 2 siblings, 0 replies; 73+ messages in thread From: Sébastien Dugué @ 2006-07-28 7:29 UTC (permalink / raw) To: Ulrich Drepper Cc: Badari Pulavarty, Zach Brown, Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev, Suparna Bhattacharya On Thu, 2006-07-27 at 11:44 -0700, Ulrich Drepper wrote: > Badari Pulavarty wrote: > > Before we spend too much time cleaning up and merging into mainline - > > I would like an agreement that what we add is good enough for glibc > > POSIX AIO. > > I haven't seen a description of the interface so far. Would be good if > it existed. But I briefly mentioned one quirk in the interface about > which Suparna wasn't sure whether it's implemented/implementable in the > current interface. > > If a lio_listio call is made the individual requests are handle just as > if they'd be issue separately. I.e., the notification specified in the > individual aiocb is performed when the specific request is done. Then, > once all requests are done, another notification is made, this time > controlled by the sigevent parameter if lio_listio. > > > Another feature which I always wanted: the current lio_listio call > returns in blocking mode only if all requests are done. In non-blocking > mode it returns immediately and the program needs to poll the aiocbs. > What is needed is something in the middle. For instance, if multiple > read requests are issued the program might be able to start working as > soon as one request is satisfied. I.e., a call similar to lio_listio > would be nice which also takes another parameter specifying how many of > the NENT aiocbs have to finish before the call returns. You're right here, that definitely would be a plus. -- ----------------------------------------------------- Sébastien Dugué BULL/FREC:B1-247 phone: (+33) 476 29 77 70 Bullcom: 229-7770 mailto:sebastien.dugue@bull.net Linux POSIX AIO: http://www.bullopensource.org/posix http://sourceforge.net/projects/paiol ----------------------------------------------------- ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-27 18:44 ` Ulrich Drepper 2006-07-27 21:02 ` Badari Pulavarty 2006-07-28 7:29 ` [3/4] kevent: AIO, aio_sendfile() implementation Sébastien Dugué @ 2006-07-31 10:11 ` Suparna Bhattacharya 2 siblings, 0 replies; 73+ messages in thread From: Suparna Bhattacharya @ 2006-07-31 10:11 UTC (permalink / raw) To: Ulrich Drepper Cc: Badari Pulavarty, Zach Brown, =?iso-8859-1?Q?S=E9bastien_Dugu=E9_=3Csebastien=2Edugue=40bull=2Enet?=.=?iso-8859-1?Q?=3E?=, Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev On Thu, Jul 27, 2006 at 11:44:23AM -0700, Ulrich Drepper wrote: > Badari Pulavarty wrote: > > Before we spend too much time cleaning up and merging into mainline - > > I would like an agreement that what we add is good enough for glibc > > POSIX AIO. > > I haven't seen a description of the interface so far. Would be good if Did Sébastien's mail with the description help ? > it existed. But I briefly mentioned one quirk in the interface about > which Suparna wasn't sure whether it's implemented/implementable in the > current interface. > > If a lio_listio call is made the individual requests are handle just as > if they'd be issue separately. I.e., the notification specified in the > individual aiocb is performed when the specific request is done. Then, > once all requests are done, another notification is made, this time > controlled by the sigevent parameter if lio_listio. Looking at the code in lio kernel patch, this should be already covered: if (iocb->ki_signo) __aio_send_signal(iocb); + if (iocb->ki_lio) + lio_check(iocb->ki_lio); That is, it first checks the notification in the individual iocb, and then the one for the LIO. > > > Another feature which I always wanted: the current lio_listio call > returns in blocking mode only if all requests are done. In non-blocking > mode it returns immediately and the program needs to poll the aiocbs. > What is needed is something in the middle. For instance, if multiple > read requests are issued the program might be able to start working as > soon as one request is satisfied. I.e., a call similar to lio_listio > would be nice which also takes another parameter specifying how many of > the NENT aiocbs have to finish before the call returns. I imagine the kernel could enable this by incorporating this additional parameter for IOCB_CMD_GROUP in the ABI (in the default case this should be the same as the total number of iocbs submitted to lio_listio). Now should the at least NENT check apply only to LIO_WAIT or also to the LIO_NOWAIT notification case ? BTW, the native io_getevents does support a min_nr wakeup already, except that it applies to any iocb on the io_context, and not just a given lio_listio call. Regards Suparna -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [3/4] kevent: AIO, aio_sendfile() implementation. 2006-07-27 15:28 ` Badari Pulavarty 2006-07-27 18:14 ` Zach Brown @ 2006-07-28 7:26 ` Sébastien Dugué 1 sibling, 0 replies; 73+ messages in thread From: Sébastien Dugué @ 2006-07-28 7:26 UTC (permalink / raw) To: Badari Pulavarty Cc: Ulrich Drepper, Christoph Hellwig, Evgeniy Polyakov, lkml, David Miller, netdev, Suparna Bhattacharya On Thu, 2006-07-27 at 08:28 -0700, Badari Pulavarty wrote: > Sébastien Dugué wrote: > > On Wed, 2006-07-26 at 09:22 -0700, Badari Pulavarty wrote: > > > >> Ulrich Drepper wrote: > >> > >>> Christoph Hellwig wrote: > >>> > >>> > >>>>> My personal opinion on existing AIO is that it is not the right design. > >>>>> Benjamin LaHaise agree with me (if I understood him right), > >>>>> > >>>>> > >>>> I completely agree with that aswell. > >>>> > >>>> > >>> I agree, too, but the current code is not the last of the line. Suparna > >>> has a st of patches which make the current kernel aio code work much > >>> better and especially make it really usable to implement POSIX AIO. > >>> > >>> In Ottawa we were talking about submitting it and Suparna will. We just > >>> thought about a little longer timeframe. I guess it could be > >>> accelerated since he mostly has the patch done. But I don't know her > >>> schedule. > >>> > >>> Important here is, don't base any decision on the current aio > >>> implementation. > >>> > >>> > >> Ulrich, > >> > >> Suparna mentioned your interest in making POSIX glibc aio work with > >> kernel-aio at OLS. > >> We thought taking a re-look at the (kernel side) work BULL did, would be > >> a nice starting > >> point. I re-based those patches to 2.6.18-rc2 and sent it to Zach Brown > >> for review before > >> sending them out to list. > >> > >> These patches does NOT make AIO any cleaner. All they do is add > >> functionality to support > >> POSIX AIO easier. These are > >> > >> [ PATCH 1/3 ] Adding signal notification for event completion > >> > >> [ PATCH 2/3 ] lio (listio) completion semantics > >> > >> [ PATCH 3/3 ] cancel_fd support > >> > > > > Badari, > > > > Thanks for refreshing those patches, they have been sitting here > > for quite some time now and collected dust. > > > > I also think Suparna's patchset for doing buffered AIO would be > > a real plus here. > > > > > >> Suparna explained these in the following article: > >> > >> http://lwn.net/Articles/148755/ > >> > >> If you think, this is a reasonable direction/approach for the kernel and > >> you would take care > >> of glibc side of things - I can spend time on these patches, getting > >> them to reasonable shape > >> and push for inclusion. > >> > > > > Ulrich, I you want to have a look at how those patches are put to > > use in libposix-aio, have a look at http://sourceforge.net/projects/paiol. > > > > It could be a starting point for glibc. > > > > Thanks, > > > > Sébastien. > > > > > Sebastien, > > Suparna mentioned at Ulrich wants us to concentrate on kernel-side > support, so that he > can look at glibc side of things (along with other work he is already > doing). So, if we > can get an agreement on what kind of kernel support is needed - we can > focus our > efforts on kernel side first and leave glibc enablement to capable hands > of Uli :) > That's fine with me. Sébastien. -- ----------------------------------------------------- Sébastien Dugué BULL/FREC:B1-247 phone: (+33) 476 29 77 70 Bullcom: 229-7770 mailto:sebastien.dugue@bull.net Linux POSIX AIO: http://www.bullopensource.org/posix http://sourceforge.net/projects/paiol ----------------------------------------------------- ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [1/4] kevent: core files. 2006-07-26 9:18 ` [1/4] kevent: core files Evgeniy Polyakov 2006-07-26 9:18 ` [2/4] kevent: network AIO, socket notifications Evgeniy Polyakov @ 2006-07-26 10:31 ` Andrew Morton 2006-07-26 10:37 ` Evgeniy Polyakov 2006-07-26 10:44 ` Evgeniy Polyakov 2 siblings, 1 reply; 73+ messages in thread From: Andrew Morton @ 2006-07-26 10:31 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: linux-kernel, davem, drepper, netdev On Wed, 26 Jul 2006 13:18:15 +0400 Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote: > +static int kevent_ctl_process(struct file *file, > + struct kevent_user_control *ctl, void __user *arg) > +{ > + int err; > + struct kevent_user *u = file->private_data; > + > + if (!u) > + return -EINVAL; > + > + switch (ctl->cmd) { > + case KEVENT_CTL_ADD: > + err = kevent_user_ctl_add(u, ctl, > + arg+sizeof(struct kevent_user_control)); > + break; > + case KEVENT_CTL_REMOVE: > + err = kevent_user_ctl_remove(u, ctl, > + arg+sizeof(struct kevent_user_control)); > + break; > + case KEVENT_CTL_MODIFY: > + err = kevent_user_ctl_modify(u, ctl, > + arg+sizeof(struct kevent_user_control)); > + break; > + case KEVENT_CTL_WAIT: > + err = kevent_user_wait(file, u, ctl, arg); > + break; > + case KEVENT_CTL_INIT: > + err = kevent_ctl_init(); > + default: > + err = -EINVAL; > + break; > + } > + > + return err; > +} Please indent the body of the switch one tabstop to the left. > +asmlinkage long sys_kevent_ctl(int fd, void __user *arg) > +{ > + int err, fput_needed; > + struct kevent_user_control ctl; > + struct file *file; > + > + if (copy_from_user(&ctl, arg, sizeof(struct kevent_user_control))) > + return -EINVAL; > + > + if (ctl.cmd == KEVENT_CTL_INIT) > + return kevent_ctl_init(); > + > + file = fget_light(fd, &fput_needed); > + if (!file) > + return -ENODEV; > + > + err = kevent_ctl_process(file, &ctl, arg); > + > + fput_light(file, fput_needed); > + return err; > +} If the user passes this an fd which was obtained via means other than kevent_ctl_init(), the kernel will explode. Do if (file->f_fop != &kevent_user_fops) return -EINVAL; ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [1/4] kevent: core files. 2006-07-26 10:31 ` [1/4] kevent: core files Andrew Morton @ 2006-07-26 10:37 ` Evgeniy Polyakov 0 siblings, 0 replies; 73+ messages in thread From: Evgeniy Polyakov @ 2006-07-26 10:37 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, davem, drepper, netdev On Wed, Jul 26, 2006 at 03:31:05AM -0700, Andrew Morton (akpm@osdl.org) wrote: > Please indent the body of the switch one tabstop to the left. .. > If the user passes this an fd which was obtained via means other than > kevent_ctl_init(), the kernel will explode. Do > > if (file->f_fop != &kevent_user_fops) > return -EINVAL; Thanks, I will implement both. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [1/4] kevent: core files. 2006-07-26 9:18 ` [1/4] kevent: core files Evgeniy Polyakov 2006-07-26 9:18 ` [2/4] kevent: network AIO, socket notifications Evgeniy Polyakov 2006-07-26 10:31 ` [1/4] kevent: core files Andrew Morton @ 2006-07-26 10:44 ` Evgeniy Polyakov 2 siblings, 0 replies; 73+ messages in thread From: Evgeniy Polyakov @ 2006-07-26 10:44 UTC (permalink / raw) To: lkml; +Cc: David Miller, Ulrich Drepper, netdev On Wed, Jul 26, 2006 at 01:18:15PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote: > +struct kevent *kevent_alloc(gfp_t mask) > +{ > + struct kevent *k; > + > + if (kevent_cache) > + k = kmem_cache_alloc(kevent_cache, mask); > + else > + k = kzalloc(sizeof(struct kevent), mask); > + > + return k; > +} > + Sorry for that. It is fixed already to always use cache, but I forget to commit that change before I created pachset. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: async network I/O, event channels, etc 2006-07-26 6:28 ` Evgeniy Polyakov 2006-07-26 9:18 ` [0/4] kevent: generic event processing subsystem Evgeniy Polyakov @ 2006-07-27 6:10 ` David Miller 2006-07-27 7:49 ` Evgeniy Polyakov 1 sibling, 1 reply; 73+ messages in thread From: David Miller @ 2006-07-27 6:10 UTC (permalink / raw) To: johnpol; +Cc: drepper, linux-kernel, netdev From: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Date: Wed, 26 Jul 2006 10:28:17 +0400 > I have not created additional DMA memory allocation methods, like > Ulrich described in his article, so I handle it inside NAIO which > has some overhead (I posted get_user_pages() sclability graph some > time ago). I've been thinking about this aspect, and I think it's very interesting. Let's be clear what the ramifications of this are first. Using the terminology of Network Algorithmics, this is an instance of Principle 2, "Shift computation in time". Instead of using get_user_pages() at AIO setup, we instead map the thing to userspace later when the user wants it. Pinning pages is a pain because both user and kernel refer to the buffer at the same time. We get more flexibility when the user has to map the thing explicitly. I want us to think about how a user might want to use this. What I anticipate is that users will want to organize a pool of AIO buffers for themselves using this DMA interface. So the events they are truly interested in are of a finer granularity than you might expect. They want to know when pieces of a buffer are available for reuse. And here is the core dilemma. If you make the event granularity too coarse, a larger AIO buffer pool is necessary. If you make the event granuliary too fine, event processing begins to dominate, and costs too much. This is true even for something as light weight as kevent. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: async network I/O, event channels, etc 2006-07-27 6:10 ` async network I/O, event channels, etc David Miller @ 2006-07-27 7:49 ` Evgeniy Polyakov 2006-07-27 8:02 ` David Miller 0 siblings, 1 reply; 73+ messages in thread From: Evgeniy Polyakov @ 2006-07-27 7:49 UTC (permalink / raw) To: David Miller; +Cc: drepper, linux-kernel, netdev On Wed, Jul 26, 2006 at 11:10:55PM -0700, David Miller (davem@davemloft.net) wrote: > From: Evgeniy Polyakov <johnpol@2ka.mipt.ru> > Date: Wed, 26 Jul 2006 10:28:17 +0400 > > > I have not created additional DMA memory allocation methods, like > > Ulrich described in his article, so I handle it inside NAIO which > > has some overhead (I posted get_user_pages() sclability graph some > > time ago). > > I've been thinking about this aspect, and I think it's very > interesting. Let's be clear what the ramifications of this > are first. > > Using the terminology of Network Algorithmics, this is an > instance of Principle 2, "Shift computation in time". > > Instead of using get_user_pages() at AIO setup, we instead map the > thing to userspace later when the user wants it. Pinning pages is a > pain because both user and kernel refer to the buffer at the same > time. We get more flexibility when the user has to map the thing > explicitly. I.e. map skb's data to userspace? Not a good idea especially with it's tricky lifetime and unability for userspace to inform kernel when it finished and skb can be freed (without additional syscall). I did it with af_tlb zero-copy sniffer (but I substitute mapped pages with physical skb->data pages), and it was not very good. > I want us to think about how a user might want to use this. What > I anticipate is that users will want to organize a pool of AIO > buffers for themselves using this DMA interface. So the events > they are truly interested in are of a finer granularity than you > might expect. They want to know when pieces of a buffer are > available for reuse. Ah, I see. Well, I think preallocate some buffers and use that in AIO setup is a plus, since in that case user does not care about when it is possible to reuse the same buffer - when appropriate kevent is completed, that means that provided buffer is no longer in use by kernel, and user can reuse it. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: async network I/O, event channels, etc 2006-07-27 7:49 ` Evgeniy Polyakov @ 2006-07-27 8:02 ` David Miller 2006-07-27 8:09 ` Jens Axboe 2006-07-27 8:58 ` Evgeniy Polyakov 0 siblings, 2 replies; 73+ messages in thread From: David Miller @ 2006-07-27 8:02 UTC (permalink / raw) To: johnpol; +Cc: drepper, linux-kernel, netdev From: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Date: Thu, 27 Jul 2006 11:49:02 +0400 > I.e. map skb's data to userspace? Not a good idea especially with it's > tricky lifetime and unability for userspace to inform kernel when it > finished and skb can be freed (without additional syscall). Hmmm... If it is paged based, I do not see the problem. Events and calls to AIO I/O routines make transfer of buffer ownership. The fact that while kernel (and thus networking stack) "owns" the buffer for an AIO call, the user can have a valid mapping to it is a unimportant detail. If the user will scramble a piece of data that is in flight to or from the network card, it is his problem. If we are using a primitive network card that does not support scatter-gather I/O and thus not page based SKBs, we will make copies. But this is transparent to the user. The idea is that DMA mappings have page granularity. At least on transmit it should work well. Receive side is more difficult and initial implementation will need to copy. > I did it with af_tlb zero-copy sniffer (but I substitute mapped pages > with physical skb->data pages), and it was not very good. Trying to be too clever with skb->data has always been catastrophic. :) > Well, I think preallocate some buffers and use that in AIO setup is a > plus, since in that case user does not care about when it is possible to > reuse the same buffer - when appropriate kevent is completed, that means > that provided buffer is no longer in use by kernel, and user can reuse > it. We now enter the most interesting topic of AIO buffer pool management and where it belongs. :-) We are assuming up to this point that the user manages this stuff with explicit DMA calls for allocation, then passes the key based references to those buffers as arguments to the AIO I/O calls. But I want to suggest another possibility. What if the kernel managed the AIO buffer pool for a task? It could grow this dynamically based upon need. The only implementation road block is how large to we allow this to grow, but I think normal VM mechanisms can take care of it. On transmit this is not straightforward, but for receive it has really nice possibilities. :) ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: async network I/O, event channels, etc 2006-07-27 8:02 ` David Miller @ 2006-07-27 8:09 ` Jens Axboe 2006-07-27 8:11 ` Jens Axboe 2006-07-27 8:58 ` Evgeniy Polyakov 1 sibling, 1 reply; 73+ messages in thread From: Jens Axboe @ 2006-07-27 8:09 UTC (permalink / raw) To: David Miller; +Cc: johnpol, drepper, linux-kernel, netdev On Thu, Jul 27 2006, David Miller wrote: > From: Evgeniy Polyakov <johnpol@2ka.mipt.ru> > Date: Thu, 27 Jul 2006 11:49:02 +0400 > > > I.e. map skb's data to userspace? Not a good idea especially with it's > > tricky lifetime and unability for userspace to inform kernel when it > > finished and skb can be freed (without additional syscall). > > Hmmm... > > If it is paged based, I do not see the problem. Events and calls to > AIO I/O routines make transfer of buffer ownership. The fact that > while kernel (and thus networking stack) "owns" the buffer for an AIO > call, the user can have a valid mapping to it is a unimportant detail. Ownership may be clear, but "when can I reuse" is tricky. The same issue comes up for vmsplice -> splice to socket. -- Jens Axboe ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: async network I/O, event channels, etc 2006-07-27 8:09 ` Jens Axboe @ 2006-07-27 8:11 ` Jens Axboe 2006-07-27 8:20 ` David Miller 0 siblings, 1 reply; 73+ messages in thread From: Jens Axboe @ 2006-07-27 8:11 UTC (permalink / raw) To: David Miller; +Cc: johnpol, drepper, linux-kernel, netdev On Thu, Jul 27 2006, Jens Axboe wrote: > On Thu, Jul 27 2006, David Miller wrote: > > From: Evgeniy Polyakov <johnpol@2ka.mipt.ru> > > Date: Thu, 27 Jul 2006 11:49:02 +0400 > > > > > I.e. map skb's data to userspace? Not a good idea especially with it's > > > tricky lifetime and unability for userspace to inform kernel when it > > > finished and skb can be freed (without additional syscall). > > > > Hmmm... > > > > If it is paged based, I do not see the problem. Events and calls to > > AIO I/O routines make transfer of buffer ownership. The fact that > > while kernel (and thus networking stack) "owns" the buffer for an AIO > > call, the user can have a valid mapping to it is a unimportant detail. > > Ownership may be clear, but "when can I reuse" is tricky. The same issue > comes up for vmsplice -> splice to socket. Ownership transition from user -> kernel that is, what I'm trying to say that returning ownership to the user again is the tricky part. -- Jens Axboe ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: async network I/O, event channels, etc 2006-07-27 8:11 ` Jens Axboe @ 2006-07-27 8:20 ` David Miller 2006-07-27 8:29 ` Jens Axboe 0 siblings, 1 reply; 73+ messages in thread From: David Miller @ 2006-07-27 8:20 UTC (permalink / raw) To: axboe; +Cc: johnpol, drepper, linux-kernel, netdev From: Jens Axboe <axboe@suse.de> Date: Thu, 27 Jul 2006 10:11:15 +0200 > Ownership transition from user -> kernel that is, what I'm trying to say > that returning ownership to the user again is the tricky part. Yes, it is important that for TCP, for example, we don't give the user the event until the data is acknowledged and the skb's referencing that data are fully freed. This is further complicated by the fact that packetization boundaries are going to be different from AIO buffer boundaries. I think this is what you are alluding to. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: async network I/O, event channels, etc 2006-07-27 8:20 ` David Miller @ 2006-07-27 8:29 ` Jens Axboe 2006-07-27 8:37 ` David Miller 0 siblings, 1 reply; 73+ messages in thread From: Jens Axboe @ 2006-07-27 8:29 UTC (permalink / raw) To: David Miller; +Cc: johnpol, drepper, linux-kernel, netdev On Thu, Jul 27 2006, David Miller wrote: > From: Jens Axboe <axboe@suse.de> > Date: Thu, 27 Jul 2006 10:11:15 +0200 > > > Ownership transition from user -> kernel that is, what I'm trying to say > > that returning ownership to the user again is the tricky part. > > Yes, it is important that for TCP, for example, we don't give > the user the event until the data is acknowledged and the skb's > referencing that data are fully freed. > > This is further complicated by the fact that packetization boundaries > are going to be different from AIO buffer boundaries. > > I think this is what you are alluding to. Precisely. And this is the bit that is currently still broken for splice-to-socket, since it gives that ack right after ->sendpage() has been called. But that's a known deficiency right now, I think Alexey is currently looking at that (as well as receive side support). -- Jens Axboe ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: async network I/O, event channels, etc 2006-07-27 8:29 ` Jens Axboe @ 2006-07-27 8:37 ` David Miller 2006-07-27 8:39 ` Jens Axboe 0 siblings, 1 reply; 73+ messages in thread From: David Miller @ 2006-07-27 8:37 UTC (permalink / raw) To: axboe; +Cc: johnpol, drepper, linux-kernel, netdev From: Jens Axboe <axboe@suse.de> Date: Thu, 27 Jul 2006 10:29:24 +0200 > Precisely. And this is the bit that is currently still broken for > splice-to-socket, since it gives that ack right after ->sendpage() has > been called. But that's a known deficiency right now, I think Alexey is > currently looking at that (as well as receive side support). That's right, I was discussing this with him just a few days ago. It's good to hear that he's looking at those patches you were working on several months ago. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: async network I/O, event channels, etc 2006-07-27 8:37 ` David Miller @ 2006-07-27 8:39 ` Jens Axboe 0 siblings, 0 replies; 73+ messages in thread From: Jens Axboe @ 2006-07-27 8:39 UTC (permalink / raw) To: David Miller; +Cc: johnpol, drepper, linux-kernel, netdev On Thu, Jul 27 2006, David Miller wrote: > From: Jens Axboe <axboe@suse.de> > Date: Thu, 27 Jul 2006 10:29:24 +0200 > > > Precisely. And this is the bit that is currently still broken for > > splice-to-socket, since it gives that ack right after ->sendpage() has > > been called. But that's a known deficiency right now, I think Alexey is > > currently looking at that (as well as receive side support). > > That's right, I was discussing this with him just a few days ago. > > It's good to hear that he's looking at those patches you were working > on several months ago. It is. I never ventured much into the networking part, just noted that as a current limitation with the ->sendpage() based approach. Basically we need to pass more info in, which also gets rid of the limitation of passing a single page at the time. -- Jens Axboe ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: async network I/O, event channels, etc 2006-07-27 8:02 ` David Miller 2006-07-27 8:09 ` Jens Axboe @ 2006-07-27 8:58 ` Evgeniy Polyakov 2006-07-27 9:31 ` David Miller 1 sibling, 1 reply; 73+ messages in thread From: Evgeniy Polyakov @ 2006-07-27 8:58 UTC (permalink / raw) To: David Miller; +Cc: drepper, linux-kernel, netdev On Thu, Jul 27, 2006 at 01:02:55AM -0700, David Miller (davem@davemloft.net) wrote: > From: Evgeniy Polyakov <johnpol@2ka.mipt.ru> > Date: Thu, 27 Jul 2006 11:49:02 +0400 > > > I.e. map skb's data to userspace? Not a good idea especially with it's > > tricky lifetime and unability for userspace to inform kernel when it > > finished and skb can be freed (without additional syscall). > > Hmmm... > > If it is paged based, I do not see the problem. Events and calls to > AIO I/O routines make transfer of buffer ownership. The fact that > while kernel (and thus networking stack) "owns" the buffer for an AIO > call, the user can have a valid mapping to it is a unimportant detail. > > If the user will scramble a piece of data that is in flight to or from > the network card, it is his problem. > > If we are using a primitive network card that does not support > scatter-gather I/O and thus not page based SKBs, we will make > copies. But this is transparent to the user. > > The idea is that DMA mappings have page granularity. > > At least on transmit it should work well. Receive side is more > difficult and initial implementation will need to copy. And what if several skb->data are placed on the same page? Or do we want to allocate at least one page for one skb? Even if it is an 40 bytes ack? > > I did it with af_tlb zero-copy sniffer (but I substitute mapped pages > > with physical skb->data pages), and it was not very good. > > Trying to be too clever with skb->data has always been catastrophic. :) Yep :) > > Well, I think preallocate some buffers and use that in AIO setup is a > > plus, since in that case user does not care about when it is possible to > > reuse the same buffer - when appropriate kevent is completed, that means > > that provided buffer is no longer in use by kernel, and user can reuse > > it. > > We now enter the most interesting topic of AIO buffer pool management > and where it belongs. :-) We are assuming up to this point that the > user manages this stuff with explicit DMA calls for allocation, then > passes the key based references to those buffers as arguments to the > AIO I/O calls. > > But I want to suggest another possibility. What if the kernel managed > the AIO buffer pool for a task? It could grow this dynamically based > upon need. The only implementation road block is how large to we > allow this to grow, but I think normal VM mechanisms can take care > of it. > > On transmit this is not straightforward, but for receive it has really > nice possibilities. :) Btw, according to DMA allocations - there are some problems here too. Some pieces of the world can not dma behind 16mb, and someone can do it over 4gb. If only 16mb are used, it is just 8k packets with 1500 MTU, and actually userspace does not know which NIC receives it's data, so it is impossible in advance to allocate some pool, which will be used for dma transfer, so we just need to allocate physical pages and use them with memcpy() from skb->data. Those physical pages can be managed within kernel and userspace can map them. But there is another possibility - replace slab allocation for network devices with allocation from premapped pool. That naturally allows to manage that pool for AIO needs and have zero-copy sending and receiving support. That is what I talked in netchannel topic when question about allocation/freeing cost in atomic context arised. I work on that solution, which can be used both for netchannels (and full userspace processing) and usual networking code. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: async network I/O, event channels, etc 2006-07-27 8:58 ` Evgeniy Polyakov @ 2006-07-27 9:31 ` David Miller 2006-07-27 9:37 ` Evgeniy Polyakov 0 siblings, 1 reply; 73+ messages in thread From: David Miller @ 2006-07-27 9:31 UTC (permalink / raw) To: johnpol; +Cc: drepper, linux-kernel, netdev From: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Date: Thu, 27 Jul 2006 12:58:13 +0400 > Btw, according to DMA allocations - there are some problems here too. > Some pieces of the world can not dma behind 16mb, and someone can do it > over 4gb. I think people take this "DMA" in Ulrich's interface names too literally. It is logically something different, although it could be used directly for this purpose. View it rather as memory you have by some key based ID, but need to explicitly map to access directly. > Those physical pages can be managed within kernel and userspace can map > them. But there is another possibility - replace slab allocation for > network devices with allocation from premapped pool. > That naturally allows to manage that pool for AIO needs and have > zero-copy sending and receiving support. That is what I talked in > netchannel topic when question about allocation/freeing cost in atomic > context arised. I work on that solution, which can be used both for > netchannels (and full userspace processing) and usual networking code. Interesting idea, and yes I have been watching you stress test your AVL tree code :)) ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: async network I/O, event channels, etc 2006-07-27 9:31 ` David Miller @ 2006-07-27 9:37 ` Evgeniy Polyakov 0 siblings, 0 replies; 73+ messages in thread From: Evgeniy Polyakov @ 2006-07-27 9:37 UTC (permalink / raw) To: David Miller; +Cc: drepper, linux-kernel, netdev On Thu, Jul 27, 2006 at 02:31:56AM -0700, David Miller (davem@davemloft.net) wrote: > From: Evgeniy Polyakov <johnpol@2ka.mipt.ru> > Date: Thu, 27 Jul 2006 12:58:13 +0400 > > > Btw, according to DMA allocations - there are some problems here too. > > Some pieces of the world can not dma behind 16mb, and someone can do it > > over 4gb. > > I think people take this "DMA" in Ulrich's interface names too > literally. It is logically something different, although it could be > used directly for this purpose. > > View it rather as memory you have by some key based ID, but need to > explicitly map to access directly. I mean here, that it is possible to have those Ulrich's dma regions to be used as a real dma regions, and showed that it is not a good idea. > > Those physical pages can be managed within kernel and userspace can map > > them. But there is another possibility - replace slab allocation for > > network devices with allocation from premapped pool. > > That naturally allows to manage that pool for AIO needs and have > > zero-copy sending and receiving support. That is what I talked in > > netchannel topic when question about allocation/freeing cost in atomic > > context arised. I work on that solution, which can be used both for > > netchannels (and full userspace processing) and usual networking code. > > Interesting idea, and yes I have been watching you stress test your > AVL tree code :)) Tests are completed - actually it required 12 a4 papers filled with small circles and numbers to prove it is correct, overnight run was just for clarifications :) -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 73+ messages in thread
* [1/1] Kevent subsystem. @ 2006-06-22 17:14 Evgeniy Polyakov 2006-06-23 7:09 ` [1/4] kevent: core files Evgeniy Polyakov 0 siblings, 1 reply; 73+ messages in thread From: Evgeniy Polyakov @ 2006-06-22 17:14 UTC (permalink / raw) To: David Miller; +Cc: netdev [-- Attachment #1: Type: text/plain, Size: 1157 bytes --] Hello. Kevent subsystem incorporates several AIO/kqueue design notes and ideas. Kevent can be used both for edge and level notifications. It supports socket notifications, network AIO (aio_send(), aio_recv() and aio_sendfile()), inode notifications (create/remove), generic poll()/select() notifications and timer notifications. It was tested against FreeBSD kqueue and Linux epoll and showed noticeble performance win. Network asynchronous IO operations were tested against Linux synchronous socket code and showed noticeble performance win. Patch against linux-2.6.17-git tree attached (gzipped). I would like to hear some comments about the overall design, implementation and plans about it's usefullness for generic kernel. Design notes, patches, userspace application and perfomance tests can be found at project's homepages. 1. Kevent subsystem. http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent 2. Network AIO. http://tservice.net.ru/~s0mbre/old/?section=projects&item=naio 3. LWN article about kevent. http://lwn.net/Articles/172844/ Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Thank you. -- Evgeniy Polyakov [-- Attachment #2: kevent-2.6.17-git.diff.gz --] [-- Type: application/x-gunzip, Size: 24054 bytes --] ^ permalink raw reply [flat|nested] 73+ messages in thread
* [1/4] kevent: core files. 2006-06-22 17:14 [1/1] Kevent subsystem Evgeniy Polyakov @ 2006-06-23 7:09 ` Evgeniy Polyakov 2006-06-23 18:44 ` Benjamin LaHaise 0 siblings, 1 reply; 73+ messages in thread From: Evgeniy Polyakov @ 2006-06-23 7:09 UTC (permalink / raw) To: David Miller; +Cc: netdev This patch includes core kevent files: - userspace controlling - kernelspace interfaces - initialisation - notification state machines It might also inlclude parts from other subsystem (like network related syscalls so it is possible that it will not compile without other patches applied). Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S index af56987..93e23ff 100644 --- a/arch/i386/kernel/syscall_table.S +++ b/arch/i386/kernel/syscall_table.S @@ -316,3 +316,7 @@ ENTRY(sys_call_table) .long sys_sync_file_range .long sys_tee /* 315 */ .long sys_vmsplice + .long sys_aio_recv + .long sys_aio_send + .long sys_aio_sendfile + .long sys_kevent_ctl diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S index 5a92fed..534d516 100644 --- a/arch/x86_64/ia32/ia32entry.S +++ b/arch/x86_64/ia32/ia32entry.S @@ -696,4 +696,8 @@ #endif .quad sys_sync_file_range .quad sys_tee .quad compat_sys_vmsplice + .quad sys_aio_recv + .quad sys_aio_send + .quad sys_aio_sendfile + .quad sys_kevent_ctl ia32_syscall_end: diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h index de2ccc1..52f8642 100644 --- a/include/asm-i386/unistd.h +++ b/include/asm-i386/unistd.h @@ -322,10 +322,14 @@ #define __NR_splice 313 #define __NR_sync_file_range 314 #define __NR_tee 315 #define __NR_vmsplice 316 +#define __NR_aio_recv 317 +#define __NR_aio_send 318 +#define __NR_aio_sendfile 319 +#define __NR_kevent_ctl 320 #ifdef __KERNEL__ -#define NR_syscalls 317 +#define NR_syscalls 321 /* * user-visible error numbers are in the range -1 - -128: see diff --git a/include/asm-x86_64/socket.h b/include/asm-x86_64/socket.h index f2cdbea..1f31f86 100644 --- a/include/asm-x86_64/socket.h +++ b/include/asm-x86_64/socket.h @@ -49,4 +49,6 @@ #define SO_ACCEPTCONN 30 #define SO_PEERSEC 31 +#define SO_ASYNC_SOCK 34 + #endif /* _ASM_SOCKET_H */ diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h index 0aff22b..352c34b 100644 --- a/include/asm-x86_64/unistd.h +++ b/include/asm-x86_64/unistd.h @@ -617,11 +617,18 @@ #define __NR_sync_file_range 277 __SYSCALL(__NR_sync_file_range, sys_sync_file_range) #define __NR_vmsplice 278 __SYSCALL(__NR_vmsplice, sys_vmsplice) +#define __NR_aio_recv 279 +__SYSCALL(__NR_aio_recv, sys_aio_recv) +#define __NR_aio_send 280 +__SYSCALL(__NR_aio_send, sys_aio_send) +#define __NR_aio_sendfile 281 +__SYSCALL(__NR_aio_sendfile, sys_aio_sendfile) +#define __NR_kevent_ctl 282 +__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl) #ifdef __KERNEL__ -#define __NR_syscall_max __NR_vmsplice - +#define __NR_syscall_max __NR_kevent_ctl #ifndef __NO_STUBS /* user-visible error numbers are in the range -1 - -4095 */ diff --git a/include/linux/kevent.h b/include/linux/kevent.h new file mode 100644 index 0000000..e94a7bf --- /dev/null +++ b/include/linux/kevent.h @@ -0,0 +1,263 @@ +/* + * kevent.h + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __KEVENT_H +#define __KEVENT_H + +/* + * Kevent request flags. + */ + +#define KEVENT_REQ_ONESHOT 0x1 /* Process this event only once and then dequeue. */ + +/* + * Kevent return flags. + */ +#define KEVENT_RET_BROKEN 0x1 /* Kevent is broken. */ +#define KEVENT_RET_DONE 0x2 /* Kevent processing was finished successfully. */ + +/* + * Kevent type set. + */ +#define KEVENT_SOCKET 0 +#define KEVENT_INODE 1 +#define KEVENT_TIMER 2 +#define KEVENT_POLL 3 +#define KEVENT_NAIO 4 +#define KEVENT_AIO 5 +#define KEVENT_MAX 6 + +/* + * Per-type event sets. + * Number of per-event sets should be exactly as number of kevent types. + */ + +/* + * Timer events. + */ +#define KEVENT_TIMER_FIRED 0x1 + +/* + * Socket/network asynchronous IO events. + */ +#define KEVENT_SOCKET_RECV 0x1 +#define KEVENT_SOCKET_ACCEPT 0x2 +#define KEVENT_SOCKET_SEND 0x4 + +/* + * Inode events. + */ +#define KEVENT_INODE_CREATE 0x1 +#define KEVENT_INODE_REMOVE 0x2 + +/* + * Poll events. + */ +#define KEVENT_POLL_POLLIN 0x0001 +#define KEVENT_POLL_POLLPRI 0x0002 +#define KEVENT_POLL_POLLOUT 0x0004 +#define KEVENT_POLL_POLLERR 0x0008 +#define KEVENT_POLL_POLLHUP 0x0010 +#define KEVENT_POLL_POLLNVAL 0x0020 + +#define KEVENT_POLL_POLLRDNORM 0x0040 +#define KEVENT_POLL_POLLRDBAND 0x0080 +#define KEVENT_POLL_POLLWRNORM 0x0100 +#define KEVENT_POLL_POLLWRBAND 0x0200 +#define KEVENT_POLL_POLLMSG 0x0400 +#define KEVENT_POLL_POLLREMOVE 0x1000 + +/* + * Asynchronous IO events. + */ +#define KEVENT_AIO_BIO 0x1 + +#define KEVENT_MASK_ALL 0xffffffff /* Mask of all possible event values. */ +#define KEVENT_MASK_EMPTY 0x0 /* Empty mask of ready events. */ + +struct kevent_id +{ + __u32 raw[2]; +}; + +struct ukevent +{ + struct kevent_id id; /* Id of this request, e.g. socket number, file descriptor and so on... */ + __u32 type; /* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */ + __u32 event; /* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */ + __u32 req_flags; /* Per-event request flags */ + __u32 ret_flags; /* Per-event return flags */ + __u32 ret_data[2]; /* Event return data. Event originator fills it with anything it likes. */ + union { + __u32 user[2]; /* User's data. It is not used, just copied to/from user. */ + void *ptr; + }; +}; + +#define KEVENT_CTL_ADD 0 +#define KEVENT_CTL_REMOVE 1 +#define KEVENT_CTL_MODIFY 2 +#define KEVENT_CTL_WAIT 3 +#define KEVENT_CTL_INIT 4 + +struct kevent_user_control +{ + unsigned int cmd; /* Control command, e.g. KEVENT_ADD, KEVENT_REMOVE... */ + unsigned int num; /* Number of ukevents this strucutre controls. */ + unsigned int timeout; /* Timeout in milliseconds waiting for "num" events to become ready. */ +}; + +#define KEVENT_USER_SYMBOL 'K' +#define KEVENT_USER_CTL _IOWR(KEVENT_USER_SYMBOL, 0, struct kevent_user_control) +#define KEVENT_USER_WAIT _IOWR(KEVENT_USER_SYMBOL, 1, struct kevent_user_control) + +#ifdef __KERNEL__ + +#include <linux/types.h> +#include <linux/list.h> +#include <linux/spinlock.h> +#include <linux/kevent_storage.h> +#include <asm/semaphore.h> + +struct inode; +struct dentry; +struct sock; + +struct kevent; +struct kevent_storage; +typedef int (* kevent_callback_t)(struct kevent *); + +struct kevent +{ + struct ukevent event; + spinlock_t lock; /* This lock protects ukevent manipulations, e.g. ret_flags changes. */ + + struct list_head kevent_entry; /* Entry of user's queue. */ + struct list_head storage_entry; /* Entry of origin's queue. */ + struct list_head ready_entry; /* Entry of user's ready. */ + + struct kevent_user *user; /* User who requested this kevent. */ + struct kevent_storage *st; /* Kevent container. */ + + kevent_callback_t callback; /* Is called each time new event has been caught. */ + kevent_callback_t enqueue; /* Is called each time new event is queued. */ + kevent_callback_t dequeue; /* Is called each time event is dequeued. */ + + void *priv; /* Private data for different storages. + * poll()/select storage has a list of wait_queue_t containers + * for each ->poll() { poll_wait()' } here. + */ +}; + +#define KEVENT_HASH_MASK 0xff + +struct kevent_list +{ + struct list_head kevent_list; /* List of all kevents. */ + spinlock_t kevent_lock; /* Protects all manipulations with queue of kevents. */ +}; + +struct kevent_user +{ + struct kevent_list kqueue[KEVENT_HASH_MASK+1]; + unsigned int kevent_num; /* Number of queued kevents. */ + + struct list_head ready_list; /* List of ready kevents. */ + unsigned int ready_num; /* Number of ready kevents. */ + spinlock_t ready_lock; /* Protects all manipulations with ready queue. */ + + unsigned int max_ready_num; /* Requested number of kevents. */ + + struct semaphore ctl_mutex; /* Protects against simultaneous kevent_user control manipulations. */ + struct semaphore wait_mutex; /* Protects against simultaneous kevent_user waits. */ + wait_queue_head_t wait; /* Wait until some events are ready. */ + + atomic_t refcnt; /* Reference counter, increased for each new kevent. */ +#ifdef CONFIG_KEVENT_USER_STAT + unsigned long im_num; + unsigned long wait_num; + unsigned long total; +#endif +}; + +#define KEVENT_MAX_REQUESTS PAGE_SIZE/sizeof(struct kevent) + +struct kevent *kevent_alloc(gfp_t mask); +void kevent_free(struct kevent *k); +int kevent_enqueue(struct kevent *k); +int kevent_dequeue(struct kevent *k); +int kevent_init(struct kevent *k); +void kevent_requeue(struct kevent *k); + +#define list_for_each_entry_reverse_safe(pos, n, head, member) \ + for (pos = list_entry((head)->prev, typeof(*pos), member), \ + n = list_entry(pos->member.prev, typeof(*pos), member); \ + prefetch(pos->member.prev), &pos->member != (head); \ + pos = n, n = list_entry(pos->member.prev, typeof(*pos), member)) + +int kevent_break(struct kevent *k); +int kevent_init(struct kevent *k); + +int kevent_init_socket(struct kevent *k); +int kevent_init_inode(struct kevent *k); +int kevent_init_timer(struct kevent *k); +int kevent_init_poll(struct kevent *k); +int kevent_init_naio(struct kevent *k); +int kevent_init_aio(struct kevent *k); + +void kevent_storage_ready(struct kevent_storage *st, + kevent_callback_t ready_callback, u32 event); +int kevent_storage_init(void *origin, struct kevent_storage *st); +void kevent_storage_fini(struct kevent_storage *st); +int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k); +void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k); + +int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u); + +#ifdef CONFIG_KEVENT_INODE +void kevent_inode_notify(struct inode *inode, u32 event); +void kevent_inode_notify_parent(struct dentry *dentry, u32 event); +void kevent_inode_remove(struct inode *inode); +#else +static inline void kevent_inode_notify(struct inode *inode, u32 event) +{ +} +static inline void kevent_inode_notify_parent(struct dentry *dentry, u32 event) +{ +} +static inline void kevent_inode_remove(struct inode *inode) +{ +} +#endif /* CONFIG_KEVENT_INODE */ +#ifdef CONFIG_KEVENT_SOCKET + +void kevent_socket_notify(struct sock *sock, u32 event); +int kevent_socket_dequeue(struct kevent *k); +int kevent_socket_enqueue(struct kevent *k); +#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC) +#else +static inline void kevent_socket_notify(struct sock *sock, u32 event) +{ +} +#define sock_async(__sk) 0 +#endif +#endif /* __KERNEL__ */ +#endif /* __KEVENT_H */ diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h new file mode 100644 index 0000000..bd891f0 --- /dev/null +++ b/include/linux/kevent_storage.h @@ -0,0 +1,12 @@ +#ifndef __KEVENT_STORAGE_H +#define __KEVENT_STORAGE_H + +struct kevent_storage +{ + void *origin; /* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */ + struct list_head list; /* List of queued kevents. */ + unsigned int qlen; /* Number of queued kevents. */ + spinlock_t lock; /* Protects users queue. */ +}; + +#endif /* __KEVENT_STORAGE_H */ diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index bd67a44..33d436e 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -587,4 +587,8 @@ asmlinkage long sys_get_robust_list(int asmlinkage long sys_set_robust_list(struct robust_list_head __user *head, size_t len); +asmlinkage long sys_aio_recv(int ctl_fd, int s, void __user *buf, size_t size, unsigned flags); +asmlinkage long sys_aio_send(int ctl_fd, int s, void __user *buf, size_t size, unsigned flags); +asmlinkage long sys_aio_sendfile(int ctl_fd, int fd, int s, size_t size, unsigned flags); +asmlinkage long sys_kevent_ctl(int ctl_fd, void __user *buf); #endif diff --git a/init/Kconfig b/init/Kconfig index df864a3..6135afc 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -185,6 +185,8 @@ config AUDITSYSCALL such as SELinux. To use audit's filesystem watch feature, please ensure that INOTIFY is configured. +source "kernel/kevent/Kconfig" + config IKCONFIG bool "Kernel .config support" ---help--- diff --git a/kernel/Makefile b/kernel/Makefile index f6ef00f..eb057ea 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -36,6 +36,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o +obj-$(CONFIG_KEVENT) += kevent/ obj-$(CONFIG_RELAY) += relay.o ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y) diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig new file mode 100644 index 0000000..88b35af --- /dev/null +++ b/kernel/kevent/Kconfig @@ -0,0 +1,57 @@ +config KEVENT + bool "Kernel event notification mechanism" + help + This option enables event queue mechanism. + It can be used as replacement for poll()/select(), AIO callback invocations, + advanced timer notifications and other kernel object status changes. + +config KEVENT_USER_STAT + bool "Kevent user statistic" + depends on KEVENT + default N + help + This option will turn kevent_user statistic collection on. + Statistic data includes total number of kevent, number of kevents which are ready + immediately at insertion time and number of kevents which were removed through + readiness completion. It will be printed each time control kevent descriptor + is closed. + +config KEVENT_SOCKET + bool "Kernel event notifications for sockets" + depends on NET && KEVENT + help + This option enables notifications through KEVENT subsystem of + sockets operations, like new packet receiving conditions, ready for accept + conditions and so on. + +config KEVENT_INODE + bool "Kernel event notifications for inodes" + depends on KEVENT + help + This option enables notifications through KEVENT subsystem of + inode operations, like file creation, removal and so on. + +config KEVENT_TIMER + bool "Kernel event notifications for timers" + depends on KEVENT + help + This option allows to use timers through KEVENT subsystem. + +config KEVENT_POLL + bool "Kernel event notifications for poll()/select()" + depends on KEVENT + help + This option allows to use kevent subsystem for poll()/select() notifications. + +config KEVENT_NAIO + bool "Network asynchronous IO" + depends on KEVENT && KEVENT_SOCKET + help + This option enables kevent based network asynchronous IO subsystem. + +config KEVENT_AIO + bool "Asynchronous IO" + depends on KEVENT + help + This option allows to use kevent subsystem for AIO operations. + AIO read is currently supported. diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile new file mode 100644 index 0000000..7dcd651 --- /dev/null +++ b/kernel/kevent/Makefile @@ -0,0 +1,7 @@ +obj-y := kevent.o kevent_user.o kevent_init.o +obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o +obj-$(CONFIG_KEVENT_INODE) += kevent_inode.o +obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o +obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o +obj-$(CONFIG_KEVENT_NAIO) += kevent_naio.o +obj-$(CONFIG_KEVENT_AIO) += kevent_aio.o diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c new file mode 100644 index 0000000..f699a13 --- /dev/null +++ b/kernel/kevent/kevent.c @@ -0,0 +1,260 @@ +/* + * kevent.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/mempool.h> +#include <linux/sched.h> +#include <linux/wait.h> +#include <linux/kevent.h> + +static kmem_cache_t *kevent_cache; + +/* + * Attempts to add an event into appropriate origin's queue. + * Returns positive value if this event is ready immediately, + * negative value in case of error and zero if event has been queued. + * ->enqueue() callback must increase origin's reference counter. + */ +int kevent_enqueue(struct kevent *k) +{ + if (k->event.type >= KEVENT_MAX) + return -E2BIG; + + if (!k->enqueue) { + kevent_break(k); + return -EINVAL; + } + + return k->enqueue(k); +} + +/* + * Remove event from the appropriate queue. + * ->dequeue() callback must decrease origin's reference counter. + */ +int kevent_dequeue(struct kevent *k) +{ + if (k->event.type >= KEVENT_MAX) + return -E2BIG; + + if (!k->dequeue) { + kevent_break(k); + return -EINVAL; + } + + return k->dequeue(k); +} + +/* + * Must be called before event is going to be added into some origin's queue. + * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks. + * If failed, kevent should not be used or kevent_enqueue() will fail to add + * this kevent into origin's queue with setting + * KEVENT_RET_BROKEN flag in kevent->event.ret_flags. + */ +int kevent_init(struct kevent *k) +{ + int err; + + spin_lock_init(&k->lock); + k->kevent_entry.next = LIST_POISON1; + k->storage_entry.next = LIST_POISON1; + k->ready_entry.next = LIST_POISON1; + + if (k->event.type >= KEVENT_MAX) + return -E2BIG; + + switch (k->event.type) { + case KEVENT_NAIO: + err = kevent_init_naio(k); + break; + case KEVENT_SOCKET: + err = kevent_init_socket(k); + break; + case KEVENT_INODE: + err = kevent_init_inode(k); + break; + case KEVENT_TIMER: + err = kevent_init_timer(k); + break; + case KEVENT_POLL: + err = kevent_init_poll(k); + break; + case KEVENT_AIO: + err = kevent_init_aio(k); + break; + default: + err = -ENODEV; + } + + return err; +} + +/* + * Called from ->enqueue() callback when reference counter for given + * origin (socket, inode...) has been increased. + */ +int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k) +{ + unsigned long flags; + + k->st = st; + spin_lock_irqsave(&st->lock, flags); + list_add_tail(&k->storage_entry, &st->list); + st->qlen++; + spin_unlock_irqrestore(&st->lock, flags); + return 0; +} + +/* + * Dequeue kevent from origin's queue. + * It does not decrease origin's reference counter in any way + * and must be called before it, so storage itself must be valid. + * It is called from ->dequeue() callback. + */ +void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&st->lock, flags); + if (k->storage_entry.next != LIST_POISON1) { + list_del(&k->storage_entry); + st->qlen--; + } + spin_unlock_irqrestore(&st->lock, flags); +} + +static void __kevent_requeue(struct kevent *k, u32 event) +{ + int err, rem = 0; + unsigned long flags; + + err = k->callback(k); + + spin_lock_irqsave(&k->lock, flags); + if (err > 0) { + k->event.ret_flags |= KEVENT_RET_DONE; + } else if (err < 0) { + k->event.ret_flags |= KEVENT_RET_BROKEN; + k->event.ret_flags |= KEVENT_RET_DONE; + } + rem = (k->event.req_flags & KEVENT_REQ_ONESHOT); + if (!err) + err = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE)); + spin_unlock_irqrestore(&k->lock, flags); + + if (err) { + if (rem) { + list_del(&k->storage_entry); + k->st->qlen--; + } + + spin_lock_irqsave(&k->user->ready_lock, flags); + if (k->ready_entry.next == LIST_POISON1) { + list_add_tail(&k->ready_entry, &k->user->ready_list); + k->user->ready_num++; + } + spin_unlock_irqrestore(&k->user->ready_lock, flags); + wake_up(&k->user->wait); + } +} + +void kevent_requeue(struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&k->st->lock, flags); + __kevent_requeue(k, 0); + spin_unlock_irqrestore(&k->st->lock, flags); +} + +/* + * Called each time some activity in origin (socket, inode...) is noticed. + */ +void kevent_storage_ready(struct kevent_storage *st, + kevent_callback_t ready_callback, u32 event) +{ + struct kevent *k, *n; + + spin_lock(&st->lock); + list_for_each_entry_safe(k, n, &st->list, storage_entry) { + if (ready_callback) + ready_callback(k); + + if (event & k->event.event) + __kevent_requeue(k, event); + } + spin_unlock(&st->lock); +} + +int kevent_storage_init(void *origin, struct kevent_storage *st) +{ + spin_lock_init(&st->lock); + st->origin = origin; + st->qlen = 0; + INIT_LIST_HEAD(&st->list); + return 0; +} + +void kevent_storage_fini(struct kevent_storage *st) +{ + kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL); +} + +struct kevent *kevent_alloc(gfp_t mask) +{ + struct kevent *k; + + if (kevent_cache) + k = kmem_cache_alloc(kevent_cache, mask); + else + k = kzalloc(sizeof(struct kevent), mask); + + return k; +} + +void kevent_free(struct kevent *k) +{ + memset(k, 0xab, sizeof(struct kevent)); + + if (kevent_cache) + kmem_cache_free(kevent_cache, k); + else + kfree(k); +} + +int __init kevent_sys_init(void) +{ + int err = 0; + + kevent_cache = kmem_cache_create("kevent_cache", + sizeof(struct kevent), 0, 0, NULL, NULL); + if (!kevent_cache) + err = -ENOMEM; + + return err; +} + +late_initcall(kevent_sys_init); diff --git a/kernel/kevent/kevent_init.c b/kernel/kevent/kevent_init.c new file mode 100644 index 0000000..ec95114 --- /dev/null +++ b/kernel/kevent/kevent_init.c @@ -0,0 +1,85 @@ +/* + * kevent_init.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/spinlock.h> +#include <linux/errno.h> +#include <linux/kevent.h> + +int kevent_break(struct kevent *k) +{ + unsigned long flags; + + spin_lock_irqsave(&k->lock, flags); + k->event.ret_flags |= KEVENT_RET_BROKEN; + spin_unlock_irqrestore(&k->lock, flags); + return 0; +} + +#ifndef CONFIG_KEVENT_SOCKET +int kevent_init_socket(struct kevent *k) +{ + kevent_break(k); + return -ENODEV; +} +#endif + +#ifndef CONFIG_KEVENT_INODE +int kevent_init_inode(struct kevent *k) +{ + kevent_break(k); + return -ENODEV; +} +#endif + +#ifndef CONFIG_KEVENT_TIMER +int kevent_init_timer(struct kevent *k) +{ + kevent_break(k); + return -ENODEV; +} +#endif + +#ifndef CONFIG_KEVENT_POLL +int kevent_init_poll(struct kevent *k) +{ + kevent_break(k); + return -ENODEV; +} +#endif + +#ifndef CONFIG_KEVENT_NAIO +int kevent_init_naio(struct kevent *k) +{ + kevent_break(k); + return -ENODEV; +} +#endif + +#ifndef CONFIG_KEVENT_AIO +int kevent_init_aio(struct kevent *k) +{ + kevent_break(k); + return -ENODEV; +} +#endif diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c new file mode 100644 index 0000000..566b62b --- /dev/null +++ b/kernel/kevent/kevent_user.c @@ -0,0 +1,728 @@ +/* + * kevent_user.c + * + * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/types.h> +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/spinlock.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/mount.h> +#include <linux/device.h> +#include <linux/poll.h> +#include <linux/kevent.h> +#include <linux/jhash.h> +#include <asm/uaccess.h> +#include <asm/semaphore.h> + +static struct class *kevent_user_class; +static char kevent_name[] = "kevent"; +static int kevent_user_major; + +static int kevent_user_open(struct inode *, struct file *); +static int kevent_user_release(struct inode *, struct file *); +static int kevent_user_ioctl(struct inode *, struct file *, + unsigned int, unsigned long); +static unsigned int kevent_user_poll(struct file *, struct poll_table_struct *); + +static struct file_operations kevent_user_fops = { + .open = kevent_user_open, + .release = kevent_user_release, + .ioctl = kevent_user_ioctl, + .poll = kevent_user_poll, + .owner = THIS_MODULE, +}; + +static struct super_block *kevent_get_sb(struct file_system_type *fs_type, + int flags, const char *dev_name, void *data) +{ + /* So original magic... */ + return get_sb_pseudo(fs_type, kevent_name, NULL, 0xabcdef); +} + +static struct file_system_type kevent_fs_type = { + .name = kevent_name, + .get_sb = kevent_get_sb, + .kill_sb = kill_anon_super, +}; + +static struct vfsmount *kevent_mnt; + +static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait) +{ + struct kevent_user *u = file->private_data; + unsigned int mask; + + poll_wait(file, &u->wait, wait); + mask = 0; + + if (u->ready_num) + mask |= POLLIN | POLLRDNORM; + + return mask; +} + +static struct kevent_user *kevent_user_alloc(void) +{ + struct kevent_user *u; + int i; + + u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL); + if (!u) + return NULL; + + INIT_LIST_HEAD(&u->ready_list); + spin_lock_init(&u->ready_lock); + u->ready_num = 0; +#ifdef CONFIG_KEVENT_USER_STAT + u->wait_num = u->im_num = u->total = 0; +#endif + for (i=0; i<KEVENT_HASH_MASK+1; ++i) { + INIT_LIST_HEAD(&u->kqueue[i].kevent_list); + spin_lock_init(&u->kqueue[i].kevent_lock); + } + u->kevent_num = 0; + + init_MUTEX(&u->ctl_mutex); + init_MUTEX(&u->wait_mutex); + init_waitqueue_head(&u->wait); + u->max_ready_num = 0; + + atomic_set(&u->refcnt, 1); + + return u; +} + +static int kevent_user_open(struct inode *inode, struct file *file) +{ + struct kevent_user *u = kevent_user_alloc(); + + if (!u) + return -ENOMEM; + + file->private_data = u; + + return 0; +} + +static inline void kevent_user_get(struct kevent_user *u) +{ + atomic_inc(&u->refcnt); +} + +static inline void kevent_user_put(struct kevent_user *u) +{ + if (atomic_dec_and_test(&u->refcnt)) { +#ifdef CONFIG_KEVENT_USER_STAT + printk("%s: u=%p, wait=%lu, immediately=%lu, total=%lu.\n", + __func__, u, u->wait_num, u->im_num, u->total); +#endif + kfree(u); + } +} + +#if 0 +static inline unsigned int kevent_user_hash(struct ukevent *uk) +{ + unsigned int h = (uk->user[0] ^ uk->user[1]) ^ (uk->id.raw[0] ^ uk->id.raw[1]); + + h = (((h >> 16) & 0xffff) ^ (h & 0xffff)) & 0xffff; + h = (((h >> 8) & 0xff) ^ (h & 0xff)) & KEVENT_HASH_MASK; + + return h; +} +#else +static inline unsigned int kevent_user_hash(struct ukevent *uk) +{ + return jhash_1word(uk->id.raw[0], 0) & KEVENT_HASH_MASK; +} +#endif + +/* + * Remove kevent from user's list of all events, + * dequeue it from storage and decrease user's reference counter, + * since this kevent does not exist anymore. That is why it is freed here. + */ +static void kevent_finish_user(struct kevent *k, int lock, int deq) +{ + struct kevent_user *u = k->user; + unsigned long flags; + + if (lock) { + unsigned int hash = kevent_user_hash(&k->event); + struct kevent_list *l = &u->kqueue[hash]; + + spin_lock_irqsave(&l->kevent_lock, flags); + list_del(&k->kevent_entry); + u->kevent_num--; + spin_unlock_irqrestore(&l->kevent_lock, flags); + } else { + list_del(&k->kevent_entry); + u->kevent_num--; + } + + if (deq) + kevent_dequeue(k); + + spin_lock_irqsave(&u->ready_lock, flags); + if (k->ready_entry.next != LIST_POISON1) { + list_del(&k->ready_entry); + u->ready_num--; + } + spin_unlock_irqrestore(&u->ready_lock, flags); + + kevent_user_put(u); + kevent_free(k); +} + +/* + * Dequeue one entry from user's ready queue. + */ +static struct kevent *__kqueue_dequeue_one_ready(struct list_head *q, + unsigned int *qlen) +{ + struct kevent *k = NULL; + unsigned int len = *qlen; + + if (len && !list_empty(q)) { + k = list_entry(q->next, struct kevent, ready_entry); + list_del(&k->ready_entry); + *qlen = len - 1; + } + + return k; +} + +static struct kevent *kqueue_dequeue_ready(struct kevent_user *u) +{ + unsigned long flags; + struct kevent *k; + + spin_lock_irqsave(&u->ready_lock, flags); + k = __kqueue_dequeue_one_ready(&u->ready_list, &u->ready_num); + spin_unlock_irqrestore(&u->ready_lock, flags); + + return k; +} + +static struct kevent *__kevent_search(struct kevent_list *l, struct ukevent *uk, + struct kevent_user *u) +{ + struct kevent *k; + int found = 0; + + list_for_each_entry(k, &l->kevent_list, kevent_entry) { + spin_lock(&k->lock); + if (k->event.user[0] == uk->user[0] && k->event.user[1] == uk->user[1] && + k->event.id.raw[0] == uk->id.raw[0] && + k->event.id.raw[1] == uk->id.raw[1]) { + found = 1; + spin_unlock(&k->lock); + break; + } + spin_unlock(&k->lock); + } + + return (found)?k:NULL; +} + +static int kevent_modify(struct ukevent *uk, struct kevent_user *u) +{ + struct kevent *k; + unsigned int hash = kevent_user_hash(uk); + struct kevent_list *l = &u->kqueue[hash]; + int err = -ENODEV; + unsigned long flags; + + spin_lock_irqsave(&l->kevent_lock, flags); + k = __kevent_search(l, uk, u); + if (k) { + spin_lock(&k->lock); + k->event.event = uk->event; + k->event.req_flags = uk->req_flags; + k->event.ret_flags = 0; + spin_unlock(&k->lock); + kevent_requeue(k); + err = 0; + } + spin_unlock_irqrestore(&l->kevent_lock, flags); + + return err; +} + +static int kevent_remove(struct ukevent *uk, struct kevent_user *u) +{ + int err = -ENODEV; + struct kevent *k; + unsigned int hash = kevent_user_hash(uk); + struct kevent_list *l = &u->kqueue[hash]; + unsigned long flags; + + spin_lock_irqsave(&l->kevent_lock, flags); + k = __kevent_search(l, uk, u); + if (k) { + kevent_finish_user(k, 0, 1); + err = 0; + } + spin_unlock_irqrestore(&l->kevent_lock, flags); + + return err; +} + +/* + * No new entry can be added or removed from any list at this point. + * It is not permitted to call ->ioctl() and ->release() in parallel. + */ +static int kevent_user_release(struct inode *inode, struct file *file) +{ + struct kevent_user *u = file->private_data; + struct kevent *k, *n; + int i; + + for (i=0; i<KEVENT_HASH_MASK+1; ++i) { + struct kevent_list *l = &u->kqueue[i]; + + list_for_each_entry_safe(k, n, &l->kevent_list, kevent_entry) + kevent_finish_user(k, 1, 1); + } + + kevent_user_put(u); + file->private_data = NULL; + + return 0; +} + +static int kevent_user_ctl_modify(struct kevent_user *u, + struct kevent_user_control *ctl, void __user *arg) +{ + int err = 0, i; + struct ukevent uk; + + if (down_interruptible(&u->ctl_mutex)) + return -ERESTARTSYS; + + for (i=0; i<ctl->num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + err = -EINVAL; + break; + } + + if (kevent_modify(&uk, u)) + uk.ret_flags |= KEVENT_RET_BROKEN; + uk.ret_flags |= KEVENT_RET_DONE; + + if (copy_to_user(arg, &uk, sizeof(struct ukevent))) { + err = -EINVAL; + break; + } + + arg += sizeof(struct ukevent); + } + + up(&u->ctl_mutex); + + return err; +} + +static int kevent_user_ctl_remove(struct kevent_user *u, + struct kevent_user_control *ctl, void __user *arg) +{ + int err = 0, i; + struct ukevent uk; + + if (down_interruptible(&u->ctl_mutex)) + return -ERESTARTSYS; + + for (i=0; i<ctl->num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + err = -EINVAL; + break; + } + + if (kevent_remove(&uk, u)) + uk.ret_flags |= KEVENT_RET_BROKEN; + + uk.ret_flags |= KEVENT_RET_DONE; + + if (copy_to_user(arg, &uk, sizeof(struct ukevent))) { + err = -EINVAL; + break; + } + + arg += sizeof(struct ukevent); + } + + up(&u->ctl_mutex); + + return err; +} + +int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u) +{ + struct kevent *k; + int err; + + k = kevent_alloc(GFP_KERNEL); + if (!k) { + err = -ENOMEM; + goto err_out_exit; + } + + memcpy(&k->event, uk, sizeof(struct ukevent)); + + k->event.ret_flags = 0; + + err = kevent_init(k); + if (err) { + kevent_free(k); + goto err_out_exit; + } + k->user = u; +#ifdef CONFIG_KEVENT_USER_STAT + u->total++; +#endif + { + unsigned long flags; + unsigned int hash = kevent_user_hash(&k->event); + struct kevent_list *l = &u->kqueue[hash]; + + spin_lock_irqsave(&l->kevent_lock, flags); + list_add_tail(&k->kevent_entry, &l->kevent_list); + u->kevent_num++; + kevent_user_get(u); + spin_unlock_irqrestore(&l->kevent_lock, flags); + } + + err = kevent_enqueue(k); + if (err) { + memcpy(uk, &k->event, sizeof(struct ukevent)); + if (err < 0) + uk->ret_flags |= KEVENT_RET_BROKEN; + uk->ret_flags |= KEVENT_RET_DONE; + kevent_finish_user(k, 1, 0); + } + +err_out_exit: + return err; +} + +/* + * Copy all ukevents from userspace, allocate kevent for each one + * and add them into appropriate kevent_storages, + * e.g. sockets, inodes and so on... + * If something goes wrong, all events will be dequeued and + * negative error will be returned. + * On success zero is returned and + * ctl->num will be a number of finished events, either completed or failed. + * Array of finished events (struct ukevent) will be placed behind + * kevent_user_control structure. User must run through that array and check + * ret_flags field of each ukevent structure to determine if it is fired or failed event. + */ +static int kevent_user_ctl_add(struct kevent_user *u, + struct kevent_user_control *ctl, void __user *arg) +{ + int err = 0, cerr = 0, num = 0, knum = 0, i; + void __user *orig, *ctl_addr; + struct ukevent uk; + + if (down_interruptible(&u->ctl_mutex)) + return -ERESTARTSYS; + + orig = arg; + ctl_addr = arg - sizeof(struct kevent_user_control); +#if 1 + err = -ENFILE; + if (u->kevent_num + ctl->num >= 1024) + goto err_out_remove; +#endif + for (i=0; i<ctl->num; ++i) { + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) { + cerr = -EINVAL; + break; + } + arg += sizeof(struct ukevent); + + err = kevent_user_add_ukevent(&uk, u); + if (err) { +#ifdef CONFIG_KEVENT_USER_STAT + u->im_num++; +#endif + if (copy_to_user(orig, &uk, sizeof(struct ukevent))) + cerr = -EINVAL; + orig += sizeof(struct ukevent); + num++; + } else + knum++; + } + + if (cerr < 0) + goto err_out_remove; + + ctl->num = num; + if (copy_to_user(ctl_addr, ctl, sizeof(struct kevent_user_control))) + cerr = -EINVAL; + + if (cerr) + err = cerr; + if (!err) + err = num; + +err_out_remove: + up(&u->ctl_mutex); + + return err; +} + +/* + * Waits until at least ctl->ready_num events are ready or timeout and returns + * number of ready events (in case of timeout) or number of requested events. + */ +static int kevent_user_wait(struct file *file, struct kevent_user *u, + struct kevent_user_control *ctl, void __user *arg) +{ + struct kevent *k; + int cerr = 0, num = 0; + void __user *ptr = arg + sizeof(struct kevent_user_control); + + if (down_interruptible(&u->ctl_mutex)) + return -ERESTARTSYS; + + if (!(file->f_flags & O_NONBLOCK)) { + if (ctl->timeout) + wait_event_interruptible_timeout(u->wait, + u->ready_num >= ctl->num, msecs_to_jiffies(ctl->timeout)); + else + wait_event_interruptible_timeout(u->wait, + u->ready_num > 0, msecs_to_jiffies(1000)); + } + while (num < ctl->num && ((k = kqueue_dequeue_ready(u)) != NULL)) { + if (copy_to_user(ptr + num*sizeof(struct ukevent), + &k->event, sizeof(struct ukevent))) + cerr = -EINVAL; + + /* + * If it is one-shot kevent, it has been removed already from + * origin's queue, so we can easily free it here. + */ + if (k->event.req_flags & KEVENT_REQ_ONESHOT) + kevent_finish_user(k, 1, 1); + ++num; +#ifdef CONFIG_KEVENT_USER_STAT + u->wait_num++; +#endif + } + + ctl->num = num; + if (copy_to_user(arg, ctl, sizeof(struct kevent_user_control))) + cerr = -EINVAL; + + up(&u->ctl_mutex); + + return (cerr)?cerr:num; +} + +static int kevent_ctl_init(void) +{ + struct kevent_user *u; + struct file *file; + int fd, ret; + + fd = get_unused_fd(); + if (fd < 0) + return fd; + + file = get_empty_filp(); + if (!file) { + ret = -ENFILE; + goto out_put_fd; + } + + u = kevent_user_alloc(); + if (unlikely(!u)) { + ret = -ENOMEM; + goto out_put_file; + } + + file->f_op = &kevent_user_fops; + file->f_vfsmnt = mntget(kevent_mnt); + file->f_dentry = dget(kevent_mnt->mnt_root); + file->f_mapping = file->f_dentry->d_inode->i_mapping; + file->f_mode = FMODE_READ; + file->f_flags = O_RDONLY; + file->private_data = u; + + fd_install(fd, file); + + return fd; + +out_put_file: + put_filp(file); +out_put_fd: + put_unused_fd(fd); + return ret; +} + +static int kevent_ctl_process(struct file *file, + struct kevent_user_control *ctl, void __user *arg) +{ + int err; + struct kevent_user *u = file->private_data; + + if (!u) + return -EINVAL; + + switch (ctl->cmd) { + case KEVENT_CTL_ADD: + err = kevent_user_ctl_add(u, ctl, + arg+sizeof(struct kevent_user_control)); + break; + case KEVENT_CTL_REMOVE: + err = kevent_user_ctl_remove(u, ctl, + arg+sizeof(struct kevent_user_control)); + break; + case KEVENT_CTL_MODIFY: + err = kevent_user_ctl_modify(u, ctl, + arg+sizeof(struct kevent_user_control)); + break; + case KEVENT_CTL_WAIT: + err = kevent_user_wait(file, u, ctl, arg); + break; + case KEVENT_CTL_INIT: + err = kevent_ctl_init(); + default: + err = -EINVAL; + break; + } + + return err; +} + +asmlinkage long sys_kevent_ctl(int fd, void __user *arg) +{ + int err, fput_needed; + struct kevent_user_control ctl; + struct file *file; + + if (copy_from_user(&ctl, arg, sizeof(struct kevent_user_control))) + return -EINVAL; + + if (ctl.cmd == KEVENT_CTL_INIT) + return kevent_ctl_init(); + + file = fget_light(fd, &fput_needed); + if (!file) + return -ENODEV; + + err = kevent_ctl_process(file, &ctl, arg); + + fput_light(file, fput_needed); + return err; +} + +static int kevent_user_ioctl(struct inode *inode, struct file *file, + unsigned int cmd, unsigned long arg) +{ + int err = -ENODEV; + struct kevent_user_control ctl; + struct kevent_user *u = file->private_data; + void __user *ptr = (void __user *)arg; + + if (copy_from_user(&ctl, ptr, sizeof(struct kevent_user_control))) + return -EINVAL; + + switch (cmd) { + case KEVENT_USER_CTL: + err = kevent_ctl_process(file, &ctl, ptr); + break; + case KEVENT_USER_WAIT: + err = kevent_user_wait(file, u, &ctl, ptr); + break; + default: + break; + } + + return err; +} + +static int __devinit kevent_user_init(void) +{ + struct class_device *dev; + int err = 0; + + err = register_filesystem(&kevent_fs_type); + if (err) + panic("%s: failed to register filesystem: err=%d.\n", + kevent_name, err); + + kevent_mnt = kern_mount(&kevent_fs_type); + if (IS_ERR(kevent_mnt)) + panic("%s: failed to mount silesystem: err=%ld.\n", + kevent_name, PTR_ERR(kevent_mnt)); + + kevent_user_major = register_chrdev(0, kevent_name, &kevent_user_fops); + if (kevent_user_major < 0) { + printk(KERN_ERR "Failed to register \"%s\" char device: err=%d.\n", + kevent_name, kevent_user_major); + return -ENODEV; + } + + kevent_user_class = class_create(THIS_MODULE, "kevent"); + if (IS_ERR(kevent_user_class)) { + printk(KERN_ERR "Failed to register \"%s\" class: err=%ld.\n", + kevent_name, PTR_ERR(kevent_user_class)); + err = PTR_ERR(kevent_user_class); + goto err_out_unregister; + } + + dev = class_device_create(kevent_user_class, NULL, + MKDEV(kevent_user_major, 0), NULL, kevent_name); + if (IS_ERR(dev)) { + printk(KERN_ERR "Failed to create %d.%d class device in \"%s\" class: err=%ld.\n", + kevent_user_major, 0, kevent_name, PTR_ERR(dev)); + err = PTR_ERR(dev); + goto err_out_class_destroy; + } + + printk("KEVENT subsystem: chardev helper: major=%d.\n", kevent_user_major); + + return 0; + +err_out_class_destroy: + class_destroy(kevent_user_class); +err_out_unregister: + unregister_chrdev(kevent_user_major, kevent_name); + + return err; +} + +static void __devexit kevent_user_fini(void) +{ + class_device_destroy(kevent_user_class, MKDEV(kevent_user_major, 0)); + class_destroy(kevent_user_class); + unregister_chrdev(kevent_user_major, kevent_name); + mntput(kevent_mnt); + unregister_filesystem(&kevent_fs_type); +} + +module_init(kevent_user_init); +module_exit(kevent_user_fini); diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 5433195..dcbacf5 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -121,6 +121,11 @@ cond_syscall(ppc_rtas); cond_syscall(sys_spu_run); cond_syscall(sys_spu_create); +cond_syscall(sys_aio_recv); +cond_syscall(sys_aio_send); +cond_syscall(sys_aio_sendfile); +cond_syscall(sys_kevent_ctl); + /* mmu depending weak syscall entries */ cond_syscall(sys_mprotect); cond_syscall(sys_msync); -- Evgeniy Polyakov ^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: [1/4] kevent: core files. 2006-06-23 7:09 ` [1/4] kevent: core files Evgeniy Polyakov @ 2006-06-23 18:44 ` Benjamin LaHaise 2006-06-23 19:24 ` Evgeniy Polyakov 0 siblings, 1 reply; 73+ messages in thread From: Benjamin LaHaise @ 2006-06-23 18:44 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: David Miller, netdev On Fri, Jun 23, 2006 at 11:09:34AM +0400, Evgeniy Polyakov wrote: > This patch includes core kevent files: > - userspace controlling > - kernelspace interfaces > - initialisation > - notification state machines We don't need yet another event mechanism in the kernel, so I don't see why the new syscalls should be added when they don't interoperate with existing solutions. If your results are enough to sway akpm that it is worth taking the patches, then it would make sense to merge the code with the already in-tree APIs. -ben ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [1/4] kevent: core files. 2006-06-23 18:44 ` Benjamin LaHaise @ 2006-06-23 19:24 ` Evgeniy Polyakov 2006-06-23 19:55 ` Benjamin LaHaise 2006-06-23 20:19 ` David Miller 0 siblings, 2 replies; 73+ messages in thread From: Evgeniy Polyakov @ 2006-06-23 19:24 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: David Miller, netdev On Fri, Jun 23, 2006 at 02:44:57PM -0400, Benjamin LaHaise (bcrl@kvack.org) wrote: > On Fri, Jun 23, 2006 at 11:09:34AM +0400, Evgeniy Polyakov wrote: > > This patch includes core kevent files: > > - userspace controlling > > - kernelspace interfaces > > - initialisation > > - notification state machines > > We don't need yet another event mechanism in the kernel, so I don't see > why the new syscalls should be added when they don't interoperate with > existing solutions. If your results are enough to sway akpm that it is > worth taking the patches, then it would make sense to merge the code with > the already in-tree APIs. What API are you talking about? There is only epoll(), which is 40% slower than kevent, and AIO, which works not as state machine, but as repeated call for the same work. There is also inotify, which allocates new message each time event occurs, which is not a good solution for every situation. Linux just does not have unified event processing mechanism, which was pointed to many times in AIO mail list and when epoll() was only introduced. I would even say, that Linux does not have such mechanism at all, since every potential user implements it's own, which can not be used with others. Kevent fixes that. Although implementation itself can be suboptimal for some cases or even unacceptible at all, but it is really needed functionality. Every existing notification can be built on top of kevent. One can find how easy it was to implement generic poll/select notifications (what epoll() does) or socket notifications (which are similar to epoll(), but are called from inside socket state machine, thus improving processing performance). > -ben -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [1/4] kevent: core files. 2006-06-23 19:24 ` Evgeniy Polyakov @ 2006-06-23 19:55 ` Benjamin LaHaise 2006-06-23 20:17 ` Evgeniy Polyakov 2006-06-23 20:19 ` David Miller 1 sibling, 1 reply; 73+ messages in thread From: Benjamin LaHaise @ 2006-06-23 19:55 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: David Miller, netdev On Fri, Jun 23, 2006 at 11:24:29PM +0400, Evgeniy Polyakov wrote: > What API are you talking about? > There is only epoll(), which is 40% slower than kevent, and AIO, which > works not as state machine, but as repeated call for the same work. > There is also inotify, which allocates new message each time event > occurs, which is not a good solution for every situation. AIO can be implemented as a state machine. Nothing in the API stops you from doing that, and in fact there was code which was implemented as a state machine used on 2.4 kernels. > Linux just does not have unified event processing mechanism, which was > pointed to many times in AIO mail list and when epoll() was only > introduced. I would even say, that Linux does not have such mechanism at > all, since every potential user implements it's own, which can not be > used with others. The epoll event API doesn't have space in the event fields for result codes as needed for AIO. The AIO API does -- how is it lacking in this regard? > Kevent fixes that. Although implementation itself can be suboptimal for > some cases or even unacceptible at all, but it is really needed > functionality. At the expense of adding another API? How is this a good thing? Why not spit out events in the existing format? > Every existing notification can be built on top of kevent. One can find > how easy it was to implement generic poll/select notifications (what > epoll() does) or socket notifications (which are similar to epoll(), but > are called from inside socket state machine, thus improving processing > performance). So far your code is adding a lot without unifying anything. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <dont@kvack.org>. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [1/4] kevent: core files. 2006-06-23 19:55 ` Benjamin LaHaise @ 2006-06-23 20:17 ` Evgeniy Polyakov 2006-06-23 20:44 ` Benjamin LaHaise 0 siblings, 1 reply; 73+ messages in thread From: Evgeniy Polyakov @ 2006-06-23 20:17 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: David Miller, netdev On Fri, Jun 23, 2006 at 03:55:13PM -0400, Benjamin LaHaise (bcrl@kvack.org) wrote: > On Fri, Jun 23, 2006 at 11:24:29PM +0400, Evgeniy Polyakov wrote: > > What API are you talking about? > > There is only epoll(), which is 40% slower than kevent, and AIO, which > > works not as state machine, but as repeated call for the same work. > > There is also inotify, which allocates new message each time event > > occurs, which is not a good solution for every situation. > > AIO can be implemented as a state machine. Nothing in the API stops > you from doing that, and in fact there was code which was implemented as > a state machine used on 2.4 kernels. But now it is implemented as repeated call for the same work, which does not look like it can be used for any other types of work. And repeated work introduce latencies. As far as I recall, it is you who wanted to remove thread based approach from AIO subsystem. > > Linux just does not have unified event processing mechanism, which was > > pointed to many times in AIO mail list and when epoll() was only > > introduced. I would even say, that Linux does not have such mechanism at > > all, since every potential user implements it's own, which can not be > > used with others. > > The epoll event API doesn't have space in the event fields for result codes > as needed for AIO. The AIO API does -- how is it lacking in this regard? AIO completion approach was designed to be used with process context VFS update. read/write approach can not cover other types of notifications, like inode updates or timers. > > Kevent fixes that. Although implementation itself can be suboptimal for > > some cases or even unacceptible at all, but it is really needed > > functionality. > > At the expense of adding another API? How is this a good thing? Why > not spit out events in the existing format? Format of the structure transferred between the objects does not matter at all. We can create a wrapper on kevent structures or kevent can transform data from AIO objects. The main design goal of kevent is to provide easy connected hooks into any state machine, which might be used by kernelspace to notify about any kind of events without any knowledge of it's background nature. Kevent can be used for example as notification blocks for address changes or it can replace netlink completely (it can even emulate event multicasting). Kevent is queue of events, which can be transferred from any object to any destination. > > Every existing notification can be built on top of kevent. One can find > > how easy it was to implement generic poll/select notifications (what > > epoll() does) or socket notifications (which are similar to epoll(), but > > are called from inside socket state machine, thus improving processing > > performance). > > So far your code is adding a lot without unifying anything. Not at all! Kevent is a mechanism, which allows to impleement AIO, network AIO, poll and select, timer control, adaptive readhead (as example of AIO VFS update). All the code I present shows how to use kevent, it is not part of the kevent. One can find Makefile in kevent dir to check what is the core of the subsystem, which allows to be used as a transport for events. AIO, NAIO, poll/select, socket and timer notifications are just users. One can add it's own usage as easy as to call kevent_storage initialization function and event generation function. All other pieces are hidded in the implementation. > -ben > -- > "Time is of no importance, Mr. President, only life is important." > Don't Email: <dont@kvack.org>. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [1/4] kevent: core files. 2006-06-23 20:17 ` Evgeniy Polyakov @ 2006-06-23 20:44 ` Benjamin LaHaise 2006-06-23 21:08 ` Evgeniy Polyakov 0 siblings, 1 reply; 73+ messages in thread From: Benjamin LaHaise @ 2006-06-23 20:44 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: David Miller, netdev On Sat, Jun 24, 2006 at 12:17:17AM +0400, Evgeniy Polyakov wrote: > But now it is implemented as repeated call for the same work, which does > not look like it can be used for any other types of work. Given an iocb, you do not have to return -EIOCBRETRY, instead return -EIOCBQUEUED and then from whatever context do an aio_complete() with the result for that iocb. > And repeated work introduce latencies. > As far as I recall, it is you who wanted to remove thread based approach > from AIO subsystem. I have essentially given up on trying to get the filesystem AIO patches in given that the concerns against them are "woe complexity" with no real recourse for inclusion being open. If David is open to changes in the networking area, I'd love to see it built on top of your code. > AIO completion approach was designed to be used with process context VFS > update. read/write approach can not cover other types of notifications, > like inode updates or timers. The completion event is 100% generic and does not need to come from process context. Calling aio_complete() from irq context is entirely valid. > Format of the structure transferred between the objects does not matter > at all. We can create a wrapper on kevent structures or kevent can > transform data from AIO objects. > The main design goal of kevent is to provide easy connected hooks into > any state machine, which might be used by kernelspace to notify about > any kind of events without any knowledge of it's background nature. > Kevent can be used for example as notification blocks for address > changes or it can replace netlink completely (it can even emulate > event multicasting). > > Kevent is queue of events, which can be transferred from any object to > any destination. And io_getevents() reads a queue of events, so I'm not sure why you need a new syscall. > Not at all! > Kevent is a mechanism, which allows to impleement AIO, network AIO, poll > and select, timer control, adaptive readhead (as example of AIO VFS > update). All the code I present shows how to use kevent, it is not part > of the kevent. One can find Makefile in kevent dir to check what is the > core of the subsystem, which allows to be used as a transport for > events. > > AIO, NAIO, poll/select, socket and timer notifications are just users. > One can add it's own usage as easy as to call kevent_storage > initialization function and event generation function. All other pieces > are hidded in the implementation. I'll look at adapting your code to use the existing syscalls. Maybe code will be better at expressing my concerns. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <dont@kvack.org>. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [1/4] kevent: core files. 2006-06-23 20:44 ` Benjamin LaHaise @ 2006-06-23 21:08 ` Evgeniy Polyakov 2006-06-23 21:31 ` Benjamin LaHaise 0 siblings, 1 reply; 73+ messages in thread From: Evgeniy Polyakov @ 2006-06-23 21:08 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: David Miller, netdev On Fri, Jun 23, 2006 at 04:44:42PM -0400, Benjamin LaHaise (bcrl@kvack.org) wrote: > > AIO completion approach was designed to be used with process context VFS > > update. read/write approach can not cover other types of notifications, > > like inode updates or timers. > > The completion event is 100% generic and does not need to come from process > context. Calling aio_complete() from irq context is entirely valid. put_ioctx() can sleep. And the whole approach is different: AIO just wakes up requesting thread, so user must provide a lot to be able to work with AIO. It perfectly fits VFS design, but it is not acceptible for generic event notifications. > > Format of the structure transferred between the objects does not matter > > at all. We can create a wrapper on kevent structures or kevent can > > transform data from AIO objects. > > > The main design goal of kevent is to provide easy connected hooks into > > any state machine, which might be used by kernelspace to notify about > > any kind of events without any knowledge of it's background nature. > > Kevent can be used for example as notification blocks for address > > changes or it can replace netlink completely (it can even emulate > > event multicasting). > > > > Kevent is queue of events, which can be transferred from any object to > > any destination. > > And io_getevents() reads a queue of events, so I'm not sure why you need > a new syscall. It is not syscall, but overall design should be analyzed. It is possible to use existing ssycalls, kevent design does not care about how it's data structures are delivered to the internal "processor". > > Not at all! > > Kevent is a mechanism, which allows to impleement AIO, network AIO, poll > > and select, timer control, adaptive readhead (as example of AIO VFS > > update). All the code I present shows how to use kevent, it is not part > > of the kevent. One can find Makefile in kevent dir to check what is the > > core of the subsystem, which allows to be used as a transport for > > events. > > > > AIO, NAIO, poll/select, socket and timer notifications are just users. > > One can add it's own usage as easy as to call kevent_storage > > initialization function and event generation function. All other pieces > > are hidded in the implementation. > > I'll look at adapting your code to use the existing syscalls. Maybe code > will be better at expressing my concerns. That would be great. > -ben -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [1/4] kevent: core files. 2006-06-23 21:08 ` Evgeniy Polyakov @ 2006-06-23 21:31 ` Benjamin LaHaise 2006-06-23 21:43 ` Evgeniy Polyakov 0 siblings, 1 reply; 73+ messages in thread From: Benjamin LaHaise @ 2006-06-23 21:31 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: David Miller, netdev On Sat, Jun 24, 2006 at 01:08:27AM +0400, Evgeniy Polyakov wrote: > On Fri, Jun 23, 2006 at 04:44:42PM -0400, Benjamin LaHaise (bcrl@kvack.org) wrote: > > > AIO completion approach was designed to be used with process context VFS > > > update. read/write approach can not cover other types of notifications, > > > like inode updates or timers. > > > > The completion event is 100% generic and does not need to come from process > > context. Calling aio_complete() from irq context is entirely valid. > > put_ioctx() can sleep. Err, no, that should definately not be the case. If it can, someone has completely broken aio. > It is not syscall, but overall design should be analyzed. > It is possible to use existing ssycalls, kevent design does not care > about how it's data structures are delivered to the internal > "processor". Okay, that's good to hear. =-) -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <dont@kvack.org>. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [1/4] kevent: core files. 2006-06-23 21:31 ` Benjamin LaHaise @ 2006-06-23 21:43 ` Evgeniy Polyakov 0 siblings, 0 replies; 73+ messages in thread From: Evgeniy Polyakov @ 2006-06-23 21:43 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: David Miller, netdev On Fri, Jun 23, 2006 at 05:31:44PM -0400, Benjamin LaHaise (bcrl@kvack.org) wrote: > On Sat, Jun 24, 2006 at 01:08:27AM +0400, Evgeniy Polyakov wrote: > > On Fri, Jun 23, 2006 at 04:44:42PM -0400, Benjamin LaHaise (bcrl@kvack.org) wrote: > > > > AIO completion approach was designed to be used with process context VFS > > > > update. read/write approach can not cover other types of notifications, > > > > like inode updates or timers. > > > > > > The completion event is 100% generic and does not need to come from process > > > context. Calling aio_complete() from irq context is entirely valid. > > > > put_ioctx() can sleep. > > Err, no, that should definately not be the case. If it can, someone has > completely broken aio. When reference counter hits zero it flushes aio workqueue, which can sleep. put_ioctx() -> __put_ioctx() -> cancel_delayed_work()/flush_workqueue(). It is there at least from 2.6.15 days (it is the oldest tree I can access using my extremely slow GPRS link). Hang the looter! > -ben -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [1/4] kevent: core files. 2006-06-23 19:24 ` Evgeniy Polyakov 2006-06-23 19:55 ` Benjamin LaHaise @ 2006-06-23 20:19 ` David Miller 2006-06-23 20:31 ` Benjamin LaHaise 1 sibling, 1 reply; 73+ messages in thread From: David Miller @ 2006-06-23 20:19 UTC (permalink / raw) To: johnpol; +Cc: bcrl, netdev From: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Date: Fri, 23 Jun 2006 23:24:29 +0400 > Linux just does not have unified event processing mechanism, which was > pointed to many times in AIO mail list and when epoll() was only > introduced. I would even say, that Linux does not have such mechanism at > all, since every potential user implements it's own, which can not be > used with others. > > Kevent fixes that. Although implementation itself can be suboptimal for > some cases or even unacceptible at all, but it is really needed > functionality. I completely agree with Evgeniy here. There is nothing in the kernel today that provides integrated event handling. Nothing. So when someone says to use the "existing" stuff, they need to have their head examined. The existing AIO stuff stinks as a set of interfaces. It was designed by a standards committee, not by people truly interested in a good performing event processing design. It is especially poorly suited for networking, and any networking developer understands this. It is pretty much a foregone conclusion that we will need new APIs to get good networking performance. Every existing interface has one limitation or another. So we should be happy people like Evgeniy try to work on this stuff, instead of discouraging them. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [1/4] kevent: core files. 2006-06-23 20:19 ` David Miller @ 2006-06-23 20:31 ` Benjamin LaHaise 2006-06-23 20:54 ` Evgeniy Polyakov 2006-06-23 20:54 ` David Miller 0 siblings, 2 replies; 73+ messages in thread From: Benjamin LaHaise @ 2006-06-23 20:31 UTC (permalink / raw) To: David Miller; +Cc: johnpol, netdev On Fri, Jun 23, 2006 at 01:19:40PM -0700, David Miller wrote: > I completely agree with Evgeniy here. > > There is nothing in the kernel today that provides integrated event > handling. Nothing. So when someone says to use the "existing" stuff, > they need to have their head examined. The existing AIO events are *events*, with the syscalls providing the reading of events. > The existing AIO stuff stinks as a set of interfaces. It was designed > by a standards committee, not by people truly interested in a good > performing event processing design. It is especially poorly suited > for networking, and any networking developer understands this. I disagree. Stuffing an event that a read or write is complete/ready is a good way of handling things, even more so with hardware that will perform the memory copies to/from user buffers. > It is pretty much a foregone conclusion that we will need new > APIs to get good networking performance. Every existing interface > has one limitation or another. Eh? Nobody has posted any numbers comparing the approaches yet, so this is pure handwaving, unless you have real concrete results? > So we should be happy people like Evgeniy try to work on this stuff, > instead of discouraging them. I would like to encourage him, but at the same time I don't want to see creating APIs that essentially duplicate existing work and needlessly break compatibility. I completely agree that the in-kernel APIs are not as encompassing as they should be, and within the kernel Evgeniy's work may well be the way to go. What I do not agree is that we need new syscalls at this point. I'm perfectly willing to accept proof that change is needed if we do a proper comparision between any new syscall API and the use of the existing syscall API, but the pain of introducing a new API is sufficiently large that I think it is worth looking at the numbers. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <dont@kvack.org>. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [1/4] kevent: core files. 2006-06-23 20:31 ` Benjamin LaHaise @ 2006-06-23 20:54 ` Evgeniy Polyakov 2006-06-24 9:14 ` Robert Iakobashvili 2006-06-23 20:54 ` David Miller 1 sibling, 1 reply; 73+ messages in thread From: Evgeniy Polyakov @ 2006-06-23 20:54 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: David Miller, netdev On Fri, Jun 23, 2006 at 04:31:14PM -0400, Benjamin LaHaise (bcrl@kvack.org) wrote: > may well be the way to go. What I do not agree is that we need new > syscalls at this point. I'm perfectly willing to accept proof that change > is needed if we do a proper comparision between any new syscall API and the > use of the existing syscall API, but the pain of introducing a new API is > sufficiently large that I think it is worth looking at the numbers. New syscall is just an interface. Originally kevent (and it still can) use char device and it's ioctl method. It is perfectly possible to create wrappers for posix aio_* calls, although I do not see why it is needed. No need to concentrate on end-users interface at this point - it can be changed at any time since design allows it, we should think about overall design and if it is ok, move forward in implementation. Btw, new API adds only one syscall for userspace kevent processing (and three for send/recv/sendfile for network AIO). According to numbers: kevent compared to epoll resulted in the folllowing numbers: kevent: more than 2600 requests per second (trivial web server) epoll: about 1600-1800 requests. Number of errors for 3k bursts of connections with 30K connections total in 10seconds: kevent: about 2k errors. epoll: upto 15k errors. More detailed results can be found on project's homepage at: tservice.net.ru/~s0mbre/old/?section=projects&item=kevent > -ben -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [1/4] kevent: core files. 2006-06-23 20:54 ` Evgeniy Polyakov @ 2006-06-24 9:14 ` Robert Iakobashvili 0 siblings, 0 replies; 73+ messages in thread From: Robert Iakobashvili @ 2006-06-24 9:14 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: Benjamin LaHaise, David Miller, netdev Hi, > According to numbers: kevent compared to epoll resulted in the > folllowing numbers: > kevent: more than 2600 requests per second (trivial web server) > epoll: about 1600-1800 requests. > Number of errors for 3k bursts of connections with 30K connections total > in 10seconds: > kevent: about 2k errors. > epoll: upto 15k errors. If it beats the great epoll, it means a real business case for kevent. All previous attempts in kernel as well as by glibc and other userland emulations to provide some real AIO infrastructure and API for server applications with performance benefits, were not too much successful. Heavy load networking servers are normally not using AIOs on linux due to low performance. >From another side Windows have a very strong I/O completion ports API, which are widely used for the most heavy load applications. Kevent may take linux servers productivity forward in general as well as encourage moving aio-applications from windows to linux. -- Sincerely, ------------------------------------------------------------------ Robert Iakobashvili, coroberti at gmail dot com Navigare necesse est, vivere non est necesse. ------------------------------------------------------------------ ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [1/4] kevent: core files. 2006-06-23 20:31 ` Benjamin LaHaise 2006-06-23 20:54 ` Evgeniy Polyakov @ 2006-06-23 20:54 ` David Miller 2006-06-23 21:53 ` Benjamin LaHaise 1 sibling, 1 reply; 73+ messages in thread From: David Miller @ 2006-06-23 20:54 UTC (permalink / raw) To: bcrl; +Cc: johnpol, netdev From: Benjamin LaHaise <bcrl@kvack.org> Date: Fri, 23 Jun 2006 16:31:14 -0400 > Eh? Nobody has posted any numbers comparing the approaches yet, so this > is pure handwaving, unless you have real concrete results? Evgeniy posts numbers and performance graphs on his kevent work all the time. Van Jacobson did in his LCA2006 net channel slides too, perhaps you missed that. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [1/4] kevent: core files. 2006-06-23 20:54 ` David Miller @ 2006-06-23 21:53 ` Benjamin LaHaise 2006-06-23 22:12 ` David Miller 0 siblings, 1 reply; 73+ messages in thread From: Benjamin LaHaise @ 2006-06-23 21:53 UTC (permalink / raw) To: David Miller; +Cc: johnpol, netdev On Fri, Jun 23, 2006 at 01:54:23PM -0700, David Miller wrote: > From: Benjamin LaHaise <bcrl@kvack.org> > Date: Fri, 23 Jun 2006 16:31:14 -0400 > > > Eh? Nobody has posted any numbers comparing the approaches yet, so this > > is pure handwaving, unless you have real concrete results? > > Evgeniy posts numbers and performance graphs on his kevent work all > the time. But you're argueing that the performance of something that hasn't been tested is worse simply by nature of it not having been tested. That's a fallacy of omission, iiuc. > Van Jacobson did in his LCA2006 net channel slides too, perhaps you > missed that. I have yet to be convinced that the layering violation known as net channels is the right way to go, mostly because it breaks horribly in a few cases -- think what happens during periods of CPU overcommit, in which case doing too much in interrupt context will kill a system (which is why softirqs are needed). The effect of doing all processing in user context creates issues with delayed acks (due to context switching to other tasks in the system), which will cause excess retransmits. The hard problems associated with packet filtering and security are also still unresolved, which is okay for a paper, but a concern in real life. There are also a number of performance flaws in the current stack that show up under profiling, some of which I posted fixes for, some of which have yet to be fixed. The pushf/popf pipeline stall was one of the bigger instances of CPU wastage that Van Jacobson noticed (it shows up as bottom halves using lots of CPU). Iirc, Ingo's real time patches may avoid that by way of reworking the irq disable/enable mechanism, which would mean the results need retesting. Using the cr8 register to enable/disable interrupts on x86-64 might also improve things, as that would eliminate the flags dependancy of cli/sti... In short, there's a lot of work that still has to be done. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <dont@kvack.org>. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [1/4] kevent: core files. 2006-06-23 21:53 ` Benjamin LaHaise @ 2006-06-23 22:12 ` David Miller 0 siblings, 0 replies; 73+ messages in thread From: David Miller @ 2006-06-23 22:12 UTC (permalink / raw) To: bcrl; +Cc: johnpol, netdev From: Benjamin LaHaise <bcrl@kvack.org> Date: Fri, 23 Jun 2006 17:53:14 -0400 > The effect of doing all processing in user context creates issues > with delayed acks (due to context switching to other tasks in the system), The Linux TCP stack does this today. Full TCP input protocol processing is done in the user process context. What you are not understanding is that process scheduling helps TCP, it does not hinder it. If the system is loaded, we want the senders to pace themselves to the rate at which the kernel can schedule the abundance of receiver work it has. And this happens naturally when the TCP protocol input processing operates in process context. Your fear of cpu overcommit in interrupt handlers is also heavily flawed. Net channels do a socket demux, and a queue entail plus ring a doorbell if necessary, nothing more. ^ permalink raw reply [flat|nested] 73+ messages in thread
end of thread, other threads:[~2006-09-04 14:38 UTC | newest]
Thread overview: 73+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <44C66FC9.3050402@redhat.com>
2006-07-25 22:01 ` async network I/O, event channels, etc David Miller
2006-07-25 22:55 ` Nicholas Miell
2006-07-26 6:28 ` Evgeniy Polyakov
2006-07-26 9:18 ` [0/4] kevent: generic event processing subsystem Evgeniy Polyakov
2006-07-26 9:18 ` [1/4] kevent: core files Evgeniy Polyakov
2006-07-26 9:18 ` [2/4] kevent: network AIO, socket notifications Evgeniy Polyakov
2006-07-26 9:18 ` [3/4] kevent: AIO, aio_sendfile() implementation Evgeniy Polyakov
2006-07-26 9:18 ` [4/4] kevent: poll/select() notifications. Timer notifications Evgeniy Polyakov
2006-07-26 10:00 ` [3/4] kevent: AIO, aio_sendfile() implementation Christoph Hellwig
2006-07-26 10:08 ` Evgeniy Polyakov
2006-07-26 10:13 ` Christoph Hellwig
2006-07-26 10:25 ` Evgeniy Polyakov
2006-07-26 10:04 ` Christoph Hellwig
2006-07-26 10:12 ` David Miller
2006-07-26 10:15 ` Christoph Hellwig
2006-07-26 20:21 ` Phillip Susi
2006-07-26 14:14 ` Avi Kivity
2006-07-26 10:19 ` Evgeniy Polyakov
2006-07-26 10:30 ` Christoph Hellwig
2006-07-26 14:28 ` Ulrich Drepper
2006-07-26 16:22 ` Badari Pulavarty
2006-07-27 6:49 ` Sébastien Dugué
2006-07-27 15:28 ` Badari Pulavarty
2006-07-27 18:14 ` Zach Brown
2006-07-27 18:29 ` Badari Pulavarty
2006-07-27 18:44 ` Ulrich Drepper
2006-07-27 21:02 ` Badari Pulavarty
2006-07-28 7:31 ` Sébastien Dugué
2006-07-28 12:58 ` Sébastien Dugué
2006-08-11 19:45 ` Ulrich Drepper
2006-08-12 18:29 ` Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile) Suparna Bhattacharya
2006-08-12 19:10 ` Ulrich Drepper
2006-08-12 19:28 ` Jakub Jelinek
2006-09-04 14:37 ` Sébastien Dugué
2006-08-14 7:02 ` Suparna Bhattacharya
2006-08-14 16:38 ` Ulrich Drepper
2006-08-15 2:06 ` Nicholas Miell
2006-09-04 14:36 ` Sébastien Dugué
2006-09-04 14:28 ` Sébastien Dugué
2006-07-28 7:29 ` [3/4] kevent: AIO, aio_sendfile() implementation Sébastien Dugué
2006-07-31 10:11 ` Suparna Bhattacharya
2006-07-28 7:26 ` Sébastien Dugué
2006-07-26 10:31 ` [1/4] kevent: core files Andrew Morton
2006-07-26 10:37 ` Evgeniy Polyakov
2006-07-26 10:44 ` Evgeniy Polyakov
2006-07-27 6:10 ` async network I/O, event channels, etc David Miller
2006-07-27 7:49 ` Evgeniy Polyakov
2006-07-27 8:02 ` David Miller
2006-07-27 8:09 ` Jens Axboe
2006-07-27 8:11 ` Jens Axboe
2006-07-27 8:20 ` David Miller
2006-07-27 8:29 ` Jens Axboe
2006-07-27 8:37 ` David Miller
2006-07-27 8:39 ` Jens Axboe
2006-07-27 8:58 ` Evgeniy Polyakov
2006-07-27 9:31 ` David Miller
2006-07-27 9:37 ` Evgeniy Polyakov
2006-06-22 17:14 [1/1] Kevent subsystem Evgeniy Polyakov
2006-06-23 7:09 ` [1/4] kevent: core files Evgeniy Polyakov
2006-06-23 18:44 ` Benjamin LaHaise
2006-06-23 19:24 ` Evgeniy Polyakov
2006-06-23 19:55 ` Benjamin LaHaise
2006-06-23 20:17 ` Evgeniy Polyakov
2006-06-23 20:44 ` Benjamin LaHaise
2006-06-23 21:08 ` Evgeniy Polyakov
2006-06-23 21:31 ` Benjamin LaHaise
2006-06-23 21:43 ` Evgeniy Polyakov
2006-06-23 20:19 ` David Miller
2006-06-23 20:31 ` Benjamin LaHaise
2006-06-23 20:54 ` Evgeniy Polyakov
2006-06-24 9:14 ` Robert Iakobashvili
2006-06-23 20:54 ` David Miller
2006-06-23 21:53 ` Benjamin LaHaise
2006-06-23 22:12 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).