* [PATCH 1/2] wait: kill is_sync_wait() @ 2008-08-26 13:57 Tejun Heo 2008-08-26 13:59 ` [PATCH 2/2] poll: allow f_op->poll to sleep Tejun Heo 2008-08-26 14:09 ` [PATCH 1/2] wait: kill is_sync_wait() Ingo Molnar 0 siblings, 2 replies; 7+ messages in thread From: Tejun Heo @ 2008-08-26 13:57 UTC (permalink / raw) To: Linus Torvalds, Ingo Molnar, Al Viro, hch, Linux Kernel Mailing List, Eric Van Hensbergen, Ron Minnich, v9fs-developer is_sync_wait() is used to distinguish between sync and async waits. Basically sync waits are the ones initialized with init_waitqueue_entry() and async ones with init_waitqueue_func_entry(). The sync/async distinction is used only in prepare_to_wait[_exclusive]() and its only function is to skip setting the current task state if the wait is async. This has a few problems. * No one uses it. None of func_entry users use prepare_to_wait() functions, so the code path never gets executed. * The distinction is bogus. Maybe back when func_entry is used only by aio but it's now also used by epoll and in future possibly by 9p and poll/select. * Taking @state as argument and ignoring it silenly depending on how @wait is initialized is just a bad error-prone API. * It prevents func_entry waits from using wait->private for no good reason. This patch kills is_sync_wait() and the associated code paths from prepare_to_wait[_exclusive](). As there was no user of these code paths, this patch doesn't cause any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> --- include/linux/wait.h | 9 --------- kernel/wait.c | 14 ++------------ 2 files changed, 2 insertions(+), 21 deletions(-) Index: work/kernel/wait.c =================================================================== --- work.orig/kernel/wait.c +++ work/kernel/wait.c @@ -72,12 +72,7 @@ prepare_to_wait(wait_queue_head_t *q, wa spin_lock_irqsave(&q->lock, flags); if (list_empty(&wait->task_list)) __add_wait_queue(q, wait); - /* - * don't alter the task state if this is just going to - * queue an async wait queue callback - */ - if (is_sync_wait(wait)) - set_current_state(state); + set_current_state(state); spin_unlock_irqrestore(&q->lock, flags); } EXPORT_SYMBOL(prepare_to_wait); @@ -91,12 +86,7 @@ prepare_to_wait_exclusive(wait_queue_hea spin_lock_irqsave(&q->lock, flags); if (list_empty(&wait->task_list)) __add_wait_queue_tail(q, wait); - /* - * don't alter the task state if this is just going to - * queue an async wait queue callback - */ - if (is_sync_wait(wait)) - set_current_state(state); + set_current_state(state); spin_unlock_irqrestore(&q->lock, flags); } EXPORT_SYMBOL(prepare_to_wait_exclusive); Index: work/include/linux/wait.h =================================================================== --- work.orig/include/linux/wait.h +++ work/include/linux/wait.h @@ -108,15 +108,6 @@ static inline int waitqueue_active(wait_ return !list_empty(&q->task_list); } -/* - * Used to distinguish between sync and async io wait context: - * sync i/o typically specifies a NULL wait queue entry or a wait - * queue entry bound to a task (current task) to wake up. - * aio specifies a wait queue entry with an async notification - * callback routine, not associated with any task. - */ -#define is_sync_wait(wait) (!(wait) || ((wait)->private)) - extern void add_wait_queue(wait_queue_head_t *q, wait_queue_t *wait); extern void add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait); extern void remove_wait_queue(wait_queue_head_t *q, wait_queue_t *wait); ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 2/2] poll: allow f_op->poll to sleep 2008-08-26 13:57 [PATCH 1/2] wait: kill is_sync_wait() Tejun Heo @ 2008-08-26 13:59 ` Tejun Heo 2008-08-26 16:58 ` Linus Torvalds 2008-08-26 14:09 ` [PATCH 1/2] wait: kill is_sync_wait() Ingo Molnar 1 sibling, 1 reply; 7+ messages in thread From: Tejun Heo @ 2008-08-26 13:59 UTC (permalink / raw) To: Linus Torvalds, Ingo Molnar, Al Viro, hch, Linux Kernel Mailing List, Eric Van Hensbergen, Ron Minnich, v9fs-developer f_op->poll is the only vfs operation which is not allowed to sleep. It's because poll and select implementation used task state to synchronize against wake ups, which doesn't have to be the case anymore as wait/wake interface can now use custom wake up functions. The non-sleep restriction can be a bit tricky because ->poll is not called from an atomic context and the result of accidentally sleeping in ->poll only shows up as temporary busy looping when the timing is right or rather wrong. This patch converts poll/select to use custom wake up function and use separate triggered variable to synchronize against wake up events. The only added overhead is an extra function call during wake up and negligible. This patch removes the one non-sleep exception from vfs locking rules and is especially beneficial to userland filesystem implementations like FUSE or 9p as it's very difficult for those to implement non-sleeping poll method. All the existing poll users except for net/9p/trans_fd.c are converted to use the new poll wait. The 9p trans_fd is difficult to convert as it currently stands but the following patchset solves the problem nicely. http://thread.gmane.org/gmane.linux.kernel/726098 With or without the above patchset, this patch doesn't break anything as all current ->poll implementations don't sleep anyway but the claim that any ->poll implementation can sleep doesn't hold without the above patchset. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> --- Documentation/filesystems/Locking | 2 - drivers/media/video/v4l1-compat.c | 3 -- fs/select.c | 50 ++++++++++++++++++++++++++++---------- include/linux/poll.h | 9 ++++++ 4 files changed, 49 insertions(+), 15 deletions(-) Index: work/fs/select.c =================================================================== --- work.orig/fs/select.c +++ work/fs/select.c @@ -54,6 +54,7 @@ static void __pollwait(struct file *filp void poll_initwait(struct poll_wqueues *pwq) { init_poll_funcptr(&pwq->pt, __pollwait); + pwq->polling_task = current; pwq->error = 0; pwq->table = NULL; pwq->inline_index = 0; @@ -90,9 +91,8 @@ void poll_freewait(struct poll_wqueues * EXPORT_SYMBOL(poll_freewait); -static struct poll_table_entry *poll_get_entry(poll_table *_p) +static struct poll_table_entry *poll_get_entry(struct poll_wqueues *p) { - struct poll_wqueues *p = container_of(_p, struct poll_wqueues, pt); struct poll_table_page *table = p->table; if (p->inline_index < N_INLINE_POLL_ENTRIES) @@ -104,7 +104,6 @@ static struct poll_table_entry *poll_get new_table = (struct poll_table_page *) __get_free_page(GFP_KERNEL); if (!new_table) { p->error = -ENOMEM; - __set_current_state(TASK_RUNNING); return NULL; } new_table->entry = new_table->entries; @@ -116,20 +115,50 @@ static struct poll_table_entry *poll_get return table->entry++; } +static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key) +{ + struct poll_wqueues *pwq = wait->private; + DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task); + + set_mb(pwq->triggered, 1); + + /* perform the default wake up operation */ + return default_wake_function(&dummy_wait, mode, sync, key); +} + /* Add a new entry */ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address, poll_table *p) { - struct poll_table_entry *entry = poll_get_entry(p); + struct poll_wqueues *pwq = container_of(p, struct poll_wqueues, pt); + struct poll_table_entry *entry = poll_get_entry(pwq); if (!entry) return; get_file(filp); entry->filp = filp; entry->wait_address = wait_address; - init_waitqueue_entry(&entry->wait, current); + init_waitqueue_func_entry(&entry->wait, pollwake); + entry->wait.private = pwq; add_wait_queue(wait_address, &entry->wait); } +long poll_schedule_timeout(struct poll_wqueues *pwq, int state, long timeout) +{ + DEFINE_WAIT(wait); + + set_current_state(TASK_INTERRUPTIBLE); + if (!pwq->triggered) + timeout = schedule_timeout(timeout); + __set_current_state(TASK_RUNNING); + + /* clear triggered for the next iteration */ + pwq->triggered = 0; + + return timeout; +} + +EXPORT_SYMBOL(poll_schedule_timeout); + #define FDS_IN(fds, n) (fds->in + n) #define FDS_OUT(fds, n) (fds->out + n) #define FDS_EX(fds, n) (fds->ex + n) @@ -205,8 +234,6 @@ int do_select(int n, fd_set_bits *fds, s unsigned long *rinp, *routp, *rexp, *inp, *outp, *exp; long __timeout; - set_current_state(TASK_INTERRUPTIBLE); - inp = fds->in; outp = fds->out; exp = fds->ex; rinp = fds->res_in; routp = fds->res_out; rexp = fds->res_ex; @@ -277,11 +304,11 @@ int do_select(int n, fd_set_bits *fds, s __timeout = *timeout; *timeout = 0; } - __timeout = schedule_timeout(__timeout); + __timeout = poll_schedule_timeout(&table, TASK_INTERRUPTIBLE, + __timeout); if (*timeout >= 0) *timeout += __timeout; } - __set_current_state(TASK_RUNNING); poll_freewait(&table); @@ -587,7 +614,6 @@ static int do_poll(unsigned int nfds, s struct poll_list *walk; long __timeout; - set_current_state(TASK_INTERRUPTIBLE); for (walk = list; walk != NULL; walk = walk->next) { struct pollfd * pfd, * pfd_end; @@ -635,11 +661,11 @@ static int do_poll(unsigned int nfds, s *timeout = 0; } - __timeout = schedule_timeout(__timeout); + __timeout = poll_schedule_timeout(wait, TASK_INTERRUPTIBLE, + __timeout); if (*timeout >= 0) *timeout += __timeout; } - __set_current_state(TASK_RUNNING); return count; } Index: work/include/linux/poll.h =================================================================== --- work.orig/include/linux/poll.h +++ work/include/linux/poll.h @@ -57,6 +57,8 @@ struct poll_table_entry { struct poll_wqueues { poll_table pt; struct poll_table_page * table; + struct task_struct * polling_task; + int triggered; int error; int inline_index; struct poll_table_entry inline_entries[N_INLINE_POLL_ENTRIES]; @@ -64,6 +66,13 @@ struct poll_wqueues { extern void poll_initwait(struct poll_wqueues *pwq); extern void poll_freewait(struct poll_wqueues *pwq); +extern long poll_schedule_timeout(struct poll_wqueues *pwq, int state, + long timeout); + +static inline long poll_schedule(struct poll_wqueues *pwq, int state) +{ + return poll_schedule_timeout(pwq, state, MAX_SCHEDULE_TIMEOUT); +} /* * Scaleable version of the fd_set. Index: work/drivers/media/video/v4l1-compat.c =================================================================== --- work.orig/drivers/media/video/v4l1-compat.c +++ work/drivers/media/video/v4l1-compat.c @@ -209,7 +209,6 @@ static int poll_one(struct file *file, s table = &pwq->pt; for (;;) { int mask; - set_current_state(TASK_INTERRUPTIBLE); mask = file->f_op->poll(file, table); if (mask & POLLIN) break; @@ -218,7 +217,7 @@ static int poll_one(struct file *file, s retval = -ERESTARTSYS; break; } - schedule(); + poll_schedule(pwq, TASK_INTERRUPTIBLE); } set_current_state(TASK_RUNNING); poll_freewait(pwq); Index: work/Documentation/filesystems/Locking =================================================================== --- work.orig/Documentation/filesystems/Locking +++ work/Documentation/filesystems/Locking @@ -396,7 +396,7 @@ prototypes: }; locking rules: - All except ->poll() may block. + All may block. BKL llseek: no (see below) read: no ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] poll: allow f_op->poll to sleep 2008-08-26 13:59 ` [PATCH 2/2] poll: allow f_op->poll to sleep Tejun Heo @ 2008-08-26 16:58 ` Linus Torvalds 2008-08-26 17:48 ` Tejun Heo ` (2 more replies) 0 siblings, 3 replies; 7+ messages in thread From: Linus Torvalds @ 2008-08-26 16:58 UTC (permalink / raw) To: Tejun Heo Cc: Ingo Molnar, Al Viro, hch, Linux Kernel Mailing List, Eric Van Hensbergen, Ron Minnich, v9fs-developer On Tue, 26 Aug 2008, Tejun Heo wrote: > > This patch converts poll/select to use custom wake up function and use > separate triggered variable to synchronize against wake up events. > The only added overhead is an extra function call during wake up and > negligible. I don't really see the point. poll() isn't allowed to sleep for many reasons. Some are technical. But the most obvious one is that a sleeping "poll()" is totally against the whole point of polling in the first place! So is there some big conceptual reason to change how poll() has always worked? If you worry about debuggability, then we could just add a preempt_enable(); .. preempt_disable(); around the poll calls (purely for catching errors) to get a big warning if somebody tries to call a sleepable function. Linus ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] poll: allow f_op->poll to sleep 2008-08-26 16:58 ` Linus Torvalds @ 2008-08-26 17:48 ` Tejun Heo 2008-08-26 19:51 ` Christoph Hellwig 2008-08-26 20:22 ` Ingo Molnar 2 siblings, 0 replies; 7+ messages in thread From: Tejun Heo @ 2008-08-26 17:48 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Al Viro, hch, Linux Kernel Mailing List, Eric Van Hensbergen, Ron Minnich, v9fs-developer, Takashi Iwai (cc'ing Takashi Iwai) Hello, Linus. Linus Torvalds wrote: >> This patch converts poll/select to use custom wake up function and use >> separate triggered variable to synchronize against wake up events. >> The only added overhead is an extra function call during wake up and >> negligible. > > I don't really see the point. > > poll() isn't allowed to sleep for many reasons. Some are technical. But > the most obvious one is that a sleeping "poll()" is totally against the > whole point of polling in the first place! > > So is there some big conceptual reason to change how poll() has always > worked? No big conceptual reason at all. I was trying to implement poll for FUSE so that character device can be implemented in userspace such that /dev/dsp can be routed to pulseaudio (or whatever) and for userland filesystem, non-sleep poll just couldn't be made to work and there didn't seem to be no big reason for poll not being able to sleep, so... The OSS sound routing is probably a bit too late but the exclusion between OSS and the modern sound system is still quite annoying and for simple sound apps OSS interface is just much simpler as it's just normal file IOs and fully contained in kernel (e.g. arch-um host sound support). > If you worry about debuggability, then we could just add a > > preempt_enable(); > .. > preempt_disable(); > > around the poll calls (purely for catching errors) to get a big warning if > somebody tries to call a sleepable function. Yeap, that will be a nice debug option to have if this patch gets NACKed. Thanks. -- tejun ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] poll: allow f_op->poll to sleep 2008-08-26 16:58 ` Linus Torvalds 2008-08-26 17:48 ` Tejun Heo @ 2008-08-26 19:51 ` Christoph Hellwig 2008-08-26 20:22 ` Ingo Molnar 2 siblings, 0 replies; 7+ messages in thread From: Christoph Hellwig @ 2008-08-26 19:51 UTC (permalink / raw) To: Linus Torvalds Cc: Tejun Heo, Ingo Molnar, Al Viro, hch, Linux Kernel Mailing List, Eric Van Hensbergen, Ron Minnich, v9fs-developer, cbe-oss-dev On Tue, Aug 26, 2008 at 09:58:14AM -0700, Linus Torvalds wrote: > I don't really see the point. > > poll() isn't allowed to sleep for many reasons. Some are technical. But > the most obvious one is that a sleeping "poll()" is totally against the > whole point of polling in the first place! > > So is there some big conceptual reason to change how poll() has always > worked? Just as a little sidenote most files on spufs have a ->poll that sleeps, and currently we don't have any debugging to catch this, I just noticed this by accident. The reason that it sleeps is because it needs to grab a sleping lock, and as that lock protects against a rather complicate type of hardware context switch it's not possible to replace it with a spinlock. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] poll: allow f_op->poll to sleep 2008-08-26 16:58 ` Linus Torvalds 2008-08-26 17:48 ` Tejun Heo 2008-08-26 19:51 ` Christoph Hellwig @ 2008-08-26 20:22 ` Ingo Molnar 2 siblings, 0 replies; 7+ messages in thread From: Ingo Molnar @ 2008-08-26 20:22 UTC (permalink / raw) To: Linus Torvalds Cc: Tejun Heo, Al Viro, hch, Linux Kernel Mailing List, Eric Van Hensbergen, Ron Minnich, v9fs-developer * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, 26 Aug 2008, Tejun Heo wrote: > > > > This patch converts poll/select to use custom wake up function and use > > separate triggered variable to synchronize against wake up events. > > The only added overhead is an extra function call during wake up and > > negligible. > > I don't really see the point. > > poll() isn't allowed to sleep for many reasons. Some are technical. > But the most obvious one is that a sleeping "poll()" is totally > against the whole point of polling in the first place! > > So is there some big conceptual reason to change how poll() has always > worked? i think the point is to replace potentially fragile atomic code with less restricted 'potentially sleeping' code. For example a GFP_KERNEL allocation will not sleep in 99% of the cases, and if it sleeps we are under such memory pressure that we dont really care all that much whether we happen to sleep in poll() or in the next userspace pagefault. The whole ->poll() handler model would be less restricted. I dont think that's a bad idea - all our atomic contexts suffer from programmability restrictions, and people are trying to get away from them. Plus it seems to enable FUSE some more as well - and FUSE has been spearheading a lot of filesystem development lately. FUSE is IMO the useful form of microkernels - a fast prototyping platform with a very friendly user-space programming and debugging interface. (and the fact that FUSE based filesystems can be packaged up and distributed a lot easier than kernel changes is an argument in favor of FUSE as well) Ingo ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] wait: kill is_sync_wait() 2008-08-26 13:57 [PATCH 1/2] wait: kill is_sync_wait() Tejun Heo 2008-08-26 13:59 ` [PATCH 2/2] poll: allow f_op->poll to sleep Tejun Heo @ 2008-08-26 14:09 ` Ingo Molnar 1 sibling, 0 replies; 7+ messages in thread From: Ingo Molnar @ 2008-08-26 14:09 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Al Viro, hch, Linux Kernel Mailing List, Eric Van Hensbergen, Ron Minnich, v9fs-developer * Tejun Heo <tj@kernel.org> wrote: > is_sync_wait() is used to distinguish between sync and async waits. > Basically sync waits are the ones initialized with > init_waitqueue_entry() and async ones with init_waitqueue_func_entry(). > The sync/async distinction is used only in prepare_to_wait[_exclusive]() > and its only function is to skip setting the current task state if the > wait is async. This has a few problems. > > * No one uses it. None of func_entry users use prepare_to_wait() > functions, so the code path never gets executed. > > * The distinction is bogus. Maybe back when func_entry is used only > by aio but it's now also used by epoll and in future possibly by 9p > and poll/select. > > * Taking @state as argument and ignoring it silenly depending on how > @wait is initialized is just a bad error-prone API. > > * It prevents func_entry waits from using wait->private for no good > reason. > > This patch kills is_sync_wait() and the associated code paths from > prepare_to_wait[_exclusive](). As there was no user of these code > paths, this patch doesn't cause any behavior difference. > > Signed-off-by: Tejun Heo <tj@kernel.org> good spotting. Acked-by: Ingo Molnar <mingo@elte.hu> Ingo ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2008-08-26 20:23 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-08-26 13:57 [PATCH 1/2] wait: kill is_sync_wait() Tejun Heo 2008-08-26 13:59 ` [PATCH 2/2] poll: allow f_op->poll to sleep Tejun Heo 2008-08-26 16:58 ` Linus Torvalds 2008-08-26 17:48 ` Tejun Heo 2008-08-26 19:51 ` Christoph Hellwig 2008-08-26 20:22 ` Ingo Molnar 2008-08-26 14:09 ` [PATCH 1/2] wait: kill is_sync_wait() Ingo Molnar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox