* [PATCH 1/2] wait: kill is_sync_wait()
@ 2008-08-26 13:57 Tejun Heo
2008-08-26 13:59 ` [PATCH 2/2] poll: allow f_op->poll to sleep Tejun Heo
2008-08-26 14:09 ` [PATCH 1/2] wait: kill is_sync_wait() Ingo Molnar
0 siblings, 2 replies; 7+ messages in thread
From: Tejun Heo @ 2008-08-26 13:57 UTC (permalink / raw)
To: Linus Torvalds, Ingo Molnar, Al Viro, hch,
Linux Kernel Mailing List, Eric Van Hensbergen, Ron Minnich,
v9fs-developer
is_sync_wait() is used to distinguish between sync and async waits.
Basically sync waits are the ones initialized with
init_waitqueue_entry() and async ones with init_waitqueue_func_entry().
The sync/async distinction is used only in prepare_to_wait[_exclusive]()
and its only function is to skip setting the current task state if the
wait is async. This has a few problems.
* No one uses it. None of func_entry users use prepare_to_wait()
functions, so the code path never gets executed.
* The distinction is bogus. Maybe back when func_entry is used only
by aio but it's now also used by epoll and in future possibly by 9p
and poll/select.
* Taking @state as argument and ignoring it silenly depending on how
@wait is initialized is just a bad error-prone API.
* It prevents func_entry waits from using wait->private for no good
reason.
This patch kills is_sync_wait() and the associated code paths from
prepare_to_wait[_exclusive](). As there was no user of these code
paths, this patch doesn't cause any behavior difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
include/linux/wait.h | 9 ---------
kernel/wait.c | 14 ++------------
2 files changed, 2 insertions(+), 21 deletions(-)
Index: work/kernel/wait.c
===================================================================
--- work.orig/kernel/wait.c
+++ work/kernel/wait.c
@@ -72,12 +72,7 @@ prepare_to_wait(wait_queue_head_t *q, wa
spin_lock_irqsave(&q->lock, flags);
if (list_empty(&wait->task_list))
__add_wait_queue(q, wait);
- /*
- * don't alter the task state if this is just going to
- * queue an async wait queue callback
- */
- if (is_sync_wait(wait))
- set_current_state(state);
+ set_current_state(state);
spin_unlock_irqrestore(&q->lock, flags);
}
EXPORT_SYMBOL(prepare_to_wait);
@@ -91,12 +86,7 @@ prepare_to_wait_exclusive(wait_queue_hea
spin_lock_irqsave(&q->lock, flags);
if (list_empty(&wait->task_list))
__add_wait_queue_tail(q, wait);
- /*
- * don't alter the task state if this is just going to
- * queue an async wait queue callback
- */
- if (is_sync_wait(wait))
- set_current_state(state);
+ set_current_state(state);
spin_unlock_irqrestore(&q->lock, flags);
}
EXPORT_SYMBOL(prepare_to_wait_exclusive);
Index: work/include/linux/wait.h
===================================================================
--- work.orig/include/linux/wait.h
+++ work/include/linux/wait.h
@@ -108,15 +108,6 @@ static inline int waitqueue_active(wait_
return !list_empty(&q->task_list);
}
-/*
- * Used to distinguish between sync and async io wait context:
- * sync i/o typically specifies a NULL wait queue entry or a wait
- * queue entry bound to a task (current task) to wake up.
- * aio specifies a wait queue entry with an async notification
- * callback routine, not associated with any task.
- */
-#define is_sync_wait(wait) (!(wait) || ((wait)->private))
-
extern void add_wait_queue(wait_queue_head_t *q, wait_queue_t *wait);
extern void add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *wait);
extern void remove_wait_queue(wait_queue_head_t *q, wait_queue_t *wait);
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 2/2] poll: allow f_op->poll to sleep
2008-08-26 13:57 [PATCH 1/2] wait: kill is_sync_wait() Tejun Heo
@ 2008-08-26 13:59 ` Tejun Heo
2008-08-26 16:58 ` Linus Torvalds
2008-08-26 14:09 ` [PATCH 1/2] wait: kill is_sync_wait() Ingo Molnar
1 sibling, 1 reply; 7+ messages in thread
From: Tejun Heo @ 2008-08-26 13:59 UTC (permalink / raw)
To: Linus Torvalds, Ingo Molnar, Al Viro, hch,
Linux Kernel Mailing List, Eric Van Hensbergen, Ron Minnich,
v9fs-developer
f_op->poll is the only vfs operation which is not allowed to sleep.
It's because poll and select implementation used task state to
synchronize against wake ups, which doesn't have to be the case
anymore as wait/wake interface can now use custom wake up functions.
The non-sleep restriction can be a bit tricky because ->poll is not
called from an atomic context and the result of accidentally sleeping
in ->poll only shows up as temporary busy looping when the timing is
right or rather wrong.
This patch converts poll/select to use custom wake up function and use
separate triggered variable to synchronize against wake up events.
The only added overhead is an extra function call during wake up and
negligible.
This patch removes the one non-sleep exception from vfs locking rules
and is especially beneficial to userland filesystem implementations
like FUSE or 9p as it's very difficult for those to implement
non-sleeping poll method.
All the existing poll users except for net/9p/trans_fd.c are converted
to use the new poll wait. The 9p trans_fd is difficult to convert as
it currently stands but the following patchset solves the problem
nicely.
http://thread.gmane.org/gmane.linux.kernel/726098
With or without the above patchset, this patch doesn't break anything
as all current ->poll implementations don't sleep anyway but the claim
that any ->poll implementation can sleep doesn't hold without the
above patchset.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
---
Documentation/filesystems/Locking | 2 -
drivers/media/video/v4l1-compat.c | 3 --
fs/select.c | 50 ++++++++++++++++++++++++++++----------
include/linux/poll.h | 9 ++++++
4 files changed, 49 insertions(+), 15 deletions(-)
Index: work/fs/select.c
===================================================================
--- work.orig/fs/select.c
+++ work/fs/select.c
@@ -54,6 +54,7 @@ static void __pollwait(struct file *filp
void poll_initwait(struct poll_wqueues *pwq)
{
init_poll_funcptr(&pwq->pt, __pollwait);
+ pwq->polling_task = current;
pwq->error = 0;
pwq->table = NULL;
pwq->inline_index = 0;
@@ -90,9 +91,8 @@ void poll_freewait(struct poll_wqueues *
EXPORT_SYMBOL(poll_freewait);
-static struct poll_table_entry *poll_get_entry(poll_table *_p)
+static struct poll_table_entry *poll_get_entry(struct poll_wqueues *p)
{
- struct poll_wqueues *p = container_of(_p, struct poll_wqueues, pt);
struct poll_table_page *table = p->table;
if (p->inline_index < N_INLINE_POLL_ENTRIES)
@@ -104,7 +104,6 @@ static struct poll_table_entry *poll_get
new_table = (struct poll_table_page *) __get_free_page(GFP_KERNEL);
if (!new_table) {
p->error = -ENOMEM;
- __set_current_state(TASK_RUNNING);
return NULL;
}
new_table->entry = new_table->entries;
@@ -116,20 +115,50 @@ static struct poll_table_entry *poll_get
return table->entry++;
}
+static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+ struct poll_wqueues *pwq = wait->private;
+ DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task);
+
+ set_mb(pwq->triggered, 1);
+
+ /* perform the default wake up operation */
+ return default_wake_function(&dummy_wait, mode, sync, key);
+}
+
/* Add a new entry */
static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
poll_table *p)
{
- struct poll_table_entry *entry = poll_get_entry(p);
+ struct poll_wqueues *pwq = container_of(p, struct poll_wqueues, pt);
+ struct poll_table_entry *entry = poll_get_entry(pwq);
if (!entry)
return;
get_file(filp);
entry->filp = filp;
entry->wait_address = wait_address;
- init_waitqueue_entry(&entry->wait, current);
+ init_waitqueue_func_entry(&entry->wait, pollwake);
+ entry->wait.private = pwq;
add_wait_queue(wait_address, &entry->wait);
}
+long poll_schedule_timeout(struct poll_wqueues *pwq, int state, long timeout)
+{
+ DEFINE_WAIT(wait);
+
+ set_current_state(TASK_INTERRUPTIBLE);
+ if (!pwq->triggered)
+ timeout = schedule_timeout(timeout);
+ __set_current_state(TASK_RUNNING);
+
+ /* clear triggered for the next iteration */
+ pwq->triggered = 0;
+
+ return timeout;
+}
+
+EXPORT_SYMBOL(poll_schedule_timeout);
+
#define FDS_IN(fds, n) (fds->in + n)
#define FDS_OUT(fds, n) (fds->out + n)
#define FDS_EX(fds, n) (fds->ex + n)
@@ -205,8 +234,6 @@ int do_select(int n, fd_set_bits *fds, s
unsigned long *rinp, *routp, *rexp, *inp, *outp, *exp;
long __timeout;
- set_current_state(TASK_INTERRUPTIBLE);
-
inp = fds->in; outp = fds->out; exp = fds->ex;
rinp = fds->res_in; routp = fds->res_out; rexp = fds->res_ex;
@@ -277,11 +304,11 @@ int do_select(int n, fd_set_bits *fds, s
__timeout = *timeout;
*timeout = 0;
}
- __timeout = schedule_timeout(__timeout);
+ __timeout = poll_schedule_timeout(&table, TASK_INTERRUPTIBLE,
+ __timeout);
if (*timeout >= 0)
*timeout += __timeout;
}
- __set_current_state(TASK_RUNNING);
poll_freewait(&table);
@@ -587,7 +614,6 @@ static int do_poll(unsigned int nfds, s
struct poll_list *walk;
long __timeout;
- set_current_state(TASK_INTERRUPTIBLE);
for (walk = list; walk != NULL; walk = walk->next) {
struct pollfd * pfd, * pfd_end;
@@ -635,11 +661,11 @@ static int do_poll(unsigned int nfds, s
*timeout = 0;
}
- __timeout = schedule_timeout(__timeout);
+ __timeout = poll_schedule_timeout(wait, TASK_INTERRUPTIBLE,
+ __timeout);
if (*timeout >= 0)
*timeout += __timeout;
}
- __set_current_state(TASK_RUNNING);
return count;
}
Index: work/include/linux/poll.h
===================================================================
--- work.orig/include/linux/poll.h
+++ work/include/linux/poll.h
@@ -57,6 +57,8 @@ struct poll_table_entry {
struct poll_wqueues {
poll_table pt;
struct poll_table_page * table;
+ struct task_struct * polling_task;
+ int triggered;
int error;
int inline_index;
struct poll_table_entry inline_entries[N_INLINE_POLL_ENTRIES];
@@ -64,6 +66,13 @@ struct poll_wqueues {
extern void poll_initwait(struct poll_wqueues *pwq);
extern void poll_freewait(struct poll_wqueues *pwq);
+extern long poll_schedule_timeout(struct poll_wqueues *pwq, int state,
+ long timeout);
+
+static inline long poll_schedule(struct poll_wqueues *pwq, int state)
+{
+ return poll_schedule_timeout(pwq, state, MAX_SCHEDULE_TIMEOUT);
+}
/*
* Scaleable version of the fd_set.
Index: work/drivers/media/video/v4l1-compat.c
===================================================================
--- work.orig/drivers/media/video/v4l1-compat.c
+++ work/drivers/media/video/v4l1-compat.c
@@ -209,7 +209,6 @@ static int poll_one(struct file *file, s
table = &pwq->pt;
for (;;) {
int mask;
- set_current_state(TASK_INTERRUPTIBLE);
mask = file->f_op->poll(file, table);
if (mask & POLLIN)
break;
@@ -218,7 +217,7 @@ static int poll_one(struct file *file, s
retval = -ERESTARTSYS;
break;
}
- schedule();
+ poll_schedule(pwq, TASK_INTERRUPTIBLE);
}
set_current_state(TASK_RUNNING);
poll_freewait(pwq);
Index: work/Documentation/filesystems/Locking
===================================================================
--- work.orig/Documentation/filesystems/Locking
+++ work/Documentation/filesystems/Locking
@@ -396,7 +396,7 @@ prototypes:
};
locking rules:
- All except ->poll() may block.
+ All may block.
BKL
llseek: no (see below)
read: no
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] wait: kill is_sync_wait()
2008-08-26 13:57 [PATCH 1/2] wait: kill is_sync_wait() Tejun Heo
2008-08-26 13:59 ` [PATCH 2/2] poll: allow f_op->poll to sleep Tejun Heo
@ 2008-08-26 14:09 ` Ingo Molnar
1 sibling, 0 replies; 7+ messages in thread
From: Ingo Molnar @ 2008-08-26 14:09 UTC (permalink / raw)
To: Tejun Heo
Cc: Linus Torvalds, Al Viro, hch, Linux Kernel Mailing List,
Eric Van Hensbergen, Ron Minnich, v9fs-developer
* Tejun Heo <tj@kernel.org> wrote:
> is_sync_wait() is used to distinguish between sync and async waits.
> Basically sync waits are the ones initialized with
> init_waitqueue_entry() and async ones with init_waitqueue_func_entry().
> The sync/async distinction is used only in prepare_to_wait[_exclusive]()
> and its only function is to skip setting the current task state if the
> wait is async. This has a few problems.
>
> * No one uses it. None of func_entry users use prepare_to_wait()
> functions, so the code path never gets executed.
>
> * The distinction is bogus. Maybe back when func_entry is used only
> by aio but it's now also used by epoll and in future possibly by 9p
> and poll/select.
>
> * Taking @state as argument and ignoring it silenly depending on how
> @wait is initialized is just a bad error-prone API.
>
> * It prevents func_entry waits from using wait->private for no good
> reason.
>
> This patch kills is_sync_wait() and the associated code paths from
> prepare_to_wait[_exclusive](). As there was no user of these code
> paths, this patch doesn't cause any behavior difference.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
good spotting.
Acked-by: Ingo Molnar <mingo@elte.hu>
Ingo
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] poll: allow f_op->poll to sleep
2008-08-26 13:59 ` [PATCH 2/2] poll: allow f_op->poll to sleep Tejun Heo
@ 2008-08-26 16:58 ` Linus Torvalds
2008-08-26 17:48 ` Tejun Heo
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Linus Torvalds @ 2008-08-26 16:58 UTC (permalink / raw)
To: Tejun Heo
Cc: Ingo Molnar, Al Viro, hch, Linux Kernel Mailing List,
Eric Van Hensbergen, Ron Minnich, v9fs-developer
On Tue, 26 Aug 2008, Tejun Heo wrote:
>
> This patch converts poll/select to use custom wake up function and use
> separate triggered variable to synchronize against wake up events.
> The only added overhead is an extra function call during wake up and
> negligible.
I don't really see the point.
poll() isn't allowed to sleep for many reasons. Some are technical. But
the most obvious one is that a sleeping "poll()" is totally against the
whole point of polling in the first place!
So is there some big conceptual reason to change how poll() has always
worked?
If you worry about debuggability, then we could just add a
preempt_enable();
..
preempt_disable();
around the poll calls (purely for catching errors) to get a big warning if
somebody tries to call a sleepable function.
Linus
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] poll: allow f_op->poll to sleep
2008-08-26 16:58 ` Linus Torvalds
@ 2008-08-26 17:48 ` Tejun Heo
2008-08-26 19:51 ` Christoph Hellwig
2008-08-26 20:22 ` Ingo Molnar
2 siblings, 0 replies; 7+ messages in thread
From: Tejun Heo @ 2008-08-26 17:48 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, Al Viro, hch, Linux Kernel Mailing List,
Eric Van Hensbergen, Ron Minnich, v9fs-developer, Takashi Iwai
(cc'ing Takashi Iwai)
Hello, Linus.
Linus Torvalds wrote:
>> This patch converts poll/select to use custom wake up function and use
>> separate triggered variable to synchronize against wake up events.
>> The only added overhead is an extra function call during wake up and
>> negligible.
>
> I don't really see the point.
>
> poll() isn't allowed to sleep for many reasons. Some are technical. But
> the most obvious one is that a sleeping "poll()" is totally against the
> whole point of polling in the first place!
>
> So is there some big conceptual reason to change how poll() has always
> worked?
No big conceptual reason at all. I was trying to implement poll for
FUSE so that character device can be implemented in userspace such that
/dev/dsp can be routed to pulseaudio (or whatever) and for userland
filesystem, non-sleep poll just couldn't be made to work and there
didn't seem to be no big reason for poll not being able to sleep, so...
The OSS sound routing is probably a bit too late but the exclusion
between OSS and the modern sound system is still quite annoying and for
simple sound apps OSS interface is just much simpler as it's just normal
file IOs and fully contained in kernel (e.g. arch-um host sound support).
> If you worry about debuggability, then we could just add a
>
> preempt_enable();
> ..
> preempt_disable();
>
> around the poll calls (purely for catching errors) to get a big warning if
> somebody tries to call a sleepable function.
Yeap, that will be a nice debug option to have if this patch gets NACKed.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] poll: allow f_op->poll to sleep
2008-08-26 16:58 ` Linus Torvalds
2008-08-26 17:48 ` Tejun Heo
@ 2008-08-26 19:51 ` Christoph Hellwig
2008-08-26 20:22 ` Ingo Molnar
2 siblings, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2008-08-26 19:51 UTC (permalink / raw)
To: Linus Torvalds
Cc: Tejun Heo, Ingo Molnar, Al Viro, hch, Linux Kernel Mailing List,
Eric Van Hensbergen, Ron Minnich, v9fs-developer, cbe-oss-dev
On Tue, Aug 26, 2008 at 09:58:14AM -0700, Linus Torvalds wrote:
> I don't really see the point.
>
> poll() isn't allowed to sleep for many reasons. Some are technical. But
> the most obvious one is that a sleeping "poll()" is totally against the
> whole point of polling in the first place!
>
> So is there some big conceptual reason to change how poll() has always
> worked?
Just as a little sidenote most files on spufs have a ->poll that sleeps,
and currently we don't have any debugging to catch this, I just noticed
this by accident. The reason that it sleeps is because it needs to grab
a sleping lock, and as that lock protects against a rather complicate
type of hardware context switch it's not possible to replace it with
a spinlock.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] poll: allow f_op->poll to sleep
2008-08-26 16:58 ` Linus Torvalds
2008-08-26 17:48 ` Tejun Heo
2008-08-26 19:51 ` Christoph Hellwig
@ 2008-08-26 20:22 ` Ingo Molnar
2 siblings, 0 replies; 7+ messages in thread
From: Ingo Molnar @ 2008-08-26 20:22 UTC (permalink / raw)
To: Linus Torvalds
Cc: Tejun Heo, Al Viro, hch, Linux Kernel Mailing List,
Eric Van Hensbergen, Ron Minnich, v9fs-developer
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, 26 Aug 2008, Tejun Heo wrote:
> >
> > This patch converts poll/select to use custom wake up function and use
> > separate triggered variable to synchronize against wake up events.
> > The only added overhead is an extra function call during wake up and
> > negligible.
>
> I don't really see the point.
>
> poll() isn't allowed to sleep for many reasons. Some are technical.
> But the most obvious one is that a sleeping "poll()" is totally
> against the whole point of polling in the first place!
>
> So is there some big conceptual reason to change how poll() has always
> worked?
i think the point is to replace potentially fragile atomic code with
less restricted 'potentially sleeping' code.
For example a GFP_KERNEL allocation will not sleep in 99% of the cases,
and if it sleeps we are under such memory pressure that we dont really
care all that much whether we happen to sleep in poll() or in the next
userspace pagefault.
The whole ->poll() handler model would be less restricted. I dont think
that's a bad idea - all our atomic contexts suffer from programmability
restrictions, and people are trying to get away from them.
Plus it seems to enable FUSE some more as well - and FUSE has been
spearheading a lot of filesystem development lately. FUSE is IMO the
useful form of microkernels - a fast prototyping platform with a very
friendly user-space programming and debugging interface. (and the fact
that FUSE based filesystems can be packaged up and distributed a lot
easier than kernel changes is an argument in favor of FUSE as well)
Ingo
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2008-08-26 20:23 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-26 13:57 [PATCH 1/2] wait: kill is_sync_wait() Tejun Heo
2008-08-26 13:59 ` [PATCH 2/2] poll: allow f_op->poll to sleep Tejun Heo
2008-08-26 16:58 ` Linus Torvalds
2008-08-26 17:48 ` Tejun Heo
2008-08-26 19:51 ` Christoph Hellwig
2008-08-26 20:22 ` Ingo Molnar
2008-08-26 14:09 ` [PATCH 1/2] wait: kill is_sync_wait() Ingo Molnar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox