Re: + poll-allow-f_op-poll-to-sleep-take-4.patch added to -mm tree

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: + poll-allow-f_op-poll-to-sleep-take-4.patch added to -mm tree
@ 2008-11-25 17:30 Oleg Nesterov
  2008-11-25 21:08 ` Davide Libenzi
  2008-11-26  4:33 ` Tejun Heo
  0 siblings, 2 replies; 14+ messages in thread
From: Oleg Nesterov @ 2008-11-25 17:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Eric Van Hensbergen, Ron Minnich, Ingo Molnar, Christoph Hellwig,
	Miklos Szeredi, Davide Libenzi, Brad Boyer, Al Viro,
	Roland McGrath, Mauro Carvalho Chehab, Andrew Morton,
	linux-kernel

Hi!

Minor question about mbs, just trying to understand the patch...

> +static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
> +{
> +	struct poll_wqueues *pwq = wait->private;
> +	DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task);
> +
> +	set_mb(pwq->triggered, 1);
> +
> +	/*
> +	 * Perform the default wake up operation using a dummy
> +	 * waitqueue.
> +	 *
> +	 * TODO: This is hacky but there currently is no interface to
> +	 * pass in @sync.  @sync is scheduled to be removed and once
> +	 * that happens, wake_up_process() can be used directly.
> +	 */
> +	return default_wake_function(&dummy_wait, mode, sync, key);
> +}
> +
> +int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
> +			  ktime_t *expires, unsigned long slack)
> +{
> +	int rc = -EINTR;
> +
> +	set_current_state(state);
> +	if (!pwq->triggered)
> +		rc = schedule_hrtimeout_range(expires, slack, HRTIMER_MODE_ABS);
> +	__set_current_state(TASK_RUNNING);

So, why do we need this mb() in pollwake() ?

try_to_wake_up() has a full barrier semantics, note the wmb() before
task_rq_lock(). Since spin_lock() itself is STORE, the setting of
pwq->triggered can't be further re-ordered with the reading of p->state.

Or any other reason ?

> +	/* clear triggered for the next iteration */
> +	pwq->triggered = 0;

And don't we (in theory) actually need the mb() here instead?

Let's suppose do_poll() starts the next iteration, so we are doing

	pwq->triggered = 0;

	->poll(file)
		if (!check_file(file))
			return 0;

		return POLLXXX;

We don't have any barriers in between (unless fget_light bumps
->f_count), so this can be reordered as

	->poll(file)
		if (!check_file(file))
			return 0;

		pwq->triggered = 0;

And, if pollwake() happens in between we can miss the event, no?

Oleg.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: + poll-allow-f_op-poll-to-sleep-take-4.patch added to -mm tree
  2008-11-25 17:30 + poll-allow-f_op-poll-to-sleep-take-4.patch added to -mm tree Oleg Nesterov
@ 2008-11-25 21:08 ` Davide Libenzi
  2008-11-26  4:33 ` Tejun Heo
  1 sibling, 0 replies; 14+ messages in thread
From: Davide Libenzi @ 2008-11-25 21:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Tejun Heo, Eric Van Hensbergen, Ron Minnich, Ingo Molnar,
	Christoph Hellwig, Miklos Szeredi, Brad Boyer, Al Viro,
	Roland McGrath, Mauro Carvalho Chehab, Andrew Morton,
	Linux Kernel Mailing List

On Tue, 25 Nov 2008, Oleg Nesterov wrote:

> > +static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
> > +{
> > +	struct poll_wqueues *pwq = wait->private;
> > +	DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task);
> > +
> > +	set_mb(pwq->triggered, 1);
> > +
> > +	/*
> > +	 * Perform the default wake up operation using a dummy
> > +	 * waitqueue.
> > +	 *
> > +	 * TODO: This is hacky but there currently is no interface to
> > +	 * pass in @sync.  @sync is scheduled to be removed and once
> > +	 * that happens, wake_up_process() can be used directly.
> > +	 */
> > +	return default_wake_function(&dummy_wait, mode, sync, key);
> > +}
> > +
> > +int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
> > +			  ktime_t *expires, unsigned long slack)
> > +{
> > +	int rc = -EINTR;
> > +
> > +	set_current_state(state);
> > +	if (!pwq->triggered)
> > +		rc = schedule_hrtimeout_range(expires, slack, HRTIMER_MODE_ABS);
> > +	__set_current_state(TASK_RUNNING);
> 
> So, why do we need this mb() in pollwake() ?
> 
> try_to_wake_up() has a full barrier semantics, note the wmb() before
> task_rq_lock(). Since spin_lock() itself is STORE, the setting of
> pwq->triggered can't be further re-ordered with the reading of p->state.

Agreed.



> > +	/* clear triggered for the next iteration */
> > +	pwq->triggered = 0;
> 
> And don't we (in theory) actually need the mb() here instead?
> 
> Let's suppose do_poll() starts the next iteration, so we are doing
> 
> 	pwq->triggered = 0;
> 
> 	->poll(file)
> 		if (!check_file(file))
> 			return 0;
> 
> 		return POLLXXX;
> 
> We don't have any barriers in between (unless fget_light bumps
> ->f_count), so this can be reordered as
> 
> 	->poll(file)
> 		if (!check_file(file))
> 			return 0;
> 
> 		pwq->triggered = 0;
> 
> And, if pollwake() happens in between we can miss the event, no?

About ->triggered, on the sys_poll() side we have:

do_pollfd()
	file->f_op->poll()
		poll_wait(file, QUEUE, WAIT);
			add_wait_queue(QUEUE, WAIT);
				spin_lock(QUEUE->lock);
				...
				spin_unlock(QUEUE->lock);
		return dev->events;
...
set_current_state(TASK_INTERRUPTIBLE);
if (!got_events && !triggered)
	schedule();

Device side:

dev->events = XXXX;
__wake_up(QUEUE);
	spin_lock(QUEUE->lock);
	pollwake(WAIT);
		triggered = 1;
		default_wake_function();
			__try_to_wake_up();
	spin_unlock(QUEUE->lock);

AFAICS,  because of  the  lock taken,  and  because of the 
set_current_state()/schedule() logic, there's no need on any ->triggered 
synchronization.
Either we're on the device QUEUE, or we see the new dev->events, so no 
wakeup or events can be lost.
The only thing that can happen, is that a device before this reported 
events ready, so we're following up the loop with WAIT==NULL and 
poll_wait() not dropping into the QUEUE (hence not grabbing the QUEUE 
lock). But at that point it doesn't matter because got_events!=0.
IMO there's no need for any mb() on ->triggered. Check?



- Davide



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: + poll-allow-f_op-poll-to-sleep-take-4.patch added to -mm tree
  2008-11-25 17:30 + poll-allow-f_op-poll-to-sleep-take-4.patch added to -mm tree Oleg Nesterov
  2008-11-25 21:08 ` Davide Libenzi
@ 2008-11-26  4:33 ` Tejun Heo
  2008-11-26  4:40   ` [PATCH] poll: allow f_op->poll to sleep, take#5 Tejun Heo
  2008-11-26  4:49   ` + poll-allow-f_op-poll-to-sleep-take-4.patch added to -mm tree Tejun Heo
  1 sibling, 2 replies; 14+ messages in thread
From: Tejun Heo @ 2008-11-26  4:33 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric Van Hensbergen, Ron Minnich, Ingo Molnar, Christoph Hellwig,
	Miklos Szeredi, Davide Libenzi, Brad Boyer, Al Viro,
	Roland McGrath, Mauro Carvalho Chehab, Andrew Morton,
	linux-kernel

Hello,

Oleg Nesterov wrote:
> So, why do we need this mb() in pollwake() ?
> 
> try_to_wake_up() has a full barrier semantics, note the wmb() before
> task_rq_lock(). Since spin_lock() itself is STORE, the setting of
> pwq->triggered can't be further re-ordered with the reading of p->state.
> 
> Or any other reason ?

try_to_wake_up() is a full barrier.  Is it something guaranteed and
intentional or is it just something which just happened to be so?
Also, as the function is doing some dirty hackery to get to
try_to_wake_up(), I just wanted to make it clear.  I suppose it's time
to add more comments there then.

>> +	/* clear triggered for the next iteration */
>> +	pwq->triggered = 0;
> 
> And don't we (in theory) actually need the mb() here instead?
> 
> Let's suppose do_poll() starts the next iteration, so we are doing
> 
> 	pwq->triggered = 0;
> 
> 	->poll(file)
> 		if (!check_file(file))
> 			return 0;
> 
> 		return POLLXXX;
> 
> We don't have any barriers in between (unless fget_light bumps
> ->f_count), so this can be reordered as
> 
> 	->poll(file)
> 		if (!check_file(file))
> 			return 0;
> 
> 		pwq->triggered = 0;
> 
> And, if pollwake() happens in between we can miss the event, no?

Hmmmm... yes, from the second run, ->poll doesn't grab the waitqueue
lock, so it doesn't necessary have the required barriers.
Heh... set_mb() should be here not in pollwake().  Thanks for spotting
it.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] poll: allow f_op->poll to sleep, take#5
  2008-11-26  4:33 ` Tejun Heo
@ 2008-11-26  4:40   ` Tejun Heo
  2008-11-26  6:27     ` Davide Libenzi
  2008-11-26  4:49   ` + poll-allow-f_op-poll-to-sleep-take-4.patch added to -mm tree Tejun Heo
  1 sibling, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2008-11-26  4:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric Van Hensbergen, Ron Minnich, Ingo Molnar, Christoph Hellwig,
	Miklos Szeredi, Davide Libenzi, Brad Boyer, Al Viro,
	Roland McGrath, Mauro Carvalho Chehab, Andrew Morton,
	linux-kernel

Subject: poll: allow f_op->poll to sleep
From: Tejun Heo <htejun@gmail.com>

f_op->poll is the only vfs operation which is not allowed to sleep.  It's
because poll and select implementation used task state to synchronize
against wake ups, which doesn't have to be the case anymore as wait/wake
interface can now use custom wake up functions.  The non-sleep restriction
can be a bit tricky because ->poll is not called from an atomic context
and the result of accidentally sleeping in ->poll only shows up as
temporary busy looping when the timing is right or rather wrong.

This patch converts poll/select to use custom wake up function and use
separate triggered variable to synchronize against wake up events.  The
only added overhead is an extra function call during wake up and
negligible.

This patch removes the one non-sleep exception from vfs locking rules and
is beneficial to userland filesystem implementations like FUSE, 9p or
peculiar fs like spufs as it's very difficult for those to implement
non-sleeping poll method.

While at it, make the following cosmetic changes to make poll.h and
select.c checkpatch friendly.

* s/type * symbol/type *symbol/		   : three places in poll.h
* remove blank line before EXPORT_SYMBOL() : two places in select.c

Oleg: spotted unnecessary use of set_mb() in pollwake() and missing
      one in poll_schedule_timeout().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Brad Boyer <flar@allandria.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Roland McGrath <roland@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 Documentation/filesystems/Locking |    2 -
 drivers/media/video/v4l1-compat.c |    4 --
 fs/select.c                       |   66 +++++++++++++++++++++++++++++---------
 include/linux/poll.h              |   15 ++++++--
 4 files changed, 66 insertions(+), 21 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 23d2f44..250f3c4 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -398,7 +398,7 @@ prototypes:
 };
 
 locking rules:
-	All except ->poll() may block.
+	All may block.
 			BKL
 llseek:			no	(see below)
 read:			no
diff --git a/drivers/media/video/v4l1-compat.c b/drivers/media/video/v4l1-compat.c
index f13c0a9..d0253da 100644
--- a/drivers/media/video/v4l1-compat.c
+++ b/drivers/media/video/v4l1-compat.c
@@ -203,7 +203,6 @@ static int poll_one(struct file *file, struct poll_wqueues *pwq)
 	table = &pwq->pt;
 	for (;;) {
 		int mask;
-		set_current_state(TASK_INTERRUPTIBLE);
 		mask = file->f_op->poll(file, table);
 		if (mask & POLLIN)
 			break;
@@ -212,9 +211,8 @@ static int poll_one(struct file *file, struct poll_wqueues *pwq)
 			retval = -ERESTARTSYS;
 			break;
 		}
-		schedule();
+		poll_schedule(pwq, TASK_INTERRUPTIBLE);
 	}
-	set_current_state(TASK_RUNNING);
 	poll_freewait(pwq);
 	return retval;
 }
diff --git a/fs/select.c b/fs/select.c
index 87df51e..cd6a5b7 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -109,11 +109,11 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
 void poll_initwait(struct poll_wqueues *pwq)
 {
 	init_poll_funcptr(&pwq->pt, __pollwait);
+	pwq->polling_task = current;
 	pwq->error = 0;
 	pwq->table = NULL;
 	pwq->inline_index = 0;
 }
-
 EXPORT_SYMBOL(poll_initwait);
 
 static void free_poll_entry(struct poll_table_entry *entry)
@@ -142,12 +142,10 @@ void poll_freewait(struct poll_wqueues *pwq)
 		free_page((unsigned long) old);
 	}
 }
-
 EXPORT_SYMBOL(poll_freewait);
 
-static struct poll_table_entry *poll_get_entry(poll_table *_p)
+static struct poll_table_entry *poll_get_entry(struct poll_wqueues *p)
 {
-	struct poll_wqueues *p = container_of(_p, struct poll_wqueues, pt);
 	struct poll_table_page *table = p->table;
 
 	if (p->inline_index < N_INLINE_POLL_ENTRIES)
@@ -159,7 +157,6 @@ static struct poll_table_entry *poll_get_entry(poll_table *_p)
 		new_table = (struct poll_table_page *) __get_free_page(GFP_KERNEL);
 		if (!new_table) {
 			p->error = -ENOMEM;
-			__set_current_state(TASK_RUNNING);
 			return NULL;
 		}
 		new_table->entry = new_table->entries;
@@ -171,20 +168,65 @@ static struct poll_table_entry *poll_get_entry(poll_table *_p)
 	return table->entry++;
 }
 
+static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+	struct poll_wqueues *pwq = wait->private;
+	DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task);
+
+	/*
+	 * Wake up functions have full barrier semantics, no need for
+	 * barrier here.
+	 */
+	pwq->triggered = 1;
+
+	/*
+	 * Perform the default wake up operation using a dummy
+	 * waitqueue.
+	 *
+	 * TODO: This is hacky but there currently is no interface to
+	 * pass in @sync.  @sync is scheduled to be removed and once
+	 * that happens, wake_up_process() can be used directly.
+	 */
+	return default_wake_function(&dummy_wait, mode, sync, key);
+}
+
 /* Add a new entry */
 static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
 				poll_table *p)
 {
-	struct poll_table_entry *entry = poll_get_entry(p);
+	struct poll_wqueues *pwq = container_of(p, struct poll_wqueues, pt);
+	struct poll_table_entry *entry = poll_get_entry(pwq);
 	if (!entry)
 		return;
 	get_file(filp);
 	entry->filp = filp;
 	entry->wait_address = wait_address;
-	init_waitqueue_entry(&entry->wait, current);
+	init_waitqueue_func_entry(&entry->wait, pollwake);
+	entry->wait.private = pwq;
 	add_wait_queue(wait_address, &entry->wait);
 }
 
+int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
+			  ktime_t *expires, unsigned long slack)
+{
+	int rc = -EINTR;
+
+	set_current_state(state);
+	if (!pwq->triggered)
+		rc = schedule_hrtimeout_range(expires, slack, HRTIMER_MODE_ABS);
+	__set_current_state(TASK_RUNNING);
+
+	/*
+	 * Prepare for the next iteration.  ->poll() might not have
+	 * enough barrier semantics from the second round as waits are
+	 * registered only during the first one.  Use set_mb().
+	 */
+	set_mb(pwq->triggered, 0);
+
+	return rc;
+}
+EXPORT_SYMBOL(poll_schedule_timeout);
+
 /**
  * poll_select_set_timeout - helper function to setup the timeout value
  * @to:		pointer to timespec variable for the final timeout
@@ -340,8 +382,6 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 	for (;;) {
 		unsigned long *rinp, *routp, *rexp, *inp, *outp, *exp;
 
-		set_current_state(TASK_INTERRUPTIBLE);
-
 		inp = fds->in; outp = fds->out; exp = fds->ex;
 		rinp = fds->res_in; routp = fds->res_out; rexp = fds->res_ex;
 
@@ -411,10 +451,10 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 			to = &expire;
 		}
 
-		if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
+		if (!poll_schedule_timeout(&table, TASK_INTERRUPTIBLE,
+					   to, slack))
 			timed_out = 1;
 	}
-	__set_current_state(TASK_RUNNING);
 
 	poll_freewait(&table);
 
@@ -666,7 +706,6 @@ static int do_poll(unsigned int nfds,  struct poll_list *list,
 	for (;;) {
 		struct poll_list *walk;
 
-		set_current_state(TASK_INTERRUPTIBLE);
 		for (walk = list; walk != NULL; walk = walk->next) {
 			struct pollfd * pfd, * pfd_end;
 
@@ -709,10 +748,9 @@ static int do_poll(unsigned int nfds,  struct poll_list *list,
 			to = &expire;
 		}
 
-		if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
+		if (!poll_schedule_timeout(wait, TASK_INTERRUPTIBLE, to, slack))
 			timed_out = 1;
 	}
-	__set_current_state(TASK_RUNNING);
 	return count;
 }
 
diff --git a/include/linux/poll.h b/include/linux/poll.h
index badd98a..8c24ef8 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -46,9 +46,9 @@ static inline void init_poll_funcptr(poll_table *pt, poll_queue_proc qproc)
 }
 
 struct poll_table_entry {
-	struct file * filp;
+	struct file *filp;
 	wait_queue_t wait;
-	wait_queue_head_t * wait_address;
+	wait_queue_head_t *wait_address;
 };
 
 /*
@@ -56,7 +56,9 @@ struct poll_table_entry {
  */
 struct poll_wqueues {
 	poll_table pt;
-	struct poll_table_page * table;
+	struct poll_table_page *table;
+	struct task_struct *polling_task;
+	int triggered;
 	int error;
 	int inline_index;
 	struct poll_table_entry inline_entries[N_INLINE_POLL_ENTRIES];
@@ -64,6 +66,13 @@ struct poll_wqueues {
 
 extern void poll_initwait(struct poll_wqueues *pwq);
 extern void poll_freewait(struct poll_wqueues *pwq);
+extern int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
+				 ktime_t *expires, unsigned long slack);
+
+static inline int poll_schedule(struct poll_wqueues *pwq, int state)
+{
+	return poll_schedule_timeout(pwq, state, NULL, 0);
+}
 
 /*
  * Scaleable version of the fd_set.

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: + poll-allow-f_op-poll-to-sleep-take-4.patch added to -mm tree
  2008-11-26  4:33 ` Tejun Heo
  2008-11-26  4:40   ` [PATCH] poll: allow f_op->poll to sleep, take#5 Tejun Heo
@ 2008-11-26  4:49   ` Tejun Heo
  1 sibling, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2008-11-26  4:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric Van Hensbergen, Ron Minnich, Ingo Molnar, Christoph Hellwig,
	Miklos Szeredi, Davide Libenzi, Brad Boyer, Al Viro,
	Roland McGrath, Mauro Carvalho Chehab, Andrew Morton,
	linux-kernel

Tejun Heo wrote:
>> And don't we (in theory) actually need the mb() here instead?
>>
>> Let's suppose do_poll() starts the next iteration, so we are doing
>>
>> 	pwq->triggered = 0;
>>
>> 	->poll(file)
>> 		if (!check_file(file))
>> 			return 0;
>>
>> 		return POLLXXX;
>>
>> We don't have any barriers in between (unless fget_light bumps
>> ->f_count), so this can be reordered as
>>
>> 	->poll(file)
>> 		if (!check_file(file))
>> 			return 0;
>>
>> 		pwq->triggered = 0;
>>
>> And, if pollwake() happens in between we can miss the event, no?
> 
> Hmmmm... yes, from the second run, ->poll doesn't grab the waitqueue
> lock, so it doesn't necessary have the required barriers.
> Heh... set_mb() should be here not in pollwake().  Thanks for spotting
> it.

Oh, I remembered why I didn't use set_mb() there.  The logic was that
once the wait is over by either event triggering or timeout, the
poll/select finishes by either valid event or the timeout, but that
isn't true as the wake up could be spurious due to implementation
details or event masking, so yes we do need barrier there.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] poll: allow f_op->poll to sleep, take#5
  2008-11-26  4:40   ` [PATCH] poll: allow f_op->poll to sleep, take#5 Tejun Heo
@ 2008-11-26  6:27     ` Davide Libenzi
  2008-11-26  6:39       ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Davide Libenzi @ 2008-11-26  6:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, Eric Van Hensbergen, Ron Minnich, Ingo Molnar,
	Christoph Hellwig, Miklos Szeredi, Brad Boyer, Al Viro,
	Roland McGrath, Mauro Carvalho Chehab, Andrew Morton,
	Linux Kernel Mailing List

On Wed, 26 Nov 2008, Tejun Heo wrote:

> +static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
> +{
> +	struct poll_wqueues *pwq = wait->private;
> +	DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task);
> +
> +	/*
> +	 * Wake up functions have full barrier semantics, no need for
> +	 * barrier here.
> +	 */
> +	pwq->triggered = 1;
> +
> +	/*
> +	 * Perform the default wake up operation using a dummy
> +	 * waitqueue.
> +	 *
> +	 * TODO: This is hacky but there currently is no interface to
> +	 * pass in @sync.  @sync is scheduled to be removed and once
> +	 * that happens, wake_up_process() can be used directly.
> +	 */
> +	return default_wake_function(&dummy_wait, mode, sync, key);
> +}
> +int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
> +			  ktime_t *expires, unsigned long slack)
> +{
> +	int rc = -EINTR;
> +
> +	set_current_state(state);
> +	if (!pwq->triggered)
> +		rc = schedule_hrtimeout_range(expires, slack, HRTIMER_MODE_ABS);
> +	__set_current_state(TASK_RUNNING);
> +
> +	/*
> +	 * Prepare for the next iteration.  ->poll() might not have
> +	 * enough barrier semantics from the second round as waits are
> +	 * registered only during the first one.  Use set_mb().
> +	 */
> +	set_mb(pwq->triggered, 0);
> +
> +	return rc;
> +}
> +EXPORT_SYMBOL(poll_schedule_timeout);

Look, pollwake() does:

w1) WR triggered (1)
w2) WMB
w3) WR task->state (RUNNING)

While poll_schedule_timeout() does:

s1) WR task->state (TASK_INTERRUPTIBLE)
s2) MB
s3) RD triggered
s4) IF0 => RD task->state (if !RUNNING -> sleep)


The only risk is that w3 preceed s1, so that we go to sleep even though a 
wakeup has been issued. But if w3 is visible, w1 is visible too, that 
means that 'triggered' is visible in s3 (there's a MB in s2). So we skip 
the schedule_hrtimeout_range(). So IMO you need no barriers on 'triggered'.
If you feel you need barriers, do you mind explaning a sequence of events 
that makes a barrier-free version break?



- Davide



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] poll: allow f_op->poll to sleep, take#5
  2008-11-26  6:27     ` Davide Libenzi
@ 2008-11-26  6:39       ` Tejun Heo
  2008-11-26 19:36         ` Davide Libenzi
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2008-11-26  6:39 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Oleg Nesterov, Eric Van Hensbergen, Ron Minnich, Ingo Molnar,
	Christoph Hellwig, Miklos Szeredi, Brad Boyer, Al Viro,
	Roland McGrath, Mauro Carvalho Chehab, Andrew Morton,
	Linux Kernel Mailing List

Hello,

Davide Libenzi wrote:
> Look, pollwake() does:
> 
> w1) WR triggered (1)
> w2) WMB
> w3) WR task->state (RUNNING)
> 
> While poll_schedule_timeout() does:
> 
> s1) WR task->state (TASK_INTERRUPTIBLE)
> s2) MB
> s3) RD triggered
> s4) IF0 => RD task->state (if !RUNNING -> sleep)
s5) after waking up, WR triggered to zero

> The only risk is that w3 preceed s1, so that we go to sleep even though a 
> wakeup has been issued. But if w3 is visible, w1 is visible too, that 
> means that 'triggered' is visible in s3 (there's a MB in s2). So we skip 
> the schedule_hrtimeout_range(). So IMO you need no barriers on 'triggered'.
> If you feel you need barriers, do you mind explaning a sequence of events 
> that makes a barrier-free version break?

s5 from the previous iteration could happen after w1 during the next
iteration and the test in s4 of the next iteration will miss the
event, so the event could get lost on the iterations which is not the
first one, no?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] poll: allow f_op->poll to sleep, take#5
  2008-11-26  6:39       ` Tejun Heo
@ 2008-11-26 19:36         ` Davide Libenzi
  2008-11-27  9:18           ` Tejun Heo
  2008-11-27 20:13           ` [PATCH] poll: allow f_op->poll to sleep, take#5 Oleg Nesterov
  0 siblings, 2 replies; 14+ messages in thread
From: Davide Libenzi @ 2008-11-26 19:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, Eric Van Hensbergen, Ron Minnich, Ingo Molnar,
	Christoph Hellwig, Miklos Szeredi, Brad Boyer, Al Viro,
	Roland McGrath, Mauro Carvalho Chehab, Andrew Morton,
	Linux Kernel Mailing List

On Wed, 26 Nov 2008, Tejun Heo wrote:

> Hello,
> 
> Davide Libenzi wrote:
> > Look, pollwake() does:
> > 
> > w1) WR triggered (1)
> > w2) WMB
> > w3) WR task->state (RUNNING)
> > 
> > While poll_schedule_timeout() does:
> > 
> > s1) WR task->state (TASK_INTERRUPTIBLE)
> > s2) MB
> > s3) RD triggered
> > s4) IF0 => RD task->state (if !RUNNING -> sleep)
> s5) after waking up, WR triggered to zero
> 
> > The only risk is that w3 preceed s1, so that we go to sleep even though a 
> > wakeup has been issued. But if w3 is visible, w1 is visible too, that 
> > means that 'triggered' is visible in s3 (there's a MB in s2). So we skip 
> > the schedule_hrtimeout_range(). So IMO you need no barriers on 'triggered'.
> > If you feel you need barriers, do you mind explaning a sequence of events 
> > that makes a barrier-free version break?
> 
> s5 from the previous iteration could happen after w1 during the next
> iteration and the test in s4 of the next iteration will miss the
> event, so the event could get lost on the iterations which is not the
> first one, no?

Hmmm, I just noticed that the set_current_state(TASK_INTERRUPTIBLE) at the 
beginning of the ->poll() loop has been dropped (and it makes sense since 
now ->poll() can sleep). So the iterations after the first becomes the 
interesting ones.
Device side, via wakeup():

w1) WR dev->events
w2) WR triggered (1)
w3) WMB
w4) WR task->state (RUNNING)

On the poller side:

s1) WR task->state (TASK_INTERRUPTIBLE)
s2) MB
s3) RD triggered
s4) IF0 => RD task->state (if !RUNNING -> sleep)
s5) WR triggered (0)
s6) RD dev->events

Now, it is very likely that after w1 there is some full mb, since the 
events (AKA internal manipulation of the device/file structure) happens 
inside a spinlocked region. So, if the write at s5 is actually able to 
override the one at w2, the dev->events set at w1 are likely going to be 
visible at the immediately next ->poll() loop.
To be sure though, independently from the device/file event setting 
behavior, IMO we need ...
Device side:

w1) WR dev->events
w2) MB
w3) WR triggered (1)
w4) WMB
w5) WR task->state (RUNNING)

Poller side:

s1) WR task->state (TASK_INTERRUPTIBLE)
s2) MB
s3) RD triggered
s4) IF0 => RD task->state (if !RUNNING -> sleep)
s5) WR triggered (0)
s6) MB
s7) RD dev->events

That is, an MB before w3 (triggered=1) and a set_mb(triggered,0) at 
s5+s6. The spinlock on the queue taken before entering pollwake() is not 
enough to guarantee the required ordering, since a LOCK is no guarantee 
that operations before it are visible after the LOCK.
Without the MB at w2, it could happen [w3, s5, s7, w1] that will make us 
miss the event *and* sleep.

- Davide

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] poll: allow f_op->poll to sleep, take#5
  2008-11-26 19:36         ` Davide Libenzi
@ 2008-11-27  9:18           ` Tejun Heo
  2008-11-27  9:37             ` [PATCH] poll: allow f_op->poll to sleep, take#6 Tejun Heo
  2008-11-27 20:13           ` [PATCH] poll: allow f_op->poll to sleep, take#5 Oleg Nesterov
  1 sibling, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2008-11-27  9:18 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Oleg Nesterov, Eric Van Hensbergen, Ron Minnich, Ingo Molnar,
	Christoph Hellwig, Miklos Szeredi, Brad Boyer, Al Viro,
	Roland McGrath, Mauro Carvalho Chehab, Andrew Morton,
	Linux Kernel Mailing List

Hello,

Davide Libenzi wrote:
> Hmmm, I just noticed that the set_current_state(TASK_INTERRUPTIBLE) at the 
> beginning of the ->poll() loop has been dropped (and it makes sense since 
> now ->poll() can sleep).

Yeah, that's exactly what the ->triggered condition replaces.

> w1) WR dev->events
> w2) MB
> w3) WR triggered (1)
> w4) WMB
> w5) WR task->state (RUNNING)
> 
> Poller side:
>  
> s1) WR task->state (TASK_INTERRUPTIBLE)
> s2) MB
> s3) RD triggered
> s4) IF0 => RD task->state (if !RUNNING -> sleep)
> s5) WR triggered (0)
> s6) MB
> s7) RD dev->events
> 
> That is, an MB before w3 (triggered=1) and a set_mb(triggered,0) at 
> s5+s6. The spinlock on the queue taken before entering pollwake() is not 
> enough to guarantee the required ordering, since a LOCK is no guarantee 
> that operations before it are visible after the LOCK.
> Without the MB at w2, it could happen [w3, s5, s7, w1] that will make us 
> miss the event *and* sleep.

Yeah, it seems we'll need something which is equivalent to smp_wmb()
in try_to_wake_up().  So, the original set_mb() should have stayed
there while just adding the latter one.  Will prep yet another take of
the patch.  Thanks for the detailed analysis.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH] poll: allow f_op->poll to sleep, take#6
  2008-11-27  9:18           ` Tejun Heo
@ 2008-11-27  9:37             ` Tejun Heo
  2008-11-28  4:35               ` Davide Libenzi
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2008-11-27  9:37 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Oleg Nesterov, Eric Van Hensbergen, Ron Minnich, Ingo Molnar,
	Christoph Hellwig, Miklos Szeredi, Brad Boyer, Al Viro,
	Roland McGrath, Mauro Carvalho Chehab, Andrew Morton,
	Linux Kernel Mailing List

f_op->poll is the only vfs operation which is not allowed to sleep.  It's
because poll and select implementation used task state to synchronize
against wake ups, which doesn't have to be the case anymore as wait/wake
interface can now use custom wake up functions.  The non-sleep restriction
can be a bit tricky because ->poll is not called from an atomic context
and the result of accidentally sleeping in ->poll only shows up as
temporary busy looping when the timing is right or rather wrong.

This patch converts poll/select to use custom wake up function and use
separate triggered variable to synchronize against wake up events.  The
only added overhead is an extra function call during wake up and
negligible.

This patch removes the one non-sleep exception from vfs locking rules and
is beneficial to userland filesystem implementations like FUSE, 9p or
peculiar fs like spufs as it's very difficult for those to implement
non-sleeping poll method.

While at it, make the following cosmetic changes to make poll.h and
select.c checkpatch friendly.

* s/type * symbol/type *symbol/		   : three places in poll.h
* remove blank line before EXPORT_SYMBOL() : two places in select.c

Oleg: spotted missing barrier in poll_schedule_timeout()
Davide: spotted missing write barrier in pollwake()

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Brad Boyer <flar@allandria.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Roland McGrath <roland@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Oleg Nesterov <oleg@redhat.com>
---
 Documentation/filesystems/Locking |    2 -
 drivers/media/video/v4l1-compat.c |    4 --
 fs/select.c                       |   76 +++++++++++++++++++++++++++++++-------
 include/linux/poll.h              |   15 ++++++-
 4 files changed, 76 insertions(+), 21 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 23d2f44..250f3c4 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -398,7 +398,7 @@ prototypes:
 };
 
 locking rules:
-	All except ->poll() may block.
+	All may block.
 			BKL
 llseek:			no	(see below)
 read:			no
diff --git a/drivers/media/video/v4l1-compat.c b/drivers/media/video/v4l1-compat.c
index f13c0a9..d0253da 100644
--- a/drivers/media/video/v4l1-compat.c
+++ b/drivers/media/video/v4l1-compat.c
@@ -203,7 +203,6 @@ static int poll_one(struct file *file, struct poll_wqueues *pwq)
 	table = &pwq->pt;
 	for (;;) {
 		int mask;
-		set_current_state(TASK_INTERRUPTIBLE);
 		mask = file->f_op->poll(file, table);
 		if (mask & POLLIN)
 			break;
@@ -212,9 +211,8 @@ static int poll_one(struct file *file, struct poll_wqueues *pwq)
 			retval = -ERESTARTSYS;
 			break;
 		}
-		schedule();
+		poll_schedule(pwq, TASK_INTERRUPTIBLE);
 	}
-	set_current_state(TASK_RUNNING);
 	poll_freewait(pwq);
 	return retval;
 }
diff --git a/fs/select.c b/fs/select.c
index 87df51e..08b91be 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -109,11 +109,11 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
 void poll_initwait(struct poll_wqueues *pwq)
 {
 	init_poll_funcptr(&pwq->pt, __pollwait);
+	pwq->polling_task = current;
 	pwq->error = 0;
 	pwq->table = NULL;
 	pwq->inline_index = 0;
 }
-
 EXPORT_SYMBOL(poll_initwait);
 
 static void free_poll_entry(struct poll_table_entry *entry)
@@ -142,12 +142,10 @@ void poll_freewait(struct poll_wqueues *pwq)
 		free_page((unsigned long) old);
 	}
 }
-
 EXPORT_SYMBOL(poll_freewait);
 
-static struct poll_table_entry *poll_get_entry(poll_table *_p)
+static struct poll_table_entry *poll_get_entry(struct poll_wqueues *p)
 {
-	struct poll_wqueues *p = container_of(_p, struct poll_wqueues, pt);
 	struct poll_table_page *table = p->table;
 
 	if (p->inline_index < N_INLINE_POLL_ENTRIES)
@@ -159,7 +157,6 @@ static struct poll_table_entry *poll_get_entry(poll_table *_p)
 		new_table = (struct poll_table_page *) __get_free_page(GFP_KERNEL);
 		if (!new_table) {
 			p->error = -ENOMEM;
-			__set_current_state(TASK_RUNNING);
 			return NULL;
 		}
 		new_table->entry = new_table->entries;
@@ -171,20 +168,75 @@ static struct poll_table_entry *poll_get_entry(poll_table *_p)
 	return table->entry++;
 }
 
+static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+	struct poll_wqueues *pwq = wait->private;
+	DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task);
+
+	/*
+	 * Although this function is called under waitqueue lock, LOCK
+	 * doesn't imply write barrier and the users expect write
+	 * barrier semantics on wakeup functions.  The following
+	 * smp_wmb() is equivalent to smp_wmb() in try_to_wake_up()
+	 * and is paired with set_mb() in poll_schedule_timeout.
+	 */
+	smp_wmb();
+	pwq->triggered = 1;
+
+	/*
+	 * Perform the default wake up operation using a dummy
+	 * waitqueue.
+	 *
+	 * TODO: This is hacky but there currently is no interface to
+	 * pass in @sync.  @sync is scheduled to be removed and once
+	 * that happens, wake_up_process() can be used directly.
+	 */
+	return default_wake_function(&dummy_wait, mode, sync, key);
+}
+
 /* Add a new entry */
 static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
 				poll_table *p)
 {
-	struct poll_table_entry *entry = poll_get_entry(p);
+	struct poll_wqueues *pwq = container_of(p, struct poll_wqueues, pt);
+	struct poll_table_entry *entry = poll_get_entry(pwq);
 	if (!entry)
 		return;
 	get_file(filp);
 	entry->filp = filp;
 	entry->wait_address = wait_address;
-	init_waitqueue_entry(&entry->wait, current);
+	init_waitqueue_func_entry(&entry->wait, pollwake);
+	entry->wait.private = pwq;
 	add_wait_queue(wait_address, &entry->wait);
 }
 
+int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
+			  ktime_t *expires, unsigned long slack)
+{
+	int rc = -EINTR;
+
+	set_current_state(state);
+	if (!pwq->triggered)
+		rc = schedule_hrtimeout_range(expires, slack, HRTIMER_MODE_ABS);
+	__set_current_state(TASK_RUNNING);
+
+	/*
+	 * Prepare for the next iteration.
+	 *
+	 * The following set_mb() serves two purposes.  First, it's
+	 * the counterpart rmb of the wmb in pollwake() such that data
+	 * written before wake up is always visible after wake up.
+	 * Second, the full barrier guarantees that triggered clearing
+	 * doesn't pass event check of the next iteration.  Note that
+	 * this problem doesn't exist for the first iteration as
+	 * add_wait_queue() has full barrier semantics.
+	 */
+	set_mb(pwq->triggered, 0);
+
+	return rc;
+}
+EXPORT_SYMBOL(poll_schedule_timeout);
+
 /**
  * poll_select_set_timeout - helper function to setup the timeout value
  * @to:		pointer to timespec variable for the final timeout
@@ -340,8 +392,6 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 	for (;;) {
 		unsigned long *rinp, *routp, *rexp, *inp, *outp, *exp;
 
-		set_current_state(TASK_INTERRUPTIBLE);
-
 		inp = fds->in; outp = fds->out; exp = fds->ex;
 		rinp = fds->res_in; routp = fds->res_out; rexp = fds->res_ex;
 
@@ -411,10 +461,10 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 			to = &expire;
 		}
 
-		if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
+		if (!poll_schedule_timeout(&table, TASK_INTERRUPTIBLE,
+					   to, slack))
 			timed_out = 1;
 	}
-	__set_current_state(TASK_RUNNING);
 
 	poll_freewait(&table);
 
@@ -666,7 +716,6 @@ static int do_poll(unsigned int nfds,  struct poll_list *list,
 	for (;;) {
 		struct poll_list *walk;
 
-		set_current_state(TASK_INTERRUPTIBLE);
 		for (walk = list; walk != NULL; walk = walk->next) {
 			struct pollfd * pfd, * pfd_end;
 
@@ -709,10 +758,9 @@ static int do_poll(unsigned int nfds,  struct poll_list *list,
 			to = &expire;
 		}
 
-		if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
+		if (!poll_schedule_timeout(wait, TASK_INTERRUPTIBLE, to, slack))
 			timed_out = 1;
 	}
-	__set_current_state(TASK_RUNNING);
 	return count;
 }
 
diff --git a/include/linux/poll.h b/include/linux/poll.h
index badd98a..8c24ef8 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -46,9 +46,9 @@ static inline void init_poll_funcptr(poll_table *pt, poll_queue_proc qproc)
 }
 
 struct poll_table_entry {
-	struct file * filp;
+	struct file *filp;
 	wait_queue_t wait;
-	wait_queue_head_t * wait_address;
+	wait_queue_head_t *wait_address;
 };
 
 /*
@@ -56,7 +56,9 @@ struct poll_table_entry {
  */
 struct poll_wqueues {
 	poll_table pt;
-	struct poll_table_page * table;
+	struct poll_table_page *table;
+	struct task_struct *polling_task;
+	int triggered;
 	int error;
 	int inline_index;
 	struct poll_table_entry inline_entries[N_INLINE_POLL_ENTRIES];
@@ -64,6 +66,13 @@ struct poll_wqueues {
 
 extern void poll_initwait(struct poll_wqueues *pwq);
 extern void poll_freewait(struct poll_wqueues *pwq);
+extern int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
+				 ktime_t *expires, unsigned long slack);
+
+static inline int poll_schedule(struct poll_wqueues *pwq, int state)
+{
+	return poll_schedule_timeout(pwq, state, NULL, 0);
+}
 
 /*
  * Scaleable version of the fd_set.

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] poll: allow f_op->poll to sleep, take#5
  2008-11-26 19:36         ` Davide Libenzi
  2008-11-27  9:18           ` Tejun Heo
@ 2008-11-27 20:13           ` Oleg Nesterov
  1 sibling, 0 replies; 14+ messages in thread
From: Oleg Nesterov @ 2008-11-27 20:13 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Tejun Heo, Eric Van Hensbergen, Ron Minnich, Ingo Molnar,
	Christoph Hellwig, Miklos Szeredi, Brad Boyer, Al Viro,
	Roland McGrath, Mauro Carvalho Chehab, Andrew Morton,
	Linux Kernel Mailing List

On 11/26, Davide Libenzi wrote:
>
> w1) WR dev->events
> w2) WR triggered (1)
> w3) WMB
> w4) WR task->state (RUNNING)
> ...
> That is, an MB before w3 (triggered=1) and a set_mb(triggered,0) at
> s5+s6. The spinlock on the queue taken before entering pollwake() is not
> enough to guarantee the required ordering, since a LOCK is no guarantee
> that operations before it are visible after the LOCK.
> Without the MB at w2, it could happen [w3, s5, s7, w1] that will make us
> miss the event *and* sleep.

I think you are very right. Actually, this is just like

W:
	fill_data(&DATA);
	wmb();
	DATA_is_ready = 1;	// triggered
	wake_up(wq);

S:
	set_current_state(state);
	if (DATA_is_ready)
		ret = poll(&DATA);
	else
		schedule();

without wmb() above poll(&DATA) can obviously return the wrong value.

Oleg.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] poll: allow f_op->poll to sleep, take#6
  2008-11-27  9:37             ` [PATCH] poll: allow f_op->poll to sleep, take#6 Tejun Heo
@ 2008-11-28  4:35               ` Davide Libenzi
  2008-11-28  4:44                 ` Tejun Heo
  2008-11-28 16:08                 ` Oleg Nesterov
  0 siblings, 2 replies; 14+ messages in thread
From: Davide Libenzi @ 2008-11-28  4:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, Eric Van Hensbergen, Ron Minnich, Ingo Molnar,
	Christoph Hellwig, Miklos Szeredi, Brad Boyer, Al Viro,
	Roland McGrath, Mauro Carvalho Chehab, Andrew Morton,
	Linux Kernel Mailing List

On Thu, 27 Nov 2008, Tejun Heo wrote:

> f_op->poll is the only vfs operation which is not allowed to sleep.  It's
> because poll and select implementation used task state to synchronize
> against wake ups, which doesn't have to be the case anymore as wait/wake
> interface can now use custom wake up functions.  The non-sleep restriction
> can be a bit tricky because ->poll is not called from an atomic context
> and the result of accidentally sleeping in ->poll only shows up as
> temporary busy looping when the timing is right or rather wrong.
> 
> This patch converts poll/select to use custom wake up function and use
> separate triggered variable to synchronize against wake up events.  The
> only added overhead is an extra function call during wake up and
> negligible.
> 
> This patch removes the one non-sleep exception from vfs locking rules and
> is beneficial to userland filesystem implementations like FUSE, 9p or
> peculiar fs like spufs as it's very difficult for those to implement
> non-sleeping poll method.

Looks OK to me, although it'd be better if some more folks eye it, in 
order to avoid painful mistakes.
Did you test it al all in a live system?


- Davide



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] poll: allow f_op->poll to sleep, take#6
  2008-11-28  4:35               ` Davide Libenzi
@ 2008-11-28  4:44                 ` Tejun Heo
  2008-11-28 16:08                 ` Oleg Nesterov
  1 sibling, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2008-11-28  4:44 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Oleg Nesterov, Eric Van Hensbergen, Ron Minnich, Ingo Molnar,
	Christoph Hellwig, Miklos Szeredi, Brad Boyer, Al Viro,
	Roland McGrath, Mauro Carvalho Chehab, Andrew Morton,
	Linux Kernel Mailing List

Davide Libenzi wrote:
> Looks OK to me, although it'd be better if some more folks eye it, in 
> order to avoid painful mistakes.
> Did you test it al all in a live system?

Depends on what you call a live system.  I tested the functionalities
and they worked okay but considering the failure scenarios, it's
highly unlikely to happen especially on x86 cpus with all its nice
memory ordering rules.  Now that it has smp_wmb() on the wake up side
and full mb on waking up side, it's as conservative as it gets, so I
think we're in the safe.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] poll: allow f_op->poll to sleep, take#6
  2008-11-28  4:35               ` Davide Libenzi
  2008-11-28  4:44                 ` Tejun Heo
@ 2008-11-28 16:08                 ` Oleg Nesterov
  1 sibling, 0 replies; 14+ messages in thread
From: Oleg Nesterov @ 2008-11-28 16:08 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Tejun Heo, Eric Van Hensbergen, Ron Minnich, Ingo Molnar,
	Christoph Hellwig, Miklos Szeredi, Brad Boyer, Al Viro,
	Roland McGrath, Mauro Carvalho Chehab, Andrew Morton,
	Linux Kernel Mailing List

On 11/27, Davide Libenzi wrote:
>
> On Thu, 27 Nov 2008, Tejun Heo wrote:
>
> > f_op->poll is the only vfs operation which is not allowed to sleep.  It's
> > because poll and select implementation used task state to synchronize
> > against wake ups, which doesn't have to be the case anymore as wait/wake
> > interface can now use custom wake up functions.  The non-sleep restriction
> > can be a bit tricky because ->poll is not called from an atomic context
> > and the result of accidentally sleeping in ->poll only shows up as
> > temporary busy looping when the timing is right or rather wrong.
> >
> > This patch converts poll/select to use custom wake up function and use
> > separate triggered variable to synchronize against wake up events.  The
> > only added overhead is an extra function call during wake up and
> > negligible.
> >
> > This patch removes the one non-sleep exception from vfs locking rules and
> > is beneficial to userland filesystem implementations like FUSE, 9p or
> > peculiar fs like spufs as it's very difficult for those to implement
> > non-sleeping poll method.
>
> Looks OK to me, although it'd be better if some more folks eye it, in
> order to avoid painful mistakes.

Just in case, I think the patch is correct too.

(v4l1-compat.c:poll_one() looks buggy, it doesn't check if poll_get_entry()
 fails, but this has nothing to do with this patch).

Oleg.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-11-28 16:09 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-25 17:30 + poll-allow-f_op-poll-to-sleep-take-4.patch added to -mm tree Oleg Nesterov
2008-11-25 21:08 ` Davide Libenzi
2008-11-26  4:33 ` Tejun Heo
2008-11-26  4:40   ` [PATCH] poll: allow f_op->poll to sleep, take#5 Tejun Heo
2008-11-26  6:27     ` Davide Libenzi
2008-11-26  6:39       ` Tejun Heo
2008-11-26 19:36         ` Davide Libenzi
2008-11-27  9:18           ` Tejun Heo
2008-11-27  9:37             ` [PATCH] poll: allow f_op->poll to sleep, take#6 Tejun Heo
2008-11-28  4:35               ` Davide Libenzi
2008-11-28  4:44                 ` Tejun Heo
2008-11-28 16:08                 ` Oleg Nesterov
2008-11-27 20:13           ` [PATCH] poll: allow f_op->poll to sleep, take#5 Oleg Nesterov
2008-11-26  4:49   ` + poll-allow-f_op-poll-to-sleep-take-4.patch added to -mm tree Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox