[RFC PATCH] poll(): add poll_wait_set

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH] poll(): add poll_wait_set_exclusive()
@ 2010-10-06 17:56 Mathieu Desnoyers
  2010-10-06 18:08 ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2010-10-06 17:56 UTC (permalink / raw)
  To: Steven Rostedt, LKML
  Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig,
	Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, Paul E. McKenney

Executive summary:

Addition of the new internal API:

poll_wait_set_exclusive() : set poll wait queue to exclusive

Sets up a poll wait queue to use exclusive wakeups. This is useful to
wake up only one waiter at each wakeup. Used to work-around "thundering herd"
problem.

* Problem description :

In the ring buffer poll() implementation, a typical multithreaded user-space
buffer reader polls all per-cpu buffer descriptors for data.  The number of
reader threads can be user-defined; the motivation for permitting this is that
there are typical workloads where a single CPU is producing most of the tracing
data and all other CPUs are idle, available to consume data. It therefore makes
sense not to tie those threads to specific buffers. However, when the number of
threads grows, we face a "thundering herd" problem where many threads can be
woken up and put back to sleep, leaving only a single thread doing useful work.

* Solution :

Introduce a poll_wait_set_exclusive() primitive to poll API, so the code which
implements the pollfd operation can specify that only a single waiter must be
woken up.

To Andi's question:

> How does that work?

I let the ring buffer poll file operation calls a new:

poll_wait_set_exclusive(wait);

Which makes sure that when we have multiple threads waiting on the same file
descriptor (which represents a ring buffer), only one of the threads is woken
up.

> Wouldn't that break poll semantics?

The way I currently do it, yes, but we might be able to do better by tweaking
the poll wakeup chain.

Basically, what I need is that a poll wakeup triggers an exclusive synchronous
wakeup, and then re-checks the wakeup condition. AFAIU, the usual poll semantics
seems to be that all poll()/epoll() should be notified of state changes on all
examined file descriptors. But wether we should do the wakeup first, wait for
the woken up thread to run (possibly consume the data), and then only after that
check if we must continue going through the wakeup chain is left as a grey zone
(ref.  http://www.opengroup.org/onlinepubs/009695399/functions/poll.html).

> If not it sounds like a general improvement.
>
> I assume epoll already does it?

Nope, if I believe epoll(7):

"      Q2  Can two epoll instances wait for the same file descriptor?  If  so,
           are events reported to both epoll file descriptors?

       A2  Yes,  and  events would be reported to both.  However, careful pro‐

So for now, I still propose the less globally intrusive approach, with
poll_wait_set_exclusive(). Maybe if we figure out that changing the poll wakeup
chains behavior is appropriate, we can proceed differently.

This patch is based on top of:

  git://git.kernel.org/pub/scm/linux/kernel/git/compudj/linux-2.6-ringbuffer.git
  branch: tip-pull-queue

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: William Lee Irwin III <wli@holomorphy.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: Andi Kleen <andi@firstfloor.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Peter Zijlstra <peterz@infradead.org>
---
 fs/select.c          |   41 ++++++++++++++++++++++++++++++++++++++---
 include/linux/poll.h |    2 ++
 2 files changed, 40 insertions(+), 3 deletions(-)

Index: linux.trees.git/fs/select.c
===================================================================
--- linux.trees.git.orig/fs/select.c	2010-07-09 15:59:00.000000000 -0400
+++ linux.trees.git/fs/select.c	2010-07-09 16:03:24.000000000 -0400
@@ -112,6 +112,9 @@ struct poll_table_page {
  */
 static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
 		       poll_table *p);
+static void __pollwait_exclusive(struct file *filp,
+				 wait_queue_head_t *wait_address,
+				 poll_table *p);

 void poll_initwait(struct poll_wqueues *pwq)
 {
@@ -152,6 +155,20 @@ void poll_freewait(struct poll_wqueues *
 }
 EXPORT_SYMBOL(poll_freewait);

+/**
+ * poll_wait_set_exclusive - set poll wait queue to exclusive
+ *
+ * Sets up a poll wait queue to use exclusive wakeups. This is useful to
+ * wake up only one waiter at each wakeup. Used to work-around "thundering herd"
+ * problem.
+ */
+void poll_wait_set_exclusive(poll_table *p)
+{
+	if (p)
+		init_poll_funcptr(p, __pollwait_exclusive);
+}
+EXPORT_SYMBOL(poll_wait_set_exclusive);
+
 static struct poll_table_entry *poll_get_entry(struct poll_wqueues *p)
 {
 	struct poll_table_page *table = p->table;
@@ -213,8 +230,10 @@ static int pollwake(wait_queue_t *wait,
 }

 /* Add a new entry */
-static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
-				poll_table *p)
+static void __pollwait_common(struct file *filp,
+			      wait_queue_head_t *wait_address,
+			      poll_table *p,
+			      int exclusive)
 {
 	struct poll_wqueues *pwq = container_of(p, struct poll_wqueues, pt);
 	struct poll_table_entry *entry = poll_get_entry(pwq);
@@ -226,7 +245,23 @@ static void __pollwait(struct file *filp
 	entry->key = p->key;
 	init_waitqueue_func_entry(&entry->wait, pollwake);
 	entry->wait.private = pwq;
-	add_wait_queue(wait_address, &entry->wait);
+	if (!exclusive)
+		add_wait_queue(wait_address, &entry->wait);
+	else
+		add_wait_queue_exclusive(wait_address, &entry->wait);
+}
+
+static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
+				poll_table *p)
+{
+	__pollwait_common(filp, wait_address, p, 0);
+}
+
+static void __pollwait_exclusive(struct file *filp,
+				 wait_queue_head_t *wait_address,
+				 poll_table *p)
+{
+	__pollwait_common(filp, wait_address, p, 1);
 }

 int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
Index: linux.trees.git/include/linux/poll.h
===================================================================
--- linux.trees.git.orig/include/linux/poll.h	2010-07-09 15:59:00.000000000 -0400
+++ linux.trees.git/include/linux/poll.h	2010-07-09 16:03:24.000000000 -0400
@@ -79,6 +79,8 @@ static inline int poll_schedule(struct p
 	return poll_schedule_timeout(pwq, state, NULL, 0);
 }

+extern void poll_wait_set_exclusive(poll_table *p);
+
 /*
  * Scaleable version of the fd_set.
  */

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] poll(): add poll_wait_set_exclusive()
  2010-10-06 17:56 [RFC PATCH] poll(): add poll_wait_set_exclusive() Mathieu Desnoyers
@ 2010-10-06 18:08 ` Linus Torvalds
  2010-10-06 19:04   ` Mathieu Desnoyers
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2010-10-06 18:08 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, LKML, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Paul E. McKenney

On Wed, Oct 6, 2010 at 10:56 AM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
> Executive summary:
>
> Addition of the new internal API:

Executive summary: no.

You need to explain first how you could _ever_ use this without
breaking select/poll semantics totally.

IOW, you need to explain the user space interface first. Before you do
that, this patch is total and utter crap that is expressly designed to
used only in a manner that is a pure bug.

                       Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] poll(): add poll_wait_set_exclusive()
  2010-10-06 18:08 ` Linus Torvalds
@ 2010-10-06 19:04   ` Mathieu Desnoyers
  2010-10-06 19:41     ` Linus Torvalds
  2010-10-06 20:31     ` Steven Rostedt
  0 siblings, 2 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2010-10-06 19:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, LKML, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Paul E. McKenney

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> On Wed, Oct 6, 2010 at 10:56 AM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
> > Executive summary:
> >
> > Addition of the new internal API:
> 
> Executive summary: no.
> 
> You need to explain first how you could _ever_ use this without
> breaking select/poll semantics totally.
> 
> IOW, you need to explain the user space interface first. Before you do
> that, this patch is total and utter crap that is expressly designed to
> used only in a manner that is a pure bug.

You are right. My approach breaks the select/poll semantics. This is why I'm
asking for input if we want to solve the more general problem. For the moment,
the poll_wait_set_exclusive() solution was only meant to be used for debugfs
kernel pseudo-files, which fall out of the POSIX scope.

Maybe what I am trying to do is too far from the poll() semantic and does not
apply in the general case, but I clearly see the need, at least in the use-case
detailed below, to wake up only one thread at a time, whether we call this
"poll" or something else. One way to make it available more generally might be
to add a new open() flag and require that all open() of a given file should use
the flag to provide the "wakeup only one thread" behavior.

For reference, here is the use-case: The user-space daemon runs typically one
thread per cpu, each with a handle on many file descriptors. Each thread waits
for data to be available using poll(). In order to follow the poll semantic,
when data becomes available on a file descriptor, the kernel wakes up all
threads at once, but in my case only one of them will successfully consume the
data (all other thread's splice or read will fail with -ENODATA). With many
threads, these useless wakeups add an unwanted overhead and scalability
limitation.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] poll(): add poll_wait_set_exclusive()
  2010-10-06 19:04   ` Mathieu Desnoyers
@ 2010-10-06 19:41     ` Linus Torvalds
  2010-10-07 16:53       ` Mathieu Desnoyers
  2010-10-06 20:31     ` Steven Rostedt
  1 sibling, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2010-10-06 19:41 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Steven Rostedt, LKML, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Paul E. McKenney, Davide Libenzi

On Wed, Oct 6, 2010 at 12:04 PM, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> You are right. My approach breaks the select/poll semantics. This is why I'm
> asking for input if we want to solve the more general problem. For the moment,
> the poll_wait_set_exclusive() solution was only meant to be used for debugfs
> kernel pseudo-files, which fall out of the POSIX scope.

The thing is, I don't think "that file isn't mentioned by POSIX" is a
valid excuse for having different semantics for any particular file.

So no, I don't think it's acceptable to say that certain files just
act differently wrt "poll".

> Maybe what I am trying to do is too far from the poll() semantic and does not
> apply in the general case, but I clearly see the need, at least in the use-case
> detailed below, to wake up only one thread at a time, whether we call this
> "poll" or something else. One way to make it available more generally might be
> to add a new open() flag and require that all open() of a given file should use
> the flag to provide the "wakeup only one thread" behavior.

I think that would be a better interface, but still sucky and badly
designed. Why? Because the obvious case where you might want to have
the whole "only wake up a single thread" is actually for sockets, so
having an open-time flag is just insane.

Making it be an file status flag (and then use fcntl F_[GS]ETFL on it)
might work. At the same time, I have the suspicion that it would be
much nicer to embed it into the "requested events" field to poll, and
simply add a "exclusive read" poll event (POLLINEX or whatever).
Because it's really the "poll()" function itself that says "I promise
that if you return a readable file descriptor, I will read everything
from it" - it's really an attribute local to the "poll()", not to the
file descriptor.

Regardless, I really think that for anything like this to make sense,
it needs way more than a single use case. So you'd really want to get
some web server developers excited or something like that. Over the
years we have learnt that one-off use cases are worthless.

Also, doesn't eventpoll already support exclusive polling? I dunno.
Davide might be interested in the discussion regardless.

                     Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] poll(): add poll_wait_set_exclusive()
  2010-10-06 19:04   ` Mathieu Desnoyers
  2010-10-06 19:41     ` Linus Torvalds
@ 2010-10-06 20:31     ` Steven Rostedt
  2010-10-07 17:07       ` Mathieu Desnoyers
  1 sibling, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2010-10-06 20:31 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, LKML, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Paul E. McKenney

On Wed, 2010-10-06 at 15:04 -0400, Mathieu Desnoyers wrote:

> For reference, here is the use-case: The user-space daemon runs typically one
> thread per cpu, each with a handle on many file descriptors. Each thread waits
> for data to be available using poll(). In order to follow the poll semantic,
> when data becomes available on a file descriptor, the kernel wakes up all
> threads at once, but in my case only one of them will successfully consume the
> data (all other thread's splice or read will fail with -ENODATA). With many
> threads, these useless wakeups add an unwanted overhead and scalability
> limitation.

Mathieu, I'm curious to why you have multiple threads reading the same
fd. Since the threads are per cpu, does the fd handle all CPUs? Or do
you have an fd per event per CPU, in which case the threads should just
poll off of their own fds.

-- Steve



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] poll(): add poll_wait_set_exclusive()
  2010-10-06 19:41     ` Linus Torvalds
@ 2010-10-07 16:53       ` Mathieu Desnoyers
  2010-10-31 23:02         ` Davide Libenzi
  0 siblings, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2010-10-07 16:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, LKML, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Paul E. McKenney, Davide Libenzi

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> On Wed, Oct 6, 2010 at 12:04 PM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
[...]
> So no, I don't think it's acceptable to say that certain files just
> act differently wrt "poll".

Agreed.

> 
> > Maybe what I am trying to do is too far from the poll() semantic and does not
> > apply in the general case, but I clearly see the need, at least in the use-case
> > detailed below, to wake up only one thread at a time, whether we call this
> > "poll" or something else. One way to make it available more generally might be
> > to add a new open() flag and require that all open() of a given file should use
> > the flag to provide the "wakeup only one thread" behavior.
> 
> I think that would be a better interface, but still sucky and badly
> designed. Why? Because the obvious case where you might want to have
> the whole "only wake up a single thread" is actually for sockets, so
> having an open-time flag is just insane.

Good point.

> Making it be an file status flag (and then use fcntl F_[GS]ETFL on it)
> might work. At the same time, I have the suspicion that it would be
> much nicer to embed it into the "requested events" field to poll, and
> simply add a "exclusive read" poll event (POLLINEX or whatever).
> Because it's really the "poll()" function itself that says "I promise
> that if you return a readable file descriptor, I will read everything
> from it" - it's really an attribute local to the "poll()", not to the
> file descriptor.

This is interesting. I'm just concerned that if we have many poll() waiting on
the same file descriptor, some with POLLINEX and others without, the poll()
calls expecting the standard POSIX behavior might be hurt. Having a single
poll() call with POLLINEX would affect the behavior of the wakeup list for the
whole file descriptor, with side-effect on non-POLLINEX poll() calls.

Also, the whole question seems to depend on the notion of edge-triggered vs
level-triggered, which is better defined in epoll(). The poll specification by
the opengroup states that "The poll() function shall identify those file
descriptors on which an application can read or write data, or on which certain
events have occurred.", which is basically some blurry definition allowing both
edge- or level- triggering.

I think the POLLINEX scheme you propose here would only work with
the level-triggering semantic, and seems to have a blurry semantic for mix of
POLLINEX/non-POLLINEX poll() calls. I would personally be inclined to go for the
fnctl F_[GS]ETFL solution that applies to the whole file descriptor, so all
users of a file descriptor would agree that the poll semantic of this fd.

> Regardless, I really think that for anything like this to make sense,
> it needs way more than a single use case. So you'd really want to get
> some web server developers excited or something like that. Over the
> years we have learnt that one-off use cases are worthless.

Indeed.

> Also, doesn't eventpoll already support exclusive polling? I dunno.
> Davide might be interested in the discussion regardless.

Looking at epoll(7), the behavior of EPOLLONESHOT when there are multiple epoll
instances monitoring a file descriptor seems unclear: does it stop event
propagation after delivery to the first epoll instance (this is the behavior I
am looking for), or does it stop the event delivery after having woken up all
epoll instances monitoring the file descriptor ? Davide might have the answer to
this one.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] poll(): add poll_wait_set_exclusive()
  2010-10-06 20:31     ` Steven Rostedt
@ 2010-10-07 17:07       ` Mathieu Desnoyers
  2010-10-07 17:51         ` Steven Rostedt
  0 siblings, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2010-10-07 17:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, LKML, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Paul E. McKenney

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Wed, 2010-10-06 at 15:04 -0400, Mathieu Desnoyers wrote:
> 
> > For reference, here is the use-case: The user-space daemon runs typically one
> > thread per cpu, each with a handle on many file descriptors. Each thread waits
> > for data to be available using poll(). In order to follow the poll semantic,
> > when data becomes available on a file descriptor, the kernel wakes up all
> > threads at once, but in my case only one of them will successfully consume the
> > data (all other thread's splice or read will fail with -ENODATA). With many
> > threads, these useless wakeups add an unwanted overhead and scalability
> > limitation.
> 
> Mathieu, I'm curious to why you have multiple threads reading the same
> fd. Since the threads are per cpu, does the fd handle all CPUs?

The fd is local to a single ring buffer (which is per-cpu, transporting a group
of events). The threads consuming the file descriptors are approximately per
cpu, modulo cpu hotplug events, user preferences, etc. I would prefer not to
make that a strong 1-1 mapping (with affinity and all), because a typical
tracing scenario is that a single CPU is heavily used by the OS (thus producing
trace data), while other CPUs are idle, available to pull the data from the
buffers. Therefore, I strongly prefer not to affine reader threads to their
"local" buffers in the general case. That being said, it could be kept as an
option, since it might make sense in some other use-cases, especially with tiny
buffers, where it makes sense to keep locality of reference in the L2 cache.

> Or do you have an fd per event per CPU, in which case the threads should just
> poll off of their own fds.

I have one fd per per-cpu buffer, but there can be many per-cpu buffers, each
transporting a group of events. Therefore, I don't want to associate one thread
per event group, because this would be a resource waste.  Typically, only a few
per-cpu buffers will be very active, and others will be very quiet.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] poll(): add poll_wait_set_exclusive()
  2010-10-07 17:07       ` Mathieu Desnoyers
@ 2010-10-07 17:51         ` Steven Rostedt
  2010-10-07 18:07           ` Mathieu Desnoyers
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2010-10-07 17:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, LKML, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Paul E. McKenney

On Thu, 2010-10-07 at 13:07 -0400, Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > On Wed, 2010-10-06 at 15:04 -0400, Mathieu Desnoyers wrote:
> > 
> > > For reference, here is the use-case: The user-space daemon runs typically one
> > > thread per cpu, each with a handle on many file descriptors. Each thread waits
> > > for data to be available using poll(). In order to follow the poll semantic,
> > > when data becomes available on a file descriptor, the kernel wakes up all
> > > threads at once, but in my case only one of them will successfully consume the
> > > data (all other thread's splice or read will fail with -ENODATA). With many
> > > threads, these useless wakeups add an unwanted overhead and scalability
> > > limitation.
> > 
> > Mathieu, I'm curious to why you have multiple threads reading the same
> > fd. Since the threads are per cpu, does the fd handle all CPUs?
> 
> The fd is local to a single ring buffer (which is per-cpu, transporting a group
> of events). The threads consuming the file descriptors are approximately per
> cpu, modulo cpu hotplug events, user preferences, etc. I would prefer not to
> make that a strong 1-1 mapping (with affinity and all), because a typical
> tracing scenario is that a single CPU is heavily used by the OS (thus producing
> trace data), while other CPUs are idle, available to pull the data from the
> buffers. Therefore, I strongly prefer not to affine reader threads to their
> "local" buffers in the general case. That being said, it could be kept as an
> option, since it might make sense in some other use-cases, especially with tiny
> buffers, where it makes sense to keep locality of reference in the L2 cache.

I never mention affinity. As with trace-cmd, it assigns a process per
CPU, but those processes can be on any CPU that the scheduler chooses. I
could probably do it with a single process reading all the CPU fds too.
I might add that as an option.

> 
> > Or do you have an fd per event per CPU, in which case the threads should just
> > poll off of their own fds.
> 
> I have one fd per per-cpu buffer, but there can be many per-cpu buffers, each
> transporting a group of events. Therefore, I don't want to associate one thread
> per event group, because this would be a resource waste.  Typically, only a few
> per-cpu buffers will be very active, and others will be very quiet.

Lets not talk about threads, what about fds? I'm wondering why you have
many threads on the same fd?

-- Steve




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] poll(): add poll_wait_set_exclusive()
  2010-10-07 17:51         ` Steven Rostedt
@ 2010-10-07 18:07           ` Mathieu Desnoyers
  2010-10-07 20:44             ` Steven Rostedt
  0 siblings, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2010-10-07 18:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, LKML, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Paul E. McKenney

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Thu, 2010-10-07 at 13:07 -0400, Mathieu Desnoyers wrote:
> > * Steven Rostedt (rostedt@goodmis.org) wrote:
> > > On Wed, 2010-10-06 at 15:04 -0400, Mathieu Desnoyers wrote:
> > > 
> > > > For reference, here is the use-case: The user-space daemon runs typically one
> > > > thread per cpu, each with a handle on many file descriptors. Each thread waits
> > > > for data to be available using poll(). In order to follow the poll semantic,
> > > > when data becomes available on a file descriptor, the kernel wakes up all
> > > > threads at once, but in my case only one of them will successfully consume the
> > > > data (all other thread's splice or read will fail with -ENODATA). With many
> > > > threads, these useless wakeups add an unwanted overhead and scalability
> > > > limitation.
> > > 
> > > Mathieu, I'm curious to why you have multiple threads reading the same
> > > fd. Since the threads are per cpu, does the fd handle all CPUs?
> > 
> > The fd is local to a single ring buffer (which is per-cpu, transporting a group
> > of events). The threads consuming the file descriptors are approximately per
> > cpu, modulo cpu hotplug events, user preferences, etc. I would prefer not to
> > make that a strong 1-1 mapping (with affinity and all), because a typical
> > tracing scenario is that a single CPU is heavily used by the OS (thus producing
> > trace data), while other CPUs are idle, available to pull the data from the
> > buffers. Therefore, I strongly prefer not to affine reader threads to their
> > "local" buffers in the general case. That being said, it could be kept as an
> > option, since it might make sense in some other use-cases, especially with tiny
> > buffers, where it makes sense to keep locality of reference in the L2 cache.
> 
> I never mention affinity. As with trace-cmd, it assigns a process per
> CPU, but those processes can be on any CPU that the scheduler chooses. I
> could probably do it with a single process reading all the CPU fds too.
> I might add that as an option.

Your scheme works fine because you have only one stream (and thus one fd) per
cpu. How would you map that with many streams per cpu ?

Also, you might want to consider using threads rather than processes, to save
the unnecessary VM swaps.

> 
> > 
> > > Or do you have an fd per event per CPU, in which case the threads should just
> > > poll off of their own fds.
> > 
> > I have one fd per per-cpu buffer, but there can be many per-cpu buffers, each
> > transporting a group of events. Therefore, I don't want to associate one thread
> > per event group, because this would be a resource waste.  Typically, only a few
> > per-cpu buffers will be very active, and others will be very quiet.
> 
> Lets not talk about threads, what about fds? I'm wondering why you have
> many threads on the same fd?

That's because I have fewer threads than file descriptors. So I can choose to
either:

1) somehow assign each thread to many fds statically or
2) make each thread wait for data on all fds

Option (2) adapts much better to workloads where a lots of data would come from
many file descriptors from a single CPU: all threads can collaboratively work to
extract the data.

Thanks,

Mathieu

> 
> -- Steve
> 
> 
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] poll(): add poll_wait_set_exclusive()
  2010-10-07 18:07           ` Mathieu Desnoyers
@ 2010-10-07 20:44             ` Steven Rostedt
  0 siblings, 0 replies; 11+ messages in thread
From: Steven Rostedt @ 2010-10-07 20:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, LKML, Andrew Morton, Peter Zijlstra, Ingo Molnar,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Paul E. McKenney

On Thu, 2010-10-07 at 14:07 -0400, Mathieu Desnoyers wrote:
> * Steven Rostedt (rostedt@goodmis.org) wrote:

> > I never mention affinity. As with trace-cmd, it assigns a process per
> > CPU, but those processes can be on any CPU that the scheduler chooses. I
> > could probably do it with a single process reading all the CPU fds too.
> > I might add that as an option.
> 
> Your scheme works fine because you have only one stream (and thus one fd) per
> cpu. How would you map that with many streams per cpu ?
> 
> Also, you might want to consider using threads rather than processes, to save
> the unnecessary VM swaps.

I thought about threads and could go back to them. I'd have to benchmark
to see the performance hit.

> 
> > 
> > > 
> > > > Or do you have an fd per event per CPU, in which case the threads should just
> > > > poll off of their own fds.
> > > 
> > > I have one fd per per-cpu buffer, but there can be many per-cpu buffers, each
> > > transporting a group of events. Therefore, I don't want to associate one thread
> > > per event group, because this would be a resource waste.  Typically, only a few
> > > per-cpu buffers will be very active, and others will be very quiet.
> > 
> > Lets not talk about threads, what about fds? I'm wondering why you have
> > many threads on the same fd?
> 
> That's because I have fewer threads than file descriptors. So I can choose to
> either:
> 
> 1) somehow assign each thread to many fds statically or
> 2) make each thread wait for data on all fds
> 
> Option (2) adapts much better to workloads where a lots of data would come from
> many file descriptors from a single CPU: all threads can collaboratively work to
> extract the data.

OK, this is what I wanted to know. Now as Linus suggested, go and ask
other application developers (web server developers?) if this is
something that could be useful for them? If you get positive feedback,
have them Cc LKML and we can go from there.

Or, if you see that this type of programming is being done in Apache,
and if you can demonstrate that adding this features helps Apache (or
some other popular program), that would also be of benefit.

-- Steve



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] poll(): add poll_wait_set_exclusive()
  2010-10-07 16:53       ` Mathieu Desnoyers
@ 2010-10-31 23:02         ` Davide Libenzi
  0 siblings, 0 replies; 11+ messages in thread
From: Davide Libenzi @ 2010-10-31 23:02 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Steven Rostedt, LKML, Andrew Morton,
	Peter Zijlstra, Ingo Molnar, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, Paul E. McKenney

On Thu, 7 Oct 2010, Mathieu Desnoyers wrote:

> > Also, doesn't eventpoll already support exclusive polling? I dunno.
> > Davide might be interested in the discussion regardless.
> 
> Looking at epoll(7), the behavior of EPOLLONESHOT when there are multiple epoll
> instances monitoring a file descriptor seems unclear: does it stop event
> propagation after delivery to the first epoll instance (this is the behavior I
> am looking for), or does it stop the event delivery after having woken up all
> epoll instances monitoring the file descriptor ? Davide might have the answer to
> this one.

Sorry for the late response, but I am very slowly wroking my way through a 
long backlog.
In your case above, both fds will get event report, in epoll, with 
EPOLLONESHOT (because the one-shot applies to the epoll-fd/monitored-fd 
pair/key).
As for the exclusive wakeup feature, I am not totally against it, though 
it is borderline as far as use cases vs. added complexity.
The POLLEX/EPOLLEX would be nicer, if it wouldn't lead to a racy interface 
(say one thread uses [E]POLLEX and the another not).
IMO this is more fcntl(2) flag territory, as the application would set the 
behavior globally (for the file), though again, I am not sure the use 
cases justify the introduction and handling of the new flag.

- Davide

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2010-10-31 23:03 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-06 17:56 [RFC PATCH] poll(): add poll_wait_set_exclusive() Mathieu Desnoyers
2010-10-06 18:08 ` Linus Torvalds
2010-10-06 19:04   ` Mathieu Desnoyers
2010-10-06 19:41     ` Linus Torvalds
2010-10-07 16:53       ` Mathieu Desnoyers
2010-10-31 23:02         ` Davide Libenzi
2010-10-06 20:31     ` Steven Rostedt
2010-10-07 17:07       ` Mathieu Desnoyers
2010-10-07 17:51         ` Steven Rostedt
2010-10-07 18:07           ` Mathieu Desnoyers
2010-10-07 20:44             ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox