[PATCH] rfc: threaded epoll_wait thundering herd

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] rfc: threaded epoll_wait thundering herd
       [not found] <20070504225730.490334000@haxent.com.br>
@ 2007-05-04 23:37 ` Davi Arnaut
  2007-05-05  4:15   ` Eric Dumazet
  2007-05-05 19:00   ` Davide Libenzi
  0 siblings, 2 replies; 21+ messages in thread
From: Davi Arnaut @ 2007-05-04 23:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Davide Libenzi, Linus Torvalds, Linux Kernel Mailing List

Hi,

If multiple threads are parked on epoll_wait (on a single epoll fd) and
events become available, epoll performs a wake up of all threads of the
poll wait list, causing a thundering herd of processes trying to grab
the eventpoll lock.

This patch addresses this by using exclusive waiters (wake one). Once
the exclusive thread finishes transferring it's events, a new thread
is woken if there are more events available.

Makes sense?

Signed-off-by: Davi E. M. Arnaut <davi@haxent.com.br>

---
 fs/eventpoll.c |    7 +++++++
 1 file changed, 7 insertions(+)

Index: linux-2.6/fs/eventpoll.c
===================================================================
--- linux-2.6.orig/fs/eventpoll.c
+++ linux-2.6/fs/eventpoll.c
@@ -1491,6 +1491,12 @@ static void ep_reinject_items(struct eve
 		}
 	}

+	/*
+	 * If there is events available, wake up the next waiter, if any.
+	 */
+	if (!ricnt)
+		ricnt = !list_empty(&ep->rdllist);
+
 	if (ricnt) {
 		/*
 		 * Wake up ( if active ) both the eventpoll wait list and the ->poll()
@@ -1570,6 +1576,7 @@ retry:
 		 * ep_poll_callback() when events will become available.
 		 */
 		init_waitqueue_entry(&wait, current);
+		wait.flags |= WQ_FLAG_EXCLUSIVE;
 		__add_wait_queue(&ep->wq, &wait);

 		for (;;) {

--

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] rfc: threaded epoll_wait thundering herd
  2007-05-04 23:37 ` [PATCH] rfc: threaded epoll_wait thundering herd Davi Arnaut
@ 2007-05-05  4:15   ` Eric Dumazet
  2007-05-05  4:44     ` Linus Torvalds
  2007-05-05 19:00   ` Davide Libenzi
  1 sibling, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2007-05-05  4:15 UTC (permalink / raw)
  To: Davi Arnaut
  Cc: Andrew Morton, Davide Libenzi, Linus Torvalds,
	Linux Kernel Mailing List

Davi Arnaut a écrit :
> Hi,
> 
> If multiple threads are parked on epoll_wait (on a single epoll fd) and
> events become available, epoll performs a wake up of all threads of the
> poll wait list, causing a thundering herd of processes trying to grab
> the eventpoll lock.
> 
> This patch addresses this by using exclusive waiters (wake one). Once
> the exclusive thread finishes transferring it's events, a new thread
> is woken if there are more events available.
> 
> Makes sense?
> 

Yes it makes sense.

But... what happens if the thread that was chosen exits from the loop in 
ep_poll() with res = -EINTR (because of signal_pending(current))


Me thinks in this case some ready events can wait forever (up to next 
ep_poll_callback() or another thread enters ep_poll())



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] rfc: threaded epoll_wait thundering herd
  2007-05-05  4:15   ` Eric Dumazet
@ 2007-05-05  4:44     ` Linus Torvalds
  2007-05-05  5:47       ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2007-05-05  4:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Davi Arnaut, Andrew Morton, Davide Libenzi,
	Linux Kernel Mailing List

On Sat, 5 May 2007, Eric Dumazet wrote:
> 
> But... what happens if the thread that was chosen exits from the loop in
> ep_poll() with res = -EINTR (because of signal_pending(current))

Not a problem.

What happens is that an exclusive wake-up stops on the first entry in the 
wait-queue that it actually *wakes*up*, but if some task has just marked 
itself as being TASK_UNINTERRUPTIBLE, but is still on the run-queue, it 
will just be marked TASK_RUNNING and that in itself isn't enough to cause 
the "exclusive" test to trigger.

The code in sched.c is subtle, but worth understanding if you care about 
these things. You should look at:

 - try_to_wake_up() - this is the default wakeup function (and the one 
   that should work correctly - I'm not going to guarantee that any of the 
   other specialty-wakeup-functions do so)

   The return value is the important thing. Returning non-zero is 
   "success", and implies that we actually activated it.

   See the "goto out_running" case for the case where the process was 
   still actually on the run-queues, and we just ended up setting 
   "p->state = TASK_RUNNING" - we still return 0, and the "exclusive" 
   logic will not trigger.

 - __wake_up_common: this is the thing that _calls_ the above, and which 
   cares about the return value above. It does

	if (curr->func(curr, mode, sync, key) &&
		(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)

   ie it only decrements (and triggers) the nr_exclusive thing when the 
   wakeup-function returned non-zero (and when the waitqueue entry was 
   marked exclusive, of course).

So what does all this subtlety *mean*?

Walk through it. It means that it is safe to do the

	if (signal_pending())
		return -EINTR;

kind of thing, because *when* you do this, you obviously are always on the 
run-queue (otherwise the process wouldn't be running, and couldn't be 
doing the test). So if there is somebody else waking you up right then and 
there, they'll never count your wakeup as an exclusive one, and they will 
wake up at least one other real exclusive waiter.

(IOW, you get a very very small probability of a very very small 
"thundering herd" - obviously it won't be "thundering" any more, it will 
be more of a "whispering herdlet").

The Linux kernel sleep/wakeup thing is really quite nifty and smart. And 
very few people realize just *how* nifty and elegant (and efficient) it 
is. Hopefully a few more people appreciate its beauty and subtlety now ;)

		Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] rfc: threaded epoll_wait thundering herd
  2007-05-05  4:44     ` Linus Torvalds
@ 2007-05-05  5:47       ` Eric Dumazet
  0 siblings, 0 replies; 21+ messages in thread
From: Eric Dumazet @ 2007-05-05  5:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davi Arnaut, Andrew Morton, Davide Libenzi,
	Linux Kernel Mailing List

Linus Torvalds a écrit :
> 
> On Sat, 5 May 2007, Eric Dumazet wrote:
>> But... what happens if the thread that was chosen exits from the loop in
>> ep_poll() with res = -EINTR (because of signal_pending(current))
> 
> Not a problem.
> 
> What happens is that an exclusive wake-up stops on the first entry in the 
> wait-queue that it actually *wakes*up*, but if some task has just marked 
> itself as being TASK_UNINTERRUPTIBLE, but is still on the run-queue, it 
> will just be marked TASK_RUNNING and that in itself isn't enough to cause 
> the "exclusive" test to trigger.
> 
> The code in sched.c is subtle, but worth understanding if you care about 
> these things. You should look at:
> 
>  - try_to_wake_up() - this is the default wakeup function (and the one 
>    that should work correctly - I'm not going to guarantee that any of the 
>    other specialty-wakeup-functions do so)
> 
>    The return value is the important thing. Returning non-zero is 
>    "success", and implies that we actually activated it.
> 
>    See the "goto out_running" case for the case where the process was 
>    still actually on the run-queues, and we just ended up setting 
>    "p->state = TASK_RUNNING" - we still return 0, and the "exclusive" 
>    logic will not trigger.
> 
>  - __wake_up_common: this is the thing that _calls_ the above, and which 
>    cares about the return value above. It does
> 
> 	if (curr->func(curr, mode, sync, key) &&
> 		(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
> 
> 
>    ie it only decrements (and triggers) the nr_exclusive thing when the 
>    wakeup-function returned non-zero (and when the waitqueue entry was 
>    marked exclusive, of course).
> 
> So what does all this subtlety *mean*?
> 
> Walk through it. It means that it is safe to do the
> 
> 	if (signal_pending())
> 		return -EINTR;
> 
> kind of thing, because *when* you do this, you obviously are always on the 
> run-queue (otherwise the process wouldn't be running, and couldn't be 
> doing the test). So if there is somebody else waking you up right then and 
> there, they'll never count your wakeup as an exclusive one, and they will 
> wake up at least one other real exclusive waiter.
> 
> (IOW, you get a very very small probability of a very very small 
> "thundering herd" - obviously it won't be "thundering" any more, it will 
> be more of a "whispering herdlet").
> 
> The Linux kernel sleep/wakeup thing is really quite nifty and smart. And 
> very few people realize just *how* nifty and elegant (and efficient) it 
> is. Hopefully a few more people appreciate its beauty and subtlety now ;)
> 

Thank you Linus for these detailed explanations.

I think I was frightened not by the wakeup logic, but by the possibility in 
SMP that a signal could be delivered to the thread just after it has been 
selected.

Looking again at ep_poll(), I see  :

			set_current_state(TASK_INTERRUPTIBLE);
[*]                     if (!list_empty(&ep->rdllist) || !jtimeout)
                                 break;
                         if (signal_pending(current)) {
                                 res = -EINTR;
                                 break;
                         }

So the test against signal_pending() is not done if an event is present in 
ready list : It should be delivered even if a signal is pending. I missed this 
bit ealier...




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] rfc: threaded epoll_wait thundering herd
  2007-05-04 23:37 ` [PATCH] rfc: threaded epoll_wait thundering herd Davi Arnaut
  2007-05-05  4:15   ` Eric Dumazet
@ 2007-05-05 19:00   ` Davide Libenzi
  2007-05-05 21:42     ` Davi Arnaut
  2007-05-07 15:46     ` Chase Venters
  1 sibling, 2 replies; 21+ messages in thread
From: Davide Libenzi @ 2007-05-05 19:00 UTC (permalink / raw)
  To: Davi Arnaut; +Cc: Andrew Morton, Linus Torvalds, Linux Kernel Mailing List

On Fri, 4 May 2007, Davi Arnaut wrote:

> Hi,
> 
> If multiple threads are parked on epoll_wait (on a single epoll fd) and
> events become available, epoll performs a wake up of all threads of the
> poll wait list, causing a thundering herd of processes trying to grab
> the eventpoll lock.
> 
> This patch addresses this by using exclusive waiters (wake one). Once
> the exclusive thread finishes transferring it's events, a new thread
> is woken if there are more events available.
> 
> Makes sense?

Theorically, make sense. I said theorically because all the use 
epoll_wait MT use cases I've heard of, use a single thread that does the 
epoll_wait, and then dispatch to worker threads. So thundering herd is not 
in the picture. OTOH, it does not hurt either.
But, that code is completely changed with the new single-pass epoll delivery 
code that is in -mm. So, I'd either wait for that code to go in, or I 
(or you, if you like) can make a patch against -mm.

- Davide

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] rfc: threaded epoll_wait thundering herd
  2007-05-05 19:00   ` Davide Libenzi
@ 2007-05-05 21:42     ` Davi Arnaut
  2007-05-07 21:00       ` Ulrich Drepper
  2007-05-07 15:46     ` Chase Venters
  1 sibling, 1 reply; 21+ messages in thread
From: Davi Arnaut @ 2007-05-05 21:42 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Andrew Morton, Linus Torvalds, Linux Kernel Mailing List

Davide Libenzi wrote:
> On Fri, 4 May 2007, Davi Arnaut wrote:
> 
>> Hi,
>>
>> If multiple threads are parked on epoll_wait (on a single epoll fd) and
>> events become available, epoll performs a wake up of all threads of the
>> poll wait list, causing a thundering herd of processes trying to grab
>> the eventpoll lock.
>>
>> This patch addresses this by using exclusive waiters (wake one). Once
>> the exclusive thread finishes transferring it's events, a new thread
>> is woken if there are more events available.
>>
>> Makes sense?
> 
> Theorically, make sense. I said theorically because all the use 
> epoll_wait MT use cases I've heard of, use a single thread that does the 
> epoll_wait, and then dispatch to worker threads. So thundering herd is not 
> in the picture. OTOH, it does not hurt either.

A google search turns up a few users. It also addresses some complaints
from Drepper.

> But, that code is completely changed with the new single-pass epoll delivery 
> code that is in -mm. So, I'd either wait for that code to go in, or I 
> (or you, if you like) can make a patch against -mm.

Fell free to update the patch for -mm and to include my signed-of-by.
Thanks.

--
Davi Arnaut


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] rfc: threaded epoll_wait thundering herd
  2007-05-05 21:42     ` Davi Arnaut
@ 2007-05-07 21:00       ` Ulrich Drepper
  2007-05-07 21:34         ` Davi Arnaut
  2007-05-07 22:47         ` Davide Libenzi
  0 siblings, 2 replies; 21+ messages in thread
From: Ulrich Drepper @ 2007-05-07 21:00 UTC (permalink / raw)
  To: Davi Arnaut
  Cc: Davide Libenzi, Andrew Morton, Linus Torvalds,
	Linux Kernel Mailing List

On 5/5/07, Davi Arnaut <davi@haxent.com.br> wrote:
> A google search turns up a few users. It also addresses some complaints
> from Drepper.

There is a huge problem with this approach and we're back at the
inadequate interface.

select/poll/epoll are thread cancellation points.  I.e., the thread
can be canceled before returning to the user.  If this cancellation
happens between the kernel deciding to give this thread the event (and
no other thread) and the thread testing for cancellation in the libc
wrapper around the syscall, then the event is lost and the process(es)
might hang.

With kevent we in the end fixed the problem by requiring that part of
the cancellation handling the thread tries to wake up another thread
waiting for the event queue.  This is easily possible since the event
data is in the shared memory segment and it's just purely the thread
wakeup that is needed.

To make something like this work for poll you'd have to push back the
revents fields of the result back to the kernel which might then cause
another thread to be woken up.  I find this too ugly to consider.  You
guys will not believe this but I really thought all these things
through before writing the OLS paper.  poll cannot be salvaged.

There is another thing about this selective wakeup: do I assume it
correctly that if more than one file descriptor is reported ready more
than one thread is woken?  I think nothing else can be justified.
Will in this case both threads get the same set of descriptors
reported or will they see disjunct sets?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] rfc: threaded epoll_wait thundering herd
  2007-05-07 21:00       ` Ulrich Drepper
@ 2007-05-07 21:34         ` Davi Arnaut
  2007-05-07 22:19           ` Ulrich Drepper
  2007-05-07 22:47         ` Davide Libenzi
  1 sibling, 1 reply; 21+ messages in thread
From: Davi Arnaut @ 2007-05-07 21:34 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Davide Libenzi, Andrew Morton, Linus Torvalds,
	Linux Kernel Mailing List

Ulrich Drepper wrote:
> On 5/5/07, Davi Arnaut <davi@haxent.com.br> wrote:
>> A google search turns up a few users. It also addresses some complaints
>> from Drepper.
> 
> There is a huge problem with this approach and we're back at the
> inadequate interface.
> 
> select/poll/epoll are thread cancellation points.  I.e., the thread
> can be canceled before returning to the user.  If this cancellation
> happens between the kernel deciding to give this thread the event (and
> no other thread) and the thread testing for cancellation in the libc
> wrapper around the syscall, then the event is lost and the process(es)
> might hang.
> 
> With kevent we in the end fixed the problem by requiring that part of
> the cancellation handling the thread tries to wake up another thread
> waiting for the event queue.  This is easily possible since the event
> data is in the shared memory segment and it's just purely the thread
> wakeup that is needed.
> 
> To make something like this work for poll you'd have to push back the
> revents fields of the result back to the kernel which might then cause
> another thread to be woken up.  I find this too ugly to consider.  You
> guys will not believe this but I really thought all these things
> through before writing the OLS paper.  poll cannot be salvaged.

See Linus's message on this same thread.

> There is another thing about this selective wakeup: do I assume it
> correctly that if more than one file descriptor is reported ready more
> than one thread is woken?  I think nothing else can be justified.

Correct.

> Will in this case both threads get the same set of descriptors
> reported or will they see disjunct sets?

Disjunct. In reality, only if the event is edge triggered.

--
Davi Arnaut

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] rfc: threaded epoll_wait thundering herd
  2007-05-07 21:34         ` Davi Arnaut
@ 2007-05-07 22:19           ` Ulrich Drepper
  2007-05-07 22:35             ` Davide Libenzi
  2007-05-07 23:15             ` Davi Arnaut
  0 siblings, 2 replies; 21+ messages in thread
From: Ulrich Drepper @ 2007-05-07 22:19 UTC (permalink / raw)
  To: Davi Arnaut
  Cc: Davide Libenzi, Andrew Morton, Linus Torvalds,
	Linux Kernel Mailing List

On 5/7/07, Davi Arnaut <davi@haxent.com.br> wrote:
> See Linus's message on this same thread.

No.  I'm talking about the userlevel side, not kernel side.

If a thread is canceled *after* it returns from the syscall but before
it reports the event to the call (i.e., while still in the syscall
wrapper, thread cancellation rules require a check there) the event is
lost.

Linus was only talking about the kernel side and how in the exit path
such events are not lost.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] rfc: threaded epoll_wait thundering herd
  2007-05-07 22:19           ` Ulrich Drepper
@ 2007-05-07 22:35             ` Davide Libenzi
  2007-05-08  2:49               ` Ulrich Drepper
  2007-05-07 23:15             ` Davi Arnaut
  1 sibling, 1 reply; 21+ messages in thread
From: Davide Libenzi @ 2007-05-07 22:35 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Davi Arnaut, Andrew Morton, Linus Torvalds,
	Linux Kernel Mailing List

On Mon, 7 May 2007, Ulrich Drepper wrote:

> On 5/7/07, Davi Arnaut <davi@haxent.com.br> wrote:
> > See Linus's message on this same thread.
> 
> No.  I'm talking about the userlevel side, not kernel side.
> 
> If a thread is canceled *after* it returns from the syscall but before
> it reports the event to the call (i.e., while still in the syscall
> wrapper, thread cancellation rules require a check there) the event is
> lost.

read(2) is a cancellation point too. So if the fine userspace code issues 
a random pthread_cancel() to a thread handling that, data is lost together 
with the session that thread was handling. Hmm, I wonder how the world 
could have functioned so far.
Bottom line is, if you really want to throw random cancels to your worker 
threads, you better wrap them into pthread_cleanup_push(). Because 
otherwise, no matter where your cancel hits, you end up with a broken 
system.

- Davide

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] rfc: threaded epoll_wait thundering herd
  2007-05-07 22:35             ` Davide Libenzi
@ 2007-05-08  2:49               ` Ulrich Drepper
  2007-05-08  3:56                 ` Kyle Moffett
                                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Ulrich Drepper @ 2007-05-08  2:49 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Davi Arnaut, Andrew Morton, Linus Torvalds,
	Linux Kernel Mailing List

On 5/7/07, Davide Libenzi <davidel@xmailserver.org> wrote:
> read(2) is a cancellation point too. So if the fine userspace code issues
> a random pthread_cancel() to a thread handling that, data is lost together
> with the session that thread was handling.

This is absolutely not comparable.  When read/write is canceled no
data is lost.  Some other thread might have to pick up the slack but
that's it.

With your poll extensions where one thread gets notified and then the
kernel forgets about the event information.  Now this information is
only available in the thread itself.  How can a cancellation handler
communication this to all the other threads waiting in poll()?  Every
event which is reported and forgotten must be communication.  To make
things more fun, the various waiters can be in different processes.

We went over this for kevent discussions.  I really am not willing to
do it all again especially since there is no hope to achieve a
satisfying result with poll.  So, don't count my silence as agreement,
it isn't.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] rfc: threaded epoll_wait thundering herd
  2007-05-08  2:49               ` Ulrich Drepper
@ 2007-05-08  3:56                 ` Kyle Moffett
  2007-05-08  4:35                 ` Linus Torvalds
  2007-05-08  6:30                 ` Davide Libenzi
  2 siblings, 0 replies; 21+ messages in thread
From: Kyle Moffett @ 2007-05-08  3:56 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Davide Libenzi, Davi Arnaut, Andrew Morton, Linus Torvalds,
	Linux Kernel Mailing List

On May 07, 2007, at 22:49:04, Ulrich Drepper wrote:
> On 5/7/07, Davide Libenzi <davidel@xmailserver.org> wrote:
>> read(2) is a cancellation point too. So if the fine userspace code  
>> issues a random pthread_cancel() to a thread handling that, data  
>> is lost together with the session that thread was handling.
>
> This is absolutely not comparable.  When read/write is canceled no  
> data is lost.  Some other thread might have to pick up the slack  
> but that's it.

Uhh, how can you claim "no data is lost" in this scenario:

A bunch of threads shares a common fd and fpos, reading fixed-size  
records:

thread1: read(myfd, per_thread_buf, 512);
thread2: pthread_cancel(thread1);
thread3: read(myfd, per_thread_buf, 512);

This is *exactly* analogous to the epoll() scenario:  A bunch of  
threads are attempting to get data (either events or records) from a  
given object.  If thread1 gets cancelled after the completion of its  
read() syscall then the file position will have been advanced but the  
data in the per-thread buffer will be discarded.  The third thread  
won't (reliably) get the same data as thread1.  It might get it  
randomly on occasion if file position updates race, but you can't  
rely on that behavior.

Here's how you make both of those models safe against thread  
cancellation (assuming, of course, that process_one_buf has no  
cancellation points until after it saves the buffer state somewhere):

char per_thread_buf[512];
pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);
pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, NULL);
while (read(fd, per_thread_buf, 512) == 512) {
	pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
	process_one_buffer(per_thread_buf);
	pthread_testcancel();
	pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, NULL);
}
pthread_exit(NULL);

Look how trivial that is!  Isn't it fun?  Of course, once you look at  
it this way you could do the same verbose crap that pthread_cancel/ 
pthread_setcancelstate/pthread_testcancel are doing with a simple int  
protected by a pthread_mutex.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] rfc: threaded epoll_wait thundering herd
  2007-05-08  2:49               ` Ulrich Drepper
  2007-05-08  3:56                 ` Kyle Moffett
@ 2007-05-08  4:35                 ` Linus Torvalds
  2007-05-08  6:30                 ` Davide Libenzi
  2 siblings, 0 replies; 21+ messages in thread
From: Linus Torvalds @ 2007-05-08  4:35 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Davide Libenzi, Davi Arnaut, Andrew Morton,
	Linux Kernel Mailing List

On Mon, 7 May 2007, Ulrich Drepper wrote:
> 
> This is absolutely not comparable.  When read/write is canceled no
> data is lost.  Some other thread might have to pick up the slack but
> that's it.

That's bullsh*t, Uli, and you know it.

Whatever the thread read() into it's buffer is effectively gone. You don't 
know how much of the buffer was updated, so other threads cannot use the 
data.

In fact, the exact *reverse* of what you claim is true. With "poll()" or 
"select()", other threads *can* actually look at the result buffer, since 
it has a known size and format that is independent of the return value, 
unlike read().

But the real issue is that if you start cancellign threads without any 
other synchronization, you are going to lose data anyway. Claiming 
anything else is just silly. The whole scenario you talk about is 
nonsensical, never mind that read() actually handles it *less* well rather 
than better as you claim.

		Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] rfc: threaded epoll_wait thundering herd
  2007-05-08  2:49               ` Ulrich Drepper
  2007-05-08  3:56                 ` Kyle Moffett
  2007-05-08  4:35                 ` Linus Torvalds
@ 2007-05-08  6:30                 ` Davide Libenzi
  2 siblings, 0 replies; 21+ messages in thread
From: Davide Libenzi @ 2007-05-08  6:30 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Davi Arnaut, Andrew Morton, Linus Torvalds,
	Linux Kernel Mailing List

On Mon, 7 May 2007, Ulrich Drepper wrote:

> On 5/7/07, Davide Libenzi <davidel@xmailserver.org> wrote:
> > read(2) is a cancellation point too. So if the fine userspace code issues
> > a random pthread_cancel() to a thread handling that, data is lost together
> > with the session that thread was handling.
> 
> This is absolutely not comparable.  When read/write is canceled no
> data is lost.  Some other thread might have to pick up the slack but
> that's it.

Ohh, is it different? Please, it is not even worth to show you how exactly 
the same is. Because you are perfectly aware of it.


> We went over this for kevent discussions.  I really am not willing to
> do it all again especially since there is no hope to achieve a
> satisfying result with poll.  So, don't count my silence as agreement,
> it isn't.

You're climbing mirrors AFAICS, more that bringing any valid points. And 
as far as I'm concerned, this thread is becoming very repetitive and boring.



- Davide



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] rfc: threaded epoll_wait thundering herd
  2007-05-07 22:19           ` Ulrich Drepper
  2007-05-07 22:35             ` Davide Libenzi
@ 2007-05-07 23:15             ` Davi Arnaut
  2007-05-08  2:32               ` Ulrich Drepper
  1 sibling, 1 reply; 21+ messages in thread
From: Davi Arnaut @ 2007-05-07 23:15 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Davide Libenzi, Andrew Morton, Linus Torvalds,
	Linux Kernel Mailing List

Ulrich Drepper wrote:
> On 5/7/07, Davi Arnaut <davi@haxent.com.br> wrote:
>> See Linus's message on this same thread.
> 
> No.  I'm talking about the userlevel side, not kernel side.

So you probably knew the answer before asking the question.

> If a thread is canceled *after* it returns from the syscall but before
> it reports the event to the call (i.e., while still in the syscall
> wrapper, thread cancellation rules require a check there) the event is
> lost.

Exactly. The same happens with sigwaitinfo(), and various other.

> Linus was only talking about the kernel side and how in the exit path
> such events are not lost.

Anyway, we could extend epoll to be mmapable...

--
Davi Arnaut

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] rfc: threaded epoll_wait thundering herd
  2007-05-07 23:15             ` Davi Arnaut
@ 2007-05-08  2:32               ` Ulrich Drepper
  2007-05-08  3:24                 ` Davi Arnaut
  0 siblings, 1 reply; 21+ messages in thread
From: Ulrich Drepper @ 2007-05-08  2:32 UTC (permalink / raw)
  To: Davi Arnaut
  Cc: Davide Libenzi, Andrew Morton, Linus Torvalds,
	Linux Kernel Mailing List

On 5/7/07, Davi Arnaut <davi@haxent.com.br> wrote:
> Anyway, we could extend epoll to be mmapable...

Welcome to kevent, well, except with a lot more ballast and awkward interfaces.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] rfc: threaded epoll_wait thundering herd
  2007-05-08  2:32               ` Ulrich Drepper
@ 2007-05-08  3:24                 ` Davi Arnaut
  0 siblings, 0 replies; 21+ messages in thread
From: Davi Arnaut @ 2007-05-08  3:24 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Davide Libenzi, Andrew Morton, Linus Torvalds,
	Linux Kernel Mailing List

Ulrich Drepper wrote:
> On 5/7/07, Davi Arnaut <davi@haxent.com.br> wrote:
>> Anyway, we could extend epoll to be mmapable...
> 
> Welcome to kevent, well, except with a lot more ballast and awkward interfaces.

So an mmapable epoll is equivalent to kevent.. great! Well, expect
without a whole new giant bloated monolithic interface.

And this thread was about something else... :-)

--
Davi Arnaut

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] rfc: threaded epoll_wait thundering herd
  2007-05-07 21:00       ` Ulrich Drepper
  2007-05-07 21:34         ` Davi Arnaut
@ 2007-05-07 22:47         ` Davide Libenzi
  1 sibling, 0 replies; 21+ messages in thread
From: Davide Libenzi @ 2007-05-07 22:47 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Davi Arnaut, Andrew Morton, Linus Torvalds,
	Linux Kernel Mailing List

On Mon, 7 May 2007, Ulrich Drepper wrote:

> On 5/5/07, Davi Arnaut <davi@haxent.com.br> wrote:
> > A google search turns up a few users. It also addresses some complaints
> > from Drepper.
> 
> There is a huge problem with this approach and we're back at the
> inadequate interface.
> 
> select/poll/epoll are thread cancellation points.  I.e., the thread
> can be canceled before returning to the user.  If this cancellation
> happens between the kernel deciding to give this thread the event (and
> no other thread) and the thread testing for cancellation in the libc
> wrapper around the syscall, then the event is lost and the process(es)
> might hang.
> 
> With kevent we in the end fixed the problem by requiring that part of
> the cancellation handling the thread tries to wake up another thread
> waiting for the event queue.  This is easily possible since the event
> data is in the shared memory segment and it's just purely the thread
> wakeup that is needed.

So, by the same logic, every API that 1) returns something to userspace 
by canceling its internal kernel state 2) is not based on shared 
kernel/userspace memory, will break under your assumptions.
Scary, because there's a pretty long list.



- Davide



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] rfc: threaded epoll_wait thundering herd
  2007-05-05 19:00   ` Davide Libenzi
  2007-05-05 21:42     ` Davi Arnaut
@ 2007-05-07 15:46     ` Chase Venters
  2007-05-07 17:18       ` Davide Libenzi
  1 sibling, 1 reply; 21+ messages in thread
From: Chase Venters @ 2007-05-07 15:46 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Davi Arnaut, Andrew Morton, Linus Torvalds,
	Linux Kernel Mailing List

On Sat, 5 May 2007, Davide Libenzi wrote:

> On Fri, 4 May 2007, Davi Arnaut wrote:
>
>> Hi,
>>
>> If multiple threads are parked on epoll_wait (on a single epoll fd) and
>> events become available, epoll performs a wake up of all threads of the
>> poll wait list, causing a thundering herd of processes trying to grab
>> the eventpoll lock.
>>
>> This patch addresses this by using exclusive waiters (wake one). Once
>> the exclusive thread finishes transferring it's events, a new thread
>> is woken if there are more events available.
>>
>> Makes sense?
>
> Theorically, make sense. I said theorically because all the use
> epoll_wait MT use cases I've heard of, use a single thread that does the
> epoll_wait, and then dispatch to worker threads. So thundering herd is not
> in the picture. OTOH, it does not hurt either.
> But, that code is completely changed with the new single-pass epoll delivery
> code that is in -mm. So, I'd either wait for that code to go in, or I
> (or you, if you like) can make a patch against -mm.
>

*raises hand*

I'm working on event handling code for multiple projects right now, and my 
method of calling epoll_wait() is to do so from several threads. I've 
glanced at the epoll code but obviously haven't noticed the wake-all 
behavior... good to know. I suppose I'm going to have to hack around this 
problem by wrapping epoll_wait() calls in a mutex. That sucks - it means 
other threads won't be able to 'get ahead' by preparing their wait before 
it is their turn to dequeue events.

In any case, I think having multiple threads blocking on epoll_wait() is a 
much saner idea than one thread which then passes out events, so I must 
voice my support for fixing this case. Why this is the exception instead 
of the norm is a little baffling, but I've seen so many perverse things in 
multi-threaded code...

>
> - Davide
>

Thanks,
Chase

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] rfc: threaded epoll_wait thundering herd
  2007-05-07 15:46     ` Chase Venters
@ 2007-05-07 17:18       ` Davide Libenzi
  2007-05-07 18:17         ` Chase Venters
  0 siblings, 1 reply; 21+ messages in thread
From: Davide Libenzi @ 2007-05-07 17:18 UTC (permalink / raw)
  To: Chase Venters
  Cc: Davi Arnaut, Andrew Morton, Linus Torvalds,
	Linux Kernel Mailing List

On Mon, 7 May 2007, Chase Venters wrote:

> I'm working on event handling code for multiple projects right now, and my
> method of calling epoll_wait() is to do so from several threads. I've glanced
> at the epoll code but obviously haven't noticed the wake-all behavior... good
> to know. I suppose I'm going to have to hack around this problem by wrapping
> epoll_wait() calls in a mutex. That sucks - it means other threads won't be
> able to 'get ahead' by preparing their wait before it is their turn to dequeue
> events.
> 
> In any case, I think having multiple threads blocking on epoll_wait() is a
> much saner idea than one thread which then passes out events, so I must voice
> my support for fixing this case. Why this is the exception instead of the norm
> is a little baffling, but I've seen so many perverse things in multi-threaded
> code...

The problem that you can have with multiple threads calling epoll_wait() 
on an SMP system, is that if you sweep 100 events in one thread, and this 
thread goes alone in processing those, you may have other CPUs idle while 
the other thread is handling those. Either you call epoll_wait() from 
multiple thread by keeping the event buffer passed to epoll_wait() farily 
limited, on you use a single epoll_wait() fetcher with a queue(s) from 
which worker threads pull from.
Davi's patch will be re-factored against 22-rc1 and submitted in any case 
though.

- Davide

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] rfc: threaded epoll_wait thundering herd
  2007-05-07 17:18       ` Davide Libenzi
@ 2007-05-07 18:17         ` Chase Venters
  0 siblings, 0 replies; 21+ messages in thread
From: Chase Venters @ 2007-05-07 18:17 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Chase Venters, Davi Arnaut, Andrew Morton, Linus Torvalds,
	Linux Kernel Mailing List

On Mon, 7 May 2007, Davide Libenzi wrote:

> On Mon, 7 May 2007, Chase Venters wrote:
>
>> I'm working on event handling code for multiple projects right now, and my
>> method of calling epoll_wait() is to do so from several threads. I've glanced
>> at the epoll code but obviously haven't noticed the wake-all behavior... good
>> to know. I suppose I'm going to have to hack around this problem by wrapping
>> epoll_wait() calls in a mutex. That sucks - it means other threads won't be
>> able to 'get ahead' by preparing their wait before it is their turn to dequeue
>> events.
>>
>> In any case, I think having multiple threads blocking on epoll_wait() is a
>> much saner idea than one thread which then passes out events, so I must voice
>> my support for fixing this case. Why this is the exception instead of the norm
>> is a little baffling, but I've seen so many perverse things in multi-threaded
>> code...
>
> The problem that you can have with multiple threads calling epoll_wait()
> on an SMP system, is that if you sweep 100 events in one thread, and this
> thread goes alone in processing those, you may have other CPUs idle while
> the other thread is handling those. Either you call epoll_wait() from
> multiple thread by keeping the event buffer passed to epoll_wait() farily
> limited, on you use a single epoll_wait() fetcher with a queue(s) from
> which worker threads pull from.

Working with smaller quantums is indeed the right thing to do.

In any case, let's consider why you're getting 100 events from one 
epoll_wait():

1. You have a single thread doing the dequeue, and it is taking a long 
time (perhaps due to the time it is taking to requeue the work in other 
threads).

2. Your load is so high that you are taking lots and lots of events, in 
which case the other epoll_wait() threads are going to be woken up very 
soon with work anyway. In this scenario you will be "scheduling" work at 
"odd" times based on its arrival, but that's just another argument to use 
smaller quantums.

I'm referring specifically to edge-triggered behavior, btw. I find 
edge-triggered development far easier and saner in a multi-threaded 
environment, and doing level-triggered and multi-threaded at the same time 
certainly seems like the wrong thing to do.

In any case, I see little point in a thread whose job is simply to move 
something from queue A (epoll ready list) to queue B (thread work list). 
My latest code basically uses epoll_wait() as a load balancing mechanism 
to pass out work. The quantums are fairly small. There may be situations 
where you get a burst of traffic that one thread handles while others are 
momentarily idle, but handling that traffic is a very quick operation (and 
everything is non-blocking). You really only need the other threads to 
participate when the load starts to get to the point where the 
epoll_wait() calls will be constantly returning anyway.

> Davi's patch will be re-factored against 22-rc1 and submitted in any case
> though.

Great. I'm just glad I saw this mail -- I probably would have burned quite 
some time in the coming weeks trying to figure out why my epoll code 
wasn't running quite smoothly.

>
> - Davide
>

Thanks,
Chase

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2007-05-08  6:30 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20070504225730.490334000@haxent.com.br>
2007-05-04 23:37 ` [PATCH] rfc: threaded epoll_wait thundering herd Davi Arnaut
2007-05-05  4:15   ` Eric Dumazet
2007-05-05  4:44     ` Linus Torvalds
2007-05-05  5:47       ` Eric Dumazet
2007-05-05 19:00   ` Davide Libenzi
2007-05-05 21:42     ` Davi Arnaut
2007-05-07 21:00       ` Ulrich Drepper
2007-05-07 21:34         ` Davi Arnaut
2007-05-07 22:19           ` Ulrich Drepper
2007-05-07 22:35             ` Davide Libenzi
2007-05-08  2:49               ` Ulrich Drepper
2007-05-08  3:56                 ` Kyle Moffett
2007-05-08  4:35                 ` Linus Torvalds
2007-05-08  6:30                 ` Davide Libenzi
2007-05-07 23:15             ` Davi Arnaut
2007-05-08  2:32               ` Ulrich Drepper
2007-05-08  3:24                 ` Davi Arnaut
2007-05-07 22:47         ` Davide Libenzi
2007-05-07 15:46     ` Chase Venters
2007-05-07 17:18       ` Davide Libenzi
2007-05-07 18:17         ` Chase Venters

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox