* [PATCH] rfc: threaded epoll_wait thundering herd [not found] <20070504225730.490334000@haxent.com.br> @ 2007-05-04 23:37 ` Davi Arnaut 2007-05-05 4:15 ` Eric Dumazet 2007-05-05 19:00 ` Davide Libenzi 0 siblings, 2 replies; 21+ messages in thread From: Davi Arnaut @ 2007-05-04 23:37 UTC (permalink / raw) To: Andrew Morton; +Cc: Davide Libenzi, Linus Torvalds, Linux Kernel Mailing List Hi, If multiple threads are parked on epoll_wait (on a single epoll fd) and events become available, epoll performs a wake up of all threads of the poll wait list, causing a thundering herd of processes trying to grab the eventpoll lock. This patch addresses this by using exclusive waiters (wake one). Once the exclusive thread finishes transferring it's events, a new thread is woken if there are more events available. Makes sense? Signed-off-by: Davi E. M. Arnaut <davi@haxent.com.br> --- fs/eventpoll.c | 7 +++++++ 1 file changed, 7 insertions(+) Index: linux-2.6/fs/eventpoll.c =================================================================== --- linux-2.6.orig/fs/eventpoll.c +++ linux-2.6/fs/eventpoll.c @@ -1491,6 +1491,12 @@ static void ep_reinject_items(struct eve } } + /* + * If there is events available, wake up the next waiter, if any. + */ + if (!ricnt) + ricnt = !list_empty(&ep->rdllist); + if (ricnt) { /* * Wake up ( if active ) both the eventpoll wait list and the ->poll() @@ -1570,6 +1576,7 @@ retry: * ep_poll_callback() when events will become available. */ init_waitqueue_entry(&wait, current); + wait.flags |= WQ_FLAG_EXCLUSIVE; __add_wait_queue(&ep->wq, &wait); for (;;) { -- ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] rfc: threaded epoll_wait thundering herd 2007-05-04 23:37 ` [PATCH] rfc: threaded epoll_wait thundering herd Davi Arnaut @ 2007-05-05 4:15 ` Eric Dumazet 2007-05-05 4:44 ` Linus Torvalds 2007-05-05 19:00 ` Davide Libenzi 1 sibling, 1 reply; 21+ messages in thread From: Eric Dumazet @ 2007-05-05 4:15 UTC (permalink / raw) To: Davi Arnaut Cc: Andrew Morton, Davide Libenzi, Linus Torvalds, Linux Kernel Mailing List Davi Arnaut a écrit : > Hi, > > If multiple threads are parked on epoll_wait (on a single epoll fd) and > events become available, epoll performs a wake up of all threads of the > poll wait list, causing a thundering herd of processes trying to grab > the eventpoll lock. > > This patch addresses this by using exclusive waiters (wake one). Once > the exclusive thread finishes transferring it's events, a new thread > is woken if there are more events available. > > Makes sense? > Yes it makes sense. But... what happens if the thread that was chosen exits from the loop in ep_poll() with res = -EINTR (because of signal_pending(current)) Me thinks in this case some ready events can wait forever (up to next ep_poll_callback() or another thread enters ep_poll()) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] rfc: threaded epoll_wait thundering herd 2007-05-05 4:15 ` Eric Dumazet @ 2007-05-05 4:44 ` Linus Torvalds 2007-05-05 5:47 ` Eric Dumazet 0 siblings, 1 reply; 21+ messages in thread From: Linus Torvalds @ 2007-05-05 4:44 UTC (permalink / raw) To: Eric Dumazet Cc: Davi Arnaut, Andrew Morton, Davide Libenzi, Linux Kernel Mailing List On Sat, 5 May 2007, Eric Dumazet wrote: > > But... what happens if the thread that was chosen exits from the loop in > ep_poll() with res = -EINTR (because of signal_pending(current)) Not a problem. What happens is that an exclusive wake-up stops on the first entry in the wait-queue that it actually *wakes*up*, but if some task has just marked itself as being TASK_UNINTERRUPTIBLE, but is still on the run-queue, it will just be marked TASK_RUNNING and that in itself isn't enough to cause the "exclusive" test to trigger. The code in sched.c is subtle, but worth understanding if you care about these things. You should look at: - try_to_wake_up() - this is the default wakeup function (and the one that should work correctly - I'm not going to guarantee that any of the other specialty-wakeup-functions do so) The return value is the important thing. Returning non-zero is "success", and implies that we actually activated it. See the "goto out_running" case for the case where the process was still actually on the run-queues, and we just ended up setting "p->state = TASK_RUNNING" - we still return 0, and the "exclusive" logic will not trigger. - __wake_up_common: this is the thing that _calls_ the above, and which cares about the return value above. It does if (curr->func(curr, mode, sync, key) && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) ie it only decrements (and triggers) the nr_exclusive thing when the wakeup-function returned non-zero (and when the waitqueue entry was marked exclusive, of course). So what does all this subtlety *mean*? Walk through it. It means that it is safe to do the if (signal_pending()) return -EINTR; kind of thing, because *when* you do this, you obviously are always on the run-queue (otherwise the process wouldn't be running, and couldn't be doing the test). So if there is somebody else waking you up right then and there, they'll never count your wakeup as an exclusive one, and they will wake up at least one other real exclusive waiter. (IOW, you get a very very small probability of a very very small "thundering herd" - obviously it won't be "thundering" any more, it will be more of a "whispering herdlet"). The Linux kernel sleep/wakeup thing is really quite nifty and smart. And very few people realize just *how* nifty and elegant (and efficient) it is. Hopefully a few more people appreciate its beauty and subtlety now ;) Linus ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] rfc: threaded epoll_wait thundering herd 2007-05-05 4:44 ` Linus Torvalds @ 2007-05-05 5:47 ` Eric Dumazet 0 siblings, 0 replies; 21+ messages in thread From: Eric Dumazet @ 2007-05-05 5:47 UTC (permalink / raw) To: Linus Torvalds Cc: Davi Arnaut, Andrew Morton, Davide Libenzi, Linux Kernel Mailing List Linus Torvalds a écrit : > > On Sat, 5 May 2007, Eric Dumazet wrote: >> But... what happens if the thread that was chosen exits from the loop in >> ep_poll() with res = -EINTR (because of signal_pending(current)) > > Not a problem. > > What happens is that an exclusive wake-up stops on the first entry in the > wait-queue that it actually *wakes*up*, but if some task has just marked > itself as being TASK_UNINTERRUPTIBLE, but is still on the run-queue, it > will just be marked TASK_RUNNING and that in itself isn't enough to cause > the "exclusive" test to trigger. > > The code in sched.c is subtle, but worth understanding if you care about > these things. You should look at: > > - try_to_wake_up() - this is the default wakeup function (and the one > that should work correctly - I'm not going to guarantee that any of the > other specialty-wakeup-functions do so) > > The return value is the important thing. Returning non-zero is > "success", and implies that we actually activated it. > > See the "goto out_running" case for the case where the process was > still actually on the run-queues, and we just ended up setting > "p->state = TASK_RUNNING" - we still return 0, and the "exclusive" > logic will not trigger. > > - __wake_up_common: this is the thing that _calls_ the above, and which > cares about the return value above. It does > > if (curr->func(curr, mode, sync, key) && > (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) > > > ie it only decrements (and triggers) the nr_exclusive thing when the > wakeup-function returned non-zero (and when the waitqueue entry was > marked exclusive, of course). > > So what does all this subtlety *mean*? > > Walk through it. It means that it is safe to do the > > if (signal_pending()) > return -EINTR; > > kind of thing, because *when* you do this, you obviously are always on the > run-queue (otherwise the process wouldn't be running, and couldn't be > doing the test). So if there is somebody else waking you up right then and > there, they'll never count your wakeup as an exclusive one, and they will > wake up at least one other real exclusive waiter. > > (IOW, you get a very very small probability of a very very small > "thundering herd" - obviously it won't be "thundering" any more, it will > be more of a "whispering herdlet"). > > The Linux kernel sleep/wakeup thing is really quite nifty and smart. And > very few people realize just *how* nifty and elegant (and efficient) it > is. Hopefully a few more people appreciate its beauty and subtlety now ;) > Thank you Linus for these detailed explanations. I think I was frightened not by the wakeup logic, but by the possibility in SMP that a signal could be delivered to the thread just after it has been selected. Looking again at ep_poll(), I see : set_current_state(TASK_INTERRUPTIBLE); [*] if (!list_empty(&ep->rdllist) || !jtimeout) break; if (signal_pending(current)) { res = -EINTR; break; } So the test against signal_pending() is not done if an event is present in ready list : It should be delivered even if a signal is pending. I missed this bit ealier... ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] rfc: threaded epoll_wait thundering herd 2007-05-04 23:37 ` [PATCH] rfc: threaded epoll_wait thundering herd Davi Arnaut 2007-05-05 4:15 ` Eric Dumazet @ 2007-05-05 19:00 ` Davide Libenzi 2007-05-05 21:42 ` Davi Arnaut 2007-05-07 15:46 ` Chase Venters 1 sibling, 2 replies; 21+ messages in thread From: Davide Libenzi @ 2007-05-05 19:00 UTC (permalink / raw) To: Davi Arnaut; +Cc: Andrew Morton, Linus Torvalds, Linux Kernel Mailing List On Fri, 4 May 2007, Davi Arnaut wrote: > Hi, > > If multiple threads are parked on epoll_wait (on a single epoll fd) and > events become available, epoll performs a wake up of all threads of the > poll wait list, causing a thundering herd of processes trying to grab > the eventpoll lock. > > This patch addresses this by using exclusive waiters (wake one). Once > the exclusive thread finishes transferring it's events, a new thread > is woken if there are more events available. > > Makes sense? Theorically, make sense. I said theorically because all the use epoll_wait MT use cases I've heard of, use a single thread that does the epoll_wait, and then dispatch to worker threads. So thundering herd is not in the picture. OTOH, it does not hurt either. But, that code is completely changed with the new single-pass epoll delivery code that is in -mm. So, I'd either wait for that code to go in, or I (or you, if you like) can make a patch against -mm. - Davide ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] rfc: threaded epoll_wait thundering herd 2007-05-05 19:00 ` Davide Libenzi @ 2007-05-05 21:42 ` Davi Arnaut 2007-05-07 21:00 ` Ulrich Drepper 2007-05-07 15:46 ` Chase Venters 1 sibling, 1 reply; 21+ messages in thread From: Davi Arnaut @ 2007-05-05 21:42 UTC (permalink / raw) To: Davide Libenzi; +Cc: Andrew Morton, Linus Torvalds, Linux Kernel Mailing List Davide Libenzi wrote: > On Fri, 4 May 2007, Davi Arnaut wrote: > >> Hi, >> >> If multiple threads are parked on epoll_wait (on a single epoll fd) and >> events become available, epoll performs a wake up of all threads of the >> poll wait list, causing a thundering herd of processes trying to grab >> the eventpoll lock. >> >> This patch addresses this by using exclusive waiters (wake one). Once >> the exclusive thread finishes transferring it's events, a new thread >> is woken if there are more events available. >> >> Makes sense? > > Theorically, make sense. I said theorically because all the use > epoll_wait MT use cases I've heard of, use a single thread that does the > epoll_wait, and then dispatch to worker threads. So thundering herd is not > in the picture. OTOH, it does not hurt either. A google search turns up a few users. It also addresses some complaints from Drepper. > But, that code is completely changed with the new single-pass epoll delivery > code that is in -mm. So, I'd either wait for that code to go in, or I > (or you, if you like) can make a patch against -mm. Fell free to update the patch for -mm and to include my signed-of-by. Thanks. -- Davi Arnaut ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] rfc: threaded epoll_wait thundering herd 2007-05-05 21:42 ` Davi Arnaut @ 2007-05-07 21:00 ` Ulrich Drepper 2007-05-07 21:34 ` Davi Arnaut 2007-05-07 22:47 ` Davide Libenzi 0 siblings, 2 replies; 21+ messages in thread From: Ulrich Drepper @ 2007-05-07 21:00 UTC (permalink / raw) To: Davi Arnaut Cc: Davide Libenzi, Andrew Morton, Linus Torvalds, Linux Kernel Mailing List On 5/5/07, Davi Arnaut <davi@haxent.com.br> wrote: > A google search turns up a few users. It also addresses some complaints > from Drepper. There is a huge problem with this approach and we're back at the inadequate interface. select/poll/epoll are thread cancellation points. I.e., the thread can be canceled before returning to the user. If this cancellation happens between the kernel deciding to give this thread the event (and no other thread) and the thread testing for cancellation in the libc wrapper around the syscall, then the event is lost and the process(es) might hang. With kevent we in the end fixed the problem by requiring that part of the cancellation handling the thread tries to wake up another thread waiting for the event queue. This is easily possible since the event data is in the shared memory segment and it's just purely the thread wakeup that is needed. To make something like this work for poll you'd have to push back the revents fields of the result back to the kernel which might then cause another thread to be woken up. I find this too ugly to consider. You guys will not believe this but I really thought all these things through before writing the OLS paper. poll cannot be salvaged. There is another thing about this selective wakeup: do I assume it correctly that if more than one file descriptor is reported ready more than one thread is woken? I think nothing else can be justified. Will in this case both threads get the same set of descriptors reported or will they see disjunct sets? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] rfc: threaded epoll_wait thundering herd 2007-05-07 21:00 ` Ulrich Drepper @ 2007-05-07 21:34 ` Davi Arnaut 2007-05-07 22:19 ` Ulrich Drepper 2007-05-07 22:47 ` Davide Libenzi 1 sibling, 1 reply; 21+ messages in thread From: Davi Arnaut @ 2007-05-07 21:34 UTC (permalink / raw) To: Ulrich Drepper Cc: Davide Libenzi, Andrew Morton, Linus Torvalds, Linux Kernel Mailing List Ulrich Drepper wrote: > On 5/5/07, Davi Arnaut <davi@haxent.com.br> wrote: >> A google search turns up a few users. It also addresses some complaints >> from Drepper. > > There is a huge problem with this approach and we're back at the > inadequate interface. > > select/poll/epoll are thread cancellation points. I.e., the thread > can be canceled before returning to the user. If this cancellation > happens between the kernel deciding to give this thread the event (and > no other thread) and the thread testing for cancellation in the libc > wrapper around the syscall, then the event is lost and the process(es) > might hang. > > With kevent we in the end fixed the problem by requiring that part of > the cancellation handling the thread tries to wake up another thread > waiting for the event queue. This is easily possible since the event > data is in the shared memory segment and it's just purely the thread > wakeup that is needed. > > To make something like this work for poll you'd have to push back the > revents fields of the result back to the kernel which might then cause > another thread to be woken up. I find this too ugly to consider. You > guys will not believe this but I really thought all these things > through before writing the OLS paper. poll cannot be salvaged. See Linus's message on this same thread. > There is another thing about this selective wakeup: do I assume it > correctly that if more than one file descriptor is reported ready more > than one thread is woken? I think nothing else can be justified. Correct. > Will in this case both threads get the same set of descriptors > reported or will they see disjunct sets? Disjunct. In reality, only if the event is edge triggered. -- Davi Arnaut ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] rfc: threaded epoll_wait thundering herd 2007-05-07 21:34 ` Davi Arnaut @ 2007-05-07 22:19 ` Ulrich Drepper 2007-05-07 22:35 ` Davide Libenzi 2007-05-07 23:15 ` Davi Arnaut 0 siblings, 2 replies; 21+ messages in thread From: Ulrich Drepper @ 2007-05-07 22:19 UTC (permalink / raw) To: Davi Arnaut Cc: Davide Libenzi, Andrew Morton, Linus Torvalds, Linux Kernel Mailing List On 5/7/07, Davi Arnaut <davi@haxent.com.br> wrote: > See Linus's message on this same thread. No. I'm talking about the userlevel side, not kernel side. If a thread is canceled *after* it returns from the syscall but before it reports the event to the call (i.e., while still in the syscall wrapper, thread cancellation rules require a check there) the event is lost. Linus was only talking about the kernel side and how in the exit path such events are not lost. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] rfc: threaded epoll_wait thundering herd 2007-05-07 22:19 ` Ulrich Drepper @ 2007-05-07 22:35 ` Davide Libenzi 2007-05-08 2:49 ` Ulrich Drepper 2007-05-07 23:15 ` Davi Arnaut 1 sibling, 1 reply; 21+ messages in thread From: Davide Libenzi @ 2007-05-07 22:35 UTC (permalink / raw) To: Ulrich Drepper Cc: Davi Arnaut, Andrew Morton, Linus Torvalds, Linux Kernel Mailing List On Mon, 7 May 2007, Ulrich Drepper wrote: > On 5/7/07, Davi Arnaut <davi@haxent.com.br> wrote: > > See Linus's message on this same thread. > > No. I'm talking about the userlevel side, not kernel side. > > If a thread is canceled *after* it returns from the syscall but before > it reports the event to the call (i.e., while still in the syscall > wrapper, thread cancellation rules require a check there) the event is > lost. read(2) is a cancellation point too. So if the fine userspace code issues a random pthread_cancel() to a thread handling that, data is lost together with the session that thread was handling. Hmm, I wonder how the world could have functioned so far. Bottom line is, if you really want to throw random cancels to your worker threads, you better wrap them into pthread_cleanup_push(). Because otherwise, no matter where your cancel hits, you end up with a broken system. - Davide ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] rfc: threaded epoll_wait thundering herd 2007-05-07 22:35 ` Davide Libenzi @ 2007-05-08 2:49 ` Ulrich Drepper 2007-05-08 3:56 ` Kyle Moffett ` (2 more replies) 0 siblings, 3 replies; 21+ messages in thread From: Ulrich Drepper @ 2007-05-08 2:49 UTC (permalink / raw) To: Davide Libenzi Cc: Davi Arnaut, Andrew Morton, Linus Torvalds, Linux Kernel Mailing List On 5/7/07, Davide Libenzi <davidel@xmailserver.org> wrote: > read(2) is a cancellation point too. So if the fine userspace code issues > a random pthread_cancel() to a thread handling that, data is lost together > with the session that thread was handling. This is absolutely not comparable. When read/write is canceled no data is lost. Some other thread might have to pick up the slack but that's it. With your poll extensions where one thread gets notified and then the kernel forgets about the event information. Now this information is only available in the thread itself. How can a cancellation handler communication this to all the other threads waiting in poll()? Every event which is reported and forgotten must be communication. To make things more fun, the various waiters can be in different processes. We went over this for kevent discussions. I really am not willing to do it all again especially since there is no hope to achieve a satisfying result with poll. So, don't count my silence as agreement, it isn't. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] rfc: threaded epoll_wait thundering herd 2007-05-08 2:49 ` Ulrich Drepper @ 2007-05-08 3:56 ` Kyle Moffett 2007-05-08 4:35 ` Linus Torvalds 2007-05-08 6:30 ` Davide Libenzi 2 siblings, 0 replies; 21+ messages in thread From: Kyle Moffett @ 2007-05-08 3:56 UTC (permalink / raw) To: Ulrich Drepper Cc: Davide Libenzi, Davi Arnaut, Andrew Morton, Linus Torvalds, Linux Kernel Mailing List On May 07, 2007, at 22:49:04, Ulrich Drepper wrote: > On 5/7/07, Davide Libenzi <davidel@xmailserver.org> wrote: >> read(2) is a cancellation point too. So if the fine userspace code >> issues a random pthread_cancel() to a thread handling that, data >> is lost together with the session that thread was handling. > > This is absolutely not comparable. When read/write is canceled no > data is lost. Some other thread might have to pick up the slack > but that's it. Uhh, how can you claim "no data is lost" in this scenario: A bunch of threads shares a common fd and fpos, reading fixed-size records: thread1: read(myfd, per_thread_buf, 512); thread2: pthread_cancel(thread1); thread3: read(myfd, per_thread_buf, 512); This is *exactly* analogous to the epoll() scenario: A bunch of threads are attempting to get data (either events or records) from a given object. If thread1 gets cancelled after the completion of its read() syscall then the file position will have been advanced but the data in the per-thread buffer will be discarded. The third thread won't (reliably) get the same data as thread1. It might get it randomly on occasion if file position updates race, but you can't rely on that behavior. Here's how you make both of those models safe against thread cancellation (assuming, of course, that process_one_buf has no cancellation points until after it saves the buffer state somewhere): char per_thread_buf[512]; pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL); pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, NULL); while (read(fd, per_thread_buf, 512) == 512) { pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL); process_one_buffer(per_thread_buf); pthread_testcancel(); pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, NULL); } pthread_exit(NULL); Look how trivial that is! Isn't it fun? Of course, once you look at it this way you could do the same verbose crap that pthread_cancel/ pthread_setcancelstate/pthread_testcancel are doing with a simple int protected by a pthread_mutex. Cheers, Kyle Moffett ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] rfc: threaded epoll_wait thundering herd 2007-05-08 2:49 ` Ulrich Drepper 2007-05-08 3:56 ` Kyle Moffett @ 2007-05-08 4:35 ` Linus Torvalds 2007-05-08 6:30 ` Davide Libenzi 2 siblings, 0 replies; 21+ messages in thread From: Linus Torvalds @ 2007-05-08 4:35 UTC (permalink / raw) To: Ulrich Drepper Cc: Davide Libenzi, Davi Arnaut, Andrew Morton, Linux Kernel Mailing List On Mon, 7 May 2007, Ulrich Drepper wrote: > > This is absolutely not comparable. When read/write is canceled no > data is lost. Some other thread might have to pick up the slack but > that's it. That's bullsh*t, Uli, and you know it. Whatever the thread read() into it's buffer is effectively gone. You don't know how much of the buffer was updated, so other threads cannot use the data. In fact, the exact *reverse* of what you claim is true. With "poll()" or "select()", other threads *can* actually look at the result buffer, since it has a known size and format that is independent of the return value, unlike read(). But the real issue is that if you start cancellign threads without any other synchronization, you are going to lose data anyway. Claiming anything else is just silly. The whole scenario you talk about is nonsensical, never mind that read() actually handles it *less* well rather than better as you claim. Linus ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] rfc: threaded epoll_wait thundering herd 2007-05-08 2:49 ` Ulrich Drepper 2007-05-08 3:56 ` Kyle Moffett 2007-05-08 4:35 ` Linus Torvalds @ 2007-05-08 6:30 ` Davide Libenzi 2 siblings, 0 replies; 21+ messages in thread From: Davide Libenzi @ 2007-05-08 6:30 UTC (permalink / raw) To: Ulrich Drepper Cc: Davi Arnaut, Andrew Morton, Linus Torvalds, Linux Kernel Mailing List On Mon, 7 May 2007, Ulrich Drepper wrote: > On 5/7/07, Davide Libenzi <davidel@xmailserver.org> wrote: > > read(2) is a cancellation point too. So if the fine userspace code issues > > a random pthread_cancel() to a thread handling that, data is lost together > > with the session that thread was handling. > > This is absolutely not comparable. When read/write is canceled no > data is lost. Some other thread might have to pick up the slack but > that's it. Ohh, is it different? Please, it is not even worth to show you how exactly the same is. Because you are perfectly aware of it. > We went over this for kevent discussions. I really am not willing to > do it all again especially since there is no hope to achieve a > satisfying result with poll. So, don't count my silence as agreement, > it isn't. You're climbing mirrors AFAICS, more that bringing any valid points. And as far as I'm concerned, this thread is becoming very repetitive and boring. - Davide ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] rfc: threaded epoll_wait thundering herd 2007-05-07 22:19 ` Ulrich Drepper 2007-05-07 22:35 ` Davide Libenzi @ 2007-05-07 23:15 ` Davi Arnaut 2007-05-08 2:32 ` Ulrich Drepper 1 sibling, 1 reply; 21+ messages in thread From: Davi Arnaut @ 2007-05-07 23:15 UTC (permalink / raw) To: Ulrich Drepper Cc: Davide Libenzi, Andrew Morton, Linus Torvalds, Linux Kernel Mailing List Ulrich Drepper wrote: > On 5/7/07, Davi Arnaut <davi@haxent.com.br> wrote: >> See Linus's message on this same thread. > > No. I'm talking about the userlevel side, not kernel side. So you probably knew the answer before asking the question. > If a thread is canceled *after* it returns from the syscall but before > it reports the event to the call (i.e., while still in the syscall > wrapper, thread cancellation rules require a check there) the event is > lost. Exactly. The same happens with sigwaitinfo(), and various other. > Linus was only talking about the kernel side and how in the exit path > such events are not lost. Anyway, we could extend epoll to be mmapable... -- Davi Arnaut ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] rfc: threaded epoll_wait thundering herd 2007-05-07 23:15 ` Davi Arnaut @ 2007-05-08 2:32 ` Ulrich Drepper 2007-05-08 3:24 ` Davi Arnaut 0 siblings, 1 reply; 21+ messages in thread From: Ulrich Drepper @ 2007-05-08 2:32 UTC (permalink / raw) To: Davi Arnaut Cc: Davide Libenzi, Andrew Morton, Linus Torvalds, Linux Kernel Mailing List On 5/7/07, Davi Arnaut <davi@haxent.com.br> wrote: > Anyway, we could extend epoll to be mmapable... Welcome to kevent, well, except with a lot more ballast and awkward interfaces. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] rfc: threaded epoll_wait thundering herd 2007-05-08 2:32 ` Ulrich Drepper @ 2007-05-08 3:24 ` Davi Arnaut 0 siblings, 0 replies; 21+ messages in thread From: Davi Arnaut @ 2007-05-08 3:24 UTC (permalink / raw) To: Ulrich Drepper Cc: Davide Libenzi, Andrew Morton, Linus Torvalds, Linux Kernel Mailing List Ulrich Drepper wrote: > On 5/7/07, Davi Arnaut <davi@haxent.com.br> wrote: >> Anyway, we could extend epoll to be mmapable... > > Welcome to kevent, well, except with a lot more ballast and awkward interfaces. So an mmapable epoll is equivalent to kevent.. great! Well, expect without a whole new giant bloated monolithic interface. And this thread was about something else... :-) -- Davi Arnaut ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] rfc: threaded epoll_wait thundering herd 2007-05-07 21:00 ` Ulrich Drepper 2007-05-07 21:34 ` Davi Arnaut @ 2007-05-07 22:47 ` Davide Libenzi 1 sibling, 0 replies; 21+ messages in thread From: Davide Libenzi @ 2007-05-07 22:47 UTC (permalink / raw) To: Ulrich Drepper Cc: Davi Arnaut, Andrew Morton, Linus Torvalds, Linux Kernel Mailing List On Mon, 7 May 2007, Ulrich Drepper wrote: > On 5/5/07, Davi Arnaut <davi@haxent.com.br> wrote: > > A google search turns up a few users. It also addresses some complaints > > from Drepper. > > There is a huge problem with this approach and we're back at the > inadequate interface. > > select/poll/epoll are thread cancellation points. I.e., the thread > can be canceled before returning to the user. If this cancellation > happens between the kernel deciding to give this thread the event (and > no other thread) and the thread testing for cancellation in the libc > wrapper around the syscall, then the event is lost and the process(es) > might hang. > > With kevent we in the end fixed the problem by requiring that part of > the cancellation handling the thread tries to wake up another thread > waiting for the event queue. This is easily possible since the event > data is in the shared memory segment and it's just purely the thread > wakeup that is needed. So, by the same logic, every API that 1) returns something to userspace by canceling its internal kernel state 2) is not based on shared kernel/userspace memory, will break under your assumptions. Scary, because there's a pretty long list. - Davide ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] rfc: threaded epoll_wait thundering herd 2007-05-05 19:00 ` Davide Libenzi 2007-05-05 21:42 ` Davi Arnaut @ 2007-05-07 15:46 ` Chase Venters 2007-05-07 17:18 ` Davide Libenzi 1 sibling, 1 reply; 21+ messages in thread From: Chase Venters @ 2007-05-07 15:46 UTC (permalink / raw) To: Davide Libenzi Cc: Davi Arnaut, Andrew Morton, Linus Torvalds, Linux Kernel Mailing List On Sat, 5 May 2007, Davide Libenzi wrote: > On Fri, 4 May 2007, Davi Arnaut wrote: > >> Hi, >> >> If multiple threads are parked on epoll_wait (on a single epoll fd) and >> events become available, epoll performs a wake up of all threads of the >> poll wait list, causing a thundering herd of processes trying to grab >> the eventpoll lock. >> >> This patch addresses this by using exclusive waiters (wake one). Once >> the exclusive thread finishes transferring it's events, a new thread >> is woken if there are more events available. >> >> Makes sense? > > Theorically, make sense. I said theorically because all the use > epoll_wait MT use cases I've heard of, use a single thread that does the > epoll_wait, and then dispatch to worker threads. So thundering herd is not > in the picture. OTOH, it does not hurt either. > But, that code is completely changed with the new single-pass epoll delivery > code that is in -mm. So, I'd either wait for that code to go in, or I > (or you, if you like) can make a patch against -mm. > *raises hand* I'm working on event handling code for multiple projects right now, and my method of calling epoll_wait() is to do so from several threads. I've glanced at the epoll code but obviously haven't noticed the wake-all behavior... good to know. I suppose I'm going to have to hack around this problem by wrapping epoll_wait() calls in a mutex. That sucks - it means other threads won't be able to 'get ahead' by preparing their wait before it is their turn to dequeue events. In any case, I think having multiple threads blocking on epoll_wait() is a much saner idea than one thread which then passes out events, so I must voice my support for fixing this case. Why this is the exception instead of the norm is a little baffling, but I've seen so many perverse things in multi-threaded code... > > - Davide > Thanks, Chase ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] rfc: threaded epoll_wait thundering herd 2007-05-07 15:46 ` Chase Venters @ 2007-05-07 17:18 ` Davide Libenzi 2007-05-07 18:17 ` Chase Venters 0 siblings, 1 reply; 21+ messages in thread From: Davide Libenzi @ 2007-05-07 17:18 UTC (permalink / raw) To: Chase Venters Cc: Davi Arnaut, Andrew Morton, Linus Torvalds, Linux Kernel Mailing List On Mon, 7 May 2007, Chase Venters wrote: > I'm working on event handling code for multiple projects right now, and my > method of calling epoll_wait() is to do so from several threads. I've glanced > at the epoll code but obviously haven't noticed the wake-all behavior... good > to know. I suppose I'm going to have to hack around this problem by wrapping > epoll_wait() calls in a mutex. That sucks - it means other threads won't be > able to 'get ahead' by preparing their wait before it is their turn to dequeue > events. > > In any case, I think having multiple threads blocking on epoll_wait() is a > much saner idea than one thread which then passes out events, so I must voice > my support for fixing this case. Why this is the exception instead of the norm > is a little baffling, but I've seen so many perverse things in multi-threaded > code... The problem that you can have with multiple threads calling epoll_wait() on an SMP system, is that if you sweep 100 events in one thread, and this thread goes alone in processing those, you may have other CPUs idle while the other thread is handling those. Either you call epoll_wait() from multiple thread by keeping the event buffer passed to epoll_wait() farily limited, on you use a single epoll_wait() fetcher with a queue(s) from which worker threads pull from. Davi's patch will be re-factored against 22-rc1 and submitted in any case though. - Davide ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH] rfc: threaded epoll_wait thundering herd 2007-05-07 17:18 ` Davide Libenzi @ 2007-05-07 18:17 ` Chase Venters 0 siblings, 0 replies; 21+ messages in thread From: Chase Venters @ 2007-05-07 18:17 UTC (permalink / raw) To: Davide Libenzi Cc: Chase Venters, Davi Arnaut, Andrew Morton, Linus Torvalds, Linux Kernel Mailing List On Mon, 7 May 2007, Davide Libenzi wrote: > On Mon, 7 May 2007, Chase Venters wrote: > >> I'm working on event handling code for multiple projects right now, and my >> method of calling epoll_wait() is to do so from several threads. I've glanced >> at the epoll code but obviously haven't noticed the wake-all behavior... good >> to know. I suppose I'm going to have to hack around this problem by wrapping >> epoll_wait() calls in a mutex. That sucks - it means other threads won't be >> able to 'get ahead' by preparing their wait before it is their turn to dequeue >> events. >> >> In any case, I think having multiple threads blocking on epoll_wait() is a >> much saner idea than one thread which then passes out events, so I must voice >> my support for fixing this case. Why this is the exception instead of the norm >> is a little baffling, but I've seen so many perverse things in multi-threaded >> code... > > The problem that you can have with multiple threads calling epoll_wait() > on an SMP system, is that if you sweep 100 events in one thread, and this > thread goes alone in processing those, you may have other CPUs idle while > the other thread is handling those. Either you call epoll_wait() from > multiple thread by keeping the event buffer passed to epoll_wait() farily > limited, on you use a single epoll_wait() fetcher with a queue(s) from > which worker threads pull from. Working with smaller quantums is indeed the right thing to do. In any case, let's consider why you're getting 100 events from one epoll_wait(): 1. You have a single thread doing the dequeue, and it is taking a long time (perhaps due to the time it is taking to requeue the work in other threads). 2. Your load is so high that you are taking lots and lots of events, in which case the other epoll_wait() threads are going to be woken up very soon with work anyway. In this scenario you will be "scheduling" work at "odd" times based on its arrival, but that's just another argument to use smaller quantums. I'm referring specifically to edge-triggered behavior, btw. I find edge-triggered development far easier and saner in a multi-threaded environment, and doing level-triggered and multi-threaded at the same time certainly seems like the wrong thing to do. In any case, I see little point in a thread whose job is simply to move something from queue A (epoll ready list) to queue B (thread work list). My latest code basically uses epoll_wait() as a load balancing mechanism to pass out work. The quantums are fairly small. There may be situations where you get a burst of traffic that one thread handles while others are momentarily idle, but handling that traffic is a very quick operation (and everything is non-blocking). You really only need the other threads to participate when the load starts to get to the point where the epoll_wait() calls will be constantly returning anyway. > Davi's patch will be re-factored against 22-rc1 and submitted in any case > though. Great. I'm just glad I saw this mail -- I probably would have burned quite some time in the coming weeks trying to figure out why my epoll code wasn't running quite smoothly. > > - Davide > Thanks, Chase ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2007-05-08 6:30 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20070504225730.490334000@haxent.com.br>
2007-05-04 23:37 ` [PATCH] rfc: threaded epoll_wait thundering herd Davi Arnaut
2007-05-05 4:15 ` Eric Dumazet
2007-05-05 4:44 ` Linus Torvalds
2007-05-05 5:47 ` Eric Dumazet
2007-05-05 19:00 ` Davide Libenzi
2007-05-05 21:42 ` Davi Arnaut
2007-05-07 21:00 ` Ulrich Drepper
2007-05-07 21:34 ` Davi Arnaut
2007-05-07 22:19 ` Ulrich Drepper
2007-05-07 22:35 ` Davide Libenzi
2007-05-08 2:49 ` Ulrich Drepper
2007-05-08 3:56 ` Kyle Moffett
2007-05-08 4:35 ` Linus Torvalds
2007-05-08 6:30 ` Davide Libenzi
2007-05-07 23:15 ` Davi Arnaut
2007-05-08 2:32 ` Ulrich Drepper
2007-05-08 3:24 ` Davi Arnaut
2007-05-07 22:47 ` Davide Libenzi
2007-05-07 15:46 ` Chase Venters
2007-05-07 17:18 ` Davide Libenzi
2007-05-07 18:17 ` Chase Venters
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox