* [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy
@ 2009-09-23 17:48 Mathieu Desnoyers
2009-09-23 18:04 ` Chris Friesen
2009-10-01 14:40 ` Paul E. McKenney
0 siblings, 2 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2009-09-23 17:48 UTC (permalink / raw)
To: Paul E. McKenney; +Cc: linux-kernel
Hi,
When implementing the call_rcu() "worker thread" in userspace, I ran
into the problem that it had to be woken up periodically to check if
there are any callbacks to execute. However, I easily imagine that this
does not fit well with the "green computing" definition.
Therefore, I've looked at ways to have the call_rcu() callers waking up
this worker thread when callbacks are enqueued. However, I don't want to
take any lock and the fast path (when no wake up is required) should not
cause any cache-line exchange.
Here are the primitives I've created. I'd like to have feedback on my
futex use, just to make sure I did not do any incorrect assumptions.
This could also be eventually used in the QSBR Userspace RCU quiescent
state and in mb/signal userspace RCU when exiting RCU read-side C.S. to
ensure synchronize_rcu() does not busy-wait for too long.
/*
* Wake-up any waiting defer thread. Called from many concurrent threads.
*/
static void wake_up_defer(void)
{
if (unlikely(atomic_read(&defer_thread_futex) == -1))
atomic_set(&defer_thread_futex, 0);
futex(&defer_thread_futex, FUTEX_WAKE,
0, NULL, NULL, 0);
}
/*
* Defer thread waiting. Single thread.
*/
static void wait_defer(void)
{
atomic_dec(&defer_thread_futex);
if (atomic_read(&defer_thread_futex) == -1)
futex(&defer_thread_futex, FUTEX_WAIT, -1,
NULL, NULL, 0);
}
Thanks,
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy 2009-09-23 17:48 [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy Mathieu Desnoyers @ 2009-09-23 18:04 ` Chris Friesen 2009-09-23 19:03 ` Mathieu Desnoyers 2009-10-01 14:40 ` Paul E. McKenney 1 sibling, 1 reply; 17+ messages in thread From: Chris Friesen @ 2009-09-23 18:04 UTC (permalink / raw) To: Mathieu Desnoyers; +Cc: Paul E. McKenney, linux-kernel On 09/23/2009 11:48 AM, Mathieu Desnoyers wrote: > Here are the primitives I've created. I'd like to have feedback on my > futex use, just to make sure I did not do any incorrect assumptions. > /* > * Wake-up any waiting defer thread. Called from many concurrent threads. > */ > static void wake_up_defer(void) > { > if (unlikely(atomic_read(&defer_thread_futex) == -1)) > atomic_set(&defer_thread_futex, 0); > futex(&defer_thread_futex, FUTEX_WAKE, > 0, NULL, NULL, 0); > } Is it a problem if multiple threads all hit the defer_thread_futex==-1 case simultaneously? If so, maybe this should use an atomic test-and-set operation so that only one thread actually calls futex(). > /* > * Defer thread waiting. Single thread. > */ > static void wait_defer(void) > { > atomic_dec(&defer_thread_futex); > if (atomic_read(&defer_thread_futex) == -1) > futex(&defer_thread_futex, FUTEX_WAIT, -1, > NULL, NULL, 0); > } Is it a problem if the value of defer_thread_futex changes to zero after the dec but before the test? Chris ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy 2009-09-23 18:04 ` Chris Friesen @ 2009-09-23 19:03 ` Mathieu Desnoyers 2009-09-23 22:32 ` Mathieu Desnoyers 0 siblings, 1 reply; 17+ messages in thread From: Mathieu Desnoyers @ 2009-09-23 19:03 UTC (permalink / raw) To: Chris Friesen; +Cc: Paul E. McKenney, linux-kernel * Chris Friesen (cfriesen@nortel.com) wrote: > On 09/23/2009 11:48 AM, Mathieu Desnoyers wrote: > > > Here are the primitives I've created. I'd like to have feedback on my > > futex use, just to make sure I did not do any incorrect assumptions. > > > /* > > * Wake-up any waiting defer thread. Called from many concurrent threads. > > */ > > static void wake_up_defer(void) > > { > > if (unlikely(atomic_read(&defer_thread_futex) == -1)) > > atomic_set(&defer_thread_futex, 0); > > futex(&defer_thread_futex, FUTEX_WAKE, > > 0, NULL, NULL, 0); > > } > > Is it a problem if multiple threads all hit the defer_thread_futex==-1 > case simultaneously? It should not be, since what we'll have is, e.g.: thread 1 calls wakeup thread 2 calls wakeup thread 3 calls wait (thread 3 is waiting on the futex, defer_thread_futex = -1) - thread 1 sees defer_thread_futex==-1 - thread 2 sees defer_thread_futex==-1 - thread 1 sets defer_thread_futex = 0 - thread 2 sets defer_thread_futex = 0 - thread 1 calls futex() to wake up the waiter, expect 0 - thread 2 calls futex() to wake up the waiter, expect 0 Basically, what happens in this scenario is that the first futex() call will wake up any waiter, and the second will be a no-op. Let's complicate this, if we have thread 3 running wait_defer() concurrently: - thread 3 decrements defer_thread_futex - thread 1 sees defer_thread_futex==-1 - thread 2 sees defer_thread_futex==-1 - thread 1 sets defer_thread_futex = 0 - thread 2 sets defer_thread_futex = 0 - thread 1 calls futex() to wake up the waiter, expect 0 - thread 2 calls futex() to wake up the waiter, expect 0 - thread 3 calls futex() to wait, expect -1 Returns immediately because defer_thread_futex == 0 Other scenario, where thread decrements defer_thread_futex a bit later: - thread 1 sees defer_thread_futex==0 - thread 2 sees defer_thread_futex==0 - thread 3 decrements defer_thread_futex - thread 3 tests defer_thread_futex==-1 - thread 3 calls futex() to wait, expect -1 In this scenario, we have to notice that if threads 1/2 enqueued tasks to do before checking defer_thread_futex, these tasks would not be seen by the waiter thread. So correct memory ordering of: - wake_up_defer: * queue callbacks to perform (1) * wake up (2) - wait_defer: * for (;;) * wait for futex (3) * sleep 100ms (wait for more callbacks to be enqueued) * dequeue callbacks, execute them (4) actually matters. I'll have to be really careful about that (unless we just accept that tasks to perform could be queued for a while, however, I'd like to give an upper bound to the delay between batch callback execution). Ensuring that 1 is written before 2, and that 4 is done before 3 seems a bit racy. (I have to got out for lunch now, so I'll have to review the ordering afterward) > If so, maybe this should use an atomic > test-and-set operation so that only one thread actually calls futex(). It's not a matter if many threads wake up the waiter, so I don't think the test-and-set is required. The benefit of using a simple test here is that we don't have to bring the cache-line in exclusive mode to the local CPU to perform the test. It can stay shared. > > > /* > > * Defer thread waiting. Single thread. > > */ > > static void wait_defer(void) > > { > > atomic_dec(&defer_thread_futex); > > if (atomic_read(&defer_thread_futex) == -1) > > futex(&defer_thread_futex, FUTEX_WAIT, -1, > > NULL, NULL, 0); > > } > > Is it a problem if the value of defer_thread_futex changes to zero after > the dec but before the test? No. That's not a problem, because this means there is a concurrent "wake up". Seeing a value of "0" here will skip over the futex wait and go on. The concurrent futex wakeup call will simply be a no-op in that case. Thanks for the comments, Mathieu > > Chris -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy 2009-09-23 19:03 ` Mathieu Desnoyers @ 2009-09-23 22:32 ` Mathieu Desnoyers 2009-09-23 23:12 ` Chris Friesen 0 siblings, 1 reply; 17+ messages in thread From: Mathieu Desnoyers @ 2009-09-23 22:32 UTC (permalink / raw) To: Chris Friesen; +Cc: Paul E. McKenney, linux-kernel * Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote: > * Chris Friesen (cfriesen@nortel.com) wrote: > > On 09/23/2009 11:48 AM, Mathieu Desnoyers wrote: > > > > > Here are the primitives I've created. I'd like to have feedback on my > > > futex use, just to make sure I did not do any incorrect assumptions. > > > > > /* > > > * Wake-up any waiting defer thread. Called from many concurrent threads. > > > */ > > > static void wake_up_defer(void) > > > { > > > if (unlikely(atomic_read(&defer_thread_futex) == -1)) > > > atomic_set(&defer_thread_futex, 0); > > > futex(&defer_thread_futex, FUTEX_WAKE, > > > 0, NULL, NULL, 0); > > > } > > > > Is it a problem if multiple threads all hit the defer_thread_futex==-1 > > case simultaneously? > > It should not be, since what we'll have is, e.g.: > > thread 1 calls wakeup > thread 2 calls wakeup > thread 3 calls wait > > (thread 3 is waiting on the futex, defer_thread_futex = -1) > - thread 1 sees defer_thread_futex==-1 > - thread 2 sees defer_thread_futex==-1 > - thread 1 sets defer_thread_futex = 0 > - thread 2 sets defer_thread_futex = 0 > - thread 1 calls futex() to wake up the waiter, expect 0 > - thread 2 calls futex() to wake up the waiter, expect 0 > > Basically, what happens in this scenario is that the first futex() > call will wake up any waiter, and the second will be a no-op. > > Let's complicate this, if we have thread 3 running wait_defer() > concurrently: > > - thread 3 decrements defer_thread_futex > - thread 1 sees defer_thread_futex==-1 > - thread 2 sees defer_thread_futex==-1 > - thread 1 sets defer_thread_futex = 0 > - thread 2 sets defer_thread_futex = 0 > - thread 1 calls futex() to wake up the waiter, expect 0 > - thread 2 calls futex() to wake up the waiter, expect 0 > - thread 3 calls futex() to wait, expect -1 > Returns immediately because defer_thread_futex == 0 > > Other scenario, where thread decrements defer_thread_futex a bit later: > > - thread 1 sees defer_thread_futex==0 > - thread 2 sees defer_thread_futex==0 > - thread 3 decrements defer_thread_futex > - thread 3 tests defer_thread_futex==-1 > - thread 3 calls futex() to wait, expect -1 > > In this scenario, we have to notice that if threads 1/2 enqueued tasks > to do before checking defer_thread_futex, these tasks would not be seen > by the waiter thread. > > So correct memory ordering of: > > - wake_up_defer: > * queue callbacks to perform (1) > * wake up (2) > > - wait_defer: > * for (;;) > * wait for futex (3) > * sleep 100ms (wait for more callbacks to be enqueued) > * dequeue callbacks, execute them (4) > > > actually matters. I'll have to be really careful about that (unless we > just accept that tasks to perform could be queued for a while, however, > I'd like to give an upper bound to the delay between batch callback > execution). > > Ensuring that 1 is written before 2, and that 4 is done before 3 seems a > bit racy. (I have to got out for lunch now, so I'll have to review the > ordering afterward) > I knew I needed to think about it a bit more. Here is the proposed algorithm hopefully fixing the race identified in the 3rd scenario above. The idea is to perform the "check for empty queue" between the &defer_thread_futex decrement and the test in wait_defer. It skips the futex call and proceed if the list is non-empty. As I am drilling into the problem, it looks very much like an attempt to implement efficient wait queues in userspace based on sys_futex(). /* * Wake-up any waiting defer thread. Called from many concurrent * threads. */ static void wake_up_defer(void) { if (unlikely(atomic_read(&defer_thread_futex) == -1)) { atomic_set(&defer_thread_futex, 0); futex(&defer_thread_futex, FUTEX_WAKE, 0, NULL, NULL, 0); } } /* * Defer thread waiting. Single thread. */ static void wait_defer(void) { atomic_dec(&defer_thread_futex); smp_mb(); /* Write futex before read queue */ if (rcu_defer_num_callbacks()) { smp_mb(); /* Read queue before write futex */ /* Callbacks are queued, don't wait. */ atomic_set(&defer_thread_futex, 0); } else { smp_rmb(); /* Read queue before read futex */ if (atomic_read(&defer_thread_futex) == -1) futex(&defer_thread_futex, FUTEX_WAIT, -1, NULL, NULL, 0); } } - call_rcu(): * queue callbacks to perform * smp_mb() * wake_up_defer() - defer thread: * for (;;) * wait_defer() * sleep 100ms (wait for more callbacks to be enqueued) * dequeue callbacks, execute them The goal here is that if call_rcu() enqueues a callback (even if it races with defer thread going to sleep), there should not be a potentially infinite delay before it gets executed. Therefore, being blocked in sys_futex while there is a callback to execute, without any hope to be woken up unless another callback is queued, would not meet that design requirement. I think that checking the number of queued callbacks within wait_defer() as I propose here should address this situation. Comments ? Mathieu > > > If so, maybe this should use an atomic > > test-and-set operation so that only one thread actually calls futex(). > > It's not a matter if many threads wake up the waiter, so I don't think > the test-and-set is required. The benefit of using a simple test here is > that we don't have to bring the cache-line in exclusive mode to the > local CPU to perform the test. It can stay shared. > > > > > > /* > > > * Defer thread waiting. Single thread. > > > */ > > > static void wait_defer(void) > > > { > > > atomic_dec(&defer_thread_futex); > > > if (atomic_read(&defer_thread_futex) == -1) > > > futex(&defer_thread_futex, FUTEX_WAIT, -1, > > > NULL, NULL, 0); > > > } > > > > Is it a problem if the value of defer_thread_futex changes to zero after > > the dec but before the test? > > No. That's not a problem, because this means there is a concurrent "wake > up". Seeing a value of "0" here will skip over the futex wait and go on. > The concurrent futex wakeup call will simply be a no-op in that case. > > Thanks for the comments, > > Mathieu > > > > > Chris > > -- > Mathieu Desnoyers > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy 2009-09-23 22:32 ` Mathieu Desnoyers @ 2009-09-23 23:12 ` Chris Friesen 2009-09-23 23:28 ` Mathieu Desnoyers 0 siblings, 1 reply; 17+ messages in thread From: Chris Friesen @ 2009-09-23 23:12 UTC (permalink / raw) To: Mathieu Desnoyers; +Cc: Paul E. McKenney, linux-kernel On 09/23/2009 04:32 PM, Mathieu Desnoyers wrote: > /* > * Defer thread waiting. Single thread. > */ > static void wait_defer(void) > { > atomic_dec(&defer_thread_futex); > smp_mb(); /* Write futex before read queue */ > if (rcu_defer_num_callbacks()) { > smp_mb(); /* Read queue before write futex */ > /* Callbacks are queued, don't wait. */ > atomic_set(&defer_thread_futex, 0); > } else { > smp_rmb(); /* Read queue before read futex */ > if (atomic_read(&defer_thread_futex) == -1) > futex(&defer_thread_futex, FUTEX_WAIT, -1, > NULL, NULL, 0); > } > } > The goal here is that if call_rcu() enqueues a callback (even if it > races with defer thread going to sleep), there should not be a > potentially infinite delay before it gets executed. It doesn't seem like the test for the number of callbacks should be necessary. I don't see anything like that in the glibc code, nor do I remember anything like that in the futex sample code. I'm still not totally convinced that you can avoid race conditions without using atomic test-and-set or compare-and-exchange. I haven't sat down and worked it out completely though. Chris ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy 2009-09-23 23:12 ` Chris Friesen @ 2009-09-23 23:28 ` Mathieu Desnoyers 2009-09-26 7:05 ` Mathieu Desnoyers 2009-09-28 7:11 ` Michael Schnell 0 siblings, 2 replies; 17+ messages in thread From: Mathieu Desnoyers @ 2009-09-23 23:28 UTC (permalink / raw) To: Chris Friesen; +Cc: Paul E. McKenney, linux-kernel * Chris Friesen (cfriesen@nortel.com) wrote: > On 09/23/2009 04:32 PM, Mathieu Desnoyers wrote: > > > > /* > > * Defer thread waiting. Single thread. > > */ > > static void wait_defer(void) > > { > > atomic_dec(&defer_thread_futex); > > smp_mb(); /* Write futex before read queue */ > > if (rcu_defer_num_callbacks()) { > > smp_mb(); /* Read queue before write futex */ > > /* Callbacks are queued, don't wait. */ > > atomic_set(&defer_thread_futex, 0); > > } else { > > smp_rmb(); /* Read queue before read futex */ > > if (atomic_read(&defer_thread_futex) == -1) > > futex(&defer_thread_futex, FUTEX_WAIT, -1, > > NULL, NULL, 0); > > } > > } > > > The goal here is that if call_rcu() enqueues a callback (even if it > > races with defer thread going to sleep), there should not be a > > potentially infinite delay before it gets executed. > > It doesn't seem like the test for the number of callbacks should be > necessary. I don't see anything like that in the glibc code, nor do I > remember anything like that in the futex sample code. > The mutex code (and usual futex users) use futex to implement mutual exclusion. My goal is to send a wakeup signal to a thread waiting for work to perform when adding such work. But without any mutual exclusion. So it is understandable that glibc code or futex sample code does not cover that, given this use is, well, creative. ;) > I'm still not totally convinced that you can avoid race conditions > without using atomic test-and-set or compare-and-exchange. I haven't > sat down and worked it out completely though. > Yes.. this is heavily dependent on the states and values which can be reached. I should probably take time to create a promela model and run that though the spin model checker to be sure. Thanks for the comments, Mathieu > Chris -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy 2009-09-23 23:28 ` Mathieu Desnoyers @ 2009-09-26 7:05 ` Mathieu Desnoyers 2009-09-28 7:11 ` Michael Schnell 1 sibling, 0 replies; 17+ messages in thread From: Mathieu Desnoyers @ 2009-09-26 7:05 UTC (permalink / raw) To: Chris Friesen; +Cc: Paul E. McKenney, linux-kernel * Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote: > * Chris Friesen (cfriesen@nortel.com) wrote: > > On 09/23/2009 04:32 PM, Mathieu Desnoyers wrote: > > > > > > > /* > > > * Defer thread waiting. Single thread. > > > */ > > > static void wait_defer(void) > > > { > > > atomic_dec(&defer_thread_futex); > > > smp_mb(); /* Write futex before read queue */ > > > if (rcu_defer_num_callbacks()) { > > > smp_mb(); /* Read queue before write futex */ > > > /* Callbacks are queued, don't wait. */ > > > atomic_set(&defer_thread_futex, 0); > > > } else { > > > smp_rmb(); /* Read queue before read futex */ > > > if (atomic_read(&defer_thread_futex) == -1) > > > futex(&defer_thread_futex, FUTEX_WAIT, -1, > > > NULL, NULL, 0); > > > } > > > } > > > > > The goal here is that if call_rcu() enqueues a callback (even if it > > > races with defer thread going to sleep), there should not be a > > > potentially infinite delay before it gets executed. > > > > It doesn't seem like the test for the number of callbacks should be > > necessary. I don't see anything like that in the glibc code, nor do I > > remember anything like that in the futex sample code. > > > > The mutex code (and usual futex users) use futex to implement mutual > exclusion. My goal is to send a wakeup signal to a thread waiting for > work to perform when adding such work. But without any mutual exclusion. > > So it is understandable that glibc code or futex sample code does not > cover that, given this use is, well, creative. ;) > > > I'm still not totally convinced that you can avoid race conditions > > without using atomic test-and-set or compare-and-exchange. I haven't > > sat down and worked it out completely though. > > > > Yes.. this is heavily dependent on the states and values which can be > reached. I should probably take time to create a promela model and run > that though the spin model checker to be sure. > Just created a Promela model for this. It assumes sequential memory ordering (so it's a fairly simplified model). Given I added memory barriers between each operation, it should well represent reality though. My algorithm seems to behave as expected: when a callback is added to the queue, it's not possible to have the waiter thread blocked until the end of days. Available at: http://www.lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=blob;f=formal-model/futex-wakeup/futex.spin Thanks, Mathieu > Thanks for the comments, > > Mathieu > > > Chris > > -- > Mathieu Desnoyers > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy 2009-09-23 23:28 ` Mathieu Desnoyers 2009-09-26 7:05 ` Mathieu Desnoyers @ 2009-09-28 7:11 ` Michael Schnell 2009-09-28 10:58 ` Michael Schnell 2009-09-28 11:01 ` Michael Schnell 1 sibling, 2 replies; 17+ messages in thread From: Michael Schnell @ 2009-09-28 7:11 UTC (permalink / raw) To: linux-kernel Mathieu Desnoyers wrote: > > The mutex code (and usual futex users) use futex to implement mutual > exclusion. My goal is to send a wakeup signal to a thread waiting for > work to perform when adding such work. But without any mutual exclusion. Why is that different ? if you use a normal FUTEX like pthread_mutex_...() and have the thread block right at the start (by blocking the *UTEX before creating the thread). Now you can wake the thread by unblocking the *UTEX. -Michael ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy 2009-09-28 7:11 ` Michael Schnell @ 2009-09-28 10:58 ` Michael Schnell 2009-09-28 11:01 ` Michael Schnell 1 sibling, 0 replies; 17+ messages in thread From: Michael Schnell @ 2009-09-28 10:58 UTC (permalink / raw) To: linux-kernel Michael Schnell wrote: > Mathieu Desnoyers wrote: >> The mutex code (and usual futex users) use futex to implement mutual >> exclusion. My goal is to send a wakeup signal to a thread waiting for >> work to perform when adding such work. But without any mutual exclusion. > > Why is that different ? > > if you use a normal FUTEX like pthread_mutex_...() and have the thread > block right at the start (by blocking the *UTEX before creating the > thread). Now you can wake the thread by unblocking the *UTEX. > > -Michael > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy 2009-09-28 7:11 ` Michael Schnell 2009-09-28 10:58 ` Michael Schnell @ 2009-09-28 11:01 ` Michael Schnell 1 sibling, 0 replies; 17+ messages in thread From: Michael Schnell @ 2009-09-28 11:01 UTC (permalink / raw) To: linux-kernel Michael Schnell wrote: > Why is that different ? It might be that (different from a sema), a FUTEX can only be unlocked by the process that locked it. But then why should FUTEX offer advantages over sema, if in that application the FUTEX is supposed to block with a system call, anyway ? -Michael ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy 2009-09-23 17:48 [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy Mathieu Desnoyers 2009-09-23 18:04 ` Chris Friesen @ 2009-10-01 14:40 ` Paul E. McKenney 2009-10-04 14:37 ` Mathieu Desnoyers 1 sibling, 1 reply; 17+ messages in thread From: Paul E. McKenney @ 2009-10-01 14:40 UTC (permalink / raw) To: Mathieu Desnoyers; +Cc: linux-kernel On Wed, Sep 23, 2009 at 01:48:20PM -0400, Mathieu Desnoyers wrote: > Hi, > > When implementing the call_rcu() "worker thread" in userspace, I ran > into the problem that it had to be woken up periodically to check if > there are any callbacks to execute. However, I easily imagine that this > does not fit well with the "green computing" definition. > > Therefore, I've looked at ways to have the call_rcu() callers waking up > this worker thread when callbacks are enqueued. However, I don't want to > take any lock and the fast path (when no wake up is required) should not > cause any cache-line exchange. > > Here are the primitives I've created. I'd like to have feedback on my > futex use, just to make sure I did not do any incorrect assumptions. > > This could also be eventually used in the QSBR Userspace RCU quiescent > state and in mb/signal userspace RCU when exiting RCU read-side C.S. to > ensure synchronize_rcu() does not busy-wait for too long. > > /* > * Wake-up any waiting defer thread. Called from many concurrent threads. > */ > static void wake_up_defer(void) > { > if (unlikely(atomic_read(&defer_thread_futex) == -1)) > atomic_set(&defer_thread_futex, 0); > futex(&defer_thread_futex, FUTEX_WAKE, > 0, NULL, NULL, 0); > } > > /* > * Defer thread waiting. Single thread. > */ > static void wait_defer(void) > { > atomic_dec(&defer_thread_futex); > if (atomic_read(&defer_thread_futex) == -1) > futex(&defer_thread_futex, FUTEX_WAIT, -1, > NULL, NULL, 0); > } The standard approach would be to use pthread_cond_wait() and pthread_cond_broadcast(). Unfortunately, this would require holding a pthread_mutex_lock across both operations, which would not necessarily be so good for wake-up-side scalability. That said, without this sort of heavy-locking approach, wakeup races are quite difficult to avoid. Thanx, Paul ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy 2009-10-01 14:40 ` Paul E. McKenney @ 2009-10-04 14:37 ` Mathieu Desnoyers 2009-10-04 20:36 ` Paul E. McKenney [not found] ` <4AC99D55.8000102@lumino.de> 0 siblings, 2 replies; 17+ messages in thread From: Mathieu Desnoyers @ 2009-10-04 14:37 UTC (permalink / raw) To: Paul E. McKenney; +Cc: linux-kernel * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote: > On Wed, Sep 23, 2009 at 01:48:20PM -0400, Mathieu Desnoyers wrote: > > Hi, > > > > When implementing the call_rcu() "worker thread" in userspace, I ran > > into the problem that it had to be woken up periodically to check if > > there are any callbacks to execute. However, I easily imagine that this > > does not fit well with the "green computing" definition. > > > > Therefore, I've looked at ways to have the call_rcu() callers waking up > > this worker thread when callbacks are enqueued. However, I don't want to > > take any lock and the fast path (when no wake up is required) should not > > cause any cache-line exchange. > > > > Here are the primitives I've created. I'd like to have feedback on my > > futex use, just to make sure I did not do any incorrect assumptions. > > > > This could also be eventually used in the QSBR Userspace RCU quiescent > > state and in mb/signal userspace RCU when exiting RCU read-side C.S. to > > ensure synchronize_rcu() does not busy-wait for too long. > > > > /* > > * Wake-up any waiting defer thread. Called from many concurrent threads. > > */ > > static void wake_up_defer(void) > > { > > if (unlikely(atomic_read(&defer_thread_futex) == -1)) > > atomic_set(&defer_thread_futex, 0); > > futex(&defer_thread_futex, FUTEX_WAKE, > > 0, NULL, NULL, 0); > > } > > > > /* > > * Defer thread waiting. Single thread. > > */ > > static void wait_defer(void) > > { > > atomic_dec(&defer_thread_futex); > > if (atomic_read(&defer_thread_futex) == -1) > > futex(&defer_thread_futex, FUTEX_WAIT, -1, > > NULL, NULL, 0); > > } > > The standard approach would be to use pthread_cond_wait() and > pthread_cond_broadcast(). Unfortunately, this would require holding a > pthread_mutex_lock across both operations, which would not necessarily > be so good for wake-up-side scalability. The pthread_cond_broadcast() mutex is really a bugger when it comes to execute it at each rcu_read_unlock(). We could as well use a mutex to protect the whole read-side.. :-( > > That said, without this sort of heavy-locking approach, wakeup races > are quite difficult to avoid. > I did a formal model of my futex-based wait/wakeup. The main idea is that the waiter: - Set itself to "waiting" - Checks the "real condition" for which it will wait (e.g. queues empty when used for rcu callbacks, no more ongoing old reader thread C.S. when used in synchronize_rcu()) - Calls sys_futex if the variable have not changed. And the waker: - sets the "real condition" waking up the waiter (enqueuing, or rcu_read_unlock()) - check if the waiter must be woken up, if so, wake it up by setting the state to "running" and calling sys_futex. But as you say, wakeup races are difficult (but not impossible!) to avoid. This is why I resorted to a formal model of the wait/wakeup scheme to ensure that we cannot end up in a situation where a waker races with the waiter and does not wake it up when it should. This is nothing fancy (does not model memory and instruction reordering automatically), but I figure that memory barriers are required between almost every steps of this algorithm, so by adding smp_mb() I end up ensure sequential behavior. I added test cases in the model to ensure that incorrect memory reordering _would_ cause errors by doing the reordering by hand in error-injection runs. The model is available at: http://www.lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=tree;f=futex-wakeup;h=4ddeaeb2784165cb0465d4ca9f7d27acb562eae3;hb=refs/heads/formal-model (this is in the formal-model branch of the urcu tree, futex-wakeup subdir) This is modeling this snippet of code : static int defer_thread_futex; /* * Wake-up any waiting defer thread. Called from many concurrent threads. */ static void wake_up_defer(void) { if (unlikely(uatomic_read(&defer_thread_futex) == -1)) { uatomic_set(&defer_thread_futex, 0); futex(&defer_thread_futex, FUTEX_WAKE, 1, NULL, NULL, 0); } } static void enqueue(void *callback) /* not the actual types */ { add_to_queue(callback); smp_mb(); wake_up_defer(); } /* * rcu_defer_num_callbacks() returns the total number of callbacks * enqueued. */ /* * Defer thread waiting. Single thread. */ static void wait_defer(void) { uatomic_dec(&defer_thread_futex); smp_mb(); /* Write futex before read queue */ if (rcu_defer_num_callbacks()) { smp_mb(); /* Read queue before write futex */ /* Callbacks are queued, don't wait. */ uatomic_set(&defer_thread_futex, 0); } else { smp_rmb(); /* Read queue before read futex */ if (uatomic_read(&defer_thread_futex) == -1) futex(&defer_thread_futex, FUTEX_WAIT, -1, NULL, NULL, 0); } } Comments are welcome, Thanks, Mathieu > Thanx, Paul -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy 2009-10-04 14:37 ` Mathieu Desnoyers @ 2009-10-04 20:36 ` Paul E. McKenney 2009-10-04 21:12 ` Mathieu Desnoyers [not found] ` <4AC99D55.8000102@lumino.de> 1 sibling, 1 reply; 17+ messages in thread From: Paul E. McKenney @ 2009-10-04 20:36 UTC (permalink / raw) To: Mathieu Desnoyers; +Cc: linux-kernel On Sun, Oct 04, 2009 at 10:37:45AM -0400, Mathieu Desnoyers wrote: > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote: > > On Wed, Sep 23, 2009 at 01:48:20PM -0400, Mathieu Desnoyers wrote: > > > Hi, > > > > > > When implementing the call_rcu() "worker thread" in userspace, I ran > > > into the problem that it had to be woken up periodically to check if > > > there are any callbacks to execute. However, I easily imagine that this > > > does not fit well with the "green computing" definition. > > > > > > Therefore, I've looked at ways to have the call_rcu() callers waking up > > > this worker thread when callbacks are enqueued. However, I don't want to > > > take any lock and the fast path (when no wake up is required) should not > > > cause any cache-line exchange. > > > > > > Here are the primitives I've created. I'd like to have feedback on my > > > futex use, just to make sure I did not do any incorrect assumptions. > > > > > > This could also be eventually used in the QSBR Userspace RCU quiescent > > > state and in mb/signal userspace RCU when exiting RCU read-side C.S. to > > > ensure synchronize_rcu() does not busy-wait for too long. > > > > > > /* > > > * Wake-up any waiting defer thread. Called from many concurrent threads. > > > */ > > > static void wake_up_defer(void) > > > { > > > if (unlikely(atomic_read(&defer_thread_futex) == -1)) > > > atomic_set(&defer_thread_futex, 0); > > > futex(&defer_thread_futex, FUTEX_WAKE, > > > 0, NULL, NULL, 0); > > > } > > > > > > /* > > > * Defer thread waiting. Single thread. > > > */ > > > static void wait_defer(void) > > > { > > > atomic_dec(&defer_thread_futex); > > > if (atomic_read(&defer_thread_futex) == -1) > > > futex(&defer_thread_futex, FUTEX_WAIT, -1, > > > NULL, NULL, 0); > > > } > > > > The standard approach would be to use pthread_cond_wait() and > > pthread_cond_broadcast(). Unfortunately, this would require holding a > > pthread_mutex_lock across both operations, which would not necessarily > > be so good for wake-up-side scalability. > > The pthread_cond_broadcast() mutex is really a bugger when it comes to > execute it at each rcu_read_unlock(). We could as well use a mutex to > protect the whole read-side.. :-( > > > That said, without this sort of heavy-locking approach, wakeup races > > are quite difficult to avoid. > > I did a formal model of my futex-based wait/wakeup. The main idea is > that the waiter: > > - Set itself to "waiting" > - Checks the "real condition" for which it will wait (e.g. queues empty > when used for rcu callbacks, no more ongoing old reader thread C.S. > when used in synchronize_rcu()) > - Calls sys_futex if the variable have not changed. > > And the waker: > - sets the "real condition" waking up the waiter (enqueuing, or > rcu_read_unlock()) > - check if the waiter must be woken up, if so, wake it up by setting the > state to "running" and calling sys_futex. > > But as you say, wakeup races are difficult (but not impossible!) to > avoid. This is why I resorted to a formal model of the wait/wakeup > scheme to ensure that we cannot end up in a situation where a waker > races with the waiter and does not wake it up when it should. This is > nothing fancy (does not model memory and instruction reordering > automatically), but I figure that memory barriers are required between > almost every steps of this algorithm, so by adding smp_mb() I end up > ensure sequential behavior. I added test cases in the model to ensure > that incorrect memory reordering _would_ cause errors by doing the > reordering by hand in error-injection runs. My question is whether pthread_cond_wait() and pthread_cond_broadcast() can substitute for the raw call to futex. Unless I am missing something (which I quite possibly am), the kernel will serialize on the futex anyway, so serialization in user-mode code does not add much additional pain. > The model is available at: > http://www.lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=tree;f=futex-wakeup;h=4ddeaeb2784165cb0465d4ca9f7d27acb562eae3;hb=refs/heads/formal-model > > (this is in the formal-model branch of the urcu tree, futex-wakeup > subdir) > > This is modeling this snippet of code : > > static int defer_thread_futex; > > /* > * Wake-up any waiting defer thread. Called from many concurrent threads. > */ > static void wake_up_defer(void) > { > if (unlikely(uatomic_read(&defer_thread_futex) == -1)) { > uatomic_set(&defer_thread_futex, 0); > futex(&defer_thread_futex, FUTEX_WAKE, 1, > NULL, NULL, 0); > } > } > > static void enqueue(void *callback) /* not the actual types */ > { > add_to_queue(callback); > smp_mb(); > wake_up_defer(); > } > > /* > * rcu_defer_num_callbacks() returns the total number of callbacks > * enqueued. > */ > > /* > * Defer thread waiting. Single thread. > */ > static void wait_defer(void) > { > uatomic_dec(&defer_thread_futex); > smp_mb(); /* Write futex before read queue */ > if (rcu_defer_num_callbacks()) { > smp_mb(); /* Read queue before write futex */ > /* Callbacks are queued, don't wait. */ > uatomic_set(&defer_thread_futex, 0); > } else { > smp_rmb(); /* Read queue before read futex */ > if (uatomic_read(&defer_thread_futex) == -1) > futex(&defer_thread_futex, FUTEX_WAIT, -1, > NULL, NULL, 0); > } > } > > > Comments are welcome, I will take a look after further recovery from jetlag. Not yet competent to review this kind of stuff. Give me a few days. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy 2009-10-04 20:36 ` Paul E. McKenney @ 2009-10-04 21:12 ` Mathieu Desnoyers 0 siblings, 0 replies; 17+ messages in thread From: Mathieu Desnoyers @ 2009-10-04 21:12 UTC (permalink / raw) To: Paul E. McKenney; +Cc: linux-kernel * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote: > On Sun, Oct 04, 2009 at 10:37:45AM -0400, Mathieu Desnoyers wrote: > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote: > > > On Wed, Sep 23, 2009 at 01:48:20PM -0400, Mathieu Desnoyers wrote: > > > > Hi, > > > > > > > > When implementing the call_rcu() "worker thread" in userspace, I ran > > > > into the problem that it had to be woken up periodically to check if > > > > there are any callbacks to execute. However, I easily imagine that this > > > > does not fit well with the "green computing" definition. > > > > > > > > Therefore, I've looked at ways to have the call_rcu() callers waking up > > > > this worker thread when callbacks are enqueued. However, I don't want to > > > > take any lock and the fast path (when no wake up is required) should not > > > > cause any cache-line exchange. > > > > > > > > Here are the primitives I've created. I'd like to have feedback on my > > > > futex use, just to make sure I did not do any incorrect assumptions. > > > > > > > > This could also be eventually used in the QSBR Userspace RCU quiescent > > > > state and in mb/signal userspace RCU when exiting RCU read-side C.S. to > > > > ensure synchronize_rcu() does not busy-wait for too long. > > > > > > > > /* > > > > * Wake-up any waiting defer thread. Called from many concurrent threads. > > > > */ > > > > static void wake_up_defer(void) > > > > { > > > > if (unlikely(atomic_read(&defer_thread_futex) == -1)) > > > > atomic_set(&defer_thread_futex, 0); > > > > futex(&defer_thread_futex, FUTEX_WAKE, > > > > 0, NULL, NULL, 0); > > > > } > > > > > > > > /* > > > > * Defer thread waiting. Single thread. > > > > */ > > > > static void wait_defer(void) > > > > { > > > > atomic_dec(&defer_thread_futex); > > > > if (atomic_read(&defer_thread_futex) == -1) > > > > futex(&defer_thread_futex, FUTEX_WAIT, -1, > > > > NULL, NULL, 0); > > > > } > > > > > > The standard approach would be to use pthread_cond_wait() and > > > pthread_cond_broadcast(). Unfortunately, this would require holding a > > > pthread_mutex_lock across both operations, which would not necessarily > > > be so good for wake-up-side scalability. > > > > The pthread_cond_broadcast() mutex is really a bugger when it comes to > > execute it at each rcu_read_unlock(). We could as well use a mutex to > > protect the whole read-side.. :-( > > > > > That said, without this sort of heavy-locking approach, wakeup races > > > are quite difficult to avoid. > > > > I did a formal model of my futex-based wait/wakeup. The main idea is > > that the waiter: > > > > - Set itself to "waiting" > > - Checks the "real condition" for which it will wait (e.g. queues empty > > when used for rcu callbacks, no more ongoing old reader thread C.S. > > when used in synchronize_rcu()) > > - Calls sys_futex if the variable have not changed. > > > > And the waker: > > - sets the "real condition" waking up the waiter (enqueuing, or > > rcu_read_unlock()) > > - check if the waiter must be woken up, if so, wake it up by setting the > > state to "running" and calling sys_futex. > > > > But as you say, wakeup races are difficult (but not impossible!) to > > avoid. This is why I resorted to a formal model of the wait/wakeup > > scheme to ensure that we cannot end up in a situation where a waker > > races with the waiter and does not wake it up when it should. This is > > nothing fancy (does not model memory and instruction reordering > > automatically), but I figure that memory barriers are required between > > almost every steps of this algorithm, so by adding smp_mb() I end up > > ensure sequential behavior. I added test cases in the model to ensure > > that incorrect memory reordering _would_ cause errors by doing the > > reordering by hand in error-injection runs. > > My question is whether pthread_cond_wait() and pthread_cond_broadcast() > can substitute for the raw call to futex. Unless I am missing something > (which I quite possibly am), the kernel will serialize on the futex > anyway, so serialization in user-mode code does not add much additional > pain. The kernel sys_futex implementation only takes per-bucket spinlocks. So this is far from the cost of a global mutex in pthread_cond. Moreover, my scheme does not require to take any mutex in the fast path (when there is no waiter to wake up), which makes performances appropriate for use in rcu read-side. It's a simple memory barrier, variable read, test and branch in this case. > > > The model is available at: > > http://www.lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=tree;f=futex-wakeup;h=4ddeaeb2784165cb0465d4ca9f7d27acb562eae3;hb=refs/heads/formal-model > > > > (this is in the formal-model branch of the urcu tree, futex-wakeup > > subdir) > > > > This is modeling this snippet of code : > > > > static int defer_thread_futex; > > > > /* > > * Wake-up any waiting defer thread. Called from many concurrent threads. > > */ > > static void wake_up_defer(void) > > { > > if (unlikely(uatomic_read(&defer_thread_futex) == -1)) { > > uatomic_set(&defer_thread_futex, 0); > > futex(&defer_thread_futex, FUTEX_WAKE, 1, > > NULL, NULL, 0); > > } > > } > > > > static void enqueue(void *callback) /* not the actual types */ > > { > > add_to_queue(callback); > > smp_mb(); > > wake_up_defer(); > > } > > > > /* > > * rcu_defer_num_callbacks() returns the total number of callbacks > > * enqueued. > > */ > > > > /* > > * Defer thread waiting. Single thread. > > */ > > static void wait_defer(void) > > { > > uatomic_dec(&defer_thread_futex); > > smp_mb(); /* Write futex before read queue */ > > if (rcu_defer_num_callbacks()) { > > smp_mb(); /* Read queue before write futex */ > > /* Callbacks are queued, don't wait. */ > > uatomic_set(&defer_thread_futex, 0); > > } else { > > smp_rmb(); /* Read queue before read futex */ > > if (uatomic_read(&defer_thread_futex) == -1) > > futex(&defer_thread_futex, FUTEX_WAIT, -1, > > NULL, NULL, 0); > > } > > } > > > > > > Comments are welcome, > > I will take a look after further recovery from jetlag. Not yet competent > to review this kind of stuff. Give me a few days. ;-) No problem, thanks for looking at this, Mathieu > > Thanx, Paul -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <4AC99D55.8000102@lumino.de>]
[parent not found: <20091005125533.GA1857@Krystal>]
* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy [not found] ` <20091005125533.GA1857@Krystal> @ 2009-10-05 13:22 ` Mathieu Desnoyers 2009-10-05 22:21 ` Mathieu Desnoyers 2009-10-07 7:22 ` Michael Schnell 1 sibling, 1 reply; 17+ messages in thread From: Mathieu Desnoyers @ 2009-10-05 13:22 UTC (permalink / raw) To: Michael Schnell; +Cc: Paul E. McKenney, Darren Hart, linux-kernel (added some CC's, given the length of the answer. Thanks for asking) ;) (Sorry for duplicate, messed up LKML email in original post) * Michael Schnell (mschnell@lumino.de) wrote: > I still don't understand what the advantage of FUTEX is, if the thread > to be waked is always blocked, and thus the fast path is not in use. > > -Michael Hrm, your assumption about the common case does not seem to fit my scenarios. Typically, the wakeup will find the waiter not blocked, and thus skip the call to sys_futex. Here is why. I use this scheme in two different implementations: 1) in call_rcu(), to wake up the worker thread after adding work to the queue. This worker thread, when woken up, sleeps for a few milliseconds before starting to dequeue work. Therefore, if the system is relatively busy, call_rcu() will usually see the worker thread while it's sleeping (and therefore _not_ waiting on the futex). Also, if work is enqueued while the worker thread is executing past RCU callbacks, the worker thread will detect it and won't wait on the futex. Therefore, this is, by design, a very unlikely event to have call_rcu() calling sys_futex. 2) in rcu_read_unlock(), to wake up synchronize_rcu() waiting on past reader's grace periods. Here, synchronize_rcu() busy-waits for the reader's G.P. to end, and if this does not work, after a few attempts (like the pthread mutexes), it adapts and uses sys_futex. The waker only need to call sys_futex if there is a synchronize_rcu() currently running which had to call sys_futex after a few active attempts failed. As you see, in both cases, the common case, "fast path", is to find the futex unlocked and not having to take any lock. Now, about the slow path. I think it's worth discussing too. Indeed, sys_futex takes a per-address spinlock, which happens to serialize all sys_futex operations on the wakeup side. Therefore, for wakeup designs relying on calling sys_futex for wakeup very frequently, this is really bad. There might be ways to mitigate this problem though: changing the sys_futex implementation to use lock-free lists might help. One advantage of calling sys_futex without holding a userspace mutex is that the contention duration on the per-address spinlock is much shorter than the contention on the mutex, because of the system call execution overhead. We could probably turn the sys_futex-dependent locking scheme into something more portable and manage to keep the same fast-path behavior if we replace the way I use sys_futex by a pthread cond var, e.g. : Instead of having: static inline void wake_up_gp(void) { if (unlikely(uatomic_read(&gp_futex) == -1)) { uatomic_set(&gp_futex, 0); futex(&gp_futex, FUTEX_WAKE, 1, NULL, NULL, 0); } } static void wait_gp(void) { uatomic_dec(&gp_futex); smp_mb(); if (!num_old_reader_gp()) { smp_mb(); atomic_set(&gp_futex, 0); return; } /* Read reader_gp before read futex */ smp_rmb(); if (uatomic_read(&gp_futex) == -1) futex(&gp_futex, FUTEX_WAIT, -1, NULL, NULL, 0); } We could have: static inline void wake_up_gp(void) { /* just released old reader gp */ smp_mb(); if (unlikely(uatomic_read(&gp_wait) == -1)) { uatomic_set(&gp_wait, 0); pthread_cond_broadcast(&rcucond); } } static void wait_gp(void) { uatomic_dec(&gp_wait); smp_mb(); if (!num_old_reader_gp()) { smp_mb(); atomic_set(&gp_wait, 0); goto unlock; } /* Read reader_gp before read futex */ smp_rmb(); pthread_mutex_lock(&rcumutex); if (uatomic_read(&gp_wait) == -1) { pthread_cond_wait(&rcucond, &rcumutex); pthread_mutex_unlock(&rcumutex); } } Is that what you had in mind ? Thanks, Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy 2009-10-05 13:22 ` Mathieu Desnoyers @ 2009-10-05 22:21 ` Mathieu Desnoyers 0 siblings, 0 replies; 17+ messages in thread From: Mathieu Desnoyers @ 2009-10-05 22:21 UTC (permalink / raw) To: Michael Schnell; +Cc: Paul E. McKenney, Darren Hart, linux-kernel * Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote: > (added some CC's, given the length of the answer. Thanks for asking) ;) > (Sorry for duplicate, messed up LKML email in original post) > > * Michael Schnell (mschnell@lumino.de) wrote: > > I still don't understand what the advantage of FUTEX is, if the thread > > to be waked is always blocked, and thus the fast path is not in use. > > > > -Michael > > Hrm, your assumption about the common case does not seem to fit my > scenarios. Typically, the wakeup will find the waiter not blocked, and > thus skip the call to sys_futex. Here is why. > > I use this scheme in two different implementations: > > 1) in call_rcu(), to wake up the worker thread after adding work to the > queue. > > This worker thread, when woken up, sleeps for a few milliseconds before > starting to dequeue work. Therefore, if the system is relatively busy, > call_rcu() will usually see the worker thread while it's sleeping (and > therefore _not_ waiting on the futex). Also, if work is enqueued while > the worker thread is executing past RCU callbacks, the worker thread > will detect it and won't wait on the futex. > > Therefore, this is, by design, a very unlikely event to have call_rcu() > calling sys_futex. > > 2) in rcu_read_unlock(), to wake up synchronize_rcu() waiting on past > reader's grace periods. > > Here, synchronize_rcu() busy-waits for the reader's G.P. to end, and if > this does not work, after a few attempts (like the pthread mutexes), it > adapts and uses sys_futex. The waker only need to call sys_futex if > there is a synchronize_rcu() currently running which had to call > sys_futex after a few active attempts failed. > > As you see, in both cases, the common case, "fast path", is to find the > futex unlocked and not having to take any lock. > > > Now, about the slow path. I think it's worth discussing too. Indeed, > sys_futex takes a per-address spinlock, which happens to serialize all > sys_futex operations on the wakeup side. Therefore, for wakeup designs > relying on calling sys_futex for wakeup very frequently, this is really > bad. > > There might be ways to mitigate this problem though: changing the > sys_futex implementation to use lock-free lists might help. > > One advantage of calling sys_futex without holding a userspace mutex is > that the contention duration on the per-address spinlock is much shorter > than the contention on the mutex, because of the system call execution > overhead. > > We could probably turn the sys_futex-dependent locking scheme into > something more portable and manage to keep the same fast-path behavior > if we replace the way I use sys_futex by a pthread cond var, e.g. : > > Instead of having: > > > static inline void wake_up_gp(void) > { > if (unlikely(uatomic_read(&gp_futex) == -1)) { > uatomic_set(&gp_futex, 0); > futex(&gp_futex, FUTEX_WAKE, 1, > NULL, NULL, 0); > } > } > > static void wait_gp(void) > { > uatomic_dec(&gp_futex); > smp_mb(); > if (!num_old_reader_gp()) { > smp_mb(); > atomic_set(&gp_futex, 0); > return; > } > /* Read reader_gp before read futex */ > smp_rmb(); > if (uatomic_read(&gp_futex) == -1) > futex(&gp_futex, FUTEX_WAIT, -1, > NULL, NULL, 0); > } > > We could have: > > static inline void wake_up_gp(void) > { > /* just released old reader gp */ > smp_mb(); > if (unlikely(uatomic_read(&gp_wait) == -1)) { > uatomic_set(&gp_wait, 0); > pthread_cond_broadcast(&rcucond); > } > } > > static void wait_gp(void) > { > uatomic_dec(&gp_wait); > smp_mb(); > if (!num_old_reader_gp()) { > smp_mb(); > atomic_set(&gp_wait, 0); > goto unlock; > } > /* Read reader_gp before read futex */ > smp_rmb(); > pthread_mutex_lock(&rcumutex); > if (uatomic_read(&gp_wait) == -1) { > pthread_cond_wait(&rcucond, &rcumutex); > pthread_mutex_unlock(&rcumutex); > } > } > > Is that what you had in mind ? > Hrm, thinking about it, the pthread_cond_wait/pthread_cond_broadcast scheme I proposed above is racy. If a wait_gp executes concurrently with wake_up_gp, we can end up doing: gp_wait = 0 * wait_gp: uatomic_dec(&gp_wait); * wait_gp: smp_mb() * wait_gp: if (!num_old_reader_gp()) { (false) * wait_gp: pthread_mutex_lock(&rcumutex); * wait_gp: if (uatomic_read(&gp_wait) == -1) { (true) * wake_up_gp: unlikely(uatomic_read(&gp_wait) == -1) { (true) * wake_up_gp: uatomic_set(&gp_wait, 0); * wait_up_gp: pthread_cond_broadcast(&rcucond); * wait_gp: pthread_cond_wait(&rcucond, &rcumutex); might wait forever (or until the next wakeup). We would need an extra mutex in the wakeup, e.g.: static inline void wake_up_gp(void) { /* just released old reader gp */ smp_mb(); if (unlikely(uatomic_read(&gp_wait) == -1)) { pthread_mutex_lock(&rcumutex); uatomic_set(&gp_wait, 0); pthread_cond_broadcast(&rcucond); pthread_mutex_unlock(&rcumutex); } } static void wait_gp(void) { uatomic_dec(&gp_wait); smp_mb(); if (!num_old_reader_gp()) { smp_mb(); atomic_set(&gp_wait, 0); goto unlock; } /* Read reader_gp before read futex */ smp_rmb(); pthread_mutex_lock(&rcumutex); if (uatomic_read(&gp_wait) == -1) { pthread_cond_wait(&rcucond, &rcumutex); pthread_mutex_unlock(&rcumutex); } } This should work better, but I'll need to do a formal model to convince myself. Mathieu > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy [not found] ` <20091005125533.GA1857@Krystal> 2009-10-05 13:22 ` Mathieu Desnoyers @ 2009-10-07 7:22 ` Michael Schnell 1 sibling, 0 replies; 17+ messages in thread From: Michael Schnell @ 2009-10-07 7:22 UTC (permalink / raw) Cc: linux-kernel Mathieu Desnoyers wrote: > Hrm, your assumption about the common case does not seem to fit my > scenarios. Yep, Obviously I was misled. So I now understand that you want to schedule the thread as a result to some event and the Thread might already be running at that time, so that it does not need to enter an OS-based wait state. But to me this looks as if a _counting_ semaphore is needed here (at least in a more general case), instead of a binary semaphore (such as FUTEX does manage). Only with a counting semaphore no events are missed (unless the user software design cares for this with other means, which might be difficult or impossible). Of course the fast path of a user space counting semaphore is easily doable with atomic_inc() and atomic_dec(). But I only know about two working variants of the user space code for the (binary) FUTEX. There seem to be some more implementations that do not work decently. I have no idea if it's possible to create a counting semaphore in user space that uses the "futex" syscall (or whatever) for the case the threads needs to wait. -Michael ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2009-10-07 7:22 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-09-23 17:48 [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy Mathieu Desnoyers
2009-09-23 18:04 ` Chris Friesen
2009-09-23 19:03 ` Mathieu Desnoyers
2009-09-23 22:32 ` Mathieu Desnoyers
2009-09-23 23:12 ` Chris Friesen
2009-09-23 23:28 ` Mathieu Desnoyers
2009-09-26 7:05 ` Mathieu Desnoyers
2009-09-28 7:11 ` Michael Schnell
2009-09-28 10:58 ` Michael Schnell
2009-09-28 11:01 ` Michael Schnell
2009-10-01 14:40 ` Paul E. McKenney
2009-10-04 14:37 ` Mathieu Desnoyers
2009-10-04 20:36 ` Paul E. McKenney
2009-10-04 21:12 ` Mathieu Desnoyers
[not found] ` <4AC99D55.8000102@lumino.de>
[not found] ` <20091005125533.GA1857@Krystal>
2009-10-05 13:22 ` Mathieu Desnoyers
2009-10-05 22:21 ` Mathieu Desnoyers
2009-10-07 7:22 ` Michael Schnell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox