[RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy
@ 2009-09-23 17:48 Mathieu Desnoyers
  2009-09-23 18:04 ` Chris Friesen
  2009-10-01 14:40 ` Paul E. McKenney
  0 siblings, 2 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2009-09-23 17:48 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel

Hi,

When implementing the call_rcu() "worker thread" in userspace, I ran
into the problem that it had to be woken up periodically to check if
there are any callbacks to execute. However, I easily imagine that this
does not fit well with the "green computing" definition.

Therefore, I've looked at ways to have the call_rcu() callers waking up
this worker thread when callbacks are enqueued. However, I don't want to
take any lock and the fast path (when no wake up is required) should not
cause any cache-line exchange.

Here are the primitives I've created. I'd like to have feedback on my
futex use, just to make sure I did not do any incorrect assumptions.

This could also be eventually used in the QSBR Userspace RCU quiescent
state and in mb/signal userspace RCU when exiting RCU read-side C.S. to
ensure synchronize_rcu() does not busy-wait for too long.

/*
 * Wake-up any waiting defer thread. Called from many concurrent threads.
 */
static void wake_up_defer(void)
{
        if (unlikely(atomic_read(&defer_thread_futex) == -1))
                atomic_set(&defer_thread_futex, 0);
                futex(&defer_thread_futex, FUTEX_WAKE,
                      0, NULL, NULL, 0);
}

/*
 * Defer thread waiting. Single thread.
 */
static void wait_defer(void)
{
        atomic_dec(&defer_thread_futex);
        if (atomic_read(&defer_thread_futex) == -1)
                futex(&defer_thread_futex, FUTEX_WAIT, -1,
                      NULL, NULL, 0);
}

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy
  2009-09-23 17:48 [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy Mathieu Desnoyers
@ 2009-09-23 18:04 ` Chris Friesen
  2009-09-23 19:03   ` Mathieu Desnoyers
  2009-10-01 14:40 ` Paul E. McKenney
  1 sibling, 1 reply; 17+ messages in thread
From: Chris Friesen @ 2009-09-23 18:04 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Paul E. McKenney, linux-kernel

On 09/23/2009 11:48 AM, Mathieu Desnoyers wrote:

> Here are the primitives I've created. I'd like to have feedback on my
> futex use, just to make sure I did not do any incorrect assumptions.

> /*
>  * Wake-up any waiting defer thread. Called from many concurrent threads.
>  */
> static void wake_up_defer(void)
> {
>         if (unlikely(atomic_read(&defer_thread_futex) == -1))
>                 atomic_set(&defer_thread_futex, 0);
>                 futex(&defer_thread_futex, FUTEX_WAKE,
>                       0, NULL, NULL, 0);
> }

Is it a problem if multiple threads all hit the defer_thread_futex==-1
case simultaneously?  If so, maybe this should use an atomic
test-and-set operation so that only one thread actually calls futex().

> /*
>  * Defer thread waiting. Single thread.
>  */
> static void wait_defer(void)
> {
>         atomic_dec(&defer_thread_futex);
>         if (atomic_read(&defer_thread_futex) == -1)
>                 futex(&defer_thread_futex, FUTEX_WAIT, -1,
>                       NULL, NULL, 0);
> }

Is it a problem if the value of defer_thread_futex changes to zero after
the dec but before the test?

Chris

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy
  2009-09-23 18:04 ` Chris Friesen
@ 2009-09-23 19:03   ` Mathieu Desnoyers
  2009-09-23 22:32     ` Mathieu Desnoyers
  0 siblings, 1 reply; 17+ messages in thread
From: Mathieu Desnoyers @ 2009-09-23 19:03 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Paul E. McKenney, linux-kernel

* Chris Friesen (cfriesen@nortel.com) wrote:
> On 09/23/2009 11:48 AM, Mathieu Desnoyers wrote:
> 
> > Here are the primitives I've created. I'd like to have feedback on my
> > futex use, just to make sure I did not do any incorrect assumptions.
> 
> > /*
> >  * Wake-up any waiting defer thread. Called from many concurrent threads.
> >  */
> > static void wake_up_defer(void)
> > {
> >         if (unlikely(atomic_read(&defer_thread_futex) == -1))
> >                 atomic_set(&defer_thread_futex, 0);
> >                 futex(&defer_thread_futex, FUTEX_WAKE,
> >                       0, NULL, NULL, 0);
> > }
> 
> Is it a problem if multiple threads all hit the defer_thread_futex==-1
> case simultaneously?

It should not be, since what we'll have is, e.g.:

thread 1 calls wakeup
thread 2 calls wakeup
thread 3 calls wait

(thread 3 is waiting on the futex, defer_thread_futex = -1)
- thread 1 sees defer_thread_futex==-1
- thread 2 sees defer_thread_futex==-1
- thread 1 sets defer_thread_futex = 0
- thread 2 sets defer_thread_futex = 0
- thread 1 calls futex() to wake up the waiter, expect 0
- thread 2 calls futex() to wake up the waiter, expect 0

Basically, what happens in this scenario is that the first futex()
call will wake up any waiter, and the second will be a no-op.

Let's complicate this, if we have thread 3 running wait_defer()
concurrently:

- thread 3 decrements defer_thread_futex
- thread 1 sees defer_thread_futex==-1
- thread 2 sees defer_thread_futex==-1
- thread 1 sets defer_thread_futex = 0
- thread 2 sets defer_thread_futex = 0
- thread 1 calls futex() to wake up the waiter, expect 0
- thread 2 calls futex() to wake up the waiter, expect 0
- thread 3 calls futex() to wait, expect -1
  Returns immediately because defer_thread_futex == 0

Other scenario, where thread decrements defer_thread_futex a bit later:

- thread 1 sees defer_thread_futex==0
- thread 2 sees defer_thread_futex==0
- thread 3 decrements defer_thread_futex
- thread 3 tests defer_thread_futex==-1
- thread 3 calls futex() to wait, expect -1

In this scenario, we have to notice that if threads 1/2 enqueued tasks
to do before checking defer_thread_futex, these tasks would not be seen
by the waiter thread.

So correct memory ordering of:

- wake_up_defer:
  * queue callbacks to perform (1)
  * wake up (2)

- wait_defer:
  * for (;;)
    * wait for futex (3)
    * sleep 100ms (wait for more callbacks to be enqueued)
    * dequeue callbacks, execute them (4)

actually matters. I'll have to be really careful about that (unless we
just accept that tasks to perform could be queued for a while, however,
I'd like to give an upper bound to the delay between batch callback
execution).

Ensuring that 1 is written before 2, and that 4 is done before 3 seems a
bit racy. (I have to got out for lunch now, so I'll have to review the
ordering afterward)

>  If so, maybe this should use an atomic
> test-and-set operation so that only one thread actually calls futex().

It's not a matter if many threads wake up the waiter, so I don't think
the test-and-set is required. The benefit of using a simple test here is
that we don't have to bring the cache-line in exclusive mode to the
local CPU to perform the test. It can stay shared.

> 
> > /*
> >  * Defer thread waiting. Single thread.
> >  */
> > static void wait_defer(void)
> > {
> >         atomic_dec(&defer_thread_futex);
> >         if (atomic_read(&defer_thread_futex) == -1)
> >                 futex(&defer_thread_futex, FUTEX_WAIT, -1,
> >                       NULL, NULL, 0);
> > }
> 
> Is it a problem if the value of defer_thread_futex changes to zero after
> the dec but before the test?

No. That's not a problem, because this means there is a concurrent "wake
up". Seeing a value of "0" here will skip over the futex wait and go on.
The concurrent futex wakeup call will simply be a no-op in that case.

Thanks for the comments,

Mathieu

> 
> Chris

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy
  2009-09-23 19:03   ` Mathieu Desnoyers
@ 2009-09-23 22:32     ` Mathieu Desnoyers
  2009-09-23 23:12       ` Chris Friesen
  0 siblings, 1 reply; 17+ messages in thread
From: Mathieu Desnoyers @ 2009-09-23 22:32 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Paul E. McKenney, linux-kernel

* Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> * Chris Friesen (cfriesen@nortel.com) wrote:
> > On 09/23/2009 11:48 AM, Mathieu Desnoyers wrote:
> > 
> > > Here are the primitives I've created. I'd like to have feedback on my
> > > futex use, just to make sure I did not do any incorrect assumptions.
> > 
> > > /*
> > >  * Wake-up any waiting defer thread. Called from many concurrent threads.
> > >  */
> > > static void wake_up_defer(void)
> > > {
> > >         if (unlikely(atomic_read(&defer_thread_futex) == -1))
> > >                 atomic_set(&defer_thread_futex, 0);
> > >                 futex(&defer_thread_futex, FUTEX_WAKE,
> > >                       0, NULL, NULL, 0);
> > > }
> > 
> > Is it a problem if multiple threads all hit the defer_thread_futex==-1
> > case simultaneously?
> 
> It should not be, since what we'll have is, e.g.:
> 
> thread 1 calls wakeup
> thread 2 calls wakeup
> thread 3 calls wait
> 
> (thread 3 is waiting on the futex, defer_thread_futex = -1)
> - thread 1 sees defer_thread_futex==-1
> - thread 2 sees defer_thread_futex==-1
> - thread 1 sets defer_thread_futex = 0
> - thread 2 sets defer_thread_futex = 0
> - thread 1 calls futex() to wake up the waiter, expect 0
> - thread 2 calls futex() to wake up the waiter, expect 0
> 
> Basically, what happens in this scenario is that the first futex()
> call will wake up any waiter, and the second will be a no-op.
> 
> Let's complicate this, if we have thread 3 running wait_defer()
> concurrently:
> 
> - thread 3 decrements defer_thread_futex
> - thread 1 sees defer_thread_futex==-1
> - thread 2 sees defer_thread_futex==-1
> - thread 1 sets defer_thread_futex = 0
> - thread 2 sets defer_thread_futex = 0
> - thread 1 calls futex() to wake up the waiter, expect 0
> - thread 2 calls futex() to wake up the waiter, expect 0
> - thread 3 calls futex() to wait, expect -1
>   Returns immediately because defer_thread_futex == 0
> 
> Other scenario, where thread decrements defer_thread_futex a bit later:
> 
> - thread 1 sees defer_thread_futex==0
> - thread 2 sees defer_thread_futex==0
> - thread 3 decrements defer_thread_futex
> - thread 3 tests defer_thread_futex==-1
> - thread 3 calls futex() to wait, expect -1
> 
> In this scenario, we have to notice that if threads 1/2 enqueued tasks
> to do before checking defer_thread_futex, these tasks would not be seen
> by the waiter thread.
> 
> So correct memory ordering of:
> 
> - wake_up_defer:
>   * queue callbacks to perform (1)
>   * wake up (2)
> 
> - wait_defer:
>   * for (;;)
>     * wait for futex (3)
>     * sleep 100ms (wait for more callbacks to be enqueued)
>     * dequeue callbacks, execute them (4)
> 
> 
> actually matters. I'll have to be really careful about that (unless we
> just accept that tasks to perform could be queued for a while, however,
> I'd like to give an upper bound to the delay between batch callback
> execution).
> 
> Ensuring that 1 is written before 2, and that 4 is done before 3 seems a
> bit racy. (I have to got out for lunch now, so I'll have to review the
> ordering afterward)
> 

I knew I needed to think about it a bit more. Here is the proposed
algorithm hopefully fixing the race identified in the 3rd scenario
above. The idea is to perform the "check for empty queue" between the
&defer_thread_futex decrement and the test in wait_defer. It skips the
futex call and proceed if the list is non-empty.

As I am drilling into the problem, it looks very much like an attempt to
implement efficient wait queues in userspace based on sys_futex().

/*
 * Wake-up any waiting defer thread. Called from many concurrent
 * threads.
 */
static void wake_up_defer(void)
{
        if (unlikely(atomic_read(&defer_thread_futex) == -1)) {
                atomic_set(&defer_thread_futex, 0);
                futex(&defer_thread_futex, FUTEX_WAKE, 0,
                      NULL, NULL, 0);
        }
}

/*
 * Defer thread waiting. Single thread.
 */
static void wait_defer(void)
{
        atomic_dec(&defer_thread_futex);
        smp_mb();       /* Write futex before read queue */
        if (rcu_defer_num_callbacks()) {
                smp_mb();       /* Read queue before write futex */
                /* Callbacks are queued, don't wait. */
                atomic_set(&defer_thread_futex, 0);
        } else {
                smp_rmb();      /* Read queue before read futex */
                if (atomic_read(&defer_thread_futex) == -1)
                        futex(&defer_thread_futex, FUTEX_WAIT, -1,
                              NULL, NULL, 0);
        }
}

- call_rcu():
  * queue callbacks to perform
  * smp_mb()
  * wake_up_defer()

- defer thread:
  * for (;;)
    * wait_defer()
    * sleep 100ms (wait for more callbacks to be enqueued)
    * dequeue callbacks, execute them

The goal here is that if call_rcu() enqueues a callback (even if it
races with defer thread going to sleep), there should not be a
potentially infinite delay before it gets executed. Therefore, being
blocked in sys_futex while there is a callback to execute, without any
hope to be woken up unless another callback is queued, would not meet
that design requirement. I think that checking the number of queued
callbacks within wait_defer() as I propose here should address this
situation.

Comments ?

Mathieu

> 
> >  If so, maybe this should use an atomic
> > test-and-set operation so that only one thread actually calls futex().
> 
> It's not a matter if many threads wake up the waiter, so I don't think
> the test-and-set is required. The benefit of using a simple test here is
> that we don't have to bring the cache-line in exclusive mode to the
> local CPU to perform the test. It can stay shared.
> 
> > 
> > > /*
> > >  * Defer thread waiting. Single thread.
> > >  */
> > > static void wait_defer(void)
> > > {
> > >         atomic_dec(&defer_thread_futex);
> > >         if (atomic_read(&defer_thread_futex) == -1)
> > >                 futex(&defer_thread_futex, FUTEX_WAIT, -1,
> > >                       NULL, NULL, 0);
> > > }
> > 
> > Is it a problem if the value of defer_thread_futex changes to zero after
> > the dec but before the test?
> 
> No. That's not a problem, because this means there is a concurrent "wake
> up". Seeing a value of "0" here will skip over the futex wait and go on.
> The concurrent futex wakeup call will simply be a no-op in that case.
> 
> Thanks for the comments,
> 
> Mathieu
> 
> > 
> > Chris
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy
  2009-09-23 22:32     ` Mathieu Desnoyers
@ 2009-09-23 23:12       ` Chris Friesen
  2009-09-23 23:28         ` Mathieu Desnoyers
  0 siblings, 1 reply; 17+ messages in thread
From: Chris Friesen @ 2009-09-23 23:12 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Paul E. McKenney, linux-kernel

On 09/23/2009 04:32 PM, Mathieu Desnoyers wrote:


> /*
>  * Defer thread waiting. Single thread.
>  */
> static void wait_defer(void)
> {
>         atomic_dec(&defer_thread_futex);
>         smp_mb();       /* Write futex before read queue */
>         if (rcu_defer_num_callbacks()) {
>                 smp_mb();       /* Read queue before write futex */
>                 /* Callbacks are queued, don't wait. */
>                 atomic_set(&defer_thread_futex, 0);
>         } else {
>                 smp_rmb();      /* Read queue before read futex */
>                 if (atomic_read(&defer_thread_futex) == -1)
>                         futex(&defer_thread_futex, FUTEX_WAIT, -1,
>                               NULL, NULL, 0);
>         }
> }

> The goal here is that if call_rcu() enqueues a callback (even if it
> races with defer thread going to sleep), there should not be a
> potentially infinite delay before it gets executed.

It doesn't seem like the test for the number of callbacks should be
necessary.  I don't see anything like that in the glibc code, nor do I
remember anything like that in the futex sample code.

I'm still not totally convinced that you can avoid race conditions
without using atomic test-and-set or compare-and-exchange.  I haven't
sat down and worked it out completely though.

Chris

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy
  2009-09-23 23:12       ` Chris Friesen
@ 2009-09-23 23:28         ` Mathieu Desnoyers
  2009-09-26  7:05           ` Mathieu Desnoyers
  2009-09-28  7:11           ` Michael Schnell
  0 siblings, 2 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2009-09-23 23:28 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Paul E. McKenney, linux-kernel

* Chris Friesen (cfriesen@nortel.com) wrote:
> On 09/23/2009 04:32 PM, Mathieu Desnoyers wrote:
> 
> 
> > /*
> >  * Defer thread waiting. Single thread.
> >  */
> > static void wait_defer(void)
> > {
> >         atomic_dec(&defer_thread_futex);
> >         smp_mb();       /* Write futex before read queue */
> >         if (rcu_defer_num_callbacks()) {
> >                 smp_mb();       /* Read queue before write futex */
> >                 /* Callbacks are queued, don't wait. */
> >                 atomic_set(&defer_thread_futex, 0);
> >         } else {
> >                 smp_rmb();      /* Read queue before read futex */
> >                 if (atomic_read(&defer_thread_futex) == -1)
> >                         futex(&defer_thread_futex, FUTEX_WAIT, -1,
> >                               NULL, NULL, 0);
> >         }
> > }
> 
> > The goal here is that if call_rcu() enqueues a callback (even if it
> > races with defer thread going to sleep), there should not be a
> > potentially infinite delay before it gets executed.
> 
> It doesn't seem like the test for the number of callbacks should be
> necessary.  I don't see anything like that in the glibc code, nor do I
> remember anything like that in the futex sample code.
> 

The mutex code (and usual futex users) use futex to implement mutual
exclusion.  My goal is to send a wakeup signal to a thread waiting for
work to perform when adding such work. But without any mutual exclusion.

So it is understandable that glibc code or futex sample code does not
cover that, given this use is, well, creative. ;)

> I'm still not totally convinced that you can avoid race conditions
> without using atomic test-and-set or compare-and-exchange.  I haven't
> sat down and worked it out completely though.
> 

Yes.. this is heavily dependent on the states and values which can be
reached. I should probably take time to create a promela model and run
that though the spin model checker to be sure.

Thanks for the comments,

Mathieu

> Chris

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy
  2009-09-23 23:28         ` Mathieu Desnoyers
@ 2009-09-26  7:05           ` Mathieu Desnoyers
  2009-09-28  7:11           ` Michael Schnell
  1 sibling, 0 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2009-09-26  7:05 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Paul E. McKenney, linux-kernel

* Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> * Chris Friesen (cfriesen@nortel.com) wrote:
> > On 09/23/2009 04:32 PM, Mathieu Desnoyers wrote:
> > 
> > 
> > > /*
> > >  * Defer thread waiting. Single thread.
> > >  */
> > > static void wait_defer(void)
> > > {
> > >         atomic_dec(&defer_thread_futex);
> > >         smp_mb();       /* Write futex before read queue */
> > >         if (rcu_defer_num_callbacks()) {
> > >                 smp_mb();       /* Read queue before write futex */
> > >                 /* Callbacks are queued, don't wait. */
> > >                 atomic_set(&defer_thread_futex, 0);
> > >         } else {
> > >                 smp_rmb();      /* Read queue before read futex */
> > >                 if (atomic_read(&defer_thread_futex) == -1)
> > >                         futex(&defer_thread_futex, FUTEX_WAIT, -1,
> > >                               NULL, NULL, 0);
> > >         }
> > > }
> > 
> > > The goal here is that if call_rcu() enqueues a callback (even if it
> > > races with defer thread going to sleep), there should not be a
> > > potentially infinite delay before it gets executed.
> > 
> > It doesn't seem like the test for the number of callbacks should be
> > necessary.  I don't see anything like that in the glibc code, nor do I
> > remember anything like that in the futex sample code.
> > 
> 
> The mutex code (and usual futex users) use futex to implement mutual
> exclusion.  My goal is to send a wakeup signal to a thread waiting for
> work to perform when adding such work. But without any mutual exclusion.
> 
> So it is understandable that glibc code or futex sample code does not
> cover that, given this use is, well, creative. ;)
> 
> > I'm still not totally convinced that you can avoid race conditions
> > without using atomic test-and-set or compare-and-exchange.  I haven't
> > sat down and worked it out completely though.
> > 
> 
> Yes.. this is heavily dependent on the states and values which can be
> reached. I should probably take time to create a promela model and run
> that though the spin model checker to be sure.
> 

Just created a Promela model for this. It assumes sequential memory
ordering (so it's a fairly simplified model). Given I added memory
barriers between each operation, it should well represent reality
though.

My algorithm seems to behave as expected: when a callback is added to
the queue, it's not possible to have the waiter thread blocked until the
end of days.

Available at:
http://www.lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=blob;f=formal-model/futex-wakeup/futex.spin

Thanks,

Mathieu

> Thanks for the comments,
> 
> Mathieu
> 
> > Chris
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy
  2009-09-23 23:28         ` Mathieu Desnoyers
  2009-09-26  7:05           ` Mathieu Desnoyers
@ 2009-09-28  7:11           ` Michael Schnell
  2009-09-28 10:58             ` Michael Schnell
  2009-09-28 11:01             ` Michael Schnell
  1 sibling, 2 replies; 17+ messages in thread
From: Michael Schnell @ 2009-09-28  7:11 UTC (permalink / raw)
  To: linux-kernel

Mathieu Desnoyers wrote:
> 
> The mutex code (and usual futex users) use futex to implement mutual
> exclusion.  My goal is to send a wakeup signal to a thread waiting for
> work to perform when adding such work. But without any mutual exclusion.

Why is that different ?

if you use a normal FUTEX like pthread_mutex_...() and have the thread
block right at the start (by blocking the *UTEX before creating the
thread). Now you can wake the thread by unblocking the *UTEX.

-Michael


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy
  2009-09-28  7:11           ` Michael Schnell
@ 2009-09-28 10:58             ` Michael Schnell
  2009-09-28 11:01             ` Michael Schnell
  1 sibling, 0 replies; 17+ messages in thread
From: Michael Schnell @ 2009-09-28 10:58 UTC (permalink / raw)
  To: linux-kernel

Michael Schnell wrote:
> Mathieu Desnoyers wrote:
>> The mutex code (and usual futex users) use futex to implement mutual
>> exclusion.  My goal is to send a wakeup signal to a thread waiting for
>> work to perform when adding such work. But without any mutual exclusion.
> 
> Why is that different ?
> 
> if you use a normal FUTEX like pthread_mutex_...() and have the thread
> block right at the start (by blocking the *UTEX before creating the
> thread). Now you can wake the thread by unblocking the *UTEX.
> 
> -Michael
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy
  2009-09-28  7:11           ` Michael Schnell
  2009-09-28 10:58             ` Michael Schnell
@ 2009-09-28 11:01             ` Michael Schnell
  1 sibling, 0 replies; 17+ messages in thread
From: Michael Schnell @ 2009-09-28 11:01 UTC (permalink / raw)
  To: linux-kernel

Michael Schnell wrote:
> Why is that different ?

It might be that (different from a sema), a FUTEX can only be unlocked
by the process that locked it.

But then why should FUTEX offer advantages over sema, if in that
application the FUTEX is supposed to block with a system call, anyway ?

-Michael

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy
  2009-09-23 17:48 [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy Mathieu Desnoyers
  2009-09-23 18:04 ` Chris Friesen
@ 2009-10-01 14:40 ` Paul E. McKenney
  2009-10-04 14:37   ` Mathieu Desnoyers
  1 sibling, 1 reply; 17+ messages in thread
From: Paul E. McKenney @ 2009-10-01 14:40 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: linux-kernel

On Wed, Sep 23, 2009 at 01:48:20PM -0400, Mathieu Desnoyers wrote:
> Hi,
> 
> When implementing the call_rcu() "worker thread" in userspace, I ran
> into the problem that it had to be woken up periodically to check if
> there are any callbacks to execute. However, I easily imagine that this
> does not fit well with the "green computing" definition.
> 
> Therefore, I've looked at ways to have the call_rcu() callers waking up
> this worker thread when callbacks are enqueued. However, I don't want to
> take any lock and the fast path (when no wake up is required) should not
> cause any cache-line exchange.
> 
> Here are the primitives I've created. I'd like to have feedback on my
> futex use, just to make sure I did not do any incorrect assumptions.
> 
> This could also be eventually used in the QSBR Userspace RCU quiescent
> state and in mb/signal userspace RCU when exiting RCU read-side C.S. to
> ensure synchronize_rcu() does not busy-wait for too long.
> 
> /*
>  * Wake-up any waiting defer thread. Called from many concurrent threads.
>  */
> static void wake_up_defer(void)
> {
>         if (unlikely(atomic_read(&defer_thread_futex) == -1))
>                 atomic_set(&defer_thread_futex, 0);
>                 futex(&defer_thread_futex, FUTEX_WAKE,
>                       0, NULL, NULL, 0);
> }
> 
> /*
>  * Defer thread waiting. Single thread.
>  */
> static void wait_defer(void)
> {
>         atomic_dec(&defer_thread_futex);
>         if (atomic_read(&defer_thread_futex) == -1)
>                 futex(&defer_thread_futex, FUTEX_WAIT, -1,
>                       NULL, NULL, 0);
> }

The standard approach would be to use pthread_cond_wait() and
pthread_cond_broadcast().  Unfortunately, this would require holding a
pthread_mutex_lock across both operations, which would not necessarily
be so good for wake-up-side scalability.

That said, without this sort of heavy-locking approach, wakeup races
are quite difficult to avoid.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy
  2009-10-01 14:40 ` Paul E. McKenney
@ 2009-10-04 14:37   ` Mathieu Desnoyers
  2009-10-04 20:36     ` Paul E. McKenney
       [not found]     ` <4AC99D55.8000102@lumino.de>
  0 siblings, 2 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2009-10-04 14:37 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Wed, Sep 23, 2009 at 01:48:20PM -0400, Mathieu Desnoyers wrote:
> > Hi,
> > 
> > When implementing the call_rcu() "worker thread" in userspace, I ran
> > into the problem that it had to be woken up periodically to check if
> > there are any callbacks to execute. However, I easily imagine that this
> > does not fit well with the "green computing" definition.
> > 
> > Therefore, I've looked at ways to have the call_rcu() callers waking up
> > this worker thread when callbacks are enqueued. However, I don't want to
> > take any lock and the fast path (when no wake up is required) should not
> > cause any cache-line exchange.
> > 
> > Here are the primitives I've created. I'd like to have feedback on my
> > futex use, just to make sure I did not do any incorrect assumptions.
> > 
> > This could also be eventually used in the QSBR Userspace RCU quiescent
> > state and in mb/signal userspace RCU when exiting RCU read-side C.S. to
> > ensure synchronize_rcu() does not busy-wait for too long.
> > 
> > /*
> >  * Wake-up any waiting defer thread. Called from many concurrent threads.
> >  */
> > static void wake_up_defer(void)
> > {
> >         if (unlikely(atomic_read(&defer_thread_futex) == -1))
> >                 atomic_set(&defer_thread_futex, 0);
> >                 futex(&defer_thread_futex, FUTEX_WAKE,
> >                       0, NULL, NULL, 0);
> > }
> > 
> > /*
> >  * Defer thread waiting. Single thread.
> >  */
> > static void wait_defer(void)
> > {
> >         atomic_dec(&defer_thread_futex);
> >         if (atomic_read(&defer_thread_futex) == -1)
> >                 futex(&defer_thread_futex, FUTEX_WAIT, -1,
> >                       NULL, NULL, 0);
> > }
> 
> The standard approach would be to use pthread_cond_wait() and
> pthread_cond_broadcast().  Unfortunately, this would require holding a
> pthread_mutex_lock across both operations, which would not necessarily
> be so good for wake-up-side scalability.

The pthread_cond_broadcast() mutex is really a bugger when it comes to
execute it at each rcu_read_unlock(). We could as well use a mutex to
protect the whole read-side.. :-(

> 
> That said, without this sort of heavy-locking approach, wakeup races
> are quite difficult to avoid.
> 

I did a formal model of my futex-based wait/wakeup. The main idea is
that the waiter:

- Set itself to "waiting"
- Checks the "real condition" for which it will wait (e.g. queues empty
  when used for rcu callbacks, no more ongoing old reader thread C.S.
  when used in synchronize_rcu())
- Calls sys_futex if the variable have not changed.

And the waker:
- sets the "real condition" waking up the waiter (enqueuing, or
  rcu_read_unlock())
- check if the waiter must be woken up, if so, wake it up by setting the
  state to "running" and calling sys_futex.

But as you say, wakeup races are difficult (but not impossible!) to
avoid. This is why I resorted to a formal model of the wait/wakeup
scheme to ensure that we cannot end up in a situation where a waker
races with the waiter and does not wake it up when it should. This is
nothing fancy (does not model memory and instruction reordering
automatically), but I figure that memory barriers are required between
almost every steps of this algorithm, so by adding smp_mb() I end up
ensure sequential behavior. I added test cases in the model to ensure
that incorrect memory reordering _would_ cause errors by doing the
reordering by hand in error-injection runs.

The model is available at:
http://www.lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=tree;f=futex-wakeup;h=4ddeaeb2784165cb0465d4ca9f7d27acb562eae3;hb=refs/heads/formal-model

(this is in the formal-model branch of the urcu tree, futex-wakeup
subdir)

This is modeling this snippet of code :

static int defer_thread_futex;

/*
 * Wake-up any waiting defer thread. Called from many concurrent threads.
 */
static void wake_up_defer(void)
{
        if (unlikely(uatomic_read(&defer_thread_futex) == -1)) {
                uatomic_set(&defer_thread_futex, 0);
                futex(&defer_thread_futex, FUTEX_WAKE, 1,
                      NULL, NULL, 0);
        }
}

static void enqueue(void *callback)	/* not the actual types */
{
	add_to_queue(callback);
	smp_mb();
	wake_up_defer();
}

/*
 * rcu_defer_num_callbacks() returns the total number of callbacks
 * enqueued.
 */

/*
 * Defer thread waiting. Single thread.
 */
static void wait_defer(void)
{
        uatomic_dec(&defer_thread_futex);
        smp_mb();       /* Write futex before read queue */
        if (rcu_defer_num_callbacks()) {
                smp_mb();       /* Read queue before write futex */
                /* Callbacks are queued, don't wait. */
                uatomic_set(&defer_thread_futex, 0);
        } else {
                smp_rmb();      /* Read queue before read futex */
                if (uatomic_read(&defer_thread_futex) == -1)
                        futex(&defer_thread_futex, FUTEX_WAIT, -1,
                              NULL, NULL, 0);
        }
}


Comments are welcome,

Thanks,

Mathieu

> 							Thanx, Paul

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy
  2009-10-04 14:37   ` Mathieu Desnoyers
@ 2009-10-04 20:36     ` Paul E. McKenney
  2009-10-04 21:12       ` Mathieu Desnoyers
       [not found]     ` <4AC99D55.8000102@lumino.de>
  1 sibling, 1 reply; 17+ messages in thread
From: Paul E. McKenney @ 2009-10-04 20:36 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: linux-kernel

On Sun, Oct 04, 2009 at 10:37:45AM -0400, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Wed, Sep 23, 2009 at 01:48:20PM -0400, Mathieu Desnoyers wrote:
> > > Hi,
> > > 
> > > When implementing the call_rcu() "worker thread" in userspace, I ran
> > > into the problem that it had to be woken up periodically to check if
> > > there are any callbacks to execute. However, I easily imagine that this
> > > does not fit well with the "green computing" definition.
> > > 
> > > Therefore, I've looked at ways to have the call_rcu() callers waking up
> > > this worker thread when callbacks are enqueued. However, I don't want to
> > > take any lock and the fast path (when no wake up is required) should not
> > > cause any cache-line exchange.
> > > 
> > > Here are the primitives I've created. I'd like to have feedback on my
> > > futex use, just to make sure I did not do any incorrect assumptions.
> > > 
> > > This could also be eventually used in the QSBR Userspace RCU quiescent
> > > state and in mb/signal userspace RCU when exiting RCU read-side C.S. to
> > > ensure synchronize_rcu() does not busy-wait for too long.
> > > 
> > > /*
> > >  * Wake-up any waiting defer thread. Called from many concurrent threads.
> > >  */
> > > static void wake_up_defer(void)
> > > {
> > >         if (unlikely(atomic_read(&defer_thread_futex) == -1))
> > >                 atomic_set(&defer_thread_futex, 0);
> > >                 futex(&defer_thread_futex, FUTEX_WAKE,
> > >                       0, NULL, NULL, 0);
> > > }
> > > 
> > > /*
> > >  * Defer thread waiting. Single thread.
> > >  */
> > > static void wait_defer(void)
> > > {
> > >         atomic_dec(&defer_thread_futex);
> > >         if (atomic_read(&defer_thread_futex) == -1)
> > >                 futex(&defer_thread_futex, FUTEX_WAIT, -1,
> > >                       NULL, NULL, 0);
> > > }
> > 
> > The standard approach would be to use pthread_cond_wait() and
> > pthread_cond_broadcast().  Unfortunately, this would require holding a
> > pthread_mutex_lock across both operations, which would not necessarily
> > be so good for wake-up-side scalability.
> 
> The pthread_cond_broadcast() mutex is really a bugger when it comes to
> execute it at each rcu_read_unlock(). We could as well use a mutex to
> protect the whole read-side.. :-(
> 
> > That said, without this sort of heavy-locking approach, wakeup races
> > are quite difficult to avoid.
> 
> I did a formal model of my futex-based wait/wakeup. The main idea is
> that the waiter:
> 
> - Set itself to "waiting"
> - Checks the "real condition" for which it will wait (e.g. queues empty
>   when used for rcu callbacks, no more ongoing old reader thread C.S.
>   when used in synchronize_rcu())
> - Calls sys_futex if the variable have not changed.
> 
> And the waker:
> - sets the "real condition" waking up the waiter (enqueuing, or
>   rcu_read_unlock())
> - check if the waiter must be woken up, if so, wake it up by setting the
>   state to "running" and calling sys_futex.
> 
> But as you say, wakeup races are difficult (but not impossible!) to
> avoid. This is why I resorted to a formal model of the wait/wakeup
> scheme to ensure that we cannot end up in a situation where a waker
> races with the waiter and does not wake it up when it should. This is
> nothing fancy (does not model memory and instruction reordering
> automatically), but I figure that memory barriers are required between
> almost every steps of this algorithm, so by adding smp_mb() I end up
> ensure sequential behavior. I added test cases in the model to ensure
> that incorrect memory reordering _would_ cause errors by doing the
> reordering by hand in error-injection runs.

My question is whether pthread_cond_wait() and pthread_cond_broadcast()
can substitute for the raw call to futex.  Unless I am missing something
(which I quite possibly am), the kernel will serialize on the futex
anyway, so serialization in user-mode code does not add much additional
pain.

> The model is available at:
> http://www.lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=tree;f=futex-wakeup;h=4ddeaeb2784165cb0465d4ca9f7d27acb562eae3;hb=refs/heads/formal-model
> 
> (this is in the formal-model branch of the urcu tree, futex-wakeup
> subdir)
> 
> This is modeling this snippet of code :
> 
> static int defer_thread_futex;
> 
> /*
>  * Wake-up any waiting defer thread. Called from many concurrent threads.
>  */
> static void wake_up_defer(void)
> {
>         if (unlikely(uatomic_read(&defer_thread_futex) == -1)) {
>                 uatomic_set(&defer_thread_futex, 0);
>                 futex(&defer_thread_futex, FUTEX_WAKE, 1,
>                       NULL, NULL, 0);
>         }
> }
> 
> static void enqueue(void *callback)	/* not the actual types */
> {
> 	add_to_queue(callback);
> 	smp_mb();
> 	wake_up_defer();
> }
> 
> /*
>  * rcu_defer_num_callbacks() returns the total number of callbacks
>  * enqueued.
>  */
> 
> /*
>  * Defer thread waiting. Single thread.
>  */
> static void wait_defer(void)
> {
>         uatomic_dec(&defer_thread_futex);
>         smp_mb();       /* Write futex before read queue */
>         if (rcu_defer_num_callbacks()) {
>                 smp_mb();       /* Read queue before write futex */
>                 /* Callbacks are queued, don't wait. */
>                 uatomic_set(&defer_thread_futex, 0);
>         } else {
>                 smp_rmb();      /* Read queue before read futex */
>                 if (uatomic_read(&defer_thread_futex) == -1)
>                         futex(&defer_thread_futex, FUTEX_WAIT, -1,
>                               NULL, NULL, 0);
>         }
> }
> 
> 
> Comments are welcome,

I will take a look after further recovery from jetlag.  Not yet competent
to review this kind of stuff.  Give me a few days.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy
  2009-10-04 20:36     ` Paul E. McKenney
@ 2009-10-04 21:12       ` Mathieu Desnoyers
  0 siblings, 0 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2009-10-04 21:12 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Sun, Oct 04, 2009 at 10:37:45AM -0400, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Wed, Sep 23, 2009 at 01:48:20PM -0400, Mathieu Desnoyers wrote:
> > > > Hi,
> > > > 
> > > > When implementing the call_rcu() "worker thread" in userspace, I ran
> > > > into the problem that it had to be woken up periodically to check if
> > > > there are any callbacks to execute. However, I easily imagine that this
> > > > does not fit well with the "green computing" definition.
> > > > 
> > > > Therefore, I've looked at ways to have the call_rcu() callers waking up
> > > > this worker thread when callbacks are enqueued. However, I don't want to
> > > > take any lock and the fast path (when no wake up is required) should not
> > > > cause any cache-line exchange.
> > > > 
> > > > Here are the primitives I've created. I'd like to have feedback on my
> > > > futex use, just to make sure I did not do any incorrect assumptions.
> > > > 
> > > > This could also be eventually used in the QSBR Userspace RCU quiescent
> > > > state and in mb/signal userspace RCU when exiting RCU read-side C.S. to
> > > > ensure synchronize_rcu() does not busy-wait for too long.
> > > > 
> > > > /*
> > > >  * Wake-up any waiting defer thread. Called from many concurrent threads.
> > > >  */
> > > > static void wake_up_defer(void)
> > > > {
> > > >         if (unlikely(atomic_read(&defer_thread_futex) == -1))
> > > >                 atomic_set(&defer_thread_futex, 0);
> > > >                 futex(&defer_thread_futex, FUTEX_WAKE,
> > > >                       0, NULL, NULL, 0);
> > > > }
> > > > 
> > > > /*
> > > >  * Defer thread waiting. Single thread.
> > > >  */
> > > > static void wait_defer(void)
> > > > {
> > > >         atomic_dec(&defer_thread_futex);
> > > >         if (atomic_read(&defer_thread_futex) == -1)
> > > >                 futex(&defer_thread_futex, FUTEX_WAIT, -1,
> > > >                       NULL, NULL, 0);
> > > > }
> > > 
> > > The standard approach would be to use pthread_cond_wait() and
> > > pthread_cond_broadcast().  Unfortunately, this would require holding a
> > > pthread_mutex_lock across both operations, which would not necessarily
> > > be so good for wake-up-side scalability.
> > 
> > The pthread_cond_broadcast() mutex is really a bugger when it comes to
> > execute it at each rcu_read_unlock(). We could as well use a mutex to
> > protect the whole read-side.. :-(
> > 
> > > That said, without this sort of heavy-locking approach, wakeup races
> > > are quite difficult to avoid.
> > 
> > I did a formal model of my futex-based wait/wakeup. The main idea is
> > that the waiter:
> > 
> > - Set itself to "waiting"
> > - Checks the "real condition" for which it will wait (e.g. queues empty
> >   when used for rcu callbacks, no more ongoing old reader thread C.S.
> >   when used in synchronize_rcu())
> > - Calls sys_futex if the variable have not changed.
> > 
> > And the waker:
> > - sets the "real condition" waking up the waiter (enqueuing, or
> >   rcu_read_unlock())
> > - check if the waiter must be woken up, if so, wake it up by setting the
> >   state to "running" and calling sys_futex.
> > 
> > But as you say, wakeup races are difficult (but not impossible!) to
> > avoid. This is why I resorted to a formal model of the wait/wakeup
> > scheme to ensure that we cannot end up in a situation where a waker
> > races with the waiter and does not wake it up when it should. This is
> > nothing fancy (does not model memory and instruction reordering
> > automatically), but I figure that memory barriers are required between
> > almost every steps of this algorithm, so by adding smp_mb() I end up
> > ensure sequential behavior. I added test cases in the model to ensure
> > that incorrect memory reordering _would_ cause errors by doing the
> > reordering by hand in error-injection runs.
> 
> My question is whether pthread_cond_wait() and pthread_cond_broadcast()
> can substitute for the raw call to futex.  Unless I am missing something
> (which I quite possibly am), the kernel will serialize on the futex
> anyway, so serialization in user-mode code does not add much additional
> pain.

The kernel sys_futex implementation only takes per-bucket spinlocks. So
this is far from the cost of a global mutex in pthread_cond. Moreover,
my scheme does not require to take any mutex in the fast path (when
there is no waiter to wake up), which makes performances appropriate for
use in rcu read-side. It's a simple memory barrier, variable read, test
and branch in this case.

> 
> > The model is available at:
> > http://www.lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=tree;f=futex-wakeup;h=4ddeaeb2784165cb0465d4ca9f7d27acb562eae3;hb=refs/heads/formal-model
> > 
> > (this is in the formal-model branch of the urcu tree, futex-wakeup
> > subdir)
> > 
> > This is modeling this snippet of code :
> > 
> > static int defer_thread_futex;
> > 
> > /*
> >  * Wake-up any waiting defer thread. Called from many concurrent threads.
> >  */
> > static void wake_up_defer(void)
> > {
> >         if (unlikely(uatomic_read(&defer_thread_futex) == -1)) {
> >                 uatomic_set(&defer_thread_futex, 0);
> >                 futex(&defer_thread_futex, FUTEX_WAKE, 1,
> >                       NULL, NULL, 0);
> >         }
> > }
> > 
> > static void enqueue(void *callback)	/* not the actual types */
> > {
> > 	add_to_queue(callback);
> > 	smp_mb();
> > 	wake_up_defer();
> > }
> > 
> > /*
> >  * rcu_defer_num_callbacks() returns the total number of callbacks
> >  * enqueued.
> >  */
> > 
> > /*
> >  * Defer thread waiting. Single thread.
> >  */
> > static void wait_defer(void)
> > {
> >         uatomic_dec(&defer_thread_futex);
> >         smp_mb();       /* Write futex before read queue */
> >         if (rcu_defer_num_callbacks()) {
> >                 smp_mb();       /* Read queue before write futex */
> >                 /* Callbacks are queued, don't wait. */
> >                 uatomic_set(&defer_thread_futex, 0);
> >         } else {
> >                 smp_rmb();      /* Read queue before read futex */
> >                 if (uatomic_read(&defer_thread_futex) == -1)
> >                         futex(&defer_thread_futex, FUTEX_WAIT, -1,
> >                               NULL, NULL, 0);
> >         }
> > }
> > 
> > 
> > Comments are welcome,
> 
> I will take a look after further recovery from jetlag.  Not yet competent
> to review this kind of stuff.  Give me a few days.  ;-)

No problem, thanks for looking at this,

Mathieu

> 
> 							Thanx, Paul

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <4AC99D55.8000102@lumino.de>]

[parent not found: <20091005125533.GA1857@Krystal>]

* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy
       [not found]       ` <20091005125533.GA1857@Krystal>
@ 2009-10-05 13:22         ` Mathieu Desnoyers
  2009-10-05 22:21           ` Mathieu Desnoyers
  2009-10-07  7:22         ` Michael Schnell
  1 sibling, 1 reply; 17+ messages in thread
From: Mathieu Desnoyers @ 2009-10-05 13:22 UTC (permalink / raw)
  To: Michael Schnell; +Cc: Paul E. McKenney, Darren Hart, linux-kernel

(added some CC's, given the length of the answer. Thanks for asking)  ;)
(Sorry for duplicate, messed up LKML email in original post)

* Michael Schnell (mschnell@lumino.de) wrote:
> I still don't understand what the advantage of FUTEX is, if the thread
> to be waked is always blocked, and thus the fast path is not in use.
> 
> -Michael

Hrm, your assumption about the common case does not seem to fit my
scenarios. Typically, the wakeup will find the waiter not blocked, and
thus skip the call to sys_futex. Here is why.

I use this scheme in two different implementations:

1) in call_rcu(), to wake up the worker thread after adding work to the
queue.

This worker thread, when woken up, sleeps for a few milliseconds before
starting to dequeue work. Therefore, if the system is relatively busy,
call_rcu() will usually see the worker thread while it's sleeping (and
therefore _not_ waiting on the futex). Also, if work is enqueued while
the worker thread is executing past RCU callbacks, the worker thread
will detect it and won't wait on the futex.

Therefore, this is, by design, a very unlikely event to have call_rcu()
calling sys_futex.

2) in rcu_read_unlock(), to wake up synchronize_rcu() waiting on past
   reader's grace periods.

Here, synchronize_rcu() busy-waits for the reader's G.P. to end, and if
this does not work, after a few attempts (like the pthread mutexes), it
adapts and uses sys_futex. The waker only need to call sys_futex if
there is a synchronize_rcu() currently running which had to call
sys_futex after a few active attempts failed.

As you see, in both cases, the common case, "fast path", is to find the
futex unlocked and not having to take any lock.

Now, about the slow path. I think it's worth discussing too. Indeed,
sys_futex takes a per-address spinlock, which happens to serialize all
sys_futex operations on the wakeup side. Therefore, for wakeup designs
relying on calling sys_futex for wakeup very frequently, this is really
bad.

There might be ways to mitigate this problem though: changing the
sys_futex implementation to use lock-free lists might help.

One advantage of calling sys_futex without holding a userspace mutex is
that the contention duration on the per-address spinlock is much shorter
than the contention on the mutex, because of the system call execution
overhead.

We could probably turn the sys_futex-dependent locking scheme into
something more portable and manage to keep the same fast-path behavior
if we replace the way I use sys_futex by a pthread cond var, e.g. :

Instead of having:

static inline void wake_up_gp(void)
{
        if (unlikely(uatomic_read(&gp_futex) == -1)) {
                uatomic_set(&gp_futex, 0);
                futex(&gp_futex, FUTEX_WAKE, 1,
                      NULL, NULL, 0);
        }
}

static void wait_gp(void)
{
	uatomic_dec(&gp_futex);
        smp_mb();
	if (!num_old_reader_gp()) {
        	smp_mb();
		atomic_set(&gp_futex, 0);
		return;
	}
        /* Read reader_gp before read futex */
        smp_rmb();
        if (uatomic_read(&gp_futex) == -1)
                futex(&gp_futex, FUTEX_WAIT, -1,
                      NULL, NULL, 0);
}

We could have:

static inline void wake_up_gp(void)
{
	/* just released old reader gp */
	smp_mb();
        if (unlikely(uatomic_read(&gp_wait) == -1)) {
                uatomic_set(&gp_wait, 0);
                pthread_cond_broadcast(&rcucond);
        }
}

static void wait_gp(void)
{
	uatomic_dec(&gp_wait);
        smp_mb();
	if (!num_old_reader_gp()) {
        	smp_mb();
		atomic_set(&gp_wait, 0);
		goto unlock;
	}
        /* Read reader_gp before read futex */
        smp_rmb();
	pthread_mutex_lock(&rcumutex);
        if (uatomic_read(&gp_wait) == -1) {
                pthread_cond_wait(&rcucond, &rcumutex);
		pthread_mutex_unlock(&rcumutex);
	}
}

Is that what you had in mind ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy
  2009-10-05 13:22         ` Mathieu Desnoyers
@ 2009-10-05 22:21           ` Mathieu Desnoyers
  0 siblings, 0 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2009-10-05 22:21 UTC (permalink / raw)
  To: Michael Schnell; +Cc: Paul E. McKenney, Darren Hart, linux-kernel

* Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> (added some CC's, given the length of the answer. Thanks for asking)  ;)
> (Sorry for duplicate, messed up LKML email in original post)
> 
> * Michael Schnell (mschnell@lumino.de) wrote:
> > I still don't understand what the advantage of FUTEX is, if the thread
> > to be waked is always blocked, and thus the fast path is not in use.
> > 
> > -Michael
> 
> Hrm, your assumption about the common case does not seem to fit my
> scenarios. Typically, the wakeup will find the waiter not blocked, and
> thus skip the call to sys_futex. Here is why.
> 
> I use this scheme in two different implementations:
> 
> 1) in call_rcu(), to wake up the worker thread after adding work to the
> queue.
> 
> This worker thread, when woken up, sleeps for a few milliseconds before
> starting to dequeue work. Therefore, if the system is relatively busy,
> call_rcu() will usually see the worker thread while it's sleeping (and
> therefore _not_ waiting on the futex). Also, if work is enqueued while
> the worker thread is executing past RCU callbacks, the worker thread
> will detect it and won't wait on the futex.
> 
> Therefore, this is, by design, a very unlikely event to have call_rcu()
> calling sys_futex.
> 
> 2) in rcu_read_unlock(), to wake up synchronize_rcu() waiting on past
>    reader's grace periods.
> 
> Here, synchronize_rcu() busy-waits for the reader's G.P. to end, and if
> this does not work, after a few attempts (like the pthread mutexes), it
> adapts and uses sys_futex. The waker only need to call sys_futex if
> there is a synchronize_rcu() currently running which had to call
> sys_futex after a few active attempts failed.
> 
> As you see, in both cases, the common case, "fast path", is to find the
> futex unlocked and not having to take any lock.
> 
> 
> Now, about the slow path. I think it's worth discussing too. Indeed,
> sys_futex takes a per-address spinlock, which happens to serialize all
> sys_futex operations on the wakeup side. Therefore, for wakeup designs
> relying on calling sys_futex for wakeup very frequently, this is really
> bad.
> 
> There might be ways to mitigate this problem though: changing the
> sys_futex implementation to use lock-free lists might help.
> 
> One advantage of calling sys_futex without holding a userspace mutex is
> that the contention duration on the per-address spinlock is much shorter
> than the contention on the mutex, because of the system call execution
> overhead.
> 
> We could probably turn the sys_futex-dependent locking scheme into
> something more portable and manage to keep the same fast-path behavior
> if we replace the way I use sys_futex by a pthread cond var, e.g. :
> 
> Instead of having:
> 
> 
> static inline void wake_up_gp(void)
> {
>         if (unlikely(uatomic_read(&gp_futex) == -1)) {
>                 uatomic_set(&gp_futex, 0);
>                 futex(&gp_futex, FUTEX_WAKE, 1,
>                       NULL, NULL, 0);
>         }
> }
> 
> static void wait_gp(void)
> {
> 	uatomic_dec(&gp_futex);
>         smp_mb();
> 	if (!num_old_reader_gp()) {
>         	smp_mb();
> 		atomic_set(&gp_futex, 0);
> 		return;
> 	}
>         /* Read reader_gp before read futex */
>         smp_rmb();
>         if (uatomic_read(&gp_futex) == -1)
>                 futex(&gp_futex, FUTEX_WAIT, -1,
>                       NULL, NULL, 0);
> }
> 
> We could have:
> 
> static inline void wake_up_gp(void)
> {
> 	/* just released old reader gp */
> 	smp_mb();
>         if (unlikely(uatomic_read(&gp_wait) == -1)) {
>                 uatomic_set(&gp_wait, 0);
>                 pthread_cond_broadcast(&rcucond);
>         }
> }
> 
> static void wait_gp(void)
> {
> 	uatomic_dec(&gp_wait);
>         smp_mb();
> 	if (!num_old_reader_gp()) {
>         	smp_mb();
> 		atomic_set(&gp_wait, 0);
> 		goto unlock;
> 	}
>         /* Read reader_gp before read futex */
>         smp_rmb();
> 	pthread_mutex_lock(&rcumutex);
>         if (uatomic_read(&gp_wait) == -1) {
>                 pthread_cond_wait(&rcucond, &rcumutex);
> 		pthread_mutex_unlock(&rcumutex);
> 	}
> }
> 
> Is that what you had in mind ?
> 

Hrm, thinking about it, the pthread_cond_wait/pthread_cond_broadcast
scheme I proposed above is racy.

If a wait_gp executes concurrently with wake_up_gp, we can end up doing:

gp_wait = 0

*    wait_gp:       uatomic_dec(&gp_wait);
*    wait_gp:       smp_mb()
*    wait_gp:       if (!num_old_reader_gp()) { (false)
*    wait_gp:       pthread_mutex_lock(&rcumutex);
*    wait_gp:       if (uatomic_read(&gp_wait) == -1) { (true)
* wake_up_gp:  unlikely(uatomic_read(&gp_wait) == -1) { (true)
* wake_up_gp:  uatomic_set(&gp_wait, 0);
* wait_up_gp:  pthread_cond_broadcast(&rcucond);
*    wait_gp:       pthread_cond_wait(&rcucond, &rcumutex);

might wait forever (or until the next wakeup).

We would need an extra mutex in the wakeup, e.g.:

static inline void wake_up_gp(void)
{
	/* just released old reader gp */
	smp_mb();
        if (unlikely(uatomic_read(&gp_wait) == -1)) {
		pthread_mutex_lock(&rcumutex);
                uatomic_set(&gp_wait, 0);
                pthread_cond_broadcast(&rcucond);
		pthread_mutex_unlock(&rcumutex);
        }
}

static void wait_gp(void)
{
	uatomic_dec(&gp_wait);
        smp_mb();
	if (!num_old_reader_gp()) {
        	smp_mb();
		atomic_set(&gp_wait, 0);
		goto unlock;
	}
        /* Read reader_gp before read futex */
        smp_rmb();
	pthread_mutex_lock(&rcumutex);
        if (uatomic_read(&gp_wait) == -1) {
                pthread_cond_wait(&rcucond, &rcumutex);
		pthread_mutex_unlock(&rcumutex);
	}
}

This should work better, but I'll need to do a formal model to convince
myself.

Mathieu

> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy
       [not found]       ` <20091005125533.GA1857@Krystal>
  2009-10-05 13:22         ` Mathieu Desnoyers
@ 2009-10-07  7:22         ` Michael Schnell
  1 sibling, 0 replies; 17+ messages in thread
From: Michael Schnell @ 2009-10-07  7:22 UTC (permalink / raw)
  Cc: linux-kernel

Mathieu Desnoyers wrote:
> Hrm, your assumption about the common case does not seem to fit my
> scenarios. 
Yep,

Obviously I was misled.

So I now understand that you want to schedule the thread as a result to
some event and the Thread might already be running at that time, so that
it does not need to enter an OS-based wait state.

But to me this looks as if a _counting_ semaphore is needed here (at
least in a more general case), instead of a binary semaphore (such as
FUTEX does manage). Only with a counting semaphore no events are missed
(unless the user software design cares for this with other means, which
might be difficult or impossible).

Of course the fast path of a  user space counting semaphore is easily
doable with atomic_inc() and atomic_dec().

But I only know about two working variants of the user space code for
the (binary) FUTEX. There seem to be some more implementations that do
not work decently.

I have no idea if it's possible to create a counting semaphore in user
space that uses the "futex" syscall (or whatever) for the case the
threads needs to wait.

-Michael

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2009-10-07  7:22 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-09-23 17:48 [RFC] Userspace RCU: (ab)using futexes to save cpu cycles and energy Mathieu Desnoyers
2009-09-23 18:04 ` Chris Friesen
2009-09-23 19:03   ` Mathieu Desnoyers
2009-09-23 22:32     ` Mathieu Desnoyers
2009-09-23 23:12       ` Chris Friesen
2009-09-23 23:28         ` Mathieu Desnoyers
2009-09-26  7:05           ` Mathieu Desnoyers
2009-09-28  7:11           ` Michael Schnell
2009-09-28 10:58             ` Michael Schnell
2009-09-28 11:01             ` Michael Schnell
2009-10-01 14:40 ` Paul E. McKenney
2009-10-04 14:37   ` Mathieu Desnoyers
2009-10-04 20:36     ` Paul E. McKenney
2009-10-04 21:12       ` Mathieu Desnoyers
     [not found]     ` <4AC99D55.8000102@lumino.de>
     [not found]       ` <20091005125533.GA1857@Krystal>
2009-10-05 13:22         ` Mathieu Desnoyers
2009-10-05 22:21           ` Mathieu Desnoyers
2009-10-07  7:22         ` Michael Schnell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox