public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH V6 0/4][RFC] futex: FUTEX_LOCK with optional adaptive spinning
@ 2010-05-06  6:24 Darren Hart
  2010-05-06  6:24 ` [PATCH 1/4] futex: replace fshared and clockrt with combined flags Darren Hart
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Darren Hart @ 2010-05-06  6:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Darren Hart, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Eric Dumazet, Peter W. Morreale, Rik van Riel, Steven Rostedt,
	Gregory Haskins, Sven-Thorsten Dietrich, Chris Mason, John Cooper,
	Chris Wright, Ulrich Drepper, Alan Cox, Avi Kivity


RFC - NOT FOR INCLUSION

The following patch series implements a new experimental kernel side futex mutex
via new FUTEX_LOCK and FUTEX_LOCK_ADAPTIVE futex op codes. The adaptive spin
allows for multiple spinners until the lock is released or the owner is
descheduled. The patch currently allows the user to specify if they want
spinning or not, but the final version will only provide the adaptive variant as
blocking locks are already very efficiently implemented with the existing
operations.

This version greatly outperforms the last, and actually outperforms adaptive
pthread mutexes for long lock hold times.  The primary difference from the
previous implementation was userspace optimization, although many kernel-side
improvements were made as well.

I'm using the futex_lock branch of my futextest test suite to gather results.
The testcases (futex_lock.c, futex_wait_2.c, pthread_mutex_2.c, and
futex_wait_tid.c) are under performance/ and can be checked out here:

git://git.kernel.org/pub/scm/linux/kernel/git/dvhart/futextest.git

At Avi's suggestion, I prepared plots of multiple thread counts, they are
available at the URL below. These plots are all from runs with a period of
1,000 or 10,000 instructions, with one plot per each of several duty-cycles.
For V6 I have added two new comparison tests, "thread_wait_tid" which uses
FUTEX_WAIT/WAKE to implement a mutex just as "futex_wait" does, but it uses
the TID|FUTEX_WAITERS futex value policy to illustrate the overhead over a
simple 0,1,2 policy. Second, "pthread_mutex_pi" uses a PTHREAD_PRIO_INHERIT
mutex which also uses the TID|FUTEX_WAITERS policy and has a higher overhead
set of futex op codes (FUTEX_LOCK_PI and FUTEX_UNLOCK_PI).

http://www.kernel.org/pub/linux/kernel/people/dvhart/adaptive_futex/v6/v6c-p1000-logs/plots/
http://www.kernel.org/pub/linux/kernel/people/dvhart/adaptive_futex/v6/v6c-p10000-logs/plots/

As illustrated in the above plots:
o FUTEX_LOCK_ADAPTIVE performs significantly better than FUTEX_LOCK for the
  shorter period over all duty-cycles and comparable for the longer period on
  all but the 32% duty-cycle, where it again out-performed the non-adaptive
  version.
o PI pthread mutex underperformed every other implementaion by one or two orders
  of magnitude. This is surely due to the significant overhead of the PI futex
  operations.
o The futex_wait_tid test illustrates the additional overhead imposed by the
  TID|FUTEX_WAITERS policy which requires the use of | and & operators as well
  as more conditionals and even cmpxchg loops. This overhead becomes very
  apparent in higher thread counts. The new FUTEX_LOCK operations outperform
  the futex_wait_tid in most scenarios.
o The only consistent win for FUTEX_LOCK_ADAPTIVE is at the 32% duty cycle,
  where it outperforms every other implementation.
o The adaptive futex_wait test served as a reference to verify my general
  approach was not grossly inferior to the hand-written asm employed by
  pthreads. Notice that futex_wait*-adaptive is mostly on par with
  pthread_mutex*-adaptive.

Given the necessity of the TID|FUTEX_WAITERS policy with a kernel-side adaptive
spin implementation, I feel this implementation is pretty well optimized.

Next steps:

o Improve workload definition. The current workload is a cpu-hog. It runs a
  fixed set of instructions in a loop, with some percentage of those being
  contained within the lock/unlock block. Adding sleeps to reduce CPU overhead
  just added so much scheduling overhead that the numbers dropped absurdly for
  both normal and adaptive. I'd appreciate any assistance in preparing a
  test-case for a real-world usage scenario. I will work with some of IBM's
  product teams to do so, and would welcome any input from others with an
  interest in kernel assisted userspace spinlocks.

o Attempt the kernel assisted userspace spinning version proposed by Avi, Alan,
  and Ulrich for comparison. This involves providing information about the
  running state of each thread to userspace. Where exactly this memory should
  be allocated is still unclear. For a proof of concept I will like create a
  simple array indexed by bid and a vdso style API to access this information.

Finally, the V6 diffstat:
 include/linux/futex.h |    6 +
 include/linux/sched.h |    1 +
 kernel/futex.c        |  572 ++++++++++++++++++++++++++++++++++++-------------
 kernel/sched.c        |   66 ++++++
 4 files changed, 494 insertions(+), 151 deletions(-)

Thanks,

Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

^ permalink raw reply	[flat|nested] 14+ messages in thread
* [PATCH V5 0/4][RFC] futex: FUTEX_LOCK with optional adaptive spinning
@ 2010-04-09  5:15 dvhltc
  2010-04-09  5:15 ` [PATCH 4/4] futex: Add " dvhltc
  0 siblings, 1 reply; 14+ messages in thread
From: dvhltc @ 2010-04-09  5:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Darren Hart, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Eric Dumazet, Peter W. Morreale, Rik van Riel, Steven Rostedt,
	Gregory Haskins, Sven-Thorsten Dietrich, Chris Mason, John Cooper,
	Chris Wright, Ulrich Drepper, Alan Cox, Avi Kivity

NOT FOR INCLUSION

The following patch series implements a new experimental kernel side futex mutex
via new FUTEX_LOCK and FUTEX_LOCK_ADAPTIVE futex op codes. The adaptive spin
allows for multiple spinners until the lock is released or the owner is
descheduled. The patch currently allows the user to specify if they want
spinning or not, but the final version will only provide the adaptive variant as
blocking locks are already very efficiently implemented with the existing
operations.

This version greatly outperforms the last, and actually shows some benefit using
adaptive locks! The primary difference in the implementations was the removal of
any attempt to limit the number of spinners as we have no means to track
spinners. This version also allows spinners to continue spinning if the owner of
a lock changes, and only aborts if the current owner deschedules. Aborting the
spin if the owner changes proved ineffectual as this locking mechanism does a
lot of lock stealing, so many of the loops would be a single cycle, and then
move on to block. Something to note is that the adaptive runs can make
significantly more syscalls than the non-adaptive version, and still show
significant improvement.

Normal Lock
===========
# ./futex_lock -i100000000 -n8 -p1000 -d10 
futex_lock: Measure FUTEX_LOCK operations per second
	Arguments: iterations=100000000 threads=8 adaptive=no
	           period=1000 duty-cycle=10%
Result: 606 Kiter/s
lock calls:      100000000
lock syscalls:   35729444 (35.73%)
unlock calls:    100000000
unlock syscalls: 41740294 (41.74%)

Adaptive Lock
=============
# ./futex_lock -i100000000 -n8 -p1000 -d10 -a
futex_lock: Measure FUTEX_LOCK operations per second
	Arguments: iterations=100000000 threads=8 adaptive=yes
	           period=1000 duty-cycle=10%
Result: 749 Kiter/s
lock calls:      100000000
lock syscalls:   72257679 (72.26%)
unlock calls:    100000000
unlock syscalls: 72265869 (72.27%)

I'm using the futex_lock branch of my futextest test suite to gather results.
The test is performance/futex_lock.c and can be checked out here:

git://git.kernel.org/pub/scm/linux/kernel/git/dvhart/futextest.git

At Avi's suggestion, I prepared plots of multiple thread counts, they are
available at the URL below. These plots are all from runs with a period of 1000
instructions, with one plot per each of several duty-cycles.

http://www.kernel.org/pub/linux/kernel/people/dvhart/adaptive_futex/v5/1000-period/plots.html

Now that an advantage can be shown using FUTEX_LOCK_ADAPTIVE over FUTEX_LOCK,
the next steps as I see them are:

o Try and show improvement of FUTEX_LOCK_ADAPTIVE over FUTEX_WAIT based
  implementations (pthread_mutex specifically).

o Improve workload definition. The current workload is a cpu-hog. It runs a
  fixed set of instructions in a loop, with some percentage of those being
  contained within the lock/unlock block. Adding sleeps to reduce CPU overhead
  just added so much scheduling overhead that the numbers dropped absurdly for
  both normal and adaptive. I'd appreciate any assistance in preparing a
  test-case for a real-world usage scenario. I will work with some of IBM's
  product teams to do so, and would welcome any input from others with an
  interest in kernel assisted userspace spinlocks.

o Attempt the kernel assisted userspace spinning version proposed by Avi, Alan,
  and Ulrich for comparison.

Thanks,

Darren Hart
IBM Linux Technology Center
Real-Time Linux Team



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-05-07 19:12 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-06  6:24 [PATCH V6 0/4][RFC] futex: FUTEX_LOCK with optional adaptive spinning Darren Hart
2010-05-06  6:24 ` [PATCH 1/4] futex: replace fshared and clockrt with combined flags Darren Hart
2010-05-06  6:24 ` [PATCH 2/4] futex: add futex_q static initializer Darren Hart
2010-05-06  6:24 ` [PATCH 3/4] futex: refactor futex_lock_pi_atomic Darren Hart
2010-05-06  6:24 ` [PATCH 4/4] futex: Add FUTEX_LOCK with optional adaptive spinning Darren Hart
2010-05-07 16:20   ` Thomas Gleixner
2010-05-07 16:24     ` Peter Zijlstra
2010-05-07 16:30       ` Thomas Gleixner
2010-05-07 16:35         ` Peter Zijlstra
2010-05-07 16:43           ` Thomas Gleixner
2010-05-07 19:05             ` Darren Hart
2010-05-07 16:52     ` Darren Hart
2010-05-07 19:11     ` Darren Hart
  -- strict thread matches above, loose matches on Subject: below --
2010-04-09  5:15 [PATCH V5 0/4][RFC] futex: " dvhltc
2010-04-09  5:15 ` [PATCH 4/4] futex: Add " dvhltc

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox