[patch 0/9] mutex subsystem, -V4

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch 0/9] mutex subsystem, -V4
@ 2005-12-22 11:41 Ingo Molnar
  2005-12-22 11:53 ` Christoph Hellwig
  2005-12-22 11:54 ` Andrew Morton
  0 siblings, 2 replies; 62+ messages in thread
From: Ingo Molnar @ 2005-12-22 11:41 UTC (permalink / raw)
  To: lkml
  Cc: Linus Torvalds, Andrew Morton, Arjan van de Ven, Nicolas Pitre,
	Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov, David Howells,
	Alan Cox, Benjamin LaHaise, Steven Rostedt, Christoph Hellwig,
	Andi Kleen, Russell King

this is the -V4 of the mutex subsystem patch-queue. It consists of the 
following patches:

  add-atomic-xchg.patch
  add-atomic-call-func-i386.patch
  add-atomic-call-func-x86_64.patch
  add-atomic-call-wrappers-rest.patch
  mutex-core.patch
  mutex-switch-arm-to-xchg.patch
  mutex-debug.patch
  mutex-debug-more.patch
  xfs-mutex-namespace-collision-fix.patch

the patches are against Linus' latest GIT tree, and they should work 
fine on every Linux architecture.

Changes since -V3:

- imlemented an atomic_xchg() based mutex implementation. It integrated
  pretty nicely into the generic code, and most of the code is still
  shared.

- added __ARCH_WANT_XCHG_BASED_ATOMICS: if an architecture defines 
  this then the generic mutex code will switch to the atomic_xchg() 
  implementation.

  This should be conceptually equivalent to the variant Nicolas Pitre 
  posted - Nicolas, could you check out this one? It's much easier to 
  provide this in the generic implementation, and the code ends up 
  looking cleaner.

- eliminated ARCH_IMPLEMENTS_MUTEX_FASTPATH: there's no need for 
  architectures to override the generic code anymore, with the 
  introduction of __ARCH_WANT_XCHG_BASED_ATOMICS.

- ARM: enable __ARCH_WANT_XCHG_BASED_ATOMICS.

- ARM buildfix: move the new atomic primitives to the correct place.
  (Nicolas Pitre)

- optimization: unlock the mutex outside of the spinlock (suggested by 
  Nicolas Pitre)

- removed leftover arch_semaphore reference from the XFS patch. (noticed
  by Arjan van de Ven)

- better document the fact that mutex_trylock() follows spin_trylock() 
  semantics, not sem_trylock() semantics.

- cleanup: got rid of the MUTEX_LOCKLESS_FASTPATH define. (in -V3 this
  became equivalent to DEBUG_MUTEXES, via the elimination of the
  __HAVE_ARCH_CMPXCHG dependency on the fastpath.)

- further simplified and unified mutex_trylock().

- DocBook entries, and more documentation of internals.

- dropped the spinlock-debug fix, Linus merged it.

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 11:41 [patch 0/9] mutex subsystem, -V4 Ingo Molnar
@ 2005-12-22 11:53 ` Christoph Hellwig
  2005-12-22 12:45   ` Ingo Molnar
  2005-12-22 15:34   ` Nicolas Pitre
  2005-12-22 11:54 ` Andrew Morton
  1 sibling, 2 replies; 62+ messages in thread
From: Christoph Hellwig @ 2005-12-22 11:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Nicolas Pitre, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov,
	David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt,
	Christoph Hellwig, Andi Kleen, Russell King

On Thu, Dec 22, 2005 at 12:41:47PM +0100, Ingo Molnar wrote:
> this is the -V4 of the mutex subsystem patch-queue. It consists of the 
> following patches:
> 
>   add-atomic-xchg.patch
>   add-atomic-call-func-i386.patch
>   add-atomic-call-func-x86_64.patch
>   add-atomic-call-wrappers-rest.patch
>   mutex-core.patch
>   mutex-switch-arm-to-xchg.patch
>   mutex-debug.patch
>   mutex-debug-more.patch
>   xfs-mutex-namespace-collision-fix.patch
> 
> the patches are against Linus' latest GIT tree, and they should work 
> fine on every Linux architecture.
> 
> Changes since -V3:
> 
> - imlemented an atomic_xchg() based mutex implementation. It integrated
>   pretty nicely into the generic code, and most of the code is still
>   shared.
> 
> - added __ARCH_WANT_XCHG_BASED_ATOMICS: if an architecture defines 
>   this then the generic mutex code will switch to the atomic_xchg() 
>   implementation.
> 
>   This should be conceptually equivalent to the variant Nicolas Pitre 
>   posted - Nicolas, could you check out this one? It's much easier to 
>   provide this in the generic implementation, and the code ends up 
>   looking cleaner.
> 
> - eliminated ARCH_IMPLEMENTS_MUTEX_FASTPATH: there's no need for 
>   architectures to override the generic code anymore, with the 
>   introduction of __ARCH_WANT_XCHG_BASED_ATOMICS.
> 
> - ARM: enable __ARCH_WANT_XCHG_BASED_ATOMICS.

I must admit I really really hat __ARCH_ stuff if we can avoid it.
An <asm/mutex.h> that usually includes two asm-generic variants is probably
a much better choice.

Actually just havign asm/mutex.h implement the faspath per-arch and get
rid of all the oddball atomic.h additions would be even better.  While
this means we need per-arch code it also means the code is a lot easier
understandable, and we don't add odd public APIs.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 11:53 ` Christoph Hellwig
@ 2005-12-22 12:45   ` Ingo Molnar
  2005-12-22 15:34   ` Nicolas Pitre
  1 sibling, 0 replies; 62+ messages in thread
From: Ingo Molnar @ 2005-12-22 12:45 UTC (permalink / raw)
  To: Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Zwane Mwaikambo,
	Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise,
	Steven Rostedt, Andi Kleen, Russell King


* Christoph Hellwig <hch@infradead.org> wrote:

> > Changes since -V3:
> > 
> > - imlemented an atomic_xchg() based mutex implementation. It integrated
> >   pretty nicely into the generic code, and most of the code is still
> >   shared.
> > 
> > - added __ARCH_WANT_XCHG_BASED_ATOMICS: if an architecture defines 
> >   this then the generic mutex code will switch to the atomic_xchg() 
> >   implementation.
> > 
> >   This should be conceptually equivalent to the variant Nicolas Pitre 
> >   posted - Nicolas, could you check out this one? It's much easier to 
> >   provide this in the generic implementation, and the code ends up 
> >   looking cleaner.
> > 
> > - eliminated ARCH_IMPLEMENTS_MUTEX_FASTPATH: there's no need for 
> >   architectures to override the generic code anymore, with the 
> >   introduction of __ARCH_WANT_XCHG_BASED_ATOMICS.
> > 
> > - ARM: enable __ARCH_WANT_XCHG_BASED_ATOMICS.
> 
> I must admit I really really hat __ARCH_ stuff if we can avoid it. An 
> <asm/mutex.h> that usually includes two asm-generic variants is 
> probably a much better choice.

agreed. In my tree i've changed it to CONFIG_MUTEX_XCHG_ALGORITHM, which 
is selected by ARM in its Kconfig.

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 11:53 ` Christoph Hellwig
  2005-12-22 12:45   ` Ingo Molnar
@ 2005-12-22 15:34   ` Nicolas Pitre
  2005-12-22 15:40     ` Ingo Molnar
                       ` (2 more replies)
  1 sibling, 3 replies; 62+ messages in thread
From: Nicolas Pitre @ 2005-12-22 15:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov,
	David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt,
	Andi Kleen, Russell King

On Thu, 22 Dec 2005, Christoph Hellwig wrote:

> On Thu, Dec 22, 2005 at 12:41:47PM +0100, Ingo Molnar wrote:
> > this is the -V4 of the mutex subsystem patch-queue. It consists of the 
> > following patches:
> > 
> >   add-atomic-xchg.patch
> >   add-atomic-call-func-i386.patch
> >   add-atomic-call-func-x86_64.patch
> >   add-atomic-call-wrappers-rest.patch
> >   mutex-core.patch
> >   mutex-switch-arm-to-xchg.patch
> >   mutex-debug.patch
> >   mutex-debug-more.patch
> >   xfs-mutex-namespace-collision-fix.patch
> > 
> > the patches are against Linus' latest GIT tree, and they should work 
> > fine on every Linux architecture.
> > 
> > Changes since -V3:
> > 
> > - imlemented an atomic_xchg() based mutex implementation. It integrated
> >   pretty nicely into the generic code, and most of the code is still
> >   shared.
> > 
> > - added __ARCH_WANT_XCHG_BASED_ATOMICS: if an architecture defines 
> >   this then the generic mutex code will switch to the atomic_xchg() 
> >   implementation.
> > 
> >   This should be conceptually equivalent to the variant Nicolas Pitre 
> >   posted - Nicolas, could you check out this one? It's much easier to 
> >   provide this in the generic implementation, and the code ends up 
> >   looking cleaner.
> > 
> > - eliminated ARCH_IMPLEMENTS_MUTEX_FASTPATH: there's no need for 
> >   architectures to override the generic code anymore, with the 
> >   introduction of __ARCH_WANT_XCHG_BASED_ATOMICS.
> > 
> > - ARM: enable __ARCH_WANT_XCHG_BASED_ATOMICS.
> 
> I must admit I really really hat __ARCH_ stuff if we can avoid it.
> An <asm/mutex.h> that usually includes two asm-generic variants is probably
> a much better choice.
> 
> Actually just havign asm/mutex.h implement the faspath per-arch and get
> rid of all the oddball atomic.h additions would be even better.  While
> this means we need per-arch code it also means the code is a lot easier
> understandable, and we don't add odd public APIs.

I'm with Christoph here.  Please preserve my 
arch_mutex_fast_lock/arch_mutex_fast_unlock helpers.  I did it that way 
because the most important thing they bring is flexibility where it is 
needed i.e. in architecture specific implementations.  And done that way 
the architecture specific part is well abstracted with the minimum 
semantics allowing flexibility in the implementation.

I insist on that because, even if ARM currently relies on the atomic 
swap behavior, on ARMv6 at least this can be improved even further, but 
a special implementation which is neither a fully qualified atomic 
decrement nor an atomic swap is needed.  That's why I insist that you 
should keep my arch_mutex_fast_lock and friends (rename them if you 
wish) and remove __ARCH_WANT_XCHG_BASED_ATOMICS entirely.


Nicolas

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 15:34   ` Nicolas Pitre
@ 2005-12-22 15:40     ` Ingo Molnar
  2005-12-22 16:32       ` Nicolas Pitre
  2005-12-22 17:20     ` Christoph Hellwig
  2005-12-22 17:33     ` Steven Rostedt
  2 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2005-12-22 15:40 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov,
	David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt,
	Andi Kleen, Russell King


* Nicolas Pitre <nico@cam.org> wrote:

> [...] on ARMv6 at least this can be improved even further, but a 
> special implementation which is neither a fully qualified atomic 
> decrement nor an atomic swap is needed. [...]

i'm curious, how would this ARMv6 solution look like, and what would be 
the advantages over the atomic swap based variant?

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 15:40     ` Ingo Molnar
@ 2005-12-22 16:32       ` Nicolas Pitre
  2005-12-22 16:44         ` Ingo Molnar
  0 siblings, 1 reply; 62+ messages in thread
From: Nicolas Pitre @ 2005-12-22 16:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov,
	David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt,
	Andi Kleen, Russell King

On Thu, 22 Dec 2005, Ingo Molnar wrote:

> 
> * Nicolas Pitre <nico@cam.org> wrote:
> 
> > [...] on ARMv6 at least this can be improved even further, but a 
> > special implementation which is neither a fully qualified atomic 
> > decrement nor an atomic swap is needed. [...]
> 
> i'm curious, how would this ARMv6 solution look like, and what would be 
> the advantages over the atomic swap based variant?

On ARMv6 (which can be SMP) the atomic swap instruction is much more 
costly than on former ARM versions.  It however has ll/sc instructions 
which allows it to implement a true atomic decrement, and the lock fast 
path would look like:

__mutex_lock:
1:	ldrex	r1, [r0]
	sub	r1, r1, #1
	strex	r2, r1, [r0]
	cmp	r2, #0
	bne	1b
	cmp	r1, #0
	moveq	pc, lr
	b	__mutex_lock_failed

With my ARMv6 implementation of arch_mutex_fast_lock() then it would 
become:

__mutex_lock:
	ldrex	r1, [r0]
	sub	r1, r1, #1
	strex	r2, r1, [r0]
	orrs	r0, r2, r1
	moveq	pc, lr
	b	__mutex_lock_failed

This code sequence is not possible with any of the standard atomic 
primitives. And the above would work even for the 
arch_mutex_fast_lock_retval() used for mutex_lock_interruptible.

Giving complete freedom to the architecture in implementing those could 
benefit architectures where disabling preemption while performing the 
lock instead of attempting any true atomic operation would be cheaper 
(Linus argued about that previously).  And if we are UP with preemption 
disabled then the lock could possibly be simplified even further if 
someone dare to.

But those implementation issues don't belong in the common core code at 
all IMHO.

Nicolas

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 16:32       ` Nicolas Pitre
@ 2005-12-22 16:44         ` Ingo Molnar
  2005-12-22 16:58           ` Russell King
  2005-12-22 16:58           ` Nicolas Pitre
  0 siblings, 2 replies; 62+ messages in thread
From: Ingo Molnar @ 2005-12-22 16:44 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov,
	David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt,
	Andi Kleen, Russell King


* Nicolas Pitre <nico@cam.org> wrote:

> > i'm curious, how would this ARMv6 solution look like, and what would be 
> > the advantages over the atomic swap based variant?
> 
> On ARMv6 (which can be SMP) the atomic swap instruction is much more 
> costly than on former ARM versions.  It however has ll/sc instructions 
> which allows it to implement a true atomic decrement, and the lock 
> fast path would look like: [...]

but couldnt you implement atomic_dec_return() with the ll/sc 
instructions? Something like:

repeat:
       ldrex   r1, [r0]
       sub     r1, r1, #1
       strex   r2, r1, [r0]
       orrs    r0, r2, r1
       jneq    repeat

(shot-in-the-dark guess at ARMv6 assembly)

hm?

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 16:44         ` Ingo Molnar
@ 2005-12-22 16:58           ` Russell King
  2005-12-22 21:04             ` Ingo Molnar
  2005-12-22 16:58           ` Nicolas Pitre
  1 sibling, 1 reply; 62+ messages in thread
From: Russell King @ 2005-12-22 16:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nicolas Pitre, Christoph Hellwig, lkml, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo,
	Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise,
	Steven Rostedt, Andi Kleen

On Thu, Dec 22, 2005 at 05:44:15PM +0100, Ingo Molnar wrote:
> * Nicolas Pitre <nico@cam.org> wrote:
> > > i'm curious, how would this ARMv6 solution look like, and what would be 
> > > the advantages over the atomic swap based variant?
> > 
> > On ARMv6 (which can be SMP) the atomic swap instruction is much more 
> > costly than on former ARM versions.  It however has ll/sc instructions 
> > which allows it to implement a true atomic decrement, and the lock 
> > fast path would look like: [...]
> 
> but couldnt you implement atomic_dec_return() with the ll/sc 
> instructions? Something like:
> 
> repeat:
>        ldrex   r1, [r0]
>        sub     r1, r1, #1
>        strex   r2, r1, [r0]
>        orrs    r0, r2, r1
>        jneq    repeat
> 
> (shot-in-the-dark guess at ARMv6 assembly)

atomic_dec_return() would be:

1:	ldrex	r1, [r0]
	sub	r1, r1, #1
	strex	r2, r1, [r0]
	teq	r2, #0
	bne	1b
	@ result in r1

But that's not really the main point Nico's making.  Yes, on ARMv6
there is little difference.  However, ARMv6 is _not_ mainstream yet.
The previous generation which do not have this is currently mainstream.

When it comes down to it, unlike x86 land where new CPU features are
taken up very quickly, the take up of new features on ARM CPUs is
a lot slower - it's a matter of years not a matter of months.
Therefore, we can expect ARMv5 architecture CPUs to be mainstream
at least for the next year or two, and these are the ones which
we should optimise for.

Nico's point still stands though - and I'd like to ask a more direct
question.  There is an efficient implementation for ARMv5 CPUs which
it appears we're being denied the ability to use.

Given that this has been argued for using clear technical arguments
over the last two days, I have yet to see any explaination why you're
choosing to ignore it.  Could you please explain why?

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 16:58           ` Russell King
@ 2005-12-22 21:04             ` Ingo Molnar
  2005-12-22 21:26               ` Russell King
  2005-12-22 21:27               ` Nicolas Pitre
  0 siblings, 2 replies; 62+ messages in thread
From: Ingo Molnar @ 2005-12-22 21:04 UTC (permalink / raw)
  To: Nicolas Pitre, Christoph Hellwig, lkml, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo,
	Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise,
	Steven Rostedt, Andi Kleen

* Russell King <rmk+lkml@arm.linux.org.uk> wrote:

> > but couldnt you implement atomic_dec_return() with the ll/sc 
> > instructions? Something like:
> > 
> > repeat:
> >        ldrex   r1, [r0]
> >        sub     r1, r1, #1
> >        strex   r2, r1, [r0]
> >        orrs    r0, r2, r1
> >        jneq    repeat
> > 
> > (shot-in-the-dark guess at ARMv6 assembly)
> 
> atomic_dec_return() would be:
> 
> 1:	ldrex	r1, [r0]
> 	sub	r1, r1, #1
> 	strex	r2, r1, [r0]
> 	teq	r2, #0
> 	bne	1b
> 	@ result in r1
> 
> But that's not really the main point Nico's making.  Yes, on ARMv6 
> there is little difference.  However, ARMv6 is _not_ mainstream yet.  
> The previous generation which do not have this is currently 
> mainstream.

i think there is some miscommunication here. I am actually on your side, 
and i spent all my day enabling the best mutex variant on both the v5 
and v6 version of your CPU. What we were doing was to discuss the 
details of getting there. So dont shoot the messenger, ok?

This is Nico's point i replied to:

> > [...] on ARMv6 at least this can be improved even further, but a
> > special implementation which is neither a fully qualified atomic
> > decrement nor an atomic swap is needed. [...]

and this was my reply:

> i'm curious, how would this ARMv6 solution look like, and what would 
> be the advantages over the atomic swap based variant?

to which Nico replied with an ll/sc implementation, which very much 
looked like atomic_dec_return(). That's all i said. Nobody is trying to
shut out anything from anywhere, and certainly not me. I am _enabling_
your stuff. I'm just trying to find the most maintainable and still most 
flexible solution.

> Nico's point still stands though - and I'd like to ask a more direct 
> question.  There is an efficient implementation for ARMv5 CPUs which 
> it appears we're being denied the ability to use.

do you realize that i've enabled this efficient implementation via 
CONFIG_MUTEX_XCHG_ALGORITHM? Do you realize that you _dont_ have to use 
extremely heavy IRQ-disabling ops on ARM to enable mutexes? Do you 
realize that what we are down to now are 1-2 instructions of 
differences?

The discussion was not about ARMv5 at all. The discussion was about the 
claim that 'ARMv6 is somehow magical that it needs its own stuff'. No it 
isnt magical, it's a sane CPU that can implement a sane 
atomic_dec_return(), or even better, a sane atomic_*_call_if_*() 
primitive. Or whatever primitive we end up having - i dont mind if it's 
called __mutex_whatever, as long as it has a _well defined_ meaning.

what i _DONT_ want is some over-opaque per-arch thing that will again 
escallate into the same situation as semaphores: 23 different 
implementations nobody is able to change at once, nobody is able to add 
features or debugging to, and by today there's probably is no single 
person on this planet who knows all 23 of them to begin with.

if you sense me trying to avoid something then that's it: ambiguities in 
the arch-level implementation, because they end up stiffling the generic 
code.

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 21:04             ` Ingo Molnar
@ 2005-12-22 21:26               ` Russell King
  2005-12-22 21:27               ` Nicolas Pitre
  1 sibling, 0 replies; 62+ messages in thread
From: Russell King @ 2005-12-22 21:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nicolas Pitre, Christoph Hellwig, lkml, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo,
	Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise,
	Steven Rostedt, Andi Kleen

On Thu, Dec 22, 2005 at 10:04:46PM +0100, Ingo Molnar wrote:
> i think there is some miscommunication here.

Quite possibly.  TBH there's soo much waffle generated on the mutex
stuff over the last few days that I've probably lost track of where
everyone is on it.  Sorry.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 21:04             ` Ingo Molnar
  2005-12-22 21:26               ` Russell King
@ 2005-12-22 21:27               ` Nicolas Pitre
  2005-12-22 21:37                 ` [patch 1/2] mutex subsystem: basic per arch fast path primitives Nicolas Pitre
                                   ` (2 more replies)
  1 sibling, 3 replies; 62+ messages in thread
From: Nicolas Pitre @ 2005-12-22 21:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov,
	David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt,
	Andi Kleen

On Thu, 22 Dec 2005, Ingo Molnar wrote:

> The discussion was not about ARMv5 at all. The discussion was about the 
> claim that 'ARMv6 is somehow magical that it needs its own stuff'. No it 
> isnt magical, it's a sane CPU that can implement a sane 
> atomic_dec_return(), or even better, a sane atomic_*_call_if_*() 
> primitive. Or whatever primitive we end up having - i dont mind if it's 
> called __mutex_whatever, as long as it has a _well defined_ meaning.

I'd like to point out that, while atomic_dec_call_if_* is really nice on 
i386, it is probably only good for i386 since no other architecture 
will be able to provide a better implementation than what can be done 
with atomic_dec_return() anyway.  Yet that IMHO overloaded 
atomic_dec_call_if_* stuff appears in core code.

But like for i386, those other architectures might be able to do some 
other tricks to achieve the same needed semantics.  My ARMv6 is one of 
them (and no it doesn't strictly follows the semantics of 
atomic_dec_call_if_*).  Are you willing to add more #if 
defined(CONFIG_MUTEX_FOO_ALGO) in the core code as time goes by?  I hope 
not.

> what i _DONT_ want is some over-opaque per-arch thing that will again 
> escallate into the same situation as semaphores: 23 different 
> implementations nobody is able to change at once, nobody is able to add 
> features or debugging to, and by today there's probably is no single 
> person on this planet who knows all 23 of them to begin with.

I agree completely with you.  But what I don't want is core code 
needlessly restricting architecture specific implementation for short 
and well defined fast paths.  Maybe that's where we must agree upon: a 
well defined fast path helper for mutexes?

Nicolas

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [patch 1/2] mutex subsystem: basic per arch fast path primitives
  2005-12-22 21:27               ` Nicolas Pitre
@ 2005-12-22 21:37                 ` Nicolas Pitre
  2005-12-22 21:53                   ` Christoph Hellwig
  2005-12-22 21:40                 ` [patch 2/2] mutex subsystem: use the per architecture fast path lock_unlock defines Nicolas Pitre
  2005-12-22 21:54                 ` [patch 0/9] mutex subsystem, -V4 Ingo Molnar
  2 siblings, 1 reply; 62+ messages in thread
From: Nicolas Pitre @ 2005-12-22 21:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov,
	David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt,
	Andi Kleen

On Thu, 22 Dec 2005, Nicolas Pitre wrote:

> On Thu, 22 Dec 2005, Ingo Molnar wrote:
> 
> > what i _DONT_ want is some over-opaque per-arch thing that will again 
> > escallate into the same situation as semaphores: 23 different 
> > implementations nobody is able to change at once, nobody is able to add 
> > features or debugging to, and by today there's probably is no single 
> > person on this planet who knows all 23 of them to begin with.
> 
> I agree completely with you.  But what I don't want is core code 
> needlessly restricting architecture specific implementation for short 
> and well defined fast paths.  Maybe that's where we must agree upon: a 
> well defined fast path helper for mutexes?

OK, let's look at actual code please.  Do you have anything against this 
and the following patches?

This patch adds fast path mutex primitives for all architectures.  It 
replaces your atomic_*_call_if_* patches that didn't seem to please 
everybody, mutex usage notwithstanding.

Comments?

Index: linux-2.6/include/asm-alpha/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-alpha/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-arm/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-arm/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-arm26/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-arm26/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-cris/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-cris/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-frv/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-frv/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-generic/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-generic/mutex.h
@@ -0,0 +1,77 @@
+/*
+ * asm-generic/mutex.h
+ *
+ * Generic wrappers for architecture specific low-level mutex
+ * fast path locking and unlocking.  Each architecture is welcome
+ * to provide optimized versions for those.
+ */
+
+#ifndef _ASM_GENERIC_MUTEX_H
+#define _ASM_GENERIC_MUTEX_H
+
+#include <asm/atomic.h>
+
+/**
+ * __mutex_fastpath_lock - lock mutex and call function if already locked
+ * @v: pointer of type atomic_t
+ * @contention_fn: function to call if v was already locked
+ *
+ * Atomically locks @v and calls a function if @v was already locked.
+ * When @v == 1 it is unlocked, <= 0 means locked.
+ */
+#define __mutex_fastpath_lock(v, contention_fn)				\
+do {									\
+	if (atomic_xchg(v, 0) != 1)					\
+		contention_fn(v);					\
+} while (0)
+
+/**
+ * __mutex_fastpath_lock_retval - lock mutex and call function if already locked
+ * @v: pointer of type atomic_t
+ * @contention_fn: function to call if v was already locked
+ *
+ * Atomically locks @v and calls a function if @v was already locked.
+ * When @v == 1 it is unlocked, <= 0 means locked.
+ *
+ * It also returns 0 if the lock was successful (@v wasn't locked), otherwise
+ * it returns what @contention_fn returned.
+ */
+#define __mutex_fastpath_lock_retval(v, contention_fn)				\
+({									\
+	int __retval = 0;						\
+	if (atomic_xchg(v, 0) != 1)					\
+		__retval = contention_fn(v);				\
+	__retval;							\
+})
+
+/**
+ * __mutex_fastpath_unlock - unlock and call function if contended
+ * @v: pointer of type atomic_t
+ * @contention_fn: function to call if v was contended
+ *
+ * Atomically unlocks @v and calls a function if @v was contended.
+ * When @v == 1 it is unlocked, 0 it is locked, any negative value means
+ * locked with contention.
+ *
+ * If @v was contended, its new value is implementation defined (it can be
+ * either locked or unlocked).  The contention function (slow path) should
+ * use __mutex_slowpath_needs_unlock() to determine if the mutex needs
+ * manual unlocking.
+ */
+#define __mutex_fastpath_unlock(v, contention_fn)			\
+do {									\
+	if (atomic_xchg(v, 1) != 0)					\
+		contention_fn(v);					\
+} while (0)
+
+/**
+ * __mutex_slowpath_needs_unlock - tell the state of a contended unlock
+ *
+ * This is meant to be called from any contention function passed to
+ * __mutex_fastpath_unlock(). It tells the contention (slow path) function
+ * whether or not it has to unlock the mutex manually after unlocking a
+ * contended lock was attempted.
+ */
+#define __mutex_slowpath_needs_unlock()	(0) /* xchg always unlocks */
+
+#endif
Index: linux-2.6/include/asm-h8300/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-h8300/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-i386/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-i386/mutex.h
@@ -0,0 +1,70 @@
+/*
+ * asm-i386/mutex.h
+ *
+ * Optimized i386 low-level mutex fast path locking and unlocking.
+ */
+
+#ifndef _ASM_MUTEX_H
+#define _ASM_MUTEX_H
+
+#include <asm/atomic.h>
+
+/*
+ * Please look into asm-generic/mutex.h for a description of semantics
+ * expected for those.
+ */
+
+#define __mutex_fastpath_lock(v, contention_fn)				\
+do {									\
+	fastcall void (*__tmp)(atomic_t *) = contention_fn;		\
+	unsigned int dummy;						\
+									\
+	(void)__tmp;							\
+	typecheck(atomic_t *, v);					\
+									\
+	__asm__ __volatile__(						\
+		LOCK "decl (%%eax)\n"  					\
+		"js 2f\n"						\
+		"1:\n"							\
+		LOCK_SECTION_START("")					\
+		"2: call "#contention_fn"\n\t"				\
+		"jmp 1b\n"						\
+		LOCK_SECTION_END					\
+		:"=a"(dummy)						\
+		:"a" (v)						\
+		:"memory", "ecx", "edx");				\
+} while (0)
+
+#define __mutex_fastpath_lock_retval(v, contention_fn)			\
+({									\
+	int __retval = 0;						\
+	if (unlikely(atomic_dec_return(v) < 0))				\
+		__retval = contention_fn(v);				\
+	__retval;							\
+})
+
+#define __mutex_fastpath_unlock(v, contention_fn)			\
+do {									\
+	fastcall void (*__tmp)(atomic_t *) = contention_fn;		\
+	unsigned int dummy;						\
+									\
+	(void)__tmp;							\
+	typecheck(atomic_t *, v);					\
+									\
+	__asm__ __volatile__(						\
+		LOCK "incl (%%eax)\n"  					\
+		"jle 2f\n"						\
+		"1:\n"							\
+		LOCK_SECTION_START("")					\
+		"2: call "#contention_fn"\n\t"				\
+		"jmp 1b\n"						\
+		LOCK_SECTION_END					\
+		:"=a" (dummy)						\
+		:"a" (v)						\
+		:"memory", "ecx", "edx");				\
+} while (0)
+
+/* In the contended case, the unlock doesn't actually unlock */
+#define __mutex_slowpath_needs_unlock()	(1)
+
+#endif
Index: linux-2.6/include/asm-ia64/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-ia64/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-m32r/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-m32r/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-m68k/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-m68k/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-m68knommu/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-m68knommu/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-mips/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-mips/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-parisc/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-parisc/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-powerpc/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-powerpc/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-ppc/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-ppc/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-ppc64/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-ppc64/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-s390/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-s390/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-sh/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-sh/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-sh64/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-sh64/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-sparc/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-sparc/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-sparc64/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-sparc64/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-um/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-um/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-v850/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-v850/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>
Index: linux-2.6/include/asm-x86_64/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-x86_64/mutex.h
@@ -0,0 +1,72 @@
+/*
+ * asm-x86_64/mutex.h
+ *
+ * Optimized x86_64 low-level mutex fast path locking and unlocking.
+ */
+
+#ifndef _ASM_MUTEX_H
+#define _ASM_MUTEX_H
+
+#include <asm/atomic.h>
+
+/*
+ * Please look into asm-generic/mutex.h for a description of semantics
+ * expected for those.
+ */
+
+#define __mutex_fastpath_lock(v, contention_fn)				\
+do {									\
+	fastcall void (*__tmp)(atomic_t *) = contention_fn;		\
+	unsigned long dummy;						\
+									\
+	(void)__tmp;							\
+	typecheck(atomic_t *, v);					\
+									\
+	__asm__ __volatile__(						\
+		LOCK "decl (%%rdi)\n"  					\
+		"js 2f\n"						\
+		"1:\n"							\
+		LOCK_SECTION_START("")					\
+		"2: call "#contention_fn"\n\t"				\
+		"jmp 1b\n"						\
+		LOCK_SECTION_END					\
+		:"=D" (dummy)						\
+		:"D" (v)						\
+		:"rax", "rsi", "rdx", "rcx",				\
+		 "r8", "r9", "r10", "r11", "memory");			\
+} while (0)
+
+#define __mutex_fastpath_lock_retval(v, contention_fn)			\
+({									\
+	int __retval = 0;						\
+	if (unlikely(atomic_dec_return(v) < 0))				\
+		__retval = contention_fn(v);				\
+	__retval;							\
+})
+
+#define __mutex_fastpath_unlock(v, contention_fn)			\
+do {									\
+	fastcall void (*__tmp)(atomic_t *) = contention_fn;		\
+	unsigned long dummy;						\
+									\
+	(void)__tmp;							\
+	typecheck(atomic_t *, v);					\
+									\
+	__asm__ __volatile__(						\
+		LOCK "incl (%%rdi)\n"  					\
+		"jle 2f\n"						\
+		"1:\n"							\
+		LOCK_SECTION_START("")					\
+		"2: call "#contention_fn"\n\t"				\
+		"jmp 1b\n"						\
+		LOCK_SECTION_END					\
+		:"=D" (dummy)						\
+		:"D" (v)						\
+		:"rax", "rsi", "rdx", "rcx",				\
+		 "r8", "r9", "r10", "r11", "memory");			\
+} while (0)
+
+/* In the contended case, the unlock doesn't actually unlock */
+#define __mutex_slowpath_needs_unlock()	(1)
+
+#endif
Index: linux-2.6/include/asm-xtensa/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-xtensa/mutex.h
@@ -0,0 +1 @@
+#include <asm-generic/mutex.h>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 1/2] mutex subsystem: basic per arch fast path primitives
  2005-12-22 21:37                 ` [patch 1/2] mutex subsystem: basic per arch fast path primitives Nicolas Pitre
@ 2005-12-22 21:53                   ` Christoph Hellwig
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2005-12-22 21:53 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Ingo Molnar, Christoph Hellwig, lkml, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo,
	Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise,
	Steven Rostedt, Andi Kleen

> OK, let's look at actual code please.  Do you have anything against this 
> and the following patches?
> 
> This patch adds fast path mutex primitives for all architectures.  It 
> replaces your atomic_*_call_if_* patches that didn't seem to please 
> everybody, mutex usage notwithstanding.

Thanks Nico, this is exactly what I wanted to see.

One question about the naming of the arch helpers, do we really need the
fastpath in there?  Or just __mutex_* ?


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [patch 2/2] mutex subsystem: use the per architecture fast path lock_unlock defines
  2005-12-22 21:27               ` Nicolas Pitre
  2005-12-22 21:37                 ` [patch 1/2] mutex subsystem: basic per arch fast path primitives Nicolas Pitre
@ 2005-12-22 21:40                 ` Nicolas Pitre
  2005-12-22 21:54                 ` [patch 0/9] mutex subsystem, -V4 Ingo Molnar
  2 siblings, 0 replies; 62+ messages in thread
From: Nicolas Pitre @ 2005-12-22 21:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov,
	David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt,
	Andi Kleen


This switch the core over to the architecture defined fast path locking 
primitives.  It also adds __mutex_lock_interruptible_noinline and 
friends.

Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -266,13 +266,8 @@ __mutex_lock_interruptible_nonatomic(str
  */
 int fastcall mutex_trylock(struct mutex *lock)
 {
-#ifdef __HAVE_ARCH_CMPXCHG
 	if (atomic_cmpxchg(&lock->count, 1, 0) == 1)
 		return 1;
-#else
-	if (atomic_dec_return(&lock->count) == 0)
-		return 1;
-#endif
 	return 0;
 }
 
@@ -316,13 +311,14 @@ static inline void __mutex_unlock_nonato
 	 * Waiters take care of themselves and stay in flight until
 	 * necessary.
 	 *
-	 * (in the xchg based implementation the fastpath has set the
+	 * (depending on the implementation the fastpath has set the
 	 *  count to 1 already, so we must not set it here, because we
 	 *  dont own the lock anymore. In the debug case we must set
 	 *  the lock inside the spinlock.)
 	 */
-#if !defined(CONFIG_MUTEX_XCHG_ALGORITHM) && !defined(CONFIG_DEBUG_MUTEXES)
-	atomic_set(&lock->count, 1);
+#if !defined(CONFIG_DEBUG_MUTEXES)
+	if (__mutex_slowpath_needs_unlock())
+		atomic_set(&lock->count, 1);
 #endif
 	spin_lock_mutex(&lock->wait_lock);
 #ifdef CONFIG_DEBUG_MUTEXES
@@ -349,21 +345,12 @@ static inline void __mutex_unlock_nonato
 static __sched void FASTCALL(__mutex_lock_noinline(atomic_t *lock_count));
 
 /*
- * Some architectures do not have fast dec_and_test atomic primitives,
- * for them we are providing an atomic_xchg() based mutex implementation,
- * if they enable CONFIG_MUTEX_XCHG_ALGORITHM.
- *
  * The locking fastpath is the 1->0 transition from 'unlocked' into
  * 'locked' state:
  */
 static inline void __mutex_lock_atomic(struct mutex *lock)
 {
-#ifdef CONFIG_MUTEX_XCHG_ALGORITHM
-	if (unlikely(atomic_xchg(&lock->count, 0) != 1))
-		__mutex_lock_noinline(&lock->count);
-#else
-	atomic_dec_call_if_negative(&lock->count, __mutex_lock_noinline);
-#endif
+	__mutex_fastpath_lock(&lock->count, __mutex_lock_noinline);
 }
 
 /*
@@ -383,16 +370,23 @@ static inline void __mutex_lock(struct m
 	__mutex_lock_atomic(lock);
 }
 
+static __sched int FASTCALL(__mutex_lock_interruptible_noinline(atomic_t *lock_count));
+
+static int __mutex_lock_interruptible_atomic(struct mutex *lock)
+{
+	__mutex_fastpath_lock_retval(&lock->count, __mutex_lock_interruptible_noinline);
+}
+
+static int fastcall __sched __mutex_lock_interruptible_noinline(atomic_t *lock_count)
+{
+	struct mutex *lock = container_of(lock_count, struct mutex, count);
+
+	return __mutex_lock_interruptible_nonatomic(lock);
+}
+
 static inline int __mutex_lock_interruptible(struct mutex *lock)
 {
-#ifdef CONFIG_MUTEX_XCHG_ALGORITHM
-	if (unlikely(atomic_xchg(&lock->count, 0) != 1))
-		return __mutex_lock_interruptible_nonatomic(lock);
-#else
-	if (unlikely(atomic_dec_return(&lock->count) < 0))
-		return __mutex_lock_interruptible_nonatomic(lock);
-#endif
-	return 0;
+	__mutex_lock_interruptible_atomic(lock);
 }
 
 static void __sched FASTCALL(__mutex_unlock_noinline(atomic_t *lock_count));
@@ -403,12 +397,7 @@ static void __sched FASTCALL(__mutex_unl
  */
 static inline void __mutex_unlock_atomic(struct mutex *lock)
 {
-#ifdef CONFIG_MUTEX_XCHG_ALGORITHM
-	if (unlikely(atomic_xchg(&lock->count, 1) != 0))
-		__mutex_unlock_noinline(&lock->count);
-#else
-	atomic_inc_call_if_nonpositive(&lock->count, __mutex_unlock_noinline);
-#endif
+	__mutex_fastpath_unlock(&lock->count, __mutex_unlock_noinline);
 }
 
 static void fastcall __sched __mutex_unlock_noinline(atomic_t *lock_count)
Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -12,6 +12,7 @@
  */
 #include <linux/list.h>
 #include <linux/spinlock_types.h>
+#include <asm/mutex.h>
 
 #include <asm/atomic.h>
 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 21:27               ` Nicolas Pitre
  2005-12-22 21:37                 ` [patch 1/2] mutex subsystem: basic per arch fast path primitives Nicolas Pitre
  2005-12-22 21:40                 ` [patch 2/2] mutex subsystem: use the per architecture fast path lock_unlock defines Nicolas Pitre
@ 2005-12-22 21:54                 ` Ingo Molnar
  2 siblings, 0 replies; 62+ messages in thread
From: Ingo Molnar @ 2005-12-22 21:54 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov,
	David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt,
	Andi Kleen

* Nicolas Pitre <nico@cam.org> wrote:

> I'd like to point out that, while atomic_dec_call_if_* is really nice 
> on i386, it is probably only good for i386 since no other architecture 
> will be able to provide a better implementation than what can be done 
> with atomic_dec_return() anyway.  Yet that IMHO overloaded 
> atomic_dec_call_if_* stuff appears in core code.

i'd have no problem with going to atomic_dec_return() on i386 too.  
atomic_dec_call_if_*() is just working around a gcc limitation: there's 
no way to pass a condition from inline assembly into C code, it has to 
go over a register. But the difference is small, just 1 extra 'sete' 
instruction. So x86 would be just fine with atomic_dec_return() too.

anyway, it seems like everyone would like to hack a few instructions 
from the fastpath, so we might as well go with your approach. As long as 
the state is well-defined (and it has to be well-defined because it all 
hits the common slowpath), and the function names are descriptive of 
what happens, it's fine with me.

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 16:44         ` Ingo Molnar
  2005-12-22 16:58           ` Russell King
@ 2005-12-22 16:58           ` Nicolas Pitre
  1 sibling, 0 replies; 62+ messages in thread
From: Nicolas Pitre @ 2005-12-22 16:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov,
	David Howells, Alan Cox, Benjamin LaHaise, Steven Rostedt,
	Andi Kleen, Russell King

On Thu, 22 Dec 2005, Ingo Molnar wrote:

> 
> * Nicolas Pitre <nico@cam.org> wrote:
> 
> > > i'm curious, how would this ARMv6 solution look like, and what would be 
> > > the advantages over the atomic swap based variant?
> > 
> > On ARMv6 (which can be SMP) the atomic swap instruction is much more 
> > costly than on former ARM versions.  It however has ll/sc instructions 
> > which allows it to implement a true atomic decrement, and the lock 
> > fast path would look like: [...]
> 
> but couldnt you implement atomic_dec_return() with the ll/sc 
> instructions? Something like:

NO.  My first example was atomic_dec_return based.  The second is 
lighter and fulfill the semantics of arch_mutex_fast_lock() but is not a 
common atomic primitive.


Nicolas

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 15:34   ` Nicolas Pitre
  2005-12-22 15:40     ` Ingo Molnar
@ 2005-12-22 17:20     ` Christoph Hellwig
  2005-12-22 17:33     ` Steven Rostedt
  2 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2005-12-22 17:20 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Ingo Molnar, Christoph Hellwig, lkml, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo,
	Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise,
	Steven Rostedt, Andi Kleen, Russell King

On Thu, Dec 22, 2005 at 10:34:18AM -0500, Nicolas Pitre wrote:
> I'm with Christoph here.  Please preserve my 
> arch_mutex_fast_lock/arch_mutex_fast_unlock helpers.  I did it that way 
> because the most important thing they bring is flexibility where it is 
> needed i.e. in architecture specific implementations.  And done that way 
> the architecture specific part is well abstracted with the minimum 
> semantics allowing flexibility in the implementation.
> 
> I insist on that because, even if ARM currently relies on the atomic 
> swap behavior, on ARMv6 at least this can be improved even further, but 
> a special implementation which is neither a fully qualified atomic 
> decrement nor an atomic swap is needed.  That's why I insist that you 
> should keep my arch_mutex_fast_lock and friends (rename them if you 
> wish) and remove __ARCH_WANT_XCHG_BASED_ATOMICS entirely.

I think one of us should so a new version based on that scheme and without
all the new atomic helpers, then we can compare it against the current
version.  I'll try to once I'll get some time.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 15:34   ` Nicolas Pitre
  2005-12-22 15:40     ` Ingo Molnar
  2005-12-22 17:20     ` Christoph Hellwig
@ 2005-12-22 17:33     ` Steven Rostedt
  2005-12-22 18:24       ` Nicolas Pitre
  2 siblings, 1 reply; 62+ messages in thread
From: Steven Rostedt @ 2005-12-22 17:33 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Ingo Molnar, Christoph Hellwig, lkml, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo,
	Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise,
	Andi Kleen, Russell King

On Thu, 2005-12-22 at 10:34 -0500, Nicolas Pitre wrote:

> > Actually just havign asm/mutex.h implement the faspath per-arch and get
> > rid of all the oddball atomic.h additions would be even better.  While
> > this means we need per-arch code it also means the code is a lot easier
> > understandable, and we don't add odd public APIs.
> 
> I'm with Christoph here.  Please preserve my 
> arch_mutex_fast_lock/arch_mutex_fast_unlock helpers.  I did it that way 
> because the most important thing they bring is flexibility where it is 
> needed i.e. in architecture specific implementations.  And done that way 
> the architecture specific part is well abstracted with the minimum 
> semantics allowing flexibility in the implementation.
> 
> I insist on that because, even if ARM currently relies on the atomic 
> swap behavior, on ARMv6 at least this can be improved even further, but 
> a special implementation which is neither a fully qualified atomic 
> decrement nor an atomic swap is needed.  That's why I insist that you 
> should keep my arch_mutex_fast_lock and friends (rename them if you 
> wish) and remove __ARCH_WANT_XCHG_BASED_ATOMICS entirely.
> 

Not sure how well this is accepted, but would it be acceptable to have
the mutex_lock and friends covered with the (weak) attribute?

ie.

void fastcall __sched __attribute__((weak)) mutex_lock(struct mutex *lock)

Then let the archs override them if they wish?

You would just need to make an extra slow path mutex visible to the archs.

-- Steve



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 17:33     ` Steven Rostedt
@ 2005-12-22 18:24       ` Nicolas Pitre
  0 siblings, 0 replies; 62+ messages in thread
From: Nicolas Pitre @ 2005-12-22 18:24 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Christoph Hellwig, lkml, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Jes Sorensen, Zwane Mwaikambo,
	Oleg Nesterov, David Howells, Alan Cox, Benjamin LaHaise,
	Andi Kleen, Russell King

On Thu, 22 Dec 2005, Steven Rostedt wrote:

> On Thu, 2005-12-22 at 10:34 -0500, Nicolas Pitre wrote:
> 
> > > Actually just havign asm/mutex.h implement the faspath per-arch and get
> > > rid of all the oddball atomic.h additions would be even better.  While
> > > this means we need per-arch code it also means the code is a lot easier
> > > understandable, and we don't add odd public APIs.
> > 
> > I'm with Christoph here.  Please preserve my 
> > arch_mutex_fast_lock/arch_mutex_fast_unlock helpers.  I did it that way 
> > because the most important thing they bring is flexibility where it is 
> > needed i.e. in architecture specific implementations.  And done that way 
> > the architecture specific part is well abstracted with the minimum 
> > semantics allowing flexibility in the implementation.
> > 
> > I insist on that because, even if ARM currently relies on the atomic 
> > swap behavior, on ARMv6 at least this can be improved even further, but 
> > a special implementation which is neither a fully qualified atomic 
> > decrement nor an atomic swap is needed.  That's why I insist that you 
> > should keep my arch_mutex_fast_lock and friends (rename them if you 
> > wish) and remove __ARCH_WANT_XCHG_BASED_ATOMICS entirely.
> > 
> 
> Not sure how well this is accepted, but would it be acceptable to have
> the mutex_lock and friends covered with the (weak) attribute?
> 
> ie.
> 
> void fastcall __sched __attribute__((weak)) mutex_lock(struct mutex *lock)
> 
> Then let the archs override them if they wish?

Actually, my next request would be for architectures to actually inline 
the fast path if they see benefit.  On ARM, given the tight fast path 
allowed by mutexes, it is indeed beneficial to inline them (the last 
patch of my latest serie does just that).  So weak symbols are not 
really useful in that case.


Nicolas

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 11:41 [patch 0/9] mutex subsystem, -V4 Ingo Molnar
  2005-12-22 11:53 ` Christoph Hellwig
@ 2005-12-22 11:54 ` Andrew Morton
  2005-12-22 12:20   ` Ingo Molnar
  1 sibling, 1 reply; 62+ messages in thread
From: Andrew Morton @ 2005-12-22 11:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells,
	alan, bcrl, rostedt, hch, ak, rmk+lkml

Ingo Molnar <mingo@elte.hu> wrote:
>
> this is the -V4 of the mutex subsystem patch-queue.
>

I've only been following this with half an eye, with the apparently
erroneous expectation that future versions of the patchset would come with
some explanation of why on earth we'd want to merge all this stuff into the
kernel.

We've seen some benchmarks which indicate that there are performance gains
to be had if there's a lot of lock contention, but isn't it the case that
this is because of the semi-extra wake_up() in down()?  Has anyone tried to
redo existing semaphores so they no longer have that disadvantage?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 11:54 ` Andrew Morton
@ 2005-12-22 12:20   ` Ingo Molnar
  2005-12-22 13:07     ` Andrew Morton
  0 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2005-12-22 12:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells,
	alan, bcrl, rostedt, hch, ak, rmk+lkml


* Andrew Morton <akpm@osdl.org> wrote:

> I've only been following this with half an eye, with the apparently 
> erroneous expectation that future versions of the patchset would come 
> with some explanation of why on earth we'd want to merge all this 
> stuff into the kernel.

in my initial announcement i listed 10 good reasons to do so, and they 
are still true:

http://people.redhat.com/mingo/generic-mutex-subsystem/mutex-announce.txt
[...]

But firstly, i'd like to answer the most important question:

  "Why the hell do we need a new mutex subsystem, and what's wrong 
   with semaphores??"

This question is clearly nagging most of the doubters, so i'm listing my 
answers first, before fully explaining the patchset. (For more 
implementational details, see the subseqent sections.)

here are the top 10 reasons of why i think the generic mutex code should 
be considered for upstream integration:

 - 'struct mutex' is smaller: on x86, 'struct semaphore' is 20 bytes, 
   'struct mutex' is 16 bytes. A smaller structure size means less RAM 
   footprint, and better CPU-cache utilization.

 - tighter code. On x86 i get the following .text sizes when 
   switching all mutex-alike semaphores in the kernel to the mutex 
   subsystem:

        text    data     bss     dec     hex filename
     3280380  868188  396860 4545428  455b94 vmlinux-semaphore
     3255329  865296  396732 4517357  44eded vmlinux-mutex

   that's 25051 bytes of code saved, or a 0.76% win - off the hottest 
   codepaths of the kernel. (The .data savings are 2892 bytes, or 0.33%) 
   Smaller code means better icache footprint, which is one of the 
   major optimization goals in the Linux kernel currently.

 - the mutex subsystem is faster and has superior scalability for 
   contented workloads. On an 8-way x86 system, running a mutex-based 
   kernel and testing creat+unlink+close (of separate, per-task files)
   in /tmp with 16 parallel tasks, the average number of ops/sec is:

    Semaphores:                        Mutexes:

    $ ./test-mutex V 16 10             $ ./test-mutex V 16 10
    8 CPUs, running 16 tasks.          8 CPUs, running 16 tasks.
    checking VFS performance.          checking VFS performance.
    avg loops/sec:      34713          avg loops/sec:      84153
    CPU utilization:    63%            CPU utilization:    22%

   i.e. in this workload, the mutex based kernel was 2.4 times faster 
   than the semaphore based kernel, _and_ it also had 2.8 times less CPU 
   utilization. (In terms of 'ops per CPU cycle', the semaphore kernel 
   performed 551 ops/sec per 1% of CPU time used, while the mutex kernel 
   performed 3825 ops/sec per 1% of CPU time used - it was 6.9 times 
   more efficient.)

   the scalability difference is visible even on a 2-way P4 HT box:

    Semaphores:                        Mutexes:

    $ ./test-mutex V 16 10             $ ./test-mutex V 16 10
    4 CPUs, running 16 tasks.          8 CPUs, running 16 tasks.
    checking VFS performance.          checking VFS performance.
    avg loops/sec:      127659         avg loops/sec:      181082
    CPU utilization:    100%           CPU utilization:    34%

   (the straight performance advantage of mutexes is 41%, the per-cycle 
    efficiency of mutexes is 4.1 times better.)

 - there are no fastpath tradeoffs, the mutex fastpath is just as tight 
   as the semaphore fastpath. On x86, the locking fastpath is 2 
   instructions:

    c0377ccb <mutex_lock>:
    c0377ccb:       f0 ff 08                lock decl (%eax)
    c0377cce:       78 0e                   js     c0377cde <.text.lock.mutex>
    c0377cd0:       c3                      ret

   the unlocking fastpath is equally tight:

    c0377cd1 <mutex_unlock>:
    c0377cd1:       f0 ff 00                lock incl (%eax)
    c0377cd4:       7e 0f                   jle    c0377ce5 <.text.lock.mutex+0x7>
    c0377cd6:       c3                      ret

 - the per-call-site inlining cost of the slowpath is cheaper and 
   smaller than that of semaphores, by one instruction, because the 
   mutex trampoline code does not do a "lea %0,%%eax" that the semaphore 
   code does before calling __down_failed. The mutex subsystem uses out 
   of line code currently so this makes only a small difference in .text
   size, but in case we want to inline mutexes, they will be cheaper 
   than semaphores.

 - No wholesale or dangerous migration path. The migration to mutexes is 
   fundamentally opt-in, safe and easy: multiple type-based and .config 
   based migration helpers are provided to make the migration to mutexes 
   easy. Migration is as finegrained as it gets, so robustness of the 
   kernel or out-of-tree code should not be jeopardized at any stage.  
   The migration helpers can be eliminated once migration is completed, 
   once the kernel has been separated into 'mutex users' and 'semaphore 
   users'. Out-of-tree code automatically defaults to semaphore 
   semantics, mutexes are not forced upon anyone, at any stage of the 
   migration.

 - 'struct mutex' semantics are well-defined and are enforced if
   CONFIG_DEBUG_MUTEXES is turned on. Semaphores on the other hand have 
   virtually no debugging code or instrumentation. The mutex subsystem 
   checks and enforces the following rules:

   * - only one task can hold the mutex at a time
   * - only the owner can unlock the mutex
   * - multiple unlocks are not permitted
   * - recursive locking is not permitted
   * - a mutex object must be initialized via the API
   * - a mutex object must not be initialized via memset or copying
   * - task may not exit with mutex held
   * - memory areas where held locks reside must not be freed

   furthermore, there are also convenience features in the debugging 
   code:

   * - uses symbolic names of mutexes, whenever they are printed in debug output
   * - point-of-acquire tracking, symbolic lookup of function names
   * - list of all locks held in the system, printout of them
   * - owner tracking
   * - detects self-recursing locks and prints out all relevant info
   * - detects multi-task circular deadlocks and prints out all affected
   *   locks and tasks (and only those tasks)

   we have extensive experience with the mutex debugging code in the -rt
   kernel, and it eases the debugging of mutex related bugs 
   considerably. A handful of upstream bugs were found as well this
   way, and were contributed back to the vanilla kernel. We do believe 
   that improved debugging code is an important tool in improving the 
   fast-paced upstream kernel's quality.

   a side-effect of the strict semantics is that mutexes are much easier 
   to analyze on a static level. E.g. Sparse could check the correctness 
   of mutex users, further improving the kernel's quality. Also, the
   highest-level security and reliability validation techniques (and 
   certification processes) involve static code analysis.

 - kernel/mutex.c is generic, and has minimal per-arch needs. No new 
   primitives have to be implemented to support spinlock-based generic 
   mutexes. Only 2 new atomic primitives have to be implemented for an 
   architecture to support optimized, lockless generic mutexes. In 
   contrast, to implement semaphores on a new architecture, hundreds of 
   lines of nontrivial (often assembly) code has to be written and 
   debugged.

 - kernel/mutex.c is highly hackable. New locking features can be 
   implemented in C, and they carry over to every architecture.  
   Extensive self-consistency debugging checks of the mutex
   implementation are done if CONFIG_DEBUG_MUTEXES is turned on. I do 
   think that hackability is the most important property of 
   kernel code.

 - the generic mutex subsystem is also one more step towards enabling 
   the fully preemptable -rt kernel. Ok, this shouldnt bother the 
   upstream kernel too much at the moment, but it's a personal driving
   force for me nevertheless ;-)

   (NOTE: i consciously did not list 'Priority Inheritance' amongst the 
    reasons, because priority inheritance for blocking kernel locks 
    would be a misguided reason at best, and flat out wrong at worst.)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 12:20   ` Ingo Molnar
@ 2005-12-22 13:07     ` Andrew Morton
  2005-12-22 13:23       ` Arjan van de Ven
                         ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Andrew Morton @ 2005-12-22 13:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells,
	alan, bcrl, rostedt, hch, ak, rmk+lkml

Ingo Molnar <mingo@elte.hu> wrote:
>
> ... 
>
> here are the top 10 reasons of why i think the generic mutex code should 
> be considered for upstream integration:

Appropriate for the changelog, please.

>  - 'struct mutex' is smaller: on x86, 'struct semaphore' is 20 bytes, 
>    'struct mutex' is 16 bytes. A smaller structure size means less RAM 
>    footprint, and better CPU-cache utilization.

Because of the .sleepers thing.  Perhaps a revamped semaphore wouldn't need
thsat field?

>  - tighter code. On x86 i get the following .text sizes when 
>    switching all mutex-alike semaphores in the kernel to the mutex 
>    subsystem:
> 
>         text    data     bss     dec     hex filename
>      3280380  868188  396860 4545428  455b94 vmlinux-semaphore
>      3255329  865296  396732 4517357  44eded vmlinux-mutex
> 
>    that's 25051 bytes of code saved, or a 0.76% win - off the hottest 
>    codepaths of the kernel. (The .data savings are 2892 bytes, or 0.33%) 
>    Smaller code means better icache footprint, which is one of the 
>    major optimization goals in the Linux kernel currently.

Why is the mutex-using kernel any smaller?  IOW: from where do these
savings come?

>  - the mutex subsystem is faster and has superior scalability for 
>    contented workloads. On an 8-way x86 system, running a mutex-based 
>    kernel and testing creat+unlink+close (of separate, per-task files)
>    in /tmp with 16 parallel tasks, the average number of ops/sec is:
> 
>     Semaphores:                        Mutexes:
> 
>     $ ./test-mutex V 16 10             $ ./test-mutex V 16 10
>     8 CPUs, running 16 tasks.          8 CPUs, running 16 tasks.
>     checking VFS performance.          checking VFS performance.
>     avg loops/sec:      34713          avg loops/sec:      84153
>     CPU utilization:    63%            CPU utilization:    22%
> 
>    i.e. in this workload, the mutex based kernel was 2.4 times faster 
>    than the semaphore based kernel, _and_ it also had 2.8 times less CPU 
>    utilization. (In terms of 'ops per CPU cycle', the semaphore kernel 
>    performed 551 ops/sec per 1% of CPU time used, while the mutex kernel 
>    performed 3825 ops/sec per 1% of CPU time used - it was 6.9 times 
>    more efficient.)
> 
>    the scalability difference is visible even on a 2-way P4 HT box:
> 
>     Semaphores:                        Mutexes:
> 
>     $ ./test-mutex V 16 10             $ ./test-mutex V 16 10
>     4 CPUs, running 16 tasks.          8 CPUs, running 16 tasks.
>     checking VFS performance.          checking VFS performance.
>     avg loops/sec:      127659         avg loops/sec:      181082
>     CPU utilization:    100%           CPU utilization:    34%
> 
>    (the straight performance advantage of mutexes is 41%, the per-cycle 
>     efficiency of mutexes is 4.1 times better.)

Why is the mutex-using kernel more scalable?

Can semaphores be tuned to offer the same scalability improvements?

> ...
>
>  - the per-call-site inlining cost of the slowpath is cheaper and 
>    smaller than that of semaphores, by one instruction, because the 
>    mutex trampoline code does not do a "lea %0,%%eax" that the semaphore 
>    code does before calling __down_failed. The mutex subsystem uses out 
>    of line code currently so this makes only a small difference in .text
>    size, but in case we want to inline mutexes, they will be cheaper 
>    than semaphores.

Cannot the semaphore code be improved in the same manner?

>  - No wholesale or dangerous migration path. The migration to mutexes is 
>    fundamentally opt-in, safe and easy: multiple type-based and .config 
>    based migration helpers are provided to make the migration to mutexes 
>    easy. Migration is as finegrained as it gets, so robustness of the 
>    kernel or out-of-tree code should not be jeopardized at any stage.  
>    The migration helpers can be eliminated once migration is completed, 
>    once the kernel has been separated into 'mutex users' and 'semaphore 
>    users'. Out-of-tree code automatically defaults to semaphore 
>    semantics, mutexes are not forced upon anyone, at any stage of the 
>    migration.

IOW: a complete PITA, sorry.

>  - 'struct mutex' semantics are well-defined and are enforced if
>    CONFIG_DEBUG_MUTEXES is turned on. Semaphores on the other hand have 
>    virtually no debugging code or instrumentation. The mutex subsystem 
>    checks and enforces the following rules:
> 
>    * - only one task can hold the mutex at a time
>    * - only the owner can unlock the mutex
>    * - multiple unlocks are not permitted
>    * - recursive locking is not permitted
>    * - a mutex object must be initialized via the API
>    * - a mutex object must not be initialized via memset or copying
>    * - task may not exit with mutex held
>    * - memory areas where held locks reside must not be freed

Pretty much all of that could be added to semaphores-when-used-as-mutexes. 
Without introducing a whole new locking mechanism.

>    furthermore, there are also convenience features in the debugging 
>    code:
> 
>    * - uses symbolic names of mutexes, whenever they are printed in debug output
>    * - point-of-acquire tracking, symbolic lookup of function names
>    * - list of all locks held in the system, printout of them
>    * - owner tracking
>    * - detects self-recursing locks and prints out all relevant info
>    * - detects multi-task circular deadlocks and prints out all affected
>    *   locks and tasks (and only those tasks)

Ditto, I expect.

>  - the generic mutex subsystem is also one more step towards enabling 
>    the fully preemptable -rt kernel. Ok, this shouldnt bother the 
>    upstream kernel too much at the moment, but it's a personal driving
>    force for me nevertheless ;-)

Actually that's the only compelling reason I've yet seen, sorry.

What is it about these mutexes which is useful to RT and why cannot that
feature be provided by semaphores?


Ingo, there appear to be quite a few straw-man arguments here.  You're
comparing a subsystem (semaphores) which obviously could do with a lot of
fixing and enhancing with something new which has had a lot of recent
feature work out into it.

I'd prefer to see mutexes compared with semaphores after you've put as much
work into improving semaphores as you have into developing mutexes.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 13:07     ` Andrew Morton
@ 2005-12-22 13:23       ` Arjan van de Ven
  2005-12-22 13:44         ` Andrew Morton
  2005-12-22 15:19       ` Nicolas Pitre
  2005-12-22 21:43       ` Paul Mackerras
  2 siblings, 1 reply; 62+ messages in thread
From: Arjan van de Ven @ 2005-12-22 13:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, linux-kernel, torvalds, arjanv, nico, jes, zwane,
	oleg, dhowells, alan, bcrl, rostedt, hch, ak, rmk+lkml

> >    * - only one task can hold the mutex at a time
> >    * - only the owner can unlock the mutex
> >    * - multiple unlocks are not permitted
> >    * - recursive locking is not permitted
> >    * - a mutex object must be initialized via the API
> >    * - a mutex object must not be initialized via memset or copying
> >    * - task may not exit with mutex held
> >    * - memory areas where held locks reside must not be freed
> 
> Pretty much all of that could be added to semaphores-when-used-as-mutexes. 
> Without introducing a whole new locking mechanism.

this is basically in direct conflict with what Linus has been asking
for.. "leave semaphore semantics alone".

I agree with Linus that if it has very different rules, it should have a
different name. 

> Ingo, there appear to be quite a few straw-man arguments here.  You're
> comparing a subsystem (semaphores) which obviously could do with a lot of
> fixing and enhancing with something new which has had a lot of recent
> feature work out into it.
> 
> I'd prefer to see mutexes compared with semaphores after you've put as much
> work into improving semaphores as you have into developing mutexes.

semaphores have had a lot of work for the last... 10 years. To me that
is a sign that maybe they're close to their limit already.

You keep saying 10 times "so please enhance semaphores to do this".
Semaphores have far more complex rules for the slowpath (mutex semantics
are simple because the number of wakers is always at most one, and if
you hold the lock, you KNOW nobody else can do another release under
you). Adding dual rules or other complexity to it doesn't sound too
compelling to me actually; they're highly tricky things already (just
look at the i386 ones.. that extra wakeup was there to plug a hole (at
least that is my empirical conclusion based on "we remove it it hangs"
behavior).

Having 2 sets of behaviors for the same primitive also sounds not good
to me to be honest, that's bound to explode stuff left and right all the
time.

I realize you're not looking forward to a gradual conversion; yet Linus
says he doesn't want a wholesale change either but wants it gradual.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 13:23       ` Arjan van de Ven
@ 2005-12-22 13:44         ` Andrew Morton
  2005-12-22 14:11           ` Alan Cox
                             ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Andrew Morton @ 2005-12-22 13:44 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: mingo, linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg,
	dhowells, alan, bcrl, rostedt, hch, ak, rmk+lkml

Arjan van de Ven <arjan@infradead.org> wrote:
>
> 
> > >    * - only one task can hold the mutex at a time
> > >    * - only the owner can unlock the mutex
> > >    * - multiple unlocks are not permitted
> > >    * - recursive locking is not permitted
> > >    * - a mutex object must be initialized via the API
> > >    * - a mutex object must not be initialized via memset or copying
> > >    * - task may not exit with mutex held
> > >    * - memory areas where held locks reside must not be freed
> > 
> > Pretty much all of that could be added to semaphores-when-used-as-mutexes. 
> > Without introducing a whole new locking mechanism.
> 
> this is basically in direct conflict with what Linus has been asking
> for.. "leave semaphore semantics alone".

No, it's mostly just adding new debug stuff.  And we'd need to add
additional anotations to those few semaphores which are used as non-mutexes
to support that new debug stuff.

That's if the new debugging stuff is actually useful.  I'm not sure it is,
really - I can't think of any kernel bugs which these features would have
exposed.  Although I think Ingo found some, but I don't recall what they
were.

> 
> > Ingo, there appear to be quite a few straw-man arguments here.  You're
> > comparing a subsystem (semaphores) which obviously could do with a lot of
> > fixing and enhancing with something new which has had a lot of recent
> > feature work out into it.
> > 
> > I'd prefer to see mutexes compared with semaphores after you've put as much
> > work into improving semaphores as you have into developing mutexes.
> 
> semaphores have had a lot of work for the last... 10 years. To me that
> is a sign that maybe they're close to their limit already.

No they haven't - they're basically unchanged from four years ago (at
least).

> You keep saying 10 times "so please enhance semaphores to do this".

Well I think diligence requires that we be able to demonstrate why it's not
possible.

It's plain dumb for us to justify a fancy-pants new system based on
features which we could have added to the old one, no?

> Semaphores have far more complex rules for the slowpath (mutex semantics
> are simple because the number of wakers is always at most one, and if
> you hold the lock, you KNOW nobody else can do another release under
> you). Adding dual rules or other complexity to it doesn't sound too
> compelling to me actually; they're highly tricky things already (just
> look at the i386 ones.. that extra wakeup was there to plug a hole (at
> least that is my empirical conclusion based on "we remove it it hangs"
> behavior).
> 
> Having 2 sets of behaviors for the same primitive also sounds not good
> to me to be honest, that's bound to explode stuff left and right all the
> time.

There's no need for two sets of behaviour.  What I'm saying is that we
could add similar debugging features to semaphores (if useful).  Yes, we
would have to tell the kernel at the source level to defeat that debugging
if a particular lock isn't being used as a mutex.  That's rather less
intrusive than adding a whole new type of lock.  But I'd question the value
even of doing that, given the general historical non-bugginess of existing
semaphore users.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 13:44         ` Andrew Morton
@ 2005-12-22 14:11           ` Alan Cox
  2005-12-22 23:30             ` Andrew Morton
  2005-12-22 15:46           ` Thomas Gleixner
  2005-12-22 17:17           ` Christoph Hellwig
  2 siblings, 1 reply; 62+ messages in thread
From: Alan Cox @ 2005-12-22 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Arjan van de Ven, mingo, linux-kernel, torvalds, arjanv, nico,
	jes, zwane, oleg, dhowells, bcrl, rostedt, hch, ak, rmk+lkml

On Iau, 2005-12-22 at 05:44 -0800, Andrew Morton wrote:
> > semaphores have had a lot of work for the last... 10 years. To me that
> > is a sign that maybe they're close to their limit already.
> 
> No they haven't - they're basically unchanged from four years ago (at
> least).

Point still holds

> It's plain dumb for us to justify a fancy-pants new system based on
> features which we could have added to the old one, no?

The old one does one job well. Mutex wakeups are very different in
behaviour because of the single waker rule and the fact you can do stuff
like affine wakeups.

You could make it one function but it would inevitably end up full of

	if(mutex) {
	}

so you'd make it slower, destabilize both and not get some of the
winnings like code size.

Oh and of course you no doubt break the semaphores while doing it, while
at least this seperate way as Linus suggested you don't break the
existing functionality.

> There's no need for two sets of behaviour. 

The fundamental reason for mutexes in terms of performance and
scalability requires the different wake up behaviours. The performance
numbers are pretty compelling. I was against the original proposal but
now the gradual changeover is proposed I'm for.

Alan

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 14:11           ` Alan Cox
@ 2005-12-22 23:30             ` Andrew Morton
  2005-12-22 23:34               ` Christoph Hellwig
  0 siblings, 1 reply; 62+ messages in thread
From: Andrew Morton @ 2005-12-22 23:30 UTC (permalink / raw)
  To: Alan Cox
  Cc: arjan, mingo, linux-kernel, torvalds, arjanv, nico, jes, zwane,
	oleg, dhowells, bcrl, rostedt, hch, ak, rmk+lkml

Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>
> On Iau, 2005-12-22 at 05:44 -0800, Andrew Morton wrote:
> > > semaphores have had a lot of work for the last... 10 years. To me that
> > > is a sign that maybe they're close to their limit already.
> > 
> > No they haven't - they're basically unchanged from four years ago (at
> > least).
> 
> Point still holds

No it does not.

Ingo's work has shown us two things:

a) semaphores use more kernel text than they should and

b) semaphores are less efficient than they could be.

Fine.  Let's update the semaphore implementation to fix those things. 
Nobody has addressed this code in several years.  If we conclusively cannot
fix these things then that's the time to start looking at implementing new
locking mechanisms.

> > It's plain dumb for us to justify a fancy-pants new system based on
> > features which we could have added to the old one, no?
> 
> The old one does one job well. Mutex wakeups are very different in
> behaviour because of the single waker rule and the fact you can do stuff
> like affine wakeups.
> 
> You could make it one function but it would inevitably end up full of
> 
> 	if(mutex) {
> 	}

We'd only need such constructs if we were trying to add all the mutex
runtime debugging features to semaphores.  I don't think that's likely to
be very useful so fine, let's not do that.

> Oh and of course you no doubt break the semaphores while doing it, while
> at least this seperate way as Linus suggested you don't break the
> existing functionality.

Linus would not be averse to a patch which optimises semaphores.

> > There's no need for two sets of behaviour. 
> 
> The fundamental reason for mutexes in terms of performance and
> scalability requires the different wake up behaviours. The performance
> numbers are pretty compelling.

This where I have a bad feeling that I'm missing some vital clue.

An efficient semaphore implementation will wake a single process per up(). 
Just like mutexes.  So why should mutexes have any performance advantage?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 23:30             ` Andrew Morton
@ 2005-12-22 23:34               ` Christoph Hellwig
       [not found]                 ` <32801.10.10.10.28.1135295357.squirrel@linux1>
                                   ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Christoph Hellwig @ 2005-12-22 23:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alan Cox, arjan, mingo, linux-kernel, torvalds, arjanv, nico, jes,
	zwane, oleg, dhowells, bcrl, rostedt, hch, ak, rmk+lkml

On Thu, Dec 22, 2005 at 03:30:14PM -0800, Andrew Morton wrote:
> No it does not.
> 
> Ingo's work has shown us two things:
> 
> a) semaphores use more kernel text than they should and
> 
> b) semaphores are less efficient than they could be.
> 
> Fine.  Let's update the semaphore implementation to fix those things. 
> Nobody has addressed this code in several years.  If we conclusively cannot
> fix these things then that's the time to start looking at implementing new
> locking mechanisms.

c) semaphores are total overkill for 99% percent of the users.  Remember
this thing about optimizing for the common case?

Pretty much everywhere we do want mutex semantic.  So let's have a proper
primitive exactly for that, and we can keep the current semaphore
implementation (with a much simpler implementation) for that handfull of
users in the kernel that really want a counting semaphore.

I really don't get why you hate mutex primitives so much.

^ permalink raw reply	[flat|nested] 62+ messages in thread

[parent not found: <32801.10.10.10.28.1135295357.squirrel@linux1>]

* Re: [patch 0/9] mutex subsystem, -V4
       [not found]                 ` <32801.10.10.10.28.1135295357.squirrel@linux1>
@ 2005-12-22 23:49                   ` Sean
  2005-12-22 23:53                     ` Randy.Dunlap
  0 siblings, 1 reply; 62+ messages in thread
From: Sean @ 2005-12-22 23:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Alan Cox, arjan, mingo, linux-kernel, torvalds,
	arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, hch, ak,
	rmk+lkml

On Thu, December 22, 2005 6:34 pm, Christoph Hellwig said:
> On Thu, Dec 22, 2005 at 03:30:14PM -0800, Andrew Morton wrote:
>> No it does not.
>>
>> Ingo's work has shown us two things:
>>
>> a) semaphores use more kernel text than they should and
>>
>> b) semaphores are less efficient than they could be.
>>
>> Fine.  Let's update the semaphore implementation to fix those things.
>> Nobody has addressed this code in several years.  If we conclusively
>> cannot
>> fix these things then that's the time to start looking at implementing
>> new
>> locking mechanisms.
>
> c) semaphores are total overkill for 99% percent of the users.  Remember
> this thing about optimizing for the common case?
>
> Pretty much everywhere we do want mutex semantic.  So let's have a proper
> primitive exactly for that, and we can keep the current semaphore
> implementation (with a much simpler implementation) for that handfull of
> users in the kernel that really want a counting semaphore.
>
> I really don't get why you hate mutex primitives so much.
>

Yes it's hard to figure.  It seems to be deeper than just hating mutex
primitives, he hates the timer core updates that come from Ingo too; this
may be a general dislike for all things -rt.

Sean


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 23:49                   ` Sean
@ 2005-12-22 23:53                     ` Randy.Dunlap
       [not found]                       ` <50572.10.10.10.28.1135296023.squirrel@linux1>
  2005-12-23  0:00                       ` Steven Rostedt
  0 siblings, 2 replies; 62+ messages in thread
From: Randy.Dunlap @ 2005-12-22 23:53 UTC (permalink / raw)
  To: Sean
  Cc: Christoph Hellwig, Andrew Morton, Alan Cox, arjan, mingo,
	linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells,
	bcrl, rostedt, ak, rmk+lkml

On Thu, 22 Dec 2005, Sean wrote:

> On Thu, December 22, 2005 6:34 pm, Christoph Hellwig said:
> > On Thu, Dec 22, 2005 at 03:30:14PM -0800, Andrew Morton wrote:
> >> No it does not.
> >>
> >> Ingo's work has shown us two things:
> >>
> >> a) semaphores use more kernel text than they should and
> >>
> >> b) semaphores are less efficient than they could be.
> >>
> >> Fine.  Let's update the semaphore implementation to fix those things.
> >> Nobody has addressed this code in several years.  If we conclusively
> >> cannot
> >> fix these things then that's the time to start looking at implementing
> >> new
> >> locking mechanisms.
> >
> > c) semaphores are total overkill for 99% percent of the users.  Remember
> > this thing about optimizing for the common case?
> >
> > Pretty much everywhere we do want mutex semantic.  So let's have a proper
> > primitive exactly for that, and we can keep the current semaphore
> > implementation (with a much simpler implementation) for that handfull of
> > users in the kernel that really want a counting semaphore.
> >
> > I really don't get why you hate mutex primitives so much.
> >
>
> Yes it's hard to figure.  It seems to be deeper than just hating mutex
> primitives, he hates the timer core updates that come from Ingo too; this
> may be a general dislike for all things -rt.

Andrew can surely answer that, but it could be something as
simple as wanting to build a more stable kernel (one without
so much churn), so that things have time to mature and
improve without breaking so many other things...

This (current) is a hectic development cycle style.
-- 
~Randy

^ permalink raw reply	[flat|nested] 62+ messages in thread

[parent not found: <50572.10.10.10.28.1135296023.squirrel@linux1>]

* Re: [patch 0/9] mutex subsystem, -V4
       [not found]                       ` <50572.10.10.10.28.1135296023.squirrel@linux1>
@ 2005-12-23  0:00                         ` Sean
  0 siblings, 0 replies; 62+ messages in thread
From: Sean @ 2005-12-23  0:00 UTC (permalink / raw)
  To: Randy.Dunlap
  Cc: Christoph Hellwig, Andrew Morton, Alan Cox, arjan, mingo,
	linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells,
	bcrl, rostedt, ak, rmk+lkml

On Thu, December 22, 2005 6:53 pm, Randy.Dunlap said:

> Andrew can surely answer that, but it could be something as
> simple as wanting to build a more stable kernel (one without
> so much churn), so that things have time to mature and
> improve without breaking so many other things...
>
> This (current) is a hectic development cycle style.

Sure, its probably that simple.  Just seems like the techincal arguments
clearly support adding a mutex.

Sean


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 23:53                     ` Randy.Dunlap
       [not found]                       ` <50572.10.10.10.28.1135296023.squirrel@linux1>
@ 2005-12-23  0:00                       ` Steven Rostedt
  1 sibling, 0 replies; 62+ messages in thread
From: Steven Rostedt @ 2005-12-23  0:00 UTC (permalink / raw)
  To: Randy.Dunlap
  Cc: Sean, Christoph Hellwig, Andrew Morton, Alan Cox, arjan, mingo,
	linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells,
	bcrl, ak, rmk+lkml


On Thu, 22 Dec 2005, Randy.Dunlap wrote:
>
> Andrew can surely answer that, but it could be something as
> simple as wanting to build a more stable kernel (one without
> so much churn), so that things have time to mature and
> improve without breaking so many other things...

Not to mention that when this gets into the kernel, Andrew's the one that
will have to deal with a ton of "switch semaphores to mutexes" patches
that will slowly update the entire kernel.  These patches will probably
continue to come in for a few years. ;)

IIRC, that was one of his rational reasons why he didn't want the timer
name change.

-- Steve


^ permalink raw reply	[flat|nested] 62+ messages in thread

[parent not found: <20051222221311.2f6056ec.akpm@osdl.org>]

* Re: [patch 0/9] mutex subsystem, -V4
       [not found]                 ` <20051222221311.2f6056ec.akpm@osdl.org>
@ 2005-12-23 14:24                   ` Nicolas Pitre
  2005-12-23 14:51                     ` Andrew Morton
  0 siblings, 1 reply; 62+ messages in thread
From: Nicolas Pitre @ 2005-12-23 14:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, alan, arjan, mingo, lkml, torvalds, arjanv,
	jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml

On Thu, 22 Dec 2005, Andrew Morton wrote:

> Christoph Hellwig <hch@infradead.org> wrote:
> > I really don't get why you hate mutex primitives so much.
> 
> I've just spelled out in considerable detail why this work is premature. 
> How can you not "get" it?  Why do I have to keep saying the same thing in
> different ways?  It's really quite simple.
> 
> 
> So here is permutation #4:
> 
> If we can optimise semaphores for speed and space, the mutexes are
> *unneeded*.

How can't you get the fact that semaphores could _never_ be as simple as 
mutexes?  This is a theoritical impossibility, which maybe turns out not 
to be so true on x86, but which is damn true on ARM where the fast path 
(the common case of a mutex) is significantly more efficient.

Semaphores _require_ an atomic decrement, mutexes do not.  On some 
architectures that makes a huge difference.

> And I think we _should_ optimise semaphores for speed and space.  Don't you?

No one disagrees with that.

> If we can do that, there is no point at all in adding a new lock type which
> has no speed advantage and no space advantage and which has less
> functionality than semaphores.

The very point is that mutexes will always have a speed advantage by 
nature.

> And, repeating myself yet again: if we can demonstrate that it is not
> feasible to optimise semaphores to the same performance and space efficiency
> of mutexes then (and only then) we have a reason for adding mutexes.

I spent the whole week making that demonstration repeatedly.  Why are 
you ignoring me?


Nicolas

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-23 14:24                   ` Nicolas Pitre
@ 2005-12-23 14:51                     ` Andrew Morton
  2005-12-23 14:57                       ` Russell King
                                         ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Andrew Morton @ 2005-12-23 14:51 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: hch, alan, arjan, mingo, linux-kernel, torvalds, arjanv, jes,
	zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml

Nicolas Pitre <nico@cam.org> wrote:
>
> How can't you get the fact that semaphores could _never_ be as simple as 
> mutexes?  This is a theoritical impossibility, which maybe turns out not 
> to be so true on x86, but which is damn true on ARM where the fast path 
> (the common case of a mutex) is significantly more efficient.
>

I did notice your comments.  I'll grant that mutexes will save some tens of
fastpath cycles on one minor architecture.  Sorry, but that doesn't seem
very important.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-23 14:51                     ` Andrew Morton
@ 2005-12-23 14:57                       ` Russell King
  2005-12-23 15:04                         ` Xavier Bestel
  2005-12-23 15:00                       ` Steven Rostedt
  2006-01-03 17:54                       ` Abhijit Bhopatkar
  2 siblings, 1 reply; 62+ messages in thread
From: Russell King @ 2005-12-23 14:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicolas Pitre, hch, alan, arjan, mingo, linux-kernel, torvalds,
	arjanv, jes, zwane, oleg, dhowells, bcrl, rostedt, ak

On Fri, Dec 23, 2005 at 06:51:18AM -0800, Andrew Morton wrote:
> Nicolas Pitre <nico@cam.org> wrote:
> > How can't you get the fact that semaphores could _never_ be as simple as 
> > mutexes?  This is a theoritical impossibility, which maybe turns out not 
> > to be so true on x86, but which is damn true on ARM where the fast path 
> > (the common case of a mutex) is significantly more efficient.
> 
> I did notice your comments.  I'll grant that mutexes will save some tens of
> fastpath cycles on one minor architecture.  Sorry, but that doesn't seem
> very important.

Wow.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-23 14:57                       ` Russell King
@ 2005-12-23 15:04                         ` Xavier Bestel
  2005-12-23 15:27                           ` Andrew Morton
  0 siblings, 1 reply; 62+ messages in thread
From: Xavier Bestel @ 2005-12-23 15:04 UTC (permalink / raw)
  To: Russell King
  Cc: Andrew Morton, Nicolas Pitre, hch, alan, arjan, mingo,
	linux-kernel, torvalds, arjanv, jes, zwane, oleg, dhowells, bcrl,
	rostedt, ak

On Fri, 2005-12-23 at 15:57, Russell King wrote:
> On Fri, Dec 23, 2005 at 06:51:18AM -0800, Andrew Morton wrote:
> > Nicolas Pitre <nico@cam.org> wrote:
> > > How can't you get the fact that semaphores could _never_ be as simple as 
> > > mutexes?  This is a theoritical impossibility, which maybe turns out not 
> > > to be so true on x86, but which is damn true on ARM where the fast path 
> > > (the common case of a mutex) is significantly more efficient.
> > 
> > I did notice your comments.  I'll grant that mutexes will save some tens of
> > fastpath cycles on one minor architecture.  Sorry, but that doesn't seem
> > very important.
> 
> Wow.

Yes, wow. Andrew doesn't seem aware of embedded linux people, for whom
cycles are important and ARM is king.

	Xav



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-23 15:04                         ` Xavier Bestel
@ 2005-12-23 15:27                           ` Andrew Morton
  0 siblings, 0 replies; 62+ messages in thread
From: Andrew Morton @ 2005-12-23 15:27 UTC (permalink / raw)
  To: Xavier Bestel
  Cc: rmk+lkml, nico, hch, alan, arjan, mingo, linux-kernel, torvalds,
	arjanv, jes, zwane, oleg, dhowells, bcrl, rostedt, ak

Xavier Bestel <xavier.bestel@free.fr> wrote:
>
> On Fri, 2005-12-23 at 15:57, Russell King wrote:
> > On Fri, Dec 23, 2005 at 06:51:18AM -0800, Andrew Morton wrote:
> > > Nicolas Pitre <nico@cam.org> wrote:
> > > > How can't you get the fact that semaphores could _never_ be as simple as 
> > > > mutexes?  This is a theoritical impossibility, which maybe turns out not 
> > > > to be so true on x86, but which is damn true on ARM where the fast path 
> > > > (the common case of a mutex) is significantly more efficient.
> > > 
> > > I did notice your comments.  I'll grant that mutexes will save some tens of
> > > fastpath cycles on one minor architecture.  Sorry, but that doesn't seem
> > > very important.
> > 
> > Wow.
> 
> Yes, wow. Andrew doesn't seem aware of embedded linux people, for whom
> cycles are important and ARM is king.
> 

Please, spare me the rhetoric.

Adding a new locking primitive is a big deal.  We need good reasons for
doing it.  And no, I don't think a few cycles on ARM is good enough.  I'd
be very surprised if it was measurable - the busiest semaphore is probably
i_sem and when it's taken we're also doing heavy filesystem operations
which would swamp any benefit.  And if we're going to churn i_sem then we
perhaps should turn it into an rwsem so we can perform concurrent lookups
(at least).  We do disk I/O with that thing held.

Look, I'm not wildly against mutexes - it's not exactly a large amount of
code.  I just think they're fairly pointless and I regret that we seem to
be diving into adding yet another locking primitive without having fully
investigated improving the existing one.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-23 14:51                     ` Andrew Morton
  2005-12-23 14:57                       ` Russell King
@ 2005-12-23 15:00                       ` Steven Rostedt
  2006-01-03 17:54                       ` Abhijit Bhopatkar
  2 siblings, 0 replies; 62+ messages in thread
From: Steven Rostedt @ 2005-12-23 15:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicolas Pitre, hch, alan, arjan, mingo, linux-kernel, torvalds,
	arjanv, jes, zwane, oleg, dhowells, bcrl, ak, rmk+lkml


On Fri, 23 Dec 2005, Andrew Morton wrote:

> Nicolas Pitre <nico@cam.org> wrote:
> >
> > How can't you get the fact that semaphores could _never_ be as simple as
> > mutexes?  This is a theoritical impossibility, which maybe turns out not
> > to be so true on x86, but which is damn true on ARM where the fast path
> > (the common case of a mutex) is significantly more efficient.
> >
>
> I did notice your comments.  I'll grant that mutexes will save some tens of
> fastpath cycles on one minor architecture.  Sorry, but that doesn't seem
> very important.
>

"minor architecture"?  Granted, I don't know of any ARM desktops or
servers, but there's a large number of ARM devices out in the real world.
Or are we giving up on Linux being an embedded OS?

-- Steve


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-23 14:51                     ` Andrew Morton
  2005-12-23 14:57                       ` Russell King
  2005-12-23 15:00                       ` Steven Rostedt
@ 2006-01-03 17:54                       ` Abhijit Bhopatkar
  2 siblings, 0 replies; 62+ messages in thread
From: Abhijit Bhopatkar @ 2006-01-03 17:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicolas Pitre, hch, alan, arjan, mingo, linux-kernel, torvalds,
	arjanv, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml

> > How can't you get the fact that semaphores could _never_ be as simple as
> > mutexes?  This is a theoritical impossibility, which maybe turns out not
> > to be so true on x86, but which is damn true on ARM where the fast path
> > (the common case of a mutex) is significantly more efficient.
> >
>
> I did notice your comments.  I'll grant that mutexes will save some tens  of
> fastpath cycles on one minor architecture.  Sorry, but that doesn't seem
> very important.

Heh !! i can't find words so i will just spell the emotion....
COMPLAIN HARD

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 23:34               ` Christoph Hellwig
       [not found]                 ` <32801.10.10.10.28.1135295357.squirrel@linux1>
       [not found]                 ` <20051222221311.2f6056ec.akpm@osdl.org>
@ 2005-12-25 16:08                 ` Roman Zippel
  2005-12-25 22:54                   ` Ingo Molnar
  2005-12-25 23:04                   ` Andrew Morton
  2 siblings, 2 replies; 62+ messages in thread
From: Roman Zippel @ 2005-12-25 16:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Alan Cox, arjan, mingo, linux-kernel, torvalds,
	arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, ak,
	rmk+lkml

Hi,

On Friday 23 December 2005 00:34, Christoph Hellwig wrote:

> c) semaphores are total overkill for 99% percent of the users.  Remember
> this thing about optimizing for the common case?

Semaphores are not that different from mutexes.
What makes me suspicious is the large difference in the test results, that 
either means something is wrong with the test or something is wrong with the 
semaphores. From reading the discussion I still don't really know, why the 
improvements to mutexes can't be applied to semaphores. I also haven't hardly 
seen any discussion about why semaphores the way they are. Linus did suspect 
there is a wakeup bug in the semaphore, but there was no conclusive followup 
to that.
IMO there are still too many questions open, so I can understand Andrew. We 
may only cover up the problem instead of fixing it. I understand that mutexes 
have advantages, but if we compare them to semaphores it should be a fair 
comparison, otherwise people start to think semaphores are something bad. The 
majority of the discussion has been about microoptimisation, but on some 
archs non-debug mutexes and semaphores may very well be the same thing.

bye, Roman

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-25 16:08                 ` Roman Zippel
@ 2005-12-25 22:54                   ` Ingo Molnar
  2005-12-26 21:49                     ` Roman Zippel
  2005-12-25 23:04                   ` Andrew Morton
  1 sibling, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2005-12-25 22:54 UTC (permalink / raw)
  To: Roman Zippel
  Cc: Christoph Hellwig, Andrew Morton, Alan Cox, arjan, linux-kernel,
	torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt,
	ak, rmk+lkml

* Roman Zippel <zippel@linux-m68k.org> wrote:

> > c) semaphores are total overkill for 99% percent of the users.  Remember
> > this thing about optimizing for the common case?
> 
> [...] I also haven't hardly seen any discussion about why semaphores 
> the way they are. Linus did suspect there is a wakeup bug in the 
> semaphore, but there was no conclusive followup to that.

no conclusive follow-up because ... they are too complex for people to 
answer such questions off the cuff? Something so frequently used in 
trivial ways should have the complexity of that typical use, not the 
complexity of the theoretical use. There is no problem with semaphores, 
other than that they are not being used as semaphores all that often.

for which i think there is a rather simple practical reason: if i want 
to control a counted resource within the kernel, it is rarely the 
simplest solution to use a semaphore for it, because a semaphore cannot 
be used to protect data structures in the 'resource is available' stage 
[i.e. when the semaphore count is above zero]. It does decrement the 
counter atomically, but that is just half of the job i have to do.

to control (allocate/free) the resource i _have to_ add some other 
locking mechanism anyway in most cases (a spinlock most likely, to 
protect the internal list and other internal state) - at which point 
it's simpler and faster to simply add a counter and a waitqueue to those 
existing internal variables, than to add a separate locking object to 
around (or within) the whole construct.

semaphores would be nice in theory, if there was a way to attach the 
'decrement counter atomically' operation to another set of atomic ops, 
like list_del() or list_add(), making the whole thing transactional.  
[this would also be a wholly new API, so it only applies to semaphores 
as a concept, not our actual semaphore incarnation] So i see the 
theoretical beauty of semaphores, but in practice, CPUs force us to work 
with much simpler constructs.

there are some exceptions: e.g. when the resource is _nothing else_ but 
a count.

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-25 22:54                   ` Ingo Molnar
@ 2005-12-26 21:49                     ` Roman Zippel
  0 siblings, 0 replies; 62+ messages in thread
From: Roman Zippel @ 2005-12-26 21:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, Andrew Morton, Alan Cox, arjan, linux-kernel,
	torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt,
	ak, rmk+lkml

Hi,

On Sunday 25 December 2005 23:54, Ingo Molnar wrote:

> > [...] I also haven't hardly seen any discussion about why semaphores
> > the way they are. Linus did suspect there is a wakeup bug in the
> > semaphore, but there was no conclusive followup to that.
>
> no conclusive follow-up because ... they are too complex for people to
> answer such questions off the cuff? Something so frequently used in
> trivial ways should have the complexity of that typical use, not the
> complexity of the theoretical use. There is no problem with semaphores,
> other than that they are not being used as semaphores all that often.

It shouldn't be that out of the blue for you. I don't mind the whole concept 
of mutexes and I agree that that's what most semaphores are used for. Please 
stop for a moment trying to sell mutexes, the basic question I'd like to get 
answered is, what is the worst-case scenerio if we convert everything to 
mutexes?
To make it very clear: I'm not arguing against mutexes, I only want a look at 
the complete picture, I don't only want to see the undoubted advantages, but 
also what are the risks? Is the semaphore wakeup behaviour really only a bug 
or does it fix some problem that just nobody remembers (and maybe even 
doesn't exist anymore)? What about the fairness issues mentioned, how easy is 
it to starve waiters?
Ingo, you're working on it already for a while, so I would expect you already 
thought about possible problems already, so why don't you take a look at the 
risks for us instead of just explaining the advantages? What are the chances 
we end up with semaphores just under a different name? Are there other 
possible problems, which then can be only solved e.g. by adding priority 
inheritance?
The point of this is to be prepared for any predictable problem, since this 
change in its consequence is rather huge and we don't have the luxury of a 
single development tree anymore, which is used by most developers.

bye, Roman

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-25 16:08                 ` Roman Zippel
  2005-12-25 22:54                   ` Ingo Molnar
@ 2005-12-25 23:04                   ` Andrew Morton
  2005-12-25 23:22                     ` Ingo Molnar
                                       ` (2 more replies)
  1 sibling, 3 replies; 62+ messages in thread
From: Andrew Morton @ 2005-12-25 23:04 UTC (permalink / raw)
  To: Roman Zippel
  Cc: hch, alan, arjan, mingo, linux-kernel, torvalds, arjanv, nico,
	jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml

Roman Zippel <zippel@linux-m68k.org> wrote:
>
> IMO there are still too many questions open, so I can understand Andrew. We 
> may only cover up the problem instead of fixing it. I understand that mutexes 
> have advantages, but if we compare them to semaphores it should be a fair 
> comparison, otherwise people start to think semaphores are something bad. The 
> majority of the discussion has been about microoptimisation, but on some 
> archs non-debug mutexes and semaphores may very well be the same thing.

Ingo has told me offline that he thinks that we can indeed remove the
current semaphore implementation.

90% of existing semaphore users get migrated to mutexes.

9% of current semaphore users get migrated to completions.

The remaining 1% of semaphore users are using the counting feature.  We
reimplement that in a mostly-arch-independent fashion and then remove the
current semaphore implementation.  Note that there are no sequencing
dependencies in all the above.

It's a lot of churn, but we'll end up with a better end result and a
somewhat-net-simpler kernel, so I'm happy.

One side point on semaphores and mutexes: the so-called "fast path" is
generally not performance-critical, because we just don't take them at high
frequencies.  Any workload which involves taking a semaphore at more than
50,000-100,000 times/second tends to have ghastly overscheduling failure
scenarios on SMP.  So people hit those scenarios and the code gets
converted to a lockless algorithm or to use spinlocking.

For example, for a while ext3/JBD was doing 200,000 context-switches per
second due to taking lock_super() at high frequencies.  When I converted
the whole fs to use spin locking throughout the performance in some
workloads went up by 1000%.

Another example: Ingo's VFS stresstest which is hitting i_sem hard: it only
does ~8000 ops/sec on an 8-way, and it's an artificial microbenchmark which
is _designed_ to hit that lock hard.  So if/when i_sem is converted to a
mutex, I figure that the benefits to ARM in that workload will be about a
0.01% performance increase.  ie: about two hours' worth of Moore's law in a
dopey microbenchmark.

For these reasons, I think that with sleeping locks, the fastpath is
realtively unimportant.  What _is_ important is not screwing up the
slowpath and not taking the lock at too high a frequency.  Because either
one will cause overscheduling which is a grossly worse problem.

Also, there's very little point in adding lots of tricky arch-dependent
code and generally mucking up the kernel source to squeeze the last drop of
performance out of the sleeping lock fastpath.  Because if those changes
actually make a difference, we've already lost - the code needs to be
changed to use spinlocking.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-25 23:04                   ` Andrew Morton
@ 2005-12-25 23:22                     ` Ingo Molnar
  2005-12-26 10:35                       ` Andrew Morton
  2005-12-26  0:33                     ` Moore's law (was Re: [patch 0/9] mutex subsystem, -V4) Pavel Machek
  2005-12-26 15:29                     ` [patch 0/9] mutex subsystem, -V4 Nicolas Pitre
  2 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2005-12-25 23:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roman Zippel, hch, alan, arjan, linux-kernel, torvalds, arjanv,
	nico, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml

* Andrew Morton <akpm@osdl.org> wrote:

> One side point on semaphores and mutexes: the so-called "fast path" is 
> generally not performance-critical, because we just don't take them at 
> high frequencies.  Any workload which involves taking a semaphore at 
> more than 50,000-100,000 times/second tends to have ghastly 
> overscheduling failure scenarios on SMP.  So people hit those 
> scenarios and the code gets converted to a lockless algorithm or to 
> use spinlocking.
> 
> For example, for a while ext3/JBD was doing 200,000 context-switches 
> per second due to taking lock_super() at high frequencies.  When I 
> converted the whole fs to use spin locking throughout the performance 
> in some workloads went up by 1000%.

actually, i'm 99.9% certain [ ;-) ] that all that ext3 spinlock 
conversion pain could have been avoided by converting ext3 to the mutex 
code. Mutexes definitely do not overschedule, even in very high 
frequency lock/unlock scenarios. They behave and perform quite close to 
spinlocks. (which property is obviously a must for the -rt kernel, where 
all spinlocks, rwlocks, seqlocks, rwsems and semaphores are mutexes - 
providing a big playground for locking constructs)

hm, can you see any easy way for me to test my bold assertion on ext3, 
by somehow moving/hacking it back to semaphores? I could then apply the 
MUTEX_DEBUG_FULL mechanism to it and redo those ext3 semaphore vs.  
spinlock benchmarks on a couple of boxes, and add a mutex column to the 
table of numbers.

also, often it's simpler to use a sleeping lock than to use a spinlock 
for something, so we want to have that option open.

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-25 23:22                     ` Ingo Molnar
@ 2005-12-26 10:35                       ` Andrew Morton
  2005-12-26 10:42                         ` Arjan van de Ven
  2005-12-27 14:42                         ` Ingo Molnar
  0 siblings, 2 replies; 62+ messages in thread
From: Andrew Morton @ 2005-12-26 10:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: zippel, hch, alan, arjan, linux-kernel, torvalds, arjanv, nico,
	jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml

Ingo Molnar <mingo@elte.hu> wrote:
>
> 
> * Andrew Morton <akpm@osdl.org> wrote:
> 
> > One side point on semaphores and mutexes: the so-called "fast path" is 
> > generally not performance-critical, because we just don't take them at 
> > high frequencies.  Any workload which involves taking a semaphore at 
> > more than 50,000-100,000 times/second tends to have ghastly 
> > overscheduling failure scenarios on SMP.  So people hit those 
> > scenarios and the code gets converted to a lockless algorithm or to 
> > use spinlocking.
> > 
> > For example, for a while ext3/JBD was doing 200,000 context-switches 
> > per second due to taking lock_super() at high frequencies.  When I 
> > converted the whole fs to use spin locking throughout the performance 
> > in some workloads went up by 1000%.
> 
> actually, i'm 99.9% certain [ ;-) ] that all that ext3 spinlock 
> conversion pain could have been avoided by converting ext3 to the mutex 
> code. Mutexes definitely do not overschedule, even in very high 
> frequency lock/unlock scenarios. They behave and perform quite close to 
> spinlocks. (which property is obviously a must for the -rt kernel, where 
> all spinlocks, rwlocks, seqlocks, rwsems and semaphores are mutexes - 
> providing a big playground for locking constructs)

hm.  16 CPUs hitting the same semaphore at great arrival rates.  The cost
of a short spin is much less than the cost of a sleep/wakeup.  The machine
was doing 100,000 - 200,000 context switches per second.

> hm, can you see any easy way for me to test my bold assertion on ext3, 
> by somehow moving/hacking it back to semaphores?

Not really.  The problem was most apparent after the lock_kernel() removal
patches.  The first thing a CPU hit when it entered the fs was previously
lock_kernel().  That became lock_super() and performance went down the
tubes.  From memory, the bad kernel was tip-of-tree as of Memorial Weekend
2003 ;)

I guess you could re-add all the lock_super()s as per 2.5.x's ext3/jbd,
check that it sucks running SDET on 8-way then implement the lock_super()s
via a mutex.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-26 10:35                       ` Andrew Morton
@ 2005-12-26 10:42                         ` Arjan van de Ven
  2005-12-26 11:11                           ` Andrew Morton
  2005-12-26 18:15                           ` Linus Torvalds
  2005-12-27 14:42                         ` Ingo Molnar
  1 sibling, 2 replies; 62+ messages in thread
From: Arjan van de Ven @ 2005-12-26 10:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, zippel, hch, alan, linux-kernel, torvalds, arjanv,
	nico, jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml


> hm.  16 CPUs hitting the same semaphore at great arrival rates.  The cost
> of a short spin is much less than the cost of a sleep/wakeup.  The machine
> was doing 100,000 - 200,000 context switches per second.

interesting.. this might be a good indication that a "spin a bit first"
mutex slowpath for some locks might be worth implementing...



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-26 10:42                         ` Arjan van de Ven
@ 2005-12-26 11:11                           ` Andrew Morton
  2005-12-26 17:15                             ` Mike Galbraith
  2005-12-27  0:32                             ` David Lang
  2005-12-26 18:15                           ` Linus Torvalds
  1 sibling, 2 replies; 62+ messages in thread
From: Andrew Morton @ 2005-12-26 11:11 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: mingo, zippel, hch, alan, linux-kernel, torvalds, arjanv, nico,
	jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml

Arjan van de Ven <arjan@infradead.org> wrote:
>
> 
> > hm.  16 CPUs hitting the same semaphore at great arrival rates.  The cost
> > of a short spin is much less than the cost of a sleep/wakeup.  The machine
> > was doing 100,000 - 200,000 context switches per second.
> 
> interesting.. this might be a good indication that a "spin a bit first"
> mutex slowpath for some locks might be worth implementing...

If we see a workload which is triggering such high context switch rates,
maybe.  But I don't think we've seen any such for a long time.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-26 11:11                           ` Andrew Morton
@ 2005-12-26 17:15                             ` Mike Galbraith
  2005-12-26 17:44                               ` Lee Revell
  2005-12-27  0:32                             ` David Lang
  1 sibling, 1 reply; 62+ messages in thread
From: Mike Galbraith @ 2005-12-26 17:15 UTC (permalink / raw)
  To: Andrew Morton, Arjan van de Ven
  Cc: mingo, zippel, hch, alan, linux-kernel, torvalds, arjanv, nico,
	jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml

At 03:11 AM 12/26/2005 -0800, Andrew Morton wrote:
>Arjan van de Ven <arjan@infradead.org> wrote:
> >
> >
> > > hm.  16 CPUs hitting the same semaphore at great arrival rates.  The cost
> > > of a short spin is much less than the cost of a sleep/wakeup.  The 
> machine
> > > was doing 100,000 - 200,000 context switches per second.
> >
> > interesting.. this might be a good indication that a "spin a bit first"
> > mutex slowpath for some locks might be worth implementing...
>
>If we see a workload which is triggering such high context switch rates,
>maybe.  But I don't think we've seen any such for a long time.

Hmm.  Is there a real workload where such a high context switch rate is 
necessary?  Every time I've seen a high (100,000 - 200,000 is beyond absurd 
on my little box, but...) context switch rate, it's been because something 
sucked.

         -Mike   


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-26 17:15                             ` Mike Galbraith
@ 2005-12-26 17:44                               ` Lee Revell
  0 siblings, 0 replies; 62+ messages in thread
From: Lee Revell @ 2005-12-26 17:44 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Andrew Morton, Arjan van de Ven, mingo, zippel, hch, alan,
	linux-kernel, torvalds, arjanv, nico, jes, zwane, oleg, dhowells,
	bcrl, rostedt, ak, rmk+lkml

On Mon, 2005-12-26 at 18:15 +0100, Mike Galbraith wrote:
> At 03:11 AM 12/26/2005 -0800, Andrew Morton wrote:
> >Arjan van de Ven <arjan@infradead.org> wrote:
> > >
> > >
> > > > hm.  16 CPUs hitting the same semaphore at great arrival rates.  The cost
> > > > of a short spin is much less than the cost of a sleep/wakeup.  The 
> > machine
> > > > was doing 100,000 - 200,000 context switches per second.
> > >
> > > interesting.. this might be a good indication that a "spin a bit first"
> > > mutex slowpath for some locks might be worth implementing...
> >
> >If we see a workload which is triggering such high context switch rates,
> >maybe.  But I don't think we've seen any such for a long time.
> 
> Hmm.  Is there a real workload where such a high context switch rate is 
> necessary?  Every time I've seen a high (100,000 - 200,000 is beyond absurd 
> on my little box, but...) context switch rate, it's been because something 
> sucked.

I can trivially produce 20K per second on my little sub Ghz box so 100K
on a busy server is certainly plausible.  Especially if for the purposes
of this discussion we are also worried about -rt + IRQ threading where
each IRQ costs two context switches (more if it raises a softirq).

Lee


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-26 11:11                           ` Andrew Morton
  2005-12-26 17:15                             ` Mike Galbraith
@ 2005-12-27  0:32                             ` David Lang
  1 sibling, 0 replies; 62+ messages in thread
From: David Lang @ 2005-12-27  0:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Arjan van de Ven, mingo, zippel, hch, alan, linux-kernel,
	torvalds, arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt,
	ak, rmk+lkml

On Mon, 26 Dec 2005, Andrew Morton wrote:

> Arjan van de Ven <arjan@infradead.org> wrote:
>>
>>
>>> hm.  16 CPUs hitting the same semaphore at great arrival rates.  The cost
>>> of a short spin is much less than the cost of a sleep/wakeup.  The machine
>>> was doing 100,000 - 200,000 context switches per second.
>>
>> interesting.. this might be a good indication that a "spin a bit first"
>> mutex slowpath for some locks might be worth implementing...
>
> If we see a workload which is triggering such high context switch rates,
> maybe.  But I don't think we've seen any such for a long time.
>

how does 'spin a bit' interact with virutal CPU's (HT and equivalent).

it would seem to me that on a multi true-core system the spin-a-bit is a 
win becouse it allows the other CPU's to release the lock, but on virtual 
CPU's all the spinning would do is to delay the other virtual CPU's 
running and therefor delay the lock getting released.

David Lang

-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-26 10:42                         ` Arjan van de Ven
  2005-12-26 11:11                           ` Andrew Morton
@ 2005-12-26 18:15                           ` Linus Torvalds
  1 sibling, 0 replies; 62+ messages in thread
From: Linus Torvalds @ 2005-12-26 18:15 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrew Morton, Ingo Molnar, zippel, hch, alan, linux-kernel,
	arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, ak,
	rmk+lkml



On Mon, 26 Dec 2005, Arjan van de Ven wrote:
> 
> > hm.  16 CPUs hitting the same semaphore at great arrival rates.  The cost
> > of a short spin is much less than the cost of a sleep/wakeup.  The machine
> > was doing 100,000 - 200,000 context switches per second.
> 
> interesting.. this might be a good indication that a "spin a bit first"
> mutex slowpath for some locks might be worth implementing...

No, please don't. 

Almost always it's a huge sign that the locking is totally broken.

And yes, the fix was to _fix_ the ext3 locking. Not to make semaphores 
spin. The ext3 locking was using a semaphore for totally the wrong 
reasons, it made zero sense at all.

I think it's fixed now.

		Linus

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-26 10:35                       ` Andrew Morton
  2005-12-26 10:42                         ` Arjan van de Ven
@ 2005-12-27 14:42                         ` Ingo Molnar
  2005-12-27 23:02                           ` Andrew Morton
  1 sibling, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2005-12-27 14:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel


* Andrew Morton <akpm@osdl.org> wrote:

> > hm, can you see any easy way for me to test my bold assertion on ext3, 
> > by somehow moving/hacking it back to semaphores?
> 
> Not really.  The problem was most apparent after the lock_kernel() 
> removal patches.  The first thing a CPU hit when it entered the fs was 
> previously lock_kernel().  That became lock_super() and performance 
> went down the tubes.  From memory, the bad kernel was tip-of-tree as 
> of Memorial Weekend 2003 ;)
> 
> I guess you could re-add all the lock_super()s as per 2.5.x's 
> ext3/jbd, check that it sucks running SDET on 8-way then implement the 
> lock_super()s via a mutex.

ok - does the patch below look roughly ok as a really bad (but 
functional) hack to restore that old behavior, for ext2?

	Ingo

 fs/ext2/balloc.c |   19 ++++++++++++++++---
 fs/ext2/ialloc.c |   15 +++++++++++++--
 2 files changed, 30 insertions(+), 6 deletions(-)

Index: linux/fs/ext2/balloc.c
===================================================================
--- linux.orig/fs/ext2/balloc.c
+++ linux/fs/ext2/balloc.c
@@ -139,7 +139,7 @@ static void release_blocks(struct super_
 	}
 }
 
-static int group_reserve_blocks(struct ext2_sb_info *sbi, int group_no,
+static int group_reserve_blocks(struct super_block *sb, struct ext2_sb_info *sbi, int group_no,
 	struct ext2_group_desc *desc, struct buffer_head *bh, int count)
 {
 	unsigned free_blocks;
@@ -147,12 +147,15 @@ static int group_reserve_blocks(struct e
 	if (!desc->bg_free_blocks_count)
 		return 0;
 
+	lock_super(sb);
 	spin_lock(sb_bgl_lock(sbi, group_no));
 	free_blocks = le16_to_cpu(desc->bg_free_blocks_count);
 	if (free_blocks < count)
 		count = free_blocks;
 	desc->bg_free_blocks_count = cpu_to_le16(free_blocks - count);
 	spin_unlock(sb_bgl_lock(sbi, group_no));
+	unlock_super(sb);
+
 	mark_buffer_dirty(bh);
 	return count;
 }
@@ -164,10 +167,12 @@ static void group_release_blocks(struct 
 		struct ext2_sb_info *sbi = EXT2_SB(sb);
 		unsigned free_blocks;
 
+		lock_super(sb);
 		spin_lock(sb_bgl_lock(sbi, group_no));
 		free_blocks = le16_to_cpu(desc->bg_free_blocks_count);
 		desc->bg_free_blocks_count = cpu_to_le16(free_blocks + count);
 		spin_unlock(sb_bgl_lock(sbi, group_no));
+		unlock_super(sb);
 		sb->s_dirt = 1;
 		mark_buffer_dirty(bh);
 	}
@@ -234,6 +239,7 @@ do_more:
 			    "Block = %lu, count = %lu",
 			    block, count);
 
+	lock_super(sb);
 	for (i = 0, group_freed = 0; i < count; i++) {
 		if (!ext2_clear_bit_atomic(sb_bgl_lock(sbi, block_group),
 						bit + i, bitmap_bh->b_data)) {
@@ -243,6 +249,7 @@ do_more:
 			group_freed++;
 		}
 	}
+	unlock_super(sb);
 
 	mark_buffer_dirty(bitmap_bh);
 	if (sb->s_flags & MS_SYNCHRONOUS)
@@ -377,7 +384,7 @@ int ext2_new_block(struct inode *inode, 
 		goto io_error;
 	}
 
-	group_alloc = group_reserve_blocks(sbi, group_no, desc,
+	group_alloc = group_reserve_blocks(sb, sbi, group_no, desc,
 					gdp_bh, es_alloc);
 	if (group_alloc) {
 		ret_block = ((goal - le32_to_cpu(es->s_first_data_block)) %
@@ -389,8 +396,10 @@ int ext2_new_block(struct inode *inode, 
 		
 		ext2_debug("goal is at %d:%d.\n", group_no, ret_block);
 
+		lock_super(sb);
 		ret_block = grab_block(sb_bgl_lock(sbi, group_no),
 				bitmap_bh->b_data, group_size, ret_block);
+		unlock_super(sb);
 		if (ret_block >= 0)
 			goto got_block;
 		group_release_blocks(sb, group_no, desc, gdp_bh, group_alloc);
@@ -413,7 +422,7 @@ retry:
 		desc = ext2_get_group_desc(sb, group_no, &gdp_bh);
 		if (!desc)
 			goto io_error;
-		group_alloc = group_reserve_blocks(sbi, group_no, desc,
+		group_alloc = group_reserve_blocks(sb, sbi, group_no, desc,
 						gdp_bh, es_alloc);
 	}
 	if (!group_alloc) {
@@ -425,8 +434,10 @@ retry:
 	if (!bitmap_bh)
 		goto io_error;
 
+	lock_super(sb);
 	ret_block = grab_block(sb_bgl_lock(sbi, group_no), bitmap_bh->b_data,
 				group_size, 0);
+	unlock_super(sb);
 	if (ret_block < 0) {
 		/*
 		 * If a free block counter is corrupted we can loop inifintely.
@@ -485,12 +496,14 @@ got_block:
 	if (group_alloc && !*prealloc_count) {
 		unsigned n;
 
+		lock_super(sb);
 		for (n = 0; n < group_alloc && ++ret_block < group_size; n++) {
 			if (ext2_set_bit_atomic(sb_bgl_lock(sbi, group_no),
 						ret_block,
 						(void*) bitmap_bh->b_data))
  				break;
 		}
+		unlock_super(sb);
 		*prealloc_block = block + 1;
 		*prealloc_count = n;
 		es_alloc -= n;
Index: linux/fs/ext2/ialloc.c
===================================================================
--- linux.orig/fs/ext2/ialloc.c
+++ linux/fs/ext2/ialloc.c
@@ -75,6 +75,7 @@ static void ext2_release_inode(struct su
 		return;
 	}
 
+	lock_super(sb);
 	spin_lock(sb_bgl_lock(EXT2_SB(sb), group));
 	desc->bg_free_inodes_count =
 		cpu_to_le16(le16_to_cpu(desc->bg_free_inodes_count) + 1);
@@ -82,6 +83,7 @@ static void ext2_release_inode(struct su
 		desc->bg_used_dirs_count =
 			cpu_to_le16(le16_to_cpu(desc->bg_used_dirs_count) - 1);
 	spin_unlock(sb_bgl_lock(EXT2_SB(sb), group));
+	unlock_super(sb);
 	if (dir)
 		percpu_counter_dec(&EXT2_SB(sb)->s_dirs_counter);
 	sb->s_dirt = 1;
@@ -148,12 +150,16 @@ void ext2_free_inode (struct inode * ino
 		goto error_return;
 
 	/* Ok, now we can actually update the inode bitmaps.. */
+	lock_super(sb);
 	if (!ext2_clear_bit_atomic(sb_bgl_lock(EXT2_SB(sb), block_group),
-				bit, (void *) bitmap_bh->b_data))
+				bit, (void *) bitmap_bh->b_data)) {
+		unlock_super(sb);
 		ext2_error (sb, "ext2_free_inode",
 			      "bit already cleared for inode %lu", ino);
-	else
+	} else {
+		unlock_super(sb);
 		ext2_release_inode(sb, block_group, is_directory);
+	}
 	mark_buffer_dirty(bitmap_bh);
 	if (sb->s_flags & MS_SYNCHRONOUS)
 		sync_dirty_buffer(bitmap_bh);
@@ -507,8 +513,10 @@ repeat_in_this_group:
 				group = 0;
 			continue;
 		}
+		lock_super(sb);
 		if (ext2_set_bit_atomic(sb_bgl_lock(sbi, group),
 						ino, bitmap_bh->b_data)) {
+			unlock_super(sb);
 			/* we lost this inode */
 			if (++ino >= EXT2_INODES_PER_GROUP(sb)) {
 				/* this group is exhausted, try next group */
@@ -519,6 +527,7 @@ repeat_in_this_group:
 			/* try to find free inode in the same group */
 			goto repeat_in_this_group;
 		}
+		unlock_super(sb);
 		goto got;
 	}
 
@@ -547,6 +556,7 @@ got:
 	if (S_ISDIR(mode))
 		percpu_counter_inc(&sbi->s_dirs_counter);
 
+	lock_super(sb);
 	spin_lock(sb_bgl_lock(sbi, group));
 	gdp->bg_free_inodes_count =
                 cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) - 1);
@@ -560,6 +570,7 @@ got:
 			sbi->s_debts[group]--;
 	}
 	spin_unlock(sb_bgl_lock(sbi, group));
+	unlock_super(sb);
 
 	sb->s_dirt = 1;
 	mark_buffer_dirty(bh2);

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-27 14:42                         ` Ingo Molnar
@ 2005-12-27 23:02                           ` Andrew Morton
  0 siblings, 0 replies; 62+ messages in thread
From: Andrew Morton @ 2005-12-27 23:02 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

Ingo Molnar <mingo@elte.hu> wrote:
>
> 
> * Andrew Morton <akpm@osdl.org> wrote:
> 
> > > hm, can you see any easy way for me to test my bold assertion on ext3, 
> > > by somehow moving/hacking it back to semaphores?
> > 
> > Not really.  The problem was most apparent after the lock_kernel() 
> > removal patches.  The first thing a CPU hit when it entered the fs was 
> > previously lock_kernel().  That became lock_super() and performance 
> > went down the tubes.  From memory, the bad kernel was tip-of-tree as 
> > of Memorial Weekend 2003 ;)
> > 
> > I guess you could re-add all the lock_super()s as per 2.5.x's 
> > ext3/jbd, check that it sucks running SDET on 8-way then implement the 
> > lock_super()s via a mutex.
> 
> ok - does the patch below look roughly ok as a really bad (but 
> functional) hack to restore that old behavior, for ext2?
> 

Hard to tell ;) 2.5.20's ext2 had 7 lock_super()s whereas for some reason
this patch adds 12...

I don't recall whether ext2 suffered wild context switches as badly as ext3
did.  It becomes pretty obvious in testing.

The really bad workload was SDET, which isn't available to mortals.  So
some hunting might be neded to find a suitable alternative.  dbench would be
a good start I guess.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Moore's law (was Re: [patch 0/9] mutex subsystem, -V4)
  2005-12-25 23:04                   ` Andrew Morton
  2005-12-25 23:22                     ` Ingo Molnar
@ 2005-12-26  0:33                     ` Pavel Machek
  2006-01-05 15:30                       ` Andi Kleen
  2005-12-26 15:29                     ` [patch 0/9] mutex subsystem, -V4 Nicolas Pitre
  2 siblings, 1 reply; 62+ messages in thread
From: Pavel Machek @ 2005-12-26  0:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roman Zippel, hch, alan, arjan, mingo, linux-kernel, torvalds,
	arjanv, nico, jes, zwane, oleg, dhowells, bcrl, rostedt, ak,
	rmk+lkml

Hi!

> Another example: Ingo's VFS stresstest which is hitting i_sem hard: it only
> does ~8000 ops/sec on an 8-way, and it's an artificial microbenchmark which
> is _designed_ to hit that lock hard.  So if/when i_sem is converted to a
> mutex, I figure that the benefits to ARM in that workload will be about a
> 0.01% performance increase.  ie: about two hours' worth of Moore's law in a
> dopey microbenchmark.

:-) Expressing performance increases in Moore's hours seems like
neat trick. OTOH I do not think it is valid any more. Single-threaded
performance stopped increasing 2 years ago AFAICS. Plus people are
pushing Linux onto smaller machines, that were unavailable 2 years
ago.


							Pavel
-- 
Thanks, Sharp!

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Moore's law (was Re: [patch 0/9] mutex subsystem, -V4)
  2005-12-26  0:33                     ` Moore's law (was Re: [patch 0/9] mutex subsystem, -V4) Pavel Machek
@ 2006-01-05 15:30                       ` Andi Kleen
  2006-01-05 19:08                         ` Pavel Machek
  0 siblings, 1 reply; 62+ messages in thread
From: Andi Kleen @ 2006-01-05 15:30 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux-kernel

On Monday 26 December 2005 01:33, Pavel Machek wrote:

[cc list from hell trimmed down]

> > Another example: Ingo's VFS stresstest which is hitting i_sem hard: it only
> > does ~8000 ops/sec on an 8-way, and it's an artificial microbenchmark which
> > is _designed_ to hit that lock hard.  So if/when i_sem is converted to a
> > mutex, I figure that the benefits to ARM in that workload will be about a
> > 0.01% performance increase.  ie: about two hours' worth of Moore's law in a
> > dopey microbenchmark.

Moore's law actually doesn't say anything about performance increases,
just about the number of transistors available.
> 
> :-) Expressing performance increases in Moore's hours seems like
> neat trick. OTOH I do not think it is valid any more. Single-threaded
> performance stopped increasing 2 years ago AFAICS.

It's not true. e.g. a 2.6 Ghz FX-57 is significantly faster than the
top end CPU you could get 2 years ago. And I'm sure this years CPUs
will be still faster than last years.

> Plus people are 
> pushing Linux onto smaller machines, that were unavailable 2 years
> ago.

Even smaller systems are still getting faster.

-Andi

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: Moore's law (was Re: [patch 0/9] mutex subsystem, -V4)
  2006-01-05 15:30                       ` Andi Kleen
@ 2006-01-05 19:08                         ` Pavel Machek
  0 siblings, 0 replies; 62+ messages in thread
From: Pavel Machek @ 2006-01-05 19:08 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Thu 05-01-06 16:30:59, Andi Kleen wrote:
> On Monday 26 December 2005 01:33, Pavel Machek wrote:
> 
> [cc list from hell trimmed down]

> > :-) Expressing performance increases in Moore's hours seems like
> > neat trick. OTOH I do not think it is valid any more. Single-threaded
> > performance stopped increasing 2 years ago AFAICS.
> 
> It's not true. e.g. a 2.6 Ghz FX-57 is significantly faster than the
> top end CPU you could get 2 years ago. And I'm sure this years CPUs
> will be still faster than last years.

Well, but it is no longer 2x as fast...

> > Plus people are 
> > pushing Linux onto smaller machines, that were unavailable 2 years
> > ago.
> 
> Even smaller systems are still getting faster.

Yep. OTOH this is the first time when *average* linux box is getting
*slower* -- mostly due to massive ammount of cheap, linux-based
wifi access points on the market.

(Average linux desktop box is still gettting faster, though) 
-- 
Thanks, Sharp!

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-25 23:04                   ` Andrew Morton
  2005-12-25 23:22                     ` Ingo Molnar
  2005-12-26  0:33                     ` Moore's law (was Re: [patch 0/9] mutex subsystem, -V4) Pavel Machek
@ 2005-12-26 15:29                     ` Nicolas Pitre
  2 siblings, 0 replies; 62+ messages in thread
From: Nicolas Pitre @ 2005-12-26 15:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Roman Zippel, hch, alan, arjan, mingo, lkml, torvalds, arjanv,
	jes, zwane, oleg, dhowells, bcrl, rostedt, ak, rmk+lkml

On Sun, 25 Dec 2005, Andrew Morton wrote:

> Ingo has told me offline that he thinks that we can indeed remove the
> current semaphore implementation.
> 
> 90% of existing semaphore users get migrated to mutexes.
> 
> 9% of current semaphore users get migrated to completions.
> 
> The remaining 1% of semaphore users are using the counting feature.  We
> reimplement that in a mostly-arch-independent fashion and then remove the
> current semaphore implementation.  Note that there are no sequencing
> dependencies in all the above.

Great!  I'm glad you're amenable to such changes.  Especially the 
"remove the current semaphore implementation" part many people are 
complaining about for all sorts of reasons.

> Another example: Ingo's VFS stresstest which is hitting i_sem hard: it only
> does ~8000 ops/sec on an 8-way, and it's an artificial microbenchmark which
> is _designed_ to hit that lock hard.  So if/when i_sem is converted to a
> mutex, I figure that the benefits to ARM in that workload will be about a
> 0.01% performance increase.  ie: about two hours' worth of Moore's law in a
> dopey microbenchmark.

There is not only the question of cycles.  There is also kernel text 
size.  And _that_ is significant.  You could argue that only adding a 
function call for every mutex lock/unlock cannot be made smaller.  
However on ARM a function call is quite a costly operation with frame 
pointers, (and frame pointers are on by default on ARM otherwise the 
kernel is undebuggable).  So given that current semaphore fast path is 
inlined to save on the function call overhead, and given that on ARM the 
transition from semaphores to mutexes will shrink that from 8 
instruction down to 3 instructions, or even 2 instructions in some 
cases, will not only save cycles, but a bunch of kernel .text bytes as 
well without compromize on the cycle count.  This is therefore an all 
win situation for ARM.

> Also, there's very little point in adding lots of tricky arch-dependent
> code and generally mucking up the kernel source to squeeze the last drop of
> performance out of the sleeping lock fastpath.  Because if those changes
> actually make a difference, we've already lost - the code needs to be
> changed to use spinlocking.

Consider my text size argument above which is something I believe you 
value more.

Nicolas

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 13:44         ` Andrew Morton
  2005-12-22 14:11           ` Alan Cox
@ 2005-12-22 15:46           ` Thomas Gleixner
  2005-12-22 17:40             ` Linus Torvalds
  2005-12-22 17:17           ` Christoph Hellwig
  2 siblings, 1 reply; 62+ messages in thread
From: Thomas Gleixner @ 2005-12-22 15:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Arjan van de Ven, mingo, linux-kernel, torvalds, arjanv, nico,
	jes, zwane, oleg, dhowells, alan, bcrl, rostedt, hch, ak,
	rmk+lkml

On Thu, 2005-12-22 at 05:44 -0800, Andrew Morton wrote:

> There's no need for two sets of behaviour.  What I'm saying is that we
> could add similar debugging features to semaphores (if useful).  Yes, we
> would have to tell the kernel at the source level to defeat that debugging
> if a particular lock isn't being used as a mutex.  That's rather less
> intrusive than adding a whole new type of lock.  But I'd question the value
> even of doing that, given the general historical non-bugginess of existing
> semaphore users.

Semaphores need a counter, mutexes only a binary representation of the
locked/unlocked state, which makes it possible to use instructions like
xchg, swap, test_and_set_bit depending on what atomic operations are
available on the architecture. On many architectures this is more
efficient than the counter based implementation.

Also the wakeup rules allow smaller and faster implementations for
mutexes.

	tglx



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 15:46           ` Thomas Gleixner
@ 2005-12-22 17:40             ` Linus Torvalds
  2005-12-22 20:09               ` Steven Rostedt
  0 siblings, 1 reply; 62+ messages in thread
From: Linus Torvalds @ 2005-12-22 17:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, Arjan van de Ven, mingo, linux-kernel, arjanv,
	nico, jes, zwane, oleg, dhowells, alan, bcrl, rostedt, hch, ak,
	rmk+lkml

On Thu, 22 Dec 2005, Thomas Gleixner wrote:
> 
> Semaphores need a counter, mutexes only a binary representation of the
> locked/unlocked state

Actually, that's not true.

A _spinlock_ only needs a binary representation of the locked/unlocked 
state.

A mutex needs a _ternary_ representation. It needs an additional 
"contention" state to tell the wakeup that extra action is needed.

If you don't handle contention (and do extra action all the time), you're 
screwed from a performance standpoint.

			Linus

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 17:40             ` Linus Torvalds
@ 2005-12-22 20:09               ` Steven Rostedt
  0 siblings, 0 replies; 62+ messages in thread
From: Steven Rostedt @ 2005-12-22 20:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Andrew Morton, Arjan van de Ven, mingo,
	linux-kernel, arjanv, nico, jes, zwane, oleg, dhowells, alan,
	bcrl, hch, ak, rmk+lkml


On Thu, 22 Dec 2005, Linus Torvalds wrote:
>
>
> On Thu, 22 Dec 2005, Thomas Gleixner wrote:
> >
> > Semaphores need a counter, mutexes only a binary representation of the
> > locked/unlocked state
>
> Actually, that's not true.
>
> A _spinlock_ only needs a binary representation of the locked/unlocked
> state.
>
> A mutex needs a _ternary_ representation. It needs an additional
> "contention" state to tell the wakeup that extra action is needed.

True, and that's exactly what Ingo has.

1 unlocked, 0 locked, -1 locked with waiters.  But it still works well
with xchg.

-- Steve

>
> If you don't handle contention (and do extra action all the time), you're
> screwed from a performance standpoint.
>
> 			Linus
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 13:44         ` Andrew Morton
  2005-12-22 14:11           ` Alan Cox
  2005-12-22 15:46           ` Thomas Gleixner
@ 2005-12-22 17:17           ` Christoph Hellwig
  2 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2005-12-22 17:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Arjan van de Ven, mingo, linux-kernel, torvalds, arjanv, nico,
	jes, zwane, oleg, dhowells, alan, bcrl, rostedt, hch, ak,
	rmk+lkml

On Thu, Dec 22, 2005 at 05:44:13AM -0800, Andrew Morton wrote:
> > You keep saying 10 times "so please enhance semaphores to do this".
> 
> Well I think diligence requires that we be able to demonstrate why it's not
> possible.
> 
> It's plain dumb for us to justify a fancy-pants new system based on
> features which we could have added to the old one, no?

But why should we add features to the semaphores.  There's very little
users of those real semaphore semantics, and they could do with a generic,
all-C implementation because they are not important fast-pathes.  OTOH
we have lots of places needing plain mutex semantics, that are important
fastpathes.  Let's optimize for the common case instead of the corner
case.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 13:07     ` Andrew Morton
  2005-12-22 13:23       ` Arjan van de Ven
@ 2005-12-22 15:19       ` Nicolas Pitre
  2005-12-22 21:43       ` Paul Mackerras
  2 siblings, 0 replies; 62+ messages in thread
From: Nicolas Pitre @ 2005-12-22 15:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, linux-kernel, torvalds, arjanv, jes, zwane, oleg,
	dhowells, alan, bcrl, rostedt, hch, ak, rmk+lkml

On Thu, 22 Dec 2005, Andrew Morton wrote:

> I'd prefer to see mutexes compared with semaphores after you've put as much
> work into improving semaphores as you have into developing mutexes.

There is a fundamental difference between semaphores and mutexes.  The 
semaphore semantics _require_ atomic increments/decrements where mutexes 
do not.  This makes a huge difference on ARM where 99% of all ARM 
processors out there can only perform atomic swaps which is sufficient 
for mutexes but insufficient for semaphores.  Therefore on ARM 
performing an atomic increment/decrement (the semaphore fast 
path) requires extra costly locking 
and .text space (23 cycles over 8 instructions) while the mutex fast 
path is about 2-3 instructions and 
needs 7-8 cycles.  I bet many other architectures are in the same camp.

Nicolas

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [patch 0/9] mutex subsystem, -V4
  2005-12-22 13:07     ` Andrew Morton
  2005-12-22 13:23       ` Arjan van de Ven
  2005-12-22 15:19       ` Nicolas Pitre
@ 2005-12-22 21:43       ` Paul Mackerras
  2 siblings, 0 replies; 62+ messages in thread
From: Paul Mackerras @ 2005-12-22 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, linux-kernel, torvalds, arjanv, nico, jes, zwane,
	oleg, dhowells, alan, bcrl, rostedt, hch, ak, rmk+lkml

Andrew Morton writes:

> Ingo Molnar <mingo@elte.hu> wrote:
> >  - 'struct mutex' is smaller: on x86, 'struct semaphore' is 20 bytes, 
> >    'struct mutex' is 16 bytes. A smaller structure size means less RAM 
> >    footprint, and better CPU-cache utilization.
> 
> Because of the .sleepers thing.  Perhaps a revamped semaphore wouldn't need
> thsat field?

Note that semaphores on 32-bit PPC are only 16 bytes already, since I
got rid of the sleepers field ages ago.  The fast path is just
atomic_dec/inc, and the slow path needs an atomic op that does

	x = max(x, 0) + inc

atomically and returns the old value of x.  That's easy to do with
lwarx/stwcx (just two more instructions than an atomic inc), and can
also be done quite straightforwardly with cmpxchg.  Alpha, mips, s390
and sparc64 also use this scheme.

In fact I would go so far as to say that I cannot see how it would be
possible to do a mutex any smaller or faster than a counting semaphore
on these architectures.

Regards,
Paul.

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2006-01-05 19:11 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-12-22 11:41 [patch 0/9] mutex subsystem, -V4 Ingo Molnar
2005-12-22 11:53 ` Christoph Hellwig
2005-12-22 12:45   ` Ingo Molnar
2005-12-22 15:34   ` Nicolas Pitre
2005-12-22 15:40     ` Ingo Molnar
2005-12-22 16:32       ` Nicolas Pitre
2005-12-22 16:44         ` Ingo Molnar
2005-12-22 16:58           ` Russell King
2005-12-22 21:04             ` Ingo Molnar
2005-12-22 21:26               ` Russell King
2005-12-22 21:27               ` Nicolas Pitre
2005-12-22 21:37                 ` [patch 1/2] mutex subsystem: basic per arch fast path primitives Nicolas Pitre
2005-12-22 21:53                   ` Christoph Hellwig
2005-12-22 21:40                 ` [patch 2/2] mutex subsystem: use the per architecture fast path lock_unlock defines Nicolas Pitre
2005-12-22 21:54                 ` [patch 0/9] mutex subsystem, -V4 Ingo Molnar
2005-12-22 16:58           ` Nicolas Pitre
2005-12-22 17:20     ` Christoph Hellwig
2005-12-22 17:33     ` Steven Rostedt
2005-12-22 18:24       ` Nicolas Pitre
2005-12-22 11:54 ` Andrew Morton
2005-12-22 12:20   ` Ingo Molnar
2005-12-22 13:07     ` Andrew Morton
2005-12-22 13:23       ` Arjan van de Ven
2005-12-22 13:44         ` Andrew Morton
2005-12-22 14:11           ` Alan Cox
2005-12-22 23:30             ` Andrew Morton
2005-12-22 23:34               ` Christoph Hellwig
     [not found]                 ` <32801.10.10.10.28.1135295357.squirrel@linux1>
2005-12-22 23:49                   ` Sean
2005-12-22 23:53                     ` Randy.Dunlap
     [not found]                       ` <50572.10.10.10.28.1135296023.squirrel@linux1>
2005-12-23  0:00                         ` Sean
2005-12-23  0:00                       ` Steven Rostedt
     [not found]                 ` <20051222221311.2f6056ec.akpm@osdl.org>
2005-12-23 14:24                   ` Nicolas Pitre
2005-12-23 14:51                     ` Andrew Morton
2005-12-23 14:57                       ` Russell King
2005-12-23 15:04                         ` Xavier Bestel
2005-12-23 15:27                           ` Andrew Morton
2005-12-23 15:00                       ` Steven Rostedt
2006-01-03 17:54                       ` Abhijit Bhopatkar
2005-12-25 16:08                 ` Roman Zippel
2005-12-25 22:54                   ` Ingo Molnar
2005-12-26 21:49                     ` Roman Zippel
2005-12-25 23:04                   ` Andrew Morton
2005-12-25 23:22                     ` Ingo Molnar
2005-12-26 10:35                       ` Andrew Morton
2005-12-26 10:42                         ` Arjan van de Ven
2005-12-26 11:11                           ` Andrew Morton
2005-12-26 17:15                             ` Mike Galbraith
2005-12-26 17:44                               ` Lee Revell
2005-12-27  0:32                             ` David Lang
2005-12-26 18:15                           ` Linus Torvalds
2005-12-27 14:42                         ` Ingo Molnar
2005-12-27 23:02                           ` Andrew Morton
2005-12-26  0:33                     ` Moore's law (was Re: [patch 0/9] mutex subsystem, -V4) Pavel Machek
2006-01-05 15:30                       ` Andi Kleen
2006-01-05 19:08                         ` Pavel Machek
2005-12-26 15:29                     ` [patch 0/9] mutex subsystem, -V4 Nicolas Pitre
2005-12-22 15:46           ` Thomas Gleixner
2005-12-22 17:40             ` Linus Torvalds
2005-12-22 20:09               ` Steven Rostedt
2005-12-22 17:17           ` Christoph Hellwig
2005-12-22 15:19       ` Nicolas Pitre
2005-12-22 21:43       ` Paul Mackerras

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox