All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
@ 2005-12-19  1:35 Ingo Molnar
  2005-12-19 17:49 ` Zwane Mwaikambo
  0 siblings, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2005-12-19  1:35 UTC (permalink / raw)
  To: linux-kernel, Linus Torvalds, Andrew Morton
  Cc: Arjan van de Ven, Steven Rostedt, Alan Cox, Christoph Hellwig,
	Andi Kleen, David Howells, Alexander Viro, Oleg Nesterov,
	Paul Jackson


add two new atomic ops to x86_64: atomic_dec_call_if_negative() and 
atomic_inc_call_if_nonpositive(), which are conditional-call-if atomic 
operations. Needed by the new mutex code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>

----

 include/asm-x86_64/atomic.h |   56 ++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 56 insertions(+)

Index: linux/include/asm-x86_64/atomic.h
===================================================================
--- linux.orig/include/asm-x86_64/atomic.h
+++ linux/include/asm-x86_64/atomic.h
@@ -203,6 +203,62 @@ static __inline__ int atomic_sub_return(
 #define atomic_inc_return(v)  (atomic_add_return(1,v))
 #define atomic_dec_return(v)  (atomic_sub_return(1,v))
 
+/**
+ * atomic_dec_call_if_negative - decrement and call function if negative
+ * @v: pointer of type atomic_t
+ * @fn: function to call if the result is negative
+ *
+ * Atomically decrements @v and calls a function if the result is negative.
+ * NOTE: the function is type-checked and must be a fastcall.
+ */
+#define atomic_dec_call_if_negative(v, fn_name)				\
+do {									\
+	fastcall void (*__tmp)(atomic_t *) = fn_name;			\
+									\
+	(void)__tmp;							\
+	typecheck(atomic_t *, v);					\
+									\
+	__asm__ __volatile__(						\
+		LOCK "decl (%%rdi)\n"  					\
+		"js 2f\n"						\
+		"1:\n"							\
+		LOCK_SECTION_START("")					\
+		"2: call "#fn_name"\n\t"				\
+		"jmp 1b\n"						\
+		LOCK_SECTION_END					\
+		:							\
+		:"D" (v)						\
+		:"memory");						\
+} while (0)
+
+/**
+ * atomic_inc_call_if_nonpositive - increment and call function if nonpositive
+ * @v: pointer of type atomic_t
+ * @fn: function to call if the result is nonpositive
+ *
+ * Atomically increments @v and calls a function if the result is nonpositive.
+ * NOTE: the function is type-checked and must be a fastcall.
+ */
+#define atomic_inc_call_if_nonpositive(v, fn_name)			\
+do {									\
+	fastcall void (*__tmp)(atomic_t *) = fn_name;			\
+									\
+	(void)__tmp;							\
+	typecheck(atomic_t *, v);					\
+									\
+	__asm__ __volatile__(						\
+		LOCK "incl (%%rdi)\n"  					\
+		"jle 2f\n"						\
+		"1:\n"							\
+		LOCK_SECTION_START("")					\
+		"2: call "#fn_name"\n\t"				\
+		"jmp 1b\n"						\
+		LOCK_SECTION_END					\
+		:							\
+		:"D" (v)						\
+		:"memory");						\
+} while (0)
+
 /* An 64bit atomic type */
 
 typedef struct { volatile long counter; } atomic64_t;

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-19  1:35 [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch Ingo Molnar
@ 2005-12-19 17:49 ` Zwane Mwaikambo
  2005-12-19 20:58   ` David Woodhouse
  2005-12-20  4:29   ` Ingo Molnar
  0 siblings, 2 replies; 34+ messages in thread
From: Zwane Mwaikambo @ 2005-12-19 17:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, Andi Kleen,
	David Howells, Alexander Viro, Oleg Nesterov, Paul Jackson

On Mon, 19 Dec 2005, Ingo Molnar wrote:

> +#define atomic_dec_call_if_negative(v, fn_name)				\
> +do {									\
> +	fastcall void (*__tmp)(atomic_t *) = fn_name;			\
> +									\
> +	(void)__tmp;							\
> +	typecheck(atomic_t *, v);					\
> +									\
> +	__asm__ __volatile__(						\
> +		LOCK "decl (%%rdi)\n"  					\
> +		"js 2f\n"						\
> +		"1:\n"							\
> +		LOCK_SECTION_START("")					\
> +		"2: call "#fn_name"\n\t"				\
> +		"jmp 1b\n"						\
> +		LOCK_SECTION_END					\
> +		:							\
> +		:"D" (v)						\
> +		:"memory");						\
> +} while (0)

Hi Ingo,
	Doesn't this corrupt caller saved registers?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-19 17:49 ` Zwane Mwaikambo
@ 2005-12-19 20:58   ` David Woodhouse
  2005-12-20  4:31     ` Ingo Molnar
  2005-12-20  4:29   ` Ingo Molnar
  1 sibling, 1 reply; 34+ messages in thread
From: David Woodhouse @ 2005-12-19 20:58 UTC (permalink / raw)
  To: Zwane Mwaikambo
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Steven Rostedt, Alan Cox, Christoph Hellwig,
	Andi Kleen, David Howells, Alexander Viro, Oleg Nesterov,
	Paul Jackson

On Mon, 2005-12-19 at 09:49 -0800, Zwane Mwaikambo wrote:
> Hi Ingo,
>         Doesn't this corrupt caller saved registers?

Looks like it. I _really_ don't like calling functions from inline asm.
It's not nice. Can't we use atomic_dec_return() for this?

-- 
dwmw2


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-19 17:49 ` Zwane Mwaikambo
  2005-12-19 20:58   ` David Woodhouse
@ 2005-12-20  4:29   ` Ingo Molnar
  1 sibling, 0 replies; 34+ messages in thread
From: Ingo Molnar @ 2005-12-20  4:29 UTC (permalink / raw)
  To: Zwane Mwaikambo
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, Andi Kleen,
	David Howells, Alexander Viro, Oleg Nesterov, Paul Jackson


* Zwane Mwaikambo <zwane@arm.linux.org.uk> wrote:

> On Mon, 19 Dec 2005, Ingo Molnar wrote:
> 
> > +#define atomic_dec_call_if_negative(v, fn_name)				\
> > +do {									\
> > +	fastcall void (*__tmp)(atomic_t *) = fn_name;			\
> > +									\
> > +	(void)__tmp;							\
> > +	typecheck(atomic_t *, v);					\
> > +									\
> > +	__asm__ __volatile__(						\
> > +		LOCK "decl (%%rdi)\n"  					\
> > +		"js 2f\n"						\
> > +		"1:\n"							\
> > +		LOCK_SECTION_START("")					\
> > +		"2: call "#fn_name"\n\t"				\
> > +		"jmp 1b\n"						\
> > +		LOCK_SECTION_END					\
> > +		:							\
> > +		:"D" (v)						\
> > +		:"memory");						\
> > +} while (0)
> 
> Hi Ingo,
> 	Doesn't this corrupt caller saved registers?

good catch - i correctly marked them clobbered on i386, but not on 
x86_64. The function using this code uses nothing else, so this bug 
caused no real corruption - but that was just pure luck.

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-19 20:58   ` David Woodhouse
@ 2005-12-20  4:31     ` Ingo Molnar
  2005-12-20  6:30       ` Nicolas Pitre
  0 siblings, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2005-12-20  4:31 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Zwane Mwaikambo, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Steven Rostedt, Alan Cox, Christoph Hellwig,
	Andi Kleen, David Howells, Alexander Viro, Oleg Nesterov,
	Paul Jackson


* David Woodhouse <dwmw2@infradead.org> wrote:

> On Mon, 2005-12-19 at 09:49 -0800, Zwane Mwaikambo wrote:
> > Hi Ingo,
> >         Doesn't this corrupt caller saved registers?
> 
> Looks like it. I _really_ don't like calling functions from inline 
> asm. It's not nice. Can't we use atomic_dec_return() for this?

we can use atomic_dec_return(), but that will add one more instruction 
to the fastpath. OTOH, atomic_dec_return() is available on every 
architecture, so it's a really tempting thing. I'll experiment with it.

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20  4:31     ` Ingo Molnar
@ 2005-12-20  6:30       ` Nicolas Pitre
  2005-12-20  8:12         ` Nick Piggin
  2005-12-21  4:16         ` Nicolas Pitre
  0 siblings, 2 replies; 34+ messages in thread
From: Nicolas Pitre @ 2005-12-20  6:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Woodhouse, Zwane Mwaikambo, lkml, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Steven Rostedt, Alan Cox,
	Christoph Hellwig, Andi Kleen, David Howells, Alexander Viro,
	Oleg Nesterov, Paul Jackson

On Tue, 20 Dec 2005, Ingo Molnar wrote:

> 
> * David Woodhouse <dwmw2@infradead.org> wrote:
> 
> > On Mon, 2005-12-19 at 09:49 -0800, Zwane Mwaikambo wrote:
> > > Hi Ingo,
> > >         Doesn't this corrupt caller saved registers?
> > 
> > Looks like it. I _really_ don't like calling functions from inline 
> > asm. It's not nice. Can't we use atomic_dec_return() for this?
> 
> we can use atomic_dec_return(), but that will add one more instruction 
> to the fastpath. OTOH, atomic_dec_return() is available on every 
> architecture, so it's a really tempting thing. I'll experiment with it.

Please consider using (a variant of) xchg() instead.  Although 
atomic_dec() is available on all architectures, its implementation is 
far from being the most efficient thing to do for them all.  For 
example, see my discussion about swp on ARM:

	http://www.ussg.iu.edu/hypermail/linux/kernel/0512.2/0727.html

What would be the most efficient implementation on ARM might look like:

static inline void mutex_lock(struct mutex *m)
{
        if (unlikely(atomic_xchg(&m->count, 0) != 1))
                __mutex_lock_nonatomic(m);
}

static inline void mutex_unlock(struct mutex *m)
{
	if (unlikely(atomic_xchg(&m->count, 1) == -1))
		__mutex_unlock_nonatomic(m);
}

Yet we might want to use special wrappers with non-standard calling 
convention for getting to the contention handlers 
(__mutex_lock_nonatomic() and __mutex_unlock_nonatomic() in this case).

Furthermore trying to make atomic_inc_call_if_nonpositive() looks like a 
generic thing is rather counter productive.  It is likely to be useful 
to the mutex code only anyway, so why not make it implicit what the 
contention handlers are?  This will allow for each architectures to 
implement the mutex interface with the best fast path they can come 
with.

I'd propose this:

First, rename some functions to make it clearer what they are used for:

	s/__mutex_lock_nonatomic/__mutex_lock_contended/
	s/__mutex_unlock_nonatomic/__mutex_unlock_contended/

Next, please make it possible for architecture specific implementation 
of mutex_lock(), mutex_unlock(), mutex_trylock(), and so on to exist.  
A default implementation in include/asm-generic/mutex.h should probably 
exist and the two examples above is certainly a good start.

mutex_lock should have the following definition: it should try to lock 
the mutex (set the count to 0, or any value < 1).  If the count was 1 
prior setting it to 0 then the lock is successful and we're done.  
Otherwise it should call __mutex_lock_contended.  Knowing the previous 
value and setting the new value must be done atomically. Whether it uses 
atomic_xchg() or atomic_dec_return(), or even open code some clever 
assembly trick to achieve that like your i386 
atomic_dec_call_if_negative() implementation is the architecture's own 
business.  I can imagine the ARM implementation which inlined fast path 
would be between 3 and 4 instructions only.  But that's possible only if 
the implementation allows any flexibility in achieving the above.

Similarly, mutex_unlock should set the mutex count to 1.  It also should 
call __mutex_unlock_contended if the count wasn't 0 before.  And again 
the atomic nature of the count (which might be renamed to "state" since 
"count" is misleading for a mutex) must be preserved of course.  Again 
the architecture is free to implement it in whatever way as long as it 
has this behavior.

The main header file for mutex users would be include/linux/mutex.h.  
If mutex debugging is enabled, then architecture fast path is ignored so 
mutex(lock() would simply become an alias for __mutex_lock_contended() 
or whatever wrapper you have for that purpose.  If debugging is disabled 
only then is asm/mutex.h included from linux/mutex.h to get the 
architecture fast path code (which is possibly including 
asm-generic/mutex.h with reasonnable reference/default implementations).

What do you think?


Nicolas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20  6:30       ` Nicolas Pitre
@ 2005-12-20  8:12         ` Nick Piggin
  2005-12-20 14:10           ` Nicolas Pitre
  2005-12-21  4:16         ` Nicolas Pitre
  1 sibling, 1 reply; 34+ messages in thread
From: Nick Piggin @ 2005-12-20  8:12 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Ingo Molnar, David Woodhouse, Zwane Mwaikambo, lkml,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Steven Rostedt,
	Alan Cox, Christoph Hellwig, Andi Kleen, David Howells,
	Alexander Viro, Oleg Nesterov, Paul Jackson

Nicolas Pitre wrote:
> On Tue, 20 Dec 2005, Ingo Molnar wrote:
> 
> 
>>* David Woodhouse <dwmw2@infradead.org> wrote:
>>
>>
>>>On Mon, 2005-12-19 at 09:49 -0800, Zwane Mwaikambo wrote:
>>>
>>>>Hi Ingo,
>>>>        Doesn't this corrupt caller saved registers?
>>>
>>>Looks like it. I _really_ don't like calling functions from inline 
>>>asm. It's not nice. Can't we use atomic_dec_return() for this?
>>
>>we can use atomic_dec_return(), but that will add one more instruction 
>>to the fastpath. OTOH, atomic_dec_return() is available on every 
>>architecture, so it's a really tempting thing. I'll experiment with it.
> 
> 
> Please consider using (a variant of) xchg() instead.  Although 
> atomic_dec() is available on all architectures, its implementation is 
> far from being the most efficient thing to do for them all.  For 
> example, see my discussion about swp on ARM:
> 

Considering that on UP, the arm should not need to disable interrupts
for this function (or has someone refuted Linus?), how about:

#ifndef CONFIG_SMP
typedef struct { volatile int counter; } mutex_counter_t;

static inline int mutex_counter_dec_return(mutex_counter *v)
{
     return --v->counter;
}

...
#else
#define mutex_counter_t atomic_t
...
#endif

Or does that get too bulky or have other problems?

MP ARMs should have an adequate atomic_dec_return.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20  8:12         ` Nick Piggin
@ 2005-12-20 14:10           ` Nicolas Pitre
  2005-12-20 14:12             ` Nick Piggin
  0 siblings, 1 reply; 34+ messages in thread
From: Nicolas Pitre @ 2005-12-20 14:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, David Woodhouse, Zwane Mwaikambo, lkml,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Steven Rostedt,
	Alan Cox, Christoph Hellwig, Andi Kleen, David Howells,
	Alexander Viro, Oleg Nesterov, Paul Jackson

On Tue, 20 Dec 2005, Nick Piggin wrote:

> Nicolas Pitre wrote:
> > On Tue, 20 Dec 2005, Ingo Molnar wrote:
> > 
> > 
> > > * David Woodhouse <dwmw2@infradead.org> wrote:
> > > 
> > > 
> > > > On Mon, 2005-12-19 at 09:49 -0800, Zwane Mwaikambo wrote:
> > > > 
> > > > > Hi Ingo,
> > > > >        Doesn't this corrupt caller saved registers?
> > > > 
> > > > Looks like it. I _really_ don't like calling functions from inline asm.
> > > > It's not nice. Can't we use atomic_dec_return() for this?
> > > 
> > > we can use atomic_dec_return(), but that will add one more instruction to
> > > the fastpath. OTOH, atomic_dec_return() is available on every
> > > architecture, so it's a really tempting thing. I'll experiment with it.
> > 
> > 
> > Please consider using (a variant of) xchg() instead.  Although atomic_dec()
> > is available on all architectures, its implementation is far from being the
> > most efficient thing to do for them all.  For example, see my discussion
> > about swp on ARM:
> > 
> 
> Considering that on UP, the arm should not need to disable interrupts
> for this function (or has someone refuted Linus?), how about:

Kernel preemption.


Nicolas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 14:10           ` Nicolas Pitre
@ 2005-12-20 14:12             ` Nick Piggin
  2005-12-20 14:35               ` Nicolas Pitre
                                 ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Nick Piggin @ 2005-12-20 14:12 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Ingo Molnar, David Woodhouse, Zwane Mwaikambo, lkml,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Steven Rostedt,
	Alan Cox, Christoph Hellwig, Andi Kleen, David Howells,
	Alexander Viro, Oleg Nesterov, Paul Jackson

Nicolas Pitre wrote:
> On Tue, 20 Dec 2005, Nick Piggin wrote:

>>Considering that on UP, the arm should not need to disable interrupts
>>for this function (or has someone refuted Linus?), how about:
> 
> 
> Kernel preemption.
> 

preempt_disable() ?

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 14:12             ` Nick Piggin
@ 2005-12-20 14:35               ` Nicolas Pitre
  2005-12-20 15:05                 ` Nick Piggin
  2005-12-20 18:27                 ` Linus Torvalds
  2005-12-20 18:25               ` Linus Torvalds
  2005-12-21  6:20               ` Ingo Molnar
  2 siblings, 2 replies; 34+ messages in thread
From: Nicolas Pitre @ 2005-12-20 14:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, David Woodhouse, Zwane Mwaikambo, lkml,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Steven Rostedt,
	Alan Cox, Christoph Hellwig, Andi Kleen, David Howells,
	Alexander Viro, Oleg Nesterov, Paul Jackson

On Wed, 21 Dec 2005, Nick Piggin wrote:

> Nicolas Pitre wrote:
> > On Tue, 20 Dec 2005, Nick Piggin wrote:
> 
> > > Considering that on UP, the arm should not need to disable interrupts
> > > for this function (or has someone refuted Linus?), how about:
> > 
> > 
> > Kernel preemption.
> > 
> 
> preempt_disable() ?

Sure, and we're now more costly than the current implementation with irq 
disabling.

If we go with simple mutexes that's because there is a gain, even a huge 
one on ARM, especially for the fast uncontended case.  If you guys 
insist on making things so generic and rigid then there is no gain 
anymore worth the bother.


Nicolas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 14:35               ` Nicolas Pitre
@ 2005-12-20 15:05                 ` Nick Piggin
  2005-12-20 16:35                   ` Nicolas Pitre
  2005-12-20 18:27                 ` Linus Torvalds
  1 sibling, 1 reply; 34+ messages in thread
From: Nick Piggin @ 2005-12-20 15:05 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Ingo Molnar, David Woodhouse, Zwane Mwaikambo, lkml,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Steven Rostedt,
	Alan Cox, Christoph Hellwig, Andi Kleen, David Howells,
	Alexander Viro, Oleg Nesterov, Paul Jackson

Nicolas Pitre wrote:
> On Wed, 21 Dec 2005, Nick Piggin wrote:
> 
> 
>>Nicolas Pitre wrote:
>>
>>>On Tue, 20 Dec 2005, Nick Piggin wrote:
>>
>>>>Considering that on UP, the arm should not need to disable interrupts
>>>>for this function (or has someone refuted Linus?), how about:
>>>
>>>
>>>Kernel preemption.
>>>
>>
>>preempt_disable() ?
> 
> 
> Sure, and we're now more costly than the current implementation with irq 
> disabling.
> 

Why? It is just a plain increment of a location that will almost certainly
be in cache. I can't see how it would be more than half the cost of the
irq implementation (based on looking at your measurements). How do you
figure?

Also, preempt_disable is a very frequent operation on preempt kernels so
if you have CONFIG_PREEMPT then you don't care about preempt_disable in
down() (and if you do then you are calling down too often).

> If we go with simple mutexes that's because there is a gain, even a huge 
> one on ARM, especially for the fast uncontended case.  If you guys 
> insist on making things so generic and rigid then there is no gain 
> anymore worth the bother.
> 

I guess there is no bother for you, but maintaining code for 1 generic
platform versus two dozen architectures is a huge win for many.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 15:05                 ` Nick Piggin
@ 2005-12-20 16:35                   ` Nicolas Pitre
  2005-12-20 19:20                     ` Russell King
  0 siblings, 1 reply; 34+ messages in thread
From: Nicolas Pitre @ 2005-12-20 16:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, David Woodhouse, Zwane Mwaikambo, lkml,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Steven Rostedt,
	Alan Cox, Christoph Hellwig, Andi Kleen, David Howells,
	Alexander Viro, Oleg Nesterov, Paul Jackson

On Wed, 21 Dec 2005, Nick Piggin wrote:

> Nicolas Pitre wrote:
> > On Wed, 21 Dec 2005, Nick Piggin wrote:
> > 
> > 
> > > Nicolas Pitre wrote:
> > > 
> > > > On Tue, 20 Dec 2005, Nick Piggin wrote:
> > > 
> > > > > Considering that on UP, the arm should not need to disable interrupts
> > > > > for this function (or has someone refuted Linus?), how about:
> > > > 
> > > > 
> > > > Kernel preemption.
> > > > 
> > > 
> > > preempt_disable() ?
> > 
> > 
> > Sure, and we're now more costly than the current implementation with irq
> > disabling.
> > 
> 
> Why? It is just a plain increment of a location that will almost certainly
> be in cache. I can't see how it would be more than half the cost of the
> irq implementation (based on looking at your measurements). How do you
> figure?

ARM is a load/store architecture.  So the preempt_count has to be 
loaded, incremented, and stored back (3 instructions just there).  Then 
the preempt_count itself isn't straight there, it is reached with the 
thread_info pointer which requires 2 additional instructions to 
generate.  So we're already up to 5 instructions while the disabling of 
interrupts takes 3.

OK now we want to decrement the semaphore count.  One load, one 
decrement, one store, i.e. 3 more instructions.  We're up to 8.

Oh and we're still supposed to be in a fast path.  And we even aren't 
ready to decide if we acquired the semaphore just yet.

Now we have to call preempt_enable().  Given that preempt_count() will 
use 2 instructions again to get to the thread_info pointer, let's say 
for the demonstration that we instead open coded our own 
preempt_enable() and preempt_disable() so to be able to cache the 
thread_info pointer amongst those two.  So let's forget about those 2 
additional instructions.

Let's also suppose that we preserved the original preempt count so that 
we don't have to re-load and decrement it which would have used 2 other 
additional instructions.  Good!  But register pressure is increasing, 
and btw we're using completely non generic kernel infrastructure at this 
point.

We nevertheless still have to store the original preempt count back.  
One instruction i.e. we're up to 9.

And with 9 instructions we're already worse than the implementation with 
interrupt disabling.  Maybe not in terms of cycles but hey we're not 
done yet!

Any preempt_enable() equivalent look at the TIF_NEED_RESCHED flag 
(luckily we still have the thread_info pointer handy since we're not 
using test_thread_flag() but loading the flag directly i.e. one 
instruction), testing it (one instruction), conditionally branching to 
preempt_schedule (one instruction).  Now up to 12 instructions.  Amd to 
make things even worse: since the flag has to be tested right after it 
has been loaded you must account for result delays (more cycles) for the 
load.

OK, did we acquire that damn semaphore or not?  Oh since we clobbered 
the processor's condition flag while testing the TIF_NEED_RESCHED flag 
we are now forced to test that semaphore count again to see if it went 
negative: one instruction.  And finally branch to the contention routine 
if the semaphore was already locked: one instruction.

So 14 instructions total with preemption disabling, and that's with the 
best implementation possible by open coding everything instead of 
relying on generic functions/macros.

Compare that to my 4 (or 3 when gcc can cse a constant) 
instructions needed for the mutex equivalent.


Nicolas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 14:12             ` Nick Piggin
  2005-12-20 14:35               ` Nicolas Pitre
@ 2005-12-20 18:25               ` Linus Torvalds
  2005-12-21  6:20               ` Ingo Molnar
  2 siblings, 0 replies; 34+ messages in thread
From: Linus Torvalds @ 2005-12-20 18:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nicolas Pitre, Ingo Molnar, David Woodhouse, Zwane Mwaikambo,
	lkml, Andrew Morton, Arjan van de Ven, Steven Rostedt, Alan Cox,
	Christoph Hellwig, Andi Kleen, David Howells, Alexander Viro,
	Oleg Nesterov, Paul Jackson



On Wed, 21 Dec 2005, Nick Piggin wrote:

> Nicolas Pitre wrote:
> > On Tue, 20 Dec 2005, Nick Piggin wrote:
> 
> > > Considering that on UP, the arm should not need to disable interrupts
> > > for this function (or has someone refuted Linus?), how about:
> > 
> > 
> > Kernel preemption.
> > 
> 
> preempt_disable() ?

Yes. It should almost always be faster than physically disabling 
interrupts anyway. Even if it's more instructions.

		Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 14:35               ` Nicolas Pitre
  2005-12-20 15:05                 ` Nick Piggin
@ 2005-12-20 18:27                 ` Linus Torvalds
  2005-12-20 19:34                   ` Russell King
  2005-12-20 19:49                   ` Nicolas Pitre
  1 sibling, 2 replies; 34+ messages in thread
From: Linus Torvalds @ 2005-12-20 18:27 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Nick Piggin, Ingo Molnar, David Woodhouse, Zwane Mwaikambo, lkml,
	Andrew Morton, Arjan van de Ven, Steven Rostedt, Alan Cox,
	Christoph Hellwig, Andi Kleen, David Howells, Alexander Viro,
	Oleg Nesterov, Paul Jackson



On Tue, 20 Dec 2005, Nicolas Pitre wrote:
>
> Sure, and we're now more costly than the current implementation with irq 
> disabling.

Do the timing. It may be more instructions, but I think it was you 
yourself that timed the current thing at 23 cycles, no?

Special registers are almost always slower than nicely cached accesses 
(and they all basically will be).

		Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 16:35                   ` Nicolas Pitre
@ 2005-12-20 19:20                     ` Russell King
  2005-12-20 19:32                       ` Arjan van de Ven
                                         ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Russell King @ 2005-12-20 19:20 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Nick Piggin, Ingo Molnar, David Woodhouse, Zwane Mwaikambo, lkml,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Steven Rostedt,
	Alan Cox, Christoph Hellwig, Andi Kleen, David Howells,
	Alexander Viro, Oleg Nesterov, Paul Jackson

On Tue, Dec 20, 2005 at 11:35:22AM -0500, Nicolas Pitre wrote:
> So 14 instructions total with preemption disabling, and that's with the 
> best implementation possible by open coding everything instead of 
> relying on generic functions/macros.

I agree with your analysis Nicolas.

However, don't forget to compare this with our existing semaphore
implementation which is 9 instructions, or 8 for the SMP version.

In total, this means that mutexes will be _FAR MORE EXPENSIVE_ on ARM
than our semaphores.

Forcing architectures down the "lets make everything generic" path
does not always hack it.  It can't do by its very nature.  Generic
implementations are *always* sub-optimal and it is pretty clear
that any gain that mutexes _may_ give is completely wasted on ARM
by the overhead of having a generic framework imposed upon us.

So, to sum up, if this path is persued, mutexes will be a bloody
disaster on ARM.  We'd be far better off sticking to semaphores
and ignoring this mutex stuff.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 19:20                     ` Russell King
@ 2005-12-20 19:32                       ` Arjan van de Ven
  2005-12-20 19:37                         ` Russell King
  2005-12-20 19:59                         ` Nicolas Pitre
  2005-12-20 19:43                       ` Steven Rostedt
  2005-12-20 19:55                       ` Nicolas Pitre
  2 siblings, 2 replies; 34+ messages in thread
From: Arjan van de Ven @ 2005-12-20 19:32 UTC (permalink / raw)
  To: Russell King
  Cc: Nicolas Pitre, Nick Piggin, Ingo Molnar, David Woodhouse,
	Zwane Mwaikambo, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Steven Rostedt, Alan Cox, Christoph Hellwig,
	Andi Kleen, David Howells, Alexander Viro, Oleg Nesterov,
	Paul Jackson


> So, to sum up, if this path is persued, mutexes will be a bloody
> disaster on ARM.  We'd be far better off sticking to semaphores
> and ignoring this mutex stuff.

on x86 the fastpath is the same for both basically.. is there a
fundamental reason it can't be for ARM?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 18:27                 ` Linus Torvalds
@ 2005-12-20 19:34                   ` Russell King
  2005-12-20 20:10                     ` Linus Torvalds
  2005-12-20 19:49                   ` Nicolas Pitre
  1 sibling, 1 reply; 34+ messages in thread
From: Russell King @ 2005-12-20 19:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nicolas Pitre, Nick Piggin, Ingo Molnar, David Woodhouse,
	Zwane Mwaikambo, lkml, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, Andi Kleen,
	David Howells, Alexander Viro, Oleg Nesterov, Paul Jackson

On Tue, Dec 20, 2005 at 10:27:12AM -0800, Linus Torvalds wrote:
> On Tue, 20 Dec 2005, Nicolas Pitre wrote:
> >
> > Sure, and we're now more costly than the current implementation with irq 
> > disabling.
> 
> Do the timing. It may be more instructions, but I think it was you 
> yourself that timed the current thing at 23 cycles, no?

That's PXA, which is Intel.  Most other ARM CPUs are far faster
than that at IRQ disable.  Typically you're looking at 6 cycles
to disable + 3 cycles to re-enable.

However, Nico's analysis of 14 instructions vs 9 instructions
pretty much paints the picture - those 14 instructions for the
preempt_disable _will_ be more heavy weight than Nico's idea.

Also, Nico has an alternative idea for mutexes which does not
involve decrementing or incrementing - it's an atomic swap.
That works out at about the same cycle count on non-Intel ARM
CPUs as the present semaphore path.  I'm willing to bet that
it will be faster than the present semaphore path on Intel ARM
CPUs.

Here's the cycle counts for ARM926, assuming hot caches and the
failure path not taken for the existing semaphore code:

mrs     ip, cpsr		2
orr     lr, ip, #128		1
msr     cpsr_c, lr		3
ldr     lr, [%0]		2
subs    lr, lr, %1		1
str     lr, [%0]		1
msr     cpsr_c, ip		3
movmi   ip, %0			1
blmi    failure			1
			total:	15 cycles

Here's nico's version (with a couple of fixes to ensure we don't
schedule if preempt count is non-zero):

        mov     r0, sp, lsr #13				1
        mov     r0, r0, lsl #13				1
        ldr     r1, [r0, #PREEMPT_COUNT]		2
        add     r2, r1, #1				1
        str     r2, [r0, #PREEMPT_COUNT]		1
        ldr     r4, [r3]				2
        sub     r4, r4, #1				1
        str     r4, [r3]				1
        str     r1, [r0, #PREEMPT_COUNT]		1
        teq     r1, #0					1
        bne     no_preempt_check			1
        ldr     r1, [r0, #FLAGS]			2
        tst     r1, #TIF_NEED_RESCHED			1
        blne    schedule				1
no_preempt_check:
        cmp     r4, #0					1
        blmi    failed					1
						total:	19 cycles

That's a 26% increase in the cost of a mutex implemented this way
over a plain semaphore.

Hence, mutexes implemented this way will be _more_ costly.
Significantly so.  Enough to make them worthless.

I think the approach needs completely rethinking.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 19:32                       ` Arjan van de Ven
@ 2005-12-20 19:37                         ` Russell King
  2005-12-20 19:59                         ` Nicolas Pitre
  1 sibling, 0 replies; 34+ messages in thread
From: Russell King @ 2005-12-20 19:37 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Nicolas Pitre, Nick Piggin, Ingo Molnar, David Woodhouse,
	Zwane Mwaikambo, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Steven Rostedt, Alan Cox, Christoph Hellwig,
	Andi Kleen, David Howells, Alexander Viro, Oleg Nesterov,
	Paul Jackson

On Tue, Dec 20, 2005 at 08:32:35PM +0100, Arjan van de Ven wrote:
> 
> > So, to sum up, if this path is persued, mutexes will be a bloody
> > disaster on ARM.  We'd be far better off sticking to semaphores
> > and ignoring this mutex stuff.
> 
> on x86 the fastpath is the same for both basically.. is there a
> fundamental reason it can't be for ARM?

If we're talking about implementing mutexes as a semaphore, then they
will be identical.  But what's the point of that?  It's a semaphore
which is just called a mutex.

So you might as well just save the patch noise and do nothing instead.
It seems to me that you'll get _exactly_ the same result with a lot
less effort.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 19:20                     ` Russell King
  2005-12-20 19:32                       ` Arjan van de Ven
@ 2005-12-20 19:43                       ` Steven Rostedt
  2005-12-20 19:57                         ` Russell King
  2005-12-20 20:35                         ` Horst von Brand
  2005-12-20 19:55                       ` Nicolas Pitre
  2 siblings, 2 replies; 34+ messages in thread
From: Steven Rostedt @ 2005-12-20 19:43 UTC (permalink / raw)
  To: Russell King
  Cc: Nicolas Pitre, Nick Piggin, Ingo Molnar, David Woodhouse,
	Zwane Mwaikambo, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Alan Cox, Christoph Hellwig, Andi Kleen,
	David Howells, Alexander Viro, Oleg Nesterov, Paul Jackson


On Tue, 20 Dec 2005, Russell King wrote:

> On Tue, Dec 20, 2005 at 11:35:22AM -0500, Nicolas Pitre wrote:
> > So 14 instructions total with preemption disabling, and that's with the
> > best implementation possible by open coding everything instead of
> > relying on generic functions/macros.
>
> I agree with your analysis Nicolas.
>
> However, don't forget to compare this with our existing semaphore
> implementation which is 9 instructions, or 8 for the SMP version.
>
> In total, this means that mutexes will be _FAR MORE EXPENSIVE_ on ARM
> than our semaphores.
>
> Forcing architectures down the "lets make everything generic" path
> does not always hack it.  It can't do by its very nature.  Generic
> implementations are *always* sub-optimal and it is pretty clear
> that any gain that mutexes _may_ give is completely wasted on ARM
> by the overhead of having a generic framework imposed upon us.
>
> So, to sum up, if this path is persued, mutexes will be a bloody
> disaster on ARM.  We'd be far better off sticking to semaphores
> and ignoring this mutex stuff.
>

So what's wrong with having the generic code, and for those with a fast
semapore add an arch specific?

#define mutex_lock down
#define mutex_unlock up
#define mutex_trylock(x) (!down_trylock(x))

Until the mutex code is updated to a fast arch specific implementation.

Let me restate, that the generic code should not be this, but each arch
can have this if they already went through great lengths in making a fast
semaphore.

Heck put the above defines in the generic code, with a define

linux/mutex.h:

#ifdef HAVE_ARCH_MUTEX
#include <asm/mutex.h>
#else

#ifdef HAVE_FAST_SEMAPHORE

#define <defines here>

#else

generic code here

#endif /* HAVE_FAST_SEMAPHORE */

#endif /* HAVE_ARCH_MUTEX */

-- Steve


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 18:27                 ` Linus Torvalds
  2005-12-20 19:34                   ` Russell King
@ 2005-12-20 19:49                   ` Nicolas Pitre
  1 sibling, 0 replies; 34+ messages in thread
From: Nicolas Pitre @ 2005-12-20 19:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Ingo Molnar, David Woodhouse, Zwane Mwaikambo, lkml,
	Andrew Morton, Arjan van de Ven, Steven Rostedt, Alan Cox,
	Christoph Hellwig, Andi Kleen, David Howells, Alexander Viro,
	Oleg Nesterov, Paul Jackson

On Tue, 20 Dec 2005, Linus Torvalds wrote:

> 
> 
> On Tue, 20 Dec 2005, Nicolas Pitre wrote:
> >
> > Sure, and we're now more costly than the current implementation with irq 
> > disabling.
> 
> Do the timing. It may be more instructions, but I think it was you 
> yourself that timed the current thing at 23 cycles, no?

Yes.  And with my hand-optimized assembly version using 
(the equivalent of) preempt_disable/preempt_enable I get 20 cycles over 
14 instructions.  If I let gcc do it using the standard macros it gets 
worse.

The irq disabling implementation spends 23 cycles over 8 instructions.

The mutex swp-based implementation spends 8 cycles over 4 instructions.

If we make the preempt_disable/enable implementation out of line that'll 
reduce the .text size bloat, but it'LL add more cycles due to the 
additional two branches plus preserve of the return address making it 
above 23 cycles for sure.

> Special registers are almost always slower than nicely cached accesses 
> (and they all basically will be).

Sure, but do the numbers above tell you anything else?  It seems pretty 
obvious to me that simple mutexes are a real gain.  Suffice for the 
implementation to let architecture do their best to lock the mutex and 
fall back to generic code when there is contention (just like we already 
do for semaphores).


Nicolas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 19:20                     ` Russell King
  2005-12-20 19:32                       ` Arjan van de Ven
  2005-12-20 19:43                       ` Steven Rostedt
@ 2005-12-20 19:55                       ` Nicolas Pitre
  2 siblings, 0 replies; 34+ messages in thread
From: Nicolas Pitre @ 2005-12-20 19:55 UTC (permalink / raw)
  To: Russell King
  Cc: Nick Piggin, Ingo Molnar, David Woodhouse, Zwane Mwaikambo, lkml,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Steven Rostedt,
	Alan Cox, Christoph Hellwig, Andi Kleen, David Howells,
	Alexander Viro, Oleg Nesterov, Paul Jackson

On Tue, 20 Dec 2005, Russell King wrote:

> On Tue, Dec 20, 2005 at 11:35:22AM -0500, Nicolas Pitre wrote:
> > So 14 instructions total with preemption disabling, and that's with the 
> > best implementation possible by open coding everything instead of 
> > relying on generic functions/macros.
> 
> I agree with your analysis Nicolas.
> 
> However, don't forget to compare this with our existing semaphore
> implementation which is 9 instructions, or 8 for the SMP version.
> 
> In total, this means that mutexes will be _FAR MORE EXPENSIVE_ on ARM
> than our semaphores.

I don't follow you at all.  I'm arguing that mutexes are much less 
expensive than any semaphore implementation (except on SMP where 
semaphores and mutexes will probably look the same).

> Forcing architectures down the "lets make everything generic" path
> does not always hack it.  It can't do by its very nature.  Generic
> implementations are *always* sub-optimal and it is pretty clear
> that any gain that mutexes _may_ give is completely wasted on ARM
> by the overhead of having a generic framework imposed upon us.

Agreed.  And that's why I'm suggesting that the mutex locking/unlocking 
fast path should be architecture specific.  And to that effect I'm 
working on a patch against Ingo's mutex code to illustrate my point.

What should still remain generic is the contention fallback.  That's 
where the actual complexity lives and that part doesn't have to be 
optimized for best inline performances.


Nicolas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 19:43                       ` Steven Rostedt
@ 2005-12-20 19:57                         ` Russell King
  2005-12-20 20:35                         ` Horst von Brand
  1 sibling, 0 replies; 34+ messages in thread
From: Russell King @ 2005-12-20 19:57 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Nicolas Pitre, Nick Piggin, Ingo Molnar, David Woodhouse,
	Zwane Mwaikambo, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Alan Cox, Christoph Hellwig, Andi Kleen,
	David Howells, Alexander Viro, Oleg Nesterov, Paul Jackson

On Tue, Dec 20, 2005 at 02:43:30PM -0500, Steven Rostedt wrote:
> So what's wrong with having the generic code, and for those with a fast
> semapore add an arch specific?
> 
> #define mutex_lock down
> #define mutex_unlock up
> #define mutex_trylock(x) (!down_trylock(x))
> 
> Until the mutex code is updated to a fast arch specific implementation.
> 
> Let me restate, that the generic code should not be this, but each arch
> can have this if they already went through great lengths in making a fast
> semaphore.

I have no problem with this since we can then use Nico's swp-based
implementation.  Great!  What seems to be happening though is that
there's a move to make these operations be generic across all
architectures.

What both Nico and myself have demonstrated is that if architectures
are placed into the generic strait-jacket, any alleged performance
benefit of mutexes is completely swamped, which in turn makes the
whole mutex idea entirely pointless.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 19:32                       ` Arjan van de Ven
  2005-12-20 19:37                         ` Russell King
@ 2005-12-20 19:59                         ` Nicolas Pitre
  1 sibling, 0 replies; 34+ messages in thread
From: Nicolas Pitre @ 2005-12-20 19:59 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Russell King, Nick Piggin, Ingo Molnar, David Woodhouse,
	Zwane Mwaikambo, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Steven Rostedt, Alan Cox, Christoph Hellwig,
	Andi Kleen, David Howells, Alexander Viro, Oleg Nesterov,
	Paul Jackson

On Tue, 20 Dec 2005, Arjan van de Ven wrote:

> on x86 the fastpath is the same for both basically.. is there a
> fundamental reason it can't be for ARM?

ARM prior ARM architecture level 6 (which means about 99% of all ARM 
processors in the field currently) cannot do an atomic 
decrement/increment.  It must be done manually with interrupt disabled.

The only truly atomic instruction it has is a swap which lays itself 
pretty well for mutex support.


Nicolas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 19:34                   ` Russell King
@ 2005-12-20 20:10                     ` Linus Torvalds
  2005-12-20 21:18                       ` Nicolas Pitre
  0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2005-12-20 20:10 UTC (permalink / raw)
  To: Russell King
  Cc: Nicolas Pitre, Nick Piggin, Ingo Molnar, David Woodhouse,
	Zwane Mwaikambo, lkml, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, Andi Kleen,
	David Howells, Alexander Viro, Oleg Nesterov, Paul Jackson



On Tue, 20 Dec 2005, Russell King wrote:
> 
> Also, Nico has an alternative idea for mutexes which does not
> involve decrementing or incrementing - it's an atomic swap.
> That works out at about the same cycle count on non-Intel ARM
> CPUs as the present semaphore path.  I'm willing to bet that
> it will be faster than the present semaphore path on Intel ARM
> CPUs.

Don't be so sure, especially not in the future.

An atomic "swap" operation is, from a CPU design standpoint, fundamentally 
more expensive that a "load + store".

Now, most ARM architectures don't notice this, because they are all 
in-order, and not SMP-aware anyway. No suble memory ordering, no nothing. 
Which is the only case when "swap" basically becomes a cheap "load+store".

What I'm trying to say is that a plain "load + store" is almost always 
going to be the best option in the long run.

It's also almost certainly always the best option for UP + non-preempt, 
for both present and future CPU's. The reason is simply that a 
microarchitecture will _always_ be optimized for that case, since it's 
pretty much by definition the common situation.

Is preemption even the common case on ARM? I'd assume not. Why are people 
so interested in the preemption case? IOW, why don't you just do

	ldr  lr,[%0]
	subs lr, lr, %1
	str  lr,[%0]
	blmi failure

as the _base_ timings, since that should be the common case. That's the 
drop-dead fastest UP case.

There's an additional advantage of the regular load/store case: if some 
CPU has scheduling issues, you can actually write this out as C code (with 
an extra empty ASM to make sure that the compiler doesn't move anything 
out of the critical region). Again, that probably doesn't matter on most 
ARM chips, but in the general case it sure does matter.

(Btw, inlining _any_ of these except perhaps the above trivial case, is 
probably wrong. None of the ARM chips tend to have caches all that big, I 
bet).

			Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 19:43                       ` Steven Rostedt
  2005-12-20 19:57                         ` Russell King
@ 2005-12-20 20:35                         ` Horst von Brand
  1 sibling, 0 replies; 34+ messages in thread
From: Horst von Brand @ 2005-12-20 20:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Russell King, Nicolas Pitre, Nick Piggin, Ingo Molnar,
	David Woodhouse, Zwane Mwaikambo, lkml, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Alan Cox, Christoph Hellwig,
	Andi Kleen, David Howells, Alexander Viro, Oleg Nesterov,
	Paul Jackson

Steven Rostedt <rostedt@goodmis.org> wrote:

[...]

> Let me restate, that the generic code should not be this, but each arch
> can have this if they already went through great lengths in making a fast
> semaphore.
> 
> Heck put the above defines in the generic code, with a define
> 
> linux/mutex.h:
> 
> #ifdef HAVE_ARCH_MUTEX
> #include <asm/mutex.h>
> #else
> 
> #ifdef HAVE_FAST_SEMAPHORE
> 
> #define <defines here>
> 
> #else
> 
> generic code here

Anything to go here could/should very well be in the above arch-specific
file. Saves you a #define ;-)

> #endif /* HAVE_FAST_SEMAPHORE */
> #endif /* HAVE_ARCH_MUTEX */
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 20:10                     ` Linus Torvalds
@ 2005-12-20 21:18                       ` Nicolas Pitre
  2005-12-20 22:04                         ` Linus Torvalds
  2005-12-21  2:12                         ` Nick Piggin
  0 siblings, 2 replies; 34+ messages in thread
From: Nicolas Pitre @ 2005-12-20 21:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Russell King, Nick Piggin, Ingo Molnar, David Woodhouse,
	Zwane Mwaikambo, lkml, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, Andi Kleen,
	David Howells, Alexander Viro, Oleg Nesterov, Paul Jackson

On Tue, 20 Dec 2005, Linus Torvalds wrote:

> 
> 
> On Tue, 20 Dec 2005, Russell King wrote:
> > 
> > Also, Nico has an alternative idea for mutexes which does not
> > involve decrementing or incrementing - it's an atomic swap.
> > That works out at about the same cycle count on non-Intel ARM
> > CPUs as the present semaphore path.  I'm willing to bet that
> > it will be faster than the present semaphore path on Intel ARM
> > CPUs.
> 
> Don't be so sure, especially not in the future.

I think we agree.  But we still live in the present.

> An atomic "swap" operation is, from a CPU design standpoint, fundamentally 
> more expensive that a "load + store".
> 
> Now, most ARM architectures don't notice this, because they are all 
> in-order, and not SMP-aware anyway. No suble memory ordering, no nothing. 
> Which is the only case when "swap" basically becomes a cheap "load+store".

Indeed.

> What I'm trying to say is that a plain "load + store" is almost always 
> going to be the best option in the long run.
> 
> It's also almost certainly always the best option for UP + non-preempt, 
> for both present and future CPU's. The reason is simply that a 
> microarchitecture will _always_ be optimized for that case, since it's 
> pretty much by definition the common situation.

Which is why each architecture must always have the option of providing 
its own fast path implementation according to a number of factors that 
cannot be made into a generic layer.  But this is the same issue whether 
we talk about semaphores or mutexes.

> Is preemption even the common case on ARM? I'd assume not. Why are people 
> so interested in the preemption case?

Because embedded people want it.  ARM is also about the only 
architecture besides x86 that currently sees most of the RT work for the 
same reason.  And yes preemption does make a difference.

> IOW, why don't you just do
> 
> 	ldr  lr,[%0]
> 	subs lr, lr, %1
> 	str  lr,[%0]
> 	blmi failure
> 
> as the _base_ timings, since that should be the common case. That's the 
> drop-dead fastest UP case.

The above is 5 cycles.  About the same as the preemption-safe swp-based 
mutex implementation on non-Intel ARM.  It is broken wrt interrupts when 
the swp is not.  It doesn't work with preemption while the swp 
implementation is potentially smaller with cse and it works in all 
cases.

I mean...... what is it with mutexes that you dislike to the point of 
bending backward that far, and even after seeing the numbers, with such 
a semaphore implementation that _I_ even wouldn't trust people to use 
correctly?  Yes we should use complete() from interrupt handlers but I 
can bet you that a lot of people are still using up() there, and with 
the current up() implementation it just _works_ at least on ARM.

> (Btw, inlining _any_ of these except perhaps the above trivial case, is 
> probably wrong. None of the ARM chips tend to have caches all that big, I 
> bet).

On XScale you should add 2 cycles to branch to the out of line code and 
2 other cycles to branch back.

The ARM mutex implementation can save that and have an extremely small 
inlined footprint already.

Again what do you dislike so much about mutexes?  It must not have much 
to do with technical issues at this point.  ;-)


Nicolas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 21:18                       ` Nicolas Pitre
@ 2005-12-20 22:04                         ` Linus Torvalds
  2005-12-20 22:19                           ` Steven Rostedt
                                             ` (2 more replies)
  2005-12-21  2:12                         ` Nick Piggin
  1 sibling, 3 replies; 34+ messages in thread
From: Linus Torvalds @ 2005-12-20 22:04 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Russell King, Nick Piggin, Ingo Molnar, David Woodhouse,
	Zwane Mwaikambo, lkml, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, Andi Kleen,
	David Howells, Alexander Viro, Oleg Nesterov, Paul Jackson



On Tue, 20 Dec 2005, Nicolas Pitre wrote:
> 
> I mean...... what is it with mutexes that you dislike to the point of 
> bending backward that far, and even after seeing the numbers, with such 
> a semaphore implementation that _I_ even wouldn't trust people to use 
> correctly?

Quite frankly, what has disgusted me about this mutex discussion is the 
totally specious arguments for the new mutexes that just rubs me entirely 
the wrong way.

If it had _started_ with a mutex implementation that was faster, simpler, 
and didn't rename the old and working semaphores, I'd have been perfectly 
fine with it.

As it is, the discussion has been pretty much everything but that. 

And then people who argue about single cycles, end up dismissing the 
single cycles when I argue that "ld+st" is faster - like you just did.

Be consistent, dammit. If single cycles matter, they matter. If they 
don't, then the existing code is better, since it's existing and works. 
You can't have it both ways.

In other words: if people didn't mix up issues that had nothing to do with 
this into it, I'd be happier. I've already said that a mutex that does 
_not_ replace semaphore (and doesn't mess with naming) is acceptable. 

We've done that before. But do it RIGHT, dammit. And don't mix existing 
semaphores into it (for example, completions didn't change any old users).

			Linus	

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 22:04                         ` Linus Torvalds
@ 2005-12-20 22:19                           ` Steven Rostedt
  2005-12-20 22:43                           ` Nicolas Pitre
  2005-12-21  6:00                           ` Ingo Molnar
  2 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2005-12-20 22:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nicolas Pitre, Russell King, Nick Piggin, Ingo Molnar,
	David Woodhouse, Zwane Mwaikambo, lkml, Andrew Morton,
	Arjan van de Ven, Alan Cox, Christoph Hellwig, Andi Kleen,
	David Howells, Alexander Viro, Oleg Nesterov, Paul Jackson


On Tue, 20 Dec 2005, Linus Torvalds wrote:
>
> In other words: if people didn't mix up issues that had nothing to do with
> this into it, I'd be happier. I've already said that a mutex that does
> _not_ replace semaphore (and doesn't mess with naming) is acceptable.
>

At least I'm very happy with Linus' decision.  Can we now end this thread,
and move forward.  You may bring up your issues in what comes next.

Thank you,

-- Steve


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 22:04                         ` Linus Torvalds
  2005-12-20 22:19                           ` Steven Rostedt
@ 2005-12-20 22:43                           ` Nicolas Pitre
  2005-12-21  6:00                           ` Ingo Molnar
  2 siblings, 0 replies; 34+ messages in thread
From: Nicolas Pitre @ 2005-12-20 22:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Russell King, Nick Piggin, Ingo Molnar, David Woodhouse,
	Zwane Mwaikambo, lkml, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, Andi Kleen,
	David Howells, Alexander Viro, Oleg Nesterov, Paul Jackson

On Tue, 20 Dec 2005, Linus Torvalds wrote:

> Quite frankly, what has disgusted me about this mutex discussion is the 
> totally specious arguments for the new mutexes that just rubs me entirely 
> the wrong way.
>
[...]
>
> In other words: if people didn't mix up issues that had nothing to do with 
> this into it, I'd be happier. I've already said that a mutex that does 
> _not_ replace semaphore (and doesn't mess with naming) is acceptable. 

Oh if it is so we are in _violent_ agreement then.  I don't dispute that 
at all and I pretty agree with a separate namespace for mutexes.  
Actually I think no one contested that either.

Current semaphores can be migrated to mutexes individually when that 
makes sense to do so, separately.

With regards to my _implementation_ concerns about the proposed mutex 
patches I guess I can discuss them with Ingo (and an actual patch is 
coming to fix them).


Nicolas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 21:18                       ` Nicolas Pitre
  2005-12-20 22:04                         ` Linus Torvalds
@ 2005-12-21  2:12                         ` Nick Piggin
  1 sibling, 0 replies; 34+ messages in thread
From: Nick Piggin @ 2005-12-21  2:12 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Linus Torvalds, Russell King, Ingo Molnar, David Woodhouse,
	Zwane Mwaikambo, lkml, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, Andi Kleen,
	David Howells, Alexander Viro, Oleg Nesterov, Paul Jackson

Nicolas Pitre wrote:
> On Tue, 20 Dec 2005, Linus Torvalds wrote:

>>IOW, why don't you just do
>>
>>	ldr  lr,[%0]
>>	subs lr, lr, %1
>>	str  lr,[%0]
>>	blmi failure
>>
>>as the _base_ timings, since that should be the common case. That's the 
>>drop-dead fastest UP case.
> 
> 
> The above is 5 cycles.  About the same as the preemption-safe swp-based 
> mutex implementation on non-Intel ARM.  It is broken wrt interrupts when 
> the swp is not.

How it is broken WRT interrupts? (sorry, I haven't had it explained to me
in words of two syllables or less)

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20  6:30       ` Nicolas Pitre
  2005-12-20  8:12         ` Nick Piggin
@ 2005-12-21  4:16         ` Nicolas Pitre
  2005-12-21  6:26           ` Ingo Molnar
  1 sibling, 1 reply; 34+ messages in thread
From: Nicolas Pitre @ 2005-12-21  4:16 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: lkml, Arjan van de Ven, Steven Rostedt, David Howells

On Tue, 20 Dec 2005, Nicolas Pitre wrote:

> On Tue, 20 Dec 2005, Ingo Molnar wrote:
> 
> > 
> > * David Woodhouse <dwmw2@infradead.org> wrote:
> > 
> > > On Mon, 2005-12-19 at 09:49 -0800, Zwane Mwaikambo wrote:
> > > > Hi Ingo,
> > > >         Doesn't this corrupt caller saved registers?
> > > 
> > > Looks like it. I _really_ don't like calling functions from inline 
> > > asm. It's not nice. Can't we use atomic_dec_return() for this?
> > 
> > we can use atomic_dec_return(), but that will add one more instruction 
> > to the fastpath. OTOH, atomic_dec_return() is available on every 
> > architecture, so it's a really tempting thing. I'll experiment with it.
> 
> Please consider using (a variant of) xchg() instead.  Although 
> atomic_dec() is available on all architectures, its implementation is 
> far from being the most efficient thing to do for them all.

Actually, the best thing to do is to let the architecture do what is the 
most efficient.  Please consider this patch that adds the flexibility 
for any architecture to use the optimal mechanism it has for atomic 
locking.  It also let it decide whether inlining the fast path is a good 
thing or not.  This patch should leave i386 unchanged.  It applies on 
top of your last posted mutex patch serie.

Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -14,6 +14,8 @@
 #include <asm/atomic.h>
 #include <linux/list.h>
 #include <linux/spinlock_types.h>
+#include <linux/linkage.h>
+#include <asm/mutex.h>
 
 /*
  * Simple, straightforward mutexes with strict semantics:
@@ -93,12 +95,49 @@ extern void FASTCALL(__mutex_init(struct
 #define DEFINE_MUTEX(mutexname) \
 	struct mutex mutexname = __MUTEX_INITIALIZER(mutexname)
 
+/*
+ * We can speed up the lock-acquire, but only if the architecture
+ * supports it and if there's no debugging state
+ * to be set up (!DEBUG_MUTEXESS).
+ */
+#ifdef CONFIG_DEBUG_MUTEXESS
+#undef MUTEX_LOCKLESS_FASTPATH
+#endif
+
+/*
+ * We can inline the lock-acquire, but only if the architecture
+ * wants it, and if MUTEX_LOCKLESS_FASTPATH is active.
+ */
+#ifndef MUTEX_LOCKLESS_FASTPATH
+#undef  MUTEX_INLINE_FASTPATH
+#endif
+
+/* out of line contention handlers */
+extern void FASTCALL(__mutex_lock_noinline(atomic_t *lock_count));
+extern void FASTCALL(__mutex_unlock_noinline(atomic_t *lock_count));
+
+#ifdef MUTEX_INLINE_FASTPATH
+
+#define mutex_lock(lock) \
+	__arch_fast_mutex_lock(&lock->count, __mutex_lock_noinline)
+#define mutex_unlock(lock) \
+	__arch_fast_mutex_unlock(&lock->count, __mutex_unlock_noinline)
+#define mutex_trylock(lock) \
+	__arch_fast_mutex_trylock(&lock->count)
+#define mutex_is_locked(lock) \
+	({ mb(); atomic_read(&lock->count) != 1; })
+
+#else
+
 extern void FASTCALL(mutex_lock(struct mutex *lock));
-extern int FASTCALL(mutex_lock_interruptible(struct mutex *lock));
-extern int FASTCALL(mutex_trylock(struct mutex *lock));
 extern void FASTCALL(mutex_unlock(struct mutex *lock));
+extern int FASTCALL(mutex_trylock(struct mutex *lock));
 extern int FASTCALL(mutex_is_locked(struct mutex *lock));
 
+#endif
+
+extern int FASTCALL(mutex_lock_interruptible(struct mutex *lock));
+
 /*
  * Debugging variant of mutexes. The only difference is that they accept
  * the semaphore APIs too:
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -21,20 +21,6 @@
 #include <linux/interrupt.h>
 
 /*
- * We can speed up the lock-acquire, if the architecture
- * supports cmpxchg and if there's no debugging state
- * to be set up (!DEBUG_MUTEXESS).
- *
- * trick: we can use cmpxchg on the release side too, if bit
- * 0 of lock->owner is set if there is at least a single pending
- * task in the wait_list. This way the release atomic-fastpath
- * can be a mirror image of the acquire path:
- */
-#if defined(__HAVE_ARCH_CMPXCHG) && !defined(CONFIG_DEBUG_MUTEXESS)
-# define MUTEX_LOCKLESS_FASTPATH
-#endif
-
-/*
  * In the debug case we carry the caller's instruction pointer into
  * other functions, but we dont want the function argument overhead
  * in the nondebug case - hence these macros:
@@ -373,9 +359,7 @@ repeat:
 static int __mutex_trylock(struct mutex *lock __IP_DECL__)
 {
 #ifdef MUTEX_LOCKLESS_FASTPATH
-	if (atomic_cmpxchg(&lock->count, 1, 0) == 1)
-		return 1;
-	return 0;
+	return __arch_fast_mutex_trylock(&lock->count);
 #else
 	struct thread_info *ti = current_thread_info();
 	unsigned long flags;
@@ -397,19 +381,6 @@ static int __mutex_trylock(struct mutex 
 #endif
 }
 
-int fastcall mutex_is_locked(struct mutex *lock)
-{
-	mb();
-	return atomic_read(&lock->count) != 1;
-}
-
-EXPORT_SYMBOL_GPL(mutex_is_locked);
-
-int __sched fastcall mutex_trylock(struct mutex *lock)
-{
-	return __mutex_trylock(lock __CALLER_IP__);
-}
-
 /*
  * Release the lock:
  */
@@ -457,7 +428,6 @@ static inline void __mutex_unlock_nonato
  * We want the atomic op come first, to make sure the
  * branch is predicted as default-untaken:
  */
-static __sched void FASTCALL(__mutex_lock_noinline(atomic_t *lock_count));
 
 /*
  * The locking fastpath is the 1->0 transition from
@@ -465,39 +435,41 @@ static __sched void FASTCALL(__mutex_loc
  */
 static inline void __mutex_lock_atomic(struct mutex *lock)
 {
-	atomic_dec_call_if_negative(&lock->count, __mutex_lock_noinline);
+	__arch_fast_mutex_lock(&lock->count, __mutex_lock_noinline);
 }
 
-static fastcall __sched void __mutex_lock_noinline(atomic_t *lock_count)
+fastcall __sched void __mutex_lock_noinline(atomic_t *lock_count)
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
 	__mutex_lock_nonatomic(lock);
 }
 
+EXPORT_SYMBOL_GPL(__mutex_lock_noinline);
+
 static inline void __mutex_lock(struct mutex *lock)
 {
 	__mutex_lock_atomic(lock);
 }
 
-static void __sched FASTCALL(__mutex_unlock_noinline(atomic_t *lock_count));
-
 /*
  * The unlocking fastpath is the 0->1 transition from
  * 'locked' into 'unlocked' state:
  */
 static inline void __mutex_unlock_atomic(struct mutex *lock)
 {
-	atomic_inc_call_if_nonpositive(&lock->count, __mutex_unlock_noinline);
+	__arch_fast_mutex_unlock(&lock->count, __mutex_unlock_noinline);
 }
 
-static fastcall void __sched __mutex_unlock_noinline(atomic_t *lock_count)
+fastcall void __sched __mutex_unlock_noinline(atomic_t *lock_count)
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
 	__mutex_unlock_nonatomic(lock);
 }
 
+EXPORT_SYMBOL_GPL(__mutex_unlock_noinline);
+
 static inline void __mutex_unlock(struct mutex *lock)
 {
 	__mutex_unlock_atomic(lock);
@@ -517,6 +489,8 @@ static inline void __mutex_unlock(struct
 
 #endif
 
+#ifndef MUTEX_INLINE_FASTPATH
+
 void __sched fastcall mutex_lock(struct mutex *lock)
 {
 	__mutex_lock(lock __CALLER_IP__);
@@ -532,6 +506,23 @@ void __sched fastcall mutex_unlock(struc
 
 EXPORT_SYMBOL_GPL(mutex_unlock);
 
+int __sched fastcall mutex_trylock(struct mutex *lock)
+{
+	return __mutex_trylock(lock __CALLER_IP__);
+}
+
+EXPORT_SYMBOL_GPL(mutex_trylock);
+
+int fastcall mutex_is_locked(struct mutex *lock)
+{
+	mb();
+	return atomic_read(&lock->count) != 1;
+}
+
+EXPORT_SYMBOL_GPL(mutex_is_locked);
+
+#endif
+
 int __sched fastcall mutex_lock_interruptible(struct mutex *lock)
 {
 	return __mutex_lock_interruptible(lock, 0 __CALLER_IP__);
Index: linux-2.6/include/asm-generic/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-generic/mutex.h
@@ -0,0 +1,22 @@
+/*
+ *  linux/include/asm-generic/mutex.h
+ *
+ *  Reference implementation for lockless fast path mutex operations
+ */
+
+#define MUTEX_LOCKLESS_FASTPATH
+
+#define __arch_fast_mutex_lock(count, contention_fn)			\
+do {									\
+	if (atomic_xchg(count, 0) != 1)					\
+		contention_fn(count);					\
+} while (0)
+
+#define __arch_fast_mutex_unlock(count, contention_fn)			\
+do {									\
+	if (atomic_xchg(count, 1) != 0)					\
+		contention_fn(count);					\
+} while (0)
+
+#define __arch_fast_mutex_trylock(count)				\
+	(atomic_cmpxchg(count, 1, 0) == 1)
Index: linux-2.6/include/asm-i386/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-i386/mutex.h
@@ -0,0 +1,44 @@
+#define MUTEX_LOCKLESS_FASTPATH
+
+#define __arch_fast_mutex_lock(v, contention)				\
+do {									\
+	fastcall void (*__tmp)(atomic_t *) = contention;		\
+									\
+	(void)__tmp;							\
+	typecheck(atomic_t *, v);					\
+									\
+	__asm__ __volatile__(						\
+		LOCK "decl (%%eax)\n"  					\
+		"js 2f\n"						\
+		"1:\n"							\
+		LOCK_SECTION_START("")					\
+		"2: call "#contention"\n\t"				\
+		"jmp 1b\n"						\
+		LOCK_SECTION_END					\
+		:							\
+		:"a" (v)						\
+		:"memory","cx","dx");					\
+} while (0)
+
+#define __arch_fast_mutex_unlock(v, contention)				\
+do {									\
+	fastcall void (*__tmp)(atomic_t *) = contention;		\
+									\
+	(void)__tmp;							\
+	typecheck(atomic_t *, v);					\
+									\
+	__asm__ __volatile__(						\
+		LOCK "incl (%%eax)\n"  					\
+		"jle 2f\n"						\
+		"1:\n"							\
+		LOCK_SECTION_START("")					\
+		"2: call "#contention"\n\t"				\
+		"jmp 1b\n"						\
+		LOCK_SECTION_END					\
+		:							\
+		:"a" (v)						\
+		:"memory","cx","dx");					\
+} while (0)
+
+#define __arch_fast_mutex_trylock(v)					\
+	(atomic_cmpxchg(v, 1, 0) == 1)
Index: linux-2.6/include/asm-arm/mutex.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-arm/mutex.h
@@ -0,0 +1,9 @@
+#include <asm/system.h>
+
+#define atomic_xchg(v, new) (xchg(&((v)->counter), new))
+
+#include <asm-generic/mutex.h>
+
+#ifndef swp_is_buggy
+#define MUTEX_INLINE_FASTPATH
+#endif


Nicolas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 22:04                         ` Linus Torvalds
  2005-12-20 22:19                           ` Steven Rostedt
  2005-12-20 22:43                           ` Nicolas Pitre
@ 2005-12-21  6:00                           ` Ingo Molnar
  2 siblings, 0 replies; 34+ messages in thread
From: Ingo Molnar @ 2005-12-21  6:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nicolas Pitre, Russell King, Nick Piggin, David Woodhouse,
	Zwane Mwaikambo, lkml, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, Andi Kleen,
	David Howells, Alexander Viro, Oleg Nesterov, Paul Jackson


* Linus Torvalds <torvalds@osdl.org> wrote:

> If it had _started_ with a mutex implementation that was faster, 
> simpler, and didn't rename the old and working semaphores, I'd have 
> been perfectly fine with it.

oh, i'm totally OK with not doing the renames and leaving semaphores 
alone!

Just in case it wasnt clear: i very much expected that the migration 
helper patches would be controverial, and that they would probably not 
go upstream. [Christoph said last week that they were fit for an 
obfuscated C contest, not the kernel, and i didnt expect this sentiment 
to change overnight.] Look at my patch order:

 xfs-mutex-namespace-collision-fix.patch

 add-atomic-xchg-i386.patch
 add-atomic-xchg-x86_64.patch
 add-atomic-xchg-ia64.patch
 add-atomic-call-func-i386.patch
 add-atomic-call-func-x86_64.patch
 add-atomic-call-func-ia64.patch
 add-atomic-call-wrappers-all-other-arches.patch

 mutex-core.patch

 mutex-debug.patch
 mutex-debug-more.patch

 mutex-migration-helper-i386.patch    # not upstream from here
 mutex-migration-helper-x86_64.patch
 mutex-migration-helper-ia64.patch
 mutex-migration-helper-core.patch

 sx8-sem2completions.patch
 cpu5wdt-sem2completions.patch
 ide-gendev-sem-to-completion.patch
 loop-to-completions.patch

 arch-semaphores.patch

The first 11 patches are finegrained and contain _zero_ of the migration 
and rename stuff. I specifically created the patch-series in such a way, 
so that we could simply chop off the last few patches.

in the future i'll only send patches up to mutex-debug-more.patch, as 
suggested by Christoph and you as well. So there's really no 
controversy. Btw., that was true for my first queue already, as noticed 
by Christoph [*].
 
Basically all of the activity in the last 2 days was in the first 11 
patches. I'll send an updated queue later today.

	Ingo

[*] the migration helpers were incredibly useful for pulling this off.
    Without the wide exposure of mutexes to existing semaphore users i'd
    not have been able to profile the system, to measure the impact the
    effects of mutexes on performance. I'd also not have been able to
    say what percentage of semaphores could move over to mutexes. We 
    could also not have carried the mutex implementation in the -rt tree 
    for more than a year, in which year millions of lines were changed 
    in the upstream kernel! It would have been simply impossible to even 
    attempt this, without the type-sensitive APIs and the minimal 
    renames to categorize semaphores.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-20 14:12             ` Nick Piggin
  2005-12-20 14:35               ` Nicolas Pitre
  2005-12-20 18:25               ` Linus Torvalds
@ 2005-12-21  6:20               ` Ingo Molnar
  2 siblings, 0 replies; 34+ messages in thread
From: Ingo Molnar @ 2005-12-21  6:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Nicolas Pitre, David Woodhouse, Zwane Mwaikambo, lkml,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Steven Rostedt,
	Alan Cox, Christoph Hellwig, Andi Kleen, David Howells,
	Alexander Viro, Oleg Nesterov, Paul Jackson


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> >>Considering that on UP, the arm should not need to disable interrupts
> >>for this function (or has someone refuted Linus?), how about:
> >
> >Kernel preemption.
> 
> preempt_disable() ?

please take a look at kernel/mutex.c, there's a define at the top of the 
file:

// #define MUTEX_IRQ_SAFE

which, if off, makes the mutex code use preempt_disable() and 
preempt_enable() to make it preemption-safe. If it's on, the mutex 
implementation uses IRQ flags.

in my current tree i've already eliminated this define, and have 
switched the code to use preempt_disable()/preempt_enable() exclusively, 
because preempt_*() is equally fast on all platforms, while IRQ disable 
costs vary largely. (and they are rarely faster than preempt_disable()).

my current tree also provides a mechanism for architectures to hand-code 
the mutex lock and unlock fastpath, if they choose to do so. So i think 
we can really stop the cycle counting.

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch
  2005-12-21  4:16         ` Nicolas Pitre
@ 2005-12-21  6:26           ` Ingo Molnar
  0 siblings, 0 replies; 34+ messages in thread
From: Ingo Molnar @ 2005-12-21  6:26 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: lkml, Arjan van de Ven, Steven Rostedt, David Howells


* Nicolas Pitre <nico@cam.org> wrote:

> > Please consider using (a variant of) xchg() instead.  Although 
> > atomic_dec() is available on all architectures, its implementation is 
> > far from being the most efficient thing to do for them all.
> 
> Actually, the best thing to do is to let the architecture do what is 
> the most efficient. [...]

i have already added something similar. Furthermore, i have also 
eliminated the CMPXCHG requirement from MUTEX_FASTPATH, which should 
open ARM up to the generic lockless fastpath. (but you can still choose 
to reimplement it in the architecture) I'll send a new queue later 
today.

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2005-12-21 11:59 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-12-19  1:35 [patch 04/15] Generic Mutex Subsystem, add-atomic-call-func-x86_64.patch Ingo Molnar
2005-12-19 17:49 ` Zwane Mwaikambo
2005-12-19 20:58   ` David Woodhouse
2005-12-20  4:31     ` Ingo Molnar
2005-12-20  6:30       ` Nicolas Pitre
2005-12-20  8:12         ` Nick Piggin
2005-12-20 14:10           ` Nicolas Pitre
2005-12-20 14:12             ` Nick Piggin
2005-12-20 14:35               ` Nicolas Pitre
2005-12-20 15:05                 ` Nick Piggin
2005-12-20 16:35                   ` Nicolas Pitre
2005-12-20 19:20                     ` Russell King
2005-12-20 19:32                       ` Arjan van de Ven
2005-12-20 19:37                         ` Russell King
2005-12-20 19:59                         ` Nicolas Pitre
2005-12-20 19:43                       ` Steven Rostedt
2005-12-20 19:57                         ` Russell King
2005-12-20 20:35                         ` Horst von Brand
2005-12-20 19:55                       ` Nicolas Pitre
2005-12-20 18:27                 ` Linus Torvalds
2005-12-20 19:34                   ` Russell King
2005-12-20 20:10                     ` Linus Torvalds
2005-12-20 21:18                       ` Nicolas Pitre
2005-12-20 22:04                         ` Linus Torvalds
2005-12-20 22:19                           ` Steven Rostedt
2005-12-20 22:43                           ` Nicolas Pitre
2005-12-21  6:00                           ` Ingo Molnar
2005-12-21  2:12                         ` Nick Piggin
2005-12-20 19:49                   ` Nicolas Pitre
2005-12-20 18:25               ` Linus Torvalds
2005-12-21  6:20               ` Ingo Molnar
2005-12-21  4:16         ` Nicolas Pitre
2005-12-21  6:26           ` Ingo Molnar
2005-12-20  4:29   ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.