public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patch 00/11] mutex subsystem, -V7
@ 2005-12-23 16:16 Ingo Molnar
  2005-12-24  5:15 ` Nicolas Pitre
                   ` (4 more replies)
  0 siblings, 5 replies; 29+ messages in thread
From: Ingo Molnar @ 2005-12-23 16:16 UTC (permalink / raw)
  To: lkml
  Cc: Linus Torvalds, Andrew Morton, Arjan van de Ven, Nicolas Pitre,
	Jes Sorensen, Zwane Mwaikambo, Oleg Nesterov, David Howells,
	Alan Cox, Benjamin LaHaise, Steven Rostedt, Christoph Hellwig,
	Andi Kleen, Russell King

this is version -V7 of the generic mutex subsystem. It consists of the 
following 11 patches:

  add-atomic-xchg.patch
  mutex-generic-asm-implementations.patch
  mutex-asm-mutex.h-i386.patch
  mutex-asm-mutex.h-x86_64.patch
  mutex-asm-mutex.h-arm.patch
  mutex-arch-mutex-h.patch
  mutex-core.patch
  mutex-docs.patch
  mutex-debug.patch
  mutex-debug-more.patch
  xfs-mutex-namespace-collision-fix.patch

the patches are against Linus' latest GIT tree, and they should work 
fine on every Linux architecture.

Changes since -V6:

 33 files changed, 515 insertions(+), 360 deletions(-)

- added asm-arm/mutex.h ARM mutex fastpath implementation,
  by Nicolas Pitre.

- as per Linus' suggestion, split up the mutex debugging code and 
  prototypes into 4 separate files: kernel/mutex.c, kernel/mutex.h, 
  kernel/mutex-debug.c and kernel/mutex-debug.h, and made the debugging 
  code build as kernel/mutex-debug.o. As a result mutex.c got smaller 
  and easier to read. This also eliminated an #ifdef.

- added a new "NULL mutex fastpath implementation" via
  asm-generic/mutex-null.h, which is now used by the debugging code.  
  This got rid of two ugly #ifdef's in mutex.c, and removed code as 
  well.

- comment cleanups by Nicolas Pitre.

- more doc/comment additions and updates.

- lots of code and style cleanups in various mutex headers.

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/11] mutex subsystem, -V7
  2005-12-23 16:16 [patch 00/11] mutex subsystem, -V7 Ingo Molnar
@ 2005-12-24  5:15 ` Nicolas Pitre
  2005-12-24  5:23   ` Nicolas Pitre
  2005-12-26 19:24 ` Nicolas Pitre
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 29+ messages in thread
From: Nicolas Pitre @ 2005-12-24  5:15 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: lkml, Arjan van de Ven, Russell King

On Fri, 23 Dec 2005, Ingo Molnar wrote:

> this is version -V7 of the generic mutex subsystem. It consists of the 
> following 11 patches:

Here's another one, allowing architecture specific trylock 
implementations.  New generic xchg-based and ARMv6 implementations are 
provided, while everything else remains the same.

Signed-off-by: Nicolas Pitre <nico@cam.org>

Index: linux-2.6/include/asm-generic/mutex-xchg.h
===================================================================
--- linux-2.6.orig/include/asm-generic/mutex-xchg.h
+++ linux-2.6/include/asm-generic/mutex-xchg.h
@@ -69,4 +69,38 @@ do {									\
 
 #define __mutex_slowpath_needs_to_unlock()		0
 
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ *  @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to a value lower than 1, and return 0 (failure)
+ * if it wasn't 1 originally, or return 1 (success) otherwise. This function
+ * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
+ * Additionally, if the value was < 0 originally, this function must not leave
+ * it to 0 on failure.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+	int prev = atomic_xchg(count, 0);
+
+	if (unlikely(prev < 0)) {
+		/*
+		 * The lock was marked contended so we must restore that
+		 * state. If while doing so we get back a prev value of 1
+		 * then we just own it.
+		 *
+		 * IN all cases this has the potential to trigger the
+		 * slowpath for the owner's unlock path - but this is not
+		 * a big problem in practice.
+		 */
+		prev = atomic_xchg(count, -1);
+		if (prev < 0)
+			prev = 0;
+	}
+
+	return prev;
+}
+
 #endif
Index: linux-2.6/include/asm-i386/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-i386/mutex.h
+++ linux-2.6/include/asm-i386/mutex.h
@@ -99,4 +99,38 @@ do {									\
 
 #define __mutex_slowpath_needs_to_unlock()	1
 
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ *  @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to a value lower than 1, and return 0 (failure)
+ * if it wasn't 1 originally, or return 1 (success) otherwise. This function
+ * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
+ * Additionally, if the value was < 0 originally, this function must not leave
+ * it to 0 on failure.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+	/*
+	 * We have two variants here. The cmpxchg based one is the best one
+	 * because it never induce a false contention state.
+	 * If not available we fall back to the atomic_dec_return variant.
+	 *
+	 * The atomic_dec_return variant might end up making the counter
+	 * negative in the failure case, which may trigger the slowpath
+	 * for the owner's unlock path - but this is not a big problem
+	 * in practice.
+	 */
+#ifdef __HAVE_ARCH_CMPXCHG
+	if (likely(atomic_cmpxchg(count, 1, 0)) == 1)
+		return 1;
+#else
+	if (likely(atomic_dec_return(count) == 0))
+		return 1;
+#endif
+	return 0;
+}
+
 #endif
Index: linux-2.6/include/asm-x86_64/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-x86_64/mutex.h
+++ linux-2.6/include/asm-x86_64/mutex.h
@@ -71,4 +71,21 @@ do {									\
 
 #define __mutex_slowpath_needs_to_unlock()	1
 
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ *  @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to 0 and return 1 (success), or return 0 (failure)
+ * if it wasn't 1 originally.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+	if (likely(atomic_cmpxchg(count, 1, 0)) == 1)
+		return 1;
+	else
+		return 0;
+}
+
 #endif
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -219,19 +219,6 @@ __mutex_lock_interruptible_nonatomic(str
 	return -EINTR;
 }
 
-/*
- * We have three mutex_trylock() variants. The cmpxchg based one is
- * the best one (because it has no side-effect on mutex_unlock()),
- * but cmpxchg is not available on every architecture, so we also
- * provide an atomic_dec_return based variant too. The debug variant
- * takes the internal lock.
- *
- * [ The atomic_dec_return variant might end up making the counter
- *   negative in the failure case, which may trigger the slowpath
- *   for the owner's unlock path - but this is not a big problem
- *   in practice. ]
- */
-#ifndef CONFIG_DEBUG_MUTEXES
 /***
  * mutex_trylock - try acquire the mutex, without waiting
  * @lock: the mutex to be acquired
@@ -248,24 +235,13 @@ __mutex_lock_interruptible_nonatomic(str
  */
 int fastcall mutex_trylock(struct mutex *lock)
 {
-#ifdef __HAVE_ARCH_CMPXCHG
-	if (atomic_cmpxchg(&lock->count, 1, 0) == 1)
-		return 1;
+#ifndef CONFIG_DEBUG_MUTEXES
+	return __mutex_trylock(&lock->count);
 #else
-	if (atomic_dec_return(&lock->count) == 0)
-		return 1;
-#endif
-	return 0;
-}
-
-#else /* CONFIG_DEBUG_MUTEXES: */
-
-/*
- * In the debug case we take the spinlock and check whether we can
- * get the lock:
- */
-int fastcall mutex_trylock(struct mutex *lock)
-{
+	/*
+	 * In the debug case we take the spinlock and check whether we can
+	 * get the lock:
+	 */
 	struct thread_info *ti = current_thread_info();
 	int ret = 0;
 
@@ -280,10 +256,9 @@ int fastcall mutex_trylock(struct mutex 
 	spin_unlock_mutex(&lock->wait_lock);
 
 	return ret;
+#endif
 }
 
-#endif /* CONFIG_DEBUG_MUTEXES */
-
 /*
  * Release the lock, slowpath:
  */
Index: linux-2.6/include/asm-arm/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-arm/mutex.h
+++ linux-2.6/include/asm-arm/mutex.h
@@ -98,5 +98,31 @@ do {									\
  */
 #define __mutex_slowpath_needs_to_unlock()	1
 
+/*
+ * For __mutex_trylock we use another construct which could be described
+ * as an "incomplete atomic decrement" or a "single value cmpxchg" since
+ * it has two modes of failure:
+ *
+ * 1) if the exclusive store fails we fail, and
+ *
+ * 2) if the decremented value is not zero we don't even attempt the store.
+ *
+ * This provides the needed trylock semantics like cmpxchg would, but it is
+ * lighter and less generic than a true cmpxchg implementation.
+ */
+static inline int __mutex_trylock(atomic_t *count)
+{
+	int __ex_flag, __res;
+	__asm__ (
+	"ldrex	%0, [%2]\n\t"
+	"subs	%0, %0, #1\n\t"
+	"strexeq %1, %0, [%2]"
+	: "=&r" (__res), "=&r" (__ex_flag)
+	: "r" (&count->counter)
+	: "cc","memory" );
+	__res |= __ex_flag;
+	return __res == 0;
+}
+
 #endif
 #endif

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/11] mutex subsystem, -V7
  2005-12-24  5:15 ` Nicolas Pitre
@ 2005-12-24  5:23   ` Nicolas Pitre
  0 siblings, 0 replies; 29+ messages in thread
From: Nicolas Pitre @ 2005-12-24  5:23 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: lkml, Arjan van de Ven, Russell King

On Sat, 24 Dec 2005, Nicolas Pitre wrote:

> Here's another one, allowing architecture specific trylock 
> implementations.  New generic xchg-based and ARMv6 implementations are 
> provided, while everything else remains the same.

Sorry, the previous patch was missing one file (mutex-dec.h).

Signed-off-by: Nicolas Pitre <nico@cam.org>

Index: linux-2.6/include/asm-generic/mutex-xchg.h
===================================================================
--- linux-2.6.orig/include/asm-generic/mutex-xchg.h
+++ linux-2.6/include/asm-generic/mutex-xchg.h
@@ -69,4 +69,38 @@ do {									\
 
 #define __mutex_slowpath_needs_to_unlock()		0
 
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ *  @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to a value lower than 1, and return 0 (failure)
+ * if it wasn't 1 originally, or return 1 (success) otherwise. This function
+ * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
+ * Additionally, if the value was < 0 originally, this function must not leave
+ * it to 0 on failure.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+	int prev = atomic_xchg(count, 0);
+
+	if (unlikely(prev < 0)) {
+		/*
+		 * The lock was marked contended so we must restore that
+		 * state. If while doing so we get back a prev value of 1
+		 * then we just own it.
+		 *
+		 * IN all cases this has the potential to trigger the
+		 * slowpath for the owner's unlock path - but this is not
+		 * a big problem in practice.
+		 */
+		prev = atomic_xchg(count, -1);
+		if (prev < 0)
+			prev = 0;
+	}
+
+	return prev;
+}
+
 #endif
Index: linux-2.6/include/asm-i386/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-i386/mutex.h
+++ linux-2.6/include/asm-i386/mutex.h
@@ -99,4 +99,38 @@ do {									\
 
 #define __mutex_slowpath_needs_to_unlock()	1
 
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ *  @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to a value lower than 1, and return 0 (failure)
+ * if it wasn't 1 originally, or return 1 (success) otherwise. This function
+ * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
+ * Additionally, if the value was < 0 originally, this function must not leave
+ * it to 0 on failure.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+	/*
+	 * We have two variants here. The cmpxchg based one is the best one
+	 * because it never induce a false contention state.
+	 * If not available we fall back to the atomic_dec_return variant.
+	 *
+	 * The atomic_dec_return variant might end up making the counter
+	 * negative in the failure case, which may trigger the slowpath
+	 * for the owner's unlock path - but this is not a big problem
+	 * in practice.
+	 */
+#ifdef __HAVE_ARCH_CMPXCHG
+	if (likely(atomic_cmpxchg(count, 1, 0)) == 1)
+		return 1;
+#else
+	if (likely(atomic_dec_return(count) == 0))
+		return 1;
+#endif
+	return 0;
+}
+
 #endif
Index: linux-2.6/include/asm-x86_64/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-x86_64/mutex.h
+++ linux-2.6/include/asm-x86_64/mutex.h
@@ -71,4 +71,21 @@ do {									\
 
 #define __mutex_slowpath_needs_to_unlock()	1
 
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ *  @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to 0 and return 1 (success), or return 0 (failure)
+ * if it wasn't 1 originally.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+	if (likely(atomic_cmpxchg(count, 1, 0)) == 1)
+		return 1;
+	else
+		return 0;
+}
+
 #endif
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -219,19 +219,6 @@ __mutex_lock_interruptible_nonatomic(str
 	return -EINTR;
 }
 
-/*
- * We have three mutex_trylock() variants. The cmpxchg based one is
- * the best one (because it has no side-effect on mutex_unlock()),
- * but cmpxchg is not available on every architecture, so we also
- * provide an atomic_dec_return based variant too. The debug variant
- * takes the internal lock.
- *
- * [ The atomic_dec_return variant might end up making the counter
- *   negative in the failure case, which may trigger the slowpath
- *   for the owner's unlock path - but this is not a big problem
- *   in practice. ]
- */
-#ifndef CONFIG_DEBUG_MUTEXES
 /***
  * mutex_trylock - try acquire the mutex, without waiting
  * @lock: the mutex to be acquired
@@ -248,24 +235,13 @@ __mutex_lock_interruptible_nonatomic(str
  */
 int fastcall mutex_trylock(struct mutex *lock)
 {
-#ifdef __HAVE_ARCH_CMPXCHG
-	if (atomic_cmpxchg(&lock->count, 1, 0) == 1)
-		return 1;
+#ifndef CONFIG_DEBUG_MUTEXES
+	return __mutex_trylock(&lock->count);
 #else
-	if (atomic_dec_return(&lock->count) == 0)
-		return 1;
-#endif
-	return 0;
-}
-
-#else /* CONFIG_DEBUG_MUTEXES: */
-
-/*
- * In the debug case we take the spinlock and check whether we can
- * get the lock:
- */
-int fastcall mutex_trylock(struct mutex *lock)
-{
+	/*
+	 * In the debug case we take the spinlock and check whether we can
+	 * get the lock:
+	 */
 	struct thread_info *ti = current_thread_info();
 	int ret = 0;
 
@@ -280,10 +256,9 @@ int fastcall mutex_trylock(struct mutex 
 	spin_unlock_mutex(&lock->wait_lock);
 
 	return ret;
+#endif
 }
 
-#endif /* CONFIG_DEBUG_MUTEXES */
-
 /*
  * Release the lock, slowpath:
  */
Index: linux-2.6/include/asm-arm/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-arm/mutex.h
+++ linux-2.6/include/asm-arm/mutex.h
@@ -98,5 +98,31 @@ do {									\
  */
 #define __mutex_slowpath_needs_to_unlock()	1
 
+/*
+ * For __mutex_trylock we use another construct which could be described
+ * as an "incomplete atomic decrement" or a "single value cmpxchg" since
+ * it has two modes of failure:
+ *
+ * 1) if the exclusive store fails we fail, and
+ *
+ * 2) if the decremented value is not zero we don't even attempt the store.
+ *
+ * This provides the needed trylock semantics like cmpxchg would, but it is
+ * lighter and less generic than a true cmpxchg implementation.
+ */
+static inline int __mutex_trylock(atomic_t *count)
+{
+	int __ex_flag, __res;
+	__asm__ (
+	"ldrex	%0, [%2]\n\t"
+	"subs	%0, %0, #1\n\t"
+	"strexeq %1, %0, [%2]"
+	: "=&r" (__res), "=&r" (__ex_flag)
+	: "r" (&count->counter)
+	: "cc","memory" );
+	__res |= __ex_flag;
+	return __res == 0;
+}
+
 #endif
 #endif
Index: linux-2.6/include/asm-generic/mutex-dec.h
===================================================================
--- linux-2.6.orig/include/asm-generic/mutex-dec.h
+++ linux-2.6/include/asm-generic/mutex-dec.h
@@ -63,4 +63,40 @@ do {									\
 
 #define __mutex_slowpath_needs_to_unlock()		1
 
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ *  @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to a value lower than 1, and return 0 (failure)
+ * if it wasn't 1 originally, or return 1 (success) otherwise. This function
+ * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
+ * Additionally, if the value was < 0 originally, this function must not leave
+ * it to 0 on failure.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+	/*
+	 * We have two variants here. The cmpxchg based one is the best one
+	 * because it never induce a false contention state.  It is included
+	 * here because architectures using the inc/dec algorithms over the
+	 * xchg ones are much more likely to support cmpxchg natively.
+	 * If not we fall back to the atomic_dec_return variant.
+	 *
+	 * The atomic_dec_return variant might end up making the counter
+	 * negative in the failure case, which may trigger the slowpath
+	 * for the owner's unlock path - but this is not a big problem
+	 * in practice.
+	 */
+#ifdef __HAVE_ARCH_CMPXCHG
+	if (likely(atomic_cmpxchg(count, 1, 0)) == 1)
+		return 1;
+#else
+	if (likely(atomic_dec_return(count) == 0))
+		return 1;
+#endif
+	return 0;
+}
+
 #endif

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/11] mutex subsystem, -V7
  2005-12-23 16:16 [patch 00/11] mutex subsystem, -V7 Ingo Molnar
  2005-12-24  5:15 ` Nicolas Pitre
@ 2005-12-26 19:24 ` Nicolas Pitre
  2005-12-26 19:25 ` [patch 1/3] mutex subsystem: trylock Nicolas Pitre
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 29+ messages in thread
From: Nicolas Pitre @ 2005-12-26 19:24 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: lkml, Arjan van de Ven, Russell King


Ingo,

Coming next are 3 patches to your mutex code that I'd like you to 
consider.  They express my last requests about mutexes.

The first patch is a resent of my trylock rework to allow for a pure 
xchg based implementation.

The second one gives architectures the ability to make lock/unlock 
fastpaths inlined.  As explained in another mail this is almost always 
beneficial on ARM.

And the last patch makes mutex_is_locked() always inlined since I can't 
find a reason why it wouldn't be beneficial to do so, even on x86.

I hope we can agree on them so that I could go back hiding behind 
ARM-specific part of the kernel again.  :-)


Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 1/3] mutex subsystem: trylock
  2005-12-23 16:16 [patch 00/11] mutex subsystem, -V7 Ingo Molnar
  2005-12-24  5:15 ` Nicolas Pitre
  2005-12-26 19:24 ` Nicolas Pitre
@ 2005-12-26 19:25 ` Nicolas Pitre
  2005-12-27 11:51   ` Ingo Molnar
  2005-12-27 12:05   ` Arjan van de Ven
  2005-12-26 19:25 ` [patch 2/3] mutex subsystem: fastpath inlining Nicolas Pitre
  2005-12-26 19:26 ` [patch 3/3] mutex subsystem: inline mutex_is_locked() Nicolas Pitre
  4 siblings, 2 replies; 29+ messages in thread
From: Nicolas Pitre @ 2005-12-26 19:25 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: lkml, Arjan van de Ven, Russell King


In the spirit of uniformity, this patch provides architecture specific
trylock implementations.  This allows for a pure xchg-based implementation,
as well as an optimized ARMv6 implementation.

On i386 and x86_64 things remain the same.

Signed-off-by: Nicolas Pitre <nico@cam.org>

Index: linux-2.6/include/asm-generic/mutex-xchg.h
===================================================================
--- linux-2.6.orig/include/asm-generic/mutex-xchg.h
+++ linux-2.6/include/asm-generic/mutex-xchg.h
@@ -69,4 +69,38 @@ do {									\
 
 #define __mutex_slowpath_needs_to_unlock()		0
 
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ *  @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to a value lower than 1, and return 0 (failure)
+ * if it wasn't 1 originally, or return 1 (success) otherwise. This function
+ * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
+ * Additionally, if the value was < 0 originally, this function must not leave
+ * it to 0 on failure.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+	int prev = atomic_xchg(count, 0);
+
+	if (unlikely(prev < 0)) {
+		/*
+		 * The lock was marked contended so we must restore that
+		 * state. If while doing so we get back a prev value of 1
+		 * then we just own it.
+		 *
+		 * IN all cases this has the potential to trigger the
+		 * slowpath for the owner's unlock path - but this is not
+		 * a big problem in practice.
+		 */
+		prev = atomic_xchg(count, -1);
+		if (prev < 0)
+			prev = 0;
+	}
+
+	return prev;
+}
+
 #endif
Index: linux-2.6/include/asm-i386/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-i386/mutex.h
+++ linux-2.6/include/asm-i386/mutex.h
@@ -99,4 +99,38 @@ do {									\
 
 #define __mutex_slowpath_needs_to_unlock()	1
 
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ *  @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to a value lower than 1, and return 0 (failure)
+ * if it wasn't 1 originally, or return 1 (success) otherwise. This function
+ * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
+ * Additionally, if the value was < 0 originally, this function must not leave
+ * it to 0 on failure.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+	/*
+	 * We have two variants here. The cmpxchg based one is the best one
+	 * because it never induce a false contention state.
+	 * If not available we fall back to the atomic_dec_return variant.
+	 *
+	 * The atomic_dec_return variant might end up making the counter
+	 * negative in the failure case, which may trigger the slowpath
+	 * for the owner's unlock path - but this is not a big problem
+	 * in practice.
+	 */
+#ifdef __HAVE_ARCH_CMPXCHG
+	if (likely(atomic_cmpxchg(count, 1, 0)) == 1)
+		return 1;
+#else
+	if (likely(atomic_dec_return(count) == 0))
+		return 1;
+#endif
+	return 0;
+}
+
 #endif
Index: linux-2.6/include/asm-x86_64/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-x86_64/mutex.h
+++ linux-2.6/include/asm-x86_64/mutex.h
@@ -71,4 +71,21 @@ do {									\
 
 #define __mutex_slowpath_needs_to_unlock()	1
 
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ *  @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to 0 and return 1 (success), or return 0 (failure)
+ * if it wasn't 1 originally.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+	if (likely(atomic_cmpxchg(count, 1, 0)) == 1)
+		return 1;
+	else
+		return 0;
+}
+
 #endif
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -219,19 +219,6 @@ __mutex_lock_interruptible_nonatomic(str
 	return -EINTR;
 }
 
-/*
- * We have three mutex_trylock() variants. The cmpxchg based one is
- * the best one (because it has no side-effect on mutex_unlock()),
- * but cmpxchg is not available on every architecture, so we also
- * provide an atomic_dec_return based variant too. The debug variant
- * takes the internal lock.
- *
- * [ The atomic_dec_return variant might end up making the counter
- *   negative in the failure case, which may trigger the slowpath
- *   for the owner's unlock path - but this is not a big problem
- *   in practice. ]
- */
-#ifndef CONFIG_DEBUG_MUTEXES
 /***
  * mutex_trylock - try acquire the mutex, without waiting
  * @lock: the mutex to be acquired
@@ -248,24 +235,13 @@ __mutex_lock_interruptible_nonatomic(str
  */
 int fastcall mutex_trylock(struct mutex *lock)
 {
-#ifdef __HAVE_ARCH_CMPXCHG
-	if (atomic_cmpxchg(&lock->count, 1, 0) == 1)
-		return 1;
+#ifndef CONFIG_DEBUG_MUTEXES
+	return __mutex_trylock(&lock->count);
 #else
-	if (atomic_dec_return(&lock->count) == 0)
-		return 1;
-#endif
-	return 0;
-}
-
-#else /* CONFIG_DEBUG_MUTEXES: */
-
-/*
- * In the debug case we take the spinlock and check whether we can
- * get the lock:
- */
-int fastcall mutex_trylock(struct mutex *lock)
-{
+	/*
+	 * In the debug case we take the spinlock and check whether we can
+	 * get the lock:
+	 */
 	struct thread_info *ti = current_thread_info();
 	int ret = 0;
 
@@ -280,10 +256,9 @@ int fastcall mutex_trylock(struct mutex 
 	spin_unlock_mutex(&lock->wait_lock);
 
 	return ret;
+#endif
 }
 
-#endif /* CONFIG_DEBUG_MUTEXES */
-
 /*
  * Release the lock, slowpath:
  */
Index: linux-2.6/include/asm-arm/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-arm/mutex.h
+++ linux-2.6/include/asm-arm/mutex.h
@@ -98,5 +98,31 @@ do {									\
  */
 #define __mutex_slowpath_needs_to_unlock()	1
 
+/*
+ * For __mutex_trylock we use another construct which could be described
+ * as an "incomplete atomic decrement" or a "single value cmpxchg" since
+ * it has two modes of failure:
+ *
+ * 1) if the exclusive store fails we fail, and
+ *
+ * 2) if the decremented value is not zero we don't even attempt the store.
+ *
+ * This provides the needed trylock semantics like cmpxchg would, but it is
+ * lighter and less generic than a true cmpxchg implementation.
+ */
+static inline int __mutex_trylock(atomic_t *count)
+{
+	int __ex_flag, __res;
+	__asm__ (
+	"ldrex	%0, [%2]\n\t"
+	"subs	%0, %0, #1\n\t"
+	"strexeq %1, %0, [%2]"
+	: "=&r" (__res), "=&r" (__ex_flag)
+	: "r" (&count->counter)
+	: "cc","memory" );
+	__res |= __ex_flag;
+	return __res == 0;
+}
+
 #endif
 #endif
Index: linux-2.6/include/asm-generic/mutex-dec.h
===================================================================
--- linux-2.6.orig/include/asm-generic/mutex-dec.h
+++ linux-2.6/include/asm-generic/mutex-dec.h
@@ -63,4 +63,40 @@ do {									\
 
 #define __mutex_slowpath_needs_to_unlock()		1
 
+/**
+ * __mutex_trylock - try to acquire the mutex, without waiting
+ *
+ *  @count: pointer of type atomic_t
+ *
+ * Change the count from 1 to a value lower than 1, and return 0 (failure)
+ * if it wasn't 1 originally, or return 1 (success) otherwise. This function
+ * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
+ * Additionally, if the value was < 0 originally, this function must not leave
+ * it to 0 on failure.
+ */
+static inline int
+__mutex_trylock(atomic_t *count)
+{
+	/*
+	 * We have two variants here. The cmpxchg based one is the best one
+	 * because it never induce a false contention state.  It is included
+	 * here because architectures using the inc/dec algorithms over the
+	 * xchg ones are much more likely to support cmpxchg natively.
+	 * If not we fall back to the atomic_dec_return variant.
+	 *
+	 * The atomic_dec_return variant might end up making the counter
+	 * negative in the failure case, which may trigger the slowpath
+	 * for the owner's unlock path - but this is not a big problem
+	 * in practice.
+	 */
+#ifdef __HAVE_ARCH_CMPXCHG
+	if (likely(atomic_cmpxchg(count, 1, 0)) == 1)
+		return 1;
+#else
+	if (likely(atomic_dec_return(count) == 0))
+		return 1;
+#endif
+	return 0;
+}
+
 #endif

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 2/3] mutex subsystem: fastpath inlining
  2005-12-23 16:16 [patch 00/11] mutex subsystem, -V7 Ingo Molnar
                   ` (2 preceding siblings ...)
  2005-12-26 19:25 ` [patch 1/3] mutex subsystem: trylock Nicolas Pitre
@ 2005-12-26 19:25 ` Nicolas Pitre
  2005-12-27 11:55   ` Ingo Molnar
  2005-12-26 19:26 ` [patch 3/3] mutex subsystem: inline mutex_is_locked() Nicolas Pitre
  4 siblings, 1 reply; 29+ messages in thread
From: Nicolas Pitre @ 2005-12-26 19:25 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: lkml, Arjan van de Ven, Russell King


Some architectures, notably ARM for instance, might benefit from inlining
the mutex fast paths.  This patch allows for this when no debugging,
otherwise inlining is unconditionally disabled when debugging is enabled.

Signed-off-by: Nicolas Pitre <nico@cam.org>

Index: linux-2.6/include/asm-arm/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-arm/mutex.h
+++ linux-2.6/include/asm-arm/mutex.h
@@ -8,6 +8,12 @@
 #ifndef _ASM_MUTEX_H
 #define _ASM_MUTEX_H
 
+/* On ARM it is best to inline the mutex fast paths, unless swp is broken. */
+#include <asm/system.h>
+#ifndef swp_is_buggy
+# define MUTEX_INLINE_FASTPATH
+#endif
+
 #if __LINUX_ARM_ARCH__ < 6
 /* On pre-ARMv6 hardware the swp based implementation is the most efficient. */
 # include <asm-generic/mutex-xchg.h>
Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -16,16 +16,10 @@
 #include <linux/spinlock.h>
 #include <linux/interrupt.h>
 
-/*
- * In the DEBUG case we are using the "NULL fastpath" for mutexes,
- * which forces all calls into the slowpath:
- */
 #ifdef CONFIG_DEBUG_MUTEXES
 # include "mutex-debug.h"
-# include <asm-generic/mutex-null.h>
 #else
 # include "mutex.h"
-# include <asm/mutex.h>
 #endif
 
 /***
@@ -219,6 +213,7 @@ __mutex_lock_interruptible_nonatomic(str
 	return -EINTR;
 }
 
+#ifndef MUTEX_INLINE_FASTPATH
 /***
  * mutex_trylock - try acquire the mutex, without waiting
  * @lock: the mutex to be acquired
@@ -258,6 +253,7 @@ int fastcall mutex_trylock(struct mutex 
 	return ret;
 #endif
 }
+#endif
 
 /*
  * Release the lock, slowpath:
@@ -293,8 +289,16 @@ static inline void __mutex_unlock_nonato
  * We want the atomic ops to come first in the kernel image, to make
  * sure the branch is predicted by the CPU as default-untaken:
  */
-static __sched void FASTCALL(__mutex_lock_noinline(atomic_t *lock_count
-						   __IP_DECL__));
+
+#ifndef MUTEX_INLINE_FASTPATH
+/* if we don't inline fast paths, then make those static */
+static void fastcall __sched
+__mutex_lock_noinline(atomic_t *lock_count __IP_DECL__);
+static int fastcall __sched
+__mutex_lock_interruptible_noinline(atomic_t *lock_count __IP_DECL__);
+static void fastcall __sched
+__mutex_unlock_noinline(atomic_t *lock_count __IP_DECL__);
+#endif
 
 /*
  * The locking fastpath is the 1->0 transition from 'unlocked' into
@@ -305,7 +309,7 @@ static inline void __mutex_lock_atomic(s
 	__mutex_fastpath_lock(&lock->count, __mutex_lock_noinline);
 }
 
-static void fastcall __sched
+void fastcall __sched
 __mutex_lock_noinline(atomic_t *lock_count __IP_DECL__)
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
@@ -318,16 +322,13 @@ static inline void __mutex_lock(struct m
 	__mutex_lock_atomic(lock __IP__);
 }
 
-static int fastcall __sched
-__mutex_lock_interruptible_noinline(atomic_t *lock_count __IP_DECL__);
-
 static inline int __mutex_lock_interruptible(struct mutex *lock __IP_DECL__)
 {
 	return __mutex_fastpath_lock_retval
 			(&lock->count, __mutex_lock_interruptible_noinline);
 }
 
-static int fastcall __sched
+int fastcall __sched
 __mutex_lock_interruptible_noinline(atomic_t *lock_count __IP_DECL__)
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
@@ -335,9 +336,6 @@ __mutex_lock_interruptible_noinline(atom
 	return __mutex_lock_interruptible_nonatomic(lock __IP__);
 }
 
-static void __sched FASTCALL(__mutex_unlock_noinline(atomic_t *lock_count
-						     __IP_DECL__));
-
 /*
  * The unlocking fastpath is the 0->1 transition from 'locked' into
  * 'unlocked' state:
@@ -347,8 +345,8 @@ static inline void __mutex_unlock_atomic
 	__mutex_fastpath_unlock(&lock->count, __mutex_unlock_noinline);
 }
 
-static void fastcall __sched __mutex_unlock_noinline(atomic_t *lock_count
-						     __IP_DECL__)
+void fastcall __sched
+__mutex_unlock_noinline(atomic_t *lock_count __IP_DECL__)
 {
 	struct mutex *lock = container_of(lock_count, struct mutex, count);
 
@@ -360,6 +358,8 @@ static inline void __mutex_unlock(struct
 	__mutex_unlock_atomic(lock __IP__);
 }
 
+#ifndef MUTEX_INLINE_FASTPATH
+
 /***
  * mutex_lock - acquire the mutex
  * @lock: the mutex to be acquired
@@ -422,6 +422,8 @@ EXPORT_SYMBOL(mutex_lock);
 EXPORT_SYMBOL(mutex_unlock);
 EXPORT_SYMBOL(mutex_lock_interruptible);
 
+#endif
+
 /***
  * mutex_init - initialize the mutex
  * @lock: the mutex to be initialized
Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -92,10 +92,53 @@ struct mutex_waiter {
 
 extern void FASTCALL(__mutex_init(struct mutex *lock, const char *name));
 
+/*
+ * In the DEBUG case we are using the "NULL fastpath" for mutexes,
+ * which forces all calls into the slowpath. We also unconditionally disable
+ * any fastpath inlining.
+ */
+#ifdef CONFIG_DEBUG_MUTEXES
+# include <asm-generic/mutex-null.h>
+# undef MUTEX_INLINE_FASTPATH
+#else
+# include <asm/mutex.h>
+#endif
+
+#ifdef MUTEX_INLINE_FASTPATH
+
+#include <linux/sched.h>
+extern void fastcall __sched __mutex_lock_noinline(atomic_t *lock_count);
+extern int fastcall __sched __mutex_lock_interruptible_noinline(atomic_t *lock_count);
+extern void fastcall __sched __mutex_unlock_noinline(atomic_t *lock_count);
+
+static inline int mutex_trylock(struct mutex *lock)
+{
+	return __mutex_trylock(&lock->count);
+}
+
+static inline void mutex_lock(struct mutex *lock)
+{
+	__mutex_fastpath_lock(&lock->count, __mutex_lock_noinline);
+}
+
+static inline int mutex_lock_interruptible(struct mutex *lock)
+{
+	return __mutex_fastpath_lock_retval
+		(&lock->count, __mutex_lock_interruptible_noinline);
+}
+
+static inline void mutex_unlock(struct mutex *lock)
+{
+	__mutex_fastpath_unlock(&lock->count, __mutex_unlock_noinline);
+}
+
+#else
 extern void FASTCALL(mutex_lock(struct mutex *lock));
 extern int FASTCALL(mutex_lock_interruptible(struct mutex *lock));
 extern int FASTCALL(mutex_trylock(struct mutex *lock));
 extern void FASTCALL(mutex_unlock(struct mutex *lock));
+#endif
+
 extern int FASTCALL(mutex_is_locked(struct mutex *lock));
 
 #endif

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [patch 3/3] mutex subsystem: inline mutex_is_locked()
  2005-12-23 16:16 [patch 00/11] mutex subsystem, -V7 Ingo Molnar
                   ` (3 preceding siblings ...)
  2005-12-26 19:25 ` [patch 2/3] mutex subsystem: fastpath inlining Nicolas Pitre
@ 2005-12-26 19:26 ` Nicolas Pitre
  2005-12-27 11:37   ` Ingo Molnar
  4 siblings, 1 reply; 29+ messages in thread
From: Nicolas Pitre @ 2005-12-26 19:26 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: lkml, Arjan van de Ven, Russell King


There is currently no advantage to not always inlining mutex_is_locked,
even on x86.

Signed-off-by: Nicolas Pitre <nico@cam.org>

Index: linux-2.6/kernel/mutex.c
===================================================================
--- linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -22,19 +22,6 @@
 # include "mutex.h"
 #endif
 
-/***
- * mutex_is_locked - is the mutex locked
- * @lock: the mutex to be queried
- *
- * Returns 1 if the mutex is locked, 0 if unlocked.
- */
-int fastcall mutex_is_locked(struct mutex *lock)
-{
-	return atomic_read(&lock->count) != 1;
-}
-
-EXPORT_SYMBOL(mutex_is_locked);
-
 /*
  * Block on a lock - add ourselves to the list of waiters.
  * Called with lock->wait_lock held.
Index: linux-2.6/include/linux/mutex.h
===================================================================
--- linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -139,6 +139,9 @@ extern int FASTCALL(mutex_trylock(struct
 extern void FASTCALL(mutex_unlock(struct mutex *lock));
 #endif
 
-extern int FASTCALL(mutex_is_locked(struct mutex *lock));
+static inline int fastcall mutex_is_locked(struct mutex *lock)
+{
+	return atomic_read(&lock->count) != 1;
+}
 
 #endif

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 3/3] mutex subsystem: inline mutex_is_locked()
  2005-12-26 19:26 ` [patch 3/3] mutex subsystem: inline mutex_is_locked() Nicolas Pitre
@ 2005-12-27 11:37   ` Ingo Molnar
  0 siblings, 0 replies; 29+ messages in thread
From: Ingo Molnar @ 2005-12-27 11:37 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: lkml, Arjan van de Ven, Russell King


* Nicolas Pitre <nico@cam.org> wrote:

> There is currently no advantage to not always inlining 
> mutex_is_locked, even on x86.

thanks, applied.

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/3] mutex subsystem: trylock
  2005-12-26 19:25 ` [patch 1/3] mutex subsystem: trylock Nicolas Pitre
@ 2005-12-27 11:51   ` Ingo Molnar
  2005-12-27 20:47     ` Nicolas Pitre
  2005-12-27 12:05   ` Arjan van de Ven
  1 sibling, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2005-12-27 11:51 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: lkml, Arjan van de Ven, Russell King


* Nicolas Pitre <nico@cam.org> wrote:

> In the spirit of uniformity, this patch provides architecture specific 
> trylock implementations.  This allows for a pure xchg-based 
> implementation, as well as an optimized ARMv6 implementation.

hm, i dont really like the xchg variant:

> +static inline int
> +__mutex_trylock(atomic_t *count)
> +{
> +	int prev = atomic_xchg(count, 0);
> +
> +	if (unlikely(prev < 0)) {
> +		/*
> +		 * The lock was marked contended so we must restore that
> +		 * state. If while doing so we get back a prev value of 1
> +		 * then we just own it.
> +		 *
> +		 * IN all cases this has the potential to trigger the
> +		 * slowpath for the owner's unlock path - but this is not
> +		 * a big problem in practice.
> +		 */
> +		prev = atomic_xchg(count, -1);
> +		if (prev < 0)
> +			prev = 0;
> +	}

here we go to great trouble trying to avoid the 'slowpath', while we 
unconditionally force the next unlock into the slowpath! So we have not 
won anything. (on a cycle count basis it's probably even a net loss)

The same applies to atomic_dec_return() based trylock variant. Only the 
cmpxchg based one, and the optimized ARM variant should be used as a 
'fastpath'.

another thing is that this solution still leaves that ugly #ifdef in 
kernel/mutex.c.

so i took a different solution that solves both problems. The API 
towards architectures is now:

 static inline int
 __mutex_fastpath_trylock(atomic_t *count, int (*fn)(atomic_t *))

note the new 'fn' parameter: this enables mutex-null.h to do a simple:

 #define __mutex_fastpath_trylock(count, fn_name)        fn_name(count)

and the final ugly debugging related #ifdef is gone from kernel/mutex.c!  
Furthermore, both the mutex-xchg.h and the mutex-dec.h non-cmpxchg 
variant will fall back to the spinlock implementation.

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/3] mutex subsystem: fastpath inlining
  2005-12-26 19:25 ` [patch 2/3] mutex subsystem: fastpath inlining Nicolas Pitre
@ 2005-12-27 11:55   ` Ingo Molnar
  2005-12-27 21:59     ` Nicolas Pitre
  0 siblings, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2005-12-27 11:55 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: lkml, Arjan van de Ven, Russell King


* Nicolas Pitre <nico@cam.org> wrote:

> Some architectures, notably ARM for instance, might benefit from 
> inlining the mutex fast paths. [...]

what is the effect on text size? Could you post the before- and 
after-patch vmlinux 'size kernel/test.o' output in the nondebug case, 
with Arjan's latest 'convert a couple of semaphore users to mutexes' 
patch applied? [make sure you've got enough of those users compiled in, 
so that the inlining cost is truly measured. Perhaps also do 
before/after 'size' output of a few affected .o files, without mixing 
kernel/mutex.o into it, like vmlinux does.]

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/3] mutex subsystem: trylock
  2005-12-26 19:25 ` [patch 1/3] mutex subsystem: trylock Nicolas Pitre
  2005-12-27 11:51   ` Ingo Molnar
@ 2005-12-27 12:05   ` Arjan van de Ven
  2005-12-27 13:15     ` Ingo Molnar
  2005-12-29  3:22     ` Nicolas Pitre
  1 sibling, 2 replies; 29+ messages in thread
From: Arjan van de Ven @ 2005-12-27 12:05 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: linux-kernel

On Mon, 2005-12-26 at 14:25 -0500, Nicolas Pitre wrote:
> Index: linux-2.6/include/asm-arm/mutex.h
> ===================================================================
> --- linux-2.6.orig/include/asm-arm/mutex.h
> +++ linux-2.6/include/asm-arm/mutex.h
> @@ -98,5 +98,31 @@ do {									\
>   */
>  #define __mutex_slowpath_needs_to_unlock()	1
>  
> +/*
> + * For __mutex_trylock we use another construct which could be described
> + * as an "incomplete atomic decrement" or a "single value cmpxchg" since
> + * it has two modes of failure:
> + *
> + * 1) if the exclusive store fails we fail, and
> + *
> + * 2) if the decremented value is not zero we don't even attempt the store.


btw I really think that 1) is wrong. trylock should do everything it can
to get the semaphore short of sleeping. Just because some cacheline got
written to (which might even be shared!) in the middle of the atomic op
is not a good enough reason to fail the trylock imho. Going into the
slowpath.. fine. But here it's a quality of implementation issue; you
COULD get the semaphore without sleeping (at least probably, you'd have
to retry to know for sure) but because something wrote to the same
cacheline as the lock... no. that's just not good enough.. sorry.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/3] mutex subsystem: trylock
  2005-12-27 12:05   ` Arjan van de Ven
@ 2005-12-27 13:15     ` Ingo Molnar
  2005-12-29  4:06       ` Nicolas Pitre
  2005-12-29  3:22     ` Nicolas Pitre
  1 sibling, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2005-12-27 13:15 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Nicolas Pitre, linux-kernel


* Arjan van de Ven <arjan@infradead.org> wrote:

> > + * 1) if the exclusive store fails we fail, and
> > + *
> > + * 2) if the decremented value is not zero we don't even attempt the store.
> 
> 
> btw I really think that 1) is wrong. trylock should do everything it 
> can to get the semaphore short of sleeping. Just because some 
> cacheline got written to (which might even be shared!) in the middle 
> of the atomic op is not a good enough reason to fail the trylock imho. 
> Going into the slowpath.. fine. But here it's a quality of 
> implementation issue; you COULD get the semaphore without sleeping (at 
> least probably, you'd have to retry to know for sure) but because 
> something wrote to the same cacheline as the lock... no. that's just 
> not good enough.. sorry.

point. I solved this in my tree by calling the generic trylock <fn> if 
there's an __ex_flag failure in the ARMv6 case. Should be rare (and thus 
the call is under unlikely()), and should thus still enable the fast 
implementation.

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/3] mutex subsystem: trylock
  2005-12-27 11:51   ` Ingo Molnar
@ 2005-12-27 20:47     ` Nicolas Pitre
  2005-12-28  7:48       ` Ingo Molnar
  0 siblings, 1 reply; 29+ messages in thread
From: Nicolas Pitre @ 2005-12-27 20:47 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: lkml, Arjan van de Ven, Russell King

On Tue, 27 Dec 2005, Ingo Molnar wrote:

> 
> * Nicolas Pitre <nico@cam.org> wrote:
> 
> > In the spirit of uniformity, this patch provides architecture specific 
> > trylock implementations.  This allows for a pure xchg-based 
> > implementation, as well as an optimized ARMv6 implementation.
> 
> hm, i dont really like the xchg variant:
> 
> > +static inline int
> > +__mutex_trylock(atomic_t *count)
> > +{
> > +	int prev = atomic_xchg(count, 0);
> > +
> > +	if (unlikely(prev < 0)) {
> > +		/*
> > +		 * The lock was marked contended so we must restore that
> > +		 * state. If while doing so we get back a prev value of 1
> > +		 * then we just own it.
> > +		 *
> > +		 * IN all cases this has the potential to trigger the
> > +		 * slowpath for the owner's unlock path - but this is not
> > +		 * a big problem in practice.
> > +		 */
> > +		prev = atomic_xchg(count, -1);
> > +		if (prev < 0)
> > +			prev = 0;
> > +	}
> 
> here we go to great trouble trying to avoid the 'slowpath', while we 
> unconditionally force the next unlock into the slowpath! So we have not 
> won anything. (on a cycle count basis it's probably even a net loss)

I disagree.  There is no unconditional forcing into the slow path 
at all. Please consider again.

1) We try to lock with 0.
   If the previous value was 1 (likely) or 0 (less likely) we return
   that value right away.  We therefore cover 2 out of 3 cases already 
   with no more instructions or cycles than the successful fastpath 
   lock. Note the if (unlikely(prev < 0)) above that handles those two 
   cases with one test, and if it gets inlined then the outer code 
   testing for the return value would rely on the same test to decide 
   what to do since the condition flags are already set (gcc apparently 
   gets it right on ARM at least). This is the likely case solved.

2) If the previous value wasn't 1 nor 0 it is therefore contended 
   already.  We agree that the "already locked with contention" is the 
   less likely of all 3 cases, right?  So it is already unlikely 
   that this case will be considered. Still, since in this case we changed
   the lock from contended to simply locked, we have to 
   mark it contended again otherwise we risk missing on waking up 
   waiters.  But if by chance, and you'd have to be extremely lucky 
   since the window is really small, if by chance the value changed from 
   0 to 1 (remember we set it to 0 above) before we swap it back to -1 
   then we simply own the lock.  If it was still 0 then we simply set it 
   back to contended as it was previously.  And if it became -1 then we 
   didn't change anything by setting it to -1. But in all cases (whether 
   the owner unlocked it (from 0 to 1), or even someone else came along 
   to lock it (from 0 to 1 to 0), we know that there is/are sleeping 
   waiters that the previous owner failed to wake up because we made the
   value 0 for a while.  So whether we are the new owner, or someone else is, 
   there is still the need for waking up a waiter upon unlock 
   regardless.

So, as you can see, my xchg-based trylock doesn't force anything in 
99.999% of the cases.  The only possibility for a false contention state 
would involve:

 - process A locking the mutex (set to 0)

 - process B locking the mutex and failing (now set to -1)

 - process C attempting a trylock (new 0, prev = -1)

 - process A unlocking (from 0 to 1 -> no waking process B)

 - process D locking (from 1 to 0)

 - process E locking and failing (from 0 to -1)

 - process D unlocking (it is -1 so wake up process B, leave it to -1)

 - process B unlocking (it is -1 so wake up process E, set to 0)

 - process E unlock (from 0 to 1)

 - process C, midway into trylock, restore contention flag (from 1 to -1)

 - when process C unlocks, it branches to the slow path needlessly.

Do you really think we should care about such a twisted scenario?

> The same applies to atomic_dec_return() based trylock variant. Only the 
> cmpxchg based one, and the optimized ARM variant should be used as a 
> 'fastpath'.

On ARM prior version 6 a cmpxchg is what we need to avoid as much as 
possible, for the same reason I've been advocating for xchg.  On ARM 
prior version 6 trying to perform a cmpxchg is as costly as an 
atomic_dec, and they are both more costly and bigger than my xchg-based 
trylock algorithm.  And contrary to the atomic_dec-based trylock, the 
xchg-based one does not unconditionally force any unlock into the slow 
path.



 avoidance with mutexes.

> Furthermore, both the mutex-xchg.h and the mutex-dec.h non-cmpxchg 
> variant will fall back to the spinlock implementation.

Given the above demonstration I really think you should use my 
xchg-based trylock in mutex-xchg.h.  At least on ARM it is _still_ 
cheaper and smaller than a preemption disable implied by a spinlock.


Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/3] mutex subsystem: fastpath inlining
  2005-12-27 11:55   ` Ingo Molnar
@ 2005-12-27 21:59     ` Nicolas Pitre
  2005-12-28  7:41       ` Ingo Molnar
  0 siblings, 1 reply; 29+ messages in thread
From: Nicolas Pitre @ 2005-12-27 21:59 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: lkml, Arjan van de Ven, Russell King

On Tue, 27 Dec 2005, Ingo Molnar wrote:

> 
> * Nicolas Pitre <nico@cam.org> wrote:
> 
> > Some architectures, notably ARM for instance, might benefit from 
> > inlining the mutex fast paths. [...]
> 
> what is the effect on text size? Could you post the before- and 
> after-patch vmlinux 'size kernel/test.o' output in the nondebug case, 
> with Arjan's latest 'convert a couple of semaphore users to mutexes' 
> patch applied? [make sure you've got enough of those users compiled in, 
> so that the inlining cost is truly measured. Perhaps also do 
> before/after 'size' output of a few affected .o files, without mixing 
> kernel/mutex.o into it, like vmlinux does.]

Theory should be convincing enough.  First of all, all the semaphore 
fast paths are always inlined currently, on all architectures I've 
looked at.  A down() fast path is always looking like this:

        mrs     ip, cpsr
        orr     lr, ip, #128
        msr     cpsr_c, lr
        ldr     lr, [r0]
        subs    lr, lr, #1
        str     lr, [r0]
        msr     cpsr_c, ip
        movmi   ip, r0
        blmi    __down_failed

So our starting point for comparison is 9 instructions for every down() 
occurence in the kernel.  Same thing for up().  Every instruction is 
invariably 4 bytes.

Now let's look at the typical mutex_lock():

        mov     r4, #0
        swp     r3, r4, [r0]
        cmp     r3, #1
        blne      __mutex_lock_noinline

This is 4 instructions.  Further more, the first "mov r4, #0" can often 
be eliminated when gcc can cse the constant 0 from another 
register.  We're talking about 3 instructions then, down from 9 !

We therefore saves between 20 and 24 bytes of kernel .text for every 
down() and every up() simply going with mutexes.

Now if the mutex_lock and mutex_unlock were not inlined, the above 3 or 
4 instructions would become one or two per call site, which is still a 
gain in space, however not as important as the one provided by the move 
from semaphores to mutexes.  It however would be more costly in terms of 
cycles since a function prologue and epilogue is somewhat costly on ARM, 
especially with frame pointer enabled (I'll let RMK elaborate on his 
reasons for not disabling them).

And for mutex_lock_interruptible(), the inlined fastpath is not bigger 
than the non-inlined one, considering that the return value has to be 
tested (the test is done twice in the non-inlined case: once inside the 
function, and once outside of it) while the inlined version needs only 
one test.  They are therefore equivalent in terms of space.


Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/3] mutex subsystem: fastpath inlining
  2005-12-27 21:59     ` Nicolas Pitre
@ 2005-12-28  7:41       ` Ingo Molnar
  2005-12-29  2:53         ` Nicolas Pitre
  0 siblings, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2005-12-28  7:41 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: lkml, Arjan van de Ven, Russell King


* Nicolas Pitre <nico@cam.org> wrote:

> > * Nicolas Pitre <nico@cam.org> wrote:
> > 
> > > Some architectures, notably ARM for instance, might benefit from 
> > > inlining the mutex fast paths. [...]
> > 
> > what is the effect on text size? Could you post the before- and 
> > after-patch vmlinux 'size kernel/test.o' output in the nondebug case, 
> > with Arjan's latest 'convert a couple of semaphore users to mutexes' 
> > patch applied? [make sure you've got enough of those users compiled in, 
> > so that the inlining cost is truly measured. Perhaps also do 
> > before/after 'size' output of a few affected .o files, without mixing 
> > kernel/mutex.o into it, like vmlinux does.]
> 
> Theory should be convincing enough. [...]

please provide actual measurements (just a simple pre-patch and 
post-patch 'size' output of vmlinux is enough), so that we can see the 
inlining cost. Note that x86 went to a non-inlined fastpath _despite_ 
having a compact CISC semaphore fastpath.

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/3] mutex subsystem: trylock
  2005-12-27 20:47     ` Nicolas Pitre
@ 2005-12-28  7:48       ` Ingo Molnar
  2005-12-28  8:13         ` Ingo Molnar
  0 siblings, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2005-12-28  7:48 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: lkml, Arjan van de Ven, Russell King


* Nicolas Pitre <nico@cam.org> wrote:

> > here we go to great trouble trying to avoid the 'slowpath', while we 
> > unconditionally force the next unlock into the slowpath! So we have 
> > not won anything. (on a cycle count basis it's probably even a net 
> > loss)
> 
> I disagree.  [...elaborate analysis of the code ...]

you are right, it should work fine, and should be optimal. I'll add your 
xchg variant to mutex-xchg.h.

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/3] mutex subsystem: trylock
  2005-12-28  7:48       ` Ingo Molnar
@ 2005-12-28  8:13         ` Ingo Molnar
  2005-12-28 16:29           ` Nicolas Pitre
  0 siblings, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2005-12-28  8:13 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: lkml, Arjan van de Ven, Russell King


* Ingo Molnar <mingo@elte.hu> wrote:

> * Nicolas Pitre <nico@cam.org> wrote:
> 
> > > here we go to great trouble trying to avoid the 'slowpath', while we 
> > > unconditionally force the next unlock into the slowpath! So we have 
> > > not won anything. (on a cycle count basis it's probably even a net 
> > > loss)
> > 
> > I disagree.  [...elaborate analysis of the code ...]
> 
> you are right, it should work fine, and should be optimal. I'll add 
> your xchg variant to mutex-xchg.h.

the patch below adds it, and it boots fine on x86 with mutex.c hacked to 
include asm-generic/mutex-xchg.h.

	Ingo

Index: linux/include/asm-generic/mutex-xchg.h
===================================================================
--- linux.orig/include/asm-generic/mutex-xchg.h
+++ linux/include/asm-generic/mutex-xchg.h
@@ -82,7 +82,25 @@ do {									\
 static inline int
 __mutex_fastpath_trylock(atomic_t *count, int (*fn)(atomic_t *))
 {
-	return fn(count);
+	int prev = atomic_xchg(count, 0);
+
+	if (unlikely(prev < 0)) {
+		/*
+		 * The lock was marked contended so we must restore that
+		 * state. If while doing so we get back a prev value of 1
+		 * then we just own it.
+		 *
+		 * [ In the rare case of the mutex going to 1 and then to 0
+		 *   in this few-instructions window, this has the potential
+		 *   to trigger the slowpath for the owner's unlock path, but
+		 *   that's not a problem in practice. ]
+		 */
+		prev = atomic_xchg(count, -1);
+		if (prev < 0)
+			prev = 0;
+	}
+
+	return prev;
 }
 
 #endif

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/3] mutex subsystem: trylock
  2005-12-28  8:13         ` Ingo Molnar
@ 2005-12-28 16:29           ` Nicolas Pitre
  2005-12-28 17:09             ` Ingo Molnar
  0 siblings, 1 reply; 29+ messages in thread
From: Nicolas Pitre @ 2005-12-28 16:29 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: lkml, Arjan van de Ven, Russell King

On Wed, 28 Dec 2005, Ingo Molnar wrote:

> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > * Nicolas Pitre <nico@cam.org> wrote:
> > 
> > > > here we go to great trouble trying to avoid the 'slowpath', while we 
> > > > unconditionally force the next unlock into the slowpath! So we have 
> > > > not won anything. (on a cycle count basis it's probably even a net 
> > > > loss)
> > > 
> > > I disagree.  [...elaborate analysis of the code ...]
> > 
> > you are right, it should work fine, and should be optimal. I'll add 
> > your xchg variant to mutex-xchg.h.
> 
> the patch below adds it, and it boots fine on x86 with mutex.c hacked to 
> include asm-generic/mutex-xchg.h.

Here's an additional patch to fix some comments, and to add a small 
optimization.

Index: linux-2.6/include/asm-generic/mutex-xchg.h
===================================================================
--- linux-2.6.orig/include/asm-generic/mutex-xchg.h
+++ linux-2.6/include/asm-generic/mutex-xchg.h
@@ -75,9 +75,14 @@ do {									\
  *  @count: pointer of type atomic_t
  *  @fn: spinlock based trylock implementation
  *
- * We call the spinlock based generic variant, because atomic_xchg() is
- * too destructive to provide a simpler trylock implementation than the
- * spinlock based one.
+ * Change the count from 1 to a value lower than 1, and return 0 (failure)
+ * if it wasn't 1 originally, or return 1 (success) otherwise. This function
+ * MUST leave the value lower than 1 even when the "1" assertion wasn't true.
+ * Additionally, if the value was < 0 originally, this function must not leave
+ * it to 0 on failure.
+ *
+ * If the architecture has no effective trylock variant, it should call the
+ * <fn> spinlock-based trylock variant unconditionally.
  */
 static inline int
 __mutex_fastpath_trylock(atomic_t *count, int (*fn)(atomic_t *))
@@ -90,12 +95,13 @@ __mutex_fastpath_trylock(atomic_t *count
 		 * state. If while doing so we get back a prev value of 1
 		 * then we just own it.
 		 *
-		 * [ In the rare case of the mutex going to 1 and then to 0
-		 *   in this few-instructions window, this has the potential
-		 *   to trigger the slowpath for the owner's unlock path, but
-		 *   that's not a problem in practice. ]
+		 * [ In the rare case of the mutex going to 1, to 0, to -1
+		 *   and then back to 0 in this few-instructions window,
+		 *   this has the potential to trigger the slowpath for the
+		 *   owner's unlock path needlessly, but that's not a problem
+		 *   in practice. ]
 		 */
-		prev = atomic_xchg(count, -1);
+		prev = atomic_xchg(count, prev);
 		if (prev < 0)
 			prev = 0;
 	}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/3] mutex subsystem: trylock
  2005-12-28 16:29           ` Nicolas Pitre
@ 2005-12-28 17:09             ` Ingo Molnar
  0 siblings, 0 replies; 29+ messages in thread
From: Ingo Molnar @ 2005-12-28 17:09 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: lkml, Arjan van de Ven, Russell King


* Nicolas Pitre <nico@cam.org> wrote:

> > the patch below adds it, and it boots fine on x86 with mutex.c hacked to 
> > include asm-generic/mutex-xchg.h.
> 
> Here's an additional patch to fix some comments, and to add a small 
> optimization.

thanks, applied.

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/3] mutex subsystem: fastpath inlining
  2005-12-28  7:41       ` Ingo Molnar
@ 2005-12-29  2:53         ` Nicolas Pitre
  2005-12-29  8:41           ` Ingo Molnar
  0 siblings, 1 reply; 29+ messages in thread
From: Nicolas Pitre @ 2005-12-29  2:53 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: lkml, Arjan van de Ven, Russell King

On Wed, 28 Dec 2005, Ingo Molnar wrote:

> * Nicolas Pitre <nico@cam.org> wrote:
> 
> > > * Nicolas Pitre <nico@cam.org> wrote:
> > > 
> > > > Some architectures, notably ARM for instance, might benefit from 
> > > > inlining the mutex fast paths. [...]
> > > 
> > > what is the effect on text size? Could you post the before- and 
> > > after-patch vmlinux 'size kernel/test.o' output in the nondebug case, 
> > > with Arjan's latest 'convert a couple of semaphore users to mutexes' 
> > > patch applied? [make sure you've got enough of those users compiled in, 
> > > so that the inlining cost is truly measured. Perhaps also do 
> > > before/after 'size' output of a few affected .o files, without mixing 
> > > kernel/mutex.o into it, like vmlinux does.]
> > 
> > Theory should be convincing enough. [...]
> 
> please provide actual measurements (just a simple pre-patch and 
> post-patch 'size' output of vmlinux is enough), so that we can see the 
> inlining cost.

This is with all mutex patches applied and CONFIG_DEBUG_MUTEX_FULL=n, 
therefore using the current semaphore code:

   text	   data	    bss	    dec	    hex	filename
1821108	 287792	  88264	2197164	 2186ac	vmlinux

Now with CONFIG_DEBUG_MUTEX_FULL=y to substitute semaphores with 
mutexes:

   text	   data	    bss	    dec	    hex	filename
1797108	 287568	  88172	2172848	 2127b0	vmlinux

Finally with CONFIG_DEBUG_MUTEX_FULL=y and fast paths inlined:

   text	   data	    bss	    dec	    hex	filename
1807824	 287136	  88172	2183132	 214fdc	vmlinux

This last case is not the smallest, but it is the fastest.

> Note that x86 went to a non-inlined fastpath _despite_ 
> having a compact CISC semaphore fastpath.

The function call overhead on x86 is less significant than the ARM one, 
so always calling out of line code might be sensible in that case.


Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/3] mutex subsystem: trylock
  2005-12-27 12:05   ` Arjan van de Ven
  2005-12-27 13:15     ` Ingo Molnar
@ 2005-12-29  3:22     ` Nicolas Pitre
  1 sibling, 0 replies; 29+ messages in thread
From: Nicolas Pitre @ 2005-12-29  3:22 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: lkml

On Tue, 27 Dec 2005, Arjan van de Ven wrote:

> btw I really think that 1) is wrong. trylock should do everything it can
> to get the semaphore short of sleeping. Just because some cacheline got
> written to (which might even be shared!) in the middle of the atomic op
> is not a good enough reason to fail the trylock imho. Going into the
> slowpath.. fine. But here it's a quality of implementation issue; you
> COULD get the semaphore without sleeping (at least probably, you'd have
> to retry to know for sure) but because something wrote to the same
> cacheline as the lock... no. that's just not good enough.. sorry.

Well... actually it is not clear from the documentation how the 
exclusive monitor is tagging accesses (lots of implementation specific 
latitude).  So you are right that it could cause too many false negative 
results.


Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/3] mutex subsystem: trylock
  2005-12-27 13:15     ` Ingo Molnar
@ 2005-12-29  4:06       ` Nicolas Pitre
  2005-12-29  8:33         ` Ingo Molnar
  0 siblings, 1 reply; 29+ messages in thread
From: Nicolas Pitre @ 2005-12-29  4:06 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Arjan van de Ven, lkml

On Tue, 27 Dec 2005, Ingo Molnar wrote:

> 
> * Arjan van de Ven <arjan@infradead.org> wrote:
> 
> > > + * 1) if the exclusive store fails we fail, and
> > > + *
> > > + * 2) if the decremented value is not zero we don't even attempt the store.
> > 
> > 
> > btw I really think that 1) is wrong. trylock should do everything it 
> > can to get the semaphore short of sleeping. Just because some 
> > cacheline got written to (which might even be shared!) in the middle 
> > of the atomic op is not a good enough reason to fail the trylock imho. 
> > Going into the slowpath.. fine. But here it's a quality of 
> > implementation issue; you COULD get the semaphore without sleeping (at 
> > least probably, you'd have to retry to know for sure) but because 
> > something wrote to the same cacheline as the lock... no. that's just 
> > not good enough.. sorry.
> 
> point. I solved this in my tree by calling the generic trylock <fn> if 
> there's an __ex_flag failure in the ARMv6 case. Should be rare (and thus 
> the call is under unlikely()), and should thus still enable the fast 
> implementation.

I'd solve it like this instead (on top of your latest patches):

Index: linux-2.6/include/asm-arm/mutex.h
===================================================================
--- linux-2.6.orig/include/asm-arm/mutex.h
+++ linux-2.6/include/asm-arm/mutex.h
@@ -110,12 +110,7 @@ do {									\
 
 /*
  * For __mutex_fastpath_trylock we use another construct which could be
- * described as an "incomplete atomic decrement" or a "single value cmpxchg"
- * since it has two modes of failure:
- *
- * 1) if the exclusive store fails we fail, and
- *
- * 2) if the decremented value is not zero we don't even attempt the store.
+ * described as a "single value cmpxchg".
  *
  * This provides the needed trylock semantics like cmpxchg would, but it is
  * lighter and less generic than a true cmpxchg implementation.
@@ -123,27 +118,22 @@ do {									\
 static inline int
 __mutex_fastpath_trylock(atomic_t *count, int (*fn_name)(atomic_t *))
 {
-	int __ex_flag, __res;
+	int __ex_flag, __res, __orig;
 
 	__asm__ (
 
-		"ldrex		%0, [%2]	\n"
-		"subs		%0, %0, #1	\n"
-		"strexeq	%1, %0, [%2]	\n"
+		"1: ldrex	%0, [%3]	\n"
+		"subs		%1, %0, #1	\n"
+		"strexeq	%2, %1, [%3]	\n"
+		"movlt		%0, #0		\n"
+		"cmpeq		%2, #0		\n"
+		"bgt		1b		\n"
 
-		: "=&r" (__res), "=&r" (__ex_flag)
+		: "=&r" (__orig), "=&r" (__res), "=&r" (__ex_flag)
 		: "r" (&count->counter)
 		: "cc", "memory" );
 
-	/*
-	 * We must not return a synthetic 'failure' if the conditional
-	 * did not succeed - drop back into the generic slowpath if
-	 * this happens (should be rare):
-	 */
-	if (unlikely(__ex_flag))
-		return fn_name(count);
-
-	return __res == 0;
+	return __orig;
 }
 
 #endif


Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/3] mutex subsystem: trylock
  2005-12-29  4:06       ` Nicolas Pitre
@ 2005-12-29  8:33         ` Ingo Molnar
  2005-12-29  9:01           ` Nick Piggin
  2005-12-29 16:46           ` Nicolas Pitre
  0 siblings, 2 replies; 29+ messages in thread
From: Ingo Molnar @ 2005-12-29  8:33 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Arjan van de Ven, lkml


* Nicolas Pitre <nico@cam.org> wrote:

> > > > + * 1) if the exclusive store fails we fail, and
> > > > + *
> > > > + * 2) if the decremented value is not zero we don't even attempt the store.
> > > 
> > > 
> > > btw I really think that 1) is wrong. trylock should do everything it 
> > > can to get the semaphore short of sleeping. Just because some 
> > > cacheline got written to (which might even be shared!) in the middle 
> > > of the atomic op is not a good enough reason to fail the trylock imho. 
> > > Going into the slowpath.. fine. But here it's a quality of 
> > > implementation issue; you COULD get the semaphore without sleeping (at 
> > > least probably, you'd have to retry to know for sure) but because 
> > > something wrote to the same cacheline as the lock... no. that's just 
> > > not good enough.. sorry.
> > 
> > point. I solved this in my tree by calling the generic trylock <fn> if 
> > there's an __ex_flag failure in the ARMv6 case. Should be rare (and thus 
> > the call is under unlikely()), and should thus still enable the fast 
> > implementation.
> 
> I'd solve it like this instead (on top of your latest patches):

thanks, applied.

> +		"1: ldrex	%0, [%3]	\n"
> +		"subs		%1, %0, #1	\n"
> +		"strexeq	%2, %1, [%3]	\n"
> +		"movlt		%0, #0		\n"
> +		"cmpeq		%2, #0		\n"
> +		"bgt		1b		\n"

so we are back to what is in essence a cmpxchg implementation?

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/3] mutex subsystem: fastpath inlining
  2005-12-29  2:53         ` Nicolas Pitre
@ 2005-12-29  8:41           ` Ingo Molnar
  2006-01-06 21:20             ` Nicolas Pitre
  0 siblings, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2005-12-29  8:41 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: lkml, Arjan van de Ven, Russell King


* Nicolas Pitre <nico@cam.org> wrote:

> This is with all mutex patches applied and CONFIG_DEBUG_MUTEX_FULL=n, 
> therefore using the current semaphore code:
> 
>    text	   data	    bss	    dec	    hex	filename
> 1821108	 287792	  88264	2197164	 2186ac	vmlinux
> 
> Now with CONFIG_DEBUG_MUTEX_FULL=y to substitute semaphores with 
> mutexes:
> 
>    text	   data	    bss	    dec	    hex	filename
> 1797108	 287568	  88172	2172848	 2127b0	vmlinux
> 
> Finally with CONFIG_DEBUG_MUTEX_FULL=y and fast paths inlined:
> 
>    text	   data	    bss	    dec	    hex	filename
> 1807824	 287136	  88172	2183132	 214fdc	vmlinux
> 
> This last case is not the smallest, but it is the fastest.

i.e. 1.3% text savings from going to mutexes, and inlining them again 
gives up 0.5% of that. We've uninlined stuff for a smaller gain in the 
past ...

> > Note that x86 went to a non-inlined fastpath _despite_ 
> > having a compact CISC semaphore fastpath.
> 
> The function call overhead on x86 is less significant than the ARM 
> one, so always calling out of line code might be sensible in that 
> case.

i'm highly doubtful we should do that. The spinlock APIs are 4 times 
more frequent than mutexes are ever going to be, still they too are 
mostly out of line. (and we only inline the unlock portions that are a 
space win!) Can you measure any significant difference in performance?  
(e.g. lat_pipe triggers the mutex fastpath, in DEBUG_MUTEX_FULL=y mode) 

the performance won by inlining is often offset by the performance cost 
of the higher icache footprint. (and ARM CPUs dont have that large 
caches to begin with)

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/3] mutex subsystem: trylock
  2005-12-29  8:33         ` Ingo Molnar
@ 2005-12-29  9:01           ` Nick Piggin
  2005-12-29 17:15             ` Nicolas Pitre
  2005-12-29 16:46           ` Nicolas Pitre
  1 sibling, 1 reply; 29+ messages in thread
From: Nick Piggin @ 2005-12-29  9:01 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Nicolas Pitre, Arjan van de Ven, lkml

Ingo Molnar wrote:
> * Nicolas Pitre <nico@cam.org> wrote:

>>+		"1: ldrex	%0, [%3]	\n"
>>+		"subs		%1, %0, #1	\n"
>>+		"strexeq	%2, %1, [%3]	\n"
>>+		"movlt		%0, #0		\n"
>>+		"cmpeq		%2, #0		\n"
>>+		"bgt		1b		\n"
> 
> 
> so we are back to what is in essence a cmpxchg implementation?
> 

FWIW, I still think we should go for an open-coded "cmpxchg" variant
for UP that disables preempt, and an atomic_cmpxchg variant for SMP.

- good generic implementations
- the UP version is faster than atomic_xchg for non preempt on ARM
- if you really are counting cycles, you'd probably have preempt off
- if you've got preempt on then the preempt_ operations in semaphores
   would be the least of your worries (how about spinlocks?)

Rather than straight out introducing lots of ugliness and complexity
for something that actually slows down the speed critical !preempt
case (but is unlikely to be measurable either way).

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/3] mutex subsystem: trylock
  2005-12-29  8:33         ` Ingo Molnar
  2005-12-29  9:01           ` Nick Piggin
@ 2005-12-29 16:46           ` Nicolas Pitre
  1 sibling, 0 replies; 29+ messages in thread
From: Nicolas Pitre @ 2005-12-29 16:46 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Arjan van de Ven, lkml

On Thu, 29 Dec 2005, Ingo Molnar wrote:

> > I'd solve it like this instead (on top of your latest patches):
> 
> thanks, applied.
> 
> > +		"1: ldrex	%0, [%3]	\n"
> > +		"subs		%1, %0, #1	\n"
> > +		"strexeq	%2, %1, [%3]	\n"
> > +		"movlt		%0, #0		\n"
> > +		"cmpeq		%2, #0		\n"
> > +		"bgt		1b		\n"
> 
> so we are back to what is in essence a cmpxchg implementation?

A limited cmpxchg.  Using the generic cmpxchg for this would need either 
9 or 10 instructions instead of 6.


Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/3] mutex subsystem: trylock
  2005-12-29  9:01           ` Nick Piggin
@ 2005-12-29 17:15             ` Nicolas Pitre
  2005-12-30  2:05               ` Nick Piggin
  0 siblings, 1 reply; 29+ messages in thread
From: Nicolas Pitre @ 2005-12-29 17:15 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Ingo Molnar, Arjan van de Ven, lkml

On Thu, 29 Dec 2005, Nick Piggin wrote:

> FWIW, I still think we should go for an open-coded "cmpxchg" variant
> for UP that disables preempt, and an atomic_cmpxchg variant for SMP.
> 
> - good generic implementations
> - the UP version is faster than atomic_xchg for non preempt on ARM
> - if you really are counting cycles, you'd probably have preempt off
> - if you've got preempt on then the preempt_ operations in semaphores
>   would be the least of your worries (how about spinlocks?)
> 
> Rather than straight out introducing lots of ugliness and complexity
> for something that actually slows down the speed critical !preempt
> case (but is unlikely to be measurable either way).

I provided you with the demonstration last week:

- the non SMP (ARM version < 6) is using xchg.

- xchg on ARM version < 6 is always faster and smaller than any 
  preemption disable.

- xchg on ARM is the same size as the smallest solution you may think of
  when preemption is disabled and never slower (prove me otherwise? if 
  you wish).

- all xchg based primitives are "generic" code already.

And I think you didn't look at the overall patch set before talking 
about "lots of ugliness", did you?  The fact is that the code, 
especially the core code, is much cleaner now than it was when 
everything was seemingly "generic" since the current work on 
architecture abstractions still allows optimizations in a way that is so 
much cleaner and controlled than what happened with the semaphore code, 
and the debugging code can even take advantage of it without polluting 
the core code.

It happens that i386, x86_64 and ARM (if v6 or above) currently have 
their own tweaks to save space and/or cycles in a pretty strictly 
defined way.  The effort is currently spent on making sure if other 
architectures want to use one of their own tricks for those they can do 
it without affecting the core code which remains 95% of the whole thing 
(which is not the case for semaphores), and the currently provided 
architecture specific versions are _never_ bigger nor slower than any 
generic version would be.  Otherwise what would be the point?  


Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 1/3] mutex subsystem: trylock
  2005-12-29 17:15             ` Nicolas Pitre
@ 2005-12-30  2:05               ` Nick Piggin
  0 siblings, 0 replies; 29+ messages in thread
From: Nick Piggin @ 2005-12-30  2:05 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Ingo Molnar, Arjan van de Ven, lkml

Nicolas Pitre wrote:

> I provided you with the demonstration last week:
> 

I didn't really find it convincing.

> - the non SMP (ARM version < 6) is using xchg.
> 
> - xchg on ARM version < 6 is always faster and smaller than any 
>   preemption disable.
> 

Maybe true, but as I said, if you have preemption enabled, then there
are far more preempt_xxx operations in other places than semaphores,
which impact both size and speed.

> - xchg on ARM is the same size as the smallest solution you may think of
>   when preemption is disabled and never slower (prove me otherwise? if 
>   you wish).
> 

I was going from your numbers where you said it was IIRC a cycle faster.

> - all xchg based primitives are "generic" code already.
> 

And yet there is a need for architecture specific code. Also having
xchg and a cmpxchg variants mean that you have two different sets of
possible interleavings of instructions, and xchg has much more subtle
cases as you demonstrated.

> And I think you didn't look at the overall patch set before talking 
> about "lots of ugliness", did you?  The fact is that the code, 

Yes I did and I think it is more ugly than my proposal would be.

> especially the core code, is much cleaner now than it was when 
> everything was seemingly "generic" since the current work on 
> architecture abstractions still allows optimizations in a way that is so 
> much cleaner and controlled than what happened with the semaphore code, 
> and the debugging code can even take advantage of it without polluting 
> the core code.
> 
> It happens that i386, x86_64 and ARM (if v6 or above) currently have 
> their own tweaks to save space and/or cycles in a pretty strictly 
> defined way.  The effort is currently spent on making sure if other 
> architectures want to use one of their own tricks for those they can do 
> it without affecting the core code which remains 95% of the whole thing 
> (which is not the case for semaphores), and the currently provided 
> architecture specific versions are _never_ bigger nor slower than any 
> generic version would be.  Otherwise what would be the point?  
> 

I don't think size is a great argument, because the operations should
be out of line anyway to save icache (your numbers showed a pretty
large increase when they're inlined)

And as for speed, I'm not arguing that you can't save a couple of
cycles, but I'm weighing that against the maintainability of a generic
implementation which has definite advantages. Wheras I don't think you
could demonstrate any real speed improvement for saving a few cycles
from slightly faster semaphore operations on CONFIG_PREEMPT kernels?

If someone ever did find the need for an arch specific variant that
really offers advantages, then there is nothing to stop tha being
added.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 2/3] mutex subsystem: fastpath inlining
  2005-12-29  8:41           ` Ingo Molnar
@ 2006-01-06 21:20             ` Nicolas Pitre
  0 siblings, 0 replies; 29+ messages in thread
From: Nicolas Pitre @ 2006-01-06 21:20 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: lkml, Arjan van de Ven, Russell King


Sorry for the delay...

On Thu, 29 Dec 2005, Ingo Molnar wrote:

> 
> * Nicolas Pitre <nico@cam.org> wrote:
> 
> > This is with all mutex patches applied and CONFIG_DEBUG_MUTEX_FULL=n, 
> > therefore using the current semaphore code:
> > 
> >    text	   data	    bss	    dec	    hex	filename
> > 1821108	 287792	  88264	2197164	 2186ac	vmlinux
> > 
> > Now with CONFIG_DEBUG_MUTEX_FULL=y to substitute semaphores with 
> > mutexes:
> > 
> >    text	   data	    bss	    dec	    hex	filename
> > 1797108	 287568	  88172	2172848	 2127b0	vmlinux
> > 
> > Finally with CONFIG_DEBUG_MUTEX_FULL=y and fast paths inlined:
> > 
> >    text	   data	    bss	    dec	    hex	filename
> > 1807824	 287136	  88172	2183132	 214fdc	vmlinux
> > 
> > This last case is not the smallest, but it is the fastest.
> 
> i.e. 1.3% text savings from going to mutexes, and inlining them again 
> gives up 0.5% of that. We've uninlined stuff for a smaller gain in the 
> past ...
> 
> > > Note that x86 went to a non-inlined fastpath _despite_ 
> > > having a compact CISC semaphore fastpath.
> > 
> > The function call overhead on x86 is less significant than the ARM 
> > one, so always calling out of line code might be sensible in that 
> > case.
> 
> i'm highly doubtful we should do that. The spinlock APIs are 4 times 
> more frequent than mutexes are ever going to be, still they too are 
> mostly out of line. (and we only inline the unlock portions that are a 
> space win!) Can you measure any significant difference in performance?  
> (e.g. lat_pipe triggers the mutex fastpath, in DEBUG_MUTEX_FULL=y mode) 

Here it is.

With the default non inlined fast path:

Pipe latency: 14.2669 microseconds
Pipe latency: 14.2625 microseconds
Pipe latency: 14.2655 microseconds
Pipe latency: 14.2670 microseconds

Then, with fast paths inlined:

Pipe latency: 13.9483 microseconds
Pipe latency: 13.9409 microseconds
Pipe latency: 13.9468 microseconds
Pipe latency: 13.9529 microseconds

So inlining the mutex fast path is more than 2% faster for the whole 
test.

Is that worth the 0.5% increase in kernel size in your opinion?

Note: I modified lat_pipe to use two threads instead of two processes 
since on ARM switching mm require a full cache flush due to the VIVT 
cache, and doing so skyrockets the pipe latency up to 480 microsecs with 
the mutex difference lost in the noise.


Nicolas

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2006-01-06 21:20 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-12-23 16:16 [patch 00/11] mutex subsystem, -V7 Ingo Molnar
2005-12-24  5:15 ` Nicolas Pitre
2005-12-24  5:23   ` Nicolas Pitre
2005-12-26 19:24 ` Nicolas Pitre
2005-12-26 19:25 ` [patch 1/3] mutex subsystem: trylock Nicolas Pitre
2005-12-27 11:51   ` Ingo Molnar
2005-12-27 20:47     ` Nicolas Pitre
2005-12-28  7:48       ` Ingo Molnar
2005-12-28  8:13         ` Ingo Molnar
2005-12-28 16:29           ` Nicolas Pitre
2005-12-28 17:09             ` Ingo Molnar
2005-12-27 12:05   ` Arjan van de Ven
2005-12-27 13:15     ` Ingo Molnar
2005-12-29  4:06       ` Nicolas Pitre
2005-12-29  8:33         ` Ingo Molnar
2005-12-29  9:01           ` Nick Piggin
2005-12-29 17:15             ` Nicolas Pitre
2005-12-30  2:05               ` Nick Piggin
2005-12-29 16:46           ` Nicolas Pitre
2005-12-29  3:22     ` Nicolas Pitre
2005-12-26 19:25 ` [patch 2/3] mutex subsystem: fastpath inlining Nicolas Pitre
2005-12-27 11:55   ` Ingo Molnar
2005-12-27 21:59     ` Nicolas Pitre
2005-12-28  7:41       ` Ingo Molnar
2005-12-29  2:53         ` Nicolas Pitre
2005-12-29  8:41           ` Ingo Molnar
2006-01-06 21:20             ` Nicolas Pitre
2005-12-26 19:26 ` [patch 3/3] mutex subsystem: inline mutex_is_locked() Nicolas Pitre
2005-12-27 11:37   ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox