linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/6] Add support for pldw instruction on v7 MP cores
@ 2013-09-17 13:29 Will Deacon
  2013-09-17 13:29 ` [PATCH v3 1/6] ARM: prefetch: remove redundant "cc" clobber Will Deacon
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Will Deacon @ 2013-09-17 13:29 UTC (permalink / raw)
  To: linux-arm-kernel

Hello,

This is version three of the patches I have previously posted here:

v1: http://lists.infradead.org/pipermail/linux-arm-kernel/2013-July/185663.html
v2: http://lists.infradead.org/pipermail/linux-arm-kernel/2013-July/186411.html

This version is simply a rebase on top of v3.12-rc1, since I concentrated
on getting my memory barrier improvements into 3.12 ahead of this series.
I'm aiming to get this merged for 3.13.

All feedback welcome,

Will


Will Deacon (6):
  ARM: prefetch: remove redundant "cc" clobber
  ARM: smp_on_up: move inline asm ALT_SMP patching macro out of
    spinlock.h
  ARM: prefetch: add support for prefetchw using pldw on SMP ARMv7+ CPUs
  ARM: locks: prefetch the destination word for write prior to strex
  ARM: atomics: prefetch the destination word for write prior to strex
  ARM: bitops: prefetch the destination word for write prior to strex

 arch/arm/include/asm/atomic.h         |  7 +++++++
 arch/arm/include/asm/processor.h      | 33 +++++++++++++++++++++++++--------
 arch/arm/include/asm/spinlock.h       | 28 ++++++++++++++--------------
 arch/arm/include/asm/spinlock_types.h |  2 +-
 arch/arm/include/asm/unified.h        |  4 ++++
 arch/arm/lib/bitops.h                 |  5 +++++
 6 files changed, 56 insertions(+), 23 deletions(-)

-- 
1.8.2.2

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 1/6] ARM: prefetch: remove redundant "cc" clobber
  2013-09-17 13:29 [PATCH v3 0/6] Add support for pldw instruction on v7 MP cores Will Deacon
@ 2013-09-17 13:29 ` Will Deacon
  2013-09-17 13:29 ` [PATCH v3 2/6] ARM: smp_on_up: move inline asm ALT_SMP patching macro out of spinlock.h Will Deacon
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Will Deacon @ 2013-09-17 13:29 UTC (permalink / raw)
  To: linux-arm-kernel

The pld instruction does not affect the condition flags, so don't bother
clobbering them.

Acked-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/include/asm/processor.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/arm/include/asm/processor.h b/arch/arm/include/asm/processor.h
index 413f387..514a989 100644
--- a/arch/arm/include/asm/processor.h
+++ b/arch/arm/include/asm/processor.h
@@ -97,9 +97,7 @@ static inline void prefetch(const void *ptr)
 {
 	__asm__ __volatile__(
 		"pld\t%a0"
-		:
-		: "p" (ptr)
-		: "cc");
+		:: "p" (ptr));
 }
 
 #define ARCH_HAS_PREFETCHW
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 2/6] ARM: smp_on_up: move inline asm ALT_SMP patching macro out of spinlock.h
  2013-09-17 13:29 [PATCH v3 0/6] Add support for pldw instruction on v7 MP cores Will Deacon
  2013-09-17 13:29 ` [PATCH v3 1/6] ARM: prefetch: remove redundant "cc" clobber Will Deacon
@ 2013-09-17 13:29 ` Will Deacon
  2013-09-17 13:29 ` [PATCH v3 3/6] ARM: prefetch: add support for prefetchw using pldw on SMP ARMv7+ CPUs Will Deacon
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Will Deacon @ 2013-09-17 13:29 UTC (permalink / raw)
  To: linux-arm-kernel

Patching UP/SMP alternatives inside inline assembly blocks is useful
outside of the spinlock implementation, where it is used for sev and wfe.

This patch lifts the macro into processor.h and gives it a scarier name
to (a) avoid conflicts in the global namespace and (b) to try and deter
its usage unless you "know what you're doing". The W macro for generating
wide instructions when targetting Thumb-2 is also made available under
the name WASM, to reduce the potential for conflicts with other headers.

Acked-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/include/asm/processor.h | 12 ++++++++++++
 arch/arm/include/asm/spinlock.h  | 15 ++++-----------
 arch/arm/include/asm/unified.h   |  4 ++++
 3 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/arch/arm/include/asm/processor.h b/arch/arm/include/asm/processor.h
index 514a989..26164c9 100644
--- a/arch/arm/include/asm/processor.h
+++ b/arch/arm/include/asm/processor.h
@@ -22,6 +22,7 @@
 #include <asm/hw_breakpoint.h>
 #include <asm/ptrace.h>
 #include <asm/types.h>
+#include <asm/unified.h>
 
 #ifdef __KERNEL__
 #define STACK_TOP	((current->personality & ADDR_LIMIT_32BIT) ? \
@@ -87,6 +88,17 @@ unsigned long get_wchan(struct task_struct *p);
 #define KSTK_EIP(tsk)	task_pt_regs(tsk)->ARM_pc
 #define KSTK_ESP(tsk)	task_pt_regs(tsk)->ARM_sp
 
+#ifdef CONFIG_SMP
+#define __ALT_SMP_ASM(smp, up)						\
+	"9998:	" smp "\n"						\
+	"	.pushsection \".alt.smp.init\", \"a\"\n"		\
+	"	.long	9998b\n"					\
+	"	" up "\n"						\
+	"	.popsection\n"
+#else
+#define __ALT_SMP_ASM(smp, up)	up
+#endif
+
 /*
  * Prefetching support - only ARMv5.
  */
diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
index 4f2c280..e1ce452 100644
--- a/arch/arm/include/asm/spinlock.h
+++ b/arch/arm/include/asm/spinlock.h
@@ -11,15 +11,7 @@
  * sev and wfe are ARMv6K extensions.  Uniprocessor ARMv6 may not have the K
  * extensions, so when running on UP, we have to patch these instructions away.
  */
-#define ALT_SMP(smp, up)					\
-	"9998:	" smp "\n"					\
-	"	.pushsection \".alt.smp.init\", \"a\"\n"	\
-	"	.long	9998b\n"				\
-	"	" up "\n"					\
-	"	.popsection\n"
-
 #ifdef CONFIG_THUMB2_KERNEL
-#define SEV		ALT_SMP("sev.w", "nop.w")
 /*
  * For Thumb-2, special care is needed to ensure that the conditional WFE
  * instruction really does assemble to exactly 4 bytes (as required by
@@ -31,17 +23,18 @@
  * the assembler won't change IT instructions which are explicitly present
  * in the input.
  */
-#define WFE(cond)	ALT_SMP(		\
+#define WFE(cond)	__ALT_SMP_ASM(		\
 	"it " cond "\n\t"			\
 	"wfe" cond ".n",			\
 						\
 	"nop.w"					\
 )
 #else
-#define SEV		ALT_SMP("sev", "nop")
-#define WFE(cond)	ALT_SMP("wfe" cond, "nop")
+#define WFE(cond)	__ALT_SMP_ASM("wfe" cond, "nop")
 #endif
 
+#define SEV		__ALT_SMP_ASM(WASM(sev), WASM(nop))
+
 static inline void dsb_sev(void)
 {
 #if __LINUX_ARM_ARCH__ >= 7
diff --git a/arch/arm/include/asm/unified.h b/arch/arm/include/asm/unified.h
index f5989f4..b88beab 100644
--- a/arch/arm/include/asm/unified.h
+++ b/arch/arm/include/asm/unified.h
@@ -38,6 +38,8 @@
 #ifdef __ASSEMBLY__
 #define W(instr)	instr.w
 #define BSYM(sym)	sym + 1
+#else
+#define WASM(instr)	#instr ".w"
 #endif
 
 #else	/* !CONFIG_THUMB2_KERNEL */
@@ -50,6 +52,8 @@
 #ifdef __ASSEMBLY__
 #define W(instr)	instr
 #define BSYM(sym)	sym
+#else
+#define WASM(instr)	#instr
 #endif
 
 #endif	/* CONFIG_THUMB2_KERNEL */
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 3/6] ARM: prefetch: add support for prefetchw using pldw on SMP ARMv7+ CPUs
  2013-09-17 13:29 [PATCH v3 0/6] Add support for pldw instruction on v7 MP cores Will Deacon
  2013-09-17 13:29 ` [PATCH v3 1/6] ARM: prefetch: remove redundant "cc" clobber Will Deacon
  2013-09-17 13:29 ` [PATCH v3 2/6] ARM: smp_on_up: move inline asm ALT_SMP patching macro out of spinlock.h Will Deacon
@ 2013-09-17 13:29 ` Will Deacon
  2013-09-17 13:29 ` [PATCH v3 4/6] ARM: locks: prefetch the destination word for write prior to strex Will Deacon
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Will Deacon @ 2013-09-17 13:29 UTC (permalink / raw)
  To: linux-arm-kernel

SMP ARMv7 CPUs implement the pldw instruction, which allows them to
prefetch data cachelines in an exclusive state.

This patch defines the prefetchw macro using pldw for CPUs that support
it.

Acked-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/include/asm/processor.h | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/arm/include/asm/processor.h b/arch/arm/include/asm/processor.h
index 26164c9..c3d5fc1 100644
--- a/arch/arm/include/asm/processor.h
+++ b/arch/arm/include/asm/processor.h
@@ -112,12 +112,19 @@ static inline void prefetch(const void *ptr)
 		:: "p" (ptr));
 }
 
+#if __LINUX_ARM_ARCH__ >= 7 && defined(CONFIG_SMP)
 #define ARCH_HAS_PREFETCHW
-#define prefetchw(ptr)	prefetch(ptr)
-
-#define ARCH_HAS_SPINLOCK_PREFETCH
-#define spin_lock_prefetch(x) do { } while (0)
-
+static inline void prefetchw(const void *ptr)
+{
+	__asm__ __volatile__(
+		".arch_extension	mp\n"
+		__ALT_SMP_ASM(
+			WASM(pldw)		"\t%a0",
+			WASM(pld)		"\t%a0"
+		)
+		:: "p" (ptr));
+}
+#endif
 #endif
 
 #define HAVE_ARCH_PICK_MMAP_LAYOUT
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 4/6] ARM: locks: prefetch the destination word for write prior to strex
  2013-09-17 13:29 [PATCH v3 0/6] Add support for pldw instruction on v7 MP cores Will Deacon
                   ` (2 preceding siblings ...)
  2013-09-17 13:29 ` [PATCH v3 3/6] ARM: prefetch: add support for prefetchw using pldw on SMP ARMv7+ CPUs Will Deacon
@ 2013-09-17 13:29 ` Will Deacon
  2013-09-17 13:29 ` [PATCH v3 5/6] ARM: atomics: " Will Deacon
  2013-09-17 13:29 ` [PATCH v3 6/6] ARM: bitops: " Will Deacon
  5 siblings, 0 replies; 9+ messages in thread
From: Will Deacon @ 2013-09-17 13:29 UTC (permalink / raw)
  To: linux-arm-kernel

The cost of changing a cacheline from shared to exclusive state can be
significant, especially when this is triggered by an exclusive store,
since it may result in having to retry the transaction.

This patch prefixes our {spin,read,write}_[try]lock implementations with
pldw instructions (on CPUs which support them) to try and grab the line
in exclusive state from the start. arch_rwlock_t is changed to avoid
using a volatile member, since this generates compiler warnings when
falling back on the __builtin_prefetch intrinsic which expects a const
void * argument.

Acked-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/include/asm/spinlock.h       | 13 ++++++++++---
 arch/arm/include/asm/spinlock_types.h |  2 +-
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
index e1ce452..4999007 100644
--- a/arch/arm/include/asm/spinlock.h
+++ b/arch/arm/include/asm/spinlock.h
@@ -5,7 +5,7 @@
 #error SMP not supported on pre-ARMv6 CPUs
 #endif
 
-#include <asm/processor.h>
+#include <linux/prefetch.h>
 
 /*
  * sev and wfe are ARMv6K extensions.  Uniprocessor ARMv6 may not have the K
@@ -70,6 +70,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
 	u32 newval;
 	arch_spinlock_t lockval;
 
+	prefetchw(&lock->slock);
 	__asm__ __volatile__(
 "1:	ldrex	%0, [%3]\n"
 "	add	%1, %0, %4\n"
@@ -93,6 +94,7 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
 	unsigned long contended, res;
 	u32 slock;
 
+	prefetchw(&lock->slock);
 	do {
 		__asm__ __volatile__(
 		"	ldrex	%0, [%3]\n"
@@ -145,6 +147,7 @@ static inline void arch_write_lock(arch_rwlock_t *rw)
 {
 	unsigned long tmp;
 
+	prefetchw(&rw->lock);
 	__asm__ __volatile__(
 "1:	ldrex	%0, [%1]\n"
 "	teq	%0, #0\n"
@@ -163,6 +166,7 @@ static inline int arch_write_trylock(arch_rwlock_t *rw)
 {
 	unsigned long contended, res;
 
+	prefetchw(&rw->lock);
 	do {
 		__asm__ __volatile__(
 		"	ldrex	%0, [%2]\n"
@@ -196,7 +200,7 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
 }
 
 /* write_can_lock - would write_trylock() succeed? */
-#define arch_write_can_lock(x)		((x)->lock == 0)
+#define arch_write_can_lock(x)		(ACCESS_ONCE((x)->lock) == 0)
 
 /*
  * Read locks are a bit more hairy:
@@ -214,6 +218,7 @@ static inline void arch_read_lock(arch_rwlock_t *rw)
 {
 	unsigned long tmp, tmp2;
 
+	prefetchw(&rw->lock);
 	__asm__ __volatile__(
 "1:	ldrex	%0, [%2]\n"
 "	adds	%0, %0, #1\n"
@@ -234,6 +239,7 @@ static inline void arch_read_unlock(arch_rwlock_t *rw)
 
 	smp_mb();
 
+	prefetchw(&rw->lock);
 	__asm__ __volatile__(
 "1:	ldrex	%0, [%2]\n"
 "	sub	%0, %0, #1\n"
@@ -252,6 +258,7 @@ static inline int arch_read_trylock(arch_rwlock_t *rw)
 {
 	unsigned long contended, res;
 
+	prefetchw(&rw->lock);
 	do {
 		__asm__ __volatile__(
 		"	ldrex	%0, [%2]\n"
@@ -273,7 +280,7 @@ static inline int arch_read_trylock(arch_rwlock_t *rw)
 }
 
 /* read_can_lock - would read_trylock() succeed? */
-#define arch_read_can_lock(x)		((x)->lock < 0x80000000)
+#define arch_read_can_lock(x)		(ACCESS_ONCE((x)->lock) < 0x80000000)
 
 #define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
 #define arch_write_lock_flags(lock, flags) arch_write_lock(lock)
diff --git a/arch/arm/include/asm/spinlock_types.h b/arch/arm/include/asm/spinlock_types.h
index b262d2f..47663fc 100644
--- a/arch/arm/include/asm/spinlock_types.h
+++ b/arch/arm/include/asm/spinlock_types.h
@@ -25,7 +25,7 @@ typedef struct {
 #define __ARCH_SPIN_LOCK_UNLOCKED	{ { 0 } }
 
 typedef struct {
-	volatile unsigned int lock;
+	u32 lock;
 } arch_rwlock_t;
 
 #define __ARCH_RW_LOCK_UNLOCKED		{ 0 }
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 5/6] ARM: atomics: prefetch the destination word for write prior to strex
  2013-09-17 13:29 [PATCH v3 0/6] Add support for pldw instruction on v7 MP cores Will Deacon
                   ` (3 preceding siblings ...)
  2013-09-17 13:29 ` [PATCH v3 4/6] ARM: locks: prefetch the destination word for write prior to strex Will Deacon
@ 2013-09-17 13:29 ` Will Deacon
  2013-09-17 18:09   ` Nicolas Pitre
  2013-09-17 13:29 ` [PATCH v3 6/6] ARM: bitops: " Will Deacon
  5 siblings, 1 reply; 9+ messages in thread
From: Will Deacon @ 2013-09-17 13:29 UTC (permalink / raw)
  To: linux-arm-kernel

The cost of changing a cacheline from shared to exclusive state can be
significant, especially when this is triggered by an exclusive store,
since it may result in having to retry the transaction.

This patch prefixes our atomic access implementations with pldw
instructions (on CPUs which support them) to try and grab the line in
exclusive state from the start. Only the barrier-less functions are
updated, since memory barriers can limit the usefulness of prefetching
data.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/include/asm/atomic.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/arm/include/asm/atomic.h b/arch/arm/include/asm/atomic.h
index da1c77d..55ffc3b 100644
--- a/arch/arm/include/asm/atomic.h
+++ b/arch/arm/include/asm/atomic.h
@@ -12,6 +12,7 @@
 #define __ASM_ARM_ATOMIC_H
 
 #include <linux/compiler.h>
+#include <linux/prefetch.h>
 #include <linux/types.h>
 #include <linux/irqflags.h>
 #include <asm/barrier.h>
@@ -41,6 +42,7 @@ static inline void atomic_add(int i, atomic_t *v)
 	unsigned long tmp;
 	int result;
 
+	prefetchw(&v->counter);
 	__asm__ __volatile__("@ atomic_add\n"
 "1:	ldrex	%0, [%3]\n"
 "	add	%0, %0, %4\n"
@@ -79,6 +81,7 @@ static inline void atomic_sub(int i, atomic_t *v)
 	unsigned long tmp;
 	int result;
 
+	prefetchw(&v->counter);
 	__asm__ __volatile__("@ atomic_sub\n"
 "1:	ldrex	%0, [%3]\n"
 "	sub	%0, %0, %4\n"
@@ -138,6 +141,7 @@ static inline void atomic_clear_mask(unsigned long mask, unsigned long *addr)
 {
 	unsigned long tmp, tmp2;
 
+	prefetchw(addr);
 	__asm__ __volatile__("@ atomic_clear_mask\n"
 "1:	ldrex	%0, [%3]\n"
 "	bic	%0, %0, %4\n"
@@ -283,6 +287,7 @@ static inline void atomic64_set(atomic64_t *v, u64 i)
 {
 	u64 tmp;
 
+	prefetchw(&v->counter);
 	__asm__ __volatile__("@ atomic64_set\n"
 "1:	ldrexd	%0, %H0, [%2]\n"
 "	strexd	%0, %3, %H3, [%2]\n"
@@ -299,6 +304,7 @@ static inline void atomic64_add(u64 i, atomic64_t *v)
 	u64 result;
 	unsigned long tmp;
 
+	prefetchw(&v->counter);
 	__asm__ __volatile__("@ atomic64_add\n"
 "1:	ldrexd	%0, %H0, [%3]\n"
 "	adds	%0, %0, %4\n"
@@ -339,6 +345,7 @@ static inline void atomic64_sub(u64 i, atomic64_t *v)
 	u64 result;
 	unsigned long tmp;
 
+	prefetchw(&v->counter);
 	__asm__ __volatile__("@ atomic64_sub\n"
 "1:	ldrexd	%0, %H0, [%3]\n"
 "	subs	%0, %0, %4\n"
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 6/6] ARM: bitops: prefetch the destination word for write prior to strex
  2013-09-17 13:29 [PATCH v3 0/6] Add support for pldw instruction on v7 MP cores Will Deacon
                   ` (4 preceding siblings ...)
  2013-09-17 13:29 ` [PATCH v3 5/6] ARM: atomics: " Will Deacon
@ 2013-09-17 13:29 ` Will Deacon
  5 siblings, 0 replies; 9+ messages in thread
From: Will Deacon @ 2013-09-17 13:29 UTC (permalink / raw)
  To: linux-arm-kernel

The cost of changing a cacheline from shared to exclusive state can be
significant, especially when this is triggered by an exclusive store,
since it may result in having to retry the transaction.

This patch prefixes our atomic bitops implementation with prefetchw,
to try and grab the line in exclusive state from the start. The testop
macro is left alone, since the barrier semantics limit the usefulness
of prefetching data.

Acked-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/lib/bitops.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/arm/lib/bitops.h b/arch/arm/lib/bitops.h
index d6408d1..e0c68d5 100644
--- a/arch/arm/lib/bitops.h
+++ b/arch/arm/lib/bitops.h
@@ -10,6 +10,11 @@ UNWIND(	.fnstart	)
 	and	r3, r0, #31		@ Get bit offset
 	mov	r0, r0, lsr #5
 	add	r1, r1, r0, lsl #2	@ Get word offset
+#if __LINUX_ARM_ARCH__ >= 7
+	.arch_extension	mp
+	ALT_SMP(W(pldw)	[r1])
+	ALT_UP(W(nop))
+#endif
 	mov	r3, r2, lsl r3
 1:	ldrex	r2, [r1]
 	\instr	r2, r2, r3
-- 
1.8.2.2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 5/6] ARM: atomics: prefetch the destination word for write prior to strex
  2013-09-17 13:29 ` [PATCH v3 5/6] ARM: atomics: " Will Deacon
@ 2013-09-17 18:09   ` Nicolas Pitre
  2013-09-18  9:54     ` Will Deacon
  0 siblings, 1 reply; 9+ messages in thread
From: Nicolas Pitre @ 2013-09-17 18:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 17 Sep 2013, Will Deacon wrote:

> The cost of changing a cacheline from shared to exclusive state can be
> significant, especially when this is triggered by an exclusive store,
> since it may result in having to retry the transaction.
> 
> This patch prefixes our atomic access implementations with pldw
> instructions (on CPUs which support them) to try and grab the line in
> exclusive state from the start. Only the barrier-less functions are
> updated, since memory barriers can limit the usefulness of prefetching
> data.
> 
> Signed-off-by: Will Deacon <will.deacon@arm.com>

Acked-by: Nicolas Pitre <nico@linaro.org>

By the way, did you measure significant performance improvements with 
those patches?


> ---
>  arch/arm/include/asm/atomic.h | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/arm/include/asm/atomic.h b/arch/arm/include/asm/atomic.h
> index da1c77d..55ffc3b 100644
> --- a/arch/arm/include/asm/atomic.h
> +++ b/arch/arm/include/asm/atomic.h
> @@ -12,6 +12,7 @@
>  #define __ASM_ARM_ATOMIC_H
>  
>  #include <linux/compiler.h>
> +#include <linux/prefetch.h>
>  #include <linux/types.h>
>  #include <linux/irqflags.h>
>  #include <asm/barrier.h>
> @@ -41,6 +42,7 @@ static inline void atomic_add(int i, atomic_t *v)
>  	unsigned long tmp;
>  	int result;
>  
> +	prefetchw(&v->counter);
>  	__asm__ __volatile__("@ atomic_add\n"
>  "1:	ldrex	%0, [%3]\n"
>  "	add	%0, %0, %4\n"
> @@ -79,6 +81,7 @@ static inline void atomic_sub(int i, atomic_t *v)
>  	unsigned long tmp;
>  	int result;
>  
> +	prefetchw(&v->counter);
>  	__asm__ __volatile__("@ atomic_sub\n"
>  "1:	ldrex	%0, [%3]\n"
>  "	sub	%0, %0, %4\n"
> @@ -138,6 +141,7 @@ static inline void atomic_clear_mask(unsigned long mask, unsigned long *addr)
>  {
>  	unsigned long tmp, tmp2;
>  
> +	prefetchw(addr);
>  	__asm__ __volatile__("@ atomic_clear_mask\n"
>  "1:	ldrex	%0, [%3]\n"
>  "	bic	%0, %0, %4\n"
> @@ -283,6 +287,7 @@ static inline void atomic64_set(atomic64_t *v, u64 i)
>  {
>  	u64 tmp;
>  
> +	prefetchw(&v->counter);
>  	__asm__ __volatile__("@ atomic64_set\n"
>  "1:	ldrexd	%0, %H0, [%2]\n"
>  "	strexd	%0, %3, %H3, [%2]\n"
> @@ -299,6 +304,7 @@ static inline void atomic64_add(u64 i, atomic64_t *v)
>  	u64 result;
>  	unsigned long tmp;
>  
> +	prefetchw(&v->counter);
>  	__asm__ __volatile__("@ atomic64_add\n"
>  "1:	ldrexd	%0, %H0, [%3]\n"
>  "	adds	%0, %0, %4\n"
> @@ -339,6 +345,7 @@ static inline void atomic64_sub(u64 i, atomic64_t *v)
>  	u64 result;
>  	unsigned long tmp;
>  
> +	prefetchw(&v->counter);
>  	__asm__ __volatile__("@ atomic64_sub\n"
>  "1:	ldrexd	%0, %H0, [%3]\n"
>  "	subs	%0, %0, %4\n"
> -- 
> 1.8.2.2
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 5/6] ARM: atomics: prefetch the destination word for write prior to strex
  2013-09-17 18:09   ` Nicolas Pitre
@ 2013-09-18  9:54     ` Will Deacon
  0 siblings, 0 replies; 9+ messages in thread
From: Will Deacon @ 2013-09-18  9:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Sep 17, 2013 at 07:09:35PM +0100, Nicolas Pitre wrote:
> On Tue, 17 Sep 2013, Will Deacon wrote:
> 
> > The cost of changing a cacheline from shared to exclusive state can be
> > significant, especially when this is triggered by an exclusive store,
> > since it may result in having to retry the transaction.
> > 
> > This patch prefixes our atomic access implementations with pldw
> > instructions (on CPUs which support them) to try and grab the line in
> > exclusive state from the start. Only the barrier-less functions are
> > updated, since memory barriers can limit the usefulness of prefetching
> > data.
> > 
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> 
> Acked-by: Nicolas Pitre <nico@linaro.org>

Thanks, Nicolas.

> By the way, did you measure significant performance improvements with 
> those patches?

Yep. Latest version shows around a 3% hackbench boost with -rc1 on my TC2.

Will

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-09-18  9:54 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-17 13:29 [PATCH v3 0/6] Add support for pldw instruction on v7 MP cores Will Deacon
2013-09-17 13:29 ` [PATCH v3 1/6] ARM: prefetch: remove redundant "cc" clobber Will Deacon
2013-09-17 13:29 ` [PATCH v3 2/6] ARM: smp_on_up: move inline asm ALT_SMP patching macro out of spinlock.h Will Deacon
2013-09-17 13:29 ` [PATCH v3 3/6] ARM: prefetch: add support for prefetchw using pldw on SMP ARMv7+ CPUs Will Deacon
2013-09-17 13:29 ` [PATCH v3 4/6] ARM: locks: prefetch the destination word for write prior to strex Will Deacon
2013-09-17 13:29 ` [PATCH v3 5/6] ARM: atomics: " Will Deacon
2013-09-17 18:09   ` Nicolas Pitre
2013-09-18  9:54     ` Will Deacon
2013-09-17 13:29 ` [PATCH v3 6/6] ARM: bitops: " Will Deacon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).