[PATCH 00/18] arm64: support for 8.1 LSE atomic instructions

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/18] arm64: support for 8.1 LSE atomic instructions
@ 2015-07-13  9:25 Will Deacon
  2015-07-13  9:25 ` [PATCH 01/18] arm64: cpufeature.h: add missing #include of kernel.h Will Deacon
                   ` (17 more replies)
  0 siblings, 18 replies; 35+ messages in thread
From: Will Deacon @ 2015-07-13  9:25 UTC (permalink / raw)
  To: linux-arm-kernel

Hi all,

This patch series adds support for the new atomic instructions,
introduced as part of the Large System Extension in ARMv8.1, to the
Linux kernel.

Whilst the new instructions can be configured out at compile time via
the CONFIG_ARM64_LSE_ATOMICS option, we make heavy use of alternative
patching to ensure that we can continue to boot a kernel Image anywhere,
regardless of the CPU support.

I've tested this on Juno, Seattle and Hikey (none of which have the new
instructions) and also on the FastModel with 8.1 enabled.

Feedback welcome,

Will

--->8

Will Deacon (18):
  arm64: cpufeature.h: add missing #include of kernel.h
  arm64: atomics: move ll/sc atomics into separate header file
  arm64: elf: advertise 8.1 atomic instructions as new hwcap
  arm64: alternatives: add cpu feature for lse atomics
  arm64: introduce CONFIG_ARM64_LSE_ATOMICS as fallback to ll/sc atomics
  arm64: atomics: patch in lse instructions when supported by the CPU
  arm64: locks: patch in lse instructions when supported by the CPU
  arm64: bitops: patch in lse instructions when supported by the CPU
  arm64: xchg: patch in lse instructions when supported by the CPU
  arm64: cmpxchg: patch in lse instructions when supported by the CPU
  arm64: cmpxchg_dbl: patch in lse instructions when supported by the
    CPU
  arm64: cmpxchg: avoid "cc" clobber in ll/sc routines
  arm64: cmpxchg: avoid memory barrier on comparison failure
  arm64: atomics: tidy up common atomic{,64}_* macros
  arm64: atomics: prefetch the destination word for write prior to stxr
  arm64: atomics: implement atomic{,64}_cmpxchg using cmpxchg
  arm64: atomic64_dec_if_positive: fix incorrect branch condition
  arm64: kconfig: select HAVE_CMPXCHG_LOCAL

 arch/arm64/Kconfig                    |  13 ++
 arch/arm64/Makefile                   |  13 +-
 arch/arm64/include/asm/atomic.h       | 262 ++++++-------------------------
 arch/arm64/include/asm/atomic_ll_sc.h | 237 +++++++++++++++++++++++++++++
 arch/arm64/include/asm/atomic_lse.h   | 279 ++++++++++++++++++++++++++++++++++
 arch/arm64/include/asm/cmpxchg.h      | 192 +++++++++--------------
 arch/arm64/include/asm/cpufeature.h   |   5 +-
 arch/arm64/include/asm/futex.h        |   2 +
 arch/arm64/include/asm/lse.h          |  58 +++++++
 arch/arm64/include/asm/spinlock.h     | 132 ++++++++++++----
 arch/arm64/include/uapi/asm/hwcap.h   |   1 +
 arch/arm64/kernel/setup.c             |  18 +++
 arch/arm64/lib/Makefile               |  13 ++
 arch/arm64/lib/atomic_ll_sc.c         |   3 +
 arch/arm64/lib/bitops.S               |  45 +++---
 15 files changed, 893 insertions(+), 380 deletions(-)
 create mode 100644 arch/arm64/include/asm/atomic_ll_sc.h
 create mode 100644 arch/arm64/include/asm/atomic_lse.h
 create mode 100644 arch/arm64/include/asm/lse.h
 create mode 100644 arch/arm64/lib/atomic_ll_sc.c

-- 
2.1.4

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 01/18] arm64: cpufeature.h: add missing #include of kernel.h
  2015-07-13  9:25 [PATCH 00/18] arm64: support for 8.1 LSE atomic instructions Will Deacon
@ 2015-07-13  9:25 ` Will Deacon
  2015-07-13  9:25 ` [PATCH 02/18] arm64: atomics: move ll/sc atomics into separate header file Will Deacon
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Will Deacon @ 2015-07-13  9:25 UTC (permalink / raw)
  To: linux-arm-kernel

cpufeature.h makes use of DECLARE_BITMAP, which in turn relies on the
BITS_TO_LONGS and DIV_ROUND_UP macros.

This patch includes kernel.h in cpufeature.h to prevent all users having
to do the same thing.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/cpufeature.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index c1044218a63a..eb09f1ee8036 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -30,6 +30,8 @@
 
 #ifndef __ASSEMBLY__
 
+#include <linux/kernel.h>
+
 struct arm64_cpu_capabilities {
 	const char *desc;
 	u16 capability;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 02/18] arm64: atomics: move ll/sc atomics into separate header file
  2015-07-13  9:25 [PATCH 00/18] arm64: support for 8.1 LSE atomic instructions Will Deacon
  2015-07-13  9:25 ` [PATCH 01/18] arm64: cpufeature.h: add missing #include of kernel.h Will Deacon
@ 2015-07-13  9:25 ` Will Deacon
  2015-07-13  9:25 ` [PATCH 03/18] arm64: elf: advertise 8.1 atomic instructions as new hwcap Will Deacon
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Will Deacon @ 2015-07-13  9:25 UTC (permalink / raw)
  To: linux-arm-kernel

In preparation for the Large System Extension (LSE) atomic instructions
introduced by ARM v8.1, move the current exclusive load/store (LL/SC)
atomics into their own header file.

Reviewed-by: Steve Capper <steve.capper@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/atomic.h       | 162 +--------------------------
 arch/arm64/include/asm/atomic_ll_sc.h | 205 ++++++++++++++++++++++++++++++++++
 2 files changed, 207 insertions(+), 160 deletions(-)
 create mode 100644 arch/arm64/include/asm/atomic_ll_sc.h

diff --git a/arch/arm64/include/asm/atomic.h b/arch/arm64/include/asm/atomic.h
index 7047051ded40..9467450a5c03 100644
--- a/arch/arm64/include/asm/atomic.h
+++ b/arch/arm64/include/asm/atomic.h
@@ -30,6 +30,8 @@
 
 #ifdef __KERNEL__
 
+#include <asm/atomic_ll_sc.h>
+
 /*
  * On ARM, ordinary assignment (str instruction) doesn't clear the local
  * strex/ldrex monitor on some implementations. The reason we can use it for
@@ -38,79 +40,6 @@
 #define atomic_read(v)	ACCESS_ONCE((v)->counter)
 #define atomic_set(v,i)	(((v)->counter) = (i))
 
-/*
- * AArch64 UP and SMP safe atomic ops.  We use load exclusive and
- * store exclusive to ensure that these are atomic.  We may loop
- * to ensure that the update happens.
- */
-
-#define ATOMIC_OP(op, asm_op)						\
-static inline void atomic_##op(int i, atomic_t *v)			\
-{									\
-	unsigned long tmp;						\
-	int result;							\
-									\
-	asm volatile("// atomic_" #op "\n"				\
-"1:	ldxr	%w0, %2\n"						\
-"	" #asm_op "	%w0, %w0, %w3\n"				\
-"	stxr	%w1, %w0, %2\n"						\
-"	cbnz	%w1, 1b"						\
-	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)		\
-	: "Ir" (i));							\
-}									\
-
-#define ATOMIC_OP_RETURN(op, asm_op)					\
-static inline int atomic_##op##_return(int i, atomic_t *v)		\
-{									\
-	unsigned long tmp;						\
-	int result;							\
-									\
-	asm volatile("// atomic_" #op "_return\n"			\
-"1:	ldxr	%w0, %2\n"						\
-"	" #asm_op "	%w0, %w0, %w3\n"				\
-"	stlxr	%w1, %w0, %2\n"						\
-"	cbnz	%w1, 1b"						\
-	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)		\
-	: "Ir" (i)							\
-	: "memory");							\
-									\
-	smp_mb();							\
-	return result;							\
-}
-
-#define ATOMIC_OPS(op, asm_op)						\
-	ATOMIC_OP(op, asm_op)						\
-	ATOMIC_OP_RETURN(op, asm_op)
-
-ATOMIC_OPS(add, add)
-ATOMIC_OPS(sub, sub)
-
-#undef ATOMIC_OPS
-#undef ATOMIC_OP_RETURN
-#undef ATOMIC_OP
-
-static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
-{
-	unsigned long tmp;
-	int oldval;
-
-	smp_mb();
-
-	asm volatile("// atomic_cmpxchg\n"
-"1:	ldxr	%w1, %2\n"
-"	cmp	%w1, %w3\n"
-"	b.ne	2f\n"
-"	stxr	%w0, %w4, %2\n"
-"	cbnz	%w0, 1b\n"
-"2:"
-	: "=&r" (tmp), "=&r" (oldval), "+Q" (ptr->counter)
-	: "Ir" (old), "r" (new)
-	: "cc");
-
-	smp_mb();
-	return oldval;
-}
-
 #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
 
 static inline int __atomic_add_unless(atomic_t *v, int a, int u)
@@ -142,95 +71,8 @@ static inline int __atomic_add_unless(atomic_t *v, int a, int u)
 #define atomic64_read(v)	ACCESS_ONCE((v)->counter)
 #define atomic64_set(v,i)	(((v)->counter) = (i))
 
-#define ATOMIC64_OP(op, asm_op)						\
-static inline void atomic64_##op(long i, atomic64_t *v)			\
-{									\
-	long result;							\
-	unsigned long tmp;						\
-									\
-	asm volatile("// atomic64_" #op "\n"				\
-"1:	ldxr	%0, %2\n"						\
-"	" #asm_op "	%0, %0, %3\n"					\
-"	stxr	%w1, %0, %2\n"						\
-"	cbnz	%w1, 1b"						\
-	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)		\
-	: "Ir" (i));							\
-}									\
-
-#define ATOMIC64_OP_RETURN(op, asm_op)					\
-static inline long atomic64_##op##_return(long i, atomic64_t *v)	\
-{									\
-	long result;							\
-	unsigned long tmp;						\
-									\
-	asm volatile("// atomic64_" #op "_return\n"			\
-"1:	ldxr	%0, %2\n"						\
-"	" #asm_op "	%0, %0, %3\n"					\
-"	stlxr	%w1, %0, %2\n"						\
-"	cbnz	%w1, 1b"						\
-	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)		\
-	: "Ir" (i)							\
-	: "memory");							\
-									\
-	smp_mb();							\
-	return result;							\
-}
-
-#define ATOMIC64_OPS(op, asm_op)					\
-	ATOMIC64_OP(op, asm_op)						\
-	ATOMIC64_OP_RETURN(op, asm_op)
-
-ATOMIC64_OPS(add, add)
-ATOMIC64_OPS(sub, sub)
-
-#undef ATOMIC64_OPS
-#undef ATOMIC64_OP_RETURN
-#undef ATOMIC64_OP
-
-static inline long atomic64_cmpxchg(atomic64_t *ptr, long old, long new)
-{
-	long oldval;
-	unsigned long res;
-
-	smp_mb();
-
-	asm volatile("// atomic64_cmpxchg\n"
-"1:	ldxr	%1, %2\n"
-"	cmp	%1, %3\n"
-"	b.ne	2f\n"
-"	stxr	%w0, %4, %2\n"
-"	cbnz	%w0, 1b\n"
-"2:"
-	: "=&r" (res), "=&r" (oldval), "+Q" (ptr->counter)
-	: "Ir" (old), "r" (new)
-	: "cc");
-
-	smp_mb();
-	return oldval;
-}
-
 #define atomic64_xchg(v, new) (xchg(&((v)->counter), new))
 
-static inline long atomic64_dec_if_positive(atomic64_t *v)
-{
-	long result;
-	unsigned long tmp;
-
-	asm volatile("// atomic64_dec_if_positive\n"
-"1:	ldxr	%0, %2\n"
-"	subs	%0, %0, #1\n"
-"	b.mi	2f\n"
-"	stlxr	%w1, %0, %2\n"
-"	cbnz	%w1, 1b\n"
-"	dmb	ish\n"
-"2:"
-	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)
-	:
-	: "cc", "memory");
-
-	return result;
-}
-
 static inline int atomic64_add_unless(atomic64_t *v, long a, long u)
 {
 	long c, old;
diff --git a/arch/arm64/include/asm/atomic_ll_sc.h b/arch/arm64/include/asm/atomic_ll_sc.h
new file mode 100644
index 000000000000..aef70f2d4cb8
--- /dev/null
+++ b/arch/arm64/include/asm/atomic_ll_sc.h
@@ -0,0 +1,205 @@
+/*
+ * Based on arch/arm/include/asm/atomic.h
+ *
+ * Copyright (C) 1996 Russell King.
+ * Copyright (C) 2002 Deep Blue Solutions Ltd.
+ * Copyright (C) 2012 ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef __ASM_ATOMIC_LL_SC_H
+#define __ASM_ATOMIC_LL_SC_H
+
+/*
+ * AArch64 UP and SMP safe atomic ops.  We use load exclusive and
+ * store exclusive to ensure that these are atomic.  We may loop
+ * to ensure that the update happens.
+ *
+ * NOTE: these functions do *not* follow the PCS and must explicitly
+ * save any clobbered registers other than x0 (regardless of return
+ * value).  This is achieved through -fcall-saved-* compiler flags for
+ * this file, which unfortunately don't work on a per-function basis
+ * (the optimize attribute silently ignores these options).
+ */
+
+#ifndef __LL_SC_INLINE
+#define __LL_SC_INLINE		static inline
+#endif
+
+#ifndef __LL_SC_PREFIX
+#define __LL_SC_PREFIX(x)	x
+#endif
+
+#define ATOMIC_OP(op, asm_op)						\
+__LL_SC_INLINE void							\
+__LL_SC_PREFIX(atomic_##op(int i, atomic_t *v))				\
+{									\
+	unsigned long tmp;						\
+	int result;							\
+									\
+	asm volatile("// atomic_" #op "\n"				\
+"1:	ldxr	%w0, %2\n"						\
+"	" #asm_op "	%w0, %w0, %w3\n"				\
+"	stxr	%w1, %w0, %2\n"						\
+"	cbnz	%w1, 1b"						\
+	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)		\
+	: "Ir" (i));							\
+}									\
+
+#define ATOMIC_OP_RETURN(op, asm_op)					\
+__LL_SC_INLINE int							\
+__LL_SC_PREFIX(atomic_##op##_return(int i, atomic_t *v))		\
+{									\
+	unsigned long tmp;						\
+	int result;							\
+									\
+	asm volatile("// atomic_" #op "_return\n"			\
+"1:	ldxr	%w0, %2\n"						\
+"	" #asm_op "	%w0, %w0, %w3\n"				\
+"	stlxr	%w1, %w0, %2\n"						\
+"	cbnz	%w1, 1b"						\
+	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)		\
+	: "Ir" (i)							\
+	: "memory");							\
+									\
+	smp_mb();							\
+	return result;							\
+}
+
+#define ATOMIC_OPS(op, asm_op)						\
+	ATOMIC_OP(op, asm_op)						\
+	ATOMIC_OP_RETURN(op, asm_op)
+
+ATOMIC_OPS(add, add)
+ATOMIC_OPS(sub, sub)
+
+#undef ATOMIC_OPS
+#undef ATOMIC_OP_RETURN
+#undef ATOMIC_OP
+
+__LL_SC_INLINE int
+__LL_SC_PREFIX(atomic_cmpxchg(atomic_t *ptr, int old, int new))
+{
+	unsigned long tmp;
+	int oldval;
+
+	smp_mb();
+
+	asm volatile("// atomic_cmpxchg\n"
+"1:	ldxr	%w1, %2\n"
+"	cmp	%w1, %w3\n"
+"	b.ne	2f\n"
+"	stxr	%w0, %w4, %2\n"
+"	cbnz	%w0, 1b\n"
+"2:"
+	: "=&r" (tmp), "=&r" (oldval), "+Q" (ptr->counter)
+	: "Ir" (old), "r" (new)
+	: "cc");
+
+	smp_mb();
+	return oldval;
+}
+
+#define ATOMIC64_OP(op, asm_op)						\
+__LL_SC_INLINE void							\
+__LL_SC_PREFIX(atomic64_##op(long i, atomic64_t *v))			\
+{									\
+	long result;							\
+	unsigned long tmp;						\
+									\
+	asm volatile("// atomic64_" #op "\n"				\
+"1:	ldxr	%0, %2\n"						\
+"	" #asm_op "	%0, %0, %3\n"					\
+"	stxr	%w1, %0, %2\n"						\
+"	cbnz	%w1, 1b"						\
+	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)		\
+	: "Ir" (i));							\
+}									\
+
+#define ATOMIC64_OP_RETURN(op, asm_op)					\
+__LL_SC_INLINE long							\
+__LL_SC_PREFIX(atomic64_##op##_return(long i, atomic64_t *v))		\
+{									\
+	long result;							\
+	unsigned long tmp;						\
+									\
+	asm volatile("// atomic64_" #op "_return\n"			\
+"1:	ldxr	%0, %2\n"						\
+"	" #asm_op "	%0, %0, %3\n"					\
+"	stlxr	%w1, %0, %2\n"						\
+"	cbnz	%w1, 1b"						\
+	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)		\
+	: "Ir" (i)							\
+	: "memory");							\
+									\
+	smp_mb();							\
+	return result;							\
+}
+
+#define ATOMIC64_OPS(op, asm_op)					\
+	ATOMIC64_OP(op, asm_op)						\
+	ATOMIC64_OP_RETURN(op, asm_op)
+
+ATOMIC64_OPS(add, add)
+ATOMIC64_OPS(sub, sub)
+
+#undef ATOMIC64_OPS
+#undef ATOMIC64_OP_RETURN
+#undef ATOMIC64_OP
+
+__LL_SC_INLINE long
+__LL_SC_PREFIX(atomic64_cmpxchg(atomic64_t *ptr, long old, long new))
+{
+	long oldval;
+	unsigned long res;
+
+	smp_mb();
+
+	asm volatile("// atomic64_cmpxchg\n"
+"1:	ldxr	%1, %2\n"
+"	cmp	%1, %3\n"
+"	b.ne	2f\n"
+"	stxr	%w0, %4, %2\n"
+"	cbnz	%w0, 1b\n"
+"2:"
+	: "=&r" (res), "=&r" (oldval), "+Q" (ptr->counter)
+	: "Ir" (old), "r" (new)
+	: "cc");
+
+	smp_mb();
+	return oldval;
+}
+
+__LL_SC_INLINE long
+__LL_SC_PREFIX(atomic64_dec_if_positive(atomic64_t *v))
+{
+	long result;
+	unsigned long tmp;
+
+	asm volatile("// atomic64_dec_if_positive\n"
+"1:	ldxr	%0, %2\n"
+"	subs	%0, %0, #1\n"
+"	b.mi	2f\n"
+"	stlxr	%w1, %0, %2\n"
+"	cbnz	%w1, 1b\n"
+"	dmb	ish\n"
+"2:"
+	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)
+	:
+	: "cc", "memory");
+
+	return result;
+}
+
+#endif	/* __ASM_ATOMIC_LL_SC_H */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 03/18] arm64: elf: advertise 8.1 atomic instructions as new hwcap
  2015-07-13  9:25 [PATCH 00/18] arm64: support for 8.1 LSE atomic instructions Will Deacon
  2015-07-13  9:25 ` [PATCH 01/18] arm64: cpufeature.h: add missing #include of kernel.h Will Deacon
  2015-07-13  9:25 ` [PATCH 02/18] arm64: atomics: move ll/sc atomics into separate header file Will Deacon
@ 2015-07-13  9:25 ` Will Deacon
  2015-07-17 13:48   ` Catalin Marinas
  2015-07-13  9:25 ` [PATCH 04/18] arm64: alternatives: add cpu feature for lse atomics Will Deacon
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Will Deacon @ 2015-07-13  9:25 UTC (permalink / raw)
  To: linux-arm-kernel

The ARM v8.1 architecture introduces new atomic instructions to the A64
instruction set for things like cmpxchg, so advertise their availability
to userspace using a hwcap.

Reviewed-by: Steve Capper <steve.capper@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/uapi/asm/hwcap.h |  1 +
 arch/arm64/kernel/setup.c           | 14 ++++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/arch/arm64/include/uapi/asm/hwcap.h b/arch/arm64/include/uapi/asm/hwcap.h
index 73cf0f54d57c..361c8a8ef55f 100644
--- a/arch/arm64/include/uapi/asm/hwcap.h
+++ b/arch/arm64/include/uapi/asm/hwcap.h
@@ -27,5 +27,6 @@
 #define HWCAP_SHA1		(1 << 5)
 #define HWCAP_SHA2		(1 << 6)
 #define HWCAP_CRC32		(1 << 7)
+#define HWCAP_ATOMICS		(1 << 8)
 
 #endif /* _UAPI__ASM_HWCAP_H */
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index f3067d4d4e35..c7fd2c946374 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -280,6 +280,19 @@ static void __init setup_processor(void)
 	if (block && !(block & 0x8))
 		elf_hwcap |= HWCAP_CRC32;
 
+	block = (features >> 20) & 0xf;
+	if (!(block & 0x8)) {
+		switch (block) {
+		default:
+		case 2:
+			elf_hwcap |= HWCAP_ATOMICS;
+		case 1:
+			/* RESERVED */
+		case 0:
+			break;
+		}
+	}
+
 #ifdef CONFIG_COMPAT
 	/*
 	 * ID_ISAR5_EL1 carries similar information as above, but pertaining to
@@ -456,6 +469,7 @@ static const char *hwcap_str[] = {
 	"sha1",
 	"sha2",
 	"crc32",
+	"atomics",
 	NULL
 };
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 04/18] arm64: alternatives: add cpu feature for lse atomics
  2015-07-13  9:25 [PATCH 00/18] arm64: support for 8.1 LSE atomic instructions Will Deacon
                   ` (2 preceding siblings ...)
  2015-07-13  9:25 ` [PATCH 03/18] arm64: elf: advertise 8.1 atomic instructions as new hwcap Will Deacon
@ 2015-07-13  9:25 ` Will Deacon
  2015-07-13  9:25 ` [PATCH 05/18] arm64: introduce CONFIG_ARM64_LSE_ATOMICS as fallback to ll/sc atomics Will Deacon
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Will Deacon @ 2015-07-13  9:25 UTC (permalink / raw)
  To: linux-arm-kernel

Add a CPU feature for the LSE atomic instructions, so that they can be
patched in at runtime when we detect that they are supported.

Reviewed-by: Steve Capper <steve.capper@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/cpufeature.h | 3 ++-
 arch/arm64/kernel/setup.c           | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index eb09f1ee8036..d58db9d3b4fa 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -25,8 +25,9 @@
 #define ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE	1
 #define ARM64_WORKAROUND_845719			2
 #define ARM64_HAS_SYSREG_GIC_CPUIF		3
+#define ARM64_CPU_FEAT_LSE_ATOMICS		4
 
-#define ARM64_NCAPS				4
+#define ARM64_NCAPS				5
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index c7fd2c946374..5b170df96aaf 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -286,6 +286,7 @@ static void __init setup_processor(void)
 		default:
 		case 2:
 			elf_hwcap |= HWCAP_ATOMICS;
+			cpus_set_cap(ARM64_CPU_FEAT_LSE_ATOMICS);
 		case 1:
 			/* RESERVED */
 		case 0:
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 05/18] arm64: introduce CONFIG_ARM64_LSE_ATOMICS as fallback to ll/sc atomics
  2015-07-13  9:25 [PATCH 00/18] arm64: support for 8.1 LSE atomic instructions Will Deacon
                   ` (3 preceding siblings ...)
  2015-07-13  9:25 ` [PATCH 04/18] arm64: alternatives: add cpu feature for lse atomics Will Deacon
@ 2015-07-13  9:25 ` Will Deacon
  2015-07-17 16:32   ` Catalin Marinas
  2015-07-13  9:25 ` [PATCH 06/18] arm64: atomics: patch in lse instructions when supported by the CPU Will Deacon
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Will Deacon @ 2015-07-13  9:25 UTC (permalink / raw)
  To: linux-arm-kernel

In order to patch in the new atomic instructions at runtime, we need to
generate wrappers around the out-of-line exclusive load/store atomics.

This patch adds a new Kconfig option, CONFIG_ARM64_LSE_ATOMICS. which
causes our atomic functions to branch to the out-of-line ll/sc
implementations. To avoid the register spill overhead of the PCS, the
out-of-line functions are compiled with specific compiler flags to
force out-of-line save/restore of any registers that are usually
caller-saved.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/Kconfig                    |  12 +++
 arch/arm64/include/asm/atomic.h       |   9 ++
 arch/arm64/include/asm/atomic_ll_sc.h |  19 +++-
 arch/arm64/include/asm/atomic_lse.h   | 181 ++++++++++++++++++++++++++++++++++
 arch/arm64/lib/Makefile               |  13 +++
 arch/arm64/lib/atomic_ll_sc.c         |   3 +
 6 files changed, 235 insertions(+), 2 deletions(-)
 create mode 100644 arch/arm64/include/asm/atomic_lse.h
 create mode 100644 arch/arm64/lib/atomic_ll_sc.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0f6edb14b7e4..682782ab6936 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -664,6 +664,18 @@ config SETEND_EMULATION
 	  If unsure, say Y
 endif
 
+config ARM64_LSE_ATOMICS
+	bool "ARMv8.1 atomic instructions"
+	help
+	  As part of the Large System Extensions, ARMv8.1 introduces new
+	  atomic instructions that are designed specifically to scale in
+	  very large systems.
+
+	  Say Y here to make use of these instructions for the in-kernel
+	  atomic routines. This incurs a small overhead on CPUs that do
+	  not support these instructions and requires the kernel to be
+	  built with binutils >= 2.25.
+
 endmenu
 
 menu "Boot options"
diff --git a/arch/arm64/include/asm/atomic.h b/arch/arm64/include/asm/atomic.h
index 9467450a5c03..955cc14f3ce4 100644
--- a/arch/arm64/include/asm/atomic.h
+++ b/arch/arm64/include/asm/atomic.h
@@ -21,6 +21,7 @@
 #define __ASM_ATOMIC_H
 
 #include <linux/compiler.h>
+#include <linux/stringify.h>
 #include <linux/types.h>
 
 #include <asm/barrier.h>
@@ -30,7 +31,15 @@
 
 #ifdef __KERNEL__
 
+#define __ARM64_IN_ATOMIC_IMPL
+
+#ifdef CONFIG_ARM64_LSE_ATOMICS
+#include <asm/atomic_lse.h>
+#else
 #include <asm/atomic_ll_sc.h>
+#endif
+
+#undef __ARM64_IN_ATOMIC_IMPL
 
 /*
  * On ARM, ordinary assignment (str instruction) doesn't clear the local
diff --git a/arch/arm64/include/asm/atomic_ll_sc.h b/arch/arm64/include/asm/atomic_ll_sc.h
index aef70f2d4cb8..024b892dbc6a 100644
--- a/arch/arm64/include/asm/atomic_ll_sc.h
+++ b/arch/arm64/include/asm/atomic_ll_sc.h
@@ -21,6 +21,10 @@
 #ifndef __ASM_ATOMIC_LL_SC_H
 #define __ASM_ATOMIC_LL_SC_H
 
+#ifndef __ARM64_IN_ATOMIC_IMPL
+#error "please don't include this file directly"
+#endif
+
 /*
  * AArch64 UP and SMP safe atomic ops.  We use load exclusive and
  * store exclusive to ensure that these are atomic.  We may loop
@@ -41,6 +45,10 @@
 #define __LL_SC_PREFIX(x)	x
 #endif
 
+#ifndef __LL_SC_EXPORT
+#define __LL_SC_EXPORT(x)
+#endif
+
 #define ATOMIC_OP(op, asm_op)						\
 __LL_SC_INLINE void							\
 __LL_SC_PREFIX(atomic_##op(int i, atomic_t *v))				\
@@ -56,6 +64,7 @@ __LL_SC_PREFIX(atomic_##op(int i, atomic_t *v))				\
 	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)		\
 	: "Ir" (i));							\
 }									\
+__LL_SC_EXPORT(atomic_##op);
 
 #define ATOMIC_OP_RETURN(op, asm_op)					\
 __LL_SC_INLINE int							\
@@ -75,7 +84,8 @@ __LL_SC_PREFIX(atomic_##op##_return(int i, atomic_t *v))		\
 									\
 	smp_mb();							\
 	return result;							\
-}
+}									\
+__LL_SC_EXPORT(atomic_##op##_return);
 
 #define ATOMIC_OPS(op, asm_op)						\
 	ATOMIC_OP(op, asm_op)						\
@@ -110,6 +120,7 @@ __LL_SC_PREFIX(atomic_cmpxchg(atomic_t *ptr, int old, int new))
 	smp_mb();
 	return oldval;
 }
+__LL_SC_EXPORT(atomic_cmpxchg);
 
 #define ATOMIC64_OP(op, asm_op)						\
 __LL_SC_INLINE void							\
@@ -126,6 +137,7 @@ __LL_SC_PREFIX(atomic64_##op(long i, atomic64_t *v))			\
 	: "=&r" (result), "=&r" (tmp), "+Q" (v->counter)		\
 	: "Ir" (i));							\
 }									\
+__LL_SC_EXPORT(atomic64_##op);
 
 #define ATOMIC64_OP_RETURN(op, asm_op)					\
 __LL_SC_INLINE long							\
@@ -145,7 +157,8 @@ __LL_SC_PREFIX(atomic64_##op##_return(long i, atomic64_t *v))		\
 									\
 	smp_mb();							\
 	return result;							\
-}
+}									\
+__LL_SC_EXPORT(atomic64_##op##_return);
 
 #define ATOMIC64_OPS(op, asm_op)					\
 	ATOMIC64_OP(op, asm_op)						\
@@ -180,6 +193,7 @@ __LL_SC_PREFIX(atomic64_cmpxchg(atomic64_t *ptr, long old, long new))
 	smp_mb();
 	return oldval;
 }
+__LL_SC_EXPORT(atomic64_cmpxchg);
 
 __LL_SC_INLINE long
 __LL_SC_PREFIX(atomic64_dec_if_positive(atomic64_t *v))
@@ -201,5 +215,6 @@ __LL_SC_PREFIX(atomic64_dec_if_positive(atomic64_t *v))
 
 	return result;
 }
+__LL_SC_EXPORT(atomic64_dec_if_positive);
 
 #endif	/* __ASM_ATOMIC_LL_SC_H */
diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h
new file mode 100644
index 000000000000..68ff1a8a7492
--- /dev/null
+++ b/arch/arm64/include/asm/atomic_lse.h
@@ -0,0 +1,181 @@
+/*
+ * Based on arch/arm/include/asm/atomic.h
+ *
+ * Copyright (C) 1996 Russell King.
+ * Copyright (C) 2002 Deep Blue Solutions Ltd.
+ * Copyright (C) 2012 ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef __ASM_ATOMIC_LSE_H
+#define __ASM_ATOMIC_LSE_H
+
+#ifndef __ARM64_IN_ATOMIC_IMPL
+#error "please don't include this file directly"
+#endif
+
+/* Move the ll/sc atomics out-of-line */
+#define __LL_SC_INLINE
+#define __LL_SC_PREFIX(x)	__ll_sc_##x
+#define __LL_SC_EXPORT(x)	EXPORT_SYMBOL(__LL_SC_PREFIX(x))
+
+/* Macros for constructing calls to out-of-line ll/sc atomics */
+#define __LL_SC_SAVE_LR(r)	"mov\t" #r ", x30\n"
+#define __LL_SC_RESTORE_LR(r)	"mov\tx30, " #r "\n"
+#define __LL_SC_CALL(op)						\
+	"bl\t" __stringify(__LL_SC_PREFIX(atomic_##op)) "\n"
+#define __LL_SC_CALL64(op)						\
+	"bl\t" __stringify(__LL_SC_PREFIX(atomic64_##op)) "\n"
+
+#define ATOMIC_OP(op, asm_op)						\
+static inline void atomic_##op(int i, atomic_t *v)			\
+{									\
+	unsigned long lr;						\
+	register int w0 asm ("w0") = i;					\
+	register atomic_t *x1 asm ("x1") = v;				\
+									\
+	asm volatile(							\
+	__LL_SC_SAVE_LR(%0)						\
+	__LL_SC_CALL(op)						\
+	__LL_SC_RESTORE_LR(%0)						\
+	: "=&r" (lr), "+r" (w0), "+Q" (v->counter)			\
+	: "r" (x1));							\
+}									\
+
+#define ATOMIC_OP_RETURN(op, asm_op)					\
+static inline int atomic_##op##_return(int i, atomic_t *v)		\
+{									\
+	unsigned long lr;						\
+	register int w0 asm ("w0") = i;					\
+	register atomic_t *x1 asm ("x1") = v;				\
+									\
+	asm volatile(							\
+	__LL_SC_SAVE_LR(%0)						\
+	__LL_SC_CALL(op##_return)					\
+	__LL_SC_RESTORE_LR(%0)						\
+	: "=&r" (lr), "+r" (w0)						\
+	: "r" (x1)							\
+	: "memory");							\
+									\
+	return w0;							\
+}
+
+#define ATOMIC_OPS(op, asm_op)						\
+	ATOMIC_OP(op, asm_op)						\
+	ATOMIC_OP_RETURN(op, asm_op)
+
+ATOMIC_OPS(add, add)
+ATOMIC_OPS(sub, sub)
+
+#undef ATOMIC_OPS
+#undef ATOMIC_OP_RETURN
+#undef ATOMIC_OP
+
+static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
+{
+	unsigned long lr;
+	register unsigned long x0 asm ("x0") = (unsigned long)ptr;
+	register int w1 asm ("w1") = old;
+	register int w2 asm ("w2") = new;
+
+	asm volatile(
+	__LL_SC_SAVE_LR(%0)
+	__LL_SC_CALL(cmpxchg)
+	__LL_SC_RESTORE_LR(%0)
+	: "=&r" (lr), "+r" (x0)
+	: "r" (w1), "r" (w2)
+	: "cc", "memory");
+
+	return x0;
+}
+
+#define ATOMIC64_OP(op, asm_op)						\
+static inline void atomic64_##op(long i, atomic64_t *v)			\
+{									\
+	unsigned long lr;						\
+	register long x0 asm ("x0") = i;				\
+	register atomic64_t *x1 asm ("x1") = v;				\
+									\
+	asm volatile(							\
+	__LL_SC_SAVE_LR(%0)						\
+	__LL_SC_CALL64(op)						\
+	__LL_SC_RESTORE_LR(%0)						\
+	: "=&r" (lr), "+r" (x0), "+Q" (v->counter)			\
+	: "r" (x1));							\
+}									\
+
+#define ATOMIC64_OP_RETURN(op, asm_op)					\
+static inline long atomic64_##op##_return(long i, atomic64_t *v)	\
+{									\
+	unsigned long lr;						\
+	register long x0 asm ("x0") = i;				\
+	register atomic64_t *x1 asm ("x1") = v;				\
+									\
+	asm volatile(							\
+	__LL_SC_SAVE_LR(%0)						\
+	__LL_SC_CALL64(op##_return)					\
+	__LL_SC_RESTORE_LR(%0)						\
+	: "=&r" (lr), "+r" (x0)						\
+	: "r" (x1)							\
+	: "memory");							\
+									\
+	return x0;							\
+}
+
+#define ATOMIC64_OPS(op, asm_op)					\
+	ATOMIC64_OP(op, asm_op)						\
+	ATOMIC64_OP_RETURN(op, asm_op)
+
+ATOMIC64_OPS(add, add)
+ATOMIC64_OPS(sub, sub)
+
+#undef ATOMIC64_OPS
+#undef ATOMIC64_OP_RETURN
+#undef ATOMIC64_OP
+
+static inline long atomic64_cmpxchg(atomic64_t *ptr, long old, long new)
+{
+	unsigned long lr;
+	register unsigned long x0 asm ("x0") = (unsigned long)ptr;
+	register long x1 asm ("x1") = old;
+	register long x2 asm ("x2") = new;
+
+	asm volatile(
+	__LL_SC_SAVE_LR(%0)
+	__LL_SC_CALL64(cmpxchg)
+	__LL_SC_RESTORE_LR(%0)
+	: "=&r" (lr), "+r" (x0)
+	: "r" (x1), "r" (x2)
+	: "cc", "memory");
+
+	return x0;
+}
+
+static inline long atomic64_dec_if_positive(atomic64_t *v)
+{
+	unsigned long lr;
+	register unsigned long x0 asm ("x0") = (unsigned long)v;
+
+	asm volatile(
+	__LL_SC_SAVE_LR(%0)
+	__LL_SC_CALL64(dec_if_positive)
+	__LL_SC_RESTORE_LR(%0)
+	: "=&r" (lr), "+r" (x0)
+	:
+	: "cc", "memory");
+
+	return x0;
+}
+
+#endif	/* __ASM_ATOMIC_LSE_H */
diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
index d98d3e39879e..1a811ecf71da 100644
--- a/arch/arm64/lib/Makefile
+++ b/arch/arm64/lib/Makefile
@@ -3,3 +3,16 @@ lib-y		:= bitops.o clear_user.o delay.o copy_from_user.o	\
 		   clear_page.o memchr.o memcpy.o memmove.o memset.o	\
 		   memcmp.o strcmp.o strncmp.o strlen.o strnlen.o	\
 		   strchr.o strrchr.o
+
+# Tell the compiler to treat all general purpose registers as
+# callee-saved, which allows for efficient runtime patching of the bl
+# instruction in the caller with an atomic instruction when supported by
+# the CPU. Result and argument registers are handled correctly, based on
+# the function prototype.
+lib-$(CONFIG_ARM64_LSE_ATOMICS) += atomic_ll_sc.o
+CFLAGS_atomic_ll_sc.o	:= -fcall-used-x0 -ffixed-x1 -ffixed-x2		\
+		   -ffixed-x3 -ffixed-x4 -ffixed-x5 -ffixed-x6		\
+		   -ffixed-x7 -fcall-saved-x8 -fcall-saved-x9		\
+		   -fcall-saved-x10 -fcall-saved-x11 -fcall-saved-x12	\
+		   -fcall-saved-x13 -fcall-saved-x14 -fcall-saved-x15	\
+		   -fcall-saved-x16 -fcall-saved-x17 -fcall-saved-x18
diff --git a/arch/arm64/lib/atomic_ll_sc.c b/arch/arm64/lib/atomic_ll_sc.c
new file mode 100644
index 000000000000..b0c538b0da28
--- /dev/null
+++ b/arch/arm64/lib/atomic_ll_sc.c
@@ -0,0 +1,3 @@
+#include <asm/atomic.h>
+#define __ARM64_IN_ATOMIC_IMPL
+#include <asm/atomic_ll_sc.h>
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 06/18] arm64: atomics: patch in lse instructions when supported by the CPU
  2015-07-13  9:25 [PATCH 00/18] arm64: support for 8.1 LSE atomic instructions Will Deacon
                   ` (4 preceding siblings ...)
  2015-07-13  9:25 ` [PATCH 05/18] arm64: introduce CONFIG_ARM64_LSE_ATOMICS as fallback to ll/sc atomics Will Deacon
@ 2015-07-13  9:25 ` Will Deacon
  2015-07-13  9:25 ` [PATCH 07/18] arm64: locks: " Will Deacon
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Will Deacon @ 2015-07-13  9:25 UTC (permalink / raw)
  To: linux-arm-kernel

On CPUs which support the LSE atomic instructions introduced in ARMv8.1,
it makes sense to use them in preference to ll/sc sequences.

This patch introduces runtime patching of atomic_t and atomic64_t
routines so that the call-site for the out-of-line ll/sc sequences is
patched with an LSE atomic instruction when we detect that
the CPU supports it.

If binutils is not recent enough to assemble the LSE instructions, then
the ll/sc sequences are inlined as though CONFIG_ARM64_LSE_ATOMICS=n.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/Makefile                   |  13 +-
 arch/arm64/include/asm/atomic.h       |   4 +-
 arch/arm64/include/asm/atomic_ll_sc.h |  12 --
 arch/arm64/include/asm/atomic_lse.h   | 274 ++++++++++++++++++++--------------
 arch/arm64/include/asm/lse.h          |  36 +++++
 arch/arm64/kernel/setup.c             |   3 +
 6 files changed, 214 insertions(+), 128 deletions(-)
 create mode 100644 arch/arm64/include/asm/lse.h

diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
index 4d2a925998f9..fa23c0dc3e77 100644
--- a/arch/arm64/Makefile
+++ b/arch/arm64/Makefile
@@ -17,7 +17,18 @@ GZFLAGS		:=-9
 
 KBUILD_DEFCONFIG := defconfig
 
-KBUILD_CFLAGS	+= -mgeneral-regs-only
+# Check for binutils support for specific extensions
+lseinstr := $(call as-instr,.arch_extension lse,-DCONFIG_AS_LSE=1)
+
+ifeq ($(CONFIG_ARM64_LSE_ATOMICS), y)
+  ifeq ($(lseinstr),)
+$(warning LSE atomics not supported by binutils)
+  endif
+endif
+
+KBUILD_CFLAGS	+= -mgeneral-regs-only $(lseinstr)
+KBUILD_AFLAGS	+= $(lseinstr)
+
 ifeq ($(CONFIG_CPU_BIG_ENDIAN), y)
 KBUILD_CPPFLAGS	+= -mbig-endian
 AS		+= -EB
diff --git a/arch/arm64/include/asm/atomic.h b/arch/arm64/include/asm/atomic.h
index 955cc14f3ce4..cb53efa23f62 100644
--- a/arch/arm64/include/asm/atomic.h
+++ b/arch/arm64/include/asm/atomic.h
@@ -21,11 +21,11 @@
 #define __ASM_ATOMIC_H
 
 #include <linux/compiler.h>
-#include <linux/stringify.h>
 #include <linux/types.h>
 
 #include <asm/barrier.h>
 #include <asm/cmpxchg.h>
+#include <asm/lse.h>
 
 #define ATOMIC_INIT(i)	{ (i) }
 
@@ -33,7 +33,7 @@
 
 #define __ARM64_IN_ATOMIC_IMPL
 
-#ifdef CONFIG_ARM64_LSE_ATOMICS
+#if defined(CONFIG_ARM64_LSE_ATOMICS) && defined(CONFIG_AS_LSE)
 #include <asm/atomic_lse.h>
 #else
 #include <asm/atomic_ll_sc.h>
diff --git a/arch/arm64/include/asm/atomic_ll_sc.h b/arch/arm64/include/asm/atomic_ll_sc.h
index 024b892dbc6a..9cf298914ac3 100644
--- a/arch/arm64/include/asm/atomic_ll_sc.h
+++ b/arch/arm64/include/asm/atomic_ll_sc.h
@@ -37,18 +37,6 @@
  * (the optimize attribute silently ignores these options).
  */
 
-#ifndef __LL_SC_INLINE
-#define __LL_SC_INLINE		static inline
-#endif
-
-#ifndef __LL_SC_PREFIX
-#define __LL_SC_PREFIX(x)	x
-#endif
-
-#ifndef __LL_SC_EXPORT
-#define __LL_SC_EXPORT(x)
-#endif
-
 #define ATOMIC_OP(op, asm_op)						\
 __LL_SC_INLINE void							\
 __LL_SC_PREFIX(atomic_##op(int i, atomic_t *v))				\
diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h
index 68ff1a8a7492..d59780350514 100644
--- a/arch/arm64/include/asm/atomic_lse.h
+++ b/arch/arm64/include/asm/atomic_lse.h
@@ -25,138 +25,172 @@
 #error "please don't include this file directly"
 #endif
 
-/* Move the ll/sc atomics out-of-line */
-#define __LL_SC_INLINE
-#define __LL_SC_PREFIX(x)	__ll_sc_##x
-#define __LL_SC_EXPORT(x)	EXPORT_SYMBOL(__LL_SC_PREFIX(x))
-
-/* Macros for constructing calls to out-of-line ll/sc atomics */
-#define __LL_SC_SAVE_LR(r)	"mov\t" #r ", x30\n"
-#define __LL_SC_RESTORE_LR(r)	"mov\tx30, " #r "\n"
-#define __LL_SC_CALL(op)						\
-	"bl\t" __stringify(__LL_SC_PREFIX(atomic_##op)) "\n"
-#define __LL_SC_CALL64(op)						\
-	"bl\t" __stringify(__LL_SC_PREFIX(atomic64_##op)) "\n"
-
-#define ATOMIC_OP(op, asm_op)						\
-static inline void atomic_##op(int i, atomic_t *v)			\
-{									\
-	unsigned long lr;						\
-	register int w0 asm ("w0") = i;					\
-	register atomic_t *x1 asm ("x1") = v;				\
-									\
-	asm volatile(							\
-	__LL_SC_SAVE_LR(%0)						\
-	__LL_SC_CALL(op)						\
-	__LL_SC_RESTORE_LR(%0)						\
-	: "=&r" (lr), "+r" (w0), "+Q" (v->counter)			\
-	: "r" (x1));							\
-}									\
-
-#define ATOMIC_OP_RETURN(op, asm_op)					\
-static inline int atomic_##op##_return(int i, atomic_t *v)		\
-{									\
-	unsigned long lr;						\
-	register int w0 asm ("w0") = i;					\
-	register atomic_t *x1 asm ("x1") = v;				\
-									\
-	asm volatile(							\
-	__LL_SC_SAVE_LR(%0)						\
-	__LL_SC_CALL(op##_return)					\
-	__LL_SC_RESTORE_LR(%0)						\
-	: "=&r" (lr), "+r" (w0)						\
-	: "r" (x1)							\
-	: "memory");							\
-									\
-	return w0;							\
+#define __LL_SC_ATOMIC(op, tmp)						\
+	__LL_SC_SAVE_LR(tmp)						\
+	__LL_SC_CALL(atomic_##op)					\
+	__LL_SC_RESTORE_LR(tmp)
+
+static inline void atomic_add(int i, atomic_t *v)
+{
+	unsigned long tmp;
+	register int w0 asm ("w0") = i;
+	register atomic_t *x1 asm ("x1") = v;
+
+	asm volatile(ARM64_LSE_ATOMIC_INSN(__LL_SC_ATOMIC(add, %[tmp]),
+	"	nop\n"
+	"	stadd	%w[i], %[v]\n"
+	"	nop")
+	: [tmp] "=&r" (tmp), [i] "+r" (w0), [v] "+Q" (v->counter)
+	: "r" (x1));
 }
 
-#define ATOMIC_OPS(op, asm_op)						\
-	ATOMIC_OP(op, asm_op)						\
-	ATOMIC_OP_RETURN(op, asm_op)
+static inline int atomic_add_return(int i, atomic_t *v)
+{
+	unsigned long tmp;
+	register int w0 asm ("w0") = i;
+	register atomic_t *x1 asm ("x1") = v;
+
+	asm volatile(ARM64_LSE_ATOMIC_INSN(__LL_SC_ATOMIC(add_return, %[tmp]),
+	"	nop\n"
+	"	ldaddal	%w[i], %w[tmp], %[v]\n"
+	"	add	%w[i], %w[i], %w[tmp]")
+	: [tmp] "=&r" (tmp), [i] "+r" (w0), [v] "+Q" (v->counter)
+	: "r" (x1)
+	: "memory");
+
+	return w0;
+}
 
-ATOMIC_OPS(add, add)
-ATOMIC_OPS(sub, sub)
+static inline void atomic_sub(int i, atomic_t *v)
+{
+	unsigned long tmp;
+	register int w0 asm ("w0") = i;
+	register atomic_t *x1 asm ("x1") = v;
+
+	asm volatile(ARM64_LSE_ATOMIC_INSN(__LL_SC_ATOMIC(sub, %[tmp]),
+	"	neg	%w[i], %w[i]\n"
+	"	stadd	%w[i], %[v]\n"
+	"	nop")
+	: [tmp] "=&r" (tmp), [i] "+r" (w0), [v] "+Q" (v->counter)
+	: "r" (x1));
+}
 
-#undef ATOMIC_OPS
-#undef ATOMIC_OP_RETURN
-#undef ATOMIC_OP
+static inline int atomic_sub_return(int i, atomic_t *v)
+{
+	unsigned long tmp;
+	register int w0 asm ("w0") = i;
+	register atomic_t *x1 asm ("x1") = v;
+
+	asm volatile(ARM64_LSE_ATOMIC_INSN(__LL_SC_ATOMIC(sub_return, %[tmp]),
+	"	neg	%w[i], %w[i]\n"
+	"	ldaddal	%w[i], %w[tmp], %[v]\n"
+	"	add	%w[i], %w[i], %w[tmp]")
+	: [tmp] "=&r" (tmp), [i] "+r" (w0), [v] "+Q" (v->counter)
+	: "r" (x1)
+	: "memory");
+
+	return w0;
+}
 
 static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
 {
-	unsigned long lr;
+	unsigned long tmp;
 	register unsigned long x0 asm ("x0") = (unsigned long)ptr;
 	register int w1 asm ("w1") = old;
 	register int w2 asm ("w2") = new;
 
-	asm volatile(
-	__LL_SC_SAVE_LR(%0)
-	__LL_SC_CALL(cmpxchg)
-	__LL_SC_RESTORE_LR(%0)
-	: "=&r" (lr), "+r" (x0)
-	: "r" (w1), "r" (w2)
+	asm volatile(ARM64_LSE_ATOMIC_INSN(__LL_SC_ATOMIC(cmpxchg, %[tmp]),
+	"	mov	%w[tmp], %w[old]\n"
+	"	casal	%w[tmp], %w[new], %[v]\n"
+	"	mov	%w[ret], %w[tmp]")
+	: [tmp] "=&r" (tmp), [ret] "+r" (x0), [v] "+Q" (ptr->counter)
+	: [old] "r" (w1), [new] "r" (w2)
 	: "cc", "memory");
 
 	return x0;
 }
 
-#define ATOMIC64_OP(op, asm_op)						\
-static inline void atomic64_##op(long i, atomic64_t *v)			\
-{									\
-	unsigned long lr;						\
-	register long x0 asm ("x0") = i;				\
-	register atomic64_t *x1 asm ("x1") = v;				\
-									\
-	asm volatile(							\
-	__LL_SC_SAVE_LR(%0)						\
-	__LL_SC_CALL64(op)						\
-	__LL_SC_RESTORE_LR(%0)						\
-	: "=&r" (lr), "+r" (x0), "+Q" (v->counter)			\
-	: "r" (x1));							\
-}									\
-
-#define ATOMIC64_OP_RETURN(op, asm_op)					\
-static inline long atomic64_##op##_return(long i, atomic64_t *v)	\
-{									\
-	unsigned long lr;						\
-	register long x0 asm ("x0") = i;				\
-	register atomic64_t *x1 asm ("x1") = v;				\
-									\
-	asm volatile(							\
-	__LL_SC_SAVE_LR(%0)						\
-	__LL_SC_CALL64(op##_return)					\
-	__LL_SC_RESTORE_LR(%0)						\
-	: "=&r" (lr), "+r" (x0)						\
-	: "r" (x1)							\
-	: "memory");							\
-									\
-	return x0;							\
+#undef __LL_SC_ATOMIC
+
+#define __LL_SC_ATOMIC64(op, tmp)					\
+	__LL_SC_SAVE_LR(tmp)						\
+	__LL_SC_CALL(atomic64_##op)					\
+	__LL_SC_RESTORE_LR(tmp)
+
+static inline void atomic64_add(long i, atomic64_t *v)
+{
+	unsigned long tmp;
+	register long x0 asm ("x0") = i;
+	register atomic64_t *x1 asm ("x1") = v;
+
+	asm volatile(ARM64_LSE_ATOMIC_INSN(__LL_SC_ATOMIC64(add, %[tmp]),
+	"	nop\n"
+	"	stadd	%[i], %[v]\n"
+	"	nop")
+	: [tmp] "=&r" (tmp), [i] "+r" (x0), [v] "+Q" (v->counter)
+	: "r" (x1));
 }
 
-#define ATOMIC64_OPS(op, asm_op)					\
-	ATOMIC64_OP(op, asm_op)						\
-	ATOMIC64_OP_RETURN(op, asm_op)
+static inline long atomic64_add_return(long i, atomic64_t *v)
+{
+	unsigned long tmp;
+	register long x0 asm ("x0") = i;
+	register atomic64_t *x1 asm ("x1") = v;
+
+	asm volatile(ARM64_LSE_ATOMIC_INSN(__LL_SC_ATOMIC64(add_return, %[tmp]),
+	"	nop\n"
+	"	ldaddal	%[i], %[tmp], %[v]\n"
+	"	add	%[i], %[i], %[tmp]")
+	: [tmp] "=&r" (tmp), [i] "+r" (x0), [v] "+Q" (v->counter)
+	: "r" (x1)
+	: "memory");
 
-ATOMIC64_OPS(add, add)
-ATOMIC64_OPS(sub, sub)
+	return x0;
+}
+
+static inline void atomic64_sub(long i, atomic64_t *v)
+{
+	unsigned long tmp;
+	register long x0 asm ("x0") = i;
+	register atomic64_t *x1 asm ("x1") = v;
+
+	asm volatile(ARM64_LSE_ATOMIC_INSN(__LL_SC_ATOMIC64(sub, %[tmp]),
+	"	neg	%[i], %[i]\n"
+	"	stadd	%[i], %[v]\n"
+	"	nop")
+	: [tmp] "=&r" (tmp), [i] "+r" (x0), [v] "+Q" (v->counter)
+	: "r" (x1));
+}
 
-#undef ATOMIC64_OPS
-#undef ATOMIC64_OP_RETURN
-#undef ATOMIC64_OP
+static inline long atomic64_sub_return(long i, atomic64_t *v)
+{
+	unsigned long tmp;
+	register long x0 asm ("x0") = i;
+	register atomic64_t *x1 asm ("x1") = v;
+
+	asm volatile(ARM64_LSE_ATOMIC_INSN(__LL_SC_ATOMIC64(sub_return, %[tmp]),
+	"	neg	%[i], %[i]\n"
+	"	ldaddal	%[i], %[tmp], %[v]\n"
+	"	add	%[i], %[i], %[tmp]")
+	: [tmp] "=&r" (tmp), [i] "+r" (x0), [v] "+Q" (v->counter)
+	: "r" (x1)
+	: "memory");
 
+	return x0;
+}
 static inline long atomic64_cmpxchg(atomic64_t *ptr, long old, long new)
 {
-	unsigned long lr;
+	unsigned long tmp;
 	register unsigned long x0 asm ("x0") = (unsigned long)ptr;
 	register long x1 asm ("x1") = old;
 	register long x2 asm ("x2") = new;
 
-	asm volatile(
-	__LL_SC_SAVE_LR(%0)
-	__LL_SC_CALL64(cmpxchg)
-	__LL_SC_RESTORE_LR(%0)
-	: "=&r" (lr), "+r" (x0)
-	: "r" (x1), "r" (x2)
+	asm volatile(ARM64_LSE_ATOMIC_INSN(__LL_SC_ATOMIC64(cmpxchg, %[tmp]),
+	"	mov	%[tmp], %[old]\n"
+	"	casal	%[tmp], %[new], %[v]\n"
+	"	mov	%[ret], %[tmp]")
+	: [tmp] "=&r" (tmp), [ret] "+r" (x0), [v] "+Q" (ptr->counter)
+	: [old] "r" (x1), [new] "r" (x2)
 	: "cc", "memory");
 
 	return x0;
@@ -164,18 +198,32 @@ static inline long atomic64_cmpxchg(atomic64_t *ptr, long old, long new)
 
 static inline long atomic64_dec_if_positive(atomic64_t *v)
 {
-	unsigned long lr;
-	register unsigned long x0 asm ("x0") = (unsigned long)v;
-
-	asm volatile(
-	__LL_SC_SAVE_LR(%0)
-	__LL_SC_CALL64(dec_if_positive)
-	__LL_SC_RESTORE_LR(%0)
-	: "=&r" (lr), "+r" (x0)
+	unsigned long tmp;
+	register long x0 asm ("x0") = (long)v;
+
+	asm volatile(ARM64_LSE_ATOMIC_INSN(
+	/* LL/SC */
+	"	nop\n"
+	__LL_SC_ATOMIC64(dec_if_positive, %[tmp])
+	"	nop\n"
+	"	nop\n"
+	"	nop",
+	/* LSE atomics */
+	"1:	ldr	%[tmp], %[v]\n"
+	"	subs	%[ret], %[tmp], #1\n"
+	"	b.mi	2f\n"
+	"	casal	%[tmp], %[ret], %[v]\n"
+	"	sub	%[tmp], %[tmp], #1\n"
+	"	sub	%[tmp], %[tmp], %[ret]\n"
+	"	cbnz	%[tmp], 1b\n"
+	"2:")
+	: [ret] "+&r" (x0), [tmp] "=&r" (tmp), [v] "+Q" (v->counter)
 	:
 	: "cc", "memory");
 
 	return x0;
 }
 
+#undef __LL_SC_ATOMIC64
+
 #endif	/* __ASM_ATOMIC_LSE_H */
diff --git a/arch/arm64/include/asm/lse.h b/arch/arm64/include/asm/lse.h
new file mode 100644
index 000000000000..c4e88334c07d
--- /dev/null
+++ b/arch/arm64/include/asm/lse.h
@@ -0,0 +1,36 @@
+#ifndef __ASM_LSE_H
+#define __ASM_LSE_H
+
+#if defined(CONFIG_AS_LSE) && defined(CONFIG_ARM64_LSE_ATOMICS)
+
+#include <linux/stringify.h>
+
+#include <asm/alternative.h>
+#include <asm/cpufeature.h>
+
+__asm__(".arch_extension	lse");
+
+/* Move the ll/sc atomics out-of-line */
+#define __LL_SC_INLINE
+#define __LL_SC_PREFIX(x)	__ll_sc_##x
+#define __LL_SC_EXPORT(x)	EXPORT_SYMBOL(__LL_SC_PREFIX(x))
+
+/* Macros for constructing calls to out-of-line ll/sc atomics */
+#define __LL_SC_SAVE_LR(r)	"mov\t" #r ", x30\n"
+#define __LL_SC_RESTORE_LR(r)	"mov\tx30, " #r "\n"
+#define __LL_SC_CALL(op)	"bl\t" __stringify(__LL_SC_PREFIX(op)) "\n"
+
+/* In-line patching at runtime */
+#define ARM64_LSE_ATOMIC_INSN(llsc, lse)				\
+	ALTERNATIVE(llsc, lse, ARM64_CPU_FEAT_LSE_ATOMICS)
+
+#else
+
+#define __LL_SC_INLINE		static inline
+#define __LL_SC_PREFIX(x)	x
+#define __LL_SC_EXPORT(x)
+
+#define ARM64_LSE_ATOMIC_INSN(llsc, lse)	llsc
+
+#endif	/* CONFIG_AS_LSE && CONFIG_ARM64_LSE_ATOMICS */
+#endif	/* __ASM_LSE_H */
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 5b170df96aaf..930a353b868c 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -287,6 +287,9 @@ static void __init setup_processor(void)
 		case 2:
 			elf_hwcap |= HWCAP_ATOMICS;
 			cpus_set_cap(ARM64_CPU_FEAT_LSE_ATOMICS);
+			if (IS_ENABLED(CONFIG_AS_LSE) &&
+			    IS_ENABLED(CONFIG_ARM64_LSE_ATOMICS))
+				pr_info("LSE atomics supported\n");
 		case 1:
 			/* RESERVED */
 		case 0:
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 07/18] arm64: locks: patch in lse instructions when supported by the CPU
  2015-07-13  9:25 [PATCH 00/18] arm64: support for 8.1 LSE atomic instructions Will Deacon
                   ` (5 preceding siblings ...)
  2015-07-13  9:25 ` [PATCH 06/18] arm64: atomics: patch in lse instructions when supported by the CPU Will Deacon
@ 2015-07-13  9:25 ` Will Deacon
  2015-07-21 16:53   ` Catalin Marinas
  2015-07-13  9:25 ` [PATCH 08/18] arm64: bitops: " Will Deacon
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Will Deacon @ 2015-07-13  9:25 UTC (permalink / raw)
  To: linux-arm-kernel

On CPUs which support the LSE atomic instructions introduced in ARMv8.1,
it makes sense to use them in preference to ll/sc sequences.

This patch introduces runtime patching of our locking functions so that
LSE atomic instructions are used for spinlocks and rwlocks.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/spinlock.h | 132 +++++++++++++++++++++++++++++---------
 1 file changed, 103 insertions(+), 29 deletions(-)

diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h
index cee128732435..7a1e852263be 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -16,6 +16,7 @@
 #ifndef __ASM_SPINLOCK_H
 #define __ASM_SPINLOCK_H
 
+#include <asm/lse.h>
 #include <asm/spinlock_types.h>
 #include <asm/processor.h>
 
@@ -38,11 +39,21 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
 
 	asm volatile(
 	/* Atomically increment the next ticket. */
+	ARM64_LSE_ATOMIC_INSN(
+	/* LL/SC */
 "	prfm	pstl1strm, %3\n"
 "1:	ldaxr	%w0, %3\n"
 "	add	%w1, %w0, %w5\n"
 "	stxr	%w2, %w1, %3\n"
-"	cbnz	%w2, 1b\n"
+"	cbnz	%w2, 1b\n",
+	/* LSE atomics */
+"	mov	%w2, %w5\n"
+"	ldadda	%w2, %w0, %3\n"
+"	nop\n"
+"	nop\n"
+"	nop\n"
+	)
+
 	/* Did we get the lock? */
 "	eor	%w1, %w0, %w0, ror #16\n"
 "	cbz	%w1, 3f\n"
@@ -67,15 +78,25 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
 	unsigned int tmp;
 	arch_spinlock_t lockval;
 
-	asm volatile(
-"	prfm	pstl1strm, %2\n"
-"1:	ldaxr	%w0, %2\n"
-"	eor	%w1, %w0, %w0, ror #16\n"
-"	cbnz	%w1, 2f\n"
-"	add	%w0, %w0, %3\n"
-"	stxr	%w1, %w0, %2\n"
-"	cbnz	%w1, 1b\n"
-"2:"
+	asm volatile(ARM64_LSE_ATOMIC_INSN(
+	/* LL/SC */
+	"	prfm	pstl1strm, %2\n"
+	"1:	ldaxr	%w0, %2\n"
+	"	eor	%w1, %w0, %w0, ror #16\n"
+	"	cbnz	%w1, 2f\n"
+	"	add	%w0, %w0, %3\n"
+	"	stxr	%w1, %w0, %2\n"
+	"	cbnz	%w1, 1b\n"
+	"2:",
+	/* LSE atomics */
+	"	ldar	%w0, %2\n"
+	"	eor	%w1, %w0, %w0, ror #16\n"
+	"	cbnz	%w1, 1f\n"
+	"	add	%w1, %w0, %3\n"
+	"	casa	%w0, %w1, %2\n"
+	"	and	%w1, %w1, #0xffff\n"
+	"	eor	%w1, %w1, %w0, lsr #16\n"
+	"1:")
 	: "=&r" (lockval), "=&r" (tmp), "+Q" (*lock)
 	: "I" (1 << TICKET_SHIFT)
 	: "memory");
@@ -85,10 +106,19 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
 
 static inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
-	asm volatile(
-"	stlrh	%w1, %0\n"
-	: "=Q" (lock->owner)
-	: "r" (lock->owner + 1)
+	unsigned long tmp;
+
+	asm volatile(ARM64_LSE_ATOMIC_INSN(
+	/* LL/SC */
+	"	ldr	%w1, %0\n"
+	"	add	%w1, %w1, #1\n"
+	"	stlrh	%w1, %0",
+	/* LSE atomics */
+	"	mov	%w1, #1\n"
+	"	nop\n"
+	"	staddlh	%w1, %0")
+	: "=Q" (lock->owner), "=&r" (tmp)
+	:
 	: "memory");
 }
 
@@ -125,11 +155,19 @@ static inline void arch_write_lock(arch_rwlock_t *rw)
 
 	asm volatile(
 	"	sevl\n"
+	ARM64_LSE_ATOMIC_INSN(
+	/* LL/SC */
 	"1:	wfe\n"
 	"2:	ldaxr	%w0, %1\n"
 	"	cbnz	%w0, 1b\n"
 	"	stxr	%w0, %w2, %1\n"
-	"	cbnz	%w0, 2b\n"
+	"	cbnz	%w0, 2b",
+	/* LSE atomics */
+	"1:	wfe\n"
+	"	mov	%w0, wzr\n"
+	"	casa	%w0, %w2, %1\n"
+	"	nop\n"
+	"	cbnz	%w0, 1b")
 	: "=&r" (tmp), "+Q" (rw->lock)
 	: "r" (0x80000000)
 	: "memory");
@@ -139,11 +177,16 @@ static inline int arch_write_trylock(arch_rwlock_t *rw)
 {
 	unsigned int tmp;
 
-	asm volatile(
+	asm volatile(ARM64_LSE_ATOMIC_INSN(
+	/* LL/SC */
 	"	ldaxr	%w0, %1\n"
 	"	cbnz	%w0, 1f\n"
 	"	stxr	%w0, %w2, %1\n"
-	"1:\n"
+	"1:",
+	/* LSE atomics */
+	"	mov	%w0, wzr\n"
+	"	casa	%w0, %w2, %1\n"
+	"	nop")
 	: "=&r" (tmp), "+Q" (rw->lock)
 	: "r" (0x80000000)
 	: "memory");
@@ -153,9 +196,10 @@ static inline int arch_write_trylock(arch_rwlock_t *rw)
 
 static inline void arch_write_unlock(arch_rwlock_t *rw)
 {
-	asm volatile(
-	"	stlr	%w1, %0\n"
-	: "=Q" (rw->lock) : "r" (0) : "memory");
+	asm volatile(ARM64_LSE_ATOMIC_INSN(
+	"	stlr	wzr, %0",
+	"	swpl	wzr, wzr, %0")
+	: "=Q" (rw->lock) :: "memory");
 }
 
 /* write_can_lock - would write_trylock() succeed? */
@@ -172,6 +216,10 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
  *
  * The memory barriers are implicit with the load-acquire and store-release
  * instructions.
+ *
+ * Note that in UNDEFINED cases, such as unlocking a lock twice, the LL/SC
+ * and LSE implementations may exhibit different behaviour (although this
+ * will have no effect on lockdep).
  */
 static inline void arch_read_lock(arch_rwlock_t *rw)
 {
@@ -179,26 +227,43 @@ static inline void arch_read_lock(arch_rwlock_t *rw)
 
 	asm volatile(
 	"	sevl\n"
+	ARM64_LSE_ATOMIC_INSN(
+	/* LL/SC */
 	"1:	wfe\n"
 	"2:	ldaxr	%w0, %2\n"
 	"	add	%w0, %w0, #1\n"
 	"	tbnz	%w0, #31, 1b\n"
 	"	stxr	%w1, %w0, %2\n"
-	"	cbnz	%w1, 2b\n"
+	"	nop\n"
+	"	cbnz	%w1, 2b",
+	/* LSE atomics */
+	"1:	wfe\n"
+	"2:	ldr	%w0, %2\n"
+	"	adds	%w1, %w0, #1\n"
+	"	tbnz	%w1, #31, 1b\n"
+	"	casa	%w0, %w1, %2\n"
+	"	sbc	%w0, %w1, %w0\n"
+	"	cbnz	%w0, 2b")
 	: "=&r" (tmp), "=&r" (tmp2), "+Q" (rw->lock)
 	:
-	: "memory");
+	: "cc", "memory");
 }
 
 static inline void arch_read_unlock(arch_rwlock_t *rw)
 {
 	unsigned int tmp, tmp2;
 
-	asm volatile(
+	asm volatile(ARM64_LSE_ATOMIC_INSN(
+	/* LL/SC */
 	"1:	ldxr	%w0, %2\n"
 	"	sub	%w0, %w0, #1\n"
 	"	stlxr	%w1, %w0, %2\n"
-	"	cbnz	%w1, 1b\n"
+	"	cbnz	%w1, 1b",
+	/* LSE atomics */
+	"	movn	%w0, #0\n"
+	"	nop\n"
+	"	nop\n"
+	"	staddl	%w0, %2")
 	: "=&r" (tmp), "=&r" (tmp2), "+Q" (rw->lock)
 	:
 	: "memory");
@@ -206,17 +271,26 @@ static inline void arch_read_unlock(arch_rwlock_t *rw)
 
 static inline int arch_read_trylock(arch_rwlock_t *rw)
 {
-	unsigned int tmp, tmp2 = 1;
+	unsigned int tmp, tmp2;
 
-	asm volatile(
+	asm volatile(ARM64_LSE_ATOMIC_INSN(
+	/* LL/SC */
+	"	mov	%w1, #1\n"
 	"	ldaxr	%w0, %2\n"
 	"	add	%w0, %w0, #1\n"
 	"	tbnz	%w0, #31, 1f\n"
 	"	stxr	%w1, %w0, %2\n"
-	"1:\n"
-	: "=&r" (tmp), "+r" (tmp2), "+Q" (rw->lock)
+	"1:",
+	/* LSE atomics */
+	"	ldr	%w0, %2\n"
+	"	adds	%w1, %w0, #1\n"
+	"	tbnz	%w1, #31, 1f\n"
+	"	casa	%w0, %w1, %2\n"
+	"	sbc	%w1, %w1, %w0\n"
+	"1:")
+	: "=&r" (tmp), "=&r" (tmp2), "+Q" (rw->lock)
 	:
-	: "memory");
+	: "cc", "memory");
 
 	return !tmp2;
 }
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 08/18] arm64: bitops: patch in lse instructions when supported by the CPU
  2015-07-13  9:25 [PATCH 00/18] arm64: support for 8.1 LSE atomic instructions Will Deacon
                   ` (6 preceding siblings ...)
  2015-07-13  9:25 ` [PATCH 07/18] arm64: locks: " Will Deacon
@ 2015-07-13  9:25 ` Will Deacon
  2015-07-13  9:25 ` [PATCH 09/18] arm64: xchg: " Will Deacon
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Will Deacon @ 2015-07-13  9:25 UTC (permalink / raw)
  To: linux-arm-kernel

On CPUs which support the LSE atomic instructions introduced in ARMv8.1,
it makes sense to use them in preference to ll/sc sequences.

This patch introduces runtime patching of our bitops functions so that
LSE atomic instructions are used instead.

Reviewed-by: Steve Capper <steve.capper@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/lse.h | 26 ++++++++++++++++++++++++--
 arch/arm64/lib/bitops.S      | 43 ++++++++++++++++++++++++-------------------
 2 files changed, 48 insertions(+), 21 deletions(-)

diff --git a/arch/arm64/include/asm/lse.h b/arch/arm64/include/asm/lse.h
index c4e88334c07d..1d5190be5ecd 100644
--- a/arch/arm64/include/asm/lse.h
+++ b/arch/arm64/include/asm/lse.h
@@ -4,9 +4,21 @@
 #if defined(CONFIG_AS_LSE) && defined(CONFIG_ARM64_LSE_ATOMICS)
 
 #include <linux/stringify.h>
+#include <asm/cpufeature.h>
+
+#ifdef __ASSEMBLER__
+
+#include <asm/alternative-asm.h>
+
+.arch_extension	lse
+
+.macro alt_lse, llsc, lse
+	alternative_insn "\llsc", "\lse", ARM64_CPU_FEAT_LSE_ATOMICS
+.endm
+
+#else	/* __ASSEMBLER__ */
 
 #include <asm/alternative.h>
-#include <asm/cpufeature.h>
 
 __asm__(".arch_extension	lse");
 
@@ -24,7 +36,16 @@ __asm__(".arch_extension	lse");
 #define ARM64_LSE_ATOMIC_INSN(llsc, lse)				\
 	ALTERNATIVE(llsc, lse, ARM64_CPU_FEAT_LSE_ATOMICS)
 
-#else
+#endif	/* __ASSEMBLER__ */
+#else	/* CONFIG_AS_LSE && CONFIG_ARM64_LSE_ATOMICS */
+
+#ifdef __ASSEMBLER__
+
+.macro alt_lse, llsc, lse
+	\llsc
+.endm
+
+#else	/* __ASSEMBLER__ */
 
 #define __LL_SC_INLINE		static inline
 #define __LL_SC_PREFIX(x)	x
@@ -32,5 +53,6 @@ __asm__(".arch_extension	lse");
 
 #define ARM64_LSE_ATOMIC_INSN(llsc, lse)	llsc
 
+#endif	/* __ASSEMBLER__ */
 #endif	/* CONFIG_AS_LSE && CONFIG_ARM64_LSE_ATOMICS */
 #endif	/* __ASM_LSE_H */
diff --git a/arch/arm64/lib/bitops.S b/arch/arm64/lib/bitops.S
index 7dac371cc9a2..bc18457c2bba 100644
--- a/arch/arm64/lib/bitops.S
+++ b/arch/arm64/lib/bitops.S
@@ -18,52 +18,57 @@
 
 #include <linux/linkage.h>
 #include <asm/assembler.h>
+#include <asm/lse.h>
 
 /*
  * x0: bits 5:0  bit offset
  *     bits 31:6 word offset
  * x1: address
  */
-	.macro	bitop, name, instr
+	.macro	bitop, name, llsc, lse
 ENTRY(	\name	)
 	and	w3, w0, #63		// Get bit offset
 	eor	w0, w0, w3		// Clear low bits
 	mov	x2, #1
 	add	x1, x1, x0, lsr #3	// Get word offset
 	lsl	x3, x2, x3		// Create mask
-1:	ldxr	x2, [x1]
-	\instr	x2, x2, x3
-	stxr	w0, x2, [x1]
-	cbnz	w0, 1b
+
+alt_lse	"1:	ldxr	x2, [x1]",		"\lse	x3, [x1]"
+alt_lse	"	\llsc	x2, x2, x3",		"nop"
+alt_lse	"	stxr	w0, x2, [x1]",		"nop"
+alt_lse	"	cbnz	w0, 1b",		"nop"
+
 	ret
 ENDPROC(\name	)
 	.endm
 
-	.macro	testop, name, instr
+	.macro	testop, name, llsc, lse
 ENTRY(	\name	)
 	and	w3, w0, #63		// Get bit offset
 	eor	w0, w0, w3		// Clear low bits
 	mov	x2, #1
 	add	x1, x1, x0, lsr #3	// Get word offset
 	lsl	x4, x2, x3		// Create mask
-1:	ldxr	x2, [x1]
-	lsr	x0, x2, x3		// Save old value of bit
-	\instr	x2, x2, x4		// toggle bit
-	stlxr	w5, x2, [x1]
-	cbnz	w5, 1b
-	dmb	ish
+
+alt_lse	"1:	ldxr	x2, [x1]",		"\lse	x4, x2, [x1]"
+	lsr	x0, x2, x3
+alt_lse	"	\llsc	x2, x2, x4",		"nop"
+alt_lse	"	stlxr	w5, x2, [x1]",		"nop"
+alt_lse	"	cbnz	w5, 1b",		"nop"
+alt_lse	"	dmb	ish",			"nop"
+
 	and	x0, x0, #1
-3:	ret
+	ret
 ENDPROC(\name	)
 	.endm
 
 /*
  * Atomic bit operations.
  */
-	bitop	change_bit, eor
-	bitop	clear_bit, bic
-	bitop	set_bit, orr
+	bitop	change_bit, eor, steor
+	bitop	clear_bit, bic, stclr
+	bitop	set_bit, orr, stset
 
-	testop	test_and_change_bit, eor
-	testop	test_and_clear_bit, bic
-	testop	test_and_set_bit, orr
+	testop	test_and_change_bit, eor, ldeoral
+	testop	test_and_clear_bit, bic, ldclral
+	testop	test_and_set_bit, orr, ldsetal
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 09/18] arm64: xchg: patch in lse instructions when supported by the CPU
  2015-07-13  9:25 [PATCH 00/18] arm64: support for 8.1 LSE atomic instructions Will Deacon
                   ` (7 preceding siblings ...)
  2015-07-13  9:25 ` [PATCH 08/18] arm64: bitops: " Will Deacon
@ 2015-07-13  9:25 ` Will Deacon
  2015-07-13  9:25 ` [PATCH 10/18] arm64: cmpxchg: " Will Deacon
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Will Deacon @ 2015-07-13  9:25 UTC (permalink / raw)
  To: linux-arm-kernel

On CPUs which support the LSE atomic instructions introduced in ARMv8.1,
it makes sense to use them in preference to ll/sc sequences.

This patch introduces runtime patching of our xchg primitives so that
the LSE swp instruction (yes, you read right!) is used instead.

Reviewed-by: Steve Capper <steve.capper@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/cmpxchg.h | 38 +++++++++++++++++++++++++++++++++-----
 1 file changed, 33 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/cmpxchg.h b/arch/arm64/include/asm/cmpxchg.h
index d8c25b7b18fb..d0cce8068902 100644
--- a/arch/arm64/include/asm/cmpxchg.h
+++ b/arch/arm64/include/asm/cmpxchg.h
@@ -22,6 +22,7 @@
 #include <linux/mmdebug.h>
 
 #include <asm/barrier.h>
+#include <asm/lse.h>
 
 static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size)
 {
@@ -29,37 +30,65 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
 
 	switch (size) {
 	case 1:
-		asm volatile("//	__xchg1\n"
+		asm volatile(ARM64_LSE_ATOMIC_INSN(
+		/* LL/SC */
 		"1:	ldxrb	%w0, %2\n"
 		"	stlxrb	%w1, %w3, %2\n"
 		"	cbnz	%w1, 1b\n"
+		"	dmb	ish",
+		/* LSE atomics */
+		"	nop\n"
+		"	swpalb	%w3, %w0, %2\n"
+		"	nop\n"
+		"	nop")
 			: "=&r" (ret), "=&r" (tmp), "+Q" (*(u8 *)ptr)
 			: "r" (x)
 			: "memory");
 		break;
 	case 2:
-		asm volatile("//	__xchg2\n"
+		asm volatile(ARM64_LSE_ATOMIC_INSN(
+		/* LL/SC */
 		"1:	ldxrh	%w0, %2\n"
 		"	stlxrh	%w1, %w3, %2\n"
 		"	cbnz	%w1, 1b\n"
+		"	dmb	ish",
+		/* LSE atomics */
+		"	nop\n"
+		"	swpalh	%w3, %w0, %2\n"
+		"	nop\n"
+		"	nop")
 			: "=&r" (ret), "=&r" (tmp), "+Q" (*(u16 *)ptr)
 			: "r" (x)
 			: "memory");
 		break;
 	case 4:
-		asm volatile("//	__xchg4\n"
+		asm volatile(ARM64_LSE_ATOMIC_INSN(
+		/* LL/SC */
 		"1:	ldxr	%w0, %2\n"
 		"	stlxr	%w1, %w3, %2\n"
 		"	cbnz	%w1, 1b\n"
+		"	dmb	ish",
+		/* LSE atomics */
+		"	nop\n"
+		"	swpal	%w3, %w0, %2\n"
+		"	nop\n"
+		"	nop")
 			: "=&r" (ret), "=&r" (tmp), "+Q" (*(u32 *)ptr)
 			: "r" (x)
 			: "memory");
 		break;
 	case 8:
-		asm volatile("//	__xchg8\n"
+		asm volatile(ARM64_LSE_ATOMIC_INSN(
+		/* LL/SC */
 		"1:	ldxr	%0, %2\n"
 		"	stlxr	%w1, %3, %2\n"
 		"	cbnz	%w1, 1b\n"
+		"	dmb	ish",
+		/* LSE atomics */
+		"	nop\n"
+		"	swpal	%3, %0, %2\n"
+		"	nop\n"
+		"	nop")
 			: "=&r" (ret), "=&r" (tmp), "+Q" (*(u64 *)ptr)
 			: "r" (x)
 			: "memory");
@@ -68,7 +97,6 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
 		BUILD_BUG();
 	}
 
-	smp_mb();
 	return ret;
 }
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 10/18] arm64: cmpxchg: patch in lse instructions when supported by the CPU
  2015-07-13  9:25 [PATCH 00/18] arm64: support for 8.1 LSE atomic instructions Will Deacon
                   ` (8 preceding siblings ...)
  2015-07-13  9:25 ` [PATCH 09/18] arm64: xchg: " Will Deacon
@ 2015-07-13  9:25 ` Will Deacon
  2015-07-13  9:25 ` [PATCH 11/18] arm64: cmpxchg_dbl: " Will Deacon
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Will Deacon @ 2015-07-13  9:25 UTC (permalink / raw)
  To: linux-arm-kernel

On CPUs which support the LSE atomic instructions introduced in ARMv8.1,
it makes sense to use them in preference to ll/sc sequences.

This patch introduces runtime patching of our cmpxchg primitives so that
the LSE cas instruction is used instead.

Reviewed-by: Steve Capper <steve.capper@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/atomic.h       |  3 +-
 arch/arm64/include/asm/atomic_ll_sc.h | 38 ++++++++++++++++
 arch/arm64/include/asm/atomic_lse.h   | 40 +++++++++++++++++
 arch/arm64/include/asm/cmpxchg.h      | 84 ++++++++---------------------------
 4 files changed, 99 insertions(+), 66 deletions(-)

diff --git a/arch/arm64/include/asm/atomic.h b/arch/arm64/include/asm/atomic.h
index cb53efa23f62..ee32776d926c 100644
--- a/arch/arm64/include/asm/atomic.h
+++ b/arch/arm64/include/asm/atomic.h
@@ -24,7 +24,6 @@
 #include <linux/types.h>
 
 #include <asm/barrier.h>
-#include <asm/cmpxchg.h>
 #include <asm/lse.h>
 
 #define ATOMIC_INIT(i)	{ (i) }
@@ -41,6 +40,8 @@
 
 #undef __ARM64_IN_ATOMIC_IMPL
 
+#include <asm/cmpxchg.h>
+
 /*
  * On ARM, ordinary assignment (str instruction) doesn't clear the local
  * strex/ldrex monitor on some implementations. The reason we can use it for
diff --git a/arch/arm64/include/asm/atomic_ll_sc.h b/arch/arm64/include/asm/atomic_ll_sc.h
index 9cf298914ac3..b4298f9a898f 100644
--- a/arch/arm64/include/asm/atomic_ll_sc.h
+++ b/arch/arm64/include/asm/atomic_ll_sc.h
@@ -205,4 +205,42 @@ __LL_SC_PREFIX(atomic64_dec_if_positive(atomic64_t *v))
 }
 __LL_SC_EXPORT(atomic64_dec_if_positive);
 
+#define __CMPXCHG_CASE(w, sz, name, mb, cl)				\
+__LL_SC_INLINE unsigned long						\
+__LL_SC_PREFIX(__cmpxchg_case_##name(volatile void *ptr,		\
+				     unsigned long old,			\
+				     unsigned long new))		\
+{									\
+	unsigned long tmp, oldval;					\
+									\
+	asm volatile(							\
+	"	" #mb "\n"						\
+	"1:	ldxr" #sz "\t%" #w "[oldval], %[v]\n"			\
+	"	eor	%" #w "[tmp], %" #w "[oldval], %" #w "[old]\n"	\
+	"	cbnz	%" #w "[tmp], 2f\n"				\
+	"	stxr" #sz "\t%w[tmp], %" #w "[new], %[v]\n"		\
+	"	cbnz	%w[tmp], 1b\n"					\
+	"	" #mb "\n"						\
+	"	mov	%" #w "[oldval], %" #w "[old]\n"		\
+	"2:"								\
+	: [tmp] "=&r" (tmp), [oldval] "=&r" (oldval),			\
+	  [v] "+Q" (*(unsigned long *)ptr)				\
+	: [old] "Lr" (old), [new] "r" (new)				\
+	: cl);								\
+									\
+	return oldval;							\
+}									\
+__LL_SC_EXPORT(__cmpxchg_case_##name);
+
+__CMPXCHG_CASE(w, b,    1,        ,         )
+__CMPXCHG_CASE(w, h,    2,        ,         )
+__CMPXCHG_CASE(w,  ,    4,        ,         )
+__CMPXCHG_CASE( ,  ,    8,        ,         )
+__CMPXCHG_CASE(w, b, mb_1, dmb ish, "memory")
+__CMPXCHG_CASE(w, h, mb_2, dmb ish, "memory")
+__CMPXCHG_CASE(w,  , mb_4, dmb ish, "memory")
+__CMPXCHG_CASE( ,  , mb_8, dmb ish, "memory")
+
+#undef __CMPXCHG_CASE
+
 #endif	/* __ASM_ATOMIC_LL_SC_H */
diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h
index d59780350514..a87ed3af6ced 100644
--- a/arch/arm64/include/asm/atomic_lse.h
+++ b/arch/arm64/include/asm/atomic_lse.h
@@ -226,4 +226,44 @@ static inline long atomic64_dec_if_positive(atomic64_t *v)
 
 #undef __LL_SC_ATOMIC64
 
+#define __LL_SC_CMPXCHG(op, tmp)					\
+	__LL_SC_SAVE_LR(tmp)						\
+	__LL_SC_CALL(__cmpxchg_case_##op)				\
+	__LL_SC_RESTORE_LR(tmp)
+
+#define __CMPXCHG_CASE(w, sz, name, mb, cl)				\
+static inline unsigned long __cmpxchg_case_##name(volatile void *ptr,	\
+						  unsigned long old,	\
+						  unsigned long new)	\
+{									\
+	unsigned long tmp;						\
+	register unsigned long x0 asm ("x0") = (unsigned long)ptr;	\
+	register unsigned long x1 asm ("x1") = old;			\
+	register unsigned long x2 asm ("x2") = new;			\
+									\
+	asm volatile(							\
+	ARM64_LSE_ATOMIC_INSN(__LL_SC_CMPXCHG(name, %[tmp]),		\
+	"	mov	%" #w "[tmp], %" #w "[old]\n"			\
+	"	cas" #mb #sz "\t%" #w "[tmp], %" #w "[new], %[v]\n"	\
+	"	mov	%" #w "[ret], %" #w "[tmp]")			\
+	: [tmp] "=&r" (tmp), [ret] "+r" (x0),				\
+	  [v] "+Q" (*(unsigned long *)ptr)				\
+	: [old] "r" (x1), [new] "r" (x2)				\
+	: cl);								\
+									\
+	return x0;							\
+}
+
+__CMPXCHG_CASE(w, b,    1,   ,         )
+__CMPXCHG_CASE(w, h,    2,   ,         )
+__CMPXCHG_CASE(w,  ,    4,   ,         )
+__CMPXCHG_CASE( ,  ,    8,   ,         )
+__CMPXCHG_CASE(w, b, mb_1, al, "memory")
+__CMPXCHG_CASE(w, h, mb_2, al, "memory")
+__CMPXCHG_CASE(w,  , mb_4, al, "memory")
+__CMPXCHG_CASE( ,  , mb_8, al, "memory")
+
+#undef __LL_SC_CMPXCHG
+#undef __CMPXCHG_CASE
+
 #endif	/* __ASM_ATOMIC_LSE_H */
diff --git a/arch/arm64/include/asm/cmpxchg.h b/arch/arm64/include/asm/cmpxchg.h
index d0cce8068902..60a558127cef 100644
--- a/arch/arm64/include/asm/cmpxchg.h
+++ b/arch/arm64/include/asm/cmpxchg.h
@@ -21,6 +21,7 @@
 #include <linux/bug.h>
 #include <linux/mmdebug.h>
 
+#include <asm/atomic.h>
 #include <asm/barrier.h>
 #include <asm/lse.h>
 
@@ -111,74 +112,20 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
 static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
 				      unsigned long new, int size)
 {
-	unsigned long oldval = 0, res;
-
 	switch (size) {
 	case 1:
-		do {
-			asm volatile("// __cmpxchg1\n"
-			"	ldxrb	%w1, %2\n"
-			"	mov	%w0, #0\n"
-			"	cmp	%w1, %w3\n"
-			"	b.ne	1f\n"
-			"	stxrb	%w0, %w4, %2\n"
-			"1:\n"
-				: "=&r" (res), "=&r" (oldval), "+Q" (*(u8 *)ptr)
-				: "Ir" (old), "r" (new)
-				: "cc");
-		} while (res);
-		break;
-
+		return __cmpxchg_case_1(ptr, old, new);
 	case 2:
-		do {
-			asm volatile("// __cmpxchg2\n"
-			"	ldxrh	%w1, %2\n"
-			"	mov	%w0, #0\n"
-			"	cmp	%w1, %w3\n"
-			"	b.ne	1f\n"
-			"	stxrh	%w0, %w4, %2\n"
-			"1:\n"
-				: "=&r" (res), "=&r" (oldval), "+Q" (*(u16 *)ptr)
-				: "Ir" (old), "r" (new)
-				: "cc");
-		} while (res);
-		break;
-
+		return __cmpxchg_case_2(ptr, old, new);
 	case 4:
-		do {
-			asm volatile("// __cmpxchg4\n"
-			"	ldxr	%w1, %2\n"
-			"	mov	%w0, #0\n"
-			"	cmp	%w1, %w3\n"
-			"	b.ne	1f\n"
-			"	stxr	%w0, %w4, %2\n"
-			"1:\n"
-				: "=&r" (res), "=&r" (oldval), "+Q" (*(u32 *)ptr)
-				: "Ir" (old), "r" (new)
-				: "cc");
-		} while (res);
-		break;
-
+		return __cmpxchg_case_4(ptr, old, new);
 	case 8:
-		do {
-			asm volatile("// __cmpxchg8\n"
-			"	ldxr	%1, %2\n"
-			"	mov	%w0, #0\n"
-			"	cmp	%1, %3\n"
-			"	b.ne	1f\n"
-			"	stxr	%w0, %4, %2\n"
-			"1:\n"
-				: "=&r" (res), "=&r" (oldval), "+Q" (*(u64 *)ptr)
-				: "Ir" (old), "r" (new)
-				: "cc");
-		} while (res);
-		break;
-
+		return __cmpxchg_case_8(ptr, old, new);
 	default:
 		BUILD_BUG();
 	}
 
-	return oldval;
+	unreachable();
 }
 
 #define system_has_cmpxchg_double()     1
@@ -229,13 +176,20 @@ static inline int __cmpxchg_double_mb(volatile void *ptr1, volatile void *ptr2,
 static inline unsigned long __cmpxchg_mb(volatile void *ptr, unsigned long old,
 					 unsigned long new, int size)
 {
-	unsigned long ret;
-
-	smp_mb();
-	ret = __cmpxchg(ptr, old, new, size);
-	smp_mb();
+	switch (size) {
+	case 1:
+		return __cmpxchg_case_mb_1(ptr, old, new);
+	case 2:
+		return __cmpxchg_case_mb_2(ptr, old, new);
+	case 4:
+		return __cmpxchg_case_mb_4(ptr, old, new);
+	case 8:
+		return __cmpxchg_case_mb_8(ptr, old, new);
+	default:
+		BUILD_BUG();
+	}
 
-	return ret;
+	unreachable();
 }
 
 #define cmpxchg(ptr, o, n) \
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 11/18] arm64: cmpxchg_dbl: patch in lse instructions when supported by the CPU
  2015-07-13  9:25 [PATCH 00/18] arm64: support for 8.1 LSE atomic instructions Will Deacon
                   ` (9 preceding siblings ...)
  2015-07-13  9:25 ` [PATCH 10/18] arm64: cmpxchg: " Will Deacon
@ 2015-07-13  9:25 ` Will Deacon
  2015-07-13  9:25 ` [PATCH 12/18] arm64: cmpxchg: avoid "cc" clobber in ll/sc routines Will Deacon
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Will Deacon @ 2015-07-13  9:25 UTC (permalink / raw)
  To: linux-arm-kernel

On CPUs which support the LSE atomic instructions introduced in ARMv8.1,
it makes sense to use them in preference to ll/sc sequences.

This patch introduces runtime patching of our cmpxchg_double primitives
so that the LSE casp instruction is used instead.

Reviewed-by: Steve Capper <steve.capper@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/atomic_ll_sc.h | 34 ++++++++++++++++++
 arch/arm64/include/asm/atomic_lse.h   | 45 +++++++++++++++++++++++
 arch/arm64/include/asm/cmpxchg.h      | 68 +++++++++--------------------------
 3 files changed, 96 insertions(+), 51 deletions(-)

diff --git a/arch/arm64/include/asm/atomic_ll_sc.h b/arch/arm64/include/asm/atomic_ll_sc.h
index b4298f9a898f..77d3aabf52ad 100644
--- a/arch/arm64/include/asm/atomic_ll_sc.h
+++ b/arch/arm64/include/asm/atomic_ll_sc.h
@@ -243,4 +243,38 @@ __CMPXCHG_CASE( ,  , mb_8, dmb ish, "memory")
 
 #undef __CMPXCHG_CASE
 
+#define __CMPXCHG_DBL(name, mb, cl)					\
+__LL_SC_INLINE int							\
+__LL_SC_PREFIX(__cmpxchg_double##name(unsigned long old1,		\
+				      unsigned long old2,		\
+				      unsigned long new1,		\
+				      unsigned long new2,		\
+				      volatile void *ptr))		\
+{									\
+	unsigned long tmp, ret;						\
+									\
+	asm volatile("// __cmpxchg_double" #name "\n"			\
+	"	" #mb "\n"						\
+	"1:	ldxp	%0, %1, %2\n"					\
+	"	eor	%0, %0, %3\n"					\
+	"	eor	%1, %1, %4\n"					\
+	"	orr	%1, %0, %1\n"					\
+	"	cbnz	%1, 2f\n"					\
+	"	stxp	%w0, %5, %6, %2\n"				\
+	"	cbnz	%w0, 1b\n"					\
+	"	" #mb "\n"						\
+	"2:"								\
+	: "=&r" (tmp), "=&r" (ret), "+Q" (*(unsigned long *)ptr)	\
+	: "r" (old1), "r" (old2), "r" (new1), "r" (new2)		\
+	: cl);								\
+									\
+	return ret;							\
+}									\
+__LL_SC_EXPORT(__cmpxchg_double##name);
+
+__CMPXCHG_DBL(   ,        ,         )
+__CMPXCHG_DBL(_mb, dmb ish, "memory")
+
+#undef __CMPXCHG_DBL
+
 #endif	/* __ASM_ATOMIC_LL_SC_H */
diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h
index a87ed3af6ced..b1f55aac6d96 100644
--- a/arch/arm64/include/asm/atomic_lse.h
+++ b/arch/arm64/include/asm/atomic_lse.h
@@ -266,4 +266,49 @@ __CMPXCHG_CASE( ,  , mb_8, al, "memory")
 #undef __LL_SC_CMPXCHG
 #undef __CMPXCHG_CASE
 
+#define __LL_SC_CMPXCHG_DBL(op, tmp)					\
+	__LL_SC_SAVE_LR(tmp)						\
+	__LL_SC_CALL(__cmpxchg_double##op)				\
+	__LL_SC_RESTORE_LR(tmp)
+
+#define __CMPXCHG_DBL(name, mb, cl)					\
+static inline int __cmpxchg_double##name(unsigned long old1,		\
+					 unsigned long old2,		\
+					 unsigned long new1,		\
+					 unsigned long new2,		\
+					 volatile void *ptr)		\
+{									\
+	unsigned long tmp;						\
+	unsigned long oldval1 = old1;					\
+	unsigned long oldval2 = old2;					\
+	register unsigned long x0 asm ("x0") = old1;			\
+	register unsigned long x1 asm ("x1") = old2;			\
+	register unsigned long x2 asm ("x2") = new1;			\
+	register unsigned long x3 asm ("x3") = new2;			\
+	register unsigned long x4 asm ("x4") = (unsigned long)ptr;	\
+									\
+	asm volatile(ARM64_LSE_ATOMIC_INSN(				\
+	/* LL/SC */							\
+	"	nop\n"							\
+	__LL_SC_CMPXCHG_DBL(name, %[tmp]),				\
+	/* LSE atomics */						\
+	"	casp" #mb "\t%[old1], %[old2], %[new1], %[new2], %[v]\n"\
+	"	eor	%[old1], %[old1], %[oldval1]\n"			\
+	"	eor	%[old2], %[old2], %[oldval2]\n"			\
+	"	orr	%[old1], %[old1], %[old2]")			\
+	: [tmp] "=&r" (tmp), [old1] "+r" (x0), [old2] "+r" (x1),	\
+	  [v] "+Q" (*(unsigned long *)ptr)				\
+	: [new1] "r" (x2), [new2] "r" (x3), [ptr] "r" (x4),		\
+	  [oldval1] "r" (oldval1), [oldval2] "r" (oldval2)		\
+	: cl);								\
+									\
+	return x0;							\
+}
+
+__CMPXCHG_DBL(   ,   ,         )
+__CMPXCHG_DBL(_mb, al, "memory")
+
+#undef __LL_SC_CMPXCHG_DBL
+#undef __CMPXCHG_DBL
+
 #endif	/* __ASM_ATOMIC_LSE_H */
diff --git a/arch/arm64/include/asm/cmpxchg.h b/arch/arm64/include/asm/cmpxchg.h
index 60a558127cef..f70212629d02 100644
--- a/arch/arm64/include/asm/cmpxchg.h
+++ b/arch/arm64/include/asm/cmpxchg.h
@@ -128,51 +128,6 @@ static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
 	unreachable();
 }
 
-#define system_has_cmpxchg_double()     1
-
-static inline int __cmpxchg_double(volatile void *ptr1, volatile void *ptr2,
-		unsigned long old1, unsigned long old2,
-		unsigned long new1, unsigned long new2, int size)
-{
-	unsigned long loop, lost;
-
-	switch (size) {
-	case 8:
-		VM_BUG_ON((unsigned long *)ptr2 - (unsigned long *)ptr1 != 1);
-		do {
-			asm volatile("// __cmpxchg_double8\n"
-			"	ldxp	%0, %1, %2\n"
-			"	eor	%0, %0, %3\n"
-			"	eor	%1, %1, %4\n"
-			"	orr	%1, %0, %1\n"
-			"	mov	%w0, #0\n"
-			"	cbnz	%1, 1f\n"
-			"	stxp	%w0, %5, %6, %2\n"
-			"1:\n"
-				: "=&r"(loop), "=&r"(lost), "+Q" (*(u64 *)ptr1)
-				: "r" (old1), "r"(old2), "r"(new1), "r"(new2));
-		} while (loop);
-		break;
-	default:
-		BUILD_BUG();
-	}
-
-	return !lost;
-}
-
-static inline int __cmpxchg_double_mb(volatile void *ptr1, volatile void *ptr2,
-			unsigned long old1, unsigned long old2,
-			unsigned long new1, unsigned long new2, int size)
-{
-	int ret;
-
-	smp_mb();
-	ret = __cmpxchg_double(ptr1, ptr2, old1, old2, new1, new2, size);
-	smp_mb();
-
-	return ret;
-}
-
 static inline unsigned long __cmpxchg_mb(volatile void *ptr, unsigned long old,
 					 unsigned long new, int size)
 {
@@ -210,21 +165,32 @@ static inline unsigned long __cmpxchg_mb(volatile void *ptr, unsigned long old,
 	__ret; \
 })
 
+#define system_has_cmpxchg_double()     1
+
+#define __cmpxchg_double_check(ptr1, ptr2)					\
+({										\
+	if (sizeof(*(ptr1)) != 8)						\
+		BUILD_BUG();							\
+	VM_BUG_ON((unsigned long *)(ptr2) - (unsigned long *)(ptr1) != 1);	\
+})
+
 #define cmpxchg_double(ptr1, ptr2, o1, o2, n1, n2) \
 ({\
 	int __ret;\
-	__ret = __cmpxchg_double_mb((ptr1), (ptr2), (unsigned long)(o1), \
-			(unsigned long)(o2), (unsigned long)(n1), \
-			(unsigned long)(n2), sizeof(*(ptr1)));\
+	__cmpxchg_double_check(ptr1, ptr2); \
+	__ret = !__cmpxchg_double_mb((unsigned long)(o1), (unsigned long)(o2), \
+				     (unsigned long)(n1), (unsigned long)(n2), \
+				     ptr1); \
 	__ret; \
 })
 
 #define cmpxchg_double_local(ptr1, ptr2, o1, o2, n1, n2) \
 ({\
 	int __ret;\
-	__ret = __cmpxchg_double((ptr1), (ptr2), (unsigned long)(o1), \
-			(unsigned long)(o2), (unsigned long)(n1), \
-			(unsigned long)(n2), sizeof(*(ptr1)));\
+	__cmpxchg_double_check(ptr1, ptr2); \
+	__ret = !__cmpxchg_double((unsigned long)(o1), (unsigned long)(o2), \
+				  (unsigned long)(n1), (unsigned long)(n2), \
+				  ptr1); \
 	__ret; \
 })
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 12/18] arm64: cmpxchg: avoid "cc" clobber in ll/sc routines
  2015-07-13  9:25 [PATCH 00/18] arm64: support for 8.1 LSE atomic instructions Will Deacon
                   ` (10 preceding siblings ...)
  2015-07-13  9:25 ` [PATCH 11/18] arm64: cmpxchg_dbl: " Will Deacon
@ 2015-07-13  9:25 ` Will Deacon
  2015-07-21 17:16   ` Catalin Marinas
  2015-07-13  9:25 ` [PATCH 13/18] arm64: cmpxchg: avoid memory barrier on comparison failure Will Deacon
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Will Deacon @ 2015-07-13  9:25 UTC (permalink / raw)
  To: linux-arm-kernel

We can perform the cmpxchg comparison using eor and cbnz which avoids
the "cc" clobber for the ll/sc case and consequently for the LSE case
where we may have to fall-back on the ll/sc code at runtime.

Reviewed-by: Steve Capper <steve.capper@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/atomic_ll_sc.h | 14 ++++++--------
 arch/arm64/include/asm/atomic_lse.h   |  4 ++--
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/atomic_ll_sc.h b/arch/arm64/include/asm/atomic_ll_sc.h
index 77d3aabf52ad..d21091bae901 100644
--- a/arch/arm64/include/asm/atomic_ll_sc.h
+++ b/arch/arm64/include/asm/atomic_ll_sc.h
@@ -96,14 +96,13 @@ __LL_SC_PREFIX(atomic_cmpxchg(atomic_t *ptr, int old, int new))
 
 	asm volatile("// atomic_cmpxchg\n"
 "1:	ldxr	%w1, %2\n"
-"	cmp	%w1, %w3\n"
-"	b.ne	2f\n"
+"	eor	%w0, %w1, %w3\n"
+"	cbnz	%w0, 2f\n"
 "	stxr	%w0, %w4, %2\n"
 "	cbnz	%w0, 1b\n"
 "2:"
 	: "=&r" (tmp), "=&r" (oldval), "+Q" (ptr->counter)
-	: "Ir" (old), "r" (new)
-	: "cc");
+	: "Lr" (old), "r" (new));
 
 	smp_mb();
 	return oldval;
@@ -169,14 +168,13 @@ __LL_SC_PREFIX(atomic64_cmpxchg(atomic64_t *ptr, long old, long new))
 
 	asm volatile("// atomic64_cmpxchg\n"
 "1:	ldxr	%1, %2\n"
-"	cmp	%1, %3\n"
-"	b.ne	2f\n"
+"	eor	%0, %1, %3\n"
+"	cbnz	%w0, 2f\n"
 "	stxr	%w0, %4, %2\n"
 "	cbnz	%w0, 1b\n"
 "2:"
 	: "=&r" (res), "=&r" (oldval), "+Q" (ptr->counter)
-	: "Ir" (old), "r" (new)
-	: "cc");
+	: "Lr" (old), "r" (new));
 
 	smp_mb();
 	return oldval;
diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h
index b1f55aac6d96..7adee6656d42 100644
--- a/arch/arm64/include/asm/atomic_lse.h
+++ b/arch/arm64/include/asm/atomic_lse.h
@@ -105,7 +105,7 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
 	"	mov	%w[ret], %w[tmp]")
 	: [tmp] "=&r" (tmp), [ret] "+r" (x0), [v] "+Q" (ptr->counter)
 	: [old] "r" (w1), [new] "r" (w2)
-	: "cc", "memory");
+	: "memory");
 
 	return x0;
 }
@@ -191,7 +191,7 @@ static inline long atomic64_cmpxchg(atomic64_t *ptr, long old, long new)
 	"	mov	%[ret], %[tmp]")
 	: [tmp] "=&r" (tmp), [ret] "+r" (x0), [v] "+Q" (ptr->counter)
 	: [old] "r" (x1), [new] "r" (x2)
-	: "cc", "memory");
+	: "memory");
 
 	return x0;
 }
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 13/18] arm64: cmpxchg: avoid memory barrier on comparison failure
  2015-07-13  9:25 [PATCH 00/18] arm64: support for 8.1 LSE atomic instructions Will Deacon
                   ` (11 preceding siblings ...)
  2015-07-13  9:25 ` [PATCH 12/18] arm64: cmpxchg: avoid "cc" clobber in ll/sc routines Will Deacon
@ 2015-07-13  9:25 ` Will Deacon
  2015-07-13 10:28   ` Peter Zijlstra
  2015-07-13  9:25 ` [PATCH 14/18] arm64: atomics: tidy up common atomic{,64}_* macros Will Deacon
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Will Deacon @ 2015-07-13  9:25 UTC (permalink / raw)
  To: linux-arm-kernel

cmpxchg doesn't require memory barrier semantics when the value
comparison fails, so make the barrier conditional on success.

Reviewed-by: Steve Capper <steve.capper@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/atomic_ll_sc.h | 48 ++++++++++++++++-------------------
 1 file changed, 22 insertions(+), 26 deletions(-)

diff --git a/arch/arm64/include/asm/atomic_ll_sc.h b/arch/arm64/include/asm/atomic_ll_sc.h
index d21091bae901..fb26f2b1f300 100644
--- a/arch/arm64/include/asm/atomic_ll_sc.h
+++ b/arch/arm64/include/asm/atomic_ll_sc.h
@@ -92,19 +92,18 @@ __LL_SC_PREFIX(atomic_cmpxchg(atomic_t *ptr, int old, int new))
 	unsigned long tmp;
 	int oldval;
 
-	smp_mb();
-
 	asm volatile("// atomic_cmpxchg\n"
 "1:	ldxr	%w1, %2\n"
 "	eor	%w0, %w1, %w3\n"
 "	cbnz	%w0, 2f\n"
-"	stxr	%w0, %w4, %2\n"
+"	stlxr	%w0, %w4, %2\n"
 "	cbnz	%w0, 1b\n"
+"	dmb	ish\n"
 "2:"
 	: "=&r" (tmp), "=&r" (oldval), "+Q" (ptr->counter)
-	: "Lr" (old), "r" (new));
+	: "Lr" (old), "r" (new)
+	: "memory");
 
-	smp_mb();
 	return oldval;
 }
 __LL_SC_EXPORT(atomic_cmpxchg);
@@ -164,19 +163,18 @@ __LL_SC_PREFIX(atomic64_cmpxchg(atomic64_t *ptr, long old, long new))
 	long oldval;
 	unsigned long res;
 
-	smp_mb();
-
 	asm volatile("// atomic64_cmpxchg\n"
 "1:	ldxr	%1, %2\n"
 "	eor	%0, %1, %3\n"
 "	cbnz	%w0, 2f\n"
-"	stxr	%w0, %4, %2\n"
+"	stlxr	%w0, %4, %2\n"
 "	cbnz	%w0, 1b\n"
+"	dmb	ish\n"
 "2:"
 	: "=&r" (res), "=&r" (oldval), "+Q" (ptr->counter)
-	: "Lr" (old), "r" (new));
+	: "Lr" (old), "r" (new)
+	: "memory");
 
-	smp_mb();
 	return oldval;
 }
 __LL_SC_EXPORT(atomic64_cmpxchg);
@@ -203,7 +201,7 @@ __LL_SC_PREFIX(atomic64_dec_if_positive(atomic64_t *v))
 }
 __LL_SC_EXPORT(atomic64_dec_if_positive);
 
-#define __CMPXCHG_CASE(w, sz, name, mb, cl)				\
+#define __CMPXCHG_CASE(w, sz, name, mb, rel, cl)			\
 __LL_SC_INLINE unsigned long						\
 __LL_SC_PREFIX(__cmpxchg_case_##name(volatile void *ptr,		\
 				     unsigned long old,			\
@@ -212,11 +210,10 @@ __LL_SC_PREFIX(__cmpxchg_case_##name(volatile void *ptr,		\
 	unsigned long tmp, oldval;					\
 									\
 	asm volatile(							\
-	"	" #mb "\n"						\
 	"1:	ldxr" #sz "\t%" #w "[oldval], %[v]\n"			\
 	"	eor	%" #w "[tmp], %" #w "[oldval], %" #w "[old]\n"	\
 	"	cbnz	%" #w "[tmp], 2f\n"				\
-	"	stxr" #sz "\t%w[tmp], %" #w "[new], %[v]\n"		\
+	"	st" #rel "xr" #sz "\t%w[tmp], %" #w "[new], %[v]\n"	\
 	"	cbnz	%w[tmp], 1b\n"					\
 	"	" #mb "\n"						\
 	"	mov	%" #w "[oldval], %" #w "[old]\n"		\
@@ -230,18 +227,18 @@ __LL_SC_PREFIX(__cmpxchg_case_##name(volatile void *ptr,		\
 }									\
 __LL_SC_EXPORT(__cmpxchg_case_##name);
 
-__CMPXCHG_CASE(w, b,    1,        ,         )
-__CMPXCHG_CASE(w, h,    2,        ,         )
-__CMPXCHG_CASE(w,  ,    4,        ,         )
-__CMPXCHG_CASE( ,  ,    8,        ,         )
-__CMPXCHG_CASE(w, b, mb_1, dmb ish, "memory")
-__CMPXCHG_CASE(w, h, mb_2, dmb ish, "memory")
-__CMPXCHG_CASE(w,  , mb_4, dmb ish, "memory")
-__CMPXCHG_CASE( ,  , mb_8, dmb ish, "memory")
+__CMPXCHG_CASE(w, b,    1,        ,  ,         )
+__CMPXCHG_CASE(w, h,    2,        ,  ,         )
+__CMPXCHG_CASE(w,  ,    4,        ,  ,         )
+__CMPXCHG_CASE( ,  ,    8,        ,  ,         )
+__CMPXCHG_CASE(w, b, mb_1, dmb ish, l, "memory")
+__CMPXCHG_CASE(w, h, mb_2, dmb ish, l, "memory")
+__CMPXCHG_CASE(w,  , mb_4, dmb ish, l, "memory")
+__CMPXCHG_CASE( ,  , mb_8, dmb ish, l, "memory")
 
 #undef __CMPXCHG_CASE
 
-#define __CMPXCHG_DBL(name, mb, cl)					\
+#define __CMPXCHG_DBL(name, mb, rel, cl)				\
 __LL_SC_INLINE int							\
 __LL_SC_PREFIX(__cmpxchg_double##name(unsigned long old1,		\
 				      unsigned long old2,		\
@@ -252,13 +249,12 @@ __LL_SC_PREFIX(__cmpxchg_double##name(unsigned long old1,		\
 	unsigned long tmp, ret;						\
 									\
 	asm volatile("// __cmpxchg_double" #name "\n"			\
-	"	" #mb "\n"						\
 	"1:	ldxp	%0, %1, %2\n"					\
 	"	eor	%0, %0, %3\n"					\
 	"	eor	%1, %1, %4\n"					\
 	"	orr	%1, %0, %1\n"					\
 	"	cbnz	%1, 2f\n"					\
-	"	stxp	%w0, %5, %6, %2\n"				\
+	"	st" #rel "xp	%w0, %5, %6, %2\n"			\
 	"	cbnz	%w0, 1b\n"					\
 	"	" #mb "\n"						\
 	"2:"								\
@@ -270,8 +266,8 @@ __LL_SC_PREFIX(__cmpxchg_double##name(unsigned long old1,		\
 }									\
 __LL_SC_EXPORT(__cmpxchg_double##name);
 
-__CMPXCHG_DBL(   ,        ,         )
-__CMPXCHG_DBL(_mb, dmb ish, "memory")
+__CMPXCHG_DBL(   ,        ,  ,         )
+__CMPXCHG_DBL(_mb, dmb ish, l, "memory")
 
 #undef __CMPXCHG_DBL
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 14/18] arm64: atomics: tidy up common atomic{,64}_* macros
  2015-07-13  9:25 [PATCH 00/18] arm64: support for 8.1 LSE atomic instructions Will Deacon
                   ` (12 preceding siblings ...)
  2015-07-13  9:25 ` [PATCH 13/18] arm64: cmpxchg: avoid memory barrier on comparison failure Will Deacon
@ 2015-07-13  9:25 ` Will Deacon
  2015-07-13  9:25 ` [PATCH 15/18] arm64: atomics: prefetch the destination word for write prior to stxr Will Deacon
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Will Deacon @ 2015-07-13  9:25 UTC (permalink / raw)
  To: linux-arm-kernel

The common (i.e. identical for ll/sc and lse) atomic macros in atomic.h
are needlessley different for atomic_t and atomic64_t.

This patch tidies up the definitions to make them consistent across the
two atomic types and factors out common code such as the add_unless
implementation based on cmpxchg.

Reviewed-by: Steve Capper <steve.capper@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/atomic.h | 92 +++++++++++++++++------------------------
 1 file changed, 38 insertions(+), 54 deletions(-)

diff --git a/arch/arm64/include/asm/atomic.h b/arch/arm64/include/asm/atomic.h
index ee32776d926c..51816ab2312d 100644
--- a/arch/arm64/include/asm/atomic.h
+++ b/arch/arm64/include/asm/atomic.h
@@ -26,8 +26,6 @@
 #include <asm/barrier.h>
 #include <asm/lse.h>
 
-#define ATOMIC_INIT(i)	{ (i) }
-
 #ifdef __KERNEL__
 
 #define __ARM64_IN_ATOMIC_IMPL
@@ -42,67 +40,53 @@
 
 #include <asm/cmpxchg.h>
 
-/*
- * On ARM, ordinary assignment (str instruction) doesn't clear the local
- * strex/ldrex monitor on some implementations. The reason we can use it for
- * atomic_set() is the clrex or dummy strex done on every exception return.
- */
-#define atomic_read(v)	ACCESS_ONCE((v)->counter)
-#define atomic_set(v,i)	(((v)->counter) = (i))
-
-#define atomic_xchg(v, new) (xchg(&((v)->counter), new))
-
-static inline int __atomic_add_unless(atomic_t *v, int a, int u)
-{
-	int c, old;
-
-	c = atomic_read(v);
-	while (c != u && (old = atomic_cmpxchg((v), c, c + a)) != c)
-		c = old;
-	return c;
-}
+#define ___atomic_add_unless(v, a, u, sfx)				\
+({									\
+	typeof((v)->counter) c, old;					\
+									\
+	c = atomic##sfx##_read(v);					\
+	while (c != (u) &&						\
+	      (old = atomic##sfx##_cmpxchg((v), c, c + (a))) != c)	\
+		c = old;						\
+	c;								\
+ })
 
-#define atomic_inc(v)		atomic_add(1, v)
-#define atomic_dec(v)		atomic_sub(1, v)
+#define ATOMIC_INIT(i)	{ (i) }
 
-#define atomic_inc_and_test(v)	(atomic_add_return(1, v) == 0)
-#define atomic_dec_and_test(v)	(atomic_sub_return(1, v) == 0)
-#define atomic_inc_return(v)    (atomic_add_return(1, v))
-#define atomic_dec_return(v)    (atomic_sub_return(1, v))
-#define atomic_sub_and_test(i, v) (atomic_sub_return(i, v) == 0)
+#define atomic_read(v)			READ_ONCE((v)->counter)
+#define atomic_set(v, i)		(((v)->counter) = (i))
+#define atomic_xchg(v, new)		xchg(&((v)->counter), (new))
 
-#define atomic_add_negative(i,v) (atomic_add_return(i, v) < 0)
+#define atomic_inc(v)			atomic_add(1, (v))
+#define atomic_dec(v)			atomic_sub(1, (v))
+#define atomic_inc_return(v)		atomic_add_return(1, (v))
+#define atomic_dec_return(v)		atomic_sub_return(1, (v))
+#define atomic_inc_and_test(v)		(atomic_inc_return(v) == 0)
+#define atomic_dec_and_test(v)		(atomic_dec_return(v) == 0)
+#define atomic_sub_and_test(i, v)	(atomic_sub_return((i), (v)) == 0)
+#define atomic_add_negative(i, v)	(atomic_add_return((i), (v)) < 0)
+#define __atomic_add_unless(v, a, u)	___atomic_add_unless(v, a, u,)
 
 /*
  * 64-bit atomic operations.
  */
-#define ATOMIC64_INIT(i) { (i) }
-
-#define atomic64_read(v)	ACCESS_ONCE((v)->counter)
-#define atomic64_set(v,i)	(((v)->counter) = (i))
-
-#define atomic64_xchg(v, new) (xchg(&((v)->counter), new))
-
-static inline int atomic64_add_unless(atomic64_t *v, long a, long u)
-{
-	long c, old;
-
-	c = atomic64_read(v);
-	while (c != u && (old = atomic64_cmpxchg((v), c, c + a)) != c)
-		c = old;
+#define ATOMIC64_INIT			ATOMIC_INIT
+#define atomic64_read			atomic_read
+#define atomic64_set			atomic_set
+#define atomic64_xchg			atomic_xchg
+
+#define atomic64_inc(v)			atomic64_add(1, (v))
+#define atomic64_dec(v)			atomic64_sub(1, (v))
+#define atomic64_inc_return(v)		atomic64_add_return(1, (v))
+#define atomic64_dec_return(v)		atomic64_sub_return(1, (v))
+#define atomic64_inc_and_test(v)	(atomic64_inc_return(v) == 0)
+#define atomic64_dec_and_test(v)	(atomic64_dec_return(v) == 0)
+#define atomic64_sub_and_test(i, v)	(atomic64_sub_return((i), (v)) == 0)
+#define atomic64_add_negative(i, v)	(atomic64_add_return((i), (v)) < 0)
+#define atomic64_add_unless(v, a, u)	(___atomic_add_unless(v, a, u, 64) != u)
 
-	return c != u;
-}
+#define atomic64_inc_not_zero(v)	atomic64_add_unless((v), 1, 0)
 
-#define atomic64_add_negative(a, v)	(atomic64_add_return((a), (v)) < 0)
-#define atomic64_inc(v)			atomic64_add(1LL, (v))
-#define atomic64_inc_return(v)		atomic64_add_return(1LL, (v))
-#define atomic64_inc_and_test(v)	(atomic64_inc_return(v) == 0)
-#define atomic64_sub_and_test(a, v)	(atomic64_sub_return((a), (v)) == 0)
-#define atomic64_dec(v)			atomic64_sub(1LL, (v))
-#define atomic64_dec_return(v)		atomic64_sub_return(1LL, (v))
-#define atomic64_dec_and_test(v)	(atomic64_dec_return((v)) == 0)
-#define atomic64_inc_not_zero(v)	atomic64_add_unless((v), 1LL, 0LL)
 
 #endif
 #endif
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 15/18] arm64: atomics: prefetch the destination word for write prior to stxr
  2015-07-13  9:25 [PATCH 00/18] arm64: support for 8.1 LSE atomic instructions Will Deacon
                   ` (13 preceding siblings ...)
  2015-07-13  9:25 ` [PATCH 14/18] arm64: atomics: tidy up common atomic{,64}_* macros Will Deacon
@ 2015-07-13  9:25 ` Will Deacon
  2015-07-13  9:25 ` [PATCH 16/18] arm64: atomics: implement atomic{, 64}_cmpxchg using cmpxchg Will Deacon
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Will Deacon @ 2015-07-13  9:25 UTC (permalink / raw)
  To: linux-arm-kernel

The cost of changing a cacheline from shared to exclusive state can be
significant, especially when this is triggered by an exclusive store,
since it may result in having to retry the transaction.

This patch makes use of prfm to prefetch cachelines for write prior to
ldxr/stxr loops when using the ll/sc atomic routines.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/atomic_ll_sc.h | 9 +++++++++
 arch/arm64/include/asm/cmpxchg.h      | 8 ++++++++
 arch/arm64/include/asm/futex.h        | 2 ++
 arch/arm64/lib/bitops.S               | 2 ++
 4 files changed, 21 insertions(+)

diff --git a/arch/arm64/include/asm/atomic_ll_sc.h b/arch/arm64/include/asm/atomic_ll_sc.h
index fb26f2b1f300..652877fefae6 100644
--- a/arch/arm64/include/asm/atomic_ll_sc.h
+++ b/arch/arm64/include/asm/atomic_ll_sc.h
@@ -45,6 +45,7 @@ __LL_SC_PREFIX(atomic_##op(int i, atomic_t *v))				\
 	int result;							\
 									\
 	asm volatile("// atomic_" #op "\n"				\
+"	prfm	pstl1strm, %2\n"					\
 "1:	ldxr	%w0, %2\n"						\
 "	" #asm_op "	%w0, %w0, %w3\n"				\
 "	stxr	%w1, %w0, %2\n"						\
@@ -62,6 +63,7 @@ __LL_SC_PREFIX(atomic_##op##_return(int i, atomic_t *v))		\
 	int result;							\
 									\
 	asm volatile("// atomic_" #op "_return\n"			\
+"	prfm	pstl1strm, %2\n"					\
 "1:	ldxr	%w0, %2\n"						\
 "	" #asm_op "	%w0, %w0, %w3\n"				\
 "	stlxr	%w1, %w0, %2\n"						\
@@ -93,6 +95,7 @@ __LL_SC_PREFIX(atomic_cmpxchg(atomic_t *ptr, int old, int new))
 	int oldval;
 
 	asm volatile("// atomic_cmpxchg\n"
+"	prfm	pstl1strm, %2\n"
 "1:	ldxr	%w1, %2\n"
 "	eor	%w0, %w1, %w3\n"
 "	cbnz	%w0, 2f\n"
@@ -116,6 +119,7 @@ __LL_SC_PREFIX(atomic64_##op(long i, atomic64_t *v))			\
 	unsigned long tmp;						\
 									\
 	asm volatile("// atomic64_" #op "\n"				\
+"	prfm	pstl1strm, %2\n"					\
 "1:	ldxr	%0, %2\n"						\
 "	" #asm_op "	%0, %0, %3\n"					\
 "	stxr	%w1, %0, %2\n"						\
@@ -133,6 +137,7 @@ __LL_SC_PREFIX(atomic64_##op##_return(long i, atomic64_t *v))		\
 	unsigned long tmp;						\
 									\
 	asm volatile("// atomic64_" #op "_return\n"			\
+"	prfm	pstl1strm, %2\n"					\
 "1:	ldxr	%0, %2\n"						\
 "	" #asm_op "	%0, %0, %3\n"					\
 "	stlxr	%w1, %0, %2\n"						\
@@ -164,6 +169,7 @@ __LL_SC_PREFIX(atomic64_cmpxchg(atomic64_t *ptr, long old, long new))
 	unsigned long res;
 
 	asm volatile("// atomic64_cmpxchg\n"
+"	prfm	pstl1strm, %2\n"
 "1:	ldxr	%1, %2\n"
 "	eor	%0, %1, %3\n"
 "	cbnz	%w0, 2f\n"
@@ -186,6 +192,7 @@ __LL_SC_PREFIX(atomic64_dec_if_positive(atomic64_t *v))
 	unsigned long tmp;
 
 	asm volatile("// atomic64_dec_if_positive\n"
+"	prfm	pstl1strm, %2\n"
 "1:	ldxr	%0, %2\n"
 "	subs	%0, %0, #1\n"
 "	b.mi	2f\n"
@@ -210,6 +217,7 @@ __LL_SC_PREFIX(__cmpxchg_case_##name(volatile void *ptr,		\
 	unsigned long tmp, oldval;					\
 									\
 	asm volatile(							\
+	"	prfm	pstl1strm, %2\n"				\
 	"1:	ldxr" #sz "\t%" #w "[oldval], %[v]\n"			\
 	"	eor	%" #w "[tmp], %" #w "[oldval], %" #w "[old]\n"	\
 	"	cbnz	%" #w "[tmp], 2f\n"				\
@@ -249,6 +257,7 @@ __LL_SC_PREFIX(__cmpxchg_double##name(unsigned long old1,		\
 	unsigned long tmp, ret;						\
 									\
 	asm volatile("// __cmpxchg_double" #name "\n"			\
+	"	prfm	pstl1strm, %2\n"				\
 	"1:	ldxp	%0, %1, %2\n"					\
 	"	eor	%0, %0, %3\n"					\
 	"	eor	%1, %1, %4\n"					\
diff --git a/arch/arm64/include/asm/cmpxchg.h b/arch/arm64/include/asm/cmpxchg.h
index f70212629d02..7bfda0944c9b 100644
--- a/arch/arm64/include/asm/cmpxchg.h
+++ b/arch/arm64/include/asm/cmpxchg.h
@@ -33,12 +33,14 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
 	case 1:
 		asm volatile(ARM64_LSE_ATOMIC_INSN(
 		/* LL/SC */
+		"	prfm	pstl1strm, %2\n"
 		"1:	ldxrb	%w0, %2\n"
 		"	stlxrb	%w1, %w3, %2\n"
 		"	cbnz	%w1, 1b\n"
 		"	dmb	ish",
 		/* LSE atomics */
 		"	nop\n"
+		"	nop\n"
 		"	swpalb	%w3, %w0, %2\n"
 		"	nop\n"
 		"	nop")
@@ -49,12 +51,14 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
 	case 2:
 		asm volatile(ARM64_LSE_ATOMIC_INSN(
 		/* LL/SC */
+		"	prfm	pstl1strm, %2\n"
 		"1:	ldxrh	%w0, %2\n"
 		"	stlxrh	%w1, %w3, %2\n"
 		"	cbnz	%w1, 1b\n"
 		"	dmb	ish",
 		/* LSE atomics */
 		"	nop\n"
+		"	nop\n"
 		"	swpalh	%w3, %w0, %2\n"
 		"	nop\n"
 		"	nop")
@@ -65,12 +69,14 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
 	case 4:
 		asm volatile(ARM64_LSE_ATOMIC_INSN(
 		/* LL/SC */
+		"	prfm	pstl1strm, %2\n"
 		"1:	ldxr	%w0, %2\n"
 		"	stlxr	%w1, %w3, %2\n"
 		"	cbnz	%w1, 1b\n"
 		"	dmb	ish",
 		/* LSE atomics */
 		"	nop\n"
+		"	nop\n"
 		"	swpal	%w3, %w0, %2\n"
 		"	nop\n"
 		"	nop")
@@ -81,12 +87,14 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
 	case 8:
 		asm volatile(ARM64_LSE_ATOMIC_INSN(
 		/* LL/SC */
+		"	prfm	pstl1strm, %2\n"
 		"1:	ldxr	%0, %2\n"
 		"	stlxr	%w1, %3, %2\n"
 		"	cbnz	%w1, 1b\n"
 		"	dmb	ish",
 		/* LSE atomics */
 		"	nop\n"
+		"	nop\n"
 		"	swpal	%3, %0, %2\n"
 		"	nop\n"
 		"	nop")
diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
index 74069b3bd919..a681608faf9a 100644
--- a/arch/arm64/include/asm/futex.h
+++ b/arch/arm64/include/asm/futex.h
@@ -24,6 +24,7 @@
 
 #define __futex_atomic_op(insn, ret, oldval, uaddr, tmp, oparg)		\
 	asm volatile(							\
+"	prfm	pstl1strm, %2\n"					\
 "1:	ldxr	%w1, %2\n"						\
 	insn "\n"							\
 "2:	stlxr	%w3, %w0, %2\n"						\
@@ -112,6 +113,7 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
 		return -EFAULT;
 
 	asm volatile("// futex_atomic_cmpxchg_inatomic\n"
+"	prfm	pstl1strm, %2\n"
 "1:	ldxr	%w1, %2\n"
 "	sub	%w3, %w1, %w4\n"
 "	cbnz	%w3, 3f\n"
diff --git a/arch/arm64/lib/bitops.S b/arch/arm64/lib/bitops.S
index bc18457c2bba..43ac736baa5b 100644
--- a/arch/arm64/lib/bitops.S
+++ b/arch/arm64/lib/bitops.S
@@ -31,6 +31,7 @@ ENTRY(	\name	)
 	eor	w0, w0, w3		// Clear low bits
 	mov	x2, #1
 	add	x1, x1, x0, lsr #3	// Get word offset
+alt_lse "	prfm	pstl1strm, [x1]",	"nop"
 	lsl	x3, x2, x3		// Create mask
 
 alt_lse	"1:	ldxr	x2, [x1]",		"\lse	x3, [x1]"
@@ -48,6 +49,7 @@ ENTRY(	\name	)
 	eor	w0, w0, w3		// Clear low bits
 	mov	x2, #1
 	add	x1, x1, x0, lsr #3	// Get word offset
+alt_lse "	prfm	pstl1strm, [x1]",	"nop"
 	lsl	x4, x2, x3		// Create mask
 
 alt_lse	"1:	ldxr	x2, [x1]",		"\lse	x4, x2, [x1]"
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 16/18] arm64: atomics: implement atomic{, 64}_cmpxchg using cmpxchg
  2015-07-13  9:25 [PATCH 00/18] arm64: support for 8.1 LSE atomic instructions Will Deacon
                   ` (14 preceding siblings ...)
  2015-07-13  9:25 ` [PATCH 15/18] arm64: atomics: prefetch the destination word for write prior to stxr Will Deacon
@ 2015-07-13  9:25 ` Will Deacon
  2015-07-13  9:25 ` [PATCH 17/18] arm64: atomic64_dec_if_positive: fix incorrect branch condition Will Deacon
  2015-07-13  9:25 ` [PATCH 18/18] arm64: kconfig: select HAVE_CMPXCHG_LOCAL Will Deacon
  17 siblings, 0 replies; 35+ messages in thread
From: Will Deacon @ 2015-07-13  9:25 UTC (permalink / raw)
  To: linux-arm-kernel

We don't need duplicate cmpxchg implementations, so use cmpxchg to
implement atomic{,64}_cmpxchg, like we do for xchg already.

Reviewed-by: Steve Capper <steve.capper@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/atomic.h       |  2 ++
 arch/arm64/include/asm/atomic_ll_sc.h | 46 -----------------------------------
 arch/arm64/include/asm/atomic_lse.h   | 35 --------------------------
 3 files changed, 2 insertions(+), 81 deletions(-)

diff --git a/arch/arm64/include/asm/atomic.h b/arch/arm64/include/asm/atomic.h
index 51816ab2312d..b4eff63be0ff 100644
--- a/arch/arm64/include/asm/atomic.h
+++ b/arch/arm64/include/asm/atomic.h
@@ -56,6 +56,7 @@
 #define atomic_read(v)			READ_ONCE((v)->counter)
 #define atomic_set(v, i)		(((v)->counter) = (i))
 #define atomic_xchg(v, new)		xchg(&((v)->counter), (new))
+#define atomic_cmpxchg(v, old, new)	cmpxchg(&((v)->counter), (old), (new))
 
 #define atomic_inc(v)			atomic_add(1, (v))
 #define atomic_dec(v)			atomic_sub(1, (v))
@@ -74,6 +75,7 @@
 #define atomic64_read			atomic_read
 #define atomic64_set			atomic_set
 #define atomic64_xchg			atomic_xchg
+#define atomic64_cmpxchg		atomic_cmpxchg
 
 #define atomic64_inc(v)			atomic64_add(1, (v))
 #define atomic64_dec(v)			atomic64_sub(1, (v))
diff --git a/arch/arm64/include/asm/atomic_ll_sc.h b/arch/arm64/include/asm/atomic_ll_sc.h
index 652877fefae6..cbaedf9afb2f 100644
--- a/arch/arm64/include/asm/atomic_ll_sc.h
+++ b/arch/arm64/include/asm/atomic_ll_sc.h
@@ -88,29 +88,6 @@ ATOMIC_OPS(sub, sub)
 #undef ATOMIC_OP_RETURN
 #undef ATOMIC_OP
 
-__LL_SC_INLINE int
-__LL_SC_PREFIX(atomic_cmpxchg(atomic_t *ptr, int old, int new))
-{
-	unsigned long tmp;
-	int oldval;
-
-	asm volatile("// atomic_cmpxchg\n"
-"	prfm	pstl1strm, %2\n"
-"1:	ldxr	%w1, %2\n"
-"	eor	%w0, %w1, %w3\n"
-"	cbnz	%w0, 2f\n"
-"	stlxr	%w0, %w4, %2\n"
-"	cbnz	%w0, 1b\n"
-"	dmb	ish\n"
-"2:"
-	: "=&r" (tmp), "=&r" (oldval), "+Q" (ptr->counter)
-	: "Lr" (old), "r" (new)
-	: "memory");
-
-	return oldval;
-}
-__LL_SC_EXPORT(atomic_cmpxchg);
-
 #define ATOMIC64_OP(op, asm_op)						\
 __LL_SC_INLINE void							\
 __LL_SC_PREFIX(atomic64_##op(long i, atomic64_t *v))			\
@@ -163,29 +140,6 @@ ATOMIC64_OPS(sub, sub)
 #undef ATOMIC64_OP
 
 __LL_SC_INLINE long
-__LL_SC_PREFIX(atomic64_cmpxchg(atomic64_t *ptr, long old, long new))
-{
-	long oldval;
-	unsigned long res;
-
-	asm volatile("// atomic64_cmpxchg\n"
-"	prfm	pstl1strm, %2\n"
-"1:	ldxr	%1, %2\n"
-"	eor	%0, %1, %3\n"
-"	cbnz	%w0, 2f\n"
-"	stlxr	%w0, %4, %2\n"
-"	cbnz	%w0, 1b\n"
-"	dmb	ish\n"
-"2:"
-	: "=&r" (res), "=&r" (oldval), "+Q" (ptr->counter)
-	: "Lr" (old), "r" (new)
-	: "memory");
-
-	return oldval;
-}
-__LL_SC_EXPORT(atomic64_cmpxchg);
-
-__LL_SC_INLINE long
 __LL_SC_PREFIX(atomic64_dec_if_positive(atomic64_t *v))
 {
 	long result;
diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h
index 7adee6656d42..6a2bbdfcf290 100644
--- a/arch/arm64/include/asm/atomic_lse.h
+++ b/arch/arm64/include/asm/atomic_lse.h
@@ -92,24 +92,6 @@ static inline int atomic_sub_return(int i, atomic_t *v)
 	return w0;
 }
 
-static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
-{
-	unsigned long tmp;
-	register unsigned long x0 asm ("x0") = (unsigned long)ptr;
-	register int w1 asm ("w1") = old;
-	register int w2 asm ("w2") = new;
-
-	asm volatile(ARM64_LSE_ATOMIC_INSN(__LL_SC_ATOMIC(cmpxchg, %[tmp]),
-	"	mov	%w[tmp], %w[old]\n"
-	"	casal	%w[tmp], %w[new], %[v]\n"
-	"	mov	%w[ret], %w[tmp]")
-	: [tmp] "=&r" (tmp), [ret] "+r" (x0), [v] "+Q" (ptr->counter)
-	: [old] "r" (w1), [new] "r" (w2)
-	: "memory");
-
-	return x0;
-}
-
 #undef __LL_SC_ATOMIC
 
 #define __LL_SC_ATOMIC64(op, tmp)					\
@@ -178,23 +160,6 @@ static inline long atomic64_sub_return(long i, atomic64_t *v)
 
 	return x0;
 }
-static inline long atomic64_cmpxchg(atomic64_t *ptr, long old, long new)
-{
-	unsigned long tmp;
-	register unsigned long x0 asm ("x0") = (unsigned long)ptr;
-	register long x1 asm ("x1") = old;
-	register long x2 asm ("x2") = new;
-
-	asm volatile(ARM64_LSE_ATOMIC_INSN(__LL_SC_ATOMIC64(cmpxchg, %[tmp]),
-	"	mov	%[tmp], %[old]\n"
-	"	casal	%[tmp], %[new], %[v]\n"
-	"	mov	%[ret], %[tmp]")
-	: [tmp] "=&r" (tmp), [ret] "+r" (x0), [v] "+Q" (ptr->counter)
-	: [old] "r" (x1), [new] "r" (x2)
-	: "memory");
-
-	return x0;
-}
 
 static inline long atomic64_dec_if_positive(atomic64_t *v)
 {
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 17/18] arm64: atomic64_dec_if_positive: fix incorrect branch condition
  2015-07-13  9:25 [PATCH 00/18] arm64: support for 8.1 LSE atomic instructions Will Deacon
                   ` (15 preceding siblings ...)
  2015-07-13  9:25 ` [PATCH 16/18] arm64: atomics: implement atomic{, 64}_cmpxchg using cmpxchg Will Deacon
@ 2015-07-13  9:25 ` Will Deacon
  2015-07-13  9:25 ` [PATCH 18/18] arm64: kconfig: select HAVE_CMPXCHG_LOCAL Will Deacon
  17 siblings, 0 replies; 35+ messages in thread
From: Will Deacon @ 2015-07-13  9:25 UTC (permalink / raw)
  To: linux-arm-kernel

If we attempt to atomic64_dec_if_positive on INT_MIN, we will underflow
and incorrectly decide that the original parameter was positive.

This patches fixes the broken condition code so that we handle this
corner case correctly.

Reviewed-by: Steve Capper <steve.capper@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/atomic_ll_sc.h | 2 +-
 arch/arm64/include/asm/atomic_lse.h   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/atomic_ll_sc.h b/arch/arm64/include/asm/atomic_ll_sc.h
index cbaedf9afb2f..1a9ae9197a9f 100644
--- a/arch/arm64/include/asm/atomic_ll_sc.h
+++ b/arch/arm64/include/asm/atomic_ll_sc.h
@@ -149,7 +149,7 @@ __LL_SC_PREFIX(atomic64_dec_if_positive(atomic64_t *v))
 "	prfm	pstl1strm, %2\n"
 "1:	ldxr	%0, %2\n"
 "	subs	%0, %0, #1\n"
-"	b.mi	2f\n"
+"	b.lt	2f\n"
 "	stlxr	%w1, %0, %2\n"
 "	cbnz	%w1, 1b\n"
 "	dmb	ish\n"
diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h
index 6a2bbdfcf290..82926657f6af 100644
--- a/arch/arm64/include/asm/atomic_lse.h
+++ b/arch/arm64/include/asm/atomic_lse.h
@@ -176,7 +176,7 @@ static inline long atomic64_dec_if_positive(atomic64_t *v)
 	/* LSE atomics */
 	"1:	ldr	%[tmp], %[v]\n"
 	"	subs	%[ret], %[tmp], #1\n"
-	"	b.mi	2f\n"
+	"	b.lt	2f\n"
 	"	casal	%[tmp], %[ret], %[v]\n"
 	"	sub	%[tmp], %[tmp], #1\n"
 	"	sub	%[tmp], %[tmp], %[ret]\n"
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 18/18] arm64: kconfig: select HAVE_CMPXCHG_LOCAL
  2015-07-13  9:25 [PATCH 00/18] arm64: support for 8.1 LSE atomic instructions Will Deacon
                   ` (16 preceding siblings ...)
  2015-07-13  9:25 ` [PATCH 17/18] arm64: atomic64_dec_if_positive: fix incorrect branch condition Will Deacon
@ 2015-07-13  9:25 ` Will Deacon
  17 siblings, 0 replies; 35+ messages in thread
From: Will Deacon @ 2015-07-13  9:25 UTC (permalink / raw)
  To: linux-arm-kernel

We implement an optimised cmpxchg_local macro, so let the kernel know.

Reviewed-by: Steve Capper <steve.capper@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 682782ab6936..5a0646466622 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -53,6 +53,7 @@ config ARM64
 	select HAVE_C_RECORDMCOUNT
 	select HAVE_CC_STACKPROTECTOR
 	select HAVE_CMPXCHG_DOUBLE
+	select HAVE_CMPXCHG_LOCAL
 	select HAVE_DEBUG_BUGVERBOSE
 	select HAVE_DEBUG_KMEMLEAK
 	select HAVE_DMA_API_DEBUG
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 13/18] arm64: cmpxchg: avoid memory barrier on comparison failure
  2015-07-13  9:25 ` [PATCH 13/18] arm64: cmpxchg: avoid memory barrier on comparison failure Will Deacon
@ 2015-07-13 10:28   ` Peter Zijlstra
  2015-07-13 11:22     ` Will Deacon
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2015-07-13 10:28 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 13, 2015 at 10:25:14AM +0100, Will Deacon wrote:
> cmpxchg doesn't require memory barrier semantics when the value
> comparison fails, so make the barrier conditional on success.

So this isn't actually a documented condition.

I would very much like a fwe extra words on this, that you've indeed
audited cmpxchg() users and preferably even a Documentation/ update to
match.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 13/18] arm64: cmpxchg: avoid memory barrier on comparison failure
  2015-07-13 10:28   ` Peter Zijlstra
@ 2015-07-13 11:22     ` Will Deacon
  2015-07-13 13:39       ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Will Deacon @ 2015-07-13 11:22 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 13, 2015 at 11:28:48AM +0100, Peter Zijlstra wrote:
> On Mon, Jul 13, 2015 at 10:25:14AM +0100, Will Deacon wrote:
> > cmpxchg doesn't require memory barrier semantics when the value
> > comparison fails, so make the barrier conditional on success.
> 
> So this isn't actually a documented condition.

There's *something* in Documentation/atomic_ops.txt (literally, the last
sentence in the file) but it's terrible at best:

"Note that this also means that for the case where the counter
 is not dropping to zero, there are no memory ordering
 requirements."

[you probably want to see the example for some context]

> I would very much like a fwe extra words on this, that you've indeed
> audited cmpxchg() users and preferably even a Documentation/ update to
> match.

Happy to update the docs. In terms of code audit, I couldn't find any
cmpxchg users that do something along the lines of "if the comparison
fails, don't loop, and instead do something to an independent address,
without barrier semantics that must be observed after the failed CAS":

  - Most (as in, it's hard to find other cases) users just loop until
    success, so there's no issue there.

  - One use-case with work on the failure path is stats update (e.g.
    drivers/net/ethernet/intel/ixgbe/ixgbe.h), but barrier semantics
    aren't required here anyway.

  - Another use-case is where you optimistically try a cmpxchg, then
    fall back on a lock if you fail (e.g. slub and cmpxchg_double).

  - Some other archs appear to do the same trick (alpha and powerpc).

So I'm confident with this change, but agree that a Docs update would
be beneficial. Something like below, or do you want some additional text,
too?

Will

--->8

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 13feb697271f..18fc860df1be 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -2383,9 +2383,7 @@ about the state (old or new) implies an SMP-conditional general memory barrier
 explicit lock operations, described later).  These include:
 
 	xchg();
-	cmpxchg();
 	atomic_xchg();			atomic_long_xchg();
-	atomic_cmpxchg();		atomic_long_cmpxchg();
 	atomic_inc_return();		atomic_long_inc_return();
 	atomic_dec_return();		atomic_long_dec_return();
 	atomic_add_return();		atomic_long_add_return();
@@ -2398,7 +2396,9 @@ explicit lock operations, described later).  These include:
 	test_and_clear_bit();
 	test_and_change_bit();
 
-	/* when succeeds (returns 1) */
+	/* when succeeds */
+	cmpxchg();
+	atomic_cmpxchg();		atomic_long_cmpxchg();
 	atomic_add_unless();		atomic_long_add_unless();
 
 These are used for such things as implementing ACQUIRE-class and RELEASE-class

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 13/18] arm64: cmpxchg: avoid memory barrier on comparison failure
  2015-07-13 11:22     ` Will Deacon
@ 2015-07-13 13:39       ` Peter Zijlstra
  2015-07-13 14:52         ` Will Deacon
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2015-07-13 13:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 13, 2015 at 12:22:27PM +0100, Will Deacon wrote:
> On Mon, Jul 13, 2015 at 11:28:48AM +0100, Peter Zijlstra wrote:
> > On Mon, Jul 13, 2015 at 10:25:14AM +0100, Will Deacon wrote:
> > > cmpxchg doesn't require memory barrier semantics when the value
> > > comparison fails, so make the barrier conditional on success.
> > 
> > So this isn't actually a documented condition.
> 
> There's *something* in Documentation/atomic_ops.txt (literally, the last
> sentence in the file) but it's terrible at best:
> 
> "Note that this also means that for the case where the counter
>  is not dropping to zero, there are no memory ordering
>  requirements."
> 
> [you probably want to see the example for some context]
> 
> > I would very much like a fwe extra words on this, that you've indeed
> > audited cmpxchg() users and preferably even a Documentation/ update to
> > match.
> 
> Happy to update the docs. In terms of code audit, I couldn't find any
> cmpxchg users that do something along the lines of "if the comparison
> fails, don't loop, and instead do something to an independent address,
> without barrier semantics that must be observed after the failed CAS":
> 
>   - Most (as in, it's hard to find other cases) users just loop until
>     success, so there's no issue there.
> 
>   - One use-case with work on the failure path is stats update (e.g.
>     drivers/net/ethernet/intel/ixgbe/ixgbe.h), but barrier semantics
>     aren't required here anyway.
> 
>   - Another use-case is where you optimistically try a cmpxchg, then
>     fall back on a lock if you fail (e.g. slub and cmpxchg_double).
> 
>   - Some other archs appear to do the same trick (alpha and powerpc).
> 
> So I'm confident with this change, but agree that a Docs update would
> be beneficial. Something like below, or do you want some additional text,
> too?

How about kernel/locking/qspinlock_paravirt.h:__pv_queued_spin_unlock()

In that case we rely on the full memory barrier of the failed cmpxchg()
to order the load of &l->locked vs the content of node.

So in pv_wait_head() we:

  pv_hash(lock)
    MB
  ->locked = _SLOW_VAL

And in __pv_queued_spin_unlock() we fail the cmpxchg when _SLOW_VAL and
rely on the barrier to ensure we observe the results of pv_hash().

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 13/18] arm64: cmpxchg: avoid memory barrier on comparison failure
  2015-07-13 13:39       ` Peter Zijlstra
@ 2015-07-13 14:52         ` Will Deacon
  2015-07-13 15:32           ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Will Deacon @ 2015-07-13 14:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 13, 2015 at 02:39:12PM +0100, Peter Zijlstra wrote:
> On Mon, Jul 13, 2015 at 12:22:27PM +0100, Will Deacon wrote:
> > Happy to update the docs. In terms of code audit, I couldn't find any
> > cmpxchg users that do something along the lines of "if the comparison
> > fails, don't loop, and instead do something to an independent address,
> > without barrier semantics that must be observed after the failed CAS":
> > 
> >   - Most (as in, it's hard to find other cases) users just loop until
> >     success, so there's no issue there.
> > 
> >   - One use-case with work on the failure path is stats update (e.g.
> >     drivers/net/ethernet/intel/ixgbe/ixgbe.h), but barrier semantics
> >     aren't required here anyway.
> > 
> >   - Another use-case is where you optimistically try a cmpxchg, then
> >     fall back on a lock if you fail (e.g. slub and cmpxchg_double).
> > 
> >   - Some other archs appear to do the same trick (alpha and powerpc).
> > 
> > So I'm confident with this change, but agree that a Docs update would
> > be beneficial. Something like below, or do you want some additional text,
> > too?
> 
> How about kernel/locking/qspinlock_paravirt.h:__pv_queued_spin_unlock()
> 
> In that case we rely on the full memory barrier of the failed cmpxchg()
> to order the load of &l->locked vs the content of node.
> 
> So in pv_wait_head() we:
> 
>   pv_hash(lock)
>     MB
>   ->locked = _SLOW_VAL
> 
> And in __pv_queued_spin_unlock() we fail the cmpxchg when _SLOW_VAL and
> rely on the barrier to ensure we observe the results of pv_hash().

That's an interesting case, and I think it's also broken on Alpha and Power
(which don't use this code). It's fun actually, because a failed cmpxchg
on those architectures gives you the barrier *before* the cmpxchg, but not
the one afterwards so it doesn't actually help here.

So there's three options afaict:

  (1) Document failed cmpxchg as having ACQUIRE semantics, and change this
      patch (and propose changes for Alpha and Power).

-or-

  (2) Change pv_unhash to use fake dependency ordering across the hash.

-or-

  (3) Put down an smp_rmb() between the cmpxchg and pv_unhash

The first two sound horrible, so I'd err towards 3, particularly as this
is x86-only code atm and I don't think it will have an effect there.

Will

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 13/18] arm64: cmpxchg: avoid memory barrier on comparison failure
  2015-07-13 14:52         ` Will Deacon
@ 2015-07-13 15:32           ` Peter Zijlstra
  2015-07-13 15:58             ` Will Deacon
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2015-07-13 15:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 13, 2015 at 03:52:25PM +0100, Will Deacon wrote:

> That's an interesting case, and I think it's also broken on Alpha and Power
> (which don't use this code). It's fun actually, because a failed cmpxchg
> on those architectures gives you the barrier *before* the cmpxchg, but not
> the one afterwards so it doesn't actually help here.
> 
> So there's three options afaict:
> 
>   (1) Document failed cmpxchg as having ACQUIRE semantics, and change this
>       patch (and propose changes for Alpha and Power).
> 
> -or-
> 
>   (2) Change pv_unhash to use fake dependency ordering across the hash.
> 
> -or-
> 
>   (3) Put down an smp_rmb() between the cmpxchg and pv_unhash
> 
> The first two sound horrible, so I'd err towards 3, particularly as this
> is x86-only code atm and I don't think it will have an effect there.

Right, I would definitely go for 3, but it does show there is code out
there :/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 13/18] arm64: cmpxchg: avoid memory barrier on comparison failure
  2015-07-13 15:32           ` Peter Zijlstra
@ 2015-07-13 15:58             ` Will Deacon
  0 siblings, 0 replies; 35+ messages in thread
From: Will Deacon @ 2015-07-13 15:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 13, 2015 at 04:32:26PM +0100, Peter Zijlstra wrote:
> On Mon, Jul 13, 2015 at 03:52:25PM +0100, Will Deacon wrote:
> > That's an interesting case, and I think it's also broken on Alpha and Power
> > (which don't use this code). It's fun actually, because a failed cmpxchg
> > on those architectures gives you the barrier *before* the cmpxchg, but not
> > the one afterwards so it doesn't actually help here.
> > 
> > So there's three options afaict:
> > 
> >   (1) Document failed cmpxchg as having ACQUIRE semantics, and change this
> >       patch (and propose changes for Alpha and Power).
> > 
> > -or-
> > 
> >   (2) Change pv_unhash to use fake dependency ordering across the hash.
> > 
> > -or-
> > 
> >   (3) Put down an smp_rmb() between the cmpxchg and pv_unhash
> > 
> > The first two sound horrible, so I'd err towards 3, particularly as this
> > is x86-only code atm and I don't think it will have an effect there.
> 
> Right, I would definitely go for 3, but it does show there is code out
> there :/

Yeah... but I think it's rare enough that I'd be willing to call it a bug
and fix it up. Especially as the code in question is both (a) new and (b)
only built for x86 atm (which doesn't have any of these issues).

FWIW, patch below. A future change would be making the cmpxchg a
cmpxchg_release, which looks good in the unlock path and makes the need
for the smp_rmb more obvious imo.

Anyway, one step at a time.

Will

--->8

commit e24f911487db52898b7f0567a9701e93d3c3f13a
Author: Will Deacon <will.deacon@arm.com>
Date:   Mon Jul 13 16:46:59 2015 +0100

    locking/pvqspinlock: order pv_unhash after cmpxchg on unlock slowpath
    
    When we unlock in __pv_queued_spin_unlock, a failed cmpxchg on the lock
    value indicates that we need to take the slow-path and unhash the
    corresponding node blocked on the lock.
    
    Since a failed cmpxchg does not provide any memory-ordering guarantees,
    it is possible that the node data could be read before the cmpxchg on
    weakly-ordered architectures and therefore return a stale value, leading
    to hash corruption and/or a BUG().
    
    This patch adds an smb_rmb() following the failed cmpxchg operation, so
    that the unhashing is ordered after the lock has been checked.
    
    Reported-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Will Deacon <will.deacon@arm.com>

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 04ab18151cc8..f216200dea3e 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -296,6 +296,13 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
 		return;
 
 	/*
+	 * A failed cmpxchg doesn't provide any memory-ordering guarantees,
+	 * so we need a barrier to order the read of the node data in
+	 * pv_unhash *after* we've read the lock being _Q_SLOW_VAL.
+	 */
+	smp_rmb();
+
+	/*
 	 * Since the above failed to release, this must be the SLOW path.
 	 * Therefore start by looking up the blocked node and unhashing it.
 	 */

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 03/18] arm64: elf: advertise 8.1 atomic instructions as new hwcap
  2015-07-13  9:25 ` [PATCH 03/18] arm64: elf: advertise 8.1 atomic instructions as new hwcap Will Deacon
@ 2015-07-17 13:48   ` Catalin Marinas
  2015-07-17 13:57     ` Russell King - ARM Linux
  0 siblings, 1 reply; 35+ messages in thread
From: Catalin Marinas @ 2015-07-17 13:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 13, 2015 at 10:25:04AM +0100, Will Deacon wrote:
>  #endif /* _UAPI__ASM_HWCAP_H */
> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> index f3067d4d4e35..c7fd2c946374 100644
> --- a/arch/arm64/kernel/setup.c
> +++ b/arch/arm64/kernel/setup.c
> @@ -280,6 +280,19 @@ static void __init setup_processor(void)
>  	if (block && !(block & 0x8))
>  		elf_hwcap |= HWCAP_CRC32;
>  
> +	block = (features >> 20) & 0xf;
> +	if (!(block & 0x8)) {
> +		switch (block) {
> +		default:
> +		case 2:
> +			elf_hwcap |= HWCAP_ATOMICS;
> +		case 1:
> +			/* RESERVED */
> +		case 0:
> +			break;
> +		}
> +	}

At some point, we should move the elf_hwcap setting to the cpu features
infrastructure. The PAN patch series introduces an "enable" method for
detected CPU features (can be cleaned up for 4.4).

-- 
Catalin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 03/18] arm64: elf: advertise 8.1 atomic instructions as new hwcap
  2015-07-17 13:48   ` Catalin Marinas
@ 2015-07-17 13:57     ` Russell King - ARM Linux
  0 siblings, 0 replies; 35+ messages in thread
From: Russell King - ARM Linux @ 2015-07-17 13:57 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jul 17, 2015 at 02:48:46PM +0100, Catalin Marinas wrote:
> On Mon, Jul 13, 2015 at 10:25:04AM +0100, Will Deacon wrote:
> >  #endif /* _UAPI__ASM_HWCAP_H */
> > diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> > index f3067d4d4e35..c7fd2c946374 100644
> > --- a/arch/arm64/kernel/setup.c
> > +++ b/arch/arm64/kernel/setup.c
> > @@ -280,6 +280,19 @@ static void __init setup_processor(void)
> >  	if (block && !(block & 0x8))
> >  		elf_hwcap |= HWCAP_CRC32;
> >  
> > +	block = (features >> 20) & 0xf;
> > +	if (!(block & 0x8)) {
> > +		switch (block) {
> > +		default:
> > +		case 2:
> > +			elf_hwcap |= HWCAP_ATOMICS;
> > +		case 1:
> > +			/* RESERVED */
> > +		case 0:
> > +			break;
> > +		}
> > +	}
> 
> At some point, we should move the elf_hwcap setting to the cpu features
> infrastructure. The PAN patch series introduces an "enable" method for
> detected CPU features (can be cleaned up for 4.4).

On 32-bit ARM, we have this accessor:

	cpuid_feature_extract_field()

which is there to properly deal with sign extending the 4-bit values, and
avoids all the if (!(block & 8)) { crap.

The above could then become the much simpler:

	block = cpuid_feature_extract_field(...);
	if (block > 0)
		elf_hwcap |= HWCAP_CRC32;

	block = cpuid_feature_extract_field(isarN, 20);
	if (block > 1)
		elf_hwcap |= HWCAP_ATOMICS;

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 05/18] arm64: introduce CONFIG_ARM64_LSE_ATOMICS as fallback to ll/sc atomics
  2015-07-13  9:25 ` [PATCH 05/18] arm64: introduce CONFIG_ARM64_LSE_ATOMICS as fallback to ll/sc atomics Will Deacon
@ 2015-07-17 16:32   ` Catalin Marinas
  2015-07-17 17:25     ` Will Deacon
  0 siblings, 1 reply; 35+ messages in thread
From: Catalin Marinas @ 2015-07-17 16:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 13, 2015 at 10:25:06AM +0100, Will Deacon wrote:
> In order to patch in the new atomic instructions at runtime, we need to
> generate wrappers around the out-of-line exclusive load/store atomics.
> 
> This patch adds a new Kconfig option, CONFIG_ARM64_LSE_ATOMICS. which
> causes our atomic functions to branch to the out-of-line ll/sc
> implementations. To avoid the register spill overhead of the PCS, the
> out-of-line functions are compiled with specific compiler flags to
> force out-of-line save/restore of any registers that are usually
> caller-saved.

I'm still trying to get my head around those -ffixed -fcall-used
options.

> --- /dev/null
> +++ b/arch/arm64/include/asm/atomic_lse.h
> @@ -0,0 +1,181 @@
> +/*
> + * Based on arch/arm/include/asm/atomic.h
> + *
> + * Copyright (C) 1996 Russell King.
> + * Copyright (C) 2002 Deep Blue Solutions Ltd.
> + * Copyright (C) 2012 ARM Ltd.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef __ASM_ATOMIC_LSE_H
> +#define __ASM_ATOMIC_LSE_H
> +
> +#ifndef __ARM64_IN_ATOMIC_IMPL
> +#error "please don't include this file directly"
> +#endif
> +
> +/* Move the ll/sc atomics out-of-line */
> +#define __LL_SC_INLINE
> +#define __LL_SC_PREFIX(x)	__ll_sc_##x
> +#define __LL_SC_EXPORT(x)	EXPORT_SYMBOL(__LL_SC_PREFIX(x))
> +
> +/* Macros for constructing calls to out-of-line ll/sc atomics */
> +#define __LL_SC_SAVE_LR(r)	"mov\t" #r ", x30\n"
> +#define __LL_SC_RESTORE_LR(r)	"mov\tx30, " #r "\n"
> +#define __LL_SC_CALL(op)						\
> +	"bl\t" __stringify(__LL_SC_PREFIX(atomic_##op)) "\n"
> +#define __LL_SC_CALL64(op)						\
> +	"bl\t" __stringify(__LL_SC_PREFIX(atomic64_##op)) "\n"
> +
> +#define ATOMIC_OP(op, asm_op)						\
> +static inline void atomic_##op(int i, atomic_t *v)			\
> +{									\
> +	unsigned long lr;						\
> +	register int w0 asm ("w0") = i;					\
> +	register atomic_t *x1 asm ("x1") = v;				\
> +									\
> +	asm volatile(							\
> +	__LL_SC_SAVE_LR(%0)						\
> +	__LL_SC_CALL(op)						\
> +	__LL_SC_RESTORE_LR(%0)						\
> +	: "=&r" (lr), "+r" (w0), "+Q" (v->counter)			\
> +	: "r" (x1));							\
> +}									\

Since that's an inline function, in most cases we wouldn't need to
save/restore LR for a BL call, it may already be on the stack of the
including functions. Can we just not tell gcc that LR is clobbered by
this asm and it makes its own decision about saving/restoring?

As for v->counter, could we allocate it in callee-saved registers
already and avoid the -ffixed etc. options.

But note that I'm still trying to understand all these tricks, so I may
be wrong.

-- 
Catalin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 05/18] arm64: introduce CONFIG_ARM64_LSE_ATOMICS as fallback to ll/sc atomics
  2015-07-17 16:32   ` Catalin Marinas
@ 2015-07-17 17:25     ` Will Deacon
  0 siblings, 0 replies; 35+ messages in thread
From: Will Deacon @ 2015-07-17 17:25 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Catalin,

On Fri, Jul 17, 2015 at 05:32:20PM +0100, Catalin Marinas wrote:
> On Mon, Jul 13, 2015 at 10:25:06AM +0100, Will Deacon wrote:
> > In order to patch in the new atomic instructions at runtime, we need to
> > generate wrappers around the out-of-line exclusive load/store atomics.
> > 
> > This patch adds a new Kconfig option, CONFIG_ARM64_LSE_ATOMICS. which
> > causes our atomic functions to branch to the out-of-line ll/sc
> > implementations. To avoid the register spill overhead of the PCS, the
> > out-of-line functions are compiled with specific compiler flags to
> > force out-of-line save/restore of any registers that are usually
> > caller-saved.
> 
> I'm still trying to get my head around those -ffixed -fcall-used
> options.

Yeah, they're pretty funky, but note that x86 does similar tricks for
some of its patching too (see ARCH_HWEIGHT_CFLAGS).

> > +#define ATOMIC_OP(op, asm_op)					\
> > +static inline void atomic_##op(int i, atomic_t *v)			\
> > +{									\
> > +	unsigned long lr;						\
> > +	register int w0 asm ("w0") = i;					\
> > +	register atomic_t *x1 asm ("x1") = v;				\
> > +									\
> > +	asm volatile(							\
> > +	__LL_SC_SAVE_LR(%0)						\
> > +	__LL_SC_CALL(op)						\
> > +	__LL_SC_RESTORE_LR(%0)						\
> > +	: "=&r" (lr), "+r" (w0), "+Q" (v->counter)			\
> > +	: "r" (x1));							\
> > +}									\
> 
> Since that's an inline function, in most cases we wouldn't need to
> save/restore LR for a BL call, it may already be on the stack of the
> including functions. Can we just not tell gcc that LR is clobbered by
> this asm and it makes its own decision about saving/restoring?

If we put lr in the clobber list, then it will get saved/restored by GCC
even when we are using the LSE atomics and don't touch lr at all. Also
note that later on the temporary register used to hold lr for the
out-of-line case is used as part of the LSE atomic, so there's no real
cost to having it.

> As for v->counter, could we allocate it in callee-saved registers
> already and avoid the -ffixed etc. options.

The issue with that is when we don't use LSE and want to in-line the
ll/sc variants. Also, the weird compiler options also apply to any
temporary variables that the out-of-line code uses, so we'd need knowledge
of that here in order to allocate registers correctly (and then I have no
idea how you'd unpack things on the other side).

My first stab at this tried to specify fcall-used on a
per-function-prototype basis using target attributes, but GCC just silently
ignores those :(

> But note that I'm still trying to understand all these tricks, so I may
> be wrong.

Sorry for all the tricks, but it's the best I could come up with whilst
still generating decent disassembly for all cases. You get used to it
after a bit.

Will

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 07/18] arm64: locks: patch in lse instructions when supported by the CPU
  2015-07-13  9:25 ` [PATCH 07/18] arm64: locks: " Will Deacon
@ 2015-07-21 16:53   ` Catalin Marinas
  2015-07-21 17:29     ` Will Deacon
  0 siblings, 1 reply; 35+ messages in thread
From: Catalin Marinas @ 2015-07-21 16:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 13, 2015 at 10:25:08AM +0100, Will Deacon wrote:
> @@ -67,15 +78,25 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
>  	unsigned int tmp;
>  	arch_spinlock_t lockval;
>  
> -	asm volatile(
> -"	prfm	pstl1strm, %2\n"
> -"1:	ldaxr	%w0, %2\n"
> -"	eor	%w1, %w0, %w0, ror #16\n"
> -"	cbnz	%w1, 2f\n"
> -"	add	%w0, %w0, %3\n"
> -"	stxr	%w1, %w0, %2\n"
> -"	cbnz	%w1, 1b\n"
> -"2:"
> +	asm volatile(ARM64_LSE_ATOMIC_INSN(
> +	/* LL/SC */
> +	"	prfm	pstl1strm, %2\n"
> +	"1:	ldaxr	%w0, %2\n"
> +	"	eor	%w1, %w0, %w0, ror #16\n"
> +	"	cbnz	%w1, 2f\n"
> +	"	add	%w0, %w0, %3\n"
> +	"	stxr	%w1, %w0, %2\n"
> +	"	cbnz	%w1, 1b\n"
> +	"2:",
> +	/* LSE atomics */
> +	"	ldar	%w0, %2\n"

Do we still need an acquire if we fail to take the lock?

> +	"	eor	%w1, %w0, %w0, ror #16\n"
> +	"	cbnz	%w1, 1f\n"
> +	"	add	%w1, %w0, %3\n"
> +	"	casa	%w0, %w1, %2\n"
> +	"	and	%w1, %w1, #0xffff\n"
> +	"	eor	%w1, %w1, %w0, lsr #16\n"
> +	"1:")
>  	: "=&r" (lockval), "=&r" (tmp), "+Q" (*lock)
>  	: "I" (1 << TICKET_SHIFT)
>  	: "memory");

I wonder if this is any faster with LSE. CAS would have to re-load the
lock but we already have the value loaded (though most likely the reload
will be from cache and we save a cbnz that can be mispredicted). I guess
we'll re-test when we get some real hardware.

Does prfm help in any way with the LDAR?

> @@ -85,10 +106,19 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
>  
>  static inline void arch_spin_unlock(arch_spinlock_t *lock)
>  {
> -	asm volatile(
> -"	stlrh	%w1, %0\n"
> -	: "=Q" (lock->owner)
> -	: "r" (lock->owner + 1)
> +	unsigned long tmp;
> +
> +	asm volatile(ARM64_LSE_ATOMIC_INSN(
> +	/* LL/SC */
> +	"	ldr	%w1, %0\n"
> +	"	add	%w1, %w1, #1\n"
> +	"	stlrh	%w1, %0",
> +	/* LSE atomics */
> +	"	mov	%w1, #1\n"
> +	"	nop\n"
> +	"	staddlh	%w1, %0")
> +	: "=Q" (lock->owner), "=&r" (tmp)
> +	:
>  	: "memory");
>  }
>  
> @@ -125,11 +155,19 @@ static inline void arch_write_lock(arch_rwlock_t *rw)
>  
>  	asm volatile(
>  	"	sevl\n"
> +	ARM64_LSE_ATOMIC_INSN(
> +	/* LL/SC */
>  	"1:	wfe\n"
>  	"2:	ldaxr	%w0, %1\n"
>  	"	cbnz	%w0, 1b\n"
>  	"	stxr	%w0, %w2, %1\n"
> -	"	cbnz	%w0, 2b\n"
> +	"	cbnz	%w0, 2b",
> +	/* LSE atomics */
> +	"1:	wfe\n"
> +	"	mov	%w0, wzr\n"
> +	"	casa	%w0, %w2, %1\n"
> +	"	nop\n"
> +	"	cbnz	%w0, 1b")
>  	: "=&r" (tmp), "+Q" (rw->lock)
>  	: "r" (0x80000000)
>  	: "memory");

With WFE in the LL/SC case, we rely on LDAXR to set the exclusive
monitor and an event would be generated every time it gets cleared. With
CAS, we no longer have this behaviour, so what guarantees a SEV?

> @@ -139,11 +177,16 @@ static inline int arch_write_trylock(arch_rwlock_t *rw)
>  {
>  	unsigned int tmp;
>  
> -	asm volatile(
> +	asm volatile(ARM64_LSE_ATOMIC_INSN(
> +	/* LL/SC */
>  	"	ldaxr	%w0, %1\n"
>  	"	cbnz	%w0, 1f\n"
>  	"	stxr	%w0, %w2, %1\n"

Not a comment to your patch but to the original code: don't we need a
branch back to ldaxr if stxr fails, like we do for arch_spin_trylock?
The same comment for arch_read_trylock.

> -	"1:\n"
> +	"1:",
> +	/* LSE atomics */
> +	"	mov	%w0, wzr\n"
> +	"	casa	%w0, %w2, %1\n"
> +	"	nop")
>  	: "=&r" (tmp), "+Q" (rw->lock)
>  	: "r" (0x80000000)
>  	: "memory");
> @@ -153,9 +196,10 @@ static inline int arch_write_trylock(arch_rwlock_t *rw)
>  
>  static inline void arch_write_unlock(arch_rwlock_t *rw)
>  {
> -	asm volatile(
> -	"	stlr	%w1, %0\n"
> -	: "=Q" (rw->lock) : "r" (0) : "memory");
> +	asm volatile(ARM64_LSE_ATOMIC_INSN(
> +	"	stlr	wzr, %0",
> +	"	swpl	wzr, wzr, %0")
> +	: "=Q" (rw->lock) :: "memory");

Is this any better than just STLR? We don't need the memory read.

> @@ -172,6 +216,10 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
>   *
>   * The memory barriers are implicit with the load-acquire and store-release
>   * instructions.
> + *
> + * Note that in UNDEFINED cases, such as unlocking a lock twice, the LL/SC
> + * and LSE implementations may exhibit different behaviour (although this
> + * will have no effect on lockdep).
>   */
>  static inline void arch_read_lock(arch_rwlock_t *rw)
>  {
> @@ -179,26 +227,43 @@ static inline void arch_read_lock(arch_rwlock_t *rw)
>  
>  	asm volatile(
>  	"	sevl\n"
> +	ARM64_LSE_ATOMIC_INSN(
> +	/* LL/SC */
>  	"1:	wfe\n"
>  	"2:	ldaxr	%w0, %2\n"
>  	"	add	%w0, %w0, #1\n"
>  	"	tbnz	%w0, #31, 1b\n"
>  	"	stxr	%w1, %w0, %2\n"
> -	"	cbnz	%w1, 2b\n"
> +	"	nop\n"
> +	"	cbnz	%w1, 2b",
> +	/* LSE atomics */
> +	"1:	wfe\n"
> +	"2:	ldr	%w0, %2\n"
> +	"	adds	%w1, %w0, #1\n"
> +	"	tbnz	%w1, #31, 1b\n"
> +	"	casa	%w0, %w1, %2\n"
> +	"	sbc	%w0, %w1, %w0\n"
> +	"	cbnz	%w0, 2b")
>  	: "=&r" (tmp), "=&r" (tmp2), "+Q" (rw->lock)
>  	:
> -	: "memory");
> +	: "cc", "memory");

Same comment here about WFE.

-- 
Catalin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 12/18] arm64: cmpxchg: avoid "cc" clobber in ll/sc routines
  2015-07-13  9:25 ` [PATCH 12/18] arm64: cmpxchg: avoid "cc" clobber in ll/sc routines Will Deacon
@ 2015-07-21 17:16   ` Catalin Marinas
  2015-07-21 17:32     ` Will Deacon
  0 siblings, 1 reply; 35+ messages in thread
From: Catalin Marinas @ 2015-07-21 17:16 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 13, 2015 at 10:25:13AM +0100, Will Deacon wrote:
> We can perform the cmpxchg comparison using eor and cbnz which avoids
> the "cc" clobber for the ll/sc case and consequently for the LSE case
> where we may have to fall-back on the ll/sc code at runtime.
> 
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  arch/arm64/include/asm/atomic_ll_sc.h | 14 ++++++--------
>  arch/arm64/include/asm/atomic_lse.h   |  4 ++--
>  2 files changed, 8 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/atomic_ll_sc.h b/arch/arm64/include/asm/atomic_ll_sc.h
> index 77d3aabf52ad..d21091bae901 100644
> --- a/arch/arm64/include/asm/atomic_ll_sc.h
> +++ b/arch/arm64/include/asm/atomic_ll_sc.h
> @@ -96,14 +96,13 @@ __LL_SC_PREFIX(atomic_cmpxchg(atomic_t *ptr, int old, int new))
>  
>  	asm volatile("// atomic_cmpxchg\n"
>  "1:	ldxr	%w1, %2\n"
> -"	cmp	%w1, %w3\n"
> -"	b.ne	2f\n"
> +"	eor	%w0, %w1, %w3\n"
> +"	cbnz	%w0, 2f\n"
>  "	stxr	%w0, %w4, %2\n"
>  "	cbnz	%w0, 1b\n"
>  "2:"
>  	: "=&r" (tmp), "=&r" (oldval), "+Q" (ptr->counter)
> -	: "Ir" (old), "r" (new)
> -	: "cc");
> +	: "Lr" (old), "r" (new));

For the LL/SC case, does this make things any slower? We replace a cmp +
b.ne with two arithmetic ops (eor and cbnz, unless the latter is somehow
smarter). I don't think the condition flags usually need to be preserved
across an asm statement, so the "cc" clobber probably didn't make much
difference anyway.

-- 
Catalin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 07/18] arm64: locks: patch in lse instructions when supported by the CPU
  2015-07-21 16:53   ` Catalin Marinas
@ 2015-07-21 17:29     ` Will Deacon
  2015-07-23 13:39       ` Will Deacon
  0 siblings, 1 reply; 35+ messages in thread
From: Will Deacon @ 2015-07-21 17:29 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Catalin,

On Tue, Jul 21, 2015 at 05:53:39PM +0100, Catalin Marinas wrote:
> On Mon, Jul 13, 2015 at 10:25:08AM +0100, Will Deacon wrote:
> > @@ -67,15 +78,25 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
> >  	unsigned int tmp;
> >  	arch_spinlock_t lockval;
> >  
> > -	asm volatile(
> > -"	prfm	pstl1strm, %2\n"
> > -"1:	ldaxr	%w0, %2\n"
> > -"	eor	%w1, %w0, %w0, ror #16\n"
> > -"	cbnz	%w1, 2f\n"
> > -"	add	%w0, %w0, %3\n"
> > -"	stxr	%w1, %w0, %2\n"
> > -"	cbnz	%w1, 1b\n"
> > -"2:"
> > +	asm volatile(ARM64_LSE_ATOMIC_INSN(
> > +	/* LL/SC */
> > +	"	prfm	pstl1strm, %2\n"
> > +	"1:	ldaxr	%w0, %2\n"
> > +	"	eor	%w1, %w0, %w0, ror #16\n"
> > +	"	cbnz	%w1, 2f\n"
> > +	"	add	%w0, %w0, %3\n"
> > +	"	stxr	%w1, %w0, %2\n"
> > +	"	cbnz	%w1, 1b\n"
> > +	"2:",
> > +	/* LSE atomics */
> > +	"	ldar	%w0, %2\n"
> 
> Do we still need an acquire if we fail to take the lock?

Good point, I think we can drop that. Read-after-read ordering is sufficient
to order LDR -> CASA in a trylock loop.

> 
> > +	"	eor	%w1, %w0, %w0, ror #16\n"
> > +	"	cbnz	%w1, 1f\n"
> > +	"	add	%w1, %w0, %3\n"
> > +	"	casa	%w0, %w1, %2\n"
> > +	"	and	%w1, %w1, #0xffff\n"
> > +	"	eor	%w1, %w1, %w0, lsr #16\n"
> > +	"1:")
> >  	: "=&r" (lockval), "=&r" (tmp), "+Q" (*lock)
> >  	: "I" (1 << TICKET_SHIFT)
> >  	: "memory");
> 
> I wonder if this is any faster with LSE. CAS would have to re-load the
> lock but we already have the value loaded (though most likely the reload
> will be from cache and we save a cbnz that can be mispredicted). I guess
> we'll re-test when we get some real hardware.

Yeah, I'll definitely want to evaluate this on real hardware when it shows
up. For now, I'm basically trying to avoid explicit dirtying at L1, so
the CAS at least gives hardware the chance to go far in the face of
contention.

> Does prfm help in any way with the LDAR?

We'd need to prfm for write, which would end up forcing everything near
(which I'm avoiding for now).

> > @@ -85,10 +106,19 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
> >  
> >  static inline void arch_spin_unlock(arch_spinlock_t *lock)
> >  {
> > -	asm volatile(
> > -"	stlrh	%w1, %0\n"
> > -	: "=Q" (lock->owner)
> > -	: "r" (lock->owner + 1)
> > +	unsigned long tmp;
> > +
> > +	asm volatile(ARM64_LSE_ATOMIC_INSN(
> > +	/* LL/SC */
> > +	"	ldr	%w1, %0\n"
> > +	"	add	%w1, %w1, #1\n"
> > +	"	stlrh	%w1, %0",
> > +	/* LSE atomics */
> > +	"	mov	%w1, #1\n"
> > +	"	nop\n"
> > +	"	staddlh	%w1, %0")
> > +	: "=Q" (lock->owner), "=&r" (tmp)
> > +	:
> >  	: "memory");
> >  }
> >  
> > @@ -125,11 +155,19 @@ static inline void arch_write_lock(arch_rwlock_t *rw)
> >  
> >  	asm volatile(
> >  	"	sevl\n"
> > +	ARM64_LSE_ATOMIC_INSN(
> > +	/* LL/SC */
> >  	"1:	wfe\n"
> >  	"2:	ldaxr	%w0, %1\n"
> >  	"	cbnz	%w0, 1b\n"
> >  	"	stxr	%w0, %w2, %1\n"
> > -	"	cbnz	%w0, 2b\n"
> > +	"	cbnz	%w0, 2b",
> > +	/* LSE atomics */
> > +	"1:	wfe\n"
> > +	"	mov	%w0, wzr\n"
> > +	"	casa	%w0, %w2, %1\n"
> > +	"	nop\n"
> > +	"	cbnz	%w0, 1b")
> >  	: "=&r" (tmp), "+Q" (rw->lock)
> >  	: "r" (0x80000000)
> >  	: "memory");
> 
> With WFE in the LL/SC case, we rely on LDAXR to set the exclusive
> monitor and an event would be generated every time it gets cleared. With
> CAS, we no longer have this behaviour, so what guarantees a SEV?

My understanding was that failed CAS will set the exclusive monitor, but
what I have for a spec doesn't actually comment on this behaviour. I'll
go digging...

> > @@ -139,11 +177,16 @@ static inline int arch_write_trylock(arch_rwlock_t *rw)
> >  {
> >  	unsigned int tmp;
> >  
> > -	asm volatile(
> > +	asm volatile(ARM64_LSE_ATOMIC_INSN(
> > +	/* LL/SC */
> >  	"	ldaxr	%w0, %1\n"
> >  	"	cbnz	%w0, 1f\n"
> >  	"	stxr	%w0, %w2, %1\n"
> 
> Not a comment to your patch but to the original code: don't we need a
> branch back to ldaxr if stxr fails, like we do for arch_spin_trylock?
> The same comment for arch_read_trylock.

I don't think Linux specifies the behaviour here, to be honest. C11
distinguishes between "weak" and "strong" cmpxchg to try and address this
sort of thing. Since we're not sure, I suppose we should loop (like we
do for arch/arm/).

Bear in mind that I'm currently moving us over to the qrwlock once I've
got this out of the way and added generic support for relaxed atomics,
which will result in the deletion of all of this anyway.

> 
> > -	"1:\n"
> > +	"1:",
> > +	/* LSE atomics */
> > +	"	mov	%w0, wzr\n"
> > +	"	casa	%w0, %w2, %1\n"
> > +	"	nop")
> >  	: "=&r" (tmp), "+Q" (rw->lock)
> >  	: "r" (0x80000000)
> >  	: "memory");
> > @@ -153,9 +196,10 @@ static inline int arch_write_trylock(arch_rwlock_t *rw)
> >  
> >  static inline void arch_write_unlock(arch_rwlock_t *rw)
> >  {
> > -	asm volatile(
> > -	"	stlr	%w1, %0\n"
> > -	: "=Q" (rw->lock) : "r" (0) : "memory");
> > +	asm volatile(ARM64_LSE_ATOMIC_INSN(
> > +	"	stlr	wzr, %0",
> > +	"	swpl	wzr, wzr, %0")
> > +	: "=Q" (rw->lock) :: "memory");
> 
> Is this any better than just STLR? We don't need the memory read.

Again, I'm trying to avoid explicit dirtying at L1.

Will

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 12/18] arm64: cmpxchg: avoid "cc" clobber in ll/sc routines
  2015-07-21 17:16   ` Catalin Marinas
@ 2015-07-21 17:32     ` Will Deacon
  0 siblings, 0 replies; 35+ messages in thread
From: Will Deacon @ 2015-07-21 17:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jul 21, 2015 at 06:16:07PM +0100, Catalin Marinas wrote:
> On Mon, Jul 13, 2015 at 10:25:13AM +0100, Will Deacon wrote:
> > We can perform the cmpxchg comparison using eor and cbnz which avoids
> > the "cc" clobber for the ll/sc case and consequently for the LSE case
> > where we may have to fall-back on the ll/sc code at runtime.
> > 
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > ---
> >  arch/arm64/include/asm/atomic_ll_sc.h | 14 ++++++--------
> >  arch/arm64/include/asm/atomic_lse.h   |  4 ++--
> >  2 files changed, 8 insertions(+), 10 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/atomic_ll_sc.h b/arch/arm64/include/asm/atomic_ll_sc.h
> > index 77d3aabf52ad..d21091bae901 100644
> > --- a/arch/arm64/include/asm/atomic_ll_sc.h
> > +++ b/arch/arm64/include/asm/atomic_ll_sc.h
> > @@ -96,14 +96,13 @@ __LL_SC_PREFIX(atomic_cmpxchg(atomic_t *ptr, int old, int new))
> >  
> >  	asm volatile("// atomic_cmpxchg\n"
> >  "1:	ldxr	%w1, %2\n"
> > -"	cmp	%w1, %w3\n"
> > -"	b.ne	2f\n"
> > +"	eor	%w0, %w1, %w3\n"
> > +"	cbnz	%w0, 2f\n"
> >  "	stxr	%w0, %w4, %2\n"
> >  "	cbnz	%w0, 1b\n"
> >  "2:"
> >  	: "=&r" (tmp), "=&r" (oldval), "+Q" (ptr->counter)
> > -	: "Ir" (old), "r" (new)
> > -	: "cc");
> > +	: "Lr" (old), "r" (new));
> 
> For the LL/SC case, does this make things any slower? We replace a cmp +
> b.ne with two arithmetic ops (eor and cbnz, unless the latter is somehow
> smarter). I don't think the condition flags usually need to be preserved
> across an asm statement, so the "cc" clobber probably didn't make much
> difference anyway.

I doubt you can measure it either way. The main reason for changing this
was for consistency with other, similar code and improved readability
(since otherwise we have a mystery "cc" clobber in the LSE version).

Will

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 07/18] arm64: locks: patch in lse instructions when supported by the CPU
  2015-07-21 17:29     ` Will Deacon
@ 2015-07-23 13:39       ` Will Deacon
  2015-07-23 14:14         ` Catalin Marinas
  0 siblings, 1 reply; 35+ messages in thread
From: Will Deacon @ 2015-07-23 13:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jul 21, 2015 at 06:29:18PM +0100, Will Deacon wrote:
> On Tue, Jul 21, 2015 at 05:53:39PM +0100, Catalin Marinas wrote:
> > On Mon, Jul 13, 2015 at 10:25:08AM +0100, Will Deacon wrote:
> > > @@ -125,11 +155,19 @@ static inline void arch_write_lock(arch_rwlock_t *rw)
> > >  
> > >  	asm volatile(
> > >  	"	sevl\n"
> > > +	ARM64_LSE_ATOMIC_INSN(
> > > +	/* LL/SC */
> > >  	"1:	wfe\n"
> > >  	"2:	ldaxr	%w0, %1\n"
> > >  	"	cbnz	%w0, 1b\n"
> > >  	"	stxr	%w0, %w2, %1\n"
> > > -	"	cbnz	%w0, 2b\n"
> > > +	"	cbnz	%w0, 2b",
> > > +	/* LSE atomics */
> > > +	"1:	wfe\n"
> > > +	"	mov	%w0, wzr\n"
> > > +	"	casa	%w0, %w2, %1\n"
> > > +	"	nop\n"
> > > +	"	cbnz	%w0, 1b")
> > >  	: "=&r" (tmp), "+Q" (rw->lock)
> > >  	: "r" (0x80000000)
> > >  	: "memory");
> > 
> > With WFE in the LL/SC case, we rely on LDAXR to set the exclusive
> > monitor and an event would be generated every time it gets cleared. With
> > CAS, we no longer have this behaviour, so what guarantees a SEV?
> 
> My understanding was that failed CAS will set the exclusive monitor, but
> what I have for a spec doesn't actually comment on this behaviour. I'll
> go digging...

... and the winner is: not me! We do need an LDXR to set the exclusive
monitor and doing that without introducing races is slightly confusing.

Here's what I now have for write_lock (read_lock is actually pretty simple):

static inline void arch_write_lock(arch_rwlock_t *rw)
{
	unsigned int tmp;

	asm volatile(ARM64_LSE_ATOMIC_INSN(
	/* LL/SC */
	"	sevl\n"
	"1:	wfe\n"
	"2:	ldaxr	%w0, %1\n"
	"	cbnz	%w0, 1b\n"
	"	stxr	%w0, %w2, %1\n"
	"	cbnz	%w0, 2b\n"
	"	nop",
	/* LSE atomics */
	"1:	mov	%w0, wzr\n"
	"2:	casa	%w0, %w2, %1\n"
	"	cbz	%w0, 3f\n"
	"	ldxr	%w0, %1\n"
	"	cbz	%w0, 2b\n"
	"	wfe\n"
	"	b	1b\n"
	"3:")
	: "=&r" (tmp), "+Q" (rw->lock)
	: "r" (0x80000000)
	: "memory");
}

What do you reckon?

Will

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 07/18] arm64: locks: patch in lse instructions when supported by the CPU
  2015-07-23 13:39       ` Will Deacon
@ 2015-07-23 14:14         ` Catalin Marinas
  0 siblings, 0 replies; 35+ messages in thread
From: Catalin Marinas @ 2015-07-23 14:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jul 23, 2015 at 02:39:35PM +0100, Will Deacon wrote:
> On Tue, Jul 21, 2015 at 06:29:18PM +0100, Will Deacon wrote:
> > On Tue, Jul 21, 2015 at 05:53:39PM +0100, Catalin Marinas wrote:
> > > On Mon, Jul 13, 2015 at 10:25:08AM +0100, Will Deacon wrote:
> > > > @@ -125,11 +155,19 @@ static inline void arch_write_lock(arch_rwlock_t *rw)
> > > >  
> > > >  	asm volatile(
> > > >  	"	sevl\n"
> > > > +	ARM64_LSE_ATOMIC_INSN(
> > > > +	/* LL/SC */
> > > >  	"1:	wfe\n"
> > > >  	"2:	ldaxr	%w0, %1\n"
> > > >  	"	cbnz	%w0, 1b\n"
> > > >  	"	stxr	%w0, %w2, %1\n"
> > > > -	"	cbnz	%w0, 2b\n"
> > > > +	"	cbnz	%w0, 2b",
> > > > +	/* LSE atomics */
> > > > +	"1:	wfe\n"
> > > > +	"	mov	%w0, wzr\n"
> > > > +	"	casa	%w0, %w2, %1\n"
> > > > +	"	nop\n"
> > > > +	"	cbnz	%w0, 1b")
> > > >  	: "=&r" (tmp), "+Q" (rw->lock)
> > > >  	: "r" (0x80000000)
> > > >  	: "memory");
> > > 
> > > With WFE in the LL/SC case, we rely on LDAXR to set the exclusive
> > > monitor and an event would be generated every time it gets cleared. With
> > > CAS, we no longer have this behaviour, so what guarantees a SEV?
> > 
> > My understanding was that failed CAS will set the exclusive monitor, but
> > what I have for a spec doesn't actually comment on this behaviour. I'll
> > go digging...
> 
> ... and the winner is: not me! We do need an LDXR to set the exclusive
> monitor and doing that without introducing races is slightly confusing.
> 
> Here's what I now have for write_lock (read_lock is actually pretty simple):
> 
> static inline void arch_write_lock(arch_rwlock_t *rw)
> {
> 	unsigned int tmp;
> 
> 	asm volatile(ARM64_LSE_ATOMIC_INSN(
> 	/* LL/SC */
> 	"	sevl\n"
> 	"1:	wfe\n"
> 	"2:	ldaxr	%w0, %1\n"
> 	"	cbnz	%w0, 1b\n"
> 	"	stxr	%w0, %w2, %1\n"
> 	"	cbnz	%w0, 2b\n"
> 	"	nop",
> 	/* LSE atomics */
> 	"1:	mov	%w0, wzr\n"
> 	"2:	casa	%w0, %w2, %1\n"
> 	"	cbz	%w0, 3f\n"
> 	"	ldxr	%w0, %1\n"
> 	"	cbz	%w0, 2b\n"
> 	"	wfe\n"
> 	"	b	1b\n"
> 	"3:")
> 	: "=&r" (tmp), "+Q" (rw->lock)
> 	: "r" (0x80000000)
> 	: "memory");
> }
> 
> What do you reckon?

It looks fine. I thought I could reduce the number of branches but I
still end up with 3. At least the no-contention case should be fast.

-- 
Catalin

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2015-07-23 14:14 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-13  9:25 [PATCH 00/18] arm64: support for 8.1 LSE atomic instructions Will Deacon
2015-07-13  9:25 ` [PATCH 01/18] arm64: cpufeature.h: add missing #include of kernel.h Will Deacon
2015-07-13  9:25 ` [PATCH 02/18] arm64: atomics: move ll/sc atomics into separate header file Will Deacon
2015-07-13  9:25 ` [PATCH 03/18] arm64: elf: advertise 8.1 atomic instructions as new hwcap Will Deacon
2015-07-17 13:48   ` Catalin Marinas
2015-07-17 13:57     ` Russell King - ARM Linux
2015-07-13  9:25 ` [PATCH 04/18] arm64: alternatives: add cpu feature for lse atomics Will Deacon
2015-07-13  9:25 ` [PATCH 05/18] arm64: introduce CONFIG_ARM64_LSE_ATOMICS as fallback to ll/sc atomics Will Deacon
2015-07-17 16:32   ` Catalin Marinas
2015-07-17 17:25     ` Will Deacon
2015-07-13  9:25 ` [PATCH 06/18] arm64: atomics: patch in lse instructions when supported by the CPU Will Deacon
2015-07-13  9:25 ` [PATCH 07/18] arm64: locks: " Will Deacon
2015-07-21 16:53   ` Catalin Marinas
2015-07-21 17:29     ` Will Deacon
2015-07-23 13:39       ` Will Deacon
2015-07-23 14:14         ` Catalin Marinas
2015-07-13  9:25 ` [PATCH 08/18] arm64: bitops: " Will Deacon
2015-07-13  9:25 ` [PATCH 09/18] arm64: xchg: " Will Deacon
2015-07-13  9:25 ` [PATCH 10/18] arm64: cmpxchg: " Will Deacon
2015-07-13  9:25 ` [PATCH 11/18] arm64: cmpxchg_dbl: " Will Deacon
2015-07-13  9:25 ` [PATCH 12/18] arm64: cmpxchg: avoid "cc" clobber in ll/sc routines Will Deacon
2015-07-21 17:16   ` Catalin Marinas
2015-07-21 17:32     ` Will Deacon
2015-07-13  9:25 ` [PATCH 13/18] arm64: cmpxchg: avoid memory barrier on comparison failure Will Deacon
2015-07-13 10:28   ` Peter Zijlstra
2015-07-13 11:22     ` Will Deacon
2015-07-13 13:39       ` Peter Zijlstra
2015-07-13 14:52         ` Will Deacon
2015-07-13 15:32           ` Peter Zijlstra
2015-07-13 15:58             ` Will Deacon
2015-07-13  9:25 ` [PATCH 14/18] arm64: atomics: tidy up common atomic{,64}_* macros Will Deacon
2015-07-13  9:25 ` [PATCH 15/18] arm64: atomics: prefetch the destination word for write prior to stxr Will Deacon
2015-07-13  9:25 ` [PATCH 16/18] arm64: atomics: implement atomic{, 64}_cmpxchg using cmpxchg Will Deacon
2015-07-13  9:25 ` [PATCH 17/18] arm64: atomic64_dec_if_positive: fix incorrect branch condition Will Deacon
2015-07-13  9:25 ` [PATCH 18/18] arm64: kconfig: select HAVE_CMPXCHG_LOCAL Will Deacon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).