[PATCH 0/5] x86/percpu semantics and fixes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5] x86/percpu semantics and fixes
@ 2019-02-27 10:12 Peter Zijlstra
  2019-02-27 10:12 ` [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() Peter Zijlstra
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Peter Zijlstra @ 2019-02-27 10:12 UTC (permalink / raw)
  To: torvalds, mingo, bp, tglx, luto, namit, peterz; +Cc: linux-kernel

Hi all,

This is a collection of x86/percpu changes that I had pending and got reminded
of by Linus' comment yesterday about __this_cpu_xchg().

This tidies up the x86/percpu primitives and fixes a bunch of 'fallout'.

Built and boot tested with CONFIG_DEBUG_PREEMPT=y.

---
 arch/x86/include/asm/irq_regs.h |   4 +-
 arch/x86/include/asm/percpu.h   | 236 +++++++++++++++++++++-------------------
 arch/x86/include/asm/smp.h      |   3 +-
 arch/x86/mm/tlb.c               |  62 +++++------
 include/linux/smp.h             |  45 +++++---
 kernel/sched/fair.c             |   5 +-
 6 files changed, 193 insertions(+), 162 deletions(-)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
  2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra
@ 2019-02-27 10:12 ` Peter Zijlstra
  2019-02-27 16:14   ` Linus Torvalds
  2019-02-27 10:12 ` [PATCH 2/5] x86/percpu: Relax smp_processor_id() Peter Zijlstra
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2019-02-27 10:12 UTC (permalink / raw)
  To: torvalds, mingo, bp, tglx, luto, namit, peterz; +Cc: linux-kernel, Nadav Amit

Nadav Amit reported that commit:

  b59167ac7baf ("x86/percpu: Fix this_cpu_read()")

added a bunch of constraints to all sorts of code; and while some of
that was correct and desired, some of that seems superfluous.

The thing is, the this_cpu_*() operations are defined IRQ-safe, this
means the values are subject to change from IRQs, and thus must be
reloaded.

Also (the generic form):

  local_irq_save()
  __this_cpu_read()
  local_irq_restore()

would not allow the re-use of previous values; if by nothing else,
then the barrier()s implied by local_irq_*().

Which raises the point that percpu_from_op() and the other also need
that volatile.

OTOH __this_cpu_*() operations are not IRQ-safe and assume external
preempt/IRQ disabling and could thus be allowed more room for
optimization.

Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/percpu.h |  224 +++++++++++++++++++++---------------------
 1 file changed, 112 insertions(+), 112 deletions(-)

--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -87,7 +87,7 @@
  * don't give an lvalue though). */
 extern void __bad_percpu_size(void);
 
-#define percpu_to_op(op, var, val)			\
+#define percpu_to_op(qual, op, var, val)		\
 do {							\
 	typedef typeof(var) pto_T__;			\
 	if (0) {					\
@@ -97,22 +97,22 @@ do {							\
 	}						\
 	switch (sizeof(var)) {				\
 	case 1:						\
-		asm(op "b %1,"__percpu_arg(0)		\
+		asm qual (op "b %1,"__percpu_arg(0)	\
 		    : "+m" (var)			\
 		    : "qi" ((pto_T__)(val)));		\
 		break;					\
 	case 2:						\
-		asm(op "w %1,"__percpu_arg(0)		\
+		asm qual (op "w %1,"__percpu_arg(0)	\
 		    : "+m" (var)			\
 		    : "ri" ((pto_T__)(val)));		\
 		break;					\
 	case 4:						\
-		asm(op "l %1,"__percpu_arg(0)		\
+		asm qual (op "l %1,"__percpu_arg(0)	\
 		    : "+m" (var)			\
 		    : "ri" ((pto_T__)(val)));		\
 		break;					\
 	case 8:						\
-		asm(op "q %1,"__percpu_arg(0)		\
+		asm qual (op "q %1,"__percpu_arg(0)	\
 		    : "+m" (var)			\
 		    : "re" ((pto_T__)(val)));		\
 		break;					\
@@ -124,7 +124,7 @@ do {							\
  * Generate a percpu add to memory instruction and optimize code
  * if one is added or subtracted.
  */
-#define percpu_add_op(var, val)						\
+#define percpu_add_op(qual, var, val)					\
 do {									\
 	typedef typeof(var) pao_T__;					\
 	const int pao_ID__ = (__builtin_constant_p(val) &&		\
@@ -138,41 +138,41 @@ do {									\
 	switch (sizeof(var)) {						\
 	case 1:								\
 		if (pao_ID__ == 1)					\
-			asm("incb "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("incb "__percpu_arg(0) : "+m" (var));	\
 		else if (pao_ID__ == -1)				\
-			asm("decb "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("decb "__percpu_arg(0) : "+m" (var));	\
 		else							\
-			asm("addb %1, "__percpu_arg(0)			\
+			asm qual ("addb %1, "__percpu_arg(0)		\
 			    : "+m" (var)				\
 			    : "qi" ((pao_T__)(val)));			\
 		break;							\
 	case 2:								\
 		if (pao_ID__ == 1)					\
-			asm("incw "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("incw "__percpu_arg(0) : "+m" (var));	\
 		else if (pao_ID__ == -1)				\
-			asm("decw "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("decw "__percpu_arg(0) : "+m" (var));	\
 		else							\
-			asm("addw %1, "__percpu_arg(0)			\
+			asm qual ("addw %1, "__percpu_arg(0)		\
 			    : "+m" (var)				\
 			    : "ri" ((pao_T__)(val)));			\
 		break;							\
 	case 4:								\
 		if (pao_ID__ == 1)					\
-			asm("incl "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("incl "__percpu_arg(0) : "+m" (var));	\
 		else if (pao_ID__ == -1)				\
-			asm("decl "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("decl "__percpu_arg(0) : "+m" (var));	\
 		else							\
-			asm("addl %1, "__percpu_arg(0)			\
+			asm qual ("addl %1, "__percpu_arg(0)		\
 			    : "+m" (var)				\
 			    : "ri" ((pao_T__)(val)));			\
 		break;							\
 	case 8:								\
 		if (pao_ID__ == 1)					\
-			asm("incq "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("incq "__percpu_arg(0) : "+m" (var));	\
 		else if (pao_ID__ == -1)				\
-			asm("decq "__percpu_arg(0) : "+m" (var));	\
+			asm qual ("decq "__percpu_arg(0) : "+m" (var));	\
 		else							\
-			asm("addq %1, "__percpu_arg(0)			\
+			asm qual ("addq %1, "__percpu_arg(0)		\
 			    : "+m" (var)				\
 			    : "re" ((pao_T__)(val)));			\
 		break;							\
@@ -180,27 +180,27 @@ do {									\
 	}								\
 } while (0)
 
-#define percpu_from_op(op, var)				\
+#define percpu_from_op(qual, op, var)			\
 ({							\
 	typeof(var) pfo_ret__;				\
 	switch (sizeof(var)) {				\
 	case 1:						\
-		asm volatile(op "b "__percpu_arg(1)",%0"\
+		asm qual (op "b "__percpu_arg(1)",%0"	\
 		    : "=q" (pfo_ret__)			\
 		    : "m" (var));			\
 		break;					\
 	case 2:						\
-		asm volatile(op "w "__percpu_arg(1)",%0"\
+		asm qual (op "w "__percpu_arg(1)",%0"	\
 		    : "=r" (pfo_ret__)			\
 		    : "m" (var));			\
 		break;					\
 	case 4:						\
-		asm volatile(op "l "__percpu_arg(1)",%0"\
+		asm qual (op "l "__percpu_arg(1)",%0"	\
 		    : "=r" (pfo_ret__)			\
 		    : "m" (var));			\
 		break;					\
 	case 8:						\
-		asm volatile(op "q "__percpu_arg(1)",%0"\
+		asm qual (op "q "__percpu_arg(1)",%0"	\
 		    : "=r" (pfo_ret__)			\
 		    : "m" (var));			\
 		break;					\
@@ -238,23 +238,23 @@ do {									\
 	pfo_ret__;					\
 })
 
-#define percpu_unary_op(op, var)			\
+#define percpu_unary_op(qual, op, var)			\
 ({							\
 	switch (sizeof(var)) {				\
 	case 1:						\
-		asm(op "b "__percpu_arg(0)		\
+		asm qual (op "b "__percpu_arg(0)	\
 		    : "+m" (var));			\
 		break;					\
 	case 2:						\
-		asm(op "w "__percpu_arg(0)		\
+		asm qual (op "w "__percpu_arg(0)	\
 		    : "+m" (var));			\
 		break;					\
 	case 4:						\
-		asm(op "l "__percpu_arg(0)		\
+		asm qual (op "l "__percpu_arg(0)	\
 		    : "+m" (var));			\
 		break;					\
 	case 8:						\
-		asm(op "q "__percpu_arg(0)		\
+		asm qual (op "q "__percpu_arg(0)	\
 		    : "+m" (var));			\
 		break;					\
 	default: __bad_percpu_size();			\
@@ -264,27 +264,27 @@ do {									\
 /*
  * Add return operation
  */
-#define percpu_add_return_op(var, val)					\
+#define percpu_add_return_op(qual, var, val)				\
 ({									\
 	typeof(var) paro_ret__ = val;					\
 	switch (sizeof(var)) {						\
 	case 1:								\
-		asm("xaddb %0, "__percpu_arg(1)				\
+		asm qual ("xaddb %0, "__percpu_arg(1)			\
 			    : "+q" (paro_ret__), "+m" (var)		\
 			    : : "memory");				\
 		break;							\
 	case 2:								\
-		asm("xaddw %0, "__percpu_arg(1)				\
+		asm qual ("xaddw %0, "__percpu_arg(1)			\
 			    : "+r" (paro_ret__), "+m" (var)		\
 			    : : "memory");				\
 		break;							\
 	case 4:								\
-		asm("xaddl %0, "__percpu_arg(1)				\
+		asm qual ("xaddl %0, "__percpu_arg(1)			\
 			    : "+r" (paro_ret__), "+m" (var)		\
 			    : : "memory");				\
 		break;							\
 	case 8:								\
-		asm("xaddq %0, "__percpu_arg(1)				\
+		asm qual ("xaddq %0, "__percpu_arg(1)			\
 			    : "+re" (paro_ret__), "+m" (var)		\
 			    : : "memory");				\
 		break;							\
@@ -299,13 +299,13 @@ do {									\
  * expensive due to the implied lock prefix.  The processor cannot prefetch
  * cachelines if xchg is used.
  */
-#define percpu_xchg_op(var, nval)					\
+#define percpu_xchg_op(qual, var, nval)					\
 ({									\
 	typeof(var) pxo_ret__;						\
 	typeof(var) pxo_new__ = (nval);					\
 	switch (sizeof(var)) {						\
 	case 1:								\
-		asm("\n\tmov "__percpu_arg(1)",%%al"			\
+		asm qual ("\n\tmov "__percpu_arg(1)",%%al"		\
 		    "\n1:\tcmpxchgb %2, "__percpu_arg(1)		\
 		    "\n\tjnz 1b"					\
 			    : "=&a" (pxo_ret__), "+m" (var)		\
@@ -313,7 +313,7 @@ do {									\
 			    : "memory");				\
 		break;							\
 	case 2:								\
-		asm("\n\tmov "__percpu_arg(1)",%%ax"			\
+		asm qual ("\n\tmov "__percpu_arg(1)",%%ax"		\
 		    "\n1:\tcmpxchgw %2, "__percpu_arg(1)		\
 		    "\n\tjnz 1b"					\
 			    : "=&a" (pxo_ret__), "+m" (var)		\
@@ -321,7 +321,7 @@ do {									\
 			    : "memory");				\
 		break;							\
 	case 4:								\
-		asm("\n\tmov "__percpu_arg(1)",%%eax"			\
+		asm qual ("\n\tmov "__percpu_arg(1)",%%eax"		\
 		    "\n1:\tcmpxchgl %2, "__percpu_arg(1)		\
 		    "\n\tjnz 1b"					\
 			    : "=&a" (pxo_ret__), "+m" (var)		\
@@ -329,7 +329,7 @@ do {									\
 			    : "memory");				\
 		break;							\
 	case 8:								\
-		asm("\n\tmov "__percpu_arg(1)",%%rax"			\
+		asm qual ("\n\tmov "__percpu_arg(1)",%%rax"		\
 		    "\n1:\tcmpxchgq %2, "__percpu_arg(1)		\
 		    "\n\tjnz 1b"					\
 			    : "=&a" (pxo_ret__), "+m" (var)		\
@@ -345,32 +345,32 @@ do {									\
  * cmpxchg has no such implied lock semantics as a result it is much
  * more efficient for cpu local operations.
  */
-#define percpu_cmpxchg_op(var, oval, nval)				\
+#define percpu_cmpxchg_op(qual, var, oval, nval)			\
 ({									\
 	typeof(var) pco_ret__;						\
 	typeof(var) pco_old__ = (oval);					\
 	typeof(var) pco_new__ = (nval);					\
 	switch (sizeof(var)) {						\
 	case 1:								\
-		asm("cmpxchgb %2, "__percpu_arg(1)			\
+		asm qual ("cmpxchgb %2, "__percpu_arg(1)		\
 			    : "=a" (pco_ret__), "+m" (var)		\
 			    : "q" (pco_new__), "0" (pco_old__)		\
 			    : "memory");				\
 		break;							\
 	case 2:								\
-		asm("cmpxchgw %2, "__percpu_arg(1)			\
+		asm qual ("cmpxchgw %2, "__percpu_arg(1)		\
 			    : "=a" (pco_ret__), "+m" (var)		\
 			    : "r" (pco_new__), "0" (pco_old__)		\
 			    : "memory");				\
 		break;							\
 	case 4:								\
-		asm("cmpxchgl %2, "__percpu_arg(1)			\
+		asm qual ("cmpxchgl %2, "__percpu_arg(1)		\
 			    : "=a" (pco_ret__), "+m" (var)		\
 			    : "r" (pco_new__), "0" (pco_old__)		\
 			    : "memory");				\
 		break;							\
 	case 8:								\
-		asm("cmpxchgq %2, "__percpu_arg(1)			\
+		asm qual ("cmpxchgq %2, "__percpu_arg(1)		\
 			    : "=a" (pco_ret__), "+m" (var)		\
 			    : "r" (pco_new__), "0" (pco_old__)		\
 			    : "memory");				\
@@ -391,58 +391,58 @@ do {									\
  */
 #define this_cpu_read_stable(var)	percpu_stable_op("mov", var)
 
-#define raw_cpu_read_1(pcp)		percpu_from_op("mov", pcp)
-#define raw_cpu_read_2(pcp)		percpu_from_op("mov", pcp)
-#define raw_cpu_read_4(pcp)		percpu_from_op("mov", pcp)
-
-#define raw_cpu_write_1(pcp, val)	percpu_to_op("mov", (pcp), val)
-#define raw_cpu_write_2(pcp, val)	percpu_to_op("mov", (pcp), val)
-#define raw_cpu_write_4(pcp, val)	percpu_to_op("mov", (pcp), val)
-#define raw_cpu_add_1(pcp, val)		percpu_add_op((pcp), val)
-#define raw_cpu_add_2(pcp, val)		percpu_add_op((pcp), val)
-#define raw_cpu_add_4(pcp, val)		percpu_add_op((pcp), val)
-#define raw_cpu_and_1(pcp, val)		percpu_to_op("and", (pcp), val)
-#define raw_cpu_and_2(pcp, val)		percpu_to_op("and", (pcp), val)
-#define raw_cpu_and_4(pcp, val)		percpu_to_op("and", (pcp), val)
-#define raw_cpu_or_1(pcp, val)		percpu_to_op("or", (pcp), val)
-#define raw_cpu_or_2(pcp, val)		percpu_to_op("or", (pcp), val)
-#define raw_cpu_or_4(pcp, val)		percpu_to_op("or", (pcp), val)
-#define raw_cpu_xchg_1(pcp, val)	percpu_xchg_op(pcp, val)
-#define raw_cpu_xchg_2(pcp, val)	percpu_xchg_op(pcp, val)
-#define raw_cpu_xchg_4(pcp, val)	percpu_xchg_op(pcp, val)
-
-#define this_cpu_read_1(pcp)		percpu_from_op("mov", pcp)
-#define this_cpu_read_2(pcp)		percpu_from_op("mov", pcp)
-#define this_cpu_read_4(pcp)		percpu_from_op("mov", pcp)
-#define this_cpu_write_1(pcp, val)	percpu_to_op("mov", (pcp), val)
-#define this_cpu_write_2(pcp, val)	percpu_to_op("mov", (pcp), val)
-#define this_cpu_write_4(pcp, val)	percpu_to_op("mov", (pcp), val)
-#define this_cpu_add_1(pcp, val)	percpu_add_op((pcp), val)
-#define this_cpu_add_2(pcp, val)	percpu_add_op((pcp), val)
-#define this_cpu_add_4(pcp, val)	percpu_add_op((pcp), val)
-#define this_cpu_and_1(pcp, val)	percpu_to_op("and", (pcp), val)
-#define this_cpu_and_2(pcp, val)	percpu_to_op("and", (pcp), val)
-#define this_cpu_and_4(pcp, val)	percpu_to_op("and", (pcp), val)
-#define this_cpu_or_1(pcp, val)		percpu_to_op("or", (pcp), val)
-#define this_cpu_or_2(pcp, val)		percpu_to_op("or", (pcp), val)
-#define this_cpu_or_4(pcp, val)		percpu_to_op("or", (pcp), val)
-#define this_cpu_xchg_1(pcp, nval)	percpu_xchg_op(pcp, nval)
-#define this_cpu_xchg_2(pcp, nval)	percpu_xchg_op(pcp, nval)
-#define this_cpu_xchg_4(pcp, nval)	percpu_xchg_op(pcp, nval)
-
-#define raw_cpu_add_return_1(pcp, val)		percpu_add_return_op(pcp, val)
-#define raw_cpu_add_return_2(pcp, val)		percpu_add_return_op(pcp, val)
-#define raw_cpu_add_return_4(pcp, val)		percpu_add_return_op(pcp, val)
-#define raw_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
-#define raw_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
-#define raw_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
-
-#define this_cpu_add_return_1(pcp, val)		percpu_add_return_op(pcp, val)
-#define this_cpu_add_return_2(pcp, val)		percpu_add_return_op(pcp, val)
-#define this_cpu_add_return_4(pcp, val)		percpu_add_return_op(pcp, val)
-#define this_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
-#define this_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
-#define this_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
+#define raw_cpu_read_1(pcp)		percpu_from_op(, "mov", pcp)
+#define raw_cpu_read_2(pcp)		percpu_from_op(, "mov", pcp)
+#define raw_cpu_read_4(pcp)		percpu_from_op(, "mov", pcp)
+
+#define raw_cpu_write_1(pcp, val)	percpu_to_op(, "mov", (pcp), val)
+#define raw_cpu_write_2(pcp, val)	percpu_to_op(, "mov", (pcp), val)
+#define raw_cpu_write_4(pcp, val)	percpu_to_op(, "mov", (pcp), val)
+#define raw_cpu_add_1(pcp, val)		percpu_add_op(, (pcp), val)
+#define raw_cpu_add_2(pcp, val)		percpu_add_op(, (pcp), val)
+#define raw_cpu_add_4(pcp, val)		percpu_add_op(, (pcp), val)
+#define raw_cpu_and_1(pcp, val)		percpu_to_op(, "and", (pcp), val)
+#define raw_cpu_and_2(pcp, val)		percpu_to_op(, "and", (pcp), val)
+#define raw_cpu_and_4(pcp, val)		percpu_to_op(, "and", (pcp), val)
+#define raw_cpu_or_1(pcp, val)		percpu_to_op(, "or", (pcp), val)
+#define raw_cpu_or_2(pcp, val)		percpu_to_op(, "or", (pcp), val)
+#define raw_cpu_or_4(pcp, val)		percpu_to_op(, "or", (pcp), val)
+#define raw_cpu_xchg_1(pcp, val)	percpu_xchg_op(, pcp, val)
+#define raw_cpu_xchg_2(pcp, val)	percpu_xchg_op(, pcp, val)
+#define raw_cpu_xchg_4(pcp, val)	percpu_xchg_op(, pcp, val)
+
+#define this_cpu_read_1(pcp)		percpu_from_op(volatile, "mov", pcp)
+#define this_cpu_read_2(pcp)		percpu_from_op(volatile, "mov", pcp)
+#define this_cpu_read_4(pcp)		percpu_from_op(volatile, "mov", pcp)
+#define this_cpu_write_1(pcp, val)	percpu_to_op(volatile, "mov", (pcp), val)
+#define this_cpu_write_2(pcp, val)	percpu_to_op(volatile, "mov", (pcp), val)
+#define this_cpu_write_4(pcp, val)	percpu_to_op(volatile, "mov", (pcp), val)
+#define this_cpu_add_1(pcp, val)	percpu_add_op(volatile, (pcp), val)
+#define this_cpu_add_2(pcp, val)	percpu_add_op(volatile, (pcp), val)
+#define this_cpu_add_4(pcp, val)	percpu_add_op(volatile, (pcp), val)
+#define this_cpu_and_1(pcp, val)	percpu_to_op(volatile, "and", (pcp), val)
+#define this_cpu_and_2(pcp, val)	percpu_to_op(volatile, "and", (pcp), val)
+#define this_cpu_and_4(pcp, val)	percpu_to_op(volatile, "and", (pcp), val)
+#define this_cpu_or_1(pcp, val)		percpu_to_op(volatile, "or", (pcp), val)
+#define this_cpu_or_2(pcp, val)		percpu_to_op(volatile, "or", (pcp), val)
+#define this_cpu_or_4(pcp, val)		percpu_to_op(volatile, "or", (pcp), val)
+#define this_cpu_xchg_1(pcp, nval)	percpu_xchg_op(volatile, pcp, nval)
+#define this_cpu_xchg_2(pcp, nval)	percpu_xchg_op(volatile, pcp, nval)
+#define this_cpu_xchg_4(pcp, nval)	percpu_xchg_op(volatile, pcp, nval)
+
+#define raw_cpu_add_return_1(pcp, val)		percpu_add_return_op(, pcp, val)
+#define raw_cpu_add_return_2(pcp, val)		percpu_add_return_op(, pcp, val)
+#define raw_cpu_add_return_4(pcp, val)		percpu_add_return_op(, pcp, val)
+#define raw_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(, pcp, oval, nval)
+#define raw_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(, pcp, oval, nval)
+#define raw_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(, pcp, oval, nval)
+
+#define this_cpu_add_return_1(pcp, val)		percpu_add_return_op(volatile, pcp, val)
+#define this_cpu_add_return_2(pcp, val)		percpu_add_return_op(volatile, pcp, val)
+#define this_cpu_add_return_4(pcp, val)		percpu_add_return_op(volatile, pcp, val)
+#define this_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(volatile, pcp, oval, nval)
+#define this_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(volatile, pcp, oval, nval)
+#define this_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(volatile, pcp, oval, nval)
 
 #ifdef CONFIG_X86_CMPXCHG64
 #define percpu_cmpxchg8b_double(pcp1, pcp2, o1, o2, n1, n2)		\
@@ -466,23 +466,23 @@ do {									\
  * 32 bit must fall back to generic operations.
  */
 #ifdef CONFIG_X86_64
-#define raw_cpu_read_8(pcp)			percpu_from_op("mov", pcp)
-#define raw_cpu_write_8(pcp, val)		percpu_to_op("mov", (pcp), val)
-#define raw_cpu_add_8(pcp, val)			percpu_add_op((pcp), val)
-#define raw_cpu_and_8(pcp, val)			percpu_to_op("and", (pcp), val)
-#define raw_cpu_or_8(pcp, val)			percpu_to_op("or", (pcp), val)
-#define raw_cpu_add_return_8(pcp, val)		percpu_add_return_op(pcp, val)
-#define raw_cpu_xchg_8(pcp, nval)		percpu_xchg_op(pcp, nval)
-#define raw_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
-
-#define this_cpu_read_8(pcp)			percpu_from_op("mov", pcp)
-#define this_cpu_write_8(pcp, val)		percpu_to_op("mov", (pcp), val)
-#define this_cpu_add_8(pcp, val)		percpu_add_op((pcp), val)
-#define this_cpu_and_8(pcp, val)		percpu_to_op("and", (pcp), val)
-#define this_cpu_or_8(pcp, val)			percpu_to_op("or", (pcp), val)
-#define this_cpu_add_return_8(pcp, val)		percpu_add_return_op(pcp, val)
-#define this_cpu_xchg_8(pcp, nval)		percpu_xchg_op(pcp, nval)
-#define this_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
+#define raw_cpu_read_8(pcp)			percpu_from_op(, "mov", pcp)
+#define raw_cpu_write_8(pcp, val)		percpu_to_op(, "mov", (pcp), val)
+#define raw_cpu_add_8(pcp, val)			percpu_add_op(, (pcp), val)
+#define raw_cpu_and_8(pcp, val)			percpu_to_op(, "and", (pcp), val)
+#define raw_cpu_or_8(pcp, val)			percpu_to_op(, "or", (pcp), val)
+#define raw_cpu_add_return_8(pcp, val)		percpu_add_return_op(, pcp, val)
+#define raw_cpu_xchg_8(pcp, nval)		percpu_xchg_op(, pcp, nval)
+#define raw_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(, pcp, oval, nval)
+
+#define this_cpu_read_8(pcp)			percpu_from_op(volatile, "mov", pcp)
+#define this_cpu_write_8(pcp, val)		percpu_to_op(volatile, "mov", (pcp), val)
+#define this_cpu_add_8(pcp, val)		percpu_add_op(volatile, (pcp), val)
+#define this_cpu_and_8(pcp, val)		percpu_to_op(volatile, "and", (pcp), val)
+#define this_cpu_or_8(pcp, val)			percpu_to_op(volatile, "or", (pcp), val)
+#define this_cpu_add_return_8(pcp, val)		percpu_add_return_op(volatile, pcp, val)
+#define this_cpu_xchg_8(pcp, nval)		percpu_xchg_op(volatile, pcp, nval)
+#define this_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(volatile, pcp, oval, nval)
 
 /*
  * Pretty complex macro to generate cmpxchg16 instruction.  The instruction



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
  2019-02-27 10:12 ` [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() Peter Zijlstra
@ 2019-02-27 16:14   ` Linus Torvalds
  2019-02-27 16:48     ` Peter Zijlstra
  2019-02-27 17:57     ` Nadav Amit
  0 siblings, 2 replies; 23+ messages in thread
From: Linus Torvalds @ 2019-02-27 16:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Borislav Petkov, Thomas Gleixner, Andrew Lutomirski,
	Nadav Amit, Linux List Kernel Mailing, Nadav Amit

[-- Attachment #1: Type: text/plain, Size: 1078 bytes --]

On Wed, Feb 27, 2019 at 2:16 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Nadav Amit reported that commit:
>
>   b59167ac7baf ("x86/percpu: Fix this_cpu_read()")
>
> added a bunch of constraints to all sorts of code; and while some of
> that was correct and desired, some of that seems superfluous.

Hmm.

I have the strong feeling that we should instead relax this_cpu_read()
again a bit.

In particular, making it "asm volatile" really is a big hammer
approach. It's worth noting that the *other* this_cpu_xyz ops don't
even do that.

I would suggest that instead of making "this_cpu_read()" be asm
volatile, we mark it as potentially changing the memory location it is
touching - the same way the modify/write ops do.

That still means that the read will be forced (like READ_ONCE()), but
allows gcc a bit more flexibility in instruction scheduling, I think.

Trivial (but entirely untested) patch attached.

That said, I didn't actually check how it affects code generation.
Nadav, would you check the code sequences you originally noticed?

                   Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1348 bytes --]

 arch/x86/include/asm/percpu.h | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 1a19d11cfbbd..63b0f361533f 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -185,24 +185,24 @@ do {									\
 	typeof(var) pfo_ret__;				\
 	switch (sizeof(var)) {				\
 	case 1:						\
-		asm volatile(op "b "__percpu_arg(1)",%0"\
-		    : "=q" (pfo_ret__)			\
-		    : "m" (var));			\
+		asm(op "b "__percpu_arg(1)",%0"\
+		    : "=q" (pfo_ret__),			\
+		      "+m" (var));			\
 		break;					\
 	case 2:						\
-		asm volatile(op "w "__percpu_arg(1)",%0"\
-		    : "=r" (pfo_ret__)			\
-		    : "m" (var));			\
+		asm(op "w "__percpu_arg(1)",%0"\
+		    : "=r" (pfo_ret__),			\
+		      "+m" (var));			\
 		break;					\
 	case 4:						\
-		asm volatile(op "l "__percpu_arg(1)",%0"\
-		    : "=r" (pfo_ret__)			\
-		    : "m" (var));			\
+		asm(op "l "__percpu_arg(1)",%0"\
+		    : "=r" (pfo_ret__),			\
+		      "+m" (var));			\
 		break;					\
 	case 8:						\
-		asm volatile(op "q "__percpu_arg(1)",%0"\
-		    : "=r" (pfo_ret__)			\
-		    : "m" (var));			\
+		asm(op "q "__percpu_arg(1)",%0"\
+		    : "=r" (pfo_ret__),			\
+		      "+m" (var));			\
 		break;					\
 	default: __bad_percpu_size();			\
 	}						\

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
  2019-02-27 16:14   ` Linus Torvalds
@ 2019-02-27 16:48     ` Peter Zijlstra
  2019-02-27 17:17       ` Linus Torvalds
  2019-02-27 17:57     ` Nadav Amit
  1 sibling, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2019-02-27 16:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Borislav Petkov, Thomas Gleixner, Andrew Lutomirski,
	Nadav Amit, Linux List Kernel Mailing, Nadav Amit

On Wed, Feb 27, 2019 at 08:14:09AM -0800, Linus Torvalds wrote:
> On Wed, Feb 27, 2019 at 2:16 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Nadav Amit reported that commit:
> >
> >   b59167ac7baf ("x86/percpu: Fix this_cpu_read()")
> >
> > added a bunch of constraints to all sorts of code; and while some of
> > that was correct and desired, some of that seems superfluous.
> 
> Hmm.
> 
> I have the strong feeling that we should instead relax this_cpu_read()
> again a bit.
> 
> In particular, making it "asm volatile" really is a big hammer
> approach. It's worth noting that the *other* this_cpu_xyz ops don't
> even do that.

Right, this patch 'fixes' that :-)

> I would suggest that instead of making "this_cpu_read()" be asm
> volatile, we mark it as potentially changing the memory location it is
> touching - the same way the modify/write ops do.
> 
> That still means that the read will be forced (like READ_ONCE()), but
> allows gcc a bit more flexibility in instruction scheduling, I think.

Ah, fair enough, I'll spin a version of this patch with "+m" for
this_cpu and "m" for raw_cpu.

> That said, I didn't actually check how it affects code generation.
> Nadav, would you check the code sequences you originally noticed?

Much of it was the ONCE behaviour defeating CSE I think, but yes, it
would be good to have another look.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
  2019-02-27 16:48     ` Peter Zijlstra
@ 2019-02-27 17:17       ` Linus Torvalds
  2019-02-27 17:34         ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2019-02-27 17:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Borislav Petkov, Thomas Gleixner, Andrew Lutomirski,
	Nadav Amit, Linux List Kernel Mailing, Nadav Amit

On Wed, Feb 27, 2019 at 8:48 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Feb 27, 2019 at 08:14:09AM -0800, Linus Torvalds wrote:
> > In particular, making it "asm volatile" really is a big hammer
> > approach. It's worth noting that the *other* this_cpu_xyz ops don't
> > even do that.
>
> Right, this patch 'fixes' that :-)

Yeah, but I do hate seeing these patches that randomly add double
underscore versions just because the regular one is so bad.

So I'd much rather improve the regular this_cpu_read() instead, and
hope that we don't need to have this kind of big difference between
that and the double-underscore version.

I'm not convinced that the "asm volatile" is right on the _other_ ops
either. You added them to cmpxchg and xadd too, and it's not obvious
that they should have them. They have the "memory" clobber to order
them wrt actual memory ops, why are they now "asm volatile"?

So I don't really like this patch that randomly adds volatile to
things, and then removes it from one special case that I don't think
should have had it in the first place.

It all seems pretty ad-hoc, and we already _know_ that "asm volatile" is bad.

                      Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
  2019-02-27 17:17       ` Linus Torvalds
@ 2019-02-27 17:34         ` Peter Zijlstra
  2019-02-27 17:38           ` Linus Torvalds
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2019-02-27 17:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Borislav Petkov, Thomas Gleixner, Andrew Lutomirski,
	Nadav Amit, Linux List Kernel Mailing, Nadav Amit

On Wed, Feb 27, 2019 at 09:17:42AM -0800, Linus Torvalds wrote:
> It all seems pretty ad-hoc, and we already _know_ that "asm volatile" is bad.

Ah, all I wanted was the ONCE thing and my inline asm foo sucks. If +m
gets us that, awesome.

But the ONCE thing defeats CSE (on purpose!) so for code-gen that
likes/wants that we then have to use the __this_cpu crud.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
  2019-02-27 17:34         ` Peter Zijlstra
@ 2019-02-27 17:38           ` Linus Torvalds
  0 siblings, 0 replies; 23+ messages in thread
From: Linus Torvalds @ 2019-02-27 17:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Borislav Petkov, Thomas Gleixner, Andrew Lutomirski,
	Nadav Amit, Linux List Kernel Mailing, Nadav Amit

On Wed, Feb 27, 2019 at 9:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> But the ONCE thing defeats CSE (on purpose!) so for code-gen that
> likes/wants that we then have to use the __this_cpu crud.

Right. But I do think that if you want CSE on a percpu access, you
might want to write it as such (ie load the value once and then re-use
the value).

But I'm not sure exactly which accesses Nadav noticed to be a problem,
so.. I guess I should just look at the other patches, but I found them
rather nasty and ad-hoc too so I just skimmed them with an "eww" ;)

                  Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
  2019-02-27 16:14   ` Linus Torvalds
  2019-02-27 16:48     ` Peter Zijlstra
@ 2019-02-27 17:57     ` Nadav Amit
  2019-02-27 18:55       ` Nadav Amit
  2019-02-27 19:41       ` Linus Torvalds
  1 sibling, 2 replies; 23+ messages in thread
From: Nadav Amit @ 2019-02-27 17:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, Thomas Gleixner,
	Andrew Lutomirski, Linux List Kernel Mailing, Matthew Wilcox

> On Feb 27, 2019, at 8:14 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Wed, Feb 27, 2019 at 2:16 AM Peter Zijlstra <peterz@infradead.org> wrote:
>> Nadav Amit reported that commit:
>> 
>>  b59167ac7baf ("x86/percpu: Fix this_cpu_read()")
>> 
>> added a bunch of constraints to all sorts of code; and while some of
>> that was correct and desired, some of that seems superfluous.
> 
> Trivial (but entirely untested) patch attached.
> 
> That said, I didn't actually check how it affects code generation.
> Nadav, would you check the code sequences you originally noticed?

The original issue was raised while I was looking into a dropped patch of
Matthew Wilcox that caused code size increase [1]. As a result I noticed
that Peter’s patch caused big changes to the generated assembly across the
kernel - I did not have a specific scenario that I cared about.

The patch you sent (“+m/-volatile”) does increase the code size by 1728
bytes. Although code size is not the only metric for “code optimization”,
the original patch of Peter (“volatile”) only increased the code size by 201
bytes. Peter’s original change also affected only 72 functions vs 228 that
impacted by the new patch.

I’ll have a look at some specific function assembly, but overall, the “+m”
approach might prevent even more code optimizations than the “volatile” one.

I’ll send an example or two later.

Regards,
Nadav

[1] https://marc.info/?l=linux-mm&m=154341370216693&w=2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
  2019-02-27 17:57     ` Nadav Amit
@ 2019-02-27 18:55       ` Nadav Amit
  2019-02-27 19:41       ` Linus Torvalds
  1 sibling, 0 replies; 23+ messages in thread
From: Nadav Amit @ 2019-02-27 18:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, Thomas Gleixner,
	Andrew Lutomirski, Linux List Kernel Mailing, Matthew Wilcox

> On Feb 27, 2019, at 9:57 AM, Nadav Amit <namit@vmware.com> wrote:
> 
>> On Feb 27, 2019, at 8:14 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> 
>> On Wed, Feb 27, 2019 at 2:16 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>> Nadav Amit reported that commit:
>>> 
>>> b59167ac7baf ("x86/percpu: Fix this_cpu_read()")
>>> 
>>> added a bunch of constraints to all sorts of code; and while some of
>>> that was correct and desired, some of that seems superfluous.
>> 
>> Trivial (but entirely untested) patch attached.
>> 
>> That said, I didn't actually check how it affects code generation.
>> Nadav, would you check the code sequences you originally noticed?
> 
> The original issue was raised while I was looking into a dropped patch of
> Matthew Wilcox that caused code size increase [1]. As a result I noticed
> that Peter’s patch caused big changes to the generated assembly across the
> kernel - I did not have a specific scenario that I cared about.
> 
> The patch you sent (“+m/-volatile”) does increase the code size by 1728
> bytes. Although code size is not the only metric for “code optimization”,
> the original patch of Peter (“volatile”) only increased the code size by 201
> bytes. Peter’s original change also affected only 72 functions vs 228 that
> impacted by the new patch.
> 
> I’ll have a look at some specific function assembly, but overall, the “+m”
> approach might prevent even more code optimizations than the “volatile” one.
> 
> I’ll send an example or two later.

Here is one example:

Dump of assembler code for function event_filter_pid_sched_wakeup_probe_pre:
   0xffffffff8117c510 <+0>:	push   %rbp
   0xffffffff8117c511 <+1>:	mov    %rsp,%rbp
   0xffffffff8117c514 <+4>:	push   %rbx
   0xffffffff8117c515 <+5>:	mov    0x28(%rdi),%rax
   0xffffffff8117c519 <+9>:	mov    %gs:0x78(%rax),%dl
   0xffffffff8117c51d <+13>:	test   %dl,%dl
   0xffffffff8117c51f <+15>:	je     0xffffffff8117c535 <event_filter_pid_sched_wakeup_probe_pre+37>
   0xffffffff8117c521 <+17>:	mov    %rdi,%rax
   0xffffffff8117c524 <+20>:	mov    0x78(%rdi),%rdi
   0xffffffff8117c528 <+24>:	mov    0x28(%rax),%rbx   # REDUNDANT
   0xffffffff8117c52c <+28>:	callq  0xffffffff81167830 <trace_ignore_this_task>
   0xffffffff8117c531 <+33>:	mov    %al,%gs:0x78(%rbx)
   0xffffffff8117c535 <+37>:	pop    %rbx
   0xffffffff8117c536 <+38>:	pop    %rbp
   0xffffffff8117c537 <+39>:	retq   

The instruction at 0xffffffff8117c528 is redundant, and does not exist
without the recent patch. It seems to be a result of no-strict-aliasing,
which due to the new "memory write” (“+m”) causes the compiler to re-read
the data.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
  2019-02-27 17:57     ` Nadav Amit
  2019-02-27 18:55       ` Nadav Amit
@ 2019-02-27 19:41       ` Linus Torvalds
  2019-03-08 13:35         ` Peter Zijlstra
  1 sibling, 1 reply; 23+ messages in thread
From: Linus Torvalds @ 2019-02-27 19:41 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, Thomas Gleixner,
	Andrew Lutomirski, Linux List Kernel Mailing, Matthew Wilcox

On Wed, Feb 27, 2019 at 9:57 AM Nadav Amit <namit@vmware.com> wrote:
>
> I’ll have a look at some specific function assembly, but overall, the “+m”
> approach might prevent even more code optimizations than the “volatile” one.

Ok, that being the case, let's forget that patch.

I still wonder about the added volatiles to the xadd/cmpxchg cases,
which already had the "memory" clobber which should make the volatile
immaterial..

                 Linus

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
  2019-02-27 19:41       ` Linus Torvalds
@ 2019-03-08 13:35         ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2019-03-08 13:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nadav Amit, Ingo Molnar, Borislav Petkov, Thomas Gleixner,
	Andrew Lutomirski, Linux List Kernel Mailing, Matthew Wilcox

On Wed, Feb 27, 2019 at 11:41:31AM -0800, Linus Torvalds wrote:
> On Wed, Feb 27, 2019 at 9:57 AM Nadav Amit <namit@vmware.com> wrote:
> >
> > I’ll have a look at some specific function assembly, but overall, the “+m”
> > approach might prevent even more code optimizations than the “volatile” one.
> 
> Ok, that being the case, let's forget that patch.
> 
> I still wonder about the added volatiles to the xadd/cmpxchg cases,
> which already had the "memory" clobber which should make the volatile
> immaterial..

That was mostly me being OCD style consistent; but also notice that the
atomic ops also often have volatile even though they have a memory
clobber.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 2/5] x86/percpu: Relax smp_processor_id()
  2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra
  2019-02-27 10:12 ` [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() Peter Zijlstra
@ 2019-02-27 10:12 ` Peter Zijlstra
  2019-02-27 10:12 ` [PATCH 3/5] x86/percpu, x86/irq: Relax {set,get}_irq_regs() Peter Zijlstra
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2019-02-27 10:12 UTC (permalink / raw)
  To: torvalds, mingo, bp, tglx, luto, namit, peterz; +Cc: linux-kernel, Nadav Amit

Nadav reported that since this_cpu_read() became asm-volatile, many
smp_processor_id() users generated worse code due to the extra
constraints.

However since smp_processor_id() is reading a stable value, we can use
__this_cpu_read().

Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/smp.h |    3 ++-
 include/linux/smp.h        |   45 +++++++++++++++++++++++++++++++--------------
 2 files changed, 33 insertions(+), 15 deletions(-)

--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -162,7 +162,8 @@ __visible void smp_call_function_single_
  * from the initial startup. We map APIC_BASE very early in page_setup(),
  * so this is correct in the x86 case.
  */
-#define raw_smp_processor_id() (this_cpu_read(cpu_number))
+#define raw_smp_processor_id()  this_cpu_read(cpu_number)
+#define __smp_processor_id() __this_cpu_read(cpu_number)
 
 #ifdef CONFIG_X86_32
 extern int safe_smp_processor_id(void);
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -181,29 +181,46 @@ static inline int get_boot_cpu_id(void)
 
 #endif /* !SMP */
 
-/*
- * smp_processor_id(): get the current CPU ID.
+/**
+ * raw_processor_id() - get the current (unstable) CPU id
+ *
+ * For then you know what you are doing and need an unstable
+ * CPU id.
+ */
+
+/**
+ * smp_processor_id() - get the current (stable) CPU id
+ *
+ * This is the normal accessor to the CPU id and should be used
+ * whenever possible.
  *
- * if DEBUG_PREEMPT is enabled then we check whether it is
- * used in a preemption-safe way. (smp_processor_id() is safe
- * if it's used in a preemption-off critical section, or in
- * a thread that is bound to the current CPU.)
+ * The CPU id is stable when:
+ *
+ *  - IRQs are disabled;
+ *  - preemption is disabled;
+ *  - the task is CPU affine.
  *
- * NOTE: raw_smp_processor_id() is for internal use only
- * (smp_processor_id() is the preferred variant), but in rare
- * instances it might also be used to turn off false positives
- * (i.e. smp_processor_id() use that the debugging code reports but
- * which use for some reason is legal). Don't use this to hack around
- * the warning message, as your code might not work under PREEMPT.
+ * When CONFIG_DEBUG_PREEMPT; we verify these assumption and WARN
+ * when smp_processor_id() is used when the CPU id is not stable.
  */
+
+/*
+ * Allow the architecture to differentiate between a stable and unstable read.
+ * For example, x86 uses an IRQ-safe asm-volatile read for the unstable but a
+ * regular asm read for the stable.
+ */
+#ifndef __smp_processor_id
+#define __smp_processor_id(x) raw_smp_processor_id(x)
+#endif
+
 #ifdef CONFIG_DEBUG_PREEMPT
   extern unsigned int debug_smp_processor_id(void);
 # define smp_processor_id() debug_smp_processor_id()
 #else
-# define smp_processor_id() raw_smp_processor_id()
+# define smp_processor_id() __smp_processor_id()
 #endif
 
-#define get_cpu()		({ preempt_disable(); smp_processor_id(); })
+#define get_cpu()		({ preempt_disable(); __smp_processor_id(); })
 #define put_cpu()		preempt_enable()
 
 /*



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 3/5] x86/percpu, x86/irq: Relax {set,get}_irq_regs()
  2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra
  2019-02-27 10:12 ` [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() Peter Zijlstra
  2019-02-27 10:12 ` [PATCH 2/5] x86/percpu: Relax smp_processor_id() Peter Zijlstra
@ 2019-02-27 10:12 ` Peter Zijlstra
  2019-02-27 10:12 ` [PATCH 4/5] x86/percpu, x86/tlb: Relax cpu_tlbstate accesses Peter Zijlstra
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2019-02-27 10:12 UTC (permalink / raw)
  To: torvalds, mingo, bp, tglx, luto, namit, peterz; +Cc: linux-kernel, Nadav Amit

Nadav reported that since the this_cpu_*() ops got asm-volatile
constraints on, code generation suffered for do_IRQ(), but since this
is all with IRQs disabled we can use __this_cpu_*().

Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/irq_regs.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/arch/x86/include/asm/irq_regs.h
+++ b/arch/x86/include/asm/irq_regs.h
@@ -16,7 +16,7 @@ DECLARE_PER_CPU(struct pt_regs *, irq_re
 
 static inline struct pt_regs *get_irq_regs(void)
 {
-	return this_cpu_read(irq_regs);
+	return __this_cpu_read(irq_regs);
 }
 
 static inline struct pt_regs *set_irq_regs(struct pt_regs *new_regs)
@@ -24,7 +24,7 @@ static inline struct pt_regs *set_irq_re
 	struct pt_regs *old_regs;
 
 	old_regs = get_irq_regs();
-	this_cpu_write(irq_regs, new_regs);
+	__this_cpu_write(irq_regs, new_regs);
 
 	return old_regs;
 }



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 4/5] x86/percpu, x86/tlb: Relax cpu_tlbstate accesses
  2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra
                   ` (2 preceding siblings ...)
  2019-02-27 10:12 ` [PATCH 3/5] x86/percpu, x86/irq: Relax {set,get}_irq_regs() Peter Zijlstra
@ 2019-02-27 10:12 ` Peter Zijlstra
  2019-02-27 10:12 ` [PATCH 5/5] x86/percpu, sched/fair: Avoid local_clock() Peter Zijlstra
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2019-02-27 10:12 UTC (permalink / raw)
  To: torvalds, mingo, bp, tglx, luto, namit, peterz; +Cc: linux-kernel, Nadav Amit

Almost all of this is ran with IRQs disabled and therefore doesn't
need the extra constraints on the this_cpu_*() ops, use __this_cpu_*()
to alleviate this.

Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/mm/tlb.c |   62 +++++++++++++++++++++++++++---------------------------
 1 file changed, 31 insertions(+), 31 deletions(-)

--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -58,15 +58,15 @@ static void clear_asid_other(void)
 
 	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
 		/* Do not need to flush the current asid */
-		if (asid == this_cpu_read(cpu_tlbstate.loaded_mm_asid))
+		if (asid == __this_cpu_read(cpu_tlbstate.loaded_mm_asid))
 			continue;
 		/*
 		 * Make sure the next time we go to switch to
 		 * this asid, we do a flush:
 		 */
-		this_cpu_write(cpu_tlbstate.ctxs[asid].ctx_id, 0);
+		__this_cpu_write(cpu_tlbstate.ctxs[asid].ctx_id, 0);
 	}
-	this_cpu_write(cpu_tlbstate.invalidate_other, false);
+	__this_cpu_write(cpu_tlbstate.invalidate_other, false);
 }
 
 atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
@@ -83,16 +83,16 @@ static void choose_new_asid(struct mm_st
 		return;
 	}
 
-	if (this_cpu_read(cpu_tlbstate.invalidate_other))
+	if (__this_cpu_read(cpu_tlbstate.invalidate_other))
 		clear_asid_other();
 
 	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
-		if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
+		if (__this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
 		    next->context.ctx_id)
 			continue;
 
 		*new_asid = asid;
-		*need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
+		*need_flush = (__this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
 			       next_tlb_gen);
 		return;
 	}
@@ -101,10 +101,10 @@ static void choose_new_asid(struct mm_st
 	 * We don't currently own an ASID slot on this CPU.
 	 * Allocate a slot.
 	 */
-	*new_asid = this_cpu_add_return(cpu_tlbstate.next_asid, 1) - 1;
+	*new_asid = __this_cpu_add_return(cpu_tlbstate.next_asid, 1) - 1;
 	if (*new_asid >= TLB_NR_DYN_ASIDS) {
 		*new_asid = 0;
-		this_cpu_write(cpu_tlbstate.next_asid, 1);
+		__this_cpu_write(cpu_tlbstate.next_asid, 1);
 	}
 	*need_flush = true;
 }
@@ -245,7 +245,7 @@ static void cond_ibpb(struct task_struct
 		 * cpu_tlbstate.last_user_mm_ibpb for comparison.
 		 */
 		next_mm = mm_mangle_tif_spec_ib(next);
-		prev_mm = this_cpu_read(cpu_tlbstate.last_user_mm_ibpb);
+		prev_mm = __this_cpu_read(cpu_tlbstate.last_user_mm_ibpb);
 
 		/*
 		 * Issue IBPB only if the mm's are different and one or
@@ -255,7 +255,7 @@ static void cond_ibpb(struct task_struct
 		    (next_mm | prev_mm) & LAST_USER_MM_IBPB)
 			indirect_branch_prediction_barrier();
 
-		this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, next_mm);
+		__this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, next_mm);
 	}
 
 	if (static_branch_unlikely(&switch_mm_always_ibpb)) {
@@ -264,9 +264,9 @@ static void cond_ibpb(struct task_struct
 		 * different context than the user space task which ran
 		 * last on this CPU.
 		 */
-		if (this_cpu_read(cpu_tlbstate.last_user_mm) != next->mm) {
+		if (__this_cpu_read(cpu_tlbstate.last_user_mm) != next->mm) {
 			indirect_branch_prediction_barrier();
-			this_cpu_write(cpu_tlbstate.last_user_mm, next->mm);
+			__this_cpu_write(cpu_tlbstate.last_user_mm, next->mm);
 		}
 	}
 }
@@ -274,9 +274,9 @@ static void cond_ibpb(struct task_struct
 void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			struct task_struct *tsk)
 {
-	struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
-	u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
-	bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy);
+	struct mm_struct *real_prev = __this_cpu_read(cpu_tlbstate.loaded_mm);
+	u16 prev_asid = __this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+	bool was_lazy = __this_cpu_read(cpu_tlbstate.is_lazy);
 	unsigned cpu = smp_processor_id();
 	u64 next_tlb_gen;
 	bool need_flush;
@@ -321,7 +321,7 @@ void switch_mm_irqs_off(struct mm_struct
 		__flush_tlb_all();
 	}
 #endif
-	this_cpu_write(cpu_tlbstate.is_lazy, false);
+	__this_cpu_write(cpu_tlbstate.is_lazy, false);
 
 	/*
 	 * The membarrier system call requires a full memory barrier and
@@ -330,7 +330,7 @@ void switch_mm_irqs_off(struct mm_struct
 	 * memory barrier and core serializing instruction.
 	 */
 	if (real_prev == next) {
-		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
+		VM_WARN_ON(__this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
 			   next->context.ctx_id);
 
 		/*
@@ -358,7 +358,7 @@ void switch_mm_irqs_off(struct mm_struct
 		 */
 		smp_mb();
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
-		if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
+		if (__this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
 				next_tlb_gen)
 			return;
 
@@ -406,13 +406,13 @@ void switch_mm_irqs_off(struct mm_struct
 		choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
 
 		/* Let nmi_uaccess_okay() know that we're changing CR3. */
-		this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
+		__this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
 		barrier();
 	}
 
 	if (need_flush) {
-		this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
-		this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
+		__this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
+		__this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
 		load_new_mm_cr3(next->pgd, new_asid, true);
 
 		/*
@@ -435,8 +435,8 @@ void switch_mm_irqs_off(struct mm_struct
 	/* Make sure we write CR3 before loaded_mm. */
 	barrier();
 
-	this_cpu_write(cpu_tlbstate.loaded_mm, next);
-	this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
+	__this_cpu_write(cpu_tlbstate.loaded_mm, next);
+	__this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
 
 	if (next != real_prev) {
 		load_mm_cr4(next);
@@ -529,10 +529,10 @@ static void flush_tlb_func_common(const
 	 * - f->new_tlb_gen: the generation that the requester of the flush
 	 *                   wants us to catch up to.
 	 */
-	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
-	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+	struct mm_struct *loaded_mm = __this_cpu_read(cpu_tlbstate.loaded_mm);
+	u32 loaded_mm_asid = __this_cpu_read(cpu_tlbstate.loaded_mm_asid);
 	u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen);
-	u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+	u64 local_tlb_gen = __this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
 
 	/* This code cannot presently handle being reentered. */
 	VM_WARN_ON(!irqs_disabled());
@@ -540,10 +540,10 @@ static void flush_tlb_func_common(const
 	if (unlikely(loaded_mm == &init_mm))
 		return;
 
-	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
+	VM_WARN_ON(__this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
 		   loaded_mm->context.ctx_id);
 
-	if (this_cpu_read(cpu_tlbstate.is_lazy)) {
+	if (__this_cpu_read(cpu_tlbstate.is_lazy)) {
 		/*
 		 * We're in lazy mode.  We need to at least flush our
 		 * paging-structure cache to avoid speculatively reading
@@ -631,7 +631,7 @@ static void flush_tlb_func_common(const
 	}
 
 	/* Both paths above update our state to mm_tlb_gen. */
-	this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen);
+	__this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen);
 }
 
 static void flush_tlb_func_local(void *info, enum tlb_flush_reason reason)
@@ -647,7 +647,7 @@ static void flush_tlb_func_remote(void *
 
 	inc_irq_stat(irq_tlb_count);
 
-	if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.loaded_mm))
+	if (f->mm && f->mm != __this_cpu_read(cpu_tlbstate.loaded_mm))
 		return;
 
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
@@ -749,7 +749,7 @@ void flush_tlb_mm_range(struct mm_struct
 		info.end = TLB_FLUSH_ALL;
 	}
 
-	if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
+	if (mm == __this_cpu_read(cpu_tlbstate.loaded_mm)) {
 		VM_WARN_ON(irqs_disabled());
 		local_irq_disable();
 		flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 5/5] x86/percpu, sched/fair: Avoid local_clock()
  2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra
                   ` (3 preceding siblings ...)
  2019-02-27 10:12 ` [PATCH 4/5] x86/percpu, x86/tlb: Relax cpu_tlbstate accesses Peter Zijlstra
@ 2019-02-27 10:12 ` Peter Zijlstra
  2019-02-27 10:24 ` [PATCH 6/5] x86/percpu: Optimize raw_cpu_xchg() Peter Zijlstra
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2019-02-27 10:12 UTC (permalink / raw)
  To: torvalds, mingo, bp, tglx, luto, namit, peterz; +Cc: linux-kernel, Nadav Amit

Nadav reported that code-gen changed because of the this_cpu_*()
constraints, avoid this for select_idle_cpu() because that runs with
preemption (and IRQs) disabled anyway.

Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6160,6 +6160,7 @@ static int select_idle_cpu(struct task_s
 	u64 time, cost;
 	s64 delta;
 	int cpu, nr = INT_MAX;
+	int this = smp_processor_id();
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -6183,7 +6184,7 @@ static int select_idle_cpu(struct task_s
 			nr = 4;
 	}
 
-	time = local_clock();
+	time = cpu_clock(this);
 
 	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
 		if (!--nr)
@@ -6194,7 +6195,7 @@ static int select_idle_cpu(struct task_s
 			break;
 	}
 
-	time = local_clock() - time;
+	time = cpu_clock(this) - time;
 	cost = this_sd->avg_scan_cost;
 	delta = (s64)(time - cost) / 8;
 	this_sd->avg_scan_cost += delta;



^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 6/5] x86/percpu: Optimize raw_cpu_xchg()
  2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra
                   ` (4 preceding siblings ...)
  2019-02-27 10:12 ` [PATCH 5/5] x86/percpu, sched/fair: Avoid local_clock() Peter Zijlstra
@ 2019-02-27 10:24 ` Peter Zijlstra
  2019-02-27 14:43   ` Peter Zijlstra
  2019-02-27 23:16 ` [PATCH 0/5] x86/percpu semantics and fixes Nadav Amit
  2019-03-08 14:50 ` Peter Zijlstra
  7 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2019-02-27 10:24 UTC (permalink / raw)
  To: torvalds, mingo, bp, tglx, luto, namit; +Cc: linux-kernel


And because it's one of _those_ days, I forgot to include one patch...

---
Subject: x86/percpu: Optimize raw_cpu_xchg()
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed Feb 27 11:09:56 CET 2019

Since raw_cpu_xchg() doesn't need to be IRQ-safe, like
this_cpu_xchg(), we can use a simple load-store instead of the cmpxchg
loop.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/percpu.h |   18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -407,9 +407,21 @@ do {									\
 #define raw_cpu_or_1(pcp, val)		percpu_to_op(, "or", (pcp), val)
 #define raw_cpu_or_2(pcp, val)		percpu_to_op(, "or", (pcp), val)
 #define raw_cpu_or_4(pcp, val)		percpu_to_op(, "or", (pcp), val)
-#define raw_cpu_xchg_1(pcp, val)	percpu_xchg_op(, pcp, val)
-#define raw_cpu_xchg_2(pcp, val)	percpu_xchg_op(, pcp, val)
-#define raw_cpu_xchg_4(pcp, val)	percpu_xchg_op(, pcp, val)
+
+/*
+ * raw_cpu_xchg() can use a load-store since it is not required to be
+ * IRQ-safe.
+ */
+#define raw_percpu_xchg_op(var, nval)					\
+({									\
+	typeof(var) pxo_ret__ = raw_cpu_read(var);			\
+	raw_cpu_write(var, (nval));					\
+	pxo_ret__;							\
+})
+
+#define raw_cpu_xchg_1(pcp, val)	raw_percpu_xchg_op(pcp, val)
+#define raw_cpu_xchg_2(pcp, val)	raw_percpu_xchg_op(pcp, val)
+#define raw_cpu_xchg_4(pcp, val)	raw_percpu_xchg_op(pcp, val)
 
 #define this_cpu_read_1(pcp)		percpu_from_op(volatile, "mov", pcp)
 #define this_cpu_read_2(pcp)		percpu_from_op(volatile, "mov", pcp)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/5] x86/percpu: Optimize raw_cpu_xchg()
  2019-02-27 10:24 ` [PATCH 6/5] x86/percpu: Optimize raw_cpu_xchg() Peter Zijlstra
@ 2019-02-27 14:43   ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2019-02-27 14:43 UTC (permalink / raw)
  To: torvalds, mingo, bp, tglx, luto, namit; +Cc: linux-kernel

On Wed, Feb 27, 2019 at 11:24:45AM +0100, Peter Zijlstra wrote:
> 
> And because it's one of _those_ days, I forgot to include one patch...
> 
> ---
> Subject: x86/percpu: Optimize raw_cpu_xchg()
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Wed Feb 27 11:09:56 CET 2019
> 
> Since raw_cpu_xchg() doesn't need to be IRQ-safe, like
> this_cpu_xchg(), we can use a simple load-store instead of the cmpxchg
> loop.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/include/asm/percpu.h |   18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> --- a/arch/x86/include/asm/percpu.h
> +++ b/arch/x86/include/asm/percpu.h
> @@ -407,9 +407,21 @@ do {									\
>  #define raw_cpu_or_1(pcp, val)		percpu_to_op(, "or", (pcp), val)
>  #define raw_cpu_or_2(pcp, val)		percpu_to_op(, "or", (pcp), val)
>  #define raw_cpu_or_4(pcp, val)		percpu_to_op(, "or", (pcp), val)
> -#define raw_cpu_xchg_1(pcp, val)	percpu_xchg_op(, pcp, val)
> -#define raw_cpu_xchg_2(pcp, val)	percpu_xchg_op(, pcp, val)
> -#define raw_cpu_xchg_4(pcp, val)	percpu_xchg_op(, pcp, val)
> +
> +/*
> + * raw_cpu_xchg() can use a load-store since it is not required to be
> + * IRQ-safe.
> + */
> +#define raw_percpu_xchg_op(var, nval)					\
> +({									\
> +	typeof(var) pxo_ret__ = raw_cpu_read(var);			\
> +	raw_cpu_write(var, (nval));					\
> +	pxo_ret__;							\
> +})
> +
> +#define raw_cpu_xchg_1(pcp, val)	raw_percpu_xchg_op(pcp, val)
> +#define raw_cpu_xchg_2(pcp, val)	raw_percpu_xchg_op(pcp, val)
> +#define raw_cpu_xchg_4(pcp, val)	raw_percpu_xchg_op(pcp, val)
>  
>  #define this_cpu_read_1(pcp)		percpu_from_op(volatile, "mov", pcp)
>  #define this_cpu_read_2(pcp)		percpu_from_op(volatile, "mov", pcp)

And yes, I just added raw_cpu_xchg_8... *sigh*

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/5] x86/percpu semantics and fixes
  2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra
                   ` (5 preceding siblings ...)
  2019-02-27 10:24 ` [PATCH 6/5] x86/percpu: Optimize raw_cpu_xchg() Peter Zijlstra
@ 2019-02-27 23:16 ` Nadav Amit
  2019-03-08 14:50 ` Peter Zijlstra
  7 siblings, 0 replies; 23+ messages in thread
From: Nadav Amit @ 2019-02-27 23:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds@linux-foundation.org, mingo@kernel.org, bp@alien8.de,
	tglx@linutronix.de, luto@kernel.org, linux-kernel@vger.kernel.org

> On Feb 27, 2019, at 2:12 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> Hi all,
> 
> This is a collection of x86/percpu changes that I had pending and got reminded
> of by Linus' comment yesterday about __this_cpu_xchg().
> 
> This tidies up the x86/percpu primitives and fixes a bunch of 'fallout'.
> 
> Built and boot tested with CONFIG_DEBUG_PREEMPT=y.

Overall this series affects 70 functions and shortens the code by 326 bytes.
_local_bh_enable() for example is shorten by 14 bytes (26%).

I must admit that I although I pointed some of these issues before, I am not
sure whether they are really important...

Recently, I tried to see how to make the compiler to generate “better code”
from Linux. I sprinkled “pure” attribute on many common function (e.g.,
page_rmapping()), sg_next()); sprinkled const-attribute on some others
(e.g., jiffies_to_msecs()); created a const-alias variable so the compiler
would consider kaslr variables and sme_me_mask as constant after
initialization, and so on.

I was then looking at the changed code, and while some functions were
shorter and some longer, many common functions did look “better”. The only
problem was that any benchmark that I did not show any measurable impact.

So perhaps it is a matter of measurement, but eventually right now there is
no clean win.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/5] x86/percpu semantics and fixes
  2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra
                   ` (6 preceding siblings ...)
  2019-02-27 23:16 ` [PATCH 0/5] x86/percpu semantics and fixes Nadav Amit
@ 2019-03-08 14:50 ` Peter Zijlstra
  2019-03-08 19:35   ` Nadav Amit
  7 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2019-03-08 14:50 UTC (permalink / raw)
  To: torvalds, mingo, bp, tglx, luto, namit; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 21199 bytes --]

On Wed, Feb 27, 2019 at 11:12:52AM +0100, Peter Zijlstra wrote:

> This is a collection of x86/percpu changes that I had pending and got reminded
> of by Linus' comment yesterday about __this_cpu_xchg().
> 
> This tidies up the x86/percpu primitives and fixes a bunch of 'fallout'.

(Sorry; this is going to have _wide_ output)

OK, so what I did is I build 4 kernels (O=defconfig-build{,1,2,3}) with
resp that many patches of this series applied.

When I look at just the vmlinux size output:

$ size defconfig-build*/vmlinux
text    data     bss     dec     hex filename
19540631        5040164 1871944 26452739        193a303 defconfig-build/vmlinux
19540635        5040164 1871944 26452743        193a307 defconfig-build1/vmlinux
19540685        5040164 1871944 26452793        193a339 defconfig-build2/vmlinux
19540685        5040164 1871944 26452793        193a339 defconfig-build3/vmlinux

Things appear to get slightly larger; however when I look in more
detail using my (newly written compare script, find attached), I get
things like:

$ ./compare.sh defconfig-build defconfig-build1
arch/x86/mm/fault.o                                     12850      12818   -32
kernel/power/process.o                                   3586       3706   +120
kernel/locking/rtmutex.o                                 1687       1671   -16
kernel/sched/core.o                                      7127       7181   +54
kernel/time/tick-sched.o                                 8941       8837   -104
kernel/exit.o                                             310        385   +75
kernel/softirq.o                                         1217       1199   -18
kernel/workqueue.o                                       3240       3288   +48
net/ipv6/tcp_ipv6.o                                     25434      25345   -89
net/ipv4/tcp_ipv4.o                                       301        305   +4
                                             total    4768226    4768268   +42

When we look at just tick-sched.o:

$ ./compare.sh defconfig-build defconfig-build1 kernel/time/tick-sched.o
can_stop_idle_tick.isra.14                                146        139   -7

we see a totally different number ?!

$ ./compare.sh defconfig-build defconfig-build1 kernel/time/tick-sched.o can_stop_idle_tick.isra.14
0000 0000000000000680 <can_stop_idle_tick.isra.14>:                                                     | 0000 0000000000000680 <can_stop_idle_tick.isra.14>:
0000  680:      53                      push   %rbx                                                     | 0000  680:    53                      push   %rbx
0001  681:      89 f8                   mov    %edi,%eax                                                | 0001  681:    89 f8                   mov    %edi,%eax
0003  683:      48 0f a3 05 00 00 00    bt     %rax,0x0(%rip)        # 68b <can_stop_id                 | 0003  683:    48 0f a3 05 00 00 00    bt     %rax,0x0(%rip)        # 68b <can_stop_id
000a  68a:      00                                                                                      | 000a  68a:    00
0007                    687: R_X86_64_PC32      __cpu_online_mask-0x4                                   | 0007                  687: R_X86_64_PC32      __cpu_online_mask-0x4
000b  68b:      0f 92 c3                setb   %bl                                                      | 000b  68b:    0f 92 c3                setb   %bl
000e  68e:      73 67                   jae    6f7 <can_stop_idle_tick.isra.14+0x77>                    \ 000e  68e:    73 48                   jae    6d8 <can_stop_idle_tick.isra.14+0x58>
0010  690:      8b 06                   mov    (%rsi),%eax                                              | 0010  690:    8b 06                   mov    (%rsi),%eax
0012  692:      85 c0                   test   %eax,%eax                                                | 0012  692:    85 c0                   test   %eax,%eax
0014  694:      74 21                   je     6b7 <can_stop_idle_tick.isra.14+0x37>                    | 0014  694:    74 21                   je     6b7 <can_stop_idle_tick.isra.14+0x37>
0016  696:      65 48 8b 04 25 00 00    mov    %gs:0x0,%rax                                             | 0016  696:    65 48 8b 04 25 00 00    mov    %gs:0x0,%rax
001d  69d:      00 00                                                                                   | 001d  69d:    00 00
001b                    69b: R_X86_64_32S       current_task                                            | 001b                  69b: R_X86_64_32S       current_task
001f  69f:      48 8b 00                mov    (%rax),%rax                                              | 001f  69f:    48 8b 00                mov    (%rax),%rax
0022  6a2:      a8 08                   test   $0x8,%al                                                 | 0022  6a2:    a8 08                   test   $0x8,%al
0024  6a4:      75 11                   jne    6b7 <can_stop_idle_tick.isra.14+0x37>                    | 0024  6a4:    75 11                   jne    6b7 <can_stop_idle_tick.isra.14+0x37>
0026  6a6:      65 66 8b 05 00 00 00    mov    %gs:0x0(%rip),%ax        # 6ae <can_stop                 \ 0026  6a6:    65 66 8b 35 00 00 00    mov    %gs:0x0(%rip),%si        # 6ae <can_stop
002d  6ad:      00                                                                                      | 002d  6ad:    00
002a                    6aa: R_X86_64_PC32      irq_stat-0x4                                            | 002a                  6aa: R_X86_64_PC32      irq_stat-0x4
002e  6ae:      66 85 c0                test   %ax,%ax                                                  \ 002e  6ae:    66 85 f6                test   %si,%si
0031  6b1:      75 0a                   jne    6bd <can_stop_idle_tick.isra.14+0x3d>                    | 0031  6b1:    75 0a                   jne    6bd <can_stop_idle_tick.isra.14+0x3d>
0033  6b3:      89 d8                   mov    %ebx,%eax                                                | 0033  6b3:    89 d8                   mov    %ebx,%eax
0035  6b5:      5b                      pop    %rbx                                                     | 0035  6b5:    5b                      pop    %rbx
0036  6b6:      c3                      retq                                                            | 0036  6b6:    c3                      retq
0037  6b7:      31 db                   xor    %ebx,%ebx                                                | 0037  6b7:    31 db                   xor    %ebx,%ebx
0039  6b9:      89 d8                   mov    %ebx,%eax                                                | 0039  6b9:    89 d8                   mov    %ebx,%eax
003b  6bb:      5b                      pop    %rbx                                                     | 003b  6bb:    5b                      pop    %rbx
003c  6bc:      c3                      retq                                                            | 003c  6bc:    c3                      retq
003d  6bd:      31 db                   xor    %ebx,%ebx                                                | 003d  6bd:    31 db                   xor    %ebx,%ebx
003f  6bf:      83 3d 00 00 00 00 09    cmpl   $0x9,0x0(%rip)        # 6c6 <can_stop_id                 | 003f  6bf:    83 3d 00 00 00 00 09    cmpl   $0x9,0x0(%rip)        # 6c6 <can_stop_id
0041                    6c1: R_X86_64_PC32      .bss-0x5                                                | 0041                  6c1: R_X86_64_PC32      .bss-0x5
0046  6c6:      7f eb                   jg     6b3 <can_stop_idle_tick.isra.14+0x33>                    | 0046  6c6:    7f eb                   jg     6b3 <can_stop_idle_tick.isra.14+0x33>
0048  6c8:      65 66 8b 05 00 00 00    mov    %gs:0x0(%rip),%ax        # 6d0 <can_stop                 \ 0048  6c8:    0f b7 f6                movzwl %si,%esi
004f  6cf:      00                                                                                      \ 004b  6cb:    f7 c6 ff fd 00 00       test   $0xfdff,%esi
004c                    6cc: R_X86_64_PC32      irq_stat-0x4                                            \ 0051  6d1:    74 e0                   je     6b3 <can_stop_idle_tick.isra.14+0x33>
0050  6d0:      a9 ff fd 00 00          test   $0xfdff,%eax                                             \ 0053  6d3:    e9 00 00 00 00          jmpq   6d8 <can_stop_idle_tick.isra.14+0x58>
0055  6d5:      74 dc                   je     6b3 <can_stop_idle_tick.isra.14+0x33>                    \ 0054                  6d4: R_X86_64_PC32      .text.unlikely-0x4
0057  6d7:      65 66 8b 35 00 00 00    mov    %gs:0x0(%rip),%si        # 6df <can_stop                 \ 0058  6d8:    3b 3d 00 00 00 00       cmp    0x0(%rip),%edi        # 6de <can_stop_id
005e  6de:      00                                                                                      \ 005a                  6da: R_X86_64_PC32      tick_do_timer_cpu-0x4
005b                    6db: R_X86_64_PC32      irq_stat-0x4                                            \ 005e  6de:    75 0a                   jne    6ea <can_stop_idle_tick.isra.14+0x6a>
005f  6df:      48 c7 c7 00 00 00 00    mov    $0x0,%rdi                                                \ 0060  6e0:    c7 05 00 00 00 00 ff    movl   $0xffffffff,0x0(%rip)        # 6ea <can_
0062                    6e2: R_X86_64_32S       .rodata.str1.8                                          \ 0067  6e7:    ff ff ff
0066  6e6:      0f b7 f6                movzwl %si,%esi                                                 \ 0062                  6e2: R_X86_64_PC32      tick_do_timer_cpu-0x8
0069  6e9:      e8 00 00 00 00          callq  6ee <can_stop_idle_tick.isra.14+0x6e>                    \ 006a  6ea:    48 c7 02 00 00 00 00    movq   $0x0,(%rdx)
006a                    6ea: R_X86_64_PLT32     printk-0x4                                              \ 0071  6f1:    eb c0                   jmp    6b3 <can_stop_idle_tick.isra.14+0x33>
006e  6ee:      83 05 00 00 00 00 01    addl   $0x1,0x0(%rip)        # 6f5 <can_stop_id                 \ 0073  6f3:    66 66 2e 0f 1f 84 00    data16 nopw %cs:0x0(%rax,%rax,1)
0070                    6f0: R_X86_64_PC32      .bss-0x5                                                \ 007a  6fa:    00 00 00 00
0075  6f5:      eb bc                   jmp    6b3 <can_stop_idle_tick.isra.14+0x33>                    \ 007e  6fe:    66 90                   xchg   %ax,%ax
0077  6f7:      3b 3d 00 00 00 00       cmp    0x0(%rip),%edi        # 6fd <can_stop_id                 \ fffffffffffff980
0079                    6f9: R_X86_64_PC32      tick_do_timer_cpu-0x4                                   \ 0000 0000000000000000 <can_stop_idle_tick.isra.14.cold.23>:
007d  6fd:      75 0a                   jne    709 <can_stop_idle_tick.isra.14+0x89>                    \ 0000    0:    48 c7 c7 00 00 00 00    mov    $0x0,%rdi
007f  6ff:      c7 05 00 00 00 00 ff    movl   $0xffffffff,0x0(%rip)        # 709 <can_                 \ 0003                  3: R_X86_64_32S .rodata.str1.8
0086  706:      ff ff ff                                                                                \ 0007    7:    e8 00 00 00 00          callq  c <can_stop_idle_tick.isra.14.cold.23+0x
0081                    701: R_X86_64_PC32      tick_do_timer_cpu-0x8                                   \ 0008                  8: R_X86_64_PLT32       printk-0x4
0089  709:      48 c7 02 00 00 00 00    movq   $0x0,(%rdx)                                              \ 000c    c:    83 05 00 00 00 00 01    addl   $0x1,0x0(%rip)        # 13 <can_stop_idl
0090  710:      eb a1                   jmp    6b3 <can_stop_idle_tick.isra.14+0x33>                    \ 000e                  e: R_X86_64_PC32        .bss-0x5
0092  712:      66 66 2e 0f 1f 84 00    data16 nopw %cs:0x0(%rax,%rax,1)                                \ 0013   13:    e9 00 00 00 00          jmpq   18 <__setup_setup_tick_nohz>
0099  719:      00 00 00 00                                                                             \ 0014                  14: R_X86_64_PC32       .text+0x6af
009d  71d:      0f 1f 00                nopl   (%rax)                                                   \

And we see that GCC created a .cold. subfunction because the first patch
removed the volatile from __this_cpu_read() and could thus move it.


Similarly the second patch; which removes volatile from
smp_processor_id():

$ ./compare.sh defconfig-build1 defconfig-build2
arch/x86/events/amd/ibs.o                                 667        757   +90
arch/x86/kernel/cpu/mce/core.o                           2677       2696   +19
arch/x86/kernel/cpu/mce/therm_throt.o                     508        527   +19
arch/x86/kernel/cpu/mtrr/generic.o                       9523       9203   -320
arch/x86/kernel/acpi/sleep.o                             3152       3088   -64
arch/x86/kernel/nmi.o                                     338        562   +224
arch/x86/kernel/process.o                                1554       1586   +32
arch/x86/kernel/tsc_sync.o                               5591       5377   -214
kernel/irq/spurious.o                                    5835       5771   -64
kernel/irq/cpuhotplug.o                                  2253       2189   -64
kernel/time/clocksource.o                                 480        593   +113
                                             total    4768268    4768039   -229

we get smaller total executable sections; and even when there is growth:

$ ./compare.sh defconfig-build1 defconfig-build2 arch/x86/events/amd/ibs.o setup_APIC_ibs
0000 0000000000000420 <setup_APIC_ibs>:                                                                 | 0000 0000000000000420 <setup_APIC_ibs>:
0000  420:      53                      push   %rbx                                                     | 0000  420:    53                      push   %rbx
0001  421:      b9 3a 10 01 c0          mov    $0xc001103a,%ecx                                         | 0001  421:    b9 3a 10 01 c0          mov    $0xc001103a,%ecx
0006  426:      0f 32                   rdmsr                                                           | 0006  426:    0f 32                   rdmsr
0008  428:      48 c1 e2 20             shl    $0x20,%rdx                                               | 0008  428:    48 c1 e2 20             shl    $0x20,%rdx
000c  42c:      48 89 d3                mov    %rdx,%rbx                                                | 000c  42c:    48 89 d3                mov    %rdx,%rbx
000f  42f:      48 09 c3                or     %rax,%rbx                                                | 000f  42f:    48 09 c3                or     %rax,%rbx
0012  432:      0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)                                         | 0012  432:    0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
0017  437:      f6 c7 01                test   $0x1,%bh                                                 | 0017  437:    f6 c7 01                test   $0x1,%bh
001a  43a:      74 2a                   je     466 <setup_APIC_ibs+0x46>                                \ 001a  43a:    0f 84 00 00 00 00       je     440 <setup_APIC_ibs+0x20>
001c  43c:      89 df                   mov    %ebx,%edi                                                \ 001c                  43c: R_X86_64_PC32      .text.unlikely-0x4
001e  43e:      31 c9                   xor    %ecx,%ecx                                                \ 0020  440:    89 df                   mov    %ebx,%edi
0020  440:      31 f6                   xor    %esi,%esi                                                \ 0022  442:    31 c9                   xor    %ecx,%ecx
0022  442:      ba 04 00 00 00          mov    $0x4,%edx                                                \ 0024  444:    31 f6                   xor    %esi,%esi
0027  447:      83 e7 0f                and    $0xf,%edi                                                \ 0026  446:    ba 04 00 00 00          mov    $0x4,%edx
002a  44a:      e8 00 00 00 00          callq  44f <setup_APIC_ibs+0x2f>                                \ 002b  44b:    83 e7 0f                and    $0xf,%edi
002b                    44b: R_X86_64_PLT32     setup_APIC_eilvt-0x4                                    \ 002e  44e:    e8 00 00 00 00          callq  453 <setup_APIC_ibs+0x33>
002f  44f:      85 c0                   test   %eax,%eax                                                \ 002f                  44f: R_X86_64_PLT32     setup_APIC_eilvt-0x4
0031  451:      75 13                   jne    466 <setup_APIC_ibs+0x46>                                \ 0033  453:    85 c0                   test   %eax,%eax
0033  453:      5b                      pop    %rbx                                                     \ 0035  455:    0f 85 00 00 00 00       jne    45b <setup_APIC_ibs+0x3b>
0034  454:      c3                      retq                                                            \ 0037                  457: R_X86_64_PC32      .text.unlikely-0x4
0035  455:      31 d2                   xor    %edx,%edx                                                \ 003b  45b:    5b                      pop    %rbx
0037  457:      48 89 de                mov    %rbx,%rsi                                                \ 003c  45c:    c3                      retq
003a  45a:      bf 3a 10 01 c0          mov    $0xc001103a,%edi                                         \ 003d  45d:    31 d2                   xor    %edx,%edx
003f  45f:      e8 00 00 00 00          callq  464 <setup_APIC_ibs+0x44>                                \ 003f  45f:    48 89 de                mov    %rbx,%rsi
0040                    460: R_X86_64_PLT32     do_trace_read_msr-0x4                                   \ 0042  462:    bf 3a 10 01 c0          mov    $0xc001103a,%edi
0044  464:      eb d1                   jmp    437 <setup_APIC_ibs+0x17>                                \ 0047  467:    e8 00 00 00 00          callq  46c <setup_APIC_ibs+0x4c>
0046  466:      65 8b 35 00 00 00 00    mov    %gs:0x0(%rip),%esi        # 46d <setup_A                 \ 0048                  468: R_X86_64_PLT32     do_trace_read_msr-0x4
0049                    469: R_X86_64_PC32      cpu_number-0x4                                          \ 004c  46c:    eb c9                   jmp    437 <setup_APIC_ibs+0x17>
004d  46d:      48 c7 c7 00 00 00 00    mov    $0x0,%rdi                                                \ 004e  46e:    66 90                   xchg   %ax,%ax
0050                    470: R_X86_64_32S       .rodata.str1.8                                          \ fffffffffffffbe0
0054  474:      5b                      pop    %rbx                                                     \ 0000 0000000000000000 <setup_APIC_ibs.cold.9>:
0055  475:      e9 00 00 00 00          jmpq   47a <setup_APIC_ibs+0x5a>                                \ 0000    0:    48 c7 c7 00 00 00 00    mov    $0x0,%rdi
0056                    476: R_X86_64_PLT32     printk-0x4                                              \ 0003                  3: R_X86_64_32S .rodata.str1.8
005a  47a:      66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)                                         \ 0007    7:    5b                      pop    %rbx
fffffffffffffbe0                                                                                        \ 0008    8:    65 8b 35 00 00 00 00    mov    %gs:0x0(%rip),%esi        # f <setup_API
                                                                                                        \ 000b                  b: R_X86_64_PC32        cpu_number-0x4
                                                                                                        \ 000f    f:    e9 00 00 00 00          jmpq   14 <force_ibs_eilvt_setup.cold.10>
                                                                                                        \ 0010                  10: R_X86_64_PLT32      printk-0x4
                                                                                                        \ 0000

It is because of cold subfunction creation; with a reduction in side of
the regular path.

The third build included patches 3 and 4 (because they don't much
overlap); and give some meagre savings:

$ ./compare.sh defconfig-build2 defconfig-build3 arch/x86/kernel/irq.o
do_IRQ                                                    195        187   -8
smp_x86_platform_ipi                                      234        222   -12
smp_kvm_posted_intr_ipi                                    74         66   -8
smp_kvm_posted_intr_wakeup_ipi                             86         78   -8
smp_kvm_posted_intr_nested_ipi                             74         66   -8
$ ./compare.sh defconfig-build2 defconfig-build3 arch/x86/mm/tlb.o
flush_tlb_func_common.constprop.13                        728        719   -9
switch_mm_irqs_off                                       1528       1524   -4



Now, I realize you particularly hate the tlb patch; and I'll see if I
can get these same savings with a few less __ added.

But in general, I think these patches are worth it. esp. since I've
already done them :-)

[-- Attachment #2: compare.sh --]
[-- Type: application/x-sh, Size: 1750 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/5] x86/percpu semantics and fixes
  2019-03-08 14:50 ` Peter Zijlstra
@ 2019-03-08 19:35   ` Nadav Amit
  2019-03-08 20:56     ` Peter Zijlstra
  2019-03-08 22:48     ` Peter Zijlstra
  0 siblings, 2 replies; 23+ messages in thread
From: Nadav Amit @ 2019-03-08 19:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Ingo Molnar, Borislav Petkov, tglx@linutronix.de,
	luto@kernel.org, linux-kernel@vger.kernel.org

> On Mar 8, 2019, at 6:50 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Feb 27, 2019 at 11:12:52AM +0100, Peter Zijlstra wrote:
> 
>> This is a collection of x86/percpu changes that I had pending and got reminded
>> of by Linus' comment yesterday about __this_cpu_xchg().
>> 
>> This tidies up the x86/percpu primitives and fixes a bunch of 'fallout'.
> 
> (Sorry; this is going to have _wide_ output)
> 
> OK, so what I did is I build 4 kernels (O=defconfig-build{,1,2,3}) with
> resp that many patches of this series applied.
> 
> When I look at just the vmlinux size output:
> 
> $ size defconfig-build*/vmlinux
> text    data     bss     dec     hex filename
> 19540631        5040164 1871944 26452739        193a303 defconfig-build/vmlinux
> 19540635        5040164 1871944 26452743        193a307 defconfig-build1/vmlinux
> 19540685        5040164 1871944 26452793        193a339 defconfig-build2/vmlinux
> 19540685        5040164 1871944 26452793        193a339 defconfig-build3/vmlinux
> 
> Things appear to get slightly larger; however when I look in more
> detail using my (newly written compare script, find attached), I get
> things like:

Nice script! I keep asking myself how comparing two binaries can provide
some “number” to indicate how “good” the binary is (at least relatively to
another one) - either during compilation or after. Code size, as you show,
is the wrong metric.

Anyhow, I am a little disappointed (and surprised) that in most cases that I
played with, this kind of optimizations have marginal impact on performance
at best, even when the binary changes “a lot” and when microbenchmarks are
used.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/5] x86/percpu semantics and fixes
  2019-03-08 19:35   ` Nadav Amit
@ 2019-03-08 20:56     ` Peter Zijlstra
  2019-03-10 12:46       ` Peter Zijlstra
  2019-03-08 22:48     ` Peter Zijlstra
  1 sibling, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2019-03-08 20:56 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linus Torvalds, Ingo Molnar, Borislav Petkov, tglx@linutronix.de,
	luto@kernel.org, linux-kernel@vger.kernel.org

On Fri, Mar 08, 2019 at 07:35:17PM +0000, Nadav Amit wrote:

> Nice script! I keep asking myself how comparing two binaries can provide
> some “number” to indicate how “good” the binary is (at least relatively to
> another one) - either during compilation or after. Code size, as you show,
> is the wrong metric.

Right; I'm still pondering other metrics, like:

  readelf -WS | grep AX | grep -v -e init -e exit -e altinstr -e unlikely -e fixup

which is only 'fast' path text.

> Anyhow, I am a little disappointed (and surprised) that in most cases that I
> played with, this kind of optimizations have marginal impact on performance
> at best, even when the binary changes “a lot” and when microbenchmarks are
> used.

Right, but if we don't care, it'll be death by 1000 cuts.

Anyway, can anybody explain percpu_stable_op() vs percpu_from_op() ?

I'm thinking of a variant of Linus' patch, but I'm confused about the
above.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/5] x86/percpu semantics and fixes
  2019-03-08 20:56     ` Peter Zijlstra
@ 2019-03-10 12:46       ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2019-03-10 12:46 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linus Torvalds, Ingo Molnar, Borislav Petkov, tglx@linutronix.de,
	luto@kernel.org, linux-kernel@vger.kernel.org

On Fri, Mar 08, 2019 at 09:56:37PM +0100, Peter Zijlstra wrote:

> Anyway, can anybody explain percpu_stable_op() vs percpu_from_op() ?
> 
> I'm thinking of a variant of Linus' patch, but I'm confused about the
> above.

So whatever I tried with +m only made things worse and always affects
thousands of symbols.

Now, afaict the whole percpu_stable_op thing is an ugly hack becaues some
earlier compiler would not CSE the regular percpu_from_op. But since it
does do that today; esp. after my first patch, I tried implementing
this_cpu_read_stable() with percpu_from_op() (no volatile, obv).

That also affects _lots_ of sites, but also significantly shrinks the
kernel image.

It's 2307 symbols affected, but:

17642871        2157438  747808 20548117        1398a15 defconfig-build1/vmlinux.o (patch 1)
17639081        2157438  747808 20544327        1397b47 defconfig-build0/vmlinux.o (patch 1 - percpu_stable_op)

So I think I'll add a patch removing percpu_stable_op and all its users.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/5] x86/percpu semantics and fixes
  2019-03-08 19:35   ` Nadav Amit
  2019-03-08 20:56     ` Peter Zijlstra
@ 2019-03-08 22:48     ` Peter Zijlstra
  1 sibling, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2019-03-08 22:48 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linus Torvalds, Ingo Molnar, Borislav Petkov, tglx@linutronix.de,
	luto@kernel.org, linux-kernel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 151 bytes --]

On Fri, Mar 08, 2019 at 07:35:17PM +0000, Nadav Amit wrote:
> Nice script!

Find a new one; this one is fast enough to run a symbol diff on vmlinux.o


[-- Attachment #2: compare.sh --]
[-- Type: application/x-sh, Size: 2221 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-03-10 12:47 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra
2019-02-27 10:12 ` [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() Peter Zijlstra
2019-02-27 16:14   ` Linus Torvalds
2019-02-27 16:48     ` Peter Zijlstra
2019-02-27 17:17       ` Linus Torvalds
2019-02-27 17:34         ` Peter Zijlstra
2019-02-27 17:38           ` Linus Torvalds
2019-02-27 17:57     ` Nadav Amit
2019-02-27 18:55       ` Nadav Amit
2019-02-27 19:41       ` Linus Torvalds
2019-03-08 13:35         ` Peter Zijlstra
2019-02-27 10:12 ` [PATCH 2/5] x86/percpu: Relax smp_processor_id() Peter Zijlstra
2019-02-27 10:12 ` [PATCH 3/5] x86/percpu, x86/irq: Relax {set,get}_irq_regs() Peter Zijlstra
2019-02-27 10:12 ` [PATCH 4/5] x86/percpu, x86/tlb: Relax cpu_tlbstate accesses Peter Zijlstra
2019-02-27 10:12 ` [PATCH 5/5] x86/percpu, sched/fair: Avoid local_clock() Peter Zijlstra
2019-02-27 10:24 ` [PATCH 6/5] x86/percpu: Optimize raw_cpu_xchg() Peter Zijlstra
2019-02-27 14:43   ` Peter Zijlstra
2019-02-27 23:16 ` [PATCH 0/5] x86/percpu semantics and fixes Nadav Amit
2019-03-08 14:50 ` Peter Zijlstra
2019-03-08 19:35   ` Nadav Amit
2019-03-08 20:56     ` Peter Zijlstra
2019-03-10 12:46       ` Peter Zijlstra
2019-03-08 22:48     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox