* [PATCH 0/5] x86/percpu semantics and fixes
@ 2019-02-27 10:12 Peter Zijlstra
2019-02-27 10:12 ` [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() Peter Zijlstra
` (7 more replies)
0 siblings, 8 replies; 23+ messages in thread
From: Peter Zijlstra @ 2019-02-27 10:12 UTC (permalink / raw)
To: torvalds, mingo, bp, tglx, luto, namit, peterz; +Cc: linux-kernel
Hi all,
This is a collection of x86/percpu changes that I had pending and got reminded
of by Linus' comment yesterday about __this_cpu_xchg().
This tidies up the x86/percpu primitives and fixes a bunch of 'fallout'.
Built and boot tested with CONFIG_DEBUG_PREEMPT=y.
---
arch/x86/include/asm/irq_regs.h | 4 +-
arch/x86/include/asm/percpu.h | 236 +++++++++++++++++++++-------------------
arch/x86/include/asm/smp.h | 3 +-
arch/x86/mm/tlb.c | 62 +++++------
include/linux/smp.h | 45 +++++---
kernel/sched/fair.c | 5 +-
6 files changed, 193 insertions(+), 162 deletions(-)
^ permalink raw reply [flat|nested] 23+ messages in thread* [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() 2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra @ 2019-02-27 10:12 ` Peter Zijlstra 2019-02-27 16:14 ` Linus Torvalds 2019-02-27 10:12 ` [PATCH 2/5] x86/percpu: Relax smp_processor_id() Peter Zijlstra ` (6 subsequent siblings) 7 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2019-02-27 10:12 UTC (permalink / raw) To: torvalds, mingo, bp, tglx, luto, namit, peterz; +Cc: linux-kernel, Nadav Amit Nadav Amit reported that commit: b59167ac7baf ("x86/percpu: Fix this_cpu_read()") added a bunch of constraints to all sorts of code; and while some of that was correct and desired, some of that seems superfluous. The thing is, the this_cpu_*() operations are defined IRQ-safe, this means the values are subject to change from IRQs, and thus must be reloaded. Also (the generic form): local_irq_save() __this_cpu_read() local_irq_restore() would not allow the re-use of previous values; if by nothing else, then the barrier()s implied by local_irq_*(). Which raises the point that percpu_from_op() and the other also need that volatile. OTOH __this_cpu_*() operations are not IRQ-safe and assume external preempt/IRQ disabling and could thus be allowed more room for optimization. Reported-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- arch/x86/include/asm/percpu.h | 224 +++++++++++++++++++++--------------------- 1 file changed, 112 insertions(+), 112 deletions(-) --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -87,7 +87,7 @@ * don't give an lvalue though). */ extern void __bad_percpu_size(void); -#define percpu_to_op(op, var, val) \ +#define percpu_to_op(qual, op, var, val) \ do { \ typedef typeof(var) pto_T__; \ if (0) { \ @@ -97,22 +97,22 @@ do { \ } \ switch (sizeof(var)) { \ case 1: \ - asm(op "b %1,"__percpu_arg(0) \ + asm qual (op "b %1,"__percpu_arg(0) \ : "+m" (var) \ : "qi" ((pto_T__)(val))); \ break; \ case 2: \ - asm(op "w %1,"__percpu_arg(0) \ + asm qual (op "w %1,"__percpu_arg(0) \ : "+m" (var) \ : "ri" ((pto_T__)(val))); \ break; \ case 4: \ - asm(op "l %1,"__percpu_arg(0) \ + asm qual (op "l %1,"__percpu_arg(0) \ : "+m" (var) \ : "ri" ((pto_T__)(val))); \ break; \ case 8: \ - asm(op "q %1,"__percpu_arg(0) \ + asm qual (op "q %1,"__percpu_arg(0) \ : "+m" (var) \ : "re" ((pto_T__)(val))); \ break; \ @@ -124,7 +124,7 @@ do { \ * Generate a percpu add to memory instruction and optimize code * if one is added or subtracted. */ -#define percpu_add_op(var, val) \ +#define percpu_add_op(qual, var, val) \ do { \ typedef typeof(var) pao_T__; \ const int pao_ID__ = (__builtin_constant_p(val) && \ @@ -138,41 +138,41 @@ do { \ switch (sizeof(var)) { \ case 1: \ if (pao_ID__ == 1) \ - asm("incb "__percpu_arg(0) : "+m" (var)); \ + asm qual ("incb "__percpu_arg(0) : "+m" (var)); \ else if (pao_ID__ == -1) \ - asm("decb "__percpu_arg(0) : "+m" (var)); \ + asm qual ("decb "__percpu_arg(0) : "+m" (var)); \ else \ - asm("addb %1, "__percpu_arg(0) \ + asm qual ("addb %1, "__percpu_arg(0) \ : "+m" (var) \ : "qi" ((pao_T__)(val))); \ break; \ case 2: \ if (pao_ID__ == 1) \ - asm("incw "__percpu_arg(0) : "+m" (var)); \ + asm qual ("incw "__percpu_arg(0) : "+m" (var)); \ else if (pao_ID__ == -1) \ - asm("decw "__percpu_arg(0) : "+m" (var)); \ + asm qual ("decw "__percpu_arg(0) : "+m" (var)); \ else \ - asm("addw %1, "__percpu_arg(0) \ + asm qual ("addw %1, "__percpu_arg(0) \ : "+m" (var) \ : "ri" ((pao_T__)(val))); \ break; \ case 4: \ if (pao_ID__ == 1) \ - asm("incl "__percpu_arg(0) : "+m" (var)); \ + asm qual ("incl "__percpu_arg(0) : "+m" (var)); \ else if (pao_ID__ == -1) \ - asm("decl "__percpu_arg(0) : "+m" (var)); \ + asm qual ("decl "__percpu_arg(0) : "+m" (var)); \ else \ - asm("addl %1, "__percpu_arg(0) \ + asm qual ("addl %1, "__percpu_arg(0) \ : "+m" (var) \ : "ri" ((pao_T__)(val))); \ break; \ case 8: \ if (pao_ID__ == 1) \ - asm("incq "__percpu_arg(0) : "+m" (var)); \ + asm qual ("incq "__percpu_arg(0) : "+m" (var)); \ else if (pao_ID__ == -1) \ - asm("decq "__percpu_arg(0) : "+m" (var)); \ + asm qual ("decq "__percpu_arg(0) : "+m" (var)); \ else \ - asm("addq %1, "__percpu_arg(0) \ + asm qual ("addq %1, "__percpu_arg(0) \ : "+m" (var) \ : "re" ((pao_T__)(val))); \ break; \ @@ -180,27 +180,27 @@ do { \ } \ } while (0) -#define percpu_from_op(op, var) \ +#define percpu_from_op(qual, op, var) \ ({ \ typeof(var) pfo_ret__; \ switch (sizeof(var)) { \ case 1: \ - asm volatile(op "b "__percpu_arg(1)",%0"\ + asm qual (op "b "__percpu_arg(1)",%0" \ : "=q" (pfo_ret__) \ : "m" (var)); \ break; \ case 2: \ - asm volatile(op "w "__percpu_arg(1)",%0"\ + asm qual (op "w "__percpu_arg(1)",%0" \ : "=r" (pfo_ret__) \ : "m" (var)); \ break; \ case 4: \ - asm volatile(op "l "__percpu_arg(1)",%0"\ + asm qual (op "l "__percpu_arg(1)",%0" \ : "=r" (pfo_ret__) \ : "m" (var)); \ break; \ case 8: \ - asm volatile(op "q "__percpu_arg(1)",%0"\ + asm qual (op "q "__percpu_arg(1)",%0" \ : "=r" (pfo_ret__) \ : "m" (var)); \ break; \ @@ -238,23 +238,23 @@ do { \ pfo_ret__; \ }) -#define percpu_unary_op(op, var) \ +#define percpu_unary_op(qual, op, var) \ ({ \ switch (sizeof(var)) { \ case 1: \ - asm(op "b "__percpu_arg(0) \ + asm qual (op "b "__percpu_arg(0) \ : "+m" (var)); \ break; \ case 2: \ - asm(op "w "__percpu_arg(0) \ + asm qual (op "w "__percpu_arg(0) \ : "+m" (var)); \ break; \ case 4: \ - asm(op "l "__percpu_arg(0) \ + asm qual (op "l "__percpu_arg(0) \ : "+m" (var)); \ break; \ case 8: \ - asm(op "q "__percpu_arg(0) \ + asm qual (op "q "__percpu_arg(0) \ : "+m" (var)); \ break; \ default: __bad_percpu_size(); \ @@ -264,27 +264,27 @@ do { \ /* * Add return operation */ -#define percpu_add_return_op(var, val) \ +#define percpu_add_return_op(qual, var, val) \ ({ \ typeof(var) paro_ret__ = val; \ switch (sizeof(var)) { \ case 1: \ - asm("xaddb %0, "__percpu_arg(1) \ + asm qual ("xaddb %0, "__percpu_arg(1) \ : "+q" (paro_ret__), "+m" (var) \ : : "memory"); \ break; \ case 2: \ - asm("xaddw %0, "__percpu_arg(1) \ + asm qual ("xaddw %0, "__percpu_arg(1) \ : "+r" (paro_ret__), "+m" (var) \ : : "memory"); \ break; \ case 4: \ - asm("xaddl %0, "__percpu_arg(1) \ + asm qual ("xaddl %0, "__percpu_arg(1) \ : "+r" (paro_ret__), "+m" (var) \ : : "memory"); \ break; \ case 8: \ - asm("xaddq %0, "__percpu_arg(1) \ + asm qual ("xaddq %0, "__percpu_arg(1) \ : "+re" (paro_ret__), "+m" (var) \ : : "memory"); \ break; \ @@ -299,13 +299,13 @@ do { \ * expensive due to the implied lock prefix. The processor cannot prefetch * cachelines if xchg is used. */ -#define percpu_xchg_op(var, nval) \ +#define percpu_xchg_op(qual, var, nval) \ ({ \ typeof(var) pxo_ret__; \ typeof(var) pxo_new__ = (nval); \ switch (sizeof(var)) { \ case 1: \ - asm("\n\tmov "__percpu_arg(1)",%%al" \ + asm qual ("\n\tmov "__percpu_arg(1)",%%al" \ "\n1:\tcmpxchgb %2, "__percpu_arg(1) \ "\n\tjnz 1b" \ : "=&a" (pxo_ret__), "+m" (var) \ @@ -313,7 +313,7 @@ do { \ : "memory"); \ break; \ case 2: \ - asm("\n\tmov "__percpu_arg(1)",%%ax" \ + asm qual ("\n\tmov "__percpu_arg(1)",%%ax" \ "\n1:\tcmpxchgw %2, "__percpu_arg(1) \ "\n\tjnz 1b" \ : "=&a" (pxo_ret__), "+m" (var) \ @@ -321,7 +321,7 @@ do { \ : "memory"); \ break; \ case 4: \ - asm("\n\tmov "__percpu_arg(1)",%%eax" \ + asm qual ("\n\tmov "__percpu_arg(1)",%%eax" \ "\n1:\tcmpxchgl %2, "__percpu_arg(1) \ "\n\tjnz 1b" \ : "=&a" (pxo_ret__), "+m" (var) \ @@ -329,7 +329,7 @@ do { \ : "memory"); \ break; \ case 8: \ - asm("\n\tmov "__percpu_arg(1)",%%rax" \ + asm qual ("\n\tmov "__percpu_arg(1)",%%rax" \ "\n1:\tcmpxchgq %2, "__percpu_arg(1) \ "\n\tjnz 1b" \ : "=&a" (pxo_ret__), "+m" (var) \ @@ -345,32 +345,32 @@ do { \ * cmpxchg has no such implied lock semantics as a result it is much * more efficient for cpu local operations. */ -#define percpu_cmpxchg_op(var, oval, nval) \ +#define percpu_cmpxchg_op(qual, var, oval, nval) \ ({ \ typeof(var) pco_ret__; \ typeof(var) pco_old__ = (oval); \ typeof(var) pco_new__ = (nval); \ switch (sizeof(var)) { \ case 1: \ - asm("cmpxchgb %2, "__percpu_arg(1) \ + asm qual ("cmpxchgb %2, "__percpu_arg(1) \ : "=a" (pco_ret__), "+m" (var) \ : "q" (pco_new__), "0" (pco_old__) \ : "memory"); \ break; \ case 2: \ - asm("cmpxchgw %2, "__percpu_arg(1) \ + asm qual ("cmpxchgw %2, "__percpu_arg(1) \ : "=a" (pco_ret__), "+m" (var) \ : "r" (pco_new__), "0" (pco_old__) \ : "memory"); \ break; \ case 4: \ - asm("cmpxchgl %2, "__percpu_arg(1) \ + asm qual ("cmpxchgl %2, "__percpu_arg(1) \ : "=a" (pco_ret__), "+m" (var) \ : "r" (pco_new__), "0" (pco_old__) \ : "memory"); \ break; \ case 8: \ - asm("cmpxchgq %2, "__percpu_arg(1) \ + asm qual ("cmpxchgq %2, "__percpu_arg(1) \ : "=a" (pco_ret__), "+m" (var) \ : "r" (pco_new__), "0" (pco_old__) \ : "memory"); \ @@ -391,58 +391,58 @@ do { \ */ #define this_cpu_read_stable(var) percpu_stable_op("mov", var) -#define raw_cpu_read_1(pcp) percpu_from_op("mov", pcp) -#define raw_cpu_read_2(pcp) percpu_from_op("mov", pcp) -#define raw_cpu_read_4(pcp) percpu_from_op("mov", pcp) - -#define raw_cpu_write_1(pcp, val) percpu_to_op("mov", (pcp), val) -#define raw_cpu_write_2(pcp, val) percpu_to_op("mov", (pcp), val) -#define raw_cpu_write_4(pcp, val) percpu_to_op("mov", (pcp), val) -#define raw_cpu_add_1(pcp, val) percpu_add_op((pcp), val) -#define raw_cpu_add_2(pcp, val) percpu_add_op((pcp), val) -#define raw_cpu_add_4(pcp, val) percpu_add_op((pcp), val) -#define raw_cpu_and_1(pcp, val) percpu_to_op("and", (pcp), val) -#define raw_cpu_and_2(pcp, val) percpu_to_op("and", (pcp), val) -#define raw_cpu_and_4(pcp, val) percpu_to_op("and", (pcp), val) -#define raw_cpu_or_1(pcp, val) percpu_to_op("or", (pcp), val) -#define raw_cpu_or_2(pcp, val) percpu_to_op("or", (pcp), val) -#define raw_cpu_or_4(pcp, val) percpu_to_op("or", (pcp), val) -#define raw_cpu_xchg_1(pcp, val) percpu_xchg_op(pcp, val) -#define raw_cpu_xchg_2(pcp, val) percpu_xchg_op(pcp, val) -#define raw_cpu_xchg_4(pcp, val) percpu_xchg_op(pcp, val) - -#define this_cpu_read_1(pcp) percpu_from_op("mov", pcp) -#define this_cpu_read_2(pcp) percpu_from_op("mov", pcp) -#define this_cpu_read_4(pcp) percpu_from_op("mov", pcp) -#define this_cpu_write_1(pcp, val) percpu_to_op("mov", (pcp), val) -#define this_cpu_write_2(pcp, val) percpu_to_op("mov", (pcp), val) -#define this_cpu_write_4(pcp, val) percpu_to_op("mov", (pcp), val) -#define this_cpu_add_1(pcp, val) percpu_add_op((pcp), val) -#define this_cpu_add_2(pcp, val) percpu_add_op((pcp), val) -#define this_cpu_add_4(pcp, val) percpu_add_op((pcp), val) -#define this_cpu_and_1(pcp, val) percpu_to_op("and", (pcp), val) -#define this_cpu_and_2(pcp, val) percpu_to_op("and", (pcp), val) -#define this_cpu_and_4(pcp, val) percpu_to_op("and", (pcp), val) -#define this_cpu_or_1(pcp, val) percpu_to_op("or", (pcp), val) -#define this_cpu_or_2(pcp, val) percpu_to_op("or", (pcp), val) -#define this_cpu_or_4(pcp, val) percpu_to_op("or", (pcp), val) -#define this_cpu_xchg_1(pcp, nval) percpu_xchg_op(pcp, nval) -#define this_cpu_xchg_2(pcp, nval) percpu_xchg_op(pcp, nval) -#define this_cpu_xchg_4(pcp, nval) percpu_xchg_op(pcp, nval) - -#define raw_cpu_add_return_1(pcp, val) percpu_add_return_op(pcp, val) -#define raw_cpu_add_return_2(pcp, val) percpu_add_return_op(pcp, val) -#define raw_cpu_add_return_4(pcp, val) percpu_add_return_op(pcp, val) -#define raw_cpu_cmpxchg_1(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) -#define raw_cpu_cmpxchg_2(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) -#define raw_cpu_cmpxchg_4(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) - -#define this_cpu_add_return_1(pcp, val) percpu_add_return_op(pcp, val) -#define this_cpu_add_return_2(pcp, val) percpu_add_return_op(pcp, val) -#define this_cpu_add_return_4(pcp, val) percpu_add_return_op(pcp, val) -#define this_cpu_cmpxchg_1(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) -#define this_cpu_cmpxchg_2(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) -#define this_cpu_cmpxchg_4(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) +#define raw_cpu_read_1(pcp) percpu_from_op(, "mov", pcp) +#define raw_cpu_read_2(pcp) percpu_from_op(, "mov", pcp) +#define raw_cpu_read_4(pcp) percpu_from_op(, "mov", pcp) + +#define raw_cpu_write_1(pcp, val) percpu_to_op(, "mov", (pcp), val) +#define raw_cpu_write_2(pcp, val) percpu_to_op(, "mov", (pcp), val) +#define raw_cpu_write_4(pcp, val) percpu_to_op(, "mov", (pcp), val) +#define raw_cpu_add_1(pcp, val) percpu_add_op(, (pcp), val) +#define raw_cpu_add_2(pcp, val) percpu_add_op(, (pcp), val) +#define raw_cpu_add_4(pcp, val) percpu_add_op(, (pcp), val) +#define raw_cpu_and_1(pcp, val) percpu_to_op(, "and", (pcp), val) +#define raw_cpu_and_2(pcp, val) percpu_to_op(, "and", (pcp), val) +#define raw_cpu_and_4(pcp, val) percpu_to_op(, "and", (pcp), val) +#define raw_cpu_or_1(pcp, val) percpu_to_op(, "or", (pcp), val) +#define raw_cpu_or_2(pcp, val) percpu_to_op(, "or", (pcp), val) +#define raw_cpu_or_4(pcp, val) percpu_to_op(, "or", (pcp), val) +#define raw_cpu_xchg_1(pcp, val) percpu_xchg_op(, pcp, val) +#define raw_cpu_xchg_2(pcp, val) percpu_xchg_op(, pcp, val) +#define raw_cpu_xchg_4(pcp, val) percpu_xchg_op(, pcp, val) + +#define this_cpu_read_1(pcp) percpu_from_op(volatile, "mov", pcp) +#define this_cpu_read_2(pcp) percpu_from_op(volatile, "mov", pcp) +#define this_cpu_read_4(pcp) percpu_from_op(volatile, "mov", pcp) +#define this_cpu_write_1(pcp, val) percpu_to_op(volatile, "mov", (pcp), val) +#define this_cpu_write_2(pcp, val) percpu_to_op(volatile, "mov", (pcp), val) +#define this_cpu_write_4(pcp, val) percpu_to_op(volatile, "mov", (pcp), val) +#define this_cpu_add_1(pcp, val) percpu_add_op(volatile, (pcp), val) +#define this_cpu_add_2(pcp, val) percpu_add_op(volatile, (pcp), val) +#define this_cpu_add_4(pcp, val) percpu_add_op(volatile, (pcp), val) +#define this_cpu_and_1(pcp, val) percpu_to_op(volatile, "and", (pcp), val) +#define this_cpu_and_2(pcp, val) percpu_to_op(volatile, "and", (pcp), val) +#define this_cpu_and_4(pcp, val) percpu_to_op(volatile, "and", (pcp), val) +#define this_cpu_or_1(pcp, val) percpu_to_op(volatile, "or", (pcp), val) +#define this_cpu_or_2(pcp, val) percpu_to_op(volatile, "or", (pcp), val) +#define this_cpu_or_4(pcp, val) percpu_to_op(volatile, "or", (pcp), val) +#define this_cpu_xchg_1(pcp, nval) percpu_xchg_op(volatile, pcp, nval) +#define this_cpu_xchg_2(pcp, nval) percpu_xchg_op(volatile, pcp, nval) +#define this_cpu_xchg_4(pcp, nval) percpu_xchg_op(volatile, pcp, nval) + +#define raw_cpu_add_return_1(pcp, val) percpu_add_return_op(, pcp, val) +#define raw_cpu_add_return_2(pcp, val) percpu_add_return_op(, pcp, val) +#define raw_cpu_add_return_4(pcp, val) percpu_add_return_op(, pcp, val) +#define raw_cpu_cmpxchg_1(pcp, oval, nval) percpu_cmpxchg_op(, pcp, oval, nval) +#define raw_cpu_cmpxchg_2(pcp, oval, nval) percpu_cmpxchg_op(, pcp, oval, nval) +#define raw_cpu_cmpxchg_4(pcp, oval, nval) percpu_cmpxchg_op(, pcp, oval, nval) + +#define this_cpu_add_return_1(pcp, val) percpu_add_return_op(volatile, pcp, val) +#define this_cpu_add_return_2(pcp, val) percpu_add_return_op(volatile, pcp, val) +#define this_cpu_add_return_4(pcp, val) percpu_add_return_op(volatile, pcp, val) +#define this_cpu_cmpxchg_1(pcp, oval, nval) percpu_cmpxchg_op(volatile, pcp, oval, nval) +#define this_cpu_cmpxchg_2(pcp, oval, nval) percpu_cmpxchg_op(volatile, pcp, oval, nval) +#define this_cpu_cmpxchg_4(pcp, oval, nval) percpu_cmpxchg_op(volatile, pcp, oval, nval) #ifdef CONFIG_X86_CMPXCHG64 #define percpu_cmpxchg8b_double(pcp1, pcp2, o1, o2, n1, n2) \ @@ -466,23 +466,23 @@ do { \ * 32 bit must fall back to generic operations. */ #ifdef CONFIG_X86_64 -#define raw_cpu_read_8(pcp) percpu_from_op("mov", pcp) -#define raw_cpu_write_8(pcp, val) percpu_to_op("mov", (pcp), val) -#define raw_cpu_add_8(pcp, val) percpu_add_op((pcp), val) -#define raw_cpu_and_8(pcp, val) percpu_to_op("and", (pcp), val) -#define raw_cpu_or_8(pcp, val) percpu_to_op("or", (pcp), val) -#define raw_cpu_add_return_8(pcp, val) percpu_add_return_op(pcp, val) -#define raw_cpu_xchg_8(pcp, nval) percpu_xchg_op(pcp, nval) -#define raw_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) - -#define this_cpu_read_8(pcp) percpu_from_op("mov", pcp) -#define this_cpu_write_8(pcp, val) percpu_to_op("mov", (pcp), val) -#define this_cpu_add_8(pcp, val) percpu_add_op((pcp), val) -#define this_cpu_and_8(pcp, val) percpu_to_op("and", (pcp), val) -#define this_cpu_or_8(pcp, val) percpu_to_op("or", (pcp), val) -#define this_cpu_add_return_8(pcp, val) percpu_add_return_op(pcp, val) -#define this_cpu_xchg_8(pcp, nval) percpu_xchg_op(pcp, nval) -#define this_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) +#define raw_cpu_read_8(pcp) percpu_from_op(, "mov", pcp) +#define raw_cpu_write_8(pcp, val) percpu_to_op(, "mov", (pcp), val) +#define raw_cpu_add_8(pcp, val) percpu_add_op(, (pcp), val) +#define raw_cpu_and_8(pcp, val) percpu_to_op(, "and", (pcp), val) +#define raw_cpu_or_8(pcp, val) percpu_to_op(, "or", (pcp), val) +#define raw_cpu_add_return_8(pcp, val) percpu_add_return_op(, pcp, val) +#define raw_cpu_xchg_8(pcp, nval) percpu_xchg_op(, pcp, nval) +#define raw_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(, pcp, oval, nval) + +#define this_cpu_read_8(pcp) percpu_from_op(volatile, "mov", pcp) +#define this_cpu_write_8(pcp, val) percpu_to_op(volatile, "mov", (pcp), val) +#define this_cpu_add_8(pcp, val) percpu_add_op(volatile, (pcp), val) +#define this_cpu_and_8(pcp, val) percpu_to_op(volatile, "and", (pcp), val) +#define this_cpu_or_8(pcp, val) percpu_to_op(volatile, "or", (pcp), val) +#define this_cpu_add_return_8(pcp, val) percpu_add_return_op(volatile, pcp, val) +#define this_cpu_xchg_8(pcp, nval) percpu_xchg_op(volatile, pcp, nval) +#define this_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(volatile, pcp, oval, nval) /* * Pretty complex macro to generate cmpxchg16 instruction. The instruction ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() 2019-02-27 10:12 ` [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() Peter Zijlstra @ 2019-02-27 16:14 ` Linus Torvalds 2019-02-27 16:48 ` Peter Zijlstra 2019-02-27 17:57 ` Nadav Amit 0 siblings, 2 replies; 23+ messages in thread From: Linus Torvalds @ 2019-02-27 16:14 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Borislav Petkov, Thomas Gleixner, Andrew Lutomirski, Nadav Amit, Linux List Kernel Mailing, Nadav Amit [-- Attachment #1: Type: text/plain, Size: 1078 bytes --] On Wed, Feb 27, 2019 at 2:16 AM Peter Zijlstra <peterz@infradead.org> wrote: > > Nadav Amit reported that commit: > > b59167ac7baf ("x86/percpu: Fix this_cpu_read()") > > added a bunch of constraints to all sorts of code; and while some of > that was correct and desired, some of that seems superfluous. Hmm. I have the strong feeling that we should instead relax this_cpu_read() again a bit. In particular, making it "asm volatile" really is a big hammer approach. It's worth noting that the *other* this_cpu_xyz ops don't even do that. I would suggest that instead of making "this_cpu_read()" be asm volatile, we mark it as potentially changing the memory location it is touching - the same way the modify/write ops do. That still means that the read will be forced (like READ_ONCE()), but allows gcc a bit more flexibility in instruction scheduling, I think. Trivial (but entirely untested) patch attached. That said, I didn't actually check how it affects code generation. Nadav, would you check the code sequences you originally noticed? Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 1348 bytes --] arch/x86/include/asm/percpu.h | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 1a19d11cfbbd..63b0f361533f 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -185,24 +185,24 @@ do { \ typeof(var) pfo_ret__; \ switch (sizeof(var)) { \ case 1: \ - asm volatile(op "b "__percpu_arg(1)",%0"\ - : "=q" (pfo_ret__) \ - : "m" (var)); \ + asm(op "b "__percpu_arg(1)",%0"\ + : "=q" (pfo_ret__), \ + "+m" (var)); \ break; \ case 2: \ - asm volatile(op "w "__percpu_arg(1)",%0"\ - : "=r" (pfo_ret__) \ - : "m" (var)); \ + asm(op "w "__percpu_arg(1)",%0"\ + : "=r" (pfo_ret__), \ + "+m" (var)); \ break; \ case 4: \ - asm volatile(op "l "__percpu_arg(1)",%0"\ - : "=r" (pfo_ret__) \ - : "m" (var)); \ + asm(op "l "__percpu_arg(1)",%0"\ + : "=r" (pfo_ret__), \ + "+m" (var)); \ break; \ case 8: \ - asm volatile(op "q "__percpu_arg(1)",%0"\ - : "=r" (pfo_ret__) \ - : "m" (var)); \ + asm(op "q "__percpu_arg(1)",%0"\ + : "=r" (pfo_ret__), \ + "+m" (var)); \ break; \ default: __bad_percpu_size(); \ } \ ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() 2019-02-27 16:14 ` Linus Torvalds @ 2019-02-27 16:48 ` Peter Zijlstra 2019-02-27 17:17 ` Linus Torvalds 2019-02-27 17:57 ` Nadav Amit 1 sibling, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2019-02-27 16:48 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Borislav Petkov, Thomas Gleixner, Andrew Lutomirski, Nadav Amit, Linux List Kernel Mailing, Nadav Amit On Wed, Feb 27, 2019 at 08:14:09AM -0800, Linus Torvalds wrote: > On Wed, Feb 27, 2019 at 2:16 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > Nadav Amit reported that commit: > > > > b59167ac7baf ("x86/percpu: Fix this_cpu_read()") > > > > added a bunch of constraints to all sorts of code; and while some of > > that was correct and desired, some of that seems superfluous. > > Hmm. > > I have the strong feeling that we should instead relax this_cpu_read() > again a bit. > > In particular, making it "asm volatile" really is a big hammer > approach. It's worth noting that the *other* this_cpu_xyz ops don't > even do that. Right, this patch 'fixes' that :-) > I would suggest that instead of making "this_cpu_read()" be asm > volatile, we mark it as potentially changing the memory location it is > touching - the same way the modify/write ops do. > > That still means that the read will be forced (like READ_ONCE()), but > allows gcc a bit more flexibility in instruction scheduling, I think. Ah, fair enough, I'll spin a version of this patch with "+m" for this_cpu and "m" for raw_cpu. > That said, I didn't actually check how it affects code generation. > Nadav, would you check the code sequences you originally noticed? Much of it was the ONCE behaviour defeating CSE I think, but yes, it would be good to have another look. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() 2019-02-27 16:48 ` Peter Zijlstra @ 2019-02-27 17:17 ` Linus Torvalds 2019-02-27 17:34 ` Peter Zijlstra 0 siblings, 1 reply; 23+ messages in thread From: Linus Torvalds @ 2019-02-27 17:17 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Borislav Petkov, Thomas Gleixner, Andrew Lutomirski, Nadav Amit, Linux List Kernel Mailing, Nadav Amit On Wed, Feb 27, 2019 at 8:48 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Feb 27, 2019 at 08:14:09AM -0800, Linus Torvalds wrote: > > In particular, making it "asm volatile" really is a big hammer > > approach. It's worth noting that the *other* this_cpu_xyz ops don't > > even do that. > > Right, this patch 'fixes' that :-) Yeah, but I do hate seeing these patches that randomly add double underscore versions just because the regular one is so bad. So I'd much rather improve the regular this_cpu_read() instead, and hope that we don't need to have this kind of big difference between that and the double-underscore version. I'm not convinced that the "asm volatile" is right on the _other_ ops either. You added them to cmpxchg and xadd too, and it's not obvious that they should have them. They have the "memory" clobber to order them wrt actual memory ops, why are they now "asm volatile"? So I don't really like this patch that randomly adds volatile to things, and then removes it from one special case that I don't think should have had it in the first place. It all seems pretty ad-hoc, and we already _know_ that "asm volatile" is bad. Linus ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() 2019-02-27 17:17 ` Linus Torvalds @ 2019-02-27 17:34 ` Peter Zijlstra 2019-02-27 17:38 ` Linus Torvalds 0 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2019-02-27 17:34 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Borislav Petkov, Thomas Gleixner, Andrew Lutomirski, Nadav Amit, Linux List Kernel Mailing, Nadav Amit On Wed, Feb 27, 2019 at 09:17:42AM -0800, Linus Torvalds wrote: > It all seems pretty ad-hoc, and we already _know_ that "asm volatile" is bad. Ah, all I wanted was the ONCE thing and my inline asm foo sucks. If +m gets us that, awesome. But the ONCE thing defeats CSE (on purpose!) so for code-gen that likes/wants that we then have to use the __this_cpu crud. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() 2019-02-27 17:34 ` Peter Zijlstra @ 2019-02-27 17:38 ` Linus Torvalds 0 siblings, 0 replies; 23+ messages in thread From: Linus Torvalds @ 2019-02-27 17:38 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Borislav Petkov, Thomas Gleixner, Andrew Lutomirski, Nadav Amit, Linux List Kernel Mailing, Nadav Amit On Wed, Feb 27, 2019 at 9:34 AM Peter Zijlstra <peterz@infradead.org> wrote: > > But the ONCE thing defeats CSE (on purpose!) so for code-gen that > likes/wants that we then have to use the __this_cpu crud. Right. But I do think that if you want CSE on a percpu access, you might want to write it as such (ie load the value once and then re-use the value). But I'm not sure exactly which accesses Nadav noticed to be a problem, so.. I guess I should just look at the other patches, but I found them rather nasty and ad-hoc too so I just skimmed them with an "eww" ;) Linus ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() 2019-02-27 16:14 ` Linus Torvalds 2019-02-27 16:48 ` Peter Zijlstra @ 2019-02-27 17:57 ` Nadav Amit 2019-02-27 18:55 ` Nadav Amit 2019-02-27 19:41 ` Linus Torvalds 1 sibling, 2 replies; 23+ messages in thread From: Nadav Amit @ 2019-02-27 17:57 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, Thomas Gleixner, Andrew Lutomirski, Linux List Kernel Mailing, Matthew Wilcox > On Feb 27, 2019, at 8:14 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, Feb 27, 2019 at 2:16 AM Peter Zijlstra <peterz@infradead.org> wrote: >> Nadav Amit reported that commit: >> >> b59167ac7baf ("x86/percpu: Fix this_cpu_read()") >> >> added a bunch of constraints to all sorts of code; and while some of >> that was correct and desired, some of that seems superfluous. > > Trivial (but entirely untested) patch attached. > > That said, I didn't actually check how it affects code generation. > Nadav, would you check the code sequences you originally noticed? The original issue was raised while I was looking into a dropped patch of Matthew Wilcox that caused code size increase [1]. As a result I noticed that Peter’s patch caused big changes to the generated assembly across the kernel - I did not have a specific scenario that I cared about. The patch you sent (“+m/-volatile”) does increase the code size by 1728 bytes. Although code size is not the only metric for “code optimization”, the original patch of Peter (“volatile”) only increased the code size by 201 bytes. Peter’s original change also affected only 72 functions vs 228 that impacted by the new patch. I’ll have a look at some specific function assembly, but overall, the “+m” approach might prevent even more code optimizations than the “volatile” one. I’ll send an example or two later. Regards, Nadav [1] https://marc.info/?l=linux-mm&m=154341370216693&w=2 ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() 2019-02-27 17:57 ` Nadav Amit @ 2019-02-27 18:55 ` Nadav Amit 2019-02-27 19:41 ` Linus Torvalds 1 sibling, 0 replies; 23+ messages in thread From: Nadav Amit @ 2019-02-27 18:55 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, Thomas Gleixner, Andrew Lutomirski, Linux List Kernel Mailing, Matthew Wilcox > On Feb 27, 2019, at 9:57 AM, Nadav Amit <namit@vmware.com> wrote: > >> On Feb 27, 2019, at 8:14 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: >> >> On Wed, Feb 27, 2019 at 2:16 AM Peter Zijlstra <peterz@infradead.org> wrote: >>> Nadav Amit reported that commit: >>> >>> b59167ac7baf ("x86/percpu: Fix this_cpu_read()") >>> >>> added a bunch of constraints to all sorts of code; and while some of >>> that was correct and desired, some of that seems superfluous. >> >> Trivial (but entirely untested) patch attached. >> >> That said, I didn't actually check how it affects code generation. >> Nadav, would you check the code sequences you originally noticed? > > The original issue was raised while I was looking into a dropped patch of > Matthew Wilcox that caused code size increase [1]. As a result I noticed > that Peter’s patch caused big changes to the generated assembly across the > kernel - I did not have a specific scenario that I cared about. > > The patch you sent (“+m/-volatile”) does increase the code size by 1728 > bytes. Although code size is not the only metric for “code optimization”, > the original patch of Peter (“volatile”) only increased the code size by 201 > bytes. Peter’s original change also affected only 72 functions vs 228 that > impacted by the new patch. > > I’ll have a look at some specific function assembly, but overall, the “+m” > approach might prevent even more code optimizations than the “volatile” one. > > I’ll send an example or two later. Here is one example: Dump of assembler code for function event_filter_pid_sched_wakeup_probe_pre: 0xffffffff8117c510 <+0>: push %rbp 0xffffffff8117c511 <+1>: mov %rsp,%rbp 0xffffffff8117c514 <+4>: push %rbx 0xffffffff8117c515 <+5>: mov 0x28(%rdi),%rax 0xffffffff8117c519 <+9>: mov %gs:0x78(%rax),%dl 0xffffffff8117c51d <+13>: test %dl,%dl 0xffffffff8117c51f <+15>: je 0xffffffff8117c535 <event_filter_pid_sched_wakeup_probe_pre+37> 0xffffffff8117c521 <+17>: mov %rdi,%rax 0xffffffff8117c524 <+20>: mov 0x78(%rdi),%rdi 0xffffffff8117c528 <+24>: mov 0x28(%rax),%rbx # REDUNDANT 0xffffffff8117c52c <+28>: callq 0xffffffff81167830 <trace_ignore_this_task> 0xffffffff8117c531 <+33>: mov %al,%gs:0x78(%rbx) 0xffffffff8117c535 <+37>: pop %rbx 0xffffffff8117c536 <+38>: pop %rbp 0xffffffff8117c537 <+39>: retq The instruction at 0xffffffff8117c528 is redundant, and does not exist without the recent patch. It seems to be a result of no-strict-aliasing, which due to the new "memory write” (“+m”) causes the compiler to re-read the data. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() 2019-02-27 17:57 ` Nadav Amit 2019-02-27 18:55 ` Nadav Amit @ 2019-02-27 19:41 ` Linus Torvalds 2019-03-08 13:35 ` Peter Zijlstra 1 sibling, 1 reply; 23+ messages in thread From: Linus Torvalds @ 2019-02-27 19:41 UTC (permalink / raw) To: Nadav Amit Cc: Peter Zijlstra, Ingo Molnar, Borislav Petkov, Thomas Gleixner, Andrew Lutomirski, Linux List Kernel Mailing, Matthew Wilcox On Wed, Feb 27, 2019 at 9:57 AM Nadav Amit <namit@vmware.com> wrote: > > I’ll have a look at some specific function assembly, but overall, the “+m” > approach might prevent even more code optimizations than the “volatile” one. Ok, that being the case, let's forget that patch. I still wonder about the added volatiles to the xadd/cmpxchg cases, which already had the "memory" clobber which should make the volatile immaterial.. Linus ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() 2019-02-27 19:41 ` Linus Torvalds @ 2019-03-08 13:35 ` Peter Zijlstra 0 siblings, 0 replies; 23+ messages in thread From: Peter Zijlstra @ 2019-03-08 13:35 UTC (permalink / raw) To: Linus Torvalds Cc: Nadav Amit, Ingo Molnar, Borislav Petkov, Thomas Gleixner, Andrew Lutomirski, Linux List Kernel Mailing, Matthew Wilcox On Wed, Feb 27, 2019 at 11:41:31AM -0800, Linus Torvalds wrote: > On Wed, Feb 27, 2019 at 9:57 AM Nadav Amit <namit@vmware.com> wrote: > > > > I’ll have a look at some specific function assembly, but overall, the “+m” > > approach might prevent even more code optimizations than the “volatile” one. > > Ok, that being the case, let's forget that patch. > > I still wonder about the added volatiles to the xadd/cmpxchg cases, > which already had the "memory" clobber which should make the volatile > immaterial.. That was mostly me being OCD style consistent; but also notice that the atomic ops also often have volatile even though they have a memory clobber. ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 2/5] x86/percpu: Relax smp_processor_id() 2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra 2019-02-27 10:12 ` [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() Peter Zijlstra @ 2019-02-27 10:12 ` Peter Zijlstra 2019-02-27 10:12 ` [PATCH 3/5] x86/percpu, x86/irq: Relax {set,get}_irq_regs() Peter Zijlstra ` (5 subsequent siblings) 7 siblings, 0 replies; 23+ messages in thread From: Peter Zijlstra @ 2019-02-27 10:12 UTC (permalink / raw) To: torvalds, mingo, bp, tglx, luto, namit, peterz; +Cc: linux-kernel, Nadav Amit Nadav reported that since this_cpu_read() became asm-volatile, many smp_processor_id() users generated worse code due to the extra constraints. However since smp_processor_id() is reading a stable value, we can use __this_cpu_read(). Reported-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- arch/x86/include/asm/smp.h | 3 ++- include/linux/smp.h | 45 +++++++++++++++++++++++++++++++-------------- 2 files changed, 33 insertions(+), 15 deletions(-) --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -162,7 +162,8 @@ __visible void smp_call_function_single_ * from the initial startup. We map APIC_BASE very early in page_setup(), * so this is correct in the x86 case. */ -#define raw_smp_processor_id() (this_cpu_read(cpu_number)) +#define raw_smp_processor_id() this_cpu_read(cpu_number) +#define __smp_processor_id() __this_cpu_read(cpu_number) #ifdef CONFIG_X86_32 extern int safe_smp_processor_id(void); --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -181,29 +181,46 @@ static inline int get_boot_cpu_id(void) #endif /* !SMP */ -/* - * smp_processor_id(): get the current CPU ID. +/** + * raw_processor_id() - get the current (unstable) CPU id + * + * For then you know what you are doing and need an unstable + * CPU id. + */ + +/** + * smp_processor_id() - get the current (stable) CPU id + * + * This is the normal accessor to the CPU id and should be used + * whenever possible. * - * if DEBUG_PREEMPT is enabled then we check whether it is - * used in a preemption-safe way. (smp_processor_id() is safe - * if it's used in a preemption-off critical section, or in - * a thread that is bound to the current CPU.) + * The CPU id is stable when: + * + * - IRQs are disabled; + * - preemption is disabled; + * - the task is CPU affine. * - * NOTE: raw_smp_processor_id() is for internal use only - * (smp_processor_id() is the preferred variant), but in rare - * instances it might also be used to turn off false positives - * (i.e. smp_processor_id() use that the debugging code reports but - * which use for some reason is legal). Don't use this to hack around - * the warning message, as your code might not work under PREEMPT. + * When CONFIG_DEBUG_PREEMPT; we verify these assumption and WARN + * when smp_processor_id() is used when the CPU id is not stable. */ + +/* + * Allow the architecture to differentiate between a stable and unstable read. + * For example, x86 uses an IRQ-safe asm-volatile read for the unstable but a + * regular asm read for the stable. + */ +#ifndef __smp_processor_id +#define __smp_processor_id(x) raw_smp_processor_id(x) +#endif + #ifdef CONFIG_DEBUG_PREEMPT extern unsigned int debug_smp_processor_id(void); # define smp_processor_id() debug_smp_processor_id() #else -# define smp_processor_id() raw_smp_processor_id() +# define smp_processor_id() __smp_processor_id() #endif -#define get_cpu() ({ preempt_disable(); smp_processor_id(); }) +#define get_cpu() ({ preempt_disable(); __smp_processor_id(); }) #define put_cpu() preempt_enable() /* ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 3/5] x86/percpu, x86/irq: Relax {set,get}_irq_regs() 2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra 2019-02-27 10:12 ` [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() Peter Zijlstra 2019-02-27 10:12 ` [PATCH 2/5] x86/percpu: Relax smp_processor_id() Peter Zijlstra @ 2019-02-27 10:12 ` Peter Zijlstra 2019-02-27 10:12 ` [PATCH 4/5] x86/percpu, x86/tlb: Relax cpu_tlbstate accesses Peter Zijlstra ` (4 subsequent siblings) 7 siblings, 0 replies; 23+ messages in thread From: Peter Zijlstra @ 2019-02-27 10:12 UTC (permalink / raw) To: torvalds, mingo, bp, tglx, luto, namit, peterz; +Cc: linux-kernel, Nadav Amit Nadav reported that since the this_cpu_*() ops got asm-volatile constraints on, code generation suffered for do_IRQ(), but since this is all with IRQs disabled we can use __this_cpu_*(). Reported-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- arch/x86/include/asm/irq_regs.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- a/arch/x86/include/asm/irq_regs.h +++ b/arch/x86/include/asm/irq_regs.h @@ -16,7 +16,7 @@ DECLARE_PER_CPU(struct pt_regs *, irq_re static inline struct pt_regs *get_irq_regs(void) { - return this_cpu_read(irq_regs); + return __this_cpu_read(irq_regs); } static inline struct pt_regs *set_irq_regs(struct pt_regs *new_regs) @@ -24,7 +24,7 @@ static inline struct pt_regs *set_irq_re struct pt_regs *old_regs; old_regs = get_irq_regs(); - this_cpu_write(irq_regs, new_regs); + __this_cpu_write(irq_regs, new_regs); return old_regs; } ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 4/5] x86/percpu, x86/tlb: Relax cpu_tlbstate accesses 2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra ` (2 preceding siblings ...) 2019-02-27 10:12 ` [PATCH 3/5] x86/percpu, x86/irq: Relax {set,get}_irq_regs() Peter Zijlstra @ 2019-02-27 10:12 ` Peter Zijlstra 2019-02-27 10:12 ` [PATCH 5/5] x86/percpu, sched/fair: Avoid local_clock() Peter Zijlstra ` (3 subsequent siblings) 7 siblings, 0 replies; 23+ messages in thread From: Peter Zijlstra @ 2019-02-27 10:12 UTC (permalink / raw) To: torvalds, mingo, bp, tglx, luto, namit, peterz; +Cc: linux-kernel, Nadav Amit Almost all of this is ran with IRQs disabled and therefore doesn't need the extra constraints on the this_cpu_*() ops, use __this_cpu_*() to alleviate this. Reported-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- arch/x86/mm/tlb.c | 62 +++++++++++++++++++++++++++--------------------------- 1 file changed, 31 insertions(+), 31 deletions(-) --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -58,15 +58,15 @@ static void clear_asid_other(void) for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) { /* Do not need to flush the current asid */ - if (asid == this_cpu_read(cpu_tlbstate.loaded_mm_asid)) + if (asid == __this_cpu_read(cpu_tlbstate.loaded_mm_asid)) continue; /* * Make sure the next time we go to switch to * this asid, we do a flush: */ - this_cpu_write(cpu_tlbstate.ctxs[asid].ctx_id, 0); + __this_cpu_write(cpu_tlbstate.ctxs[asid].ctx_id, 0); } - this_cpu_write(cpu_tlbstate.invalidate_other, false); + __this_cpu_write(cpu_tlbstate.invalidate_other, false); } atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1); @@ -83,16 +83,16 @@ static void choose_new_asid(struct mm_st return; } - if (this_cpu_read(cpu_tlbstate.invalidate_other)) + if (__this_cpu_read(cpu_tlbstate.invalidate_other)) clear_asid_other(); for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) { - if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) != + if (__this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) != next->context.ctx_id) continue; *new_asid = asid; - *need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) < + *need_flush = (__this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) < next_tlb_gen); return; } @@ -101,10 +101,10 @@ static void choose_new_asid(struct mm_st * We don't currently own an ASID slot on this CPU. * Allocate a slot. */ - *new_asid = this_cpu_add_return(cpu_tlbstate.next_asid, 1) - 1; + *new_asid = __this_cpu_add_return(cpu_tlbstate.next_asid, 1) - 1; if (*new_asid >= TLB_NR_DYN_ASIDS) { *new_asid = 0; - this_cpu_write(cpu_tlbstate.next_asid, 1); + __this_cpu_write(cpu_tlbstate.next_asid, 1); } *need_flush = true; } @@ -245,7 +245,7 @@ static void cond_ibpb(struct task_struct * cpu_tlbstate.last_user_mm_ibpb for comparison. */ next_mm = mm_mangle_tif_spec_ib(next); - prev_mm = this_cpu_read(cpu_tlbstate.last_user_mm_ibpb); + prev_mm = __this_cpu_read(cpu_tlbstate.last_user_mm_ibpb); /* * Issue IBPB only if the mm's are different and one or @@ -255,7 +255,7 @@ static void cond_ibpb(struct task_struct (next_mm | prev_mm) & LAST_USER_MM_IBPB) indirect_branch_prediction_barrier(); - this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, next_mm); + __this_cpu_write(cpu_tlbstate.last_user_mm_ibpb, next_mm); } if (static_branch_unlikely(&switch_mm_always_ibpb)) { @@ -264,9 +264,9 @@ static void cond_ibpb(struct task_struct * different context than the user space task which ran * last on this CPU. */ - if (this_cpu_read(cpu_tlbstate.last_user_mm) != next->mm) { + if (__this_cpu_read(cpu_tlbstate.last_user_mm) != next->mm) { indirect_branch_prediction_barrier(); - this_cpu_write(cpu_tlbstate.last_user_mm, next->mm); + __this_cpu_write(cpu_tlbstate.last_user_mm, next->mm); } } } @@ -274,9 +274,9 @@ static void cond_ibpb(struct task_struct void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk) { - struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm); - u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); - bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy); + struct mm_struct *real_prev = __this_cpu_read(cpu_tlbstate.loaded_mm); + u16 prev_asid = __this_cpu_read(cpu_tlbstate.loaded_mm_asid); + bool was_lazy = __this_cpu_read(cpu_tlbstate.is_lazy); unsigned cpu = smp_processor_id(); u64 next_tlb_gen; bool need_flush; @@ -321,7 +321,7 @@ void switch_mm_irqs_off(struct mm_struct __flush_tlb_all(); } #endif - this_cpu_write(cpu_tlbstate.is_lazy, false); + __this_cpu_write(cpu_tlbstate.is_lazy, false); /* * The membarrier system call requires a full memory barrier and @@ -330,7 +330,7 @@ void switch_mm_irqs_off(struct mm_struct * memory barrier and core serializing instruction. */ if (real_prev == next) { - VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) != + VM_WARN_ON(__this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) != next->context.ctx_id); /* @@ -358,7 +358,7 @@ void switch_mm_irqs_off(struct mm_struct */ smp_mb(); next_tlb_gen = atomic64_read(&next->context.tlb_gen); - if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) == + if (__this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) == next_tlb_gen) return; @@ -406,13 +406,13 @@ void switch_mm_irqs_off(struct mm_struct choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush); /* Let nmi_uaccess_okay() know that we're changing CR3. */ - this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING); + __this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING); barrier(); } if (need_flush) { - this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id); - this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen); + __this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id); + __this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen); load_new_mm_cr3(next->pgd, new_asid, true); /* @@ -435,8 +435,8 @@ void switch_mm_irqs_off(struct mm_struct /* Make sure we write CR3 before loaded_mm. */ barrier(); - this_cpu_write(cpu_tlbstate.loaded_mm, next); - this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid); + __this_cpu_write(cpu_tlbstate.loaded_mm, next); + __this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid); if (next != real_prev) { load_mm_cr4(next); @@ -529,10 +529,10 @@ static void flush_tlb_func_common(const * - f->new_tlb_gen: the generation that the requester of the flush * wants us to catch up to. */ - struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm); - u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); + struct mm_struct *loaded_mm = __this_cpu_read(cpu_tlbstate.loaded_mm); + u32 loaded_mm_asid = __this_cpu_read(cpu_tlbstate.loaded_mm_asid); u64 mm_tlb_gen = atomic64_read(&loaded_mm->context.tlb_gen); - u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen); + u64 local_tlb_gen = __this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen); /* This code cannot presently handle being reentered. */ VM_WARN_ON(!irqs_disabled()); @@ -540,10 +540,10 @@ static void flush_tlb_func_common(const if (unlikely(loaded_mm == &init_mm)) return; - VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) != + VM_WARN_ON(__this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) != loaded_mm->context.ctx_id); - if (this_cpu_read(cpu_tlbstate.is_lazy)) { + if (__this_cpu_read(cpu_tlbstate.is_lazy)) { /* * We're in lazy mode. We need to at least flush our * paging-structure cache to avoid speculatively reading @@ -631,7 +631,7 @@ static void flush_tlb_func_common(const } /* Both paths above update our state to mm_tlb_gen. */ - this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen); + __this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen); } static void flush_tlb_func_local(void *info, enum tlb_flush_reason reason) @@ -647,7 +647,7 @@ static void flush_tlb_func_remote(void * inc_irq_stat(irq_tlb_count); - if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.loaded_mm)) + if (f->mm && f->mm != __this_cpu_read(cpu_tlbstate.loaded_mm)) return; count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); @@ -749,7 +749,7 @@ void flush_tlb_mm_range(struct mm_struct info.end = TLB_FLUSH_ALL; } - if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { + if (mm == __this_cpu_read(cpu_tlbstate.loaded_mm)) { VM_WARN_ON(irqs_disabled()); local_irq_disable(); flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN); ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 5/5] x86/percpu, sched/fair: Avoid local_clock() 2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra ` (3 preceding siblings ...) 2019-02-27 10:12 ` [PATCH 4/5] x86/percpu, x86/tlb: Relax cpu_tlbstate accesses Peter Zijlstra @ 2019-02-27 10:12 ` Peter Zijlstra 2019-02-27 10:24 ` [PATCH 6/5] x86/percpu: Optimize raw_cpu_xchg() Peter Zijlstra ` (2 subsequent siblings) 7 siblings, 0 replies; 23+ messages in thread From: Peter Zijlstra @ 2019-02-27 10:12 UTC (permalink / raw) To: torvalds, mingo, bp, tglx, luto, namit, peterz; +Cc: linux-kernel, Nadav Amit Nadav reported that code-gen changed because of the this_cpu_*() constraints, avoid this for select_idle_cpu() because that runs with preemption (and IRQs) disabled anyway. Reported-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/sched/fair.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6160,6 +6160,7 @@ static int select_idle_cpu(struct task_s u64 time, cost; s64 delta; int cpu, nr = INT_MAX; + int this = smp_processor_id(); this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6183,7 +6184,7 @@ static int select_idle_cpu(struct task_s nr = 4; } - time = local_clock(); + time = cpu_clock(this); for_each_cpu_wrap(cpu, sched_domain_span(sd), target) { if (!--nr) @@ -6194,7 +6195,7 @@ static int select_idle_cpu(struct task_s break; } - time = local_clock() - time; + time = cpu_clock(this) - time; cost = this_sd->avg_scan_cost; delta = (s64)(time - cost) / 8; this_sd->avg_scan_cost += delta; ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 6/5] x86/percpu: Optimize raw_cpu_xchg() 2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra ` (4 preceding siblings ...) 2019-02-27 10:12 ` [PATCH 5/5] x86/percpu, sched/fair: Avoid local_clock() Peter Zijlstra @ 2019-02-27 10:24 ` Peter Zijlstra 2019-02-27 14:43 ` Peter Zijlstra 2019-02-27 23:16 ` [PATCH 0/5] x86/percpu semantics and fixes Nadav Amit 2019-03-08 14:50 ` Peter Zijlstra 7 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2019-02-27 10:24 UTC (permalink / raw) To: torvalds, mingo, bp, tglx, luto, namit; +Cc: linux-kernel And because it's one of _those_ days, I forgot to include one patch... --- Subject: x86/percpu: Optimize raw_cpu_xchg() From: Peter Zijlstra <peterz@infradead.org> Date: Wed Feb 27 11:09:56 CET 2019 Since raw_cpu_xchg() doesn't need to be IRQ-safe, like this_cpu_xchg(), we can use a simple load-store instead of the cmpxchg loop. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- arch/x86/include/asm/percpu.h | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -407,9 +407,21 @@ do { \ #define raw_cpu_or_1(pcp, val) percpu_to_op(, "or", (pcp), val) #define raw_cpu_or_2(pcp, val) percpu_to_op(, "or", (pcp), val) #define raw_cpu_or_4(pcp, val) percpu_to_op(, "or", (pcp), val) -#define raw_cpu_xchg_1(pcp, val) percpu_xchg_op(, pcp, val) -#define raw_cpu_xchg_2(pcp, val) percpu_xchg_op(, pcp, val) -#define raw_cpu_xchg_4(pcp, val) percpu_xchg_op(, pcp, val) + +/* + * raw_cpu_xchg() can use a load-store since it is not required to be + * IRQ-safe. + */ +#define raw_percpu_xchg_op(var, nval) \ +({ \ + typeof(var) pxo_ret__ = raw_cpu_read(var); \ + raw_cpu_write(var, (nval)); \ + pxo_ret__; \ +}) + +#define raw_cpu_xchg_1(pcp, val) raw_percpu_xchg_op(pcp, val) +#define raw_cpu_xchg_2(pcp, val) raw_percpu_xchg_op(pcp, val) +#define raw_cpu_xchg_4(pcp, val) raw_percpu_xchg_op(pcp, val) #define this_cpu_read_1(pcp) percpu_from_op(volatile, "mov", pcp) #define this_cpu_read_2(pcp) percpu_from_op(volatile, "mov", pcp) ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 6/5] x86/percpu: Optimize raw_cpu_xchg() 2019-02-27 10:24 ` [PATCH 6/5] x86/percpu: Optimize raw_cpu_xchg() Peter Zijlstra @ 2019-02-27 14:43 ` Peter Zijlstra 0 siblings, 0 replies; 23+ messages in thread From: Peter Zijlstra @ 2019-02-27 14:43 UTC (permalink / raw) To: torvalds, mingo, bp, tglx, luto, namit; +Cc: linux-kernel On Wed, Feb 27, 2019 at 11:24:45AM +0100, Peter Zijlstra wrote: > > And because it's one of _those_ days, I forgot to include one patch... > > --- > Subject: x86/percpu: Optimize raw_cpu_xchg() > From: Peter Zijlstra <peterz@infradead.org> > Date: Wed Feb 27 11:09:56 CET 2019 > > Since raw_cpu_xchg() doesn't need to be IRQ-safe, like > this_cpu_xchg(), we can use a simple load-store instead of the cmpxchg > loop. > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > arch/x86/include/asm/percpu.h | 18 +++++++++++++++--- > 1 file changed, 15 insertions(+), 3 deletions(-) > > --- a/arch/x86/include/asm/percpu.h > +++ b/arch/x86/include/asm/percpu.h > @@ -407,9 +407,21 @@ do { \ > #define raw_cpu_or_1(pcp, val) percpu_to_op(, "or", (pcp), val) > #define raw_cpu_or_2(pcp, val) percpu_to_op(, "or", (pcp), val) > #define raw_cpu_or_4(pcp, val) percpu_to_op(, "or", (pcp), val) > -#define raw_cpu_xchg_1(pcp, val) percpu_xchg_op(, pcp, val) > -#define raw_cpu_xchg_2(pcp, val) percpu_xchg_op(, pcp, val) > -#define raw_cpu_xchg_4(pcp, val) percpu_xchg_op(, pcp, val) > + > +/* > + * raw_cpu_xchg() can use a load-store since it is not required to be > + * IRQ-safe. > + */ > +#define raw_percpu_xchg_op(var, nval) \ > +({ \ > + typeof(var) pxo_ret__ = raw_cpu_read(var); \ > + raw_cpu_write(var, (nval)); \ > + pxo_ret__; \ > +}) > + > +#define raw_cpu_xchg_1(pcp, val) raw_percpu_xchg_op(pcp, val) > +#define raw_cpu_xchg_2(pcp, val) raw_percpu_xchg_op(pcp, val) > +#define raw_cpu_xchg_4(pcp, val) raw_percpu_xchg_op(pcp, val) > > #define this_cpu_read_1(pcp) percpu_from_op(volatile, "mov", pcp) > #define this_cpu_read_2(pcp) percpu_from_op(volatile, "mov", pcp) And yes, I just added raw_cpu_xchg_8... *sigh* ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/5] x86/percpu semantics and fixes 2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra ` (5 preceding siblings ...) 2019-02-27 10:24 ` [PATCH 6/5] x86/percpu: Optimize raw_cpu_xchg() Peter Zijlstra @ 2019-02-27 23:16 ` Nadav Amit 2019-03-08 14:50 ` Peter Zijlstra 7 siblings, 0 replies; 23+ messages in thread From: Nadav Amit @ 2019-02-27 23:16 UTC (permalink / raw) To: Peter Zijlstra Cc: torvalds@linux-foundation.org, mingo@kernel.org, bp@alien8.de, tglx@linutronix.de, luto@kernel.org, linux-kernel@vger.kernel.org > On Feb 27, 2019, at 2:12 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > Hi all, > > This is a collection of x86/percpu changes that I had pending and got reminded > of by Linus' comment yesterday about __this_cpu_xchg(). > > This tidies up the x86/percpu primitives and fixes a bunch of 'fallout'. > > Built and boot tested with CONFIG_DEBUG_PREEMPT=y. Overall this series affects 70 functions and shortens the code by 326 bytes. _local_bh_enable() for example is shorten by 14 bytes (26%). I must admit that I although I pointed some of these issues before, I am not sure whether they are really important... Recently, I tried to see how to make the compiler to generate “better code” from Linux. I sprinkled “pure” attribute on many common function (e.g., page_rmapping()), sg_next()); sprinkled const-attribute on some others (e.g., jiffies_to_msecs()); created a const-alias variable so the compiler would consider kaslr variables and sme_me_mask as constant after initialization, and so on. I was then looking at the changed code, and while some functions were shorter and some longer, many common functions did look “better”. The only problem was that any benchmark that I did not show any measurable impact. So perhaps it is a matter of measurement, but eventually right now there is no clean win. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/5] x86/percpu semantics and fixes 2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra ` (6 preceding siblings ...) 2019-02-27 23:16 ` [PATCH 0/5] x86/percpu semantics and fixes Nadav Amit @ 2019-03-08 14:50 ` Peter Zijlstra 2019-03-08 19:35 ` Nadav Amit 7 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2019-03-08 14:50 UTC (permalink / raw) To: torvalds, mingo, bp, tglx, luto, namit; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 21199 bytes --] On Wed, Feb 27, 2019 at 11:12:52AM +0100, Peter Zijlstra wrote: > This is a collection of x86/percpu changes that I had pending and got reminded > of by Linus' comment yesterday about __this_cpu_xchg(). > > This tidies up the x86/percpu primitives and fixes a bunch of 'fallout'. (Sorry; this is going to have _wide_ output) OK, so what I did is I build 4 kernels (O=defconfig-build{,1,2,3}) with resp that many patches of this series applied. When I look at just the vmlinux size output: $ size defconfig-build*/vmlinux text data bss dec hex filename 19540631 5040164 1871944 26452739 193a303 defconfig-build/vmlinux 19540635 5040164 1871944 26452743 193a307 defconfig-build1/vmlinux 19540685 5040164 1871944 26452793 193a339 defconfig-build2/vmlinux 19540685 5040164 1871944 26452793 193a339 defconfig-build3/vmlinux Things appear to get slightly larger; however when I look in more detail using my (newly written compare script, find attached), I get things like: $ ./compare.sh defconfig-build defconfig-build1 arch/x86/mm/fault.o 12850 12818 -32 kernel/power/process.o 3586 3706 +120 kernel/locking/rtmutex.o 1687 1671 -16 kernel/sched/core.o 7127 7181 +54 kernel/time/tick-sched.o 8941 8837 -104 kernel/exit.o 310 385 +75 kernel/softirq.o 1217 1199 -18 kernel/workqueue.o 3240 3288 +48 net/ipv6/tcp_ipv6.o 25434 25345 -89 net/ipv4/tcp_ipv4.o 301 305 +4 total 4768226 4768268 +42 When we look at just tick-sched.o: $ ./compare.sh defconfig-build defconfig-build1 kernel/time/tick-sched.o can_stop_idle_tick.isra.14 146 139 -7 we see a totally different number ?! $ ./compare.sh defconfig-build defconfig-build1 kernel/time/tick-sched.o can_stop_idle_tick.isra.14 0000 0000000000000680 <can_stop_idle_tick.isra.14>: | 0000 0000000000000680 <can_stop_idle_tick.isra.14>: 0000 680: 53 push %rbx | 0000 680: 53 push %rbx 0001 681: 89 f8 mov %edi,%eax | 0001 681: 89 f8 mov %edi,%eax 0003 683: 48 0f a3 05 00 00 00 bt %rax,0x0(%rip) # 68b <can_stop_id | 0003 683: 48 0f a3 05 00 00 00 bt %rax,0x0(%rip) # 68b <can_stop_id 000a 68a: 00 | 000a 68a: 00 0007 687: R_X86_64_PC32 __cpu_online_mask-0x4 | 0007 687: R_X86_64_PC32 __cpu_online_mask-0x4 000b 68b: 0f 92 c3 setb %bl | 000b 68b: 0f 92 c3 setb %bl 000e 68e: 73 67 jae 6f7 <can_stop_idle_tick.isra.14+0x77> \ 000e 68e: 73 48 jae 6d8 <can_stop_idle_tick.isra.14+0x58> 0010 690: 8b 06 mov (%rsi),%eax | 0010 690: 8b 06 mov (%rsi),%eax 0012 692: 85 c0 test %eax,%eax | 0012 692: 85 c0 test %eax,%eax 0014 694: 74 21 je 6b7 <can_stop_idle_tick.isra.14+0x37> | 0014 694: 74 21 je 6b7 <can_stop_idle_tick.isra.14+0x37> 0016 696: 65 48 8b 04 25 00 00 mov %gs:0x0,%rax | 0016 696: 65 48 8b 04 25 00 00 mov %gs:0x0,%rax 001d 69d: 00 00 | 001d 69d: 00 00 001b 69b: R_X86_64_32S current_task | 001b 69b: R_X86_64_32S current_task 001f 69f: 48 8b 00 mov (%rax),%rax | 001f 69f: 48 8b 00 mov (%rax),%rax 0022 6a2: a8 08 test $0x8,%al | 0022 6a2: a8 08 test $0x8,%al 0024 6a4: 75 11 jne 6b7 <can_stop_idle_tick.isra.14+0x37> | 0024 6a4: 75 11 jne 6b7 <can_stop_idle_tick.isra.14+0x37> 0026 6a6: 65 66 8b 05 00 00 00 mov %gs:0x0(%rip),%ax # 6ae <can_stop \ 0026 6a6: 65 66 8b 35 00 00 00 mov %gs:0x0(%rip),%si # 6ae <can_stop 002d 6ad: 00 | 002d 6ad: 00 002a 6aa: R_X86_64_PC32 irq_stat-0x4 | 002a 6aa: R_X86_64_PC32 irq_stat-0x4 002e 6ae: 66 85 c0 test %ax,%ax \ 002e 6ae: 66 85 f6 test %si,%si 0031 6b1: 75 0a jne 6bd <can_stop_idle_tick.isra.14+0x3d> | 0031 6b1: 75 0a jne 6bd <can_stop_idle_tick.isra.14+0x3d> 0033 6b3: 89 d8 mov %ebx,%eax | 0033 6b3: 89 d8 mov %ebx,%eax 0035 6b5: 5b pop %rbx | 0035 6b5: 5b pop %rbx 0036 6b6: c3 retq | 0036 6b6: c3 retq 0037 6b7: 31 db xor %ebx,%ebx | 0037 6b7: 31 db xor %ebx,%ebx 0039 6b9: 89 d8 mov %ebx,%eax | 0039 6b9: 89 d8 mov %ebx,%eax 003b 6bb: 5b pop %rbx | 003b 6bb: 5b pop %rbx 003c 6bc: c3 retq | 003c 6bc: c3 retq 003d 6bd: 31 db xor %ebx,%ebx | 003d 6bd: 31 db xor %ebx,%ebx 003f 6bf: 83 3d 00 00 00 00 09 cmpl $0x9,0x0(%rip) # 6c6 <can_stop_id | 003f 6bf: 83 3d 00 00 00 00 09 cmpl $0x9,0x0(%rip) # 6c6 <can_stop_id 0041 6c1: R_X86_64_PC32 .bss-0x5 | 0041 6c1: R_X86_64_PC32 .bss-0x5 0046 6c6: 7f eb jg 6b3 <can_stop_idle_tick.isra.14+0x33> | 0046 6c6: 7f eb jg 6b3 <can_stop_idle_tick.isra.14+0x33> 0048 6c8: 65 66 8b 05 00 00 00 mov %gs:0x0(%rip),%ax # 6d0 <can_stop \ 0048 6c8: 0f b7 f6 movzwl %si,%esi 004f 6cf: 00 \ 004b 6cb: f7 c6 ff fd 00 00 test $0xfdff,%esi 004c 6cc: R_X86_64_PC32 irq_stat-0x4 \ 0051 6d1: 74 e0 je 6b3 <can_stop_idle_tick.isra.14+0x33> 0050 6d0: a9 ff fd 00 00 test $0xfdff,%eax \ 0053 6d3: e9 00 00 00 00 jmpq 6d8 <can_stop_idle_tick.isra.14+0x58> 0055 6d5: 74 dc je 6b3 <can_stop_idle_tick.isra.14+0x33> \ 0054 6d4: R_X86_64_PC32 .text.unlikely-0x4 0057 6d7: 65 66 8b 35 00 00 00 mov %gs:0x0(%rip),%si # 6df <can_stop \ 0058 6d8: 3b 3d 00 00 00 00 cmp 0x0(%rip),%edi # 6de <can_stop_id 005e 6de: 00 \ 005a 6da: R_X86_64_PC32 tick_do_timer_cpu-0x4 005b 6db: R_X86_64_PC32 irq_stat-0x4 \ 005e 6de: 75 0a jne 6ea <can_stop_idle_tick.isra.14+0x6a> 005f 6df: 48 c7 c7 00 00 00 00 mov $0x0,%rdi \ 0060 6e0: c7 05 00 00 00 00 ff movl $0xffffffff,0x0(%rip) # 6ea <can_ 0062 6e2: R_X86_64_32S .rodata.str1.8 \ 0067 6e7: ff ff ff 0066 6e6: 0f b7 f6 movzwl %si,%esi \ 0062 6e2: R_X86_64_PC32 tick_do_timer_cpu-0x8 0069 6e9: e8 00 00 00 00 callq 6ee <can_stop_idle_tick.isra.14+0x6e> \ 006a 6ea: 48 c7 02 00 00 00 00 movq $0x0,(%rdx) 006a 6ea: R_X86_64_PLT32 printk-0x4 \ 0071 6f1: eb c0 jmp 6b3 <can_stop_idle_tick.isra.14+0x33> 006e 6ee: 83 05 00 00 00 00 01 addl $0x1,0x0(%rip) # 6f5 <can_stop_id \ 0073 6f3: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1) 0070 6f0: R_X86_64_PC32 .bss-0x5 \ 007a 6fa: 00 00 00 00 0075 6f5: eb bc jmp 6b3 <can_stop_idle_tick.isra.14+0x33> \ 007e 6fe: 66 90 xchg %ax,%ax 0077 6f7: 3b 3d 00 00 00 00 cmp 0x0(%rip),%edi # 6fd <can_stop_id \ fffffffffffff980 0079 6f9: R_X86_64_PC32 tick_do_timer_cpu-0x4 \ 0000 0000000000000000 <can_stop_idle_tick.isra.14.cold.23>: 007d 6fd: 75 0a jne 709 <can_stop_idle_tick.isra.14+0x89> \ 0000 0: 48 c7 c7 00 00 00 00 mov $0x0,%rdi 007f 6ff: c7 05 00 00 00 00 ff movl $0xffffffff,0x0(%rip) # 709 <can_ \ 0003 3: R_X86_64_32S .rodata.str1.8 0086 706: ff ff ff \ 0007 7: e8 00 00 00 00 callq c <can_stop_idle_tick.isra.14.cold.23+0x 0081 701: R_X86_64_PC32 tick_do_timer_cpu-0x8 \ 0008 8: R_X86_64_PLT32 printk-0x4 0089 709: 48 c7 02 00 00 00 00 movq $0x0,(%rdx) \ 000c c: 83 05 00 00 00 00 01 addl $0x1,0x0(%rip) # 13 <can_stop_idl 0090 710: eb a1 jmp 6b3 <can_stop_idle_tick.isra.14+0x33> \ 000e e: R_X86_64_PC32 .bss-0x5 0092 712: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1) \ 0013 13: e9 00 00 00 00 jmpq 18 <__setup_setup_tick_nohz> 0099 719: 00 00 00 00 \ 0014 14: R_X86_64_PC32 .text+0x6af 009d 71d: 0f 1f 00 nopl (%rax) \ And we see that GCC created a .cold. subfunction because the first patch removed the volatile from __this_cpu_read() and could thus move it. Similarly the second patch; which removes volatile from smp_processor_id(): $ ./compare.sh defconfig-build1 defconfig-build2 arch/x86/events/amd/ibs.o 667 757 +90 arch/x86/kernel/cpu/mce/core.o 2677 2696 +19 arch/x86/kernel/cpu/mce/therm_throt.o 508 527 +19 arch/x86/kernel/cpu/mtrr/generic.o 9523 9203 -320 arch/x86/kernel/acpi/sleep.o 3152 3088 -64 arch/x86/kernel/nmi.o 338 562 +224 arch/x86/kernel/process.o 1554 1586 +32 arch/x86/kernel/tsc_sync.o 5591 5377 -214 kernel/irq/spurious.o 5835 5771 -64 kernel/irq/cpuhotplug.o 2253 2189 -64 kernel/time/clocksource.o 480 593 +113 total 4768268 4768039 -229 we get smaller total executable sections; and even when there is growth: $ ./compare.sh defconfig-build1 defconfig-build2 arch/x86/events/amd/ibs.o setup_APIC_ibs 0000 0000000000000420 <setup_APIC_ibs>: | 0000 0000000000000420 <setup_APIC_ibs>: 0000 420: 53 push %rbx | 0000 420: 53 push %rbx 0001 421: b9 3a 10 01 c0 mov $0xc001103a,%ecx | 0001 421: b9 3a 10 01 c0 mov $0xc001103a,%ecx 0006 426: 0f 32 rdmsr | 0006 426: 0f 32 rdmsr 0008 428: 48 c1 e2 20 shl $0x20,%rdx | 0008 428: 48 c1 e2 20 shl $0x20,%rdx 000c 42c: 48 89 d3 mov %rdx,%rbx | 000c 42c: 48 89 d3 mov %rdx,%rbx 000f 42f: 48 09 c3 or %rax,%rbx | 000f 42f: 48 09 c3 or %rax,%rbx 0012 432: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) | 0012 432: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 0017 437: f6 c7 01 test $0x1,%bh | 0017 437: f6 c7 01 test $0x1,%bh 001a 43a: 74 2a je 466 <setup_APIC_ibs+0x46> \ 001a 43a: 0f 84 00 00 00 00 je 440 <setup_APIC_ibs+0x20> 001c 43c: 89 df mov %ebx,%edi \ 001c 43c: R_X86_64_PC32 .text.unlikely-0x4 001e 43e: 31 c9 xor %ecx,%ecx \ 0020 440: 89 df mov %ebx,%edi 0020 440: 31 f6 xor %esi,%esi \ 0022 442: 31 c9 xor %ecx,%ecx 0022 442: ba 04 00 00 00 mov $0x4,%edx \ 0024 444: 31 f6 xor %esi,%esi 0027 447: 83 e7 0f and $0xf,%edi \ 0026 446: ba 04 00 00 00 mov $0x4,%edx 002a 44a: e8 00 00 00 00 callq 44f <setup_APIC_ibs+0x2f> \ 002b 44b: 83 e7 0f and $0xf,%edi 002b 44b: R_X86_64_PLT32 setup_APIC_eilvt-0x4 \ 002e 44e: e8 00 00 00 00 callq 453 <setup_APIC_ibs+0x33> 002f 44f: 85 c0 test %eax,%eax \ 002f 44f: R_X86_64_PLT32 setup_APIC_eilvt-0x4 0031 451: 75 13 jne 466 <setup_APIC_ibs+0x46> \ 0033 453: 85 c0 test %eax,%eax 0033 453: 5b pop %rbx \ 0035 455: 0f 85 00 00 00 00 jne 45b <setup_APIC_ibs+0x3b> 0034 454: c3 retq \ 0037 457: R_X86_64_PC32 .text.unlikely-0x4 0035 455: 31 d2 xor %edx,%edx \ 003b 45b: 5b pop %rbx 0037 457: 48 89 de mov %rbx,%rsi \ 003c 45c: c3 retq 003a 45a: bf 3a 10 01 c0 mov $0xc001103a,%edi \ 003d 45d: 31 d2 xor %edx,%edx 003f 45f: e8 00 00 00 00 callq 464 <setup_APIC_ibs+0x44> \ 003f 45f: 48 89 de mov %rbx,%rsi 0040 460: R_X86_64_PLT32 do_trace_read_msr-0x4 \ 0042 462: bf 3a 10 01 c0 mov $0xc001103a,%edi 0044 464: eb d1 jmp 437 <setup_APIC_ibs+0x17> \ 0047 467: e8 00 00 00 00 callq 46c <setup_APIC_ibs+0x4c> 0046 466: 65 8b 35 00 00 00 00 mov %gs:0x0(%rip),%esi # 46d <setup_A \ 0048 468: R_X86_64_PLT32 do_trace_read_msr-0x4 0049 469: R_X86_64_PC32 cpu_number-0x4 \ 004c 46c: eb c9 jmp 437 <setup_APIC_ibs+0x17> 004d 46d: 48 c7 c7 00 00 00 00 mov $0x0,%rdi \ 004e 46e: 66 90 xchg %ax,%ax 0050 470: R_X86_64_32S .rodata.str1.8 \ fffffffffffffbe0 0054 474: 5b pop %rbx \ 0000 0000000000000000 <setup_APIC_ibs.cold.9>: 0055 475: e9 00 00 00 00 jmpq 47a <setup_APIC_ibs+0x5a> \ 0000 0: 48 c7 c7 00 00 00 00 mov $0x0,%rdi 0056 476: R_X86_64_PLT32 printk-0x4 \ 0003 3: R_X86_64_32S .rodata.str1.8 005a 47a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) \ 0007 7: 5b pop %rbx fffffffffffffbe0 \ 0008 8: 65 8b 35 00 00 00 00 mov %gs:0x0(%rip),%esi # f <setup_API \ 000b b: R_X86_64_PC32 cpu_number-0x4 \ 000f f: e9 00 00 00 00 jmpq 14 <force_ibs_eilvt_setup.cold.10> \ 0010 10: R_X86_64_PLT32 printk-0x4 \ 0000 It is because of cold subfunction creation; with a reduction in side of the regular path. The third build included patches 3 and 4 (because they don't much overlap); and give some meagre savings: $ ./compare.sh defconfig-build2 defconfig-build3 arch/x86/kernel/irq.o do_IRQ 195 187 -8 smp_x86_platform_ipi 234 222 -12 smp_kvm_posted_intr_ipi 74 66 -8 smp_kvm_posted_intr_wakeup_ipi 86 78 -8 smp_kvm_posted_intr_nested_ipi 74 66 -8 $ ./compare.sh defconfig-build2 defconfig-build3 arch/x86/mm/tlb.o flush_tlb_func_common.constprop.13 728 719 -9 switch_mm_irqs_off 1528 1524 -4 Now, I realize you particularly hate the tlb patch; and I'll see if I can get these same savings with a few less __ added. But in general, I think these patches are worth it. esp. since I've already done them :-) [-- Attachment #2: compare.sh --] [-- Type: application/x-sh, Size: 1750 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/5] x86/percpu semantics and fixes 2019-03-08 14:50 ` Peter Zijlstra @ 2019-03-08 19:35 ` Nadav Amit 2019-03-08 20:56 ` Peter Zijlstra 2019-03-08 22:48 ` Peter Zijlstra 0 siblings, 2 replies; 23+ messages in thread From: Nadav Amit @ 2019-03-08 19:35 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Ingo Molnar, Borislav Petkov, tglx@linutronix.de, luto@kernel.org, linux-kernel@vger.kernel.org > On Mar 8, 2019, at 6:50 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Feb 27, 2019 at 11:12:52AM +0100, Peter Zijlstra wrote: > >> This is a collection of x86/percpu changes that I had pending and got reminded >> of by Linus' comment yesterday about __this_cpu_xchg(). >> >> This tidies up the x86/percpu primitives and fixes a bunch of 'fallout'. > > (Sorry; this is going to have _wide_ output) > > OK, so what I did is I build 4 kernels (O=defconfig-build{,1,2,3}) with > resp that many patches of this series applied. > > When I look at just the vmlinux size output: > > $ size defconfig-build*/vmlinux > text data bss dec hex filename > 19540631 5040164 1871944 26452739 193a303 defconfig-build/vmlinux > 19540635 5040164 1871944 26452743 193a307 defconfig-build1/vmlinux > 19540685 5040164 1871944 26452793 193a339 defconfig-build2/vmlinux > 19540685 5040164 1871944 26452793 193a339 defconfig-build3/vmlinux > > Things appear to get slightly larger; however when I look in more > detail using my (newly written compare script, find attached), I get > things like: Nice script! I keep asking myself how comparing two binaries can provide some “number” to indicate how “good” the binary is (at least relatively to another one) - either during compilation or after. Code size, as you show, is the wrong metric. Anyhow, I am a little disappointed (and surprised) that in most cases that I played with, this kind of optimizations have marginal impact on performance at best, even when the binary changes “a lot” and when microbenchmarks are used. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/5] x86/percpu semantics and fixes 2019-03-08 19:35 ` Nadav Amit @ 2019-03-08 20:56 ` Peter Zijlstra 2019-03-10 12:46 ` Peter Zijlstra 2019-03-08 22:48 ` Peter Zijlstra 1 sibling, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2019-03-08 20:56 UTC (permalink / raw) To: Nadav Amit Cc: Linus Torvalds, Ingo Molnar, Borislav Petkov, tglx@linutronix.de, luto@kernel.org, linux-kernel@vger.kernel.org On Fri, Mar 08, 2019 at 07:35:17PM +0000, Nadav Amit wrote: > Nice script! I keep asking myself how comparing two binaries can provide > some “number” to indicate how “good” the binary is (at least relatively to > another one) - either during compilation or after. Code size, as you show, > is the wrong metric. Right; I'm still pondering other metrics, like: readelf -WS | grep AX | grep -v -e init -e exit -e altinstr -e unlikely -e fixup which is only 'fast' path text. > Anyhow, I am a little disappointed (and surprised) that in most cases that I > played with, this kind of optimizations have marginal impact on performance > at best, even when the binary changes “a lot” and when microbenchmarks are > used. Right, but if we don't care, it'll be death by 1000 cuts. Anyway, can anybody explain percpu_stable_op() vs percpu_from_op() ? I'm thinking of a variant of Linus' patch, but I'm confused about the above. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/5] x86/percpu semantics and fixes 2019-03-08 20:56 ` Peter Zijlstra @ 2019-03-10 12:46 ` Peter Zijlstra 0 siblings, 0 replies; 23+ messages in thread From: Peter Zijlstra @ 2019-03-10 12:46 UTC (permalink / raw) To: Nadav Amit Cc: Linus Torvalds, Ingo Molnar, Borislav Petkov, tglx@linutronix.de, luto@kernel.org, linux-kernel@vger.kernel.org On Fri, Mar 08, 2019 at 09:56:37PM +0100, Peter Zijlstra wrote: > Anyway, can anybody explain percpu_stable_op() vs percpu_from_op() ? > > I'm thinking of a variant of Linus' patch, but I'm confused about the > above. So whatever I tried with +m only made things worse and always affects thousands of symbols. Now, afaict the whole percpu_stable_op thing is an ugly hack becaues some earlier compiler would not CSE the regular percpu_from_op. But since it does do that today; esp. after my first patch, I tried implementing this_cpu_read_stable() with percpu_from_op() (no volatile, obv). That also affects _lots_ of sites, but also significantly shrinks the kernel image. It's 2307 symbols affected, but: 17642871 2157438 747808 20548117 1398a15 defconfig-build1/vmlinux.o (patch 1) 17639081 2157438 747808 20544327 1397b47 defconfig-build0/vmlinux.o (patch 1 - percpu_stable_op) So I think I'll add a patch removing percpu_stable_op and all its users. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/5] x86/percpu semantics and fixes 2019-03-08 19:35 ` Nadav Amit 2019-03-08 20:56 ` Peter Zijlstra @ 2019-03-08 22:48 ` Peter Zijlstra 1 sibling, 0 replies; 23+ messages in thread From: Peter Zijlstra @ 2019-03-08 22:48 UTC (permalink / raw) To: Nadav Amit Cc: Linus Torvalds, Ingo Molnar, Borislav Petkov, tglx@linutronix.de, luto@kernel.org, linux-kernel@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 151 bytes --] On Fri, Mar 08, 2019 at 07:35:17PM +0000, Nadav Amit wrote: > Nice script! Find a new one; this one is fast enough to run a symbol diff on vmlinux.o [-- Attachment #2: compare.sh --] [-- Type: application/x-sh, Size: 2221 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2019-03-10 12:47 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-02-27 10:12 [PATCH 0/5] x86/percpu semantics and fixes Peter Zijlstra
2019-02-27 10:12 ` [PATCH 1/5] x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}() Peter Zijlstra
2019-02-27 16:14 ` Linus Torvalds
2019-02-27 16:48 ` Peter Zijlstra
2019-02-27 17:17 ` Linus Torvalds
2019-02-27 17:34 ` Peter Zijlstra
2019-02-27 17:38 ` Linus Torvalds
2019-02-27 17:57 ` Nadav Amit
2019-02-27 18:55 ` Nadav Amit
2019-02-27 19:41 ` Linus Torvalds
2019-03-08 13:35 ` Peter Zijlstra
2019-02-27 10:12 ` [PATCH 2/5] x86/percpu: Relax smp_processor_id() Peter Zijlstra
2019-02-27 10:12 ` [PATCH 3/5] x86/percpu, x86/irq: Relax {set,get}_irq_regs() Peter Zijlstra
2019-02-27 10:12 ` [PATCH 4/5] x86/percpu, x86/tlb: Relax cpu_tlbstate accesses Peter Zijlstra
2019-02-27 10:12 ` [PATCH 5/5] x86/percpu, sched/fair: Avoid local_clock() Peter Zijlstra
2019-02-27 10:24 ` [PATCH 6/5] x86/percpu: Optimize raw_cpu_xchg() Peter Zijlstra
2019-02-27 14:43 ` Peter Zijlstra
2019-02-27 23:16 ` [PATCH 0/5] x86/percpu semantics and fixes Nadav Amit
2019-03-08 14:50 ` Peter Zijlstra
2019-03-08 19:35 ` Nadav Amit
2019-03-08 20:56 ` Peter Zijlstra
2019-03-10 12:46 ` Peter Zijlstra
2019-03-08 22:48 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox