[PATCH] x86: fix and improve percpu_cmpxchg{8,16}b

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] x86: fix and improve percpu_cmpxchg{8,16}b_double()
@ 2011-12-14  8:33 Jan Beulich
  2011-12-15  9:55 ` [tip:x86/asm] x86: Fix " tip-bot for Jan Beulich
  2011-12-16 15:39 ` [PATCH] x86: fix " H. Peter Anvin
  0 siblings, 2 replies; 6+ messages in thread
From: Jan Beulich @ 2011-12-14  8:33 UTC (permalink / raw)
  To: mingo, tglx, hpa; +Cc: linux-kernel

They had several problems/shortcomings:

Only the first memory operand was mentioned in the 2x32bit asm()
operands, and 2x64-bit version had a memory clobber. The first allowed
the compiler to not recognize the need to re-load the data in case it
had it cached in some register, and the second was overly destructive.

The memory operand in the 2x32-bit asm() was declared to only be an
output.

The types of the local copies of the old and new values were incorrect
(as in other per-CPU ops, the types of the per-CPU variables accessed
should be used here, to make sure the respective types are compatible).

The __dummy variable was pointless (and needlessly initialized in the
2x32-bit case), given that local copies of the inputs already exist.

The 2x64-bit variant forced the address of the first object into %rsi,
even though this is needed only for the call to the emulation function.
The real cmpxchg16b can operate on an memory.

At once also change the return value type to what it really is -
'bool'.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

---
 arch/x86/include/asm/percpu.h |   53 ++++++++++++++++--------------------------
 1 file changed, 21 insertions(+), 32 deletions(-)

--- 3.2-rc5/arch/x86/include/asm/percpu.h
+++ 3.2-rc5-x86-cpu-cmpxchg-double/arch/x86/include/asm/percpu.h
@@ -451,23 +451,20 @@ do {									\
 #endif /* !CONFIG_M386 */
 
 #ifdef CONFIG_X86_CMPXCHG64
-#define percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2)			\
+#define percpu_cmpxchg8b_double(pcp1, pcp2, o1, o2, n1, n2)		\
 ({									\
-	char __ret;							\
-	typeof(o1) __o1 = o1;						\
-	typeof(o1) __n1 = n1;						\
-	typeof(o2) __o2 = o2;						\
-	typeof(o2) __n2 = n2;						\
-	typeof(o2) __dummy = n2;					\
+	bool __ret;							\
+	typeof(pcp1) __o1 = (o1), __n1 = (n1);				\
+	typeof(pcp2) __o2 = (o2), __n2 = (n2);				\
 	asm volatile("cmpxchg8b "__percpu_arg(1)"\n\tsetz %0\n\t"	\
-		    : "=a"(__ret), "=m" (pcp1), "=d"(__dummy)		\
-		    :  "b"(__n1), "c"(__n2), "a"(__o1), "d"(__o2));	\
+		    : "=a" (__ret), "+m" (pcp1), "+m" (pcp2), "+d" (__o2) \
+		    :  "b" (__n1), "c" (__n2), "a" (__o1));		\
 	__ret;								\
 })
 
-#define __this_cpu_cmpxchg_double_4(pcp1, pcp2, o1, o2, n1, n2)		percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2)
-#define this_cpu_cmpxchg_double_4(pcp1, pcp2, o1, o2, n1, n2)		percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2)
-#define irqsafe_cpu_cmpxchg_double_4(pcp1, pcp2, o1, o2, n1, n2)	percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2)
+#define __this_cpu_cmpxchg_double_4	percpu_cmpxchg8b_double
+#define this_cpu_cmpxchg_double_4	percpu_cmpxchg8b_double
+#define irqsafe_cpu_cmpxchg_double_4	percpu_cmpxchg8b_double
 #endif /* CONFIG_X86_CMPXCHG64 */
 
 /*
@@ -508,31 +505,23 @@ do {									\
  * it in software.  The address used in the cmpxchg16 instruction must be
  * aligned to a 16 byte boundary.
  */
-#ifdef CONFIG_SMP
-#define CMPXCHG16B_EMU_CALL "call this_cpu_cmpxchg16b_emu\n\t" ASM_NOP3
-#else
-#define CMPXCHG16B_EMU_CALL "call this_cpu_cmpxchg16b_emu\n\t" ASM_NOP2
-#endif
-#define percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2)			\
+#define percpu_cmpxchg16b_double(pcp1, pcp2, o1, o2, n1, n2)		\
 ({									\
-	char __ret;							\
-	typeof(o1) __o1 = o1;						\
-	typeof(o1) __n1 = n1;						\
-	typeof(o2) __o2 = o2;						\
-	typeof(o2) __n2 = n2;						\
-	typeof(o2) __dummy;						\
-	alternative_io(CMPXCHG16B_EMU_CALL,				\
-		       "cmpxchg16b " __percpu_prefix "(%%rsi)\n\tsetz %0\n\t",	\
+	bool __ret;							\
+	typeof(pcp1) __o1 = (o1), __n1 = (n1);				\
+	typeof(pcp2) __o2 = (o2), __n2 = (n2);				\
+	alternative_io("leaq %P1,%%rsi\n\tcall this_cpu_cmpxchg16b_emu\n\t", \
+		       "cmpxchg16b " __percpu_arg(1) "\n\tsetz %0\n\t",	\
 		       X86_FEATURE_CX16,				\
-		       ASM_OUTPUT2("=a"(__ret), "=d"(__dummy)),		\
-		       "S" (&pcp1), "b"(__n1), "c"(__n2),		\
-		       "a"(__o1), "d"(__o2) : "memory");		\
+		       ASM_OUTPUT2("=a" (__ret), "+m" (pcp1),		\
+				   "+m" (pcp2), "+d" (__o2)),		\
+		       "b" (__n1), "c" (__n2), "a" (__o1) : "rsi");	\
 	__ret;								\
 })
 
-#define __this_cpu_cmpxchg_double_8(pcp1, pcp2, o1, o2, n1, n2)		percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2)
-#define this_cpu_cmpxchg_double_8(pcp1, pcp2, o1, o2, n1, n2)		percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2)
-#define irqsafe_cpu_cmpxchg_double_8(pcp1, pcp2, o1, o2, n1, n2)	percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2)
+#define __this_cpu_cmpxchg_double_8	percpu_cmpxchg16b_double
+#define this_cpu_cmpxchg_double_8	percpu_cmpxchg16b_double
+#define irqsafe_cpu_cmpxchg_double_8	percpu_cmpxchg16b_double
 
 #endif
 



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tip:x86/asm] x86: Fix and improve percpu_cmpxchg{8,16}b_double()
  2011-12-14  8:33 [PATCH] x86: fix and improve percpu_cmpxchg{8,16}b_double() Jan Beulich
@ 2011-12-15  9:55 ` tip-bot for Jan Beulich
  2011-12-15 22:52   ` Christoph Lameter
  2011-12-16 15:39 ` [PATCH] x86: fix " H. Peter Anvin
  1 sibling, 1 reply; 6+ messages in thread
From: tip-bot for Jan Beulich @ 2011-12-15  9:55 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, torvalds, jbeulich, cl,
	dhowells, akpm, JBeulich, tglx, mingo

Commit-ID:  cebef5beed3de3037de85a521495897256b2c5da
Gitweb:     http://git.kernel.org/tip/cebef5beed3de3037de85a521495897256b2c5da
Author:     Jan Beulich <JBeulich@suse.com>
AuthorDate: Wed, 14 Dec 2011 08:33:25 +0000
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 15 Dec 2011 08:17:14 +0100

x86: Fix and improve percpu_cmpxchg{8,16}b_double()

They had several problems/shortcomings:

Only the first memory operand was mentioned in the 2x32bit asm()
operands, and 2x64-bit version had a memory clobber. The first
allowed the compiler to not recognize the need to re-load the
data in case it had it cached in some register, and the second
was overly destructive.

The memory operand in the 2x32-bit asm() was declared to only be
an output.

The types of the local copies of the old and new values were
incorrect (as in other per-CPU ops, the types of the per-CPU
variables accessed should be used here, to make sure the
respective types are compatible).

The __dummy variable was pointless (and needlessly initialized
in the 2x32-bit case), given that local copies of the inputs
already exist.

The 2x64-bit variant forced the address of the first object into
%rsi, even though this is needed only for the call to the
emulation function. The real cmpxchg16b can operate on an
memory.

At once also change the return value type to what it really is -
'bool'.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4EE86D6502000078000679FE@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/include/asm/percpu.h |   53 ++++++++++++++++------------------------
 1 files changed, 21 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 3470c9d..529bf07e 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -451,23 +451,20 @@ do {									\
 #endif /* !CONFIG_M386 */
 
 #ifdef CONFIG_X86_CMPXCHG64
-#define percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2)			\
+#define percpu_cmpxchg8b_double(pcp1, pcp2, o1, o2, n1, n2)		\
 ({									\
-	char __ret;							\
-	typeof(o1) __o1 = o1;						\
-	typeof(o1) __n1 = n1;						\
-	typeof(o2) __o2 = o2;						\
-	typeof(o2) __n2 = n2;						\
-	typeof(o2) __dummy = n2;					\
+	bool __ret;							\
+	typeof(pcp1) __o1 = (o1), __n1 = (n1);				\
+	typeof(pcp2) __o2 = (o2), __n2 = (n2);				\
 	asm volatile("cmpxchg8b "__percpu_arg(1)"\n\tsetz %0\n\t"	\
-		    : "=a"(__ret), "=m" (pcp1), "=d"(__dummy)		\
-		    :  "b"(__n1), "c"(__n2), "a"(__o1), "d"(__o2));	\
+		    : "=a" (__ret), "+m" (pcp1), "+m" (pcp2), "+d" (__o2) \
+		    :  "b" (__n1), "c" (__n2), "a" (__o1));		\
 	__ret;								\
 })
 
-#define __this_cpu_cmpxchg_double_4(pcp1, pcp2, o1, o2, n1, n2)		percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2)
-#define this_cpu_cmpxchg_double_4(pcp1, pcp2, o1, o2, n1, n2)		percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2)
-#define irqsafe_cpu_cmpxchg_double_4(pcp1, pcp2, o1, o2, n1, n2)	percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2)
+#define __this_cpu_cmpxchg_double_4	percpu_cmpxchg8b_double
+#define this_cpu_cmpxchg_double_4	percpu_cmpxchg8b_double
+#define irqsafe_cpu_cmpxchg_double_4	percpu_cmpxchg8b_double
 #endif /* CONFIG_X86_CMPXCHG64 */
 
 /*
@@ -508,31 +505,23 @@ do {									\
  * it in software.  The address used in the cmpxchg16 instruction must be
  * aligned to a 16 byte boundary.
  */
-#ifdef CONFIG_SMP
-#define CMPXCHG16B_EMU_CALL "call this_cpu_cmpxchg16b_emu\n\t" ASM_NOP3
-#else
-#define CMPXCHG16B_EMU_CALL "call this_cpu_cmpxchg16b_emu\n\t" ASM_NOP2
-#endif
-#define percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2)			\
+#define percpu_cmpxchg16b_double(pcp1, pcp2, o1, o2, n1, n2)		\
 ({									\
-	char __ret;							\
-	typeof(o1) __o1 = o1;						\
-	typeof(o1) __n1 = n1;						\
-	typeof(o2) __o2 = o2;						\
-	typeof(o2) __n2 = n2;						\
-	typeof(o2) __dummy;						\
-	alternative_io(CMPXCHG16B_EMU_CALL,				\
-		       "cmpxchg16b " __percpu_prefix "(%%rsi)\n\tsetz %0\n\t",	\
+	bool __ret;							\
+	typeof(pcp1) __o1 = (o1), __n1 = (n1);				\
+	typeof(pcp2) __o2 = (o2), __n2 = (n2);				\
+	alternative_io("leaq %P1,%%rsi\n\tcall this_cpu_cmpxchg16b_emu\n\t", \
+		       "cmpxchg16b " __percpu_arg(1) "\n\tsetz %0\n\t",	\
 		       X86_FEATURE_CX16,				\
-		       ASM_OUTPUT2("=a"(__ret), "=d"(__dummy)),		\
-		       "S" (&pcp1), "b"(__n1), "c"(__n2),		\
-		       "a"(__o1), "d"(__o2) : "memory");		\
+		       ASM_OUTPUT2("=a" (__ret), "+m" (pcp1),		\
+				   "+m" (pcp2), "+d" (__o2)),		\
+		       "b" (__n1), "c" (__n2), "a" (__o1) : "rsi");	\
 	__ret;								\
 })
 
-#define __this_cpu_cmpxchg_double_8(pcp1, pcp2, o1, o2, n1, n2)		percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2)
-#define this_cpu_cmpxchg_double_8(pcp1, pcp2, o1, o2, n1, n2)		percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2)
-#define irqsafe_cpu_cmpxchg_double_8(pcp1, pcp2, o1, o2, n1, n2)	percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2)
+#define __this_cpu_cmpxchg_double_8	percpu_cmpxchg16b_double
+#define this_cpu_cmpxchg_double_8	percpu_cmpxchg16b_double
+#define irqsafe_cpu_cmpxchg_double_8	percpu_cmpxchg16b_double
 
 #endif
 

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [tip:x86/asm] x86: Fix and improve percpu_cmpxchg{8,16}b_double()
  2011-12-15  9:55 ` [tip:x86/asm] x86: Fix " tip-bot for Jan Beulich
@ 2011-12-15 22:52   ` Christoph Lameter
  2011-12-16  8:48     ` Jan Beulich
  0 siblings, 1 reply; 6+ messages in thread
From: Christoph Lameter @ 2011-12-15 22:52 UTC (permalink / raw)
  To: jbeulich, mingo, hpa, linux-kernel, torvalds, a.p.zijlstra,
	dhowells, akpm, Thomas Gleixner, Ingo Molnar
  Cc: linux-tip-commits

On Thu, 15 Dec 2011, tip-bot for Jan Beulich wrote:

> x86: Fix and improve percpu_cmpxchg{8,16}b_double()

Great. Thanks. I wish you would have been there when I had to put this
together. Had a difficult time finding relevant docs etc on how to exactly
do this. And the failure messages from the inline asm parser were not that
helpful.

Acked-by: Christoph Lameter <cl@linux.com>

> The __dummy variable was pointless (and needlessly initialized
> in the 2x32-bit case), given that local copies of the inputs
> already exist.

Hmm... I had some failures if I did not specify that dummy in the
inline asm. Does this work for all gcc versions?

> The 2x64-bit variant forced the address of the first object into
> %rsi, even though this is needed only for the call to the
> emulation function. The real cmpxchg16b can operate on an
> memory.

Yup. Good idea to code the load into the alternative code path to avoid
the cmpxchg of the primary code path to be restricted to %si register.

You dropped the padding with NOPs. Are the instructions on both paths
always the same length?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [tip:x86/asm] x86: Fix and improve percpu_cmpxchg{8,16}b_double()
  2011-12-15 22:52   ` Christoph Lameter
@ 2011-12-16  8:48     ` Jan Beulich
  0 siblings, 0 replies; 6+ messages in thread
From: Jan Beulich @ 2011-12-16  8:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: a.p.zijlstra, Ingo Molnar, Thomas Gleixner, akpm, torvalds,
	dhowells, mingo, linux-kernel, linux-tip-commits, hpa

>>> On 15.12.11 at 23:52, Christoph Lameter <cl@linux.com> wrote:
> On Thu, 15 Dec 2011, tip-bot for Jan Beulich wrote:
>> The __dummy variable was pointless (and needlessly initialized
>> in the 2x32-bit case), given that local copies of the inputs
>> already exist.
> 
> Hmm... I had some failures if I did not specify that dummy in the
> inline asm. Does this work for all gcc versions?

You need to have the output go somewhere, but using the already
existing __o2 for this purpose is possible and sufficient.

>> The 2x64-bit variant forced the address of the first object into
>> %rsi, even though this is needed only for the call to the
>> emulation function. The real cmpxchg16b can operate on an
>> memory.
> 
> Yup. Good idea to code the load into the alternative code path to avoid
> the cmpxchg of the primary code path to be restricted to %si register.
> 
> You dropped the padding with NOPs. Are the instructions on both paths
> always the same length?

leaq and cmpxchg16b use the same operand, so their modrm
encoding (including eventual SIB and immediate) are the same.

leaq is REX+opcode+modrm...
cmpxchg16b (SEG+)REX+0x0f+opcode+modrm

so the latter is two bytes longer than the former (one byte in UP).

call being five bytes vs setz %al being 3 bytes makes it that
lea+call are one byte longer than cmpxchg16b+setz in the UP case,
but I think this is tolerable. If not, the operand of lea could be
made %rip-relative for the case where the operand is a direct
access (i.e. become %1 instead of %P1).

Jan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] x86: fix and improve percpu_cmpxchg{8,16}b_double()
  2011-12-14  8:33 [PATCH] x86: fix and improve percpu_cmpxchg{8,16}b_double() Jan Beulich
  2011-12-15  9:55 ` [tip:x86/asm] x86: Fix " tip-bot for Jan Beulich
@ 2011-12-16 15:39 ` H. Peter Anvin
  2011-12-18  8:43   ` Ingo Molnar
  1 sibling, 1 reply; 6+ messages in thread
From: H. Peter Anvin @ 2011-12-16 15:39 UTC (permalink / raw)
  To: Jan Beulich; +Cc: mingo, tglx, linux-kernel

On 12/14/2011 12:33 AM, Jan Beulich wrote:
> 
> At once also change the return value type to what it really is -
> 'bool'.
> 

I have a hazy memory of "bool" not working correctly in inline assembly,
but it might be ancient history by now.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] x86: fix and improve percpu_cmpxchg{8,16}b_double()
  2011-12-16 15:39 ` [PATCH] x86: fix " H. Peter Anvin
@ 2011-12-18  8:43   ` Ingo Molnar
  0 siblings, 0 replies; 6+ messages in thread
From: Ingo Molnar @ 2011-12-18  8:43 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: Jan Beulich, tglx, linux-kernel


* H. Peter Anvin <hpa@zytor.com> wrote:

> On 12/14/2011 12:33 AM, Jan Beulich wrote:
> > 
> > At once also change the return value type to what it really is -
> > 'bool'.
> > 
> 
> I have a hazy memory of "bool" not working correctly in inline assembly,
> but it might be ancient history by now.

I too have memory of badness - but i think it was something 
about suboptimal code generated. We'll see any stability 
problems soon enough, now that it's committed :-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-12-18  8:45 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-14  8:33 [PATCH] x86: fix and improve percpu_cmpxchg{8,16}b_double() Jan Beulich
2011-12-15  9:55 ` [tip:x86/asm] x86: Fix " tip-bot for Jan Beulich
2011-12-15 22:52   ` Christoph Lameter
2011-12-16  8:48     ` Jan Beulich
2011-12-16 15:39 ` [PATCH] x86: fix " H. Peter Anvin
2011-12-18  8:43   ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox