Re: [cpuops cmpxchg V1 2/4] x86: this_cpu_cmpxchg and this_cpu_xchg operations

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>,
	akpm@linux-foundation.org, Pekka Enberg <penberg@cs.helsinki.fi>,
	linux-kernel@vger.kernel.org,
	Eric Dumazet <eric.dumazet@gmail.com>,
	Arjan van de Ven <arjan@infradead.org>,
	hpa@zytor.com
Subject: Re: [cpuops cmpxchg V1 2/4] x86: this_cpu_cmpxchg and this_cpu_xchg operations
Date: Wed, 8 Dec 2010 13:17:36 -0500	[thread overview]
Message-ID: <20101208181736.GC30693@Krystal> (raw)
In-Reply-To: <alpine.DEB.2.00.1012081207250.26943@router.home>

* Christoph Lameter (cl@linux.com) wrote:
> Alternate approach: Could also use cmpxchg for xchg..
> 
> 
> Subject: cpuops: Use cmpxchg for xchg to avoid lock semantics
> 
> Cmpxchg has a lower cycle count due to the implied lock semantics of xchg.
> 
> Simulate the xchg through cmpxchg for the cpu ops.

Hi Christoph,

Can you show if this provides savings in terms of:

- instruction cache footprint
- cycles required to run
- large-scale impact on the branch prediction buffers

Given that this targets per-cpu data only, the additional impact on cache-line
exchange traffic of using cmpxchg over xchg (cache-line not grabbed as exclusive
by the initial read) should not really matter.

I'm CCing Arjan and HPA, because they might have some interesting insight into
the performance impact of lock-prefixed xchg vs using local cmpxchg in a loop.

Thanks,

Mathieu


> 
> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> ---
>  arch/x86/include/asm/percpu.h |   68 +++++++-----------------------------------
>  1 file changed, 12 insertions(+), 56 deletions(-)
> 
> Index: linux-2.6/arch/x86/include/asm/percpu.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/percpu.h	2010-12-08 11:43:50.000000000 -0600
> +++ linux-2.6/arch/x86/include/asm/percpu.h	2010-12-08 12:00:21.000000000 -0600
> @@ -212,48 +212,6 @@ do {									\
>  	ret__;								\
>  })
> 
> -/*
> - * Beware: xchg on x86 has an implied lock prefix. There will be the cost of
> - * full lock semantics even though they are not needed.
> - */
> -#define percpu_xchg_op(var, nval)					\
> -({									\
> -	typeof(var) __ret;						\
> -	typeof(var) __new = (nval);					\
> -	switch (sizeof(var)) {						\
> -	case 1:								\
> -		asm("xchgb %2, "__percpu_arg(1)			\
> -			    : "=a" (__ret), "+m" (var)			\
> -			    : "q" (__new)				\
> -			    : "memory");				\
> -		break;							\
> -	case 2:								\
> -		asm("xchgw %2, "__percpu_arg(1)			\
> -			    : "=a" (__ret), "+m" (var)			\
> -			    : "r" (__new)				\
> -			    : "memory");				\
> -		break;							\
> -	case 4:								\
> -		asm("xchgl %2, "__percpu_arg(1)			\
> -			    : "=a" (__ret), "+m" (var)			\
> -			    : "r" (__new)				\
> -			    : "memory");				\
> -		break;							\
> -	case 8:								\
> -		asm("xchgq %2, "__percpu_arg(1)			\
> -			    : "=a" (__ret), "+m" (var)			\
> -			    : "r" (__new)				\
> -			    : "memory");				\
> -		break;							\
> -	default: __bad_percpu_size();					\
> -	}								\
> -	__ret;								\
> -})
> -
> -/*
> - * cmpxchg has no such implied lock semantics as a result it is much
> - * more efficient for cpu local operations.
> - */
>  #define percpu_cmpxchg_op(var, oval, nval)				\
>  ({									\
>  	typeof(var) __ret;						\
> @@ -412,16 +370,6 @@ do {									\
>  #define irqsafe_cpu_xor_2(pcp, val)	percpu_to_op("xor", (pcp), val)
>  #define irqsafe_cpu_xor_4(pcp, val)	percpu_to_op("xor", (pcp), val)
> 
> -#define __this_cpu_xchg_1(pcp, nval)	percpu_xchg_op(pcp, nval)
> -#define __this_cpu_xchg_2(pcp, nval)	percpu_xchg_op(pcp, nval)
> -#define __this_cpu_xchg_4(pcp, nval)	percpu_xchg_op(pcp, nval)
> -#define this_cpu_xchg_1(pcp, nval)	percpu_xchg_op(pcp, nval)
> -#define this_cpu_xchg_2(pcp, nval)	percpu_xchg_op(pcp, nval)
> -#define this_cpu_xchg_4(pcp, nval)	percpu_xchg_op(pcp, nval)
> -#define irqsafe_cpu_xchg_1(pcp, nval)	percpu_xchg_op(pcp, nval)
> -#define irqsafe_cpu_xchg_2(pcp, nval)	percpu_xchg_op(pcp, nval)
> -#define irqsafe_cpu_xchg_4(pcp, nval)	percpu_xchg_op(pcp, nval)
> -
>  #ifndef CONFIG_M386
>  #define __this_cpu_add_return_1(pcp, val)	percpu_add_return_op(pcp, val)
>  #define __this_cpu_add_return_2(pcp, val)	percpu_add_return_op(pcp, val)
> @@ -489,16 +437,24 @@ do {									\
>  #define __this_cpu_add_return_8(pcp, val)	percpu_add_return_op(pcp, val)
>  #define this_cpu_add_return_8(pcp, val)	percpu_add_return_op(pcp, val)
> 
> -#define __this_cpu_xchg_8(pcp, nval)	percpu_xchg_op(pcp, nval)
> -#define this_cpu_xchg_8(pcp, nval)	percpu_xchg_op(pcp, nval)
> -#define irqsafe_cpu_xchg_8(pcp, nval)	percpu_xchg_op(pcp, nval)
> -
>  #define __this_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
>  #define this_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
>  #define irqsafe_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
> 
>  #endif
> 
> +#define this_cpu_xchg(pcp, val) \
> +({									\
> +	typeof(val) __o;						\
> +	do {								\
> +	 	__o = __this_cpu_read(pcp);				\
> +	} while (this_cpu_cmpxchg(pcp, __o, val) != __o);		\
> +	__o;								\
> +})
> +
> +#define __this_cpu_xchg this_cpu_xchg
> +#define irqsafe_cpu_xchg this_cpu_xchg
> +
>  /* This is not atomic against other CPUs -- CPU preemption needs to be off */
>  #define x86_test_and_clear_bit_percpu(bit, var)				\
>  ({									\
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

next prev parent reply	other threads:[~2010-12-08 18:17 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-12-08 17:55 [cpuops cmpxchg V1 0/4] Cmpxchg and xchg operations Christoph Lameter
2010-12-08 17:55 ` [cpuops cmpxchg V1 1/4] percpu: Generic this_cpu_cmpxchg() and this_cpu_xchg support Christoph Lameter
2010-12-08 17:55 ` [cpuops cmpxchg V1 2/4] x86: this_cpu_cmpxchg and this_cpu_xchg operations Christoph Lameter
2010-12-08 18:08   ` Christoph Lameter
2010-12-08 18:17     ` Mathieu Desnoyers [this message]
2010-12-09  6:26       ` H. Peter Anvin
2010-12-09 23:40         ` Christoph Lameter
2010-12-08 22:20   ` Christoph Lameter
2010-12-08 17:55 ` [cpuops cmpxchg V1 3/4] irq_work: Use per cpu atomics instead of regular atomics Christoph Lameter
2010-12-08 17:55 ` [cpuops cmpxchg V1 4/4] vmstat: User per cpu atomics to avoid interrupt disable / enable Christoph Lameter
2010-12-08 22:22 ` cpuops cmpxchg: Provide 64 bit this_cpu_xx for 32 bit x86 using cmpxchg8b Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101208181736.GC30693@Krystal \
    --to=mathieu.desnoyers@efficios.com \
    --cc=akpm@linux-foundation.org \
    --cc=arjan@infradead.org \
    --cc=cl@linux.com \
    --cc=eric.dumazet@gmail.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=penberg@cs.helsinki.fi \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox