From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:38743)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <paolo.bonzini@gmail.com>) id 1UyFJs-0000Az-Bg
	for qemu-devel@nongnu.org; Sun, 14 Jul 2013 01:58:05 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <paolo.bonzini@gmail.com>) id 1UyFJr-0004VX-A8
	for qemu-devel@nongnu.org; Sun, 14 Jul 2013 01:58:04 -0400
Received: from mail-ee0-x22e.google.com ([2a00:1450:4013:c00::22e]:58489)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <paolo.bonzini@gmail.com>) id 1UyFJq-0004VT-Vi
	for qemu-devel@nongnu.org; Sun, 14 Jul 2013 01:58:03 -0400
Received: by mail-ee0-f46.google.com with SMTP id d41so6858479eek.19
	for <qemu-devel@nongnu.org>; Sat, 13 Jul 2013 22:58:02 -0700 (PDT)
Sender: Paolo Bonzini <paolo.bonzini@gmail.com>
Message-ID: <51E23DE2.2040908@redhat.com>
Date: Sun, 14 Jul 2013 07:57:54 +0200
From: Paolo Bonzini <pbonzini@redhat.com>
MIME-Version: 1.0
References: <1373770422-5111-1-git-send-email-pingfank@linux.vnet.ibm.com>
In-Reply-To: <1373770422-5111-1-git-send-email-pingfank@linux.vnet.ibm.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH v2] atomic: using memory_order_relaxed for
 refcnt inc/dec ops
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Liu Ping Fan <qemulist@gmail.com>
Cc: qemu-devel@nongnu.org, Richard Henderson <rth@twiddle.net>

Il 14/07/2013 04:53, Liu Ping Fan ha scritto:
> Refcnt's atomic inc/dec ops are frequent and its idiom need no seq_cst
> order. So to get better performance, it worth to adopt _relaxed
> other than _seq_cst memory model on them.
> 
> We resort to gcc builtins. If gcc supports C11 memory model, __atomic_*
> buitlins is used, otherwise __sync_* builtins.
> 
> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

No, not at all. :(

First of all, I'd like to understand how you benchmarked this.  For
inc/dec, relaxed vs. seq_cst has no effect except on PPC and ARM.  And
if the refcount ops are frequent enough, I strongly suspect cacheline
bouncing has a bigger effect than the memory barriers.

Second, it is wrong because you need a further read memory barrier when
you are removing the last reference

Third, it is making the API completely unorthogonal, and "tend to be
exceptions" is not a justification.

The justification here could be, rather than the performance, having to
remember how to use atomic_fetch_dec in the unref side.  I don't really
buy that, but if you really care, do something like

#define atomic_ref(ptr, field) \
  __atomic_fetch_add(&((ptr)->field), 1, __ATOMIC_RELAXED)
#define atomic_unref(ptr, field, releasefn) ( \
  __atomic_fetch_add(&((ptr)->field), -1, __ATOMIC_RELEASE) == 1 \
  ? (__atomic_thread_fence(__ATOMIC_ACQUIRE), (releasefn)(ptr)) : false)

i.e. define a new interface similar to kref_get/kref_put and, since you
are at it, optimize it.

Paolo

> ---
> v2:
>   update atomics.txt
>   fix the dependency MACRO
> ---
>  docs/atomics.txt      |  4 ++++
>  include/qemu/atomic.h | 14 ++++++++++++++
>  2 files changed, 18 insertions(+)
> 
> diff --git a/docs/atomics.txt b/docs/atomics.txt
> index 6f2997b..c8ad31d 100644
> --- a/docs/atomics.txt
> +++ b/docs/atomics.txt
> @@ -45,6 +45,10 @@ consistency*, where "the result of any execution is the same as if the
>  operations of all the processors were executed in some sequential order,
>  and the operations of each individual processor appear in this sequence
>  in the order specified by its program".
> +Note that atomic_inc/_dec and atomic_fetch_inc/_dec tend to be exceptions.
> +Once gcc provides atomic_* builtins, they adopt *relaxed* memory order.
> +The reason is that these pairs are used by the refcnt's ops which is frequent
> +and has more effect on performance.
>  
>  qemu/atomic.h provides the following set of atomic read-modify-write
>  operations:
> diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
> index 0aa8913..e8353d2 100644
> --- a/include/qemu/atomic.h
> +++ b/include/qemu/atomic.h
> @@ -183,8 +183,15 @@
>  #endif
>  
>  /* Provide shorter names for GCC atomic builtins.  */
> +#ifdef __ATOMIC_RELAXED
> +/* C11 memory_order_relaxed */
> +#define atomic_fetch_inc(ptr)  __atomic_fetch_add(ptr, 1, __ATOMIC_RELAXED)
> +#define atomic_fetch_dec(ptr)  __atomic_fetch_add(ptr, -1, __ATOMIC_RELAXED)
> +#else
> +/* close to C11 memory_order_seq_cst */
>  #define atomic_fetch_inc(ptr)  __sync_fetch_and_add(ptr, 1)
>  #define atomic_fetch_dec(ptr)  __sync_fetch_and_add(ptr, -1)
> +#endif
>  #define atomic_fetch_add       __sync_fetch_and_add
>  #define atomic_fetch_sub       __sync_fetch_and_sub
>  #define atomic_fetch_and       __sync_fetch_and_and
> @@ -192,8 +199,15 @@
>  #define atomic_cmpxchg         __sync_val_compare_and_swap
>  
>  /* And even shorter names that return void.  */
> +#ifdef __ATOMIC_RELAXED
> +/* C11 memory_order_relaxed */
> +#define atomic_inc(ptr)        ((void) __atomic_fetch_add(ptr, 1, __ATOMIC_RELAXED))
> +#define atomic_dec(ptr)        ((void) __atomic_fetch_add(ptr, -1, __ATOMIC_RELAXED))
> +#else
> +/* close to C11 memory_order_seq_cst */
>  #define atomic_inc(ptr)        ((void) __sync_fetch_and_add(ptr, 1))
>  #define atomic_dec(ptr)        ((void) __sync_fetch_and_add(ptr, -1))
> +#endif
>  #define atomic_add(ptr, n)     ((void) __sync_fetch_and_add(ptr, n))
>  #define atomic_sub(ptr, n)     ((void) __sync_fetch_and_sub(ptr, n))
>  #define atomic_and(ptr, n)     ((void) __sync_fetch_and_and(ptr, n))
>