* [Qemu-devel] [PATCH v2] atomic: using memory_order_relaxed for refcnt inc/dec ops @ 2013-07-14 2:53 Liu Ping Fan 2013-07-14 5:57 ` Paolo Bonzini 0 siblings, 1 reply; 4+ messages in thread From: Liu Ping Fan @ 2013-07-14 2:53 UTC (permalink / raw) To: qemu-devel; +Cc: Paolo Bonzini, Richard Henderson Refcnt's atomic inc/dec ops are frequent and its idiom need no seq_cst order. So to get better performance, it worth to adopt _relaxed other than _seq_cst memory model on them. We resort to gcc builtins. If gcc supports C11 memory model, __atomic_* buitlins is used, otherwise __sync_* builtins. Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com> --- v2: update atomics.txt fix the dependency MACRO --- docs/atomics.txt | 4 ++++ include/qemu/atomic.h | 14 ++++++++++++++ 2 files changed, 18 insertions(+) diff --git a/docs/atomics.txt b/docs/atomics.txt index 6f2997b..c8ad31d 100644 --- a/docs/atomics.txt +++ b/docs/atomics.txt @@ -45,6 +45,10 @@ consistency*, where "the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program". +Note that atomic_inc/_dec and atomic_fetch_inc/_dec tend to be exceptions. +Once gcc provides atomic_* builtins, they adopt *relaxed* memory order. +The reason is that these pairs are used by the refcnt's ops which is frequent +and has more effect on performance. qemu/atomic.h provides the following set of atomic read-modify-write operations: diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h index 0aa8913..e8353d2 100644 --- a/include/qemu/atomic.h +++ b/include/qemu/atomic.h @@ -183,8 +183,15 @@ #endif /* Provide shorter names for GCC atomic builtins. */ +#ifdef __ATOMIC_RELAXED +/* C11 memory_order_relaxed */ +#define atomic_fetch_inc(ptr) __atomic_fetch_add(ptr, 1, __ATOMIC_RELAXED) +#define atomic_fetch_dec(ptr) __atomic_fetch_add(ptr, -1, __ATOMIC_RELAXED) +#else +/* close to C11 memory_order_seq_cst */ #define atomic_fetch_inc(ptr) __sync_fetch_and_add(ptr, 1) #define atomic_fetch_dec(ptr) __sync_fetch_and_add(ptr, -1) +#endif #define atomic_fetch_add __sync_fetch_and_add #define atomic_fetch_sub __sync_fetch_and_sub #define atomic_fetch_and __sync_fetch_and_and @@ -192,8 +199,15 @@ #define atomic_cmpxchg __sync_val_compare_and_swap /* And even shorter names that return void. */ +#ifdef __ATOMIC_RELAXED +/* C11 memory_order_relaxed */ +#define atomic_inc(ptr) ((void) __atomic_fetch_add(ptr, 1, __ATOMIC_RELAXED)) +#define atomic_dec(ptr) ((void) __atomic_fetch_add(ptr, -1, __ATOMIC_RELAXED)) +#else +/* close to C11 memory_order_seq_cst */ #define atomic_inc(ptr) ((void) __sync_fetch_and_add(ptr, 1)) #define atomic_dec(ptr) ((void) __sync_fetch_and_add(ptr, -1)) +#endif #define atomic_add(ptr, n) ((void) __sync_fetch_and_add(ptr, n)) #define atomic_sub(ptr, n) ((void) __sync_fetch_and_sub(ptr, n)) #define atomic_and(ptr, n) ((void) __sync_fetch_and_and(ptr, n)) -- 1.8.1.4 ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] [PATCH v2] atomic: using memory_order_relaxed for refcnt inc/dec ops 2013-07-14 2:53 [Qemu-devel] [PATCH v2] atomic: using memory_order_relaxed for refcnt inc/dec ops Liu Ping Fan @ 2013-07-14 5:57 ` Paolo Bonzini 2013-07-14 10:23 ` liu ping fan 0 siblings, 1 reply; 4+ messages in thread From: Paolo Bonzini @ 2013-07-14 5:57 UTC (permalink / raw) To: Liu Ping Fan; +Cc: qemu-devel, Richard Henderson Il 14/07/2013 04:53, Liu Ping Fan ha scritto: > Refcnt's atomic inc/dec ops are frequent and its idiom need no seq_cst > order. So to get better performance, it worth to adopt _relaxed > other than _seq_cst memory model on them. > > We resort to gcc builtins. If gcc supports C11 memory model, __atomic_* > buitlins is used, otherwise __sync_* builtins. > > Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com> No, not at all. :( First of all, I'd like to understand how you benchmarked this. For inc/dec, relaxed vs. seq_cst has no effect except on PPC and ARM. And if the refcount ops are frequent enough, I strongly suspect cacheline bouncing has a bigger effect than the memory barriers. Second, it is wrong because you need a further read memory barrier when you are removing the last reference Third, it is making the API completely unorthogonal, and "tend to be exceptions" is not a justification. The justification here could be, rather than the performance, having to remember how to use atomic_fetch_dec in the unref side. I don't really buy that, but if you really care, do something like #define atomic_ref(ptr, field) \ __atomic_fetch_add(&((ptr)->field), 1, __ATOMIC_RELAXED) #define atomic_unref(ptr, field, releasefn) ( \ __atomic_fetch_add(&((ptr)->field), -1, __ATOMIC_RELEASE) == 1 \ ? (__atomic_thread_fence(__ATOMIC_ACQUIRE), (releasefn)(ptr)) : false) i.e. define a new interface similar to kref_get/kref_put and, since you are at it, optimize it. Paolo > --- > v2: > update atomics.txt > fix the dependency MACRO > --- > docs/atomics.txt | 4 ++++ > include/qemu/atomic.h | 14 ++++++++++++++ > 2 files changed, 18 insertions(+) > > diff --git a/docs/atomics.txt b/docs/atomics.txt > index 6f2997b..c8ad31d 100644 > --- a/docs/atomics.txt > +++ b/docs/atomics.txt > @@ -45,6 +45,10 @@ consistency*, where "the result of any execution is the same as if the > operations of all the processors were executed in some sequential order, > and the operations of each individual processor appear in this sequence > in the order specified by its program". > +Note that atomic_inc/_dec and atomic_fetch_inc/_dec tend to be exceptions. > +Once gcc provides atomic_* builtins, they adopt *relaxed* memory order. > +The reason is that these pairs are used by the refcnt's ops which is frequent > +and has more effect on performance. > > qemu/atomic.h provides the following set of atomic read-modify-write > operations: > diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h > index 0aa8913..e8353d2 100644 > --- a/include/qemu/atomic.h > +++ b/include/qemu/atomic.h > @@ -183,8 +183,15 @@ > #endif > > /* Provide shorter names for GCC atomic builtins. */ > +#ifdef __ATOMIC_RELAXED > +/* C11 memory_order_relaxed */ > +#define atomic_fetch_inc(ptr) __atomic_fetch_add(ptr, 1, __ATOMIC_RELAXED) > +#define atomic_fetch_dec(ptr) __atomic_fetch_add(ptr, -1, __ATOMIC_RELAXED) > +#else > +/* close to C11 memory_order_seq_cst */ > #define atomic_fetch_inc(ptr) __sync_fetch_and_add(ptr, 1) > #define atomic_fetch_dec(ptr) __sync_fetch_and_add(ptr, -1) > +#endif > #define atomic_fetch_add __sync_fetch_and_add > #define atomic_fetch_sub __sync_fetch_and_sub > #define atomic_fetch_and __sync_fetch_and_and > @@ -192,8 +199,15 @@ > #define atomic_cmpxchg __sync_val_compare_and_swap > > /* And even shorter names that return void. */ > +#ifdef __ATOMIC_RELAXED > +/* C11 memory_order_relaxed */ > +#define atomic_inc(ptr) ((void) __atomic_fetch_add(ptr, 1, __ATOMIC_RELAXED)) > +#define atomic_dec(ptr) ((void) __atomic_fetch_add(ptr, -1, __ATOMIC_RELAXED)) > +#else > +/* close to C11 memory_order_seq_cst */ > #define atomic_inc(ptr) ((void) __sync_fetch_and_add(ptr, 1)) > #define atomic_dec(ptr) ((void) __sync_fetch_and_add(ptr, -1)) > +#endif > #define atomic_add(ptr, n) ((void) __sync_fetch_and_add(ptr, n)) > #define atomic_sub(ptr, n) ((void) __sync_fetch_and_sub(ptr, n)) > #define atomic_and(ptr, n) ((void) __sync_fetch_and_and(ptr, n)) > ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] [PATCH v2] atomic: using memory_order_relaxed for refcnt inc/dec ops 2013-07-14 5:57 ` Paolo Bonzini @ 2013-07-14 10:23 ` liu ping fan 2013-07-15 10:39 ` Paolo Bonzini 0 siblings, 1 reply; 4+ messages in thread From: liu ping fan @ 2013-07-14 10:23 UTC (permalink / raw) To: Paolo Bonzini; +Cc: qemu-devel, Richard Henderson On Sun, Jul 14, 2013 at 1:57 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > Il 14/07/2013 04:53, Liu Ping Fan ha scritto: >> Refcnt's atomic inc/dec ops are frequent and its idiom need no seq_cst >> order. So to get better performance, it worth to adopt _relaxed >> other than _seq_cst memory model on them. >> >> We resort to gcc builtins. If gcc supports C11 memory model, __atomic_* >> buitlins is used, otherwise __sync_* builtins. >> >> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com> > > No, not at all. :( > > First of all, I'd like to understand how you benchmarked this. For > inc/dec, relaxed vs. seq_cst has no effect except on PPC and ARM. And Not based on benchmark. Just based on the code releated with barrier implement, since PPC and ARM can just lock cacheline without flushing write buffer. > if the refcount ops are frequent enough, I strongly suspect cacheline > bouncing has a bigger effect than the memory barriers. > When out of biglock, object_ref/unref to pin the Device will be quite often, and can it be marked "frequent"? Or how can we say something is frequent? > Second, it is wrong because you need a further read memory barrier when > you are removing the last reference > Oh, yes, the last one. > Third, it is making the API completely unorthogonal, and "tend to be > exceptions" is not a justification. > > The justification here could be, rather than the performance, having to > remember how to use atomic_fetch_dec in the unref side. I don't really > buy that, but if you really care, do something like > > #define atomic_ref(ptr, field) \ > __atomic_fetch_add(&((ptr)->field), 1, __ATOMIC_RELAXED) > #define atomic_unref(ptr, field, releasefn) ( \ > __atomic_fetch_add(&((ptr)->field), -1, __ATOMIC_RELEASE) == 1 \ > ? (__atomic_thread_fence(__ATOMIC_ACQUIRE), (releasefn)(ptr)) : false) > > i.e. define a new interface similar to kref_get/kref_put and, since you > are at it, optimize it. > Thanks, a abstract layer for refct is what I need. Regards Pingfan ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] [PATCH v2] atomic: using memory_order_relaxed for refcnt inc/dec ops 2013-07-14 10:23 ` liu ping fan @ 2013-07-15 10:39 ` Paolo Bonzini 0 siblings, 0 replies; 4+ messages in thread From: Paolo Bonzini @ 2013-07-15 10:39 UTC (permalink / raw) To: liu ping fan; +Cc: qemu-devel, Richard Henderson Il 14/07/2013 12:23, liu ping fan ha scritto: >> if the refcount ops are frequent enough, I strongly suspect cacheline >> bouncing has a bigger effect than the memory barriers. >> > When out of biglock, object_ref/unref to pin the Device will be quite > often, and can it be marked "frequent"? Or how can we say something is > frequent? I didn't say it is not frequent. I said I suspect (it _is_ just a suspicion, not the result of a benchmark, but at least I said so...) that "cacheline bouncing has a bigger effect than the memory barriers" and thus the API would not have such a dramatic impact. >> Third, it is making the API completely unorthogonal, and "tend to be >> exceptions" is not a justification. >> >> The justification here could be, rather than the performance, having to >> remember how to use atomic_fetch_dec in the unref side. I don't really >> buy that, but if you really care, do something like >> >> #define atomic_ref(ptr, field) >> #define atomic_unref(ptr, field, releasefn) >> >> i.e. define a new interface similar to kref_get/kref_put and, since you >> are at it, optimize it. >> > Thanks, a abstract layer for refct is what I need. If someone cares enough to review your patch (which _must_ come with documentation), that's fine for me. I don't think it's worthwhile, but others may disagree. Paolo ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2013-07-15 10:39 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-07-14 2:53 [Qemu-devel] [PATCH v2] atomic: using memory_order_relaxed for refcnt inc/dec ops Liu Ping Fan 2013-07-14 5:57 ` Paolo Bonzini 2013-07-14 10:23 ` liu ping fan 2013-07-15 10:39 ` Paolo Bonzini
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).