* [PATCH 0/4] Update definition of cmpxchg() under CodeSamples
@ 2018-12-11 15:39 Akira Yokosawa
2018-12-11 15:40 ` [PATCH 1/4] CodeSamples: Add C-cmpxchg.litmus Akira Yokosawa
` (4 more replies)
0 siblings, 5 replies; 16+ messages in thread
From: Akira Yokosawa @ 2018-12-11 15:39 UTC (permalink / raw)
To: Paul E. McKenney, Junchang Wang; +Cc: perfbook, Akira Yokosawa
Subject: [PATCH 0/4] Update definition of cmpxchg() under CodeSamples
Hi Paul and Junchang,
Based on the earlier correspondence, I prepared a patch set to update
the definition of cmpxchg().
Patch #1 adds a simple litmus test of cmpxchg().
Patch #2 fix cmpxchg() in .../litmus/api.h by specifying strong
__atomic_compare_exchange_n)\(). Results of litmus tests are presented
in the commit log.
Patch #3 fix cmpxchg() in .../api-gcc.h in the same manner as patch #2.
Patch #4 is not intended to be applied to master, but provides a weak
variant of cmpxchg() and count_lim_atomic_weak.c.
I did some experiments on PPC (10 runs each).
count_lim_atomic 1 uperf, ns/update (as of v2018.12.08a):
56.6572
56.4972
56.5504
56.4972
56.6305
56.5238
56.6038
56.5238
56.5504
56.5238
count_lim_atomic 1 uperf, ns/update (as of Patch #4):
38.3264
38.1073
38.2287
38.2287
38.2653
38.3142
38.1194
38.1801
38.4986
38.2531
count_lim_atomic_weak 1 uperf, ns/update (as of Patch #4):
47.3186
47.3747
47.4121
47.5248
47.3934
47.3373
47.3186
47.4121
47.6002
47.3186
So count_lim_atomic (strong semantics) is around 9ns/update
faster than count_lim_atomic_weak, and is around 18ns/update
faster than count_lim_atomic of v2018.12.08a.
This contradicts Junchang's view. The difference looks like
due to removal of ternary operation in cmpxchg().
Or I might be missing something.
Junchang, can you please experiment on PPC and ARM?
Thanks, Akira
--
Akira Yokosawa (4):
CodeSamples: Add C-cmpxchg.litmus
CodeSamples/formal/litmus/api.h: Fix definition of cmpxchg()
CodeSamples: Fix definition of cmpxchg() in api-gcc.h
[EXP] CodeSamples: Add weak variant of cmpxchg() as cmpxchg_weak()
CodeSamples/api-pthreads/api-gcc.h | 19 ++-
CodeSamples/count/Makefile | 4 +
CodeSamples/count/count_lim_atomic_weak.c | 226 +++++++++++++++++++++++++++++
CodeSamples/formal/litmus/C-cmpxchg.litmus | 24 +++
CodeSamples/formal/litmus/api.h | 2 +-
5 files changed, 272 insertions(+), 3 deletions(-)
create mode 100644 CodeSamples/count/count_lim_atomic_weak.c
create mode 100644 CodeSamples/formal/litmus/C-cmpxchg.litmus
--
2.7.4
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 1/4] CodeSamples: Add C-cmpxchg.litmus 2018-12-11 15:39 [PATCH 0/4] Update definition of cmpxchg() under CodeSamples Akira Yokosawa @ 2018-12-11 15:40 ` Akira Yokosawa 2018-12-11 15:42 ` [PATCH 2/4] CodeSamples/formal/litmus/api.h: Fix definition of cmpxchg() Akira Yokosawa ` (3 subsequent siblings) 4 siblings, 0 replies; 16+ messages in thread From: Akira Yokosawa @ 2018-12-11 15:40 UTC (permalink / raw) To: Paul E. McKenney, Junchang Wang; +Cc: perfbook From d724a9b7ffb4b6d648290864c80cd1542297fa40 Mon Sep 17 00:00:00 2001 From: Akira Yokosawa <akiyks@gmail.com> Date: Mon, 10 Dec 2018 20:39:28 +0900 Subject: [PATCH 1/4] CodeSamples: Add C-cmpxchg.litmus Signed-off-by: Akira Yokosawa <akiyks@gmail.com> --- CodeSamples/formal/litmus/C-cmpxchg.litmus | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) create mode 100644 CodeSamples/formal/litmus/C-cmpxchg.litmus diff --git a/CodeSamples/formal/litmus/C-cmpxchg.litmus b/CodeSamples/formal/litmus/C-cmpxchg.litmus new file mode 100644 index 0000000..9579d4a --- /dev/null +++ b/CodeSamples/formal/litmus/C-cmpxchg.litmus @@ -0,0 +1,24 @@ +C C-cmpxchg +{ +} + +{ +#include "api.h" +} + +P0(int *x) +{ + int r1; + + r1 = cmpxchg(x, 1, 2); +} + +P1(int *x) +{ + int r1; + + r1 = cmpxchg(x, 0, 1); +} + +locations [1:r1] +exists (0:r1=1 /\ ~x=2) -- 2.7.4 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 2/4] CodeSamples/formal/litmus/api.h: Fix definition of cmpxchg() 2018-12-11 15:39 [PATCH 0/4] Update definition of cmpxchg() under CodeSamples Akira Yokosawa 2018-12-11 15:40 ` [PATCH 1/4] CodeSamples: Add C-cmpxchg.litmus Akira Yokosawa @ 2018-12-11 15:42 ` Akira Yokosawa 2018-12-11 15:42 ` [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h Akira Yokosawa ` (2 subsequent siblings) 4 siblings, 0 replies; 16+ messages in thread From: Akira Yokosawa @ 2018-12-11 15:42 UTC (permalink / raw) To: Paul E. McKenney, Junchang Wang; +Cc: perfbook From 2796715f236d5c152f6e559422e072b705e3d237 Mon Sep 17 00:00:00 2001 From: Akira Yokosawa <akiyks@gmail.com> Date: Tue, 11 Dec 2018 20:49:16 +0900 Subject: [PATCH 2/4] CodeSamples/formal/litmus/api.h: Fix definition of cmpxchg() Output of C-cmpxchg.litmus by litmus7 on PPC (old definition): ---------------------------------------------------- Test C-cmpxchg Allowed Histogram (4 states) 2789 :>0:r1=0; 1:r1=0; x=0; 50452053:>0:r1=0; 1:r1=0; x=1; 36003 *>0:r1=1; 1:r1=0; x=1; 49509155:>0:r1=1; 1:r1=0; x=2; Ok Witnesses Positive: 36003, Negative: 99963997 Condition exists (0:r1=1 /\ not (x=2)) is validated Hash=3f625d680407ff822fb56f1a80834291 Observation C-cmpxchg Sometimes 36003 99963997 Time C-cmpxchg 43.37 ---------------------------------------------------- Output of C-cmpxchg.litmus by litmus7 on PPC (new definition): ---------------------------------------------------- Test C-cmpxchg Allowed Histogram (2 states) 50766497:>0:r1=0; 1:r1=0; x=1; 49233503:>0:r1=1; 1:r1=0; x=2; No Witnesses Positive: 0, Negative: 100000000 Condition exists (0:r1=1 /\ not (x=2)) is NOT validated Hash=3f625d680407ff822fb56f1a80834291 Observation C-cmpxchg Never 0 100000000 Time C-cmpxchg 35.20 ---------------------------------------------------- herd7 says: ---------------------------------------------------- Test C-cmpxchg Allowed States 2 0:r1=0; 1:r1=0; x=1; 0:r1=1; 1:r1=0; x=2; No Witnesses Positive: 0 Negative: 2 Condition exists (0:r1=1 /\ not (x=2)) Observation C-cmpxchg Never 0 2 Time C-cmpxchg 0.01 Hash=1720691890ea69e35615997626680d6f ---------------------------------------------------- klitmus7 on PPC says: ---------------------------------------------------- Test C-cmpxchg Allowed Histogram (2 states) 2019624 :>0:r1=0; 1:r1=0; x=1; 1980376 :>0:r1=1; 1:r1=0; x=2; No Witnesses Positive: 0, Negative: 4000000 Condition exists (0:r1=1 /\ not (x=2)) is NOT validated Hash=1720691890ea69e35615997626680d6f Observation C-cmpxchg Never 0 4000000 Time C-cmpxchg 0.76 ---------------------------------------------------- Signed-off-by: Akira Yokosawa <akiyks@gmail.com> --- CodeSamples/formal/litmus/api.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CodeSamples/formal/litmus/api.h b/CodeSamples/formal/litmus/api.h index 8bf0ba6..bb1e1c2 100644 --- a/CodeSamples/formal/litmus/api.h +++ b/CodeSamples/formal/litmus/api.h @@ -16,7 +16,7 @@ #define smp_load_acquire(x) __atomic_load_n(x, __ATOMIC_ACQUIRE) #define cmpxchg(x, o, n) ({ \ typeof(o) __old = (o); \ - __atomic_compare_exchange_n((x), &__old, (n), 1, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \ + __atomic_compare_exchange_n((x), &__old, (n), 0, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \ __old; \ }) #endif -- 2.7.4 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h 2018-12-11 15:39 [PATCH 0/4] Update definition of cmpxchg() under CodeSamples Akira Yokosawa 2018-12-11 15:40 ` [PATCH 1/4] CodeSamples: Add C-cmpxchg.litmus Akira Yokosawa 2018-12-11 15:42 ` [PATCH 2/4] CodeSamples/formal/litmus/api.h: Fix definition of cmpxchg() Akira Yokosawa @ 2018-12-11 15:42 ` Akira Yokosawa 2018-12-12 16:01 ` Junchang Wang 2018-12-11 15:44 ` [PATCH 4/4] EXP CodeSamples: Add weak variant of cmpxchg() as cmpxchg_weak() Akira Yokosawa 2018-12-11 17:23 ` [PATCH 0/4] Update definition of cmpxchg() under CodeSamples Paul E. McKenney 4 siblings, 1 reply; 16+ messages in thread From: Akira Yokosawa @ 2018-12-11 15:42 UTC (permalink / raw) To: Paul E. McKenney, Junchang Wang; +Cc: perfbook From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001 From: Akira Yokosawa <akiyks@gmail.com> Date: Tue, 11 Dec 2018 21:37:11 +0900 Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h Do the same change as CodeSamples/formal/litmus/api.h. Signed-off-by: Akira Yokosawa <akiyks@gmail.com> --- CodeSamples/api-pthreads/api-gcc.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h index 3afe340..b66f4b9 100644 --- a/CodeSamples/api-pthreads/api-gcc.h +++ b/CodeSamples/api-pthreads/api-gcc.h @@ -168,8 +168,9 @@ struct __xchg_dummy { ({ \ typeof(*ptr) _____actual = (o); \ \ - __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \ - __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \ + __atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \ + __ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \ + _____actual; \ }) static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new) -- 2.7.4 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h 2018-12-11 15:42 ` [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h Akira Yokosawa @ 2018-12-12 16:01 ` Junchang Wang 2018-12-13 15:33 ` Akira Yokosawa 0 siblings, 1 reply; 16+ messages in thread From: Junchang Wang @ 2018-12-12 16:01 UTC (permalink / raw) To: Akira Yokosawa, Paul E. McKenney; +Cc: perfbook On 12/11/18 11:42 PM, Akira Yokosawa wrote: > From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001 > From: Akira Yokosawa <akiyks@gmail.com> > Date: Tue, 11 Dec 2018 21:37:11 +0900 > Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h > > Do the same change as CodeSamples/formal/litmus/api.h. > > Signed-off-by: Akira Yokosawa <akiyks@gmail.com> > --- > CodeSamples/api-pthreads/api-gcc.h | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h > index 3afe340..b66f4b9 100644 > --- a/CodeSamples/api-pthreads/api-gcc.h > +++ b/CodeSamples/api-pthreads/api-gcc.h > @@ -168,8 +168,9 @@ struct __xchg_dummy { > ({ \ > typeof(*ptr) _____actual = (o); \ > \ > - __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \ > - __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \ > + __atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \ > + __ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \ > + _____actual; \ > }) > Hi Akira, Another reason that the performance of cmpxchg is catching up with cmpxchg_weak is that __ATOMIC_SEQ_CST is replaced by __ATOMIC_RELAXED in this patch. The use of __ATOMIC_RELAXED means if the CAS primitive fails, the relaxed semantic is used, rather than sequential consistent. Following are some experiment results: # If __ATOMIC_RELAXED is used for both cmpxchg and cmpxchg_weak ./count_lim_atomic 64 uperf ns/update: 290 ./count_lim_atomic_weak 64 uperf ns/update: 301 # and then if __ATOMIC_SEQ_CST is used for both cmpxchg and cmpxchg_weak ./count_lim_atomic 64 uperf ns/update: 316 ./count_lim_atomic_weak 64 uperf ns/update: 302 ./count_lim_atomic 120 uperf ns/update: 630 ./count_lim_atomic_weak 120 uperf ns/update: 568 The results show that if we want to ensure sequential consistency when the CAS primitive fails, cmpxchg_weak performs better than cmpxchg. It seems that the (type of variation, failure_memorder) pair affects performance. I know that PPC uses LL/SC to simulate CAS. But what's the relationship between a simulated CAS and the memory order. This is interesting because as far as I know, PPC and ARM are using LL/SC to simulate atomic primitives such as CAS and FAA. So FAA might have the same behavior. In actually, I'm not very clear about the meaning of different types of failure memory orders. For example, when should we use __ATOMIC_RELAXED, rather than __ATOMIC_SEQ_CST, if a CAS fails? What happen if __ATOMIC_RELAXED is used for x86? The one I'm look at is https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html . Do you know some resources about this? I can look into this tomorrow. Thanks. --Junchang > static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new) > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h 2018-12-12 16:01 ` Junchang Wang @ 2018-12-13 15:33 ` Akira Yokosawa 2018-12-14 14:32 ` Junchang Wang 2018-12-15 15:10 ` Akira Yokosawa 0 siblings, 2 replies; 16+ messages in thread From: Akira Yokosawa @ 2018-12-13 15:33 UTC (permalink / raw) To: Junchang Wang; +Cc: Paul E. McKenney, perfbook, Akira Yokosawa On 2018/12/13 00:01:33 +0800, Junchang Wang wrote: > On 12/11/18 11:42 PM, Akira Yokosawa wrote: >> From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001 >> From: Akira Yokosawa <akiyks@gmail.com> >> Date: Tue, 11 Dec 2018 21:37:11 +0900 >> Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h >> >> Do the same change as CodeSamples/formal/litmus/api.h. >> >> Signed-off-by: Akira Yokosawa <akiyks@gmail.com> >> --- >> CodeSamples/api-pthreads/api-gcc.h | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h >> index 3afe340..b66f4b9 100644 >> --- a/CodeSamples/api-pthreads/api-gcc.h >> +++ b/CodeSamples/api-pthreads/api-gcc.h >> @@ -168,8 +168,9 @@ struct __xchg_dummy { >> ({ \ >> typeof(*ptr) _____actual = (o); \ >> \ >> - __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \ >> - __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \ >> + __atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \ >> + __ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \ >> + _____actual; \ >> }) >> > > Hi Akira, > > Another reason that the performance of cmpxchg is catching up with cmpxchg_weak is that __ATOMIC_SEQ_CST is replaced by __ATOMIC_RELAXED in this patch. The use of __ATOMIC_RELAXED means if the CAS primitive fails, the relaxed semantic is used, rather than sequential consistent. Following are some experiment results: > > # If __ATOMIC_RELAXED is used for both cmpxchg and cmpxchg_weak > > ./count_lim_atomic 64 uperf > ns/update: 290 > > ./count_lim_atomic_weak 64 uperf > ns/update: 301 > > > # and then if __ATOMIC_SEQ_CST is used for both cmpxchg and cmpxchg_weak > > ./count_lim_atomic 64 uperf > ns/update: 316 > > ./count_lim_atomic_weak 64 uperf > ns/update: 302 > > ./count_lim_atomic 120 uperf > ns/update: 630 > > ./count_lim_atomic_weak 120 uperf > ns/update: 568 > > The results show that if we want to ensure sequential consistency when the CAS primitive fails, cmpxchg_weak performs better than cmpxchg. It seems that the (type of variation, failure_memorder) pair affects performance. I know that PPC uses LL/SC to simulate CAS. But what's the relationship between a simulated CAS and the memory order. This is interesting because as far as I know, PPC and ARM are using LL/SC to simulate atomic primitives such as CAS and FAA. So FAA might have the same behavior. > > In actually, I'm not very clear about the meaning of different types of failure memory orders. For example, when should we use __ATOMIC_RELAXED, rather than __ATOMIC_SEQ_CST, if a CAS fails? What happen if __ATOMIC_RELAXED is used for x86? The one I'm look at is https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html . Do you know some resources about this? I can look into this tomorrow. Thanks. Hi Junchang, In Linux-kernel speak, Documentation/core-api/atomic.rst says: -------------------------------------------------------------------------- atomic_xchg must provide explicit memory barriers around the operation. :: int atomic_cmpxchg(atomic_t *v, int old, int new); This performs an atomic compare exchange operation on the atomic value v, with the given old and new values. Like all atomic_xxx operations, atomic_cmpxchg will only satisfy its atomicity semantics as long as all other accesses of \*v are performed through atomic_xxx operations. atomic_cmpxchg must provide explicit memory barriers around the operation, although if the comparison fails then no memory ordering guarantees are required. [snip] The routines xchg() and cmpxchg() must provide the same exact memory-barrier semantics as the atomic and bit operations returning values. -------------------------------------------------------------------------- The 2nd __ATOMIC_RELAXED to __atomic_compare_exchange_n() matches this lack of requirement. On x86_64, __atomic_compare_exchange_n() is translated to the same code in both cases (with the help of litmus7's cross compiling): #START _litmus_P1 xorl %eax, %eax movl $0, 4(%rsp) lock cmpxchgl %r10d, (%rdx) je .L36 movl %eax, 4(%rsp) .L36: movl 4(%rsp), %eax There is no difference between the code with __ATOMIC_RELAXED and the code with __ATOMIC_SEQ_CST as the 2nd parameter. As you can see, there is no memory barrier instruction emitted. On PPC, there is a difference. With __ATOMIC_RELAXED as 2nd parameter, the code looks like: #START _litmus_P1 sync .L34: lwarx 7,0,9 cmpwi 0,7,0 bne 0,.L35 stwcx. 5,0,9 bne 0,.L34 isync .L35: , OTOH with __ATOMIC_SEQ_CST as 2nd argument: #START _litmus_P1 sync .L34: lwarx 7,0,9 cmpwi 0,7,0 bne 0,.L35 stwcx. 5,0,9 bne 0,.L34 .L35: isync See the difference of position of label .L35. (Note that we are talking about strong version of cmpxchg().) Does the above example make sense to you? Thanks, Akira > > > --Junchang > > > >> static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new) >> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h 2018-12-13 15:33 ` Akira Yokosawa @ 2018-12-14 14:32 ` Junchang Wang 2018-12-15 14:58 ` Akira Yokosawa 2018-12-15 15:10 ` Akira Yokosawa 1 sibling, 1 reply; 16+ messages in thread From: Junchang Wang @ 2018-12-14 14:32 UTC (permalink / raw) To: Akira Yokosawa; +Cc: Paul E. McKenney, perfbook On Thu, Dec 13, 2018 at 11:33 PM Akira Yokosawa <akiyks@gmail.com> wrote: > > On 2018/12/13 00:01:33 +0800, Junchang Wang wrote: > > On 12/11/18 11:42 PM, Akira Yokosawa wrote: > >> From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001 > >> From: Akira Yokosawa <akiyks@gmail.com> > >> Date: Tue, 11 Dec 2018 21:37:11 +0900 > >> Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h > >> > >> Do the same change as CodeSamples/formal/litmus/api.h. > >> > >> Signed-off-by: Akira Yokosawa <akiyks@gmail.com> > >> --- > >> CodeSamples/api-pthreads/api-gcc.h | 5 +++-- > >> 1 file changed, 3 insertions(+), 2 deletions(-) > >> > >> diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h > >> index 3afe340..b66f4b9 100644 > >> --- a/CodeSamples/api-pthreads/api-gcc.h > >> +++ b/CodeSamples/api-pthreads/api-gcc.h > >> @@ -168,8 +168,9 @@ struct __xchg_dummy { > >> ({ \ > >> typeof(*ptr) _____actual = (o); \ > >> \ > >> - __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \ > >> - __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \ > >> + __atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \ > >> + __ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \ > >> + _____actual; \ > >> }) > >> > > > > Hi Akira, > > > > Another reason that the performance of cmpxchg is catching up with cmpxchg_weak is that __ATOMIC_SEQ_CST is replaced by __ATOMIC_RELAXED in this patch. The use of __ATOMIC_RELAXED means if the CAS primitive fails, the relaxed semantic is used, rather than sequential consistent. Following are some experiment results: > > > > # If __ATOMIC_RELAXED is used for both cmpxchg and cmpxchg_weak > > > > ./count_lim_atomic 64 uperf > > ns/update: 290 > > > > ./count_lim_atomic_weak 64 uperf > > ns/update: 301 > > > > > > # and then if __ATOMIC_SEQ_CST is used for both cmpxchg and cmpxchg_weak > > > > ./count_lim_atomic 64 uperf > > ns/update: 316 > > > > ./count_lim_atomic_weak 64 uperf > > ns/update: 302 > > > > ./count_lim_atomic 120 uperf > > ns/update: 630 > > > > ./count_lim_atomic_weak 120 uperf > > ns/update: 568 > > > > The results show that if we want to ensure sequential consistency when the CAS primitive fails, cmpxchg_weak performs better than cmpxchg. It seems that the (type of variation, failure_memorder) pair affects performance. I know that PPC uses LL/SC to simulate CAS. But what's the relationship between a simulated CAS and the memory order. This is interesting because as far as I know, PPC and ARM are using LL/SC to simulate atomic primitives such as CAS and FAA. So FAA might have the same behavior. > > > > In actually, I'm not very clear about the meaning of different types of failure memory orders. For example, when should we use __ATOMIC_RELAXED, rather than __ATOMIC_SEQ_CST, if a CAS fails? What happen if __ATOMIC_RELAXED is used for x86? The one I'm look at is https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html . Do you know some resources about this? I can look into this tomorrow. Thanks. > > Hi Junchang, > > In Linux-kernel speak, Documentation/core-api/atomic.rst says: > > -------------------------------------------------------------------------- > atomic_xchg must provide explicit memory barriers around the operation. :: > > int atomic_cmpxchg(atomic_t *v, int old, int new); > > This performs an atomic compare exchange operation on the atomic value v, > with the given old and new values. Like all atomic_xxx operations, > atomic_cmpxchg will only satisfy its atomicity semantics as long as all > other accesses of \*v are performed through atomic_xxx operations. > > atomic_cmpxchg must provide explicit memory barriers around the operation, > although if the comparison fails then no memory ordering guarantees are > required. Hi Akira, Thanks for the link which is helpful. > > [snip] > > The routines xchg() and cmpxchg() must provide the same exact > memory-barrier semantics as the atomic and bit operations returning > values. > -------------------------------------------------------------------------- > > The 2nd __ATOMIC_RELAXED to __atomic_compare_exchange_n() matches this > lack of requirement. > > On x86_64, __atomic_compare_exchange_n() is translated to the same code > in both cases (with the help of litmus7's cross compiling): > > #START _litmus_P1 > xorl %eax, %eax > movl $0, 4(%rsp) > lock cmpxchgl %r10d, (%rdx) > je .L36 > movl %eax, 4(%rsp) > .L36: > movl 4(%rsp), %eax > > There is no difference between the code with __ATOMIC_RELAXED and > the code with __ATOMIC_SEQ_CST as the 2nd parameter. As you can see, > there is no memory barrier instruction emitted. My understanding is that x86 is using the TSO memory model, such that it is unnecessary to add extra barriers. Is that right? > > On PPC, there is a difference. With __ATOMIC_RELAXED as 2nd parameter, > the code looks like: > > #START _litmus_P1 > sync > .L34: > lwarx 7,0,9 > cmpwi 0,7,0 > bne 0,.L35 > stwcx. 5,0,9 > bne 0,.L34 > isync > .L35: > > , OTOH with __ATOMIC_SEQ_CST as 2nd argument: > > #START _litmus_P1 > sync > .L34: > lwarx 7,0,9 > cmpwi 0,7,0 > bne 0,.L35 > stwcx. 5,0,9 > bne 0,.L34 > .L35: > isync > > See the difference of position of label .L35. > (Note that we are talking about strong version of cmpxchg().) > > Does the above example make sense to you? > Yes. It makes sense. For curiosity, I checked the assembly code of weak atomic_cmpxchg (the fourth argument is set to 1) with __ATOMIC_SEQ_CST. The code is shown below: #START _litmus_P3 sync lwarx 9,0,8 cmpwi 0,9,1 bne 0,.L34 stwcx. 4,0,8 .L34: The code shows that the weak atomic_cmpxchg fails because (1) the content of *ptr is not equal to argument old, or (2) the store operation fails because this thread loses reservation for the memory location referenced by ptr. In contrast, the strong atomic_cmpxchg (its assembly code is shown below) contains a while-loop and fails only if (1) *ptr is not equal to argument old. So the weak atomic_cmpxchg could fail "spuriously" and it is the caller's responsibility to retry it. #START _litmus_P1 sync .L34: lwarx 7,0,9 cmpwi 0,7,0 bne 0,.L35 stwcx. 5,0,9 bne 0,.L34 .L35: isync So for the performance comparison, my hypothesis is that the weak version may perform slightly better at high levels of contention because if there is a "spuriously" failure, atomic_cmpxchg returns immediately, which gives other threads chances to successfully perform their atomic_cmpxchg instructions. For other cases, the strong atomic_cmpxchg works pretty well because it avoids rebuilding arguments old and new. Does that make sense? Thanks, --Junchang > Thanks, Akira > > > > > > > --Junchang > > > > > > > >> static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new) > >> > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h 2018-12-14 14:32 ` Junchang Wang @ 2018-12-15 14:58 ` Akira Yokosawa 2018-12-16 0:55 ` Junchang Wang 0 siblings, 1 reply; 16+ messages in thread From: Akira Yokosawa @ 2018-12-15 14:58 UTC (permalink / raw) To: Junchang Wang; +Cc: Paul E. McKenney, perfbook, Akira Yokosawa On 2018/12/14 22:32:22 +0800, Junchang Wang wrote: > On Thu, Dec 13, 2018 at 11:33 PM Akira Yokosawa <akiyks@gmail.com> wrote: >> >> On 2018/12/13 00:01:33 +0800, Junchang Wang wrote: >>> On 12/11/18 11:42 PM, Akira Yokosawa wrote: >>>> From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001 >>>> From: Akira Yokosawa <akiyks@gmail.com> >>>> Date: Tue, 11 Dec 2018 21:37:11 +0900 >>>> Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h >>>> >>>> Do the same change as CodeSamples/formal/litmus/api.h. >>>> >>>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com> >>>> --- >>>> CodeSamples/api-pthreads/api-gcc.h | 5 +++-- >>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h >>>> index 3afe340..b66f4b9 100644 >>>> --- a/CodeSamples/api-pthreads/api-gcc.h >>>> +++ b/CodeSamples/api-pthreads/api-gcc.h >>>> @@ -168,8 +168,9 @@ struct __xchg_dummy { >>>> ({ \ >>>> typeof(*ptr) _____actual = (o); \ >>>> \ >>>> - __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \ >>>> - __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \ >>>> + __atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \ >>>> + __ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \ >>>> + _____actual; \ >>>> }) >>>> >>> >>> Hi Akira, >>> >>> Another reason that the performance of cmpxchg is catching up with cmpxchg_weak is that __ATOMIC_SEQ_CST is replaced by __ATOMIC_RELAXED in this patch. The use of __ATOMIC_RELAXED means if the CAS primitive fails, the relaxed semantic is used, rather than sequential consistent. Following are some experiment results: >>> >>> # If __ATOMIC_RELAXED is used for both cmpxchg and cmpxchg_weak >>> >>> ./count_lim_atomic 64 uperf >>> ns/update: 290 >>> >>> ./count_lim_atomic_weak 64 uperf >>> ns/update: 301 >>> >>> >>> # and then if __ATOMIC_SEQ_CST is used for both cmpxchg and cmpxchg_weak >>> >>> ./count_lim_atomic 64 uperf >>> ns/update: 316 >>> >>> ./count_lim_atomic_weak 64 uperf >>> ns/update: 302 >>> >>> ./count_lim_atomic 120 uperf >>> ns/update: 630 >>> >>> ./count_lim_atomic_weak 120 uperf >>> ns/update: 568 >>> >>> The results show that if we want to ensure sequential consistency when the CAS primitive fails, cmpxchg_weak performs better than cmpxchg. It seems that the (type of variation, failure_memorder) pair affects performance. I know that PPC uses LL/SC to simulate CAS. But what's the relationship between a simulated CAS and the memory order. This is interesting because as far as I know, PPC and ARM are using LL/SC to simulate atomic primitives such as CAS and FAA. So FAA might have the same behavior. >>> >>> In actually, I'm not very clear about the meaning of different types of failure memory orders. For example, when should we use __ATOMIC_RELAXED, rather than __ATOMIC_SEQ_CST, if a CAS fails? What happen if __ATOMIC_RELAXED is used for x86? The one I'm look at is https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html . Do you know some resources about this? I can look into this tomorrow. Thanks. >> >> Hi Junchang, >> >> In Linux-kernel speak, Documentation/core-api/atomic.rst says: >> >> -------------------------------------------------------------------------- >> atomic_xchg must provide explicit memory barriers around the operation. :: >> >> int atomic_cmpxchg(atomic_t *v, int old, int new); >> >> This performs an atomic compare exchange operation on the atomic value v, >> with the given old and new values. Like all atomic_xxx operations, >> atomic_cmpxchg will only satisfy its atomicity semantics as long as all >> other accesses of \*v are performed through atomic_xxx operations. >> >> atomic_cmpxchg must provide explicit memory barriers around the operation, >> although if the comparison fails then no memory ordering guarantees are >> required. > > Hi Akira, > > Thanks for the link which is helpful. > >> >> [snip] >> >> The routines xchg() and cmpxchg() must provide the same exact >> memory-barrier semantics as the atomic and bit operations returning >> values. >> -------------------------------------------------------------------------- >> >> The 2nd __ATOMIC_RELAXED to __atomic_compare_exchange_n() matches this >> lack of requirement. >> >> On x86_64, __atomic_compare_exchange_n() is translated to the same code >> in both cases (with the help of litmus7's cross compiling): >> >> #START _litmus_P1 >> xorl %eax, %eax >> movl $0, 4(%rsp) >> lock cmpxchgl %r10d, (%rdx) >> je .L36 >> movl %eax, 4(%rsp) >> .L36: >> movl 4(%rsp), %eax >> >> There is no difference between the code with __ATOMIC_RELAXED and >> the code with __ATOMIC_SEQ_CST as the 2nd parameter. As you can see, >> there is no memory barrier instruction emitted. > > My understanding is that x86 is using the TSO memory model, such that > it is unnecessary to add extra barriers. Is that right? I think so. > >> >> On PPC, there is a difference. With __ATOMIC_RELAXED as 2nd parameter, >> the code looks like: >> >> #START _litmus_P1 >> sync >> .L34: >> lwarx 7,0,9 >> cmpwi 0,7,0 >> bne 0,.L35 >> stwcx. 5,0,9 >> bne 0,.L34 >> isync >> .L35: >> >> , OTOH with __ATOMIC_SEQ_CST as 2nd argument: >> >> #START _litmus_P1 >> sync >> .L34: >> lwarx 7,0,9 >> cmpwi 0,7,0 >> bne 0,.L35 >> stwcx. 5,0,9 >> bne 0,.L34 >> .L35: >> isync >> >> See the difference of position of label .L35. >> (Note that we are talking about strong version of cmpxchg().) >> >> Does the above example make sense to you? >> > > Yes. It makes sense. > > For curiosity, I checked the assembly code of weak atomic_cmpxchg (the > fourth argument is set to 1) with __ATOMIC_SEQ_CST. The code is shown > below: > > #START _litmus_P3 > sync > lwarx 9,0,8 > cmpwi 0,9,1 > bne 0,.L34 > stwcx. 4,0,8 > .L34: > > The code shows that the weak atomic_cmpxchg fails because (1) the > content of *ptr is not equal to argument old, or (2) the store > operation fails because this thread loses reservation for the memory > location referenced by ptr. In contrast, the strong atomic_cmpxchg > (its assembly code is shown below) contains a while-loop and fails > only if (1) *ptr is not equal to argument old. So the weak > atomic_cmpxchg could fail "spuriously" and it is the caller's > responsibility to retry it. > > #START _litmus_P1 > sync > .L34: > lwarx 7,0,9 > cmpwi 0,7,0 > bne 0,.L35 > stwcx. 5,0,9 > bne 0,.L34 > .L35: > isync > > So for the performance comparison, my hypothesis is that the weak > version may perform slightly better at high levels of contention > because if there is a "spuriously" failure, atomic_cmpxchg returns > immediately, which gives other threads chances to successfully perform > their atomic_cmpxchg instructions. For other cases, the strong> atomic_cmpxchg works pretty well because it avoids rebuilding > arguments old and new. Does that make sense? Well, as atomic_cmpxchg() is supposed to be inlined, it is not easy to tell how the code of count_lim_atomic.c will be optimized in the end. The performance can vary depending on compiler version. The point of this sample code is that it scales much better than count_atomic.c. In the end, we need to minimize the contention of atomic accesses, don't we? Thanks, Akira > > Thanks, > --Junchang > >> Thanks, Akira >> >>> >>> >>> --Junchang >>> >>> >>> >>>> static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new) >>>> >> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h 2018-12-15 14:58 ` Akira Yokosawa @ 2018-12-16 0:55 ` Junchang Wang 0 siblings, 0 replies; 16+ messages in thread From: Junchang Wang @ 2018-12-16 0:55 UTC (permalink / raw) To: Akira Yokosawa; +Cc: Paul E. McKenney, perfbook On Sat, Dec 15, 2018 at 10:58 PM Akira Yokosawa <akiyks@gmail.com> wrote: > > On 2018/12/14 22:32:22 +0800, Junchang Wang wrote: > > On Thu, Dec 13, 2018 at 11:33 PM Akira Yokosawa <akiyks@gmail.com> wrote: > >> > >> On 2018/12/13 00:01:33 +0800, Junchang Wang wrote: > >>> On 12/11/18 11:42 PM, Akira Yokosawa wrote: > >>>> From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001 > >>>> From: Akira Yokosawa <akiyks@gmail.com> > >>>> Date: Tue, 11 Dec 2018 21:37:11 +0900 > >>>> Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h > >>>> > >>>> Do the same change as CodeSamples/formal/litmus/api.h. > >>>> > >>>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com> > >>>> --- > >>>> CodeSamples/api-pthreads/api-gcc.h | 5 +++-- > >>>> 1 file changed, 3 insertions(+), 2 deletions(-) > >>>> > >>>> diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h > >>>> index 3afe340..b66f4b9 100644 > >>>> --- a/CodeSamples/api-pthreads/api-gcc.h > >>>> +++ b/CodeSamples/api-pthreads/api-gcc.h > >>>> @@ -168,8 +168,9 @@ struct __xchg_dummy { > >>>> ({ \ > >>>> typeof(*ptr) _____actual = (o); \ > >>>> \ > >>>> - __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \ > >>>> - __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \ > >>>> + __atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \ > >>>> + __ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \ > >>>> + _____actual; \ > >>>> }) > >>>> > >>> > >>> Hi Akira, > >>> > >>> Another reason that the performance of cmpxchg is catching up with cmpxchg_weak is that __ATOMIC_SEQ_CST is replaced by __ATOMIC_RELAXED in this patch. The use of __ATOMIC_RELAXED means if the CAS primitive fails, the relaxed semantic is used, rather than sequential consistent. Following are some experiment results: > >>> > >>> # If __ATOMIC_RELAXED is used for both cmpxchg and cmpxchg_weak > >>> > >>> ./count_lim_atomic 64 uperf > >>> ns/update: 290 > >>> > >>> ./count_lim_atomic_weak 64 uperf > >>> ns/update: 301 > >>> > >>> > >>> # and then if __ATOMIC_SEQ_CST is used for both cmpxchg and cmpxchg_weak > >>> > >>> ./count_lim_atomic 64 uperf > >>> ns/update: 316 > >>> > >>> ./count_lim_atomic_weak 64 uperf > >>> ns/update: 302 > >>> > >>> ./count_lim_atomic 120 uperf > >>> ns/update: 630 > >>> > >>> ./count_lim_atomic_weak 120 uperf > >>> ns/update: 568 > >>> > >>> The results show that if we want to ensure sequential consistency when the CAS primitive fails, cmpxchg_weak performs better than cmpxchg. It seems that the (type of variation, failure_memorder) pair affects performance. I know that PPC uses LL/SC to simulate CAS. But what's the relationship between a simulated CAS and the memory order. This is interesting because as far as I know, PPC and ARM are using LL/SC to simulate atomic primitives such as CAS and FAA. So FAA might have the same behavior. > >>> > >>> In actually, I'm not very clear about the meaning of different types of failure memory orders. For example, when should we use __ATOMIC_RELAXED, rather than __ATOMIC_SEQ_CST, if a CAS fails? What happen if __ATOMIC_RELAXED is used for x86? The one I'm look at is https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html . Do you know some resources about this? I can look into this tomorrow. Thanks. > >> > >> Hi Junchang, > >> > >> In Linux-kernel speak, Documentation/core-api/atomic.rst says: > >> > >> -------------------------------------------------------------------------- > >> atomic_xchg must provide explicit memory barriers around the operation. :: > >> > >> int atomic_cmpxchg(atomic_t *v, int old, int new); > >> > >> This performs an atomic compare exchange operation on the atomic value v, > >> with the given old and new values. Like all atomic_xxx operations, > >> atomic_cmpxchg will only satisfy its atomicity semantics as long as all > >> other accesses of \*v are performed through atomic_xxx operations. > >> > >> atomic_cmpxchg must provide explicit memory barriers around the operation, > >> although if the comparison fails then no memory ordering guarantees are > >> required. > > > > Hi Akira, > > > > Thanks for the link which is helpful. > > > >> > >> [snip] > >> > >> The routines xchg() and cmpxchg() must provide the same exact > >> memory-barrier semantics as the atomic and bit operations returning > >> values. > >> -------------------------------------------------------------------------- > >> > >> The 2nd __ATOMIC_RELAXED to __atomic_compare_exchange_n() matches this > >> lack of requirement. > >> > >> On x86_64, __atomic_compare_exchange_n() is translated to the same code > >> in both cases (with the help of litmus7's cross compiling): > >> > >> #START _litmus_P1 > >> xorl %eax, %eax > >> movl $0, 4(%rsp) > >> lock cmpxchgl %r10d, (%rdx) > >> je .L36 > >> movl %eax, 4(%rsp) > >> .L36: > >> movl 4(%rsp), %eax > >> > >> There is no difference between the code with __ATOMIC_RELAXED and > >> the code with __ATOMIC_SEQ_CST as the 2nd parameter. As you can see, > >> there is no memory barrier instruction emitted. > > > > My understanding is that x86 is using the TSO memory model, such that > > it is unnecessary to add extra barriers. Is that right? > > I think so. > > > > >> > >> On PPC, there is a difference. With __ATOMIC_RELAXED as 2nd parameter, > >> the code looks like: > >> > >> #START _litmus_P1 > >> sync > >> .L34: > >> lwarx 7,0,9 > >> cmpwi 0,7,0 > >> bne 0,.L35 > >> stwcx. 5,0,9 > >> bne 0,.L34 > >> isync > >> .L35: > >> > >> , OTOH with __ATOMIC_SEQ_CST as 2nd argument: > >> > >> #START _litmus_P1 > >> sync > >> .L34: > >> lwarx 7,0,9 > >> cmpwi 0,7,0 > >> bne 0,.L35 > >> stwcx. 5,0,9 > >> bne 0,.L34 > >> .L35: > >> isync > >> > >> See the difference of position of label .L35. > >> (Note that we are talking about strong version of cmpxchg().) > >> > >> Does the above example make sense to you? > >> > > > > Yes. It makes sense. > > > > For curiosity, I checked the assembly code of weak atomic_cmpxchg (the > > fourth argument is set to 1) with __ATOMIC_SEQ_CST. The code is shown > > below: > > > > #START _litmus_P3 > > sync > > lwarx 9,0,8 > > cmpwi 0,9,1 > > bne 0,.L34 > > stwcx. 4,0,8 > > .L34: > > > > The code shows that the weak atomic_cmpxchg fails because (1) the > > content of *ptr is not equal to argument old, or (2) the store > > operation fails because this thread loses reservation for the memory > > location referenced by ptr. In contrast, the strong atomic_cmpxchg > > (its assembly code is shown below) contains a while-loop and fails > > only if (1) *ptr is not equal to argument old. So the weak > > atomic_cmpxchg could fail "spuriously" and it is the caller's > > responsibility to retry it. > > > > #START _litmus_P1 > > sync > > .L34: > > lwarx 7,0,9 > > cmpwi 0,7,0 > > bne 0,.L35 > > stwcx. 5,0,9 > > bne 0,.L34 > > .L35: > > isync > > > > So for the performance comparison, my hypothesis is that the weak > > version may perform slightly better at high levels of contention > > because if there is a "spuriously" failure, atomic_cmpxchg returns > > immediately, which gives other threads chances to successfully perform > > their atomic_cmpxchg instructions. For other cases, the strong> atomic_cmpxchg works pretty well because it avoids rebuilding > > arguments old and new. Does that make sense? > > Well, as atomic_cmpxchg() is supposed to be inlined, > it is not easy to tell how the code of count_lim_atomic.c will be > optimized in the end. > The performance can vary depending on compiler version. Got it. Thanks a lot :-). --Junchang > > The point of this sample code is that it scales much better than > count_atomic.c. In the end, we need to minimize the contention > of atomic accesses, don't we? > > Thanks, Akira > > > > > Thanks, > > --Junchang > > > >> Thanks, Akira > >> > >>> > >>> > >>> --Junchang > >>> > >>> > >>> > >>>> static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new) > >>>> > >> > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h 2018-12-13 15:33 ` Akira Yokosawa 2018-12-14 14:32 ` Junchang Wang @ 2018-12-15 15:10 ` Akira Yokosawa 2018-12-15 19:37 ` Paul E. McKenney 1 sibling, 1 reply; 16+ messages in thread From: Akira Yokosawa @ 2018-12-15 15:10 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Junchang Wang, perfbook, Akira Yokosawa Hi Paul, There is something I want to make sure. Please see inline comment below. On 2018/12/14 00:33:15 +0900, Akira Yokosawa wrote: > On 2018/12/13 00:01:33 +0800, Junchang Wang wrote: >> On 12/11/18 11:42 PM, Akira Yokosawa wrote: >>> From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001 >>> From: Akira Yokosawa <akiyks@gmail.com> >>> Date: Tue, 11 Dec 2018 21:37:11 +0900 >>> Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h >>> >>> Do the same change as CodeSamples/formal/litmus/api.h. >>> >>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com> >>> --- >>> CodeSamples/api-pthreads/api-gcc.h | 5 +++-- >>> 1 file changed, 3 insertions(+), 2 deletions(-) >>> >>> diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h >>> index 3afe340..b66f4b9 100644 >>> --- a/CodeSamples/api-pthreads/api-gcc.h >>> +++ b/CodeSamples/api-pthreads/api-gcc.h >>> @@ -168,8 +168,9 @@ struct __xchg_dummy { >>> ({ \ >>> typeof(*ptr) _____actual = (o); \ >>> \ >>> - __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \ >>> - __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \ >>> + __atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \ >>> + __ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \ >>> + _____actual; \ >>> }) >>> >> >> Hi Akira, >> >> Another reason that the performance of cmpxchg is catching up with cmpxchg_weak is that __ATOMIC_SEQ_CST is replaced by __ATOMIC_RELAXED in this patch. The use of __ATOMIC_RELAXED means if the CAS primitive fails, the relaxed semantic is used, rather than sequential consistent. Following are some experiment results: >> >> # If __ATOMIC_RELAXED is used for both cmpxchg and cmpxchg_weak >> >> ./count_lim_atomic 64 uperf >> ns/update: 290 >> >> ./count_lim_atomic_weak 64 uperf >> ns/update: 301 >> >> >> # and then if __ATOMIC_SEQ_CST is used for both cmpxchg and cmpxchg_weak >> >> ./count_lim_atomic 64 uperf >> ns/update: 316 >> >> ./count_lim_atomic_weak 64 uperf >> ns/update: 302 >> >> ./count_lim_atomic 120 uperf >> ns/update: 630 >> >> ./count_lim_atomic_weak 120 uperf >> ns/update: 568 >> >> The results show that if we want to ensure sequential consistency when the CAS primitive fails, cmpxchg_weak performs better than cmpxchg. It seems that the (type of variation, failure_memorder) pair affects performance. I know that PPC uses LL/SC to simulate CAS. But what's the relationship between a simulated CAS and the memory order. This is interesting because as far as I know, PPC and ARM are using LL/SC to simulate atomic primitives such as CAS and FAA. So FAA might have the same behavior. >> >> In actually, I'm not very clear about the meaning of different types of failure memory orders. For example, when should we use __ATOMIC_RELAXED, rather than __ATOMIC_SEQ_CST, if a CAS fails? What happen if __ATOMIC_RELAXED is used for x86? The one I'm look at is https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html . Do you know some resources about this? I can look into this tomorrow. Thanks. > > Hi Junchang, > > In Linux-kernel speak, Documentation/core-api/atomic.rst says: > > -------------------------------------------------------------------------- > atomic_xchg must provide explicit memory barriers around the operation. :: > > int atomic_cmpxchg(atomic_t *v, int old, int new); > > This performs an atomic compare exchange operation on the atomic value v, > with the given old and new values. Like all atomic_xxx operations, > atomic_cmpxchg will only satisfy its atomicity semantics as long as all > other accesses of \*v are performed through atomic_xxx operations. > > atomic_cmpxchg must provide explicit memory barriers around the operation, > although if the comparison fails then no memory ordering guarantees are > required. > > [snip] > > The routines xchg() and cmpxchg() must provide the same exact > memory-barrier semantics as the atomic and bit operations returning > values. > -------------------------------------------------------------------------- > > The 2nd __ATOMIC_RELAXED to __atomic_compare_exchange_n() matches this > lack of requirement. > > On x86_64, __atomic_compare_exchange_n() is translated to the same code > in both cases (with the help of litmus7's cross compiling): > > #START _litmus_P1 > xorl %eax, %eax > movl $0, 4(%rsp) > lock cmpxchgl %r10d, (%rdx) > je .L36 > movl %eax, 4(%rsp) > .L36: > movl 4(%rsp), %eax > > There is no difference between the code with __ATOMIC_RELAXED and > the code with __ATOMIC_SEQ_CST as the 2nd parameter. As you can see, > there is no memory barrier instruction emitted. > > On PPC, there is a difference. With __ATOMIC_RELAXED as 2nd parameter, > the code looks like: > > #START _litmus_P1 > sync > .L34: > lwarx 7,0,9 > cmpwi 0,7,0 > bne 0,.L35 > stwcx. 5,0,9 > bne 0,.L34 > isync > .L35: > > , OTOH with __ATOMIC_SEQ_CST as 2nd argument: > > #START _litmus_P1 > sync > .L34: > lwarx 7,0,9 > cmpwi 0,7,0 > bne 0,.L35 > stwcx. 5,0,9 > bne 0,.L34 > .L35: > isync > In this asm code snippets, the barrier instruction at the end of cmpxchg() is "isync". In arch/powerpc/include/asm/synch.h of Linux kernel source tree, both PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER are defined as '"\n" stringify_in_c(sync) "\n"', which will result in "sync". IIUC, "isync" of PowerPC is not good enough for __ATOMIC_SEQ_CST, is it? Thanks, Akira > See the difference of position of label .L35. > (Note that we are talking about strong version of cmpxchg().) > > Does the above example make sense to you? > > Thanks, Akira > >> >> >> --Junchang >> >> >> >>> static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new) >>> > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h 2018-12-15 15:10 ` Akira Yokosawa @ 2018-12-15 19:37 ` Paul E. McKenney 2018-12-15 23:42 ` Akira Yokosawa 0 siblings, 1 reply; 16+ messages in thread From: Paul E. McKenney @ 2018-12-15 19:37 UTC (permalink / raw) To: Akira Yokosawa; +Cc: Junchang Wang, perfbook On Sun, Dec 16, 2018 at 12:10:18AM +0900, Akira Yokosawa wrote: > Hi Paul, > > There is something I want to make sure. > Please see inline comment below. > > On 2018/12/14 00:33:15 +0900, Akira Yokosawa wrote: > > On 2018/12/13 00:01:33 +0800, Junchang Wang wrote: > >> On 12/11/18 11:42 PM, Akira Yokosawa wrote: > >>> From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001 > >>> From: Akira Yokosawa <akiyks@gmail.com> > >>> Date: Tue, 11 Dec 2018 21:37:11 +0900 > >>> Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h > >>> > >>> Do the same change as CodeSamples/formal/litmus/api.h. > >>> > >>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com> > >>> --- > >>> CodeSamples/api-pthreads/api-gcc.h | 5 +++-- > >>> 1 file changed, 3 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h > >>> index 3afe340..b66f4b9 100644 > >>> --- a/CodeSamples/api-pthreads/api-gcc.h > >>> +++ b/CodeSamples/api-pthreads/api-gcc.h > >>> @@ -168,8 +168,9 @@ struct __xchg_dummy { > >>> ({ \ > >>> typeof(*ptr) _____actual = (o); \ > >>> \ > >>> - __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \ > >>> - __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \ > >>> + __atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \ > >>> + __ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \ > >>> + _____actual; \ > >>> }) > >>> > >> > >> Hi Akira, > >> > >> Another reason that the performance of cmpxchg is catching up with cmpxchg_weak is that __ATOMIC_SEQ_CST is replaced by __ATOMIC_RELAXED in this patch. The use of __ATOMIC_RELAXED means if the CAS primitive fails, the relaxed semantic is used, rather than sequential consistent. Following are some experiment results: > >> > >> # If __ATOMIC_RELAXED is used for both cmpxchg and cmpxchg_weak > >> > >> ./count_lim_atomic 64 uperf > >> ns/update: 290 > >> > >> ./count_lim_atomic_weak 64 uperf > >> ns/update: 301 > >> > >> > >> # and then if __ATOMIC_SEQ_CST is used for both cmpxchg and cmpxchg_weak > >> > >> ./count_lim_atomic 64 uperf > >> ns/update: 316 > >> > >> ./count_lim_atomic_weak 64 uperf > >> ns/update: 302 > >> > >> ./count_lim_atomic 120 uperf > >> ns/update: 630 > >> > >> ./count_lim_atomic_weak 120 uperf > >> ns/update: 568 > >> > >> The results show that if we want to ensure sequential consistency when the CAS primitive fails, cmpxchg_weak performs better than cmpxchg. It seems that the (type of variation, failure_memorder) pair affects performance. I know that PPC uses LL/SC to simulate CAS. But what's the relationship between a simulated CAS and the memory order. This is interesting because as far as I know, PPC and ARM are using LL/SC to simulate atomic primitives such as CAS and FAA. So FAA might have the same behavior. > >> > >> In actually, I'm not very clear about the meaning of different types of failure memory orders. For example, when should we use __ATOMIC_RELAXED, rather than __ATOMIC_SEQ_CST, if a CAS fails? What happen if __ATOMIC_RELAXED is used for x86? The one I'm look at is https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html . Do you know some resources about this? I can look into this tomorrow. Thanks. > > > > Hi Junchang, > > > > In Linux-kernel speak, Documentation/core-api/atomic.rst says: > > > > -------------------------------------------------------------------------- > > atomic_xchg must provide explicit memory barriers around the operation. :: > > > > int atomic_cmpxchg(atomic_t *v, int old, int new); > > > > This performs an atomic compare exchange operation on the atomic value v, > > with the given old and new values. Like all atomic_xxx operations, > > atomic_cmpxchg will only satisfy its atomicity semantics as long as all > > other accesses of \*v are performed through atomic_xxx operations. > > > > atomic_cmpxchg must provide explicit memory barriers around the operation, > > although if the comparison fails then no memory ordering guarantees are > > required. > > > > [snip] > > > > The routines xchg() and cmpxchg() must provide the same exact > > memory-barrier semantics as the atomic and bit operations returning > > values. > > -------------------------------------------------------------------------- > > > > The 2nd __ATOMIC_RELAXED to __atomic_compare_exchange_n() matches this > > lack of requirement. > > > > On x86_64, __atomic_compare_exchange_n() is translated to the same code > > in both cases (with the help of litmus7's cross compiling): > > > > #START _litmus_P1 > > xorl %eax, %eax > > movl $0, 4(%rsp) > > lock cmpxchgl %r10d, (%rdx) > > je .L36 > > movl %eax, 4(%rsp) > > .L36: > > movl 4(%rsp), %eax > > > > There is no difference between the code with __ATOMIC_RELAXED and > > the code with __ATOMIC_SEQ_CST as the 2nd parameter. As you can see, > > there is no memory barrier instruction emitted. > > > > On PPC, there is a difference. With __ATOMIC_RELAXED as 2nd parameter, > > the code looks like: > > > > #START _litmus_P1 > > sync > > .L34: > > lwarx 7,0,9 > > cmpwi 0,7,0 > > bne 0,.L35 > > stwcx. 5,0,9 > > bne 0,.L34 > > isync > > .L35: > > > > , OTOH with __ATOMIC_SEQ_CST as 2nd argument: > > > > #START _litmus_P1 > > sync > > .L34: > > lwarx 7,0,9 > > cmpwi 0,7,0 > > bne 0,.L35 > > stwcx. 5,0,9 > > bne 0,.L34 > > .L35: > > isync > > > > In this asm code snippets, the barrier instruction at the end of > cmpxchg() is "isync". > > In arch/powerpc/include/asm/synch.h of Linux kernel source tree, > both PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER are > defined as '"\n" stringify_in_c(sync) "\n"', which will result > in "sync". > > IIUC, "isync" of PowerPC is not good enough for __ATOMIC_SEQ_CST, > is it? By itself, no, but in combination with the "sync" instruction at the beginning, combined with the ordering of other __ATOMIC_SEQ_CST operations, it actually is sufficient. For more information, please see: http://open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2745.html And this is an update that is mostly irrelevant on modern hardware: http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html Note also that an lwsync instruction can be substituted for each isync, which can sometimes provide better performance. Thanx, Paul > Thanks, Akira > > > See the difference of position of label .L35. > > (Note that we are talking about strong version of cmpxchg().) > > > > Does the above example make sense to you? > > > > Thanks, Akira > > > >> > >> > >> --Junchang > >> > >> > >> > >>> static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new) > >>> > > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h 2018-12-15 19:37 ` Paul E. McKenney @ 2018-12-15 23:42 ` Akira Yokosawa 2018-12-16 0:54 ` Paul E. McKenney 0 siblings, 1 reply; 16+ messages in thread From: Akira Yokosawa @ 2018-12-15 23:42 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Junchang Wang, perfbook, Akira Yokosawa On 2018/12/15 11:37:44 -0800, Paul E. McKenney wrote: > On Sun, Dec 16, 2018 at 12:10:18AM +0900, Akira Yokosawa wrote: >> Hi Paul, >> >> There is something I want to make sure. >> Please see inline comment below. >> >> On 2018/12/14 00:33:15 +0900, Akira Yokosawa wrote: >>> On 2018/12/13 00:01:33 +0800, Junchang Wang wrote: >>>> On 12/11/18 11:42 PM, Akira Yokosawa wrote: >>>>> From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001 >>>>> From: Akira Yokosawa <akiyks@gmail.com> >>>>> Date: Tue, 11 Dec 2018 21:37:11 +0900 >>>>> Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h >>>>> >>>>> Do the same change as CodeSamples/formal/litmus/api.h. >>>>> >>>>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com> >>>>> --- >>>>> CodeSamples/api-pthreads/api-gcc.h | 5 +++-- >>>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h >>>>> index 3afe340..b66f4b9 100644 >>>>> --- a/CodeSamples/api-pthreads/api-gcc.h >>>>> +++ b/CodeSamples/api-pthreads/api-gcc.h >>>>> @@ -168,8 +168,9 @@ struct __xchg_dummy { >>>>> ({ \ >>>>> typeof(*ptr) _____actual = (o); \ >>>>> \ >>>>> - __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \ >>>>> - __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \ >>>>> + __atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \ >>>>> + __ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \ >>>>> + _____actual; \ >>>>> }) >>>>> >>>> >>>> Hi Akira, >>>> >>>> Another reason that the performance of cmpxchg is catching up with cmpxchg_weak is that __ATOMIC_SEQ_CST is replaced by __ATOMIC_RELAXED in this patch. The use of __ATOMIC_RELAXED means if the CAS primitive fails, the relaxed semantic is used, rather than sequential consistent. Following are some experiment results: >>>> >>>> # If __ATOMIC_RELAXED is used for both cmpxchg and cmpxchg_weak >>>> >>>> ./count_lim_atomic 64 uperf >>>> ns/update: 290 >>>> >>>> ./count_lim_atomic_weak 64 uperf >>>> ns/update: 301 >>>> >>>> >>>> # and then if __ATOMIC_SEQ_CST is used for both cmpxchg and cmpxchg_weak >>>> >>>> ./count_lim_atomic 64 uperf >>>> ns/update: 316 >>>> >>>> ./count_lim_atomic_weak 64 uperf >>>> ns/update: 302 >>>> >>>> ./count_lim_atomic 120 uperf >>>> ns/update: 630 >>>> >>>> ./count_lim_atomic_weak 120 uperf >>>> ns/update: 568 >>>> >>>> The results show that if we want to ensure sequential consistency when the CAS primitive fails, cmpxchg_weak performs better than cmpxchg. It seems that the (type of variation, failure_memorder) pair affects performance. I know that PPC uses LL/SC to simulate CAS. But what's the relationship between a simulated CAS and the memory order. This is interesting because as far as I know, PPC and ARM are using LL/SC to simulate atomic primitives such as CAS and FAA. So FAA might have the same behavior. >>>> >>>> In actually, I'm not very clear about the meaning of different types of failure memory orders. For example, when should we use __ATOMIC_RELAXED, rather than __ATOMIC_SEQ_CST, if a CAS fails? What happen if __ATOMIC_RELAXED is used for x86? The one I'm look at is https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html . Do you know some resources about this? I can look into this tomorrow. Thanks. >>> >>> Hi Junchang, >>> >>> In Linux-kernel speak, Documentation/core-api/atomic.rst says: >>> >>> -------------------------------------------------------------------------- >>> atomic_xchg must provide explicit memory barriers around the operation. :: >>> >>> int atomic_cmpxchg(atomic_t *v, int old, int new); >>> >>> This performs an atomic compare exchange operation on the atomic value v, >>> with the given old and new values. Like all atomic_xxx operations, >>> atomic_cmpxchg will only satisfy its atomicity semantics as long as all >>> other accesses of \*v are performed through atomic_xxx operations. >>> >>> atomic_cmpxchg must provide explicit memory barriers around the operation, >>> although if the comparison fails then no memory ordering guarantees are >>> required. >>> >>> [snip] >>> >>> The routines xchg() and cmpxchg() must provide the same exact >>> memory-barrier semantics as the atomic and bit operations returning >>> values. >>> -------------------------------------------------------------------------- >>> >>> The 2nd __ATOMIC_RELAXED to __atomic_compare_exchange_n() matches this >>> lack of requirement. >>> >>> On x86_64, __atomic_compare_exchange_n() is translated to the same code >>> in both cases (with the help of litmus7's cross compiling): >>> >>> #START _litmus_P1 >>> xorl %eax, %eax >>> movl $0, 4(%rsp) >>> lock cmpxchgl %r10d, (%rdx) >>> je .L36 >>> movl %eax, 4(%rsp) >>> .L36: >>> movl 4(%rsp), %eax >>> >>> There is no difference between the code with __ATOMIC_RELAXED and >>> the code with __ATOMIC_SEQ_CST as the 2nd parameter. As you can see, >>> there is no memory barrier instruction emitted. >>> >>> On PPC, there is a difference. With __ATOMIC_RELAXED as 2nd parameter, >>> the code looks like: >>> >>> #START _litmus_P1 >>> sync >>> .L34: >>> lwarx 7,0,9 >>> cmpwi 0,7,0 >>> bne 0,.L35 >>> stwcx. 5,0,9 >>> bne 0,.L34 >>> isync >>> .L35: >>> >>> , OTOH with __ATOMIC_SEQ_CST as 2nd argument: >>> >>> #START _litmus_P1 >>> sync >>> .L34: >>> lwarx 7,0,9 >>> cmpwi 0,7,0 >>> bne 0,.L35 >>> stwcx. 5,0,9 >>> bne 0,.L34 >>> .L35: >>> isync >>> >> >> In this asm code snippets, the barrier instruction at the end of >> cmpxchg() is "isync". >> >> In arch/powerpc/include/asm/synch.h of Linux kernel source tree, >> both PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER are >> defined as '"\n" stringify_in_c(sync) "\n"', which will result >> in "sync". >> >> IIUC, "isync" of PowerPC is not good enough for __ATOMIC_SEQ_CST, >> is it? > > By itself, no, but in combination with the "sync" instruction at > the beginning, combined with the ordering of other __ATOMIC_SEQ_CST > operations, it actually is sufficient. For more information, please > see: > > http://open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2745.html > > And this is an update that is mostly irrelevant on modern hardware: > > http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html > > Note also that an lwsync instruction can be substituted for each isync, > which can sometimes provide better performance. Now I'm wondering why PPC_ATOMIC_EXIT_BARRIER is defined as '"\n" stringify_in_c(sync) "\n"' rather than '"\n" stringify_in_c(isync) "\n"' (or '"\n" stringify_in_c(LWSYNC) "\n"') in arch/powerpc/include/asm/synch.h. ??? Thanks, Akira > > Thanx, Paul > >> Thanks, Akira >> >>> See the difference of position of label .L35. >>> (Note that we are talking about strong version of cmpxchg().) >>> >>> Does the above example make sense to you? >>> >>> Thanks, Akira >>> >>>> >>>> >>>> --Junchang >>>> >>>> >>>> >>>>> static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new) >>>>> >>> >> > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h 2018-12-15 23:42 ` Akira Yokosawa @ 2018-12-16 0:54 ` Paul E. McKenney 0 siblings, 0 replies; 16+ messages in thread From: Paul E. McKenney @ 2018-12-16 0:54 UTC (permalink / raw) To: Akira Yokosawa; +Cc: Junchang Wang, perfbook On Sun, Dec 16, 2018 at 08:42:56AM +0900, Akira Yokosawa wrote: > On 2018/12/15 11:37:44 -0800, Paul E. McKenney wrote: > > On Sun, Dec 16, 2018 at 12:10:18AM +0900, Akira Yokosawa wrote: > >> Hi Paul, > >> > >> There is something I want to make sure. > >> Please see inline comment below. > >> > >> On 2018/12/14 00:33:15 +0900, Akira Yokosawa wrote: > >>> On 2018/12/13 00:01:33 +0800, Junchang Wang wrote: > >>>> On 12/11/18 11:42 PM, Akira Yokosawa wrote: > >>>>> From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001 > >>>>> From: Akira Yokosawa <akiyks@gmail.com> > >>>>> Date: Tue, 11 Dec 2018 21:37:11 +0900 > >>>>> Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h > >>>>> > >>>>> Do the same change as CodeSamples/formal/litmus/api.h. > >>>>> > >>>>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com> > >>>>> --- > >>>>> CodeSamples/api-pthreads/api-gcc.h | 5 +++-- > >>>>> 1 file changed, 3 insertions(+), 2 deletions(-) > >>>>> > >>>>> diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h > >>>>> index 3afe340..b66f4b9 100644 > >>>>> --- a/CodeSamples/api-pthreads/api-gcc.h > >>>>> +++ b/CodeSamples/api-pthreads/api-gcc.h > >>>>> @@ -168,8 +168,9 @@ struct __xchg_dummy { > >>>>> ({ \ > >>>>> typeof(*ptr) _____actual = (o); \ > >>>>> \ > >>>>> - __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \ > >>>>> - __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \ > >>>>> + __atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \ > >>>>> + __ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \ > >>>>> + _____actual; \ > >>>>> }) > >>>>> > >>>> > >>>> Hi Akira, > >>>> > >>>> Another reason that the performance of cmpxchg is catching up with cmpxchg_weak is that __ATOMIC_SEQ_CST is replaced by __ATOMIC_RELAXED in this patch. The use of __ATOMIC_RELAXED means if the CAS primitive fails, the relaxed semantic is used, rather than sequential consistent. Following are some experiment results: > >>>> > >>>> # If __ATOMIC_RELAXED is used for both cmpxchg and cmpxchg_weak > >>>> > >>>> ./count_lim_atomic 64 uperf > >>>> ns/update: 290 > >>>> > >>>> ./count_lim_atomic_weak 64 uperf > >>>> ns/update: 301 > >>>> > >>>> > >>>> # and then if __ATOMIC_SEQ_CST is used for both cmpxchg and cmpxchg_weak > >>>> > >>>> ./count_lim_atomic 64 uperf > >>>> ns/update: 316 > >>>> > >>>> ./count_lim_atomic_weak 64 uperf > >>>> ns/update: 302 > >>>> > >>>> ./count_lim_atomic 120 uperf > >>>> ns/update: 630 > >>>> > >>>> ./count_lim_atomic_weak 120 uperf > >>>> ns/update: 568 > >>>> > >>>> The results show that if we want to ensure sequential consistency when the CAS primitive fails, cmpxchg_weak performs better than cmpxchg. It seems that the (type of variation, failure_memorder) pair affects performance. I know that PPC uses LL/SC to simulate CAS. But what's the relationship between a simulated CAS and the memory order. This is interesting because as far as I know, PPC and ARM are using LL/SC to simulate atomic primitives such as CAS and FAA. So FAA might have the same behavior. > >>>> > >>>> In actually, I'm not very clear about the meaning of different types of failure memory orders. For example, when should we use __ATOMIC_RELAXED, rather than __ATOMIC_SEQ_CST, if a CAS fails? What happen if __ATOMIC_RELAXED is used for x86? The one I'm look at is https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html . Do you know some resources about this? I can look into this tomorrow. Thanks. > >>> > >>> Hi Junchang, > >>> > >>> In Linux-kernel speak, Documentation/core-api/atomic.rst says: > >>> > >>> -------------------------------------------------------------------------- > >>> atomic_xchg must provide explicit memory barriers around the operation. :: > >>> > >>> int atomic_cmpxchg(atomic_t *v, int old, int new); > >>> > >>> This performs an atomic compare exchange operation on the atomic value v, > >>> with the given old and new values. Like all atomic_xxx operations, > >>> atomic_cmpxchg will only satisfy its atomicity semantics as long as all > >>> other accesses of \*v are performed through atomic_xxx operations. > >>> > >>> atomic_cmpxchg must provide explicit memory barriers around the operation, > >>> although if the comparison fails then no memory ordering guarantees are > >>> required. > >>> > >>> [snip] > >>> > >>> The routines xchg() and cmpxchg() must provide the same exact > >>> memory-barrier semantics as the atomic and bit operations returning > >>> values. > >>> -------------------------------------------------------------------------- > >>> > >>> The 2nd __ATOMIC_RELAXED to __atomic_compare_exchange_n() matches this > >>> lack of requirement. > >>> > >>> On x86_64, __atomic_compare_exchange_n() is translated to the same code > >>> in both cases (with the help of litmus7's cross compiling): > >>> > >>> #START _litmus_P1 > >>> xorl %eax, %eax > >>> movl $0, 4(%rsp) > >>> lock cmpxchgl %r10d, (%rdx) > >>> je .L36 > >>> movl %eax, 4(%rsp) > >>> .L36: > >>> movl 4(%rsp), %eax > >>> > >>> There is no difference between the code with __ATOMIC_RELAXED and > >>> the code with __ATOMIC_SEQ_CST as the 2nd parameter. As you can see, > >>> there is no memory barrier instruction emitted. > >>> > >>> On PPC, there is a difference. With __ATOMIC_RELAXED as 2nd parameter, > >>> the code looks like: > >>> > >>> #START _litmus_P1 > >>> sync > >>> .L34: > >>> lwarx 7,0,9 > >>> cmpwi 0,7,0 > >>> bne 0,.L35 > >>> stwcx. 5,0,9 > >>> bne 0,.L34 > >>> isync > >>> .L35: > >>> > >>> , OTOH with __ATOMIC_SEQ_CST as 2nd argument: > >>> > >>> #START _litmus_P1 > >>> sync > >>> .L34: > >>> lwarx 7,0,9 > >>> cmpwi 0,7,0 > >>> bne 0,.L35 > >>> stwcx. 5,0,9 > >>> bne 0,.L34 > >>> .L35: > >>> isync > >>> > >> > >> In this asm code snippets, the barrier instruction at the end of > >> cmpxchg() is "isync". > >> > >> In arch/powerpc/include/asm/synch.h of Linux kernel source tree, > >> both PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER are > >> defined as '"\n" stringify_in_c(sync) "\n"', which will result > >> in "sync". > >> > >> IIUC, "isync" of PowerPC is not good enough for __ATOMIC_SEQ_CST, > >> is it? > > > > By itself, no, but in combination with the "sync" instruction at > > the beginning, combined with the ordering of other __ATOMIC_SEQ_CST > > operations, it actually is sufficient. For more information, please > > see: > > > > http://open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2745.html > > > > And this is an update that is mostly irrelevant on modern hardware: > > > > http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html > > > > Note also that an lwsync instruction can be substituted for each isync, > > which can sometimes provide better performance. > > Now I'm wondering why PPC_ATOMIC_EXIT_BARRIER is defined as > '"\n" stringify_in_c(sync) "\n"' rather than > '"\n" stringify_in_c(isync) "\n"' (or '"\n" stringify_in_c(LWSYNC) "\n"') > in arch/powerpc/include/asm/synch.h. > > ??? Interactions with spin_is_locked(), if I remember correctly. It is not clear to me that this was the right way to go, though. Thanx, Paul > Thanks, Akira > > > > > Thanx, Paul > > > >> Thanks, Akira > >> > >>> See the difference of position of label .L35. > >>> (Note that we are talking about strong version of cmpxchg().) > >>> > >>> Does the above example make sense to you? > >>> > >>> Thanks, Akira > >>> > >>>> > >>>> > >>>> --Junchang > >>>> > >>>> > >>>> > >>>>> static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new) > >>>>> > >>> > >> > > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 4/4] EXP CodeSamples: Add weak variant of cmpxchg() as cmpxchg_weak() 2018-12-11 15:39 [PATCH 0/4] Update definition of cmpxchg() under CodeSamples Akira Yokosawa ` (2 preceding siblings ...) 2018-12-11 15:42 ` [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h Akira Yokosawa @ 2018-12-11 15:44 ` Akira Yokosawa 2018-12-12 15:48 ` Junchang Wang 2018-12-11 17:23 ` [PATCH 0/4] Update definition of cmpxchg() under CodeSamples Paul E. McKenney 4 siblings, 1 reply; 16+ messages in thread From: Akira Yokosawa @ 2018-12-11 15:44 UTC (permalink / raw) To: Paul E. McKenney, Junchang Wang; +Cc: perfbook From 61a010b7c34c1b6c05b3883637ccdf2438409408 Mon Sep 17 00:00:00 2001 From: Akira Yokosawa <akiyks@gmail.com> Date: Tue, 11 Dec 2018 21:14:31 +0900 Subject: [PATCH 4/4] EXP CodeSamples: Add weak variant of cmpxchg() as cmpxchg_weak() For comparison, add weak (spurious-failure permitting) ones in the names of cmpxchg_weak() and atomic_cmpxchg_weak(). count_lim_atomic_weak.c is a copy of count_lim_atomic.c but uses atomic_cmpxchg_weak(). Signed-off-by: Akira Yokosawa <akiyks@gmail.com> --- CodeSamples/api-pthreads/api-gcc.h | 14 ++ CodeSamples/count/Makefile | 4 + CodeSamples/count/count_lim_atomic_weak.c | 226 ++++++++++++++++++++++++++++++ 3 files changed, 244 insertions(+) create mode 100644 CodeSamples/count/count_lim_atomic_weak.c diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h index b66f4b9..b27facd 100644 --- a/CodeSamples/api-pthreads/api-gcc.h +++ b/CodeSamples/api-pthreads/api-gcc.h @@ -178,6 +178,20 @@ static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new) return cmpxchg(&v->counter, old, new); } +#define cmpxchg_weak(ptr, o, n) \ +({ \ + typeof(*ptr) _____actual = (o); \ + typeof(*ptr) _____o = _____actual; \ + \ + __atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 1, \ + __ATOMIC_SEQ_CST, __ATOMIC_RELAXED) ? _____o : _____o+1 ; \ +}) + +static __inline__ int atomic_cmpxchg_weak(atomic_t *v, int old, int new) +{ + return cmpxchg_weak(&v->counter, old, new); +} + #define xchg(ptr, v) __atomic_exchange_n((ptr), (v), __ATOMIC_SEQ_CST) #define atomic_xchg(ptr, v) \ __atomic_exchange_n(&(ptr)->counter, (v), __ATOMIC_SEQ_CST) diff --git a/CodeSamples/count/Makefile b/CodeSamples/count/Makefile index 481eb3f..52eb466 100644 --- a/CodeSamples/count/Makefile +++ b/CodeSamples/count/Makefile @@ -22,6 +22,7 @@ PROGS = count_atomic \ count_lim \ count_lim_app \ count_lim_atomic \ + count_lim_atomic_weak \ count_lim_sig \ count_limd \ count_nonatomic \ @@ -65,6 +66,9 @@ count_lim_app: count_lim_app.c ../api.h limtorture.h count_lim_atomic: count_lim_atomic.c ../api.h limtorture.h $(CC) $(GCC_ARGS) $(CFLAGS) -o count_lim_atomic count_lim_atomic.c -lpthread +count_lim_atomic_weak: count_lim_atomic_weak.c ../api.h limtorture.h + $(CC) $(GCC_ARGS) $(CFLAGS) -o count_lim_atomic_weak count_lim_atomic_weak.c -lpthread + count_lim_sig: count_lim_sig.c ../api.h limtorture.h $(CC) $(GCC_ARGS) $(CFLAGS) -o count_lim_sig count_lim_sig.c -lpthread diff --git a/CodeSamples/count/count_lim_atomic_weak.c b/CodeSamples/count/count_lim_atomic_weak.c new file mode 100644 index 0000000..42953f2 --- /dev/null +++ b/CodeSamples/count/count_lim_atomic_weak.c @@ -0,0 +1,226 @@ +/* + * count_lim_atomic_weak.c: simple limit counter that uses weak variant + * of atomic operations to steal from other threads. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, you can access it online at + * http://www.gnu.org/licenses/gpl-2.0.html. + * + * Copyright (c) 2009 Paul E. McKenney, IBM Corporation. + */ + +#include "../api.h" + +static void balance_count(void); + +atomic_t __thread counterandmax = ATOMIC_INIT(0); //\lnlbl{var:candmax} +unsigned long globalcountmax = 1 << 25; //\lnlbl{var:def:b} +unsigned long globalcount = 0; +unsigned long globalreserve = 0; +atomic_t *counterp[NR_THREADS] = { NULL }; +DEFINE_SPINLOCK(gblcnt_mutex); //\lnlbl{var:def:e} +#define CM_BITS (sizeof(atomic_t) * 4) //\lnlbl{var:CM_BITS} +#define MAX_COUNTERMAX ((1 << CM_BITS) - 1) //\lnlbl{var:MAX_CMAX} + +static __inline__ void //\lnlbl{split_int:b} +split_counterandmax_int(int cami, int *c, int *cm) +{ + *c = (cami >> CM_BITS) & MAX_COUNTERMAX; //\lnlbl{split_int:msh} + *cm = cami & MAX_COUNTERMAX; //\lnlbl{split_int:lsh} +} //\lnlbl{split_int:e} + +static __inline__ void //\lnlbl{split:b} +split_counterandmax(atomic_t *cam, int *old, int *c, int *cm)//\lnlbl{split:func} +{ + unsigned int cami = atomic_read(cam); //\lnlbl{split:int} + + *old = cami; //\lnlbl{split:old} + split_counterandmax_int(cami, c, cm); //\lnlbl{split:split_int} +} //\lnlbl{split:e} + +static __inline__ int merge_counterandmax(int c, int cm)//\lnlbl{merge:b} +{ + unsigned int cami; + + cami = (c << CM_BITS) | cm; //\lnlbl{merge:merge} + return ((int)cami); +} //\lnlbl{merge:e} + +static void globalize_count(void) //\lnlbl{globalize:b} +{ + int c; + int cm; + int old; + + split_counterandmax(&counterandmax, &old, &c, &cm);//\lnlbl{globalize:split} + globalcount += c; + globalreserve -= cm; + old = merge_counterandmax(0, 0); + atomic_set(&counterandmax, old); +} //\lnlbl{globalize:e} + +static void flush_local_count(void) //\lnlbl{flush:b} +{ + int c; + int cm; + int old; + int t; + int zero; + + if (globalreserve == 0) //\lnlbl{flush:checkrsv} + return; //\lnlbl{flush:return:n} + zero = merge_counterandmax(0, 0); //\lnlbl{flush:initzero} + for_each_thread(t) //\lnlbl{flush:loop:b} + if (counterp[t] != NULL) { //\lnlbl{flush:checkp} + old = atomic_xchg(counterp[t], zero);//\lnlbl{flush:atmxchg} + split_counterandmax_int(old, &c, &cm);//\lnlbl{flush:split} + globalcount += c; //\lnlbl{flush:glbcnt} + globalreserve -= cm; //\lnlbl{flush:glbrsv} + } //\lnlbl{flush:loop:e} +} //\lnlbl{flush:e} + +int add_count(unsigned long delta) //\lnlbl{add:b} +{ + int c; + int cm; + int old; + int new; + + do { //\lnlbl{add:fast:b} + split_counterandmax(&counterandmax, &old, &c, &cm);//\lnlbl{add:split} + if (delta > MAX_COUNTERMAX || c + delta > cm)//\lnlbl{add:check} + goto slowpath; //\lnlbl{add:goto} + new = merge_counterandmax(c + delta, cm);//\lnlbl{add:merge} + } while (atomic_cmpxchg_weak(&counterandmax, //\lnlbl{add:atmcmpex} + old, new) != old); //\lnlbl{add:loop:e} + return 1; //\lnlbl{add:return:fs} +slowpath: //\lnlbl{add:slow:b} + spin_lock(&gblcnt_mutex); //\lnlbl{add:acquire} + globalize_count(); //\lnlbl{add:globalize} + if (globalcountmax - globalcount - //\lnlbl{add:checkglb:b} + globalreserve < delta) { //\lnlbl{add:checkglb:e} + flush_local_count(); //\lnlbl{add:flush} + if (globalcountmax - globalcount - //\lnlbl{add:checkglb:nb} + globalreserve < delta) { //\lnlbl{add:checkglb:ne} + spin_unlock(&gblcnt_mutex); //\lnlbl{add:release:f} + return 0; //\lnlbl{add:return:sf} + } + } + globalcount += delta; //\lnlbl{add:addglb} + balance_count(); //\lnlbl{add:balance} + spin_unlock(&gblcnt_mutex); //\lnlbl{add:release:s} + return 1; //\lnlbl{add:return:ss} +} //\lnlbl{add:e} + +int sub_count(unsigned long delta) //\lnlbl{sub:b} +{ + int c; + int cm; + int old; + int new; + + do { //\lnlbl{sub:fast:b} + split_counterandmax(&counterandmax, &old, &c, &cm); + if (delta > c) + goto slowpath; + new = merge_counterandmax(c - delta, cm); + } while (atomic_cmpxchg_weak(&counterandmax, + old, new) != old); + return 1; //\lnlbl{sub:fast:e} + slowpath: //\lnlbl{sub:slow:b} + spin_lock(&gblcnt_mutex); + globalize_count(); + if (globalcount < delta) { + flush_local_count(); + if (globalcount < delta) { + spin_unlock(&gblcnt_mutex); + return 0; + } + } + globalcount -= delta; + balance_count(); + spin_unlock(&gblcnt_mutex); + return 1; //\lnlbl{sub:slow:e} +} //\lnlbl{sub:e} + +unsigned long read_count(void) //\lnlbl{b} +{ + int c; + int cm; + int old; + int t; + unsigned long sum; + + spin_lock(&gblcnt_mutex); //\lnlbl{acquire} + sum = globalcount; //\lnlbl{initsum} + for_each_thread(t) //\lnlbl{loop:b} + if (counterp[t] != NULL) { + split_counterandmax(counterp[t], &old, &c, &cm);//\lnlbl{split} + sum += c; + } //\lnlbl{loop:e} + spin_unlock(&gblcnt_mutex); //\lnlbl{release} + return sum; //\lnlbl{return} +} //\lnlbl{e} + +void count_init(void) +{ +} + +static void balance_count(void) //\lnlbl{balance:b} +{ + int c; + int cm; + int old; + unsigned long limit; + + limit = globalcountmax - globalcount - + globalreserve; + limit /= num_online_threads(); + if (limit > MAX_COUNTERMAX) + cm = MAX_COUNTERMAX; + else + cm = limit; + globalreserve += cm; + c = cm / 2; + if (c > globalcount) + c = globalcount; + globalcount -= c; + old = merge_counterandmax(c, cm); + atomic_set(&counterandmax, old); //\lnlbl{balance:atmcset} +} //\lnlbl{balance:e} + +void count_register_thread(void) //\lnlbl{register:b} +{ + int idx = smp_thread_id(); + + spin_lock(&gblcnt_mutex); + counterp[idx] = &counterandmax; + spin_unlock(&gblcnt_mutex); +} + +void count_unregister_thread(int nthreadsexpected) //\lnlbl{unregister:b} +{ + int idx = smp_thread_id(); + + spin_lock(&gblcnt_mutex); + globalize_count(); + counterp[idx] = NULL; + spin_unlock(&gblcnt_mutex); +} + +void count_cleanup(void) +{ +} + +#define NEED_REGISTER_THREAD +#include "limtorture.h" -- 2.7.4 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 4/4] EXP CodeSamples: Add weak variant of cmpxchg() as cmpxchg_weak() 2018-12-11 15:44 ` [PATCH 4/4] EXP CodeSamples: Add weak variant of cmpxchg() as cmpxchg_weak() Akira Yokosawa @ 2018-12-12 15:48 ` Junchang Wang 0 siblings, 0 replies; 16+ messages in thread From: Junchang Wang @ 2018-12-12 15:48 UTC (permalink / raw) To: Akira Yokosawa, Paul E. McKenney; +Cc: perfbook On 12/11/18 11:44 PM, Akira Yokosawa wrote: > From 61a010b7c34c1b6c05b3883637ccdf2438409408 Mon Sep 17 00:00:00 2001 > From: Akira Yokosawa <akiyks@gmail.com> > Date: Tue, 11 Dec 2018 21:14:31 +0900 > Subject: [PATCH 4/4] EXP CodeSamples: Add weak variant of cmpxchg() as cmpxchg_weak() > > For comparison, add weak (spurious-failure permitting) ones in the > names of cmpxchg_weak() and atomic_cmpxchg_weak(). > count_lim_atomic_weak.c is a copy of count_lim_atomic.c but uses > atomic_cmpxchg_weak(). > > Signed-off-by: Akira Yokosawa <akiyks@gmail.com> > --- > CodeSamples/api-pthreads/api-gcc.h | 14 ++ > CodeSamples/count/Makefile | 4 + > CodeSamples/count/count_lim_atomic_weak.c | 226 ++++++++++++++++++++++++++++++ > 3 files changed, 244 insertions(+) > create mode 100644 CodeSamples/count/count_lim_atomic_weak.c > > diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h > index b66f4b9..b27facd 100644 > --- a/CodeSamples/api-pthreads/api-gcc.h > +++ b/CodeSamples/api-pthreads/api-gcc.h > @@ -178,6 +178,20 @@ static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new) > return cmpxchg(&v->counter, old, new); > } > > +#define cmpxchg_weak(ptr, o, n) \ > +({ \ > + typeof(*ptr) _____actual = (o); \ > + typeof(*ptr) _____o = _____actual; \ > + \ > + __atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 1, \ > + __ATOMIC_SEQ_CST, __ATOMIC_RELAXED) ? _____o : _____o+1 ; \ > +}) > + Hi Akira, Quit an interesting topic. I did some experiments and please see below. We changed the implementation of cmpxchg() heavily. So let's use the latest two versions---count_lim_atomic based on your latest patch [3/4] and count_lim_atomic_weak based on patch [4/4]---to compare apple to apple. I ran these two programs on an 8-core PPC server. The duration of each run is set to 2.4 seconds. The first observation is that the result reported in your last mail (patch [0/4]) is correct; for each update operation, count_lim_atomic is about 10 nanoseconds faster than count_lim_atomic_weak, no matter how many concurrent updaters are running. For example, when I created 64 updaters, I got the follow results: ./count_lim_atomic 64 uperf ns/update: 290 ./count_lim_atomic_weak 64 uperf ns/update: 301 Following is my analysis. I found that the first reason is because cmpxchg_weak is using a temporal variable _____o to avoid side-effects of (o). If we switch cmpxchg_weak to the older version (shown below. I know it has flaws, but let's forget about it temporarily :-)), count_lim_atomic_weak runs as fast as count_lim_atomic. #define cmpxchg_weak(ptr, o, n) \ ({ \ typeof(*ptr) _____actual = (o); \ \ __atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 1, \ __ATOMIC_SEQ_CST, __ATOMIC_RELAXED) ? (o) : (n) ; \ }) The second possible reason could be more interesting. I discussed that in the reply to your mail [Patch 3/4]. Regards, --Junchang > +static __inline__ int atomic_cmpxchg_weak(atomic_t *v, int old, int new) > +{ > + return cmpxchg_weak(&v->counter, old, new); > +} > + > #define xchg(ptr, v) __atomic_exchange_n((ptr), (v), __ATOMIC_SEQ_CST) > #define atomic_xchg(ptr, v) \ > __atomic_exchange_n(&(ptr)->counter, (v), __ATOMIC_SEQ_CST) > diff --git a/CodeSamples/count/Makefile b/CodeSamples/count/Makefile > index 481eb3f..52eb466 100644 > --- a/CodeSamples/count/Makefile > +++ b/CodeSamples/count/Makefile > @@ -22,6 +22,7 @@ PROGS = count_atomic \ > count_lim \ > count_lim_app \ > count_lim_atomic \ > + count_lim_atomic_weak \ > count_lim_sig \ > count_limd \ > count_nonatomic \ > @@ -65,6 +66,9 @@ count_lim_app: count_lim_app.c ../api.h limtorture.h > count_lim_atomic: count_lim_atomic.c ../api.h limtorture.h > $(CC) $(GCC_ARGS) $(CFLAGS) -o count_lim_atomic count_lim_atomic.c -lpthread > > +count_lim_atomic_weak: count_lim_atomic_weak.c ../api.h limtorture.h > + $(CC) $(GCC_ARGS) $(CFLAGS) -o count_lim_atomic_weak count_lim_atomic_weak.c -lpthread > + > count_lim_sig: count_lim_sig.c ../api.h limtorture.h > $(CC) $(GCC_ARGS) $(CFLAGS) -o count_lim_sig count_lim_sig.c -lpthread > > diff --git a/CodeSamples/count/count_lim_atomic_weak.c b/CodeSamples/count/count_lim_atomic_weak.c > new file mode 100644 > index 0000000..42953f2 > --- /dev/null > +++ b/CodeSamples/count/count_lim_atomic_weak.c > @@ -0,0 +1,226 @@ > +/* > + * count_lim_atomic_weak.c: simple limit counter that uses weak variant > + * of atomic operations to steal from other threads. > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation; either version 2 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program; if not, you can access it online at > + * http://www.gnu.org/licenses/gpl-2.0.html. > + * > + * Copyright (c) 2009 Paul E. McKenney, IBM Corporation. > + */ > + > +#include "../api.h" > + > +static void balance_count(void); > + > +atomic_t __thread counterandmax = ATOMIC_INIT(0); //\lnlbl{var:candmax} > +unsigned long globalcountmax = 1 << 25; //\lnlbl{var:def:b} > +unsigned long globalcount = 0; > +unsigned long globalreserve = 0; > +atomic_t *counterp[NR_THREADS] = { NULL }; > +DEFINE_SPINLOCK(gblcnt_mutex); //\lnlbl{var:def:e} > +#define CM_BITS (sizeof(atomic_t) * 4) //\lnlbl{var:CM_BITS} > +#define MAX_COUNTERMAX ((1 << CM_BITS) - 1) //\lnlbl{var:MAX_CMAX} > + > +static __inline__ void //\lnlbl{split_int:b} > +split_counterandmax_int(int cami, int *c, int *cm) > +{ > + *c = (cami >> CM_BITS) & MAX_COUNTERMAX; //\lnlbl{split_int:msh} > + *cm = cami & MAX_COUNTERMAX; //\lnlbl{split_int:lsh} > +} //\lnlbl{split_int:e} > + > +static __inline__ void //\lnlbl{split:b} > +split_counterandmax(atomic_t *cam, int *old, int *c, int *cm)//\lnlbl{split:func} > +{ > + unsigned int cami = atomic_read(cam); //\lnlbl{split:int} > + > + *old = cami; //\lnlbl{split:old} > + split_counterandmax_int(cami, c, cm); //\lnlbl{split:split_int} > +} //\lnlbl{split:e} > + > +static __inline__ int merge_counterandmax(int c, int cm)//\lnlbl{merge:b} > +{ > + unsigned int cami; > + > + cami = (c << CM_BITS) | cm; //\lnlbl{merge:merge} > + return ((int)cami); > +} //\lnlbl{merge:e} > + > +static void globalize_count(void) //\lnlbl{globalize:b} > +{ > + int c; > + int cm; > + int old; > + > + split_counterandmax(&counterandmax, &old, &c, &cm);//\lnlbl{globalize:split} > + globalcount += c; > + globalreserve -= cm; > + old = merge_counterandmax(0, 0); > + atomic_set(&counterandmax, old); > +} //\lnlbl{globalize:e} > + > +static void flush_local_count(void) //\lnlbl{flush:b} > +{ > + int c; > + int cm; > + int old; > + int t; > + int zero; > + > + if (globalreserve == 0) //\lnlbl{flush:checkrsv} > + return; //\lnlbl{flush:return:n} > + zero = merge_counterandmax(0, 0); //\lnlbl{flush:initzero} > + for_each_thread(t) //\lnlbl{flush:loop:b} > + if (counterp[t] != NULL) { //\lnlbl{flush:checkp} > + old = atomic_xchg(counterp[t], zero);//\lnlbl{flush:atmxchg} > + split_counterandmax_int(old, &c, &cm);//\lnlbl{flush:split} > + globalcount += c; //\lnlbl{flush:glbcnt} > + globalreserve -= cm; //\lnlbl{flush:glbrsv} > + } //\lnlbl{flush:loop:e} > +} //\lnlbl{flush:e} > + > +int add_count(unsigned long delta) //\lnlbl{add:b} > +{ > + int c; > + int cm; > + int old; > + int new; > + > + do { //\lnlbl{add:fast:b} > + split_counterandmax(&counterandmax, &old, &c, &cm);//\lnlbl{add:split} > + if (delta > MAX_COUNTERMAX || c + delta > cm)//\lnlbl{add:check} > + goto slowpath; //\lnlbl{add:goto} > + new = merge_counterandmax(c + delta, cm);//\lnlbl{add:merge} > + } while (atomic_cmpxchg_weak(&counterandmax, //\lnlbl{add:atmcmpex} > + old, new) != old); //\lnlbl{add:loop:e} > + return 1; //\lnlbl{add:return:fs} > +slowpath: //\lnlbl{add:slow:b} > + spin_lock(&gblcnt_mutex); //\lnlbl{add:acquire} > + globalize_count(); //\lnlbl{add:globalize} > + if (globalcountmax - globalcount - //\lnlbl{add:checkglb:b} > + globalreserve < delta) { //\lnlbl{add:checkglb:e} > + flush_local_count(); //\lnlbl{add:flush} > + if (globalcountmax - globalcount - //\lnlbl{add:checkglb:nb} > + globalreserve < delta) { //\lnlbl{add:checkglb:ne} > + spin_unlock(&gblcnt_mutex); //\lnlbl{add:release:f} > + return 0; //\lnlbl{add:return:sf} > + } > + } > + globalcount += delta; //\lnlbl{add:addglb} > + balance_count(); //\lnlbl{add:balance} > + spin_unlock(&gblcnt_mutex); //\lnlbl{add:release:s} > + return 1; //\lnlbl{add:return:ss} > +} //\lnlbl{add:e} > + > +int sub_count(unsigned long delta) //\lnlbl{sub:b} > +{ > + int c; > + int cm; > + int old; > + int new; > + > + do { //\lnlbl{sub:fast:b} > + split_counterandmax(&counterandmax, &old, &c, &cm); > + if (delta > c) > + goto slowpath; > + new = merge_counterandmax(c - delta, cm); > + } while (atomic_cmpxchg_weak(&counterandmax, > + old, new) != old); > + return 1; //\lnlbl{sub:fast:e} > + slowpath: //\lnlbl{sub:slow:b} > + spin_lock(&gblcnt_mutex); > + globalize_count(); > + if (globalcount < delta) { > + flush_local_count(); > + if (globalcount < delta) { > + spin_unlock(&gblcnt_mutex); > + return 0; > + } > + } > + globalcount -= delta; > + balance_count(); > + spin_unlock(&gblcnt_mutex); > + return 1; //\lnlbl{sub:slow:e} > +} //\lnlbl{sub:e} > + > +unsigned long read_count(void) //\lnlbl{b} > +{ > + int c; > + int cm; > + int old; > + int t; > + unsigned long sum; > + > + spin_lock(&gblcnt_mutex); //\lnlbl{acquire} > + sum = globalcount; //\lnlbl{initsum} > + for_each_thread(t) //\lnlbl{loop:b} > + if (counterp[t] != NULL) { > + split_counterandmax(counterp[t], &old, &c, &cm);//\lnlbl{split} > + sum += c; > + } //\lnlbl{loop:e} > + spin_unlock(&gblcnt_mutex); //\lnlbl{release} > + return sum; //\lnlbl{return} > +} //\lnlbl{e} > + > +void count_init(void) > +{ > +} > + > +static void balance_count(void) //\lnlbl{balance:b} > +{ > + int c; > + int cm; > + int old; > + unsigned long limit; > + > + limit = globalcountmax - globalcount - > + globalreserve; > + limit /= num_online_threads(); > + if (limit > MAX_COUNTERMAX) > + cm = MAX_COUNTERMAX; > + else > + cm = limit; > + globalreserve += cm; > + c = cm / 2; > + if (c > globalcount) > + c = globalcount; > + globalcount -= c; > + old = merge_counterandmax(c, cm); > + atomic_set(&counterandmax, old); //\lnlbl{balance:atmcset} > +} //\lnlbl{balance:e} > + > +void count_register_thread(void) //\lnlbl{register:b} > +{ > + int idx = smp_thread_id(); > + > + spin_lock(&gblcnt_mutex); > + counterp[idx] = &counterandmax; > + spin_unlock(&gblcnt_mutex); > +} > + > +void count_unregister_thread(int nthreadsexpected) //\lnlbl{unregister:b} > +{ > + int idx = smp_thread_id(); > + > + spin_lock(&gblcnt_mutex); > + globalize_count(); > + counterp[idx] = NULL; > + spin_unlock(&gblcnt_mutex); > +} > + > +void count_cleanup(void) > +{ > +} > + > +#define NEED_REGISTER_THREAD > +#include "limtorture.h" > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/4] Update definition of cmpxchg() under CodeSamples 2018-12-11 15:39 [PATCH 0/4] Update definition of cmpxchg() under CodeSamples Akira Yokosawa ` (3 preceding siblings ...) 2018-12-11 15:44 ` [PATCH 4/4] EXP CodeSamples: Add weak variant of cmpxchg() as cmpxchg_weak() Akira Yokosawa @ 2018-12-11 17:23 ` Paul E. McKenney 4 siblings, 0 replies; 16+ messages in thread From: Paul E. McKenney @ 2018-12-11 17:23 UTC (permalink / raw) To: Akira Yokosawa; +Cc: Junchang Wang, perfbook On Wed, Dec 12, 2018 at 12:39:38AM +0900, Akira Yokosawa wrote: > Subject: [PATCH 0/4] Update definition of cmpxchg() under CodeSamples > > Hi Paul and Junchang, > > Based on the earlier correspondence, I prepared a patch set to update > the definition of cmpxchg(). > > Patch #1 adds a simple litmus test of cmpxchg(). > Patch #2 fix cmpxchg() in .../litmus/api.h by specifying strong > __atomic_compare_exchange_n)\(). Results of litmus tests are presented > in the commit log. > Patch #3 fix cmpxchg() in .../api-gcc.h in the same manner as patch #2. I applied and pushed #1-3, thank you! Thanx, Paul > Patch #4 is not intended to be applied to master, but provides a weak > variant of cmpxchg() and count_lim_atomic_weak.c. > > I did some experiments on PPC (10 runs each). > > count_lim_atomic 1 uperf, ns/update (as of v2018.12.08a): > 56.6572 > 56.4972 > 56.5504 > 56.4972 > 56.6305 > 56.5238 > 56.6038 > 56.5238 > 56.5504 > 56.5238 > > count_lim_atomic 1 uperf, ns/update (as of Patch #4): > 38.3264 > 38.1073 > 38.2287 > 38.2287 > 38.2653 > 38.3142 > 38.1194 > 38.1801 > 38.4986 > 38.2531 > > count_lim_atomic_weak 1 uperf, ns/update (as of Patch #4): > 47.3186 > 47.3747 > 47.4121 > 47.5248 > 47.3934 > 47.3373 > 47.3186 > 47.4121 > 47.6002 > 47.3186 > > So count_lim_atomic (strong semantics) is around 9ns/update > faster than count_lim_atomic_weak, and is around 18ns/update > faster than count_lim_atomic of v2018.12.08a. > > This contradicts Junchang's view. The difference looks like > due to removal of ternary operation in cmpxchg(). > > Or I might be missing something. > > Junchang, can you please experiment on PPC and ARM? > > Thanks, Akira > -- > Akira Yokosawa (4): > CodeSamples: Add C-cmpxchg.litmus > CodeSamples/formal/litmus/api.h: Fix definition of cmpxchg() > CodeSamples: Fix definition of cmpxchg() in api-gcc.h > [EXP] CodeSamples: Add weak variant of cmpxchg() as cmpxchg_weak() > > CodeSamples/api-pthreads/api-gcc.h | 19 ++- > CodeSamples/count/Makefile | 4 + > CodeSamples/count/count_lim_atomic_weak.c | 226 +++++++++++++++++++++++++++++ > CodeSamples/formal/litmus/C-cmpxchg.litmus | 24 +++ > CodeSamples/formal/litmus/api.h | 2 +- > 5 files changed, 272 insertions(+), 3 deletions(-) > create mode 100644 CodeSamples/count/count_lim_atomic_weak.c > create mode 100644 CodeSamples/formal/litmus/C-cmpxchg.litmus > > -- > 2.7.4 > ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2018-12-16 0:55 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-12-11 15:39 [PATCH 0/4] Update definition of cmpxchg() under CodeSamples Akira Yokosawa 2018-12-11 15:40 ` [PATCH 1/4] CodeSamples: Add C-cmpxchg.litmus Akira Yokosawa 2018-12-11 15:42 ` [PATCH 2/4] CodeSamples/formal/litmus/api.h: Fix definition of cmpxchg() Akira Yokosawa 2018-12-11 15:42 ` [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h Akira Yokosawa 2018-12-12 16:01 ` Junchang Wang 2018-12-13 15:33 ` Akira Yokosawa 2018-12-14 14:32 ` Junchang Wang 2018-12-15 14:58 ` Akira Yokosawa 2018-12-16 0:55 ` Junchang Wang 2018-12-15 15:10 ` Akira Yokosawa 2018-12-15 19:37 ` Paul E. McKenney 2018-12-15 23:42 ` Akira Yokosawa 2018-12-16 0:54 ` Paul E. McKenney 2018-12-11 15:44 ` [PATCH 4/4] EXP CodeSamples: Add weak variant of cmpxchg() as cmpxchg_weak() Akira Yokosawa 2018-12-12 15:48 ` Junchang Wang 2018-12-11 17:23 ` [PATCH 0/4] Update definition of cmpxchg() under CodeSamples Paul E. McKenney
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox