[PATCH 0/4] Update definition of cmpxchg() under CodeSamples

Discussions of the Parallel Programming book
 help / color / mirror / Atom feed

* [PATCH 0/4] Update definition of cmpxchg() under CodeSamples
@ 2018-12-11 15:39 Akira Yokosawa
  2018-12-11 15:40 ` [PATCH 1/4] CodeSamples: Add C-cmpxchg.litmus Akira Yokosawa
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Akira Yokosawa @ 2018-12-11 15:39 UTC (permalink / raw)
  To: Paul E. McKenney, Junchang Wang; +Cc: perfbook, Akira Yokosawa

Subject: [PATCH 0/4] Update definition of cmpxchg() under CodeSamples

Hi Paul and Junchang,

Based on the earlier correspondence, I prepared a patch set to update
the definition of cmpxchg().

Patch #1 adds a simple litmus test of cmpxchg().
Patch #2 fix cmpxchg() in .../litmus/api.h by specifying strong
__atomic_compare_exchange_n)\(). Results of litmus tests are presented
in the commit log.
Patch #3 fix cmpxchg() in .../api-gcc.h in the same manner as patch #2.
Patch #4 is not intended to be applied to master, but provides a weak
variant of cmpxchg() and count_lim_atomic_weak.c.

I did some experiments on PPC (10 runs each).

count_lim_atomic 1 uperf, ns/update (as of v2018.12.08a):
56.6572
56.4972
56.5504
56.4972
56.6305
56.5238
56.6038
56.5238
56.5504
56.5238

count_lim_atomic 1 uperf, ns/update (as of Patch #4):
38.3264
38.1073
38.2287
38.2287
38.2653
38.3142
38.1194
38.1801
38.4986
38.2531

count_lim_atomic_weak 1 uperf, ns/update (as of Patch #4):
47.3186
47.3747
47.4121
47.5248
47.3934
47.3373
47.3186
47.4121
47.6002
47.3186

So count_lim_atomic (strong semantics) is around 9ns/update
faster than count_lim_atomic_weak, and is around 18ns/update
faster than count_lim_atomic of v2018.12.08a.

This contradicts Junchang's view. The difference looks like
due to removal of ternary operation in cmpxchg().

Or I might be missing something.

Junchang, can you please experiment on PPC and ARM?

        Thanks, Akira
--
Akira Yokosawa (4):
  CodeSamples: Add C-cmpxchg.litmus
  CodeSamples/formal/litmus/api.h: Fix definition of cmpxchg()
  CodeSamples: Fix definition of cmpxchg() in api-gcc.h
  [EXP] CodeSamples: Add weak variant of cmpxchg() as cmpxchg_weak()

 CodeSamples/api-pthreads/api-gcc.h         |  19 ++-
 CodeSamples/count/Makefile                 |   4 +
 CodeSamples/count/count_lim_atomic_weak.c  | 226 +++++++++++++++++++++++++++++
 CodeSamples/formal/litmus/C-cmpxchg.litmus |  24 +++
 CodeSamples/formal/litmus/api.h            |   2 +-
 5 files changed, 272 insertions(+), 3 deletions(-)
 create mode 100644 CodeSamples/count/count_lim_atomic_weak.c
 create mode 100644 CodeSamples/formal/litmus/C-cmpxchg.litmus

-- 
2.7.4

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/4] CodeSamples: Add C-cmpxchg.litmus
  2018-12-11 15:39 [PATCH 0/4] Update definition of cmpxchg() under CodeSamples Akira Yokosawa
@ 2018-12-11 15:40 ` Akira Yokosawa
  2018-12-11 15:42 ` [PATCH 2/4] CodeSamples/formal/litmus/api.h: Fix definition of cmpxchg() Akira Yokosawa
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Akira Yokosawa @ 2018-12-11 15:40 UTC (permalink / raw)
  To: Paul E. McKenney, Junchang Wang; +Cc: perfbook

From d724a9b7ffb4b6d648290864c80cd1542297fa40 Mon Sep 17 00:00:00 2001
From: Akira Yokosawa <akiyks@gmail.com>
Date: Mon, 10 Dec 2018 20:39:28 +0900
Subject: [PATCH 1/4] CodeSamples: Add C-cmpxchg.litmus

Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
---
 CodeSamples/formal/litmus/C-cmpxchg.litmus | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)
 create mode 100644 CodeSamples/formal/litmus/C-cmpxchg.litmus

diff --git a/CodeSamples/formal/litmus/C-cmpxchg.litmus b/CodeSamples/formal/litmus/C-cmpxchg.litmus
new file mode 100644
index 0000000..9579d4a
--- /dev/null
+++ b/CodeSamples/formal/litmus/C-cmpxchg.litmus
@@ -0,0 +1,24 @@
+C C-cmpxchg
+{
+}
+
+{
+#include "api.h"
+}
+
+P0(int *x)
+{
+	int r1;
+
+	r1 = cmpxchg(x, 1, 2);
+}
+
+P1(int *x)
+{
+	int r1;
+
+	r1 = cmpxchg(x, 0, 1);
+}
+
+locations [1:r1]
+exists (0:r1=1 /\ ~x=2)
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/4] CodeSamples/formal/litmus/api.h: Fix definition of cmpxchg()
  2018-12-11 15:39 [PATCH 0/4] Update definition of cmpxchg() under CodeSamples Akira Yokosawa
  2018-12-11 15:40 ` [PATCH 1/4] CodeSamples: Add C-cmpxchg.litmus Akira Yokosawa
@ 2018-12-11 15:42 ` Akira Yokosawa
  2018-12-11 15:42 ` [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h Akira Yokosawa
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Akira Yokosawa @ 2018-12-11 15:42 UTC (permalink / raw)
  To: Paul E. McKenney, Junchang Wang; +Cc: perfbook

From 2796715f236d5c152f6e559422e072b705e3d237 Mon Sep 17 00:00:00 2001
From: Akira Yokosawa <akiyks@gmail.com>
Date: Tue, 11 Dec 2018 20:49:16 +0900
Subject: [PATCH 2/4] CodeSamples/formal/litmus/api.h: Fix definition of cmpxchg()

Output of C-cmpxchg.litmus by litmus7 on PPC (old definition):
----------------------------------------------------
Test C-cmpxchg Allowed
Histogram (4 states)
2789  :>0:r1=0; 1:r1=0; x=0;
50452053:>0:r1=0; 1:r1=0; x=1;
36003 *>0:r1=1; 1:r1=0; x=1;
49509155:>0:r1=1; 1:r1=0; x=2;
Ok

Witnesses
Positive: 36003, Negative: 99963997
Condition exists (0:r1=1 /\ not (x=2)) is validated
Hash=3f625d680407ff822fb56f1a80834291
Observation C-cmpxchg Sometimes 36003 99963997
Time C-cmpxchg 43.37
----------------------------------------------------

Output of C-cmpxchg.litmus by litmus7 on PPC (new definition):
----------------------------------------------------
Test C-cmpxchg Allowed
Histogram (2 states)
50766497:>0:r1=0; 1:r1=0; x=1;
49233503:>0:r1=1; 1:r1=0; x=2;
No

Witnesses
Positive: 0, Negative: 100000000
Condition exists (0:r1=1 /\ not (x=2)) is NOT validated
Hash=3f625d680407ff822fb56f1a80834291
Observation C-cmpxchg Never 0 100000000
Time C-cmpxchg 35.20
----------------------------------------------------

herd7 says:
----------------------------------------------------
Test C-cmpxchg Allowed
States 2
0:r1=0; 1:r1=0; x=1;
0:r1=1; 1:r1=0; x=2;
No
Witnesses
Positive: 0 Negative: 2
Condition exists (0:r1=1 /\ not (x=2))
Observation C-cmpxchg Never 0 2
Time C-cmpxchg 0.01
Hash=1720691890ea69e35615997626680d6f
----------------------------------------------------

klitmus7 on PPC says:
----------------------------------------------------
Test C-cmpxchg Allowed
Histogram (2 states)
2019624 :>0:r1=0; 1:r1=0; x=1;
1980376 :>0:r1=1; 1:r1=0; x=2;
No

Witnesses
Positive: 0, Negative: 4000000
Condition exists (0:r1=1 /\ not (x=2)) is NOT validated
Hash=1720691890ea69e35615997626680d6f
Observation C-cmpxchg Never 0 4000000
Time C-cmpxchg 0.76
----------------------------------------------------

Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
---
 CodeSamples/formal/litmus/api.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/CodeSamples/formal/litmus/api.h b/CodeSamples/formal/litmus/api.h
index 8bf0ba6..bb1e1c2 100644
--- a/CodeSamples/formal/litmus/api.h
+++ b/CodeSamples/formal/litmus/api.h
@@ -16,7 +16,7 @@
 #define smp_load_acquire(x) __atomic_load_n(x, __ATOMIC_ACQUIRE)
 #define cmpxchg(x, o, n) ({ \
 	typeof(o) __old = (o); \
-	__atomic_compare_exchange_n((x), &__old, (n), 1, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \
+	__atomic_compare_exchange_n((x), &__old, (n), 0, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \
 	__old; \
 })
 #endif
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h
  2018-12-11 15:39 [PATCH 0/4] Update definition of cmpxchg() under CodeSamples Akira Yokosawa
  2018-12-11 15:40 ` [PATCH 1/4] CodeSamples: Add C-cmpxchg.litmus Akira Yokosawa
  2018-12-11 15:42 ` [PATCH 2/4] CodeSamples/formal/litmus/api.h: Fix definition of cmpxchg() Akira Yokosawa
@ 2018-12-11 15:42 ` Akira Yokosawa
  2018-12-12 16:01   ` Junchang Wang
  2018-12-11 15:44 ` [PATCH 4/4] EXP CodeSamples: Add weak variant of cmpxchg() as cmpxchg_weak() Akira Yokosawa
  2018-12-11 17:23 ` [PATCH 0/4] Update definition of cmpxchg() under CodeSamples Paul E. McKenney
  4 siblings, 1 reply; 16+ messages in thread
From: Akira Yokosawa @ 2018-12-11 15:42 UTC (permalink / raw)
  To: Paul E. McKenney, Junchang Wang; +Cc: perfbook

From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001
From: Akira Yokosawa <akiyks@gmail.com>
Date: Tue, 11 Dec 2018 21:37:11 +0900
Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h

Do the same change as CodeSamples/formal/litmus/api.h.

Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
---
 CodeSamples/api-pthreads/api-gcc.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h
index 3afe340..b66f4b9 100644
--- a/CodeSamples/api-pthreads/api-gcc.h
+++ b/CodeSamples/api-pthreads/api-gcc.h
@@ -168,8 +168,9 @@ struct __xchg_dummy {
 ({ \
 	typeof(*ptr) _____actual = (o); \
 	\
-	__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
-			__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \
+	__atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \
+			__ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \
+	_____actual; \
 })
 
 static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h
  2018-12-11 15:42 ` [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h Akira Yokosawa
@ 2018-12-12 16:01   ` Junchang Wang
  2018-12-13 15:33     ` Akira Yokosawa
  0 siblings, 1 reply; 16+ messages in thread
From: Junchang Wang @ 2018-12-12 16:01 UTC (permalink / raw)
  To: Akira Yokosawa, Paul E. McKenney; +Cc: perfbook

On 12/11/18 11:42 PM, Akira Yokosawa wrote:
> From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001
> From: Akira Yokosawa <akiyks@gmail.com>
> Date: Tue, 11 Dec 2018 21:37:11 +0900
> Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h
> 
> Do the same change as CodeSamples/formal/litmus/api.h.
> 
> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
> ---
>  CodeSamples/api-pthreads/api-gcc.h | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h
> index 3afe340..b66f4b9 100644
> --- a/CodeSamples/api-pthreads/api-gcc.h
> +++ b/CodeSamples/api-pthreads/api-gcc.h
> @@ -168,8 +168,9 @@ struct __xchg_dummy {
>  ({ \
>  	typeof(*ptr) _____actual = (o); \
>  	\
> -	__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> -			__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \
> +	__atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \
> +			__ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \
> +	_____actual; \
>  })
>  

Hi Akira,

Another reason that the performance of cmpxchg is catching up with cmpxchg_weak is that __ATOMIC_SEQ_CST is replaced by __ATOMIC_RELAXED in this patch. The use of __ATOMIC_RELAXED means if the CAS primitive fails, the relaxed semantic is used, rather than sequential consistent. Following are some experiment results:

# If __ATOMIC_RELAXED is used for both cmpxchg and cmpxchg_weak

./count_lim_atomic 64 uperf
ns/update: 290

./count_lim_atomic_weak 64 uperf
ns/update: 301

# and then if __ATOMIC_SEQ_CST is used for both cmpxchg and cmpxchg_weak

./count_lim_atomic 64 uperf
ns/update: 316

./count_lim_atomic_weak 64 uperf
ns/update: 302

./count_lim_atomic 120 uperf
ns/update: 630

./count_lim_atomic_weak 120 uperf
ns/update: 568

The results show that if we want to ensure sequential consistency when the CAS primitive fails, cmpxchg_weak performs better than cmpxchg. It seems that the (type of variation, failure_memorder) pair affects performance. I know that PPC uses LL/SC to simulate CAS. But what's the relationship between a simulated CAS and the memory order. This is interesting because as far as I know, PPC and ARM are using LL/SC to simulate atomic primitives such as CAS and FAA. So FAA might have the same behavior.

In actually, I'm not very clear about the meaning of different types of failure memory orders. For example, when should we use __ATOMIC_RELAXED, rather than __ATOMIC_SEQ_CST, if a CAS fails? What happen if __ATOMIC_RELAXED is used for x86? The one I'm look at is https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html . Do you know some resources about this? I can look into this tomorrow. Thanks.

--Junchang

>  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h
  2018-12-12 16:01   ` Junchang Wang
@ 2018-12-13 15:33     ` Akira Yokosawa
  2018-12-14 14:32       ` Junchang Wang
  2018-12-15 15:10       ` Akira Yokosawa
  0 siblings, 2 replies; 16+ messages in thread
From: Akira Yokosawa @ 2018-12-13 15:33 UTC (permalink / raw)
  To: Junchang Wang; +Cc: Paul E. McKenney, perfbook, Akira Yokosawa

On 2018/12/13 00:01:33 +0800, Junchang Wang wrote:
> On 12/11/18 11:42 PM, Akira Yokosawa wrote:
>> From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001
>> From: Akira Yokosawa <akiyks@gmail.com>
>> Date: Tue, 11 Dec 2018 21:37:11 +0900
>> Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h
>>
>> Do the same change as CodeSamples/formal/litmus/api.h.
>>
>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
>> ---
>>  CodeSamples/api-pthreads/api-gcc.h | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h
>> index 3afe340..b66f4b9 100644
>> --- a/CodeSamples/api-pthreads/api-gcc.h
>> +++ b/CodeSamples/api-pthreads/api-gcc.h
>> @@ -168,8 +168,9 @@ struct __xchg_dummy {
>>  ({ \
>>  	typeof(*ptr) _____actual = (o); \
>>  	\
>> -	__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
>> -			__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \
>> +	__atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \
>> +			__ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \
>> +	_____actual; \
>>  })
>>  
> 
> Hi Akira,
> 
> Another reason that the performance of cmpxchg is catching up with cmpxchg_weak is that __ATOMIC_SEQ_CST is replaced by __ATOMIC_RELAXED in this patch. The use of __ATOMIC_RELAXED means if the CAS primitive fails, the relaxed semantic is used, rather than sequential consistent. Following are some experiment results:
> 
> # If __ATOMIC_RELAXED is used for both cmpxchg and cmpxchg_weak
> 
> ./count_lim_atomic 64 uperf
> ns/update: 290
> 
> ./count_lim_atomic_weak 64 uperf
> ns/update: 301
> 
> 
> # and then if __ATOMIC_SEQ_CST is used for both cmpxchg and cmpxchg_weak
> 
> ./count_lim_atomic 64 uperf
> ns/update: 316
> 
> ./count_lim_atomic_weak 64 uperf
> ns/update: 302
> 
> ./count_lim_atomic 120 uperf
> ns/update: 630
> 
> ./count_lim_atomic_weak 120 uperf
> ns/update: 568
> 
> The results show that if we want to ensure sequential consistency when the CAS primitive fails, cmpxchg_weak performs better than cmpxchg. It seems that the (type of variation, failure_memorder) pair affects performance. I know that PPC uses LL/SC to simulate CAS. But what's the relationship between a simulated CAS and the memory order. This is interesting because as far as I know, PPC and ARM are using LL/SC to simulate atomic primitives such as CAS and FAA. So FAA might have the same behavior.
> 
> In actually, I'm not very clear about the meaning of different types of failure memory orders. For example, when should we use __ATOMIC_RELAXED, rather than __ATOMIC_SEQ_CST, if a CAS fails? What happen if __ATOMIC_RELAXED is used for x86? The one I'm look at is https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html . Do you know some resources about this? I can look into this tomorrow. Thanks.

Hi Junchang,

In Linux-kernel speak, Documentation/core-api/atomic.rst says:

--------------------------------------------------------------------------
atomic_xchg must provide explicit memory barriers around the operation. ::

	int atomic_cmpxchg(atomic_t *v, int old, int new);

This performs an atomic compare exchange operation on the atomic value v,
with the given old and new values. Like all atomic_xxx operations,
atomic_cmpxchg will only satisfy its atomicity semantics as long as all
other accesses of \*v are performed through atomic_xxx operations.

atomic_cmpxchg must provide explicit memory barriers around the operation,
although if the comparison fails then no memory ordering guarantees are
required.

[snip]

The routines xchg() and cmpxchg() must provide the same exact
memory-barrier semantics as the atomic and bit operations returning
values.
--------------------------------------------------------------------------

The 2nd __ATOMIC_RELAXED to __atomic_compare_exchange_n() matches this
lack of requirement.

On x86_64, __atomic_compare_exchange_n() is translated to the same code
in both cases  (with the help of litmus7's cross compiling):

#START _litmus_P1
	xorl	%eax, %eax
	movl	$0, 4(%rsp)
	lock cmpxchgl	%r10d, (%rdx)
	je	.L36
	movl	%eax, 4(%rsp)
.L36:
	movl	4(%rsp), %eax

There is no difference between the code with __ATOMIC_RELAXED and
the code with __ATOMIC_SEQ_CST as the 2nd parameter. As you can see,
there is no memory barrier instruction emitted.

On PPC, there is a difference. With __ATOMIC_RELAXED as 2nd parameter,
the code looks like:

#START _litmus_P1
        sync
.L34:
        lwarx 7,0,9
        cmpwi 0,7,0
        bne 0,.L35
        stwcx. 5,0,9
        bne 0,.L34
        isync
.L35:

, OTOH with __ATOMIC_SEQ_CST as 2nd argument:

#START _litmus_P1
        sync
.L34:
        lwarx 7,0,9
        cmpwi 0,7,0
        bne 0,.L35
        stwcx. 5,0,9
        bne 0,.L34
.L35:
        isync

See the difference of position of label .L35.
(Note that we are talking about strong version of cmpxchg().)

Does the above example make sense to you?

        Thanks, Akira

> 
> 
> --Junchang
> 
> 
> 
>>  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
>>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h
  2018-12-13 15:33     ` Akira Yokosawa
@ 2018-12-14 14:32       ` Junchang Wang
  2018-12-15 14:58         ` Akira Yokosawa
  2018-12-15 15:10       ` Akira Yokosawa
  1 sibling, 1 reply; 16+ messages in thread
From: Junchang Wang @ 2018-12-14 14:32 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: Paul E. McKenney, perfbook

On Thu, Dec 13, 2018 at 11:33 PM Akira Yokosawa <akiyks@gmail.com> wrote:
>
> On 2018/12/13 00:01:33 +0800, Junchang Wang wrote:
> > On 12/11/18 11:42 PM, Akira Yokosawa wrote:
> >> From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001
> >> From: Akira Yokosawa <akiyks@gmail.com>
> >> Date: Tue, 11 Dec 2018 21:37:11 +0900
> >> Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h
> >>
> >> Do the same change as CodeSamples/formal/litmus/api.h.
> >>
> >> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
> >> ---
> >>  CodeSamples/api-pthreads/api-gcc.h | 5 +++--
> >>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h
> >> index 3afe340..b66f4b9 100644
> >> --- a/CodeSamples/api-pthreads/api-gcc.h
> >> +++ b/CodeSamples/api-pthreads/api-gcc.h
> >> @@ -168,8 +168,9 @@ struct __xchg_dummy {
> >>  ({ \
> >>      typeof(*ptr) _____actual = (o); \
> >>      \
> >> -    __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> >> -                    __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \
> >> +    __atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \
> >> +                    __ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \
> >> +    _____actual; \
> >>  })
> >>
> >
> > Hi Akira,
> >
> > Another reason that the performance of cmpxchg is catching up with cmpxchg_weak is that __ATOMIC_SEQ_CST is replaced by __ATOMIC_RELAXED in this patch. The use of __ATOMIC_RELAXED means if the CAS primitive fails, the relaxed semantic is used, rather than sequential consistent. Following are some experiment results:
> >
> > # If __ATOMIC_RELAXED is used for both cmpxchg and cmpxchg_weak
> >
> > ./count_lim_atomic 64 uperf
> > ns/update: 290
> >
> > ./count_lim_atomic_weak 64 uperf
> > ns/update: 301
> >
> >
> > # and then if __ATOMIC_SEQ_CST is used for both cmpxchg and cmpxchg_weak
> >
> > ./count_lim_atomic 64 uperf
> > ns/update: 316
> >
> > ./count_lim_atomic_weak 64 uperf
> > ns/update: 302
> >
> > ./count_lim_atomic 120 uperf
> > ns/update: 630
> >
> > ./count_lim_atomic_weak 120 uperf
> > ns/update: 568
> >
> > The results show that if we want to ensure sequential consistency when the CAS primitive fails, cmpxchg_weak performs better than cmpxchg. It seems that the (type of variation, failure_memorder) pair affects performance. I know that PPC uses LL/SC to simulate CAS. But what's the relationship between a simulated CAS and the memory order. This is interesting because as far as I know, PPC and ARM are using LL/SC to simulate atomic primitives such as CAS and FAA. So FAA might have the same behavior.
> >
> > In actually, I'm not very clear about the meaning of different types of failure memory orders. For example, when should we use __ATOMIC_RELAXED, rather than __ATOMIC_SEQ_CST, if a CAS fails? What happen if __ATOMIC_RELAXED is used for x86? The one I'm look at is https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html . Do you know some resources about this? I can look into this tomorrow. Thanks.
>
> Hi Junchang,
>
> In Linux-kernel speak, Documentation/core-api/atomic.rst says:
>
> --------------------------------------------------------------------------
> atomic_xchg must provide explicit memory barriers around the operation. ::
>
>         int atomic_cmpxchg(atomic_t *v, int old, int new);
>
> This performs an atomic compare exchange operation on the atomic value v,
> with the given old and new values. Like all atomic_xxx operations,
> atomic_cmpxchg will only satisfy its atomicity semantics as long as all
> other accesses of \*v are performed through atomic_xxx operations.
>
> atomic_cmpxchg must provide explicit memory barriers around the operation,
> although if the comparison fails then no memory ordering guarantees are
> required.

Hi Akira,

Thanks for the link which is helpful.

>
> [snip]
>
> The routines xchg() and cmpxchg() must provide the same exact
> memory-barrier semantics as the atomic and bit operations returning
> values.
> --------------------------------------------------------------------------
>
> The 2nd __ATOMIC_RELAXED to __atomic_compare_exchange_n() matches this
> lack of requirement.
>
> On x86_64, __atomic_compare_exchange_n() is translated to the same code
> in both cases  (with the help of litmus7's cross compiling):
>
> #START _litmus_P1
>         xorl    %eax, %eax
>         movl    $0, 4(%rsp)
>         lock cmpxchgl   %r10d, (%rdx)
>         je      .L36
>         movl    %eax, 4(%rsp)
> .L36:
>         movl    4(%rsp), %eax
>
> There is no difference between the code with __ATOMIC_RELAXED and
> the code with __ATOMIC_SEQ_CST as the 2nd parameter. As you can see,
> there is no memory barrier instruction emitted.

My understanding is that x86 is using the TSO memory model, such that
it is unnecessary to add extra barriers. Is that right?

>
> On PPC, there is a difference. With __ATOMIC_RELAXED as 2nd parameter,
> the code looks like:
>
> #START _litmus_P1
>         sync
> .L34:
>         lwarx 7,0,9
>         cmpwi 0,7,0
>         bne 0,.L35
>         stwcx. 5,0,9
>         bne 0,.L34
>         isync
> .L35:
>
> , OTOH with __ATOMIC_SEQ_CST as 2nd argument:
>
> #START _litmus_P1
>         sync
> .L34:
>         lwarx 7,0,9
>         cmpwi 0,7,0
>         bne 0,.L35
>         stwcx. 5,0,9
>         bne 0,.L34
> .L35:
>         isync
>
> See the difference of position of label .L35.
> (Note that we are talking about strong version of cmpxchg().)
>
> Does the above example make sense to you?
>

Yes. It makes sense.

For curiosity, I checked the assembly code of weak atomic_cmpxchg (the
fourth argument is set to 1) with __ATOMIC_SEQ_CST. The code is shown
below:

#START _litmus_P3
        sync
        lwarx 9,0,8
        cmpwi 0,9,1
        bne 0,.L34
        stwcx. 4,0,8
.L34:

The code shows that the weak atomic_cmpxchg fails because (1) the
content of *ptr is not equal to argument old, or (2) the store
operation fails because this thread loses reservation for the memory
location referenced by ptr. In contrast, the strong atomic_cmpxchg
(its assembly code is shown below) contains a while-loop and fails
only if (1) *ptr is not equal to argument old. So the weak
atomic_cmpxchg could fail "spuriously" and it is the caller's
responsibility to retry it.

#START _litmus_P1
        sync
.L34:
        lwarx 7,0,9
        cmpwi 0,7,0
        bne 0,.L35
        stwcx. 5,0,9
        bne 0,.L34
.L35:
        isync

So for the performance comparison, my hypothesis is that the weak
version may perform slightly better at high levels of contention
because if there is a "spuriously" failure, atomic_cmpxchg returns
immediately, which gives other threads chances to successfully perform
their atomic_cmpxchg instructions. For other cases, the strong
atomic_cmpxchg works pretty well because it avoids rebuilding
arguments old and new. Does that make sense?

Thanks,
--Junchang

>         Thanks, Akira
>
> >
> >
> > --Junchang
> >
> >
> >
> >>  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
> >>
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h
  2018-12-14 14:32       ` Junchang Wang
@ 2018-12-15 14:58         ` Akira Yokosawa
  2018-12-16  0:55           ` Junchang Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Akira Yokosawa @ 2018-12-15 14:58 UTC (permalink / raw)
  To: Junchang Wang; +Cc: Paul E. McKenney, perfbook, Akira Yokosawa

On 2018/12/14 22:32:22 +0800, Junchang Wang wrote:
> On Thu, Dec 13, 2018 at 11:33 PM Akira Yokosawa <akiyks@gmail.com> wrote:
>>
>> On 2018/12/13 00:01:33 +0800, Junchang Wang wrote:
>>> On 12/11/18 11:42 PM, Akira Yokosawa wrote:
>>>> From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001
>>>> From: Akira Yokosawa <akiyks@gmail.com>
>>>> Date: Tue, 11 Dec 2018 21:37:11 +0900
>>>> Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h
>>>>
>>>> Do the same change as CodeSamples/formal/litmus/api.h.
>>>>
>>>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
>>>> ---
>>>>  CodeSamples/api-pthreads/api-gcc.h | 5 +++--
>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h
>>>> index 3afe340..b66f4b9 100644
>>>> --- a/CodeSamples/api-pthreads/api-gcc.h
>>>> +++ b/CodeSamples/api-pthreads/api-gcc.h
>>>> @@ -168,8 +168,9 @@ struct __xchg_dummy {
>>>>  ({ \
>>>>      typeof(*ptr) _____actual = (o); \
>>>>      \
>>>> -    __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
>>>> -                    __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \
>>>> +    __atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \
>>>> +                    __ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \
>>>> +    _____actual; \
>>>>  })
>>>>
>>>
>>> Hi Akira,
>>>
>>> Another reason that the performance of cmpxchg is catching up with cmpxchg_weak is that __ATOMIC_SEQ_CST is replaced by __ATOMIC_RELAXED in this patch. The use of __ATOMIC_RELAXED means if the CAS primitive fails, the relaxed semantic is used, rather than sequential consistent. Following are some experiment results:
>>>
>>> # If __ATOMIC_RELAXED is used for both cmpxchg and cmpxchg_weak
>>>
>>> ./count_lim_atomic 64 uperf
>>> ns/update: 290
>>>
>>> ./count_lim_atomic_weak 64 uperf
>>> ns/update: 301
>>>
>>>
>>> # and then if __ATOMIC_SEQ_CST is used for both cmpxchg and cmpxchg_weak
>>>
>>> ./count_lim_atomic 64 uperf
>>> ns/update: 316
>>>
>>> ./count_lim_atomic_weak 64 uperf
>>> ns/update: 302
>>>
>>> ./count_lim_atomic 120 uperf
>>> ns/update: 630
>>>
>>> ./count_lim_atomic_weak 120 uperf
>>> ns/update: 568
>>>
>>> The results show that if we want to ensure sequential consistency when the CAS primitive fails, cmpxchg_weak performs better than cmpxchg. It seems that the (type of variation, failure_memorder) pair affects performance. I know that PPC uses LL/SC to simulate CAS. But what's the relationship between a simulated CAS and the memory order. This is interesting because as far as I know, PPC and ARM are using LL/SC to simulate atomic primitives such as CAS and FAA. So FAA might have the same behavior.
>>>
>>> In actually, I'm not very clear about the meaning of different types of failure memory orders. For example, when should we use __ATOMIC_RELAXED, rather than __ATOMIC_SEQ_CST, if a CAS fails? What happen if __ATOMIC_RELAXED is used for x86? The one I'm look at is https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html . Do you know some resources about this? I can look into this tomorrow. Thanks.
>>
>> Hi Junchang,
>>
>> In Linux-kernel speak, Documentation/core-api/atomic.rst says:
>>
>> --------------------------------------------------------------------------
>> atomic_xchg must provide explicit memory barriers around the operation. ::
>>
>>         int atomic_cmpxchg(atomic_t *v, int old, int new);
>>
>> This performs an atomic compare exchange operation on the atomic value v,
>> with the given old and new values. Like all atomic_xxx operations,
>> atomic_cmpxchg will only satisfy its atomicity semantics as long as all
>> other accesses of \*v are performed through atomic_xxx operations.
>>
>> atomic_cmpxchg must provide explicit memory barriers around the operation,
>> although if the comparison fails then no memory ordering guarantees are
>> required.
> 
> Hi Akira,
> 
> Thanks for the link which is helpful.
> 
>>
>> [snip]
>>
>> The routines xchg() and cmpxchg() must provide the same exact
>> memory-barrier semantics as the atomic and bit operations returning
>> values.
>> --------------------------------------------------------------------------
>>
>> The 2nd __ATOMIC_RELAXED to __atomic_compare_exchange_n() matches this
>> lack of requirement.
>>
>> On x86_64, __atomic_compare_exchange_n() is translated to the same code
>> in both cases  (with the help of litmus7's cross compiling):
>>
>> #START _litmus_P1
>>         xorl    %eax, %eax
>>         movl    $0, 4(%rsp)
>>         lock cmpxchgl   %r10d, (%rdx)
>>         je      .L36
>>         movl    %eax, 4(%rsp)
>> .L36:
>>         movl    4(%rsp), %eax
>>
>> There is no difference between the code with __ATOMIC_RELAXED and
>> the code with __ATOMIC_SEQ_CST as the 2nd parameter. As you can see,
>> there is no memory barrier instruction emitted.
> 
> My understanding is that x86 is using the TSO memory model, such that
> it is unnecessary to add extra barriers. Is that right?

I think so.

> 
>>
>> On PPC, there is a difference. With __ATOMIC_RELAXED as 2nd parameter,
>> the code looks like:
>>
>> #START _litmus_P1
>>         sync
>> .L34:
>>         lwarx 7,0,9
>>         cmpwi 0,7,0
>>         bne 0,.L35
>>         stwcx. 5,0,9
>>         bne 0,.L34
>>         isync
>> .L35:
>>
>> , OTOH with __ATOMIC_SEQ_CST as 2nd argument:
>>
>> #START _litmus_P1
>>         sync
>> .L34:
>>         lwarx 7,0,9
>>         cmpwi 0,7,0
>>         bne 0,.L35
>>         stwcx. 5,0,9
>>         bne 0,.L34
>> .L35:
>>         isync
>>
>> See the difference of position of label .L35.
>> (Note that we are talking about strong version of cmpxchg().)
>>
>> Does the above example make sense to you?
>>
> 
> Yes. It makes sense.
> 
> For curiosity, I checked the assembly code of weak atomic_cmpxchg (the
> fourth argument is set to 1) with __ATOMIC_SEQ_CST. The code is shown
> below:
> 
> #START _litmus_P3
>         sync
>         lwarx 9,0,8
>         cmpwi 0,9,1
>         bne 0,.L34
>         stwcx. 4,0,8
> .L34:
> 
> The code shows that the weak atomic_cmpxchg fails because (1) the
> content of *ptr is not equal to argument old, or (2) the store
> operation fails because this thread loses reservation for the memory
> location referenced by ptr. In contrast, the strong atomic_cmpxchg
> (its assembly code is shown below) contains a while-loop and fails
> only if (1) *ptr is not equal to argument old. So the weak
> atomic_cmpxchg could fail "spuriously" and it is the caller's
> responsibility to retry it.
> 
> #START _litmus_P1
>         sync
> .L34:
>         lwarx 7,0,9
>         cmpwi 0,7,0
>         bne 0,.L35
>         stwcx. 5,0,9
>         bne 0,.L34
> .L35:
>         isync
> 
> So for the performance comparison, my hypothesis is that the weak
> version may perform slightly better at high levels of contention
> because if there is a "spuriously" failure, atomic_cmpxchg returns
> immediately, which gives other threads chances to successfully perform
> their atomic_cmpxchg instructions. For other cases, the strong> atomic_cmpxchg works pretty well because it avoids rebuilding
> arguments old and new. Does that make sense?

Well, as atomic_cmpxchg() is supposed to be inlined,
it is not easy to tell how the code of count_lim_atomic.c will be
optimized in the end.
The performance can vary depending on compiler version.

The point of this sample code is that it scales much better than
count_atomic.c. In the end, we need to minimize the contention
of atomic accesses, don't we?

        Thanks, Akira

> 
> Thanks,
> --Junchang
> 
>>         Thanks, Akira
>>
>>>
>>>
>>> --Junchang
>>>
>>>
>>>
>>>>  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
>>>>
>>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h
  2018-12-15 14:58         ` Akira Yokosawa
@ 2018-12-16  0:55           ` Junchang Wang
  0 siblings, 0 replies; 16+ messages in thread
From: Junchang Wang @ 2018-12-16  0:55 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: Paul E. McKenney, perfbook

On Sat, Dec 15, 2018 at 10:58 PM Akira Yokosawa <akiyks@gmail.com> wrote:
>
> On 2018/12/14 22:32:22 +0800, Junchang Wang wrote:
> > On Thu, Dec 13, 2018 at 11:33 PM Akira Yokosawa <akiyks@gmail.com> wrote:
> >>
> >> On 2018/12/13 00:01:33 +0800, Junchang Wang wrote:
> >>> On 12/11/18 11:42 PM, Akira Yokosawa wrote:
> >>>> From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001
> >>>> From: Akira Yokosawa <akiyks@gmail.com>
> >>>> Date: Tue, 11 Dec 2018 21:37:11 +0900
> >>>> Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h
> >>>>
> >>>> Do the same change as CodeSamples/formal/litmus/api.h.
> >>>>
> >>>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
> >>>> ---
> >>>>  CodeSamples/api-pthreads/api-gcc.h | 5 +++--
> >>>>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h
> >>>> index 3afe340..b66f4b9 100644
> >>>> --- a/CodeSamples/api-pthreads/api-gcc.h
> >>>> +++ b/CodeSamples/api-pthreads/api-gcc.h
> >>>> @@ -168,8 +168,9 @@ struct __xchg_dummy {
> >>>>  ({ \
> >>>>      typeof(*ptr) _____actual = (o); \
> >>>>      \
> >>>> -    __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> >>>> -                    __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \
> >>>> +    __atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \
> >>>> +                    __ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \
> >>>> +    _____actual; \
> >>>>  })
> >>>>
> >>>
> >>> Hi Akira,
> >>>
> >>> Another reason that the performance of cmpxchg is catching up with cmpxchg_weak is that __ATOMIC_SEQ_CST is replaced by __ATOMIC_RELAXED in this patch. The use of __ATOMIC_RELAXED means if the CAS primitive fails, the relaxed semantic is used, rather than sequential consistent. Following are some experiment results:
> >>>
> >>> # If __ATOMIC_RELAXED is used for both cmpxchg and cmpxchg_weak
> >>>
> >>> ./count_lim_atomic 64 uperf
> >>> ns/update: 290
> >>>
> >>> ./count_lim_atomic_weak 64 uperf
> >>> ns/update: 301
> >>>
> >>>
> >>> # and then if __ATOMIC_SEQ_CST is used for both cmpxchg and cmpxchg_weak
> >>>
> >>> ./count_lim_atomic 64 uperf
> >>> ns/update: 316
> >>>
> >>> ./count_lim_atomic_weak 64 uperf
> >>> ns/update: 302
> >>>
> >>> ./count_lim_atomic 120 uperf
> >>> ns/update: 630
> >>>
> >>> ./count_lim_atomic_weak 120 uperf
> >>> ns/update: 568
> >>>
> >>> The results show that if we want to ensure sequential consistency when the CAS primitive fails, cmpxchg_weak performs better than cmpxchg. It seems that the (type of variation, failure_memorder) pair affects performance. I know that PPC uses LL/SC to simulate CAS. But what's the relationship between a simulated CAS and the memory order. This is interesting because as far as I know, PPC and ARM are using LL/SC to simulate atomic primitives such as CAS and FAA. So FAA might have the same behavior.
> >>>
> >>> In actually, I'm not very clear about the meaning of different types of failure memory orders. For example, when should we use __ATOMIC_RELAXED, rather than __ATOMIC_SEQ_CST, if a CAS fails? What happen if __ATOMIC_RELAXED is used for x86? The one I'm look at is https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html . Do you know some resources about this? I can look into this tomorrow. Thanks.
> >>
> >> Hi Junchang,
> >>
> >> In Linux-kernel speak, Documentation/core-api/atomic.rst says:
> >>
> >> --------------------------------------------------------------------------
> >> atomic_xchg must provide explicit memory barriers around the operation. ::
> >>
> >>         int atomic_cmpxchg(atomic_t *v, int old, int new);
> >>
> >> This performs an atomic compare exchange operation on the atomic value v,
> >> with the given old and new values. Like all atomic_xxx operations,
> >> atomic_cmpxchg will only satisfy its atomicity semantics as long as all
> >> other accesses of \*v are performed through atomic_xxx operations.
> >>
> >> atomic_cmpxchg must provide explicit memory barriers around the operation,
> >> although if the comparison fails then no memory ordering guarantees are
> >> required.
> >
> > Hi Akira,
> >
> > Thanks for the link which is helpful.
> >
> >>
> >> [snip]
> >>
> >> The routines xchg() and cmpxchg() must provide the same exact
> >> memory-barrier semantics as the atomic and bit operations returning
> >> values.
> >> --------------------------------------------------------------------------
> >>
> >> The 2nd __ATOMIC_RELAXED to __atomic_compare_exchange_n() matches this
> >> lack of requirement.
> >>
> >> On x86_64, __atomic_compare_exchange_n() is translated to the same code
> >> in both cases  (with the help of litmus7's cross compiling):
> >>
> >> #START _litmus_P1
> >>         xorl    %eax, %eax
> >>         movl    $0, 4(%rsp)
> >>         lock cmpxchgl   %r10d, (%rdx)
> >>         je      .L36
> >>         movl    %eax, 4(%rsp)
> >> .L36:
> >>         movl    4(%rsp), %eax
> >>
> >> There is no difference between the code with __ATOMIC_RELAXED and
> >> the code with __ATOMIC_SEQ_CST as the 2nd parameter. As you can see,
> >> there is no memory barrier instruction emitted.
> >
> > My understanding is that x86 is using the TSO memory model, such that
> > it is unnecessary to add extra barriers. Is that right?
>
> I think so.
>
> >
> >>
> >> On PPC, there is a difference. With __ATOMIC_RELAXED as 2nd parameter,
> >> the code looks like:
> >>
> >> #START _litmus_P1
> >>         sync
> >> .L34:
> >>         lwarx 7,0,9
> >>         cmpwi 0,7,0
> >>         bne 0,.L35
> >>         stwcx. 5,0,9
> >>         bne 0,.L34
> >>         isync
> >> .L35:
> >>
> >> , OTOH with __ATOMIC_SEQ_CST as 2nd argument:
> >>
> >> #START _litmus_P1
> >>         sync
> >> .L34:
> >>         lwarx 7,0,9
> >>         cmpwi 0,7,0
> >>         bne 0,.L35
> >>         stwcx. 5,0,9
> >>         bne 0,.L34
> >> .L35:
> >>         isync
> >>
> >> See the difference of position of label .L35.
> >> (Note that we are talking about strong version of cmpxchg().)
> >>
> >> Does the above example make sense to you?
> >>
> >
> > Yes. It makes sense.
> >
> > For curiosity, I checked the assembly code of weak atomic_cmpxchg (the
> > fourth argument is set to 1) with __ATOMIC_SEQ_CST. The code is shown
> > below:
> >
> > #START _litmus_P3
> >         sync
> >         lwarx 9,0,8
> >         cmpwi 0,9,1
> >         bne 0,.L34
> >         stwcx. 4,0,8
> > .L34:
> >
> > The code shows that the weak atomic_cmpxchg fails because (1) the
> > content of *ptr is not equal to argument old, or (2) the store
> > operation fails because this thread loses reservation for the memory
> > location referenced by ptr. In contrast, the strong atomic_cmpxchg
> > (its assembly code is shown below) contains a while-loop and fails
> > only if (1) *ptr is not equal to argument old. So the weak
> > atomic_cmpxchg could fail "spuriously" and it is the caller's
> > responsibility to retry it.
> >
> > #START _litmus_P1
> >         sync
> > .L34:
> >         lwarx 7,0,9
> >         cmpwi 0,7,0
> >         bne 0,.L35
> >         stwcx. 5,0,9
> >         bne 0,.L34
> > .L35:
> >         isync
> >
> > So for the performance comparison, my hypothesis is that the weak
> > version may perform slightly better at high levels of contention
> > because if there is a "spuriously" failure, atomic_cmpxchg returns
> > immediately, which gives other threads chances to successfully perform
> > their atomic_cmpxchg instructions. For other cases, the strong> atomic_cmpxchg works pretty well because it avoids rebuilding
> > arguments old and new. Does that make sense?
>
> Well, as atomic_cmpxchg() is supposed to be inlined,
> it is not easy to tell how the code of count_lim_atomic.c will be
> optimized in the end.
> The performance can vary depending on compiler version.

Got it. Thanks a lot :-).

--Junchang

>
> The point of this sample code is that it scales much better than
> count_atomic.c. In the end, we need to minimize the contention
> of atomic accesses, don't we?
>
>         Thanks, Akira
>
> >
> > Thanks,
> > --Junchang
> >
> >>         Thanks, Akira
> >>
> >>>
> >>>
> >>> --Junchang
> >>>
> >>>
> >>>
> >>>>  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
> >>>>
> >>
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h
  2018-12-13 15:33     ` Akira Yokosawa
  2018-12-14 14:32       ` Junchang Wang
@ 2018-12-15 15:10       ` Akira Yokosawa
  2018-12-15 19:37         ` Paul E. McKenney
  1 sibling, 1 reply; 16+ messages in thread
From: Akira Yokosawa @ 2018-12-15 15:10 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Junchang Wang, perfbook, Akira Yokosawa

Hi Paul,

There is something I want to make sure.
Please see inline comment below.

On 2018/12/14 00:33:15 +0900, Akira Yokosawa wrote:
> On 2018/12/13 00:01:33 +0800, Junchang Wang wrote:
>> On 12/11/18 11:42 PM, Akira Yokosawa wrote:
>>> From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001
>>> From: Akira Yokosawa <akiyks@gmail.com>
>>> Date: Tue, 11 Dec 2018 21:37:11 +0900
>>> Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h
>>>
>>> Do the same change as CodeSamples/formal/litmus/api.h.
>>>
>>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
>>> ---
>>>  CodeSamples/api-pthreads/api-gcc.h | 5 +++--
>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h
>>> index 3afe340..b66f4b9 100644
>>> --- a/CodeSamples/api-pthreads/api-gcc.h
>>> +++ b/CodeSamples/api-pthreads/api-gcc.h
>>> @@ -168,8 +168,9 @@ struct __xchg_dummy {
>>>  ({ \
>>>  	typeof(*ptr) _____actual = (o); \
>>>  	\
>>> -	__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
>>> -			__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \
>>> +	__atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \
>>> +			__ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \
>>> +	_____actual; \
>>>  })
>>>  
>>
>> Hi Akira,
>>
>> Another reason that the performance of cmpxchg is catching up with cmpxchg_weak is that __ATOMIC_SEQ_CST is replaced by __ATOMIC_RELAXED in this patch. The use of __ATOMIC_RELAXED means if the CAS primitive fails, the relaxed semantic is used, rather than sequential consistent. Following are some experiment results:
>>
>> # If __ATOMIC_RELAXED is used for both cmpxchg and cmpxchg_weak
>>
>> ./count_lim_atomic 64 uperf
>> ns/update: 290
>>
>> ./count_lim_atomic_weak 64 uperf
>> ns/update: 301
>>
>>
>> # and then if __ATOMIC_SEQ_CST is used for both cmpxchg and cmpxchg_weak
>>
>> ./count_lim_atomic 64 uperf
>> ns/update: 316
>>
>> ./count_lim_atomic_weak 64 uperf
>> ns/update: 302
>>
>> ./count_lim_atomic 120 uperf
>> ns/update: 630
>>
>> ./count_lim_atomic_weak 120 uperf
>> ns/update: 568
>>
>> The results show that if we want to ensure sequential consistency when the CAS primitive fails, cmpxchg_weak performs better than cmpxchg. It seems that the (type of variation, failure_memorder) pair affects performance. I know that PPC uses LL/SC to simulate CAS. But what's the relationship between a simulated CAS and the memory order. This is interesting because as far as I know, PPC and ARM are using LL/SC to simulate atomic primitives such as CAS and FAA. So FAA might have the same behavior.
>>
>> In actually, I'm not very clear about the meaning of different types of failure memory orders. For example, when should we use __ATOMIC_RELAXED, rather than __ATOMIC_SEQ_CST, if a CAS fails? What happen if __ATOMIC_RELAXED is used for x86? The one I'm look at is https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html . Do you know some resources about this? I can look into this tomorrow. Thanks.
> 
> Hi Junchang,
> 
> In Linux-kernel speak, Documentation/core-api/atomic.rst says:
> 
> --------------------------------------------------------------------------
> atomic_xchg must provide explicit memory barriers around the operation. ::
> 
> 	int atomic_cmpxchg(atomic_t *v, int old, int new);
> 
> This performs an atomic compare exchange operation on the atomic value v,
> with the given old and new values. Like all atomic_xxx operations,
> atomic_cmpxchg will only satisfy its atomicity semantics as long as all
> other accesses of \*v are performed through atomic_xxx operations.
> 
> atomic_cmpxchg must provide explicit memory barriers around the operation,
> although if the comparison fails then no memory ordering guarantees are
> required.
> 
> [snip]
> 
> The routines xchg() and cmpxchg() must provide the same exact
> memory-barrier semantics as the atomic and bit operations returning
> values.
> --------------------------------------------------------------------------
> 
> The 2nd __ATOMIC_RELAXED to __atomic_compare_exchange_n() matches this
> lack of requirement.
> 
> On x86_64, __atomic_compare_exchange_n() is translated to the same code
> in both cases  (with the help of litmus7's cross compiling):
> 
> #START _litmus_P1
> 	xorl	%eax, %eax
> 	movl	$0, 4(%rsp)
> 	lock cmpxchgl	%r10d, (%rdx)
> 	je	.L36
> 	movl	%eax, 4(%rsp)
> .L36:
> 	movl	4(%rsp), %eax
> 
> There is no difference between the code with __ATOMIC_RELAXED and
> the code with __ATOMIC_SEQ_CST as the 2nd parameter. As you can see,
> there is no memory barrier instruction emitted.
> 
> On PPC, there is a difference. With __ATOMIC_RELAXED as 2nd parameter,
> the code looks like:
> 
> #START _litmus_P1
>         sync
> .L34:
>         lwarx 7,0,9
>         cmpwi 0,7,0
>         bne 0,.L35
>         stwcx. 5,0,9
>         bne 0,.L34
>         isync
> .L35:
> 
> , OTOH with __ATOMIC_SEQ_CST as 2nd argument:
> 
> #START _litmus_P1
>         sync
> .L34:
>         lwarx 7,0,9
>         cmpwi 0,7,0
>         bne 0,.L35
>         stwcx. 5,0,9
>         bne 0,.L34
> .L35:
>         isync
> 

In this asm code snippets, the barrier instruction at the end of
cmpxchg() is "isync".

In arch/powerpc/include/asm/synch.h of Linux kernel source tree,
both PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER are
defined as '"\n" stringify_in_c(sync) "\n"', which will result
in "sync".

IIUC, "isync" of PowerPC is not good enough for __ATOMIC_SEQ_CST,
is it?

        Thanks, Akira

> See the difference of position of label .L35.
> (Note that we are talking about strong version of cmpxchg().)
> 
> Does the above example make sense to you?
> 
>         Thanks, Akira
> 
>>
>>
>> --Junchang
>>
>>
>>
>>>  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
>>>
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h
  2018-12-15 15:10       ` Akira Yokosawa
@ 2018-12-15 19:37         ` Paul E. McKenney
  2018-12-15 23:42           ` Akira Yokosawa
  0 siblings, 1 reply; 16+ messages in thread
From: Paul E. McKenney @ 2018-12-15 19:37 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: Junchang Wang, perfbook

On Sun, Dec 16, 2018 at 12:10:18AM +0900, Akira Yokosawa wrote:
> Hi Paul,
> 
> There is something I want to make sure.
> Please see inline comment below.
> 
> On 2018/12/14 00:33:15 +0900, Akira Yokosawa wrote:
> > On 2018/12/13 00:01:33 +0800, Junchang Wang wrote:
> >> On 12/11/18 11:42 PM, Akira Yokosawa wrote:
> >>> From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001
> >>> From: Akira Yokosawa <akiyks@gmail.com>
> >>> Date: Tue, 11 Dec 2018 21:37:11 +0900
> >>> Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h
> >>>
> >>> Do the same change as CodeSamples/formal/litmus/api.h.
> >>>
> >>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
> >>> ---
> >>>  CodeSamples/api-pthreads/api-gcc.h | 5 +++--
> >>>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h
> >>> index 3afe340..b66f4b9 100644
> >>> --- a/CodeSamples/api-pthreads/api-gcc.h
> >>> +++ b/CodeSamples/api-pthreads/api-gcc.h
> >>> @@ -168,8 +168,9 @@ struct __xchg_dummy {
> >>>  ({ \
> >>>  	typeof(*ptr) _____actual = (o); \
> >>>  	\
> >>> -	__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> >>> -			__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \
> >>> +	__atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \
> >>> +			__ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \
> >>> +	_____actual; \
> >>>  })
> >>>  
> >>
> >> Hi Akira,
> >>
> >> Another reason that the performance of cmpxchg is catching up with cmpxchg_weak is that __ATOMIC_SEQ_CST is replaced by __ATOMIC_RELAXED in this patch. The use of __ATOMIC_RELAXED means if the CAS primitive fails, the relaxed semantic is used, rather than sequential consistent. Following are some experiment results:
> >>
> >> # If __ATOMIC_RELAXED is used for both cmpxchg and cmpxchg_weak
> >>
> >> ./count_lim_atomic 64 uperf
> >> ns/update: 290
> >>
> >> ./count_lim_atomic_weak 64 uperf
> >> ns/update: 301
> >>
> >>
> >> # and then if __ATOMIC_SEQ_CST is used for both cmpxchg and cmpxchg_weak
> >>
> >> ./count_lim_atomic 64 uperf
> >> ns/update: 316
> >>
> >> ./count_lim_atomic_weak 64 uperf
> >> ns/update: 302
> >>
> >> ./count_lim_atomic 120 uperf
> >> ns/update: 630
> >>
> >> ./count_lim_atomic_weak 120 uperf
> >> ns/update: 568
> >>
> >> The results show that if we want to ensure sequential consistency when the CAS primitive fails, cmpxchg_weak performs better than cmpxchg. It seems that the (type of variation, failure_memorder) pair affects performance. I know that PPC uses LL/SC to simulate CAS. But what's the relationship between a simulated CAS and the memory order. This is interesting because as far as I know, PPC and ARM are using LL/SC to simulate atomic primitives such as CAS and FAA. So FAA might have the same behavior.
> >>
> >> In actually, I'm not very clear about the meaning of different types of failure memory orders. For example, when should we use __ATOMIC_RELAXED, rather than __ATOMIC_SEQ_CST, if a CAS fails? What happen if __ATOMIC_RELAXED is used for x86? The one I'm look at is https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html . Do you know some resources about this? I can look into this tomorrow. Thanks.
> > 
> > Hi Junchang,
> > 
> > In Linux-kernel speak, Documentation/core-api/atomic.rst says:
> > 
> > --------------------------------------------------------------------------
> > atomic_xchg must provide explicit memory barriers around the operation. ::
> > 
> > 	int atomic_cmpxchg(atomic_t *v, int old, int new);
> > 
> > This performs an atomic compare exchange operation on the atomic value v,
> > with the given old and new values. Like all atomic_xxx operations,
> > atomic_cmpxchg will only satisfy its atomicity semantics as long as all
> > other accesses of \*v are performed through atomic_xxx operations.
> > 
> > atomic_cmpxchg must provide explicit memory barriers around the operation,
> > although if the comparison fails then no memory ordering guarantees are
> > required.
> > 
> > [snip]
> > 
> > The routines xchg() and cmpxchg() must provide the same exact
> > memory-barrier semantics as the atomic and bit operations returning
> > values.
> > --------------------------------------------------------------------------
> > 
> > The 2nd __ATOMIC_RELAXED to __atomic_compare_exchange_n() matches this
> > lack of requirement.
> > 
> > On x86_64, __atomic_compare_exchange_n() is translated to the same code
> > in both cases  (with the help of litmus7's cross compiling):
> > 
> > #START _litmus_P1
> > 	xorl	%eax, %eax
> > 	movl	$0, 4(%rsp)
> > 	lock cmpxchgl	%r10d, (%rdx)
> > 	je	.L36
> > 	movl	%eax, 4(%rsp)
> > .L36:
> > 	movl	4(%rsp), %eax
> > 
> > There is no difference between the code with __ATOMIC_RELAXED and
> > the code with __ATOMIC_SEQ_CST as the 2nd parameter. As you can see,
> > there is no memory barrier instruction emitted.
> > 
> > On PPC, there is a difference. With __ATOMIC_RELAXED as 2nd parameter,
> > the code looks like:
> > 
> > #START _litmus_P1
> >         sync
> > .L34:
> >         lwarx 7,0,9
> >         cmpwi 0,7,0
> >         bne 0,.L35
> >         stwcx. 5,0,9
> >         bne 0,.L34
> >         isync
> > .L35:
> > 
> > , OTOH with __ATOMIC_SEQ_CST as 2nd argument:
> > 
> > #START _litmus_P1
> >         sync
> > .L34:
> >         lwarx 7,0,9
> >         cmpwi 0,7,0
> >         bne 0,.L35
> >         stwcx. 5,0,9
> >         bne 0,.L34
> > .L35:
> >         isync
> > 
> 
> In this asm code snippets, the barrier instruction at the end of
> cmpxchg() is "isync".
> 
> In arch/powerpc/include/asm/synch.h of Linux kernel source tree,
> both PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER are
> defined as '"\n" stringify_in_c(sync) "\n"', which will result
> in "sync".
> 
> IIUC, "isync" of PowerPC is not good enough for __ATOMIC_SEQ_CST,
> is it?

By itself, no, but in combination with the "sync" instruction at
the beginning, combined with the ordering of other __ATOMIC_SEQ_CST
operations, it actually is sufficient.  For more information, please
see:

http://open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2745.html

And this is an update that is mostly irrelevant on modern hardware:

http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html

Note also that an lwsync instruction can be substituted for each isync,
which can sometimes provide better performance.

							Thanx, Paul

>         Thanks, Akira
> 
> > See the difference of position of label .L35.
> > (Note that we are talking about strong version of cmpxchg().)
> > 
> > Does the above example make sense to you?
> > 
> >         Thanks, Akira
> > 
> >>
> >>
> >> --Junchang
> >>
> >>
> >>
> >>>  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
> >>>
> > 
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h
  2018-12-15 19:37         ` Paul E. McKenney
@ 2018-12-15 23:42           ` Akira Yokosawa
  2018-12-16  0:54             ` Paul E. McKenney
  0 siblings, 1 reply; 16+ messages in thread
From: Akira Yokosawa @ 2018-12-15 23:42 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Junchang Wang, perfbook, Akira Yokosawa

On 2018/12/15 11:37:44 -0800, Paul E. McKenney wrote:
> On Sun, Dec 16, 2018 at 12:10:18AM +0900, Akira Yokosawa wrote:
>> Hi Paul,
>>
>> There is something I want to make sure.
>> Please see inline comment below.
>>
>> On 2018/12/14 00:33:15 +0900, Akira Yokosawa wrote:
>>> On 2018/12/13 00:01:33 +0800, Junchang Wang wrote:
>>>> On 12/11/18 11:42 PM, Akira Yokosawa wrote:
>>>>> From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001
>>>>> From: Akira Yokosawa <akiyks@gmail.com>
>>>>> Date: Tue, 11 Dec 2018 21:37:11 +0900
>>>>> Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h
>>>>>
>>>>> Do the same change as CodeSamples/formal/litmus/api.h.
>>>>>
>>>>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
>>>>> ---
>>>>>  CodeSamples/api-pthreads/api-gcc.h | 5 +++--
>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h
>>>>> index 3afe340..b66f4b9 100644
>>>>> --- a/CodeSamples/api-pthreads/api-gcc.h
>>>>> +++ b/CodeSamples/api-pthreads/api-gcc.h
>>>>> @@ -168,8 +168,9 @@ struct __xchg_dummy {
>>>>>  ({ \
>>>>>  	typeof(*ptr) _____actual = (o); \
>>>>>  	\
>>>>> -	__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
>>>>> -			__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \
>>>>> +	__atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \
>>>>> +			__ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \
>>>>> +	_____actual; \
>>>>>  })
>>>>>  
>>>>
>>>> Hi Akira,
>>>>
>>>> Another reason that the performance of cmpxchg is catching up with cmpxchg_weak is that __ATOMIC_SEQ_CST is replaced by __ATOMIC_RELAXED in this patch. The use of __ATOMIC_RELAXED means if the CAS primitive fails, the relaxed semantic is used, rather than sequential consistent. Following are some experiment results:
>>>>
>>>> # If __ATOMIC_RELAXED is used for both cmpxchg and cmpxchg_weak
>>>>
>>>> ./count_lim_atomic 64 uperf
>>>> ns/update: 290
>>>>
>>>> ./count_lim_atomic_weak 64 uperf
>>>> ns/update: 301
>>>>
>>>>
>>>> # and then if __ATOMIC_SEQ_CST is used for both cmpxchg and cmpxchg_weak
>>>>
>>>> ./count_lim_atomic 64 uperf
>>>> ns/update: 316
>>>>
>>>> ./count_lim_atomic_weak 64 uperf
>>>> ns/update: 302
>>>>
>>>> ./count_lim_atomic 120 uperf
>>>> ns/update: 630
>>>>
>>>> ./count_lim_atomic_weak 120 uperf
>>>> ns/update: 568
>>>>
>>>> The results show that if we want to ensure sequential consistency when the CAS primitive fails, cmpxchg_weak performs better than cmpxchg. It seems that the (type of variation, failure_memorder) pair affects performance. I know that PPC uses LL/SC to simulate CAS. But what's the relationship between a simulated CAS and the memory order. This is interesting because as far as I know, PPC and ARM are using LL/SC to simulate atomic primitives such as CAS and FAA. So FAA might have the same behavior.
>>>>
>>>> In actually, I'm not very clear about the meaning of different types of failure memory orders. For example, when should we use __ATOMIC_RELAXED, rather than __ATOMIC_SEQ_CST, if a CAS fails? What happen if __ATOMIC_RELAXED is used for x86? The one I'm look at is https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html . Do you know some resources about this? I can look into this tomorrow. Thanks.
>>>
>>> Hi Junchang,
>>>
>>> In Linux-kernel speak, Documentation/core-api/atomic.rst says:
>>>
>>> --------------------------------------------------------------------------
>>> atomic_xchg must provide explicit memory barriers around the operation. ::
>>>
>>> 	int atomic_cmpxchg(atomic_t *v, int old, int new);
>>>
>>> This performs an atomic compare exchange operation on the atomic value v,
>>> with the given old and new values. Like all atomic_xxx operations,
>>> atomic_cmpxchg will only satisfy its atomicity semantics as long as all
>>> other accesses of \*v are performed through atomic_xxx operations.
>>>
>>> atomic_cmpxchg must provide explicit memory barriers around the operation,
>>> although if the comparison fails then no memory ordering guarantees are
>>> required.
>>>
>>> [snip]
>>>
>>> The routines xchg() and cmpxchg() must provide the same exact
>>> memory-barrier semantics as the atomic and bit operations returning
>>> values.
>>> --------------------------------------------------------------------------
>>>
>>> The 2nd __ATOMIC_RELAXED to __atomic_compare_exchange_n() matches this
>>> lack of requirement.
>>>
>>> On x86_64, __atomic_compare_exchange_n() is translated to the same code
>>> in both cases  (with the help of litmus7's cross compiling):
>>>
>>> #START _litmus_P1
>>> 	xorl	%eax, %eax
>>> 	movl	$0, 4(%rsp)
>>> 	lock cmpxchgl	%r10d, (%rdx)
>>> 	je	.L36
>>> 	movl	%eax, 4(%rsp)
>>> .L36:
>>> 	movl	4(%rsp), %eax
>>>
>>> There is no difference between the code with __ATOMIC_RELAXED and
>>> the code with __ATOMIC_SEQ_CST as the 2nd parameter. As you can see,
>>> there is no memory barrier instruction emitted.
>>>
>>> On PPC, there is a difference. With __ATOMIC_RELAXED as 2nd parameter,
>>> the code looks like:
>>>
>>> #START _litmus_P1
>>>         sync
>>> .L34:
>>>         lwarx 7,0,9
>>>         cmpwi 0,7,0
>>>         bne 0,.L35
>>>         stwcx. 5,0,9
>>>         bne 0,.L34
>>>         isync
>>> .L35:
>>>
>>> , OTOH with __ATOMIC_SEQ_CST as 2nd argument:
>>>
>>> #START _litmus_P1
>>>         sync
>>> .L34:
>>>         lwarx 7,0,9
>>>         cmpwi 0,7,0
>>>         bne 0,.L35
>>>         stwcx. 5,0,9
>>>         bne 0,.L34
>>> .L35:
>>>         isync
>>>
>>
>> In this asm code snippets, the barrier instruction at the end of
>> cmpxchg() is "isync".
>>
>> In arch/powerpc/include/asm/synch.h of Linux kernel source tree,
>> both PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER are
>> defined as '"\n" stringify_in_c(sync) "\n"', which will result
>> in "sync".
>>
>> IIUC, "isync" of PowerPC is not good enough for __ATOMIC_SEQ_CST,
>> is it?
> 
> By itself, no, but in combination with the "sync" instruction at
> the beginning, combined with the ordering of other __ATOMIC_SEQ_CST
> operations, it actually is sufficient.  For more information, please
> see:
> 
> http://open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2745.html
> 
> And this is an update that is mostly irrelevant on modern hardware:
> 
> http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html
> 
> Note also that an lwsync instruction can be substituted for each isync,
> which can sometimes provide better performance.

Now I'm wondering why PPC_ATOMIC_EXIT_BARRIER is defined as
'"\n" stringify_in_c(sync) "\n"' rather than
'"\n" stringify_in_c(isync) "\n"' (or '"\n" stringify_in_c(LWSYNC) "\n"')
in arch/powerpc/include/asm/synch.h.

???

        Thanks, Akira

> 
> 							Thanx, Paul
> 
>>         Thanks, Akira
>>
>>> See the difference of position of label .L35.
>>> (Note that we are talking about strong version of cmpxchg().)
>>>
>>> Does the above example make sense to you?
>>>
>>>         Thanks, Akira
>>>
>>>>
>>>>
>>>> --Junchang
>>>>
>>>>
>>>>
>>>>>  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
>>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h
  2018-12-15 23:42           ` Akira Yokosawa
@ 2018-12-16  0:54             ` Paul E. McKenney
  0 siblings, 0 replies; 16+ messages in thread
From: Paul E. McKenney @ 2018-12-16  0:54 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: Junchang Wang, perfbook

On Sun, Dec 16, 2018 at 08:42:56AM +0900, Akira Yokosawa wrote:
> On 2018/12/15 11:37:44 -0800, Paul E. McKenney wrote:
> > On Sun, Dec 16, 2018 at 12:10:18AM +0900, Akira Yokosawa wrote:
> >> Hi Paul,
> >>
> >> There is something I want to make sure.
> >> Please see inline comment below.
> >>
> >> On 2018/12/14 00:33:15 +0900, Akira Yokosawa wrote:
> >>> On 2018/12/13 00:01:33 +0800, Junchang Wang wrote:
> >>>> On 12/11/18 11:42 PM, Akira Yokosawa wrote:
> >>>>> From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001
> >>>>> From: Akira Yokosawa <akiyks@gmail.com>
> >>>>> Date: Tue, 11 Dec 2018 21:37:11 +0900
> >>>>> Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h
> >>>>>
> >>>>> Do the same change as CodeSamples/formal/litmus/api.h.
> >>>>>
> >>>>> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
> >>>>> ---
> >>>>>  CodeSamples/api-pthreads/api-gcc.h | 5 +++--
> >>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h
> >>>>> index 3afe340..b66f4b9 100644
> >>>>> --- a/CodeSamples/api-pthreads/api-gcc.h
> >>>>> +++ b/CodeSamples/api-pthreads/api-gcc.h
> >>>>> @@ -168,8 +168,9 @@ struct __xchg_dummy {
> >>>>>  ({ \
> >>>>>  	typeof(*ptr) _____actual = (o); \
> >>>>>  	\
> >>>>> -	__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> >>>>> -			__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \
> >>>>> +	__atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \
> >>>>> +			__ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \
> >>>>> +	_____actual; \
> >>>>>  })
> >>>>>  
> >>>>
> >>>> Hi Akira,
> >>>>
> >>>> Another reason that the performance of cmpxchg is catching up with cmpxchg_weak is that __ATOMIC_SEQ_CST is replaced by __ATOMIC_RELAXED in this patch. The use of __ATOMIC_RELAXED means if the CAS primitive fails, the relaxed semantic is used, rather than sequential consistent. Following are some experiment results:
> >>>>
> >>>> # If __ATOMIC_RELAXED is used for both cmpxchg and cmpxchg_weak
> >>>>
> >>>> ./count_lim_atomic 64 uperf
> >>>> ns/update: 290
> >>>>
> >>>> ./count_lim_atomic_weak 64 uperf
> >>>> ns/update: 301
> >>>>
> >>>>
> >>>> # and then if __ATOMIC_SEQ_CST is used for both cmpxchg and cmpxchg_weak
> >>>>
> >>>> ./count_lim_atomic 64 uperf
> >>>> ns/update: 316
> >>>>
> >>>> ./count_lim_atomic_weak 64 uperf
> >>>> ns/update: 302
> >>>>
> >>>> ./count_lim_atomic 120 uperf
> >>>> ns/update: 630
> >>>>
> >>>> ./count_lim_atomic_weak 120 uperf
> >>>> ns/update: 568
> >>>>
> >>>> The results show that if we want to ensure sequential consistency when the CAS primitive fails, cmpxchg_weak performs better than cmpxchg. It seems that the (type of variation, failure_memorder) pair affects performance. I know that PPC uses LL/SC to simulate CAS. But what's the relationship between a simulated CAS and the memory order. This is interesting because as far as I know, PPC and ARM are using LL/SC to simulate atomic primitives such as CAS and FAA. So FAA might have the same behavior.
> >>>>
> >>>> In actually, I'm not very clear about the meaning of different types of failure memory orders. For example, when should we use __ATOMIC_RELAXED, rather than __ATOMIC_SEQ_CST, if a CAS fails? What happen if __ATOMIC_RELAXED is used for x86? The one I'm look at is https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html . Do you know some resources about this? I can look into this tomorrow. Thanks.
> >>>
> >>> Hi Junchang,
> >>>
> >>> In Linux-kernel speak, Documentation/core-api/atomic.rst says:
> >>>
> >>> --------------------------------------------------------------------------
> >>> atomic_xchg must provide explicit memory barriers around the operation. ::
> >>>
> >>> 	int atomic_cmpxchg(atomic_t *v, int old, int new);
> >>>
> >>> This performs an atomic compare exchange operation on the atomic value v,
> >>> with the given old and new values. Like all atomic_xxx operations,
> >>> atomic_cmpxchg will only satisfy its atomicity semantics as long as all
> >>> other accesses of \*v are performed through atomic_xxx operations.
> >>>
> >>> atomic_cmpxchg must provide explicit memory barriers around the operation,
> >>> although if the comparison fails then no memory ordering guarantees are
> >>> required.
> >>>
> >>> [snip]
> >>>
> >>> The routines xchg() and cmpxchg() must provide the same exact
> >>> memory-barrier semantics as the atomic and bit operations returning
> >>> values.
> >>> --------------------------------------------------------------------------
> >>>
> >>> The 2nd __ATOMIC_RELAXED to __atomic_compare_exchange_n() matches this
> >>> lack of requirement.
> >>>
> >>> On x86_64, __atomic_compare_exchange_n() is translated to the same code
> >>> in both cases  (with the help of litmus7's cross compiling):
> >>>
> >>> #START _litmus_P1
> >>> 	xorl	%eax, %eax
> >>> 	movl	$0, 4(%rsp)
> >>> 	lock cmpxchgl	%r10d, (%rdx)
> >>> 	je	.L36
> >>> 	movl	%eax, 4(%rsp)
> >>> .L36:
> >>> 	movl	4(%rsp), %eax
> >>>
> >>> There is no difference between the code with __ATOMIC_RELAXED and
> >>> the code with __ATOMIC_SEQ_CST as the 2nd parameter. As you can see,
> >>> there is no memory barrier instruction emitted.
> >>>
> >>> On PPC, there is a difference. With __ATOMIC_RELAXED as 2nd parameter,
> >>> the code looks like:
> >>>
> >>> #START _litmus_P1
> >>>         sync
> >>> .L34:
> >>>         lwarx 7,0,9
> >>>         cmpwi 0,7,0
> >>>         bne 0,.L35
> >>>         stwcx. 5,0,9
> >>>         bne 0,.L34
> >>>         isync
> >>> .L35:
> >>>
> >>> , OTOH with __ATOMIC_SEQ_CST as 2nd argument:
> >>>
> >>> #START _litmus_P1
> >>>         sync
> >>> .L34:
> >>>         lwarx 7,0,9
> >>>         cmpwi 0,7,0
> >>>         bne 0,.L35
> >>>         stwcx. 5,0,9
> >>>         bne 0,.L34
> >>> .L35:
> >>>         isync
> >>>
> >>
> >> In this asm code snippets, the barrier instruction at the end of
> >> cmpxchg() is "isync".
> >>
> >> In arch/powerpc/include/asm/synch.h of Linux kernel source tree,
> >> both PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER are
> >> defined as '"\n" stringify_in_c(sync) "\n"', which will result
> >> in "sync".
> >>
> >> IIUC, "isync" of PowerPC is not good enough for __ATOMIC_SEQ_CST,
> >> is it?
> > 
> > By itself, no, but in combination with the "sync" instruction at
> > the beginning, combined with the ordering of other __ATOMIC_SEQ_CST
> > operations, it actually is sufficient.  For more information, please
> > see:
> > 
> > http://open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2745.html
> > 
> > And this is an update that is mostly irrelevant on modern hardware:
> > 
> > http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html
> > 
> > Note also that an lwsync instruction can be substituted for each isync,
> > which can sometimes provide better performance.
> 
> Now I'm wondering why PPC_ATOMIC_EXIT_BARRIER is defined as
> '"\n" stringify_in_c(sync) "\n"' rather than
> '"\n" stringify_in_c(isync) "\n"' (or '"\n" stringify_in_c(LWSYNC) "\n"')
> in arch/powerpc/include/asm/synch.h.
> 
> ???

Interactions with spin_is_locked(), if I remember correctly.  It is not
clear to me that this was the right way to go, though.

							Thanx, Paul

>         Thanks, Akira
> 
> > 
> > 							Thanx, Paul
> > 
> >>         Thanks, Akira
> >>
> >>> See the difference of position of label .L35.
> >>> (Note that we are talking about strong version of cmpxchg().)
> >>>
> >>> Does the above example make sense to you?
> >>>
> >>>         Thanks, Akira
> >>>
> >>>>
> >>>>
> >>>> --Junchang
> >>>>
> >>>>
> >>>>
> >>>>>  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
> >>>>>
> >>>
> >>
> > 
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 4/4] EXP CodeSamples: Add weak variant of cmpxchg() as cmpxchg_weak()
  2018-12-11 15:39 [PATCH 0/4] Update definition of cmpxchg() under CodeSamples Akira Yokosawa
                   ` (2 preceding siblings ...)
  2018-12-11 15:42 ` [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h Akira Yokosawa
@ 2018-12-11 15:44 ` Akira Yokosawa
  2018-12-12 15:48   ` Junchang Wang
  2018-12-11 17:23 ` [PATCH 0/4] Update definition of cmpxchg() under CodeSamples Paul E. McKenney
  4 siblings, 1 reply; 16+ messages in thread
From: Akira Yokosawa @ 2018-12-11 15:44 UTC (permalink / raw)
  To: Paul E. McKenney, Junchang Wang; +Cc: perfbook

From 61a010b7c34c1b6c05b3883637ccdf2438409408 Mon Sep 17 00:00:00 2001
From: Akira Yokosawa <akiyks@gmail.com>
Date: Tue, 11 Dec 2018 21:14:31 +0900
Subject: [PATCH 4/4] EXP CodeSamples: Add weak variant of cmpxchg() as cmpxchg_weak()

For comparison, add weak (spurious-failure permitting) ones in the
names of cmpxchg_weak() and atomic_cmpxchg_weak().
count_lim_atomic_weak.c is a copy of count_lim_atomic.c but uses
atomic_cmpxchg_weak().

Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
---
 CodeSamples/api-pthreads/api-gcc.h        |  14 ++
 CodeSamples/count/Makefile                |   4 +
 CodeSamples/count/count_lim_atomic_weak.c | 226 ++++++++++++++++++++++++++++++
 3 files changed, 244 insertions(+)
 create mode 100644 CodeSamples/count/count_lim_atomic_weak.c

diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h
index b66f4b9..b27facd 100644
--- a/CodeSamples/api-pthreads/api-gcc.h
+++ b/CodeSamples/api-pthreads/api-gcc.h
@@ -178,6 +178,20 @@ static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
 	return cmpxchg(&v->counter, old, new);
 }
 
+#define cmpxchg_weak(ptr, o, n) \
+({ \
+	typeof(*ptr) _____actual = (o); \
+	typeof(*ptr) _____o = _____actual; \
+	\
+	__atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 1, \
+			__ATOMIC_SEQ_CST, __ATOMIC_RELAXED) ? _____o : _____o+1 ; \
+})
+
+static __inline__ int atomic_cmpxchg_weak(atomic_t *v, int old, int new)
+{
+	return cmpxchg_weak(&v->counter, old, new);
+}
+
 #define xchg(ptr, v) __atomic_exchange_n((ptr), (v), __ATOMIC_SEQ_CST)
 #define atomic_xchg(ptr, v) \
 	__atomic_exchange_n(&(ptr)->counter, (v), __ATOMIC_SEQ_CST)
diff --git a/CodeSamples/count/Makefile b/CodeSamples/count/Makefile
index 481eb3f..52eb466 100644
--- a/CodeSamples/count/Makefile
+++ b/CodeSamples/count/Makefile
@@ -22,6 +22,7 @@ PROGS =	count_atomic \
 	count_lim \
 	count_lim_app \
 	count_lim_atomic \
+	count_lim_atomic_weak \
 	count_lim_sig \
 	count_limd \
 	count_nonatomic \
@@ -65,6 +66,9 @@ count_lim_app: count_lim_app.c ../api.h limtorture.h
 count_lim_atomic: count_lim_atomic.c ../api.h limtorture.h
 	$(CC) $(GCC_ARGS) $(CFLAGS) -o count_lim_atomic count_lim_atomic.c -lpthread
 
+count_lim_atomic_weak: count_lim_atomic_weak.c ../api.h limtorture.h
+	$(CC) $(GCC_ARGS) $(CFLAGS) -o count_lim_atomic_weak count_lim_atomic_weak.c -lpthread
+
 count_lim_sig: count_lim_sig.c ../api.h limtorture.h
 	$(CC) $(GCC_ARGS) $(CFLAGS) -o count_lim_sig count_lim_sig.c -lpthread
 
diff --git a/CodeSamples/count/count_lim_atomic_weak.c b/CodeSamples/count/count_lim_atomic_weak.c
new file mode 100644
index 0000000..42953f2
--- /dev/null
+++ b/CodeSamples/count/count_lim_atomic_weak.c
@@ -0,0 +1,226 @@
+/*
+ * count_lim_atomic_weak.c: simple limit counter that uses weak variant
+ *	of atomic operations to steal from other threads.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, you can access it online at
+ * http://www.gnu.org/licenses/gpl-2.0.html.
+ *
+ * Copyright (c) 2009 Paul E. McKenney, IBM Corporation.
+ */
+
+#include "../api.h"
+
+static void balance_count(void);
+
+atomic_t __thread counterandmax = ATOMIC_INIT(0);	//\lnlbl{var:candmax}
+unsigned long globalcountmax = 1 << 25;			//\lnlbl{var:def:b}
+unsigned long globalcount = 0;
+unsigned long globalreserve = 0;
+atomic_t *counterp[NR_THREADS] = { NULL };
+DEFINE_SPINLOCK(gblcnt_mutex);				//\lnlbl{var:def:e}
+#define CM_BITS (sizeof(atomic_t) * 4)			//\lnlbl{var:CM_BITS}
+#define MAX_COUNTERMAX ((1 << CM_BITS) - 1)		//\lnlbl{var:MAX_CMAX}
+
+static __inline__ void					//\lnlbl{split_int:b}
+split_counterandmax_int(int cami, int *c, int *cm)
+{
+	*c = (cami >> CM_BITS) & MAX_COUNTERMAX;	//\lnlbl{split_int:msh}
+	*cm = cami & MAX_COUNTERMAX;			//\lnlbl{split_int:lsh}
+}							//\lnlbl{split_int:e}
+
+static __inline__ void					//\lnlbl{split:b}
+split_counterandmax(atomic_t *cam, int *old, int *c, int *cm)//\lnlbl{split:func}
+{
+	unsigned int cami = atomic_read(cam);		//\lnlbl{split:int}
+
+	*old = cami;					//\lnlbl{split:old}
+	split_counterandmax_int(cami, c, cm);		//\lnlbl{split:split_int}
+}							//\lnlbl{split:e}
+
+static __inline__ int merge_counterandmax(int c, int cm)//\lnlbl{merge:b}
+{
+	unsigned int cami;
+
+	cami = (c << CM_BITS) | cm;			//\lnlbl{merge:merge}
+	return ((int)cami);
+}							//\lnlbl{merge:e}
+
+static void globalize_count(void)			//\lnlbl{globalize:b}
+{
+	int c;
+	int cm;
+	int old;
+
+	split_counterandmax(&counterandmax, &old, &c, &cm);//\lnlbl{globalize:split}
+	globalcount += c;
+	globalreserve -= cm;
+	old = merge_counterandmax(0, 0);
+	atomic_set(&counterandmax, old);
+}							//\lnlbl{globalize:e}
+
+static void flush_local_count(void)			//\lnlbl{flush:b}
+{
+	int c;
+	int cm;
+	int old;
+	int t;
+	int zero;
+
+	if (globalreserve == 0)				//\lnlbl{flush:checkrsv}
+		return;					//\lnlbl{flush:return:n}
+	zero = merge_counterandmax(0, 0);		//\lnlbl{flush:initzero}
+	for_each_thread(t)				//\lnlbl{flush:loop:b}
+		if (counterp[t] != NULL) {		//\lnlbl{flush:checkp}
+			old = atomic_xchg(counterp[t], zero);//\lnlbl{flush:atmxchg}
+			split_counterandmax_int(old, &c, &cm);//\lnlbl{flush:split}
+			globalcount += c;		//\lnlbl{flush:glbcnt}
+			globalreserve -= cm;		//\lnlbl{flush:glbrsv}
+		}					//\lnlbl{flush:loop:e}
+}							//\lnlbl{flush:e}
+
+int add_count(unsigned long delta)			//\lnlbl{add:b}
+{
+	int c;
+	int cm;
+	int old;
+	int new;
+
+	do {						//\lnlbl{add:fast:b}
+		split_counterandmax(&counterandmax, &old, &c, &cm);//\lnlbl{add:split}
+		if (delta > MAX_COUNTERMAX || c + delta > cm)//\lnlbl{add:check}
+			goto slowpath;			//\lnlbl{add:goto}
+		new = merge_counterandmax(c + delta, cm);//\lnlbl{add:merge}
+	} while (atomic_cmpxchg_weak(&counterandmax,		//\lnlbl{add:atmcmpex}
+	                        old, new) != old);	//\lnlbl{add:loop:e}
+	return 1;					//\lnlbl{add:return:fs}
+slowpath:						//\lnlbl{add:slow:b}
+	spin_lock(&gblcnt_mutex);			//\lnlbl{add:acquire}
+	globalize_count();				//\lnlbl{add:globalize}
+	if (globalcountmax - globalcount -		//\lnlbl{add:checkglb:b}
+	    globalreserve < delta) {			//\lnlbl{add:checkglb:e}
+		flush_local_count();			//\lnlbl{add:flush}
+		if (globalcountmax - globalcount -	//\lnlbl{add:checkglb:nb}
+		    globalreserve < delta) {		//\lnlbl{add:checkglb:ne}
+			spin_unlock(&gblcnt_mutex);	//\lnlbl{add:release:f}
+			return 0;			//\lnlbl{add:return:sf}
+		}
+	}
+	globalcount += delta;				//\lnlbl{add:addglb}
+	balance_count();				//\lnlbl{add:balance}
+	spin_unlock(&gblcnt_mutex);			//\lnlbl{add:release:s}
+	return 1;					//\lnlbl{add:return:ss}
+}							//\lnlbl{add:e}
+
+int sub_count(unsigned long delta)			//\lnlbl{sub:b}
+{
+	int c;
+	int cm;
+	int old;
+	int new;
+
+	do {						//\lnlbl{sub:fast:b}
+		split_counterandmax(&counterandmax, &old, &c, &cm);
+		if (delta > c)
+			goto slowpath;
+		new = merge_counterandmax(c - delta, cm);
+	} while (atomic_cmpxchg_weak(&counterandmax,
+	                        old, new) != old);
+	return 1;					//\lnlbl{sub:fast:e}
+ slowpath:						//\lnlbl{sub:slow:b}
+	spin_lock(&gblcnt_mutex);
+	globalize_count();
+	if (globalcount < delta) {
+		flush_local_count();
+		if (globalcount < delta) {
+			spin_unlock(&gblcnt_mutex);
+			return 0;
+		}
+	}
+	globalcount -= delta;
+	balance_count();
+	spin_unlock(&gblcnt_mutex);
+	return 1;					//\lnlbl{sub:slow:e}
+}							//\lnlbl{sub:e}
+
+unsigned long read_count(void)				//\lnlbl{b}
+{
+	int c;
+	int cm;
+	int old;
+	int t;
+	unsigned long sum;
+
+	spin_lock(&gblcnt_mutex);			//\lnlbl{acquire}
+	sum = globalcount;				//\lnlbl{initsum}
+	for_each_thread(t)				//\lnlbl{loop:b}
+		if (counterp[t] != NULL) {
+			split_counterandmax(counterp[t], &old, &c, &cm);//\lnlbl{split}
+			sum += c;
+		}					//\lnlbl{loop:e}
+	spin_unlock(&gblcnt_mutex);			//\lnlbl{release}
+	return sum;					//\lnlbl{return}
+}							//\lnlbl{e}
+
+void count_init(void)
+{
+}
+
+static void balance_count(void)				//\lnlbl{balance:b}
+{
+	int c;
+	int cm;
+	int old;
+	unsigned long limit;
+
+	limit = globalcountmax - globalcount -
+	        globalreserve;
+	limit /= num_online_threads();
+	if (limit > MAX_COUNTERMAX)
+		cm = MAX_COUNTERMAX;
+	else
+		cm = limit;
+	globalreserve += cm;
+	c = cm / 2;
+	if (c > globalcount)
+		c = globalcount;
+	globalcount -= c;
+	old = merge_counterandmax(c, cm);
+	atomic_set(&counterandmax, old);		//\lnlbl{balance:atmcset}
+}							//\lnlbl{balance:e}
+
+void count_register_thread(void)			//\lnlbl{register:b}
+{
+	int idx = smp_thread_id();
+
+	spin_lock(&gblcnt_mutex);
+	counterp[idx] = &counterandmax;
+	spin_unlock(&gblcnt_mutex);
+}
+
+void count_unregister_thread(int nthreadsexpected)	//\lnlbl{unregister:b}
+{
+	int idx = smp_thread_id();
+
+	spin_lock(&gblcnt_mutex);
+	globalize_count();
+	counterp[idx] = NULL;
+	spin_unlock(&gblcnt_mutex);
+}
+
+void count_cleanup(void)
+{
+}
+
+#define NEED_REGISTER_THREAD
+#include "limtorture.h"
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] EXP CodeSamples: Add weak variant of cmpxchg() as cmpxchg_weak()
  2018-12-11 15:44 ` [PATCH 4/4] EXP CodeSamples: Add weak variant of cmpxchg() as cmpxchg_weak() Akira Yokosawa
@ 2018-12-12 15:48   ` Junchang Wang
  0 siblings, 0 replies; 16+ messages in thread
From: Junchang Wang @ 2018-12-12 15:48 UTC (permalink / raw)
  To: Akira Yokosawa, Paul E. McKenney; +Cc: perfbook



On 12/11/18 11:44 PM, Akira Yokosawa wrote:
> From 61a010b7c34c1b6c05b3883637ccdf2438409408 Mon Sep 17 00:00:00 2001
> From: Akira Yokosawa <akiyks@gmail.com>
> Date: Tue, 11 Dec 2018 21:14:31 +0900
> Subject: [PATCH 4/4] EXP CodeSamples: Add weak variant of cmpxchg() as cmpxchg_weak()
> 
> For comparison, add weak (spurious-failure permitting) ones in the
> names of cmpxchg_weak() and atomic_cmpxchg_weak().
> count_lim_atomic_weak.c is a copy of count_lim_atomic.c but uses
> atomic_cmpxchg_weak().
> 
> Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
> ---
>  CodeSamples/api-pthreads/api-gcc.h        |  14 ++
>  CodeSamples/count/Makefile                |   4 +
>  CodeSamples/count/count_lim_atomic_weak.c | 226 ++++++++++++++++++++++++++++++
>  3 files changed, 244 insertions(+)
>  create mode 100644 CodeSamples/count/count_lim_atomic_weak.c
> 
> diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h
> index b66f4b9..b27facd 100644
> --- a/CodeSamples/api-pthreads/api-gcc.h
> +++ b/CodeSamples/api-pthreads/api-gcc.h
> @@ -178,6 +178,20 @@ static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
>  	return cmpxchg(&v->counter, old, new);
>  }
>  
> +#define cmpxchg_weak(ptr, o, n) \
> +({ \
> +	typeof(*ptr) _____actual = (o); \
> +	typeof(*ptr) _____o = _____actual; \
> +	\
> +	__atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 1, \
> +			__ATOMIC_SEQ_CST, __ATOMIC_RELAXED) ? _____o : _____o+1 ; \
> +})
> +

Hi Akira,

Quit an interesting topic. I did some experiments and please see below.

We changed the implementation of cmpxchg() heavily. So let's use the latest two versions---count_lim_atomic based on your latest patch [3/4] and count_lim_atomic_weak based on patch [4/4]---to compare apple to apple. I ran these two programs on an 8-core PPC server. The duration of each run is set to 2.4 seconds.

The first observation is that the result reported in your last mail (patch [0/4]) is correct; for each update operation, count_lim_atomic is about 10 nanoseconds faster than count_lim_atomic_weak, no matter how many concurrent updaters are running. For example, when I created 64 updaters, I got the follow results:

./count_lim_atomic 64 uperf
ns/update: 290

./count_lim_atomic_weak 64 uperf
ns/update: 301

Following is my analysis. I found that the first reason is because cmpxchg_weak is using a temporal variable _____o to avoid side-effects of (o). If we switch cmpxchg_weak to the older version (shown below. I know it has flaws, but let's forget about it temporarily :-)), count_lim_atomic_weak runs as fast as count_lim_atomic.

#define cmpxchg_weak(ptr, o, n) \
({ \
        typeof(*ptr) _____actual = (o); \
        \
        __atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 1, \
                        __ATOMIC_SEQ_CST, __ATOMIC_RELAXED) ? (o) : (n) ; \
})

The second possible reason could be more interesting. I discussed that in the reply to your mail [Patch 3/4].

Regards,
--Junchang



> +static __inline__ int atomic_cmpxchg_weak(atomic_t *v, int old, int new)
> +{
> +	return cmpxchg_weak(&v->counter, old, new);
> +}
> +
>  #define xchg(ptr, v) __atomic_exchange_n((ptr), (v), __ATOMIC_SEQ_CST)
>  #define atomic_xchg(ptr, v) \
>  	__atomic_exchange_n(&(ptr)->counter, (v), __ATOMIC_SEQ_CST)
> diff --git a/CodeSamples/count/Makefile b/CodeSamples/count/Makefile
> index 481eb3f..52eb466 100644
> --- a/CodeSamples/count/Makefile
> +++ b/CodeSamples/count/Makefile
> @@ -22,6 +22,7 @@ PROGS =	count_atomic \
>  	count_lim \
>  	count_lim_app \
>  	count_lim_atomic \
> +	count_lim_atomic_weak \
>  	count_lim_sig \
>  	count_limd \
>  	count_nonatomic \
> @@ -65,6 +66,9 @@ count_lim_app: count_lim_app.c ../api.h limtorture.h
>  count_lim_atomic: count_lim_atomic.c ../api.h limtorture.h
>  	$(CC) $(GCC_ARGS) $(CFLAGS) -o count_lim_atomic count_lim_atomic.c -lpthread
>  
> +count_lim_atomic_weak: count_lim_atomic_weak.c ../api.h limtorture.h
> +	$(CC) $(GCC_ARGS) $(CFLAGS) -o count_lim_atomic_weak count_lim_atomic_weak.c -lpthread
> +
>  count_lim_sig: count_lim_sig.c ../api.h limtorture.h
>  	$(CC) $(GCC_ARGS) $(CFLAGS) -o count_lim_sig count_lim_sig.c -lpthread
>  
> diff --git a/CodeSamples/count/count_lim_atomic_weak.c b/CodeSamples/count/count_lim_atomic_weak.c
> new file mode 100644
> index 0000000..42953f2
> --- /dev/null
> +++ b/CodeSamples/count/count_lim_atomic_weak.c
> @@ -0,0 +1,226 @@
> +/*
> + * count_lim_atomic_weak.c: simple limit counter that uses weak variant
> + *	of atomic operations to steal from other threads.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, you can access it online at
> + * http://www.gnu.org/licenses/gpl-2.0.html.
> + *
> + * Copyright (c) 2009 Paul E. McKenney, IBM Corporation.
> + */
> +
> +#include "../api.h"
> +
> +static void balance_count(void);
> +
> +atomic_t __thread counterandmax = ATOMIC_INIT(0);	//\lnlbl{var:candmax}
> +unsigned long globalcountmax = 1 << 25;			//\lnlbl{var:def:b}
> +unsigned long globalcount = 0;
> +unsigned long globalreserve = 0;
> +atomic_t *counterp[NR_THREADS] = { NULL };
> +DEFINE_SPINLOCK(gblcnt_mutex);				//\lnlbl{var:def:e}
> +#define CM_BITS (sizeof(atomic_t) * 4)			//\lnlbl{var:CM_BITS}
> +#define MAX_COUNTERMAX ((1 << CM_BITS) - 1)		//\lnlbl{var:MAX_CMAX}
> +
> +static __inline__ void					//\lnlbl{split_int:b}
> +split_counterandmax_int(int cami, int *c, int *cm)
> +{
> +	*c = (cami >> CM_BITS) & MAX_COUNTERMAX;	//\lnlbl{split_int:msh}
> +	*cm = cami & MAX_COUNTERMAX;			//\lnlbl{split_int:lsh}
> +}							//\lnlbl{split_int:e}
> +
> +static __inline__ void					//\lnlbl{split:b}
> +split_counterandmax(atomic_t *cam, int *old, int *c, int *cm)//\lnlbl{split:func}
> +{
> +	unsigned int cami = atomic_read(cam);		//\lnlbl{split:int}
> +
> +	*old = cami;					//\lnlbl{split:old}
> +	split_counterandmax_int(cami, c, cm);		//\lnlbl{split:split_int}
> +}							//\lnlbl{split:e}
> +
> +static __inline__ int merge_counterandmax(int c, int cm)//\lnlbl{merge:b}
> +{
> +	unsigned int cami;
> +
> +	cami = (c << CM_BITS) | cm;			//\lnlbl{merge:merge}
> +	return ((int)cami);
> +}							//\lnlbl{merge:e}
> +
> +static void globalize_count(void)			//\lnlbl{globalize:b}
> +{
> +	int c;
> +	int cm;
> +	int old;
> +
> +	split_counterandmax(&counterandmax, &old, &c, &cm);//\lnlbl{globalize:split}
> +	globalcount += c;
> +	globalreserve -= cm;
> +	old = merge_counterandmax(0, 0);
> +	atomic_set(&counterandmax, old);
> +}							//\lnlbl{globalize:e}
> +
> +static void flush_local_count(void)			//\lnlbl{flush:b}
> +{
> +	int c;
> +	int cm;
> +	int old;
> +	int t;
> +	int zero;
> +
> +	if (globalreserve == 0)				//\lnlbl{flush:checkrsv}
> +		return;					//\lnlbl{flush:return:n}
> +	zero = merge_counterandmax(0, 0);		//\lnlbl{flush:initzero}
> +	for_each_thread(t)				//\lnlbl{flush:loop:b}
> +		if (counterp[t] != NULL) {		//\lnlbl{flush:checkp}
> +			old = atomic_xchg(counterp[t], zero);//\lnlbl{flush:atmxchg}
> +			split_counterandmax_int(old, &c, &cm);//\lnlbl{flush:split}
> +			globalcount += c;		//\lnlbl{flush:glbcnt}
> +			globalreserve -= cm;		//\lnlbl{flush:glbrsv}
> +		}					//\lnlbl{flush:loop:e}
> +}							//\lnlbl{flush:e}
> +
> +int add_count(unsigned long delta)			//\lnlbl{add:b}
> +{
> +	int c;
> +	int cm;
> +	int old;
> +	int new;
> +
> +	do {						//\lnlbl{add:fast:b}
> +		split_counterandmax(&counterandmax, &old, &c, &cm);//\lnlbl{add:split}
> +		if (delta > MAX_COUNTERMAX || c + delta > cm)//\lnlbl{add:check}
> +			goto slowpath;			//\lnlbl{add:goto}
> +		new = merge_counterandmax(c + delta, cm);//\lnlbl{add:merge}
> +	} while (atomic_cmpxchg_weak(&counterandmax,		//\lnlbl{add:atmcmpex}
> +	                        old, new) != old);	//\lnlbl{add:loop:e}
> +	return 1;					//\lnlbl{add:return:fs}
> +slowpath:						//\lnlbl{add:slow:b}
> +	spin_lock(&gblcnt_mutex);			//\lnlbl{add:acquire}
> +	globalize_count();				//\lnlbl{add:globalize}
> +	if (globalcountmax - globalcount -		//\lnlbl{add:checkglb:b}
> +	    globalreserve < delta) {			//\lnlbl{add:checkglb:e}
> +		flush_local_count();			//\lnlbl{add:flush}
> +		if (globalcountmax - globalcount -	//\lnlbl{add:checkglb:nb}
> +		    globalreserve < delta) {		//\lnlbl{add:checkglb:ne}
> +			spin_unlock(&gblcnt_mutex);	//\lnlbl{add:release:f}
> +			return 0;			//\lnlbl{add:return:sf}
> +		}
> +	}
> +	globalcount += delta;				//\lnlbl{add:addglb}
> +	balance_count();				//\lnlbl{add:balance}
> +	spin_unlock(&gblcnt_mutex);			//\lnlbl{add:release:s}
> +	return 1;					//\lnlbl{add:return:ss}
> +}							//\lnlbl{add:e}
> +
> +int sub_count(unsigned long delta)			//\lnlbl{sub:b}
> +{
> +	int c;
> +	int cm;
> +	int old;
> +	int new;
> +
> +	do {						//\lnlbl{sub:fast:b}
> +		split_counterandmax(&counterandmax, &old, &c, &cm);
> +		if (delta > c)
> +			goto slowpath;
> +		new = merge_counterandmax(c - delta, cm);
> +	} while (atomic_cmpxchg_weak(&counterandmax,
> +	                        old, new) != old);
> +	return 1;					//\lnlbl{sub:fast:e}
> + slowpath:						//\lnlbl{sub:slow:b}
> +	spin_lock(&gblcnt_mutex);
> +	globalize_count();
> +	if (globalcount < delta) {
> +		flush_local_count();
> +		if (globalcount < delta) {
> +			spin_unlock(&gblcnt_mutex);
> +			return 0;
> +		}
> +	}
> +	globalcount -= delta;
> +	balance_count();
> +	spin_unlock(&gblcnt_mutex);
> +	return 1;					//\lnlbl{sub:slow:e}
> +}							//\lnlbl{sub:e}
> +
> +unsigned long read_count(void)				//\lnlbl{b}
> +{
> +	int c;
> +	int cm;
> +	int old;
> +	int t;
> +	unsigned long sum;
> +
> +	spin_lock(&gblcnt_mutex);			//\lnlbl{acquire}
> +	sum = globalcount;				//\lnlbl{initsum}
> +	for_each_thread(t)				//\lnlbl{loop:b}
> +		if (counterp[t] != NULL) {
> +			split_counterandmax(counterp[t], &old, &c, &cm);//\lnlbl{split}
> +			sum += c;
> +		}					//\lnlbl{loop:e}
> +	spin_unlock(&gblcnt_mutex);			//\lnlbl{release}
> +	return sum;					//\lnlbl{return}
> +}							//\lnlbl{e}
> +
> +void count_init(void)
> +{
> +}
> +
> +static void balance_count(void)				//\lnlbl{balance:b}
> +{
> +	int c;
> +	int cm;
> +	int old;
> +	unsigned long limit;
> +
> +	limit = globalcountmax - globalcount -
> +	        globalreserve;
> +	limit /= num_online_threads();
> +	if (limit > MAX_COUNTERMAX)
> +		cm = MAX_COUNTERMAX;
> +	else
> +		cm = limit;
> +	globalreserve += cm;
> +	c = cm / 2;
> +	if (c > globalcount)
> +		c = globalcount;
> +	globalcount -= c;
> +	old = merge_counterandmax(c, cm);
> +	atomic_set(&counterandmax, old);		//\lnlbl{balance:atmcset}
> +}							//\lnlbl{balance:e}
> +
> +void count_register_thread(void)			//\lnlbl{register:b}
> +{
> +	int idx = smp_thread_id();
> +
> +	spin_lock(&gblcnt_mutex);
> +	counterp[idx] = &counterandmax;
> +	spin_unlock(&gblcnt_mutex);
> +}
> +
> +void count_unregister_thread(int nthreadsexpected)	//\lnlbl{unregister:b}
> +{
> +	int idx = smp_thread_id();
> +
> +	spin_lock(&gblcnt_mutex);
> +	globalize_count();
> +	counterp[idx] = NULL;
> +	spin_unlock(&gblcnt_mutex);
> +}
> +
> +void count_cleanup(void)
> +{
> +}
> +
> +#define NEED_REGISTER_THREAD
> +#include "limtorture.h"
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/4] Update definition of cmpxchg() under CodeSamples
  2018-12-11 15:39 [PATCH 0/4] Update definition of cmpxchg() under CodeSamples Akira Yokosawa
                   ` (3 preceding siblings ...)
  2018-12-11 15:44 ` [PATCH 4/4] EXP CodeSamples: Add weak variant of cmpxchg() as cmpxchg_weak() Akira Yokosawa
@ 2018-12-11 17:23 ` Paul E. McKenney
  4 siblings, 0 replies; 16+ messages in thread
From: Paul E. McKenney @ 2018-12-11 17:23 UTC (permalink / raw)
  To: Akira Yokosawa; +Cc: Junchang Wang, perfbook

On Wed, Dec 12, 2018 at 12:39:38AM +0900, Akira Yokosawa wrote:
> Subject: [PATCH 0/4] Update definition of cmpxchg() under CodeSamples
> 
> Hi Paul and Junchang,
> 
> Based on the earlier correspondence, I prepared a patch set to update
> the definition of cmpxchg().
> 
> Patch #1 adds a simple litmus test of cmpxchg().
> Patch #2 fix cmpxchg() in .../litmus/api.h by specifying strong
> __atomic_compare_exchange_n)\(). Results of litmus tests are presented
> in the commit log.
> Patch #3 fix cmpxchg() in .../api-gcc.h in the same manner as patch #2.

I applied and pushed #1-3, thank you!

							Thanx, Paul

> Patch #4 is not intended to be applied to master, but provides a weak
> variant of cmpxchg() and count_lim_atomic_weak.c.
> 
> I did some experiments on PPC (10 runs each).
> 
> count_lim_atomic 1 uperf, ns/update (as of v2018.12.08a):
> 56.6572
> 56.4972
> 56.5504
> 56.4972
> 56.6305
> 56.5238
> 56.6038
> 56.5238
> 56.5504
> 56.5238
> 
> count_lim_atomic 1 uperf, ns/update (as of Patch #4):
> 38.3264
> 38.1073
> 38.2287
> 38.2287
> 38.2653
> 38.3142
> 38.1194
> 38.1801
> 38.4986
> 38.2531
> 
> count_lim_atomic_weak 1 uperf, ns/update (as of Patch #4):
> 47.3186
> 47.3747
> 47.4121
> 47.5248
> 47.3934
> 47.3373
> 47.3186
> 47.4121
> 47.6002
> 47.3186
> 
> So count_lim_atomic (strong semantics) is around 9ns/update
> faster than count_lim_atomic_weak, and is around 18ns/update
> faster than count_lim_atomic of v2018.12.08a.
> 
> This contradicts Junchang's view. The difference looks like
> due to removal of ternary operation in cmpxchg().
> 
> Or I might be missing something.
> 
> Junchang, can you please experiment on PPC and ARM?
> 
>         Thanks, Akira
> --
> Akira Yokosawa (4):
>   CodeSamples: Add C-cmpxchg.litmus
>   CodeSamples/formal/litmus/api.h: Fix definition of cmpxchg()
>   CodeSamples: Fix definition of cmpxchg() in api-gcc.h
>   [EXP] CodeSamples: Add weak variant of cmpxchg() as cmpxchg_weak()
> 
>  CodeSamples/api-pthreads/api-gcc.h         |  19 ++-
>  CodeSamples/count/Makefile                 |   4 +
>  CodeSamples/count/count_lim_atomic_weak.c  | 226 +++++++++++++++++++++++++++++
>  CodeSamples/formal/litmus/C-cmpxchg.litmus |  24 +++
>  CodeSamples/formal/litmus/api.h            |   2 +-
>  5 files changed, 272 insertions(+), 3 deletions(-)
>  create mode 100644 CodeSamples/count/count_lim_atomic_weak.c
>  create mode 100644 CodeSamples/formal/litmus/C-cmpxchg.litmus
> 
> -- 
> 2.7.4
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-12-16  0:55 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-12-11 15:39 [PATCH 0/4] Update definition of cmpxchg() under CodeSamples Akira Yokosawa
2018-12-11 15:40 ` [PATCH 1/4] CodeSamples: Add C-cmpxchg.litmus Akira Yokosawa
2018-12-11 15:42 ` [PATCH 2/4] CodeSamples/formal/litmus/api.h: Fix definition of cmpxchg() Akira Yokosawa
2018-12-11 15:42 ` [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h Akira Yokosawa
2018-12-12 16:01   ` Junchang Wang
2018-12-13 15:33     ` Akira Yokosawa
2018-12-14 14:32       ` Junchang Wang
2018-12-15 14:58         ` Akira Yokosawa
2018-12-16  0:55           ` Junchang Wang
2018-12-15 15:10       ` Akira Yokosawa
2018-12-15 19:37         ` Paul E. McKenney
2018-12-15 23:42           ` Akira Yokosawa
2018-12-16  0:54             ` Paul E. McKenney
2018-12-11 15:44 ` [PATCH 4/4] EXP CodeSamples: Add weak variant of cmpxchg() as cmpxchg_weak() Akira Yokosawa
2018-12-12 15:48   ` Junchang Wang
2018-12-11 17:23 ` [PATCH 0/4] Update definition of cmpxchg() under CodeSamples Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox