GCC built-in atomic operations and memory barriers

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* GCC built-in atomic operations and memory barriers
@ 2009-11-04 18:09 Toby Douglass
  2009-11-04 19:05 ` Russell King - ARM Linux
  0 siblings, 1 reply; 37+ messages in thread
From: Toby Douglass @ 2009-11-04 18:09 UTC (permalink / raw)
  To: linux-arm-kernel

Hi -

My first post.

I have written a (currently small) library of lock-free data structures
(www.liblfds.org).

I am porting to ARM.

The port is complete except for CAS.

I initially tried to use the GCC built-in CAS; in my test set, the third
test fails with a double free, because the head element of the freelist
being tested has come to point to itself.

So I figured I'd need to write my own CAS.  I learned enough ARM
assembly and did enough Googling to put something together.  I've done
the same already for x86 (I needed cmpxchg16b, so I couldn't use the GCC
built-ins) so I have a tiny bit of experience in it; and my attempt
fails in exactly the same way as the GCC built-in.

So I go to bed.  Next day I think, ah, memory barriers!  you don't need
to specify them on x86 for atomics, but I bet you do on ARM; and indeed,
you do.

I then discover in this mailing list an interesting thread, dated May
2009, which states that the GCC built-ins for ARM do not have memory
barriers.  The kernel I'm using is 2.6.28, which was released Christmas
Day 2008.

So I think the lack of memory barriers may well be why the code is failing.

This leads me to want to use smp_mb().  However, from what I can see,
this macro is only available via the linux kernel headers; it's not
available in user-mode.  Is this correct?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* GCC built-in atomic operations and memory barriers
  2009-11-04 18:09 GCC built-in atomic operations and memory barriers Toby Douglass
@ 2009-11-04 19:05 ` Russell King - ARM Linux
  2009-11-04 20:12   ` Toby Douglass
  2009-11-04 22:09   ` Gilles Chanteperdrix
  0 siblings, 2 replies; 37+ messages in thread
From: Russell King - ARM Linux @ 2009-11-04 19:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Nov 04, 2009 at 07:09:37PM +0100, Toby Douglass wrote:
> This leads me to want to use smp_mb().  However, from what I can see,
> this macro is only available via the linux kernel headers; it's not
> available in user-mode.  Is this correct?

Correct.

And please understand that just because something exists in kernel land
does not give userland a right to make use of it.  (There are people
who believe it does.)

The kernel uses header files as a way to efficiently provide itself with
architecture specific optimized helpers, and these helpers may well only
work in kernel land.  Eg, traditionally, atomic operations on ARM
required interrupts to be disabled across them.  Userspace can not
disable interrupts, and so use of those functions in userland leads to
them becoming inherently non-atomic.

The kernel API is the syscall interface and associated data structures.
It does not include everything you can find in kernel headers.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* GCC built-in atomic operations and memory barriers
  2009-11-04 19:05 ` Russell King - ARM Linux
@ 2009-11-04 20:12   ` Toby Douglass
  2009-11-04 21:03     ` Russell King - ARM Linux
  2009-11-04 22:09   ` Gilles Chanteperdrix
  1 sibling, 1 reply; 37+ messages in thread
From: Toby Douglass @ 2009-11-04 20:12 UTC (permalink / raw)
  To: linux-arm-kernel

Russell King - ARM Linux wrote:
> On Wed, Nov 04, 2009 at 07:09:37PM +0100, Toby Douglass wrote:
>> This leads me to want to use smp_mb().  However, from what I can see,
>> this macro is only available via the linux kernel headers; it's not
>> available in user-mode.  Is this correct?
> 
> Correct.

Thanks.  It's often hard on the net to track down a negative answer.

[snip]

While we're talking about the GCC atomics...

This appears to be the current code for CAS;

  349        static inline unsigned long __cmpxchg(volatile void *ptr,
unsigned long old, unsigned long new, int size)

  352        unsigned long oldval, res;

[snip]

  382                do {
  383                        asm volatile("@ __cmpxchg4\n"
  384                        "       ldrex   %1, [%2]\n"
  385                        "       mov     %0, #0\n"
  386                        "       teq     %1, %3\n"
  387                        "       strexeq %0, %4, [%2]\n"
  388                                : "=&r" (res), "=&r" (oldval)
  389                                : "r" (ptr), "Ir" (old), "r" (new)
  390                                : "memory", "cc");
  391                } while (res);

The "mov %0, #0" - why is this inbetween the ldrex and strexeq?  it
seems to me it could just as well happen before the ldrex, and doing so
would reduce the time between the ldrex and strexeq and so reduce the
chance of someone else modifying our target.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* GCC built-in atomic operations and memory barriers
  2009-11-04 20:12   ` Toby Douglass
@ 2009-11-04 21:03     ` Russell King - ARM Linux
  2009-11-06 19:10       ` Toby Douglass
  0 siblings, 1 reply; 37+ messages in thread
From: Russell King - ARM Linux @ 2009-11-04 21:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Nov 04, 2009 at 09:12:10PM +0100, Toby Douglass wrote:
>  382                do {
>  383                        asm volatile("@ __cmpxchg4\n"
>  384                        "       ldrex   %1, [%2]\n"
>  385                        "       mov     %0, #0\n"
>  386                        "       teq     %1, %3\n"
>  387                        "       strexeq %0, %4, [%2]\n"
>  388                                : "=&r" (res), "=&r" (oldval)
>  389                                : "r" (ptr), "Ir" (old), "r" (new)
>  390                                : "memory", "cc");
>  391                } while (res);
>
> The "mov %0, #0" - why is this inbetween the ldrex and strexeq?  it
> seems to me it could just as well happen before the ldrex, and doing so
> would reduce the time between the ldrex and strexeq and so reduce the
> chance of someone else modifying our target.

Because it probably doesn't.  Loads normally have a 'result delay' which
cause a pipeline stall when the result is used in the next instruction.
It makes sense to fill those stall cycles with useful work, rather than
stalling the execution pipeline.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* GCC built-in atomic operations and memory barriers
  2009-11-04 21:03     ` Russell King - ARM Linux
@ 2009-11-06 19:10       ` Toby Douglass
  0 siblings, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-06 19:10 UTC (permalink / raw)
  To: linux-arm-kernel

Russell King - ARM Linux wrote:
> On Wed, Nov 04, 2009 at 09:12:10PM +0100, Toby Douglass wrote:

[snip]

>> The "mov %0, #0" - why is this inbetween the ldrex and strexeq?  it
>> seems to me it could just as well happen before the ldrex, and doing so
>> would reduce the time between the ldrex and strexeq and so reduce the
>> chance of someone else modifying our target.
> 
> Because it probably doesn't.  Loads normally have a 'result delay' which
> cause a pipeline stall when the result is used in the next instruction.
> It makes sense to fill those stall cycles with useful work, rather than
> stalling the execution pipeline.

Thanks, Russell.  I was concerned there was something functionally 
important I'd missed with regard to the atomic op.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* GCC built-in atomic operations and memory barriers
  2009-11-04 19:05 ` Russell King - ARM Linux
  2009-11-04 20:12   ` Toby Douglass
@ 2009-11-04 22:09   ` Gilles Chanteperdrix
  2009-11-06 19:17     ` Toby Douglass
  2009-11-21 15:21     ` CAS implementation may be broken Toby Douglass
  1 sibling, 2 replies; 37+ messages in thread
From: Gilles Chanteperdrix @ 2009-11-04 22:09 UTC (permalink / raw)
  To: linux-arm-kernel

Russell King - ARM Linux wrote:
> The kernel API is the syscall interface and associated data structures.
> It does not include everything you can find in kernel headers.

Are the functions found in the vectors page part of the kernel API?
Because if that is so, then they provide a way to implement cmpxchg and
barriers in user-space.

-- 
					    Gilles.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* GCC built-in atomic operations and memory barriers
  2009-11-04 22:09   ` Gilles Chanteperdrix
@ 2009-11-06 19:17     ` Toby Douglass
  2009-11-21 15:21     ` CAS implementation may be broken Toby Douglass
  1 sibling, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-06 19:17 UTC (permalink / raw)
  To: linux-arm-kernel

Gilles Chanteperdrix wrote:
> Russell King - ARM Linux wrote:
>> The kernel API is the syscall interface and associated data structures.
>> It does not include everything you can find in kernel headers.
> 
> Are the functions found in the vectors page part of the kernel API?
> Because if that is so, then they provide a way to implement cmpxchg and
> barriers in user-space.

GCC (I had overlooked this) has a function, __sync_synchronize(), which 
provides a full memory barrier.  I do not know how it is implemented on 
ARM, but the CAS I wrote certainly would not work (and did not work) 
without a memory barrier and with this function inserted, does work.

The GCC atomic CAS for ARM appears to call the OS's compare-exchange 
function, which certainly is broken, by lacking memory barriers, on the 
Linux I am developing on.

Either way, I now have my own platform independent ARM __asm__ for LL/CS 
CAS, with memory barriers, so I'm happy.  Also, I got my OpenGL wrapper 
DLL working last night :-)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-04 22:09   ` Gilles Chanteperdrix
  2009-11-06 19:17     ` Toby Douglass
@ 2009-11-21 15:21     ` Toby Douglass
  2009-11-23 15:08       ` Russell King - ARM Linux
                         ` (3 more replies)
  1 sibling, 4 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-21 15:21 UTC (permalink / raw)
  To: linux-arm-kernel

Hi all -

I may be *utterly* wrong, and I'm expecting that someone here is simply 
going to look at what I'm about to say and explain to me what I've 
misunderstood, but I suspect it may be that atomic compare-and-swap is 
incorrectly implemented in the Linux kernel on ARM v6 and above (e.g. 
using ldrex and strex).

I'm looking at this code, for 2.6.31, which I believe is the latest 
released version;

http://lxr.linux.no/#linux+v2.6.31/arch/arm/include/asm/system.h

  382                do {
  383                        asm volatile("@ __cmpxchg4\n"
  384                        "       ldrex   %1, [%2]\n"
  385                        "       mov     %0, #0\n"
  386                        "       teq     %1, %3\n"
  387                        "       strexeq %0, %4, [%2]\n"
  388                                : "=&r" (res), "=&r" (oldval)
  389                                : "r" (ptr), "Ir" (old), "r" (new)
  390                                : "memory", "cc");
  391                } while (res);

This code is for a four byte CAS; the issue I'm about to describe exists 
similarly for the two byte and one byte CASs.

The issue is the ABA problem.

Imagine you have a populated stack.  Two threads come to pop.  Both read 
the head pointer and the next pointer of the top element and both 
simultaneously come to the point where the are about to CAS, comparing 
the head pointer with the top element pointer and if they match, 
replacing the head pointer with the next pointer of the top element - so 
by this, I mean both threads have just got into the do-while loop above, 
but have not yet executed ldrex.

Now, one thread pauses for a long time.

The other thread pops.  Then a bunch of the other threads pop and push 
and the stack is *totally* different.  Then our initial thread which 
popped *pushes* that original element back.

Finally, our second thread kicks into life.  Remember that his exchange 
pointer is now utterly incorrect, because the stack has completely 
changed.  Now, he's just inside the do-while loop.  He checks the top 
pointer - and lo and behold, it matches the head element.  However, 
because someone else has messed with the top pointer in the meantime, 
the CAS fails - strex refuses to swap, it can see from the shared TLB 
attribute that someone else has touched destination (the top pointer).

So we dodge ABA *there*.

The problem is *we then come round in the do-while loop again*.  We have 
*not* updated our exchange value.  So THIS second time around, we 
*repeat* our strex and we DO swap - and we just swapped in completely 
the wrong next pointer, from way back before the stack was totally 
changed by all the other threads popping and pushing.

So that's the problem.

A common way to solve ABA on contiguous double-word compare-and-swap 
architectures (e.g. x86, x64 and Itanium) is to use a pointer-counter 
pair.  This permits the caller to ensure swaps fail, even if the 
pointers all match, because the counters don't.

Load-linked/conditional-store architectures solve ABA by having the 
store fail if the destination has been touched since the load was performed.

Currently, the code appears to violate this, by repeating the CAS 
*anyway*.  In fact, the appropriate behaviour would seem to be *not* to 
loop, but rather, to issue the ldrex/strex *once*, and indicate to the 
user if the store succeed or failed.

This requires a prototype change, because the return value is the 
original value of the destination and so is unable to indicate, when 
returning that value, if it is returned from a successful or 
unsuccessful swap.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-21 15:21     ` CAS implementation may be broken Toby Douglass
@ 2009-11-23 15:08       ` Russell King - ARM Linux
  2009-11-23 19:10         ` Toby Douglass
  2009-11-23 15:13       ` Catalin Marinas
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 37+ messages in thread
From: Russell King - ARM Linux @ 2009-11-23 15:08 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Nov 21, 2009 at 04:21:00PM +0100, Toby Douglass wrote:
>  382                do {
>  383                        asm volatile("@ __cmpxchg4\n"
>  384                        "       ldrex   %1, [%2]\n"
>  385                        "       mov     %0, #0\n"
>  386                        "       teq     %1, %3\n"
>  387                        "       strexeq %0, %4, [%2]\n"
>  388                                : "=&r" (res), "=&r" (oldval)
>  389                                : "r" (ptr), "Ir" (old), "r" (new)
>  390                                : "memory", "cc");
>  391                } while (res);
>
> The problem is *we then come round in the do-while loop again*.  We have  
> *not* updated our exchange value.  So THIS second time around, we  
> *repeat* our strex and we DO swap - and we just swapped in completely  
> the wrong next pointer, from way back before the stack was totally  
> changed by all the other threads popping and pushing.

First time around the loop, lets say %3 = 1 *(u32 *)%2 = 1.

	ldrex	%1, [%2]
				%1 = *(u32 *)%2 (= 1)
	mov	%0, #0
				%0 = 0
	teq	%1, %3
				%3 == %1? (yes)
	strexeq	%0, %4, [%2]
				executed but because of the other access,
				exclusivity fails. *(u32 *)%2 not written
				and %0 = 1

So, res = 1, and we go around the loop again.  Lets say that *(u32 *)%2 = 2
now.

	ldrex	%1, [%2]
				%1 = *(u32 *)%2 (= 2)
	mov	%0, #0
				%0 = 0
	teq	%1, %3
				%3 == %1? (no)
	strexeq	%0, %4, [%2]
				not executed at all, %0 and *(u32 *)%2 untouched

So, res = 0 and we do _not_ repeat the loop and return "cmpxchg" failure.

I haven't had time to read all your email properly (due to the need to
get on a conference call), but please tell me where the problem is above
(using a similar worked example).

Thanks.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-23 15:08       ` Russell King - ARM Linux
@ 2009-11-23 19:10         ` Toby Douglass
  2009-11-23 20:06           ` Russell King - ARM Linux
  0 siblings, 1 reply; 37+ messages in thread
From: Toby Douglass @ 2009-11-23 19:10 UTC (permalink / raw)
  To: linux-arm-kernel

Russell King - ARM Linux wrote:
> First time around the loop, lets say %3 = 1 *(u32 *)%2 = 1.
> 
> 	ldrex	%1, [%2]
> 				%1 = *(u32 *)%2 (= 1)
> 	mov	%0, #0
> 				%0 = 0
> 	teq	%1, %3
> 				%3 == %1? (yes)
> 	strexeq	%0, %4, [%2]
> 				executed but because of the other access,
> 				exclusivity fails. *(u32 *)%2 not written
> 				and %0 = 1
> 
> So, res = 1, and we go around the loop again.  Lets say that *(u32 *)%2 = 2
> now.

No - we're dealing with the ABA problem.  We're assuming here that this 
thread gets to retry with the same values.

> I haven't had time to read all your email properly (due to the need to
> get on a conference call), but please tell me where the problem is above
> (using a similar worked example).

So; we go around again, load %2, do the teq, which succeeds, then the 
strexeq, which now succeeds since no-one else has touched %2.

This was the thrust of the original post; however, Catalin has raised 
arguments against it which I have not yet digested, so what I'm writing 
here, where it is simply an enlargement on the OP, has the same flaws; 
it's only in response to your specific point.  I'm not trying to assert 
this *is* what happens, in spite of what Catalin has written.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-23 19:10         ` Toby Douglass
@ 2009-11-23 20:06           ` Russell King - ARM Linux
  2009-11-23 20:34             ` Toby Douglass
  0 siblings, 1 reply; 37+ messages in thread
From: Russell King - ARM Linux @ 2009-11-23 20:06 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Nov 23, 2009 at 08:10:51PM +0100, Toby Douglass wrote:
> Russell King - ARM Linux wrote:
>> First time around the loop, lets say %3 = 1 *(u32 *)%2 = 1.
>>
>> 	ldrex	%1, [%2]
>> 				%1 = *(u32 *)%2 (= 1)
>> 	mov	%0, #0
>> 				%0 = 0
>> 	teq	%1, %3
>> 				%3 == %1? (yes)
>> 	strexeq	%0, %4, [%2]
>> 				executed but because of the other access,
>> 				exclusivity fails. *(u32 *)%2 not written
>> 				and %0 = 1
>>
>> So, res = 1, and we go around the loop again.  Lets say that *(u32 *)%2 = 2
>> now.
>
> No - we're dealing with the ABA problem.  We're assuming here that this  
> thread gets to retry with the same values.
>
>> I haven't had time to read all your email properly (due to the need to
>> get on a conference call), but please tell me where the problem is above
>> (using a similar worked example).
>
> So; we go around again, load %2, do the teq, which succeeds, then the  
> strexeq, which now succeeds since no-one else has touched %2.
>
> This was the thrust of the original post; however, Catalin has raised  
> arguments against it which I have not yet digested, so what I'm writing  
> here, where it is simply an enlargement on the OP, has the same flaws;  
> it's only in response to your specific point.  I'm not trying to assert  
> this *is* what happens, in spite of what Catalin has written.

Well, I've thought it through quite a bit now, and I have an expansive
reply to your email.  In summary, there is nothing wrong with the
existing code; your use of it is the problem.

I can post the expansive reply if you need the details.

In short, consider what happens if you consider a slightly different order
of operations, where you have calculated 'ptr', 'old' and 'new' for cmpxchg
but you haven't executed the first ldrex.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-23 20:06           ` Russell King - ARM Linux
@ 2009-11-23 20:34             ` Toby Douglass
  0 siblings, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-23 20:34 UTC (permalink / raw)
  To: linux-arm-kernel

Russell King - ARM Linux wrote:
> On Mon, Nov 23, 2009 at 08:10:51PM +0100, Toby Douglass wrote:
>> This was the thrust of the original post; however, Catalin has raised  
>> arguments against it which I have not yet digested, so what I'm writing  
>> here, where it is simply an enlargement on the OP, has the same flaws;  
>> it's only in response to your specific point.  I'm not trying to assert  
>> this *is* what happens, in spite of what Catalin has written.
> 
> Well, I've thought it through quite a bit now, and I have an expansive
> reply to your email.  In summary, there is nothing wrong with the
> existing code; your use of it is the problem.
> 
> I can post the expansive reply if you need the details.

I would be very grateful if you could - although if it is extensive and 
well-understood, you might want to spare the list and email me directly.

I've been working every hour there is for the last three weekends trying 
to get CAS working properly on ARM under Linux; any fresh insight or 
knowledge from others may give me what I need to work out what I've 
mis-understood (before I go insane :-)

> In short, consider what happens if you consider a slightly different order
> of operations, where you have calculated 'ptr', 'old' and 'new' for cmpxchg
> but you haven't executed the first ldrex.

This is, I think, what Catalin said - I'm grappling with this at the 
moment; I don't have a reply to it yet, positive or negative.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-21 15:21     ` CAS implementation may be broken Toby Douglass
  2009-11-23 15:08       ` Russell King - ARM Linux
@ 2009-11-23 15:13       ` Catalin Marinas
  2009-11-24 15:15         ` Toby Douglass
  2009-11-24 15:33         ` Toby Douglass
  2009-11-23 15:34       ` Catalin Marinas
  2009-11-23 22:28       ` Jamie Lokier
  3 siblings, 2 replies; 37+ messages in thread
From: Catalin Marinas @ 2009-11-23 15:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, 2009-11-21 at 15:21 +0000, Toby Douglass wrote:
> I may be *utterly* wrong, and I'm expecting that someone here is simply
> going to look at what I'm about to say and explain to me what I've
> misunderstood, but I suspect it may be that atomic compare-and-swap is
> incorrectly implemented in the Linux kernel on ARM v6 and above (e.g.
> using ldrex and strex).

Well, you have to define "correctly" or you may just need a different
function for what you want. In this implementation, CAS is expected to
succeed if the the old value matches the memory one. The loop is present
to always guarantee that it will succeed if the condition matches.

Note that the exclusive monitor state is cleared (and hence STREX fails)
a every context switch or a return from an exception, i.e. interrupt).
> 
> The issue is the ABA problem.
[...]
> The problem is *we then come round in the do-while loop again*.  We have
> *not* updated our exchange value.  So THIS second time around, we
> *repeat* our strex and we DO swap - and we just swapped in completely
> the wrong next pointer, from way back before the stack was totally
> changed by all the other threads popping and pushing.

But the first thread (paused) one may may actually wait before the LDREX
(the old value was loaded with an LDR), so the LDREX/STREX pair would
succeed anyway. That's not really a solution unless you also use LDREX
for loading the old value to be passed to CAS. IOW, you need your own
implementation of what you are trying to achieve (and not modifying
CAS).

-- 
Catalin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-23 15:13       ` Catalin Marinas
@ 2009-11-24 15:15         ` Toby Douglass
  2009-11-24 15:36           ` Russell King - ARM Linux
  2009-11-25  1:24           ` Jamie Lokier
  2009-11-24 15:33         ` Toby Douglass
  1 sibling, 2 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-24 15:15 UTC (permalink / raw)
  To: linux-arm-kernel

Catalin Marinas wrote:
> On Sat, 2009-11-21 at 15:21 +0000, Toby Douglass wrote:
>> I may be *utterly* wrong, and I'm expecting that someone here is simply
>> going to look at what I'm about to say and explain to me what I've
>> misunderstood, but I suspect it may be that atomic compare-and-swap is
>> incorrectly implemented in the Linux kernel on ARM v6 and above (e.g.
>> using ldrex and strex).
> 
> Well, you have to define "correctly" or you may just need a different
> function for what you want.

Yeeessss-s-s-s-s-s...

But you can have a situation where you can choose between two 
implementations and although both could be 'correct', one offers more 
functionality than the other - and so is best choice.

I'm coming at this from the POV of implementing lock-free data 
structures.  Essentially, "plain CAS" (as on SPARC, which has neither 
LL/SC nor contiguous double word length CAS) makes implementation 
problematic because you have no straightforward solution to the ABA problem.

Contiguous double word length CAS permits a pointer-counter pair, which 
solves ABA.  LL/SC solves ABA by failing the store if the target has 
been modified since load (I didn't properly realise how this could be 
used to solve ABA until later in this post; there is, I think, a way - I 
describe it later).

Now, it is possible to implement CAS using LL/SC such that it behaves 
exactly like plain CAS - a la SPARC.  I believe this is what exists now.

However, I think it may also possible to implement using LL/SC such that 
ABA is avoided.  If a platform implements CAS using LL/SC *only* a plain 
CAS, then it makes life exceedingly difficult for lock-free data 
structures; and it does so by *not* implementing functionality which 
exists in the CPU which solves ABA!

> In this implementation, CAS is expected to
> succeed if the the old value matches the memory one. The loop is present
> to always guarantee that it will succeed if the condition matches.

Yes.  This is a "plain CAS" implementation.

Interestingly, I saw a day or two ago that there is a double-word 
version of LDREX.  The atomic-ptr project, which forms the basis of the 
GCC garbage collector, relies on this; it does not in fact use LL/SC on 
ARM, but rather uses double-word CAS using a pointer-counter pair.

> Note that the exclusive monitor state is cleared (and hence STREX fails)
> a every context switch or a return from an exception, i.e. interrupt).

Good.  I expected this to be so and it is important to me that it is; 
for by being so, it permits the use of these lock-free data structures 
in interrupt handlers.

>> The issue is the ABA problem.

>> The problem is *we then come round in the do-while loop again*.  We have
>> *not* updated our exchange value.  So THIS second time around, we
>> *repeat* our strex and we DO swap - and we just swapped in completely
>> the wrong next pointer, from way back before the stack was totally
>> changed by all the other threads popping and pushing.
> 
> But the first thread (paused) one may may actually wait before the LDREX
> (the old value was loaded with an LDR), so the LDREX/STREX pair would
> succeed anyway.

I can see my original description of the problem contains an error!  I 
described it first as "both threads pause before LDREX" and then later I 
wrote as if both had in fact executed LDREX.

In fact, what you describe here in the paragraph above is what I 
described in the OP as occurring in the second iteration of the while 
loop (when in fact it can just as well occur in the first iteration, as 
you have here described); namely, you're experiencing ABA - your 
exchange pointer is wrong, but you don't realise it, because the 
destination matches your compare.

So, now I'm puzzled.  I was under the impression it was possible to use 
LL/SC to solve ABA.

Ah - wait - perhaps I see how.  The problem is that the exchange pointer 
is loaded before we become able to see that the destination has changed.

In fact, what's needed here (in this stack example) is that you *load 
the exchange pointer inside the ldrex/strex pair*.

> That's not really a solution unless you also use LDREX
> for loading the old value to be passed to CAS.

Man, you're just way ahead of me :-)  the coffee at ARM must be *really* 
good :-)

But I think you don't need to use LDREX; it's enough just to LDR inside 
the LDREX/STREX pair used on destination.  (This would mean in the CAS 
prototype taking a pointer to the exchange value).

This would mean that if the destination was changed by someone else, 
you'd never use the exchange value; but if it was not, you'd be using 
the current value.

However, this assumes the caller has a linkage between the exchange 
value and the destination; e.g. that if the destination is modified by 
someone else, the exchange must now be wrong.

This I would expect is not always the case and such an implementation 
would no longer be lowest common denominator.

But I wonder...

Imagine we had a CAS where the compare and exchange are passed in as 
pointers and we load what they point to after the LDREX on destination.

What do we have now?

We have a CAS where the exchange and compare are guaranteed to be only 
those which exist continuously and without modification between the 
loading, compare and potential set of the destination.

It is perhaps offering stronger guarantees than necessary in all cases - 
but it actually sounds like a *real* CAS; it says that when we enter the 
CAS, we WILL have the values for compare and exchange which are correct 
at that moment, rather than using historical values.

Is this wrong?  or is it better than x86 style cmpxchg?  can we do 
everything with this that we could do with a CAS which swaps historical 
values?

> IOW, you need your own
> implementation of what you are trying to achieve (and not modifying
> CAS).

On one hand, plain CAS is the basic functionality.  It is all you need 
for some things and indeed, I can imagine for some things, it is 
orthogonal; it is *what you must have* and so it would need to exist.

However, for many things you want an ABA-safe CAS.  I would go as so far 
as to say is a *second* basic functionality, since there is I think more 
use for this than for a non-ABA safe CAS.

Windows offers this simply by offering double-word CAS (and requiring 
developers to use it as a pointer-counter pairing, which is reasonable, 
since you might want to use double-word really as double-word, so it has 
other uses).

ARM offers that as well, but I believe there is no double-word CAS 
support in the kernel.  That would be one route (and indeed it seems odd 
that the current code supports in a switch one, two and four byte CAS, 
but not eight - perhaps is it not available on all >= v6 CPUs?).

The other would be to use the LL/SC functionality as described above. 
The user of the current CAS cannot implement this behaviour on top of 
the CAS that currently exists; he would have to write his own assembly.

My feeling in this, for what it's worth, is that this seems an excessive 
requirement for something which is necessary for lock-free data 
structures; they are becoming important as the number of cores in SMP 
systems continues to double.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-24 15:15         ` Toby Douglass
@ 2009-11-24 15:36           ` Russell King - ARM Linux
  2009-11-24 16:20             ` Toby Douglass
                               ` (2 more replies)
  2009-11-25  1:24           ` Jamie Lokier
  1 sibling, 3 replies; 37+ messages in thread
From: Russell King - ARM Linux @ 2009-11-24 15:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Nov 24, 2009 at 04:15:37PM +0100, Toby Douglass wrote:
> So, now I'm puzzled.  I was under the impression it was possible to use  
> LL/SC to solve ABA.

Well, what you can do is:

	LL
	compute new value
	SC

The problem for LL/SC architectures is that this creates a big region
where we hope that no one else performs another 'load locked' operation.

LL/SC based architectures doing this traditionally fall fowl of a problem.
Consider the following:

do {
	LL
	complex computation to new value
	SC
} while (store failed)

and have that running on two CPUs trying to modify the same location.  It
can hang both CPUs up in that loop for a _very_ long time.  This is
primerily why we've ended up with a CAS implementation, and everything
is implemented as:

	do {
		load current
		compute new value
		do {
			LL old
			old == current?
				SC new
		} while SC failed
	} while old != current

(I argued initially for a LL / SC interface but the arguments I was making
were shot down by Linus et.al. as completely unreasonable and technically
idiotic.)

Actually, I'm rather surprised that our existing LDREX/STREX implementations
don't suffer from this problem.  Consider two CPUs operating in lock-step:

	CPU0		CPU1
	ldrex
	add		ldrex
	strex (fails)	add
	repeat		strex (fails)
	ldrex		repeat
	add		ldrex
	strex (fails)	add
	repeat		strex (fails)
	...		repeat
			...

In this scenario, the loops _never_ terminate.  The usual solution is to
add delays into the loop which depend on the CPU number, thus ensuring
that two CPUs execute a different number of instructions each time around
the loop.

> My feeling in this, for what it's worth, is that this seems an excessive  
> requirement for something which is necessary for lock-free data  
> structures; they are becoming important as the number of cores in SMP  
> systems continues to double.

Hasn't the lock-free issue been solved by things like rcu?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-24 15:36           ` Russell King - ARM Linux
@ 2009-11-24 16:20             ` Toby Douglass
  2009-11-24 16:27             ` Catalin Marinas
  2009-11-24 17:14             ` Toby Douglass
  2 siblings, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-24 16:20 UTC (permalink / raw)
  To: linux-arm-kernel

Russell King - ARM Linux wrote:
> On Tue, Nov 24, 2009 at 04:15:37PM +0100, Toby Douglass wrote:
>> So, now I'm puzzled.  I was under the impression it was possible to use  
>> LL/SC to solve ABA.
> 
> Well, what you can do is:
> 
> 	LL
> 	compute new value
> 	SC
> 
> The problem for LL/SC architectures is that this creates a big region
> where we hope that no one else performs another 'load locked' operation.

Yes; live-lock.

> LL/SC based architectures doing this traditionally fall fowl of a problem.
> Consider the following:
> 
> do {
> 	LL
> 	complex computation to new value
> 	SC
> } while (store failed)
> 
> and have that running on two CPUs trying to modify the same location.  It
> can hang both CPUs up in that loop for a _very_ long time.

Yes, but this complex computation; in all of the data structures which 
can be implemented using a lock-free algorithm, the only work that is 
done is loading an exchange value and a compare value.

> (I argued initially for a LL / SC interface but the arguments I was making
> were shot down by Linus et.al. as completely unreasonable and technically
> idiotic.)

Were they?

> Actually, I'm rather surprised that our existing LDREX/STREX implementations
> don't suffer from this problem.  Consider two CPUs operating in lock-step:
> 
> 	CPU0		CPU1
> 	ldrex
> 	add		ldrex
> 	strex (fails)	add
> 	repeat		strex (fails)
> 	ldrex		repeat
> 	add		ldrex
> 	strex (fails)	add
> 	repeat		strex (fails)
> 	...		repeat
> 			...
> 
> In this scenario, the loops _never_ terminate.  The usual solution is to
> add delays into the loop which depend on the CPU number, thus ensuring
> that two CPUs execute a different number of instructions each time around
> the loop.

Yes.  Algorithmic back-off is the standard solution.  For light use, 
it's excessive.  TBH, I doubt many are using this code, since the kernel 
version of it didn't have memory barriers until recently and GCC's 
__sync_synchronize() does nothing on ARM.

>> My feeling in this, for what it's worth, is that this seems an excessive  
>> requirement for something which is necessary for lock-free data  
>> structures; they are becoming important as the number of cores in SMP  
>> systems continues to double.
> 
> Hasn't the lock-free issue been solved by things like rcu?

I would say yes and no.  Memory handling adds significant complexity to 
a lock-free algorithm.  For some algorithms, it's not necessary, for 
others you can get by without it at a cost (using a freelist, for 
example).  It's a trade-off, rather than an outright solution.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-24 15:36           ` Russell King - ARM Linux
  2009-11-24 16:20             ` Toby Douglass
@ 2009-11-24 16:27             ` Catalin Marinas
  2009-11-24 17:14             ` Toby Douglass
  2 siblings, 0 replies; 37+ messages in thread
From: Catalin Marinas @ 2009-11-24 16:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2009-11-24 at 15:36 +0000, Russell King - ARM Linux wrote:
> On Tue, Nov 24, 2009 at 04:15:37PM +0100, Toby Douglass wrote:
> > So, now I'm puzzled.  I was under the impression it was possible to use 
> > LL/SC to solve ABA.
> 
> Well, what you can do is:
> 
>         LL
>         compute new value
>         SC
> 
> The problem for LL/SC architectures is that this creates a big region
> where we hope that no one else performs another 'load locked' operation.

As I replied to Jamie, the long computation above may introduce some
live-lock situation on some hardware implementations (I think in general
it needs to avoid cache line evictions).

> (I argued initially for a LL / SC interface but the arguments I was making
> were shot down by Linus et.al. as completely unreasonable and technically
> idiotic.)

Maybe because other architectures don't allow for many instructions
between LL and SC.

> Actually, I'm rather surprised that our existing LDREX/STREX implementations
> don't suffer from this problem.  Consider two CPUs operating in lock-step:
> 
>         CPU0            CPU1
>         ldrex
>         add             ldrex
>         strex (fails)   add
>         repeat          strex (fails)
>         ldrex           repeat
>         add             ldrex
>         strex (fails)   add
>         repeat          strex (fails)
>         ...             repeat
>                         ...
> 
> In this scenario, the loops _never_ terminate.  The usual solution is to
> add delays into the loop which depend on the CPU number, thus ensuring
> that two CPUs execute a different number of instructions each time around
> the loop.

I thought we already have a hardware solution or this. There was an
erratum for ARM11MPCore r0p0 where we had to add the delay in software
but it's no longer needed.

> > My feeling in this, for what it's worth, is that this seems an excessive 
> > requirement for something which is necessary for lock-free data 
> > structures; they are becoming important as the number of cores in SMP 
> > systems continues to double.
> 
> Hasn't the lock-free issue been solved by things like rcu?

RCU allows only one writer (and multiple readers). The completely
lock-free algorithm would allow multiple writers (multiple threads
popping from a stack in Toby's original problem).

-- 
Catalin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-24 15:36           ` Russell King - ARM Linux
  2009-11-24 16:20             ` Toby Douglass
  2009-11-24 16:27             ` Catalin Marinas
@ 2009-11-24 17:14             ` Toby Douglass
  2 siblings, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-24 17:14 UTC (permalink / raw)
  To: linux-arm-kernel

I wrote:
 > Russell King - ARM Linux wrote:

 >> (I argued initially for a LL / SC interface but the arguments
 >> I was making were shot down by Linus et.al. as completely
 >> unreasonable and technically idiotic.)

 > Were they?

My question is I think in hindsight misinterpretable.

What I mean is; you're a reasonable and intelligent bloke.  You have 
your views in this matter and Linus et al have theirs; they thought what 
you said was crazy.  Having heard them out, what did you think of what 
they said?  were there valid arguments in there?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-24 15:15         ` Toby Douglass
  2009-11-24 15:36           ` Russell King - ARM Linux
@ 2009-11-25  1:24           ` Jamie Lokier
  2009-11-26 16:14             ` Toby Douglass
  1 sibling, 1 reply; 37+ messages in thread
From: Jamie Lokier @ 2009-11-25  1:24 UTC (permalink / raw)
  To: linux-arm-kernel

Toby Douglass wrote:
> Catalin Marinas wrote:
> Interestingly, I saw a day or two ago that there is a double-word 
> version of LDREX.  The atomic-ptr project, which forms the basis of the 
> GCC garbage collector, relies on this; it does not in fact use LL/SC on 
> ARM, but rather uses double-word CAS using a pointer-counter pair.

Note that pointer-counter is not really reliable.

If a thread is suspended for the time taken for exactly 2^32 operations
by other threads, it fails.  (Assuming 32-bit pointer + 32-bit counter).

That's unlikely, but not impossible.

Consider a program with 100 threads and one CPU, each thread in a
tight loop doing atomic_op(&p) and some other things.  It's quite
reasonable for 1 thread to be descheduled for the amount of time taken
for 2^32 iterations from all the other threads, if the pointer-counter
CAS operation is fast, which it increasingly is.

Remember, this is one CPU so it's in L1 cache, and modern CPUs execute
2^32 operations in a single second.  CAS operations are not
single-cycle _yet_, but they continue to get faster.

With 100 threads running without sleeping, the kernel may well
deschedule one of them for several seconds as it cycles through them
all and gives them "non-interactive" timeslices.

It's also quite likely to wrap if someone used kill -STOP, and later
kill -CONT on a thread, or if they're debugging one of the threads.

Of course even if the counter wraps, you still have to be 1/2^32
unlucky to see the same value.  But that's enough to make it unreliable.

-- Jamie

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-25  1:24           ` Jamie Lokier
@ 2009-11-26 16:14             ` Toby Douglass
  2009-11-27  1:37               ` Jamie Lokier
  0 siblings, 1 reply; 37+ messages in thread
From: Toby Douglass @ 2009-11-26 16:14 UTC (permalink / raw)
  To: linux-arm-kernel

Jamie Lokier wrote:
> Toby Douglass wrote:

>> Interestingly, I saw a day or two ago that there is a double-word 
>> version of LDREX.  The atomic-ptr project, which forms the basis of the 
>> GCC garbage collector, relies on this; it does not in fact use LL/SC on 
>> ARM, but rather uses double-word CAS using a pointer-counter pair.

> Note that pointer-counter is not really reliable.

[snip case]

> Of course even if the counter wraps, you still have to be 1/2^32
> unlucky to see the same value.  But that's enough to make it unreliable.

I concur, but it's not a problem on 64-bit.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-26 16:14             ` Toby Douglass
@ 2009-11-27  1:37               ` Jamie Lokier
  0 siblings, 0 replies; 37+ messages in thread
From: Jamie Lokier @ 2009-11-27  1:37 UTC (permalink / raw)
  To: linux-arm-kernel

Toby Douglass wrote:
> Jamie Lokier wrote:
> >Note that pointer-counter is not really reliable.
> 
> [snip case]
> 
> >Of course even if the counter wraps, you still have to be 1/2^32
> >unlucky to see the same value.  But that's enough to make it unreliable.
> 
> I concur, but it's not a problem on 64-bit.

Agree, about 64-bit.  (Until we have much faster CPUs :-)

It's occurred to me that it's possible to make pointer-counter safe.

Simply, when about to wrap, send a signal to every thread.  The signal
handler should check if the thread it has interrupted is inside a
pointer-counter critical section, and if yes, force the interrupted
code to synchronise with all the other threads in another way, such as
using a mutex.

-- Jamie

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-23 15:13       ` Catalin Marinas
  2009-11-24 15:15         ` Toby Douglass
@ 2009-11-24 15:33         ` Toby Douglass
  1 sibling, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-24 15:33 UTC (permalink / raw)
  To: linux-arm-kernel

I wrote:
> We have a CAS where the exchange and compare are guaranteed to be only those which exist continuously and without modification between the loading, compare and potential set of the destination. 

Ah, this isn't true.  The compare and exchange values could be modified 
after their LDR but before the STREX.  Apologies - basic mistake :-/

So in fact what you have is a guarantee you will only use values for 
compare and exchange which exist *after* you've loaded your destination.

I don't think this is useful for anything; it's no different to having 
them loaded before the LDREX.  What's more, I think I can do this now 
anyway, in GCC; pass pointers in and in the assembly, use "*exchange" 
rather than "exchange".

I understand now what Catalin means about using LDREX to load them as 
well (which you can't do, I believe; you can't nest LDREX on one core).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-21 15:21     ` CAS implementation may be broken Toby Douglass
  2009-11-23 15:08       ` Russell King - ARM Linux
  2009-11-23 15:13       ` Catalin Marinas
@ 2009-11-23 15:34       ` Catalin Marinas
  2009-11-23 16:40         ` Toby Douglass
  2009-11-23 22:28       ` Jamie Lokier
  3 siblings, 1 reply; 37+ messages in thread
From: Catalin Marinas @ 2009-11-23 15:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, 2009-11-21 at 15:21 +0000, Toby Douglass wrote:
> Load-linked/conditional-store architectures solve ABA by having the
> store fail if the destination has been touched since the load was performed.

So, for your problem, CAS wouldn't solve it on any architecture, not
just ARM if it is implemented with LL/SC.

Something like below may work (untested and haven't put much thought
into it). Basically you need atomicity between the moment you first get
the head pointer and the moment you update it. CAS doesn't ensure this
since the first load of the [head] pointer would be standard LDR (even
if it is LDREX, you get another LDREX which resets the link with the
STREX):

1:	LDREX	orig_head, [head]
	LDR	orig_next, [orig_head, #next]
	STREX	res, orig_next, [head]
	cmp	res, #0
	bne	1b

-- 
Catalin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-23 15:34       ` Catalin Marinas
@ 2009-11-23 16:40         ` Toby Douglass
  0 siblings, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-23 16:40 UTC (permalink / raw)
  To: linux-arm-kernel

Catalin, Russell - thankyou for your replies.  I'm looking at them now; 
some thought is required :-)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-21 15:21     ` CAS implementation may be broken Toby Douglass
                         ` (2 preceding siblings ...)
  2009-11-23 15:34       ` Catalin Marinas
@ 2009-11-23 22:28       ` Jamie Lokier
  2009-11-23 23:13         ` Russell King - ARM Linux
                           ` (2 more replies)
  3 siblings, 3 replies; 37+ messages in thread
From: Jamie Lokier @ 2009-11-23 22:28 UTC (permalink / raw)
  To: linux-arm-kernel

Toby Douglass wrote:
> Load-linked/conditional-store architectures solve ABA by having the 
> store fail if the destination has been touched since the load was performed.
> 
> Currently, the code appears to violate this, by repeating the CAS 
> *anyway*.  In fact, the appropriate behaviour would seem to be *not* to 
> loop, but rather, to issue the ldrex/strex *once*, and indicate to the 
> user if the store succeed or failed.

I believe Catalin's explained why it does not work even doing
LDREX/STREX once, because the thread can pause before the LDREX.  So
you must begin fetching pointers after the LDREX.

(At least I think so.  I'm prepared to be shown to be wrong :-)

If you're writing code intended for other LL/SC architectures too, and
following Catalin's suggestion to put LDR between LDREX and STREX,
then you might have to check if the other architectures permit loads
between the LL and SC.

> This requires a prototype change, because the return value is the 
> original value of the destination and so is unable to indicate, when 
> returning that value, if it is returned from a successful or 
> unsuccessful swap.

Nonetheless, such a prototype change might be an improvement anyway.

Some platforms provide compare_and_swap_bool() operations, which do as
you describe: Compare, conditionally store, and return bool to indicate
success.  No loop.

That could be an improvement for some algorithms, because often if the
store doesn't happen, the *inputs* to compare_and_swap() need to be
recalculated anyway before another try is likely to succeed.  What's
the point in having code which expands to two loops:

    do {
       old = get_something;
       new = calc_something;

       /* oldval = compare_and_swap(ptr, old, new); */
       do {
           __asm__("LL/SC" : (failed), (oldval) : (ptr), (old), (new));
       } while (failed && oldval == old);

    } while (oldval != old);

When it can often be a smaller loop, which probably executes a little
faster too in various cases:

    do {
       old = get_something;
       new = calc_something;

       /* oldval = compare_and_swap_bool(ptr, old, new); */
       __asm__("LL/SC" : (failed), (oldval) : (ptr), (old), (new));

    } while (failed);

-- Jamie

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-23 22:28       ` Jamie Lokier
@ 2009-11-23 23:13         ` Russell King - ARM Linux
  2009-11-24  1:32           ` Jamie Lokier
  2009-11-24  9:38           ` Toby Douglass
  2009-11-24 15:59         ` Catalin Marinas
  2009-11-24 16:34         ` Toby Douglass
  2 siblings, 2 replies; 37+ messages in thread
From: Russell King - ARM Linux @ 2009-11-23 23:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Nov 23, 2009 at 10:28:30PM +0000, Jamie Lokier wrote:
> That could be an improvement for some algorithms, because often if the
> store doesn't happen, the *inputs* to compare_and_swap() need to be
> recalculated anyway before another try is likely to succeed.  What's
> the point in having code which expands to two loops:

The point is that often its cheaper to retry the LL/SC than it is to
do the recalculation.

However, I don't think you've understood the original problem at all.
The issue is that for a particular 32-bit word, the behaviour required
is:

 - if no change occurs between the _original_ read and the atomic update,
   then go ahead with the update
 - if any change has occured, do not update, but go back and recalculate
   from the beginning.

This is because, even if the value at the pointer is restored back to
the numerically identical value it had at the start, the requirement is
for the store to fail.

No amount of loops (or lack of loops) solves that, even if you have a
proper single atomic 32-bit CAS instruction.

You need to be far more clever, and there's some hints on wikipedia
about possible solutions.  They assume that you can CAS more bits than
your intended value though, which we don't have.

I believe there is a solution to the ABA problem using atomic operations
with one 32-bit word and a separate counter to identify updates, but it
requires a little more time than I have available this evening to fully
put together.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-23 23:13         ` Russell King - ARM Linux
@ 2009-11-24  1:32           ` Jamie Lokier
  2009-11-24 11:19             ` Catalin Marinas
  2009-11-24  9:38           ` Toby Douglass
  1 sibling, 1 reply; 37+ messages in thread
From: Jamie Lokier @ 2009-11-24  1:32 UTC (permalink / raw)
  To: linux-arm-kernel

Russell King - ARM Linux wrote:
> On Mon, Nov 23, 2009 at 10:28:30PM +0000, Jamie Lokier wrote:
> > That could be an improvement for some algorithms, because often if the
> > store doesn't happen, the *inputs* to compare_and_swap() need to be
> > recalculated anyway before another try is likely to succeed.  What's
> > the point in having code which expands to two loops:
> 
> The point is that often its cheaper to retry the LL/SC than it is to
> do the recalculation.
> 
> However, I don't think you've understood the original problem at all.

I think I have - I agreed with you and Catalin already that LL/SC does
not suffice.  But do you mean that Catalin's suggestion to put the
LDREX before the LDR does not work either?  (Maybe it needs a barrier
too).

It probably looks like I didn't understand because I've mixed two
quite different issues in the same mail (same in this one), because I
think the CAS_BOOL has merit.  Having implemented both variants in
userspace, it's actually quite annoying to use CAS and have to
remember the input and compare it with the output sometimes.

For some algorithms, you're right that it can be cheaper to retry the
LL/SC.  But for some algorithms you know the retry will never succeed
and is a waste of time and code space, e.g. ones which do CAS on a
counter which is always incremented.  It makes the caller a bit more
long-winded too.

The final argument is: you can implement CAS in terms of CAS_BOOL, but
you can't do it the other way.  Because everyone copied x86's CAS at
the begining, that's the one we got.

-- Jamie

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-24  1:32           ` Jamie Lokier
@ 2009-11-24 11:19             ` Catalin Marinas
  2009-11-24 22:24               ` Toby Douglass
  2009-11-24 22:34               ` Toby Douglass
  0 siblings, 2 replies; 37+ messages in thread
From: Catalin Marinas @ 2009-11-24 11:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2009-11-24 at 01:32 +0000, Jamie Lokier wrote:
> Russell King - ARM Linux wrote:
> > On Mon, Nov 23, 2009 at 10:28:30PM +0000, Jamie Lokier wrote:
> > > That could be an improvement for some algorithms, because often if the
> > > store doesn't happen, the *inputs* to compare_and_swap() need to be
> > > recalculated anyway before another try is likely to succeed.  What's
> > > the point in having code which expands to two loops:
> >
> > The point is that often its cheaper to retry the LL/SC than it is to
> > do the recalculation.
> >
> > However, I don't think you've understood the original problem at all.
> 
> I think I have - I agreed with you and Catalin already that LL/SC does
> not suffice.  But do you mean that Catalin's suggestion to put the
> LDREX before the LDR does not work either?  (Maybe it needs a barrier
> too).

It definitely needs a barrier after the LDREX and maybe one after STREX
but that depends on the semantics of such operation.

-- 
Catalin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-24 11:19             ` Catalin Marinas
@ 2009-11-24 22:24               ` Toby Douglass
  2009-11-25 11:11                 ` Catalin Marinas
  2009-11-24 22:34               ` Toby Douglass
  1 sibling, 1 reply; 37+ messages in thread
From: Toby Douglass @ 2009-11-24 22:24 UTC (permalink / raw)
  To: linux-arm-kernel

Catalin Marinas wrote:
> On Tue, 2009-11-24 at 01:32 +0000, Jamie Lokier wrote:
>> Russell King - ARM Linux wrote:

>>> However, I don't think you've understood the original problem at all.

>> I think I have - I agreed with you and Catalin already that LL/SC does
>> not suffice.  But do you mean that Catalin's suggestion to put the
>> LDREX before the LDR does not work either?  (Maybe it needs a barrier
>> too).

> It definitely needs a barrier after the LDREX and maybe one after STREX
> but that depends on the semantics of such operation.

In the latest kernel, there is a memory barrier (the DMB instruction on 
v7 and above) immediately before and immediately after the call to the 
CAS function.

I thought about this a little.  If the memory barrier is immediately 
before and given the next instruction is the LDREX, *all* threads coming 
to the LDREX *must* have preceeding them a DMB and so be up to date on 
memory, regardless of pauses in thread execution.

This however is in conflict with Catalin's comments.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-24 22:24               ` Toby Douglass
@ 2009-11-25 11:11                 ` Catalin Marinas
  2009-11-25 18:57                   ` Toby Douglass
  0 siblings, 1 reply; 37+ messages in thread
From: Catalin Marinas @ 2009-11-25 11:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2009-11-24 at 22:24 +0000, Toby Douglass wrote:
> Catalin Marinas wrote:
> > On Tue, 2009-11-24 at 01:32 +0000, Jamie Lokier wrote:
> >> Russell King - ARM Linux wrote:
> 
> >>> However, I don't think you've understood the original problem at all.
> 
> >> I think I have - I agreed with you and Catalin already that LL/SC does
> >> not suffice.  But do you mean that Catalin's suggestion to put the
> >> LDREX before the LDR does not work either?  (Maybe it needs a barrier
> >> too).
> 
> > It definitely needs a barrier after the LDREX and maybe one after STREX
> > but that depends on the semantics of such operation.
> 
> In the latest kernel, there is a memory barrier (the DMB instruction on
> v7 and above) immediately before and immediately after the call to the
> CAS function.
> 
> I thought about this a little.  If the memory barrier is immediately
> before and given the next instruction is the LDREX, *all* threads coming
> to the LDREX *must* have preceeding them a DMB and so be up to date on
> memory, regardless of pauses in thread execution.
> 
> This however is in conflict with Catalin's comments.

My comment was to a new implementation for the ABA issue using LDR for
the next pointer after LDREX and these would need to be separated by a
barrier. Looking at this a bit more, it isn't actually needed since the
subsequent LDR uses the pointer read by LDREX and thus creating an
address dependency ("the value returned by a read access is used to
compute the virtual address of a subsequent read or write access") which
implies strict ordering between LDREX and LDR (A3.8.2 in the ARM ARM).
But as I said, LDR between LDREX/STREX may no always work.

You may need barriers before and after you routine as we do in the CAS
case but it depends how you define your routine, you may ask the caller
of such routine to add the barriers explicitly.

-- 
Catalin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-25 11:11                 ` Catalin Marinas
@ 2009-11-25 18:57                   ` Toby Douglass
  0 siblings, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-25 18:57 UTC (permalink / raw)
  To: linux-arm-kernel

So, I'm thinking now; where does this leave us?

We see that LDREX/SDREX cannot be used on their own in a single word CAS 
to prevent ABA since we cannot rely on using LDR inside an LDREX/SDREX 
pair.  (You guys knew this; I'm just stating it here to lead in).

This means you must use a pointer-counter pair (with its potential for 
failure) or memory handling; the same situation as we find on x86 and 
x64, for example.  That's okay - not perfect, but we're not deficient 
compared to other common platforms.

I think however it leaves open one question; do we wish for the 
LDREX/SDREX CAS to fail if destination has been modified, or to retry, 
as it does now.

 From a lock-free data structure point of view, failure on modification 
is more efficient, since we will always need to recompute our exchange 
value; going to the extra trouble of forcing the CAS (which will then 
fail anyway, because the counter part of the pointer-counter pair will 
not match) is a waste of time.

The behaviour we have now can be easily implemented (a loop) on top of 
the non-retry CAS; but the non-retry CAS behaviour cannot be implemented 
on top of what we have now.

However, the world and his dog expects x86/x64 style behaviour - we can 
do better, but it may not be what users expect.

Ah - there is one other issue - it seems reasonable that the CAS that 
exists now should be extended to support 8 byte swaps, since it already 
supports 1, 2 and 4 byte swaps.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-24 11:19             ` Catalin Marinas
  2009-11-24 22:24               ` Toby Douglass
@ 2009-11-24 22:34               ` Toby Douglass
  2009-11-24 22:56                 ` Russell King - ARM Linux
  1 sibling, 1 reply; 37+ messages in thread
From: Toby Douglass @ 2009-11-24 22:34 UTC (permalink / raw)
  To: linux-arm-kernel

I wrote:
> I thought about this a little.  If the memory barrier is immediately 
> before and given the next instruction is the LDREX, *all* threads coming 
> to the LDREX *must* have preceeding them a DMB and so be up to date on 
> memory, regardless of pauses in thread execution.

Additionally; the DMB afterwards seemed to have no value.  You could 
perform the STREX and then your thread could pause indefinitely and were 
you in a situation where you store was not immediately visible or 
correctly ordered to another thread, that thread would then read the old 
value.

A more common example would be that another thread is reading the 
destination value for some reason, and reads after the STREX and before 
the trailing DMB.

This implies you need a DMB as the instruction immediately preceding 
every read and every write of destination.  On x86/x64, all the atomic 
ops (e.g. the LOCK prefix) act as full memory barriers.

Note that I'm not yet fully conversant with the issues in multi-core 
threading, so I may be writing rubbish :-)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-24 22:34               ` Toby Douglass
@ 2009-11-24 22:56                 ` Russell King - ARM Linux
  2009-11-25  0:34                   ` Toby Douglass
  0 siblings, 1 reply; 37+ messages in thread
From: Russell King - ARM Linux @ 2009-11-24 22:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Nov 24, 2009 at 11:34:56PM +0100, Toby Douglass wrote:
> I wrote:
>> I thought about this a little.  If the memory barrier is immediately  
>> before and given the next instruction is the LDREX, *all* threads 
>> coming to the LDREX *must* have preceeding them a DMB and so be up to 
>> date on memory, regardless of pauses in thread execution.
>
> Additionally; the DMB afterwards seemed to have no value.  You could  
> perform the STREX and then your thread could pause indefinitely and were  
> you in a situation where you store was not immediately visible or  
> correctly ordered to another thread, that thread would then read the old  
> value.

The two DMBs are there to prevent other loads and stores on the _local_
CPU being speculated inside the compare-and-swap operation, either by
the compiler or the CPU.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-24 22:56                 ` Russell King - ARM Linux
@ 2009-11-25  0:34                   ` Toby Douglass
  0 siblings, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-25  0:34 UTC (permalink / raw)
  To: linux-arm-kernel

Russell King - ARM Linux wrote:
> On Tue, Nov 24, 2009 at 11:34:56PM +0100, Toby Douglass wrote:

>> Additionally; the DMB afterwards seemed to have no value.  You could  
>> perform the STREX and then your thread could pause indefinitely and were  
>> you in a situation where you store was not immediately visible or  
>> correctly ordered to another thread, that thread would then read the old  
>> value.

> The two DMBs are there to prevent other loads and stores on the _local_
> CPU being speculated inside the compare-and-swap operation, either by
> the compiler or the CPU.

Ahhh!

Forest, trees, etc...

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-23 23:13         ` Russell King - ARM Linux
  2009-11-24  1:32           ` Jamie Lokier
@ 2009-11-24  9:38           ` Toby Douglass
  1 sibling, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-24  9:38 UTC (permalink / raw)
  To: linux-arm-kernel

Russell King - ARM Linux wrote:

[snip]

No comments on everything else and the other posts; still digesting them.

> I believe there is a solution to the ABA problem using atomic operations
> with one 32-bit word and a separate counter to identify updates, but it
> requires a little more time than I have available this evening to fully
> put together.

I've implemented the pointer-counter solution on x86/x64.  Indeed, it 
turns out you can use them on ARM, since there is a double-word ldrex. 
However, and this is part of what I'm currently thinking about, I was 
under the impression LL/SC meant *single word CAS* could work without a 
counter.

I think you'd never implement pointer-counter in the kernel, as your 
basic CAS, because plain CAS is all you need for certain things and the 
user can easily implement pointer-counter on top of plain CAS.

OTOH, perhaps it is nice to offer a plain CAS API and an ABA-safe CAS API.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-23 22:28       ` Jamie Lokier
  2009-11-23 23:13         ` Russell King - ARM Linux
@ 2009-11-24 15:59         ` Catalin Marinas
  2009-11-24 16:34         ` Toby Douglass
  2 siblings, 0 replies; 37+ messages in thread
From: Catalin Marinas @ 2009-11-24 15:59 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 2009-11-23 at 22:28 +0000, Jamie Lokier wrote:
> Toby Douglass wrote:
> > Load-linked/conditional-store architectures solve ABA by having the
> > store fail if the destination has been touched since the load was performed.
> >
> > Currently, the code appears to violate this, by repeating the CAS
> > *anyway*.  In fact, the appropriate behaviour would seem to be *not* to
> > loop, but rather, to issue the ldrex/strex *once*, and indicate to the
> > user if the store succeed or failed.
> 
> I believe Catalin's explained why it does not work even doing
> LDREX/STREX once, because the thread can pause before the LDREX.  So
> you must begin fetching pointers after the LDREX.
> 
> (At least I think so.  I'm prepared to be shown to be wrong :-)
> 
> If you're writing code intended for other LL/SC architectures too, and
> following Catalin's suggestion to put LDR between LDREX and STREX,
> then you might have to check if the other architectures permit loads
> between the LL and SC.

That's a good point. ARM doesn't allow this either, though probably
current implementations don't have any problem with it. From the ARM ARM
A3.4.5:

        An implementation might clear an exclusive monitor between the
        LDREX and the STREX, without any application-related cause. For
        example, this might happen because of cache evictions. Code
        written for such an implementation must avoid having any
        *explicit memory accesses* or cache maintenance operations
        between the LDREX and STREX instructions

So on some future implementations it could live-lock.

-- 
Catalin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* CAS implementation may be broken
  2009-11-23 22:28       ` Jamie Lokier
  2009-11-23 23:13         ` Russell King - ARM Linux
  2009-11-24 15:59         ` Catalin Marinas
@ 2009-11-24 16:34         ` Toby Douglass
  2 siblings, 0 replies; 37+ messages in thread
From: Toby Douglass @ 2009-11-24 16:34 UTC (permalink / raw)
  To: linux-arm-kernel

Jamie Lokier wrote:
> Toby Douglass wrote:
>> Load-linked/conditional-store architectures solve ABA by having the 
>> store fail if the destination has been touched since the load was performed.
>>
>> Currently, the code appears to violate this, by repeating the CAS 
>> *anyway*.  In fact, the appropriate behaviour would seem to be *not* to 
>> loop, but rather, to issue the ldrex/strex *once*, and indicate to the 
>> user if the store succeed or failed.
> 
> I believe Catalin's explained why it does not work even doing
> LDREX/STREX once, because the thread can pause before the LDREX.  So
> you must begin fetching pointers after the LDREX.

Yes.  I'm now getting around to digesting the posts after Catalin's and 
I see here people have already pointed out the things I worked out =-)

> (At least I think so.  I'm prepared to be shown to be wrong :-)

Well - it doesn't save you - you could LDREX, then LDR - and *then* 
pause.  Then the freelist completely changes, but the original element 
is moved back to the top.  Then you continue.  You fail to swap, since 
the destination has changed...ah, wait!  it *does* save you.  Of course, 
you're dead if you loop again, so you need to exit at this point.

Now I'm wondering what I was thinking before, in my earlier email to the 
list...!

> If you're writing code intended for other LL/SC architectures too, and
> following Catalin's suggestion to put LDR between LDREX and STREX,
> then you might have to check if the other architectures permit loads
> between the LL and SC.

As Catalin has pointed out, this in fact is not allowed on ARM.  So it 
doesn't save us (well, me :-) after all and in fact I look stuffed, as 
far as LL/SC is concerned; I have to use double-word LDREX and a 
pointer-counter pair.

So LL/SC on ARM is not an ABA solution.  It merely is a RISC-style 
implementation of what on CISC is a single instruction.

>> This requires a prototype change, because the return value is the 
>> original value of the destination and so is unable to indicate, when 
>> returning that value, if it is returned from a successful or 
>> unsuccessful swap.
> 
> Nonetheless, such a prototype change might be an improvement anyway.
> 
> Some platforms provide compare_and_swap_bool() operations, which do as
> you describe: Compare, conditionally store, and return bool to indicate
> success.  No loop.

Ah, well, I also like to receive the original value of the destination, 
if the CAS had to load it, since I'm going to have to load it again 
myself in most cases.

In the case where the CAS fails because the compare fails, the freshly 
loaded destination value is useful.

In the case where the CAS fails because the destination is touched by 
another thread, the freshly loaded destination is almost never useful 
(unless it was modified by the other thread to be the same value).

> That could be an improvement for some algorithms, because often if the
> store doesn't happen, the *inputs* to compare_and_swap() need to be
> recalculated anyway before another try is likely to succeed.

Exactly!

In fact, this is what happens in *every* algorithm I'm implemented (a 
grand total of two, mind you, stack and queue, but I know it also 
happens for the singly linked list).

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2009-11-27  1:37 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-04 18:09 GCC built-in atomic operations and memory barriers Toby Douglass
2009-11-04 19:05 ` Russell King - ARM Linux
2009-11-04 20:12   ` Toby Douglass
2009-11-04 21:03     ` Russell King - ARM Linux
2009-11-06 19:10       ` Toby Douglass
2009-11-04 22:09   ` Gilles Chanteperdrix
2009-11-06 19:17     ` Toby Douglass
2009-11-21 15:21     ` CAS implementation may be broken Toby Douglass
2009-11-23 15:08       ` Russell King - ARM Linux
2009-11-23 19:10         ` Toby Douglass
2009-11-23 20:06           ` Russell King - ARM Linux
2009-11-23 20:34             ` Toby Douglass
2009-11-23 15:13       ` Catalin Marinas
2009-11-24 15:15         ` Toby Douglass
2009-11-24 15:36           ` Russell King - ARM Linux
2009-11-24 16:20             ` Toby Douglass
2009-11-24 16:27             ` Catalin Marinas
2009-11-24 17:14             ` Toby Douglass
2009-11-25  1:24           ` Jamie Lokier
2009-11-26 16:14             ` Toby Douglass
2009-11-27  1:37               ` Jamie Lokier
2009-11-24 15:33         ` Toby Douglass
2009-11-23 15:34       ` Catalin Marinas
2009-11-23 16:40         ` Toby Douglass
2009-11-23 22:28       ` Jamie Lokier
2009-11-23 23:13         ` Russell King - ARM Linux
2009-11-24  1:32           ` Jamie Lokier
2009-11-24 11:19             ` Catalin Marinas
2009-11-24 22:24               ` Toby Douglass
2009-11-25 11:11                 ` Catalin Marinas
2009-11-25 18:57                   ` Toby Douglass
2009-11-24 22:34               ` Toby Douglass
2009-11-24 22:56                 ` Russell King - ARM Linux
2009-11-25  0:34                   ` Toby Douglass
2009-11-24  9:38           ` Toby Douglass
2009-11-24 15:59         ` Catalin Marinas
2009-11-24 16:34         ` Toby Douglass

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).