* GCC built-in atomic operations and memory barriers @ 2009-11-04 18:09 Toby Douglass 2009-11-04 19:05 ` Russell King - ARM Linux 0 siblings, 1 reply; 37+ messages in thread From: Toby Douglass @ 2009-11-04 18:09 UTC (permalink / raw) To: linux-arm-kernel Hi - My first post. I have written a (currently small) library of lock-free data structures (www.liblfds.org). I am porting to ARM. The port is complete except for CAS. I initially tried to use the GCC built-in CAS; in my test set, the third test fails with a double free, because the head element of the freelist being tested has come to point to itself. So I figured I'd need to write my own CAS. I learned enough ARM assembly and did enough Googling to put something together. I've done the same already for x86 (I needed cmpxchg16b, so I couldn't use the GCC built-ins) so I have a tiny bit of experience in it; and my attempt fails in exactly the same way as the GCC built-in. So I go to bed. Next day I think, ah, memory barriers! you don't need to specify them on x86 for atomics, but I bet you do on ARM; and indeed, you do. I then discover in this mailing list an interesting thread, dated May 2009, which states that the GCC built-ins for ARM do not have memory barriers. The kernel I'm using is 2.6.28, which was released Christmas Day 2008. So I think the lack of memory barriers may well be why the code is failing. This leads me to want to use smp_mb(). However, from what I can see, this macro is only available via the linux kernel headers; it's not available in user-mode. Is this correct? ^ permalink raw reply [flat|nested] 37+ messages in thread
* GCC built-in atomic operations and memory barriers 2009-11-04 18:09 GCC built-in atomic operations and memory barriers Toby Douglass @ 2009-11-04 19:05 ` Russell King - ARM Linux 2009-11-04 20:12 ` Toby Douglass 2009-11-04 22:09 ` Gilles Chanteperdrix 0 siblings, 2 replies; 37+ messages in thread From: Russell King - ARM Linux @ 2009-11-04 19:05 UTC (permalink / raw) To: linux-arm-kernel On Wed, Nov 04, 2009 at 07:09:37PM +0100, Toby Douglass wrote: > This leads me to want to use smp_mb(). However, from what I can see, > this macro is only available via the linux kernel headers; it's not > available in user-mode. Is this correct? Correct. And please understand that just because something exists in kernel land does not give userland a right to make use of it. (There are people who believe it does.) The kernel uses header files as a way to efficiently provide itself with architecture specific optimized helpers, and these helpers may well only work in kernel land. Eg, traditionally, atomic operations on ARM required interrupts to be disabled across them. Userspace can not disable interrupts, and so use of those functions in userland leads to them becoming inherently non-atomic. The kernel API is the syscall interface and associated data structures. It does not include everything you can find in kernel headers. ^ permalink raw reply [flat|nested] 37+ messages in thread
* GCC built-in atomic operations and memory barriers 2009-11-04 19:05 ` Russell King - ARM Linux @ 2009-11-04 20:12 ` Toby Douglass 2009-11-04 21:03 ` Russell King - ARM Linux 2009-11-04 22:09 ` Gilles Chanteperdrix 1 sibling, 1 reply; 37+ messages in thread From: Toby Douglass @ 2009-11-04 20:12 UTC (permalink / raw) To: linux-arm-kernel Russell King - ARM Linux wrote: > On Wed, Nov 04, 2009 at 07:09:37PM +0100, Toby Douglass wrote: >> This leads me to want to use smp_mb(). However, from what I can see, >> this macro is only available via the linux kernel headers; it's not >> available in user-mode. Is this correct? > > Correct. Thanks. It's often hard on the net to track down a negative answer. [snip] While we're talking about the GCC atomics... This appears to be the current code for CAS; 349 static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old, unsigned long new, int size) 352 unsigned long oldval, res; [snip] 382 do { 383 asm volatile("@ __cmpxchg4\n" 384 " ldrex %1, [%2]\n" 385 " mov %0, #0\n" 386 " teq %1, %3\n" 387 " strexeq %0, %4, [%2]\n" 388 : "=&r" (res), "=&r" (oldval) 389 : "r" (ptr), "Ir" (old), "r" (new) 390 : "memory", "cc"); 391 } while (res); The "mov %0, #0" - why is this inbetween the ldrex and strexeq? it seems to me it could just as well happen before the ldrex, and doing so would reduce the time between the ldrex and strexeq and so reduce the chance of someone else modifying our target. ^ permalink raw reply [flat|nested] 37+ messages in thread
* GCC built-in atomic operations and memory barriers 2009-11-04 20:12 ` Toby Douglass @ 2009-11-04 21:03 ` Russell King - ARM Linux 2009-11-06 19:10 ` Toby Douglass 0 siblings, 1 reply; 37+ messages in thread From: Russell King - ARM Linux @ 2009-11-04 21:03 UTC (permalink / raw) To: linux-arm-kernel On Wed, Nov 04, 2009 at 09:12:10PM +0100, Toby Douglass wrote: > 382 do { > 383 asm volatile("@ __cmpxchg4\n" > 384 " ldrex %1, [%2]\n" > 385 " mov %0, #0\n" > 386 " teq %1, %3\n" > 387 " strexeq %0, %4, [%2]\n" > 388 : "=&r" (res), "=&r" (oldval) > 389 : "r" (ptr), "Ir" (old), "r" (new) > 390 : "memory", "cc"); > 391 } while (res); > > The "mov %0, #0" - why is this inbetween the ldrex and strexeq? it > seems to me it could just as well happen before the ldrex, and doing so > would reduce the time between the ldrex and strexeq and so reduce the > chance of someone else modifying our target. Because it probably doesn't. Loads normally have a 'result delay' which cause a pipeline stall when the result is used in the next instruction. It makes sense to fill those stall cycles with useful work, rather than stalling the execution pipeline. ^ permalink raw reply [flat|nested] 37+ messages in thread
* GCC built-in atomic operations and memory barriers 2009-11-04 21:03 ` Russell King - ARM Linux @ 2009-11-06 19:10 ` Toby Douglass 0 siblings, 0 replies; 37+ messages in thread From: Toby Douglass @ 2009-11-06 19:10 UTC (permalink / raw) To: linux-arm-kernel Russell King - ARM Linux wrote: > On Wed, Nov 04, 2009 at 09:12:10PM +0100, Toby Douglass wrote: [snip] >> The "mov %0, #0" - why is this inbetween the ldrex and strexeq? it >> seems to me it could just as well happen before the ldrex, and doing so >> would reduce the time between the ldrex and strexeq and so reduce the >> chance of someone else modifying our target. > > Because it probably doesn't. Loads normally have a 'result delay' which > cause a pipeline stall when the result is used in the next instruction. > It makes sense to fill those stall cycles with useful work, rather than > stalling the execution pipeline. Thanks, Russell. I was concerned there was something functionally important I'd missed with regard to the atomic op. ^ permalink raw reply [flat|nested] 37+ messages in thread
* GCC built-in atomic operations and memory barriers 2009-11-04 19:05 ` Russell King - ARM Linux 2009-11-04 20:12 ` Toby Douglass @ 2009-11-04 22:09 ` Gilles Chanteperdrix 2009-11-06 19:17 ` Toby Douglass 2009-11-21 15:21 ` CAS implementation may be broken Toby Douglass 1 sibling, 2 replies; 37+ messages in thread From: Gilles Chanteperdrix @ 2009-11-04 22:09 UTC (permalink / raw) To: linux-arm-kernel Russell King - ARM Linux wrote: > The kernel API is the syscall interface and associated data structures. > It does not include everything you can find in kernel headers. Are the functions found in the vectors page part of the kernel API? Because if that is so, then they provide a way to implement cmpxchg and barriers in user-space. -- Gilles. ^ permalink raw reply [flat|nested] 37+ messages in thread
* GCC built-in atomic operations and memory barriers 2009-11-04 22:09 ` Gilles Chanteperdrix @ 2009-11-06 19:17 ` Toby Douglass 2009-11-21 15:21 ` CAS implementation may be broken Toby Douglass 1 sibling, 0 replies; 37+ messages in thread From: Toby Douglass @ 2009-11-06 19:17 UTC (permalink / raw) To: linux-arm-kernel Gilles Chanteperdrix wrote: > Russell King - ARM Linux wrote: >> The kernel API is the syscall interface and associated data structures. >> It does not include everything you can find in kernel headers. > > Are the functions found in the vectors page part of the kernel API? > Because if that is so, then they provide a way to implement cmpxchg and > barriers in user-space. GCC (I had overlooked this) has a function, __sync_synchronize(), which provides a full memory barrier. I do not know how it is implemented on ARM, but the CAS I wrote certainly would not work (and did not work) without a memory barrier and with this function inserted, does work. The GCC atomic CAS for ARM appears to call the OS's compare-exchange function, which certainly is broken, by lacking memory barriers, on the Linux I am developing on. Either way, I now have my own platform independent ARM __asm__ for LL/CS CAS, with memory barriers, so I'm happy. Also, I got my OpenGL wrapper DLL working last night :-) ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-04 22:09 ` Gilles Chanteperdrix 2009-11-06 19:17 ` Toby Douglass @ 2009-11-21 15:21 ` Toby Douglass 2009-11-23 15:08 ` Russell King - ARM Linux ` (3 more replies) 1 sibling, 4 replies; 37+ messages in thread From: Toby Douglass @ 2009-11-21 15:21 UTC (permalink / raw) To: linux-arm-kernel Hi all - I may be *utterly* wrong, and I'm expecting that someone here is simply going to look at what I'm about to say and explain to me what I've misunderstood, but I suspect it may be that atomic compare-and-swap is incorrectly implemented in the Linux kernel on ARM v6 and above (e.g. using ldrex and strex). I'm looking at this code, for 2.6.31, which I believe is the latest released version; http://lxr.linux.no/#linux+v2.6.31/arch/arm/include/asm/system.h 382 do { 383 asm volatile("@ __cmpxchg4\n" 384 " ldrex %1, [%2]\n" 385 " mov %0, #0\n" 386 " teq %1, %3\n" 387 " strexeq %0, %4, [%2]\n" 388 : "=&r" (res), "=&r" (oldval) 389 : "r" (ptr), "Ir" (old), "r" (new) 390 : "memory", "cc"); 391 } while (res); This code is for a four byte CAS; the issue I'm about to describe exists similarly for the two byte and one byte CASs. The issue is the ABA problem. Imagine you have a populated stack. Two threads come to pop. Both read the head pointer and the next pointer of the top element and both simultaneously come to the point where the are about to CAS, comparing the head pointer with the top element pointer and if they match, replacing the head pointer with the next pointer of the top element - so by this, I mean both threads have just got into the do-while loop above, but have not yet executed ldrex. Now, one thread pauses for a long time. The other thread pops. Then a bunch of the other threads pop and push and the stack is *totally* different. Then our initial thread which popped *pushes* that original element back. Finally, our second thread kicks into life. Remember that his exchange pointer is now utterly incorrect, because the stack has completely changed. Now, he's just inside the do-while loop. He checks the top pointer - and lo and behold, it matches the head element. However, because someone else has messed with the top pointer in the meantime, the CAS fails - strex refuses to swap, it can see from the shared TLB attribute that someone else has touched destination (the top pointer). So we dodge ABA *there*. The problem is *we then come round in the do-while loop again*. We have *not* updated our exchange value. So THIS second time around, we *repeat* our strex and we DO swap - and we just swapped in completely the wrong next pointer, from way back before the stack was totally changed by all the other threads popping and pushing. So that's the problem. A common way to solve ABA on contiguous double-word compare-and-swap architectures (e.g. x86, x64 and Itanium) is to use a pointer-counter pair. This permits the caller to ensure swaps fail, even if the pointers all match, because the counters don't. Load-linked/conditional-store architectures solve ABA by having the store fail if the destination has been touched since the load was performed. Currently, the code appears to violate this, by repeating the CAS *anyway*. In fact, the appropriate behaviour would seem to be *not* to loop, but rather, to issue the ldrex/strex *once*, and indicate to the user if the store succeed or failed. This requires a prototype change, because the return value is the original value of the destination and so is unable to indicate, when returning that value, if it is returned from a successful or unsuccessful swap. ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-21 15:21 ` CAS implementation may be broken Toby Douglass @ 2009-11-23 15:08 ` Russell King - ARM Linux 2009-11-23 19:10 ` Toby Douglass 2009-11-23 15:13 ` Catalin Marinas ` (2 subsequent siblings) 3 siblings, 1 reply; 37+ messages in thread From: Russell King - ARM Linux @ 2009-11-23 15:08 UTC (permalink / raw) To: linux-arm-kernel On Sat, Nov 21, 2009 at 04:21:00PM +0100, Toby Douglass wrote: > 382 do { > 383 asm volatile("@ __cmpxchg4\n" > 384 " ldrex %1, [%2]\n" > 385 " mov %0, #0\n" > 386 " teq %1, %3\n" > 387 " strexeq %0, %4, [%2]\n" > 388 : "=&r" (res), "=&r" (oldval) > 389 : "r" (ptr), "Ir" (old), "r" (new) > 390 : "memory", "cc"); > 391 } while (res); > > The problem is *we then come round in the do-while loop again*. We have > *not* updated our exchange value. So THIS second time around, we > *repeat* our strex and we DO swap - and we just swapped in completely > the wrong next pointer, from way back before the stack was totally > changed by all the other threads popping and pushing. First time around the loop, lets say %3 = 1 *(u32 *)%2 = 1. ldrex %1, [%2] %1 = *(u32 *)%2 (= 1) mov %0, #0 %0 = 0 teq %1, %3 %3 == %1? (yes) strexeq %0, %4, [%2] executed but because of the other access, exclusivity fails. *(u32 *)%2 not written and %0 = 1 So, res = 1, and we go around the loop again. Lets say that *(u32 *)%2 = 2 now. ldrex %1, [%2] %1 = *(u32 *)%2 (= 2) mov %0, #0 %0 = 0 teq %1, %3 %3 == %1? (no) strexeq %0, %4, [%2] not executed at all, %0 and *(u32 *)%2 untouched So, res = 0 and we do _not_ repeat the loop and return "cmpxchg" failure. I haven't had time to read all your email properly (due to the need to get on a conference call), but please tell me where the problem is above (using a similar worked example). Thanks. ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-23 15:08 ` Russell King - ARM Linux @ 2009-11-23 19:10 ` Toby Douglass 2009-11-23 20:06 ` Russell King - ARM Linux 0 siblings, 1 reply; 37+ messages in thread From: Toby Douglass @ 2009-11-23 19:10 UTC (permalink / raw) To: linux-arm-kernel Russell King - ARM Linux wrote: > First time around the loop, lets say %3 = 1 *(u32 *)%2 = 1. > > ldrex %1, [%2] > %1 = *(u32 *)%2 (= 1) > mov %0, #0 > %0 = 0 > teq %1, %3 > %3 == %1? (yes) > strexeq %0, %4, [%2] > executed but because of the other access, > exclusivity fails. *(u32 *)%2 not written > and %0 = 1 > > So, res = 1, and we go around the loop again. Lets say that *(u32 *)%2 = 2 > now. No - we're dealing with the ABA problem. We're assuming here that this thread gets to retry with the same values. > I haven't had time to read all your email properly (due to the need to > get on a conference call), but please tell me where the problem is above > (using a similar worked example). So; we go around again, load %2, do the teq, which succeeds, then the strexeq, which now succeeds since no-one else has touched %2. This was the thrust of the original post; however, Catalin has raised arguments against it which I have not yet digested, so what I'm writing here, where it is simply an enlargement on the OP, has the same flaws; it's only in response to your specific point. I'm not trying to assert this *is* what happens, in spite of what Catalin has written. ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-23 19:10 ` Toby Douglass @ 2009-11-23 20:06 ` Russell King - ARM Linux 2009-11-23 20:34 ` Toby Douglass 0 siblings, 1 reply; 37+ messages in thread From: Russell King - ARM Linux @ 2009-11-23 20:06 UTC (permalink / raw) To: linux-arm-kernel On Mon, Nov 23, 2009 at 08:10:51PM +0100, Toby Douglass wrote: > Russell King - ARM Linux wrote: >> First time around the loop, lets say %3 = 1 *(u32 *)%2 = 1. >> >> ldrex %1, [%2] >> %1 = *(u32 *)%2 (= 1) >> mov %0, #0 >> %0 = 0 >> teq %1, %3 >> %3 == %1? (yes) >> strexeq %0, %4, [%2] >> executed but because of the other access, >> exclusivity fails. *(u32 *)%2 not written >> and %0 = 1 >> >> So, res = 1, and we go around the loop again. Lets say that *(u32 *)%2 = 2 >> now. > > No - we're dealing with the ABA problem. We're assuming here that this > thread gets to retry with the same values. > >> I haven't had time to read all your email properly (due to the need to >> get on a conference call), but please tell me where the problem is above >> (using a similar worked example). > > So; we go around again, load %2, do the teq, which succeeds, then the > strexeq, which now succeeds since no-one else has touched %2. > > This was the thrust of the original post; however, Catalin has raised > arguments against it which I have not yet digested, so what I'm writing > here, where it is simply an enlargement on the OP, has the same flaws; > it's only in response to your specific point. I'm not trying to assert > this *is* what happens, in spite of what Catalin has written. Well, I've thought it through quite a bit now, and I have an expansive reply to your email. In summary, there is nothing wrong with the existing code; your use of it is the problem. I can post the expansive reply if you need the details. In short, consider what happens if you consider a slightly different order of operations, where you have calculated 'ptr', 'old' and 'new' for cmpxchg but you haven't executed the first ldrex. ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-23 20:06 ` Russell King - ARM Linux @ 2009-11-23 20:34 ` Toby Douglass 0 siblings, 0 replies; 37+ messages in thread From: Toby Douglass @ 2009-11-23 20:34 UTC (permalink / raw) To: linux-arm-kernel Russell King - ARM Linux wrote: > On Mon, Nov 23, 2009 at 08:10:51PM +0100, Toby Douglass wrote: >> This was the thrust of the original post; however, Catalin has raised >> arguments against it which I have not yet digested, so what I'm writing >> here, where it is simply an enlargement on the OP, has the same flaws; >> it's only in response to your specific point. I'm not trying to assert >> this *is* what happens, in spite of what Catalin has written. > > Well, I've thought it through quite a bit now, and I have an expansive > reply to your email. In summary, there is nothing wrong with the > existing code; your use of it is the problem. > > I can post the expansive reply if you need the details. I would be very grateful if you could - although if it is extensive and well-understood, you might want to spare the list and email me directly. I've been working every hour there is for the last three weekends trying to get CAS working properly on ARM under Linux; any fresh insight or knowledge from others may give me what I need to work out what I've mis-understood (before I go insane :-) > In short, consider what happens if you consider a slightly different order > of operations, where you have calculated 'ptr', 'old' and 'new' for cmpxchg > but you haven't executed the first ldrex. This is, I think, what Catalin said - I'm grappling with this at the moment; I don't have a reply to it yet, positive or negative. ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-21 15:21 ` CAS implementation may be broken Toby Douglass 2009-11-23 15:08 ` Russell King - ARM Linux @ 2009-11-23 15:13 ` Catalin Marinas 2009-11-24 15:15 ` Toby Douglass 2009-11-24 15:33 ` Toby Douglass 2009-11-23 15:34 ` Catalin Marinas 2009-11-23 22:28 ` Jamie Lokier 3 siblings, 2 replies; 37+ messages in thread From: Catalin Marinas @ 2009-11-23 15:13 UTC (permalink / raw) To: linux-arm-kernel On Sat, 2009-11-21 at 15:21 +0000, Toby Douglass wrote: > I may be *utterly* wrong, and I'm expecting that someone here is simply > going to look at what I'm about to say and explain to me what I've > misunderstood, but I suspect it may be that atomic compare-and-swap is > incorrectly implemented in the Linux kernel on ARM v6 and above (e.g. > using ldrex and strex). Well, you have to define "correctly" or you may just need a different function for what you want. In this implementation, CAS is expected to succeed if the the old value matches the memory one. The loop is present to always guarantee that it will succeed if the condition matches. Note that the exclusive monitor state is cleared (and hence STREX fails) a every context switch or a return from an exception, i.e. interrupt). > > The issue is the ABA problem. [...] > The problem is *we then come round in the do-while loop again*. We have > *not* updated our exchange value. So THIS second time around, we > *repeat* our strex and we DO swap - and we just swapped in completely > the wrong next pointer, from way back before the stack was totally > changed by all the other threads popping and pushing. But the first thread (paused) one may may actually wait before the LDREX (the old value was loaded with an LDR), so the LDREX/STREX pair would succeed anyway. That's not really a solution unless you also use LDREX for loading the old value to be passed to CAS. IOW, you need your own implementation of what you are trying to achieve (and not modifying CAS). -- Catalin ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-23 15:13 ` Catalin Marinas @ 2009-11-24 15:15 ` Toby Douglass 2009-11-24 15:36 ` Russell King - ARM Linux 2009-11-25 1:24 ` Jamie Lokier 2009-11-24 15:33 ` Toby Douglass 1 sibling, 2 replies; 37+ messages in thread From: Toby Douglass @ 2009-11-24 15:15 UTC (permalink / raw) To: linux-arm-kernel Catalin Marinas wrote: > On Sat, 2009-11-21 at 15:21 +0000, Toby Douglass wrote: >> I may be *utterly* wrong, and I'm expecting that someone here is simply >> going to look at what I'm about to say and explain to me what I've >> misunderstood, but I suspect it may be that atomic compare-and-swap is >> incorrectly implemented in the Linux kernel on ARM v6 and above (e.g. >> using ldrex and strex). > > Well, you have to define "correctly" or you may just need a different > function for what you want. Yeeessss-s-s-s-s-s... But you can have a situation where you can choose between two implementations and although both could be 'correct', one offers more functionality than the other - and so is best choice. I'm coming at this from the POV of implementing lock-free data structures. Essentially, "plain CAS" (as on SPARC, which has neither LL/SC nor contiguous double word length CAS) makes implementation problematic because you have no straightforward solution to the ABA problem. Contiguous double word length CAS permits a pointer-counter pair, which solves ABA. LL/SC solves ABA by failing the store if the target has been modified since load (I didn't properly realise how this could be used to solve ABA until later in this post; there is, I think, a way - I describe it later). Now, it is possible to implement CAS using LL/SC such that it behaves exactly like plain CAS - a la SPARC. I believe this is what exists now. However, I think it may also possible to implement using LL/SC such that ABA is avoided. If a platform implements CAS using LL/SC *only* a plain CAS, then it makes life exceedingly difficult for lock-free data structures; and it does so by *not* implementing functionality which exists in the CPU which solves ABA! > In this implementation, CAS is expected to > succeed if the the old value matches the memory one. The loop is present > to always guarantee that it will succeed if the condition matches. Yes. This is a "plain CAS" implementation. Interestingly, I saw a day or two ago that there is a double-word version of LDREX. The atomic-ptr project, which forms the basis of the GCC garbage collector, relies on this; it does not in fact use LL/SC on ARM, but rather uses double-word CAS using a pointer-counter pair. > Note that the exclusive monitor state is cleared (and hence STREX fails) > a every context switch or a return from an exception, i.e. interrupt). Good. I expected this to be so and it is important to me that it is; for by being so, it permits the use of these lock-free data structures in interrupt handlers. >> The issue is the ABA problem. >> The problem is *we then come round in the do-while loop again*. We have >> *not* updated our exchange value. So THIS second time around, we >> *repeat* our strex and we DO swap - and we just swapped in completely >> the wrong next pointer, from way back before the stack was totally >> changed by all the other threads popping and pushing. > > But the first thread (paused) one may may actually wait before the LDREX > (the old value was loaded with an LDR), so the LDREX/STREX pair would > succeed anyway. I can see my original description of the problem contains an error! I described it first as "both threads pause before LDREX" and then later I wrote as if both had in fact executed LDREX. In fact, what you describe here in the paragraph above is what I described in the OP as occurring in the second iteration of the while loop (when in fact it can just as well occur in the first iteration, as you have here described); namely, you're experiencing ABA - your exchange pointer is wrong, but you don't realise it, because the destination matches your compare. So, now I'm puzzled. I was under the impression it was possible to use LL/SC to solve ABA. Ah - wait - perhaps I see how. The problem is that the exchange pointer is loaded before we become able to see that the destination has changed. In fact, what's needed here (in this stack example) is that you *load the exchange pointer inside the ldrex/strex pair*. > That's not really a solution unless you also use LDREX > for loading the old value to be passed to CAS. Man, you're just way ahead of me :-) the coffee at ARM must be *really* good :-) But I think you don't need to use LDREX; it's enough just to LDR inside the LDREX/STREX pair used on destination. (This would mean in the CAS prototype taking a pointer to the exchange value). This would mean that if the destination was changed by someone else, you'd never use the exchange value; but if it was not, you'd be using the current value. However, this assumes the caller has a linkage between the exchange value and the destination; e.g. that if the destination is modified by someone else, the exchange must now be wrong. This I would expect is not always the case and such an implementation would no longer be lowest common denominator. But I wonder... Imagine we had a CAS where the compare and exchange are passed in as pointers and we load what they point to after the LDREX on destination. What do we have now? We have a CAS where the exchange and compare are guaranteed to be only those which exist continuously and without modification between the loading, compare and potential set of the destination. It is perhaps offering stronger guarantees than necessary in all cases - but it actually sounds like a *real* CAS; it says that when we enter the CAS, we WILL have the values for compare and exchange which are correct at that moment, rather than using historical values. Is this wrong? or is it better than x86 style cmpxchg? can we do everything with this that we could do with a CAS which swaps historical values? > IOW, you need your own > implementation of what you are trying to achieve (and not modifying > CAS). On one hand, plain CAS is the basic functionality. It is all you need for some things and indeed, I can imagine for some things, it is orthogonal; it is *what you must have* and so it would need to exist. However, for many things you want an ABA-safe CAS. I would go as so far as to say is a *second* basic functionality, since there is I think more use for this than for a non-ABA safe CAS. Windows offers this simply by offering double-word CAS (and requiring developers to use it as a pointer-counter pairing, which is reasonable, since you might want to use double-word really as double-word, so it has other uses). ARM offers that as well, but I believe there is no double-word CAS support in the kernel. That would be one route (and indeed it seems odd that the current code supports in a switch one, two and four byte CAS, but not eight - perhaps is it not available on all >= v6 CPUs?). The other would be to use the LL/SC functionality as described above. The user of the current CAS cannot implement this behaviour on top of the CAS that currently exists; he would have to write his own assembly. My feeling in this, for what it's worth, is that this seems an excessive requirement for something which is necessary for lock-free data structures; they are becoming important as the number of cores in SMP systems continues to double. ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-24 15:15 ` Toby Douglass @ 2009-11-24 15:36 ` Russell King - ARM Linux 2009-11-24 16:20 ` Toby Douglass ` (2 more replies) 2009-11-25 1:24 ` Jamie Lokier 1 sibling, 3 replies; 37+ messages in thread From: Russell King - ARM Linux @ 2009-11-24 15:36 UTC (permalink / raw) To: linux-arm-kernel On Tue, Nov 24, 2009 at 04:15:37PM +0100, Toby Douglass wrote: > So, now I'm puzzled. I was under the impression it was possible to use > LL/SC to solve ABA. Well, what you can do is: LL compute new value SC The problem for LL/SC architectures is that this creates a big region where we hope that no one else performs another 'load locked' operation. LL/SC based architectures doing this traditionally fall fowl of a problem. Consider the following: do { LL complex computation to new value SC } while (store failed) and have that running on two CPUs trying to modify the same location. It can hang both CPUs up in that loop for a _very_ long time. This is primerily why we've ended up with a CAS implementation, and everything is implemented as: do { load current compute new value do { LL old old == current? SC new } while SC failed } while old != current (I argued initially for a LL / SC interface but the arguments I was making were shot down by Linus et.al. as completely unreasonable and technically idiotic.) Actually, I'm rather surprised that our existing LDREX/STREX implementations don't suffer from this problem. Consider two CPUs operating in lock-step: CPU0 CPU1 ldrex add ldrex strex (fails) add repeat strex (fails) ldrex repeat add ldrex strex (fails) add repeat strex (fails) ... repeat ... In this scenario, the loops _never_ terminate. The usual solution is to add delays into the loop which depend on the CPU number, thus ensuring that two CPUs execute a different number of instructions each time around the loop. > My feeling in this, for what it's worth, is that this seems an excessive > requirement for something which is necessary for lock-free data > structures; they are becoming important as the number of cores in SMP > systems continues to double. Hasn't the lock-free issue been solved by things like rcu? ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-24 15:36 ` Russell King - ARM Linux @ 2009-11-24 16:20 ` Toby Douglass 2009-11-24 16:27 ` Catalin Marinas 2009-11-24 17:14 ` Toby Douglass 2 siblings, 0 replies; 37+ messages in thread From: Toby Douglass @ 2009-11-24 16:20 UTC (permalink / raw) To: linux-arm-kernel Russell King - ARM Linux wrote: > On Tue, Nov 24, 2009 at 04:15:37PM +0100, Toby Douglass wrote: >> So, now I'm puzzled. I was under the impression it was possible to use >> LL/SC to solve ABA. > > Well, what you can do is: > > LL > compute new value > SC > > The problem for LL/SC architectures is that this creates a big region > where we hope that no one else performs another 'load locked' operation. Yes; live-lock. > LL/SC based architectures doing this traditionally fall fowl of a problem. > Consider the following: > > do { > LL > complex computation to new value > SC > } while (store failed) > > and have that running on two CPUs trying to modify the same location. It > can hang both CPUs up in that loop for a _very_ long time. Yes, but this complex computation; in all of the data structures which can be implemented using a lock-free algorithm, the only work that is done is loading an exchange value and a compare value. > (I argued initially for a LL / SC interface but the arguments I was making > were shot down by Linus et.al. as completely unreasonable and technically > idiotic.) Were they? > Actually, I'm rather surprised that our existing LDREX/STREX implementations > don't suffer from this problem. Consider two CPUs operating in lock-step: > > CPU0 CPU1 > ldrex > add ldrex > strex (fails) add > repeat strex (fails) > ldrex repeat > add ldrex > strex (fails) add > repeat strex (fails) > ... repeat > ... > > In this scenario, the loops _never_ terminate. The usual solution is to > add delays into the loop which depend on the CPU number, thus ensuring > that two CPUs execute a different number of instructions each time around > the loop. Yes. Algorithmic back-off is the standard solution. For light use, it's excessive. TBH, I doubt many are using this code, since the kernel version of it didn't have memory barriers until recently and GCC's __sync_synchronize() does nothing on ARM. >> My feeling in this, for what it's worth, is that this seems an excessive >> requirement for something which is necessary for lock-free data >> structures; they are becoming important as the number of cores in SMP >> systems continues to double. > > Hasn't the lock-free issue been solved by things like rcu? I would say yes and no. Memory handling adds significant complexity to a lock-free algorithm. For some algorithms, it's not necessary, for others you can get by without it at a cost (using a freelist, for example). It's a trade-off, rather than an outright solution. ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-24 15:36 ` Russell King - ARM Linux 2009-11-24 16:20 ` Toby Douglass @ 2009-11-24 16:27 ` Catalin Marinas 2009-11-24 17:14 ` Toby Douglass 2 siblings, 0 replies; 37+ messages in thread From: Catalin Marinas @ 2009-11-24 16:27 UTC (permalink / raw) To: linux-arm-kernel On Tue, 2009-11-24 at 15:36 +0000, Russell King - ARM Linux wrote: > On Tue, Nov 24, 2009 at 04:15:37PM +0100, Toby Douglass wrote: > > So, now I'm puzzled. I was under the impression it was possible to use > > LL/SC to solve ABA. > > Well, what you can do is: > > LL > compute new value > SC > > The problem for LL/SC architectures is that this creates a big region > where we hope that no one else performs another 'load locked' operation. As I replied to Jamie, the long computation above may introduce some live-lock situation on some hardware implementations (I think in general it needs to avoid cache line evictions). > (I argued initially for a LL / SC interface but the arguments I was making > were shot down by Linus et.al. as completely unreasonable and technically > idiotic.) Maybe because other architectures don't allow for many instructions between LL and SC. > Actually, I'm rather surprised that our existing LDREX/STREX implementations > don't suffer from this problem. Consider two CPUs operating in lock-step: > > CPU0 CPU1 > ldrex > add ldrex > strex (fails) add > repeat strex (fails) > ldrex repeat > add ldrex > strex (fails) add > repeat strex (fails) > ... repeat > ... > > In this scenario, the loops _never_ terminate. The usual solution is to > add delays into the loop which depend on the CPU number, thus ensuring > that two CPUs execute a different number of instructions each time around > the loop. I thought we already have a hardware solution or this. There was an erratum for ARM11MPCore r0p0 where we had to add the delay in software but it's no longer needed. > > My feeling in this, for what it's worth, is that this seems an excessive > > requirement for something which is necessary for lock-free data > > structures; they are becoming important as the number of cores in SMP > > systems continues to double. > > Hasn't the lock-free issue been solved by things like rcu? RCU allows only one writer (and multiple readers). The completely lock-free algorithm would allow multiple writers (multiple threads popping from a stack in Toby's original problem). -- Catalin ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-24 15:36 ` Russell King - ARM Linux 2009-11-24 16:20 ` Toby Douglass 2009-11-24 16:27 ` Catalin Marinas @ 2009-11-24 17:14 ` Toby Douglass 2 siblings, 0 replies; 37+ messages in thread From: Toby Douglass @ 2009-11-24 17:14 UTC (permalink / raw) To: linux-arm-kernel I wrote: > Russell King - ARM Linux wrote: >> (I argued initially for a LL / SC interface but the arguments >> I was making were shot down by Linus et.al. as completely >> unreasonable and technically idiotic.) > Were they? My question is I think in hindsight misinterpretable. What I mean is; you're a reasonable and intelligent bloke. You have your views in this matter and Linus et al have theirs; they thought what you said was crazy. Having heard them out, what did you think of what they said? were there valid arguments in there? ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-24 15:15 ` Toby Douglass 2009-11-24 15:36 ` Russell King - ARM Linux @ 2009-11-25 1:24 ` Jamie Lokier 2009-11-26 16:14 ` Toby Douglass 1 sibling, 1 reply; 37+ messages in thread From: Jamie Lokier @ 2009-11-25 1:24 UTC (permalink / raw) To: linux-arm-kernel Toby Douglass wrote: > Catalin Marinas wrote: > Interestingly, I saw a day or two ago that there is a double-word > version of LDREX. The atomic-ptr project, which forms the basis of the > GCC garbage collector, relies on this; it does not in fact use LL/SC on > ARM, but rather uses double-word CAS using a pointer-counter pair. Note that pointer-counter is not really reliable. If a thread is suspended for the time taken for exactly 2^32 operations by other threads, it fails. (Assuming 32-bit pointer + 32-bit counter). That's unlikely, but not impossible. Consider a program with 100 threads and one CPU, each thread in a tight loop doing atomic_op(&p) and some other things. It's quite reasonable for 1 thread to be descheduled for the amount of time taken for 2^32 iterations from all the other threads, if the pointer-counter CAS operation is fast, which it increasingly is. Remember, this is one CPU so it's in L1 cache, and modern CPUs execute 2^32 operations in a single second. CAS operations are not single-cycle _yet_, but they continue to get faster. With 100 threads running without sleeping, the kernel may well deschedule one of them for several seconds as it cycles through them all and gives them "non-interactive" timeslices. It's also quite likely to wrap if someone used kill -STOP, and later kill -CONT on a thread, or if they're debugging one of the threads. Of course even if the counter wraps, you still have to be 1/2^32 unlucky to see the same value. But that's enough to make it unreliable. -- Jamie ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-25 1:24 ` Jamie Lokier @ 2009-11-26 16:14 ` Toby Douglass 2009-11-27 1:37 ` Jamie Lokier 0 siblings, 1 reply; 37+ messages in thread From: Toby Douglass @ 2009-11-26 16:14 UTC (permalink / raw) To: linux-arm-kernel Jamie Lokier wrote: > Toby Douglass wrote: >> Interestingly, I saw a day or two ago that there is a double-word >> version of LDREX. The atomic-ptr project, which forms the basis of the >> GCC garbage collector, relies on this; it does not in fact use LL/SC on >> ARM, but rather uses double-word CAS using a pointer-counter pair. > Note that pointer-counter is not really reliable. [snip case] > Of course even if the counter wraps, you still have to be 1/2^32 > unlucky to see the same value. But that's enough to make it unreliable. I concur, but it's not a problem on 64-bit. ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-26 16:14 ` Toby Douglass @ 2009-11-27 1:37 ` Jamie Lokier 0 siblings, 0 replies; 37+ messages in thread From: Jamie Lokier @ 2009-11-27 1:37 UTC (permalink / raw) To: linux-arm-kernel Toby Douglass wrote: > Jamie Lokier wrote: > >Note that pointer-counter is not really reliable. > > [snip case] > > >Of course even if the counter wraps, you still have to be 1/2^32 > >unlucky to see the same value. But that's enough to make it unreliable. > > I concur, but it's not a problem on 64-bit. Agree, about 64-bit. (Until we have much faster CPUs :-) It's occurred to me that it's possible to make pointer-counter safe. Simply, when about to wrap, send a signal to every thread. The signal handler should check if the thread it has interrupted is inside a pointer-counter critical section, and if yes, force the interrupted code to synchronise with all the other threads in another way, such as using a mutex. -- Jamie ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-23 15:13 ` Catalin Marinas 2009-11-24 15:15 ` Toby Douglass @ 2009-11-24 15:33 ` Toby Douglass 1 sibling, 0 replies; 37+ messages in thread From: Toby Douglass @ 2009-11-24 15:33 UTC (permalink / raw) To: linux-arm-kernel I wrote: > We have a CAS where the exchange and compare are guaranteed to be only those which exist continuously and without modification between the loading, compare and potential set of the destination. Ah, this isn't true. The compare and exchange values could be modified after their LDR but before the STREX. Apologies - basic mistake :-/ So in fact what you have is a guarantee you will only use values for compare and exchange which exist *after* you've loaded your destination. I don't think this is useful for anything; it's no different to having them loaded before the LDREX. What's more, I think I can do this now anyway, in GCC; pass pointers in and in the assembly, use "*exchange" rather than "exchange". I understand now what Catalin means about using LDREX to load them as well (which you can't do, I believe; you can't nest LDREX on one core). ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-21 15:21 ` CAS implementation may be broken Toby Douglass 2009-11-23 15:08 ` Russell King - ARM Linux 2009-11-23 15:13 ` Catalin Marinas @ 2009-11-23 15:34 ` Catalin Marinas 2009-11-23 16:40 ` Toby Douglass 2009-11-23 22:28 ` Jamie Lokier 3 siblings, 1 reply; 37+ messages in thread From: Catalin Marinas @ 2009-11-23 15:34 UTC (permalink / raw) To: linux-arm-kernel On Sat, 2009-11-21 at 15:21 +0000, Toby Douglass wrote: > Load-linked/conditional-store architectures solve ABA by having the > store fail if the destination has been touched since the load was performed. So, for your problem, CAS wouldn't solve it on any architecture, not just ARM if it is implemented with LL/SC. Something like below may work (untested and haven't put much thought into it). Basically you need atomicity between the moment you first get the head pointer and the moment you update it. CAS doesn't ensure this since the first load of the [head] pointer would be standard LDR (even if it is LDREX, you get another LDREX which resets the link with the STREX): 1: LDREX orig_head, [head] LDR orig_next, [orig_head, #next] STREX res, orig_next, [head] cmp res, #0 bne 1b -- Catalin ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-23 15:34 ` Catalin Marinas @ 2009-11-23 16:40 ` Toby Douglass 0 siblings, 0 replies; 37+ messages in thread From: Toby Douglass @ 2009-11-23 16:40 UTC (permalink / raw) To: linux-arm-kernel Catalin, Russell - thankyou for your replies. I'm looking at them now; some thought is required :-) ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-21 15:21 ` CAS implementation may be broken Toby Douglass ` (2 preceding siblings ...) 2009-11-23 15:34 ` Catalin Marinas @ 2009-11-23 22:28 ` Jamie Lokier 2009-11-23 23:13 ` Russell King - ARM Linux ` (2 more replies) 3 siblings, 3 replies; 37+ messages in thread From: Jamie Lokier @ 2009-11-23 22:28 UTC (permalink / raw) To: linux-arm-kernel Toby Douglass wrote: > Load-linked/conditional-store architectures solve ABA by having the > store fail if the destination has been touched since the load was performed. > > Currently, the code appears to violate this, by repeating the CAS > *anyway*. In fact, the appropriate behaviour would seem to be *not* to > loop, but rather, to issue the ldrex/strex *once*, and indicate to the > user if the store succeed or failed. I believe Catalin's explained why it does not work even doing LDREX/STREX once, because the thread can pause before the LDREX. So you must begin fetching pointers after the LDREX. (At least I think so. I'm prepared to be shown to be wrong :-) If you're writing code intended for other LL/SC architectures too, and following Catalin's suggestion to put LDR between LDREX and STREX, then you might have to check if the other architectures permit loads between the LL and SC. > This requires a prototype change, because the return value is the > original value of the destination and so is unable to indicate, when > returning that value, if it is returned from a successful or > unsuccessful swap. Nonetheless, such a prototype change might be an improvement anyway. Some platforms provide compare_and_swap_bool() operations, which do as you describe: Compare, conditionally store, and return bool to indicate success. No loop. That could be an improvement for some algorithms, because often if the store doesn't happen, the *inputs* to compare_and_swap() need to be recalculated anyway before another try is likely to succeed. What's the point in having code which expands to two loops: do { old = get_something; new = calc_something; /* oldval = compare_and_swap(ptr, old, new); */ do { __asm__("LL/SC" : (failed), (oldval) : (ptr), (old), (new)); } while (failed && oldval == old); } while (oldval != old); When it can often be a smaller loop, which probably executes a little faster too in various cases: do { old = get_something; new = calc_something; /* oldval = compare_and_swap_bool(ptr, old, new); */ __asm__("LL/SC" : (failed), (oldval) : (ptr), (old), (new)); } while (failed); -- Jamie ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-23 22:28 ` Jamie Lokier @ 2009-11-23 23:13 ` Russell King - ARM Linux 2009-11-24 1:32 ` Jamie Lokier 2009-11-24 9:38 ` Toby Douglass 2009-11-24 15:59 ` Catalin Marinas 2009-11-24 16:34 ` Toby Douglass 2 siblings, 2 replies; 37+ messages in thread From: Russell King - ARM Linux @ 2009-11-23 23:13 UTC (permalink / raw) To: linux-arm-kernel On Mon, Nov 23, 2009 at 10:28:30PM +0000, Jamie Lokier wrote: > That could be an improvement for some algorithms, because often if the > store doesn't happen, the *inputs* to compare_and_swap() need to be > recalculated anyway before another try is likely to succeed. What's > the point in having code which expands to two loops: The point is that often its cheaper to retry the LL/SC than it is to do the recalculation. However, I don't think you've understood the original problem at all. The issue is that for a particular 32-bit word, the behaviour required is: - if no change occurs between the _original_ read and the atomic update, then go ahead with the update - if any change has occured, do not update, but go back and recalculate from the beginning. This is because, even if the value at the pointer is restored back to the numerically identical value it had at the start, the requirement is for the store to fail. No amount of loops (or lack of loops) solves that, even if you have a proper single atomic 32-bit CAS instruction. You need to be far more clever, and there's some hints on wikipedia about possible solutions. They assume that you can CAS more bits than your intended value though, which we don't have. I believe there is a solution to the ABA problem using atomic operations with one 32-bit word and a separate counter to identify updates, but it requires a little more time than I have available this evening to fully put together. ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-23 23:13 ` Russell King - ARM Linux @ 2009-11-24 1:32 ` Jamie Lokier 2009-11-24 11:19 ` Catalin Marinas 2009-11-24 9:38 ` Toby Douglass 1 sibling, 1 reply; 37+ messages in thread From: Jamie Lokier @ 2009-11-24 1:32 UTC (permalink / raw) To: linux-arm-kernel Russell King - ARM Linux wrote: > On Mon, Nov 23, 2009 at 10:28:30PM +0000, Jamie Lokier wrote: > > That could be an improvement for some algorithms, because often if the > > store doesn't happen, the *inputs* to compare_and_swap() need to be > > recalculated anyway before another try is likely to succeed. What's > > the point in having code which expands to two loops: > > The point is that often its cheaper to retry the LL/SC than it is to > do the recalculation. > > However, I don't think you've understood the original problem at all. I think I have - I agreed with you and Catalin already that LL/SC does not suffice. But do you mean that Catalin's suggestion to put the LDREX before the LDR does not work either? (Maybe it needs a barrier too). It probably looks like I didn't understand because I've mixed two quite different issues in the same mail (same in this one), because I think the CAS_BOOL has merit. Having implemented both variants in userspace, it's actually quite annoying to use CAS and have to remember the input and compare it with the output sometimes. For some algorithms, you're right that it can be cheaper to retry the LL/SC. But for some algorithms you know the retry will never succeed and is a waste of time and code space, e.g. ones which do CAS on a counter which is always incremented. It makes the caller a bit more long-winded too. The final argument is: you can implement CAS in terms of CAS_BOOL, but you can't do it the other way. Because everyone copied x86's CAS at the begining, that's the one we got. -- Jamie ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-24 1:32 ` Jamie Lokier @ 2009-11-24 11:19 ` Catalin Marinas 2009-11-24 22:24 ` Toby Douglass 2009-11-24 22:34 ` Toby Douglass 0 siblings, 2 replies; 37+ messages in thread From: Catalin Marinas @ 2009-11-24 11:19 UTC (permalink / raw) To: linux-arm-kernel On Tue, 2009-11-24 at 01:32 +0000, Jamie Lokier wrote: > Russell King - ARM Linux wrote: > > On Mon, Nov 23, 2009 at 10:28:30PM +0000, Jamie Lokier wrote: > > > That could be an improvement for some algorithms, because often if the > > > store doesn't happen, the *inputs* to compare_and_swap() need to be > > > recalculated anyway before another try is likely to succeed. What's > > > the point in having code which expands to two loops: > > > > The point is that often its cheaper to retry the LL/SC than it is to > > do the recalculation. > > > > However, I don't think you've understood the original problem at all. > > I think I have - I agreed with you and Catalin already that LL/SC does > not suffice. But do you mean that Catalin's suggestion to put the > LDREX before the LDR does not work either? (Maybe it needs a barrier > too). It definitely needs a barrier after the LDREX and maybe one after STREX but that depends on the semantics of such operation. -- Catalin ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-24 11:19 ` Catalin Marinas @ 2009-11-24 22:24 ` Toby Douglass 2009-11-25 11:11 ` Catalin Marinas 2009-11-24 22:34 ` Toby Douglass 1 sibling, 1 reply; 37+ messages in thread From: Toby Douglass @ 2009-11-24 22:24 UTC (permalink / raw) To: linux-arm-kernel Catalin Marinas wrote: > On Tue, 2009-11-24 at 01:32 +0000, Jamie Lokier wrote: >> Russell King - ARM Linux wrote: >>> However, I don't think you've understood the original problem at all. >> I think I have - I agreed with you and Catalin already that LL/SC does >> not suffice. But do you mean that Catalin's suggestion to put the >> LDREX before the LDR does not work either? (Maybe it needs a barrier >> too). > It definitely needs a barrier after the LDREX and maybe one after STREX > but that depends on the semantics of such operation. In the latest kernel, there is a memory barrier (the DMB instruction on v7 and above) immediately before and immediately after the call to the CAS function. I thought about this a little. If the memory barrier is immediately before and given the next instruction is the LDREX, *all* threads coming to the LDREX *must* have preceeding them a DMB and so be up to date on memory, regardless of pauses in thread execution. This however is in conflict with Catalin's comments. ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-24 22:24 ` Toby Douglass @ 2009-11-25 11:11 ` Catalin Marinas 2009-11-25 18:57 ` Toby Douglass 0 siblings, 1 reply; 37+ messages in thread From: Catalin Marinas @ 2009-11-25 11:11 UTC (permalink / raw) To: linux-arm-kernel On Tue, 2009-11-24 at 22:24 +0000, Toby Douglass wrote: > Catalin Marinas wrote: > > On Tue, 2009-11-24 at 01:32 +0000, Jamie Lokier wrote: > >> Russell King - ARM Linux wrote: > > >>> However, I don't think you've understood the original problem at all. > > >> I think I have - I agreed with you and Catalin already that LL/SC does > >> not suffice. But do you mean that Catalin's suggestion to put the > >> LDREX before the LDR does not work either? (Maybe it needs a barrier > >> too). > > > It definitely needs a barrier after the LDREX and maybe one after STREX > > but that depends on the semantics of such operation. > > In the latest kernel, there is a memory barrier (the DMB instruction on > v7 and above) immediately before and immediately after the call to the > CAS function. > > I thought about this a little. If the memory barrier is immediately > before and given the next instruction is the LDREX, *all* threads coming > to the LDREX *must* have preceeding them a DMB and so be up to date on > memory, regardless of pauses in thread execution. > > This however is in conflict with Catalin's comments. My comment was to a new implementation for the ABA issue using LDR for the next pointer after LDREX and these would need to be separated by a barrier. Looking at this a bit more, it isn't actually needed since the subsequent LDR uses the pointer read by LDREX and thus creating an address dependency ("the value returned by a read access is used to compute the virtual address of a subsequent read or write access") which implies strict ordering between LDREX and LDR (A3.8.2 in the ARM ARM). But as I said, LDR between LDREX/STREX may no always work. You may need barriers before and after you routine as we do in the CAS case but it depends how you define your routine, you may ask the caller of such routine to add the barriers explicitly. -- Catalin ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-25 11:11 ` Catalin Marinas @ 2009-11-25 18:57 ` Toby Douglass 0 siblings, 0 replies; 37+ messages in thread From: Toby Douglass @ 2009-11-25 18:57 UTC (permalink / raw) To: linux-arm-kernel So, I'm thinking now; where does this leave us? We see that LDREX/SDREX cannot be used on their own in a single word CAS to prevent ABA since we cannot rely on using LDR inside an LDREX/SDREX pair. (You guys knew this; I'm just stating it here to lead in). This means you must use a pointer-counter pair (with its potential for failure) or memory handling; the same situation as we find on x86 and x64, for example. That's okay - not perfect, but we're not deficient compared to other common platforms. I think however it leaves open one question; do we wish for the LDREX/SDREX CAS to fail if destination has been modified, or to retry, as it does now. From a lock-free data structure point of view, failure on modification is more efficient, since we will always need to recompute our exchange value; going to the extra trouble of forcing the CAS (which will then fail anyway, because the counter part of the pointer-counter pair will not match) is a waste of time. The behaviour we have now can be easily implemented (a loop) on top of the non-retry CAS; but the non-retry CAS behaviour cannot be implemented on top of what we have now. However, the world and his dog expects x86/x64 style behaviour - we can do better, but it may not be what users expect. Ah - there is one other issue - it seems reasonable that the CAS that exists now should be extended to support 8 byte swaps, since it already supports 1, 2 and 4 byte swaps. ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-24 11:19 ` Catalin Marinas 2009-11-24 22:24 ` Toby Douglass @ 2009-11-24 22:34 ` Toby Douglass 2009-11-24 22:56 ` Russell King - ARM Linux 1 sibling, 1 reply; 37+ messages in thread From: Toby Douglass @ 2009-11-24 22:34 UTC (permalink / raw) To: linux-arm-kernel I wrote: > I thought about this a little. If the memory barrier is immediately > before and given the next instruction is the LDREX, *all* threads coming > to the LDREX *must* have preceeding them a DMB and so be up to date on > memory, regardless of pauses in thread execution. Additionally; the DMB afterwards seemed to have no value. You could perform the STREX and then your thread could pause indefinitely and were you in a situation where you store was not immediately visible or correctly ordered to another thread, that thread would then read the old value. A more common example would be that another thread is reading the destination value for some reason, and reads after the STREX and before the trailing DMB. This implies you need a DMB as the instruction immediately preceding every read and every write of destination. On x86/x64, all the atomic ops (e.g. the LOCK prefix) act as full memory barriers. Note that I'm not yet fully conversant with the issues in multi-core threading, so I may be writing rubbish :-) ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-24 22:34 ` Toby Douglass @ 2009-11-24 22:56 ` Russell King - ARM Linux 2009-11-25 0:34 ` Toby Douglass 0 siblings, 1 reply; 37+ messages in thread From: Russell King - ARM Linux @ 2009-11-24 22:56 UTC (permalink / raw) To: linux-arm-kernel On Tue, Nov 24, 2009 at 11:34:56PM +0100, Toby Douglass wrote: > I wrote: >> I thought about this a little. If the memory barrier is immediately >> before and given the next instruction is the LDREX, *all* threads >> coming to the LDREX *must* have preceeding them a DMB and so be up to >> date on memory, regardless of pauses in thread execution. > > Additionally; the DMB afterwards seemed to have no value. You could > perform the STREX and then your thread could pause indefinitely and were > you in a situation where you store was not immediately visible or > correctly ordered to another thread, that thread would then read the old > value. The two DMBs are there to prevent other loads and stores on the _local_ CPU being speculated inside the compare-and-swap operation, either by the compiler or the CPU. ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-24 22:56 ` Russell King - ARM Linux @ 2009-11-25 0:34 ` Toby Douglass 0 siblings, 0 replies; 37+ messages in thread From: Toby Douglass @ 2009-11-25 0:34 UTC (permalink / raw) To: linux-arm-kernel Russell King - ARM Linux wrote: > On Tue, Nov 24, 2009 at 11:34:56PM +0100, Toby Douglass wrote: >> Additionally; the DMB afterwards seemed to have no value. You could >> perform the STREX and then your thread could pause indefinitely and were >> you in a situation where you store was not immediately visible or >> correctly ordered to another thread, that thread would then read the old >> value. > The two DMBs are there to prevent other loads and stores on the _local_ > CPU being speculated inside the compare-and-swap operation, either by > the compiler or the CPU. Ahhh! Forest, trees, etc... ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-23 23:13 ` Russell King - ARM Linux 2009-11-24 1:32 ` Jamie Lokier @ 2009-11-24 9:38 ` Toby Douglass 1 sibling, 0 replies; 37+ messages in thread From: Toby Douglass @ 2009-11-24 9:38 UTC (permalink / raw) To: linux-arm-kernel Russell King - ARM Linux wrote: [snip] No comments on everything else and the other posts; still digesting them. > I believe there is a solution to the ABA problem using atomic operations > with one 32-bit word and a separate counter to identify updates, but it > requires a little more time than I have available this evening to fully > put together. I've implemented the pointer-counter solution on x86/x64. Indeed, it turns out you can use them on ARM, since there is a double-word ldrex. However, and this is part of what I'm currently thinking about, I was under the impression LL/SC meant *single word CAS* could work without a counter. I think you'd never implement pointer-counter in the kernel, as your basic CAS, because plain CAS is all you need for certain things and the user can easily implement pointer-counter on top of plain CAS. OTOH, perhaps it is nice to offer a plain CAS API and an ABA-safe CAS API. ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-23 22:28 ` Jamie Lokier 2009-11-23 23:13 ` Russell King - ARM Linux @ 2009-11-24 15:59 ` Catalin Marinas 2009-11-24 16:34 ` Toby Douglass 2 siblings, 0 replies; 37+ messages in thread From: Catalin Marinas @ 2009-11-24 15:59 UTC (permalink / raw) To: linux-arm-kernel On Mon, 2009-11-23 at 22:28 +0000, Jamie Lokier wrote: > Toby Douglass wrote: > > Load-linked/conditional-store architectures solve ABA by having the > > store fail if the destination has been touched since the load was performed. > > > > Currently, the code appears to violate this, by repeating the CAS > > *anyway*. In fact, the appropriate behaviour would seem to be *not* to > > loop, but rather, to issue the ldrex/strex *once*, and indicate to the > > user if the store succeed or failed. > > I believe Catalin's explained why it does not work even doing > LDREX/STREX once, because the thread can pause before the LDREX. So > you must begin fetching pointers after the LDREX. > > (At least I think so. I'm prepared to be shown to be wrong :-) > > If you're writing code intended for other LL/SC architectures too, and > following Catalin's suggestion to put LDR between LDREX and STREX, > then you might have to check if the other architectures permit loads > between the LL and SC. That's a good point. ARM doesn't allow this either, though probably current implementations don't have any problem with it. From the ARM ARM A3.4.5: An implementation might clear an exclusive monitor between the LDREX and the STREX, without any application-related cause. For example, this might happen because of cache evictions. Code written for such an implementation must avoid having any *explicit memory accesses* or cache maintenance operations between the LDREX and STREX instructions So on some future implementations it could live-lock. -- Catalin ^ permalink raw reply [flat|nested] 37+ messages in thread
* CAS implementation may be broken 2009-11-23 22:28 ` Jamie Lokier 2009-11-23 23:13 ` Russell King - ARM Linux 2009-11-24 15:59 ` Catalin Marinas @ 2009-11-24 16:34 ` Toby Douglass 2 siblings, 0 replies; 37+ messages in thread From: Toby Douglass @ 2009-11-24 16:34 UTC (permalink / raw) To: linux-arm-kernel Jamie Lokier wrote: > Toby Douglass wrote: >> Load-linked/conditional-store architectures solve ABA by having the >> store fail if the destination has been touched since the load was performed. >> >> Currently, the code appears to violate this, by repeating the CAS >> *anyway*. In fact, the appropriate behaviour would seem to be *not* to >> loop, but rather, to issue the ldrex/strex *once*, and indicate to the >> user if the store succeed or failed. > > I believe Catalin's explained why it does not work even doing > LDREX/STREX once, because the thread can pause before the LDREX. So > you must begin fetching pointers after the LDREX. Yes. I'm now getting around to digesting the posts after Catalin's and I see here people have already pointed out the things I worked out =-) > (At least I think so. I'm prepared to be shown to be wrong :-) Well - it doesn't save you - you could LDREX, then LDR - and *then* pause. Then the freelist completely changes, but the original element is moved back to the top. Then you continue. You fail to swap, since the destination has changed...ah, wait! it *does* save you. Of course, you're dead if you loop again, so you need to exit at this point. Now I'm wondering what I was thinking before, in my earlier email to the list...! > If you're writing code intended for other LL/SC architectures too, and > following Catalin's suggestion to put LDR between LDREX and STREX, > then you might have to check if the other architectures permit loads > between the LL and SC. As Catalin has pointed out, this in fact is not allowed on ARM. So it doesn't save us (well, me :-) after all and in fact I look stuffed, as far as LL/SC is concerned; I have to use double-word LDREX and a pointer-counter pair. So LL/SC on ARM is not an ABA solution. It merely is a RISC-style implementation of what on CISC is a single instruction. >> This requires a prototype change, because the return value is the >> original value of the destination and so is unable to indicate, when >> returning that value, if it is returned from a successful or >> unsuccessful swap. > > Nonetheless, such a prototype change might be an improvement anyway. > > Some platforms provide compare_and_swap_bool() operations, which do as > you describe: Compare, conditionally store, and return bool to indicate > success. No loop. Ah, well, I also like to receive the original value of the destination, if the CAS had to load it, since I'm going to have to load it again myself in most cases. In the case where the CAS fails because the compare fails, the freshly loaded destination value is useful. In the case where the CAS fails because the destination is touched by another thread, the freshly loaded destination is almost never useful (unless it was modified by the other thread to be the same value). > That could be an improvement for some algorithms, because often if the > store doesn't happen, the *inputs* to compare_and_swap() need to be > recalculated anyway before another try is likely to succeed. Exactly! In fact, this is what happens in *every* algorithm I'm implemented (a grand total of two, mind you, stack and queue, but I know it also happens for the singly linked list). ^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2009-11-27 1:37 UTC | newest] Thread overview: 37+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-11-04 18:09 GCC built-in atomic operations and memory barriers Toby Douglass 2009-11-04 19:05 ` Russell King - ARM Linux 2009-11-04 20:12 ` Toby Douglass 2009-11-04 21:03 ` Russell King - ARM Linux 2009-11-06 19:10 ` Toby Douglass 2009-11-04 22:09 ` Gilles Chanteperdrix 2009-11-06 19:17 ` Toby Douglass 2009-11-21 15:21 ` CAS implementation may be broken Toby Douglass 2009-11-23 15:08 ` Russell King - ARM Linux 2009-11-23 19:10 ` Toby Douglass 2009-11-23 20:06 ` Russell King - ARM Linux 2009-11-23 20:34 ` Toby Douglass 2009-11-23 15:13 ` Catalin Marinas 2009-11-24 15:15 ` Toby Douglass 2009-11-24 15:36 ` Russell King - ARM Linux 2009-11-24 16:20 ` Toby Douglass 2009-11-24 16:27 ` Catalin Marinas 2009-11-24 17:14 ` Toby Douglass 2009-11-25 1:24 ` Jamie Lokier 2009-11-26 16:14 ` Toby Douglass 2009-11-27 1:37 ` Jamie Lokier 2009-11-24 15:33 ` Toby Douglass 2009-11-23 15:34 ` Catalin Marinas 2009-11-23 16:40 ` Toby Douglass 2009-11-23 22:28 ` Jamie Lokier 2009-11-23 23:13 ` Russell King - ARM Linux 2009-11-24 1:32 ` Jamie Lokier 2009-11-24 11:19 ` Catalin Marinas 2009-11-24 22:24 ` Toby Douglass 2009-11-25 11:11 ` Catalin Marinas 2009-11-25 18:57 ` Toby Douglass 2009-11-24 22:34 ` Toby Douglass 2009-11-24 22:56 ` Russell King - ARM Linux 2009-11-25 0:34 ` Toby Douglass 2009-11-24 9:38 ` Toby Douglass 2009-11-24 15:59 ` Catalin Marinas 2009-11-24 16:34 ` Toby Douglass
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).