* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an
@ 2006-12-11 2:30 linux
2006-12-11 4:30 ` Nick Piggin
2006-12-11 4:49 ` Linus Torvalds
0 siblings, 2 replies; 7+ messages in thread
From: linux @ 2006-12-11 2:30 UTC (permalink / raw)
To: nickpiggin, torvalds; +Cc: linux, linux-arch, linux-arm-kernel, linux-kernel
> Even if ARM is able to handle any arbitrary C code between the
> "load locked" and store conditional API, other architectures can not
> by definition.
Maybe so, but I think you and Linus are missing the middle ground.
While I agree that LL/SC can't be part of the kernel API for people to
get arbitrarily clever with in the device driver du jour, they are *very*
nice abstractions for shrinking the arch-specific code size.
The semantics are widely enough shared that it's quite possible in
practice to write a good set of atomic primitives in terms of LL/SC
and then let most architectures define LL/SC and simply #include the
generic atomic op implementations.
If there's a restriction that would pessimize the generic implementation,
that function can be implemented specially for that arch.
Then implementing things like backoff on contention can involve writing
a whole lot less duplicated code.
Just like you can write a set of helpers for, say, CPUs with physically
addressed caches, even though the "real" API has to be able to handle the
virtually addressed ones, you can write a nice set of helpers for machines
with sane LL/SC.
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an 2006-12-11 2:30 [PATCH] WorkStruct: Implement generic UP cmpxchg() where an linux @ 2006-12-11 4:30 ` Nick Piggin 2006-12-11 6:17 ` linux 2006-12-11 4:49 ` Linus Torvalds 1 sibling, 1 reply; 7+ messages in thread From: Nick Piggin @ 2006-12-11 4:30 UTC (permalink / raw) To: linux; +Cc: torvalds, linux-arch, linux-arm-kernel, linux-kernel Hi "Linux", linux@horizon.com wrote: >>Even if ARM is able to handle any arbitrary C code between the >>"load locked" and store conditional API, other architectures can not >>by definition. > > > Maybe so, but I think you and Linus are missing the middle ground. Nobdy argued against adding nice arch specific helpers to do higher level operations (for example, atomic_add_unless I added was able to reduce the use of cmpxchg in the kernel and can be optimally implemented with ll/sc). I implemented it specifically because I didn't want to use atomic_cmpxchg directly for lockless pagecache, exactly because it is suboptimal on RISCs in that performance critical path. The point is that if somebody wants to implement some fancy lockless code, atomic_cmpxchg is a good tool to use that does not require writing the assembly for two dozen architectures. If it is performance critical then it can absolutely be rewritten in an optimal manner. > While I agree that LL/SC can't be part of the kernel API for people to > get arbitrarily clever with in the device driver du jour, they are *very* > nice abstractions for shrinking the arch-specific code size. > > The semantics are widely enough shared that it's quite possible in > practice to write a good set of atomic primitives in terms of LL/SC > and then let most architectures define LL/SC and simply #include the > generic atomic op implementations. > > If there's a restriction that would pessimize the generic implementation, > that function can be implemented specially for that arch. > > Then implementing things like backoff on contention can involve writing > a whole lot less duplicated code. > > > Just like you can write a set of helpers for, say, CPUs with physically > addressed caches, even though the "real" API has to be able to handle the > virtually addressed ones, you can write a nice set of helpers for machines > with sane LL/SC. So, what would your ll/sc abstraction look like? Let's hear it. The one I'm thinking of goes something like this: atomic_ll() / atomic_sc() with the restriction that they cannot be nested, you cannot write any C code between them, and may only call into some specific set of atomic_llsc_xxx primitives, operating on the address given to ll, and must not have more than a given number of instructions between them. Also, the atomic_sc won't always fail if there were interleaving stores. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an 2006-12-11 4:30 ` Nick Piggin @ 2006-12-11 6:17 ` linux 2006-12-11 7:36 ` Nick Piggin 0 siblings, 1 reply; 7+ messages in thread From: linux @ 2006-12-11 6:17 UTC (permalink / raw) To: linux, nickpiggin; +Cc: linux-arch, linux-arm-kernel, linux-kernel, torvalds [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 4983 bytes --] > atomic_ll() / atomic_sc() with the restriction that they cannot be > nested, you cannot write any C code between them, and may only call > into some specific set of atomic_llsc_xxx primitives, operating on > the address given to ll, and must not have more than a given number > of instructions between them. Also, the atomic_sc won't always fail > if there were interleaving stores. I'm not entirely sure what you're talking about. You generally want to keep the amount of code between ll and sc to an absolute minimum to avoid interference which causes livelock. Processor timeouts are generally much longer than any reasonable code sequence. I hear you and Linus say that loads and stores are not allowed. I'm trying to find that documented somewhere. Let's see... Wikipedia says that LL/SC is implemented on Alpha, MIPS, PowerPC, and ARM. Generally, LL loads the line into cache, makes a note of what it is, and sets an "atomic bit". In particular, the atomic bit is cleared if that cache line is evicted, or an interrupt happens. MIPS clears the LLbit if: - There's an external (another processor) update or invalidate of the affected cache line. (There are three cases of this, but they are all implementation-specific details.) - Or an ERET (return from exception) is executed. I see no restriction in internal accesses, even to the same cache line. Wikipedia says that the PowerPC implementation is particularly strong, and the IBM compiler has C builtins to perform the operations (http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.xlcpp8a.doc/compiler/ref/bif_misc.htm), so I don't think it breaks on memory access. Double-checking with the "PowerPC Microprocessor Family: Programming Environments Manual for 64 and 32-bit Microprocessors" (4.2.6 Memory Synchronization Instructions--UISA, pp. 173-174) confirms that only a modification by ANOTHER processor, or a return from exception, breaks the reservation. The ARM implementation is hard to find documentation on. The best I managed is U.S. patent 6892257, which describes the implementation. In particular, the "external access monitor" that rmk has been describing appears to only apply to uncacheable accesses, where it behaves as described. # At step 28, a check is made as to whether or not the semaphore value # being sought is cacheable. If the semaphore value is not cacheable, as # indicated within an associated MMU (not illustrated), then processing # proceeds to step 30. If the semaphore value is cacheable, then # processing proceeds to step 32. # At step 30, the semaphore value corresponding to the address indicated # by the Rm register value is stored into a register Rd specified within # the LDREX instruction. Furthermore, the physical address of the # semaphore value together with a processor identifying number for the # processor executing the LDREX instruction is stored within the main # memory monitor circuit 22. The bus arbiter 20 is triggered to provide # this processor identifying number to the main memory monitor circuit 22 # by the decoding of an LDREX instruction by one of the processors 4, 6. # At step 30, the physical address of the semaphore data is also stored # within the local semaphore identifying data store 24, 26 of the # processor executing the LDREX instruction. # If processing from step 28 proceeds to step 32, instead of step 30, then # no data is written into the main memory monitor circuit 22, but the # physical address of the semaphore value being returned from address Rm # to register Rd is written into the appropriate local semaphore # identifying data store 24, 26. In the cacheable case (which is the onle of interest), then it works the same as the others, being triggered primarily by the eviction of the cache line. Wading through the patentese is painful, but I assume there's a similar exception return case. It appears that the problem is on Alpha. From the architecture handbook: "· If any other memory access (ECB, LDx, LDQ_U, STx, STQ_U, WH64) is executed on the given processor between the LDx_L and the STx_C, the sequence above may always fail on some implementations; hence, no useful program should do this. · If a branch is taken between the LDx_L and the STx_C, the sequence above may always fail on some implementations; hence, no useful program should do this. (CMOVxx may be used to avoid branching.)" Okay, so Alpha is the special case; write those primitives directly in asm. It's still possible to get three out of four. In truth, however, realizing that we're only talking about three architectures (wo of which have 32 & 64-bit versions) it's probably not worth it. If there were five, it would probably be a savings, but 3x code duplication of some small, well-defined primitives is a fair price to pay for avoiding another layer of abstraction (a.k.a. obfuscation). And it lets you optimize them better. I apologize for not having counted them before. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an 2006-12-11 6:17 ` linux @ 2006-12-11 7:36 ` Nick Piggin 2006-12-12 3:24 ` linux 0 siblings, 1 reply; 7+ messages in thread From: Nick Piggin @ 2006-12-11 7:36 UTC (permalink / raw) To: linux; +Cc: linux-arch, linux-arm-kernel, linux-kernel, torvalds linux@horizon.com wrote: >> atomic_ll() / atomic_sc() with the restriction that they cannot be >> nested, you cannot write any C code between them, and may only call >> into some specific set of atomic_llsc_xxx primitives, operating on >> the address given to ll, and must not have more than a given number >> of instructions between them. Also, the atomic_sc won't always fail >> if there were interleaving stores. > > > I'm not entirely sure what you're talking about. You generally want I'm talking about what a generic API (which you were proposing) would look like. > to keep the amount of code between ll and sc to an absolute minimum > to avoid interference which causes livelock. Processor timeouts > are generally much longer than any reasonable code sequence. "Generally" does not mean you can just ignore it and hope the C compiler does the right thing. Nor is it enough for just SOME of the architectures to have the properties you require. > I hear you and Linus say that loads and stores are not allowed. > I'm trying to find that documented somewhere. Let's see... Wikipedia > says that LL/SC is implemented on Alpha, MIPS, PowerPC, and ARM. Ralf tells us that MIPS cannot execute any loads, stores, or sync instructions on MIPS. Ivan says no loads, stores, taken branches etc on Alpha. MIPS also has a limit of 2048 bytes between the ll and sc. So you almost definitely cannot have gcc generated assembly between. I think we agree on that much. So what remains is for you to propose your API. > Okay, so Alpha is the special case; write those primitives directly > in asm. It's still possible to get three out of four. Yes, but the code *between* the time you call atomic_ll and atomic_sc has to be performed in arch calls as well. Ie. you cannot write: do { val = atomic_ll(&A); if (val == blah) { ... } while (!atomic_sc(&A, newval)); AFAIKS, you cannot even write: do { val = atomic_ll(&A); val++; } while(!atomic_sc(&A, val)); Is the do {} while loop itself even safe? I'm not sure, I don't think we could guarantee it. On the other hand, atomic_cmpxchg can be used in arbitrary C code, and has no restrictions on use whatsoever. > In truth, however, realizing that we're only talking about three > architectures (wo of which have 32 & 64-bit versions) it's probably not > worth it. If there were five, it would probably be a savings, but 3x > code duplication of some small, well-defined primitives is a fair price > to pay for avoiding another layer of abstraction (a.k.a. obfuscation). > > And it lets you optimize them better. > > I apologize for not having counted them before. I disagree. It wouldn't be another layer of abstraction, it would be an abstraction on pretty much the same level as the other atomic primitives. What it would be is a bad abstraction if it leaked all these architecture specific details into generic C code. But if you can get around that... I also disagree that the architectures don't matter. ARM and PPC are pretty important, and I believe Linux on MIPS is growing too. I also think it is really nice to be able to come up with an optimal implementation over a wider range of architectures even if they are relatively rare. It often implies that you have come up with a better API[*]. One proposal that I could buy is an atomic_ll/sc API, which mapped to a cmpxchg emulation even on those llsc architectures which had any sort of restriction whatsoever. This could be used in regular C code (eg. you indicate powerpc might be able to do this). But it may also help cmpxchg architectures optimise their code, because the load really wants to be a "load with intent to store" -- and is IMO the biggest suboptimal aspect of current atomic_cmpxchg. [*] Of course you can take this in the wrong direction. For example, the 2 which implement SMP atomic operations with spinlocks would often prefer to use a spinlock in the data structure being modified than hashed off the address of the atomic (much better locality of reference). But doing this all over the kernel is idiotic. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an 2006-12-11 7:36 ` Nick Piggin @ 2006-12-12 3:24 ` linux 2006-12-12 10:37 ` David Howells 0 siblings, 1 reply; 7+ messages in thread From: linux @ 2006-12-12 3:24 UTC (permalink / raw) To: linux, nickpiggin; +Cc: linux-arch, linux-arm-kernel, linux-kernel, torvalds [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 4680 bytes --] >> to keep the amount of code between ll and sc to an absolute minimum >> to avoid interference which causes livelock. Processor timeouts >> are generally much longer than any reasonable code sequence. > "Generally" does not mean you can just ignore it and hope the C compiler > does the right thing. Nor is it enough for just SOME of the architectures > to have the properties you require. If it's an order of magnitude larger than the common case, then yes you can. Do we worry about writing functions so big that they exceed branch displacement limits? That's detected at compile time, but LL/SC pair distance is in principle straightforward to measure, too. > Ralf tells us that MIPS cannot execute any loads, stores, or sync > instructions on MIPS. Ivan says no loads, stores, taken branches etc > on Alpha. > > MIPS also has a limit of 2048 bytes between the ll and sc. I agree with you about the Alpha, and that will have to be directly coded. But on MIPS, the R4000 manual (2nd ed, covering the R4400 as well) says > The link is broken in the following circumstances: > · if any external request (invalidate, snoop, or intervention) > changes the state of the line containing the lock variable to > invalid > · upon completion of an ERET (return from exception) > instruction > · an external update to the cache line containing the lock > variable Are you absolutely sure of what you are reporting about MIPS? Have you got a source? I've been checking the most authoritative references I can find and can't find mention of such a restriction. (The R8000 User's Manual doesn't appear to mention LL/SC at all, sigh.) One thing I DID find is the "R4000MC Errata, Processor Revision 2.2 and 3.0", which documents several LL/SC bugs (Numbers 10, 12, 13) and #12 in particular requires extremely careful coding in the workaround. That may completely scuttle the idea of using generic LL/SC functions. > So you almost definitely cannot have gcc generated assembly between. I > think we agree on that much. We don't. I think that if that restriction applies, it's worthless, because you can't achieve a net reduction in arch-dependent code. GCC specifically says that if you want a 100% guarantee of no reloads between asm instructions, place them in a single asm() statement. > In truth, however, realizing that we're only talking about three > architectures (wo of which have 32 & 64-bit versions) it's probably not > worth it. If there were five, it would probably be a savings, but 3x > code duplication of some small, well-defined primitives is a fair price > to pay for avoiding another layer of abstraction (a.k.a. obfuscation). > > And it lets you optimize them better. > > I apologize for not having counted them before. > I also disagree that the architectures don't matter. ARM and PPC are > pretty important, and I believe Linux on MIPS is growing too. Er... I definitely don't see where I said, and I don't even see where I implied - or even hinted - that MIPS, ARM and PPC "don't matter." I use Linux on ARM daily. I just thought that writing a nearly-optimal generic primitive is about 3x harder than writing a single-architecture one, so even for primitives yet to be written, its just as easy to do it fully arch-specific. Plus you have corner cases like the R5900 that don't have LL/SC at all. (Can it be used multiprocessor?) > One proposal that I could buy is an atomic_ll/sc API, which mapped > to a cmpxchg emulation even on those llsc architectures which had > any sort of restriction whatsoever. This could be used in regular C > code (eg. you indicate powerpc might be able to do this). But it may > also help cmpxchg architectures optimise their code, because the > load really wants to be a "load with intent to store" -- and is > IMO the biggest suboptimal aspect of current atomic_cmpxchg. Or, possibly, an interface like do { oldvalue = ll(addr) newvalue = ... oldvalue ... } while (!sc(addr, oldvalue, newvalue)) Where sc() could be a cmpxchg. But, more importantly, if the architecture did implement LL/SC, it could be a "try plain SC; if that fails try CMPXCHG built out of LL/SC; if that fails, loop" Actually, I'd want something a bit more integrated, that could have the option of fetching the new oldvalue as part of the sc() implementation if that failed. Something like DO_ATOMIC(addr, oldvalue) { ... code ... } UNTIL_ATOMIC(addr, oldvalue, newvalue); or perhaps, to encourage short code sections, DO_ATOMIC(addr, oldvalue, code, newvalue); The problem is, that's already not optimal for spinlocks, where you want to use a non-linked load while spinning. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an 2006-12-12 3:24 ` linux @ 2006-12-12 10:37 ` David Howells 0 siblings, 0 replies; 7+ messages in thread From: David Howells @ 2006-12-12 10:37 UTC (permalink / raw) To: linux; +Cc: nickpiggin, linux-arch, linux-arm-kernel, linux-kernel, torvalds linux@horizon.com wrote: > do { > oldvalue = ll(addr) > newvalue = ... oldvalue ... > } while (!sc(addr, oldvalue, newvalue)) > > Where sc() could be a cmpxchg. But, more importantly, if the > architecture did implement LL/SC, it could be a "try plain SC; if > that fails try CMPXCHG built out of LL/SC; if that fails, loop" If sc() is implemented with cmpxchg(), then this is a very silly piece of code. cmpxchg() returns the current value if it fails, rendering a repetition of ll() pointless. In some circumstances, you should really do it by putting the cmpxchg() up front with what you expect the most likely substitution to be, and that then doesn't require the initial load. David ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an 2006-12-11 2:30 [PATCH] WorkStruct: Implement generic UP cmpxchg() where an linux 2006-12-11 4:30 ` Nick Piggin @ 2006-12-11 4:49 ` Linus Torvalds 1 sibling, 0 replies; 7+ messages in thread From: Linus Torvalds @ 2006-12-11 4:49 UTC (permalink / raw) To: linux; +Cc: nickpiggin, linux-arch, linux-arm-kernel, linux-kernel On Sun, 10 Dec 2006, linux@horizon.com wrote: > > While I agree that LL/SC can't be part of the kernel API for people to > get arbitrarily clever with in the device driver du jour, they are *very* > nice abstractions for shrinking the arch-specific code size. I'm not sure. The thing is, it's _really_ hard to tell the compiler to not reload values from memory in between two inline asm statements. So what you easily end up with is (a) yes, you can actually get the compiler to generate the "obvious" code sequence 99% of the time, and it will all work fine. (b) but it's really hard to actually guarantee it, and some subtle things can really mess you up. An example of (b) is how we actually put some of these atomic data structures on the stack ("struct completion" comes to mind), and it can get really interesting it something works in all the tests, but then subtly breaks on some microarchitectures when the data structures happen to be on the stackjust because the compiler happened to do a register reload to the stack at the wrong point. Now, if you don't inline any of these things, you can control things a lot better, since then you end up having a much smaller set of circumstances, and you never have code "around" the actual operation that changes things like register reload. And yes, I do think that it might be possible to have some kind of generic "ll/sc template" setup for that case. You can often make gcc generate the code you want, especially if there is no real register pressure and you can keep the code simple. > The semantics are widely enough shared that it's quite possible in > practice to write a good set of atomic primitives in terms of LL/SC > and then let most architectures define LL/SC and simply #include the > generic atomic op implementations. Well, you do have to also realize that the architectures that dont' do ll/sc do end up limiting the number of useful primitives, especially considering that we know that some architectures simply cannot do a lot between them (which _also_ limits it). I think we've ended up implementing most of the common ones. We have a fairly big set of ops like "atomic_add_return()"-like operations, and those are the obvious ones that can be done for _any_ ll/sc architecture too. So I don't think there's all that much more to be had there. Linus ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2006-12-12 10:37 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-12-11 2:30 [PATCH] WorkStruct: Implement generic UP cmpxchg() where an linux 2006-12-11 4:30 ` Nick Piggin 2006-12-11 6:17 ` linux 2006-12-11 7:36 ` Nick Piggin 2006-12-12 3:24 ` linux 2006-12-12 10:37 ` David Howells 2006-12-11 4:49 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox