Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an

public inbox for linux-arch@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an
@ 2006-12-11  2:30 linux
  2006-12-11  4:30 ` Nick Piggin
  2006-12-11  4:49 ` Linus Torvalds
  0 siblings, 2 replies; 7+ messages in thread
From: linux @ 2006-12-11  2:30 UTC (permalink / raw)
  To: nickpiggin, torvalds; +Cc: linux, linux-arch, linux-arm-kernel, linux-kernel

> Even if ARM is able to handle any arbitrary C code between the
> "load locked" and store conditional API, other architectures can not
> by definition.

Maybe so, but I think you and Linus are missing the middle ground.

While I agree that LL/SC can't be part of the kernel API for people to
get arbitrarily clever with in the device driver du jour, they are *very*
nice abstractions for shrinking the arch-specific code size.

The semantics are widely enough shared that it's quite possible in
practice to write a good set of atomic primitives in terms of LL/SC
and then let most architectures define LL/SC and simply #include the
generic atomic op implementations.

If there's a restriction that would pessimize the generic implementation,
that function can be implemented specially for that arch.

Then implementing things like backoff on contention can involve writing
a whole lot less duplicated code.

Just like you can write a set of helpers for, say, CPUs with physically
addressed caches, even though the "real" API has to be able to handle the
virtually addressed ones, you can write a nice set of helpers for machines
with sane LL/SC.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an
  2006-12-11  2:30 [PATCH] WorkStruct: Implement generic UP cmpxchg() where an linux
@ 2006-12-11  4:30 ` Nick Piggin
  2006-12-11  6:17   ` linux
  2006-12-11  4:49 ` Linus Torvalds
  1 sibling, 1 reply; 7+ messages in thread
From: Nick Piggin @ 2006-12-11  4:30 UTC (permalink / raw)
  To: linux; +Cc: torvalds, linux-arch, linux-arm-kernel, linux-kernel

Hi "Linux",

linux@horizon.com wrote:
>>Even if ARM is able to handle any arbitrary C code between the
>>"load locked" and store conditional API, other architectures can not
>>by definition.
> 
> 
> Maybe so, but I think you and Linus are missing the middle ground.

Nobdy argued against adding nice arch specific helpers to do higher
level operations (for example, atomic_add_unless I added was able to
reduce the use of cmpxchg in the kernel and can be optimally
implemented with ll/sc). I implemented it specifically because I
didn't want to use atomic_cmpxchg directly for lockless pagecache,
exactly because it is suboptimal on RISCs in that performance
critical path.

The point is that if somebody wants to implement some fancy lockless
code, atomic_cmpxchg is a good tool to use that does not require
writing the assembly for two dozen architectures. If it is performance
critical then it can absolutely be rewritten in an optimal manner.

> While I agree that LL/SC can't be part of the kernel API for people to
> get arbitrarily clever with in the device driver du jour, they are *very*
> nice abstractions for shrinking the arch-specific code size.
> 
> The semantics are widely enough shared that it's quite possible in
> practice to write a good set of atomic primitives in terms of LL/SC
> and then let most architectures define LL/SC and simply #include the
> generic atomic op implementations.
> 
> If there's a restriction that would pessimize the generic implementation,
> that function can be implemented specially for that arch.
> 
> Then implementing things like backoff on contention can involve writing
> a whole lot less duplicated code.
> 
> 
> Just like you can write a set of helpers for, say, CPUs with physically
> addressed caches, even though the "real" API has to be able to handle the
> virtually addressed ones, you can write a nice set of helpers for machines
> with sane LL/SC.

So, what would your ll/sc abstraction look like? Let's hear it.

The one I'm thinking of goes something like this:

   atomic_ll() / atomic_sc() with the restriction that they cannot be
   nested, you cannot write any C code between them, and may only call
   into some specific set of atomic_llsc_xxx primitives, operating on
   the address given to ll, and must not have more than a given number
   of instructions between them. Also, the atomic_sc won't always fail
   if there were interleaving stores.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an
  2006-12-11  4:30 ` Nick Piggin
@ 2006-12-11  6:17   ` linux
  2006-12-11  7:36     ` Nick Piggin
  0 siblings, 1 reply; 7+ messages in thread
From: linux @ 2006-12-11  6:17 UTC (permalink / raw)
  To: linux, nickpiggin; +Cc: linux-arch, linux-arm-kernel, linux-kernel, torvalds

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 4983 bytes --]

>  atomic_ll() / atomic_sc() with the restriction that they cannot be
>  nested, you cannot write any C code between them, and may only call
>  into some specific set of atomic_llsc_xxx primitives, operating on
>  the address given to ll, and must not have more than a given number
>  of instructions between them. Also, the atomic_sc won't always fail
>  if there were interleaving stores.

I'm not entirely sure what you're talking about.  You generally want
to keep the amount of code between ll and sc to an absolute minimum
to avoid interference which causes livelock.  Processor timeouts
are generally much longer than any reasonable code sequence.

I hear you and Linus say that loads and stores are not allowed.
I'm trying to find that documented somewhere.  Let's see... Wikipedia
says that LL/SC is implemented on Alpha, MIPS, PowerPC, and ARM.

Generally, LL loads the line into cache, makes a note of what it is,
and sets an "atomic bit".  In particular, the atomic bit is cleared if
that cache line is evicted, or an interrupt happens.

MIPS clears the LLbit if:
- There's an external (another processor) update or invalidate of the
  affected cache line.  (There are three cases of this, but they are
  all implementation-specific details.)
- Or an ERET (return from exception) is executed.

I see no restriction in internal accesses, even to the same cache line.

Wikipedia says that the PowerPC implementation is particularly
strong, and the IBM compiler has C builtins to perform the operations
(http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.xlcpp8a.doc/compiler/ref/bif_misc.htm),
so I don't think it breaks on memory access.  Double-checking with the
"PowerPC Microprocessor Family: Programming Environments Manual for 64 and
32-bit Microprocessors" (4.2.6 Memory Synchronization Instructions--UISA,
pp. 173-174) confirms that only a modification by ANOTHER processor,
or a return from exception, breaks the reservation.

The ARM implementation is hard to find documentation on.  The best
I managed is U.S. patent 6892257, which describes the implementation.
In particular, the "external access monitor" that rmk has been describing
appears to only apply to uncacheable accesses, where it behaves as
described.

# At step 28, a check is made as to whether or not the semaphore value
# being sought is cacheable. If the semaphore value is not cacheable, as
# indicated within an associated MMU (not illustrated), then processing
# proceeds to step 30. If the semaphore value is cacheable, then
# processing proceeds to step 32.

# At step 30, the semaphore value corresponding to the address indicated
# by the Rm register value is stored into a register Rd specified within
# the LDREX instruction. Furthermore, the physical address of the
# semaphore value together with a processor identifying number for the
# processor executing the LDREX instruction is stored within the main
# memory monitor circuit 22. The bus arbiter 20 is triggered to provide
# this processor identifying number to the main memory monitor circuit 22
# by the decoding of an LDREX instruction by one of the processors 4, 6.
# At step 30, the physical address of the semaphore data is also stored
# within the local semaphore identifying data store 24, 26 of the
# processor executing the LDREX instruction.

# If processing from step 28 proceeds to step 32, instead of step 30, then
# no data is written into the main memory monitor circuit 22, but the
# physical address of the semaphore value being returned from address Rm
# to register Rd is written into the appropriate local semaphore
# identifying data store 24, 26.

In the cacheable case (which is the onle of interest), then it works
the same as the others, being triggered primarily by the eviction of
the cache line.  Wading through the patentese is painful, but I
assume there's a similar exception return case.

It appears that the problem is on Alpha.  From the architecture handbook:
"· If any other memory access (ECB, LDx, LDQ_U, STx, STQ_U, WH64) is executed on
   the given processor between the LDx_L and the STx_C, the sequence above may
   always fail on some implementations; hence, no useful program should do this.
 · If a branch is taken between the LDx_L and the STx_C, the sequence above may
   always fail on some implementations; hence, no useful program should do this.
   (CMOVxx may be used to avoid branching.)"

Okay, so Alpha is the special case; write those primitives directly
in asm.  It's still possible to get three out of four.

In truth, however, realizing that we're only talking about three
architectures (wo of which have 32 & 64-bit versions) it's probably not
worth it.  If there were five, it would probably be a savings, but 3x
code duplication of some small, well-defined primitives is a fair price
to pay for avoiding another layer of abstraction (a.k.a. obfuscation).

And it lets you optimize them better.

I apologize for not having counted them before.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an
  2006-12-11  6:17   ` linux
@ 2006-12-11  7:36     ` Nick Piggin
  2006-12-12  3:24       ` linux
  0 siblings, 1 reply; 7+ messages in thread
From: Nick Piggin @ 2006-12-11  7:36 UTC (permalink / raw)
  To: linux; +Cc: linux-arch, linux-arm-kernel, linux-kernel, torvalds

linux@horizon.com wrote:
>> atomic_ll() / atomic_sc() with the restriction that they cannot be
>> nested, you cannot write any C code between them, and may only call
>> into some specific set of atomic_llsc_xxx primitives, operating on
>> the address given to ll, and must not have more than a given number
>> of instructions between them. Also, the atomic_sc won't always fail
>> if there were interleaving stores.
> 
> 
> I'm not entirely sure what you're talking about.  You generally want

I'm talking about what a generic API (which you were proposing) would
look like.

> to keep the amount of code between ll and sc to an absolute minimum
> to avoid interference which causes livelock.  Processor timeouts
> are generally much longer than any reasonable code sequence.

"Generally" does not mean you can just ignore it and hope the C compiler
does the right thing. Nor is it enough for just SOME of the architectures
to have the properties you require.

> I hear you and Linus say that loads and stores are not allowed.
> I'm trying to find that documented somewhere.  Let's see... Wikipedia
> says that LL/SC is implemented on Alpha, MIPS, PowerPC, and ARM.

Ralf tells us that MIPS cannot execute any loads, stores, or sync
instructions on MIPS. Ivan says no loads, stores, taken branches etc
on Alpha.

MIPS also has a limit of 2048 bytes between the ll and sc.

So you almost definitely cannot have gcc generated assembly between. I
think we agree on that much. So what remains is for you to propose your
API.

> Okay, so Alpha is the special case; write those primitives directly
> in asm.  It's still possible to get three out of four.

Yes, but the code *between* the time you call atomic_ll and atomic_sc
has to be performed in arch calls as well. Ie. you cannot write:

    do {
      val = atomic_ll(&A);
      if (val == blah) {
        ...
    } while (!atomic_sc(&A, newval));

AFAIKS, you cannot even write:

   do {
     val = atomic_ll(&A);
     val++;
   } while(!atomic_sc(&A, val));

Is the do {} while loop itself even safe? I'm not sure, I don't think
we could guarantee it.

On the other hand, atomic_cmpxchg can be used in arbitrary C code, and
has no restrictions on use whatsoever.

> In truth, however, realizing that we're only talking about three
> architectures (wo of which have 32 & 64-bit versions) it's probably not
> worth it.  If there were five, it would probably be a savings, but 3x
> code duplication of some small, well-defined primitives is a fair price
> to pay for avoiding another layer of abstraction (a.k.a. obfuscation).
> 
> And it lets you optimize them better.
> 
> I apologize for not having counted them before.

I disagree. It wouldn't be another layer of abstraction, it would be
an abstraction on pretty much the same level as the other atomic
primitives. What it would be is a bad abstraction if it leaked all
these architecture specific details into generic C code. But if you
can get around that...

I also disagree that the architectures don't matter. ARM and PPC are
pretty important, and I believe Linux on MIPS is growing too.

I also think it is really nice to be able to come up with an optimal
implementation over a wider range of architectures even if they are
relatively rare. It often implies that you have come up with a better
API[*].

One proposal that I could buy is an atomic_ll/sc API, which mapped
to a cmpxchg emulation even on those llsc architectures which had
any sort of restriction whatsoever. This could be used in regular C
code (eg. you indicate powerpc might be able to do this). But it may
also help cmpxchg architectures optimise their code, because the
load really wants to be a "load with intent to store" -- and is
IMO the biggest suboptimal aspect of current atomic_cmpxchg.

[*] Of course you can take this in the wrong direction. For example, the
2 which implement SMP atomic operations with spinlocks would often prefer
to use a spinlock in the data structure being modified than hashed off
the address of the atomic (much better locality of reference). But
doing this all over the kernel is idiotic.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an
  2006-12-11  7:36     ` Nick Piggin
@ 2006-12-12  3:24       ` linux
  2006-12-12 10:37         ` David Howells
  0 siblings, 1 reply; 7+ messages in thread
From: linux @ 2006-12-12  3:24 UTC (permalink / raw)
  To: linux, nickpiggin; +Cc: linux-arch, linux-arm-kernel, linux-kernel, torvalds

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 4680 bytes --]

>> to keep the amount of code between ll and sc to an absolute minimum
>> to avoid interference which causes livelock.  Processor timeouts
>> are generally much longer than any reasonable code sequence.

> "Generally" does not mean you can just ignore it and hope the C compiler
> does the right thing. Nor is it enough for just SOME of the architectures
> to have the properties you require.

If it's an order of magnitude larger than the common case, then yes
you can.  Do we worry about writing functions so big that they
exceed branch displacement limits?

That's detected at compile time, but LL/SC pair distance is
in principle straightforward to measure, too.

> Ralf tells us that MIPS cannot execute any loads, stores, or sync
> instructions on MIPS. Ivan says no loads, stores, taken branches etc
> on Alpha.
>
> MIPS also has a limit of 2048 bytes between the ll and sc.

I agree with you about the Alpha, and that will have to be directly
coded.  But on MIPS, the R4000 manual (2nd ed, covering the R4400
as well) says

> The link is broken in the following circumstances:
>    ·   if any external request (invalidate, snoop, or intervention)
>        changes the state of the line containing the lock variable to
>        invalid
>    ·   upon completion of an ERET (return from exception)
>        instruction
>    ·   an external update to the cache line containing the lock
>        variable

Are you absolutely sure of what you are reporting about MIPS?
Have you got a source?  I've been checking the most authoritative
references I can find and can't find mention of such a restriction.
(The R8000 User's Manual doesn't appear to mention LL/SC at all, sigh.)

One thing I DID find is the "R4000MC Errata, Processor Revision 2.2 and
3.0", which documents several LL/SC bugs (Numbers 10, 12, 13) and #12
in particular requires extremely careful coding in the workaround.

That may completely scuttle the idea of using generic LL/SC functions.

> So you almost definitely cannot have gcc generated assembly between. I
> think we agree on that much.

We don't.  I think that if that restriction applies, it's worthless,
because you can't achieve a net reduction in arch-dependent code.

GCC specifically says that if you want a 100% guarantee of no reloads
between asm instructions, place them in a single asm() statement.

> In truth, however, realizing that we're only talking about three
> architectures (wo of which have 32 & 64-bit versions) it's probably not
> worth it.  If there were five, it would probably be a savings, but 3x
> code duplication of some small, well-defined primitives is a fair price
> to pay for avoiding another layer of abstraction (a.k.a. obfuscation).
> 
> And it lets you optimize them better.
> 
> I apologize for not having counted them before.

> I also disagree that the architectures don't matter. ARM and PPC are
> pretty important, and I believe Linux on MIPS is growing too.

Er... I definitely don't see where I said, and I don't even see where
I implied - or even hinted - that MIPS, ARM and PPC "don't matter."
I use Linux on ARM daily.

I just thought that writing a nearly-optimal generic primitive is about
3x harder than writing a single-architecture one, so even for primitives
yet to be written, its just as easy to do it fully arch-specific.

Plus you have corner cases like the R5900 that don't have LL/SC at all.
(Can it be used multiprocessor?)

> One proposal that I could buy is an atomic_ll/sc API, which mapped
> to a cmpxchg emulation even on those llsc architectures which had
> any sort of restriction whatsoever. This could be used in regular C
> code (eg. you indicate powerpc might be able to do this). But it may
> also help cmpxchg architectures optimise their code, because the
> load really wants to be a "load with intent to store" -- and is
> IMO the biggest suboptimal aspect of current atomic_cmpxchg.

Or, possibly, an interface like

do {
	oldvalue = ll(addr)
	newvalue = ... oldvalue ...
} while (!sc(addr, oldvalue, newvalue))

Where sc() could be a cmpxchg.  But, more importantly, if the
architecture did implement LL/SC, it could be a "try plain SC; if
that fails try CMPXCHG built out of LL/SC; if that fails, loop"

Actually, I'd want something a bit more integrated, that could
have the option of fetching the new oldvalue as part of the sc()
implementation if that failed.  Something like

	DO_ATOMIC(addr, oldvalue) {
		... code ...
	} UNTIL_ATOMIC(addr, oldvalue, newvalue);

or perhaps, to encourage short code sections,
	DO_ATOMIC(addr, oldvalue, code, newvalue);

The problem is, that's already not optimal for spinlocks, where
you want to use a non-linked load while spinning.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an
  2006-12-12  3:24       ` linux
@ 2006-12-12 10:37         ` David Howells
  0 siblings, 0 replies; 7+ messages in thread
From: David Howells @ 2006-12-12 10:37 UTC (permalink / raw)
  To: linux; +Cc: nickpiggin, linux-arch, linux-arm-kernel, linux-kernel, torvalds

linux@horizon.com wrote:

> do {
> 	oldvalue = ll(addr)
> 	newvalue = ... oldvalue ...
> } while (!sc(addr, oldvalue, newvalue))
> 
> Where sc() could be a cmpxchg.  But, more importantly, if the
> architecture did implement LL/SC, it could be a "try plain SC; if
> that fails try CMPXCHG built out of LL/SC; if that fails, loop"

If sc() is implemented with cmpxchg(), then this is a very silly piece of
code.  cmpxchg() returns the current value if it fails, rendering a repetition
of ll() pointless.  In some circumstances, you should really do it by putting
the cmpxchg() up front with what you expect the most likely substitution to
be, and that then doesn't require the initial load.

David

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an
  2006-12-11  2:30 [PATCH] WorkStruct: Implement generic UP cmpxchg() where an linux
  2006-12-11  4:30 ` Nick Piggin
@ 2006-12-11  4:49 ` Linus Torvalds
  1 sibling, 0 replies; 7+ messages in thread
From: Linus Torvalds @ 2006-12-11  4:49 UTC (permalink / raw)
  To: linux; +Cc: nickpiggin, linux-arch, linux-arm-kernel, linux-kernel

On Sun, 10 Dec 2006, linux@horizon.com wrote:
> 
> While I agree that LL/SC can't be part of the kernel API for people to
> get arbitrarily clever with in the device driver du jour, they are *very*
> nice abstractions for shrinking the arch-specific code size.

I'm not sure. 

The thing is, it's _really_ hard to tell the compiler to not reload values 
from memory in between two inline asm statements.

So what you easily end up with is 
 (a) yes, you can actually get the compiler to generate the "obvious" code 
     sequence 99% of the time, and it will all work fine.
 (b) but it's really hard to actually guarantee it, and some subtle things 
     can really mess you up.

An example of (b) is how we actually put some of these atomic data 
structures on the stack ("struct completion" comes to mind), and it can 
get really interesting it something works in all the tests, but then 
subtly breaks on some microarchitectures when the data structures happen 
to be on the stackjust because the compiler happened to do a register 
reload to the stack at the wrong point.

Now, if you don't inline any of these things, you can control things a lot 
better, since then you end up having a much smaller set of circumstances, 
and you never have code "around" the actual operation that changes things 
like register reload. And yes, I do think that it might be possible to 
have some kind of generic "ll/sc template" setup for that case. You can 
often make gcc generate the code you want, especially if there is no real 
register pressure and you can keep the code simple.

> The semantics are widely enough shared that it's quite possible in
> practice to write a good set of atomic primitives in terms of LL/SC
> and then let most architectures define LL/SC and simply #include the
> generic atomic op implementations.

Well, you do have to also realize that the architectures that dont' do 
ll/sc do end up limiting the number of useful primitives, especially 
considering that we know that some architectures simply cannot do a lot 
between them (which _also_ limits it).

I think we've ended up implementing most of the common ones. We have a 
fairly big set of ops like "atomic_add_return()"-like operations, and 
those are the obvious ones that can be done for _any_ ll/sc architecture 
too. So I don't think there's all that much more to be had there.

		Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-12-12 10:37 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-12-11  2:30 [PATCH] WorkStruct: Implement generic UP cmpxchg() where an linux
2006-12-11  4:30 ` Nick Piggin
2006-12-11  6:17   ` linux
2006-12-11  7:36     ` Nick Piggin
2006-12-12  3:24       ` linux
2006-12-12 10:37         ` David Howells
2006-12-11  4:49 ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox