Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Paul E. McKenney @ 2007-08-17  5:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Nick Piggin, Segher Boessenkool, heiko.carstens,
	horms, linux-kernel, rpjday, ak, netdev, cfriesen, akpm,
	jesper.juhl, linux-arch, zlynx, satyam, clameter, schwidefsky,
	Chris Snook, Herbert Xu, davem, wensong, wjiang
In-Reply-To: <alpine.LFD.0.999.0708162033400.30176@woody.linux-foundation.org>

On Thu, Aug 16, 2007 at 08:42:23PM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 17 Aug 2007, Paul Mackerras wrote:
> > 
> > I'm really surprised it's as much as a few K.  I tried it on powerpc
> > and it only saved 40 bytes (10 instructions) for a G5 config.
> 
> One of the things that "volatile" generally screws up is a simple
> 
> 	volatile int i;
> 
> 	i++;
> 
> which a compiler will generally get horribly, horribly wrong.
> 
> In a reasonable world, gcc should just make that be (on x86)
> 
> 	addl $1,i(%rip)
> 
> on x86-64, which is indeed what it does without the volatile. But with the 
> volatile, the compiler gets really nervous, and doesn't dare do it in one 
> instruction, and thus generates crap like
> 
>         movl    i(%rip), %eax
>         addl    $1, %eax
>         movl    %eax, i(%rip)

Blech.  Sounds like a chat with some gcc people is in order.  Will
see what I can do.

						Thanx, Paul

> instead. For no good reason, except that "volatile" just doesn't have any 
> good/clear semantics for the compiler, so most compilers will just make it 
> be "I will not touch this access in any way, shape, or form". Including 
> even trivially correct instruction optimization/combination.
> 
> This is one of the reasons why we should never use "volatile". It 
> pessimises code generation for no good reason - just because compilers 
> don't know what the heck it even means! 
> 
> Now, people don't do "i++" on atomics (you'd use "atomic_inc()" for that), 
> but people *do* do things like
> 
> 	if (atomic_read(..) <= 1)
> 		..
> 
> On ppc, things like that probably don't much matter. But on x86, it makes 
> a *huge* difference whether you do
> 
> 	movl i(%rip),%eax
> 	cmpl $1,%eax
> 
> or if you can just use the value directly for the operation, like this:
> 
> 	cmpl $1,i(%rip)
> 
> which is again a totally obvious and totally safe optimization, but is 
> (again) something that gcc doesn't dare do, since "i" is volatile.
> 
> In other words: "volatile" is a horribly horribly bad way of doing things, 
> because it generates *worse*code*, for no good reason. You just don't see 
> it on powerpc, because it's already a load-store architecture, so there is 
> no "good code" for doing direct-to-memory operations.
> 
> 		Linus
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Paul Mackerras @ 2007-08-17  5:09 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Stefan Richter, Satyam Sharma, Christoph Lameter,
	Paul E. McKenney, Chris Snook, Linux Kernel Mailing List,
	linux-arch, Linus Torvalds, netdev, Andrew Morton, ak,
	heiko.carstens, davem, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <20070816081049.GA1431@gondor.apana.org.au>

Herbert Xu writes:

> Can you find an actual atomic_read code snippet there that is
> broken without the volatile modifier?

There are some in arch-specific code, for example line 1073 of
arch/mips/kernel/smtc.c.  On mips, cpu_relax() is just barrier(), so
the empty loop body is ok provided that atomic_read actually does the
load each time around the loop.

Paul.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Paul E. McKenney @ 2007-08-17  5:07 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Christoph Lameter, Chris Snook, Ilpo Järvinen,
	Paul Mackerras, Satyam Sharma, Stefan Richter,
	Linux Kernel Mailing List, linux-arch, Linus Torvalds, Netdev,
	Andrew Morton, ak, heiko.carstens, David Miller, schwidefsky,
	wensong, horms, wjiang, cfriesen, zlynx, rpjday, jesper.juhl,
	segher
In-Reply-To: <20070817012800.GA12621@gondor.apana.org.au>

On Fri, Aug 17, 2007 at 09:28:00AM +0800, Herbert Xu wrote:
> On Thu, Aug 16, 2007 at 06:02:32PM -0700, Paul E. McKenney wrote:
> > 
> > Yep.  Or you can use atomic_dec_return() instead of using a barrier.
> 
> Or you could use smp_mb__{before,after}_atomic_dec.

Yep.  That would be an example of a barrier, either in the
atomic_dec() itself or in the smp_mb...().

							Thanx, Paul

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Paul Mackerras @ 2007-08-17  5:04 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Stefan Richter, Satyam Sharma, Christoph Lameter,
	Paul E. McKenney, Chris Snook, Linux Kernel Mailing List,
	linux-arch, Linus Torvalds, netdev, Andrew Morton, ak,
	heiko.carstens, davem, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <20070816104250.GB2927@gondor.apana.org.au>

Herbert Xu writes:

> So the point here is that if you don't mind getting a stale
> value from the CPU cache when doing an atomic_read, then
> surely you won't mind getting a stale value from the compiler
> "cache".

No, that particular argument is bogus, because there is a cache
coherency protocol operating to keep the CPU cache coherent with
stores from other CPUs, but there isn't any such protocol (nor should
there be) for a register used as a "cache".

(Linux requires SMP systems to keep any CPU caches coherent as far as
accesses by other CPUs are concerned.  It doesn't support any SMP
systems that are not cache-coherent as far as CPU accesses are
concerned.  It does support systems with non-cache-coherent DMA.)

Paul.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Nick Piggin @ 2007-08-17  4:39 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: paulmck, Herbert Xu, Stefan Richter, Satyam Sharma,
	Christoph Lameter, Chris Snook, Linux Kernel Mailing List,
	linux-arch, Linus Torvalds, netdev, Andrew Morton, ak,
	heiko.carstens, davem, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <18117.7611.660070.879392@cargo.ozlabs.ibm.com>

Paul Mackerras wrote:
> Nick Piggin writes:
> 
> 
>>Why are people making these undocumented and just plain false
>>assumptions about atomic_t?
> 
> 
> Well, it has only been false since December 2006.  Prior to that
> atomics *were* volatile on all platforms.

Hmm, although I don't think it has ever been guaranteed by the
API documentation (concede documentation is often not treated
as the authoritative source here, but for atomic it is actually
very good and obviously indispensable as the memory ordering
reference).

>>If they're using lockless code (ie.
>>which they must be if using atomics), then they actually need to be
>>thinking much harder about memory ordering issues.
> 
> 
> Indeed.  I believe that most uses of atomic_read other than in polling
> loops or debug printk statements are actually racy.  In some cases the
> race doesn't seem to matter, but I'm sure there are cases where it
> does.
> 
> 
>>If that is too
>>much for them, then they can just use locks.
> 
> 
> Why use locks when you can just sprinkle magic fix-the-races dust (aka
> atomic_t) over your code? :) :)

I agree with your skepticism of a lot of lockless code. But I think
a lot of the more subtle race problems will not be fixed with volatile.
The big, dumb infinite loop bugs would be fixed, but they're pretty
trivial to debug and even audit for.

>>>Precisely.  And volatility is a key property of "atomic".  Let's please
>>>not throw it away.
>>
>>It isn't, though (at least not since i386 and x86-64 don't have it).
> 
> 
> Conceptually it is, because atomic_t is specifically for variables
> which are liable to be modified by other CPUs, and volatile _means_
> "liable to be changed by mechanisms outside the knowledge of the
> compiler".

Usually that is the case, yes. But also most of the time we don't
care that it has been changed and don't mind it being reordered or
eliminated.

One of the only places we really care about that at all is for
variables that are modified by the *same* CPU.

>>_Adding_ it is trivial, and can be done any time. Throwing it away
>>(ie. making the API weaker) is _hard_. So let's not add it without
> 
> 
> Well, in one sense it's not that hard - Linus did it just 8 months ago
> in commit f9e9dcb3. :)

Well it would have been harder if the documentation also guaranteed
that atomic_read/atomic_set was ordered. Or it would have been harder
for _me_ to make such a change, anyway ;)

>>really good reasons. It most definitely results in worse code
>>generation in practice.
> 
> 
> 0.0008% increase in kernel text size on powerpc according to my
> measurement. :)

I don't think you're making a bad choice by keeping it volatile on
powerpc and waiting for others to shake out more of the bugs. You
get to fix everybody else's memory ordering bugs :)

>>I don't know why people would assume volatile of atomics. AFAIK, most
> 
> 
> By making something an atomic_t you're saying "other CPUs are going to
> be modifying this, so treat it specially".  It's reasonable to assume
> that special treatment extends to reading and setting it.

But I don't actually know what that "special treatment" is. Well
actually, I do know that operations will never result in a partial
modification being exposed. I also know that the operators that
do not modify and return are not guaranteed to have any sort of
ordering constraints.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Satyam Sharma @ 2007-08-17  4:32 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Christoph Lameter, heiko.carstens, horms, Stefan Richter,
	Bill Fink, Linux Kernel Mailing List, Paul E. McKenney, netdev,
	ak, cfriesen, rpjday, jesper.juhl, linux-arch, Andrew Morton,
	zlynx, schwidefsky, Chris Snook, Herbert Xu, davem,
	Linus Torvalds, wensong, wjiang
In-Reply-To: <43eccf425e36943ce0411c3b504b1de2@kernel.crashing.org>



On Thu, 16 Aug 2007, Segher Boessenkool wrote:

> > Here, I should obviously admit that the semantics of *(volatile int *)&
> > aren't any neater or well-defined in the _language standard_ at all. The
> > standard does say (verbatim) "precisely what constitutes as access to
> > object of volatile-qualified type is implementation-defined", but GCC
> > does help us out here by doing the right thing.
> 
> Where do you get that idea?

Try a testcase (experimentally verify).

> GCC manual, section 6.1, "When
> is a Volatile Object Accessed?" doesn't say anything of the
> kind.

True, "implementation-defined" as per the C standard _is_ supposed to mean
"unspecified behaviour where each implementation documents how the choice
is made". So ok, probably GCC isn't "documenting" this
implementation-defined behaviour which it is supposed to, but can't really
fault them much for this, probably.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Satyam Sharma @ 2007-08-17  4:24 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Christoph Lameter, heiko.carstens, horms, Stefan Richter,
	Bill Fink, Linux Kernel Mailing List, Paul E. McKenney, netdev,
	ak, cfriesen, rpjday, jesper.juhl, linux-arch, Andrew Morton,
	zlynx, schwidefsky, Chris Snook, Herbert Xu, davem,
	Linus Torvalds, wensong, wjiang, davids
In-Reply-To: <4952d3f536060c186f40c277dfc74194@kernel.crashing.org>



On Thu, 16 Aug 2007, Segher Boessenkool wrote:

> > Note that "volatile"
> > is a type-qualifier, not a type itself, so a cast of the _object_ itself
> > to a qualified-type i.e. (volatile int) would not make the access itself
> > volatile-qualified.
> 
> There is no such thing as "volatile-qualified access" defined
> anywhere; there only is the concept of a "volatile-qualified
> *object*".

Sure, "volatile-qualified access" was not some standard term I used
there. Just something to mean "an access that would make the compiler
treat the object at that memory as if it were an object with a
volatile-qualified type".

Now the second wording *IS* technically correct, but come on, it's
24 words long whereas the original one was 3 -- and hopefully anybody
reading the shorter phrase *would* have known anyway what was meant,
without having to be pedantic about it :-)


Satyam

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Paul Mackerras @ 2007-08-17  4:02 UTC (permalink / raw)
  To: Nick Piggin
  Cc: paulmck, Herbert Xu, Stefan Richter, Satyam Sharma,
	Christoph Lameter, Chris Snook, Linux Kernel Mailing List,
	linux-arch, Linus Torvalds, netdev, Andrew Morton, ak,
	heiko.carstens, davem, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <46C512EB.7020603@yahoo.com.au>

Nick Piggin writes:

> Why are people making these undocumented and just plain false
> assumptions about atomic_t?

Well, it has only been false since December 2006.  Prior to that
atomics *were* volatile on all platforms.

> If they're using lockless code (ie.
> which they must be if using atomics), then they actually need to be
> thinking much harder about memory ordering issues.

Indeed.  I believe that most uses of atomic_read other than in polling
loops or debug printk statements are actually racy.  In some cases the
race doesn't seem to matter, but I'm sure there are cases where it
does.

> If that is too
> much for them, then they can just use locks.

Why use locks when you can just sprinkle magic fix-the-races dust (aka
atomic_t) over your code? :) :)

> > Precisely.  And volatility is a key property of "atomic".  Let's please
> > not throw it away.
> 
> It isn't, though (at least not since i386 and x86-64 don't have it).

Conceptually it is, because atomic_t is specifically for variables
which are liable to be modified by other CPUs, and volatile _means_
"liable to be changed by mechanisms outside the knowledge of the
compiler".

> _Adding_ it is trivial, and can be done any time. Throwing it away
> (ie. making the API weaker) is _hard_. So let's not add it without

Well, in one sense it's not that hard - Linus did it just 8 months ago
in commit f9e9dcb3. :)

> really good reasons. It most definitely results in worse code
> generation in practice.

0.0008% increase in kernel text size on powerpc according to my
measurement. :)

> I don't know why people would assume volatile of atomics. AFAIK, most

By making something an atomic_t you're saying "other CPUs are going to
be modifying this, so treat it specially".  It's reasonable to assume
that special treatment extends to reading and setting it.

> of the documentation is pretty clear that all the atomic stuff can be
> reordered etc. except for those that modify and return a value.

Volatility isn't primarily about reordering (though as Linus says it
does restrict reordering to some extent).

Paul.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Herbert Xu @ 2007-08-17  3:53 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Linus Torvalds, Christoph Lameter, Chris Snook, Ilpo J?rvinen,
	Satyam Sharma, Paul E. McKenney, Stefan Richter,
	Linux Kernel Mailing List, linux-arch, Netdev, Andrew Morton, ak,
	heiko.carstens, David Miller, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <18117.6495.397597.582736@cargo.ozlabs.ibm.com>

On Fri, Aug 17, 2007 at 01:43:27PM +1000, Paul Mackerras wrote:
>
> The cost of doing so seems to me to be well down in the noise - 44
> bytes of extra kernel text on a ppc64 G5 config, and I don't believe
> the extra few cycles for the occasional extra load would be measurable
> (they should all hit in the L1 dcache).  I don't mind if x86[-64] have
> atomic_read/set be nonvolatile and find all the missing barriers, but
> for now on powerpc, I think that not having to find those missing
> barriers is worth the 0.00076% increase in kernel text size.

BTW, the sort of missing barriers that triggered this thread
aren't that subtle.  It'll result in a simple lock-up if the
loop condition holds upon entry.  At which point it's fairly
straightforward to find the culprit.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Linus Torvalds @ 2007-08-17  3:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Paul Mackerras, Segher Boessenkool, heiko.carstens, horms,
	linux-kernel, rpjday, ak, netdev, cfriesen, akpm, jesper.juhl,
	linux-arch, zlynx, satyam, clameter, schwidefsky, Chris Snook,
	Herbert Xu, davem, wensong, wjiang
In-Reply-To: <46C516BA.60700@yahoo.com.au>



On Fri, 17 Aug 2007, Nick Piggin wrote:
> 
> I'm surprised too. Numbers were from the "...use asm() like the other
> atomic operations already do" thread. According to them,
> 
>   text    data     bss     dec     hex filename
> 3434150  249176  176128 3859454  3ae3fe atomic_normal/vmlinux
> 3436203  249176  176128 3861507  3aec03 atomic_volatile/vmlinux
> 
> The first one is a stock kenel, the second is with atomic_read/set
> cast to volatile. gcc-4.1 -- maybe if you have an earlier gcc it
> won't optimise as much?

No, see my earlier reply. "volatile" really *is* an incredible piece of 
crap.

Just try it yourself:

	volatile int i;
	int j;

	int testme(void)
	{
	        return i <= 1;
	}

	int testme2(void)
	{
	        return j <= 1;
	}

and compile with all the optimizations you can.

I get:

	testme:
	        movl    i(%rip), %eax
	        subl    $1, %eax
	        setle   %al
	        movzbl  %al, %eax
	        ret

vs

	testme2:
	        xorl    %eax, %eax
	        cmpl    $1, j(%rip)
	        setle   %al
	        ret

(now, whether that "xorl + setle" is better than "setle + movzbl", I don't 
really know - maybe it is. But that's not thepoint. The point is the 
difference between

                movl    i(%rip), %eax
                subl    $1, %eax

and

                cmpl    $1, j(%rip)

and imagine this being done for *every* single volatile access.

Just do a 

	git grep atomic_read

to see how atomics are actually used. A lot of them are exactly the above 
kind of "compare against a value".

			Linus

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Paul Mackerras @ 2007-08-17  3:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, Chris Snook, Ilpo J?rvinen, Herbert Xu,
	Satyam Sharma, Paul E. McKenney, Stefan Richter,
	Linux Kernel Mailing List, linux-arch, Netdev, Andrew Morton, ak,
	heiko.carstens, David Miller, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <alpine.LFD.0.999.0708161953310.30176@woody.linux-foundation.org>

Linus Torvalds writes:

> In general, I'd *much* rather we used barriers. Anything that "depends" on 
> volatile is pretty much set up to be buggy. But I'm certainly also willing 
> to have that volatile inside "atomic_read/atomic_set()" if it avoids code 
> that would otherwise break - ie if it hides a bug.

The cost of doing so seems to me to be well down in the noise - 44
bytes of extra kernel text on a ppc64 G5 config, and I don't believe
the extra few cycles for the occasional extra load would be measurable
(they should all hit in the L1 dcache).  I don't mind if x86[-64] have
atomic_read/set be nonvolatile and find all the missing barriers, but
for now on powerpc, I think that not having to find those missing
barriers is worth the 0.00076% increase in kernel text size.

Paul.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Linus Torvalds @ 2007-08-17  3:42 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Nick Piggin, Segher Boessenkool, heiko.carstens, horms,
	linux-kernel, rpjday, ak, netdev, cfriesen, akpm, jesper.juhl,
	linux-arch, zlynx, satyam, clameter, schwidefsky, Chris Snook,
	Herbert Xu, davem, wensong, wjiang
In-Reply-To: <18117.4848.695269.72976@cargo.ozlabs.ibm.com>

On Fri, 17 Aug 2007, Paul Mackerras wrote:
> 
> I'm really surprised it's as much as a few K.  I tried it on powerpc
> and it only saved 40 bytes (10 instructions) for a G5 config.

One of the things that "volatile" generally screws up is a simple

	volatile int i;

	i++;

which a compiler will generally get horribly, horribly wrong.

In a reasonable world, gcc should just make that be (on x86)

	addl $1,i(%rip)

on x86-64, which is indeed what it does without the volatile. But with the 
volatile, the compiler gets really nervous, and doesn't dare do it in one 
instruction, and thus generates crap like

        movl    i(%rip), %eax
        addl    $1, %eax
        movl    %eax, i(%rip)

instead. For no good reason, except that "volatile" just doesn't have any 
good/clear semantics for the compiler, so most compilers will just make it 
be "I will not touch this access in any way, shape, or form". Including 
even trivially correct instruction optimization/combination.

This is one of the reasons why we should never use "volatile". It 
pessimises code generation for no good reason - just because compilers 
don't know what the heck it even means! 

Now, people don't do "i++" on atomics (you'd use "atomic_inc()" for that), 
but people *do* do things like

	if (atomic_read(..) <= 1)
		..

On ppc, things like that probably don't much matter. But on x86, it makes 
a *huge* difference whether you do

	movl i(%rip),%eax
	cmpl $1,%eax

or if you can just use the value directly for the operation, like this:

	cmpl $1,i(%rip)

which is again a totally obvious and totally safe optimization, but is 
(again) something that gcc doesn't dare do, since "i" is volatile.

In other words: "volatile" is a horribly horribly bad way of doing things, 
because it generates *worse*code*, for no good reason. You just don't see 
it on powerpc, because it's already a load-store architecture, so there is 
no "good code" for doing direct-to-memory operations.

		Linus

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Nick Piggin @ 2007-08-17  3:32 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Segher Boessenkool, heiko.carstens, horms, linux-kernel, rpjday,
	ak, netdev, cfriesen, akpm, torvalds, jesper.juhl, linux-arch,
	zlynx, satyam, clameter, schwidefsky, Chris Snook, Herbert Xu,
	davem, wensong, wjiang
In-Reply-To: <18117.4848.695269.72976@cargo.ozlabs.ibm.com>

Paul Mackerras wrote:
> Nick Piggin writes:
> 
> 
>>So i386 and x86-64 don't have volatiles there, and it saves them a
>>few K of kernel text. What you need to justify is why it is a good
> 
> 
> I'm really surprised it's as much as a few K.  I tried it on powerpc
> and it only saved 40 bytes (10 instructions) for a G5 config.
> 
> Paul.
> 

I'm surprised too. Numbers were from the "...use asm() like the other
atomic operations already do" thread. According to them,

   text    data     bss     dec     hex filename
3434150  249176  176128 3859454  3ae3fe atomic_normal/vmlinux
3436203  249176  176128 3861507  3aec03 atomic_volatile/vmlinux

The first one is a stock kenel, the second is with atomic_read/set
cast to volatile. gcc-4.1 -- maybe if you have an earlier gcc it
won't optimise as much?

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Paul Mackerras @ 2007-08-17  3:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Segher Boessenkool, heiko.carstens, horms, linux-kernel, rpjday,
	ak, netdev, cfriesen, akpm, torvalds, jesper.juhl, linux-arch,
	zlynx, satyam, clameter, schwidefsky, Chris Snook, Herbert Xu,
	davem, wensong, wjiang
In-Reply-To: <46C505B2.6030704@yahoo.com.au>

Nick Piggin writes:

> So i386 and x86-64 don't have volatiles there, and it saves them a
> few K of kernel text. What you need to justify is why it is a good

I'm really surprised it's as much as a few K.  I tried it on powerpc
and it only saved 40 bytes (10 instructions) for a G5 config.

Paul.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Nick Piggin @ 2007-08-17  3:15 UTC (permalink / raw)
  To: paulmck
  Cc: Herbert Xu, Stefan Richter, Paul Mackerras, Satyam Sharma,
	Christoph Lameter, Chris Snook, Linux Kernel Mailing List,
	linux-arch, Linus Torvalds, netdev, Andrew Morton, ak,
	heiko.carstens, davem, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <20070816163441.GB16957@linux.vnet.ibm.com>

Paul E. McKenney wrote:
> On Thu, Aug 16, 2007 at 06:42:50PM +0800, Herbert Xu wrote:

>>In fact, volatile doesn't guarantee that the memory gets
>>read anyway.  You might be reading some stale value out
>>of the cache.  Granted this doesn't happen on x86 but
>>when you're coding for the kernel you can't make such
>>assumptions.
>>
>>So the point here is that if you don't mind getting a stale
>>value from the CPU cache when doing an atomic_read, then
>>surely you won't mind getting a stale value from the compiler
>>"cache".
> 
> 
> Absolutely disagree.  An interrupt/NMI/SMI handler running on the CPU
> will see the same value (whether in cache or in store buffer) that
> the mainline code will see.  In this case, we don't care about CPU
> misordering, only about compiler misordering.  It is easy to see
> other uses that combine communication with handlers on the current
> CPU with communication among CPUs -- again, see prior messages in
> this thread.

I still don't agree with the underlying assumption that everybody
(or lots of kernel code) treats atomic accesses as volatile.

Nobody that does has managed to explain my logic problem either:
loads and stores to long and ptr have always been considered to be
atomic, test_bit is atomic; so why are another special subclass of
atomic loads and stores? (and yes, it is perfectly legitimate to
want a non-volatile read for a data type that you also want to do
atomic RMW operations on)

Why are people making these undocumented and just plain false
assumptions about atomic_t? If they're using lockless code (ie.
which they must be if using atomics), then they actually need to be
thinking much harder about memory ordering issues. If that is too
much for them, then they can just use locks.

>>>So, the architecture guys can implement atomic_read however they want
>>>--- as long as it cannot be optimized away.*
>>
>>They can implement it however they want as long as it stays
>>atomic.
> 
> 
> Precisely.  And volatility is a key property of "atomic".  Let's please
> not throw it away.

It isn't, though (at least not since i386 and x86-64 don't have it).
_Adding_ it is trivial, and can be done any time. Throwing it away
(ie. making the API weaker) is _hard_. So let's not add it without
really good reasons. It most definitely results in worse code
generation in practice.

I don't know why people would assume volatile of atomics. AFAIK, most
of the documentation is pretty clear that all the atomic stuff can be
reordered etc. except for those that modify and return a value.

It isn't even intuitive: `*lp = value` is like the most fundamental
atomic operation in Linux.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Linus Torvalds @ 2007-08-17  3:03 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Christoph Lameter, Chris Snook, Ilpo J?rvinen, Herbert Xu,
	Satyam Sharma, Paul E. McKenney, Stefan Richter,
	Linux Kernel Mailing List, linux-arch, Netdev, Andrew Morton, ak,
	heiko.carstens, David Miller, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <18117.1287.779351.836552@cargo.ozlabs.ibm.com>

On Fri, 17 Aug 2007, Paul Mackerras wrote:
>
> Volatile doesn't mean it can't be reordered; volatile means the
> accesses can't be eliminated.

It also does limit re-ordering. 

Of course, since *normal* accesses aren't necessarily limited wrt 
re-ordering, the question then becomes one of "with regard to *what* does 
it limit re-ordering?".

A C compiler that re-orders two different volatile accesses that have a 
sequence point in between them is pretty clearly a buggy compiler. So at a 
minimum, it limits re-ordering wrt other volatiles (assuming sequence 
points exists). It also means that the compiler cannot move it 
speculatively across conditionals, but other than that it's starting to 
get fuzzy.

In general, I'd *much* rather we used barriers. Anything that "depends" on 
volatile is pretty much set up to be buggy. But I'm certainly also willing 
to have that volatile inside "atomic_read/atomic_set()" if it avoids code 
that would otherwise break - ie if it hides a bug.

		Linus

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Paul Mackerras @ 2007-08-17  2:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Chris Snook, Ilpo Järvinen, Herbert Xu, Satyam Sharma,
	Paul E. McKenney, Stefan Richter, Linux Kernel Mailing List,
	linux-arch, Linus Torvalds, Netdev, Andrew Morton, ak,
	heiko.carstens, David Miller, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <Pine.LNX.4.64.0708161319000.17777@schroedinger.engr.sgi.com>

Christoph Lameter writes:

> No it does not have any volatile semantics. atomic_dec() can be reordered 
> at will by the compiler within the current basic unit if you do not add a 
> barrier.

Volatile doesn't mean it can't be reordered; volatile means the
accesses can't be eliminated.

Paul.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Nick Piggin @ 2007-08-17  2:31 UTC (permalink / raw)
  To: Chris Snook
  Cc: Herbert Xu, Stefan Richter, Paul Mackerras, Satyam Sharma,
	Christoph Lameter, Paul E. McKenney, Linux Kernel Mailing List,
	linux-arch, Linus Torvalds, netdev, Andrew Morton, ak,
	heiko.carstens, davem, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <46C50228.4000004@redhat.com>

Chris Snook wrote:
> Herbert Xu wrote:
> 
>> On Thu, Aug 16, 2007 at 03:48:54PM -0400, Chris Snook wrote:
>>
>>>> Can you find an actual atomic_read code snippet there that is
>>>> broken without the volatile modifier?
>>>
>>> A whole bunch of atomic_read uses will be broken without the volatile 
>>> modifier once we start removing barriers that aren't needed if 
>>> volatile behavior is guaranteed.
>>
>>
>> Could you please cite the file/function names so we can
>> see whether removing the barrier makes sense?
>>
>> Thanks,
> 
> 
> At a glance, several architectures' implementations of 
> smp_call_function() have one or more legitimate atomic_read() busy-waits 
> that shouldn't be using CPU-relax.  Some of them do work in the loop.

sh looks like the only one there that would be broken (and that's only
because they don't have a cpu_relax there, but it should be added anyway).
Sure, if we removed volatile from other architectures, it would be wise
to audit arch code because arch maintainers do sometimes make assumptions
about their implementation details... however we can assume most generic
code is safe without volatile.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Nick Piggin @ 2007-08-17  2:19 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: heiko.carstens, horms, linux-kernel, rpjday, ak, netdev, cfriesen,
	akpm, torvalds, jesper.juhl, linux-arch, zlynx, satyam, clameter,
	schwidefsky, Chris Snook, Herbert Xu, davem, wensong, wjiang
In-Reply-To: <194369f4c96ea0e24decf8f9197d5bad@kernel.crashing.org>

Segher Boessenkool wrote:
>>>>> Part of the motivation here is to fix heisenbugs.  If I knew where 
>>>>> they
>>>>
>>>>
>>>>
>>>> By the same token we should probably disable optimisations
>>>> altogether since that too can create heisenbugs.
>>>
>>> Almost everything is a tradeoff; and so is this.  I don't
>>> believe most people would find disabling all compiler
>>> optimisations an acceptable price to pay for some peace
>>> of mind.
>>
>>
>> So why is this a good tradeoff?
> 
> 
> It certainly is better than disabling all compiler optimisations!

It's easy to be better than something really stupid :)

So i386 and x86-64 don't have volatiles there, and it saves them a
few K of kernel text. What you need to justify is why it is a good
tradeoff to make them volatile (which btw, is much harder to go
the other way after we let people make those assumptions).

>> I also think that just adding things to APIs in the hope it might fix
>> up some bugs isn't really a good road to go down. Where do you stop?
> 
> 
> I look at it the other way: keeping the "volatile" semantics in
> atomic_XXX() (or adding them to it, whatever) helps _prevent_ bugs;

Yeah, but we could add lots of things to help prevent bugs and
would never be included. I would also contend that it helps _hide_
bugs and encourages people to be lazy when thinking about these
things.

Also, you dismiss the fact that we'd actually be *adding* volatile
semantics back to the 2 most widely tested architectures (in terms
of test time, number of testers, variety of configurations, and
coverage of driver code). This is a very important different from
just keeping volatile semantics because it is basically a one-way
API change.

> certainly most people expect that behaviour, and also that behaviour
> is *needed* in some places and no other interface provides that
> functionality.

I don't know that most people would expect that behaviour. Is there
any documentation anywhere that would suggest this?

Also, barrier() most definitely provides the required functionality.
It is overkill in some situations, but volatile is overkill in _most_
situations. If that's what you're worried about, we should add a new
ordering primitive.

> [some confusion about barriers wrt atomics snipped]

What were you confused about?

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Herbert Xu @ 2007-08-17  2:13 UTC (permalink / raw)
  To: Chris Snook
  Cc: Stefan Richter, Paul Mackerras, Satyam Sharma, Christoph Lameter,
	Paul E. McKenney, Linux Kernel Mailing List, linux-arch,
	Linus Torvalds, netdev, Andrew Morton, ak, heiko.carstens, davem,
	schwidefsky, wensong, horms, wjiang, cfriesen, zlynx, rpjday,
	jesper.juhl, segher
In-Reply-To: <46C50228.4000004@redhat.com>

On Thu, Aug 16, 2007 at 10:04:24PM -0400, Chris Snook wrote:
>
> >Could you please cite the file/function names so we can
> >see whether removing the barrier makes sense?
> 
> At a glance, several architectures' implementations of smp_call_function() 
> have one or more legitimate atomic_read() busy-waits that shouldn't be 
> using CPU-relax.  Some of them do work in the loop.

Care to name one so we can discuss it?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Chris Snook @ 2007-08-17  2:04 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Stefan Richter, Paul Mackerras, Satyam Sharma, Christoph Lameter,
	Paul E. McKenney, Linux Kernel Mailing List, linux-arch,
	Linus Torvalds, netdev, Andrew Morton, ak, heiko.carstens, davem,
	schwidefsky, wensong, horms, wjiang, cfriesen, zlynx, rpjday,
	jesper.juhl, segher
In-Reply-To: <20070817000209.GC11594@gondor.apana.org.au>

Herbert Xu wrote:
> On Thu, Aug 16, 2007 at 03:48:54PM -0400, Chris Snook wrote:
>>> Can you find an actual atomic_read code snippet there that is
>>> broken without the volatile modifier?
>> A whole bunch of atomic_read uses will be broken without the volatile 
>> modifier once we start removing barriers that aren't needed if volatile 
>> behavior is guaranteed.
> 
> Could you please cite the file/function names so we can
> see whether removing the barrier makes sense?
> 
> Thanks,

At a glance, several architectures' implementations of smp_call_function() have 
one or more legitimate atomic_read() busy-waits that shouldn't be using 
CPU-relax.  Some of them do work in the loop.

I'm sure there are plenty more examples that various maintainers could find in 
their own code.

	-- Chris

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Herbert Xu @ 2007-08-17  1:28 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Christoph Lameter, Chris Snook, Ilpo Järvinen,
	Paul Mackerras, Satyam Sharma, Stefan Richter,
	Linux Kernel Mailing List, linux-arch, Linus Torvalds, Netdev,
	Andrew Morton, ak, heiko.carstens, David Miller, schwidefsky,
	wensong, horms, wjiang, cfriesen, zlynx, rpjday, jesper.juhl,
	segher
In-Reply-To: <20070817010232.GK16957@linux.vnet.ibm.com>

On Thu, Aug 16, 2007 at 06:02:32PM -0700, Paul E. McKenney wrote:
> 
> Yep.  Or you can use atomic_dec_return() instead of using a barrier.

Or you could use smp_mb__{before,after}_atomic_dec.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Paul E. McKenney @ 2007-08-17  1:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Chris Snook, Ilpo Järvinen, Herbert Xu, Paul Mackerras,
	Satyam Sharma, Stefan Richter, Linux Kernel Mailing List,
	linux-arch, Linus Torvalds, Netdev, Andrew Morton, ak,
	heiko.carstens, David Miller, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <Pine.LNX.4.64.0708161319000.17777@schroedinger.engr.sgi.com>

On Thu, Aug 16, 2007 at 01:20:26PM -0700, Christoph Lameter wrote:
> On Thu, 16 Aug 2007, Chris Snook wrote:
> 
> > atomic_dec() already has volatile behavior everywhere, so this is semantically
> > okay, but this code (and any like it) should be calling cpu_relax() each
> > iteration through the loop, unless there's a compelling reason not to.  I'll
> > allow that for some hardware drivers (possibly this one) such a compelling
> > reason may exist, but hardware-independent core subsystems probably have no
> > excuse.
> 
> No it does not have any volatile semantics. atomic_dec() can be reordered 
> at will by the compiler within the current basic unit if you do not add a 
> barrier.

Yep.  Or you can use atomic_dec_return() instead of using a barrier.

						Thanx, Paul

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Paul E. McKenney @ 2007-08-17  1:01 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Stefan Richter, Paul Mackerras, Satyam Sharma, Christoph Lameter,
	Chris Snook, Linux Kernel Mailing List, linux-arch,
	Linus Torvalds, netdev, Andrew Morton, ak, heiko.carstens, davem,
	schwidefsky, wensong, horms, wjiang, cfriesen, zlynx, rpjday,
	jesper.juhl, segher
In-Reply-To: <20070816235902.GB11594@gondor.apana.org.au>

On Fri, Aug 17, 2007 at 07:59:02AM +0800, Herbert Xu wrote:
> On Thu, Aug 16, 2007 at 09:34:41AM -0700, Paul E. McKenney wrote:
> >
> > The compiler can also reorder non-volatile accesses.  For an example
> > patch that cares about this, please see:
> > 
> > 	http://lkml.org/lkml/2007/8/7/280
> > 
> > This patch uses an ORDERED_WRT_IRQ() in rcu_read_lock() and
> > rcu_read_unlock() to ensure that accesses aren't reordered with respect
> > to interrupt handlers and NMIs/SMIs running on that same CPU.
> 
> Good, finally we have some code to discuss (even though it's
> not actually in the kernel yet).

There was some earlier in this thread as well.

> First of all, I think this illustrates that what you want
> here has nothing to do with atomic ops.  The ORDERED_WRT_IRQ
> macro occurs a lot more times in your patch than atomic
> reads/sets.  So *assuming* that it was necessary at all,
> then having an ordered variant of the atomic_read/atomic_set
> ops could do just as well.

Indeed.  If I could trust atomic_read()/atomic_set() to cause the compiler
to maintain ordering, then I could just use them instead of having to
create an  ORDERED_WRT_IRQ().  (Or ACCESS_ONCE(), as it is called in a
different patch.)

> However, I still don't know which atomic_read/atomic_set in
> your patch would be broken if there were no volatile.  Could
> you please point them out?

Suppose I tried replacing the ORDERED_WRT_IRQ() calls with
atomic_read() and atomic_set().  Starting with __rcu_read_lock():

o	If "ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++"
	was ordered by the compiler after
	"ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1", then
	suppose an NMI/SMI happened after the rcu_read_lock_nesting but
	before the rcu_flipctr.

	Then if there was an rcu_read_lock() in the SMI/NMI
	handler (which is perfectly legal), the nested rcu_read_lock()
	would believe that it could take the then-clause of the
	enclosing "if" statement.  But because the rcu_flipctr per-CPU
	variable had not yet been incremented, an RCU updater would
	be within its rights to assume that there were no RCU reads
	in progress, thus possibly yanking a data structure out from
	under the reader in the SMI/NMI function.

	Fatal outcome.  Note that only one CPU is involved here
	because these are all either per-CPU or per-task variables.

o	If "ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1"
	was ordered by the compiler to follow the
	"ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx", and an NMI/SMI
	happened between the two, then an __rcu_read_lock() in the NMI/SMI
	would incorrectly take the "else" clause of the enclosing "if"
	statement.  If some other CPU flipped the rcu_ctrlblk.completed
	in the meantime, then the __rcu_read_lock() would (correctly)
	write the new value into rcu_flipctr_idx.

	Well and good so far.  But the problem arises in
	__rcu_read_unlock(), which then decrements the wrong counter.
	Depending on exactly how subsequent events played out, this could
	result in either prematurely ending grace periods or never-ending
	grace periods, both of which are fatal outcomes.

And the following are not needed in the current version of the
patch, but will be in a future version that either avoids disabling
irqs or that dispenses with the smp_read_barrier_depends() that I
have 99% convinced myself is unneeded:

o	nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);

o	idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed) & 0x1;

Furthermore, in that future version, irq handlers can cause the same
mischief that SMI/NMI handlers can in this version.

Next, looking at __rcu_read_unlock():

o	If "ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting - 1"
	was reordered by the compiler to follow the
	"ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])--",
	then if an NMI/SMI containing an rcu_read_lock() occurs between
	the two, this nested rcu_read_lock() would incorrectly believe
	that it was protected by an enclosing RCU read-side critical
	section as described in the first reversal discussed for
	__rcu_read_lock() above.  Again, fatal outcome.

This is what we have now.  It is not hard to imagine situations that
interact with -both- interrupt handlers -and- other CPUs, as described
earlier.

							Thanx, Paul

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Herbert Xu @ 2007-08-17  0:02 UTC (permalink / raw)
  To: Chris Snook
  Cc: Stefan Richter, Paul Mackerras, Satyam Sharma, Christoph Lameter,
	Paul E. McKenney, Linux Kernel Mailing List, linux-arch,
	Linus Torvalds, netdev, Andrew Morton, ak, heiko.carstens, davem,
	schwidefsky, wensong, horms, wjiang, cfriesen, zlynx, rpjday,
	jesper.juhl, segher
In-Reply-To: <46C4AA26.4060707@redhat.com>

On Thu, Aug 16, 2007 at 03:48:54PM -0400, Chris Snook wrote:
>
> >Can you find an actual atomic_read code snippet there that is
> >broken without the volatile modifier?
> 
> A whole bunch of atomic_read uses will be broken without the volatile 
> modifier once we start removing barriers that aren't needed if volatile 
> behavior is guaranteed.

Could you please cite the file/function names so we can
see whether removing the barrier makes sense?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox