Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Paul E. McKenney @ 2007-08-17  5:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Nick Piggin, Segher Boessenkool, heiko.carstens,
	horms, linux-kernel, rpjday, ak, netdev, cfriesen, akpm,
	jesper.juhl, linux-arch, zlynx, satyam, clameter, schwidefsky,
	Chris Snook, Herbert Xu, davem, wensong, wjiang
In-Reply-To: <alpine.LFD.0.999.0708162033400.30176@woody.linux-foundation.org>

On Thu, Aug 16, 2007 at 08:42:23PM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 17 Aug 2007, Paul Mackerras wrote:
> > 
> > I'm really surprised it's as much as a few K.  I tried it on powerpc
> > and it only saved 40 bytes (10 instructions) for a G5 config.
> 
> One of the things that "volatile" generally screws up is a simple
> 
> 	volatile int i;
> 
> 	i++;
> 
> which a compiler will generally get horribly, horribly wrong.
> 
> In a reasonable world, gcc should just make that be (on x86)
> 
> 	addl $1,i(%rip)
> 
> on x86-64, which is indeed what it does without the volatile. But with the 
> volatile, the compiler gets really nervous, and doesn't dare do it in one 
> instruction, and thus generates crap like
> 
>         movl    i(%rip), %eax
>         addl    $1, %eax
>         movl    %eax, i(%rip)

Blech.  Sounds like a chat with some gcc people is in order.  Will
see what I can do.

						Thanx, Paul

> instead. For no good reason, except that "volatile" just doesn't have any 
> good/clear semantics for the compiler, so most compilers will just make it 
> be "I will not touch this access in any way, shape, or form". Including 
> even trivially correct instruction optimization/combination.
> 
> This is one of the reasons why we should never use "volatile". It 
> pessimises code generation for no good reason - just because compilers 
> don't know what the heck it even means! 
> 
> Now, people don't do "i++" on atomics (you'd use "atomic_inc()" for that), 
> but people *do* do things like
> 
> 	if (atomic_read(..) <= 1)
> 		..
> 
> On ppc, things like that probably don't much matter. But on x86, it makes 
> a *huge* difference whether you do
> 
> 	movl i(%rip),%eax
> 	cmpl $1,%eax
> 
> or if you can just use the value directly for the operation, like this:
> 
> 	cmpl $1,i(%rip)
> 
> which is again a totally obvious and totally safe optimization, but is 
> (again) something that gcc doesn't dare do, since "i" is volatile.
> 
> In other words: "volatile" is a horribly horribly bad way of doing things, 
> because it generates *worse*code*, for no good reason. You just don't see 
> it on powerpc, because it's already a load-store architecture, so there is 
> no "good code" for doing direct-to-memory operations.
> 
> 		Linus
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Herbert Xu @ 2007-08-17  5:32 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Stefan Richter, Satyam Sharma, Christoph Lameter,
	Paul E. McKenney, Chris Snook, Linux Kernel Mailing List,
	linux-arch, Linus Torvalds, netdev, Andrew Morton, ak,
	heiko.carstens, davem, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <18117.11685.431347.996767@cargo.ozlabs.ibm.com>

On Fri, Aug 17, 2007 at 03:09:57PM +1000, Paul Mackerras wrote:
> Herbert Xu writes:
> 
> > Can you find an actual atomic_read code snippet there that is
> > broken without the volatile modifier?
> 
> There are some in arch-specific code, for example line 1073 of
> arch/mips/kernel/smtc.c.  On mips, cpu_relax() is just barrier(), so
> the empty loop body is ok provided that atomic_read actually does the
> load each time around the loop.

A barrier() is all you need to force the compiler to reread
the value.

The people advocating volatile in this thread are talking
about code that doesn't use barrier()/cpu_relax().

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Paul Mackerras @ 2007-08-17  5:41 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Stefan Richter, Satyam Sharma, Christoph Lameter,
	Paul E. McKenney, Chris Snook, Linux Kernel Mailing List,
	linux-arch, Linus Torvalds, netdev, Andrew Morton, ak,
	heiko.carstens, davem, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <20070817053200.GA15457@gondor.apana.org.au>

Herbert Xu writes:

> On Fri, Aug 17, 2007 at 03:09:57PM +1000, Paul Mackerras wrote:
> > Herbert Xu writes:
> > 
> > > Can you find an actual atomic_read code snippet there that is
> > > broken without the volatile modifier?
> > 
> > There are some in arch-specific code, for example line 1073 of
> > arch/mips/kernel/smtc.c.  On mips, cpu_relax() is just barrier(), so
> > the empty loop body is ok provided that atomic_read actually does the
> > load each time around the loop.
> 
> A barrier() is all you need to force the compiler to reread
> the value.
> 
> The people advocating volatile in this thread are talking
> about code that doesn't use barrier()/cpu_relax().

Did you look at it?  Here it is:

	/* Someone else is initializing in parallel - let 'em finish */
	while (atomic_read(&idle_hook_initialized) < 1000)
		;

Paul.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Satyam Sharma @ 2007-08-17  5:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Nick Piggin, Segher Boessenkool, heiko.carstens,
	horms, Linux Kernel Mailing List, rpjday, ak, netdev, cfriesen,
	Andrew Morton, jesper.juhl, linux-arch, zlynx, clameter,
	schwidefsky, Chris Snook, Herbert Xu, davem, wensong, wjiang
In-Reply-To: <alpine.LFD.0.999.0708162033400.30176@woody.linux-foundation.org>

Hi Linus,

[ and others; I think there's a communication gap in a lot of this
  thread, and a little summary would be useful. Hence this posting. ]

On Thu, 16 Aug 2007, Linus Torvalds wrote:

> On Fri, 17 Aug 2007, Paul Mackerras wrote:
> > 
> > I'm really surprised it's as much as a few K.  I tried it on powerpc
> > and it only saved 40 bytes (10 instructions) for a G5 config.
> 
> One of the things that "volatile" generally screws up is a simple
> 
> 	volatile int i;
> 
> 	i++;
> 
> which a compiler will generally get horribly, horribly wrong.
> 
> [...] For no good reason, except that "volatile" just doesn't have any 
> good/clear semantics for the compiler, so most compilers will just make it 
> be "I will not touch this access in any way, shape, or form". Including 
> even trivially correct instruction optimization/combination.
> 
> This is one of the reasons why we should never use "volatile". It 
> pessimises code generation for no good reason - just because compilers 
> don't know what the heck it even means! 
> [...]
> In other words: "volatile" is a horribly horribly bad way of doing things, 
> because it generates *worse*code*, for no good reason. You just don't see 
> it on powerpc, because it's already a load-store architecture, so there is 
> no "good code" for doing direct-to-memory operations.

True, and I bet *everybody* on this list has already agreed for a very
long time that using "volatile" to type-qualify the _declaration_ of an
object itself as being horribly bad (taste-wise, code-generation-wise,
often even buggy for sitations where real CPU barriers should've been
used instead).

However, the discussion on this thread (IIRC) began with only "giving
volatility semantics" to atomic ops. Now that is different, and may not
require the use the "volatile" keyword (at least not in the declaration
of the object) itself.

Sadly, most arch's *still* do type-qualify the declaration of the
"counter" member of atomic_t as "volatile". This is probably a historic
hangover, and I suspect not yet rectified because of lethargy.

Anyway, some of the variants I can think of are:

[1]

#define atomic_read_volatile(v)				\
	({						\
		forget((v)->counter);			\
		((v)->counter);				\
	})

where:

#define forget(a)	__asm__ __volatile__ ("" :"=m" (a) :"m" (a))

[ This is exactly equivalent to using "+m" in the constraints, as recently
  explained on a GCC list somewhere, in response to the patch in my bitops
  series a few weeks back where I thought "+m" was bogus. ]

[2]

#define atomic_read_volatile(v)		(*(volatile int *)&(v)->counter)

This is something that does work. It has reasonably good semantics
guaranteed by the C standard in conjunction with how GCC currently
behaves (and how it has behaved for all supported versions). I haven't
checked if generates much different code than the first variant above,
(it probably will generate similar code to just declaring the object
as volatile, but would still be better in terms of code-clarity and
taste, IMHO), but in any case, we should pick whichever of these variants
works for us and generates good code.

[3]

static inline int atomic_read_volatile(atomic_t *v)
{
	... arch-dependent __asm__ __volatile__ stuff ...
}

I can reasonably bet this variant would often generate worse code than
at least the variant "[1]" above.

Now, why do we even require these "volatility" semantics variants?

Note, "volatility" semantics *know* / assume that it can have a meaning
_only_ as far as the compiler, so atomic_read_volatile() doesn't really
care reading stale values from the cache for certain non-x86 archs, etc.

The first argument is "safety":

Use of atomic_read() (possibly in conjunction with other atomic ops) in
a lot of code out there in the kernel *assumes* the compiler will not
optimize away those ops. (which is possible given current definitions
of atomic_set and atomic_read on archs such as x86 in present code).
An additional argument that builds on this one says that by ensuring
the compiler will not elid or coalesce these ops, we could even avoid
potential heisenbugs in the future.

However, there is a counter-argument:

As Herbert Xu has often been making the point, there is *no* bug out
there involving "atomic_read" in busy-while-loops that should not have
a compiler barrier (or cpu_relax() in fact) anyway. As for non-busy-loops,
they would invariable call schedule() at some point (possibly directly)
and thus have an "implicit" compiler barrier by virtue of calling out
a function that is not in scope of the current compilation unit (although
users in sched.c itself would probably require an explicit compiler
barrier).

The second pro-volatility-in-atomic-ops argument is performance:
(surprise!)

Using a full memory clobber compiler barrier in busy loops will disqualify
optimizations for loop invariants so it probably makes sense to *only*
make the compiler forget *that* particular address of the atomic counter
object, and none other. All 3 variants above would work nicely here.

So the final idea may be to have a cpu_relax_no_barrier() variant as a
rep;nop (pause) *without* an included full memory clobber, and replace
a lot of kernel busy-while-loops out there with:

-	cpu_relax();
+	cpu_relax_no_barrier();
+	forget(x);

or may be just:

-	cpu_relax();
+	cpu_relax_no_barrier();

because the "forget" / "volatility" / specific-variable-compiler-barrier
could be made implicit inside the atomic ops themselves.

This could especially make a difference for register-rich CPUs (probably
not x86) where using a full memory clobber will disqualify a hell lot of
compiler optimizations for loop-invariants.

On x86 itself, cpu_relax_no_barrier() could be:

#define cpu_relax_no_barrier()	__asm__ __volatile__ ("rep;nop":::);

and still continue to do its job as it is doing presently.

However, there is still a counter-argument:

As Herbert Xu and Christoph Lameter have often been saying, giving
"volatility" semantics to the atomic ops will disqualify compiler
optimizations such as eliding / coalescing of atomic ops, etc, and
probably some sections of code in the kernel (Christoph mentioned code
in SLUB, and I saw such code in sched) benefit from such optimizations.

Paul Mackerras has, otoh, mentioned that a lot of such places probably
don't need (or shouldn't use) atomic ops in the first place.
Alternatively, such callsites should probably just cache the atomic_read
in a local variable (which compiler could just as well make a register)
explicitly, and repeating atomic_read() isn't really necessary.

There could still be legitimate uses of atomic ops that don't care about
them being elided / coalesced, but given the loop-invariant-optimization
benefit, personally, I do see some benefit in the use of defining atomic
ops variants with "volatility" semantics (for only the atomic counter
object) but also having a non-volatile atomic ops API side-by-side for
performance critical users (probably sched, slub) that may require that.

Possibly, one of the two APIs above could turn out to be redundant, but
that's still very much the issue of debate presently.

Satyam

[ Sorry if I missed anything important, but this thread has been long
  and noisy, although I've tried to keep up ... ]

^ permalink raw reply

* Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
From: Krishna Kumar2 @ 2007-08-17  6:06 UTC (permalink / raw)
  To: David Miller
  Cc: gaagaan, general, hadi, herbert, jagana, jeff, johnpol, kaber,
	kumarkr, mcarlson, mchan, netdev, peter.p.waskiewicz.jr, rdreier,
	rick.jones2, Robert.Olsson, shemminger, sri, tgraf, xma


Hi Dave,

> I ran 3 iterations of 45 sec tests (total 1 hour 16 min, but I will
> run a longer one tonight). The results are (results in KB/s, and %):

I ran a 8.5 hours run with no batching + another 8.5 hours run with
batching (Buffer sizes: "32 128 512 4096 16384", Threads: "1 8 32",
Each test run time: 3 minutes, Iterations to average: 5). TCP seems
to get a small improvement.

Thanks,

- KK

-------------------------------------------------------------------------------
                           TCP
                        -----------
Size:32 Procs:1             3415       3321       -2.75
Size:128 Procs:1           13094      13388      2.24
Size:512 Procs:1            49037      50683      3.35
Size:4096 Procs:1          114646     114619     -.02
Size:16384 Procs:1        114626     114644     .01

Size:32 Procs:8             22675      22633      -.18
Size:128 Procs:8            77994      77297      -.89
Size:512 Procs:8            114716     114711     0
Size:4096 Procs:8          114637     114636     0
Size:16384 Procs:8         95814      114638     19.64

Size:32 Procs:32            23240      23349      .46
Size:128 Procs:32          82284      82247      -.04
Size:512 Procs:32          114885     114769     -.10
Size:4096 Procs:32         95735      114634     19.74
Size:16384 Procs:32       114736     114641     -.08

Average:                        1151534    1190210    3.36%
-------------------------------------------------------------------------------
                        No Delay:
                        ---------
Size:32 Procs:1             3002       2873       -4.29
Size:128 Procs:1           11853      11801      -.43
Size:512 Procs:1           45565      45837      .59
Size:4096 Procs:1         114511     114485     -.02
Size:16384 Procs:1        114521     114555     .02

Size:32 Procs:8             8026       8029       .03
Size:128 Procs:8           31589      31573      -.05
Size:512 Procs:8           111506     105766     -5.14
Size:4096 Procs:8         114455     114454     0
Size:16384 Procs:8        95833      114491     19.46

Size:32 Procs:32            8005       8027       .27
Size:128 Procs:32          31475      31505      .09
Size:512 Procs:32          114558     113687     -.76
Size:4096 Procs:32        114784     114447     -.29
Size:16384 Procs:32       114719     114496     -.19

Average:                         1046026    1034402    -1.11%
-------------------------------------------------------------------------------


^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Satyam Sharma @ 2007-08-17  6:26 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Paul Mackerras, Linus Torvalds, Christoph Lameter, Chris Snook,
	Ilpo Jarvinen, Paul E. McKenney, Stefan Richter,
	Linux Kernel Mailing List, linux-arch, Netdev, Andrew Morton, ak,
	heiko.carstens, David Miller, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <20070817035342.GA14744@gondor.apana.org.au>



On Fri, 17 Aug 2007, Herbert Xu wrote:

> On Fri, Aug 17, 2007 at 01:43:27PM +1000, Paul Mackerras wrote:
> >
> > The cost of doing so seems to me to be well down in the noise - 44
> > bytes of extra kernel text on a ppc64 G5 config, and I don't believe
> > the extra few cycles for the occasional extra load would be measurable
> > (they should all hit in the L1 dcache).  I don't mind if x86[-64] have
> > atomic_read/set be nonvolatile and find all the missing barriers, but
> > for now on powerpc, I think that not having to find those missing
> > barriers is worth the 0.00076% increase in kernel text size.
> 
> BTW, the sort of missing barriers that triggered this thread
> aren't that subtle.  It'll result in a simple lock-up if the
> loop condition holds upon entry.  At which point it's fairly
> straightforward to find the culprit.

Not necessarily. A barrier-less buggy code such as below:

	atomic_set(&v, 0);

	... /* some initial code */

	while (atomic_read(&v))
		;

	... /* code that MUST NOT be executed unless v becomes non-zero */

(where v->counter is has no volatile access semantics)

could be generated by the compiler to simply *elid* or *do away* with
the loop itself, thereby making the:

"/* code that MUST NOT be executed unless v becomes non-zero */"

to be executed even when v is zero! That is subtle indeed, and causes
no hard lockups.

Granted, the above IS buggy code. But, the stated objective is to avoid
heisenbugs. And we have driver / subsystem maintainers such as Stefan
coming up and admitting that often a lot of code that's written to use
atomic_read() does assume the read will not be elided by the compiler.

See, I agree, "volatility" semantics != what we often want. However, if
what we want is compiler barrier, for only the object under consideration,
"volatility" semantics aren't really "nonsensical" or anything.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Geert Uytterhoeven @ 2007-08-17  6:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Nick Piggin, Segher Boessenkool, heiko.carstens,
	horms, linux-kernel, rpjday, ak, netdev, cfriesen, akpm,
	jesper.juhl, linux-arch, zlynx, satyam, clameter, schwidefsky,
	Chris Snook, Herbert Xu, davem, wensong, wjiang
In-Reply-To: <alpine.LFD.0.999.0708162033400.30176@woody.linux-foundation.org>

On Thu, 16 Aug 2007, Linus Torvalds wrote:
> On Fri, 17 Aug 2007, Paul Mackerras wrote:
> > I'm really surprised it's as much as a few K.  I tried it on powerpc
> > and it only saved 40 bytes (10 instructions) for a G5 config.
> 
> One of the things that "volatile" generally screws up is a simple
> 
> 	volatile int i;
> 
> 	i++;
> 
> which a compiler will generally get horribly, horribly wrong.
> 
> In a reasonable world, gcc should just make that be (on x86)
> 
> 	addl $1,i(%rip)
> 
> on x86-64, which is indeed what it does without the volatile. But with the 
> volatile, the compiler gets really nervous, and doesn't dare do it in one 
> instruction, and thus generates crap like
> 
>         movl    i(%rip), %eax
>         addl    $1, %eax
>         movl    %eax, i(%rip)
> 
> instead. For no good reason, except that "volatile" just doesn't have any 
> good/clear semantics for the compiler, so most compilers will just make it 
> be "I will not touch this access in any way, shape, or form". Including 
> even trivially correct instruction optimization/combination.

Apart from having to fetch more bytes for the instructions (which does
matter), execution time is probably the same on modern processors, as they
convert the single instruction to RISC-style load, modify, store anyway.

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply

* [PATCH] phy layer: fix genphy_setup_forced (don't reset)
From: Domen Puncer @ 2007-08-17  6:54 UTC (permalink / raw)
  To: jeff; +Cc: netdev

Writing BMCR_RESET bit will reset MII_BMCR to default values. This is
clearly not what we want.


Signed-off-by: Domen Puncer <domen.puncer@telargo.com>

---
 drivers/net/phy/phy_device.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: work-powerpc.git/drivers/net/phy/phy_device.c
===================================================================
--- work-powerpc.git.orig/drivers/net/phy/phy_device.c
+++ work-powerpc.git/drivers/net/phy/phy_device.c
@@ -364,7 +364,7 @@ EXPORT_SYMBOL(genphy_config_advert);
  */
 int genphy_setup_forced(struct phy_device *phydev)
 {
-	int ctl = BMCR_RESET;
+	int ctl = 0;
 
 	phydev->pause = phydev->asym_pause = 0;
 

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Nick Piggin @ 2007-08-17  7:26 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linus Torvalds, Paul Mackerras, Segher Boessenkool,
	heiko.carstens, horms, Linux Kernel Mailing List, rpjday, ak,
	netdev, cfriesen, Andrew Morton, jesper.juhl, linux-arch, zlynx,
	clameter, schwidefsky, Chris Snook, Herbert Xu, davem, wensong,
	wjiang
In-Reply-To: <alpine.LFD.0.999.0708171016160.3666@enigma.security.iitk.ac.in>

Satyam Sharma wrote:

> #define atomic_read_volatile(v)				\
> 	({						\
> 		forget((v)->counter);			\
> 		((v)->counter);				\
> 	})
> 
> where:

*vomit* :)

Not only do I hate the keyword volatile, but the barrier is only a
one-sided affair so its probable this is going to have slightly
different allowed reorderings than a real volatile access.

Also, why would you want to make these insane accessors for atomic_t
types? Just make sure everybody knows the basics of barriers, and they
can apply that knowledge to atomic_t and all other lockless memory
accesses as well.

> #define forget(a)	__asm__ __volatile__ ("" :"=m" (a) :"m" (a))

I like order(x) better, but it's not the most perfect name either.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Satyam Sharma @ 2007-08-17  7:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Herbert Xu, Stefan Richter, Paul Mackerras, Christoph Lameter,
	Chris Snook, Linux Kernel Mailing List, linux-arch,
	Linus Torvalds, netdev, Andrew Morton, ak, heiko.carstens, davem,
	schwidefsky, wensong, horms, wjiang, cfriesen, zlynx, rpjday,
	jesper.juhl, segher
In-Reply-To: <20070817010108.GJ16957@linux.vnet.ibm.com>



On Thu, 16 Aug 2007, Paul E. McKenney wrote:

> On Fri, Aug 17, 2007 at 07:59:02AM +0800, Herbert Xu wrote:
> > On Thu, Aug 16, 2007 at 09:34:41AM -0700, Paul E. McKenney wrote:
> > >
> > > The compiler can also reorder non-volatile accesses.  For an example
> > > patch that cares about this, please see:
> > > 
> > > 	http://lkml.org/lkml/2007/8/7/280
> > > 
> > > This patch uses an ORDERED_WRT_IRQ() in rcu_read_lock() and
> > > rcu_read_unlock() to ensure that accesses aren't reordered with respect
> > > to interrupt handlers and NMIs/SMIs running on that same CPU.
> > 
> > Good, finally we have some code to discuss (even though it's
> > not actually in the kernel yet).
> 
> There was some earlier in this thread as well.

Hmm, I never quite got what all this interrupt/NMI/SMI handling and
RCU business you mentioned earlier was all about, but now that you've
pointed to the actual code and issues with it ...


> > First of all, I think this illustrates that what you want
> > here has nothing to do with atomic ops.  The ORDERED_WRT_IRQ
> > macro occurs a lot more times in your patch than atomic
> > reads/sets.  So *assuming* that it was necessary at all,
> > then having an ordered variant of the atomic_read/atomic_set
> > ops could do just as well.
> 
> Indeed.  If I could trust atomic_read()/atomic_set() to cause the compiler
> to maintain ordering, then I could just use them instead of having to
> create an  ORDERED_WRT_IRQ().  (Or ACCESS_ONCE(), as it is called in a
> different patch.)

+#define WHATEVER(x)	(*(volatile typeof(x) *)&(x))

I suppose one could want volatile access semantics for stuff that's
a bit-field too, no?

Also, this gives *zero* "re-ordering" guarantees that your code wants
as you've explained it below) -- neither w.r.t. CPU re-ordering (which
probably you don't care about) *nor* w.r.t. compiler re-ordering
(which you definitely _do_ care about).


> > However, I still don't know which atomic_read/atomic_set in
> > your patch would be broken if there were no volatile.  Could
> > you please point them out?
> 
> Suppose I tried replacing the ORDERED_WRT_IRQ() calls with
> atomic_read() and atomic_set().  Starting with __rcu_read_lock():
> 
> o	If "ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++"
> 	was ordered by the compiler after
> 	"ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1", then
> 	suppose an NMI/SMI happened after the rcu_read_lock_nesting but
> 	before the rcu_flipctr.
> 
> 	Then if there was an rcu_read_lock() in the SMI/NMI
> 	handler (which is perfectly legal), the nested rcu_read_lock()
> 	would believe that it could take the then-clause of the
> 	enclosing "if" statement.  But because the rcu_flipctr per-CPU
> 	variable had not yet been incremented, an RCU updater would
> 	be within its rights to assume that there were no RCU reads
> 	in progress, thus possibly yanking a data structure out from
> 	under the reader in the SMI/NMI function.
> 
> 	Fatal outcome.  Note that only one CPU is involved here
> 	because these are all either per-CPU or per-task variables.

Ok, so you don't care about CPU re-ordering. Still, I should let you know
that your ORDERED_WRT_IRQ() -- bad name, btw -- is still buggy. What you
want is a full compiler optimization barrier().

[ Your code probably works now, and emits correct code, but that's
  just because of gcc did what it did. Nothing in any standard,
  or in any documented behaviour of gcc, or anything about the real
  (or expected) semantics of "volatile" is protecting the code here. ]


> o	If "ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1"
> 	was ordered by the compiler to follow the
> 	"ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx", and an NMI/SMI
> 	happened between the two, then an __rcu_read_lock() in the NMI/SMI
> 	would incorrectly take the "else" clause of the enclosing "if"
> 	statement.  If some other CPU flipped the rcu_ctrlblk.completed
> 	in the meantime, then the __rcu_read_lock() would (correctly)
> 	write the new value into rcu_flipctr_idx.
> 
> 	Well and good so far.  But the problem arises in
> 	__rcu_read_unlock(), which then decrements the wrong counter.
> 	Depending on exactly how subsequent events played out, this could
> 	result in either prematurely ending grace periods or never-ending
> 	grace periods, both of which are fatal outcomes.
> 
> And the following are not needed in the current version of the
> patch, but will be in a future version that either avoids disabling
> irqs or that dispenses with the smp_read_barrier_depends() that I
> have 99% convinced myself is unneeded:
> 
> o	nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting);
> 
> o	idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed) & 0x1;
> 
> Furthermore, in that future version, irq handlers can cause the same
> mischief that SMI/NMI handlers can in this version.
> 
> Next, looking at __rcu_read_unlock():
> 
> o	If "ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting - 1"
> 	was reordered by the compiler to follow the
> 	"ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])--",
> 	then if an NMI/SMI containing an rcu_read_lock() occurs between
> 	the two, this nested rcu_read_lock() would incorrectly believe
> 	that it was protected by an enclosing RCU read-side critical
> 	section as described in the first reversal discussed for
> 	__rcu_read_lock() above.  Again, fatal outcome.
> 
> This is what we have now.  It is not hard to imagine situations that
> interact with -both- interrupt handlers -and- other CPUs, as described
> earlier.

It's not about interrupt/SMI/NMI handlers at all! What you clearly want,
simply put, is that a certain stream of C statements must be emitted
by the compiler _as they are_ with no re-ordering optimizations! You must
*definitely* use barrier(), IMHO.


Satyam

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Stefan Richter @ 2007-08-17  7:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: paulmck, Herbert Xu, Paul Mackerras, Satyam Sharma,
	Christoph Lameter, Chris Snook, Linux Kernel Mailing List,
	linux-arch, Linus Torvalds, netdev, Andrew Morton, ak,
	heiko.carstens, davem, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <46C512EB.7020603@yahoo.com.au>

Nick Piggin wrote:
> I don't know why people would assume volatile of atomics. AFAIK, most
> of the documentation is pretty clear that all the atomic stuff can be
> reordered etc. except for those that modify and return a value.

Which documentation is there?

For driver authors, there is LDD3.  It doesn't specifically cover
effects of optimization on accesses to atomic_t.

For architecture port authors, there is Documentation/atomic_ops.txt.
Driver authors also can learn something from that document, as it
indirectly documents the atomic_t and bitops APIs.

Prompted by this thread, I reread this document, and indeed, the
sentence "Unlike the above routines, it is required that explicit memory
barriers are performed before and after [atomic_{inc,dec}_return]"
indicates that atomic_read (one of the "above routines") is very
different from all other atomic_t accessors that return values.

This is strange.  Why is it that atomic_read stands out that way?  IMO
this API imbalance is quite unexpected by many people.  Wouldn't it be
beneficial to change the atomic_read API to behave the same like all
other atomic_t accessors that return values?

OK, it is also different from the other accessors that return data in so
far as it doesn't modify the data.  But as driver "author", i.e. user of
the API, I can't see much use of an atomic_read that can be reordered
and, more importantly, can be optimized away by the compiler.  Sure, now
that I learned of these properties I can start to audit code and insert
barriers where I believe they are needed, but this simply means that
almost all occurrences of atomic_read will get barriers (unless there
already are implicit but more or less obvious barriers like msleep).
-- 
Stefan Richter
-=====-=-=== =--- =---=
http://arcgraph.de/sr/

^ permalink raw reply

* Re: [PATCH] lockdep: annotate rcu_read_{,un}lock()
From: Peter Zijlstra @ 2007-08-17  7:56 UTC (permalink / raw)
  To: paulmck
  Cc: Ingo Molnar, herbert, 123.oleg, netdev, linux-kernel,
	David Miller, Daniel Walker, josht
In-Reply-To: <20070816160145.GA16957@linux.vnet.ibm.com>

On Thu, 2007-08-16 at 09:01 -0700, Paul E. McKenney wrote:
> On Thu, Aug 16, 2007 at 04:25:07PM +0200, Peter Zijlstra wrote:
> > 
> > There seem to be some unbalanced rcu_read_{,un}lock() issues of late,
> > how about doing something like this:
> 
> This will break when rcu_read_lock() and rcu_read_unlock() are invoked
> from NMI/SMI handlers -- the raw_local_irq_save() in lock_acquire() will
> not mask NMIs or SMIs.
> 
> One approach would be to check for being in an NMI/SMI handler, and
> to avoid calling lock_acquire() and lock_release() in those cases.

It seems:

#define nmi_enter()		do { lockdep_off(); __irq_enter(); } while (0)
#define nmi_exit()		do { __irq_exit(); lockdep_on(); } while (0)

Should make it all work out just fine. (for NMIs at least, /me fully
ignorant of the workings of SMIs)

> Another approach would be to use sparse, which has checks for
> rcu_read_lock()/rcu_read_unlock() nesting.

Yeah, but one more method can never hurt, no? :-)



^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Nick Piggin @ 2007-08-17  8:06 UTC (permalink / raw)
  To: Stefan Richter
  Cc: paulmck, Herbert Xu, Paul Mackerras, Satyam Sharma,
	Christoph Lameter, Chris Snook, Linux Kernel Mailing List,
	linux-arch, Linus Torvalds, netdev, Andrew Morton, ak,
	heiko.carstens, davem, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <46C54D74.60101@s5r6.in-berlin.de>

Stefan Richter wrote:
> Nick Piggin wrote:
> 
>>I don't know why people would assume volatile of atomics. AFAIK, most
>>of the documentation is pretty clear that all the atomic stuff can be
>>reordered etc. except for those that modify and return a value.
> 
> 
> Which documentation is there?

Documentation/atomic_ops.txt

> For driver authors, there is LDD3.  It doesn't specifically cover
> effects of optimization on accesses to atomic_t.
> 
> For architecture port authors, there is Documentation/atomic_ops.txt.
> Driver authors also can learn something from that document, as it
> indirectly documents the atomic_t and bitops APIs.
>

"Semantics and Behavior of Atomic and Bitmask Operations" is
pretty direct :)

Sure, it says that it's for arch maintainers, but there is no
reason why users can't make use of it.

> Prompted by this thread, I reread this document, and indeed, the
> sentence "Unlike the above routines, it is required that explicit memory
> barriers are performed before and after [atomic_{inc,dec}_return]"
> indicates that atomic_read (one of the "above routines") is very
> different from all other atomic_t accessors that return values.
> 
> This is strange.  Why is it that atomic_read stands out that way?  IMO

It is not just atomic_read of course. It is atomic_add,sub,inc,dec,set.

> this API imbalance is quite unexpected by many people.  Wouldn't it be
> beneficial to change the atomic_read API to behave the same like all
> other atomic_t accessors that return values?

It is very consistent and well defined. Operations which both modify
the data _and_ return something are defined to have full barriers
before and after.

What do you want to add to the other atomic accessors? Full memory
barriers? Only compiler barriers? It's quite likely that if you think
some barriers will fix bugs, then there are other bugs lurking there
anyway.

Just use spinlocks if you're not absolutely clear about potential
races and memory ordering issues -- they're pretty cheap and simple.

> OK, it is also different from the other accessors that return data in so
> far as it doesn't modify the data.  But as driver "author", i.e. user of
> the API, I can't see much use of an atomic_read that can be reordered
> and, more importantly, can be optimized away by the compiler.

It will return to you an atomic snapshot of the data (loaded from
memory at some point since the last compiler barrier). All you have
to be aware of compiler barriers and the Linux SMP memory ordering
model, which should be a given if you are writing lockless code.

> Sure, now
> that I learned of these properties I can start to audit code and insert
> barriers where I believe they are needed, but this simply means that
> almost all occurrences of atomic_read will get barriers (unless there
> already are implicit but more or less obvious barriers like msleep).

You might find that these places that appear to need barriers are
buggy for other reasons anyway. Can you point to some in-tree code
we can have a look at?

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply

* Re: [RFC] net/core/dst.c : Should'nt dst_run_gc() be more scalable and friendly ?
From: Eric Dumazet @ 2007-08-17  8:10 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev
In-Reply-To: <20070816233339.GA11594@gondor.apana.org.au>

On Fri, 17 Aug 2007 07:33:39 +0800
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Thu, Aug 16, 2007 at 05:40:44PM +0200, Eric Dumazet wrote:
> > 
> > So do you think this patch is enough or should we convert dst_run_gc processing from softirq to workqueue too ?
> 
> I think a workqueue would be the best solution since with
> that you wouldn't have to worry about processing things in
> chunks.

Will a workqueue react the same in case of a DDOS situation, 
where softirq could use all CPU cycles to handle incoming
packets and feed the GC list, and GC would never
have a chance to scan and free some items ?

About chunk processing, I did it on purpose, to not throw away
all CPU cache. Goal is to process entries, but not all of them
in a row, especially if we find many yet referenced entries
(and thus not candidates to freeing)

Thanks

^ permalink raw reply

* Re: [RFC] net/core/dst.c : Should'nt dst_run_gc() be more scalable and friendly ?
From: Herbert Xu @ 2007-08-17  8:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <20070817101030.cc64467a.dada1@cosmosbay.com>

On Fri, Aug 17, 2007 at 10:10:30AM +0200, Eric Dumazet wrote:
>
> Will a workqueue react the same in case of a DDOS situation, 
> where softirq could use all CPU cycles to handle incoming
> packets and feed the GC list, and GC would never
> have a chance to scan and free some items ?

Well when that happens the softirqs will be deferred to
ksoftirqd which should share the CPU fairly with the
workqueue.

> About chunk processing, I did it on purpose, to not throw away
> all CPU cache. Goal is to process entries, but not all of them
> in a row, especially if we find many yet referenced entries
> (and thus not candidates to freeing)

I agree that chunks are desirable for a timer since you'd
be hogging the CPU otherwise.  However, if you went to a
workqueue then it's less of a concern and would simplify
things.  In particular, you won't have to pick a good
chunk size :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Satyam Sharma @ 2007-08-17  8:28 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Herbert Xu, Stefan Richter, Christoph Lameter, Paul E. McKenney,
	Chris Snook, Linux Kernel Mailing List, linux-arch,
	Linus Torvalds, netdev, Andrew Morton, ak, heiko.carstens, davem,
	schwidefsky, wensong, horms, wjiang, cfriesen, zlynx, rpjday,
	jesper.juhl, segher
In-Reply-To: <18117.13576.93583.222760@cargo.ozlabs.ibm.com>

On Fri, 17 Aug 2007, Paul Mackerras wrote:

> Herbert Xu writes:
> 
> > On Fri, Aug 17, 2007 at 03:09:57PM +1000, Paul Mackerras wrote:
> > > Herbert Xu writes:
> > > 
> > > > Can you find an actual atomic_read code snippet there that is
> > > > broken without the volatile modifier?
> > > 
> > > There are some in arch-specific code, for example line 1073 of
> > > arch/mips/kernel/smtc.c.  On mips, cpu_relax() is just barrier(), so
> > > the empty loop body is ok provided that atomic_read actually does the
> > > load each time around the loop.
> > 
> > A barrier() is all you need to force the compiler to reread
> > the value.
> > 
> > The people advocating volatile in this thread are talking
> > about code that doesn't use barrier()/cpu_relax().
> 
> Did you look at it?  Here it is:
> 
> 	/* Someone else is initializing in parallel - let 'em finish */
> 	while (atomic_read(&idle_hook_initialized) < 1000)
> 		;

Honestly, this thread is suffering from HUGE communication gaps.

What Herbert (obviously) meant there was that "this loop could've
been okay _without_ using volatile-semantics-atomic_read() also, if
only it used cpu_relax()".

That does work, because cpu_relax() is _at least_ barrier() on all
archs (on some it also emits some arch-dependent "pause" kind of
instruction).

Now, saying that "MIPS does not have such an instruction so I won't
use cpu_relax() for arch-dependent-busy-while-loops in arch/mips/"
sounds like a wrong argument, because: tomorrow, such arch's _may_
introduce such an instruction, so naturally, at that time we'd
change cpu_relax() appropriately (in reality, we would actually
*re-define* cpu_relax() and ensure that the correct version gets
pulled in depending on whether the callsite code is legacy or only
for the newer such CPUs of said arch, whatever), but loops such as
this would remain un-changed, because they never used cpu_relax()!

OTOH an argument that said the following would've made a stronger case:

"I don't want to use cpu_relax() because that's a full memory
clobber barrier() and I have loop-invariants / other variables
around in that code that I *don't* want the compiler to forget
just because it used cpu_relax(), and hence I will not use
cpu_relax() but instead make my atomic_read() itself have
"volatility" semantics. Not just that, but I will introduce a
cpu_relax_no_barrier() on MIPS, that would be a no-op #define
for now, but which may not be so forever, and continue to use
that in such busy loops."

In general, please read the thread-summary I've tried to do at:
http://lkml.org/lkml/2007/8/17/25
Feel free to continue / comment / correct stuff from there, there's
too much confusion and circular-arguments happening on this thread
otherwise.

[ I might've made an incorrect statement there about
  "volatile" w.r.t. cache on non-x86 archs, I think. ]

Satyam

^ permalink raw reply

* Re: [RFC] net/core/dst.c : Should'nt dst_run_gc() be more scalable and friendly ?
From: Eric Dumazet @ 2007-08-17  8:31 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev
In-Reply-To: <20070817081522.GA16720@gondor.apana.org.au>

On Fri, 17 Aug 2007 16:15:22 +0800
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Fri, Aug 17, 2007 at 10:10:30AM +0200, Eric Dumazet wrote:
> >
> > Will a workqueue react the same in case of a DDOS situation, 
> > where softirq could use all CPU cycles to handle incoming
> > packets and feed the GC list, and GC would never
> > have a chance to scan and free some items ?
> 
> Well when that happens the softirqs will be deferred to
> ksoftirqd which should share the CPU fairly with the
> workqueue.

Thats nice :)

I'll code a workqueue based thing in about 10 days after my hollidays,
and perform DOS tests as well.

Thanks for the feedback.

> 
> > About chunk processing, I did it on purpose, to not throw away
> > all CPU cache. Goal is to process entries, but not all of them
> > in a row, especially if we find many yet referenced entries
> > (and thus not candidates to freeing)
> 
> I agree that chunks are desirable for a timer since you'd
> be hogging the CPU otherwise.  However, if you went to a
> workqueue then it's less of a concern and would simplify
> things.  In particular, you won't have to pick a good
> chunk size :)

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Satyam Sharma @ 2007-08-17  8:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Paul Mackerras, Segher Boessenkool,
	heiko.carstens, horms, Linux Kernel Mailing List, rpjday, ak,
	netdev, cfriesen, Andrew Morton, jesper.juhl, linux-arch, zlynx,
	clameter, schwidefsky, Chris Snook, Herbert Xu, davem, wensong,
	wjiang
In-Reply-To: <46C54D94.5080803@yahoo.com.au>

On Fri, 17 Aug 2007, Nick Piggin wrote:

> Satyam Sharma wrote:
> 
> > #define atomic_read_volatile(v)				\
> > 	({						\
> > 		forget((v)->counter);			\
> > 		((v)->counter);				\
> > 	})
> > 
> > where:
> 
> *vomit* :)

I wonder if this'll generate smaller and better code than _both_ the
other atomic_read_volatile() variants. Would need to build allyesconfig
on lots of diff arch's etc to test the theory though.

> Not only do I hate the keyword volatile, but the barrier is only a
> one-sided affair so its probable this is going to have slightly
> different allowed reorderings than a real volatile access.

True ...

> Also, why would you want to make these insane accessors for atomic_t
> types? Just make sure everybody knows the basics of barriers, and they
> can apply that knowledge to atomic_t and all other lockless memory
> accesses as well.

Code that looks like:

	while (!atomic_read(&v)) {
		...
		cpu_relax_no_barrier();
		forget(v.counter);
		        ^^^^^^^^
	}

would be uglier. Also think about code such as:

	a = atomic_read();
	if (!a)
		do_something();

	forget();
	a = atomic_read();
	... /* some code that depends on value of a, obviously */

	forget();
	a = atomic_read();
	...

So much explicit sprinkling of "forget()" looks ugly.

	atomic_read_volatile()

on the other hand, looks neater. The "_volatile()" suffix makes it also
no less explicit than an explicit barrier-like macro that this primitive
is something "special", for code clarity purposes.

> > #define forget(a)	__asm__ __volatile__ ("" :"=m" (a) :"m" (a))
> 
> I like order(x) better, but it's not the most perfect name either.

forget(x) is just a stupid-placeholder-for-a-better-name. order(x) sounds
good but we could leave quibbling about function or macro names for later,
this thread is noisy as it is :-)

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Nick Piggin @ 2007-08-17  8:38 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Herbert Xu, Paul Mackerras, Linus Torvalds, Christoph Lameter,
	Chris Snook, Ilpo Jarvinen, Paul E. McKenney, Stefan Richter,
	Linux Kernel Mailing List, linux-arch, Netdev, Andrew Morton, ak,
	heiko.carstens, David Miller, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <alpine.LFD.0.999.0708171142550.3666@enigma.security.iitk.ac.in>

Satyam Sharma wrote:
> 
> On Fri, 17 Aug 2007, Herbert Xu wrote:
> 
> 
>>On Fri, Aug 17, 2007 at 01:43:27PM +1000, Paul Mackerras wrote:
>>
>>BTW, the sort of missing barriers that triggered this thread
>>aren't that subtle.  It'll result in a simple lock-up if the
>>loop condition holds upon entry.  At which point it's fairly
>>straightforward to find the culprit.
> 
> 
> Not necessarily. A barrier-less buggy code such as below:
> 
> 	atomic_set(&v, 0);
> 
> 	... /* some initial code */
> 
> 	while (atomic_read(&v))
> 		;
> 
> 	... /* code that MUST NOT be executed unless v becomes non-zero */
> 
> (where v->counter is has no volatile access semantics)
> 
> could be generated by the compiler to simply *elid* or *do away* with
> the loop itself, thereby making the:
> 
> "/* code that MUST NOT be executed unless v becomes non-zero */"
> 
> to be executed even when v is zero! That is subtle indeed, and causes
> no hard lockups.

Then I presume you mean

while (!atomic_read(&v))
     ;

Which is just the same old infinite loop bug solved with cpu_relax().
These are pretty trivial to audit and fix, and also to debug, I would
think.

> Granted, the above IS buggy code. But, the stated objective is to avoid
> heisenbugs.

Anyway, why are you making up code snippets that are buggy in other
ways in order to support this assertion being made that lots of kernel
code supposedly depends on volatile semantics. Just reference the
actual code.

> And we have driver / subsystem maintainers such as Stefan
> coming up and admitting that often a lot of code that's written to use
> atomic_read() does assume the read will not be elided by the compiler.

So these are broken on i386 and x86-64?

Are they definitely safe on SMP and weakly ordered machines with
just a simple compiler barrier there? Because I would not be
surprised if there are a lot of developers who don't really know
what to assume when it comes to memory ordering issues.

This is not a dig at driver writers: we still have memory ordering
problems in the VM too (and probably most of the subtle bugs in
lockless VM code are memory ordering ones). Let's not make up a
false sense of security and hope that sprinkling volatile around
will allow people to write bug-free lockless code. If a writer
can't be bothered reading API documentation and learning the Linux
memory model, they can still be productive writing safely locked
code.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply

* Re: [GENETLINK]: Question: global lock (genl_mutex) possible refinement?
From: Richard MUSIL @ 2007-08-17  8:38 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Patrick McHardy, netdev
In-Reply-To: <20070816155819.GE32236@postel.suug.ch>

Thomas Graf wrote:
>> @@ -150,9 +176,9 @@ int genl_register_ops(struct genl_family *family, struct genl_ops *ops)
>>  	if (ops->policy)
>>  		ops->flags |= GENL_CMD_CAP_HASPOL;
>>  
>> -	genl_lock();
>> +	genl_fam_lock(family);
>>  	list_add_tail(&ops->ops_list, &family->ops_list);
>> -	genl_unlock();
>> +	genl_fam_unlock(family);
> 
> For registering operations, it is sufficient to just acquire the
> family lock, the family itself can't disappear while holding it.

I agree.

>> @@ -303,38 +332,57 @@ static int genl_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
>>  	struct genlmsghdr *hdr = nlmsg_data(nlh);
>>  	int hdrlen, err;
>>  
>> +	genl_fam_lock(NULL);
>>  	family = genl_family_find_byid(nlh->nlmsg_type);
>> -	if (family == NULL)
>> +	if (family == NULL) {
>> +		genl_fam_unlock(NULL);
>>  		return -ENOENT;
>> +	}
>> +
>> +	/* get particular family lock, but release global family lock
>> +	 * so registering operations for other families are possible */
>> +	genl_onefam_lock(family);
>> +	genl_fam_unlock(NULL);
> 
> I don't like having two locks for something as trivial as this.
> Basically the only reason the global lock is required here is to
> protect from family removal which can be avoided otherwise by
> using RCU list operations.
> 
> Therefore, I'd propose the following lock semantics:
> Use own global mutex to protect writing to the family list, make
> reading side lockless using rcu for use when looking up family
> upon mesage processing. Use a family lock to protect writing to
> operations list and serialize messae processing with unregister
> operations.

I was not aware of RCU lists, but after looking at them, I consider your
proposal to be better. I guess, you would rather write the patch
yourself, so I will wait.

Thanks for help,
Richard

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Andi Kleen @ 2007-08-17  8:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Mackerras, Nick Piggin, Segher Boessenkool, heiko.carstens,
	horms, linux-kernel, rpjday, netdev, cfriesen, akpm, jesper.juhl,
	linux-arch, zlynx, satyam, clameter, schwidefsky, Chris Snook,
	Herbert Xu, davem, wensong, wjiang
In-Reply-To: <alpine.LFD.0.999.0708162033400.30176@woody.linux-foundation.org>

On Friday 17 August 2007 05:42, Linus Torvalds wrote:
> On Fri, 17 Aug 2007, Paul Mackerras wrote:
> > I'm really surprised it's as much as a few K.  I tried it on powerpc
> > and it only saved 40 bytes (10 instructions) for a G5 config.
>
> One of the things that "volatile" generally screws up is a simple
>
> 	volatile int i;
>
> 	i++;

But for atomic_t people use atomic_inc() anyways which does this correctly.
It shouldn't really matter for atomic_t.

I'm worrying a bit that the volatile atomic_t change caused subtle code 
breakage like these delay read loops people here pointed out.
Wouldn't it be safer to just re-add the volatile to atomic_read() 
for 2.6.23? Or alternatively make it asm(), but volatile seems more
proven.

-Andi

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Satyam Sharma @ 2007-08-17  8:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Stefan Richter, paulmck, Herbert Xu, Paul Mackerras,
	Christoph Lameter, Chris Snook, Linux Kernel Mailing List,
	linux-arch, Linus Torvalds, netdev, Andrew Morton, ak,
	heiko.carstens, davem, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <46C556F1.8000407@yahoo.com.au>

On Fri, 17 Aug 2007, Nick Piggin wrote:

> Stefan Richter wrote:
> [...]
> Just use spinlocks if you're not absolutely clear about potential
> races and memory ordering issues -- they're pretty cheap and simple.

I fully agree with this. As Paul Mackerras mentioned elsewhere,
a lot of authors sprinkle atomic_t in code thinking they're somehow
done with *locking*. This is sad, and I wonder if it's time for a
Documentation/atomic-considered-dodgy.txt kind of document :-)

> > Sure, now
> > that I learned of these properties I can start to audit code and insert
> > barriers where I believe they are needed, but this simply means that
> > almost all occurrences of atomic_read will get barriers (unless there
> > already are implicit but more or less obvious barriers like msleep).
> 
> You might find that these places that appear to need barriers are
> buggy for other reasons anyway. Can you point to some in-tree code
> we can have a look at?

Such code was mentioned elsewhere (query nodemgr_host_thread in cscope)
that managed to escape the requirement for a barrier only because of
some completely un-obvious compilation-unit-scope thing. But I find such
an non-explicit barrier quite bad taste. Stefan, do consider plunking an
explicit call to barrier() there.

Satyam

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Satyam Sharma @ 2007-08-17  9:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Herbert Xu, Paul Mackerras, Linus Torvalds, Christoph Lameter,
	Chris Snook, Ilpo Jarvinen, Paul E. McKenney, Stefan Richter,
	Linux Kernel Mailing List, linux-arch, Netdev, Andrew Morton, ak,
	heiko.carstens, David Miller, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <46C55E90.7010407@yahoo.com.au>

On Fri, 17 Aug 2007, Nick Piggin wrote:

> Satyam Sharma wrote:
> [...]
> > Granted, the above IS buggy code. But, the stated objective is to avoid
> > heisenbugs.
    ^^^^^^^^^^

> Anyway, why are you making up code snippets that are buggy in other
> ways in order to support this assertion being made that lots of kernel
> code supposedly depends on volatile semantics. Just reference the
> actual code.

Because the point is *not* about existing bugs in kernel code. At some
point Chris Snook (who started this thread) did write that "If I knew
of the existing bugs in the kernel, I would be sending patches for them,
not this series" or something to that effect.

The point is about *author expecations*. If people do expect atomic_read()
(or a variant thereof) to have volatile semantics, why not give them such
a variant?

And by the way, the point is *also* about the fact that cpu_relax(), as
of today, implies a full memory clobber, which is not what a lot of such
loops want. (due to stuff mentioned elsewhere, summarized in that summary)

> > And we have driver / subsystem maintainers such as Stefan
> > coming up and admitting that often a lot of code that's written to use
> > atomic_read() does assume the read will not be elided by the compiler.
                                                             ^^^^^^^^^^^^^

(so it's about compiler barrier expectations only, though I fully agree
that those who're using atomic_t as if it were some magic thing that lets
them write lockless code are sorrily mistaken.)

> So these are broken on i386 and x86-64?

Possibly, but the point is not about existing bugs, as mentioned above.

Some such bugs have been found nonetheless -- reminds me, can somebody
please apply http://www.gossamer-threads.com/lists/linux/kernel/810674 ?

Satyam

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Nick Piggin @ 2007-08-17  9:15 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Stefan Richter, paulmck, Herbert Xu, Paul Mackerras,
	Christoph Lameter, Chris Snook, Linux Kernel Mailing List,
	linux-arch, Linus Torvalds, netdev, Andrew Morton, ak,
	heiko.carstens, davem, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <alpine.LFD.0.999.0708171420060.3666@enigma.security.iitk.ac.in>

Satyam Sharma wrote:
> 
> On Fri, 17 Aug 2007, Nick Piggin wrote:

>>>Sure, now
>>>that I learned of these properties I can start to audit code and insert
>>>barriers where I believe they are needed, but this simply means that
>>>almost all occurrences of atomic_read will get barriers (unless there
>>>already are implicit but more or less obvious barriers like msleep).
>>
>>You might find that these places that appear to need barriers are
>>buggy for other reasons anyway. Can you point to some in-tree code
>>we can have a look at?
> 
> 
> Such code was mentioned elsewhere (query nodemgr_host_thread in cscope)
> that managed to escape the requirement for a barrier only because of
> some completely un-obvious compilation-unit-scope thing. But I find such
> an non-explicit barrier quite bad taste. Stefan, do consider plunking an
> explicit call to barrier() there.

It is very obvious. msleep calls schedule() (ie. sleeps), which is
always a barrier.

The "unobvious" thing is that you wanted to know how the compiler knows
a function is a barrier -- answer is that if it does not *know* it is not
a barrier, it must assume it is a barrier. If the whole msleep call chain
including the scheduler were defined static in the current compilation
unit, then it would still be a barrier because it would actually be able
to see the barriers in schedule(void), if nothing else.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Nick Piggin @ 2007-08-17  9:15 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linus Torvalds, Paul Mackerras, Segher Boessenkool,
	heiko.carstens, horms, Linux Kernel Mailing List, rpjday, ak,
	netdev, cfriesen, Andrew Morton, jesper.juhl, linux-arch, zlynx,
	clameter, schwidefsky, Chris Snook, Herbert Xu, davem, wensong,
	wjiang
In-Reply-To: <alpine.LFD.0.999.0708171358520.3666@enigma.security.iitk.ac.in>

Satyam Sharma wrote:
> 
> On Fri, 17 Aug 2007, Nick Piggin wrote:

>>Also, why would you want to make these insane accessors for atomic_t
>>types? Just make sure everybody knows the basics of barriers, and they
>>can apply that knowledge to atomic_t and all other lockless memory
>>accesses as well.
> 
> 
> Code that looks like:
> 
> 	while (!atomic_read(&v)) {
> 		...
> 		cpu_relax_no_barrier();
> 		forget(v.counter);
> 		        ^^^^^^^^
> 	}
> 
> would be uglier. Also think about code such as:

I think they would both be equally ugly, but the atomic_read_volatile
variant would be more prone to subtle bugs because of the weird
implementation.

And it would be more ugly than introducing an order(x) statement for
all memory operations, and adding an order_atomic() wrapper for it
for atomic types.

> 	a = atomic_read();
> 	if (!a)
> 		do_something();
> 
> 	forget();
> 	a = atomic_read();
> 	... /* some code that depends on value of a, obviously */
> 
> 	forget();
> 	a = atomic_read();
> 	...
> 
> So much explicit sprinkling of "forget()" looks ugly.

Firstly, why is it ugly? It's nice because of those nice explicit
statements there that give us a good heads up and would have some
comments attached to them (also, lack of the word "volatile" is
always a plus).

Secondly, what sort of code would do such a thing? In most cases,
it is probably riddled with bugs anyway (unless it is doing a
really specific sequence of interrupts or something, but in that
case it is very likely to either require locking or busy waits
anyway -> ie. barriers).

> on the other hand, looks neater. The "_volatile()" suffix makes it also
> no less explicit than an explicit barrier-like macro that this primitive
> is something "special", for code clarity purposes.

Just don't use the word volatile, and have barriers both before
and after the memory operation, and I'm OK with it. I don't see
the point though, when you could just have a single barrier(x)
barrier function defined for all memory locations, rather than
this odd thing that only works for atomics (and would have to
be duplicated for atomic_set.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox