Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Nick Piggin @ 2007-08-17  9:15 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linus Torvalds, Paul Mackerras, Segher Boessenkool,
	heiko.carstens, horms, Linux Kernel Mailing List, rpjday, ak,
	netdev, cfriesen, Andrew Morton, jesper.juhl, linux-arch, zlynx,
	clameter, schwidefsky, Chris Snook, Herbert Xu, davem, wensong,
	wjiang
In-Reply-To: <alpine.LFD.0.999.0708171358520.3666@enigma.security.iitk.ac.in>

Satyam Sharma wrote:
> 
> On Fri, 17 Aug 2007, Nick Piggin wrote:

>>Also, why would you want to make these insane accessors for atomic_t
>>types? Just make sure everybody knows the basics of barriers, and they
>>can apply that knowledge to atomic_t and all other lockless memory
>>accesses as well.
> 
> 
> Code that looks like:
> 
> 	while (!atomic_read(&v)) {
> 		...
> 		cpu_relax_no_barrier();
> 		forget(v.counter);
> 		        ^^^^^^^^
> 	}
> 
> would be uglier. Also think about code such as:

I think they would both be equally ugly, but the atomic_read_volatile
variant would be more prone to subtle bugs because of the weird
implementation.

And it would be more ugly than introducing an order(x) statement for
all memory operations, and adding an order_atomic() wrapper for it
for atomic types.

> 	a = atomic_read();
> 	if (!a)
> 		do_something();
> 
> 	forget();
> 	a = atomic_read();
> 	... /* some code that depends on value of a, obviously */
> 
> 	forget();
> 	a = atomic_read();
> 	...
> 
> So much explicit sprinkling of "forget()" looks ugly.

Firstly, why is it ugly? It's nice because of those nice explicit
statements there that give us a good heads up and would have some
comments attached to them (also, lack of the word "volatile" is
always a plus).

Secondly, what sort of code would do such a thing? In most cases,
it is probably riddled with bugs anyway (unless it is doing a
really specific sequence of interrupts or something, but in that
case it is very likely to either require locking or busy waits
anyway -> ie. barriers).

> on the other hand, looks neater. The "_volatile()" suffix makes it also
> no less explicit than an explicit barrier-like macro that this primitive
> is something "special", for code clarity purposes.

Just don't use the word volatile, and have barriers both before
and after the memory operation, and I'm OK with it. I don't see
the point though, when you could just have a single barrier(x)
barrier function defined for all memory locations, rather than
this odd thing that only works for atomics (and would have to
be duplicated for atomic_set.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Nick Piggin @ 2007-08-17  9:31 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Herbert Xu, Paul Mackerras, Linus Torvalds, Christoph Lameter,
	Chris Snook, Ilpo Jarvinen, Paul E. McKenney, Stefan Richter,
	Linux Kernel Mailing List, linux-arch, Netdev, Andrew Morton, ak,
	heiko.carstens, David Miller, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <alpine.LFD.0.999.0708171431150.3666@enigma.security.iitk.ac.in>

Satyam Sharma wrote:
> 
> On Fri, 17 Aug 2007, Nick Piggin wrote:
> 
> 
>>Satyam Sharma wrote:
>>[...]
>>
>>>Granted, the above IS buggy code. But, the stated objective is to avoid
>>>heisenbugs.
> 
>     ^^^^^^^^^^
> 
> 
>>Anyway, why are you making up code snippets that are buggy in other
>>ways in order to support this assertion being made that lots of kernel
>>code supposedly depends on volatile semantics. Just reference the
>>actual code.
> 
> 
> Because the point is *not* about existing bugs in kernel code. At some
> point Chris Snook (who started this thread) did write that "If I knew
> of the existing bugs in the kernel, I would be sending patches for them,
> not this series" or something to that effect.
> 
> The point is about *author expecations*. If people do expect atomic_read()
> (or a variant thereof) to have volatile semantics, why not give them such
> a variant?

Because they should be thinking about them in terms of barriers, over
which the compiler / CPU is not to reorder accesses or cache memory
operations, rather than "special" "volatile" accesses. Linux's whole
memory ordering and locking model is completely geared around the
former.


> And by the way, the point is *also* about the fact that cpu_relax(), as
> of today, implies a full memory clobber, which is not what a lot of such
> loops want. (due to stuff mentioned elsewhere, summarized in that summary)

That's not the point, because as I also mentioned, the logical extention
to Linux's barrier API to handle this is the order(x) macro. Again, not
special volatile accessors.

^ permalink raw reply

* [PATCH 2.6.22.3] ppp: fix output buffer size in ppp_decompress_frame
From: Konstantin Sharlaimov @ 2007-08-17  9:45 UTC (permalink / raw)
  To: Suresh Mahalingam; +Cc: David Miller, paulus, netdev, linux-kernel

This patch addresses the issue with "osize too small" errors in mppe encryption.
The patch fixes the issue with wrong output buffer size being passed to ppp
decompression routine.

Signed-off-by: Konstantin Sharlaimov <konstantin.sharlaimov@gmail.com>
---
As pointed out by Suresh Mahalingam, the issue addressed by 
ppp-fix-osize-too-small-errors-when-decoding patch is not fully resolved yet.
The size of allocated output buffer is correct, however it size passed to
ppp->rcomp->decompress in ppp_generic.c if wrong. The patch fixes that.

--- linux-2.6.21.3/drivers/net/ppp_generic.c.orig	2007-08-17 20:35:03.000000000 +1100
+++ linux-2.6.21.3/drivers/net/ppp_generic.c	2007-08-17 20:35:45.000000000 +1100
@@ -1726,7 +1726,7 @@ ppp_decompress_frame(struct ppp *ppp, st
 		}
 		/* the decompressor still expects the A/C bytes in the hdr */
 		len = ppp->rcomp->decompress(ppp->rc_state, skb->data - 2,
-				skb->len + 2, ns->data, ppp->mru + PPP_HDRLEN);
+				skb->len + 2, ns->data, obuff_size);
 		if (len < 0) {
 			/* Pass the compressed frame to pppd as an
 			   error indication. */


^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Paul Mackerras @ 2007-08-17  9:48 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Nick Piggin, Linus Torvalds, Segher Boessenkool, heiko.carstens,
	horms, Linux Kernel Mailing List, rpjday, ak, netdev, cfriesen,
	Andrew Morton, jesper.juhl, linux-arch, zlynx, clameter,
	schwidefsky, Chris Snook, Herbert Xu, davem, wensong, wjiang
In-Reply-To: <alpine.LFD.0.999.0708171358520.3666@enigma.security.iitk.ac.in>

Satyam Sharma writes:

> I wonder if this'll generate smaller and better code than _both_ the
> other atomic_read_volatile() variants. Would need to build allyesconfig
> on lots of diff arch's etc to test the theory though.

I'm sure it would be a tiny effect.

This whole thread is arguing about effects that are quite
insignificant.  On the one hand we have the non-volatile proponents,
who want to let the compiler do extra optimizations - which amounts to
letting it elide maybe a dozen loads in the whole kernel, loads which
would almost always be L1 cache hits.

On the other hand we have the volatile proponents, who are concerned
that some code somewhere in the kernel might be buggy without the
volatile behaviour, and who also want to be able to remove some
barriers and thus save a few bytes of code and a few loads here and
there (and possibly some stores too).

Either way the effect on code size and execution time is miniscule.

In the end the strongest argument is actually that gcc generates
unnecessarily verbose code on x86[-64] for volatile accesses.  Even
then we're only talking about ~2000 bytes, or less than 1 byte per
instance of atomic_read on average, about 0.06% of the kernel text
size.

The x86[-64] developers seem to be willing to bear the debugging cost
involved in having the non-volatile behaviour for atomic_read, in
order to save the 2kB.  That's fine with me.  Either way I think
somebody should audit all the uses of atomic_read, not just for
missing barriers, but also to find the places where it's used in a
racy manner.  Then we can work out where the races matter and fix them
if they do.

Paul.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Satyam Sharma @ 2007-08-17 10:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Stefan Richter, paulmck, Herbert Xu, Paul Mackerras,
	Christoph Lameter, Chris Snook, Linux Kernel Mailing List,
	linux-arch, Linus Torvalds, netdev, Andrew Morton, ak,
	heiko.carstens, davem, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <46C5672E.4060003@cyberone.com.au>



On Fri, 17 Aug 2007, Nick Piggin wrote:

> Satyam Sharma wrote:
> > 
> > On Fri, 17 Aug 2007, Nick Piggin wrote:
> 
> > > > Sure, now
> > > > that I learned of these properties I can start to audit code and insert
> > > > barriers where I believe they are needed, but this simply means that
> > > > almost all occurrences of atomic_read will get barriers (unless there
> > > > already are implicit but more or less obvious barriers like msleep).
> > > 
> > > You might find that these places that appear to need barriers are
> > > buggy for other reasons anyway. Can you point to some in-tree code
> > > we can have a look at?
> > 
> > 
> > Such code was mentioned elsewhere (query nodemgr_host_thread in cscope)
> > that managed to escape the requirement for a barrier only because of
> > some completely un-obvious compilation-unit-scope thing. But I find such
> > an non-explicit barrier quite bad taste. Stefan, do consider plunking an
> > explicit call to barrier() there.
> 
> It is very obvious. msleep calls schedule() (ie. sleeps), which is
> always a barrier.

Probably you didn't mean that, but no, schedule() is not barrier because
it sleeps. It's a barrier because it's invisible.

> The "unobvious" thing is that you wanted to know how the compiler knows
> a function is a barrier -- answer is that if it does not *know* it is not
> a barrier, it must assume it is a barrier.

True, that's clearly what happens here. But are you're definitely joking
that this is "obvious" in terms of code-clarity, right?

Just 5 minutes back you mentioned elsewhere you like seeing lots of
explicit calls to barrier() (with comments, no less, hmm? :-)

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Satyam Sharma @ 2007-08-17 10:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Paul Mackerras, Nick Piggin, Segher Boessenkool,
	heiko.carstens, horms, Linux Kernel Mailing List, rpjday, netdev,
	cfriesen, Andrew Morton, jesper.juhl, linux-arch, zlynx, clameter,
	schwidefsky, Chris Snook, Herbert Xu, davem, wensong, wjiang
In-Reply-To: <200708171052.39477.ak@suse.de>

On Fri, 17 Aug 2007, Andi Kleen wrote:

> On Friday 17 August 2007 05:42, Linus Torvalds wrote:
> > On Fri, 17 Aug 2007, Paul Mackerras wrote:
> > > I'm really surprised it's as much as a few K.  I tried it on powerpc
> > > and it only saved 40 bytes (10 instructions) for a G5 config.
> >
> > One of the things that "volatile" generally screws up is a simple
> >
> > 	volatile int i;
> >
> > 	i++;
> 
> But for atomic_t people use atomic_inc() anyways which does this correctly.
> It shouldn't really matter for atomic_t.
> 
> I'm worrying a bit that the volatile atomic_t change caused subtle code 
> breakage like these delay read loops people here pointed out.

Umm, I followed most of the thread, but which breakage is this?

> Wouldn't it be safer to just re-add the volatile to atomic_read() 
> for 2.6.23? Or alternatively make it asm(), but volatile seems more
> proven.

The problem with volatile is not just trashy code generation (which also
definitely is a major problem), but definition holes, and implementation
inconsistencies. Making it asm() is not the only other alternative to
volatile either (read another reply to this mail), but considering most
of the thread has been about people not wanting even a
atomic_read_volatile() variant, making atomic_read() itself have volatile
semantics sounds ... strange :-)

PS: http://lkml.org/lkml/2007/8/15/407 was submitted a couple days back,
any word if you saw that?

I have another one for you:

[PATCH] i386, x86_64: __const_udelay() should not be marked inline

Because it can never get inlined in any callsite (each translation unit
is compiled separately for the kernel and so the implementation of
__const_udelay() would be invisible to all other callsites). In fact it
turns out, the correctness of callsites at arch/x86_64/kernel/crash.c:97
and arch/i386/kernel/crash.c:101 explicitly _depends_ upon it not being
inlined, and also it's an exported symbol (modules may want to call
mdelay() and udelay() that often becomes __const_udelay() after some
macro-ing in various headers). So let's not mark it as "inline" either.

Signed-off-by: Satyam Sharma <satyam@infradead.org>

---

 arch/i386/lib/delay.c   |    2 +-
 arch/x86_64/lib/delay.c |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/i386/lib/delay.c b/arch/i386/lib/delay.c
index f6edb11..0082c99 100644
--- a/arch/i386/lib/delay.c
+++ b/arch/i386/lib/delay.c
@@ -74,7 +74,7 @@ void __delay(unsigned long loops)
 	delay_fn(loops);
 }

-inline void __const_udelay(unsigned long xloops)
+void __const_udelay(unsigned long xloops)
 {
 	int d0;

diff --git a/arch/x86_64/lib/delay.c b/arch/x86_64/lib/delay.c
index 2dbebd3..d0cd9cd 100644
--- a/arch/x86_64/lib/delay.c
+++ b/arch/x86_64/lib/delay.c
@@ -38,7 +38,7 @@ void __delay(unsigned long loops)
 }
 EXPORT_SYMBOL(__delay);

-inline void __const_udelay(unsigned long xloops)
+void __const_udelay(unsigned long xloops)
 {
 	__delay(((xloops * HZ * cpu_data[raw_smp_processor_id()].loops_per_jiffy) >> 32) + 1);
 }

^ permalink raw reply related

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Satyam Sharma @ 2007-08-17 10:12 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Paul Mackerras, Segher Boessenkool,
	heiko.carstens, horms, Linux Kernel Mailing List, rpjday, ak,
	netdev, cfriesen, Andrew Morton, jesper.juhl, linux-arch, zlynx,
	clameter, schwidefsky, Chris Snook, Herbert Xu, davem, wensong,
	wjiang
In-Reply-To: <46C56725.6000207@cyberone.com.au>



On Fri, 17 Aug 2007, Nick Piggin wrote:

> Satyam Sharma wrote:
> > 
> > On Fri, 17 Aug 2007, Nick Piggin wrote:
> 
> > > Also, why would you want to make these insane accessors for atomic_t
> > > types? Just make sure everybody knows the basics of barriers, and they
> > > can apply that knowledge to atomic_t and all other lockless memory
> > > accesses as well.
> > 
> > 
> > Code that looks like:
> > 
> > 	while (!atomic_read(&v)) {
> > 		...
> > 		cpu_relax_no_barrier();
> > 		forget(v.counter);
> > 		        ^^^^^^^^
> > 	}
> > 
> > would be uglier. Also think about code such as:
> 
> I think they would both be equally ugly,

You think both these are equivalent in terms of "looks":

					|
while (!atomic_read(&v)) {		|	while (!atomic_read_xxx(&v)) {
	...				|		...
	cpu_relax_no_barrier();		|		cpu_relax_no_barrier();
	order_atomic(&v);		|	}
}					|

(where order_atomic() is an atomic_t
specific wrapper as you mentioned below)

?

Well, taste varies, but ...

> but the atomic_read_volatile
> variant would be more prone to subtle bugs because of the weird
> implementation.

What bugs?

> And it would be more ugly than introducing an order(x) statement for
> all memory operations, and adding an order_atomic() wrapper for it
> for atomic types.

Oh, that order() / forget() macro [forget() was named such by Chuck Ebbert
earlier in this thread where he first mentioned it, btw] could definitely
be generically introduced for any memory operations.

> > 	a = atomic_read();
> > 	if (!a)
> > 		do_something();
> > 
> > 	forget();
> > 	a = atomic_read();
> > 	... /* some code that depends on value of a, obviously */
> > 
> > 	forget();
> > 	a = atomic_read();
> > 	...
> > 
> > So much explicit sprinkling of "forget()" looks ugly.
> 
> Firstly, why is it ugly? It's nice because of those nice explicit
> statements there that give us a good heads up and would have some
> comments attached to them

atomic_read_xxx (where xxx = whatever naming sounds nice to you) would
obviously also give a heads up, and could also have some comments
attached to it.

> (also, lack of the word "volatile" is always a plus).

Ok, xxx != volatile.

> Secondly, what sort of code would do such a thing?

See the nodemgr_host_thread() that does something similar, though not
exactly same.

> > on the other hand, looks neater. The "_volatile()" suffix makes it also
> > no less explicit than an explicit barrier-like macro that this primitive
> > is something "special", for code clarity purposes.
> 
> Just don't use the word volatile,

That sounds amazingly frivolous, but hey, why not. As I said, ok,
xxx != volatile.

> and have barriers both before and after the memory operation,

How could that lead to bugs? (if you can point to existing code,
but just some testcase / sample code would be fine as well).

> [...] I don't see
> the point though, when you could just have a single barrier(x)
> barrier function defined for all memory locations,

As I said, barrier() is too heavy-handed.

> rather than
> this odd thing that only works for atomics

Why would it work only for atomics? You could use that generic macro
for anything you well damn please.

> (and would have to
> be duplicated for atomic_set.

#define atomic_set_xxx for something similar. Big deal ... NOT.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Satyam Sharma @ 2007-08-17 10:23 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Nick Piggin, Linus Torvalds, Segher Boessenkool, heiko.carstens,
	horms, Linux Kernel Mailing List, rpjday, ak, netdev, cfriesen,
	Andrew Morton, jesper.juhl, linux-arch, zlynx, clameter,
	schwidefsky, Chris Snook, Herbert Xu, davem, wensong, wjiang
In-Reply-To: <18117.28409.943948.355586@cargo.ozlabs.ibm.com>



On Fri, 17 Aug 2007, Paul Mackerras wrote:

> Satyam Sharma writes:
> 
> > I wonder if this'll generate smaller and better code than _both_ the
> > other atomic_read_volatile() variants. Would need to build allyesconfig
> > on lots of diff arch's etc to test the theory though.
> 
> I'm sure it would be a tiny effect.
> 
> This whole thread is arguing about effects that are quite
> insignificant.

Hmm, the fact that this thread became what it did, probably means that
most developers on this list do not mind thinking/arguing about effects
or optimizations that are otherwise "tiny". But yeah, they are tiny
nonetheless.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Satyam Sharma @ 2007-08-17 10:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Herbert Xu, Paul Mackerras, Linus Torvalds, Christoph Lameter,
	Chris Snook, Ilpo Jarvinen, Paul E. McKenney, Stefan Richter,
	Linux Kernel Mailing List, linux-arch, Netdev, Andrew Morton, ak,
	heiko.carstens, David Miller, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <46C56ADF.8010501@cyberone.com.au>

On Fri, 17 Aug 2007, Nick Piggin wrote:

> Satyam Sharma wrote:
> > [...]
> > The point is about *author expecations*. If people do expect atomic_read()
> > (or a variant thereof) to have volatile semantics, why not give them such
> > a variant?
> 
> Because they should be thinking about them in terms of barriers, over
> which the compiler / CPU is not to reorder accesses or cache memory
> operations, rather than "special" "volatile" accesses.

This is obviously just a taste thing. Whether to have that forget(x)
barrier as something author should explicitly sprinkle appropriately
in appropriate places in the code by himself or use a primitive that
includes it itself.

I'm not saying "taste matters aren't important" (they are), but I'm really
skeptical if most folks would find the former tasteful.

> > And by the way, the point is *also* about the fact that cpu_relax(), as
> > of today, implies a full memory clobber, which is not what a lot of such
> > loops want. (due to stuff mentioned elsewhere, summarized in that summary)
> 
> That's not the point,

That's definitely the point, why not. This is why "barrier()", being
heavy-handed, is not the best option.

> because as I also mentioned, the logical extention
> to Linux's barrier API to handle this is the order(x) macro. Again, not
> special volatile accessors.

Sure, that forget(x) macro _is_ proposed to be made part of the generic
API. Doesn't explain why not to define/use primitives that has volatility
semantics in itself, though (taste matters apart).

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Stefan Richter @ 2007-08-17 10:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: paulmck, Herbert Xu, Paul Mackerras, Satyam Sharma,
	Christoph Lameter, Chris Snook, Linux Kernel Mailing List,
	linux-arch, Linus Torvalds, netdev, Andrew Morton, ak,
	heiko.carstens, davem, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <46C556F1.8000407@yahoo.com.au>

Nick Piggin wrote:
> Stefan Richter wrote:
>> For architecture port authors, there is Documentation/atomic_ops.txt.
>> Driver authors also can learn something from that document, as it
>> indirectly documents the atomic_t and bitops APIs.
> 
> "Semantics and Behavior of Atomic and Bitmask Operations" is
> pretty direct :)

"Indirect", "pretty direct"... It's subjective.

(It is not an API documentation; it is an implementation specification.)

> Sure, it says that it's for arch maintainers, but there is no
> reason why users can't make use of it.
> 
> 
>> Prompted by this thread, I reread this document, and indeed, the
>> sentence "Unlike the above routines, it is required that explicit memory
>> barriers are performed before and after [atomic_{inc,dec}_return]"
>> indicates that atomic_read (one of the "above routines") is very
>> different from all other atomic_t accessors that return values.
>> 
>> This is strange.  Why is it that atomic_read stands out that way?  IMO
> 
> It is not just atomic_read of course. It is atomic_add,sub,inc,dec,set.

Yes, but unlike these, atomic_read returns a value.

Without me (the API user) providing extra barriers, that value may
become something else whenever someone touches code in the vicinity of
the atomic_read.

>> this API imbalance is quite unexpected by many people.  Wouldn't it be
>> beneficial to change the atomic_read API to behave the same like all
>> other atomic_t accessors that return values?
> 
> It is very consistent and well defined. Operations which both modify
> the data _and_ return something are defined to have full barriers
> before and after.

You are right, atomic_read is not only different from accessors that
don't retunr values, it is also different from all other accessors that
return values (because they all also modify the value).  There is just
no actual API documentation, which contributes to the issue that some
people (or at least one: me) learn a little bit late how special
atomic_read is.

> What do you want to add to the other atomic accessors? Full memory
> barriers? Only compiler barriers? It's quite likely that if you think
> some barriers will fix bugs, then there are other bugs lurking there
> anyway.

A lot of different though related issues are discussed in this thread,
but I personally am only occupied by one particular thing:  What kind of
return values do I get from atomic_read.

> Just use spinlocks if you're not absolutely clear about potential
> races and memory ordering issues -- they're pretty cheap and simple.

Probably good advice, like generally if driver guys consider lockless
algorithms.

>> OK, it is also different from the other accessors that return data in so
>> far as it doesn't modify the data.  But as driver "author", i.e. user of
>> the API, I can't see much use of an atomic_read that can be reordered
>> and, more importantly, can be optimized away by the compiler.
> 
> It will return to you an atomic snapshot of the data (loaded from
> memory at some point since the last compiler barrier). All you have
> to be aware of compiler barriers and the Linux SMP memory ordering
> model, which should be a given if you are writing lockless code.

OK, that's what I slowly realized during this discussion, and I
appreciate the explanations that were given here.

>> Sure, now
>> that I learned of these properties I can start to audit code and insert
>> barriers where I believe they are needed, but this simply means that
>> almost all occurrences of atomic_read will get barriers (unless there
>> already are implicit but more or less obvious barriers like msleep).
> 
> You might find that these places that appear to need barriers are
> buggy for other reasons anyway. Can you point to some in-tree code
> we can have a look at?

I could, or could not, if I were through with auditing the code.  I
remembered one case and posted it (nodemgr_host_thread) which was safe
because msleep_interruptible provided the necessary barrier there, and
this implicit barrier is not in danger to be removed by future patches.
-- 
Stefan Richter
-=====-=-=== =--- =---=
http://arcgraph.de/sr/

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Stefan Richter @ 2007-08-17 10:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: paulmck, Herbert Xu, Paul Mackerras, Satyam Sharma,
	Christoph Lameter, Chris Snook, Linux Kernel Mailing List,
	linux-arch, Linus Torvalds, netdev, Andrew Morton, ak,
	heiko.carstens, davem, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <46C57D07.80604@s5r6.in-berlin.de>

I wrote:
> Nick Piggin wrote:
>> You might find that these places that appear to need barriers are
>> buggy for other reasons anyway. Can you point to some in-tree code
>> we can have a look at?
> 
> I could, or could not, if I were through with auditing the code.  I
> remembered one case and posted it (nodemgr_host_thread) which was safe
> because msleep_interruptible provided the necessary barrier there, and
> this implicit barrier is not in danger to be removed by future patches.

PS, just in case anybody holds his breath for more example code from me,
I don't plan to continue with an actual audit of the drivers I maintain.
It's an important issue, but my current time budget will restrict me to
look at it ad hoc, per case.  (Open bugs have higher priority than
potential bugs.)
-- 
Stefan Richter
-=====-=-=== =--- =---=
http://arcgraph.de/sr/

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Stefan Richter @ 2007-08-17 11:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Satyam Sharma, Herbert Xu, Paul Mackerras, Linus Torvalds,
	Christoph Lameter, Chris Snook, Ilpo Jarvinen, Paul E. McKenney,
	Linux Kernel Mailing List, linux-arch, Netdev, Andrew Morton, ak,
	heiko.carstens, David Miller, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <46C55E90.7010407@yahoo.com.au>

Nick Piggin wrote:
> Satyam Sharma wrote:
>> And we have driver / subsystem maintainers such as Stefan
>> coming up and admitting that often a lot of code that's written to use
>> atomic_read() does assume the read will not be elided by the compiler.
> 
> So these are broken on i386 and x86-64?

The ieee1394 and firewire subsystems have open, undiagnosed bugs, also
on i386 and x86-64.  But whether there is any bug because of wrong
assumptions about atomic_read among them, I don't know.  I don't know
which assumptions the authors made, I only know that I wasn't aware of
all the properties of atomic_read until now.

> Are they definitely safe on SMP and weakly ordered machines with
> just a simple compiler barrier there? Because I would not be
> surprised if there are a lot of developers who don't really know
> what to assume when it comes to memory ordering issues.
> 
> This is not a dig at driver writers: we still have memory ordering
> problems in the VM too (and probably most of the subtle bugs in
> lockless VM code are memory ordering ones). Let's not make up a
> false sense of security and hope that sprinkling volatile around
> will allow people to write bug-free lockless code. If a writer
> can't be bothered reading API documentation

...or, if there is none, the implementation specification (as in case of
the atomic ops), or, if there is none, the implementation (as in case of
a some infrastructure code here and there)...

> and learning the Linux memory model, they can still be productive
> writing safely locked code.

Provided they are aware that they might not have the full picture of the
lockless primitives.  :-)
-- 
Stefan Richter
-=====-=-=== =--- =---=
http://arcgraph.de/sr/

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Nick Piggin @ 2007-08-17 11:50 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Stefan Richter, paulmck, Herbert Xu, Paul Mackerras,
	Christoph Lameter, Chris Snook, Linux Kernel Mailing List,
	linux-arch, Linus Torvalds, netdev, Andrew Morton, ak,
	heiko.carstens, davem, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <alpine.LFD.0.999.0708171521230.3666@enigma.security.iitk.ac.in>

Satyam Sharma wrote:

>
>On Fri, 17 Aug 2007, Nick Piggin wrote:
>
>
>>Satyam Sharma wrote:
>>
>>It is very obvious. msleep calls schedule() (ie. sleeps), which is
>>always a barrier.
>>
>
>Probably you didn't mean that, but no, schedule() is not barrier because
>it sleeps. It's a barrier because it's invisible.
>

Where did I say it is a barrier because it sleeps?

It is always a barrier because, at the lowest level, schedule() (and thus
anything that sleeps) is defined to always be a barrier. Regardless of
whatever obscure means the compiler might need to infer the barrier.

In other words, you can ignore those obscure details because schedule() is
always going to have an explicit barrier in it.

>>The "unobvious" thing is that you wanted to know how the compiler knows
>>a function is a barrier -- answer is that if it does not *know* it is not
>>a barrier, it must assume it is a barrier.
>>
>
>True, that's clearly what happens here. But are you're definitely joking
>that this is "obvious" in terms of code-clarity, right?
>

No. If you accept that barrier() is implemented correctly, and you know
that sleeping is defined to be a barrier, then its perfectly clear. You
don't have to know how the compiler "knows" that some function contains
a barrier.

>Just 5 minutes back you mentioned elsewhere you like seeing lots of
>explicit calls to barrier() (with comments, no less, hmm? :-)
>

Sure, but there are well known primitives which contain barriers, and
trivial recognisable code sequences for which you don't need comments.
waiting-loops using sleeps or cpu_relax() are prime examples.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Nick Piggin @ 2007-08-17 12:14 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linus Torvalds, Paul Mackerras, Segher Boessenkool,
	heiko.carstens, horms, Linux Kernel Mailing List, rpjday, ak,
	netdev, cfriesen, Andrew Morton, jesper.juhl, linux-arch, zlynx,
	clameter, schwidefsky, Chris Snook, Herbert Xu, davem, wensong,
	wjiang
In-Reply-To: <alpine.LFD.0.999.0708171505200.3666@enigma.security.iitk.ac.in>

Satyam Sharma wrote:

>
>On Fri, 17 Aug 2007, Nick Piggin wrote:
>
>>I think they would both be equally ugly,
>>
>
>You think both these are equivalent in terms of "looks":
>
>					|
>while (!atomic_read(&v)) {		|	while (!atomic_read_xxx(&v)) {
>	...				|		...
>	cpu_relax_no_barrier();		|		cpu_relax_no_barrier();
>	order_atomic(&v);		|	}
>}					|
>
>(where order_atomic() is an atomic_t
>specific wrapper as you mentioned below)
>
>?
>

I think the LHS is better if your atomic_read_xxx primitive is using the
crazy one-sided barrier, because the LHS code you immediately know what
barriers are happening, and with the RHS you have to look at the 
atomic_read_xxx
definition.

If your atomic_read_xxx implementation was more intuitive, then both are
pretty well equal. More lines != ugly code.

>>but the atomic_read_volatile
>>variant would be more prone to subtle bugs because of the weird
>>implementation.
>>
>
>What bugs?
>

You can't think for yourself? Your atomic_read_volatile contains a compiler
barrier to the atomic variable before the load. 2 such reads from different
locations look like this:

asm volatile("" : "+m" (v1));
atomic_read(&v1);
asm volatile("" : "+m" (v2));
atomic_read(&v2);

Which implies that the load of v1 can be reordered to occur after the load
of v2. Bet you didn't expect that?

>>Secondly, what sort of code would do such a thing?
>>
>
>See the nodemgr_host_thread() that does something similar, though not
>exactly same.
>

I'm sorry, all this waffling about made up code which might do this and
that is just a waste of time. Seriously, the thread is bloated enough
and never going to get anywhere with all this handwaving. If someone is
saving up all the really real and actually good arguments for why we
must have a volatile here, now is the time to use them.

>>and have barriers both before and after the memory operation,
>>
>
>How could that lead to bugs? (if you can point to existing code,
>but just some testcase / sample code would be fine as well).
>

See above.

>As I said, barrier() is too heavy-handed.
>

Typo. I meant: defined for a single memory location (ie. order(x)).

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Nick Piggin @ 2007-08-17 12:39 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Herbert Xu, Paul Mackerras, Linus Torvalds, Christoph Lameter,
	Chris Snook, Ilpo Jarvinen, Paul E. McKenney, Stefan Richter,
	Linux Kernel Mailing List, linux-arch, Netdev, Andrew Morton, ak,
	heiko.carstens, David Miller, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <alpine.LFD.0.999.0708171612440.3666@enigma.security.iitk.ac.in>

Satyam Sharma wrote:

>
>On Fri, 17 Aug 2007, Nick Piggin wrote:
>
>
>>Because they should be thinking about them in terms of barriers, over
>>which the compiler / CPU is not to reorder accesses or cache memory
>>operations, rather than "special" "volatile" accesses.
>>
>
>This is obviously just a taste thing. Whether to have that forget(x)
>barrier as something author should explicitly sprinkle appropriately
>in appropriate places in the code by himself or use a primitive that
>includes it itself.
>

That's not obviously just taste to me. Not when the primitive has many
(perhaps, the majority) of uses that do not require said barriers. And
this is not solely about the code generation (which, as Paul says, is
relatively minor even on x86). I prefer people to think explicitly
about barriers in their lockless code.

>I'm not saying "taste matters aren't important" (they are), but I'm really
>skeptical if most folks would find the former tasteful.
>

So I /do/ have better taste than most folks? Thanks! :-)

>>>And by the way, the point is *also* about the fact that cpu_relax(), as
>>>of today, implies a full memory clobber, which is not what a lot of such
>>>loops want. (due to stuff mentioned elsewhere, summarized in that summary)
>>>
>>That's not the point,
>>
>
>That's definitely the point, why not. This is why "barrier()", being
>heavy-handed, is not the best option.
>

That is _not_ the point (of why a volatile atomic_read is good) because 
there
has already been an alternative posted that better conforms with Linux 
barrier
API and is much more widely useful and more usable. If you are so 
worried about
barrier() being too heavyweight, then you're off to a poor start by 
wanting to
add a few K of kernel text by making atomic_read volatile.

>>because as I also mentioned, the logical extention
>>to Linux's barrier API to handle this is the order(x) macro. Again, not
>>special volatile accessors.
>>
>
>Sure, that forget(x) macro _is_ proposed to be made part of the generic
>API. Doesn't explain why not to define/use primitives that has volatility
>semantics in itself, though (taste matters apart).
>

If you follow the discussion.... You were thinking of a reason why the
semantics *should* be changed or added, and I was rebutting your argument
that it must be used when a full barrier() is too heavy (ie. by pointing
out that order() has superior semantics anyway).

Why do I keep repeating the same things? I'll not continue bloating this
thread until a new valid point comes up...

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Satyam Sharma @ 2007-08-17 12:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Stefan Richter, paulmck, Herbert Xu, Paul Mackerras,
	Christoph Lameter, Chris Snook, Linux Kernel Mailing List,
	linux-arch, Linus Torvalds, netdev, Andrew Morton, ak,
	heiko.carstens, davem, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <46C58B93.5000408@cyberone.com.au>

On Fri, 17 Aug 2007, Nick Piggin wrote:

> Satyam Sharma wrote:
> > On Fri, 17 Aug 2007, Nick Piggin wrote:
> > > Satyam Sharma wrote:
> > > 
> > > It is very obvious. msleep calls schedule() (ie. sleeps), which is
> > > always a barrier.
> > 
> > Probably you didn't mean that, but no, schedule() is not barrier because
> > it sleeps. It's a barrier because it's invisible.
> 
> Where did I say it is a barrier because it sleeps?

Just below. What you wrote:

> It is always a barrier because, at the lowest level, schedule() (and thus
> anything that sleeps) is defined to always be a barrier.

"It is always a barrier because, at the lowest level, anything that sleeps
is defined to always be a barrier".

> Regardless of
> whatever obscure means the compiler might need to infer the barrier.
> 
> In other words, you can ignore those obscure details because schedule() is
> always going to have an explicit barrier in it.

I didn't quite understand what you said here, so I'll tell what I think:

* foo() is a compiler barrier if the definition of foo() is invisible to
  the compiler at a callsite.

* foo() is also a compiler barrier if the definition of foo() includes
  a barrier, and it is inlined at the callsite.

If the above is wrong, or if there's something else at play as well,
do let me know.

> > > The "unobvious" thing is that you wanted to know how the compiler knows
> > > a function is a barrier -- answer is that if it does not *know* it is not
> > > a barrier, it must assume it is a barrier.
> > 
> > True, that's clearly what happens here. But are you're definitely joking
> > that this is "obvious" in terms of code-clarity, right?
> 
> No. If you accept that barrier() is implemented correctly, and you know
> that sleeping is defined to be a barrier,

Curiously, that's the second time you've said "sleeping is defined to
be a (compiler) barrier". How does the compiler even know if foo() is
a function that "sleeps"? Do compilers have some notion of "sleeping"
to ensure they automatically assume a compiler barrier whenever such
a function is called? Or are you saying that the compiler can see the
barrier() inside said function ... nopes, you're saying quite the
opposite below.

> then its perfectly clear. You
> don't have to know how the compiler "knows" that some function contains
> a barrier.

I think I do, why not? Would appreciate if you could elaborate on this.

Satyam

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Satyam Sharma @ 2007-08-17 13:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Paul Mackerras, Segher Boessenkool,
	heiko.carstens, horms, Linux Kernel Mailing List, rpjday, ak,
	netdev, cfriesen, Andrew Morton, jesper.juhl, linux-arch, zlynx,
	clameter, schwidefsky, Chris Snook, Herbert Xu, davem, wensong,
	wjiang
In-Reply-To: <46C5910A.6020904@cyberone.com.au>



On Fri, 17 Aug 2007, Nick Piggin wrote:

> Satyam Sharma wrote:
> [...]
> > You think both these are equivalent in terms of "looks":
> > 
> > 					|
> > while (!atomic_read(&v)) {		|	while (!atomic_read_xxx(&v)) {
> > 	...				|		...
> > 	cpu_relax_no_barrier();		|
> > cpu_relax_no_barrier();
> > 	order_atomic(&v);		|	}
> > }					|
> > 
> > (where order_atomic() is an atomic_t
> > specific wrapper as you mentioned below)
> > 
> > ?
> 
> I think the LHS is better if your atomic_read_xxx primitive is using the
> crazy one-sided barrier,
  ^^^^^

I'd say it's purposefully one-sided.

> because the LHS code you immediately know what
> barriers are happening, and with the RHS you have to look at the
> atomic_read_xxx
> definition.

No. As I said, the _xxx (whatever the heck you want to name it as) should
give the same heads-up that your "order_atomic" thing is supposed to give.


> If your atomic_read_xxx implementation was more intuitive, then both are
> pretty well equal. More lines != ugly code.
> 
> > [...]
> > What bugs?
> 
> You can't think for yourself? Your atomic_read_volatile contains a compiler
> barrier to the atomic variable before the load. 2 such reads from different
> locations look like this:
> 
> asm volatile("" : "+m" (v1));
> atomic_read(&v1);
> asm volatile("" : "+m" (v2));
> atomic_read(&v2);
> 
> Which implies that the load of v1 can be reordered to occur after the load
> of v2.

And how would that be a bug? (sorry, I really can't think for myself)


> > > Secondly, what sort of code would do such a thing?
> > 
> > See the nodemgr_host_thread() that does something similar, though not
> > exactly same.
> 
> I'm sorry, all this waffling about made up code which might do this and
> that is just a waste of time.

First, you could try looking at the code.

And by the way, as I've already said (why do *require* people to have to
repeat things to you?) this isn't even about only existing code.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Nick Piggin @ 2007-08-17 12:56 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Stefan Richter, paulmck, Herbert Xu, Paul Mackerras,
	Christoph Lameter, Chris Snook, Linux Kernel Mailing List,
	linux-arch, Linus Torvalds, netdev, Andrew Morton, ak,
	heiko.carstens, davem, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <alpine.LFD.0.999.0708171737060.3666@enigma.security.iitk.ac.in>

Satyam Sharma wrote:

>
>On Fri, 17 Aug 2007, Nick Piggin wrote:
>
>
>>Satyam Sharma wrote:
>>
>>>On Fri, 17 Aug 2007, Nick Piggin wrote:
>>>
>>>>Satyam Sharma wrote:
>>>>
>>>>It is very obvious. msleep calls schedule() (ie. sleeps), which is
>>>>always a barrier.
>>>>
>>>Probably you didn't mean that, but no, schedule() is not barrier because
>>>it sleeps. It's a barrier because it's invisible.
>>>
>>Where did I say it is a barrier because it sleeps?
>>
>
>Just below. What you wrote:
>
>
>>It is always a barrier because, at the lowest level, schedule() (and thus
>>anything that sleeps) is defined to always be a barrier.
>>
>
>"It is always a barrier because, at the lowest level, anything that sleeps
>is defined to always be a barrier".
>

... because it must call schedule and schedule is a barrier.


>>Regardless of
>>whatever obscure means the compiler might need to infer the barrier.
>>
>>In other words, you can ignore those obscure details because schedule() is
>>always going to have an explicit barrier in it.
>>
>
>I didn't quite understand what you said here, so I'll tell what I think:
>
>* foo() is a compiler barrier if the definition of foo() is invisible to
>  the compiler at a callsite.
>
>* foo() is also a compiler barrier if the definition of foo() includes
>  a barrier, and it is inlined at the callsite.
>
>If the above is wrong, or if there's something else at play as well,
>do let me know.
>

Right.


>>>>The "unobvious" thing is that you wanted to know how the compiler knows
>>>>a function is a barrier -- answer is that if it does not *know* it is not
>>>>a barrier, it must assume it is a barrier.
>>>>
>>>True, that's clearly what happens here. But are you're definitely joking
>>>that this is "obvious" in terms of code-clarity, right?
>>>
>>No. If you accept that barrier() is implemented correctly, and you know
>>that sleeping is defined to be a barrier,
>>
>
>Curiously, that's the second time you've said "sleeping is defined to
>be a (compiler) barrier".
>

_In Linux,_ sleeping is defined to be a compiler barrier.

>How does the compiler even know if foo() is
>a function that "sleeps"? Do compilers have some notion of "sleeping"
>to ensure they automatically assume a compiler barrier whenever such
>a function is called? Or are you saying that the compiler can see the
>barrier() inside said function ... nopes, you're saying quite the
>opposite below.
>

You're getting too worried about the compiler implementation. Start
by assuming that it does work ;)


>>then its perfectly clear. You
>>don't have to know how the compiler "knows" that some function contains
>>a barrier.
>>
>
>I think I do, why not? Would appreciate if you could elaborate on this.
>

If a function is not completely visible to the compiler (so it can't
determine whether a barrier could be in it or not), then it must always
assume it will contain a barrier so it always does the right thing.

^ permalink raw reply

* Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures
From: Satyam Sharma @ 2007-08-17 13:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Herbert Xu, Paul Mackerras, Linus Torvalds, Christoph Lameter,
	Chris Snook, Ilpo Jarvinen, Paul E. McKenney, Stefan Richter,
	Linux Kernel Mailing List, linux-arch, Netdev, Andrew Morton, ak,
	heiko.carstens, David Miller, schwidefsky, wensong, horms, wjiang,
	cfriesen, zlynx, rpjday, jesper.juhl, segher
In-Reply-To: <46C59717.4020108@cyberone.com.au>

On Fri, 17 Aug 2007, Nick Piggin wrote:

> Satyam Sharma wrote:
> 
> > On Fri, 17 Aug 2007, Nick Piggin wrote:
> > 
> > > Because they should be thinking about them in terms of barriers, over
> > > which the compiler / CPU is not to reorder accesses or cache memory
> > > operations, rather than "special" "volatile" accesses.
> > 
> > This is obviously just a taste thing. Whether to have that forget(x)
> > barrier as something author should explicitly sprinkle appropriately
> > in appropriate places in the code by himself or use a primitive that
> > includes it itself.
> 
> That's not obviously just taste to me. Not when the primitive has many
> (perhaps, the majority) of uses that do not require said barriers. And
> this is not solely about the code generation (which, as Paul says, is
> relatively minor even on x86).

See, you do *require* people to have to repeat the same things to you!

As has been written about enough times already, and if you followed the
discussion on this thread, I am *not* proposing that atomic_read()'s
semantics be changed to have any extra barriers. What is proposed is a
different atomic_read_xxx() variant thereof, that those can use who do
want that.

Now whether to have a kind of barrier ("volatile", whatever) in the
atomic_read_xxx() itself, or whether to make the code writer himself to
explicitly write the order(x) appropriately in appropriate places in the
code _is_ a matter of taste.

> > That's definitely the point, why not. This is why "barrier()", being
> > heavy-handed, is not the best option.
> 
> That is _not_ the point [...]

Again, you're requiring me to repeat things that were already made evident
on this thread (if you follow it).

This _is_ the point, because a lot of loops out there (too many of them,
I WILL NOT bother citing file_name:line_number) end up having to use a
barrier just because they're using a loop-exit-condition that depends
on a value returned by atomic_read(). It would be good for them if they
used an atomic_read_xxx() primitive that gave these "volatility" semantics
without junking compiler optimizations for other memory references.

> because there has already been an alternative posted

Whether that alternative (explicitly using forget(x), or wrappers thereof,
such as the "order_atomic" you proposed) is better than other alternatives
(such as atomic_read_xxx() which includes the volatility behaviour in
itself) is still open, and precisely what we started discussing just one
mail back.

(The above was also mostly stuff I had to repeated for you, sadly.)

> that better conforms with Linux barrier
> API and is much more widely useful and more usable.

I don't think so.

(Now *this* _is_ the "taste-dependent matter" that I mentioned earlier.)

> If you are so worried
> about
> barrier() being too heavyweight, then you're off to a poor start by wanting to
> add a few K of kernel text by making atomic_read volatile.

Repeating myself, for the N'th time, NO, I DON'T want to make atomic_read
have "volatile" semantics.

> > > because as I also mentioned, the logical extention
> > > to Linux's barrier API to handle this is the order(x) macro. Again, not
> > > special volatile accessors.
> > 
> > Sure, that forget(x) macro _is_ proposed to be made part of the generic
> > API. Doesn't explain why not to define/use primitives that has volatility
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > semantics in itself, though (taste matters apart).
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> If you follow the discussion.... You were thinking of a reason why the
> semantics *should* be changed or added, and I was rebutting your argument
> that it must be used when a full barrier() is too heavy (ie. by pointing
> out that order() has superior semantics anyway).

Amazing. Either you have reading comprehension problems, or else, please
try reading this thread (or at least this sub-thread) again. I don't want
_you_ blaming _me_ for having to repeat things to you all over again.

^ permalink raw reply

* [PATCH 1/6] ibmveth: Enable TCP checksum offload
From: Brian King @ 2007-08-17 14:16 UTC (permalink / raw)
  To: jeff; +Cc: linuxppc-dev, rcjenn, brking, santil, netdev


This patchset enables TCP checksum offload support for IPV4
on ibmveth. This completely eliminates the generation and checking of
the checksum for packets that are completely virtual and never
touch a physical network. A simple TCP_STREAM netperf run on
a virtual network with maximum mtu set yielded a ~30% increase
in throughput. This feature is enabled by default on systems that
support it, but can be disabled with a module option.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
---

 linux-2.6-bjking1/drivers/net/ibmveth.c |   53 ++++++++++++++++++++++++++++++++
 linux-2.6-bjking1/drivers/net/ibmveth.h |   41 +++++++++++++++++++++++-
 2 files changed, 92 insertions(+), 2 deletions(-)

diff -puN drivers/net/ibmveth.c~ibmveth_csum_offload drivers/net/ibmveth.c
--- linux-2.6/drivers/net/ibmveth.c~ibmveth_csum_offload	2007-08-09 13:08:21.000000000 -0500
+++ linux-2.6-bjking1/drivers/net/ibmveth.c	2007-08-09 13:08:21.000000000 -0500
@@ -47,6 +47,8 @@
 #include <linux/mm.h>
 #include <linux/ethtool.h>
 #include <linux/proc_fs.h>
+#include <linux/in.h>
+#include <linux/ip.h>
 #include <asm/semaphore.h>
 #include <asm/hvcall.h>
 #include <asm/atomic.h>
@@ -131,6 +133,11 @@ static inline int ibmveth_rxq_frame_leng
 	return (adapter->rx_queue.queue_addr[adapter->rx_queue.index].length);
 }
 
+static inline int ibmveth_rxq_csum_good(struct ibmveth_adapter *adapter)
+{
+	return (adapter->rx_queue.queue_addr[adapter->rx_queue.index].csum_good);
+}
+
 /* setup the initial settings for a buffer pool */
 static void ibmveth_init_buffer_pool(struct ibmveth_buff_pool *pool, u32 pool_index, u32 pool_size, u32 buff_size, u32 pool_active)
 {
@@ -684,6 +691,24 @@ static int ibmveth_start_xmit(struct sk_
 					desc[0].fields.length, DMA_TO_DEVICE);
 	desc[0].fields.valid   = 1;
 
+	if (skb->ip_summed == CHECKSUM_PARTIAL &&
+	    ip_hdr(skb)->protocol != IPPROTO_TCP && skb_checksum_help(skb)) {
+		ibmveth_error_printk("tx: failed to checksum packet\n");
+		tx_dropped++;
+		goto out;
+	}
+
+	if (skb->ip_summed == CHECKSUM_PARTIAL) {
+		unsigned char *buf = skb_transport_header(skb) + skb->csum_offset;
+
+		desc[0].fields.no_csum = 1;
+		desc[0].fields.csum_good = 1;
+
+		/* Need to zero out the checksum */
+		buf[0] = 0;
+		buf[1] = 0;
+	}
+
 	if(dma_mapping_error(desc[0].fields.address)) {
 		ibmveth_error_printk("tx: unable to map initial fragment\n");
 		tx_map_failed++;
@@ -702,6 +727,10 @@ static int ibmveth_start_xmit(struct sk_
 				frag->size, DMA_TO_DEVICE);
 		desc[curfrag+1].fields.length = frag->size;
 		desc[curfrag+1].fields.valid  = 1;
+		if (skb->ip_summed == CHECKSUM_PARTIAL) {
+			desc[curfrag+1].fields.no_csum = 1;
+			desc[curfrag+1].fields.csum_good = 1;
+		}
 
 		if(dma_mapping_error(desc[curfrag+1].fields.address)) {
 			ibmveth_error_printk("tx: unable to map fragment %d\n", curfrag);
@@ -792,7 +821,11 @@ static int ibmveth_poll(struct net_devic
 			} else {
 				int length = ibmveth_rxq_frame_length(adapter);
 				int offset = ibmveth_rxq_frame_offset(adapter);
+				int csum_good = ibmveth_rxq_csum_good(adapter);
+
 				skb = ibmveth_rxq_get_buffer(adapter);
+				if (csum_good)
+					skb->ip_summed = CHECKSUM_UNNECESSARY;
 
 				ibmveth_rxq_harvest_buffer(adapter);
 
@@ -962,8 +995,10 @@ static void ibmveth_poll_controller(stru
 static int __devinit ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id)
 {
 	int rc, i;
+	long ret;
 	struct net_device *netdev;
 	struct ibmveth_adapter *adapter;
+	union ibmveth_illan_attributes set_attr, ret_attr;
 
 	unsigned char *mac_addr_p;
 	unsigned int *mcastFilterSize_p;
@@ -1057,6 +1092,24 @@ static int __devinit ibmveth_probe(struc
 
 	ibmveth_debug_printk("registering netdev...\n");
 
+	ret = h_illan_attributes(dev->unit_address, 0, 0, &ret_attr.desc);
+
+	if (ret == H_SUCCESS && !ret_attr.fields.active_trunk &&
+	    !ret_attr.fields.trunk_priority &&
+	    ret_attr.fields.csum_offload_padded_pkt_support) {
+		set_attr.desc = 0;
+		set_attr.fields.tcp_csum_offload_ipv4 = 1;
+
+		ret = h_illan_attributes(dev->unit_address, 0, set_attr.desc,
+					 &ret_attr.desc);
+
+		if (ret == H_SUCCESS)
+			netdev->features |= NETIF_F_IP_CSUM;
+		else
+			ret = h_illan_attributes(dev->unit_address, set_attr.desc,
+						 0, &ret_attr.desc);
+	}
+
 	rc = register_netdev(netdev);
 
 	if(rc) {
diff -puN drivers/net/ibmveth.h~ibmveth_csum_offload drivers/net/ibmveth.h
--- linux-2.6/drivers/net/ibmveth.h~ibmveth_csum_offload	2007-08-09 13:08:21.000000000 -0500
+++ linux-2.6-bjking1/drivers/net/ibmveth.h	2007-08-09 13:08:21.000000000 -0500
@@ -67,6 +67,21 @@ static inline long h_send_logical_lan(un
 	return rc;
 }
 
+static inline long h_illan_attributes(unsigned long unit_address,
+				      unsigned long reset_mask, unsigned long set_mask,
+				      unsigned long *ret_attributes)
+{
+	long rc;
+	unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+
+	rc = plpar_hcall(H_ILLAN_ATTRIBUTES, retbuf, unit_address,
+			 reset_mask, set_mask);
+
+	*ret_attributes = retbuf[0];
+
+	return rc;
+}
+
 #define h_multicast_ctrl(ua, cmd, mac) \
   plpar_hcall_norets(H_MULTICAST_CTRL, ua, cmd, mac)
 
@@ -141,7 +156,9 @@ struct ibmveth_adapter {
 struct ibmveth_buf_desc_fields {
     u32 valid : 1;
     u32 toggle : 1;
-    u32 reserved : 6;
+    u32 reserved : 4;
+    u32 no_csum : 1;
+    u32 csum_good : 1;
     u32 length : 24;
     u32 address;
 };
@@ -151,10 +168,30 @@ union ibmveth_buf_desc {
     struct ibmveth_buf_desc_fields fields;
 };
 
+struct ibmveth_illan_attributes_fields {
+	u32 reserved;
+	u32 reserved2 : 18;
+	u32 csum_offload_padded_pkt_support : 1;
+	u32 reserved3 : 1;
+	u32 trunk_priority : 4;
+	u32 reserved4 : 5;
+	u32 tcp_csum_offload_ipv6 : 1;
+	u32 tcp_csum_offload_ipv4 : 1;
+	u32 active_trunk : 1;
+};
+
+union ibmveth_illan_attributes {
+	u64 desc;
+	struct ibmveth_illan_attributes_fields fields;
+};
+
 struct ibmveth_rx_q_entry {
     u16 toggle : 1;
     u16 valid : 1;
-    u16 reserved : 14;
+    u16 reserved : 4;
+    u16 no_csum : 1;
+    u16 csum_good : 1;
+    u16 reserved2 : 8;
     u16 offset;
     u32 length;
     u64 correlator;
_

^ permalink raw reply

* [PATCH 1/6] ibmveth: Enable TCP checksum offload
From: Brian King @ 2007-08-17 14:16 UTC (permalink / raw)
  To: jeff; +Cc: santil, rcjenn, netdev, linuxppc-dev, brking


This patchset enables TCP checksum offload support for IPV4
on ibmveth. This completely eliminates the generation and checking of
the checksum for packets that are completely virtual and never
touch a physical network. A simple TCP_STREAM netperf run on
a virtual network with maximum mtu set yielded a ~30% increase
in throughput. This feature is enabled by default on systems that
support it, but can be disabled with a module option.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
---

 linux-2.6-bjking1/drivers/net/ibmveth.c |   53 ++++++++++++++++++++++++++++++++
 linux-2.6-bjking1/drivers/net/ibmveth.h |   41 +++++++++++++++++++++++-
 2 files changed, 92 insertions(+), 2 deletions(-)

diff -puN drivers/net/ibmveth.c~ibmveth_csum_offload drivers/net/ibmveth.c
--- linux-2.6/drivers/net/ibmveth.c~ibmveth_csum_offload	2007-08-09 13:08:21.000000000 -0500
+++ linux-2.6-bjking1/drivers/net/ibmveth.c	2007-08-09 13:08:21.000000000 -0500
@@ -47,6 +47,8 @@
 #include <linux/mm.h>
 #include <linux/ethtool.h>
 #include <linux/proc_fs.h>
+#include <linux/in.h>
+#include <linux/ip.h>
 #include <asm/semaphore.h>
 #include <asm/hvcall.h>
 #include <asm/atomic.h>
@@ -131,6 +133,11 @@ static inline int ibmveth_rxq_frame_leng
 	return (adapter->rx_queue.queue_addr[adapter->rx_queue.index].length);
 }
 
+static inline int ibmveth_rxq_csum_good(struct ibmveth_adapter *adapter)
+{
+	return (adapter->rx_queue.queue_addr[adapter->rx_queue.index].csum_good);
+}
+
 /* setup the initial settings for a buffer pool */
 static void ibmveth_init_buffer_pool(struct ibmveth_buff_pool *pool, u32 pool_index, u32 pool_size, u32 buff_size, u32 pool_active)
 {
@@ -684,6 +691,24 @@ static int ibmveth_start_xmit(struct sk_
 					desc[0].fields.length, DMA_TO_DEVICE);
 	desc[0].fields.valid   = 1;
 
+	if (skb->ip_summed == CHECKSUM_PARTIAL &&
+	    ip_hdr(skb)->protocol != IPPROTO_TCP && skb_checksum_help(skb)) {
+		ibmveth_error_printk("tx: failed to checksum packet\n");
+		tx_dropped++;
+		goto out;
+	}
+
+	if (skb->ip_summed == CHECKSUM_PARTIAL) {
+		unsigned char *buf = skb_transport_header(skb) + skb->csum_offset;
+
+		desc[0].fields.no_csum = 1;
+		desc[0].fields.csum_good = 1;
+
+		/* Need to zero out the checksum */
+		buf[0] = 0;
+		buf[1] = 0;
+	}
+
 	if(dma_mapping_error(desc[0].fields.address)) {
 		ibmveth_error_printk("tx: unable to map initial fragment\n");
 		tx_map_failed++;
@@ -702,6 +727,10 @@ static int ibmveth_start_xmit(struct sk_
 				frag->size, DMA_TO_DEVICE);
 		desc[curfrag+1].fields.length = frag->size;
 		desc[curfrag+1].fields.valid  = 1;
+		if (skb->ip_summed == CHECKSUM_PARTIAL) {
+			desc[curfrag+1].fields.no_csum = 1;
+			desc[curfrag+1].fields.csum_good = 1;
+		}
 
 		if(dma_mapping_error(desc[curfrag+1].fields.address)) {
 			ibmveth_error_printk("tx: unable to map fragment %d\n", curfrag);
@@ -792,7 +821,11 @@ static int ibmveth_poll(struct net_devic
 			} else {
 				int length = ibmveth_rxq_frame_length(adapter);
 				int offset = ibmveth_rxq_frame_offset(adapter);
+				int csum_good = ibmveth_rxq_csum_good(adapter);
+
 				skb = ibmveth_rxq_get_buffer(adapter);
+				if (csum_good)
+					skb->ip_summed = CHECKSUM_UNNECESSARY;
 
 				ibmveth_rxq_harvest_buffer(adapter);
 
@@ -962,8 +995,10 @@ static void ibmveth_poll_controller(stru
 static int __devinit ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id)
 {
 	int rc, i;
+	long ret;
 	struct net_device *netdev;
 	struct ibmveth_adapter *adapter;
+	union ibmveth_illan_attributes set_attr, ret_attr;
 
 	unsigned char *mac_addr_p;
 	unsigned int *mcastFilterSize_p;
@@ -1057,6 +1092,24 @@ static int __devinit ibmveth_probe(struc
 
 	ibmveth_debug_printk("registering netdev...\n");
 
+	ret = h_illan_attributes(dev->unit_address, 0, 0, &ret_attr.desc);
+
+	if (ret == H_SUCCESS && !ret_attr.fields.active_trunk &&
+	    !ret_attr.fields.trunk_priority &&
+	    ret_attr.fields.csum_offload_padded_pkt_support) {
+		set_attr.desc = 0;
+		set_attr.fields.tcp_csum_offload_ipv4 = 1;
+
+		ret = h_illan_attributes(dev->unit_address, 0, set_attr.desc,
+					 &ret_attr.desc);
+
+		if (ret == H_SUCCESS)
+			netdev->features |= NETIF_F_IP_CSUM;
+		else
+			ret = h_illan_attributes(dev->unit_address, set_attr.desc,
+						 0, &ret_attr.desc);
+	}
+
 	rc = register_netdev(netdev);
 
 	if(rc) {
diff -puN drivers/net/ibmveth.h~ibmveth_csum_offload drivers/net/ibmveth.h
--- linux-2.6/drivers/net/ibmveth.h~ibmveth_csum_offload	2007-08-09 13:08:21.000000000 -0500
+++ linux-2.6-bjking1/drivers/net/ibmveth.h	2007-08-09 13:08:21.000000000 -0500
@@ -67,6 +67,21 @@ static inline long h_send_logical_lan(un
 	return rc;
 }
 
+static inline long h_illan_attributes(unsigned long unit_address,
+				      unsigned long reset_mask, unsigned long set_mask,
+				      unsigned long *ret_attributes)
+{
+	long rc;
+	unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+
+	rc = plpar_hcall(H_ILLAN_ATTRIBUTES, retbuf, unit_address,
+			 reset_mask, set_mask);
+
+	*ret_attributes = retbuf[0];
+
+	return rc;
+}
+
 #define h_multicast_ctrl(ua, cmd, mac) \
   plpar_hcall_norets(H_MULTICAST_CTRL, ua, cmd, mac)
 
@@ -141,7 +156,9 @@ struct ibmveth_adapter {
 struct ibmveth_buf_desc_fields {
     u32 valid : 1;
     u32 toggle : 1;
-    u32 reserved : 6;
+    u32 reserved : 4;
+    u32 no_csum : 1;
+    u32 csum_good : 1;
     u32 length : 24;
     u32 address;
 };
@@ -151,10 +168,30 @@ union ibmveth_buf_desc {
     struct ibmveth_buf_desc_fields fields;
 };
 
+struct ibmveth_illan_attributes_fields {
+	u32 reserved;
+	u32 reserved2 : 18;
+	u32 csum_offload_padded_pkt_support : 1;
+	u32 reserved3 : 1;
+	u32 trunk_priority : 4;
+	u32 reserved4 : 5;
+	u32 tcp_csum_offload_ipv6 : 1;
+	u32 tcp_csum_offload_ipv4 : 1;
+	u32 active_trunk : 1;
+};
+
+union ibmveth_illan_attributes {
+	u64 desc;
+	struct ibmveth_illan_attributes_fields fields;
+};
+
 struct ibmveth_rx_q_entry {
     u16 toggle : 1;
     u16 valid : 1;
-    u16 reserved : 14;
+    u16 reserved : 4;
+    u16 no_csum : 1;
+    u16 csum_good : 1;
+    u16 reserved2 : 8;
     u16 offset;
     u32 length;
     u64 correlator;
_

^ permalink raw reply

* [PATCH 2/6] ibmveth: Implement ethtool hooks to enable/disable checksum offload
From: Brian King @ 2007-08-17 14:16 UTC (permalink / raw)
  To: jeff; +Cc: santil, rcjenn, netdev, linuxppc-dev, brking
In-Reply-To: <11873601831813-patch-mail.ibm.com>


This patch adds the appropriate ethtool hooks to allow for enabling/disabling
of hypervisor assisted checksum offload for TCP.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
---

 linux-2.6-bjking1/drivers/net/ibmveth.c |  125 +++++++++++++++++++++++++++++++-
 linux-2.6-bjking1/drivers/net/ibmveth.h |    1 
 2 files changed, 124 insertions(+), 2 deletions(-)

diff -puN drivers/net/ibmveth.c~ibmveth_csum_offload_ethtool drivers/net/ibmveth.c
--- linux-2.6/drivers/net/ibmveth.c~ibmveth_csum_offload_ethtool	2007-08-09 15:15:07.000000000 -0500
+++ linux-2.6-bjking1/drivers/net/ibmveth.c	2007-08-09 15:15:07.000000000 -0500
@@ -641,12 +641,132 @@ static u32 netdev_get_link(struct net_de
 	return 1;
 }
 
+static void ibmveth_set_rx_csum_flags(struct net_device *dev, u32 data)
+{
+	struct ibmveth_adapter *adapter = dev->priv;
+
+	if (data)
+		adapter->rx_csum = 1;
+	else {
+		/*
+		 * Since the ibmveth firmware interface does not have the concept of
+		 * separate tx/rx checksum offload enable, if rx checksum is disabled
+		 * we also have to disable tx checksum offload. Once we disable rx
+		 * checksum offload, we are no longer allowed to send tx buffers that
+		 * are not properly checksummed.
+		 */
+		adapter->rx_csum = 0;
+		dev->features &= ~NETIF_F_IP_CSUM;
+	}
+}
+
+static void ibmveth_set_tx_csum_flags(struct net_device *dev, u32 data)
+{
+	struct ibmveth_adapter *adapter = dev->priv;
+
+	if (data) {
+		dev->features |= NETIF_F_IP_CSUM;
+		adapter->rx_csum = 1;
+	} else
+		dev->features &= ~NETIF_F_IP_CSUM;
+}
+
+static int ibmveth_set_csum_offload(struct net_device *dev, u32 data,
+				    void (*done) (struct net_device *, u32))
+{
+	struct ibmveth_adapter *adapter = dev->priv;
+	union ibmveth_illan_attributes set_attr, clr_attr, ret_attr;
+	long ret;
+	int rc1 = 0, rc2 = 0;
+	int restart = 0;
+
+	if (netif_running(dev)) {
+		restart = 1;
+		adapter->pool_config = 1;
+		ibmveth_close(dev);
+		adapter->pool_config = 0;
+	}
+
+	set_attr.desc = 0;
+	clr_attr.desc = 0;
+
+	if (data)
+		set_attr.fields.tcp_csum_offload_ipv4 = 1;
+	else
+		clr_attr.fields.tcp_csum_offload_ipv4 = 1;
+
+	ret = h_illan_attributes(adapter->vdev->unit_address, 0, 0, &ret_attr.desc);
+
+	if (ret == H_SUCCESS && !ret_attr.fields.active_trunk &&
+	    !ret_attr.fields.trunk_priority &&
+	    ret_attr.fields.csum_offload_padded_pkt_support) {
+		ret = h_illan_attributes(adapter->vdev->unit_address, clr_attr.desc,
+					 set_attr.desc, &ret_attr.desc);
+
+		if (ret != H_SUCCESS) {
+			rc1 = -EIO;
+			ibmveth_error_printk("unable to change checksum offload settings."
+					     " %d rc=%ld\n", data, ret);
+
+			ret = h_illan_attributes(adapter->vdev->unit_address,
+						 set_attr.desc, clr_attr.desc, &ret_attr.desc);
+		} else
+			done(dev, data);
+	} else {
+		rc1 = -EIO;
+		ibmveth_error_printk("unable to change checksum offload settings."
+				     " %d rc=%ld ret_attr=%lx\n", data, ret, ret_attr.desc);
+	}
+
+	if (restart)
+		rc2 = ibmveth_open(dev);
+
+	return rc1 ? rc1 : rc2;
+}
+
+static int ibmveth_set_rx_csum(struct net_device *dev, u32 data)
+{
+	struct ibmveth_adapter *adapter = dev->priv;
+
+	if ((data && adapter->rx_csum) || (!data && !adapter->rx_csum))
+		return 0;
+
+	return ibmveth_set_csum_offload(dev, data, ibmveth_set_rx_csum_flags);
+}
+
+static int ibmveth_set_tx_csum(struct net_device *dev, u32 data)
+{
+	struct ibmveth_adapter *adapter = dev->priv;
+	int rc = 0;
+
+	if (data && (dev->features & NETIF_F_IP_CSUM))
+		return 0;
+	if (!data && !(dev->features & NETIF_F_IP_CSUM))
+		return 0;
+
+	if (data && !adapter->rx_csum)
+		rc = ibmveth_set_csum_offload(dev, data, ibmveth_set_tx_csum_flags);
+	else
+		ibmveth_set_tx_csum_flags(dev, data);
+
+	return rc;
+}
+
+static u32 ibmveth_get_rx_csum(struct net_device *dev)
+{
+	struct ibmveth_adapter *adapter = dev->priv;
+	return adapter->rx_csum;
+}
+
 static const struct ethtool_ops netdev_ethtool_ops = {
 	.get_drvinfo		= netdev_get_drvinfo,
 	.get_settings		= netdev_get_settings,
 	.get_link		= netdev_get_link,
 	.get_sg			= ethtool_op_get_sg,
 	.get_tx_csum		= ethtool_op_get_tx_csum,
+	.set_tx_csum		= ibmveth_set_tx_csum,
+	.get_rx_csum		= ibmveth_get_rx_csum,
+	.set_rx_csum		= ibmveth_set_rx_csum,
 };
 
 static int ibmveth_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
@@ -1103,9 +1223,10 @@ static int __devinit ibmveth_probe(struc
 		ret = h_illan_attributes(dev->unit_address, 0, set_attr.desc,
 					 &ret_attr.desc);
 
-		if (ret == H_SUCCESS)
+		if (ret == H_SUCCESS) {
+			adapter->rx_csum = 1;
 			netdev->features |= NETIF_F_IP_CSUM;
-		else
+		} else
 			ret = h_illan_attributes(dev->unit_address, set_attr.desc,
 						 0, &ret_attr.desc);
 	}
diff -puN drivers/net/ibmveth.h~ibmveth_csum_offload_ethtool drivers/net/ibmveth.h
--- linux-2.6/drivers/net/ibmveth.h~ibmveth_csum_offload_ethtool	2007-08-09 15:15:07.000000000 -0500
+++ linux-2.6-bjking1/drivers/net/ibmveth.h	2007-08-09 15:15:07.000000000 -0500
@@ -137,6 +137,7 @@ struct ibmveth_adapter {
     struct ibmveth_buff_pool rx_buff_pool[IbmVethNumBufferPools];
     struct ibmveth_rx_q rx_queue;
     int pool_config;
+    int rx_csum;
 
     /* adapter specific stats */
     u64 replenish_task_cycles;
_

^ permalink raw reply

* [PATCH 3/6] ibmveth: Add ethtool TSO handlers
From: Brian King @ 2007-08-17 14:16 UTC (permalink / raw)
  To: jeff; +Cc: santil, rcjenn, netdev, linuxppc-dev, brking
In-Reply-To: <11873601831813-patch-mail.ibm.com>


Add handlers for get_tso and get_ufo to prevent errors being printed
by ethtool.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
---

 linux-2.6-bjking1/drivers/net/ibmveth.c |    2 ++
 1 file changed, 2 insertions(+)

diff -puN drivers/net/ibmveth.c~ibmveth_ethtool_get_tso drivers/net/ibmveth.c
--- linux-2.6/drivers/net/ibmveth.c~ibmveth_ethtool_get_tso	2007-08-08 10:46:28.000000000 -0500
+++ linux-2.6-bjking1/drivers/net/ibmveth.c	2007-08-08 10:46:28.000000000 -0500
@@ -767,6 +767,8 @@ static const struct ethtool_ops netdev_e
 	.set_tx_csum		= ibmveth_set_tx_csum,
 	.get_rx_csum		= ibmveth_get_rx_csum,
 	.set_rx_csum		= ibmveth_set_rx_csum,
+	.get_tso			= ethtool_op_get_tso,
+	.get_ufo			= ethtool_op_get_ufo,
 };
 
 static int ibmveth_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
_

^ permalink raw reply

* [PATCH 4/6] ibmveth: Add ethtool driver stats hooks
From: Brian King @ 2007-08-17 14:16 UTC (permalink / raw)
  To: jeff; +Cc: santil, rcjenn, netdev, linuxppc-dev, brking
In-Reply-To: <11873601831813-patch-mail.ibm.com>


Add ethtool hooks to ibmveth to retrieve driver statistics.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
---

 linux-2.6-bjking1/drivers/net/ibmveth.c |   51 ++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff -puN drivers/net/ibmveth.c~ibmveth_ethtool_driver_stats drivers/net/ibmveth.c
--- linux-2.6/drivers/net/ibmveth.c~ibmveth_ethtool_driver_stats	2007-08-08 10:46:30.000000000 -0500
+++ linux-2.6-bjking1/drivers/net/ibmveth.c	2007-08-08 10:48:08.000000000 -0500
@@ -112,6 +112,28 @@ MODULE_DESCRIPTION("IBM i/pSeries Virtua
 MODULE_LICENSE("GPL");
 MODULE_VERSION(ibmveth_driver_version);
 
+struct ibmveth_stat {
+	char name[ETH_GSTRING_LEN];
+	int offset;
+};
+
+#define IBMVETH_STAT_OFF(stat) offsetof(struct ibmveth_adapter, stat)
+#define IBMVETH_GET_STAT(a, off) *((u64 *)(((unsigned long)(a)) + off))
+
+struct ibmveth_stat ibmveth_stats[] = {
+	{ "replenish_task_cycles", IBMVETH_STAT_OFF(replenish_task_cycles) },
+	{ "replenish_no_mem", IBMVETH_STAT_OFF(replenish_no_mem) },
+	{ "replenish_add_buff_failure", IBMVETH_STAT_OFF(replenish_add_buff_failure) },
+	{ "replenish_add_buff_success", IBMVETH_STAT_OFF(replenish_add_buff_success) },
+	{ "rx_invalid_buffer", IBMVETH_STAT_OFF(rx_invalid_buffer) },
+	{ "rx_no_buffer", IBMVETH_STAT_OFF(rx_no_buffer) },
+	{ "tx_multidesc_send", IBMVETH_STAT_OFF(tx_multidesc_send) },
+	{ "tx_linearized", IBMVETH_STAT_OFF(tx_linearized) },
+	{ "tx_linearize_failed", IBMVETH_STAT_OFF(tx_linearize_failed) },
+	{ "tx_map_failed", IBMVETH_STAT_OFF(tx_map_failed) },
+	{ "tx_send_failed", IBMVETH_STAT_OFF(tx_send_failed) },
+};
+
 /* simple methods of getting data from the current rxq entry */
 static inline int ibmveth_rxq_pending_buffer(struct ibmveth_adapter *adapter)
 {
@@ -758,6 +780,32 @@ static u32 ibmveth_get_rx_csum(struct ne
 	return adapter->rx_csum;
 }
 
+static void ibmveth_get_strings(struct net_device *dev, u32 stringset, u8 *data)
+{
+	int i;
+
+	if (stringset != ETH_SS_STATS)
+		return;
+
+	for (i = 0; i < ARRAY_SIZE(ibmveth_stats); i++, data += ETH_GSTRING_LEN)
+		memcpy(data, ibmveth_stats[i].name, ETH_GSTRING_LEN);
+}
+
+static int ibmveth_get_stats_count(struct net_device *dev)
+{
+	return ARRAY_SIZE(ibmveth_stats);
+}
+
+static void ibmveth_get_ethtool_stats(struct net_device *dev,
+				      struct ethtool_stats *stats, u64 *data)
+{
+	int i;
+	struct ibmveth_adapter *adapter = dev->priv;
+
+	for (i = 0; i < ARRAY_SIZE(ibmveth_stats); i++)
+		data[i] = IBMVETH_GET_STAT(adapter, ibmveth_stats[i].offset);
+}
+
 static const struct ethtool_ops netdev_ethtool_ops = {
 	.get_drvinfo		= netdev_get_drvinfo,
 	.get_settings		= netdev_get_settings,
@@ -769,6 +817,9 @@ static const struct ethtool_ops netdev_e
 	.set_rx_csum		= ibmveth_set_rx_csum,
 	.get_tso			= ethtool_op_get_tso,
 	.get_ufo			= ethtool_op_get_ufo,
+	.get_strings		= ibmveth_get_strings,
+	.get_stats_count		= ibmveth_get_stats_count,
+	.get_ethtool_stats	= ibmveth_get_ethtool_stats,
 };
 
 static int ibmveth_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
diff -puN drivers/net/ibmveth.h~ibmveth_ethtool_driver_stats drivers/net/ibmveth.h
_

^ permalink raw reply

* [PATCH 5/6] ibmveth: Remove dead frag processing code
From: Brian King @ 2007-08-17 14:16 UTC (permalink / raw)
  To: jeff; +Cc: santil, rcjenn, netdev, linuxppc-dev, brking
In-Reply-To: <11873601831813-patch-mail.ibm.com>


Removes dead frag processing code from ibmveth. Since NETIF_F_SG was
not set, this code was never executed. Also, since the ibmveth
interface can only handle 6 fragments, core networking code would need
to be modified in order to efficiently enable this support.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
---

 linux-2.6-bjking1/drivers/net/ibmveth.c |  100 +++++---------------------------
 linux-2.6-bjking1/drivers/net/ibmveth.h |    5 -
 2 files changed, 17 insertions(+), 88 deletions(-)

diff -puN drivers/net/ibmveth.c~ibmveth_remove_frag drivers/net/ibmveth.c
--- linux-2.6/drivers/net/ibmveth.c~ibmveth_remove_frag	2007-08-09 15:15:18.000000000 -0500
+++ linux-2.6-bjking1/drivers/net/ibmveth.c	2007-08-09 15:15:18.000000000 -0500
@@ -28,7 +28,6 @@
 /**************************************************************************/
 /*
   TODO:
-  - remove frag processing code - no longer needed
   - add support for sysfs
   - possibly remove procfs support
 */
@@ -127,9 +126,6 @@ struct ibmveth_stat ibmveth_stats[] = {
 	{ "replenish_add_buff_success", IBMVETH_STAT_OFF(replenish_add_buff_success) },
 	{ "rx_invalid_buffer", IBMVETH_STAT_OFF(rx_invalid_buffer) },
 	{ "rx_no_buffer", IBMVETH_STAT_OFF(rx_no_buffer) },
-	{ "tx_multidesc_send", IBMVETH_STAT_OFF(tx_multidesc_send) },
-	{ "tx_linearized", IBMVETH_STAT_OFF(tx_linearized) },
-	{ "tx_linearize_failed", IBMVETH_STAT_OFF(tx_linearize_failed) },
 	{ "tx_map_failed", IBMVETH_STAT_OFF(tx_map_failed) },
 	{ "tx_send_failed", IBMVETH_STAT_OFF(tx_send_failed) },
 };
@@ -832,9 +828,8 @@ static int ibmveth_ioctl(struct net_devi
 static int ibmveth_start_xmit(struct sk_buff *skb, struct net_device *netdev)
 {
 	struct ibmveth_adapter *adapter = netdev->priv;
-	union ibmveth_buf_desc desc[IbmVethMaxSendFrags];
+	union ibmveth_buf_desc desc;
 	unsigned long lpar_rc;
-	int nfrags = 0, curfrag;
 	unsigned long correlator;
 	unsigned long flags;
 	unsigned int retry_count;
@@ -844,25 +839,11 @@ static int ibmveth_start_xmit(struct sk_
 	unsigned int tx_send_failed = 0;
 	unsigned int tx_map_failed = 0;
 
-
-	if ((skb_shinfo(skb)->nr_frags + 1) > IbmVethMaxSendFrags) {
-		tx_dropped++;
-		goto out;
-	}
-
-	memset(&desc, 0, sizeof(desc));
-
-	/* nfrags = number of frags after the initial fragment */
-	nfrags = skb_shinfo(skb)->nr_frags;
-
-	if(nfrags)
-		adapter->tx_multidesc_send++;
-
-	/* map the initial fragment */
-	desc[0].fields.length  = nfrags ? skb->len - skb->data_len : skb->len;
-	desc[0].fields.address = dma_map_single(&adapter->vdev->dev, skb->data,
-					desc[0].fields.length, DMA_TO_DEVICE);
-	desc[0].fields.valid   = 1;
+	desc.desc = 0;
+	desc.fields.length  = skb->len;
+	desc.fields.address = dma_map_single(&adapter->vdev->dev, skb->data,
+					     desc.fields.length, DMA_TO_DEVICE);
+	desc.fields.valid   = 1;
 
 	if (skb->ip_summed == CHECKSUM_PARTIAL &&
 	    ip_hdr(skb)->protocol != IPPROTO_TCP && skb_checksum_help(skb)) {
@@ -874,75 +855,34 @@ static int ibmveth_start_xmit(struct sk_
 	if (skb->ip_summed == CHECKSUM_PARTIAL) {
 		unsigned char *buf = skb_transport_header(skb) + skb->csum_offset;
 
-		desc[0].fields.no_csum = 1;
-		desc[0].fields.csum_good = 1;
+		desc.fields.no_csum = 1;
+		desc.fields.csum_good = 1;
 
 		/* Need to zero out the checksum */
 		buf[0] = 0;
 		buf[1] = 0;
 	}
 
-	if(dma_mapping_error(desc[0].fields.address)) {
-		ibmveth_error_printk("tx: unable to map initial fragment\n");
+	if (dma_mapping_error(desc.fields.address)) {
+		ibmveth_error_printk("tx: unable to map xmit buffer\n");
 		tx_map_failed++;
 		tx_dropped++;
 		goto out;
 	}
 
-	curfrag = nfrags;
-
-	/* map fragments past the initial portion if there are any */
-	while(curfrag--) {
-		skb_frag_t *frag = &skb_shinfo(skb)->frags[curfrag];
-		desc[curfrag+1].fields.address
-			= dma_map_single(&adapter->vdev->dev,
-				page_address(frag->page) + frag->page_offset,
-				frag->size, DMA_TO_DEVICE);
-		desc[curfrag+1].fields.length = frag->size;
-		desc[curfrag+1].fields.valid  = 1;
-		if (skb->ip_summed == CHECKSUM_PARTIAL) {
-			desc[curfrag+1].fields.no_csum = 1;
-			desc[curfrag+1].fields.csum_good = 1;
-		}
-
-		if(dma_mapping_error(desc[curfrag+1].fields.address)) {
-			ibmveth_error_printk("tx: unable to map fragment %d\n", curfrag);
-			tx_map_failed++;
-			tx_dropped++;
-			/* Free all the mappings we just created */
-			while(curfrag < nfrags) {
-				dma_unmap_single(&adapter->vdev->dev,
-						 desc[curfrag+1].fields.address,
-						 desc[curfrag+1].fields.length,
-						 DMA_TO_DEVICE);
-				curfrag++;
-			}
-			goto out;
-		}
-	}
-
 	/* send the frame. Arbitrarily set retrycount to 1024 */
 	correlator = 0;
 	retry_count = 1024;
 	do {
 		lpar_rc = h_send_logical_lan(adapter->vdev->unit_address,
-					     desc[0].desc,
-					     desc[1].desc,
-					     desc[2].desc,
-					     desc[3].desc,
-					     desc[4].desc,
-					     desc[5].desc,
-					     correlator,
-					     &correlator);
+					     desc.desc, 0, 0, 0, 0, 0,
+					     correlator, &correlator);
 	} while ((lpar_rc == H_BUSY) && (retry_count--));
 
 	if(lpar_rc != H_SUCCESS && lpar_rc != H_DROPPED) {
-		int i;
 		ibmveth_error_printk("tx: h_send_logical_lan failed with rc=%ld\n", lpar_rc);
-		for(i = 0; i < 6; i++) {
-			ibmveth_error_printk("tx: desc[%i] valid=%d, len=%d, address=0x%d\n", i,
-					     desc[i].fields.valid, desc[i].fields.length, desc[i].fields.address);
-		}
+		ibmveth_error_printk("tx: valid=%d, len=%d, address=0x%08x\n",
+				     desc.fields.valid, desc.fields.length, desc.fields.address);
 		tx_send_failed++;
 		tx_dropped++;
 	} else {
@@ -951,11 +891,8 @@ static int ibmveth_start_xmit(struct sk_
 		netdev->trans_start = jiffies;
 	}
 
-	do {
-		dma_unmap_single(&adapter->vdev->dev,
-				desc[nfrags].fields.address,
-				desc[nfrags].fields.length, DMA_TO_DEVICE);
-	} while(--nfrags >= 0);
+	dma_unmap_single(&adapter->vdev->dev, desc.fields.address,
+			 desc.fields.length, DMA_TO_DEVICE);
 
 out:	spin_lock_irqsave(&adapter->stats_lock, flags);
 	adapter->stats.tx_dropped += tx_dropped;
@@ -1366,10 +1303,7 @@ static int ibmveth_seq_show(struct seq_f
 		   firmware_mac[3], firmware_mac[4], firmware_mac[5]);
 
 	seq_printf(seq, "\nAdapter Statistics:\n");
-	seq_printf(seq, "  TX:  skbuffs linearized:          %ld\n", adapter->tx_linearized);
-	seq_printf(seq, "       multi-descriptor sends:      %ld\n", adapter->tx_multidesc_send);
-	seq_printf(seq, "       skb_linearize failures:      %ld\n", adapter->tx_linearize_failed);
-	seq_printf(seq, "       vio_map_single failres:      %ld\n", adapter->tx_map_failed);
+	seq_printf(seq, "  TX:  vio_map_single failres:      %ld\n", adapter->tx_map_failed);
 	seq_printf(seq, "       send failures:               %ld\n", adapter->tx_send_failed);
 	seq_printf(seq, "  RX:  replenish task cycles:       %ld\n", adapter->replenish_task_cycles);
 	seq_printf(seq, "       alloc_skb_failures:          %ld\n", adapter->replenish_no_mem);
diff -puN drivers/net/ibmveth.h~ibmveth_remove_frag drivers/net/ibmveth.h
--- linux-2.6/drivers/net/ibmveth.h~ibmveth_remove_frag	2007-08-09 15:15:18.000000000 -0500
+++ linux-2.6-bjking1/drivers/net/ibmveth.h	2007-08-09 15:15:18.000000000 -0500
@@ -25,8 +25,6 @@
 #ifndef _IBMVETH_H
 #define _IBMVETH_H
 
-#define IbmVethMaxSendFrags 6
-
 /* constants for H_MULTICAST_CTRL */
 #define IbmVethMcastReceptionModifyBit     0x80000UL
 #define IbmVethMcastReceptionEnableBit     0x20000UL
@@ -146,9 +144,6 @@ struct ibmveth_adapter {
     u64 replenish_add_buff_success;
     u64 rx_invalid_buffer;
     u64 rx_no_buffer;
-    u64 tx_multidesc_send;
-    u64 tx_linearized;
-    u64 tx_linearize_failed;
     u64 tx_map_failed;
     u64 tx_send_failed;
     spinlock_t stats_lock;
_

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox