LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
From: Benjamin Herrenschmidt @ 2011-10-10  9:01 UTC (permalink / raw)
  To: Eli Cohen
  Cc: netdev, Yevgeny Petrilin, Eli Cohen, David Laight,
	Thadeu Lima de Souza Cascardo, linuxppc-dev
In-Reply-To: <20111010084726.GM2681@mtldesk30>

On Mon, 2011-10-10 at 10:47 +0200, Eli Cohen wrote:
> On Mon, Oct 10, 2011 at 09:40:17AM +0100, David Laight wrote:
> > 
> > Actually memory barriers shouldn't really be added to
> > any of these 'accessor' functions.
> > (Or, at least, ones without barriers should be provided.)
> > 
> > The driver may want to to a series of writes, then a
> > single barrier, before a final write of a command (etc).
> > 
> > in_le32() from io.h is specially horrid!
> > 
> > 	David
> > 
> The driver would like to control if and when we want to put a memory
> barrier. We really don't want it to be done under the hood. In this
> respect we prefer raw functions which are still available to all
> platforms.

 ... but not necessarily the corresponding barriers.

That's why on powerpc we had to make all rmb,wmb and mb the same, aka a
full sync, because our weaker barriers don't order cachable vs.
non-cachable.

In any case, the raw functions are a bit nasty to use because they both
don't have barriers -and- don't handle endianness. So you have to be
extra careful.

In 90% of the cases, the barriers are what you want anyway. For example
in the else case of the driver, the doorbell MMIO typically wants it, so
using writel() is fine (or iowrite32be) and will have the necessary
barriers.

The case where things get a bit more nasty is when you try to use MMIO
for low latency small-data type transfers instead of DMA, in which case
you do want the ability for the chipset to write-combine and control the
barriers more precisely.

However, this is hard and Linux doesn't provide very good accessors to
do so, thus you need to be extra careful (see my example about wmb()

In the case of the iomap "copy" operations, my problem is that they
don't properly advertise their lack of ordering since normal iomap does
have full ordering.

I believe they should provide ordering with a barrier before & a barrier
after, eventually with _relaxed variants or _raw variants for those who
"know what they are doing".

Maybe it's time for us to revive those discussions about providing a
good set of relaxed MMIO accessors with explicit barriers :-)

Cheers,
Ben.
 

^ permalink raw reply

* RE: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
From: Benjamin Herrenschmidt @ 2011-10-10  8:53 UTC (permalink / raw)
  To: David Laight
  Cc: netdev, linuxppc-dev, Yevgeny Petrilin, Eli Cohen,
	Thadeu Lima de Souza Cascardo, Eli Cohen
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6D8AE78@saturn3.aculab.com>

On Mon, 2011-10-10 at 09:40 +0100, David Laight wrote:
> > What is this __iowrite64_copy... oh I see
> > 
> > Nice, somebody _AGAIN_ added a bunch of "generic" IO 
> > accessors that are utterly wrong on all archs except
> > x86 (ok, -almost-).
> > There isn't a single bloody memory barrier in there !
> 
> Actually memory barriers shouldn't really be added to
> any of these 'accessor' functions.
> (Or, at least, ones without barriers should be provided.)

As long as they are documented to provide no guarantee of ordering
between the stores... And x86 driver writers have any clue that they
will not be ordered vs. surrounding accesses.

> The driver may want to to a series of writes, then a
> single barrier, before a final write of a command (etc).
> 
> in_le32() from io.h is specially horrid!

The reason for that is that drivers expect fully ordered writel() vs
everything (including DMA).

Unfortunately, this is how Linux defines those semantics. I would much
prefer to require barriers explicitely but the decision was made back
then simply because the vast majority of driver writers do not
understand weakly ordered memory models and "everything should be made
to look like x86".

It would be great to come up with a set of more relaxed accessors along
with the appropriate barrier to use for drivers who "know better" but so
far all attempts at doing so have failed due to the inability to agree
on their precise semantics. Tho that was a while ago, we should probably
give it a new shot.

Cheers,
Ben. 

^ permalink raw reply

* Re: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
From: Eli Cohen @ 2011-10-10  8:47 UTC (permalink / raw)
  To: David Laight
  Cc: Yevgeny Petrilin, Eli Cohen, Thadeu Lima de Souza Cascardo,
	netdev, linuxppc-dev
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6D8AE78@saturn3.aculab.com>

On Mon, Oct 10, 2011 at 09:40:17AM +0100, David Laight wrote:
> 
> Actually memory barriers shouldn't really be added to
> any of these 'accessor' functions.
> (Or, at least, ones without barriers should be provided.)
> 
> The driver may want to to a series of writes, then a
> single barrier, before a final write of a command (etc).
> 
> in_le32() from io.h is specially horrid!
> 
> 	David
> 
The driver would like to control if and when we want to put a memory
barrier. We really don't want it to be done under the hood. In this
respect we prefer raw functions which are still available to all
platforms.

^ permalink raw reply

* RE: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
From: David Laight @ 2011-10-10  8:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Eli Cohen
  Cc: netdev, Yevgeny Petrilin, linuxppc-dev,
	Thadeu Lima de Souza Cascardo, Eli Cohen
In-Reply-To: <1318145118.29415.371.camel@pasglop>

=20
> What is this __iowrite64_copy... oh I see
>=20
> Nice, somebody _AGAIN_ added a bunch of "generic" IO=20
> accessors that are utterly wrong on all archs except
> x86 (ok, -almost-).
> There isn't a single bloody memory barrier in there !

Actually memory barriers shouldn't really be added to
any of these 'accessor' functions.
(Or, at least, ones without barriers should be provided.)

The driver may want to to a series of writes, then a
single barrier, before a final write of a command (etc).

in_le32() from io.h is specially horrid!

	David

^ permalink raw reply

* RE: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
From: Benjamin Herrenschmidt @ 2011-10-10  8:29 UTC (permalink / raw)
  To: David Laight
  Cc: netdev, linuxppc-dev, Yevgeny Petrilin, Eli Cohen,
	Thadeu Lima de Souza Cascardo, Eli Cohen
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6D8AE77@saturn3.aculab.com>

On Mon, 2011-10-10 at 09:20 +0100, David Laight wrote:
> 
> For the above I'd actually suggest making 'doorbell_qpn' have the
> correct endianness in order to avoid the (potential) swap every
> time it is set.

Well, the problem is that either you'll end up swapping on x86 or you'll
end up swapping on ppc, there is no "native" MMIO accessor that allow
you to do a no-swap access whatever the arch you are on. Or rather,
there is the __raw_ one but you shouldn't use it for most things :-)
(Because it also doesn't have the right memory barriers).

So I'd rather they do it right using the simpler method, the cost of
swap is going to be negligible, probably not even measurable, and if and
only if they think they can improve on that in a second step, then
consider doing otherwise with appropriate measurements showing a
significant difference.

> You also need to treble-check the required endianness for the
> 'vlan_tag' in the tx descriptor. What would be needed is the
> MAC PCI slave were on an x86 (LE) system.

Cheers,
Ben.

^ permalink raw reply

* RE: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
From: David Laight @ 2011-10-10  8:20 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Eli Cohen
  Cc: netdev, Yevgeny Petrilin, linuxppc-dev,
	Thadeu Lima de Souza Cascardo, Eli Cohen
In-Reply-To: <1318153939.29415.401.camel@pasglop>

=20
> Then, this statement:
>=20
> *(u32 *) (&tx_desc->ctrl.vlan_tag) |=3D ring->doorbell_qpn;

...
> instead do ... :

> 	*(u32 *) (&tx_desc->ctrl.vlan_tag) |=3D
cpu_to_be32(ring->doorbell_qpn);
>=20
> (Also get rid of that cast and define vlan_tag as a __be32 to start
> with).

Agreed, casts that change the type of memory - *(foo *)&xxx - are
generally bad news unless you are casting a generic 'buffer' to
a specific structure.
I've seen far to much code that ends up being depending on the
endianness and system word size.

For the above I'd actually suggest making 'doorbell_qpn' have the
correct endianness in order to avoid the (potential) swap every
time it is set.

You also need to treble-check the required endianness for the
'vlan_tag' in the tx descriptor. What would be needed is the
MAC PCI slave were on an x86 (LE) system.

	David

^ permalink raw reply

* Re: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
From: Benjamin Herrenschmidt @ 2011-10-10  7:32 UTC (permalink / raw)
  To: Eli Cohen
  Cc: netdev@vger.kernel.org, Eli Cohen, linuxppc-dev,
	Thadeu Lima de Souza Cascardo, Yevgeny Petrilin
In-Reply-To: <20111009103020.GL2681@mtldesk30>

On Sun, 2011-10-09 at 12:30 +0200, Eli Cohen wrote:

> > Ideally you want to avoid that swapping altogether and use the right
> > accessor that indicates that your register is BE to start with. IE.
> > remove the swab32 completely and then use something like 
> > iowrite32be() instead of writel().
> I agree, this looks better but does it work on memory mapped io or
> only on io pci space? All our registers are memory mapped...

The iomap functions work on both.

> > Basically, the problem you have is that writel() has an implicit "write
> > to LE register" semantic. Your register is BE. the "iomap" variants
> > provide you with more fine grained "be" variants to use in that case.
> > There's also writel_be() but that one doesn't exist on every
> > architecture afaik.
> So writel_be is the function I should use for memory mapped io? If it
> does not exist for all platforms it's a pitty :-(

Just use the iomap variant. Usually you also use pci_iomap() instead of
ioremap() but afaik, for straight MMIO, it works with normal ioremap as
well.

> > Now, once the mmio problem is out of the way, let's look back at how you
> > then use that qpn.
> > 
> > With the current code, you've generated something in memory which is
> > byte reversed, so essentially "LE" on ppc and "BE" on x86.
> > 
> > Then, this statement:
> > 
> > *(u32 *) (&tx_desc->ctrl.vlan_tag) |= ring->doorbell_qpn;
> > 
> > Will essentially write it out as-is in memory for use by the chip. The chip,
> > from what you say, expects BE, so this will be broken on PPC.
> I see. So this field is layed in le for ppc and the rest of the
> descriptor is be. so I assum that __iowrite64_copy() does not swap
> anything but we still have tx_desc->ctrl.vlan_tag in the wrong
> endianess.

Yes because you had swapped it initially. IE your original swab32 is
what busts it for you on ppc.

> > Here too, the right solution is to instead not do that swab32 to begin
> > with (ring->doorbell_qpn remains a native endian value) and instead do,
> > in addition to the above mentioned change to the writel:
> > 
> > 	*(u32 *) (&tx_desc->ctrl.vlan_tag) |= cpu_to_be32(ring->doorbell_qpn);
> > 
> > (Also get rid of that cast and define vlan_tag as a __be32 to start
> > with).
> > 
> > Cheers,
> > Ben.
> 
> Thanks for your review. I will send another patch which should fix the
> deficiencies.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
From: Eli Cohen @ 2011-10-09 10:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: netdev@vger.kernel.org, Eli Cohen, linuxppc-dev,
	Thadeu Lima de Souza Cascardo, Yevgeny Petrilin
In-Reply-To: <1318153939.29415.401.camel@pasglop>

On Sun, Oct 09, 2011 at 11:52:19AM +0200, Benjamin Herrenschmidt wrote:
> 
> > > To go back to the driver code, the statements that ring a "bell" are:
> > > 
> > > 	*(u32 *) (&tx_desc->ctrl.vlan_tag) |= ring->doorbell_qpn;
> > > 
> > > This doesn't look right unless "doorbell_qpn" itself is already somewhat
> > > in the appropriate byte order.
> 
> > This is something that supports my claim that the chipset swaps
> > endianess in ppc.
> 
> No the chipset doesn't swap
> 
> > Look at mlx4_en_activate_tx_ring():
> > ring->doorbell_qpn = swab32(ring->qp.qpn << 8);
> 
> That looks gross, I think somebody writing this driver doesn't
> understand endianness.
> 
> > so in LE machines it layed as big endian in memory while in BE machines
> > it is layed as little endian in memory.
> 
> Yes which is very odd, it should be layed out the same in memory
> regardless of the machine. However in this case, this isn't accessed
> directly via DMA (this field at least), so what you appear to be doing
> here is to artificially "undo" what writel does later on (see below).
> 
> > Then we write this value to the device registers which must get it in
> > big endian otherwise it won't work - and we know this works in both
> > ppc and x86. You can ignore the case of blue flame:
> 
> Well it works because your device is odd and has BE registers :-) It's
> however not the right way to do and it's broken in your blue flame case
> (for what is now obvious reasons, see below).
> 
> What's happening basically here is that you are swapping once in
> swab32 , store that swapped value, then writel will re-swap on power and
> not swap on x86 (because the writel accessor performs swapping). So on
> x86 you basically do LE -> BE and on ppc you do BE -> LE -> BE :-)
> Pretty inefficient.
> 
> None of this has anything to do with the chipset which doesn't swap
> anything behind your back.
> 
> Ideally you want to avoid that swapping altogether and use the right
> accessor that indicates that your register is BE to start with. IE.
> remove the swab32 completely and then use something like 
> iowrite32be() instead of writel().
I agree, this looks better but does it work on memory mapped io or
only on io pci space? All our registers are memory mapped...

> 
> Basically, the problem you have is that writel() has an implicit "write
> to LE register" semantic. Your register is BE. the "iomap" variants
> provide you with more fine grained "be" variants to use in that case.
> There's also writel_be() but that one doesn't exist on every
> architecture afaik.
So writel_be is the function I should use for memory mapped io? If it
does not exist for all platforms it's a pitty :-(
> 
> Now, once the mmio problem is out of the way, let's look back at how you
> then use that qpn.
> 
> With the current code, you've generated something in memory which is
> byte reversed, so essentially "LE" on ppc and "BE" on x86.
> 
> Then, this statement:
> 
> *(u32 *) (&tx_desc->ctrl.vlan_tag) |= ring->doorbell_qpn;
> 
> Will essentially write it out as-is in memory for use by the chip. The chip,
> from what you say, expects BE, so this will be broken on PPC.
I see. So this field is layed in le for ppc and the rest of the
descriptor is be. so I assum that __iowrite64_copy() does not swap
anything but we still have tx_desc->ctrl.vlan_tag in the wrong
endianess.

> 
> Here too, the right solution is to instead not do that swab32 to begin
> with (ring->doorbell_qpn remains a native endian value) and instead do,
> in addition to the above mentioned change to the writel:
> 
> 	*(u32 *) (&tx_desc->ctrl.vlan_tag) |= cpu_to_be32(ring->doorbell_qpn);
> 
> (Also get rid of that cast and define vlan_tag as a __be32 to start
> with).
> 
> Cheers,
> Ben.

Thanks for your review. I will send another patch which should fix the
deficiencies.

^ permalink raw reply

* Re: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
From: Benjamin Herrenschmidt @ 2011-10-09  9:52 UTC (permalink / raw)
  To: Eli Cohen
  Cc: netdev@vger.kernel.org, Eli Cohen, linuxppc-dev,
	Thadeu Lima de Souza Cascardo, Yevgeny Petrilin
In-Reply-To: <20111009092102.GK2681@mtldesk30>


> > To go back to the driver code, the statements that ring a "bell" are:
> > 
> > 	*(u32 *) (&tx_desc->ctrl.vlan_tag) |= ring->doorbell_qpn;
> > 
> > This doesn't look right unless "doorbell_qpn" itself is already somewhat
> > in the appropriate byte order.

> This is something that supports my claim that the chipset swaps
> endianess in ppc.

No the chipset doesn't swap

> Look at mlx4_en_activate_tx_ring():
> ring->doorbell_qpn = swab32(ring->qp.qpn << 8);

That looks gross, I think somebody writing this driver doesn't
understand endianness.

> so in LE machines it layed as big endian in memory while in BE machines
> it is layed as little endian in memory.

Yes which is very odd, it should be layed out the same in memory
regardless of the machine. However in this case, this isn't accessed
directly via DMA (this field at least), so what you appear to be doing
here is to artificially "undo" what writel does later on (see below).

> Then we write this value to the device registers which must get it in
> big endian otherwise it won't work - and we know this works in both
> ppc and x86. You can ignore the case of blue flame:

Well it works because your device is odd and has BE registers :-) It's
however not the right way to do and it's broken in your blue flame case
(for what is now obvious reasons, see below).

What's happening basically here is that you are swapping once in
swab32 , store that swapped value, then writel will re-swap on power and
not swap on x86 (because the writel accessor performs swapping). So on
x86 you basically do LE -> BE and on ppc you do BE -> LE -> BE :-)
Pretty inefficient.

None of this has anything to do with the chipset which doesn't swap
anything behind your back.

Ideally you want to avoid that swapping altogether and use the right
accessor that indicates that your register is BE to start with. IE.
remove the swab32 completely and then use something like 
iowrite32be() instead of writel().

Basically, the problem you have is that writel() has an implicit "write
to LE register" semantic. Your register is BE. the "iomap" variants
provide you with more fine grained "be" variants to use in that case.
There's also writel_be() but that one doesn't exist on every
architecture afaik.

Now, once the mmio problem is out of the way, let's look back at how you
then use that qpn.

With the current code, you've generated something in memory which is
byte reversed, so essentially "LE" on ppc and "BE" on x86.

Then, this statement:

*(u32 *) (&tx_desc->ctrl.vlan_tag) |= ring->doorbell_qpn;

Will essentially write it out as-is in memory for use by the chip. The chip,
from what you say, expects BE, so this will be broken on PPC.

Here too, the right solution is to instead not do that swab32 to begin
with (ring->doorbell_qpn remains a native endian value) and instead do,
in addition to the above mentioned change to the writel:

	*(u32 *) (&tx_desc->ctrl.vlan_tag) |= cpu_to_be32(ring->doorbell_qpn);

(Also get rid of that cast and define vlan_tag as a __be32 to start
with).

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
From: Eli Cohen @ 2011-10-09  9:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: netdev@vger.kernel.org, Eli Cohen, linuxppc-dev,
	Thadeu Lima de Souza Cascardo, Yevgeny Petrilin
In-Reply-To: <1318149536.29415.384.camel@pasglop>

On Sun, Oct 09, 2011 at 10:38:56AM +0200, Benjamin Herrenschmidt wrote:
> On Sun, 2011-10-09 at 10:07 +0200, Eli Cohen wrote:
> 
> > > Well, first, what do you mean by "swapped" ? :-) But no, it won't for
> > > all intend and purpose, this is a copy routine, copy routines never
> > > swap, neither do fifo accesses for example.
> > When I say swapped, I mean not necessairliy by software. I think that
> > the chipset will swap the the data. The reason I think so is that the
> > CPU arch is big endian, while PCI bus is defined as little endian.
> > That's why I think a swap will occur in ppc and not in x86.
> 
> No it won't "swap the data". The wiring between PCI and the CPU bus is
> done in a way called "byte address invariant", and there is some kind of
> flip of byte lanes related essentially to ensure that, but for all
> intend and purpose it's transparent.
> 
> > It's a special descriptor that resides both in memory and also written
> > to the device's register. An it contains both data and control
> > informartion.
> 
> Data should not be swapped then. Only individual bits of control
> information. In any case, the buffer should be generated in the right
> format in memory to start with. The copy routine doesn't need to swap.
> 
> To go back to the driver code, the statements that ring a "bell" are:
> 
> 	*(u32 *) (&tx_desc->ctrl.vlan_tag) |= ring->doorbell_qpn;
> 
> This doesn't look right unless "doorbell_qpn" itself is already somewhat
> in the appropriate byte order.
This is something that supports my claim that the chipset swaps
endianess in ppc.
Look at mlx4_en_activate_tx_ring():
ring->doorbell_qpn = swab32(ring->qp.qpn << 8);
so in LE machines it layed as big endian in memory while in BE machines
it is layed as little endian in memory.
Then we write this value to the device registers which must get it in
big endian otherwise it won't work - and we know this works in both
ppc and x86. You can ignore the case of blue flame:

        } else if (nreq) {
                qp->sq.head += nreq;

                /*
                 * Make sure that descriptors are written before
                 * doorbell record.
                 */
                wmb();

                writel(qp->doorbell_qpn, qp->bf.uar->map +
MLX4_SEND_DOORBELL); <==  remember that it is layed in little endian
but the device must get it in big endian.

                /*
                 * Make sure doorbells don't leak out of SQ spinlock
                 * and reach the HCA out of order.
                 */
                mmiowb();

        }




> 
> Is that vlan_tag a big or little endian quantity ? Either way, this
> looks broken in either x86 or ppc unless doorbell_qpn itself is already
> in the right endian.
> 
> But since later I see
> 
> 	writel(ring->doorbell_qpn, ring->bf.uar->map + MLX4_SEND_DOORBELL);
> 
> That rings a nasty bell, it looks like doorbell_pqn is in CPU (native)
> endian, so it should have to be swapped before being OR'ed into the
> descriptor, either that or the HW does some black magic I don't
> understand.
> 
> Cheers,
> Ben.
> 

^ permalink raw reply

* Re: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
From: Benjamin Herrenschmidt @ 2011-10-09  8:38 UTC (permalink / raw)
  To: Eli Cohen
  Cc: netdev@vger.kernel.org, Eli Cohen, linuxppc-dev,
	Thadeu Lima de Souza Cascardo, Yevgeny Petrilin
In-Reply-To: <20111009080747.GJ2681@mtldesk30>

On Sun, 2011-10-09 at 10:07 +0200, Eli Cohen wrote:

> > Well, first, what do you mean by "swapped" ? :-) But no, it won't for
> > all intend and purpose, this is a copy routine, copy routines never
> > swap, neither do fifo accesses for example.
> When I say swapped, I mean not necessairliy by software. I think that
> the chipset will swap the the data. The reason I think so is that the
> CPU arch is big endian, while PCI bus is defined as little endian.
> That's why I think a swap will occur in ppc and not in x86.

No it won't "swap the data". The wiring between PCI and the CPU bus is
done in a way called "byte address invariant", and there is some kind of
flip of byte lanes related essentially to ensure that, but for all
intend and purpose it's transparent.

> It's a special descriptor that resides both in memory and also written
> to the device's register. An it contains both data and control
> informartion.

Data should not be swapped then. Only individual bits of control
information. In any case, the buffer should be generated in the right
format in memory to start with. The copy routine doesn't need to swap.

To go back to the driver code, the statements that ring a "bell" are:

	*(u32 *) (&tx_desc->ctrl.vlan_tag) |= ring->doorbell_qpn;

This doesn't look right unless "doorbell_qpn" itself is already somewhat
in the appropriate byte order.

Is that vlan_tag a big or little endian quantity ? Either way, this
looks broken in either x86 or ppc unless doorbell_qpn itself is already
in the right endian.

But since later I see

	writel(ring->doorbell_qpn, ring->bf.uar->map + MLX4_SEND_DOORBELL);

That rings a nasty bell, it looks like doorbell_pqn is in CPU (native)
endian, so it should have to be swapped before being OR'ed into the
descriptor, either that or the HW does some black magic I don't
understand.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
From: Eli Cohen @ 2011-10-09  8:07 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: netdev@vger.kernel.org, Eli Cohen, linuxppc-dev,
	Thadeu Lima de Souza Cascardo, Yevgeny Petrilin
In-Reply-To: <1318147254.29415.377.camel@pasglop>

On Sun, Oct 09, 2011 at 10:00:54AM +0200, Benjamin Herrenschmidt wrote:
> On Sun, 2011-10-09 at 09:35 +0200, Eli Cohen wrote:
> > On Sun, Oct 09, 2011 at 09:25:18AM +0200, Benjamin Herrenschmidt wrote:
> > > On Thu, 2011-10-06 at 15:57 +0200, Eli Cohen wrote:
> > > > On Wed, Oct 05, 2011 at 10:15:02AM +0200, Eli Cohen wrote:
> > > > 
> > > > How about this patch - can you give it a try?
> > > > 
> > > > 
> > > > >From dee60547aa9e35a02835451d9e694cd80dd3072f Mon Sep 17 00:00:00 2001
> > > > From: Eli Cohen <eli@mellanox.co.il>
> > > > Date: Thu, 6 Oct 2011 15:50:02 +0200
> > > > Subject: [PATCH] mlx4_en: Fix blue flame on powerpc
> > > > 
> > > > The source buffer used for copying into the blue flame register is already in
> > > > big endian. However, when copying to device on powerpc, the endianess is
> > > > swapped so the data reaches th device in little endian which is wrong. On x86
> > > > based platform no swapping occurs so it reaches the device with the correct
> > > > endianess. Fix this by calling le32_to_cpu() on the buffer. On LE systems there
> > > > is no change; on BE there will be a swap.
> > > 
> > > That looks wrong.
> > Not sure I understand: are you saying that on ppc, when you call
> > __iowrite64_copy, it will not reach the device swapped?
> 
> Well, first, what do you mean by "swapped" ? :-) But no, it won't for
> all intend and purpose, this is a copy routine, copy routines never
> swap, neither do fifo accesses for example.
When I say swapped, I mean not necessairliy by software. I think that
the chipset will swap the the data. The reason I think so is that the
CPU arch is big endian, while PCI bus is defined as little endian.
That's why I think a swap will occur in ppc and not in x86.
> 
> > The point is that we must always have the buffer ready in big endian
> > in memory. In the case of blue flame, we must also copy it to the
> > device registers in pci memory space. So if we use the buffer we
> > already prepared, we must have another swap. I can think of a nicer
> > way to implement this functionality but the question is do you think
> > my observation above is wrong and why.
> 
> No. If it's in memory BE then the copy routine will keep it BE. A copy
> routine doesn't swap and doesn't affect endianness.
> 
> Additionally, a swapping phase like you proposed doing 32-bit swaps
> means that you know for sure that the buffer is made of 32-bit
> quantities, is that the case ?
Yes

> Even if you had needed that swap, if your
> buffer had contained 16-bit or 64-bit quantities, you're toast.
> 
> What is this buffer anyway ? A descriptor or a network packet ?
It's a special descriptor that resides both in memory and also written
to the device's register. An it contains both data and control
informartion.
> 
> If it's a packet, then it's data, endianness has no meaning (or rather
> it has for individual fields of the packets but they are already in the
> right format and a 32-bit swap will never be right).
>  
> It's almost never right to perform swapping when copying data (or
> reading/writing a FIFO).
> 
> Cheers,
> Ben.

^ permalink raw reply

* Re: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
From: Benjamin Herrenschmidt @ 2011-10-09  8:00 UTC (permalink / raw)
  To: Eli Cohen
  Cc: netdev@vger.kernel.org, Eli Cohen, linuxppc-dev,
	Thadeu Lima de Souza Cascardo, Yevgeny Petrilin
In-Reply-To: <20111009073546.GI2681@mtldesk30>

On Sun, 2011-10-09 at 09:35 +0200, Eli Cohen wrote:
> On Sun, Oct 09, 2011 at 09:25:18AM +0200, Benjamin Herrenschmidt wrote:
> > On Thu, 2011-10-06 at 15:57 +0200, Eli Cohen wrote:
> > > On Wed, Oct 05, 2011 at 10:15:02AM +0200, Eli Cohen wrote:
> > > 
> > > How about this patch - can you give it a try?
> > > 
> > > 
> > > >From dee60547aa9e35a02835451d9e694cd80dd3072f Mon Sep 17 00:00:00 2001
> > > From: Eli Cohen <eli@mellanox.co.il>
> > > Date: Thu, 6 Oct 2011 15:50:02 +0200
> > > Subject: [PATCH] mlx4_en: Fix blue flame on powerpc
> > > 
> > > The source buffer used for copying into the blue flame register is already in
> > > big endian. However, when copying to device on powerpc, the endianess is
> > > swapped so the data reaches th device in little endian which is wrong. On x86
> > > based platform no swapping occurs so it reaches the device with the correct
> > > endianess. Fix this by calling le32_to_cpu() on the buffer. On LE systems there
> > > is no change; on BE there will be a swap.
> > 
> > That looks wrong.
> Not sure I understand: are you saying that on ppc, when you call
> __iowrite64_copy, it will not reach the device swapped?

Well, first, what do you mean by "swapped" ? :-) But no, it won't for
all intend and purpose, this is a copy routine, copy routines never
swap, neither do fifo accesses for example.

> The point is that we must always have the buffer ready in big endian
> in memory. In the case of blue flame, we must also copy it to the
> device registers in pci memory space. So if we use the buffer we
> already prepared, we must have another swap. I can think of a nicer
> way to implement this functionality but the question is do you think
> my observation above is wrong and why.

No. If it's in memory BE then the copy routine will keep it BE. A copy
routine doesn't swap and doesn't affect endianness.

Additionally, a swapping phase like you proposed doing 32-bit swaps
means that you know for sure that the buffer is made of 32-bit
quantities, is that the case ? Even if you had needed that swap, if your
buffer had contained 16-bit or 64-bit quantities, you're toast.

What is this buffer anyway ? A descriptor or a network packet ?

If it's a packet, then it's data, endianness has no meaning (or rather
it has for individual fields of the packets but they are already in the
right format and a 32-bit swap will never be right).
 
It's almost never right to perform swapping when copying data (or
reading/writing a FIFO).

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
From: Eli Cohen @ 2011-10-09  7:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: netdev@vger.kernel.org, Eli Cohen, linuxppc-dev,
	Thadeu Lima de Souza Cascardo, Yevgeny Petrilin
In-Reply-To: <1318145118.29415.371.camel@pasglop>

On Sun, Oct 09, 2011 at 09:25:18AM +0200, Benjamin Herrenschmidt wrote:
> On Thu, 2011-10-06 at 15:57 +0200, Eli Cohen wrote:
> > On Wed, Oct 05, 2011 at 10:15:02AM +0200, Eli Cohen wrote:
> > 
> > How about this patch - can you give it a try?
> > 
> > 
> > >From dee60547aa9e35a02835451d9e694cd80dd3072f Mon Sep 17 00:00:00 2001
> > From: Eli Cohen <eli@mellanox.co.il>
> > Date: Thu, 6 Oct 2011 15:50:02 +0200
> > Subject: [PATCH] mlx4_en: Fix blue flame on powerpc
> > 
> > The source buffer used for copying into the blue flame register is already in
> > big endian. However, when copying to device on powerpc, the endianess is
> > swapped so the data reaches th device in little endian which is wrong. On x86
> > based platform no swapping occurs so it reaches the device with the correct
> > endianess. Fix this by calling le32_to_cpu() on the buffer. On LE systems there
> > is no change; on BE there will be a swap.
> 
> That looks wrong.
Not sure I understand: are you saying that on ppc, when you call
__iowrite64_copy, it will not reach the device swapped?

The point is that we must always have the buffer ready in big endian
in memory. In the case of blue flame, we must also copy it to the
device registers in pci memory space. So if we use the buffer we
already prepared, we must have another swap. I can think of a nicer
way to implement this functionality but the question is do you think
my observation above is wrong and why.

> 
> What is this __iowrite64_copy... oh I see
> 
> Nice, somebody _AGAIN_ added a bunch of "generic" IO accessors that are
> utterly wrong on all archs except x86 (ok, -almost-). There isn't a
> single bloody memory barrier in there !
> 
> So, __iowrite64_copy is doing raw_writel which will -not- swap, so your
> buffer is going to have the same endianness in the destination than it
> has in the source. This is _NOT_ the right place to do a swap.
> 
> It's the original construction of the descriptor that needs change. The
> data itself should never need to be affected accross a copy operation
> (unless your HW is terminally busted).
> 
> Cheers,
> Ben.
> 
> > Signed-off-by: Eli Cohen <eli@mellanox.co.il>
> > ---
> >  drivers/net/mlx4/en_tx.c |   10 ++++++++++
> >  1 files changed, 10 insertions(+), 0 deletions(-)
> > 
> > diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c
> > index 16337fb..3743acc 100644
> > --- a/drivers/net/mlx4/en_tx.c
> > +++ b/drivers/net/mlx4/en_tx.c
> > @@ -601,6 +601,16 @@ u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
> >  
> >  static void mlx4_bf_copy(unsigned long *dst, unsigned long *src, unsigned bytecnt)
> >  {
> > +	int i;
> > +	__le32 *psrc = (__le32 *)src;
> > +
> > +	/*
> > +	 * the buffer is already in big endian. For little endian machines that's
> > +	 * fine. For big endain machines we must swap since the chipset swaps again
> > +	 */
> > +	for (i = 0; i < bytecnt / 4; ++i)
> > +		psrc[i] = le32_to_cpu(psrc[i]);
> > +
> >  	__iowrite64_copy(dst, src, bytecnt / 8);
> >  }
> >  
> > -- 
> > 1.7.7.rc0.70.g82660
> > 
> > 
> > 
> > > On Tue, Oct 04, 2011 at 05:26:20PM -0300, Thadeu Lima de Souza Cascardo wrote:
> > > 
> > > I believe we have an endianess problem here. The source buffer is in
> > > big endian - in x86 archs, it will rich the pci device unswapped since
> > > both x86 and pci are little endian. In ppc, it wil be swapped by the
> > > chipset so it will reach the device in little endian which is wrong.
> > > So, in mlx4_bf_copy, you could loop over the buffer and swap32 the all
> > > the dwords before the call to __iowrite64_copy. Of course which should
> > > fix this in an arch independent manner. Let me know this works for
> > > you.
> > > 
> > > > On Tue, Oct 04, 2011 at 08:02:12AM +0200, Benjamin Herrenschmidt wrote:
> > > > > On Mon, 2011-10-03 at 17:53 -0300, Thadeu Lima de Souza Cascardo wrote:
> > > > > 
> > > > >  .../...
> > > > > 
> > > > > > > Can you also send me the output of ethtool -i?
> > > > > > > It seems that there is a problem with write combining on Power processors, we will check this issue.
> > > > > > > 
> > > > > > > Yevgeny
> > > > > > 
> > > > > > Hello, Yevgeny.
> > > > > > 
> > > > > > You will find the output of ethtool -i below.
> > > > > > 
> > > > > > I am copying Ben and powerpc list, in case this is an issue with Power
> > > > > > processors. They can provide us some more insight into this.
> > > > > 
> > > > > May I get some background please ? :-)
> > > > > 
> > > > > I'm not aware of a specific issue with write combining but I'd need to
> > > > > know more about what you are doing and the code to do it to comment on
> > > > > whether it should work or not.
> > > > > 
> > > > > Cheers,
> > > > > Ben.
> > > > > 
> > > > > 
> > > > 
> > > > Hello, Ben.
> > > > 
> > > > Sorry for that. I am testing mlx4_en driver on a POWER. Yevgeny has
> > > > added blue frame support, that does not require writing to the device
> > > > memory to indicate a new packet (the doorbell register as it is called).
> > > > 
> > > > Well, the ring is getting full with no interrupt or packets transmitted.
> > > > I simply added a write to the doorbell register and it works for me.
> > > > Yevgeny says this is not the right fix, claiming there is a problem with
> > > > write combining on POWER. The code uses memory barriers, so I don't know
> > > > why there is any problem.
> > > > 
> > > > I am posting the code here to show better what the situation is.
> > > > Yevgeny can tell more about the device and the driver.
> > > > 
> > > > The code below is the driver as of now, including a diff with what I
> > > > changed and had resulted OK for me. Before the blue frame support, the
> > > > only code executed was the else part. I can't tell much what the device
> > > > should be seeing and doing after the blue frame part of the code is
> > > > executed. But it does send the packet if I write to the doorbell
> > > > register.
> > > > 
> > > > Yevgeny, can you tell us what the device should be doing and why you
> > > > think this is a problem on POWER? Is it possible that this is simply a
> > > > problem with the firmware version?
> > > > 
> > > > Thanks,
> > > > Cascardo.
> > > > 
> > > > ---
> > > >         if (ring->bf_enabled && desc_size <= MAX_BF && !bounce &&
> > > > !vlan_tag) {
> > > >                 *(u32 *) (&tx_desc->ctrl.vlan_tag) |=
> > > > ring->doorbell_qpn;
> > > >                 op_own |= htonl((bf_index & 0xffff) << 8);
> > > >                 /* Ensure new descirptor hits memory
> > > >                 * before setting ownership of this descriptor to HW */
> > > >                 wmb();
> > > >                 tx_desc->ctrl.owner_opcode = op_own;
> > > > 
> > > >                 wmb();
> > > > 
> > > >                 mlx4_bf_copy(ring->bf.reg + ring->bf.offset, (unsigned
> > > > long *) &tx_desc->ctrl,
> > > >                      desc_size);
> > > > 
> > > >                 wmb();
> > > > 
> > > >                 ring->bf.offset ^= ring->bf.buf_size;
> > > >         } else {
> > > >                 /* Ensure new descirptor hits memory
> > > >                 * before setting ownership of this descriptor to HW */
> > > >                 wmb();
> > > >                 tx_desc->ctrl.owner_opcode = op_own;
> > > > -               wmb();
> > > > -               writel(ring->doorbell_qpn, ring->bf.uar->map +
> > > > MLX4_SEND_DOORBELL);
> > > >         }
> > > > 
> > > > +       wmb();
> > > > +       writel(ring->doorbell_qpn, ring->bf.uar->map +
> > > > MLX4_SEND_DOORBELL);
> > > > +
> > > > ---
> 

^ permalink raw reply

* RE: [PATCH] mlx4_en: fix transmit of packages when blue frame isenabled
From: Benjamin Herrenschmidt @ 2011-10-09  7:28 UTC (permalink / raw)
  To: David Laight
  Cc: netdev, Yevgeny Petrilin, Eli Cohen, Eli Cohen, linuxppc-dev,
	Thadeu Lima de Souza Cascardo
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6D8AE76@saturn3.aculab.com>

On Thu, 2011-10-06 at 15:10 +0100, David Laight wrote:
> horrid...
> 1) I'm not sure the caller expects the buffer to be corrupted.
> 2) It contains a lot of memory cycles.
> 3) It looked from the calls that this code is copying descriptors,
>    so the transfer length is probably 1 or 2 words - so the loop
>    is inefficient.
> 4) ppc doesn't have a fast byteswap instruction (very new gcc might
>    use the byteswapping memery access for the le32_to_cpu() though),
>    so it would be better getting the byteswap done inside
>    __iowrite64_copy() - since that is probably requesting a byteswap
>    anyway.
> OTOH I'm not at all clear about the 64bit xfers....

And it's just plain wrong anyway. You should never have to byteswap a
copy.

Ben.

^ permalink raw reply

* Re: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
From: Benjamin Herrenschmidt @ 2011-10-09  7:25 UTC (permalink / raw)
  To: Eli Cohen
  Cc: netdev@vger.kernel.org, Eli Cohen, linuxppc-dev,
	Thadeu Lima de Souza Cascardo, Yevgeny Petrilin
In-Reply-To: <20111006135759.GH2681@mtldesk30>

On Thu, 2011-10-06 at 15:57 +0200, Eli Cohen wrote:
> On Wed, Oct 05, 2011 at 10:15:02AM +0200, Eli Cohen wrote:
> 
> How about this patch - can you give it a try?
> 
> 
> >From dee60547aa9e35a02835451d9e694cd80dd3072f Mon Sep 17 00:00:00 2001
> From: Eli Cohen <eli@mellanox.co.il>
> Date: Thu, 6 Oct 2011 15:50:02 +0200
> Subject: [PATCH] mlx4_en: Fix blue flame on powerpc
> 
> The source buffer used for copying into the blue flame register is already in
> big endian. However, when copying to device on powerpc, the endianess is
> swapped so the data reaches th device in little endian which is wrong. On x86
> based platform no swapping occurs so it reaches the device with the correct
> endianess. Fix this by calling le32_to_cpu() on the buffer. On LE systems there
> is no change; on BE there will be a swap.

That looks wrong.

What is this __iowrite64_copy... oh I see

Nice, somebody _AGAIN_ added a bunch of "generic" IO accessors that are
utterly wrong on all archs except x86 (ok, -almost-). There isn't a
single bloody memory barrier in there !

So, __iowrite64_copy is doing raw_writel which will -not- swap, so your
buffer is going to have the same endianness in the destination than it
has in the source. This is _NOT_ the right place to do a swap.

It's the original construction of the descriptor that needs change. The
data itself should never need to be affected accross a copy operation
(unless your HW is terminally busted).

Cheers,
Ben.

> Signed-off-by: Eli Cohen <eli@mellanox.co.il>
> ---
>  drivers/net/mlx4/en_tx.c |   10 ++++++++++
>  1 files changed, 10 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c
> index 16337fb..3743acc 100644
> --- a/drivers/net/mlx4/en_tx.c
> +++ b/drivers/net/mlx4/en_tx.c
> @@ -601,6 +601,16 @@ u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
>  
>  static void mlx4_bf_copy(unsigned long *dst, unsigned long *src, unsigned bytecnt)
>  {
> +	int i;
> +	__le32 *psrc = (__le32 *)src;
> +
> +	/*
> +	 * the buffer is already in big endian. For little endian machines that's
> +	 * fine. For big endain machines we must swap since the chipset swaps again
> +	 */
> +	for (i = 0; i < bytecnt / 4; ++i)
> +		psrc[i] = le32_to_cpu(psrc[i]);
> +
>  	__iowrite64_copy(dst, src, bytecnt / 8);
>  }
>  
> -- 
> 1.7.7.rc0.70.g82660
> 
> 
> 
> > On Tue, Oct 04, 2011 at 05:26:20PM -0300, Thadeu Lima de Souza Cascardo wrote:
> > 
> > I believe we have an endianess problem here. The source buffer is in
> > big endian - in x86 archs, it will rich the pci device unswapped since
> > both x86 and pci are little endian. In ppc, it wil be swapped by the
> > chipset so it will reach the device in little endian which is wrong.
> > So, in mlx4_bf_copy, you could loop over the buffer and swap32 the all
> > the dwords before the call to __iowrite64_copy. Of course which should
> > fix this in an arch independent manner. Let me know this works for
> > you.
> > 
> > > On Tue, Oct 04, 2011 at 08:02:12AM +0200, Benjamin Herrenschmidt wrote:
> > > > On Mon, 2011-10-03 at 17:53 -0300, Thadeu Lima de Souza Cascardo wrote:
> > > > 
> > > >  .../...
> > > > 
> > > > > > Can you also send me the output of ethtool -i?
> > > > > > It seems that there is a problem with write combining on Power processors, we will check this issue.
> > > > > > 
> > > > > > Yevgeny
> > > > > 
> > > > > Hello, Yevgeny.
> > > > > 
> > > > > You will find the output of ethtool -i below.
> > > > > 
> > > > > I am copying Ben and powerpc list, in case this is an issue with Power
> > > > > processors. They can provide us some more insight into this.
> > > > 
> > > > May I get some background please ? :-)
> > > > 
> > > > I'm not aware of a specific issue with write combining but I'd need to
> > > > know more about what you are doing and the code to do it to comment on
> > > > whether it should work or not.
> > > > 
> > > > Cheers,
> > > > Ben.
> > > > 
> > > > 
> > > 
> > > Hello, Ben.
> > > 
> > > Sorry for that. I am testing mlx4_en driver on a POWER. Yevgeny has
> > > added blue frame support, that does not require writing to the device
> > > memory to indicate a new packet (the doorbell register as it is called).
> > > 
> > > Well, the ring is getting full with no interrupt or packets transmitted.
> > > I simply added a write to the doorbell register and it works for me.
> > > Yevgeny says this is not the right fix, claiming there is a problem with
> > > write combining on POWER. The code uses memory barriers, so I don't know
> > > why there is any problem.
> > > 
> > > I am posting the code here to show better what the situation is.
> > > Yevgeny can tell more about the device and the driver.
> > > 
> > > The code below is the driver as of now, including a diff with what I
> > > changed and had resulted OK for me. Before the blue frame support, the
> > > only code executed was the else part. I can't tell much what the device
> > > should be seeing and doing after the blue frame part of the code is
> > > executed. But it does send the packet if I write to the doorbell
> > > register.
> > > 
> > > Yevgeny, can you tell us what the device should be doing and why you
> > > think this is a problem on POWER? Is it possible that this is simply a
> > > problem with the firmware version?
> > > 
> > > Thanks,
> > > Cascardo.
> > > 
> > > ---
> > >         if (ring->bf_enabled && desc_size <= MAX_BF && !bounce &&
> > > !vlan_tag) {
> > >                 *(u32 *) (&tx_desc->ctrl.vlan_tag) |=
> > > ring->doorbell_qpn;
> > >                 op_own |= htonl((bf_index & 0xffff) << 8);
> > >                 /* Ensure new descirptor hits memory
> > >                 * before setting ownership of this descriptor to HW */
> > >                 wmb();
> > >                 tx_desc->ctrl.owner_opcode = op_own;
> > > 
> > >                 wmb();
> > > 
> > >                 mlx4_bf_copy(ring->bf.reg + ring->bf.offset, (unsigned
> > > long *) &tx_desc->ctrl,
> > >                      desc_size);
> > > 
> > >                 wmb();
> > > 
> > >                 ring->bf.offset ^= ring->bf.buf_size;
> > >         } else {
> > >                 /* Ensure new descirptor hits memory
> > >                 * before setting ownership of this descriptor to HW */
> > >                 wmb();
> > >                 tx_desc->ctrl.owner_opcode = op_own;
> > > -               wmb();
> > > -               writel(ring->doorbell_qpn, ring->bf.uar->map +
> > > MLX4_SEND_DOORBELL);
> > >         }
> > > 
> > > +       wmb();
> > > +       writel(ring->doorbell_qpn, ring->bf.uar->map +
> > > MLX4_SEND_DOORBELL);
> > > +
> > > ---

^ permalink raw reply

* Re: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
From: Benjamin Herrenschmidt @ 2011-10-09  7:21 UTC (permalink / raw)
  To: Eli Cohen
  Cc: netdev@vger.kernel.org, Eli Cohen, linuxppc-dev,
	Thadeu Lima de Souza Cascardo, Yevgeny Petrilin
In-Reply-To: <20111005081502.GB2681@mtldesk30>

On Wed, 2011-10-05 at 10:15 +0200, Eli Cohen wrote:
> On Tue, Oct 04, 2011 at 05:26:20PM -0300, Thadeu Lima de Souza Cascardo wrote:
> 
> I believe we have an endianess problem here. The source buffer is in
> big endian - in x86 archs, it will rich the pci device unswapped since
> both x86 and pci are little endian. In ppc, it wil be swapped by the
> chipset 
  ^^^^^^^ ugh ?

> so it will reach the device in little endian which is wrong.
> So, in mlx4_bf_copy, you could loop over the buffer and swap32 the all
> the dwords before the call to __iowrite64_copy. Of course which should
> fix this in an arch independent manner. Let me know this works for
> you.

That's generally the wrong way to handle endian (to swap32 a whole
buffer).

But I need to know more about the structure of those descriptors etc...
to judge properly.

Cheers,
Ben.

> > On Tue, Oct 04, 2011 at 08:02:12AM +0200, Benjamin Herrenschmidt wrote:
> > > On Mon, 2011-10-03 at 17:53 -0300, Thadeu Lima de Souza Cascardo wrote:
> > > 
> > >  .../...
> > > 
> > > > > Can you also send me the output of ethtool -i?
> > > > > It seems that there is a problem with write combining on Power processors, we will check this issue.
> > > > > 
> > > > > Yevgeny
> > > > 
> > > > Hello, Yevgeny.
> > > > 
> > > > You will find the output of ethtool -i below.
> > > > 
> > > > I am copying Ben and powerpc list, in case this is an issue with Power
> > > > processors. They can provide us some more insight into this.
> > > 
> > > May I get some background please ? :-)
> > > 
> > > I'm not aware of a specific issue with write combining but I'd need to
> > > know more about what you are doing and the code to do it to comment on
> > > whether it should work or not.
> > > 
> > > Cheers,
> > > Ben.
> > > 
> > > 
> > 
> > Hello, Ben.
> > 
> > Sorry for that. I am testing mlx4_en driver on a POWER. Yevgeny has
> > added blue frame support, that does not require writing to the device
> > memory to indicate a new packet (the doorbell register as it is called).
> > 
> > Well, the ring is getting full with no interrupt or packets transmitted.
> > I simply added a write to the doorbell register and it works for me.
> > Yevgeny says this is not the right fix, claiming there is a problem with
> > write combining on POWER. The code uses memory barriers, so I don't know
> > why there is any problem.
> > 
> > I am posting the code here to show better what the situation is.
> > Yevgeny can tell more about the device and the driver.
> > 
> > The code below is the driver as of now, including a diff with what I
> > changed and had resulted OK for me. Before the blue frame support, the
> > only code executed was the else part. I can't tell much what the device
> > should be seeing and doing after the blue frame part of the code is
> > executed. But it does send the packet if I write to the doorbell
> > register.
> > 
> > Yevgeny, can you tell us what the device should be doing and why you
> > think this is a problem on POWER? Is it possible that this is simply a
> > problem with the firmware version?
> > 
> > Thanks,
> > Cascardo.
> > 
> > ---
> >         if (ring->bf_enabled && desc_size <= MAX_BF && !bounce &&
> > !vlan_tag) {
> >                 *(u32 *) (&tx_desc->ctrl.vlan_tag) |=
> > ring->doorbell_qpn;
> >                 op_own |= htonl((bf_index & 0xffff) << 8);
> >                 /* Ensure new descirptor hits memory
> >                 * before setting ownership of this descriptor to HW */
> >                 wmb();
> >                 tx_desc->ctrl.owner_opcode = op_own;
> > 
> >                 wmb();
> > 
> >                 mlx4_bf_copy(ring->bf.reg + ring->bf.offset, (unsigned
> > long *) &tx_desc->ctrl,
> >                      desc_size);
> > 
> >                 wmb();
> > 
> >                 ring->bf.offset ^= ring->bf.buf_size;
> >         } else {
> >                 /* Ensure new descirptor hits memory
> >                 * before setting ownership of this descriptor to HW */
> >                 wmb();
> >                 tx_desc->ctrl.owner_opcode = op_own;
> > -               wmb();
> > -               writel(ring->doorbell_qpn, ring->bf.uar->map +
> > MLX4_SEND_DOORBELL);
> >         }
> > 
> > +       wmb();
> > +       writel(ring->doorbell_qpn, ring->bf.uar->map +
> > MLX4_SEND_DOORBELL);
> > +
> > ---

^ permalink raw reply

* Re: [PATCH] mlx4_en: fix transmit of packages when blue frame is enabled
From: Benjamin Herrenschmidt @ 2011-10-09  7:19 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo
  Cc: netdev@vger.kernel.org, Eli Cohen, eli@dev.mellanox.co.il,
	linuxppc-dev, Yevgeny Petrilin
In-Reply-To: <20111004202620.GA3455@oc1711230544.ibm.com>

On Tue, 2011-10-04 at 17:26 -0300, Thadeu Lima de Souza Cascardo wrote:

>         if (ring->bf_enabled && desc_size <= MAX_BF && !bounce &&
> !vlan_tag) {
>                 *(u32 *) (&tx_desc->ctrl.vlan_tag) |=
> ring->doorbell_qpn;

Could this have endianness problems ?

>                 op_own |= htonl((bf_index & 0xffff) << 8);

Probably better to use cpu_to_be/lexx() here. Also make sure your
descriptors are defined with the appropriate __beXX or __leXX types
so sparse can tell you when you get this wrong

>                 /* Ensure new descirptor hits memory
>                 * before setting ownership of this descriptor to HW */
>                 wmb();
>                 tx_desc->ctrl.owner_opcode = op_own;

>                 wmb();
> 
>                 mlx4_bf_copy(ring->bf.reg + ring->bf.offset, (unsigned
> long *) &tx_desc->ctrl,
>                      desc_size);

What does the above do ?

>                 wmb();
> 
>                 ring->bf.offset ^= ring->bf.buf_size;

Is this HW visible ? It's a bit odd, are you doing double bufferring and
trying to flip a bit ? If it's HW visible, is endian correct ?
 
        } else {
>                 /* Ensure new descirptor hits memory
>                 * before setting ownership of this descriptor to HW */
>                 wmb();
>                 tx_desc->ctrl.owner_opcode = op_own;
> -               wmb();
> -               writel(ring->doorbell_qpn, ring->bf.uar->map +
> MLX4_SEND_DOORBELL);
>         }
> 
> +       wmb();
> +       writel(ring->doorbell_qpn, ring->bf.uar->map +
> MLX4_SEND_DOORBELL);
> +
> ---

Cheers,
Ben.

^ permalink raw reply

* RE: Request_irq fails for IRQ2
From: Benjamin Herrenschmidt @ 2011-10-09  7:11 UTC (permalink / raw)
  To: smitha.vanga; +Cc: scottwood, linuxppc-dev
In-Reply-To: <40631E9A2581F14BA60888C87A76A1FE0147A1@HYD-MKD-MBX4.wipro.com>

On Tue, 2011-10-04 at 13:55 +0000, smitha.vanga@wipro.com wrote:
> 
> Hi Scoot,
> 
> When I try to use a atomic varaible in my ISR  I see a kernel crash .
> with mesage BUG: scheduling while atomic:
> 
> 
> 
> Below is the code :
> My ISR
> 
> irqreturn_t cpld_irq_handler(int irq, void * dev_id, struct pt_regs
> *regs)
> 
> {
> 
> wake_up(&cpld_intr_wait);
> 
> atomic_inc(&cpld_intr_data); /* incrementing this will indicate the
> poll() that the interrupt is occured */
> 
> return 0;
> 
> }

This is of course completely racy, you should do the increment -before-
you wake up. I suppose you aren't SMP at the moment but even then, if
you ever switch for example to threaded interrupts or use RT it will
potentially break.
> 
> DRIVER_INIT
> static int __init gpio_init(void)
> {
>         int ret = 0;
>         int virq;
> 
>        
>     atomic_set(&cpld_intr_data, 0);                     /* initialize
> the Interrupt indicator */
>     init_waitqueue_head(&cpld_intr_wait);               /* Initialize
> the wait queue */
> 
>     virq = irq_create_mapping(NULL, CPLD1_INTERRUPT);

See comments earlier about using the device-tree here.
>               
>    if ((ret = request_irq(virq,cpld_irq_handler, 0, GPIO_CHAR_PATH,
> NULL))!=0)
>    {
>       printk(KERN_ERR "gpio_init: Could not grab IRQ line for CPLD ret
> = %d\n",ret);
>           goto err1;
>    }
>   
>  
>         if((s_nGPIOMajor = register_chrdev(MPC8247_DEVICE_MAJOR_NUM,
> GPIO_CHAR_PATH, &gpio_fops))<0)
>         {
>                 GPIO_DBG2("GPIO_DRIVER  : unable to get major %d\n",
> s_nGPIOMajor);
>                 return s_nGPIOMajor;
>                
>         }else{
>                 GPIO_DBG2("GPIO_DRIVER  : major = %x\n", s_nGPIOMajor
> );
>         }

coding style FAIL

>         return 0;
>     
> } 

I don't see anything that does your scheduling while atomic here,
probably a bug in your poll implementation but it's not here.

Oh and stop sending that crap:

> Regards,
> Smitha
> 
> Please do not print this email unless it is absolutely necessary. 
> 
> The information contained in this electronic message and any
> attachments to this message are intended for the exclusive use of the
> addressee(s) and may contain proprietary, confidential or privileged
> information. If you are not the intended recipient, you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately and destroy all copies of this message and any
> attachments. 
> 
> WARNING: Computer viruses can be transmitted via email. The recipient
> should check this email and any attachments for the presence of
> viruses. The company accepts no liability for any damage caused by any
> virus transmitted by this email. 

It's a complete nonsense on a public mailing list

Ben.

^ permalink raw reply

* RE: Request_irq fails for IRQ2
From: Benjamin Herrenschmidt @ 2011-10-09  7:07 UTC (permalink / raw)
  To: smitha.vanga; +Cc: scottwood, linuxppc-dev
In-Reply-To: <40631E9A2581F14BA60888C87A76A1FE014773@HYD-MKD-MBX4.wipro.com>

On Tue, 2011-10-04 at 11:21 +0000, smitha.vanga@wipro.com wrote:
> Hi Scott,
> 
> I am able to register the IRQ now once I add the call irq_set_default_host in cpm2_pic.

This is a band-aid at best and will probably not be accepted upstream.

You should -really- describe your interrupt in the device-tree instead.

Ben.


> But when I call enable_irq the code throws the below warning and gives exception in enable irq at 
> spin_unlock_irqrestore(&desc->lock, flags); in enable_irq.
> 
> 
> printk(KERN_WARNING "Unbalanced enable for IRQ %d\n", irq);
> 
> Regards,
> Smitha
> Please do not print this email unless it is absolutely necessary. 
> 
> The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. 
> 
> WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. 
> 
> www.wipro.com
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

* RE: BUG: scheduling while atomic:
From: smitha.vanga @ 2011-10-08 12:51 UTC (permalink / raw)
  To: scottwood; +Cc: linuxppc-dev
In-Reply-To: <4E8C9741.9010401@freescale.com>

 
Hi Scott,

Below are the logs when I enable the kallsyms.
Any suggestion.


sh-3.1# I am in cpld_irq_handler
end cpld_irq_handler
BUG: scheduling while atomic: IRQ-20/0x0fff0000/108
Call Trace:
[C3B13EC0] [C0007DD0] show_stack+0x3c/0x194 (unreliable)
[C3B13EF0] [C0017F70] __schedule_bug+0x34/0x44
[C3B13F00] [C0269E38] schedule+0x31c/0x35c
[C3B13F50] [C0017FA4] __cond_resched+0x24/0x50
[C3B13F60] [C00180A4] cond_resched_hardirq_context+0x8c/0xfc
[C3B13F70] [C0047FBC] thread_simple_irq+0x78/0xc4
[C3B13F90] [C00482BC] do_irqd+0x2b4/0x32c
[C3B13FC0] [C0032BAC] kthread+0xc0/0xfc
[C3B13FF0] [C000F60C] original_kernel_thread+0x44/0x60



Regards,
Smitha
Please do not print this email unless it is absolutely necessary. =0A=
=0A=
The information contained in this electronic message and any attachments to=
 this message are intended for the exclusive use of the addressee(s) and may=
 contain proprietary, confidential or privileged information. If you are not=
 the intended recipient, you should not disseminate, distribute or copy this=
 e-mail. Please notify the sender immediately and destroy all copies of this=
 message and any attachments. =0A=
=0A=
WARNING: Computer viruses can be transmitted via email. The recipient should=
 check this email and any attachments for the presence of viruses. The compa=
ny accepts no liability for any damage caused by any virus transmitted by th=
is email. =0A=
=0A=
www.wipro.com

^ permalink raw reply

* Re: [PATCH] mlx4_en: fix transmit of packages when blue frame isenabled
From: Thadeu Lima de Souza Cascardo @ 2011-10-07 22:29 UTC (permalink / raw)
  To: David Laight; +Cc: Eli Cohen, Yevgeny Petrilin, Eli Cohen, linuxppc-dev, netdev
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6D8AE76@saturn3.aculab.com>

On Thu, Oct 06, 2011 at 03:10:28PM +0100, David Laight wrote:
>  
> >  static void mlx4_bf_copy(unsigned long *dst, unsigned long *src,
> unsigned bytecnt) {
> > +	int i;
> > +	__le32 *psrc = (__le32 *)src;
> > +
> > +	/*
> > +	 * the buffer is already in big endian. For little endian
> machines that's
> > +	 * fine. For big endain machines we must swap since the chipset
> swaps again
> > +	 */
> > +	for (i = 0; i < bytecnt / 4; ++i)
> > +		psrc[i] = le32_to_cpu(psrc[i]);
> > +
> >  	__iowrite64_copy(dst, src, bytecnt / 8);
> >  }
> 
> That code looks horrid...
> 1) I'm not sure the caller expects the buffer to be corrupted.
> 2) It contains a lot of memory cycles.
> 3) It looked from the calls that this code is copying descriptors,
>    so the transfer length is probably 1 or 2 words - so the loop
>    is inefficient.
> 4) ppc doesn't have a fast byteswap instruction (very new gcc might
>    use the byteswapping memery access for the le32_to_cpu() though),
>    so it would be better getting the byteswap done inside
>    __iowrite64_copy() - since that is probably requesting a byteswap
>    anyway.
> OTOH I'm not at all clear about the 64bit xfers....
> 
> 

Plus, it doesn't work.  :-\

After doing some more investigation, I decided to add the ARP entries
manually and test for different packet sizes. Packets larger than 256
should use the non-BF path and work. Smaller packets should fail. To my
surprise, the smaller packets were fine. Tried many different sizes and
they all succeeded. Then, tried using broadcast and it worked too (after
setting the proper sysctl). Broadcast with 0, 8 and 16 ping size, all
OK. Removed the ARP entries, the ARP responses did not get to the other
side. So, I did a pretty nasty hack to not use BF for ARP packets and
after cleaning up the ARP table, everything seems to be working fine.
What follows is my patch, just for reference.

Regards,
Cascardo.

---
--- drivers/net/mlx4/en_tx.c.orig	2011-10-03 15:42:15.736104623 -0500
+++ drivers/net/mlx4/en_tx.c	2011-10-07 17:12:38.023792483 -0500
@@ -627,6 +627,7 @@
 	int lso_header_size;
 	void *fragptr;
 	bool bounce = false;
+	bool is_arp = false;
 
 	if (!priv->port_up)
 		goto tx_drop;
@@ -698,10 +699,11 @@
 		priv->port_stats.tx_chksum_offload++;
 	}
 
+	skb_reset_mac_header(skb);
+	ethh = eth_hdr(skb);
+	is_arp = ethh && ethh->h_proto == ETH_P_ARP;
 	if (unlikely(priv->validate_loopback)) {
 		/* Copy dst mac address to wqe */
-		skb_reset_mac_header(skb);
-		ethh = eth_hdr(skb);
 		if (ethh && ethh->h_dest) {
 			mac = mlx4_en_mac_to_u64(ethh->h_dest);
 			mac_h = (u32) ((mac & 0xffff00000000ULL) >> 16);
@@ -790,7 +792,7 @@
 	if (likely(!skb_shared(skb)))
 		skb_orphan(skb);
 
-	if (ring->bf_enabled && desc_size <= MAX_BF && !bounce && !vlan_tag) {
+	if (ring->bf_enabled && desc_size <= MAX_BF && !bounce && !vlan_tag && !is_arp) {
 		*(u32 *) (&tx_desc->ctrl.vlan_tag) |= ring->doorbell_qpn;
 		op_own |= htonl((bf_index & 0xffff) << 8);
 		/* Ensure new descirptor hits memory
---

^ permalink raw reply

* RE: [RFC PATCH 1/2] RapidIO: Add DMA Engine support for RIO data transfers
From: Bounine, Alexandre @ 2011-10-07 19:08 UTC (permalink / raw)
  To: Vinod Koul; +Cc: akpm, linux-kernel, Dan, linuxppc-dev
In-Reply-To: <1317965252.1573.2238.camel@vkoul-udesk3>

Vmlub2QgS291bCB3cm90ZToNCj4gDQo+IE9uIE1vbiwgMjAxMS0xMC0wMyBhdCAwOTo1MiAtMDcw
MCwgQm91bmluZSwgQWxleGFuZHJlIHdyb3RlOg0KPiA+DQo+ID4gTXkgY29uY2VybiBoZXJlIGlz
IHRoYXQgb3RoZXIgc3Vic3lzdGVtcyBtYXkgdXNlL3JlcXVlc3QgRE1BX1NMQVZFIGNoYW5uZWwo
cykgYXMgd2VsbA0KPiA+IGFuZCB3cm9uZ2Z1bGx5IGFjcXVpcmUgb25lIHRoYXQgYmVsb25ncyB0
byBSYXBpZElPLiBJbiB0aGlzIGNhc2Ugc2VwYXJhdGlvbiB3aXRoIGFub3RoZXINCj4gPiBmbGFn
IG1heSBoYXZlIGEgc2Vuc2UgLSBpdCBpcyBwb3NzaWJsZSB0byBoYXZlIGEgc3lzdGVtIHRoYXQg
dXNlcyBSYXBpZElPDQo+ID4gYW5kIG90aGVyICJ0cmFkaXRpb25hbCIgRE1BIHNsYXZlIGNoYW5u
ZWwuDQo+IE5vcGUgdGhhdCB3aWxsIG5ldmVyIGhhcHBlbiBpbiBjdXJyZW50IGZvcm0uDQo+IEV2
ZXJ5IGNvbnRyb2xsZXIgZHJpdmVyIHRvZGF5ICJtYWdpY2FsbHkiIGVuc3VyZXMgdGhhdCBpdCBk
b2Vzbid0IGdldA0KPiBhbnkgb3RoZXIgZG1hIGNvbnRyb2xsZXJzIGNoYW5uZWwuIFdlIHVzZSBm
aWx0ZXIgZnVuY3Rpb24gZm9yIHRoYXQuDQo+IEFsdGhvdWdoIGl0IGlzIG5vdCBjbGVhbiB5ZXQg
YW5kIHdlIGFyZSB3b3JraW5nIHRvIGZpeCB0aGF0IGJ1dCB0aGF0J3MNCj4gYW5vdGhlciBkaXNj
dXNzaW9uLg0KPiBFdmVuIHNwZWNpZnlpbmcgcGxhaW4gRE1BX1NMQVZFIHNob3VsZCB3b3JrIGlm
IHlvdSBjb2RlIHlvdXIgZmlsdGVyDQo+IGZ1bmN0aW9uIHByb3Blcmx5IDopDQoNClJJTyBmaWx0
ZXIgY2hlY2tzIGZvciBETUEgZGV2aWNlIGFzc29jaWF0ZWQgd2l0aCBSYXBpZElPIG1wb3J0IG9i
amVjdC4NClRoaXMgc2hvdWxkIHdvcmsgcmVsaWFibGUgZnJvbSB0aGUgUmFwaWRJTyBzaWRlLiBJ
dCBhbHNvIHZlcmlmaWVzDQp0aGF0IHJldHVybmVkIERNQSBjaGFubmVsIGlzIGNhcGFibGUgdG8g
c2VydmljZSBjb3JyZXNwb25kaW5nIFJJTyBkZXZpY2UNCihpbiBzeXN0ZW0gdGhhdCBoYXMgbW9y
ZSB0aGFuIG9uZSBSSU8gY29udHJvbGxlcikuIA0KDQouLi4gc2tpcCAuLi4NCj4gPg0KPiA+IFNl
Y29uZCwgaGF2aW5nIGFiaWxpdHkgdG8gcGFzcyBwcml2YXRlIHRhcmdldCBpbmZvcm1hdGlvbiBh
bGxvd3MgbWUgdG8gcGFzcw0KPiA+IGluZm9ybWF0aW9uIGFib3V0IHJlbW90ZSB0YXJnZXQgZGV2
aWNlIG9uIHBlci10cmFuc2ZlciBiYXNpcy4NCj4gT2theSwgdGhlbiB3aHkgbm90IHBhc3MgdGhl
IGRtYSBhZGRyZXNzIGFuZCBtYWtlIHlvdXIgZG1hIGRyaXZlcg0KPiB0cmFuc3BhcmVudCAoaSBz
YXcgeW91IHBhc3NlZCBSSU8gYWRkcmVzcywgSUlSQyA2NCsyIGJpdHMpDQo+IEN1cnJlbnRseSB1
c2luZyBkbWFfc2xhdmVfY29uZmlnIHdlIHBhc3MgY2hhbm5lbCBzcGVjaWZpYyBpbmZvcm1hdGlv
biwNCj4gdGhpbmdzIGxpa2UgcGVyaXBoZXJhbCBhZGRyZXNzIGFuZCBjb25maWcgZG9uJ3QgY2hh
bmdlIHR5cGljYWxseQ0KPiBiZXR3ZWVuDQo+IHRyYW5zZmVycyBhbmQgaWYgeW91IGhhdmUgc29t
ZSBjb250cm9sbGVyIHNwZWNpZmljIHByb3BlcnRpZXMgeW91IGNhbg0KPiBwYXNzIHRoZW0gYnkg
ZW1iZWRkaW5nIGRtYV9zbGF2ZV9jb25maWcgaW4geW91ciBzcGVjaWZpYyBzdHJ1Y3R1cmUuDQo+
IFdvcnN0IGNhc2UsIHlvdSBjYW4gY29uZmlndXJlIHNsYXZlIGJlZm9yZSBldmVyeSBwcmVwYXJl
DQoNCkluIGFkZGl0aW9uIHRvIGFkZHJlc3Mgb24gdGFyZ2V0IFJJTyBkZXZpY2UgSSBuZWVkIHRv
IHBhc3MgY29ycmVzcG9uZGluZw0KZGV2aWNlIGRlc3RpbmF0aW9uIElELiBXaXRoIHNpbmdsZSBj
aGFubmVsIGNhcGFibGUgdG8gdHJhbnNmZXIgZGF0YSBiZXR3ZWVuDQpsb2NhbCBtZW1vcnkgYW5k
IGRpZmZlcmVudCBSYXBpZElPIGRldmljZXMgSSBoYXZlIHRvIHBhc3MgZGV2aWNlIHNwZWNpZmlj
DQppbmZvcm1hdGlvbiBvbiBwZXIgdHJhbnNmZXIgYmFzaXMgKGRlc3RJRCArIDY2LWJpdCBhZGRy
ZXNzICsgdHlwZSBvZiB3cml0ZSBvcHMsIGV0Yy4pLg0KDQpFdmVuIGhhdmluZyA4IGNoYW5uZWxz
IChlYWNoIHNldCBmb3Igc3BlY2lmaWMgdGFyZ2V0KSB3aWxsIG5vdCBoZWxwIG1lIHdpdGgNCmZ1
bGwgc3VwcG9ydCBvZiBSYXBpZElPIG5ldHdvcmsgd2hlcmUgSSBoYXZlIDgtIG9yIDE2LWJpdCBk
ZXN0SUQgKDI1NiBvciA2NEsNCmRldmljZXMgcmVzcGVjdGl2ZWx5KS4NCg0KUmFwaWRJTyBjb250
cm9sbGVyIGRldmljZSAoYW5kIGl0cyBETUEgY29tcG9uZW50KSBtYXkgcHJvdmlkZSBzZXJ2aWNl
cyB0bw0KbXVsdGlwbGUgZGV2aWNlIGRyaXZlcnMgd2hpY2ggc2VydmljZSBpbmRpdmlkdWFsIGRl
dmljZXMgb24gUmFwaWRJTyBuZXR3b3JrDQooc2ltaWxhciB0byBQQ0llIGhhdmluZyBtdWx0aXBs
ZSBwZXJpcGhlcmFscywgYnV0IG5vdCB1c2luZyBtZW1vcnkgbWFwcGVkDQptb2RlbCAtIGRlc3RJ
RCBkZWZpbmVzIHJvdXRlIHRvIGEgZGV2aWNlKS4gDQoNCkdlbmVyaWMgUmFwaWRJTyBjb250cm9s
bGVyIG1heSBoYXZlIG9ubHkgb25lIERNQSBjaGFubmVsIHdoaWNoIHNlcnZpY2VzIGFsbA0KdGFy
Z2V0IGRldmljZXMgZm9ybWluZyB0aGUgbmV0d29yay4gV2UgbWF5IGhhdmUgbXVsdGlwbGUgY29u
Y3VycmVudCBkYXRhDQp0cmFuc2ZlciByZXF1ZXN0cyBmb3IgZGlmZmVyZW50IGRldmljZXMuDQoN
ClBhcmFsbGVsIGRpc2N1c3Npb24gd2l0aCBEYW4gdG91Y2hlcyB0aGUgc2FtZSBwb3N0LWNvbmZp
ZyBhcHByb2FjaCBhbmQNCmFub3RoZXIgb3B0aW9uLiBJIGxpa2UgRGFuJ3MgaWRlYSBvZiBoYXZp
bmcgUklPLXNwZWNpZmljIHZlcnNpb24gb2YgcHJlcF9zZygpLg0KQXQgdGhlIHNhbWUgdGltZSBt
eSBjdXJyZW50IGltcGxlbWVudGF0aW9uIG9mIHJpb19kbWFfcHJlcF9zbGF2ZV9zZygpIHdpdGgN
CmFkZGVkIGFwcHJvcHJpYXRlIGxvY2tpbmcgbWF5IGRvIGpvYiBhcyB3ZWxsIGFuZCBrZWVwcyBE
TUEgcGFydCB3aXRoaW4NCmV4aXN0aW5nIEFQSSAoRE1BX1JBUElESU8gcmVtb3ZlZCkuICAgICAg
DQoNCkFsZXguDQoNCg==

^ permalink raw reply

* RE: [RFC PATCH 1/2] RapidIO: Add DMA Engine support for RIO data transfers
From: Bounine, Alexandre @ 2011-10-07 16:12 UTC (permalink / raw)
  To: Williams, Dan J; +Cc: Dave Jiang, Vinod Koul, linux-kernel, akpm, linuxppc-dev
In-Reply-To: <CABE8wwuB2HxpjWiOjASizL5L6LU2_AHhLZnAJyey__hh1jfX8Q@mail.gmail.com>

Dan J Williams wrote:
>=20
> On Mon, Oct 3, 2011 at 9:52 AM, Bounine, Alexandre
> <Alexandre.Bounine@idt.com> wrote:
> >
> > My concern here is that other subsystems may use/request DMA_SLAVE
channel(s) as well
> > and wrongfully acquire one that belongs to RapidIO. In this case
separation with another
> > flag may have a sense - it is possible to have a system that uses
RapidIO
> > and other "traditional" DMA slave channel.
> >
> > This is why I put that proposed interface for discussion instead of
keeping everything
> > inside of RapidIO.
> > If you think that situation above will not happen I will be happy to
remove
> > that subsystem knowledge from dmaengine files.
>=20
> I don't think that situation will happen, even on the same arch I
> don't think DMA_SLAVE is ready to be enabled generically.  So you're
> probably safe here.

Thank you for confirmation. I will rely on DMA_SLAVE only as in my most
recent
version.=20
=20
> >
> > I agree, this is not a case of "pure" slave transfers but existing
DMA_SLAVE
> > interface fits well into the RapidIO operations.
> >
> > First, we have only one memory mapped location on the host side. We
transfer
> > data to/from location that is not mapped into memory on the same
side.
> >
> > Second, having ability to pass private target information allows me
to pass
> > information about remote target device on per-transfer basis.
>=20
> ...but there is no expectation that these engines will be generically
> useful to other subsytems.  To be clear you are just using dmaengine
> as a match making service for your dma providers to clients, right?

Not only that. As an example I am offering other defined DMA slave mode
callbacks
in tsi721_dma driver. I think that RapidIO specific DMA channels should
follow
unified DMA engine interface. I am expecting that other DMA defined
functions
to be used directly without RapidIO specifics. E.g. tx_submit, tx_status
and
issue_pending are implemented in tsi721_dma driver. Similar approach may
be
applied to fsl_dma driver which is capable to RapidIO data transfers on
platforms
with RapidIO interface.=20

> > ... skip ...
> > RapidIO network usually has more than one device attached to it and
> > single DMA channel may service data transfers to/from several
devices.
> > In this case device information should be passed on per-transfer
basis.
> >
>=20
> You could maybe do what async_tx does and just apply the extra context
> after the ->prep(), but before ->submit(), but it looks like that
> context is critical to setting up the operation.

Yes, it is possible to do but does not look as quite safe and effective
as passing all related parameters during single call. This would require
RIO DMA drivers to hold a "half cooked" descriptor chain until next
portion
of information arrives. This requires to have prep and post-configure
calls
coupled together by locks. In this situation prep with private data
looks
safer and more effective.   =20

> This looks pretty dangerous without knowing the other details.  What
> prevents another thread from changing dchan->private before the the
> prep routine reads it?

Yes, locking is needed in rio_dma_prep_slave_sg() around a call for prep
routine. After all internal descriptors are set dchan can be submitted
with new private content. =20

>=20
> DMA_SLAVE assumes a static relationship between dma device and
> slave-device, instead this rapid-io case is a per-operation slave
> context.  It sounds like you really do want a new dma operation type
> that is just an extra-parameter version of the current
> ->device_prep_slave_sg.  But now we're getting into to
> dma_transaction_type proliferation again.  This is probably more fuel
> for the fire of creating a structure transfer template that defines
> multiple possible operation types and clients just fill in the fields
> that they need, rather than adding new operation types for every
> possible permutation of copy operation (DMA_SLAVE, DMA_MEMCPY,
> DMA_CYCLIC, DMA_SG, DMA_INTERLEAVE, DMA_RAPIDIO), it's getting to be a
> bit much.

Exactly. I need an ability of passing private parameter to prep
function.
I chose to adopt DMA engine interface instead of adding one completely
internal to RapidIO because having one common API has much more sense.
Passing user defined data structure when building individual request
would
make not only my life easier but probably will help some other drivers
as well.  =20
=20
> As a starting point, since this the first driver proposal to have
> per-operation slave context and there are other rapid-io specific
> considerations, maybe it's ok to have a rio_dma_prep_slave_sg() that
> does something like:
>=20
> struct tsi721_bdma_chan *bdma_chan =3D to_tsi721_chan(dchan);
>=20
> bdma_chan->prep_sg(rext, sgl, sg_len, direction, flags);
>=20
> Thoughts?

This brings HW dependence into generic RapidIO layer.
I would like to see it as a generic interface that is available
to other RIO-capable devices (e.g Freescale's 85xx and QorIQ).

I would probably make  ->prep_sg callback part of rio_mport structure
(which holds dma_device). In this case HW will be well abstracted.
The only problem will be with registering with DMA engine but this can
be solved by providing a pointer to a stub routine.

Second option may be keeping rio_dma_prep_slave_sg() "as is" but with an
appropriate
locking as I mentioned above (maybe remove "slave" from its name to
reduce confusion).

Thank you,

Alex.

^ permalink raw reply

* Re: [linuxppc-dev] powerpc/64e: External Proxy interrupt support
From: Tudor Laurentiu @ 2011-10-07 14:27 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev
In-Reply-To: <20110802194449.GA26318@schlenkerla.am.freescale.net>

Sorry about this. Please disregard.

---
Best Regards, Laurentiu

On 10/7/2011 5:15 PM, Laurentiu Tudor wrote:
> From: Scott Wood<scottwood@freescale.com>
>
> Adds support for External Proxy (a.k.a. CoreInt) interrupts on 64-bit
> kernels.  External Proxy combines interrupt delivery and
> acknowledgement, so simply returning from the interrupt without EOI
> or other action will not result in the interrupt being reasserted.
>
> When an external interrupt is deferred in this manner (whether
> external proxy is used or not), we set a flag in the PACA.  When we
> re-enable interrupts, either explicitly or as part of an exception
> return, we check the flag and branch to the interrupt exception
> vector as if hardware had delivered the interrupt.
>
> Another approach I considered was to use doorbells to replay the
> interrupt.  There are some problems with this:
>   - The timing of the actual delivery of the doorbell is undefined.
>     This means we can't be sure in an architected way that the
>     doorbell will happen before interrupts are again soft-disabled, at
>     which point (barring interrupt-controller specific actions such as
>     raising CTPR) we could take a higher priority interrupt and
>     overwrite the saved EPR.
>   - Doorbells have a lower priority than true external interrupt. This
>     means you could have a lower priority interrupt appear to preempt
>     a higher prio interrupt, once the higher priority interrupt
>     enables EE and the doorbell comes in.
>
> Signed-off-by: Scott Wood<scottwood@freescale.com>
>
> ---
> We need this patch to have reliable interrupts in hypervisor
> scenarios.
>
> arch/powerpc/include/asm/irq.h         |    2 +
>   arch/powerpc/include/asm/paca.h        |    4 ++
>   arch/powerpc/kernel/asm-offsets.c      |    3 +
>   arch/powerpc/kernel/entry_64.S         |    4 ++
>   arch/powerpc/kernel/exceptions-64e.S   |   94 +++++++++++++++++++++++++++-----
>   arch/powerpc/kernel/irq.c              |   11 ++++
>   arch/powerpc/platforms/85xx/p5020_ds.c |    5 --
>   7 files changed, 103 insertions(+), 20 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h
> index c57a28e..c0a45e7 100644
> --- a/arch/powerpc/include/asm/irq.h
> +++ b/arch/powerpc/include/asm/irq.h
> @@ -332,5 +332,7 @@ extern void do_IRQ(struct pt_regs *regs);
>
>   int irq_choose_cpu(const struct cpumask *mask);
>
> +void deliver_pending_irq(void);
> +
>   #endif /* _ASM_IRQ_H */
>   #endif /* __KERNEL__ */
> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
> index c1f65f5..e5af3e3 100644
> --- a/arch/powerpc/include/asm/paca.h
> +++ b/arch/powerpc/include/asm/paca.h
> @@ -134,6 +134,10 @@ struct paca_struct {
>   	u8 hard_enabled;		/* set if irqs are enabled in MSR */
>   	u8 io_sync;			/* writel() needs spin_unlock sync */
>   	u8 irq_work_pending;		/* IRQ_WORK interrupt while soft-disable */
> +#ifdef CONFIG_PPC_BOOK3E
> +	/* an irq is pending while soft-disabled */
> +	u8 irq_pending;
> +#endif
>
>   	/* Stuff for accurate time accounting */
>   	u64 user_time;			/* accumulated usermode TB ticks */
> diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
> index c98144f..5082ee7 100644
> --- a/arch/powerpc/kernel/asm-offsets.c
> +++ b/arch/powerpc/kernel/asm-offsets.c
> @@ -206,6 +206,9 @@ int main(void)
>   	DEFINE(SVCPU_SLB, offsetof(struct kvmppc_book3s_shadow_vcpu, slb));
>   	DEFINE(SVCPU_SLB_MAX, offsetof(struct kvmppc_book3s_shadow_vcpu, slb_max));
>   #endif
> +#ifdef CONFIG_PPC_BOOK3E
> +	DEFINE(PACA_IRQ_PENDING, offsetof(struct paca_struct, irq_pending));
> +#endif
>   #endif /* CONFIG_PPC64 */
>
>   	/* RTAS */
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index d834425..c1d8eea 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -596,6 +596,9 @@ _GLOBAL(ret_from_except_lite)
>   restore:
>   BEGIN_FW_FTR_SECTION
>   	ld	r5,SOFTE(r1)
> +#ifdef CONFIG_PPC_BOOK3E
> +	lbz	r6,PACA_IRQ_PENDING(r13)
> +#endif
>   FW_FTR_SECTION_ELSE
>   	b	.Liseries_check_pending_irqs
>   ALT_FW_FTR_SECTION_END_IFCLR(FW_FEATURE_ISERIES)
> @@ -608,6 +611,7 @@ ALT_FW_FTR_SECTION_END_IFCLR(FW_FEATURE_ISERIES)
>   	stb	r4,PACAHARDIRQEN(r13)
>
>   #ifdef CONFIG_PPC_BOOK3E
> +	/* consumes r3-r6 */
>   	b	.exception_return_book3e
>   #else
>   	ld	r4,_CTR(r1)
> diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S
> index 429983c..9886be9 100644
> --- a/arch/powerpc/kernel/exceptions-64e.S
> +++ b/arch/powerpc/kernel/exceptions-64e.S
> @@ -2,6 +2,7 @@
>    *  Boot code and exception vectors for Book3E processors
>    *
>    *  Copyright (C) 2007 Ben. Herrenschmidt (benh@kernel.crashing.org), IBM Corp.
> + *  Copyright 2011 Freescale Semiconductor, Inc.
>    *
>    *  This program is free software; you can redistribute it and/or
>    *  modify it under the terms of the GNU General Public License
> @@ -125,6 +126,10 @@
>   	cmpwi	cr0,r11,0;		/* yes ->  go out of line */	    \
>   	beq	masked_doorbell_book3e
>
> +#define PROLOG_ADDITION_EXTIRQ_GEN					    \
> +	lbz	r11,PACASOFTIRQEN(r13); /* are irqs soft-disabled ? */	    \
> +	cmpwi	cr0,r11,0;		/* yes ->  go out of line */	    \
> +	beq	masked_extirq_book3e
>
>   /* Core exception code for all exceptions except TLB misses.
>    * XXX: Needs to make SPRN_SPRG_GEN depend on exception type
> @@ -325,7 +330,13 @@ interrupt_end_book3e:
>   	b	storage_fault_common
>
>   /* External Input Interrupt */
> -	MASKABLE_EXCEPTION(0x500, external_input, .do_IRQ, ACK_NONE)
> +	START_EXCEPTION(external_input)
> +	NORMAL_EXCEPTION_PROLOG(0x500, PROLOG_ADDITION_EXTIRQ)
> +	EXCEPTION_COMMON(0x500, PACA_EXGEN, INTS_DISABLE_ALL)
> +	CHECK_NAPPING()
> +	addi	r3,r1,STACK_FRAME_OVERHEAD
> +	bl	.do_IRQ
> +	b	.ret_from_except_lite
>
>   /* Alignment */
>   	START_EXCEPTION(alignment);
> @@ -557,6 +568,12 @@ kernel_dbg_exc:
>    * An interrupt came in while soft-disabled; clear EE in SRR1,
>    * clear paca->hard_enabled and return.
>    */
> +masked_extirq_book3e:
> +	mtcr	r10
> +	li	r10,1
> +	stb	r10,PACA_IRQ_PENDING(r13)
> +	b	masked_interrupt_book3e_common
> +
>   masked_doorbell_book3e:
>   	mtcr	r10
>   	/* Resend the doorbell to fire again when ints enabled */
> @@ -618,20 +635,8 @@ alignment_more:
>   	bl	.alignment_exception
>   	b	.ret_from_except
>
> -/*
> - * We branch here from entry_64.S for the last stage of the exception
> - * return code path. MSR:EE is expected to be off at that point
> - */
> -_GLOBAL(exception_return_book3e)
> -	b	1f
> -
> -/* This is the return from load_up_fpu fast path which could do with
> - * less GPR restores in fact, but for now we have a single return path
> - */
> -	.globl fast_exception_return
> -fast_exception_return:
> -	wrteei	0
> -1:	mr	r0,r13
> +.macro exception_restore
> +	mr	r0,r13
>   	ld	r10,_MSR(r1)
>   	REST_4GPRS(2, r1)
>   	andi.	r6,r10,MSR_PR
> @@ -667,8 +672,67 @@ fast_exception_return:
>   	ld	r10,PACA_EXGEN+EX_R10(r13)
>   	ld	r11,PACA_EXGEN+EX_R11(r13)
>   	mfspr	r13,SPRN_SPRG_GEN_SCRATCH
> +.endm
> +
> +/*
> + * We branch here from entry_64.S for the last stage of the exception
> + * return code path. MSR:EE is expected to be off at that point
> + * r3 = MSR for return context
> + * r4 = hard irq-enable status for return context
> + * r5 = soft irq-enable status for return context
> + * r6 = irq pending flag
> + */
> +_GLOBAL(exception_return_book3e)
> +	cmpwi	r6,0
> +	beq	common_exception_return
> +
> +/*
> + * There's an interrupt pending.  If we're returning to a context that
> + * is soft-irq-enabled, we need to deliver the interrupt now.
> + *
> + * We should never get here with soft IRQs enabled but hard IRQs disabled,
> + * but just to be sure, check that too.
> + */
> +	cmpwi	r5,0
> +	beq	common_exception_return
> +	cmpwi	r4,0
> +	beq	common_exception_return
> +
> +	lis	r5,(MSR_CE | MSR_ME | MSR_DE)@h
> +	li	r4,0
> +	ori	r5,r5,(MSR_CE | MSR_ME | MSR_DE)@l
> +	stb	r4,PACA_IRQ_PENDING(r13)
> +	and	r5,r5,r3
> +	oris	r5,r5,MSR_CM@h
> +	mtmsr	r5
> +
> +	exception_restore
> +	b	exc_external_input_book3e
> +
> +/* This is the return from load_up_fpu fast path which could do with
> + * less GPR restores in fact, but for now we have a single return path
> + */
> +	.globl fast_exception_return
> +fast_exception_return:
> +	wrteei	0
> +common_exception_return:
> +	exception_restore
>   	rfi
>
> +/* Called from arch_local_irq_restore() prior to hard-enabling interrupts */
> +_GLOBAL(deliver_pending_irq)
> +	mflr	r3
> +	mfmsr	r4
> +	lis	r5,(MSR_CM | MSR_CE | MSR_ME | MSR_DE)@h
> +	ori	r5,r5,(MSR_CM | MSR_CE | MSR_ME | MSR_DE)@l
> +	and	r5,r5,r4
> +	ori	r4,r4,MSR_EE
> +
> +	mtspr	SPRN_SRR0,r3
> +	mtspr	SPRN_SRR1,r4
> +	mtmsr	r5
> +	b	exc_external_input_book3e
> +
>   /*
>    * Trampolines used when spotting a bad kernel stack pointer in
>    * the exception entry code.
> diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
> index d281fb6..44a23d0 100644
> --- a/arch/powerpc/kernel/irq.c
> +++ b/arch/powerpc/kernel/irq.c
> @@ -184,6 +184,17 @@ notrace void arch_local_irq_restore(unsigned long en)
>   		lv1_get_version_info(&tmp);
>   	}
>
> +#ifdef CONFIG_PPC_BOOK3E
> +	/*
> +	 * If there's a pending IRQ, deliver it now.  Interrupts
> +	 * will be hard-enabled on return.
> +	 */
> +	if (get_paca()->irq_pending) {
> +		get_paca()->irq_pending = 0;
> +		deliver_pending_irq();
> +	}
> +#endif
> +
>   	__hard_irq_enable();
>   }
>   EXPORT_SYMBOL(arch_local_irq_restore);
> diff --git a/arch/powerpc/platforms/85xx/p5020_ds.c b/arch/powerpc/platforms/85xx/p5020_ds.c
> index e8cba50..87e7d29 100644
> --- a/arch/powerpc/platforms/85xx/p5020_ds.c
> +++ b/arch/powerpc/platforms/85xx/p5020_ds.c
> @@ -76,12 +76,7 @@ define_machine(p5020_ds) {
>   #ifdef CONFIG_PCI
>   	.pcibios_fixup_bus	= fsl_pcibios_fixup_bus,
>   #endif
> -/* coreint doesn't play nice with lazy EE, use legacy mpic for now */
> -#ifdef CONFIG_PPC64
> -	.get_irq		= mpic_get_irq,
> -#else
>   	.get_irq		= mpic_get_coreint_irq,
> -#endif
>   	.restart		= fsl_rstcr_restart,
>   	.calibrate_decr		= generic_calibrate_decr,
>   	.progress		= udbg_progress,
> _______________________________________________
> linuxppc-dev mailing list
> linuxppc-dev@linux.freescale.net
> http://linux.freescale.net/mailman/listinfo/linuxppc-dev

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox