linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* PCIe Access - achieve bursts without DMA
@ 2014-01-30 12:20 Moese, Michael
  2014-01-30 14:19 ` David Laight
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Moese, Michael @ 2014-01-30 12:20 UTC (permalink / raw)
  To: linuxppc-dev@lists.ozlabs.org

Hello PPC-developers,
I'm currently trying to benchmark access speeds to our PCIe-connected IP-co=
res
located inside our FPGA. On x86-based systems I was able to achieve bursts =
for
both read and write access. On PPC32, using an e500v2, I had no success at =
all=20
so far.=20
I tried using ioremap_wc(), like I did on x86, for writing, and it only res=
ults in my
writes just being single requests, one after another.
For reads, I noticed I could not ioremap_cache() on PPC, so I used simple i=
oremap()
here.=20
I used several ways to read from the device, from simple readl(),memcpy_fro=
m_io(),=20
memcpy()  to cacheable_memcpy() - with no improvements.  Even when just iss=
uing
a batch of prefetch()-calls for all the memory to read did not result in re=
ad bursts.

I only get really poor results, writing is possible with around 40 MiByte/s=
, whereas I =20
can read at about only 3 MiByte/s.
After hours of studying the reference manual from freescale, looking into o=
ther code
and searching the web, I'm close to resignation.

Maybe someone of you has some more directions for me, I'd appreciate every =
hint
that leads me to my problem's solution - maybe I just missed something or l=
ack=20
knowledge about this architecture in general.

Thanks for your reading.


Michael

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: PCIe Access - achieve bursts without DMA
  2014-01-30 12:20 PCIe Access - achieve bursts without DMA Moese, Michael
@ 2014-01-30 14:19 ` David Laight
  2014-01-31 12:31 ` Gabriel Paubert
  2014-01-31 22:53 ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 10+ messages in thread
From: David Laight @ 2014-01-30 14:19 UTC (permalink / raw)
  To: 'Moese, Michael', linuxppc-dev@lists.ozlabs.org

>From Moese, Michael
> Hello PPC-developers,
> I'm currently trying to benchmark access speeds to our PCIe-connected IP-=
cores
> located inside our FPGA. On x86-based systems I was able to achieve burst=
s for
> both read and write access. On PPC32, using an e500v2, I had no success a=
t all
> so far.

I'm not sure that you can.
I had to write a simple driver for the PCIe CSB bridge dma on a 83xx ppc.
I think that might be the one in the e500v2.

I don't know how fast 'normal' PCIe slaves are, but we were accessing
an Altera fpga and the latency is less than pedestrian.
I think an ISA bus can run faster!
With moderate length transfers, the throughput was more than adequate.

	David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: PCIe Access - achieve bursts without DMA
  2014-01-30 12:20 PCIe Access - achieve bursts without DMA Moese, Michael
  2014-01-30 14:19 ` David Laight
@ 2014-01-31 12:31 ` Gabriel Paubert
  2014-01-31 22:53 ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 10+ messages in thread
From: Gabriel Paubert @ 2014-01-31 12:31 UTC (permalink / raw)
  To: Moese, Michael; +Cc: linuxppc-dev@lists.ozlabs.org

On Thu, Jan 30, 2014 at 12:20:21PM +0000, Moese, Michael wrote:
> Hello PPC-developers,
> I'm currently trying to benchmark access speeds to our PCIe-connected IP-cores
> located inside our FPGA. On x86-based systems I was able to achieve bursts for
> both read and write access. On PPC32, using an e500v2, I had no success at all 
> so far. 
> I tried using ioremap_wc(), like I did on x86, for writing, and it only results in my
> writes just being single requests, one after another.

I believe that on PPC, write-combine is directly mapped to nocache. I can't remember
if there is a writethrough option for ioremap (but adding it would probably be
relaively easy).

> For reads, I noticed I could not ioremap_cache() on PPC, so I used simple ioremap()
> here. 

You might be able to use ioremap_cache and using direct cache control instruction
(dcbf/dcbi) to achieve your goals. This becomes similar to handling machines with 
no hardware cache coherency. You have to know the hardware cache line size to make
this work.

This said, it might be better to mark the memory as guarded and non-coherent 
(WIMG=0000), I don't know what ioremap_cache does for the MG bits and don't
have the time to look it up right now.

> I used several ways to read from the device, from simple readl(),memcpy_from_io(), 
> memcpy()  to cacheable_memcpy() - with no improvements.  Even when just issuing
> a batch of prefetch()-calls for all the memory to read did not result in read bursts.

If the device data you want to read is supposed to be cacheable (which means basically
that the data does not change unexpectedly under you, i.e., is not as volatile as
a typical device I/O register), you don't want to use readl() which adds some
synchronization to the read.

Prefetch only works on writeback memory, maybe writethrough, expecting it to work on
cache-inhibited memory is contradictory.

	Regards,
	Gabriel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: PCIe Access - achieve bursts without DMA
  2014-01-30 12:20 PCIe Access - achieve bursts without DMA Moese, Michael
  2014-01-30 14:19 ` David Laight
  2014-01-31 12:31 ` Gabriel Paubert
@ 2014-01-31 22:53 ` Benjamin Herrenschmidt
  2014-01-31 23:18   ` David Hawkins
  2 siblings, 1 reply; 10+ messages in thread
From: Benjamin Herrenschmidt @ 2014-01-31 22:53 UTC (permalink / raw)
  To: Moese, Michael; +Cc: linuxppc-dev@lists.ozlabs.org

On Thu, 2014-01-30 at 12:20 +0000, Moese, Michael wrote:
> Hello PPC-developers,
> I'm currently trying to benchmark access speeds to our PCIe-connected IP-cores
> located inside our FPGA. On x86-based systems I was able to achieve bursts for
> both read and write access. On PPC32, using an e500v2, I had no success at all 
> so far. 
> I tried using ioremap_wc(), like I did on x86, for writing, and it only results in my
> writes just being single requests, one after another.

Hrm, ioremap_wc will give you a mapping without the G (guard) bit.
Whether that results in some store gathering or not on IOs depends on a
specific HW implementation, you'll have to check with the FSP folks on
that one, there could also be a chicken switch (HID bit or similar)
needed to enable that (there was on some earlier ppc32 chips).

Another thing you can try is to use FP register load/stores.

> For reads, I noticed I could not ioremap_cache() on PPC, so I used simple ioremap()
> here. 
> I used several ways to read from the device, from simple readl(),memcpy_from_io(), 
> memcpy()  to cacheable_memcpy() - with no improvements.  Even when just issuing
> a batch of prefetch()-calls for all the memory to read did not result in read bursts.
> 
> I only get really poor results, writing is possible with around 40 MiByte/s, whereas I  
> can read at about only 3 MiByte/s.
> After hours of studying the reference manual from freescale, looking into other code
> and searching the web, I'm close to resignation.
> 
> Maybe someone of you has some more directions for me, I'd appreciate every hint
> that leads me to my problem's solution - maybe I just missed something or lack 
> knowledge about this architecture in general.
> 
> Thanks for your reading.
> 
> 
> Michael
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: PCIe Access - achieve bursts without DMA
  2014-01-31 22:53 ` Benjamin Herrenschmidt
@ 2014-01-31 23:18   ` David Hawkins
  2014-02-03  8:20     ` Michael Moese
  0 siblings, 1 reply; 10+ messages in thread
From: David Hawkins @ 2014-01-31 23:18 UTC (permalink / raw)
  To: Moese, Michael; +Cc: linuxppc-dev@lists.ozlabs.org

Hi Michael,

>> I'm currently trying to benchmark access speeds to our PCIe-connected IP-cores
>> located inside our FPGA. On x86-based systems I was able to achieve bursts for
>> both read and write access. On PPC32, using an e500v2, I had no success at all
>> so far.

Whenever I want to benchmark PCI/PCIe performance I do the
following tests;

1. Peripheral board DMA (board-to-board)

    Use two of your FPGA boards in a chassis and DMA between them.

    In a PCI system, you can put the cards on the same bus segment and
    then between a bridge and see how that affects things. In your case,
    the PCIe traffic will all be via the root-complex/switch, so
    you should get the same performance regardless of which PCIe slot
    you use.

    This is likely the "best you can do" as far as bursts go.

2. Peripheral board DMA to host memory.

    In this case I typically insmod a simple driver on the host that
    gives me a page of memory, and then DMA into and out of that
    memory, using the DMA controller on the peripheral.

3. Host (root complex) DMA.

    If your host has a DMA controller, then program it per (2).

As far as "verification" of your custom peripheral board FPGA IP is
concerned, if I was a customer, and you had data for (1) and (2),
I'd be pretty happy (and could care less about (2), since its so
system dependent).

Since its an FPGA-based IP. I'd also expect to see a PCIe simulation
with Bus Functional Models showing what the optimal performance of
your IP was, and then how it nicely matches with the measurements
in (1). If you do not have a PCIe logic analyzer, both Xilinx and
Altera have Chipscope/SignalTap logic analyzers that can be used
for tracing traffic at the TLP layer inside the FPGA.

Just some thoughts ...

Cheers,
Dave

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: PCIe Access - achieve bursts without DMA
  2014-01-31 23:18   ` David Hawkins
@ 2014-02-03  8:20     ` Michael Moese
  2014-02-03 10:17       ` David Laight
  2014-02-03 17:08       ` David Hawkins
  0 siblings, 2 replies; 10+ messages in thread
From: Michael Moese @ 2014-02-03  8:20 UTC (permalink / raw)
  To: David Hawkins; +Cc: linuxppc-dev, Moese, Michael

On Fri, Jan 31, 2014 at 03:18:30PM -0800, David Hawkins wrote:
> 1. Peripheral board DMA (board-to-board)
> 2. Peripheral board DMA to host memory.
> 3. Host (root complex) DMA.
> 
> As far as "verification" of your custom peripheral board FPGA IP is
> concerned, if I was a customer, and you had data for (1) and (2),
> I'd be pretty happy (and could care less about (2), since its so
> system dependent).

Usually I would totally agree with you and try to implement the benchmark
using DMA transfers Unfortunately, we have some boards and IP cores that
do not support DMA transfers, or the target system must not do by a 
requirement, and as I have no influence on these, I had to investigate
on how to improve my throughput.
I've submitted a RFC Patch earlier today, which allowed me to perform
PCIe read bursts on IO memory, achieving 18 MB/s instead of the 3 MB/s
I got when using non-cached reads. However, I had to ioremap() my 
memory, like Gabriel said, using write-thru configuration. 

> Since its an FPGA-based IP. I'd also expect to see a PCIe simulation
> with Bus Functional Models showing what the optimal performance of
> your IP was, and then how it nicely matches with the measurements
> in (1). If you do not have a PCIe logic analyzer, both Xilinx and
> Altera have Chipscope/SignalTap logic analyzers that can be used
> for tracing traffic at the TLP layer inside the FPGA.

Of course our IP developers to simulation and analyzing, we have PCI
and PCIe analyzer and all other equipment one might need. However,
we've seen that not only on PowerPC but also on x86, performing real
bursts is not intuitive.


Thank you for your help - we might be satisfied with the achieved 
18 MB/s.


Michael

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: PCIe Access - achieve bursts without DMA
  2014-02-03  8:20     ` Michael Moese
@ 2014-02-03 10:17       ` David Laight
  2014-02-03 10:39         ` Michael Moese
  2014-02-03 17:08       ` David Hawkins
  1 sibling, 1 reply; 10+ messages in thread
From: David Laight @ 2014-02-03 10:17 UTC (permalink / raw)
  To: 'Michael Moese', David Hawkins; +Cc: linuxppc-dev@lists.ozlabs.org

From: Michael Moese
> Thank you for your help - we might be satisfied with the achieved
> 18 MB/s.

We achieved about twice that using the PEX dma controller.
I found the following comment I wrote:

/* Long transfer requests are cut into smaller DMA requests.
 * Each PCIe request can contain a maximum of 128 bytes, but the
 * dma engine can have multiple PCIe requests outstanding and this
 * speeds things up somewhat (50ns/byte with 128, 24ns/byte with 1024).
 * 1k is somewhere near the point of diminishing returns. */

Those times would include a system call.
The transfers were done through a simple driver that converted pread()
and pwrite() requests into accesses to the boards memory.
The non-dma versions are just copy_to/from_user() directly between
the PCIe and user buffers.

Your 3MB/s for single word transfers is similar to what we saw.
Cycle times that make an ISA bus look fast.

	David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: PCIe Access - achieve bursts without DMA
  2014-02-03 10:17       ` David Laight
@ 2014-02-03 10:39         ` Michael Moese
  2014-02-03 10:51           ` David Laight
  0 siblings, 1 reply; 10+ messages in thread
From: Michael Moese @ 2014-02-03 10:39 UTC (permalink / raw)
  To: David Laight
  Cc: linuxppc-dev@lists.ozlabs.org, David Hawkins,
	'Michael Moese'

On Mon, Feb 03, 2014 at 10:17:43AM +0000, David Laight wrote:

> We achieved about twice that using the PEX dma controller.

> Your 3MB/s for single word transfers is similar to what we saw.
> Cycle times that make an ISA bus look fast.

Indeed, this is a really poor performance. I know we could achieve much
more performance using DMA, we have several products where we simply 
don't have DMA available - this requires searching for other paths.

My ioremap_wt() could help in these situations, at least increasing
performance for non-DMA operation to a not-that-bad level.

I don't know if other devices could benefit from this, but surely we
got several IPs that would, but those were not yet upstreamed, we're
still working on this.

Michael

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: PCIe Access - achieve bursts without DMA
  2014-02-03 10:39         ` Michael Moese
@ 2014-02-03 10:51           ` David Laight
  0 siblings, 0 replies; 10+ messages in thread
From: David Laight @ 2014-02-03 10:51 UTC (permalink / raw)
  To: 'Michael Moese'; +Cc: linuxppc-dev@lists.ozlabs.org, David Hawkins

From: Michael Moese=20
> On Mon, Feb 03, 2014 at 10:17:43AM +0000, David Laight wrote:
>=20
> > We achieved about twice that using the PEX dma controller.
>=20
> > Your 3MB/s for single word transfers is similar to what we saw.
> > Cycle times that make an ISA bus look fast.
>=20
> Indeed, this is a really poor performance. I know we could achieve much
> more performance using DMA, we have several products where we simply
> don't have DMA available - this requires searching for other paths.

I got the host (ppc) to do a dma, not the card. (This does need a
dma controller that is adequately intergrated with the PCIe logic.)
So it doesn't require any hardware changes.
I did have to design the software to minimise the number of single
memory transfers.

> My ioremap_wt() could help in these situations, at least increasing
> performance for non-DMA operation to a not-that-bad level.

I needed to do writes as well as reads - so I think I would have
needed to map PCIe space fully cached (rather than write-through).
The speed of back to back writes is better than reads (even if they don't
get combined) because the requests get 'posted' and overlap on the
PCIe bus.

Managing cached accesses does get tricky - you need to make sure that
both sides never have to write to the same cache line.

> I don't know if other devices could benefit from this, but surely we
> got several IPs that would, but those were not yet upstreamed, we're
> still working on this.
>=20
> Michael
>=20

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: PCIe Access - achieve bursts without DMA
  2014-02-03  8:20     ` Michael Moese
  2014-02-03 10:17       ` David Laight
@ 2014-02-03 17:08       ` David Hawkins
  1 sibling, 0 replies; 10+ messages in thread
From: David Hawkins @ 2014-02-03 17:08 UTC (permalink / raw)
  To: Michael Moese; +Cc: linuxppc-dev

Hi Michael,

> On Fri, Jan 31, 2014 at 03:18:30PM -0800, David Hawkins wrote:
>> 1. Peripheral board DMA (board-to-board)
>> 2. Peripheral board DMA to host memory.
>> 3. Host (root complex) DMA.
>>
>> As far as "verification" of your custom peripheral board FPGA IP is
>> concerned, if I was a customer, and you had data for (1) and (2),
>> I'd be pretty happy (and could care less about (2), since its so
>> system dependent).
>
> Usually I would totally agree with you and try to implement the benchmark
> using DMA transfers Unfortunately, we have some boards and IP cores that
> do not support DMA transfers, or the target system must not do by a
> requirement, and as I have no influence on these, I had to investigate
> on how to improve my throughput.

Ah, I see, that does make your life difficult then.

> I've submitted a RFC Patch earlier today, which allowed me to perform
> PCIe read bursts on IO memory, achieving 18 MB/s instead of the 3 MB/s
> I got when using non-cached reads. However, I had to ioremap() my
> memory, like Gabriel said, using write-thru configuration.

That sounds like a reasonable compromise.

Cheers,
Dave

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-02-03 17:08 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-30 12:20 PCIe Access - achieve bursts without DMA Moese, Michael
2014-01-30 14:19 ` David Laight
2014-01-31 12:31 ` Gabriel Paubert
2014-01-31 22:53 ` Benjamin Herrenschmidt
2014-01-31 23:18   ` David Hawkins
2014-02-03  8:20     ` Michael Moese
2014-02-03 10:17       ` David Laight
2014-02-03 10:39         ` Michael Moese
2014-02-03 10:51           ` David Laight
2014-02-03 17:08       ` David Hawkins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).