linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Performance issues writing to PCIe in a Zynq
@ 2017-11-03  7:10 Ruben Guerra Marin
  0 siblings, 0 replies; 7+ messages in thread
From: Ruben Guerra Marin @ 2017-11-03  7:10 UTC (permalink / raw)
  To: linux-pci@vger.kernel.org, linux-arm-kernel@lists.infradead.org

4oCLCkhpLAoKSSBoYXZlIHRoZSBhIFp5bnEgYm9hcmQgcnVubmluZyBwZXRhbGludXgsIGFuZCBp
dCBpcyBjb25uZWN0ZWQgdGhyb3VnaCBQQ0llIHRvIGEgVmlydGV4IFVsdHJhc2NhbGUgYm9hcmQu
IEkgY29uZmlndXJlZCB0aGUgVWx0cmFzY2FsZSBmb3IgVGFuZGVtIFBDSWUsIHdoaWNoIHRoZSBz
ZWNvbmQgc3RhZ2UgYml0c3RyZWFtIGlzIGJlaW5nIHByb2dyYW1tZWQgZnJvbSB0aGUgWnlucSBi
b2FyZCAoSSBjcm9zc2VkIGNvbXBpbGVkIHRoZSBtY2FwIGFwcGxpY2F0aW9uIHRoYXQgWGlsaW54
IHByb3ZpZGVzKS4KIApUaGlzIHdvcmtzIHBlcmZlY3RseSwgYnV0IHRha2VzIGFyb3VuZCB+MTIg
c2Vjb25kcyB0byBwcm9ncmFtIHRoZSBzZWNvbmQgc3RhZ2UgYml0c3RyZWFtIChjb21wcmVzc2Vk
IGlzIH4xMiBNQiksIHdoaWNoIGlzIHF1aXRlIHNsb3cuIFdlIGFsc28gdHJpZWQgZGVidWdnaW5n
IHRoZSBtY2FwIGFwcGxpY2F0aW9uIGFuZCBwY2l1dGlscy4gV2UgZm91bmQgb3V0IHRoZSBvcGVy
YXRpb24gdGhhdCB0YWtlcyBsb25nIHRvIGV4ZWN1dGU6IEluIHBjaXV0aWxzLCB0aGUgaW5zdHJ1
Y3Rpb24gdG8gYWN0dWFsbHkgY2FsbCB0aGUgd3JpdGUgdG8gdGhlIGRyaXZlciAocHdyaXRlKSB0
YWtlcyBhcHByb3hpbWF0ZWx5IDZ1Uywgc28gaWYgeW91IGFkZCB1cCB0aGlzIGZvciAxMiBNQiB0
aGVuIHlvdSBjYW4gc2VlIHdoeSBpdCB0YWtlcyBzbyBsb25nLiBXaHkgaXMgdGhpcyBzbyBzbG93
PyBJcyB0aGlzIG1heWJlIGEgcHJvYmxlbSB3aXRoIHRoZSBkcml2ZXI/CgpGb3IgdGVzdGluZywg
SSBhZGRlZCBhbiBJTEEgdG8gdGhlIEFYSSBidXMgaW4gYmV0d2VlbiB0aGUgWnlucSBHUDEgYW5k
IHRoZSBQQ0llIElQIGNvbnRyb2wgcmVnaXN0ZXJzIHBvcnQuIEkgdHJpZ2dlcmVkIGhhbGZ3YXkg
dGhlIHByb2dyYW1taW5nIG9mIHRoZSBiaXRzdHJlYW0gdXNpbmcgdGhlIG1jYXAgcHJvZ3JhbSBw
cm92aWRlZCBieSBYaWxpbnguIEkgY2FuIHNlZSB0aGF0IGl0IGlzIHdyaXRpbmcgdG8gYWRkcmVz
cyB4MzU4LCB3aGljaCBhY2NvcmRpbmcgdG8gdGhlICpkYXRhc2hlZXQqIChodHRwczovL3d3dy54
aWxpbnguY29tL0F0dGFjaG1lbnQvWGlsaW54X0Fuc3dlcl82NDc2MV9fVWx0cmFTY2FsZV9EZXZp
Y2VzLnBkZikgaXMgdGhlIFdyaXRlIERhdGEgUmVnaXN0ZXIsIHdoaWNoIGlzIGNvcnJlY3QgKGFu
ZCBhZ2FpbiwgSSBrbm93IHRoZSB3aG9sZSBiaXRzdHJlYW0gZ2V0cyBwcm9ncmFtbWVkIGNvcnJl
Y3RseSkuCiAKQnV0IHdoYXQgSSBhbHNvIHNlZSBpcyB0aGF0IGEgImF3dmFsaWQiIGJlaW5nIGFz
c2VydGVkIHRvIHRoZSBuZXh0IG9uZSBpdCB0YWtlcyAyNDUgY3ljbGVzLCBhbmQgSSBjYW4gaW1h
Z2luZSB0aGlzIGlzIHdoeSBpdCB0YWtlcyAxMiBzZWNvbmRzIHRvIHByb2dyYW0gYSAxMk1CIGJp
dHN0cmVhbS4K4oCLClRoYW5rcyBhIGxvdCwKCiAgCiBSdWJlbiBHdWVycmEgTWFyaW4gCnJ1YmVu
Lmd1ZXJyYS5tYXJpbkBheG9uLnR2CiAgICAKX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f
X19fX19fX19fX19fX19fX18KbGludXgtYXJtLWtlcm5lbCBtYWlsaW5nIGxpc3QKbGludXgtYXJt
LWtlcm5lbEBsaXN0cy5pbmZyYWRlYWQub3JnCmh0dHA6Ly9saXN0cy5pbmZyYWRlYWQub3JnL21h
aWxtYW4vbGlzdGluZm8vbGludXgtYXJtLWtlcm5lbAo=

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Performance issues writing to PCIe in a Zynq
       [not found] <1509636637116.27702@axon.tv>
@ 2017-11-03  8:12 ` Michal Simek
  2017-11-03 13:54   ` Bjorn Helgaas
  0 siblings, 1 reply; 7+ messages in thread
From: Michal Simek @ 2017-11-03  8:12 UTC (permalink / raw)
  To: Ruben Guerra Marin, bhelgaas@google.com, michal.simek@xilinx.com,
	soren.brinkmann@xilinx.com, bharat.kumar.gogada@xilinx.com,
	linux-pci@vger.kernel.org, linux-arm-kernel@lists.infradead.org

Hi,

On 2.11.2017 16:30, Ruben Guerra Marin wrote:
> Hi,
> 
> 
> I have the a Zynq board running petalinux, and it is connected through PCIe to a Virtex Ultrascale board. I configured the Ultrascale for Tandem PCIe, which the second stage bitstream is being programmed from the Zynq board (I crossed compiled the mcap application that Xilinx provides).
> 
> 
> 
> This works perfectly, but takes around ~12 seconds to program the second stage bitstream (compressed is ~12 MB), which is quite slow. We also tried debugging the mcap application and pciutils. We found out the operation that takes long to execute: In pciutils, the instruction to actually call the write to the driver (pwrite) takes approximately 6uS, so if you add up this for 12 MB then you can see why it takes so long. Why is this so slow? Is this maybe a problem with the driver?
> 
> 
> For testing, I added an ILA to the AXI bus in between the Zynq GP1 and the PCIe IP control registers port. I triggered halfway the programming of the bitstream using the mcap program provided by Xilinx. I can see that it is writing to address x358, which according to the *datasheet* (https://www.xilinx.com/Attachment/Xilinx_Answer_64761__UltraScale_Devices.pdf) is the Write Data Register, which is correct (and again, I know the whole bitstream gets programmed correctly).
> 
> 
> 
> But what I also see is that a "awvalid" being asserted to the next one it takes 245 cycles, and I can imagine this is why it takes 12 seconds to program a 12MB bitstream.
> 
> ?

Bharat: Can you please take a look at this?

Thanks,
Michal


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Performance issues writing to PCIe in a Zynq
  2017-11-03  8:12 ` Performance issues writing to PCIe in a Zynq Michal Simek
@ 2017-11-03 13:54   ` Bjorn Helgaas
  2017-11-06  8:51     ` Ruben Guerra Marin
  0 siblings, 1 reply; 7+ messages in thread
From: Bjorn Helgaas @ 2017-11-03 13:54 UTC (permalink / raw)
  To: Michal Simek
  Cc: Ruben Guerra Marin, bhelgaas@google.com,
	soren.brinkmann@xilinx.com, bharat.kumar.gogada@xilinx.com,
	linux-pci@vger.kernel.org, linux-arm-kernel@lists.infradead.org

On Fri, Nov 03, 2017 at 09:12:04AM +0100, Michal Simek wrote:
> On 2.11.2017 16:30, Ruben Guerra Marin wrote:
> > 
> > I have the a Zynq board running petalinux, and it is connected
> > through PCIe to a Virtex Ultrascale board. I configured the
> > Ultrascale for Tandem PCIe, which the second stage bitstream is
> > being programmed from the Zynq board (I crossed compiled the mcap
> > application that Xilinx provides).
> > 
> > This works perfectly, but takes around ~12 seconds to program the
> > second stage bitstream (compressed is ~12 MB), which is quite
> > slow. We also tried debugging the mcap application and pciutils.
> > We found out the operation that takes long to execute: In
> > pciutils, the instruction to actually call the write to the driver
> > (pwrite) takes approximately 6uS, so if you add up this for 12 MB
> > then you can see why it takes so long. Why is this so slow? Is
> > this maybe a problem with the driver?
> > 
> > For testing, I added an ILA to the AXI bus in between the Zynq GP1
> > and the PCIe IP control registers port. I triggered halfway the
> > programming of the bitstream using the mcap program provided by
> > Xilinx. I can see that it is writing to address x358, which
> > according to the *datasheet*
> > (https://www.xilinx.com/Attachment/Xilinx_Answer_64761__UltraScale_Devices.pdf)
> > is the Write Data Register, which is correct (and again, I know
> > the whole bitstream gets programmed correctly).
> > 
> > But what I also see is that a "awvalid" being asserted to the next
> > one it takes 245 cycles, and I can imagine this is why it takes 12
> > seconds to program a 12MB bitstream.

How long do you expect this to take?  What are the corresponding times
on other hardware or other OSes?

It sounds like this programming is done via config accesses, which are
definitely not a fast path, so I don't know if 12s is unreasonably
slow or not.  The largest config write is 4 bytes, which means 12MB
requires 3M writes, and if each takes 6us, that's 18s total.

Most platforms serialize config accesses with a lock, which also slows
things down.  It looks like your hardware might support ECAM, which
means you might be able to remove the locking overhead by using
lockless config (see CONFIG_PCI_LOCKLESS_CONFIG).  This is a new
feature currently only used by x86, and it's currently a system-wide
compile-time switch, so it would require some work for you to use it.

The high bandwidth way to do this would be use a BAR and do PCI memory
writes instead of PCI config writes.  Obviously the adapter determines
whether this is possible.

Bjorn

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Performance issues writing to PCIe in a Zynq
  2017-11-03 13:54   ` Bjorn Helgaas
@ 2017-11-06  8:51     ` Ruben Guerra Marin
  2017-11-06 17:35       ` Bjorn Helgaas
  0 siblings, 1 reply; 7+ messages in thread
From: Ruben Guerra Marin @ 2017-11-06  8:51 UTC (permalink / raw)
  To: Bjorn Helgaas, Michal Simek
  Cc: bhelgaas@google.com, soren.brinkmann@xilinx.com,
	bharat.kumar.gogada@xilinx.com, linux-pci@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org

Hi, according to Xilinx, from a computer host it happens in a second, while for us in the Zynq (ARM) takes way more than that as explained before. And indeed the programming is done via config accesses, and can't happen otherwise as this is the way Xilinx created its FPGA IP (Intellectual Property) cores. Still if I do a baremetal test (so no Linux) and write from the ARMs to the FPGA via those registers, it takes only 17 cycles instead of the Linux implementation which takes 250 cycles. 

Ruben Guerra Marin
ruben.guerra.marin@axon.tv

________________________________________
From: Bjorn Helgaas <helgaas@kernel.org>
Sent: Friday, November 3, 2017 2:54 PM
To: Michal Simek
Cc: Ruben Guerra Marin; bhelgaas@google.com; soren.brinkmann@xilinx.com; bharat.kumar.gogada@xilinx.com; linux-pci@vger.kernel.org; linux-arm-kernel@lists.infradead.org
Subject: Re: Performance issues writing to PCIe in a Zynq

On Fri, Nov 03, 2017 at 09:12:04AM +0100, Michal Simek wrote:
> On 2.11.2017 16:30, Ruben Guerra Marin wrote:
> >
> > I have the a Zynq board running petalinux, and it is connected
> > through PCIe to a Virtex Ultrascale board. I configured the
> > Ultrascale for Tandem PCIe, which the second stage bitstream is
> > being programmed from the Zynq board (I crossed compiled the mcap
> > application that Xilinx provides).
> >
> > This works perfectly, but takes around ~12 seconds to program the
> > second stage bitstream (compressed is ~12 MB), which is quite
> > slow. We also tried debugging the mcap application and pciutils.
> > We found out the operation that takes long to execute: In
> > pciutils, the instruction to actually call the write to the driver
> > (pwrite) takes approximately 6uS, so if you add up this for 12 MB
> > then you can see why it takes so long. Why is this so slow? Is
> > this maybe a problem with the driver?
> >
> > For testing, I added an ILA to the AXI bus in between the Zynq GP1
> > and the PCIe IP control registers port. I triggered halfway the
> > programming of the bitstream using the mcap program provided by
> > Xilinx. I can see that it is writing to address x358, which
> > according to the *datasheet*
> > (https://www.xilinx.com/Attachment/Xilinx_Answer_64761__UltraScale_Devices.pdf)
> > is the Write Data Register, which is correct (and again, I know
> > the whole bitstream gets programmed correctly).
> >
> > But what I also see is that a "awvalid" being asserted to the next
> > one it takes 245 cycles, and I can imagine this is why it takes 12
> > seconds to program a 12MB bitstream.

How long do you expect this to take?  What are the corresponding times
on other hardware or other OSes?

It sounds like this programming is done via config accesses, which are
definitely not a fast path, so I don't know if 12s is unreasonably
slow or not.  The largest config write is 4 bytes, which means 12MB
requires 3M writes, and if each takes 6us, that's 18s total.

Most platforms serialize config accesses with a lock, which also slows
things down.  It looks like your hardware might support ECAM, which
means you might be able to remove the locking overhead by using
lockless config (see     CONFIG_    PCI_LOCKLESS_CONFIG).  This is a new
feature currently only used by x86, and it's currently a system-wide
compile-time switch, so it would require some work for you to use it.

The high bandwidth way to do this would be use a BAR and do PCI memory
writes instead of PCI config writes.  Obviously the adapter determines
whether this is possible.

Bjorn

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Performance issues writing to PCIe in a Zynq
  2017-11-06  8:51     ` Ruben Guerra Marin
@ 2017-11-06 17:35       ` Bjorn Helgaas
  2017-11-07  9:04         ` Ruben Guerra Marin
  0 siblings, 1 reply; 7+ messages in thread
From: Bjorn Helgaas @ 2017-11-06 17:35 UTC (permalink / raw)
  To: Ruben Guerra Marin
  Cc: bharat.kumar.gogada@xilinx.com, linux-pci@vger.kernel.org,
	Michal Simek, soren.brinkmann@xilinx.com, bhelgaas@google.com,
	linux-arm-kernel@lists.infradead.org

On Mon, Nov 06, 2017 at 08:51:28AM +0000, Ruben Guerra Marin wrote:
> Hi, according to Xilinx, from a computer host it happens in a
> second, while for us in the Zynq (ARM) takes way more than that as
> explained before. And indeed the programming is done via config
> accesses, and can't happen otherwise as this is the way Xilinx
> created its FPGA IP (Intellectual Property) cores. Still if I do a
> baremetal test (so no Linux) and write from the ARMs to the FPGA via
> those registers, it takes only 17 cycles instead of the Linux
> implementation which takes 250 cycles. 

Which PCI host bridge driver are you using?

If you're using drivers/pci/host/pcie-xilinx.c, it uses
pci_generic_config_write(), which doesn't contain anything that looks
obviously expensive (except for the __iowmb() inside writeq(), and
it's hard to do much about that).

There is the pci_lock that's acquired in pci_bus_read_config_dword()
(see drivers/pci/access.c).  As I mentioned before, it looks like
xilinx could probably use the CONFIG_PCI_LOCKLESS_CONFIG strategy to
get rid of that lock.  Not sure how much that would buy you since
there's probably no contention, but you could experiment with it.

It sounds like you have some user-level code involved, too; I don't
know anything about what that code might be doing.

Can you do any profiling to figure out where the time is going?

> From: Bjorn Helgaas <helgaas@kernel.org>
> Sent: Friday, November 3, 2017 2:54 PM
> To: Michal Simek
> Cc: Ruben Guerra Marin; bhelgaas@google.com; soren.brinkmann@xilinx.com; bharat.kumar.gogada@xilinx.com; linux-pci@vger.kernel.org; linux-arm-kernel@lists.infradead.org
> Subject: Re: Performance issues writing to PCIe in a Zynq
> 
> On Fri, Nov 03, 2017 at 09:12:04AM +0100, Michal Simek wrote:
> > On 2.11.2017 16:30, Ruben Guerra Marin wrote:
> > >
> > > I have the a Zynq board running petalinux, and it is connected
> > > through PCIe to a Virtex Ultrascale board. I configured the
> > > Ultrascale for Tandem PCIe, which the second stage bitstream is
> > > being programmed from the Zynq board (I crossed compiled the mcap
> > > application that Xilinx provides).
> > >
> > > This works perfectly, but takes around ~12 seconds to program the
> > > second stage bitstream (compressed is ~12 MB), which is quite
> > > slow. We also tried debugging the mcap application and pciutils.
> > > We found out the operation that takes long to execute: In
> > > pciutils, the instruction to actually call the write to the driver
> > > (pwrite) takes approximately 6uS, so if you add up this for 12 MB
> > > then you can see why it takes so long. Why is this so slow? Is
> > > this maybe a problem with the driver?
> > >
> > > For testing, I added an ILA to the AXI bus in between the Zynq GP1
> > > and the PCIe IP control registers port. I triggered halfway the
> > > programming of the bitstream using the mcap program provided by
> > > Xilinx. I can see that it is writing to address x358, which
> > > according to the *datasheet*
> > > (https://www.xilinx.com/Attachment/Xilinx_Answer_64761__UltraScale_Devices.pdf)
> > > is the Write Data Register, which is correct (and again, I know
> > > the whole bitstream gets programmed correctly).
> > >
> > > But what I also see is that a "awvalid" being asserted to the next
> > > one it takes 245 cycles, and I can imagine this is why it takes 12
> > > seconds to program a 12MB bitstream.
> 
> How long do you expect this to take?  What are the corresponding times
> on other hardware or other OSes?
> 
> It sounds like this programming is done via config accesses, which are
> definitely not a fast path, so I don't know if 12s is unreasonably
> slow or not.  The largest config write is 4 bytes, which means 12MB
> requires 3M writes, and if each takes 6us, that's 18s total.
> 
> Most platforms serialize config accesses with a lock, which also slows
> things down.  It looks like your hardware might support ECAM, which
> means you might be able to remove the locking overhead by using
> lockless config (see     CONFIG_    PCI_LOCKLESS_CONFIG).  This is a new
> feature currently only used by x86, and it's currently a system-wide
> compile-time switch, so it would require some work for you to use it.
> 
> The high bandwidth way to do this would be use a BAR and do PCI memory
> writes instead of PCI config writes.  Obviously the adapter determines
> whether this is possible.
> 
> Bjorn

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Performance issues writing to PCIe in a Zynq
  2017-11-06 17:35       ` Bjorn Helgaas
@ 2017-11-07  9:04         ` Ruben Guerra Marin
  2017-11-07 12:19           ` David Laight
  0 siblings, 1 reply; 7+ messages in thread
From: Ruben Guerra Marin @ 2017-11-07  9:04 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Michal Simek, bhelgaas@google.com, soren.brinkmann@xilinx.com,
	bharat.kumar.gogada@xilinx.com, linux-pci@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org

> Which PCI host bridge driver are you using?=0A=
Indeed I am using drivers/pci/host/pcie-xilinx.c=0A=
=0A=
> Can you do any profiling to figure out where the time is going?=0A=
So I did some profiling on the mcap application, and traced the bottleneck =
in a write call to pciutils. I added some debug to pciutils and could trace=
 it that the bottleneck is in a function called pwrite, which pretty much d=
oes this:=0A=
=0A=
syscall(SYS_pwrite, fd, buf, size, where);=0A=
=0A=
until now this is the last part I could trace it to.=0A=
=0A=
Ruben Guerra Marin=0A=
ruben.guerra.marin@axon.tv=0A=
=0A=
________________________________________=0A=
From: Bjorn Helgaas <helgaas@kernel.org>=0A=
Sent: Monday, November 6, 2017 6:35 PM=0A=
To: Ruben Guerra Marin=0A=
Cc: Michal Simek; bhelgaas@google.com; soren.brinkmann@xilinx.com; bharat.k=
umar.gogada@xilinx.com; linux-pci@vger.kernel.org; linux-arm-kernel@lists.i=
nfradead.org=0A=
Subject: Re: Performance issues writing to PCIe in a Zynq=0A=
=0A=
On Mon, Nov 06, 2017 at 08:51:28AM +0000, Ruben Guerra Marin wrote:=0A=
> Hi, according to Xilinx, from a computer host it happens in a=0A=
> second, while for us in the Zynq (ARM) takes way more than that as=0A=
> explained before. And indeed the programming is done via config=0A=
> accesses, and can't happen otherwise as this is the way Xilinx=0A=
> created its FPGA IP (Intellectual Property) cores. Still if I do a=0A=
> baremetal test (so no Linux) and write from the ARMs to the FPGA via=0A=
> those registers, it takes only 17 cycles instead of the Linux=0A=
> implementation which takes 250 cycles.=0A=
=0A=
Which PCI host bridge driver are you using?=0A=
=0A=
If you're using drivers/pci/host/pcie-xilinx.c, it uses=0A=
pci_generic_config_write(), which doesn't contain anything that looks=0A=
obviously expensive (except for the __iowmb() inside writeq(), and=0A=
it's hard to do much about that).=0A=
=0A=
There is the pci_lock that's acquired in pci_bus_read_config_dword()=0A=
(see drivers/pci/access.c).  As I mentioned before, it looks like=0A=
xilinx could probably use the CONFIG_PCI_LOCKLESS_CONFIG strategy to=0A=
get rid of that lock.  Not sure how much that would buy you since=0A=
there's probably no contention, but you could experiment with it.=0A=
=0A=
It sounds like you have some user-level code involved, too; I don't=0A=
know anything about what that code might be doing.=0A=
=0A=
Can you do any profiling to figure out where the time is going?=0A=
=0A=
> From: Bjorn Helgaas <helgaas@kernel.org>=0A=
> Sent: Friday, November 3, 2017 2:54 PM=0A=
> To: Michal Simek=0A=
> Cc: Ruben Guerra Marin; bhelgaas@google.com; soren.brinkmann@xilinx.com; =
bharat.kumar.gogada@xilinx.com; linux-pci@vger.kernel.org; linux-arm-kernel=
@lists.infradead.org=0A=
> Subject: Re: Performance issues writing to PCIe in a Zynq=0A=
>=0A=
> On Fri, Nov 03, 2017 at 09:12:04AM +0100, Michal Simek wrote:=0A=
> > On 2.11.2017 16:30, Ruben Guerra Marin wrote:=0A=
> > >=0A=
> > > I have the a Zynq board running petalinux, and it is connected=0A=
> > > through PCIe to a Virtex Ultrascale board. I configured the=0A=
> > > Ultrascale for Tandem PCIe, which the second stage bitstream is=0A=
> > > being programmed from the Zynq board (I crossed compiled the mcap=0A=
> > > application that Xilinx provides).=0A=
> > >=0A=
> > > This works perfectly, but takes around ~12 seconds to program the=0A=
> > > second stage bitstream (compressed is ~12 MB), which is quite=0A=
> > > slow. We also tried debugging the mcap application and pciutils.=0A=
> > > We found out the operation that takes long to execute: In=0A=
> > > pciutils, the instruction to actually call the write to the driver=0A=
> > > (pwrite) takes approximately 6uS, so if you add up this for 12 MB=0A=
> > > then you can see why it takes so long. Why is this so slow? Is=0A=
> > > this maybe a problem with the driver?=0A=
> > >=0A=
> > > For testing, I added an ILA to the AXI bus in between the Zynq GP1=0A=
> > > and the PCIe IP control registers port. I triggered halfway the=0A=
> > > programming of the bitstream using the mcap program provided by=0A=
> > > Xilinx. I can see that it is writing to address x358, which=0A=
> > > according to the *datasheet*=0A=
> > > (https://www.xilinx.com/Attachment/Xilinx_Answer_64761__UltraScale_De=
vices.pdf)=0A=
> > > is the Write Data Register, which is correct (and again, I know=0A=
> > > the whole bitstream gets programmed correctly).=0A=
> > >=0A=
> > > But what I also see is that a "awvalid" being asserted to the next=0A=
> > > one it takes 245 cycles, and I can imagine this is why it takes 12=0A=
> > > seconds to program a 12MB bitstream.=0A=
>=0A=
> How long do you expect this to take?  What are the corresponding times=0A=
> on other hardware or other OSes?=0A=
>=0A=
> It sounds like this programming is done via config accesses, which are=0A=
> definitely not a fast path, so I don't know if 12s is unreasonably=0A=
> slow or not.  The largest config write is 4 bytes, which means 12MB=0A=
> requires 3M writes, and if each takes 6us, that's 18s total.=0A=
>=0A=
> Most platforms serialize config accesses with a lock, which also slows=0A=
> things down.  It looks like your hardware might support ECAM, which=0A=
> means you might be able to remove the locking overhead by using=0A=
> lockless config (see     CONFIG_    PCI_LOCKLESS_CONFIG).  This is a new=
=0A=
> feature currently only used by x86, and it's currently a system-wide=0A=
> compile-time switch, so it would require some work for you to use it.=0A=
>=0A=
> The high bandwidth way to do this would be use a BAR and do PCI memory=0A=
> writes instead of PCI config writes.  Obviously the adapter determines=0A=
> whether this is possible.=0A=
>=0A=
> Bjorn=0A=

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Performance issues writing to PCIe in a Zynq
  2017-11-07  9:04         ` Ruben Guerra Marin
@ 2017-11-07 12:19           ` David Laight
  0 siblings, 0 replies; 7+ messages in thread
From: David Laight @ 2017-11-07 12:19 UTC (permalink / raw)
  To: 'Ruben Guerra Marin', Bjorn Helgaas
  Cc: Michal Simek, bhelgaas@google.com, soren.brinkmann@xilinx.com,
	bharat.kumar.gogada@xilinx.com, linux-pci@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org

From: Ruben Guerra Marin
> Sent: 07 November 2017 09:04
> > Which PCI host bridge driver are you using?
> Indeed I am using drivers/pci/host/pcie-xilinx.c
>=20
> > Can you do any profiling to figure out where the time is going?
> So I did some profiling on the mcap application, and traced the bottlenec=
k in a write call to
> pciutils. I added some debug to pciutils and could trace it that the bott=
leneck is in a function
> called pwrite, which pretty much does this:
>=20
> syscall(SYS_pwrite, fd, buf, size, where);
>=20
> until now this is the last part I could trace it to.

Are you doing one pwrite() for each 32-bit word?
If you are trying to write a large fpga image you'll need to use
a single system call.
I'd suspect that needs a special driver....

I'd also check that the driver size isn't doing single word copy_from_user(=
).
That can be completely horrid in recent kernels because of userspace copy
hardening.

	David

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-11-07 12:19 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1509636637116.27702@axon.tv>
2017-11-03  8:12 ` Performance issues writing to PCIe in a Zynq Michal Simek
2017-11-03 13:54   ` Bjorn Helgaas
2017-11-06  8:51     ` Ruben Guerra Marin
2017-11-06 17:35       ` Bjorn Helgaas
2017-11-07  9:04         ` Ruben Guerra Marin
2017-11-07 12:19           ` David Laight
2017-11-03  7:10 Ruben Guerra Marin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).