All of lore.kernel.org
 help / color / mirror / Atom feed
diff for duplicates of <1510045454475.74232@axon.tv>

diff --git a/a/1.txt b/N1/1.txt
index 5861c28..a9fcab0 100644
--- a/a/1.txt
+++ b/N1/1.txt
@@ -1,114 +1,105 @@
-> Which PCI host bridge driver are you using?=0A=
-Indeed I am using drivers/pci/host/pcie-xilinx.c=0A=
-=0A=
-> Can you do any profiling to figure out where the time is going?=0A=
-So I did some profiling on the mcap application, and traced the bottleneck =
-in a write call to pciutils. I added some debug to pciutils and could trace=
- it that the bottleneck is in a function called pwrite, which pretty much d=
-oes this:=0A=
-=0A=
-syscall(SYS_pwrite, fd, buf, size, where);=0A=
-=0A=
-until now this is the last part I could trace it to.=0A=
-=0A=
-Ruben Guerra Marin=0A=
-ruben.guerra.marin@axon.tv=0A=
-=0A=
-________________________________________=0A=
-From: Bjorn Helgaas <helgaas@kernel.org>=0A=
-Sent: Monday, November 6, 2017 6:35 PM=0A=
-To: Ruben Guerra Marin=0A=
-Cc: Michal Simek; bhelgaas@google.com; soren.brinkmann@xilinx.com; bharat.k=
-umar.gogada@xilinx.com; linux-pci@vger.kernel.org; linux-arm-kernel@lists.i=
-nfradead.org=0A=
-Subject: Re: Performance issues writing to PCIe in a Zynq=0A=
-=0A=
-On Mon, Nov 06, 2017 at 08:51:28AM +0000, Ruben Guerra Marin wrote:=0A=
-> Hi, according to Xilinx, from a computer host it happens in a=0A=
-> second, while for us in the Zynq (ARM) takes way more than that as=0A=
-> explained before. And indeed the programming is done via config=0A=
-> accesses, and can't happen otherwise as this is the way Xilinx=0A=
-> created its FPGA IP (Intellectual Property) cores. Still if I do a=0A=
-> baremetal test (so no Linux) and write from the ARMs to the FPGA via=0A=
-> those registers, it takes only 17 cycles instead of the Linux=0A=
-> implementation which takes 250 cycles.=0A=
-=0A=
-Which PCI host bridge driver are you using?=0A=
-=0A=
-If you're using drivers/pci/host/pcie-xilinx.c, it uses=0A=
-pci_generic_config_write(), which doesn't contain anything that looks=0A=
-obviously expensive (except for the __iowmb() inside writeq(), and=0A=
-it's hard to do much about that).=0A=
-=0A=
-There is the pci_lock that's acquired in pci_bus_read_config_dword()=0A=
-(see drivers/pci/access.c).  As I mentioned before, it looks like=0A=
-xilinx could probably use the CONFIG_PCI_LOCKLESS_CONFIG strategy to=0A=
-get rid of that lock.  Not sure how much that would buy you since=0A=
-there's probably no contention, but you could experiment with it.=0A=
-=0A=
-It sounds like you have some user-level code involved, too; I don't=0A=
-know anything about what that code might be doing.=0A=
-=0A=
-Can you do any profiling to figure out where the time is going?=0A=
-=0A=
-> From: Bjorn Helgaas <helgaas@kernel.org>=0A=
-> Sent: Friday, November 3, 2017 2:54 PM=0A=
-> To: Michal Simek=0A=
-> Cc: Ruben Guerra Marin; bhelgaas@google.com; soren.brinkmann@xilinx.com; =
-bharat.kumar.gogada@xilinx.com; linux-pci@vger.kernel.org; linux-arm-kernel=
-@lists.infradead.org=0A=
-> Subject: Re: Performance issues writing to PCIe in a Zynq=0A=
->=0A=
-> On Fri, Nov 03, 2017 at 09:12:04AM +0100, Michal Simek wrote:=0A=
-> > On 2.11.2017 16:30, Ruben Guerra Marin wrote:=0A=
-> > >=0A=
-> > > I have the a Zynq board running petalinux, and it is connected=0A=
-> > > through PCIe to a Virtex Ultrascale board. I configured the=0A=
-> > > Ultrascale for Tandem PCIe, which the second stage bitstream is=0A=
-> > > being programmed from the Zynq board (I crossed compiled the mcap=0A=
-> > > application that Xilinx provides).=0A=
-> > >=0A=
-> > > This works perfectly, but takes around ~12 seconds to program the=0A=
-> > > second stage bitstream (compressed is ~12 MB), which is quite=0A=
-> > > slow. We also tried debugging the mcap application and pciutils.=0A=
-> > > We found out the operation that takes long to execute: In=0A=
-> > > pciutils, the instruction to actually call the write to the driver=0A=
-> > > (pwrite) takes approximately 6uS, so if you add up this for 12 MB=0A=
-> > > then you can see why it takes so long. Why is this so slow? Is=0A=
-> > > this maybe a problem with the driver?=0A=
-> > >=0A=
-> > > For testing, I added an ILA to the AXI bus in between the Zynq GP1=0A=
-> > > and the PCIe IP control registers port. I triggered halfway the=0A=
-> > > programming of the bitstream using the mcap program provided by=0A=
-> > > Xilinx. I can see that it is writing to address x358, which=0A=
-> > > according to the *datasheet*=0A=
-> > > (https://www.xilinx.com/Attachment/Xilinx_Answer_64761__UltraScale_De=
-vices.pdf)=0A=
-> > > is the Write Data Register, which is correct (and again, I know=0A=
-> > > the whole bitstream gets programmed correctly).=0A=
-> > >=0A=
-> > > But what I also see is that a "awvalid" being asserted to the next=0A=
-> > > one it takes 245 cycles, and I can imagine this is why it takes 12=0A=
-> > > seconds to program a 12MB bitstream.=0A=
->=0A=
-> How long do you expect this to take?  What are the corresponding times=0A=
-> on other hardware or other OSes?=0A=
->=0A=
-> It sounds like this programming is done via config accesses, which are=0A=
-> definitely not a fast path, so I don't know if 12s is unreasonably=0A=
-> slow or not.  The largest config write is 4 bytes, which means 12MB=0A=
-> requires 3M writes, and if each takes 6us, that's 18s total.=0A=
->=0A=
-> Most platforms serialize config accesses with a lock, which also slows=0A=
-> things down.  It looks like your hardware might support ECAM, which=0A=
-> means you might be able to remove the locking overhead by using=0A=
-> lockless config (see     CONFIG_    PCI_LOCKLESS_CONFIG).  This is a new=
-=0A=
-> feature currently only used by x86, and it's currently a system-wide=0A=
-> compile-time switch, so it would require some work for you to use it.=0A=
->=0A=
-> The high bandwidth way to do this would be use a BAR and do PCI memory=0A=
-> writes instead of PCI config writes.  Obviously the adapter determines=0A=
-> whether this is possible.=0A=
->=0A=
-> Bjorn=0A=
+> Which PCI host bridge driver are you using?
+Indeed I am using drivers/pci/host/pcie-xilinx.c
+
+> Can you do any profiling to figure out where the time is going?
+So I did some profiling on the mcap application, and traced the bottleneck in a write call to pciutils. I added some debug to pciutils and could trace it that the bottleneck is in a function called pwrite, which pretty much does this:
+
+syscall(SYS_pwrite, fd, buf, size, where);
+
+until now this is the last part I could trace it to.
+
+Ruben Guerra Marin
+ruben.guerra.marin at axon.tv
+
+________________________________________
+From: Bjorn Helgaas <helgaas@kernel.org>
+Sent: Monday, November 6, 2017 6:35 PM
+To: Ruben Guerra Marin
+Cc: Michal Simek; bhelgaas at google.com; soren.brinkmann at xilinx.com; bharat.kumar.gogada at xilinx.com; linux-pci at vger.kernel.org; linux-arm-kernel at lists.infradead.org
+Subject: Re: Performance issues writing to PCIe in a Zynq
+
+On Mon, Nov 06, 2017 at 08:51:28AM +0000, Ruben Guerra Marin wrote:
+> Hi, according to Xilinx, from a computer host it happens in a
+> second, while for us in the Zynq (ARM) takes way more than that as
+> explained before. And indeed the programming is done via config
+> accesses, and can't happen otherwise as this is the way Xilinx
+> created its FPGA IP (Intellectual Property) cores. Still if I do a
+> baremetal test (so no Linux) and write from the ARMs to the FPGA via
+> those registers, it takes only 17 cycles instead of the Linux
+> implementation which takes 250 cycles.
+
+Which PCI host bridge driver are you using?
+
+If you're using drivers/pci/host/pcie-xilinx.c, it uses
+pci_generic_config_write(), which doesn't contain anything that looks
+obviously expensive (except for the __iowmb() inside writeq(), and
+it's hard to do much about that).
+
+There is the pci_lock that's acquired in pci_bus_read_config_dword()
+(see drivers/pci/access.c).  As I mentioned before, it looks like
+xilinx could probably use the CONFIG_PCI_LOCKLESS_CONFIG strategy to
+get rid of that lock.  Not sure how much that would buy you since
+there's probably no contention, but you could experiment with it.
+
+It sounds like you have some user-level code involved, too; I don't
+know anything about what that code might be doing.
+
+Can you do any profiling to figure out where the time is going?
+
+> From: Bjorn Helgaas <helgaas@kernel.org>
+> Sent: Friday, November 3, 2017 2:54 PM
+> To: Michal Simek
+> Cc: Ruben Guerra Marin; bhelgaas at google.com; soren.brinkmann at xilinx.com; bharat.kumar.gogada at xilinx.com; linux-pci at vger.kernel.org; linux-arm-kernel at lists.infradead.org
+> Subject: Re: Performance issues writing to PCIe in a Zynq
+>
+> On Fri, Nov 03, 2017 at 09:12:04AM +0100, Michal Simek wrote:
+> > On 2.11.2017 16:30, Ruben Guerra Marin wrote:
+> > >
+> > > I have the a Zynq board running petalinux, and it is connected
+> > > through PCIe to a Virtex Ultrascale board. I configured the
+> > > Ultrascale for Tandem PCIe, which the second stage bitstream is
+> > > being programmed from the Zynq board (I crossed compiled the mcap
+> > > application that Xilinx provides).
+> > >
+> > > This works perfectly, but takes around ~12 seconds to program the
+> > > second stage bitstream (compressed is ~12 MB), which is quite
+> > > slow. We also tried debugging the mcap application and pciutils.
+> > > We found out the operation that takes long to execute: In
+> > > pciutils, the instruction to actually call the write to the driver
+> > > (pwrite) takes approximately 6uS, so if you add up this for 12 MB
+> > > then you can see why it takes so long. Why is this so slow? Is
+> > > this maybe a problem with the driver?
+> > >
+> > > For testing, I added an ILA to the AXI bus in between the Zynq GP1
+> > > and the PCIe IP control registers port. I triggered halfway the
+> > > programming of the bitstream using the mcap program provided by
+> > > Xilinx. I can see that it is writing to address x358, which
+> > > according to the *datasheet*
+> > > (https://www.xilinx.com/Attachment/Xilinx_Answer_64761__UltraScale_Devices.pdf)
+> > > is the Write Data Register, which is correct (and again, I know
+> > > the whole bitstream gets programmed correctly).
+> > >
+> > > But what I also see is that a "awvalid" being asserted to the next
+> > > one it takes 245 cycles, and I can imagine this is why it takes 12
+> > > seconds to program a 12MB bitstream.
+>
+> How long do you expect this to take?  What are the corresponding times
+> on other hardware or other OSes?
+>
+> It sounds like this programming is done via config accesses, which are
+> definitely not a fast path, so I don't know if 12s is unreasonably
+> slow or not.  The largest config write is 4 bytes, which means 12MB
+> requires 3M writes, and if each takes 6us, that's 18s total.
+>
+> Most platforms serialize config accesses with a lock, which also slows
+> things down.  It looks like your hardware might support ECAM, which
+> means you might be able to remove the locking overhead by using
+> lockless config (see     CONFIG_    PCI_LOCKLESS_CONFIG).  This is a new
+> feature currently only used by x86, and it's currently a system-wide
+> compile-time switch, so it would require some work for you to use it.
+>
+> The high bandwidth way to do this would be use a BAR and do PCI memory
+> writes instead of PCI config writes.  Obviously the adapter determines
+> whether this is possible.
+>
+> Bjorn
diff --git a/a/content_digest b/N1/content_digest
index 4597991..e6543ce 100644
--- a/a/content_digest
+++ b/N1/content_digest
@@ -3,131 +3,116 @@
  "ref\020171103135457.GA8457@bhelgaas-glaptop.roam.corp.google.com\0"
  "ref\01509958288489.45871@axon.tv\0"
  "ref\020171106173549.GB31930@bhelgaas-glaptop.roam.corp.google.com\0"
- "From\0Ruben Guerra Marin <ruben.guerra.marin@axon.tv>\0"
- "Subject\0Re: Performance issues writing to PCIe in a Zynq\0"
+ "From\0ruben.guerra.marin@axon.tv (Ruben Guerra Marin)\0"
+ "Subject\0Performance issues writing to PCIe in a Zynq\0"
  "Date\0Tue, 7 Nov 2017 09:04:14 +0000\0"
- "To\0Bjorn Helgaas <helgaas@kernel.org>\0"
- "Cc\0Michal Simek <michal.simek@xilinx.com>"
-  bhelgaas@google.com <bhelgaas@google.com>
-  soren.brinkmann@xilinx.com <soren.brinkmann@xilinx.com>
-  bharat.kumar.gogada@xilinx.com <bharat.kumar.gogada@xilinx.com>
-  linux-pci@vger.kernel.org <linux-pci@vger.kernel.org>
- " linux-arm-kernel@lists.infradead.org <linux-arm-kernel@lists.infradead.org>\0"
+ "To\0linux-arm-kernel@lists.infradead.org\0"
  "\00:1\0"
  "b\0"
- "> Which PCI host bridge driver are you using?=0A=\n"
- "Indeed I am using drivers/pci/host/pcie-xilinx.c=0A=\n"
- "=0A=\n"
- "> Can you do any profiling to figure out where the time is going?=0A=\n"
- "So I did some profiling on the mcap application, and traced the bottleneck =\n"
- "in a write call to pciutils. I added some debug to pciutils and could trace=\n"
- " it that the bottleneck is in a function called pwrite, which pretty much d=\n"
- "oes this:=0A=\n"
- "=0A=\n"
- "syscall(SYS_pwrite, fd, buf, size, where);=0A=\n"
- "=0A=\n"
- "until now this is the last part I could trace it to.=0A=\n"
- "=0A=\n"
- "Ruben Guerra Marin=0A=\n"
- "ruben.guerra.marin@axon.tv=0A=\n"
- "=0A=\n"
- "________________________________________=0A=\n"
- "From: Bjorn Helgaas <helgaas@kernel.org>=0A=\n"
- "Sent: Monday, November 6, 2017 6:35 PM=0A=\n"
- "To: Ruben Guerra Marin=0A=\n"
- "Cc: Michal Simek; bhelgaas@google.com; soren.brinkmann@xilinx.com; bharat.k=\n"
- "umar.gogada@xilinx.com; linux-pci@vger.kernel.org; linux-arm-kernel@lists.i=\n"
- "nfradead.org=0A=\n"
- "Subject: Re: Performance issues writing to PCIe in a Zynq=0A=\n"
- "=0A=\n"
- "On Mon, Nov 06, 2017 at 08:51:28AM +0000, Ruben Guerra Marin wrote:=0A=\n"
- "> Hi, according to Xilinx, from a computer host it happens in a=0A=\n"
- "> second, while for us in the Zynq (ARM) takes way more than that as=0A=\n"
- "> explained before. And indeed the programming is done via config=0A=\n"
- "> accesses, and can't happen otherwise as this is the way Xilinx=0A=\n"
- "> created its FPGA IP (Intellectual Property) cores. Still if I do a=0A=\n"
- "> baremetal test (so no Linux) and write from the ARMs to the FPGA via=0A=\n"
- "> those registers, it takes only 17 cycles instead of the Linux=0A=\n"
- "> implementation which takes 250 cycles.=0A=\n"
- "=0A=\n"
- "Which PCI host bridge driver are you using?=0A=\n"
- "=0A=\n"
- "If you're using drivers/pci/host/pcie-xilinx.c, it uses=0A=\n"
- "pci_generic_config_write(), which doesn't contain anything that looks=0A=\n"
- "obviously expensive (except for the __iowmb() inside writeq(), and=0A=\n"
- "it's hard to do much about that).=0A=\n"
- "=0A=\n"
- "There is the pci_lock that's acquired in pci_bus_read_config_dword()=0A=\n"
- "(see drivers/pci/access.c).  As I mentioned before, it looks like=0A=\n"
- "xilinx could probably use the CONFIG_PCI_LOCKLESS_CONFIG strategy to=0A=\n"
- "get rid of that lock.  Not sure how much that would buy you since=0A=\n"
- "there's probably no contention, but you could experiment with it.=0A=\n"
- "=0A=\n"
- "It sounds like you have some user-level code involved, too; I don't=0A=\n"
- "know anything about what that code might be doing.=0A=\n"
- "=0A=\n"
- "Can you do any profiling to figure out where the time is going?=0A=\n"
- "=0A=\n"
- "> From: Bjorn Helgaas <helgaas@kernel.org>=0A=\n"
- "> Sent: Friday, November 3, 2017 2:54 PM=0A=\n"
- "> To: Michal Simek=0A=\n"
- "> Cc: Ruben Guerra Marin; bhelgaas@google.com; soren.brinkmann@xilinx.com; =\n"
- "bharat.kumar.gogada@xilinx.com; linux-pci@vger.kernel.org; linux-arm-kernel=\n"
- "@lists.infradead.org=0A=\n"
- "> Subject: Re: Performance issues writing to PCIe in a Zynq=0A=\n"
- ">=0A=\n"
- "> On Fri, Nov 03, 2017 at 09:12:04AM +0100, Michal Simek wrote:=0A=\n"
- "> > On 2.11.2017 16:30, Ruben Guerra Marin wrote:=0A=\n"
- "> > >=0A=\n"
- "> > > I have the a Zynq board running petalinux, and it is connected=0A=\n"
- "> > > through PCIe to a Virtex Ultrascale board. I configured the=0A=\n"
- "> > > Ultrascale for Tandem PCIe, which the second stage bitstream is=0A=\n"
- "> > > being programmed from the Zynq board (I crossed compiled the mcap=0A=\n"
- "> > > application that Xilinx provides).=0A=\n"
- "> > >=0A=\n"
- "> > > This works perfectly, but takes around ~12 seconds to program the=0A=\n"
- "> > > second stage bitstream (compressed is ~12 MB), which is quite=0A=\n"
- "> > > slow. We also tried debugging the mcap application and pciutils.=0A=\n"
- "> > > We found out the operation that takes long to execute: In=0A=\n"
- "> > > pciutils, the instruction to actually call the write to the driver=0A=\n"
- "> > > (pwrite) takes approximately 6uS, so if you add up this for 12 MB=0A=\n"
- "> > > then you can see why it takes so long. Why is this so slow? Is=0A=\n"
- "> > > this maybe a problem with the driver?=0A=\n"
- "> > >=0A=\n"
- "> > > For testing, I added an ILA to the AXI bus in between the Zynq GP1=0A=\n"
- "> > > and the PCIe IP control registers port. I triggered halfway the=0A=\n"
- "> > > programming of the bitstream using the mcap program provided by=0A=\n"
- "> > > Xilinx. I can see that it is writing to address x358, which=0A=\n"
- "> > > according to the *datasheet*=0A=\n"
- "> > > (https://www.xilinx.com/Attachment/Xilinx_Answer_64761__UltraScale_De=\n"
- "vices.pdf)=0A=\n"
- "> > > is the Write Data Register, which is correct (and again, I know=0A=\n"
- "> > > the whole bitstream gets programmed correctly).=0A=\n"
- "> > >=0A=\n"
- "> > > But what I also see is that a \"awvalid\" being asserted to the next=0A=\n"
- "> > > one it takes 245 cycles, and I can imagine this is why it takes 12=0A=\n"
- "> > > seconds to program a 12MB bitstream.=0A=\n"
- ">=0A=\n"
- "> How long do you expect this to take?  What are the corresponding times=0A=\n"
- "> on other hardware or other OSes?=0A=\n"
- ">=0A=\n"
- "> It sounds like this programming is done via config accesses, which are=0A=\n"
- "> definitely not a fast path, so I don't know if 12s is unreasonably=0A=\n"
- "> slow or not.  The largest config write is 4 bytes, which means 12MB=0A=\n"
- "> requires 3M writes, and if each takes 6us, that's 18s total.=0A=\n"
- ">=0A=\n"
- "> Most platforms serialize config accesses with a lock, which also slows=0A=\n"
- "> things down.  It looks like your hardware might support ECAM, which=0A=\n"
- "> means you might be able to remove the locking overhead by using=0A=\n"
- "> lockless config (see     CONFIG_    PCI_LOCKLESS_CONFIG).  This is a new=\n"
- "=0A=\n"
- "> feature currently only used by x86, and it's currently a system-wide=0A=\n"
- "> compile-time switch, so it would require some work for you to use it.=0A=\n"
- ">=0A=\n"
- "> The high bandwidth way to do this would be use a BAR and do PCI memory=0A=\n"
- "> writes instead of PCI config writes.  Obviously the adapter determines=0A=\n"
- "> whether this is possible.=0A=\n"
- ">=0A=\n"
- > Bjorn=0A=
+ "> Which PCI host bridge driver are you using?\n"
+ "Indeed I am using drivers/pci/host/pcie-xilinx.c\n"
+ "\n"
+ "> Can you do any profiling to figure out where the time is going?\n"
+ "So I did some profiling on the mcap application, and traced the bottleneck in a write call to pciutils. I added some debug to pciutils and could trace it that the bottleneck is in a function called pwrite, which pretty much does this:\n"
+ "\n"
+ "syscall(SYS_pwrite, fd, buf, size, where);\n"
+ "\n"
+ "until now this is the last part I could trace it to.\n"
+ "\n"
+ "Ruben Guerra Marin\n"
+ "ruben.guerra.marin at axon.tv\n"
+ "\n"
+ "________________________________________\n"
+ "From: Bjorn Helgaas <helgaas@kernel.org>\n"
+ "Sent: Monday, November 6, 2017 6:35 PM\n"
+ "To: Ruben Guerra Marin\n"
+ "Cc: Michal Simek; bhelgaas at google.com; soren.brinkmann at xilinx.com; bharat.kumar.gogada at xilinx.com; linux-pci at vger.kernel.org; linux-arm-kernel at lists.infradead.org\n"
+ "Subject: Re: Performance issues writing to PCIe in a Zynq\n"
+ "\n"
+ "On Mon, Nov 06, 2017 at 08:51:28AM +0000, Ruben Guerra Marin wrote:\n"
+ "> Hi, according to Xilinx, from a computer host it happens in a\n"
+ "> second, while for us in the Zynq (ARM) takes way more than that as\n"
+ "> explained before. And indeed the programming is done via config\n"
+ "> accesses, and can't happen otherwise as this is the way Xilinx\n"
+ "> created its FPGA IP (Intellectual Property) cores. Still if I do a\n"
+ "> baremetal test (so no Linux) and write from the ARMs to the FPGA via\n"
+ "> those registers, it takes only 17 cycles instead of the Linux\n"
+ "> implementation which takes 250 cycles.\n"
+ "\n"
+ "Which PCI host bridge driver are you using?\n"
+ "\n"
+ "If you're using drivers/pci/host/pcie-xilinx.c, it uses\n"
+ "pci_generic_config_write(), which doesn't contain anything that looks\n"
+ "obviously expensive (except for the __iowmb() inside writeq(), and\n"
+ "it's hard to do much about that).\n"
+ "\n"
+ "There is the pci_lock that's acquired in pci_bus_read_config_dword()\n"
+ "(see drivers/pci/access.c).  As I mentioned before, it looks like\n"
+ "xilinx could probably use the CONFIG_PCI_LOCKLESS_CONFIG strategy to\n"
+ "get rid of that lock.  Not sure how much that would buy you since\n"
+ "there's probably no contention, but you could experiment with it.\n"
+ "\n"
+ "It sounds like you have some user-level code involved, too; I don't\n"
+ "know anything about what that code might be doing.\n"
+ "\n"
+ "Can you do any profiling to figure out where the time is going?\n"
+ "\n"
+ "> From: Bjorn Helgaas <helgaas@kernel.org>\n"
+ "> Sent: Friday, November 3, 2017 2:54 PM\n"
+ "> To: Michal Simek\n"
+ "> Cc: Ruben Guerra Marin; bhelgaas at google.com; soren.brinkmann at xilinx.com; bharat.kumar.gogada at xilinx.com; linux-pci at vger.kernel.org; linux-arm-kernel at lists.infradead.org\n"
+ "> Subject: Re: Performance issues writing to PCIe in a Zynq\n"
+ ">\n"
+ "> On Fri, Nov 03, 2017 at 09:12:04AM +0100, Michal Simek wrote:\n"
+ "> > On 2.11.2017 16:30, Ruben Guerra Marin wrote:\n"
+ "> > >\n"
+ "> > > I have the a Zynq board running petalinux, and it is connected\n"
+ "> > > through PCIe to a Virtex Ultrascale board. I configured the\n"
+ "> > > Ultrascale for Tandem PCIe, which the second stage bitstream is\n"
+ "> > > being programmed from the Zynq board (I crossed compiled the mcap\n"
+ "> > > application that Xilinx provides).\n"
+ "> > >\n"
+ "> > > This works perfectly, but takes around ~12 seconds to program the\n"
+ "> > > second stage bitstream (compressed is ~12 MB), which is quite\n"
+ "> > > slow. We also tried debugging the mcap application and pciutils.\n"
+ "> > > We found out the operation that takes long to execute: In\n"
+ "> > > pciutils, the instruction to actually call the write to the driver\n"
+ "> > > (pwrite) takes approximately 6uS, so if you add up this for 12 MB\n"
+ "> > > then you can see why it takes so long. Why is this so slow? Is\n"
+ "> > > this maybe a problem with the driver?\n"
+ "> > >\n"
+ "> > > For testing, I added an ILA to the AXI bus in between the Zynq GP1\n"
+ "> > > and the PCIe IP control registers port. I triggered halfway the\n"
+ "> > > programming of the bitstream using the mcap program provided by\n"
+ "> > > Xilinx. I can see that it is writing to address x358, which\n"
+ "> > > according to the *datasheet*\n"
+ "> > > (https://www.xilinx.com/Attachment/Xilinx_Answer_64761__UltraScale_Devices.pdf)\n"
+ "> > > is the Write Data Register, which is correct (and again, I know\n"
+ "> > > the whole bitstream gets programmed correctly).\n"
+ "> > >\n"
+ "> > > But what I also see is that a \"awvalid\" being asserted to the next\n"
+ "> > > one it takes 245 cycles, and I can imagine this is why it takes 12\n"
+ "> > > seconds to program a 12MB bitstream.\n"
+ ">\n"
+ "> How long do you expect this to take?  What are the corresponding times\n"
+ "> on other hardware or other OSes?\n"
+ ">\n"
+ "> It sounds like this programming is done via config accesses, which are\n"
+ "> definitely not a fast path, so I don't know if 12s is unreasonably\n"
+ "> slow or not.  The largest config write is 4 bytes, which means 12MB\n"
+ "> requires 3M writes, and if each takes 6us, that's 18s total.\n"
+ ">\n"
+ "> Most platforms serialize config accesses with a lock, which also slows\n"
+ "> things down.  It looks like your hardware might support ECAM, which\n"
+ "> means you might be able to remove the locking overhead by using\n"
+ "> lockless config (see     CONFIG_    PCI_LOCKLESS_CONFIG).  This is a new\n"
+ "> feature currently only used by x86, and it's currently a system-wide\n"
+ "> compile-time switch, so it would require some work for you to use it.\n"
+ ">\n"
+ "> The high bandwidth way to do this would be use a BAR and do PCI memory\n"
+ "> writes instead of PCI config writes.  Obviously the adapter determines\n"
+ "> whether this is possible.\n"
+ ">\n"
+ > Bjorn
 
-4e908eb66ba0800a05ba741533419b89772d6297baf10f4d9450aa308f3eabcf
+e1970768cd0ba75ac11bf575a5bd5952f9033a51f55a1d2d1fcbc6fb671970f3

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.