diff for duplicates of <1510045454475.74232@axon.tv> diff --git a/a/1.txt b/N1/1.txt index 5861c28..a9fcab0 100644 --- a/a/1.txt +++ b/N1/1.txt @@ -1,114 +1,105 @@ -> Which PCI host bridge driver are you using?=0A= -Indeed I am using drivers/pci/host/pcie-xilinx.c=0A= -=0A= -> Can you do any profiling to figure out where the time is going?=0A= -So I did some profiling on the mcap application, and traced the bottleneck = -in a write call to pciutils. I added some debug to pciutils and could trace= - it that the bottleneck is in a function called pwrite, which pretty much d= -oes this:=0A= -=0A= -syscall(SYS_pwrite, fd, buf, size, where);=0A= -=0A= -until now this is the last part I could trace it to.=0A= -=0A= -Ruben Guerra Marin=0A= -ruben.guerra.marin@axon.tv=0A= -=0A= -________________________________________=0A= -From: Bjorn Helgaas <helgaas@kernel.org>=0A= -Sent: Monday, November 6, 2017 6:35 PM=0A= -To: Ruben Guerra Marin=0A= -Cc: Michal Simek; bhelgaas@google.com; soren.brinkmann@xilinx.com; bharat.k= -umar.gogada@xilinx.com; linux-pci@vger.kernel.org; linux-arm-kernel@lists.i= -nfradead.org=0A= -Subject: Re: Performance issues writing to PCIe in a Zynq=0A= -=0A= -On Mon, Nov 06, 2017 at 08:51:28AM +0000, Ruben Guerra Marin wrote:=0A= -> Hi, according to Xilinx, from a computer host it happens in a=0A= -> second, while for us in the Zynq (ARM) takes way more than that as=0A= -> explained before. And indeed the programming is done via config=0A= -> accesses, and can't happen otherwise as this is the way Xilinx=0A= -> created its FPGA IP (Intellectual Property) cores. Still if I do a=0A= -> baremetal test (so no Linux) and write from the ARMs to the FPGA via=0A= -> those registers, it takes only 17 cycles instead of the Linux=0A= -> implementation which takes 250 cycles.=0A= -=0A= -Which PCI host bridge driver are you using?=0A= -=0A= -If you're using drivers/pci/host/pcie-xilinx.c, it uses=0A= -pci_generic_config_write(), which doesn't contain anything that looks=0A= -obviously expensive (except for the __iowmb() inside writeq(), and=0A= -it's hard to do much about that).=0A= -=0A= -There is the pci_lock that's acquired in pci_bus_read_config_dword()=0A= -(see drivers/pci/access.c). As I mentioned before, it looks like=0A= -xilinx could probably use the CONFIG_PCI_LOCKLESS_CONFIG strategy to=0A= -get rid of that lock. Not sure how much that would buy you since=0A= -there's probably no contention, but you could experiment with it.=0A= -=0A= -It sounds like you have some user-level code involved, too; I don't=0A= -know anything about what that code might be doing.=0A= -=0A= -Can you do any profiling to figure out where the time is going?=0A= -=0A= -> From: Bjorn Helgaas <helgaas@kernel.org>=0A= -> Sent: Friday, November 3, 2017 2:54 PM=0A= -> To: Michal Simek=0A= -> Cc: Ruben Guerra Marin; bhelgaas@google.com; soren.brinkmann@xilinx.com; = -bharat.kumar.gogada@xilinx.com; linux-pci@vger.kernel.org; linux-arm-kernel= -@lists.infradead.org=0A= -> Subject: Re: Performance issues writing to PCIe in a Zynq=0A= ->=0A= -> On Fri, Nov 03, 2017 at 09:12:04AM +0100, Michal Simek wrote:=0A= -> > On 2.11.2017 16:30, Ruben Guerra Marin wrote:=0A= -> > >=0A= -> > > I have the a Zynq board running petalinux, and it is connected=0A= -> > > through PCIe to a Virtex Ultrascale board. I configured the=0A= -> > > Ultrascale for Tandem PCIe, which the second stage bitstream is=0A= -> > > being programmed from the Zynq board (I crossed compiled the mcap=0A= -> > > application that Xilinx provides).=0A= -> > >=0A= -> > > This works perfectly, but takes around ~12 seconds to program the=0A= -> > > second stage bitstream (compressed is ~12 MB), which is quite=0A= -> > > slow. We also tried debugging the mcap application and pciutils.=0A= -> > > We found out the operation that takes long to execute: In=0A= -> > > pciutils, the instruction to actually call the write to the driver=0A= -> > > (pwrite) takes approximately 6uS, so if you add up this for 12 MB=0A= -> > > then you can see why it takes so long. Why is this so slow? Is=0A= -> > > this maybe a problem with the driver?=0A= -> > >=0A= -> > > For testing, I added an ILA to the AXI bus in between the Zynq GP1=0A= -> > > and the PCIe IP control registers port. I triggered halfway the=0A= -> > > programming of the bitstream using the mcap program provided by=0A= -> > > Xilinx. I can see that it is writing to address x358, which=0A= -> > > according to the *datasheet*=0A= -> > > (https://www.xilinx.com/Attachment/Xilinx_Answer_64761__UltraScale_De= -vices.pdf)=0A= -> > > is the Write Data Register, which is correct (and again, I know=0A= -> > > the whole bitstream gets programmed correctly).=0A= -> > >=0A= -> > > But what I also see is that a "awvalid" being asserted to the next=0A= -> > > one it takes 245 cycles, and I can imagine this is why it takes 12=0A= -> > > seconds to program a 12MB bitstream.=0A= ->=0A= -> How long do you expect this to take? What are the corresponding times=0A= -> on other hardware or other OSes?=0A= ->=0A= -> It sounds like this programming is done via config accesses, which are=0A= -> definitely not a fast path, so I don't know if 12s is unreasonably=0A= -> slow or not. The largest config write is 4 bytes, which means 12MB=0A= -> requires 3M writes, and if each takes 6us, that's 18s total.=0A= ->=0A= -> Most platforms serialize config accesses with a lock, which also slows=0A= -> things down. It looks like your hardware might support ECAM, which=0A= -> means you might be able to remove the locking overhead by using=0A= -> lockless config (see CONFIG_ PCI_LOCKLESS_CONFIG). This is a new= -=0A= -> feature currently only used by x86, and it's currently a system-wide=0A= -> compile-time switch, so it would require some work for you to use it.=0A= ->=0A= -> The high bandwidth way to do this would be use a BAR and do PCI memory=0A= -> writes instead of PCI config writes. Obviously the adapter determines=0A= -> whether this is possible.=0A= ->=0A= -> Bjorn=0A= +> Which PCI host bridge driver are you using? +Indeed I am using drivers/pci/host/pcie-xilinx.c + +> Can you do any profiling to figure out where the time is going? +So I did some profiling on the mcap application, and traced the bottleneck in a write call to pciutils. I added some debug to pciutils and could trace it that the bottleneck is in a function called pwrite, which pretty much does this: + +syscall(SYS_pwrite, fd, buf, size, where); + +until now this is the last part I could trace it to. + +Ruben Guerra Marin +ruben.guerra.marin at axon.tv + +________________________________________ +From: Bjorn Helgaas <helgaas@kernel.org> +Sent: Monday, November 6, 2017 6:35 PM +To: Ruben Guerra Marin +Cc: Michal Simek; bhelgaas at google.com; soren.brinkmann at xilinx.com; bharat.kumar.gogada at xilinx.com; linux-pci at vger.kernel.org; linux-arm-kernel at lists.infradead.org +Subject: Re: Performance issues writing to PCIe in a Zynq + +On Mon, Nov 06, 2017 at 08:51:28AM +0000, Ruben Guerra Marin wrote: +> Hi, according to Xilinx, from a computer host it happens in a +> second, while for us in the Zynq (ARM) takes way more than that as +> explained before. And indeed the programming is done via config +> accesses, and can't happen otherwise as this is the way Xilinx +> created its FPGA IP (Intellectual Property) cores. Still if I do a +> baremetal test (so no Linux) and write from the ARMs to the FPGA via +> those registers, it takes only 17 cycles instead of the Linux +> implementation which takes 250 cycles. + +Which PCI host bridge driver are you using? + +If you're using drivers/pci/host/pcie-xilinx.c, it uses +pci_generic_config_write(), which doesn't contain anything that looks +obviously expensive (except for the __iowmb() inside writeq(), and +it's hard to do much about that). + +There is the pci_lock that's acquired in pci_bus_read_config_dword() +(see drivers/pci/access.c). As I mentioned before, it looks like +xilinx could probably use the CONFIG_PCI_LOCKLESS_CONFIG strategy to +get rid of that lock. Not sure how much that would buy you since +there's probably no contention, but you could experiment with it. + +It sounds like you have some user-level code involved, too; I don't +know anything about what that code might be doing. + +Can you do any profiling to figure out where the time is going? + +> From: Bjorn Helgaas <helgaas@kernel.org> +> Sent: Friday, November 3, 2017 2:54 PM +> To: Michal Simek +> Cc: Ruben Guerra Marin; bhelgaas at google.com; soren.brinkmann at xilinx.com; bharat.kumar.gogada at xilinx.com; linux-pci at vger.kernel.org; linux-arm-kernel at lists.infradead.org +> Subject: Re: Performance issues writing to PCIe in a Zynq +> +> On Fri, Nov 03, 2017 at 09:12:04AM +0100, Michal Simek wrote: +> > On 2.11.2017 16:30, Ruben Guerra Marin wrote: +> > > +> > > I have the a Zynq board running petalinux, and it is connected +> > > through PCIe to a Virtex Ultrascale board. I configured the +> > > Ultrascale for Tandem PCIe, which the second stage bitstream is +> > > being programmed from the Zynq board (I crossed compiled the mcap +> > > application that Xilinx provides). +> > > +> > > This works perfectly, but takes around ~12 seconds to program the +> > > second stage bitstream (compressed is ~12 MB), which is quite +> > > slow. We also tried debugging the mcap application and pciutils. +> > > We found out the operation that takes long to execute: In +> > > pciutils, the instruction to actually call the write to the driver +> > > (pwrite) takes approximately 6uS, so if you add up this for 12 MB +> > > then you can see why it takes so long. Why is this so slow? Is +> > > this maybe a problem with the driver? +> > > +> > > For testing, I added an ILA to the AXI bus in between the Zynq GP1 +> > > and the PCIe IP control registers port. I triggered halfway the +> > > programming of the bitstream using the mcap program provided by +> > > Xilinx. I can see that it is writing to address x358, which +> > > according to the *datasheet* +> > > (https://www.xilinx.com/Attachment/Xilinx_Answer_64761__UltraScale_Devices.pdf) +> > > is the Write Data Register, which is correct (and again, I know +> > > the whole bitstream gets programmed correctly). +> > > +> > > But what I also see is that a "awvalid" being asserted to the next +> > > one it takes 245 cycles, and I can imagine this is why it takes 12 +> > > seconds to program a 12MB bitstream. +> +> How long do you expect this to take? What are the corresponding times +> on other hardware or other OSes? +> +> It sounds like this programming is done via config accesses, which are +> definitely not a fast path, so I don't know if 12s is unreasonably +> slow or not. The largest config write is 4 bytes, which means 12MB +> requires 3M writes, and if each takes 6us, that's 18s total. +> +> Most platforms serialize config accesses with a lock, which also slows +> things down. It looks like your hardware might support ECAM, which +> means you might be able to remove the locking overhead by using +> lockless config (see CONFIG_ PCI_LOCKLESS_CONFIG). This is a new +> feature currently only used by x86, and it's currently a system-wide +> compile-time switch, so it would require some work for you to use it. +> +> The high bandwidth way to do this would be use a BAR and do PCI memory +> writes instead of PCI config writes. Obviously the adapter determines +> whether this is possible. +> +> Bjorn diff --git a/a/content_digest b/N1/content_digest index 4597991..e6543ce 100644 --- a/a/content_digest +++ b/N1/content_digest @@ -3,131 +3,116 @@ "ref\020171103135457.GA8457@bhelgaas-glaptop.roam.corp.google.com\0" "ref\01509958288489.45871@axon.tv\0" "ref\020171106173549.GB31930@bhelgaas-glaptop.roam.corp.google.com\0" - "From\0Ruben Guerra Marin <ruben.guerra.marin@axon.tv>\0" - "Subject\0Re: Performance issues writing to PCIe in a Zynq\0" + "From\0ruben.guerra.marin@axon.tv (Ruben Guerra Marin)\0" + "Subject\0Performance issues writing to PCIe in a Zynq\0" "Date\0Tue, 7 Nov 2017 09:04:14 +0000\0" - "To\0Bjorn Helgaas <helgaas@kernel.org>\0" - "Cc\0Michal Simek <michal.simek@xilinx.com>" - bhelgaas@google.com <bhelgaas@google.com> - soren.brinkmann@xilinx.com <soren.brinkmann@xilinx.com> - bharat.kumar.gogada@xilinx.com <bharat.kumar.gogada@xilinx.com> - linux-pci@vger.kernel.org <linux-pci@vger.kernel.org> - " linux-arm-kernel@lists.infradead.org <linux-arm-kernel@lists.infradead.org>\0" + "To\0linux-arm-kernel@lists.infradead.org\0" "\00:1\0" "b\0" - "> Which PCI host bridge driver are you using?=0A=\n" - "Indeed I am using drivers/pci/host/pcie-xilinx.c=0A=\n" - "=0A=\n" - "> Can you do any profiling to figure out where the time is going?=0A=\n" - "So I did some profiling on the mcap application, and traced the bottleneck =\n" - "in a write call to pciutils. I added some debug to pciutils and could trace=\n" - " it that the bottleneck is in a function called pwrite, which pretty much d=\n" - "oes this:=0A=\n" - "=0A=\n" - "syscall(SYS_pwrite, fd, buf, size, where);=0A=\n" - "=0A=\n" - "until now this is the last part I could trace it to.=0A=\n" - "=0A=\n" - "Ruben Guerra Marin=0A=\n" - "ruben.guerra.marin@axon.tv=0A=\n" - "=0A=\n" - "________________________________________=0A=\n" - "From: Bjorn Helgaas <helgaas@kernel.org>=0A=\n" - "Sent: Monday, November 6, 2017 6:35 PM=0A=\n" - "To: Ruben Guerra Marin=0A=\n" - "Cc: Michal Simek; bhelgaas@google.com; soren.brinkmann@xilinx.com; bharat.k=\n" - "umar.gogada@xilinx.com; linux-pci@vger.kernel.org; linux-arm-kernel@lists.i=\n" - "nfradead.org=0A=\n" - "Subject: Re: Performance issues writing to PCIe in a Zynq=0A=\n" - "=0A=\n" - "On Mon, Nov 06, 2017 at 08:51:28AM +0000, Ruben Guerra Marin wrote:=0A=\n" - "> Hi, according to Xilinx, from a computer host it happens in a=0A=\n" - "> second, while for us in the Zynq (ARM) takes way more than that as=0A=\n" - "> explained before. And indeed the programming is done via config=0A=\n" - "> accesses, and can't happen otherwise as this is the way Xilinx=0A=\n" - "> created its FPGA IP (Intellectual Property) cores. Still if I do a=0A=\n" - "> baremetal test (so no Linux) and write from the ARMs to the FPGA via=0A=\n" - "> those registers, it takes only 17 cycles instead of the Linux=0A=\n" - "> implementation which takes 250 cycles.=0A=\n" - "=0A=\n" - "Which PCI host bridge driver are you using?=0A=\n" - "=0A=\n" - "If you're using drivers/pci/host/pcie-xilinx.c, it uses=0A=\n" - "pci_generic_config_write(), which doesn't contain anything that looks=0A=\n" - "obviously expensive (except for the __iowmb() inside writeq(), and=0A=\n" - "it's hard to do much about that).=0A=\n" - "=0A=\n" - "There is the pci_lock that's acquired in pci_bus_read_config_dword()=0A=\n" - "(see drivers/pci/access.c). As I mentioned before, it looks like=0A=\n" - "xilinx could probably use the CONFIG_PCI_LOCKLESS_CONFIG strategy to=0A=\n" - "get rid of that lock. Not sure how much that would buy you since=0A=\n" - "there's probably no contention, but you could experiment with it.=0A=\n" - "=0A=\n" - "It sounds like you have some user-level code involved, too; I don't=0A=\n" - "know anything about what that code might be doing.=0A=\n" - "=0A=\n" - "Can you do any profiling to figure out where the time is going?=0A=\n" - "=0A=\n" - "> From: Bjorn Helgaas <helgaas@kernel.org>=0A=\n" - "> Sent: Friday, November 3, 2017 2:54 PM=0A=\n" - "> To: Michal Simek=0A=\n" - "> Cc: Ruben Guerra Marin; bhelgaas@google.com; soren.brinkmann@xilinx.com; =\n" - "bharat.kumar.gogada@xilinx.com; linux-pci@vger.kernel.org; linux-arm-kernel=\n" - "@lists.infradead.org=0A=\n" - "> Subject: Re: Performance issues writing to PCIe in a Zynq=0A=\n" - ">=0A=\n" - "> On Fri, Nov 03, 2017 at 09:12:04AM +0100, Michal Simek wrote:=0A=\n" - "> > On 2.11.2017 16:30, Ruben Guerra Marin wrote:=0A=\n" - "> > >=0A=\n" - "> > > I have the a Zynq board running petalinux, and it is connected=0A=\n" - "> > > through PCIe to a Virtex Ultrascale board. I configured the=0A=\n" - "> > > Ultrascale for Tandem PCIe, which the second stage bitstream is=0A=\n" - "> > > being programmed from the Zynq board (I crossed compiled the mcap=0A=\n" - "> > > application that Xilinx provides).=0A=\n" - "> > >=0A=\n" - "> > > This works perfectly, but takes around ~12 seconds to program the=0A=\n" - "> > > second stage bitstream (compressed is ~12 MB), which is quite=0A=\n" - "> > > slow. We also tried debugging the mcap application and pciutils.=0A=\n" - "> > > We found out the operation that takes long to execute: In=0A=\n" - "> > > pciutils, the instruction to actually call the write to the driver=0A=\n" - "> > > (pwrite) takes approximately 6uS, so if you add up this for 12 MB=0A=\n" - "> > > then you can see why it takes so long. Why is this so slow? Is=0A=\n" - "> > > this maybe a problem with the driver?=0A=\n" - "> > >=0A=\n" - "> > > For testing, I added an ILA to the AXI bus in between the Zynq GP1=0A=\n" - "> > > and the PCIe IP control registers port. I triggered halfway the=0A=\n" - "> > > programming of the bitstream using the mcap program provided by=0A=\n" - "> > > Xilinx. I can see that it is writing to address x358, which=0A=\n" - "> > > according to the *datasheet*=0A=\n" - "> > > (https://www.xilinx.com/Attachment/Xilinx_Answer_64761__UltraScale_De=\n" - "vices.pdf)=0A=\n" - "> > > is the Write Data Register, which is correct (and again, I know=0A=\n" - "> > > the whole bitstream gets programmed correctly).=0A=\n" - "> > >=0A=\n" - "> > > But what I also see is that a \"awvalid\" being asserted to the next=0A=\n" - "> > > one it takes 245 cycles, and I can imagine this is why it takes 12=0A=\n" - "> > > seconds to program a 12MB bitstream.=0A=\n" - ">=0A=\n" - "> How long do you expect this to take? What are the corresponding times=0A=\n" - "> on other hardware or other OSes?=0A=\n" - ">=0A=\n" - "> It sounds like this programming is done via config accesses, which are=0A=\n" - "> definitely not a fast path, so I don't know if 12s is unreasonably=0A=\n" - "> slow or not. The largest config write is 4 bytes, which means 12MB=0A=\n" - "> requires 3M writes, and if each takes 6us, that's 18s total.=0A=\n" - ">=0A=\n" - "> Most platforms serialize config accesses with a lock, which also slows=0A=\n" - "> things down. It looks like your hardware might support ECAM, which=0A=\n" - "> means you might be able to remove the locking overhead by using=0A=\n" - "> lockless config (see CONFIG_ PCI_LOCKLESS_CONFIG). This is a new=\n" - "=0A=\n" - "> feature currently only used by x86, and it's currently a system-wide=0A=\n" - "> compile-time switch, so it would require some work for you to use it.=0A=\n" - ">=0A=\n" - "> The high bandwidth way to do this would be use a BAR and do PCI memory=0A=\n" - "> writes instead of PCI config writes. Obviously the adapter determines=0A=\n" - "> whether this is possible.=0A=\n" - ">=0A=\n" - > Bjorn=0A= + "> Which PCI host bridge driver are you using?\n" + "Indeed I am using drivers/pci/host/pcie-xilinx.c\n" + "\n" + "> Can you do any profiling to figure out where the time is going?\n" + "So I did some profiling on the mcap application, and traced the bottleneck in a write call to pciutils. I added some debug to pciutils and could trace it that the bottleneck is in a function called pwrite, which pretty much does this:\n" + "\n" + "syscall(SYS_pwrite, fd, buf, size, where);\n" + "\n" + "until now this is the last part I could trace it to.\n" + "\n" + "Ruben Guerra Marin\n" + "ruben.guerra.marin at axon.tv\n" + "\n" + "________________________________________\n" + "From: Bjorn Helgaas <helgaas@kernel.org>\n" + "Sent: Monday, November 6, 2017 6:35 PM\n" + "To: Ruben Guerra Marin\n" + "Cc: Michal Simek; bhelgaas at google.com; soren.brinkmann at xilinx.com; bharat.kumar.gogada at xilinx.com; linux-pci at vger.kernel.org; linux-arm-kernel at lists.infradead.org\n" + "Subject: Re: Performance issues writing to PCIe in a Zynq\n" + "\n" + "On Mon, Nov 06, 2017 at 08:51:28AM +0000, Ruben Guerra Marin wrote:\n" + "> Hi, according to Xilinx, from a computer host it happens in a\n" + "> second, while for us in the Zynq (ARM) takes way more than that as\n" + "> explained before. And indeed the programming is done via config\n" + "> accesses, and can't happen otherwise as this is the way Xilinx\n" + "> created its FPGA IP (Intellectual Property) cores. Still if I do a\n" + "> baremetal test (so no Linux) and write from the ARMs to the FPGA via\n" + "> those registers, it takes only 17 cycles instead of the Linux\n" + "> implementation which takes 250 cycles.\n" + "\n" + "Which PCI host bridge driver are you using?\n" + "\n" + "If you're using drivers/pci/host/pcie-xilinx.c, it uses\n" + "pci_generic_config_write(), which doesn't contain anything that looks\n" + "obviously expensive (except for the __iowmb() inside writeq(), and\n" + "it's hard to do much about that).\n" + "\n" + "There is the pci_lock that's acquired in pci_bus_read_config_dword()\n" + "(see drivers/pci/access.c). As I mentioned before, it looks like\n" + "xilinx could probably use the CONFIG_PCI_LOCKLESS_CONFIG strategy to\n" + "get rid of that lock. Not sure how much that would buy you since\n" + "there's probably no contention, but you could experiment with it.\n" + "\n" + "It sounds like you have some user-level code involved, too; I don't\n" + "know anything about what that code might be doing.\n" + "\n" + "Can you do any profiling to figure out where the time is going?\n" + "\n" + "> From: Bjorn Helgaas <helgaas@kernel.org>\n" + "> Sent: Friday, November 3, 2017 2:54 PM\n" + "> To: Michal Simek\n" + "> Cc: Ruben Guerra Marin; bhelgaas at google.com; soren.brinkmann at xilinx.com; bharat.kumar.gogada at xilinx.com; linux-pci at vger.kernel.org; linux-arm-kernel at lists.infradead.org\n" + "> Subject: Re: Performance issues writing to PCIe in a Zynq\n" + ">\n" + "> On Fri, Nov 03, 2017 at 09:12:04AM +0100, Michal Simek wrote:\n" + "> > On 2.11.2017 16:30, Ruben Guerra Marin wrote:\n" + "> > >\n" + "> > > I have the a Zynq board running petalinux, and it is connected\n" + "> > > through PCIe to a Virtex Ultrascale board. I configured the\n" + "> > > Ultrascale for Tandem PCIe, which the second stage bitstream is\n" + "> > > being programmed from the Zynq board (I crossed compiled the mcap\n" + "> > > application that Xilinx provides).\n" + "> > >\n" + "> > > This works perfectly, but takes around ~12 seconds to program the\n" + "> > > second stage bitstream (compressed is ~12 MB), which is quite\n" + "> > > slow. We also tried debugging the mcap application and pciutils.\n" + "> > > We found out the operation that takes long to execute: In\n" + "> > > pciutils, the instruction to actually call the write to the driver\n" + "> > > (pwrite) takes approximately 6uS, so if you add up this for 12 MB\n" + "> > > then you can see why it takes so long. Why is this so slow? Is\n" + "> > > this maybe a problem with the driver?\n" + "> > >\n" + "> > > For testing, I added an ILA to the AXI bus in between the Zynq GP1\n" + "> > > and the PCIe IP control registers port. I triggered halfway the\n" + "> > > programming of the bitstream using the mcap program provided by\n" + "> > > Xilinx. I can see that it is writing to address x358, which\n" + "> > > according to the *datasheet*\n" + "> > > (https://www.xilinx.com/Attachment/Xilinx_Answer_64761__UltraScale_Devices.pdf)\n" + "> > > is the Write Data Register, which is correct (and again, I know\n" + "> > > the whole bitstream gets programmed correctly).\n" + "> > >\n" + "> > > But what I also see is that a \"awvalid\" being asserted to the next\n" + "> > > one it takes 245 cycles, and I can imagine this is why it takes 12\n" + "> > > seconds to program a 12MB bitstream.\n" + ">\n" + "> How long do you expect this to take? What are the corresponding times\n" + "> on other hardware or other OSes?\n" + ">\n" + "> It sounds like this programming is done via config accesses, which are\n" + "> definitely not a fast path, so I don't know if 12s is unreasonably\n" + "> slow or not. The largest config write is 4 bytes, which means 12MB\n" + "> requires 3M writes, and if each takes 6us, that's 18s total.\n" + ">\n" + "> Most platforms serialize config accesses with a lock, which also slows\n" + "> things down. It looks like your hardware might support ECAM, which\n" + "> means you might be able to remove the locking overhead by using\n" + "> lockless config (see CONFIG_ PCI_LOCKLESS_CONFIG). This is a new\n" + "> feature currently only used by x86, and it's currently a system-wide\n" + "> compile-time switch, so it would require some work for you to use it.\n" + ">\n" + "> The high bandwidth way to do this would be use a BAR and do PCI memory\n" + "> writes instead of PCI config writes. Obviously the adapter determines\n" + "> whether this is possible.\n" + ">\n" + > Bjorn -4e908eb66ba0800a05ba741533419b89772d6297baf10f4d9450aa308f3eabcf +e1970768cd0ba75ac11bf575a5bd5952f9033a51f55a1d2d1fcbc6fb671970f3
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.