* i.MX 6 and PCIe DMA issues @ 2017-07-11 13:40 Moese, Michael 2017-07-11 14:07 ` Andrew Lunn ` (2 more replies) 0 siblings, 3 replies; 10+ messages in thread From: Moese, Michael @ 2017-07-11 13:40 UTC (permalink / raw) To: linux-arm-kernel Hello ARM folks, I turn to you in hope you have any hints or directions on the problem I describe below. I am currently investigating strange behavior of our i.MX 6 (Quad) board with an FPGA connected to to PCI Express. This FPGA contains, among others, an Ethernet (10/100 Mbps) IP core. The Ethernet relies completely on DMA transfers. There is one buffer descriptor table containing pointers to 64 RX and TX buffers. Buffers are allocated using dma_alloc_coherent() and mapped using dma_map_single(). Prior to access, dma_sync_single_for_cpu() is called on the memory regions, afterwards dma_sync_single_for_device(). If a new frame is received, the driver reads the RX buffer and passes the frame using skb_put(). When the issue was reported for an old (say 3.18.19) kernel, the buffer descriptor was read correctly (including a correct length), but the buffer contained all zeroes. When I map the physical address in userspace and dump the contents, I can see the correct buffer descriptor contents and inside the buffers valid Ethernet frames. So the DMA transfer itself is working obviously. To avoid chasing after already-fixed bugs, I switched to a 4.12.0 kernel and observed almost the same behavior, but this time there was no length read as well. On 3.18.19 I was able to read the buffers when I allocate them using kmalloc() instead of dma_alloc_coherent(), on 4.12 this did not have any impact. I was suspecting the caches to be the root of my issue, but I was not able to resolve the issue with calls to flush_cache_all(), which I suppose should have invalidated the entire cache. Unfortunately, our driver is legacy out-of-tree code and I started working on this driver to get it ready for submission. If it is of any help, I could send the code of the driver as well as our board's device tree. I would highly appreciate any hint or direction that may help my troubleshooting. Best Regards, Michael ^ permalink raw reply [flat|nested] 10+ messages in thread
* i.MX 6 and PCIe DMA issues 2017-07-11 13:40 i.MX 6 and PCIe DMA issues Moese, Michael @ 2017-07-11 14:07 ` Andrew Lunn 2017-07-11 14:50 ` Robin Murphy 2017-07-11 17:56 ` Andrew Lunn 2 siblings, 0 replies; 10+ messages in thread From: Andrew Lunn @ 2017-07-11 14:07 UTC (permalink / raw) To: linux-arm-kernel > I was suspecting the caches to be the root of my issue, but I was > not able to resolve the issue with calls to flush_cache_all(), which > I suppose should have invalidated the entire cache. Flush and invalidate are different operations. flush_cache_all() will not invalidate. How are your RX buffers aligned. It is good to ensure that your buffers don't share cache lines. What could be happening is that you are reading the tail end of one buffer which is bringing into cache the head of the next buffer, if they share cache lines. Also, ARMv7 processors can do speculative reads, so you must get your dma_sync_single_for_cpu() and dma_sync_single_for_device() correct. Andrew ^ permalink raw reply [flat|nested] 10+ messages in thread
* i.MX 6 and PCIe DMA issues 2017-07-11 13:40 i.MX 6 and PCIe DMA issues Moese, Michael 2017-07-11 14:07 ` Andrew Lunn @ 2017-07-11 14:50 ` Robin Murphy 2017-07-13 7:07 ` michael.moese at men.de 2017-07-11 17:56 ` Andrew Lunn 2 siblings, 1 reply; 10+ messages in thread From: Robin Murphy @ 2017-07-11 14:50 UTC (permalink / raw) To: linux-arm-kernel On 11/07/17 14:40, Moese, Michael wrote: > Hello ARM folks, I turn to you in hope you have any hints or > directions on the problem I describe below. > > I am currently investigating strange behavior of our i.MX 6 (Quad) > board with an FPGA connected to to PCI Express. This FPGA contains, > among others, an Ethernet (10/100 Mbps) IP core. The Ethernet relies > completely on DMA transfers. There is one buffer descriptor table > containing pointers to 64 RX and TX buffers. Buffers are allocated > using dma_alloc_coherent() and mapped using dma_map_single(). Prior I don't much like the sound of that "and" there - coherent DMA allocations are, as the name implies, already coherent for CPU and DMA accesses, and require no maintenance; the streaming DMA API (dma_{map,unmap,sync}_*) on the other hand is *only* for use on kmalloced memory. I'm far more of a DMA guy than a networking guy, so I don't know much about the details of ethernet drivers, but typically, long-lived things like descriptor rings would usually use coherent allocations, whilst streaming DMA is used for the transient mapping/unmapping of individual skbs. > to access, dma_sync_single_for_cpu() is called on the memory regions, > afterwards dma_sync_single_for_device(). > > If a new frame is received, the driver reads the RX buffer and passes > the frame using skb_put(). When the issue was reported for an old > (say 3.18.19) kernel, the buffer descriptor was read correctly > (including a correct length), but the buffer contained all zeroes. > When I map the physical address in userspace and dump the contents, I > can see the correct buffer descriptor contents and inside the buffers > valid Ethernet frames. So the DMA transfer itself is working > obviously. To avoid chasing after already-fixed bugs, I switched to a > 4.12.0 kernel and observed almost the same behavior, but this time > there was no length read as well. On 3.18.19 I was able to read the > buffers when I allocate them using kmalloc() instead of > dma_alloc_coherent(), on 4.12 this did not have any impact. > > I was suspecting the caches to be the root of my issue, but I was not > able to resolve the issue with calls to flush_cache_all(), which I > suppose should have invalidated the entire cache. i.MX6Q has a PL310 L2 outer cache, which brings with it a whole load of extra fun, but the primary effect is that it'll be extremely hard to bodge things if the DMA API usage is incorrect in the first place. AFAICS flush_cache_all() on a v7 CPU performs set/way operations on the CPU caches, which means a) on that system it will only affect L1, and b) it shouldn't really be used from an SMP context with the MMU on anyway. The PL310 does have more than its fair share of wackiness, but unless you also see DMA going wrong for the on-chip peripherals, the problem is almost certainly down to the driver itself rather than the cache configuration. Robin. > Unfortunately, our driver is legacy out-of-tree code and I started > working on this driver to get it ready for submission. If it is of > any help, I could send the code of the driver as well as our board's > device tree. > > I would highly appreciate any hint or direction that may help my > troubleshooting. > > Best Regards, Michael > > > > _______________________________________________ linux-arm-kernel > mailing list linux-arm-kernel at lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > ^ permalink raw reply [flat|nested] 10+ messages in thread
* i.MX 6 and PCIe DMA issues 2017-07-11 14:50 ` Robin Murphy @ 2017-07-13 7:07 ` michael.moese at men.de 2017-07-13 9:04 ` Russell King - ARM Linux 2017-07-13 14:57 ` Robin Murphy 0 siblings, 2 replies; 10+ messages in thread From: michael.moese at men.de @ 2017-07-13 7:07 UTC (permalink / raw) To: linux-arm-kernel On Tue, Jul 11, 2017 at 03:50:19PM +0100, Robin Murphy wrote: > I don't much like the sound of that "and" there - coherent DMA > allocations are, as the name implies, already coherent for CPU and DMA > accesses, and require no maintenance; the streaming DMA API > (dma_{map,unmap,sync}_*) on the other hand is *only* for use on > kmalloced memory. Ok, I hope I am correct. I alloc my memory using dma_alloc_coherent() once, the dma_handle is passed to the device, with no other dma_map*() or dma_sync_*() calls needed? Yesterday I observed some strange behavior when I did some debug prints in the driver, printing me the result from phys_to_virt() of my virtual address and the dma_handle. I know these may be different, but, when I dump the memory (from userspace using mmap), I can see the data at the adress of my dma_handle, but the memory the driver has the pointer for has different contents. I don't understand this. > The PL310 does have more than its fair share of wackiness, but unless > you also see DMA going wrong for the on-chip peripherals, the problem is > almost certainly down to the driver itself rather than the cache > configuration. Well, I think I need do dive into this as well, my former co-worker disabled DMA for SPI, for example, in the device tree. That may be another hint. I think I will need to find out what is setup here. Thank you for your advise, I will try to find out more. Michael ^ permalink raw reply [flat|nested] 10+ messages in thread
* i.MX 6 and PCIe DMA issues 2017-07-13 7:07 ` michael.moese at men.de @ 2017-07-13 9:04 ` Russell King - ARM Linux 2017-07-13 10:00 ` michael.moese at men.de 2017-07-13 14:57 ` Robin Murphy 1 sibling, 1 reply; 10+ messages in thread From: Russell King - ARM Linux @ 2017-07-13 9:04 UTC (permalink / raw) To: linux-arm-kernel On Thu, Jul 13, 2017 at 09:07:19AM +0200, michael.moese at men.de wrote: > Ok, I hope I am correct. I alloc my memory using dma_alloc_coherent() > once, the dma_handle is passed to the device, with no other dma_map*() > or dma_sync_*() calls needed? Correct. > Yesterday I observed some strange behavior when I did some debug prints in > the driver, printing me the result from phys_to_virt() of my > virtual address and the dma_handle. I know these may be different, but, > when I dump the memory (from userspace using mmap), I can see the data > at the adress of my dma_handle, but the memory the driver has the > pointer for has different contents. I don't understand this. phys_to_virt() and virt_to_phys() are only defined for the linear map region. DMA coherent allocations may or may not return addresses in the linear region, which means virt_to_phys() on a virtual address returned from dma_alloc_coherent() is not defined. DMA addresses are not physical addresses, they are device addresses - this is an important distinction when IOMMUs or address translators are in the path between the device and memory. phys_to_virt() must not be used with dma_addr_t data types. You must always use the two addresses returned from dma_alloc_coherent() and never try to translate them - the DMA address is for programming into the hardware, and the virtual address for CPU accesses. If you do try to pass these to the various translators (like virt_to_phys()) the result is undefined. Moreover, passing the dma_addr_t returned from dma_alloc_coherent() to the dma_sync_*() functions also results in undefined behaviour, and potential data corruption. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 10+ messages in thread
* i.MX 6 and PCIe DMA issues 2017-07-13 9:04 ` Russell King - ARM Linux @ 2017-07-13 10:00 ` michael.moese at men.de 2017-07-13 10:18 ` Russell King - ARM Linux 0 siblings, 1 reply; 10+ messages in thread From: michael.moese at men.de @ 2017-07-13 10:00 UTC (permalink / raw) To: linux-arm-kernel On Thu, Jul 13, 2017 at 10:04:23AM +0100, Russell King - ARM Linux wrote: > On Thu, Jul 13, 2017 at 09:07:19AM +0200, michael.moese at men.de wrote: > > Ok, I hope I am correct. I alloc my memory using dma_alloc_coherent() > > once, the dma_handle is passed to the device, with no other dma_map*() > > or dma_sync_*() calls needed? > > Correct. Great. Thanks for the clarification. I will make sure the usage is done the right way. > You must always use the two addresses returned from dma_alloc_coherent() > and never try to translate them - the DMA address is for programming > into the hardware, and the virtual address for CPU accesses. If you do > try to pass these to the various translators (like virt_to_phys()) the > result is undefined. I was trying to just print some addresses for debugging. I removed this again. I'm just confused, because the DMA device address seems to be the physical one. At least the memory contains my data. The virtual memory contains different data. Could it be possible this mapping on my board is going wrong? Michael ^ permalink raw reply [flat|nested] 10+ messages in thread
* i.MX 6 and PCIe DMA issues 2017-07-13 10:00 ` michael.moese at men.de @ 2017-07-13 10:18 ` Russell King - ARM Linux 0 siblings, 0 replies; 10+ messages in thread From: Russell King - ARM Linux @ 2017-07-13 10:18 UTC (permalink / raw) To: linux-arm-kernel On Thu, Jul 13, 2017 at 12:00:31PM +0200, michael.moese at men.de wrote: > On Thu, Jul 13, 2017 at 10:04:23AM +0100, Russell King - ARM Linux wrote: > > You must always use the two addresses returned from dma_alloc_coherent() > > and never try to translate them - the DMA address is for programming > > into the hardware, and the virtual address for CPU accesses. If you do > > try to pass these to the various translators (like virt_to_phys()) the > > result is undefined. > > I was trying to just print some addresses for debugging. I removed this > again. I'm just confused, because the DMA device address seems to be the > physical one. At least the memory contains my data. The virtual memory > contains different data. The DMA address might be a physical address if that's what the device requires, but equally, it might not be. If it is a physical address, phys_to_virt() will return the virtual address in the lowmem mapping provided that the physical address corresponds with a lowmem mapped page. If it's a highmem page, then phys_to_virt() will still return something that looks like a virtual address, but it could be anything. It's undefined. Accessing memory through that pointer could end up hitting a cacheable mapping, which would destroy the semantics of the DMA coherent mapping, thereby leading to unpredictable behaviour. As I've already said, use _both_ pointers for a DMA coherent allocation. If you want to dump the data in the DMA coherent memory, use the virtual address returned from dma_alloc_coherent() - this will have the least side effects. If you access it through phys_to_virt(dma_addr) then you could be dragging the data into the caches, and the first time you print it, it would look correct, but when the DMA device updates it, it becomes stale, and even if you subsequently access it through the correct pointer, you could continue receiving the stale data. Just don't do this - using phys_to_virt() may seem like a short-cut, but it really isn't. > Could it be possible this mapping on my board is going wrong? Unlikely. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 10+ messages in thread
* i.MX 6 and PCIe DMA issues 2017-07-13 7:07 ` michael.moese at men.de 2017-07-13 9:04 ` Russell King - ARM Linux @ 2017-07-13 14:57 ` Robin Murphy 2017-07-14 11:07 ` michael.moese at men.de 1 sibling, 1 reply; 10+ messages in thread From: Robin Murphy @ 2017-07-13 14:57 UTC (permalink / raw) To: linux-arm-kernel On 13/07/17 08:07, michael.moese at men.de wrote: > > > On Tue, Jul 11, 2017 at 03:50:19PM +0100, Robin Murphy wrote: >> I don't much like the sound of that "and" there - coherent DMA >> allocations are, as the name implies, already coherent for CPU and DMA >> accesses, and require no maintenance; the streaming DMA API >> (dma_{map,unmap,sync}_*) on the other hand is *only* for use on >> kmalloced memory. > Ok, I hope I am correct. I alloc my memory using dma_alloc_coherent() > once, the dma_handle is passed to the device, with no other dma_map*() > or dma_sync_*() calls needed? > > Yesterday I observed some strange behavior when I did some debug prints in > the driver, printing me the result from phys_to_virt() of my > virtual address and the dma_handle. I know these may be different, but, > when I dump the memory (from userspace using mmap), I can see the data > at the adress of my dma_handle, but the memory the driver has the > pointer for has different contents. I don't understand this. > > >> The PL310 does have more than its fair share of wackiness, but unless >> you also see DMA going wrong for the on-chip peripherals, the problem is >> almost certainly down to the driver itself rather than the cache >> configuration. > Well, I think I need do dive into this as well, my former co-worker > disabled DMA for SPI, for example, in the device tree. That may be > another hint. I think I will need to find out what is setup here. OK, the first thing I'd check is that you have the "arm,shared-override" property on the DT node for the PL310 and that it's being applied correctly (e.g. you're not entering the kernel with L2 already enabled). Otherwise I think it might be possible to end up in a situation where you get stale data into L2 that even reads from the non-cacheable remap of the buffer can hit, which would probably look rather like what you're describing. Robin. ^ permalink raw reply [flat|nested] 10+ messages in thread
* i.MX 6 and PCIe DMA issues 2017-07-13 14:57 ` Robin Murphy @ 2017-07-14 11:07 ` michael.moese at men.de 0 siblings, 0 replies; 10+ messages in thread From: michael.moese at men.de @ 2017-07-14 11:07 UTC (permalink / raw) To: linux-arm-kernel On Thu, Jul 13, 2017 at 03:57:19PM +0100, Robin Murphy wrote: > >> I don't much like the sound of that "and" there - coherent DMA > >> allocations are, as the name implies, already coherent for CPU and DMA > >> accesses, and require no maintenance; the streaming DMA API > >> (dma_{map,unmap,sync}_*) on the other hand is *only* for use on > >> kmalloced memory. I removed all the calls to dma_{map,unmap,sync}_* and took special care on the handling of the dma_attr_t's. I now dma_alloc_coherent my memory, and do nothing else with the dma_attr_t's. It seems to work, at least partly, now. I had to grab a newer board, because we killed it while soldering wires to trace PCI Express traffic. Maybe I had some weird old CPU or some strange U-Boot or I simply missed removing all dma_sync_* calls or similar before. Now I got other issues, but at least the DMA seems to be working now. Thanks to you, I think I understood this topic now, thank you. Now I can try to figure out why a read to the PCI Express may stop all traffic - that's a different story. Thanks, Michael ^ permalink raw reply [flat|nested] 10+ messages in thread
* i.MX 6 and PCIe DMA issues 2017-07-11 13:40 i.MX 6 and PCIe DMA issues Moese, Michael 2017-07-11 14:07 ` Andrew Lunn 2017-07-11 14:50 ` Robin Murphy @ 2017-07-11 17:56 ` Andrew Lunn 2 siblings, 0 replies; 10+ messages in thread From: Andrew Lunn @ 2017-07-11 17:56 UTC (permalink / raw) To: linux-arm-kernel On Tue, Jul 11, 2017 at 01:40:57PM +0000, Moese, Michael wrote: > Hello ARM folks, > I turn to you in hope you have any hints or directions on the problem I describe below. > > I am currently investigating strange behavior of our i.MX 6 (Quad) board with an FPGA connected to to PCI Express. This FPGA contains, among others, an Ethernet (10/100 Mbps) IP core. FYI: I'm using am IMX6 quad with an intel i210 PCIe Ethernet controller. It works fine. So you might want to look at how the igb driver does its buffer handling. Andrew ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2017-07-14 11:07 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-07-11 13:40 i.MX 6 and PCIe DMA issues Moese, Michael 2017-07-11 14:07 ` Andrew Lunn 2017-07-11 14:50 ` Robin Murphy 2017-07-13 7:07 ` michael.moese at men.de 2017-07-13 9:04 ` Russell King - ARM Linux 2017-07-13 10:00 ` michael.moese at men.de 2017-07-13 10:18 ` Russell King - ARM Linux 2017-07-13 14:57 ` Robin Murphy 2017-07-14 11:07 ` michael.moese at men.de 2017-07-11 17:56 ` Andrew Lunn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).