* using DMA-API on ARM
@ 2014-12-05 9:22 Arend van Spriel
2014-12-05 9:45 ` Russell King - ARM Linux
` (2 more replies)
0 siblings, 3 replies; 41+ messages in thread
From: Arend van Spriel @ 2014-12-05 9:22 UTC (permalink / raw)
To: linux-arm-kernel
Hi Russell,
For our brcm80211 development we are working on getting brcmfmac driver
up and running on a Broadcom ARM-based platform. The wireless device is
a PCIe device, which is hooked up to the system behind a PCIe host
bridge, and we transfer information between host and device using a
descriptor ring buffer allocated using dma_alloc_coherent(). We mostly
tested on x86 and seen no issue. However, on this ARM platform
(single-core A9) we detect occasionally that the descriptor content is
invalid. When this occurs we do a dma_sync_single_for_cpu() and this is
retried a number of times if the problem persists. Actually, found out
that someone made a mistake by using virt_to_dma(va) to get the
dma_handle parameter. So probably we only provided a delay in the retry
loop. After fixing that a single call to dma_sync_single_for_cpu() is
sufficient. The DMA-API-HOWTO clearly states that:
"""
the hardware should guarantee that the device and the CPU can access the
data in parallel and will see updates made by each other without any
explicit software flushing.
"""
So it seems incorrect that we would need to do a dma_sync for this
memory. That we do need it seems like this memory can end up in
cache(?), or whatever happens, in some rare condition. Is there anyway
to investigate this situation either through DMA-API or some low-level
ARM specific functions.
Regards,
Arend
^ permalink raw reply [flat|nested] 41+ messages in thread* using DMA-API on ARM 2014-12-05 9:22 using DMA-API on ARM Arend van Spriel @ 2014-12-05 9:45 ` Russell King - ARM Linux 2014-12-05 12:24 ` Will Deacon 2014-12-05 12:43 ` Arend van Spriel 2014-12-05 9:52 ` Arnd Bergmann 2014-12-05 18:39 ` Catalin Marinas 2 siblings, 2 replies; 41+ messages in thread From: Russell King - ARM Linux @ 2014-12-05 9:45 UTC (permalink / raw) To: linux-arm-kernel On Fri, Dec 05, 2014 at 10:22:22AM +0100, Arend van Spriel wrote: > For our brcm80211 development we are working on getting brcmfmac driver > up and running on a Broadcom ARM-based platform. The wireless device is > a PCIe device, which is hooked up to the system behind a PCIe host > bridge, and we transfer information between host and device using a > descriptor ring buffer allocated using dma_alloc_coherent(). We mostly > tested on x86 and seen no issue. However, on this ARM platform > (single-core A9) we detect occasionally that the descriptor content is > invalid. When this occurs we do a dma_sync_single_for_cpu() and this is > retried a number of times if the problem persists. Actually, found out > that someone made a mistake by using virt_to_dma(va) to get the > dma_handle parameter. So probably we only provided a delay in the retry > loop. After fixing that a single call to dma_sync_single_for_cpu() is > sufficient. The DMA-API-HOWTO clearly states that: > > """ > the hardware should guarantee that the device and the CPU can access the > data in parallel and will see updates made by each other without any > explicit software flushing. > """ > > So it seems incorrect that we would need to do a dma_sync for this > memory. That we do need it seems like this memory can end up in > cache(?), or whatever happens, in some rare condition. Is there anyway > to investigate this situation either through DMA-API or some low-level > ARM specific functions. It's been a long while since I looked at the code, and the code for dma_alloc_coherent() has completely changed since then with the addition of CMA. I'm afraid that anything I would say about it would not be accurate without research into the possible paths through that code - it's no longer just a simple allocator. What you say is correct however: the memory should not have any cache lines associated with it, if it does, there's a bug somewhere. Also, the memory will be weakly ordered, which means that writes to such memory can be reordered. If ordering matters, barriers should be used. rmb() and wmb() can be used for this. (Added Marek for comment on dma_alloc_coherent(), Will for comment on barrier stuff.) -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 9:45 ` Russell King - ARM Linux @ 2014-12-05 12:24 ` Will Deacon 2014-12-05 12:56 ` Hante Meuleman 2014-12-05 12:43 ` Arend van Spriel 1 sibling, 1 reply; 41+ messages in thread From: Will Deacon @ 2014-12-05 12:24 UTC (permalink / raw) To: linux-arm-kernel On Fri, Dec 05, 2014 at 09:45:08AM +0000, Russell King - ARM Linux wrote: > On Fri, Dec 05, 2014 at 10:22:22AM +0100, Arend van Spriel wrote: > > For our brcm80211 development we are working on getting brcmfmac driver > > up and running on a Broadcom ARM-based platform. The wireless device is > > a PCIe device, which is hooked up to the system behind a PCIe host > > bridge, and we transfer information between host and device using a > > descriptor ring buffer allocated using dma_alloc_coherent(). We mostly > > tested on x86 and seen no issue. However, on this ARM platform > > (single-core A9) we detect occasionally that the descriptor content is > > invalid. When this occurs we do a dma_sync_single_for_cpu() and this is > > retried a number of times if the problem persists. Actually, found out > > that someone made a mistake by using virt_to_dma(va) to get the > > dma_handle parameter. So probably we only provided a delay in the retry > > loop. After fixing that a single call to dma_sync_single_for_cpu() is > > sufficient. The DMA-API-HOWTO clearly states that: > > > > """ > > the hardware should guarantee that the device and the CPU can access the > > data in parallel and will see updates made by each other without any > > explicit software flushing. > > """ > > > > So it seems incorrect that we would need to do a dma_sync for this > > memory. That we do need it seems like this memory can end up in > > cache(?), or whatever happens, in some rare condition. Is there anyway > > to investigate this situation either through DMA-API or some low-level > > ARM specific functions. > > It's been a long while since I looked at the code, and the code for > dma_alloc_coherent() has completely changed since then with the > addition of CMA. I'm afraid that anything I would say about it would > not be accurate without research into the possible paths through that > code - it's no longer just a simple allocator. > > What you say is correct however: the memory should not have any cache > lines associated with it, if it does, there's a bug somewhere. > > Also, the memory will be weakly ordered, which means that writes to such > memory can be reordered. If ordering matters, barriers should be used. > rmb() and wmb() can be used for this. > > (Added Marek for comment on dma_alloc_coherent(), Will for comment on > barrier stuff.) I'm not quite clear on the issue being seen here: is this on write from the CPU to the descriptor ring, or the other way around (or both?). Either way, you need barriers on the CPU side to ensure ordering of accesses to the buffer. rmb/wmb will work, but are heavier than what you need (relaxed versions have been proposed on LKML recently). Will ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 12:24 ` Will Deacon @ 2014-12-05 12:56 ` Hante Meuleman 2014-12-05 13:23 ` Russell King - ARM Linux 0 siblings, 1 reply; 41+ messages in thread From: Hante Meuleman @ 2014-12-05 12:56 UTC (permalink / raw) To: linux-arm-kernel The problem is with data coming from device, so DMA from device to host. The DMA takes place from device local memory to host memory, where the host memory is allocated with dma_alloc_coherent, which we thought should not be cached. The host is an ARM (as is the device). The data being DMA'ed ends up in a ring buffer. This ring is only being read by host when it is a d2h ring (device to host). Each entry in the ring is 32 bytes, and contains a sequence number. The sequence number is a modulo 253 and the ring has 256 entries. At some point we read a sequence number which was "old". Then we loop to see if the sequence number changes. The loop is 1024 times and uses an rmb() call. This does not help. After looping 1024 times it is still reading the same value for sequence number. Now it can happen that 256 entries further we are still reading this old sequence (so iso reading a seqnum which is off by 3, it is off by 6). This was an indication that it was cached. So instead of using rmb() we used dma_sync_single_for_cpu. When using that call the problem was fixed. Whenever an old sequence number was read a single call to dma_sync_single_for_cpu would flush the cache and the next read would be correct. However: this indicates that dma_alloc_coherent on an ARM target may result in a memory buffer which can be cached which conflicts with the API of this function. This problem has sofar not been observed on x86 hosts. Regards, Hante -----Original Message----- From: Will Deacon [mailto:will.deacon at arm.com] Sent: vrijdag 5 december 2014 13:24 To: Russell King - ARM Linux Cc: Arend Van Spriel; Marek Szyprowski; linux-arm-kernel at lists.infradead.org; David Miller; linux-kernel at vger.kernel.org; brcm80211-dev-list; linux-wireless Subject: Re: using DMA-API on ARM On Fri, Dec 05, 2014 at 09:45:08AM +0000, Russell King - ARM Linux wrote: > On Fri, Dec 05, 2014 at 10:22:22AM +0100, Arend van Spriel wrote: > > For our brcm80211 development we are working on getting brcmfmac driver > > up and running on a Broadcom ARM-based platform. The wireless device is > > a PCIe device, which is hooked up to the system behind a PCIe host > > bridge, and we transfer information between host and device using a > > descriptor ring buffer allocated using dma_alloc_coherent(). We mostly > > tested on x86 and seen no issue. However, on this ARM platform > > (single-core A9) we detect occasionally that the descriptor content is > > invalid. When this occurs we do a dma_sync_single_for_cpu() and this is > > retried a number of times if the problem persists. Actually, found out > > that someone made a mistake by using virt_to_dma(va) to get the > > dma_handle parameter. So probably we only provided a delay in the retry > > loop. After fixing that a single call to dma_sync_single_for_cpu() is > > sufficient. The DMA-API-HOWTO clearly states that: > > > > """ > > the hardware should guarantee that the device and the CPU can access the > > data in parallel and will see updates made by each other without any > > explicit software flushing. > > """ > > > > So it seems incorrect that we would need to do a dma_sync for this > > memory. That we do need it seems like this memory can end up in > > cache(?), or whatever happens, in some rare condition. Is there anyway > > to investigate this situation either through DMA-API or some low-level > > ARM specific functions. > > It's been a long while since I looked at the code, and the code for > dma_alloc_coherent() has completely changed since then with the > addition of CMA. I'm afraid that anything I would say about it would > not be accurate without research into the possible paths through that > code - it's no longer just a simple allocator. > > What you say is correct however: the memory should not have any cache > lines associated with it, if it does, there's a bug somewhere. > > Also, the memory will be weakly ordered, which means that writes to such > memory can be reordered. If ordering matters, barriers should be used. > rmb() and wmb() can be used for this. > > (Added Marek for comment on dma_alloc_coherent(), Will for comment on > barrier stuff.) I'm not quite clear on the issue being seen here: is this on write from the CPU to the descriptor ring, or the other way around (or both?). Either way, you need barriers on the CPU side to ensure ordering of accesses to the buffer. rmb/wmb will work, but are heavier than what you need (relaxed versions have been proposed on LKML recently). Will ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 12:56 ` Hante Meuleman @ 2014-12-05 13:23 ` Russell King - ARM Linux 2014-12-05 14:20 ` Hante Meuleman 2014-12-05 15:06 ` Russell King - ARM Linux 0 siblings, 2 replies; 41+ messages in thread From: Russell King - ARM Linux @ 2014-12-05 13:23 UTC (permalink / raw) To: linux-arm-kernel Please wrap your message - replying to a message which looks like this in my editor is far from easy, and gives me much more work to /manually/ reformat it before I can reply to it: On Fri, Dec 05, 2014 at 12:56:45PM +0000, Hante Meuleman wrote: > The problem is with data coming from device, so DMA from device to host. The $ > > However: this indicates that dma_alloc_coherent on an ARM target may result i$ > > Regards, > Hante Thanks. On Fri, Dec 05, 2014 at 12:56:45PM +0000, Hante Meuleman wrote: > However: this indicates that dma_alloc_coherent on an ARM target may > result in a memory buffer which can be cached which conflicts with > the API of this function. If the memory has an alias which is cacheable, it is possible for cache lines to get allocated via that alias, even if the alias has no explicit accesses to it. This is something which I've been going on for quite literally /years/ - mismatched cache attributes can cause unpredictable behaviour. I've had a lot of push back from people who are of the opinion that "if it works for me, then there isn't a problem" and I eventually gave up fighting the battle, especially as the ARM architecture people weakened my reasoning behind it by publishing a relaxation of the "no differing attributes" issue. This was particularly true of those who wanted to use ioremap() on system memory - and cases such as dma_init_coherent_memory(). So, I never fixed this problem in the original DMA allocator code; I basically gave up with it. It's a latent bug which did need to be fixed, and is still present today in the non-CMA case. The symptoms which you are reporting sound very much like this kind of problem - the virtual address for the memory returned by dma_alloc_coherent() will not be cacheable memory - it will have been remapped using map_vm_area(). However, there could very well be a fully cacheable lowmem mapping of that memory, which if a read (speculative or otherwise) will bring a cache line in, and because the caches are VIPT or PIPT, that cache line can be hit via the non-cacheable mapping too. What I /really/ need is more evidence of this to tell those disbelievers where to stick their flawed arguments. :) -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 13:23 ` Russell King - ARM Linux @ 2014-12-05 14:20 ` Hante Meuleman 2014-12-05 14:47 ` Arend van Spriel 2014-12-08 13:47 ` Hante Meuleman 2014-12-05 15:06 ` Russell King - ARM Linux 1 sibling, 2 replies; 41+ messages in thread From: Hante Meuleman @ 2014-12-05 14:20 UTC (permalink / raw) To: linux-arm-kernel Ok, I'll add the necessary debug to get all the information out, but it will take some time to get it done, so I won't have anything before Monday. -----Original Message----- From: Russell King - ARM Linux [mailto:linux at arm.linux.org.uk] Sent: vrijdag 5 december 2014 14:24 To: Hante Meuleman Cc: Will Deacon; Arend Van Spriel; Marek Szyprowski; linux-arm-kernel at lists.infradead.org; David Miller; linux-kernel at vger.kernel.org; brcm80211-dev-list; linux-wireless Subject: Re: using DMA-API on ARM Please wrap your message - replying to a message which looks like this in my editor is far from easy, and gives me much more work to /manually/ reformat it before I can reply to it: On Fri, Dec 05, 2014 at 12:56:45PM +0000, Hante Meuleman wrote: > The problem is with data coming from device, so DMA from device to host. The $ > > However: this indicates that dma_alloc_coherent on an ARM target may result i$ > > Regards, > Hante Thanks. On Fri, Dec 05, 2014 at 12:56:45PM +0000, Hante Meuleman wrote: > However: this indicates that dma_alloc_coherent on an ARM target may > result in a memory buffer which can be cached which conflicts with > the API of this function. If the memory has an alias which is cacheable, it is possible for cache lines to get allocated via that alias, even if the alias has no explicit accesses to it. This is something which I've been going on for quite literally /years/ - mismatched cache attributes can cause unpredictable behaviour. I've had a lot of push back from people who are of the opinion that "if it works for me, then there isn't a problem" and I eventually gave up fighting the battle, especially as the ARM architecture people weakened my reasoning behind it by publishing a relaxation of the "no differing attributes" issue. This was particularly true of those who wanted to use ioremap() on system memory - and cases such as dma_init_coherent_memory(). So, I never fixed this problem in the original DMA allocator code; I basically gave up with it. It's a latent bug which did need to be fixed, and is still present today in the non-CMA case. The symptoms which you are reporting sound very much like this kind of problem - the virtual address for the memory returned by dma_alloc_coherent() will not be cacheable memory - it will have been remapped using map_vm_area(). However, there could very well be a fully cacheable lowmem mapping of that memory, which if a read (speculative or otherwise) will bring a cache line in, and because the caches are VIPT or PIPT, that cache line can be hit via the non-cacheable mapping too. What I /really/ need is more evidence of this to tell those disbelievers where to stick their flawed arguments. :) -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 14:20 ` Hante Meuleman @ 2014-12-05 14:47 ` Arend van Spriel 2014-12-08 13:47 ` Hante Meuleman 1 sibling, 0 replies; 41+ messages in thread From: Arend van Spriel @ 2014-12-05 14:47 UTC (permalink / raw) To: linux-arm-kernel On 12/05/14 15:20, Hante Meuleman wrote: > Ok, I'll add the necessary debug to get all the information out, > but it will take some time to get it done, so I won't have anything > before Monday. > > -----Original Message----- > From: Russell King - ARM Linux [mailto:linux at arm.linux.org.uk] > Sent: vrijdag 5 december 2014 14:24 > To: Hante Meuleman > Cc: Will Deacon; Arend Van Spriel; Marek Szyprowski; linux-arm-kernel at lists.infradead.org; David Miller; linux-kernel at vger.kernel.org; brcm80211-dev-list; linux-wireless > Subject: Re: using DMA-API on ARM > > Please wrap your message - replying to a message which looks like this in > my editor is far from easy, and gives me much more work to /manually/ > reformat it before I can reply to it: That's what happens with corporate IT forcing to use Outlook. We can workaround that using Thunderbird on Citrix. I will enlighten Hante about that option :-) Regards, Arend > On Fri, Dec 05, 2014 at 12:56:45PM +0000, Hante Meuleman wrote: >> The problem is with data coming from device, so DMA from device to host. The $ >> >> However: this indicates that dma_alloc_coherent on an ARM target may result i$ >> >> Regards, >> Hante > > Thanks. > > On Fri, Dec 05, 2014 at 12:56:45PM +0000, Hante Meuleman wrote: >> However: this indicates that dma_alloc_coherent on an ARM target may >> result in a memory buffer which can be cached which conflicts with >> the API of this function. > > If the memory has an alias which is cacheable, it is possible for cache > lines to get allocated via that alias, even if the alias has no explicit > accesses to it. > > This is something which I've been going on for quite literally /years/ - > mismatched cache attributes can cause unpredictable behaviour. I've had > a lot of push back from people who are of the opinion that "if it works > for me, then there isn't a problem" and I eventually gave up fighting > the battle, especially as the ARM architecture people weakened my > reasoning behind it by publishing a relaxation of the "no differing > attributes" issue. This was particularly true of those who wanted to > use ioremap() on system memory - and cases such as > dma_init_coherent_memory(). > > So, I never fixed this problem in the original DMA allocator code; I > basically gave up with it. It's a latent bug which did need to be fixed, > and is still present today in the non-CMA case. > > The symptoms which you are reporting sound very much like this kind of > problem - the virtual address for the memory returned by > dma_alloc_coherent() will not be cacheable memory - it will have been > remapped using map_vm_area(). However, there could very well be a fully > cacheable lowmem mapping of that memory, which if a read (speculative or > otherwise) will bring a cache line in, and because the caches are VIPT > or PIPT, that cache line can be hit via the non-cacheable mapping too. > > What I /really/ need is more evidence of this to tell those disbelievers > where to stick their flawed arguments. :) > ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 14:20 ` Hante Meuleman 2014-12-05 14:47 ` Arend van Spriel @ 2014-12-08 13:47 ` Hante Meuleman 2014-12-08 15:01 ` Arnd Bergmann 1 sibling, 1 reply; 41+ messages in thread From: Hante Meuleman @ 2014-12-08 13:47 UTC (permalink / raw) To: linux-arm-kernel Still using outlook, but will limit the line length, I hope that works for the moment. Attached is a log with the requested information, it is a little bit non-standard though. The dump code from the mm was copied in the driver and called from there, mapping the prints back to our local printf, but it should produce the same. I did this because I didn't realize the table is static. Some background on the test setup: I'm using a Broadcom reference design AP platform with an BRCM 4708 host SOC. For the AP router platform the opensource packet OpenWRT was used. Some small modifications were made to get it to work on our HW. Only one core is enabled for the moment (no time to figure out how to enable the other one). Openwrt was configured to use kernel 3.18-rc2 and the brcmfmac of the compat-wireless code was updated with our latest code (minor patches, which have been submitted already). The device used is 43602 pcie device. Some modifications to the build system were made to enable PCIE. The test is to connect with a client to the AP and run iperf (TCP). The test can run for many hours without a problem, but sometimes fails very quickly. The log: first the ring allocation info is printed. Starting at 16.124847, ring 2, 3 and 4 are rings used for device to host. In this log the failure is on a read of ring 3. Ring 3 is 1024 entries of each 16 bytes. The next thing printed is the kernel page tables. Then some OpenWRT info and the logging of part of the connection setup. Then at 1780.130752 the logging of the failure starts. The sequence number is modulo 253 with ring size of 1024 matches an "old" entry (read 40, expected 52). Then the different pointers are printed followed by the kernel page table. The code does then a cache invalidate on the dma_handle and the next read the sequence number is correct. Regards, Hante Please wrap your message - replying to a message which looks like this in my editor is far from easy, and gives me much more work to /manually/ reformat it before I can reply to it: On Fri, Dec 05, 2014 at 12:56:45PM +0000, Hante Meuleman wrote: > The problem is with data coming from device, so DMA from device to host. The $ > > However: this indicates that dma_alloc_coherent on an ARM target may result i$ > > Regards, > Hante Thanks. On Fri, Dec 05, 2014 at 12:56:45PM +0000, Hante Meuleman wrote: > However: this indicates that dma_alloc_coherent on an ARM target may > result in a memory buffer which can be cached which conflicts with > the API of this function. If the memory has an alias which is cacheable, it is possible for cache lines to get allocated via that alias, even if the alias has no explicit accesses to it. This is something which I've been going on for quite literally /years/ - mismatched cache attributes can cause unpredictable behaviour. I've had a lot of push back from people who are of the opinion that "if it works for me, then there isn't a problem" and I eventually gave up fighting the battle, especially as the ARM architecture people weakened my reasoning behind it by publishing a relaxation of the "no differing attributes" issue. This was particularly true of those who wanted to use ioremap() on system memory - and cases such as dma_init_coherent_memory(). So, I never fixed this problem in the original DMA allocator code; I basically gave up with it. It's a latent bug which did need to be fixed, and is still present today in the non-CMA case. The symptoms which you are reporting sound very much like this kind of problem - the virtual address for the memory returned by dma_alloc_coherent() will not be cacheable memory - it will have been remapped using map_vm_area(). However, there could very well be a fully cacheable lowmem mapping of that memory, which if a read (speculative or otherwise) will bring a cache line in, and because the caches are VIPT or PIPT, that cache line can be hit via the non-cacheable mapping too. What I /really/ need is more evidence of this to tell those disbelievers where to stick their flawed arguments. :) -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: cache_fail_dmesg.txt URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20141208/b16580d5/attachment-0001.txt> ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-08 13:47 ` Hante Meuleman @ 2014-12-08 15:01 ` Arnd Bergmann 2014-12-08 15:17 ` Russell King - ARM Linux ` (2 more replies) 0 siblings, 3 replies; 41+ messages in thread From: Arnd Bergmann @ 2014-12-08 15:01 UTC (permalink / raw) To: linux-arm-kernel On Monday 08 December 2014 13:47:38 Hante Meuleman wrote: > Still using outlook, but will limit the line length, I hope that works for the > moment. Attached is a log with the requested information, it is a little > bit non-standard though. The dump code from the mm was copied in > the driver and called from there, mapping the prints back to our local > printf, but it should produce the same. I did this because I didn't realize > the table is static. > > Some background on the test setup: I'm using a Broadcom reference > design AP platform with an BRCM 4708 host SOC. I think you are using the wrong dtb file, the log says this is a "Buffalo WZR-1750DHP", not the reference design. > For the AP router > platform the opensource packet OpenWRT was used. Some small > modifications were made to get it to work on our HW. Only one core > is enabled for the moment (no time to figure out how to enable the > other one). Openwrt was configured to use kernel 3.18-rc2 and > the brcmfmac of the compat-wireless code was updated with our > latest code (minor patches, which have been submitted already). > The device used is 43602 pcie device. Some modifications to the build > system were made to enable PCIE. The test is to connect with a > client to the AP and run iperf (TCP). The test can run for many hours > without a problem, but sometimes fails very quickly. The bcm4708 platform is maintained by Hauke Mehrtens, adding him to Cc. In your log, I see this message: [ 0.000000] PL310 OF: cache setting yield illegal associativity [ 0.000000] PL310 OF: -1069781724 calculated, only 8 and 16 legal [ 0.000000] L2C-310 enabling early BRESP for Cortex-A9 [ 0.000000] L2C-310 full line of zeros enabled for Cortex-A9 [ 0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled [ 0.000000] L2C-310 cache controller enabled, 16 ways, 256 kB [ 0.000000] L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x4e130001 Evidently the cache controller information in DT is incorrect and the setup may be wrong as a consequence, which may explain cache coherency problems. Can you verify that the AUX_CTRL value is the same one you see in a working kernel? > The log: first the ring allocation info is printed. Starting at > 16.124847, ring 2, 3 and 4 are rings used for device to host. In this > log the failure is on a read of ring 3. Ring 3 is 1024 entries of each > 16 bytes. The next thing printed is the kernel page tables. Then some > OpenWRT info and the logging of part of the connection setup. Then at > 1780.130752 the logging of the failure starts. The sequence number is > modulo 253 with ring size of 1024 matches an "old" entry (read 40, > expected 52). Then the different pointers are printed followed by > the kernel page table. The code does then a cache invalidate on the > dma_handle and the next read the sequence number is correct. How do you invalidate the cache? A dma_handle is of type dma_addr_t and we don't define an operation for that, nor does it make sense on an allocation from dma_alloc_coherent(). What happens if you take out the invalidate? Can you post the patch that you use (both platform and driver) relative to the snapshot of the the mainline kernel you are basing on? Arnd ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-08 15:01 ` Arnd Bergmann @ 2014-12-08 15:17 ` Russell King - ARM Linux 2014-12-08 15:22 ` Arnd Bergmann 2014-12-08 16:03 ` Catalin Marinas 2014-12-08 16:22 ` Arend van Spriel 2 siblings, 1 reply; 41+ messages in thread From: Russell King - ARM Linux @ 2014-12-08 15:17 UTC (permalink / raw) To: linux-arm-kernel On Mon, Dec 08, 2014 at 04:01:32PM +0100, Arnd Bergmann wrote: > In your log, I see this message: > > [ 0.000000] PL310 OF: cache setting yield illegal associativity > [ 0.000000] PL310 OF: -1069781724 calculated, only 8 and 16 legal > [ 0.000000] L2C-310 enabling early BRESP for Cortex-A9 > [ 0.000000] L2C-310 full line of zeros enabled for Cortex-A9 > [ 0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled > [ 0.000000] L2C-310 cache controller enabled, 16 ways, 256 kB > [ 0.000000] L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x4e130001 > > Evidently the cache controller information in DT is incorrect and > the setup may be wrong as a consequence, which may explain cache > coherency problems. No. See d0b92845e54 (ARM: 8182/1: l2c: Make l2x0_cache_size_of_parse() return 'int') which was merged in rc3, post-dating this kernel. The original commit was buggy, and produces those harmless "illegal associativity" messages. They occur if DT *doesn't* specify the cache parameters, in which case we use the hardware-set value (which /should/ be correct.) We made these optional in DT as hardware really should set these correctly in the first place. -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-08 15:17 ` Russell King - ARM Linux @ 2014-12-08 15:22 ` Arnd Bergmann 0 siblings, 0 replies; 41+ messages in thread From: Arnd Bergmann @ 2014-12-08 15:22 UTC (permalink / raw) To: linux-arm-kernel On Monday 08 December 2014 15:17:33 Russell King - ARM Linux wrote: > On Mon, Dec 08, 2014 at 04:01:32PM +0100, Arnd Bergmann wrote: > > In your log, I see this message: > > > > [ 0.000000] PL310 OF: cache setting yield illegal associativity > > [ 0.000000] PL310 OF: -1069781724 calculated, only 8 and 16 legal > > [ 0.000000] L2C-310 enabling early BRESP for Cortex-A9 > > [ 0.000000] L2C-310 full line of zeros enabled for Cortex-A9 > > [ 0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled > > [ 0.000000] L2C-310 cache controller enabled, 16 ways, 256 kB > > [ 0.000000] L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x4e130001 > > > > Evidently the cache controller information in DT is incorrect and > > the setup may be wrong as a consequence, which may explain cache > > coherency problems. > > No. See d0b92845e54 (ARM: 8182/1: l2c: Make l2x0_cache_size_of_parse() return 'int') > which was merged in rc3, post-dating this kernel. The original commit > was buggy, and produces those harmless "illegal associativity" messages. > They occur if DT *doesn't* specify the cache parameters, in which case > we use the hardware-set value (which /should/ be correct.) > > We made these optional in DT as hardware really should set these > correctly in the first place. Ok, good. Arnd ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-08 15:01 ` Arnd Bergmann 2014-12-08 15:17 ` Russell King - ARM Linux @ 2014-12-08 16:03 ` Catalin Marinas 2014-12-08 17:01 ` Arend van Spriel 2014-12-08 16:22 ` Arend van Spriel 2 siblings, 1 reply; 41+ messages in thread From: Catalin Marinas @ 2014-12-08 16:03 UTC (permalink / raw) To: linux-arm-kernel On Mon, Dec 08, 2014 at 03:01:32PM +0000, Arnd Bergmann wrote: > [ 0.000000] PL310 OF: cache setting yield illegal associativity > [ 0.000000] PL310 OF: -1069781724 calculated, only 8 and 16 legal > [ 0.000000] L2C-310 enabling early BRESP for Cortex-A9 > [ 0.000000] L2C-310 full line of zeros enabled for Cortex-A9 > [ 0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled > [ 0.000000] L2C-310 cache controller enabled, 16 ways, 256 kB > [ 0.000000] L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x4e130001 If the above value is correct, they should make sure bit 22 is set in AUX_CTRL. -- Catalin ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-08 16:03 ` Catalin Marinas @ 2014-12-08 17:01 ` Arend van Spriel 2014-12-09 10:19 ` Arend van Spriel 0 siblings, 1 reply; 41+ messages in thread From: Arend van Spriel @ 2014-12-08 17:01 UTC (permalink / raw) To: linux-arm-kernel On 12/08/14 17:03, Catalin Marinas wrote: > On Mon, Dec 08, 2014 at 03:01:32PM +0000, Arnd Bergmann wrote: >> [ 0.000000] PL310 OF: cache setting yield illegal associativity >> [ 0.000000] PL310 OF: -1069781724 calculated, only 8 and 16 legal >> [ 0.000000] L2C-310 enabling early BRESP for Cortex-A9 >> [ 0.000000] L2C-310 full line of zeros enabled for Cortex-A9 >> [ 0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled >> [ 0.000000] L2C-310 cache controller enabled, 16 ways, 256 kB >> [ 0.000000] L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x4e130001 > > If the above value is correct, they should make sure bit 22 is set in > AUX_CTRL. Hante applied the patch and it now says: [ 0.000000] PL310 OF: cache setting yield illegal associativity [ 0.000000] PL310 OF: -1069781724 calculated, only 8 and 16 legal [ 0.000000] L2C-310 enabling early BRESP for Cortex-A9 [ 0.000000] L2C-310 full line of zeros enabled for Cortex-A9 [ 0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled [ 0.000000] L2C-310 cache controller enabled, 16 ways, 256 kB [ 0.000000] L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x4e530001 He started running a test overnight. So will see if it hits the failure with this L2 cache configuration. Regards, Arend ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-08 17:01 ` Arend van Spriel @ 2014-12-09 10:19 ` Arend van Spriel 2014-12-09 10:29 ` Russell King - ARM Linux 0 siblings, 1 reply; 41+ messages in thread From: Arend van Spriel @ 2014-12-09 10:19 UTC (permalink / raw) To: linux-arm-kernel On 12/08/14 18:01, Arend van Spriel wrote: > On 12/08/14 17:03, Catalin Marinas wrote: >> On Mon, Dec 08, 2014 at 03:01:32PM +0000, Arnd Bergmann wrote: >>> [ 0.000000] PL310 OF: cache setting yield illegal associativity >>> [ 0.000000] PL310 OF: -1069781724 calculated, only 8 and 16 legal >>> [ 0.000000] L2C-310 enabling early BRESP for Cortex-A9 >>> [ 0.000000] L2C-310 full line of zeros enabled for Cortex-A9 >>> [ 0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled >>> [ 0.000000] L2C-310 cache controller enabled, 16 ways, 256 kB >>> [ 0.000000] L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x4e130001 >> >> If the above value is correct, they should make sure bit 22 is set in >> AUX_CTRL. > > Hante applied the patch and it now says: > > [ 0.000000] PL310 OF: cache setting yield illegal associativity > [ 0.000000] PL310 OF: -1069781724 calculated, only 8 and 16 legal > [ 0.000000] L2C-310 enabling early BRESP for Cortex-A9 > [ 0.000000] L2C-310 full line of zeros enabled for Cortex-A9 > [ 0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled > [ 0.000000] L2C-310 cache controller enabled, 16 ways, 256 kB > [ 0.000000] L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x4e530001 > > He started running a test overnight. So will see if it hits the failure > with this L2 cache configuration. The issue did not trigger overnight so it seems setting bit 22 <Shared Attribute _Override_ Enable> solves the issue over here. Now the question is how to move forward with this. As I understood from Catalin this patch was not included as it was not considered responsibility of the linux kernel. Regards, Arend ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-09 10:19 ` Arend van Spriel @ 2014-12-09 10:29 ` Russell King - ARM Linux 2014-12-09 11:07 ` Arend van Spriel ` (2 more replies) 0 siblings, 3 replies; 41+ messages in thread From: Russell King - ARM Linux @ 2014-12-09 10:29 UTC (permalink / raw) To: linux-arm-kernel On Tue, Dec 09, 2014 at 11:19:40AM +0100, Arend van Spriel wrote: > The issue did not trigger overnight so it seems setting bit 22 <Shared > Attribute _Override_ Enable> solves the issue over here. Now the question is > how to move forward with this. As I understood from Catalin this patch was > not included as it was not considered responsibility of the linux kernel. It is preferable for firmware to configure the L2 cache appropriately, which includes things like the prefetch offsets as well as feature bits like bit 22. I think what I'll do is queue up a patch which adds a warning if bit 22 is not set, suggesting that firmware is updated to set this bit. -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-09 10:29 ` Russell King - ARM Linux @ 2014-12-09 11:07 ` Arend van Spriel 2014-12-09 11:54 ` Catalin Marinas 2015-01-20 15:22 ` Fabio Estevam 2 siblings, 0 replies; 41+ messages in thread From: Arend van Spriel @ 2014-12-09 11:07 UTC (permalink / raw) To: linux-arm-kernel On 12/09/14 11:29, Russell King - ARM Linux wrote: > On Tue, Dec 09, 2014 at 11:19:40AM +0100, Arend van Spriel wrote: >> The issue did not trigger overnight so it seems setting bit 22<Shared >> Attribute _Override_ Enable> solves the issue over here. Now the question is >> how to move forward with this. As I understood from Catalin this patch was >> not included as it was not considered responsibility of the linux kernel. > > It is preferable for firmware to configure the L2 cache appropriately, > which includes things like the prefetch offsets as well as feature bits > like bit 22. > > I think what I'll do is queue up a patch which adds a warning if bit 22 > is not set, suggesting that firmware is updated to set this bit. I was thinking in the same direction. Thanks to you all for looking into this. It did not feel right to use the dma_sync_single_for_cpu() for memory allocated with dma_alloc_coherent() and I am glad this got cleared up. Regards, Arend ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-09 10:29 ` Russell King - ARM Linux 2014-12-09 11:07 ` Arend van Spriel @ 2014-12-09 11:54 ` Catalin Marinas 2015-01-20 15:22 ` Fabio Estevam 2 siblings, 0 replies; 41+ messages in thread From: Catalin Marinas @ 2014-12-09 11:54 UTC (permalink / raw) To: linux-arm-kernel On Tue, Dec 09, 2014 at 10:29:05AM +0000, Russell King - ARM Linux wrote: > On Tue, Dec 09, 2014 at 11:19:40AM +0100, Arend van Spriel wrote: > > The issue did not trigger overnight so it seems setting bit 22 <Shared > > Attribute _Override_ Enable> solves the issue over here. Now the question is > > how to move forward with this. As I understood from Catalin this patch was > > not included as it was not considered responsibility of the linux kernel. > > It is preferable for firmware to configure the L2 cache appropriately, > which includes things like the prefetch offsets as well as feature bits > like bit 22. > > I think what I'll do is queue up a patch which adds a warning if bit 22 > is not set, suggesting that firmware is updated to set this bit. I'm fine with a (big) warning on this bit. But when you boot in secure mode on 32-bit, do we still have a read/modify/write sequence for L2X0_AUX_CTRL? A quick look at l2c310_enable() didn't reveal this (the code has changed since I proposed the bit 22 setting patch). -- Catalin ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-09 10:29 ` Russell King - ARM Linux 2014-12-09 11:07 ` Arend van Spriel 2014-12-09 11:54 ` Catalin Marinas @ 2015-01-20 15:22 ` Fabio Estevam 2 siblings, 0 replies; 41+ messages in thread From: Fabio Estevam @ 2015-01-20 15:22 UTC (permalink / raw) To: linux-arm-kernel Hi Russell, On Tue, Dec 9, 2014 at 8:29 AM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Tue, Dec 09, 2014 at 11:19:40AM +0100, Arend van Spriel wrote: >> The issue did not trigger overnight so it seems setting bit 22 <Shared >> Attribute _Override_ Enable> solves the issue over here. Now the question is >> how to move forward with this. As I understood from Catalin this patch was >> not included as it was not considered responsibility of the linux kernel. > > It is preferable for firmware to configure the L2 cache appropriately, > which includes things like the prefetch offsets as well as feature bits > like bit 22. > > I think what I'll do is queue up a patch which adds a warning if bit 22 > is not set, suggesting that firmware is updated to set this bit. Do you mean something like this? --- a/arch/arm/mm/cache-l2x0.c +++ b/arch/arm/mm/cache-l2x0.c @@ -943,6 +943,10 @@ static int __init __l2c_init(const struct l2c_init_data *data, pr_info("%s: CACHE_ID 0x%08x, AUX_CTRL 0x%08x\n", data->type, cache_id, aux); + if (!(aux & L2C_AUX_CTRL_SHARED_OVERRIDE)) + pr_warn("%s: L2C_AUX_CTRL_SHARED_OVERRIDE needs to be set by the bootloader\n", + data->type); + return 0; } ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-08 15:01 ` Arnd Bergmann 2014-12-08 15:17 ` Russell King - ARM Linux 2014-12-08 16:03 ` Catalin Marinas @ 2014-12-08 16:22 ` Arend van Spriel 2014-12-08 16:38 ` Arnd Bergmann 2 siblings, 1 reply; 41+ messages in thread From: Arend van Spriel @ 2014-12-08 16:22 UTC (permalink / raw) To: linux-arm-kernel On 12/08/14 16:01, Arnd Bergmann wrote: > On Monday 08 December 2014 13:47:38 Hante Meuleman wrote: >> Still using outlook, but will limit the line length, I hope that works for the >> moment. Attached is a log with the requested information, it is a little >> bit non-standard though. The dump code from the mm was copied in >> the driver and called from there, mapping the prints back to our local >> printf, but it should produce the same. I did this because I didn't realize >> the table is static. >> >> Some background on the test setup: I'm using a Broadcom reference >> design AP platform with an BRCM 4708 host SOC. > > I think you are using the wrong dtb file, the log says this is > a "Buffalo WZR-1750DHP", not the reference design. That router is close enough to the reference design. >> For the AP router >> platform the opensource packet OpenWRT was used. Some small >> modifications were made to get it to work on our HW. Only one core >> is enabled for the moment (no time to figure out how to enable the >> other one). Openwrt was configured to use kernel 3.18-rc2 and >> the brcmfmac of the compat-wireless code was updated with our >> latest code (minor patches, which have been submitted already). >> The device used is 43602 pcie device. Some modifications to the build >> system were made to enable PCIE. The test is to connect with a >> client to the AP and run iperf (TCP). The test can run for many hours >> without a problem, but sometimes fails very quickly. > > The bcm4708 platform is maintained by Hauke Mehrtens, adding him to Cc. Thanks. While going through the DTS files I intended to add him as well ;-) > In your log, I see this message: > > [ 0.000000] PL310 OF: cache setting yield illegal associativity > [ 0.000000] PL310 OF: -1069781724 calculated, only 8 and 16 legal > [ 0.000000] L2C-310 enabling early BRESP for Cortex-A9 > [ 0.000000] L2C-310 full line of zeros enabled for Cortex-A9 > [ 0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled > [ 0.000000] L2C-310 cache controller enabled, 16 ways, 256 kB > [ 0.000000] L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x4e130001 > > Evidently the cache controller information in DT is incorrect and > the setup may be wrong as a consequence, which may explain cache > coherency problems. While staring at the DTS files I suspect there are some parts still missing. I have attached them for reference. Catalin pointed us to a patch in the l2 cache [1]. We have not tried that yet. > Can you verify that the AUX_CTRL value is the same one you see > in a working kernel? > >> The log: first the ring allocation info is printed. Starting at >> 16.124847, ring 2, 3 and 4 are rings used for device to host. In this >> log the failure is on a read of ring 3. Ring 3 is 1024 entries of each >> 16 bytes. The next thing printed is the kernel page tables. Then some >> OpenWRT info and the logging of part of the connection setup. Then at >> 1780.130752 the logging of the failure starts. The sequence number is >> modulo 253 with ring size of 1024 matches an "old" entry (read 40, >> expected 52). Then the different pointers are printed followed by >> the kernel page table. The code does then a cache invalidate on the >> dma_handle and the next read the sequence number is correct. > > How do you invalidate the cache? A dma_handle is of type dma_addr_t > and we don't define an operation for that, nor does it make sense > on an allocation from dma_alloc_coherent(). What happens if you > take out the invalidate? dma_sync_single_for_cpu(, DMA_FROM_DEVICE) which ends up invalidating the cache (or that is our suspicion). > Can you post the patch that you use (both platform and driver) relative > to the snapshot of the the mainline kernel you are basing on? > > Arnd > Regards, Arend [1] http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6529/1 -------------- next part -------------- A non-text attachment was scrubbed... Name: bcm-dt-files.tar.bz2 Type: application/x-bzip2 Size: 2139 bytes Desc: not available URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20141208/43560d5e/attachment.bz2> ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-08 16:22 ` Arend van Spriel @ 2014-12-08 16:38 ` Arnd Bergmann 2014-12-08 16:47 ` Russell King - ARM Linux 2014-12-08 16:50 ` Catalin Marinas 0 siblings, 2 replies; 41+ messages in thread From: Arnd Bergmann @ 2014-12-08 16:38 UTC (permalink / raw) To: linux-arm-kernel On Monday 08 December 2014 17:22:44 Arend van Spriel wrote: > >> The log: first the ring allocation info is printed. Starting at > >> 16.124847, ring 2, 3 and 4 are rings used for device to host. In this > >> log the failure is on a read of ring 3. Ring 3 is 1024 entries of each > >> 16 bytes. The next thing printed is the kernel page tables. Then some > >> OpenWRT info and the logging of part of the connection setup. Then at > >> 1780.130752 the logging of the failure starts. The sequence number is > >> modulo 253 with ring size of 1024 matches an "old" entry (read 40, > >> expected 52). Then the different pointers are printed followed by > >> the kernel page table. The code does then a cache invalidate on the > >> dma_handle and the next read the sequence number is correct. > > > > How do you invalidate the cache? A dma_handle is of type dma_addr_t > > and we don't define an operation for that, nor does it make sense > > on an allocation from dma_alloc_coherent(). What happens if you > > take out the invalidate? > > dma_sync_single_for_cpu(, DMA_FROM_DEVICE) which ends up invalidating > the cache (or that is our suspicion). I'm not sure about that: static void arm_dma_sync_single_for_cpu(struct device *dev, dma_addr_t handle, size_t size, enum dma_data_direction dir) { unsigned int offset = handle & (PAGE_SIZE - 1); struct page *page = pfn_to_page(dma_to_pfn(dev, handle-offset)); __dma_page_dev_to_cpu(page, offset, size, dir); } Assuming a noncoherent linear (no IOMMU, no swiotlb, no dmabounce) mapping, dma_to_pfn will return the correct pfn here, but pfn_to_page will return a page pointer into the kernel linear mapping, which is not the same as the pointer you get from __alloc_remap_buffer(). The pointer that was returned from dma_alloc_coherent is a) non-cachable, and b) not the same that you flush here. Arnd ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-08 16:38 ` Arnd Bergmann @ 2014-12-08 16:47 ` Russell King - ARM Linux 2014-12-08 16:50 ` Catalin Marinas 1 sibling, 0 replies; 41+ messages in thread From: Russell King - ARM Linux @ 2014-12-08 16:47 UTC (permalink / raw) To: linux-arm-kernel On Mon, Dec 08, 2014 at 05:38:57PM +0100, Arnd Bergmann wrote: > On Monday 08 December 2014 17:22:44 Arend van Spriel wrote: > > >> The log: first the ring allocation info is printed. Starting at > > >> 16.124847, ring 2, 3 and 4 are rings used for device to host. In this > > >> log the failure is on a read of ring 3. Ring 3 is 1024 entries of each > > >> 16 bytes. The next thing printed is the kernel page tables. Then some > > >> OpenWRT info and the logging of part of the connection setup. Then at > > >> 1780.130752 the logging of the failure starts. The sequence number is > > >> modulo 253 with ring size of 1024 matches an "old" entry (read 40, > > >> expected 52). Then the different pointers are printed followed by > > >> the kernel page table. The code does then a cache invalidate on the > > >> dma_handle and the next read the sequence number is correct. > > > > > > How do you invalidate the cache? A dma_handle is of type dma_addr_t > > > and we don't define an operation for that, nor does it make sense > > > on an allocation from dma_alloc_coherent(). What happens if you > > > take out the invalidate? > > > > dma_sync_single_for_cpu(, DMA_FROM_DEVICE) which ends up invalidating > > the cache (or that is our suspicion). > > I'm not sure about that: > > static void arm_dma_sync_single_for_cpu(struct device *dev, > dma_addr_t handle, size_t size, enum dma_data_direction dir) > { > unsigned int offset = handle & (PAGE_SIZE - 1); > struct page *page = pfn_to_page(dma_to_pfn(dev, handle-offset)); > __dma_page_dev_to_cpu(page, offset, size, dir); > } > > Assuming a noncoherent linear (no IOMMU, no swiotlb, no dmabounce) mapping, > dma_to_pfn will return the correct pfn here, but pfn_to_page will return a > page pointer into the kernel linear mapping, which is not the same > as the pointer you get from __alloc_remap_buffer(). The pointer that > was returned from dma_alloc_coherent is a) non-cachable, and b) not the > same that you flush here. Having looked up the details of the Cortex CPU TRMs: 1. The caches are PIPT. 2. A non-cacheable mapping will not hit L1 cache lines which may be allocated against the same physical address. (This is implementation specific.) So, the problem can't be the L1 cache, it has to be the L2 cache. The L2 cache only deals with physical addresses, so it doesn't really matter which mapping gets flushed - the result will be the same as far as the L2 cache is concerned. If bit 22 is not set in the auxcr, then a non-cacheable access can hit a cache line which may be allocated in the L2 cache (which may have been allocated via a speculative prefetch via the cacheable mapping.) In the case which has been supplied, the physical address does indeed have two mappings: it has a lowmem mapping which is cacheable, and it has the DMA mapping which is marked as non-cacheable. Accesses via the non-cacheable mapping will not hit L1 (that's an implementation specific behaviour.) However, they may hit L2 if bit 22 is clear. -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-08 16:38 ` Arnd Bergmann 2014-12-08 16:47 ` Russell King - ARM Linux @ 2014-12-08 16:50 ` Catalin Marinas 2014-12-08 16:54 ` Russell King - ARM Linux 1 sibling, 1 reply; 41+ messages in thread From: Catalin Marinas @ 2014-12-08 16:50 UTC (permalink / raw) To: linux-arm-kernel On Mon, Dec 08, 2014 at 04:38:57PM +0000, Arnd Bergmann wrote: > On Monday 08 December 2014 17:22:44 Arend van Spriel wrote: > > >> The log: first the ring allocation info is printed. Starting at > > >> 16.124847, ring 2, 3 and 4 are rings used for device to host. In this > > >> log the failure is on a read of ring 3. Ring 3 is 1024 entries of each > > >> 16 bytes. The next thing printed is the kernel page tables. Then some > > >> OpenWRT info and the logging of part of the connection setup. Then at > > >> 1780.130752 the logging of the failure starts. The sequence number is > > >> modulo 253 with ring size of 1024 matches an "old" entry (read 40, > > >> expected 52). Then the different pointers are printed followed by > > >> the kernel page table. The code does then a cache invalidate on the > > >> dma_handle and the next read the sequence number is correct. > > > > > > How do you invalidate the cache? A dma_handle is of type dma_addr_t > > > and we don't define an operation for that, nor does it make sense > > > on an allocation from dma_alloc_coherent(). What happens if you > > > take out the invalidate? > > > > dma_sync_single_for_cpu(, DMA_FROM_DEVICE) which ends up invalidating > > the cache (or that is our suspicion). > > I'm not sure about that: > > static void arm_dma_sync_single_for_cpu(struct device *dev, > dma_addr_t handle, size_t size, enum dma_data_direction dir) > { > unsigned int offset = handle & (PAGE_SIZE - 1); > struct page *page = pfn_to_page(dma_to_pfn(dev, handle-offset)); > __dma_page_dev_to_cpu(page, offset, size, dir); > } > > Assuming a noncoherent linear (no IOMMU, no swiotlb, no dmabounce) mapping, > dma_to_pfn will return the correct pfn here, but pfn_to_page will return a > page pointer into the kernel linear mapping, Or a highmem page, both should be handled by dma_cache_maint_page(). > which is not the same > as the pointer you get from __alloc_remap_buffer(). The pointer that > was returned from dma_alloc_coherent is a) non-cachable, and b) not the > same that you flush here. Correct. But apart from the fact that you don't need to flush buffers allocated with dma_alloc_coherent(), the above sync_single would work on ARMv7 where the D-cache is PIPT, so the virtual address doesn't matter much as long as it maps the same physical address. -- Catalin ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-08 16:50 ` Catalin Marinas @ 2014-12-08 16:54 ` Russell King - ARM Linux 0 siblings, 0 replies; 41+ messages in thread From: Russell King - ARM Linux @ 2014-12-08 16:54 UTC (permalink / raw) To: linux-arm-kernel On Mon, Dec 08, 2014 at 04:50:43PM +0000, Catalin Marinas wrote: > On Mon, Dec 08, 2014 at 04:38:57PM +0000, Arnd Bergmann wrote: > > On Monday 08 December 2014 17:22:44 Arend van Spriel wrote: > > > >> The log: first the ring allocation info is printed. Starting at > > > >> 16.124847, ring 2, 3 and 4 are rings used for device to host. In this > > > >> log the failure is on a read of ring 3. Ring 3 is 1024 entries of each > > > >> 16 bytes. The next thing printed is the kernel page tables. Then some > > > >> OpenWRT info and the logging of part of the connection setup. Then at > > > >> 1780.130752 the logging of the failure starts. The sequence number is > > > >> modulo 253 with ring size of 1024 matches an "old" entry (read 40, > > > >> expected 52). Then the different pointers are printed followed by > > > >> the kernel page table. The code does then a cache invalidate on the > > > >> dma_handle and the next read the sequence number is correct. > > > > > > > > How do you invalidate the cache? A dma_handle is of type dma_addr_t > > > > and we don't define an operation for that, nor does it make sense > > > > on an allocation from dma_alloc_coherent(). What happens if you > > > > take out the invalidate? > > > > > > dma_sync_single_for_cpu(, DMA_FROM_DEVICE) which ends up invalidating > > > the cache (or that is our suspicion). > > > > I'm not sure about that: > > > > static void arm_dma_sync_single_for_cpu(struct device *dev, > > dma_addr_t handle, size_t size, enum dma_data_direction dir) > > { > > unsigned int offset = handle & (PAGE_SIZE - 1); > > struct page *page = pfn_to_page(dma_to_pfn(dev, handle-offset)); > > __dma_page_dev_to_cpu(page, offset, size, dir); > > } > > > > Assuming a noncoherent linear (no IOMMU, no swiotlb, no dmabounce) mapping, > > dma_to_pfn will return the correct pfn here, but pfn_to_page will return a > > page pointer into the kernel linear mapping, > > Or a highmem page, both should be handled by dma_cache_maint_page(). A valid point, but one which is irrelevant to this thread, because we're talking about a platform with only 128MB, and a PAGE_OFFSET of 2GB (hence no highmem): Memory: 125936K/131072K available (2682K kernel code, 103K rwdata, 744K rodata, 164K init, 188K bss, 5136K reserved) Can we stay on-point to getting this problem solved, rather than drifting off topic please? -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 13:23 ` Russell King - ARM Linux 2014-12-05 14:20 ` Hante Meuleman @ 2014-12-05 15:06 ` Russell King - ARM Linux 2014-12-05 18:28 ` Catalin Marinas 1 sibling, 1 reply; 41+ messages in thread From: Russell King - ARM Linux @ 2014-12-05 15:06 UTC (permalink / raw) To: linux-arm-kernel I've been doing more digging into the current DMA code, and I'm dismayed to see that there's new bugs in it... commit 513510ddba9650fc7da456eefeb0ead7632324f6 Author: Laura Abbott <lauraa@codeaurora.org> Date: Thu Oct 9 15:26:40 2014 -0700 common: dma-mapping: introduce common remapping functions This uses map_vm_area() to achieve the remapping of pages allocated inside dma_alloc_coherent(). dma_alloc_coherent() is documented in a rather round-about way in Documentation/DMA-API.txt: | Part Ia - Using large DMA-coherent buffers | ------------------------------------------ | | void * | dma_alloc_coherent(struct device *dev, size_t size, | dma_addr_t *dma_handle, gfp_t flag) | | void | dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, | dma_addr_t dma_handle) | | Free a region of consistent memory you previously allocated. dev, | size and dma_handle must all be the same as those passed into | dma_alloc_coherent(). cpu_addr must be the virtual address returned by | the dma_alloc_coherent(). | | Note that unlike their sibling allocation calls, these routines | may only be called with IRQs enabled. Note that very last paragraph. What this says is that it is explicitly permitted to call dma_alloc_coherent() with IRQs disabled. Now, the question is: is it safe to call map_vm_area() with IRQs disabled? Well, map_vm_area() calls pud_alloc(), pmd_alloc(), and pte_alloc_kernel(). These functions all call into the kernel memory allocator *without* GFP_ATOMIC - in other words, these allocations are permitted to sleep. Except, IRQs are off, so it's a bug to call these functions from dma_alloc_coherent(). Now, if we look at the previous code, it used ioremap_page_range(). This has the same problem: it needs to allocate page tables, and it can only do it via functions which may sleep. If we go back even further, we find that the use of ioremap_page_range() in dma_alloc_coherent() was introduced by: commit e9da6e9905e639b0f842a244bc770b48ad0523e9 Author: Marek Szyprowski <m.szyprowski@samsung.com> Date: Mon Jul 30 09:11:33 2012 +0200 ARM: dma-mapping: remove custom consistent dma region which is the commit which removed my pre-allocated page tables for the DMA re-mapping region - code which I explicitly had to specifically avoid this issue. Obviously, this isn't a big problem, because people haven't reported that they've hit any of the might_sleep() checks in the memory allocators, which I think is our only saving grace - but it's still wrong to the specified calling conditions of the DMA API. If the problem which you (Broadcom) are suffering from is down to the issue I suspect (that being having mappings with different cache attributes) then I'm not sure that there's anything we can realistically do about that. There's a number of issues which make it hard to see a way forward. One example is that if we allocate memory, we need to be able to change (or remove) the cacheable mappings associated with that memory. We'd need to touch the L1 page table, either to change the attributes of the section mapping, or to convert the section mapping to a L2 page table pointer. We need to change the attributes in a break-flush-make sequence to avoid TLB conflicts. However, those mappings may be shared between other CPUs in a SMP system. So, we would need to flush the TLBs on other CPUs before we could proceed to create replacement mappings. That means something like stop_machine() or sending (and waiting for completion) of an IPI to the other CPUs. That is totally impractical due to dma_alloc_coherent() being allowed to be called with IRQs off. I'll continue to think about it, but I don't see many possibilities to satisfy dma_alloc_coherent()'s documented requirements other than by pre-allocating a chunk of memory at boot time to be served out as DMA-able memory for these horrid cases. I don't see much point in keeping the map_vm_area() approach on ARM even if we did fallback - if we're re-establishing mappings for the surrounding pages in lowmem, we might as well insert appropriately attributed mappings for the DMA memory as well. On the face of it, it would be better to allocate one section at a time, but my unfortunate experience is that 3.x kernels are a /lot/ more trigger happy with the OOM killer, and the chances of being able to allocate 1MB of memory at a go after the system has been running for a while is near-on impossible. So I don't think that's a reality. Even if we did break up section mappings in this way, it would also mean that over time, we'd end up with much of lowmem mapped using 4K page table entries, which would place significant pressure on the MMU TLBs. So, we might just be far better off pre-allocating enough "DMA coherent" memory at boot time and be done with it. Those who want it dynamic can use CMA instead. -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 15:06 ` Russell King - ARM Linux @ 2014-12-05 18:28 ` Catalin Marinas 2014-12-05 19:22 ` Arend van Spriel 0 siblings, 1 reply; 41+ messages in thread From: Catalin Marinas @ 2014-12-05 18:28 UTC (permalink / raw) To: linux-arm-kernel On Fri, Dec 05, 2014 at 03:06:48PM +0000, Russell King - ARM Linux wrote: > I've been doing more digging into the current DMA code, and I'm dismayed > to see that there's new bugs in it... > > commit 513510ddba9650fc7da456eefeb0ead7632324f6 > Author: Laura Abbott <lauraa@codeaurora.org> > Date: Thu Oct 9 15:26:40 2014 -0700 > > common: dma-mapping: introduce common remapping functions > > This uses map_vm_area() to achieve the remapping of pages allocated inside > dma_alloc_coherent(). dma_alloc_coherent() is documented in a rather > round-about way in Documentation/DMA-API.txt: > > | Part Ia - Using large DMA-coherent buffers > | ------------------------------------------ > | > | void * > | dma_alloc_coherent(struct device *dev, size_t size, > | dma_addr_t *dma_handle, gfp_t flag) > | > | void > | dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, > | dma_addr_t dma_handle) > | > | Free a region of consistent memory you previously allocated. dev, > | size and dma_handle must all be the same as those passed into > | dma_alloc_coherent(). cpu_addr must be the virtual address returned by > | the dma_alloc_coherent(). > | > | Note that unlike their sibling allocation calls, these routines > | may only be called with IRQs enabled. > > Note that very last paragraph. What this says is that it is explicitly > permitted to call dma_alloc_coherent() with IRQs disabled. This is solved by using a pre-allocated, pre-mapped atomic_pool which avoids any further mapping. __dma_alloc() calls __alloc_from_pool() when !__GFP_WAIT. This code got pretty complex and we may find bugs. It can be simplified by a pre-allocated non-cacheable region that is safe in atomic context (how big you allocate this is hard to say). > If the problem which you (Broadcom) are suffering from is down to the > issue I suspect (that being having mappings with different cache > attributes) then I'm not sure that there's anything we can realistically > do about that. There's a number of issues which make it hard to see a > way forward. I'm still puzzled by this problem, so I don't have any suggestion yet. I wouldn't blame the mismatched attributes yet as I haven't seen such problem in practice (but you never know). How does the DT describe this device? Could it have some dma-coherent property in there that causes dma_alloc_coherent() to create a cacheable memory? The reverse could also cause problems: the device is coherent but the CPU creates a non-cacheable mapping. -- Catalin ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 18:28 ` Catalin Marinas @ 2014-12-05 19:22 ` Arend van Spriel 2014-12-05 19:25 ` Russell King - ARM Linux 0 siblings, 1 reply; 41+ messages in thread From: Arend van Spriel @ 2014-12-05 19:22 UTC (permalink / raw) To: linux-arm-kernel On 12/05/14 19:28, Catalin Marinas wrote: > On Fri, Dec 05, 2014 at 03:06:48PM +0000, Russell King - ARM Linux wrote: >> I've been doing more digging into the current DMA code, and I'm dismayed >> to see that there's new bugs in it... >> >> commit 513510ddba9650fc7da456eefeb0ead7632324f6 >> Author: Laura Abbott<lauraa@codeaurora.org> >> Date: Thu Oct 9 15:26:40 2014 -0700 >> >> common: dma-mapping: introduce common remapping functions >> >> This uses map_vm_area() to achieve the remapping of pages allocated inside >> dma_alloc_coherent(). dma_alloc_coherent() is documented in a rather >> round-about way in Documentation/DMA-API.txt: >> >> | Part Ia - Using large DMA-coherent buffers >> | ------------------------------------------ >> | >> | void * >> | dma_alloc_coherent(struct device *dev, size_t size, >> | dma_addr_t *dma_handle, gfp_t flag) >> | >> | void >> | dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, >> | dma_addr_t dma_handle) >> | >> | Free a region of consistent memory you previously allocated. dev, >> | size and dma_handle must all be the same as those passed into >> | dma_alloc_coherent(). cpu_addr must be the virtual address returned by >> | the dma_alloc_coherent(). >> | >> | Note that unlike their sibling allocation calls, these routines >> | may only be called with IRQs enabled. >> >> Note that very last paragraph. What this says is that it is explicitly >> permitted to call dma_alloc_coherent() with IRQs disabled. > > This is solved by using a pre-allocated, pre-mapped atomic_pool which > avoids any further mapping. __dma_alloc() calls __alloc_from_pool() when > !__GFP_WAIT. So we are actually calling dma_alloc_coherent() with GFP_KERNEL during device probe. That last paragraph Russell pointed out seems to suggest this is not allowed. > This code got pretty complex and we may find bugs. It can be simplified > by a pre-allocated non-cacheable region that is safe in atomic context > (how big you allocate this is hard to say). > >> If the problem which you (Broadcom) are suffering from is down to the >> issue I suspect (that being having mappings with different cache >> attributes) then I'm not sure that there's anything we can realistically >> do about that. There's a number of issues which make it hard to see a >> way forward. > > I'm still puzzled by this problem, so I don't have any suggestion yet. I > wouldn't blame the mismatched attributes yet as I haven't seen such > problem in practice (but you never know). > > How does the DT describe this device? Could it have some dma-coherent > property in there that causes dma_alloc_coherent() to create a cacheable > memory? Ok. Will add it to our todo list: check DTS files for dma-coherent property. Thanks, Arend > The reverse could also cause problems: the device is coherent but the > CPU creates a non-cacheable mapping. > ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 19:22 ` Arend van Spriel @ 2014-12-05 19:25 ` Russell King - ARM Linux 0 siblings, 0 replies; 41+ messages in thread From: Russell King - ARM Linux @ 2014-12-05 19:25 UTC (permalink / raw) To: linux-arm-kernel On Fri, Dec 05, 2014 at 08:22:05PM +0100, Arend van Spriel wrote: > On 12/05/14 19:28, Catalin Marinas wrote: > >This is solved by using a pre-allocated, pre-mapped atomic_pool which > >avoids any further mapping. __dma_alloc() calls __alloc_from_pool() when > >!__GFP_WAIT. > > So we are actually calling dma_alloc_coherent() with GFP_KERNEL during > device probe. That last paragraph Russell pointed out seems to suggest this > is not allowed. device probe is a schedulable, sleepable context, so dma_alloc_coherent() is fine there. As Catalin points out, and as I realised after sending them ail, it does check for __GFP_WAIT and uses a smaller atomic pool for those allocations. This explains why no one has hit any warnings in map_vm_area. So, it's safe from atomic contexts after all. -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 9:45 ` Russell King - ARM Linux 2014-12-05 12:24 ` Will Deacon @ 2014-12-05 12:43 ` Arend van Spriel 2014-12-05 12:59 ` Russell King - ARM Linux 1 sibling, 1 reply; 41+ messages in thread From: Arend van Spriel @ 2014-12-05 12:43 UTC (permalink / raw) To: linux-arm-kernel On 05-12-14 10:45, Russell King - ARM Linux wrote: > On Fri, Dec 05, 2014 at 10:22:22AM +0100, Arend van Spriel wrote: >> For our brcm80211 development we are working on getting brcmfmac driver >> up and running on a Broadcom ARM-based platform. The wireless device is >> a PCIe device, which is hooked up to the system behind a PCIe host >> bridge, and we transfer information between host and device using a >> descriptor ring buffer allocated using dma_alloc_coherent(). We mostly >> tested on x86 and seen no issue. However, on this ARM platform >> (single-core A9) we detect occasionally that the descriptor content is >> invalid. When this occurs we do a dma_sync_single_for_cpu() and this is >> retried a number of times if the problem persists. Actually, found out >> that someone made a mistake by using virt_to_dma(va) to get the >> dma_handle parameter. So probably we only provided a delay in the retry >> loop. After fixing that a single call to dma_sync_single_for_cpu() is >> sufficient. The DMA-API-HOWTO clearly states that: >> >> """ >> the hardware should guarantee that the device and the CPU can access the >> data in parallel and will see updates made by each other without any >> explicit software flushing. >> """ >> >> So it seems incorrect that we would need to do a dma_sync for this >> memory. That we do need it seems like this memory can end up in >> cache(?), or whatever happens, in some rare condition. Is there anyway >> to investigate this situation either through DMA-API or some low-level >> ARM specific functions. > > It's been a long while since I looked at the code, and the code for > dma_alloc_coherent() has completely changed since then with the > addition of CMA. I'm afraid that anything I would say about it would > not be accurate without research into the possible paths through that > code - it's no longer just a simple allocator. I know. On this particular platform we are not using CMA. > What you say is correct however: the memory should not have any cache > lines associated with it, if it does, there's a bug somewhere. > > Also, the memory will be weakly ordered, which means that writes to such > memory can be reordered. If ordering matters, barriers should be used. > rmb() and wmb() can be used for this. Ok. You already had a peek in our code checking the memory barriers, which does not have the dma_sync_single_for_cpu() "workaround" yet. So here some more background. The problem is in DMA_FROM_DEVICE direction. Because of the possible reordering issue we first tried using rmb() in the retry loop but that did not solve it. Another experiment was to ignore the failed ring descriptor entry and proceed. So we get interrupt from device and access the ring descriptor entry. This should contain expected value X, however we get X-1 back. When proceeding everything works find until hitting the same ring descriptor entry again reading X-1 when X+1 would be valid. This lead us to the assumption that somehow this entry ended up in cache lines. The issue goes away using the dma_sync_single_for_cpu() with DMA_FROM_DEVICE in direction parameter. We are not longer using virt_to_dma() so that is no longer an issue. So is there any function interface to verify cache status. Regards, Arend > (Added Marek for comment on dma_alloc_coherent(), Will for comment on > barrier stuff.) > ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 12:43 ` Arend van Spriel @ 2014-12-05 12:59 ` Russell King - ARM Linux 0 siblings, 0 replies; 41+ messages in thread From: Russell King - ARM Linux @ 2014-12-05 12:59 UTC (permalink / raw) To: linux-arm-kernel On Fri, Dec 05, 2014 at 01:43:01PM +0100, Arend van Spriel wrote: > Ok. You already had a peek in our code checking the memory barriers, which > does not have the dma_sync_single_for_cpu() "workaround" yet. So here some > more background. The problem is in DMA_FROM_DEVICE direction. Because of the > possible reordering issue we first tried using rmb() in the retry loop but > that did not solve it. Another experiment was to ignore the failed ring > descriptor entry and proceed. So we get interrupt from device and access the > ring descriptor entry. This should contain expected value X, however we get > X-1 back. When proceeding everything works find until hitting the same ring > descriptor entry again reading X-1 when X+1 would be valid. This lead us to > the assumption that somehow this entry ended up in cache lines. The issue > goes away using the dma_sync_single_for_cpu() with DMA_FROM_DEVICE in > direction parameter. Can you give some further detail - I think it would help understanding if you could give: - the initial numerical state of the descriptor (presumably setup by msgbuf.c calling brcmf_commonring_reserve_for_write(), and then writing the contents into the ring buffer, followed by brcmf_commonring_write_complete(). - time passes, the hardware processes the entry - the numerical state of the descriptor (which is in error) which you read back - the expected numerical state of the descriptor > So is there any function interface to verify cache status. There isn't, but if you dump the virtual address, and you have debugfs enabled, along with CONFIG_ARM_PTDUMP, you should be able to find the mapping in /sys/kernel/debug/kernel_page_tables, which will tell you the attributes that it's mapped using. What it won't tell you is whether there's an alias of the mapping with differing attributes. If you use dma_to_pfn() to convert the DMA handle into a PFN, we can use that to see whether there could be another mapping from the kernel page table dump (by checking whether the PFN would be a lowmem PFN, and therefore whether it's already mapped at it's lowmem address.) If you'd like to mail me (in addition to the ring contents above): - the kernel_page_tables dump - virtual address of the ring buffer - dma_to_pfn() converted DMA handle of the ring buffer - PHYS_PFN_OFFSET for your platform then I can see whether there is. -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 9:22 using DMA-API on ARM Arend van Spriel 2014-12-05 9:45 ` Russell King - ARM Linux @ 2014-12-05 9:52 ` Arnd Bergmann 2014-12-05 11:11 ` Russell King - ARM Linux 2014-12-05 18:39 ` Catalin Marinas 2 siblings, 1 reply; 41+ messages in thread From: Arnd Bergmann @ 2014-12-05 9:52 UTC (permalink / raw) To: linux-arm-kernel On Friday 05 December 2014 10:22:22 Arend van Spriel wrote: > Hi Russell, > > For our brcm80211 development we are working on getting brcmfmac driver > up and running on a Broadcom ARM-based platform. The wireless device is > a PCIe device, which is hooked up to the system behind a PCIe host > bridge, and we transfer information between host and device using a > descriptor ring buffer allocated using dma_alloc_coherent(). We mostly > tested on x86 and seen no issue. However, on this ARM platform > (single-core A9) we detect occasionally that the descriptor content is > invalid. When this occurs we do a dma_sync_single_for_cpu() and this is > retried a number of times if the problem persists. Actually, found out > that someone made a mistake by using virt_to_dma(va) to get the > dma_handle parameter. So probably we only provided a delay in the retry > loop. After fixing that a single call to dma_sync_single_for_cpu() is > sufficient. The DMA-API-HOWTO clearly states that: > > """ > the hardware should guarantee that the device and the CPU can access the > data in parallel and will see updates made by each other without any > explicit software flushing. > """ > > So it seems incorrect that we would need to do a dma_sync for this > memory. That we do need it seems like this memory can end up in > cache(?), or whatever happens, in some rare condition. Is there anyway > to investigate this situation either through DMA-API or some low-level > ARM specific functions. I think the problem comes down to not following the advice from this comment in asm/dma-mapping.h: /* * dma_to_pfn/pfn_to_dma/dma_to_virt/virt_to_dma are architecture private * functions used internally by the DMA-mapping API to provide DMA * addresses. They must not be used by drivers. */ The previous behavior of the driver is clearly wrong and cannot work on any architecture that has noncoherent PCI DMA or uses swiotlb, and that includes some older 64-bit x86 machines (Pentium D and similar). I'm still puzzled why you'd need a single dma_sync_single_for_cpu() after dma_alloc_coherent though, you should not need any. Is it possible that the driver accidentally uses __raw_readl() instead of readl() in some places and you are just lacking an appropriate barrier? Arnd ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 9:52 ` Arnd Bergmann @ 2014-12-05 11:11 ` Russell King - ARM Linux 2014-12-05 11:49 ` Hante Meuleman 2014-12-05 17:38 ` Catalin Marinas 0 siblings, 2 replies; 41+ messages in thread From: Russell King - ARM Linux @ 2014-12-05 11:11 UTC (permalink / raw) To: linux-arm-kernel On Fri, Dec 05, 2014 at 10:52:02AM +0100, Arnd Bergmann wrote: > I'm still puzzled why you'd need a single dma_sync_single_for_cpu() > after dma_alloc_coherent though, you should not need any. Is it possible > that the driver accidentally uses __raw_readl() instead of readl() > in some places and you are just lacking an appropriate barrier? Digging into the driver, it looks like individual DMA buffers are allocated (via brcmf_pcie_init_dmabuffer_for_device) and registered into a "commonring" layer. Whenever the buffer is written to, space is first allocated via a call to brcmf_commonring_reserve_for_write() or brcmf_commonring_reserve_for_write_multiple(), data written to the buffer, followed by a call to brcmf_commonring_write_complete(). brcmf_commonring_write_complete() calls two methods at that point: cr_write_wptr() and cr_ring_bell(), which will be brcmf_pcie_ring_mb_write_wptr() and brcmf_pcie_ring_mb_ring_bell(). The first calls brcmf_pcie_write_tcm16(), which uses iowrite16(), which contains the appropriate barrier. The bell ringing functions also use ioread*/iowrite*(). So, on the write side, it looks fine from the barrier perspective. On the read side, brcmf_commonring_get_read_ptr() is used before a read access to the ring - which calls the cr_update_wptr() method, which in turn uses an ioread16() call. After the CPU has read data from the ring, brcmf_commonring_read_complete() is used, which uses iowrite16(). So, I don't see a barrier problem on the read side. However, I did trip over this: static void * brcmf_pcie_init_dmabuffer_for_device(struct brcmf_pciedev_info *devinfo, u32 size, u32 tcm_dma_phys_addr, dma_addr_t *dma_handle) { void *ring; long long address; ring = dma_alloc_coherent(&devinfo->pdev->dev, size, dma_handle, GFP_KERNEL); if (!ring) return NULL; address = (long long)(long)*dma_handle; Casting to (long) will truncate the DMA handle to 32-bits on a 32-bit architecture, even if it supports 64-bit DMA addresses. There's a couple of other places where this same truncation occurs: address = (long long)(long)devinfo->shared.scratch_dmahandle; and address = (long long)(long)devinfo->shared.ringupd_dmahandle; In any case, wouldn't using a u64 type for "address" be better - isn't "long long" 128-bit on 64-bit architectures? -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 11:11 ` Russell King - ARM Linux @ 2014-12-05 11:49 ` Hante Meuleman 2014-12-05 17:38 ` Catalin Marinas 1 sibling, 0 replies; 41+ messages in thread From: Hante Meuleman @ 2014-12-05 11:49 UTC (permalink / raw) To: linux-arm-kernel Thank you for investigating. Your analysis matches what was intended to be done. Regarding the cast, you're right, I'll improve it. My idea was that long long is always 64bit, but when casting directly to long long I get a compiler (when using C=2) warning: "warning: right shift by bigger than source value". Probably I concluded wrongly that I should cast it to long first. On the targets used the high address was always 0, so I got away with it thusfar, but indeed it should and shall be fixed. Regards, Hante -----Original Message----- From: Russell King - ARM Linux [mailto:linux at arm.linux.org.uk] Sent: vrijdag 5 december 2014 12:11 To: Arnd Bergmann Cc: linux-arm-kernel at lists.infradead.org; Arend Van Spriel; linux-wireless; brcm80211-dev-list; David Miller; linux-kernel at vger.kernel.org Subject: Re: using DMA-API on ARM On Fri, Dec 05, 2014 at 10:52:02AM +0100, Arnd Bergmann wrote: > I'm still puzzled why you'd need a single dma_sync_single_for_cpu() > after dma_alloc_coherent though, you should not need any. Is it possible > that the driver accidentally uses __raw_readl() instead of readl() > in some places and you are just lacking an appropriate barrier? Digging into the driver, it looks like individual DMA buffers are allocated (via brcmf_pcie_init_dmabuffer_for_device) and registered into a "commonring" layer. Whenever the buffer is written to, space is first allocated via a call to brcmf_commonring_reserve_for_write() or brcmf_commonring_reserve_for_write_multiple(), data written to the buffer, followed by a call to brcmf_commonring_write_complete(). brcmf_commonring_write_complete() calls two methods at that point: cr_write_wptr() and cr_ring_bell(), which will be brcmf_pcie_ring_mb_write_wptr() and brcmf_pcie_ring_mb_ring_bell(). The first calls brcmf_pcie_write_tcm16(), which uses iowrite16(), which contains the appropriate barrier. The bell ringing functions also use ioread*/iowrite*(). So, on the write side, it looks fine from the barrier perspective. On the read side, brcmf_commonring_get_read_ptr() is used before a read access to the ring - which calls the cr_update_wptr() method, which in turn uses an ioread16() call. After the CPU has read data from the ring, brcmf_commonring_read_complete() is used, which uses iowrite16(). So, I don't see a barrier problem on the read side. However, I did trip over this: static void * brcmf_pcie_init_dmabuffer_for_device(struct brcmf_pciedev_info *devinfo, u32 size, u32 tcm_dma_phys_addr, dma_addr_t *dma_handle) { void *ring; long long address; ring = dma_alloc_coherent(&devinfo->pdev->dev, size, dma_handle, GFP_KERNEL); if (!ring) return NULL; address = (long long)(long)*dma_handle; Casting to (long) will truncate the DMA handle to 32-bits on a 32-bit architecture, even if it supports 64-bit DMA addresses. There's a couple of other places where this same truncation occurs: address = (long long)(long)devinfo->shared.scratch_dmahandle; and address = (long long)(long)devinfo->shared.ringupd_dmahandle; In any case, wouldn't using a u64 type for "address" be better - isn't "long long" 128-bit on 64-bit architectures? -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 11:11 ` Russell King - ARM Linux 2014-12-05 11:49 ` Hante Meuleman @ 2014-12-05 17:38 ` Catalin Marinas 2014-12-05 18:24 ` Russell King - ARM Linux 1 sibling, 1 reply; 41+ messages in thread From: Catalin Marinas @ 2014-12-05 17:38 UTC (permalink / raw) To: linux-arm-kernel On Fri, Dec 05, 2014 at 11:11:14AM +0000, Russell King - ARM Linux wrote: > In any case, wouldn't using a u64 type for "address" be better - isn't > "long long" 128-bit on 64-bit architectures? No, it's still 64-bit. There is no 128-bit integer in the C standard. -- Catalin ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 17:38 ` Catalin Marinas @ 2014-12-05 18:24 ` Russell King - ARM Linux 2014-12-05 18:31 ` Catalin Marinas 0 siblings, 1 reply; 41+ messages in thread From: Russell King - ARM Linux @ 2014-12-05 18:24 UTC (permalink / raw) To: linux-arm-kernel On Fri, Dec 05, 2014 at 05:38:39PM +0000, Catalin Marinas wrote: > On Fri, Dec 05, 2014 at 11:11:14AM +0000, Russell King - ARM Linux wrote: > > In any case, wouldn't using a u64 type for "address" be better - isn't > > "long long" 128-bit on 64-bit architectures? > > No, it's still 64-bit. There is no 128-bit integer in the C standard. Actually, that's a fallicy. The C99 standard (like previous versions) does not define exactly the number of bits in each type. It defines ranks of type, and says that lower ranks are a subrange of integers with higher ranks (for the same signed-ness.) See section 6.2.5. So, it merely states that: range(char) <= range(short) <= range(int) <= range(long) <= range(long long) So, an implementation could have: char: 8 short: 16 int: 16 long: 32 long long: 64 char: 8 short: 16 int: 32 long: 32 long long: 64 char: 8 short: 16 int: 32 long: 64 long long: 64 char: 8 short: 16 int: 64 long: 64 long long: 64 or even: char: 8 short: 16 int: 32 long: 64 long long: 128 and that would still be compliant with C99, since it continues to meet the criteria about the required data types specified in the standard. -- FTTC broadband for 0.8mile line: currently@9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 18:24 ` Russell King - ARM Linux @ 2014-12-05 18:31 ` Catalin Marinas 0 siblings, 0 replies; 41+ messages in thread From: Catalin Marinas @ 2014-12-05 18:31 UTC (permalink / raw) To: linux-arm-kernel On Fri, Dec 05, 2014 at 06:24:43PM +0000, Russell King - ARM Linux wrote: > On Fri, Dec 05, 2014 at 05:38:39PM +0000, Catalin Marinas wrote: > > On Fri, Dec 05, 2014 at 11:11:14AM +0000, Russell King - ARM Linux wrote: > > > In any case, wouldn't using a u64 type for "address" be better - isn't > > > "long long" 128-bit on 64-bit architectures? > > > > No, it's still 64-bit. There is no 128-bit integer in the C standard. > > Actually, that's a fallicy. > > The C99 standard (like previous versions) does not define exactly the > number of bits in each type. It defines ranks of type, and says that > lower ranks are a subrange of integers with higher ranks (for the same > signed-ness.) See section 6.2.5. > > So, it merely states that: > > range(char) <= range(short) <= range(int) <= range(long) <= range(long long) You are probably right, I haven't checked. But the ABI we use in Linux for 64-bit, LP64, defines long long as 64-bit. Gcc has a int128_t type but it's specific to this toolchain. -- Catalin ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 9:22 using DMA-API on ARM Arend van Spriel 2014-12-05 9:45 ` Russell King - ARM Linux 2014-12-05 9:52 ` Arnd Bergmann @ 2014-12-05 18:39 ` Catalin Marinas 2014-12-05 18:53 ` Catalin Marinas 2 siblings, 1 reply; 41+ messages in thread From: Catalin Marinas @ 2014-12-05 18:39 UTC (permalink / raw) To: linux-arm-kernel On Fri, Dec 05, 2014 at 09:22:22AM +0000, Arend van Spriel wrote: > For our brcm80211 development we are working on getting brcmfmac driver > up and running on a Broadcom ARM-based platform. The wireless device is > a PCIe device, which is hooked up to the system behind a PCIe host > bridge, and we transfer information between host and device using a > descriptor ring buffer allocated using dma_alloc_coherent(). We mostly > tested on x86 and seen no issue. However, on this ARM platform > (single-core A9) we detect occasionally that the descriptor content is > invalid. When this occurs we do a dma_sync_single_for_cpu() and this is > retried a number of times if the problem persists. Actually, found out > that someone made a mistake by using virt_to_dma(va) to get the > dma_handle parameter. So probably we only provided a delay in the retry > loop. After fixing that a single call to dma_sync_single_for_cpu() is > sufficient. The DMA-API-HOWTO clearly states that: Does your system have an L2 cache? What's the SoC topology, can PCIe see such L2 cache (or snoop the L1 caches)? Also, are you certain that dma_alloc_coherent() ends up creating a non-cacheable mapping in Linux (this call translates to a function pointer call which may or may not create non-cacheable memory, depending on the "dma-coherent" property passed via DT). -- Catalin ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 18:39 ` Catalin Marinas @ 2014-12-05 18:53 ` Catalin Marinas 2014-12-05 19:50 ` Arend van Spriel 2014-12-08 12:55 ` Johannes Stezenbach 0 siblings, 2 replies; 41+ messages in thread From: Catalin Marinas @ 2014-12-05 18:53 UTC (permalink / raw) To: linux-arm-kernel On Fri, Dec 05, 2014 at 06:39:45PM +0000, Catalin Marinas wrote: > On Fri, Dec 05, 2014 at 09:22:22AM +0000, Arend van Spriel wrote: > > For our brcm80211 development we are working on getting brcmfmac driver > > up and running on a Broadcom ARM-based platform. The wireless device is > > a PCIe device, which is hooked up to the system behind a PCIe host > > bridge, and we transfer information between host and device using a > > descriptor ring buffer allocated using dma_alloc_coherent(). We mostly > > tested on x86 and seen no issue. However, on this ARM platform > > (single-core A9) we detect occasionally that the descriptor content is > > invalid. When this occurs we do a dma_sync_single_for_cpu() and this is > > retried a number of times if the problem persists. Actually, found out > > that someone made a mistake by using virt_to_dma(va) to get the > > dma_handle parameter. So probably we only provided a delay in the retry > > loop. After fixing that a single call to dma_sync_single_for_cpu() is > > sufficient. The DMA-API-HOWTO clearly states that: > > Does your system have an L2 cache? What's the SoC topology, can PCIe see > such L2 cache (or snoop the L1 caches)? BTW, if you really have a PL310-like L2 cache, have a look at some patches (I've seen similar symptoms) and make sure your configuration is correct: http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6395/1 http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6529/1 The first one is vexpress specific. The second one was eventually discarded by Russell (I don't remember the reason, I guess it's because SoC code is supposed to set the right bits in there anyway). In your case, such bits may be set up by firmware, so Linux cannot fix anything up. -- Catalin ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 18:53 ` Catalin Marinas @ 2014-12-05 19:50 ` Arend van Spriel 2014-12-08 12:55 ` Johannes Stezenbach 1 sibling, 0 replies; 41+ messages in thread From: Arend van Spriel @ 2014-12-05 19:50 UTC (permalink / raw) To: linux-arm-kernel On 12/05/14 19:53, Catalin Marinas wrote: > On Fri, Dec 05, 2014 at 06:39:45PM +0000, Catalin Marinas wrote: >> On Fri, Dec 05, 2014 at 09:22:22AM +0000, Arend van Spriel wrote: >>> For our brcm80211 development we are working on getting brcmfmac driver >>> up and running on a Broadcom ARM-based platform. The wireless device is >>> a PCIe device, which is hooked up to the system behind a PCIe host >>> bridge, and we transfer information between host and device using a >>> descriptor ring buffer allocated using dma_alloc_coherent(). We mostly >>> tested on x86 and seen no issue. However, on this ARM platform >>> (single-core A9) we detect occasionally that the descriptor content is >>> invalid. When this occurs we do a dma_sync_single_for_cpu() and this is >>> retried a number of times if the problem persists. Actually, found out >>> that someone made a mistake by using virt_to_dma(va) to get the >>> dma_handle parameter. So probably we only provided a delay in the retry >>> loop. After fixing that a single call to dma_sync_single_for_cpu() is >>> sufficient. The DMA-API-HOWTO clearly states that: >> >> Does your system have an L2 cache? What's the SoC topology, can PCIe see >> such L2 cache (or snoop the L1 caches)? > > BTW, if you really have a PL310-like L2 cache, have a look at some > patches (I've seen similar symptoms) and make sure your configuration is > correct: > > http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6395/1 > > http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6529/1 > > The first one is vexpress specific. The second one was eventually > discarded by Russell (I don't remember the reason, I guess it's because > SoC code is supposed to set the right bits in there anyway). In your > case, such bits may be set up by firmware, so Linux cannot fix anything > up. I guess by firmware you mean to bootloader. This one boots with CFE bootloader which Broadcom maintains itself so could look into that. Regards, Arend ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-05 18:53 ` Catalin Marinas 2014-12-05 19:50 ` Arend van Spriel @ 2014-12-08 12:55 ` Johannes Stezenbach 2014-12-08 15:55 ` Catalin Marinas 1 sibling, 1 reply; 41+ messages in thread From: Johannes Stezenbach @ 2014-12-08 12:55 UTC (permalink / raw) To: linux-arm-kernel On Fri, Dec 05, 2014 at 06:53:03PM +0000, Catalin Marinas wrote: > On Fri, Dec 05, 2014 at 06:39:45PM +0000, Catalin Marinas wrote: > > > > Does your system have an L2 cache? What's the SoC topology, can PCIe see > > such L2 cache (or snoop the L1 caches)? > > BTW, if you really have a PL310-like L2 cache, have a look at some > patches (I've seen similar symptoms) and make sure your configuration is > correct: > > http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6395/1 > > http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6529/1 > > The first one is vexpress specific. The second one was eventually > discarded by Russell (I don't remember the reason, I guess it's because > SoC code is supposed to set the right bits in there anyway). In your > case, such bits may be set up by firmware, so Linux cannot fix anything > up. How do you avoid the unpredictable behavior mentioned in the PL310 TRM when the Shared Attribute Invalidate Enable bit is set? http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246h/Ceggcfcj.html I think this bit does not do what you seem to think it does, it only changes behaviour for "writes targeting a full cache line, for example 4x64-bit bursts with all strobes active", which then cause the cacheline to be invalidated. "Other cases are identical to the default shared behavior", which is "cacheable no allocate for reads" and "write through no write allocate for writes". If the problem is really speculative reads via the cachable alias mapping, it seems this bit cannot solve the problem, right? Johannes ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-08 12:55 ` Johannes Stezenbach @ 2014-12-08 15:55 ` Catalin Marinas 2014-12-08 16:50 ` Johannes Stezenbach 0 siblings, 1 reply; 41+ messages in thread From: Catalin Marinas @ 2014-12-08 15:55 UTC (permalink / raw) To: linux-arm-kernel On Mon, Dec 08, 2014 at 12:55:38PM +0000, Johannes Stezenbach wrote: > On Fri, Dec 05, 2014 at 06:53:03PM +0000, Catalin Marinas wrote: > > On Fri, Dec 05, 2014 at 06:39:45PM +0000, Catalin Marinas wrote: > > > > > > Does your system have an L2 cache? What's the SoC topology, can PCIe see > > > such L2 cache (or snoop the L1 caches)? > > > > BTW, if you really have a PL310-like L2 cache, have a look at some > > patches (I've seen similar symptoms) and make sure your configuration is > > correct: > > > > http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6395/1 > > > > http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6529/1 > > > > The first one is vexpress specific. The second one was eventually > > discarded by Russell (I don't remember the reason, I guess it's because > > SoC code is supposed to set the right bits in there anyway). In your > > case, such bits may be set up by firmware, so Linux cannot fix anything > > up. > > How do you avoid the unpredictable behavior mentioned in the > PL310 TRM when the Shared Attribute Invalidate Enable bit is set? > http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246h/Ceggcfcj.html So you talk about "Shared Attribute _Invalidate_ Enable" (bit 13) while I talk about "Shared Attribute _Override_ Enable" (bit 22). In addition, Shared _Invalidate_ behaviour can only be enabled if Shared Attribute _Override_ Enable bit is not set. -- Catalin ^ permalink raw reply [flat|nested] 41+ messages in thread
* using DMA-API on ARM 2014-12-08 15:55 ` Catalin Marinas @ 2014-12-08 16:50 ` Johannes Stezenbach 0 siblings, 0 replies; 41+ messages in thread From: Johannes Stezenbach @ 2014-12-08 16:50 UTC (permalink / raw) To: linux-arm-kernel On Mon, Dec 08, 2014 at 03:55:57PM +0000, Catalin Marinas wrote: > On Mon, Dec 08, 2014 at 12:55:38PM +0000, Johannes Stezenbach wrote: > > On Fri, Dec 05, 2014 at 06:53:03PM +0000, Catalin Marinas wrote: > > > > > > BTW, if you really have a PL310-like L2 cache, have a look at some > > > patches (I've seen similar symptoms) and make sure your configuration is > > > correct: > > > > > > http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6395/1 > > > > > > http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6529/1 > > > > > > The first one is vexpress specific. The second one was eventually > > > discarded by Russell (I don't remember the reason, I guess it's because > > > SoC code is supposed to set the right bits in there anyway). In your > > > case, such bits may be set up by firmware, so Linux cannot fix anything > > > up. > > > > How do you avoid the unpredictable behavior mentioned in the > > PL310 TRM when the Shared Attribute Invalidate Enable bit is set? > > http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246h/Ceggcfcj.html > > So you talk about "Shared Attribute _Invalidate_ Enable" (bit 13) while > I talk about "Shared Attribute _Override_ Enable" (bit 22). > > In addition, Shared _Invalidate_ behaviour can only be enabled if Shared > Attribute _Override_ Enable bit is not set. Yeah, I got confused, sorry for the noise. Thanks, Johannes ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2015-01-20 15:22 UTC | newest] Thread overview: 41+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-12-05 9:22 using DMA-API on ARM Arend van Spriel 2014-12-05 9:45 ` Russell King - ARM Linux 2014-12-05 12:24 ` Will Deacon 2014-12-05 12:56 ` Hante Meuleman 2014-12-05 13:23 ` Russell King - ARM Linux 2014-12-05 14:20 ` Hante Meuleman 2014-12-05 14:47 ` Arend van Spriel 2014-12-08 13:47 ` Hante Meuleman 2014-12-08 15:01 ` Arnd Bergmann 2014-12-08 15:17 ` Russell King - ARM Linux 2014-12-08 15:22 ` Arnd Bergmann 2014-12-08 16:03 ` Catalin Marinas 2014-12-08 17:01 ` Arend van Spriel 2014-12-09 10:19 ` Arend van Spriel 2014-12-09 10:29 ` Russell King - ARM Linux 2014-12-09 11:07 ` Arend van Spriel 2014-12-09 11:54 ` Catalin Marinas 2015-01-20 15:22 ` Fabio Estevam 2014-12-08 16:22 ` Arend van Spriel 2014-12-08 16:38 ` Arnd Bergmann 2014-12-08 16:47 ` Russell King - ARM Linux 2014-12-08 16:50 ` Catalin Marinas 2014-12-08 16:54 ` Russell King - ARM Linux 2014-12-05 15:06 ` Russell King - ARM Linux 2014-12-05 18:28 ` Catalin Marinas 2014-12-05 19:22 ` Arend van Spriel 2014-12-05 19:25 ` Russell King - ARM Linux 2014-12-05 12:43 ` Arend van Spriel 2014-12-05 12:59 ` Russell King - ARM Linux 2014-12-05 9:52 ` Arnd Bergmann 2014-12-05 11:11 ` Russell King - ARM Linux 2014-12-05 11:49 ` Hante Meuleman 2014-12-05 17:38 ` Catalin Marinas 2014-12-05 18:24 ` Russell King - ARM Linux 2014-12-05 18:31 ` Catalin Marinas 2014-12-05 18:39 ` Catalin Marinas 2014-12-05 18:53 ` Catalin Marinas 2014-12-05 19:50 ` Arend van Spriel 2014-12-08 12:55 ` Johannes Stezenbach 2014-12-08 15:55 ` Catalin Marinas 2014-12-08 16:50 ` Johannes Stezenbach
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).