* [RFC] arm: DMA-API contiguous cacheable memory
@ 2015-05-18 20:56 Lorenzo Nava
2015-05-19 16:34 ` Catalin Marinas
0 siblings, 1 reply; 11+ messages in thread
From: Lorenzo Nava @ 2015-05-18 20:56 UTC (permalink / raw)
To: linux-arm-kernel
Hello,
it's been a while since I've started working with DMA on ARM processor
for a smart camera project. Typically the requirements is to have a
large memory area which can be accessed by both DMA and user. I've
already noticed that many people wonder about which would be the best
way to have data received from DMA mapped in user space and, more
important, mapped in a cacheable area of memory. Having a memory
mapped region which is cacheable is very important if the user must
access the data and make some sort of processing on that.
My question is: why don't we introduce a function in the DMA-API
interface for ARM processors which allows to allocate a contiguous and
cacheable area of memory (> 4MB)?
This new function can take advantage of the CMA mechanism as
dma_alloc_coherent() function does, but using different PTE attribute
for the allocated pages. Basically making a function similar to
arm_dma_alloc() and set the attributes differently would do the trick:
pgprot_t prot = __pgprot_modify(prot, L_PTE_MT_MASK,
L_PTE_MT_WRITEALLOC | L_PTE_XN)
Of course this is very important for ARM processors as the pages
attributes must be coherent among different addressing of the same
physical memory, so this modification should eventually affect only
contiguous cacheable memory areas.
This will also make an improvement in the V4L2 interface which, for
buffers which is larger then 4MB, is forced to use non-cacheable
memory at the moment (with vb2_dma_contig_memops). The performance are
very poor if users deal with non cacheable memory while performing
image processing.
Any comment will be very appreciated.
Thanks.
Cheers.
^ permalink raw reply [flat|nested] 11+ messages in thread* [RFC] arm: DMA-API contiguous cacheable memory 2015-05-18 20:56 [RFC] arm: DMA-API contiguous cacheable memory Lorenzo Nava @ 2015-05-19 16:34 ` Catalin Marinas 2015-05-19 22:05 ` Lorenzo Nava 2015-05-19 22:09 ` Lorenzo Nava 0 siblings, 2 replies; 11+ messages in thread From: Catalin Marinas @ 2015-05-19 16:34 UTC (permalink / raw) To: linux-arm-kernel On Mon, May 18, 2015 at 10:56:06PM +0200, Lorenzo Nava wrote: > it's been a while since I've started working with DMA on ARM processor > for a smart camera project. Typically the requirements is to have a > large memory area which can be accessed by both DMA and user. I've > already noticed that many people wonder about which would be the best > way to have data received from DMA mapped in user space and, more > important, mapped in a cacheable area of memory. Having a memory > mapped region which is cacheable is very important if the user must > access the data and make some sort of processing on that. > My question is: why don't we introduce a function in the DMA-API > interface for ARM processors which allows to allocate a contiguous and > cacheable area of memory (> 4MB)? > This new function can take advantage of the CMA mechanism as > dma_alloc_coherent() function does, but using different PTE attribute > for the allocated pages. Basically making a function similar to > arm_dma_alloc() and set the attributes differently would do the trick: > > pgprot_t prot = __pgprot_modify(prot, L_PTE_MT_MASK, > L_PTE_MT_WRITEALLOC | L_PTE_XN) We already have a way to specify whether a device is coherent via the "dma-coherent" DT property. This allows the correct dma_map_ops to be set for a device. For cache coherent devices, the arm_coherent_dma_alloc() and __dma_alloc() should return cacheable memory. However, looking at the code, it seems that __dma_alloc() does not use the CMA when is_coherent == true, though you would hit a limit on the number of pages that can be allocated. As for mmap'ing to user space, there is arm_dma_mmap(). This one sets the vm_page_prot to what __get_dma_pgprot() returns which is always non-cacheable. I haven't checked the history cache coherent DMA support on arm but I think some of the above can be changed. As an example, on arm64 __dma_alloc() allocates from CMA independent of whether the device is coherent or not. Also __get_dma_pgprot() returns cacheable attributes for coherent devices, which in turn allows cacheable user mapping of such buffers. You don't really need to implement additional functions, just tweaks to the existing ones. Patches welcome ;) -- Catalin ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC] arm: DMA-API contiguous cacheable memory 2015-05-19 16:34 ` Catalin Marinas @ 2015-05-19 22:05 ` Lorenzo Nava 2015-05-19 22:09 ` Russell King - ARM Linux 2015-05-19 22:14 ` Arnd Bergmann 2015-05-19 22:09 ` Lorenzo Nava 1 sibling, 2 replies; 11+ messages in thread From: Lorenzo Nava @ 2015-05-19 22:05 UTC (permalink / raw) To: linux-arm-kernel Thanks for the answer. I do agree with you on that: I'll take a look at arm64 code and I'll be glad to contribute with patches as soon as possible. Anyway I'd like to focus on a different aspect: I think that this solution can manage cache coherent DMA, so devices which guarantees the coherency using cache snooping mechanism. However how can I manage devices which needs contiguous memory and don't guarantee cache coherency? If the device doesn't implement sg functionality, I can't allocate buffers which is greater than 4MB because I can't use neither dma_alloc_coherent() nor accessing directly to CMA (well, actually I can use dma_alloc_coherent(), but it sounds a little bit confusing). Do you think that dma_alloc_coherent() can be used as well with this type of devices? Do you think that a new dma_alloc_contiguous() function would help in this case? Maybe my interpretation of dma_alloc_coherent() is not correct, and the coherency can be managed using the dma_sync_single_for_* functions and it doesn't require hardware mechanism. Thank you. Cheers On Tue, May 19, 2015 at 6:34 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: > On Mon, May 18, 2015 at 10:56:06PM +0200, Lorenzo Nava wrote: >> it's been a while since I've started working with DMA on ARM processor >> for a smart camera project. Typically the requirements is to have a >> large memory area which can be accessed by both DMA and user. I've >> already noticed that many people wonder about which would be the best >> way to have data received from DMA mapped in user space and, more >> important, mapped in a cacheable area of memory. Having a memory >> mapped region which is cacheable is very important if the user must >> access the data and make some sort of processing on that. >> My question is: why don't we introduce a function in the DMA-API >> interface for ARM processors which allows to allocate a contiguous and >> cacheable area of memory (> 4MB)? >> This new function can take advantage of the CMA mechanism as >> dma_alloc_coherent() function does, but using different PTE attribute >> for the allocated pages. Basically making a function similar to >> arm_dma_alloc() and set the attributes differently would do the trick: >> >> pgprot_t prot = __pgprot_modify(prot, L_PTE_MT_MASK, >> L_PTE_MT_WRITEALLOC | L_PTE_XN) > > We already have a way to specify whether a device is coherent via the > "dma-coherent" DT property. This allows the correct dma_map_ops to be > set for a device. For cache coherent devices, the > arm_coherent_dma_alloc() and __dma_alloc() should return cacheable > memory. > > However, looking at the code, it seems that __dma_alloc() does not use > the CMA when is_coherent == true, though you would hit a limit on the > number of pages that can be allocated. > > As for mmap'ing to user space, there is arm_dma_mmap(). This one sets > the vm_page_prot to what __get_dma_pgprot() returns which is always > non-cacheable. > > I haven't checked the history cache coherent DMA support on arm but I > think some of the above can be changed. As an example, on arm64 > __dma_alloc() allocates from CMA independent of whether the device is > coherent or not. Also __get_dma_pgprot() returns cacheable attributes > for coherent devices, which in turn allows cacheable user mapping of > such buffers. You don't really need to implement additional functions, > just tweaks to the existing ones. > > Patches welcome ;) > > -- > Catalin ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC] arm: DMA-API contiguous cacheable memory 2015-05-19 22:05 ` Lorenzo Nava @ 2015-05-19 22:09 ` Russell King - ARM Linux 2015-05-19 22:14 ` Arnd Bergmann 1 sibling, 0 replies; 11+ messages in thread From: Russell King - ARM Linux @ 2015-05-19 22:09 UTC (permalink / raw) To: linux-arm-kernel On Wed, May 20, 2015 at 12:05:54AM +0200, Lorenzo Nava wrote: > Maybe my interpretation of dma_alloc_coherent() is not correct, and > the coherency can be managed using the dma_sync_single_for_* functions > and it doesn't require hardware mechanism. dma_sync_single_for_* are only for use with the streaming DMA API, where you must have already mapped the buffer using one of the dma_map_* functions. Anything other than that is abusing the API. -- FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC] arm: DMA-API contiguous cacheable memory 2015-05-19 22:05 ` Lorenzo Nava 2015-05-19 22:09 ` Russell King - ARM Linux @ 2015-05-19 22:14 ` Arnd Bergmann 2015-05-19 22:27 ` Lorenzo Nava 2015-05-19 22:34 ` Russell King - ARM Linux 1 sibling, 2 replies; 11+ messages in thread From: Arnd Bergmann @ 2015-05-19 22:14 UTC (permalink / raw) To: linux-arm-kernel On Wednesday 20 May 2015 00:05:54 Lorenzo Nava wrote: > > On Tue, May 19, 2015 at 6:34 PM, Catalin Marinas > <catalin.marinas@arm.com> wrote: > > On Mon, May 18, 2015 at 10:56:06PM +0200, Lorenzo Nava wrote: > >> it's been a while since I've started working with DMA on ARM processor > >> for a smart camera project. Typically the requirements is to have a > >> large memory area which can be accessed by both DMA and user. I've > >> already noticed that many people wonder about which would be the best > >> way to have data received from DMA mapped in user space and, more > >> important, mapped in a cacheable area of memory. Having a memory > >> mapped region which is cacheable is very important if the user must > >> access the data and make some sort of processing on that. > >> My question is: why don't we introduce a function in the DMA-API > >> interface for ARM processors which allows to allocate a contiguous and > >> cacheable area of memory (> 4MB)? > >> This new function can take advantage of the CMA mechanism as > >> dma_alloc_coherent() function does, but using different PTE attribute > >> for the allocated pages. Basically making a function similar to > >> arm_dma_alloc() and set the attributes differently would do the trick: > >> > >> pgprot_t prot = __pgprot_modify(prot, L_PTE_MT_MASK, > >> L_PTE_MT_WRITEALLOC | L_PTE_XN) > > > > We already have a way to specify whether a device is coherent via the > > "dma-coherent" DT property. This allows the correct dma_map_ops to be > > set for a device. For cache coherent devices, the > > arm_coherent_dma_alloc() and __dma_alloc() should return cacheable > > memory. That is not what Lorenzo was asking about though. > > However, looking at the code, it seems that __dma_alloc() does not use > > the CMA when is_coherent == true, though you would hit a limit on the > > number of pages that can be allocated. > > > > As for mmap'ing to user space, there is arm_dma_mmap(). This one sets > > the vm_page_prot to what __get_dma_pgprot() returns which is always > > non-cacheable. > > > > I haven't checked the history cache coherent DMA support on arm but I > > think some of the above can be changed. As an example, on arm64 > > __dma_alloc() allocates from CMA independent of whether the device is > > coherent or not. Also __get_dma_pgprot() returns cacheable attributes > > for coherent devices, which in turn allows cacheable user mapping of > > such buffers. You don't really need to implement additional functions, > > just tweaks to the existing ones. > > Thanks for the answer. I do agree with you on that: I'll take a look > at arm64 code and I'll be glad to contribute with patches as soon as > possible. > > Anyway I'd like to focus on a different aspect: I think that this > solution can manage cache coherent DMA, so devices which guarantees > the coherency using cache snooping mechanism. However how can I manage > devices which needs contiguous memory and don't guarantee cache > coherency? If the device doesn't implement sg functionality, I can't > allocate buffers which is greater than 4MB because I can't use neither > dma_alloc_coherent() nor accessing directly to CMA (well, actually I > can use dma_alloc_coherent(), but it sounds a little bit confusing). So you have a device that is not cache-coherent, and you want to allocate cacheable memory and manage coherency manually. This is normally done using alloc_pages() and dma_map_single(), but as you have realized, that does not use the CMA area. > Do you think that dma_alloc_coherent() can be used as well with this > type of devices? Do you think that a new dma_alloc_contiguous() > function would help in this case? > Maybe my interpretation of dma_alloc_coherent() is not correct, and > the coherency can be managed using the dma_sync_single_for_* functions > and it doesn't require hardware mechanism. I believe dma_alloc_attrs is the interface you want, with attributes DMA_ATTR_FORCE_CONTIGUOUS and DMA_ATTR_NON_CONSISTENT. I don't know if that is already implemented on arm64, but this is something that can definitely be done. With that memory, you should be able to use the normal streaming API (dma_sync_single_for_*). There is an older interface called dma_alloc_noncoherent(), but that cannot be easily implemented on ARM. Arnd ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC] arm: DMA-API contiguous cacheable memory 2015-05-19 22:14 ` Arnd Bergmann @ 2015-05-19 22:27 ` Lorenzo Nava 2015-05-19 22:34 ` Russell King - ARM Linux 1 sibling, 0 replies; 11+ messages in thread From: Lorenzo Nava @ 2015-05-19 22:27 UTC (permalink / raw) To: linux-arm-kernel On Wed, May 20, 2015 at 12:14 AM, Arnd Bergmann <arnd@arndb.de> wrote: > On Wednesday 20 May 2015 00:05:54 Lorenzo Nava wrote: >> >> On Tue, May 19, 2015 at 6:34 PM, Catalin Marinas >> <catalin.marinas@arm.com> wrote: >> > On Mon, May 18, 2015 at 10:56:06PM +0200, Lorenzo Nava wrote: >> >> it's been a while since I've started working with DMA on ARM processor >> >> for a smart camera project. Typically the requirements is to have a >> >> large memory area which can be accessed by both DMA and user. I've >> >> already noticed that many people wonder about which would be the best >> >> way to have data received from DMA mapped in user space and, more >> >> important, mapped in a cacheable area of memory. Having a memory >> >> mapped region which is cacheable is very important if the user must >> >> access the data and make some sort of processing on that. >> >> My question is: why don't we introduce a function in the DMA-API >> >> interface for ARM processors which allows to allocate a contiguous and >> >> cacheable area of memory (> 4MB)? >> >> This new function can take advantage of the CMA mechanism as >> >> dma_alloc_coherent() function does, but using different PTE attribute >> >> for the allocated pages. Basically making a function similar to >> >> arm_dma_alloc() and set the attributes differently would do the trick: >> >> >> >> pgprot_t prot = __pgprot_modify(prot, L_PTE_MT_MASK, >> >> L_PTE_MT_WRITEALLOC | L_PTE_XN) >> > >> > We already have a way to specify whether a device is coherent via the >> > "dma-coherent" DT property. This allows the correct dma_map_ops to be >> > set for a device. For cache coherent devices, the >> > arm_coherent_dma_alloc() and __dma_alloc() should return cacheable >> > memory. > > That is not what Lorenzo was asking about though. > >> > However, looking at the code, it seems that __dma_alloc() does not use >> > the CMA when is_coherent == true, though you would hit a limit on the >> > number of pages that can be allocated. >> > >> > As for mmap'ing to user space, there is arm_dma_mmap(). This one sets >> > the vm_page_prot to what __get_dma_pgprot() returns which is always >> > non-cacheable. >> > >> > I haven't checked the history cache coherent DMA support on arm but I >> > think some of the above can be changed. As an example, on arm64 >> > __dma_alloc() allocates from CMA independent of whether the device is >> > coherent or not. Also __get_dma_pgprot() returns cacheable attributes >> > for coherent devices, which in turn allows cacheable user mapping of >> > such buffers. You don't really need to implement additional functions, >> > just tweaks to the existing ones. >> >> Thanks for the answer. I do agree with you on that: I'll take a look >> at arm64 code and I'll be glad to contribute with patches as soon as >> possible. >> >> Anyway I'd like to focus on a different aspect: I think that this >> solution can manage cache coherent DMA, so devices which guarantees >> the coherency using cache snooping mechanism. However how can I manage >> devices which needs contiguous memory and don't guarantee cache >> coherency? If the device doesn't implement sg functionality, I can't >> allocate buffers which is greater than 4MB because I can't use neither >> dma_alloc_coherent() nor accessing directly to CMA (well, actually I >> can use dma_alloc_coherent(), but it sounds a little bit confusing). > > So you have a device that is not cache-coherent, and you want to > allocate cacheable memory and manage coherency manually. > > This is normally done using alloc_pages() and dma_map_single(), > but as you have realized, that does not use the CMA area. > >> Do you think that dma_alloc_coherent() can be used as well with this >> type of devices? Do you think that a new dma_alloc_contiguous() >> function would help in this case? >> Maybe my interpretation of dma_alloc_coherent() is not correct, and >> the coherency can be managed using the dma_sync_single_for_* functions >> and it doesn't require hardware mechanism. > > I believe dma_alloc_attrs is the interface you want, with attributes > DMA_ATTR_FORCE_CONTIGUOUS and DMA_ATTR_NON_CONSISTENT. I don't > know if that is already implemented on arm64, but this is something > that can definitely be done. > > With that memory, you should be able to use the normal streaming > API (dma_sync_single_for_*). There is an older interface called > dma_alloc_noncoherent(), but that cannot be easily implemented on > ARM. > > Arnd Yes, this is exactly the point. Currently this function is used only with dma_alloc_coherent() function (which actually call dma_alloc_attrs()). This function, anyway, is not available in the DMA API of linux, but I think it could be useful to manage some kind of devices (see my previous mail). What do you think would be the best way to access dma_alloc_attrs function from a device driver? Call the function directly? Thank you. Lorenzo ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC] arm: DMA-API contiguous cacheable memory 2015-05-19 22:14 ` Arnd Bergmann 2015-05-19 22:27 ` Lorenzo Nava @ 2015-05-19 22:34 ` Russell King - ARM Linux 2015-05-20 12:57 ` Lorenzo Nava 1 sibling, 1 reply; 11+ messages in thread From: Russell King - ARM Linux @ 2015-05-19 22:34 UTC (permalink / raw) To: linux-arm-kernel On Wed, May 20, 2015 at 12:14:48AM +0200, Arnd Bergmann wrote: > With that memory, you should be able to use the normal streaming > API (dma_sync_single_for_*). Wrong, as I've pointed out previously. The only memory which you're allowed to sync is with memory which has been mapped with a dma_map_*() function. -- FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC] arm: DMA-API contiguous cacheable memory 2015-05-19 22:34 ` Russell King - ARM Linux @ 2015-05-20 12:57 ` Lorenzo Nava 2015-05-20 16:20 ` Russell King - ARM Linux 0 siblings, 1 reply; 11+ messages in thread From: Lorenzo Nava @ 2015-05-20 12:57 UTC (permalink / raw) To: linux-arm-kernel On Wed, May 20, 2015 at 12:34 AM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Wed, May 20, 2015 at 12:14:48AM +0200, Arnd Bergmann wrote: >> With that memory, you should be able to use the normal streaming >> API (dma_sync_single_for_*). > > Wrong, as I've pointed out previously. The only memory which you're > allowed to sync is with memory which has been mapped with a dma_map_*() > function. > > -- > FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up > according to speedtest.net. Russell, so probably currently is impossible to allocate a contiguous cachable DMA memory. You can't use CMA, and the only functions which allow you to use it are not compatible with sync functions. Do you think the problem is the CMA design, the DMA API design, or there is no problem at all and this is not something useful? Anyway it's not completely clear to me which is the difference between: - allocating memory and use sync function on memory mapped with dma_map_*() - allocating memory with dma_alloc_*() (with cacheable attributes) and use the sync functions on it It looks that the second just make alloc + map in a single step instead of splitting the operation in two steps. I'm sure I'm losing something, can you please help me understand that? Thanks. Lorenzo ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC] arm: DMA-API contiguous cacheable memory 2015-05-20 12:57 ` Lorenzo Nava @ 2015-05-20 16:20 ` Russell King - ARM Linux 2015-05-20 21:49 ` Lorenzo Nava 0 siblings, 1 reply; 11+ messages in thread From: Russell King - ARM Linux @ 2015-05-20 16:20 UTC (permalink / raw) To: linux-arm-kernel On Wed, May 20, 2015 at 02:57:36PM +0200, Lorenzo Nava wrote: > so probably currently is impossible to allocate a contiguous cachable > DMA memory. You can't use CMA, and the only functions which allow you > to use it are not compatible with sync functions. > Do you think the problem is the CMA design, the DMA API design, or > there is no problem at all and this is not something useful? Well, the whole issue of DMA from userspace is a fraught topic. I consider what we have at the moment as mere luck than anything else - there are architecture maintainers who'd like to see dma_mmap_* be deleted from the kernel. However, I have a problem with what you're trying to do. You want to allocate a large chunk of memory for DMA. Large chunks of memory can _only_ come from CMA - the standard Linux allocators do _not_ cope well with large allocations. Even 16K allocations can become difficult after the system has been running for a while. So, CMA is really the only way to go to obtain large chunks of memory. You want this large chunk of memory to be cacheable. CMA might be able to provide that. You want to DMA to this memory, and then read from it. The problem there is, how do you ensure that the data you're reading is the data that the DMA wrote there. If you have caching enabled, the caching model that we _have_ to assume is that the cache is infinite, and that it speculates aggressively. This means that we can not guarantee that any data read through a cacheable mapping will be coherent with the DMA'd data. So, we have to flush the cache. The problem is that with an infinite cache size model, we have to flush all possible lines associated with the buffer, because we don't know which might be in the cache and which are not. Of course, caches are finite, and we can say that if the size of the region being flushed is greater than the cache size (or multiple of the cache size), we _could_ just flush the entire cache instead. (This can only work for non-SG stuff, as we don't know before hand how large the SG is in bytes.) However, here's the problem. As I mentioned above, we have dma_mmap_* stuff, which works for memory allocated by dma_alloc_coherent(). The only reason mapping that memory into userspace works is because (for the non-coherent cache case) we map it in such a way that the caches are disabled, and this works fine. For the coherent cache case, it doesn't matter that we map it with the caches enabled. So both of these work. When you have a non-coherent cache _and_ you want the mapping to be cacheable, you have extra problems to worry about. You need to know the type of the CPU cache. If the CPU cache is physically indexed, physically tagged, then you can perform cache maintanence on any mapping of that memory, and you will hit the appropriate cache lines. For other types of caches, this is not true. Hence, a userspace mapping of non-coherent cacheable memory with a cache which makes use of virtual addresses would need to be flushed at the virtual aliases - this is precisely why kernel arch maintainers don't like DMA from userspace. It's brings with it huge problems. Thankfully, ARMv7 caches are PIPT - but that doesn't really give us "permission" to just consider PIPT for this case, especially for something which is used between arch code and driver code. What I'm trying to say is that what you're asking for is not a simple issue - it needs lots of thought and consideration, more than I have time to spare (or likely have time to spare in the future, _most_ of my time is wasted trying to deal with the flood of email from these mailing lists rather than doing any real work - even non-relevant email has a non-zero time cost as it takes a certain amount of time to decide whether an email is relevant or not.) > Anyway it's not completely clear to me which is the difference between: > - allocating memory and use sync function on memory mapped with dma_map_*() > - allocating memory with dma_alloc_*() (with cacheable attributes) > and use the sync functions on it Let me say _for the third time_: dma_sync_*() on memory returned from dma_alloc_*() is not permitted. Anyone who tells you different is just plain wrong, and is telling you to do something which is _not_ supported by the API, and _will_ fail with some implementations including the ARM implementation if it uses the atomic pool to satisfy your allocation. > It looks that the second just make alloc + map in a single step > instead of splitting the operation in two steps. > I'm sure I'm losing something, can you please help me understand that? The problem is that you're hitting two different costs: the cost from accessing data via an uncacheable mapping, vs the cost of having to do cache maintanence to ensure that you're reading the up-to-date data. At the end of the day, there's only one truth here: large DMA buffers on architectures which are not cache-coherent suck and require a non-zero cost to ensure that you can read the data written to the buffer by DMA, or that DMA can see the data you have written to the buffer. The final thing to mention is that the ARM cache maintanence instructions are not available in userspace, so you can't have userspace taking care of flushing the caches where they need to... -- FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC] arm: DMA-API contiguous cacheable memory 2015-05-20 16:20 ` Russell King - ARM Linux @ 2015-05-20 21:49 ` Lorenzo Nava 0 siblings, 0 replies; 11+ messages in thread From: Lorenzo Nava @ 2015-05-20 21:49 UTC (permalink / raw) To: linux-arm-kernel On Wed, May 20, 2015 at 6:20 PM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Wed, May 20, 2015 at 02:57:36PM +0200, Lorenzo Nava wrote: >> so probably currently is impossible to allocate a contiguous cachable >> DMA memory. You can't use CMA, and the only functions which allow you >> to use it are not compatible with sync functions. >> Do you think the problem is the CMA design, the DMA API design, or >> there is no problem at all and this is not something useful? > > Well, the whole issue of DMA from userspace is a fraught topic. I > consider what we have at the moment as mere luck than anything else - > there are architecture maintainers who'd like to see dma_mmap_* be > deleted from the kernel. > Well sometimes mmap can avoid non-necessary memory copies and boost the performance. Of course it must be carefully managed to avoid big problems. > However, I have a problem with what you're trying to do. > > You want to allocate a large chunk of memory for DMA. Large chunks > of memory can _only_ come from CMA - the standard Linux allocators do > _not_ cope well with large allocations. Even 16K allocations can > become difficult after the system has been running for a while. So, > CMA is really the only way to go to obtain large chunks of memory. > > You want this large chunk of memory to be cacheable. CMA might be > able to provide that. > > You want to DMA to this memory, and then read from it. The problem > there is, how do you ensure that the data you're reading is the data > that the DMA wrote there. If you have caching enabled, the caching > model that we _have_ to assume is that the cache is infinite, and that > it speculates aggressively. This means that we can not guarantee that > any data read through a cacheable mapping will be coherent with the > DMA'd data. > > So, we have to flush the cache. The problem is that with an infinite > cache size model, we have to flush all possible lines associated with > the buffer, because we don't know which might be in the cache and > which are not. > > Of course, caches are finite, and we can say that if the size of the > region being flushed is greater than the cache size (or multiple of the > cache size), we _could_ just flush the entire cache instead. (This can > only work for non-SG stuff, as we don't know before hand how large the > SG is in bytes.) > > However, here's the problem. As I mentioned above, we have dma_mmap_* > stuff, which works for memory allocated by dma_alloc_coherent(). The > only reason mapping that memory into userspace works is because (for > the non-coherent cache case) we map it in such a way that the caches > are disabled, and this works fine. For the coherent cache case, it > doesn't matter that we map it with the caches enabled. So both of these > work. > > When you have a non-coherent cache _and_ you want the mapping to be > cacheable, you have extra problems to worry about. You need to know > the type of the CPU cache. If the CPU cache is physically indexed, > physically tagged, then you can perform cache maintanence on any > mapping of that memory, and you will hit the appropriate cache lines. > For other types of caches, this is not true. Hence, a userspace > mapping of non-coherent cacheable memory with a cache which makes use > of virtual addresses would need to be flushed at the virtual aliases - > this is precisely why kernel arch maintainers don't like DMA from > userspace. It's brings with it huge problems. > > Thankfully, ARMv7 caches are PIPT - but that doesn't really give us > "permission" to just consider PIPT for this case, especially for > something which is used between arch code and driver code. > CPU cache type is an extremely interesting subject which honestly I didn't consider. > What I'm trying to say is that what you're asking for is not a simple > issue - it needs lots of thought and consideration, more than I have > time to spare (or likely have time to spare in the future, _most_ of > my time is wasted trying to deal with the flood of email from these > mailing lists rather than doing any real work - even non-relevant email > has a non-zero time cost as it takes a certain amount of time to decide > whether an email is relevant or not.) > And let me thank you for this explanation and for sharing your knowledge that is really helping me. >> Anyway it's not completely clear to me which is the difference between: >> - allocating memory and use sync function on memory mapped with dma_map_*() >> - allocating memory with dma_alloc_*() (with cacheable attributes) >> and use the sync functions on it > > Let me say _for the third time_: dma_sync_*() on memory returned from > dma_alloc_*() is not permitted. Anyone who tells you different is > just plain wrong, and is telling you to do something which is _not_ > supported by the API, and _will_ fail with some implementations > including the ARM implementation if it uses the atomic pool to satisfy > your allocation. > Ok, got it. Sync functions on dma_alloc_*() it's very bad :-) >> It looks that the second just make alloc + map in a single step >> instead of splitting the operation in two steps. >> I'm sure I'm losing something, can you please help me understand that? > > The problem is that you're hitting two different costs: the cost from > accessing data via an uncacheable mapping, vs the cost of having to do > cache maintanence to ensure that you're reading the up-to-date data. > > At the end of the day, there's only one truth here: large DMA buffers > on architectures which are not cache-coherent suck and require a non-zero > cost to ensure that you can read the data written to the buffer by DMA, > or that DMA can see the data you have written to the buffer. > > The final thing to mention is that the ARM cache maintanence instructions > are not available in userspace, so you can't have userspace taking care > of flushing the caches where they need to... > You're right. This is the crucial point: you can't guarantee that accessed data is correct at any given time unless you know how stuffs work at kernel level. Basically the only way is to make a sort of synchronisation between user and kernel to be sure that accessed data is actually updated. The solution could be to implement a mechanism that doesn't make data available to user until cache coherence was not correctly managed. To be honest V4L implements exactly that mechanism: buffers are queued and made available with mmap to user once the grab process is completed, and cache coherence can then be guaranteed. I'm a little bit disappointed because using CMA with non-coherent memory is not currently possible, and this is something that could be useful when developer is able to manage cache coherence (and doesn't have sg available). I hoped that "bigphysarea" patch will be forever forget and replaced by CMA, but it doesn't look like it is really possible. Thanks. Lorenzo > -- > FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up > according to speedtest.net. ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC] arm: DMA-API contiguous cacheable memory 2015-05-19 16:34 ` Catalin Marinas 2015-05-19 22:05 ` Lorenzo Nava @ 2015-05-19 22:09 ` Lorenzo Nava 1 sibling, 0 replies; 11+ messages in thread From: Lorenzo Nava @ 2015-05-19 22:09 UTC (permalink / raw) To: linux-arm-kernel On Tue, May 19, 2015 at 6:34 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: > On Mon, May 18, 2015 at 10:56:06PM +0200, Lorenzo Nava wrote: >> it's been a while since I've started working with DMA on ARM processor >> for a smart camera project. Typically the requirements is to have a >> large memory area which can be accessed by both DMA and user. I've >> already noticed that many people wonder about which would be the best >> way to have data received from DMA mapped in user space and, more >> important, mapped in a cacheable area of memory. Having a memory >> mapped region which is cacheable is very important if the user must >> access the data and make some sort of processing on that. >> My question is: why don't we introduce a function in the DMA-API >> interface for ARM processors which allows to allocate a contiguous and >> cacheable area of memory (> 4MB)? >> This new function can take advantage of the CMA mechanism as >> dma_alloc_coherent() function does, but using different PTE attribute >> for the allocated pages. Basically making a function similar to >> arm_dma_alloc() and set the attributes differently would do the trick: >> >> pgprot_t prot = __pgprot_modify(prot, L_PTE_MT_MASK, >> L_PTE_MT_WRITEALLOC | L_PTE_XN) > > We already have a way to specify whether a device is coherent via the > "dma-coherent" DT property. This allows the correct dma_map_ops to be > set for a device. For cache coherent devices, the > arm_coherent_dma_alloc() and __dma_alloc() should return cacheable > memory. > > However, looking at the code, it seems that __dma_alloc() does not use > the CMA when is_coherent == true, though you would hit a limit on the > number of pages that can be allocated. > > As for mmap'ing to user space, there is arm_dma_mmap(). This one sets > the vm_page_prot to what __get_dma_pgprot() returns which is always > non-cacheable. > > I haven't checked the history cache coherent DMA support on arm but I > think some of the above can be changed. As an example, on arm64 > __dma_alloc() allocates from CMA independent of whether the device is > coherent or not. Also __get_dma_pgprot() returns cacheable attributes > for coherent devices, which in turn allows cacheable user mapping of > such buffers. You don't really need to implement additional functions, > just tweaks to the existing ones. > > Patches welcome ;) > > -- > Catalin Thanks for the answer. I do agree with you on that: I'll take a look at arm64 code and I'll be glad to contribute with patches as soon as possible. Anyway I'd like to focus on a different aspect: I think that this solution can manage cache coherent DMA, so devices which guarantees the coherency using cache snooping mechanism. However how can I manage devices which needs contiguous memory and don't guarantee cache coherency? If the device doesn't implement sg functionality, I can't allocate buffers which is greater than 4MB because I can't use neither dma_alloc_coherent() nor accessing directly to CMA (well, actually I can use dma_alloc_coherent(), but it sounds a little bit confusing). Do you think that dma_alloc_coherent() can be used as well with this type of devices? Do you think that a new dma_alloc_contiguous() function would help in this case? Maybe my interpretation of dma_alloc_coherent() is not correct, and the coherency can be managed using the dma_sync_single_for_* functions and it doesn't require hardware mechanism. Thank you. Cheers ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2015-05-20 21:49 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-05-18 20:56 [RFC] arm: DMA-API contiguous cacheable memory Lorenzo Nava 2015-05-19 16:34 ` Catalin Marinas 2015-05-19 22:05 ` Lorenzo Nava 2015-05-19 22:09 ` Russell King - ARM Linux 2015-05-19 22:14 ` Arnd Bergmann 2015-05-19 22:27 ` Lorenzo Nava 2015-05-19 22:34 ` Russell King - ARM Linux 2015-05-20 12:57 ` Lorenzo Nava 2015-05-20 16:20 ` Russell King - ARM Linux 2015-05-20 21:49 ` Lorenzo Nava 2015-05-19 22:09 ` Lorenzo Nava
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).