[RFC] arm: DMA-API contiguous cacheable memory

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [RFC] arm: DMA-API contiguous cacheable memory
@ 2015-05-18 20:56 Lorenzo Nava
  2015-05-19 16:34 ` Catalin Marinas
  0 siblings, 1 reply; 11+ messages in thread
From: Lorenzo Nava @ 2015-05-18 20:56 UTC (permalink / raw)
  To: linux-arm-kernel

Hello,

it's been a while since I've started working with DMA on ARM processor
for a smart camera project. Typically the requirements is to have a
large memory area which can be accessed by both DMA and user. I've
already noticed that many people wonder about which would be the best
way to have data received from DMA mapped in user space and, more
important, mapped in a cacheable area of memory. Having a memory
mapped region which is cacheable is very important if the user must
access the data and make some sort of processing on that.
My question is: why don't we introduce a function in the DMA-API
interface for ARM processors which allows to allocate a contiguous and
cacheable area of memory (> 4MB)?
This new function can take advantage of the CMA mechanism as
dma_alloc_coherent() function does, but using different PTE attribute
for the allocated pages. Basically making a function similar to
arm_dma_alloc() and set the attributes differently would do the trick:

pgprot_t prot = __pgprot_modify(prot, L_PTE_MT_MASK,
        L_PTE_MT_WRITEALLOC | L_PTE_XN)

Of course this is very important for ARM processors as the pages
attributes must be coherent among different addressing of the same
physical memory, so this modification should eventually affect only
contiguous cacheable memory areas.

This will also make an improvement in the V4L2 interface which, for
buffers which is larger then 4MB, is forced to use non-cacheable
memory at the moment (with vb2_dma_contig_memops). The performance are
very poor if users deal with non cacheable memory while performing
image processing.

Any comment will be very appreciated.
Thanks.
Cheers.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC] arm: DMA-API contiguous cacheable memory
  2015-05-18 20:56 [RFC] arm: DMA-API contiguous cacheable memory Lorenzo Nava
@ 2015-05-19 16:34 ` Catalin Marinas
  2015-05-19 22:05   ` Lorenzo Nava
  2015-05-19 22:09   ` Lorenzo Nava
  0 siblings, 2 replies; 11+ messages in thread
From: Catalin Marinas @ 2015-05-19 16:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, May 18, 2015 at 10:56:06PM +0200, Lorenzo Nava wrote:
> it's been a while since I've started working with DMA on ARM processor
> for a smart camera project. Typically the requirements is to have a
> large memory area which can be accessed by both DMA and user. I've
> already noticed that many people wonder about which would be the best
> way to have data received from DMA mapped in user space and, more
> important, mapped in a cacheable area of memory. Having a memory
> mapped region which is cacheable is very important if the user must
> access the data and make some sort of processing on that.
> My question is: why don't we introduce a function in the DMA-API
> interface for ARM processors which allows to allocate a contiguous and
> cacheable area of memory (> 4MB)?
> This new function can take advantage of the CMA mechanism as
> dma_alloc_coherent() function does, but using different PTE attribute
> for the allocated pages. Basically making a function similar to
> arm_dma_alloc() and set the attributes differently would do the trick:
> 
> pgprot_t prot = __pgprot_modify(prot, L_PTE_MT_MASK,
>         L_PTE_MT_WRITEALLOC | L_PTE_XN)

We already have a way to specify whether a device is coherent via the
"dma-coherent" DT property. This allows the correct dma_map_ops to be
set for a device. For cache coherent devices, the
arm_coherent_dma_alloc() and __dma_alloc() should return cacheable
memory.

However, looking at the code, it seems that __dma_alloc() does not use
the CMA when is_coherent == true, though you would hit a limit on the
number of pages that can be allocated.

As for mmap'ing to user space, there is arm_dma_mmap(). This one sets
the vm_page_prot to what __get_dma_pgprot() returns which is always
non-cacheable.

I haven't checked the history cache coherent DMA support on arm but I
think some of the above can be changed. As an example, on arm64
__dma_alloc() allocates from CMA independent of whether the device is
coherent or not. Also __get_dma_pgprot() returns cacheable attributes
for coherent devices, which in turn allows cacheable user mapping of
such buffers. You don't really need to implement additional functions,
just tweaks to the existing ones.

Patches welcome ;)

-- 
Catalin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC] arm: DMA-API contiguous cacheable memory
  2015-05-19 16:34 ` Catalin Marinas
@ 2015-05-19 22:05   ` Lorenzo Nava
  2015-05-19 22:09     ` Russell King - ARM Linux
  2015-05-19 22:14     ` Arnd Bergmann
  2015-05-19 22:09   ` Lorenzo Nava
  1 sibling, 2 replies; 11+ messages in thread
From: Lorenzo Nava @ 2015-05-19 22:05 UTC (permalink / raw)
  To: linux-arm-kernel

Thanks for the answer. I do agree with you on that: I'll take a look
at arm64 code and I'll be glad to contribute with patches as soon as
possible.

Anyway I'd like to focus on a different aspect: I think that this
solution can manage cache coherent DMA, so devices which guarantees
the coherency using cache snooping mechanism. However how can I manage
devices which needs contiguous memory and don't guarantee cache
coherency? If the device doesn't implement sg functionality, I can't
allocate buffers which is greater than 4MB because I can't use neither
dma_alloc_coherent() nor accessing directly to CMA (well, actually I
can use dma_alloc_coherent(), but it sounds a little bit confusing).

Do you think that dma_alloc_coherent() can be used as well with this
type of devices? Do you think that a new dma_alloc_contiguous()
function would help in this case?
Maybe my interpretation of dma_alloc_coherent() is not correct, and
the coherency can be managed using the dma_sync_single_for_* functions
and it doesn't require hardware mechanism.

Thank you.
Cheers


On Tue, May 19, 2015 at 6:34 PM, Catalin Marinas
<catalin.marinas@arm.com> wrote:
> On Mon, May 18, 2015 at 10:56:06PM +0200, Lorenzo Nava wrote:
>> it's been a while since I've started working with DMA on ARM processor
>> for a smart camera project. Typically the requirements is to have a
>> large memory area which can be accessed by both DMA and user. I've
>> already noticed that many people wonder about which would be the best
>> way to have data received from DMA mapped in user space and, more
>> important, mapped in a cacheable area of memory. Having a memory
>> mapped region which is cacheable is very important if the user must
>> access the data and make some sort of processing on that.
>> My question is: why don't we introduce a function in the DMA-API
>> interface for ARM processors which allows to allocate a contiguous and
>> cacheable area of memory (> 4MB)?
>> This new function can take advantage of the CMA mechanism as
>> dma_alloc_coherent() function does, but using different PTE attribute
>> for the allocated pages. Basically making a function similar to
>> arm_dma_alloc() and set the attributes differently would do the trick:
>>
>> pgprot_t prot = __pgprot_modify(prot, L_PTE_MT_MASK,
>>         L_PTE_MT_WRITEALLOC | L_PTE_XN)
>
> We already have a way to specify whether a device is coherent via the
> "dma-coherent" DT property. This allows the correct dma_map_ops to be
> set for a device. For cache coherent devices, the
> arm_coherent_dma_alloc() and __dma_alloc() should return cacheable
> memory.
>
> However, looking at the code, it seems that __dma_alloc() does not use
> the CMA when is_coherent == true, though you would hit a limit on the
> number of pages that can be allocated.
>
> As for mmap'ing to user space, there is arm_dma_mmap(). This one sets
> the vm_page_prot to what __get_dma_pgprot() returns which is always
> non-cacheable.
>
> I haven't checked the history cache coherent DMA support on arm but I
> think some of the above can be changed. As an example, on arm64
> __dma_alloc() allocates from CMA independent of whether the device is
> coherent or not. Also __get_dma_pgprot() returns cacheable attributes
> for coherent devices, which in turn allows cacheable user mapping of
> such buffers. You don't really need to implement additional functions,
> just tweaks to the existing ones.
>
> Patches welcome ;)
>
> --
> Catalin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC] arm: DMA-API contiguous cacheable memory
  2015-05-19 22:05   ` Lorenzo Nava
@ 2015-05-19 22:09     ` Russell King - ARM Linux
  2015-05-19 22:14     ` Arnd Bergmann
  1 sibling, 0 replies; 11+ messages in thread
From: Russell King - ARM Linux @ 2015-05-19 22:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, May 20, 2015 at 12:05:54AM +0200, Lorenzo Nava wrote:
> Maybe my interpretation of dma_alloc_coherent() is not correct, and
> the coherency can be managed using the dma_sync_single_for_* functions
> and it doesn't require hardware mechanism.

dma_sync_single_for_* are only for use with the streaming DMA API,
where you must have already mapped the buffer using one of the
dma_map_* functions.  Anything other than that is abusing the API.

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC] arm: DMA-API contiguous cacheable memory
  2015-05-19 22:05   ` Lorenzo Nava
  2015-05-19 22:09     ` Russell King - ARM Linux
@ 2015-05-19 22:14     ` Arnd Bergmann
  2015-05-19 22:27       ` Lorenzo Nava
  2015-05-19 22:34       ` Russell King - ARM Linux
  1 sibling, 2 replies; 11+ messages in thread
From: Arnd Bergmann @ 2015-05-19 22:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 20 May 2015 00:05:54 Lorenzo Nava wrote:
> 
> On Tue, May 19, 2015 at 6:34 PM, Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > On Mon, May 18, 2015 at 10:56:06PM +0200, Lorenzo Nava wrote:
> >> it's been a while since I've started working with DMA on ARM processor
> >> for a smart camera project. Typically the requirements is to have a
> >> large memory area which can be accessed by both DMA and user. I've
> >> already noticed that many people wonder about which would be the best
> >> way to have data received from DMA mapped in user space and, more
> >> important, mapped in a cacheable area of memory. Having a memory
> >> mapped region which is cacheable is very important if the user must
> >> access the data and make some sort of processing on that.
> >> My question is: why don't we introduce a function in the DMA-API
> >> interface for ARM processors which allows to allocate a contiguous and
> >> cacheable area of memory (> 4MB)?
> >> This new function can take advantage of the CMA mechanism as
> >> dma_alloc_coherent() function does, but using different PTE attribute
> >> for the allocated pages. Basically making a function similar to
> >> arm_dma_alloc() and set the attributes differently would do the trick:
> >>
> >> pgprot_t prot = __pgprot_modify(prot, L_PTE_MT_MASK,
> >>         L_PTE_MT_WRITEALLOC | L_PTE_XN)
> >
> > We already have a way to specify whether a device is coherent via the
> > "dma-coherent" DT property. This allows the correct dma_map_ops to be
> > set for a device. For cache coherent devices, the
> > arm_coherent_dma_alloc() and __dma_alloc() should return cacheable
> > memory.

That is not what Lorenzo was asking about though.

> > However, looking at the code, it seems that __dma_alloc() does not use
> > the CMA when is_coherent == true, though you would hit a limit on the
> > number of pages that can be allocated.
> >
> > As for mmap'ing to user space, there is arm_dma_mmap(). This one sets
> > the vm_page_prot to what __get_dma_pgprot() returns which is always
> > non-cacheable.
> >
> > I haven't checked the history cache coherent DMA support on arm but I
> > think some of the above can be changed. As an example, on arm64
> > __dma_alloc() allocates from CMA independent of whether the device is
> > coherent or not. Also __get_dma_pgprot() returns cacheable attributes
> > for coherent devices, which in turn allows cacheable user mapping of
> > such buffers. You don't really need to implement additional functions,
> > just tweaks to the existing ones.
>
> Thanks for the answer. I do agree with you on that: I'll take a look
> at arm64 code and I'll be glad to contribute with patches as soon as
> possible.
> 
> Anyway I'd like to focus on a different aspect: I think that this
> solution can manage cache coherent DMA, so devices which guarantees
> the coherency using cache snooping mechanism. However how can I manage
> devices which needs contiguous memory and don't guarantee cache
> coherency? If the device doesn't implement sg functionality, I can't
> allocate buffers which is greater than 4MB because I can't use neither
> dma_alloc_coherent() nor accessing directly to CMA (well, actually I
> can use dma_alloc_coherent(), but it sounds a little bit confusing).

So you have a device that is not cache-coherent, and you want to
allocate cacheable memory and manage coherency manually.

This is normally done using alloc_pages() and dma_map_single(),
but as you have realized, that does not use the CMA area.

> Do you think that dma_alloc_coherent() can be used as well with this
> type of devices? Do you think that a new dma_alloc_contiguous()
> function would help in this case?
> Maybe my interpretation of dma_alloc_coherent() is not correct, and
> the coherency can be managed using the dma_sync_single_for_* functions
> and it doesn't require hardware mechanism.

I believe dma_alloc_attrs is the interface you want, with attributes
DMA_ATTR_FORCE_CONTIGUOUS and DMA_ATTR_NON_CONSISTENT. I don't
know if that is already implemented on arm64, but this is something
that can definitely be done.

With that memory, you should be able to use the normal streaming
API (dma_sync_single_for_*). There is an older interface called
dma_alloc_noncoherent(), but that cannot be easily implemented on
ARM.

	Arnd

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC] arm: DMA-API contiguous cacheable memory
  2015-05-19 22:14     ` Arnd Bergmann
@ 2015-05-19 22:27       ` Lorenzo Nava
  2015-05-19 22:34       ` Russell King - ARM Linux
  1 sibling, 0 replies; 11+ messages in thread
From: Lorenzo Nava @ 2015-05-19 22:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, May 20, 2015 at 12:14 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Wednesday 20 May 2015 00:05:54 Lorenzo Nava wrote:
>>
>> On Tue, May 19, 2015 at 6:34 PM, Catalin Marinas
>> <catalin.marinas@arm.com> wrote:
>> > On Mon, May 18, 2015 at 10:56:06PM +0200, Lorenzo Nava wrote:
>> >> it's been a while since I've started working with DMA on ARM processor
>> >> for a smart camera project. Typically the requirements is to have a
>> >> large memory area which can be accessed by both DMA and user. I've
>> >> already noticed that many people wonder about which would be the best
>> >> way to have data received from DMA mapped in user space and, more
>> >> important, mapped in a cacheable area of memory. Having a memory
>> >> mapped region which is cacheable is very important if the user must
>> >> access the data and make some sort of processing on that.
>> >> My question is: why don't we introduce a function in the DMA-API
>> >> interface for ARM processors which allows to allocate a contiguous and
>> >> cacheable area of memory (> 4MB)?
>> >> This new function can take advantage of the CMA mechanism as
>> >> dma_alloc_coherent() function does, but using different PTE attribute
>> >> for the allocated pages. Basically making a function similar to
>> >> arm_dma_alloc() and set the attributes differently would do the trick:
>> >>
>> >> pgprot_t prot = __pgprot_modify(prot, L_PTE_MT_MASK,
>> >>         L_PTE_MT_WRITEALLOC | L_PTE_XN)
>> >
>> > We already have a way to specify whether a device is coherent via the
>> > "dma-coherent" DT property. This allows the correct dma_map_ops to be
>> > set for a device. For cache coherent devices, the
>> > arm_coherent_dma_alloc() and __dma_alloc() should return cacheable
>> > memory.
>
> That is not what Lorenzo was asking about though.
>
>> > However, looking at the code, it seems that __dma_alloc() does not use
>> > the CMA when is_coherent == true, though you would hit a limit on the
>> > number of pages that can be allocated.
>> >
>> > As for mmap'ing to user space, there is arm_dma_mmap(). This one sets
>> > the vm_page_prot to what __get_dma_pgprot() returns which is always
>> > non-cacheable.
>> >
>> > I haven't checked the history cache coherent DMA support on arm but I
>> > think some of the above can be changed. As an example, on arm64
>> > __dma_alloc() allocates from CMA independent of whether the device is
>> > coherent or not. Also __get_dma_pgprot() returns cacheable attributes
>> > for coherent devices, which in turn allows cacheable user mapping of
>> > such buffers. You don't really need to implement additional functions,
>> > just tweaks to the existing ones.
>>
>> Thanks for the answer. I do agree with you on that: I'll take a look
>> at arm64 code and I'll be glad to contribute with patches as soon as
>> possible.
>>
>> Anyway I'd like to focus on a different aspect: I think that this
>> solution can manage cache coherent DMA, so devices which guarantees
>> the coherency using cache snooping mechanism. However how can I manage
>> devices which needs contiguous memory and don't guarantee cache
>> coherency? If the device doesn't implement sg functionality, I can't
>> allocate buffers which is greater than 4MB because I can't use neither
>> dma_alloc_coherent() nor accessing directly to CMA (well, actually I
>> can use dma_alloc_coherent(), but it sounds a little bit confusing).
>
> So you have a device that is not cache-coherent, and you want to
> allocate cacheable memory and manage coherency manually.
>
> This is normally done using alloc_pages() and dma_map_single(),
> but as you have realized, that does not use the CMA area.
>
>> Do you think that dma_alloc_coherent() can be used as well with this
>> type of devices? Do you think that a new dma_alloc_contiguous()
>> function would help in this case?
>> Maybe my interpretation of dma_alloc_coherent() is not correct, and
>> the coherency can be managed using the dma_sync_single_for_* functions
>> and it doesn't require hardware mechanism.
>
> I believe dma_alloc_attrs is the interface you want, with attributes
> DMA_ATTR_FORCE_CONTIGUOUS and DMA_ATTR_NON_CONSISTENT. I don't
> know if that is already implemented on arm64, but this is something
> that can definitely be done.
>
> With that memory, you should be able to use the normal streaming
> API (dma_sync_single_for_*). There is an older interface called
> dma_alloc_noncoherent(), but that cannot be easily implemented on
> ARM.
>
>         Arnd

Yes, this is exactly the point. Currently this function is used only
with dma_alloc_coherent() function (which actually call
dma_alloc_attrs()).
This function, anyway, is not available in the DMA API of linux, but I
think it could be useful to manage some kind of devices (see my
previous mail).

What do you think would be the best way to access dma_alloc_attrs
function from a device driver? Call the function directly?

Thank you.
Lorenzo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC] arm: DMA-API contiguous cacheable memory
  2015-05-19 22:14     ` Arnd Bergmann
  2015-05-19 22:27       ` Lorenzo Nava
@ 2015-05-19 22:34       ` Russell King - ARM Linux
  2015-05-20 12:57         ` Lorenzo Nava
  1 sibling, 1 reply; 11+ messages in thread
From: Russell King - ARM Linux @ 2015-05-19 22:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, May 20, 2015 at 12:14:48AM +0200, Arnd Bergmann wrote:
> With that memory, you should be able to use the normal streaming
> API (dma_sync_single_for_*).

Wrong, as I've pointed out previously.  The only memory which you're
allowed to sync is with memory which has been mapped with a dma_map_*()
function.

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC] arm: DMA-API contiguous cacheable memory
  2015-05-19 22:34       ` Russell King - ARM Linux
@ 2015-05-20 12:57         ` Lorenzo Nava
  2015-05-20 16:20           ` Russell King - ARM Linux
  0 siblings, 1 reply; 11+ messages in thread
From: Lorenzo Nava @ 2015-05-20 12:57 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, May 20, 2015 at 12:34 AM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Wed, May 20, 2015 at 12:14:48AM +0200, Arnd Bergmann wrote:
>> With that memory, you should be able to use the normal streaming
>> API (dma_sync_single_for_*).
>
> Wrong, as I've pointed out previously.  The only memory which you're
> allowed to sync is with memory which has been mapped with a dma_map_*()
> function.
>
> --
> FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
> according to speedtest.net.

Russell,

so probably currently is impossible to allocate a contiguous cachable
DMA memory. You can't use CMA, and the only functions which allow you
to use it are not compatible with sync functions.
Do you think the problem is the CMA design, the DMA API design, or
there is no problem at all and this is not something useful?
Anyway it's not completely clear to me which is the difference between:
  - allocating memory and use sync function on memory mapped with dma_map_*()
  - allocating memory with dma_alloc_*() (with cacheable attributes)
and use the sync functions on it
It looks that the second just make alloc + map in a single step
instead of splitting the operation in two steps.
I'm sure I'm losing something, can you please help me understand that?

Thanks.
Lorenzo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC] arm: DMA-API contiguous cacheable memory
  2015-05-20 12:57         ` Lorenzo Nava
@ 2015-05-20 16:20           ` Russell King - ARM Linux
  2015-05-20 21:49             ` Lorenzo Nava
  0 siblings, 1 reply; 11+ messages in thread
From: Russell King - ARM Linux @ 2015-05-20 16:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, May 20, 2015 at 02:57:36PM +0200, Lorenzo Nava wrote:
> so probably currently is impossible to allocate a contiguous cachable
> DMA memory. You can't use CMA, and the only functions which allow you
> to use it are not compatible with sync functions.
> Do you think the problem is the CMA design, the DMA API design, or
> there is no problem at all and this is not something useful?

Well, the whole issue of DMA from userspace is a fraught topic.  I
consider what we have at the moment as mere luck than anything else -
there are architecture maintainers who'd like to see dma_mmap_* be
deleted from the kernel.

However, I have a problem with what you're trying to do.

You want to allocate a large chunk of memory for DMA.  Large chunks
of memory can _only_ come from CMA - the standard Linux allocators do
_not_ cope well with large allocations.  Even 16K allocations can
become difficult after the system has been running for a while.  So,
CMA is really the only way to go to obtain large chunks of memory.

You want this large chunk of memory to be cacheable.  CMA might be
able to provide that.

You want to DMA to this memory, and then read from it.  The problem
there is, how do you ensure that the data you're reading is the data
that the DMA wrote there.  If you have caching enabled, the caching
model that we _have_ to assume is that the cache is infinite, and that
it speculates aggressively.  This means that we can not guarantee that
any data read through a cacheable mapping will be coherent with the
DMA'd data.

So, we have to flush the cache.  The problem is that with an infinite
cache size model, we have to flush all possible lines associated with
the buffer, because we don't know which might be in the cache and
which are not.

Of course, caches are finite, and we can say that if the size of the
region being flushed is greater than the cache size (or multiple of the
cache size), we _could_ just flush the entire cache instead.  (This can
only work for non-SG stuff, as we don't know before hand how large the
SG is in bytes.)

However, here's the problem.  As I mentioned above, we have dma_mmap_*
stuff, which works for memory allocated by dma_alloc_coherent().  The
only reason mapping that memory into userspace works is because (for
the non-coherent cache case) we map it in such a way that the caches
are disabled, and this works fine.  For the coherent cache case, it
doesn't matter that we map it with the caches enabled.  So both of these
work.

When you have a non-coherent cache _and_ you want the mapping to be
cacheable, you have extra problems to worry about.  You need to know
the type of the CPU cache.  If the CPU cache is physically indexed,
physically tagged, then you can perform cache maintanence on any
mapping of that memory, and you will hit the appropriate cache lines.
For other types of caches, this is not true.  Hence, a userspace
mapping of non-coherent cacheable memory with a cache which makes use
of virtual addresses would need to be flushed at the virtual aliases -
this is precisely why kernel arch maintainers don't like DMA from
userspace.  It's brings with it huge problems.

Thankfully, ARMv7 caches are PIPT - but that doesn't really give us
"permission" to just consider PIPT for this case, especially for
something which is used between arch code and driver code.

What I'm trying to say is that what you're asking for is not a simple
issue - it needs lots of thought and consideration, more than I have
time to spare (or likely have time to spare in the future, _most_ of
my time is wasted trying to deal with the flood of email from these
mailing lists rather than doing any real work - even non-relevant email
has a non-zero time cost as it takes a certain amount of time to decide
whether an email is relevant or not.)

> Anyway it's not completely clear to me which is the difference between:
>   - allocating memory and use sync function on memory mapped with dma_map_*()
>   - allocating memory with dma_alloc_*() (with cacheable attributes)
> and use the sync functions on it

Let me say _for the third time_: dma_sync_*() on memory returned from
dma_alloc_*() is not permitted.  Anyone who tells you different is
just plain wrong, and is telling you to do something which is _not_
supported by the API, and _will_ fail with some implementations
including the ARM implementation if it uses the atomic pool to satisfy
your allocation.

> It looks that the second just make alloc + map in a single step
> instead of splitting the operation in two steps.
> I'm sure I'm losing something, can you please help me understand that?

The problem is that you're hitting two different costs: the cost from
accessing data via an uncacheable mapping, vs the cost of having to do
cache maintanence to ensure that you're reading the up-to-date data.

At the end of the day, there's only one truth here: large DMA buffers
on architectures which are not cache-coherent suck and require a non-zero
cost to ensure that you can read the data written to the buffer by DMA,
or that DMA can see the data you have written to the buffer.

The final thing to mention is that the ARM cache maintanence instructions
are not available in userspace, so you can't have userspace taking care
of flushing the caches where they need to...

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC] arm: DMA-API contiguous cacheable memory
  2015-05-20 16:20           ` Russell King - ARM Linux
@ 2015-05-20 21:49             ` Lorenzo Nava
  0 siblings, 0 replies; 11+ messages in thread
From: Lorenzo Nava @ 2015-05-20 21:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, May 20, 2015 at 6:20 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Wed, May 20, 2015 at 02:57:36PM +0200, Lorenzo Nava wrote:
>> so probably currently is impossible to allocate a contiguous cachable
>> DMA memory. You can't use CMA, and the only functions which allow you
>> to use it are not compatible with sync functions.
>> Do you think the problem is the CMA design, the DMA API design, or
>> there is no problem at all and this is not something useful?
>
> Well, the whole issue of DMA from userspace is a fraught topic.  I
> consider what we have at the moment as mere luck than anything else -
> there are architecture maintainers who'd like to see dma_mmap_* be
> deleted from the kernel.
>
Well sometimes mmap can avoid non-necessary memory copies and boost
the performance. Of course it must be carefully managed to avoid big
problems.

> However, I have a problem with what you're trying to do.
>
> You want to allocate a large chunk of memory for DMA.  Large chunks
> of memory can _only_ come from CMA - the standard Linux allocators do
> _not_ cope well with large allocations.  Even 16K allocations can
> become difficult after the system has been running for a while.  So,
> CMA is really the only way to go to obtain large chunks of memory.
>
> You want this large chunk of memory to be cacheable.  CMA might be
> able to provide that.
>
> You want to DMA to this memory, and then read from it.  The problem
> there is, how do you ensure that the data you're reading is the data
> that the DMA wrote there.  If you have caching enabled, the caching
> model that we _have_ to assume is that the cache is infinite, and that
> it speculates aggressively.  This means that we can not guarantee that
> any data read through a cacheable mapping will be coherent with the
> DMA'd data.
>
> So, we have to flush the cache.  The problem is that with an infinite
> cache size model, we have to flush all possible lines associated with
> the buffer, because we don't know which might be in the cache and
> which are not.
>
> Of course, caches are finite, and we can say that if the size of the
> region being flushed is greater than the cache size (or multiple of the
> cache size), we _could_ just flush the entire cache instead.  (This can
> only work for non-SG stuff, as we don't know before hand how large the
> SG is in bytes.)
>
> However, here's the problem.  As I mentioned above, we have dma_mmap_*
> stuff, which works for memory allocated by dma_alloc_coherent().  The
> only reason mapping that memory into userspace works is because (for
> the non-coherent cache case) we map it in such a way that the caches
> are disabled, and this works fine.  For the coherent cache case, it
> doesn't matter that we map it with the caches enabled.  So both of these
> work.
>
> When you have a non-coherent cache _and_ you want the mapping to be
> cacheable, you have extra problems to worry about.  You need to know
> the type of the CPU cache.  If the CPU cache is physically indexed,
> physically tagged, then you can perform cache maintanence on any
> mapping of that memory, and you will hit the appropriate cache lines.
> For other types of caches, this is not true.  Hence, a userspace
> mapping of non-coherent cacheable memory with a cache which makes use
> of virtual addresses would need to be flushed at the virtual aliases -
> this is precisely why kernel arch maintainers don't like DMA from
> userspace.  It's brings with it huge problems.
>
> Thankfully, ARMv7 caches are PIPT - but that doesn't really give us
> "permission" to just consider PIPT for this case, especially for
> something which is used between arch code and driver code.
>
CPU cache type is an extremely interesting subject which honestly I
didn't consider.

> What I'm trying to say is that what you're asking for is not a simple
> issue - it needs lots of thought and consideration, more than I have
> time to spare (or likely have time to spare in the future, _most_ of
> my time is wasted trying to deal with the flood of email from these
> mailing lists rather than doing any real work - even non-relevant email
> has a non-zero time cost as it takes a certain amount of time to decide
> whether an email is relevant or not.)
>
And let me thank you for this explanation and for sharing your
knowledge that is really helping me.

>> Anyway it's not completely clear to me which is the difference between:
>>   - allocating memory and use sync function on memory mapped with dma_map_*()
>>   - allocating memory with dma_alloc_*() (with cacheable attributes)
>> and use the sync functions on it
>
> Let me say _for the third time_: dma_sync_*() on memory returned from
> dma_alloc_*() is not permitted.  Anyone who tells you different is
> just plain wrong, and is telling you to do something which is _not_
> supported by the API, and _will_ fail with some implementations
> including the ARM implementation if it uses the atomic pool to satisfy
> your allocation.
>
Ok, got it. Sync functions on dma_alloc_*() it's very bad :-)

>> It looks that the second just make alloc + map in a single step
>> instead of splitting the operation in two steps.
>> I'm sure I'm losing something, can you please help me understand that?
>
> The problem is that you're hitting two different costs: the cost from
> accessing data via an uncacheable mapping, vs the cost of having to do
> cache maintanence to ensure that you're reading the up-to-date data.
>
> At the end of the day, there's only one truth here: large DMA buffers
> on architectures which are not cache-coherent suck and require a non-zero
> cost to ensure that you can read the data written to the buffer by DMA,
> or that DMA can see the data you have written to the buffer.
>
> The final thing to mention is that the ARM cache maintanence instructions
> are not available in userspace, so you can't have userspace taking care
> of flushing the caches where they need to...
>
You're right. This is the crucial point: you can't guarantee that
accessed data is correct at any given time unless you know how stuffs
work at kernel level. Basically the only way is to make a sort of
synchronisation between user and kernel to be sure that accessed data
is actually updated.
The solution could be to implement a mechanism that doesn't make data
available to user until cache coherence was not correctly managed. To
be honest V4L implements exactly that mechanism: buffers are queued
and made available with mmap to user once the grab process is
completed, and cache coherence can then be guaranteed.

I'm a little bit disappointed because using CMA with non-coherent
memory is not currently possible, and this is something that could be
useful when developer is able to manage cache coherence (and doesn't
have sg available). I hoped that "bigphysarea" patch will be forever
forget and replaced by CMA, but it doesn't look like it is really
possible.

Thanks.
Lorenzo

> --
> FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
> according to speedtest.net.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC] arm: DMA-API contiguous cacheable memory
  2015-05-19 16:34 ` Catalin Marinas
  2015-05-19 22:05   ` Lorenzo Nava
@ 2015-05-19 22:09   ` Lorenzo Nava
  1 sibling, 0 replies; 11+ messages in thread
From: Lorenzo Nava @ 2015-05-19 22:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, May 19, 2015 at 6:34 PM, Catalin Marinas
<catalin.marinas@arm.com> wrote:
> On Mon, May 18, 2015 at 10:56:06PM +0200, Lorenzo Nava wrote:
>> it's been a while since I've started working with DMA on ARM processor
>> for a smart camera project. Typically the requirements is to have a
>> large memory area which can be accessed by both DMA and user. I've
>> already noticed that many people wonder about which would be the best
>> way to have data received from DMA mapped in user space and, more
>> important, mapped in a cacheable area of memory. Having a memory
>> mapped region which is cacheable is very important if the user must
>> access the data and make some sort of processing on that.
>> My question is: why don't we introduce a function in the DMA-API
>> interface for ARM processors which allows to allocate a contiguous and
>> cacheable area of memory (> 4MB)?
>> This new function can take advantage of the CMA mechanism as
>> dma_alloc_coherent() function does, but using different PTE attribute
>> for the allocated pages. Basically making a function similar to
>> arm_dma_alloc() and set the attributes differently would do the trick:
>>
>> pgprot_t prot = __pgprot_modify(prot, L_PTE_MT_MASK,
>>         L_PTE_MT_WRITEALLOC | L_PTE_XN)
>
> We already have a way to specify whether a device is coherent via the
> "dma-coherent" DT property. This allows the correct dma_map_ops to be
> set for a device. For cache coherent devices, the
> arm_coherent_dma_alloc() and __dma_alloc() should return cacheable
> memory.
>
> However, looking at the code, it seems that __dma_alloc() does not use
> the CMA when is_coherent == true, though you would hit a limit on the
> number of pages that can be allocated.
>
> As for mmap'ing to user space, there is arm_dma_mmap(). This one sets
> the vm_page_prot to what __get_dma_pgprot() returns which is always
> non-cacheable.
>
> I haven't checked the history cache coherent DMA support on arm but I
> think some of the above can be changed. As an example, on arm64
> __dma_alloc() allocates from CMA independent of whether the device is
> coherent or not. Also __get_dma_pgprot() returns cacheable attributes
> for coherent devices, which in turn allows cacheable user mapping of
> such buffers. You don't really need to implement additional functions,
> just tweaks to the existing ones.
>
> Patches welcome ;)
>
> --
> Catalin

Thanks for the answer. I do agree with you on that: I'll take a look
at arm64 code and I'll be glad to contribute with patches as soon as
possible.

Anyway I'd like to focus on a different aspect: I think that this
solution can manage cache coherent DMA, so devices which guarantees
the coherency using cache snooping mechanism. However how can I manage
devices which needs contiguous memory and don't guarantee cache
coherency? If the device doesn't implement sg functionality, I can't
allocate buffers which is greater than 4MB because I can't use neither
dma_alloc_coherent() nor accessing directly to CMA (well, actually I
can use dma_alloc_coherent(), but it sounds a little bit confusing).

Do you think that dma_alloc_coherent() can be used as well with this
type of devices? Do you think that a new dma_alloc_contiguous()
function would help in this case?
Maybe my interpretation of dma_alloc_coherent() is not correct, and
the coherency can be managed using the dma_sync_single_for_* functions
and it doesn't require hardware mechanism.

Thank you.
Cheers

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-05-20 21:49 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-18 20:56 [RFC] arm: DMA-API contiguous cacheable memory Lorenzo Nava
2015-05-19 16:34 ` Catalin Marinas
2015-05-19 22:05   ` Lorenzo Nava
2015-05-19 22:09     ` Russell King - ARM Linux
2015-05-19 22:14     ` Arnd Bergmann
2015-05-19 22:27       ` Lorenzo Nava
2015-05-19 22:34       ` Russell King - ARM Linux
2015-05-20 12:57         ` Lorenzo Nava
2015-05-20 16:20           ` Russell King - ARM Linux
2015-05-20 21:49             ` Lorenzo Nava
2015-05-19 22:09   ` Lorenzo Nava

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).