All of lore.kernel.org
 help / color / mirror / Atom feed
* Uncached buffers from CMA DMA heap on some Arm devices?
@ 2024-01-24 18:27 Milan Zamazal
  2024-01-25 11:41 ` Lucas Stach
  0 siblings, 1 reply; 11+ messages in thread
From: Milan Zamazal @ 2024-01-24 18:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: iommu, Will Deacon, catalin.marinas, Bryan O'Donoghue,
	Andrey Konovalov, Pavel Machek, Maxime Ripard, Laurent Pinchart,
	kieran.bingham, Hans de Goede

Hello,

in the libcamera project, we experience a major performance problem related to
DMA buffers while working on camera image processing using CPU.  This happens
only with some Arm boards, we have observed it on Debix Model A (NXP i.MX 8M
Plus) and PinePhone.  We use /dev/dma_heap/linux,cma (or reserved) DMA buffer
heap on Arm.

Reading V4L2 camera data from buffers is very slow.  When we memcpy the data
from the buffer to a malloc'ed memory before working with it (reading each byte
multiple times, without any big non-sequential jumps across the data), we get
more than 10 times speed up.  It looks like the input buffer is uncached.

We experience slow down also when writing to output buffers.  It doesn't seem to
matter whether we write to the output byte-by-byte or memcpy larger chunks.

We are having trouble to understand what's the problem with the buffers on some
hardware and what we can realistically do about it.  Could you please help us
clarify this?  Is it possible to force the DMA buffer CMA heap to be cached?
Or is there anything else we can do or try?

Thank you,
Milan


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Uncached buffers from CMA DMA heap on some Arm devices?
  2024-01-24 18:27 Uncached buffers from CMA DMA heap on some Arm devices? Milan Zamazal
@ 2024-01-25 11:41 ` Lucas Stach
  2024-01-26 11:22   ` Milan Zamazal
                     ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Lucas Stach @ 2024-01-25 11:41 UTC (permalink / raw)
  To: Milan Zamazal, Christoph Hellwig
  Cc: iommu, Will Deacon, catalin.marinas, Bryan O'Donoghue,
	Andrey Konovalov, Pavel Machek, Maxime Ripard, Laurent Pinchart,
	kieran.bingham, Hans de Goede

Hi Milan,

Am Mittwoch, dem 24.01.2024 um 19:27 +0100 schrieb Milan Zamazal:
> Hello,
> 
> in the libcamera project, we experience a major performance problem related to
> DMA buffers while working on camera image processing using CPU.  This happens
> only with some Arm boards, we have observed it on Debix Model A (NXP i.MX 8M
> Plus) and PinePhone.  We use /dev/dma_heap/linux,cma (or reserved) DMA buffer
> heap on Arm.
> 
> Reading V4L2 camera data from buffers is very slow.  When we memcpy the data
> from the buffer to a malloc'ed memory before working with it (reading each byte
> multiple times, without any big non-sequential jumps across the data), we get
> more than 10 times speed up.  It looks like the input buffer is uncached.
> 
That's right and a reality you have to deal with on those small ARM
systems. The ARM architecture allows for systems that don't enforce
hardware coherency across the whole SoC and many of the small/cheap SoC
variants make use of this architectural feature.

What this means is that the CPU caches aren't coherent when it comes to
DMA from other masters like the video capture units. There are two ways
to enforce DMA coherency on such systems:
1. map the DMA buffers uncached on the CPU
2. require explicit cache maintenance when touching DMA buffers with
the CPU

Option 1 is what you see is happening in your setup, as it is simple,
straight-forward and doesn't require any synchronization points.

Option 2 could be implemented by allocating cached DMA buffers in the
V4L2 device and then executing the necessary cache synchronization in
qbuf/dqbuf when ownership of the DMA buffer changes between CPU and DMA
master. However this isn't guaranteed to be any faster, as the cache
synchronization itself is a pretty heavy-weight operation when you are
dealing with buffer that are potentially multi-megabytes in size.

> We experience slow down also when writing to output buffers.  It doesn't seem to
> matter whether we write to the output byte-by-byte or memcpy larger chunks.
> 
For DMA coherency it's sufficient to map the DMA buffers as write-
combined, which should at least give you okay-ish write performance,
depending on the specific micro-architecture of your system.

> We are having trouble to understand what's the problem with the buffers on some
> hardware and what we can realistically do about it.  Could you please help us
> clarify this?  Is it possible to force the DMA buffer CMA heap to be cached?
> Or is there anything else we can do or try?

See above. You can work with cached buffers, but that is moving the
cost elsewhere and is not guaranteed to yield better performance. There
is no panacea on systems that don't enforce coherency at the hardware
level.

When working on uncached buffers directly, your best option is to try
to access the buffers in as large chunks as possible, using vector
loads or similar facilities. You certainly don't want to access a
single memory location multiple times. If that is what your algorithm
requires then copying the content into a cached buffer might be your
best option, as it might have similar performance to explicit cache
maintenance on cached DMA buffers and doesn't require another
maintenance operation when transitioning the buffer back to DMA master
ownership.

Regards,
Lucas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Uncached buffers from CMA DMA heap on some Arm devices?
  2024-01-25 11:41 ` Lucas Stach
@ 2024-01-26 11:22   ` Milan Zamazal
  2024-01-26 12:19     ` Maxime Ripard
  2024-01-26 12:17   ` Maxime Ripard
  2024-01-29 10:23   ` Pavel Machek
  2 siblings, 1 reply; 11+ messages in thread
From: Milan Zamazal @ 2024-01-26 11:22 UTC (permalink / raw)
  To: Lucas Stach
  Cc: Christoph Hellwig, iommu, Will Deacon, catalin.marinas,
	Bryan O'Donoghue, Andrey Konovalov, Pavel Machek,
	Maxime Ripard, Laurent Pinchart, kieran.bingham, Hans de Goede

Lucas Stach <dev@lynxeye.de> writes:

> Hi Milan,
>
> Am Mittwoch, dem 24.01.2024 um 19:27 +0100 schrieb Milan Zamazal:
>> Hello,
>> 
>> in the libcamera project, we experience a major performance problem related to
>> DMA buffers while working on camera image processing using CPU.  This happens
>> only with some Arm boards, we have observed it on Debix Model A (NXP i.MX 8M
>> Plus) and PinePhone.  We use /dev/dma_heap/linux,cma (or reserved) DMA buffer
>> heap on Arm.
>> 
>> Reading V4L2 camera data from buffers is very slow.  When we memcpy the data
>> from the buffer to a malloc'ed memory before working with it (reading each byte
>> multiple times, without any big non-sequential jumps across the data), we get
>> more than 10 times speed up.  It looks like the input buffer is uncached.
>> 
> That's right and a reality you have to deal with on those small ARM
> systems. The ARM architecture allows for systems that don't enforce
> hardware coherency across the whole SoC and many of the small/cheap SoC
> variants make use of this architectural feature.

Hi Lucas,

thank you for explanation.  It mostly confirms the limitations we suspected are
unavoidable but it's good in any case to know for sure whether there is any hope
or not. :-)

> What this means is that the CPU caches aren't coherent when it comes to
> DMA from other masters like the video capture units. There are two ways
> to enforce DMA coherency on such systems:
> 1. map the DMA buffers uncached on the CPU
> 2. require explicit cache maintenance when touching DMA buffers with
> the CPU
>
> Option 1 is what you see is happening in your setup, as it is simple,
> straight-forward and doesn't require any synchronization points.
>
> Option 2 could be implemented by allocating cached DMA buffers in the
> V4L2 device and then executing the necessary cache synchronization in
> qbuf/dqbuf when ownership of the DMA buffer changes between CPU and DMA
> master. 

Do I understand it right that "could be implemented" applies to kernel code and
there are currently no facilities there that would allow experimenting with such
an approach from user space?

> However this isn't guaranteed to be any faster, as the cache synchronization
> itself is a pretty heavy-weight operation when you are dealing with buffer
> that are potentially multi-megabytes in size.

Yes, it would be best to measure it if the mechanism was available.

>> We experience slow down also when writing to output buffers.  It doesn't seem to
>> matter whether we write to the output byte-by-byte or memcpy larger chunks.
>> 
> For DMA coherency it's sufficient to map the DMA buffers as write-
> combined, which should at least give you okay-ish write performance,
> depending on the specific micro-architecture of your system.

OK.

>> We are having trouble to understand what's the problem with the buffers on some
>> hardware and what we can realistically do about it.  Could you please help us
>> clarify this?  Is it possible to force the DMA buffer CMA heap to be cached?
>> Or is there anything else we can do or try?
>
> See above. You can work with cached buffers, but that is moving the
> cost elsewhere and is not guaranteed to yield better performance. There
> is no panacea on systems that don't enforce coherency at the hardware
> level.
>
> When working on uncached buffers directly, your best option is to try
> to access the buffers in as large chunks as possible, using vector
> loads or similar facilities. You certainly don't want to access a
> single memory location multiple times. If that is what your algorithm
> requires then copying the content into a cached buffer might be your
> best option, as it might have similar performance to explicit cache
> maintenance on cached DMA buffers and doesn't require another
> maintenance operation when transitioning the buffer back to DMA master
> ownership.

What works best for us is copying + processing camera data approximately
line-by-line, which are chunks large enough to achieve efficient copying while
still being small enough to fit into CPU caches.

Regards,
Milan


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: Uncached buffers from CMA DMA heap on some Arm devices?
  2024-01-25 11:41 ` Lucas Stach
  2024-01-26 11:22   ` Milan Zamazal
@ 2024-01-26 12:17   ` Maxime Ripard
  2024-01-29 12:05     ` Laurent Pinchart
  2024-01-29 10:23   ` Pavel Machek
  2 siblings, 1 reply; 11+ messages in thread
From: Maxime Ripard @ 2024-01-26 12:17 UTC (permalink / raw)
  To: Lucas Stach
  Cc: Milan Zamazal, Christoph Hellwig, iommu, Will Deacon,
	catalin.marinas, Bryan O'Donoghue, Andrey Konovalov,
	Pavel Machek, Laurent Pinchart, kieran.bingham, Hans de Goede

[-- Attachment #1: Type: text/plain, Size: 2224 bytes --]

Hi Lucas,

On Thu, Jan 25, 2024 at 12:41:01PM +0100, Lucas Stach wrote:
> Am Mittwoch, dem 24.01.2024 um 19:27 +0100 schrieb Milan Zamazal:
> > Hello,
> > 
> > in the libcamera project, we experience a major performance problem related to
> > DMA buffers while working on camera image processing using CPU.  This happens
> > only with some Arm boards, we have observed it on Debix Model A (NXP i.MX 8M
> > Plus) and PinePhone.  We use /dev/dma_heap/linux,cma (or reserved) DMA buffer
> > heap on Arm.
> > 
> > Reading V4L2 camera data from buffers is very slow.  When we memcpy the data
> > from the buffer to a malloc'ed memory before working with it (reading each byte
> > multiple times, without any big non-sequential jumps across the data), we get
> > more than 10 times speed up.  It looks like the input buffer is uncached.
> > 
> That's right and a reality you have to deal with on those small ARM
> systems. The ARM architecture allows for systems that don't enforce
> hardware coherency across the whole SoC and many of the small/cheap SoC
> variants make use of this architectural feature.
> 
> What this means is that the CPU caches aren't coherent when it comes to
> DMA from other masters like the video capture units. There are two ways
> to enforce DMA coherency on such systems:
> 1. map the DMA buffers uncached on the CPU
> 2. require explicit cache maintenance when touching DMA buffers with
> the CPU
> 
> Option 1 is what you see is happening in your setup, as it is simple,
> straight-forward and doesn't require any synchronization points.
> 
> Option 2 could be implemented by allocating cached DMA buffers in the
> V4L2 device and then executing the necessary cache synchronization in
> qbuf/dqbuf when ownership of the DMA buffer changes between CPU and DMA
> master. However this isn't guaranteed to be any faster, as the cache
> synchronization itself is a pretty heavy-weight operation when you are
> dealing with buffer that are potentially multi-megabytes in size.

My understanding was that the CMA DMA Heap is already allocating
cacheable buffers, with the expectation that you need to call the
dma-buf cache management ioctl. Is it not?

Maxime

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: Uncached buffers from CMA DMA heap on some Arm devices?
  2024-01-26 11:22   ` Milan Zamazal
@ 2024-01-26 12:19     ` Maxime Ripard
  0 siblings, 0 replies; 11+ messages in thread
From: Maxime Ripard @ 2024-01-26 12:19 UTC (permalink / raw)
  To: Milan Zamazal
  Cc: Lucas Stach, Christoph Hellwig, iommu, Will Deacon,
	catalin.marinas, Bryan O'Donoghue, Andrey Konovalov,
	Pavel Machek, Laurent Pinchart, kieran.bingham, Hans de Goede

[-- Attachment #1: Type: text/plain, Size: 522 bytes --]

On Fri, Jan 26, 2024 at 12:22:30PM +0100, Milan Zamazal wrote:
> > However this isn't guaranteed to be any faster, as the cache synchronization
> > itself is a pretty heavy-weight operation when you are dealing with buffer
> > that are potentially multi-megabytes in size.
> 
> Yes, it would be best to measure it if the mechanism was available.

AFAIK, perf exposes all kinds of metrics related to cache management. It
would be a good idea to measure all our scenario with perf and see what
comes up.

Maxime

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Uncached buffers from CMA DMA heap on some Arm devices?
  2024-01-25 11:41 ` Lucas Stach
  2024-01-26 11:22   ` Milan Zamazal
  2024-01-26 12:17   ` Maxime Ripard
@ 2024-01-29 10:23   ` Pavel Machek
  2024-01-29 10:32     ` Maxime Ripard
  2 siblings, 1 reply; 11+ messages in thread
From: Pavel Machek @ 2024-01-29 10:23 UTC (permalink / raw)
  To: Lucas Stach, kernel list
  Cc: Milan Zamazal, Christoph Hellwig, iommu, Will Deacon,
	catalin.marinas, Bryan O'Donoghue, Andrey Konovalov,
	Maxime Ripard, Laurent Pinchart, kieran.bingham, Hans de Goede

[-- Attachment #1: Type: text/plain, Size: 2093 bytes --]

Hi!

> That's right and a reality you have to deal with on those small ARM
> systems. The ARM architecture allows for systems that don't enforce
> hardware coherency across the whole SoC and many of the small/cheap SoC
> variants make use of this architectural feature.
> 
> What this means is that the CPU caches aren't coherent when it comes to
> DMA from other masters like the video capture units. There are two ways
> to enforce DMA coherency on such systems:
> 1. map the DMA buffers uncached on the CPU
> 2. require explicit cache maintenance when touching DMA buffers with
> the CPU
> 
> Option 1 is what you see is happening in your setup, as it is simple,
> straight-forward and doesn't require any synchronization points.

Yeah, and it also does not work :-).

Userspace gets the buffers, and it is not really equipped to work with
them. For example, on pinephone, memcpy() crashes on uncached
memory. I'm pretty sure user could have some kind of kernel-crashing
fun if he passed the uncached memory to futex or something similar.

> Option 2 could be implemented by allocating cached DMA buffers in the
> V4L2 device and then executing the necessary cache synchronization in
> qbuf/dqbuf when ownership of the DMA buffer changes between CPU and DMA
> master. However this isn't guaranteed to be any faster, as the cache
> synchronization itself is a pretty heavy-weight operation when you are
> dealing with buffer that are potentially multi-megabytes in size.

Yes, cache synchronization can be slow, but IIRC it was on order of
milisecond in the worst case.. and copying megayte images is still
slower than that.

Note that it is faster to do read/write syscalls then deal with
uncached memory. And userspace can't simply flush the caches and remap
memory as cached.

v4l2 moved away from read/write "because it is slow" and switched to
interface that is even slower than that. And libcamera exposes
uncached memory to the user :-(.

Best regards,
								Pavel
-- 
People of Russia, stop Putin before his war on Ukraine escalates.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: Uncached buffers from CMA DMA heap on some Arm devices?
  2024-01-29 10:23   ` Pavel Machek
@ 2024-01-29 10:32     ` Maxime Ripard
  2024-01-29 12:07       ` Laurent Pinchart
  2024-01-29 18:30       ` Pavel Machek
  0 siblings, 2 replies; 11+ messages in thread
From: Maxime Ripard @ 2024-01-29 10:32 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Lucas Stach, kernel list, Milan Zamazal, Christoph Hellwig, iommu,
	Will Deacon, catalin.marinas, Bryan O'Donoghue,
	Andrey Konovalov, Laurent Pinchart, kieran.bingham, Hans de Goede

[-- Attachment #1: Type: text/plain, Size: 2666 bytes --]

On Mon, Jan 29, 2024 at 11:23:16AM +0100, Pavel Machek wrote:
> Hi!
> 
> > That's right and a reality you have to deal with on those small ARM
> > systems. The ARM architecture allows for systems that don't enforce
> > hardware coherency across the whole SoC and many of the small/cheap SoC
> > variants make use of this architectural feature.
> > 
> > What this means is that the CPU caches aren't coherent when it comes to
> > DMA from other masters like the video capture units. There are two ways
> > to enforce DMA coherency on such systems:
> > 1. map the DMA buffers uncached on the CPU
> > 2. require explicit cache maintenance when touching DMA buffers with
> > the CPU
> > 
> > Option 1 is what you see is happening in your setup, as it is simple,
> > straight-forward and doesn't require any synchronization points.
> 
> Yeah, and it also does not work :-).
> 
> Userspace gets the buffers, and it is not really equipped to work with
> them. For example, on pinephone, memcpy() crashes on uncached
> memory. I'm pretty sure user could have some kind of kernel-crashing
> fun if he passed the uncached memory to futex or something similar.

Uncached buffers are ubiquitous on arm/arm64 so there must be something
else going on. And there's nothing to equip for, it's just a memory
array you can access in any way you want (but very slowly).

How does it not work?

> > Option 2 could be implemented by allocating cached DMA buffers in the
> > V4L2 device and then executing the necessary cache synchronization in
> > qbuf/dqbuf when ownership of the DMA buffer changes between CPU and DMA
> > master. However this isn't guaranteed to be any faster, as the cache
> > synchronization itself is a pretty heavy-weight operation when you are
> > dealing with buffer that are potentially multi-megabytes in size.
> 
> Yes, cache synchronization can be slow, but IIRC it was on order of
> milisecond in the worst case.. and copying megayte images is still
> slower than that.
> 
> Note that it is faster to do read/write syscalls then deal with
> uncached memory. And userspace can't simply flush the caches and remap
> memory as cached.

You can't change the memory mapping, but you can flush the caches with
dma-buf. It's even required by the dma-buf documentation.

> v4l2 moved away from read/write "because it is slow" and switched to
> interface that is even slower than that. And libcamera exposes
> uncached memory to the user :-(.

There's also the number of copies to consider. If you were to use
read/write to display a frame on a framebuffer, you would use 4 copies
vs 2 with dma-buf.

Maxime

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: Uncached buffers from CMA DMA heap on some Arm devices?
  2024-01-26 12:17   ` Maxime Ripard
@ 2024-01-29 12:05     ` Laurent Pinchart
  0 siblings, 0 replies; 11+ messages in thread
From: Laurent Pinchart @ 2024-01-29 12:05 UTC (permalink / raw)
  To: Maxime Ripard
  Cc: Lucas Stach, Milan Zamazal, Christoph Hellwig, iommu, Will Deacon,
	catalin.marinas, Bryan O'Donoghue, Andrey Konovalov,
	Pavel Machek, kieran.bingham, Hans de Goede

Hi Maxime,

On Fri, Jan 26, 2024 at 01:17:50PM +0100, Maxime Ripard wrote:
> On Thu, Jan 25, 2024 at 12:41:01PM +0100, Lucas Stach wrote:
> > Am Mittwoch, dem 24.01.2024 um 19:27 +0100 schrieb Milan Zamazal:
> > > Hello,
> > > 
> > > in the libcamera project, we experience a major performance problem related to
> > > DMA buffers while working on camera image processing using CPU.  This happens
> > > only with some Arm boards, we have observed it on Debix Model A (NXP i.MX 8M
> > > Plus) and PinePhone.  We use /dev/dma_heap/linux,cma (or reserved) DMA buffer
> > > heap on Arm.
> > > 
> > > Reading V4L2 camera data from buffers is very slow.  When we memcpy the data
> > > from the buffer to a malloc'ed memory before working with it (reading each byte
> > > multiple times, without any big non-sequential jumps across the data), we get
> > > more than 10 times speed up.  It looks like the input buffer is uncached.
> > > 
> > That's right and a reality you have to deal with on those small ARM
> > systems. The ARM architecture allows for systems that don't enforce
> > hardware coherency across the whole SoC and many of the small/cheap SoC
> > variants make use of this architectural feature.
> > 
> > What this means is that the CPU caches aren't coherent when it comes to
> > DMA from other masters like the video capture units. There are two ways
> > to enforce DMA coherency on such systems:
> > 1. map the DMA buffers uncached on the CPU
> > 2. require explicit cache maintenance when touching DMA buffers with
> > the CPU
> > 
> > Option 1 is what you see is happening in your setup, as it is simple,
> > straight-forward and doesn't require any synchronization points.
> > 
> > Option 2 could be implemented by allocating cached DMA buffers in the
> > V4L2 device and then executing the necessary cache synchronization in
> > qbuf/dqbuf when ownership of the DMA buffer changes between CPU and DMA
> > master. However this isn't guaranteed to be any faster, as the cache
> > synchronization itself is a pretty heavy-weight operation when you are
> > dealing with buffer that are potentially multi-megabytes in size.
> 
> My understanding was that the CMA DMA Heap is already allocating
> cacheable buffers,

I'll be a bit pedantic here. As far as I understand, the CMA heap
doesn't allocate "cacheable" buffers. It allocates pages, and they are
not inherently cached or uncached. Whether a page is mapped to the CPU
as cached or uncached is a decision made at mapping time. Unless I'm
mistaken, the CMA heap maps pages to userspace cached.

> with the expectation that you need to call the dma-buf cache
> management ioctl. Is it not?

Someone has to manage the cache, yes. It can be done explicitly by
userspace through the dmabuf sync ioctl, or implicitly within the
kernel. For instance, when queueing a dmabuf to a V4L2 device that uses
videobuf2-dma-contig, the QBUF ioctl ends up calling
flush_kernel_vmap_range() and dma_sync_sgtable_for_device() (see
vb2_dc_prepare()). videobuf2-vmalloc, on the other hand, has no cache
handling, which is a known issue when sharing buffers with the display.

On a side note, the cache handling in videobuf2-dma-contig.c seems
problematic to me, as vb2 shouldn't assume much about imported dmabufs.
It should instead use the operations exposed by dmabuf to delegate cache
handling to the exporter.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: Uncached buffers from CMA DMA heap on some Arm devices?
  2024-01-29 10:32     ` Maxime Ripard
@ 2024-01-29 12:07       ` Laurent Pinchart
  2024-01-29 13:12         ` Lucas Stach
  2024-01-29 18:30       ` Pavel Machek
  1 sibling, 1 reply; 11+ messages in thread
From: Laurent Pinchart @ 2024-01-29 12:07 UTC (permalink / raw)
  To: Maxime Ripard
  Cc: Pavel Machek, Lucas Stach, kernel list, Milan Zamazal,
	Christoph Hellwig, iommu, Will Deacon, catalin.marinas,
	Bryan O'Donoghue, Andrey Konovalov, kieran.bingham,
	Hans de Goede

On Mon, Jan 29, 2024 at 11:32:08AM +0100, Maxime Ripard wrote:
> On Mon, Jan 29, 2024 at 11:23:16AM +0100, Pavel Machek wrote:
> > Hi!
> > 
> > > That's right and a reality you have to deal with on those small ARM
> > > systems. The ARM architecture allows for systems that don't enforce
> > > hardware coherency across the whole SoC and many of the small/cheap SoC
> > > variants make use of this architectural feature.
> > > 
> > > What this means is that the CPU caches aren't coherent when it comes to
> > > DMA from other masters like the video capture units. There are two ways
> > > to enforce DMA coherency on such systems:
> > > 1. map the DMA buffers uncached on the CPU
> > > 2. require explicit cache maintenance when touching DMA buffers with
> > > the CPU
> > > 
> > > Option 1 is what you see is happening in your setup, as it is simple,
> > > straight-forward and doesn't require any synchronization points.
> > 
> > Yeah, and it also does not work :-).
> > 
> > Userspace gets the buffers, and it is not really equipped to work with
> > them. For example, on pinephone, memcpy() crashes on uncached
> > memory. I'm pretty sure user could have some kind of kernel-crashing
> > fun if he passed the uncached memory to futex or something similar.
> 
> Uncached buffers are ubiquitous on arm/arm64 so there must be something
> else going on. And there's nothing to equip for, it's just a memory
> array you can access in any way you want (but very slowly).
> 
> How does it not work?

I agree, this should just work (albeit possibly slowly). A crash is a
sign something needs to be fixed.

> > > Option 2 could be implemented by allocating cached DMA buffers in the
> > > V4L2 device and then executing the necessary cache synchronization in
> > > qbuf/dqbuf when ownership of the DMA buffer changes between CPU and DMA
> > > master. However this isn't guaranteed to be any faster, as the cache
> > > synchronization itself is a pretty heavy-weight operation when you are
> > > dealing with buffer that are potentially multi-megabytes in size.
> > 
> > Yes, cache synchronization can be slow, but IIRC it was on order of
> > milisecond in the worst case.. and copying megayte images is still
> > slower than that.

Those numbers are platform-specific, you can't assume this to be true
everywhere.

> > Note that it is faster to do read/write syscalls then deal with
> > uncached memory. And userspace can't simply flush the caches and remap
> > memory as cached.
> 
> You can't change the memory mapping, but you can flush the caches with
> dma-buf. It's even required by the dma-buf documentation.
> 
> > v4l2 moved away from read/write "because it is slow" and switched to
> > interface that is even slower than that. And libcamera exposes
> > uncached memory to the user :-(.
> 
> There's also the number of copies to consider. If you were to use
> read/write to display a frame on a framebuffer, you would use 4 copies
> vs 2 with dma-buf.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: Uncached buffers from CMA DMA heap on some Arm devices?
  2024-01-29 12:07       ` Laurent Pinchart
@ 2024-01-29 13:12         ` Lucas Stach
  0 siblings, 0 replies; 11+ messages in thread
From: Lucas Stach @ 2024-01-29 13:12 UTC (permalink / raw)
  To: Laurent Pinchart, Maxime Ripard
  Cc: Pavel Machek, kernel list, Milan Zamazal, Christoph Hellwig,
	iommu, Will Deacon, catalin.marinas, Bryan O'Donoghue,
	Andrey Konovalov, kieran.bingham, Hans de Goede

Am Montag, dem 29.01.2024 um 14:07 +0200 schrieb Laurent Pinchart:
> On Mon, Jan 29, 2024 at 11:32:08AM +0100, Maxime Ripard wrote:
> > On Mon, Jan 29, 2024 at 11:23:16AM +0100, Pavel Machek wrote:
> > > Hi!
> > > 
> > > > That's right and a reality you have to deal with on those small ARM
> > > > systems. The ARM architecture allows for systems that don't enforce
> > > > hardware coherency across the whole SoC and many of the small/cheap SoC
> > > > variants make use of this architectural feature.
> > > > 
> > > > What this means is that the CPU caches aren't coherent when it comes to
> > > > DMA from other masters like the video capture units. There are two ways
> > > > to enforce DMA coherency on such systems:
> > > > 1. map the DMA buffers uncached on the CPU
> > > > 2. require explicit cache maintenance when touching DMA buffers with
> > > > the CPU
> > > > 
> > > > Option 1 is what you see is happening in your setup, as it is simple,
> > > > straight-forward and doesn't require any synchronization points.
> > > 
> > > Yeah, and it also does not work :-).
> > > 
> > > Userspace gets the buffers, and it is not really equipped to work with
> > > them. For example, on pinephone, memcpy() crashes on uncached
> > > memory. I'm pretty sure user could have some kind of kernel-crashing
> > > fun if he passed the uncached memory to futex or something similar.
> > 
> > Uncached buffers are ubiquitous on arm/arm64 so there must be something
> > else going on. And there's nothing to equip for, it's just a memory
> > array you can access in any way you want (but very slowly).
> > 
> > How does it not work?
> 
> I agree, this should just work (albeit possibly slowly). A crash is a
> sign something needs to be fixed.
> 
Optimized memcpy implementations might use unligned access at the edges
of the copy regions, which will in fact not work with uncached memory,
as hardware unaligned access support on ARM(64) requires the bufferable
memory attribute, so you might see aborts in this case.

write-combined mappings are bufferable and thus don't exhibit this
issue.

> > > > Option 2 could be implemented by allocating cached DMA buffers in the
> > > > V4L2 device and then executing the necessary cache synchronization in
> > > > qbuf/dqbuf when ownership of the DMA buffer changes between CPU and DMA
> > > > master. However this isn't guaranteed to be any faster, as the cache
> > > > synchronization itself is a pretty heavy-weight operation when you are
> > > > dealing with buffer that are potentially multi-megabytes in size.
> > > 
> > > Yes, cache synchronization can be slow, but IIRC it was on order of
> > > milisecond in the worst case.. and copying megayte images is still
> > > slower than that.
> 
> Those numbers are platform-specific, you can't assume this to be true
> everywhere.
> 
Last time I looked at this was on a pretty old platform (Cortex-A9).
There the TLB walks caused by the cache maintenance by virtual address
was causing severe slowdowns, to the point where actually copying the
data performs similar to the cache maintenance within noise margins,
with the significant difference that copying actually causes the data
to be cache hot for the following operations.

Regards,
Lucas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: Uncached buffers from CMA DMA heap on some Arm devices?
  2024-01-29 10:32     ` Maxime Ripard
  2024-01-29 12:07       ` Laurent Pinchart
@ 2024-01-29 18:30       ` Pavel Machek
  1 sibling, 0 replies; 11+ messages in thread
From: Pavel Machek @ 2024-01-29 18:30 UTC (permalink / raw)
  To: Maxime Ripard
  Cc: Lucas Stach, kernel list, Milan Zamazal, Christoph Hellwig, iommu,
	Will Deacon, catalin.marinas, Bryan O'Donoghue,
	Andrey Konovalov, Laurent Pinchart, kieran.bingham, Hans de Goede

[-- Attachment #1: Type: text/plain, Size: 985 bytes --]

Hi!

> > Yeah, and it also does not work :-).
> > 
> > Userspace gets the buffers, and it is not really equipped to work with
> > them. For example, on pinephone, memcpy() crashes on uncached
> > memory. I'm pretty sure user could have some kind of kernel-crashing
> > fun if he passed the uncached memory to futex or something similar.
> 
> Uncached buffers are ubiquitous on arm/arm64 so there must be something
> else going on. And there's nothing to equip for, it's just a memory
> array you can access in any way you want (but very slowly).

Not really. Not on anything modern.

ll/sc will not work, for example, than's on ARM.
https://en.wikipedia.org/wiki/Load-link/store-conditional
Transactional memory will not work, that was on x86. Powerpc has
cacheline clearing instruction.

And that's design, I'm pretty sure there are also numerous CPU errata.

Best regards,
								Pavel
-- 
People of Russia, stop Putin before his war on Ukraine escalates.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-01-29 18:30 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-24 18:27 Uncached buffers from CMA DMA heap on some Arm devices? Milan Zamazal
2024-01-25 11:41 ` Lucas Stach
2024-01-26 11:22   ` Milan Zamazal
2024-01-26 12:19     ` Maxime Ripard
2024-01-26 12:17   ` Maxime Ripard
2024-01-29 12:05     ` Laurent Pinchart
2024-01-29 10:23   ` Pavel Machek
2024-01-29 10:32     ` Maxime Ripard
2024-01-29 12:07       ` Laurent Pinchart
2024-01-29 13:12         ` Lucas Stach
2024-01-29 18:30       ` Pavel Machek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.