* Re: Uncached buffers from CMA DMA heap on some Arm devices?
[not found] ` <d2ff8df896d8a167e9abf447ae184ce2f5823852.camel@lynxeye.de>
@ 2024-01-29 10:23 ` Pavel Machek
2024-01-29 10:32 ` Maxime Ripard
0 siblings, 1 reply; 5+ messages in thread
From: Pavel Machek @ 2024-01-29 10:23 UTC (permalink / raw)
To: Lucas Stach, kernel list
Cc: Milan Zamazal, Christoph Hellwig, iommu, Will Deacon,
catalin.marinas, Bryan O'Donoghue, Andrey Konovalov,
Maxime Ripard, Laurent Pinchart, kieran.bingham, Hans de Goede
[-- Attachment #1: Type: text/plain, Size: 2093 bytes --]
Hi!
> That's right and a reality you have to deal with on those small ARM
> systems. The ARM architecture allows for systems that don't enforce
> hardware coherency across the whole SoC and many of the small/cheap SoC
> variants make use of this architectural feature.
>
> What this means is that the CPU caches aren't coherent when it comes to
> DMA from other masters like the video capture units. There are two ways
> to enforce DMA coherency on such systems:
> 1. map the DMA buffers uncached on the CPU
> 2. require explicit cache maintenance when touching DMA buffers with
> the CPU
>
> Option 1 is what you see is happening in your setup, as it is simple,
> straight-forward and doesn't require any synchronization points.
Yeah, and it also does not work :-).
Userspace gets the buffers, and it is not really equipped to work with
them. For example, on pinephone, memcpy() crashes on uncached
memory. I'm pretty sure user could have some kind of kernel-crashing
fun if he passed the uncached memory to futex or something similar.
> Option 2 could be implemented by allocating cached DMA buffers in the
> V4L2 device and then executing the necessary cache synchronization in
> qbuf/dqbuf when ownership of the DMA buffer changes between CPU and DMA
> master. However this isn't guaranteed to be any faster, as the cache
> synchronization itself is a pretty heavy-weight operation when you are
> dealing with buffer that are potentially multi-megabytes in size.
Yes, cache synchronization can be slow, but IIRC it was on order of
milisecond in the worst case.. and copying megayte images is still
slower than that.
Note that it is faster to do read/write syscalls then deal with
uncached memory. And userspace can't simply flush the caches and remap
memory as cached.
v4l2 moved away from read/write "because it is slow" and switched to
interface that is even slower than that. And libcamera exposes
uncached memory to the user :-(.
Best regards,
Pavel
--
People of Russia, stop Putin before his war on Ukraine escalates.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Re: Uncached buffers from CMA DMA heap on some Arm devices?
2024-01-29 10:23 ` Uncached buffers from CMA DMA heap on some Arm devices? Pavel Machek
@ 2024-01-29 10:32 ` Maxime Ripard
2024-01-29 12:07 ` Laurent Pinchart
2024-01-29 18:30 ` Pavel Machek
0 siblings, 2 replies; 5+ messages in thread
From: Maxime Ripard @ 2024-01-29 10:32 UTC (permalink / raw)
To: Pavel Machek
Cc: Lucas Stach, kernel list, Milan Zamazal, Christoph Hellwig, iommu,
Will Deacon, catalin.marinas, Bryan O'Donoghue,
Andrey Konovalov, Laurent Pinchart, kieran.bingham, Hans de Goede
[-- Attachment #1: Type: text/plain, Size: 2666 bytes --]
On Mon, Jan 29, 2024 at 11:23:16AM +0100, Pavel Machek wrote:
> Hi!
>
> > That's right and a reality you have to deal with on those small ARM
> > systems. The ARM architecture allows for systems that don't enforce
> > hardware coherency across the whole SoC and many of the small/cheap SoC
> > variants make use of this architectural feature.
> >
> > What this means is that the CPU caches aren't coherent when it comes to
> > DMA from other masters like the video capture units. There are two ways
> > to enforce DMA coherency on such systems:
> > 1. map the DMA buffers uncached on the CPU
> > 2. require explicit cache maintenance when touching DMA buffers with
> > the CPU
> >
> > Option 1 is what you see is happening in your setup, as it is simple,
> > straight-forward and doesn't require any synchronization points.
>
> Yeah, and it also does not work :-).
>
> Userspace gets the buffers, and it is not really equipped to work with
> them. For example, on pinephone, memcpy() crashes on uncached
> memory. I'm pretty sure user could have some kind of kernel-crashing
> fun if he passed the uncached memory to futex or something similar.
Uncached buffers are ubiquitous on arm/arm64 so there must be something
else going on. And there's nothing to equip for, it's just a memory
array you can access in any way you want (but very slowly).
How does it not work?
> > Option 2 could be implemented by allocating cached DMA buffers in the
> > V4L2 device and then executing the necessary cache synchronization in
> > qbuf/dqbuf when ownership of the DMA buffer changes between CPU and DMA
> > master. However this isn't guaranteed to be any faster, as the cache
> > synchronization itself is a pretty heavy-weight operation when you are
> > dealing with buffer that are potentially multi-megabytes in size.
>
> Yes, cache synchronization can be slow, but IIRC it was on order of
> milisecond in the worst case.. and copying megayte images is still
> slower than that.
>
> Note that it is faster to do read/write syscalls then deal with
> uncached memory. And userspace can't simply flush the caches and remap
> memory as cached.
You can't change the memory mapping, but you can flush the caches with
dma-buf. It's even required by the dma-buf documentation.
> v4l2 moved away from read/write "because it is slow" and switched to
> interface that is even slower than that. And libcamera exposes
> uncached memory to the user :-(.
There's also the number of copies to consider. If you were to use
read/write to display a frame on a framebuffer, you would use 4 copies
vs 2 with dma-buf.
Maxime
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Re: Uncached buffers from CMA DMA heap on some Arm devices?
2024-01-29 10:32 ` Maxime Ripard
@ 2024-01-29 12:07 ` Laurent Pinchart
2024-01-29 13:12 ` Lucas Stach
2024-01-29 18:30 ` Pavel Machek
1 sibling, 1 reply; 5+ messages in thread
From: Laurent Pinchart @ 2024-01-29 12:07 UTC (permalink / raw)
To: Maxime Ripard
Cc: Pavel Machek, Lucas Stach, kernel list, Milan Zamazal,
Christoph Hellwig, iommu, Will Deacon, catalin.marinas,
Bryan O'Donoghue, Andrey Konovalov, kieran.bingham,
Hans de Goede
On Mon, Jan 29, 2024 at 11:32:08AM +0100, Maxime Ripard wrote:
> On Mon, Jan 29, 2024 at 11:23:16AM +0100, Pavel Machek wrote:
> > Hi!
> >
> > > That's right and a reality you have to deal with on those small ARM
> > > systems. The ARM architecture allows for systems that don't enforce
> > > hardware coherency across the whole SoC and many of the small/cheap SoC
> > > variants make use of this architectural feature.
> > >
> > > What this means is that the CPU caches aren't coherent when it comes to
> > > DMA from other masters like the video capture units. There are two ways
> > > to enforce DMA coherency on such systems:
> > > 1. map the DMA buffers uncached on the CPU
> > > 2. require explicit cache maintenance when touching DMA buffers with
> > > the CPU
> > >
> > > Option 1 is what you see is happening in your setup, as it is simple,
> > > straight-forward and doesn't require any synchronization points.
> >
> > Yeah, and it also does not work :-).
> >
> > Userspace gets the buffers, and it is not really equipped to work with
> > them. For example, on pinephone, memcpy() crashes on uncached
> > memory. I'm pretty sure user could have some kind of kernel-crashing
> > fun if he passed the uncached memory to futex or something similar.
>
> Uncached buffers are ubiquitous on arm/arm64 so there must be something
> else going on. And there's nothing to equip for, it's just a memory
> array you can access in any way you want (but very slowly).
>
> How does it not work?
I agree, this should just work (albeit possibly slowly). A crash is a
sign something needs to be fixed.
> > > Option 2 could be implemented by allocating cached DMA buffers in the
> > > V4L2 device and then executing the necessary cache synchronization in
> > > qbuf/dqbuf when ownership of the DMA buffer changes between CPU and DMA
> > > master. However this isn't guaranteed to be any faster, as the cache
> > > synchronization itself is a pretty heavy-weight operation when you are
> > > dealing with buffer that are potentially multi-megabytes in size.
> >
> > Yes, cache synchronization can be slow, but IIRC it was on order of
> > milisecond in the worst case.. and copying megayte images is still
> > slower than that.
Those numbers are platform-specific, you can't assume this to be true
everywhere.
> > Note that it is faster to do read/write syscalls then deal with
> > uncached memory. And userspace can't simply flush the caches and remap
> > memory as cached.
>
> You can't change the memory mapping, but you can flush the caches with
> dma-buf. It's even required by the dma-buf documentation.
>
> > v4l2 moved away from read/write "because it is slow" and switched to
> > interface that is even slower than that. And libcamera exposes
> > uncached memory to the user :-(.
>
> There's also the number of copies to consider. If you were to use
> read/write to display a frame on a framebuffer, you would use 4 copies
> vs 2 with dma-buf.
--
Regards,
Laurent Pinchart
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Re: Uncached buffers from CMA DMA heap on some Arm devices?
2024-01-29 12:07 ` Laurent Pinchart
@ 2024-01-29 13:12 ` Lucas Stach
0 siblings, 0 replies; 5+ messages in thread
From: Lucas Stach @ 2024-01-29 13:12 UTC (permalink / raw)
To: Laurent Pinchart, Maxime Ripard
Cc: Pavel Machek, kernel list, Milan Zamazal, Christoph Hellwig,
iommu, Will Deacon, catalin.marinas, Bryan O'Donoghue,
Andrey Konovalov, kieran.bingham, Hans de Goede
Am Montag, dem 29.01.2024 um 14:07 +0200 schrieb Laurent Pinchart:
> On Mon, Jan 29, 2024 at 11:32:08AM +0100, Maxime Ripard wrote:
> > On Mon, Jan 29, 2024 at 11:23:16AM +0100, Pavel Machek wrote:
> > > Hi!
> > >
> > > > That's right and a reality you have to deal with on those small ARM
> > > > systems. The ARM architecture allows for systems that don't enforce
> > > > hardware coherency across the whole SoC and many of the small/cheap SoC
> > > > variants make use of this architectural feature.
> > > >
> > > > What this means is that the CPU caches aren't coherent when it comes to
> > > > DMA from other masters like the video capture units. There are two ways
> > > > to enforce DMA coherency on such systems:
> > > > 1. map the DMA buffers uncached on the CPU
> > > > 2. require explicit cache maintenance when touching DMA buffers with
> > > > the CPU
> > > >
> > > > Option 1 is what you see is happening in your setup, as it is simple,
> > > > straight-forward and doesn't require any synchronization points.
> > >
> > > Yeah, and it also does not work :-).
> > >
> > > Userspace gets the buffers, and it is not really equipped to work with
> > > them. For example, on pinephone, memcpy() crashes on uncached
> > > memory. I'm pretty sure user could have some kind of kernel-crashing
> > > fun if he passed the uncached memory to futex or something similar.
> >
> > Uncached buffers are ubiquitous on arm/arm64 so there must be something
> > else going on. And there's nothing to equip for, it's just a memory
> > array you can access in any way you want (but very slowly).
> >
> > How does it not work?
>
> I agree, this should just work (albeit possibly slowly). A crash is a
> sign something needs to be fixed.
>
Optimized memcpy implementations might use unligned access at the edges
of the copy regions, which will in fact not work with uncached memory,
as hardware unaligned access support on ARM(64) requires the bufferable
memory attribute, so you might see aborts in this case.
write-combined mappings are bufferable and thus don't exhibit this
issue.
> > > > Option 2 could be implemented by allocating cached DMA buffers in the
> > > > V4L2 device and then executing the necessary cache synchronization in
> > > > qbuf/dqbuf when ownership of the DMA buffer changes between CPU and DMA
> > > > master. However this isn't guaranteed to be any faster, as the cache
> > > > synchronization itself is a pretty heavy-weight operation when you are
> > > > dealing with buffer that are potentially multi-megabytes in size.
> > >
> > > Yes, cache synchronization can be slow, but IIRC it was on order of
> > > milisecond in the worst case.. and copying megayte images is still
> > > slower than that.
>
> Those numbers are platform-specific, you can't assume this to be true
> everywhere.
>
Last time I looked at this was on a pretty old platform (Cortex-A9).
There the TLB walks caused by the cache maintenance by virtual address
was causing severe slowdowns, to the point where actually copying the
data performs similar to the cache maintenance within noise margins,
with the significant difference that copying actually causes the data
to be cache hot for the following operations.
Regards,
Lucas
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Re: Uncached buffers from CMA DMA heap on some Arm devices?
2024-01-29 10:32 ` Maxime Ripard
2024-01-29 12:07 ` Laurent Pinchart
@ 2024-01-29 18:30 ` Pavel Machek
1 sibling, 0 replies; 5+ messages in thread
From: Pavel Machek @ 2024-01-29 18:30 UTC (permalink / raw)
To: Maxime Ripard
Cc: Lucas Stach, kernel list, Milan Zamazal, Christoph Hellwig, iommu,
Will Deacon, catalin.marinas, Bryan O'Donoghue,
Andrey Konovalov, Laurent Pinchart, kieran.bingham, Hans de Goede
[-- Attachment #1: Type: text/plain, Size: 985 bytes --]
Hi!
> > Yeah, and it also does not work :-).
> >
> > Userspace gets the buffers, and it is not really equipped to work with
> > them. For example, on pinephone, memcpy() crashes on uncached
> > memory. I'm pretty sure user could have some kind of kernel-crashing
> > fun if he passed the uncached memory to futex or something similar.
>
> Uncached buffers are ubiquitous on arm/arm64 so there must be something
> else going on. And there's nothing to equip for, it's just a memory
> array you can access in any way you want (but very slowly).
Not really. Not on anything modern.
ll/sc will not work, for example, than's on ARM.
https://en.wikipedia.org/wiki/Load-link/store-conditional
Transactional memory will not work, that was on x86. Powerpc has
cacheline clearing instruction.
And that's design, I'm pretty sure there are also numerous CPU errata.
Best regards,
Pavel
--
People of Russia, stop Putin before his war on Ukraine escalates.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-01-29 18:30 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <87bk9ahex7.fsf@redhat.com>
[not found] ` <d2ff8df896d8a167e9abf447ae184ce2f5823852.camel@lynxeye.de>
2024-01-29 10:23 ` Uncached buffers from CMA DMA heap on some Arm devices? Pavel Machek
2024-01-29 10:32 ` Maxime Ripard
2024-01-29 12:07 ` Laurent Pinchart
2024-01-29 13:12 ` Lucas Stach
2024-01-29 18:30 ` Pavel Machek
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox