* DMA-BUFs always uncached on arm64, causing poor camera performance on Librem 5
@ 2025-07-10 8:24 Pavel Machek
2025-07-10 8:42 ` Lucas Stach
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Pavel Machek @ 2025-07-10 8:24 UTC (permalink / raw)
To: kraxel, vivek.kasireddy, dri-devel, sumit.semwal,
benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, linux-media,
linaro-mm-sig, kernel list, laurent.pinchart, l.stach,
linux+etnaviv, christian.gmeiner, etnaviv, phone-devel
[-- Attachment #1: Type: text/plain, Size: 1315 bytes --]
Hi!
It seems that DMA-BUFs are always uncached on arm64... which is a
problem.
I'm trying to get useful camera support on Librem 5, and that includes
recording vidos (and taking photos).
memcpy() from normal memory is about 2msec/1MB. Unfortunately, for
DMA-BUFs it is 20msec/1MB, and that basically means I can't easily do
760p video recording. Plus, copying full-resolution photo buffer takes
more than 200msec!
There's possibility to do some processing on GPU, and its implemented here:
https://gitlab.com/tui/tui/-/tree/master/icam?ref_type=heads
but that hits the same problem in the end -- data is in DMA-BUF,
uncached, and takes way too long to copy out.
And that's ... wrong. DMA ended seconds ago, complete cache flush
would be way cheaper than copying single frame out, and I still have
to deal with uncached frames.
So I have two questions:
1) Is my analysis correct that, no matter how I get frame from v4l and
process it on GPU, I'll have to copy it from uncached memory in the
end?
2) Does anyone have patches / ideas / roadmap how to solve that? It
makes GPU unusable for computing, and camera basically unusable for
video.
Best regards,
Pavel
--
I don't work for Nazis and criminals, and neither should you.
Boycott Putin, Trump, and Musk!
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: DMA-BUFs always uncached on arm64, causing poor camera performance on Librem 5
2025-07-10 8:24 DMA-BUFs always uncached on arm64, causing poor camera performance on Librem 5 Pavel Machek
@ 2025-07-10 8:42 ` Lucas Stach
2025-07-10 8:49 ` Pavel Machek
2025-07-10 16:01 ` Nicolas Dufresne
2025-07-13 19:54 ` Mikhail Rudenko
2 siblings, 1 reply; 6+ messages in thread
From: Lucas Stach @ 2025-07-10 8:42 UTC (permalink / raw)
To: Pavel Machek, kraxel, vivek.kasireddy, dri-devel, sumit.semwal,
benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, linux-media,
linaro-mm-sig, kernel list, laurent.pinchart, linux+etnaviv,
christian.gmeiner, etnaviv, phone-devel
Hi Pavel,
Am Donnerstag, dem 10.07.2025 um 10:24 +0200 schrieb Pavel Machek:
> Hi!
>
> It seems that DMA-BUFs are always uncached on arm64... which is a
> problem.
>
> I'm trying to get useful camera support on Librem 5, and that includes
> recording vidos (and taking photos).
>
> memcpy() from normal memory is about 2msec/1MB. Unfortunately, for
> DMA-BUFs it is 20msec/1MB, and that basically means I can't easily do
> 760p video recording. Plus, copying full-resolution photo buffer takes
> more than 200msec!
>
> There's possibility to do some processing on GPU, and its implemented here:
>
> https://gitlab.com/tui/tui/-/tree/master/icam?ref_type=heads
>
> but that hits the same problem in the end -- data is in DMA-BUF,
> uncached, and takes way too long to copy out.
>
> And that's ... wrong. DMA ended seconds ago, complete cache flush
> would be way cheaper than copying single frame out, and I still have
> to deal with uncached frames.
>
> So I have two questions:
>
> 1) Is my analysis correct that, no matter how I get frame from v4l and
> process it on GPU, I'll have to copy it from uncached memory in the
> end?
If you need to touch the buffers using the CPU then you are either
stuck with uncached memory or you need to implement bracketed access to
do the necessary cache maintenance. Be aware that completely flushing
the cache is not really an option, as that would impact other
workloads, so you have to flush the cache by walking the virtual
address space of the buffer, which may take a significant amount of CPU
time.
However, if you are only going to use the buffer with the GPU I see no
reason to touch it from the CPU side. Why would you even need to copy
the content? After all dma-bufs are meant to enable zero-copy between
DMA capable accelerators. You can simply import the V4L2 buffer into a
GL texture using EGL_EXT_image_dma_buf_import. Using this path you
don't need to bother with the cache at all, as the GPU will directly
read the video buffers from RAM.
Regards,
Lucas
>
> 2) Does anyone have patches / ideas / roadmap how to solve that? It
> makes GPU unusable for computing, and camera basically unusable for
> video.
>
> Best regards,
> Pavel
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: DMA-BUFs always uncached on arm64, causing poor camera performance on Librem 5
2025-07-10 8:42 ` Lucas Stach
@ 2025-07-10 8:49 ` Pavel Machek
2025-07-10 21:52 ` Laurent Pinchart
0 siblings, 1 reply; 6+ messages in thread
From: Pavel Machek @ 2025-07-10 8:49 UTC (permalink / raw)
To: Lucas Stach
Cc: kraxel, vivek.kasireddy, dri-devel, sumit.semwal,
benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, linux-media,
linaro-mm-sig, kernel list, laurent.pinchart, linux+etnaviv,
christian.gmeiner, etnaviv, phone-devel
[-- Attachment #1: Type: text/plain, Size: 2539 bytes --]
Hi!
> > memcpy() from normal memory is about 2msec/1MB. Unfortunately, for
> > DMA-BUFs it is 20msec/1MB, and that basically means I can't easily do
> > 760p video recording. Plus, copying full-resolution photo buffer takes
> > more than 200msec!
> >
> > There's possibility to do some processing on GPU, and its implemented here:
> >
> > https://gitlab.com/tui/tui/-/tree/master/icam?ref_type=heads
> >
> > but that hits the same problem in the end -- data is in DMA-BUF,
> > uncached, and takes way too long to copy out.
> >
> > And that's ... wrong. DMA ended seconds ago, complete cache flush
> > would be way cheaper than copying single frame out, and I still have
> > to deal with uncached frames.
> >
> > So I have two questions:
> >
> > 1) Is my analysis correct that, no matter how I get frame from v4l and
> > process it on GPU, I'll have to copy it from uncached memory in the
> > end?
>
> If you need to touch the buffers using the CPU then you are either
> stuck with uncached memory or you need to implement bracketed access to
> do the necessary cache maintenance. Be aware that completely flushing
> the cache is not really an option, as that would impact other
> workloads, so you have to flush the cache by walking the virtual
> address space of the buffer, which may take a significant amount of CPU
> time.
What kind of "significant amount of CPU time" are we talking here?
Millisecond?
Bracketed access is fine with me.
Flushing a cache should be an option. I'm root, there's no other
significant workload, and copying out the buffer takes 200msec+. There
are lot of cache flushes that can be done in quarter a second!
> However, if you are only going to use the buffer with the GPU I see no
> reason to touch it from the CPU side. Why would you even need to copy
> the content? After all dma-bufs are meant to enable zero-copy between
> DMA capable accelerators. You can simply import the V4L2 buffer into a
> GL texture using EGL_EXT_image_dma_buf_import. Using this path you
> don't need to bother with the cache at all, as the GPU will directly
> read the video buffers from RAM.
Yes, so GPU will read video buffer from RAM, then debayer it, and then
what? Then I need to store a data into raw file, or use CPU to turn it
into JPEG file, or maybe run video encoder on it. That are all tasks
that are done on CPU...
Best regards,
Pavel
--
I don't work for Nazis and criminals, and neither should you.
Boycott Putin, Trump, and Musk!
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: DMA-BUFs always uncached on arm64, causing poor camera performance on Librem 5
2025-07-10 8:24 DMA-BUFs always uncached on arm64, causing poor camera performance on Librem 5 Pavel Machek
2025-07-10 8:42 ` Lucas Stach
@ 2025-07-10 16:01 ` Nicolas Dufresne
2025-07-13 19:54 ` Mikhail Rudenko
2 siblings, 0 replies; 6+ messages in thread
From: Nicolas Dufresne @ 2025-07-10 16:01 UTC (permalink / raw)
To: Pavel Machek, kraxel, vivek.kasireddy, dri-devel, sumit.semwal,
benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, linux-media,
linaro-mm-sig, kernel list, laurent.pinchart, l.stach,
linux+etnaviv, christian.gmeiner, etnaviv, phone-devel
[-- Attachment #1: Type: text/plain, Size: 2537 bytes --]
Hi Pavel,
Le jeudi 10 juillet 2025 à 10:24 +0200, Pavel Machek a écrit :
> Hi!
>
> It seems that DMA-BUFs are always uncached on arm64... which is a
> problem.
>
> I'm trying to get useful camera support on Librem 5, and that includes
> recording vidos (and taking photos).
>
> memcpy() from normal memory is about 2msec/1MB. Unfortunately, for
> DMA-BUFs it is 20msec/1MB, and that basically means I can't easily do
> 760p video recording. Plus, copying full-resolution photo buffer takes
> more than 200msec!
>
> There's possibility to do some processing on GPU, and its implemented here:
>
> https://gitlab.com/tui/tui/-/tree/master/icam?ref_type=heads
>
> but that hits the same problem in the end -- data is in DMA-BUF,
> uncached, and takes way too long to copy out.
>
> And that's ... wrong. DMA ended seconds ago, complete cache flush
> would be way cheaper than copying single frame out, and I still have
> to deal with uncached frames.
>
> So I have two questions:
>
> 1) Is my analysis correct that, no matter how I get frame from v4l and
> process it on GPU, I'll have to copy it from uncached memory in the
> end?
>
> 2) Does anyone have patches / ideas / roadmap how to solve that? It
> makes GPU unusable for computing, and camera basically unusable for
> video.
If CPU access is strictly required for your use case, the way forward is to
implement V4L2_BUF_CAP_SUPPORTS_MMAP_CACHE_HINT in the capture driver. Very
little drivers enable that.
Once your driver have that capability, you will be able to set
V4L2_MEMORY_FLAG_NON_COHERENT while doing REQBUFS or CREATE_BUFS ioctl. That
gives you allocation with CPU cache working, but you'll get the invalidation (or
flush) overhead by default. When capture data have not been read by CPU, you can
always queue it back with the V4L2_BUF_FLAG_NO_CACHE_INVALIDATE. But for your
use case, it seems that you want the invalidation to take place, otherwise your
software will endup reading old cache data instead of the next frame data.
Please note that the integration in the DMABuf SYNC ioctl was missing for a
while, so make sure you have recent enough kernel or get ready for backports.
The feature itself was commonly used with CPU only access, notably on ChromeOS
using libyuv. No DMABuf was involved initially.
regards,
Nicolas
[0] https://www.kernel.org/doc/html/latest/userspace-api/media/v4l/vidioc-reqbufs.html#v4l2-buf-cap-supports-mmap-cache-hints
>
> Best regards,
> Pavel
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: DMA-BUFs always uncached on arm64, causing poor camera performance on Librem 5
2025-07-10 8:49 ` Pavel Machek
@ 2025-07-10 21:52 ` Laurent Pinchart
0 siblings, 0 replies; 6+ messages in thread
From: Laurent Pinchart @ 2025-07-10 21:52 UTC (permalink / raw)
To: Pavel Machek
Cc: Lucas Stach, kraxel, vivek.kasireddy, dri-devel, sumit.semwal,
benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, linux-media,
linaro-mm-sig, kernel list, linux+etnaviv, christian.gmeiner,
etnaviv, phone-devel
On Thu, Jul 10, 2025 at 10:49:19AM +0200, Pavel Machek wrote:
> Hi!
>
> > > memcpy() from normal memory is about 2msec/1MB. Unfortunately, for
> > > DMA-BUFs it is 20msec/1MB, and that basically means I can't easily do
> > > 760p video recording. Plus, copying full-resolution photo buffer takes
> > > more than 200msec!
> > >
> > > There's possibility to do some processing on GPU, and its implemented here:
> > >
> > > https://gitlab.com/tui/tui/-/tree/master/icam?ref_type=heads
> > >
> > > but that hits the same problem in the end -- data is in DMA-BUF,
> > > uncached, and takes way too long to copy out.
> > >
> > > And that's ... wrong. DMA ended seconds ago, complete cache flush
> > > would be way cheaper than copying single frame out, and I still have
> > > to deal with uncached frames.
> > >
> > > So I have two questions:
> > >
> > > 1) Is my analysis correct that, no matter how I get frame from v4l and
> > > process it on GPU, I'll have to copy it from uncached memory in the
> > > end?
> >
> > If you need to touch the buffers using the CPU then you are either
> > stuck with uncached memory or you need to implement bracketed access to
> > do the necessary cache maintenance. Be aware that completely flushing
> > the cache is not really an option, as that would impact other
> > workloads, so you have to flush the cache by walking the virtual
> > address space of the buffer, which may take a significant amount of CPU
> > time.
>
> What kind of "significant amount of CPU time" are we talking here?
> Millisecond?
It really depends on the platform, the type of cache, and the size of
the buffer. I remember that back in the N900 days a selective cash clean
of a large buffer for full resolution images took several dozens of
milliseconds, possibly close to 100ms. We had to clean the whole D-cache
to make it fast enough, but you can't always do that as Lucas mentioned.
> Bracketed access is fine with me.
>
> Flushing a cache should be an option. I'm root, there's no other
> significant workload, and copying out the buffer takes 200msec+. There
> are lot of cache flushes that can be done in quarter a second!
>
> > However, if you are only going to use the buffer with the GPU I see no
> > reason to touch it from the CPU side. Why would you even need to copy
> > the content? After all dma-bufs are meant to enable zero-copy between
> > DMA capable accelerators. You can simply import the V4L2 buffer into a
> > GL texture using EGL_EXT_image_dma_buf_import. Using this path you
> > don't need to bother with the cache at all, as the GPU will directly
> > read the video buffers from RAM.
>
> Yes, so GPU will read video buffer from RAM, then debayer it, and then
> what? Then I need to store a data into raw file, or use CPU to turn it
> into JPEG file, or maybe run video encoder on it. That are all tasks
> that are done on CPU...
--
Regards,
Laurent Pinchart
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: DMA-BUFs always uncached on arm64, causing poor camera performance on Librem 5
2025-07-10 8:24 DMA-BUFs always uncached on arm64, causing poor camera performance on Librem 5 Pavel Machek
2025-07-10 8:42 ` Lucas Stach
2025-07-10 16:01 ` Nicolas Dufresne
@ 2025-07-13 19:54 ` Mikhail Rudenko
2 siblings, 0 replies; 6+ messages in thread
From: Mikhail Rudenko @ 2025-07-13 19:54 UTC (permalink / raw)
To: Pavel Machek
Cc: kraxel, vivek.kasireddy, dri-devel, sumit.semwal,
benjamin.gaignard, Brian.Starkey, jstultz, tjmercier, linux-media,
linaro-mm-sig, kernel list, laurent.pinchart, l.stach,
linux+etnaviv, christian.gmeiner, etnaviv, phone-devel
Hi, Pavel,
On 2025-07-10 at 10:24 +02, Pavel Machek <pavel@ucw.cz> wrote:
> [[PGP Signed Part:Undecided]]
> Hi!
>
> It seems that DMA-BUFs are always uncached on arm64... which is a
> problem.
>
> I'm trying to get useful camera support on Librem 5, and that includes
> recording vidos (and taking photos).
Earlier this year i tried to solve a similar issue on rkisp1 (Rockchip
3399), and done some measurements, showing that non-coherent buffers +
cache flushing for buffers is a viable approach [1]. Unfortunately, that
effort stalled, but maybe patch "[PATCH v4 1/2] media: videobuf2: Fix
dmabuf cache sync/flush in dma-contig" will be useful to you.
[1] https://lore.kernel.org/all/20250303-b4-rkisp-noncoherent-v4-0-e32e843fb6ef@gmail.com/
> memcpy() from normal memory is about 2msec/1MB. Unfortunately, for
> DMA-BUFs it is 20msec/1MB, and that basically means I can't easily do
> 760p video recording. Plus, copying full-resolution photo buffer takes
> more than 200msec!
>
> There's possibility to do some processing on GPU, and its implemented here:
>
> https://gitlab.com/tui/tui/-/tree/master/icam?ref_type=heads
>
> but that hits the same problem in the end -- data is in DMA-BUF,
> uncached, and takes way too long to copy out.
>
> And that's ... wrong. DMA ended seconds ago, complete cache flush
> would be way cheaper than copying single frame out, and I still have
> to deal with uncached frames.
>
> So I have two questions:
>
> 1) Is my analysis correct that, no matter how I get frame from v4l and
> process it on GPU, I'll have to copy it from uncached memory in the
> end?
>
> 2) Does anyone have patches / ideas / roadmap how to solve that? It
> makes GPU unusable for computing, and camera basically unusable for
> video.
>
> Best regards,
> Pavel
--
Best regards,
Mikhail Rudenko
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-07-13 20:00 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-10 8:24 DMA-BUFs always uncached on arm64, causing poor camera performance on Librem 5 Pavel Machek
2025-07-10 8:42 ` Lucas Stach
2025-07-10 8:49 ` Pavel Machek
2025-07-10 21:52 ` Laurent Pinchart
2025-07-10 16:01 ` Nicolas Dufresne
2025-07-13 19:54 ` Mikhail Rudenko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).