* [PATCH] media: rkisp1: allow non-coherent video capture buffers
@ 2025-01-02 15:35 Mikhail Rudenko
2025-01-03 15:23 ` Laurent Pinchart
0 siblings, 1 reply; 17+ messages in thread
From: Mikhail Rudenko @ 2025-01-02 15:35 UTC (permalink / raw)
To: Dafna Hirschfeld, Laurent Pinchart, Mauro Carvalho Chehab,
Heiko Stuebner
Cc: linux-media, linux-rockchip, linux-arm-kernel, linux-kernel,
Mikhail Rudenko
Currently, the rkisp1 driver always uses coherent DMA allocations for
video capture buffers. However, on some platforms, using non-coherent
buffers can improve performance, especially when CPU processing of
MMAP'ed video buffers is required.
For example, on the Rockchip RK3399 running at maximum CPU frequency,
the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
non-coherent DMA allocation. CPU usage also decreases accordingly.
This change allows userspace to request the allocation of non-coherent
buffers. Note that the behavior for existing users will remain unchanged
unless they explicitly set the V4L2_MEMORY_FLAG_NON_COHERENT flag when
allocating buffers.
Signed-off-by: Mikhail Rudenko <mike.rudenko@gmail.com>
---
drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c b/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c
index 6dcefd144d5abe358323e37ac6133c6134ac636e..c94f7d1d73a92646457a27da20726ec6f92e7717 100644
--- a/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c
+++ b/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c
@@ -1563,6 +1563,7 @@ static int rkisp1_register_capture(struct rkisp1_capture *cap)
q->timestamp_flags = V4L2_BUF_FLAG_TIMESTAMP_MONOTONIC;
q->lock = &node->vlock;
q->dev = cap->rkisp1->dev;
+ q->allow_cache_hints = 1;
ret = vb2_queue_init(q);
if (ret) {
dev_err(cap->rkisp1->dev,
---
base-commit: 40ed9e9b2808beeb835bd0ed971fb364c285d39c
change-id: 20241231-b4-rkisp-noncoherent-ad6e7c7a68ba
Best regards,
--
Mikhail Rudenko <mike.rudenko@gmail.com>
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] media: rkisp1: allow non-coherent video capture buffers
2025-01-02 15:35 [PATCH] media: rkisp1: allow non-coherent video capture buffers Mikhail Rudenko
@ 2025-01-03 15:23 ` Laurent Pinchart
2025-01-14 16:00 ` Mikhail Rudenko
0 siblings, 1 reply; 17+ messages in thread
From: Laurent Pinchart @ 2025-01-03 15:23 UTC (permalink / raw)
To: Mikhail Rudenko
Cc: Dafna Hirschfeld, Mauro Carvalho Chehab, Heiko Stuebner,
linux-media, linux-rockchip, linux-arm-kernel, linux-kernel
On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
> Currently, the rkisp1 driver always uses coherent DMA allocations for
> video capture buffers. However, on some platforms, using non-coherent
> buffers can improve performance, especially when CPU processing of
> MMAP'ed video buffers is required.
>
> For example, on the Rockchip RK3399 running at maximum CPU frequency,
> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
> non-coherent DMA allocation. CPU usage also decreases accordingly.
What's the time taken by the cache management operations ?
> This change allows userspace to request the allocation of non-coherent
> buffers. Note that the behavior for existing users will remain unchanged
> unless they explicitly set the V4L2_MEMORY_FLAG_NON_COHERENT flag when
> allocating buffers.
>
> Signed-off-by: Mikhail Rudenko <mike.rudenko@gmail.com>
> ---
> drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c b/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c
> index 6dcefd144d5abe358323e37ac6133c6134ac636e..c94f7d1d73a92646457a27da20726ec6f92e7717 100644
> --- a/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c
> +++ b/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c
> @@ -1563,6 +1563,7 @@ static int rkisp1_register_capture(struct rkisp1_capture *cap)
> q->timestamp_flags = V4L2_BUF_FLAG_TIMESTAMP_MONOTONIC;
> q->lock = &node->vlock;
> q->dev = cap->rkisp1->dev;
> + q->allow_cache_hints = 1;
> ret = vb2_queue_init(q);
> if (ret) {
> dev_err(cap->rkisp1->dev,
>
> ---
> base-commit: 40ed9e9b2808beeb835bd0ed971fb364c285d39c
> change-id: 20241231-b4-rkisp-noncoherent-ad6e7c7a68ba
--
Regards,
Laurent Pinchart
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] media: rkisp1: allow non-coherent video capture buffers
2025-01-03 15:23 ` Laurent Pinchart
@ 2025-01-14 16:00 ` Mikhail Rudenko
2025-01-15 8:31 ` Tomasz Figa
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Mikhail Rudenko @ 2025-01-14 16:00 UTC (permalink / raw)
To: Laurent Pinchart
Cc: Dafna Hirschfeld, Mauro Carvalho Chehab, Heiko Stuebner,
linux-media, linux-rockchip, linux-arm-kernel, linux-kernel
Hi Laurent,
On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
> On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
>> Currently, the rkisp1 driver always uses coherent DMA allocations for
>> video capture buffers. However, on some platforms, using non-coherent
>> buffers can improve performance, especially when CPU processing of
>> MMAP'ed video buffers is required.
>>
>> For example, on the Rockchip RK3399 running at maximum CPU frequency,
>> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
>> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
>> non-coherent DMA allocation. CPU usage also decreases accordingly.
>
> What's the time taken by the cache management operations ?
Sorry for the late reply, your question turned out a little more
interesting than I expected initially. :)
When capturing using Yavta with MMAP buffers under the conditions mentioned
in the commit message, ftrace gives 437.6 +- 1.1 us for
dma_sync_sgtable_for_cpu and 409 +- 14 us for
dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
buffers in this case is more CPU-efficient even when considering cache
management overhead.
When trying to do the same measurements with libcamera, I failed. In a
typical libcamera use case when MMAP buffers are allocated from a
device, exported as dmabufs and then used for capture on the same device
with DMABUF memory type, cache management in kernel is skipped [1]
[2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
DMA_BUF_IOCTL_SYNC from userspace does not work either.
So it looks like to make this change really useful, the above issue of
cache management for libcamera/DMABUF/videobuf2-dma-contig has to be
solved. I'm not an expert in this area, so any advice is kindly welcome. :)
[1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
[2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
[3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426
--
Best regards,
Mikhail Rudenko
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] media: rkisp1: allow non-coherent video capture buffers
2025-01-14 16:00 ` Mikhail Rudenko
@ 2025-01-15 8:31 ` Tomasz Figa
2025-01-15 13:24 ` Mikhail Rudenko
2025-01-15 19:13 ` Nicolas Dufresne
2025-02-27 17:05 ` Jacopo Mondi
2 siblings, 1 reply; 17+ messages in thread
From: Tomasz Figa @ 2025-01-15 8:31 UTC (permalink / raw)
To: Mikhail Rudenko
Cc: Laurent Pinchart, Dafna Hirschfeld, Mauro Carvalho Chehab,
Heiko Stuebner, linux-media, linux-rockchip, linux-arm-kernel,
linux-kernel
Hi Mikhail and Laurent,
On Wed, Jan 15, 2025 at 2:07 AM Mikhail Rudenko <mike.rudenko@gmail.com> wrote:
>
>
> Hi Laurent,
>
> On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
>
> > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
> >> Currently, the rkisp1 driver always uses coherent DMA allocations for
> >> video capture buffers. However, on some platforms, using non-coherent
> >> buffers can improve performance, especially when CPU processing of
> >> MMAP'ed video buffers is required.
> >>
> >> For example, on the Rockchip RK3399 running at maximum CPU frequency,
> >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
> >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
> >> non-coherent DMA allocation. CPU usage also decreases accordingly.
> >
> > What's the time taken by the cache management operations ?
>
> Sorry for the late reply, your question turned out a little more
> interesting than I expected initially. :)
>
> When capturing using Yavta with MMAP buffers under the conditions mentioned
> in the commit message, ftrace gives 437.6 +- 1.1 us for
> dma_sync_sgtable_for_cpu and 409 +- 14 us for
> dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
> buffers in this case is more CPU-efficient even when considering cache
> management overhead.
>
> When trying to do the same measurements with libcamera, I failed. In a
> typical libcamera use case when MMAP buffers are allocated from a
> device, exported as dmabufs and then used for capture on the same device
> with DMABUF memory type, cache management in kernel is skipped [1]
> [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
> DMA_BUF_IOCTL_SYNC from userspace does not work either.
Oops, so I believe this is a bug. When an MMAP buffer is allocated in
the non-coherent mode, those ops should perform proper cache
maintenance.
Let me send a patch to fix this in a couple of days unless someone
does it earlier.
Best regards,
Tomasz
>
> So it looks like to make this change really useful, the above issue of
> cache management for libcamera/DMABUF/videobuf2-dma-contig has to be
> solved. I'm not an expert in this area, so any advice is kindly welcome. :)
>
> [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
> [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
> [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426
>
> --
> Best regards,
> Mikhail Rudenko
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] media: rkisp1: allow non-coherent video capture buffers
2025-01-15 8:31 ` Tomasz Figa
@ 2025-01-15 13:24 ` Mikhail Rudenko
2025-01-15 14:46 ` Tomasz Figa
0 siblings, 1 reply; 17+ messages in thread
From: Mikhail Rudenko @ 2025-01-15 13:24 UTC (permalink / raw)
To: Tomasz Figa
Cc: Laurent Pinchart, Dafna Hirschfeld, Mauro Carvalho Chehab,
Heiko Stuebner, linux-media, linux-rockchip, linux-arm-kernel,
linux-kernel
Hi Tomasz,
On 2025-01-15 at 17:31 +09, Tomasz Figa <tfiga@chromium.org> wrote:
> Hi Mikhail and Laurent,
>
> On Wed, Jan 15, 2025 at 2:07 AM Mikhail Rudenko <mike.rudenko@gmail.com> wrote:
>>
>>
>> Hi Laurent,
>>
>> On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
>>
>> > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
>> >> Currently, the rkisp1 driver always uses coherent DMA allocations for
>> >> video capture buffers. However, on some platforms, using non-coherent
>> >> buffers can improve performance, especially when CPU processing of
>> >> MMAP'ed video buffers is required.
>> >>
>> >> For example, on the Rockchip RK3399 running at maximum CPU frequency,
>> >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
>> >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
>> >> non-coherent DMA allocation. CPU usage also decreases accordingly.
>> >
>> > What's the time taken by the cache management operations ?
>>
>> Sorry for the late reply, your question turned out a little more
>> interesting than I expected initially. :)
>>
>> When capturing using Yavta with MMAP buffers under the conditions mentioned
>> in the commit message, ftrace gives 437.6 +- 1.1 us for
>> dma_sync_sgtable_for_cpu and 409 +- 14 us for
>> dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
>> buffers in this case is more CPU-efficient even when considering cache
>> management overhead.
>>
>> When trying to do the same measurements with libcamera, I failed. In a
>> typical libcamera use case when MMAP buffers are allocated from a
>> device, exported as dmabufs and then used for capture on the same device
>> with DMABUF memory type, cache management in kernel is skipped [1]
>> [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
>> DMA_BUF_IOCTL_SYNC from userspace does not work either.
>
> Oops, so I believe this is a bug. When an MMAP buffer is allocated in
> the non-coherent mode, those ops should perform proper cache
> maintenance.
Thanks for pointing this out!
> Let me send a patch to fix this in a couple of days unless someone
> does it earlier.
Now that we know that this is a bug, not an API misuse from my side, I
can fix this myself and send a v2. Would this be okay for you?
> Best regards,
> Tomasz
>
>>
>> So it looks like to make this change really useful, the above issue of
>> cache management for libcamera/DMABUF/videobuf2-dma-contig has to be
>> solved. I'm not an expert in this area, so any advice is kindly welcome. :)
>>
>> [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
>> [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
>> [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426
>>
>> --
>> Best regards,
>> Mikhail Rudenko
>>
--
Best regards,
Mikhail Rudenko
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] media: rkisp1: allow non-coherent video capture buffers
2025-01-15 13:24 ` Mikhail Rudenko
@ 2025-01-15 14:46 ` Tomasz Figa
2025-01-15 17:29 ` Mikhail Rudenko
0 siblings, 1 reply; 17+ messages in thread
From: Tomasz Figa @ 2025-01-15 14:46 UTC (permalink / raw)
To: Mikhail Rudenko
Cc: Laurent Pinchart, Dafna Hirschfeld, Mauro Carvalho Chehab,
Heiko Stuebner, linux-media, linux-rockchip, linux-arm-kernel,
linux-kernel
On Wed, Jan 15, 2025 at 10:30 PM Mikhail Rudenko <mike.rudenko@gmail.com> wrote:
>
> Hi Tomasz,
>
> On 2025-01-15 at 17:31 +09, Tomasz Figa <tfiga@chromium.org> wrote:
>
> > Hi Mikhail and Laurent,
> >
> > On Wed, Jan 15, 2025 at 2:07 AM Mikhail Rudenko <mike.rudenko@gmail.com> wrote:
> >>
> >>
> >> Hi Laurent,
> >>
> >> On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
> >>
> >> > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
> >> >> Currently, the rkisp1 driver always uses coherent DMA allocations for
> >> >> video capture buffers. However, on some platforms, using non-coherent
> >> >> buffers can improve performance, especially when CPU processing of
> >> >> MMAP'ed video buffers is required.
> >> >>
> >> >> For example, on the Rockchip RK3399 running at maximum CPU frequency,
> >> >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
> >> >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
> >> >> non-coherent DMA allocation. CPU usage also decreases accordingly.
> >> >
> >> > What's the time taken by the cache management operations ?
> >>
> >> Sorry for the late reply, your question turned out a little more
> >> interesting than I expected initially. :)
> >>
> >> When capturing using Yavta with MMAP buffers under the conditions mentioned
> >> in the commit message, ftrace gives 437.6 +- 1.1 us for
> >> dma_sync_sgtable_for_cpu and 409 +- 14 us for
> >> dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
> >> buffers in this case is more CPU-efficient even when considering cache
> >> management overhead.
> >>
> >> When trying to do the same measurements with libcamera, I failed. In a
> >> typical libcamera use case when MMAP buffers are allocated from a
> >> device, exported as dmabufs and then used for capture on the same device
> >> with DMABUF memory type, cache management in kernel is skipped [1]
> >> [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
> >> DMA_BUF_IOCTL_SYNC from userspace does not work either.
> >
> > Oops, so I believe this is a bug. When an MMAP buffer is allocated in
> > the non-coherent mode, those ops should perform proper cache
> > maintenance.
>
> Thanks for pointing this out!
>
> > Let me send a patch to fix this in a couple of days unless someone
> > does it earlier.
>
> Now that we know that this is a bug, not an API misuse from my side, I
> can fix this myself and send a v2. Would this be okay for you?
I'd be more than happy :)
>
> > Best regards,
> > Tomasz
> >
> >>
> >> So it looks like to make this change really useful, the above issue of
> >> cache management for libcamera/DMABUF/videobuf2-dma-contig has to be
> >> solved. I'm not an expert in this area, so any advice is kindly welcome. :)
> >>
> >> [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
> >> [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
> >> [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426
> >>
> >> --
> >> Best regards,
> >> Mikhail Rudenko
> >>
>
>
> --
> Best regards,
> Mikhail Rudenko
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] media: rkisp1: allow non-coherent video capture buffers
2025-01-15 14:46 ` Tomasz Figa
@ 2025-01-15 17:29 ` Mikhail Rudenko
0 siblings, 0 replies; 17+ messages in thread
From: Mikhail Rudenko @ 2025-01-15 17:29 UTC (permalink / raw)
To: Tomasz Figa
Cc: Laurent Pinchart, Dafna Hirschfeld, Mauro Carvalho Chehab,
Heiko Stuebner, linux-media, linux-rockchip, linux-arm-kernel,
linux-kernel
On 2025-01-15 at 23:46 +09, Tomasz Figa <tfiga@chromium.org> wrote:
> On Wed, Jan 15, 2025 at 10:30 PM Mikhail Rudenko <mike.rudenko@gmail.com> wrote:
>>
>> Hi Tomasz,
>>
>> On 2025-01-15 at 17:31 +09, Tomasz Figa <tfiga@chromium.org> wrote:
>>
>> > Hi Mikhail and Laurent,
>> >
>> > On Wed, Jan 15, 2025 at 2:07 AM Mikhail Rudenko <mike.rudenko@gmail.com> wrote:
>> >>
>> >>
>> >> Hi Laurent,
>> >>
>> >> On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
>> >>
>> >> > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
>> >> >> Currently, the rkisp1 driver always uses coherent DMA allocations for
>> >> >> video capture buffers. However, on some platforms, using non-coherent
>> >> >> buffers can improve performance, especially when CPU processing of
>> >> >> MMAP'ed video buffers is required.
>> >> >>
>> >> >> For example, on the Rockchip RK3399 running at maximum CPU frequency,
>> >> >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
>> >> >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
>> >> >> non-coherent DMA allocation. CPU usage also decreases accordingly.
>> >> >
>> >> > What's the time taken by the cache management operations ?
>> >>
>> >> Sorry for the late reply, your question turned out a little more
>> >> interesting than I expected initially. :)
>> >>
>> >> When capturing using Yavta with MMAP buffers under the conditions mentioned
>> >> in the commit message, ftrace gives 437.6 +- 1.1 us for
>> >> dma_sync_sgtable_for_cpu and 409 +- 14 us for
>> >> dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
>> >> buffers in this case is more CPU-efficient even when considering cache
>> >> management overhead.
>> >>
>> >> When trying to do the same measurements with libcamera, I failed. In a
>> >> typical libcamera use case when MMAP buffers are allocated from a
>> >> device, exported as dmabufs and then used for capture on the same device
>> >> with DMABUF memory type, cache management in kernel is skipped [1]
>> >> [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
>> >> DMA_BUF_IOCTL_SYNC from userspace does not work either.
>> >
>> > Oops, so I believe this is a bug. When an MMAP buffer is allocated in
>> > the non-coherent mode, those ops should perform proper cache
>> > maintenance.
>>
>> Thanks for pointing this out!
>>
>> > Let me send a patch to fix this in a couple of days unless someone
>> > does it earlier.
>>
>> Now that we know that this is a bug, not an API misuse from my side, I
>> can fix this myself and send a v2. Would this be okay for you?
>
> I'd be more than happy :)
Done, see [1]. A review would be appreciated. :)
[1] https://lore.kernel.org/all/20250115-b4-rkisp-noncoherent-v2-0-0853e1a24012@gmail.com/
--
Best regards,
Mikhail Rudenko
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] media: rkisp1: allow non-coherent video capture buffers
2025-01-14 16:00 ` Mikhail Rudenko
2025-01-15 8:31 ` Tomasz Figa
@ 2025-01-15 19:13 ` Nicolas Dufresne
2025-02-27 17:05 ` Jacopo Mondi
2 siblings, 0 replies; 17+ messages in thread
From: Nicolas Dufresne @ 2025-01-15 19:13 UTC (permalink / raw)
To: Mikhail Rudenko, Laurent Pinchart
Cc: Dafna Hirschfeld, Mauro Carvalho Chehab, Heiko Stuebner,
linux-media, linux-rockchip, linux-arm-kernel, linux-kernel
Le mardi 14 janvier 2025 à 19:00 +0300, Mikhail Rudenko a écrit :
> Hi Laurent,
>
> On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
>
> > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
> > > Currently, the rkisp1 driver always uses coherent DMA allocations for
> > > video capture buffers. However, on some platforms, using non-coherent
> > > buffers can improve performance, especially when CPU processing of
> > > MMAP'ed video buffers is required.
> > >
> > > For example, on the Rockchip RK3399 running at maximum CPU frequency,
> > > the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
> > > malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
> > > non-coherent DMA allocation. CPU usage also decreases accordingly.
> >
> > What's the time taken by the cache management operations ?
>
> Sorry for the late reply, your question turned out a little more
> interesting than I expected initially. :)
>
> When capturing using Yavta with MMAP buffers under the conditions mentioned
> in the commit message, ftrace gives 437.6 +- 1.1 us for
> dma_sync_sgtable_for_cpu and 409 +- 14 us for
> dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
> buffers in this case is more CPU-efficient even when considering cache
> management overhead.
>
> When trying to do the same measurements with libcamera, I failed. In a
> typical libcamera use case when MMAP buffers are allocated from a
> device, exported as dmabufs and then used for capture on the same device
> with DMABUF memory type, cache management in kernel is skipped [1]
> [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
> DMA_BUF_IOCTL_SYNC from userspace does not work either.
>
> So it looks like to make this change really useful, the above issue of
> cache management for libcamera/DMABUF/videobuf2-dma-contig has to be
> solved. I'm not an expert in this area, so any advice is kindly welcome. :)
The manual coherency hints are not implemented for DMABuf, and libcamera only do
dmabuf. Someone will have to look into that. This is also why this API have very
low adoption, its breaks easily.
>
> [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
> [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
> [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426
>
> --
> Best regards,
> Mikhail Rudenko
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] media: rkisp1: allow non-coherent video capture buffers
2025-01-14 16:00 ` Mikhail Rudenko
2025-01-15 8:31 ` Tomasz Figa
2025-01-15 19:13 ` Nicolas Dufresne
@ 2025-02-27 17:05 ` Jacopo Mondi
2025-02-27 20:46 ` Mikhail Rudenko
2025-02-28 10:00 ` Tomasz Figa
2 siblings, 2 replies; 17+ messages in thread
From: Jacopo Mondi @ 2025-02-27 17:05 UTC (permalink / raw)
To: Mikhail Rudenko
Cc: Laurent Pinchart, Dafna Hirschfeld, Mauro Carvalho Chehab,
Heiko Stuebner, linux-media, linux-rockchip, linux-arm-kernel,
linux-kernel
Hi Mikhail
On Tue, Jan 14, 2025 at 07:00:39PM +0300, Mikhail Rudenko wrote:
>
> Hi Laurent,
>
> On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
>
> > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
> >> Currently, the rkisp1 driver always uses coherent DMA allocations for
> >> video capture buffers. However, on some platforms, using non-coherent
> >> buffers can improve performance, especially when CPU processing of
> >> MMAP'ed video buffers is required.
> >>
> >> For example, on the Rockchip RK3399 running at maximum CPU frequency,
> >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
> >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
> >> non-coherent DMA allocation. CPU usage also decreases accordingly.
> >
> > What's the time taken by the cache management operations ?
>
> Sorry for the late reply, your question turned out a little more
> interesting than I expected initially. :)
>
> When capturing using Yavta with MMAP buffers under the conditions mentioned
> in the commit message, ftrace gives 437.6 +- 1.1 us for
> dma_sync_sgtable_for_cpu and 409 +- 14 us for
> dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
> buffers in this case is more CPU-efficient even when considering cache
> management overhead.
>
> When trying to do the same measurements with libcamera, I failed. In a
> typical libcamera use case when MMAP buffers are allocated from a
> device, exported as dmabufs and then used for capture on the same device
> with DMABUF memory type, cache management in kernel is skipped [1]
> [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
> DMA_BUF_IOCTL_SYNC from userspace does not work either.
>
> So it looks like to make this change really useful, the above issue of
> cache management for libcamera/DMABUF/videobuf2-dma-contig has to be
> solved. I'm not an expert in this area, so any advice is kindly welcome. :)
It would be shame if we let this discussion drop dead.. cache
management policies are relevant for performances, specifically for
cpu access, and your above 7.7ms vs 1.1 ms test clearly shows that.
>
> [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
I would like to know from Hans if the decision to disallow cache-hints
for dmabuf importers is a design choice or is deeply rooted in other
reasons I might be missing.
I'm asking because the idea is for libcamera to act solely as dma-buf
importer, the current alloc-export-then-import trick is an helper for
applications to work around the absence of a system allocator.
If the requirement to disable cache-hints for importers cannot be
lifted, for libcamera it means we would not be able to use it.
> [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
> [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426
>
> --
> Best regards,
> Mikhail Rudenko
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] media: rkisp1: allow non-coherent video capture buffers
2025-02-27 17:05 ` Jacopo Mondi
@ 2025-02-27 20:46 ` Mikhail Rudenko
2025-02-28 2:58 ` Nicolas Dufresne
2025-02-28 10:00 ` Tomasz Figa
1 sibling, 1 reply; 17+ messages in thread
From: Mikhail Rudenko @ 2025-02-27 20:46 UTC (permalink / raw)
To: Jacopo Mondi
Cc: Laurent Pinchart, Dafna Hirschfeld, Mauro Carvalho Chehab,
Heiko Stuebner, linux-media, linux-rockchip, linux-arm-kernel,
linux-kernel
Hi Jacopo,
On 2025-02-27 at 18:05 +01, Jacopo Mondi <jacopo.mondi@ideasonboard.com> wrote:
> Hi Mikhail
>
> On Tue, Jan 14, 2025 at 07:00:39PM +0300, Mikhail Rudenko wrote:
>>
>> Hi Laurent,
>>
>> On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
>>
>> > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
>> >> Currently, the rkisp1 driver always uses coherent DMA allocations for
>> >> video capture buffers. However, on some platforms, using non-coherent
>> >> buffers can improve performance, especially when CPU processing of
>> >> MMAP'ed video buffers is required.
>> >>
>> >> For example, on the Rockchip RK3399 running at maximum CPU frequency,
>> >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
>> >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
>> >> non-coherent DMA allocation. CPU usage also decreases accordingly.
>> >
>> > What's the time taken by the cache management operations ?
>>
>> Sorry for the late reply, your question turned out a little more
>> interesting than I expected initially. :)
>>
>> When capturing using Yavta with MMAP buffers under the conditions mentioned
>> in the commit message, ftrace gives 437.6 +- 1.1 us for
>> dma_sync_sgtable_for_cpu and 409 +- 14 us for
>> dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
>> buffers in this case is more CPU-efficient even when considering cache
>> management overhead.
>>
>> When trying to do the same measurements with libcamera, I failed. In a
>> typical libcamera use case when MMAP buffers are allocated from a
>> device, exported as dmabufs and then used for capture on the same device
>> with DMABUF memory type, cache management in kernel is skipped [1]
>> [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
>> DMA_BUF_IOCTL_SYNC from userspace does not work either.
>>
>> So it looks like to make this change really useful, the above issue of
>> cache management for libcamera/DMABUF/videobuf2-dma-contig has to be
>> solved. I'm not an expert in this area, so any advice is kindly welcome. :)
>
> It would be shame if we let this discussion drop dead.. cache
> management policies are relevant for performances, specifically for
> cpu access, and your above 7.7ms vs 1.1 ms test clearly shows that.
>
>>
>> [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
>
> I would like to know from Hans if the decision to disallow cache-hints
> for dmabuf importers is a design choice or is deeply rooted in other
> reasons I might be missing.
>
> I'm asking because the idea is for libcamera to act solely as dma-buf
> importer, the current alloc-export-then-import trick is an helper for
> applications to work around the absence of a system allocator.
>
> If the requirement to disable cache-hints for importers cannot be
> lifted, for libcamera it means we would not be able to use it.
Meanwhile, I have posted a patch, which re-enables cache management ops
for non-coherent dmabufs exported from dma-contig allocator [1]. It is
currently waiting for review.
[1] https://lore.kernel.org/all/20250128-b4-rkisp-noncoherent-v3-1-baf39c997d2a@gmail.com/
>
>> [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
>> [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426
>>
>> --
>> Best regards,
>> Mikhail Rudenko
>>
--
Best regards,
Mikhail Rudenko
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] media: rkisp1: allow non-coherent video capture buffers
2025-02-27 20:46 ` Mikhail Rudenko
@ 2025-02-28 2:58 ` Nicolas Dufresne
2025-02-28 9:54 ` Hans Verkuil
0 siblings, 1 reply; 17+ messages in thread
From: Nicolas Dufresne @ 2025-02-28 2:58 UTC (permalink / raw)
To: Mikhail Rudenko, Jacopo Mondi
Cc: Laurent Pinchart, Dafna Hirschfeld, Mauro Carvalho Chehab,
Heiko Stuebner, linux-media, linux-rockchip, linux-arm-kernel,
linux-kernel
Le jeudi 27 février 2025 à 23:46 +0300, Mikhail Rudenko a écrit :
> Hi Jacopo,
>
> On 2025-02-27 at 18:05 +01, Jacopo Mondi <jacopo.mondi@ideasonboard.com> wrote:
>
> > Hi Mikhail
> >
> > On Tue, Jan 14, 2025 at 07:00:39PM +0300, Mikhail Rudenko wrote:
> > >
> > > Hi Laurent,
> > >
> > > On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
> > >
> > > > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
> > > > > Currently, the rkisp1 driver always uses coherent DMA allocations for
> > > > > video capture buffers. However, on some platforms, using non-coherent
> > > > > buffers can improve performance, especially when CPU processing of
> > > > > MMAP'ed video buffers is required.
> > > > >
> > > > > For example, on the Rockchip RK3399 running at maximum CPU frequency,
> > > > > the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
> > > > > malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
> > > > > non-coherent DMA allocation. CPU usage also decreases accordingly.
> > > >
> > > > What's the time taken by the cache management operations ?
> > >
> > > Sorry for the late reply, your question turned out a little more
> > > interesting than I expected initially. :)
> > >
> > > When capturing using Yavta with MMAP buffers under the conditions mentioned
> > > in the commit message, ftrace gives 437.6 +- 1.1 us for
> > > dma_sync_sgtable_for_cpu and 409 +- 14 us for
> > > dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
> > > buffers in this case is more CPU-efficient even when considering cache
> > > management overhead.
> > >
> > > When trying to do the same measurements with libcamera, I failed. In a
> > > typical libcamera use case when MMAP buffers are allocated from a
> > > device, exported as dmabufs and then used for capture on the same device
> > > with DMABUF memory type, cache management in kernel is skipped [1]
> > > [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
> > > DMA_BUF_IOCTL_SYNC from userspace does not work either.
> > >
> > > So it looks like to make this change really useful, the above issue of
> > > cache management for libcamera/DMABUF/videobuf2-dma-contig has to be
> > > solved. I'm not an expert in this area, so any advice is kindly welcome. :)
> >
> > It would be shame if we let this discussion drop dead.. cache
> > management policies are relevant for performances, specifically for
> > cpu access, and your above 7.7ms vs 1.1 ms test clearly shows that.
> >
> > >
> > > [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
> >
> > I would like to know from Hans if the decision to disallow cache-hints
> > for dmabuf importers is a design choice or is deeply rooted in other
> > reasons I might be missing.
> >
> > I'm asking because the idea is for libcamera to act solely as dma-buf
> > importer, the current alloc-export-then-import trick is an helper for
> > applications to work around the absence of a system allocator.
> >
> > If the requirement to disable cache-hints for importers cannot be
> > lifted, for libcamera it means we would not be able to use it.
>
> Meanwhile, I have posted a patch, which re-enables cache management ops
> for non-coherent dmabufs exported from dma-contig allocator [1]. It is
> currently waiting for review.
>
> [1] https://lore.kernel.org/all/20250128-b4-rkisp-noncoherent-v3-1-baf39c997d2a@gmail.com/
Thanks for you work, this matches what I was referring to missing in my
previous reply.
I don't think there is any intention to block or deprecate it, but
partially enabling leads to problems. Do we need something in the SG
allocator to ?
Nicolas
>
> >
> > > [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
> > > [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426
> > >
> > > --
> > > Best regards,
> > > Mikhail Rudenko
> > >
>
>
> --
> Best regards,
> Mikhail Rudenko
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] media: rkisp1: allow non-coherent video capture buffers
2025-02-28 2:58 ` Nicolas Dufresne
@ 2025-02-28 9:54 ` Hans Verkuil
0 siblings, 0 replies; 17+ messages in thread
From: Hans Verkuil @ 2025-02-28 9:54 UTC (permalink / raw)
To: Nicolas Dufresne, Mikhail Rudenko, Jacopo Mondi, Tomasz Figa
Cc: Laurent Pinchart, Dafna Hirschfeld, Mauro Carvalho Chehab,
Heiko Stuebner, linux-media, linux-rockchip, linux-arm-kernel,
linux-kernel
On 28/02/2025 03:58, Nicolas Dufresne wrote:
> Le jeudi 27 février 2025 à 23:46 +0300, Mikhail Rudenko a écrit :
>> Hi Jacopo,
>>
>> On 2025-02-27 at 18:05 +01, Jacopo Mondi <jacopo.mondi@ideasonboard.com> wrote:
>>
>>> Hi Mikhail
>>>
>>> On Tue, Jan 14, 2025 at 07:00:39PM +0300, Mikhail Rudenko wrote:
>>>>
>>>> Hi Laurent,
>>>>
>>>> On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
>>>>
>>>>> On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
>>>>>> Currently, the rkisp1 driver always uses coherent DMA allocations for
>>>>>> video capture buffers. However, on some platforms, using non-coherent
>>>>>> buffers can improve performance, especially when CPU processing of
>>>>>> MMAP'ed video buffers is required.
>>>>>>
>>>>>> For example, on the Rockchip RK3399 running at maximum CPU frequency,
>>>>>> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
>>>>>> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
>>>>>> non-coherent DMA allocation. CPU usage also decreases accordingly.
>>>>>
>>>>> What's the time taken by the cache management operations ?
>>>>
>>>> Sorry for the late reply, your question turned out a little more
>>>> interesting than I expected initially. :)
>>>>
>>>> When capturing using Yavta with MMAP buffers under the conditions mentioned
>>>> in the commit message, ftrace gives 437.6 +- 1.1 us for
>>>> dma_sync_sgtable_for_cpu and 409 +- 14 us for
>>>> dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
>>>> buffers in this case is more CPU-efficient even when considering cache
>>>> management overhead.
>>>>
>>>> When trying to do the same measurements with libcamera, I failed. In a
>>>> typical libcamera use case when MMAP buffers are allocated from a
>>>> device, exported as dmabufs and then used for capture on the same device
>>>> with DMABUF memory type, cache management in kernel is skipped [1]
>>>> [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
>>>> DMA_BUF_IOCTL_SYNC from userspace does not work either.
>>>>
>>>> So it looks like to make this change really useful, the above issue of
>>>> cache management for libcamera/DMABUF/videobuf2-dma-contig has to be
>>>> solved. I'm not an expert in this area, so any advice is kindly welcome. :)
>>>
>>> It would be shame if we let this discussion drop dead.. cache
>>> management policies are relevant for performances, specifically for
>>> cpu access, and your above 7.7ms vs 1.1 ms test clearly shows that.
>>>
>>>>
>>>> [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
>>>
>>> I would like to know from Hans if the decision to disallow cache-hints
>>> for dmabuf importers is a design choice or is deeply rooted in other
>>> reasons I might be missing.
>>>
>>> I'm asking because the idea is for libcamera to act solely as dma-buf
>>> importer, the current alloc-export-then-import trick is an helper for
>>> applications to work around the absence of a system allocator.
>>>
>>> If the requirement to disable cache-hints for importers cannot be
>>> lifted, for libcamera it means we would not be able to use it.
>>
>> Meanwhile, I have posted a patch, which re-enables cache management ops
>> for non-coherent dmabufs exported from dma-contig allocator [1]. It is
>> currently waiting for review.
>>
>> [1] https://lore.kernel.org/all/20250128-b4-rkisp-noncoherent-v3-1-baf39c997d2a@gmail.com/
I rely heavily on Tomasz Figa for review of such patches. He was OK with the
patch, but he suggested waiting a bit for more comments. My plan was to pick it
up end of next week if there are no further comments.
Regards,
Hans
>
> Thanks for you work, this matches what I was referring to missing in my
> previous reply.
>
> I don't think there is any intention to block or deprecate it, but
> partially enabling leads to problems. Do we need something in the SG
> allocator to ?
>
> Nicolas
>
>>
>>>
>>>> [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
>>>> [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426
>>>>
>>>> --
>>>> Best regards,
>>>> Mikhail Rudenko
>>>>
>>
>>
>> --
>> Best regards,
>> Mikhail Rudenko
>>
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] media: rkisp1: allow non-coherent video capture buffers
2025-02-27 17:05 ` Jacopo Mondi
2025-02-27 20:46 ` Mikhail Rudenko
@ 2025-02-28 10:00 ` Tomasz Figa
2025-02-28 10:18 ` Jacopo Mondi
1 sibling, 1 reply; 17+ messages in thread
From: Tomasz Figa @ 2025-02-28 10:00 UTC (permalink / raw)
To: Jacopo Mondi
Cc: Mikhail Rudenko, Laurent Pinchart, Dafna Hirschfeld,
Mauro Carvalho Chehab, Heiko Stuebner, linux-media,
linux-rockchip, linux-arm-kernel, linux-kernel
Hi Jacopo,
On Fri, Feb 28, 2025 at 2:11 AM Jacopo Mondi
<jacopo.mondi@ideasonboard.com> wrote:
>
> Hi Mikhail
>
> On Tue, Jan 14, 2025 at 07:00:39PM +0300, Mikhail Rudenko wrote:
> >
> > Hi Laurent,
> >
> > On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
> >
> > > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
> > >> Currently, the rkisp1 driver always uses coherent DMA allocations for
> > >> video capture buffers. However, on some platforms, using non-coherent
> > >> buffers can improve performance, especially when CPU processing of
> > >> MMAP'ed video buffers is required.
> > >>
> > >> For example, on the Rockchip RK3399 running at maximum CPU frequency,
> > >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
> > >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
> > >> non-coherent DMA allocation. CPU usage also decreases accordingly.
> > >
> > > What's the time taken by the cache management operations ?
> >
> > Sorry for the late reply, your question turned out a little more
> > interesting than I expected initially. :)
> >
> > When capturing using Yavta with MMAP buffers under the conditions mentioned
> > in the commit message, ftrace gives 437.6 +- 1.1 us for
> > dma_sync_sgtable_for_cpu and 409 +- 14 us for
> > dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
> > buffers in this case is more CPU-efficient even when considering cache
> > management overhead.
> >
> > When trying to do the same measurements with libcamera, I failed. In a
> > typical libcamera use case when MMAP buffers are allocated from a
> > device, exported as dmabufs and then used for capture on the same device
> > with DMABUF memory type, cache management in kernel is skipped [1]
> > [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
> > DMA_BUF_IOCTL_SYNC from userspace does not work either.
> >
> > So it looks like to make this change really useful, the above issue of
> > cache management for libcamera/DMABUF/videobuf2-dma-contig has to be
> > solved. I'm not an expert in this area, so any advice is kindly welcome. :)
>
> It would be shame if we let this discussion drop dead.. cache
> management policies are relevant for performances, specifically for
> cpu access, and your above 7.7ms vs 1.1 ms test clearly shows that.
>
> >
> > [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
>
> I would like to know from Hans if the decision to disallow cache-hints
> for dmabuf importers is a design choice or is deeply rooted in other
> reasons I might be missing.
When DMA-buf is used, the responsibility for cache management is
solely on the CPU users' side, so cache-hints don't really apply. It's
the exporter (=allocator) who determines the mapping policy of the
buffer and provides necessary DMA_BUF_SYNC operations (can be no-op if
the buffer is coherent).
Best regards,
Tomasz
>
> I'm asking because the idea is for libcamera to act solely as dma-buf
> importer, the current alloc-export-then-import trick is an helper for
> applications to work around the absence of a system allocator.
>
> If the requirement to disable cache-hints for importers cannot be
> lifted, for libcamera it means we would not be able to use it.
>
>
> > [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
> > [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426
> >
> > --
> > Best regards,
> > Mikhail Rudenko
> >
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] media: rkisp1: allow non-coherent video capture buffers
2025-02-28 10:00 ` Tomasz Figa
@ 2025-02-28 10:18 ` Jacopo Mondi
2025-02-28 10:28 ` Tomasz Figa
0 siblings, 1 reply; 17+ messages in thread
From: Jacopo Mondi @ 2025-02-28 10:18 UTC (permalink / raw)
To: Tomasz Figa
Cc: Jacopo Mondi, Mikhail Rudenko, Laurent Pinchart, Dafna Hirschfeld,
Mauro Carvalho Chehab, Heiko Stuebner, linux-media,
linux-rockchip, linux-arm-kernel, linux-kernel
Hi Tomasz
On Fri, Feb 28, 2025 at 07:00:57PM +0900, Tomasz Figa wrote:
> Hi Jacopo,
>
> On Fri, Feb 28, 2025 at 2:11 AM Jacopo Mondi
> <jacopo.mondi@ideasonboard.com> wrote:
> >
> > Hi Mikhail
> >
> > On Tue, Jan 14, 2025 at 07:00:39PM +0300, Mikhail Rudenko wrote:
> > >
> > > Hi Laurent,
> > >
> > > On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
> > >
> > > > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
> > > >> Currently, the rkisp1 driver always uses coherent DMA allocations for
> > > >> video capture buffers. However, on some platforms, using non-coherent
> > > >> buffers can improve performance, especially when CPU processing of
> > > >> MMAP'ed video buffers is required.
> > > >>
> > > >> For example, on the Rockchip RK3399 running at maximum CPU frequency,
> > > >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
> > > >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
> > > >> non-coherent DMA allocation. CPU usage also decreases accordingly.
> > > >
> > > > What's the time taken by the cache management operations ?
> > >
> > > Sorry for the late reply, your question turned out a little more
> > > interesting than I expected initially. :)
> > >
> > > When capturing using Yavta with MMAP buffers under the conditions mentioned
> > > in the commit message, ftrace gives 437.6 +- 1.1 us for
> > > dma_sync_sgtable_for_cpu and 409 +- 14 us for
> > > dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
> > > buffers in this case is more CPU-efficient even when considering cache
> > > management overhead.
> > >
> > > When trying to do the same measurements with libcamera, I failed. In a
> > > typical libcamera use case when MMAP buffers are allocated from a
> > > device, exported as dmabufs and then used for capture on the same device
> > > with DMABUF memory type, cache management in kernel is skipped [1]
> > > [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
> > > DMA_BUF_IOCTL_SYNC from userspace does not work either.
> > >
> > > So it looks like to make this change really useful, the above issue of
> > > cache management for libcamera/DMABUF/videobuf2-dma-contig has to be
> > > solved. I'm not an expert in this area, so any advice is kindly welcome. :)
> >
> > It would be shame if we let this discussion drop dead.. cache
> > management policies are relevant for performances, specifically for
> > cpu access, and your above 7.7ms vs 1.1 ms test clearly shows that.
> >
> > >
> > > [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
> >
> > I would like to know from Hans if the decision to disallow cache-hints
> > for dmabuf importers is a design choice or is deeply rooted in other
> > reasons I might be missing.
>
> When DMA-buf is used, the responsibility for cache management is
> solely on the CPU users' side, so cache-hints don't really apply. It's
> the exporter (=allocator) who determines the mapping policy of the
> buffer and provides necessary DMA_BUF_SYNC operations (can be no-op if
> the buffer is coherent).
This all makes sense.
I take it as, for libcamera, users of the FrameBufferAllocator helper
(which first exports MMAP buffers from the video device and the
imports them back as DMABUF) won't be able to control the cache
policies.
Now, in the long term, we want FrameBufferAllocator to go away and
have either buffers exported by the consumer (likely DRM) or by a
system wide buffer allocator (when we'll have one) and have the video
devices operate as pure importers. But for the time being the
"first export then import" use case is possibile and valid so I wonder
if we should consider measures to allow controlling caching policies
for this use case too.
>
> Best regards,
> Tomasz
>
> >
> > I'm asking because the idea is for libcamera to act solely as dma-buf
> > importer, the current alloc-export-then-import trick is an helper for
> > applications to work around the absence of a system allocator.
> >
> > If the requirement to disable cache-hints for importers cannot be
> > lifted, for libcamera it means we would not be able to use it.
> >
> >
> > > [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
> > > [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426
> > >
> > > --
> > > Best regards,
> > > Mikhail Rudenko
> > >
> >
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] media: rkisp1: allow non-coherent video capture buffers
2025-02-28 10:18 ` Jacopo Mondi
@ 2025-02-28 10:28 ` Tomasz Figa
2025-02-28 10:48 ` Jacopo Mondi
0 siblings, 1 reply; 17+ messages in thread
From: Tomasz Figa @ 2025-02-28 10:28 UTC (permalink / raw)
To: Jacopo Mondi
Cc: Mikhail Rudenko, Laurent Pinchart, Dafna Hirschfeld,
Mauro Carvalho Chehab, Heiko Stuebner, linux-media,
linux-rockchip, linux-arm-kernel, linux-kernel
On Fri, Feb 28, 2025 at 7:18 PM Jacopo Mondi
<jacopo.mondi@ideasonboard.com> wrote:
>
> Hi Tomasz
>
> On Fri, Feb 28, 2025 at 07:00:57PM +0900, Tomasz Figa wrote:
> > Hi Jacopo,
> >
> > On Fri, Feb 28, 2025 at 2:11 AM Jacopo Mondi
> > <jacopo.mondi@ideasonboard.com> wrote:
> > >
> > > Hi Mikhail
> > >
> > > On Tue, Jan 14, 2025 at 07:00:39PM +0300, Mikhail Rudenko wrote:
> > > >
> > > > Hi Laurent,
> > > >
> > > > On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
> > > >
> > > > > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
> > > > >> Currently, the rkisp1 driver always uses coherent DMA allocations for
> > > > >> video capture buffers. However, on some platforms, using non-coherent
> > > > >> buffers can improve performance, especially when CPU processing of
> > > > >> MMAP'ed video buffers is required.
> > > > >>
> > > > >> For example, on the Rockchip RK3399 running at maximum CPU frequency,
> > > > >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
> > > > >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
> > > > >> non-coherent DMA allocation. CPU usage also decreases accordingly.
> > > > >
> > > > > What's the time taken by the cache management operations ?
> > > >
> > > > Sorry for the late reply, your question turned out a little more
> > > > interesting than I expected initially. :)
> > > >
> > > > When capturing using Yavta with MMAP buffers under the conditions mentioned
> > > > in the commit message, ftrace gives 437.6 +- 1.1 us for
> > > > dma_sync_sgtable_for_cpu and 409 +- 14 us for
> > > > dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
> > > > buffers in this case is more CPU-efficient even when considering cache
> > > > management overhead.
> > > >
> > > > When trying to do the same measurements with libcamera, I failed. In a
> > > > typical libcamera use case when MMAP buffers are allocated from a
> > > > device, exported as dmabufs and then used for capture on the same device
> > > > with DMABUF memory type, cache management in kernel is skipped [1]
> > > > [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
> > > > DMA_BUF_IOCTL_SYNC from userspace does not work either.
> > > >
> > > > So it looks like to make this change really useful, the above issue of
> > > > cache management for libcamera/DMABUF/videobuf2-dma-contig has to be
> > > > solved. I'm not an expert in this area, so any advice is kindly welcome. :)
> > >
> > > It would be shame if we let this discussion drop dead.. cache
> > > management policies are relevant for performances, specifically for
> > > cpu access, and your above 7.7ms vs 1.1 ms test clearly shows that.
> > >
> > > >
> > > > [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
> > >
> > > I would like to know from Hans if the decision to disallow cache-hints
> > > for dmabuf importers is a design choice or is deeply rooted in other
> > > reasons I might be missing.
> >
> > When DMA-buf is used, the responsibility for cache management is
> > solely on the CPU users' side, so cache-hints don't really apply. It's
> > the exporter (=allocator) who determines the mapping policy of the
> > buffer and provides necessary DMA_BUF_SYNC operations (can be no-op if
> > the buffer is coherent).
>
> This all makes sense.
>
> I take it as, for libcamera, users of the FrameBufferAllocator helper
> (which first exports MMAP buffers from the video device and the
> imports them back as DMABUF) won't be able to control the cache
> policies.
>
> Now, in the long term, we want FrameBufferAllocator to go away and
> have either buffers exported by the consumer (likely DRM) or by a
> system wide buffer allocator (when we'll have one) and have the video
> devices operate as pure importers. But for the time being the
> "first export then import" use case is possibile and valid so I wonder
> if we should consider measures to allow controlling caching policies
> for this use case too.
Hmm, I may be missing something, but if FrameBufferAllocator does the
allocation internally in libcamera, why couldn't it use the
V4L2_MEMORY_FLAG_NON_COHERENT REQBUFS/CREATE_BUFS flag?
Note that however, once the buffer is allocated and mapped once, on
many architectures it must keep the same mapping attributes across the
different mappings, due to how the memory hierarchy works. (Not sure
if this is what you are asking for here, though.)
>
> >
> > Best regards,
> > Tomasz
> >
> > >
> > > I'm asking because the idea is for libcamera to act solely as dma-buf
> > > importer, the current alloc-export-then-import trick is an helper for
> > > applications to work around the absence of a system allocator.
> > >
> > > If the requirement to disable cache-hints for importers cannot be
> > > lifted, for libcamera it means we would not be able to use it.
> > >
> > >
> > > > [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
> > > > [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426
> > > >
> > > > --
> > > > Best regards,
> > > > Mikhail Rudenko
> > > >
> > >
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] media: rkisp1: allow non-coherent video capture buffers
2025-02-28 10:28 ` Tomasz Figa
@ 2025-02-28 10:48 ` Jacopo Mondi
2025-02-28 11:19 ` Tomasz Figa
0 siblings, 1 reply; 17+ messages in thread
From: Jacopo Mondi @ 2025-02-28 10:48 UTC (permalink / raw)
To: Tomasz Figa
Cc: Jacopo Mondi, Mikhail Rudenko, Laurent Pinchart, Dafna Hirschfeld,
Mauro Carvalho Chehab, Heiko Stuebner, linux-media,
linux-rockchip, linux-arm-kernel, linux-kernel
Hi Tomasz
On Fri, Feb 28, 2025 at 07:28:43PM +0900, Tomasz Figa wrote:
> On Fri, Feb 28, 2025 at 7:18 PM Jacopo Mondi
> <jacopo.mondi@ideasonboard.com> wrote:
> >
> > Hi Tomasz
> >
> > On Fri, Feb 28, 2025 at 07:00:57PM +0900, Tomasz Figa wrote:
> > > Hi Jacopo,
> > >
> > > On Fri, Feb 28, 2025 at 2:11 AM Jacopo Mondi
> > > <jacopo.mondi@ideasonboard.com> wrote:
> > > >
> > > > Hi Mikhail
> > > >
> > > > On Tue, Jan 14, 2025 at 07:00:39PM +0300, Mikhail Rudenko wrote:
> > > > >
> > > > > Hi Laurent,
> > > > >
> > > > > On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
> > > > >
> > > > > > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
> > > > > >> Currently, the rkisp1 driver always uses coherent DMA allocations for
> > > > > >> video capture buffers. However, on some platforms, using non-coherent
> > > > > >> buffers can improve performance, especially when CPU processing of
> > > > > >> MMAP'ed video buffers is required.
> > > > > >>
> > > > > >> For example, on the Rockchip RK3399 running at maximum CPU frequency,
> > > > > >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
> > > > > >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
> > > > > >> non-coherent DMA allocation. CPU usage also decreases accordingly.
> > > > > >
> > > > > > What's the time taken by the cache management operations ?
> > > > >
> > > > > Sorry for the late reply, your question turned out a little more
> > > > > interesting than I expected initially. :)
> > > > >
> > > > > When capturing using Yavta with MMAP buffers under the conditions mentioned
> > > > > in the commit message, ftrace gives 437.6 +- 1.1 us for
> > > > > dma_sync_sgtable_for_cpu and 409 +- 14 us for
> > > > > dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
> > > > > buffers in this case is more CPU-efficient even when considering cache
> > > > > management overhead.
> > > > >
> > > > > When trying to do the same measurements with libcamera, I failed. In a
> > > > > typical libcamera use case when MMAP buffers are allocated from a
> > > > > device, exported as dmabufs and then used for capture on the same device
> > > > > with DMABUF memory type, cache management in kernel is skipped [1]
> > > > > [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
> > > > > DMA_BUF_IOCTL_SYNC from userspace does not work either.
> > > > >
> > > > > So it looks like to make this change really useful, the above issue of
> > > > > cache management for libcamera/DMABUF/videobuf2-dma-contig has to be
> > > > > solved. I'm not an expert in this area, so any advice is kindly welcome. :)
> > > >
> > > > It would be shame if we let this discussion drop dead.. cache
> > > > management policies are relevant for performances, specifically for
> > > > cpu access, and your above 7.7ms vs 1.1 ms test clearly shows that.
> > > >
> > > > >
> > > > > [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
> > > >
> > > > I would like to know from Hans if the decision to disallow cache-hints
> > > > for dmabuf importers is a design choice or is deeply rooted in other
> > > > reasons I might be missing.
> > >
> > > When DMA-buf is used, the responsibility for cache management is
> > > solely on the CPU users' side, so cache-hints don't really apply. It's
> > > the exporter (=allocator) who determines the mapping policy of the
> > > buffer and provides necessary DMA_BUF_SYNC operations (can be no-op if
> > > the buffer is coherent).
> >
> > This all makes sense.
> >
> > I take it as, for libcamera, users of the FrameBufferAllocator helper
> > (which first exports MMAP buffers from the video device and the
> > imports them back as DMABUF) won't be able to control the cache
> > policies.
> >
> > Now, in the long term, we want FrameBufferAllocator to go away and
> > have either buffers exported by the consumer (likely DRM) or by a
> > system wide buffer allocator (when we'll have one) and have the video
> > devices operate as pure importers. But for the time being the
> > "first export then import" use case is possibile and valid so I wonder
> > if we should consider measures to allow controlling caching policies
> > for this use case too.
>
> Hmm, I may be missing something, but if FrameBufferAllocator does the
> allocation internally in libcamera, why couldn't it use the
> V4L2_MEMORY_FLAG_NON_COHERENT REQBUFS/CREATE_BUFS flag?
-I- might be missing something, but reading the below links that
Mikhail reported in a previous email
[1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
[2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
it seems to me that yes, when first exporting you can hint to tell
vb2 to allocate from non-coherent/non-contiguous memory, but then when
switching the device to importer mode (and import the same buffers it
previously exported) the cache sync point would then
be skipped, leading to possible synchronization issues between the cpu
consumer and the device.
>
> Note that however, once the buffer is allocated and mapped once, on
> many architectures it must keep the same mapping attributes across the
> different mappings, due to how the memory hierarchy works. (Not sure
> if this is what you are asking for here, though.)
>
> >
> > >
> > > Best regards,
> > > Tomasz
> > >
> > > >
> > > > I'm asking because the idea is for libcamera to act solely as dma-buf
> > > > importer, the current alloc-export-then-import trick is an helper for
> > > > applications to work around the absence of a system allocator.
> > > >
> > > > If the requirement to disable cache-hints for importers cannot be
> > > > lifted, for libcamera it means we would not be able to use it.
> > > >
> > > >
> > > > > [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
> > > > > [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Mikhail Rudenko
> > > > >
> > > >
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] media: rkisp1: allow non-coherent video capture buffers
2025-02-28 10:48 ` Jacopo Mondi
@ 2025-02-28 11:19 ` Tomasz Figa
0 siblings, 0 replies; 17+ messages in thread
From: Tomasz Figa @ 2025-02-28 11:19 UTC (permalink / raw)
To: Jacopo Mondi
Cc: Mikhail Rudenko, Laurent Pinchart, Dafna Hirschfeld,
Mauro Carvalho Chehab, Heiko Stuebner, linux-media,
linux-rockchip, linux-arm-kernel, linux-kernel
On Fri, Feb 28, 2025 at 7:48 PM Jacopo Mondi
<jacopo.mondi@ideasonboard.com> wrote:
>
> Hi Tomasz
>
> On Fri, Feb 28, 2025 at 07:28:43PM +0900, Tomasz Figa wrote:
> > On Fri, Feb 28, 2025 at 7:18 PM Jacopo Mondi
> > <jacopo.mondi@ideasonboard.com> wrote:
> > >
> > > Hi Tomasz
> > >
> > > On Fri, Feb 28, 2025 at 07:00:57PM +0900, Tomasz Figa wrote:
> > > > Hi Jacopo,
> > > >
> > > > On Fri, Feb 28, 2025 at 2:11 AM Jacopo Mondi
> > > > <jacopo.mondi@ideasonboard.com> wrote:
> > > > >
> > > > > Hi Mikhail
> > > > >
> > > > > On Tue, Jan 14, 2025 at 07:00:39PM +0300, Mikhail Rudenko wrote:
> > > > > >
> > > > > > Hi Laurent,
> > > > > >
> > > > > > On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
> > > > > >
> > > > > > > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
> > > > > > >> Currently, the rkisp1 driver always uses coherent DMA allocations for
> > > > > > >> video capture buffers. However, on some platforms, using non-coherent
> > > > > > >> buffers can improve performance, especially when CPU processing of
> > > > > > >> MMAP'ed video buffers is required.
> > > > > > >>
> > > > > > >> For example, on the Rockchip RK3399 running at maximum CPU frequency,
> > > > > > >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
> > > > > > >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
> > > > > > >> non-coherent DMA allocation. CPU usage also decreases accordingly.
> > > > > > >
> > > > > > > What's the time taken by the cache management operations ?
> > > > > >
> > > > > > Sorry for the late reply, your question turned out a little more
> > > > > > interesting than I expected initially. :)
> > > > > >
> > > > > > When capturing using Yavta with MMAP buffers under the conditions mentioned
> > > > > > in the commit message, ftrace gives 437.6 +- 1.1 us for
> > > > > > dma_sync_sgtable_for_cpu and 409 +- 14 us for
> > > > > > dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
> > > > > > buffers in this case is more CPU-efficient even when considering cache
> > > > > > management overhead.
> > > > > >
> > > > > > When trying to do the same measurements with libcamera, I failed. In a
> > > > > > typical libcamera use case when MMAP buffers are allocated from a
> > > > > > device, exported as dmabufs and then used for capture on the same device
> > > > > > with DMABUF memory type, cache management in kernel is skipped [1]
> > > > > > [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
> > > > > > DMA_BUF_IOCTL_SYNC from userspace does not work either.
> > > > > >
> > > > > > So it looks like to make this change really useful, the above issue of
> > > > > > cache management for libcamera/DMABUF/videobuf2-dma-contig has to be
> > > > > > solved. I'm not an expert in this area, so any advice is kindly welcome. :)
> > > > >
> > > > > It would be shame if we let this discussion drop dead.. cache
> > > > > management policies are relevant for performances, specifically for
> > > > > cpu access, and your above 7.7ms vs 1.1 ms test clearly shows that.
> > > > >
> > > > > >
> > > > > > [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
> > > > >
> > > > > I would like to know from Hans if the decision to disallow cache-hints
> > > > > for dmabuf importers is a design choice or is deeply rooted in other
> > > > > reasons I might be missing.
> > > >
> > > > When DMA-buf is used, the responsibility for cache management is
> > > > solely on the CPU users' side, so cache-hints don't really apply. It's
> > > > the exporter (=allocator) who determines the mapping policy of the
> > > > buffer and provides necessary DMA_BUF_SYNC operations (can be no-op if
> > > > the buffer is coherent).
> > >
> > > This all makes sense.
> > >
> > > I take it as, for libcamera, users of the FrameBufferAllocator helper
> > > (which first exports MMAP buffers from the video device and the
> > > imports them back as DMABUF) won't be able to control the cache
> > > policies.
> > >
> > > Now, in the long term, we want FrameBufferAllocator to go away and
> > > have either buffers exported by the consumer (likely DRM) or by a
> > > system wide buffer allocator (when we'll have one) and have the video
> > > devices operate as pure importers. But for the time being the
> > > "first export then import" use case is possibile and valid so I wonder
> > > if we should consider measures to allow controlling caching policies
> > > for this use case too.
> >
> > Hmm, I may be missing something, but if FrameBufferAllocator does the
> > allocation internally in libcamera, why couldn't it use the
> > V4L2_MEMORY_FLAG_NON_COHERENT REQBUFS/CREATE_BUFS flag?
>
> -I- might be missing something, but reading the below links that
> Mikhail reported in a previous email
> [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
> [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
>
> it seems to me that yes, when first exporting you can hint to tell
> vb2 to allocate from non-coherent/non-contiguous memory, but then when
> switching the device to importer mode (and import the same buffers it
> previously exported) the cache sync point would then
> be skipped, leading to possible synchronization issues between the cpu
> consumer and the device.
Yeah, and that is 100% expected. As I said in my previous reply, in
the DMA-buf world, it's the responsibility of the users doing the CPU
accesses to ensure the correct synchronization (aka explicit sync).
>
> >
> > Note that however, once the buffer is allocated and mapped once, on
> > many architectures it must keep the same mapping attributes across the
> > different mappings, due to how the memory hierarchy works. (Not sure
> > if this is what you are asking for here, though.)
> >
> > >
> > > >
> > > > Best regards,
> > > > Tomasz
> > > >
> > > > >
> > > > > I'm asking because the idea is for libcamera to act solely as dma-buf
> > > > > importer, the current alloc-export-then-import trick is an helper for
> > > > > applications to work around the absence of a system allocator.
> > > > >
> > > > > If the requirement to disable cache-hints for importers cannot be
> > > > > lifted, for libcamera it means we would not be able to use it.
> > > > >
> > > > >
> > > > > > [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
> > > > > > [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > > Mikhail Rudenko
> > > > > >
> > > > >
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-02-28 12:06 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-02 15:35 [PATCH] media: rkisp1: allow non-coherent video capture buffers Mikhail Rudenko
2025-01-03 15:23 ` Laurent Pinchart
2025-01-14 16:00 ` Mikhail Rudenko
2025-01-15 8:31 ` Tomasz Figa
2025-01-15 13:24 ` Mikhail Rudenko
2025-01-15 14:46 ` Tomasz Figa
2025-01-15 17:29 ` Mikhail Rudenko
2025-01-15 19:13 ` Nicolas Dufresne
2025-02-27 17:05 ` Jacopo Mondi
2025-02-27 20:46 ` Mikhail Rudenko
2025-02-28 2:58 ` Nicolas Dufresne
2025-02-28 9:54 ` Hans Verkuil
2025-02-28 10:00 ` Tomasz Figa
2025-02-28 10:18 ` Jacopo Mondi
2025-02-28 10:28 ` Tomasz Figa
2025-02-28 10:48 ` Jacopo Mondi
2025-02-28 11:19 ` Tomasz Figa
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).