[PATCH] dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap

public inbox for linux-media@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap
@ 2026-04-06 21:49 Barry Song (Xiaomi)
  2026-04-07  7:57 ` Christian König
  0 siblings, 1 reply; 8+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-06 21:49 UTC (permalink / raw)
  To: linux-media, dri-devel, linaro-mm-sig
  Cc: linux-kernel, Xueyuan Chen, Sumit Semwal, Benjamin Gaignard,
	Brian Starkey, John Stultz, T . J . Mercier, Christian König,
	Barry Song

From: Xueyuan Chen <Xueyuan.chen21@gmail.com>

Replace the heavy for_each_sgtable_page() iterator in system_heap_do_vmap()
with a more efficient nested loop approach.

Instead of iterating page by page, we now iterate through the scatterlist
entries via for_each_sgtable_sg(). Because pages within a single sg entry
are physically contiguous, we can populate the page array with a in an
inner loop using simple pointer math. This save a lot of time.

The WARN_ON check is also pulled out of the loop to save branch
instructions.

Performance results mapping a 2GB buffer on Radxa O6:
- Before: ~1440000 ns
- After:  ~232000 ns
(~84% reduction in iteration time, or ~6.2x faster)

Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Benjamin Gaignard <benjamin.gaignard@collabora.com>
Cc: Brian Starkey <Brian.Starkey@arm.com>
Cc: John Stultz <jstultz@google.com>
Cc: T.J. Mercier <tjmercier@google.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Xueyuan Chen <Xueyuan.chen21@gmail.com>
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 drivers/dma-buf/heaps/system_heap.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
index b3650d8fd651..769f01f0cc96 100644
--- a/drivers/dma-buf/heaps/system_heap.c
+++ b/drivers/dma-buf/heaps/system_heap.c
@@ -224,16 +224,21 @@ static void *system_heap_do_vmap(struct system_heap_buffer *buffer)
 	int npages = PAGE_ALIGN(buffer->len) / PAGE_SIZE;
 	struct page **pages = vmalloc(sizeof(struct page *) * npages);
 	struct page **tmp = pages;
-	struct sg_page_iter piter;
 	void *vaddr;
+	u32 i, j, count;
+	struct page *base_page;
+	struct scatterlist *sg;
 
 	if (!pages)
 		return ERR_PTR(-ENOMEM);
 
-	for_each_sgtable_page(table, &piter, 0) {
-		WARN_ON(tmp - pages >= npages);
-		*tmp++ = sg_page_iter_page(&piter);
+	for_each_sgtable_sg(table, sg, i) {
+		base_page = sg_page(sg);
+		count = sg->length >> PAGE_SHIFT;
+		for (j = 0; j < count; j++)
+			*tmp++ = base_page + j;
 	}
+	WARN_ON(tmp - pages != npages);
 
 	vaddr = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
 	vfree(pages);
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap
  2026-04-06 21:49 [PATCH] dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap Barry Song (Xiaomi)
@ 2026-04-07  7:57 ` Christian König
  2026-04-07 11:29   ` Barry Song
  0 siblings, 1 reply; 8+ messages in thread
From: Christian König @ 2026-04-07  7:57 UTC (permalink / raw)
  To: Barry Song (Xiaomi), linux-media, dri-devel, linaro-mm-sig
  Cc: linux-kernel, Xueyuan Chen, Sumit Semwal, Benjamin Gaignard,
	Brian Starkey, John Stultz, T . J . Mercier

On 4/6/26 23:49, Barry Song (Xiaomi) wrote:
> From: Xueyuan Chen <Xueyuan.chen21@gmail.com>
> 
> Replace the heavy for_each_sgtable_page() iterator in system_heap_do_vmap()
> with a more efficient nested loop approach.
> 
> Instead of iterating page by page, we now iterate through the scatterlist
> entries via for_each_sgtable_sg(). Because pages within a single sg entry
> are physically contiguous, we can populate the page array with a in an
> inner loop using simple pointer math. This save a lot of time.
> 
> The WARN_ON check is also pulled out of the loop to save branch
> instructions.
> 
> Performance results mapping a 2GB buffer on Radxa O6:
> - Before: ~1440000 ns
> - After:  ~232000 ns
> (~84% reduction in iteration time, or ~6.2x faster)

Well real question is why do you care about the vmap performance?

That should basically only be used for fbdev emulation (except for VMGFX) and we absolutely don't care about performance there.

Regards,
Christian.

> 
> Cc: Sumit Semwal <sumit.semwal@linaro.org>
> Cc: Benjamin Gaignard <benjamin.gaignard@collabora.com>
> Cc: Brian Starkey <Brian.Starkey@arm.com>
> Cc: John Stultz <jstultz@google.com>
> Cc: T.J. Mercier <tjmercier@google.com>
> Cc: Christian König <christian.koenig@amd.com>
> Signed-off-by: Xueyuan Chen <Xueyuan.chen21@gmail.com>
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
>  drivers/dma-buf/heaps/system_heap.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> index b3650d8fd651..769f01f0cc96 100644
> --- a/drivers/dma-buf/heaps/system_heap.c
> +++ b/drivers/dma-buf/heaps/system_heap.c
> @@ -224,16 +224,21 @@ static void *system_heap_do_vmap(struct system_heap_buffer *buffer)
>  	int npages = PAGE_ALIGN(buffer->len) / PAGE_SIZE;
>  	struct page **pages = vmalloc(sizeof(struct page *) * npages);
>  	struct page **tmp = pages;
> -	struct sg_page_iter piter;
>  	void *vaddr;
> +	u32 i, j, count;
> +	struct page *base_page;
> +	struct scatterlist *sg;
>  
>  	if (!pages)
>  		return ERR_PTR(-ENOMEM);
>  
> -	for_each_sgtable_page(table, &piter, 0) {
> -		WARN_ON(tmp - pages >= npages);
> -		*tmp++ = sg_page_iter_page(&piter);
> +	for_each_sgtable_sg(table, sg, i) {
> +		base_page = sg_page(sg);
> +		count = sg->length >> PAGE_SHIFT;
> +		for (j = 0; j < count; j++)
> +			*tmp++ = base_page + j;
>  	}
> +	WARN_ON(tmp - pages != npages);
>  
>  	vaddr = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
>  	vfree(pages);


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap
  2026-04-07  7:57 ` Christian König
@ 2026-04-07 11:29   ` Barry Song
  2026-04-22  7:10     ` Christian König
  0 siblings, 1 reply; 8+ messages in thread
From: Barry Song @ 2026-04-07 11:29 UTC (permalink / raw)
  To: Christian König
  Cc: linux-media, dri-devel, linaro-mm-sig, linux-kernel, Xueyuan Chen,
	Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T . J . Mercier

On Tue, Apr 7, 2026 at 3:58 PM Christian König <christian.koenig@amd.com> wrote:
>
> On 4/6/26 23:49, Barry Song (Xiaomi) wrote:
> > From: Xueyuan Chen <Xueyuan.chen21@gmail.com>
> >
> > Replace the heavy for_each_sgtable_page() iterator in system_heap_do_vmap()
> > with a more efficient nested loop approach.
> >
> > Instead of iterating page by page, we now iterate through the scatterlist
> > entries via for_each_sgtable_sg(). Because pages within a single sg entry
> > are physically contiguous, we can populate the page array with a in an
> > inner loop using simple pointer math. This save a lot of time.
> >
> > The WARN_ON check is also pulled out of the loop to save branch
> > instructions.
> >
> > Performance results mapping a 2GB buffer on Radxa O6:
> > - Before: ~1440000 ns
> > - After:  ~232000 ns
> > (~84% reduction in iteration time, or ~6.2x faster)
>
> Well real question is why do you care about the vmap performance?
>
> That should basically only be used for fbdev emulation (except for VMGFX) and we absolutely don't care about performance there.

I agree that in mainline, dma_buf_vmap is not used very often.
Here’s what I was able to find:

  1   1638  drivers/dma-buf/dma-buf.c <<dma_buf_vmap_unlocked>>
             ret = dma_buf_vmap(dmabuf, map);
   2    376  drivers/gpu/drm/drm_gem_shmem_helper.c
<<drm_gem_shmem_vmap_locked>>
             ret = dma_buf_vmap(obj->import_attach->dmabuf, map);
   3     85  drivers/gpu/drm/etnaviv/etnaviv_gem_prime.c
<<etnaviv_gem_prime_vmap_impl>>
             ret = dma_buf_vmap(etnaviv_obj->base.import_attach->dmabuf, &map);
   4    433  drivers/gpu/drm/vmwgfx/vmwgfx_blit.c <<map_external>>
             ret = dma_buf_vmap(bo->tbo.base.dma_buf, map);
   5     88  drivers/gpu/drm/vmwgfx/vmwgfx_gem.c <<vmw_gem_vmap>>
             ret = dma_buf_vmap(obj->import_attach->dmabuf, map);

However, in the Android ecosystem, system_heap and similar heaps
are widely used across camera, NPU, and media drivers. Many of these
drivers are not in mainline but do use vmap() in real code paths.

As I can show you some of them from MTK platforms:

1:
[    6.689849] system_heap_vmap+0x17c/0x254 [system_heap
8d35d4ce35bb30d8a623f0b9863998a2528e4175]
[    6.689859] dma_buf_vmap_unlocked+0xb8/0x130
[    6.689861] aov_core_init+0x310/0x718 [mtk_aov
96e2e5e9457dcdacce3a7629b0600c5dbeca623b]
[    6.689873] mtk_aov_probe+0x434/0x5b4 [mtk_aov
96e2e5e9457dcdacce3a7629b0600c5dbeca623b]

2:
[  116.181643] __vmap_pages_range_noflush+0x7c4/0x814
[  116.181645] vmap+0xb4/0x148
[  116.181647] system_heap_vmap+0x17c/0x254 [system_heap
8d35d4ce35bb30d8a623f0b9863998a2528e4175]
[  116.181651] dma_buf_vmap_unlocked+0xb8/0x130
[  116.181653] mtk_cam_vb2_vaddr+0xa0/0xfc [mtk_cam_isp8s
0cf9be6c773a8f14aab9db9ebf53feacb499846a]
[  116.181682] vb2_plane_vaddr+0x5c/0x78
[  116.181684] mtk_cam_job_fill_ipi_frame+0xa8c/0x128c [mtk_cam_isp8s
0cf9be6c773a8f14aab9db9ebf53feacb499846a]

3:
[  116.306178] __vmap_pages_range_noflush+0x7c4/0x814
[  116.306183] vmap+0xb4/0x148
[  116.306187] system_heap_vmap+0x17c/0x254 [system_heap
8d35d4ce35bb30d8a623f0b9863998a2528e4175]
[  116.306209] dma_buf_vmap_unlocked+0xb8/0x130
[  116.306212] apu_sysmem_alloc+0x168/0x360 [apusys
8fb33cbce3b858d651b9da26fc370090a67cfb70]
[  116.306468] mdw_mem_alloc+0xd8/0x314 [apusys
8fb33cbce3b858d651b9da26fc370090a67cfb70]
[  116.306591] mdw_mem_pool_chunk_add+0x11c/0x400 [apusys
8fb33cbce3b858d651b9da26fc370090a67cfb70]
[  116.306712] mdw_mem_pool_create+0x190/0x2c8 [apusys
8fb33cbce3b858d651b9da26fc370090a67cfb70]
[  116.306833] mdw_drv_open+0x21c/0x47c [apusys
8fb33cbce3b858d651b9da26fc370090a67cfb70]

While we may want to encourage more of these drivers to upstream,
some aspects are beyond our control (different SoC vendors), but we
can at least contribute upstream ourselves.

Best Regards
Barry

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap
  2026-04-07 11:29   ` Barry Song
@ 2026-04-22  7:10     ` Christian König
  2026-05-01  4:15       ` Barry Song
  0 siblings, 1 reply; 8+ messages in thread
From: Christian König @ 2026-04-22  7:10 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-media, dri-devel, linaro-mm-sig, linux-kernel, Xueyuan Chen,
	Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T . J . Mercier

On 4/7/26 13:29, Barry Song wrote:
> On Tue, Apr 7, 2026 at 3:58 PM Christian König <christian.koenig@amd.com> wrote:
>>
>> On 4/6/26 23:49, Barry Song (Xiaomi) wrote:
>>> From: Xueyuan Chen <Xueyuan.chen21@gmail.com>
>>>
>>> Replace the heavy for_each_sgtable_page() iterator in system_heap_do_vmap()
>>> with a more efficient nested loop approach.
>>>
>>> Instead of iterating page by page, we now iterate through the scatterlist
>>> entries via for_each_sgtable_sg(). Because pages within a single sg entry
>>> are physically contiguous, we can populate the page array with a in an
>>> inner loop using simple pointer math. This save a lot of time.
>>>
>>> The WARN_ON check is also pulled out of the loop to save branch
>>> instructions.
>>>
>>> Performance results mapping a 2GB buffer on Radxa O6:
>>> - Before: ~1440000 ns
>>> - After:  ~232000 ns
>>> (~84% reduction in iteration time, or ~6.2x faster)
>>
>> Well real question is why do you care about the vmap performance?
>>
>> That should basically only be used for fbdev emulation (except for VMGFX) and we absolutely don't care about performance there.
> 
> I agree that in mainline, dma_buf_vmap is not used very often.
> Here’s what I was able to find:
> 
>   1   1638  drivers/dma-buf/dma-buf.c <<dma_buf_vmap_unlocked>>
>              ret = dma_buf_vmap(dmabuf, map);
>    2    376  drivers/gpu/drm/drm_gem_shmem_helper.c
> <<drm_gem_shmem_vmap_locked>>
>              ret = dma_buf_vmap(obj->import_attach->dmabuf, map);
>    3     85  drivers/gpu/drm/etnaviv/etnaviv_gem_prime.c
> <<etnaviv_gem_prime_vmap_impl>>
>              ret = dma_buf_vmap(etnaviv_obj->base.import_attach->dmabuf, &map);
>    4    433  drivers/gpu/drm/vmwgfx/vmwgfx_blit.c <<map_external>>
>              ret = dma_buf_vmap(bo->tbo.base.dma_buf, map);
>    5     88  drivers/gpu/drm/vmwgfx/vmwgfx_gem.c <<vmw_gem_vmap>>
>              ret = dma_buf_vmap(obj->import_attach->dmabuf, map);
> 
> However, in the Android ecosystem, system_heap and similar heaps
> are widely used across camera, NPU, and media drivers. Many of these
> drivers are not in mainline but do use vmap() in real code paths.

Well out of tree drivers are not a justification to make an upstream changes.

Apart from a handful of workarounds which need to CPU access as fallback DMA-buf vmap is only used to provide fb dev emulation.

The vmap interface has already given us quite a headache in the first place and there are a couple of unresolved problems regarding synchronization and coherency. 

When a driver would be pushed upstream which makes so frequent use of the dma_buf_vmap function that it matters for the performance I think there would be push back on that and the driver developer would require a very good explanation why that is necessary.

So for now I have to reject that patch.

Regards,
Christian.

> 
> As I can show you some of them from MTK platforms:
> 
> 1:
> [    6.689849] system_heap_vmap+0x17c/0x254 [system_heap
> 8d35d4ce35bb30d8a623f0b9863998a2528e4175]
> [    6.689859] dma_buf_vmap_unlocked+0xb8/0x130
> [    6.689861] aov_core_init+0x310/0x718 [mtk_aov
> 96e2e5e9457dcdacce3a7629b0600c5dbeca623b]
> [    6.689873] mtk_aov_probe+0x434/0x5b4 [mtk_aov
> 96e2e5e9457dcdacce3a7629b0600c5dbeca623b]
> 
> 2:
> [  116.181643] __vmap_pages_range_noflush+0x7c4/0x814
> [  116.181645] vmap+0xb4/0x148
> [  116.181647] system_heap_vmap+0x17c/0x254 [system_heap
> 8d35d4ce35bb30d8a623f0b9863998a2528e4175]
> [  116.181651] dma_buf_vmap_unlocked+0xb8/0x130
> [  116.181653] mtk_cam_vb2_vaddr+0xa0/0xfc [mtk_cam_isp8s
> 0cf9be6c773a8f14aab9db9ebf53feacb499846a]
> [  116.181682] vb2_plane_vaddr+0x5c/0x78
> [  116.181684] mtk_cam_job_fill_ipi_frame+0xa8c/0x128c [mtk_cam_isp8s
> 0cf9be6c773a8f14aab9db9ebf53feacb499846a]
> 
> 3:
> [  116.306178] __vmap_pages_range_noflush+0x7c4/0x814
> [  116.306183] vmap+0xb4/0x148
> [  116.306187] system_heap_vmap+0x17c/0x254 [system_heap
> 8d35d4ce35bb30d8a623f0b9863998a2528e4175]
> [  116.306209] dma_buf_vmap_unlocked+0xb8/0x130
> [  116.306212] apu_sysmem_alloc+0x168/0x360 [apusys
> 8fb33cbce3b858d651b9da26fc370090a67cfb70]
> [  116.306468] mdw_mem_alloc+0xd8/0x314 [apusys
> 8fb33cbce3b858d651b9da26fc370090a67cfb70]
> [  116.306591] mdw_mem_pool_chunk_add+0x11c/0x400 [apusys
> 8fb33cbce3b858d651b9da26fc370090a67cfb70]
> [  116.306712] mdw_mem_pool_create+0x190/0x2c8 [apusys
> 8fb33cbce3b858d651b9da26fc370090a67cfb70]
> [  116.306833] mdw_drv_open+0x21c/0x47c [apusys
> 8fb33cbce3b858d651b9da26fc370090a67cfb70]
> 
> While we may want to encourage more of these drivers to upstream,
> some aspects are beyond our control (different SoC vendors), but we
> can at least contribute upstream ourselves.
> 
> Best Regards
> Barry


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap
  2026-04-22  7:10     ` Christian König
@ 2026-05-01  4:15       ` Barry Song
  2026-05-01 15:54         ` T.J. Mercier
  0 siblings, 1 reply; 8+ messages in thread
From: Barry Song @ 2026-05-01  4:15 UTC (permalink / raw)
  To: Christian König
  Cc: linux-media, dri-devel, linaro-mm-sig, linux-kernel, Xueyuan Chen,
	Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T . J . Mercier

On Wed, Apr 22, 2026 at 3:10 PM Christian König
<christian.koenig@amd.com> wrote:
>
> On 4/7/26 13:29, Barry Song wrote:
> > On Tue, Apr 7, 2026 at 3:58 PM Christian König <christian.koenig@amd.com> wrote:
> >>
> >> On 4/6/26 23:49, Barry Song (Xiaomi) wrote:
> >>> From: Xueyuan Chen <Xueyuan.chen21@gmail.com>
> >>>
> >>> Replace the heavy for_each_sgtable_page() iterator in system_heap_do_vmap()
> >>> with a more efficient nested loop approach.
> >>>
> >>> Instead of iterating page by page, we now iterate through the scatterlist
> >>> entries via for_each_sgtable_sg(). Because pages within a single sg entry
> >>> are physically contiguous, we can populate the page array with a in an
> >>> inner loop using simple pointer math. This save a lot of time.
> >>>
> >>> The WARN_ON check is also pulled out of the loop to save branch
> >>> instructions.
> >>>
> >>> Performance results mapping a 2GB buffer on Radxa O6:
> >>> - Before: ~1440000 ns
> >>> - After:  ~232000 ns
> >>> (~84% reduction in iteration time, or ~6.2x faster)
> >>
> >> Well real question is why do you care about the vmap performance?
> >>
> >> That should basically only be used for fbdev emulation (except for VMGFX) and we absolutely don't care about performance there.
> >
> > I agree that in mainline, dma_buf_vmap is not used very often.
> > Here’s what I was able to find:
> >
> >   1   1638  drivers/dma-buf/dma-buf.c <<dma_buf_vmap_unlocked>>
> >              ret = dma_buf_vmap(dmabuf, map);
> >    2    376  drivers/gpu/drm/drm_gem_shmem_helper.c
> > <<drm_gem_shmem_vmap_locked>>
> >              ret = dma_buf_vmap(obj->import_attach->dmabuf, map);
> >    3     85  drivers/gpu/drm/etnaviv/etnaviv_gem_prime.c
> > <<etnaviv_gem_prime_vmap_impl>>
> >              ret = dma_buf_vmap(etnaviv_obj->base.import_attach->dmabuf, &map);
> >    4    433  drivers/gpu/drm/vmwgfx/vmwgfx_blit.c <<map_external>>
> >              ret = dma_buf_vmap(bo->tbo.base.dma_buf, map);
> >    5     88  drivers/gpu/drm/vmwgfx/vmwgfx_gem.c <<vmw_gem_vmap>>
> >              ret = dma_buf_vmap(obj->import_attach->dmabuf, map);
> >
> > However, in the Android ecosystem, system_heap and similar heaps
> > are widely used across camera, NPU, and media drivers. Many of these
> > drivers are not in mainline but do use vmap() in real code paths.
>
> Well out of tree drivers are not a justification to make an upstream changes.
>
> Apart from a handful of workarounds which need to CPU access as fallback DMA-buf vmap is only used to provide fb dev emulation.
>
> The vmap interface has already given us quite a headache in the first place and there are a couple of unresolved problems regarding synchronization and coherency.
>
> When a driver would be pushed upstream which makes so frequent use of the dma_buf_vmap function that it matters for the performance I think there would be push back on that and the driver developer would require a very good explanation why that is necessary.
>
> So for now I have to reject that patch.

Well, it doesn’t seem to increase complexity, and the code is quite easy
to understand. It would be great if the community could be more welcoming
to developers who are just getting involved, rather than discouraging them.

Apparently, no one can control whether the source code of those kernel
modules will be upstreamed except the vendors themselves, but products
can still benefit from the common kernel.

>
> Regards,
> Christian.
>
> >
> > As I can show you some of them from MTK platforms:
> >
> > 1:
> > [    6.689849] system_heap_vmap+0x17c/0x254 [system_heap
> > 8d35d4ce35bb30d8a623f0b9863998a2528e4175]
> > [    6.689859] dma_buf_vmap_unlocked+0xb8/0x130
> > [    6.689861] aov_core_init+0x310/0x718 [mtk_aov
> > 96e2e5e9457dcdacce3a7629b0600c5dbeca623b]
> > [    6.689873] mtk_aov_probe+0x434/0x5b4 [mtk_aov
> > 96e2e5e9457dcdacce3a7629b0600c5dbeca623b]
> >
> > 2:
> > [  116.181643] __vmap_pages_range_noflush+0x7c4/0x814
> > [  116.181645] vmap+0xb4/0x148
> > [  116.181647] system_heap_vmap+0x17c/0x254 [system_heap
> > 8d35d4ce35bb30d8a623f0b9863998a2528e4175]
> > [  116.181651] dma_buf_vmap_unlocked+0xb8/0x130
> > [  116.181653] mtk_cam_vb2_vaddr+0xa0/0xfc [mtk_cam_isp8s
> > 0cf9be6c773a8f14aab9db9ebf53feacb499846a]
> > [  116.181682] vb2_plane_vaddr+0x5c/0x78
> > [  116.181684] mtk_cam_job_fill_ipi_frame+0xa8c/0x128c [mtk_cam_isp8s
> > 0cf9be6c773a8f14aab9db9ebf53feacb499846a]
> >
> > 3:
> > [  116.306178] __vmap_pages_range_noflush+0x7c4/0x814
> > [  116.306183] vmap+0xb4/0x148
> > [  116.306187] system_heap_vmap+0x17c/0x254 [system_heap
> > 8d35d4ce35bb30d8a623f0b9863998a2528e4175]
> > [  116.306209] dma_buf_vmap_unlocked+0xb8/0x130
> > [  116.306212] apu_sysmem_alloc+0x168/0x360 [apusys
> > 8fb33cbce3b858d651b9da26fc370090a67cfb70]
> > [  116.306468] mdw_mem_alloc+0xd8/0x314 [apusys
> > 8fb33cbce3b858d651b9da26fc370090a67cfb70]
> > [  116.306591] mdw_mem_pool_chunk_add+0x11c/0x400 [apusys
> > 8fb33cbce3b858d651b9da26fc370090a67cfb70]
> > [  116.306712] mdw_mem_pool_create+0x190/0x2c8 [apusys
> > 8fb33cbce3b858d651b9da26fc370090a67cfb70]
> > [  116.306833] mdw_drv_open+0x21c/0x47c [apusys
> > 8fb33cbce3b858d651b9da26fc370090a67cfb70]
> >
> > While we may want to encourage more of these drivers to upstream,
> > some aspects are beyond our control (different SoC vendors), but we
> > can at least contribute upstream ourselves.
> >

Best Regards
Barry

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap
  2026-05-01  4:15       ` Barry Song
@ 2026-05-01 15:54         ` T.J. Mercier
  2026-05-04  7:49           ` Christian König
  0 siblings, 1 reply; 8+ messages in thread
From: T.J. Mercier @ 2026-05-01 15:54 UTC (permalink / raw)
  To: Barry Song
  Cc: Christian König, linux-media, dri-devel, linaro-mm-sig,
	linux-kernel, Xueyuan Chen, Sumit Semwal, Benjamin Gaignard,
	Brian Starkey, John Stultz

On Thu, Apr 30, 2026 at 9:15 PM Barry Song <baohua@kernel.org> wrote:
>
> On Wed, Apr 22, 2026 at 3:10 PM Christian König
> <christian.koenig@amd.com> wrote:
> >
> > On 4/7/26 13:29, Barry Song wrote:
> > > On Tue, Apr 7, 2026 at 3:58 PM Christian König <christian.koenig@amd.com> wrote:
> > >>
> > >> On 4/6/26 23:49, Barry Song (Xiaomi) wrote:
> > >>> From: Xueyuan Chen <Xueyuan.chen21@gmail.com>
> > >>>
> > >>> Replace the heavy for_each_sgtable_page() iterator in system_heap_do_vmap()
> > >>> with a more efficient nested loop approach.
> > >>>
> > >>> Instead of iterating page by page, we now iterate through the scatterlist
> > >>> entries via for_each_sgtable_sg(). Because pages within a single sg entry
> > >>> are physically contiguous, we can populate the page array with a in an
> > >>> inner loop using simple pointer math. This save a lot of time.
> > >>>
> > >>> The WARN_ON check is also pulled out of the loop to save branch
> > >>> instructions.
> > >>>
> > >>> Performance results mapping a 2GB buffer on Radxa O6:
> > >>> - Before: ~1440000 ns
> > >>> - After:  ~232000 ns
> > >>> (~84% reduction in iteration time, or ~6.2x faster)
> > >>
> > >> Well real question is why do you care about the vmap performance?
> > >>
> > >> That should basically only be used for fbdev emulation (except for VMGFX) and we absolutely don't care about performance there.
> > >
> > > I agree that in mainline, dma_buf_vmap is not used very often.
> > > Here’s what I was able to find:
> > >
> > >   1   1638  drivers/dma-buf/dma-buf.c <<dma_buf_vmap_unlocked>>
> > >              ret = dma_buf_vmap(dmabuf, map);
> > >    2    376  drivers/gpu/drm/drm_gem_shmem_helper.c
> > > <<drm_gem_shmem_vmap_locked>>
> > >              ret = dma_buf_vmap(obj->import_attach->dmabuf, map);
> > >    3     85  drivers/gpu/drm/etnaviv/etnaviv_gem_prime.c
> > > <<etnaviv_gem_prime_vmap_impl>>
> > >              ret = dma_buf_vmap(etnaviv_obj->base.import_attach->dmabuf, &map);
> > >    4    433  drivers/gpu/drm/vmwgfx/vmwgfx_blit.c <<map_external>>
> > >              ret = dma_buf_vmap(bo->tbo.base.dma_buf, map);
> > >    5     88  drivers/gpu/drm/vmwgfx/vmwgfx_gem.c <<vmw_gem_vmap>>
> > >              ret = dma_buf_vmap(obj->import_attach->dmabuf, map);
> > >
> > > However, in the Android ecosystem, system_heap and similar heaps
> > > are widely used across camera, NPU, and media drivers. Many of these
> > > drivers are not in mainline but do use vmap() in real code paths.
> >
> > Well out of tree drivers are not a justification to make an upstream changes.
> >
> > Apart from a handful of workarounds which need to CPU access as fallback DMA-buf vmap is only used to provide fb dev emulation.
> >
> > The vmap interface has already given us quite a headache in the first place and there are a couple of unresolved problems regarding synchronization and coherency.
> >
> > When a driver would be pushed upstream which makes so frequent use of the dma_buf_vmap function that it matters for the performance I think there would be push back on that and the driver developer would require a very good explanation why that is necessary.
> >
> > So for now I have to reject that patch.
>
> Well, it doesn’t seem to increase complexity, and the code is quite easy
> to understand.

I agree with this. This change introduces basically no downsides for
upstream, even if it primarily benefits a rare use case. Since
dma_buf_vmap is exported for driver use, why not enhance the
performance for all callers?

-T.J.

> It would be great if the community could be more welcoming
> to developers who are just getting involved, rather than discouraging them.
>
> Apparently, no one can control whether the source code of those kernel
> modules will be upstreamed except the vendors themselves, but products
> can still benefit from the common kernel.
>
> Best Regards
> Barry

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap
  2026-05-01 15:54         ` T.J. Mercier
@ 2026-05-04  7:49           ` Christian König
  2026-05-05 14:44             ` T.J. Mercier
  0 siblings, 1 reply; 8+ messages in thread
From: Christian König @ 2026-05-04  7:49 UTC (permalink / raw)
  To: T.J. Mercier, Barry Song
  Cc: linux-media, dri-devel, linaro-mm-sig, linux-kernel, Xueyuan Chen,
	Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz

On 5/1/26 17:54, T.J. Mercier wrote:
> On Thu, Apr 30, 2026 at 9:15 PM Barry Song <baohua@kernel.org> wrote:
>>
>> On Wed, Apr 22, 2026 at 3:10 PM Christian König
>> <christian.koenig@amd.com> wrote:
>>>
>>> On 4/7/26 13:29, Barry Song wrote:
>>>> On Tue, Apr 7, 2026 at 3:58 PM Christian König <christian.koenig@amd.com> wrote:
>>>>>
>>>>> On 4/6/26 23:49, Barry Song (Xiaomi) wrote:
>>>>>> From: Xueyuan Chen <Xueyuan.chen21@gmail.com>
>>>>>>
>>>>>> Replace the heavy for_each_sgtable_page() iterator in system_heap_do_vmap()
>>>>>> with a more efficient nested loop approach.
>>>>>>
>>>>>> Instead of iterating page by page, we now iterate through the scatterlist
>>>>>> entries via for_each_sgtable_sg(). Because pages within a single sg entry
>>>>>> are physically contiguous, we can populate the page array with a in an
>>>>>> inner loop using simple pointer math. This save a lot of time.
>>>>>>
>>>>>> The WARN_ON check is also pulled out of the loop to save branch
>>>>>> instructions.
>>>>>>
>>>>>> Performance results mapping a 2GB buffer on Radxa O6:
>>>>>> - Before: ~1440000 ns
>>>>>> - After:  ~232000 ns
>>>>>> (~84% reduction in iteration time, or ~6.2x faster)
>>>>>
>>>>> Well real question is why do you care about the vmap performance?
>>>>>
>>>>> That should basically only be used for fbdev emulation (except for VMGFX) and we absolutely don't care about performance there.
>>>>
>>>> I agree that in mainline, dma_buf_vmap is not used very often.
>>>> Here’s what I was able to find:
>>>>
>>>>   1   1638  drivers/dma-buf/dma-buf.c <<dma_buf_vmap_unlocked>>
>>>>              ret = dma_buf_vmap(dmabuf, map);
>>>>    2    376  drivers/gpu/drm/drm_gem_shmem_helper.c
>>>> <<drm_gem_shmem_vmap_locked>>
>>>>              ret = dma_buf_vmap(obj->import_attach->dmabuf, map);
>>>>    3     85  drivers/gpu/drm/etnaviv/etnaviv_gem_prime.c
>>>> <<etnaviv_gem_prime_vmap_impl>>
>>>>              ret = dma_buf_vmap(etnaviv_obj->base.import_attach->dmabuf, &map);
>>>>    4    433  drivers/gpu/drm/vmwgfx/vmwgfx_blit.c <<map_external>>
>>>>              ret = dma_buf_vmap(bo->tbo.base.dma_buf, map);
>>>>    5     88  drivers/gpu/drm/vmwgfx/vmwgfx_gem.c <<vmw_gem_vmap>>
>>>>              ret = dma_buf_vmap(obj->import_attach->dmabuf, map);
>>>>
>>>> However, in the Android ecosystem, system_heap and similar heaps
>>>> are widely used across camera, NPU, and media drivers. Many of these
>>>> drivers are not in mainline but do use vmap() in real code paths.
>>>
>>> Well out of tree drivers are not a justification to make an upstream changes.
>>>
>>> Apart from a handful of workarounds which need to CPU access as fallback DMA-buf vmap is only used to provide fb dev emulation.
>>>
>>> The vmap interface has already given us quite a headache in the first place and there are a couple of unresolved problems regarding synchronization and coherency.
>>>
>>> When a driver would be pushed upstream which makes so frequent use of the dma_buf_vmap function that it matters for the performance I think there would be push back on that and the driver developer would require a very good explanation why that is necessary.
>>>
>>> So for now I have to reject that patch.
>>
>> Well, it doesn’t seem to increase complexity, and the code is quite easy
>> to understand.
> 
> I agree with this. This change introduces basically no downsides for
> upstream, even if it primarily benefits a rare use case. Since
> dma_buf_vmap is exported for driver use, why not enhance the
> performance for all callers?

Because we essentially want to restrict the vmap interface to only the fb dev emulation use case and not promote or even expand it.

When this matters performance wise the caller is clearly doing something wrong and by improving the performance we just paper over the issue instead of fixing it.

Regards,
Christian.

> 
> -T.J.
> 
>> It would be great if the community could be more welcoming
>> to developers who are just getting involved, rather than discouraging them.
>>
>> Apparently, no one can control whether the source code of those kernel
>> modules will be upstreamed except the vendors themselves, but products
>> can still benefit from the common kernel.
>>
>> Best Regards
>> Barry


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap
  2026-05-04  7:49           ` Christian König
@ 2026-05-05 14:44             ` T.J. Mercier
  0 siblings, 0 replies; 8+ messages in thread
From: T.J. Mercier @ 2026-05-05 14:44 UTC (permalink / raw)
  To: Christian König
  Cc: Barry Song, linux-media, dri-devel, linaro-mm-sig, linux-kernel,
	Xueyuan Chen, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz

On Mon, May 4, 2026 at 12:49 AM Christian König
<christian.koenig@amd.com> wrote:
>
> On 5/1/26 17:54, T.J. Mercier wrote:
> > On Thu, Apr 30, 2026 at 9:15 PM Barry Song <baohua@kernel.org> wrote:
> >>
> >> On Wed, Apr 22, 2026 at 3:10 PM Christian König
> >> <christian.koenig@amd.com> wrote:
> >>>
> >>> On 4/7/26 13:29, Barry Song wrote:
> >>>> On Tue, Apr 7, 2026 at 3:58 PM Christian König <christian.koenig@amd.com> wrote:
> >>>>>
> >>>>> On 4/6/26 23:49, Barry Song (Xiaomi) wrote:
> >>>>>> From: Xueyuan Chen <Xueyuan.chen21@gmail.com>
> >>>>>>
> >>>>>> Replace the heavy for_each_sgtable_page() iterator in system_heap_do_vmap()
> >>>>>> with a more efficient nested loop approach.
> >>>>>>
> >>>>>> Instead of iterating page by page, we now iterate through the scatterlist
> >>>>>> entries via for_each_sgtable_sg(). Because pages within a single sg entry
> >>>>>> are physically contiguous, we can populate the page array with a in an
> >>>>>> inner loop using simple pointer math. This save a lot of time.
> >>>>>>
> >>>>>> The WARN_ON check is also pulled out of the loop to save branch
> >>>>>> instructions.
> >>>>>>
> >>>>>> Performance results mapping a 2GB buffer on Radxa O6:
> >>>>>> - Before: ~1440000 ns
> >>>>>> - After:  ~232000 ns
> >>>>>> (~84% reduction in iteration time, or ~6.2x faster)
> >>>>>
> >>>>> Well real question is why do you care about the vmap performance?
> >>>>>
> >>>>> That should basically only be used for fbdev emulation (except for VMGFX) and we absolutely don't care about performance there.
> >>>>
> >>>> I agree that in mainline, dma_buf_vmap is not used very often.
> >>>> Here’s what I was able to find:
> >>>>
> >>>>   1   1638  drivers/dma-buf/dma-buf.c <<dma_buf_vmap_unlocked>>
> >>>>              ret = dma_buf_vmap(dmabuf, map);
> >>>>    2    376  drivers/gpu/drm/drm_gem_shmem_helper.c
> >>>> <<drm_gem_shmem_vmap_locked>>
> >>>>              ret = dma_buf_vmap(obj->import_attach->dmabuf, map);
> >>>>    3     85  drivers/gpu/drm/etnaviv/etnaviv_gem_prime.c
> >>>> <<etnaviv_gem_prime_vmap_impl>>
> >>>>              ret = dma_buf_vmap(etnaviv_obj->base.import_attach->dmabuf, &map);
> >>>>    4    433  drivers/gpu/drm/vmwgfx/vmwgfx_blit.c <<map_external>>
> >>>>              ret = dma_buf_vmap(bo->tbo.base.dma_buf, map);
> >>>>    5     88  drivers/gpu/drm/vmwgfx/vmwgfx_gem.c <<vmw_gem_vmap>>
> >>>>              ret = dma_buf_vmap(obj->import_attach->dmabuf, map);
> >>>>
> >>>> However, in the Android ecosystem, system_heap and similar heaps
> >>>> are widely used across camera, NPU, and media drivers. Many of these
> >>>> drivers are not in mainline but do use vmap() in real code paths.
> >>>
> >>> Well out of tree drivers are not a justification to make an upstream changes.
> >>>
> >>> Apart from a handful of workarounds which need to CPU access as fallback DMA-buf vmap is only used to provide fb dev emulation.
> >>>
> >>> The vmap interface has already given us quite a headache in the first place and there are a couple of unresolved problems regarding synchronization and coherency.
> >>>
> >>> When a driver would be pushed upstream which makes so frequent use of the dma_buf_vmap function that it matters for the performance I think there would be push back on that and the driver developer would require a very good explanation why that is necessary.
> >>>
> >>> So for now I have to reject that patch.
> >>
> >> Well, it doesn’t seem to increase complexity, and the code is quite easy
> >> to understand.
> >
> > I agree with this. This change introduces basically no downsides for
> > upstream, even if it primarily benefits a rare use case. Since
> > dma_buf_vmap is exported for driver use, why not enhance the
> > performance for all callers?
>
> Because we essentially want to restrict the vmap interface to only the fb dev emulation use case and not promote or even expand it.
>
> When this matters performance wise the caller is clearly doing something wrong and by improving the performance we just paper over the issue instead of fixing it.

Ack, I understand your position.

> Regards,
> Christian.
>
> >
> > -T.J.
> >
> >> It would be great if the community could be more welcoming
> >> to developers who are just getting involved, rather than discouraging them.
> >>
> >> Apparently, no one can control whether the source code of those kernel
> >> modules will be upstreamed except the vendors themselves, but products
> >> can still benefit from the common kernel.
> >>
> >> Best Regards
> >> Barry
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-05-05 14:44 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-06 21:49 [PATCH] dma-buf: system_heap: Optimize sg_table-to-pages conversion in vmap Barry Song (Xiaomi)
2026-04-07  7:57 ` Christian König
2026-04-07 11:29   ` Barry Song
2026-04-22  7:10     ` Christian König
2026-05-01  4:15       ` Barry Song
2026-05-01 15:54         ` T.J. Mercier
2026-05-04  7:49           ` Christian König
2026-05-05 14:44             ` T.J. Mercier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox