[PATCH 0/6] dma-mapping: arm64: support batched cache sync

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] dma-mapping: arm64: support batched cache sync
@ 2025-12-19  5:36 Barry Song
  2025-12-19  5:36 ` [PATCH 1/6] arm64: Provide dcache_by_myline_op_nosync helper Barry Song
                   ` (7 more replies)
  0 siblings, 8 replies; 30+ messages in thread
From: Barry Song @ 2025-12-19  5:36 UTC (permalink / raw)
  To: catalin.marinas, m.szyprowski, robin.murphy, will
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
	linux-kernel, iommu, surenb, ardb, linux-arm-kernel

From: Barry Song <v-songbaohua@oppo.com>

Many embedded ARM64 SoCs still lack hardware cache coherency support, which
causes DMA mapping operations to appear as hotspots in on-CPU flame graphs.

For an SG list with *nents* entries, the current dma_map/unmap_sg() and DMA
sync APIs perform cache maintenance one entry at a time. After each entry,
the implementation synchronously waits for the corresponding region’s
D-cache operations to complete. On architectures like arm64, efficiency can
be improved by issuing all entries’ operations first and then performing a
single batched wait for completion.

Tangquan's results show that batched synchronization can reduce
dma_map_sg() time by 64.61% and dma_unmap_sg() time by 66.60% on an MTK
phone platform (MediaTek Dimensity 9500). The tests were performed by
pinning the task to CPU7 and fixing the CPU frequency at 2.6 GHz,
running dma_map_sg() and dma_unmap_sg() on 10 MB buffers (10 MB / 4 KB
sg entries per buffer) for 200 iterations and then averaging the
results.

I also ran this patch set on an RK3588 Rock5B+ board and
observed that millions of DMA sync operations were batched.

diff with RFC:
 * Dropped lots of #ifdef/#else/#endif according to Catalin and Marek,
  thanks!
 * Also add iova link/unlink batches, which is marked as RFC as i lack
   hardware. This is suggested by Marek, thanks!

RFC link:
 https://lore.kernel.org/lkml/20251029023115.22809-1-21cnbao@gmail.com/

Barry Song (6):
  arm64: Provide dcache_by_myline_op_nosync helper
  arm64: Provide dcache_clean_poc_nosync helper
  arm64: Provide dcache_inval_poc_nosync helper
  arm64: Provide arch_sync_dma_ batched helpers
  dma-mapping: Allow batched DMA sync operations if supported by the
    arch
  dma-iommu: Allow DMA sync batching for IOVA link/unlink

 arch/arm64/Kconfig                  |  1 +
 arch/arm64/include/asm/assembler.h  | 79 +++++++++++++++++++-------
 arch/arm64/include/asm/cacheflush.h |  2 +
 arch/arm64/mm/cache.S               | 58 +++++++++++++++----
 arch/arm64/mm/dma-mapping.c         | 24 ++++++++
 drivers/iommu/dma-iommu.c           | 12 +++-
 include/linux/dma-map-ops.h         | 22 ++++++++
 kernel/dma/Kconfig                  |  3 +
 kernel/dma/direct.c                 | 28 +++++++---
 kernel/dma/direct.h                 | 86 +++++++++++++++++++++++++----
 10 files changed, 262 insertions(+), 53 deletions(-)

-- 
2.39.3 (Apple Git-146)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/6] arm64: Provide dcache_by_myline_op_nosync helper
  2025-12-19  5:36 [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
@ 2025-12-19  5:36 ` Barry Song
  2025-12-19 12:20   ` Robin Murphy
  2025-12-19  5:36 ` [PATCH 2/6] arm64: Provide dcache_clean_poc_nosync helper Barry Song
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Barry Song @ 2025-12-19  5:36 UTC (permalink / raw)
  To: catalin.marinas, m.szyprowski, robin.murphy, will
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
	linux-kernel, iommu, surenb, ardb, linux-arm-kernel

From: Barry Song <v-songbaohua@oppo.com>

dcache_by_myline_op ensures completion of the data cache operations for a
region, while dcache_by_myline_op_nosync only issues them without waiting.
This enables deferred synchronization so completion for multiple regions
can be handled together later.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 arch/arm64/include/asm/assembler.h | 79 ++++++++++++++++++++++--------
 1 file changed, 59 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index f0ca7196f6fa..7d84a9ca7880 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -366,22 +366,7 @@ alternative_else
 alternative_endif
 	.endm
 
-/*
- * Macro to perform a data cache maintenance for the interval
- * [start, end) with dcache line size explicitly provided.
- *
- * 	op:		operation passed to dc instruction
- * 	domain:		domain used in dsb instruction
- * 	start:          starting virtual address of the region
- * 	end:            end virtual address of the region
- *	linesz:		dcache line size
- * 	fixup:		optional label to branch to on user fault
- * 	Corrupts:       start, end, tmp
- */
-	.macro dcache_by_myline_op op, domain, start, end, linesz, tmp, fixup
-	sub	\tmp, \linesz, #1
-	bic	\start, \start, \tmp
-.Ldcache_op\@:
+	.macro __dcache_op_line op, start
 	.ifc	\op, cvau
 	__dcache_op_workaround_clean_cache \op, \start
 	.else
@@ -399,14 +384,54 @@ alternative_endif
 	.endif
 	.endif
 	.endif
-	add	\start, \start, \linesz
-	cmp	\start, \end
-	b.lo	.Ldcache_op\@
-	dsb	\domain
+	.endm
+
+/*
+ * Macro to perform a data cache maintenance for the interval
+ * [start, end) with dcache line size explicitly provided.
+ *
+ * 	op:		operation passed to dc instruction
+ * 	domain:		domain used in dsb instruction
+ * 	start:          starting virtual address of the region
+ * 	end:            end virtual address of the region
+ *	linesz:		dcache line size
+ * 	fixup:		optional label to branch to on user fault
+ * 	Corrupts:       start, end, tmp
+ */
+	.macro dcache_by_myline_op op, domain, start, end, linesz, tmp, fixup
+	sub	\tmp, \linesz, #1
+	bic	\start, \start, \tmp
+.Ldcache_op\@:
+	__dcache_op_line \op, \start
+	add     \start, \start, \linesz
+	cmp     \start, \end
+	b.lo    .Ldcache_op\@
 
+	dsb	\domain
 	_cond_uaccess_extable .Ldcache_op\@, \fixup
 	.endm
 
+/*
+ * Macro to perform a data cache maintenance for the interval
+ * [start, end) with dcache line size explicitly provided.
+ * It won't wait for the completion of the dc operation.
+ *
+ * 	op:		operation passed to dc instruction
+ * 	start:          starting virtual address of the region
+ * 	end:            end virtual address of the region
+ *	linesz:		dcache line size
+ * 	Corrupts:       start, end, tmp
+ */
+	.macro dcache_by_myline_op_nosync op, start, end, linesz, tmp
+	sub	\tmp, \linesz, #1
+	bic	\start, \start, \tmp
+.Ldcache_op\@:
+	__dcache_op_line \op, \start
+	add     \start, \start, \linesz
+	cmp     \start, \end
+	b.lo    .Ldcache_op\@
+	.endm
+
 /*
  * Macro to perform a data cache maintenance for the interval
  * [start, end)
@@ -423,6 +448,20 @@ alternative_endif
 	dcache_by_myline_op \op, \domain, \start, \end, \tmp1, \tmp2, \fixup
 	.endm
 
+/*
+ * Macro to perform a data cache maintenance for the interval
+ * [start, end). It won’t wait for the dc operation to complete.
+ *
+ * 	op:		operation passed to dc instruction
+ * 	start:          starting virtual address of the region
+ * 	end:            end virtual address of the region
+ * 	Corrupts:       start, end, tmp1, tmp2
+ */
+	.macro dcache_by_line_op_nosync op, start, end, tmp1, tmp2
+	dcache_line_size \tmp1, \tmp2
+	dcache_by_myline_op_nosync \op, \start, \end, \tmp1, \tmp2
+	.endm
+
 /*
  * Macro to perform an instruction cache maintenance for the interval
  * [start, end)
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/6] arm64: Provide dcache_by_myline_op_nosync helper
  2025-12-19  5:36 ` [PATCH 1/6] arm64: Provide dcache_by_myline_op_nosync helper Barry Song
@ 2025-12-19 12:20   ` Robin Murphy
  2025-12-21  7:22     ` Barry Song
  0 siblings, 1 reply; 30+ messages in thread
From: Robin Murphy @ 2025-12-19 12:20 UTC (permalink / raw)
  To: Barry Song, catalin.marinas, m.szyprowski, will
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
	linux-kernel, iommu, surenb, ardb, linux-arm-kernel

On 2025-12-19 5:36 am, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> dcache_by_myline_op ensures completion of the data cache operations for a
> region, while dcache_by_myline_op_nosync only issues them without waiting.
> This enables deferred synchronization so completion for multiple regions
> can be handled together later.

This is a super-low-level internal macro with only two users... Frankly I'd
just do as below.

Thanks,
Robin.

----->8-----

diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index f0ca7196f6fa..26e983c331c5 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -367,18 +367,17 @@ alternative_endif
  	.endm
  
  /*
- * Macro to perform a data cache maintenance for the interval
- * [start, end) with dcache line size explicitly provided.
+ * Main loop for a data cache maintenance operation. Caller to provide the
+ * dcache line size and take care of relevant synchronisation afterwards.
   *
   * 	op:		operation passed to dc instruction
- * 	domain:		domain used in dsb instruction
   * 	start:          starting virtual address of the region
   * 	end:            end virtual address of the region
   *	linesz:		dcache line size
   * 	fixup:		optional label to branch to on user fault
   * 	Corrupts:       start, end, tmp
   */
-	.macro dcache_by_myline_op op, domain, start, end, linesz, tmp, fixup
+	.macro raw_dcache_by_line_op op, start, end, linesz, tmp, fixup
  	sub	\tmp, \linesz, #1
  	bic	\start, \start, \tmp
  .Ldcache_op\@:
@@ -402,7 +401,6 @@ alternative_endif
  	add	\start, \start, \linesz
  	cmp	\start, \end
  	b.lo	.Ldcache_op\@
-	dsb	\domain
  
  	_cond_uaccess_extable .Ldcache_op\@, \fixup
  	.endm
@@ -420,7 +418,8 @@ alternative_endif
   */
  	.macro dcache_by_line_op op, domain, start, end, tmp1, tmp2, fixup
  	dcache_line_size \tmp1, \tmp2
-	dcache_by_myline_op \op, \domain, \start, \end, \tmp1, \tmp2, \fixup
+	raw_dcache_by_line_op \op, \start, \end, \tmp1, \tmp2, \fixup
+	dsb \domain
  	.endm
  
  /*
diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S
index 413f899e4ac6..efdb6884058e 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -64,7 +64,8 @@ SYM_CODE_START(arm64_relocate_new_kernel)
  	mov	x19, x13
  	copy_page x13, x12, x1, x2, x3, x4, x5, x6, x7, x8
  	add	x1, x19, #PAGE_SIZE
-	dcache_by_myline_op civac, sy, x19, x1, x15, x20
+	raw_dcache_by_line_op civac, x19, x1, x15, x20
+	dsb	sy
  	b	.Lnext
  .Ltest_indirection:
  	tbz	x16, IND_INDIRECTION_BIT, .Ltest_destination



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/6] arm64: Provide dcache_by_myline_op_nosync helper
  2025-12-19 12:20   ` Robin Murphy
@ 2025-12-21  7:22     ` Barry Song
  0 siblings, 0 replies; 30+ messages in thread
From: Barry Song @ 2025-12-21  7:22 UTC (permalink / raw)
  To: Robin Murphy
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual,
	catalin.marinas, linux-kernel, surenb, iommu, maz, will, ardb,
	linux-arm-kernel, m.szyprowski

On Fri, Dec 19, 2025 at 8:20 PM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2025-12-19 5:36 am, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > dcache_by_myline_op ensures completion of the data cache operations for a
> > region, while dcache_by_myline_op_nosync only issues them without waiting.
> > This enables deferred synchronization so completion for multiple regions
> > can be handled together later.
>
> This is a super-low-level internal macro with only two users... Frankly I'd
> just do as below.
>
> Thanks,
> Robin.
>
> ----->8-----
>
> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> index f0ca7196f6fa..26e983c331c5 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -367,18 +367,17 @@ alternative_endif
>         .endm
>
>   /*
> - * Macro to perform a data cache maintenance for the interval
> - * [start, end) with dcache line size explicitly provided.
> + * Main loop for a data cache maintenance operation. Caller to provide the
> + * dcache line size and take care of relevant synchronisation afterwards.
>    *
>    *    op:             operation passed to dc instruction
> - *     domain:         domain used in dsb instruction
>    *    start:          starting virtual address of the region
>    *    end:            end virtual address of the region
>    *    linesz:         dcache line size
>    *    fixup:          optional label to branch to on user fault
>    *    Corrupts:       start, end, tmp
>    */
> -       .macro dcache_by_myline_op op, domain, start, end, linesz, tmp, fixup
> +       .macro raw_dcache_by_line_op op, start, end, linesz, tmp, fixup
>         sub     \tmp, \linesz, #1
>         bic     \start, \start, \tmp
>   .Ldcache_op\@:
> @@ -402,7 +401,6 @@ alternative_endif
>         add     \start, \start, \linesz
>         cmp     \start, \end
>         b.lo    .Ldcache_op\@
> -       dsb     \domain
>
>         _cond_uaccess_extable .Ldcache_op\@, \fixup
>         .endm
> @@ -420,7 +418,8 @@ alternative_endif
>    */
>         .macro dcache_by_line_op op, domain, start, end, tmp1, tmp2, fixup
>         dcache_line_size \tmp1, \tmp2
> -       dcache_by_myline_op \op, \domain, \start, \end, \tmp1, \tmp2, \fixup
> +       raw_dcache_by_line_op \op, \start, \end, \tmp1, \tmp2, \fixup
> +       dsb \domain
>         .endm
>
>   /*
> diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S
> index 413f899e4ac6..efdb6884058e 100644
> --- a/arch/arm64/kernel/relocate_kernel.S
> +++ b/arch/arm64/kernel/relocate_kernel.S
> @@ -64,7 +64,8 @@ SYM_CODE_START(arm64_relocate_new_kernel)
>         mov     x19, x13
>         copy_page x13, x12, x1, x2, x3, x4, x5, x6, x7, x8
>         add     x1, x19, #PAGE_SIZE
> -       dcache_by_myline_op civac, sy, x19, x1, x15, x20
> +       raw_dcache_by_line_op civac, x19, x1, x15, x20
> +       dsb     sy
>         b       .Lnext
>   .Ltest_indirection:
>         tbz     x16, IND_INDIRECTION_BIT, .Ltest_destination
>

Thanks, Robin.  Really much better!
dcache_by_line_op_nosync could be:

/*
 * Macro to perform a data cache maintenance for the interval
 * [start, end) without waiting for completion
 *
 *      op:             operation passed to dc instruction
 *      start:          starting virtual address of the region
 *      end:            end virtual address of the region
 *      fixup:          optional label to branch to on user fault
 *      Corrupts:       start, end, tmp1, tmp2
 */
        .macro dcache_by_line_op_nosync op, start, end, tmp1, tmp2, fixup
        dcache_line_size \tmp1, \tmp2
        raw_dcache_by_myline_op \op, \start, \end, \tmp1, \tmp2, \fixup
        .endm

Thanks
Barry


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 2/6] arm64: Provide dcache_clean_poc_nosync helper
  2025-12-19  5:36 [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
  2025-12-19  5:36 ` [PATCH 1/6] arm64: Provide dcache_by_myline_op_nosync helper Barry Song
@ 2025-12-19  5:36 ` Barry Song
  2025-12-19  5:36 ` [PATCH 3/6] arm64: Provide dcache_inval_poc_nosync helper Barry Song
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Barry Song @ 2025-12-19  5:36 UTC (permalink / raw)
  To: catalin.marinas, m.szyprowski, robin.murphy, will
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
	linux-kernel, iommu, surenb, ardb, linux-arm-kernel

From: Barry Song <v-songbaohua@oppo.com>

dcache_clean_poc_nosync does not wait for the data cache clean to
complete. Later, we wait for completion of all scatter-gather entries
together.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 arch/arm64/include/asm/cacheflush.h |  1 +
 arch/arm64/mm/cache.S               | 15 +++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 28ab96e808ef..9b6d0a62cf3d 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -74,6 +74,7 @@ extern void icache_inval_pou(unsigned long start, unsigned long end);
 extern void dcache_clean_inval_poc(unsigned long start, unsigned long end);
 extern void dcache_inval_poc(unsigned long start, unsigned long end);
 extern void dcache_clean_poc(unsigned long start, unsigned long end);
+extern void dcache_clean_poc_nosync(unsigned long start, unsigned long end);
 extern void dcache_clean_pop(unsigned long start, unsigned long end);
 extern void dcache_clean_pou(unsigned long start, unsigned long end);
 extern long caches_clean_inval_user_pou(unsigned long start, unsigned long end);
diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
index 503567c864fd..4a7c7e03785d 100644
--- a/arch/arm64/mm/cache.S
+++ b/arch/arm64/mm/cache.S
@@ -178,6 +178,21 @@ SYM_FUNC_START(__pi_dcache_clean_poc)
 SYM_FUNC_END(__pi_dcache_clean_poc)
 SYM_FUNC_ALIAS(dcache_clean_poc, __pi_dcache_clean_poc)
 
+/*
+ *	dcache_clean_poc_nosync(start, end)
+ *
+ * 	Issue the instructions of D-cache lines for the interval [start, end).
+ * 	not necessarily cleaned to the PoC till an explicit dsb sy afterward.
+ *
+ *	- start   - virtual start address of region
+ *	- end     - virtual end address of region
+ */
+SYM_FUNC_START(__pi_dcache_clean_poc_nosync)
+	dcache_by_line_op_nosync cvac, x0, x1, x2, x3
+	ret
+SYM_FUNC_END(__pi_dcache_clean_poc_nosync)
+SYM_FUNC_ALIAS(dcache_clean_poc_nosync, __pi_dcache_clean_poc_nosync)
+
 /*
  *	dcache_clean_pop(start, end)
  *
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/6] arm64: Provide dcache_inval_poc_nosync helper
  2025-12-19  5:36 [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
  2025-12-19  5:36 ` [PATCH 1/6] arm64: Provide dcache_by_myline_op_nosync helper Barry Song
  2025-12-19  5:36 ` [PATCH 2/6] arm64: Provide dcache_clean_poc_nosync helper Barry Song
@ 2025-12-19  5:36 ` Barry Song
  2025-12-19 12:34   ` Robin Murphy
  2025-12-19  5:36 ` [PATCH 4/6] arm64: Provide arch_sync_dma_ batched helpers Barry Song
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Barry Song @ 2025-12-19  5:36 UTC (permalink / raw)
  To: catalin.marinas, m.szyprowski, robin.murphy, will
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
	linux-kernel, iommu, surenb, ardb, linux-arm-kernel

From: Barry Song <v-songbaohua@oppo.com>

dcache_inval_poc_nosync does not wait for the data cache invalidation to
complete. Later, we defer the synchronization so we can wait for all SG
entries together.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 arch/arm64/include/asm/cacheflush.h |  1 +
 arch/arm64/mm/cache.S               | 43 +++++++++++++++++++++--------
 2 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 9b6d0a62cf3d..382b4ac3734d 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -74,6 +74,7 @@ extern void icache_inval_pou(unsigned long start, unsigned long end);
 extern void dcache_clean_inval_poc(unsigned long start, unsigned long end);
 extern void dcache_inval_poc(unsigned long start, unsigned long end);
 extern void dcache_clean_poc(unsigned long start, unsigned long end);
+extern void dcache_inval_poc_nosync(unsigned long start, unsigned long end);
 extern void dcache_clean_poc_nosync(unsigned long start, unsigned long end);
 extern void dcache_clean_pop(unsigned long start, unsigned long end);
 extern void dcache_clean_pou(unsigned long start, unsigned long end);
diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
index 4a7c7e03785d..8c1043c9b9e5 100644
--- a/arch/arm64/mm/cache.S
+++ b/arch/arm64/mm/cache.S
@@ -132,17 +132,7 @@ alternative_else_nop_endif
 	ret
 SYM_FUNC_END(dcache_clean_pou)
 
-/*
- *	dcache_inval_poc(start, end)
- *
- * 	Ensure that any D-cache lines for the interval [start, end)
- * 	are invalidated. Any partial lines at the ends of the interval are
- *	also cleaned to PoC to prevent data loss.
- *
- *	- start   - kernel start address of region
- *	- end     - kernel end address of region
- */
-SYM_FUNC_START(__pi_dcache_inval_poc)
+.macro _dcache_inval_poc_impl, do_sync
 	dcache_line_size x2, x3
 	sub	x3, x2, #1
 	tst	x1, x3				// end cache line aligned?
@@ -158,11 +148,42 @@ SYM_FUNC_START(__pi_dcache_inval_poc)
 3:	add	x0, x0, x2
 	cmp	x0, x1
 	b.lo	2b
+.if \do_sync
 	dsb	sy
+.endif
 	ret
+.endm
+
+/*
+ *	dcache_inval_poc(start, end)
+ *
+ * 	Ensure that any D-cache lines for the interval [start, end)
+ * 	are invalidated. Any partial lines at the ends of the interval are
+ *	also cleaned to PoC to prevent data loss.
+ *
+ *	- start   - kernel start address of region
+ *	- end     - kernel end address of region
+ */
+SYM_FUNC_START(__pi_dcache_inval_poc)
+	_dcache_inval_poc_impl 1
 SYM_FUNC_END(__pi_dcache_inval_poc)
 SYM_FUNC_ALIAS(dcache_inval_poc, __pi_dcache_inval_poc)
 
+/*
+ *	dcache_inval_poc_nosync(start, end)
+ *
+ * 	Issue the instructions of D-cache lines for the interval [start, end)
+ * 	for invalidation. Not necessarily cleaned to PoC till an explicit dsb
+ *	sy later
+ *
+ *	- start   - kernel start address of region
+ *	- end     - kernel end address of region
+ */
+SYM_FUNC_START(__pi_dcache_inval_poc_nosync)
+	_dcache_inval_poc_impl 0
+SYM_FUNC_END(__pi_dcache_inval_poc_nosync)
+SYM_FUNC_ALIAS(dcache_inval_poc_nosync, __pi_dcache_inval_poc_nosync)
+
 /*
  *	dcache_clean_poc(start, end)
  *
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/6] arm64: Provide dcache_inval_poc_nosync helper
  2025-12-19  5:36 ` [PATCH 3/6] arm64: Provide dcache_inval_poc_nosync helper Barry Song
@ 2025-12-19 12:34   ` Robin Murphy
  2025-12-21  7:59     ` Barry Song
  0 siblings, 1 reply; 30+ messages in thread
From: Robin Murphy @ 2025-12-19 12:34 UTC (permalink / raw)
  To: Barry Song, catalin.marinas, m.szyprowski, will
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
	linux-kernel, iommu, surenb, ardb, linux-arm-kernel

On 2025-12-19 5:36 am, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> dcache_inval_poc_nosync does not wait for the data cache invalidation to
> complete. Later, we defer the synchronization so we can wait for all SG
> entries together.
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>   arch/arm64/include/asm/cacheflush.h |  1 +
>   arch/arm64/mm/cache.S               | 43 +++++++++++++++++++++--------
>   2 files changed, 33 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
> index 9b6d0a62cf3d..382b4ac3734d 100644
> --- a/arch/arm64/include/asm/cacheflush.h
> +++ b/arch/arm64/include/asm/cacheflush.h
> @@ -74,6 +74,7 @@ extern void icache_inval_pou(unsigned long start, unsigned long end);
>   extern void dcache_clean_inval_poc(unsigned long start, unsigned long end);
>   extern void dcache_inval_poc(unsigned long start, unsigned long end);
>   extern void dcache_clean_poc(unsigned long start, unsigned long end);
> +extern void dcache_inval_poc_nosync(unsigned long start, unsigned long end);
>   extern void dcache_clean_poc_nosync(unsigned long start, unsigned long end);
>   extern void dcache_clean_pop(unsigned long start, unsigned long end);
>   extern void dcache_clean_pou(unsigned long start, unsigned long end);
> diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
> index 4a7c7e03785d..8c1043c9b9e5 100644
> --- a/arch/arm64/mm/cache.S
> +++ b/arch/arm64/mm/cache.S
> @@ -132,17 +132,7 @@ alternative_else_nop_endif
>   	ret
>   SYM_FUNC_END(dcache_clean_pou)
>   
> -/*
> - *	dcache_inval_poc(start, end)
> - *
> - * 	Ensure that any D-cache lines for the interval [start, end)
> - * 	are invalidated. Any partial lines at the ends of the interval are
> - *	also cleaned to PoC to prevent data loss.
> - *
> - *	- start   - kernel start address of region
> - *	- end     - kernel end address of region
> - */
> -SYM_FUNC_START(__pi_dcache_inval_poc)
> +.macro _dcache_inval_poc_impl, do_sync
>   	dcache_line_size x2, x3
>   	sub	x3, x2, #1
>   	tst	x1, x3				// end cache line aligned?
> @@ -158,11 +148,42 @@ SYM_FUNC_START(__pi_dcache_inval_poc)
>   3:	add	x0, x0, x2
>   	cmp	x0, x1
>   	b.lo	2b
> +.if \do_sync
>   	dsb	sy
> +.endif

Similarly, don't bother with complication like this, just put the DSB in 
the one place it needs to be.

Thanks,
Robin.

>   	ret
> +.endm
> +
> +/*
> + *	dcache_inval_poc(start, end)
> + *
> + * 	Ensure that any D-cache lines for the interval [start, end)
> + * 	are invalidated. Any partial lines at the ends of the interval are
> + *	also cleaned to PoC to prevent data loss.
> + *
> + *	- start   - kernel start address of region
> + *	- end     - kernel end address of region
> + */
> +SYM_FUNC_START(__pi_dcache_inval_poc)
> +	_dcache_inval_poc_impl 1
>   SYM_FUNC_END(__pi_dcache_inval_poc)
>   SYM_FUNC_ALIAS(dcache_inval_poc, __pi_dcache_inval_poc)
>   
> +/*
> + *	dcache_inval_poc_nosync(start, end)
> + *
> + * 	Issue the instructions of D-cache lines for the interval [start, end)
> + * 	for invalidation. Not necessarily cleaned to PoC till an explicit dsb
> + *	sy later
> + *
> + *	- start   - kernel start address of region
> + *	- end     - kernel end address of region
> + */
> +SYM_FUNC_START(__pi_dcache_inval_poc_nosync)
> +	_dcache_inval_poc_impl 0
> +SYM_FUNC_END(__pi_dcache_inval_poc_nosync)
> +SYM_FUNC_ALIAS(dcache_inval_poc_nosync, __pi_dcache_inval_poc_nosync)
> +
>   /*
>    *	dcache_clean_poc(start, end)
>    *



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 3/6] arm64: Provide dcache_inval_poc_nosync helper
  2025-12-19 12:34   ` Robin Murphy
@ 2025-12-21  7:59     ` Barry Song
  0 siblings, 0 replies; 30+ messages in thread
From: Barry Song @ 2025-12-21  7:59 UTC (permalink / raw)
  To: robin.murphy
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
	anshuman.khandual, catalin.marinas, 21cnbao, linux-kernel, iommu,
	maz, surenb, ardb, linux-arm-kernel, m.szyprowski

On Fri, Dec 19, 2025 at 8:50 PM Robin Murphy <robin.murphy@arm.com> wrote:
[...]
> > diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
> > index 4a7c7e03785d..8c1043c9b9e5 100644
> > --- a/arch/arm64/mm/cache.S
> > +++ b/arch/arm64/mm/cache.S
> > @@ -132,17 +132,7 @@ alternative_else_nop_endif
> >       ret
> >   SYM_FUNC_END(dcache_clean_pou)
> >  
> > -/*
> > - *   dcache_inval_poc(start, end)
> > - *
> > - *   Ensure that any D-cache lines for the interval [start, end)
> > - *   are invalidated. Any partial lines at the ends of the interval are
> > - *   also cleaned to PoC to prevent data loss.
> > - *
> > - *   - start   - kernel start address of region
> > - *   - end     - kernel end address of region
> > - */
> > -SYM_FUNC_START(__pi_dcache_inval_poc)
> > +.macro _dcache_inval_poc_impl, do_sync
> >       dcache_line_size x2, x3
> >       sub     x3, x2, #1
> >       tst     x1, x3                          // end cache line aligned?
> > @@ -158,11 +148,42 @@ SYM_FUNC_START(__pi_dcache_inval_poc)
> >   3:  add     x0, x0, x2
> >       cmp     x0, x1
> >       b.lo    2b
> > +.if \do_sync
> >       dsb     sy
> > +.endif
>
> Similarly, don't bother with complication like this, just put the DSB in
> the one place it needs to be.
>

Thanks, Robin — great suggestion. I assume it can be:

diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
index 4a7c7e03785d..99a093d3aecb 100644
--- a/arch/arm64/mm/cache.S
+++ b/arch/arm64/mm/cache.S
@@ -132,17 +132,7 @@ alternative_else_nop_endif
 	ret
 SYM_FUNC_END(dcache_clean_pou)
 
-/*
- *	dcache_inval_poc(start, end)
- *
- * 	Ensure that any D-cache lines for the interval [start, end)
- * 	are invalidated. Any partial lines at the ends of the interval are
- *	also cleaned to PoC to prevent data loss.
- *
- *	- start   - kernel start address of region
- *	- end     - kernel end address of region
- */
-SYM_FUNC_START(__pi_dcache_inval_poc)
+.macro raw_dcache_inval_poc_macro
 	dcache_line_size x2, x3
 	sub	x3, x2, #1
 	tst	x1, x3				// end cache line aligned?
@@ -158,11 +148,41 @@ SYM_FUNC_START(__pi_dcache_inval_poc)
 3:	add	x0, x0, x2
 	cmp	x0, x1
 	b.lo	2b
+.endm
+
+/*
+ *	dcache_inval_poc(start, end)
+ *
+ * 	Ensure that any D-cache lines for the interval [start, end)
+ * 	are invalidated. Any partial lines at the ends of the interval are
+ *	also cleaned to PoC to prevent data loss.
+ *
+ *	- start   - kernel start address of region
+ *	- end     - kernel end address of region
+ */
+SYM_FUNC_START(__pi_dcache_inval_poc)
+	raw_dcache_inval_poc_macro
 	dsb	sy
 	ret
 SYM_FUNC_END(__pi_dcache_inval_poc)
 SYM_FUNC_ALIAS(dcache_inval_poc, __pi_dcache_inval_poc)
 
+/*
+ *	dcache_inval_poc_nosync(start, end)
+ *
+ * 	Issue the instructions of D-cache lines for the interval [start, end)
+ * 	for invalidation. Not necessarily cleaned to PoC till an explicit dsb
+ *	sy is issued later
+ *
+ *	- start   - kernel start address of region
+ *	- end     - kernel end address of region
+ */
+SYM_FUNC_START(__pi_dcache_inval_poc_nosync)
+	raw_dcache_inval_poc_macro
+	ret
+SYM_FUNC_END(__pi_dcache_inval_poc_nosync)
+SYM_FUNC_ALIAS(dcache_inval_poc_nosync, __pi_dcache_inval_poc_nosync)
+
 /*
  *	dcache_clean_poc(start, end)
  *
-- 

Does it look good to you?

Thanks
Barry


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/6] arm64: Provide arch_sync_dma_ batched helpers
  2025-12-19  5:36 [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
                   ` (2 preceding siblings ...)
  2025-12-19  5:36 ` [PATCH 3/6] arm64: Provide dcache_inval_poc_nosync helper Barry Song
@ 2025-12-19  5:36 ` Barry Song
  2025-12-19  5:36 ` [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch Barry Song
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Barry Song @ 2025-12-19  5:36 UTC (permalink / raw)
  To: catalin.marinas, m.szyprowski, robin.murphy, will
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
	linux-kernel, iommu, surenb, ardb, linux-arm-kernel

From: Barry Song <v-songbaohua@oppo.com>

arch_sync_dma_for_device_batch_add() and
arch_sync_dma_for_cpu_batch_add() batch DMA sync operations,
while arch_sync_dma_batch_flush() waits for their completion
as a group.

On architectures that do not support batching,
arch_sync_dma_for_device_batch_add() and
arch_sync_dma_for_cpu_batch_add() fall back to the non-batched
implementations, and arch_sync_dma_batch_flush() is a no-op.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 arch/arm64/Kconfig          |  1 +
 arch/arm64/mm/dma-mapping.c | 24 ++++++++++++++++++++++++
 include/linux/dma-map-ops.h | 22 ++++++++++++++++++++++
 kernel/dma/Kconfig          |  3 +++
 4 files changed, 50 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 93173f0a09c7..c8adbf21b7bf 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -112,6 +112,7 @@ config ARM64
 	select ARCH_SUPPORTS_SCHED_CLUSTER
 	select ARCH_SUPPORTS_SCHED_MC
 	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+	select ARCH_WANT_BATCHED_DMA_SYNC
 	select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
 	select ARCH_WANT_DEFAULT_BPF_JIT
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index b2b5792b2caa..9ac1ddd1bb9c 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -31,6 +31,30 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
 	dcache_inval_poc(start, start + size);
 }
 
+void arch_sync_dma_for_device_batch_add(phys_addr_t paddr, size_t size,
+			      enum dma_data_direction dir)
+{
+	unsigned long start = (unsigned long)phys_to_virt(paddr);
+
+	dcache_clean_poc_nosync(start, start + size);
+}
+
+void arch_sync_dma_for_cpu_batch_add(phys_addr_t paddr, size_t size,
+			   enum dma_data_direction dir)
+{
+	unsigned long start = (unsigned long)phys_to_virt(paddr);
+
+	if (dir == DMA_TO_DEVICE)
+		return;
+
+	dcache_inval_poc_nosync(start, start + size);
+}
+
+void arch_sync_dma_batch_flush(void)
+{
+	dsb(sy);
+}
+
 void arch_dma_prep_coherent(struct page *page, size_t size)
 {
 	unsigned long start = (unsigned long)page_address(page);
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 4809204c674c..5ee92c410e3c 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -361,6 +361,28 @@ static inline void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
 }
 #endif /* ARCH_HAS_SYNC_DMA_FOR_CPU */
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_DMA_SYNC
+void arch_sync_dma_for_device_batch_add(phys_addr_t paddr, size_t size,
+		enum dma_data_direction dir);
+void arch_sync_dma_for_cpu_batch_add(phys_addr_t paddr, size_t size,
+		enum dma_data_direction dir);
+void arch_sync_dma_batch_flush(void);
+#else
+static inline void arch_sync_dma_for_device_batch_add(phys_addr_t paddr, size_t size,
+		enum dma_data_direction dir)
+{
+	arch_sync_dma_for_device(paddr, size, dir);
+}
+static inline void arch_sync_dma_for_cpu_batch_add(phys_addr_t paddr, size_t size,
+		enum dma_data_direction dir)
+{
+	arch_sync_dma_for_cpu(paddr, size, dir);
+}
+static inline void arch_sync_dma_batch_flush(void)
+{
+}
+#endif
+
 #ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL
 void arch_sync_dma_for_cpu_all(void);
 #else
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index 31cfdb6b4bc3..2785099b2fa0 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -78,6 +78,9 @@ config ARCH_HAS_DMA_PREP_COHERENT
 config ARCH_HAS_FORCE_DMA_UNENCRYPTED
 	bool
 
+config ARCH_WANT_BATCHED_DMA_SYNC
+	bool
+
 #
 # Select this option if the architecture assumes DMA devices are coherent
 # by default.
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
  2025-12-19  5:36 [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
                   ` (3 preceding siblings ...)
  2025-12-19  5:36 ` [PATCH 4/6] arm64: Provide arch_sync_dma_ batched helpers Barry Song
@ 2025-12-19  5:36 ` Barry Song
  2025-12-20 17:37   ` kernel test robot
                     ` (4 more replies)
  2025-12-19  5:36 ` [PATCH RFC 6/6] dma-iommu: Allow DMA sync batching for IOVA link/unlink Barry Song
                   ` (2 subsequent siblings)
  7 siblings, 5 replies; 30+ messages in thread
From: Barry Song @ 2025-12-19  5:36 UTC (permalink / raw)
  To: catalin.marinas, m.szyprowski, robin.murphy, will
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
	linux-kernel, iommu, surenb, ardb, linux-arm-kernel

From: Barry Song <v-songbaohua@oppo.com>

This enables dma_direct_sync_sg_for_device, dma_direct_sync_sg_for_cpu,
dma_direct_map_sg, and dma_direct_unmap_sg to use batched DMA sync
operations when possible. This significantly improves performance on
devices without hardware cache coherence.

Tangquan's initial results show that batched synchronization can reduce
dma_map_sg() time by 64.61% and dma_unmap_sg() time by 66.60% on an MTK
phone platform (MediaTek Dimensity 9500). The tests were performed by
pinning the task to CPU7 and fixing the CPU frequency at 2.6 GHz,
running dma_map_sg() and dma_unmap_sg() on 10 MB buffers (10 MB / 4 KB
sg entries per buffer) for 200 iterations and then averaging the
results.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 kernel/dma/direct.c | 28 ++++++++++-----
 kernel/dma/direct.h | 86 +++++++++++++++++++++++++++++++++++++++------
 2 files changed, 95 insertions(+), 19 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 50c3fe2a1d55..ed2339b0c5e7 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -403,9 +403,10 @@ void dma_direct_sync_sg_for_device(struct device *dev,
 		swiotlb_sync_single_for_device(dev, paddr, sg->length, dir);
 
 		if (!dev_is_dma_coherent(dev))
-			arch_sync_dma_for_device(paddr, sg->length,
-					dir);
+			arch_sync_dma_for_device_batch_add(paddr, sg->length, dir);
 	}
+	if (!dev_is_dma_coherent(dev))
+		arch_sync_dma_batch_flush();
 }
 #endif
 
@@ -422,7 +423,7 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
 		phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
 
 		if (!dev_is_dma_coherent(dev))
-			arch_sync_dma_for_cpu(paddr, sg->length, dir);
+			arch_sync_dma_for_cpu_batch_add(paddr, sg->length, dir);
 
 		swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
 
@@ -430,8 +431,10 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
 			arch_dma_mark_clean(paddr, sg->length);
 	}
 
-	if (!dev_is_dma_coherent(dev))
+	if (!dev_is_dma_coherent(dev)) {
 		arch_sync_dma_for_cpu_all();
+		arch_sync_dma_batch_flush();
+	}
 }
 
 /*
@@ -443,14 +446,19 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
 {
 	struct scatterlist *sg;
 	int i;
+	bool need_sync = false;
 
 	for_each_sg(sgl,  sg, nents, i) {
-		if (sg_dma_is_bus_address(sg))
+		if (sg_dma_is_bus_address(sg)) {
 			sg_dma_unmark_bus_address(sg);
-		else
-			dma_direct_unmap_phys(dev, sg->dma_address,
+		} else {
+			need_sync = true;
+			dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
 					      sg_dma_len(sg), dir, attrs);
+		}
 	}
+	if (need_sync && !dev_is_dma_coherent(dev))
+		arch_sync_dma_batch_flush();
 }
 #endif
 
@@ -460,6 +468,7 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 	struct pci_p2pdma_map_state p2pdma_state = {};
 	struct scatterlist *sg;
 	int i, ret;
+	bool need_sync = false;
 
 	for_each_sg(sgl, sg, nents, i) {
 		switch (pci_p2pdma_state(&p2pdma_state, dev, sg_page(sg))) {
@@ -471,7 +480,8 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 			 */
 			break;
 		case PCI_P2PDMA_MAP_NONE:
-			sg->dma_address = dma_direct_map_phys(dev, sg_phys(sg),
+			need_sync = true;
+			sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
 					sg->length, dir, attrs);
 			if (sg->dma_address == DMA_MAPPING_ERROR) {
 				ret = -EIO;
@@ -491,6 +501,8 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 		sg_dma_len(sg) = sg->length;
 	}
 
+	if (need_sync && !dev_is_dma_coherent(dev))
+		arch_sync_dma_batch_flush();
 	return nents;
 
 out_unmap:
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index da2fadf45bcd..a211bab26478 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -64,15 +64,11 @@ static inline void dma_direct_sync_single_for_device(struct device *dev,
 		arch_sync_dma_for_device(paddr, size, dir);
 }
 
-static inline void dma_direct_sync_single_for_cpu(struct device *dev,
-		dma_addr_t addr, size_t size, enum dma_data_direction dir)
+static inline void __dma_direct_sync_single_for_cpu(struct device *dev,
+		phys_addr_t paddr, size_t size, enum dma_data_direction dir)
 {
-	phys_addr_t paddr = dma_to_phys(dev, addr);
-
-	if (!dev_is_dma_coherent(dev)) {
-		arch_sync_dma_for_cpu(paddr, size, dir);
+	if (!dev_is_dma_coherent(dev))
 		arch_sync_dma_for_cpu_all();
-	}
 
 	swiotlb_sync_single_for_cpu(dev, paddr, size, dir);
 
@@ -80,7 +76,31 @@ static inline void dma_direct_sync_single_for_cpu(struct device *dev,
 		arch_dma_mark_clean(paddr, size);
 }
 
-static inline dma_addr_t dma_direct_map_phys(struct device *dev,
+#ifdef CONFIG_ARCH_WANT_BATCHED_DMA_SYNC
+static inline void dma_direct_sync_single_for_cpu_batch_add(struct device *dev,
+		dma_addr_t addr, size_t size, enum dma_data_direction dir)
+{
+	phys_addr_t paddr = dma_to_phys(dev, addr);
+
+	if (!dev_is_dma_coherent(dev))
+		arch_sync_dma_for_cpu_batch_add(paddr, size, dir);
+
+	__dma_direct_sync_single_for_cpu(dev, paddr, size, dir);
+}
+#endif
+
+static inline void dma_direct_sync_single_for_cpu(struct device *dev,
+		dma_addr_t addr, size_t size, enum dma_data_direction dir)
+{
+	phys_addr_t paddr = dma_to_phys(dev, addr);
+
+	if (!dev_is_dma_coherent(dev))
+		arch_sync_dma_for_cpu(paddr, size, dir);
+
+	__dma_direct_sync_single_for_cpu(dev, paddr, size, dir);
+}
+
+static inline dma_addr_t __dma_direct_map_phys(struct device *dev,
 		phys_addr_t phys, size_t size, enum dma_data_direction dir,
 		unsigned long attrs)
 {
@@ -108,9 +128,6 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
 		}
 	}
 
-	if (!dev_is_dma_coherent(dev) &&
-	    !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
-		arch_sync_dma_for_device(phys, size, dir);
 	return dma_addr;
 
 err_overflow:
@@ -121,6 +138,53 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
 	return DMA_MAPPING_ERROR;
 }
 
+#ifdef CONFIG_ARCH_WANT_BATCHED_DMA_SYNC
+static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
+		phys_addr_t phys, size_t size, enum dma_data_direction dir,
+		unsigned long attrs)
+{
+	dma_addr_t dma_addr = __dma_direct_map_phys(dev, phys, size, dir, attrs);
+
+	if (dma_addr != DMA_MAPPING_ERROR && !dev_is_dma_coherent(dev) &&
+	    !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
+		arch_sync_dma_for_device_batch_add(phys, size, dir);
+
+	return dma_addr;
+}
+#endif
+
+static inline dma_addr_t dma_direct_map_phys(struct device *dev,
+		phys_addr_t phys, size_t size, enum dma_data_direction dir,
+		unsigned long attrs)
+{
+	dma_addr_t dma_addr = __dma_direct_map_phys(dev, phys, size, dir, attrs);
+
+	if (dma_addr != DMA_MAPPING_ERROR && !dev_is_dma_coherent(dev) &&
+	    !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
+		arch_sync_dma_for_device(phys, size, dir);
+
+	return dma_addr;
+}
+
+#ifdef CONFIG_ARCH_WANT_BATCHED_DMA_SYNC
+static inline void dma_direct_unmap_phys_batch_add(struct device *dev, dma_addr_t addr,
+		size_t size, enum dma_data_direction dir, unsigned long attrs)
+{
+	phys_addr_t phys;
+
+	if (attrs & DMA_ATTR_MMIO)
+		/* nothing to do: uncached and no swiotlb */
+		return;
+
+	phys = dma_to_phys(dev, addr);
+	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+		dma_direct_sync_single_for_cpu_batch_add(dev, addr, size, dir);
+
+	swiotlb_tbl_unmap_single(dev, phys, size, dir,
+					 attrs | DMA_ATTR_SKIP_CPU_SYNC);
+}
+#endif
+
 static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
 		size_t size, enum dma_data_direction dir, unsigned long attrs)
 {
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
  2025-12-19  5:36 ` [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch Barry Song
@ 2025-12-20 17:37   ` kernel test robot
  2025-12-21  5:15     ` Barry Song
  2025-12-21 11:55   ` Leon Romanovsky
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: kernel test robot @ 2025-12-20 17:37 UTC (permalink / raw)
  To: Barry Song, catalin.marinas, m.szyprowski, robin.murphy, will
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
	llvm, linux-kernel, iommu, oe-kbuild-all, surenb, ardb,
	linux-arm-kernel

Hi Barry,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on next-20251219]
[cannot apply to arm64/for-next/core v6.16-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Barry-Song/arm64-Provide-dcache_by_myline_op_nosync-helper/20251219-195810
base:   linus/master
patch link:    https://lore.kernel.org/r/20251219053658.84978-6-21cnbao%40gmail.com
patch subject: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20251220/202512201836.f6KX6WMH-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251220/202512201836.f6KX6WMH-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512201836.f6KX6WMH-lkp@intel.com/

All errors (new ones prefixed by >>):

>> kernel/dma/direct.c:456:4: error: call to undeclared function 'dma_direct_unmap_phys_batch_add'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     456 |                         dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
         |                         ^
   kernel/dma/direct.c:456:4: note: did you mean 'dma_direct_unmap_phys'?
   kernel/dma/direct.h:188:20: note: 'dma_direct_unmap_phys' declared here
     188 | static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
         |                    ^
>> kernel/dma/direct.c:484:22: error: call to undeclared function 'dma_direct_map_phys_batch_add'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     484 |                         sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
         |                                           ^
   2 errors generated.


vim +/dma_direct_unmap_phys_batch_add +456 kernel/dma/direct.c

   439	
   440	/*
   441	 * Unmaps segments, except for ones marked as pci_p2pdma which do not
   442	 * require any further action as they contain a bus address.
   443	 */
   444	void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
   445			int nents, enum dma_data_direction dir, unsigned long attrs)
   446	{
   447		struct scatterlist *sg;
   448		int i;
   449		bool need_sync = false;
   450	
   451		for_each_sg(sgl,  sg, nents, i) {
   452			if (sg_dma_is_bus_address(sg)) {
   453				sg_dma_unmark_bus_address(sg);
   454			} else {
   455				need_sync = true;
 > 456				dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
   457						      sg_dma_len(sg), dir, attrs);
   458			}
   459		}
   460		if (need_sync && !dev_is_dma_coherent(dev))
   461			arch_sync_dma_batch_flush();
   462	}
   463	#endif
   464	
   465	int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
   466			enum dma_data_direction dir, unsigned long attrs)
   467	{
   468		struct pci_p2pdma_map_state p2pdma_state = {};
   469		struct scatterlist *sg;
   470		int i, ret;
   471		bool need_sync = false;
   472	
   473		for_each_sg(sgl, sg, nents, i) {
   474			switch (pci_p2pdma_state(&p2pdma_state, dev, sg_page(sg))) {
   475			case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
   476				/*
   477				 * Any P2P mapping that traverses the PCI host bridge
   478				 * must be mapped with CPU physical address and not PCI
   479				 * bus addresses.
   480				 */
   481				break;
   482			case PCI_P2PDMA_MAP_NONE:
   483				need_sync = true;
 > 484				sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
   485						sg->length, dir, attrs);
   486				if (sg->dma_address == DMA_MAPPING_ERROR) {
   487					ret = -EIO;
   488					goto out_unmap;
   489				}
   490				break;
   491			case PCI_P2PDMA_MAP_BUS_ADDR:
   492				sg->dma_address = pci_p2pdma_bus_addr_map(
   493					p2pdma_state.mem, sg_phys(sg));
   494				sg_dma_len(sg) = sg->length;
   495				sg_dma_mark_bus_address(sg);
   496				continue;
   497			default:
   498				ret = -EREMOTEIO;
   499				goto out_unmap;
   500			}
   501			sg_dma_len(sg) = sg->length;
   502		}
   503	
   504		if (need_sync && !dev_is_dma_coherent(dev))
   505			arch_sync_dma_batch_flush();
   506		return nents;
   507	
   508	out_unmap:
   509		dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
   510		return ret;
   511	}
   512	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
  2025-12-20 17:37   ` kernel test robot
@ 2025-12-21  5:15     ` Barry Song
  0 siblings, 0 replies; 30+ messages in thread
From: Barry Song @ 2025-12-21  5:15 UTC (permalink / raw)
  To: lkp
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, oe-kbuild-all,
	anshuman.khandual, will, catalin.marinas, llvm, 21cnbao,
	linux-kernel, surenb, iommu, maz, robin.murphy, ardb,
	linux-arm-kernel, m.szyprowski

>
> All errors (new ones prefixed by >>):
>
> >> kernel/dma/direct.c:456:4: error: call to undeclared function 'dma_direct_unmap_phys_batch_add'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
>      456 |                         dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
>          |                         ^
>    kernel/dma/direct.c:456:4: note: did you mean 'dma_direct_unmap_phys'?
>    kernel/dma/direct.h:188:20: note: 'dma_direct_unmap_phys' declared here
>      188 | static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
>          |                    ^
> >> kernel/dma/direct.c:484:22: error: call to undeclared function 'dma_direct_map_phys_batch_add'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
>      484 |                         sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
>          |                                           ^
>    2 errors generated.
>
>

Thanks very much for the report.
Can you please check if the below diff fix the build issue?

From 5541aa1efa19777e435c9f3cca7cd2c6a490d9f1 Mon Sep 17 00:00:00 2001
From: Barry Song <v-songbaohua@oppo.com>
Date: Sun, 21 Dec 2025 13:09:36 +0800
Subject: [PATCH] kernel/dma: Fix build errors for dma_direct_map_phys

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202512201836.f6KX6WMH-lkp@intel.com/
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 kernel/dma/direct.h | 38 ++++++++++++++++++++++++++------------
 1 file changed, 26 insertions(+), 12 deletions(-)

diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index a211bab26478..bcc398b5aa6b 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -138,8 +138,7 @@ static inline dma_addr_t __dma_direct_map_phys(struct device *dev,
 	return DMA_MAPPING_ERROR;
 }
 
-#ifdef CONFIG_ARCH_WANT_BATCHED_DMA_SYNC
-static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
+static inline dma_addr_t dma_direct_map_phys(struct device *dev,
 		phys_addr_t phys, size_t size, enum dma_data_direction dir,
 		unsigned long attrs)
 {
@@ -147,13 +146,13 @@ static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
 
 	if (dma_addr != DMA_MAPPING_ERROR && !dev_is_dma_coherent(dev) &&
 	    !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
-		arch_sync_dma_for_device_batch_add(phys, size, dir);
+		arch_sync_dma_for_device(phys, size, dir);
 
 	return dma_addr;
 }
-#endif
 
-static inline dma_addr_t dma_direct_map_phys(struct device *dev,
+#ifdef CONFIG_ARCH_WANT_BATCHED_DMA_SYNC
+static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
 		phys_addr_t phys, size_t size, enum dma_data_direction dir,
 		unsigned long attrs)
 {
@@ -161,13 +160,20 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
 
 	if (dma_addr != DMA_MAPPING_ERROR && !dev_is_dma_coherent(dev) &&
 	    !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
-		arch_sync_dma_for_device(phys, size, dir);
+		arch_sync_dma_for_device_batch_add(phys, size, dir);
 
 	return dma_addr;
 }
+#else
+static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
+		phys_addr_t phys, size_t size, enum dma_data_direction dir,
+		unsigned long attrs)
+{
+	return dma_direct_map_phys(dev, phys, size, dir, attrs);
+}
+#endif
 
-#ifdef CONFIG_ARCH_WANT_BATCHED_DMA_SYNC
-static inline void dma_direct_unmap_phys_batch_add(struct device *dev, dma_addr_t addr,
+static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
 		size_t size, enum dma_data_direction dir, unsigned long attrs)
 {
 	phys_addr_t phys;
@@ -178,14 +184,14 @@ static inline void dma_direct_unmap_phys_batch_add(struct device *dev, dma_addr_
 
 	phys = dma_to_phys(dev, addr);
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
-		dma_direct_sync_single_for_cpu_batch_add(dev, addr, size, dir);
+		dma_direct_sync_single_for_cpu(dev, addr, size, dir);
 
 	swiotlb_tbl_unmap_single(dev, phys, size, dir,
 					 attrs | DMA_ATTR_SKIP_CPU_SYNC);
 }
-#endif
 
-static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
+#ifdef CONFIG_ARCH_WANT_BATCHED_DMA_SYNC
+static inline void dma_direct_unmap_phys_batch_add(struct device *dev, dma_addr_t addr,
 		size_t size, enum dma_data_direction dir, unsigned long attrs)
 {
 	phys_addr_t phys;
@@ -196,9 +202,17 @@ static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
 
 	phys = dma_to_phys(dev, addr);
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
-		dma_direct_sync_single_for_cpu(dev, addr, size, dir);
+		dma_direct_sync_single_for_cpu_batch_add(dev, addr, size, dir);
 
 	swiotlb_tbl_unmap_single(dev, phys, size, dir,
 					 attrs | DMA_ATTR_SKIP_CPU_SYNC);
 }
+#else
+static inline void dma_direct_unmap_phys_batch_add(struct device *dev, dma_addr_t addr,
+		size_t size, enum dma_data_direction dir, unsigned long attrs)
+{
+	dma_direct_unmap_phys(dev, addr, size, dir, attrs);
+}
+#endif
+
 #endif /* _KERNEL_DMA_DIRECT_H */
-- 
2.39.3 (Apple Git-146)

Thanks
Barry



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
  2025-12-19  5:36 ` [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch Barry Song
  2025-12-20 17:37   ` kernel test robot
@ 2025-12-21 11:55   ` Leon Romanovsky
  2025-12-21 19:24     ` Barry Song
  2025-12-21 12:36   ` kernel test robot
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2025-12-21 11:55 UTC (permalink / raw)
  To: Barry Song
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
	anshuman.khandual, catalin.marinas, linux-kernel, surenb, iommu,
	maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski

On Fri, Dec 19, 2025 at 01:36:57PM +0800, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> This enables dma_direct_sync_sg_for_device, dma_direct_sync_sg_for_cpu,
> dma_direct_map_sg, and dma_direct_unmap_sg to use batched DMA sync
> operations when possible. This significantly improves performance on
> devices without hardware cache coherence.
> 
> Tangquan's initial results show that batched synchronization can reduce
> dma_map_sg() time by 64.61% and dma_unmap_sg() time by 66.60% on an MTK
> phone platform (MediaTek Dimensity 9500). The tests were performed by
> pinning the task to CPU7 and fixing the CPU frequency at 2.6 GHz,
> running dma_map_sg() and dma_unmap_sg() on 10 MB buffers (10 MB / 4 KB
> sg entries per buffer) for 200 iterations and then averaging the
> results.
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  kernel/dma/direct.c | 28 ++++++++++-----
>  kernel/dma/direct.h | 86 +++++++++++++++++++++++++++++++++++++++------
>  2 files changed, 95 insertions(+), 19 deletions(-)

<...>

>  		if (!dev_is_dma_coherent(dev))
> -			arch_sync_dma_for_device(paddr, sg->length,
> -					dir);
> +			arch_sync_dma_for_device_batch_add(paddr, sg->length, dir);

<...>

> -static inline dma_addr_t dma_direct_map_phys(struct device *dev,
> +#ifdef CONFIG_ARCH_WANT_BATCHED_DMA_SYNC
> +static inline void dma_direct_sync_single_for_cpu_batch_add(struct device *dev,
> +		dma_addr_t addr, size_t size, enum dma_data_direction dir)
> +{
> +	phys_addr_t paddr = dma_to_phys(dev, addr);
> +
> +	if (!dev_is_dma_coherent(dev))
> +		arch_sync_dma_for_cpu_batch_add(paddr, size, dir);
> +
> +	__dma_direct_sync_single_for_cpu(dev, paddr, size, dir);
> +}
> +#endif
> +
> +static inline void dma_direct_sync_single_for_cpu(struct device *dev,
> +		dma_addr_t addr, size_t size, enum dma_data_direction dir)
> +{
> +	phys_addr_t paddr = dma_to_phys(dev, addr);
> +
> +	if (!dev_is_dma_coherent(dev))
> +		arch_sync_dma_for_cpu(paddr, size, dir);
> +
> +	__dma_direct_sync_single_for_cpu(dev, paddr, size, dir);
> +}
> +

I'm wondering why you don't implement this batch‑sync support inside the
arch_sync_dma_*() functions. Doing so would minimize changes to the generic
kernel/dma/* code and reduce the amount of #ifdef‑based spaghetti.

Thanks."


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
  2025-12-21 11:55   ` Leon Romanovsky
@ 2025-12-21 19:24     ` Barry Song
  2025-12-22  8:49       ` Leon Romanovsky
  0 siblings, 1 reply; 30+ messages in thread
From: Barry Song @ 2025-12-21 19:24 UTC (permalink / raw)
  To: leon
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
	anshuman.khandual, catalin.marinas, 21cnbao, linux-kernel, surenb,
	iommu, maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski

On Sun, Dec 21, 2025 at 7:55 PM Leon Romanovsky <leon@kernel.org> wrote:
[...]
> > +
>
> I'm wondering why you don't implement this batch‑sync support inside the
> arch_sync_dma_*() functions. Doing so would minimize changes to the generic
> kernel/dma/* code and reduce the amount of #ifdef‑based spaghetti.
>

There are two cases: mapping an sg list and mapping a single
buffer. The former can be batched with
arch_sync_dma_*_batch_add() and flushed via
arch_sync_dma_batch_flush(), while the latter requires all work to
be done inside arch_sync_dma_*(). Therefore,
arch_sync_dma_*() cannot always batch and flush.

But yes, I can drop the ifdef in this patch. I have rewritten the entire
patch as shown below, and it will be tested today prior to
resending v2. Before I send v2, you are very welcome to comment.


From c03aae12c608b25fc1a84931ce78dbe3ef0f1ebe Mon Sep 17 00:00:00 2001
From: Barry Song <v-songbaohua@oppo.com>
Date: Wed, 29 Oct 2025 10:31:15 +0800
Subject: [PATCH v2 FOR DISCUSION 5/6] dma-mapping: Allow batched DMA sync operations

This enables dma_direct_sync_sg_for_device, dma_direct_sync_sg_for_cpu,
dma_direct_map_sg, and dma_direct_unmap_sg to use batched DMA sync
operations when possible. This significantly improves performance on
devices without hardware cache coherence.

Tangquan's initial results show that batched synchronization can reduce
dma_map_sg() time by 64.61% and dma_unmap_sg() time by 66.60% on an MTK
phone platform (MediaTek Dimensity 9500). The tests were performed by
pinning the task to CPU7 and fixing the CPU frequency at 2.6 GHz,
running dma_map_sg() and dma_unmap_sg() on 10 MB buffers (10 MB / 4 KB
sg entries per buffer) for 200 iterations and then averaging the
results.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 kernel/dma/direct.c | 28 +++++++++++++++------
 kernel/dma/direct.h | 59 +++++++++++++++++++++++++++++++++++++--------
 2 files changed, 69 insertions(+), 18 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 50c3fe2a1d55..ed2339b0c5e7 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -403,9 +403,10 @@ void dma_direct_sync_sg_for_device(struct device *dev,
 		swiotlb_sync_single_for_device(dev, paddr, sg->length, dir);
 
 		if (!dev_is_dma_coherent(dev))
-			arch_sync_dma_for_device(paddr, sg->length,
-					dir);
+			arch_sync_dma_for_device_batch_add(paddr, sg->length, dir);
 	}
+	if (!dev_is_dma_coherent(dev))
+		arch_sync_dma_batch_flush();
 }
 #endif
 
@@ -422,7 +423,7 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
 		phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
 
 		if (!dev_is_dma_coherent(dev))
-			arch_sync_dma_for_cpu(paddr, sg->length, dir);
+			arch_sync_dma_for_cpu_batch_add(paddr, sg->length, dir);
 
 		swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
 
@@ -430,8 +431,10 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
 			arch_dma_mark_clean(paddr, sg->length);
 	}
 
-	if (!dev_is_dma_coherent(dev))
+	if (!dev_is_dma_coherent(dev)) {
 		arch_sync_dma_for_cpu_all();
+		arch_sync_dma_batch_flush();
+	}
 }
 
 /*
@@ -443,14 +446,19 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
 {
 	struct scatterlist *sg;
 	int i;
+	bool need_sync = false;
 
 	for_each_sg(sgl,  sg, nents, i) {
-		if (sg_dma_is_bus_address(sg))
+		if (sg_dma_is_bus_address(sg)) {
 			sg_dma_unmark_bus_address(sg);
-		else
-			dma_direct_unmap_phys(dev, sg->dma_address,
+		} else {
+			need_sync = true;
+			dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
 					      sg_dma_len(sg), dir, attrs);
+		}
 	}
+	if (need_sync && !dev_is_dma_coherent(dev))
+		arch_sync_dma_batch_flush();
 }
 #endif
 
@@ -460,6 +468,7 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 	struct pci_p2pdma_map_state p2pdma_state = {};
 	struct scatterlist *sg;
 	int i, ret;
+	bool need_sync = false;
 
 	for_each_sg(sgl, sg, nents, i) {
 		switch (pci_p2pdma_state(&p2pdma_state, dev, sg_page(sg))) {
@@ -471,7 +480,8 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 			 */
 			break;
 		case PCI_P2PDMA_MAP_NONE:
-			sg->dma_address = dma_direct_map_phys(dev, sg_phys(sg),
+			need_sync = true;
+			sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
 					sg->length, dir, attrs);
 			if (sg->dma_address == DMA_MAPPING_ERROR) {
 				ret = -EIO;
@@ -491,6 +501,8 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 		sg_dma_len(sg) = sg->length;
 	}
 
+	if (need_sync && !dev_is_dma_coherent(dev))
+		arch_sync_dma_batch_flush();
 	return nents;
 
 out_unmap:
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index da2fadf45bcd..2e25af887204 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -64,13 +64,16 @@ static inline void dma_direct_sync_single_for_device(struct device *dev,
 		arch_sync_dma_for_device(paddr, size, dir);
 }
 
-static inline void dma_direct_sync_single_for_cpu(struct device *dev,
-		dma_addr_t addr, size_t size, enum dma_data_direction dir)
+static inline void __dma_direct_sync_single_for_cpu(struct device *dev,
+		dma_addr_t addr, size_t size, enum dma_data_direction dir,
+		bool flush)
 {
 	phys_addr_t paddr = dma_to_phys(dev, addr);
 
 	if (!dev_is_dma_coherent(dev)) {
-		arch_sync_dma_for_cpu(paddr, size, dir);
+		arch_sync_dma_for_cpu_batch_add(paddr, size, dir);
+		if (flush)
+			arch_sync_dma_batch_flush();
 		arch_sync_dma_for_cpu_all();
 	}
 
@@ -80,9 +83,15 @@ static inline void dma_direct_sync_single_for_cpu(struct device *dev,
 		arch_dma_mark_clean(paddr, size);
 }
 
-static inline dma_addr_t dma_direct_map_phys(struct device *dev,
+static inline void dma_direct_sync_single_for_cpu(struct device *dev,
+		dma_addr_t addr, size_t size, enum dma_data_direction dir)
+{
+	__dma_direct_sync_single_for_cpu(dev, addr, size, dir, true);
+}
+
+static inline dma_addr_t __dma_direct_map_phys(struct device *dev,
 		phys_addr_t phys, size_t size, enum dma_data_direction dir,
-		unsigned long attrs)
+		unsigned long attrs, bool flush)
 {
 	dma_addr_t dma_addr;
 
@@ -109,8 +118,11 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
 	}
 
 	if (!dev_is_dma_coherent(dev) &&
-	    !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
-		arch_sync_dma_for_device(phys, size, dir);
+	    !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO))) {
+		arch_sync_dma_for_device_batch_add(phys, size, dir);
+		if (flush)
+			arch_sync_dma_batch_flush();
+	}
 	return dma_addr;
 
 err_overflow:
@@ -121,8 +133,23 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
 	return DMA_MAPPING_ERROR;
 }
 
-static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
-		size_t size, enum dma_data_direction dir, unsigned long attrs)
+static inline dma_addr_t dma_direct_map_phys(struct device *dev,
+		phys_addr_t phys, size_t size, enum dma_data_direction dir,
+		unsigned long attrs)
+{
+	return __dma_direct_map_phys(dev, phys, size, dir, attrs, true);
+}
+
+static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
+		phys_addr_t phys, size_t size, enum dma_data_direction dir,
+		unsigned long attrs)
+{
+	return __dma_direct_map_phys(dev, phys, size, dir, attrs, false);
+}
+
+static inline void __dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
+		size_t size, enum dma_data_direction dir, unsigned long attrs,
+		bool flush)
 {
 	phys_addr_t phys;
 
@@ -132,9 +159,21 @@ static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
 
 	phys = dma_to_phys(dev, addr);
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
-		dma_direct_sync_single_for_cpu(dev, addr, size, dir);
+		__dma_direct_sync_single_for_cpu(dev, addr, size, dir, flush);
 
 	swiotlb_tbl_unmap_single(dev, phys, size, dir,
 					 attrs | DMA_ATTR_SKIP_CPU_SYNC);
 }
+
+static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
+		size_t size, enum dma_data_direction dir, unsigned long attrs)
+{
+	__dma_direct_unmap_phys(dev, addr, size, dir, attrs, true);
+}
+
+static inline void dma_direct_unmap_phys_batch_add(struct device *dev, dma_addr_t addr,
+		size_t size, enum dma_data_direction dir, unsigned long attrs)
+{
+	__dma_direct_unmap_phys(dev, addr, size, dir, attrs, false);
+}
 #endif /* _KERNEL_DMA_DIRECT_H */
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
  2025-12-21 19:24     ` Barry Song
@ 2025-12-22  8:49       ` Leon Romanovsky
  2025-12-23  0:02         ` Barry Song
  0 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2025-12-22  8:49 UTC (permalink / raw)
  To: Barry Song
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
	anshuman.khandual, catalin.marinas, linux-kernel, surenb, iommu,
	maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski

On Mon, Dec 22, 2025 at 03:24:58AM +0800, Barry Song wrote:
> On Sun, Dec 21, 2025 at 7:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> [...]
> > > +
> >
> > I'm wondering why you don't implement this batch‑sync support inside the
> > arch_sync_dma_*() functions. Doing so would minimize changes to the generic
> > kernel/dma/* code and reduce the amount of #ifdef‑based spaghetti.
> >
> 
> There are two cases: mapping an sg list and mapping a single
> buffer. The former can be batched with
> arch_sync_dma_*_batch_add() and flushed via
> arch_sync_dma_batch_flush(), while the latter requires all work to
> be done inside arch_sync_dma_*(). Therefore,
> arch_sync_dma_*() cannot always batch and flush.

Probably in all cases you can call the _batch_ variant, followed by _flush_,  
even when handling a single page. This keeps the code consistent across all  
paths. On platforms that do not support _batch_, the _flush_ operation will be  
a NOP anyway.

I would also rename arch_sync_dma_batch_flush() to arch_sync_dma_flush().

You can also minimize changes in dma_direct_map_phys() too, by extending
it's signature to provide if flush is needed or not.

dma_direct_map_phys(....) -> dma_direct_map_phys(...., bool flush):

static inline dma_addr_t dma_direct_map_phys(...., bool flush)
{
	....

	if (dma_addr != DMA_MAPPING_ERROR && !dev_is_dma_coherent(dev) &&
	    !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
        {
	    	arch_sync_dma_for_device(phys, size, dir);
		if (flush)
			arch_sync_dma_flush();
	}
}

Thanks


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
  2025-12-22  8:49       ` Leon Romanovsky
@ 2025-12-23  0:02         ` Barry Song
  2025-12-23  2:36           ` Barry Song
  2025-12-23 14:14           ` Leon Romanovsky
  0 siblings, 2 replies; 30+ messages in thread
From: Barry Song @ 2025-12-23  0:02 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
	anshuman.khandual, catalin.marinas, linux-kernel, surenb, iommu,
	maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski

On Mon, Dec 22, 2025 at 9:49 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Mon, Dec 22, 2025 at 03:24:58AM +0800, Barry Song wrote:
> > On Sun, Dec 21, 2025 at 7:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > [...]
> > > > +
> > >
> > > I'm wondering why you don't implement this batch‑sync support inside the
> > > arch_sync_dma_*() functions. Doing so would minimize changes to the generic
> > > kernel/dma/* code and reduce the amount of #ifdef‑based spaghetti.
> > >
> >
> > There are two cases: mapping an sg list and mapping a single
> > buffer. The former can be batched with
> > arch_sync_dma_*_batch_add() and flushed via
> > arch_sync_dma_batch_flush(), while the latter requires all work to
> > be done inside arch_sync_dma_*(). Therefore,
> > arch_sync_dma_*() cannot always batch and flush.
>
> Probably in all cases you can call the _batch_ variant, followed by _flush_,
> even when handling a single page. This keeps the code consistent across all
> paths. On platforms that do not support _batch_, the _flush_ operation will be
> a NOP anyway.

We have a lot of code outside kernel/dma that also calls
arch_sync_dma_for_* such as arch/arm, arch/mips, drivers/xen,
I guess we don’t want to modify so many things?

for kernel/dma, we have two "single" callers only:
kernel/dma/direct.h, kernel/dma/swiotlb.c.  and they looks quite
straightforward:

static inline void dma_direct_sync_single_for_device(struct device *dev,
                dma_addr_t addr, size_t size, enum dma_data_direction dir)
{
        phys_addr_t paddr = dma_to_phys(dev, addr);

        swiotlb_sync_single_for_device(dev, paddr, size, dir);

        if (!dev_is_dma_coherent(dev))
                arch_sync_dma_for_device(paddr, size, dir);
}

I guess moving to arch_sync_dma_for_device_batch + flush
doesn’t really look much better, does it?

>
> I would also rename arch_sync_dma_batch_flush() to arch_sync_dma_flush().

Sure.

>
> You can also minimize changes in dma_direct_map_phys() too, by extending
> it's signature to provide if flush is needed or not.

Yes. I have

static inline dma_addr_t __dma_direct_map_phys(struct device *dev,
                phys_addr_t phys, size_t size, enum dma_data_direction dir,
                unsigned long attrs, bool flush)

and two wrappers:
static inline dma_addr_t dma_direct_map_phys(struct device *dev,
                phys_addr_t phys, size_t size, enum dma_data_direction dir,
                unsigned long attrs)
{
        return __dma_direct_map_phys(dev, phys, size, dir, attrs, true);
}

static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
                phys_addr_t phys, size_t size, enum dma_data_direction dir,
                unsigned long attrs)
{
        return __dma_direct_map_phys(dev, phys, size, dir, attrs, false);
}

If you prefer exposing "flush" directly in dma_direct_map_phys()
and updating its callers with flush=true, I think that’s fine.

It could be also true for dma_direct_sync_single_for_device().

>
> dma_direct_map_phys(....) -> dma_direct_map_phys(...., bool flush):
>
> static inline dma_addr_t dma_direct_map_phys(...., bool flush)
> {
>         ....
>
>         if (dma_addr != DMA_MAPPING_ERROR && !dev_is_dma_coherent(dev) &&
>             !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
>         {
>                 arch_sync_dma_for_device(phys, size, dir);
>                 if (flush)
>                         arch_sync_dma_flush();
>         }
> }
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
  2025-12-23  0:02         ` Barry Song
@ 2025-12-23  2:36           ` Barry Song
  2025-12-23 14:14           ` Leon Romanovsky
  1 sibling, 0 replies; 30+ messages in thread
From: Barry Song @ 2025-12-23  2:36 UTC (permalink / raw)
  To: 21cnbao, leon
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
	anshuman.khandual, catalin.marinas, linux-kernel, surenb, iommu,
	maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski

>
> >
> > I would also rename arch_sync_dma_batch_flush() to arch_sync_dma_flush().
>
> Sure.
>
> >
> > You can also minimize changes in dma_direct_map_phys() too, by extending
> > it's signature to provide if flush is needed or not.
>
> Yes. I have
>
> static inline dma_addr_t __dma_direct_map_phys(struct device *dev,
>                 phys_addr_t phys, size_t size, enum dma_data_direction dir,
>                 unsigned long attrs, bool flush)
>
> and two wrappers:
> static inline dma_addr_t dma_direct_map_phys(struct device *dev,
>                 phys_addr_t phys, size_t size, enum dma_data_direction dir,
>                 unsigned long attrs)
> {
>         return __dma_direct_map_phys(dev, phys, size, dir, attrs, true);
> }
>
> static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
>                 phys_addr_t phys, size_t size, enum dma_data_direction dir,
>                 unsigned long attrs)
> {
>         return __dma_direct_map_phys(dev, phys, size, dir, attrs, false);
> }
>
> If you prefer exposing "flush" directly in dma_direct_map_phys()
> and updating its callers with flush=true, I think that’s fine.
>
> It could be also true for dma_direct_sync_single_for_device().

sorry for typo. I meant dma_direct_sync_single_for_cpu().
With flush passed as an argument, the patch becomes the following.
Please feel free to comment before I send v2.


diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 50c3fe2a1d55..5c65d213eb37 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -403,9 +403,11 @@ void dma_direct_sync_sg_for_device(struct device *dev,
 		swiotlb_sync_single_for_device(dev, paddr, sg->length, dir);
 
 		if (!dev_is_dma_coherent(dev))
-			arch_sync_dma_for_device(paddr, sg->length,
+			arch_sync_dma_for_device_batch_add(paddr, sg->length,
 					dir);
 	}
+	if (!dev_is_dma_coherent(dev))
+		arch_sync_dma_flush();
 }
 #endif
 
@@ -422,7 +424,7 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
 		phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
 
 		if (!dev_is_dma_coherent(dev))
-			arch_sync_dma_for_cpu(paddr, sg->length, dir);
+			arch_sync_dma_for_cpu_batch_add(paddr, sg->length, dir);
 
 		swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
 
@@ -430,8 +432,10 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
 			arch_dma_mark_clean(paddr, sg->length);
 	}
 
-	if (!dev_is_dma_coherent(dev))
+	if (!dev_is_dma_coherent(dev)) {
 		arch_sync_dma_for_cpu_all();
+		arch_sync_dma_flush();
+	}
 }
 
 /*
@@ -443,14 +447,19 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
 {
 	struct scatterlist *sg;
 	int i;
+	bool need_sync = false;
 
 	for_each_sg(sgl,  sg, nents, i) {
-		if (sg_dma_is_bus_address(sg))
+		if (sg_dma_is_bus_address(sg)) {
 			sg_dma_unmark_bus_address(sg);
-		else
+		} else {
+			need_sync = true;
 			dma_direct_unmap_phys(dev, sg->dma_address,
-					      sg_dma_len(sg), dir, attrs);
+					      sg_dma_len(sg), dir, attrs, false);
+		}
 	}
+	if (need_sync && !dev_is_dma_coherent(dev))
+		arch_sync_dma_flush();
 }
 #endif
 
@@ -460,6 +469,7 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 	struct pci_p2pdma_map_state p2pdma_state = {};
 	struct scatterlist *sg;
 	int i, ret;
+	bool need_sync = false;
 
 	for_each_sg(sgl, sg, nents, i) {
 		switch (pci_p2pdma_state(&p2pdma_state, dev, sg_page(sg))) {
@@ -471,8 +481,9 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 			 */
 			break;
 		case PCI_P2PDMA_MAP_NONE:
+			need_sync = true;
 			sg->dma_address = dma_direct_map_phys(dev, sg_phys(sg),
-					sg->length, dir, attrs);
+					sg->length, dir, attrs, false);
 			if (sg->dma_address == DMA_MAPPING_ERROR) {
 				ret = -EIO;
 				goto out_unmap;
@@ -491,6 +502,8 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 		sg_dma_len(sg) = sg->length;
 	}
 
+	if (need_sync && !dev_is_dma_coherent(dev))
+		arch_sync_dma_flush();
 	return nents;
 
 out_unmap:
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index da2fadf45bcd..b13eb5bfd051 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -65,12 +65,15 @@ static inline void dma_direct_sync_single_for_device(struct device *dev,
 }
 
 static inline void dma_direct_sync_single_for_cpu(struct device *dev,
-		dma_addr_t addr, size_t size, enum dma_data_direction dir)
+		dma_addr_t addr, size_t size, enum dma_data_direction dir,
+		bool flush)
 {
 	phys_addr_t paddr = dma_to_phys(dev, addr);
 
 	if (!dev_is_dma_coherent(dev)) {
-		arch_sync_dma_for_cpu(paddr, size, dir);
+		arch_sync_dma_for_cpu_batch_add(paddr, size, dir);
+		if (flush)
+			arch_sync_dma_flush();
 		arch_sync_dma_for_cpu_all();
 	}
 
@@ -82,7 +85,7 @@ static inline void dma_direct_sync_single_for_cpu(struct device *dev,
 
 static inline dma_addr_t dma_direct_map_phys(struct device *dev,
 		phys_addr_t phys, size_t size, enum dma_data_direction dir,
-		unsigned long attrs)
+		unsigned long attrs, bool flush)
 {
 	dma_addr_t dma_addr;
 
@@ -109,8 +112,11 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
 	}
 
 	if (!dev_is_dma_coherent(dev) &&
-	    !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
-		arch_sync_dma_for_device(phys, size, dir);
+	    !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO))) {
+		arch_sync_dma_for_device_batch_add(phys, size, dir);
+		if (flush)
+			arch_sync_dma_flush();
+	}
 	return dma_addr;
 
 err_overflow:
@@ -122,7 +128,8 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
 }
 
 static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
-		size_t size, enum dma_data_direction dir, unsigned long attrs)
+		size_t size, enum dma_data_direction dir, unsigned long attrs,
+		bool flush)
 {
 	phys_addr_t phys;
 
@@ -132,9 +139,10 @@ static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
 
 	phys = dma_to_phys(dev, addr);
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
-		dma_direct_sync_single_for_cpu(dev, addr, size, dir);
+		dma_direct_sync_single_for_cpu(dev, addr, size, dir, flush);
 
 	swiotlb_tbl_unmap_single(dev, phys, size, dir,
 					 attrs | DMA_ATTR_SKIP_CPU_SYNC);
 }
+
 #endif /* _KERNEL_DMA_DIRECT_H */
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 37163eb49f9f..d8cfa56a3cbb 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -166,7 +166,7 @@ dma_addr_t dma_map_phys(struct device *dev, phys_addr_t phys, size_t size,
 
 	if (dma_map_direct(dev, ops) ||
 	    (!is_mmio && arch_dma_map_phys_direct(dev, phys + size)))
-		addr = dma_direct_map_phys(dev, phys, size, dir, attrs);
+		addr = dma_direct_map_phys(dev, phys, size, dir, attrs, true);
 	else if (use_dma_iommu(dev))
 		addr = iommu_dma_map_phys(dev, phys, size, dir, attrs);
 	else if (ops->map_phys)
@@ -207,7 +207,7 @@ void dma_unmap_phys(struct device *dev, dma_addr_t addr, size_t size,
 	BUG_ON(!valid_dma_direction(dir));
 	if (dma_map_direct(dev, ops) ||
 	    (!is_mmio && arch_dma_unmap_phys_direct(dev, addr + size)))
-		dma_direct_unmap_phys(dev, addr, size, dir, attrs);
+		dma_direct_unmap_phys(dev, addr, size, dir, attrs, true);
 	else if (use_dma_iommu(dev))
 		iommu_dma_unmap_phys(dev, addr, size, dir, attrs);
 	else if (ops->unmap_phys)
@@ -373,7 +373,7 @@ void __dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size,
 
 	BUG_ON(!valid_dma_direction(dir));
 	if (dma_map_direct(dev, ops))
-		dma_direct_sync_single_for_cpu(dev, addr, size, dir);
+		dma_direct_sync_single_for_cpu(dev, addr, size, dir, true);
 	else if (use_dma_iommu(dev))
 		iommu_dma_sync_single_for_cpu(dev, addr, size, dir);
 	else if (ops->sync_single_for_cpu)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
  2025-12-23  0:02         ` Barry Song
  2025-12-23  2:36           ` Barry Song
@ 2025-12-23 14:14           ` Leon Romanovsky
  2025-12-24  1:29             ` Barry Song
  1 sibling, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2025-12-23 14:14 UTC (permalink / raw)
  To: Barry Song
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
	anshuman.khandual, catalin.marinas, linux-kernel, surenb, iommu,
	maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski

On Tue, Dec 23, 2025 at 01:02:55PM +1300, Barry Song wrote:
> On Mon, Dec 22, 2025 at 9:49 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Mon, Dec 22, 2025 at 03:24:58AM +0800, Barry Song wrote:
> > > On Sun, Dec 21, 2025 at 7:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > [...]
> > > > > +
> > > >
> > > > I'm wondering why you don't implement this batch‑sync support inside the
> > > > arch_sync_dma_*() functions. Doing so would minimize changes to the generic
> > > > kernel/dma/* code and reduce the amount of #ifdef‑based spaghetti.
> > > >
> > >
> > > There are two cases: mapping an sg list and mapping a single
> > > buffer. The former can be batched with
> > > arch_sync_dma_*_batch_add() and flushed via
> > > arch_sync_dma_batch_flush(), while the latter requires all work to
> > > be done inside arch_sync_dma_*(). Therefore,
> > > arch_sync_dma_*() cannot always batch and flush.
> >
> > Probably in all cases you can call the _batch_ variant, followed by _flush_,
> > even when handling a single page. This keeps the code consistent across all
> > paths. On platforms that do not support _batch_, the _flush_ operation will be
> > a NOP anyway.
> 
> We have a lot of code outside kernel/dma that also calls
> arch_sync_dma_for_* such as arch/arm, arch/mips, drivers/xen,
> I guess we don’t want to modify so many things?

Aren't they using internal, arch specific, arch_sync_dma_for_* implementations?

> 
> for kernel/dma, we have two "single" callers only:
> kernel/dma/direct.h, kernel/dma/swiotlb.c.  and they looks quite
> straightforward:
> 
> static inline void dma_direct_sync_single_for_device(struct device *dev,
>                 dma_addr_t addr, size_t size, enum dma_data_direction dir)
> {
>         phys_addr_t paddr = dma_to_phys(dev, addr);
> 
>         swiotlb_sync_single_for_device(dev, paddr, size, dir);
> 
>         if (!dev_is_dma_coherent(dev))
>                 arch_sync_dma_for_device(paddr, size, dir);
> }
> 
> I guess moving to arch_sync_dma_for_device_batch + flush
> doesn’t really look much better, does it?
> 
> >
> > I would also rename arch_sync_dma_batch_flush() to arch_sync_dma_flush().
> 
> Sure.
> 
> >
> > You can also minimize changes in dma_direct_map_phys() too, by extending
> > it's signature to provide if flush is needed or not.
> 
> Yes. I have
> 
> static inline dma_addr_t __dma_direct_map_phys(struct device *dev,
>                 phys_addr_t phys, size_t size, enum dma_data_direction dir,
>                 unsigned long attrs, bool flush)

My suggestion is to use it directly, without wrappers.

> 
> and two wrappers:
> static inline dma_addr_t dma_direct_map_phys(struct device *dev,
>                 phys_addr_t phys, size_t size, enum dma_data_direction dir,
>                 unsigned long attrs)
> {
>         return __dma_direct_map_phys(dev, phys, size, dir, attrs, true);
> }
> 
> static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
>                 phys_addr_t phys, size_t size, enum dma_data_direction dir,
>                 unsigned long attrs)
> {
>         return __dma_direct_map_phys(dev, phys, size, dir, attrs, false);
> }
> 
> If you prefer exposing "flush" directly in dma_direct_map_phys()
> and updating its callers with flush=true, I think that’s fine.

Yes

> 
> It could be also true for dma_direct_sync_single_for_device().
> 
> >
> > dma_direct_map_phys(....) -> dma_direct_map_phys(...., bool flush):
> >
> > static inline dma_addr_t dma_direct_map_phys(...., bool flush)
> > {
> >         ....
> >
> >         if (dma_addr != DMA_MAPPING_ERROR && !dev_is_dma_coherent(dev) &&
> >             !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
> >         {
> >                 arch_sync_dma_for_device(phys, size, dir);
> >                 if (flush)
> >                         arch_sync_dma_flush();
> >         }
> > }
> >
> 
> Thanks
> Barry
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
  2025-12-23 14:14           ` Leon Romanovsky
@ 2025-12-24  1:29             ` Barry Song
  2025-12-24  8:51               ` Leon Romanovsky
  0 siblings, 1 reply; 30+ messages in thread
From: Barry Song @ 2025-12-24  1:29 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
	anshuman.khandual, catalin.marinas, linux-kernel, surenb, iommu,
	maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski

On Wed, Dec 24, 2025 at 3:14 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Tue, Dec 23, 2025 at 01:02:55PM +1300, Barry Song wrote:
> > On Mon, Dec 22, 2025 at 9:49 PM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Mon, Dec 22, 2025 at 03:24:58AM +0800, Barry Song wrote:
> > > > On Sun, Dec 21, 2025 at 7:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > [...]
> > > > > > +
> > > > >
> > > > > I'm wondering why you don't implement this batch‑sync support inside the
> > > > > arch_sync_dma_*() functions. Doing so would minimize changes to the generic
> > > > > kernel/dma/* code and reduce the amount of #ifdef‑based spaghetti.
> > > > >
> > > >
> > > > There are two cases: mapping an sg list and mapping a single
> > > > buffer. The former can be batched with
> > > > arch_sync_dma_*_batch_add() and flushed via
> > > > arch_sync_dma_batch_flush(), while the latter requires all work to
> > > > be done inside arch_sync_dma_*(). Therefore,
> > > > arch_sync_dma_*() cannot always batch and flush.
> > >
> > > Probably in all cases you can call the _batch_ variant, followed by _flush_,
> > > even when handling a single page. This keeps the code consistent across all
> > > paths. On platforms that do not support _batch_, the _flush_ operation will be
> > > a NOP anyway.
> >
> > We have a lot of code outside kernel/dma that also calls
> > arch_sync_dma_for_* such as arch/arm, arch/mips, drivers/xen,
> > I guess we don’t want to modify so many things?
>
> Aren't they using internal, arch specific, arch_sync_dma_for_* implementations?

for arch/arm, arch/mips, they are arch-specific implementations.
xen is an exception:

static void xen_swiotlb_unmap_phys(struct device *hwdev, dma_addr_t dev_addr,
                size_t size, enum dma_data_direction dir, unsigned long attrs)
{
        phys_addr_t paddr = xen_dma_to_phys(hwdev, dev_addr);
        struct io_tlb_pool *pool;

        BUG_ON(dir == DMA_NONE);

        if (!dev_is_dma_coherent(hwdev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
                if (pfn_valid(PFN_DOWN(dma_to_phys(hwdev, dev_addr))))
                        arch_sync_dma_for_cpu(paddr, size, dir);
                else
                        xen_dma_sync_for_cpu(hwdev, dev_addr, size, dir);
        }

        /* NOTE: We use dev_addr here, not paddr! */
        pool = xen_swiotlb_find_pool(hwdev, dev_addr);
        if (pool)
                __swiotlb_tbl_unmap_single(hwdev, paddr, size, dir,
                                           attrs, pool);
}

>
> >
> > for kernel/dma, we have two "single" callers only:
> > kernel/dma/direct.h, kernel/dma/swiotlb.c.  and they looks quite
> > straightforward:
> >
> > static inline void dma_direct_sync_single_for_device(struct device *dev,
> >                 dma_addr_t addr, size_t size, enum dma_data_direction dir)
> > {
> >         phys_addr_t paddr = dma_to_phys(dev, addr);
> >
> >         swiotlb_sync_single_for_device(dev, paddr, size, dir);
> >
> >         if (!dev_is_dma_coherent(dev))
> >                 arch_sync_dma_for_device(paddr, size, dir);
> > }
> >
> > I guess moving to arch_sync_dma_for_device_batch + flush
> > doesn’t really look much better, does it?
> >
> > >
> > > I would also rename arch_sync_dma_batch_flush() to arch_sync_dma_flush().
> >
> > Sure.
> >
> > >
> > > You can also minimize changes in dma_direct_map_phys() too, by extending
> > > it's signature to provide if flush is needed or not.
> >
> > Yes. I have
> >
> > static inline dma_addr_t __dma_direct_map_phys(struct device *dev,
> >                 phys_addr_t phys, size_t size, enum dma_data_direction dir,
> >                 unsigned long attrs, bool flush)
>
> My suggestion is to use it directly, without wrappers.
>
> >
> > and two wrappers:
> > static inline dma_addr_t dma_direct_map_phys(struct device *dev,
> >                 phys_addr_t phys, size_t size, enum dma_data_direction dir,
> >                 unsigned long attrs)
> > {
> >         return __dma_direct_map_phys(dev, phys, size, dir, attrs, true);
> > }
> >
> > static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
> >                 phys_addr_t phys, size_t size, enum dma_data_direction dir,
> >                 unsigned long attrs)
> > {
> >         return __dma_direct_map_phys(dev, phys, size, dir, attrs, false);
> > }
> >
> > If you prefer exposing "flush" directly in dma_direct_map_phys()
> > and updating its callers with flush=true, I think that’s fine.
>
> Yes
>

OK. Could you take a look at [1] and see if any further
improvements are needed before I send v2?

[1] https://lore.kernel.org/lkml/20251223023648.31614-1-21cnbao@gmail.com/

Thanks
Barry


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
  2025-12-24  1:29             ` Barry Song
@ 2025-12-24  8:51               ` Leon Romanovsky
  2025-12-25  5:45                 ` Barry Song
  0 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2025-12-24  8:51 UTC (permalink / raw)
  To: Barry Song
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
	anshuman.khandual, catalin.marinas, linux-kernel, surenb, iommu,
	maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski

On Wed, Dec 24, 2025 at 02:29:13PM +1300, Barry Song wrote:
> On Wed, Dec 24, 2025 at 3:14 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Tue, Dec 23, 2025 at 01:02:55PM +1300, Barry Song wrote:
> > > On Mon, Dec 22, 2025 at 9:49 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > >
> > > > On Mon, Dec 22, 2025 at 03:24:58AM +0800, Barry Song wrote:
> > > > > On Sun, Dec 21, 2025 at 7:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > [...]
> > > > > > > +
> > > > > >
> > > > > > I'm wondering why you don't implement this batch‑sync support inside the
> > > > > > arch_sync_dma_*() functions. Doing so would minimize changes to the generic
> > > > > > kernel/dma/* code and reduce the amount of #ifdef‑based spaghetti.
> > > > > >
> > > > >
> > > > > There are two cases: mapping an sg list and mapping a single
> > > > > buffer. The former can be batched with
> > > > > arch_sync_dma_*_batch_add() and flushed via
> > > > > arch_sync_dma_batch_flush(), while the latter requires all work to
> > > > > be done inside arch_sync_dma_*(). Therefore,
> > > > > arch_sync_dma_*() cannot always batch and flush.
> > > >
> > > > Probably in all cases you can call the _batch_ variant, followed by _flush_,
> > > > even when handling a single page. This keeps the code consistent across all
> > > > paths. On platforms that do not support _batch_, the _flush_ operation will be
> > > > a NOP anyway.
> > >
> > > We have a lot of code outside kernel/dma that also calls
> > > arch_sync_dma_for_* such as arch/arm, arch/mips, drivers/xen,
> > > I guess we don’t want to modify so many things?
> >
> > Aren't they using internal, arch specific, arch_sync_dma_for_* implementations?
> 
> for arch/arm, arch/mips, they are arch-specific implementations.
> xen is an exception:

Right, and this is the only location outside of kernel/dma where you need to
invoke arch_sync_dma_flush().

> 
> static void xen_swiotlb_unmap_phys(struct device *hwdev, dma_addr_t dev_addr,
>                 size_t size, enum dma_data_direction dir, unsigned long attrs)
> {
>         phys_addr_t paddr = xen_dma_to_phys(hwdev, dev_addr);
>         struct io_tlb_pool *pool;
> 
>         BUG_ON(dir == DMA_NONE);
> 
>         if (!dev_is_dma_coherent(hwdev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
>                 if (pfn_valid(PFN_DOWN(dma_to_phys(hwdev, dev_addr))))
>                         arch_sync_dma_for_cpu(paddr, size, dir);
>                 else
>                         xen_dma_sync_for_cpu(hwdev, dev_addr, size, dir);
>         }
> 
>         /* NOTE: We use dev_addr here, not paddr! */
>         pool = xen_swiotlb_find_pool(hwdev, dev_addr);
>         if (pool)
>                 __swiotlb_tbl_unmap_single(hwdev, paddr, size, dir,
>                                            attrs, pool);
> }
> 
> >
> > >
> > > for kernel/dma, we have two "single" callers only:
> > > kernel/dma/direct.h, kernel/dma/swiotlb.c.  and they looks quite
> > > straightforward:
> > >
> > > static inline void dma_direct_sync_single_for_device(struct device *dev,
> > >                 dma_addr_t addr, size_t size, enum dma_data_direction dir)
> > > {
> > >         phys_addr_t paddr = dma_to_phys(dev, addr);
> > >
> > >         swiotlb_sync_single_for_device(dev, paddr, size, dir);
> > >
> > >         if (!dev_is_dma_coherent(dev))
> > >                 arch_sync_dma_for_device(paddr, size, dir);
> > > }
> > >
> > > I guess moving to arch_sync_dma_for_device_batch + flush
> > > doesn’t really look much better, does it?
> > >
> > > >
> > > > I would also rename arch_sync_dma_batch_flush() to arch_sync_dma_flush().
> > >
> > > Sure.
> > >
> > > >
> > > > You can also minimize changes in dma_direct_map_phys() too, by extending
> > > > it's signature to provide if flush is needed or not.
> > >
> > > Yes. I have
> > >
> > > static inline dma_addr_t __dma_direct_map_phys(struct device *dev,
> > >                 phys_addr_t phys, size_t size, enum dma_data_direction dir,
> > >                 unsigned long attrs, bool flush)
> >
> > My suggestion is to use it directly, without wrappers.
> >
> > >
> > > and two wrappers:
> > > static inline dma_addr_t dma_direct_map_phys(struct device *dev,
> > >                 phys_addr_t phys, size_t size, enum dma_data_direction dir,
> > >                 unsigned long attrs)
> > > {
> > >         return __dma_direct_map_phys(dev, phys, size, dir, attrs, true);
> > > }
> > >
> > > static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
> > >                 phys_addr_t phys, size_t size, enum dma_data_direction dir,
> > >                 unsigned long attrs)
> > > {
> > >         return __dma_direct_map_phys(dev, phys, size, dir, attrs, false);
> > > }
> > >
> > > If you prefer exposing "flush" directly in dma_direct_map_phys()
> > > and updating its callers with flush=true, I think that’s fine.
> >
> > Yes
> >
> 
> OK. Could you take a look at [1] and see if any further
> improvements are needed before I send v2?

Everything looks ok, except these renames:
-			arch_sync_dma_for_cpu(paddr, sg->length, dir);
+			arch_sync_dma_for_cpu_batch_add(paddr, sg->length, dir);

Thanks

> 
> [1] https://lore.kernel.org/lkml/20251223023648.31614-1-21cnbao@gmail.com/
> 
> Thanks
> Barry
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
  2025-12-24  8:51               ` Leon Romanovsky
@ 2025-12-25  5:45                 ` Barry Song
  2025-12-25 12:36                   ` Leon Romanovsky
  0 siblings, 1 reply; 30+ messages in thread
From: Barry Song @ 2025-12-25  5:45 UTC (permalink / raw)
  To: leon
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
	anshuman.khandual, catalin.marinas, 21cnbao, linux-kernel, surenb,
	iommu, maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski

> > >
> >
> > OK. Could you take a look at [1] and see if any further
> > improvements are needed before I send v2?
>
> Everything looks ok, except these renames:
> -                       arch_sync_dma_for_cpu(paddr, sg->length, dir);
> +                       arch_sync_dma_for_cpu_batch_add(paddr, sg->length, dir);

Thanks!
I'm happy to drop the rename as outlined below-feedback welcome :-)

diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
index dd2c8586a725..487fb7c355ed 100644
--- a/arch/arm64/include/asm/cache.h
+++ b/arch/arm64/include/asm/cache.h
@@ -87,6 +87,12 @@ int cache_line_size(void);
 
 #define dma_get_cache_alignment	cache_line_size
 
+static inline void arch_sync_dma_flush(void)
+{
+	dsb(sy);
+}
+#define arch_sync_dma_flush arch_sync_dma_flush
+
 /* Compress a u64 MPIDR value into 32 bits. */
 static inline u64 arch_compact_of_hwid(u64 id)
 {
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index b2b5792b2caa..ae1ae0280eef 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -17,7 +17,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
 {
 	unsigned long start = (unsigned long)phys_to_virt(paddr);
 
-	dcache_clean_poc(start, start + size);
+	dcache_clean_poc_nosync(start, start + size);
 }
 
 void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
@@ -28,7 +28,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
 	if (dir == DMA_TO_DEVICE)
 		return;
 
-	dcache_inval_poc(start, start + size);
+	dcache_inval_poc_nosync(start, start + size);
 }
 
 void arch_dma_prep_coherent(struct page *page, size_t size)
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 4809204c674c..e7dd8a63b40e 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -361,6 +361,12 @@ static inline void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
 }
 #endif /* ARCH_HAS_SYNC_DMA_FOR_CPU */
 
+#ifndef arch_sync_dma_flush
+static inline void arch_sync_dma_flush(void)
+{
+}
+#endif
+
 #ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL
 void arch_sync_dma_for_cpu_all(void);
 #else


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
  2025-12-25  5:45                 ` Barry Song
@ 2025-12-25 12:36                   ` Leon Romanovsky
  2025-12-25 13:31                     ` Barry Song
  0 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2025-12-25 12:36 UTC (permalink / raw)
  To: Barry Song
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
	anshuman.khandual, catalin.marinas, linux-kernel, surenb, iommu,
	maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski

On Thu, Dec 25, 2025 at 06:45:09PM +1300, Barry Song wrote:
> > > >
> > >
> > > OK. Could you take a look at [1] and see if any further
> > > improvements are needed before I send v2?
> >
> > Everything looks ok, except these renames:
> > -                       arch_sync_dma_for_cpu(paddr, sg->length, dir);
> > +                       arch_sync_dma_for_cpu_batch_add(paddr, sg->length, dir);
> 
> Thanks!
> I'm happy to drop the rename as outlined below-feedback welcome :-)
> 
> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> index dd2c8586a725..487fb7c355ed 100644
> --- a/arch/arm64/include/asm/cache.h
> +++ b/arch/arm64/include/asm/cache.h
> @@ -87,6 +87,12 @@ int cache_line_size(void);
>  
>  #define dma_get_cache_alignment	cache_line_size
>  
> +static inline void arch_sync_dma_flush(void)
> +{
> +	dsb(sy);
> +}
> +#define arch_sync_dma_flush arch_sync_dma_flush
> +
>  /* Compress a u64 MPIDR value into 32 bits. */
>  static inline u64 arch_compact_of_hwid(u64 id)
>  {
> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> index b2b5792b2caa..ae1ae0280eef 100644
> --- a/arch/arm64/mm/dma-mapping.c
> +++ b/arch/arm64/mm/dma-mapping.c
> @@ -17,7 +17,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
>  {
>  	unsigned long start = (unsigned long)phys_to_virt(paddr);
>  
> -	dcache_clean_poc(start, start + size);
> +	dcache_clean_poc_nosync(start, start + size);
>  }
>  
>  void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> @@ -28,7 +28,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
>  	if (dir == DMA_TO_DEVICE)
>  		return;
>  
> -	dcache_inval_poc(start, start + size);
> +	dcache_inval_poc_nosync(start, start + size);
>  }
>  
>  void arch_dma_prep_coherent(struct page *page, size_t size)
> diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> index 4809204c674c..e7dd8a63b40e 100644
> --- a/include/linux/dma-map-ops.h
> +++ b/include/linux/dma-map-ops.h
> @@ -361,6 +361,12 @@ static inline void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
>  }
>  #endif /* ARCH_HAS_SYNC_DMA_FOR_CPU */
>  
> +#ifndef arch_sync_dma_flush

You likely need to wrap this in "#ifdef CONFIG_ARCH_HAS_SYNC_DMA_FLUSH"
as done in the surrounding code.

Thanks

> +static inline void arch_sync_dma_flush(void)
> +{
> +}
> +#endif
> +
>  #ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL
>  void arch_sync_dma_for_cpu_all(void);
>  #else
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
  2025-12-25 12:36                   ` Leon Romanovsky
@ 2025-12-25 13:31                     ` Barry Song
  2025-12-25 13:40                       ` Leon Romanovsky
  0 siblings, 1 reply; 30+ messages in thread
From: Barry Song @ 2025-12-25 13:31 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
	anshuman.khandual, catalin.marinas, linux-kernel, surenb, iommu,
	maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski

On Fri, Dec 26, 2025 at 1:36 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Thu, Dec 25, 2025 at 06:45:09PM +1300, Barry Song wrote:
> > > > >
> > > >
> > > > OK. Could you take a look at [1] and see if any further
> > > > improvements are needed before I send v2?
> > >
> > > Everything looks ok, except these renames:
> > > -                       arch_sync_dma_for_cpu(paddr, sg->length, dir);
> > > +                       arch_sync_dma_for_cpu_batch_add(paddr, sg->length, dir);
> >
> > Thanks!
> > I'm happy to drop the rename as outlined below-feedback welcome :-)
> >
> > diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> > index dd2c8586a725..487fb7c355ed 100644
> > --- a/arch/arm64/include/asm/cache.h
> > +++ b/arch/arm64/include/asm/cache.h
> > @@ -87,6 +87,12 @@ int cache_line_size(void);
> >
> >  #define dma_get_cache_alignment      cache_line_size
> >
> > +static inline void arch_sync_dma_flush(void)
> > +{
> > +     dsb(sy);
> > +}
> > +#define arch_sync_dma_flush arch_sync_dma_flush
> > +
> >  /* Compress a u64 MPIDR value into 32 bits. */
> >  static inline u64 arch_compact_of_hwid(u64 id)
> >  {
> > diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> > index b2b5792b2caa..ae1ae0280eef 100644
> > --- a/arch/arm64/mm/dma-mapping.c
> > +++ b/arch/arm64/mm/dma-mapping.c
> > @@ -17,7 +17,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> >  {
> >       unsigned long start = (unsigned long)phys_to_virt(paddr);
> >
> > -     dcache_clean_poc(start, start + size);
> > +     dcache_clean_poc_nosync(start, start + size);
> >  }
> >
> >  void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> > @@ -28,7 +28,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> >       if (dir == DMA_TO_DEVICE)
> >               return;
> >
> > -     dcache_inval_poc(start, start + size);
> > +     dcache_inval_poc_nosync(start, start + size);
> >  }
> >
> >  void arch_dma_prep_coherent(struct page *page, size_t size)
> > diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> > index 4809204c674c..e7dd8a63b40e 100644
> > --- a/include/linux/dma-map-ops.h
> > +++ b/include/linux/dma-map-ops.h
> > @@ -361,6 +361,12 @@ static inline void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> >  }
> >  #endif /* ARCH_HAS_SYNC_DMA_FOR_CPU */
> >
> > +#ifndef arch_sync_dma_flush
>
> You likely need to wrap this in "#ifdef CONFIG_ARCH_HAS_SYNC_DMA_FLUSH"
> as done in the surrounding code.

I've dropped the new Kconfig option and now rely on whether
arch_sync_dma_flush() is provided by the architecture. If an arch
does not define arch_sync_dma_flush() in its asm/cache.h, a no-op
implementation is used instead.

Do you still prefer keeping a config option to match the surrounding
code style? Note that on arm64, arch_sync_dma_flush() is already a
static inline rather than an extern, so it is not strictly aligned
with the others.
Having both CONFIG_ARCH_HAS_SYNC_DMA_FLUSH and
"#ifndef arch_sync_dma_flush" seems duplicated.

Another potential optimization would be to drop these options
entirely and handle this via ifndefs, letting each architecture
define the macros in asm/cache.h instead.

Whether arch implements arch_sync_dma_for_xx() as static inline or
as external functions makes no difference.

- #ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU
- void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,-
                enum dma_data_direction dir);
- #else
+ #ifndef arch_sync_dma_for_cpu
static inline void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
                enum dma_data_direction dir)
{
}
#endif /* ARCH_HAS_SYNC_DMA_FOR_CPU */

>
> Thanks
>
> > +static inline void arch_sync_dma_flush(void)
> > +{
> > +}
> > +#endif
> > +
> >  #ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL
> >  void arch_sync_dma_for_cpu_all(void);
> >  #else
> >

Thanks
Barry


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
  2025-12-25 13:31                     ` Barry Song
@ 2025-12-25 13:40                       ` Leon Romanovsky
  0 siblings, 0 replies; 30+ messages in thread
From: Leon Romanovsky @ 2025-12-25 13:40 UTC (permalink / raw)
  To: Barry Song
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
	anshuman.khandual, catalin.marinas, linux-kernel, surenb, iommu,
	maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski

On Fri, Dec 26, 2025 at 02:31:42AM +1300, Barry Song wrote:
> On Fri, Dec 26, 2025 at 1:36 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Thu, Dec 25, 2025 at 06:45:09PM +1300, Barry Song wrote:
> > > > > >
> > > > >
> > > > > OK. Could you take a look at [1] and see if any further
> > > > > improvements are needed before I send v2?
> > > >
> > > > Everything looks ok, except these renames:
> > > > -                       arch_sync_dma_for_cpu(paddr, sg->length, dir);
> > > > +                       arch_sync_dma_for_cpu_batch_add(paddr, sg->length, dir);
> > >
> > > Thanks!
> > > I'm happy to drop the rename as outlined below-feedback welcome :-)
> > >
> > > diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> > > index dd2c8586a725..487fb7c355ed 100644
> > > --- a/arch/arm64/include/asm/cache.h
> > > +++ b/arch/arm64/include/asm/cache.h
> > > @@ -87,6 +87,12 @@ int cache_line_size(void);
> > >
> > >  #define dma_get_cache_alignment      cache_line_size
> > >
> > > +static inline void arch_sync_dma_flush(void)
> > > +{
> > > +     dsb(sy);
> > > +}
> > > +#define arch_sync_dma_flush arch_sync_dma_flush
> > > +
> > >  /* Compress a u64 MPIDR value into 32 bits. */
> > >  static inline u64 arch_compact_of_hwid(u64 id)
> > >  {
> > > diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> > > index b2b5792b2caa..ae1ae0280eef 100644
> > > --- a/arch/arm64/mm/dma-mapping.c
> > > +++ b/arch/arm64/mm/dma-mapping.c
> > > @@ -17,7 +17,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> > >  {
> > >       unsigned long start = (unsigned long)phys_to_virt(paddr);
> > >
> > > -     dcache_clean_poc(start, start + size);
> > > +     dcache_clean_poc_nosync(start, start + size);
> > >  }
> > >
> > >  void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> > > @@ -28,7 +28,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> > >       if (dir == DMA_TO_DEVICE)
> > >               return;
> > >
> > > -     dcache_inval_poc(start, start + size);
> > > +     dcache_inval_poc_nosync(start, start + size);
> > >  }
> > >
> > >  void arch_dma_prep_coherent(struct page *page, size_t size)
> > > diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> > > index 4809204c674c..e7dd8a63b40e 100644
> > > --- a/include/linux/dma-map-ops.h
> > > +++ b/include/linux/dma-map-ops.h
> > > @@ -361,6 +361,12 @@ static inline void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> > >  }
> > >  #endif /* ARCH_HAS_SYNC_DMA_FOR_CPU */
> > >
> > > +#ifndef arch_sync_dma_flush
> >
> > You likely need to wrap this in "#ifdef CONFIG_ARCH_HAS_SYNC_DMA_FLUSH"
> > as done in the surrounding code.
> 
> I've dropped the new Kconfig option and now rely on whether
> arch_sync_dma_flush() is provided by the architecture. If an arch
> does not define arch_sync_dma_flush() in its asm/cache.h, a no-op
> implementation is used instead.

I know.

> 
> Do you still prefer keeping a config option to match the surrounding
> code style?

I don't have a strong preference here. Go ahead and try your current
version and see how people respond.

> Note that on arm64, arch_sync_dma_flush() is already a
> static inline rather than an extern, so it is not strictly aligned
> with the others.
> Having both CONFIG_ARCH_HAS_SYNC_DMA_FLUSH and
> "#ifndef arch_sync_dma_flush" seems duplicated.
> 
> Another potential optimization would be to drop these options
> entirely and handle this via ifndefs, letting each architecture
> define the macros in asm/cache.h instead.
> 
> Whether arch implements arch_sync_dma_for_xx() as static inline or
> as external functions makes no difference.
> 
> - #ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU
> - void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,-
>                 enum dma_data_direction dir);
> - #else
> + #ifndef arch_sync_dma_for_cpu
> static inline void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
>                 enum dma_data_direction dir)
> {
> }
> #endif /* ARCH_HAS_SYNC_DMA_FOR_CPU */
> 
> >
> > Thanks
> >
> > > +static inline void arch_sync_dma_flush(void)
> > > +{
> > > +}
> > > +#endif
> > > +
> > >  #ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL
> > >  void arch_sync_dma_for_cpu_all(void);
> > >  #else
> > >
> 
> Thanks
> Barry
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
  2025-12-19  5:36 ` [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch Barry Song
  2025-12-20 17:37   ` kernel test robot
  2025-12-21 11:55   ` Leon Romanovsky
@ 2025-12-21 12:36   ` kernel test robot
  2025-12-22 12:43   ` kernel test robot
  2025-12-22 14:00   ` kernel test robot
  4 siblings, 0 replies; 30+ messages in thread
From: kernel test robot @ 2025-12-21 12:36 UTC (permalink / raw)
  To: Barry Song, catalin.marinas, m.szyprowski, robin.murphy, will
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
	linux-kernel, iommu, oe-kbuild-all, surenb, ardb,
	linux-arm-kernel

Hi Barry,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.19-rc1 next-20251219]
[cannot apply to arm64/for-next/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Barry-Song/arm64-Provide-dcache_by_myline_op_nosync-helper/20251219-195810
base:   linus/master
patch link:    https://lore.kernel.org/r/20251219053658.84978-6-21cnbao%40gmail.com
patch subject: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
config: x86_64-rhel-9.4 (https://download.01.org/0day-ci/archive/20251221/202512211320.LaiSSLAc-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251221/202512211320.LaiSSLAc-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512211320.LaiSSLAc-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/dma/direct.c: In function 'dma_direct_unmap_sg':
>> kernel/dma/direct.c:456:25: error: implicit declaration of function 'dma_direct_unmap_phys_batch_add'; did you mean 'dma_direct_unmap_phys'? [-Wimplicit-function-declaration]
     456 |                         dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
         |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         |                         dma_direct_unmap_phys
   kernel/dma/direct.c: In function 'dma_direct_map_sg':
>> kernel/dma/direct.c:484:43: error: implicit declaration of function 'dma_direct_map_phys_batch_add'; did you mean 'dma_direct_map_phys'? [-Wimplicit-function-declaration]
     484 |                         sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
         |                                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         |                                           dma_direct_map_phys


vim +456 kernel/dma/direct.c

   439	
   440	/*
   441	 * Unmaps segments, except for ones marked as pci_p2pdma which do not
   442	 * require any further action as they contain a bus address.
   443	 */
   444	void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
   445			int nents, enum dma_data_direction dir, unsigned long attrs)
   446	{
   447		struct scatterlist *sg;
   448		int i;
   449		bool need_sync = false;
   450	
   451		for_each_sg(sgl,  sg, nents, i) {
   452			if (sg_dma_is_bus_address(sg)) {
   453				sg_dma_unmark_bus_address(sg);
   454			} else {
   455				need_sync = true;
 > 456				dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
   457						      sg_dma_len(sg), dir, attrs);
   458			}
   459		}
   460		if (need_sync && !dev_is_dma_coherent(dev))
   461			arch_sync_dma_batch_flush();
   462	}
   463	#endif
   464	
   465	int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
   466			enum dma_data_direction dir, unsigned long attrs)
   467	{
   468		struct pci_p2pdma_map_state p2pdma_state = {};
   469		struct scatterlist *sg;
   470		int i, ret;
   471		bool need_sync = false;
   472	
   473		for_each_sg(sgl, sg, nents, i) {
   474			switch (pci_p2pdma_state(&p2pdma_state, dev, sg_page(sg))) {
   475			case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
   476				/*
   477				 * Any P2P mapping that traverses the PCI host bridge
   478				 * must be mapped with CPU physical address and not PCI
   479				 * bus addresses.
   480				 */
   481				break;
   482			case PCI_P2PDMA_MAP_NONE:
   483				need_sync = true;
 > 484				sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
   485						sg->length, dir, attrs);
   486				if (sg->dma_address == DMA_MAPPING_ERROR) {
   487					ret = -EIO;
   488					goto out_unmap;
   489				}
   490				break;
   491			case PCI_P2PDMA_MAP_BUS_ADDR:
   492				sg->dma_address = pci_p2pdma_bus_addr_map(
   493					p2pdma_state.mem, sg_phys(sg));
   494				sg_dma_len(sg) = sg->length;
   495				sg_dma_mark_bus_address(sg);
   496				continue;
   497			default:
   498				ret = -EREMOTEIO;
   499				goto out_unmap;
   500			}
   501			sg_dma_len(sg) = sg->length;
   502		}
   503	
   504		if (need_sync && !dev_is_dma_coherent(dev))
   505			arch_sync_dma_batch_flush();
   506		return nents;
   507	
   508	out_unmap:
   509		dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
   510		return ret;
   511	}
   512	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
  2025-12-19  5:36 ` [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch Barry Song
                     ` (2 preceding siblings ...)
  2025-12-21 12:36   ` kernel test robot
@ 2025-12-22 12:43   ` kernel test robot
  2025-12-22 14:00   ` kernel test robot
  4 siblings, 0 replies; 30+ messages in thread
From: kernel test robot @ 2025-12-22 12:43 UTC (permalink / raw)
  To: Barry Song, catalin.marinas, m.szyprowski, robin.murphy, will
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
	llvm, linux-kernel, iommu, oe-kbuild-all, surenb, ardb,
	linux-arm-kernel

Hi Barry,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.19-rc2 next-20251219]
[cannot apply to arm64/for-next/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Barry-Song/arm64-Provide-dcache_by_myline_op_nosync-helper/20251219-195810
base:   linus/master
patch link:    https://lore.kernel.org/r/20251219053658.84978-6-21cnbao%40gmail.com
patch subject: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
config: i386-buildonly-randconfig-006-20251222 (https://download.01.org/0day-ci/archive/20251222/202512222029.Dd6Vs1Eg-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251222/202512222029.Dd6Vs1Eg-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512222029.Dd6Vs1Eg-lkp@intel.com/

All errors (new ones prefixed by >>):

>> kernel/dma/direct.c:456:4: error: call to undeclared function 'dma_direct_unmap_phys_batch_add'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     456 |                         dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
         |                         ^
   kernel/dma/direct.c:456:4: note: did you mean 'dma_direct_unmap_phys'?
   kernel/dma/direct.h:188:20: note: 'dma_direct_unmap_phys' declared here
     188 | static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
         |                    ^
>> kernel/dma/direct.c:484:22: error: call to undeclared function 'dma_direct_map_phys_batch_add'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     484 |                         sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
         |                                           ^
   2 errors generated.


vim +/dma_direct_unmap_phys_batch_add +456 kernel/dma/direct.c

   439	
   440	/*
   441	 * Unmaps segments, except for ones marked as pci_p2pdma which do not
   442	 * require any further action as they contain a bus address.
   443	 */
   444	void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
   445			int nents, enum dma_data_direction dir, unsigned long attrs)
   446	{
   447		struct scatterlist *sg;
   448		int i;
   449		bool need_sync = false;
   450	
   451		for_each_sg(sgl,  sg, nents, i) {
   452			if (sg_dma_is_bus_address(sg)) {
   453				sg_dma_unmark_bus_address(sg);
   454			} else {
   455				need_sync = true;
 > 456				dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
   457						      sg_dma_len(sg), dir, attrs);
   458			}
   459		}
   460		if (need_sync && !dev_is_dma_coherent(dev))
   461			arch_sync_dma_batch_flush();
   462	}
   463	#endif
   464	
   465	int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
   466			enum dma_data_direction dir, unsigned long attrs)
   467	{
   468		struct pci_p2pdma_map_state p2pdma_state = {};
   469		struct scatterlist *sg;
   470		int i, ret;
   471		bool need_sync = false;
   472	
   473		for_each_sg(sgl, sg, nents, i) {
   474			switch (pci_p2pdma_state(&p2pdma_state, dev, sg_page(sg))) {
   475			case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
   476				/*
   477				 * Any P2P mapping that traverses the PCI host bridge
   478				 * must be mapped with CPU physical address and not PCI
   479				 * bus addresses.
   480				 */
   481				break;
   482			case PCI_P2PDMA_MAP_NONE:
   483				need_sync = true;
 > 484				sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
   485						sg->length, dir, attrs);
   486				if (sg->dma_address == DMA_MAPPING_ERROR) {
   487					ret = -EIO;
   488					goto out_unmap;
   489				}
   490				break;
   491			case PCI_P2PDMA_MAP_BUS_ADDR:
   492				sg->dma_address = pci_p2pdma_bus_addr_map(
   493					p2pdma_state.mem, sg_phys(sg));
   494				sg_dma_len(sg) = sg->length;
   495				sg_dma_mark_bus_address(sg);
   496				continue;
   497			default:
   498				ret = -EREMOTEIO;
   499				goto out_unmap;
   500			}
   501			sg_dma_len(sg) = sg->length;
   502		}
   503	
   504		if (need_sync && !dev_is_dma_coherent(dev))
   505			arch_sync_dma_batch_flush();
   506		return nents;
   507	
   508	out_unmap:
   509		dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
   510		return ret;
   511	}
   512	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
  2025-12-19  5:36 ` [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch Barry Song
                     ` (3 preceding siblings ...)
  2025-12-22 12:43   ` kernel test robot
@ 2025-12-22 14:00   ` kernel test robot
  4 siblings, 0 replies; 30+ messages in thread
From: kernel test robot @ 2025-12-22 14:00 UTC (permalink / raw)
  To: Barry Song, catalin.marinas, m.szyprowski, robin.murphy, will
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
	linux-kernel, iommu, oe-kbuild-all, surenb, ardb,
	linux-arm-kernel

Hi Barry,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.19-rc2 next-20251219]
[cannot apply to arm64/for-next/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Barry-Song/arm64-Provide-dcache_by_myline_op_nosync-helper/20251219-195810
base:   linus/master
patch link:    https://lore.kernel.org/r/20251219053658.84978-6-21cnbao%40gmail.com
patch subject: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
config: x86_64-randconfig-161-20251222 (https://download.01.org/0day-ci/archive/20251222/202512222137.rpXOEE5p-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251222/202512222137.rpXOEE5p-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512222137.rpXOEE5p-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/dma/direct.c: In function 'dma_direct_unmap_sg':
>> kernel/dma/direct.c:456:25: error: implicit declaration of function 'dma_direct_unmap_phys_batch_add'; did you mean 'dma_direct_unmap_phys'? [-Wimplicit-function-declaration]
     456 |                         dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
         |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         |                         dma_direct_unmap_phys
   kernel/dma/direct.c: In function 'dma_direct_map_sg':
>> kernel/dma/direct.c:484:43: error: implicit declaration of function 'dma_direct_map_phys_batch_add'; did you mean 'dma_direct_map_phys'? [-Wimplicit-function-declaration]
     484 |                         sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
         |                                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         |                                           dma_direct_map_phys


vim +456 kernel/dma/direct.c

   439	
   440	/*
   441	 * Unmaps segments, except for ones marked as pci_p2pdma which do not
   442	 * require any further action as they contain a bus address.
   443	 */
   444	void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
   445			int nents, enum dma_data_direction dir, unsigned long attrs)
   446	{
   447		struct scatterlist *sg;
   448		int i;
   449		bool need_sync = false;
   450	
   451		for_each_sg(sgl,  sg, nents, i) {
   452			if (sg_dma_is_bus_address(sg)) {
   453				sg_dma_unmark_bus_address(sg);
   454			} else {
   455				need_sync = true;
 > 456				dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
   457						      sg_dma_len(sg), dir, attrs);
   458			}
   459		}
   460		if (need_sync && !dev_is_dma_coherent(dev))
   461			arch_sync_dma_batch_flush();
   462	}
   463	#endif
   464	
   465	int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
   466			enum dma_data_direction dir, unsigned long attrs)
   467	{
   468		struct pci_p2pdma_map_state p2pdma_state = {};
   469		struct scatterlist *sg;
   470		int i, ret;
   471		bool need_sync = false;
   472	
   473		for_each_sg(sgl, sg, nents, i) {
   474			switch (pci_p2pdma_state(&p2pdma_state, dev, sg_page(sg))) {
   475			case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
   476				/*
   477				 * Any P2P mapping that traverses the PCI host bridge
   478				 * must be mapped with CPU physical address and not PCI
   479				 * bus addresses.
   480				 */
   481				break;
   482			case PCI_P2PDMA_MAP_NONE:
   483				need_sync = true;
 > 484				sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
   485						sg->length, dir, attrs);
   486				if (sg->dma_address == DMA_MAPPING_ERROR) {
   487					ret = -EIO;
   488					goto out_unmap;
   489				}
   490				break;
   491			case PCI_P2PDMA_MAP_BUS_ADDR:
   492				sg->dma_address = pci_p2pdma_bus_addr_map(
   493					p2pdma_state.mem, sg_phys(sg));
   494				sg_dma_len(sg) = sg->length;
   495				sg_dma_mark_bus_address(sg);
   496				continue;
   497			default:
   498				ret = -EREMOTEIO;
   499				goto out_unmap;
   500			}
   501			sg_dma_len(sg) = sg->length;
   502		}
   503	
   504		if (need_sync && !dev_is_dma_coherent(dev))
   505			arch_sync_dma_batch_flush();
   506		return nents;
   507	
   508	out_unmap:
   509		dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
   510		return ret;
   511	}
   512	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH RFC 6/6] dma-iommu: Allow DMA sync batching for IOVA link/unlink
  2025-12-19  5:36 [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
                   ` (4 preceding siblings ...)
  2025-12-19  5:36 ` [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch Barry Song
@ 2025-12-19  5:36 ` Barry Song
  2025-12-19  6:04 ` [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
  2025-12-19  6:12 ` Barry Song
  7 siblings, 0 replies; 30+ messages in thread
From: Barry Song @ 2025-12-19  5:36 UTC (permalink / raw)
  To: catalin.marinas, m.szyprowski, robin.murphy, will
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
	Joerg Roedel, linux-kernel, iommu, surenb, ardb, linux-arm-kernel

From: Barry Song <v-songbaohua@oppo.com>

Apply batched DMA synchronization to __dma_iova_link() and
iommu_dma_iova_unlink_range_slow(). For multiple
sync_dma_for_device() and sync_dma_for_cpu() calls, we only
need to wait once for the completion of all sync operations,
rather than waiting for each one individually.

I do not have the hardware to test this, so it is marked as
RFC. I would greatly appreciate it if someone could test it.

Suggested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 drivers/iommu/dma-iommu.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index c92088855450..95432bdc364f 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1837,7 +1837,7 @@ static int __dma_iova_link(struct device *dev, dma_addr_t addr,
 	int prot = dma_info_to_prot(dir, coherent, attrs);
 
 	if (!coherent && !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
-		arch_sync_dma_for_device(phys, size, dir);
+		arch_sync_dma_for_device_batch_add(phys, size, dir);
 
 	return iommu_map_nosync(iommu_get_dma_domain(dev), addr, phys, size,
 			prot, GFP_ATOMIC);
@@ -1980,6 +1980,8 @@ int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
 	dma_addr_t addr = state->addr + offset;
 	size_t iova_start_pad = iova_offset(iovad, addr);
 
+	if (!dev_is_dma_coherent(dev))
+		arch_sync_dma_batch_flush();
 	return iommu_sync_map(domain, addr - iova_start_pad,
 		      iova_align(iovad, size + iova_start_pad));
 }
@@ -1993,6 +1995,8 @@ static void iommu_dma_iova_unlink_range_slow(struct device *dev,
 	struct iommu_dma_cookie *cookie = domain->iova_cookie;
 	struct iova_domain *iovad = &cookie->iovad;
 	size_t iova_start_pad = iova_offset(iovad, addr);
+	bool need_sync_dma = !dev_is_dma_coherent(dev) &&
+			!(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO));
 	dma_addr_t end = addr + size;
 
 	do {
@@ -2007,8 +2011,7 @@ static void iommu_dma_iova_unlink_range_slow(struct device *dev,
 		len = min_t(size_t,
 			end - addr, iovad->granule - iova_start_pad);
 
-		if (!dev_is_dma_coherent(dev) &&
-		    !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
+		if (need_sync_dma)
 			arch_sync_dma_for_cpu(phys, len, dir);
 
 		swiotlb_tbl_unmap_single(dev, phys, len, dir, attrs);
@@ -2016,6 +2019,9 @@ static void iommu_dma_iova_unlink_range_slow(struct device *dev,
 		addr += len;
 		iova_start_pad = 0;
 	} while (addr < end);
+
+	if (need_sync_dma)
+		arch_sync_dma_batch_flush();
 }
 
 static void __iommu_dma_iova_unlink(struct device *dev,
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 0/6] dma-mapping: arm64: support batched cache sync
  2025-12-19  5:36 [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
                   ` (5 preceding siblings ...)
  2025-12-19  5:36 ` [PATCH RFC 6/6] dma-iommu: Allow DMA sync batching for IOVA link/unlink Barry Song
@ 2025-12-19  6:04 ` Barry Song
  2025-12-19  6:12 ` Barry Song
  7 siblings, 0 replies; 30+ messages in thread
From: Barry Song @ 2025-12-19  6:04 UTC (permalink / raw)
  To: catalin.marinas, m.szyprowski, robin.murphy, will
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
	linux-kernel, iommu, surenb, ardb, linux-arm-kernel

From: Barry Song <v-songbaohua@oppo.com>

For reasons unclear, the cover letter was omitted from the
initial posting, despite Gmail indicating it was sent. This
is a resend. Apologies for the noise.

Many embedded ARM64 SoCs still lack hardware cache coherency support, which
causes DMA mapping operations to appear as hotspots in on-CPU flame graphs.

For an SG list with *nents* entries, the current dma_map/unmap_sg() and DMA
sync APIs perform cache maintenance one entry at a time. After each entry,
the implementation synchronously waits for the corresponding region’s
D-cache operations to complete. On architectures like arm64, efficiency can
be improved by issuing all entries’ operations first and then performing a
single batched wait for completion.

Tangquan's results show that batched synchronization can reduce
dma_map_sg() time by 64.61% and dma_unmap_sg() time by 66.60% on an MTK
phone platform (MediaTek Dimensity 9500). The tests were performed by
pinning the task to CPU7 and fixing the CPU frequency at 2.6 GHz,
running dma_map_sg() and dma_unmap_sg() on 10 MB buffers (10 MB / 4 KB
sg entries per buffer) for 200 iterations and then averaging the
results.

I also ran this patch set on an RK3588 Rock5B+ board and
observed that millions of DMA sync operations were batched.

diff with RFC:
 * Dropped lots of #ifdef/#else/#endif according to Catalin and Marek,
  thanks!
 * Also add iova link/unlink batches, which is marked as RFC as i lack
   hardware. This is suggested by Marek, thanks!

RFC link:
 https://lore.kernel.org/lkml/20251029023115.22809-1-21cnbao@gmail.com/

Barry Song (6):
  arm64: Provide dcache_by_myline_op_nosync helper
  arm64: Provide dcache_clean_poc_nosync helper
  arm64: Provide dcache_inval_poc_nosync helper
  arm64: Provide arch_sync_dma_ batched helpers
  dma-mapping: Allow batched DMA sync operations if supported by the
    arch
  dma-iommu: Allow DMA sync batching for IOVA link/unlink

 arch/arm64/Kconfig                  |  1 +
 arch/arm64/include/asm/assembler.h  | 79 +++++++++++++++++++-------
 arch/arm64/include/asm/cacheflush.h |  2 +
 arch/arm64/mm/cache.S               | 58 +++++++++++++++----
 arch/arm64/mm/dma-mapping.c         | 24 ++++++++
 drivers/iommu/dma-iommu.c           | 12 +++-
 include/linux/dma-map-ops.h         | 22 ++++++++
 kernel/dma/Kconfig                  |  3 +
 kernel/dma/direct.c                 | 28 +++++++---
 kernel/dma/direct.h                 | 86 +++++++++++++++++++++++++----
 10 files changed, 262 insertions(+), 53 deletions(-)

-- 
2.39.3 (Apple Git-146)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/6] dma-mapping: arm64: support batched cache sync
  2025-12-19  5:36 [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
                   ` (6 preceding siblings ...)
  2025-12-19  6:04 ` [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
@ 2025-12-19  6:12 ` Barry Song
  7 siblings, 0 replies; 30+ messages in thread
From: Barry Song @ 2025-12-19  6:12 UTC (permalink / raw)
  To: catalin.marinas, m.szyprowski, robin.murphy, will
  Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
	linux-kernel, iommu, surenb, ardb, linux-arm-kernel

It is unclear why, but the cover letter was missed in the
initial posting, even though Gmail shows it as sent. I am
resending it here as a reply to check whether it appears on
the mailing list. Apologies for the inconvenience.

On Fri, Dec 19, 2025 at 1:37 PM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> Many embedded ARM64 SoCs still lack hardware cache coherency support, which
> causes DMA mapping operations to appear as hotspots in on-CPU flame graphs.
>
> For an SG list with *nents* entries, the current dma_map/unmap_sg() and DMA
> sync APIs perform cache maintenance one entry at a time. After each entry,
> the implementation synchronously waits for the corresponding region’s
> D-cache operations to complete. On architectures like arm64, efficiency can
> be improved by issuing all entries’ operations first and then performing a
> single batched wait for completion.
>
> Tangquan's results show that batched synchronization can reduce
> dma_map_sg() time by 64.61% and dma_unmap_sg() time by 66.60% on an MTK
> phone platform (MediaTek Dimensity 9500). The tests were performed by
> pinning the task to CPU7 and fixing the CPU frequency at 2.6 GHz,
> running dma_map_sg() and dma_unmap_sg() on 10 MB buffers (10 MB / 4 KB
> sg entries per buffer) for 200 iterations and then averaging the
> results.
>
> I also ran this patch set on an RK3588 Rock5B+ board and
> observed that millions of DMA sync operations were batched.
>
> diff with RFC:
>  * Dropped lots of #ifdef/#else/#endif according to Catalin and Marek,
>   thanks!
>  * Also add iova link/unlink batches, which is marked as RFC as i lack
>    hardware. This is suggested by Marek, thanks!
>
> RFC link:
>  https://lore.kernel.org/lkml/20251029023115.22809-1-21cnbao@gmail.com/
>
> Barry Song (6):
>   arm64: Provide dcache_by_myline_op_nosync helper
>   arm64: Provide dcache_clean_poc_nosync helper
>   arm64: Provide dcache_inval_poc_nosync helper
>   arm64: Provide arch_sync_dma_ batched helpers
>   dma-mapping: Allow batched DMA sync operations if supported by the
>     arch
>   dma-iommu: Allow DMA sync batching for IOVA link/unlink
>
>  arch/arm64/Kconfig                  |  1 +
>  arch/arm64/include/asm/assembler.h  | 79 +++++++++++++++++++-------
>  arch/arm64/include/asm/cacheflush.h |  2 +
>  arch/arm64/mm/cache.S               | 58 +++++++++++++++----
>  arch/arm64/mm/dma-mapping.c         | 24 ++++++++
>  drivers/iommu/dma-iommu.c           | 12 +++-
>  include/linux/dma-map-ops.h         | 22 ++++++++
>  kernel/dma/Kconfig                  |  3 +
>  kernel/dma/direct.c                 | 28 +++++++---
>  kernel/dma/direct.h                 | 86 +++++++++++++++++++++++++----
>  10 files changed, 262 insertions(+), 53 deletions(-)
>
> --
> 2.39.3 (Apple Git-146)
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-12-25 13:41 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-19  5:36 [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
2025-12-19  5:36 ` [PATCH 1/6] arm64: Provide dcache_by_myline_op_nosync helper Barry Song
2025-12-19 12:20   ` Robin Murphy
2025-12-21  7:22     ` Barry Song
2025-12-19  5:36 ` [PATCH 2/6] arm64: Provide dcache_clean_poc_nosync helper Barry Song
2025-12-19  5:36 ` [PATCH 3/6] arm64: Provide dcache_inval_poc_nosync helper Barry Song
2025-12-19 12:34   ` Robin Murphy
2025-12-21  7:59     ` Barry Song
2025-12-19  5:36 ` [PATCH 4/6] arm64: Provide arch_sync_dma_ batched helpers Barry Song
2025-12-19  5:36 ` [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch Barry Song
2025-12-20 17:37   ` kernel test robot
2025-12-21  5:15     ` Barry Song
2025-12-21 11:55   ` Leon Romanovsky
2025-12-21 19:24     ` Barry Song
2025-12-22  8:49       ` Leon Romanovsky
2025-12-23  0:02         ` Barry Song
2025-12-23  2:36           ` Barry Song
2025-12-23 14:14           ` Leon Romanovsky
2025-12-24  1:29             ` Barry Song
2025-12-24  8:51               ` Leon Romanovsky
2025-12-25  5:45                 ` Barry Song
2025-12-25 12:36                   ` Leon Romanovsky
2025-12-25 13:31                     ` Barry Song
2025-12-25 13:40                       ` Leon Romanovsky
2025-12-21 12:36   ` kernel test robot
2025-12-22 12:43   ` kernel test robot
2025-12-22 14:00   ` kernel test robot
2025-12-19  5:36 ` [PATCH RFC 6/6] dma-iommu: Allow DMA sync batching for IOVA link/unlink Barry Song
2025-12-19  6:04 ` [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
2025-12-19  6:12 ` Barry Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).