* [PATCH 1/6] arm64: Provide dcache_by_myline_op_nosync helper
2025-12-19 5:36 [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
@ 2025-12-19 5:36 ` Barry Song
2025-12-19 12:20 ` Robin Murphy
2025-12-19 5:36 ` [PATCH 2/6] arm64: Provide dcache_clean_poc_nosync helper Barry Song
` (6 subsequent siblings)
7 siblings, 1 reply; 30+ messages in thread
From: Barry Song @ 2025-12-19 5:36 UTC (permalink / raw)
To: catalin.marinas, m.szyprowski, robin.murphy, will
Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
linux-kernel, iommu, surenb, ardb, linux-arm-kernel
From: Barry Song <v-songbaohua@oppo.com>
dcache_by_myline_op ensures completion of the data cache operations for a
region, while dcache_by_myline_op_nosync only issues them without waiting.
This enables deferred synchronization so completion for multiple regions
can be handled together later.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
arch/arm64/include/asm/assembler.h | 79 ++++++++++++++++++++++--------
1 file changed, 59 insertions(+), 20 deletions(-)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index f0ca7196f6fa..7d84a9ca7880 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -366,22 +366,7 @@ alternative_else
alternative_endif
.endm
-/*
- * Macro to perform a data cache maintenance for the interval
- * [start, end) with dcache line size explicitly provided.
- *
- * op: operation passed to dc instruction
- * domain: domain used in dsb instruction
- * start: starting virtual address of the region
- * end: end virtual address of the region
- * linesz: dcache line size
- * fixup: optional label to branch to on user fault
- * Corrupts: start, end, tmp
- */
- .macro dcache_by_myline_op op, domain, start, end, linesz, tmp, fixup
- sub \tmp, \linesz, #1
- bic \start, \start, \tmp
-.Ldcache_op\@:
+ .macro __dcache_op_line op, start
.ifc \op, cvau
__dcache_op_workaround_clean_cache \op, \start
.else
@@ -399,14 +384,54 @@ alternative_endif
.endif
.endif
.endif
- add \start, \start, \linesz
- cmp \start, \end
- b.lo .Ldcache_op\@
- dsb \domain
+ .endm
+
+/*
+ * Macro to perform a data cache maintenance for the interval
+ * [start, end) with dcache line size explicitly provided.
+ *
+ * op: operation passed to dc instruction
+ * domain: domain used in dsb instruction
+ * start: starting virtual address of the region
+ * end: end virtual address of the region
+ * linesz: dcache line size
+ * fixup: optional label to branch to on user fault
+ * Corrupts: start, end, tmp
+ */
+ .macro dcache_by_myline_op op, domain, start, end, linesz, tmp, fixup
+ sub \tmp, \linesz, #1
+ bic \start, \start, \tmp
+.Ldcache_op\@:
+ __dcache_op_line \op, \start
+ add \start, \start, \linesz
+ cmp \start, \end
+ b.lo .Ldcache_op\@
+ dsb \domain
_cond_uaccess_extable .Ldcache_op\@, \fixup
.endm
+/*
+ * Macro to perform a data cache maintenance for the interval
+ * [start, end) with dcache line size explicitly provided.
+ * It won't wait for the completion of the dc operation.
+ *
+ * op: operation passed to dc instruction
+ * start: starting virtual address of the region
+ * end: end virtual address of the region
+ * linesz: dcache line size
+ * Corrupts: start, end, tmp
+ */
+ .macro dcache_by_myline_op_nosync op, start, end, linesz, tmp
+ sub \tmp, \linesz, #1
+ bic \start, \start, \tmp
+.Ldcache_op\@:
+ __dcache_op_line \op, \start
+ add \start, \start, \linesz
+ cmp \start, \end
+ b.lo .Ldcache_op\@
+ .endm
+
/*
* Macro to perform a data cache maintenance for the interval
* [start, end)
@@ -423,6 +448,20 @@ alternative_endif
dcache_by_myline_op \op, \domain, \start, \end, \tmp1, \tmp2, \fixup
.endm
+/*
+ * Macro to perform a data cache maintenance for the interval
+ * [start, end). It won’t wait for the dc operation to complete.
+ *
+ * op: operation passed to dc instruction
+ * start: starting virtual address of the region
+ * end: end virtual address of the region
+ * Corrupts: start, end, tmp1, tmp2
+ */
+ .macro dcache_by_line_op_nosync op, start, end, tmp1, tmp2
+ dcache_line_size \tmp1, \tmp2
+ dcache_by_myline_op_nosync \op, \start, \end, \tmp1, \tmp2
+ .endm
+
/*
* Macro to perform an instruction cache maintenance for the interval
* [start, end)
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCH 1/6] arm64: Provide dcache_by_myline_op_nosync helper
2025-12-19 5:36 ` [PATCH 1/6] arm64: Provide dcache_by_myline_op_nosync helper Barry Song
@ 2025-12-19 12:20 ` Robin Murphy
2025-12-21 7:22 ` Barry Song
0 siblings, 1 reply; 30+ messages in thread
From: Robin Murphy @ 2025-12-19 12:20 UTC (permalink / raw)
To: Barry Song, catalin.marinas, m.szyprowski, will
Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
linux-kernel, iommu, surenb, ardb, linux-arm-kernel
On 2025-12-19 5:36 am, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> dcache_by_myline_op ensures completion of the data cache operations for a
> region, while dcache_by_myline_op_nosync only issues them without waiting.
> This enables deferred synchronization so completion for multiple regions
> can be handled together later.
This is a super-low-level internal macro with only two users... Frankly I'd
just do as below.
Thanks,
Robin.
----->8-----
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index f0ca7196f6fa..26e983c331c5 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -367,18 +367,17 @@ alternative_endif
.endm
/*
- * Macro to perform a data cache maintenance for the interval
- * [start, end) with dcache line size explicitly provided.
+ * Main loop for a data cache maintenance operation. Caller to provide the
+ * dcache line size and take care of relevant synchronisation afterwards.
*
* op: operation passed to dc instruction
- * domain: domain used in dsb instruction
* start: starting virtual address of the region
* end: end virtual address of the region
* linesz: dcache line size
* fixup: optional label to branch to on user fault
* Corrupts: start, end, tmp
*/
- .macro dcache_by_myline_op op, domain, start, end, linesz, tmp, fixup
+ .macro raw_dcache_by_line_op op, start, end, linesz, tmp, fixup
sub \tmp, \linesz, #1
bic \start, \start, \tmp
.Ldcache_op\@:
@@ -402,7 +401,6 @@ alternative_endif
add \start, \start, \linesz
cmp \start, \end
b.lo .Ldcache_op\@
- dsb \domain
_cond_uaccess_extable .Ldcache_op\@, \fixup
.endm
@@ -420,7 +418,8 @@ alternative_endif
*/
.macro dcache_by_line_op op, domain, start, end, tmp1, tmp2, fixup
dcache_line_size \tmp1, \tmp2
- dcache_by_myline_op \op, \domain, \start, \end, \tmp1, \tmp2, \fixup
+ raw_dcache_by_line_op \op, \start, \end, \tmp1, \tmp2, \fixup
+ dsb \domain
.endm
/*
diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S
index 413f899e4ac6..efdb6884058e 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -64,7 +64,8 @@ SYM_CODE_START(arm64_relocate_new_kernel)
mov x19, x13
copy_page x13, x12, x1, x2, x3, x4, x5, x6, x7, x8
add x1, x19, #PAGE_SIZE
- dcache_by_myline_op civac, sy, x19, x1, x15, x20
+ raw_dcache_by_line_op civac, x19, x1, x15, x20
+ dsb sy
b .Lnext
.Ltest_indirection:
tbz x16, IND_INDIRECTION_BIT, .Ltest_destination
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 1/6] arm64: Provide dcache_by_myline_op_nosync helper
2025-12-19 12:20 ` Robin Murphy
@ 2025-12-21 7:22 ` Barry Song
0 siblings, 0 replies; 30+ messages in thread
From: Barry Song @ 2025-12-21 7:22 UTC (permalink / raw)
To: Robin Murphy
Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual,
catalin.marinas, linux-kernel, surenb, iommu, maz, will, ardb,
linux-arm-kernel, m.szyprowski
On Fri, Dec 19, 2025 at 8:20 PM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2025-12-19 5:36 am, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > dcache_by_myline_op ensures completion of the data cache operations for a
> > region, while dcache_by_myline_op_nosync only issues them without waiting.
> > This enables deferred synchronization so completion for multiple regions
> > can be handled together later.
>
> This is a super-low-level internal macro with only two users... Frankly I'd
> just do as below.
>
> Thanks,
> Robin.
>
> ----->8-----
>
> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> index f0ca7196f6fa..26e983c331c5 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -367,18 +367,17 @@ alternative_endif
> .endm
>
> /*
> - * Macro to perform a data cache maintenance for the interval
> - * [start, end) with dcache line size explicitly provided.
> + * Main loop for a data cache maintenance operation. Caller to provide the
> + * dcache line size and take care of relevant synchronisation afterwards.
> *
> * op: operation passed to dc instruction
> - * domain: domain used in dsb instruction
> * start: starting virtual address of the region
> * end: end virtual address of the region
> * linesz: dcache line size
> * fixup: optional label to branch to on user fault
> * Corrupts: start, end, tmp
> */
> - .macro dcache_by_myline_op op, domain, start, end, linesz, tmp, fixup
> + .macro raw_dcache_by_line_op op, start, end, linesz, tmp, fixup
> sub \tmp, \linesz, #1
> bic \start, \start, \tmp
> .Ldcache_op\@:
> @@ -402,7 +401,6 @@ alternative_endif
> add \start, \start, \linesz
> cmp \start, \end
> b.lo .Ldcache_op\@
> - dsb \domain
>
> _cond_uaccess_extable .Ldcache_op\@, \fixup
> .endm
> @@ -420,7 +418,8 @@ alternative_endif
> */
> .macro dcache_by_line_op op, domain, start, end, tmp1, tmp2, fixup
> dcache_line_size \tmp1, \tmp2
> - dcache_by_myline_op \op, \domain, \start, \end, \tmp1, \tmp2, \fixup
> + raw_dcache_by_line_op \op, \start, \end, \tmp1, \tmp2, \fixup
> + dsb \domain
> .endm
>
> /*
> diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S
> index 413f899e4ac6..efdb6884058e 100644
> --- a/arch/arm64/kernel/relocate_kernel.S
> +++ b/arch/arm64/kernel/relocate_kernel.S
> @@ -64,7 +64,8 @@ SYM_CODE_START(arm64_relocate_new_kernel)
> mov x19, x13
> copy_page x13, x12, x1, x2, x3, x4, x5, x6, x7, x8
> add x1, x19, #PAGE_SIZE
> - dcache_by_myline_op civac, sy, x19, x1, x15, x20
> + raw_dcache_by_line_op civac, x19, x1, x15, x20
> + dsb sy
> b .Lnext
> .Ltest_indirection:
> tbz x16, IND_INDIRECTION_BIT, .Ltest_destination
>
Thanks, Robin. Really much better!
dcache_by_line_op_nosync could be:
/*
* Macro to perform a data cache maintenance for the interval
* [start, end) without waiting for completion
*
* op: operation passed to dc instruction
* start: starting virtual address of the region
* end: end virtual address of the region
* fixup: optional label to branch to on user fault
* Corrupts: start, end, tmp1, tmp2
*/
.macro dcache_by_line_op_nosync op, start, end, tmp1, tmp2, fixup
dcache_line_size \tmp1, \tmp2
raw_dcache_by_myline_op \op, \start, \end, \tmp1, \tmp2, \fixup
.endm
Thanks
Barry
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 2/6] arm64: Provide dcache_clean_poc_nosync helper
2025-12-19 5:36 [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
2025-12-19 5:36 ` [PATCH 1/6] arm64: Provide dcache_by_myline_op_nosync helper Barry Song
@ 2025-12-19 5:36 ` Barry Song
2025-12-19 5:36 ` [PATCH 3/6] arm64: Provide dcache_inval_poc_nosync helper Barry Song
` (5 subsequent siblings)
7 siblings, 0 replies; 30+ messages in thread
From: Barry Song @ 2025-12-19 5:36 UTC (permalink / raw)
To: catalin.marinas, m.szyprowski, robin.murphy, will
Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
linux-kernel, iommu, surenb, ardb, linux-arm-kernel
From: Barry Song <v-songbaohua@oppo.com>
dcache_clean_poc_nosync does not wait for the data cache clean to
complete. Later, we wait for completion of all scatter-gather entries
together.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
arch/arm64/include/asm/cacheflush.h | 1 +
arch/arm64/mm/cache.S | 15 +++++++++++++++
2 files changed, 16 insertions(+)
diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 28ab96e808ef..9b6d0a62cf3d 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -74,6 +74,7 @@ extern void icache_inval_pou(unsigned long start, unsigned long end);
extern void dcache_clean_inval_poc(unsigned long start, unsigned long end);
extern void dcache_inval_poc(unsigned long start, unsigned long end);
extern void dcache_clean_poc(unsigned long start, unsigned long end);
+extern void dcache_clean_poc_nosync(unsigned long start, unsigned long end);
extern void dcache_clean_pop(unsigned long start, unsigned long end);
extern void dcache_clean_pou(unsigned long start, unsigned long end);
extern long caches_clean_inval_user_pou(unsigned long start, unsigned long end);
diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
index 503567c864fd..4a7c7e03785d 100644
--- a/arch/arm64/mm/cache.S
+++ b/arch/arm64/mm/cache.S
@@ -178,6 +178,21 @@ SYM_FUNC_START(__pi_dcache_clean_poc)
SYM_FUNC_END(__pi_dcache_clean_poc)
SYM_FUNC_ALIAS(dcache_clean_poc, __pi_dcache_clean_poc)
+/*
+ * dcache_clean_poc_nosync(start, end)
+ *
+ * Issue the instructions of D-cache lines for the interval [start, end).
+ * not necessarily cleaned to the PoC till an explicit dsb sy afterward.
+ *
+ * - start - virtual start address of region
+ * - end - virtual end address of region
+ */
+SYM_FUNC_START(__pi_dcache_clean_poc_nosync)
+ dcache_by_line_op_nosync cvac, x0, x1, x2, x3
+ ret
+SYM_FUNC_END(__pi_dcache_clean_poc_nosync)
+SYM_FUNC_ALIAS(dcache_clean_poc_nosync, __pi_dcache_clean_poc_nosync)
+
/*
* dcache_clean_pop(start, end)
*
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 30+ messages in thread* [PATCH 3/6] arm64: Provide dcache_inval_poc_nosync helper
2025-12-19 5:36 [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
2025-12-19 5:36 ` [PATCH 1/6] arm64: Provide dcache_by_myline_op_nosync helper Barry Song
2025-12-19 5:36 ` [PATCH 2/6] arm64: Provide dcache_clean_poc_nosync helper Barry Song
@ 2025-12-19 5:36 ` Barry Song
2025-12-19 12:34 ` Robin Murphy
2025-12-19 5:36 ` [PATCH 4/6] arm64: Provide arch_sync_dma_ batched helpers Barry Song
` (4 subsequent siblings)
7 siblings, 1 reply; 30+ messages in thread
From: Barry Song @ 2025-12-19 5:36 UTC (permalink / raw)
To: catalin.marinas, m.szyprowski, robin.murphy, will
Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
linux-kernel, iommu, surenb, ardb, linux-arm-kernel
From: Barry Song <v-songbaohua@oppo.com>
dcache_inval_poc_nosync does not wait for the data cache invalidation to
complete. Later, we defer the synchronization so we can wait for all SG
entries together.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
arch/arm64/include/asm/cacheflush.h | 1 +
arch/arm64/mm/cache.S | 43 +++++++++++++++++++++--------
2 files changed, 33 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 9b6d0a62cf3d..382b4ac3734d 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -74,6 +74,7 @@ extern void icache_inval_pou(unsigned long start, unsigned long end);
extern void dcache_clean_inval_poc(unsigned long start, unsigned long end);
extern void dcache_inval_poc(unsigned long start, unsigned long end);
extern void dcache_clean_poc(unsigned long start, unsigned long end);
+extern void dcache_inval_poc_nosync(unsigned long start, unsigned long end);
extern void dcache_clean_poc_nosync(unsigned long start, unsigned long end);
extern void dcache_clean_pop(unsigned long start, unsigned long end);
extern void dcache_clean_pou(unsigned long start, unsigned long end);
diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
index 4a7c7e03785d..8c1043c9b9e5 100644
--- a/arch/arm64/mm/cache.S
+++ b/arch/arm64/mm/cache.S
@@ -132,17 +132,7 @@ alternative_else_nop_endif
ret
SYM_FUNC_END(dcache_clean_pou)
-/*
- * dcache_inval_poc(start, end)
- *
- * Ensure that any D-cache lines for the interval [start, end)
- * are invalidated. Any partial lines at the ends of the interval are
- * also cleaned to PoC to prevent data loss.
- *
- * - start - kernel start address of region
- * - end - kernel end address of region
- */
-SYM_FUNC_START(__pi_dcache_inval_poc)
+.macro _dcache_inval_poc_impl, do_sync
dcache_line_size x2, x3
sub x3, x2, #1
tst x1, x3 // end cache line aligned?
@@ -158,11 +148,42 @@ SYM_FUNC_START(__pi_dcache_inval_poc)
3: add x0, x0, x2
cmp x0, x1
b.lo 2b
+.if \do_sync
dsb sy
+.endif
ret
+.endm
+
+/*
+ * dcache_inval_poc(start, end)
+ *
+ * Ensure that any D-cache lines for the interval [start, end)
+ * are invalidated. Any partial lines at the ends of the interval are
+ * also cleaned to PoC to prevent data loss.
+ *
+ * - start - kernel start address of region
+ * - end - kernel end address of region
+ */
+SYM_FUNC_START(__pi_dcache_inval_poc)
+ _dcache_inval_poc_impl 1
SYM_FUNC_END(__pi_dcache_inval_poc)
SYM_FUNC_ALIAS(dcache_inval_poc, __pi_dcache_inval_poc)
+/*
+ * dcache_inval_poc_nosync(start, end)
+ *
+ * Issue the instructions of D-cache lines for the interval [start, end)
+ * for invalidation. Not necessarily cleaned to PoC till an explicit dsb
+ * sy later
+ *
+ * - start - kernel start address of region
+ * - end - kernel end address of region
+ */
+SYM_FUNC_START(__pi_dcache_inval_poc_nosync)
+ _dcache_inval_poc_impl 0
+SYM_FUNC_END(__pi_dcache_inval_poc_nosync)
+SYM_FUNC_ALIAS(dcache_inval_poc_nosync, __pi_dcache_inval_poc_nosync)
+
/*
* dcache_clean_poc(start, end)
*
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCH 3/6] arm64: Provide dcache_inval_poc_nosync helper
2025-12-19 5:36 ` [PATCH 3/6] arm64: Provide dcache_inval_poc_nosync helper Barry Song
@ 2025-12-19 12:34 ` Robin Murphy
2025-12-21 7:59 ` Barry Song
0 siblings, 1 reply; 30+ messages in thread
From: Robin Murphy @ 2025-12-19 12:34 UTC (permalink / raw)
To: Barry Song, catalin.marinas, m.szyprowski, will
Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
linux-kernel, iommu, surenb, ardb, linux-arm-kernel
On 2025-12-19 5:36 am, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> dcache_inval_poc_nosync does not wait for the data cache invalidation to
> complete. Later, we defer the synchronization so we can wait for all SG
> entries together.
>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
> arch/arm64/include/asm/cacheflush.h | 1 +
> arch/arm64/mm/cache.S | 43 +++++++++++++++++++++--------
> 2 files changed, 33 insertions(+), 11 deletions(-)
>
> diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
> index 9b6d0a62cf3d..382b4ac3734d 100644
> --- a/arch/arm64/include/asm/cacheflush.h
> +++ b/arch/arm64/include/asm/cacheflush.h
> @@ -74,6 +74,7 @@ extern void icache_inval_pou(unsigned long start, unsigned long end);
> extern void dcache_clean_inval_poc(unsigned long start, unsigned long end);
> extern void dcache_inval_poc(unsigned long start, unsigned long end);
> extern void dcache_clean_poc(unsigned long start, unsigned long end);
> +extern void dcache_inval_poc_nosync(unsigned long start, unsigned long end);
> extern void dcache_clean_poc_nosync(unsigned long start, unsigned long end);
> extern void dcache_clean_pop(unsigned long start, unsigned long end);
> extern void dcache_clean_pou(unsigned long start, unsigned long end);
> diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
> index 4a7c7e03785d..8c1043c9b9e5 100644
> --- a/arch/arm64/mm/cache.S
> +++ b/arch/arm64/mm/cache.S
> @@ -132,17 +132,7 @@ alternative_else_nop_endif
> ret
> SYM_FUNC_END(dcache_clean_pou)
>
> -/*
> - * dcache_inval_poc(start, end)
> - *
> - * Ensure that any D-cache lines for the interval [start, end)
> - * are invalidated. Any partial lines at the ends of the interval are
> - * also cleaned to PoC to prevent data loss.
> - *
> - * - start - kernel start address of region
> - * - end - kernel end address of region
> - */
> -SYM_FUNC_START(__pi_dcache_inval_poc)
> +.macro _dcache_inval_poc_impl, do_sync
> dcache_line_size x2, x3
> sub x3, x2, #1
> tst x1, x3 // end cache line aligned?
> @@ -158,11 +148,42 @@ SYM_FUNC_START(__pi_dcache_inval_poc)
> 3: add x0, x0, x2
> cmp x0, x1
> b.lo 2b
> +.if \do_sync
> dsb sy
> +.endif
Similarly, don't bother with complication like this, just put the DSB in
the one place it needs to be.
Thanks,
Robin.
> ret
> +.endm
> +
> +/*
> + * dcache_inval_poc(start, end)
> + *
> + * Ensure that any D-cache lines for the interval [start, end)
> + * are invalidated. Any partial lines at the ends of the interval are
> + * also cleaned to PoC to prevent data loss.
> + *
> + * - start - kernel start address of region
> + * - end - kernel end address of region
> + */
> +SYM_FUNC_START(__pi_dcache_inval_poc)
> + _dcache_inval_poc_impl 1
> SYM_FUNC_END(__pi_dcache_inval_poc)
> SYM_FUNC_ALIAS(dcache_inval_poc, __pi_dcache_inval_poc)
>
> +/*
> + * dcache_inval_poc_nosync(start, end)
> + *
> + * Issue the instructions of D-cache lines for the interval [start, end)
> + * for invalidation. Not necessarily cleaned to PoC till an explicit dsb
> + * sy later
> + *
> + * - start - kernel start address of region
> + * - end - kernel end address of region
> + */
> +SYM_FUNC_START(__pi_dcache_inval_poc_nosync)
> + _dcache_inval_poc_impl 0
> +SYM_FUNC_END(__pi_dcache_inval_poc_nosync)
> +SYM_FUNC_ALIAS(dcache_inval_poc_nosync, __pi_dcache_inval_poc_nosync)
> +
> /*
> * dcache_clean_poc(start, end)
> *
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 3/6] arm64: Provide dcache_inval_poc_nosync helper
2025-12-19 12:34 ` Robin Murphy
@ 2025-12-21 7:59 ` Barry Song
0 siblings, 0 replies; 30+ messages in thread
From: Barry Song @ 2025-12-21 7:59 UTC (permalink / raw)
To: robin.murphy
Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
anshuman.khandual, catalin.marinas, 21cnbao, linux-kernel, iommu,
maz, surenb, ardb, linux-arm-kernel, m.szyprowski
On Fri, Dec 19, 2025 at 8:50 PM Robin Murphy <robin.murphy@arm.com> wrote:
[...]
> > diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
> > index 4a7c7e03785d..8c1043c9b9e5 100644
> > --- a/arch/arm64/mm/cache.S
> > +++ b/arch/arm64/mm/cache.S
> > @@ -132,17 +132,7 @@ alternative_else_nop_endif
> > ret
> > SYM_FUNC_END(dcache_clean_pou)
> >
> > -/*
> > - * dcache_inval_poc(start, end)
> > - *
> > - * Ensure that any D-cache lines for the interval [start, end)
> > - * are invalidated. Any partial lines at the ends of the interval are
> > - * also cleaned to PoC to prevent data loss.
> > - *
> > - * - start - kernel start address of region
> > - * - end - kernel end address of region
> > - */
> > -SYM_FUNC_START(__pi_dcache_inval_poc)
> > +.macro _dcache_inval_poc_impl, do_sync
> > dcache_line_size x2, x3
> > sub x3, x2, #1
> > tst x1, x3 // end cache line aligned?
> > @@ -158,11 +148,42 @@ SYM_FUNC_START(__pi_dcache_inval_poc)
> > 3: add x0, x0, x2
> > cmp x0, x1
> > b.lo 2b
> > +.if \do_sync
> > dsb sy
> > +.endif
>
> Similarly, don't bother with complication like this, just put the DSB in
> the one place it needs to be.
>
Thanks, Robin — great suggestion. I assume it can be:
diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
index 4a7c7e03785d..99a093d3aecb 100644
--- a/arch/arm64/mm/cache.S
+++ b/arch/arm64/mm/cache.S
@@ -132,17 +132,7 @@ alternative_else_nop_endif
ret
SYM_FUNC_END(dcache_clean_pou)
-/*
- * dcache_inval_poc(start, end)
- *
- * Ensure that any D-cache lines for the interval [start, end)
- * are invalidated. Any partial lines at the ends of the interval are
- * also cleaned to PoC to prevent data loss.
- *
- * - start - kernel start address of region
- * - end - kernel end address of region
- */
-SYM_FUNC_START(__pi_dcache_inval_poc)
+.macro raw_dcache_inval_poc_macro
dcache_line_size x2, x3
sub x3, x2, #1
tst x1, x3 // end cache line aligned?
@@ -158,11 +148,41 @@ SYM_FUNC_START(__pi_dcache_inval_poc)
3: add x0, x0, x2
cmp x0, x1
b.lo 2b
+.endm
+
+/*
+ * dcache_inval_poc(start, end)
+ *
+ * Ensure that any D-cache lines for the interval [start, end)
+ * are invalidated. Any partial lines at the ends of the interval are
+ * also cleaned to PoC to prevent data loss.
+ *
+ * - start - kernel start address of region
+ * - end - kernel end address of region
+ */
+SYM_FUNC_START(__pi_dcache_inval_poc)
+ raw_dcache_inval_poc_macro
dsb sy
ret
SYM_FUNC_END(__pi_dcache_inval_poc)
SYM_FUNC_ALIAS(dcache_inval_poc, __pi_dcache_inval_poc)
+/*
+ * dcache_inval_poc_nosync(start, end)
+ *
+ * Issue the instructions of D-cache lines for the interval [start, end)
+ * for invalidation. Not necessarily cleaned to PoC till an explicit dsb
+ * sy is issued later
+ *
+ * - start - kernel start address of region
+ * - end - kernel end address of region
+ */
+SYM_FUNC_START(__pi_dcache_inval_poc_nosync)
+ raw_dcache_inval_poc_macro
+ ret
+SYM_FUNC_END(__pi_dcache_inval_poc_nosync)
+SYM_FUNC_ALIAS(dcache_inval_poc_nosync, __pi_dcache_inval_poc_nosync)
+
/*
* dcache_clean_poc(start, end)
*
--
Does it look good to you?
Thanks
Barry
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 4/6] arm64: Provide arch_sync_dma_ batched helpers
2025-12-19 5:36 [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
` (2 preceding siblings ...)
2025-12-19 5:36 ` [PATCH 3/6] arm64: Provide dcache_inval_poc_nosync helper Barry Song
@ 2025-12-19 5:36 ` Barry Song
2025-12-19 5:36 ` [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch Barry Song
` (3 subsequent siblings)
7 siblings, 0 replies; 30+ messages in thread
From: Barry Song @ 2025-12-19 5:36 UTC (permalink / raw)
To: catalin.marinas, m.szyprowski, robin.murphy, will
Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
linux-kernel, iommu, surenb, ardb, linux-arm-kernel
From: Barry Song <v-songbaohua@oppo.com>
arch_sync_dma_for_device_batch_add() and
arch_sync_dma_for_cpu_batch_add() batch DMA sync operations,
while arch_sync_dma_batch_flush() waits for their completion
as a group.
On architectures that do not support batching,
arch_sync_dma_for_device_batch_add() and
arch_sync_dma_for_cpu_batch_add() fall back to the non-batched
implementations, and arch_sync_dma_batch_flush() is a no-op.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
arch/arm64/Kconfig | 1 +
arch/arm64/mm/dma-mapping.c | 24 ++++++++++++++++++++++++
include/linux/dma-map-ops.h | 22 ++++++++++++++++++++++
kernel/dma/Kconfig | 3 +++
4 files changed, 50 insertions(+)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 93173f0a09c7..c8adbf21b7bf 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -112,6 +112,7 @@ config ARM64
select ARCH_SUPPORTS_SCHED_CLUSTER
select ARCH_SUPPORTS_SCHED_MC
select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+ select ARCH_WANT_BATCHED_DMA_SYNC
select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
select ARCH_WANT_DEFAULT_BPF_JIT
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index b2b5792b2caa..9ac1ddd1bb9c 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -31,6 +31,30 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
dcache_inval_poc(start, start + size);
}
+void arch_sync_dma_for_device_batch_add(phys_addr_t paddr, size_t size,
+ enum dma_data_direction dir)
+{
+ unsigned long start = (unsigned long)phys_to_virt(paddr);
+
+ dcache_clean_poc_nosync(start, start + size);
+}
+
+void arch_sync_dma_for_cpu_batch_add(phys_addr_t paddr, size_t size,
+ enum dma_data_direction dir)
+{
+ unsigned long start = (unsigned long)phys_to_virt(paddr);
+
+ if (dir == DMA_TO_DEVICE)
+ return;
+
+ dcache_inval_poc_nosync(start, start + size);
+}
+
+void arch_sync_dma_batch_flush(void)
+{
+ dsb(sy);
+}
+
void arch_dma_prep_coherent(struct page *page, size_t size)
{
unsigned long start = (unsigned long)page_address(page);
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 4809204c674c..5ee92c410e3c 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -361,6 +361,28 @@ static inline void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
}
#endif /* ARCH_HAS_SYNC_DMA_FOR_CPU */
+#ifdef CONFIG_ARCH_WANT_BATCHED_DMA_SYNC
+void arch_sync_dma_for_device_batch_add(phys_addr_t paddr, size_t size,
+ enum dma_data_direction dir);
+void arch_sync_dma_for_cpu_batch_add(phys_addr_t paddr, size_t size,
+ enum dma_data_direction dir);
+void arch_sync_dma_batch_flush(void);
+#else
+static inline void arch_sync_dma_for_device_batch_add(phys_addr_t paddr, size_t size,
+ enum dma_data_direction dir)
+{
+ arch_sync_dma_for_device(paddr, size, dir);
+}
+static inline void arch_sync_dma_for_cpu_batch_add(phys_addr_t paddr, size_t size,
+ enum dma_data_direction dir)
+{
+ arch_sync_dma_for_cpu(paddr, size, dir);
+}
+static inline void arch_sync_dma_batch_flush(void)
+{
+}
+#endif
+
#ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL
void arch_sync_dma_for_cpu_all(void);
#else
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index 31cfdb6b4bc3..2785099b2fa0 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -78,6 +78,9 @@ config ARCH_HAS_DMA_PREP_COHERENT
config ARCH_HAS_FORCE_DMA_UNENCRYPTED
bool
+config ARCH_WANT_BATCHED_DMA_SYNC
+ bool
+
#
# Select this option if the architecture assumes DMA devices are coherent
# by default.
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 30+ messages in thread* [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
2025-12-19 5:36 [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
` (3 preceding siblings ...)
2025-12-19 5:36 ` [PATCH 4/6] arm64: Provide arch_sync_dma_ batched helpers Barry Song
@ 2025-12-19 5:36 ` Barry Song
2025-12-20 17:37 ` kernel test robot
` (4 more replies)
2025-12-19 5:36 ` [PATCH RFC 6/6] dma-iommu: Allow DMA sync batching for IOVA link/unlink Barry Song
` (2 subsequent siblings)
7 siblings, 5 replies; 30+ messages in thread
From: Barry Song @ 2025-12-19 5:36 UTC (permalink / raw)
To: catalin.marinas, m.szyprowski, robin.murphy, will
Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
linux-kernel, iommu, surenb, ardb, linux-arm-kernel
From: Barry Song <v-songbaohua@oppo.com>
This enables dma_direct_sync_sg_for_device, dma_direct_sync_sg_for_cpu,
dma_direct_map_sg, and dma_direct_unmap_sg to use batched DMA sync
operations when possible. This significantly improves performance on
devices without hardware cache coherence.
Tangquan's initial results show that batched synchronization can reduce
dma_map_sg() time by 64.61% and dma_unmap_sg() time by 66.60% on an MTK
phone platform (MediaTek Dimensity 9500). The tests were performed by
pinning the task to CPU7 and fixing the CPU frequency at 2.6 GHz,
running dma_map_sg() and dma_unmap_sg() on 10 MB buffers (10 MB / 4 KB
sg entries per buffer) for 200 iterations and then averaging the
results.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
kernel/dma/direct.c | 28 ++++++++++-----
kernel/dma/direct.h | 86 +++++++++++++++++++++++++++++++++++++++------
2 files changed, 95 insertions(+), 19 deletions(-)
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 50c3fe2a1d55..ed2339b0c5e7 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -403,9 +403,10 @@ void dma_direct_sync_sg_for_device(struct device *dev,
swiotlb_sync_single_for_device(dev, paddr, sg->length, dir);
if (!dev_is_dma_coherent(dev))
- arch_sync_dma_for_device(paddr, sg->length,
- dir);
+ arch_sync_dma_for_device_batch_add(paddr, sg->length, dir);
}
+ if (!dev_is_dma_coherent(dev))
+ arch_sync_dma_batch_flush();
}
#endif
@@ -422,7 +423,7 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
if (!dev_is_dma_coherent(dev))
- arch_sync_dma_for_cpu(paddr, sg->length, dir);
+ arch_sync_dma_for_cpu_batch_add(paddr, sg->length, dir);
swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
@@ -430,8 +431,10 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
arch_dma_mark_clean(paddr, sg->length);
}
- if (!dev_is_dma_coherent(dev))
+ if (!dev_is_dma_coherent(dev)) {
arch_sync_dma_for_cpu_all();
+ arch_sync_dma_batch_flush();
+ }
}
/*
@@ -443,14 +446,19 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
{
struct scatterlist *sg;
int i;
+ bool need_sync = false;
for_each_sg(sgl, sg, nents, i) {
- if (sg_dma_is_bus_address(sg))
+ if (sg_dma_is_bus_address(sg)) {
sg_dma_unmark_bus_address(sg);
- else
- dma_direct_unmap_phys(dev, sg->dma_address,
+ } else {
+ need_sync = true;
+ dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
sg_dma_len(sg), dir, attrs);
+ }
}
+ if (need_sync && !dev_is_dma_coherent(dev))
+ arch_sync_dma_batch_flush();
}
#endif
@@ -460,6 +468,7 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
struct pci_p2pdma_map_state p2pdma_state = {};
struct scatterlist *sg;
int i, ret;
+ bool need_sync = false;
for_each_sg(sgl, sg, nents, i) {
switch (pci_p2pdma_state(&p2pdma_state, dev, sg_page(sg))) {
@@ -471,7 +480,8 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
*/
break;
case PCI_P2PDMA_MAP_NONE:
- sg->dma_address = dma_direct_map_phys(dev, sg_phys(sg),
+ need_sync = true;
+ sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
sg->length, dir, attrs);
if (sg->dma_address == DMA_MAPPING_ERROR) {
ret = -EIO;
@@ -491,6 +501,8 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
sg_dma_len(sg) = sg->length;
}
+ if (need_sync && !dev_is_dma_coherent(dev))
+ arch_sync_dma_batch_flush();
return nents;
out_unmap:
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index da2fadf45bcd..a211bab26478 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -64,15 +64,11 @@ static inline void dma_direct_sync_single_for_device(struct device *dev,
arch_sync_dma_for_device(paddr, size, dir);
}
-static inline void dma_direct_sync_single_for_cpu(struct device *dev,
- dma_addr_t addr, size_t size, enum dma_data_direction dir)
+static inline void __dma_direct_sync_single_for_cpu(struct device *dev,
+ phys_addr_t paddr, size_t size, enum dma_data_direction dir)
{
- phys_addr_t paddr = dma_to_phys(dev, addr);
-
- if (!dev_is_dma_coherent(dev)) {
- arch_sync_dma_for_cpu(paddr, size, dir);
+ if (!dev_is_dma_coherent(dev))
arch_sync_dma_for_cpu_all();
- }
swiotlb_sync_single_for_cpu(dev, paddr, size, dir);
@@ -80,7 +76,31 @@ static inline void dma_direct_sync_single_for_cpu(struct device *dev,
arch_dma_mark_clean(paddr, size);
}
-static inline dma_addr_t dma_direct_map_phys(struct device *dev,
+#ifdef CONFIG_ARCH_WANT_BATCHED_DMA_SYNC
+static inline void dma_direct_sync_single_for_cpu_batch_add(struct device *dev,
+ dma_addr_t addr, size_t size, enum dma_data_direction dir)
+{
+ phys_addr_t paddr = dma_to_phys(dev, addr);
+
+ if (!dev_is_dma_coherent(dev))
+ arch_sync_dma_for_cpu_batch_add(paddr, size, dir);
+
+ __dma_direct_sync_single_for_cpu(dev, paddr, size, dir);
+}
+#endif
+
+static inline void dma_direct_sync_single_for_cpu(struct device *dev,
+ dma_addr_t addr, size_t size, enum dma_data_direction dir)
+{
+ phys_addr_t paddr = dma_to_phys(dev, addr);
+
+ if (!dev_is_dma_coherent(dev))
+ arch_sync_dma_for_cpu(paddr, size, dir);
+
+ __dma_direct_sync_single_for_cpu(dev, paddr, size, dir);
+}
+
+static inline dma_addr_t __dma_direct_map_phys(struct device *dev,
phys_addr_t phys, size_t size, enum dma_data_direction dir,
unsigned long attrs)
{
@@ -108,9 +128,6 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
}
}
- if (!dev_is_dma_coherent(dev) &&
- !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
- arch_sync_dma_for_device(phys, size, dir);
return dma_addr;
err_overflow:
@@ -121,6 +138,53 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
return DMA_MAPPING_ERROR;
}
+#ifdef CONFIG_ARCH_WANT_BATCHED_DMA_SYNC
+static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
+ phys_addr_t phys, size_t size, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+ dma_addr_t dma_addr = __dma_direct_map_phys(dev, phys, size, dir, attrs);
+
+ if (dma_addr != DMA_MAPPING_ERROR && !dev_is_dma_coherent(dev) &&
+ !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
+ arch_sync_dma_for_device_batch_add(phys, size, dir);
+
+ return dma_addr;
+}
+#endif
+
+static inline dma_addr_t dma_direct_map_phys(struct device *dev,
+ phys_addr_t phys, size_t size, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+ dma_addr_t dma_addr = __dma_direct_map_phys(dev, phys, size, dir, attrs);
+
+ if (dma_addr != DMA_MAPPING_ERROR && !dev_is_dma_coherent(dev) &&
+ !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
+ arch_sync_dma_for_device(phys, size, dir);
+
+ return dma_addr;
+}
+
+#ifdef CONFIG_ARCH_WANT_BATCHED_DMA_SYNC
+static inline void dma_direct_unmap_phys_batch_add(struct device *dev, dma_addr_t addr,
+ size_t size, enum dma_data_direction dir, unsigned long attrs)
+{
+ phys_addr_t phys;
+
+ if (attrs & DMA_ATTR_MMIO)
+ /* nothing to do: uncached and no swiotlb */
+ return;
+
+ phys = dma_to_phys(dev, addr);
+ if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+ dma_direct_sync_single_for_cpu_batch_add(dev, addr, size, dir);
+
+ swiotlb_tbl_unmap_single(dev, phys, size, dir,
+ attrs | DMA_ATTR_SKIP_CPU_SYNC);
+}
+#endif
+
static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
size_t size, enum dma_data_direction dir, unsigned long attrs)
{
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
2025-12-19 5:36 ` [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch Barry Song
@ 2025-12-20 17:37 ` kernel test robot
2025-12-21 5:15 ` Barry Song
2025-12-21 11:55 ` Leon Romanovsky
` (3 subsequent siblings)
4 siblings, 1 reply; 30+ messages in thread
From: kernel test robot @ 2025-12-20 17:37 UTC (permalink / raw)
To: Barry Song, catalin.marinas, m.szyprowski, robin.murphy, will
Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
llvm, linux-kernel, iommu, oe-kbuild-all, surenb, ardb,
linux-arm-kernel
Hi Barry,
kernel test robot noticed the following build errors:
[auto build test ERROR on linus/master]
[also build test ERROR on next-20251219]
[cannot apply to arm64/for-next/core v6.16-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Barry-Song/arm64-Provide-dcache_by_myline_op_nosync-helper/20251219-195810
base: linus/master
patch link: https://lore.kernel.org/r/20251219053658.84978-6-21cnbao%40gmail.com
patch subject: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20251220/202512201836.f6KX6WMH-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251220/202512201836.f6KX6WMH-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512201836.f6KX6WMH-lkp@intel.com/
All errors (new ones prefixed by >>):
>> kernel/dma/direct.c:456:4: error: call to undeclared function 'dma_direct_unmap_phys_batch_add'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
456 | dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
| ^
kernel/dma/direct.c:456:4: note: did you mean 'dma_direct_unmap_phys'?
kernel/dma/direct.h:188:20: note: 'dma_direct_unmap_phys' declared here
188 | static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
| ^
>> kernel/dma/direct.c:484:22: error: call to undeclared function 'dma_direct_map_phys_batch_add'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
484 | sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
| ^
2 errors generated.
vim +/dma_direct_unmap_phys_batch_add +456 kernel/dma/direct.c
439
440 /*
441 * Unmaps segments, except for ones marked as pci_p2pdma which do not
442 * require any further action as they contain a bus address.
443 */
444 void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
445 int nents, enum dma_data_direction dir, unsigned long attrs)
446 {
447 struct scatterlist *sg;
448 int i;
449 bool need_sync = false;
450
451 for_each_sg(sgl, sg, nents, i) {
452 if (sg_dma_is_bus_address(sg)) {
453 sg_dma_unmark_bus_address(sg);
454 } else {
455 need_sync = true;
> 456 dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
457 sg_dma_len(sg), dir, attrs);
458 }
459 }
460 if (need_sync && !dev_is_dma_coherent(dev))
461 arch_sync_dma_batch_flush();
462 }
463 #endif
464
465 int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
466 enum dma_data_direction dir, unsigned long attrs)
467 {
468 struct pci_p2pdma_map_state p2pdma_state = {};
469 struct scatterlist *sg;
470 int i, ret;
471 bool need_sync = false;
472
473 for_each_sg(sgl, sg, nents, i) {
474 switch (pci_p2pdma_state(&p2pdma_state, dev, sg_page(sg))) {
475 case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
476 /*
477 * Any P2P mapping that traverses the PCI host bridge
478 * must be mapped with CPU physical address and not PCI
479 * bus addresses.
480 */
481 break;
482 case PCI_P2PDMA_MAP_NONE:
483 need_sync = true;
> 484 sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
485 sg->length, dir, attrs);
486 if (sg->dma_address == DMA_MAPPING_ERROR) {
487 ret = -EIO;
488 goto out_unmap;
489 }
490 break;
491 case PCI_P2PDMA_MAP_BUS_ADDR:
492 sg->dma_address = pci_p2pdma_bus_addr_map(
493 p2pdma_state.mem, sg_phys(sg));
494 sg_dma_len(sg) = sg->length;
495 sg_dma_mark_bus_address(sg);
496 continue;
497 default:
498 ret = -EREMOTEIO;
499 goto out_unmap;
500 }
501 sg_dma_len(sg) = sg->length;
502 }
503
504 if (need_sync && !dev_is_dma_coherent(dev))
505 arch_sync_dma_batch_flush();
506 return nents;
507
508 out_unmap:
509 dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
510 return ret;
511 }
512
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
2025-12-20 17:37 ` kernel test robot
@ 2025-12-21 5:15 ` Barry Song
0 siblings, 0 replies; 30+ messages in thread
From: Barry Song @ 2025-12-21 5:15 UTC (permalink / raw)
To: lkp
Cc: v-songbaohua, zhengtangquan, ryan.roberts, oe-kbuild-all,
anshuman.khandual, will, catalin.marinas, llvm, 21cnbao,
linux-kernel, surenb, iommu, maz, robin.murphy, ardb,
linux-arm-kernel, m.szyprowski
>
> All errors (new ones prefixed by >>):
>
> >> kernel/dma/direct.c:456:4: error: call to undeclared function 'dma_direct_unmap_phys_batch_add'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
> 456 | dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
> | ^
> kernel/dma/direct.c:456:4: note: did you mean 'dma_direct_unmap_phys'?
> kernel/dma/direct.h:188:20: note: 'dma_direct_unmap_phys' declared here
> 188 | static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
> | ^
> >> kernel/dma/direct.c:484:22: error: call to undeclared function 'dma_direct_map_phys_batch_add'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
> 484 | sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
> | ^
> 2 errors generated.
>
>
Thanks very much for the report.
Can you please check if the below diff fix the build issue?
From 5541aa1efa19777e435c9f3cca7cd2c6a490d9f1 Mon Sep 17 00:00:00 2001
From: Barry Song <v-songbaohua@oppo.com>
Date: Sun, 21 Dec 2025 13:09:36 +0800
Subject: [PATCH] kernel/dma: Fix build errors for dma_direct_map_phys
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202512201836.f6KX6WMH-lkp@intel.com/
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
kernel/dma/direct.h | 38 ++++++++++++++++++++++++++------------
1 file changed, 26 insertions(+), 12 deletions(-)
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index a211bab26478..bcc398b5aa6b 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -138,8 +138,7 @@ static inline dma_addr_t __dma_direct_map_phys(struct device *dev,
return DMA_MAPPING_ERROR;
}
-#ifdef CONFIG_ARCH_WANT_BATCHED_DMA_SYNC
-static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
+static inline dma_addr_t dma_direct_map_phys(struct device *dev,
phys_addr_t phys, size_t size, enum dma_data_direction dir,
unsigned long attrs)
{
@@ -147,13 +146,13 @@ static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
if (dma_addr != DMA_MAPPING_ERROR && !dev_is_dma_coherent(dev) &&
!(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
- arch_sync_dma_for_device_batch_add(phys, size, dir);
+ arch_sync_dma_for_device(phys, size, dir);
return dma_addr;
}
-#endif
-static inline dma_addr_t dma_direct_map_phys(struct device *dev,
+#ifdef CONFIG_ARCH_WANT_BATCHED_DMA_SYNC
+static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
phys_addr_t phys, size_t size, enum dma_data_direction dir,
unsigned long attrs)
{
@@ -161,13 +160,20 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
if (dma_addr != DMA_MAPPING_ERROR && !dev_is_dma_coherent(dev) &&
!(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
- arch_sync_dma_for_device(phys, size, dir);
+ arch_sync_dma_for_device_batch_add(phys, size, dir);
return dma_addr;
}
+#else
+static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
+ phys_addr_t phys, size_t size, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+ return dma_direct_map_phys(dev, phys, size, dir, attrs);
+}
+#endif
-#ifdef CONFIG_ARCH_WANT_BATCHED_DMA_SYNC
-static inline void dma_direct_unmap_phys_batch_add(struct device *dev, dma_addr_t addr,
+static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
size_t size, enum dma_data_direction dir, unsigned long attrs)
{
phys_addr_t phys;
@@ -178,14 +184,14 @@ static inline void dma_direct_unmap_phys_batch_add(struct device *dev, dma_addr_
phys = dma_to_phys(dev, addr);
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
- dma_direct_sync_single_for_cpu_batch_add(dev, addr, size, dir);
+ dma_direct_sync_single_for_cpu(dev, addr, size, dir);
swiotlb_tbl_unmap_single(dev, phys, size, dir,
attrs | DMA_ATTR_SKIP_CPU_SYNC);
}
-#endif
-static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
+#ifdef CONFIG_ARCH_WANT_BATCHED_DMA_SYNC
+static inline void dma_direct_unmap_phys_batch_add(struct device *dev, dma_addr_t addr,
size_t size, enum dma_data_direction dir, unsigned long attrs)
{
phys_addr_t phys;
@@ -196,9 +202,17 @@ static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
phys = dma_to_phys(dev, addr);
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
- dma_direct_sync_single_for_cpu(dev, addr, size, dir);
+ dma_direct_sync_single_for_cpu_batch_add(dev, addr, size, dir);
swiotlb_tbl_unmap_single(dev, phys, size, dir,
attrs | DMA_ATTR_SKIP_CPU_SYNC);
}
+#else
+static inline void dma_direct_unmap_phys_batch_add(struct device *dev, dma_addr_t addr,
+ size_t size, enum dma_data_direction dir, unsigned long attrs)
+{
+ dma_direct_unmap_phys(dev, addr, size, dir, attrs);
+}
+#endif
+
#endif /* _KERNEL_DMA_DIRECT_H */
--
2.39.3 (Apple Git-146)
Thanks
Barry
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
2025-12-19 5:36 ` [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch Barry Song
2025-12-20 17:37 ` kernel test robot
@ 2025-12-21 11:55 ` Leon Romanovsky
2025-12-21 19:24 ` Barry Song
2025-12-21 12:36 ` kernel test robot
` (2 subsequent siblings)
4 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2025-12-21 11:55 UTC (permalink / raw)
To: Barry Song
Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
anshuman.khandual, catalin.marinas, linux-kernel, surenb, iommu,
maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski
On Fri, Dec 19, 2025 at 01:36:57PM +0800, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> This enables dma_direct_sync_sg_for_device, dma_direct_sync_sg_for_cpu,
> dma_direct_map_sg, and dma_direct_unmap_sg to use batched DMA sync
> operations when possible. This significantly improves performance on
> devices without hardware cache coherence.
>
> Tangquan's initial results show that batched synchronization can reduce
> dma_map_sg() time by 64.61% and dma_unmap_sg() time by 66.60% on an MTK
> phone platform (MediaTek Dimensity 9500). The tests were performed by
> pinning the task to CPU7 and fixing the CPU frequency at 2.6 GHz,
> running dma_map_sg() and dma_unmap_sg() on 10 MB buffers (10 MB / 4 KB
> sg entries per buffer) for 200 iterations and then averaging the
> results.
>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
> kernel/dma/direct.c | 28 ++++++++++-----
> kernel/dma/direct.h | 86 +++++++++++++++++++++++++++++++++++++++------
> 2 files changed, 95 insertions(+), 19 deletions(-)
<...>
> if (!dev_is_dma_coherent(dev))
> - arch_sync_dma_for_device(paddr, sg->length,
> - dir);
> + arch_sync_dma_for_device_batch_add(paddr, sg->length, dir);
<...>
> -static inline dma_addr_t dma_direct_map_phys(struct device *dev,
> +#ifdef CONFIG_ARCH_WANT_BATCHED_DMA_SYNC
> +static inline void dma_direct_sync_single_for_cpu_batch_add(struct device *dev,
> + dma_addr_t addr, size_t size, enum dma_data_direction dir)
> +{
> + phys_addr_t paddr = dma_to_phys(dev, addr);
> +
> + if (!dev_is_dma_coherent(dev))
> + arch_sync_dma_for_cpu_batch_add(paddr, size, dir);
> +
> + __dma_direct_sync_single_for_cpu(dev, paddr, size, dir);
> +}
> +#endif
> +
> +static inline void dma_direct_sync_single_for_cpu(struct device *dev,
> + dma_addr_t addr, size_t size, enum dma_data_direction dir)
> +{
> + phys_addr_t paddr = dma_to_phys(dev, addr);
> +
> + if (!dev_is_dma_coherent(dev))
> + arch_sync_dma_for_cpu(paddr, size, dir);
> +
> + __dma_direct_sync_single_for_cpu(dev, paddr, size, dir);
> +}
> +
I'm wondering why you don't implement this batch‑sync support inside the
arch_sync_dma_*() functions. Doing so would minimize changes to the generic
kernel/dma/* code and reduce the amount of #ifdef‑based spaghetti.
Thanks."
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
2025-12-21 11:55 ` Leon Romanovsky
@ 2025-12-21 19:24 ` Barry Song
2025-12-22 8:49 ` Leon Romanovsky
0 siblings, 1 reply; 30+ messages in thread
From: Barry Song @ 2025-12-21 19:24 UTC (permalink / raw)
To: leon
Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
anshuman.khandual, catalin.marinas, 21cnbao, linux-kernel, surenb,
iommu, maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski
On Sun, Dec 21, 2025 at 7:55 PM Leon Romanovsky <leon@kernel.org> wrote:
[...]
> > +
>
> I'm wondering why you don't implement this batch‑sync support inside the
> arch_sync_dma_*() functions. Doing so would minimize changes to the generic
> kernel/dma/* code and reduce the amount of #ifdef‑based spaghetti.
>
There are two cases: mapping an sg list and mapping a single
buffer. The former can be batched with
arch_sync_dma_*_batch_add() and flushed via
arch_sync_dma_batch_flush(), while the latter requires all work to
be done inside arch_sync_dma_*(). Therefore,
arch_sync_dma_*() cannot always batch and flush.
But yes, I can drop the ifdef in this patch. I have rewritten the entire
patch as shown below, and it will be tested today prior to
resending v2. Before I send v2, you are very welcome to comment.
From c03aae12c608b25fc1a84931ce78dbe3ef0f1ebe Mon Sep 17 00:00:00 2001
From: Barry Song <v-songbaohua@oppo.com>
Date: Wed, 29 Oct 2025 10:31:15 +0800
Subject: [PATCH v2 FOR DISCUSION 5/6] dma-mapping: Allow batched DMA sync operations
This enables dma_direct_sync_sg_for_device, dma_direct_sync_sg_for_cpu,
dma_direct_map_sg, and dma_direct_unmap_sg to use batched DMA sync
operations when possible. This significantly improves performance on
devices without hardware cache coherence.
Tangquan's initial results show that batched synchronization can reduce
dma_map_sg() time by 64.61% and dma_unmap_sg() time by 66.60% on an MTK
phone platform (MediaTek Dimensity 9500). The tests were performed by
pinning the task to CPU7 and fixing the CPU frequency at 2.6 GHz,
running dma_map_sg() and dma_unmap_sg() on 10 MB buffers (10 MB / 4 KB
sg entries per buffer) for 200 iterations and then averaging the
results.
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
kernel/dma/direct.c | 28 +++++++++++++++------
kernel/dma/direct.h | 59 +++++++++++++++++++++++++++++++++++++--------
2 files changed, 69 insertions(+), 18 deletions(-)
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 50c3fe2a1d55..ed2339b0c5e7 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -403,9 +403,10 @@ void dma_direct_sync_sg_for_device(struct device *dev,
swiotlb_sync_single_for_device(dev, paddr, sg->length, dir);
if (!dev_is_dma_coherent(dev))
- arch_sync_dma_for_device(paddr, sg->length,
- dir);
+ arch_sync_dma_for_device_batch_add(paddr, sg->length, dir);
}
+ if (!dev_is_dma_coherent(dev))
+ arch_sync_dma_batch_flush();
}
#endif
@@ -422,7 +423,7 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
if (!dev_is_dma_coherent(dev))
- arch_sync_dma_for_cpu(paddr, sg->length, dir);
+ arch_sync_dma_for_cpu_batch_add(paddr, sg->length, dir);
swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
@@ -430,8 +431,10 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
arch_dma_mark_clean(paddr, sg->length);
}
- if (!dev_is_dma_coherent(dev))
+ if (!dev_is_dma_coherent(dev)) {
arch_sync_dma_for_cpu_all();
+ arch_sync_dma_batch_flush();
+ }
}
/*
@@ -443,14 +446,19 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
{
struct scatterlist *sg;
int i;
+ bool need_sync = false;
for_each_sg(sgl, sg, nents, i) {
- if (sg_dma_is_bus_address(sg))
+ if (sg_dma_is_bus_address(sg)) {
sg_dma_unmark_bus_address(sg);
- else
- dma_direct_unmap_phys(dev, sg->dma_address,
+ } else {
+ need_sync = true;
+ dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
sg_dma_len(sg), dir, attrs);
+ }
}
+ if (need_sync && !dev_is_dma_coherent(dev))
+ arch_sync_dma_batch_flush();
}
#endif
@@ -460,6 +468,7 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
struct pci_p2pdma_map_state p2pdma_state = {};
struct scatterlist *sg;
int i, ret;
+ bool need_sync = false;
for_each_sg(sgl, sg, nents, i) {
switch (pci_p2pdma_state(&p2pdma_state, dev, sg_page(sg))) {
@@ -471,7 +480,8 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
*/
break;
case PCI_P2PDMA_MAP_NONE:
- sg->dma_address = dma_direct_map_phys(dev, sg_phys(sg),
+ need_sync = true;
+ sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
sg->length, dir, attrs);
if (sg->dma_address == DMA_MAPPING_ERROR) {
ret = -EIO;
@@ -491,6 +501,8 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
sg_dma_len(sg) = sg->length;
}
+ if (need_sync && !dev_is_dma_coherent(dev))
+ arch_sync_dma_batch_flush();
return nents;
out_unmap:
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index da2fadf45bcd..2e25af887204 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -64,13 +64,16 @@ static inline void dma_direct_sync_single_for_device(struct device *dev,
arch_sync_dma_for_device(paddr, size, dir);
}
-static inline void dma_direct_sync_single_for_cpu(struct device *dev,
- dma_addr_t addr, size_t size, enum dma_data_direction dir)
+static inline void __dma_direct_sync_single_for_cpu(struct device *dev,
+ dma_addr_t addr, size_t size, enum dma_data_direction dir,
+ bool flush)
{
phys_addr_t paddr = dma_to_phys(dev, addr);
if (!dev_is_dma_coherent(dev)) {
- arch_sync_dma_for_cpu(paddr, size, dir);
+ arch_sync_dma_for_cpu_batch_add(paddr, size, dir);
+ if (flush)
+ arch_sync_dma_batch_flush();
arch_sync_dma_for_cpu_all();
}
@@ -80,9 +83,15 @@ static inline void dma_direct_sync_single_for_cpu(struct device *dev,
arch_dma_mark_clean(paddr, size);
}
-static inline dma_addr_t dma_direct_map_phys(struct device *dev,
+static inline void dma_direct_sync_single_for_cpu(struct device *dev,
+ dma_addr_t addr, size_t size, enum dma_data_direction dir)
+{
+ __dma_direct_sync_single_for_cpu(dev, addr, size, dir, true);
+}
+
+static inline dma_addr_t __dma_direct_map_phys(struct device *dev,
phys_addr_t phys, size_t size, enum dma_data_direction dir,
- unsigned long attrs)
+ unsigned long attrs, bool flush)
{
dma_addr_t dma_addr;
@@ -109,8 +118,11 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
}
if (!dev_is_dma_coherent(dev) &&
- !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
- arch_sync_dma_for_device(phys, size, dir);
+ !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO))) {
+ arch_sync_dma_for_device_batch_add(phys, size, dir);
+ if (flush)
+ arch_sync_dma_batch_flush();
+ }
return dma_addr;
err_overflow:
@@ -121,8 +133,23 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
return DMA_MAPPING_ERROR;
}
-static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
- size_t size, enum dma_data_direction dir, unsigned long attrs)
+static inline dma_addr_t dma_direct_map_phys(struct device *dev,
+ phys_addr_t phys, size_t size, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+ return __dma_direct_map_phys(dev, phys, size, dir, attrs, true);
+}
+
+static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
+ phys_addr_t phys, size_t size, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+ return __dma_direct_map_phys(dev, phys, size, dir, attrs, false);
+}
+
+static inline void __dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
+ size_t size, enum dma_data_direction dir, unsigned long attrs,
+ bool flush)
{
phys_addr_t phys;
@@ -132,9 +159,21 @@ static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
phys = dma_to_phys(dev, addr);
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
- dma_direct_sync_single_for_cpu(dev, addr, size, dir);
+ __dma_direct_sync_single_for_cpu(dev, addr, size, dir, flush);
swiotlb_tbl_unmap_single(dev, phys, size, dir,
attrs | DMA_ATTR_SKIP_CPU_SYNC);
}
+
+static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
+ size_t size, enum dma_data_direction dir, unsigned long attrs)
+{
+ __dma_direct_unmap_phys(dev, addr, size, dir, attrs, true);
+}
+
+static inline void dma_direct_unmap_phys_batch_add(struct device *dev, dma_addr_t addr,
+ size_t size, enum dma_data_direction dir, unsigned long attrs)
+{
+ __dma_direct_unmap_phys(dev, addr, size, dir, attrs, false);
+}
#endif /* _KERNEL_DMA_DIRECT_H */
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
2025-12-21 19:24 ` Barry Song
@ 2025-12-22 8:49 ` Leon Romanovsky
2025-12-23 0:02 ` Barry Song
0 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2025-12-22 8:49 UTC (permalink / raw)
To: Barry Song
Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
anshuman.khandual, catalin.marinas, linux-kernel, surenb, iommu,
maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski
On Mon, Dec 22, 2025 at 03:24:58AM +0800, Barry Song wrote:
> On Sun, Dec 21, 2025 at 7:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> [...]
> > > +
> >
> > I'm wondering why you don't implement this batch‑sync support inside the
> > arch_sync_dma_*() functions. Doing so would minimize changes to the generic
> > kernel/dma/* code and reduce the amount of #ifdef‑based spaghetti.
> >
>
> There are two cases: mapping an sg list and mapping a single
> buffer. The former can be batched with
> arch_sync_dma_*_batch_add() and flushed via
> arch_sync_dma_batch_flush(), while the latter requires all work to
> be done inside arch_sync_dma_*(). Therefore,
> arch_sync_dma_*() cannot always batch and flush.
Probably in all cases you can call the _batch_ variant, followed by _flush_,
even when handling a single page. This keeps the code consistent across all
paths. On platforms that do not support _batch_, the _flush_ operation will be
a NOP anyway.
I would also rename arch_sync_dma_batch_flush() to arch_sync_dma_flush().
You can also minimize changes in dma_direct_map_phys() too, by extending
it's signature to provide if flush is needed or not.
dma_direct_map_phys(....) -> dma_direct_map_phys(...., bool flush):
static inline dma_addr_t dma_direct_map_phys(...., bool flush)
{
....
if (dma_addr != DMA_MAPPING_ERROR && !dev_is_dma_coherent(dev) &&
!(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
{
arch_sync_dma_for_device(phys, size, dir);
if (flush)
arch_sync_dma_flush();
}
}
Thanks
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
2025-12-22 8:49 ` Leon Romanovsky
@ 2025-12-23 0:02 ` Barry Song
2025-12-23 2:36 ` Barry Song
2025-12-23 14:14 ` Leon Romanovsky
0 siblings, 2 replies; 30+ messages in thread
From: Barry Song @ 2025-12-23 0:02 UTC (permalink / raw)
To: Leon Romanovsky
Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
anshuman.khandual, catalin.marinas, linux-kernel, surenb, iommu,
maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski
On Mon, Dec 22, 2025 at 9:49 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Mon, Dec 22, 2025 at 03:24:58AM +0800, Barry Song wrote:
> > On Sun, Dec 21, 2025 at 7:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > [...]
> > > > +
> > >
> > > I'm wondering why you don't implement this batch‑sync support inside the
> > > arch_sync_dma_*() functions. Doing so would minimize changes to the generic
> > > kernel/dma/* code and reduce the amount of #ifdef‑based spaghetti.
> > >
> >
> > There are two cases: mapping an sg list and mapping a single
> > buffer. The former can be batched with
> > arch_sync_dma_*_batch_add() and flushed via
> > arch_sync_dma_batch_flush(), while the latter requires all work to
> > be done inside arch_sync_dma_*(). Therefore,
> > arch_sync_dma_*() cannot always batch and flush.
>
> Probably in all cases you can call the _batch_ variant, followed by _flush_,
> even when handling a single page. This keeps the code consistent across all
> paths. On platforms that do not support _batch_, the _flush_ operation will be
> a NOP anyway.
We have a lot of code outside kernel/dma that also calls
arch_sync_dma_for_* such as arch/arm, arch/mips, drivers/xen,
I guess we don’t want to modify so many things?
for kernel/dma, we have two "single" callers only:
kernel/dma/direct.h, kernel/dma/swiotlb.c. and they looks quite
straightforward:
static inline void dma_direct_sync_single_for_device(struct device *dev,
dma_addr_t addr, size_t size, enum dma_data_direction dir)
{
phys_addr_t paddr = dma_to_phys(dev, addr);
swiotlb_sync_single_for_device(dev, paddr, size, dir);
if (!dev_is_dma_coherent(dev))
arch_sync_dma_for_device(paddr, size, dir);
}
I guess moving to arch_sync_dma_for_device_batch + flush
doesn’t really look much better, does it?
>
> I would also rename arch_sync_dma_batch_flush() to arch_sync_dma_flush().
Sure.
>
> You can also minimize changes in dma_direct_map_phys() too, by extending
> it's signature to provide if flush is needed or not.
Yes. I have
static inline dma_addr_t __dma_direct_map_phys(struct device *dev,
phys_addr_t phys, size_t size, enum dma_data_direction dir,
unsigned long attrs, bool flush)
and two wrappers:
static inline dma_addr_t dma_direct_map_phys(struct device *dev,
phys_addr_t phys, size_t size, enum dma_data_direction dir,
unsigned long attrs)
{
return __dma_direct_map_phys(dev, phys, size, dir, attrs, true);
}
static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
phys_addr_t phys, size_t size, enum dma_data_direction dir,
unsigned long attrs)
{
return __dma_direct_map_phys(dev, phys, size, dir, attrs, false);
}
If you prefer exposing "flush" directly in dma_direct_map_phys()
and updating its callers with flush=true, I think that’s fine.
It could be also true for dma_direct_sync_single_for_device().
>
> dma_direct_map_phys(....) -> dma_direct_map_phys(...., bool flush):
>
> static inline dma_addr_t dma_direct_map_phys(...., bool flush)
> {
> ....
>
> if (dma_addr != DMA_MAPPING_ERROR && !dev_is_dma_coherent(dev) &&
> !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
> {
> arch_sync_dma_for_device(phys, size, dir);
> if (flush)
> arch_sync_dma_flush();
> }
> }
>
Thanks
Barry
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
2025-12-23 0:02 ` Barry Song
@ 2025-12-23 2:36 ` Barry Song
2025-12-23 14:14 ` Leon Romanovsky
1 sibling, 0 replies; 30+ messages in thread
From: Barry Song @ 2025-12-23 2:36 UTC (permalink / raw)
To: 21cnbao, leon
Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
anshuman.khandual, catalin.marinas, linux-kernel, surenb, iommu,
maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski
>
> >
> > I would also rename arch_sync_dma_batch_flush() to arch_sync_dma_flush().
>
> Sure.
>
> >
> > You can also minimize changes in dma_direct_map_phys() too, by extending
> > it's signature to provide if flush is needed or not.
>
> Yes. I have
>
> static inline dma_addr_t __dma_direct_map_phys(struct device *dev,
> phys_addr_t phys, size_t size, enum dma_data_direction dir,
> unsigned long attrs, bool flush)
>
> and two wrappers:
> static inline dma_addr_t dma_direct_map_phys(struct device *dev,
> phys_addr_t phys, size_t size, enum dma_data_direction dir,
> unsigned long attrs)
> {
> return __dma_direct_map_phys(dev, phys, size, dir, attrs, true);
> }
>
> static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
> phys_addr_t phys, size_t size, enum dma_data_direction dir,
> unsigned long attrs)
> {
> return __dma_direct_map_phys(dev, phys, size, dir, attrs, false);
> }
>
> If you prefer exposing "flush" directly in dma_direct_map_phys()
> and updating its callers with flush=true, I think that’s fine.
>
> It could be also true for dma_direct_sync_single_for_device().
sorry for typo. I meant dma_direct_sync_single_for_cpu().
With flush passed as an argument, the patch becomes the following.
Please feel free to comment before I send v2.
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 50c3fe2a1d55..5c65d213eb37 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -403,9 +403,11 @@ void dma_direct_sync_sg_for_device(struct device *dev,
swiotlb_sync_single_for_device(dev, paddr, sg->length, dir);
if (!dev_is_dma_coherent(dev))
- arch_sync_dma_for_device(paddr, sg->length,
+ arch_sync_dma_for_device_batch_add(paddr, sg->length,
dir);
}
+ if (!dev_is_dma_coherent(dev))
+ arch_sync_dma_flush();
}
#endif
@@ -422,7 +424,7 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
if (!dev_is_dma_coherent(dev))
- arch_sync_dma_for_cpu(paddr, sg->length, dir);
+ arch_sync_dma_for_cpu_batch_add(paddr, sg->length, dir);
swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
@@ -430,8 +432,10 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
arch_dma_mark_clean(paddr, sg->length);
}
- if (!dev_is_dma_coherent(dev))
+ if (!dev_is_dma_coherent(dev)) {
arch_sync_dma_for_cpu_all();
+ arch_sync_dma_flush();
+ }
}
/*
@@ -443,14 +447,19 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
{
struct scatterlist *sg;
int i;
+ bool need_sync = false;
for_each_sg(sgl, sg, nents, i) {
- if (sg_dma_is_bus_address(sg))
+ if (sg_dma_is_bus_address(sg)) {
sg_dma_unmark_bus_address(sg);
- else
+ } else {
+ need_sync = true;
dma_direct_unmap_phys(dev, sg->dma_address,
- sg_dma_len(sg), dir, attrs);
+ sg_dma_len(sg), dir, attrs, false);
+ }
}
+ if (need_sync && !dev_is_dma_coherent(dev))
+ arch_sync_dma_flush();
}
#endif
@@ -460,6 +469,7 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
struct pci_p2pdma_map_state p2pdma_state = {};
struct scatterlist *sg;
int i, ret;
+ bool need_sync = false;
for_each_sg(sgl, sg, nents, i) {
switch (pci_p2pdma_state(&p2pdma_state, dev, sg_page(sg))) {
@@ -471,8 +481,9 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
*/
break;
case PCI_P2PDMA_MAP_NONE:
+ need_sync = true;
sg->dma_address = dma_direct_map_phys(dev, sg_phys(sg),
- sg->length, dir, attrs);
+ sg->length, dir, attrs, false);
if (sg->dma_address == DMA_MAPPING_ERROR) {
ret = -EIO;
goto out_unmap;
@@ -491,6 +502,8 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
sg_dma_len(sg) = sg->length;
}
+ if (need_sync && !dev_is_dma_coherent(dev))
+ arch_sync_dma_flush();
return nents;
out_unmap:
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index da2fadf45bcd..b13eb5bfd051 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -65,12 +65,15 @@ static inline void dma_direct_sync_single_for_device(struct device *dev,
}
static inline void dma_direct_sync_single_for_cpu(struct device *dev,
- dma_addr_t addr, size_t size, enum dma_data_direction dir)
+ dma_addr_t addr, size_t size, enum dma_data_direction dir,
+ bool flush)
{
phys_addr_t paddr = dma_to_phys(dev, addr);
if (!dev_is_dma_coherent(dev)) {
- arch_sync_dma_for_cpu(paddr, size, dir);
+ arch_sync_dma_for_cpu_batch_add(paddr, size, dir);
+ if (flush)
+ arch_sync_dma_flush();
arch_sync_dma_for_cpu_all();
}
@@ -82,7 +85,7 @@ static inline void dma_direct_sync_single_for_cpu(struct device *dev,
static inline dma_addr_t dma_direct_map_phys(struct device *dev,
phys_addr_t phys, size_t size, enum dma_data_direction dir,
- unsigned long attrs)
+ unsigned long attrs, bool flush)
{
dma_addr_t dma_addr;
@@ -109,8 +112,11 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
}
if (!dev_is_dma_coherent(dev) &&
- !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
- arch_sync_dma_for_device(phys, size, dir);
+ !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO))) {
+ arch_sync_dma_for_device_batch_add(phys, size, dir);
+ if (flush)
+ arch_sync_dma_flush();
+ }
return dma_addr;
err_overflow:
@@ -122,7 +128,8 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
}
static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
- size_t size, enum dma_data_direction dir, unsigned long attrs)
+ size_t size, enum dma_data_direction dir, unsigned long attrs,
+ bool flush)
{
phys_addr_t phys;
@@ -132,9 +139,10 @@ static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
phys = dma_to_phys(dev, addr);
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
- dma_direct_sync_single_for_cpu(dev, addr, size, dir);
+ dma_direct_sync_single_for_cpu(dev, addr, size, dir, flush);
swiotlb_tbl_unmap_single(dev, phys, size, dir,
attrs | DMA_ATTR_SKIP_CPU_SYNC);
}
+
#endif /* _KERNEL_DMA_DIRECT_H */
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 37163eb49f9f..d8cfa56a3cbb 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -166,7 +166,7 @@ dma_addr_t dma_map_phys(struct device *dev, phys_addr_t phys, size_t size,
if (dma_map_direct(dev, ops) ||
(!is_mmio && arch_dma_map_phys_direct(dev, phys + size)))
- addr = dma_direct_map_phys(dev, phys, size, dir, attrs);
+ addr = dma_direct_map_phys(dev, phys, size, dir, attrs, true);
else if (use_dma_iommu(dev))
addr = iommu_dma_map_phys(dev, phys, size, dir, attrs);
else if (ops->map_phys)
@@ -207,7 +207,7 @@ void dma_unmap_phys(struct device *dev, dma_addr_t addr, size_t size,
BUG_ON(!valid_dma_direction(dir));
if (dma_map_direct(dev, ops) ||
(!is_mmio && arch_dma_unmap_phys_direct(dev, addr + size)))
- dma_direct_unmap_phys(dev, addr, size, dir, attrs);
+ dma_direct_unmap_phys(dev, addr, size, dir, attrs, true);
else if (use_dma_iommu(dev))
iommu_dma_unmap_phys(dev, addr, size, dir, attrs);
else if (ops->unmap_phys)
@@ -373,7 +373,7 @@ void __dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size,
BUG_ON(!valid_dma_direction(dir));
if (dma_map_direct(dev, ops))
- dma_direct_sync_single_for_cpu(dev, addr, size, dir);
+ dma_direct_sync_single_for_cpu(dev, addr, size, dir, true);
else if (use_dma_iommu(dev))
iommu_dma_sync_single_for_cpu(dev, addr, size, dir);
else if (ops->sync_single_for_cpu)
--
2.43.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
2025-12-23 0:02 ` Barry Song
2025-12-23 2:36 ` Barry Song
@ 2025-12-23 14:14 ` Leon Romanovsky
2025-12-24 1:29 ` Barry Song
1 sibling, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2025-12-23 14:14 UTC (permalink / raw)
To: Barry Song
Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
anshuman.khandual, catalin.marinas, linux-kernel, surenb, iommu,
maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski
On Tue, Dec 23, 2025 at 01:02:55PM +1300, Barry Song wrote:
> On Mon, Dec 22, 2025 at 9:49 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Mon, Dec 22, 2025 at 03:24:58AM +0800, Barry Song wrote:
> > > On Sun, Dec 21, 2025 at 7:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > [...]
> > > > > +
> > > >
> > > > I'm wondering why you don't implement this batch‑sync support inside the
> > > > arch_sync_dma_*() functions. Doing so would minimize changes to the generic
> > > > kernel/dma/* code and reduce the amount of #ifdef‑based spaghetti.
> > > >
> > >
> > > There are two cases: mapping an sg list and mapping a single
> > > buffer. The former can be batched with
> > > arch_sync_dma_*_batch_add() and flushed via
> > > arch_sync_dma_batch_flush(), while the latter requires all work to
> > > be done inside arch_sync_dma_*(). Therefore,
> > > arch_sync_dma_*() cannot always batch and flush.
> >
> > Probably in all cases you can call the _batch_ variant, followed by _flush_,
> > even when handling a single page. This keeps the code consistent across all
> > paths. On platforms that do not support _batch_, the _flush_ operation will be
> > a NOP anyway.
>
> We have a lot of code outside kernel/dma that also calls
> arch_sync_dma_for_* such as arch/arm, arch/mips, drivers/xen,
> I guess we don’t want to modify so many things?
Aren't they using internal, arch specific, arch_sync_dma_for_* implementations?
>
> for kernel/dma, we have two "single" callers only:
> kernel/dma/direct.h, kernel/dma/swiotlb.c. and they looks quite
> straightforward:
>
> static inline void dma_direct_sync_single_for_device(struct device *dev,
> dma_addr_t addr, size_t size, enum dma_data_direction dir)
> {
> phys_addr_t paddr = dma_to_phys(dev, addr);
>
> swiotlb_sync_single_for_device(dev, paddr, size, dir);
>
> if (!dev_is_dma_coherent(dev))
> arch_sync_dma_for_device(paddr, size, dir);
> }
>
> I guess moving to arch_sync_dma_for_device_batch + flush
> doesn’t really look much better, does it?
>
> >
> > I would also rename arch_sync_dma_batch_flush() to arch_sync_dma_flush().
>
> Sure.
>
> >
> > You can also minimize changes in dma_direct_map_phys() too, by extending
> > it's signature to provide if flush is needed or not.
>
> Yes. I have
>
> static inline dma_addr_t __dma_direct_map_phys(struct device *dev,
> phys_addr_t phys, size_t size, enum dma_data_direction dir,
> unsigned long attrs, bool flush)
My suggestion is to use it directly, without wrappers.
>
> and two wrappers:
> static inline dma_addr_t dma_direct_map_phys(struct device *dev,
> phys_addr_t phys, size_t size, enum dma_data_direction dir,
> unsigned long attrs)
> {
> return __dma_direct_map_phys(dev, phys, size, dir, attrs, true);
> }
>
> static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
> phys_addr_t phys, size_t size, enum dma_data_direction dir,
> unsigned long attrs)
> {
> return __dma_direct_map_phys(dev, phys, size, dir, attrs, false);
> }
>
> If you prefer exposing "flush" directly in dma_direct_map_phys()
> and updating its callers with flush=true, I think that’s fine.
Yes
>
> It could be also true for dma_direct_sync_single_for_device().
>
> >
> > dma_direct_map_phys(....) -> dma_direct_map_phys(...., bool flush):
> >
> > static inline dma_addr_t dma_direct_map_phys(...., bool flush)
> > {
> > ....
> >
> > if (dma_addr != DMA_MAPPING_ERROR && !dev_is_dma_coherent(dev) &&
> > !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
> > {
> > arch_sync_dma_for_device(phys, size, dir);
> > if (flush)
> > arch_sync_dma_flush();
> > }
> > }
> >
>
> Thanks
> Barry
>
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
2025-12-23 14:14 ` Leon Romanovsky
@ 2025-12-24 1:29 ` Barry Song
2025-12-24 8:51 ` Leon Romanovsky
0 siblings, 1 reply; 30+ messages in thread
From: Barry Song @ 2025-12-24 1:29 UTC (permalink / raw)
To: Leon Romanovsky
Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
anshuman.khandual, catalin.marinas, linux-kernel, surenb, iommu,
maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski
On Wed, Dec 24, 2025 at 3:14 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Tue, Dec 23, 2025 at 01:02:55PM +1300, Barry Song wrote:
> > On Mon, Dec 22, 2025 at 9:49 PM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Mon, Dec 22, 2025 at 03:24:58AM +0800, Barry Song wrote:
> > > > On Sun, Dec 21, 2025 at 7:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > [...]
> > > > > > +
> > > > >
> > > > > I'm wondering why you don't implement this batch‑sync support inside the
> > > > > arch_sync_dma_*() functions. Doing so would minimize changes to the generic
> > > > > kernel/dma/* code and reduce the amount of #ifdef‑based spaghetti.
> > > > >
> > > >
> > > > There are two cases: mapping an sg list and mapping a single
> > > > buffer. The former can be batched with
> > > > arch_sync_dma_*_batch_add() and flushed via
> > > > arch_sync_dma_batch_flush(), while the latter requires all work to
> > > > be done inside arch_sync_dma_*(). Therefore,
> > > > arch_sync_dma_*() cannot always batch and flush.
> > >
> > > Probably in all cases you can call the _batch_ variant, followed by _flush_,
> > > even when handling a single page. This keeps the code consistent across all
> > > paths. On platforms that do not support _batch_, the _flush_ operation will be
> > > a NOP anyway.
> >
> > We have a lot of code outside kernel/dma that also calls
> > arch_sync_dma_for_* such as arch/arm, arch/mips, drivers/xen,
> > I guess we don’t want to modify so many things?
>
> Aren't they using internal, arch specific, arch_sync_dma_for_* implementations?
for arch/arm, arch/mips, they are arch-specific implementations.
xen is an exception:
static void xen_swiotlb_unmap_phys(struct device *hwdev, dma_addr_t dev_addr,
size_t size, enum dma_data_direction dir, unsigned long attrs)
{
phys_addr_t paddr = xen_dma_to_phys(hwdev, dev_addr);
struct io_tlb_pool *pool;
BUG_ON(dir == DMA_NONE);
if (!dev_is_dma_coherent(hwdev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
if (pfn_valid(PFN_DOWN(dma_to_phys(hwdev, dev_addr))))
arch_sync_dma_for_cpu(paddr, size, dir);
else
xen_dma_sync_for_cpu(hwdev, dev_addr, size, dir);
}
/* NOTE: We use dev_addr here, not paddr! */
pool = xen_swiotlb_find_pool(hwdev, dev_addr);
if (pool)
__swiotlb_tbl_unmap_single(hwdev, paddr, size, dir,
attrs, pool);
}
>
> >
> > for kernel/dma, we have two "single" callers only:
> > kernel/dma/direct.h, kernel/dma/swiotlb.c. and they looks quite
> > straightforward:
> >
> > static inline void dma_direct_sync_single_for_device(struct device *dev,
> > dma_addr_t addr, size_t size, enum dma_data_direction dir)
> > {
> > phys_addr_t paddr = dma_to_phys(dev, addr);
> >
> > swiotlb_sync_single_for_device(dev, paddr, size, dir);
> >
> > if (!dev_is_dma_coherent(dev))
> > arch_sync_dma_for_device(paddr, size, dir);
> > }
> >
> > I guess moving to arch_sync_dma_for_device_batch + flush
> > doesn’t really look much better, does it?
> >
> > >
> > > I would also rename arch_sync_dma_batch_flush() to arch_sync_dma_flush().
> >
> > Sure.
> >
> > >
> > > You can also minimize changes in dma_direct_map_phys() too, by extending
> > > it's signature to provide if flush is needed or not.
> >
> > Yes. I have
> >
> > static inline dma_addr_t __dma_direct_map_phys(struct device *dev,
> > phys_addr_t phys, size_t size, enum dma_data_direction dir,
> > unsigned long attrs, bool flush)
>
> My suggestion is to use it directly, without wrappers.
>
> >
> > and two wrappers:
> > static inline dma_addr_t dma_direct_map_phys(struct device *dev,
> > phys_addr_t phys, size_t size, enum dma_data_direction dir,
> > unsigned long attrs)
> > {
> > return __dma_direct_map_phys(dev, phys, size, dir, attrs, true);
> > }
> >
> > static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
> > phys_addr_t phys, size_t size, enum dma_data_direction dir,
> > unsigned long attrs)
> > {
> > return __dma_direct_map_phys(dev, phys, size, dir, attrs, false);
> > }
> >
> > If you prefer exposing "flush" directly in dma_direct_map_phys()
> > and updating its callers with flush=true, I think that’s fine.
>
> Yes
>
OK. Could you take a look at [1] and see if any further
improvements are needed before I send v2?
[1] https://lore.kernel.org/lkml/20251223023648.31614-1-21cnbao@gmail.com/
Thanks
Barry
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
2025-12-24 1:29 ` Barry Song
@ 2025-12-24 8:51 ` Leon Romanovsky
2025-12-25 5:45 ` Barry Song
0 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2025-12-24 8:51 UTC (permalink / raw)
To: Barry Song
Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
anshuman.khandual, catalin.marinas, linux-kernel, surenb, iommu,
maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski
On Wed, Dec 24, 2025 at 02:29:13PM +1300, Barry Song wrote:
> On Wed, Dec 24, 2025 at 3:14 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Tue, Dec 23, 2025 at 01:02:55PM +1300, Barry Song wrote:
> > > On Mon, Dec 22, 2025 at 9:49 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > >
> > > > On Mon, Dec 22, 2025 at 03:24:58AM +0800, Barry Song wrote:
> > > > > On Sun, Dec 21, 2025 at 7:55 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > [...]
> > > > > > > +
> > > > > >
> > > > > > I'm wondering why you don't implement this batch‑sync support inside the
> > > > > > arch_sync_dma_*() functions. Doing so would minimize changes to the generic
> > > > > > kernel/dma/* code and reduce the amount of #ifdef‑based spaghetti.
> > > > > >
> > > > >
> > > > > There are two cases: mapping an sg list and mapping a single
> > > > > buffer. The former can be batched with
> > > > > arch_sync_dma_*_batch_add() and flushed via
> > > > > arch_sync_dma_batch_flush(), while the latter requires all work to
> > > > > be done inside arch_sync_dma_*(). Therefore,
> > > > > arch_sync_dma_*() cannot always batch and flush.
> > > >
> > > > Probably in all cases you can call the _batch_ variant, followed by _flush_,
> > > > even when handling a single page. This keeps the code consistent across all
> > > > paths. On platforms that do not support _batch_, the _flush_ operation will be
> > > > a NOP anyway.
> > >
> > > We have a lot of code outside kernel/dma that also calls
> > > arch_sync_dma_for_* such as arch/arm, arch/mips, drivers/xen,
> > > I guess we don’t want to modify so many things?
> >
> > Aren't they using internal, arch specific, arch_sync_dma_for_* implementations?
>
> for arch/arm, arch/mips, they are arch-specific implementations.
> xen is an exception:
Right, and this is the only location outside of kernel/dma where you need to
invoke arch_sync_dma_flush().
>
> static void xen_swiotlb_unmap_phys(struct device *hwdev, dma_addr_t dev_addr,
> size_t size, enum dma_data_direction dir, unsigned long attrs)
> {
> phys_addr_t paddr = xen_dma_to_phys(hwdev, dev_addr);
> struct io_tlb_pool *pool;
>
> BUG_ON(dir == DMA_NONE);
>
> if (!dev_is_dma_coherent(hwdev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
> if (pfn_valid(PFN_DOWN(dma_to_phys(hwdev, dev_addr))))
> arch_sync_dma_for_cpu(paddr, size, dir);
> else
> xen_dma_sync_for_cpu(hwdev, dev_addr, size, dir);
> }
>
> /* NOTE: We use dev_addr here, not paddr! */
> pool = xen_swiotlb_find_pool(hwdev, dev_addr);
> if (pool)
> __swiotlb_tbl_unmap_single(hwdev, paddr, size, dir,
> attrs, pool);
> }
>
> >
> > >
> > > for kernel/dma, we have two "single" callers only:
> > > kernel/dma/direct.h, kernel/dma/swiotlb.c. and they looks quite
> > > straightforward:
> > >
> > > static inline void dma_direct_sync_single_for_device(struct device *dev,
> > > dma_addr_t addr, size_t size, enum dma_data_direction dir)
> > > {
> > > phys_addr_t paddr = dma_to_phys(dev, addr);
> > >
> > > swiotlb_sync_single_for_device(dev, paddr, size, dir);
> > >
> > > if (!dev_is_dma_coherent(dev))
> > > arch_sync_dma_for_device(paddr, size, dir);
> > > }
> > >
> > > I guess moving to arch_sync_dma_for_device_batch + flush
> > > doesn’t really look much better, does it?
> > >
> > > >
> > > > I would also rename arch_sync_dma_batch_flush() to arch_sync_dma_flush().
> > >
> > > Sure.
> > >
> > > >
> > > > You can also minimize changes in dma_direct_map_phys() too, by extending
> > > > it's signature to provide if flush is needed or not.
> > >
> > > Yes. I have
> > >
> > > static inline dma_addr_t __dma_direct_map_phys(struct device *dev,
> > > phys_addr_t phys, size_t size, enum dma_data_direction dir,
> > > unsigned long attrs, bool flush)
> >
> > My suggestion is to use it directly, without wrappers.
> >
> > >
> > > and two wrappers:
> > > static inline dma_addr_t dma_direct_map_phys(struct device *dev,
> > > phys_addr_t phys, size_t size, enum dma_data_direction dir,
> > > unsigned long attrs)
> > > {
> > > return __dma_direct_map_phys(dev, phys, size, dir, attrs, true);
> > > }
> > >
> > > static inline dma_addr_t dma_direct_map_phys_batch_add(struct device *dev,
> > > phys_addr_t phys, size_t size, enum dma_data_direction dir,
> > > unsigned long attrs)
> > > {
> > > return __dma_direct_map_phys(dev, phys, size, dir, attrs, false);
> > > }
> > >
> > > If you prefer exposing "flush" directly in dma_direct_map_phys()
> > > and updating its callers with flush=true, I think that’s fine.
> >
> > Yes
> >
>
> OK. Could you take a look at [1] and see if any further
> improvements are needed before I send v2?
Everything looks ok, except these renames:
- arch_sync_dma_for_cpu(paddr, sg->length, dir);
+ arch_sync_dma_for_cpu_batch_add(paddr, sg->length, dir);
Thanks
>
> [1] https://lore.kernel.org/lkml/20251223023648.31614-1-21cnbao@gmail.com/
>
> Thanks
> Barry
>
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
2025-12-24 8:51 ` Leon Romanovsky
@ 2025-12-25 5:45 ` Barry Song
2025-12-25 12:36 ` Leon Romanovsky
0 siblings, 1 reply; 30+ messages in thread
From: Barry Song @ 2025-12-25 5:45 UTC (permalink / raw)
To: leon
Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
anshuman.khandual, catalin.marinas, 21cnbao, linux-kernel, surenb,
iommu, maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski
> > >
> >
> > OK. Could you take a look at [1] and see if any further
> > improvements are needed before I send v2?
>
> Everything looks ok, except these renames:
> - arch_sync_dma_for_cpu(paddr, sg->length, dir);
> + arch_sync_dma_for_cpu_batch_add(paddr, sg->length, dir);
Thanks!
I'm happy to drop the rename as outlined below-feedback welcome :-)
diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
index dd2c8586a725..487fb7c355ed 100644
--- a/arch/arm64/include/asm/cache.h
+++ b/arch/arm64/include/asm/cache.h
@@ -87,6 +87,12 @@ int cache_line_size(void);
#define dma_get_cache_alignment cache_line_size
+static inline void arch_sync_dma_flush(void)
+{
+ dsb(sy);
+}
+#define arch_sync_dma_flush arch_sync_dma_flush
+
/* Compress a u64 MPIDR value into 32 bits. */
static inline u64 arch_compact_of_hwid(u64 id)
{
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index b2b5792b2caa..ae1ae0280eef 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -17,7 +17,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
{
unsigned long start = (unsigned long)phys_to_virt(paddr);
- dcache_clean_poc(start, start + size);
+ dcache_clean_poc_nosync(start, start + size);
}
void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
@@ -28,7 +28,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
if (dir == DMA_TO_DEVICE)
return;
- dcache_inval_poc(start, start + size);
+ dcache_inval_poc_nosync(start, start + size);
}
void arch_dma_prep_coherent(struct page *page, size_t size)
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 4809204c674c..e7dd8a63b40e 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -361,6 +361,12 @@ static inline void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
}
#endif /* ARCH_HAS_SYNC_DMA_FOR_CPU */
+#ifndef arch_sync_dma_flush
+static inline void arch_sync_dma_flush(void)
+{
+}
+#endif
+
#ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL
void arch_sync_dma_for_cpu_all(void);
#else
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
2025-12-25 5:45 ` Barry Song
@ 2025-12-25 12:36 ` Leon Romanovsky
2025-12-25 13:31 ` Barry Song
0 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2025-12-25 12:36 UTC (permalink / raw)
To: Barry Song
Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
anshuman.khandual, catalin.marinas, linux-kernel, surenb, iommu,
maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski
On Thu, Dec 25, 2025 at 06:45:09PM +1300, Barry Song wrote:
> > > >
> > >
> > > OK. Could you take a look at [1] and see if any further
> > > improvements are needed before I send v2?
> >
> > Everything looks ok, except these renames:
> > - arch_sync_dma_for_cpu(paddr, sg->length, dir);
> > + arch_sync_dma_for_cpu_batch_add(paddr, sg->length, dir);
>
> Thanks!
> I'm happy to drop the rename as outlined below-feedback welcome :-)
>
> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> index dd2c8586a725..487fb7c355ed 100644
> --- a/arch/arm64/include/asm/cache.h
> +++ b/arch/arm64/include/asm/cache.h
> @@ -87,6 +87,12 @@ int cache_line_size(void);
>
> #define dma_get_cache_alignment cache_line_size
>
> +static inline void arch_sync_dma_flush(void)
> +{
> + dsb(sy);
> +}
> +#define arch_sync_dma_flush arch_sync_dma_flush
> +
> /* Compress a u64 MPIDR value into 32 bits. */
> static inline u64 arch_compact_of_hwid(u64 id)
> {
> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> index b2b5792b2caa..ae1ae0280eef 100644
> --- a/arch/arm64/mm/dma-mapping.c
> +++ b/arch/arm64/mm/dma-mapping.c
> @@ -17,7 +17,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> {
> unsigned long start = (unsigned long)phys_to_virt(paddr);
>
> - dcache_clean_poc(start, start + size);
> + dcache_clean_poc_nosync(start, start + size);
> }
>
> void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> @@ -28,7 +28,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> if (dir == DMA_TO_DEVICE)
> return;
>
> - dcache_inval_poc(start, start + size);
> + dcache_inval_poc_nosync(start, start + size);
> }
>
> void arch_dma_prep_coherent(struct page *page, size_t size)
> diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> index 4809204c674c..e7dd8a63b40e 100644
> --- a/include/linux/dma-map-ops.h
> +++ b/include/linux/dma-map-ops.h
> @@ -361,6 +361,12 @@ static inline void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> }
> #endif /* ARCH_HAS_SYNC_DMA_FOR_CPU */
>
> +#ifndef arch_sync_dma_flush
You likely need to wrap this in "#ifdef CONFIG_ARCH_HAS_SYNC_DMA_FLUSH"
as done in the surrounding code.
Thanks
> +static inline void arch_sync_dma_flush(void)
> +{
> +}
> +#endif
> +
> #ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL
> void arch_sync_dma_for_cpu_all(void);
> #else
>
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
2025-12-25 12:36 ` Leon Romanovsky
@ 2025-12-25 13:31 ` Barry Song
2025-12-25 13:40 ` Leon Romanovsky
0 siblings, 1 reply; 30+ messages in thread
From: Barry Song @ 2025-12-25 13:31 UTC (permalink / raw)
To: Leon Romanovsky
Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
anshuman.khandual, catalin.marinas, linux-kernel, surenb, iommu,
maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski
On Fri, Dec 26, 2025 at 1:36 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Thu, Dec 25, 2025 at 06:45:09PM +1300, Barry Song wrote:
> > > > >
> > > >
> > > > OK. Could you take a look at [1] and see if any further
> > > > improvements are needed before I send v2?
> > >
> > > Everything looks ok, except these renames:
> > > - arch_sync_dma_for_cpu(paddr, sg->length, dir);
> > > + arch_sync_dma_for_cpu_batch_add(paddr, sg->length, dir);
> >
> > Thanks!
> > I'm happy to drop the rename as outlined below-feedback welcome :-)
> >
> > diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> > index dd2c8586a725..487fb7c355ed 100644
> > --- a/arch/arm64/include/asm/cache.h
> > +++ b/arch/arm64/include/asm/cache.h
> > @@ -87,6 +87,12 @@ int cache_line_size(void);
> >
> > #define dma_get_cache_alignment cache_line_size
> >
> > +static inline void arch_sync_dma_flush(void)
> > +{
> > + dsb(sy);
> > +}
> > +#define arch_sync_dma_flush arch_sync_dma_flush
> > +
> > /* Compress a u64 MPIDR value into 32 bits. */
> > static inline u64 arch_compact_of_hwid(u64 id)
> > {
> > diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> > index b2b5792b2caa..ae1ae0280eef 100644
> > --- a/arch/arm64/mm/dma-mapping.c
> > +++ b/arch/arm64/mm/dma-mapping.c
> > @@ -17,7 +17,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> > {
> > unsigned long start = (unsigned long)phys_to_virt(paddr);
> >
> > - dcache_clean_poc(start, start + size);
> > + dcache_clean_poc_nosync(start, start + size);
> > }
> >
> > void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> > @@ -28,7 +28,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> > if (dir == DMA_TO_DEVICE)
> > return;
> >
> > - dcache_inval_poc(start, start + size);
> > + dcache_inval_poc_nosync(start, start + size);
> > }
> >
> > void arch_dma_prep_coherent(struct page *page, size_t size)
> > diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> > index 4809204c674c..e7dd8a63b40e 100644
> > --- a/include/linux/dma-map-ops.h
> > +++ b/include/linux/dma-map-ops.h
> > @@ -361,6 +361,12 @@ static inline void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> > }
> > #endif /* ARCH_HAS_SYNC_DMA_FOR_CPU */
> >
> > +#ifndef arch_sync_dma_flush
>
> You likely need to wrap this in "#ifdef CONFIG_ARCH_HAS_SYNC_DMA_FLUSH"
> as done in the surrounding code.
I've dropped the new Kconfig option and now rely on whether
arch_sync_dma_flush() is provided by the architecture. If an arch
does not define arch_sync_dma_flush() in its asm/cache.h, a no-op
implementation is used instead.
Do you still prefer keeping a config option to match the surrounding
code style? Note that on arm64, arch_sync_dma_flush() is already a
static inline rather than an extern, so it is not strictly aligned
with the others.
Having both CONFIG_ARCH_HAS_SYNC_DMA_FLUSH and
"#ifndef arch_sync_dma_flush" seems duplicated.
Another potential optimization would be to drop these options
entirely and handle this via ifndefs, letting each architecture
define the macros in asm/cache.h instead.
Whether arch implements arch_sync_dma_for_xx() as static inline or
as external functions makes no difference.
- #ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU
- void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,-
enum dma_data_direction dir);
- #else
+ #ifndef arch_sync_dma_for_cpu
static inline void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
{
}
#endif /* ARCH_HAS_SYNC_DMA_FOR_CPU */
>
> Thanks
>
> > +static inline void arch_sync_dma_flush(void)
> > +{
> > +}
> > +#endif
> > +
> > #ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL
> > void arch_sync_dma_for_cpu_all(void);
> > #else
> >
Thanks
Barry
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
2025-12-25 13:31 ` Barry Song
@ 2025-12-25 13:40 ` Leon Romanovsky
0 siblings, 0 replies; 30+ messages in thread
From: Leon Romanovsky @ 2025-12-25 13:40 UTC (permalink / raw)
To: Barry Song
Cc: v-songbaohua, zhengtangquan, ryan.roberts, will,
anshuman.khandual, catalin.marinas, linux-kernel, surenb, iommu,
maz, robin.murphy, ardb, linux-arm-kernel, m.szyprowski
On Fri, Dec 26, 2025 at 02:31:42AM +1300, Barry Song wrote:
> On Fri, Dec 26, 2025 at 1:36 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Thu, Dec 25, 2025 at 06:45:09PM +1300, Barry Song wrote:
> > > > > >
> > > > >
> > > > > OK. Could you take a look at [1] and see if any further
> > > > > improvements are needed before I send v2?
> > > >
> > > > Everything looks ok, except these renames:
> > > > - arch_sync_dma_for_cpu(paddr, sg->length, dir);
> > > > + arch_sync_dma_for_cpu_batch_add(paddr, sg->length, dir);
> > >
> > > Thanks!
> > > I'm happy to drop the rename as outlined below-feedback welcome :-)
> > >
> > > diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
> > > index dd2c8586a725..487fb7c355ed 100644
> > > --- a/arch/arm64/include/asm/cache.h
> > > +++ b/arch/arm64/include/asm/cache.h
> > > @@ -87,6 +87,12 @@ int cache_line_size(void);
> > >
> > > #define dma_get_cache_alignment cache_line_size
> > >
> > > +static inline void arch_sync_dma_flush(void)
> > > +{
> > > + dsb(sy);
> > > +}
> > > +#define arch_sync_dma_flush arch_sync_dma_flush
> > > +
> > > /* Compress a u64 MPIDR value into 32 bits. */
> > > static inline u64 arch_compact_of_hwid(u64 id)
> > > {
> > > diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> > > index b2b5792b2caa..ae1ae0280eef 100644
> > > --- a/arch/arm64/mm/dma-mapping.c
> > > +++ b/arch/arm64/mm/dma-mapping.c
> > > @@ -17,7 +17,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> > > {
> > > unsigned long start = (unsigned long)phys_to_virt(paddr);
> > >
> > > - dcache_clean_poc(start, start + size);
> > > + dcache_clean_poc_nosync(start, start + size);
> > > }
> > >
> > > void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> > > @@ -28,7 +28,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> > > if (dir == DMA_TO_DEVICE)
> > > return;
> > >
> > > - dcache_inval_poc(start, start + size);
> > > + dcache_inval_poc_nosync(start, start + size);
> > > }
> > >
> > > void arch_dma_prep_coherent(struct page *page, size_t size)
> > > diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> > > index 4809204c674c..e7dd8a63b40e 100644
> > > --- a/include/linux/dma-map-ops.h
> > > +++ b/include/linux/dma-map-ops.h
> > > @@ -361,6 +361,12 @@ static inline void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> > > }
> > > #endif /* ARCH_HAS_SYNC_DMA_FOR_CPU */
> > >
> > > +#ifndef arch_sync_dma_flush
> >
> > You likely need to wrap this in "#ifdef CONFIG_ARCH_HAS_SYNC_DMA_FLUSH"
> > as done in the surrounding code.
>
> I've dropped the new Kconfig option and now rely on whether
> arch_sync_dma_flush() is provided by the architecture. If an arch
> does not define arch_sync_dma_flush() in its asm/cache.h, a no-op
> implementation is used instead.
I know.
>
> Do you still prefer keeping a config option to match the surrounding
> code style?
I don't have a strong preference here. Go ahead and try your current
version and see how people respond.
> Note that on arm64, arch_sync_dma_flush() is already a
> static inline rather than an extern, so it is not strictly aligned
> with the others.
> Having both CONFIG_ARCH_HAS_SYNC_DMA_FLUSH and
> "#ifndef arch_sync_dma_flush" seems duplicated.
>
> Another potential optimization would be to drop these options
> entirely and handle this via ifndefs, letting each architecture
> define the macros in asm/cache.h instead.
>
> Whether arch implements arch_sync_dma_for_xx() as static inline or
> as external functions makes no difference.
>
> - #ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU
> - void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,-
> enum dma_data_direction dir);
> - #else
> + #ifndef arch_sync_dma_for_cpu
> static inline void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> enum dma_data_direction dir)
> {
> }
> #endif /* ARCH_HAS_SYNC_DMA_FOR_CPU */
>
> >
> > Thanks
> >
> > > +static inline void arch_sync_dma_flush(void)
> > > +{
> > > +}
> > > +#endif
> > > +
> > > #ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL
> > > void arch_sync_dma_for_cpu_all(void);
> > > #else
> > >
>
> Thanks
> Barry
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
2025-12-19 5:36 ` [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch Barry Song
2025-12-20 17:37 ` kernel test robot
2025-12-21 11:55 ` Leon Romanovsky
@ 2025-12-21 12:36 ` kernel test robot
2025-12-22 12:43 ` kernel test robot
2025-12-22 14:00 ` kernel test robot
4 siblings, 0 replies; 30+ messages in thread
From: kernel test robot @ 2025-12-21 12:36 UTC (permalink / raw)
To: Barry Song, catalin.marinas, m.szyprowski, robin.murphy, will
Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
linux-kernel, iommu, oe-kbuild-all, surenb, ardb,
linux-arm-kernel
Hi Barry,
kernel test robot noticed the following build errors:
[auto build test ERROR on linus/master]
[also build test ERROR on v6.19-rc1 next-20251219]
[cannot apply to arm64/for-next/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Barry-Song/arm64-Provide-dcache_by_myline_op_nosync-helper/20251219-195810
base: linus/master
patch link: https://lore.kernel.org/r/20251219053658.84978-6-21cnbao%40gmail.com
patch subject: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
config: x86_64-rhel-9.4 (https://download.01.org/0day-ci/archive/20251221/202512211320.LaiSSLAc-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251221/202512211320.LaiSSLAc-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512211320.LaiSSLAc-lkp@intel.com/
All errors (new ones prefixed by >>):
kernel/dma/direct.c: In function 'dma_direct_unmap_sg':
>> kernel/dma/direct.c:456:25: error: implicit declaration of function 'dma_direct_unmap_phys_batch_add'; did you mean 'dma_direct_unmap_phys'? [-Wimplicit-function-declaration]
456 | dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| dma_direct_unmap_phys
kernel/dma/direct.c: In function 'dma_direct_map_sg':
>> kernel/dma/direct.c:484:43: error: implicit declaration of function 'dma_direct_map_phys_batch_add'; did you mean 'dma_direct_map_phys'? [-Wimplicit-function-declaration]
484 | sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| dma_direct_map_phys
vim +456 kernel/dma/direct.c
439
440 /*
441 * Unmaps segments, except for ones marked as pci_p2pdma which do not
442 * require any further action as they contain a bus address.
443 */
444 void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
445 int nents, enum dma_data_direction dir, unsigned long attrs)
446 {
447 struct scatterlist *sg;
448 int i;
449 bool need_sync = false;
450
451 for_each_sg(sgl, sg, nents, i) {
452 if (sg_dma_is_bus_address(sg)) {
453 sg_dma_unmark_bus_address(sg);
454 } else {
455 need_sync = true;
> 456 dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
457 sg_dma_len(sg), dir, attrs);
458 }
459 }
460 if (need_sync && !dev_is_dma_coherent(dev))
461 arch_sync_dma_batch_flush();
462 }
463 #endif
464
465 int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
466 enum dma_data_direction dir, unsigned long attrs)
467 {
468 struct pci_p2pdma_map_state p2pdma_state = {};
469 struct scatterlist *sg;
470 int i, ret;
471 bool need_sync = false;
472
473 for_each_sg(sgl, sg, nents, i) {
474 switch (pci_p2pdma_state(&p2pdma_state, dev, sg_page(sg))) {
475 case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
476 /*
477 * Any P2P mapping that traverses the PCI host bridge
478 * must be mapped with CPU physical address and not PCI
479 * bus addresses.
480 */
481 break;
482 case PCI_P2PDMA_MAP_NONE:
483 need_sync = true;
> 484 sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
485 sg->length, dir, attrs);
486 if (sg->dma_address == DMA_MAPPING_ERROR) {
487 ret = -EIO;
488 goto out_unmap;
489 }
490 break;
491 case PCI_P2PDMA_MAP_BUS_ADDR:
492 sg->dma_address = pci_p2pdma_bus_addr_map(
493 p2pdma_state.mem, sg_phys(sg));
494 sg_dma_len(sg) = sg->length;
495 sg_dma_mark_bus_address(sg);
496 continue;
497 default:
498 ret = -EREMOTEIO;
499 goto out_unmap;
500 }
501 sg_dma_len(sg) = sg->length;
502 }
503
504 if (need_sync && !dev_is_dma_coherent(dev))
505 arch_sync_dma_batch_flush();
506 return nents;
507
508 out_unmap:
509 dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
510 return ret;
511 }
512
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
2025-12-19 5:36 ` [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch Barry Song
` (2 preceding siblings ...)
2025-12-21 12:36 ` kernel test robot
@ 2025-12-22 12:43 ` kernel test robot
2025-12-22 14:00 ` kernel test robot
4 siblings, 0 replies; 30+ messages in thread
From: kernel test robot @ 2025-12-22 12:43 UTC (permalink / raw)
To: Barry Song, catalin.marinas, m.szyprowski, robin.murphy, will
Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
llvm, linux-kernel, iommu, oe-kbuild-all, surenb, ardb,
linux-arm-kernel
Hi Barry,
kernel test robot noticed the following build errors:
[auto build test ERROR on linus/master]
[also build test ERROR on v6.19-rc2 next-20251219]
[cannot apply to arm64/for-next/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Barry-Song/arm64-Provide-dcache_by_myline_op_nosync-helper/20251219-195810
base: linus/master
patch link: https://lore.kernel.org/r/20251219053658.84978-6-21cnbao%40gmail.com
patch subject: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
config: i386-buildonly-randconfig-006-20251222 (https://download.01.org/0day-ci/archive/20251222/202512222029.Dd6Vs1Eg-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251222/202512222029.Dd6Vs1Eg-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512222029.Dd6Vs1Eg-lkp@intel.com/
All errors (new ones prefixed by >>):
>> kernel/dma/direct.c:456:4: error: call to undeclared function 'dma_direct_unmap_phys_batch_add'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
456 | dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
| ^
kernel/dma/direct.c:456:4: note: did you mean 'dma_direct_unmap_phys'?
kernel/dma/direct.h:188:20: note: 'dma_direct_unmap_phys' declared here
188 | static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
| ^
>> kernel/dma/direct.c:484:22: error: call to undeclared function 'dma_direct_map_phys_batch_add'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
484 | sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
| ^
2 errors generated.
vim +/dma_direct_unmap_phys_batch_add +456 kernel/dma/direct.c
439
440 /*
441 * Unmaps segments, except for ones marked as pci_p2pdma which do not
442 * require any further action as they contain a bus address.
443 */
444 void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
445 int nents, enum dma_data_direction dir, unsigned long attrs)
446 {
447 struct scatterlist *sg;
448 int i;
449 bool need_sync = false;
450
451 for_each_sg(sgl, sg, nents, i) {
452 if (sg_dma_is_bus_address(sg)) {
453 sg_dma_unmark_bus_address(sg);
454 } else {
455 need_sync = true;
> 456 dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
457 sg_dma_len(sg), dir, attrs);
458 }
459 }
460 if (need_sync && !dev_is_dma_coherent(dev))
461 arch_sync_dma_batch_flush();
462 }
463 #endif
464
465 int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
466 enum dma_data_direction dir, unsigned long attrs)
467 {
468 struct pci_p2pdma_map_state p2pdma_state = {};
469 struct scatterlist *sg;
470 int i, ret;
471 bool need_sync = false;
472
473 for_each_sg(sgl, sg, nents, i) {
474 switch (pci_p2pdma_state(&p2pdma_state, dev, sg_page(sg))) {
475 case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
476 /*
477 * Any P2P mapping that traverses the PCI host bridge
478 * must be mapped with CPU physical address and not PCI
479 * bus addresses.
480 */
481 break;
482 case PCI_P2PDMA_MAP_NONE:
483 need_sync = true;
> 484 sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
485 sg->length, dir, attrs);
486 if (sg->dma_address == DMA_MAPPING_ERROR) {
487 ret = -EIO;
488 goto out_unmap;
489 }
490 break;
491 case PCI_P2PDMA_MAP_BUS_ADDR:
492 sg->dma_address = pci_p2pdma_bus_addr_map(
493 p2pdma_state.mem, sg_phys(sg));
494 sg_dma_len(sg) = sg->length;
495 sg_dma_mark_bus_address(sg);
496 continue;
497 default:
498 ret = -EREMOTEIO;
499 goto out_unmap;
500 }
501 sg_dma_len(sg) = sg->length;
502 }
503
504 if (need_sync && !dev_is_dma_coherent(dev))
505 arch_sync_dma_batch_flush();
506 return nents;
507
508 out_unmap:
509 dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
510 return ret;
511 }
512
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
2025-12-19 5:36 ` [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch Barry Song
` (3 preceding siblings ...)
2025-12-22 12:43 ` kernel test robot
@ 2025-12-22 14:00 ` kernel test robot
4 siblings, 0 replies; 30+ messages in thread
From: kernel test robot @ 2025-12-22 14:00 UTC (permalink / raw)
To: Barry Song, catalin.marinas, m.szyprowski, robin.murphy, will
Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
linux-kernel, iommu, oe-kbuild-all, surenb, ardb,
linux-arm-kernel
Hi Barry,
kernel test robot noticed the following build errors:
[auto build test ERROR on linus/master]
[also build test ERROR on v6.19-rc2 next-20251219]
[cannot apply to arm64/for-next/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Barry-Song/arm64-Provide-dcache_by_myline_op_nosync-helper/20251219-195810
base: linus/master
patch link: https://lore.kernel.org/r/20251219053658.84978-6-21cnbao%40gmail.com
patch subject: [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch
config: x86_64-randconfig-161-20251222 (https://download.01.org/0day-ci/archive/20251222/202512222137.rpXOEE5p-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251222/202512222137.rpXOEE5p-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512222137.rpXOEE5p-lkp@intel.com/
All errors (new ones prefixed by >>):
kernel/dma/direct.c: In function 'dma_direct_unmap_sg':
>> kernel/dma/direct.c:456:25: error: implicit declaration of function 'dma_direct_unmap_phys_batch_add'; did you mean 'dma_direct_unmap_phys'? [-Wimplicit-function-declaration]
456 | dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| dma_direct_unmap_phys
kernel/dma/direct.c: In function 'dma_direct_map_sg':
>> kernel/dma/direct.c:484:43: error: implicit declaration of function 'dma_direct_map_phys_batch_add'; did you mean 'dma_direct_map_phys'? [-Wimplicit-function-declaration]
484 | sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| dma_direct_map_phys
vim +456 kernel/dma/direct.c
439
440 /*
441 * Unmaps segments, except for ones marked as pci_p2pdma which do not
442 * require any further action as they contain a bus address.
443 */
444 void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
445 int nents, enum dma_data_direction dir, unsigned long attrs)
446 {
447 struct scatterlist *sg;
448 int i;
449 bool need_sync = false;
450
451 for_each_sg(sgl, sg, nents, i) {
452 if (sg_dma_is_bus_address(sg)) {
453 sg_dma_unmark_bus_address(sg);
454 } else {
455 need_sync = true;
> 456 dma_direct_unmap_phys_batch_add(dev, sg->dma_address,
457 sg_dma_len(sg), dir, attrs);
458 }
459 }
460 if (need_sync && !dev_is_dma_coherent(dev))
461 arch_sync_dma_batch_flush();
462 }
463 #endif
464
465 int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
466 enum dma_data_direction dir, unsigned long attrs)
467 {
468 struct pci_p2pdma_map_state p2pdma_state = {};
469 struct scatterlist *sg;
470 int i, ret;
471 bool need_sync = false;
472
473 for_each_sg(sgl, sg, nents, i) {
474 switch (pci_p2pdma_state(&p2pdma_state, dev, sg_page(sg))) {
475 case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
476 /*
477 * Any P2P mapping that traverses the PCI host bridge
478 * must be mapped with CPU physical address and not PCI
479 * bus addresses.
480 */
481 break;
482 case PCI_P2PDMA_MAP_NONE:
483 need_sync = true;
> 484 sg->dma_address = dma_direct_map_phys_batch_add(dev, sg_phys(sg),
485 sg->length, dir, attrs);
486 if (sg->dma_address == DMA_MAPPING_ERROR) {
487 ret = -EIO;
488 goto out_unmap;
489 }
490 break;
491 case PCI_P2PDMA_MAP_BUS_ADDR:
492 sg->dma_address = pci_p2pdma_bus_addr_map(
493 p2pdma_state.mem, sg_phys(sg));
494 sg_dma_len(sg) = sg->length;
495 sg_dma_mark_bus_address(sg);
496 continue;
497 default:
498 ret = -EREMOTEIO;
499 goto out_unmap;
500 }
501 sg_dma_len(sg) = sg->length;
502 }
503
504 if (need_sync && !dev_is_dma_coherent(dev))
505 arch_sync_dma_batch_flush();
506 return nents;
507
508 out_unmap:
509 dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
510 return ret;
511 }
512
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH RFC 6/6] dma-iommu: Allow DMA sync batching for IOVA link/unlink
2025-12-19 5:36 [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
` (4 preceding siblings ...)
2025-12-19 5:36 ` [PATCH 5/6] dma-mapping: Allow batched DMA sync operations if supported by the arch Barry Song
@ 2025-12-19 5:36 ` Barry Song
2025-12-19 6:04 ` [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
2025-12-19 6:12 ` Barry Song
7 siblings, 0 replies; 30+ messages in thread
From: Barry Song @ 2025-12-19 5:36 UTC (permalink / raw)
To: catalin.marinas, m.szyprowski, robin.murphy, will
Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
Joerg Roedel, linux-kernel, iommu, surenb, ardb, linux-arm-kernel
From: Barry Song <v-songbaohua@oppo.com>
Apply batched DMA synchronization to __dma_iova_link() and
iommu_dma_iova_unlink_range_slow(). For multiple
sync_dma_for_device() and sync_dma_for_cpu() calls, we only
need to wait once for the completion of all sync operations,
rather than waiting for each one individually.
I do not have the hardware to test this, so it is marked as
RFC. I would greatly appreciate it if someone could test it.
Suggested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
drivers/iommu/dma-iommu.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index c92088855450..95432bdc364f 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1837,7 +1837,7 @@ static int __dma_iova_link(struct device *dev, dma_addr_t addr,
int prot = dma_info_to_prot(dir, coherent, attrs);
if (!coherent && !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
- arch_sync_dma_for_device(phys, size, dir);
+ arch_sync_dma_for_device_batch_add(phys, size, dir);
return iommu_map_nosync(iommu_get_dma_domain(dev), addr, phys, size,
prot, GFP_ATOMIC);
@@ -1980,6 +1980,8 @@ int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
dma_addr_t addr = state->addr + offset;
size_t iova_start_pad = iova_offset(iovad, addr);
+ if (!dev_is_dma_coherent(dev))
+ arch_sync_dma_batch_flush();
return iommu_sync_map(domain, addr - iova_start_pad,
iova_align(iovad, size + iova_start_pad));
}
@@ -1993,6 +1995,8 @@ static void iommu_dma_iova_unlink_range_slow(struct device *dev,
struct iommu_dma_cookie *cookie = domain->iova_cookie;
struct iova_domain *iovad = &cookie->iovad;
size_t iova_start_pad = iova_offset(iovad, addr);
+ bool need_sync_dma = !dev_is_dma_coherent(dev) &&
+ !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO));
dma_addr_t end = addr + size;
do {
@@ -2007,8 +2011,7 @@ static void iommu_dma_iova_unlink_range_slow(struct device *dev,
len = min_t(size_t,
end - addr, iovad->granule - iova_start_pad);
- if (!dev_is_dma_coherent(dev) &&
- !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
+ if (need_sync_dma)
arch_sync_dma_for_cpu(phys, len, dir);
swiotlb_tbl_unmap_single(dev, phys, len, dir, attrs);
@@ -2016,6 +2019,9 @@ static void iommu_dma_iova_unlink_range_slow(struct device *dev,
addr += len;
iova_start_pad = 0;
} while (addr < end);
+
+ if (need_sync_dma)
+ arch_sync_dma_batch_flush();
}
static void __iommu_dma_iova_unlink(struct device *dev,
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 30+ messages in thread* [PATCH 0/6] dma-mapping: arm64: support batched cache sync
2025-12-19 5:36 [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
` (5 preceding siblings ...)
2025-12-19 5:36 ` [PATCH RFC 6/6] dma-iommu: Allow DMA sync batching for IOVA link/unlink Barry Song
@ 2025-12-19 6:04 ` Barry Song
2025-12-19 6:12 ` Barry Song
7 siblings, 0 replies; 30+ messages in thread
From: Barry Song @ 2025-12-19 6:04 UTC (permalink / raw)
To: catalin.marinas, m.szyprowski, robin.murphy, will
Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
linux-kernel, iommu, surenb, ardb, linux-arm-kernel
From: Barry Song <v-songbaohua@oppo.com>
For reasons unclear, the cover letter was omitted from the
initial posting, despite Gmail indicating it was sent. This
is a resend. Apologies for the noise.
Many embedded ARM64 SoCs still lack hardware cache coherency support, which
causes DMA mapping operations to appear as hotspots in on-CPU flame graphs.
For an SG list with *nents* entries, the current dma_map/unmap_sg() and DMA
sync APIs perform cache maintenance one entry at a time. After each entry,
the implementation synchronously waits for the corresponding region’s
D-cache operations to complete. On architectures like arm64, efficiency can
be improved by issuing all entries’ operations first and then performing a
single batched wait for completion.
Tangquan's results show that batched synchronization can reduce
dma_map_sg() time by 64.61% and dma_unmap_sg() time by 66.60% on an MTK
phone platform (MediaTek Dimensity 9500). The tests were performed by
pinning the task to CPU7 and fixing the CPU frequency at 2.6 GHz,
running dma_map_sg() and dma_unmap_sg() on 10 MB buffers (10 MB / 4 KB
sg entries per buffer) for 200 iterations and then averaging the
results.
I also ran this patch set on an RK3588 Rock5B+ board and
observed that millions of DMA sync operations were batched.
diff with RFC:
* Dropped lots of #ifdef/#else/#endif according to Catalin and Marek,
thanks!
* Also add iova link/unlink batches, which is marked as RFC as i lack
hardware. This is suggested by Marek, thanks!
RFC link:
https://lore.kernel.org/lkml/20251029023115.22809-1-21cnbao@gmail.com/
Barry Song (6):
arm64: Provide dcache_by_myline_op_nosync helper
arm64: Provide dcache_clean_poc_nosync helper
arm64: Provide dcache_inval_poc_nosync helper
arm64: Provide arch_sync_dma_ batched helpers
dma-mapping: Allow batched DMA sync operations if supported by the
arch
dma-iommu: Allow DMA sync batching for IOVA link/unlink
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/assembler.h | 79 +++++++++++++++++++-------
arch/arm64/include/asm/cacheflush.h | 2 +
arch/arm64/mm/cache.S | 58 +++++++++++++++----
arch/arm64/mm/dma-mapping.c | 24 ++++++++
drivers/iommu/dma-iommu.c | 12 +++-
include/linux/dma-map-ops.h | 22 ++++++++
kernel/dma/Kconfig | 3 +
kernel/dma/direct.c | 28 +++++++---
kernel/dma/direct.h | 86 +++++++++++++++++++++++++----
10 files changed, 262 insertions(+), 53 deletions(-)
--
2.39.3 (Apple Git-146)
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH 0/6] dma-mapping: arm64: support batched cache sync
2025-12-19 5:36 [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
` (6 preceding siblings ...)
2025-12-19 6:04 ` [PATCH 0/6] dma-mapping: arm64: support batched cache sync Barry Song
@ 2025-12-19 6:12 ` Barry Song
7 siblings, 0 replies; 30+ messages in thread
From: Barry Song @ 2025-12-19 6:12 UTC (permalink / raw)
To: catalin.marinas, m.szyprowski, robin.murphy, will
Cc: v-songbaohua, zhengtangquan, ryan.roberts, anshuman.khandual, maz,
linux-kernel, iommu, surenb, ardb, linux-arm-kernel
It is unclear why, but the cover letter was missed in the
initial posting, even though Gmail shows it as sent. I am
resending it here as a reply to check whether it appears on
the mailing list. Apologies for the inconvenience.
On Fri, Dec 19, 2025 at 1:37 PM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> Many embedded ARM64 SoCs still lack hardware cache coherency support, which
> causes DMA mapping operations to appear as hotspots in on-CPU flame graphs.
>
> For an SG list with *nents* entries, the current dma_map/unmap_sg() and DMA
> sync APIs perform cache maintenance one entry at a time. After each entry,
> the implementation synchronously waits for the corresponding region’s
> D-cache operations to complete. On architectures like arm64, efficiency can
> be improved by issuing all entries’ operations first and then performing a
> single batched wait for completion.
>
> Tangquan's results show that batched synchronization can reduce
> dma_map_sg() time by 64.61% and dma_unmap_sg() time by 66.60% on an MTK
> phone platform (MediaTek Dimensity 9500). The tests were performed by
> pinning the task to CPU7 and fixing the CPU frequency at 2.6 GHz,
> running dma_map_sg() and dma_unmap_sg() on 10 MB buffers (10 MB / 4 KB
> sg entries per buffer) for 200 iterations and then averaging the
> results.
>
> I also ran this patch set on an RK3588 Rock5B+ board and
> observed that millions of DMA sync operations were batched.
>
> diff with RFC:
> * Dropped lots of #ifdef/#else/#endif according to Catalin and Marek,
> thanks!
> * Also add iova link/unlink batches, which is marked as RFC as i lack
> hardware. This is suggested by Marek, thanks!
>
> RFC link:
> https://lore.kernel.org/lkml/20251029023115.22809-1-21cnbao@gmail.com/
>
> Barry Song (6):
> arm64: Provide dcache_by_myline_op_nosync helper
> arm64: Provide dcache_clean_poc_nosync helper
> arm64: Provide dcache_inval_poc_nosync helper
> arm64: Provide arch_sync_dma_ batched helpers
> dma-mapping: Allow batched DMA sync operations if supported by the
> arch
> dma-iommu: Allow DMA sync batching for IOVA link/unlink
>
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/assembler.h | 79 +++++++++++++++++++-------
> arch/arm64/include/asm/cacheflush.h | 2 +
> arch/arm64/mm/cache.S | 58 +++++++++++++++----
> arch/arm64/mm/dma-mapping.c | 24 ++++++++
> drivers/iommu/dma-iommu.c | 12 +++-
> include/linux/dma-map-ops.h | 22 ++++++++
> kernel/dma/Kconfig | 3 +
> kernel/dma/direct.c | 28 +++++++---
> kernel/dma/direct.h | 86 +++++++++++++++++++++++++----
> 10 files changed, 262 insertions(+), 53 deletions(-)
>
> --
> 2.39.3 (Apple Git-146)
>
^ permalink raw reply [flat|nested] 30+ messages in thread