* [PATCH v3 1/5] arm64: Provide dcache_by_myline_op_nosync helper
@ 2026-02-28 22:12 ` Barry Song
2026-03-13 19:35 ` Catalin Marinas
0 siblings, 1 reply; 10+ messages in thread
From: Barry Song @ 2026-02-28 22:12 UTC (permalink / raw)
To: catalin.marinas, m.szyprowski, robin.murphy, will, iommu,
linux-arm-kernel
Cc: Barry Song, Ryan Roberts, Leon Romanovsky, Anshuman Khandual,
Marc Zyngier, linux-kernel, Tangquan Zheng, Xueyuan Chen,
Suren Baghdasaryan, Ard Biesheuvel
From: Barry Song <baohua@kernel.org>
dcache_by_myline_op ensures completion of the data cache operations for a
region, while dcache_by_myline_op_nosync only issues them without waiting.
This enables deferred synchronization so completion for multiple regions
can be handled together later.
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
Signed-off-by: Barry Song <baohua@kernel.org>
---
arch/arm64/include/asm/assembler.h | 25 +++++++++++++++++++------
arch/arm64/kernel/relocate_kernel.S | 3 ++-
2 files changed, 21 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index d3d46e5f7188..cdbaad41bddb 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -371,14 +371,13 @@ alternative_endif
* [start, end) with dcache line size explicitly provided.
*
* op: operation passed to dc instruction
- * domain: domain used in dsb instruction
* start: starting virtual address of the region
* end: end virtual address of the region
* linesz: dcache line size
* fixup: optional label to branch to on user fault
* Corrupts: start, end, tmp
*/
- .macro dcache_by_myline_op op, domain, start, end, linesz, tmp, fixup
+ .macro dcache_by_myline_op_nosync op, start, end, linesz, tmp, fixup
sub \tmp, \linesz, #1
bic \start, \start, \tmp
alternative_if ARM64_WORKAROUND_4311569
@@ -412,14 +411,28 @@ alternative_if ARM64_WORKAROUND_4311569
cbnz \start, .Ldcache_op\@
.endif
alternative_else_nop_endif
- dsb \domain
_cond_uaccess_extable .Ldcache_op\@, \fixup
.endm
/*
* Macro to perform a data cache maintenance for the interval
- * [start, end)
+ * [start, end) without waiting for completion
+ *
+ * op: operation passed to dc instruction
+ * start: starting virtual address of the region
+ * end: end virtual address of the region
+ * fixup: optional label to branch to on user fault
+ * Corrupts: start, end, tmp1, tmp2
+ */
+ .macro dcache_by_line_op_nosync op, start, end, tmp1, tmp2, fixup
+ dcache_line_size \tmp1, \tmp2
+ dcache_by_myline_op_nosync \op, \start, \end, \tmp1, \tmp2, \fixup
+ .endm
+
+/*
+ * Macro to perform a data cache maintenance for the interval
+ * [start, end) and wait for completion
*
* op: operation passed to dc instruction
* domain: domain used in dsb instruction
@@ -429,8 +442,8 @@ alternative_else_nop_endif
* Corrupts: start, end, tmp1, tmp2
*/
.macro dcache_by_line_op op, domain, start, end, tmp1, tmp2, fixup
- dcache_line_size \tmp1, \tmp2
- dcache_by_myline_op \op, \domain, \start, \end, \tmp1, \tmp2, \fixup
+ dcache_by_line_op_nosync \op, \start, \end, \tmp1, \tmp2, \fixup
+ dsb \domain
.endm
/*
diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S
index 413f899e4ac6..6cb4209f5dab 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -64,7 +64,8 @@ SYM_CODE_START(arm64_relocate_new_kernel)
mov x19, x13
copy_page x13, x12, x1, x2, x3, x4, x5, x6, x7, x8
add x1, x19, #PAGE_SIZE
- dcache_by_myline_op civac, sy, x19, x1, x15, x20
+ dcache_by_myline_op_nosync civac, x19, x1, x15, x20
+ dsb sy
b .Lnext
.Ltest_indirection:
tbz x16, IND_INDIRECTION_BIT, .Ltest_destination
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v3 1/5] arm64: Provide dcache_by_myline_op_nosync helper
2026-02-28 22:12 ` [PATCH v3 1/5] arm64: Provide dcache_by_myline_op_nosync helper Barry Song
@ 2026-03-13 19:35 ` Catalin Marinas
0 siblings, 0 replies; 10+ messages in thread
From: Catalin Marinas @ 2026-03-13 19:35 UTC (permalink / raw)
To: Barry Song
Cc: Tangquan Zheng, Barry Song, Ryan Roberts, Leon Romanovsky,
Anshuman Khandual, robin.murphy, Xueyuan Chen, linux-kernel,
Suren Baghdasaryan, iommu, Marc Zyngier, will, Ard Biesheuvel,
linux-arm-kernel, m.szyprowski
On Sun, Mar 01, 2026 at 06:12:16AM +0800, Barry Song wrote:
> From: Barry Song <baohua@kernel.org>
>
> dcache_by_myline_op ensures completion of the data cache operations for a
> region, while dcache_by_myline_op_nosync only issues them without waiting.
> This enables deferred synchronization so completion for multiple regions
> can be handled together later.
>
> Cc: Leon Romanovsky <leon@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> Signed-off-by: Barry Song <baohua@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v3 2/5] arm64: Provide dcache_clean_poc_nosync helper
@ 2026-02-28 22:12 ` Barry Song
2026-03-13 19:35 ` Catalin Marinas
0 siblings, 1 reply; 10+ messages in thread
From: Barry Song @ 2026-02-28 22:12 UTC (permalink / raw)
To: catalin.marinas, m.szyprowski, robin.murphy, will, iommu,
linux-arm-kernel
Cc: Barry Song, Ryan Roberts, Leon Romanovsky, Anshuman Khandual,
Marc Zyngier, linux-kernel, Tangquan Zheng, Xueyuan Chen,
Suren Baghdasaryan, Ard Biesheuvel
From: Barry Song <baohua@kernel.org>
dcache_clean_poc_nosync does not wait for the data cache clean to
complete. Later, we wait for completion of all scatter-gather entries
together.
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
Signed-off-by: Barry Song <baohua@kernel.org>
---
arch/arm64/include/asm/cacheflush.h | 1 +
arch/arm64/mm/cache.S | 15 +++++++++++++++
2 files changed, 16 insertions(+)
diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 28ab96e808ef..9b6d0a62cf3d 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -74,6 +74,7 @@ extern void icache_inval_pou(unsigned long start, unsigned long end);
extern void dcache_clean_inval_poc(unsigned long start, unsigned long end);
extern void dcache_inval_poc(unsigned long start, unsigned long end);
extern void dcache_clean_poc(unsigned long start, unsigned long end);
+extern void dcache_clean_poc_nosync(unsigned long start, unsigned long end);
extern void dcache_clean_pop(unsigned long start, unsigned long end);
extern void dcache_clean_pou(unsigned long start, unsigned long end);
extern long caches_clean_inval_user_pou(unsigned long start, unsigned long end);
diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
index 503567c864fd..4a7c7e03785d 100644
--- a/arch/arm64/mm/cache.S
+++ b/arch/arm64/mm/cache.S
@@ -178,6 +178,21 @@ SYM_FUNC_START(__pi_dcache_clean_poc)
SYM_FUNC_END(__pi_dcache_clean_poc)
SYM_FUNC_ALIAS(dcache_clean_poc, __pi_dcache_clean_poc)
+/*
+ * dcache_clean_poc_nosync(start, end)
+ *
+ * Issue the instructions of D-cache lines for the interval [start, end).
+ * not necessarily cleaned to the PoC till an explicit dsb sy afterward.
+ *
+ * - start - virtual start address of region
+ * - end - virtual end address of region
+ */
+SYM_FUNC_START(__pi_dcache_clean_poc_nosync)
+ dcache_by_line_op_nosync cvac, x0, x1, x2, x3
+ ret
+SYM_FUNC_END(__pi_dcache_clean_poc_nosync)
+SYM_FUNC_ALIAS(dcache_clean_poc_nosync, __pi_dcache_clean_poc_nosync)
+
/*
* dcache_clean_pop(start, end)
*
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v3 2/5] arm64: Provide dcache_clean_poc_nosync helper
2026-02-28 22:12 ` [PATCH v3 2/5] arm64: Provide dcache_clean_poc_nosync helper Barry Song
@ 2026-03-13 19:35 ` Catalin Marinas
0 siblings, 0 replies; 10+ messages in thread
From: Catalin Marinas @ 2026-03-13 19:35 UTC (permalink / raw)
To: Barry Song
Cc: Tangquan Zheng, Barry Song, Ryan Roberts, Leon Romanovsky,
Anshuman Khandual, robin.murphy, Xueyuan Chen, linux-kernel,
Suren Baghdasaryan, iommu, Marc Zyngier, will, Ard Biesheuvel,
linux-arm-kernel, m.szyprowski
On Sun, Mar 01, 2026 at 06:12:39AM +0800, Barry Song wrote:
> From: Barry Song <baohua@kernel.org>
>
> dcache_clean_poc_nosync does not wait for the data cache clean to
> complete. Later, we wait for completion of all scatter-gather entries
> together.
>
> Cc: Leon Romanovsky <leon@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> Signed-off-by: Barry Song <baohua@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v3 3/5] arm64: Provide dcache_inval_poc_nosync helper
@ 2026-02-28 22:12 ` Barry Song
2026-03-13 19:35 ` Catalin Marinas
0 siblings, 1 reply; 10+ messages in thread
From: Barry Song @ 2026-02-28 22:12 UTC (permalink / raw)
To: catalin.marinas, m.szyprowski, robin.murphy, will, iommu,
linux-arm-kernel
Cc: Barry Song, Ryan Roberts, Leon Romanovsky, Anshuman Khandual,
Marc Zyngier, linux-kernel, Tangquan Zheng, Xueyuan Chen,
Suren Baghdasaryan, Ard Biesheuvel
From: Barry Song <baohua@kernel.org>
dcache_inval_poc_nosync does not wait for the data cache invalidation to
complete. Later, we defer the synchronization so we can wait for all SG
entries together.
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
Signed-off-by: Barry Song <baohua@kernel.org>
---
arch/arm64/include/asm/cacheflush.h | 1 +
arch/arm64/mm/cache.S | 42 +++++++++++++++++++++--------
2 files changed, 32 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 9b6d0a62cf3d..382b4ac3734d 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -74,6 +74,7 @@ extern void icache_inval_pou(unsigned long start, unsigned long end);
extern void dcache_clean_inval_poc(unsigned long start, unsigned long end);
extern void dcache_inval_poc(unsigned long start, unsigned long end);
extern void dcache_clean_poc(unsigned long start, unsigned long end);
+extern void dcache_inval_poc_nosync(unsigned long start, unsigned long end);
extern void dcache_clean_poc_nosync(unsigned long start, unsigned long end);
extern void dcache_clean_pop(unsigned long start, unsigned long end);
extern void dcache_clean_pou(unsigned long start, unsigned long end);
diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
index 4a7c7e03785d..ab75c050f559 100644
--- a/arch/arm64/mm/cache.S
+++ b/arch/arm64/mm/cache.S
@@ -132,17 +132,7 @@ alternative_else_nop_endif
ret
SYM_FUNC_END(dcache_clean_pou)
-/*
- * dcache_inval_poc(start, end)
- *
- * Ensure that any D-cache lines for the interval [start, end)
- * are invalidated. Any partial lines at the ends of the interval are
- * also cleaned to PoC to prevent data loss.
- *
- * - start - kernel start address of region
- * - end - kernel end address of region
- */
-SYM_FUNC_START(__pi_dcache_inval_poc)
+.macro __dcache_inval_poc_nosync
dcache_line_size x2, x3
sub x3, x2, #1
tst x1, x3 // end cache line aligned?
@@ -158,11 +148,41 @@ SYM_FUNC_START(__pi_dcache_inval_poc)
3: add x0, x0, x2
cmp x0, x1
b.lo 2b
+.endm
+
+/*
+ * dcache_inval_poc(start, end)
+ *
+ * Ensure that any D-cache lines for the interval [start, end)
+ * are invalidated. Any partial lines at the ends of the interval are
+ * also cleaned to PoC to prevent data loss.
+ *
+ * - start - kernel start address of region
+ * - end - kernel end address of region
+ */
+SYM_FUNC_START(__pi_dcache_inval_poc)
+ __dcache_inval_poc_nosync
dsb sy
ret
SYM_FUNC_END(__pi_dcache_inval_poc)
SYM_FUNC_ALIAS(dcache_inval_poc, __pi_dcache_inval_poc)
+/*
+ * dcache_inval_poc_nosync(start, end)
+ *
+ * Issue the instructions of D-cache lines for the interval [start, end)
+ * for invalidation. Not necessarily cleaned to PoC till an explicit dsb
+ * sy is issued later
+ *
+ * - start - kernel start address of region
+ * - end - kernel end address of region
+ */
+SYM_FUNC_START(__pi_dcache_inval_poc_nosync)
+ __dcache_inval_poc_nosync
+ ret
+SYM_FUNC_END(__pi_dcache_inval_poc_nosync)
+SYM_FUNC_ALIAS(dcache_inval_poc_nosync, __pi_dcache_inval_poc_nosync)
+
/*
* dcache_clean_poc(start, end)
*
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v3 3/5] arm64: Provide dcache_inval_poc_nosync helper
2026-02-28 22:12 ` [PATCH v3 3/5] arm64: Provide dcache_inval_poc_nosync helper Barry Song
@ 2026-03-13 19:35 ` Catalin Marinas
0 siblings, 0 replies; 10+ messages in thread
From: Catalin Marinas @ 2026-03-13 19:35 UTC (permalink / raw)
To: Barry Song
Cc: Tangquan Zheng, Barry Song, Ryan Roberts, Leon Romanovsky,
Anshuman Khandual, robin.murphy, Xueyuan Chen, linux-kernel,
Suren Baghdasaryan, iommu, Marc Zyngier, will, Ard Biesheuvel,
linux-arm-kernel, m.szyprowski
On Sun, Mar 01, 2026 at 06:12:58AM +0800, Barry Song wrote:
> From: Barry Song <baohua@kernel.org>
>
> dcache_inval_poc_nosync does not wait for the data cache invalidation to
> complete. Later, we defer the synchronization so we can wait for all SG
> entries together.
>
> Cc: Leon Romanovsky <leon@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> Signed-off-by: Barry Song <baohua@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 0/5] dma-mapping: arm64: support batched cache sync
2026-02-28 22:11 ` [PATCH v3 0/5] dma-mapping: arm64: support batched cache sync Barry Song
` (2 preceding siblings ...)
2026-02-28 22:12 ` [PATCH v3 3/5] arm64: Provide dcache_inval_poc_nosync helper Barry Song
@ 2026-03-03 16:33 ` Marek Szyprowski
2026-03-13 19:36 ` Catalin Marinas
3 siblings, 1 reply; 10+ messages in thread
From: Marek Szyprowski @ 2026-03-03 16:33 UTC (permalink / raw)
To: Barry Song, catalin.marinas, robin.murphy, will, iommu,
linux-arm-kernel
Cc: Juergen Gross, Barry Song, Stefano Stabellini, Ryan Roberts,
Leon Romanovsky, Anshuman Khandual, Marc Zyngier, Joerg Roedel,
linux-kernel, Tangquan Zheng, Xueyuan Chen, Oleksandr Tyshchenko,
Suren Baghdasaryan, Ard Biesheuvel, Huacai Zhou
On 28.02.2026 23:11, Barry Song wrote:
> From: Barry Song <baohua@kernel.org>
>
> Many embedded ARM64 SoCs still lack hardware cache coherency support, which
> causes DMA mapping operations to appear as hotspots in on-CPU flame graphs.
>
> For an SG list with *nents* entries, the current dma_map/unmap_sg() and DMA
> sync APIs perform cache maintenance one entry at a time. After each entry,
> the implementation synchronously waits for the corresponding region’s
> D-cache operations to complete. On architectures like arm64, efficiency can
> be improved by issuing all entries’ operations first and then performing a
> single batched wait for completion.
>
> Tangquan's results show that batched synchronization can reduce
> dma_map_sg() time by 64.61% and dma_unmap_sg() time by 66.60% on an MTK
> phone platform (MediaTek Dimensity 9500). The tests were performed by
> pinning the task to CPU7 and fixing the CPU frequency at 2.6 GHz,
> running dma_map_sg() and dma_unmap_sg() on 10 MB buffers (10 MB / 4 KB
> sg entries per buffer) for 200 iterations and then averaging the
> results.
>
> Thanks to Xueyuan for volunteering to take on the testing tasks. He
> put significant effort into validating paths such as IOVA link/unlink
> and SWIOTLB on RK3588 boards with NVMe.
Catalin, Will, I would like to merge this to dma-mapping tree, give Your
ack or comment if You are okay with ARM64 related parts.
> v3:
> * Fold patches 5/8, 7/8, and 8/8 into patch 4/8 as suggested by Leon,
> reducing the series from 8 patches to 5;
> * Fix the SWIOTLB path by ensuring a sync is issued before memcpy;
> * Add ARCH_HAS_BATCHED_DMA_SYNC Kconfig as suggested by Leon;
> * Collect Reviewed-by tags from Leon and Juergen. Leon's tag is not
> added to patch 4 since it has changed significantly since v2 and
> requires re-review;
> * Rename some asm macros and functions as suggested by Will;
> * Add Xueyuan's Tested-by. His help is greatly appreciated!
> v2 link:
> https://lore.kernel.org/lkml/20251226225254.46197-1-21cnbao@gmail.com/
>
> v2:
> * Refine a large amount of arm64 asm code based on feedback from
> Robin, thanks!
> * Drop batch_add APIs and always use arch_sync_dma_for_* + flush,
> even for a single buffer, based on Leon’s suggestion, thanks!
> * Refine a large amount of code based on feedback from Leon, thanks!
> * Also add batch support for iommu_dma_sync_sg_for_{cpu,device}
> v1 link:
> https://lore.kernel.org/lkml/20251219053658.84978-1-21cnbao@gmail.com/
>
> v1, diff with RFC:
> * Drop a large number of #ifdef/#else/#endif blocks based on feedback
> from Catalin and Marek, thanks!
> * Also add batched iova link/unlink support, marked as RFC since I lack
> the required hardware. This was suggested by Marek, thanks!
> RFC link:
> https://lore.kernel.org/lkml/20251029023115.22809-1-21cnbao@gmail.com/
>
> Barry Song (5):
> arm64: Provide dcache_by_myline_op_nosync helper
> arm64: Provide dcache_clean_poc_nosync helper
> arm64: Provide dcache_inval_poc_nosync helper
> dma-mapping: Separate DMA sync issuing and completion waiting
> dma-mapping: Support batch mode for dma_direct_{map,unmap}_sg
>
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/assembler.h | 25 ++++++++++---
> arch/arm64/include/asm/cache.h | 5 +++
> arch/arm64/include/asm/cacheflush.h | 2 +
> arch/arm64/kernel/relocate_kernel.S | 3 +-
> arch/arm64/mm/cache.S | 57 +++++++++++++++++++++++------
> arch/arm64/mm/dma-mapping.c | 4 +-
> drivers/iommu/dma-iommu.c | 35 ++++++++++++++----
> drivers/xen/swiotlb-xen.c | 24 ++++++++----
> include/linux/dma-map-ops.h | 6 +++
> kernel/dma/Kconfig | 3 ++
> kernel/dma/direct.c | 23 +++++++++---
> kernel/dma/direct.h | 21 ++++++++---
> kernel/dma/mapping.c | 6 +--
> kernel/dma/swiotlb.c | 7 +++-
> 15 files changed, 171 insertions(+), 51 deletions(-)
>
> Cc: Leon Romanovsky <leon@kernel.org>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Cc: Huacai Zhou <zhouhuacai@oppo.com>
> Cc: Xueyuan Chen <xueyuan.chen21@gmail.com>
Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH v3 0/5] dma-mapping: arm64: support batched cache sync
2026-03-03 16:33 ` [PATCH v3 0/5] dma-mapping: arm64: support batched cache sync Marek Szyprowski
@ 2026-03-13 19:36 ` Catalin Marinas
2026-03-16 7:24 ` Marek Szyprowski
0 siblings, 1 reply; 10+ messages in thread
From: Catalin Marinas @ 2026-03-13 19:36 UTC (permalink / raw)
To: Marek Szyprowski
Cc: Juergen Gross, Tangquan Zheng, Barry Song, Stefano Stabellini,
Ryan Roberts, Leon Romanovsky, Anshuman Khandual, will,
Joerg Roedel, Barry Song, linux-kernel, Suren Baghdasaryan, iommu,
Xueyuan Chen, Marc Zyngier, Oleksandr Tyshchenko, robin.murphy,
Ard Biesheuvel, linux-arm-kernel, Huacai Zhou
On Tue, Mar 03, 2026 at 05:33:37PM +0100, Marek Szyprowski wrote:
> On 28.02.2026 23:11, Barry Song wrote:
> > From: Barry Song <baohua@kernel.org>
> >
> > Many embedded ARM64 SoCs still lack hardware cache coherency support, which
> > causes DMA mapping operations to appear as hotspots in on-CPU flame graphs.
> >
> > For an SG list with *nents* entries, the current dma_map/unmap_sg() and DMA
> > sync APIs perform cache maintenance one entry at a time. After each entry,
> > the implementation synchronously waits for the corresponding region’s
> > D-cache operations to complete. On architectures like arm64, efficiency can
> > be improved by issuing all entries’ operations first and then performing a
> > single batched wait for completion.
> >
> > Tangquan's results show that batched synchronization can reduce
> > dma_map_sg() time by 64.61% and dma_unmap_sg() time by 66.60% on an MTK
> > phone platform (MediaTek Dimensity 9500). The tests were performed by
> > pinning the task to CPU7 and fixing the CPU frequency at 2.6 GHz,
> > running dma_map_sg() and dma_unmap_sg() on 10 MB buffers (10 MB / 4 KB
> > sg entries per buffer) for 200 iterations and then averaging the
> > results.
> >
> > Thanks to Xueyuan for volunteering to take on the testing tasks. He
> > put significant effort into validating paths such as IOVA link/unlink
> > and SWIOTLB on RK3588 boards with NVMe.
>
> Catalin, Will, I would like to merge this to dma-mapping tree, give Your
> ack or comment if You are okay with ARM64 related parts.
Sorry for the delay. Yes, feel free to pick them up. I doubt there would
be any conflicts in this area with what I'm merging through the arm64
tree.
Thanks.
--
Catalin
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 0/5] dma-mapping: arm64: support batched cache sync
2026-03-13 19:36 ` Catalin Marinas
@ 2026-03-16 7:24 ` Marek Szyprowski
0 siblings, 0 replies; 10+ messages in thread
From: Marek Szyprowski @ 2026-03-16 7:24 UTC (permalink / raw)
To: Catalin Marinas
Cc: Juergen Gross, Tangquan Zheng, Barry Song, Stefano Stabellini,
Ryan Roberts, Leon Romanovsky, Anshuman Khandual, will,
Joerg Roedel, Barry Song, linux-kernel, Suren Baghdasaryan, iommu,
Xueyuan Chen, Marc Zyngier, Oleksandr Tyshchenko, robin.murphy,
Ard Biesheuvel, linux-arm-kernel, Huacai Zhou
On 13.03.2026 20:36, Catalin Marinas wrote:
> On Tue, Mar 03, 2026 at 05:33:37PM +0100, Marek Szyprowski wrote:
>> On 28.02.2026 23:11, Barry Song wrote:
>>> From: Barry Song <baohua@kernel.org>
>>>
>>> Many embedded ARM64 SoCs still lack hardware cache coherency support, which
>>> causes DMA mapping operations to appear as hotspots in on-CPU flame graphs.
>>>
>>> For an SG list with *nents* entries, the current dma_map/unmap_sg() and DMA
>>> sync APIs perform cache maintenance one entry at a time. After each entry,
>>> the implementation synchronously waits for the corresponding region’s
>>> D-cache operations to complete. On architectures like arm64, efficiency can
>>> be improved by issuing all entries’ operations first and then performing a
>>> single batched wait for completion.
>>>
>>> Tangquan's results show that batched synchronization can reduce
>>> dma_map_sg() time by 64.61% and dma_unmap_sg() time by 66.60% on an MTK
>>> phone platform (MediaTek Dimensity 9500). The tests were performed by
>>> pinning the task to CPU7 and fixing the CPU frequency at 2.6 GHz,
>>> running dma_map_sg() and dma_unmap_sg() on 10 MB buffers (10 MB / 4 KB
>>> sg entries per buffer) for 200 iterations and then averaging the
>>> results.
>>>
>>> Thanks to Xueyuan for volunteering to take on the testing tasks. He
>>> put significant effort into validating paths such as IOVA link/unlink
>>> and SWIOTLB on RK3588 boards with NVMe.
>> Catalin, Will, I would like to merge this to dma-mapping tree, give Your
>> ack or comment if You are okay with ARM64 related parts.
> Sorry for the delay. Yes, feel free to pick them up. I doubt there would
> be any conflicts in this area with what I'm merging through the arm64
> tree.
Thanks, applied to dma-mapping-for-next.
Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland
^ permalink raw reply [flat|nested] 10+ messages in thread