* [PATCH v3 0/5] dma-mapping: arm64: support batched cache sync
@ 2026-02-28 22:11 ` Barry Song
2026-02-28 22:12 ` [PATCH v3 1/5] arm64: Provide dcache_by_myline_op_nosync helper Barry Song
` (3 more replies)
0 siblings, 4 replies; 10+ messages in thread
From: Barry Song @ 2026-02-28 22:11 UTC (permalink / raw)
To: catalin.marinas, m.szyprowski, robin.murphy, will, iommu,
linux-arm-kernel
Cc: linux-kernel, Barry Song, Leon Romanovsky, Ada Couprie Diaz,
Ard Biesheuvel, Marc Zyngier, Anshuman Khandual, Ryan Roberts,
Suren Baghdasaryan, Joerg Roedel, Juergen Gross,
Stefano Stabellini, Oleksandr Tyshchenko, Tangquan Zheng,
Huacai Zhou, Xueyuan Chen
From: Barry Song <baohua@kernel.org>
Many embedded ARM64 SoCs still lack hardware cache coherency support, which
causes DMA mapping operations to appear as hotspots in on-CPU flame graphs.
For an SG list with *nents* entries, the current dma_map/unmap_sg() and DMA
sync APIs perform cache maintenance one entry at a time. After each entry,
the implementation synchronously waits for the corresponding region’s
D-cache operations to complete. On architectures like arm64, efficiency can
be improved by issuing all entries’ operations first and then performing a
single batched wait for completion.
Tangquan's results show that batched synchronization can reduce
dma_map_sg() time by 64.61% and dma_unmap_sg() time by 66.60% on an MTK
phone platform (MediaTek Dimensity 9500). The tests were performed by
pinning the task to CPU7 and fixing the CPU frequency at 2.6 GHz,
running dma_map_sg() and dma_unmap_sg() on 10 MB buffers (10 MB / 4 KB
sg entries per buffer) for 200 iterations and then averaging the
results.
Thanks to Xueyuan for volunteering to take on the testing tasks. He
put significant effort into validating paths such as IOVA link/unlink
and SWIOTLB on RK3588 boards with NVMe.
v3:
* Fold patches 5/8, 7/8, and 8/8 into patch 4/8 as suggested by Leon,
reducing the series from 8 patches to 5;
* Fix the SWIOTLB path by ensuring a sync is issued before memcpy;
* Add ARCH_HAS_BATCHED_DMA_SYNC Kconfig as suggested by Leon;
* Collect Reviewed-by tags from Leon and Juergen. Leon's tag is not
added to patch 4 since it has changed significantly since v2 and
requires re-review;
* Rename some asm macros and functions as suggested by Will;
* Add Xueyuan's Tested-by. His help is greatly appreciated!
v2 link:
https://lore.kernel.org/lkml/20251226225254.46197-1-21cnbao@gmail.com/
v2:
* Refine a large amount of arm64 asm code based on feedback from
Robin, thanks!
* Drop batch_add APIs and always use arch_sync_dma_for_* + flush,
even for a single buffer, based on Leon’s suggestion, thanks!
* Refine a large amount of code based on feedback from Leon, thanks!
* Also add batch support for iommu_dma_sync_sg_for_{cpu,device}
v1 link:
https://lore.kernel.org/lkml/20251219053658.84978-1-21cnbao@gmail.com/
v1, diff with RFC:
* Drop a large number of #ifdef/#else/#endif blocks based on feedback
from Catalin and Marek, thanks!
* Also add batched iova link/unlink support, marked as RFC since I lack
the required hardware. This was suggested by Marek, thanks!
RFC link:
https://lore.kernel.org/lkml/20251029023115.22809-1-21cnbao@gmail.com/
Barry Song (5):
arm64: Provide dcache_by_myline_op_nosync helper
arm64: Provide dcache_clean_poc_nosync helper
arm64: Provide dcache_inval_poc_nosync helper
dma-mapping: Separate DMA sync issuing and completion waiting
dma-mapping: Support batch mode for dma_direct_{map,unmap}_sg
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/assembler.h | 25 ++++++++++---
arch/arm64/include/asm/cache.h | 5 +++
arch/arm64/include/asm/cacheflush.h | 2 +
arch/arm64/kernel/relocate_kernel.S | 3 +-
arch/arm64/mm/cache.S | 57 +++++++++++++++++++++++------
arch/arm64/mm/dma-mapping.c | 4 +-
drivers/iommu/dma-iommu.c | 35 ++++++++++++++----
drivers/xen/swiotlb-xen.c | 24 ++++++++----
include/linux/dma-map-ops.h | 6 +++
kernel/dma/Kconfig | 3 ++
kernel/dma/direct.c | 23 +++++++++---
kernel/dma/direct.h | 21 ++++++++---
kernel/dma/mapping.c | 6 +--
kernel/dma/swiotlb.c | 7 +++-
15 files changed, 171 insertions(+), 51 deletions(-)
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Cc: Huacai Zhou <zhouhuacai@oppo.com>
Cc: Xueyuan Chen <xueyuan.chen21@gmail.com>
--
2.39.3 (Apple Git-146)
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v3 1/5] arm64: Provide dcache_by_myline_op_nosync helper
@ 2026-02-28 22:12 ` Barry Song
2026-03-13 19:35 ` Catalin Marinas
0 siblings, 1 reply; 10+ messages in thread
From: Barry Song @ 2026-02-28 22:12 UTC (permalink / raw)
To: catalin.marinas, m.szyprowski, robin.murphy, will, iommu,
linux-arm-kernel
Cc: linux-kernel, Barry Song, Leon Romanovsky, Ada Couprie Diaz,
Ard Biesheuvel, Marc Zyngier, Anshuman Khandual, Ryan Roberts,
Suren Baghdasaryan, Tangquan Zheng, Xueyuan Chen
From: Barry Song <baohua@kernel.org>
dcache_by_myline_op ensures completion of the data cache operations for a
region, while dcache_by_myline_op_nosync only issues them without waiting.
This enables deferred synchronization so completion for multiple regions
can be handled together later.
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
Signed-off-by: Barry Song <baohua@kernel.org>
---
arch/arm64/include/asm/assembler.h | 25 +++++++++++++++++++------
arch/arm64/kernel/relocate_kernel.S | 3 ++-
2 files changed, 21 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index d3d46e5f7188..cdbaad41bddb 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -371,14 +371,13 @@ alternative_endif
* [start, end) with dcache line size explicitly provided.
*
* op: operation passed to dc instruction
- * domain: domain used in dsb instruction
* start: starting virtual address of the region
* end: end virtual address of the region
* linesz: dcache line size
* fixup: optional label to branch to on user fault
* Corrupts: start, end, tmp
*/
- .macro dcache_by_myline_op op, domain, start, end, linesz, tmp, fixup
+ .macro dcache_by_myline_op_nosync op, start, end, linesz, tmp, fixup
sub \tmp, \linesz, #1
bic \start, \start, \tmp
alternative_if ARM64_WORKAROUND_4311569
@@ -412,14 +411,28 @@ alternative_if ARM64_WORKAROUND_4311569
cbnz \start, .Ldcache_op\@
.endif
alternative_else_nop_endif
- dsb \domain
_cond_uaccess_extable .Ldcache_op\@, \fixup
.endm
/*
* Macro to perform a data cache maintenance for the interval
- * [start, end)
+ * [start, end) without waiting for completion
+ *
+ * op: operation passed to dc instruction
+ * start: starting virtual address of the region
+ * end: end virtual address of the region
+ * fixup: optional label to branch to on user fault
+ * Corrupts: start, end, tmp1, tmp2
+ */
+ .macro dcache_by_line_op_nosync op, start, end, tmp1, tmp2, fixup
+ dcache_line_size \tmp1, \tmp2
+ dcache_by_myline_op_nosync \op, \start, \end, \tmp1, \tmp2, \fixup
+ .endm
+
+/*
+ * Macro to perform a data cache maintenance for the interval
+ * [start, end) and wait for completion
*
* op: operation passed to dc instruction
* domain: domain used in dsb instruction
@@ -429,8 +442,8 @@ alternative_else_nop_endif
* Corrupts: start, end, tmp1, tmp2
*/
.macro dcache_by_line_op op, domain, start, end, tmp1, tmp2, fixup
- dcache_line_size \tmp1, \tmp2
- dcache_by_myline_op \op, \domain, \start, \end, \tmp1, \tmp2, \fixup
+ dcache_by_line_op_nosync \op, \start, \end, \tmp1, \tmp2, \fixup
+ dsb \domain
.endm
/*
diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S
index 413f899e4ac6..6cb4209f5dab 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -64,7 +64,8 @@ SYM_CODE_START(arm64_relocate_new_kernel)
mov x19, x13
copy_page x13, x12, x1, x2, x3, x4, x5, x6, x7, x8
add x1, x19, #PAGE_SIZE
- dcache_by_myline_op civac, sy, x19, x1, x15, x20
+ dcache_by_myline_op_nosync civac, x19, x1, x15, x20
+ dsb sy
b .Lnext
.Ltest_indirection:
tbz x16, IND_INDIRECTION_BIT, .Ltest_destination
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v3 2/5] arm64: Provide dcache_clean_poc_nosync helper
@ 2026-02-28 22:12 ` Barry Song
2026-03-13 19:35 ` Catalin Marinas
0 siblings, 1 reply; 10+ messages in thread
From: Barry Song @ 2026-02-28 22:12 UTC (permalink / raw)
To: catalin.marinas, m.szyprowski, robin.murphy, will, iommu,
linux-arm-kernel
Cc: linux-kernel, Barry Song, Leon Romanovsky, Ada Couprie Diaz,
Ard Biesheuvel, Marc Zyngier, Anshuman Khandual, Ryan Roberts,
Suren Baghdasaryan, Tangquan Zheng, Xueyuan Chen
From: Barry Song <baohua@kernel.org>
dcache_clean_poc_nosync does not wait for the data cache clean to
complete. Later, we wait for completion of all scatter-gather entries
together.
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
Signed-off-by: Barry Song <baohua@kernel.org>
---
arch/arm64/include/asm/cacheflush.h | 1 +
arch/arm64/mm/cache.S | 15 +++++++++++++++
2 files changed, 16 insertions(+)
diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 28ab96e808ef..9b6d0a62cf3d 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -74,6 +74,7 @@ extern void icache_inval_pou(unsigned long start, unsigned long end);
extern void dcache_clean_inval_poc(unsigned long start, unsigned long end);
extern void dcache_inval_poc(unsigned long start, unsigned long end);
extern void dcache_clean_poc(unsigned long start, unsigned long end);
+extern void dcache_clean_poc_nosync(unsigned long start, unsigned long end);
extern void dcache_clean_pop(unsigned long start, unsigned long end);
extern void dcache_clean_pou(unsigned long start, unsigned long end);
extern long caches_clean_inval_user_pou(unsigned long start, unsigned long end);
diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
index 503567c864fd..4a7c7e03785d 100644
--- a/arch/arm64/mm/cache.S
+++ b/arch/arm64/mm/cache.S
@@ -178,6 +178,21 @@ SYM_FUNC_START(__pi_dcache_clean_poc)
SYM_FUNC_END(__pi_dcache_clean_poc)
SYM_FUNC_ALIAS(dcache_clean_poc, __pi_dcache_clean_poc)
+/*
+ * dcache_clean_poc_nosync(start, end)
+ *
+ * Issue the instructions of D-cache lines for the interval [start, end).
+ * not necessarily cleaned to the PoC till an explicit dsb sy afterward.
+ *
+ * - start - virtual start address of region
+ * - end - virtual end address of region
+ */
+SYM_FUNC_START(__pi_dcache_clean_poc_nosync)
+ dcache_by_line_op_nosync cvac, x0, x1, x2, x3
+ ret
+SYM_FUNC_END(__pi_dcache_clean_poc_nosync)
+SYM_FUNC_ALIAS(dcache_clean_poc_nosync, __pi_dcache_clean_poc_nosync)
+
/*
* dcache_clean_pop(start, end)
*
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v3 3/5] arm64: Provide dcache_inval_poc_nosync helper
@ 2026-02-28 22:12 ` Barry Song
2026-03-13 19:35 ` Catalin Marinas
0 siblings, 1 reply; 10+ messages in thread
From: Barry Song @ 2026-02-28 22:12 UTC (permalink / raw)
To: catalin.marinas, m.szyprowski, robin.murphy, will, iommu,
linux-arm-kernel
Cc: linux-kernel, Barry Song, Leon Romanovsky, Ada Couprie Diaz,
Ard Biesheuvel, Marc Zyngier, Anshuman Khandual, Ryan Roberts,
Suren Baghdasaryan, Tangquan Zheng, Xueyuan Chen
From: Barry Song <baohua@kernel.org>
dcache_inval_poc_nosync does not wait for the data cache invalidation to
complete. Later, we defer the synchronization so we can wait for all SG
entries together.
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
Signed-off-by: Barry Song <baohua@kernel.org>
---
arch/arm64/include/asm/cacheflush.h | 1 +
arch/arm64/mm/cache.S | 42 +++++++++++++++++++++--------
2 files changed, 32 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 9b6d0a62cf3d..382b4ac3734d 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -74,6 +74,7 @@ extern void icache_inval_pou(unsigned long start, unsigned long end);
extern void dcache_clean_inval_poc(unsigned long start, unsigned long end);
extern void dcache_inval_poc(unsigned long start, unsigned long end);
extern void dcache_clean_poc(unsigned long start, unsigned long end);
+extern void dcache_inval_poc_nosync(unsigned long start, unsigned long end);
extern void dcache_clean_poc_nosync(unsigned long start, unsigned long end);
extern void dcache_clean_pop(unsigned long start, unsigned long end);
extern void dcache_clean_pou(unsigned long start, unsigned long end);
diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
index 4a7c7e03785d..ab75c050f559 100644
--- a/arch/arm64/mm/cache.S
+++ b/arch/arm64/mm/cache.S
@@ -132,17 +132,7 @@ alternative_else_nop_endif
ret
SYM_FUNC_END(dcache_clean_pou)
-/*
- * dcache_inval_poc(start, end)
- *
- * Ensure that any D-cache lines for the interval [start, end)
- * are invalidated. Any partial lines at the ends of the interval are
- * also cleaned to PoC to prevent data loss.
- *
- * - start - kernel start address of region
- * - end - kernel end address of region
- */
-SYM_FUNC_START(__pi_dcache_inval_poc)
+.macro __dcache_inval_poc_nosync
dcache_line_size x2, x3
sub x3, x2, #1
tst x1, x3 // end cache line aligned?
@@ -158,11 +148,41 @@ SYM_FUNC_START(__pi_dcache_inval_poc)
3: add x0, x0, x2
cmp x0, x1
b.lo 2b
+.endm
+
+/*
+ * dcache_inval_poc(start, end)
+ *
+ * Ensure that any D-cache lines for the interval [start, end)
+ * are invalidated. Any partial lines at the ends of the interval are
+ * also cleaned to PoC to prevent data loss.
+ *
+ * - start - kernel start address of region
+ * - end - kernel end address of region
+ */
+SYM_FUNC_START(__pi_dcache_inval_poc)
+ __dcache_inval_poc_nosync
dsb sy
ret
SYM_FUNC_END(__pi_dcache_inval_poc)
SYM_FUNC_ALIAS(dcache_inval_poc, __pi_dcache_inval_poc)
+/*
+ * dcache_inval_poc_nosync(start, end)
+ *
+ * Issue the instructions of D-cache lines for the interval [start, end)
+ * for invalidation. Not necessarily cleaned to PoC till an explicit dsb
+ * sy is issued later
+ *
+ * - start - kernel start address of region
+ * - end - kernel end address of region
+ */
+SYM_FUNC_START(__pi_dcache_inval_poc_nosync)
+ __dcache_inval_poc_nosync
+ ret
+SYM_FUNC_END(__pi_dcache_inval_poc_nosync)
+SYM_FUNC_ALIAS(dcache_inval_poc_nosync, __pi_dcache_inval_poc_nosync)
+
/*
* dcache_clean_poc(start, end)
*
--
2.39.3 (Apple Git-146)
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v3 0/5] dma-mapping: arm64: support batched cache sync
2026-02-28 22:11 ` [PATCH v3 0/5] dma-mapping: arm64: support batched cache sync Barry Song
` (2 preceding siblings ...)
2026-02-28 22:12 ` [PATCH v3 3/5] arm64: Provide dcache_inval_poc_nosync helper Barry Song
@ 2026-03-03 16:33 ` Marek Szyprowski
2026-03-13 19:36 ` Catalin Marinas
3 siblings, 1 reply; 10+ messages in thread
From: Marek Szyprowski @ 2026-03-03 16:33 UTC (permalink / raw)
To: Barry Song, catalin.marinas, robin.murphy, will, iommu,
linux-arm-kernel
Cc: linux-kernel, Barry Song, Leon Romanovsky, Ada Couprie Diaz,
Ard Biesheuvel, Marc Zyngier, Anshuman Khandual, Ryan Roberts,
Suren Baghdasaryan, Joerg Roedel, Juergen Gross,
Stefano Stabellini, Oleksandr Tyshchenko, Tangquan Zheng,
Huacai Zhou, Xueyuan Chen
On 28.02.2026 23:11, Barry Song wrote:
> From: Barry Song <baohua@kernel.org>
>
> Many embedded ARM64 SoCs still lack hardware cache coherency support, which
> causes DMA mapping operations to appear as hotspots in on-CPU flame graphs.
>
> For an SG list with *nents* entries, the current dma_map/unmap_sg() and DMA
> sync APIs perform cache maintenance one entry at a time. After each entry,
> the implementation synchronously waits for the corresponding region’s
> D-cache operations to complete. On architectures like arm64, efficiency can
> be improved by issuing all entries’ operations first and then performing a
> single batched wait for completion.
>
> Tangquan's results show that batched synchronization can reduce
> dma_map_sg() time by 64.61% and dma_unmap_sg() time by 66.60% on an MTK
> phone platform (MediaTek Dimensity 9500). The tests were performed by
> pinning the task to CPU7 and fixing the CPU frequency at 2.6 GHz,
> running dma_map_sg() and dma_unmap_sg() on 10 MB buffers (10 MB / 4 KB
> sg entries per buffer) for 200 iterations and then averaging the
> results.
>
> Thanks to Xueyuan for volunteering to take on the testing tasks. He
> put significant effort into validating paths such as IOVA link/unlink
> and SWIOTLB on RK3588 boards with NVMe.
Catalin, Will, I would like to merge this to dma-mapping tree, give Your
ack or comment if You are okay with ARM64 related parts.
> v3:
> * Fold patches 5/8, 7/8, and 8/8 into patch 4/8 as suggested by Leon,
> reducing the series from 8 patches to 5;
> * Fix the SWIOTLB path by ensuring a sync is issued before memcpy;
> * Add ARCH_HAS_BATCHED_DMA_SYNC Kconfig as suggested by Leon;
> * Collect Reviewed-by tags from Leon and Juergen. Leon's tag is not
> added to patch 4 since it has changed significantly since v2 and
> requires re-review;
> * Rename some asm macros and functions as suggested by Will;
> * Add Xueyuan's Tested-by. His help is greatly appreciated!
> v2 link:
> https://lore.kernel.org/lkml/20251226225254.46197-1-21cnbao@gmail.com/
>
> v2:
> * Refine a large amount of arm64 asm code based on feedback from
> Robin, thanks!
> * Drop batch_add APIs and always use arch_sync_dma_for_* + flush,
> even for a single buffer, based on Leon’s suggestion, thanks!
> * Refine a large amount of code based on feedback from Leon, thanks!
> * Also add batch support for iommu_dma_sync_sg_for_{cpu,device}
> v1 link:
> https://lore.kernel.org/lkml/20251219053658.84978-1-21cnbao@gmail.com/
>
> v1, diff with RFC:
> * Drop a large number of #ifdef/#else/#endif blocks based on feedback
> from Catalin and Marek, thanks!
> * Also add batched iova link/unlink support, marked as RFC since I lack
> the required hardware. This was suggested by Marek, thanks!
> RFC link:
> https://lore.kernel.org/lkml/20251029023115.22809-1-21cnbao@gmail.com/
>
> Barry Song (5):
> arm64: Provide dcache_by_myline_op_nosync helper
> arm64: Provide dcache_clean_poc_nosync helper
> arm64: Provide dcache_inval_poc_nosync helper
> dma-mapping: Separate DMA sync issuing and completion waiting
> dma-mapping: Support batch mode for dma_direct_{map,unmap}_sg
>
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/assembler.h | 25 ++++++++++---
> arch/arm64/include/asm/cache.h | 5 +++
> arch/arm64/include/asm/cacheflush.h | 2 +
> arch/arm64/kernel/relocate_kernel.S | 3 +-
> arch/arm64/mm/cache.S | 57 +++++++++++++++++++++++------
> arch/arm64/mm/dma-mapping.c | 4 +-
> drivers/iommu/dma-iommu.c | 35 ++++++++++++++----
> drivers/xen/swiotlb-xen.c | 24 ++++++++----
> include/linux/dma-map-ops.h | 6 +++
> kernel/dma/Kconfig | 3 ++
> kernel/dma/direct.c | 23 +++++++++---
> kernel/dma/direct.h | 21 ++++++++---
> kernel/dma/mapping.c | 6 +--
> kernel/dma/swiotlb.c | 7 +++-
> 15 files changed, 171 insertions(+), 51 deletions(-)
>
> Cc: Leon Romanovsky <leon@kernel.org>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Cc: Huacai Zhou <zhouhuacai@oppo.com>
> Cc: Xueyuan Chen <xueyuan.chen21@gmail.com>
Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 1/5] arm64: Provide dcache_by_myline_op_nosync helper
2026-02-28 22:12 ` [PATCH v3 1/5] arm64: Provide dcache_by_myline_op_nosync helper Barry Song
@ 2026-03-13 19:35 ` Catalin Marinas
0 siblings, 0 replies; 10+ messages in thread
From: Catalin Marinas @ 2026-03-13 19:35 UTC (permalink / raw)
To: Barry Song
Cc: m.szyprowski, robin.murphy, will, iommu, linux-arm-kernel,
linux-kernel, Barry Song, Leon Romanovsky, Ada Couprie Diaz,
Ard Biesheuvel, Marc Zyngier, Anshuman Khandual, Ryan Roberts,
Suren Baghdasaryan, Tangquan Zheng, Xueyuan Chen
On Sun, Mar 01, 2026 at 06:12:16AM +0800, Barry Song wrote:
> From: Barry Song <baohua@kernel.org>
>
> dcache_by_myline_op ensures completion of the data cache operations for a
> region, while dcache_by_myline_op_nosync only issues them without waiting.
> This enables deferred synchronization so completion for multiple regions
> can be handled together later.
>
> Cc: Leon Romanovsky <leon@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> Signed-off-by: Barry Song <baohua@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 2/5] arm64: Provide dcache_clean_poc_nosync helper
2026-02-28 22:12 ` [PATCH v3 2/5] arm64: Provide dcache_clean_poc_nosync helper Barry Song
@ 2026-03-13 19:35 ` Catalin Marinas
0 siblings, 0 replies; 10+ messages in thread
From: Catalin Marinas @ 2026-03-13 19:35 UTC (permalink / raw)
To: Barry Song
Cc: m.szyprowski, robin.murphy, will, iommu, linux-arm-kernel,
linux-kernel, Barry Song, Leon Romanovsky, Ada Couprie Diaz,
Ard Biesheuvel, Marc Zyngier, Anshuman Khandual, Ryan Roberts,
Suren Baghdasaryan, Tangquan Zheng, Xueyuan Chen
On Sun, Mar 01, 2026 at 06:12:39AM +0800, Barry Song wrote:
> From: Barry Song <baohua@kernel.org>
>
> dcache_clean_poc_nosync does not wait for the data cache clean to
> complete. Later, we wait for completion of all scatter-gather entries
> together.
>
> Cc: Leon Romanovsky <leon@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> Signed-off-by: Barry Song <baohua@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 3/5] arm64: Provide dcache_inval_poc_nosync helper
2026-02-28 22:12 ` [PATCH v3 3/5] arm64: Provide dcache_inval_poc_nosync helper Barry Song
@ 2026-03-13 19:35 ` Catalin Marinas
0 siblings, 0 replies; 10+ messages in thread
From: Catalin Marinas @ 2026-03-13 19:35 UTC (permalink / raw)
To: Barry Song
Cc: m.szyprowski, robin.murphy, will, iommu, linux-arm-kernel,
linux-kernel, Barry Song, Leon Romanovsky, Ada Couprie Diaz,
Ard Biesheuvel, Marc Zyngier, Anshuman Khandual, Ryan Roberts,
Suren Baghdasaryan, Tangquan Zheng, Xueyuan Chen
On Sun, Mar 01, 2026 at 06:12:58AM +0800, Barry Song wrote:
> From: Barry Song <baohua@kernel.org>
>
> dcache_inval_poc_nosync does not wait for the data cache invalidation to
> complete. Later, we defer the synchronization so we can wait for all SG
> entries together.
>
> Cc: Leon Romanovsky <leon@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> Signed-off-by: Barry Song <baohua@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 0/5] dma-mapping: arm64: support batched cache sync
2026-03-03 16:33 ` [PATCH v3 0/5] dma-mapping: arm64: support batched cache sync Marek Szyprowski
@ 2026-03-13 19:36 ` Catalin Marinas
2026-03-16 7:24 ` Marek Szyprowski
0 siblings, 1 reply; 10+ messages in thread
From: Catalin Marinas @ 2026-03-13 19:36 UTC (permalink / raw)
To: Marek Szyprowski
Cc: Barry Song, robin.murphy, will, iommu, linux-arm-kernel,
linux-kernel, Barry Song, Leon Romanovsky, Ada Couprie Diaz,
Ard Biesheuvel, Marc Zyngier, Anshuman Khandual, Ryan Roberts,
Suren Baghdasaryan, Joerg Roedel, Juergen Gross,
Stefano Stabellini, Oleksandr Tyshchenko, Tangquan Zheng,
Huacai Zhou, Xueyuan Chen
On Tue, Mar 03, 2026 at 05:33:37PM +0100, Marek Szyprowski wrote:
> On 28.02.2026 23:11, Barry Song wrote:
> > From: Barry Song <baohua@kernel.org>
> >
> > Many embedded ARM64 SoCs still lack hardware cache coherency support, which
> > causes DMA mapping operations to appear as hotspots in on-CPU flame graphs.
> >
> > For an SG list with *nents* entries, the current dma_map/unmap_sg() and DMA
> > sync APIs perform cache maintenance one entry at a time. After each entry,
> > the implementation synchronously waits for the corresponding region’s
> > D-cache operations to complete. On architectures like arm64, efficiency can
> > be improved by issuing all entries’ operations first and then performing a
> > single batched wait for completion.
> >
> > Tangquan's results show that batched synchronization can reduce
> > dma_map_sg() time by 64.61% and dma_unmap_sg() time by 66.60% on an MTK
> > phone platform (MediaTek Dimensity 9500). The tests were performed by
> > pinning the task to CPU7 and fixing the CPU frequency at 2.6 GHz,
> > running dma_map_sg() and dma_unmap_sg() on 10 MB buffers (10 MB / 4 KB
> > sg entries per buffer) for 200 iterations and then averaging the
> > results.
> >
> > Thanks to Xueyuan for volunteering to take on the testing tasks. He
> > put significant effort into validating paths such as IOVA link/unlink
> > and SWIOTLB on RK3588 boards with NVMe.
>
> Catalin, Will, I would like to merge this to dma-mapping tree, give Your
> ack or comment if You are okay with ARM64 related parts.
Sorry for the delay. Yes, feel free to pick them up. I doubt there would
be any conflicts in this area with what I'm merging through the arm64
tree.
Thanks.
--
Catalin
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 0/5] dma-mapping: arm64: support batched cache sync
2026-03-13 19:36 ` Catalin Marinas
@ 2026-03-16 7:24 ` Marek Szyprowski
0 siblings, 0 replies; 10+ messages in thread
From: Marek Szyprowski @ 2026-03-16 7:24 UTC (permalink / raw)
To: Catalin Marinas
Cc: Barry Song, robin.murphy, will, iommu, linux-arm-kernel,
linux-kernel, Barry Song, Leon Romanovsky, Ada Couprie Diaz,
Ard Biesheuvel, Marc Zyngier, Anshuman Khandual, Ryan Roberts,
Suren Baghdasaryan, Joerg Roedel, Juergen Gross,
Stefano Stabellini, Oleksandr Tyshchenko, Tangquan Zheng,
Huacai Zhou, Xueyuan Chen
On 13.03.2026 20:36, Catalin Marinas wrote:
> On Tue, Mar 03, 2026 at 05:33:37PM +0100, Marek Szyprowski wrote:
>> On 28.02.2026 23:11, Barry Song wrote:
>>> From: Barry Song <baohua@kernel.org>
>>>
>>> Many embedded ARM64 SoCs still lack hardware cache coherency support, which
>>> causes DMA mapping operations to appear as hotspots in on-CPU flame graphs.
>>>
>>> For an SG list with *nents* entries, the current dma_map/unmap_sg() and DMA
>>> sync APIs perform cache maintenance one entry at a time. After each entry,
>>> the implementation synchronously waits for the corresponding region’s
>>> D-cache operations to complete. On architectures like arm64, efficiency can
>>> be improved by issuing all entries’ operations first and then performing a
>>> single batched wait for completion.
>>>
>>> Tangquan's results show that batched synchronization can reduce
>>> dma_map_sg() time by 64.61% and dma_unmap_sg() time by 66.60% on an MTK
>>> phone platform (MediaTek Dimensity 9500). The tests were performed by
>>> pinning the task to CPU7 and fixing the CPU frequency at 2.6 GHz,
>>> running dma_map_sg() and dma_unmap_sg() on 10 MB buffers (10 MB / 4 KB
>>> sg entries per buffer) for 200 iterations and then averaging the
>>> results.
>>>
>>> Thanks to Xueyuan for volunteering to take on the testing tasks. He
>>> put significant effort into validating paths such as IOVA link/unlink
>>> and SWIOTLB on RK3588 boards with NVMe.
>> Catalin, Will, I would like to merge this to dma-mapping tree, give Your
>> ack or comment if You are okay with ARM64 related parts.
> Sorry for the delay. Yes, feel free to pick them up. I doubt there would
> be any conflicts in this area with what I'm merging through the arm64
> tree.
Thanks, applied to dma-mapping-for-next.
Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-03-16 7:24 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CGME20260228221143eucas1p12a276d4216b6ce0f3c374b093f73acd5@eucas1p1.samsung.com>
2026-02-28 22:11 ` [PATCH v3 0/5] dma-mapping: arm64: support batched cache sync Barry Song
2026-02-28 22:12 ` [PATCH v3 1/5] arm64: Provide dcache_by_myline_op_nosync helper Barry Song
2026-03-13 19:35 ` Catalin Marinas
2026-02-28 22:12 ` [PATCH v3 2/5] arm64: Provide dcache_clean_poc_nosync helper Barry Song
2026-03-13 19:35 ` Catalin Marinas
2026-02-28 22:12 ` [PATCH v3 3/5] arm64: Provide dcache_inval_poc_nosync helper Barry Song
2026-03-13 19:35 ` Catalin Marinas
2026-03-03 16:33 ` [PATCH v3 0/5] dma-mapping: arm64: support batched cache sync Marek Szyprowski
2026-03-13 19:36 ` Catalin Marinas
2026-03-16 7:24 ` Marek Szyprowski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox