linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/8] dma-mapping: arm64: support batched cache sync
@ 2025-12-26 22:52 Barry Song
  2025-12-26 22:52 ` [PATCH v2 1/8] arm64: Provide dcache_by_myline_op_nosync helper Barry Song
                   ` (7 more replies)
  0 siblings, 8 replies; 21+ messages in thread
From: Barry Song @ 2025-12-26 22:52 UTC (permalink / raw)
  To: catalin.marinas, m.szyprowski, robin.murphy, will, iommu,
	linux-arm-kernel
  Cc: Juergen Gross, Barry Song, Stefano Stabellini, Ryan Roberts,
	Leon Romanovsky, Anshuman Khandual, Marc Zyngier, Joerg Roedel,
	linux-kernel, Tangquan Zheng, Oleksandr Tyshchenko, xen-devel,
	Suren Baghdasaryan, Ard Biesheuvel, Huacai Zhou

From: Barry Song <baohua@kernel.org>

Many embedded ARM64 SoCs still lack hardware cache coherency support, which
causes DMA mapping operations to appear as hotspots in on-CPU flame graphs.

For an SG list with *nents* entries, the current dma_map/unmap_sg() and DMA
sync APIs perform cache maintenance one entry at a time. After each entry,
the implementation synchronously waits for the corresponding region’s
D-cache operations to complete. On architectures like arm64, efficiency can
be improved by issuing all entries’ operations first and then performing a
single batched wait for completion.

Tangquan's results show that batched synchronization can reduce
dma_map_sg() time by 64.61% and dma_unmap_sg() time by 66.60% on an MTK
phone platform (MediaTek Dimensity 9500). The tests were performed by
pinning the task to CPU7 and fixing the CPU frequency at 2.6 GHz,
running dma_map_sg() and dma_unmap_sg() on 10 MB buffers (10 MB / 4 KB
sg entries per buffer) for 200 iterations and then averaging the
results.

I also ran this patch set on an RK3588 Rock5B+ board and
observed that millions of DMA sync operations were batched.

v2:
 * Refine a large amount of arm64 asm code based on feedback from
   Robin, thanks!
 * Drop batch_add APIs and always use arch_sync_dma_for_* + flush,
   even for a single buffer, based on Leon’s suggestion, thanks!
 * Refine a large amount of code based on feedback from Leon, thanks!
 * Also add batch support for iommu_dma_sync_sg_for_{cpu,device}
v1 link:
 https://lore.kernel.org/lkml/20251219053658.84978-1-21cnbao@gmail.com/

v1, diff with RFC:
 * Drop a large number of #ifdef/#else/#endif blocks based on feedback
   from Catalin and Marek, thanks!
 * Also add batched iova link/unlink support, marked as RFC since I lack
   the required hardware. This was suggested by Marek, thanks!
RFC link:
 https://lore.kernel.org/lkml/20251029023115.22809-1-21cnbao@gmail.com/

Barry Song (8):
  arm64: Provide dcache_by_myline_op_nosync helper
  arm64: Provide dcache_clean_poc_nosync helper
  arm64: Provide dcache_inval_poc_nosync helper
  dma-mapping: Separate DMA sync issuing and completion waiting
  dma-mapping: Support batch mode for dma_direct_sync_sg_for_*
  dma-mapping: Support batch mode for dma_direct_{map,unmap}_sg
  dma-iommu: Support DMA sync batch mode for IOVA link and unlink
  dma-iommu: Support DMA sync batch mode for iommu_dma_sync_sg_for_{cpu,
    device}

 arch/arm64/include/asm/assembler.h  | 24 +++++++++---
 arch/arm64/include/asm/cache.h      |  6 +++
 arch/arm64/include/asm/cacheflush.h |  2 +
 arch/arm64/kernel/relocate_kernel.S |  3 +-
 arch/arm64/mm/cache.S               | 57 +++++++++++++++++++++++------
 arch/arm64/mm/dma-mapping.c         |  4 +-
 drivers/iommu/dma-iommu.c           | 35 ++++++++++++++----
 drivers/xen/swiotlb-xen.c           | 24 ++++++++----
 include/linux/dma-map-ops.h         |  6 +++
 kernel/dma/direct.c                 | 23 +++++++++---
 kernel/dma/direct.h                 | 21 ++++++++---
 kernel/dma/mapping.c                |  6 +--
 kernel/dma/swiotlb.c                |  4 +-
 13 files changed, 165 insertions(+), 50 deletions(-)

Cc: Leon Romanovsky <leon@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Cc: Huacai Zhou <zhouhuacai@oppo.com>
--
2.43.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 1/8] arm64: Provide dcache_by_myline_op_nosync helper
  2025-12-26 22:52 [PATCH v2 0/8] dma-mapping: arm64: support batched cache sync Barry Song
@ 2025-12-26 22:52 ` Barry Song
  2025-12-26 22:52 ` [PATCH v2 2/8] arm64: Provide dcache_clean_poc_nosync helper Barry Song
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Barry Song @ 2025-12-26 22:52 UTC (permalink / raw)
  To: catalin.marinas, m.szyprowski, robin.murphy, will, iommu,
	linux-arm-kernel
  Cc: Barry Song, Ryan Roberts, Leon Romanovsky, Anshuman Khandual,
	Marc Zyngier, linux-kernel, Tangquan Zheng, xen-devel,
	Suren Baghdasaryan, Ard Biesheuvel

From: Barry Song <baohua@kernel.org>

dcache_by_myline_op ensures completion of the data cache operations for a
region, while dcache_by_myline_op_nosync only issues them without waiting.
This enables deferred synchronization so completion for multiple regions
can be handled together later.

Cc: Leon Romanovsky <leon@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <baohua@kernel.org>
---
 arch/arm64/include/asm/assembler.h  | 24 +++++++++++++++++++-----
 arch/arm64/kernel/relocate_kernel.S |  3 ++-
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index f0ca7196f6fa..b408ed61866f 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -371,14 +371,13 @@ alternative_endif
  * [start, end) with dcache line size explicitly provided.
  *
  * 	op:		operation passed to dc instruction
- * 	domain:		domain used in dsb instruction
  * 	start:          starting virtual address of the region
  * 	end:            end virtual address of the region
  *	linesz:		dcache line size
  * 	fixup:		optional label to branch to on user fault
  * 	Corrupts:       start, end, tmp
  */
-	.macro dcache_by_myline_op op, domain, start, end, linesz, tmp, fixup
+	.macro raw_dcache_by_myline_op op, start, end, linesz, tmp, fixup
 	sub	\tmp, \linesz, #1
 	bic	\start, \start, \tmp
 .Ldcache_op\@:
@@ -402,14 +401,13 @@ alternative_endif
 	add	\start, \start, \linesz
 	cmp	\start, \end
 	b.lo	.Ldcache_op\@
-	dsb	\domain
 
 	_cond_uaccess_extable .Ldcache_op\@, \fixup
 	.endm
 
 /*
  * Macro to perform a data cache maintenance for the interval
- * [start, end)
+ * [start, end) and wait for completion
  *
  * 	op:		operation passed to dc instruction
  * 	domain:		domain used in dsb instruction
@@ -420,7 +418,23 @@ alternative_endif
  */
 	.macro dcache_by_line_op op, domain, start, end, tmp1, tmp2, fixup
 	dcache_line_size \tmp1, \tmp2
-	dcache_by_myline_op \op, \domain, \start, \end, \tmp1, \tmp2, \fixup
+	raw_dcache_by_myline_op \op, \start, \end, \tmp1, \tmp2, \fixup
+	dsb \domain
+	.endm
+
+/*
+ * Macro to perform a data cache maintenance for the interval
+ * [start, end) without waiting for completion
+ *
+ * 	op:		operation passed to dc instruction
+ * 	start:          starting virtual address of the region
+ * 	end:            end virtual address of the region
+ * 	fixup:		optional label to branch to on user fault
+ * 	Corrupts:       start, end, tmp1, tmp2
+ */
+	.macro dcache_by_line_op_nosync op, start, end, tmp1, tmp2, fixup
+	dcache_line_size \tmp1, \tmp2
+	raw_dcache_by_myline_op \op, \start, \end, \tmp1, \tmp2, \fixup
 	.endm
 
 /*
diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S
index 413f899e4ac6..71938eb3a3a3 100644
--- a/arch/arm64/kernel/relocate_kernel.S
+++ b/arch/arm64/kernel/relocate_kernel.S
@@ -64,7 +64,8 @@ SYM_CODE_START(arm64_relocate_new_kernel)
 	mov	x19, x13
 	copy_page x13, x12, x1, x2, x3, x4, x5, x6, x7, x8
 	add	x1, x19, #PAGE_SIZE
-	dcache_by_myline_op civac, sy, x19, x1, x15, x20
+	raw_dcache_by_myline_op civac, x19, x1, x15, x20
+	dsb	sy
 	b	.Lnext
 .Ltest_indirection:
 	tbz	x16, IND_INDIRECTION_BIT, .Ltest_destination
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 2/8] arm64: Provide dcache_clean_poc_nosync helper
  2025-12-26 22:52 [PATCH v2 0/8] dma-mapping: arm64: support batched cache sync Barry Song
  2025-12-26 22:52 ` [PATCH v2 1/8] arm64: Provide dcache_by_myline_op_nosync helper Barry Song
@ 2025-12-26 22:52 ` Barry Song
  2025-12-26 22:52 ` [PATCH v2 3/8] arm64: Provide dcache_inval_poc_nosync helper Barry Song
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Barry Song @ 2025-12-26 22:52 UTC (permalink / raw)
  To: catalin.marinas, m.szyprowski, robin.murphy, will, iommu,
	linux-arm-kernel
  Cc: Barry Song, Ryan Roberts, Leon Romanovsky, Anshuman Khandual,
	Marc Zyngier, linux-kernel, Tangquan Zheng, xen-devel,
	Suren Baghdasaryan, Ard Biesheuvel

From: Barry Song <baohua@kernel.org>

dcache_clean_poc_nosync does not wait for the data cache clean to
complete. Later, we wait for completion of all scatter-gather entries
together.

Cc: Leon Romanovsky <leon@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <baohua@kernel.org>
---
 arch/arm64/include/asm/cacheflush.h |  1 +
 arch/arm64/mm/cache.S               | 15 +++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 28ab96e808ef..9b6d0a62cf3d 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -74,6 +74,7 @@ extern void icache_inval_pou(unsigned long start, unsigned long end);
 extern void dcache_clean_inval_poc(unsigned long start, unsigned long end);
 extern void dcache_inval_poc(unsigned long start, unsigned long end);
 extern void dcache_clean_poc(unsigned long start, unsigned long end);
+extern void dcache_clean_poc_nosync(unsigned long start, unsigned long end);
 extern void dcache_clean_pop(unsigned long start, unsigned long end);
 extern void dcache_clean_pou(unsigned long start, unsigned long end);
 extern long caches_clean_inval_user_pou(unsigned long start, unsigned long end);
diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
index 503567c864fd..4a7c7e03785d 100644
--- a/arch/arm64/mm/cache.S
+++ b/arch/arm64/mm/cache.S
@@ -178,6 +178,21 @@ SYM_FUNC_START(__pi_dcache_clean_poc)
 SYM_FUNC_END(__pi_dcache_clean_poc)
 SYM_FUNC_ALIAS(dcache_clean_poc, __pi_dcache_clean_poc)
 
+/*
+ *	dcache_clean_poc_nosync(start, end)
+ *
+ * 	Issue the instructions of D-cache lines for the interval [start, end).
+ * 	not necessarily cleaned to the PoC till an explicit dsb sy afterward.
+ *
+ *	- start   - virtual start address of region
+ *	- end     - virtual end address of region
+ */
+SYM_FUNC_START(__pi_dcache_clean_poc_nosync)
+	dcache_by_line_op_nosync cvac, x0, x1, x2, x3
+	ret
+SYM_FUNC_END(__pi_dcache_clean_poc_nosync)
+SYM_FUNC_ALIAS(dcache_clean_poc_nosync, __pi_dcache_clean_poc_nosync)
+
 /*
  *	dcache_clean_pop(start, end)
  *
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 3/8] arm64: Provide dcache_inval_poc_nosync helper
  2025-12-26 22:52 [PATCH v2 0/8] dma-mapping: arm64: support batched cache sync Barry Song
  2025-12-26 22:52 ` [PATCH v2 1/8] arm64: Provide dcache_by_myline_op_nosync helper Barry Song
  2025-12-26 22:52 ` [PATCH v2 2/8] arm64: Provide dcache_clean_poc_nosync helper Barry Song
@ 2025-12-26 22:52 ` Barry Song
  2025-12-26 22:52 ` [PATCH v2 4/8] dma-mapping: Separate DMA sync issuing and completion waiting Barry Song
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Barry Song @ 2025-12-26 22:52 UTC (permalink / raw)
  To: catalin.marinas, m.szyprowski, robin.murphy, will, iommu,
	linux-arm-kernel
  Cc: Barry Song, Ryan Roberts, Leon Romanovsky, Anshuman Khandual,
	Marc Zyngier, linux-kernel, Tangquan Zheng, xen-devel,
	Suren Baghdasaryan, Ard Biesheuvel

From: Barry Song <baohua@kernel.org>

dcache_inval_poc_nosync does not wait for the data cache invalidation to
complete. Later, we defer the synchronization so we can wait for all SG
entries together.

Cc: Leon Romanovsky <leon@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <baohua@kernel.org>
---
 arch/arm64/include/asm/cacheflush.h |  1 +
 arch/arm64/mm/cache.S               | 42 +++++++++++++++++++++--------
 2 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 9b6d0a62cf3d..382b4ac3734d 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -74,6 +74,7 @@ extern void icache_inval_pou(unsigned long start, unsigned long end);
 extern void dcache_clean_inval_poc(unsigned long start, unsigned long end);
 extern void dcache_inval_poc(unsigned long start, unsigned long end);
 extern void dcache_clean_poc(unsigned long start, unsigned long end);
+extern void dcache_inval_poc_nosync(unsigned long start, unsigned long end);
 extern void dcache_clean_poc_nosync(unsigned long start, unsigned long end);
 extern void dcache_clean_pop(unsigned long start, unsigned long end);
 extern void dcache_clean_pou(unsigned long start, unsigned long end);
diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
index 4a7c7e03785d..99a093d3aecb 100644
--- a/arch/arm64/mm/cache.S
+++ b/arch/arm64/mm/cache.S
@@ -132,17 +132,7 @@ alternative_else_nop_endif
 	ret
 SYM_FUNC_END(dcache_clean_pou)
 
-/*
- *	dcache_inval_poc(start, end)
- *
- * 	Ensure that any D-cache lines for the interval [start, end)
- * 	are invalidated. Any partial lines at the ends of the interval are
- *	also cleaned to PoC to prevent data loss.
- *
- *	- start   - kernel start address of region
- *	- end     - kernel end address of region
- */
-SYM_FUNC_START(__pi_dcache_inval_poc)
+.macro raw_dcache_inval_poc_macro
 	dcache_line_size x2, x3
 	sub	x3, x2, #1
 	tst	x1, x3				// end cache line aligned?
@@ -158,11 +148,41 @@ SYM_FUNC_START(__pi_dcache_inval_poc)
 3:	add	x0, x0, x2
 	cmp	x0, x1
 	b.lo	2b
+.endm
+
+/*
+ *	dcache_inval_poc(start, end)
+ *
+ * 	Ensure that any D-cache lines for the interval [start, end)
+ * 	are invalidated. Any partial lines at the ends of the interval are
+ *	also cleaned to PoC to prevent data loss.
+ *
+ *	- start   - kernel start address of region
+ *	- end     - kernel end address of region
+ */
+SYM_FUNC_START(__pi_dcache_inval_poc)
+	raw_dcache_inval_poc_macro
 	dsb	sy
 	ret
 SYM_FUNC_END(__pi_dcache_inval_poc)
 SYM_FUNC_ALIAS(dcache_inval_poc, __pi_dcache_inval_poc)
 
+/*
+ *	dcache_inval_poc_nosync(start, end)
+ *
+ * 	Issue the instructions of D-cache lines for the interval [start, end)
+ * 	for invalidation. Not necessarily cleaned to PoC till an explicit dsb
+ *	sy is issued later
+ *
+ *	- start   - kernel start address of region
+ *	- end     - kernel end address of region
+ */
+SYM_FUNC_START(__pi_dcache_inval_poc_nosync)
+	raw_dcache_inval_poc_macro
+	ret
+SYM_FUNC_END(__pi_dcache_inval_poc_nosync)
+SYM_FUNC_ALIAS(dcache_inval_poc_nosync, __pi_dcache_inval_poc_nosync)
+
 /*
  *	dcache_clean_poc(start, end)
  *
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 4/8] dma-mapping: Separate DMA sync issuing and completion waiting
  2025-12-26 22:52 [PATCH v2 0/8] dma-mapping: arm64: support batched cache sync Barry Song
                   ` (2 preceding siblings ...)
  2025-12-26 22:52 ` [PATCH v2 3/8] arm64: Provide dcache_inval_poc_nosync helper Barry Song
@ 2025-12-26 22:52 ` Barry Song
  2025-12-27 20:07   ` Leon Romanovsky
  2025-12-26 22:52 ` [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_* Barry Song
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Barry Song @ 2025-12-26 22:52 UTC (permalink / raw)
  To: catalin.marinas, m.szyprowski, robin.murphy, will, iommu,
	linux-arm-kernel
  Cc: Juergen Gross, Barry Song, Stefano Stabellini, Ryan Roberts,
	Leon Romanovsky, Anshuman Khandual, Marc Zyngier, Joerg Roedel,
	linux-kernel, Tangquan Zheng, Oleksandr Tyshchenko, xen-devel,
	Suren Baghdasaryan, Ard Biesheuvel

From: Barry Song <baohua@kernel.org>

Currently, arch_sync_dma_for_cpu and arch_sync_dma_for_device
always wait for the completion of each DMA buffer. That is,
issuing the DMA sync and waiting for completion is done in a
single API call.

For scatter-gather lists with multiple entries, this means
issuing and waiting is repeated for each entry, which can hurt
performance. Architectures like ARM64 may be able to issue all
DMA sync operations for all entries first and then wait for
completion together.

To address this, arch_sync_dma_for_* now issues DMA operations in
batch, followed by a flush. On ARM64, the flush is implemented
using a dsb instruction within arch_sync_dma_flush().

For now, add arch_sync_dma_flush() after each
arch_sync_dma_for_*() call. arch_sync_dma_flush() is defined as a
no-op on all architectures except arm64, so this patch does not
change existing behavior. Subsequent patches will introduce true
batching for SG DMA buffers.

Cc: Leon Romanovsky <leon@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <baohua@kernel.org>
---
 arch/arm64/include/asm/cache.h |  6 ++++++
 arch/arm64/mm/dma-mapping.c    |  4 ++--
 drivers/iommu/dma-iommu.c      | 37 +++++++++++++++++++++++++---------
 drivers/xen/swiotlb-xen.c      | 24 ++++++++++++++--------
 include/linux/dma-map-ops.h    |  6 ++++++
 kernel/dma/direct.c            |  8 ++++++--
 kernel/dma/direct.h            |  9 +++++++--
 kernel/dma/swiotlb.c           |  4 +++-
 8 files changed, 73 insertions(+), 25 deletions(-)

diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h
index dd2c8586a725..487fb7c355ed 100644
--- a/arch/arm64/include/asm/cache.h
+++ b/arch/arm64/include/asm/cache.h
@@ -87,6 +87,12 @@ int cache_line_size(void);
 
 #define dma_get_cache_alignment	cache_line_size
 
+static inline void arch_sync_dma_flush(void)
+{
+	dsb(sy);
+}
+#define arch_sync_dma_flush arch_sync_dma_flush
+
 /* Compress a u64 MPIDR value into 32 bits. */
 static inline u64 arch_compact_of_hwid(u64 id)
 {
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index b2b5792b2caa..ae1ae0280eef 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -17,7 +17,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
 {
 	unsigned long start = (unsigned long)phys_to_virt(paddr);
 
-	dcache_clean_poc(start, start + size);
+	dcache_clean_poc_nosync(start, start + size);
 }
 
 void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
@@ -28,7 +28,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
 	if (dir == DMA_TO_DEVICE)
 		return;
 
-	dcache_inval_poc(start, start + size);
+	dcache_inval_poc_nosync(start, start + size);
 }
 
 void arch_dma_prep_coherent(struct page *page, size_t size)
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index c92088855450..6827763a3877 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1095,8 +1095,10 @@ void iommu_dma_sync_single_for_cpu(struct device *dev, dma_addr_t dma_handle,
 		return;
 
 	phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle);
-	if (!dev_is_dma_coherent(dev))
+	if (!dev_is_dma_coherent(dev)) {
 		arch_sync_dma_for_cpu(phys, size, dir);
+		arch_sync_dma_flush();
+	}
 
 	swiotlb_sync_single_for_cpu(dev, phys, size, dir);
 }
@@ -1112,8 +1114,10 @@ void iommu_dma_sync_single_for_device(struct device *dev, dma_addr_t dma_handle,
 	phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle);
 	swiotlb_sync_single_for_device(dev, phys, size, dir);
 
-	if (!dev_is_dma_coherent(dev))
+	if (!dev_is_dma_coherent(dev)) {
 		arch_sync_dma_for_device(phys, size, dir);
+		arch_sync_dma_flush();
+	}
 }
 
 void iommu_dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sgl,
@@ -1122,13 +1126,16 @@ void iommu_dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sgl,
 	struct scatterlist *sg;
 	int i;
 
-	if (sg_dma_is_swiotlb(sgl))
+	if (sg_dma_is_swiotlb(sgl)) {
 		for_each_sg(sgl, sg, nelems, i)
 			iommu_dma_sync_single_for_cpu(dev, sg_dma_address(sg),
 						      sg->length, dir);
-	else if (!dev_is_dma_coherent(dev))
-		for_each_sg(sgl, sg, nelems, i)
+	} else if (!dev_is_dma_coherent(dev)) {
+		for_each_sg(sgl, sg, nelems, i) {
 			arch_sync_dma_for_cpu(sg_phys(sg), sg->length, dir);
+			arch_sync_dma_flush();
+		}
+	}
 }
 
 void iommu_dma_sync_sg_for_device(struct device *dev, struct scatterlist *sgl,
@@ -1143,8 +1150,10 @@ void iommu_dma_sync_sg_for_device(struct device *dev, struct scatterlist *sgl,
 							 sg_dma_address(sg),
 							 sg->length, dir);
 	else if (!dev_is_dma_coherent(dev))
-		for_each_sg(sgl, sg, nelems, i)
+		for_each_sg(sgl, sg, nelems, i) {
 			arch_sync_dma_for_device(sg_phys(sg), sg->length, dir);
+			arch_sync_dma_flush();
+		}
 }
 
 static phys_addr_t iommu_dma_map_swiotlb(struct device *dev, phys_addr_t phys,
@@ -1219,8 +1228,10 @@ dma_addr_t iommu_dma_map_phys(struct device *dev, phys_addr_t phys, size_t size,
 			return DMA_MAPPING_ERROR;
 	}
 
-	if (!coherent && !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
+	if (!coherent && !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO))) {
 		arch_sync_dma_for_device(phys, size, dir);
+		arch_sync_dma_flush();
+	}
 
 	iova = __iommu_dma_map(dev, phys, size, prot, dma_mask);
 	if (iova == DMA_MAPPING_ERROR && !(attrs & DMA_ATTR_MMIO))
@@ -1242,8 +1253,10 @@ void iommu_dma_unmap_phys(struct device *dev, dma_addr_t dma_handle,
 	if (WARN_ON(!phys))
 		return;
 
-	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) && !dev_is_dma_coherent(dev))
+	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) && !dev_is_dma_coherent(dev)) {
 		arch_sync_dma_for_cpu(phys, size, dir);
+		arch_sync_dma_flush();
+	}
 
 	__iommu_dma_unmap(dev, dma_handle, size);
 
@@ -1836,8 +1849,10 @@ static int __dma_iova_link(struct device *dev, dma_addr_t addr,
 	bool coherent = dev_is_dma_coherent(dev);
 	int prot = dma_info_to_prot(dir, coherent, attrs);
 
-	if (!coherent && !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
+	if (!coherent && !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO))) {
 		arch_sync_dma_for_device(phys, size, dir);
+		arch_sync_dma_flush();
+	}
 
 	return iommu_map_nosync(iommu_get_dma_domain(dev), addr, phys, size,
 			prot, GFP_ATOMIC);
@@ -2008,8 +2023,10 @@ static void iommu_dma_iova_unlink_range_slow(struct device *dev,
 			end - addr, iovad->granule - iova_start_pad);
 
 		if (!dev_is_dma_coherent(dev) &&
-		    !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
+		    !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO))) {
 			arch_sync_dma_for_cpu(phys, len, dir);
+			arch_sync_dma_flush();
+		}
 
 		swiotlb_tbl_unmap_single(dev, phys, len, dir, attrs);
 
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index ccf25027bec1..b79917e785a5 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -262,10 +262,12 @@ static dma_addr_t xen_swiotlb_map_phys(struct device *dev, phys_addr_t phys,
 
 done:
 	if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
-		if (pfn_valid(PFN_DOWN(dma_to_phys(dev, dev_addr))))
+		if (pfn_valid(PFN_DOWN(dma_to_phys(dev, dev_addr)))) {
 			arch_sync_dma_for_device(phys, size, dir);
-		else
+			arch_sync_dma_flush();
+		} else {
 			xen_dma_sync_for_device(dev, dev_addr, size, dir);
+		}
 	}
 	return dev_addr;
 }
@@ -287,10 +289,12 @@ static void xen_swiotlb_unmap_phys(struct device *hwdev, dma_addr_t dev_addr,
 	BUG_ON(dir == DMA_NONE);
 
 	if (!dev_is_dma_coherent(hwdev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
-		if (pfn_valid(PFN_DOWN(dma_to_phys(hwdev, dev_addr))))
+		if (pfn_valid(PFN_DOWN(dma_to_phys(hwdev, dev_addr)))) {
 			arch_sync_dma_for_cpu(paddr, size, dir);
-		else
+			arch_sync_dma_flush();
+		} else {
 			xen_dma_sync_for_cpu(hwdev, dev_addr, size, dir);
+		}
 	}
 
 	/* NOTE: We use dev_addr here, not paddr! */
@@ -308,10 +312,12 @@ xen_swiotlb_sync_single_for_cpu(struct device *dev, dma_addr_t dma_addr,
 	struct io_tlb_pool *pool;
 
 	if (!dev_is_dma_coherent(dev)) {
-		if (pfn_valid(PFN_DOWN(dma_to_phys(dev, dma_addr))))
+		if (pfn_valid(PFN_DOWN(dma_to_phys(dev, dma_addr)))) {
 			arch_sync_dma_for_cpu(paddr, size, dir);
-		else
+			arch_sync_dma_flush();
+		} else {
 			xen_dma_sync_for_cpu(dev, dma_addr, size, dir);
+		}
 	}
 
 	pool = xen_swiotlb_find_pool(dev, dma_addr);
@@ -331,10 +337,12 @@ xen_swiotlb_sync_single_for_device(struct device *dev, dma_addr_t dma_addr,
 		__swiotlb_sync_single_for_device(dev, paddr, size, dir, pool);
 
 	if (!dev_is_dma_coherent(dev)) {
-		if (pfn_valid(PFN_DOWN(dma_to_phys(dev, dma_addr))))
+		if (pfn_valid(PFN_DOWN(dma_to_phys(dev, dma_addr)))) {
 			arch_sync_dma_for_device(paddr, size, dir);
-		else
+			arch_sync_dma_flush();
+		} else {
 			xen_dma_sync_for_device(dev, dma_addr, size, dir);
+		}
 	}
 }
 
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 4809204c674c..e7dd8a63b40e 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -361,6 +361,12 @@ static inline void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
 }
 #endif /* ARCH_HAS_SYNC_DMA_FOR_CPU */
 
+#ifndef arch_sync_dma_flush
+static inline void arch_sync_dma_flush(void)
+{
+}
+#endif
+
 #ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL
 void arch_sync_dma_for_cpu_all(void);
 #else
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 50c3fe2a1d55..a219911c7b90 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -402,9 +402,11 @@ void dma_direct_sync_sg_for_device(struct device *dev,
 
 		swiotlb_sync_single_for_device(dev, paddr, sg->length, dir);
 
-		if (!dev_is_dma_coherent(dev))
+		if (!dev_is_dma_coherent(dev)) {
 			arch_sync_dma_for_device(paddr, sg->length,
 					dir);
+			arch_sync_dma_flush();
+		}
 	}
 }
 #endif
@@ -421,8 +423,10 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
 	for_each_sg(sgl, sg, nents, i) {
 		phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
 
-		if (!dev_is_dma_coherent(dev))
+		if (!dev_is_dma_coherent(dev)) {
 			arch_sync_dma_for_cpu(paddr, sg->length, dir);
+			arch_sync_dma_flush();
+		}
 
 		swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
 
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index da2fadf45bcd..a69326eed266 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -60,8 +60,10 @@ static inline void dma_direct_sync_single_for_device(struct device *dev,
 
 	swiotlb_sync_single_for_device(dev, paddr, size, dir);
 
-	if (!dev_is_dma_coherent(dev))
+	if (!dev_is_dma_coherent(dev)) {
 		arch_sync_dma_for_device(paddr, size, dir);
+		arch_sync_dma_flush();
+	}
 }
 
 static inline void dma_direct_sync_single_for_cpu(struct device *dev,
@@ -71,6 +73,7 @@ static inline void dma_direct_sync_single_for_cpu(struct device *dev,
 
 	if (!dev_is_dma_coherent(dev)) {
 		arch_sync_dma_for_cpu(paddr, size, dir);
+		arch_sync_dma_flush();
 		arch_sync_dma_for_cpu_all();
 	}
 
@@ -109,8 +112,10 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
 	}
 
 	if (!dev_is_dma_coherent(dev) &&
-	    !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
+	    !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO))) {
 		arch_sync_dma_for_device(phys, size, dir);
+		arch_sync_dma_flush();
+	}
 	return dma_addr;
 
 err_overflow:
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index a547c7693135..7cdbfcdfef86 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -1595,8 +1595,10 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
 		return DMA_MAPPING_ERROR;
 	}
 
-	if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+	if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
 		arch_sync_dma_for_device(swiotlb_addr, size, dir);
+		arch_sync_dma_flush();
+	}
 	return dma_addr;
 }
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_*
  2025-12-26 22:52 [PATCH v2 0/8] dma-mapping: arm64: support batched cache sync Barry Song
                   ` (3 preceding siblings ...)
  2025-12-26 22:52 ` [PATCH v2 4/8] dma-mapping: Separate DMA sync issuing and completion waiting Barry Song
@ 2025-12-26 22:52 ` Barry Song
  2025-12-27 20:09   ` Leon Romanovsky
  2025-12-26 22:52 ` [PATCH v2 6/8] dma-mapping: Support batch mode for dma_direct_{map,unmap}_sg Barry Song
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Barry Song @ 2025-12-26 22:52 UTC (permalink / raw)
  To: catalin.marinas, m.szyprowski, robin.murphy, will, iommu,
	linux-arm-kernel
  Cc: Barry Song, Ryan Roberts, Leon Romanovsky, Anshuman Khandual,
	Marc Zyngier, linux-kernel, Tangquan Zheng, xen-devel,
	Suren Baghdasaryan, Ard Biesheuvel

From: Barry Song <baohua@kernel.org>

Instead of performing a flush per SG entry, issue all cache
operations first and then flush once. This ultimately benefits
__dma_sync_sg_for_cpu() and __dma_sync_sg_for_device().

Cc: Leon Romanovsky <leon@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <baohua@kernel.org>
---
 kernel/dma/direct.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index a219911c7b90..98bacf562ca1 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -402,12 +402,12 @@ void dma_direct_sync_sg_for_device(struct device *dev,
 
 		swiotlb_sync_single_for_device(dev, paddr, sg->length, dir);
 
-		if (!dev_is_dma_coherent(dev)) {
+		if (!dev_is_dma_coherent(dev))
 			arch_sync_dma_for_device(paddr, sg->length,
 					dir);
-			arch_sync_dma_flush();
-		}
 	}
+	if (!dev_is_dma_coherent(dev))
+		arch_sync_dma_flush();
 }
 #endif
 
@@ -423,10 +423,8 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
 	for_each_sg(sgl, sg, nents, i) {
 		phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
 
-		if (!dev_is_dma_coherent(dev)) {
+		if (!dev_is_dma_coherent(dev))
 			arch_sync_dma_for_cpu(paddr, sg->length, dir);
-			arch_sync_dma_flush();
-		}
 
 		swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
 
@@ -434,8 +432,10 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
 			arch_dma_mark_clean(paddr, sg->length);
 	}
 
-	if (!dev_is_dma_coherent(dev))
+	if (!dev_is_dma_coherent(dev)) {
+		arch_sync_dma_flush();
 		arch_sync_dma_for_cpu_all();
+	}
 }
 
 /*
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 6/8] dma-mapping: Support batch mode for dma_direct_{map,unmap}_sg
  2025-12-26 22:52 [PATCH v2 0/8] dma-mapping: arm64: support batched cache sync Barry Song
                   ` (4 preceding siblings ...)
  2025-12-26 22:52 ` [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_* Barry Song
@ 2025-12-26 22:52 ` Barry Song
  2025-12-27 20:14   ` Leon Romanovsky
  2025-12-26 22:52 ` [PATCH RFC v2 7/8] dma-iommu: Support DMA sync batch mode for IOVA link and unlink Barry Song
  2025-12-26 22:52 ` [PATCH RFC v2 8/8] dma-iommu: Support DMA sync batch mode for iommu_dma_sync_sg_for_{cpu, device} Barry Song
  7 siblings, 1 reply; 21+ messages in thread
From: Barry Song @ 2025-12-26 22:52 UTC (permalink / raw)
  To: catalin.marinas, m.szyprowski, robin.murphy, will, iommu,
	linux-arm-kernel
  Cc: Barry Song, Ryan Roberts, Leon Romanovsky, Anshuman Khandual,
	Marc Zyngier, linux-kernel, Tangquan Zheng, xen-devel,
	Suren Baghdasaryan, Ard Biesheuvel

From: Barry Song <baohua@kernel.org>

Leon suggested extending a flush argument to
dma_direct_unmap_phys(), dma_direct_map_phys(), and
dma_direct_sync_single_for_cpu(). For single-buffer cases, this
would use flush=true, while for SG cases flush=false would be
used, followed by a single flush after all cache operations are
issued in dma_direct_{map,unmap}_sg().

This ultimately benefits dma_map_sg() and dma_unmap_sg().

Cc: Leon Romanovsky <leon@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <baohua@kernel.org>
---
 kernel/dma/direct.c  | 17 +++++++++++++----
 kernel/dma/direct.h  | 16 ++++++++++------
 kernel/dma/mapping.c |  6 +++---
 3 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 98bacf562ca1..550a1a13148d 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -447,14 +447,19 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
 {
 	struct scatterlist *sg;
 	int i;
+	bool need_sync = false;
 
 	for_each_sg(sgl,  sg, nents, i) {
-		if (sg_dma_is_bus_address(sg))
+		if (sg_dma_is_bus_address(sg)) {
 			sg_dma_unmark_bus_address(sg);
-		else
+		} else {
+			need_sync = true;
 			dma_direct_unmap_phys(dev, sg->dma_address,
-					      sg_dma_len(sg), dir, attrs);
+					      sg_dma_len(sg), dir, attrs, false);
+		}
 	}
+	if (need_sync && !dev_is_dma_coherent(dev))
+		arch_sync_dma_flush();
 }
 #endif
 
@@ -464,6 +469,7 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 	struct pci_p2pdma_map_state p2pdma_state = {};
 	struct scatterlist *sg;
 	int i, ret;
+	bool need_sync = false;
 
 	for_each_sg(sgl, sg, nents, i) {
 		switch (pci_p2pdma_state(&p2pdma_state, dev, sg_page(sg))) {
@@ -475,8 +481,9 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 			 */
 			break;
 		case PCI_P2PDMA_MAP_NONE:
+			need_sync = true;
 			sg->dma_address = dma_direct_map_phys(dev, sg_phys(sg),
-					sg->length, dir, attrs);
+					sg->length, dir, attrs, false);
 			if (sg->dma_address == DMA_MAPPING_ERROR) {
 				ret = -EIO;
 				goto out_unmap;
@@ -495,6 +502,8 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 		sg_dma_len(sg) = sg->length;
 	}
 
+	if (need_sync && !dev_is_dma_coherent(dev))
+		arch_sync_dma_flush();
 	return nents;
 
 out_unmap:
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index a69326eed266..d4ad79828090 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -67,13 +67,15 @@ static inline void dma_direct_sync_single_for_device(struct device *dev,
 }
 
 static inline void dma_direct_sync_single_for_cpu(struct device *dev,
-		dma_addr_t addr, size_t size, enum dma_data_direction dir)
+		dma_addr_t addr, size_t size, enum dma_data_direction dir,
+		bool flush)
 {
 	phys_addr_t paddr = dma_to_phys(dev, addr);
 
 	if (!dev_is_dma_coherent(dev)) {
 		arch_sync_dma_for_cpu(paddr, size, dir);
-		arch_sync_dma_flush();
+		if (flush)
+			arch_sync_dma_flush();
 		arch_sync_dma_for_cpu_all();
 	}
 
@@ -85,7 +87,7 @@ static inline void dma_direct_sync_single_for_cpu(struct device *dev,
 
 static inline dma_addr_t dma_direct_map_phys(struct device *dev,
 		phys_addr_t phys, size_t size, enum dma_data_direction dir,
-		unsigned long attrs)
+		unsigned long attrs, bool flush)
 {
 	dma_addr_t dma_addr;
 
@@ -114,7 +116,8 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
 	if (!dev_is_dma_coherent(dev) &&
 	    !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO))) {
 		arch_sync_dma_for_device(phys, size, dir);
-		arch_sync_dma_flush();
+		if (flush)
+			arch_sync_dma_flush();
 	}
 	return dma_addr;
 
@@ -127,7 +130,8 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
 }
 
 static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
-		size_t size, enum dma_data_direction dir, unsigned long attrs)
+		size_t size, enum dma_data_direction dir, unsigned long attrs,
+		bool flush)
 {
 	phys_addr_t phys;
 
@@ -137,7 +141,7 @@ static inline void dma_direct_unmap_phys(struct device *dev, dma_addr_t addr,
 
 	phys = dma_to_phys(dev, addr);
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
-		dma_direct_sync_single_for_cpu(dev, addr, size, dir);
+		dma_direct_sync_single_for_cpu(dev, addr, size, dir, flush);
 
 	swiotlb_tbl_unmap_single(dev, phys, size, dir,
 					 attrs | DMA_ATTR_SKIP_CPU_SYNC);
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 37163eb49f9f..d8cfa56a3cbb 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -166,7 +166,7 @@ dma_addr_t dma_map_phys(struct device *dev, phys_addr_t phys, size_t size,
 
 	if (dma_map_direct(dev, ops) ||
 	    (!is_mmio && arch_dma_map_phys_direct(dev, phys + size)))
-		addr = dma_direct_map_phys(dev, phys, size, dir, attrs);
+		addr = dma_direct_map_phys(dev, phys, size, dir, attrs, true);
 	else if (use_dma_iommu(dev))
 		addr = iommu_dma_map_phys(dev, phys, size, dir, attrs);
 	else if (ops->map_phys)
@@ -207,7 +207,7 @@ void dma_unmap_phys(struct device *dev, dma_addr_t addr, size_t size,
 	BUG_ON(!valid_dma_direction(dir));
 	if (dma_map_direct(dev, ops) ||
 	    (!is_mmio && arch_dma_unmap_phys_direct(dev, addr + size)))
-		dma_direct_unmap_phys(dev, addr, size, dir, attrs);
+		dma_direct_unmap_phys(dev, addr, size, dir, attrs, true);
 	else if (use_dma_iommu(dev))
 		iommu_dma_unmap_phys(dev, addr, size, dir, attrs);
 	else if (ops->unmap_phys)
@@ -373,7 +373,7 @@ void __dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size,
 
 	BUG_ON(!valid_dma_direction(dir));
 	if (dma_map_direct(dev, ops))
-		dma_direct_sync_single_for_cpu(dev, addr, size, dir);
+		dma_direct_sync_single_for_cpu(dev, addr, size, dir, true);
 	else if (use_dma_iommu(dev))
 		iommu_dma_sync_single_for_cpu(dev, addr, size, dir);
 	else if (ops->sync_single_for_cpu)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC v2 7/8] dma-iommu: Support DMA sync batch mode for IOVA link and unlink
  2025-12-26 22:52 [PATCH v2 0/8] dma-mapping: arm64: support batched cache sync Barry Song
                   ` (5 preceding siblings ...)
  2025-12-26 22:52 ` [PATCH v2 6/8] dma-mapping: Support batch mode for dma_direct_{map,unmap}_sg Barry Song
@ 2025-12-26 22:52 ` Barry Song
  2025-12-26 22:52 ` [PATCH RFC v2 8/8] dma-iommu: Support DMA sync batch mode for iommu_dma_sync_sg_for_{cpu, device} Barry Song
  7 siblings, 0 replies; 21+ messages in thread
From: Barry Song @ 2025-12-26 22:52 UTC (permalink / raw)
  To: catalin.marinas, m.szyprowski, robin.murphy, will, iommu,
	linux-arm-kernel
  Cc: Barry Song, Ryan Roberts, Leon Romanovsky, Anshuman Khandual,
	Marc Zyngier, Joerg Roedel, linux-kernel, Tangquan Zheng,
	xen-devel, Suren Baghdasaryan, Ard Biesheuvel

From: Barry Song <baohua@kernel.org>

Apply batched DMA synchronization to __dma_iova_link() and
iommu_dma_iova_unlink_range_slow(). For multiple
sync_dma_for_device() and sync_dma_for_cpu() calls, we only
need to wait once for the completion of all sync operations,
rather than waiting for each one individually.

I do not have the hardware to test this, so it is marked as
RFC. I would greatly appreciate it if someone could test it.

Suggested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <baohua@kernel.org>
---
 drivers/iommu/dma-iommu.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 6827763a3877..ffa940bdbbaf 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1849,10 +1849,8 @@ static int __dma_iova_link(struct device *dev, dma_addr_t addr,
 	bool coherent = dev_is_dma_coherent(dev);
 	int prot = dma_info_to_prot(dir, coherent, attrs);
 
-	if (!coherent && !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO))) {
+	if (!coherent && !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
 		arch_sync_dma_for_device(phys, size, dir);
-		arch_sync_dma_flush();
-	}
 
 	return iommu_map_nosync(iommu_get_dma_domain(dev), addr, phys, size,
 			prot, GFP_ATOMIC);
@@ -1995,6 +1993,8 @@ int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
 	dma_addr_t addr = state->addr + offset;
 	size_t iova_start_pad = iova_offset(iovad, addr);
 
+	if (!dev_is_dma_coherent(dev))
+		arch_sync_dma_flush();
 	return iommu_sync_map(domain, addr - iova_start_pad,
 		      iova_align(iovad, size + iova_start_pad));
 }
@@ -2008,6 +2008,8 @@ static void iommu_dma_iova_unlink_range_slow(struct device *dev,
 	struct iommu_dma_cookie *cookie = domain->iova_cookie;
 	struct iova_domain *iovad = &cookie->iovad;
 	size_t iova_start_pad = iova_offset(iovad, addr);
+	bool need_sync_dma = !dev_is_dma_coherent(dev) &&
+			!(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO));
 	dma_addr_t end = addr + size;
 
 	do {
@@ -2023,16 +2025,17 @@ static void iommu_dma_iova_unlink_range_slow(struct device *dev,
 			end - addr, iovad->granule - iova_start_pad);
 
 		if (!dev_is_dma_coherent(dev) &&
-		    !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO))) {
+		    !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO)))
 			arch_sync_dma_for_cpu(phys, len, dir);
-			arch_sync_dma_flush();
-		}
 
 		swiotlb_tbl_unmap_single(dev, phys, len, dir, attrs);
 
 		addr += len;
 		iova_start_pad = 0;
 	} while (addr < end);
+
+	if (need_sync_dma)
+		arch_sync_dma_flush();
 }
 
 static void __iommu_dma_iova_unlink(struct device *dev,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC v2 8/8] dma-iommu: Support DMA sync batch mode for iommu_dma_sync_sg_for_{cpu, device}
  2025-12-26 22:52 [PATCH v2 0/8] dma-mapping: arm64: support batched cache sync Barry Song
                   ` (6 preceding siblings ...)
  2025-12-26 22:52 ` [PATCH RFC v2 7/8] dma-iommu: Support DMA sync batch mode for IOVA link and unlink Barry Song
@ 2025-12-26 22:52 ` Barry Song
  2025-12-27 20:16   ` Leon Romanovsky
  7 siblings, 1 reply; 21+ messages in thread
From: Barry Song @ 2025-12-26 22:52 UTC (permalink / raw)
  To: catalin.marinas, m.szyprowski, robin.murphy, will, iommu,
	linux-arm-kernel
  Cc: Barry Song, Ryan Roberts, Leon Romanovsky, Anshuman Khandual,
	Marc Zyngier, Joerg Roedel, linux-kernel, Tangquan Zheng,
	xen-devel, Suren Baghdasaryan, Ard Biesheuvel

From: Barry Song <baohua@kernel.org>

Apply batched DMA synchronization to iommu_dma_sync_sg_for_cpu() and
iommu_dma_sync_sg_for_device(). For all buffers in an SG list, only
a single flush operation is needed.

I do not have the hardware to test this, so the patch is marked as
RFC. I would greatly appreciate any testing feedback.

Cc: Leon Romanovsky <leon@kernel.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>
Signed-off-by: Barry Song <baohua@kernel.org>
---
 drivers/iommu/dma-iommu.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index ffa940bdbbaf..b68dbfcb7846 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1131,10 +1131,9 @@ void iommu_dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sgl,
 			iommu_dma_sync_single_for_cpu(dev, sg_dma_address(sg),
 						      sg->length, dir);
 	} else if (!dev_is_dma_coherent(dev)) {
-		for_each_sg(sgl, sg, nelems, i) {
+		for_each_sg(sgl, sg, nelems, i)
 			arch_sync_dma_for_cpu(sg_phys(sg), sg->length, dir);
-			arch_sync_dma_flush();
-		}
+		arch_sync_dma_flush();
 	}
 }
 
@@ -1144,16 +1143,16 @@ void iommu_dma_sync_sg_for_device(struct device *dev, struct scatterlist *sgl,
 	struct scatterlist *sg;
 	int i;
 
-	if (sg_dma_is_swiotlb(sgl))
+	if (sg_dma_is_swiotlb(sgl)) {
 		for_each_sg(sgl, sg, nelems, i)
 			iommu_dma_sync_single_for_device(dev,
 							 sg_dma_address(sg),
 							 sg->length, dir);
-	else if (!dev_is_dma_coherent(dev))
-		for_each_sg(sgl, sg, nelems, i) {
+	} else if (!dev_is_dma_coherent(dev)) {
+		for_each_sg(sgl, sg, nelems, i)
 			arch_sync_dma_for_device(sg_phys(sg), sg->length, dir);
-			arch_sync_dma_flush();
-		}
+		arch_sync_dma_flush();
+	}
 }
 
 static phys_addr_t iommu_dma_map_swiotlb(struct device *dev, phys_addr_t phys,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/8] dma-mapping: Separate DMA sync issuing and completion waiting
  2025-12-26 22:52 ` [PATCH v2 4/8] dma-mapping: Separate DMA sync issuing and completion waiting Barry Song
@ 2025-12-27 20:07   ` Leon Romanovsky
  2025-12-27 21:45     ` Barry Song
  0 siblings, 1 reply; 21+ messages in thread
From: Leon Romanovsky @ 2025-12-27 20:07 UTC (permalink / raw)
  To: Barry Song
  Cc: Juergen Gross, Tangquan Zheng, Barry Song, Stefano Stabellini,
	Ryan Roberts, will, Anshuman Khandual, catalin.marinas,
	Joerg Roedel, linux-kernel, Suren Baghdasaryan, iommu,
	Marc Zyngier, Oleksandr Tyshchenko, xen-devel, robin.murphy,
	Ard Biesheuvel, linux-arm-kernel, m.szyprowski

On Sat, Dec 27, 2025 at 11:52:44AM +1300, Barry Song wrote:
> From: Barry Song <baohua@kernel.org>
> 
> Currently, arch_sync_dma_for_cpu and arch_sync_dma_for_device
> always wait for the completion of each DMA buffer. That is,
> issuing the DMA sync and waiting for completion is done in a
> single API call.
> 
> For scatter-gather lists with multiple entries, this means
> issuing and waiting is repeated for each entry, which can hurt
> performance. Architectures like ARM64 may be able to issue all
> DMA sync operations for all entries first and then wait for
> completion together.
> 
> To address this, arch_sync_dma_for_* now issues DMA operations in
> batch, followed by a flush. On ARM64, the flush is implemented
> using a dsb instruction within arch_sync_dma_flush().
> 
> For now, add arch_sync_dma_flush() after each
> arch_sync_dma_for_*() call. arch_sync_dma_flush() is defined as a
> no-op on all architectures except arm64, so this patch does not
> change existing behavior. Subsequent patches will introduce true
> batching for SG DMA buffers.
> 
> Cc: Leon Romanovsky <leon@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Signed-off-by: Barry Song <baohua@kernel.org>
> ---
>  arch/arm64/include/asm/cache.h |  6 ++++++
>  arch/arm64/mm/dma-mapping.c    |  4 ++--
>  drivers/iommu/dma-iommu.c      | 37 +++++++++++++++++++++++++---------
>  drivers/xen/swiotlb-xen.c      | 24 ++++++++++++++--------
>  include/linux/dma-map-ops.h    |  6 ++++++
>  kernel/dma/direct.c            |  8 ++++++--
>  kernel/dma/direct.h            |  9 +++++++--
>  kernel/dma/swiotlb.c           |  4 +++-
>  8 files changed, 73 insertions(+), 25 deletions(-)

<...>

> +#ifndef arch_sync_dma_flush
> +static inline void arch_sync_dma_flush(void)
> +{
> +}
> +#endif

Over the weekend I realized a useful advantage of the ARCH_HAVE_* config
options: they make it straightforward to inspect the entire DMA path simply
by looking at the .config.

Thanks,
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_*
  2025-12-26 22:52 ` [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_* Barry Song
@ 2025-12-27 20:09   ` Leon Romanovsky
  2025-12-27 20:52     ` Barry Song
  0 siblings, 1 reply; 21+ messages in thread
From: Leon Romanovsky @ 2025-12-27 20:09 UTC (permalink / raw)
  To: Barry Song
  Cc: Tangquan Zheng, Barry Song, Ryan Roberts, will, Anshuman Khandual,
	catalin.marinas, linux-kernel, Suren Baghdasaryan, iommu,
	Marc Zyngier, xen-devel, robin.murphy, Ard Biesheuvel,
	linux-arm-kernel, m.szyprowski

On Sat, Dec 27, 2025 at 11:52:45AM +1300, Barry Song wrote:
> From: Barry Song <baohua@kernel.org>
> 
> Instead of performing a flush per SG entry, issue all cache
> operations first and then flush once. This ultimately benefits
> __dma_sync_sg_for_cpu() and __dma_sync_sg_for_device().
> 
> Cc: Leon Romanovsky <leon@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Signed-off-by: Barry Song <baohua@kernel.org>
> ---
>  kernel/dma/direct.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)

<...>

> -		if (!dev_is_dma_coherent(dev)) {
> +		if (!dev_is_dma_coherent(dev))
>  			arch_sync_dma_for_device(paddr, sg->length,
>  					dir);
> -			arch_sync_dma_flush();
> -		}
>  	}
> +	if (!dev_is_dma_coherent(dev))
> +		arch_sync_dma_flush();

This patch should be squashed into the previous one. You introduced
arch_sync_dma_flush() there, and now you are placing it elsewhere.

Thanks


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 6/8] dma-mapping: Support batch mode for dma_direct_{map,unmap}_sg
  2025-12-26 22:52 ` [PATCH v2 6/8] dma-mapping: Support batch mode for dma_direct_{map,unmap}_sg Barry Song
@ 2025-12-27 20:14   ` Leon Romanovsky
  0 siblings, 0 replies; 21+ messages in thread
From: Leon Romanovsky @ 2025-12-27 20:14 UTC (permalink / raw)
  To: Barry Song
  Cc: Tangquan Zheng, Barry Song, Ryan Roberts, will, Anshuman Khandual,
	catalin.marinas, linux-kernel, Suren Baghdasaryan, iommu,
	Marc Zyngier, xen-devel, robin.murphy, Ard Biesheuvel,
	linux-arm-kernel, m.szyprowski

On Sat, Dec 27, 2025 at 11:52:46AM +1300, Barry Song wrote:
> From: Barry Song <baohua@kernel.org>
> 
> Leon suggested extending a flush argument to

Let's move this sentence out of the commit message and place it in the
changelog instead.

> dma_direct_unmap_phys(), dma_direct_map_phys(), and
> dma_direct_sync_single_for_cpu(). For single-buffer cases, this
> would use flush=true, while for SG cases flush=false would be
> used, followed by a single flush after all cache operations are
> issued in dma_direct_{map,unmap}_sg().
> 
> This ultimately benefits dma_map_sg() and dma_unmap_sg().
> 
> Cc: Leon Romanovsky <leon@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Signed-off-by: Barry Song <baohua@kernel.org>
> ---
>  kernel/dma/direct.c  | 17 +++++++++++++----
>  kernel/dma/direct.h  | 16 ++++++++++------
>  kernel/dma/mapping.c |  6 +++---
>  3 files changed, 26 insertions(+), 13 deletions(-)

Thanks,
Reviewed-by: Leon Romanovsky <leon@kernel.org>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v2 8/8] dma-iommu: Support DMA sync batch mode for iommu_dma_sync_sg_for_{cpu, device}
  2025-12-26 22:52 ` [PATCH RFC v2 8/8] dma-iommu: Support DMA sync batch mode for iommu_dma_sync_sg_for_{cpu, device} Barry Song
@ 2025-12-27 20:16   ` Leon Romanovsky
  2025-12-27 20:59     ` Barry Song
  0 siblings, 1 reply; 21+ messages in thread
From: Leon Romanovsky @ 2025-12-27 20:16 UTC (permalink / raw)
  To: Barry Song
  Cc: Tangquan Zheng, Barry Song, Ryan Roberts, will, Anshuman Khandual,
	catalin.marinas, Joerg Roedel, linux-kernel, Suren Baghdasaryan,
	iommu, Marc Zyngier, xen-devel, robin.murphy, Ard Biesheuvel,
	linux-arm-kernel, m.szyprowski

On Sat, Dec 27, 2025 at 11:52:48AM +1300, Barry Song wrote:
> From: Barry Song <baohua@kernel.org>
> 
> Apply batched DMA synchronization to iommu_dma_sync_sg_for_cpu() and
> iommu_dma_sync_sg_for_device(). For all buffers in an SG list, only
> a single flush operation is needed.
> 
> I do not have the hardware to test this, so the patch is marked as
> RFC. I would greatly appreciate any testing feedback.
> 
> Cc: Leon Romanovsky <leon@kernel.org>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> Signed-off-by: Barry Song <baohua@kernel.org>
> ---
>  drivers/iommu/dma-iommu.c | 15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index ffa940bdbbaf..b68dbfcb7846 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -1131,10 +1131,9 @@ void iommu_dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sgl,
>  			iommu_dma_sync_single_for_cpu(dev, sg_dma_address(sg),
>  						      sg->length, dir);
>  	} else if (!dev_is_dma_coherent(dev)) {
> -		for_each_sg(sgl, sg, nelems, i) {
> +		for_each_sg(sgl, sg, nelems, i)
>  			arch_sync_dma_for_cpu(sg_phys(sg), sg->length, dir);
> -			arch_sync_dma_flush();
> -		}
> +		arch_sync_dma_flush();

This and previous patches should be squashed into the one which
introduced arch_sync_dma_flush().

Thanks


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_*
  2025-12-27 20:09   ` Leon Romanovsky
@ 2025-12-27 20:52     ` Barry Song
  2025-12-28 14:50       ` Leon Romanovsky
  0 siblings, 1 reply; 21+ messages in thread
From: Barry Song @ 2025-12-27 20:52 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Tangquan Zheng, Ryan Roberts, will, Anshuman Khandual,
	catalin.marinas, linux-kernel, Suren Baghdasaryan, iommu,
	Marc Zyngier, xen-devel, robin.murphy, Ard Biesheuvel,
	linux-arm-kernel, m.szyprowski

On Sun, Dec 28, 2025 at 9:09 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Sat, Dec 27, 2025 at 11:52:45AM +1300, Barry Song wrote:
> > From: Barry Song <baohua@kernel.org>
> >
> > Instead of performing a flush per SG entry, issue all cache
> > operations first and then flush once. This ultimately benefits
> > __dma_sync_sg_for_cpu() and __dma_sync_sg_for_device().
> >
> > Cc: Leon Romanovsky <leon@kernel.org>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > Cc: Robin Murphy <robin.murphy@arm.com>
> > Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> > Cc: Ard Biesheuvel <ardb@kernel.org>
> > Cc: Marc Zyngier <maz@kernel.org>
> > Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> > Signed-off-by: Barry Song <baohua@kernel.org>
> > ---
> >  kernel/dma/direct.c | 14 +++++++-------
> >  1 file changed, 7 insertions(+), 7 deletions(-)
>
> <...>
>
> > -             if (!dev_is_dma_coherent(dev)) {
> > +             if (!dev_is_dma_coherent(dev))
> >                       arch_sync_dma_for_device(paddr, sg->length,
> >                                       dir);
> > -                     arch_sync_dma_flush();
> > -             }
> >       }
> > +     if (!dev_is_dma_coherent(dev))
> > +             arch_sync_dma_flush();
>
> This patch should be squashed into the previous one. You introduced
> arch_sync_dma_flush() there, and now you are placing it elsewhere.

Hi Leon,

The previous patch replaces all arch_sync_dma_for_* calls with
arch_sync_dma_for_* plus arch_sync_dma_flush(), without any
functional change. The subsequent patches then implement the
actual batching. I feel this is a better approach for reviewing
each change independently. Otherwise, the previous patch would
be too large.

Thanks
Barry


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v2 8/8] dma-iommu: Support DMA sync batch mode for iommu_dma_sync_sg_for_{cpu, device}
  2025-12-27 20:16   ` Leon Romanovsky
@ 2025-12-27 20:59     ` Barry Song
  0 siblings, 0 replies; 21+ messages in thread
From: Barry Song @ 2025-12-27 20:59 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Tangquan Zheng, Ryan Roberts, will, Anshuman Khandual,
	catalin.marinas, Joerg Roedel, linux-kernel, Suren Baghdasaryan,
	iommu, Marc Zyngier, xen-devel, robin.murphy, Ard Biesheuvel,
	linux-arm-kernel, m.szyprowski

On Sun, Dec 28, 2025 at 9:16 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Sat, Dec 27, 2025 at 11:52:48AM +1300, Barry Song wrote:
> > From: Barry Song <baohua@kernel.org>
> >
> > Apply batched DMA synchronization to iommu_dma_sync_sg_for_cpu() and
> > iommu_dma_sync_sg_for_device(). For all buffers in an SG list, only
> > a single flush operation is needed.
> >
> > I do not have the hardware to test this, so the patch is marked as
> > RFC. I would greatly appreciate any testing feedback.
> >
> > Cc: Leon Romanovsky <leon@kernel.org>
> > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> > Cc: Ard Biesheuvel <ardb@kernel.org>
> > Cc: Marc Zyngier <maz@kernel.org>
> > Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Robin Murphy <robin.murphy@arm.com>
> > Cc: Joerg Roedel <joro@8bytes.org>
> > Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> > Signed-off-by: Barry Song <baohua@kernel.org>
> > ---
> >  drivers/iommu/dma-iommu.c | 15 +++++++--------
> >  1 file changed, 7 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> > index ffa940bdbbaf..b68dbfcb7846 100644
> > --- a/drivers/iommu/dma-iommu.c
> > +++ b/drivers/iommu/dma-iommu.c
> > @@ -1131,10 +1131,9 @@ void iommu_dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sgl,
> >                       iommu_dma_sync_single_for_cpu(dev, sg_dma_address(sg),
> >                                                     sg->length, dir);
> >       } else if (!dev_is_dma_coherent(dev)) {
> > -             for_each_sg(sgl, sg, nelems, i) {
> > +             for_each_sg(sgl, sg, nelems, i)
> >                       arch_sync_dma_for_cpu(sg_phys(sg), sg->length, dir);
> > -                     arch_sync_dma_flush();
> > -             }
> > +             arch_sync_dma_flush();
>
> This and previous patches should be squashed into the one which
> introduced arch_sync_dma_flush().

Hi Leon,

The series is structured to first introduce no functional change by
replacing all arch_sync_dma_for_* calls with arch_sync_dma_for_* plus
arch_sync_dma_flush(). Subsequent patches then add batching for
different scenarios as separate changes.

Another issue is that I was unable to find a board that both runs
mainline and exercises the IOMMU paths affected by these changes.
As a result, patches 7 and 8 are marked as RFC, while the other
patches have been tested on a real board running mainline + changes.

Thanks
Barry


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/8] dma-mapping: Separate DMA sync issuing and completion waiting
  2025-12-27 20:07   ` Leon Romanovsky
@ 2025-12-27 21:45     ` Barry Song
  2025-12-28 14:49       ` Leon Romanovsky
  0 siblings, 1 reply; 21+ messages in thread
From: Barry Song @ 2025-12-27 21:45 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Juergen Gross, Tangquan Zheng, Stefano Stabellini, Ryan Roberts,
	will, Anshuman Khandual, catalin.marinas, Joerg Roedel,
	linux-kernel, Suren Baghdasaryan, iommu, Marc Zyngier,
	Oleksandr Tyshchenko, xen-devel, robin.murphy, Ard Biesheuvel,
	linux-arm-kernel, m.szyprowski

On Sun, Dec 28, 2025 at 9:07 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Sat, Dec 27, 2025 at 11:52:44AM +1300, Barry Song wrote:
> > From: Barry Song <baohua@kernel.org>
> >
> > Currently, arch_sync_dma_for_cpu and arch_sync_dma_for_device
> > always wait for the completion of each DMA buffer. That is,
> > issuing the DMA sync and waiting for completion is done in a
> > single API call.
> >
> > For scatter-gather lists with multiple entries, this means
> > issuing and waiting is repeated for each entry, which can hurt
> > performance. Architectures like ARM64 may be able to issue all
> > DMA sync operations for all entries first and then wait for
> > completion together.
> >
> > To address this, arch_sync_dma_for_* now issues DMA operations in
> > batch, followed by a flush. On ARM64, the flush is implemented
> > using a dsb instruction within arch_sync_dma_flush().
> >
> > For now, add arch_sync_dma_flush() after each
> > arch_sync_dma_for_*() call. arch_sync_dma_flush() is defined as a
> > no-op on all architectures except arm64, so this patch does not
> > change existing behavior. Subsequent patches will introduce true
> > batching for SG DMA buffers.
> >
> > Cc: Leon Romanovsky <leon@kernel.org>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > Cc: Robin Murphy <robin.murphy@arm.com>
> > Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> > Cc: Ard Biesheuvel <ardb@kernel.org>
> > Cc: Marc Zyngier <maz@kernel.org>
> > Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Joerg Roedel <joro@8bytes.org>
> > Cc: Juergen Gross <jgross@suse.com>
> > Cc: Stefano Stabellini <sstabellini@kernel.org>
> > Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> > Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> > Signed-off-by: Barry Song <baohua@kernel.org>
> > ---
> >  arch/arm64/include/asm/cache.h |  6 ++++++
> >  arch/arm64/mm/dma-mapping.c    |  4 ++--
> >  drivers/iommu/dma-iommu.c      | 37 +++++++++++++++++++++++++---------
> >  drivers/xen/swiotlb-xen.c      | 24 ++++++++++++++--------
> >  include/linux/dma-map-ops.h    |  6 ++++++
> >  kernel/dma/direct.c            |  8 ++++++--
> >  kernel/dma/direct.h            |  9 +++++++--
> >  kernel/dma/swiotlb.c           |  4 +++-
> >  8 files changed, 73 insertions(+), 25 deletions(-)
>
> <...>
>
> > +#ifndef arch_sync_dma_flush
> > +static inline void arch_sync_dma_flush(void)
> > +{
> > +}
> > +#endif
>
> Over the weekend I realized a useful advantage of the ARCH_HAVE_* config
> options: they make it straightforward to inspect the entire DMA path simply
> by looking at the .config.

I am not quite sure how much this benefits users, as the same
information could also be obtained by grepping for
#define arch_sync_dma_flush in the source code.

>
> Thanks,
> Reviewed-by: Leon Romanovsky <leonro@nvidia.com>

Thanks very much, Leon, for reviewing this over the weekend. One thing
you might have missed is that I place arch_sync_dma_flush() after all
arch_sync_dma_for_*() calls, for both single and sg cases. I also
used a Python script to scan the code and verify that every
arch_sync_dma_for_*() is followed by arch_sync_dma_flush(), to ensure
that no call is left out.

In the subsequent patches, for sg cases, the per-entry flush is
replaced by a single flush of the entire sg. Each sg case has
different characteristics: some are straightforward, while others
can be tricky and involve additional contexts.

Thanks
Barry


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/8] dma-mapping: Separate DMA sync issuing and completion waiting
  2025-12-27 21:45     ` Barry Song
@ 2025-12-28 14:49       ` Leon Romanovsky
  2025-12-28 21:38         ` Barry Song
  0 siblings, 1 reply; 21+ messages in thread
From: Leon Romanovsky @ 2025-12-28 14:49 UTC (permalink / raw)
  To: Barry Song
  Cc: Juergen Gross, Tangquan Zheng, Stefano Stabellini, Ryan Roberts,
	will, Anshuman Khandual, catalin.marinas, Joerg Roedel,
	linux-kernel, Suren Baghdasaryan, iommu, Marc Zyngier,
	Oleksandr Tyshchenko, xen-devel, robin.murphy, Ard Biesheuvel,
	linux-arm-kernel, m.szyprowski

On Sun, Dec 28, 2025 at 10:45:13AM +1300, Barry Song wrote:
> On Sun, Dec 28, 2025 at 9:07 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Sat, Dec 27, 2025 at 11:52:44AM +1300, Barry Song wrote:
> > > From: Barry Song <baohua@kernel.org>
> > >
> > > Currently, arch_sync_dma_for_cpu and arch_sync_dma_for_device
> > > always wait for the completion of each DMA buffer. That is,
> > > issuing the DMA sync and waiting for completion is done in a
> > > single API call.
> > >
> > > For scatter-gather lists with multiple entries, this means
> > > issuing and waiting is repeated for each entry, which can hurt
> > > performance. Architectures like ARM64 may be able to issue all
> > > DMA sync operations for all entries first and then wait for
> > > completion together.
> > >
> > > To address this, arch_sync_dma_for_* now issues DMA operations in
> > > batch, followed by a flush. On ARM64, the flush is implemented
> > > using a dsb instruction within arch_sync_dma_flush().
> > >
> > > For now, add arch_sync_dma_flush() after each
> > > arch_sync_dma_for_*() call. arch_sync_dma_flush() is defined as a
> > > no-op on all architectures except arm64, so this patch does not
> > > change existing behavior. Subsequent patches will introduce true
> > > batching for SG DMA buffers.
> > >
> > > Cc: Leon Romanovsky <leon@kernel.org>
> > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > Cc: Will Deacon <will@kernel.org>
> > > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > > Cc: Robin Murphy <robin.murphy@arm.com>
> > > Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> > > Cc: Ard Biesheuvel <ardb@kernel.org>
> > > Cc: Marc Zyngier <maz@kernel.org>
> > > Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> > > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > Cc: Joerg Roedel <joro@8bytes.org>
> > > Cc: Juergen Gross <jgross@suse.com>
> > > Cc: Stefano Stabellini <sstabellini@kernel.org>
> > > Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> > > Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> > > Signed-off-by: Barry Song <baohua@kernel.org>
> > > ---
> > >  arch/arm64/include/asm/cache.h |  6 ++++++
> > >  arch/arm64/mm/dma-mapping.c    |  4 ++--
> > >  drivers/iommu/dma-iommu.c      | 37 +++++++++++++++++++++++++---------
> > >  drivers/xen/swiotlb-xen.c      | 24 ++++++++++++++--------
> > >  include/linux/dma-map-ops.h    |  6 ++++++
> > >  kernel/dma/direct.c            |  8 ++++++--
> > >  kernel/dma/direct.h            |  9 +++++++--
> > >  kernel/dma/swiotlb.c           |  4 +++-
> > >  8 files changed, 73 insertions(+), 25 deletions(-)
> >
> > <...>
> >
> > > +#ifndef arch_sync_dma_flush
> > > +static inline void arch_sync_dma_flush(void)
> > > +{
> > > +}
> > > +#endif
> >
> > Over the weekend I realized a useful advantage of the ARCH_HAVE_* config
> > options: they make it straightforward to inspect the entire DMA path simply
> > by looking at the .config.
> 
> I am not quite sure how much this benefits users, as the same
> information could also be obtained by grepping for
> #define arch_sync_dma_flush in the source code.

It differs slightly. Users no longer need to grep around or guess whether this
platform used the arch_sync_dma_flush path. A simple grep for ARCH_HAVE_ in
/proc/config.gz provides the answer.

> 
> >
> > Thanks,
> > Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
> 
> Thanks very much, Leon, for reviewing this over the weekend. One thing
> you might have missed is that I place arch_sync_dma_flush() after all
> arch_sync_dma_for_*() calls, for both single and sg cases. I also
> used a Python script to scan the code and verify that every
> arch_sync_dma_for_*() is followed by arch_sync_dma_flush(), to ensure
> that no call is left out.
> 
> In the subsequent patches, for sg cases, the per-entry flush is
> replaced by a single flush of the entire sg. Each sg case has
> different characteristics: some are straightforward, while others
> can be tricky and involve additional contexts.

I didn't overlook it, and I understand your rationale. However, this is
not how kernel patches should be structured. You should not introduce
code in patch X and then move it elsewhere in patch X + Y.

Place the code in the correct location from the start. Your patches are
small enough to review as is.

Thanks"
> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_*
  2025-12-27 20:52     ` Barry Song
@ 2025-12-28 14:50       ` Leon Romanovsky
  0 siblings, 0 replies; 21+ messages in thread
From: Leon Romanovsky @ 2025-12-28 14:50 UTC (permalink / raw)
  To: Barry Song
  Cc: Tangquan Zheng, Ryan Roberts, will, Anshuman Khandual,
	catalin.marinas, linux-kernel, Suren Baghdasaryan, iommu,
	Marc Zyngier, xen-devel, robin.murphy, Ard Biesheuvel,
	linux-arm-kernel, m.szyprowski

On Sun, Dec 28, 2025 at 09:52:05AM +1300, Barry Song wrote:
> On Sun, Dec 28, 2025 at 9:09 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Sat, Dec 27, 2025 at 11:52:45AM +1300, Barry Song wrote:
> > > From: Barry Song <baohua@kernel.org>
> > >
> > > Instead of performing a flush per SG entry, issue all cache
> > > operations first and then flush once. This ultimately benefits
> > > __dma_sync_sg_for_cpu() and __dma_sync_sg_for_device().
> > >
> > > Cc: Leon Romanovsky <leon@kernel.org>
> > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > Cc: Will Deacon <will@kernel.org>
> > > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > > Cc: Robin Murphy <robin.murphy@arm.com>
> > > Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> > > Cc: Ard Biesheuvel <ardb@kernel.org>
> > > Cc: Marc Zyngier <maz@kernel.org>
> > > Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> > > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> > > Signed-off-by: Barry Song <baohua@kernel.org>
> > > ---
> > >  kernel/dma/direct.c | 14 +++++++-------
> > >  1 file changed, 7 insertions(+), 7 deletions(-)
> >
> > <...>
> >
> > > -             if (!dev_is_dma_coherent(dev)) {
> > > +             if (!dev_is_dma_coherent(dev))
> > >                       arch_sync_dma_for_device(paddr, sg->length,
> > >                                       dir);
> > > -                     arch_sync_dma_flush();
> > > -             }
> > >       }
> > > +     if (!dev_is_dma_coherent(dev))
> > > +             arch_sync_dma_flush();
> >
> > This patch should be squashed into the previous one. You introduced
> > arch_sync_dma_flush() there, and now you are placing it elsewhere.
> 
> Hi Leon,
> 
> The previous patch replaces all arch_sync_dma_for_* calls with
> arch_sync_dma_for_* plus arch_sync_dma_flush(), without any
> functional change. The subsequent patches then implement the
> actual batching. I feel this is a better approach for reviewing
> each change independently. Otherwise, the previous patch would
> be too large.

Don't worry about it. Your patches are small enough.

> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/8] dma-mapping: Separate DMA sync issuing and completion waiting
  2025-12-28 14:49       ` Leon Romanovsky
@ 2025-12-28 21:38         ` Barry Song
  2025-12-29 14:40           ` Leon Romanovsky
  2025-12-31 14:43           ` Marek Szyprowski
  0 siblings, 2 replies; 21+ messages in thread
From: Barry Song @ 2025-12-28 21:38 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Juergen Gross, Tangquan Zheng, Stefano Stabellini, Ryan Roberts,
	will, Anshuman Khandual, catalin.marinas, Joerg Roedel,
	linux-kernel, Suren Baghdasaryan, iommu, Marc Zyngier,
	Oleksandr Tyshchenko, xen-devel, robin.murphy, Ard Biesheuvel,
	linux-arm-kernel, m.szyprowski

On Mon, Dec 29, 2025 at 3:49 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Sun, Dec 28, 2025 at 10:45:13AM +1300, Barry Song wrote:
> > On Sun, Dec 28, 2025 at 9:07 AM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Sat, Dec 27, 2025 at 11:52:44AM +1300, Barry Song wrote:
> > > > From: Barry Song <baohua@kernel.org>
> > > >
> > > > Currently, arch_sync_dma_for_cpu and arch_sync_dma_for_device
> > > > always wait for the completion of each DMA buffer. That is,
> > > > issuing the DMA sync and waiting for completion is done in a
> > > > single API call.
> > > >
> > > > For scatter-gather lists with multiple entries, this means
> > > > issuing and waiting is repeated for each entry, which can hurt
> > > > performance. Architectures like ARM64 may be able to issue all
> > > > DMA sync operations for all entries first and then wait for
> > > > completion together.
> > > >
> > > > To address this, arch_sync_dma_for_* now issues DMA operations in
> > > > batch, followed by a flush. On ARM64, the flush is implemented
> > > > using a dsb instruction within arch_sync_dma_flush().
> > > >
> > > > For now, add arch_sync_dma_flush() after each
> > > > arch_sync_dma_for_*() call. arch_sync_dma_flush() is defined as a
> > > > no-op on all architectures except arm64, so this patch does not
> > > > change existing behavior. Subsequent patches will introduce true
> > > > batching for SG DMA buffers.
> > > >
> > > > Cc: Leon Romanovsky <leon@kernel.org>
> > > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > > Cc: Will Deacon <will@kernel.org>
> > > > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > > > Cc: Robin Murphy <robin.murphy@arm.com>
> > > > Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> > > > Cc: Ard Biesheuvel <ardb@kernel.org>
> > > > Cc: Marc Zyngier <maz@kernel.org>
> > > > Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> > > > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > Cc: Joerg Roedel <joro@8bytes.org>
> > > > Cc: Juergen Gross <jgross@suse.com>
> > > > Cc: Stefano Stabellini <sstabellini@kernel.org>
> > > > Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> > > > Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> > > > Signed-off-by: Barry Song <baohua@kernel.org>
> > > > ---
> > > >  arch/arm64/include/asm/cache.h |  6 ++++++
> > > >  arch/arm64/mm/dma-mapping.c    |  4 ++--
> > > >  drivers/iommu/dma-iommu.c      | 37 +++++++++++++++++++++++++---------
> > > >  drivers/xen/swiotlb-xen.c      | 24 ++++++++++++++--------
> > > >  include/linux/dma-map-ops.h    |  6 ++++++
> > > >  kernel/dma/direct.c            |  8 ++++++--
> > > >  kernel/dma/direct.h            |  9 +++++++--
> > > >  kernel/dma/swiotlb.c           |  4 +++-
> > > >  8 files changed, 73 insertions(+), 25 deletions(-)
> > >
> > > <...>
> > >
> > > > +#ifndef arch_sync_dma_flush
> > > > +static inline void arch_sync_dma_flush(void)
> > > > +{
> > > > +}
> > > > +#endif
> > >
> > > Over the weekend I realized a useful advantage of the ARCH_HAVE_* config
> > > options: they make it straightforward to inspect the entire DMA path simply
> > > by looking at the .config.
> >
> > I am not quite sure how much this benefits users, as the same
> > information could also be obtained by grepping for
> > #define arch_sync_dma_flush in the source code.
>
> It differs slightly. Users no longer need to grep around or guess whether this
> platform used the arch_sync_dma_flush path. A simple grep for ARCH_HAVE_ in
> /proc/config.gz provides the answer.

In any case, it is only two or three lines of code, so I am fine with
either approach. Perhaps Marek, Robin, and others have a point here?

>
> >
> > >
> > > Thanks,
> > > Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
> >
> > Thanks very much, Leon, for reviewing this over the weekend. One thing
> > you might have missed is that I place arch_sync_dma_flush() after all
> > arch_sync_dma_for_*() calls, for both single and sg cases. I also
> > used a Python script to scan the code and verify that every
> > arch_sync_dma_for_*() is followed by arch_sync_dma_flush(), to ensure
> > that no call is left out.
> >
> > In the subsequent patches, for sg cases, the per-entry flush is
> > replaced by a single flush of the entire sg. Each sg case has
> > different characteristics: some are straightforward, while others
> > can be tricky and involve additional contexts.
>
> I didn't overlook it, and I understand your rationale. However, this is
> not how kernel patches should be structured. You should not introduce
> code in patch X and then move it elsewhere in patch X + Y.

I am not quite convinced by this concern. This patch only
separates DMA sync issuing from completion waiting, and it
reflects that the development is done step by step.

>
> Place the code in the correct location from the start. Your patches are
> small enough to review as is.

My point is that this patch places the code in the correct locations
from the start. It splits arch_sync_dma_for_*() into
arch_sync_dma_for_*() plus arch_sync_dma_flush() everywhere, without
introducing any functional changes from the outset.
The subsequent patches clearly show which parts are truly batched.

In the meantime, I do not have a strong preference here. If you think
it is better to move some of the straightforward batching code here,
I can follow that approach. Perhaps I could move patch 5, patch 8,
and the iommu_dma_iova_unlink_range_slow change from patch 7 here,
while keeping

  [PATCH 6] dma-mapping: Support batch mode for
  dma_direct_{map,unmap}_sg

and the IOVA link part from patch 7 as separate patches, since that
part is not straightforward. The IOVA link changes affect both
__dma_iova_link() and dma_iova_sync(), which are two separate
functions and require a deeper understanding of the contexts to
determine correctness. That part also lacks testing.

Would that be okay with you?

Thanks
Barry


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/8] dma-mapping: Separate DMA sync issuing and completion waiting
  2025-12-28 21:38         ` Barry Song
@ 2025-12-29 14:40           ` Leon Romanovsky
  2025-12-31 14:43           ` Marek Szyprowski
  1 sibling, 0 replies; 21+ messages in thread
From: Leon Romanovsky @ 2025-12-29 14:40 UTC (permalink / raw)
  To: Barry Song
  Cc: Juergen Gross, Tangquan Zheng, Stefano Stabellini, Ryan Roberts,
	will, Anshuman Khandual, catalin.marinas, Joerg Roedel,
	linux-kernel, Suren Baghdasaryan, iommu, Marc Zyngier,
	Oleksandr Tyshchenko, xen-devel, robin.murphy, Ard Biesheuvel,
	linux-arm-kernel, m.szyprowski

On Mon, Dec 29, 2025 at 10:38:26AM +1300, Barry Song wrote:
> On Mon, Dec 29, 2025 at 3:49 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Sun, Dec 28, 2025 at 10:45:13AM +1300, Barry Song wrote:
> > > On Sun, Dec 28, 2025 at 9:07 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > >
> > > > On Sat, Dec 27, 2025 at 11:52:44AM +1300, Barry Song wrote:
> > > > > From: Barry Song <baohua@kernel.org>
> > > > >
> > > > > Currently, arch_sync_dma_for_cpu and arch_sync_dma_for_device
> > > > > always wait for the completion of each DMA buffer. That is,
> > > > > issuing the DMA sync and waiting for completion is done in a
> > > > > single API call.
> > > > >
> > > > > For scatter-gather lists with multiple entries, this means
> > > > > issuing and waiting is repeated for each entry, which can hurt
> > > > > performance. Architectures like ARM64 may be able to issue all
> > > > > DMA sync operations for all entries first and then wait for
> > > > > completion together.
> > > > >
> > > > > To address this, arch_sync_dma_for_* now issues DMA operations in
> > > > > batch, followed by a flush. On ARM64, the flush is implemented
> > > > > using a dsb instruction within arch_sync_dma_flush().
> > > > >
> > > > > For now, add arch_sync_dma_flush() after each
> > > > > arch_sync_dma_for_*() call. arch_sync_dma_flush() is defined as a
> > > > > no-op on all architectures except arm64, so this patch does not
> > > > > change existing behavior. Subsequent patches will introduce true
> > > > > batching for SG DMA buffers.
> > > > >
> > > > > Cc: Leon Romanovsky <leon@kernel.org>
> > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > > > Cc: Will Deacon <will@kernel.org>
> > > > > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > > > > Cc: Robin Murphy <robin.murphy@arm.com>
> > > > > Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> > > > > Cc: Ard Biesheuvel <ardb@kernel.org>
> > > > > Cc: Marc Zyngier <maz@kernel.org>
> > > > > Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> > > > > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > > Cc: Joerg Roedel <joro@8bytes.org>
> > > > > Cc: Juergen Gross <jgross@suse.com>
> > > > > Cc: Stefano Stabellini <sstabellini@kernel.org>
> > > > > Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
> > > > > Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> > > > > Signed-off-by: Barry Song <baohua@kernel.org>
> > > > > ---
> > > > >  arch/arm64/include/asm/cache.h |  6 ++++++
> > > > >  arch/arm64/mm/dma-mapping.c    |  4 ++--
> > > > >  drivers/iommu/dma-iommu.c      | 37 +++++++++++++++++++++++++---------
> > > > >  drivers/xen/swiotlb-xen.c      | 24 ++++++++++++++--------
> > > > >  include/linux/dma-map-ops.h    |  6 ++++++
> > > > >  kernel/dma/direct.c            |  8 ++++++--
> > > > >  kernel/dma/direct.h            |  9 +++++++--
> > > > >  kernel/dma/swiotlb.c           |  4 +++-
> > > > >  8 files changed, 73 insertions(+), 25 deletions(-)
> > > >
> > > > <...>
> > > >
> > > > > +#ifndef arch_sync_dma_flush
> > > > > +static inline void arch_sync_dma_flush(void)
> > > > > +{
> > > > > +}
> > > > > +#endif
> > > >
> > > > Over the weekend I realized a useful advantage of the ARCH_HAVE_* config
> > > > options: they make it straightforward to inspect the entire DMA path simply
> > > > by looking at the .config.
> > >
> > > I am not quite sure how much this benefits users, as the same
> > > information could also be obtained by grepping for
> > > #define arch_sync_dma_flush in the source code.
> >
> > It differs slightly. Users no longer need to grep around or guess whether this
> > platform used the arch_sync_dma_flush path. A simple grep for ARCH_HAVE_ in
> > /proc/config.gz provides the answer.
> 
> In any case, it is only two or three lines of code, so I am fine with
> either approach. Perhaps Marek, Robin, and others have a point here?
> 
> >
> > >
> > > >
> > > > Thanks,
> > > > Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
> > >
> > > Thanks very much, Leon, for reviewing this over the weekend. One thing
> > > you might have missed is that I place arch_sync_dma_flush() after all
> > > arch_sync_dma_for_*() calls, for both single and sg cases. I also
> > > used a Python script to scan the code and verify that every
> > > arch_sync_dma_for_*() is followed by arch_sync_dma_flush(), to ensure
> > > that no call is left out.
> > >
> > > In the subsequent patches, for sg cases, the per-entry flush is
> > > replaced by a single flush of the entire sg. Each sg case has
> > > different characteristics: some are straightforward, while others
> > > can be tricky and involve additional contexts.
> >
> > I didn't overlook it, and I understand your rationale. However, this is
> > not how kernel patches should be structured. You should not introduce
> > code in patch X and then move it elsewhere in patch X + Y.
> 
> I am not quite convinced by this concern. This patch only
> separates DMA sync issuing from completion waiting, and it
> reflects that the development is done step by step.
> 
> >
> > Place the code in the correct location from the start. Your patches are
> > small enough to review as is.
> 
> My point is that this patch places the code in the correct locations
> from the start. It splits arch_sync_dma_for_*() into
> arch_sync_dma_for_*() plus arch_sync_dma_flush() everywhere, without
> introducing any functional changes from the outset.
> The subsequent patches clearly show which parts are truly batched.
> 
> In the meantime, I do not have a strong preference here. If you think
> it is better to move some of the straightforward batching code here,
> I can follow that approach. Perhaps I could move patch 5, patch 8,
> and the iommu_dma_iova_unlink_range_slow change from patch 7 here,
> while keeping
> 
>   [PATCH 6] dma-mapping: Support batch mode for
>   dma_direct_{map,unmap}_sg
> 
> and the IOVA link part from patch 7 as separate patches, since that
> part is not straightforward. The IOVA link changes affect both
> __dma_iova_link() and dma_iova_sync(), which are two separate
> functions and require a deeper understanding of the contexts to
> determine correctness. That part also lacks testing.

Don't worry about testing. NVME, RDMA and GPU are using this path
and someone will test it.

> 
> Would that be okay with you?

I don't know, need to see the code.

Thanks

> 
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/8] dma-mapping: Separate DMA sync issuing and completion waiting
  2025-12-28 21:38         ` Barry Song
  2025-12-29 14:40           ` Leon Romanovsky
@ 2025-12-31 14:43           ` Marek Szyprowski
  1 sibling, 0 replies; 21+ messages in thread
From: Marek Szyprowski @ 2025-12-31 14:43 UTC (permalink / raw)
  To: Barry Song, Leon Romanovsky
  Cc: Juergen Gross, Tangquan Zheng, Stefano Stabellini, Ryan Roberts,
	will, Anshuman Khandual, catalin.marinas, Joerg Roedel,
	linux-kernel, Suren Baghdasaryan, iommu, Marc Zyngier,
	Oleksandr Tyshchenko, xen-devel, robin.murphy, Ard Biesheuvel,
	linux-arm-kernel

On 28.12.2025 22:38, Barry Song wrote:
> On Mon, Dec 29, 2025 at 3:49 AM Leon Romanovsky <leon@kernel.org> wrote:
>> On Sun, Dec 28, 2025 at 10:45:13AM +1300, Barry Song wrote:
>>> On Sun, Dec 28, 2025 at 9:07 AM Leon Romanovsky <leon@kernel.org> wrote:
>>>> On Sat, Dec 27, 2025 at 11:52:44AM +1300, Barry Song wrote:
>>>>> From: Barry Song <baohua@kernel.org>
>>>>>
>>>>> Currently, arch_sync_dma_for_cpu and arch_sync_dma_for_device
>>>>> always wait for the completion of each DMA buffer. That is,
>>>>> issuing the DMA sync and waiting for completion is done in a
>>>>> single API call.
>>>>>
>>>>> For scatter-gather lists with multiple entries, this means
>>>>> issuing and waiting is repeated for each entry, which can hurt
>>>>> performance. Architectures like ARM64 may be able to issue all
>>>>> DMA sync operations for all entries first and then wait for
>>>>> completion together.
>>>>>
>>>>> To address this, arch_sync_dma_for_* now issues DMA operations in
>>>>> batch, followed by a flush. On ARM64, the flush is implemented
>>>>> using a dsb instruction within arch_sync_dma_flush().
>>>>>
>>>>> For now, add arch_sync_dma_flush() after each
>>>>> arch_sync_dma_for_*() call. arch_sync_dma_flush() is defined as a
>>>>> no-op on all architectures except arm64, so this patch does not
>>>>> change existing behavior. Subsequent patches will introduce true
>>>>> batching for SG DMA buffers.
>>>>>
>>>>> Cc: Leon Romanovsky <leon@kernel.org>
>>>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>>>>> Cc: Will Deacon <will@kernel.org>
>>>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
>>>>> Cc: Robin Murphy <robin.murphy@arm.com>
>>>>> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
>>>>> Cc: Ard Biesheuvel <ardb@kernel.org>
>>>>> Cc: Marc Zyngier <maz@kernel.org>
>>>>> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>> Cc: Suren Baghdasaryan <surenb@google.com>
>>>>> Cc: Joerg Roedel <joro@8bytes.org>
>>>>> Cc: Juergen Gross <jgross@suse.com>
>>>>> Cc: Stefano Stabellini <sstabellini@kernel.org>
>>>>> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
>>>>> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
>>>>> Signed-off-by: Barry Song <baohua@kernel.org>
>>>>> ---
>>>>>   arch/arm64/include/asm/cache.h |  6 ++++++
>>>>>   arch/arm64/mm/dma-mapping.c    |  4 ++--
>>>>>   drivers/iommu/dma-iommu.c      | 37 +++++++++++++++++++++++++---------
>>>>>   drivers/xen/swiotlb-xen.c      | 24 ++++++++++++++--------
>>>>>   include/linux/dma-map-ops.h    |  6 ++++++
>>>>>   kernel/dma/direct.c            |  8 ++++++--
>>>>>   kernel/dma/direct.h            |  9 +++++++--
>>>>>   kernel/dma/swiotlb.c           |  4 +++-
>>>>>   8 files changed, 73 insertions(+), 25 deletions(-)
>>>> <...>
>>>>
>>>>> +#ifndef arch_sync_dma_flush
>>>>> +static inline void arch_sync_dma_flush(void)
>>>>> +{
>>>>> +}
>>>>> +#endif
>>>> Over the weekend I realized a useful advantage of the ARCH_HAVE_* config
>>>> options: they make it straightforward to inspect the entire DMA path simply
>>>> by looking at the .config.
>>> I am not quite sure how much this benefits users, as the same
>>> information could also be obtained by grepping for
>>> #define arch_sync_dma_flush in the source code.
>> It differs slightly. Users no longer need to grep around or guess whether this
>> platform used the arch_sync_dma_flush path. A simple grep for ARCH_HAVE_ in
>> /proc/config.gz provides the answer.
> In any case, it is only two or three lines of code, so I am fine with
> either approach. Perhaps Marek, Robin, and others have a point here?

If possible I would suggest to follow the already used style in the 
given code even if it means a bit larger patch.

>>>> Thanks,
>>>> Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
>>> Thanks very much, Leon, for reviewing this over the weekend. One thing
>>> you might have missed is that I place arch_sync_dma_flush() after all
>>> arch_sync_dma_for_*() calls, for both single and sg cases. I also
>>> used a Python script to scan the code and verify that every
>>> arch_sync_dma_for_*() is followed by arch_sync_dma_flush(), to ensure
>>> that no call is left out.
>>>
>>> In the subsequent patches, for sg cases, the per-entry flush is
>>> replaced by a single flush of the entire sg. Each sg case has
>>> different characteristics: some are straightforward, while others
>>> can be tricky and involve additional contexts.
>> I didn't overlook it, and I understand your rationale. However, this is
>> not how kernel patches should be structured. You should not introduce
>> code in patch X and then move it elsewhere in patch X + Y.
> I am not quite convinced by this concern. This patch only
> separates DMA sync issuing from completion waiting, and it
> reflects that the development is done step by step.
>
>> Place the code in the correct location from the start. Your patches are
>> small enough to review as is.
> My point is that this patch places the code in the correct locations
> from the start. It splits arch_sync_dma_for_*() into
> arch_sync_dma_for_*() plus arch_sync_dma_flush() everywhere, without
> introducing any functional changes from the outset.
> The subsequent patches clearly show which parts are truly batched.
>
> In the meantime, I do not have a strong preference here. If you think
> it is better to move some of the straightforward batching code here,
> I can follow that approach. Perhaps I could move patch 5, patch 8,
> and the iommu_dma_iova_unlink_range_slow change from patch 7 here,
> while keeping
>
>    [PATCH 6] dma-mapping: Support batch mode for
>    dma_direct_{map,unmap}_sg
>
> and the IOVA link part from patch 7 as separate patches, since that
> part is not straightforward. The IOVA link changes affect both
> __dma_iova_link() and dma_iova_sync(), which are two separate
> functions and require a deeper understanding of the contexts to
> determine correctness. That part also lacks testing.
>
> Would that be okay with you?

Yes, this will be okay. The changes are easy to understand, so we don't 
need to go there with such very small steps.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-12-31 14:43 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-26 22:52 [PATCH v2 0/8] dma-mapping: arm64: support batched cache sync Barry Song
2025-12-26 22:52 ` [PATCH v2 1/8] arm64: Provide dcache_by_myline_op_nosync helper Barry Song
2025-12-26 22:52 ` [PATCH v2 2/8] arm64: Provide dcache_clean_poc_nosync helper Barry Song
2025-12-26 22:52 ` [PATCH v2 3/8] arm64: Provide dcache_inval_poc_nosync helper Barry Song
2025-12-26 22:52 ` [PATCH v2 4/8] dma-mapping: Separate DMA sync issuing and completion waiting Barry Song
2025-12-27 20:07   ` Leon Romanovsky
2025-12-27 21:45     ` Barry Song
2025-12-28 14:49       ` Leon Romanovsky
2025-12-28 21:38         ` Barry Song
2025-12-29 14:40           ` Leon Romanovsky
2025-12-31 14:43           ` Marek Szyprowski
2025-12-26 22:52 ` [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_* Barry Song
2025-12-27 20:09   ` Leon Romanovsky
2025-12-27 20:52     ` Barry Song
2025-12-28 14:50       ` Leon Romanovsky
2025-12-26 22:52 ` [PATCH v2 6/8] dma-mapping: Support batch mode for dma_direct_{map,unmap}_sg Barry Song
2025-12-27 20:14   ` Leon Romanovsky
2025-12-26 22:52 ` [PATCH RFC v2 7/8] dma-iommu: Support DMA sync batch mode for IOVA link and unlink Barry Song
2025-12-26 22:52 ` [PATCH RFC v2 8/8] dma-iommu: Support DMA sync batch mode for iommu_dma_sync_sg_for_{cpu, device} Barry Song
2025-12-27 20:16   ` Leon Romanovsky
2025-12-27 20:59     ` Barry Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).