From: Leon Romanovsky <leon@kernel.org>
To: Barry Song <21cnbao@gmail.com>
Cc: Tangquan Zheng <zhengtangquan@oppo.com>,
Ryan Roberts <ryan.roberts@arm.com>,
will@kernel.org, Anshuman Khandual <anshuman.khandual@arm.com>,
catalin.marinas@arm.com, linux-kernel@vger.kernel.org,
Suren Baghdasaryan <surenb@google.com>,
iommu@lists.linux.dev, Marc Zyngier <maz@kernel.org>,
xen-devel@lists.xenproject.org,
Robin Murphy <robin.murphy@arm.com>,
Ard Biesheuvel <ardb@kernel.org>,
linux-arm-kernel@lists.infradead.org, m.szyprowski@samsung.com
Subject: Re: [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_*
Date: Wed, 7 Jan 2026 09:54:14 +0200 [thread overview]
Message-ID: <20260107075414.GA11783@unreal> (raw)
In-Reply-To: <CAGsJ_4xYqseJMFXOU39JJW4Lk2ZHXAnRJLhZdVuFLxAi=Dy5sw@mail.gmail.com>
On Wed, Jan 07, 2026 at 08:47:36AM +1300, Barry Song wrote:
> On Wed, Jan 7, 2026 at 8:12 AM Robin Murphy <robin.murphy@arm.com> wrote:
> >
> > On 2026-01-06 6:41 pm, Barry Song wrote:
> > > On Mon, Dec 29, 2025 at 3:50 AM Leon Romanovsky <leon@kernel.org> wrote:
> > >>
> > >> On Sun, Dec 28, 2025 at 09:52:05AM +1300, Barry Song wrote:
> > >>> On Sun, Dec 28, 2025 at 9:09 AM Leon Romanovsky <leon@kernel.org> wrote:
> > >>>>
> > >>>> On Sat, Dec 27, 2025 at 11:52:45AM +1300, Barry Song wrote:
> > >>>>> From: Barry Song <baohua@kernel.org>
> > >>>>>
> > >>>>> Instead of performing a flush per SG entry, issue all cache
> > >>>>> operations first and then flush once. This ultimately benefits
> > >>>>> __dma_sync_sg_for_cpu() and __dma_sync_sg_for_device().
> > >>>>>
> > >>>>> Cc: Leon Romanovsky <leon@kernel.org>
> > >>>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
> > >>>>> Cc: Will Deacon <will@kernel.org>
> > >>>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > >>>>> Cc: Robin Murphy <robin.murphy@arm.com>
> > >>>>> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> > >>>>> Cc: Ard Biesheuvel <ardb@kernel.org>
> > >>>>> Cc: Marc Zyngier <maz@kernel.org>
> > >>>>> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> > >>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
> > >>>>> Cc: Suren Baghdasaryan <surenb@google.com>
> > >>>>> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> > >>>>> Signed-off-by: Barry Song <baohua@kernel.org>
> > >>>>> ---
> > >>>>> kernel/dma/direct.c | 14 +++++++-------
> > >>>>> 1 file changed, 7 insertions(+), 7 deletions(-)
> > >>>>
> > >>>> <...>
> > >>>>
> > >>>>> - if (!dev_is_dma_coherent(dev)) {
> > >>>>> + if (!dev_is_dma_coherent(dev))
> > >>>>> arch_sync_dma_for_device(paddr, sg->length,
> > >>>>> dir);
> > >>>>> - arch_sync_dma_flush();
> > >>>>> - }
> > >>>>> }
> > >>>>> + if (!dev_is_dma_coherent(dev))
> > >>>>> + arch_sync_dma_flush();
> > >>>>
> > >>>> This patch should be squashed into the previous one. You introduced
> > >>>> arch_sync_dma_flush() there, and now you are placing it elsewhere.
> > >>>
> > >>> Hi Leon,
> > >>>
> > >>> The previous patch replaces all arch_sync_dma_for_* calls with
> > >>> arch_sync_dma_for_* plus arch_sync_dma_flush(), without any
> > >>> functional change. The subsequent patches then implement the
> > >>> actual batching. I feel this is a better approach for reviewing
> > >>> each change independently. Otherwise, the previous patch would
> > >>> be too large.
> > >>
> > >> Don't worry about it. Your patches are small enough.
> > >
> > > My hardware does not require a bounce buffer, but I am concerned that
> > > this patch may be incorrect for systems that do require one.
> > >
> > > Now it is:
> > >
> > > void dma_direct_sync_sg_for_cpu(struct device *dev,
> > > struct scatterlist *sgl, int nents, enum dma_data_direction dir)
> > > {
> > > struct scatterlist *sg;
> > > int i;
> > >
> > > for_each_sg(sgl, sg, nents, i) {
> > > phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
> > >
> > > if (!dev_is_dma_coherent(dev))
> > > arch_sync_dma_for_cpu(paddr, sg->length, dir);
> > >
> > > swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
> > >
> > > if (dir == DMA_FROM_DEVICE)
> > > arch_dma_mark_clean(paddr, sg->length);
> > > }
> > >
> > > if (!dev_is_dma_coherent(dev)) {
> > > arch_sync_dma_flush();
> > > arch_sync_dma_for_cpu_all();
> > > }
> > > }
> > >
> > > Should we call swiotlb_sync_single_for_cpu() and
> > > arch_dma_mark_clean() after the flush to ensure the CPU sees the
> > > latest data and that the memcpy is correct? I mean:
> >
> > Yes, this and the equivalents in the later patches are broken for all
> > the sync_for_cpu and unmap paths which may end up bouncing (beware some
> > of them get a bit fiddly) - any cache maintenance *must* be completed
> > before calling SWIOTLB. As for mark_clean, IIRC that was an IA-64 thing,
> > and appears to be entirely dead now.
>
> Thanks, Robin. Personally, I would prefer an approach like the one below—
> that is, not optimizing the bounce buffer cases, as they are already slow
> due to hardware limitations with memcpy, and optimizing them would make
> the code quite messy.
>
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 550a1a13148d..a4840f7e8722 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -423,8 +423,11 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
> for_each_sg(sgl, sg, nents, i) {
> phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
>
> - if (!dev_is_dma_coherent(dev))
> + if (!dev_is_dma_coherent(dev)) {
> arch_sync_dma_for_cpu(paddr, sg->length, dir);
> + if (unlikely(dev->dma_io_tlb_mem))
> + arch_sync_dma_flush();
> + }
>
> swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
>
> I’d like to check with you, Leon, and Marek on your views about this.
I agree with your point that the non‑SWIOTLB path is the performant one and
should be preferred. My concern is that you are accessing the
dma_io_tlb_mem variable directly from direct.c, which looks like a layer
violation.
You likely need to introduce an is_swiotlb_something() helper for this.
BTW, please send a v3 instead of posting incremental follow‑ups.
It's hard to track the changes across multiple small additions.
Thanks.
>
> Thanks
> Barry
WARNING: multiple messages have this Message-ID (diff)
From: Leon Romanovsky <leon@kernel.org>
To: Barry Song <21cnbao@gmail.com>
Cc: Robin Murphy <robin.murphy@arm.com>,
catalin.marinas@arm.com, m.szyprowski@samsung.com,
will@kernel.org, iommu@lists.linux.dev,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, xen-devel@lists.xenproject.org,
Ada Couprie Diaz <ada.coupriediaz@arm.com>,
Ard Biesheuvel <ardb@kernel.org>, Marc Zyngier <maz@kernel.org>,
Anshuman Khandual <anshuman.khandual@arm.com>,
Ryan Roberts <ryan.roberts@arm.com>,
Suren Baghdasaryan <surenb@google.com>,
Tangquan Zheng <zhengtangquan@oppo.com>
Subject: Re: [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_*
Date: Wed, 7 Jan 2026 09:54:14 +0200 [thread overview]
Message-ID: <20260107075414.GA11783@unreal> (raw)
In-Reply-To: <CAGsJ_4xYqseJMFXOU39JJW4Lk2ZHXAnRJLhZdVuFLxAi=Dy5sw@mail.gmail.com>
On Wed, Jan 07, 2026 at 08:47:36AM +1300, Barry Song wrote:
> On Wed, Jan 7, 2026 at 8:12 AM Robin Murphy <robin.murphy@arm.com> wrote:
> >
> > On 2026-01-06 6:41 pm, Barry Song wrote:
> > > On Mon, Dec 29, 2025 at 3:50 AM Leon Romanovsky <leon@kernel.org> wrote:
> > >>
> > >> On Sun, Dec 28, 2025 at 09:52:05AM +1300, Barry Song wrote:
> > >>> On Sun, Dec 28, 2025 at 9:09 AM Leon Romanovsky <leon@kernel.org> wrote:
> > >>>>
> > >>>> On Sat, Dec 27, 2025 at 11:52:45AM +1300, Barry Song wrote:
> > >>>>> From: Barry Song <baohua@kernel.org>
> > >>>>>
> > >>>>> Instead of performing a flush per SG entry, issue all cache
> > >>>>> operations first and then flush once. This ultimately benefits
> > >>>>> __dma_sync_sg_for_cpu() and __dma_sync_sg_for_device().
> > >>>>>
> > >>>>> Cc: Leon Romanovsky <leon@kernel.org>
> > >>>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
> > >>>>> Cc: Will Deacon <will@kernel.org>
> > >>>>> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > >>>>> Cc: Robin Murphy <robin.murphy@arm.com>
> > >>>>> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com>
> > >>>>> Cc: Ard Biesheuvel <ardb@kernel.org>
> > >>>>> Cc: Marc Zyngier <maz@kernel.org>
> > >>>>> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> > >>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
> > >>>>> Cc: Suren Baghdasaryan <surenb@google.com>
> > >>>>> Cc: Tangquan Zheng <zhengtangquan@oppo.com>
> > >>>>> Signed-off-by: Barry Song <baohua@kernel.org>
> > >>>>> ---
> > >>>>> kernel/dma/direct.c | 14 +++++++-------
> > >>>>> 1 file changed, 7 insertions(+), 7 deletions(-)
> > >>>>
> > >>>> <...>
> > >>>>
> > >>>>> - if (!dev_is_dma_coherent(dev)) {
> > >>>>> + if (!dev_is_dma_coherent(dev))
> > >>>>> arch_sync_dma_for_device(paddr, sg->length,
> > >>>>> dir);
> > >>>>> - arch_sync_dma_flush();
> > >>>>> - }
> > >>>>> }
> > >>>>> + if (!dev_is_dma_coherent(dev))
> > >>>>> + arch_sync_dma_flush();
> > >>>>
> > >>>> This patch should be squashed into the previous one. You introduced
> > >>>> arch_sync_dma_flush() there, and now you are placing it elsewhere.
> > >>>
> > >>> Hi Leon,
> > >>>
> > >>> The previous patch replaces all arch_sync_dma_for_* calls with
> > >>> arch_sync_dma_for_* plus arch_sync_dma_flush(), without any
> > >>> functional change. The subsequent patches then implement the
> > >>> actual batching. I feel this is a better approach for reviewing
> > >>> each change independently. Otherwise, the previous patch would
> > >>> be too large.
> > >>
> > >> Don't worry about it. Your patches are small enough.
> > >
> > > My hardware does not require a bounce buffer, but I am concerned that
> > > this patch may be incorrect for systems that do require one.
> > >
> > > Now it is:
> > >
> > > void dma_direct_sync_sg_for_cpu(struct device *dev,
> > > struct scatterlist *sgl, int nents, enum dma_data_direction dir)
> > > {
> > > struct scatterlist *sg;
> > > int i;
> > >
> > > for_each_sg(sgl, sg, nents, i) {
> > > phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
> > >
> > > if (!dev_is_dma_coherent(dev))
> > > arch_sync_dma_for_cpu(paddr, sg->length, dir);
> > >
> > > swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
> > >
> > > if (dir == DMA_FROM_DEVICE)
> > > arch_dma_mark_clean(paddr, sg->length);
> > > }
> > >
> > > if (!dev_is_dma_coherent(dev)) {
> > > arch_sync_dma_flush();
> > > arch_sync_dma_for_cpu_all();
> > > }
> > > }
> > >
> > > Should we call swiotlb_sync_single_for_cpu() and
> > > arch_dma_mark_clean() after the flush to ensure the CPU sees the
> > > latest data and that the memcpy is correct? I mean:
> >
> > Yes, this and the equivalents in the later patches are broken for all
> > the sync_for_cpu and unmap paths which may end up bouncing (beware some
> > of them get a bit fiddly) - any cache maintenance *must* be completed
> > before calling SWIOTLB. As for mark_clean, IIRC that was an IA-64 thing,
> > and appears to be entirely dead now.
>
> Thanks, Robin. Personally, I would prefer an approach like the one below—
> that is, not optimizing the bounce buffer cases, as they are already slow
> due to hardware limitations with memcpy, and optimizing them would make
> the code quite messy.
>
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 550a1a13148d..a4840f7e8722 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -423,8 +423,11 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
> for_each_sg(sgl, sg, nents, i) {
> phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
>
> - if (!dev_is_dma_coherent(dev))
> + if (!dev_is_dma_coherent(dev)) {
> arch_sync_dma_for_cpu(paddr, sg->length, dir);
> + if (unlikely(dev->dma_io_tlb_mem))
> + arch_sync_dma_flush();
> + }
>
> swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
>
> I’d like to check with you, Leon, and Marek on your views about this.
I agree with your point that the non‑SWIOTLB path is the performant one and
should be preferred. My concern is that you are accessing the
dma_io_tlb_mem variable directly from direct.c, which looks like a layer
violation.
You likely need to introduce an is_swiotlb_something() helper for this.
BTW, please send a v3 instead of posting incremental follow‑ups.
It's hard to track the changes across multiple small additions.
Thanks.
>
> Thanks
> Barry
next prev parent reply other threads:[~2026-01-07 7:54 UTC|newest]
Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-26 22:52 [PATCH v2 0/8] dma-mapping: arm64: support batched cache sync Barry Song
2025-12-26 22:52 ` Barry Song
2025-12-26 22:52 ` [PATCH v2 1/8] arm64: Provide dcache_by_myline_op_nosync helper Barry Song
2025-12-26 22:52 ` Barry Song
2026-01-20 12:27 ` Will Deacon
2026-01-20 12:27 ` Will Deacon
2026-01-26 1:43 ` Barry Song
2026-01-26 1:43 ` Barry Song
2025-12-26 22:52 ` [PATCH v2 2/8] arm64: Provide dcache_clean_poc_nosync helper Barry Song
2025-12-26 22:52 ` Barry Song
2025-12-26 22:52 ` [PATCH v2 3/8] arm64: Provide dcache_inval_poc_nosync helper Barry Song
2025-12-26 22:52 ` Barry Song
2026-01-20 12:33 ` Will Deacon
2026-01-20 12:33 ` Will Deacon
2025-12-26 22:52 ` [PATCH v2 4/8] dma-mapping: Separate DMA sync issuing and completion waiting Barry Song
2025-12-26 22:52 ` Barry Song
2025-12-27 20:07 ` Leon Romanovsky
2025-12-27 20:07 ` Leon Romanovsky
2025-12-27 21:45 ` Barry Song
2025-12-27 21:45 ` Barry Song
2025-12-28 14:49 ` Leon Romanovsky
2025-12-28 14:49 ` Leon Romanovsky
2025-12-28 21:38 ` Barry Song
2025-12-28 21:38 ` Barry Song
2025-12-29 14:40 ` Leon Romanovsky
2025-12-29 14:40 ` Leon Romanovsky
2025-12-31 14:43 ` Marek Szyprowski
2025-12-31 14:43 ` Marek Szyprowski
2026-01-05 12:28 ` Jürgen Groß
2026-01-05 12:28 ` Jürgen Groß
2025-12-26 22:52 ` [PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_* Barry Song
2025-12-26 22:52 ` Barry Song
2025-12-27 20:09 ` Leon Romanovsky
2025-12-27 20:09 ` Leon Romanovsky
2025-12-27 20:52 ` Barry Song
2025-12-27 20:52 ` Barry Song
2025-12-28 14:50 ` Leon Romanovsky
2025-12-28 14:50 ` Leon Romanovsky
2026-01-06 18:41 ` Barry Song
2026-01-06 18:41 ` Barry Song
2026-01-06 19:12 ` Robin Murphy
2026-01-06 19:12 ` Robin Murphy
2026-01-06 19:47 ` Barry Song
2026-01-06 19:47 ` Barry Song
2026-01-07 7:54 ` Leon Romanovsky [this message]
2026-01-07 7:54 ` Leon Romanovsky
2026-01-07 13:16 ` Robin Murphy
2026-01-07 13:16 ` Robin Murphy
2026-01-08 11:45 ` Marek Szyprowski
2026-01-08 11:45 ` Marek Szyprowski
2025-12-26 22:52 ` [PATCH v2 6/8] dma-mapping: Support batch mode for dma_direct_{map,unmap}_sg Barry Song
2025-12-26 22:52 ` Barry Song
2025-12-27 20:14 ` Leon Romanovsky
2025-12-27 20:14 ` Leon Romanovsky
2025-12-26 22:52 ` [PATCH RFC v2 7/8] dma-iommu: Support DMA sync batch mode for IOVA link and unlink Barry Song
2025-12-26 22:52 ` Barry Song
2025-12-26 22:52 ` [PATCH RFC v2 8/8] dma-iommu: Support DMA sync batch mode for iommu_dma_sync_sg_for_{cpu, device} Barry Song
2025-12-26 22:52 ` Barry Song
2025-12-27 20:16 ` Leon Romanovsky
2025-12-27 20:16 ` Leon Romanovsky
2025-12-27 20:59 ` Barry Song
2025-12-27 20:59 ` Barry Song
2026-01-06 19:42 ` Robin Murphy
2026-01-06 19:42 ` Robin Murphy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260107075414.GA11783@unreal \
--to=leon@kernel.org \
--cc=21cnbao@gmail.com \
--cc=anshuman.khandual@arm.com \
--cc=ardb@kernel.org \
--cc=catalin.marinas@arm.com \
--cc=iommu@lists.linux.dev \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=m.szyprowski@samsung.com \
--cc=maz@kernel.org \
--cc=robin.murphy@arm.com \
--cc=ryan.roberts@arm.com \
--cc=surenb@google.com \
--cc=will@kernel.org \
--cc=xen-devel@lists.xenproject.org \
--cc=zhengtangquan@oppo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.