From: Jason Gunthorpe <jgg@nvidia.com>
To: Robin Murphy <robin.murphy@arm.com>
Cc: Vincent Chen <vincent.chen@sifive.com>,
Alexandre Ghiti <alex@ghiti.fr>,
Albert Ou <aou@eecs.berkeley.edu>,
iommu@lists.linux.dev, Joerg Roedel <joro@8bytes.org>,
linux-riscv@lists.infradead.org,
Palmer Dabbelt <palmer@dabbelt.com>,
Paul Walmsley <pjw@kernel.org>,
Tomasz Jeznach <tjeznach@rivosinc.com>,
Will Deacon <will@kernel.org>,
lihangjing@bytedance.com, Xu Lu <luxu.kernel@bytedance.com>,
patches@lists.linux.dev, xieyongji@bytedance.com
Subject: Re: [PATCH v2 0/5] Convert riscv to use the generic iommu page table
Date: Mon, 2 Feb 2026 10:37:20 -0400 [thread overview]
Message-ID: <20260202143720.GN2223369@nvidia.com> (raw)
In-Reply-To: <8c46864e-4625-49f4-90b3-a7467cec8b7b@arm.com>
On Mon, Feb 02, 2026 at 02:00:07PM +0000, Robin Murphy wrote:
> > DMA-FQ requires two functionalites from the page table:
> > 1) use gather->freelist to avoid a HW UAF (iommupt always does this)
>
> Nope, correct DMA API usage would almost never unmap an entire table, so
> synchronous non-leaf maintenance in that path still doesn't hurt DMA-FQ
> either (e.g. io-pgtable-arm).
Well, it certainly would hurt workloads like IB MR's which can have
quite alot of IOVA in a single dma_map_sg() and we do want to see the
table levels removed to avoid the waste that Pasha has talked
about. Doing single invalidations of potentially a lot of levels in a
DMA-FQ environment is unnecessary overhead.
But I get your point that simple, say storage, use of the DMA API
wouldn't be bothered by this and you could still get alot of benefit
without using the free list.
> If a pagetable implementation wanted to refcount and eagerly free empty
> tables upon leaf unmaps, then yes it would need deferred freeing, but
> frankly it would be better off just not doing that at all for DMA-FQ anyway
> (as IOVA caching would make it likely to need to repopulate the same level
> of table soon.)
Today it isn't done with refcounts, just if the iova range unmapped
fully contains a table level then the table level can go away too. It
does trim interior page tables for large IOVA allocations but small
ones are unlikely to free anything.
> > The one call to iommu_iotlb_sync() is only for the para-virtualization
> > optimization of narrowing invalidations. It would be nonsensical for a
> > driver to enable this optimization and offer IOMMU_CAP_DEFERRED_FLUSH.
>
> Not necessarily - in the PV case it can be desirable to minimise
> over-invalidation *if* you're trapping for targeted invalidations in strict
> mode. However, depending on the usage pattern it may also be beneficial to
> have non-strict let the FQ mechanism batch up work to minimise the number of
> traps taken - e.g. s390 is in this situation, and is precisely why we added
> IOMMU_DMA_OPTS_SINGLE_QUEUE to help optimise for that.
Okay, so if I understand you right, it should check for
iommu_iotlb_gather_queued() and disable PT_FEAT_FLUSH_RANGE_NO_GAPS
mode entirely? ie there is no point in doing small invalidations if
the caller is going to do a flush all?
This way the user gets to pick using DMA-FQ or DMA-strict ?
Also Intel would probably benefit from .shadow_on_flush too?
Jason
WARNING: multiple messages have this Message-ID (diff)
From: Jason Gunthorpe <jgg@nvidia.com>
To: Robin Murphy <robin.murphy@arm.com>
Cc: Vincent Chen <vincent.chen@sifive.com>,
Alexandre Ghiti <alex@ghiti.fr>,
Albert Ou <aou@eecs.berkeley.edu>,
iommu@lists.linux.dev, Joerg Roedel <joro@8bytes.org>,
linux-riscv@lists.infradead.org,
Palmer Dabbelt <palmer@dabbelt.com>,
Paul Walmsley <pjw@kernel.org>,
Tomasz Jeznach <tjeznach@rivosinc.com>,
Will Deacon <will@kernel.org>,
lihangjing@bytedance.com, Xu Lu <luxu.kernel@bytedance.com>,
patches@lists.linux.dev, xieyongji@bytedance.com
Subject: Re: [PATCH v2 0/5] Convert riscv to use the generic iommu page table
Date: Mon, 2 Feb 2026 10:37:20 -0400 [thread overview]
Message-ID: <20260202143720.GN2223369@nvidia.com> (raw)
In-Reply-To: <8c46864e-4625-49f4-90b3-a7467cec8b7b@arm.com>
On Mon, Feb 02, 2026 at 02:00:07PM +0000, Robin Murphy wrote:
> > DMA-FQ requires two functionalites from the page table:
> > 1) use gather->freelist to avoid a HW UAF (iommupt always does this)
>
> Nope, correct DMA API usage would almost never unmap an entire table, so
> synchronous non-leaf maintenance in that path still doesn't hurt DMA-FQ
> either (e.g. io-pgtable-arm).
Well, it certainly would hurt workloads like IB MR's which can have
quite alot of IOVA in a single dma_map_sg() and we do want to see the
table levels removed to avoid the waste that Pasha has talked
about. Doing single invalidations of potentially a lot of levels in a
DMA-FQ environment is unnecessary overhead.
But I get your point that simple, say storage, use of the DMA API
wouldn't be bothered by this and you could still get alot of benefit
without using the free list.
> If a pagetable implementation wanted to refcount and eagerly free empty
> tables upon leaf unmaps, then yes it would need deferred freeing, but
> frankly it would be better off just not doing that at all for DMA-FQ anyway
> (as IOVA caching would make it likely to need to repopulate the same level
> of table soon.)
Today it isn't done with refcounts, just if the iova range unmapped
fully contains a table level then the table level can go away too. It
does trim interior page tables for large IOVA allocations but small
ones are unlikely to free anything.
> > The one call to iommu_iotlb_sync() is only for the para-virtualization
> > optimization of narrowing invalidations. It would be nonsensical for a
> > driver to enable this optimization and offer IOMMU_CAP_DEFERRED_FLUSH.
>
> Not necessarily - in the PV case it can be desirable to minimise
> over-invalidation *if* you're trapping for targeted invalidations in strict
> mode. However, depending on the usage pattern it may also be beneficial to
> have non-strict let the FQ mechanism batch up work to minimise the number of
> traps taken - e.g. s390 is in this situation, and is precisely why we added
> IOMMU_DMA_OPTS_SINGLE_QUEUE to help optimise for that.
Okay, so if I understand you right, it should check for
iommu_iotlb_gather_queued() and disable PT_FEAT_FLUSH_RANGE_NO_GAPS
mode entirely? ie there is no point in doing small invalidations if
the caller is going to do a flush all?
This way the user gets to pick using DMA-FQ or DMA-strict ?
Also Intel would probably benefit from .shadow_on_flush too?
Jason
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
next prev parent reply other threads:[~2026-02-02 14:37 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-06 15:06 [PATCH v2 0/5] Convert riscv to use the generic iommu page table Jason Gunthorpe
2026-01-06 15:06 ` Jason Gunthorpe
2026-01-06 15:06 ` [PATCH v2 1/5] iommupt: Add the RISC-V page table format Jason Gunthorpe
2026-01-06 15:06 ` Jason Gunthorpe
2026-01-30 19:21 ` Andrew Jones
2026-01-30 19:21 ` Andrew Jones
2026-01-30 23:47 ` Jason Gunthorpe
2026-01-30 23:47 ` Jason Gunthorpe
2026-01-06 15:06 ` [PATCH v2 2/5] iommu/riscv: Disable SADE Jason Gunthorpe
2026-01-06 15:06 ` Jason Gunthorpe
2026-01-06 15:06 ` [PATCH v2 3/5] iommu/riscv: Use the generic iommu page table Jason Gunthorpe
2026-01-06 15:06 ` Jason Gunthorpe
2026-01-06 15:06 ` [PATCH v2 4/5] iommu/riscv: Enable SVNAPOT support for contiguous ptes Jason Gunthorpe
2026-01-06 15:06 ` Jason Gunthorpe
2026-01-06 15:06 ` [PATCH v2 5/5] iommu/riscv: Allow RISC_VIOMMU to COMPILE_TEST Jason Gunthorpe
2026-01-06 15:06 ` Jason Gunthorpe
2026-01-30 19:58 ` Andrew Jones
2026-01-30 19:58 ` Andrew Jones
2026-01-30 23:44 ` Jason Gunthorpe
2026-01-30 23:44 ` Jason Gunthorpe
2026-02-04 16:09 ` Andrew Jones
2026-02-04 16:09 ` Andrew Jones
2026-01-22 1:46 ` [PATCH v2 0/5] Convert riscv to use the generic iommu page table Vincent Chen
2026-01-22 1:46 ` Vincent Chen
2026-01-22 15:31 ` Jason Gunthorpe
2026-01-22 15:31 ` Jason Gunthorpe
2026-01-23 3:05 ` Vincent Chen
2026-01-23 3:05 ` Vincent Chen
2026-01-23 12:29 ` Vincent Chen
2026-01-23 12:29 ` Vincent Chen
2026-01-23 13:52 ` Jason Gunthorpe
2026-01-23 13:52 ` Jason Gunthorpe
2026-01-29 11:21 ` Robin Murphy
2026-01-29 11:21 ` Robin Murphy
2026-01-31 0:27 ` Jason Gunthorpe
2026-01-31 0:27 ` Jason Gunthorpe
2026-02-02 14:00 ` Robin Murphy
2026-02-02 14:00 ` Robin Murphy
2026-02-02 14:37 ` Jason Gunthorpe [this message]
2026-02-02 14:37 ` Jason Gunthorpe
2026-02-02 16:43 ` Robin Murphy
2026-02-02 16:43 ` Robin Murphy
2026-01-22 7:56 ` Joerg Roedel
2026-01-22 7:56 ` Joerg Roedel
2026-01-29 0:46 ` Jason Gunthorpe
2026-01-29 0:46 ` Jason Gunthorpe
2026-01-30 23:14 ` Paul Walmsley
2026-01-30 23:14 ` Paul Walmsley
2026-01-31 1:28 ` Tomasz Jeznach
2026-01-31 1:28 ` Tomasz Jeznach
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260202143720.GN2223369@nvidia.com \
--to=jgg@nvidia.com \
--cc=alex@ghiti.fr \
--cc=aou@eecs.berkeley.edu \
--cc=iommu@lists.linux.dev \
--cc=joro@8bytes.org \
--cc=lihangjing@bytedance.com \
--cc=linux-riscv@lists.infradead.org \
--cc=luxu.kernel@bytedance.com \
--cc=palmer@dabbelt.com \
--cc=patches@lists.linux.dev \
--cc=pjw@kernel.org \
--cc=robin.murphy@arm.com \
--cc=tjeznach@rivosinc.com \
--cc=vincent.chen@sifive.com \
--cc=will@kernel.org \
--cc=xieyongji@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.