Re: [PATCH 1/2] iommu/io-pgtable-arm: Implement .iotlb_sync_map callback

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mostafa Saleh <smostafa@google.com>
To: Daniel Mentz <danielmentz@google.com>
Cc: iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	Will Deacon <will@kernel.org>,
	Pranjal Shrivastava <praan@google.com>,
	Liviu Dudau <liviu.dudau@arm.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Rob Clark <robin.clark@oss.qualcomm.com>
Subject: Re: [PATCH 1/2] iommu/io-pgtable-arm: Implement .iotlb_sync_map callback
Date: Tue, 30 Sep 2025 09:10:44 +0000	[thread overview]
Message-ID: <aNuelC9K24z_Ph_G@google.com> (raw)
In-Reply-To: <CAE2F3rB6TYjy0a9yecW4zwBLraaj75YBafEz3DUh8zrLChnuCg@mail.gmail.com>

On Mon, Sep 29, 2025 at 02:00:09PM -0700, Daniel Mentz wrote:
> On Mon, Sep 29, 2025 at 5:21 AM Mostafa Saleh <smostafa@google.com> wrote:
> >
> > On Sat, Sep 27, 2025 at 10:39:52PM +0000, Daniel Mentz wrote:
> > > On systems with a non-coherent SMMU, the CPU must perform cache
> > > maintenance operations (CMOs) after updating page table entries (PTEs).
> > > This ensures that the SMMU reads the latest version of the descriptors
> > > and not stale data from memory.
> > >
> > > This requirement can lead to significant performance overhead,
> > > especially when mapping long scatter-gather lists where each sg entry
> > > maps into an iommu_map() call that only covers 4KB of address space.
> > >
> > > In such scenarios, each small mapping operation modifies a single 8-byte
> > > PTE but triggers a cache clean for the entire cache line (e.g., 64
> > > bytes). This results in the same cache line being cleaned repeatedly,
> > > once for each PTE it contains.
> > >
> > > A more efficient implementation performs the cache clean operation only
> > > after updating all descriptors that are co-located in the same cache
> > > line. This patch introduces a mechanism to defer and batch the cache
> > > maintenance:
> > >
> > > A new boolean flag, defer_sync_pte, is added to struct io_pgtable_cfg.
> > > When this flag is set, the arm-lpae backend will skip the cache sync
> > > operation for leaf entries within its .map_pages implementation.
> > >
> > > A new callback, .iotlb_sync_map, is added to struct io_pgtable_ops.
> > > After performing a series of mapping operations, the caller is
> > > responsible for invoking this callback for the entire IOVA range. This
> > > function then walks the page tables for the specified range and performs
> > > the necessary cache clean operations for all the modified PTEs.
> > >
> > > This allows for a single, efficient cache maintenance operation to cover
> > > multiple PTE updates, significantly reducing overhead for workloads that
> > > perform many small, contiguous mappings.
> > >
> > > Signed-off-by: Daniel Mentz <danielmentz@google.com>
> > > ---
> > >  drivers/iommu/io-pgtable-arm.c | 66 +++++++++++++++++++++++++++++++++-
> > >  include/linux/io-pgtable.h     |  7 ++++
> > >  2 files changed, 72 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> > > index 7e8e2216c294..a970eefb07fb 100644
> > > --- a/drivers/iommu/io-pgtable-arm.c
> > > +++ b/drivers/iommu/io-pgtable-arm.c
> > > @@ -353,7 +353,7 @@ static void __arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
> > >       for (i = 0; i < num_entries; i++)
> > >               ptep[i] = pte | paddr_to_iopte(paddr + i * sz, data);
> > >
> > > -     if (!cfg->coherent_walk)
> > > +     if (!cfg->coherent_walk && !cfg->defer_sync_pte)
> > >               __arm_lpae_sync_pte(ptep, num_entries, cfg);
> > >  }
> > >
> > > @@ -582,6 +582,69 @@ static int arm_lpae_map_pages(struct io_pgtable_ops *ops, unsigned long iova,
> > >       return ret;
> > >  }
> > >
> > > +static int __arm_lpae_iotlb_sync_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
> > > +                           size_t size, int lvl, arm_lpae_iopte *ptep)
> > > +{
> > > +     struct io_pgtable *iop = &data->iop;
> > > +     size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> > > +     int ret = 0, num_entries, max_entries;
> > > +     unsigned long iova_offset, sync_idx_start, sync_idx_end;
> > > +     int i, shift, synced_entries = 0;
> > > +
> > > +     shift = (ARM_LPAE_LVL_SHIFT(lvl - 1, data) + ARM_LPAE_PGD_IDX(lvl - 1, data));
> > > +     iova_offset = iova & ((1ULL << shift) - 1);
> > > +     sync_idx_start = ARM_LPAE_LVL_IDX(iova, lvl, data);
> > > +     sync_idx_end = (iova_offset + size + block_size - ARM_LPAE_GRANULE(data)) >>
> > > +             ARM_LPAE_LVL_SHIFT(lvl, data);
> > > +     max_entries = arm_lpae_max_entries(sync_idx_start, data);
> > > +     num_entries = min_t(unsigned long, sync_idx_end - sync_idx_start, max_entries);
> > > +     ptep += sync_idx_start;
> > > +
> > > +     if (lvl < (ARM_LPAE_MAX_LEVELS - 1)) {
> > > +             for (i = 0; i < num_entries; i++) {
> > > +                     arm_lpae_iopte pte = READ_ONCE(ptep[i]);
> > > +                     unsigned long synced;
> > > +
> > > +                     WARN_ON(!pte);
> > > +
> > > +                     if (iopte_type(pte) == ARM_LPAE_PTE_TYPE_TABLE) {
> > > +                             int n = i - synced_entries;
> > > +
> > > +                             if (n) {
> > > +                                     __arm_lpae_sync_pte(&ptep[synced_entries], n, &iop->cfg);
> > > +                                     synced_entries += n;
> > > +                             }
> > > +                             ret = __arm_lpae_iotlb_sync_map(data, iova, size, lvl + 1,
> > > +                                                             iopte_deref(pte, data));
> > > +                             synced_entries++;
> > > +                     }
> > > +                     synced = block_size - (iova & (block_size - 1));
> > > +                     size -= synced;
> > > +                     iova += synced;
> > > +             }
> > > +     }
> > > +
> > > +     if (synced_entries != num_entries)
> > > +             __arm_lpae_sync_pte(&ptep[synced_entries], num_entries - synced_entries, &iop->cfg);
> > > +
> > > +     return ret;
> > > +}
> >
> > Can't we rely on the exisiting generic table walker "__arm_lpae_iopte_walk",
> > instead writing a new one, that is already used for iova_to_phys and dirty bit.
> 
> The performance gains of .iotlb_sync_map are achieved by performing
> CMOs on a range of descriptors as opposed to individually on each
> descriptor in isolation. The function __arm_lpae_iopte_walk is
> inherently incompatible with this, because it calls the .visit
> callback once for each descriptor it finds in the specified range. I
> guess I could work around this limitation by saving some state in
> io_pgtable_walk_data and developing a .visit function that tries to
> coalesce individual descriptors into contiguous ranges and delays CMOs
> until it finds a break in continuity. I'm afraid, though, that that
> might hurt performance significantly.

Exactly, I think that would be the way, I don’t have a strong opinion
though, but I’d avoid open coding a new walker unless it’s necessary.
Also, the current walker won’t do ranges, it needs some more changes,
I did that as part of (half of the patch doesn’t apply for this case):
https://lore.kernel.org/all/20241212180423.1578358-38-smostafa@google.com/

Thanks,
Mostafa

next prev parent reply	other threads:[~2025-09-30  9:10 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-27 22:39 [PATCH 1/2] iommu/io-pgtable-arm: Implement .iotlb_sync_map callback Daniel Mentz
2025-09-27 22:39 ` [PATCH 2/2] drivers/arm-smmu-v3: " Daniel Mentz
2025-09-29 11:58   ` Jason Gunthorpe
2025-09-29 12:24     ` Mostafa Saleh
2025-09-29 12:47       ` Jason Gunthorpe
2025-09-30  0:23         ` Pranjal Shrivastava
2025-09-30  9:27           ` Mostafa Saleh
2025-09-30 14:56             ` Pranjal Shrivastava
2025-09-29 12:25   ` Mostafa Saleh
2025-09-29 12:21 ` [PATCH 1/2] iommu/io-pgtable-arm: " Mostafa Saleh
2025-09-29 21:00   ` Daniel Mentz
2025-09-30  9:10     ` Mostafa Saleh [this message]
2025-11-04 13:28       ` Will Deacon
2025-11-04 13:37         ` Jason Gunthorpe
2025-09-29 19:57 ` Pranjal Shrivastava
2025-09-29 20:42   ` Daniel Mentz
2025-09-29 22:24     ` Pranjal Shrivastava

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aNuelC9K24z_Ph_G@google.com \
    --to=smostafa@google.com \
    --cc=danielmentz@google.com \
    --cc=iommu@lists.linux.dev \
    --cc=jgg@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=liviu.dudau@arm.com \
    --cc=praan@google.com \
    --cc=robin.clark@oss.qualcomm.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.