All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Nicolin Chen <nicolinc@nvidia.com>
Cc: will@kernel.org, robin.murphy@arm.com, joro@8bytes.org,
	kevin.tian@intel.com, praan@google.com, nathan@kernel.org,
	yi.l.liu@intel.com, peterz@infradead.org, mshavit@google.com,
	jsnitsel@redhat.com, smostafa@google.com,
	jeff.johnson@oss.qualcomm.com, zhangzekun11@huawei.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	shameerali.kolothum.thodi@huawei.com
Subject: Re: [PATCH v2 06/11] iommu/arm-smmu-v3: Introduce arm_smmu_s2_parent_tlb_ invalidation helpers
Date: Tue, 15 Apr 2025 20:46:16 -0300	[thread overview]
Message-ID: <20250415234616.GB823903@nvidia.com> (raw)
In-Reply-To: <Z/69PTDANYagKX3d@Asurada-Nvidia>

On Tue, Apr 15, 2025 at 01:10:37PM -0700, Nicolin Chen wrote:
> On Tue, Apr 15, 2025 at 09:50:42AM -0300, Jason Gunthorpe wrote:
> > struct invalidation_op {
> >     struct arm_smmu_device *smmu;
> >     enum {ATS,S2_VMDIA_IPA,S2_VMID,S1_ASID} invalidation_op;
> >     union {
> >         u16 vmid;
> >         u32 asid;
> >     	u32 ats_id;
> >     };
> >     refcount_t users;
> > };
> > 
> > Then invalidation would just iterate over this list following each
> > instruction. 
> > 
> > When things are attached the list is mutated:
> >  - Normal S1/S2 attach would reuse an ASID for the same instance or
> >    allocate a new list entry, users keeps track of ID sharing
> >  - VMID attach would use the VMID of the vSMMU
> >  - ATS enabled would add entries for each PCI device instead of the
> >    seperate ATS list
> 
> Interesting. I can see it generalize all the use cases.
> 
> Yet are you expecting a big list combining TLBI and ATC_INV cmds?

It is the idea I had in my head. There isn't really a great reason to
have two lists if one list can handle the required updating and
locking needs.. I imagine the IOTLB entries would be sorted first and
the ATC entries last.

> I think the ATC_INV entries doesn't need a refcount? 

Probably in almost all cases.

But see below about needing two domains in the list at once and recall
that today we temporarily put the same domain in the list twice
sometimes. So it may make alot of sense to use the refcount in every
entry to track how many masters are using that entry just to keep the
design simple.

> And finding an SID (to remove the device for example) would take
> long, when there are a lot of entries in the list?

It depends how smart you get, bisection search on a sorted linear list
would scale fine. But I don't think we care much about attach/detach
performance, or have such high numbers of attachments that this is
worth optimizing for.

> Should the ATS list still be separate, or even an xarray?

I haven't gone through it in any details to know.. If the invalidation
can use the structure above for ATS and nothing else needs the ATS
list, then perhaps it doesn't need to exist.

> I will refer to their driver. Yet, I wonder what we will gain from
> RCU here? Race condition? Would you elaborate with some use case?

The invalidation path was optimized to avoid locking, look at the
stuff in arm_smmu_atc_inv_domain() to try to avoid the spinlock
protecting the ATS invalidations read from the devices list.

So, I imagine a similar lock free scheme would be

invalidation:
  rcu_read_lock()
  list = READ_ONCE(domain->invalidation_ops);
  [execute invalidation on list]
  rcu_read_unlock()

mutate:
   mutex_lock(domain->lock for attachment)
   new_list = kcalloc()
   copy_and_mutate(domain->invalidation_ops, new_list);
   rcu_assign_pointer(domain->invalidation_ops, new_list);
   mutex_unlock(domain->lock for attachment)

Then because of RCU you have to deal with some races.

1) HW flushing must be synchronous with the domain attach:
      CPU 1                   CPU 2
    change an IOPTE
    release IOPTs
                              attach a domain
			      release invalidation_ops
    invalidation
       acquire READ_ONCE()
			      acquire IOPTEs
                              update the STE/CD

Such that the HW is guarenteed to either:
 a) see the new value of IOPTE before seeing the STE/CD that could
    cause it be fetched
 b) is guaranteed to see the invalidation_op for the new STE prior to
    the STE being installed.

IIRC the riscv folks determined that this was a simple smp_mb()..

On the detaching side spurious IOTLB invalidation is OK, that will
just cause some performance anomaly. And I think spurious ATC
invalidation is OK too, though maybe need a synchronize_rcu() in
device removal due to friendly hot unplug.. IDK

2) Safe domain replacement

The existing code double adds devices to the invalidations lists for
safety. So it would need a algorithm like this:
   
prepare:
    middle_list = copy_and_mutate_add_master(domain->list, new_master);
    final_list = copy_and_mutate_remove_master(middle_list, old_master);
commit:
   // Invalidate both new/old master while we mess with the STE/CD
   rcu_assign_pointer(domain->list, middle_list);
   install_ste()
   // Only invalidate new master
   rcu_assign_pointer(domain->list, final_list);
   kfree_rcu(middle_list);
   kfree_rcu(old_list);

As there is an intrinsic time window after the STE is written to
memory but before the STE invalidation sync has been completed in HW
where we have no idea which of the two domains the HW is fetching
from.

3) IOMMU Device removal

Since the RCU is also protecting the smmu instance memory and queues:

       CPU 1                      CPU 2
    invalidation
      rcu_read_lock()
                                   domain detach
				   arm_smmu_release_device()
				   iommu_device_unregister()
      list = READ_ONCE()
      .. list[i]->smmu ..
      rcu_read_unlock()
				     synchronize_rcu()
				     kfree(smmu);

But that's easy and we never hotunplug smmu's anyhow.

> > But the end result is we fully disconnect the domain from the smmu
> > instance and all domain types can be shared across all instances if
> > they support the pagetable layout. The invalidation also becomes
> > somewhat simpler as it just sweeps the list and does what it is
> > told. The special ATS list, counter and locking is removed too.
> 
> OK. I'd like to give it another try. Or would you prefer to write
> yourself?

I'd be happy if you can knock it out, or at least determine it is too
hard/bad idea I'm trying to push out the io page table stuff this
cycle

The only thing that gives me pause is the complexity of the list copy
and mutate, but I didn't try to enumerate all the mutations that are
required. Maybe if this is done in a very simple unoptimized way it is
good enough 'mutate add master' 'mutate remove master', allocating a
new list copy for each operation.

Scan the list and calculate the new size. Copy the list discarding
things to delete. Add the new things to the end. Sort.

I'd probably start here, try to write the two mutate functions, check
if those are enough mutate functions, then try to migrate the
invalidation logic over to use the new lists part by part. Building
the new lists can be done first in a series.

From here a future project would be to optimize the invalidation for
multi-SMMU and multi-device... The current code runs everything
serially, but we could push all the invalidation commands to all the
instances, then wait for the sync's to come back from each instance
allowing the HW invalidation to be in parallel. Then similarly do the
ATC in parallel. It is easy to do if the list is sorted already in
order of required operations. This might make most sense for ATC
invalidation since it is always range based and only needs two command
entries?

Jason


  reply	other threads:[~2025-04-15 23:48 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-15  4:57 [PATCH v2 00/11] iommu/arm-smmu-v3: Allocate vmid per vsmmu instead of s2_parent Nicolin Chen
2025-04-15  4:57 ` [PATCH v2 01/11] iommu/arm-smmu-v3: Pass in vmid to arm_smmu_make_s2_domain_ste() Nicolin Chen
2025-04-15  4:57 ` [PATCH v2 02/11] iommu/arm-smmu-v3: Pass in smmu/iommu_domain to __arm_smmu_tlb_inv_range() Nicolin Chen
2025-05-15 15:06   ` Will Deacon
2025-04-15  4:57 ` [PATCH v2 03/11] iommu/arm-smmu-v3: Share cmdq/cmd helpers with arm-smmu-v3-iommufd Nicolin Chen
2025-04-15  4:57 ` [PATCH v2 04/11] iommu/arm-smmu-v3: Add an inline arm_smmu_tlb_inv_vmid helper Nicolin Chen
2025-04-15  4:57 ` [PATCH v2 05/11] iommu/arm-smmu-v3: Rename arm_smmu_attach_prepare_vmaster Nicolin Chen
2025-04-15  4:57 ` [PATCH v2 06/11] iommu/arm-smmu-v3: Introduce arm_smmu_s2_parent_tlb_ invalidation helpers Nicolin Chen
2025-04-15 12:50   ` Jason Gunthorpe
2025-04-15 20:10     ` Nicolin Chen
2025-04-15 23:46       ` Jason Gunthorpe [this message]
2025-04-15  4:57 ` [PATCH v2 07/11] iommu/arm-smmu-v3: Introduce arm_vsmmu_atc_inv_domain() Nicolin Chen
2025-04-15  4:57 ` [PATCH v2 08/11] iommu/arm-smmu-v3: Use vSMMU helpers for S2 and ATC invalidations Nicolin Chen
2025-04-15  4:57 ` [PATCH v2 09/11] iommu/arm-smmu-v3: Clean up nested_ats_flush from master_domain Nicolin Chen
2025-04-15  4:57 ` [PATCH v2 10/11] iommu/arm-smmu-v3: Decouple vmid from S2 nest_parent domain Nicolin Chen
2025-04-15  4:57 ` [PATCH v2 11/11] iommu/arm-smmu-v3: Allow to share S2 nest_parent domain across vSMMUs Nicolin Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250415234616.GB823903@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=iommu@lists.linux.dev \
    --cc=jeff.johnson@oss.qualcomm.com \
    --cc=joro@8bytes.org \
    --cc=jsnitsel@redhat.com \
    --cc=kevin.tian@intel.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mshavit@google.com \
    --cc=nathan@kernel.org \
    --cc=nicolinc@nvidia.com \
    --cc=peterz@infradead.org \
    --cc=praan@google.com \
    --cc=robin.murphy@arm.com \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=smostafa@google.com \
    --cc=will@kernel.org \
    --cc=yi.l.liu@intel.com \
    --cc=zhangzekun11@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.