From: Mostafa Saleh <smostafa@google.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: iommu@lists.linux.dev, Joerg Roedel <joro@8bytes.org>,
linux-arm-kernel@lists.infradead.org,
Robin Murphy <robin.murphy@arm.com>,
Will Deacon <will@kernel.org>, Moritz Fischer <mdf@kernel.org>,
Moritz Fischer <moritzf@google.com>,
Michael Shavit <mshavit@google.com>,
Nicolin Chen <nicolinc@nvidia.com>,
patches@lists.linux.dev,
Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Subject: Re: [PATCH v4 01/16] iommu/arm-smmu-v3: Make STE programming independent of the callers
Date: Wed, 31 Jan 2024 14:34:23 +0000 [thread overview]
Message-ID: <Zbpab0NoE2mKBnUc@google.com> (raw)
In-Reply-To: <20240130235611.GF1455070@nvidia.com>
On Tue, Jan 30, 2024 at 07:56:11PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 30, 2024 at 10:42:13PM +0000, Mostafa Saleh wrote:
>
> > On Thu, Jan 25, 2024 at 07:57:11PM -0400, Jason Gunthorpe wrote:
> > > As the comment in arm_smmu_write_strtab_ent() explains, this routine has
> > > been limited to only work correctly in certain scenarios that the caller
> > > must ensure. Generally the caller must put the STE into ABORT or BYPASS
> > > before attempting to program it to something else.
> > >
> > > The iommu core APIs would ideally expect the driver to do a hitless change
> > > of iommu_domain in a number of cases:
> > >
> > > - RESV_DIRECT support wants IDENTITY -> DMA -> IDENTITY to be hitless
> > > for the RESV ranges
> > >
> > > - PASID upgrade has IDENTIY on the RID with no PASID then a PASID paging
> > > domain installed. The RID should not be impacted
> > >
> > > - PASID downgrade has IDENTIY on the RID and all PASID's removed.
> > > The RID should not be impacted
> > >
> > > - RID does PAGING -> BLOCKING with active PASID, PASID's should not be
> > > impacted
> > >
> > > - NESTING -> NESTING for carrying all the above hitless cases in a VM
> > > into the hypervisor. To comprehensively emulate the HW in a VM we should
> > > assume the VM OS is running logic like this and expecting hitless updates
> > > to be relayed to real HW.
> >
> > From my understanding, some of these cases are not implemented (at this point).
> > However, from what I see, most of these cases are related to switching from/to
> > identity, which the current driver would have to block in between, is my
> > understanding correct?
>
> Basically
>
> > As for NESTING -> NESTING, how is that achieved? (and why?)
>
> Through iommufd and it is necessary to reflect hitless transition from
> the VM to the real HW. See VFIO_DEVICE_ATTACH_IOMMUFD_PT
>
> > AFAICT, VFIO will do BLOCKING in between any transition, and that domain
> > should never change while the a device is assigned to a VM.
>
> It ultimately calls iommufd_device_replace() which avoids that. Old
> vfio type1 users will force a blocking, but type1 will never support
> nesting so it isn't relevant.
>
Thanks, I will check those.
> > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > index 0ffb1cf17e0b2e..690742e8f173eb 100644
> > > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > @@ -48,6 +48,22 @@ enum arm_smmu_msi_index {
> > > ARM_SMMU_MAX_MSIS,
> > > };
> > >
> > > +struct arm_smmu_entry_writer_ops;
> > > +struct arm_smmu_entry_writer {
> > > + const struct arm_smmu_entry_writer_ops *ops;
> > > + struct arm_smmu_master *master;
> >
> > I see only master->smmu is used, is there a reason why we have this
> > struct instead?
>
> The CD patches in part 2 requires the master because the CD entry
> memory is shared across multiple CDs so we iterate the SID list inside
> the update. The STE is the opposite, each STE has its own memory so we
> iterate the SID list outside the update.
>
> > > +struct arm_smmu_entry_writer_ops {
> > > + unsigned int num_entry_qwords;
> > > + __le64 v_bit;
> > > + void (*get_used)(struct arm_smmu_entry_writer *writer, const __le64 *entry,
> > > + __le64 *used);
> >
> > *writer is not used in this series, I think it would make more sense if
> > it's added in the patch that introduce using it.
>
> Ah, I guess, I think it is used in the test bench.
>
> > > + void (*sync)(struct arm_smmu_entry_writer *writer);
> > > +};
> > > +
> > > +#define NUM_ENTRY_QWORDS (sizeof(struct arm_smmu_ste) / sizeof(u64))
> > > +
> >
> > Isn't that just STRTAB_STE_DWORDS, also it makes more sense to not tie
> > this to the struct but with the actual hardware description that would
> > never change (but the struct can change)
>
> The struct and the HW description are the same. The struct size cannot
> change. Broadly in the series STRTAB_STE_DWORDS is being dis-favoured
> for sizeof(struct arm_smmu_ste) now that we have the struct.
>
> After part 3 there are only two references left to that constant, so I
> will likely change part 3 to remove it.
But arm_smmu_ste is defined based on STRTAB_STE_DWORDS. And this macro would
never change as it is tied to the HW. However, in the future we can update
“struct arm_smmu_ste” to hold a refcount for some reason,
then sizeof(struct arm_smmu_ste) is not the size of the STE in the hardware.
IMHO, any reference to the HW STE should be done using the macro.
> > > +/*
> > > + * Figure out if we can do a hitless update of entry to become target. Returns a
> > > + * bit mask where 1 indicates that qword needs to be set disruptively.
> > > + * unused_update is an intermediate value of entry that has unused bits set to
> > > + * their new values.
> > > + */
> > > +static u8 arm_smmu_entry_qword_diff(struct arm_smmu_entry_writer *writer,
> > > + const __le64 *entry, const __le64 *target,
> > > + __le64 *unused_update)
> > > +{
> > > + __le64 target_used[NUM_ENTRY_QWORDS] = {};
> > > + __le64 cur_used[NUM_ENTRY_QWORDS] = {};
> > > + u8 used_qword_diff = 0;
> > > + unsigned int i;
> > > +
> > > + writer->ops->get_used(writer, entry, cur_used);
> > > + writer->ops->get_used(writer, target, target_used);
> > > +
> > > + for (i = 0; i != writer->ops->num_entry_qwords; i++) {
> > > + /*
> > > + * Check that masks are up to date, the make functions are not
> > > + * allowed to set a bit to 1 if the used function doesn't say it
> > > + * is used.
> > > + */
> > > + WARN_ON_ONCE(target[i] & ~target_used[i]);
> > > +
> >
> > I think this should be a BUG. As we don't know the consequence for such change,
> > and this should never happen in a non-development kernel.
>
> Guidance from Linus is to never use BUG, always use WARN_ON and try to
> recover. If people are running in a high-sensitivity production
> environment they should set the warn on panic feature to ensure any
> kernel self-detection of corruption triggers a halt.
>
> > > +/*
> > > + * Update the STE/CD to the target configuration. The transition from the
> > > + * current entry to the target entry takes place over multiple steps that
> > > + * attempts to make the transition hitless if possible. This function takes care
> > > + * not to create a situation where the HW can perceive a corrupted entry. HW is
> > > + * only required to have a 64 bit atomicity with stores from the CPU, while
> > > + * entries are many 64 bit values big.
> > > + *
> > > + * The difference between the current value and the target value is analyzed to
> > > + * determine which of three updates are required - disruptive, hitless or no
> > > + * change.
> > > + *
> > > + * In the most general disruptive case we can make any update in three steps:
> > > + * - Disrupting the entry (V=0)
> > > + * - Fill now unused qwords, execpt qword 0 which contains V
> > > + * - Make qword 0 have the final value and valid (V=1) with a single 64
> > > + * bit store
> > > + *
> > > + * However this disrupts the HW while it is happening. There are several
> > > + * interesting cases where a STE/CD can be updated without disturbing the HW
> > > + * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
> > > + * because the used bits don't intersect. We can detect this by calculating how
> > > + * many 64 bit values need update after adjusting the unused bits and skip the
> > > + * V=0 process. This relies on the IGNORED behavior described in the
> > > + * specification.
> > > + */
> > > +static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
> > > + __le64 *entry, const __le64 *target)
> > > +{
> > > + unsigned int num_entry_qwords = writer->ops->num_entry_qwords;
> > > + __le64 unused_update[NUM_ENTRY_QWORDS];
> > > + u8 used_qword_diff;
> > > +
> > > + used_qword_diff =
> > > + arm_smmu_entry_qword_diff(writer, entry, target, unused_update);
> > > + if (hweight8(used_qword_diff) > 1) {
> > > + /*
> > > + * At least two qwords need their inuse bits to be changed. This
> > > + * requires a breaking update, zero the V bit, write all qwords
> > > + * but 0, then set qword 0
> > > + */
> > > + unused_update[0] = entry[0] & (~writer->ops->v_bit);
> > > + entry_set(writer, entry, unused_update, 0, 1);
> > > + entry_set(writer, entry, target, 1, num_entry_qwords - 1);
> > > + entry_set(writer, entry, target, 0, 1);
> > > + } else if (hweight8(used_qword_diff) == 1) {
> > > + /*
> > > + * Only one qword needs its used bits to be changed. This is a
> > > + * hitless update, update all bits the current STE is ignoring
> > > + * to their new values, then update a single "critical qword" to
> > > + * change the STE and finally 0 out any bits that are now unused
> > > + * in the target configuration.
> > > + */
> > > + unsigned int critical_qword_index = ffs(used_qword_diff) - 1;
> > > +
> > > + /*
> > > + * Skip writing unused bits in the critical qword since we'll be
> > > + * writing it in the next step anyways. This can save a sync
> > > + * when the only change is in that qword.
> > > + */
> > > + unused_update[critical_qword_index] =
> > > + entry[critical_qword_index];
> > > + entry_set(writer, entry, unused_update, 0, num_entry_qwords);
> > > + entry_set(writer, entry, target, critical_qword_index, 1);
> > > + entry_set(writer, entry, target, 0, num_entry_qwords);
> >
> > The STE is updated in 3 steps.
> > 1) Update all bits from target (except the changed qword)
> > 2) Update the changed qword
> > 3) Remove the bits that are not used by the target STE.
> >
> > In most cases we would issue a sync for 1) and 3) although the hardware ignores
> > the updates, that seems necessary, am I missing something?
>
> "seems [un]necessary", right?
Yes, that's a typo.
> All syncs are necessary because the way the SMMU HW is permitted to
> cache on a qword by qword basis.
>
> Eg with no sync after step 1 the HW cache could have:
>
> QW0 Not present
> QW1 Step 0 (Current)
>
> And then instantly after step 2 updates DW0, but before it does the
> sync, the HW is permited to read. Then it would have:
>
> QW0 Step 2
> QW1 Step 0 (Current)
>
> Which is illegal. The HW is allowed to observe a mix of Step[n] and
> Step[n+1] only. Never a mix of Step[n-1] and Step[n+1].
>
> The sync provides a barrier that prevents this. HW can never observe
> the critical qword of step 2 without also observing only new values of
> step 1.
>
> The same argument is for step 3 -> next step 1 on a future update.
I see, thanks for the explanation.
Thanks,
Mostafa
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2024-01-31 14:34 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-25 23:57 [PATCH v4 00/16] Update SMMUv3 to the modern iommu API (part 1/3) Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 01/16] iommu/arm-smmu-v3: Make STE programming independent of the callers Jason Gunthorpe
2024-01-26 4:03 ` Michael Shavit
2024-01-29 19:53 ` Moritz Fischer
2024-01-30 22:42 ` Mostafa Saleh
2024-01-30 23:56 ` Jason Gunthorpe
2024-01-31 14:34 ` Mostafa Saleh [this message]
2024-01-31 14:40 ` Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 02/16] iommu/arm-smmu-v3: Consolidate the STE generation for abort/bypass Jason Gunthorpe
2024-01-31 14:40 ` Mostafa Saleh
2024-01-31 14:47 ` Jason Gunthorpe
2024-02-01 11:32 ` Mostafa Saleh
2024-02-01 13:02 ` Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 03/16] iommu/arm-smmu-v3: Move arm_smmu_rmr_install_bypass_ste() Jason Gunthorpe
2024-01-29 15:07 ` Shameerali Kolothum Thodi
2024-01-29 15:43 ` Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 04/16] iommu/arm-smmu-v3: Move the STE generation for S1 and S2 domains into functions Jason Gunthorpe
2024-01-31 14:50 ` Mostafa Saleh
2024-01-31 15:05 ` Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 05/16] iommu/arm-smmu-v3: Build the whole STE in arm_smmu_make_s2_domain_ste() Jason Gunthorpe
2024-02-01 11:34 ` Mostafa Saleh
2024-01-25 23:57 ` [PATCH v4 06/16] iommu/arm-smmu-v3: Hold arm_smmu_asid_lock during all of attach_dev Jason Gunthorpe
2024-02-01 12:15 ` Mostafa Saleh
2024-02-01 13:24 ` Jason Gunthorpe
2024-02-13 13:30 ` Mostafa Saleh
2024-01-25 23:57 ` [PATCH v4 07/16] iommu/arm-smmu-v3: Compute the STE only once for each master Jason Gunthorpe
2024-02-01 12:18 ` Mostafa Saleh
2024-01-25 23:57 ` [PATCH v4 08/16] iommu/arm-smmu-v3: Do not change the STE twice during arm_smmu_attach_dev() Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 09/16] iommu/arm-smmu-v3: Put writing the context descriptor in the right order Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 10/16] iommu/arm-smmu-v3: Pass smmu_domain to arm_enable/disable_ats() Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 11/16] iommu/arm-smmu-v3: Remove arm_smmu_master->domain Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 12/16] iommu/arm-smmu-v3: Add a global static IDENTITY domain Jason Gunthorpe
2024-01-29 18:11 ` Shameerali Kolothum Thodi
2024-01-29 18:37 ` Jason Gunthorpe
2024-01-30 8:35 ` Shameerali Kolothum Thodi
2024-01-25 23:57 ` [PATCH v4 13/16] iommu/arm-smmu-v3: Add a global static BLOCKED domain Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 14/16] iommu/arm-smmu-v3: Use the identity/blocked domain during release Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 15/16] iommu/arm-smmu-v3: Pass arm_smmu_domain and arm_smmu_device to finalize Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 16/16] iommu/arm-smmu-v3: Convert to domain_alloc_paging() Jason Gunthorpe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zbpab0NoE2mKBnUc@google.com \
--to=smostafa@google.com \
--cc=iommu@lists.linux.dev \
--cc=jgg@nvidia.com \
--cc=joro@8bytes.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=mdf@kernel.org \
--cc=moritzf@google.com \
--cc=mshavit@google.com \
--cc=nicolinc@nvidia.com \
--cc=patches@lists.linux.dev \
--cc=robin.murphy@arm.com \
--cc=shameerali.kolothum.thodi@huawei.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).