From: Jason Gunthorpe <jgg@nvidia.com>
To: Nicolin Chen <nicolinc@nvidia.com>
Cc: Will Deacon <will@kernel.org>,
Robin Murphy <robin.murphy@arm.com>,
Joerg Roedel <joro@8bytes.org>,
Bjorn Helgaas <bhelgaas@google.com>,
"Rafael J . Wysocki" <rafael@kernel.org>,
Len Brown <lenb@kernel.org>,
Pranjal Shrivastava <praan@google.com>,
Mostafa Saleh <smostafa@google.com>,
Lu Baolu <baolu.lu@linux.intel.com>,
Kevin Tian <kevin.tian@intel.com>,
linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-pci@vger.kernel.org, vsethi@nvidia.com,
Shuai Xue <xueshuai@linux.alibaba.com>
Subject: Re: [PATCH v4 11/24] iommu: Add iommu_report_device_broken() to quarantine a broken device
Date: Tue, 19 May 2026 16:16:26 -0300 [thread overview]
Message-ID: <20260519191626.GJ3602937@nvidia.com> (raw)
In-Reply-To: <agysAyreCY4huf0I@Asurada-Nvidia>
On Tue, May 19, 2026 at 11:29:23AM -0700, Nicolin Chen wrote:
> On Tue, May 19, 2026 at 09:07:37AM -0300, Jason Gunthorpe wrote:
> > On Mon, May 18, 2026 at 08:38:54PM -0700, Nicolin Chen wrote:
> > > +void iommu_report_device_broken(struct device *dev)
> > > +{
> > > + struct group_device *gdev;
> > > +
> > > + /*
> > > + * We cannot hold group->mutex here. Rely on iommu_group_broken_worker()
> > > + * to validate dev_has_iommu(). The iommu_group memory is RCU-protected
> > > + * via kfree_rcu() in iommu_group_release(), and group->devices is an
> > > + * RCU-protected list, so the lookup runs entirely under rcu_read_lock.
> > > + *
> > > + * Note the device might have been concurrently removed from the group
> > > + * (list_del_rcu) before iommu_deinit_device() cleared the dev->iommu.
> > > + */
> > > + rcu_read_lock();
> > > + gdev = __dev_to_gdev_rcu(dev);
> > > + if (gdev) {
> >
> > If this is why the RCU is being added it seems like overkill.
> >
> > Just add the worker to struct dev_iommu and push it there so it can
> > use a mutex but I'm confused why are we even adding this function?
> >
> > The entire design of this series was supposed to have the IOMMU driver
> > itself adjust it's "STE" to inhibit translated TLPs synchronosly
> > within its fully locked invalidation loop.
>
> Yes. Surgical STE is done in the driver. But, core-level attaching
> state doesn't reflect correctly. So the driver calls this function
> to notify the core (this is in an invalidation context -- not able
> to use mutex).
>
> > Whats the async worker for?
>
> Then, the core needs to block the device using the similar routine
> to the reset prepare(). And that needs to hold group->mutex, so it
> needs an async worker.
>
> Do you see a much simpler way?
Put the work on the dev_iommu and forget about rcu.
But this is all probably better as some later series if at all. The
driver can block the ATS and the expectation is something will FLR the
device. The FLR will set the blocking and then restore the
domain. None of this async work seems functionally necessary, though
it would be a nice to have. Lets focus on the bare minimum here it, it
is already a difficult enough problem without tacking on these
extras..
Jason
next prev parent reply other threads:[~2026-05-19 19:16 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-19 3:38 [PATCH v4 00/24] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
2026-05-19 3:38 ` [PATCH v4 01/24] PCI: Don't suspend IOMMU when probing reset capability Nicolin Chen
2026-05-19 3:38 ` [PATCH v4 02/24] PCI: Propagate FLR return values to callers Nicolin Chen
2026-05-19 3:38 ` [PATCH v4 03/24] iommu: Convert gdev->blocked from bool to enum gdev_blocked Nicolin Chen
2026-05-19 3:38 ` [PATCH v4 04/24] iommu: Pass in reset result to pci_dev_reset_iommu_done() Nicolin Chen
2026-05-19 3:38 ` [PATCH v4 05/24] iommu: Add reset_device_done callback for hardware fault recovery Nicolin Chen
2026-05-19 3:38 ` [PATCH v4 06/24] iommu: Defer iommu_group free via kfree_rcu() Nicolin Chen
2026-05-19 11:39 ` Jason Gunthorpe
2026-05-19 18:54 ` Nicolin Chen
2026-05-19 3:38 ` [PATCH v4 07/24] iommu: Defer __iommu_group_free_device() to be outside group->mutex Nicolin Chen
2026-05-19 11:47 ` Jason Gunthorpe
2026-05-19 3:38 ` [PATCH v4 08/24] iommu: Change group->devices to RCU-protected list Nicolin Chen
2026-05-19 3:38 ` [PATCH v4 09/24] iommu: Add group pointer to struct group_device Nicolin Chen
2026-05-19 3:38 ` [PATCH v4 10/24] iommu: Add __iommu_group_block_device helper Nicolin Chen
2026-05-19 3:38 ` [PATCH v4 11/24] iommu: Add iommu_report_device_broken() to quarantine a broken device Nicolin Chen
2026-05-19 12:07 ` Jason Gunthorpe
2026-05-19 18:29 ` Nicolin Chen
2026-05-19 19:16 ` Jason Gunthorpe [this message]
2026-05-19 22:30 ` Nicolin Chen
2026-05-19 23:02 ` Jason Gunthorpe
2026-05-19 3:38 ` [PATCH v4 12/24] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap Nicolin Chen
2026-05-19 3:38 ` [PATCH v4 13/24] iommu/arm-smmu-v3: Skip remaining GERROR causes on SFM Nicolin Chen
2026-05-19 3:38 ` [PATCH v4 14/24] iommu/arm-smmu-v3: Introduce per-cmdq cmdq_err_handler callback Nicolin Chen
2026-05-19 3:38 ` [PATCH v4 15/24] iommu/arm-smmu-v3: Co-clear pending CMDQ_ERR when CMD_SYNC times out Nicolin Chen
2026-05-19 3:38 ` [PATCH v4 16/24] iommu/arm-smmu-v3: Co-clear pending CMDQ_ERR when queue_has_space() fails Nicolin Chen
2026-05-19 3:39 ` [PATCH v4 17/24] iommu/arm-smmu-v3: Add master in arm_smmu_inv for ATS entries Nicolin Chen
2026-05-19 12:01 ` Jason Gunthorpe
2026-05-19 3:39 ` [PATCH v4 18/24] iommu/arm-smmu-v3: Introduce master->ats_broken flag Nicolin Chen
2026-05-19 12:06 ` Jason Gunthorpe
2026-05-19 3:39 ` [PATCH v4 19/24] iommu/arm-smmu-v3: Add invs and has_ats to struct arm_smmu_cmdq_batch Nicolin Chen
2026-05-19 12:09 ` Jason Gunthorpe
2026-05-19 3:39 ` [PATCH v4 20/24] iommu/arm-smmu-v3: Introduce arm_smmu_cmdq_batch_issue() wrapper Nicolin Chen
2026-05-19 3:39 ` [PATCH v4 21/24] iommu/arm-smmu-v3: Move arm_smmu_invs_for_each_entry to header Nicolin Chen
2026-05-19 3:39 ` [PATCH v4 22/24] iommu/arm-smmu-v3: Introduce master->ats_invs Nicolin Chen
2026-05-19 12:12 ` Jason Gunthorpe
2026-05-19 3:39 ` [PATCH v4 23/24] iommu/arm-smmu-v3: Serialize STE.EATS and ats_broken updates Nicolin Chen
2026-05-19 3:39 ` [PATCH v4 24/24] iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout Nicolin Chen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260519191626.GJ3602937@nvidia.com \
--to=jgg@nvidia.com \
--cc=baolu.lu@linux.intel.com \
--cc=bhelgaas@google.com \
--cc=iommu@lists.linux.dev \
--cc=joro@8bytes.org \
--cc=kevin.tian@intel.com \
--cc=lenb@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=nicolinc@nvidia.com \
--cc=praan@google.com \
--cc=rafael@kernel.org \
--cc=robin.murphy@arm.com \
--cc=smostafa@google.com \
--cc=vsethi@nvidia.com \
--cc=will@kernel.org \
--cc=xueshuai@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox