From: Nicolin Chen <nicolinc@nvidia.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>, <will@kernel.org>,
<joro@8bytes.org>, <bhelgaas@google.com>, <rafael@kernel.org>,
<lenb@kernel.org>, <praan@google.com>, <kees@kernel.org>,
<baolu.lu@linux.intel.com>, <smostafa@google.com>,
<Alexander.Grest@microsoft.com>, <kevin.tian@intel.com>,
<miko.lenczewski@arm.com>, <linux-arm-kernel@lists.infradead.org>,
<iommu@lists.linux.dev>, <linux-kernel@vger.kernel.org>,
<linux-acpi@vger.kernel.org>, <linux-pci@vger.kernel.org>,
<vsethi@nvidia.com>
Subject: Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
Date: Fri, 6 Mar 2026 12:18:24 -0800 [thread overview]
Message-ID: <aas2kHcjJPYFbKSD@Asurada-Nvidia> (raw)
In-Reply-To: <20260306140115.GH1651202@nvidia.com>
On Fri, Mar 06, 2026 at 10:01:15AM -0400, Jason Gunthorpe wrote:
> On Fri, Mar 06, 2026 at 01:22:11PM +0000, Robin Murphy wrote:
> > On 2026-03-05 11:41 pm, Jason Gunthorpe wrote:
> > > On Thu, Mar 05, 2026 at 01:15:45PM -0800, Nicolin Chen wrote:
> > >
> > > > You mean in arm_smmu_cmdq_issue_cmdlist() that issued the timed
> > > > out ATC command?
> > >
> > > Yes, it was my off hand thought.
> > >
> > > > So my test case was to trigger a device fault followed by an ATC
> > > > command. But, I found that the ATC command submission returned 0
> > > > while only the ISR received:
> > > > CMDQ error (cons 0x03000003): ATC invalidate timeout
> > > > arm_smmu_debugfs_atc_write: ATC_INV ret=0
> > > >
> > > > It seems difficult to insert a CMDQ_OP_CFGI_STE in the submission
> > > > thread?
> > >
> > > I didn't look, but I thought the CMDQ stops on the ATC invalidation,
> > > flags the error and the ISR NOP's the failing CMDQ entry and restarts
> > > it to resume the thread? Is that something else?
> > >
> > > If so you could insert the STE flush instead of a NOP
> >
> > Nope, sadly the timeout is asynchronous, and CERROR_ATC_INV_SYNC is only
> > reported on the *next* CMD_SYNC - it can't even tell us which CMD_ATC_INV(s)
> > had a problem.
>
> !! That's a good point! The new invalidation code runs many ATC
> invalidations under one sync to optimize for SVA performance so we
> have no idea what devices need to be reset :(
>
> So we really do need to signal to the issuing thread and it will have
> to go back and check how many ATC invalidations are under this sync
> and re-issue one by one to isolate the error then issue the STE change
> and sync. Nothing from an ISR then..
IIUIC, we would have two timeouts to identify the device(s), so we
wouldn't need to give away the optimization of batching ATCI cmds?
Will letting a faulty device time out once again give it a window
to corrupt the memory?
Thanks
Nicolin
next prev parent reply other threads:[~2026-03-06 20:18 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-05 5:21 [PATCH v1 0/2] iommu/arm-smmu-v3: Reset PCI device upon ATC invalidate timeout Nicolin Chen
2026-03-05 5:21 ` [PATCH v1 1/2] iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds Nicolin Chen
2026-03-05 5:21 ` [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts Nicolin Chen
2026-03-05 15:15 ` kernel test robot
2026-03-05 15:24 ` Robin Murphy
2026-03-05 21:06 ` Nicolin Chen
2026-03-05 23:30 ` Nicolin Chen
2026-03-05 23:52 ` Jason Gunthorpe
2026-03-06 15:24 ` Robin Murphy
2026-03-06 15:56 ` Jason Gunthorpe
2026-03-10 19:34 ` Pranjal Shrivastava
2026-03-05 15:39 ` Jason Gunthorpe
2026-03-05 21:15 ` Nicolin Chen
2026-03-05 23:41 ` Jason Gunthorpe
2026-03-06 1:29 ` Nicolin Chen
2026-03-06 1:33 ` Jason Gunthorpe
2026-03-06 5:06 ` Nicolin Chen
2026-03-06 13:02 ` Jason Gunthorpe
2026-03-06 19:20 ` Nicolin Chen
2026-03-06 19:22 ` Jason Gunthorpe
2026-03-06 19:39 ` Nicolin Chen
2026-03-06 19:47 ` Jason Gunthorpe
2026-03-10 19:40 ` Pranjal Shrivastava
2026-03-10 19:57 ` Nicolin Chen
2026-03-10 20:04 ` Pranjal Shrivastava
2026-03-06 13:22 ` Robin Murphy
2026-03-06 14:01 ` Jason Gunthorpe
2026-03-06 20:18 ` Nicolin Chen [this message]
2026-03-06 20:22 ` Jason Gunthorpe
2026-03-06 20:34 ` Nicolin Chen
2026-03-06 3:22 ` Baolu Lu
2026-03-06 13:00 ` Jason Gunthorpe
2026-03-06 19:35 ` Samiullah Khawaja
2026-03-06 19:43 ` Jason Gunthorpe
2026-03-06 19:59 ` Samiullah Khawaja
2026-03-06 20:03 ` Jason Gunthorpe
2026-03-06 20:22 ` Samiullah Khawaja
2026-03-06 20:26 ` Jason Gunthorpe
2026-03-10 20:00 ` Samiullah Khawaja
2026-03-11 12:12 ` Jason Gunthorpe
2026-03-06 2:35 ` kernel test robot
2026-03-10 19:16 ` Pranjal Shrivastava
2026-03-10 19:51 ` Nicolin Chen
2026-03-10 20:00 ` Pranjal Shrivastava
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aas2kHcjJPYFbKSD@Asurada-Nvidia \
--to=nicolinc@nvidia.com \
--cc=Alexander.Grest@microsoft.com \
--cc=baolu.lu@linux.intel.com \
--cc=bhelgaas@google.com \
--cc=iommu@lists.linux.dev \
--cc=jgg@nvidia.com \
--cc=joro@8bytes.org \
--cc=kees@kernel.org \
--cc=kevin.tian@intel.com \
--cc=lenb@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=miko.lenczewski@arm.com \
--cc=praan@google.com \
--cc=rafael@kernel.org \
--cc=robin.murphy@arm.com \
--cc=smostafa@google.com \
--cc=vsethi@nvidia.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox