From: Samiullah Khawaja <skhawaja@google.com>
To: Nicolin Chen <nicolinc@nvidia.com>
Cc: will@kernel.org, robin.murphy@arm.com, joro@8bytes.org,
bhelgaas@google.com, jgg@nvidia.com, rafael@kernel.org,
lenb@kernel.org, praan@google.com, baolu.lu@linux.intel.com,
xueshuai@linux.alibaba.com, kevin.tian@intel.com,
linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-pci@vger.kernel.org, vsethi@nvidia.com
Subject: Re: [PATCH v2 4/7] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap
Date: Wed, 18 Mar 2026 22:02:32 +0000 [thread overview]
Message-ID: <absfAT4jgL2f3UfR@google.com> (raw)
In-Reply-To: <0c5525367cc67ccc84a675544d1d9f8462704065.1773774441.git.nicolinc@nvidia.com>
Hi Nicolin,
On Tue, Mar 17, 2026 at 12:15:37PM -0700, Nicolin Chen wrote:
>An ATC invalidation timeout is a fatal error. While the SMMUv3 hardware is
>aware of the timeout via a GERROR interrupt, the driver thread issuing the
>commands lacks a direct mechanism to verify whether its specific batch was
>the cause or not, as polling the CMD_SYNC status doesn't natively return a
>failure code, making it very difficult to coordinate per-device recovery.
>
>Introduce an atc_sync_timeouts bitmap in the cmdq structure to bridge this
>gap. When the ISR detects an ATC timeout, set the bit corresponding to the
>physical CMDQ index of the faulting CMD_SYNC command.
>
>On the issuer side, after polling completes (or times out), test and clear
>its dedicated bit. If set, override any generic timeout, return -ETIMEDOUT
>to trigger device quarantine.
>
>Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>---
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 +
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 20 +++++++++++++++++++-
> 2 files changed, 20 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>index 36de2b0b2ebe6..3eb12a34b086a 100644
>--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
>@@ -633,6 +633,7 @@ struct arm_smmu_cmdq {
> atomic_long_t *valid_map;
> atomic_t owner_prod;
> atomic_t lock;
>+ unsigned long *atc_sync_timeouts;
> bool (*supports_cmd)(struct arm_smmu_cmdq_ent *ent);
> };
>
>diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>index 01030ffd2fe23..9c8972ebc94f9 100644
>--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>@@ -445,7 +445,10 @@ void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu,
> * at the CMD_SYNC. Attempt to complete other pending commands
> * by repeating the CMD_SYNC, though we might well end up back
> * here since the ATC invalidation may still be pending.
>+ *
>+ * Mark the faulty batch in the bitmap for the issuer to match.
> */
>+ set_bit(Q_IDX(&q->llq, cons), cmdq->atc_sync_timeouts);
> return;
> case CMDQ_ERR_CERROR_ILL_IDX:
> default:
>@@ -895,9 +898,19 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
>
> /* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */
> if (sync) {
>+ u32 sync_prod;
>+
> llq.prod = queue_inc_prod_n(&llq, n);
>+ sync_prod = llq.prod;
>+
> ret = arm_smmu_cmdq_poll_until_sync(smmu, cmdq, &llq);
>- if (ret) {
>+ if (test_and_clear_bit(Q_IDX(&llq, sync_prod),
>+ cmdq->atc_sync_timeouts)) {
This will not be set if a software timeout (1 second) occurs. Do you
know if the ATC timeout of Arm sMMUv3 is less than the software timeout
in the driver?
If not maybe we can handle the software timeout here also as the cmdlist
is already known?
Thanks,
Sami
>+ dev_err_ratelimited(smmu->dev,
>+ "CMD_SYNC for ATC_INV timeout at prod=0x%08x\n",
>+ sync_prod);
>+ ret = -ETIMEDOUT;
>+ } else if (ret) {
> dev_err_ratelimited(smmu->dev,
> "CMD_SYNC timeout at 0x%08x [hwprod 0x%08x, hwcons 0x%08x]\n",
> llq.prod,
>@@ -4458,6 +4471,11 @@ int arm_smmu_cmdq_init(struct arm_smmu_device *smmu,
> if (!cmdq->valid_map)
> return -ENOMEM;
>
>+ cmdq->atc_sync_timeouts =
>+ devm_bitmap_zalloc(smmu->dev, nents, GFP_KERNEL);
>+ if (!cmdq->atc_sync_timeouts)
>+ return -ENOMEM;
>+
> return 0;
> }
>
>--
>2.43.0
>
>
next prev parent reply other threads:[~2026-03-18 22:02 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-17 19:15 [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon ATC invalidation timeout Nicolin Chen
2026-03-17 19:15 ` [PATCH v2 1/7] iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds Nicolin Chen
2026-03-18 7:21 ` Tian, Kevin
2026-03-18 20:16 ` Nicolin Chen
2026-03-18 8:02 ` Shuai Xue
2026-03-18 20:27 ` Nicolin Chen
2026-03-17 19:15 ` [PATCH v2 2/7] iommu: Add reset_device_done callback for hardware fault recovery Nicolin Chen
2026-03-18 5:59 ` Baolu Lu
2026-03-18 18:42 ` Nicolin Chen
2026-03-17 19:15 ` [PATCH v2 3/7] iommu: Add iommu_report_device_broken() to quarantine a broken device Nicolin Chen
2026-03-18 6:13 ` Baolu Lu
2026-03-19 1:31 ` Nicolin Chen
2026-03-18 7:31 ` Tian, Kevin
2026-03-19 1:30 ` Nicolin Chen
2026-03-19 2:35 ` Tian, Kevin
2026-03-19 3:13 ` Nicolin Chen
2026-03-18 11:45 ` Shuai Xue
2026-03-18 20:29 ` Nicolin Chen
2026-03-17 19:15 ` [PATCH v2 4/7] iommu/arm-smmu-v3: Mark ATC invalidate timeouts via lockless bitmap Nicolin Chen
2026-03-18 7:36 ` Tian, Kevin
2026-03-18 19:26 ` Nicolin Chen
2026-03-18 22:06 ` Samiullah Khawaja
2026-03-19 3:08 ` Tian, Kevin
2026-03-19 3:12 ` Nicolin Chen
2026-03-23 23:51 ` Jason Gunthorpe
2026-03-18 22:02 ` Samiullah Khawaja [this message]
2026-03-18 23:23 ` Nicolin Chen
2026-03-19 0:08 ` Samiullah Khawaja
2026-03-19 1:15 ` Nicolin Chen
2026-03-23 23:57 ` Jason Gunthorpe
2026-03-24 1:21 ` Nicolin Chen
2026-03-17 19:15 ` [PATCH v2 5/7] iommu/arm-smmu-v3: Replace smmu with master in arm_smmu_inv Nicolin Chen
2026-03-17 19:15 ` [PATCH v2 6/7] iommu/arm-smmu-v3: Introduce master->ats_broken flag Nicolin Chen
2026-03-18 7:39 ` Tian, Kevin
2026-03-18 20:00 ` Nicolin Chen
2026-03-17 19:15 ` [PATCH v2 7/7] iommu/arm-smmu-v3: Block ATS upon an ATC invalidation timeout Nicolin Chen
2026-03-19 2:56 ` Shuai Xue
2026-03-19 3:26 ` Nicolin Chen
2026-03-19 7:41 ` Shuai Xue
2026-03-18 7:47 ` [PATCH v2 0/7] iommu/arm-smmu-v3: Quarantine device upon " Tian, Kevin
2026-03-18 20:04 ` Nicolin Chen
2026-03-19 2:29 ` Tian, Kevin
2026-03-19 3:10 ` Nicolin Chen
2026-03-24 0:03 ` Jason Gunthorpe
2026-03-24 1:30 ` Nicolin Chen
2026-03-25 6:55 ` Tian, Kevin
2026-03-25 14:12 ` Jason Gunthorpe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=absfAT4jgL2f3UfR@google.com \
--to=skhawaja@google.com \
--cc=baolu.lu@linux.intel.com \
--cc=bhelgaas@google.com \
--cc=iommu@lists.linux.dev \
--cc=jgg@nvidia.com \
--cc=joro@8bytes.org \
--cc=kevin.tian@intel.com \
--cc=lenb@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=nicolinc@nvidia.com \
--cc=praan@google.com \
--cc=rafael@kernel.org \
--cc=robin.murphy@arm.com \
--cc=vsethi@nvidia.com \
--cc=will@kernel.org \
--cc=xueshuai@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox