* [PATCH v5 0/5] iommu/arm-smmu, drm/msm: Fixes for stall-on-fault
@ 2025-03-19 14:43 Connor Abbott
2025-03-19 14:44 ` [PATCH v5 1/5] iommu/arm-smmu: Save additional information on context fault Connor Abbott
` (4 more replies)
0 siblings, 5 replies; 14+ messages in thread
From: Connor Abbott @ 2025-03-19 14:43 UTC (permalink / raw)
To: Rob Clark, Will Deacon, Robin Murphy, Joerg Roedel, Sean Paul,
Konrad Dybcio, Abhinav Kumar, Dmitry Baryshkov, Marijn Suijten
Cc: iommu, linux-arm-msm, linux-arm-kernel, freedreno, Connor Abbott
drm/msm uses the stall-on-fault model to record the GPU state on the
first GPU page fault to help debugging. On systems where the GPU is
paired with a MMU-500, there were two problems:
1. The MMU-500 doesn't de-assert its interrupt line until the fault is
resumed, which led to a storm of interrupts until the fault handler
was called. If we got unlucky and the fault handler was on the same
CPU as the interrupt, there was a deadlock.
2. The GPU is capable of generating page faults much faster than we can
resume them. GMU (GPU Management Unit) shares the same context bank
as the GPU, so if there was a sudden spurt of page faults it would be
effectively starved and would trigger a watchdog reset, made even
worse because the GPU cannot be reset while there's a pending
transaction leaving the GPU permanently wedged.
Patches 1-3 fixes the first problem and is independent of the rest of the
series. Patch 5 fixes the second problem and is dependent on patch 4, so
there will have to be some cross-tree coordination.
I've rebased this series on the latest linux-next to avoid rebase
troubles.
Signed-off-by: Connor Abbott <cwabbott0@gmail.com>
---
Changes in v5:
- Don't read CONTEXTIDR for stage 2 domains.
- Clarify that we don't need TLB invalidation when changing
SMMU_CBn_SCTLR.CFCFG.
- Link to v4: https://lore.kernel.org/r/20250304-msm-gpu-fault-fixes-next-v4-0-be14be37f4c3@gmail.com
Changes in v4:
- Add patches 1-2, which fix reading registers in drm/msm when
acknowledging the fault early. This was Robin's preferred solution
compared to making drm/msm's fault handler tell arm-smmu to resume the
fault.
- Link to v3: https://lore.kernel.org/r/20250122-msm-gpu-fault-fixes-next-v3-0-0afa00158521@gmail.com
Changes in v3:
- Acknowledge the fault before resuming the transaction in patch 1.
- Add suggested extra context to commit messages.
- Link to v2: https://lore.kernel.org/r/20250120-msm-gpu-fault-fixes-next-v2-0-d636c4027042@gmail.com
Changes in v2:
- Remove unnecessary _irqsave when locking in IRQ handler (Robin)
- Reuse existing spinlock for CFIE manipulation (Robin)
- Lock CFCFG manipulation against concurrent CFIE manipulation
- Don't use timer to re-enable stall-on-fault. (Rob)
- Use more descriptive name for the function that re-enables
stall-on-fault if the cooldown period has ended. (Rob)
- Link to v1: https://lore.kernel.org/r/20250117-msm-gpu-fault-fixes-next-v1-0-bc9b332b5d0b@gmail.com
---
Connor Abbott (5):
iommu/arm-smmu: Save additional information on context fault
iommu/arm-smmu-qcom: Don't read fault registers directly
iommu/arm-smmu: Fix spurious interrupts with stall-on-fault
iommu/arm-smmu-qcom: Make set_stall work when the device is on
drm/msm: Temporarily disable stall-on-fault after a page fault
drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 2 +
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 4 ++
drivers/gpu/drm/msm/adreno/adreno_gpu.c | 42 +++++++++++-
drivers/gpu/drm/msm/adreno/adreno_gpu.h | 26 ++++++++
drivers/gpu/drm/msm/msm_iommu.c | 9 +++
drivers/gpu/drm/msm/msm_mmu.h | 1 +
drivers/iommu/arm/arm-smmu/arm-smmu-qcom-debug.c | 6 +-
drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 67 +++++++++++++++----
drivers/iommu/arm/arm-smmu/arm-smmu.c | 84 ++++++++++++++++++------
drivers/iommu/arm/arm-smmu/arm-smmu.h | 21 +++---
10 files changed, 216 insertions(+), 46 deletions(-)
---
base-commit: 866e43b945bf98f8e807dfa45eca92f931f3a032
change-id: 20250117-msm-gpu-fault-fixes-next-96e3098023e1
Best regards,
--
Connor Abbott <cwabbott0@gmail.com>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v5 1/5] iommu/arm-smmu: Save additional information on context fault
2025-03-19 14:43 [PATCH v5 0/5] iommu/arm-smmu, drm/msm: Fixes for stall-on-fault Connor Abbott
@ 2025-03-19 14:44 ` Connor Abbott
2025-05-06 11:32 ` Will Deacon
2025-03-19 14:44 ` [PATCH v5 2/5] iommu/arm-smmu-qcom: Don't read fault registers directly Connor Abbott
` (3 subsequent siblings)
4 siblings, 1 reply; 14+ messages in thread
From: Connor Abbott @ 2025-03-19 14:44 UTC (permalink / raw)
To: Rob Clark, Will Deacon, Robin Murphy, Joerg Roedel, Sean Paul,
Konrad Dybcio, Abhinav Kumar, Dmitry Baryshkov, Marijn Suijten
Cc: iommu, linux-arm-msm, linux-arm-kernel, freedreno, Connor Abbott
This will be used by drm/msm for GPU page faults, replacing the manual
register reading it does.
Signed-off-by: Connor Abbott <cwabbott0@gmail.com>
Reviewed-by: Rob Clark <robdclark@gmail.com>
---
drivers/iommu/arm/arm-smmu/arm-smmu-qcom-debug.c | 6 ++--
drivers/iommu/arm/arm-smmu/arm-smmu.c | 35 ++++++++++++++----------
drivers/iommu/arm/arm-smmu/arm-smmu.h | 7 +++--
3 files changed, 29 insertions(+), 19 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom-debug.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom-debug.c
index 548783f3f8e89fd978367afa65c473002f66e2e7..5bf1aa4aa941962710f1f14260e133d560aee86f 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom-debug.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom-debug.c
@@ -393,14 +393,14 @@ irqreturn_t qcom_smmu_context_fault(int irq, void *dev)
DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
- arm_smmu_read_context_fault_info(smmu, idx, &cfi);
+ arm_smmu_read_context_fault_info(smmu_domain, &cfi);
if (!(cfi.fsr & ARM_SMMU_CB_FSR_FAULT))
return IRQ_NONE;
if (list_empty(&tbu_list)) {
ret = report_iommu_fault(&smmu_domain->domain, NULL, cfi.iova,
- cfi.fsynr & ARM_SMMU_CB_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);
+ cfi.fsynr0 & ARM_SMMU_CB_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);
if (ret == -ENOSYS)
arm_smmu_print_context_fault_info(smmu, idx, &cfi);
@@ -412,7 +412,7 @@ irqreturn_t qcom_smmu_context_fault(int irq, void *dev)
phys_soft = ops->iova_to_phys(ops, cfi.iova);
tmp = report_iommu_fault(&smmu_domain->domain, NULL, cfi.iova,
- cfi.fsynr & ARM_SMMU_CB_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);
+ cfi.fsynr0 & ARM_SMMU_CB_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);
if (!tmp || tmp == -EBUSY) {
ret = IRQ_HANDLED;
resume = ARM_SMMU_RESUME_TERMINATE;
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index ade4684c14c9b2724a71e2457288dbfaf7562c83..a02078eb968b81a35c1c086ed7007ea2a453ef94 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -405,13 +405,20 @@ static const struct iommu_flush_ops arm_smmu_s2_tlb_ops_v1 = {
};
-void arm_smmu_read_context_fault_info(struct arm_smmu_device *smmu, int idx,
+void arm_smmu_read_context_fault_info(struct arm_smmu_domain *smmu_domain,
struct arm_smmu_context_fault_info *cfi)
{
+ struct arm_smmu_device *smmu = smmu_domain->smmu;
+ int idx = smmu_domain->cfg.cbndx;
+
cfi->iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR);
+ cfi->ttbr0 = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_TTBR0);
cfi->fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR);
- cfi->fsynr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSYNR0);
+ cfi->fsynr0 = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSYNR0);
+ cfi->fsynr1 = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSYNR1);
cfi->cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx));
+ if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1)
+ cfi->contextidr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_CONTEXTIDR);
}
void arm_smmu_print_context_fault_info(struct arm_smmu_device *smmu, int idx,
@@ -419,7 +426,7 @@ void arm_smmu_print_context_fault_info(struct arm_smmu_device *smmu, int idx,
{
dev_err(smmu->dev,
"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",
- cfi->fsr, cfi->iova, cfi->fsynr, cfi->cbfrsynra, idx);
+ cfi->fsr, cfi->iova, cfi->fsynr0, cfi->cbfrsynra, idx);
dev_err(smmu->dev, "FSR = %08x [%s%sFormat=%u%s%s%s%s%s%s%s%s], SID=0x%x\n",
cfi->fsr,
@@ -437,15 +444,15 @@ void arm_smmu_print_context_fault_info(struct arm_smmu_device *smmu, int idx,
cfi->cbfrsynra);
dev_err(smmu->dev, "FSYNR0 = %08x [S1CBNDX=%u%s%s%s%s%s%s PLVL=%u]\n",
- cfi->fsynr,
- (u32)FIELD_GET(ARM_SMMU_CB_FSYNR0_S1CBNDX, cfi->fsynr),
- (cfi->fsynr & ARM_SMMU_CB_FSYNR0_AFR) ? " AFR" : "",
- (cfi->fsynr & ARM_SMMU_CB_FSYNR0_PTWF) ? " PTWF" : "",
- (cfi->fsynr & ARM_SMMU_CB_FSYNR0_NSATTR) ? " NSATTR" : "",
- (cfi->fsynr & ARM_SMMU_CB_FSYNR0_IND) ? " IND" : "",
- (cfi->fsynr & ARM_SMMU_CB_FSYNR0_PNU) ? " PNU" : "",
- (cfi->fsynr & ARM_SMMU_CB_FSYNR0_WNR) ? " WNR" : "",
- (u32)FIELD_GET(ARM_SMMU_CB_FSYNR0_PLVL, cfi->fsynr));
+ cfi->fsynr0,
+ (u32)FIELD_GET(ARM_SMMU_CB_FSYNR0_S1CBNDX, cfi->fsynr0),
+ (cfi->fsynr0 & ARM_SMMU_CB_FSYNR0_AFR) ? " AFR" : "",
+ (cfi->fsynr0 & ARM_SMMU_CB_FSYNR0_PTWF) ? " PTWF" : "",
+ (cfi->fsynr0 & ARM_SMMU_CB_FSYNR0_NSATTR) ? " NSATTR" : "",
+ (cfi->fsynr0 & ARM_SMMU_CB_FSYNR0_IND) ? " IND" : "",
+ (cfi->fsynr0 & ARM_SMMU_CB_FSYNR0_PNU) ? " PNU" : "",
+ (cfi->fsynr0 & ARM_SMMU_CB_FSYNR0_WNR) ? " WNR" : "",
+ (u32)FIELD_GET(ARM_SMMU_CB_FSYNR0_PLVL, cfi->fsynr0));
}
static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
@@ -458,13 +465,13 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
int idx = smmu_domain->cfg.cbndx;
int ret;
- arm_smmu_read_context_fault_info(smmu, idx, &cfi);
+ arm_smmu_read_context_fault_info(smmu_domain, &cfi);
if (!(cfi.fsr & ARM_SMMU_CB_FSR_FAULT))
return IRQ_NONE;
ret = report_iommu_fault(&smmu_domain->domain, NULL, cfi.iova,
- cfi.fsynr & ARM_SMMU_CB_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);
+ cfi.fsynr0 & ARM_SMMU_CB_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);
if (ret == -ENOSYS && __ratelimit(&rs))
arm_smmu_print_context_fault_info(smmu, idx, &cfi);
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h b/drivers/iommu/arm/arm-smmu/arm-smmu.h
index e2aeb511ae903302e3c15d2cf5f22e2a26ac2346..ef6915a0d9f62b0a1734a3ee57ea422615036094 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.h
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h
@@ -543,12 +543,15 @@ int arm_mmu500_reset(struct arm_smmu_device *smmu);
struct arm_smmu_context_fault_info {
unsigned long iova;
+ u64 ttbr0;
u32 fsr;
- u32 fsynr;
+ u32 fsynr0;
+ u32 fsynr1;
u32 cbfrsynra;
+ u32 contextidr;
};
-void arm_smmu_read_context_fault_info(struct arm_smmu_device *smmu, int idx,
+void arm_smmu_read_context_fault_info(struct arm_smmu_domain *smmu_domain,
struct arm_smmu_context_fault_info *cfi);
void arm_smmu_print_context_fault_info(struct arm_smmu_device *smmu, int idx,
--
2.47.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v5 2/5] iommu/arm-smmu-qcom: Don't read fault registers directly
2025-03-19 14:43 [PATCH v5 0/5] iommu/arm-smmu, drm/msm: Fixes for stall-on-fault Connor Abbott
2025-03-19 14:44 ` [PATCH v5 1/5] iommu/arm-smmu: Save additional information on context fault Connor Abbott
@ 2025-03-19 14:44 ` Connor Abbott
2025-03-19 14:44 ` [PATCH v5 3/5] iommu/arm-smmu: Fix spurious interrupts with stall-on-fault Connor Abbott
` (2 subsequent siblings)
4 siblings, 0 replies; 14+ messages in thread
From: Connor Abbott @ 2025-03-19 14:44 UTC (permalink / raw)
To: Rob Clark, Will Deacon, Robin Murphy, Joerg Roedel, Sean Paul,
Konrad Dybcio, Abhinav Kumar, Dmitry Baryshkov, Marijn Suijten
Cc: iommu, linux-arm-msm, linux-arm-kernel, freedreno, Connor Abbott
In some cases drm/msm has to resume a stalled transaction directly in
its fault handler. Experimentally this doesn't work on SMMU500 if the
fault hasn't already been acknowledged by clearing FSR. Rather than
trying to clear FSR in msm's fault handler and implementing a
tricky handshake to avoid accidentally clearing FSR twice, we want to
clear FSR before calling the fault handlers, but this means that the
contents of registers can change underneath us in the fault handler and
msm currently uses a private function to read the register contents for
its own purposes in its fault handler, such as using the
implementation-defined FSYNR1 to determine which block caused the fault.
Fix this by making msm use the register values already read by arm-smmu
itself before clearing FSR rather than messing around with reading
registers directly.
Signed-off-by: Connor Abbott <cwabbott0@gmail.com>
Reviewed-by: Rob Clark <robdclark@gmail.com>
---
drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 19 +++++++++----------
drivers/iommu/arm/arm-smmu/arm-smmu.c | 14 +++++++-------
drivers/iommu/arm/arm-smmu/arm-smmu.h | 21 +++++++++++----------
3 files changed, 27 insertions(+), 27 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index 6372f3e25c4bc24cb52f9233095170e8aa510a53..186d6ad4fd1c990398df4dec53f4d58ada9e658c 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -62,16 +62,15 @@ static void qcom_adreno_smmu_get_fault_info(const void *cookie,
struct adreno_smmu_fault_info *info)
{
struct arm_smmu_domain *smmu_domain = (void *)cookie;
- struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
- struct arm_smmu_device *smmu = smmu_domain->smmu;
-
- info->fsr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSR);
- info->fsynr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR0);
- info->fsynr1 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR1);
- info->far = arm_smmu_cb_readq(smmu, cfg->cbndx, ARM_SMMU_CB_FAR);
- info->cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(cfg->cbndx));
- info->ttbr0 = arm_smmu_cb_readq(smmu, cfg->cbndx, ARM_SMMU_CB_TTBR0);
- info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR);
+ struct arm_smmu_context_fault_info *cfi = &smmu_domain->cfi;
+
+ info->fsr = cfi->fsr;
+ info->fsynr0 = cfi->fsynr0;
+ info->fsynr1 = cfi->fsynr1;
+ info->far = cfi->iova;
+ info->cbfrsynra = cfi->cbfrsynra;
+ info->ttbr0 = cfi->ttbr0;
+ info->contextidr = cfi->contextidr;
}
static void qcom_adreno_smmu_set_stall(const void *cookie, bool enabled)
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index a02078eb968b81a35c1c086ed7007ea2a453ef94..c7b5d7c093e71050d29a834c8d33125e96b04d81 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -457,26 +457,26 @@ void arm_smmu_print_context_fault_info(struct arm_smmu_device *smmu, int idx,
static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
{
- struct arm_smmu_context_fault_info cfi;
struct arm_smmu_domain *smmu_domain = dev;
+ struct arm_smmu_context_fault_info *cfi = &smmu_domain->cfi;
struct arm_smmu_device *smmu = smmu_domain->smmu;
static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
int idx = smmu_domain->cfg.cbndx;
int ret;
- arm_smmu_read_context_fault_info(smmu_domain, &cfi);
+ arm_smmu_read_context_fault_info(smmu_domain, cfi);
- if (!(cfi.fsr & ARM_SMMU_CB_FSR_FAULT))
+ if (!(cfi->fsr & ARM_SMMU_CB_FSR_FAULT))
return IRQ_NONE;
- ret = report_iommu_fault(&smmu_domain->domain, NULL, cfi.iova,
- cfi.fsynr0 & ARM_SMMU_CB_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);
+ ret = report_iommu_fault(&smmu_domain->domain, NULL, cfi->iova,
+ cfi->fsynr0 & ARM_SMMU_CB_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);
if (ret == -ENOSYS && __ratelimit(&rs))
- arm_smmu_print_context_fault_info(smmu, idx, &cfi);
+ arm_smmu_print_context_fault_info(smmu, idx, cfi);
- arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, cfi.fsr);
+ arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, cfi->fsr);
return IRQ_HANDLED;
}
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h b/drivers/iommu/arm/arm-smmu/arm-smmu.h
index ef6915a0d9f62b0a1734a3ee57ea422615036094..ff84ce3b8d8567d3402e945e8277ca2a87df9a4e 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.h
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h
@@ -373,6 +373,16 @@ enum arm_smmu_domain_stage {
ARM_SMMU_DOMAIN_NESTED,
};
+struct arm_smmu_context_fault_info {
+ unsigned long iova;
+ u64 ttbr0;
+ u32 fsr;
+ u32 fsynr0;
+ u32 fsynr1;
+ u32 cbfrsynra;
+ u32 contextidr;
+};
+
struct arm_smmu_domain {
struct arm_smmu_device *smmu;
struct io_pgtable_ops *pgtbl_ops;
@@ -380,6 +390,7 @@ struct arm_smmu_domain {
const struct iommu_flush_ops *flush_ops;
struct arm_smmu_cfg cfg;
enum arm_smmu_domain_stage stage;
+ struct arm_smmu_context_fault_info cfi;
struct mutex init_mutex; /* Protects smmu pointer */
spinlock_t cb_lock; /* Serialises ATS1* ops and TLB syncs */
struct iommu_domain domain;
@@ -541,16 +552,6 @@ struct arm_smmu_device *qcom_smmu_impl_init(struct arm_smmu_device *smmu);
void arm_smmu_write_context_bank(struct arm_smmu_device *smmu, int idx);
int arm_mmu500_reset(struct arm_smmu_device *smmu);
-struct arm_smmu_context_fault_info {
- unsigned long iova;
- u64 ttbr0;
- u32 fsr;
- u32 fsynr0;
- u32 fsynr1;
- u32 cbfrsynra;
- u32 contextidr;
-};
-
void arm_smmu_read_context_fault_info(struct arm_smmu_domain *smmu_domain,
struct arm_smmu_context_fault_info *cfi);
--
2.47.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v5 3/5] iommu/arm-smmu: Fix spurious interrupts with stall-on-fault
2025-03-19 14:43 [PATCH v5 0/5] iommu/arm-smmu, drm/msm: Fixes for stall-on-fault Connor Abbott
2025-03-19 14:44 ` [PATCH v5 1/5] iommu/arm-smmu: Save additional information on context fault Connor Abbott
2025-03-19 14:44 ` [PATCH v5 2/5] iommu/arm-smmu-qcom: Don't read fault registers directly Connor Abbott
@ 2025-03-19 14:44 ` Connor Abbott
2025-05-06 12:24 ` Will Deacon
2025-03-19 14:44 ` [PATCH v5 4/5] iommu/arm-smmu-qcom: Make set_stall work when the device is on Connor Abbott
2025-03-19 14:44 ` [PATCH v5 5/5] drm/msm: Temporarily disable stall-on-fault after a page fault Connor Abbott
4 siblings, 1 reply; 14+ messages in thread
From: Connor Abbott @ 2025-03-19 14:44 UTC (permalink / raw)
To: Rob Clark, Will Deacon, Robin Murphy, Joerg Roedel, Sean Paul,
Konrad Dybcio, Abhinav Kumar, Dmitry Baryshkov, Marijn Suijten
Cc: iommu, linux-arm-msm, linux-arm-kernel, freedreno, Connor Abbott
On some SMMUv2 implementations, including MMU-500, SMMU_CBn_FSR.SS
asserts an interrupt. The only way to clear that bit is to resume the
transaction by writing SMMU_CBn_RESUME, but typically resuming the
transaction requires complex operations (copying in pages, etc.) that
can't be done in IRQ context. drm/msm already has a problem, because
its fault handler sometimes schedules a job to dump the GPU state and
doesn't resume translation until this is complete.
Work around this by disabling context fault interrupts until after the
transaction is resumed. Because other context banks can share an IRQ
line, we may still get an interrupt intended for another context bank,
but in this case only SMMU_CBn_FSR.SS will be asserted and we can skip
it assuming that interrupts are disabled which is accomplished by
removing the bit from ARM_SMMU_CB_FSR_FAULT. SMMU_CBn_FSR.SS won't be
asserted unless an external user enabled stall-on-fault, and they are
expected to resume the translation and re-enable interrupts.
Signed-off-by: Connor Abbott <cwabbott0@gmail.com>
Reviewed-by Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Rob Clark <robdclark@gmail.com>
---
drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 15 ++++++++++-
drivers/iommu/arm/arm-smmu/arm-smmu.c | 41 +++++++++++++++++++++++++++++-
drivers/iommu/arm/arm-smmu/arm-smmu.h | 1 -
3 files changed, 54 insertions(+), 3 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index 186d6ad4fd1c990398df4dec53f4d58ada9e658c..a428e53add08d451fb2152e3ab80e0fba936e214 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -90,12 +90,25 @@ static void qcom_adreno_smmu_resume_translation(const void *cookie, bool termina
struct arm_smmu_domain *smmu_domain = (void *)cookie;
struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
struct arm_smmu_device *smmu = smmu_domain->smmu;
- u32 reg = 0;
+ u32 reg = 0, sctlr;
+ unsigned long flags;
if (terminate)
reg |= ARM_SMMU_RESUME_TERMINATE;
+ spin_lock_irqsave(&smmu_domain->cb_lock, flags);
+
arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_RESUME, reg);
+
+ /*
+ * Re-enable interrupts after they were disabled by
+ * arm_smmu_context_fault().
+ */
+ sctlr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_SCTLR);
+ sctlr |= ARM_SMMU_SCTLR_CFIE;
+ arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_SCTLR, sctlr);
+
+ spin_unlock_irqrestore(&smmu_domain->cb_lock, flags);
}
#define QCOM_ADRENO_SMMU_GPU_SID 0
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index c7b5d7c093e71050d29a834c8d33125e96b04d81..9927f3431a2eab913750e6079edc6393d1938c98 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -470,13 +470,52 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
if (!(cfi->fsr & ARM_SMMU_CB_FSR_FAULT))
return IRQ_NONE;
+ /*
+ * On some implementations FSR.SS asserts a context fault
+ * interrupt. We do not want this behavior, because resolving the
+ * original context fault typically requires operations that cannot be
+ * performed in IRQ context but leaving the stall unacknowledged will
+ * immediately lead to another spurious interrupt as FSR.SS is still
+ * set. Work around this by disabling interrupts for this context bank.
+ * It's expected that interrupts are re-enabled after resuming the
+ * translation.
+ *
+ * We have to do this before report_iommu_fault() so that we don't
+ * leave interrupts disabled in case the downstream user decides the
+ * fault can be resolved inside its fault handler.
+ *
+ * There is a possible race if there are multiple context banks sharing
+ * the same interrupt and both signal an interrupt in between writing
+ * RESUME and SCTLR. We could disable interrupts here before we
+ * re-enable them in the resume handler, leaving interrupts enabled.
+ * Lock the write to serialize it with the resume handler.
+ */
+ if (cfi->fsr & ARM_SMMU_CB_FSR_SS) {
+ u32 val;
+
+ spin_lock(&smmu_domain->cb_lock);
+ val = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_SCTLR);
+ val &= ~ARM_SMMU_SCTLR_CFIE;
+ arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, val);
+ spin_unlock(&smmu_domain->cb_lock);
+ }
+
+ /*
+ * The SMMUv2 architecture specification says that if stall-on-fault is
+ * enabled the correct sequence is to write to SMMU_CBn_FSR to clear
+ * the fault and then write to SMMU_CBn_RESUME. Clear the interrupt
+ * first before running the user's fault handler to make sure we follow
+ * this sequence. It should be ok if there is another fault in the
+ * meantime because we have already read the fault info.
+ */
+ arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, cfi->fsr);
+
ret = report_iommu_fault(&smmu_domain->domain, NULL, cfi->iova,
cfi->fsynr0 & ARM_SMMU_CB_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);
if (ret == -ENOSYS && __ratelimit(&rs))
arm_smmu_print_context_fault_info(smmu, idx, cfi);
- arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, cfi->fsr);
return IRQ_HANDLED;
}
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h b/drivers/iommu/arm/arm-smmu/arm-smmu.h
index ff84ce3b8d8567d3402e945e8277ca2a87df9a4e..5fe8e482457f905529a08aea14ea5656d3e31328 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.h
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h
@@ -214,7 +214,6 @@ enum arm_smmu_cbar_type {
ARM_SMMU_CB_FSR_TLBLKF)
#define ARM_SMMU_CB_FSR_FAULT (ARM_SMMU_CB_FSR_MULTI | \
- ARM_SMMU_CB_FSR_SS | \
ARM_SMMU_CB_FSR_UUT | \
ARM_SMMU_CB_FSR_EF | \
ARM_SMMU_CB_FSR_PF | \
--
2.47.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v5 4/5] iommu/arm-smmu-qcom: Make set_stall work when the device is on
2025-03-19 14:43 [PATCH v5 0/5] iommu/arm-smmu, drm/msm: Fixes for stall-on-fault Connor Abbott
` (2 preceding siblings ...)
2025-03-19 14:44 ` [PATCH v5 3/5] iommu/arm-smmu: Fix spurious interrupts with stall-on-fault Connor Abbott
@ 2025-03-19 14:44 ` Connor Abbott
2025-03-19 14:44 ` [PATCH v5 5/5] drm/msm: Temporarily disable stall-on-fault after a page fault Connor Abbott
4 siblings, 0 replies; 14+ messages in thread
From: Connor Abbott @ 2025-03-19 14:44 UTC (permalink / raw)
To: Rob Clark, Will Deacon, Robin Murphy, Joerg Roedel, Sean Paul,
Konrad Dybcio, Abhinav Kumar, Dmitry Baryshkov, Marijn Suijten
Cc: iommu, linux-arm-msm, linux-arm-kernel, freedreno, Connor Abbott
Up until now we have only called the set_stall callback during
initialization when the device is off. But we will soon start calling it
to temporarily disable stall-on-fault when the device is on, so handle
that by checking if the device is on and writing SCTLR.
Signed-off-by: Connor Abbott <cwabbott0@gmail.com>
Reviewed-by: Rob Clark <robdclark@gmail.com>
---
drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 33 +++++++++++++++++++++++++++---
1 file changed, 30 insertions(+), 3 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index a428e53add08d451fb2152e3ab80e0fba936e214..f6bb405573be0ed480a2587a5a780dd711b8d2aa 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -77,12 +77,39 @@ static void qcom_adreno_smmu_set_stall(const void *cookie, bool enabled)
{
struct arm_smmu_domain *smmu_domain = (void *)cookie;
struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
- struct qcom_smmu *qsmmu = to_qcom_smmu(smmu_domain->smmu);
+ struct arm_smmu_device *smmu = smmu_domain->smmu;
+ struct qcom_smmu *qsmmu = to_qcom_smmu(smmu);
+ u32 mask = BIT(cfg->cbndx);
+ bool stall_changed = !!(qsmmu->stall_enabled & mask) != enabled;
+ unsigned long flags;
if (enabled)
- qsmmu->stall_enabled |= BIT(cfg->cbndx);
+ qsmmu->stall_enabled |= mask;
else
- qsmmu->stall_enabled &= ~BIT(cfg->cbndx);
+ qsmmu->stall_enabled &= ~mask;
+
+ /*
+ * If the device is on and we changed the setting, update the register.
+ * The spec pseudocode says that CFCFG is resampled after a fault, and
+ * we believe that no implementations cache it in the TLB, so it should
+ * be safe to change it without a TLB invalidation.
+ */
+ if (stall_changed && pm_runtime_get_if_active(smmu->dev) > 0) {
+ spin_lock_irqsave(&smmu_domain->cb_lock, flags);
+
+ u32 reg = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_SCTLR);
+
+ if (enabled)
+ reg |= ARM_SMMU_SCTLR_CFCFG;
+ else
+ reg &= ~ARM_SMMU_SCTLR_CFCFG;
+
+ arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_SCTLR, reg);
+
+ spin_unlock_irqrestore(&smmu_domain->cb_lock, flags);
+
+ pm_runtime_put_autosuspend(smmu->dev);
+ }
}
static void qcom_adreno_smmu_resume_translation(const void *cookie, bool terminate)
--
2.47.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v5 5/5] drm/msm: Temporarily disable stall-on-fault after a page fault
2025-03-19 14:43 [PATCH v5 0/5] iommu/arm-smmu, drm/msm: Fixes for stall-on-fault Connor Abbott
` (3 preceding siblings ...)
2025-03-19 14:44 ` [PATCH v5 4/5] iommu/arm-smmu-qcom: Make set_stall work when the device is on Connor Abbott
@ 2025-03-19 14:44 ` Connor Abbott
4 siblings, 0 replies; 14+ messages in thread
From: Connor Abbott @ 2025-03-19 14:44 UTC (permalink / raw)
To: Rob Clark, Will Deacon, Robin Murphy, Joerg Roedel, Sean Paul,
Konrad Dybcio, Abhinav Kumar, Dmitry Baryshkov, Marijn Suijten
Cc: iommu, linux-arm-msm, linux-arm-kernel, freedreno, Connor Abbott
When things go wrong, the GPU is capable of quickly generating millions
of faulting translation requests per second. When that happens, in the
stall-on-fault model each access will stall until it wins the race to
signal the fault and then the RESUME register is written. This slows
processing page faults to a crawl as the GPU can generate faults much
faster than the CPU can acknowledge them. It also means that all
available resources in the SMMU are saturated waiting for the stalled
transactions, so that other transactions such as transactions generated
by the GMU, which shares translation resources with the GPU, cannot
proceed. This causes a GMU watchdog timeout, which leads to a failed
reset because GX cannot collapse when there is a transaction pending and
a permanently hung GPU.
On older platforms with qcom,smmu-v2, it seems that when one transaction
is stalled subsequent faulting transactions are terminated, which avoids
this problem, but the MMU-500 follows the spec here.
To work around these problems, disable stall-on-fault as soon as we get a
page fault until a cooldown period after pagefaults stop. This allows
the GMU some guaranteed time to continue working. We only use
stall-on-fault to halt the GPU while we collect a devcoredump and we
always terminate the transaction afterward, so it's fine to miss some
subsequent page faults. We also keep it disabled so long as the current
devcoredump hasn't been deleted, because in that case we likely won't
capture another one if there's a fault.
After this commit HFI messages still occasionally time out, because the
crashdump handler doesn't run fast enough to let the GMU resume, but the
driver seems to recover from it. This will probably go away after the
HFI timeout is increased.
Signed-off-by: Connor Abbott <cwabbott0@gmail.com>
Reviewed-by: Rob Clark <robdclark@gmail.com>
---
drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 2 ++
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 4 ++++
drivers/gpu/drm/msm/adreno/adreno_gpu.c | 42 ++++++++++++++++++++++++++++++++-
drivers/gpu/drm/msm/adreno/adreno_gpu.h | 26 ++++++++++++++++++++
drivers/gpu/drm/msm/msm_iommu.c | 9 +++++++
drivers/gpu/drm/msm/msm_mmu.h | 1 +
6 files changed, 83 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
index 71dca78cd7a5324e9ff5b14f173e2209fa42e196..670141531112c9d29cef8ef1fd51b74759fdd6d2 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
@@ -131,6 +131,8 @@ static void a5xx_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit)
struct msm_ringbuffer *ring = submit->ring;
unsigned int i, ibs = 0;
+ adreno_check_and_reenable_stall(adreno_gpu);
+
if (IS_ENABLED(CONFIG_DRM_MSM_GPU_SUDO) && submit->in_rb) {
ring->cur_ctx_seqno = 0;
a5xx_submit_in_rb(gpu, submit);
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 0ae29a7c8a4d3f74236a35cc919f69d5c0a384a0..5a34cd2109a2d74c92841448a61ccb0d4f34e264 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -212,6 +212,8 @@ static void a6xx_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit)
struct msm_ringbuffer *ring = submit->ring;
unsigned int i, ibs = 0;
+ adreno_check_and_reenable_stall(adreno_gpu);
+
a6xx_set_pagetable(a6xx_gpu, ring, submit);
get_stats_counter(ring, REG_A6XX_RBBM_PERFCTR_CP(0),
@@ -335,6 +337,8 @@ static void a7xx_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit)
struct msm_ringbuffer *ring = submit->ring;
unsigned int i, ibs = 0;
+ adreno_check_and_reenable_stall(adreno_gpu);
+
/*
* Toggle concurrent binning for pagetable switch and set the thread to
* BR since only it can execute the pagetable switch packets.
diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
index 1238f326597808eb28b4c6822cbd41a26e555eb9..bac586101dc0494f46b069a8440a45825dfe9b5e 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
@@ -246,16 +246,53 @@ u64 adreno_private_address_space_size(struct msm_gpu *gpu)
return SZ_4G;
}
+void adreno_check_and_reenable_stall(struct adreno_gpu *adreno_gpu)
+{
+ struct msm_gpu *gpu = &adreno_gpu->base;
+ unsigned long flags;
+
+ /*
+ * Wait until the cooldown period has passed and we would actually
+ * collect a crashdump to re-enable stall-on-fault.
+ */
+ spin_lock_irqsave(&adreno_gpu->fault_stall_lock, flags);
+ if (!adreno_gpu->stall_enabled &&
+ ktime_after(ktime_get(), adreno_gpu->stall_reenable_time) &&
+ !READ_ONCE(gpu->crashstate)) {
+ adreno_gpu->stall_enabled = true;
+
+ gpu->aspace->mmu->funcs->set_stall(gpu->aspace->mmu, true);
+ }
+ spin_unlock_irqrestore(&adreno_gpu->fault_stall_lock, flags);
+}
+
#define ARM_SMMU_FSR_TF BIT(1)
#define ARM_SMMU_FSR_PF BIT(3)
#define ARM_SMMU_FSR_EF BIT(4)
+#define ARM_SMMU_FSR_SS BIT(30)
int adreno_fault_handler(struct msm_gpu *gpu, unsigned long iova, int flags,
struct adreno_smmu_fault_info *info, const char *block,
u32 scratch[4])
{
+ struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
const char *type = "UNKNOWN";
- bool do_devcoredump = info && !READ_ONCE(gpu->crashstate);
+ bool do_devcoredump = info && (info->fsr & ARM_SMMU_FSR_SS) &&
+ !READ_ONCE(gpu->crashstate);
+ unsigned long irq_flags;
+
+ /*
+ * In case there is a subsequent storm of pagefaults, disable
+ * stall-on-fault for at least half a second.
+ */
+ spin_lock_irqsave(&adreno_gpu->fault_stall_lock, irq_flags);
+ if (adreno_gpu->stall_enabled) {
+ adreno_gpu->stall_enabled = false;
+
+ gpu->aspace->mmu->funcs->set_stall(gpu->aspace->mmu, false);
+ }
+ adreno_gpu->stall_reenable_time = ktime_add_ms(ktime_get(), 500);
+ spin_unlock_irqrestore(&adreno_gpu->fault_stall_lock, irq_flags);
/*
* If we aren't going to be resuming later from fault_worker, then do
@@ -1143,6 +1180,9 @@ int adreno_gpu_init(struct drm_device *drm, struct platform_device *pdev,
adreno_gpu->info->inactive_period);
pm_runtime_use_autosuspend(dev);
+ spin_lock_init(&adreno_gpu->fault_stall_lock);
+ adreno_gpu->stall_enabled = true;
+
return msm_gpu_init(drm, pdev, &adreno_gpu->base, &funcs->base,
gpu_name, &adreno_gpu_config);
}
diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.h b/drivers/gpu/drm/msm/adreno/adreno_gpu.h
index dcf454629ce037b2a8274a6699674ad754ce1f07..868d22fde1396def35e2f46bc2b2f4eff611d8fd 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_gpu.h
+++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.h
@@ -205,6 +205,30 @@ struct adreno_gpu {
/* firmware: */
const struct firmware *fw[ADRENO_FW_MAX];
+ /**
+ * @fault_stall_lock:
+ *
+ * Serialize changes to stall-on-fault state.
+ */
+ spinlock_t fault_stall_lock;
+
+ /**
+ * @fault_stall_reenable_time:
+ *
+ * If stall_enabled is false, when to reenable stall-on-fault.
+ * Protected by @fault_stall_lock.
+ */
+ ktime_t stall_reenable_time;
+
+ /**
+ * @stall_enabled:
+ *
+ * Whether stall-on-fault is currently enabled. Protected by
+ * @fault_stall_lock.
+ */
+ bool stall_enabled;
+
+
struct {
/**
* @rgb565_predicator: Unknown, introduced with A650 family,
@@ -629,6 +653,8 @@ int adreno_fault_handler(struct msm_gpu *gpu, unsigned long iova, int flags,
struct adreno_smmu_fault_info *info, const char *block,
u32 scratch[4]);
+void adreno_check_and_reenable_stall(struct adreno_gpu *gpu);
+
int adreno_read_speedbin(struct device *dev, u32 *speedbin);
/*
diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
index 2a94e82316f95c5f9dcc37ef0a4664a29e3492b2..8d5380e6dcc217c7c209b51527bf15748b3ada71 100644
--- a/drivers/gpu/drm/msm/msm_iommu.c
+++ b/drivers/gpu/drm/msm/msm_iommu.c
@@ -351,6 +351,14 @@ static void msm_iommu_resume_translation(struct msm_mmu *mmu)
adreno_smmu->resume_translation(adreno_smmu->cookie, true);
}
+static void msm_iommu_set_stall(struct msm_mmu *mmu, bool enable)
+{
+ struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(mmu->dev);
+
+ if (adreno_smmu->set_stall)
+ adreno_smmu->set_stall(adreno_smmu->cookie, enable);
+}
+
static void msm_iommu_detach(struct msm_mmu *mmu)
{
struct msm_iommu *iommu = to_msm_iommu(mmu);
@@ -399,6 +407,7 @@ static const struct msm_mmu_funcs funcs = {
.unmap = msm_iommu_unmap,
.destroy = msm_iommu_destroy,
.resume_translation = msm_iommu_resume_translation,
+ .set_stall = msm_iommu_set_stall,
};
struct msm_mmu *msm_iommu_new(struct device *dev, unsigned long quirks)
diff --git a/drivers/gpu/drm/msm/msm_mmu.h b/drivers/gpu/drm/msm/msm_mmu.h
index 88af4f490881f2a6789ae2d03e1c02d10046331a..2694a356a17904e7572b767b16ed0cee806406cf 100644
--- a/drivers/gpu/drm/msm/msm_mmu.h
+++ b/drivers/gpu/drm/msm/msm_mmu.h
@@ -16,6 +16,7 @@ struct msm_mmu_funcs {
int (*unmap)(struct msm_mmu *mmu, uint64_t iova, size_t len);
void (*destroy)(struct msm_mmu *mmu);
void (*resume_translation)(struct msm_mmu *mmu);
+ void (*set_stall)(struct msm_mmu *mmu, bool enable);
};
enum msm_mmu_type {
--
2.47.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v5 1/5] iommu/arm-smmu: Save additional information on context fault
2025-03-19 14:44 ` [PATCH v5 1/5] iommu/arm-smmu: Save additional information on context fault Connor Abbott
@ 2025-05-06 11:32 ` Will Deacon
2025-05-06 16:26 ` Connor Abbott
0 siblings, 1 reply; 14+ messages in thread
From: Will Deacon @ 2025-05-06 11:32 UTC (permalink / raw)
To: Connor Abbott
Cc: Rob Clark, Robin Murphy, Joerg Roedel, Sean Paul, Konrad Dybcio,
Abhinav Kumar, Dmitry Baryshkov, Marijn Suijten, iommu,
linux-arm-msm, linux-arm-kernel, freedreno
On Wed, Mar 19, 2025 at 10:44:00AM -0400, Connor Abbott wrote:
> This will be used by drm/msm for GPU page faults, replacing the manual
> register reading it does.
>
> Signed-off-by: Connor Abbott <cwabbott0@gmail.com>
> Reviewed-by: Rob Clark <robdclark@gmail.com>
> ---
> drivers/iommu/arm/arm-smmu/arm-smmu-qcom-debug.c | 6 ++--
> drivers/iommu/arm/arm-smmu/arm-smmu.c | 35 ++++++++++++++----------
> drivers/iommu/arm/arm-smmu/arm-smmu.h | 7 +++--
> 3 files changed, 29 insertions(+), 19 deletions(-)
[...]
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> index ade4684c14c9b2724a71e2457288dbfaf7562c83..a02078eb968b81a35c1c086ed7007ea2a453ef94 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> @@ -405,13 +405,20 @@ static const struct iommu_flush_ops arm_smmu_s2_tlb_ops_v1 = {
> };
>
>
> -void arm_smmu_read_context_fault_info(struct arm_smmu_device *smmu, int idx,
> +void arm_smmu_read_context_fault_info(struct arm_smmu_domain *smmu_domain,
> struct arm_smmu_context_fault_info *cfi)
> {
> + struct arm_smmu_device *smmu = smmu_domain->smmu;
> + int idx = smmu_domain->cfg.cbndx;
> +
> cfi->iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR);
> + cfi->ttbr0 = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_TTBR0);
> cfi->fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR);
> - cfi->fsynr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSYNR0);
> + cfi->fsynr0 = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSYNR0);
> + cfi->fsynr1 = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSYNR1);
> cfi->cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx));
> + if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1)
> + cfi->contextidr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_CONTEXTIDR);
I think this leaves 'cfi->contextidr' uninitialised for stage-2 domains.
We should probably either zero it here or just zero-initialise the whole
'cfi' struct in arm_smmu_context_fault() with a:
struct arm_smmu_context_fault_info cfi = {};
line.
Will
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v5 3/5] iommu/arm-smmu: Fix spurious interrupts with stall-on-fault
2025-03-19 14:44 ` [PATCH v5 3/5] iommu/arm-smmu: Fix spurious interrupts with stall-on-fault Connor Abbott
@ 2025-05-06 12:24 ` Will Deacon
2025-05-06 14:08 ` Connor Abbott
0 siblings, 1 reply; 14+ messages in thread
From: Will Deacon @ 2025-05-06 12:24 UTC (permalink / raw)
To: Connor Abbott
Cc: Rob Clark, Robin Murphy, Joerg Roedel, Sean Paul, Konrad Dybcio,
Abhinav Kumar, Dmitry Baryshkov, Marijn Suijten, iommu,
linux-arm-msm, linux-arm-kernel, freedreno
On Wed, Mar 19, 2025 at 10:44:02AM -0400, Connor Abbott wrote:
> On some SMMUv2 implementations, including MMU-500, SMMU_CBn_FSR.SS
> asserts an interrupt. The only way to clear that bit is to resume the
> transaction by writing SMMU_CBn_RESUME, but typically resuming the
> transaction requires complex operations (copying in pages, etc.) that
> can't be done in IRQ context. drm/msm already has a problem, because
> its fault handler sometimes schedules a job to dump the GPU state and
> doesn't resume translation until this is complete.
>
> Work around this by disabling context fault interrupts until after the
> transaction is resumed. Because other context banks can share an IRQ
> line, we may still get an interrupt intended for another context bank,
> but in this case only SMMU_CBn_FSR.SS will be asserted and we can skip
> it assuming that interrupts are disabled which is accomplished by
> removing the bit from ARM_SMMU_CB_FSR_FAULT. SMMU_CBn_FSR.SS won't be
> asserted unless an external user enabled stall-on-fault, and they are
> expected to resume the translation and re-enable interrupts.
>
> Signed-off-by: Connor Abbott <cwabbott0@gmail.com>
> Reviewed-by Robin Murphy <robin.murphy@arm.com>
> Reviewed-by: Rob Clark <robdclark@gmail.com>
> ---
> drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 15 ++++++++++-
> drivers/iommu/arm/arm-smmu/arm-smmu.c | 41 +++++++++++++++++++++++++++++-
> drivers/iommu/arm/arm-smmu/arm-smmu.h | 1 -
> 3 files changed, 54 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> index 186d6ad4fd1c990398df4dec53f4d58ada9e658c..a428e53add08d451fb2152e3ab80e0fba936e214 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> @@ -90,12 +90,25 @@ static void qcom_adreno_smmu_resume_translation(const void *cookie, bool termina
> struct arm_smmu_domain *smmu_domain = (void *)cookie;
> struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
> struct arm_smmu_device *smmu = smmu_domain->smmu;
> - u32 reg = 0;
> + u32 reg = 0, sctlr;
> + unsigned long flags;
>
> if (terminate)
> reg |= ARM_SMMU_RESUME_TERMINATE;
>
> + spin_lock_irqsave(&smmu_domain->cb_lock, flags);
> +
> arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_RESUME, reg);
> +
> + /*
> + * Re-enable interrupts after they were disabled by
> + * arm_smmu_context_fault().
> + */
> + sctlr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_SCTLR);
> + sctlr |= ARM_SMMU_SCTLR_CFIE;
> + arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_SCTLR, sctlr);
> +
> + spin_unlock_irqrestore(&smmu_domain->cb_lock, flags);
> }
>
> #define QCOM_ADRENO_SMMU_GPU_SID 0
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> index c7b5d7c093e71050d29a834c8d33125e96b04d81..9927f3431a2eab913750e6079edc6393d1938c98 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> @@ -470,13 +470,52 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
> if (!(cfi->fsr & ARM_SMMU_CB_FSR_FAULT))
> return IRQ_NONE;
>
> + /*
> + * On some implementations FSR.SS asserts a context fault
> + * interrupt. We do not want this behavior, because resolving the
> + * original context fault typically requires operations that cannot be
> + * performed in IRQ context but leaving the stall unacknowledged will
> + * immediately lead to another spurious interrupt as FSR.SS is still
> + * set. Work around this by disabling interrupts for this context bank.
> + * It's expected that interrupts are re-enabled after resuming the
> + * translation.
s/translation/transaction/
> + *
> + * We have to do this before report_iommu_fault() so that we don't
> + * leave interrupts disabled in case the downstream user decides the
> + * fault can be resolved inside its fault handler.
> + *
> + * There is a possible race if there are multiple context banks sharing
> + * the same interrupt and both signal an interrupt in between writing
> + * RESUME and SCTLR. We could disable interrupts here before we
> + * re-enable them in the resume handler, leaving interrupts enabled.
> + * Lock the write to serialize it with the resume handler.
> + */
I'm struggling to understand this last part. If the resume handler runs
synchronously from report_iommu_fault(), then there's no need for
locking because we're in interrupt context. If the resume handler can
run asynchronously from report_iommu_fault(), then the locking doesn't
help because the code below could clear CFIE right after the resume
handler has set it.
I think the broader problem is that we can end up reporting a spurious
fault when the irq is shared between context banks. However, your change
to ARM_SMMU_CB_FSR_FAULT fixes that afaict:
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h b/drivers/iommu/arm/arm-smmu/arm-smmu.h
> index ff84ce3b8d8567d3402e945e8277ca2a87df9a4e..5fe8e482457f905529a08aea14ea5656d3e31328 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu.h
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h
> @@ -214,7 +214,6 @@ enum arm_smmu_cbar_type {
> ARM_SMMU_CB_FSR_TLBLKF)
>
> #define ARM_SMMU_CB_FSR_FAULT (ARM_SMMU_CB_FSR_MULTI | \
> - ARM_SMMU_CB_FSR_SS | \
> ARM_SMMU_CB_FSR_UUT | \
> ARM_SMMU_CB_FSR_EF | \
> ARM_SMMU_CB_FSR_PF | \
because now any subsequent spurious IRQs will return IRQ_NONE. So I
don't see why we need the locking at all. What am I missing?
Also, I think your change probably exposes a benign bug in the qcom
debug code as qcom_tbu_halt() has the following:
if ((fsr & ARM_SMMU_CB_FSR_FAULT) && (fsr & ARM_SMMU_CB_FSR_SS))
...
Will
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v5 3/5] iommu/arm-smmu: Fix spurious interrupts with stall-on-fault
2025-05-06 12:24 ` Will Deacon
@ 2025-05-06 14:08 ` Connor Abbott
2025-05-06 14:53 ` Will Deacon
0 siblings, 1 reply; 14+ messages in thread
From: Connor Abbott @ 2025-05-06 14:08 UTC (permalink / raw)
To: Will Deacon
Cc: Rob Clark, Robin Murphy, Joerg Roedel, Sean Paul, Konrad Dybcio,
Abhinav Kumar, Dmitry Baryshkov, Marijn Suijten, iommu,
linux-arm-msm, linux-arm-kernel, freedreno
On Tue, May 6, 2025 at 8:24 AM Will Deacon <will@kernel.org> wrote:
>
> On Wed, Mar 19, 2025 at 10:44:02AM -0400, Connor Abbott wrote:
> > On some SMMUv2 implementations, including MMU-500, SMMU_CBn_FSR.SS
> > asserts an interrupt. The only way to clear that bit is to resume the
> > transaction by writing SMMU_CBn_RESUME, but typically resuming the
> > transaction requires complex operations (copying in pages, etc.) that
> > can't be done in IRQ context. drm/msm already has a problem, because
> > its fault handler sometimes schedules a job to dump the GPU state and
> > doesn't resume translation until this is complete.
> >
> > Work around this by disabling context fault interrupts until after the
> > transaction is resumed. Because other context banks can share an IRQ
> > line, we may still get an interrupt intended for another context bank,
> > but in this case only SMMU_CBn_FSR.SS will be asserted and we can skip
> > it assuming that interrupts are disabled which is accomplished by
> > removing the bit from ARM_SMMU_CB_FSR_FAULT. SMMU_CBn_FSR.SS won't be
> > asserted unless an external user enabled stall-on-fault, and they are
> > expected to resume the translation and re-enable interrupts.
> >
> > Signed-off-by: Connor Abbott <cwabbott0@gmail.com>
> > Reviewed-by Robin Murphy <robin.murphy@arm.com>
> > Reviewed-by: Rob Clark <robdclark@gmail.com>
> > ---
> > drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 15 ++++++++++-
> > drivers/iommu/arm/arm-smmu/arm-smmu.c | 41 +++++++++++++++++++++++++++++-
> > drivers/iommu/arm/arm-smmu/arm-smmu.h | 1 -
> > 3 files changed, 54 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> > index 186d6ad4fd1c990398df4dec53f4d58ada9e658c..a428e53add08d451fb2152e3ab80e0fba936e214 100644
> > --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> > @@ -90,12 +90,25 @@ static void qcom_adreno_smmu_resume_translation(const void *cookie, bool termina
> > struct arm_smmu_domain *smmu_domain = (void *)cookie;
> > struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
> > struct arm_smmu_device *smmu = smmu_domain->smmu;
> > - u32 reg = 0;
> > + u32 reg = 0, sctlr;
> > + unsigned long flags;
> >
> > if (terminate)
> > reg |= ARM_SMMU_RESUME_TERMINATE;
> >
> > + spin_lock_irqsave(&smmu_domain->cb_lock, flags);
> > +
> > arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_RESUME, reg);
> > +
> > + /*
> > + * Re-enable interrupts after they were disabled by
> > + * arm_smmu_context_fault().
> > + */
> > + sctlr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_SCTLR);
> > + sctlr |= ARM_SMMU_SCTLR_CFIE;
> > + arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_SCTLR, sctlr);
> > +
> > + spin_unlock_irqrestore(&smmu_domain->cb_lock, flags);
> > }
> >
> > #define QCOM_ADRENO_SMMU_GPU_SID 0
> > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > index c7b5d7c093e71050d29a834c8d33125e96b04d81..9927f3431a2eab913750e6079edc6393d1938c98 100644
> > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > @@ -470,13 +470,52 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
> > if (!(cfi->fsr & ARM_SMMU_CB_FSR_FAULT))
> > return IRQ_NONE;
> >
> > + /*
> > + * On some implementations FSR.SS asserts a context fault
> > + * interrupt. We do not want this behavior, because resolving the
> > + * original context fault typically requires operations that cannot be
> > + * performed in IRQ context but leaving the stall unacknowledged will
> > + * immediately lead to another spurious interrupt as FSR.SS is still
> > + * set. Work around this by disabling interrupts for this context bank.
> > + * It's expected that interrupts are re-enabled after resuming the
> > + * translation.
>
> s/translation/transaction/
>
> > + *
> > + * We have to do this before report_iommu_fault() so that we don't
> > + * leave interrupts disabled in case the downstream user decides the
> > + * fault can be resolved inside its fault handler.
> > + *
> > + * There is a possible race if there are multiple context banks sharing
> > + * the same interrupt and both signal an interrupt in between writing
> > + * RESUME and SCTLR. We could disable interrupts here before we
> > + * re-enable them in the resume handler, leaving interrupts enabled.
> > + * Lock the write to serialize it with the resume handler.
> > + */
>
> I'm struggling to understand this last part. If the resume handler runs
> synchronously from report_iommu_fault(), then there's no need for
> locking because we're in interrupt context. If the resume handler can
> run asynchronously from report_iommu_fault(), then the locking doesn't
> help because the code below could clear CFIE right after the resume
> handler has set it.
The problem is indeed when the resume handler runs asynchronously.
Clearing CFIE right after the resume handler has set it is normal and
expected. The issue is the opposite, i.e. something like:
- Resume handler writes RESUME and stalls for some reason
- The interrupt handler runs through and clears CFIE while it's already cleared
- Resume handler sets CFIE, assuming that the handler hasn't run yet
but it actually has
This wouldn't happen with only one context bank, because we wouldn't
get an interrupt until the resume handler sets CFIE, but with multiple
context banks and a shared interrupt line we could get a "spurious"
interrupt due to a fault in an earlier context bank that becomes not
spurious if the resume handler writes RESUME before the context fault
handler for this bank reads FSR above.
Connor
>
> I think the broader problem is that we can end up reporting a spurious
> fault when the irq is shared between context banks. However, your change
> to ARM_SMMU_CB_FSR_FAULT fixes that afaict:
>
> > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h b/drivers/iommu/arm/arm-smmu/arm-smmu.h
> > index ff84ce3b8d8567d3402e945e8277ca2a87df9a4e..5fe8e482457f905529a08aea14ea5656d3e31328 100644
> > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.h
> > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h
> > @@ -214,7 +214,6 @@ enum arm_smmu_cbar_type {
> > ARM_SMMU_CB_FSR_TLBLKF)
> >
> > #define ARM_SMMU_CB_FSR_FAULT (ARM_SMMU_CB_FSR_MULTI | \
> > - ARM_SMMU_CB_FSR_SS | \
> > ARM_SMMU_CB_FSR_UUT | \
> > ARM_SMMU_CB_FSR_EF | \
> > ARM_SMMU_CB_FSR_PF | \
>
> because now any subsequent spurious IRQs will return IRQ_NONE. So I
> don't see why we need the locking at all. What am I missing?
>
> Also, I think your change probably exposes a benign bug in the qcom
> debug code as qcom_tbu_halt() has the following:
>
> if ((fsr & ARM_SMMU_CB_FSR_FAULT) && (fsr & ARM_SMMU_CB_FSR_SS))
> ...
>
> Will
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v5 3/5] iommu/arm-smmu: Fix spurious interrupts with stall-on-fault
2025-05-06 14:08 ` Connor Abbott
@ 2025-05-06 14:53 ` Will Deacon
2025-05-06 15:18 ` Connor Abbott
0 siblings, 1 reply; 14+ messages in thread
From: Will Deacon @ 2025-05-06 14:53 UTC (permalink / raw)
To: Connor Abbott
Cc: Rob Clark, Robin Murphy, Joerg Roedel, Sean Paul, Konrad Dybcio,
Abhinav Kumar, Dmitry Baryshkov, Marijn Suijten, iommu,
linux-arm-msm, linux-arm-kernel, freedreno
On Tue, May 06, 2025 at 10:08:05AM -0400, Connor Abbott wrote:
> On Tue, May 6, 2025 at 8:24 AM Will Deacon <will@kernel.org> wrote:
> > On Wed, Mar 19, 2025 at 10:44:02AM -0400, Connor Abbott wrote:
> > > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > > index c7b5d7c093e71050d29a834c8d33125e96b04d81..9927f3431a2eab913750e6079edc6393d1938c98 100644
> > > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > > @@ -470,13 +470,52 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
> > > if (!(cfi->fsr & ARM_SMMU_CB_FSR_FAULT))
> > > return IRQ_NONE;
> > >
> > > + /*
> > > + * On some implementations FSR.SS asserts a context fault
> > > + * interrupt. We do not want this behavior, because resolving the
> > > + * original context fault typically requires operations that cannot be
> > > + * performed in IRQ context but leaving the stall unacknowledged will
> > > + * immediately lead to another spurious interrupt as FSR.SS is still
> > > + * set. Work around this by disabling interrupts for this context bank.
> > > + * It's expected that interrupts are re-enabled after resuming the
> > > + * translation.
> >
> > s/translation/transaction/
> >
> > > + *
> > > + * We have to do this before report_iommu_fault() so that we don't
> > > + * leave interrupts disabled in case the downstream user decides the
> > > + * fault can be resolved inside its fault handler.
> > > + *
> > > + * There is a possible race if there are multiple context banks sharing
> > > + * the same interrupt and both signal an interrupt in between writing
> > > + * RESUME and SCTLR. We could disable interrupts here before we
> > > + * re-enable them in the resume handler, leaving interrupts enabled.
> > > + * Lock the write to serialize it with the resume handler.
> > > + */
> >
> > I'm struggling to understand this last part. If the resume handler runs
> > synchronously from report_iommu_fault(), then there's no need for
> > locking because we're in interrupt context. If the resume handler can
> > run asynchronously from report_iommu_fault(), then the locking doesn't
> > help because the code below could clear CFIE right after the resume
> > handler has set it.
>
> The problem is indeed when the resume handler runs asynchronously.
> Clearing CFIE right after the resume handler has set it is normal and
> expected. The issue is the opposite, i.e. something like:
>
> - Resume handler writes RESUME and stalls for some reason
> - The interrupt handler runs through and clears CFIE while it's already cleared
> - Resume handler sets CFIE, assuming that the handler hasn't run yet
> but it actually has
>
> This wouldn't happen with only one context bank, because we wouldn't
> get an interrupt until the resume handler sets CFIE, but with multiple
> context banks and a shared interrupt line we could get a "spurious"
> interrupt due to a fault in an earlier context bank that becomes not
> spurious if the resume handler writes RESUME before the context fault
> handler for this bank reads FSR above.
Ah, gotcha. Thanks for the explanation.
If we moved the RESUME+CFIE into the interrupt handler after the call
to report_iommu_fault(), would it be possible to run the handler as a
threaded irq (see 'context_fault_needs_threaded_irq') and handle the
callback synchronously? In that case, I think we could avoid taking the
lock if we wrote CFIE _before_ RESUME.
Will
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v5 3/5] iommu/arm-smmu: Fix spurious interrupts with stall-on-fault
2025-05-06 14:53 ` Will Deacon
@ 2025-05-06 15:18 ` Connor Abbott
2025-05-15 14:46 ` Will Deacon
0 siblings, 1 reply; 14+ messages in thread
From: Connor Abbott @ 2025-05-06 15:18 UTC (permalink / raw)
To: Will Deacon
Cc: Rob Clark, Robin Murphy, Joerg Roedel, Sean Paul, Konrad Dybcio,
Abhinav Kumar, Dmitry Baryshkov, Marijn Suijten, iommu,
linux-arm-msm, linux-arm-kernel, freedreno
On Tue, May 6, 2025 at 10:53 AM Will Deacon <will@kernel.org> wrote:
>
> On Tue, May 06, 2025 at 10:08:05AM -0400, Connor Abbott wrote:
> > On Tue, May 6, 2025 at 8:24 AM Will Deacon <will@kernel.org> wrote:
> > > On Wed, Mar 19, 2025 at 10:44:02AM -0400, Connor Abbott wrote:
> > > > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > > > index c7b5d7c093e71050d29a834c8d33125e96b04d81..9927f3431a2eab913750e6079edc6393d1938c98 100644
> > > > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > > > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > > > @@ -470,13 +470,52 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
> > > > if (!(cfi->fsr & ARM_SMMU_CB_FSR_FAULT))
> > > > return IRQ_NONE;
> > > >
> > > > + /*
> > > > + * On some implementations FSR.SS asserts a context fault
> > > > + * interrupt. We do not want this behavior, because resolving the
> > > > + * original context fault typically requires operations that cannot be
> > > > + * performed in IRQ context but leaving the stall unacknowledged will
> > > > + * immediately lead to another spurious interrupt as FSR.SS is still
> > > > + * set. Work around this by disabling interrupts for this context bank.
> > > > + * It's expected that interrupts are re-enabled after resuming the
> > > > + * translation.
> > >
> > > s/translation/transaction/
> > >
> > > > + *
> > > > + * We have to do this before report_iommu_fault() so that we don't
> > > > + * leave interrupts disabled in case the downstream user decides the
> > > > + * fault can be resolved inside its fault handler.
> > > > + *
> > > > + * There is a possible race if there are multiple context banks sharing
> > > > + * the same interrupt and both signal an interrupt in between writing
> > > > + * RESUME and SCTLR. We could disable interrupts here before we
> > > > + * re-enable them in the resume handler, leaving interrupts enabled.
> > > > + * Lock the write to serialize it with the resume handler.
> > > > + */
> > >
> > > I'm struggling to understand this last part. If the resume handler runs
> > > synchronously from report_iommu_fault(), then there's no need for
> > > locking because we're in interrupt context. If the resume handler can
> > > run asynchronously from report_iommu_fault(), then the locking doesn't
> > > help because the code below could clear CFIE right after the resume
> > > handler has set it.
> >
> > The problem is indeed when the resume handler runs asynchronously.
> > Clearing CFIE right after the resume handler has set it is normal and
> > expected. The issue is the opposite, i.e. something like:
> >
> > - Resume handler writes RESUME and stalls for some reason
> > - The interrupt handler runs through and clears CFIE while it's already cleared
> > - Resume handler sets CFIE, assuming that the handler hasn't run yet
> > but it actually has
> >
> > This wouldn't happen with only one context bank, because we wouldn't
> > get an interrupt until the resume handler sets CFIE, but with multiple
> > context banks and a shared interrupt line we could get a "spurious"
> > interrupt due to a fault in an earlier context bank that becomes not
> > spurious if the resume handler writes RESUME before the context fault
> > handler for this bank reads FSR above.
>
> Ah, gotcha. Thanks for the explanation.
>
> If we moved the RESUME+CFIE into the interrupt handler after the call
> to report_iommu_fault(), would it be possible to run the handler as a
> threaded irq (see 'context_fault_needs_threaded_irq') and handle the
> callback synchronously? In that case, I think we could avoid taking the
> lock if we wrote CFIE _before_ RESUME.
>
> Will
We need the lock anyway due to the parallel manipulation of CFCFG in
the same register introduced in the next patch. Expanding it to also
cover the write to RESUME is not a huge deal. Also, doing it
synchronously would require rewriting the fault handling in drm/msm
and again I'm trying to fix this serious stability problem now as soon
as possible without getting dragged into rewriting the whole thing.
Connor
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v5 1/5] iommu/arm-smmu: Save additional information on context fault
2025-05-06 11:32 ` Will Deacon
@ 2025-05-06 16:26 ` Connor Abbott
0 siblings, 0 replies; 14+ messages in thread
From: Connor Abbott @ 2025-05-06 16:26 UTC (permalink / raw)
To: Will Deacon
Cc: Rob Clark, Robin Murphy, Joerg Roedel, Sean Paul, Konrad Dybcio,
Abhinav Kumar, Dmitry Baryshkov, Marijn Suijten, iommu,
linux-arm-msm, linux-arm-kernel, freedreno
On Tue, May 6, 2025 at 7:32 AM Will Deacon <will@kernel.org> wrote:
>
> On Wed, Mar 19, 2025 at 10:44:00AM -0400, Connor Abbott wrote:
> > This will be used by drm/msm for GPU page faults, replacing the manual
> > register reading it does.
> >
> > Signed-off-by: Connor Abbott <cwabbott0@gmail.com>
> > Reviewed-by: Rob Clark <robdclark@gmail.com>
> > ---
> > drivers/iommu/arm/arm-smmu/arm-smmu-qcom-debug.c | 6 ++--
> > drivers/iommu/arm/arm-smmu/arm-smmu.c | 35 ++++++++++++++----------
> > drivers/iommu/arm/arm-smmu/arm-smmu.h | 7 +++--
> > 3 files changed, 29 insertions(+), 19 deletions(-)
>
> [...]
>
> > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > index ade4684c14c9b2724a71e2457288dbfaf7562c83..a02078eb968b81a35c1c086ed7007ea2a453ef94 100644
> > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > @@ -405,13 +405,20 @@ static const struct iommu_flush_ops arm_smmu_s2_tlb_ops_v1 = {
> > };
> >
> >
> > -void arm_smmu_read_context_fault_info(struct arm_smmu_device *smmu, int idx,
> > +void arm_smmu_read_context_fault_info(struct arm_smmu_domain *smmu_domain,
> > struct arm_smmu_context_fault_info *cfi)
> > {
> > + struct arm_smmu_device *smmu = smmu_domain->smmu;
> > + int idx = smmu_domain->cfg.cbndx;
> > +
> > cfi->iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR);
> > + cfi->ttbr0 = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_TTBR0);
> > cfi->fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR);
> > - cfi->fsynr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSYNR0);
> > + cfi->fsynr0 = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSYNR0);
> > + cfi->fsynr1 = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSYNR1);
> > cfi->cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx));
> > + if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1)
> > + cfi->contextidr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_CONTEXTIDR);
>
> I think this leaves 'cfi->contextidr' uninitialised for stage-2 domains.
> We should probably either zero it here or just zero-initialise the whole
> 'cfi' struct in arm_smmu_context_fault() with a:
>
> struct arm_smmu_context_fault_info cfi = {};
>
> line.
>
> Will
The next patch moves the context fault info into arm_smmu_domain which
is already zero-initialized.
Connor
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v5 3/5] iommu/arm-smmu: Fix spurious interrupts with stall-on-fault
2025-05-06 15:18 ` Connor Abbott
@ 2025-05-15 14:46 ` Will Deacon
2025-05-15 17:13 ` Connor Abbott
0 siblings, 1 reply; 14+ messages in thread
From: Will Deacon @ 2025-05-15 14:46 UTC (permalink / raw)
To: Connor Abbott
Cc: Rob Clark, Robin Murphy, Joerg Roedel, Sean Paul, Konrad Dybcio,
Abhinav Kumar, Dmitry Baryshkov, Marijn Suijten, iommu,
linux-arm-msm, linux-arm-kernel, freedreno, quic_c_gdjako
On Tue, May 06, 2025 at 11:18:44AM -0400, Connor Abbott wrote:
> On Tue, May 6, 2025 at 10:53 AM Will Deacon <will@kernel.org> wrote:
> >
> > On Tue, May 06, 2025 at 10:08:05AM -0400, Connor Abbott wrote:
> > > On Tue, May 6, 2025 at 8:24 AM Will Deacon <will@kernel.org> wrote:
> > > > On Wed, Mar 19, 2025 at 10:44:02AM -0400, Connor Abbott wrote:
> > > > > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > > > > index c7b5d7c093e71050d29a834c8d33125e96b04d81..9927f3431a2eab913750e6079edc6393d1938c98 100644
> > > > > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > > > > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > > > > @@ -470,13 +470,52 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
> > > > > if (!(cfi->fsr & ARM_SMMU_CB_FSR_FAULT))
> > > > > return IRQ_NONE;
> > > > >
> > > > > + /*
> > > > > + * On some implementations FSR.SS asserts a context fault
> > > > > + * interrupt. We do not want this behavior, because resolving the
> > > > > + * original context fault typically requires operations that cannot be
> > > > > + * performed in IRQ context but leaving the stall unacknowledged will
> > > > > + * immediately lead to another spurious interrupt as FSR.SS is still
> > > > > + * set. Work around this by disabling interrupts for this context bank.
> > > > > + * It's expected that interrupts are re-enabled after resuming the
> > > > > + * translation.
> > > >
> > > > s/translation/transaction/
> > > >
> > > > > + *
> > > > > + * We have to do this before report_iommu_fault() so that we don't
> > > > > + * leave interrupts disabled in case the downstream user decides the
> > > > > + * fault can be resolved inside its fault handler.
> > > > > + *
> > > > > + * There is a possible race if there are multiple context banks sharing
> > > > > + * the same interrupt and both signal an interrupt in between writing
> > > > > + * RESUME and SCTLR. We could disable interrupts here before we
> > > > > + * re-enable them in the resume handler, leaving interrupts enabled.
> > > > > + * Lock the write to serialize it with the resume handler.
> > > > > + */
> > > >
> > > > I'm struggling to understand this last part. If the resume handler runs
> > > > synchronously from report_iommu_fault(), then there's no need for
> > > > locking because we're in interrupt context. If the resume handler can
> > > > run asynchronously from report_iommu_fault(), then the locking doesn't
> > > > help because the code below could clear CFIE right after the resume
> > > > handler has set it.
> > >
> > > The problem is indeed when the resume handler runs asynchronously.
> > > Clearing CFIE right after the resume handler has set it is normal and
> > > expected. The issue is the opposite, i.e. something like:
> > >
> > > - Resume handler writes RESUME and stalls for some reason
> > > - The interrupt handler runs through and clears CFIE while it's already cleared
> > > - Resume handler sets CFIE, assuming that the handler hasn't run yet
> > > but it actually has
> > >
> > > This wouldn't happen with only one context bank, because we wouldn't
> > > get an interrupt until the resume handler sets CFIE, but with multiple
> > > context banks and a shared interrupt line we could get a "spurious"
> > > interrupt due to a fault in an earlier context bank that becomes not
> > > spurious if the resume handler writes RESUME before the context fault
> > > handler for this bank reads FSR above.
> >
> > Ah, gotcha. Thanks for the explanation.
> >
> > If we moved the RESUME+CFIE into the interrupt handler after the call
> > to report_iommu_fault(), would it be possible to run the handler as a
> > threaded irq (see 'context_fault_needs_threaded_irq') and handle the
> > callback synchronously? In that case, I think we could avoid taking the
> > lock if we wrote CFIE _before_ RESUME.
> >
>
> We need the lock anyway due to the parallel manipulation of CFCFG in
> the same register introduced in the next patch. Expanding it to also
> cover the write to RESUME is not a huge deal. Also, doing it
> synchronously would require rewriting the fault handling in drm/msm
> and again I'm trying to fix this serious stability problem now as soon
> as possible without getting dragged into rewriting the whole thing.
This has never worked though, right? In which case, we should fix it
properly rather than papering over the mess.
Georgi (CC'd) added support for threaded interrupts specifically to
permit sleeping operations in the fault handler. You should be able to
use that and I don't understand why that would require "rewriting the
whole thing". You can kick the async work and then wait for it to
complete, no?
That would then open the door to handling the RESUME in the core driver
in future based on the return value from report_iommu_fault().
You also need to fix qcom_tbu_halt() as I mentioned before.
Will
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v5 3/5] iommu/arm-smmu: Fix spurious interrupts with stall-on-fault
2025-05-15 14:46 ` Will Deacon
@ 2025-05-15 17:13 ` Connor Abbott
0 siblings, 0 replies; 14+ messages in thread
From: Connor Abbott @ 2025-05-15 17:13 UTC (permalink / raw)
To: Will Deacon
Cc: Rob Clark, Robin Murphy, Joerg Roedel, Sean Paul, Konrad Dybcio,
Abhinav Kumar, Dmitry Baryshkov, Marijn Suijten, iommu,
linux-arm-msm, linux-arm-kernel, freedreno, quic_c_gdjako
On Thu, May 15, 2025 at 10:47 AM Will Deacon <will@kernel.org> wrote:
>
> On Tue, May 06, 2025 at 11:18:44AM -0400, Connor Abbott wrote:
> > On Tue, May 6, 2025 at 10:53 AM Will Deacon <will@kernel.org> wrote:
> > >
> > > On Tue, May 06, 2025 at 10:08:05AM -0400, Connor Abbott wrote:
> > > > On Tue, May 6, 2025 at 8:24 AM Will Deacon <will@kernel.org> wrote:
> > > > > On Wed, Mar 19, 2025 at 10:44:02AM -0400, Connor Abbott wrote:
> > > > > > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > > > > > index c7b5d7c093e71050d29a834c8d33125e96b04d81..9927f3431a2eab913750e6079edc6393d1938c98 100644
> > > > > > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > > > > > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > > > > > @@ -470,13 +470,52 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
> > > > > > if (!(cfi->fsr & ARM_SMMU_CB_FSR_FAULT))
> > > > > > return IRQ_NONE;
> > > > > >
> > > > > > + /*
> > > > > > + * On some implementations FSR.SS asserts a context fault
> > > > > > + * interrupt. We do not want this behavior, because resolving the
> > > > > > + * original context fault typically requires operations that cannot be
> > > > > > + * performed in IRQ context but leaving the stall unacknowledged will
> > > > > > + * immediately lead to another spurious interrupt as FSR.SS is still
> > > > > > + * set. Work around this by disabling interrupts for this context bank.
> > > > > > + * It's expected that interrupts are re-enabled after resuming the
> > > > > > + * translation.
> > > > >
> > > > > s/translation/transaction/
> > > > >
> > > > > > + *
> > > > > > + * We have to do this before report_iommu_fault() so that we don't
> > > > > > + * leave interrupts disabled in case the downstream user decides the
> > > > > > + * fault can be resolved inside its fault handler.
> > > > > > + *
> > > > > > + * There is a possible race if there are multiple context banks sharing
> > > > > > + * the same interrupt and both signal an interrupt in between writing
> > > > > > + * RESUME and SCTLR. We could disable interrupts here before we
> > > > > > + * re-enable them in the resume handler, leaving interrupts enabled.
> > > > > > + * Lock the write to serialize it with the resume handler.
> > > > > > + */
> > > > >
> > > > > I'm struggling to understand this last part. If the resume handler runs
> > > > > synchronously from report_iommu_fault(), then there's no need for
> > > > > locking because we're in interrupt context. If the resume handler can
> > > > > run asynchronously from report_iommu_fault(), then the locking doesn't
> > > > > help because the code below could clear CFIE right after the resume
> > > > > handler has set it.
> > > >
> > > > The problem is indeed when the resume handler runs asynchronously.
> > > > Clearing CFIE right after the resume handler has set it is normal and
> > > > expected. The issue is the opposite, i.e. something like:
> > > >
> > > > - Resume handler writes RESUME and stalls for some reason
> > > > - The interrupt handler runs through and clears CFIE while it's already cleared
> > > > - Resume handler sets CFIE, assuming that the handler hasn't run yet
> > > > but it actually has
> > > >
> > > > This wouldn't happen with only one context bank, because we wouldn't
> > > > get an interrupt until the resume handler sets CFIE, but with multiple
> > > > context banks and a shared interrupt line we could get a "spurious"
> > > > interrupt due to a fault in an earlier context bank that becomes not
> > > > spurious if the resume handler writes RESUME before the context fault
> > > > handler for this bank reads FSR above.
> > >
> > > Ah, gotcha. Thanks for the explanation.
> > >
> > > If we moved the RESUME+CFIE into the interrupt handler after the call
> > > to report_iommu_fault(), would it be possible to run the handler as a
> > > threaded irq (see 'context_fault_needs_threaded_irq') and handle the
> > > callback synchronously? In that case, I think we could avoid taking the
> > > lock if we wrote CFIE _before_ RESUME.
> > >
> >
> > We need the lock anyway due to the parallel manipulation of CFCFG in
> > the same register introduced in the next patch. Expanding it to also
> > cover the write to RESUME is not a huge deal. Also, doing it
> > synchronously would require rewriting the fault handling in drm/msm
> > and again I'm trying to fix this serious stability problem now as soon
> > as possible without getting dragged into rewriting the whole thing.
>
> This has never worked though, right? In which case, we should fix it
> properly rather than papering over the mess.
It has never worked upstream. Which means that everyone is either
carrying this series downstream, or blowing up sometimes. The number
of places carrying this series is quickly multiplying, so it's
becoming more and more painful that this isn't upstream. Not to
mention the downstreams that still aren't aware of this and hang the
whole system sometimes.
>
> Georgi (CC'd) added support for threaded interrupts specifically to
> permit sleeping operations in the fault handler. You should be able to
> use that and I don't understand why that would require "rewriting the
> whole thing". You can kick the async work and then wait for it to
> complete, no?
It would certainly require rewriting the iommu side of it, although it
does get simpler.
Properly handling the drm/msm side of it would also require getting
rid of the fault worker, although I suppose we could just wait on it
as a hack.
I've started trying to see how it would look, but the biggest problem
is that it's going to introduce a lot of complicated cross-tree
dependencies. To fully follow the recommended sequence, we'd have to
do something like:
1. Enable threaded IRQ on Adreno SMMU.
2. Make drm/msm do its devcoredump business immediately.
3. Make iommu in charge of writing RESUME and do it after writing FSR.
The problem is that if we only do 1 and 2, things will be way worse
than before. Instead of some duplicate faults while the devcoredump is
pending (that sometimes, but not always, results in a full system hang
if devcoredump is scheduled on the same core) the interrupt will never
be cleared, due to the MMU-500 behavior of ignoring writes to RESUME
if FSR isn't cleared, and the entire system will hang every time
there's a context fault.
I suppose I could put 3 before 2, and temporarily break devcoredumps?
>
> That would then open the door to handling the RESUME in the core driver
> in future based on the return value from report_iommu_fault().
>
> You also need to fix qcom_tbu_halt() as I mentioned before.
>
> Will
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2025-05-15 17:16 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-19 14:43 [PATCH v5 0/5] iommu/arm-smmu, drm/msm: Fixes for stall-on-fault Connor Abbott
2025-03-19 14:44 ` [PATCH v5 1/5] iommu/arm-smmu: Save additional information on context fault Connor Abbott
2025-05-06 11:32 ` Will Deacon
2025-05-06 16:26 ` Connor Abbott
2025-03-19 14:44 ` [PATCH v5 2/5] iommu/arm-smmu-qcom: Don't read fault registers directly Connor Abbott
2025-03-19 14:44 ` [PATCH v5 3/5] iommu/arm-smmu: Fix spurious interrupts with stall-on-fault Connor Abbott
2025-05-06 12:24 ` Will Deacon
2025-05-06 14:08 ` Connor Abbott
2025-05-06 14:53 ` Will Deacon
2025-05-06 15:18 ` Connor Abbott
2025-05-15 14:46 ` Will Deacon
2025-05-15 17:13 ` Connor Abbott
2025-03-19 14:44 ` [PATCH v5 4/5] iommu/arm-smmu-qcom: Make set_stall work when the device is on Connor Abbott
2025-03-19 14:44 ` [PATCH v5 5/5] drm/msm: Temporarily disable stall-on-fault after a page fault Connor Abbott
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).