linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: Connor Abbott <cwabbott0@gmail.com>
To: Rob Clark <robdclark@gmail.com>, Will Deacon <will@kernel.org>,
	 Robin Murphy <robin.murphy@arm.com>,
	Joerg Roedel <joro@8bytes.org>,  Sean Paul <sean@poorly.run>,
	Konrad Dybcio <konradybcio@kernel.org>,
	 Abhinav Kumar <quic_abhinavk@quicinc.com>,
	 Dmitry Baryshkov <dmitry.baryshkov@linaro.org>,
	 Marijn Suijten <marijn.suijten@somainline.org>
Cc: iommu@lists.linux.dev, linux-arm-msm@vger.kernel.org,
	 linux-arm-kernel@lists.infradead.org,
	freedreno@lists.freedesktop.org,
	 Connor Abbott <cwabbott0@gmail.com>
Subject: [PATCH v4 0/5] iommu/arm-smmu, drm/msm: Fixes for stall-on-fault
Date: Tue, 04 Mar 2025 11:56:46 -0500	[thread overview]
Message-ID: <20250304-msm-gpu-fault-fixes-next-v4-0-be14be37f4c3@gmail.com> (raw)

drm/msm uses the stall-on-fault model to record the GPU state on the
first GPU page fault to help debugging. On systems where the GPU is
paired with a MMU-500, there were two problems:

1. The MMU-500 doesn't de-assert its interrupt line until the fault is
   resumed, which led to a storm of interrupts until the fault handler
   was called. If we got unlucky and the fault handler was on the same
   CPU as the interrupt, there was a deadlock.
2. The GPU is capable of generating page faults much faster than we can
   resume them. GMU (GPU Management Unit) shares the same context bank
   as the GPU, so if there was a sudden spurt of page faults it would be
   effectively starved and would trigger a watchdog reset, made even
   worse because the GPU cannot be reset while there's a pending
   transaction leaving the GPU permanently wedged.

Patches 1-3 fixes the first problem and is independent of the rest of the
series. Patch 5 fixes the second problem and is dependent on patch 4, so
there will have to be some cross-tree coordination.

I've rebased this series on the latest linux-next to avoid rebase
troubles.

Signed-off-by: Connor Abbott <cwabbott0@gmail.com>
---
Changes in v4:
- Add patches 1-2, which fix reading registers in drm/msm when
  acknowledging the fault early. This was Robin's preferred solution
  compared to making drm/msm's fault handler tell arm-smmu to resume the
  fault.
- Link to v3: https://lore.kernel.org/r/20250122-msm-gpu-fault-fixes-next-v3-0-0afa00158521@gmail.com

Changes in v3:
- Acknowledge the fault before resuming the transaction in patch 1.
- Add suggested extra context to commit messages.
- Link to v2: https://lore.kernel.org/r/20250120-msm-gpu-fault-fixes-next-v2-0-d636c4027042@gmail.com

Changes in v2:
- Remove unnecessary _irqsave when locking in IRQ handler (Robin)
- Reuse existing spinlock for CFIE manipulation (Robin)
- Lock CFCFG manipulation against concurrent CFIE manipulation
- Don't use timer to re-enable stall-on-fault. (Rob)
- Use more descriptive name for the function that re-enables
  stall-on-fault if the cooldown period has ended. (Rob)
- Link to v1: https://lore.kernel.org/r/20250117-msm-gpu-fault-fixes-next-v1-0-bc9b332b5d0b@gmail.com

---
Connor Abbott (5):
      iommu/arm-smmu: Save additional information on context fault
      iommu/arm-smmu-qcom: Don't read fault registers directly
      iommu/arm-smmu: Fix spurious interrupts with stall-on-fault
      iommu/arm-smmu-qcom: Make set_stall work when the device is on
      drm/msm: Temporarily disable stall-on-fault after a page fault

 drivers/gpu/drm/msm/adreno/a5xx_gpu.c            |  2 +
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c            |  4 ++
 drivers/gpu/drm/msm/adreno/adreno_gpu.c          | 42 ++++++++++++-
 drivers/gpu/drm/msm/adreno/adreno_gpu.h          | 24 ++++++++
 drivers/gpu/drm/msm/msm_iommu.c                  |  9 +++
 drivers/gpu/drm/msm/msm_mmu.h                    |  1 +
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom-debug.c |  4 +-
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c       | 64 ++++++++++++++-----
 drivers/iommu/arm/arm-smmu/arm-smmu.c            | 78 ++++++++++++++++++------
 drivers/iommu/arm/arm-smmu/arm-smmu.h            | 19 +++---
 10 files changed, 204 insertions(+), 43 deletions(-)
---
base-commit: 866e43b945bf98f8e807dfa45eca92f931f3a032
change-id: 20250117-msm-gpu-fault-fixes-next-96e3098023e1

Best regards,
-- 
Connor Abbott <cwabbott0@gmail.com>



             reply	other threads:[~2025-03-04 18:24 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-04 16:56 Connor Abbott [this message]
2025-03-04 16:56 ` [PATCH v4 1/5] iommu/arm-smmu: Save additional information on context fault Connor Abbott
2025-03-05 19:09   ` Rob Clark
2025-03-11 18:05   ` Will Deacon
2025-03-11 22:36     ` Connor Abbott
2025-03-12 13:05       ` Will Deacon
2025-03-12 14:59         ` Rob Clark
2025-03-12 16:47           ` Will Deacon
2025-03-12 17:23             ` Rob Clark
2025-03-12 18:01               ` Robin Murphy
2025-03-12 20:20                 ` Rob Clark
2025-03-18 15:46                   ` Will Deacon
2025-03-04 16:56 ` [PATCH v4 2/5] iommu/arm-smmu-qcom: Don't read fault registers directly Connor Abbott
2025-03-05 19:08   ` Rob Clark
2025-03-11 18:08   ` Will Deacon
2025-03-11 19:42     ` Connor Abbott
2025-03-11 20:00       ` Rob Clark
2025-03-12 12:57         ` Robin Murphy
2025-03-12 13:06         ` Will Deacon
2025-03-04 16:56 ` [PATCH v4 3/5] iommu/arm-smmu: Fix spurious interrupts with stall-on-fault Connor Abbott
2025-03-05 19:12   ` Rob Clark
2025-03-04 16:56 ` [PATCH v4 4/5] iommu/arm-smmu-qcom: Make set_stall work when the device is on Connor Abbott
2025-03-05 19:09   ` Rob Clark
2025-03-11 18:11   ` Will Deacon
2025-03-11 20:01     ` Connor Abbott
2025-03-12 12:49       ` Will Deacon
2025-03-12 13:30         ` Connor Abbott
2025-03-18 13:36           ` Robin Murphy
2025-03-18 15:47             ` Will Deacon
2025-03-04 16:56 ` [PATCH v4 5/5] drm/msm: Temporarily disable stall-on-fault after a page fault Connor Abbott
2025-03-05 19:07   ` Rob Clark
2025-03-05 19:38     ` Connor Abbott
2025-03-05 19:56       ` Rob Clark

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250304-msm-gpu-fault-fixes-next-v4-0-be14be37f4c3@gmail.com \
    --to=cwabbott0@gmail.com \
    --cc=dmitry.baryshkov@linaro.org \
    --cc=freedreno@lists.freedesktop.org \
    --cc=iommu@lists.linux.dev \
    --cc=joro@8bytes.org \
    --cc=konradybcio@kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-arm-msm@vger.kernel.org \
    --cc=marijn.suijten@somainline.org \
    --cc=quic_abhinavk@quicinc.com \
    --cc=robdclark@gmail.com \
    --cc=robin.murphy@arm.com \
    --cc=sean@poorly.run \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).